[PPT] - Operational Research in Assessm ent Program s as a W indow into PowerPoint Presentation

SLIDE 1

Operational Research in Assessm ent Program s as a W indow into Task and I tem Design Principles: Exam ples from NAEP

Panel: Madeleine Keehner, Hilary Persky, and Luis Saldivia, Educational Testing Service Discussant: Robin Hill, Kentucky Department of Education 2018 NCSA, San Diego, CA

SLIDE 2

Overview of key aspects of hum an cognition that are relevant to item and task design

Research findings and theory from cognitive science

Madeleine Keehner

SLIDE 3

Design Decisions in Assessm ent Developm ent Constructs – Target KSAs

Task structure, item types, response modes, interactive capabilities, design devices, graphics, text, media, layouts… What we measure How we measure

SLIDE 4

These Design Decisions I m pact Key Processes

Task structure, item types, response modes, interactive capabilities, design devices, graphics, text, media, layouts… Cognitive

Perception and attention WM load, exec functions Intrinsic/extraneous load LTM schema activation Metacognition

Behavioral

Affordances for action Embodiment

Social

Collaborative Communicative

Affective

Engagement Motivation Enjoyment Frustration Boredom

SLIDE 5

Zoom ing in on Cognition and Behavior

5

Perception and attention Long term schema Action planning and control Working memory

How do external item and task design features influence these internal cognitive processes?

SLIDE 6

How External Design Features affect I nternal Processes

6

Perception and attention Long term schema Action planning and control Working memory

Attention can be captured by salient features; it can be directed through signaling Perception can be

verloaded by too

much information

SLIDE 7

How External Design Features affect I nternal Processes

7

Perception and attention Long term schema Action planning and control Working memory

Total processing load may exceed WM capacity With good design, extraneous load can be minimized, intrinsic load can be optimized

SLIDE 8

How External Design Features affect I nternal Processes

8

Perception and attention Long term schema Action planning and control Working memory

Familiar response modes, technology, or task types can activate learned schema and reduce WM load Schema may be inappropriately triggered by familiar-feeling formats

SLIDE 9

How External Design Features affect I nternal Processes

9

Perception and attention Long term schema Action planning and control Working memory

The affordances of a display can make some behaviors more likely We may not know what behaviors we are ‘inviting’ with our design

SLIDE 10

Conclusion: External Representations affect I nternal Processes

10

Perception and attention Long term schema Action planning and control Working memory

External item and task design features interact with internal cognitive processes

SLIDE 11

NAEP Reading Exam ple: I nsights from Pretesting an I nnovative I nterface Design

NAEP eReader design problem:

How to present reading passages and items on tablet
Allow students to interact fluently with them
Gather evidence of reading processes
Full-screen presentation would allow for

widest variety of passages

Items presented in a separate window or panel

would allow for wide variety of item types

Navigational aides provided to facilitate

navigation between items and passage

SLIDE 12

Com parison of Different Layouts

Fish Fossils

Fish Fossils Dinosaur Skeleton

1 vs 2 column passage Items swiped in from the right side

WM load if items not

always visible?

How do interactive

behaviors differ with visual

cclusion?

Look-back buttons in items

Schema for use?
Sufficiently salient?

SLIDE 13

I nteraction Behaviors: Sw iping I tem s On and Off

Swiping (L and R) happened more in

layouts where items overlap text (two-column passages)

– Where there was no overlap (one- column - blue) students still swipe L (on) but hardly ever swipe R (off)

Item is visible all the time
Is this too different from P&P?
Does it change the way students

read/ search?

Two-column layouts: 4th and 8th

Graders differed – 4th Graders: swiped on and then

ff

– 8th Graders: swiped on, did other actions, then swiped off

Some performance differences: G4 did a little better with 1-column, G8 had longer CRs with 1-column

SLIDE 14

Overall I nsights and Eventual Design Decisions

Different behavioral affordances from 1 and 2 column layouts

– Students do not remove items if they are not occluding text

Suggests less cognitive effort to leave on – only removed when in the way

– Performance similar but not identical (note: no P&P baseline) – More process information when swiping on and off – Always-visible items might change reading strategy/ approach (diff from P&P) – Expert committee decision: Two-column layout appropriate operational trade-off – (Note: interface design still evolving)

Use of look-back buttons in items hardly ever observed

– Interview questions indicated students had not noticed them – Suggests no schema to look for them and not salient enough to capture attention – Design tweak: Visual salience was enhanced, instruction added to tutorial

SLIDE 15

Take-Hom e 1 : Design Decisions I m pact Basic Processes, and the Reverse should Also be True

Task structure, item types, response modes, interactive capabilities, design devices, graphics, text, media, layouts… Cognitive

Perception and attention WM load, exec functions Intrinsic/extraneous load LTM schema activation Metacognition

Behavioral

Affordances for action Embodiment

Social

Collaborative Communicative

Affective

Engagement Motivation Enjoyment Frustration Boredom Knowledge of these basic processes should also impact our design decisions

SLIDE 16

Take-Hom e 2 : I nterdisciplinary Collaboration is needed to do Justice to both W hat and How

Assessment developers

– Subject-matter content expertise, item and task design experience

Learning scientists

– Subject-relevant cognitive and learning expertise

Cognitive scientists

– Expertise in general cognitive, metacognitive, behavioral, social, and affective processes; usability and cognitive research methods; human-computer interaction, etc. – (And many others, of course… ..)

SLIDE 17

Take-Hom e 3 : More and Better Research Needed

Traditional items are supported by decades of psychometric research

– Empirical data: item response characteristics, validity studies, etc.

Digital assessments allow many more options for:

– Varied stimuli and representations – Different response modes and response behaviors – Other kinds of behaviors and interactions

Psychometric approach alone may not be enough

– Basic properties of cognition need to be examined, and considered a priori – Requires experimental cognitive research methods and analyses – Meanwhile, let’s look at some insights from operational pretesting studies… .

19

SLIDE 18

A Pretesting Study: Effects of avatars ( and leveling) in SBTs

n students

Hilary Persky

SLIDE 19

Background

The affordances of DBA allow assessments to better reflect

authentic reading experiences, which are purpose driven, at times collaborative, and involve various types and levels of support.

Many believe the construct of reading comprehension has

broadened with advent of digital literacies.

Purpose driven tasks have been taken up by the next generation

state assessments as well as national and international assessments (PIRLS and PISA).

21

SLIDE 20

W hy the study?

Avatars used in new NAEP reading tasks to:
introduce and reaffirm overall task and specific activity

purposes

simulate conversation/ collaboration
assist in task transitions
reset student understanding (leveling)
Some stakeholder concerns:
Do avatars add cognitive load?
Are avatars actually engaging?
Does “leveling” negatively affect students?

22

SLIDE 21

Study Questions

Main focus: Does having student avatars

affect

Test performance?
Test-taking behaviors?
Affective responses?
Do we see any effects of leveling?

23

SLIDE 22

Study Design

Two assessment tasks: literary and

informational

Two versions of each task

– Avatar vs Non-avatar

Leveling in both versions
Student survey on

– Preferences and affective responses – Background information (digital access; reading motivation)

24

SLIDE 23

Study Approach

Tryout (like normal admin):
100 students recruited from the DC area
Randomly assigned to the Avatar or Non-

avatar conditions (each student took only one task)

Cog labs (one on one; think aloud, eye

tracking, post-task interview):

12 students, recruited from Trenton, Ewing,

Princeton

Randomly assigned to the Avatar and Non-

avatar conditions

25

SLIDE 24

Tryout Perform ance Results

No significant effects on total tasks scores or item

scores

The number of high- and low-performing

students was similarly distributed in the avatar and non-avatar conditions.

No significant interactions with gender,

race/ ethnicity, SES, or digital access (based on survey items included in the tryout).

26

SLIDE 25

Tryout Process Data Results

No significant effect of avatars on reading

behaviors such as reading speed, or the number

f page turns.
No significant effect of avatars on question

answering behaviors such as the number of times answers are changed, back navigation, or specific item behavior, such as select in passage behavior.

No significant effects of avatars on time use (that

is, time on reading or items)

27

SLIDE 26

BUT: Tryout survey affective results show differences…

28

SLIDE 27

29

5 10 15 20 25 1 Very easy 2 Somewhat easy 3 Not easy or hard 4 Somewhat hard 5 Very hard

Lit task. How easy or hard was this task?

Avatar Non-Avatar

Results suggest students in the avatar conditions perceived the tasks as easier.

5 10 15 20 25 30 1 Very easy 2 Somewhat easy 3 Not easy or hard 4 Somewhat hard 5 Very hard

Inf task. How easy or hard was this task?

Avatar Non-Avatar

SLIDE 28

30

5 10 15 20 25

Lit task. The pictures and conversations made the task more interesting.

5 10 15 20 25 30 1 Strongly Disagree 2 Disagree 3 Somewhat Disagree 4 Somewhat Agree 5 Agree 6 Strongly Agree No Answer

Lit task. The conversations made the task less interesting.

2 4 6 8 10 12 14 1 Strongly Disagree 2 Disagree 3 Somewhat Disagree 4 Somewhat Agree 5 Agree 6 Strongly Agree

Inf task AV. The pictures and conversations made the task more interesting.

2 4 6 8 10 12 14 16 1 Strongly Disagree 2 Disagree 3 Somewhat Disagree 4 Somewhat Agree 5 Agree 6 Strongly Agree

Inf task AV. The conversations made the task less interesting.

SLIDE 29

Leveling survey responses

I felt annoyed when the task gave me answers to

questions I had just answered: for both tasks, significantly m ore students disagreed.

Getting an answer to a question I had just completed

made me more confident about answering the next question: for both tasks, significantly m ore students agreed.

31

SLIDE 30

Tryout Survey Sum m ary

Students perceived the Avatar version

as easier and equally or more interesting.

On average, students in the Avatar

condition were positive to neutral about the images and conversations.

On average, students were positive to

neutral about leveling.

32

SLIDE 31

Cog lab student comments

33

SLIDE 32

They made me feel like I already knew the book and read it several times, like they would come to me

for help if I was the teacher. It gave me specific parts of the book that I would read, and then the avatars would ask questions about it so I felt like I was explaining it to them.

They made you look at things you wouldn’t think about if reading it by yourself.
Seeing their interpretation helped me connect back to the story, made it a little easier. I think the

avatars helped me to understand the story because they had similar questions that I had.

Classmates (avatars) make it easier because they do the reading of what you would normally have to

read and give your brain a rest. With avatars is more interactive because you don’t get bored and zone

ut as you would in normal reading tasks in school.
They provided guidance and direction. It was more personal, not all directions given by the computer.

It felt almost real and like I was working with them a little bit…

When they communicate with each other it was like working with students in class, like when two other

students are talking to each other and I am listening to them.

It was different. Usually we just get the questions and multiple choice answers. It was cool but didn’t

help or hurt me.

I guess, kind of collaborating, but they couldn’t actually talk to me or respond to what I was saying. It

didn’t feel like a real interaction, they can’t comment on my statements.

Leveling by avatar: Some of it was funny because the answers were kind of obvious, but it was cool to

feel like you were having a conversation with someone and see what they are thinking and where they are coming from and explain why. Good to hear opinions other than my own.

Leveling by avatar: I guess it could have been that answer, but it doesn’t matter that much what she

(avatar) said. Not annoying, just whatever. It had no effect and wouldn’t change my approach/ answers. 34

SLIDE 33

Take-aw ays

Avatars do not seem to add cognitive load and

students do seem to find them (mostly) engaging.

On average students perceived leveling as not

annoying, and it gave them confidence to answer the next question in the task.

Avatars and leveling are not just surface design

features, but construct-relevant features afforded by DBA to measure reading comprehension.

35

SLIDE 34

New study to dig further

Purpose: Study effect of SBT features on students’ reading performance, reading

behaviors, and engagement.

In context of full NAEP pilot, examine students’ performance and processes on

SBTs in comparison with discrete (DI) blocks using the same texts and items as the SBTs, but without any of SBT features (e.g., avatars, leveling, sequencing).

Developed special study student questionnaire items from recent literature on

student engagement, motivation, persistence, and self-efficacy (Guthrie, & Klauda, 2014) as well as established NAEP survey design principles.

Analysis about to begin!

36

SLIDE 35

NAEP Mathem atics Pretesting Findings

Luis Saldivia

NCSA, June 2 7 , 2 0 1 8

SLIDE 36

2 0 1 7 Operational

Multiple Choice Constructed Response Multiple Select Matching Zones Grids In-line Choice (Drop-down) Bar Graph Box Plot

NAEP Mathematics Item Types

3 8

2 0 1 5 Operational

Multiple Choice Constructed Response

SLIDE 37

Purpose of the Pretest Study

The study consisted of small-scale tryouts of a selection of NAEP

mathematics 2017 discrete items. In tryouts, students answer items in timed, assessment-like conditions. Goals:

Gather data about item response times (RTs)
Investigate item performance
Systematically test the effects of presentation format and response

mode by varying item type while holding constant the item content

SLIDE 38

Design

I nline vs SSMC: Compare response times and scores. Examine

whether inline choice formats appear to produce greater usability or construct-irrelevant cognitive challenges, compared with traditional SSMC.

MSMC: Compare two variants of MSMC items, with and without the

number of selections specified. Compare number and range of selections made and resulting scores.

Zone: Compare two variants of MS Zone-selection items, with and

without the number of selections specified. Compare number and range of selections made, and resulting scores.

Grid vs MSMC: Compare selection behaviors, specifically number of

choices selected and number of options left blank

SLIDE 39

Results – Tim ing

SLIDE 40

I nline Choice vs. SSMC

Six pairs of items were compared at each grade
The findings suggest that inline choice is equivalent to SSMC in

terms of effects of presentation format and response selection mode on performance and speed.

Conclusion: The item content should drive the selection of the best

item format to meet the requirement of the item. For instance, inline choice can be used to for content that require linking ideas such as claims with evidence

SLIDE 41

Zone and MSMC W ith vs. W ithout Num ber of Selections Specified

SLIDE 42

Zone and MSMC W ith vs. W ithout Num ber of Selections Specified

Two Zone and Two MSMC versions were given at each grade
It is clear from the data that students do understand the requirement

to select more than one option

There is some indication that specifying the number of selections

reduces the variance in the number of selections made

Students do not adhere to the instruction

– It is not possible to know from the present data

whether students do not notice (or forget) the instruction
if they do attend to it but deliberately choose a different number

SLIDE 43

In two cases out of eight contrasts, scores were significantly higher

in the number-specified variant

Given that students did not adhere to the number specified, it is not

clear whether giving this instruction is systematically beneficial

Item contents were at least as important for raw score differences as

the instruction to select a particular number of options

Side Notes:

– There does not appear to be a trend for MSMC items to be easier or harder than Zone items, – There is not difference in the response times for these item types, and overall there is no evidence from either scores or RTs that students have difficulty with the zone-selection response mode

Zone and MSMC W ith vs. W ithout Num ber of Selections Specified

SLIDE 44

MSMC vs. Grid I tem s

SLIDE 45

MSMC Vs. Grid I tem s

Four pairs of variant per grade In grid items, students were almost universally likely to fill all rows. By contrast, in MSMC items, the number of response selections varied considerably and tended to cluster around the middle of the available range, with very few students making the maximum number of selections

Original scoring rubrics tended to benefit the MSMC

– Partial scores that allowed answers with some blank responses, which rarely

ccurs in Grid items

– Penalizing incorrect selections, which is more likely in Grid items since students rarely leave options blanks even if they are unsecure of the correct selection – For these reason Grid items tended to no receive partial scoring

SLIDE 46

MSMC Vs. Grid item s

Dichotomous rubric also advantaged MSMC items

– Grid items tend to encourage attempts on all rows. Students may be more likely to guess when they are not sure, since they believe they must provide a response for all instances. – We cannot assume that the unselected options in MSMC items are equivalent to False selections on Grid items– some may be equivalent to False, while

thers may be equivalent to Don’t Know, and in those cases students may not

choose to guess – In a Grid format, we assume that those same Don’t Know instances tend to get instantiated in a selection, which in those cases would be a guess or a less than certain selection

SLIDE 47

MSMC Vs. Grid I tem s

It is important to develop scoring rubrics for Grid items that take

account of the affordances of this layout and produce equivalent scores for cognitively equivalent items

Grid items rubrics should not assume that any options will be left

blank, and the treatment of incorrect selections should take into account the greater likelihood of guessing

Grid items have distinct measurement properties, and are by

design not analogous to MSMC items

Grid items appear to have different cognitive and even

metacognitive affordances

SLIDE 48

MSMC Vs. Grid I tem s

One potential advantage of Grid items might come from their

tendency to make students attempt all selections.

If a rubric is designed carefully, this property might be helpful for

distinguishing students who are leaving MSMC options blank to indicate ‘not True’ versus those whose blank responses indicate ‘Don’t know’

Given the layout of Grid items, it might even be possible to

incorporate a third column (e.g., “Cannot be determined”) that could make such a distinction explicit, which is something that would be difficult to do with MSMC formats

SLIDE 49

Next Steps- Other I tem Types

5 1

Source: A B C D E Target: 1 2 3

SLIDE 50

Num ber of Actions

5 2

SLIDE 51

Num ber of Actions

5 3

SLIDE 52

Sequence of Student Actions

– Representing num bers sym bolically

Analyze patterns in sequence of actions (exactly 3 choices)
Do students work from graphical to symbolic representation

(“target-focused”) or symbolic to graphical (“source-focused”) representation?

5 4

SLIDE 53

Num ber of actions

5 5

44% Target-focused 13% Source-focused 12% No clear focus

All students

69% Exactly 3 actions 31% More than 3 actions

SLIDE 54

Target- vs. Source-focused

I tem scores

5 6

SLIDE 55

Turn and Talk

SLIDE 56

Discussant/ Reactant

High quality assessments: Can states have them without

considering item and assessment constructs and structures including the student experience? (Performance, Reaction, Engagement)

How should states begin to examine all of the underlying metadata

regarding how students interact with the items and the construct of the assessment?

What are the potential barriers for large scale assessments?

– Lack of collaboration with cognitive scientists, content scientists and assessment developers – from Keehner’s slide (17). – Lack of funding to support this research. – Time to complete research and react to findings.

SLIDE 57

Takeaw ays

This is not just a “NAEP” issue.
Design decisions matter beyond just having more robust items and

assessment construct.

DBAs should not simply replicate P & P assessments. There is so much

to gain from the underlying meta-data that can be provided from a DBA.

Refreshing to know this research is happening related to the NAEP items

and construct; however, more research and collaboration efforts is needed to develop high quality DBAs in order to make appropriate judgements of what students know and understand.

SLIDE 58