This post first ran in Education Week‘s Learning Deeply blog on July 2, 2015.
In a meeting room at the U.S. Department of Education in late September of 2014, leaders from New Hampshire braced for the question they all knew was coming. They had arrived at the Department to request flexibility from Secretary of Education Arne Duncan to pilot a new assessment model: one that relies on a backbone of flexible performance-based demonstrations of student knowledge and deeper learning rather than complete reliance on standardized end-of-year exams that are common in classrooms today. Unlike most statewide summative assessments, curriculum-embedded performance-based assessments require students to apply knowledge in tangible ways (for example, through original research projects, presentations, or live performances) that are often locally designed and may vary in nature from student to student. To the state and district leaders in the room, this new model was the essential product of their multi-year efforts to reorient their education system around shared responsibility for every student’s successes. To Secretary Duncan and his senior staff, however, one vexing question stood out: how will the state be able to compare assessment results across the pilot – or across the state – if each student’s experience with performance assessment is different?
Concern for what testing gurus call assessment comparability is not limited to the conscience of the U.S. Department of Education or New Hampshire. It is a primary reason why some civil rights organizations are fighting to protect the provisions in current federal law that mandate the use of standardized assessments for every child in every year, so that all students are “measured up” using the same measuring stick. And so that no students are told they are proficient when, writ large, they are not. And ultimately, so that schools and districts (and sometimes individual educators) that do not achieve comparably high outcomes for students receive appropriate supports and consequences.
Nevertheless, growing public discontent with the standardized assessment paradigm has heightened already-growing interest among schools, districts, states, and even some federal lawmakers to reinvent “high stakes” assessment in ways that make room for more flexible performance-based assessments.
Still, the big question must be asked of any state or district transitioning to a new assessment model: how will they be sure that performance assessments – which may differ for each student – are comparable to standardized assessments during the transition?
Leading the field in contending with these issues are assessment gurus Chris Domaleski and Scott Marion with the National Center of Assessment. I owe gratitude to them not only for their body of work over the years, but also for countless “bonus” hours they have spent helping a layperson like me to understand. First, it should be noted the Center of Assessment offers guidance on determining the quality of performance-based assessments, as presented in a recent publication co-authored with Achieve. But specifically on the issue of comparability, what follows is my best interpretation of a few (grossly oversimplified) insights gleaned from conversations with Domaleski and Marion.
- Comparable does not mean Equivalent.
Consider a hungry third-grader seeking a nutritional snack before she heads off to school. Her father offers her an apple, but she’s never liked apples and pleads for something else. Were her father to give her something equivalent to this apple, he would give her another apple. On the other hand, he could give her something comparable to the apple – say, an orange, or a pear. Although they may vary in taste and color, these fruits provide comparable nutrients. And both are better than a cupcake for breakfast.
Likewise with assessments, performance-based assessments that are embedded throughout the curriculum at one school should not be expected to be equivalent to curriculum-based performance assessments administered somewhere else (unless both schools agreed on a number of standardizing procedures). Rather, in order to be more relevant to instruction, these assessments intentionally vary, permitting different students to complete different tasks to demonstrate mastery of a topic, sometimes at different points in time throughout the year. Therefore, unlike standardized assessments which are, well, standardized so that all kids take the same test on the same day, curriculum-embedded performance assessments are likely too different to say that scores from one assessment are equivalent to – or interchangeable with – scores on another.
They can, however, be shown to be valid predictors of the same outcomes (carrying the same “nutritional value” according to the analogy), which brings me to the second insight:
- Comparability can be demonstrated when different assessment systems predict the same outcomes.
Although somewhat oversimplified, the basic idea is this: even if two assessment systems (such as a performance-based system and a standardized assessment system) are not equivalent or interchangeable, if they each can be shown to predict a certain set of outcomes, then they can be said to be comparable. In the analogy, this is like saying that although apples and oranges are not the same, if they deliver the same nutritional value to the body, they can be used in comparable ways. In the testing world, even though two assessment systems may be different, if each is designed to demonstrate a similar outcome (say, “college and career readiness,”) and each can be shown to predict that outcome (say, college performance as measured by college course grades) with similar precision, we can be confident that comparisons between the two assessments are trustworthy.
This kind of demonstration of comparability is perhaps a longer-term goal for states like New Hampshire, who will need to demonstrate that their performance assessment system is as valid a predictor of postsecondary outcomes as the Smarter Balanced standardized assessments that non-pilot districts will administer. It takes time – years – to gather the kinds of postsecondary outcome data required for such an analysis. Meanwhile, however, there are still other methods to demonstrate comparability, some of which are already employed in other countries. Which brings us to the next insight:
- Comparability can be established by examining student work.
To determine whether two assessments are comparable, other nations such as the United Kingdom, Australia, and others have developed approaches where expert judges examine student work – and associated artifacts such as the tasks and rubrics – from various assessments against a pre-determined set of criteria. Like orchestra members tuning before a performance, the idea is to calibrate judgments about student work across the assessments so that we are sure that students who are deemed “proficient” by one assessment produce the same high-quality work as students deemed “proficient” by another assessment. Depending on the intended purposes of each assessment, these expert judges may adjust scores or scoring processes until the results from both assessments are “in tune.”
To further illustrate, in a recent article in the Education Policy Analysis Archives, Marion and New Hampshire Department of Education Deputy Commissioner Paul Leather write:
[C]onsider students applying for a competitive music program. Students will play different songs, perhaps using different instruments, but judges will have to determine who should be admitted to the program. We accept that judges are able to weigh the different types of evidence to make “comparable judgments.” Why do we accept this? Because we have great trust in expert judges and their shared criteria. When the criteria are not explicit and applied systematically, then people have concerns (remember some of the Olympic figure skating fiascos in past years).
Engaging not only expert judges but also teachers in reviewing student work can multiply benefits, as documented in this week’s post describing how engaging teachers in scoring assessments helps to deepen their understanding of student standards and student learning. Such explicit attention to how teachers apply scoring criteria is essential to ensuring comparability. As Marion and Domaleski advise, the fact that performance assessments are “looser” on some parameters necessitates being “tight” elsewhere, such as taking measures to ensure accurate and consistent scoring across classrooms.
- Double-testing is not always the answer.
In this midst of New Hampshire’s campaign to secure federal permission for its performance assessment pilot, state leaders were often asked why they don’t simply run the performance assessment system concurrent to the full suite of Smarter Balanced standardized assessments – at least during a couple transition years. After all, isn’t the easiest way to demonstrate comparability to run both systems simultaneously and then directly compare the results?
State leaders and practitioners in New Hampshire point out that, although it may be theoretically and psychometrically convenient, “double-testing” during one or more transition years proves to be practically inconvenient, both because of the limited capacity of teachers to invest in doing both systems well; and because of the corollary disincentive to invest in something new if the results don’t fully “count.”
In addition, Marion points to another underlying issue with double-testing: the issue of transfer. Suppose students are instructed in a classroom that regularly employs performance-based assessment. In addition to their performance-based assessments, suppose we also administer a completely different type of assessment, such as a standardized test. We compare the results directly to see if the two tests are comparable. In doing so, however, we have unwittingly made a huge assumption: that the student has learned the material deeply enough to transfer his knowledge to the new and unfamiliar standardized assessment. If the student’s results on the two assessments do not measure up, it might be because the assessments are not comparable, but it might also be because that student’s learning is still too fragile to be transferred from one application to another. It is hard to disentangle the two possible interpretations. Therefore, by resting judgment of assessment comparability solely on a direct comparison of students’ scores in a double-testing approach, one risks a “false negative” conclusion.
Either way, the essential message is that double-testing may not be the panacea that some would suggest. We must have other ways for states or districts to transition into a new system without double-testing – and must employ strategies, such as those listed above, to ensure that assessment systems deliver comparable results during the transition.
Many of you already know the ending to New Hampshire’s meeting with Secretary Duncan: months later, after additional rounds of exchange, Secretary Duncan approved their request for a pilot. But for New Hampshire and other interested states, the pilot represents only the beginning. New Hampshire is in the midst of a multi-year quality review process through which they hope to demonstrate the validity and comparability of their pilot assessment system. Participating pilot districts will anchor judgments of “proficiency” to the Smarter Balanced achievement level descriptors (ALDs), and will participate in a common standard-setting process to compare student work. They will also participate in a peer review process during the first two years of implementation in order to examine system design and performance assessment results, and to provide technical assistance to districts where needed.
If districts and states are to successfully invent the next paradigm for state assessments, while continuing to hold themselves accountable for high levels of achievement for all students, they must be able to understand and monitor student performance not just on one assessment but across an entire system. And as these new systems introduce greater variability through the use of performance assessments, they must tighten other areas, such as scoring. Any threat to comparability must be evaluated and adjusted, so that every student in every school receives a meaningful opportunity to achieve college and career readiness.
Note: None of the statements herein are intended to oversimplify what is actually a very complex process of determining comparability.