No. 33 February 2003
Testing High Stakes Tests: Can We Believe the Results of Accountability Tests?
Jay P. Greene, Ph.D.
Senior Fellow, Manhattan Institute for Policy Research
Marcus A. Winters
Research Associate, Manhattan Institute for Policy Research
Greg Forster, Ph.D.
Senior Research Associate, Manhattan Institute for Policy Research
Do standardized tests that are used to reward or sanction schools for their academic performance, known as “high stakes” tests, effectively measure student proficiency? Opponents of high stakes testing argue that it encourages schools to “teach to the test,” thereby improving results on high stakes tests without improving real learning. Since many states have implemented high stakes testing and it is also central to President Bush’s No Child Left Behind Act, this a crucial question to answer.
This report tackles that important policy issue by comparing schools’ results on high stakes tests with their results on other standardized tests that are not used for accountability purposes, and thus are “low stakes” tests. Schools have no incentive to manipulate scores on these nationally respected tests, which are administered around the same time as the high stakes tests. If high stakes tests and low stakes tests produce similar results, we can have confidence that the stakes attached to high stakes tests are not distorting test outcomes, and that high stakes test results accurately reflect student achievement.
The report finds that score levels on high stakes tests closely track score levels on other tests, suggesting that high stakes tests provide reliable information on student performance. When a state’s high stakes test scores go up, we should have confidence that this represents real improvements in student learning. If schools are “teaching to the test,” they are doing so in a way that conveys useful general knowledge as measured by nationally respected low stakes tests. Test score levels are heavily influenced by factors that are outside schools’ control, such as student demographics, so some states use year-to-year score gains rather than score levels for accountability purposes. The report’s analysis of year-to-year score gains finds that some high stakes tests are less effective than others in measuring schools’ effects on student performance.
The report also finds that Florida, which has the nation’s most aggressive high stakes testing program, has a very strong correlation between high and low stakes test results on both score levels and year-to-year score gains. This justifies a high level of confidence that Florida’s high stakes test is an accurate measure of both student performance and schools’ effects on that performance. The case of Florida shows that a properly designed high stakes accountability program can provide schools with an incentive to improve real learning rather than artificially improving test scores.
The report’s specific findings are as follows:
- On average in the two states and seven school districts studied, representing 9% of the nation’s total public school enrollment, there was a very strong population adjusted average correlation (0.88) between high and low stakes test score levels, and a moderate average correlation (0.45) between the year-to-year score gains on high and low stakes tests. (If the high and low stakes tests produced identical results, the correlation would be 1.00.)
- The state of Florida had by far the strongest correlations, with a 0.96 correlation between high and low stakes test score levels, and a 0.71 correlation between the year-to-year gains on high and low stakes tests.
- The other state studied, Virginia, had a strong 0.77 correlation between test score levels, and a weak correlation of 0.17 between year-to-year score gains.
- The Chicago school district had a strong correlation of 0.88 between test score levels, and no correlation (-0.02) between year-to-year score gains.
- The Boston school district had a strong correlation of 0.75 between test score levels, and a moderate correlation of 0.27 between year-to-year score gains.
- The Toledo school district had a strong correlation of 0.79 between test score levels, and a weak correlation of 0.14 between year-to-year score gains.
- The Fairfield, Ohio, school district had a moderate correlation of 0.49 between test score levels, and a moderate negative correlation of -0.56 between year-to-year score gains.
- The Blue Valley, Kansas, school district had a moderate correlation of 0.53 between test score levels, and a weak correlation of 0.12 between year-to-year score gains.
- The Columbia, Missouri, school district had a strong correlation of 0.82 between test score levels, and a weak negative correlation of -0.14 between year-to-year score gains.
- The Fountain Fort Carson, Colorado, school district had a moderate correlation of 0.35 between test score levels, and a weak correlation of 0.15 between year-to-year score gains.
About the Authors
Jay P. Greene is a Senior Fellow at the Manhattan Institute for Policy Research where he conducts research and writes about education policy. He has conducted evaluations of school choice and accountability programs in Florida, Charlotte, Milwaukee, Cleveland, and San Antonio. He also recently published a report and a number of articles on the role of funding incentives in special education enrollment increases.
His research was cited four times in the Supreme Court’s opinions in the landmark Zelman v. Simmons-Harris case on school vouchers. His articles have appeared in policy journals, such as The Public Interest, City Journal, and Education Next, in academic journals, such as The Georgetown Public Policy Review, Education and Urban Society, and The British Journal of Political Science, as well as in major newspapers, such as the Wall Street Journal and Christian Science Monitor.
Greene has been a professor of government at the University of Texas at Austin and the University of Houston. He received his B.A. in history from Tufts University in 1988 and his Ph.D. from the Government Department at Harvard University in 1995. He lives with his wife and three children in Weston, Florida.
Marcus Winters is a Research Associate at the Manhattan Institute’s Education Research Office where he studies and writes on education policy. He recently graduated from Ohio University with a B.A. in political science, for which he received departmental honors, and a minor in economics.
Greg Forster is a Senior Research Associate at the Manhattan Institute’s Education Research Office. He is the co-author of several education studies and op-ed articles. He received a Ph.D. with distinction in Political Science from Yale University in May 2002, and his B.A. from the University of Virginia, where he double-majored in Political and Social Thought and Rhetoric and Communications Studies, in 1995.
The authors would like to thank Chester E. Finn, Jr. and Rick Hess, who reviewed this manuscript and provided valuable comments and suggestions. The authors would also like to thank the state and local education officials who helped make test score data available for the study.
“High stakes” testing, the use of standardized tests to reward or sanction schools for their academic performance, is among the most contentious issues in education policy. As the centerpiece of President Bush’s No Child Left Behind Act, it is also among the most prominent education reform strategies. The idea behind it is that rewarding or sanctioning schools for their performance provides schools with incentives necessary to improve academic achievement.
But what if schools respond to the incentives of high stakes testing by developing ways to improve results on the high stakes tests without actually improving real learning? This is the principal objection raised by opponents of high stakes testing. Opponents contend that schools will “teach to the test,” or cheat even more directly by manipulating the test answer sheets. The concern raised by opponents is that high stakes testing causes schools to teach skills or adopt policies that are only useful for passing the high stakes tests and are not more generally useful in helping prepare students for later life.
Whether the high stakes of high stakes testing are in fact motivating schools to manipulate results without actually improving real student achievement is a question that can be investigated empirically. By comparing results from high stakes tests with results from other standardized tests administered around the same time, we can determine whether the high stakes associated with high stakes tests are distorting test results. If high stakes tests produce results that are similar to the results of other tests where there are no incentives to manipulate scores, which we might call “low stakes” tests, then we can have confidence that the high stakes do not themselves distort the outcomes. If, on the other hand, high stakes tests produce results that are not similar to the results of low stakes tests, then we should be concerned that schools have managed to produce results on high stakes tests that are inaccurate reflections of actual student achievement.
This report investigates the validity of high stakes testing by comparing the results of high and low stakes tests administered to students around the same time in two states and in seven school districts nationwide. The states and districts examined contain 9% of all public school students in the country. We find that scores on high and low stakes tests generally produce results that are similar to each other. The population adjusted average correlation between high and low stakes test results in all the school systems we examined was 0.88, which is a very strong correlation (if the high and low stakes tests produced identical results, the correlation would be 1.00).
We also find that year-to-year improvement on high stakes testing is strongly correlated with year-to-year improvement on low stakes standardized tests in some places, but weakly correlated in others. The population adjusted average correlation between year-to-year gain on high stakes tests and year-to-year gain on low stakes tests in all the school systems we examined was 0.45, which is a moderately strong correlation. But the correlation between year-to-year gains on Florida’s high and low stakes tests was extremely high, 0.71, while the correlation in other locations was considerably lower. These analyses lead us to conclude that well-designed high stakes accountability systems can and do produce reliable measures of student progress, as they appear to have done in Florida, but we can have less confidence that other states’ high stakes tests are as well designed and administered as Florida’s.
A Variety of Testing Policies
There is considerable diversity in testing policies nationwide. States and school districts around the country vary in the types of tests they use, the number of subjects they test, the grades in which they administer the tests, and the seriousness of the sanctions or rewards they attach to test results. Some states, such as Minnesota, report scores on state-mandated tests to the public in order to shame school districts into performing better; other states, such as Ohio and Massachusetts, require students to pass the state exam before receiving a high school diploma. Chicago public school students must perform well on the Iowa Test of Basic Skills in specified grades in order to be promoted to the next grade, even though neither the test nor the sanction is required by the state of Illinois.
Perhaps the nation’s most aggressive test-based accountability measure is Florida’s A+ program. Florida uses results on the Florida Comprehensive Assessment Test (FCAT) to hold students accountable by requiring all students to pass the 3rd grade administration of the exam before moving to the 4th grade, and by withholding diplomas from students who have not passed all sections of the 10th grade administration of the exam. It also holds schools and districts accountable by using FCAT results to grade schools from A to F on school report cards that are very widely publicized and scrutinized. However, what really makes Florida’s program stand out is that the state holds schools and districts accountable for their students’ performance on FCAT by offering vouchers to all students in schools that have earned an F on their report cards in any two of the previous four years. These chronically failing schools face the possibility of the ultimate consequence—they could lose their students and the state funding that accompanies them.
Two states, Florida and Virginia, and several school districts gave their students both a high stakes test and a commercially-designed low stakes test during the school year. The low stakes tests are used to assess how well students are doing compared to national norms and to decide what curriculum changes should be implemented to better serve students. Since parents and school officials see the results of the tests and use them for their own purposes, it would be incorrect to say that there are no stakes attached to them at all. However, the stakes attached to these tests are small enough that schools have little or no incentive to manipulate the results in the way that some fear high stakes tests may be manipulated. Thus a student’s performance on a low stakes test is most likely free from potential distortion.
Several objections have been raised against using standardized testing for accountability purposes. Most concerns about high stakes testing revolve around the adverse incentives created by the tests. Some have worried that pressures to produce gains in test scores have led to poor test designs or questionable revisions in test designs that exaggerate student achievement (for example, see Koretz and Barron 1998 on Kentucky’s test; Haney 2000 on Texas’ test; and Haney et al 1999 on Massachusetts’ test). Others have written that instead of teaching generally useful skills, teachers are teaching skills that are unique only to a particular test (for example, see Amrein and Berliner 2002; Klein et al 2000; McNeil and Valenzuela 2000; Haney 2000; and Koretz and Barron 1998). Still others have directly questioned the integrity of those administering and scoring the high stakes tests, suggesting that cheating has produced much of the claimed rise in student achievement on such exams (for example, see Cizek 2001; Dewan 1999; Hoff 1999; and Lawton 1996).
Most of these criticisms fail to withstand scrutiny. Much of the research done in this area has been largely theoretical, anecdotal, or limited to one or another particular state test. For example, Linda McNeil and Angela Valenzuela’s critique of the validity of high stakes testing lacks an analysis of data (see McNeil and Valenzuela 2000). Instead, their arguments are based largely on theoretical expectations and anecdotal reports from teachers, whose resentment of high stakes testing for depriving them of autonomy may cloud their assessments of the effectiveness of testing policies. Their reports of cases in which high stakes tests were manipulated are intriguing, but they do not present evidence on whether these practices are sufficiently widespread to fundamentally distort testing results.
Other researchers have compared high stakes test results to results on other tests, as we do in this study. Prior research in this area, however, has failed to use tests that accurately mirror the population of students taking the high stakes test or the level of knowledge needed to pass the state mandated exam.
Amrein and Berliner find a weak relationship between the adoption of high stakes tests and improvement in other test indicators, such as NAEP, SAT, ACT, and AP results (see Amrein and Berliner 2002).1 Koretz and Barron find that Kentucky’s high stakes test results show increases that are not similarly found in the state’s NAEP results (see Koretz and Barron 1998). Klein et al similarly claim that gains on the Texas high stakes test appear to be larger than are shown by NAEP (see Klein, Hamilton, McCaffrey, and Stecher 2000).
Comparing state-mandated high stakes tests with college entrance and AP exams is misleading because the college-oriented exams are primarily taken by the best high school students, who represent a minority of all students. Though the percentage of students taking these exams has increased to the point that test-takers now include more than the most elite students, they still are not taken by all students, and this hinders their usefulness for assessing the validity of near-universally administered high stakes tests. Only a third of all high school students take the SAT, and even fewer take the ACT or AP. Furthermore, college-oriented tests tell us nothing about the academic progress of the student population that high stakes testing is most intended to benefit: low performing students in underserved communities. In addition, because these tests are intended only for college bound students they test a higher level of knowledge than most high stakes tests, which are used to make sure students have the most basic knowledge necessary to earn a diploma. Any discrepancy between the results of college-oriented tests and high stakes tests could be attributable to the difference in the populations taking these tests and the different sets of skills they demand.
Comparisons between high stakes tests and NAEP are more meaningful than comparisons to college-oriented tests, though NAEP-based analyses also fall short of the mark. NAEP is administered infrequently and only to certain grades. Any weak correlation between NAEP and high stakes tests could be attributable to such factors. When tests are not administered around the same time, or are not administered to the same students, their results are less likely to track each other. This will soon change with the new, more frequent NAEP testing schedule required under the No Child Left Behind Act—although NAEP will also become a high stakes test under No Child Left Behind, so its usefulness for evaluating other tests may not be improved.
Rather than focusing on statewide outcomes, like NAEP or college-oriented exam results, Haney uses classroom grades to assess the validity of Texas’ high stakes test. He finds a weak correlation between Texas high stakes results and classroom grades, from which he concludes that the Texas high stakes test results lack credibility (see Haney 2000). However, it is far more likely that classroom grades lack credibility. Classroom grades are highly subjective and inconsistently assigned, and are thus likely to be misleading indicators of student progress (see Barnes and Finn 2002). To support this suspicion of classroom grades, Figlio and Lucas, 2001 correlated school grades with scores in Florida on the state’s high stakes test and found that teacher given grades were inflated (see Figlio and Lucas 2001).
There have also been a number of responses to these critiques of state testing validity. For example, Hanushek and Phelps have written a series of methodological critiques of the work by Haney and Klein (see Hanushek 2001 and Phelps 2001). Hanushek points out that Klein’s finding of stronger gains on the Texas state test than on NAEP should come as no surprise given that Texas school curricula are more closely aligned with the Texas test than with NAEP (see Hanushek 2001). Phelps takes Haney and Klein to task for a variety of errors, alleging (for example) that Haney used incorrect NAEP figures on exemption rates in Texas and that Klein failed to note more significant progress on NAEP by Texas students because of excessive disaggregation of scores (see Phelps 2000).
Other analyses, such as those by Grissmer, et al, and Greene, also contradict Haney and Klein’s results. Contrary to Haney and Klein, Grissmer and Greene find that Texas made exceptional gains on the NAEP as state-level test results were increasing dramatically (see Grissmer, Flanagan, Kawata, and Williamson 2000; and Greene 2000). Unfortunately, our inability to correlate individual-level or school-level performance on the NAEP and the Texas test, as well as the infrequent administration of NAEP, prevent any clear resolution of this dispute.
This report differs from other analyses in that it focuses on the comparison of school-level results on high stakes tests and commercially-designed low stakes tests. By focusing on school-level results we are comparing test results from the same or similar students, reducing the danger that population differences may hinder the comparison. Examining school-level results also allows for a more precise correlation of the different kinds of test results than is possible by looking only at state-level results, which provide fewer observations for analysis. In addition, school-level analyses are especially appropriate because in most cases the accountability consequences of high stakes test results are applied at the school level. By comparing school-level scores on high stakes and low stakes tests, this study attempts to find where, if anywhere, we can believe high stakes test results. If we see that high stakes and low stakes tests produce similar results, we have reason to believe that results on the high stakes test were not affected by any of the adverse incentives tied to the test.
The first step in conducting this study was to locate states and school districts that administer both high stakes and low stakes tests. We examined information available on each state’s Department of Education website about their testing programs, and contacted by phone states whose information was unclear. A test was considered high stakes if any of the following depended upon it: student promotion or graduation, accreditation, funding cuts, teacher bonuses, a widely publicized school grading or ranking system, or state assumption of at least some school responsibilities. We found two states, Florida and Virginia, that administered both a high stakes test and a low stakes test.2 Test scores in Florida were available on the Florida Department of Education’s website, and we were able to obtain scores from Virginia through a data request.
We next attempted to find individual school districts that also administered both high stakes and low stakes tests. We first investigated the 58 member districts of the Council for Great City Schools, which includes many of the largest school districts in the nation. Next, through internet searches, we looked for other school districts that administer multiple tests. After locating several of these districts, we contacted them by phone and interviewed education staffers about the different types of tests the districts administered.
Because we were forced to rely on internet searches and non-systematic phone interviews to find school districts that gave both high and low-stakes tests, our search was certainly not exhaustive.3 But the two states and seven school districts included in this study, which did administer both high and low stakes tests, contain approximately 9% of all public school students in the United States and a significantly higher percentage of all students who take a high stakes test. We therefore have reason to believe that our results provide evidence on the general validity of high stakes testing nationwide.
In each of the states and school districts we studied, we compared scores on each test given in the same subject and in the same school year. In total, we examined test scores from 5,587 schools in nine school systems. When possible, we also compared the results of high and low stakes tests given at the same grade levels. We were not able to do this for all the school systems we studied, however, because several districts give their low stakes tests at different grade levels from those that take their high stakes tests. When a high or low stakes test was administered in multiple grade levels of the same school level (elementary, middle, or high school), we took an average of the tests for that school level. Though this method does not directly compare test scores for the same students on both tests, the use of school-level scores does reflect the same method used in most accountability programs.
Because we sometimes had to compute an average test score for a school, and because scores were reported in different ways (percentiles, scale scores, percent passing, etc.), we standardized scores from each separate test administration by converting them into what are technically known as “z-score” results. To standardize the test scores into z-scores, we subtracted the score a school received on the test administration by the average score on that administration throughout the district/state. We then divided that number by the standard deviation of the test administration. The standardized test score is therefore equal to the number of standard deviations each school’s result is from the sample average.
In school systems with accountability programs, there is debate over how to evaluate test results. School systems evaluate test results in one of two ways: either they look at the actual average test score in each school or they look at how much each school improved its test scores from one year to another. Each method has its advantages and disadvantages. Looking at score levels tells us whether or not students are performing academically at an acceptable level, but it does not isolate the influence of schools from other factors that contribute to student performance, such as family and community factors. Looking at year-to-year score gains is a “value added” approach, telling us how much educational value each school added to its students in each year.
For the school systems we studied, we computed the correlation between high and low stakes test results for both the score level and the year-to-year gain in scores. We found the year-to-year gain scores for each test by subtracting the standardized score on the test administration in one year from the standardized score on the test administration in the next year. For example, in Florida we subtracted each school’s standardized score on the 4th grade reading FCAT test in 2000 from the same school’s standardized score on the 4th grade reading FCAT in 2001. This showed us whether a school was either gaining or losing ground on the test.
We used a Pearson’s correlation to measure how similar the results from the high and low stakes tests were, both in terms of score levels and in terms of the year-to-year gain in scores. For example, for score levels we measured the correlation between the high stakes FCAT 3rd grade reading test in 2001 and the low stakes Standford-9 3rd grade reading test in 2001. Similarly, for year-to-year score gains we measured the correlation between the 2000–2001 score gain on the FCAT and the 2000–2001 score gain on the Standford-9.4
Where there is a high correlation between high and low stakes test results, we conclude that the high stakes of the high stakes test do not distort test results, and where there is a low correlation we have significantly less confidence in the validity of the high stakes test results.5
There are many factors that could explain a low correlation between high and low stakes test results. One possibility would be that the high stakes test is poorly designed, such that schools can successfully target their teaching on the skills required for the high stakes test without also conveying a more comprehensive set of skills that would be measured by other standardized tests. It is also possible that the implementation of high stakes tests in some school systems could be poorly executed. Administering high stakes tests in only a few grades may allow schools to reallocate their best teachers to those grades, creating false improvements that are not reflected in the low stakes test results from other grades. The security of high stakes tests could also be compromised, such that teachers and administrators could teach the specific items needed to answer the questions on the high stakes test without at the same time teaching a broader set of skills covered by the low stakes standardized test. It is even possible that in some places teachers and administrators have been able to manipulate the high stakes test answers to inflate the apparent performance of students on the high stakes test.
More benign explanations for weak correlations between high and low stakes test results are also available. When we analyze year-to-year gains in the test scores, we face the problem of having to measure student performance twice, thus introducing more measurement error. Weak correlations could also partially be explained by the fact that the score gains we examine do not track a cohort of the same students over time. Such data are not available, forcing us to compute the difference in scores between one year’s students against the previous year’s students in the same grade. While this could suppress the correlation of gain scores, it is important to note that our method is comparable to the method of evaluation used in virtually all state high stakes accountability systems that have any kind of value-added measurement. In addition, if a school as a whole is in fact improving, we would expect to observe similar improvement on high and low stakes tests when comparing the same grades over time.
Correlations between results on high and low stakes tests could also be reduced to some extent by differences in the material covered by different tests. High stakes tests are generally geared to a particular state or local curriculum, while low stakes tests are generally national. But this can be no more than a partial explanation of differences in test results. There is no reason to believe that the set of skills students should be expected to acquire in a particular school system would differ dramatically from the skills covered by nationally-respected standardized tests. Students in Virginia need to be able to perform arithmetic and understand what they read just like students in other places, especially if students in Virginia hope to attend colleges or find employment in other places.
If low correlations between results on high and low stakes tests are attributable to differences between the skills required for the two tests, we might reasonably worry that the high stakes test is not guiding educators to cover the appropriate academic material. It might be the case that the high stakes test is too narrowly drawn, such that it does not effectively require teachers to convey to their students a broad set of generally useful skills. The low stakes tests used in the school systems we studied are all nationally respected tests that are generally acknowledged to measure whether or not students have successfully achieved just this kind of broad skill learning, so if the high stakes test results in these systems do not correlate with their low stakes test results, this may be an indication that poorly-designed high stakes tests are failing to cover a broad set of skills. On the other hand, if their high stakes test results are strongly correlated with their results on low stakes tests that are nationally respected as measurements of broad skill learning, this would give us a high degree of confidence that the high stakes tests are indeed testing a broad set of generally useful skills and not just a narrow set of skills needed only to pass the test itself.
Interpretation of our results is made somewhat problematic because we cannot know with absolute certainty the extent to which factors other than school quality influence test score levels. Family background, population demographics, and other factors are known to have a significant effect on students’ level of achievement on tests, but we have no way of knowing how large this effect is. To an unknown extent, score level correlations reflect other factors in addition to the reliability of the high stakes test. However, the higher the correlation between score levels on high and low stakes tests, the less we have reason to believe that poor test design or implementation undermines the reliability of high stakes test results. Furthermore, where a high correlation between year-to-year score gains accompanies a high correlation between score levels, we can be very confident that the high stakes test is reliably measuring school quality because family and demographic factors have no significant effect on score gains. On the other hand, even where score level correlations are high, score gain correlations could be low if student background factors are causing the high score level correlations.
No doubt some will object that a high correlation between high and low stakes test scores does not support the credibility of high stakes tests because they do not believe that low stakes standardized tests are any better than high stakes standardized tests as a measure of student achievement. Some may question whether students put forth the necessary effort on a test with no real consequences tied to their scores. This argument would prove true if we find low correlations between the tests on the score levels. If a large number of students randomly fill in answers on the low stakes test, then that randomness will produce low correlations with the high stakes tests, on which the students surely gave their best effort. But where we find high correlations on the score levels we have confidence that students gave comparable effort on the two tests.
Others may object entirely to the use of standardized testing to assess student performance. To those readers, no evidence would be sufficient to support the credibility of high stakes testing, because they are fundamentally opposed to the notion that academic achievement can be systematically measured and analyzed by standardized tests. The difficulty with this premise is that it leads to educational nihilism. If academic achievement cannot be systematically measured, we cannot ever know whether or not students in general are making progress, nor can we ever know in general whether schools are helping, hurting, or having no effect on student progress. If we cannot know these things, then we cannot identify which educational techniques are likely to be effective or which policy interventions are likely to be desirable.
This study begins with a different premise: that achievement is measurable. Its purpose is to address the reasonable concern that measurements of achievement are distorted by the accountability incentives that are designed to spur improvement in achievement. By comparing scores on tests where there may be incentives to distort the results with scores on tests where there are almost no incentives to distort the results, we are able to isolate the extent to which the incentives of high stakes testing are in fact distorting information on student achievement.
For all the school systems examined in our study, we generally found high correlations between score levels on high and low stakes tests.6 We also found some high correlations for year-to-year gains in scores on high and low stakes tests, but the correlations of score gains were not as consistently high, and in some places were quite low.
This greater variation on score gain correlations might be partially explained by the increased measurement error involved in calculating score gains as opposed to score levels. It is also possible that high stakes tests provide less credible measures of student progress in some school systems than in others. In places where high stakes tests are poorly designed (such that teaching to the test is an effective strategy for boosting performance on the high stakes test without also conveying useful skills that are captured by the low stakes test) or where the security of tests has been compromised (such that teachers can teach the exact items to be included in the high stakes test, or help students cheat during the test administration), the correlations between score gains on high and low stakes tests may be quite low. The high correlations between score level results on high and low stakes tests do not rule out these possibilities because, to an unknown extent, score levels reflect family and demographic factors in addition to school quality. However, despite this, the high correlations between score level results do justify a moderate level of confidence in the reliability of these systems’ high stakes tests.
Perhaps the most intriguing results we found came from the state of Florida. We might expect the especially large magnitude of the stakes associated with Florida’s high stakes test to make it highly vulnerable to adverse responses because of the incentives created by high stakes testing. It was in Florida, however, that we found the highest correlations between high and low stakes test results, for both score levels in each given year and for the year-to-year score gains.
Florida’s high stakes test, the FCAT, produced score levels that correlated with the score levels of the low stakes Stanford-9 standardized test across all grade levels and subjects at 0.96. If the two tests had produced identical results, the correlation would have been 1.00. The year-to-year score gains on the FCAT correlated with the year-to-year score gains on the Stanford-9 at 0.71. (See the Appendix Tables for the average correlations as well as the separate correlations between each test, in each subject, and for each test administration.) Both of these correlations are very strong, suggesting that the high and low stakes tests produced very similar information about student achievement and progress. Because the high stakes FCAT produces results very similar to those from the low stakes Stanford-9, we can be confident that the high stakes associated with the FCAT did not distort its results. If teachers were “teaching to the test” on the FCAT, they were teaching generally useful skills that were also reflected in the results of the Stanford-9, a nationally respected standardized test.
In other school systems we found very strong correlations between score levels for high and low stakes test results in each given year, but relatively weak or even negative correlations between the year-to-year score gains on the two types of tests. For example, in Virginia the correlation between score levels on the state’s high stakes Standards of Learning test (SOL) and the low stakes Stanford-9 was 0.77, but the correlation between the year-to-year score gains on these two tests was only 0.17. Similarly, in Boston the correlation between the level of the high stakes Massachusetts Comprehensive Assessment System (MCAS) and the low stakes Stanford-9 was 0.75, but the correlation on the gain in scores between these two tests was a moderate 0.27. In Toledo, Ohio, the correlation between the level of the high and low stakes tests was 0.79, while the correlation between the score gains on the same tests was only 0.14.
In Chicago, the Iowa Test of Basic Skills (ITBS) is administered as a high stakes test in some grades and a low stakes test in other grades. The correlation between score levels on the high and low stakes administrations of this test is a very strong 0.88. But the year-to-year score gain in the results of the ITBS in high stakes grades is totally uncorrelated (-0.02) with the year-to-year score gain from the same test given in grades where the stakes are low. Similarly, in Columbia, Missouri, the high correlation (0.82) of score levels on the high and low stakes tests is accompanied by a weak negative correlation (-0.14) between the year-to-year score gain on the two types of tests.
In some school systems even the level of results on high and low stakes tests correlate only moderately well. In Blue Valley, Kansas, the high and low stakes tests produce score levels that correlate at 0.53 and score gains that correlate at only 0.12. In Fairfield, Ohio, the score levels on the high and low stakes tests correlate at 0.49, while, oddly, the year-to-year score gains have a moderate negative correlation of -0.56. In Fountain Fort Carson, Colorado, the score level correlation is only 0.35, while the score gain correlation is an even weaker 0.15.
The finding that high and low stakes tests produce very similar score level results tells us that the stakes of the tests do not distort information about the general level at which students are performing. If high stakes testing is only being used to assure that students can perform at certain academic levels, then the results of those high stakes tests appear to be reliable policy tools. The generally strong correlations between score levels on high and low stakes tests in all the school systems we examined suggest that teaching to the test, cheating, or other manipulations are not causing high stakes tests to produce results that look very different from tests where there are no incentives for distortion.
But policymakers have increasingly recognized that score level test results are strongly influenced by a variety of factors outside of a school system’s control. These include student family background, family income, and community factors. If policymakers want to isolate the difference that schools and educators make in student progress, they need to look at year-to-year score gains, or “value-added” measures, as part of a high stakes accountability system.
Florida has incorporated value-added measures into its high stakes testing and accountability system, and the evidence shows that Florida has designed and implemented a high stakes testing system where the year-to-year score gains on the high stakes test correspond very closely with year-to-year score gains on standardized tests where there are no incentives to manipulate the results. This strong correlation suggests that the value-added results produced by Florida’s high stakes testing system provide credible information about the influence schools have on student progress.
In all of the other school systems we examined, however, the correlations between score gains on high and low stakes tests are much weaker. We cannot be completely confident that those high stakes tests provide accurate information about school influence over student progress. However, the consistently high correlations we found between score levels on high and low stakes tests does justify a moderate level of confidence in the reliability of those high stakes tests.
Our examination of school systems containing 9% of all public school students shows that accountability systems that use high stakes tests can, in fact, be designed to produce credible results that are not distorted by teaching to the test, cheating, or other manipulations of the testing system. We know this because we have observed at least one statewide system, Florida’s, where high stakes have not distorted information either about the level of student performance or the value that schools add to their year-to-year progress. In other school systems we have found that high stakes tests produce very credible information on the level of student performance and somewhat credible information on the academic progress of students over time. Further research is needed to identify ways in which other school systems might modify their practices to produce results more like those in Florida.