Manhattan Institute for Policy Research.
Subscribe   Subscribe   MI on Facebook Find us on Twitter Find us on Instagram      

Civic Report
No. 33 February 2003

Testing High Stakes Tests: Can We Believe the Results of Accountability Tests?



  1. For a criticue specifically of the Amrein and Berliner study see Greene and Forster 2003.
  2. A number of states and school districts administer a standardized test in addition to the state criterion reference test, but many of those standardized tests had high stakes attached to the results. For example, Houston and Dallas, Texas, Arizona, and California all administered multiple tests to their students but all tests had high stakes. We could not include those states or school districts in our sample.
  3. Because school level test scores are public information and usually covered under state freedom of information laws we might have expected obtaining the scores to have been relatively easy. Unfortunately, we encountered numerous delays and refusals from school officials. Some school districts were very helpful with their test score information and provided us with the necessary data. Other school districts, however, were less helpful and in some cases were downright hostile. The Maynard, Massachusetts school district, for instance, refused to give us the data. We spoke directly to the Assistant Superintendent of the district, who said she was in charge of testing. She informed us that she would not release the test score information because she was “philosophically opposed” to our study. We are unaware how her philosophical opposition trumps public information laws, but since we had neither the time nor the resources to pursue the matter in the courts she was successful in denying us her test score information. The Maynard, Massachusetts case was by far the most blatant obstruction we faced while attempting to obtain the necessary test scores, but some other districts were reluctant to provide the information until we informed them that they were legally required to do so. We found this rather disturbing considering that public schools claim their transparency as one of their greatest virtues. In performing this study, at least, we certainly did not find public schools to be transparent.
  4. Our method can be illustrated by using Virginia’s administration of the high stakes SOL and the low stakes Standford-9 elementary math tests in 2000 as an example. In this year, Virginia gave the SOL to students in the 3rd and 5th grade, and gave the Stanford-9 to 4th graders. We averaged the 3rd and 5th grade scores on the SOL test to get a single school score on that test.

    We next standardized the scores on each of the tests. The SOL was reported as mean scaled scores and the Stanford-9 scores were reported as mean percentiles. We calculated both the average school score on each test and the standard deviation on each test administration. On the SOL the average school mean scaled score was 431.93 and the standard deviation was 39.31. On the Stanford-9 the average school percentile was 57.93 and the standard deviation was 15.24. For each school we subtracted the average school score on the test from that individual school’s score on the test and divided the resulting number by the standard deviation. So for Chincoteague Elementary School, which posted a 60 percentile score on the Stanford-9 the calculation was thus:

    60 - 57.93
    ————————— = .14

    After standardizing scores for every school in the state on each of the two test administrations in question, Stanford-9 4th grade math, 2000, and SOL elementary average math, 2000, we then correlated the standard scores on the two tests. In this instance we find a correlation of .80. This high correlation leads us to conclude that in this case the stakes of the tests had no effect on the results of the tests.

    We then found and correlated the gain scores for each test. Building off our example, we subtracted the standardized scores on the 1999 administration of the tests from the standardized scores in the 2000 administration of the tests to find the gain or loss the school made on the test in the year. In our example school, this meant a .01 standard score gain on the Stanford-9 and a .10 standard score gain on the SOL. We calculated the gain scores for each school in the state and correlated the results. In this example we found a correlation of .34, a moderate correlation between the two tests.

    Next we combined the standardized scores of the test by grade, while keeping them separated by year and subject and correlated the results. In our example this meant combining all 2000 administrations of the Stanford-9 math test (elementary, middle and high school scores) and doing the same for the SOL math 2000 test and correlating the results. In this example we found a high correlation of .77. We then repeated this combining and correlating for the difference scores. In our example we found that the difference between the 2000 and 1999 standardized scores on the SOL in all grades correlated with the difference between the 2000 and 1999 standardized scores on the Stanford-9 in all grades at a level of .29, a moderate correlation.
  5. There is one distortion that might be caused by the incentives created by the high stakes of high stakes tests that this method cannot detect: if school systems are excluding low-performing students from the testing pool altogether, such as by labeling them as disabled or non-English speaking, a high correlation between scores on high and low stakes tests would not reveal it. However, the research that has been done so far on exclusion from high stakes testing gives us no good reason to believe that this is occurring to a significant extent. Most studies of this phenomenon are methodologically suspect, and those that are not have found no significant relationship between high stakes testing and testing exclusion (for a full discussion, see Greene and Forster 2002).
  6. It is conventional to describe correlations between .75 and 1 as strong correlations, correlations between .25 and .75 as moderate correlations, and correlations between 0 and .25 as weak correlations (Mason, et al., 1999).


Center for Civic Innovation.


CR 33 PDF (77 kb)


This report examines whether "high stakes" tests can effectively measure students' academic proficiency. It finds that, contrary to the assertion by critics that schools would merely "teach to the test," improving the results without increasing real learning, high stakes tests generally provide reliable information on students actual academic performance.


Executive Summary

About the Authors



A Variety of Testing Policies

Previous Research





Table 1: Average Correlations

Table 2: Florida

Table 3: Virginia

Table 4: Chicago, IL

Table 5: Boston, MA

Table 6: Toledo, OH

Table 7: Blue Valley, KS

Table 8: Columbia, MO

Table 9: Fairfield, OH

Table 10: Fountain Fort Carson, CO




Home | About MI | Scholars | Publications | Books | Links | Contact MI
City Journal | CAU | CCI | CEPE | CLP | CMP | CRD | ECNY
Thank you for visiting us.
To receive a General Information Packet, please email
and include your name and address in your e-mail message.
Copyright © 2015 Manhattan Institute for Policy Research, Inc. All rights reserved.
52 Vanderbilt Avenue, New York, N.Y. 10017
phone (212) 599-7000 / fax (212) 599-3494