Sixteen states and several urban school districts require students to score above a minimum threshold on standardized tests (always reading, and sometimes math) in order to be default-promoted past one or more “gateway grades” to the next grade. The large majority of these policies apply to students being promoted from the third to the fourth grade, on the theory that, by the third grade, students “stop learning to read and start reading to learn.”
Most studies of test-based promotion policies focus on measuring the effect of retention (being left back) on later student outcomes, and the evidence is fairly mixed. However, test-based promotion policies do not only affect the students who are retained. Presumably, they also affect students and schools as they try to improve reading performance in order to avoid being retained or having to retain students. The threat of retention could plausibly have either positive or negative effects on students within the gateway grade. On the one hand, the pressure to score above a particular threshold on a standardized test might backfire by overwhelming both students and schools. On the other hand, test-based promotion policies might incentivize students and schools to make academic improvements within the targeted grade in order to avoid retention.
The effect of test-based promotion policies on student performance prior to the retention decision has not been studied enough. Filling this hole in the literature is important in order to understand the full impact that these policies have on students within a given jurisdiction. Even if relatively few students are actually retained under these policies, many students are in danger of scoring below the benchmark when they enter the gateway grade in the fall, and thus might be motivated by the policy to do better during the year than they would have otherwise. Therefore, even a small effect on students within the gateway grade could have a larger overall effect on student learning in a school system.
We apply a difference-in-difference design to statewide longitudinal school-by-grade data from two states (Florida and Arizona) and to longitudinal student-level data from a large public school district (Hillsborough County, Florida) to investigate the effect of introducing a third-grade test-based promotion requirement on third-grade test scores. We measure whether there was an increase in third-grade test scores that occurred in the policy’s first year relative to scores in other grades within the school that were not directly targeted by the policy. We find evidence that enacting the policy led to a statistically significant and meaningful increase in average third-grade test scores in both states. The magnitude of the effect is very similar across the two states, despite their being enacted nearly a decade apart, as are the differences in the policy details.
We then use student-level data from public schools in Hillsborough County, where the school district independently tested second-grade students (even though statewide testing begins in the third grade), to evaluate whether the effects of the policy differed based on the student’s reading proficiency at the end of the second grade. We find that the effect of adopting the policy on third-grade test scores was similar for students regardless of their reading ability at the end of the second grade.
Our analysis adds to a handful of previous studies evaluating the impact that test-based promotion policies have on student performance prior to the retention decision. An evaluation of Chicago’s test-based promotion policy found that it led to substantially improved student performance in both math and reading for students in grades three, six, and eight. In the third grade, there was a positive treatment effect in math across the distribution of previous performance—but it was largest for students most at risk of retention. But while third-grade students, on average, made substantial improvements on the reading exam, higher-performing students with little to no risk of falling below the policy threshold experienced substantial testscore declines. Another study found that the positive effect from Chicago’s policy was not apparent on tests other than those used to determine promotion. An evaluation of the fifth-grade test-based promotion policy in New York City compared the performance of entering fifth-graders in the first year of the policy to that of a matched comparison group comprising entering fifth-graders from the previous year. It found a substantial positive effect within the gateway grade on the English Language Arts exam for students with considerable risk of falling below the policy threshold without intervention, but no effect in math.
Our study also contributes to the more general literature on the effects of accountability policies on student performance. Though retention under test-based promotion policies is intended not as a punishment but rather as an educational intervention, these policies are rightly classed within a family of accountability policies that attach an undesirable consequence to the failure to meet a performance benchmark. Previous studies have found that such policies tend to increase average student achievement but that the effect often varies according to the student’s previous performance relative to the benchmark. Testbased promotion policies are perhaps especially interesting to consider within an accountability framework. They could engage the participation of students in a way that other accountability policies that only directly affect schools (for example, grading schools from A to F) do not.
Florida’s test-based promotion policy requires students to demonstrate basic reading proficiency before they are promoted to the fourth grade. Students scoring at Level 1 (the lowest of five levels) on the reading portion of the Florida Comprehensive Assessment Test (FCAT) are flagged for retention. Third-grade students in the 2002–03 school year were the first subjected to the policy.
Students scoring below the threshold could receive one of several exemptions and be promoted. About 46% of students scoring below the threshold in the first year of the policy were nonetheless promoted. Still, the policy considerably increased the use of grade retention in the state: 2.8% of third-grade students were retained in the year prior to the policy’s implementation, compared with 13.5% in the first year of the policy.
Florida used its high-stakes criterion-referenced test, the FCAT, for accountability purposes in a way that is problematic for estimating the treatment effect from the test-based promotion policy. Fortunately, at this time the state also administered a commercial norm-referenced exam, the Stanford-9, in math and reading to all students in grades three through 10. Results on the Stanford-9 exam were used for informational purposes only and played no role in student or school accountability. For this reason, they were not likely to have been influenced by factors such as teaching-to-the-test or other manipulations. We thus rely on data for student scores on the Stanford-9 to estimate the treatment effect.
The Arizona legislature adopted the Move On When Reading (MOWR) policy in 2010. The policy requires third-grade students to demonstrate a minimal level of reading proficiency by scoring above the threshold for “Falls Far Below” on the state’s annual standardized test in order to be default-promoted to the fourth grade. Students in the third grade in the 2013–14 school year were the first subjected to the policy.
MOWR was modeled in part on Florida’s test-based promotion policy. However, a potentially important difference is that Arizona’s policy set the performance standard that students were required to reach on the state’s reading test at a much lower level than did Florida’s policy, and the policy actually retained a much smaller percentage of the student population. Approximately 3% of Arizona’s third-grade students scored in the Falls Far Below category in the policy’s first year and thus were targeted for retention, and less than 1% of third-grade students were retained under the policy.
It is plausible that Arizona’s significantly lower standard would reduce the potential effect of the policy. However, a recent study, based on interviews with teachers and administrators and observations of several third-grade classrooms in five Arizona school districts, found that districts and schools made intentional efforts to avoid student retention under the policy in ways that could lead to improvements in student performance before the retention decision. Teachers described an increased awareness of the importance of building literacy foundations, and administrators reported that they allocated additional financial and curricular resources toward early-grade instruction. Districts reported that students as well as parents felt pressure from the policy to improve performance.
We evaluate the impact of the test-based promotion policies statewide in Florida and Arizona using longitudinal school-bygrade- level test scores and demographic characteristics. For each statewide analysis, we use data from the first year that the policy was adopted (2002–03 in Florida and 2013–14 in Arizona) and the two prior years. Data from Florida are publicly available and were downloaded from the Florida Department of Education website. In Arizona, we acquired aggregated data from a data request to the Arizona Department of Education.
To estimate effects in Hillsborough County, Florida, we use longitudinal student-level data for the universe of students in grades two through five from school years 1999–2000 through 2002–03. Student-level data are beneficial not only because of the increased precision but because Hillsborough is one of several districts in the state that administered the Stanford-9 exam district-wide in the second grade.
The availability of second-grade scores enables us to examine whether the effect of introducing the test-based promotion policy differed according to the student’s reading ability when entering the third grade. The estimation samples include all grades (or in Hillsborough, all students within grades) in schools that have valid test-score data in grades three, four, and five in an observed year. Results are similar if we also include grades six through eight, which are found in several K–8 schools.
Our goal is to determine whether, in the first year that the policy was implemented, there was a significant change in the trajectory of third-grade test scores compared with the trajectory in other grades in the school. The intuition underlying this approach is that the test-based promotion policy would provide an incentive within the third grade (and perhaps earlier grades) but not in later grades, where students faced no danger of retention under the policy. Specifically, we employ a difference-in-difference design, where the first difference is across grades and the second difference is over time.
The unit of observation for the statewide analysis is a grade within a school. The dependent variable is the average math or reading score for students within that grade and year. The primary regression analysis includes fixed effects for each school, grade, and year, as well as our variable of interest, which equals 1 if the observation is of the third grade during the first treated year, and equals 0 otherwise (i.e., a treatment indicator). The coefficient on the treatment variable represents the differential change in third-grade test scores relative to fourth- and fifth-grade scores in the first year that the policy was in effect. We weight the regression according to the number of students who took the test within the school in a given year, and we cluster the standard errors by school.
The data from Hillsborough County enable us to evaluate whether the treatment effect differs by the student’s reading-test score in the prior year. We first designated six performance thresholds within the scores for the second-grade exam. Then we interact the treatment indicator variable (indicating that the observation is of a third-grade student during the treated year) with variables that indicate how the student performed on the prior year’s reading exam. The coefficients on these interaction variables represent the effect of the treatment for students whose second-grade reading score was within a particular performance category.
Lack of data for more than two years of pretreatment outcomes in any of the jurisdictions is the most important limitation of our analysis because it severely limits our ability to test an essential underlying assumption used to interpret the estimates. Interpreting the results as the causal effect of adopting test-based promotion on average third-grade scores assumes that the trajectory of average fourth- and fifth-grade scores after the introduction of the policy would have been the trajectory for average third-grade scores if the policy were not enacted. That assumption is intuitively plausible—there is no particular reason to believe that the trajectory of third-grade scores across either state prior to the adoption of the policy systematically differed from the trajectory of fourth- and fifth-grade scores. Nonetheless, the inability to observe these trends over a longer period means that we cannot confirm whether those grades were on similar trajectories before the introduction of the policy. Though our estimates remain policy-relevant and are at least suggestive of trends in outcomes across grades, readers should be cautious about interpreting the estimate as the causal effect of the policy.
Another limitation of the analysis is that we are not able to evaluate whether the benefits caused by the threat of retention persisted in later years or if the effect faded over time. Unfortunately, such an analysis is impossible because many of the students (about 13.5% of the first cohort in Florida) were retained under the policy the following year. When evaluating later test scores, it is not clear how one would disentangle the effect of retention from that of the response within the third grade in an attempt to avoid retention.
Further, we are not able to measure the effect of the threat of retention on the performance of later-entering third-grade cohorts. Winters (2012) presents descriptive evidence that the test scores of entering third-grade students in Florida grew substantially over time after adoption of the policy. Causal analysis of these data is complicated by the large increase in retention due to the policy fundamentally altering the student bodies in each grade over time. Finally, the state continued its focus on improving early-grade literacy during this time period, which may have had an effect on third-grade scores.
Effect of Implementing Test-Based Promotion on Average Third-Grade on Average Third-Grade Test Scores
We first consider the results from estimating the effect of the introduction of a third-grade test-based promotion policy on average math and reading scores. Figure 1 reports the main statewide effect estimates using school-by-grade-level data from Arizona and Florida, as well as the student-level estimates using data from Hillsborough County. In each case, we find a statistically significant increase in third-grade scores in math and reading/ELA in the first year of the test-based policy, relative to the students in higher elementary grades who were not subjected to the policy.
The coefficients in the table represent the additional increase in average third-grade reading scores that occurred in the policy’s first year in scale-score points on the state test. After they are converted to comparable standard deviation units, the magnitude of the estimated treatment effect is similar across jurisdictions. The magnitude of the effect is “medium,” according to a recently proposed taxonomy. That the estimated treatment effect is similar in Hillsborough County and statewide in Florida is somewhat expected. However, it is reassuring that the effects in the Florida jurisdictions are similar to the estimated impact in Arizona, despite the policies having been implemented more than a decade apart.
Figure 2 illustrates the results from models that use the Hillsborough County data to measure the impact of the test-based promotion policy by the student’s second-grade reading score. The figure illustrates the coefficient and 95% confidence interval for each performance group in reading and math, respectively. Overall, the results suggest that the implementation of Florida’s test-based promotion policy, at least in Hillsborough, had similar impacts on third-grade students regardless of their reading score in the previous year. In most cases, the estimated effects by percentile category are similar, and the differences are not statistically significant.
Discussions about whether to introduce test-based promotion policies often focus only on the students who are actually retained. While the impact of retention on those students is a first-order concern, far-reaching policies such as test-based promotion likely produce a broad set of effects beyond the main treatment, which should not be ignored. Our goal with this report is to give policymakers a broader scope when considering the impacts of existing and future test-based promotion policies, including the effects on students before they even take the decisive test.
We find evidence that introducing third-grade test-based promotion policies in Florida and Arizona led to statistically significant and meaningful average test-score improvements within the third grade before the policy retained any students. We supplement the statewide analyses with an evaluation of longitudinal student-level data from Hillsborough County, in which we find that the effect of the policy on third-grade students was similar for those who scored well or poorly on the second-grade reading test.
In addition to test-based promotion policies, our findings are relevant for the more general set of policies that incentivize better student outcomes by linking a consequence to the failure to meet a particular performance standard. Our estimates are quite consistent with previous research in this area. Our findings effectively rule out the concern that test-based promotion policies have unintended negative impacts for students in the gateway grade. For instance, some have suggested that raising the stakes of performance via the threat of retention would backfire and actually reduce student outcomes. Our results show that such concern is unfounded. Indeed, rather than declines, in both Florida and Arizona, adoption of test-based promotion led to substantial improvements for students in gateway grades.
The evidence on the outcomes for students who are retained under a test-based promotion policy is currently mixed, although it is notable that several earlier studies have found benefits from retention under Florida’s policy.
Our results, however, suggest that earlier studies, which focus entirely on retained students, substantially understate the benefits of test-based promotion policies on student achievement. The test-score improvements that we find within the third grade for students in Arizona and Florida apply to a much larger group of students than those who were eventually retained by the policies. Indeed, our results show that the threat of retention improves student academic achievement, thus reducing the need for retention.
Our approach looks for a particularly large increase in third-grade test scores, relative to test scores in other grades, in the first year that the respective state implemented its test-based promotion policy. Figure 3 presents results from a placebo test by grade level. It is possible that by comparing the third grade to the combination of students in fourth and fifth grades, the main analysis masks unexpected gains in the non-affected grades. For this test, we add to our primary regression an interaction between the treatment year and the fourth grade (Placebo Grade 4). Thus, we separately estimate the differential gain during the initial year of the policy in both the third (treated) grade and the fourth grade against the comparison group of the fifth grade. Though there are significant differences in fourth- and fifth-grade scores during the treated year, the gain in the third grade is significantly and substantially larger than in either of the comparison grades. Models that control for the grade-specific trend find no difference in the treatment-year gain between the untreated fourth and fifth grades, with the one exception of the reading test in Hillsborough. These results are consistent with applying a causal interpretation to the primary estimates.
Figure 4 reports the results from a placebo test evaluating differences in third-grade performance over time. In this model, we add to our primary regression an interaction between a variable indicating third grade and a variable indicating the year before policy introduction. The results from this test suggest that the increase in the third-grade scores that occurred in the first treated year is significantly and substantially larger than third-grade test scores in either of the observed years prior to the treatment.
Photo by smolaw11/iStock