Executive Summary
Social promotion has long been the normal practice
in American schools. Critics of this practice,
whereby students are promoted to the next grade
regardless of academic preparation, have suggested
that students would benefit academically if
they were made to repeat a grade. Supporters
of social promotion claim that retaining students
(i.e, holding them back) disrupts them socially,
producing greater academic harm than promotion
would. A number of states and school districts,
including Florida, Texas, Chicago, and New York
City, have attempted to curtail social promotion,
by requiring students to demonstrate academic
preparation on a standardized test before they
can be promoted to the next grade.
This study analyzes the effects of Floridas
test-based promotion policy on student achievement
two years after initial retention. It builds
upon our previous evaluation of the policy in
two ways. First, we examine whether the initial
benefits of retention observed in the previous
study continue, expand, or contract in the second
year after students are retained. Second, we
determine whether discrepancies between our
evaluation and the evaluation of a test-based
promotion policy in Chicago are caused by differences
in how researchers examined the issue, or by
differences in the nature of the programs.
Our analysis shows that, after two years of
the policy, retained Florida students made significant
reading gains relative to the control group
of socially promoted students. These academic
benefits grew substantially from the first to
the second year after retention. That is, students
lacking in basic skills who are socially promoted
appear to fall farther behind over time, whereas
retained students appear to be able to catch
up on the skills they are lacking.
Further, we find these positive results in
Florida both when we use the same research design
that we used in our previous study and when
we use a design similar to that employed by
the evaluation of the program in Chicago.The
differences between the Chicago and Florida
evaluations appear to be caused by differences
in the details of the programs, and not by differences
in how the programs were evaluated.
About the Authors
Jay P. Greene, Ph.D., is Endowed Chair
and Head of the Department of Education Reform
at the University of Arkansas and a Senior Fellow
at the Manhattan Institute. He has conducted evaluations
of school choice and accountability programs in
Florida, Charlotte, Milwaukee, Cleveland, and
San Antonio. He has also recently published research
on high school graduation rates, social promotion,
and special education. His articles have appeared
in policy journals, such as The Public Interest,
City Journal, and Education Next,
in academic journals, such as the Teachers
College Record, the Georgetown Public Policy
Review, and the British Journal of Political
Science, as well as in major newspapers, such
as the Wall Street Journal, the Washington
Post, and USA Today. Dr. Greene is
the author of Education
Myths (Rowman & Littlefield, 2005).
His education research has been cited in U.S.
Supreme Court opinions and has appeared in scholarly
and popular publications. Dr. Greene received
his doctorate in political science from Harvard
University in 1995.
Marcus A. Winters is a Senior Research
Associate at the Manhattan Institute and a Doctoral
Academy Fellow at the University of Arkansas.
He has performed several studies on a variety
of education policy issues, including high-stakes
testing, charter schools, and the effects of
vouchers on the public school system. His op-ed
articles have appeared in numerous newspapers,
including the Washington Post, USA
Today, and the Chicago Sun-Times.
He received his B.A. in political science with
departmental honors from Ohio University in
2002 and a M.A. in economics from the University
of Arkansas in 2006.
Introduction
Social promotion is the practice of promoting
students to the next grade regardless of their
academic preparation. While some students have
always been made to repeat a grade, the prevailing
view among educators has been that it is in
the best academic and social interests of students
to advance to the next grade. When students
have been retained, it has generally been at
the discretion of teachers in consultation with
administrators and parents, and not based on
the results of standardized tests.
This practice of social promotion has recently
been replaced by test-based promotion
in a number of states and school districts around
the country, including Florida, Texas, Chicago,
and New York City. Under test-based promotion,
students are required to demonstrate a certain
level of academic preparation on a standardized
test before they can be promoted to the next
grade. There are usually various exemptions
and alternative routes to promotion, but the
default outcome under test-based promotion is
that students with low test results are retained
in the same grade.
There has been considerable debate among educators,
policymakers, and researchers about the consequences
of this shift away from social promotion and
toward test-based promotion. This study adds
evidence to that debate by analyzing the effects
of being retained under Floridas test-based
promotion policy on student achievement two
years after initial retention. This study builds
upon our previous evaluation of the policy in
two ways. First, we are able to examine whether
the initial benefits of retention under a test-based
policy observed in the previous study continue,
expand, or contract in the second year after
students are retained. Second, we are able to
determine whether the different findings of
our evaluation and a high-quality evaluation
of a test-based promotion policy in Chicago
are caused by differences in how the researchers
examined the issue, or by differences in the
nature of the programs.
The results of this new analysis show that
retained students in Florida made significant
reading gains relative to the control group
of socially promoted students two years after
being subjected to the policy. These academic
benefits of being retained grew substantially
from the first to the second year after retention.
That is, students lacking in basic skills who
are socially promoted appear to fall farther
behind over time, whereas retained students
appear to be able to catch up on the skills
they are lacking. In addition, we find these
positive results for the test-based promotion
policy in Florida whether we use the same research
design that we used in our previous study or
a design similar to that employed by the evaluation
of the program in Chicago. The differences in
outcomes from the Chicago and Florida evaluations
appear to be caused by differences in the details
of the programs and not by differences in how
the programs were evaluated.
PREVIOUS RESEARCH ON DISCRETIONARY RETENTION
Under the practice of social promotion, some
students have always been retained, but retention
was rare and was based on the discretion of
educators, not the results of standardized tests.
Several previous studies have evaluated the
academic impact of this discretionary retention
under social promotion regimes. Meta-analyses
indicate that the cumulative finding of this
previous research is that retaining a student
leads to substantial academic harm (Holmes and
Matthews 1984, Holmes 1989, Jimerson 2001).
These findings on the effects of discretionary
retention are plagued by two serious limitations.
First, it is very hard for those studies to
find an appropriate control group against which
retained students could be compared. Even if
control-group students have similar test scores
and other observable characteristics, students
retained at the discretion of educators may
differ significantly in unobservable ways. When
educators use their discretion to retain students,
they are aware of detailed contextual information
that may lead them to recommend retaining one
student while promoting another student with
similar test scores and other recorded characteristics.
The fact that educators chose to retain one
student and not another means that the two are
not likely to be similar in their future prospects.
After all, if the two really had been identical,
educators would probably have made the same
decision about their retention. The retained
students unrecorded disadvantages may
account for their lower future achievement,
not their retention. Unfortunately, most of
the previous studies used in the meta-analyses
that draw negative conclusions about retention
failed to address this difficulty with proper
techniques or research design to produce valid
apple-to-apple comparisons. While these meta-analyses
are often cited as conclusive, there is legitimate
reason to doubt the findings of previous studies
on discretionary grade retention.[1]
Second, it is not at all clear that the findings
from studies of discretionary retention under
social promotion regimes would apply to retention
under test-based promotion policies. Studies
of discretionary retention are essentially evaluations
of whether educators use their discretion wisely
in identifying students who ought to be retained.
If that discretion is used wisely, only students
who could benefit from retention are retained
and all others are promoted.
Under test-based promotion policies, the discretion
of educators is greatly restricted. Retention
decisions are based primarily or exclusively
on the results of standardized tests. This shift
to test-based promotion has been motivated by
the belief that educators have generally not
used their discretion wisely, either by failing
to retain more students or by failing to retain
the right students. It would therefore be inappropriate
to extrapolate from evaluations of discretionary
retention to the effects of retention under
test-based policies meant to restrict or alter
the use of that discretion.
PREVIOUS RESEARCH ON TEST-BASED RETENTION
In addition to our previous evaluation of Floridas
test-based promotion policy (Greene and Winters
2006), there is another high-quality study of
test-based retention.[2]
Roderick and Nagaoka (2005) evaluated the impact
of a test-based promotion policy in Chicago
on reading-test scores. Since 1996, students
in Chicago have been required to reach minimal
benchmarks on the reading and math portions
of the Iowa Test of Basic Skills (ITBS) in the
third, sixth, and eighth grades in order to
be promoted to the next grade. Roderick and
Nagaoka found that the retention policy led
to small improvements in reading scores relative
to socially promoted students during the first
year after the retention decision but that these
gains disappeared or turned negative in the
following year.
The existence of a test-based promotion policy
in Chicago allowed Roderick and Nagaoka (2005)
to develop more appropriate comparison groups
than had been available to previous researchers.
They utilized two comparison groups in the study.
First, they took advantage of a change in the
policys design that made it likely that
students with scores just below the test-score
cutoff would get an exemption and thus be promoted
in a later year. Prior to this change, students
with scores just below the cutoff were likely
to be retained; after the change, students with
these same scores were likely to be promoted.
Roderick and Nagaoka (2005) compared the test-score
gains of these two groups on the assumption
that the only difference between them was the
year in which the student happened to have been
born. This was the across-year research
design.
In a second comparison, Roderick and Nagaoka
(2005) took advantage of the existence of an
observable cutoff for the promotion policy and
utilized a regression discontinuity design.
In this design, they included only students
with test scores that were very close but on
either side of the cutoff score. That is, they
compared the test-score gains of students whose
original score was just above the
necessary threshold (most of whom were promoted)
with those of students in the same year whose
score was just below the threshold
(most of whom were retained). This was their
discontinuity research design.
Using multiple analytical models on both the
across-year and discontinuity research designs,
Roderick and Nagaoka (2005) found similar results.
They found that the retention policy in Chicago
had a mild positive impact on the test-score
performance of retained students relative to
promoted students in the year that the students
were retained. However, in their analysis of
test scores two years after the baseline year,
each specification found that the effect of
retention was either statistically insignificant
or negative.
But this negative result from Roderick and
Nagaokas study in Chicago may not be generalizable
to all test-based promotion policies in other
school systems. Perhaps Chicagos test-based
promotion policy has been counterproductive
while Floridas has been beneficial. While
both programs use test-based promotion, differences
in the characteristics of the two programs could
lead the policies to have different effects.
For example, the Chicago program did not have
a clear policy permitting exemptions to test-based
promotion requirements, while Florida did. Perhaps
the restricted but guided discretion of educators
decisions about retention under Floridas
test-based policy has significant advantages
over the unguided policy in Chicago. In addition,
recent allegations of testing impropriety in
Chicago (see Jacob and Levitt 2003 and Greene,
Winters, and Forster 2002) compared with validation
of testing integrity in Florida (see Greene,
Winters, and Forster 2004; West and Peterson
2005) may produce different findings from the
Chicago and Florida programs. If Chicago schools
are manipulating test results in response to
student retention, rather than addressing the
needs of those students, test-based retention
may indeed be counterproductive.
The current paper analyzes student performance
one and two years after retention in Florida,
using both across-year and discontinuity research
designs. If an analysis in Florida were to produce
negative results, like those found by Roderick
and Nagaoka (2005) in Chicago, we could have
greater confidence that test-based retention
policies truly harm student achievement. However,
if the results differ even when similar analyses
are performed, we have reason to be more encouraged
about the prospects of test-based promotion
as practiced and implemented in Florida. Especially
given the clearer exemption policy and superior
test integrity in Florida, a positive result
from Florida in a second-year study using multiple
research designs would suggest that test-based
promotion is likely to add significantly to
student learning under the proper conditions.
FLORIDA'S TEST-BASED PROMOTION
POLICY
In 2002, the Florida legislature voted to require
third-grade students to meet at least the Level
2 benchmark (the second-lowest of five levels)
on the FCAT reading test in order to be promoted
to the fourth grade. According to the states
testing website, students who score at Level
2 are considered to have limited success
with the challenging content on the test.[3]
The third-grade class of 200203 was the
first that was subjected to the mandate.
The legislature allowed for several exemptions
to the retention policy: students with limited
English proficiency who had had less than two
years of instruction in English; disabled students
whose individual educational plans indicated
that testing would be inappropriate; students
who scored above the 51st percentile on another
standardized reading test; disabled students
who received intensive remediation in reading;
students who demonstrated proficiency through
a student portfolio; and students who had been
retained twice previously.
Table 1 shows the promotion characteristics
of third-grade students in the first year that
the policy was in place, whose test scores were
below Level 2 and for whom baseline test scores
were reported in our dataset. The table shows
that only 57 percent of students who had test
scores below the threshold necessary to be promoted
were actually retained in the third grade. The
table shows that some students (13 percent)
with scores below the threshold were coded as
having been promoted without any explanation
for their exemption. After discussing this with
the Florida data-warehouse personnel, it remains
unclear why these students were promoted or
whether there was an error in their coding.[4]
Schools must develop an academic improvement
plan for any student who does not meet the standards
for promotion. These plans must address the
students specific academic needs and create
success-based intervention strategies
for his improvement.[5]
Students who fail to meet the necessary test-score
cutoff are also required to attend a summer
reading camp, where they receive literacy instruction.
The only substantial change to Floridas
retention policy since its implementation is
that beginning in the 200405 school year,
retained students became eligible to receive
a midyear promotion if they demonstrate possession
of necessary skills. In the time period evaluated
in this paper, retained students remained in
the third grade for the entirety of the retained
year.
RESEARCH DESIGN
The most difficult problem for previous studies
evaluating the academic effect of grade retention
has been the identification of a proper group
with which to compare retained students. The
existence of a test-based retention policy helps
solve this problem by reducing (but not eliminating)
the impact of subjective teacher assessments
that made comparisons difficult in the past.
With the increased reliance on objective, test-based
criteria for promotion, we can identify treatment
and control groups that are similar on those
criteria and are less likely to differ in other,
unrecorded ways.
In this paper, we utilize two strategies for
identifying comparison groups with which to
evaluate the effect of grade retention. In the
first analysis, we compare students with similar
reading-test scores who differ by the year in
which they entered the third grade. In the second
analysis, we utilize the discontinuity in retention
created by the test-score threshold and compare
the achievement of students who were just above
and just below the retention benchmark.
Across-Year Comparison
In our first analysis, we focus only on Florida
students in the third grade in 200102
or 200203 whose test scores were below
the Level 2 benchmark on the FCAT reading test.
The score required to reach Level 2 was identical
in both years.[6] We compare
the academic achievement of students with these
low test scores who were in the first third-grade
class (subject to the retention mandate) with
the test-score gains of students with the same
low baseline score but who entered the third
grade in the year prior to the policy (who were
thus were not subjected to the program). That
is, our treatment group consists of the first
cohort of low-achieving students subject to
the test-based retention policy, and our control
group consists of similarly low-achieving students
who were not subject to the policy because they
happened to be born a year earlier. On average,
the two groups should be very similar, and any
observed differences can be controlled statistically.
We compare the test-score gains of students
in the first and second years after their initial
third-grade year. For each group of students,
we measure the test-score gains that they made
between the baseline year and two years afterward.
Thus, in the evaluation of gains after one year,
we compare the gains that the control group
made between 200102 and 200203 with
the gains made by the treatment group between
200203 and 200304. For the analysis
of gains in the second year after retention,
we compare the gains that the control group
made between 200102 and 200304 with
the gains that the treatment group made between
200203 and 200405.
The test scores of students in our two comparison
groups not only differ in the year of the evaluation
but, in most cases, in the grades evaluated
as well. Since most students in the treatment
group were retained after their baseline year,
in the second year after baseline (200405)
most of them were in the fourth grade. However,
since they were not subjected to the retention
policy, most of the students in the control
group were initially promoted, and thus in the
second year after baseline (200304), most
of them were in the fifth grade.
The existence of Developmental Scale Scores
(DSS) allows us to compare student gains on
the FCAT reading test regardless of the year
and grade in which the test was administered.
These scores were developed by the Florida Department
of Education as a uniform measure of proficiency
across grades and years. For example, a third-grade
student who earns a DSS of 1000 on the FCAT
reading test in 200203 has the same proficiency
as a fourth-grade student who earns a DSS of
1000 on the FCAT reading test in 200405.
Similar scale scores have also been developed
for other commercial standardized tests such
as the Stanford testing series. Previous research
has shown that the FCAT produces results that
are very similar to those of the Stanford-9
test (Greene, Winters, and Forster 2004; West
and Peterson 2005).[7]
Table 2 reports descriptive statistics on the
treatment and control groups and compares them
using a one-way ANOVA analysis. The table shows
that the two groups of students are, in fact,
statistically different on all observed dimensions.
The control group of students with low test
scores who entered third grade the year before
the policy was in place are slightly more likely
to be white or Asian (and consequently less
likely to be Hispanic or African-American) and
have test scores that are below those of the
treatment group. However, though each of these
differences is statistically significant, most
are quite insubstantial. Only whether the individual
is white or whether he is Hispanic differs by
more than a single percentage point between
the groups. These modest differences that do
exist can be controlled statistically.

The across-year comparison approach is limited
because our treatment and control groups entered
the third grade in different years. It is possible
that students in our treatment and control groups
were not uniformly affected by reforms other
than the retention policy that might have occurred
in Florida. In fact, Florida has experimented
with many educational reforms, including vouchers,
charter schools, and other forms of test-based
accountability. Our results could be biased
if our treatment and control groups were affected
by these other policies in different ways. Further,
it is possible that schools responded to the
implementation of the retention policy by improving
the education provided in the third grade so
that fewer students would be retained. The statistically
higher baseline reading scores for our treatment
group reported in Table 2 indicate that this
bias could exist. The difference in baseline
test scores highlights the importance of controlling
for these scores in all the analyses.
Regression Discontinuity Comparison
For a check on robustness of the results of
our across-year approach and to compare our
results more directly with those of Roderick
and Nagaoka (2005), we further analyze the effect
of Floridas retention policy using a regression
discontinuity design. The use of regression
discontinuity has been growing in popularity
as a design for evaluating public policy. This
design is useful in cases such as this, when
a treatment is primarily determined by the reaching
of a threshold of some kind. Van der Klaauw
(2001) shows that if obtaining a treatment is
conditioned on meeting a certain known threshold,
an analysis of individuals in a narrow margin
around the threshold approximates random assignment.
That is, chance has a large influence over whether
students are just above or just below the promotion
threshold, so students on either side of the
threshold should be very similar at baseline.
Differences in their progress over time can
then be attributed to whether they happened
to be promoted or retained, since the two groups
were nearly identical at the start.
We take advantage of the existence of a known
cutoff score below which students were more
likely to be retained and above which they were
more likely to be promoted. The design we utilize
is very similar to that used by Roderick and
Nagaoka (2005) in their evaluation of Chicagos
objective retention policy as well as to other
studies outside of education (see, for example,
Van der Klaauw 2001, Angrist and Lavy 1999,
DiNardo and Lee 2004).
In this evaluation, we compare the test-score
gains of students whose reading scores in 200203
were just below the threshold required for promotion
with students who were in the third grade that
same year and whose scores were just above this
threshold. Unlike the across-year
analysis, all students in this design were in
the third grade in 200203 and were subject
to the policy if they did not score above the
necessary threshold. Since all students were
in the same grade and age cohort, they were
all uniformly affected by policies other than
the retention policy. Thus, the regression discontinuity
approach does not suffer from the limitation
of the previous across-year analysis that other
policies could affect the results.
In their evaluation of Chicagos policy,
Roderick and Nagaoka (2005) use grade-equivalency
scores and draw the discontinuity line at scores
that were within three months of the threshold.[8]
However, DSS scores are not directly convertible
into grade equivalents, so we are left to produce
our own definition of those just
above and below the threshold.
Lacking a formal definition for those who are
just below or above a threshold,
we use two potential definition strategies in
the regression discontinuity design. We draw
the discontinuity first for those whose score
on the third-grade FCAT reading test in 200203
(the test used for the retention decision) was
within 50 DSS points of the threshold for retention
and then for whether it was within 25 points
of the threshold. In the baseline year, the
mean DSS score on the FCAT reading test for
all students was 1290.9 with a standard deviation
of 381.2. Thus, both definitions of those close
to the threshold severely limit the sample,
and the 25-point definition is quite strict.
The comparison of descriptive statistics of
our treatment and control groups using the regression
discontinuity cutoffs are recorded in Table
3. Within the 25-point definition of close,
the observed demographic characteristics of
the treatment and control groups are statistically
identical, except, of course, for their baseline
reading-test score and whether or not they were
retained. When we compare those within 50 points
of the threshold, there are only minor differences
in the percentage of students who are white
and African-American and who are ineligible
for the free or reduced-price lunch program.
Thus, the regression discontinuity helps to
confirm the robustness of the findings from
the across-year model. In particular, the regression
discontinuity approach has the advantage of
helping to address concerns about unobserved
demographic differences between the treatment
and control groups in the across-year analysis.

Our method follows the so-called fuzzy
discontinuity design, as do many other such
papers. That is, the discontinuity of student
baseline test scores is not strict. Many students
with test scores below the cutoff score were
exempted from the policy. Further, some students
who scored above the cutoff were nonetheless
retained. Table 3 also reports the percentage
of students in the treatment and control groups
of the discontinuity approach who were retained
and exempted from the policy. Under the 25-point
definition, the table shows that 59 percent
of students with scores below the test-score
cutoff were actually promoted (did not receive
the treatment) while 4.5 percent of students
whose scores were just above the cutoff were
actually retained (did receive the treatment).
When there are a lot of exemptions, we risk
running into the same methodological dangers
that beset earlier studies of discretion-based
retention. If exemptions are granted on a discretionary
basis, perhaps retained students will once again
be incomparable in key unobserved ways. To address
this problem, we use a two-stage model. In a
two-stage approach, we essentially identify
who would have been retained if exemptions did
not distort the pool of retained students. Then
we predict the effect of this undistorted retention
on academic achievement. This technique removes
bias that could be introduced by the subjective
use of exemptions.
One limitation of the discontinuity approach
is that by including only those students whose
baseline reading score falls within a very narrow
range, we eliminate many potentially useful
observations. While our number of observations
in the across-year comparison is 78,039 in the
second year, under the regression discontinuity
this falls to 13,841 under the 50-point threshold
and only 7,326 under the 25-point definition.
The regression discontinuity approach also
suffers from a potential problem with external
validity, not faced by our across-year approach.
By limiting the analysis to only those students
whose baseline score is within a quite narrow
region of the cutoff score, we are only able
to make inferences about the effect of the policy
on this small group of marginally affected students.
If the impact of the policy is not identical
for all students below the retention cutofffor
example, if students with very low baseline
proficiency are more or less affected by the
policythen our estimates will not indicate
the true effect of retention.
Of course, the across-year design has its limitations
as well, such as the danger that different cohorts
differ in unobserved ways or are differentially
affected by changes in school practices over
time. The point of using multiple designs and
multiple analyses is to gauge ones confidence
in results by seeing if they are robust across
different specifications.
RESULTS
The results using multiple research strategies
are consistent with the theory that test-based
retention of low-proficiency students increases
their reading proficiency and that these gains
increase over time.
The results of our analyses on the test-score
gains made in reading are reported in Table
4. The first column of the table shows the test-score
gains in the first year after retention, and
the second column shows the test-score gains
two years after retention.[9]
These results can be interpreted as the gains
made by retained students above those made by
comparable students who were promoted. Table
4 also contains the results from the three different
analyses we performed: the across-year comparison;
the discontinuity comparison, using 50 DSS points
as the definition of close to the
promotion threshold; and another discontinuity
comparison, using 25 DSS points as the definition
of close.
In both the first and second year, the effects
of being retained are statistically significant
and positive in all three comparisons. Test-based
retention has significant benefits that grow
over time and are robust across multiple analytical
strategies. In the across-year comparison, the
effect of retention on reading scores after
one year is small but statistically significant
(4.1 DSS points). Two years after students are
retained, however, their reading achievement
outstrips their counterparts who were promoted
by 40.9 DSS points.
These results are confirmed by the regression
discontinuity comparisons. In the discontinuity
comparison of students whose FCAT reading score
was within 50 points of the cutoff score, retained
students made test-score improvements over promoted
students of 16.3 DSS points in the first year
after retention and 57.8 in the second year
after retention. We find similar results using
the very strict discontinuity comparison of
those within 25 points of the promotion threshold.
After one year, retained students made reading
gains on the FCAT that were 17.9 DSS points
higher than students with similar characteristics
who were promoted, and these relative gains
grew to 60.3 DSS points in the second year after
retention.
The true size of the retention effect is difficult
to interpret from the above results because
it is substantially different depending upon
the comparison group utilized. This is, however,
somewhat to be expected given that the regression
discontinuity approach is limited to evaluating
only the impact of the policy on those with
test scores in a very narrow margin near the
cutoff, while the across-year approach measures
the impact of the policy for all students who
were subjected to it. Thus, the true size of
the effect is most likely found in the across-year
comparison. However, the fact that in all analyses
the effect of retention is positive, highly
statistically significant, and grows from the
first year to the second year after retention
provides confidence that the overall effect
of the policy is distinctly positive.
It is also difficult for most people to interpret
how large a benefit these improvements in DSS
scores really represent. To put them in better
perspective, we have converted the results into
standard deviations and percentiles in Table
5. A standard deviation is a measure that helps
education researchers compare results across
different studies that use different tests.
A standard deviation represents a portion of
a bell curve (or normal curve). If all students
were arrayed in a bell curve, 95 percent of
them would be within two standard deviations
of the average student and 68 percent would
be within one standard deviation (more students
are packed into the middle of a bell curve).
After one year, retained students benefit by
between .01 and .05 standard deviations, depending
upon the analysis. These represent small, but
statistically significant, effects. After two
years, the benefit of retention grows to between
.11 and .16 standard deviations, which education
researchers would generally regard as moderate
benefits. Gains of this size are somewhat smaller
than have been observed in evaluations of class-size
reduction or voucher programs, which are around
one-quarter of a standard deviation, but they
are larger than the effects of charter-school
programs or increased per-pupil spending, which
tend to be between zero and one-tenth of a standard
deviation.

While measuring effects in standard deviations
permits comparisons with other studies of other
programs, these units are still relatively unfamiliar
to most non-researchers. To help people understand
the magnitude of the effects, we have also converted
them into percentiles in Table 5. Percentiles
rank all students so that 1 percent would be
in each percentile. A student performing at
the 50th percentile outperforms 50 percent of
all students. Students in our across-year treatment
group (those who entered the third grade in
200203 with FCAT reading scores below
the necessary threshold) had an average score
at the 23rd percentile on a nationally normed
test also administered to all students in the
state. A student at the 23rd percentile outperforms
23 percent of all students but trails the other
77 percent. A gain of five percentile points
is easier closer to the middle of the pack,
where most students are grouped, and harder
on the tails, just as passing other students
in a foot race is easier if one is running in
the middle of the pack than if one is way ahead
or way behind, where there is more distance
between each runner. Given that retained students
start at the 23rd percentile in reading, they
would barely gain one percentile point one year
after being retained but would gain between
three and 5.1 percentile points two years after
being retained.
COMPARING FLORIDA WITH CHICAGO
Using several analytical strategies, we find
that Floridas test-based retention policy
has led to significant improvements in reading
scores for those students who were retained.
These results contradict those of Roderick and
Nagaoka (2005), who also found initial benefits
after the first year of the program but found
that these benefits disappeared in the second
year after retention. Because we use a similar
basic analytical model as Roderick and Nagaoka,
the different results most likely stem from
differences in the policies and their implementation
in Chicago and Florida, not from differences
in the research designs.[10]
Although we are unable to test the effects of
the different characteristics of the two programs
empirically, some key policy differences deserve
discussion.
One important difference between the two policies
is that Floridas policy regulated and
guided the exemptions from the policy while
Chicagos policy had no formal rules for
promotion of students with scores below the
minimal threshold. The idea of allowing exemptions
in Florida is to accommodate the needs of students
whose test scores, for some reason, do not truly
demonstrate their academic proficiency or who
have some exceptional characteristic that could
explain low test scores (such as a disability
or limited proficiency in English). If these
exemptions effectively promote students for
whom retention would be harmful, they would
add to the effectiveness of the policy overall.
Thus, part of the negative findings in Chicago
could be attributed to the fact that the policy
in that city retained some students who would
have benefited from promotion. Without formal
rules for promoting students, it is likely that
the exemption strategy was not well tailored
to identifying individuals who would benefit
from promotion, and it could have been quite
arbitrary. In Florida, on the other hand, the
procedures for exempting students from retention
may have more effectively guided educators about
who would benefit most from being exempted from
test-based retention.
Another difference between the policies in
Chicago and Florida is that the Chicago policy
underwent several changes in its implementation,
while Floridas policy has remained consistent.
Changes in the policy might cause uncertainty
in the response of schools and thus inconsistent
results. If educators believe that a retained
student will be promoted because of a change
in the retention policy rather than because
of improved skills, their incentives to improve
student skills are undermined.
In addition, recent allegations of testing
impropriety in Chicago (see Jacob and Levitt
2003 and Greene, Winters, and Forster 2004)
compared with validation of testing integrity
in Florida (see Greene, Winters, and Forster
2004; West and Peterson 2005) may help explain
the different findings from the Chicago and
Florida programs. If Chicago schools are manipulating
test results in response to student retention,
rather than addressing the needs of those students,
test-based retention may indeed be counterproductive.
If that explains the different findings, the
lesson would be that test-based promotion with
a valid testing system is beneficial while the
same policy without testing integrity may be
harmful.
Of course, these possible explanations for
the differences in the findings in Florida and
Chicago are only hypotheses and require further
empirical examination. What is clear, however,
is that there are differences in the effect
of test-based retention across these two jurisdictions
and that these differences do not appear to
have been caused by variation in the way the
programs were evaluated.
CONCLUSION
While we can have confidence that test-based
retention in Florida has academic benefits,
we do not know a number of things. We do not
know whether the gains we have observed two
years after students are retained will continue
to hold, expand, or disappear over time. We
intend to continue tracking their progress to
find out.
We do not know whether test-based retention
policies in other school systems, such as Texas
and New York City, have benefits similar to
those in Florida. The results from Florida tell
us that test-based retention, when implemented
under the right conditions, improves student
learning, but the evidence from Chicago reminds
us that the same policy improperly implemented
can be counterproductive. These programs in
other school systems need to be carefully evaluated
to determine if they are producing benefits
or if their features need to be modified to
achieve results similar to those found in Florida.
We do not know whether the benefits of test-based
retention in Florida justify the additional
costs involved. Retaining students means that
students may spend an additional year in public
schools. With national per-pupil spending topping
$10,000, adding another year of school for a
large number of students requires significant
additional spending over time. Of course, additional
spending that significantly improves outcomes
for students may well be worth it. Without tracking
the benefits over the long term, and without
a careful cost-benefit analysis, it is difficult
to draw conclusions on this.
What we can know is that test-based retention
in Florida is helping students improve their
reading. This evaluation supports the theory
that students with low test scores who are promoted
appear to lack the minimum skills to prosper
in the next grade. Retaining low-scoring students
gives those students a chance to catch up on
their skills so that they have the wherewithal
to progress academically.
Given the frustrating stagnation in student
achievement over the last three decades, despite
the significant increase in resources and efforts
to improve learning, any large-scale policy
that produces progress is promising. Test-based
retention should continue to be tried and carefully
evaluated to see if this promise can become
a reality of higher student achievement for
students nationwide.