About the Author
MARCUS A. WINTERS
is a senior fellow at the Manhattan Institute.
He has conducted studies of a variety of education
policy issues including high-stakes testing,
performance pay for teachers, and the effects
of vouchers on the public school system. His
research has been published in the journals
Education Finance and Policy, Economics of
Education Review, Teachers College Record,
and Education Next. His op-ed articles
have appeared in numerous newspapers, including
the Wall Street Journal, Washington
Post, and USA Today. He received
a B.A. in political science from Ohio University
in 2002 and a Ph.D. in economics from the University
of Arkansas in 2008.
Introduction
Several public school systems in the U.S.
have recently adopted policies intended to hold
schools accountable for student outcomes, as
measured typically by standardized math and
English exams. One type of accountability policy
centers on what are often referred to as progress
reports. Programs using them provide schools
with what are essentially public report cards,
which grade them from A to F and often bring
them material rewards or sanctions.
The New York City public school system, the
largest school district in the United States,
adopted a progress report program, which first
graded schools on the basis of their performance
at the end of the 200607 school year.
Under the program, schools accumulate points
based on their students performance on
standardized exams and a variety of other factors.
Besides risking public disgrace, schools that
repeatedly receive F or D grades are subject
to review and ultimately face takeover by the
city. Schools that earn high grades are eligible
to receive rewards.
The goal of the citys policy is twofold:
to inform parents about school quality; and
to encourage schools to improve in response
to incentives. Many argue, however, that the
policy harms public schools by depressing the
morale of teachers and others. For example,
a November 11, 2007, editorial in the New York
Times argued that the practice of giving,
say, an F, to an otherwise high-performing school
that lags in student improvement for a single
year stigmatizes the entire school and angers
parents.
Unfortunately, the often vigorous debate between
those who argue that the grading policy is essential
to improving New York Citys public schools
and those who believe that it is detrimental
to them has thus far occurred in a data vacuum.
This paper seeks to inform this debate with
empirical evidence on the programs effectiveness.
In particular, we follow the strategy of previous
work on Floridas schools in order to evaluate
the impact of grading New York City schools
on their students achievement one year
later. The design of New Yorks program
allows for the use of a regression discontinuity
approach, which, under certain reasonable assumptions,
allows for a causal interpretation of the impact
of earning a particular gradeA, B, C,
D, or Fon school productivity as measured
by student academic performance.
Our findings are somewhat mixed. Using data
on students in grades four through eight, we
found that students in schools that received
an F grade in 2007 made academic improvements
in English that were on a par with the improvements
of students in schools that received better
grades. However, we find that students in F-
and D-graded schools made meaningful improvements
in math relative to other schools, though this
result appears to have been caused primarily
by student progress in the fifth grade. In summary,
our results suggest that schools may have responded
to the F sanction by improving their performance
and that there is no reason to believe that
the sanction of a low grade harmed student achievement.
The paper continues in six parts. Section 2
provides a brief overview of previous research
evaluating a similar progress report program
in Florida. In Section 3, we discuss the design
of New York Citys policy. In order to
give our results on the relative improvement
of D- and F-graded schools greater context,
we present some information about overall school
progress in New York City in Section 4. We then
devote Section 5 to discussing our methodology
and data. We report results from estimation
in Section 6, including a replication of results
obtained by Rockoff and Turner (2008), who employed
a similar design to study New Yorks policy
but used aggregate data. Section 7 states our
conclusions.
Previous Research
For a thorough review of research evaluating
accountability policies overall, see the recent
survey by Figlio and Ladd (2008). For our current
purposes, we focus on previous evidence evaluating
progress reports as a specific form of accountability.
In particular, New York Citys progress
report policy is quite similar in its design
to Floridas A+ Program, which has graded
schools in that state since 1998. Though the
programs provide different incentives for schools
that earn certain gradesin particular,
until recently, students in Florida public schools
that received more than one failing grade in
a four-year period became eligible for private
school vouchersboth utilize a point-based
system that determines whether schools receive
certain letter grades that carry important consequences.
Floridas A+ Program has been the subject
of several studies. Greene (2001) and Greene
and Winters (2004) used aggregate school-level
data to directly compare the educational gains
made by differently graded schools. They found
that schools that received an F grade made substantial
academic improvements relative to other schools.
These results were confirmed by Chakrabarti
(2005), who went on to find evidence that the
results were not driven by regression to the
mean.
Some recent studies of the A+ Program have
used student-level data and have taken advantage
of its known grade thresholds to pursue a regression-discontinuity
approach. West and Peterson (2006) limited their
sample to students in schools that earned point
totals barely qualifying them for an F grade
or barely missing the benchmark and thus earning
them a D grade. Like other studies using aggregate
data, West and Petersons study found evidence
that the incentives of schools earning an F
grade had a positive impact on student academic
proficiency.
Using a more flexible regression-discontinuity
design, Rouse and others (2007) also drew upon
individual-level data, but they included students
in all schools throughout Florida. Their model
incorporated a control for a cubic function
of the point total earned by a school, which,
under certain reasonable assumptions, allows
for causal interpretations of the impact on
a school of the receipt of a particular grade.
They found additional evidence that the incentives
of the F-grade sanction led to increased school
performance. These findings were replicated
by Winters, Greene, and Trivitt (2008), who
also utilized this procedure and found that
the school-grading policy improved student proficiency
in science, which is not part of the grading
process.
A recent study by Rockoff and Turner (2008)
follows the procedure of Rouse and others (2007)
in evaluating the impact of progress reports
in New York using aggregated data. The earlier
paper found that schools that received an F
or a D grade in 200607 had statistically
and substantially higher scores in math and
reading in 200708. One value of the present
paper is its replication of these previous results,
which provides confidence in both papers
estimates.
Though there are other differences, the primary
difference between the Rockoff and Turner (2008)
work and the present paper is that here we utilize
student-level data. Use of student-level data
allows for more precise estimation and lends
itself to a value-added approach to account
for unobserved student heterogeneity that is
not available in the school-aggregated data.
New York Citys Progress
Report Policy
New York Citys progress report policy
rates schools on a variety of factors, according
to an accumulating point system based on the
weighted average of metrics intended to measure
school environment, student performance, and
student academic progress. It then assigns grades
from A to F to schools, according to certain
benchmarks. Progress reports were first issued
at the beginning of the 200607 school
year. The city claims that the reports are designed
to help principals and teachers accelerate academic
achievement and that the policy enables
students, parents, and the public to hold the
DOE and its schools accountable for student
outcomes and improvement (New York City
Department of Education, 2007).
The first factor in determining the schools
grade is school environment. This metric uses
information from school and parent surveys about
school safety and parental engagement. The environment
index accounts for 15 percent of the total points
that can be awarded.
The remainder of the points earned by a school
are linked to performance on the states
standardized math and English exams. The value
assigned to the percentage of students with
test scores that meet the proficient or advanced
benchmark on these tests is 30 percent of the
total score that can be awarded. This measure
rewards those schools in which students meet
a particular academic level, although it may
put schools in which students have lower beginning
proficiency at a disadvantage. To compensate,
55 percent of the schools potential points
are linked to the progress that students make
on the standardized math and English tests during
the year. This value-added measure takes into
account the percentage of students making at
least a years worth of academic progress
as well as the average change in proficiency
scores of students who began the year with proficiency
in the bottom third of the school. Schools can
earn additional bonus points if students deemed
high need make exemplary gains on
the state exams.

The
resulting scores on each of these factors are
then further adjusted to account for the schools
performance relative to the rest of the schools
in the district and a grouping of schools with
similar characteristics. The scores on each
of these elements are weighted, as indicated
above, in order to produce the schools
total points under the system.
Schools are then assigned grades from A to
F on the basis of the number of total points
earned. The range of overall points that yield
particular grades is reported in Table 1. The
table shows that there are slightly different
point requirements for elementary, middle, and
K8 schools, for which we account in the
analysis.
Many commentators in New York have argued that
the grading system does not accurately measure
school quality. For example, it
has often been pointed out in the popular press
that many of the same schools earning poor grades
under the citys system receive high marks
under the different accountability system that
the No Child Left Behind Act calls for, and
vice versa.
This study is not particularly interested in
the extent to which the progress report policy
accurately measures school quality. Under no
circumstances should this paper be interpreted
as suggesting that the program is accurately
or inaccurately identifying successful or failing
schools. The identification of those factors
that underlie a successful school involves value
judgments that only communities, the school
district, and elected representatives can make.
Nor is this paper concerned with whether progress
reports have improved school effectiveness generally
in New York. It may be that every school responds
positively or negatively to the grading policy,
regardless of whether it receives a high grade
or a low grade at the outset. Our procedure
does not lend itself to measuring general improvements
throughout the school system.
Rather, the goal of this paper is to measure
how schools that are officially and publicly
deemed to be failing and thus face sanction
respond to that designation. In particular,
we evaluate whether student proficiency in such
schools suffers or increases as a consequence
of the schools success in earning higher
grades. As we will see in the next section,
the point system used to grade schools allows
us to control for unobserved differences in
school quality during estimation. We emphasize,
however, that for our purposes, it is the grade
and points themselves that were earned under
the system that are important, not the particular
factors that are responsible for a higher score
or grade.
Schools that receive a poor grade under the
program face unspecified sanctions and even
restructuring or closure if they fail to improve.[1]
However, the act of stigmatizing schools as
failing could have a motivating
effect on them. Several researchers have speculated
that accountability policies could shame
schools into better performance (Figlio and
Rouse 2005, Ladd 2001, Carnoy 2001, Harris 2001).
In this paper, we are not particularly concerned
with the causes of improvements in student performance,
though inquiry into the causes is a clear avenue
for future research.
Overall Performance in New
York City
As mentioned above, our research did not directly
measure whether the progress report program
led to general improvements or declines in the
performance of public schools in New York City.
However, reviewing some summary statistics about
overall progress in the school system can help
place any results about the relative performance
of schools earning particular grades into a
broader context.
Table
2 summarizes the performance of New York City
schools on fourth- and eighth-grade math and
English exams in 2006, 2007, and 2008, using
data aggregated to the school level in our data
set.[2] Between 2006 and
2008, schools made statistically significant
progress in both grades and in both math and
English. The gains were largest in eighth-grade
math and smallest in fourth-grade English. In
fact, scores in English were actually statistically
lower in 2007 than in 2006, but schools made
progress in the next year.
Thus, the overall story on the state tests
is one of relative improvement from 2006 to
2008. Though these gains are statistically significant
(that is, we can have high confidence that the
true gain, once we take into account measurement
error, is greater than zero), we are not able
to say that these overall gains are directly
related to the progress report program, nor
is it the place of this paper to conclude that
such improvements are substantial enough to
warrant overall optimism about the citys
schools.
Data and Method
We utilize a student-level data set provided
by the New York City Department of Education.
The data set includes demographics and test
scores on the states standardized math
and English exams for the universe of New York
City public school students enrolled in grades
three through eight from the 200607 through
the 200708 school years. We are also able
to link students to the schools they attend
and thus school grades and points earned under
the policy at the end of the 200607 school
year.
Table 3 presents descriptive information about
the schools in our data set overall and disaggregated
by the letter grade earned by the school at
the end of 200607. These descriptive statistics
are not identical to, but do closely match,
those reported by Rockoff and Turner (2008),
who used data aggregated by the Department of
Education.

One difficulty with the data set is that the
state does not claim that the results of its
math and English exams are vertically
aligned across grades. When results are
vertically aligned, a particular score should
indicate a certain level of proficiency regardless
of the grade for which the exam was prepared.
So a fifth-grade student with a score of 600
would have the same level of reading proficiency,
as measured by a fifth-grade test, as a third-grade
student with a score of 600, as measured by
a third-grade test, and so on. Our lack of a
vertically aligned score is important for measurement
because it means that the relationship between
a students previous years score
and current score could be affected by grade
level. This causes a difficulty in estimation
because our method is to pool students across
grades into a single regression equation. Specification
checks (not reported here) suggested that there
are slight but significant differences in the
relationship between previous and current student
proficiency across grades. To account for these,
along with estimating models that include all
grade levels, we report models restricted to
each grade level tested individually.
We follow the regression-discontinuity method
first presented in this context by Rouse and
others (2007) to study Floridas similar
school-grading program. This method takes advantage
of the discrete cutoffs in the continuous point
system utilized to assign schools particular
letter grades. We slightly modify this procedure
to fit better the particular design of the New
York program.[3]
We use a cross-sectional regression model to
measure how the relationship between student
and school characteristics affects the students
math or English score on the 200708 administration
of the exam. In particular, we run regressions
where the students 200708 test score
is the dependent variable and independent variables
include a cubic function of the students
score on the exam in 200607,[4]
observable characteristics about the student
(race, ethnicity, special-education status,
etc.), observable characteristics about the
school (percentage of students who are of a
particular race or ethnicity, etc.), and whether
the school is listed as an elementary, K8,
or middle school. The model also controls for
a cubic function of the number of points earned
by the students school in each of the
categories of the overall point system at the
end of the 200607 school year and the
letter grade earned by the school at the end
of that year. Finally, we include an interaction
between points earned on each of the input factors
and the school type (elementary, K8, or
middle school). These interactions account for
the fact that the cutoffs from the point system
vary somewhat by school type, as shown in Table
1.
The central assumption of our procedure is
that there is no difference in school quality
that is conveyed in the schools grade
that is not also accounted for in (a cubic function
of) the number of points that a public school
earned in each category under the formula. If
this assumption holds, we can interpret the
estimate as the impact of a schools receipt
of a particular grade on a students academic
proficiency.
The basic idea behind this technique is to
take advantage of the known cutoffs above or
below which schools are assigned different letter
grades according to the policy. The continuous
point system provides a direct measure of the
quality of each public school as determined
by the school system. Though important for policy
purposes, the cutoffs on the point scale at
which a school earns an A, B, C, D, or F grade
are set at somewhat arbitrary points and thus
convey little to no additional information about
the schools performance that is not already
represented in the point total. Schools with
similar point totals are likely to be similar
in their effectiveness, but whether their score
falls on one side or the other of a cutoff will
determine whether they receive the F-grade sanction.
For instance, a public elementary school earning
a point total of 31.0 under the grading system
is likely educating its students just as well
as another public elementary school that earns
30.8 points. However, these schools face very
different incentives, since the former would
receive a D grade and the latter an F grade.
By controlling for each schools point
total, we are thus able to measure the independent
impact of earning a particular grade on the
schools previous productivity level.
We make a couple of sampling restrictions that
are worth mentioning. First, we exclude students
who were tested in the third grade in 200708.
Since we utilize a lagged dependent variable,
and testing begins in the third grade, these
students must have been retained in the third
grade and thus may categorically differ from
students in other grades. Second, we exclude
students whose school is listed as a high school
on its progress report. The data set contains
observations of students taking the seventh-
and eighth-grade exams who are listed as attending
a high school, though the progress report definition
suggests that such identified schools would
teach grades nine through twelve. The data appear
to indicate that these are specialty schools
(for the arts, etc.), so we chose to eliminate
them from the data set. However, our results
remain robust when these sampling restrictions
are relaxed.
It is possible that focusing on treatments
that disproportionately affect students in low-performing
schools, as this study does, may be affected
by regression to the mean. Schools and students
at the bottom of the achievement distribution
may have such low scores partly because of random
error. If a negative random error were more
present in these schools, improvements made
on tests in later years could be an inflated
measure of a childs academic progress.[5]
In their similar study in Florida, Rouse and
others (2007) present a series of specification
tests indicating that regression to the mean
is not the driving force behind their results.
Unfortunately, the timing of the beginning of
wide-scale testing in New York City does not
allow us to adopt similar tests there. Thus,
it remains possible that our results are affected
by regression to the mean.
Results
We first aggregate our data set in order to
replicate the recent results reported by Rockoff
and Turner (2008). The results of this test
are reported in Table 4. Though not identical,
the coefficient and standard-error estimates
in Table 4 closely mirror those reported in
Rockoff and Turners paper.
This
replication suggests that the Rockoff and Turner
papers finding that F- and D-graded schools
made bigger improvements than higher-graded
schools continues to hold. It also lends some
confidence that the data utilized to estimate
our models of primary interest using student-level
data are accurate. This confirmation is particularly
important in view of the fact that Rockoff and
Turner (2008) rely on data that were reported
at the school-aggregated level, while we utilize
data aggregated from our individual-level data
set.
Table 5 reports the results of estimation in
math overall and for each particular grade level.
As in the aggregate data, the overall model
that includes all grade levels continues to
find a statistically significant and substantial
positive effect after a school has received
an F or a D grade. Here we find that students
in F-graded schools made test-score improvements
that were 3.5 scale points higher than students
in C-graded schools. In standard-deviation terms,
attending an F school has a positive one-year
impact of about a 0.18 standard deviation in
math proficiency relative to students in schools
that earned a C grade.
The additional columns of Table 5 report results
in math in regressions restricted only to the
particular grade level. We do find evidence
that the result varies across grade levels.
In particular, we find a strong positive impact
in students attending an F or a D school in
the fifth grade. However, the results in other
grades are statistically insignificant, and
the coefficients of interest are negative in
the fourth grade. The reasons for such different
effects across grades are unclear. However,
it is worth noting that the coefficient estimates
on each of the cubic factors of the students
lagged math score are similar, though the small
differences are statistically significant. This
gives some confidence in our overall estimate
(column 1) because it indicates that lack of
vertical alignment in the test scores across
grades is probably not having a large impact
on estimation of the model.

Table 6 reports our results in reading, following
a similar format. We find no significant difference
between the reading performance of students
in F- or D-graded schools and that of students
in schools receiving better grades. A result
lacking statistical significance is found both
in the overall regression and in each of the
grade-level regressions.

Conclusions
In this paper, we have evaluated the impact
of schools earning particular grades under New
York Citys progress report policy on student
academic proficiency. The regression-discontinuity
methodology, by taking advantage of the citys
continuous point system for assigning school
grades, allows us to make causal interpretations
of the impact of such school grades on student
progress.
Our results can be construed as indicating
a mixed-positive effect from receipt of an F
or a D grade under the policy. We find that
students in F-graded schools made significant
and substantial improvements in math, though
these results appear to be primarily the result
of progress made by fifth-grade students. We
find no evidence that a schools grade
has a significant impact on student proficiency
in English.
Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics:
Methods and Applications. New York: Cambridge
University Press.
Carnoy, M. 2001. Do School Vouchers Improve
Student Performance? American Prospect
12, no. 1:4245.
Chakrabarti, R. 2005. Do Public Schools
Facing Vouchers Behave Strategically? Evidence
from Florida. Manuscript. Program on Education
Policy and Governance.
Figlio, D. N., and H. F. Ladd. 2008. School
Accountability and Student Achievement.
In Handbook of Research in Education Finance
and Policy, ed. Helen F. Ladd and Edward B.
Fiske. New York: Routledge.
Figlio, D. N., and C. Rouse. 2005. Do Accountability
and Voucher Threats Improve Low-Performing Schools?
Journal of Public Economics 90: 23955.
Greene, J. P. 2001. An Evaluation of the
Florida A-Plus Accountability and School Choice
Program. Manuscript. Manhattan Institute.
Greene, J. P., and M. A. Winters. 2004. Competition
Passes the Test. Education Next 4,
no. 3: 6671.
Harris, D. 2001. What Caused the Effects
of the Florida A+ Program: Ratings or Vouchers?
In School Vouchers: Examining the Evidence,
ed. Martin Carnoy. Economic Policy Institute.
Ladd, H. F. 2001. School-Based Educational
Accountability Systems: The Promise and the Pitfalls.
National Tax Journal 54, no. 2: 385400.
New York City Department of Education. 2007.
Educator Guide: The New York City Progress
Report.
Rockoff, J. E., and L. J. Turner. 2008. Short-Run
Impacts of Accountability on School Quality.
Unpublished manuscript.
Rouse, C. E., J. Hannaway, D. Goldhaber, and
D. N. Figlio. 2007. Feeling the Florida
Heat? How Low-Performing Schools Respond to Voucher
and Accountability Pressure. National Center
for Analysis of Longitudinal Data in Education
Research, Working Paper 13.
West, M. R., and P. E. Peterson. 2006. The
Efficacy of Choice Threats within School Accountability
Systems: Results from Legislatively Induced Experiments.
Economic Journal 116, no. 510: C4662.
Winters, M. A., J. P. Greene, and J. Trivitt.
2008. Building on the Basics: The Impact
of High-Stakes Testing on Student Proficiency
in Low-Stakes Subjects. Manuscript. Manhattan
Institute.