The Mission of the Manhattan Institute is
to develop and disseminate new ideas that
foster greater economic choice and
individual responsibility.


Civic
Report

No. 56 April 2009


The NYC Teacher Pay-for-Performance Program:

Early Evidence from a Randomized Trial

Matthew G. Springer, Ph.D., Research Assistant Professor of Public Policy and Education, Vanderbilt University, Director, National Center on Performance Incentives

Marcus A. Winters, Ph.D., Senior Fellow, Manhattan Institute

 

Executive Summary

Paying teachers varying amounts on the basis of how well their students perform is an idea that has been winning increasing support, both in the United States and abroad, and many school systems have adopted some version of it. Proponents claim that linking teacher pay to student performance is a powerful way to encourage talented and highly motivated people to enter the teaching profession and then to motivate them further inside the classroom. Critics, on the other hand, contend that an extrinsic incentive like bonus pay may have unfortunate consequences, including rivalry instead of cooperation among teachers and excessive focus on the one or two subjects used to measure academic progress.

In this paper, a researcher from the Manhattan Institute for Policy Research and another from the National Center on Performance Incentives at Vanderbilt University present evidence on the short-run impact of a group-level incentive pay program operating in the New York City Public School System. The School-Wide Performance Bonus Program (SPBP) is a pay-for-performance program that was implemented in approximately 200 K–12 public schools midway into the 2007–08 school year. Participating schools can earn bonus awards of up to $3,000 per full-time union member working at the school if the school meets performance targets defined by the city’s accountability program.

This study examines the impact of the SPBP on student outcomes and the school learning environment. More specifically, the study is designed to address three research questions.

  1. Did students enrolled in schools eligible for the SPBP perform better on the high-stakes mathematics assessment than students enrolled in schools that were not eligible?
  2. Did participating schools with disparate characteristics perform differently from one another? And did subgroups of students in these schools perform differently from one another?
  3. Did the SPBP have an impact on students’, parents’, and teachers’ perceptions of the school learning environment or on the quality of a school’s instructional program?

Although a well-executed random-assignment study is the gold standard for the making of causal inferences, readers should be aware that the analyses reported in this paper can address only the short-run effects of the SPBP because the period between the inception of schools’ participation in the SPBP and the administration of New York State’s high-stakes math exam was less than three months. The purpose of this study is to establish a baseline for subsequent analyses of student outcomes, teacher behavior, and school environment.

The authors did not discern any impact on math test scores of a school’s participation in the SPBP. The performance of students enrolled in schools participating in the SPBP did not differ statistically from the performance of students enrolled in schools assigned to the control group. The same holds true after adjusting estimates of student performance to account for whether an eligible school voted in favor of participating in the program, and thus actually enrolled in it.

The authors also investigated whether an effect of participation might be observable in particular subgroups of students or schools, if not among students or schools overall. But we could not find evidence that two possible factors—students’ race/ethnicity and their level of proficiency at the beginning of the academic year—affected the impact of the SPBP to any extent. The authors find some evidence that the math performance of students in smaller schools participating in the SPBP remained static, while the scores of students in participating schools with larger enrollments decreased. However, the relationship between school size and the impact of the SPBP warrants further study when data from year two of the SPBP become available.

The authors also examined the impact of the SPBP on students’, teachers’, and parents’ perceptions of the school learning environment, as well as an external evaluator’s assessment of a school’s instructional program. Once again, no significant differences between the outcomes of schools participating in the SPBP and those of schools that were assigned to the control group could be found.

Overall, the authors found that the SPBP had little to no impact on student proficiency or school environment in its first year. However, the authors emphasize that the short-run results reported in this study provide only very limited evidence of the program’s true effectiveness. An evaluation of the program’s impact after two years should provide more meaningful information about the impact of the SPBP. The authors intend to perform such a study and release its results in the near future.

 

About the Authors

MATTHEW G. SPRINGER is a research assistant professor of public policy and education at Vanderbilt University’s Peabody College and director of the federally funded National Center on Performance Incentives. His research interests are in the broad area of education policy, with a particular focus on educational innovations and productivity. He has conducted studies on the impact of the No Child Left Behind law on student outcomes; the impact of school finance litigation on resource allocation; and the impact of incentive pay programs on student achievement and teacher mobility. His research has been published in Economics of Education Review, Education Economics, Education Next, Journal of Education Finance, Journal of Policy Analysis and Management, and Peabody Journal of Education. He received a B.A. in psychology and education from Denison University and a Ph.D. in education finance and policy from Vanderbilt University.

MARCUS A. WINTERS is a senior fellow at the Manhattan Institute. He has conducted studies of a variety of education policy issues, including high-stakes testing, performance pay for teachers, and the effects of vouchers on the public school system. His research has been published in the journals Education Finance and Policy, Economics of Education Review, Teachers College Record, and Education Next. His op-ed articles have appeared in numerous newspapers, including the Wall Street Journal, the Washington Post, and USA Today. He received a B.A. in political science from Ohio University and a Ph.D. in economics from the University of Arkansas in 2008.

Acknowledgments

This paper was supported, in part, by the National Center on Performance Incentives at Vanderbilt University. We appreciate helpful comments and suggestions from Dale Ballou, Julie Marsh, Dan McCaffrey, and Patrick Wolf. We would also like to acknowledge Terry Bowman and the many individuals at the New York City Public Schools for providing data and technical support to conduct our analyses. Any errors remain the sole responsibility of the authors. The views expressed in this paper do not necessarily reflect those of sponsoring agencies or individuals acknowledged.


1. Introduction

Teacher pay for performance has resurfaced as a popular reform strategy in the United States and abroad.[1] The basis for these proposals is grounded in the argument that current compensation policies provide weak incentives to teachers to act in the best interest of their students and that inefficiencies arise from rigidities in current compensation policies. Proponents claim that linking teacher pay to student performance is a powerful way to affect teacher motivation and labor-market selection. Critics, on the other hand, contend that extrinsic incentives may compromise the intrinsic motivation of teachers and possibly lead to dysfunctional behaviors or negative spillover effects.[2] Another frequent criticism of this reform strategy is that output in the education sector is difficult to define because it is not readily measured in a reliable, valid, and fair manner.

Recent experimental and quasi-experimental evidence paints a mixed picture of the impact of teacher pay-for-performance programs. Muralidharan and Sundararaman (2008) and Lavy (2002, 2007) found that teacher incentive programs in India and Israel, respectively, improved student outcomes and promoted positive changes in teacher behavior and/or classroom pedagogy. Glewwe, Ilias, and Kremer (2008) similarly reported that students instructed by teachers eligible to receive an award in a teacher incentive program in Kenya demonstrated better scores on high-stakes tests; however, no discernible impact was found on low-stakes tests taken by a sample of students or on the same students when they took high-stakes tests during the post-intervention school year. Furthermore, a comprehensive evaluation of a long-standing incentive program in Mexico detected a negligible impact on elementary students’ test scores and small, positive effects at the secondary level (Santibanez et al., 2007).

This paper contributes to the evaluation literature on teacher incentive systems by assessing the short-run impact of a group incentive program implemented in the New York City Department of Education (NYCDOE). The School-Wide Performance Bonus Program (SPBP) was implemented midway into the 2007–08 school year and was designed to provide financial rewards to educators in schools serving disadvantaged students. The SPBP sets expected incentive payments as a fixed performance standard, meaning that schools participating in the program are not competing against one another for a fixed sum of money. All participating schools can earn bonus awards of up to $3,000 per full-time union member working at the school if the school meets predetermined performance targets defined by the NYCDOE’s accountability program, with the idea that this sum will be used to award bonuses to teachers and staff found to be deserving. The SPBP rules further mandate that schools participating in the program establish a four-person site-based compensation committee to determine how bonus awards will be distributed to school personnel.

The SPBP is interesting to study for a number of reasons. First, the NYCDOE randomly assigned schools qualifying for the program to either treatment or control status. Since true random assignment will remove unobserved factors that can lead to systematic differences between schools receiving the SPBP treatment and those not eligible to do so, any significant differences in future outcomes can be attributed to the SPBP intervention rather than to other confounding factors associated with outcomes of interest. Moreover, even though the United States has a long history of testing various teacher compensation reforms, this study is the first to report the causal effect of a domestic teacher pay-for-performance program.[3]

Second, design and implementation of the SPBP addressed potential obstacles that can diminish teachers’ receptiveness to compensation reform. The SPBP was developed collaboratively by the NYCDOE and the United Federation of Teachers (UFT), which is the sole bargaining agent for school district personnel.[4] Program guidelines that they developed required that at least 55 percent of school personnel in SPBP-eligible schools vote in favor of participation and that all school personnel within a school be eligible to receive an award.[5] On the other hand, some observers contend that using a school as the unit of accountability makes for a weak incentive policy, since school personnel may feel unable to influence the chances that their school qualifies for an award.[6] And despite the SPBP’s assignment of responsibility to site-based “compensation committees” for determining how bonus awards are distributed, similar reforms in Texas suggest that schools tend to adopt very egalitarian award-distribution plans when teachers have a role in designing school-level incentive systems (Springer et al., 2008; Taylor, Springer, and Ehlert, forthcoming).

This study also takes advantage of school-level data on institutional and organizational practices collected by the NYCDOE. The district collects survey data on student, parent, and teacher perceptions of the school learning environment, including items on academic expectations, communication, engagement, and safety. Teams of experienced educators also conduct two- to three-day on-site visits to review the quality of a school’s institutional and instructional program. The school learning-environment survey and external quality reviews provide a means for studying the causal effect of the SPBP on intermediate outcomes. If significant differences in student achievement among schools assigned to the treatment and control conditions are detected, data on institutional and organizational practices could shed light on the types of changes that affected student achievement.

Our evaluation focuses on the impact of the SPBP on student achievement in mathematics during the first year of implementation. A series of analyses also uses school-level survey data on student, parent, and teacher perceptions of the school learning environment, as well as school-level data from enumerators’ tests of how well a school is organized for the purpose of improving student learning. In addition, we explore the first-year impact on a variety of student and school characteristics.

Our sample includes 186 SPBP-eligible elementary, K–8, and middle schools and 137 control-condition schools in New York City over a two-year period. The 2006–07 school year is the baseline. The first year of implementation was the 2007–08 school year, though less than three months of school elapsed between the end of the period in which an eligible campus had to vote on whether to participate and the point at which New York State administered the high-stakes mathematics tests.[7] Test scores in mathematics for more than 100,000 students in grades three through eight were collected and reviewed. We do not include the English language-arts test because it was administered a few weeks after the SPBP was implemented, and before the distributional rules of the SPBP reward system had been finalized by each school’s compensation committee.

We found that the SPBP had no discernible effect on overall student achievement in mathematics during the first year of the program’s implementation. The sign on the SPBP coefficient is negative in virtually all models, though the average treatment effect is always insignificant at any conventional level. There were no discernible impacts when adjusting estimates for SPBP-eligible schools that declined participation. The same holds true when using different achievement specifications.

An important question is whether any particular group of students or schools benefited from participating in the SPBP. Some previous studies of other pay-for-performance progams have found differential effects on student outcomes by student race (Ladd, 1996), prior student achievement (Lavy, 2008), family affluence (Muralidharan and Sundararaman, 2008), and parent education level (Lavy, 2002). Studies have also found no evidence of a significant difference attributable to student or teacher characteristics (Lavy, 2008; Muralidharan and Sundararaman, 2008; Lavy 2002). We found that neither student race nor initial student achievement produced statistically significant differences in the impact of the SPBP. We did not have access to data on other student characteristics such as free and reduced-price lunch status, parental education, or gender, or data on teacher characteristics such as years of experience, salary, or gender.

Organizational theory on group incentive programs suggests that social penalties and other strong forms of reciprocity can positively affect effort if the size of the team is not too large (Kandel and Lazear, 1992; Besley and Coate, 1995; Bowles and Gintis, 2002). Teachers in SPBP-eligible schools with large student enrollments may not respond to the SPBP because they feel less able to affect performance measures that establish qualification for bonuses. Contrary to previous theory, our findings suggest that mathematics achievement by students enrolled in schools with fewer students remained static in response to the SPBP, while student achievement in schools with larger enrollments decreased. The potential moderating affect of school size on the direction and/or strength of the relationship between the SPBP and mathematics achievement will be revisited when data from the 2008-09 school year become available.

We also examined the impact of the SPBP on student, teacher, and parent perceptions of the school learning environment, as well as external enumerators’ tests of a school’s instructional program. We found no discernible differences in intermediate outcomes between SPBP schools and schools assigned to the control condition. Admittedly, estimates may be losing leverage. These data are aggregated at the school level, response rates on the survey vary considerably among schools, and enumerators’ quality reviews are not available for all the schools in our sample. In addition, a positive and significant effect on teacher perceptions of the school learning environment would need to be interpreted cautiously, in light of the fact that scores count toward a school’s overall Progress Report Card rating, which determines whether its teachers can qualify for a bonus award.

We use a regression discontinuity design within the randomized evaluation to examine whether the difficulty of performance thresholds that SPBP schools needed to reach to earn a bonus contributed to the treatment effect. Schools had to meet different performance targets determined by their overall Progress Report Card score ranking to earn a bonus award. We use discrete cutoffs in performance target scores to identify these impacts. We found no evidence of a differential treatment impact among schools in response to the performance targets that they had to meet to earn a bonus award.

Note that this evaluation examines the impact of the SPBP after the program had been in operation for less than three months. Even though a randomized evaluation study of incentive programs in Andhra Pradesh, India observed a modest impact on student achievement after a single year, the governance structures in rural schools there are very different from the operational context of New York City schools. The incentive structure facing teachers and schools in Andhra Pradesh is very weak compared to the accountability measures found in New York City (and the United States, more generally).[8] Furthermore, a series of educational reforms in New York City operating concurrently with SPBP potentially makes it difficult to distinguish the short-run effects that the SPBP generated. These reforms focus on the same outcome measures used in this study to assess the impact of the SPBP on student outcomes. Consequently, we will be able to offer a more comprehensive understanding than we have today of the impact of the SPBP on student outcomes, teacher behavior, and schooling practices as more years of data become available.

Finally, we evaluate the impact of the SPBP on the productivity of existing teachers and school personnel. A system of remunerating employees on the basis of individual or group performance is likely to do a better job of retaining such people and attracting new ones than systems that do not. The size of the sorting effect has been reported to be as large as the size of the incentive effect on the productivity of existing workers (Lazear, 2000).[9] The impact of the SPBP on teacher sorting and selection, as well as the teacher sorting and selection implications for student achievement and other outcomes of interest still await study.[10]

The remainder of this paper is organized as follows: Section 2 discusses the components of pay-for-performance programs. Section 3 reviews key findings from the relevant literature on teacher pay-for-performance programs. Section 4 provides a complete description of the SPBP. Section 5 describes the data, sample, and random assignment of schools to treatment. Section 6 offers a description of the analysis plan, which sets the stage for Section 7, which is a discussion of overall results. Section 8 provides an analysis of potential differential treatment effects. Section 9 is the conclusion.


2. Understanding Components of Pay-for-Performance Programs

An organization’s compensation system is arguably its most important human-resource management system (Ehrenberg and Milkovich, 1987; Lawler, 1981). Providing employees with financial incentives is believed to increase organizational productivity by strengthening employee motivation and attracting and retaining more effective individuals. However, in the public education sector, many contend that sufficient incentives reside in the work itself and that rewards can suppress teachers’ intrinsic motivation (Johnson, 1986; Lortie, 1975). Social psychologists refer to the trade-off between intrinsic and extrinsic motivation as the “hidden cost of rewards” (Lepper and Greene, 1978) and “the corruption effect of intrinsic motivation” (Deci, 1975), or what behavioral economists have labeled the “crowding out” of intrinsic motivation (Frey, 1997).

However, numerous design components need to be understood before education reformers can conclude that teacher pay-for-performance programs are practical. For example, whose performance should determine bonus award eligibility? What performance indicators will monitor and appraise employee performance? Will the program reward school personnel according to a relative or an absolute performance standard? Who is part of the pay-for-performance system, and how will bonus awards be distributed to school personnel?

2.a. Forms of Teacher Pay-for-Performance Programs

The unit of accountability describes the entity whose performance determines award eligibility. It can be an individual teacher, a group or team of teachers (e.g., grade-level, department, interdisciplinary team, or school), or some combination thereof. Some literature indicates that pay-for-performance programs that are focused on the individual as the unit of accountability will achieve the best outcomes, particularly if output can be easily attributed to a single individual, the criteria for performance appraisals are observable and objective, and the work does not depend on the interdependence of employees (Deutsch, 1985; Milgrom and Roberts, 1990; Bowles and Gintis, 2002). A common critique of individual incentive programs rests on observation of dysfunctional behavior and system gaming (Prendergast, 1999; Murnane and Cohen, 1986).[11] Furthermore, in education and other sectors involving complex tasks and multiple goals, individuals have greater opportunity to maximize their own utility by reallocating effort to metered, rewarded activities (Holmstrom and Milgrom, 1991; Baker, 1992; Courty and Marschke, 2004).

Pay-for-performance programs that are focused on the group as the unit of accountability may contribute to greater productivity in organizations such as schools, where employees work interdependently. Group incentives can promote social cohesion and feelings of fairness and generate productivity norms (Lazear, 1998; Rosen, 1986; Pfeffer, 1995). A frequently cited threat to group incentive structures is free-riding or shirking, which suggests that some workers may underperform because they assume that others will take up the slack. However, the free-rider problem can be solved through mutual monitoring and the enforcement of social penalties if the team unit is not too large (Kandel and Lazear, 1992; Nalbantian and Schotter, 1997; Bowles and Gintis, 2002). Group structures may also create a perverse incentive by motivating effective teachers in low-performing schools to move to higher-performing schools, where their potential to earn a bonus award increases (Ladd, 2001; Clotfelter, Ladd, and Vigdor, 2005).

Pay-for-performance programs may use any number of performance indicators to monitor and appraise individual or group performance. Test scores are the most heavily weighted performance metric in most output-focused systems. These systems may also incorporate graduation or promotion rates, student or teacher attendance, a reduction in disciplinary referrals, increased test participation, and the like. On the other hand, some pay-for-performance programs may focus more heavily on input-based measures, particularly those that were developed and implemented prior to 2002 (e.g., teacher career ladder or knowledge- and skill-based pay programs).

Past compensation reforms in the education sector have been faulted for measuring what exists rather than proposing and testing what might be useful and important to measure. Today, most agree a pay-for-performance system must have multiple measures and can’t be singularly focused on test scores. A structural misalignment between performance measures and a school’s mission, or volatility in the outcome measure from one point in time to a later one, can create discontent among teachers and distort policy.

Incentive structure is another key component of teacher pay-for-performance programs. Programs can award a teacher, team of teachers, or an entire school contingent on the basis of how their performance compares with that of similarly situated individuals, groups, or schools using a rank-ordered tournament, or such programs can adopt a fixed performance standard by which any teacher or group of teachers meeting a predefined threshold wins (Lazear and Rosen, 1981; Green and Stokey, 1983).[12] Tournament incentive structures create competition among individuals or groups to partake in a fixed pool of bonus awards, thus removing the financial risk inherent in operating a fixed performance–standard scheme. Incentive structures based on a relative performance standard may also be more practical when no obvious performance target exists or performance metrics are volatile. However, tournament incentive structures for teachers or teams of teachers have not received much support because schools have strong work interdependencies, though it is possible to design tournaments in which groups within schools are not competing against one another.

Bonus award distribution systems determine how evenly a pay-for-performance system distributes rewards to eligible employees. An egalitarian distribution plan distributes incentive money widely, in contrast to those plans that reward some individuals far more than others. Proponents argue that individualist reward plans help create a meritocracy able to retain an organization’s highest performers, attract similar talent over the long run, send a clear signal to the lowest performers to improve or move elsewhere, and are more cost-effective (Milgrom and Roberts, 1992; Zenger, 1992; Ehrenberg and Smith, 1994; Pfeffer and Langston, 1993). At the same time, a growing body of research suggests that egalitarian pay distribution promotes cooperation and group performance, which are critical in participative organizations. Furthermore, Milgrom and Roberts (1992) suggest that greater pay dispersion may elevate the performance of the lowest performers, who also like receiving awards.


3. Review of Relevant Research and Experiments on Pay for Performance

This section offers a review of previous research studying the impact of teacher pay-for-performance programs on student outcomes, teacher behavior, and institutional dynamics. Our review focuses on evaluations of studies having experimental designs or those using regression discontinuity (RD) designs in a quasi-experimental framework.[13] When implemented properly, such designs are ideal for assessing whether a specific intervention truly produces changes in outcomes under study or whether observed changes in outcomes are simply artifacts of pretreatment differences between two or more groups under study.

Muralidharan and Sundararaman (2008) studied the impact of two output-based incentive systems (an individual teacher incentive program and a group-level teacher incentive program) and two input-based resource interventions (one providing an extra-paraprofessional teacher and another providing block grants). In what was known as the Andhra Pradesh Randomized Evaluation Study (AP RESt), 500 rural schools in Andhra Pradesh, India, were randomly selected to participate and then assigned to one of the four treatment conditions or to the control group. These schools had a weak incentive structure for teachers, with 90 percent of noncapital education spending going to regular teacher salary and benefits. The AP RESt intervention was developed in partnership with the government of Andhra Pradesh, a large nonprofit organization interested in education issues in India (the Azim Premji Foundation), and the World Bank.

The individual incentive program awarded bonus payments to teachers for every percentage point of improvement above five percentage points in their students’ average test score. All recipients received the same bonus for every percentage point of improvement. The bonus award scheme was structured as a fixed performance standard, which means that awards were distributed to any teacher or school that was selected to be in the AP RESt intervention and that exceeded the performance threshold.

Muralidharan and Sundararaman (2008) reported that student test scores on high-stakes tests increased between 0.12 and 0.19 standard deviations in the first year of the program and between 0.16 and 0.27 standard deviations in the second. Students enrolled in classrooms presided over by teachers eligible to receive a bonus award scored 0.11 to 0.18 standard deviations higher on low-stakes tests than those students whose teachers were not eligible to earn a bonus award. Students in treatment-condition classrooms also scored higher on a separate test that assessed high-order thinking which Muralidharan and Sundararaman (2008) indicate represents “genuine improvements” in learning, as opposed to better test-taking skills or perhaps other strategies employed by teachers to increase their chances of receiving a bonus award.

Muralidharan and Sundararaman (2008) also found that the schools assigned to the output-based intervention (i.e., individual- or group-incentive conditions) outperformed those schools assigned to the input-based resource interventions (i.e., paraprofessional or block grant conditions). Students enrolled in a classroom instructed by a teacher selected for the group incentive intervention also outperformed students in control-condition classrooms on the mathematics and language tests (0.28 and 0.16 standard deviations, respectively). At the same time, students enrolled in schools assigned to the individual incentive condition outperformed students in both the group incentive condition and the control condition following the second year of implementation.

Another interesting feature of the AP RESt study is that external evaluators collected data on intermediate outcomes in interviews and through classroom observation. Teacher interviews offered anecdotal evidence that teachers in the individual or group incentive intervention were more likely to assign homework, offer support outside of class time, have students complete practice tests, and focus attention on low-performing students. However, Muralidharan and Sundararaman (2008), using data collected by the observational protocol, found no significant differences between treatment- and control-condition classrooms.

Glewwe, Ilias, and Kremer (2008) studied the impact of the International Child Support Incentive Program (ICSIP), a group incentive intervention that randomly assigned 100 schools in rural Kenya to either a treatment or a control condition. ICSIP’s bonus scheme was structured as a rank-ordered tournament, and prizes ranged between 21 percent and 43 percent of average monthly base salary.[14] The ICSIP appraised school performance on the basis of student drop-out rates and test scores, with the twelve highest-performing and the twelve most-improved schools that were assigned to the ICSIP intervention receiving a prize.

Glewwe et al. (2008) found that students enrolled in schools participating in the ICSIP intervention had noticeably higher scores on high-stakes tests than students enrolled in schools assigned to the control condition. However, when comparing the performance of students enrolled in control- and treatment-group schools on a low-stakes test, Glewwe et al. (2008) found no differences in student test scores. It appeared that students enrolled in schools participating in the ICSIP intervention were coached in test-taking skills; an analysis of item-level test data revealed, for example, that treatment-condition students were significantly less likely to leave a test question blank.

Glewwe et al. (2008) also examined the impact of the ICSIP on teacher behavior. The authors found no differences in teacher attendance or pedagogy (behavior in classroom, instructional practices, number of homework assignments) among teachers in schools assigned to the ICSIP intervention and those working in a control-condition school. At the same time, teachers working in schools eligible for an ICSIP prize were 7.4 percentage points more likely to offer test-preparation sessions for students outside of normal school hours (typically when students were on vacation). In total, Glewwe et al. (2008) question the probability of the ICSIP program’s improving long-run education outcomes, given the current state of schooling in the Busia and Teso districts of western Kenya.

Unlike the above-mentioned controlled trials, in which teachers or schools were randomly assigned to research groups, the next several studies exploited the fact that teachers or schools assigned to intervention and control-group conditions differ solely with respect to a cutoff point along some pre-intervention assignment variable. When implemented properly, an RD design allows for unbiased comparison of average treatment effect on teachers or schools that fall just to the right or to the left of such selection cutoffs.[15] The remainder of this subsection presents an overview of major findings from three RD studies of education incentive interventions: two programs implemented in Israel and a program operating in Mexico since 1992.

Lavy (2002) evaluated a group incentive program that was implemented in sixty-two Israeli high schools and designed to reduce student drop-out rates and improve student achievement. The program rewarded school performance on the basis of three factors: mean test scores, mean number of credit hours, and school drop-out rate. The bonus scheme was designed as a rank-ordered tournament, with the schools in the top third of performers competing for $1.44 million in awards. Schools earning a bonus had to distribute to their teachers 75 percent of the school-level award funds in amounts proportional to their gross annual compensation, regardless of their performance during the school year; the remaining 25 percent was to be used for improving school facilities for teachers. Lavy (2002) reported that top-performing schools received between $13,000 and $105,000 during the first year of implementation, with teacher bonuses ranging from $250 to $1,000 per teacher.

Lavy (2002) found a positive and statistically significant effect on student outcomes. Following the second year of implementation, for example, the group incentive program was found to have had a positive effect on average credit hours earned, average science credits earned, average test scores, and proportion of students taking Israel’s matriculation test. Estimates further indicated that the program affected particular groups of students more than others—for instance, students at the low end of the ability distribution performed much better than expected on Israel’s exit tests.

Lavy (2002) also compared the effectiveness of Israel’s group incentive intervention with an input-based intervention that had been implemented several years earlier. The input-based intervention provided twenty-two secondary schools with additional resources to implement professional training programs, reduce class size, and offer tutoring to below-average students. Although both programs improved student outcomes, Lavy (2002) concluded that the group incentive program is more cost-effective per marginal dollar spent. Muralidharan and Sundararaman (2008) similarly found that both the individual and group incentive programs were more cost-effective than either the “extra-paraprofessional” teacher or block-grant treatment conditions. The relative effectiveness of these interventions is particularly relevant to U.S. education policy because input-based reforms generally have been implemented more widely than output-based interventions such as New York City’s SPBP.[16]

Lavy (2008) studied an individual incentive program in Israel that awarded bonuses to high school teachers in grades ten, eleven, and twelve based on their students’ performance on national exit tests. The program was structured as a rank-ordered tournament and operated for a single semester (January–June 2001). Teachers in the intervention could earn a bonus for each class of students they prepared for the national exit tests, with awards ranging from $1,750 to $7,500 per class prepared. As reported by Lavy (2008), of the 302 teachers (48 percent of eligible teachers) awarded a bonus following the June 2001 exit tests, sixteen won bonuses for two of their classes.

Lavy (2008) creatively exploited two subtle features of the pay-for-performance program—measurement error in the assignment variable and a break along the pre-intervention assignment variable—to estimate the causal impact of the incentive program by using regression discontinuity design. Estimates of the net intervention effect indicated that the number of exit-exam credits earned by students instructed by a teacher in the incentive program increased by 18 percent in mathematics and 17 percent in English, while data from a survey of teacher attitudes and behaviors suggested positive changes in teaching practices, teacher effort, and instruction tailored to low-performing students. When investigating gaps in performance between the results of school tests and national tests taken by students enrolled in treatment and comparison schools, Lavy (2008) did not find evidence of opportunistic behavior or negative spillover effects.

Santibanez et al. (2008) used a RD design to estimate the impact of Mexico’s Carrera Magisterial (CM) on student test scores. Implemented in 1992, CM is a teacher incentive program that was designed collaboratively by state and federal education departments and the national teachers’ union. Teachers participating in the program can earn a financial bonus if they accumulate enough points on a variety of measures defined by CM guidelines, including input criteria such as years of experience, highest degree held, and professional development activities, as well as output criteria such as their performance on a subject-matter knowledge test and their students’ test scores (Santibanez et al., 2008). Awards ranged from 24.5 to 197 percent of a teacher’s annual earnings (McEwan and Santibanez, 2005; Ortiz-Jiminez, 2003).

Santibanez et al. (2008) take advantage of the financial incentive that individual teachers have to improve their students’ test performance. Since the program appraises teachers on most performance measures before students take the high-stakes tests each school year, teachers participating in the CM program have a general sense of how many additional points they need to earn on the strength of their students’ performance on the high-stakes test to receive an award. Santibanez et al. (2008) detected a negligible impact on test scores of students enrolled in elementary school classrooms taught by teachers facing a strong incentive, while they detected small, positive effects at the secondary level. The authors note that their identification strategy relies on a factor in the CM program that may be worth too few points to motivate teachers to exert more effort to improve student test scores.

Pay-for-Performance Experiments in the United States

Despite more than a quarter-century of sustained debate over teacher compensation reform, research on pay-for-performance programs in the U.S. have tended to be focused on short-run motivational effects and to be highly diverse in terms of methodology, population targeted, and programs evaluated (Podgursky and Springer, 2007). Indeed, the four pay-for-performance programs known to us to employ a random-assignment design, as this study does, are still being implemented or evaluated. Building a solid research base is necessary for making firm judgments about pay-for-performance programs generally and for deciding whether specific types of design features have promise.

In August 2006, the National Center on Performance Incentives (NCPI) implemented the Project on Incentives in Teaching (POINT) intervention in the Metropolitan Nashville Public Schools (MNPS) system.[17] The POINT experiment recruited 297 teachers of middle-school mathematics in grades five through eight and randomly assigned these teachers to the treatment or control condition. Teachers assigned to the intervention are eligible to receive bonuses of up to $15,000 per year for a three-year period on the basis of two factors: the progress of a teacher’s math students over a year, as measured by their gains on the Tennessee Comprehensive Assessment Program (TCAP); and the progress of a teacher’s nonmath students over a year, as measured by their gains on the TCAP as well.

The POINT experiment is designed as an individual incentive intervention in which performance is judged according to a fixed performance standard. Because this standard was determined at the beginning of the POINT experiment and will remain fixed for three years, all teachers have the opportunity to be rewarded for having improved over time. The experiment concludes following the 2008–09 school year, and preliminary results will be available sometime during the following year.

In October 2008, the NCPI implemented a demonstration project to study a group incentive intervention. Eighty-two grade-level teams of teachers in grades six, seven, or eight were randomly assigned to either the treatment or control conditions. A team is defined as a group of academic teachers who meet regularly to discuss a common set of students, performance goals, and outcomes for which they are collectively accountable. Teachers assigned to the incentive intervention are eligible to receive an award if their team is selected as one of the four highest-performing teams at their grade level, as measured by standardized achievement scores in reading, mathematics, science, and social studies. Treatment teachers are projected to earn a bonus of about $6,000 if their team qualifies for an award.

Glazerman et al. (2007) designed and implemented an impact evaluation of the Teacher Advancement Program (TAP), a program being implemented by the Chicago Public Schools using a federal Teacher Incentive Fund grant. The TAP is a comprehensive school-reform model consisting of four elements: (1) multiple career paths; (2) ongoing, applied professional growth; (3) instructionally focused accountability; and (4) performance-based compensation.[18] At the beginning of the 2007–08 school year, Glazerman and colleagues randomly assigned eight schools to receive the TAP intervention and eight schools to the control condition. The latter set of schools delayed implementation of TAP for a two-year period while serving as controls. Another sixteen schools were then recruited and randomly assigned to the TAP intervention or control conditions for the 2009-10 and 2010-11 school years.

4. New York City’s School-Wide Performance Bonus Program

The SPBP is a group incentive program developed collaboratively by the NYCDOE and the UFT. The SPBP sets expected incentive payments using fixed performance standards, not by constructing a rank-ordered tournament. The SPBP was conceived as a two-year pilot program, with the number of eligible schools increasing from approximately 216 to 400 in the second year. However, because of budgetary constraints, the number of SPBP-eligible schools did not grow in the 2008–09 school year. Stakeholders are currently exploring funding the SPBP for a third year by leveraging funding obtained from the Obama administration’s American Recovery and Reinvestment Act (Hernandez, 2009).

Participating schools earn bonus awards if they meet performance targets established by the NYCDOE’s Progress Report Card system, which is the primary accountability program in the school district.[19] The Progress Report Card system evaluates each public school on the basis of three factors: student attendance and student, parent, and teacher perceptions of the school learning environment (15 percent); student performance on New York State’s high-stakes test in English language arts and mathematics (30 percent); and student progress in English language arts and mathematics (55 percent). All schools receive an overall Progress Report Card score and grade—from A to F—which is based on how well they performed in these three areas in comparison with a set of schools serving a similar population of students.[20] The Progress Report Card system then assigns each public school a performance target for the subsequent school year based on the rank of its overall Progress Report Card score.

Table 1 displays descriptive information on the relationship between overall performance-score rankings and performance targets. For example, if a school’s overall Progress Report Card score ranked it in the 75th percentile of schools—that is, in Category 2—its target improvement for the next year’s score would have been 12.5 points for the 2007–08 school year. In other words, the school’s overall performance target score for that school year was its overall performance score from the 2006–07 school year plus 12.5 points. Table 1 also displays the number and percentage of schools in our sample according to their Progress Report Card performance rankings and their target gains.

Schools participating in the SPBP that meet 100 percent of their performance target score receive $3,000 per UFT member in their school. Schools that meet 75 percent of their performance target score receive $1,500 per UFT member in their school.[21] As displayed in Table 2, of ninety-three SPBP-eligible schools meeting their performance target, sixty-five met 100 percent of their performance target and twenty-eight schools met 75 percent of their target. In total, $14.25 million was awarded to these schools, with bonus awards ranging between $51,000 and $351,000 per school (with an average award of $160,095 per school).

Nearly all schools in our sample entered the lottery because of the challenges posed by the nature of their student bodies, not their previous achievement.[22] All schools in the lottery served students with difficult backgrounds. As illustrated in Table 1, some schools were identified as being more effective than others at improving student outcomes (and earned a high number of points under the Progress Report Card system or a high grade under No Child Left Behind [NCLB]). Furthermore, the percentage of schools in our sample in each accountability rating category is similar to the percentage of all schools in the NYCDOE in those categories. The impact of the SPBP should be generalizable to schools of varying productivity with high percentages of disadvantaged student populations.

Schools’ receipt of bonuses under the SPBP does not necessarily indicate that program eligibility caused improvements according to the performance indicators: student attendance and school learning environment; performance and progress on high-stakes tests in mathematics and English language arts. The Progress Report Card system is going to identify high- and low-performing schools irrespective of the SPBP. We would expect some schools to meet their Progress Report Card targets and earn bonuses even if there were no treatment effects from the SPBP.

At the same time, since schools are assigned target gain scores on the basis of their overall Progress Report Card performance ranking, some schools may have a greater chance than others of meeting the bonus performance threshold. Table 2 reports the number and percentage of schools assigned by lottery to an SPBP treatment group according to their Progress Report Card category and how many schools in each category met all, some, or none of their performance target during the 2007–08 school year. It is clear that the great majority of Category 4 and Category 5 schools met 100 percent of their performance target, while only about half of Category 2 and Category 3 schools met at least 75 percent of their performance target. Furthermore, even though 65 percent of Category 1 schools met at least part of their performance target, 70 percent of Category 1 schools received an award for earning two consecutive A-grades—a performance metric that was established after the first year of the program concluded.

We also estimated a simple binomial logit model to understand the relationship between Progress Report Card categories and the probability that a school met part of its performance target. Specifically, we estimate the odds of a school meeting at least 75 percent of its performance target when controlling for school level, breakdown of students by race/ethnicity, and peer index rating. We find that the odds of a Category 4 or 5 school’s earning at least part of its performance bonus award are about ten times greater than the odds of a Category 3 or 2 school’s meeting part of its performance target. Category 1 schools are about two to three times more likely to earn a performance bonus than Category 2 or 3 schools. The difference is explained by the two consecutive A-grades that the Category 1 school had earned.

The SPBP stipulates that schools participating in the SPBP establish a site-based compensation committee to determine how bonus awards will be distributed to school personnel. Compensation committees consist of the school principal, an individual appointed by the principal, and two staff people who are UFT members. A school’s compensation committee “has complete discretion, without interference from either the [NYCDOE] or the UFT, to decide how to distribute the pool of bonus money available to the school. The compensation committee could choose to give every employee the same amount, give employees who did exceptional work more, give employees in one title (for instance, teachers) more, give employees who only worked a partial year less, etc.” (SPBP background document, August 1, 2008).

Table 3 provides descriptive statistics and Figure 1 illustrates the range of award amounts for the ninety-two elementary, middle, and K–8 schools that earned a performance-award bonus following the 2007–08 school year. Each vertical bar represents a single school, its lower end being the minimum distributed award (other than zero) and its upper end being the maximum award distributed. The mean bonus awarded to teachers was $2,417 at the school level. About three-quarters of all schools awarded a maximum individual bonus of $3,000 or less. When restricting the sample to only those school personnel classified as teachers, we find that the average bonus increases to $3,000, with more than 90 percent of all teachers receiving a bonus of between $2,500 and $3,500.

The average size of the bonus awards received by teachers in SPBP schools that met their performance target is around the size thought to be large enough to influence teacher behavior. For example, an average teacher bonus award of $3,000 is 45 percent of monthly base salary, or 5 percent of annual base salary, assuming a $60,000 average base salary and a nine-month pay period. Case studies suggest that bonus awards of 5–8 percent should be large enough to elicit a behavioral response (Odden, 2001). Furthermore, experimental studies that detected behavioral changes in response to teacher pay-for-performance interventions reported average bonus awards equivalent to about 40 percent of a single month’s base salary.


5. Data, Sample, and Random Assignment

5.a. Data

The data for this study come from multiple sources. Student-level data were provided by the NYCDOE’s Office of Accountability. The data set contains student demographic information, including race/ethnicity, special-education status, and English-language learner status. It also contains scores on New York State’s mathematics and English language arts (ELA) tests administered during the 2006–07 and 2007–08 school years. Using data for the universe of students in the NYCDOE, we standardized student test scores in math and ELA by grade and school year. A negative z-score indicates that the score is below the mean for all tested students in that subject, grade, and year, while a positive z-score indicates that the score is above the distribution mean.

A second data set contained information on the SPBP. It identified eligible schools that voted in favor of, or against, participation. A separate annotated file provided details on both lotteries and documented any violations of the random assignment process between the first and second lotteries. The NYCDOE also provided a teacher-level file setting out the size of the actual bonus awards given in autumn 2008 to personnel who worked during the 2007–08 school year in an SPBP school that met at least 75 percent of its performance target or earned two consecutive A-grades.

We supplemented these files with school Progress Report Card data available on the NYCDOE website. Files contained aggregated data on student demographics, student attendance rates, and student enrollment, as well as information on the following accountability-system ratings: overall accountability, student performance, student progress, environment, engagement, communication, academic expectations, percentile rank, performance target score, and NCLB Adequate Yearly Progress (AYP) status.

We also obtained school-level data from a survey of students’, teachers’, and parents’ perceptions of the school learning environment. The surveys were administered from April 30 to June 6, 2007, and from April 4 to April 18, 2008. Surveys were sent to all parents and teachers, and to students in grades six through twelve. Response rates increased significantly from 2007 to 2008, with parents’ response rate increasing from 26 percent to 40 percent, teachers’ response rate increasing from 44 percent to 66 percent, and students’ response rate increasing from 65 percent to 78 percent (NYCDOE, 2008).

Finally, we downloaded and keyed data from quality reviews completed for all the NYCDOE during the 2007–08 school year. The quality review process consists of trained teams of enumerators’ conducting a prereview of a school and then visiting that school for two to three days. School site visits included a thirty-minute campus tour, ten to fifteen classroom observations lasting twenty minutes each, and structured and unstructured interviews with teachers and students (NYCDOE, 2008). Enumerators assess schools on the basis of five criteria indicating relative quality, each of which contains seven ratings, the lowest being “underdeveloped” and the highest being “outstanding.”

5.b. Sample

Our sample includes 186 SPBP-eligible elementary, K–8, and middle schools and 137 control-condition schools over a two-year period comprising the baseline year (the 2006–07 school year) and the first treatment year (the 2007–08 school year). Student test scores are available in mathematics for more than 100,000 students in grades three through eight. We restrict the sample to schools identified on their Progress Report Cards as being an elementary, K–8, or middle school because test scores are unavailable in high school grades. We focus on student achievement in mathematics because the ELA test was administered weeks after the SPBP was implemented and before the distributional rules of the SPBP reward system had been finalized by each school’s compensation committee.

Schools had to be elementary, middle, and high schools in the NYCDOE with the highest needs to qualify for the SPBP. The NYCDOE determines a school’s “need” by resorting to a peer index ranking system in which: elementary and K–8 school rankings are based on a composite measure of student demographic factors such as the percentage of English-language learners, black students, Hispanic students, special-education students, and Title I free lunch-program students; and middle school and high school rankings are set in relationship to the average proficiency ratings in mathematics and ELA in a single grade (fourth grade for middle schools and eighth grade for high schools).

Table 4 displays descriptive statistics on demographic and performance measures for treatment schools, control-group schools, and all public schools in the NYCDOE. About 59 percent of schools eligible for SPBP were elementary schools, 11 percent K–8 schools, and 29 percent middle schools. The average school size is slightly under 600 students, with elementary schools being modestly smaller in size than K–8 and middle schools. On average, more than 95 percent of the school’s students are identified as Hispanic (56 percent) or black (41 percent), 19 percent are identified as English-language learners, and 22 percent receive some level of special-education services. Standardized scores in mathematics and reading are, respectively, approximately 0.36 and 0.37 standard deviations below the mean test scores in the district.

More than half of SPBP eligible schools (53 percent) in our sample were in good standing, according to New York State’s NCLB accountability plan. Another 27 percent of schools were restructuring, while approximately 19 percent attended schools that were either in need of improvement or under corrective action. Interestingly, Progress Report Card grades assigned to schools by the NYCDOE’s accountability program suggest that schools in our sample are distributed more evenly: 23 percent of schools received an A; 32 percent received a B; 27 percent received a C; 10 percent receive a D; and 9 percent received an F.

5.c. Random Assignment of Schools

The NYCDOE had it in mind that 200 schools (including high schools) would participate in the SPBP during the 2007–08 school year. How to arrive at this number was difficult to know. Schools randomly assigned (lotteried in) to the SPBP intervention had to vote in favor of participation. Schools not randomly assigned (lotteried out) to the SPBP intervention were assigned to the control group. Thus, the NYCDOE’s Research and Policy Support Group was able to implement a two-stage clustered randomized trial, which is summarized in Figure 2.

In early November 2007, the Research and Policy Support Group identified 429 schools meeting eligibility criteria for the SPBP. Almost all of them (404) were entered into the first lottery, from which the Research and Policy Support Group then randomly selected 233 schools, which it invited to participate in the SPBP. Schools had six weeks to vote for or against participation. Of the initial 233 schools lotteried into the SPBP, 195 voted to participate, 35 did not to participate, and 3 were excluded because of complicating factors.

In December 2007, the NYCDOE’s Research and Policy Support Group held a second lottery. The second lottery included only the 189 schools that were not selected during the first lottery. Twenty-one schools were randomly selected and then invited to participate in the SPBP. Nineteen of these schools voted in favor of participation, and two schools declined participation. In total, 254 of 404 schools entered into the lottery were randomly selected to participate in SPBP. Thirty-seven schools lotteried into the SPBP declined participation.

Figure 2 indicates a few irregularities in the lottery process. To begin with, 25 schools were barred from the lottery even though these schools, on the basis of observable characteristics, met the selection criteria for entering the SPBP lottery. These schools also were similar to those included in the lottery on the basis of observable student and school characteristics (see Tables 4 and 5). In a conversation with the authors, the NYCDOE indicated that the 25 schools were ruled ineligible prior to the lottery process. While their exclusion could impair the external validity of our findings, it should not have any effect on their internal validity.

Noncompliance in the form of “no-shows” and “crossovers” may blur the contrast in outcomes between treatment groups by understating the average SPBP treatment effect (Bloom, 2006). Thirty-seven schools that lotteried into the treatment condition declined participation following a vote among school personnel (no-shows). Another eight schools were permitted to participate in the SPBP despite never having been lotteried into the SPBP (crossovers). The thirty-seven schools that were lotteried in but declined to participate were coded as having been deemed eligible for the policy but as not having participated. The eight schools that received treatment under special circumstances were coded as being ineligible for participation but to have received treatment.[23] We address noncompliance by using the local average treatment effect (LATE) framework developed by Angrist, Imbens, and Rubin (1996), which is a refinement of Bloom’s (1984) average impact of treatment-on-the-treated strategy.

Our analytic strategy assumes that both the observed and unobserved characteristics of treatment and control schools are, on average, identical. Logically, we cannot attest to the identicalness of a school’s unobserved characteristics. But we can establish whether there are observed differences between two categories of schools and then infer from a lack of difference between them that they are identical in unobserved ways as well. We tested for differences on observables using a Kruskal-Wallis one-way analysis of variance, a nonparametric method for testing equality of population medians among groups (Kruskal and Wallis, 1952).

Table 4 displays descriptive statistics on demographic and performance measures by experimental status. We find that the sample of schools assigned to the SPBP treatment (column 1) are statistically indistinguishable from the schools assigned to the control condition (column 2), according to most demographic characteristics and performance measures in the baseline year (the 2006–07 school year). A slightly greater proportion of control-group schools received a D-grade under the NYCDOE’s Progress Report Card system than were enrolled in SPBP-condition schools (0.17 vs. 0.10). We also find that a greater proportion of control-group schools than eligible schools were identified as being “in need of improvement” under NCLB (0.16 vs. 0.10).

Table 4 also displays the extent to which the group of eligible schools that voted in favor of participating in the SPBP differs from the group that chose not to participate. Interestingly, we find few differences between the observed characteristics of those eligible schools that voted to participate in the program and the characteristics of the schools that voted against the SPBP. Personnel in schools that voted to participate in the SPBP had slightly lower ELA scores in the 2006–07 school year (-0.37 vs. -0.27). Furthermore, schools that voted to participate in the SPBP were slightly more likely to have earned an F-grade on the Progress Report Card system (0.10 vs. 0.00) and to have been labeled “in need of improvement” under the NCLB accountability system (0.15 vs. 0.00) than schools in the nonparticipating group. All other observed demographic and performance characteristics of the participating and declining schools are statistically indistinguishable.

Table 5 displays descriptive statistics by experimental status for key constructs and response rates on the NYCDOE’s school learning-environment survey. We find that schools eligible for the SPBP (column 1) and those not lotteried into the SPBP (column 2) are similar in their scores on all characteristics during the baseline year (the 2006–07 school year). The same holds true for schools voting in favor of participating in the SPBP (column 3) and those schools that declined to participate (column 4). Furthermore, we do not detect any significant differences in student, parent, and teacher responses to the school learning-environment survey during the baseline year.

We used Hotelling’s T-test to determine whether there were baseline imbalances between schools participating in the SPBP and control-condition schools (Hotelling, 1940). We say that the lottery is balanced if we cannot on statistical grounds, after examining all observable characteristics identified in Tables 4 and 5, dismiss the possibility that the treatment group and the control group are the same. Hotelling’s T-test is the analog to a t-test when multiple variables are considered simultaneously. We fail to reject the hypothesis that the means of the treatment (column 1) and the control (column 2) conditions are different. We also found no significant differences in the means employed by the eligible participant sample of schools (column 3) and declining schools (column 4) as determined by Hotelling’s T-test.


6. Analytic Strategy

6.a. Average Impact of Intention to Treat

We first estimate the average impact of the SPBP on student achievement using a standard intention-to-treat (ITT) approach. An ITT effect assumes that all schools lotteried into the SPBP elected to participate in the program, even though an approximate 14 percent of eligible schools in our sample did not participate. ITT estimates are relevant to policy because, by all accounts, if the SPBP is sustained in future years, it is likely that imperfect treatment implementation will continue to occur. Thus, to judge the overall impact of the SPBP as implemented, the combined effect of the SPBP intervention and the effect of a school’s decision not to comply with the policy can be expressed as:

(1) ITT = E[Y|Z=1]-E[Y|Z=0]

where Z=1 indicates a school’s assignment to the SPBP intervention and Z=0 indicates a school’s assignment to the control condition. Subscripts are suppressed for simplicity.

We also estimate a series of cross-sectional regression models to measure how student and school characteristics affected a student’s math test score in the 2007–08 school year. A binary variable is set to equal one if a student was enrolled in a school that was lotteried into an SPBP treatment group and zero if a student was enrolled in a school that was not lotteried into a treatment condition. The average impact of the ITT effect is reported with and without regression adjustments. The most inclusive estimates control for a large number of observable student- and school-level covariates. Our most basic estimation strategy controls only for student grade.

Because SPBP eligibility was determined by lottery, and commonly used tests indicate balance across observable student and school characteristics, we interpret the relationship between SPBP intervention and student achievement in mathematics to be a direct consequence of the SPBP intervention. More formally, ITT estimates of the SPBP intervention are given by the ordinary least squares estimate for d4, which can be defined as:

(2)

where Yist represents the math test score of student i in school s at the end of program year t (April 2008); f(Yist-1) is a cubic function of the student’s math test score in that subject at the end of year t-1; Student is a vector of observable student-level variables, including race/ethnicity, special-education status, limited English proficiency (LEP) status, and so forth; School is a vector of observable school-level attributes, including level of schooling (elementary, middle, or K–8) and percentage of students by race/ethnicity and borough; Eligible is an indicator variable that equals one if student i is enrolled in a school that was lotteried into the SPBP intervention and zero if the school was not; fist is a stochastic error term; and fs reminds us that this random error is clustered by school.[24]

We also tested for differential SPBP treatment effects by student and school characteristics. For example, previous research has documented system gaming and opportunistic behavior among school personnel in response to high-powered incentive policies.[25] School personnel may respond strategically to the SPBP intervention because the availability of bonus awards is determined by a school’s overall Progress Report Card score, and schools can earn bonus points if high-needs students make exemplary progress on the high-stakes tests. We explore differential SPBP treatment effect by including in equation (2) a simple interaction term between Eligible and a particular student or school characteristic.

We typically report the average impact of ITT effects using a lagged achievement specification of (2), where the standardized form of a student’s previous test score in mathematics at time t-1 is an explanatory variable. Controlling for lagged achievement helps to account for unobservable student attributes such as prior knowledge that students bring to the classroom. We also control for a cubic polynomial of a student’s previous test score, which allows for the relationship between previous and current test scores to differ with reference to the student’s previous score. Furthermore, the lagged achievement specification does not impose a specific assumption about the rate of decay in student achievement over time.

6.b. Impact of Treatment on the Treated

The ability of schools lotteried into the SPBP intervention group to vote against participation means that the ITT effect does not directly measure the intervention effect on schools that adopted the policy. A handful of schools that were never lotteried into the program received SPBP treatment, which complicates estimating the direct effect on student achievement. Specifically, the presence of forty-three noncompliant schools in the sample (thirty-five no-show and eight crossover schools) may be responsible for the understatement of the average SPBP treatment effect. We therefore estimate models according to a form of the treatment on the treated (TOT) framework developed by Bloom (1984) and advanced by Angrist, Imbens, and Rubin (1996) and others.

Bloom (1984) developed a strategy for estimating the average impact of TOT when no-shows are present in an experimental design (i.e., subjects are assigned to treatment but do not participate). A TOT approach isolates the impact of the SPBP intervention on the subset of schools lotteried into the SPBP condition that actually received the treatment, and it then compares the achievement scores of students enrolled in schools participating in the SPBP with those of the sample of students enrolled in schools that were not lotteried into the SPBP. In contrast to the basic ITT approach identified in equation (2), the TOT effect can be expressed as:

(3) ITT = E[D|Z=1]TOT + [1-E(D|Z = 1)]0 = [E(D|Z=1)]TOT

where Z=1 if the school was lotteried into the SPBP intervention and Z=0 otherwise, and D=1 for schools that receive the SPBP treatment and D=0 for those that do not.[26]

However, the TOT effect assumes that schools assigned to the control group did not participate in the SPBP treatment. Not accounting for crossovers may produce a downward bias in estimates of the average treatment effect. We therefore estimate the local average treatment effect (LATE) developed by Angrist, Imbens, and Rubin (1996). LATE not only accounts for the lack of participation among those randomly assigned to the SPBP treatment group (no-shows); it also adjusts for SPBP participation by schools that were not lotteried into the treatment group but are participants in the SPBP treatment nonetheless.[27] As noted in Bloom (2006), LATE can be expressed as:

(4)

Equation (4) is equivalent to the Wald Estimator, a special case of an instrumental variables (IV) strategy. An IV strategy can be used to capture the effect of the SPBP intervention on compliers—that is, schools randomly assigned to the control condition but that participated in the SPBP intervention. The LATE is estimated using a two-stage ordinary least squares, by which an IV approach estimates the average treatment effect on the subset of schools that participated in the SPBP because of a lottery assignment, and the estimated probability of receiving the SPBP treatment is then used as an indicator variable in a second-stage regression model. More formally, to establish the probability that a school actually received the SPBP treatment, our first-stage regression model takes the form:

(5)

where Tst indicates the school’s actual participation in the SPBP during the first year of implementation (2007–08 school year), and all other variables are as previously defined. We then use the resulting coefficient estimates on p0, p1, p2 to establish the probability that each school received the SPBP treatment.

The instrument in equation (5) is the variable Eligible, which indicates whether the school was lotteried into the SPBP treatment condition. The fact that program eligibility was determined randomly suggests that the School variables are relatively unnecessary, and the estimated probabilities of whether a school was actually treated resulting from equation (5) are nearly identical, whether or not we include these variables. The coefficient on Eligible, p2, is also very similar to the percentage of eligible schools that voted to participate in the policy. However, for the sake of completeness, we continue to include it in all estimates reported below. The first-stage (or instrumenting) model is performed at the school level because schools (not students) were randomly assigned to the treatment condition. Estimating the first stage at the student level would imply that individual students within a school with different observed characteristics had different probabilities of receiving treatment. The estimated probability that a school received treatment, T, is merged on the student achievement data file. We then estimate the impact of the SPBP on student achievement using the estimated probability that the student’s school received the treatment, which can be expressed as:

(6)

where all variables are as previously defined in equation (5) and the coefficient on the probability of treatment, q4, provides a consistent estimate of the impact of the actual SPBP treatment on student mathematics proficiency.


7. Average Impact of the School-Wide Performance Bonus Program

Table 6 presents results for a series of estimates of the impact of the SPBP on student achievement in mathematics. Panel A reports ITT estimates with and without regression adjustments. Estimates of the ITT effect indicate no significant relationships between SPBP eligibility and student performance in mathematics. The sign on coefficient estimates is always negative but never significant at conventional levels. Panel B indicates that the same holds true when we use an IV strategy to estimate the LATE, which means that the average treatment effect in the subpopulation of compliant schools is indistinguishable at a conventional level.

We are also interested in whether particular student subgroups benefit more from the SPBP. We test for differential effects by introducing the simple interaction term Eligible with a binary student demographic variable. The LATE effect is estimated by interacting the predicted treatment from equation (5), T, with student demographic variables. NYC’s Progress Report Card system also gives schools extra credit or bonus points if a high-needs student makes exemplary gains on the state’s high-stakes tests. Using a basic X2 test, we also report whether estimates on these coefficients are jointly equal to zero.

Table 7 reports the average impact of the ITT effect (column 1) and the LATE (column 2), allowing for heterogeneous treatment effects by student race. Regression-adjusted estimates indicate no discernible differences among student race. Estimates are robust irrespective of the controls for student- and school-level covariates or whether we exclude a student’s previous test score in mathematics from the regression equation. Furthermore, both the ITT effect and the LATE are robust if student attainment (rather than the lagged-achievement or value-added approach) is the dependent variable when comparing schools lotteried into the SPBP with those randomly assigned to the control condition.

Table 8 reports results when we allow the estimate of the SPBP treatment effect to vary by student ability, where ability is defined by the quartile of a student’s previous test score in the tested subject. ITT estimates do not provide much evidence of the SPBP’s benefiting students of a particular ability group. We find no statistical difference in the performance of students according to the quartile of their baseline math score. However, students in the third quartile scored, on average, 0.0328 standard-deviation units below the typical student enrolled in a school participating in the SPBP. Students whose previous achievement scores were in the bottom performance quartile also performed worse than expected. Furthermore, we find that the LATE estimates in Panel B of Table 8 are qualitatively similar to estimates reported for the average impact of the ITT effect.

We also examined whether achievement scores in mathematics varied by school type. Our sample includes three types of schools: elementary, middle, and K–8 schools. Approximately 60 percent of students enrolled in SPBP-eligible schools attend an elementary school, while about 29 percent attend middle schools and 11 percent attend K–8 schools. Panel A of Table 9 displays estimates for the ITT effect. We do not find a significant difference in achievement between students enrolled in the SPBP treatment and those enrolled in schools assigned to the control condition. Furthermore, the TOT estimates find that students’ achievement gains in mathematics at schools that actually received the treatment are not statistically different from those at untreated schools. Estimates reported in Table 9 are similar to estimates, with or without making adjustments for student or school characteristics, or for a student’s previous achievement score.

In sum, we find no evidence of a significant SPBP treatment effect during the first partial year of implementation. Perhaps this is unsurprising. There was a limited window of opportunity for school personnel working in schools that were lotteried into the SPBP to respond to the SPBP intervention (less than three months), assuming that school personnel were disposed to respond to the program. We will repeat these analyses following the 2009–10 school year, when more years of data become available, including scores on student achievement in ELA.

8. Potential Mediators of the Treatment Effects

This section focuses on potential mediators of the SPBP treatment effect, including school size and the rigorousness of the performance target that a school has to meet to earn a performance bonus award. We also examine the association between institutional and organizational practices and student achievement in mathematics, as measured in surveys of student, teacher, and parent perceptions of the school learning environment. Finally, we examine data from independent appraisals of institutional practices conducted by an independent team of experienced educators.

8.a. Differential Impact by School Size

The SPBP is a group incentive program. School size may affect the strength of incentives offered school personnel in SPBP-eligible schools (Kandel and Lazear, 1992). Our intuition is that, in larger schools, the probability that social penalties can influence group performance diminishes. Further, an individual teacher has relatively little direct impact on the school’s overall performance, which is the unit of accountability; in smaller schools, the impact of an individual teacher’s performance is proportionately larger. It may also be easier for teachers in larger schools to free-ride, while a smaller school may contain incentives that shape teacher behavior in ways that an individual-level pay-for-performance program does.

To evaluate whether there is a differential SPBP treatment effect caused by school size, we interact the Eligible variable with school size. School size is defined as the number of unique students with a valid mathematics test score in the 2007–08 school year. The mean school size was slightly fewer than 600 students, with a standard deviation of approximately 260. The regression model is a modified form of equation (4) and can be expressed as:

(7)

where Size is the number of students in a school with a valid test score in mathematics during the baseline school year, and all other variables are as previously defined. The estimate on the interaction term, p, indicates the direction and strength of the association between school size and student achievement gains in mathematics.

We also estimate the differential effect of actually receiving the treatment by school size using a two-stage least squares regression. Following the four types of compliance behaviors identified by Angrist et al. (1996), we substitute the Eligible variable identified in (2) with estimates of a school’s probability of participating in the SPBP that were generated from a linear probability model. We then run a second-stage regression model in which estimates from the first-stage participation model, T, become the instrument for estimating the relationship between school size and student achievement. Because a weak instrument can cause the precision of estimators to be low, we report regression-adjusted estimates.Panel A of Table 10 reports estimates of the differential effect of SPBP eligibility by school size. Model (1) suggests that student achievement gains in mathematics in SPBP-eligible schools tend to be inversely proportionate to school size. The coefficient on the average effect of SPBP eligibility is no longer significant at the 10 percent level in Model (3) of Panel B. Furthermore, there is a negative and significant relationship between school size and receipt of the SPBP treatment, which suggests that some schools participating in the SPBP may be large enough to have a negative effect on students’ achievement gains in mathematics.

Table 10 also reports estimates of the association between school size and treatment status by school size as measured in quartiles. Models (2) and (4) compare student achievement gains in Quartile 1, Quartile 2, or Quartile 4 schools with achievement gains of students enrolled in Quartile 3 schools using an interaction term between the SPBP indicator and a dummy variable for each quartile. Interestingly, estimates suggest that students who were enrolled in SPBP schools in the quartile containing the largest schools performed not as good as than students enrolled in control-group schools in the same quartile. Average estimates of both the ITT effect and the LATE are statistically significant at the 5 percent level.

We can also see this effect in the model incorporating an interaction between overall enrollment and treatment. Differentiating (7) with respect to Eligible, we can see that the overall impact of SPBP eligibility in this model is found by solving: p5 + p6*Size. We can recover the school size at which treatment is no longer pointed in a positive direction by inputting the coefficient estimates from the regression p5 and p6, setting the resulting equation equal to zero, and solving for Size. Doing so for the ITT model in math yields a school size of 529 students, the point where the coefficient estimate for SPBP eligibility goes to zero and then turns negative with the enrollment of every additional student in a school.

We performed a series of X2 tests to identify the points at which any positive effect (when school size is below 529) and any negative impact (when school size is greater than 529) are statistically different from zero. We find that school size must drop to under 120 students to produce a positive treatment effect that is statistically significant at the 10 percent level. No school in our sample is this small. We also find that the overall SPBP treatment effect becomes significantly negative in schools with more than 693 students, which represent about 30 percent of the schools in our sample.[28]

These results are inconsistent with economic theory, suggesting that the relationship between school size and treatment effect may be spurious. Previous theoretical models hypothesize an inverse relationship between school size and the positive effects of a group incentive program like the SPBP. However, there is no practical explanation for why larger groups would be negatively affected by the program. We expect that schools could be large enough to neutralize the positive effects of a group incentive program. As a consequence, teachers’ behavior at large schools should simply return to its nontreatment norm (i.e., a treatment effect indistinguishable from zero). We plan to revisit these analyses when data from the 2008–09 school year become available.

8.b. Differences in Target Score to Earn Bonus Awards and Treatment Response

Variation in the performance target that an SPBP-eligible school must meet to receive a bonus award is an interesting design feature of the SPBP. Schools in the treatment condition receive bonus awards if they make significant improvements under the NYCDOE’s Progress Report Card system. More specifically, schools with higher overall point totals at the end of 2006–07 school year than the rest of the city’s schools and their peer group were required to make fewer point gains in the following year to receive a bonus than schools with lower overall scores (see Table 1). In effect, schools’ target gain score affects the strength of the incentives acting upon them.

We take advantage of this variation to evaluate whether the targets set by the SPBP might send a signal to schools about the amount of effort they need to exert to raise their students’ test scores. It is possible that schools that needed to make greater gains tried harder than schools with easier targets. However, there is also a chance that schools discouraged by targets that seemed unattainable would end up expending less effort than schools with easier targets.

Although schools were not randomly assigned improvement targets, we take advantage of the nonlinear structure of the performance targets reported in Table 1 to examine whether there is a differential response attributable to the particular performance targets defined by the SPBP intervention. Discrete performance thresholds facilitate an RD design within the context of the randomized evaluation design. Under certain reasonable assumptions, we can estimate how the perceived rigorousness of performance targets affects student achievement in mathematics.[29]

Our analytic strategy follows the RD framework described in Rouse et al. (2007), and subsequently applied to a number of education-related studies (Winters, Greene, and Trivitt, 2008; Winters, 2008; Rockoff and Turner, 2008). We add a number of independent variables to equation (2), including a cubic function for the number of points earned by a student’s school during the 2006–07 school year, dummy variables indicating the performance target category that a school needed to reach to earn a bonus award, and an interaction between school target score category and the SPBP treatment. The regression model can be expressed as:

(8)


where g(Percentile) is a cubic function for the percentile of the school’s overall points (less the additional bonus points) relative to the rest of the city’s schools in its type under the Progress Report Card system in the 2006–07 school year, which was used to put them into categories; Cat is a vector of binary variables indicating which of the five target levels of performance the school was required to meet in order to receive a bonus, and all other variables are as previously defined.

The estimated coefficients on the vector of interaction terms, y6, indicate any differential SPBP treatment effect between an included and an excluded category. In addition, we want to recover the respective estimated treatment impacts on students enrolled in schools in each individual category. Differentiating (8) with respect to Eligible, we see that the overall treatment effect on an eligible school in a particular category is the sum of the coefficient for eligibility and the coefficient on the interaction term for the particular category y4 + y6. We measure this relationship for each category and test its significance with a X2 test.

The Eligible variable continues to be identified by random assignment. The identifying assumption for estimating the variables in the vector Cat is that there is no difference in school performance represented in the target-level category that is not conveyed in a cubic function of the percentile of the number of points that a school earned under the Progress Report Card system. We can then interpret the estimated effect of a school’s being in a particular category (and thus facing a particular performance target) as the causal influence of assignment to that categorization (and consequently, the different performance target attached to the categorization) on student achievement. Furthermore, we can interpret the interaction of Cat and Eligible as a consistent estimate of the differential impact of the SPBP on schools facing varying performance targets.

The basic idea behind this technique is to take advantage of the cutoffs on either side of which schools are assigned scoring targets. In essence, this technique compares the performance of students in schools that just barely fell on either side of the benchmark cutoff. The cutoffs on the point scale that determine in which performance category a school is placed are set at somewhat arbitrary points. They convey little, if any, information about a school’s effectiveness that is not already represented in an overall Progress Report Card score (nor do they convey the percentile of a school’s overall Progress Report Card score). Though schools with similar point totals may be similar in their effectiveness, the rank that their overall performance score gives them determines the target that the school must meet in order to earn a bonus award under the SPBP.

Although RD designs are a powerful evaluation technique, several limitations are worth mentioning. RD designs focus on a highly localized impact of the SPBP—that is, on schools that are very close to either side of the cutoffs. These estimates will not necessarily hold globally—that is, for all schools. Furthermore, RD designs require much larger sample sizes to produce impact estimates with sufficient statistical power (Cappelleri et al., 1994; Schochet, 2008a, 2008b; Bloom et al., 2005).

It is also worth emphasizing here that while all New York City schools are given performance targets, and thus could be affected by them, this analysis is not particularly concerned with the overall impact of the targets themselves on schools in our sample. We use the estimate on the interaction term to focus on the differential response of schools to performance target thresholds, not to recover the overall impact of the performance target scores. Even though it is plausible that both treatment and control schools would be affected by how they were categorized, only SPBP-eligible schools had the additional incentive of a performance bonus award if they met their performance target, which is what we focus on here.

The results from estimating various forms of (8) in mathematics are displayed in Table 11. None of the models finds that any kind of SPBP treatment makes a significant difference, regardless of a school’s Progress Report Card target score.

8.c. Pay for Performance and School Learning Environment

The results above indicate that on average, the SPBP treatment had no effect on student achievement in mathematics. However, student attendance and student, parent, and teacher perceptions of the school learning environment account for 15 percent of a school’s overall Progress Report Card score. Thus, schools may have sought to increase scores in ways unrelated to advancing student achievement.

We measure directly whether SPBP-eligible schools made larger improvements on the Progress Report Card system overall score, as well as on individual components of a school’s overall score. We use publicly available data at the school level to estimate regressions that explain the number of points a school earned on its 2007–08 Progress Report Card as a function of the SPBP treatment and observed school characteristics (including the number of points earned in 2006–07 school year). We estimate equations taking the form:

(9)

where, depending on the specification, Points is a school’s overall Progress Report Card score or its score is a component of the Progress Report Card system, including environment scores, progress score, or extra credit earned (bonus points). All other variables are as previously defined.

Table 12 displays results comparing SPBP-eligible and comparison-group schools on the basis of individual components that make up a school’s overall Progress Report Card score. We find no relationship between SPBP eligibility and any component score of the school grading system. Nonetheless, the regression model evaluating a school’s score on the performance score (Model 3 of Table 12) has particular interest. A school’s performance score is determined by the percentage of its students meeting particular proficiency benchmarks on the New York State high-stakes mathematics and ELA tests. It might be thought that SPBP-eligible schools would respond to the importance of this component by focusing their efforts on students falling just short of the proficiency benchmarks.[30] However, the lack of statistical differences in the performance scores of eligible schools and those assigned to the SPBP intervention suggests that the latter have not responded in this way.

We also compare scores at the school level from the student, teacher, and parent school learning-environment surveys. Recall that a school’s learning-environment survey score accounts for 15 percent of its overall Progress Report Card score. Further, if schools have responded to the SPBP eligibility, it is possible for the school learning-environment survey to reflect some of these short-run outcomes. We also evaluated individual components of the student, teacher, and parent survey results, all of which are reported in Table 13. Once again, we find no difference in any of the components of the teacher, student, or parent surveys among SPBP-eligible and control-group schools.


9. Summary and Conclusion

In this paper we present evidence on the impact of NYCDOE’ SPBP during the program’s first year of implementation. Because the number of schools meeting eligibility criteria under the SPBP guidelines required more than the amount of money budgeted for the program, NYCDOE’ Research and Policy Support Group assigned schools to the SPBP intervention by random lottery. Our evaluation design takes advantage of the fact that schools were randomly lotteried into the SPBP intervention.

Our findings suggest that the SPBP has had negligible short-run effects on student achievement in mathematics. The same holds true for intermediate outcomes such as student, parent, and teacher perceptions of the school learning environment. We also find no evidence that the treatment effect differed on the basis of student or school characteristic. An exception is the differential effect of SPBP eligibility by school size, which suggests student performance in larger schools decreases when SPBP was implemented. The potential moderating effect of school size on the direction and/or strength of the relationship between the SPBP and mathematics achievement will be revisited when data from the 2008-09 school year become available.

Although a well-implemented experimental evaluation design would suggest that our estimates have strong internal validity, readers should interpret these initial findings with caution when considering the possible impact of this or any other program. First, the estimates presented here are of the short-run effects of the SPBP, which may limit our ability to identify any aspect or degree of the program’s effectiveness. Schools learned that they were eligible for the program less than three months before New York State’s high-stakes mathematics tests were administered. An evaluation of the SPBP’s impact following the 2008-09 school year should provide much more reliable information.

Furthermore, readers should not lose sight of the fact that additional experimental and quasi-experimental evaluations of various forms of teacher compensation reform are needed. Pay-for-performance programs can exhibit various design components, including the unit of accountability, performance measurement, incentive structure, and bonus distribution. The education policy community needs to study a greater number of forward-thinking schools systems such as NYCDOE before it can construct a knowledge base sufficiently large to permit the making of sound policy decisions on the question of whether teacher pay-for-performance is a useful strategy for enhancing teacher effectiveness and school quality.


Endnotes
  1. A number of school districts and states in the United States have recently adopted performance-related compensation reforms. Performance is part of compensation packages in the Denver, New York City, Dallas, and Houston public school systems. Florida, Minnesota, and Texas allocate over $550 million to incentive programs that reward teacher performance. The U.S. Congress advanced policy dialogues around teacher compensation reform: first, in 2006, with the appropriation of $495 million over a five-year period to provide Teacher Incentive Fund grants to select districts and states across the country; and in 2009, with part of a massive economic stimulus package earmarking around $200 million for the development and implementation of teacher pay-for-performance programs.

    High-profile teacher pay-for-performance plans have also been implemented abroad: for example, Chile’s Sistema Nacional de Evaluación del Desempeño de los Establecimientos Educacionales (SNED) (Mizala and Romaguera, 2003); Mexico’s Carrera Magisterial (McEwan and Santibanez, 2005; Santibanez et al., 2007); programs developed by Israel’s Ministry of Education (Lavy, 2002, 2007); and experiments in Andhra Pradesh, India (Muralidharan and Sundararaman, 2008), and in the Busia and Teso districts of western Kenya (Glewwe, Ilias, and Kremer, 2008).
  2. See, for example, Holmstrom and Milgrom (1991), Baker (1992), and Prendergast (1999).
  3. A few pay-for-performance experiments are running concurrently in the U.S. public school system. The National Center on Performance Incentives has implemented an individual teacher incentive program in Nashville and a team-level incentive program in Round Rock, Texas. Mathematica Policy Research, Inc. is evaluating a five-year demonstration project examining the impact of the Teacher Advancement Program in the Chicago Public Schools.
  4. For more information on the role of teacher associations and collective bargaining agreements in teacher compensation reform, see Eberts (2007), Koppich (2008), Goldhaber (forthcoming), and Hannaway and Rotherham (2008).
  5. The NYCDOE secured funding from private sources to operate the SPBP during the first year of implementation. The district also appropriated public funding for year two of the program. Heinrich (2004) notes: “Districts and states rarely provide consistent funding for these programs, significantly reducing their motivational value.”
  6. Sager (2009) contends that New York City should “take it up a notch” by implementing an individual-based incentive system.
  7. The SPBP was formally announced on October 23, 2007. The randomization of schools into treatment- and control-group conditions was announced in November and December of the same year. New York State’s high-stakes English language-arts exams were administered from January 8 to 17, 2008. The high-stakes mathematics tests were implemented two months later (March 4–11, 2008).
  8. Kremer et al. (2004) reported the average absence rate for teachers in Andhra Pradesh, India was about 25 percent while only about half of the teachers in a nationally representative sample of government primary schools in India were actually teaching when external enumerators conducted unannounced visits. Teacher absenteeism in the United States is around five or six percent (Ehrenberg et al., 1991; Ballou, 1996; Podgursky, 2003; Clotfelter, Ladd, and Vigdor, 2007; Miller, Murnane, and Willett, 2007).
  9. In a case study of Safelite Glass Corporation, Lazear (2000) estimated that the compensation system’s transition from hourly wages to piece rates was associated with a 44 percent increase in productivity (as measured by individual worker output per month). Interestingly, half of this effect was attributed to workers becoming more motivated, an incentive effect; the other half resulted from the sorting of more able workers largely through the hiring process.
  10. For more information on the relationship between teacher incentive programs and teacher mobility, see Taylor and Springer (2009), Springer et al. (2008), Springer et al. (2009), and Clotfelter, Glennie, Ladd, and Vigdor (2008).
  11. A growing body of education research documents dysfunctional behavior in response to high-stakes accountability programs, including systematically excluding low-scoring students from testing, reclassifying students assignment to particular student subgroups, altering student answer sheets, and focusing on marginally performing students. See, for example, Cullen and Reback (2006), Figlio and Getzler (2002), Figlio and Winicki (2002), Jacob and Levitt (2003), and Neal and Schanzenbach (forthcoming).
  12. Neal (forthcoming) contends that it is important to come up with incentive pay designs specially suited to public education. He recommends rank-ordered tournaments of comparable schools that measure and reward school-wide performance. He identifies three challenges that the design of incentive-pay systems face: (1) defining the intended outcomes of public education; (2) the inability of existing assessment tools to identify and measure the contribution of specific teachers or schools to student learning; and (3) the lack of true market forces in the public education system.
  13. It is important to note that RD studies generate highly localized estimates of a treatment effect, and estimates tend to be low-powered in many applications because they are reliant on a subset of observations immediately above and below a cutoff point.
  14. Unlike other incentive programs discussed in this section of the paper, ICSIP awarded teachers with prizes rather than cash bonuses. As noted by Glewwe, Ilias, and Kremer (2008), the ICSIP awarded prizes such as a suit worth about $50, plates, glasses and cutlery worth about $40, a tea set worth about $30, and bed linens and blankets worth about $25.
  15. For a discussion of RD designs, see Thistlewaite and Campbell (1960); Hahn, Todd, and van der Klaauw (2001); and Lee and Lemieux (2009).
  16. Hanushek (2003) provides a critical review of evidence on input-based schooling policies in the United States and abroad.
  17. The NCPI, a state and local policy research and development center funded by the U.S. Department of Education’s Institute of Education Sciences, was established in 2006 to conduct independent and scientific studies on the individual and institutional effects of pay-for-performance programs and other incentive policies. The NCPI is located at Vanderbilt University’s Peabody College and core institutional partners include the RAND Corporation and the University of Missouri – Columbia. More information can be found at www.performanceincentives.org.
  18. More information on the TAP can be found at www.talentedteachers.org. For a recent, non-experimental evaluation of the TAP see Springer, Ballou, and Peng (2008). The Center for Educator Compensation reform also provides an overview of a related program in Chicago’s Public Schools (http://www.cecr.ed.gov/initiatives/profiles/pdfs/Chicago.pdf).
  19. Although performance targets were eliminated from the Progress Report Card system for the 2008–09 school year, the NYCDOE and the UFT elected to use the same metric from 2007-08 school year for schools participating in the second year of the SPBP. For an evaluation of the NYCDOE Progress Report Card system, see Rockoff and Turner (2008) and Winters (2008).
  20. A school can also earn bonus points, which are added to their overall Progress Report Card score when high-needs students make exemplary progress on New York State’s high-stakes tests. The Progress Report Card system identifies five categories of high-needs students: (1) any student identified as having special needs; (2) any student identifieid as being limited English proficient; (3) Hispanic students in the bottom third of all NYCDOE students; (4) black students in the bottom third of all NYCDOE students; (5) all other students in the bottom third of all NYCDOE students. “Exemplary” gains are those in the highest 40 percent of all student gains per school type in the NYCDOE. For more information, see New York City Public Schools (2007).
  21. In June 2008, the NYCDOE and the UFT announced a third way that schools participating in the SPBP could earn a bonus award: by achieving two consecutive A-grades under the Progress Report Card system. Doing so entitles them to receive $1,500 per UFT member. However, this alternative does not have any bearing on our analysis of the first year because schools were unaware of the policy during the school year and thus could not have responded to it.
  22. Middle schools are the exception. Middle schools were identified on the basis of their average proficiency ratings in mathematics and English language arts (ELA) in the fourth grade. Our sample contains fifty-five middle schools in the treatment sample and forty-one middle schools in the control sample. These schools make up 29.63 percent of schools in our sample.
  23. Eight additional schools were offered treatment for some “special case.” School personnel at six of these schools voted to participate in SPBP. Special-case schools were not entered into a lottery, so we removed them from our sample.
  24. When estimating equation (4) and all other equations at the student level, we calculate standard errors using the bootstrap method with 300 iterations. Among other advantages, the bootstrap method calculates consistent standard errors in light of potential autocorrelation in regression models, such as the value-added specification, that included a lagged dependent variable as a regressor (Cameron and Trivedi, 2006; Mackinnon, 2002).
  25. See Endnote 17.
  26. As noted by Bloom (2006), equation (5) further shows that the average effect of ITT equals the weighted mean of TOT effect for schools that were lotteried into and participated in SPBP and it equals zero for the no-show schools, where weights are equal to the SPBP treatment receipt rate ([E(D|Z=1]) and the no-show rate (1-[E(D|Z=1). Equation (3) implies that: TOT = ITT/[E(D|Z=1)].
  27. The LATE is also known as the complier-average causal effect of treatment (CACE).
  28. We experimented with polynomials of school size and found that the relationship between school size and treatment was quite linear, so we keep the more parsimonious model here.
  29. McEwan and Santibanez (2005) and Santibanez et al. (2008) implemented a similar approach when evaluating Mexico’s Carrera Magisterial.
  30. For studies on educational triage in response to high-stakes accountability systems, see Booher-Jennings (2005), Neal and Schanzenbach (forthcoming), Reback (forthcoming), Ballou and Springer (2008), and Springer (2007).
References

Angrist, J., G. Imbens, and D. Rubin. (1996). “Identification of Causal Effects Using Instrumental Variables.” Journal of the American Statistical Association, 91 (434), 444–55.

Baker, G. (1992). “Incentive Contracts and Performance Measurement.” Journal of Political Economy, 100 (3), 598–614.

Ballou, D. (1996). “Do Public Schools Hire the Best Applicants?.” Quarterly Journal of Economics, 111 (1), 97–133.

———, and M.G Springer. (2008). “Achievement Trade-Offs and No Child Left Behind.” Mimeograph. Nashville, TN: Vanderbilt University.

Besley, T., and S. Coate. (1995). “The Design of Income Maintenance Programs.” Review of Economic Studies, 62 (1), 187–221.

Bloom, H. (1984). “Accounting for No-Shows in Experimental Evaluation Designs.” Evaluation Review, 8 (2), 225–46.

———. (2006). “The Core Analytics of Randomized Experiments for Social Research.” Working Paper. New York: Manpower Demonstration Research Corporation.

———, L. Richburg-Hayes, and A. Black. (2005). “Using Covariates to Improve Precision: Empirical Guidance for Studies That Randomize Schools to Measure the Impacts of Educational Interventions.” Working Paper. New York: Manpower Demonstration Research Corporation.

Booher-Jennings, J. (2005). “Below the Bubble: ‘Educational Triage’ and the Texas Accountability System.” American Educational Research Journal, 42 (2), 231–68.

Bowles, S, and H. Gintis. (2002). “Social Capital and Community Governance.” Economic Journal, 112 (November), F419-F436.

Cameron, A., and P. Trivedi. (2006). Microeconometrics: Methods and Applications. New York: Cambridge University Press.

Capelleri, J., R. Darlington, and W. Trochim. (1994). “Power Analysis of Cutoff-Based Randomized Clinical Trials.” Evaluation Review, 18 (2), 141–52.

Clotfelter, C., E. Glennie, H. Ladd, and J. Vigdor. (2008). “Would Higher Salaries Keep Teachers in High-Poverty Schools? Evidence from a Policy Intervention in North Carolina.” Journal of Public Economics, 92 (5–6), 1352–70.

Clotfelter, C., H. Ladd, and J. Vigdor. (2007). “Are Teacher Absences Worth Worrying About in the U.S.?.” Working Paper 13648. Cambridge, Mass.: National Bureau of Economic Research.

———. (2005). “Who Teaches Whom? Race and the Distribution of Novice Teachers.” Economics of Education Review, 24 (4), 377–92.

Courty, P., and G. Marschke. (2004). “An Empirical Investigation of Gaming Responses to Explicit Performance Incentives.” Journal of Labor Economics, 22 (1), 23–56.

Cullen, J., and R. Reback. (2006). “Tinkering toward Accolades: School Gaming under a Performance Accountability System.” Working Paper 12286. Cambridge, Mass.: National Bureau for Economic Research.

Deci, E. (1975). Intrinsic Motivation. New York: Plenum.

Deutsch, M. (1985). Distributive Justice: A Social-Psychological Perspective. New Haven, Conn.: Yale University Press.

Eberts, R. W. (2007). “Teacher Unions and Student Performance: Help or Hindrance?.” The Future of Children, 17 (1), 175–200.

Ehrenberg, R., and G. Milkovich. (1987). “Compensation and Firm Performance.” In Human Resources and the Performance of Firms, ed. M. Kleiner. Madison, Wisc.: Industrial Relations Research Association.

Ehrenberg, R., and R. Smith. (1994). Modern Labor Economics: Theory and Public Policy, 5th ed. New York: Harper Collins College Publishers.

Ehrenberg, R., D. Rees, and E. Ehrenberg. (1991). “School District Leave Policies, Teacher Absenteeism, and Student Achievement.” Journal of Human Resources, 26 (1), 72–105.

Figlio, D., and L. Getzler. (2002). “Accountability, Ability and Disability: Gaming the System?.” Working Paper 9307. Cambridge, Mass.: National Bureau for Economic Research.

Figlio, D., and J. Winicki. (2002). “Food for Thought? The Effects of School Accountability Plans on School Nutrition.” Working Paper 9319. Cambridge, Mass.: National Bureau for Economic Research.

Frey, B. (1997). “A Constitution for Knaves Crowds Out Civic Virtues.” Economic Journal, 107 (443), 1043–53.

Glazerman, S., et al. (2007). Options for Studying Teacher Pay Reform Using Natural Experiments. Washington, D.C.: Mathematica Policy Research.

Glewwe, P., N. Ilias, and M. Kremer. (2008). “Teacher Incentives.” Mimeograph, Cambridge, Mass: Harvard University.

———. (2008). Teacher Incentives in the Developing World. Mimeograph. Cambridge, Mass.: Harvard University.

Goldhaber, D. (Forthcoming). “The Politics of Teacher Pay Reform.” In Performance Incentives: Their Growing Impact on American K–12 Education, ed. M. G. Springer. Washington, D.C.: Brookings Institution Press.

Green, J., and N. Stokey. (1983). “A Comparison of Tournaments and Contracts.” Journal of Political Economy, 91 (3), 349–64.

Hahn, J., P. Todd, and W. van der Klaauw. (2001). “Identification and Estimation of Treatment Effects with a Regression Discontinuity Design.” Econometrica, 69, 201–9.

Hamilton, B. H., J. A. Nickerson, and H. Owan. (2003). “Team Incentives and Worker Heterogeneity: An Empirical Analysis of the Impact of Teams on Productivity and Participation.” Journal of Political Economy, 111 (3), 465–97.

Hannaway, J., and J. Rotherham. (2008). Collective Bargaining in Education and Pay for Performance. Nashville: National Center on Performance Incentives.

Hanushek, E. (2003). “The Failure of Input-Based Resource Policies.” Economic Journal, 113, F64–68.

Heinrich, C. (2004). “Outcomes-Based Performance Management in the Public Sector: Implications for Government Accountability and Effectiveness.” Public Administration Review, 62 (6), 712–25.

Hernandez, J.C. (2009). “New Education Secretary Visits Brooklyn School.” Retrieved from New York Times at http://cityroom.blogs.nytimes.com/2009/02/19/new-education-secretary-visits-brooklyn-school/.

Holmstrom, B., and P. Milgrom. (1991). “Multitask Principal-Agent Analysis: Incentive Contracts, Asset Ownership, and Job Design.” Journal of Law, Economics, and Organization, 7, 24–52.

Hotelling, H. (1940). “The Selection of Variates for Use in Prediction with Some Comments on the General Problem of Nuisance Parameters.” Annals of Mathematical Statistics, 11, 271–83.

Jacob, B., and S. Levitt. (2003). “Rotten Apples: An Investigation of the Prevalence and Predictors of Teacher Cheating.” Quarterly Journal of Economics, 118, 843–77.

Johnson, S. (1986). “Incentives for Teachers: What Motivates, What Matters.” Education Administration Quarterly, 22 (3), 54–79.

Kandel, E., and E. P. Lazear. (1992). “Peer Pressure and Partnerships.” Journal of Political Economy, 100 (4), 801–17.

Kelley, C., and K. Finnigan. (2004). “Teacher Compensation and Teacher Workforce Development.” Yearbook of the National Society for the Study of Education, 103, 253–73.

Koppich, J. (2008). “Toward a More Comprehensive Model of Teacher Pay.” Nashville: National Center on Performance Incentives.

Kremer, M., et al. (2004). “Teacher Absence in India.” Working Paper. Washington, D.C.: World Bank.

Kruskal, W., and W. Wallis. (1952). “Use of Ranks in One-Criterion Variance Analysis.” Journal of the American Statistical Association, 47 (260), 583–621.

Ladd, H. F., ed. (1996). Holding Schools Accountable: Performance-Based Reform in Education. Washington, D.C.: Brookings Institution.

Ladd, H. (2001). “School-Based Education Accountability Systems: The Promise and the Pitfalls.” National Tax Journal, 54 (2), 385–400.

Lavy, V. (2002). “Evaluating the Effect of Teachers’ Group Performance Incentives on Pupil Achievement.” Journal of Political Economy, 110 (6), 1286–1317.

———. (Forthcoming). Performance Pay and Teachers’ Effort, Productivity and Grading Ethics. American Economic Review.

Lawler, E. (1981). Pay and Organization Development. Reading, Mass.: Addison-Wesley.

Lazear, E. P. (2000). “Performance Pay and Productivity.” American Economic Review, 90, 1346–61.

———. (1998). Personnel Economics for Managers. New York: Wiley.

———, and S. Rosen. (1981). “Rank-Order Tournaments as Optimum Labor Contracts.” Journal of Political Economy, 89 (5), 841–64.

Lee, D., and T. Lemieux. (2009). “Regression Discontinuity Designs in Economics.” Working Paper 14723. Cambridge, Mass.: National Bureau of Economic Research.

Lepper, M., and D. Greene. (1978). The Hidden Costs of Reward. Hillsdale, N.J.: Erlbaum.

Lortie, D. (1975). Schoolteacher. Chicago: University of Chicago Press.

MacKinnon, J. (2002). “Bootstrap Inference in Econometrics.” Canadian Journal of Economics, 35, 615–35.

McEwan, P., and L. Santibañez. (2005). “Teacher and Principal Incentives in Mexico.” In Incentives to Improve Teaching: Lessons from Latin America, ed. E. Vegas, 213–53. Washington, D.C.: World Bank Press.

Milgrom, P., and J. Roberts. (1992). Economics, Organization, and Management. Englewood Cliffs, N.J.: Prentice-Hall.

———. (1990). “Rationalizability, Learning, and Equilibrium in Games with Strategic Complementarities.” Econometrica, 58 (6), 1255–77.

Miller, R., J. Murnane, and J. Willett. (2007). “Do Teacher Absences Impact Student Achievement? Longitudinal Evidence from One Urban School District.” Working Paper 13356. Cambridge, Mass.: National Bureau of Economic Research.

Mizala, A., and P. Romaguera. (2003). “Scholastic Performance and Performance Awards.” Working Paper. Universidad de Chile. Centro de Economía Aplicada.

Muralidharan, K., and V. Sundararaman. (2008). “Teacher Incentives in Developing Countries: Experimental Evidence from India.” Working Paper. Nashville: National Center on Performance Incentives.

Murnane, R. J., and D. K. Cohen. (1986). “Merit Pay and the Evaluation Problem: Why Most Merit Plans Fail and a Few Survive. Harvard Educational Review, 56 (1), 1-17.

Nalbantian, H., and A. Schotter. (1997). “Productivity under Group Incentives: An Experimental Study.” American Economic Review, 87 (3), 314–41.

Neal, D. (Forthcoming). “Designing Incentive Systems for Schools.” In Performance Incentives: Their Growing Impact on American K–12 Education, ed. M. G. Springer. Washington, D.C.: Brookings Institution Press.

———, and D. Schanzenbach. (forthcoming). “Left Behind by Design: Proficiency Counts and Test-Based Accountability.” Review of Economics and Statistics.

New York City Public Schools. (2007). “Educator Guide: New York City Progress Report. Elementary/Middle School.” Retrieved from http://schools.nyc.gov/NR/rdonlyres/DEFA8A3D-7BB8-4502-BEFC-F977FB206542/43571/ProgressReportEducatorGuide_EMS_091608.pdf.

———. (2008, July 1). “Parent, Teacher, Student Learning Environment Surveys: 2008 Citywide Results.” Retrieved from http://schools.nyc.gov/NR/rdonlyres/4C0235D3-AE5A-4E9B-98F4-8B4F5697497F/40759/lesresults.pdf.

Odden, A. (2001). “Rewarding Expertise.” Education Matters, 1 (1), 16–25.

Ortiz-Jiménez, M. (2003). “Carrera Magisterial: Un proyecto de desarrollo profesional.” Cuadernos de Discusión 12. Mexico City: Secretaría de Educación Pública.

Pfeffer, J. (1995). Competitive Advantage through People: Unleashing the Power of the Workforce. Boston: Harvard Business School Press.

———, and N. Langston. (1993). “The Effect of Wage Dispersion on Satisfaction, Productivity, and Working Collaboratively: Evidence from College and University Faculty.” Administrative Science Quarterly, 38, 382–407.

Podgursky, M. (2003). “Fringe Benefits.” Education Next, 3 (3), 71–76.

———, and M. Springer. (2007). “Teacher Performance Pay: A Review.” Journal of Policy Analysis and Management, 26 (4), 909–49.

Prendergast, C. (1999). “The Provision of Incentives in Firms.” Journal of Economic Literature, 37, 7–63.

Reback, R. (Forthcoming). “Teaching to the Rating: School Accountability and the Distribution of Student Achievement.” Journal of Public Economics, 92 (5–6), 1394–1415.

Rockoff, J., and L. Turner. (2008). “Short-Run Impacts of Accountability on School Quality.” Working Paper 14564. Cambridge, Mass.: National Bureau of Economic Research.

Rosen, S. (1986). “The Theory of Equalizing Differences.” In Handbook of Labor Economics, ed. O. Ashenfelter and R. Layard, vol. 1. Amsterdam: North-Holland.

Rouse, C., et al. (2007). “Feeling the Florida Heat? How Low-Performing Schools Respond to Voucher and Accountability Pressure.” CALDER Working Paper: Urban Institute.

Sager, R. (2009). “Prez’s Challenge to NYC Teachers.” Retrieved March 31, 2009, from New York Post (March 2).

Santibañez, L., et al. (2007). “Breaking Ground: Analysis of the Assessment System and Impact of Mexico’s Teacher Incentive Program ‘Carrera Magisterial.’ ” RAND Corporation.

Schochet, P. (2008a). “Statistical Power for Random Assignment Evaluations of Education Programs.” Journal of Educational and Behavioral Statistics, 33 (1), 62–87.

———. (2008b). “The Late Pretest Problem in Randomized Control Trials of Education Interventions.” Jessup, Md.: National Center for Education Evaluation and Regional Assistance.

Springer, M. (2007). “Accountability Incentives: Do Failing Schools Practice Educational Triage?.” Education Next, 8 (1), 74–79.

———, D. Ballou, and A. Peng. (2008). “Impact of the Teacher Advancement Program on Student Test Score Gains: Findings from an Independent Appraisal.” Working Paper. Nashville: National Center on Performance Incentives.

Springer, M. et al. (2009). “Governor’s Educator Excellence Grant (GEEG) Program: Year Two Evaluation Report.” Nashville: National Center on Performance Incentives.

———. (2008). “Texas Educator Excellence Grant (TEEG) Program: Year Two Evaluation Report.” Nashville: National Center on Performance Incentives.

Taylor, L., and M. Springer. (2009). “Optimal Incentives for Public Sector Workers: The Case of Teacher-Designed Incentive Pay in Texas.” Working Paper. Nashville: National Center on Performance Incentives.

———, and M. Ehlert. (Forthcoming). “Characteristics and Determinants of Teacher-Designed Pay for Performance Plans: Evidence from Texas’ Governor’s Educator Excellence Grant Program.” In Performance Incentives: Their Growing Impact on American K–12 Education, ed. M. G. Springer. Washington, D.C.: Brookings Institution Press.

Thistlewaite, D., and D. Campbell. (1960). “Regression-Discontinuity Analysis: An Alternative to the Ex-Post Facto Experiment.” Journal of Education Psychology, 51, 309–17.

Winters, M. (2008). “Grading New York: An Evaluation of New York City’s Progress Report Program.” Manhattan Institute.

———, J. Greene, and J. Trivitt. (2008). “Building on the Basics: The Impact of High-Stakes Testing on Student Proficiency in Low-Stakes Subjects.” Manuscript. Manhattan Institute.

Zenger, T. (1992). “Why Do Employers Only Reward Extreme Performance? Examining the Relationships among Performance, Pay, and Turnover.” Administrative Science Quarterly, 37 (2), 198–219.

 


Center for Civic Innovation.
EMAIL THIS | PRINTER FRIENDLY
CR 56 (PDF)
Table of Contents:
Executive Summary
About the Authors
Introduction
Understanding Components of Pay-for-Performance Programs
Review of Relevant Research and Experiments on Pay for Performance
New York City’s School-Wide Performance Bonus Program
Data, Sample, and Random Assignment
Analytic Strategy
Average Impact of the School-Wide Performance Bonus Program
Potential Mediators of the Treatment Effects
Summary and Conclusion
Endnotes
References
 


Home | About MI | Scholars | Publications | Books | Links | Contact MI
City Journal | CAU | CCI | CEPE | CLP | CMP | CRD | ECNY
Thank you for visiting us.
To receive a General Information Packet, please email support@manhattan-institute.org
and include your name and address in your e-mail message.
Copyright © 2009 Manhattan Institute for Policy Research, Inc. All rights reserved.
52 Vanderbilt Avenue, New York, N.Y. 10017
phone (212) 599-7000 / fax (212) 599-3494