In early 2009, the United States was engaged in an intense public debate over a proposed $800 billion stimulus bill designed to boost economic activity through government borrowing and spending.
James Buchanan, Edward Prescott, Vernon Smith and Gary Becker, all Nobel laureates in economics, argued that while the stimulus might be an important emergency measure, it would fail to improve economic performance. Nobel laureates Paul Krugman and Joseph Stiglitz, on the other hand, argued that the stimulus would improve the economy and indeed that it should be bigger. Fierce debates can be found in frontier areas of all the sciences, of course, but this was as if, on the night before the Apollo moon launch, half of the worlds Nobel laureates in physics were asserting that rockets couldnt reach the moon and the other half were saying that they could.
Prior to the launch of the stimulus program, the only thing that anyone could conclude with high confidence was that several Nobelists would be wrong about it.
Unlike physics or biology, the social sciences have not demonstrated the capacity to produce a substantial body of useful, nonobvious and reliable predictive rules about what they study that is, human social behavior, including the impact of proposed government programs. The missing ingredient is controlled experimentation, which is what allows science positively to settle certain kinds of debates.
How do we know that our physical theories concerning the wing are true? In the end, not because of equations on blackboards or compelling speeches by famous physicists but because airplanes stay up. Social scientists may make claims as fascinating and counterintuitive as the proposition that a heavy piece of machinery can fly, but these claims are frequently untested by experiment, which means that debates like the one in 2009 will never be settled.
Over many decades, social science has groped toward the goal of applying the experimental method to evaluate its theories for social improvement. Recent developments have made this much more practical, and the experimental revolution is finally reaching social science. The most fundamental lesson that emerges from such experimentation to date is that our scientific ignorance of the human condition remains profound. Very few social-program interventions can be shown in controlled experiments to create real improvement in outcomes of interest.
An experiment is the (always imperfect) attempt to demonstrate a cause-and-effect relationship by holding all potential causes of an outcome constant, consciously changing only the potential cause of interest and then observing whether the outcome changes. Scientists may try to discern patterns in observational data in order to develop theories. But central to the scientific method is the stricture that such theories should ideally be tested through controlled experiments before they are accepted as reliable.
Thanks to scientists like Galileo and methodologists like Francis Bacon, the experimental method became widespread in physics and chemistry. Later, it invaded the realm of medicine. Though comparisons designed to determine the effect of medical therapies have appeared around the globe many times over thousands of years, James Lind is conventionally credited with executing the first clinical trial in the modern sense of the term. In 1747, he divided 12 scurvy-stricken crew members on the British ship Salisbury into six treatment groups of two sailors each. He treated each group with a different therapy, tried to hold all other potential causes of change to their condition as constant as possible, and observed that the two patients treated with citrus juice showed by far the greatest improvement.
The fundamental concept of the clinical trial has not changed in the 250 years since. Scientists attempt to find two groups of people alike in all respects possible, apply a treatment to one group (the test group) but not to the other (the control group), and ascribe the difference in outcome to the treatment. The power of this approach is that the experimenter doesnt need a detailed understanding of the mechanism by which the treatment operates; Lind, for example, didnt have to know about Vitamin C and human biochemistry to conclude that citrus juice somehow ameliorated scurvy.
But clinical trials place an enormous burden on being sure that the treatment under evaluation is the only difference between the two groups. And as experiments began to move from fields like classical physics to fields like therapeutic biology, the number and complexity of potential causes of the outcome of interest what I term “causal density” rose substantially. It became difficult even to identify, never mind actually hold constant, all these causes.
In 1884, the brilliant but erratic American polymath C.S. Peirce hit upon a solution when he randomly assigned participants to the test and control groups. Random assignment permits a medical experimentalist to conclude reliably that differences in outcome are caused by differences in treatment. Thats because even causal differences among individuals of which the experimentalist is unaware say, a genetic predisposition to a disease should be roughly equally distributed between the test and control groups, and therefore not bias the result.
In theory, social scientists, too, can use that approach to evaluate proposed government programs. In the social sciences, such experiments are normally termed “randomized field trials,“ or RFTs. From the late 1960s through the early 1980s, RFTs often attempted to evaluate entirely new programs or large-scale changes to existing ones, considering such topics as the negative income tax, employment programs, housing allowances and health insurance.
By about a quarter-century ago, however, it had become obvious to sophisticated experimentalists that the idea that we could settle a given policy debate with a sufficiently robust experiment was naive. The reason had to do with generalization, which is the Achilles heel of any experiment, whether randomized or not.
In medicine, for example, what we really know from a given clinical trial is that this particular list of patients who received this exact treatment delivered in these specific clinics on these dates by these doctors had these outcomes, as compared with a specific control group. But when we want to use the trials results to guide future action, we must generalize them into a reliable predictive rule for as-yet-unseen situations.
Even if the experiment was correctly executed, how do we know that our generalization is correct?
A physicist generally answers that question by assuming that predictive rules like the law of gravity apply everywhere, even in regions of the universe that have not been subject to experiments, and that gravity will not suddenly stop operating one second from now. No matter how many experiments we run, we can never escape the need for such assumptions.
Even in classical therapeutic experiments, the assumption of uniform biological response is often a tolerable approximation that permits researchers to assert, say, that the polio vaccine that worked for a test population will also work for human beings beyond the test population.
But we cannot safely assume that a literacy program that works in one school will work in all schools. Just as high causal densities in biology created the need for randomization, even higher causal densities in the social sciences create the need for even greater rigor when we try to generalize the results of an experiment.
Criminology provides an excellent illustration of the way experimenters have grappled with the problem. Crime, like any human social behavior, has complex causes and is therefore difficult to predict reliably. Though criminologists have repeatedly used the nonexperimental statistical method called regression analysis to try to understand the causes of crime, regression doesnt even demonstrate good correlation with historical data, never mind predict future outcomes reliably.
So since the early 1980s, criminologists increasingly turned to randomized experiments. One of the most widely publicized tried to determine the best way for police officers to handle domestic violence. In 1981 and 1982, Lawrence Sherman, a respected criminology professor at the University of Cambridge, randomly assigned one of three responses to Minneapolis cops responding to misdemeanor domestic-violence incidents: They were required to arrest the assailant, to provide advice to both parties or to send the assailant away for eight hours.
The experiment showed a statistically significant lower rate of repeat calls for domestic violence for the mandatory-arrest group. The media and many politicians seized upon what seemed like a triumph for scientific knowledge, and mandatory arrest for domestic violence rapidly became a widespread practice in many large jurisdictions in the United States.
But sophisticated experimentalists understood that because of the issues high causal density, there would be hidden conditionals to the simple rule that “mandatory-arrest policies will reduce domestic violence.” The only way to unearth these conditionals was to conduct replications of the original experiment under a variety of conditions. So researchers replicated the RFT six times in cities across the country. In three of those studies, the test groups exposed to the mandatory-arrest policy again experienced a lower rate of re-arrest than the control groups did. But in the other three, the test groups had a higher re-arrest rate.
Why? In 1992, Sherman surveyed the replications and concluded that in stable communities with high rates of employment, arrest shamed the perpetrators, who then became less likely to re-offend; in less stable communities with low rates of employment, arrest tended to anger the perpetrators, who would therefore be likely to become more violent.
The problem with this kind of conclusion, though, is that it is not itself the outcome of an experiment: How do we know if it is right? By running an experiment to test it that is, by conducting still more RFTs in both kinds of communities and seeing if they bear it out. Only if they do can we stop this seemingly endless cycle of tests begetting more tests.
Even then, the very high causal densities that characterize human society guarantee that no matter how refined our predictive rules become, there will always be conditionals lurking undiscovered.
So what have we learned about reducing crime? First, that most promising ideas have not been shown to work reliably. Second, that nuisance abatement which is at the core of what is often called “Broken Windows” policing tentatively appears to work. Even that conclusion needs qualification: Its a safe bet that there is some jurisdiction in the United States where even Broken Windows would fail.
Experimentation does not create absolute knowledge but rather changes both the burden and the standard of proof for those who disagree with its findings.
At the same time that the social sciences began struggling with the problem of dismayingly high causal densities, the same problem was being addressed by the business world. A key event occurred in 1988, when Rich Fairbank and Nigel Morris left a small strategy-consulting firm where the three of us worked to found credit card company Capital One.
The company was designed precisely as an application of the experimental method to business, and that method quickly permeated Capital One, to an extent never before seen. Suppose marketers wanted to know whether a credit card solicitation would meet with greater success if it was mailed in a blue envelope or in a white one. Rather than debate the question, the company would simply mail, say, 50,000 randomly selected households the solicitation in a blue envelope and 50,000 randomly selected households the same solicitation in a white envelope, and then measure the relative profitability of the resulting customer relationships from each group.
By 2000, Capital One was reportedly running more than 60,000 tests per year. And by 2009, it had gone from an idea in a conference room to a public corporation worth $35 billion, transforming not only the credit card industry but most financial services marketed through direct channels.
The Internet is even better for experimentation than the direct-mail and telemarketing channels that Capital One originally used. Executing a randomized experiment say, to determine whether a pop-up ad should appear in the upper-left or upper-right corner of a webpage is close to costless on a modern e-commerce platform. The leaders in this sector, such as Google , Amazon and eBay, are inveterate experimenters. These days, experimentation is something that one assumes from a successful online commerce company.
What businesses have figured out is that they can deal with the problem of causal density by scaling up the testing process. Run enough tests, and you can find predictive rules that are sufficiently nuanced to be of practical use in the very complex environment of real-world human decision making.
Many of the same techniques that businesses use to lower the cost per test integration with operational data systems, standardization of test design and so on could be applied to social policy experiments. In fact, they were applied in a limited way during the execution of more than 30 randomized experiments during the welfare-reform debate of the 1990s, which was one of the most fruitful sequences of social policy experiments ever done. Businesses have demonstrated that the concept of replication of field experiments can be pushed much further than most social scientists had imagined.
And what do we know from the social-science experiments we have already conducted? After reviewing experiments not just in criminology but also in welfare-program design, education and other fields, I propose that three lessons emerge consistently from them.
First, few programs can be shown to work in properly randomized and replicated trials. We should be very skeptical of claims for the effectiveness of new, counterintuitive programs and policies, and we should be reluctant to trump the trial-and-error process of social evolution in matters of economics or social policy.
Second, within this universe of programs that are far more likely to fail than succeed, programs that try to change people are even more likely to fail than those that try to change incentives.
A litany of program ideas designed to push welfare recipients into the workforce failed when tested in those randomized experiments of the welfare-reform era; only adding mandatory work requirements succeeded in moving people from welfare to work in a humane fashion. And mandatory work-requirement programs that emphasize just getting a job are far more effective than those that emphasize skills-building.
Similarly, the list of failed attempts to change people to make them less likely to commit crimes is almost endless prisoner counseling, transitional aid to prisoners, intensive probation, juvenile boot camps but the only program concept that tentatively demonstrated reductions in crime rates in replicated RFTs was nuisance abatement, which changes the environment in which criminals operate.
And third, there is no magic. Those rare programs that do work usually lead to improvements that are quite modest, compared with the size of the problems they are meant to address or the dreams of advocates.
Experiments are surely changing the way we conduct social science. It is tempting to argue that we are at the beginning of an experimental revolution in social science that will ultimately lead to unimaginable discoveries. But we should be skeptical of that argument.
At the moment, we do not have anything remotely approaching a scientific understanding of human society. And the methods of experimental social science are not close to providing one within the foreseeable future. Science may someday allow us to predict human behavior comprehensively and reliably. Until then, we need to keep stumbling forward with trial-and-error learning as best we can.
Original Source: http://www.dallasnews.com/sharedcontent/dws/dn/opinion/points/stories/dn-100903-manzi.edi.baee91ba.html