A series of small wars have recently erupted between researchers in labor, development, and other fields of economics. They involve disagreements over studies claiming that they cannot “replicate” some earlier study’s result.
Replication lies in the bedrock of science. The pathological science underlying “cold fusion” was uncovered after one lab claimed to generate boundless energy in a cheap, low-temperature chemical reaction but other labs could not replicate it. Likewise in social science, when others learn that a famous result could not be replicated, they wonder about the original researchers—thinking at best of carelessness or freak chance, and at worst of deliberate deception or outright fraud.
This ambiguity makes the claim of a failed replication extremely serious. Heated, protracted controversies are common. The original authors (usually more senior) often attribute the new study’s results to differences of method that disqualify it as a replication, and complain of perverse incentives for attention-getting “gotcha” studies. The new study’s authors (usually more junior) complain of intimidating blowback that deters others from doing any replication studies at all.
Failed “replications” carry a stinging stigma
A new IZA Discussion Paper by Michael Clemens proposes one modest step to move beyond this impasse: Social science needs a single, clear definition of what a ‘replication’ is. This would make it clear what it means when a study ‘fails to replicate’ some earlier result.
The definition is simple but technical: A follow-up study replicates an original study when the follow-up study estimates a parameter from the same “sampling distribution” as the original. This means that the methods in the original study and the follow-up should be so similar that, if each of them were repeated countless times, they would arrive at practically identical estimates.
For example, if a follow-up study substantially alters the statistical methods or sampled population used in the original study, this definition makes it wrong to say that the original results could not be replicated. It could be said that the original results were not robust to reanalysis with different methods, or were not robust to extending the data to cover a different population. But these classifications do not carry the stinging stigma of a failed “replication”.
The discussion paper digs into the literature to show why this definition makes a difference. First, it spells out the definition of a replication test, contrasts it with the mutually exclusive category of “robustness” tests, and gives several examples of each. Second, it catalogs 41 different (and often conflicting) definitions of “replication” in the social science literature. Third, it classifies numerous recent and prominent follow-up studies, finding that only about one third of them qualify as replication studies by the proposed definition.