Open Science Collaboration – CASRAI Dictionary

For most of the twentieth century, the published literature was treated, in practice, as a reasonably trustworthy record: if a finding appeared in a peer-reviewed journal, it was presumed to be real until something specific cast doubt on it. That presumption rested on an assumption rarely tested directly — that published results would reappear if someone repeated the study. Beginning in the early 2010s, a series of deliberate, large-scale efforts set out to test exactly that assumption by repeating published studies systematically, and what they found unsettled whole disciplines. The episode came to be called the replication crisis, and the work it provoked has reshaped how research thinks about its own reliability. This article looks at the major replication projects and the lessons they taught, drawing on the reproducibility domain of the CASRAI Dictionary.

From unease to evidence

Concerns that some published findings might be fragile were not new; what was new was the decision to measure the problem rather than merely worry about it. The crucial move was to treat replication itself as a research programme — to take a defined set of published studies, repeat them carefully using the original methods, and report honestly how many produced consistent results. This turned a diffuse anxiety into an empirical question rather than a matter of faith.

The Reproducibility Project: Psychology

The best-known of these efforts is the Reproducibility Project: Psychology, coordinated by the Open Science Collaboration and led through the Center for Open Science. A large group of researchers worked together to repeat a substantial sample of studies drawn from prominent psychology journals, following the original methods as closely as possible and, where they could, working with the original authors to get the protocols right. The headline finding was sobering: a considerable proportion of the replication attempts did not reproduce the original results, and where effects did appear again, they were often smaller than first reported. The project did not claim that the original findings were necessarily false — a failed replication can have many causes — but it demonstrated, at scale and in public, that a worrying share of published findings could not simply be taken on trust. It became a reference point for the entire debate.

The Many Labs studies

A complementary approach came from the Many Labs projects. Rather than each replication being attempted once by one team, Many Labs had numerous laboratories around the world each attempt the same set of studies using shared protocols. This answered a different question: not just whether a finding replicates once, but how consistent it is across many independent settings, samples and contexts. Some effects proved robust, reappearing reliably across nearly all the participating laboratories; others were inconsistent or largely absent. Many Labs also helped separate genuine variability in a phenomenon from the noise of any single replication attempt. The lesson was that replication is not a simple pass or fail but a way of mapping how dependable and how context-sensitive a finding really is.

Cancer biology and beyond

The replication question was not confined to psychology. The Reproducibility Project: Cancer Biology, a collaboration involving the Center for Open Science and an independent laboratory network, set out to repeat key experiments from high-profile preclinical cancer studies. Replicating biological experiments proved genuinely difficult, often because the original papers lacked enough methodological detail to repeat the work without extensive back-and-forth with the original authors — and sometimes that detail could not be recovered at all. Where replications could be completed, the picture was mixed, with many original effects appearing weaker than first reported. The Brazilian Reproducibility Initiative extended the same spirit to biomedical research within a national research system, coordinating multiple laboratories to repeat a common set of experimental methods. Across these efforts a recurring finding emerged: incomplete reporting is itself a major obstacle to reproducibility, quite apart from whether the underlying result is real.

What the projects taught

Taken together, the large replication projects yielded several durable lessons:

The problem is real and measurable. A meaningful proportion of published findings do not straightforwardly replicate, and this can be demonstrated rather than merely asserted.
Reporting matters enormously. Many replication difficulties stem not from false results but from methods described too thinly to repeat.
Replication is informative, not punitive. A single failed replication rarely settles anything; replication is most valuable for estimating how robust and context-dependent an effect is.
Practices can be reformed. The findings spurred pre-registration, registered reports, open data and better reporting standards.

The rise of metascience

Perhaps the most lasting consequence is the maturing of metascience — the scientific study of science itself. The replication projects showed that the research process can be studied empirically: that questions about how reliable findings are, what practices improve reliability, and how incentives shape behaviour can be investigated with the same rigour applied to any other subject. Metascience has since examined publication bias, statistical practice and the effects of pre-registration. The replication crisis, in this light, was not an embarrassment to be buried but the beginning of research becoming more willing to examine its own foundations. Reproducibility ceased to be assumed and became something to be designed for, measured and improved.

A shared vocabulary for reliability

For reproducibility to be improved across disciplines, institutions and publishers, the concepts involved must be described consistently — what a replication is, what counts as the materials and methods needed to repeat a study, and how outputs such as data and protocols are identified and shared. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the elements underpinning reproducible research are understood the same way wherever they are recorded. And because conducting replications and sharing the data and methods behind them is genuine, recognisable work, it can be described in the same framework used for every other contribution — the CRediT taxonomy, whose full set of contribution roles covers investigation, data curation and the rest. Building replication into practice is part of research administration. The replication projects taught research to test its own claims; a shared vocabulary helps ensure the lessons travel.

Tag: Open Science Collaboration

The replication crisis and large-scale replication projects: what systematic replication has taught us