Tag: replication crisis

  • The Replication Crisis in Psychology and the Open Science Response

    The replication crisis is the recognition that a substantial share of published findings, notably in psychology, fail to reproduce when independent teams repeat the studies. It prompted a wide-ranging reform movement built around transparency and pre-specified methods. Rather than discrediting the discipline, the crisis has driven psychology to strengthen the reliability of its evidence.

    The 2015 reproducibility project

    A landmark moment was the Open Science Collaboration’s Reproducibility Project: Psychology, published in 2015. A large network of researchers attempted to replicate 100 studies from leading psychology journals using high-powered designs. A considerable proportion of the original effects did not replicate, and where effects did appear they were on average markedly smaller than in the original reports. The result was a wake-up call: publication did not guarantee a finding was robust. Crucially, the project was itself a model of open practice—its protocols were shared, its analyses were transparent, and its data were made public—so its own conclusions could be scrutinised and re-examined by others. It demonstrated that large-scale, coordinated replication was feasible, and it gave the reform movement a concrete, quantified anchor rather than anecdote. Subsequent multi-lab projects in psychology and adjacent fields extended the approach, confirming that the pattern was systemic rather than confined to a handful of studies.

    What drives non-replication

    Several interacting causes are now well understood:

    Cause How it inflates false findings
    P-hacking Flexible analysis choices made until results cross significance, producing false positives
    Publication bias Journals favour positive, novel results, so null findings stay unpublished (the “file drawer”)
    Low statistical power Small samples yield unstable estimates and exaggerated effect sizes
    Researcher degrees of freedom Undisclosed choices in design and analysis enable selective reporting

    These pressures interact with weak measurement: instruments with poor reliability and validity add noise that low-powered studies are ill-equipped to handle.

    The Open Science response

    The reform agenda answers each cause directly. Preregistration records hypotheses and analysis plans before data are seen, separating confirmatory tests from exploratory ones and curbing p-hacking. Registered Reports go further: a journal peer-reviews the introduction and methods and grants in-principle acceptance before results exist, so publication no longer hinges on whether the result is positive—directly tackling publication bias. Data and materials sharing lets others reanalyse and reuse work, and adequately powered designs reduce false positives at source.

    The role of the Center for Open Science

    Much of this infrastructure is coordinated by the Center for Open Science, the non-profit behind the Open Science Framework, a platform for preregistration, data sharing and project management. By making transparent practice easy and rewarded—through badges, registries and tooling—it has helped shift norms across psychology and beyond. The movement aligns closely with CASRAI’s interest in reproducibility and clear research metadata.

    The difference between direct and conceptual replication

    Not all replications are the same, and the distinction matters for interpreting the crisis. A direct replication repeats the original method as closely as possible to test whether the same procedure yields the same result. A conceptual replication tests the same underlying idea using a different method or measure. Conceptual replications are valuable for generalisation, but they cannot substitute for direct ones: if a different method fails, it is ambiguous whether the original finding was false or the new method simply tapped a different construct. Part of what the reform movement restored was respect for direct replication, which had been undervalued by journals that prized novelty over verification.

    Beyond p-values: estimation and transparency

    A recurring theme is over-reliance on the binary question “is p below 0.05?”. A single significant p-value says little about how large or reliable an effect is, and the threshold is easy to cross by chance or by flexible analysis. Reformers therefore emphasise reporting effect sizes with confidence intervals, planning sample sizes in advance through power analysis, and distinguishing pre-specified confirmatory tests from exploratory ones. None of this forbids exploration; it simply asks researchers to label it honestly so readers can weight the evidence appropriately. These habits depend on sound measurement, since unreliable instruments undermine even a well-powered, preregistered design—linking the crisis back to reliability and validity.

    A cultural shift, not just a checklist

    The most durable change has been cultural. Open practices—sharing data, code and materials, posting preprints, and crediting replication work—are increasingly expected rather than exceptional, and funders and journals now reward them. Many psychology journals offer Registered Reports, and badges for open data and open materials have become common. The shift reframes transparency as a normal part of doing science well rather than an optional extra, and it has begun to spread to neighbouring fields facing similar pressures.

    What it means for everyday research practice

    The crisis has practical consequences for how studies are designed and read. Single, striking results deserve caution until replicated; effect sizes and confidence intervals matter more than a lone p-value; and vivid claims—the kind that circulate as popular psychology, such as strong readings of the Dunning-Kruger effect—warrant scrutiny against replication evidence. These habits sit alongside responsible assessment of the instruments a study relies upon.

    What the crisis does and does not imply

    It is important to state the limits of the lesson. A failed replication does not automatically prove the original effect is false; replications themselves can be underpowered, can differ subtly in method, or can be run on different populations. Equally, the crisis is not unique to psychology—medicine, economics and other empirical fields have confronted comparable problems—nor does it mean that nothing in psychology is true. Many core findings replicate robustly. The accurate reading is that the proportion of fragile results in the literature was higher than assumed, that publishing incentives rewarded surprising single studies over careful verification, and that the remedy is structural rather than a matter of individual blame. Framed this way, the crisis is a sign of a discipline maturing, not collapsing.

    Standards, terminology and authors

    Reproducibility also depends on mundane infrastructure: consistent terms, well-described methods and shareable metadata. Defining concepts in a controlled research dictionary reduces ambiguity across studies, and clear expectations for authors—preregister where possible, report all measures, share data—turn the lessons of the crisis into routine. The goal is not to publish less but to publish findings that hold up.

    Frequently asked questions

    What is the replication crisis?

    It is the finding that many published results, especially in psychology, do not reproduce when independent teams repeat the studies. It exposed weaknesses in research and publishing practices and sparked reform.

    What did the 2015 Open Science Collaboration project find?

    The Reproducibility Project: Psychology replicated 100 studies and found that a large proportion did not reproduce, with replicated effects typically smaller than the originals.

    What causes findings to fail replication?

    Key causes include p-hacking, publication bias against null results, low statistical power and undisclosed analytic flexibility, often compounded by measures with weak reliability and validity.

    What are preregistration and Registered Reports?

    Preregistration logs hypotheses and analysis plans before data collection. Registered Reports take this further, with journals accepting a study based on its methods before results are known, reducing publication bias.

  • The replication crisis and large-scale replication projects: what systematic replication has taught us

    For most of the twentieth century, the published literature was treated, in practice, as a reasonably trustworthy record: if a finding appeared in a peer-reviewed journal, it was presumed to be real until something specific cast doubt on it. That presumption rested on an assumption rarely tested directly — that published results would reappear if someone repeated the study. Beginning in the early 2010s, a series of deliberate, large-scale efforts set out to test exactly that assumption by repeating published studies systematically, and what they found unsettled whole disciplines. The episode came to be called the replication crisis, and the work it provoked has reshaped how research thinks about its own reliability. This article looks at the major replication projects and the lessons they taught, drawing on the reproducibility domain of the CASRAI Dictionary.

    From unease to evidence

    Concerns that some published findings might be fragile were not new; what was new was the decision to measure the problem rather than merely worry about it. The crucial move was to treat replication itself as a research programme — to take a defined set of published studies, repeat them carefully using the original methods, and report honestly how many produced consistent results. This turned a diffuse anxiety into an empirical question rather than a matter of faith.

    The Reproducibility Project: Psychology

    The best-known of these efforts is the Reproducibility Project: Psychology, coordinated by the Open Science Collaboration and led through the Center for Open Science. A large group of researchers worked together to repeat a substantial sample of studies drawn from prominent psychology journals, following the original methods as closely as possible and, where they could, working with the original authors to get the protocols right. The headline finding was sobering: a considerable proportion of the replication attempts did not reproduce the original results, and where effects did appear again, they were often smaller than first reported. The project did not claim that the original findings were necessarily false — a failed replication can have many causes — but it demonstrated, at scale and in public, that a worrying share of published findings could not simply be taken on trust. It became a reference point for the entire debate.

    The Many Labs studies

    A complementary approach came from the Many Labs projects. Rather than each replication being attempted once by one team, Many Labs had numerous laboratories around the world each attempt the same set of studies using shared protocols. This answered a different question: not just whether a finding replicates once, but how consistent it is across many independent settings, samples and contexts. Some effects proved robust, reappearing reliably across nearly all the participating laboratories; others were inconsistent or largely absent. Many Labs also helped separate genuine variability in a phenomenon from the noise of any single replication attempt. The lesson was that replication is not a simple pass or fail but a way of mapping how dependable and how context-sensitive a finding really is.

    Cancer biology and beyond

    The replication question was not confined to psychology. The Reproducibility Project: Cancer Biology, a collaboration involving the Center for Open Science and an independent laboratory network, set out to repeat key experiments from high-profile preclinical cancer studies. Replicating biological experiments proved genuinely difficult, often because the original papers lacked enough methodological detail to repeat the work without extensive back-and-forth with the original authors — and sometimes that detail could not be recovered at all. Where replications could be completed, the picture was mixed, with many original effects appearing weaker than first reported. The Brazilian Reproducibility Initiative extended the same spirit to biomedical research within a national research system, coordinating multiple laboratories to repeat a common set of experimental methods. Across these efforts a recurring finding emerged: incomplete reporting is itself a major obstacle to reproducibility, quite apart from whether the underlying result is real.

    What the projects taught

    Taken together, the large replication projects yielded several durable lessons:

    • The problem is real and measurable. A meaningful proportion of published findings do not straightforwardly replicate, and this can be demonstrated rather than merely asserted.
    • Reporting matters enormously. Many replication difficulties stem not from false results but from methods described too thinly to repeat.
    • Replication is informative, not punitive. A single failed replication rarely settles anything; replication is most valuable for estimating how robust and context-dependent an effect is.
    • Practices can be reformed. The findings spurred pre-registration, registered reports, open data and better reporting standards.

    The rise of metascience

    Perhaps the most lasting consequence is the maturing of metascience — the scientific study of science itself. The replication projects showed that the research process can be studied empirically: that questions about how reliable findings are, what practices improve reliability, and how incentives shape behaviour can be investigated with the same rigour applied to any other subject. Metascience has since examined publication bias, statistical practice and the effects of pre-registration. The replication crisis, in this light, was not an embarrassment to be buried but the beginning of research becoming more willing to examine its own foundations. Reproducibility ceased to be assumed and became something to be designed for, measured and improved.

    A shared vocabulary for reliability

    For reproducibility to be improved across disciplines, institutions and publishers, the concepts involved must be described consistently — what a replication is, what counts as the materials and methods needed to repeat a study, and how outputs such as data and protocols are identified and shared. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the elements underpinning reproducible research are understood the same way wherever they are recorded. And because conducting replications and sharing the data and methods behind them is genuine, recognisable work, it can be described in the same framework used for every other contribution — the CRediT taxonomy, whose full set of contribution roles covers investigation, data curation and the rest. Building replication into practice is part of research administration. The replication projects taught research to test its own claims; a shared vocabulary helps ensure the lessons travel.