Explainer · Plain-language

What Is Statistical Significance? p-Values Explained | CASRAI

Q: p < 0.05 means there is a 95% chance the result is real.

No — the p-value is the probability of the observed data (or more extreme) given that H₀ is true, not the probability that H₀ is false. It cannot be interpreted as a direct probability that your finding is correct.

Q: A statistically significant result is always practically important.

No — with a large enough sample, even negligible effects achieve statistical significance. Effect size and practical or clinical significance must be assessed separately.

Statistical significance is a measure of whether an observed result is likely to have arisen by chance under the null hypothesis. It is conventionally indexed by the p-value and has been central to hypothesis testing in science since Ronald Fisher formalised it in the 1920s.

CASRAI plain-language explainers — clear answers to recurring research-administration questions

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.

The p-value and how it works

Ronald Fisher introduced the p-value in his 1925 book Statistical Methods for Research Workers as a convenient index of evidence against the null hypothesis. The p-value is the probability of obtaining a result at least as extreme as the one observed, if the null hypothesis were true. It does not tell you the probability that your hypothesis is correct, or the probability that the result arose by chance — both common misinterpretations. Fisher himself described p < 0.05 as a useful rule of thumb, not a sacred threshold, and his approach was descriptive: a small p meant the data deserved further investigation.

Type I and Type II errors

Neyman and Pearson formalised hypothesis testing in the 1930s by introducing decision rules with known error rates. A Type I error (false positive, α) occurs when the null hypothesis is correctly true but is wrongly rejected — the conventional threshold α = 0.05 means a 5% risk of this. A Type II error (false negative, β) occurs when a true effect exists but is not detected — related to statistical power (1 − β). The Fisher and Neyman–Pearson frameworks are conceptually distinct: Fisher’s p-values are measures of evidence; Neyman–Pearson testing produces binary decisions with controlled error rates. Most scientific practice conflates the two.

Statistical significance vs effect size vs practical significance

A result can be statistically significant without being practically important. In very large samples, trivially small effects produce highly significant p-values because standard errors shrink with sample size. Conversely, in small samples, clinically important effects may not reach significance. Effect sizes (Cohen’s d, correlation r, odds ratio) and confidence intervals communicate the magnitude and precision of an effect independently of sample size. Clinical significance (or practical importance) asks whether the effect is large enough to matter in practice — a judgement that p-values cannot make.

The replication crisis and the future of significance

From the 2010s, the replication crisis revealed that many findings across psychology, medicine, and other fields did not replicate. Causes include p-hacking (selectively reporting analyses that yield p < 0.05), the multiple comparisons problem (running many tests inflates the family-wise error rate, corrected by Bonferroni or false discovery rate methods), and publication bias favouring significant results. A 2019 Nature commentary by Wasserstein, Schirm and Lazar, signed by 800+ scientists, called for abandoning the term "statistically significant" in favour of reporting precise estimates and confidence intervals. Many journals now require pre-registration and reporting of effect sizes alongside p-values.

Key facts

At a glance

Definition: Probability of a result as extreme as observed if H₀ were true
Introduced by: Ronald Fisher (1925) — p < 0.05 as informal rule of thumb
Threshold: p < 0.05 conventional; not a natural law
Type I error: Rejecting a true null hypothesis (false positive, rate = α)
Type II error: Failing to reject a false null hypothesis (false negative, rate = β)
Limitation: Significance ≠ importance; affected by sample size
Reform: 2019 Nature commentary called for abandoning the term entirely

Common misconceptions

What people often get wrong

Often heard: p < 0.05 means there is a 95% chance the result is real.

Actually: No — the p-value is the probability of the observed data (or more extreme) given that H₀ is true, not the probability that H₀ is false. It cannot be interpreted as a direct probability that your finding is correct.

Often heard: A statistically significant result is always practically important.

Actually: No — with a large enough sample, even negligible effects achieve statistical significance. Effect size and practical or clinical significance must be assessed separately.

Often heard: If p > 0.05, there is no effect.

Actually: No — a non-significant result means the data do not provide sufficient evidence to reject H₀ at the chosen threshold. It does not prove that no effect exists, especially when the study is underpowered.

Going deeper