Correlation vs Causation in Research: Knowing the Difference

Correlation describes the degree to which two variables move together, while causation means that a change in one variable actually produces a change in another. The central principle, often summarised as “correlation does not imply causation”, is that observing two things vary together is not sufficient to conclude that one causes the other. Distinguishing the two is one of the most important and most frequently neglected tasks in research.

Measuring correlation with Pearson’s r

The most common measure of linear correlation is Pearson’s correlation coefficient, written r. It ranges from minus one to plus one. A value of plus one indicates a perfect positive linear relationship, minus one a perfect negative linear relationship, and zero no linear relationship at all. Pearson’s r captures only the strength and direction of a straight-line association; it can miss strong but non-linear relationships, and it is sensitive to outliers. A high r tells you two variables track each other closely, but says nothing about why.

Pearson’s r Interpretation
+1.0 Perfect positive linear relationship
0 No linear relationship
−1.0 Perfect negative linear relationship

Why correlation does not imply causation

Two variables can be correlated for several reasons that have nothing to do with one causing the other. The direction of causation may be reversed, both may be driven by a third factor, or the association may simply be a coincidence in the data. The classic example is the correlation between ice cream sales and drowning incidents. Neither causes the other; both rise in hot weather, which is a confounding variable. A confounder is a variable associated with both the supposed cause and the supposed effect, creating a spurious link.

Reason for correlation Example
Genuine causation Smoking raises lung cancer risk
Reverse causation Assuming illness causes a behaviour when the behaviour causes the illness
Confounding Ice cream sales and drownings both driven by hot weather
Coincidence Two unrelated trends that happen to move together

Criteria for causal inference

Because correlation alone is insufficient, researchers use additional reasoning to assess causation. In epidemiology the Bradford Hill considerations, set out by Austin Bradford Hill in 1965, offer a widely cited framework. They include the strength of the association, its consistency across studies, specificity, the correct temporal sequence (the cause must precede the effect), a biological gradient or dose-response relationship, plausibility, coherence with existing knowledge, experimental evidence and analogy. These are considerations to weigh, not a checklist to tick mechanically, and no single one proves causation on its own.

Randomisation and experiments

The strongest evidence for causation usually comes from a randomised controlled experiment. By randomly assigning participants to conditions, randomisation tends to balance both known and unknown confounders across groups, so that a difference in outcomes can more credibly be attributed to the intervention. Where experiments are impossible, careful observational designs attempt to control for confounders statistically, but they remain more vulnerable to hidden bias. Extreme data points can also distort correlation estimates, which connects to the separate task of outlier handling.

Sound causal reasoning draws on the wider discipline of statistics and on transparent reporting of methods, both essential for reproducible findings. Related concepts such as statistical significance describe whether an association is unlikely under chance, but significance is still not causation. For terminology, see the CASRAI dictionary, the reproducibility category and the author guidance.

Frequently asked questions

What does Pearson’s r actually measure?

Pearson’s r measures the strength and direction of a linear relationship between two continuous variables, on a scale from minus one to plus one. It does not capture non-linear relationships and does not establish that one variable causes the other.

What is a confounding variable?

A confounder is a third variable associated with both the supposed cause and the supposed effect. It can create a correlation between two variables that are not causally linked, which is why controlling for confounders is central to causal inference.

How can researchers establish causation?

Randomised controlled experiments provide the strongest evidence by balancing confounders across groups. Where experiments are not feasible, frameworks such as the Bradford Hill considerations, combined with careful adjustment for confounders, help build a case for causation, though no single study proves it conclusively.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *