An outlier is an observation that lies a markedly unusual distance from the rest of a dataset — far enough that it may distort summary statistics, model fit or test results. Outliers are not automatically errors to be removed; they are signals to be investigated, justified and reported.
Why outliers matter for reproducibility
A single extreme value can inflate a mean, balloon a variance or drag a regression line toward itself, changing a study’s conclusions. Because the decision about whether to keep or exclude such a point is a researcher degree of freedom, undocumented outlier handling is a well-known threat to reproducibility. Transparent reporting of what you found, what you did and why is the antidote.
Two causes: error versus genuine extreme
Outliers arise from two broad sources, and the cause dictates the response.
- Error outliers come from data-entry mistakes, instrument faults, unit mix-ups or sampling problems. A recorded human age of 250 years is an error. These can legitimately be corrected or excluded once verified.
- Genuine extremes are real but unusual observations — a true high earner in an income survey, a rare strong responder in a trial. These carry information and should generally be retained, possibly with a robust analysis.
The crucial point is that you cannot tell the two apart from the number alone. Investigation of the source — the raw record, the instrument log, the data-collection notes — is what separates them.
Detection methods
Several established methods flag candidate outliers. None is definitive; each makes assumptions and each has a different sensitivity. Visual inspection should always accompany any rule.
| Method | How it works | Best suited to |
|---|---|---|
| Z-score | Flags points whose distance from the mean exceeds a threshold of standard deviations (commonly 3) | Roughly normal, larger samples |
| IQR / boxplot | Flags points beyond Q1 − 1.5×IQR or Q3 + 1.5×IQR | Skewed data; robust, distribution-light |
| Grubbs’ test | Formal hypothesis test for a single outlier in a normal sample | One suspected outlier, normality assumed |
| Modified z-score (MAD) | Uses the median and median absolute deviation, resisting masking | Small samples or multiple outliers |
The z-score is intuitive but breaks down precisely when it matters most: a strong outlier inflates the standard deviation and can mask itself. The IQR rule, built on quartiles, is more robust and makes few distributional assumptions, which is why the boxplot remains the everyday workhorse. Grubbs’ test offers a formal, probabilistic answer when a single outlier is suspected in approximately normal data. Robust alternatives based on the median and MAD resist the masking and swamping that trip up mean-based rules.
Principled handling: never delete silently
The cardinal rule is that you do not quietly drop inconvenient points. A defensible workflow looks like this:
- Detect and flag candidates using a pre-specified rule, ideally chosen before seeing the results.
- Investigate the source to classify each as error or genuine extreme.
- Decide and document — correct verified errors, retain genuine extremes, and record every decision with its rationale.
- Report sensitivity — run the analysis with and without the contested points and show whether conclusions change.
- Prefer robust methods where extremes are genuine, such as medians, trimmed means or rank-based tests, instead of deletion.
Pre-registering the outlier rule removes the temptation to choose a definition that produces a desired result. For more on transparent analysis decisions see our reproducibility coverage and the CASRAI dictionary. Software choices also shape how outliers are detected and reported — see our review of statistical software in research.
Frequently asked questions
Should I always remove outliers?
No. Removing outliers by default is one of the most common analytic errors. Verified data-entry errors can be corrected or excluded, but genuine extreme values usually contain information and should be retained, often with a robust method. Always report what you did either way.
Which detection method is best?
There is no universal best. The IQR/boxplot rule is robust and assumption-light for skewed data; the z-score suits larger, roughly normal samples; Grubbs’ test is appropriate for a single suspected outlier under normality. Combine a numeric rule with a plot.
How do I report outlier handling?
State the detection rule, how many points were flagged, how each was classified, what action was taken and why, and the result of a sensitivity analysis with and without them. This level of detail is what makes the analysis reproducible. Our author guidance covers transparent methods reporting.
Do outliers affect meta-analyses too?
Yes. An aberrant study can dominate a pooled estimate just as a point dominates a sample. Sensitivity and influence analyses are standard, as discussed in our explainer on systematic reviews versus meta-analyses.







