Tag: statistics

  • Outliers in Statistics: Definition, Detection and Principled Handling

    An outlier is an observation that lies a markedly unusual distance from the rest of a dataset — far enough that it may distort summary statistics, model fit or test results. Outliers are not automatically errors to be removed; they are signals to be investigated, justified and reported.

    Why outliers matter for reproducibility

    A single extreme value can inflate a mean, balloon a variance or drag a regression line toward itself, changing a study’s conclusions. Because the decision about whether to keep or exclude such a point is a researcher degree of freedom, undocumented outlier handling is a well-known threat to reproducibility. Transparent reporting of what you found, what you did and why is the antidote.

    Two causes: error versus genuine extreme

    Outliers arise from two broad sources, and the cause dictates the response.

    • Error outliers come from data-entry mistakes, instrument faults, unit mix-ups or sampling problems. A recorded human age of 250 years is an error. These can legitimately be corrected or excluded once verified.
    • Genuine extremes are real but unusual observations — a true high earner in an income survey, a rare strong responder in a trial. These carry information and should generally be retained, possibly with a robust analysis.

    The crucial point is that you cannot tell the two apart from the number alone. Investigation of the source — the raw record, the instrument log, the data-collection notes — is what separates them.

    Detection methods

    Several established methods flag candidate outliers. None is definitive; each makes assumptions and each has a different sensitivity. Visual inspection should always accompany any rule.

    Method How it works Best suited to
    Z-score Flags points whose distance from the mean exceeds a threshold of standard deviations (commonly 3) Roughly normal, larger samples
    IQR / boxplot Flags points beyond Q1 − 1.5×IQR or Q3 + 1.5×IQR Skewed data; robust, distribution-light
    Grubbs’ test Formal hypothesis test for a single outlier in a normal sample One suspected outlier, normality assumed
    Modified z-score (MAD) Uses the median and median absolute deviation, resisting masking Small samples or multiple outliers

    The z-score is intuitive but breaks down precisely when it matters most: a strong outlier inflates the standard deviation and can mask itself. The IQR rule, built on quartiles, is more robust and makes few distributional assumptions, which is why the boxplot remains the everyday workhorse. Grubbs’ test offers a formal, probabilistic answer when a single outlier is suspected in approximately normal data. Robust alternatives based on the median and MAD resist the masking and swamping that trip up mean-based rules.

    Principled handling: never delete silently

    The cardinal rule is that you do not quietly drop inconvenient points. A defensible workflow looks like this:

    1. Detect and flag candidates using a pre-specified rule, ideally chosen before seeing the results.
    2. Investigate the source to classify each as error or genuine extreme.
    3. Decide and document — correct verified errors, retain genuine extremes, and record every decision with its rationale.
    4. Report sensitivity — run the analysis with and without the contested points and show whether conclusions change.
    5. Prefer robust methods where extremes are genuine, such as medians, trimmed means or rank-based tests, instead of deletion.

    Pre-registering the outlier rule removes the temptation to choose a definition that produces a desired result. For more on transparent analysis decisions see our reproducibility coverage and the CASRAI dictionary. Software choices also shape how outliers are detected and reported — see our review of statistical software in research.

    Frequently asked questions

    Should I always remove outliers?

    No. Removing outliers by default is one of the most common analytic errors. Verified data-entry errors can be corrected or excluded, but genuine extreme values usually contain information and should be retained, often with a robust method. Always report what you did either way.

    Which detection method is best?

    There is no universal best. The IQR/boxplot rule is robust and assumption-light for skewed data; the z-score suits larger, roughly normal samples; Grubbs’ test is appropriate for a single suspected outlier under normality. Combine a numeric rule with a plot.

    How do I report outlier handling?

    State the detection rule, how many points were flagged, how each was classified, what action was taken and why, and the result of a sensitivity analysis with and without them. This level of detail is what makes the analysis reproducible. Our author guidance covers transparent methods reporting.

    Do outliers affect meta-analyses too?

    Yes. An aberrant study can dominate a pooled estimate just as a point dominates a sample. Sensitivity and influence analyses are standard, as discussed in our explainer on systematic reviews versus meta-analyses.

  • Mean, Median and Mode: Measures of Central Tendency

    Measures of central tendency are summary statistics that describe the centre, or typical value, of a dataset using a single number. The three most common are the mean, the median and the mode. Each captures the centre in a different way, and choosing the right one depends on the shape of the data and the presence of outliers.

    The mean

    The mean, or arithmetic average, is the sum of all values divided by the number of values. It uses every data point, which makes it efficient, but also sensitive to extreme values. The mean is the natural choice for roughly symmetric data and underlies many statistical methods, including variance and the t-test.

    The median

    The median is the middle value when the data are arranged in order, splitting the dataset into two equal halves. If there is an even number of values, the median is the average of the two central ones. Because it depends only on rank, the median is resistant to outliers and is the preferred measure of centre for skewed distributions such as incomes or house prices.

    The mode

    The mode is the value that occurs most frequently. A dataset can have one mode, several modes or none at all. The mode is the only measure of central tendency that can be used with categorical data, such as the most common blood type or eye colour, where calculating a mean or median would be meaningless.

    When to use each measure

    Measure Best for Sensitive to outliers?
    Mean Symmetric, continuous data Yes
    Median Skewed data or data with outliers No
    Mode Categorical or multimodal data No

    The effect of skew and outliers

    In a perfectly symmetric distribution, such as the normal distribution, the mean, median and mode coincide. When data are skewed, they separate. In a right-skewed distribution, a long tail of high values pulls the mean above the median, while in a left-skewed distribution the mean is dragged below it. The gap between mean and median is therefore a useful, quick indicator of skew. Because the mean is pulled towards extreme values, reporting the median alongside it for skewed data gives a more honest picture of the centre.

    A worked example

    Consider seven salaries, in thousands of pounds: 22, 24, 25, 26, 28, 30 and 95. The mean is the sum, 250, divided by 7, which is about 35.7. The median is the fourth value, 26, since the data are already in order. There is no repeated value, so there is no mode. The single high salary of 95 inflates the mean to nearly 36, well above what most people earn in this group, whereas the median of 26 represents the typical salary far better. This illustrates why the median is usually reported for income data. Choosing and stating the appropriate measure supports reproducible reporting, in line with the CASRAI dictionary and our guidance for authors.

    Frequently asked questions

    Which measure of central tendency is best?

    There is no single best measure. The mean suits symmetric data, the median suits skewed data or data with outliers, and the mode suits categorical data. The right choice depends on the distribution and the question.

    Why does the mean differ from the median in skewed data?

    The mean is influenced by every value, including extremes in the tail, so it is pulled in the direction of the skew. The median depends only on the middle rank and so stays closer to the bulk of the data.

    Can a dataset have more than one mode?

    Yes. A dataset with two equally common peaks is bimodal, and one with several is multimodal. This can signal that the data come from distinct subgroups worth investigating separately.

  • T-Tests Explained: Comparing Two Means

    A t-test is a statistical test that assesses whether the difference between two means is larger than would be expected by chance alone. It compares the size of an observed difference against the variability in the data, producing a t-statistic that can be converted into a p-value. The t-test is one of the most common tools for comparing groups in research.

    How a t-test works

    The t-statistic is essentially the difference between means divided by the standard error of that difference. A large t-statistic indicates that the difference is large relative to the spread of the data, making it less likely to have arisen by chance. The t-statistic is then evaluated against the t-distribution, which resembles the normal distribution but has heavier tails to account for the extra uncertainty in small samples.

    The three types of t-test

    There are three principal forms of the t-test, each suited to a particular comparison.

    Type What it compares Typical use
    One-sample A sample mean against a known or hypothesised value Testing whether a mean differs from a reference standard
    Independent-samples The means of two separate, unrelated groups Comparing a treatment group with a control group
    Paired Two measurements from the same subjects Before-and-after measurements on the same participants

    Choosing the right type is essential. Using an independent-samples test on paired data, for instance, ignores the correlation between the two measurements and usually reduces the power of the analysis.

    Assumptions of the t-test

    The validity of a t-test rests on several assumptions. The data should be approximately normally distributed, particularly in small samples, although the central limit theorem makes the test fairly robust at larger sample sizes. Observations should be independent, except in the paired test where the pairing is deliberate. For the independent-samples test, the two groups are traditionally assumed to have equal variances; when this assumption is doubtful, Welch’s t-test, which does not require equal variances, is a safer default. Outliers can distort the result and should be inspected beforehand.

    Relationship to p-values and significance

    The t-test does not by itself prove that two groups differ; it quantifies the evidence against the null hypothesis that the means are equal. The resulting p-value is the probability of observing a difference at least as large as the one found, assuming the null hypothesis is true. A small p-value, conventionally below 0.05, suggests the difference is statistically significant, but it says nothing about the size or practical importance of the effect. Reporting the mean difference and a confidence interval alongside the p-value gives a fuller picture.

    Reporting t-tests transparently

    Good practice is to report the type of t-test used, the t-statistic, the degrees of freedom, the p-value, the effect size and a confidence interval. Stating which test was chosen and why, and confirming that its assumptions were checked, supports the reproducibility goals described in the CASRAI dictionary and our guidance for authors. An adequately powered design, discussed in our piece on statistical power, is equally important.

    Frequently asked questions

    When should I use a t-test rather than ANOVA?

    Use a t-test to compare two means. When you need to compare three or more group means simultaneously, analysis of variance (ANOVA) is the appropriate extension, as running multiple t-tests inflates the chance of a false positive.

    What if my data are not normally distributed?

    For small, clearly non-normal samples, consider a non-parametric alternative such as the Mann-Whitney U test for independent groups or the Wilcoxon signed-rank test for paired data.

    What is the difference between a one-tailed and two-tailed t-test?

    A two-tailed test detects a difference in either direction and is the default. A one-tailed test only looks for a difference in one specified direction and should be used only when justified in advance.

  • Effect Size: Why It Matters Beyond Statistical Significance

    An effect size is a standardised measure of the magnitude of a difference or relationship, telling you how large an effect is rather than merely whether it is statistically detectable. Where a p-value answers “is there an effect?”, an effect size answers the more useful question “how big is it?”. Reporting effect sizes is now expected by major journals and statistical bodies, because significance alone can mislead.

    Why a p-value is not enough

    A p-value depends heavily on sample size. With a large enough sample, a trivially small difference can become statistically significant; with a small sample, a substantial effect can fail to reach significance. This means a significant result tells you an effect probably exists, but nothing about whether it is large enough to matter in practice. The American Statistical Association’s 2016 statement on p-values explicitly cautioned against treating statistical significance as a measure of importance and urged researchers to report effect sizes and uncertainty. For the foundations, see our explainer on p-values and statistical significance.

    Common effect size measures

    Different designs call for different effect size statistics. The table below summarises the most widely used.

    Measure Used with What it expresses Rough benchmarks
    Cohen’s d Difference between two means Difference in standard-deviation units 0.2 small, 0.5 medium, 0.8 large
    Eta-squared ANOVA Proportion of variance explained by a factor 0.01 small, 0.06 medium, 0.14 large
    Pearson’s r Correlation between two variables Strength and direction of association 0.1 small, 0.3 medium, 0.5 large
    Cramer’s V Categorical association Strength of relationship in a contingency table Depends on table size

    These benchmarks, popularised by Jacob Cohen, are useful starting points but are not universal laws. What counts as a meaningful effect depends on the field: a small standardised effect in a public-health intervention can have enormous real-world value, while a large effect in a tightly controlled lab study may be unremarkable.

    Effect size in context: ANOVA and categorical data

    Effect sizes pair naturally with the tests that produce p-values. After an ANOVA, eta-squared (or partial eta-squared) quantifies how much variance each factor explains. After a chi-square test, Cramer’s V or the phi coefficient gives the strength of association that the chi-square statistic alone cannot. Reporting the test statistic and the effect size together turns “there is an effect” into “there is an effect of this size”.

    Practical versus statistical significance

    Statistical significance concerns whether an effect is distinguishable from chance. Practical significance concerns whether the effect is large enough to matter for decisions, policy or theory. The two can diverge sharply. A drug that lowers blood pressure by a statistically significant but clinically negligible amount is significant without being meaningful. Effect sizes, ideally reported with confidence intervals, are what let readers judge practical importance for themselves.

    Reporting standards and reproducibility

    Effect size reporting is not optional in many venues. The APA Publication Manual has long required effect sizes alongside test results, and reporting guidelines across disciplines echo this. Effect sizes also power meta-analysis and a-priori power analysis: you cannot plan an adequately powered study without an expected effect size, as our guide to sample size and statistical power explains. Recording effect sizes, confidence intervals and the measure used is part of the transparent reporting we champion across our reproducibility coverage and codify in our guidance for authors.

    Frequently asked questions

    What is the difference between a p-value and an effect size?

    A p-value indicates whether an effect is likely to be real rather than chance. An effect size indicates how large that effect is. They answer different questions and should always be reported together.

    Which effect size should I report?

    Match the measure to the design: Cohen’s d for two-group mean differences, eta-squared for ANOVA, Pearson’s r for correlations, and Cramer’s V for categorical associations. Always state which measure you used.

    Can a result be statistically significant but practically meaningless?

    Yes. With a large sample, tiny differences become significant. The effect size, especially with a confidence interval, reveals whether the difference is large enough to matter in the real world.

    Why do journals now require effect sizes?

    Because significance alone gives an incomplete picture and contributes to overstated findings. Bodies such as the American Statistical Association and APA emphasise effect sizes to improve transparency and reproducibility. See the CASRAI dictionary for the standardised terms used in reporting.

  • The Normal Distribution Explained

    The normal distribution, also called the Gaussian distribution, is a continuous probability distribution that is symmetric about its mean and forms a bell-shaped curve. It is fully described by two parameters: the mean, which locates the centre of the curve, and the standard deviation, which controls its width. Most values lie near the mean, and values become increasingly rare as they move further away in either direction.

    Shape, symmetry and parameters

    A normal curve is perfectly symmetric, so its mean, median and mode coincide at the centre. The two tails extend infinitely in both directions, approaching but never touching the horizontal axis. Changing the mean shifts the curve left or right; changing the standard deviation stretches or compresses it. A larger standard deviation produces a flatter, wider bell; a smaller one produces a taller, narrower peak.

    The 68-95-99.7 rule

    For any normal distribution, a fixed proportion of values falls within a given number of standard deviations of the mean. This is known as the empirical rule, or the 68-95-99.7 rule.

    Within Approximate proportion
    ±1 standard deviation 68%
    ±2 standard deviations 95%
    ±3 standard deviations 99.7%

    This rule underpins the interpretation of confidence intervals and the identification of outliers, since values beyond about three standard deviations are unusual under normality.

    The central limit theorem

    The normal distribution is central to statistics largely because of the central limit theorem. This theorem states that the sampling distribution of the mean of a sufficiently large number of independent observations is approximately normal, regardless of the shape of the underlying population, provided the population has a finite variance. In practice, sample means tend towards normality as sample size increases, often by around n = 30 for moderately skewed data. This is why many tests that compare means, such as the t-test, can be applied even when the raw data are not perfectly normal.

    Why it matters for inference

    Because the behaviour of the normal distribution is exactly known, it provides the mathematical basis for many inferential procedures, including the calculation of p-values and significance tests. Standardising a value into a z-score, by subtracting the mean and dividing by the standard deviation, lets researchers compare observations on a common scale and look up exact probabilities.

    What is and is not normally distributed

    Many measurements approximate a normal distribution, including heights, blood pressure and measurement errors. However, normality should never be assumed. Reaction times, incomes and counts of rare events are typically skewed, and some variables are bounded or bimodal. Always check the distribution using histograms or quantile-quantile plots before applying methods that assume normality. Defining variables and their distributions clearly supports the reproducibility standards set out in the CASRAI dictionary and our guidance for authors.

    Frequently asked questions

    What is the difference between the normal and standard normal distribution?

    The standard normal distribution is a special case with a mean of 0 and a standard deviation of 1. Any normal distribution can be converted to the standard normal by calculating z-scores.

    Does my data have to be normal to use statistics?

    Not always. Thanks to the central limit theorem, tests based on means are robust to non-normality at larger sample sizes. For small samples or strongly skewed data, non-parametric alternatives or transformations may be more appropriate.

    How can I check whether data are normally distributed?

    Use graphical tools such as histograms and quantile-quantile plots, supplemented by formal tests like Shapiro-Wilk. Visual inspection is often the most informative, as formal tests can be over-sensitive in large samples.

  • Regression Analysis: An Introduction for Researchers

    Regression analysis is a statistical method for modelling the relationship between an outcome variable and one or more predictor variables. In its simplest form, linear regression fits a straight line through a scatter of points to describe how the outcome changes, on average, as a predictor changes. It is one of the most widely used tools for prediction and for quantifying associations in research.

    The linear regression equation

    Simple linear regression summarises the relationship between a predictor x and an outcome y with the equation y = a + bx, where a is the intercept and b is the slope. The intercept is the predicted value of y when x is zero, and the slope is the average change in y for a one-unit increase in x. A positive slope indicates that y rises with x; a negative slope indicates that it falls.

    Least squares estimation

    The line is chosen by the method of ordinary least squares, which finds the slope and intercept that minimise the sum of the squared vertical distances between the observed points and the fitted line. These distances are called residuals. Squaring them, as with variance, prevents positive and negative residuals from cancelling and penalises large errors more heavily. The result is the best-fitting line in the least squares sense.

    Interpreting R-squared

    The coefficient of determination, R², measures the proportion of variance in the outcome that is explained by the model. It ranges from 0 to 1: an R² of 0 means the predictors explain none of the variation, while an R² of 1 means they explain all of it. An R² of 0.64, for example, indicates that 64% of the variation in the outcome is accounted for by the predictor. R² alone does not confirm that a model is correct, however; it should be read alongside residual plots and an assessment of the model’s assumptions.

    Multiple regression

    Multiple regression extends the model to include several predictors at once, taking the form y = a + b₁x₁ + b₂x₂ + … + bₖxₖ. Each slope coefficient estimates the effect of its predictor while holding the others constant, which helps to adjust for confounding variables. This makes multiple regression valuable when several factors plausibly influence an outcome.

    Assumptions of linear regression

    Assumption Meaning
    Linearity The relationship between predictor and outcome is linear
    Independence Residuals are independent of one another
    Homoscedasticity Residual variance is constant across the range of predictions
    Normality of residuals Residuals are approximately normally distributed

    When these assumptions are violated, estimates and p-values can be misleading. Diagnostic plots help to detect problems before results are reported.

    Correlation is not causation

    A statistically significant slope shows that two variables are associated, not that one causes the other. Unmeasured confounders, reverse causation or coincidence can all produce a relationship. Causal claims require careful study design, such as randomised experiments, not regression alone. Stating this limitation clearly is part of transparent, reproducible reporting, as encouraged by the CASRAI dictionary and our author guidance.

    Frequently asked questions

    What is the difference between correlation and regression?

    Correlation measures the strength and direction of a linear association with a single number between −1 and 1. Regression goes further, producing an equation that predicts the outcome and quantifies the effect of each predictor.

    What counts as a good R-squared value?

    It depends entirely on the field. In physical sciences an R² above 0.9 may be expected, whereas in social or biological research values of 0.2 to 0.4 can still be meaningful. Always interpret R² in context.

    Can regression prove causation?

    No. Regression quantifies association and can adjust for measured confounders, but it cannot establish causation on its own. Causal inference requires appropriate design, such as randomisation or robust quasi-experimental methods.

  • The Chi-Square Test for Categorical Data: A Practical Guide

    The chi-square test is a statistical method for categorical data that compares the frequencies you actually observe with the frequencies you would expect if a given hypothesis were true. The larger the gap between observed and expected counts, the larger the chi-square statistic, and the stronger the evidence against the hypothesis of no relationship. It is the workhorse test for counts, proportions and contingency tables across the social, biological and medical sciences.

    Observed versus expected frequencies

    Every chi-square test rests on the same intuition. You record how many cases fall into each category (the observed frequencies), then calculate how many should fall there under your null hypothesis (the expected frequencies). The statistic sums the squared difference between observed and expected, divided by expected, across all cells:

    chi-square = sum of (observed – expected)squared / expected

    A value near zero means observation matches expectation. A large value, evaluated against the chi-square distribution with the appropriate degrees of freedom, produces a small p-value and signals a meaningful departure. For background on interpreting those probabilities, see our explainer on p-values and statistical significance.

    Two common forms of the test

    There are two principal versions, which answer different questions.

    Feature Goodness-of-fit Test of independence
    Variables One categorical variable Two categorical variables
    Question Do observed counts match an expected distribution? Are the two variables associated?
    Data layout Single row of category counts Contingency (cross-tabulation) table
    Expected counts from A theoretical or known distribution Row and column marginal totals
    Example Is a die fair across its six faces? Is treatment outcome related to dosage group?

    The goodness-of-fit test checks whether a single variable follows a hypothesised distribution. The test of independence checks whether two variables in a contingency table are related or vary independently. A closely related variant, the test of homogeneity, asks whether several populations share the same category distribution.

    Assumptions and small-sample cautions

    The chi-square test relies on a handful of conditions. The data must be frequency counts, not percentages or means. Observations should be independent, with each case appearing in only one cell. And expected counts should be reasonably large: a common rule of thumb is that no cell should have an expected frequency below 5, and ideally all should exceed it. When tables are small or sparse, Fisher’s exact test is the safer choice, and for 2×2 tables Yates’s continuity correction is sometimes applied. Reporting which test variant and corrections were used is part of transparent, replicable analysis, a theme across our reproducibility coverage.

    Interpreting and reporting the result

    A significant chi-square tells you that an association or departure exists, but not how strong it is. Because the statistic scales with sample size, even trivial differences become significant in very large datasets. For this reason you should accompany the test with a measure of association such as Cramer’s V or the phi coefficient, which behave like an effect size for categorical data. Report the chi-square value, degrees of freedom, sample size and p-value together, for example: chi-square(2, N = 240) = 11.3, p = .003.

    Adequate planning matters too. As with mean comparisons in ANOVA, the power to detect a true association depends on having enough observations, a point we expand on in our guide to sample size and statistical power.

    Frequently asked questions

    When should I use a chi-square test rather than a t-test or ANOVA?

    Use chi-square when your outcome is categorical and you are working with counts in categories. Use a t-test or ANOVA when your outcome is a continuous measurement whose means you want to compare across groups.

    What is the difference between goodness-of-fit and the test of independence?

    Goodness-of-fit examines one variable against an expected distribution. The test of independence examines whether two variables in a contingency table are associated. They share the same formula but answer different questions.

    What happens if my expected counts are too small?

    The chi-square approximation becomes unreliable when expected cell counts fall below about 5. In that case, combine sparse categories where it makes sense, or use Fisher’s exact test, which is valid for small samples.

    Does a significant chi-square tell me how strong the relationship is?

    No. It only indicates that a relationship is unlikely to be due to chance. To judge strength, report an association measure such as Cramer’s V alongside the result. The CASRAI dictionary and our author guidance describe the reporting metadata that keeps such analyses auditable.

  • Standard Deviation in Research: A Clear Statistical Definition

    Standard deviation is a measure of how spread out a set of values is around its mean. It expresses, in the original units of the data, the typical distance of an observation from the average. A small standard deviation means values cluster tightly around the mean; a large standard deviation means they are widely dispersed. It is one of the most widely reported summary statistics in quantitative research because it captures variability that a mean alone conceals.

    Standard deviation and the mean

    Two datasets can share an identical mean yet behave very differently. Consider two classes whose mean test score is 70. In the first, scores fall between 68 and 72; in the second, they range from 40 to 100. Both means are 70, but the second class is far more variable. The standard deviation quantifies that difference, which is why reporting a mean without a measure of spread is incomplete.

    Standard deviation is the square root of the variance. Variance is the average of the squared deviations of each value from the mean. Squaring removes negative signs and emphasises larger departures, but it also leaves variance in squared units. Taking the square root returns the figure to the original units, making standard deviation the more interpretable companion to the mean.

    Population versus sample

    The formula differs depending on whether the data represent an entire population or a sample drawn from one. The population standard deviation divides the sum of squared deviations by N, the number of values. The sample standard deviation divides by n minus 1 rather than n. This adjustment, known as Bessel’s correction, compensates for the tendency of a sample to underestimate the spread of the population it came from. Because most research analyses a sample and infers something about a wider population, the sample formula with n minus 1 is the one most often applied.

    Quantity Divisor Used when
    Population standard deviation N Every member of the population is measured
    Sample standard deviation n − 1 A sample is used to estimate the population

    The 68-95-99.7 rule

    When data follow a normal (bell-shaped) distribution, standard deviation maps onto predictable proportions of the data. This is the empirical rule, often called the 68-95-99.7 rule. Approximately 68% of values fall within one standard deviation of the mean, about 95% fall within two standard deviations, and roughly 99.7% fall within three. These figures hold only for a normal distribution and are approximations for real data that merely resemble one; skewed or heavy-tailed distributions will not obey them.

    Range from the mean Approximate share of data (normal distribution)
    ±1 standard deviation 68%
    ±2 standard deviations 95%
    ±3 standard deviations 99.7%

    A worked conceptual example

    Suppose adult resting heart rates in a sample have a mean of 70 beats per minute and a standard deviation of 8. If the distribution is roughly normal, then about 68% of people in that sample have a resting rate between 62 and 78 (the mean plus or minus one standard deviation). About 95% fall between 54 and 86 (two standard deviations), and almost everyone, around 99.7%, falls between 46 and 94 (three standard deviations). A reading of 100 would lie more than three standard deviations above the mean and would therefore be unusual relative to this sample. Examining such extreme values links directly to outlier detection, a related step in data quality assessment.

    Standard deviation versus standard error

    A frequent source of confusion is the difference between standard deviation and standard error. Standard deviation describes the variability of individual observations in the data. The standard error of the mean describes the variability of the sample mean itself as an estimate of the population mean, and it equals the standard deviation divided by the square root of the sample size. Because dividing by the root of n shrinks it, the standard error is always smaller than the standard deviation and grows narrower as the sample grows.

    The choice between them depends on what is being communicated. To describe how much individuals differ from one another, report the standard deviation. To express how precisely the mean has been estimated, report the standard error or, more informatively, a confidence interval. Reporting a standard error where a standard deviation is meant can mislead readers into thinking data are far less variable than they are. For practical reporting conventions, see the CASRAI author guidance and the CASRAI dictionary.

    Frequently asked questions

    Why divide by n minus 1 for a sample?

    Dividing by n minus 1 corrects a bias: using the sample mean to centre the data slightly reduces the spread, so dividing by the smaller divisor produces an unbiased estimate of the population variance. This is Bessel’s correction.

    Can standard deviation be negative?

    No. It is a square root of an average of squared quantities, so it is always zero or positive. A standard deviation of zero means every value is identical to the mean.

    Should I report standard deviation or standard error?

    Report the standard deviation to describe variability among observations, and the standard error or a confidence interval to describe the precision of the mean. For wider context on variability and uncertainty, see our guide to confidence intervals and the reproducibility news category.

  • What Is Statistics? The Discipline and Its Role in Research

    Statistics is the discipline concerned with collecting, organising, analysing, interpreting and presenting data. At its core it is the science of reasoning under uncertainty: it provides methods for drawing conclusions about a whole population from a limited sample, and for quantifying how much confidence those conclusions deserve. Statistics underpins quantitative research across every field, from medicine and economics to ecology and the social sciences.

    Descriptive versus inferential statistics

    The discipline divides into two broad branches. Descriptive statistics summarise and describe the features of a dataset without claiming anything beyond it. Measures of central tendency such as the mean, median and mode, measures of spread such as the range and standard deviation, and visual summaries such as histograms all belong here. Descriptive statistics tell you what the data at hand look like.

    Inferential statistics go further: they use a sample to make estimates or test claims about a larger population that has not been fully observed. Estimation, hypothesis testing, confidence intervals and regression modelling are all inferential tools. The defining feature of inference is that it carries uncertainty, and statistics provides the machinery to measure that uncertainty rather than ignore it.

    Branch Purpose Typical tools
    Descriptive Summarise observed data Mean, median, standard deviation, charts
    Inferential Draw conclusions about a population Confidence intervals, hypothesis tests, regression

    Populations and samples

    The distinction between a population and a sample is fundamental. A population is the entire set of units a researcher wishes to understand: all adults in a country, every transaction in a year, all stars in a galaxy. A sample is a subset of that population actually measured. Because studying an entire population is usually impractical, researchers work from samples and infer to the whole. A numerical fact about a population is a parameter; the corresponding figure calculated from a sample is a statistic, and statistics as a discipline is largely the study of how well sample statistics estimate population parameters.

    Estimation and hypothesis testing

    Two complementary tasks dominate inferential work. Estimation asks how large a quantity is and how precisely we know it, producing point estimates and interval estimates such as confidence intervals. Hypothesis testing asks whether the data are compatible with a specific claim, typically a null hypothesis of no effect, and summarises that compatibility with measures such as p-values. Both rest on the idea that random sampling produces variation, and that this variation can be modelled probabilistically.

    Variability and probability

    Underlying all of statistics is the recognition that data vary. Two samples from the same population will rarely give identical results, and statistics describes this sampling variation using probability. Measures such as the standard deviation quantify spread within data, while probability distributions describe how estimates would behave across repeated sampling. This probabilistic foundation is what allows statisticians to attach honest measures of uncertainty to their conclusions.

    Why statistics is central to research

    Statistics is not an optional add-on to research; it shapes how studies are designed, how large samples need to be, how data are analysed and how findings are reported. Sound statistical practice is essential for reproducibility, because it disciplines researchers against over-interpreting noise and helps others judge whether a result is robust. Poor statistical practice, by contrast, is a recognised driver of irreproducible findings. CASRAI’s work on standardised reporting and the CASRAI dictionary supports clearer, more comparable statistical reporting across the scholarly record, and the reproducibility category tracks developments in this area.

    Frequently asked questions

    Is statistics a branch of mathematics?

    Statistics uses mathematics, particularly probability theory, but it is usually regarded as a distinct discipline. Its focus is on data, inference and the practical business of learning from observation under uncertainty, not on abstract mathematical structure alone.

    What is the difference between a parameter and a statistic?

    A parameter is a fixed numerical characteristic of a population, such as the population mean. A statistic is the corresponding figure computed from a sample, such as the sample mean. Statistics as a discipline studies how to estimate parameters from statistics.

    Why does statistics matter for reproducibility?

    Reproducibility depends on whether a reported result reflects a genuine effect or random variation. Statistical methods quantify that uncertainty and guard against over-claiming, so transparent statistical reporting is one foundation of a trustworthy scholarly record. See the CASRAI author guidance for reporting practices.

  • ANOVA (Analysis of Variance) Explained: Comparing Means Across Groups

    Analysis of variance (ANOVA) is a statistical method that tests whether the means of three or more groups differ by more than would be expected from random variation alone. It does this by comparing the variance between group means against the variance within groups, summarised in a single F-statistic. ANOVA is one of the most widely used inferential tests in experimental research, and reporting it transparently is central to reproducible analysis.

    Why ANOVA instead of multiple t-tests?

    A t-test compares two group means. When you have three or more groups, it is tempting to run a separate t-test for every pair. The problem is the family-wise error rate: each test carries its own chance of a false positive, and those chances accumulate. With three groups there are three pairwise comparisons; at a 5% significance level the probability of at least one false positive rises to roughly 14%, and it climbs further as groups are added. ANOVA solves this by performing a single omnibus test that asks one question: are any of the group means different?

    This control of error is why ANOVA underpins so much of experimental design. For a refresher on what significance thresholds mean in practice, see our explainer on p-values and statistical significance.

    The F-statistic and how it works

    ANOVA partitions the total variability in the data into two components. The between-groups variance reflects how far each group mean sits from the overall (grand) mean. The within-groups variance reflects the natural spread of observations inside each group. The F-statistic is the ratio of these two:

    F = between-groups variance / within-groups variance

    If the groups truly share a common mean, both quantities estimate the same underlying variability and F sits near 1. When real differences exist, the between-groups term grows and F rises. A large F, evaluated against the F-distribution with the appropriate degrees of freedom, yields a small p-value and signals that at least one mean differs.

    One-way versus two-way ANOVA

    The design depends on how many factors you are manipulating.

    Feature One-way ANOVA Two-way ANOVA
    Number of factors One independent variable Two independent variables
    Example question Does diet type affect plant growth? Do diet type and watering frequency affect plant growth?
    Main effects One Two (one per factor)
    Interaction Not assessed Tests whether factors combine non-additively
    Output Single F-statistic F-statistic for each main effect plus interaction

    The key advantage of two-way ANOVA is the interaction effect: it reveals whether the influence of one factor depends on the level of another, something separate analyses would miss.

    Assumptions you must check

    ANOVA rests on three core assumptions. Observations should be independent. The residuals should be approximately normally distributed. And the groups should show roughly equal variances, a property called homogeneity of variance (homoscedasticity). When variances differ markedly, a Welch ANOVA is a robust alternative; when normality fails, a non-parametric Kruskal-Wallis test may be more appropriate. Stating which assumptions were tested, and how, is good practice and supports replication, as we discuss across our reproducibility coverage.

    Post-hoc tests: locating the difference

    A significant ANOVA tells you that some mean differs, but not which one. Post-hoc tests answer that follow-up while still controlling the family-wise error rate. Tukey’s HSD is the standard choice for all pairwise comparisons with equal sample sizes; Bonferroni correction is conservative and simple; Scheffe’s test is flexible for complex contrasts. Crucially, you should not revert to uncorrected t-tests after a significant ANOVA, as that reintroduces the inflated error the test was designed to prevent.

    Equally important, statistical significance does not measure how large a difference is. Always pair ANOVA results with an effect size such as eta-squared, as covered in our companion piece on why effect size matters beyond significance. Authors planning a study should also budget adequate sample size and statistical power so a real effect can actually be detected.

    Frequently asked questions

    What does a significant ANOVA result actually tell you?

    It tells you that at least one group mean differs from the others by more than chance would explain. It does not identify which groups differ or how large the difference is; you need post-hoc tests and effect sizes to answer those questions.

    Can ANOVA be used for only two groups?

    Yes. With two groups a one-way ANOVA gives results mathematically equivalent to an independent-samples t-test (F equals t squared). ANOVA’s real value appears with three or more groups, where it prevents the error inflation of multiple t-tests.

    What is the difference between a main effect and an interaction?

    A main effect is the overall influence of one factor averaged across the others. An interaction means the effect of one factor changes depending on the level of another. Detecting interactions is the principal reason to use two-way rather than one-way designs.

    How should ANOVA results be reported for reproducibility?

    Report the F-statistic with both degrees of freedom, the p-value, an effect size, the post-hoc method used, and confirmation that assumptions were checked. The CASRAI dictionary and our guidance for authors set out the metadata that makes such results auditable.