Tag: p-values

  • T-Tests Explained: Comparing Two Means

    A t-test is a statistical test that assesses whether the difference between two means is larger than would be expected by chance alone. It compares the size of an observed difference against the variability in the data, producing a t-statistic that can be converted into a p-value. The t-test is one of the most common tools for comparing groups in research.

    How a t-test works

    The t-statistic is essentially the difference between means divided by the standard error of that difference. A large t-statistic indicates that the difference is large relative to the spread of the data, making it less likely to have arisen by chance. The t-statistic is then evaluated against the t-distribution, which resembles the normal distribution but has heavier tails to account for the extra uncertainty in small samples.

    The three types of t-test

    There are three principal forms of the t-test, each suited to a particular comparison.

    Type What it compares Typical use
    One-sample A sample mean against a known or hypothesised value Testing whether a mean differs from a reference standard
    Independent-samples The means of two separate, unrelated groups Comparing a treatment group with a control group
    Paired Two measurements from the same subjects Before-and-after measurements on the same participants

    Choosing the right type is essential. Using an independent-samples test on paired data, for instance, ignores the correlation between the two measurements and usually reduces the power of the analysis.

    Assumptions of the t-test

    The validity of a t-test rests on several assumptions. The data should be approximately normally distributed, particularly in small samples, although the central limit theorem makes the test fairly robust at larger sample sizes. Observations should be independent, except in the paired test where the pairing is deliberate. For the independent-samples test, the two groups are traditionally assumed to have equal variances; when this assumption is doubtful, Welch’s t-test, which does not require equal variances, is a safer default. Outliers can distort the result and should be inspected beforehand.

    Relationship to p-values and significance

    The t-test does not by itself prove that two groups differ; it quantifies the evidence against the null hypothesis that the means are equal. The resulting p-value is the probability of observing a difference at least as large as the one found, assuming the null hypothesis is true. A small p-value, conventionally below 0.05, suggests the difference is statistically significant, but it says nothing about the size or practical importance of the effect. Reporting the mean difference and a confidence interval alongside the p-value gives a fuller picture.

    Reporting t-tests transparently

    Good practice is to report the type of t-test used, the t-statistic, the degrees of freedom, the p-value, the effect size and a confidence interval. Stating which test was chosen and why, and confirming that its assumptions were checked, supports the reproducibility goals described in the CASRAI dictionary and our guidance for authors. An adequately powered design, discussed in our piece on statistical power, is equally important.

    Frequently asked questions

    When should I use a t-test rather than ANOVA?

    Use a t-test to compare two means. When you need to compare three or more group means simultaneously, analysis of variance (ANOVA) is the appropriate extension, as running multiple t-tests inflates the chance of a false positive.

    What if my data are not normally distributed?

    For small, clearly non-normal samples, consider a non-parametric alternative such as the Mann-Whitney U test for independent groups or the Wilcoxon signed-rank test for paired data.

    What is the difference between a one-tailed and two-tailed t-test?

    A two-tailed test detects a difference in either direction and is the default. A one-tailed test only looks for a difference in one specified direction and should be used only when justified in advance.

  • Effect Size: Why It Matters Beyond Statistical Significance

    An effect size is a standardised measure of the magnitude of a difference or relationship, telling you how large an effect is rather than merely whether it is statistically detectable. Where a p-value answers “is there an effect?”, an effect size answers the more useful question “how big is it?”. Reporting effect sizes is now expected by major journals and statistical bodies, because significance alone can mislead.

    Why a p-value is not enough

    A p-value depends heavily on sample size. With a large enough sample, a trivially small difference can become statistically significant; with a small sample, a substantial effect can fail to reach significance. This means a significant result tells you an effect probably exists, but nothing about whether it is large enough to matter in practice. The American Statistical Association’s 2016 statement on p-values explicitly cautioned against treating statistical significance as a measure of importance and urged researchers to report effect sizes and uncertainty. For the foundations, see our explainer on p-values and statistical significance.

    Common effect size measures

    Different designs call for different effect size statistics. The table below summarises the most widely used.

    Measure Used with What it expresses Rough benchmarks
    Cohen’s d Difference between two means Difference in standard-deviation units 0.2 small, 0.5 medium, 0.8 large
    Eta-squared ANOVA Proportion of variance explained by a factor 0.01 small, 0.06 medium, 0.14 large
    Pearson’s r Correlation between two variables Strength and direction of association 0.1 small, 0.3 medium, 0.5 large
    Cramer’s V Categorical association Strength of relationship in a contingency table Depends on table size

    These benchmarks, popularised by Jacob Cohen, are useful starting points but are not universal laws. What counts as a meaningful effect depends on the field: a small standardised effect in a public-health intervention can have enormous real-world value, while a large effect in a tightly controlled lab study may be unremarkable.

    Effect size in context: ANOVA and categorical data

    Effect sizes pair naturally with the tests that produce p-values. After an ANOVA, eta-squared (or partial eta-squared) quantifies how much variance each factor explains. After a chi-square test, Cramer’s V or the phi coefficient gives the strength of association that the chi-square statistic alone cannot. Reporting the test statistic and the effect size together turns “there is an effect” into “there is an effect of this size”.

    Practical versus statistical significance

    Statistical significance concerns whether an effect is distinguishable from chance. Practical significance concerns whether the effect is large enough to matter for decisions, policy or theory. The two can diverge sharply. A drug that lowers blood pressure by a statistically significant but clinically negligible amount is significant without being meaningful. Effect sizes, ideally reported with confidence intervals, are what let readers judge practical importance for themselves.

    Reporting standards and reproducibility

    Effect size reporting is not optional in many venues. The APA Publication Manual has long required effect sizes alongside test results, and reporting guidelines across disciplines echo this. Effect sizes also power meta-analysis and a-priori power analysis: you cannot plan an adequately powered study without an expected effect size, as our guide to sample size and statistical power explains. Recording effect sizes, confidence intervals and the measure used is part of the transparent reporting we champion across our reproducibility coverage and codify in our guidance for authors.

    Frequently asked questions

    What is the difference between a p-value and an effect size?

    A p-value indicates whether an effect is likely to be real rather than chance. An effect size indicates how large that effect is. They answer different questions and should always be reported together.

    Which effect size should I report?

    Match the measure to the design: Cohen’s d for two-group mean differences, eta-squared for ANOVA, Pearson’s r for correlations, and Cramer’s V for categorical associations. Always state which measure you used.

    Can a result be statistically significant but practically meaningless?

    Yes. With a large sample, tiny differences become significant. The effect size, especially with a confidence interval, reveals whether the difference is large enough to matter in the real world.

    Why do journals now require effect sizes?

    Because significance alone gives an incomplete picture and contributes to overstated findings. Bodies such as the American Statistical Association and APA emphasise effect sizes to improve transparency and reproducibility. See the CASRAI dictionary for the standardised terms used in reporting.