Tag: hypothesis testing

  • P-Values and Statistical Significance Explained Correctly

    A p-value is the probability of obtaining a result at least as extreme as the one observed, assuming that the null hypothesis is true. It is a measure of how compatible the data are with a specified statistical model in which there is no effect or no difference. A small p-value indicates that the observed data would be unusual if the null hypothesis held; it does not, by itself, prove that the null hypothesis is false or that an effect is real or important.

    What the null hypothesis represents

    Hypothesis testing begins with a null hypothesis, typically a statement of no effect, no difference or no association. The test asks how surprising the observed data would be if that null hypothesis were true. The p-value quantifies that surprise: the smaller it is, the less compatible the data are with the null model. Critically, the p-value is calculated under the assumption that the null is true, which is why it cannot be read as the probability that the null is true.

    The American Statistical Association’s 2016 statement

    In 2016 the American Statistical Association (ASA) published a formal statement on p-values, the first time it had issued such guidance, in response to widespread misuse. The statement set out six principles. In summary, it affirmed that p-values can indicate how incompatible data are with a specified model, but warned that a p-value does not measure the probability that the hypothesis under study is true, nor the probability that the data arose by chance alone. It cautioned that scientific conclusions should not be based only on whether a p-value passes a threshold, that proper reporting requires full transparency, that a p-value does not measure the size or importance of an effect, and that by itself a p-value is a poor measure of evidence regarding a model or hypothesis.

    Common misinterpretations

    Several persistent errors surround p-values. Avoiding them is essential for sound, reproducible reporting.

    Misinterpretation Why it is wrong
    The p-value is the probability the null hypothesis is true It is calculated assuming the null is true; it cannot also be that probability
    p = 0.05 means a 5% chance the result is a fluke The p-value is not the probability that the finding is due to chance
    A non-significant result proves no effect exists Absence of significance is not evidence of absence; the study may simply lack power
    A small p-value means a large or important effect The p-value reflects compatibility and sample size, not effect magnitude

    The limits of the 0.05 convention

    The threshold of 0.05 for declaring statistical significance is a convention, not a law of nature. Treating 0.05 as a bright line encourages dichotomous thinking in which a result at p = 0.049 is celebrated and one at p = 0.051 dismissed, despite negligible difference between them. This convention has fed practices such as selective reporting and p-hacking, where analyses are adjusted until a result crosses the threshold, both serious threats to reproducibility. The ASA statement explicitly warned against basing conclusions solely on whether a p-value clears a cut-off.

    Effect sizes and intervals

    Because a p-value says nothing about magnitude, it should be accompanied by an effect size, which describes how large the observed effect is, and ideally a confidence interval, which expresses the precision of the estimate. Reporting these alongside, or instead of, a bare p-value gives readers far more information for judging whether a finding matters. The underpinning ideas come from the wider discipline of statistics, and transparent reporting of all of them supports the goals tracked in our reproducibility category. For terminology and reporting conventions, consult the CASRAI dictionary.

    Frequently asked questions

    Does a p-value below 0.05 prove an effect is real?

    No. It indicates the data would be unusual if the null hypothesis were true, but it does not prove the null is false, nor that the effect is large or important. Replication, effect sizes and intervals are needed to judge that.

    What did the ASA 2016 statement conclude?

    The statement set out six principles emphasising that p-values measure compatibility with a model, are not the probability the hypothesis is true, do not measure effect size, and should never be the sole basis for scientific conclusions. It urged full transparency in reporting.

    Should we abandon p-values altogether?

    Not necessarily. P-values can be informative when interpreted correctly and reported alongside effect sizes and confidence intervals. The problem lies in misuse and over-reliance on a single threshold, not in the statistic itself. See the CASRAI author guidance for reporting practices.

  • T-Tests Explained: Comparing Two Means

    A t-test is a statistical test that assesses whether the difference between two means is larger than would be expected by chance alone. It compares the size of an observed difference against the variability in the data, producing a t-statistic that can be converted into a p-value. The t-test is one of the most common tools for comparing groups in research.

    How a t-test works

    The t-statistic is essentially the difference between means divided by the standard error of that difference. A large t-statistic indicates that the difference is large relative to the spread of the data, making it less likely to have arisen by chance. The t-statistic is then evaluated against the t-distribution, which resembles the normal distribution but has heavier tails to account for the extra uncertainty in small samples.

    The three types of t-test

    There are three principal forms of the t-test, each suited to a particular comparison.

    Type What it compares Typical use
    One-sample A sample mean against a known or hypothesised value Testing whether a mean differs from a reference standard
    Independent-samples The means of two separate, unrelated groups Comparing a treatment group with a control group
    Paired Two measurements from the same subjects Before-and-after measurements on the same participants

    Choosing the right type is essential. Using an independent-samples test on paired data, for instance, ignores the correlation between the two measurements and usually reduces the power of the analysis.

    Assumptions of the t-test

    The validity of a t-test rests on several assumptions. The data should be approximately normally distributed, particularly in small samples, although the central limit theorem makes the test fairly robust at larger sample sizes. Observations should be independent, except in the paired test where the pairing is deliberate. For the independent-samples test, the two groups are traditionally assumed to have equal variances; when this assumption is doubtful, Welch’s t-test, which does not require equal variances, is a safer default. Outliers can distort the result and should be inspected beforehand.

    Relationship to p-values and significance

    The t-test does not by itself prove that two groups differ; it quantifies the evidence against the null hypothesis that the means are equal. The resulting p-value is the probability of observing a difference at least as large as the one found, assuming the null hypothesis is true. A small p-value, conventionally below 0.05, suggests the difference is statistically significant, but it says nothing about the size or practical importance of the effect. Reporting the mean difference and a confidence interval alongside the p-value gives a fuller picture.

    Reporting t-tests transparently

    Good practice is to report the type of t-test used, the t-statistic, the degrees of freedom, the p-value, the effect size and a confidence interval. Stating which test was chosen and why, and confirming that its assumptions were checked, supports the reproducibility goals described in the CASRAI dictionary and our guidance for authors. An adequately powered design, discussed in our piece on statistical power, is equally important.

    Frequently asked questions

    When should I use a t-test rather than ANOVA?

    Use a t-test to compare two means. When you need to compare three or more group means simultaneously, analysis of variance (ANOVA) is the appropriate extension, as running multiple t-tests inflates the chance of a false positive.

    What if my data are not normally distributed?

    For small, clearly non-normal samples, consider a non-parametric alternative such as the Mann-Whitney U test for independent groups or the Wilcoxon signed-rank test for paired data.

    What is the difference between a one-tailed and two-tailed t-test?

    A two-tailed test detects a difference in either direction and is the default. A one-tailed test only looks for a difference in one specified direction and should be used only when justified in advance.

  • The Chi-Square Test for Categorical Data: A Practical Guide

    The chi-square test is a statistical method for categorical data that compares the frequencies you actually observe with the frequencies you would expect if a given hypothesis were true. The larger the gap between observed and expected counts, the larger the chi-square statistic, and the stronger the evidence against the hypothesis of no relationship. It is the workhorse test for counts, proportions and contingency tables across the social, biological and medical sciences.

    Observed versus expected frequencies

    Every chi-square test rests on the same intuition. You record how many cases fall into each category (the observed frequencies), then calculate how many should fall there under your null hypothesis (the expected frequencies). The statistic sums the squared difference between observed and expected, divided by expected, across all cells:

    chi-square = sum of (observed – expected)squared / expected

    A value near zero means observation matches expectation. A large value, evaluated against the chi-square distribution with the appropriate degrees of freedom, produces a small p-value and signals a meaningful departure. For background on interpreting those probabilities, see our explainer on p-values and statistical significance.

    Two common forms of the test

    There are two principal versions, which answer different questions.

    Feature Goodness-of-fit Test of independence
    Variables One categorical variable Two categorical variables
    Question Do observed counts match an expected distribution? Are the two variables associated?
    Data layout Single row of category counts Contingency (cross-tabulation) table
    Expected counts from A theoretical or known distribution Row and column marginal totals
    Example Is a die fair across its six faces? Is treatment outcome related to dosage group?

    The goodness-of-fit test checks whether a single variable follows a hypothesised distribution. The test of independence checks whether two variables in a contingency table are related or vary independently. A closely related variant, the test of homogeneity, asks whether several populations share the same category distribution.

    Assumptions and small-sample cautions

    The chi-square test relies on a handful of conditions. The data must be frequency counts, not percentages or means. Observations should be independent, with each case appearing in only one cell. And expected counts should be reasonably large: a common rule of thumb is that no cell should have an expected frequency below 5, and ideally all should exceed it. When tables are small or sparse, Fisher’s exact test is the safer choice, and for 2×2 tables Yates’s continuity correction is sometimes applied. Reporting which test variant and corrections were used is part of transparent, replicable analysis, a theme across our reproducibility coverage.

    Interpreting and reporting the result

    A significant chi-square tells you that an association or departure exists, but not how strong it is. Because the statistic scales with sample size, even trivial differences become significant in very large datasets. For this reason you should accompany the test with a measure of association such as Cramer’s V or the phi coefficient, which behave like an effect size for categorical data. Report the chi-square value, degrees of freedom, sample size and p-value together, for example: chi-square(2, N = 240) = 11.3, p = .003.

    Adequate planning matters too. As with mean comparisons in ANOVA, the power to detect a true association depends on having enough observations, a point we expand on in our guide to sample size and statistical power.

    Frequently asked questions

    When should I use a chi-square test rather than a t-test or ANOVA?

    Use chi-square when your outcome is categorical and you are working with counts in categories. Use a t-test or ANOVA when your outcome is a continuous measurement whose means you want to compare across groups.

    What is the difference between goodness-of-fit and the test of independence?

    Goodness-of-fit examines one variable against an expected distribution. The test of independence examines whether two variables in a contingency table are associated. They share the same formula but answer different questions.

    What happens if my expected counts are too small?

    The chi-square approximation becomes unreliable when expected cell counts fall below about 5. In that case, combine sparse categories where it makes sense, or use Fisher’s exact test, which is valid for small samples.

    Does a significant chi-square tell me how strong the relationship is?

    No. It only indicates that a relationship is unlikely to be due to chance. To judge strength, report an association measure such as Cramer’s V alongside the result. The CASRAI dictionary and our author guidance describe the reporting metadata that keeps such analyses auditable.

  • ANOVA (Analysis of Variance) Explained: Comparing Means Across Groups

    Analysis of variance (ANOVA) is a statistical method that tests whether the means of three or more groups differ by more than would be expected from random variation alone. It does this by comparing the variance between group means against the variance within groups, summarised in a single F-statistic. ANOVA is one of the most widely used inferential tests in experimental research, and reporting it transparently is central to reproducible analysis.

    Why ANOVA instead of multiple t-tests?

    A t-test compares two group means. When you have three or more groups, it is tempting to run a separate t-test for every pair. The problem is the family-wise error rate: each test carries its own chance of a false positive, and those chances accumulate. With three groups there are three pairwise comparisons; at a 5% significance level the probability of at least one false positive rises to roughly 14%, and it climbs further as groups are added. ANOVA solves this by performing a single omnibus test that asks one question: are any of the group means different?

    This control of error is why ANOVA underpins so much of experimental design. For a refresher on what significance thresholds mean in practice, see our explainer on p-values and statistical significance.

    The F-statistic and how it works

    ANOVA partitions the total variability in the data into two components. The between-groups variance reflects how far each group mean sits from the overall (grand) mean. The within-groups variance reflects the natural spread of observations inside each group. The F-statistic is the ratio of these two:

    F = between-groups variance / within-groups variance

    If the groups truly share a common mean, both quantities estimate the same underlying variability and F sits near 1. When real differences exist, the between-groups term grows and F rises. A large F, evaluated against the F-distribution with the appropriate degrees of freedom, yields a small p-value and signals that at least one mean differs.

    One-way versus two-way ANOVA

    The design depends on how many factors you are manipulating.

    Feature One-way ANOVA Two-way ANOVA
    Number of factors One independent variable Two independent variables
    Example question Does diet type affect plant growth? Do diet type and watering frequency affect plant growth?
    Main effects One Two (one per factor)
    Interaction Not assessed Tests whether factors combine non-additively
    Output Single F-statistic F-statistic for each main effect plus interaction

    The key advantage of two-way ANOVA is the interaction effect: it reveals whether the influence of one factor depends on the level of another, something separate analyses would miss.

    Assumptions you must check

    ANOVA rests on three core assumptions. Observations should be independent. The residuals should be approximately normally distributed. And the groups should show roughly equal variances, a property called homogeneity of variance (homoscedasticity). When variances differ markedly, a Welch ANOVA is a robust alternative; when normality fails, a non-parametric Kruskal-Wallis test may be more appropriate. Stating which assumptions were tested, and how, is good practice and supports replication, as we discuss across our reproducibility coverage.

    Post-hoc tests: locating the difference

    A significant ANOVA tells you that some mean differs, but not which one. Post-hoc tests answer that follow-up while still controlling the family-wise error rate. Tukey’s HSD is the standard choice for all pairwise comparisons with equal sample sizes; Bonferroni correction is conservative and simple; Scheffe’s test is flexible for complex contrasts. Crucially, you should not revert to uncorrected t-tests after a significant ANOVA, as that reintroduces the inflated error the test was designed to prevent.

    Equally important, statistical significance does not measure how large a difference is. Always pair ANOVA results with an effect size such as eta-squared, as covered in our companion piece on why effect size matters beyond significance. Authors planning a study should also budget adequate sample size and statistical power so a real effect can actually be detected.

    Frequently asked questions

    What does a significant ANOVA result actually tell you?

    It tells you that at least one group mean differs from the others by more than chance would explain. It does not identify which groups differ or how large the difference is; you need post-hoc tests and effect sizes to answer those questions.

    Can ANOVA be used for only two groups?

    Yes. With two groups a one-way ANOVA gives results mathematically equivalent to an independent-samples t-test (F equals t squared). ANOVA’s real value appears with three or more groups, where it prevents the error inflation of multiple t-tests.

    What is the difference between a main effect and an interaction?

    A main effect is the overall influence of one factor averaged across the others. An interaction means the effect of one factor changes depending on the level of another. Detecting interactions is the principal reason to use two-way rather than one-way designs.

    How should ANOVA results be reported for reproducibility?

    Report the F-statistic with both degrees of freedom, the p-value, an effect size, the post-hoc method used, and confirmation that assumptions were checked. The CASRAI dictionary and our guidance for authors set out the metadata that makes such results auditable.