Guide

Hypothesis testing

Hypothesis testing is the formal statistical procedure for deciding whether the evidence in a sample is strong enough to reject a null hypothesis of no effect in favour of an alternative.

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.

State the null and alternative hypotheses

Every test begins by framing two competing, mutually exclusive statements. The null hypothesis (H0) is the default claim of no effect, no difference or no relationship — for example, that two group means are equal. The alternative hypothesis (H1 or Ha) states the effect the researcher expects to find. The alternative may be non-directional (two-tailed), asserting only that a difference exists, or directional (one-tailed), predicting which way it runs. Crucially, the test is built to challenge the null: the burden of proof falls on the alternative, and the hypotheses should be fixed before the data are examined, ideally through preregistration.

Choose the significance level (alpha)

Next, the researcher sets the significance level, denoted alpha (α), before collecting or inspecting the data. Alpha is the probability of wrongly rejecting a true null hypothesis that the researcher is willing to accept — the risk of a false positive. By long-standing convention it is often set at 0.05, with 0.01 used when a stricter standard is wanted. Alpha defines the threshold for the decision: it marks how extreme the evidence must be before the null is rejected. Setting it in advance, rather than after seeing the results, is essential to keeping the false-positive rate at the stated level.

Compute a test statistic

With hypotheses and alpha fixed, the data are summarised into a single test statistic — such as a z, t, chi-square or F value — that measures how far the observed result departs from what the null hypothesis predicts, scaled by the variability in the data. A test statistic far from zero indicates the data are unusual under the null. The appropriate statistic depends on the question and the data: a t-test compares means, a chi-square test examines categorical frequencies, and analysis of variance (ANOVA) compares several group means. Each statistic has a known sampling distribution under the null, which is what makes the next step possible.

Compare with a critical value or p-value, then decide

There are two equivalent ways to reach a decision. The critical-value approach finds the cut-off on the test statistic’s distribution that corresponds to alpha; if the computed statistic is more extreme than the critical value, the result falls in the rejection region and the null is rejected. The p-value approach instead computes the probability of obtaining a result at least as extreme as the one observed, assuming the null is true, and rejects the null when the p-value is less than or equal to alpha. Both lead to the same conclusion. If the null is rejected, the result is called statistically significant; otherwise, you fail to reject the null — which is not the same as proving it true.

Type I and Type II errors

Because a decision is made from limited data, two kinds of error are possible. A Type I error (false positive) occurs when a true null hypothesis is rejected; its probability is exactly alpha. A Type II error (false negative) occurs when a false null hypothesis is not rejected — a real effect is missed; its probability is denoted beta (β). The statistical power of a test is 1 − β, the probability of correctly detecting a true effect. The two error types trade off: lowering alpha to reduce false positives, all else equal, raises the chance of a Type II error. Increasing the sample size is the principal way to reduce both, by sharpening the test’s ability to distinguish signal from noise.

Interpreting and reporting the result

A complete report goes beyond a bare significant-or-not verdict. Statistical significance indicates that an effect is unlikely to be due to chance under the null, but it does not measure how large or important the effect is — that requires an effect size and a confidence interval, which together convey magnitude and precision. A result can be statistically significant yet trivially small, or non-significant simply because the sample was too small to detect a real effect. Good practice reports the test used, the test statistic, the exact p-value, the effect size and a confidence interval, so that readers can judge both the reliability and the practical importance of the finding.

Key facts

At a glance

Definition: a procedure for deciding whether evidence justifies rejecting the null
Step 1: state the null (H0) and alternative (H1) hypotheses
Step 2: choose the significance level, alpha, in advance (often 0.05)
Step 3: compute a test statistic, then a p-value or critical value
Decision: reject H0 if the result is more extreme than the threshold
Two errors: Type I (false positive, probability α) and Type II (false negative, probability β)

Common questions

FAQ

What is the difference between a Type I and a Type II error?+

A Type I error is a false positive: rejecting a null hypothesis that is actually true, so you claim an effect that does not exist. Its probability equals the significance level, alpha. A Type II error is a false negative: failing to reject a false null, so you miss a real effect. Its probability is beta, and 1 − beta is the test’s statistical power.

Does failing to reject the null hypothesis prove it is true?+

No. Failing to reject the null means the evidence was not strong enough to overturn it, not that no effect exists — absence of evidence is not evidence of absence. A non-significant result can simply reflect a small sample or low statistical power. You can only reject the null or fail to reject it, never prove it.

What does a significance level of 0.05 mean?+

An alpha of 0.05 means you are willing to accept a 5% chance of a Type I error — wrongly rejecting a true null hypothesis. It sets the threshold for the decision: the null is rejected when the p-value is 0.05 or lower. Alpha should be chosen before seeing the data, and stricter studies may use 0.01.

Going deeper

Related CASRAI guidance

Z-score vs t-score →Degrees of freedom →Central limit theorem →Descriptive vs inferential statistics →Research methods hub →Standards dictionary →