Explainer · Plain-language

Test Retest Reliability: Definition, Meaning & Examples | CASRAI

Test-retest reliability is the consistency of a measure when it is administered to the same people on two separate occasions. It indicates how stable scores are over time, assuming the underlying trait has not genuinely changed.

CASRAI plain-language explainers — clear answers to recurring research-administration questions

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.

Measuring stability over time

Test-retest reliability assesses temporal stability: administer the same instrument to the same people on two occasions and see how closely the scores agree. The result is typically expressed as a test-retest correlation (often a Pearson or intraclass correlation coefficient). A measure with high test-retest reliability yields nearly the same score each time for someone whose true standing has not changed, which is essential for any instrument used to track, compare, or screen individuals reliably.

Choosing the retest interval

The time between administrations is a design decision with real consequences. If the gap is too short, respondents may remember and repeat their earlier answers (a carry-over or memory effect), inflating apparent reliability. If it is too long, genuine change in the trait, in circumstances, or in the respondent can lower the correlation even though the instrument itself is sound. Reports of test-retest reliability should therefore always state the interval used, because the coefficient is meaningless without it.

When it is and is not appropriate

Test-retest reliability suits constructs that are expected to be stable — traits, abilities, attitudes held over time. It is the wrong yardstick for constructs that genuinely vary, such as mood, stress, or symptom severity, where a low correlation reflects real fluctuation rather than a defective measure. For such state-like constructs, internal consistency or other reliability evidence is more informative. Matching the reliability method to the nature of the construct is part of using it correctly.

One of several reliability types

Test-retest reliability is one member of a family. Internal consistency (such as Cronbach’s alpha) asks whether items measure the same thing within a single administration; inter-rater reliability asks whether different observers agree; parallel-forms reliability asks whether two versions of a test agree. These capture different sources of error — time, item sampling, raters, forms — and a thorough validation of an instrument typically reports more than one, because high reliability of one kind does not guarantee another.

Key facts

At a glance

Definition: Consistency of a measure across two separate occasions
Quantified: Correlation between the two administrations (e.g. ICC)
Suited to: Stable traits — personality, aptitude, attitudes
Poor fit: Fluctuating states such as mood or symptom severity
Key factor: The retest interval (too short or too long both distort)
Family: One of several reliability types, alongside the others

Common misconceptions

What people often get wrong

Often heard: A low test-retest correlation always means the measure is unreliable.

Actually: No — if the construct genuinely changes between occasions (e.g. mood), a low correlation reflects real change, not a faulty instrument.

Often heard: The interval between tests does not affect the result.

Actually: No — a short gap invites memory effects that inflate reliability, while a long gap lets real change deflate it. The interval must always be reported.

Often heard: High test-retest reliability proves the measure is valid.

Actually: No — reliability is necessary but not sufficient for validity. A measure can be perfectly consistent yet consistently measure the wrong thing.

Going deeper