Explainer · Plain-language

Inter Rater Reliability: Definition, Meaning & Examples | CASRAI

Inter-rater reliability is the degree of agreement between two or more independent raters or observers assessing the same thing. It shows whether a rating reflects what is being judged rather than the idiosyncrasies of who is judging.

CASRAI plain-language explainers — clear answers to recurring research-administration questions

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.

Agreement between independent judges

Inter-rater reliability addresses a basic worry about any judgement-based measurement: would a different observer have reached the same conclusion? When two coders categorise the same interview transcripts, or two examiners mark the same scripts, the extent to which they agree tells us how much the resulting data reflect the object being rated rather than the particular rater. High inter-rater reliability is a prerequisite for trusting subjective measures and for combining ratings made by different people.

Correcting for chance agreement

Raw percentage agreement is misleading because some agreement happens by chance, especially with few categories. Chance-corrected statistics address this. Cohen’s kappa handles two raters with nominal categories; Fleiss’ kappa generalises to more than two raters; weighted kappa accounts for the seriousness of disagreements on ordered categories; and the intraclass correlation coefficient (ICC) is used for continuous or ordinal ratings. Reporting a chance-corrected coefficient, rather than bare agreement, is standard good practice.

Improving agreement

When inter-rater reliability is low, the usual remedy is clearer criteria and better training rather than blaming individuals. Detailed coding manuals with explicit definitions and worked examples, calibration sessions where raters discuss discrepancies, and pilot rounds before the main study all raise agreement. In systematic reviews, having two reviewers independently screen and extract data — then reconcile disagreements — is built in precisely to make the process reproducible and to surface ambiguous criteria.

Reliability of raters, not of items

Inter-rater reliability isolates one specific source of measurement error — the rater — and so complements other reliability types. Test-retest reliability concerns stability over time, internal consistency concerns agreement among items, and inter-rater reliability concerns agreement among people. A measure can be internally consistent yet show poor inter-rater reliability if the scoring rules are vague. Studies relying on observation or coding should report inter-rater reliability explicitly, as it underwrites the credibility of all judgement-based data.

Key facts

At a glance

Definition: Agreement between independent raters judging the same thing
Matters for: Coding, scoring, diagnosis, systematic-review screening
Two raters: Cohen’s kappa (chance-corrected, nominal categories)
Many raters: Fleiss’ kappa for several independent raters
Continuous: Intraclass correlation coefficient (ICC)
Improved by: Clear criteria, coding manuals and rater training

Common misconceptions

What people often get wrong

Often heard: Percentage agreement is the best measure of inter-rater reliability.

Actually: No — raw agreement ignores chance. Chance-corrected statistics such as Cohen’s kappa or the ICC give a more honest picture, especially with few categories.

Often heard: Low agreement just means the raters were careless.

Actually: No — it more often signals ambiguous criteria or insufficient training. Clearer definitions and calibration usually fix it.

Often heard: Good inter-rater reliability means the rating is valid.

Actually: No — raters can agree consistently on the wrong thing. Reliability is necessary but not sufficient for validity.

Going deeper