Examples
Worked examples
- Is an instance
An LLM technical report headline including MMLU 5-shot accuracy.
- Is an instance
A leaderboard ranking open-weight models by MMLU score.
Counter-examples
Looks similar, but isn't
- Not an instance
BIG-bench (different methodology).
- Not an instance
HumanEval (code-only benchmark).
Editorial commentary
MMLU (Hendrycks et al., 2021) became the dominant headline benchmark for general LLM capability through 2022-2024. Its limitations (multiple-choice format, contamination risk from public test sets, decreasing headroom) drove the development of successors such as MMLU-Pro and GPQA. MMLU remains widely reported for comparability.
References
- Hendrycks et al., 'Measuring Massive Multitask Language Understanding' (ICLR 2021).
Also known as
Massive Multitask Language Understanding
Machine-readable encodings
Use in your systems
<role vocab="credit"
vocab-identifier="https://casrai.org/dictionary/"
vocab-term="MMLU benchmark"
vocab-term-identifier="https://casrai.org/dictionary/term/mmlu-benchmark" />{
"@context": "https://schema.org",
"@type": "DefinedTerm",
"name": "MMLU benchmark",
"identifier": "https://casrai.org/dictionary/term/mmlu-benchmark",
"description": "The Massive Multitask Language Understanding benchmark, a 57-subject multiple-choice test covering elementary, high-school, college, and professional knowledge, designed to probe broad-coverage language-model knowledge.",
"inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
"url": "https://casrai.org/dictionary/term/mmlu-benchmark",
"sameAs": [
"Massive Multitask Language Understanding"
],
"license": "https://creativecommons.org/licenses/by/4.0/"
}







