Examples
Worked examples
- Is an instance
A model report including the HELM scenario-metric matrix as appendix evidence.
- Is an instance
A research lab using HELM scenarios for internal model comparison.
Counter-examples
Looks similar, but isn't
- Not an instance
A single-accuracy MMLU score.
- Not an instance
A latency-only benchmark.
Editorial commentary
HELM (Liang et al., 2023) emphasises holistic evaluation: a single model is scored across many metrics on many scenarios, with the resulting matrix surfaced as the principal output. This contrasts with single-metric leaderboards and aligns with the multi-property framing of trustworthy AI.
References
- Liang et al., 'Holistic Evaluation of Language Models' (Transactions on Machine Learning Research, 2023).
Also known as
Holistic Evaluation of Language Models
Machine-readable encodings
Use in your systems
<role vocab="credit"
vocab-identifier="https://casrai.org/dictionary/"
vocab-term="HELM benchmark"
vocab-term-identifier="https://casrai.org/dictionary/term/helm-benchmark" />{
"@context": "https://schema.org",
"@type": "DefinedTerm",
"name": "HELM benchmark",
"identifier": "https://casrai.org/dictionary/term/helm-benchmark",
"description": "The Holistic Evaluation of Language Models benchmark, a multi-metric framework evaluating language models across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency on a fixed set of scenarios.",
"inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
"url": "https://casrai.org/dictionary/term/helm-benchmark",
"sameAs": [
"Holistic Evaluation of Language Models"
],
"license": "https://creativecommons.org/licenses/by/4.0/"
}







