The Holistic Evaluation of Language Models benchmark, a multi-metric framework evaluating language models across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency on a fixed set of scenarios.

ByCASRAI Editorial Board

· Last updated 21 May 2026

Examples

Worked examples

Is an instance
A model report including the HELM scenario-metric matrix as appendix evidence.
Is an instance
A research lab using HELM scenarios for internal model comparison.

Counter-examples

Looks similar, but isn't

Not an instance
A single-accuracy MMLU score.
Not an instance
A latency-only benchmark.

Editorial commentary

HELM (Liang et al., 2023) emphasises holistic evaluation: a single model is scored across many metrics on many scenarios, with the resulting matrix surfaced as the principal output. This contrasts with single-metric leaderboards and aligns with the multi-property framing of trustworthy AI.

References

Liang et al., 'Holistic Evaluation of Language Models' (Transactions on Machine Learning Research, 2023).

Also known as

Holistic Evaluation of Language Models

Machine-readable encodings

Use in your systems

JATS XML <role> element

xml

<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="HELM benchmark"
      vocab-term-identifier="https://casrai.org/dictionary/term/helm-benchmark" />

Schema.org DefinedTerm (JSON-LD)

json

{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "@id": "https://casrai.org/dictionary/term/helm-benchmark",
  "name": "HELM benchmark",
  "identifier": "https://casrai.org/dictionary/term/helm-benchmark",
  "description": "The Holistic Evaluation of Language Models benchmark, a multi-metric framework evaluating language models across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency on a fixed set of scenarios.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-ml-research-outputs#set",
  "url": "https://casrai.org/dictionary/term/helm-benchmark",
  "sameAs": [
    "Holistic Evaluation of Language Models"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "publisher": {
    "@id": "https://casrai.org/#organization"
  },
  "dateModified": "2026-05-21T02:22:51",
  "inLanguage": "en"
}

Referenced across the research world

View CASRAI adoption →