Skip to main content
v2026.1714 entries · CC-BY 4.0
Dictionary termTrack CStablev2026.2

HELM benchmark

The Holistic Evaluation of Language Models benchmark, a multi-metric framework evaluating language models across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency on a fixed set of scenarios.

ByCASRAI Editorial Board
· Last updated 21 May 2026

Examples

Worked examples

  • Is an instance

    A model report including the HELM scenario-metric matrix as appendix evidence.

  • Is an instance

    A research lab using HELM scenarios for internal model comparison.

Counter-examples

Looks similar, but isn't

  • Not an instance

    A single-accuracy MMLU score.

  • Not an instance

    A latency-only benchmark.

Editorial commentary

HELM (Liang et al., 2023) emphasises holistic evaluation: a single model is scored across many metrics on many scenarios, with the resulting matrix surfaced as the principal output. This contrasts with single-metric leaderboards and aligns with the multi-property framing of trustworthy AI.

References

  • Liang et al., 'Holistic Evaluation of Language Models' (Transactions on Machine Learning Research, 2023).

Also known as

Holistic Evaluation of Language Models

Machine-readable encodings

Use in your systems

JATS XML <role> element
xml
<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="HELM benchmark"
      vocab-term-identifier="https://casrai.org/dictionary/term/helm-benchmark" />
Schema.org DefinedTerm (JSON-LD)
json
{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "name": "HELM benchmark",
  "identifier": "https://casrai.org/dictionary/term/helm-benchmark",
  "description": "The Holistic Evaluation of Language Models benchmark, a multi-metric framework evaluating language models across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency on a fixed set of scenarios.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
  "url": "https://casrai.org/dictionary/term/helm-benchmark",
  "sameAs": [
    "Holistic Evaluation of Language Models"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/"
}

Adopted by research universities worldwide

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • Massachusetts Institute of Technology logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo

View CASRAI adoption →