Skip to main content
v2026.1714 entries · CC-BY 4.0
Dictionary termTrack CStablev2026.2

Model evaluation suite

A defined collection of benchmarks, tasks, and metrics, with standardised prompting and decoding rules, used to characterise a model's capabilities and behaviour across a range of dimensions.

ByCASRAI Editorial Board
· Last updated 21 May 2026

Examples

Worked examples

  • Is an instance

    A model report listing results from lm-evaluation-harness v0.4.0 across MMLU, HellaSwag, ARC-c, TruthfulQA.

  • Is an instance

    A new domain-specific evaluation suite covering 14 medical-coding benchmarks.

Counter-examples

Looks similar, but isn't

  • Not an instance

    A single accuracy figure without specification of suite or template.

  • Not an instance

    A leaderboard with undisclosed methodology.

Editorial commentary

Evaluation suites range from single-purpose (HumanEval for code) to broad-coverage (lm-evaluation-harness, HELM). Reproducible reporting requires specifying the suite, the suite version, the prompting template, the decoding configuration (temperature, top-p), and the few-shot example selection. EleutherAI's lm-evaluation-harness has emerged as a community-default open implementation.

References

  • Gao et al., 'A framework for few-shot language model evaluation' (lm-evaluation-harness GitHub project, 2021-).

Also known as

LLM eval suite · evaluation harness

Machine-readable encodings

Use in your systems

JATS XML <role> element
xml
<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Model evaluation suite"
      vocab-term-identifier="https://casrai.org/dictionary/term/model-evaluation-suite" />
Schema.org DefinedTerm (JSON-LD)
json
{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "name": "Model evaluation suite",
  "identifier": "https://casrai.org/dictionary/term/model-evaluation-suite",
  "description": "A defined collection of benchmarks, tasks, and metrics, with standardised prompting and decoding rules, used to characterise a model's capabilities and behaviour across a range of dimensions.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
  "url": "https://casrai.org/dictionary/term/model-evaluation-suite",
  "sameAs": [
    "LLM eval suite",
    "evaluation harness"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/"
}

Adopted by research universities worldwide

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • Massachusetts Institute of Technology logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo

View CASRAI adoption →