A defined collection of benchmarks, tasks, and metrics, with standardised prompting and decoding rules, used to characterise a model's capabilities and behaviour across a range of dimensions.

ByCASRAI Editorial Board

· Last updated 21 May 2026

Examples

Worked examples

Is an instance
A model report listing results from lm-evaluation-harness v0.4.0 across MMLU, HellaSwag, ARC-c, TruthfulQA.
Is an instance
A new domain-specific evaluation suite covering 14 medical-coding benchmarks.

Counter-examples

Looks similar, but isn't

Not an instance
A single accuracy figure without specification of suite or template.
Not an instance
A leaderboard with undisclosed methodology.

Editorial commentary

Evaluation suites range from single-purpose (HumanEval for code) to broad-coverage (lm-evaluation-harness, HELM). Reproducible reporting requires specifying the suite, the suite version, the prompting template, the decoding configuration (temperature, top-p), and the few-shot example selection. EleutherAI's lm-evaluation-harness has emerged as a community-default open implementation.

References

Gao et al., 'A framework for few-shot language model evaluation' (lm-evaluation-harness GitHub project, 2021-).

Also known as

LLM eval suite · evaluation harness

Machine-readable encodings

Use in your systems

JATS XML <role> element

xml

<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Model evaluation suite"
      vocab-term-identifier="https://casrai.org/dictionary/term/model-evaluation-suite" />

Schema.org DefinedTerm (JSON-LD)

json

{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "@id": "https://casrai.org/dictionary/term/model-evaluation-suite",
  "name": "Model evaluation suite",
  "identifier": "https://casrai.org/dictionary/term/model-evaluation-suite",
  "description": "A defined collection of benchmarks, tasks, and metrics, with standardised prompting and decoding rules, used to characterise a model's capabilities and behaviour across a range of dimensions.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-ml-research-outputs#set",
  "url": "https://casrai.org/dictionary/term/model-evaluation-suite",
  "sameAs": [
    "LLM eval suite",
    "evaluation harness"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "publisher": {
    "@id": "https://casrai.org/#organization"
  },
  "dateModified": "2026-05-21T02:22:51",
  "inLanguage": "en"
}

Referenced across the research world

View CASRAI adoption →