Examples
Worked examples
- Is an instance
A model report listing results from lm-evaluation-harness v0.4.0 across MMLU, HellaSwag, ARC-c, TruthfulQA.
- Is an instance
A new domain-specific evaluation suite covering 14 medical-coding benchmarks.
Counter-examples
Looks similar, but isn't
- Not an instance
A single accuracy figure without specification of suite or template.
- Not an instance
A leaderboard with undisclosed methodology.
Editorial commentary
Evaluation suites range from single-purpose (HumanEval for code) to broad-coverage (lm-evaluation-harness, HELM). Reproducible reporting requires specifying the suite, the suite version, the prompting template, the decoding configuration (temperature, top-p), and the few-shot example selection. EleutherAI's lm-evaluation-harness has emerged as a community-default open implementation.
References
- Gao et al., 'A framework for few-shot language model evaluation' (lm-evaluation-harness GitHub project, 2021-).
Also known as
LLM eval suite · evaluation harness
Machine-readable encodings
Use in your systems
<role vocab="credit"
vocab-identifier="https://casrai.org/dictionary/"
vocab-term="Model evaluation suite"
vocab-term-identifier="https://casrai.org/dictionary/term/model-evaluation-suite" />{
"@context": "https://schema.org",
"@type": "DefinedTerm",
"name": "Model evaluation suite",
"identifier": "https://casrai.org/dictionary/term/model-evaluation-suite",
"description": "A defined collection of benchmarks, tasks, and metrics, with standardised prompting and decoding rules, used to characterise a model's capabilities and behaviour across a range of dimensions.",
"inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
"url": "https://casrai.org/dictionary/term/model-evaluation-suite",
"sameAs": [
"LLM eval suite",
"evaluation harness"
],
"license": "https://creativecommons.org/licenses/by/4.0/"
}







