Skip to main content
v2026.1714 entries · CC-BY 4.0
Dictionary termTrack CStablev2026.2

Synthetic benchmark

A benchmark whose evaluation items are wholly or partially generated by another model or procedural method, rather than collected from natural human-produced sources, used to probe specific capabilities or to scale evaluation cheaply.

ByCASRAI Editorial Board
· Last updated 21 May 2026

Examples

Worked examples

  • Is an instance

    A procedurally generated benchmark of 100k multi-step arithmetic problems.

  • Is an instance

    An LLM-generated multi-choice probe of moral reasoning validated by human raters on a 5% sample.

Counter-examples

Looks similar, but isn't

  • Not an instance

    MMLU (human-curated benchmark).

  • Not an instance

    A real-world dataset of clinical notes.

Editorial commentary

Synthetic benchmarks support controlled capability testing (e.g., procedurally generated multi-step reasoning problems) and large-scale evaluation, but raise validity concerns: a benchmark generated by model A may be biased toward strengths of model A and unrepresentative of natural distributions. Best practice includes human validation of a sample, disclosure of the generator, and complementary natural benchmarks.

References

  • Perez et al., 'Discovering Language Model Behaviors with Model-Written Evaluations' (ACL Findings 2023).

Also known as

model-generated benchmark · synthetic eval

Machine-readable encodings

Use in your systems

JATS XML <role> element
xml
<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Synthetic benchmark"
      vocab-term-identifier="https://casrai.org/dictionary/term/synthetic-benchmark" />
Schema.org DefinedTerm (JSON-LD)
json
{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "name": "Synthetic benchmark",
  "identifier": "https://casrai.org/dictionary/term/synthetic-benchmark",
  "description": "A benchmark whose evaluation items are wholly or partially generated by another model or procedural method, rather than collected from natural human-produced sources, used to probe specific capabilities or to scale evaluation cheaply.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
  "url": "https://casrai.org/dictionary/term/synthetic-benchmark",
  "sameAs": [
    "model-generated benchmark",
    "synthetic eval"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/"
}

Adopted by research universities worldwide

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • Massachusetts Institute of Technology logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo

View CASRAI adoption →