A benchmark whose evaluation items are wholly or partially generated by another model or procedural method, rather than collected from natural human-produced sources, used to probe specific capabilities or to scale evaluation cheaply.

ByCASRAI Editorial Board

· Last updated 21 May 2026

Examples

Worked examples

Is an instance
A procedurally generated benchmark of 100k multi-step arithmetic problems.
Is an instance
An LLM-generated multi-choice probe of moral reasoning validated by human raters on a 5% sample.

Counter-examples

Looks similar, but isn't

Not an instance
MMLU (human-curated benchmark).
Not an instance
A real-world dataset of clinical notes.

Editorial commentary

Synthetic benchmarks support controlled capability testing (e.g., procedurally generated multi-step reasoning problems) and large-scale evaluation, but raise validity concerns: a benchmark generated by model A may be biased toward strengths of model A and unrepresentative of natural distributions. Best practice includes human validation of a sample, disclosure of the generator, and complementary natural benchmarks.

References

Perez et al., 'Discovering Language Model Behaviors with Model-Written Evaluations' (ACL Findings 2023).

Also known as

model-generated benchmark · synthetic eval

Machine-readable encodings

Use in your systems

JATS XML <role> element

xml

<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Synthetic benchmark"
      vocab-term-identifier="https://casrai.org/dictionary/term/synthetic-benchmark" />

Schema.org DefinedTerm (JSON-LD)

json

{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "@id": "https://casrai.org/dictionary/term/synthetic-benchmark",
  "name": "Synthetic benchmark",
  "identifier": "https://casrai.org/dictionary/term/synthetic-benchmark",
  "description": "A benchmark whose evaluation items are wholly or partially generated by another model or procedural method, rather than collected from natural human-produced sources, used to probe specific capabilities or to scale evaluation cheaply.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-ml-research-outputs#set",
  "url": "https://casrai.org/dictionary/term/synthetic-benchmark",
  "sameAs": [
    "model-generated benchmark",
    "synthetic eval"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "publisher": {
    "@id": "https://casrai.org/#organization"
  },
  "dateModified": "2026-05-21T02:22:51",
  "inLanguage": "en"
}

Referenced across the research world

View CASRAI adoption →