Examples
Worked examples
- Is an instance
A procedurally generated benchmark of 100k multi-step arithmetic problems.
- Is an instance
An LLM-generated multi-choice probe of moral reasoning validated by human raters on a 5% sample.
Counter-examples
Looks similar, but isn't
- Not an instance
MMLU (human-curated benchmark).
- Not an instance
A real-world dataset of clinical notes.
Editorial commentary
Synthetic benchmarks support controlled capability testing (e.g., procedurally generated multi-step reasoning problems) and large-scale evaluation, but raise validity concerns: a benchmark generated by model A may be biased toward strengths of model A and unrepresentative of natural distributions. Best practice includes human validation of a sample, disclosure of the generator, and complementary natural benchmarks.
References
- Perez et al., 'Discovering Language Model Behaviors with Model-Written Evaluations' (ACL Findings 2023).
Also known as
model-generated benchmark · synthetic eval
Machine-readable encodings
Use in your systems
<role vocab="credit"
vocab-identifier="https://casrai.org/dictionary/"
vocab-term="Synthetic benchmark"
vocab-term-identifier="https://casrai.org/dictionary/term/synthetic-benchmark" />{
"@context": "https://schema.org",
"@type": "DefinedTerm",
"name": "Synthetic benchmark",
"identifier": "https://casrai.org/dictionary/term/synthetic-benchmark",
"description": "A benchmark whose evaluation items are wholly or partially generated by another model or procedural method, rather than collected from natural human-produced sources, used to probe specific capabilities or to scale evaluation cheaply.",
"inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
"url": "https://casrai.org/dictionary/term/synthetic-benchmark",
"sameAs": [
"model-generated benchmark",
"synthetic eval"
],
"license": "https://creativecommons.org/licenses/by/4.0/"
}







