Examples
Worked examples
- Is an instance
GAN-generated chest X-rays used to augment a limited real training set
- Is an instance
LLM-generated patient narratives used to test a triage classifier
Counter-examples
Looks similar, but isn't
- Not an instance
Bootstrap resamples of an empirical dataset are not synthetic data in this sense (they are resamples of real observations)
Editorial commentary
Synthetic data must be disclosed as such in any analysis where the distinction matters (e.g., training-set composition, statistical inference, claims of empirical support). The generating model, its parameters, and the validation procedure used to demonstrate fidelity to real data should be reported.
References
- Jordon et al. 2022 ‘Synthetic Data — what, why and how?’ Royal Society
- OECD Synthetic Data Guidance (2023)
Also known as
Generated data · Simulated data
Machine-readable encodings
Use in your systems
<role vocab="credit"
vocab-identifier="https://casrai.org/dictionary/"
vocab-term="Synthetic data"
vocab-term-identifier="https://casrai.org/dictionary/term/synthetic-data" />{
"@context": "https://schema.org",
"@type": "DefinedTerm",
"name": "Synthetic data",
"identifier": "https://casrai.org/dictionary/term/synthetic-data",
"description": "Data generated by a model or algorithm rather than collected from real-world observations or experiments, designed to mimic the statistical structure of real data for purposes such as augmentation, privacy-preservation, or model training.",
"inDefinedTermSet": "https://casrai.org/dictionary/domain/generative-ai-use-and-disclosure/",
"url": "https://casrai.org/dictionary/term/synthetic-data",
"sameAs": [
"Generated data",
"Simulated data"
],
"license": "https://creativecommons.org/licenses/by/4.0/"
}







