Examples
Worked examples
- Is an instance
A reported GPT-4 score on MMLU that may be inflated because MMLU items appeared in training
- Is an instance
A consumer chatbot that surfaces another user's pasted manuscript text in a response
Counter-examples
Looks similar, but isn't
- Not an instance
A model that memorises rare facts from public training data is not necessarily showing leakage — leakage is specifically about data that should have been excluded
Editorial commentary
Two distinct concerns: (1) benchmark contamination, where a model trained on the internet has already seen its evaluation set; (2) confidential-input leakage, where users’ prompts feed back into training and surface in others’ outputs. Both must be considered when reporting AI use, especially when the AI saw sensitive data.
References
- Sainz et al. 2023 ‘NLP Evaluation in Trouble’ EMNLP
- OpenAI Enterprise Privacy documentation (2024)
Also known as
Training-set contamination · Benchmark contamination
Machine-readable encodings
Use in your systems
<role vocab="credit"
vocab-identifier="https://casrai.org/dictionary/"
vocab-term="Data leakage (training)"
vocab-term-identifier="https://casrai.org/dictionary/term/data-leakage-training" />{
"@context": "https://schema.org",
"@type": "DefinedTerm",
"name": "Data leakage (training)",
"identifier": "https://casrai.org/dictionary/term/data-leakage-training",
"description": "The contamination of an AI model's training corpus with data that should have remained held-out — including evaluation benchmarks, test sets, or proprietary content — such that the model's apparent performance overstates its true generalisation ability or it can reproduce content it should not have seen.",
"inDefinedTermSet": "https://casrai.org/dictionary/domain/generative-ai-use-and-disclosure/",
"url": "https://casrai.org/dictionary/term/data-leakage-training",
"sameAs": [
"Training-set contamination",
"Benchmark contamination"
],
"license": "https://creativecommons.org/licenses/by/4.0/"
}







