The contamination of an AI model's training corpus with data that should have remained held-out — including evaluation benchmarks, test sets, or proprietary content — such that the model's apparent performance overstates its true generalisation ability or it can reproduce content it should not have seen.

ByCASRAI Editorial Board

· Last updated 21 May 2026

Examples

Worked examples

Is an instance
A reported GPT-4 score on MMLU that may be inflated because MMLU items appeared in training
Is an instance
A consumer chatbot that surfaces another user's pasted manuscript text in a response

Counter-examples

Looks similar, but isn't

Not an instance
A model that memorises rare facts from public training data is not necessarily showing leakage — leakage is specifically about data that should have been excluded

Editorial commentary

Two distinct concerns: (1) benchmark contamination, where a model trained on the internet has already seen its evaluation set; (2) confidential-input leakage, where users’ prompts feed back into training and surface in others’ outputs. Both must be considered when reporting AI use, especially when the AI saw sensitive data.

References

Sainz et al. 2023 ‘NLP Evaluation in Trouble’ EMNLP
OpenAI Enterprise Privacy documentation (2024)

Also known as

Training-set contamination · Benchmark contamination

Machine-readable encodings

Use in your systems

JATS XML <role> element

xml

<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Data leakage (training)"
      vocab-term-identifier="https://casrai.org/dictionary/term/data-leakage-training" />

Schema.org DefinedTerm (JSON-LD)

json

{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "@id": "https://casrai.org/dictionary/term/data-leakage-training",
  "name": "Data leakage (training)",
  "identifier": "https://casrai.org/dictionary/term/data-leakage-training",
  "description": "The contamination of an AI model's training corpus with data that should have remained held-out — including evaluation benchmarks, test sets, or proprietary content — such that the model's apparent performance overstates its true generalisation ability or it can reproduce content it should not have seen.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/genai-disclosure#set",
  "url": "https://casrai.org/dictionary/term/data-leakage-training",
  "sameAs": [
    "Training-set contamination",
    "Benchmark contamination"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "publisher": {
    "@id": "https://casrai.org/#organization"
  },
  "dateModified": "2026-05-21T01:55:52",
  "inLanguage": "en"
}

Referenced across the research world

View CASRAI adoption →