Skip to main content
v2026.1714 entries · CC-BY 4.0
Dictionary termTrack AStablev2026.2

Data leakage (training)

The contamination of an AI model's training corpus with data that should have remained held-out — including evaluation benchmarks, test sets, or proprietary content — such that the model's apparent performance overstates its true generalisation ability or it can reproduce content it should not have seen.

ByCASRAI Editorial Board
· Last updated 21 May 2026

Examples

Worked examples

  • Is an instance

    A reported GPT-4 score on MMLU that may be inflated because MMLU items appeared in training

  • Is an instance

    A consumer chatbot that surfaces another user's pasted manuscript text in a response

Counter-examples

Looks similar, but isn't

  • Not an instance

    A model that memorises rare facts from public training data is not necessarily showing leakage — leakage is specifically about data that should have been excluded

Editorial commentary

Two distinct concerns: (1) benchmark contamination, where a model trained on the internet has already seen its evaluation set; (2) confidential-input leakage, where users’ prompts feed back into training and surface in others’ outputs. Both must be considered when reporting AI use, especially when the AI saw sensitive data.

References

  • Sainz et al. 2023 ‘NLP Evaluation in Trouble’ EMNLP
  • OpenAI Enterprise Privacy documentation (2024)

Also known as

Training-set contamination · Benchmark contamination

Machine-readable encodings

Use in your systems

JATS XML <role> element
xml
<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Data leakage (training)"
      vocab-term-identifier="https://casrai.org/dictionary/term/data-leakage-training" />
Schema.org DefinedTerm (JSON-LD)
json
{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "name": "Data leakage (training)",
  "identifier": "https://casrai.org/dictionary/term/data-leakage-training",
  "description": "The contamination of an AI model's training corpus with data that should have remained held-out — including evaluation benchmarks, test sets, or proprietary content — such that the model's apparent performance overstates its true generalisation ability or it can reproduce content it should not have seen.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/generative-ai-use-and-disclosure/",
  "url": "https://casrai.org/dictionary/term/data-leakage-training",
  "sameAs": [
    "Training-set contamination",
    "Benchmark contamination"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/"
}

Adopted by research universities worldwide

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • Massachusetts Institute of Technology logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo

View CASRAI adoption →