Skip to main content
v2026.1714 entries · CC-BY 4.0
CASRAI
Dictionary termTrack AStablev2026.2

Data leakage (training)

The contamination of an AI model's training corpus with data that should have remained held-out — including evaluation benchmarks, test sets, or proprietary content — such that the model's apparent performance overstates its true generalisation ability or it can reproduce content it should not have seen.

ByCASRAI Editorial Board
· Last updated 21 May 2026

Examples

Worked examples

  • Is an instance

    A reported GPT-4 score on MMLU that may be inflated because MMLU items appeared in training

  • Is an instance

    A consumer chatbot that surfaces another user's pasted manuscript text in a response

Counter-examples

Looks similar, but isn't

  • Not an instance

    A model that memorises rare facts from public training data is not necessarily showing leakage — leakage is specifically about data that should have been excluded

Editorial commentary

Two distinct concerns: (1) benchmark contamination, where a model trained on the internet has already seen its evaluation set; (2) confidential-input leakage, where users’ prompts feed back into training and surface in others’ outputs. Both must be considered when reporting AI use, especially when the AI saw sensitive data.

References

  • Sainz et al. 2023 ‘NLP Evaluation in Trouble’ EMNLP
  • OpenAI Enterprise Privacy documentation (2024)

Also known as

Training-set contamination · Benchmark contamination

Machine-readable encodings

Use in your systems

JATS XML <role> element
xml
<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Data leakage (training)"
      vocab-term-identifier="https://casrai.org/dictionary/term/data-leakage-training" />
Schema.org DefinedTerm (JSON-LD)
json
{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "name": "Data leakage (training)",
  "identifier": "https://casrai.org/dictionary/term/data-leakage-training",
  "description": "The contamination of an AI model's training corpus with data that should have remained held-out — including evaluation benchmarks, test sets, or proprietary content — such that the model's apparent performance overstates its true generalisation ability or it can reproduce content it should not have seen.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/generative-ai-use-and-disclosure/",
  "url": "https://casrai.org/dictionary/term/data-leakage-training",
  "sameAs": [
    "Training-set contamination",
    "Benchmark contamination"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/"
}
LAC

Partner Deal

LAC Health Supplies Mobile App

Referenced across the research world

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo
  • ORCID logo
  • Crossref logo

View CASRAI adoption →