Skip to main content
v2026.1714 entries · CC-BY 4.0
Dictionary termTrack AStablev2026.2

Training data provenance

Documentation of the sources, collection methods, licensing, consent basis, time range, and processing steps applied to the data used to train an AI model, sufficient to assess fitness-for-purpose, legal compliance, and potential bias.

ByCASRAI Editorial Board
· Last updated 21 May 2026

Examples

Worked examples

  • Is an instance

    The C4 corpus paper documenting Common Crawl filtering for T5 training

  • Is an instance

    LAION-5B datasheet documenting image-text pair sourcing

Counter-examples

Looks similar, but isn't

  • Not an instance

    A model vendor's statement that 'we trained on a diverse mix of internet text' is not provenance in the operational sense

Editorial commentary

For closed foundation models, full provenance is rarely disclosed by vendors; for open models, provenance is often documented in datasheets or papers. Scholarly use of any AI model should at minimum cite what is publicly known about the training data and flag known gaps.

References

  • Gebru et al. 2021 ‘Datasheets for Datasets’ CACM
  • EU AI Act training-data transparency requirements (2024)

Also known as

Training data lineage · Data provenance (ML)

Machine-readable encodings

Use in your systems

JATS XML <role> element
xml
<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Training data provenance"
      vocab-term-identifier="https://casrai.org/dictionary/term/training-data-provenance" />
Schema.org DefinedTerm (JSON-LD)
json
{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "name": "Training data provenance",
  "identifier": "https://casrai.org/dictionary/term/training-data-provenance",
  "description": "Documentation of the sources, collection methods, licensing, consent basis, time range, and processing steps applied to the data used to train an AI model, sufficient to assess fitness-for-purpose, legal compliance, and potential bias.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/generative-ai-use-and-disclosure/",
  "url": "https://casrai.org/dictionary/term/training-data-provenance",
  "sameAs": [
    "Training data lineage",
    "Data provenance (ML)"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/"
}

Adopted by research universities worldwide

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • Massachusetts Institute of Technology logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo

View CASRAI adoption →