Skip to main content
v2026.1714 entries · CC-BY 4.0
CASRAI
Dictionary termTrack AStablev2026.2

Training data provenance

Documentation of the sources, collection methods, licensing, consent basis, time range, and processing steps applied to the data used to train an AI model, sufficient to assess fitness-for-purpose, legal compliance, and potential bias.

ByCASRAI Editorial Board
· Last updated 21 May 2026

Examples

Worked examples

  • Is an instance

    The C4 corpus paper documenting Common Crawl filtering for T5 training

  • Is an instance

    LAION-5B datasheet documenting image-text pair sourcing

Counter-examples

Looks similar, but isn't

  • Not an instance

    A model vendor's statement that 'we trained on a diverse mix of internet text' is not provenance in the operational sense

Editorial commentary

For closed foundation models, full provenance is rarely disclosed by vendors; for open models, provenance is often documented in datasheets or papers. Scholarly use of any AI model should at minimum cite what is publicly known about the training data and flag known gaps.

References

  • Gebru et al. 2021 ‘Datasheets for Datasets’ CACM
  • EU AI Act training-data transparency requirements (2024)

Also known as

Training data lineage · Data provenance (ML)

Machine-readable encodings

Use in your systems

JATS XML <role> element
xml
<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Training data provenance"
      vocab-term-identifier="https://casrai.org/dictionary/term/training-data-provenance" />
Schema.org DefinedTerm (JSON-LD)
json
{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "name": "Training data provenance",
  "identifier": "https://casrai.org/dictionary/term/training-data-provenance",
  "description": "Documentation of the sources, collection methods, licensing, consent basis, time range, and processing steps applied to the data used to train an AI model, sufficient to assess fitness-for-purpose, legal compliance, and potential bias.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/generative-ai-use-and-disclosure/",
  "url": "https://casrai.org/dictionary/term/training-data-provenance",
  "sameAs": [
    "Training data lineage",
    "Data provenance (ML)"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/"
}
LAC

Partner Deal

LAC Health Supplies Mobile App

Referenced across the research world

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo
  • ORCID logo
  • Crossref logo

View CASRAI adoption →