Examples
Worked examples
- Is an instance
The C4 corpus paper documenting Common Crawl filtering for T5 training
- Is an instance
LAION-5B datasheet documenting image-text pair sourcing
Counter-examples
Looks similar, but isn't
- Not an instance
A model vendor's statement that 'we trained on a diverse mix of internet text' is not provenance in the operational sense
Editorial commentary
For closed foundation models, full provenance is rarely disclosed by vendors; for open models, provenance is often documented in datasheets or papers. Scholarly use of any AI model should at minimum cite what is publicly known about the training data and flag known gaps.
References
- Gebru et al. 2021 ‘Datasheets for Datasets’ CACM
- EU AI Act training-data transparency requirements (2024)
Also known as
Training data lineage · Data provenance (ML)
Machine-readable encodings
Use in your systems
<role vocab="credit"
vocab-identifier="https://casrai.org/dictionary/"
vocab-term="Training data provenance"
vocab-term-identifier="https://casrai.org/dictionary/term/training-data-provenance" />{
"@context": "https://schema.org",
"@type": "DefinedTerm",
"name": "Training data provenance",
"identifier": "https://casrai.org/dictionary/term/training-data-provenance",
"description": "Documentation of the sources, collection methods, licensing, consent basis, time range, and processing steps applied to the data used to train an AI model, sufficient to assess fitness-for-purpose, legal compliance, and potential bias.",
"inDefinedTermSet": "https://casrai.org/dictionary/domain/generative-ai-use-and-disclosure/",
"url": "https://casrai.org/dictionary/term/training-data-provenance",
"sameAs": [
"Training data lineage",
"Data provenance (ML)"
],
"license": "https://creativecommons.org/licenses/by/4.0/"
}







