Examples
Worked examples
- Is an instance
A model card declaring '40% web text, 22% code, 15% books, 13% scientific papers, 10% other'.
- Is an instance
A research model trained exclusively on the openly published Common Corpus.
Counter-examples
Looks similar, but isn't
- Not an instance
A single number for tokens of training data.
- Not an instance
A statement 'trained on public data' with no proportions.
Editorial commentary
Training-data composition is the principal lens through which model behaviour is interpreted: a model trained predominantly on English-language code will perform differently from one trained predominantly on multilingual literary text. Datasheets, data statements, and model cards each cover aspects of composition. Frontier-model providers vary widely in disclosure detail.
References
- Gao et al., 'The Pile' (arXiv 2020); Together AI RedPajama documentation; Common Corpus documentation.
Also known as
data mix · training mixture
Machine-readable encodings
Use in your systems
<role vocab="credit"
vocab-identifier="https://casrai.org/dictionary/"
vocab-term="Training data composition"
vocab-term-identifier="https://casrai.org/dictionary/term/training-data-composition" />{
"@context": "https://schema.org",
"@type": "DefinedTerm",
"name": "Training data composition",
"identifier": "https://casrai.org/dictionary/term/training-data-composition",
"description": "The mixture of data sources, by domain, language, modality, and provenance, used to train a model, including the proportions and any filtering or deduplication applied.",
"inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
"url": "https://casrai.org/dictionary/term/training-data-composition",
"sameAs": [
"data mix",
"training mixture"
],
"license": "https://creativecommons.org/licenses/by/4.0/"
}







