The mixture of data sources, by domain, language, modality, and provenance, used to train a model, including the proportions and any filtering or deduplication applied.

ByCASRAI Editorial Board

· Last updated 21 May 2026

Examples

Worked examples

Is an instance
A model card declaring '40% web text, 22% code, 15% books, 13% scientific papers, 10% other'.
Is an instance
A research model trained exclusively on the openly published Common Corpus.

Counter-examples

Looks similar, but isn't

Not an instance
A single number for tokens of training data.
Not an instance
A statement 'trained on public data' with no proportions.

Editorial commentary

Training-data composition is the principal lens through which model behaviour is interpreted: a model trained predominantly on English-language code will perform differently from one trained predominantly on multilingual literary text. Datasheets, data statements, and model cards each cover aspects of composition. Frontier-model providers vary widely in disclosure detail.

References

Gao et al., 'The Pile' (arXiv 2020); Together AI RedPajama documentation; Common Corpus documentation.

Also known as

data mix · training mixture

Machine-readable encodings

Use in your systems

JATS XML <role> element

xml

<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Training data composition"
      vocab-term-identifier="https://casrai.org/dictionary/term/training-data-composition" />

Schema.org DefinedTerm (JSON-LD)

json

{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "@id": "https://casrai.org/dictionary/term/training-data-composition",
  "name": "Training data composition",
  "identifier": "https://casrai.org/dictionary/term/training-data-composition",
  "description": "The mixture of data sources, by domain, language, modality, and provenance, used to train a model, including the proportions and any filtering or deduplication applied.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-ml-research-outputs#set",
  "url": "https://casrai.org/dictionary/term/training-data-composition",
  "sameAs": [
    "data mix",
    "training mixture"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "publisher": {
    "@id": "https://casrai.org/#organization"
  },
  "dateModified": "2026-05-21T02:22:51",
  "inLanguage": "en"
}

Referenced across the research world

View CASRAI adoption →