Skip to main content
v2026.1714 entries · CC-BY 4.0
Dictionary termTrack CStablev2026.2

Training data composition

The mixture of data sources, by domain, language, modality, and provenance, used to train a model, including the proportions and any filtering or deduplication applied.

ByCASRAI Editorial Board
· Last updated 21 May 2026

Examples

Worked examples

  • Is an instance

    A model card declaring '40% web text, 22% code, 15% books, 13% scientific papers, 10% other'.

  • Is an instance

    A research model trained exclusively on the openly published Common Corpus.

Counter-examples

Looks similar, but isn't

  • Not an instance

    A single number for tokens of training data.

  • Not an instance

    A statement 'trained on public data' with no proportions.

Editorial commentary

Training-data composition is the principal lens through which model behaviour is interpreted: a model trained predominantly on English-language code will perform differently from one trained predominantly on multilingual literary text. Datasheets, data statements, and model cards each cover aspects of composition. Frontier-model providers vary widely in disclosure detail.

References

  • Gao et al., 'The Pile' (arXiv 2020); Together AI RedPajama documentation; Common Corpus documentation.

Also known as

data mix · training mixture

Machine-readable encodings

Use in your systems

JATS XML <role> element
xml
<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Training data composition"
      vocab-term-identifier="https://casrai.org/dictionary/term/training-data-composition" />
Schema.org DefinedTerm (JSON-LD)
json
{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "name": "Training data composition",
  "identifier": "https://casrai.org/dictionary/term/training-data-composition",
  "description": "The mixture of data sources, by domain, language, modality, and provenance, used to train a model, including the proportions and any filtering or deduplication applied.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
  "url": "https://casrai.org/dictionary/term/training-data-composition",
  "sameAs": [
    "data mix",
    "training mixture"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/"
}

Adopted by research universities worldwide

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • Massachusetts Institute of Technology logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo

View CASRAI adoption →