Skip to main content
v2026.1714 entries · CC-BY 4.0
CASRAI

Data science & AI · Reference

What is cross-validation?

Cross-validation is a resampling method for estimating how well a model will generalise to unseen data, by repeatedly partitioning the dataset into training and validation subsets and averaging the results.

Why cross-validation is needed

Evaluating a model on the same data it was trained on gives an over-optimistic estimate of performance, because the model has effectively seen the answers. A single train–test split is better but wastes data and can give a noisy estimate that depends heavily on which examples happened to fall in the test set. Cross-validation addresses both problems by reusing the data systematically: every example serves both for training and, at some point, for validation, producing a more stable and honest estimate of how well the model will generalise.

How k-fold cross-validation works

In k-fold cross-validation, the data is divided into k equal parts, or "folds". The model is trained on k−1 folds and validated on the remaining fold; this is repeated k times so that each fold serves once as the validation set.

The k results are then averaged to give the overall estimate. Common choices are k = 5 or k = 10. A special case, leave-one-out cross-validation, sets k equal to the number of data points, validating on a single example at a time.

Uses and pitfalls

Cross-validation serves two main purposes: estimating generalisation performance and selecting between models or hyperparameters. A key pitfall is data leakage: any preprocessing that uses information from the whole dataset — such as scaling or feature selection — must be done inside each fold, not before splitting, or the estimate becomes optimistic. For grouped or time-ordered data, the splitting must respect that structure (for example, not splitting a time series randomly) to avoid leaking future information into the past.

Cross-validation in research

Cross-validation is a standard tool for honest model evaluation and for guarding against overfitting during model selection. To report results credibly, the final performance should ideally be measured on a separate test set not used in any cross-validation, since repeatedly tuning against cross-validation scores can itself overfit. Reporting the cross-validation scheme, the number of folds, and how preprocessing was handled is part of a reproducible methodology.

Key facts

At a glance

  • Definition: resampling to estimate generalisation
  • Most common form: k-fold cross-validation
  • Typical k: 5 or 10
  • Each fold used once as the validation set
  • Special case: leave-one-out (k = number of data points)
  • Key pitfall: data leakage across folds

Common questions

FAQ

How does k-fold cross-validation work?+

The data is split into k equal folds. The model is trained on k−1 folds and tested on the remaining one, repeating until each fold has served as the test set once. The k scores are averaged to estimate generalisation performance.

Why use cross-validation instead of a single train–test split?+

A single split can give a noisy estimate that depends on which examples land in the test set, and it uses the data less efficiently. Cross-validation averages over multiple splits, giving a more stable and reliable estimate of performance.

What is data leakage in cross-validation?+

Data leakage occurs when information from outside the training fold influences the model — for example, scaling or selecting features using the whole dataset before splitting. It makes results look better than they are; preprocessing should be done within each fold.

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.

Referenced across the research world

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo
  • ORCID logo
  • Crossref logo

View CASRAI adoption →