Data science & AI · Reference

What is cross-validation?

Cross-validation is a resampling method for estimating how well a model will generalise to unseen data, by repeatedly partitioning the dataset into training and validation subsets and averaging the results.

Why cross-validation is needed

Evaluating a model on the same data it was trained on gives an over-optimistic estimate of performance, because the model has effectively seen the answers. A single train–test split is better but wastes data and can give a noisy estimate that depends heavily on which examples happened to fall in the test set. Cross-validation addresses both problems by reusing the data systematically: every example serves both for training and, at some point, for validation, producing a more stable and honest estimate of how well the model will generalise.

How k-fold cross-validation works

In k-fold cross-validation, the data is divided into k equal parts, or "folds". The model is trained on k−1 folds and validated on the remaining fold; this is repeated k times so that each fold serves once as the validation set.

The k results are then averaged to give the overall estimate. Common choices are k = 5 or k = 10. A special case, leave-one-out cross-validation, sets k equal to the number of data points, validating on a single example at a time.

Uses and pitfalls

Cross-validation serves two main purposes: estimating generalisation performance and selecting between models or hyperparameters. A key pitfall is data leakage: any preprocessing that uses information from the whole dataset — such as scaling or feature selection — must be done inside each fold, not before splitting, or the estimate becomes optimistic. For grouped or time-ordered data, the splitting must respect that structure (for example, not splitting a time series randomly) to avoid leaking future information into the past.

Cross-validation in research

Cross-validation is a standard tool for honest model evaluation and for guarding against overfitting during model selection. To report results credibly, the final performance should ideally be measured on a separate test set not used in any cross-validation, since repeatedly tuning against cross-validation scores can itself overfit. Reporting the cross-validation scheme, the number of folds, and how preprocessing was handled is part of a reproducible methodology.

Key facts

At a glance

Definition: resampling to estimate generalisation
Most common form: k-fold cross-validation
Typical k: 5 or 10
Each fold used once as the validation set
Special case: leave-one-out (k = number of data points)
Key pitfall: data leakage across folds

Common questions

FAQ

How does k-fold cross-validation work?+

The data is split into k equal folds. The model is trained on k−1 folds and tested on the remaining one, repeating until each fold has served as the test set once. The k scores are averaged to estimate generalisation performance.

Why use cross-validation instead of a single train–test split?+

A single split can give a noisy estimate that depends on which examples land in the test set, and it uses the data less efficiently. Cross-validation averages over multiple splits, giving a more stable and reliable estimate of performance.

What is data leakage in cross-validation?+

Data leakage occurs when information from outside the training fold influences the model — for example, scaling or selecting features using the whole dataset before splitting. It makes results look better than they are; preprocessing should be done within each fold.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.