Data science & AI · Reference
What is cross-validation?
Cross-validation is a resampling method for estimating how well a model will generalise to unseen data, by repeatedly partitioning the dataset into training and validation subsets and averaging the results.
Why cross-validation is needed
Evaluating a model on the same data it was trained on gives an over-optimistic estimate of performance, because the model has effectively seen the answers. A single train–test split is better but wastes data and can give a noisy estimate that depends heavily on which examples happened to fall in the test set. Cross-validation addresses both problems by reusing the data systematically: every example serves both for training and, at some point, for validation, producing a more stable and honest estimate of how well the model will generalise.
How k-fold cross-validation works
In k-fold cross-validation, the data is divided into k equal parts, or "folds". The model is trained on k−1 folds and validated on the remaining fold; this is repeated k times so that each fold serves once as the validation set.
The k results are then averaged to give the overall estimate. Common choices are k = 5 or k = 10. A special case, leave-one-out cross-validation, sets k equal to the number of data points, validating on a single example at a time.
Uses and pitfalls
Cross-validation serves two main purposes: estimating generalisation performance and selecting between models or hyperparameters. A key pitfall is data leakage: any preprocessing that uses information from the whole dataset — such as scaling or feature selection — must be done inside each fold, not before splitting, or the estimate becomes optimistic. For grouped or time-ordered data, the splitting must respect that structure (for example, not splitting a time series randomly) to avoid leaking future information into the past.
Cross-validation in research
Cross-validation is a standard tool for honest model evaluation and for guarding against overfitting during model selection. To report results credibly, the final performance should ideally be measured on a separate test set not used in any cross-validation, since repeatedly tuning against cross-validation scores can itself overfit. Reporting the cross-validation scheme, the number of folds, and how preprocessing was handled is part of a reproducible methodology.
Key facts
At a glance
- Definition: resampling to estimate generalisation
- Most common form: k-fold cross-validation
- Typical k: 5 or 10
- Each fold used once as the validation set
- Special case: leave-one-out (k = number of data points)
- Key pitfall: data leakage across folds
Common questions
FAQ
How does k-fold cross-validation work?+
The data is split into k equal folds. The model is trained on k−1 folds and tested on the remaining one, repeating until each fold has served as the test set once. The k scores are averaged to estimate generalisation performance.
Why use cross-validation instead of a single train–test split?+
A single split can give a noisy estimate that depends on which examples land in the test set, and it uses the data less efficiently. Cross-validation averages over multiple splits, giving a more stable and reliable estimate of performance.
What is data leakage in cross-validation?+
Data leakage occurs when information from outside the training fold influences the model — for example, scaling or selecting features using the whole dataset before splitting. It makes results look better than they are; preprocessing should be done within each fold.
Going deeper
Related on CASRAI
- What is overfitting? →
- What is machine learning? →
- What is supervised learning? →
- What is the F1 score? →
- Computer science, data science & AI →
Sources
The step most authors miss
Doing CRediT right? Don’t stop at the statement.
A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.
Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.







