Overfitting occurs when a machine learning model learns the noise and quirks of its training data so closely that it performs well on that data but poorly on new, unseen data. Its opposite, underfitting, occurs when a model is too simple to capture the underlying pattern and performs poorly even on the training data. Balancing these two failure modes is one of the central challenges of building reliable, reproducible models.
The bias-variance trade-off
Underfitting and overfitting are two sides of the bias-variance trade-off. Bias is error from overly simplistic assumptions; a high-bias model misses real structure and underfits. Variance is error from excessive sensitivity to the training sample; a high-variance model chases noise and overfits. As you make a model more flexible, bias falls but variance rises. The art is to find the sweet spot where total error, the sum of both, is lowest. A model that generalises well sits between the extremes.
| Aspect | Underfitting | Good fit | Overfitting |
|---|---|---|---|
| Model complexity | Too low | Appropriate | Too high |
| Bias | High | Balanced | Low |
| Variance | Low | Balanced | High |
| Training accuracy | Poor | Good | Excellent |
| Test accuracy | Poor | Good | Poor |
The tell-tale sign of overfitting is a large gap between strong training performance and weak test performance. Underfitting shows up as poor performance on both.
Train, validation and test splits
Diagnosing these problems requires holding data back. The convention is a three-way split: the training set fits the model, the validation set tunes choices such as model complexity and stopping point, and the test set is touched only once, at the end, to estimate real-world performance. Evaluating on data the model trained on always flatters it and hides overfitting. Keeping the test set genuinely untouched is fundamental to honest evaluation, a point we stress across our AI and ML research outputs coverage.
Regularisation: penalising complexity
Regularisation discourages a model from becoming too complex by adding a penalty for large or numerous parameters. L1 (lasso) regularisation can shrink some weights to zero, effectively performing feature selection. L2 (ridge) regularisation shrinks weights smoothly towards zero without eliminating them. In neural networks, techniques such as dropout, which randomly disables units during training, and early stopping, which halts training before the model starts memorising, serve the same goal. Each nudges the model towards simpler, more generalisable solutions.
Cross-validation: a more robust check
A single train-validation split can be lucky or unlucky. Cross-validation guards against this by rotating the validation role across the data. In k-fold cross-validation, the data is divided into k parts; the model trains on k-1 parts and validates on the remaining one, repeating until every part has served as validation once. Averaging the results gives a more stable estimate of how the model will generalise, and a smaller chance of being fooled by a single fortunate split. To learn how these ideas fit into the wider discipline, see what is machine learning.
Why this threatens reproducible ML
Overfitting is a leading cause of results that fail to replicate. A model tuned too tightly to one dataset, or evaluated with leakage between training and test data, can report impressive accuracy that collapses when applied elsewhere. Honest splits, regularisation, cross-validation and full reporting of hyperparameters are the defences. We discuss these safeguards in depth in reproducibility of machine learning research, and the consistent terminology to describe them lives in the CASRAI dictionary. As with classical statistics, adequate data matters: too few examples make overfitting almost inevitable, echoing the concerns in our guide to sample size and statistical power.
Frequently asked questions
How can I tell if my model is overfitting?
Compare training and test performance. A model that scores very high on training data but noticeably worse on held-out test data is overfitting. If it performs poorly on both, it is underfitting.
What is the simplest way to reduce overfitting?
Gather more representative data, simplify the model, and apply regularisation or early stopping. Cross-validation helps you confirm that your fix genuinely improves generalisation rather than just luck.
What is the bias-variance trade-off in one sentence?
It is the tension between a model being too simple to capture the pattern (high bias, underfitting) and too flexible so it captures noise (high variance, overfitting), with the best model balancing the two.
Why does overfitting harm reproducibility?
An overfitted model reports performance specific to one dataset that does not carry over to new data, so its results fail to replicate. Honest data splits and transparent reporting, as described in our guidance for authors, are the remedy.







