Data science & AI · Reference

What is overfitting?

Overfitting is when a machine-learning model fits the training data too closely, capturing noise as well as signal, so that it performs well on training examples but generalises poorly to new, unseen data.

Why models overfit

A model overfits when it is flexible enough to fit not only the genuine pattern in the training data but also its random noise. This usually happens when the model has many parameters relative to the amount of data, or when it is trained for too long. The tell-tale sign is a large gap between performance on the training set and on held-out data: training error keeps falling while test error stops improving or rises. The aim of machine learning is to generalise to new data, so overfitting directly undermines a model's purpose.

The bias–variance trade-off

Overfitting and underfitting are two sides of the bias–variance trade-off. A model that is too simple has high bias: it underfits, missing real structure. A model that is too complex has high variance: it overfits, reacting strongly to the particular training sample.

Good generalisation comes from balancing the two. The skill in modelling is choosing a complexity that is rich enough to capture the real pattern but not so rich that it chases noise.

How overfitting is prevented

Several techniques reduce overfitting. Cross-validation gives a more honest estimate of generalisation and helps in choosing model complexity. Regularisation penalises overly complex models. Gathering more, more representative data reduces the chance of fitting noise. Simplifying the model — fewer features or parameters — and, for iterative methods, early stopping also help. Ensemble methods such as random forests reduce variance by averaging many models.

Overfitting in research

Overfitting is a central threat to the validity of data-driven research. A model or analysis tuned to fit one dataset may not replicate on another — a pattern that contributes to reproducibility problems. The standard safeguard is to evaluate on data not used in training or model selection, and to avoid "data leakage", where information from the test set influences training. Separating exploratory model-building from confirmatory evaluation guards against reporting noise as a finding.

Key facts

At a glance

Definition: fitting training noise, not just the pattern
Symptom: good on training data, poor on new data
Common cause: model too complex for the data
Opposite: underfitting (model too simple)
Framed by: the bias–variance trade-off
Remedies: cross-validation, regularisation, more data, early stopping

Common questions

FAQ

How do you detect overfitting?+

Compare performance on the training data with performance on a held-out test set. A model that scores much better on training data than on unseen data is overfitting: it has learned noise specific to the training set rather than the general pattern.

What is the difference between overfitting and underfitting?+

Overfitting means a model is too complex and fits noise, doing well on training data but poorly on new data. Underfitting means a model is too simple to capture the pattern, doing poorly on both. Both reflect the bias–variance trade-off.

How can overfitting be reduced?+

Common methods include cross-validation, regularisation, gathering more representative data, simplifying the model, early stopping during training, and using ensemble methods that average several models to reduce variance.

Going deeper

Related on CASRAI

Sources

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.