Tag: overfitting

  • Overfitting and Underfitting in Machine Learning Explained

    Overfitting occurs when a machine learning model learns the noise and quirks of its training data so closely that it performs well on that data but poorly on new, unseen data. Its opposite, underfitting, occurs when a model is too simple to capture the underlying pattern and performs poorly even on the training data. Balancing these two failure modes is one of the central challenges of building reliable, reproducible models.

    The bias-variance trade-off

    Underfitting and overfitting are two sides of the bias-variance trade-off. Bias is error from overly simplistic assumptions; a high-bias model misses real structure and underfits. Variance is error from excessive sensitivity to the training sample; a high-variance model chases noise and overfits. As you make a model more flexible, bias falls but variance rises. The art is to find the sweet spot where total error, the sum of both, is lowest. A model that generalises well sits between the extremes.

    Aspect Underfitting Good fit Overfitting
    Model complexity Too low Appropriate Too high
    Bias High Balanced Low
    Variance Low Balanced High
    Training accuracy Poor Good Excellent
    Test accuracy Poor Good Poor

    The tell-tale sign of overfitting is a large gap between strong training performance and weak test performance. Underfitting shows up as poor performance on both.

    Train, validation and test splits

    Diagnosing these problems requires holding data back. The convention is a three-way split: the training set fits the model, the validation set tunes choices such as model complexity and stopping point, and the test set is touched only once, at the end, to estimate real-world performance. Evaluating on data the model trained on always flatters it and hides overfitting. Keeping the test set genuinely untouched is fundamental to honest evaluation, a point we stress across our AI and ML research outputs coverage.

    Regularisation: penalising complexity

    Regularisation discourages a model from becoming too complex by adding a penalty for large or numerous parameters. L1 (lasso) regularisation can shrink some weights to zero, effectively performing feature selection. L2 (ridge) regularisation shrinks weights smoothly towards zero without eliminating them. In neural networks, techniques such as dropout, which randomly disables units during training, and early stopping, which halts training before the model starts memorising, serve the same goal. Each nudges the model towards simpler, more generalisable solutions.

    Cross-validation: a more robust check

    A single train-validation split can be lucky or unlucky. Cross-validation guards against this by rotating the validation role across the data. In k-fold cross-validation, the data is divided into k parts; the model trains on k-1 parts and validates on the remaining one, repeating until every part has served as validation once. Averaging the results gives a more stable estimate of how the model will generalise, and a smaller chance of being fooled by a single fortunate split. To learn how these ideas fit into the wider discipline, see what is machine learning.

    Why this threatens reproducible ML

    Overfitting is a leading cause of results that fail to replicate. A model tuned too tightly to one dataset, or evaluated with leakage between training and test data, can report impressive accuracy that collapses when applied elsewhere. Honest splits, regularisation, cross-validation and full reporting of hyperparameters are the defences. We discuss these safeguards in depth in reproducibility of machine learning research, and the consistent terminology to describe them lives in the CASRAI dictionary. As with classical statistics, adequate data matters: too few examples make overfitting almost inevitable, echoing the concerns in our guide to sample size and statistical power.

    Frequently asked questions

    How can I tell if my model is overfitting?

    Compare training and test performance. A model that scores very high on training data but noticeably worse on held-out test data is overfitting. If it performs poorly on both, it is underfitting.

    What is the simplest way to reduce overfitting?

    Gather more representative data, simplify the model, and apply regularisation or early stopping. Cross-validation helps you confirm that your fix genuinely improves generalisation rather than just luck.

    What is the bias-variance trade-off in one sentence?

    It is the tension between a model being too simple to capture the pattern (high bias, underfitting) and too flexible so it captures noise (high variance, overfitting), with the best model balancing the two.

    Why does overfitting harm reproducibility?

    An overfitted model reports performance specific to one dataset that does not carry over to new data, so its results fail to replicate. Honest data splits and transparent reporting, as described in our guidance for authors, are the remedy.

  • What Is Machine Learning? Concepts and Methods

    Machine learning (ML) is the subfield of artificial intelligence concerned with algorithms that learn patterns from data and improve at a task with experience, rather than being explicitly programmed with rules. Instead of an engineer writing the logic, the engineer specifies a model and an objective, and the model adjusts its internal parameters to fit examples. The central scientific question is not whether a model fits the data it has seen, but whether it generalises to data it has not.

    Features, labels and the learning objective

    A machine-learning problem is usually framed in terms of features (the input variables describing each example) and, for supervised tasks, labels (the target output to be predicted). For a model predicting house prices, features might include floor area and location, and the label is the sale price. Learning means searching for model parameters that minimise a loss function measuring the gap between predictions and the truth.

    Machine learning is one paradigm within the broader discipline described in our explainer on artificial intelligence definition and history. Where symbolic AI encodes knowledge by hand, ML infers it statistically from examples.

    The three main paradigms

    Machine learning is conventionally divided into three families, distinguished by what kind of feedback the algorithm receives.

    Type Data used Goal Typical examples
    Supervised learning Labelled examples (features + targets) Predict a label for new inputs Classification, regression
    Unsupervised learning Unlabelled data Discover structure Clustering, dimensionality reduction
    Reinforcement learning Rewards from an environment Learn a policy that maximises long-term reward Control, game playing, sequential decisions

    Supervised learning trains on examples paired with correct answers and learns to predict those answers for unseen inputs; classification predicts categories and regression predicts continuous values. Unsupervised learning works with unlabelled data and seeks hidden structure, for instance grouping similar items (clustering) or compressing many variables into a few (dimensionality reduction). Reinforcement learning learns by trial and error: an agent takes actions, receives rewards or penalties, and gradually improves a policy that maximises cumulative reward.

    The train, validation and test split

    To estimate how well a model will generalise, data is partitioned into three disjoint sets. The training set is used to fit the model’s parameters. The validation set is used to tune choices the algorithm does not learn directly, such as model size or learning rate (the hyperparameters), and to compare candidate models. The test set is held back and used only once, at the end, to give an unbiased estimate of performance on unseen data.

    The cardinal rule is that the test set must not influence training or model selection. Repeatedly peeking at the test set leaks information and inflates reported performance, a subtle but common source of irreproducible results. We discuss safeguards at length in our guide to reproducibility of machine learning research.

    Overfitting and generalisation

    Overfitting occurs when a model learns the noise and idiosyncrasies of its training data rather than the underlying pattern, performing well on training examples but poorly on new ones. The opposite failure, underfitting, occurs when a model is too simple to capture the real structure. The art of machine learning lies in finding the balance, the so-called bias-variance trade-off, that yields the best generalisation to unseen data. Techniques such as regularisation, early stopping and cross-validation all serve this goal.

    Why method reporting matters

    Because performance depends so heavily on the data split, the loss function and the hyperparameters, a machine-learning result is only as credible as its reporting. Standardised vocabulary, captured in the casrai.org research dictionary, helps authors describe their methods consistently, and contribution frameworks such as CRediT help assign credit for the data, software and analysis work involved. Coverage of these issues continues in our AI and ML research outputs category.

    Frequently asked questions

    What is the difference between supervised and unsupervised learning?

    Supervised learning trains on data with known correct answers (labels) and predicts those answers for new inputs. Unsupervised learning works with unlabelled data and instead discovers structure, such as clusters or compressed representations, without a target to predict.

    Why split data into training, validation and test sets?

    The training set fits the model, the validation set tunes hyperparameters and compares models, and the held-out test set gives an unbiased estimate of real-world performance. Mixing these roles inflates results and undermines reproducibility.

    What is overfitting?

    Overfitting is when a model memorises the noise in its training data and therefore performs well on that data but poorly on new examples. The goal of machine learning is generalisation, not memorisation.

    Is machine learning the same as artificial intelligence?

    No. Machine learning is a subfield of artificial intelligence focused on learning from data. AI also includes symbolic reasoning, search and planning that do not learn from examples.