Tag: machine learning

  • Supervised vs Unsupervised Learning Explained

    Supervised learning is a machine-learning paradigm in which a model is trained on labelled examples — inputs paired with known correct outputs — so that it can predict the output for new, unseen inputs. Its counterpart, unsupervised learning, works with unlabelled data and seeks to discover structure, patterns or groupings without being told what the “right” answer is. The presence or absence of labels is the defining distinction between the two.

    Both belong to the wider field of machine learning, and choosing between them depends on whether you have labelled data and what question you are asking. A third paradigm, reinforcement learning, sits apart from both.

    Supervised learning: learning from labels

    In supervised learning, each training example carries a label: the email is “spam” or “not spam”, the image is “tumour” or “benign”, the house has a known sale price. The algorithm learns a mapping from inputs (features) to outputs (labels), and its performance is judged by how accurately it predicts labels for data it has not seen. Two main task types exist. Classification predicts a discrete category (spam or not, species A, B or C). Regression predicts a continuous quantity (a price, a temperature, a concentration).

    Common supervised methods include linear and logistic regression, decision trees, support vector machines and neural networks. The key practical requirement — and often the key cost — is obtaining enough accurately labelled data, which may require expert annotation.

    Unsupervised learning: finding structure

    Unsupervised learning has no labels. Instead, the algorithm looks for inherent structure in the data. Clustering groups similar items together — for example, segmenting samples into subtypes without prior categories, using methods such as k-means or hierarchical clustering. Dimensionality reduction, such as principal component analysis, compresses many variables into fewer while preserving structure, aiding visualisation and downstream analysis. Because there is no ground-truth label, evaluating unsupervised results is harder and often relies on domain judgement.

    Reinforcement learning: a third paradigm

    Reinforcement learning differs from both. Here an agent learns by interacting with an environment, taking actions and receiving rewards or penalties, and gradually improving a policy to maximise cumulative reward. It is neither labelled-example learning nor pure pattern discovery; it learns from consequences over time. Reinforcement learning underlies advances in game-playing and robotics, and is noted here for completeness rather than treated in depth.

    Comparing the paradigms

    Feature Supervised learning Unsupervised learning
    Data Labelled (input–output pairs) Unlabelled
    Goal Predict known outputs Discover hidden structure
    Main tasks Classification, regression Clustering, dimensionality reduction
    Example methods Logistic regression, decision trees, SVMs k-means, hierarchical clustering, PCA
    Evaluation Accuracy against held-out labels Often qualitative; needs domain judgement

    Research uses and good practice

    In research, supervised learning suits prediction and classification where labelled outcomes exist — diagnosing from images, predicting properties from features. Unsupervised learning suits exploration — finding subgroups, detecting anomalies, reducing dimensionality before further analysis. The two are often combined: unsupervised methods can pre-process or explore data that a supervised model then uses.

    Whichever paradigm is used, the outputs are research outputs that require careful reporting: how data were labelled or collected, how the model was validated, and how results were evaluated. Sharing data, code and methods using consistent terminology supports reproducibility, and our guidance for authors covers documenting such computational work. For foundational background, see our overview of machine learning concepts and methods and the explainer on neural networks and deep learning.

    Frequently asked questions

    What is the main difference between supervised and unsupervised learning?

    Labels. Supervised learning trains on labelled examples — inputs paired with known correct outputs — to make predictions, whereas unsupervised learning works with unlabelled data to discover structure such as clusters or lower-dimensional representations without any predefined answer.

    What are classification and regression?

    They are the two main supervised tasks. Classification predicts a discrete category, such as spam or not spam. Regression predicts a continuous value, such as a price or temperature. Both learn a mapping from input features to known output labels.

    Where does reinforcement learning fit?

    It is a separate, third paradigm. An agent learns by acting in an environment and receiving rewards or penalties, improving its policy over time to maximise cumulative reward. It learns from consequences rather than from labelled examples or pure pattern discovery.

    Can the two approaches be combined?

    Yes, frequently. Unsupervised methods such as clustering or PCA can explore or pre-process data, and a supervised model can then make predictions from the result. Many research pipelines use both, so they are complementary rather than mutually exclusive.

  • Overfitting and Underfitting in Machine Learning Explained

    Overfitting occurs when a machine learning model learns the noise and quirks of its training data so closely that it performs well on that data but poorly on new, unseen data. Its opposite, underfitting, occurs when a model is too simple to capture the underlying pattern and performs poorly even on the training data. Balancing these two failure modes is one of the central challenges of building reliable, reproducible models.

    The bias-variance trade-off

    Underfitting and overfitting are two sides of the bias-variance trade-off. Bias is error from overly simplistic assumptions; a high-bias model misses real structure and underfits. Variance is error from excessive sensitivity to the training sample; a high-variance model chases noise and overfits. As you make a model more flexible, bias falls but variance rises. The art is to find the sweet spot where total error, the sum of both, is lowest. A model that generalises well sits between the extremes.

    Aspect Underfitting Good fit Overfitting
    Model complexity Too low Appropriate Too high
    Bias High Balanced Low
    Variance Low Balanced High
    Training accuracy Poor Good Excellent
    Test accuracy Poor Good Poor

    The tell-tale sign of overfitting is a large gap between strong training performance and weak test performance. Underfitting shows up as poor performance on both.

    Train, validation and test splits

    Diagnosing these problems requires holding data back. The convention is a three-way split: the training set fits the model, the validation set tunes choices such as model complexity and stopping point, and the test set is touched only once, at the end, to estimate real-world performance. Evaluating on data the model trained on always flatters it and hides overfitting. Keeping the test set genuinely untouched is fundamental to honest evaluation, a point we stress across our AI and ML research outputs coverage.

    Regularisation: penalising complexity

    Regularisation discourages a model from becoming too complex by adding a penalty for large or numerous parameters. L1 (lasso) regularisation can shrink some weights to zero, effectively performing feature selection. L2 (ridge) regularisation shrinks weights smoothly towards zero without eliminating them. In neural networks, techniques such as dropout, which randomly disables units during training, and early stopping, which halts training before the model starts memorising, serve the same goal. Each nudges the model towards simpler, more generalisable solutions.

    Cross-validation: a more robust check

    A single train-validation split can be lucky or unlucky. Cross-validation guards against this by rotating the validation role across the data. In k-fold cross-validation, the data is divided into k parts; the model trains on k-1 parts and validates on the remaining one, repeating until every part has served as validation once. Averaging the results gives a more stable estimate of how the model will generalise, and a smaller chance of being fooled by a single fortunate split. To learn how these ideas fit into the wider discipline, see what is machine learning.

    Why this threatens reproducible ML

    Overfitting is a leading cause of results that fail to replicate. A model tuned too tightly to one dataset, or evaluated with leakage between training and test data, can report impressive accuracy that collapses when applied elsewhere. Honest splits, regularisation, cross-validation and full reporting of hyperparameters are the defences. We discuss these safeguards in depth in reproducibility of machine learning research, and the consistent terminology to describe them lives in the CASRAI dictionary. As with classical statistics, adequate data matters: too few examples make overfitting almost inevitable, echoing the concerns in our guide to sample size and statistical power.

    Frequently asked questions

    How can I tell if my model is overfitting?

    Compare training and test performance. A model that scores very high on training data but noticeably worse on held-out test data is overfitting. If it performs poorly on both, it is underfitting.

    What is the simplest way to reduce overfitting?

    Gather more representative data, simplify the model, and apply regularisation or early stopping. Cross-validation helps you confirm that your fix genuinely improves generalisation rather than just luck.

    What is the bias-variance trade-off in one sentence?

    It is the tension between a model being too simple to capture the pattern (high bias, underfitting) and too flexible so it captures noise (high variance, overfitting), with the best model balancing the two.

    Why does overfitting harm reproducibility?

    An overfitted model reports performance specific to one dataset that does not carry over to new data, so its results fail to replicate. Honest data splits and transparent reporting, as described in our guidance for authors, are the remedy.

  • Reproducibility for AI/ML research: model cards, seeds and compute disclosure

    Machine-learning research has a reproducibility problem, and the awkward truth is that most of it is not about anything exotic. A reported result fails to reproduce not because the science is fraudulent or the maths is wrong, but because of mundane omissions: a random seed that was never recorded, a library version that was never pinned, a preprocessing step that lived only in someone’s notebook, a hardware configuration nobody thought to mention. The good news is that exactly because the causes are mundane, the fixes are tractable — they are matters of documentation and discipline rather than fundamental breakthroughs. This article sets out the practical components of reproducible AI/ML work, drawing on the definitions in the AI/ML research outputs domain of the CASRAI Dictionary and the broader principles in the reproducibility domain.

    Why ML is especially fragile

    Several features of machine learning conspire to make results fragile. Models are stochastic: random initialisation, shuffling and sampling mean that two runs of the same code can produce different numbers unless randomness is controlled. They are dependency-heavy: results can shift with a change in a framework version, a numerical library, or even a hardware driver. They are data-sensitive: a different split, a different preprocessing choice, or an undocumented filtering step can change a headline metric. And they are increasingly compute-bound: some results depend on hardware and scale that are themselves part of the experiment. None of these is a flaw to be ashamed of, but each is a source of irreproducibility unless it is documented and controlled.

    Model cards and datasheets: documenting what you built

    The first pillar is structured documentation of the model itself. A model card is a short, standardised document that accompanies a trained model and records what it is, what it was trained and evaluated on, how it performs across relevant conditions, its intended uses, and its known limitations and ethical considerations. The point of a model card is that it travels with the model, so that anyone using or building on it inherits the context they need rather than reconstructing it from a paper’s prose.

    The complementary artefact for data is the datasheet for datasets, which documents a dataset’s motivation, composition, collection process, preprocessing, recommended uses and limitations. Together, model cards and datasheets address the two halves of an ML experiment whose details most often go unrecorded — the model and the data — and they turn ‘trust me, it works’ into something a reader can interrogate. Both are concrete examples of treating documentation as a first-class research output rather than an afterthought.

    Seeds and determinism: making runs repeatable

    The second pillar is the humble random seed. Setting and recording seeds for every source of randomness — the framework, the numerical libraries, the data loaders — is the single cheapest reproducibility measure available, and one of the most frequently neglected. Recording the seed lets someone reproduce a specific run; reporting results across several seeds, with variation shown, lets readers judge whether a result is robust or an artefact of a lucky initialisation.

    It is worth being honest about the limits here. Even with fixed seeds, full bit-for-bit determinism can be elusive, because some operations on parallel hardware are non-deterministic by default and because results can differ across hardware and library versions. The realistic goal is not always perfect determinism but documented randomness: a reader should know what was fixed, what was not, and how much the results varied as a consequence. A result reported as a mean across seeds with a measure of spread is far more credible than a single number with no indication of how stable it is.

    Compute and environment disclosure

    The third pillar is disclosure of the compute and environment in which the work was done. This means recording the hardware used, the software environment (framework and library versions, ideally captured in a pinned dependency specification or a container image), and the scale of the experiment — training time, the amount of computation involved, and the resources required. This serves two purposes at once. It supports reproducibility, because a result obtained on particular hardware with particular software may not reproduce elsewhere without that context. And it supports honesty and sustainability, because the computational and environmental cost of large-scale training is itself a material fact that readers, reviewers and funders increasingly expect to see stated rather than hidden.

    Capturing the environment in a reusable form — a container, a pinned environment file, a recorded command line — is what lets a reader move from reading about a result to re-running it, which is the real test of reproducibility.

    Software and the FAIR4RS principles

    Underlying all of this is the recognition that the code is a research output, to be shared, versioned, identified and cited like any other. The FAIR4RS principles — FAIR for Research Software — adapt the familiar Findable, Accessible, Interoperable and Reusable framework to software, acknowledging that code has characteristics (executability, dependencies, versions) that data alone does not. Treating ML code as a citable, archived output with a persistent identifier, rather than as a transient artefact, is what makes the model card, the seeds and the compute disclosure add up to something reproducible rather than merely well-described.

    Crediting the work properly

    Reproducible ML research is rarely the work of one person, and the contributions are varied: building the model, curating the data, writing the evaluation, managing the compute. Recording who did what through structured contributorship — the roles set out in the CRediT taxonomy — makes that division of labour visible and creditable, which matters all the more in collaborative ML projects where data, code, models and evaluation are often distinct workstreams. The consistent vocabulary for describing AI/ML outputs, their documentation and their reproducibility is maintained in the CASRAI Dictionary, so that a claim of reproducibility can be expressed, recorded and checked across the systems that track research outputs.

  • What Is Machine Learning? Concepts and Methods

    Machine learning (ML) is the subfield of artificial intelligence concerned with algorithms that learn patterns from data and improve at a task with experience, rather than being explicitly programmed with rules. Instead of an engineer writing the logic, the engineer specifies a model and an objective, and the model adjusts its internal parameters to fit examples. The central scientific question is not whether a model fits the data it has seen, but whether it generalises to data it has not.

    Features, labels and the learning objective

    A machine-learning problem is usually framed in terms of features (the input variables describing each example) and, for supervised tasks, labels (the target output to be predicted). For a model predicting house prices, features might include floor area and location, and the label is the sale price. Learning means searching for model parameters that minimise a loss function measuring the gap between predictions and the truth.

    Machine learning is one paradigm within the broader discipline described in our explainer on artificial intelligence definition and history. Where symbolic AI encodes knowledge by hand, ML infers it statistically from examples.

    The three main paradigms

    Machine learning is conventionally divided into three families, distinguished by what kind of feedback the algorithm receives.

    Type Data used Goal Typical examples
    Supervised learning Labelled examples (features + targets) Predict a label for new inputs Classification, regression
    Unsupervised learning Unlabelled data Discover structure Clustering, dimensionality reduction
    Reinforcement learning Rewards from an environment Learn a policy that maximises long-term reward Control, game playing, sequential decisions

    Supervised learning trains on examples paired with correct answers and learns to predict those answers for unseen inputs; classification predicts categories and regression predicts continuous values. Unsupervised learning works with unlabelled data and seeks hidden structure, for instance grouping similar items (clustering) or compressing many variables into a few (dimensionality reduction). Reinforcement learning learns by trial and error: an agent takes actions, receives rewards or penalties, and gradually improves a policy that maximises cumulative reward.

    The train, validation and test split

    To estimate how well a model will generalise, data is partitioned into three disjoint sets. The training set is used to fit the model’s parameters. The validation set is used to tune choices the algorithm does not learn directly, such as model size or learning rate (the hyperparameters), and to compare candidate models. The test set is held back and used only once, at the end, to give an unbiased estimate of performance on unseen data.

    The cardinal rule is that the test set must not influence training or model selection. Repeatedly peeking at the test set leaks information and inflates reported performance, a subtle but common source of irreproducible results. We discuss safeguards at length in our guide to reproducibility of machine learning research.

    Overfitting and generalisation

    Overfitting occurs when a model learns the noise and idiosyncrasies of its training data rather than the underlying pattern, performing well on training examples but poorly on new ones. The opposite failure, underfitting, occurs when a model is too simple to capture the real structure. The art of machine learning lies in finding the balance, the so-called bias-variance trade-off, that yields the best generalisation to unseen data. Techniques such as regularisation, early stopping and cross-validation all serve this goal.

    Why method reporting matters

    Because performance depends so heavily on the data split, the loss function and the hyperparameters, a machine-learning result is only as credible as its reporting. Standardised vocabulary, captured in the casrai.org research dictionary, helps authors describe their methods consistently, and contribution frameworks such as CRediT help assign credit for the data, software and analysis work involved. Coverage of these issues continues in our AI and ML research outputs category.

    Frequently asked questions

    What is the difference between supervised and unsupervised learning?

    Supervised learning trains on data with known correct answers (labels) and predicts those answers for new inputs. Unsupervised learning works with unlabelled data and instead discovers structure, such as clusters or compressed representations, without a target to predict.

    Why split data into training, validation and test sets?

    The training set fits the model, the validation set tunes hyperparameters and compares models, and the held-out test set gives an unbiased estimate of real-world performance. Mixing these roles inflates results and undermines reproducibility.

    What is overfitting?

    Overfitting is when a model memorises the noise in its training data and therefore performs well on that data but poorly on new examples. The goal of machine learning is generalisation, not memorisation.

    Is machine learning the same as artificial intelligence?

    No. Machine learning is a subfield of artificial intelligence focused on learning from data. AI also includes symbolic reasoning, search and planning that do not learn from examples.