Tag: research standards

  • Reproducibility of Machine Learning Research

    Machine-learning (ML) reproducibility is the ability of an independent party to obtain results consistent with a published study using the same code, data and computational configuration. It is a persistent challenge: many ML papers report results that others cannot reproduce, not through misconduct but because critical details, such as random seeds, data versions and compute settings, go unrecorded. Fixing this is a matter of disciplined reporting rather than new science, and a set of practical standards has emerged to make ML results reliably reproducible.

    Why ML results are hard to reproduce

    Several sources of variation conspire against reproducibility. ML training is inherently stochastic: random weight initialisation, data shuffling and randomised algorithms mean two runs of the same code can yield different models. Results are also sensitive to the exact data version and preprocessing, to hyperparameters, and to the software and hardware environment, since different library versions or GPU behaviour can change outcomes. When a paper omits these, the reported numbers cannot be regenerated. The train/validation/test discipline that guards against inflated results is covered in our explainer on machine learning concepts and methods.

    Random seeds and reporting variance

    Setting and recording random seeds for every source of randomness makes a single run repeatable. But a fixed seed is not the whole story: because results vary across seeds, robust practice is to report performance across multiple seeds with a measure of spread, not a single best run. This distinguishes a genuine improvement from one that merely got a lucky initialisation.

    Data and model versioning

    Reproducibility requires knowing exactly which data and which model produced a result. Data versioning records the precise dataset snapshot, including any cleaning, filtering and splits, so the same inputs can be reconstructed. Model versioning records the trained weights and the configuration that produced them. This provenance is the engineering counterpart to the documentation artefacts described in our piece on AI model documentation: datasheets and model cards describe what the data and model are, while versioning lets others retrieve the exact instances used.

    Practice What it captures Why it matters
    Random seeds All sources of randomness Makes a run repeatable; report across seeds for variance
    Data versioning Exact dataset snapshot and splits Lets others reconstruct the same inputs
    Model versioning Trained weights and configuration Identifies exactly which model produced a result
    Environment reporting Library versions, hardware, compute Controls for software and hardware variation
    Shared code and weights The implementation itself Enables direct re-execution and scrutiny

    Environment and compute reporting

    Results depend on the computational environment, so reproducible studies report the software stack (framework and library versions), the hardware used (such as the GPU type and count), and the compute budget, including training time and the number of runs. Capturing dependencies, for example through a pinned environment file or a container, lets others recreate the conditions rather than guess at them. Compute reporting also supports honest comparison, since a method that wins only with vastly more compute is a different claim from one that wins under equal budgets.

    Sharing code, weights and reproducibility checklists

    The single most effective step is to share the code and trained weights alongside the paper, so reviewers and readers can re-run the experiments directly. To make expectations concrete, the community has adopted reproducibility checklists, such as the machine-learning reproducibility checklist used by major conferences, which prompt authors to confirm that they have reported data, code, hyperparameters, compute and statistical significance. Treating these checklists as standard practice raises the floor for the whole field. We track these standards across our AI and ML research outputs coverage, with shared terminology anchored in the casrai.org research dictionary and contribution credit handled through CRediT.

    Frequently asked questions

    Why are machine-learning results often hard to reproduce?

    Because training is stochastic and results depend on random seeds, exact data versions, hyperparameters and the software and hardware environment. When papers omit these details, the reported numbers cannot be regenerated.

    Is setting a random seed enough for reproducibility?

    No. A fixed seed makes one run repeatable, but because results vary across seeds, robust practice is to report performance over multiple seeds with a measure of spread, not a single best run.

    What is a reproducibility checklist?

    It is a structured list, adopted by major ML venues, that prompts authors to confirm they have reported data, code, hyperparameters, compute and statistical significance, raising the baseline standard for the field.

    What is the single most effective reproducibility step?

    Sharing the code and trained weights alongside the paper, together with the exact data and environment, so that others can directly re-run and scrutinise the experiments.

  • Neural Networks and Deep Learning Explained

    An artificial neural network is a machine-learning model composed of many simple interconnected units, loosely inspired by biological neurons, that transform input data through successive layers of weighted connections. Deep learning is the use of neural networks with many such layers to learn rich, hierarchical representations directly from data. Together they underpin most of the recent advances in artificial intelligence, from image recognition to the large language models behind generative systems.

    Neurons, weights and activations

    The basic unit, often called a neuron or node, computes a weighted sum of its inputs, adds a bias term, and passes the result through a non-linear activation function. The weights are the model’s learnable parameters; they determine how strongly each input influences the unit’s output. The activation function, such as the rectified linear unit (ReLU) or the sigmoid, introduces non-linearity, which is essential: without it, stacking layers would collapse into a single linear transformation incapable of modelling complex patterns.

    Neurons are organised into layers: an input layer that receives the data, one or more hidden layers that transform it, and an output layer that produces the prediction. Information flows forward through these layers in a process called the forward pass. This architecture is one realisation of the machine-learning ideas described in our explainer on machine learning concepts and methods.

    What “deep” means

    The word deep refers simply to the number of layers. A network with many hidden layers is “deep”, and depth allows the model to build representations in stages: early layers may detect simple features such as edges in an image, while later layers combine these into increasingly abstract concepts such as shapes and objects. This automatic, layered feature learning is what distinguishes deep learning from earlier methods that relied on hand-engineered features. The historical shift to deep networks is traced in our overview of artificial intelligence definition and history.

    Component Role
    Neuron (node) Computes a weighted sum plus bias, then an activation
    Weight Learnable parameter scaling each input
    Activation function Adds non-linearity (e.g. ReLU, sigmoid)
    Layer A group of neurons; depth is the number of layers
    Loss function Measures error between prediction and target

    Training: backpropagation and gradient descent

    A neural network learns by adjusting its weights to reduce a loss function that measures how wrong its predictions are. Training proceeds in two coupled steps. First, the forward pass produces predictions and computes the loss. Second, backpropagation uses the chain rule of calculus to compute the gradient of the loss with respect to every weight, efficiently propagating error signals backward from the output layer to the input layer.

    These gradients tell an optimiser how to change each weight to reduce the loss. Gradient descent, usually in its stochastic mini-batch form, then nudges the weights a small step in the direction that lowers the loss, controlled by a learning rate. Repeating this over many passes through the data (epochs) gradually improves the model. Because the outcome depends on random initialisation, data ordering and these hyperparameters, careful reporting is essential, as discussed in our guide to reproducibility of machine learning research.

    Why documentation matters for neural networks

    Because a trained network is defined by millions of learned weights rather than human-readable rules, transparency depends on documentation: what data trained it, how it was evaluated, and what its limits are. Structured artefacts such as model cards, covered in our piece on AI model documentation, address exactly this need, and the controlled terminology in the casrai.org research dictionary helps keep descriptions consistent across the literature.

    Frequently asked questions

    What makes a neural network “deep”?

    Depth refers to the number of layers. A deep network has many hidden layers, which lets it learn features in stages, from simple patterns in early layers to abstract concepts in later ones.

    What is backpropagation?

    Backpropagation is the algorithm that computes the gradient of the loss with respect to each weight by applying the chain rule backward through the network. These gradients tell the optimiser how to adjust the weights.

    What is the role of an activation function?

    An activation function adds non-linearity to each neuron. Without it, stacking layers would be equivalent to a single linear transformation, and the network could not model complex relationships.

    How does gradient descent train a network?

    Gradient descent repeatedly adjusts the weights by a small step in the direction that reduces the loss, using the gradients from backpropagation and a learning rate to control the step size.

  • What Is Artificial Intelligence? Definition and History

    Artificial intelligence (AI) is the branch of computer science concerned with building systems that perform tasks normally requiring human intelligence, such as perception, reasoning, language understanding and decision-making. As a research field it spans both symbolic approaches, which encode knowledge and rules explicitly, and statistical approaches, which learn patterns from data. For the research community, AI is best understood not as a single technology but as a long-standing discipline with a measurable history, contested definitions and evolving documentation standards.

    A working definition of artificial intelligence

    There is no universally agreed definition of artificial intelligence, partly because the goalposts move: tasks once considered to require intelligence, such as optical character recognition, become routine engineering and stop being called AI. A durable, standards-friendly definition treats AI as the study and construction of agents that perceive their environment and take actions to maximise a defined objective. This framing accommodates everything from a rule-based expert system to a modern neural network without privileging any one method.

    Because the term is so elastic, research-standards bodies encourage authors to describe the specific method used, rather than the marketing label. A paper that says it “used AI” tells a reader very little; one that names the model class, training data and evaluation protocol is reproducible. The casrai.org research dictionary exists precisely to stabilise this vocabulary across disciplines.

    Narrow AI versus general AI

    Almost all systems deployed today are examples of narrow AI (also called weak AI): they are built for a single, bounded task such as translating text, recommending content or classifying images. A narrow system that excels at one task has no capacity to transfer that competence to another domain.

    Artificial general intelligence (AGI) refers to a hypothetical system with the broad, flexible competence of a human across arbitrary tasks. AGI remains a research aspiration rather than an existing artefact, and claims of its arrival should be treated with scholarly caution. Keeping the narrow/general distinction explicit prevents the overstatement that often clouds reporting on AI in research outputs.

    Symbolic AI versus statistical approaches

    The field has long been organised around two broad paradigms. Symbolic AI (sometimes called “good old-fashioned AI”) represents knowledge as symbols and manipulates them with explicit logical rules; expert systems and classical search and planning belong here. Its strengths are transparency and the ability to explain a decision step by step.

    Statistical or machine-learning approaches instead infer behaviour from data. Rather than hand-coding rules, an engineer specifies a model and an objective, and the system learns parameters that fit observed examples. This paradigm now dominates practical AI, and it underpins the techniques discussed in our companion piece on machine learning concepts and methods. The two paradigms are increasingly combined in neuro-symbolic systems that pair learned perception with explicit reasoning.

    A brief history: Dartmouth 1956 to the deep-learning era

    The field was named at the Dartmouth Summer Research Project on Artificial Intelligence in 1956, a workshop organised by John McCarthy, Marvin Minsky, Nathaniel Rochester and Claude Shannon. Early optimism produced symbolic reasoning programs and the first neural-network models, but progress stalled when problems proved harder than expected, producing the so-called AI winters of reduced funding and interest in the 1970s and again in the late 1980s.

    The modern resurgence, often dated to the early 2010s, came from the convergence of large datasets, graphics-processing hardware and improved training methods, ushering in the deep-learning era. These advances are explored further in our overview of neural networks and deep learning, and they set the stage for today’s generative models.

    Period Milestone Significance
    1950 Turing’s “Computing Machinery and Intelligence” Proposed the imitation game (Turing test)
    1956 Dartmouth workshop Coined the term “artificial intelligence”
    1970s, late 1980s AI winters Funding and interest contracted
    2010s Deep-learning breakthroughs Data plus GPUs revived neural networks

    The Turing test

    In 1950 Alan Turing proposed what is now called the Turing test: rather than asking whether a machine can “think”, he asked whether a human interrogator, conversing by text, could reliably distinguish the machine from a person. The test reframed an unanswerable philosophical question as an operational one. It remains a touchstone for discussion, though contemporary researchers treat it as a thought experiment rather than a benchmark of genuine understanding, and it does not measure reasoning, safety or factual accuracy.

    Why definitions matter for the research record

    Precise terminology is not pedantry; it is the foundation of reproducibility and credit. When AI methods feature in a study, readers and reviewers need to know exactly what was done. This connects to broader work on contribution transparency captured by the CRediT taxonomy and to the emerging disclosure norms tracked in our AI and ML research outputs coverage.

    Frequently asked questions

    Is artificial intelligence the same as machine learning?

    No. Machine learning is a subfield of artificial intelligence concerned with learning from data. AI is the broader discipline and also includes symbolic reasoning, search and planning that need not learn at all.

    Does any current system count as general AI?

    No. All systems in production are narrow AI, built for specific tasks. Artificial general intelligence remains a research aspiration, and claims of its existence should be treated sceptically.

    What was the significance of the 1956 Dartmouth workshop?

    It is where the term “artificial intelligence” was coined and where the field was effectively founded as a distinct research discipline, setting a shared agenda for the decades that followed.

    Does passing the Turing test prove a machine is intelligent?

    Not in any deep sense. The test measures whether a machine can imitate human conversation convincingly, not whether it understands, reasons soundly or is factually reliable.

  • What Is Machine Learning? Concepts and Methods

    Machine learning (ML) is the subfield of artificial intelligence concerned with algorithms that learn patterns from data and improve at a task with experience, rather than being explicitly programmed with rules. Instead of an engineer writing the logic, the engineer specifies a model and an objective, and the model adjusts its internal parameters to fit examples. The central scientific question is not whether a model fits the data it has seen, but whether it generalises to data it has not.

    Features, labels and the learning objective

    A machine-learning problem is usually framed in terms of features (the input variables describing each example) and, for supervised tasks, labels (the target output to be predicted). For a model predicting house prices, features might include floor area and location, and the label is the sale price. Learning means searching for model parameters that minimise a loss function measuring the gap between predictions and the truth.

    Machine learning is one paradigm within the broader discipline described in our explainer on artificial intelligence definition and history. Where symbolic AI encodes knowledge by hand, ML infers it statistically from examples.

    The three main paradigms

    Machine learning is conventionally divided into three families, distinguished by what kind of feedback the algorithm receives.

    Type Data used Goal Typical examples
    Supervised learning Labelled examples (features + targets) Predict a label for new inputs Classification, regression
    Unsupervised learning Unlabelled data Discover structure Clustering, dimensionality reduction
    Reinforcement learning Rewards from an environment Learn a policy that maximises long-term reward Control, game playing, sequential decisions

    Supervised learning trains on examples paired with correct answers and learns to predict those answers for unseen inputs; classification predicts categories and regression predicts continuous values. Unsupervised learning works with unlabelled data and seeks hidden structure, for instance grouping similar items (clustering) or compressing many variables into a few (dimensionality reduction). Reinforcement learning learns by trial and error: an agent takes actions, receives rewards or penalties, and gradually improves a policy that maximises cumulative reward.

    The train, validation and test split

    To estimate how well a model will generalise, data is partitioned into three disjoint sets. The training set is used to fit the model’s parameters. The validation set is used to tune choices the algorithm does not learn directly, such as model size or learning rate (the hyperparameters), and to compare candidate models. The test set is held back and used only once, at the end, to give an unbiased estimate of performance on unseen data.

    The cardinal rule is that the test set must not influence training or model selection. Repeatedly peeking at the test set leaks information and inflates reported performance, a subtle but common source of irreproducible results. We discuss safeguards at length in our guide to reproducibility of machine learning research.

    Overfitting and generalisation

    Overfitting occurs when a model learns the noise and idiosyncrasies of its training data rather than the underlying pattern, performing well on training examples but poorly on new ones. The opposite failure, underfitting, occurs when a model is too simple to capture the real structure. The art of machine learning lies in finding the balance, the so-called bias-variance trade-off, that yields the best generalisation to unseen data. Techniques such as regularisation, early stopping and cross-validation all serve this goal.

    Why method reporting matters

    Because performance depends so heavily on the data split, the loss function and the hyperparameters, a machine-learning result is only as credible as its reporting. Standardised vocabulary, captured in the casrai.org research dictionary, helps authors describe their methods consistently, and contribution frameworks such as CRediT help assign credit for the data, software and analysis work involved. Coverage of these issues continues in our AI and ML research outputs category.

    Frequently asked questions

    What is the difference between supervised and unsupervised learning?

    Supervised learning trains on data with known correct answers (labels) and predicts those answers for new inputs. Unsupervised learning works with unlabelled data and instead discovers structure, such as clusters or compressed representations, without a target to predict.

    Why split data into training, validation and test sets?

    The training set fits the model, the validation set tunes hyperparameters and compares models, and the held-out test set gives an unbiased estimate of real-world performance. Mixing these roles inflates results and undermines reproducibility.

    What is overfitting?

    Overfitting is when a model memorises the noise in its training data and therefore performs well on that data but poorly on new examples. The goal of machine learning is generalisation, not memorisation.

    Is machine learning the same as artificial intelligence?

    No. Machine learning is a subfield of artificial intelligence focused on learning from data. AI also includes symbolic reasoning, search and planning that do not learn from examples.