Machine-learning (ML) reproducibility is the ability of an independent party to obtain results consistent with a published study using the same code, data and computational configuration. It is a persistent challenge: many ML papers report results that others cannot reproduce, not through misconduct but because critical details, such as random seeds, data versions and compute settings, go unrecorded. Fixing this is a matter of disciplined reporting rather than new science, and a set of practical standards has emerged to make ML results reliably reproducible.
Why ML results are hard to reproduce
Several sources of variation conspire against reproducibility. ML training is inherently stochastic: random weight initialisation, data shuffling and randomised algorithms mean two runs of the same code can yield different models. Results are also sensitive to the exact data version and preprocessing, to hyperparameters, and to the software and hardware environment, since different library versions or GPU behaviour can change outcomes. When a paper omits these, the reported numbers cannot be regenerated. The train/validation/test discipline that guards against inflated results is covered in our explainer on machine learning concepts and methods.
Random seeds and reporting variance
Setting and recording random seeds for every source of randomness makes a single run repeatable. But a fixed seed is not the whole story: because results vary across seeds, robust practice is to report performance across multiple seeds with a measure of spread, not a single best run. This distinguishes a genuine improvement from one that merely got a lucky initialisation.
Data and model versioning
Reproducibility requires knowing exactly which data and which model produced a result. Data versioning records the precise dataset snapshot, including any cleaning, filtering and splits, so the same inputs can be reconstructed. Model versioning records the trained weights and the configuration that produced them. This provenance is the engineering counterpart to the documentation artefacts described in our piece on AI model documentation: datasheets and model cards describe what the data and model are, while versioning lets others retrieve the exact instances used.
| Practice | What it captures | Why it matters |
|---|---|---|
| Random seeds | All sources of randomness | Makes a run repeatable; report across seeds for variance |
| Data versioning | Exact dataset snapshot and splits | Lets others reconstruct the same inputs |
| Model versioning | Trained weights and configuration | Identifies exactly which model produced a result |
| Environment reporting | Library versions, hardware, compute | Controls for software and hardware variation |
| Shared code and weights | The implementation itself | Enables direct re-execution and scrutiny |
Environment and compute reporting
Results depend on the computational environment, so reproducible studies report the software stack (framework and library versions), the hardware used (such as the GPU type and count), and the compute budget, including training time and the number of runs. Capturing dependencies, for example through a pinned environment file or a container, lets others recreate the conditions rather than guess at them. Compute reporting also supports honest comparison, since a method that wins only with vastly more compute is a different claim from one that wins under equal budgets.
Sharing code, weights and reproducibility checklists
The single most effective step is to share the code and trained weights alongside the paper, so reviewers and readers can re-run the experiments directly. To make expectations concrete, the community has adopted reproducibility checklists, such as the machine-learning reproducibility checklist used by major conferences, which prompt authors to confirm that they have reported data, code, hyperparameters, compute and statistical significance. Treating these checklists as standard practice raises the floor for the whole field. We track these standards across our AI and ML research outputs coverage, with shared terminology anchored in the casrai.org research dictionary and contribution credit handled through CRediT.
Frequently asked questions
Why are machine-learning results often hard to reproduce?
Because training is stochastic and results depend on random seeds, exact data versions, hyperparameters and the software and hardware environment. When papers omit these details, the reported numbers cannot be regenerated.
Is setting a random seed enough for reproducibility?
No. A fixed seed makes one run repeatable, but because results vary across seeds, robust practice is to report performance over multiple seeds with a measure of spread, not a single best run.
What is a reproducibility checklist?
It is a structured list, adopted by major ML venues, that prompts authors to confirm they have reported data, code, hyperparameters, compute and statistical significance, raising the baseline standard for the field.
What is the single most effective reproducibility step?
Sharing the code and trained weights alongside the paper, together with the exact data and environment, so that others can directly re-run and scrutinise the experiments.







