Reproducibility for AI/ML research: model cards, seeds and compute disclosure

Machine-learning research has a reproducibility problem, and the awkward truth is that most of it is not about anything exotic. A reported result fails to reproduce not because the science is fraudulent or the maths is wrong, but because of mundane omissions: a random seed that was never recorded, a library version that was never pinned, a preprocessing step that lived only in someone’s notebook, a hardware configuration nobody thought to mention. The good news is that exactly because the causes are mundane, the fixes are tractable — they are matters of documentation and discipline rather than fundamental breakthroughs. This article sets out the practical components of reproducible AI/ML work, drawing on the definitions in the AI/ML research outputs domain of the CASRAI Dictionary and the broader principles in the reproducibility domain.

Why ML is especially fragile

Several features of machine learning conspire to make results fragile. Models are stochastic: random initialisation, shuffling and sampling mean that two runs of the same code can produce different numbers unless randomness is controlled. They are dependency-heavy: results can shift with a change in a framework version, a numerical library, or even a hardware driver. They are data-sensitive: a different split, a different preprocessing choice, or an undocumented filtering step can change a headline metric. And they are increasingly compute-bound: some results depend on hardware and scale that are themselves part of the experiment. None of these is a flaw to be ashamed of, but each is a source of irreproducibility unless it is documented and controlled.

Model cards and datasheets: documenting what you built

The first pillar is structured documentation of the model itself. A model card is a short, standardised document that accompanies a trained model and records what it is, what it was trained and evaluated on, how it performs across relevant conditions, its intended uses, and its known limitations and ethical considerations. The point of a model card is that it travels with the model, so that anyone using or building on it inherits the context they need rather than reconstructing it from a paper’s prose.

The complementary artefact for data is the datasheet for datasets, which documents a dataset’s motivation, composition, collection process, preprocessing, recommended uses and limitations. Together, model cards and datasheets address the two halves of an ML experiment whose details most often go unrecorded — the model and the data — and they turn ‘trust me, it works’ into something a reader can interrogate. Both are concrete examples of treating documentation as a first-class research output rather than an afterthought.

Seeds and determinism: making runs repeatable

The second pillar is the humble random seed. Setting and recording seeds for every source of randomness — the framework, the numerical libraries, the data loaders — is the single cheapest reproducibility measure available, and one of the most frequently neglected. Recording the seed lets someone reproduce a specific run; reporting results across several seeds, with variation shown, lets readers judge whether a result is robust or an artefact of a lucky initialisation.

It is worth being honest about the limits here. Even with fixed seeds, full bit-for-bit determinism can be elusive, because some operations on parallel hardware are non-deterministic by default and because results can differ across hardware and library versions. The realistic goal is not always perfect determinism but documented randomness: a reader should know what was fixed, what was not, and how much the results varied as a consequence. A result reported as a mean across seeds with a measure of spread is far more credible than a single number with no indication of how stable it is.

Compute and environment disclosure

The third pillar is disclosure of the compute and environment in which the work was done. This means recording the hardware used, the software environment (framework and library versions, ideally captured in a pinned dependency specification or a container image), and the scale of the experiment — training time, the amount of computation involved, and the resources required. This serves two purposes at once. It supports reproducibility, because a result obtained on particular hardware with particular software may not reproduce elsewhere without that context. And it supports honesty and sustainability, because the computational and environmental cost of large-scale training is itself a material fact that readers, reviewers and funders increasingly expect to see stated rather than hidden.

Capturing the environment in a reusable form — a container, a pinned environment file, a recorded command line — is what lets a reader move from reading about a result to re-running it, which is the real test of reproducibility.

Software and the FAIR4RS principles

Underlying all of this is the recognition that the code is a research output, to be shared, versioned, identified and cited like any other. The FAIR4RS principles — FAIR for Research Software — adapt the familiar Findable, Accessible, Interoperable and Reusable framework to software, acknowledging that code has characteristics (executability, dependencies, versions) that data alone does not. Treating ML code as a citable, archived output with a persistent identifier, rather than as a transient artefact, is what makes the model card, the seeds and the compute disclosure add up to something reproducible rather than merely well-described.

Crediting the work properly

Reproducible ML research is rarely the work of one person, and the contributions are varied: building the model, curating the data, writing the evaluation, managing the compute. Recording who did what through structured contributorship — the roles set out in the CRediT taxonomy — makes that division of labour visible and creditable, which matters all the more in collaborative ML projects where data, code, models and evaluation are often distinct workstreams. The consistent vocabulary for describing AI/ML outputs, their documentation and their reproducibility is maintained in the CASRAI Dictionary, so that a claim of reproducibility can be expressed, recorded and checked across the systems that track research outputs.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *