Tag: AI/ML research outputs

  • Reproducibility for AI/ML research: model cards, seeds and compute disclosure

    Machine-learning research has a reproducibility problem, and the awkward truth is that most of it is not about anything exotic. A reported result fails to reproduce not because the science is fraudulent or the maths is wrong, but because of mundane omissions: a random seed that was never recorded, a library version that was never pinned, a preprocessing step that lived only in someone’s notebook, a hardware configuration nobody thought to mention. The good news is that exactly because the causes are mundane, the fixes are tractable — they are matters of documentation and discipline rather than fundamental breakthroughs. This article sets out the practical components of reproducible AI/ML work, drawing on the definitions in the AI/ML research outputs domain of the CASRAI Dictionary and the broader principles in the reproducibility domain.

    Why ML is especially fragile

    Several features of machine learning conspire to make results fragile. Models are stochastic: random initialisation, shuffling and sampling mean that two runs of the same code can produce different numbers unless randomness is controlled. They are dependency-heavy: results can shift with a change in a framework version, a numerical library, or even a hardware driver. They are data-sensitive: a different split, a different preprocessing choice, or an undocumented filtering step can change a headline metric. And they are increasingly compute-bound: some results depend on hardware and scale that are themselves part of the experiment. None of these is a flaw to be ashamed of, but each is a source of irreproducibility unless it is documented and controlled.

    Model cards and datasheets: documenting what you built

    The first pillar is structured documentation of the model itself. A model card is a short, standardised document that accompanies a trained model and records what it is, what it was trained and evaluated on, how it performs across relevant conditions, its intended uses, and its known limitations and ethical considerations. The point of a model card is that it travels with the model, so that anyone using or building on it inherits the context they need rather than reconstructing it from a paper’s prose.

    The complementary artefact for data is the datasheet for datasets, which documents a dataset’s motivation, composition, collection process, preprocessing, recommended uses and limitations. Together, model cards and datasheets address the two halves of an ML experiment whose details most often go unrecorded — the model and the data — and they turn ‘trust me, it works’ into something a reader can interrogate. Both are concrete examples of treating documentation as a first-class research output rather than an afterthought.

    Seeds and determinism: making runs repeatable

    The second pillar is the humble random seed. Setting and recording seeds for every source of randomness — the framework, the numerical libraries, the data loaders — is the single cheapest reproducibility measure available, and one of the most frequently neglected. Recording the seed lets someone reproduce a specific run; reporting results across several seeds, with variation shown, lets readers judge whether a result is robust or an artefact of a lucky initialisation.

    It is worth being honest about the limits here. Even with fixed seeds, full bit-for-bit determinism can be elusive, because some operations on parallel hardware are non-deterministic by default and because results can differ across hardware and library versions. The realistic goal is not always perfect determinism but documented randomness: a reader should know what was fixed, what was not, and how much the results varied as a consequence. A result reported as a mean across seeds with a measure of spread is far more credible than a single number with no indication of how stable it is.

    Compute and environment disclosure

    The third pillar is disclosure of the compute and environment in which the work was done. This means recording the hardware used, the software environment (framework and library versions, ideally captured in a pinned dependency specification or a container image), and the scale of the experiment — training time, the amount of computation involved, and the resources required. This serves two purposes at once. It supports reproducibility, because a result obtained on particular hardware with particular software may not reproduce elsewhere without that context. And it supports honesty and sustainability, because the computational and environmental cost of large-scale training is itself a material fact that readers, reviewers and funders increasingly expect to see stated rather than hidden.

    Capturing the environment in a reusable form — a container, a pinned environment file, a recorded command line — is what lets a reader move from reading about a result to re-running it, which is the real test of reproducibility.

    Software and the FAIR4RS principles

    Underlying all of this is the recognition that the code is a research output, to be shared, versioned, identified and cited like any other. The FAIR4RS principles — FAIR for Research Software — adapt the familiar Findable, Accessible, Interoperable and Reusable framework to software, acknowledging that code has characteristics (executability, dependencies, versions) that data alone does not. Treating ML code as a citable, archived output with a persistent identifier, rather than as a transient artefact, is what makes the model card, the seeds and the compute disclosure add up to something reproducible rather than merely well-described.

    Crediting the work properly

    Reproducible ML research is rarely the work of one person, and the contributions are varied: building the model, curating the data, writing the evaluation, managing the compute. Recording who did what through structured contributorship — the roles set out in the CRediT taxonomy — makes that division of labour visible and creditable, which matters all the more in collaborative ML projects where data, code, models and evaluation are often distinct workstreams. The consistent vocabulary for describing AI/ML outputs, their documentation and their reproducibility is maintained in the CASRAI Dictionary, so that a claim of reproducibility can be expressed, recorded and checked across the systems that track research outputs.

  • Crediting contributions to AI/ML research: data, code, models and evaluation

    Machine-learning research distributes its intellectual labour differently from a conventional empirical study. The work that determines whether a result is any good is spread across data collection and annotation, code, model training, and evaluation — and the people who do each of those things are often different people. So how well do the 14 roles of CRediT describe who did what on an AI/ML paper? Better than one might fear, with a few well-understood friction points. This article walks through the mapping, role by role, for the benefit of anyone writing a CRediT author statement for ML work.

    Start from the lifecycle, not the role list

    The cleanest way to assign CRediT roles to ML work is to walk the lifecycle and ask, at each stage, who contributed and which role names that contribution. A typical AI/ML project moves through: framing the problem and research goals; designing the method or model architecture; assembling, cleaning, and annotating data; implementing and training; evaluating; and writing it up. Each stage has a natural CRediT home.

    Conceptualization and Methodology: the ideas and the design

    The framing of the research question — what problem the model is meant to solve, what would count as success — is Conceptualization, exactly as in any other field. The design of the method is where ML gets its own texture. A genuinely novel architecture, training objective, or learning algorithm is Methodology in the canonical sense: “development or design of methodology; creation of models.” The phrase “creation of models” sits slightly oddly here, because in ML “model” can mean either the conceptual method or the concrete trained weights; the CRediT definition means the former. Designing the experimental protocol — what gets held out, how runs are seeded, what ablations are performed — is also Methodology.

    Data curation and Investigation: the part that decides the result

    In ML, data quality usually matters more than model cleverness, and the people who do data work are frequently undercredited. CRediT offers two relevant roles. Investigation covers “performing the experiments, or data/evidence collection” — the gathering of the raw data, the running of the training experiments themselves. Data curation covers “management activities to annotate (produce metadata), scrub data and maintain research data… for initial use and later re-use” — which is an almost exact description of dataset cleaning, labelling, deduplication, and the construction of the documented, reusable dataset.

    The practical advice is to use both roles deliberately and not to let Investigation swallow everything. The person who designed the annotation scheme and produced the dataset’s metadata is doing Data curation, and saying so makes visible a contribution that is otherwise invisible — and that, by the field’s own lights, often determines the outcome. The datasheet for the dataset is, in effect, a written artefact of that Data curation work.

    Software: central, and overloaded

    Almost all ML work involves code, so Software — “programming, software development; designing computer programs; implementation of the computer code and supporting algorithms; testing of existing code components” — is the most frequently assigned role. It is also the most overloaded. On a real project, “Software” can cover the researcher who implemented the novel method, the engineer who built the training pipeline, the person who wrote the data-loading code, and whoever maintains the evaluation harness. CRediT gives all of them the same role name.

    This is the same limitation we have documented for software papers: the Software role lacks sub-roles for implementation, testing, infrastructure, and maintenance. The current best practice is to use the degree-of-contribution qualifier (lead / equal / supporting) to differentiate, and to carry finer-grained per-component contributorship in the repository’s own metadata — a CITATION.cff file or the model card’s authorship section — rather than trying to force it all into the paper’s CRediT statement.

    Validation: evaluation is its own contribution

    The single most useful point in this whole mapping is that Validation exists and should be used. Its definition — “verification… of the overall replication/reproducibility of results/experiments and other research outputs” — fits the work of building and running an evaluation suite almost perfectly. The person who designed the evaluation, guarded against test-set contamination, ran the baselines, and confirmed that the reported numbers reproduce is doing Validation, and in ML that is frequently the difference between a trustworthy result and a misleading one.

    Because evaluation is so central to ML and so often distinct from the modelling work, assigning Validation as a lead role to the person who owned evaluation is one of the highest-value things a CRediT statement for ML can do. It is also under-used, because the habit of treating evaluation as an undifferentiated part of “the experiments” persists.

    The remaining roles

    The rest map without surprises. Producing figures, training curves, and visualisations is Visualization. Providing compute — “computing resources… or other analysis tools” is explicitly in the Resources definition — is Resources; on compute-intensive projects, the contribution of whoever secured and managed the GPU allocation is real and namable. Writing the paper is Writing – original draft and Writing – review & editing. Leading the project is Supervision and Project administration; securing the grant is Funding acquisition.

    Where AI assistance fits, and where it does not

    One thing CRediT deliberately does not represent is the use of AI tools to do the work — an AI coding assistant that helped write the training code, or a model that drafted prose. That is a disclosure matter, not a contributorship matter: AI systems are not contributors, a position the community has settled, and the prevailing view is that AI use should be tracked as a separate dimension rather than as a CRediT role. CASRAI has written separately on authorship and AI; the short version is that a human who used an AI tool to discharge a role still gets that role, and the AI use is disclosed elsewhere.

    A worked statement

    A. Okonkwo: Conceptualization, Methodology (lead), Writing – original draft. B. Lindqvist: Data curation (lead), Investigation. C. Nakamura: Software (lead), Methodology (supporting). D. Rossi: Validation (lead), Software (supporting). E. Mwangi: Visualization, Writing – review & editing. F. Schmidt: Resources, Supervision, Funding acquisition.

    Read off, this says: someone designed the method and wrote the paper; someone else built the dataset; someone else implemented the system; someone else owned evaluation; someone made the figures and edited; and someone provided the compute and led the project. That is a far truer account of an ML project than “six authors,” and it is exactly what CRediT is for.

    What to do now

    Use the full role set, not just Software and Writing. Credit Data curation and Validation explicitly — they are where ML results are won or lost. Use the degree-of-contribution qualifier to differentiate within overloaded roles, and push fine-grained software contributorship into the repository’s own metadata. Disclose AI use separately from contributorship. CASRAI’s author-statement guidance has the templates.

    Related reading

  • System cards, benchmarks and evaluation suites as first-class research outputs

    A previous article in this series made the case for model cards and datasheets as documented research outputs. That is the well-established end of the spectrum. A wider and faster-moving set of AI/ML artefacts is now emerging as citable scholarship: system cards, evaluation benchmarks, and the evaluation suites that run against them. These are less settled than model cards, but they are arguably more consequential for how the field measures its own progress, and they belong in the conversation about what counts as a research output. CASRAI’s AI/ML research outputs domain treats them as first-class entries.

    From model cards to system cards

    A model card describes a model. A system card describes a deployed system — which is usually a model plus a great deal else: input and output filters, retrieval components, tool-use scaffolding, safety mitigations, usage policies, and the human processes wrapped around the whole thing. The distinction matters because the behaviour a user encounters is a property of the system, not the model in isolation.

    System cards became prominent as the major model developers began publishing them alongside frontier model releases, documenting not just capabilities and evaluations but the risk assessments, red-teaming exercises, and mitigation decisions that shaped the deployment. A system card typically records what the system can do, what failure modes were identified, what was done to mitigate them, and what residual risks remain. It is closer in spirit to a safety case than to a specification sheet.

    For the research record, the significance is that a system card documents decisions, not just facts. It is the contemporaneous account of how a team reasoned about deploying a powerful artefact responsibly. That reasoning is itself a research contribution — and, when the system later behaves in unexpected ways, the system card is the record against which those decisions can be assessed.

    Benchmarks as scholarly infrastructure

    An evaluation benchmark is a standardised dataset paired with a scoring protocol, used to measure and compare the capability of models on a defined task. Benchmarks are not new — they have organised whole subfields of machine learning for decades — but their status as research outputs has historically been ambiguous. A benchmark is often introduced in a paper, but the artefact that matters and gets reused for years afterwards is the dataset-and-protocol, not the paper describing it.

    The honest view is that benchmarks are among the highest-leverage outputs in the field. A widely adopted benchmark shapes what thousands of researchers optimise for; a flawed one steers the field toward the wrong objective. Building a good benchmark — defining the task crisply, assembling a representative dataset, designing a scoring procedure that resists gaming, and documenting all of it — is substantial and genuinely scholarly work. It deserves to be cited and credited as such, not buried as the apparatus of a single paper.

    Evaluation suites and the contamination problem

    An evaluation suite bundles multiple benchmarks and tasks together with the harness that runs them, so that a model release can be assessed consistently and reproducibly. The suite is the executable counterpart to the benchmark: where a benchmark is the standard, the suite is the instrument that applies it.

    Treating evaluation suites as research outputs surfaces a problem that the field takes seriously: test set contamination. When benchmark data leaks into the training corpus of the models being evaluated, the resulting scores measure memorisation rather than capability, and the benchmark silently loses its validity. Documenting an evaluation suite as a versioned, identifier-bearing output — with a clear record of which data is held out and when it was published — is part of how the community defends against contamination. A benchmark whose provenance and release date are pinned by a persistent identifier can be checked against a model’s training cut-off; an undocumented one cannot.

    Why first-class status matters

    The recurring theme across these artefact types is that the thing that gets reused is not the paper. It is the system card, the benchmark dataset, the evaluation harness. When these are treated as mere supplementary material, three things go wrong:

    • Credit is misallocated. The team that built a benchmark used by the whole field gets a single citation to a paper, while the artefact’s real influence is invisible to any contribution-based assessment.
    • Versioning is lost. Benchmarks and suites evolve — tasks are added, errors corrected, contaminated items removed. Without identifiers and versioning, it becomes impossible to say which version of a benchmark a given result was measured against.
    • Provenance breaks. The contamination defence above depends entirely on knowing exactly what a benchmark contained and when it was released. That is metadata, and metadata needs an identifier to hang on.

    How persistent identifiers and CRediT apply

    The infrastructure pattern mirrors that for model cards and datasets. A benchmark or evaluation suite is deposited in a repository and receives a DataCite DOI; each substantive revision receives its own version DOI, so that results can cite the exact version measured against. The executable harness is pinned by a Software Heritage ID. Contributors carry ORCID iDs, institutions carry ROR IDs, and where the work belongs to a larger programme a RAiD ties it together. CASRAI’s persistent-identifier guidance covers the deposit pattern.

    Contributorship maps onto CRediT with the same caveats discussed for models and datasets. Designing the benchmark task is Conceptualization and Methodology; assembling and annotating the data is Investigation and Data curation; building the harness is Software; running and verifying the evaluation is Validation. As with software papers, the Software role tends to be overloaded — a known limitation of applying CRediT to code-heavy outputs — but the statement is still more informative than an undifferentiated author list.

    What to do now

    For researchers: deposit benchmarks and evaluation suites as versioned outputs with DataCite DOIs and Software Heritage IDs, document them with the rigour of a datasheet, record the release date and held-out contents explicitly, and attach a CRediT statement. For funders and institutions: recognise system cards, benchmarks, and evaluation suites in CRIS systems and assessment as the high-leverage outputs they are, rather than as apparatus. For the field’s collective health, the benchmark you cite should be one you can name a version of.

    Related reading

  • Model cards and datasheets: documenting AI/ML research outputs

    For most of the history of the scholarly record, the unit of documentation was the paper. A piece of empirical research was described, peer-reviewed, and citable as an article; the underlying data and code were, at best, supplementary. Machine-learning research has been quietly rewriting that assumption. A trained model and the dataset it learned from are research outputs in their own right, and the community has developed its own documentation conventions for them: the model card and the datasheet for datasets. This piece sets out what they are, where they came from, and why they belong in the formal research record that CASRAI’s AI/ML research outputs domain is designed to describe.

    Model cards: a short, structured account of a model

    The model card was proposed by Margaret Mitchell and colleagues in their 2019 paper Model Cards for Model Reporting. The idea is disarmingly simple: every trained model should ship with a short, structured document that answers the questions a responsible user would need to ask before relying on it. Who built it and when? What is it intended to do, and what is it explicitly not intended to do? What data was it trained on? How was it evaluated, and on which populations or subgroups? What are its known limitations, failure modes, and ethical considerations?

    The motivating insight was that aggregate performance numbers conceal more than they reveal. A model that is 95% accurate overall can be 99% accurate for one group and 70% for another. A model card’s evaluation section is expected to report performance disaggregated across relevant factors, so that the user can see where the model works and where it does not. This is documentation in service of accountability, not marketing.

    Model cards have since become near-ubiquitous in practice. The Hugging Face Hub, the dominant model registry, attaches a model card to every hosted model as its README, and the convention has spread to internal model registries across industry and academia. The format is loose enough to suit a small fine-tuned classifier or a large foundation model, but the core sections — intended use, training data, evaluation, limitations — are stable.

    Datasheets for datasets: provenance for the data

    The companion convention for data is the datasheet for datasets, proposed by Timnit Gebru and colleagues in 2018 (revised and published in Communications of the ACM in 2021). The analogy in the title is to the datasheets that accompany electronic components: a structured specification that lets an engineer decide whether a part is fit for their purpose.

    A datasheet works through a dataset’s full lifecycle in a series of question prompts. Motivation: why was the dataset created, and by whom? Composition: what does each instance represent, are there labels, are there sensitive subpopulations? Collection process: how was the data acquired, was consent obtained, were people aware they were being recorded? Preprocessing and cleaning: what was done to the raw data, and is the raw data preserved? Uses: what has the dataset been used for, and what uses should be avoided? Distribution and maintenance: how is it licensed, who maintains it, and how will errors be corrected?

    The point of the datasheet is to make the provenance and limitations of a dataset legible to people who did not collect it. A dataset reused without understanding its collection context is a well-documented source of downstream harm; the datasheet is the mechanism for transmitting that context with the data.

    Why these belong in the research record

    It is tempting to treat model cards and datasheets as engineering hygiene — useful, but not scholarly in the way a paper is. We think that view is mistaken, for three reasons.

    • They are how ML researchers are increasingly evaluated. A well-constructed datasheet or a rigorous disaggregated model card represents real intellectual labour: the careful articulation of provenance, intended use, and limitation. Under responsible-assessment regimes such as the narrative CV, this kind of output is exactly the contribution a researcher should be able to claim.
    • They are the documentation layer that makes a model or dataset FAIR. A trained model with a DataCite DOI but no model card is findable and accessible but not meaningfully reusable. The card supplies the metadata that the FAIR principles require for reuse.
    • They carry the accountability that the research record is supposed to preserve. When a model is later found to behave badly, the model card is the contemporaneous record of what its builders claimed and disclosed. That is precisely the function the published record has always served for empirical claims.

    How persistent identifiers apply

    For a model card or datasheet to function as a citable research output, it needs the same identifier infrastructure as any other output. The pattern that has emerged, and that CASRAI’s guidance on persistent identifiers recommends, is straightforward.

    The dataset or model receives a DataCite DOI, minted by a generalist repository (Zenodo, Figshare) or a domain-specific one. The datasheet or model card is published as part of that deposit, so that resolving the DOI reaches both the artefact and its documentation. Where source code is involved, a Software Heritage ID pins the exact code state. Contributors are identified by ORCID iD and institutions by ROR ID, so that the people and organisations behind the artefact are unambiguous. Where the model or dataset belongs to a larger project, a RAiD ties it to the project record. The model card’s documentation of its training data should, ideally, cite the dataset’s DOI directly — closing the provenance loop between model and data.

    How CRediT applies

    Contributorship for these outputs maps onto CRediT better than one might expect, though not perfectly. The person who designed the data-collection protocol is doing Methodology; the people who collected, cleaned, and annotated the data are doing Investigation and Data curation; the person who trained the model is doing Software and, where the training method is itself novel, Methodology; the person who built and ran the evaluation suite is doing Validation. We have written separately about the friction points in this mapping — the Software role in particular tends to absorb too much — but the basic correspondence holds, and a model or dataset deposit should carry a CRediT statement just as a paper does.

    Quality varies, and that matters

    A note of realism. Because model cards and datasheets are not yet enforced by peer review in the way a methods section is, their quality varies enormously. A thorough datasheet that honestly documents consent gaps and known biases is a genuine contribution; a model card that lists only headline accuracy and a boilerplate licence is documentation theatre. The value of folding these artefacts into the formal research record — with identifiers, contributorship, and eventually review — is precisely that it creates the incentive and the scrutiny to make them good.

    What to do now

    For researchers releasing a model or dataset: write the model card or datasheet using the established Mitchell et al. and Gebru et al. templates; deposit it with the artefact under a DataCite DOI; attach a CRediT statement and ORCID iDs; and cite the dataset’s DOI from the model card where the model was trained on a citable dataset. For institutions and funders: recognise these outputs in CRIS systems and assessment processes as first-class, identifier-bearing research outputs, not as supplementary material.

    Related reading