Tag: model cards

  • AI Model Documentation: Datasheets and Model Cards

    Model cards are short, structured documents that report what an AI model does, how it was evaluated, and the conditions under which it should and should not be used. Together with datasheets for datasets, which document the data a model is trained and tested on, they form the backbone of responsible-AI documentation. Both were proposed to bring the same rigour to AI artefacts that established disciplines bring to materials and reagents, and both directly support reproducibility, accountability and the integrity of the research record.

    Model cards (Mitchell et al. 2019)

    Model cards were introduced by Mitchell and colleagues in 2019 as a framework for transparent model reporting. A model card accompanies a trained model and records, in a consistent format, the essential facts a user needs to decide whether the model is appropriate for their purpose. Crucially, model cards emphasise disaggregated evaluation: reporting performance not only in aggregate but across relevant subgroups, so that uneven performance is visible rather than hidden behind a single headline number.

    A typical model card covers model details (who built it, version, architecture), intended use and out-of-scope uses, evaluation data and metrics, performance across conditions, and ethical considerations, limitations and caveats. By stating intended and prohibited uses explicitly, a model card reduces the risk of a model being deployed in a context it was never validated for.

    Datasheets for datasets (Gebru et al.)

    Datasheets for datasets, proposed by Gebru and colleagues, apply the same documentation philosophy to data. A datasheet answers questions about a dataset’s whole life cycle: the motivation for creating it, its composition (what the instances represent, how many, whether sensitive data is present), the collection process, any preprocessing, cleaning or labelling, intended and discouraged uses, distribution terms, and arrangements for maintenance. Because so many problems in machine learning originate in the data, documenting it is often more consequential than documenting the model.

    Artefact Documents Key contents
    Model card A trained model Intended use, evaluation, disaggregated performance, limitations
    Datasheet for datasets A dataset Motivation, composition, collection, preprocessing, uses, maintenance

    How they support reproducibility and accountability

    Documentation turns an opaque artefact into an auditable one. A model card tells a future researcher exactly which model version and evaluation protocol produced a published result, while a datasheet records the data provenance needed to interpret or rebuild that result. This is the documentation layer that complements the engineering practices in our guide to reproducibility of machine learning research: code and seeds make a result re-runnable, while cards and datasheets make it interpretable and accountable.

    These artefacts also support the broader disclosure expectations now common in scholarly publishing. When generative AI features in a study, documenting the model and its data complements the editorial requirements covered in our explainer on generative AI and research disclosure norms and across our GenAI disclosure coverage.

    Embedding documentation in the research record

    For documentation to be useful it must be findable and citable as part of the scholarly record, not buried in a code repository. Treating model cards and datasheets as first-class research outputs supports proper credit assignment through frameworks such as CRediT and consistent description through the casrai.org research dictionary. Doing so recognises the substantial work of data curation and evaluation that these documents describe.

    Frequently asked questions

    What is a model card?

    A model card is a structured document, proposed by Mitchell et al. in 2019, that reports an AI model’s intended use, evaluation results (including across subgroups), limitations and ethical considerations, so users can judge whether it suits their purpose.

    What is a datasheet for datasets?

    A datasheet, proposed by Gebru et al., documents a dataset’s motivation, composition, collection and preprocessing, intended uses and maintenance, capturing the data provenance needed to interpret or reproduce results.

    How do model cards differ from datasheets?

    Model cards document a trained model; datasheets document the dataset behind it. Used together, they describe both the artefact and the data that shaped it.

    Why does AI documentation matter for reproducibility?

    It records which model version, evaluation protocol and data produced a result, turning an opaque artefact into an auditable one that others can interpret, scrutinise and rebuild.

  • Reproducibility for AI/ML research: model cards, seeds and compute disclosure

    Machine-learning research has a reproducibility problem, and the awkward truth is that most of it is not about anything exotic. A reported result fails to reproduce not because the science is fraudulent or the maths is wrong, but because of mundane omissions: a random seed that was never recorded, a library version that was never pinned, a preprocessing step that lived only in someone’s notebook, a hardware configuration nobody thought to mention. The good news is that exactly because the causes are mundane, the fixes are tractable — they are matters of documentation and discipline rather than fundamental breakthroughs. This article sets out the practical components of reproducible AI/ML work, drawing on the definitions in the AI/ML research outputs domain of the CASRAI Dictionary and the broader principles in the reproducibility domain.

    Why ML is especially fragile

    Several features of machine learning conspire to make results fragile. Models are stochastic: random initialisation, shuffling and sampling mean that two runs of the same code can produce different numbers unless randomness is controlled. They are dependency-heavy: results can shift with a change in a framework version, a numerical library, or even a hardware driver. They are data-sensitive: a different split, a different preprocessing choice, or an undocumented filtering step can change a headline metric. And they are increasingly compute-bound: some results depend on hardware and scale that are themselves part of the experiment. None of these is a flaw to be ashamed of, but each is a source of irreproducibility unless it is documented and controlled.

    Model cards and datasheets: documenting what you built

    The first pillar is structured documentation of the model itself. A model card is a short, standardised document that accompanies a trained model and records what it is, what it was trained and evaluated on, how it performs across relevant conditions, its intended uses, and its known limitations and ethical considerations. The point of a model card is that it travels with the model, so that anyone using or building on it inherits the context they need rather than reconstructing it from a paper’s prose.

    The complementary artefact for data is the datasheet for datasets, which documents a dataset’s motivation, composition, collection process, preprocessing, recommended uses and limitations. Together, model cards and datasheets address the two halves of an ML experiment whose details most often go unrecorded — the model and the data — and they turn ‘trust me, it works’ into something a reader can interrogate. Both are concrete examples of treating documentation as a first-class research output rather than an afterthought.

    Seeds and determinism: making runs repeatable

    The second pillar is the humble random seed. Setting and recording seeds for every source of randomness — the framework, the numerical libraries, the data loaders — is the single cheapest reproducibility measure available, and one of the most frequently neglected. Recording the seed lets someone reproduce a specific run; reporting results across several seeds, with variation shown, lets readers judge whether a result is robust or an artefact of a lucky initialisation.

    It is worth being honest about the limits here. Even with fixed seeds, full bit-for-bit determinism can be elusive, because some operations on parallel hardware are non-deterministic by default and because results can differ across hardware and library versions. The realistic goal is not always perfect determinism but documented randomness: a reader should know what was fixed, what was not, and how much the results varied as a consequence. A result reported as a mean across seeds with a measure of spread is far more credible than a single number with no indication of how stable it is.

    Compute and environment disclosure

    The third pillar is disclosure of the compute and environment in which the work was done. This means recording the hardware used, the software environment (framework and library versions, ideally captured in a pinned dependency specification or a container image), and the scale of the experiment — training time, the amount of computation involved, and the resources required. This serves two purposes at once. It supports reproducibility, because a result obtained on particular hardware with particular software may not reproduce elsewhere without that context. And it supports honesty and sustainability, because the computational and environmental cost of large-scale training is itself a material fact that readers, reviewers and funders increasingly expect to see stated rather than hidden.

    Capturing the environment in a reusable form — a container, a pinned environment file, a recorded command line — is what lets a reader move from reading about a result to re-running it, which is the real test of reproducibility.

    Software and the FAIR4RS principles

    Underlying all of this is the recognition that the code is a research output, to be shared, versioned, identified and cited like any other. The FAIR4RS principles — FAIR for Research Software — adapt the familiar Findable, Accessible, Interoperable and Reusable framework to software, acknowledging that code has characteristics (executability, dependencies, versions) that data alone does not. Treating ML code as a citable, archived output with a persistent identifier, rather than as a transient artefact, is what makes the model card, the seeds and the compute disclosure add up to something reproducible rather than merely well-described.

    Crediting the work properly

    Reproducible ML research is rarely the work of one person, and the contributions are varied: building the model, curating the data, writing the evaluation, managing the compute. Recording who did what through structured contributorship — the roles set out in the CRediT taxonomy — makes that division of labour visible and creditable, which matters all the more in collaborative ML projects where data, code, models and evaluation are often distinct workstreams. The consistent vocabulary for describing AI/ML outputs, their documentation and their reproducibility is maintained in the CASRAI Dictionary, so that a claim of reproducibility can be expressed, recorded and checked across the systems that track research outputs.

  • Model cards and datasheets: documenting AI/ML research outputs

    For most of the history of the scholarly record, the unit of documentation was the paper. A piece of empirical research was described, peer-reviewed, and citable as an article; the underlying data and code were, at best, supplementary. Machine-learning research has been quietly rewriting that assumption. A trained model and the dataset it learned from are research outputs in their own right, and the community has developed its own documentation conventions for them: the model card and the datasheet for datasets. This piece sets out what they are, where they came from, and why they belong in the formal research record that CASRAI’s AI/ML research outputs domain is designed to describe.

    Model cards: a short, structured account of a model

    The model card was proposed by Margaret Mitchell and colleagues in their 2019 paper Model Cards for Model Reporting. The idea is disarmingly simple: every trained model should ship with a short, structured document that answers the questions a responsible user would need to ask before relying on it. Who built it and when? What is it intended to do, and what is it explicitly not intended to do? What data was it trained on? How was it evaluated, and on which populations or subgroups? What are its known limitations, failure modes, and ethical considerations?

    The motivating insight was that aggregate performance numbers conceal more than they reveal. A model that is 95% accurate overall can be 99% accurate for one group and 70% for another. A model card’s evaluation section is expected to report performance disaggregated across relevant factors, so that the user can see where the model works and where it does not. This is documentation in service of accountability, not marketing.

    Model cards have since become near-ubiquitous in practice. The Hugging Face Hub, the dominant model registry, attaches a model card to every hosted model as its README, and the convention has spread to internal model registries across industry and academia. The format is loose enough to suit a small fine-tuned classifier or a large foundation model, but the core sections — intended use, training data, evaluation, limitations — are stable.

    Datasheets for datasets: provenance for the data

    The companion convention for data is the datasheet for datasets, proposed by Timnit Gebru and colleagues in 2018 (revised and published in Communications of the ACM in 2021). The analogy in the title is to the datasheets that accompany electronic components: a structured specification that lets an engineer decide whether a part is fit for their purpose.

    A datasheet works through a dataset’s full lifecycle in a series of question prompts. Motivation: why was the dataset created, and by whom? Composition: what does each instance represent, are there labels, are there sensitive subpopulations? Collection process: how was the data acquired, was consent obtained, were people aware they were being recorded? Preprocessing and cleaning: what was done to the raw data, and is the raw data preserved? Uses: what has the dataset been used for, and what uses should be avoided? Distribution and maintenance: how is it licensed, who maintains it, and how will errors be corrected?

    The point of the datasheet is to make the provenance and limitations of a dataset legible to people who did not collect it. A dataset reused without understanding its collection context is a well-documented source of downstream harm; the datasheet is the mechanism for transmitting that context with the data.

    Why these belong in the research record

    It is tempting to treat model cards and datasheets as engineering hygiene — useful, but not scholarly in the way a paper is. We think that view is mistaken, for three reasons.

    • They are how ML researchers are increasingly evaluated. A well-constructed datasheet or a rigorous disaggregated model card represents real intellectual labour: the careful articulation of provenance, intended use, and limitation. Under responsible-assessment regimes such as the narrative CV, this kind of output is exactly the contribution a researcher should be able to claim.
    • They are the documentation layer that makes a model or dataset FAIR. A trained model with a DataCite DOI but no model card is findable and accessible but not meaningfully reusable. The card supplies the metadata that the FAIR principles require for reuse.
    • They carry the accountability that the research record is supposed to preserve. When a model is later found to behave badly, the model card is the contemporaneous record of what its builders claimed and disclosed. That is precisely the function the published record has always served for empirical claims.

    How persistent identifiers apply

    For a model card or datasheet to function as a citable research output, it needs the same identifier infrastructure as any other output. The pattern that has emerged, and that CASRAI’s guidance on persistent identifiers recommends, is straightforward.

    The dataset or model receives a DataCite DOI, minted by a generalist repository (Zenodo, Figshare) or a domain-specific one. The datasheet or model card is published as part of that deposit, so that resolving the DOI reaches both the artefact and its documentation. Where source code is involved, a Software Heritage ID pins the exact code state. Contributors are identified by ORCID iD and institutions by ROR ID, so that the people and organisations behind the artefact are unambiguous. Where the model or dataset belongs to a larger project, a RAiD ties it to the project record. The model card’s documentation of its training data should, ideally, cite the dataset’s DOI directly — closing the provenance loop between model and data.

    How CRediT applies

    Contributorship for these outputs maps onto CRediT better than one might expect, though not perfectly. The person who designed the data-collection protocol is doing Methodology; the people who collected, cleaned, and annotated the data are doing Investigation and Data curation; the person who trained the model is doing Software and, where the training method is itself novel, Methodology; the person who built and ran the evaluation suite is doing Validation. We have written separately about the friction points in this mapping — the Software role in particular tends to absorb too much — but the basic correspondence holds, and a model or dataset deposit should carry a CRediT statement just as a paper does.

    Quality varies, and that matters

    A note of realism. Because model cards and datasheets are not yet enforced by peer review in the way a methods section is, their quality varies enormously. A thorough datasheet that honestly documents consent gaps and known biases is a genuine contribution; a model card that lists only headline accuracy and a boilerplate licence is documentation theatre. The value of folding these artefacts into the formal research record — with identifiers, contributorship, and eventually review — is precisely that it creates the incentive and the scrutiny to make them good.

    What to do now

    For researchers releasing a model or dataset: write the model card or datasheet using the established Mitchell et al. and Gebru et al. templates; deposit it with the artefact under a DataCite DOI; attach a CRediT statement and ORCID iDs; and cite the dataset’s DOI from the model card where the model was trained on a citable dataset. For institutions and funders: recognise these outputs in CRIS systems and assessment processes as first-class, identifier-bearing research outputs, not as supplementary material.

    Related reading