Tag: datasheets for datasets

  • AI Model Documentation: Datasheets and Model Cards

    Model cards are short, structured documents that report what an AI model does, how it was evaluated, and the conditions under which it should and should not be used. Together with datasheets for datasets, which document the data a model is trained and tested on, they form the backbone of responsible-AI documentation. Both were proposed to bring the same rigour to AI artefacts that established disciplines bring to materials and reagents, and both directly support reproducibility, accountability and the integrity of the research record.

    Model cards (Mitchell et al. 2019)

    Model cards were introduced by Mitchell and colleagues in 2019 as a framework for transparent model reporting. A model card accompanies a trained model and records, in a consistent format, the essential facts a user needs to decide whether the model is appropriate for their purpose. Crucially, model cards emphasise disaggregated evaluation: reporting performance not only in aggregate but across relevant subgroups, so that uneven performance is visible rather than hidden behind a single headline number.

    A typical model card covers model details (who built it, version, architecture), intended use and out-of-scope uses, evaluation data and metrics, performance across conditions, and ethical considerations, limitations and caveats. By stating intended and prohibited uses explicitly, a model card reduces the risk of a model being deployed in a context it was never validated for.

    Datasheets for datasets (Gebru et al.)

    Datasheets for datasets, proposed by Gebru and colleagues, apply the same documentation philosophy to data. A datasheet answers questions about a dataset’s whole life cycle: the motivation for creating it, its composition (what the instances represent, how many, whether sensitive data is present), the collection process, any preprocessing, cleaning or labelling, intended and discouraged uses, distribution terms, and arrangements for maintenance. Because so many problems in machine learning originate in the data, documenting it is often more consequential than documenting the model.

    Artefact Documents Key contents
    Model card A trained model Intended use, evaluation, disaggregated performance, limitations
    Datasheet for datasets A dataset Motivation, composition, collection, preprocessing, uses, maintenance

    How they support reproducibility and accountability

    Documentation turns an opaque artefact into an auditable one. A model card tells a future researcher exactly which model version and evaluation protocol produced a published result, while a datasheet records the data provenance needed to interpret or rebuild that result. This is the documentation layer that complements the engineering practices in our guide to reproducibility of machine learning research: code and seeds make a result re-runnable, while cards and datasheets make it interpretable and accountable.

    These artefacts also support the broader disclosure expectations now common in scholarly publishing. When generative AI features in a study, documenting the model and its data complements the editorial requirements covered in our explainer on generative AI and research disclosure norms and across our GenAI disclosure coverage.

    Embedding documentation in the research record

    For documentation to be useful it must be findable and citable as part of the scholarly record, not buried in a code repository. Treating model cards and datasheets as first-class research outputs supports proper credit assignment through frameworks such as CRediT and consistent description through the casrai.org research dictionary. Doing so recognises the substantial work of data curation and evaluation that these documents describe.

    Frequently asked questions

    What is a model card?

    A model card is a structured document, proposed by Mitchell et al. in 2019, that reports an AI model’s intended use, evaluation results (including across subgroups), limitations and ethical considerations, so users can judge whether it suits their purpose.

    What is a datasheet for datasets?

    A datasheet, proposed by Gebru et al., documents a dataset’s motivation, composition, collection and preprocessing, intended uses and maintenance, capturing the data provenance needed to interpret or reproduce results.

    How do model cards differ from datasheets?

    Model cards document a trained model; datasheets document the dataset behind it. Used together, they describe both the artefact and the data that shaped it.

    Why does AI documentation matter for reproducibility?

    It records which model version, evaluation protocol and data produced a result, turning an opaque artefact into an auditable one that others can interpret, scrutinise and rebuild.

  • Reproducibility for AI/ML research: model cards, seeds and compute disclosure

    Machine-learning research has a reproducibility problem, and the awkward truth is that most of it is not about anything exotic. A reported result fails to reproduce not because the science is fraudulent or the maths is wrong, but because of mundane omissions: a random seed that was never recorded, a library version that was never pinned, a preprocessing step that lived only in someone’s notebook, a hardware configuration nobody thought to mention. The good news is that exactly because the causes are mundane, the fixes are tractable — they are matters of documentation and discipline rather than fundamental breakthroughs. This article sets out the practical components of reproducible AI/ML work, drawing on the definitions in the AI/ML research outputs domain of the CASRAI Dictionary and the broader principles in the reproducibility domain.

    Why ML is especially fragile

    Several features of machine learning conspire to make results fragile. Models are stochastic: random initialisation, shuffling and sampling mean that two runs of the same code can produce different numbers unless randomness is controlled. They are dependency-heavy: results can shift with a change in a framework version, a numerical library, or even a hardware driver. They are data-sensitive: a different split, a different preprocessing choice, or an undocumented filtering step can change a headline metric. And they are increasingly compute-bound: some results depend on hardware and scale that are themselves part of the experiment. None of these is a flaw to be ashamed of, but each is a source of irreproducibility unless it is documented and controlled.

    Model cards and datasheets: documenting what you built

    The first pillar is structured documentation of the model itself. A model card is a short, standardised document that accompanies a trained model and records what it is, what it was trained and evaluated on, how it performs across relevant conditions, its intended uses, and its known limitations and ethical considerations. The point of a model card is that it travels with the model, so that anyone using or building on it inherits the context they need rather than reconstructing it from a paper’s prose.

    The complementary artefact for data is the datasheet for datasets, which documents a dataset’s motivation, composition, collection process, preprocessing, recommended uses and limitations. Together, model cards and datasheets address the two halves of an ML experiment whose details most often go unrecorded — the model and the data — and they turn ‘trust me, it works’ into something a reader can interrogate. Both are concrete examples of treating documentation as a first-class research output rather than an afterthought.

    Seeds and determinism: making runs repeatable

    The second pillar is the humble random seed. Setting and recording seeds for every source of randomness — the framework, the numerical libraries, the data loaders — is the single cheapest reproducibility measure available, and one of the most frequently neglected. Recording the seed lets someone reproduce a specific run; reporting results across several seeds, with variation shown, lets readers judge whether a result is robust or an artefact of a lucky initialisation.

    It is worth being honest about the limits here. Even with fixed seeds, full bit-for-bit determinism can be elusive, because some operations on parallel hardware are non-deterministic by default and because results can differ across hardware and library versions. The realistic goal is not always perfect determinism but documented randomness: a reader should know what was fixed, what was not, and how much the results varied as a consequence. A result reported as a mean across seeds with a measure of spread is far more credible than a single number with no indication of how stable it is.

    Compute and environment disclosure

    The third pillar is disclosure of the compute and environment in which the work was done. This means recording the hardware used, the software environment (framework and library versions, ideally captured in a pinned dependency specification or a container image), and the scale of the experiment — training time, the amount of computation involved, and the resources required. This serves two purposes at once. It supports reproducibility, because a result obtained on particular hardware with particular software may not reproduce elsewhere without that context. And it supports honesty and sustainability, because the computational and environmental cost of large-scale training is itself a material fact that readers, reviewers and funders increasingly expect to see stated rather than hidden.

    Capturing the environment in a reusable form — a container, a pinned environment file, a recorded command line — is what lets a reader move from reading about a result to re-running it, which is the real test of reproducibility.

    Software and the FAIR4RS principles

    Underlying all of this is the recognition that the code is a research output, to be shared, versioned, identified and cited like any other. The FAIR4RS principles — FAIR for Research Software — adapt the familiar Findable, Accessible, Interoperable and Reusable framework to software, acknowledging that code has characteristics (executability, dependencies, versions) that data alone does not. Treating ML code as a citable, archived output with a persistent identifier, rather than as a transient artefact, is what makes the model card, the seeds and the compute disclosure add up to something reproducible rather than merely well-described.

    Crediting the work properly

    Reproducible ML research is rarely the work of one person, and the contributions are varied: building the model, curating the data, writing the evaluation, managing the compute. Recording who did what through structured contributorship — the roles set out in the CRediT taxonomy — makes that division of labour visible and creditable, which matters all the more in collaborative ML projects where data, code, models and evaluation are often distinct workstreams. The consistent vocabulary for describing AI/ML outputs, their documentation and their reproducibility is maintained in the CASRAI Dictionary, so that a claim of reproducibility can be expressed, recorded and checked across the systems that track research outputs.

  • Documenting datasets for machine-learning research: datasheets, data statements and Croissant

    A machine-learning model is, in a profound sense, a product of its training data. Whatever patterns, gaps, imbalances and biases live in that data are absorbed by the model and reproduced in its behaviour. And yet, for much of the field’s recent history, datasets have circulated with remarkably little documentation: a file, perhaps a brief description, and little record of where the data came from, who is represented in it, what it omits, or what it should and should not be used for. The result has been models trained on poorly understood foundations, with predictable consequences for reliability and fairness. A growing movement now treats dataset documentation as a serious, first-class research output in its own right. This article surveys that movement, drawing on the AI and ML research-outputs domain of the CASRAI Dictionary.

    Datasheets for Datasets

    The most influential proposal, borrowing an idea from electronics, is the datasheet. Just as an electronic component ships with a datasheet describing its characteristics, operating conditions and limitations, Datasheets for Datasets proposes that every dataset be accompanied by a document answering a structured set of questions about it. Those questions span the dataset’s whole life: the motivation for creating it and who funded it; its composition — what the instances are, how many there are, what they represent, and whether sensitive or personal data is involved; the collection process — how the data was gathered and whether consent was obtained; any preprocessing, cleaning or labelling; recommended and discouraged uses; and plans for distribution and maintenance. The aim is to make explicit what would otherwise remain tacit, so that anyone considering using the dataset can understand its provenance and judge its fitness for their purpose — and so that the people who created it must think carefully about these matters while they still can.

    Data Statements for NLP

    A closely related proposal arose specifically in natural-language processing, where the characteristics of the people who produced the text in a dataset profoundly shape what a model learns. Data Statements for Natural Language Processing ask dataset creators to document the relevant characteristics of their data: who the speakers and annotators are, the language varieties represented, the situations in which the language was produced, and so on. The motivation is squarely about bias and generalisation. A language model trained on text from a narrow demographic will work less well, and sometimes fail or cause harm, for people outside it — and without documentation, that limitation is invisible until it bites. Data statements make the population behind the data explicit, so that the boundaries of a model’s likely competence can be understood rather than discovered the hard way. Both datasheets and data statements share a conviction: documentation is not bureaucratic overhead but a precondition for using data responsibly.

    Croissant: machine-readable dataset metadata

    Datasheets and data statements are written largely for humans. But for datasets to be discoverable, loadable and interoperable across the many tools of the machine-learning ecosystem, their metadata also needs to be machine-readable. This is the role of Croissant, a metadata format for machine-learning datasets developed through a community effort associated with MLCommons. Croissant provides a standard, structured way to describe a dataset — its resources, structure, fields and semantics — so that tools, frameworks and repositories can understand and work with it consistently, rather than each requiring bespoke handling. By standardising the description, Croissant makes datasets easier to find, load and combine across platforms, and it can carry the kind of responsible-use and provenance information that datasheets capture into a form that systems can act on. It is, in effect, the interoperability layer for dataset documentation.

    How this connects to FAIR and persistent identifiers

    This work is the machine-learning expression of principles that the wider research-data community has long advocated. The FAIR principles — that data should be Findable, Accessible, Interoperable and Reusable — map directly onto what good dataset documentation achieves: rich, machine-readable metadata (Croissant) makes data findable and interoperable, while thorough human-readable documentation (datasheets, data statements) is what genuine reusability requires, because data cannot be responsibly reused if its provenance and limitations are unknown. Persistent identifiers complete the picture: when a dataset is registered with an identifier through an infrastructure such as DataCite, it becomes citable and trackable, so that it can be referenced precisely in papers, credited to its creators, and connected to the models and results that depend on it. A documented, identified dataset is one that can take its place in the scholarly record as a real output rather than an anonymous file.

    Datasets as research outputs deserving credit

    The deeper shift here is a change in status. Creating a good dataset — collecting, cleaning, labelling and documenting it carefully — is substantial intellectual labour, and the resulting dataset is a genuine research output that others build upon, often more widely than any single paper. Treating datasets as first-class outputs means documenting them properly, identifying them persistently, and crediting the people who made them. The CRediT taxonomy, whose full set of contribution types is described in our overview of the CRediT roles, captures this work through roles such as Data curation, which recognises the production, annotation and maintenance of data. Recognising dataset creation as creditable contribution is part of the same movement that produced datasheets: an insistence that the data underpinning machine learning, and the people who steward it, be taken seriously.

    A consistent vocabulary for dataset documentation

    For dataset documentation to be useful across repositories, frameworks and institutions, the elements it contains must mean the same thing everywhere — what a field describes, what a provenance statement records, what an intended-use restriction means. That consistency is what the CASRAI Dictionary works towards: a shared vocabulary so that the metadata describing a dataset is understood identically wherever it travels. Datasheets, data statements and Croissant all rest on the same insight: that a dataset without documentation is a liability, and that documenting it well is not an afterthought but part of doing the research properly.

  • Crediting the hidden labour in AI/ML research: annotation, labelling and evaluation

    Every impressive machine-learning result rests on a foundation of human work that is almost never named. Before a model can be trained, someone has to gather, clean and organise the data; someone has to annotate and label examples, often by hand and at scale; and after training, someone has to evaluate what the model actually does, judging its outputs against criteria that only people can apply. This is real intellectual labour, demanding domain knowledge, careful judgement and sustained attention — and it is frequently invisible, performed by people whose names appear nowhere in the paper that depends on their work. As AI and machine learning become central to research, the question of how to recognise this hidden labour has become a matter of fairness and of accuracy in the scholarly record. This article examines it through the AI and ML research outputs domain of the CASRAI Dictionary.

    The work that does not make the byline

    It is worth being concrete about what this labour involves, because its invisibility partly stems from it being taken for granted. Data annotation and labelling — marking up images, transcribing audio, tagging text — is the painstaking process that gives supervised learning something to learn from, and the quality of a model is bounded by the quality of these labels. Data curation — selecting, cleaning, documenting and organising the data — shapes everything that follows and embeds countless consequential decisions. Evaluation — assessing model outputs, designing test sets, identifying failure modes — is where human expertise determines whether a system actually works. None of this is mechanical. Each requires judgement, and each materially affects the result. Yet the reward structures of research, organised around authorship and citation, have tended to treat all of it as plumbing rather than contribution.

    Why recognition matters here

    The case for recognising this work is partly about fairness to the people who do it, but it is also about the integrity of the record. When the labour behind a dataset or an evaluation is invisible, two things go wrong. First, the people responsible — often early-career researchers, students, or specialist data workers — are denied credit for substantial, skilled contributions, with real consequences for their careers. Second, the research itself becomes harder to understand and trust, because the decisions embedded in annotation and curation — which are exactly the decisions that determine bias, coverage and validity — are hidden from view. Recognising the labour and documenting the choices are two sides of the same coin: both bring into the open the human work that determines what a model is and does.

    How CRediT captures these contributions

    A structured account of who did what is the most direct route to making this labour visible, and the CRediT taxonomy already contains roles that fit it well. Data curation explicitly covers the management activities of annotating, scrubbing and maintaining research data — the very heart of the annotation and labelling work that machine learning depends on. Investigation covers conducting the research and data-collection process, which includes the hands-on work of producing labelled examples. Validation covers verifying results and assessing reproducibility, which maps onto the evaluation of model outputs. Software recognises those who build the tooling that makes annotation and evaluation possible at scale. The full set is described in our overview of the CRediT roles. The point is that the vocabulary for crediting this work largely already exists; what has often been missing is the will to apply it, and the recognition that annotation and evaluation are contributions worth naming rather than chores to be absorbed silently.

    Documenting the data and the decisions

    Recognition of people goes hand in hand with documentation of process. The movement to document datasets properly — through structured records that describe how a dataset was created, by whom, with what labelling procedures, and with what known limitations — makes the hidden labour visible as part of the dataset’s own description. Approaches such as datasheets for datasets and data statements ask creators to record the provenance of the data, the annotation process, the people involved and the judgements made. This documentation serves recognition directly: a dataset that records who annotated it and how is one in which that labour is acknowledged rather than erased. It also serves responsible AI, because the same record that credits the annotators is the record that lets others understand the dataset’s biases and boundaries. Good documentation is thus both an ethical and a scientific instrument — it names the people and exposes the decisions in a single act.

    Data work as a labour question

    There is a broader dimension the research community has had to confront. Much annotation and labelling is performed by data workers — within research teams or through external labour, often poorly paid and rarely credited — whose conditions have become a focus of responsible-AI discussion. Recognising annotation as genuine contribution is connected to recognising it as genuine work, with the dignity, fair treatment and acknowledgement that implies. The contributor metadata that records who did this work is not a clerical detail; it is a statement about whose labour the research is built on.

    A consistent vocabulary for AI contributions

    For the contributions behind AI and ML research to be recognised consistently — across institutions, publishers, dataset repositories and reporting systems — the way they are described must mean the same thing everywhere. Annotation, curation, evaluation and the roles that capture them have to travel without losing their meaning. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the hidden labour behind a model or a dataset, once made visible, is understood identically wherever it is recorded. And because this work is part of the wider research enterprise, recognising it well also serves the goals of fair research administration, explored in our research administration resources. The most advanced model is, at bottom, an artefact of human judgement applied to data; crediting the people who supply that judgement is simply an honest account of how the research was done.

  • Model cards and datasheets: documenting AI/ML research outputs

    For most of the history of the scholarly record, the unit of documentation was the paper. A piece of empirical research was described, peer-reviewed, and citable as an article; the underlying data and code were, at best, supplementary. Machine-learning research has been quietly rewriting that assumption. A trained model and the dataset it learned from are research outputs in their own right, and the community has developed its own documentation conventions for them: the model card and the datasheet for datasets. This piece sets out what they are, where they came from, and why they belong in the formal research record that CASRAI’s AI/ML research outputs domain is designed to describe.

    Model cards: a short, structured account of a model

    The model card was proposed by Margaret Mitchell and colleagues in their 2019 paper Model Cards for Model Reporting. The idea is disarmingly simple: every trained model should ship with a short, structured document that answers the questions a responsible user would need to ask before relying on it. Who built it and when? What is it intended to do, and what is it explicitly not intended to do? What data was it trained on? How was it evaluated, and on which populations or subgroups? What are its known limitations, failure modes, and ethical considerations?

    The motivating insight was that aggregate performance numbers conceal more than they reveal. A model that is 95% accurate overall can be 99% accurate for one group and 70% for another. A model card’s evaluation section is expected to report performance disaggregated across relevant factors, so that the user can see where the model works and where it does not. This is documentation in service of accountability, not marketing.

    Model cards have since become near-ubiquitous in practice. The Hugging Face Hub, the dominant model registry, attaches a model card to every hosted model as its README, and the convention has spread to internal model registries across industry and academia. The format is loose enough to suit a small fine-tuned classifier or a large foundation model, but the core sections — intended use, training data, evaluation, limitations — are stable.

    Datasheets for datasets: provenance for the data

    The companion convention for data is the datasheet for datasets, proposed by Timnit Gebru and colleagues in 2018 (revised and published in Communications of the ACM in 2021). The analogy in the title is to the datasheets that accompany electronic components: a structured specification that lets an engineer decide whether a part is fit for their purpose.

    A datasheet works through a dataset’s full lifecycle in a series of question prompts. Motivation: why was the dataset created, and by whom? Composition: what does each instance represent, are there labels, are there sensitive subpopulations? Collection process: how was the data acquired, was consent obtained, were people aware they were being recorded? Preprocessing and cleaning: what was done to the raw data, and is the raw data preserved? Uses: what has the dataset been used for, and what uses should be avoided? Distribution and maintenance: how is it licensed, who maintains it, and how will errors be corrected?

    The point of the datasheet is to make the provenance and limitations of a dataset legible to people who did not collect it. A dataset reused without understanding its collection context is a well-documented source of downstream harm; the datasheet is the mechanism for transmitting that context with the data.

    Why these belong in the research record

    It is tempting to treat model cards and datasheets as engineering hygiene — useful, but not scholarly in the way a paper is. We think that view is mistaken, for three reasons.

    • They are how ML researchers are increasingly evaluated. A well-constructed datasheet or a rigorous disaggregated model card represents real intellectual labour: the careful articulation of provenance, intended use, and limitation. Under responsible-assessment regimes such as the narrative CV, this kind of output is exactly the contribution a researcher should be able to claim.
    • They are the documentation layer that makes a model or dataset FAIR. A trained model with a DataCite DOI but no model card is findable and accessible but not meaningfully reusable. The card supplies the metadata that the FAIR principles require for reuse.
    • They carry the accountability that the research record is supposed to preserve. When a model is later found to behave badly, the model card is the contemporaneous record of what its builders claimed and disclosed. That is precisely the function the published record has always served for empirical claims.

    How persistent identifiers apply

    For a model card or datasheet to function as a citable research output, it needs the same identifier infrastructure as any other output. The pattern that has emerged, and that CASRAI’s guidance on persistent identifiers recommends, is straightforward.

    The dataset or model receives a DataCite DOI, minted by a generalist repository (Zenodo, Figshare) or a domain-specific one. The datasheet or model card is published as part of that deposit, so that resolving the DOI reaches both the artefact and its documentation. Where source code is involved, a Software Heritage ID pins the exact code state. Contributors are identified by ORCID iD and institutions by ROR ID, so that the people and organisations behind the artefact are unambiguous. Where the model or dataset belongs to a larger project, a RAiD ties it to the project record. The model card’s documentation of its training data should, ideally, cite the dataset’s DOI directly — closing the provenance loop between model and data.

    How CRediT applies

    Contributorship for these outputs maps onto CRediT better than one might expect, though not perfectly. The person who designed the data-collection protocol is doing Methodology; the people who collected, cleaned, and annotated the data are doing Investigation and Data curation; the person who trained the model is doing Software and, where the training method is itself novel, Methodology; the person who built and ran the evaluation suite is doing Validation. We have written separately about the friction points in this mapping — the Software role in particular tends to absorb too much — but the basic correspondence holds, and a model or dataset deposit should carry a CRediT statement just as a paper does.

    Quality varies, and that matters

    A note of realism. Because model cards and datasheets are not yet enforced by peer review in the way a methods section is, their quality varies enormously. A thorough datasheet that honestly documents consent gaps and known biases is a genuine contribution; a model card that lists only headline accuracy and a boilerplate licence is documentation theatre. The value of folding these artefacts into the formal research record — with identifiers, contributorship, and eventually review — is precisely that it creates the incentive and the scrutiny to make them good.

    What to do now

    For researchers releasing a model or dataset: write the model card or datasheet using the established Mitchell et al. and Gebru et al. templates; deposit it with the artefact under a DataCite DOI; attach a CRediT statement and ORCID iDs; and cite the dataset’s DOI from the model card where the model was trained on a citable dataset. For institutions and funders: recognise these outputs in CRIS systems and assessment processes as first-class, identifier-bearing research outputs, not as supplementary material.

    Related reading