Tag: responsible AI

  • AI Model Documentation: Datasheets and Model Cards

    Model cards are short, structured documents that report what an AI model does, how it was evaluated, and the conditions under which it should and should not be used. Together with datasheets for datasets, which document the data a model is trained and tested on, they form the backbone of responsible-AI documentation. Both were proposed to bring the same rigour to AI artefacts that established disciplines bring to materials and reagents, and both directly support reproducibility, accountability and the integrity of the research record.

    Model cards (Mitchell et al. 2019)

    Model cards were introduced by Mitchell and colleagues in 2019 as a framework for transparent model reporting. A model card accompanies a trained model and records, in a consistent format, the essential facts a user needs to decide whether the model is appropriate for their purpose. Crucially, model cards emphasise disaggregated evaluation: reporting performance not only in aggregate but across relevant subgroups, so that uneven performance is visible rather than hidden behind a single headline number.

    A typical model card covers model details (who built it, version, architecture), intended use and out-of-scope uses, evaluation data and metrics, performance across conditions, and ethical considerations, limitations and caveats. By stating intended and prohibited uses explicitly, a model card reduces the risk of a model being deployed in a context it was never validated for.

    Datasheets for datasets (Gebru et al.)

    Datasheets for datasets, proposed by Gebru and colleagues, apply the same documentation philosophy to data. A datasheet answers questions about a dataset’s whole life cycle: the motivation for creating it, its composition (what the instances represent, how many, whether sensitive data is present), the collection process, any preprocessing, cleaning or labelling, intended and discouraged uses, distribution terms, and arrangements for maintenance. Because so many problems in machine learning originate in the data, documenting it is often more consequential than documenting the model.

    Artefact Documents Key contents
    Model card A trained model Intended use, evaluation, disaggregated performance, limitations
    Datasheet for datasets A dataset Motivation, composition, collection, preprocessing, uses, maintenance

    How they support reproducibility and accountability

    Documentation turns an opaque artefact into an auditable one. A model card tells a future researcher exactly which model version and evaluation protocol produced a published result, while a datasheet records the data provenance needed to interpret or rebuild that result. This is the documentation layer that complements the engineering practices in our guide to reproducibility of machine learning research: code and seeds make a result re-runnable, while cards and datasheets make it interpretable and accountable.

    These artefacts also support the broader disclosure expectations now common in scholarly publishing. When generative AI features in a study, documenting the model and its data complements the editorial requirements covered in our explainer on generative AI and research disclosure norms and across our GenAI disclosure coverage.

    Embedding documentation in the research record

    For documentation to be useful it must be findable and citable as part of the scholarly record, not buried in a code repository. Treating model cards and datasheets as first-class research outputs supports proper credit assignment through frameworks such as CRediT and consistent description through the casrai.org research dictionary. Doing so recognises the substantial work of data curation and evaluation that these documents describe.

    Frequently asked questions

    What is a model card?

    A model card is a structured document, proposed by Mitchell et al. in 2019, that reports an AI model’s intended use, evaluation results (including across subgroups), limitations and ethical considerations, so users can judge whether it suits their purpose.

    What is a datasheet for datasets?

    A datasheet, proposed by Gebru et al., documents a dataset’s motivation, composition, collection and preprocessing, intended uses and maintenance, capturing the data provenance needed to interpret or reproduce results.

    How do model cards differ from datasheets?

    Model cards document a trained model; datasheets document the dataset behind it. Used together, they describe both the artefact and the data that shaped it.

    Why does AI documentation matter for reproducibility?

    It records which model version, evaluation protocol and data produced a result, turning an opaque artefact into an auditable one that others can interpret, scrutinise and rebuild.

  • Crediting the hidden labour in AI/ML research: annotation, labelling and evaluation

    Every impressive machine-learning result rests on a foundation of human work that is almost never named. Before a model can be trained, someone has to gather, clean and organise the data; someone has to annotate and label examples, often by hand and at scale; and after training, someone has to evaluate what the model actually does, judging its outputs against criteria that only people can apply. This is real intellectual labour, demanding domain knowledge, careful judgement and sustained attention — and it is frequently invisible, performed by people whose names appear nowhere in the paper that depends on their work. As AI and machine learning become central to research, the question of how to recognise this hidden labour has become a matter of fairness and of accuracy in the scholarly record. This article examines it through the AI and ML research outputs domain of the CASRAI Dictionary.

    The work that does not make the byline

    It is worth being concrete about what this labour involves, because its invisibility partly stems from it being taken for granted. Data annotation and labelling — marking up images, transcribing audio, tagging text — is the painstaking process that gives supervised learning something to learn from, and the quality of a model is bounded by the quality of these labels. Data curation — selecting, cleaning, documenting and organising the data — shapes everything that follows and embeds countless consequential decisions. Evaluation — assessing model outputs, designing test sets, identifying failure modes — is where human expertise determines whether a system actually works. None of this is mechanical. Each requires judgement, and each materially affects the result. Yet the reward structures of research, organised around authorship and citation, have tended to treat all of it as plumbing rather than contribution.

    Why recognition matters here

    The case for recognising this work is partly about fairness to the people who do it, but it is also about the integrity of the record. When the labour behind a dataset or an evaluation is invisible, two things go wrong. First, the people responsible — often early-career researchers, students, or specialist data workers — are denied credit for substantial, skilled contributions, with real consequences for their careers. Second, the research itself becomes harder to understand and trust, because the decisions embedded in annotation and curation — which are exactly the decisions that determine bias, coverage and validity — are hidden from view. Recognising the labour and documenting the choices are two sides of the same coin: both bring into the open the human work that determines what a model is and does.

    How CRediT captures these contributions

    A structured account of who did what is the most direct route to making this labour visible, and the CRediT taxonomy already contains roles that fit it well. Data curation explicitly covers the management activities of annotating, scrubbing and maintaining research data — the very heart of the annotation and labelling work that machine learning depends on. Investigation covers conducting the research and data-collection process, which includes the hands-on work of producing labelled examples. Validation covers verifying results and assessing reproducibility, which maps onto the evaluation of model outputs. Software recognises those who build the tooling that makes annotation and evaluation possible at scale. The full set is described in our overview of the CRediT roles. The point is that the vocabulary for crediting this work largely already exists; what has often been missing is the will to apply it, and the recognition that annotation and evaluation are contributions worth naming rather than chores to be absorbed silently.

    Documenting the data and the decisions

    Recognition of people goes hand in hand with documentation of process. The movement to document datasets properly — through structured records that describe how a dataset was created, by whom, with what labelling procedures, and with what known limitations — makes the hidden labour visible as part of the dataset’s own description. Approaches such as datasheets for datasets and data statements ask creators to record the provenance of the data, the annotation process, the people involved and the judgements made. This documentation serves recognition directly: a dataset that records who annotated it and how is one in which that labour is acknowledged rather than erased. It also serves responsible AI, because the same record that credits the annotators is the record that lets others understand the dataset’s biases and boundaries. Good documentation is thus both an ethical and a scientific instrument — it names the people and exposes the decisions in a single act.

    Data work as a labour question

    There is a broader dimension the research community has had to confront. Much annotation and labelling is performed by data workers — within research teams or through external labour, often poorly paid and rarely credited — whose conditions have become a focus of responsible-AI discussion. Recognising annotation as genuine contribution is connected to recognising it as genuine work, with the dignity, fair treatment and acknowledgement that implies. The contributor metadata that records who did this work is not a clerical detail; it is a statement about whose labour the research is built on.

    A consistent vocabulary for AI contributions

    For the contributions behind AI and ML research to be recognised consistently — across institutions, publishers, dataset repositories and reporting systems — the way they are described must mean the same thing everywhere. Annotation, curation, evaluation and the roles that capture them have to travel without losing their meaning. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the hidden labour behind a model or a dataset, once made visible, is understood identically wherever it is recorded. And because this work is part of the wider research enterprise, recognising it well also serves the goals of fair research administration, explored in our research administration resources. The most advanced model is, at bottom, an artefact of human judgement applied to data; crediting the people who supply that judgement is simply an honest account of how the research was done.