Tag: data documentation initiative

  • Datasheets for Datasets: FAIR Habits for AI Data

    Datasheets for datasets are structured documentation records — covering motivation, composition, collection process, and recommended uses — that accompany a dataset the way a technical datasheet accompanies an electronic component. Proposed for machine learning in 2018, the practice mirrors documentation habits research data managers have used for decades, and research offices are increasingly the ones best placed to recognise and credit that documentation work.

    A datasheet for a dataset is a short, standardised document that records where a dataset came from, how it was collected and labelled, what it should and should not be used for, and who is responsible for maintaining it. The idea was formalised by Timnit Gebru and colleagues in the 2018 paper “Datasheets for Datasets” (arXiv:1803.09010), later published in Communications of the ACM, Vol. 64, No. 12 (2021).

    Contents

    Where did datasheets for datasets come from?

    Gebru et al.’s 2018 paper argued that machine learning datasets circulated with almost no accompanying documentation, unlike the datasheets that have long shipped with electronic components. The paper has since been cited by more than 4,700 works, according to citation counts indexed alongside the ACM Digital Library record — a scale of uptake that puts it among the most influential AI-ethics-adjacent papers of the past decade.

    The proposal did not invent documentation practice from nothing. It imported habits that research-data communities already used. The Data Documentation Initiative (DDI), a metadata standard maintained by the DDI Alliance for the social, behavioural, and economic sciences, has specified variable-level dataset documentation since the early 2000s — well before the AI field adopted the term “datasheet.”

    What does a dataset datasheet actually document?

    Gebru et al.’s original template organises documentation into seven sections: motivation, composition, collection process, preprocessing/cleaning/labelling, uses, distribution, and maintenance. Each section is a set of prompts, not a checkbox — creators answer in prose, which is what makes the format adaptable across domains.

    • Motivation: why the dataset was created, who funded it, and what problem it addresses.
    • Composition: what the instances represent, how many there are, and whether sensitive attributes or personal data are present.
    • Collection process: how and from whom the data was gathered, and what consent or licensing applied.
    • Uses: tasks the dataset is suited for, and — critically — tasks it should not be used for.
    • Maintenance: who is responsible for updates, corrections, and retraction if problems surface.

    Adjacent frameworks document different units of the same pipeline. Model Cards for Model Reporting (Mitchell et al., Google, 2019) document a trained model’s performance across demographic subgroups rather than the training data itself. The Dataset Nutrition Label, developed by the Data Nutrition Project (originating at Harvard and MIT), condenses similar information into a scannable label modelled on food nutrition facts. The table below maps how these efforts differ.

    Framework Origin Unit documented Primary audience
    Datasheets for Datasets Gebru et al., 2018 (arXiv/ACM) Dataset provenance and composition Dataset creators and consumers
    Model Cards for Model Reporting Mitchell et al., Google, 2019 Trained model performance Model deployers and auditors
    Dataset Nutrition Label Data Nutrition Project, Harvard/MIT Dataset health at a glance Practitioners screening datasets quickly
    Datasheets for Digital Cultural Heritage Europeana Research/EuropeanaTech, 2023 Heritage collection reuse context GLAM institutions and researchers

    The FAIR Data Principles — Findable, Accessible, Interoperable, Reusable, set out by Wilkinson et al. in Scientific Data (2016) — were written for research data broadly, not for AI training corpora specifically. Datasheets operationalise the “Reusable” pillar in particular: a dataset without documented provenance, licensing, and known limitations cannot be responsibly reused, regardless of how accessible its files are.

    This is a FAIR-adjacent practice rather than a formal extension of FAIR itself, and research offices should frame it that way rather than treating “datasheet” and “FAIR-compliant” as synonyms. A dataset can be technically Findable and Accessible while still shipping with a thin or absent datasheet — the two efforts solve overlapping but distinct problems.

    Dataset-level documentation also underpins dataset citation. The Force11 Joint Declaration of Data Citation Principles (2014) established that datasets should be cited as first-class research outputs, and registration agencies such as DataCite issue the DOIs that make that citation persistent. A datasheet gives the context a citation alone cannot: not just that a dataset exists and where, but what it contains and how it may legitimately be used.

    Answer-first questions on datasheets for datasets

    What are datasheets for datasets?

    Datasheets for datasets are structured documents that record a dataset’s motivation, composition, collection process, and intended uses. They were proposed by Gebru et al. in 2018 to give dataset creators and consumers a shared, standardised record — closing the gap between how thoroughly software and hardware components are documented and how poorly datasets typically are.

    What information does a dataset datasheet include?

    A complete datasheet covers seven areas: motivation, composition, collection process, preprocessing and labelling, recommended uses, distribution terms, and maintenance responsibility. Creators answer narrative prompts under each heading rather than filling in a fixed schema, which is why the format has been adapted for domains as different as machine learning corpora and digitised cultural heritage collections.

    How do datasheets differ from model cards?

    Datasheets document the dataset — its provenance, composition, and licensing. Model cards, introduced by Mitchell et al. at Google in 2019, document the trained model built from that data, including performance disaggregated across demographic groups. The two are complementary: a model card without a corresponding dataset datasheet leaves the training-data provenance question unanswered.

    What this means for research offices

    Research administration has treated dataset documentation as a data-management-plan checkbox for years; AI training-data transparency debates are now forcing the same discipline onto machine learning teams. Institutions that already run mature research-data-management functions have a genuine head start: DMP review, licensing checks, and provenance tracking are core competencies, not new ones.

    One overlooked lever is contributor recognition. CASRAI originated the CRediT contributor role taxonomy in 2014. The standard is now stewarded by NISO as ANSI/NISO Z39.104-2022. CRediT’s Data Curation role exists precisely to credit the labour of managing, annotating, and maintaining research data for reuse — the same labour a datasheet documents. Research offices that already apply CRediT to publications have a ready-made mechanism for recognising the people who write and maintain dataset datasheets, rather than letting that work go uncredited.

    • Require a datasheet (or equivalent provenance record) as a condition of institutional data-repository deposit, alongside existing licensing checks.
    • Map datasheet authorship to CRediT’s Data Curation role in institutional repository metadata.
    • Treat AI training-data provenance requests from partners and funders as an extension of existing data-management-plan review, not a new workflow.

    Where the practice is heading

    Uptake outside machine learning is accelerating. The Europeana Research Community and EuropeanaTech Community published Datasheets for Digital Cultural Heritage Datasets in the Journal of Open Humanities Data in 2023 (DOI: 10.5334/johd.124), adapting the template for collections that were digitised long after their original creation. A revised Version 2 template was released in July 2025, with alignment to the DCAT-AP data-portal application profile identified as ongoing work.

    AI training-data transparency requirements are converging on the same documentation habits that research-data management has practised for two decades, under the Data Documentation Initiative and FAIR principles alike. Research offices that recognise datasheets as an extension of existing data governance — rather than a novel AI-specific burden — will be better positioned to advise both AI developers and dataset creators as scrutiny of training-data provenance intensifies.