Tag: dataset documentation

  • Documenting datasets for machine-learning research: datasheets, data statements and Croissant

    A machine-learning model is, in a profound sense, a product of its training data. Whatever patterns, gaps, imbalances and biases live in that data are absorbed by the model and reproduced in its behaviour. And yet, for much of the field’s recent history, datasets have circulated with remarkably little documentation: a file, perhaps a brief description, and little record of where the data came from, who is represented in it, what it omits, or what it should and should not be used for. The result has been models trained on poorly understood foundations, with predictable consequences for reliability and fairness. A growing movement now treats dataset documentation as a serious, first-class research output in its own right. This article surveys that movement, drawing on the AI and ML research-outputs domain of the CASRAI Dictionary.

    Datasheets for Datasets

    The most influential proposal, borrowing an idea from electronics, is the datasheet. Just as an electronic component ships with a datasheet describing its characteristics, operating conditions and limitations, Datasheets for Datasets proposes that every dataset be accompanied by a document answering a structured set of questions about it. Those questions span the dataset’s whole life: the motivation for creating it and who funded it; its composition — what the instances are, how many there are, what they represent, and whether sensitive or personal data is involved; the collection process — how the data was gathered and whether consent was obtained; any preprocessing, cleaning or labelling; recommended and discouraged uses; and plans for distribution and maintenance. The aim is to make explicit what would otherwise remain tacit, so that anyone considering using the dataset can understand its provenance and judge its fitness for their purpose — and so that the people who created it must think carefully about these matters while they still can.

    Data Statements for NLP

    A closely related proposal arose specifically in natural-language processing, where the characteristics of the people who produced the text in a dataset profoundly shape what a model learns. Data Statements for Natural Language Processing ask dataset creators to document the relevant characteristics of their data: who the speakers and annotators are, the language varieties represented, the situations in which the language was produced, and so on. The motivation is squarely about bias and generalisation. A language model trained on text from a narrow demographic will work less well, and sometimes fail or cause harm, for people outside it — and without documentation, that limitation is invisible until it bites. Data statements make the population behind the data explicit, so that the boundaries of a model’s likely competence can be understood rather than discovered the hard way. Both datasheets and data statements share a conviction: documentation is not bureaucratic overhead but a precondition for using data responsibly.

    Croissant: machine-readable dataset metadata

    Datasheets and data statements are written largely for humans. But for datasets to be discoverable, loadable and interoperable across the many tools of the machine-learning ecosystem, their metadata also needs to be machine-readable. This is the role of Croissant, a metadata format for machine-learning datasets developed through a community effort associated with MLCommons. Croissant provides a standard, structured way to describe a dataset — its resources, structure, fields and semantics — so that tools, frameworks and repositories can understand and work with it consistently, rather than each requiring bespoke handling. By standardising the description, Croissant makes datasets easier to find, load and combine across platforms, and it can carry the kind of responsible-use and provenance information that datasheets capture into a form that systems can act on. It is, in effect, the interoperability layer for dataset documentation.

    How this connects to FAIR and persistent identifiers

    This work is the machine-learning expression of principles that the wider research-data community has long advocated. The FAIR principles — that data should be Findable, Accessible, Interoperable and Reusable — map directly onto what good dataset documentation achieves: rich, machine-readable metadata (Croissant) makes data findable and interoperable, while thorough human-readable documentation (datasheets, data statements) is what genuine reusability requires, because data cannot be responsibly reused if its provenance and limitations are unknown. Persistent identifiers complete the picture: when a dataset is registered with an identifier through an infrastructure such as DataCite, it becomes citable and trackable, so that it can be referenced precisely in papers, credited to its creators, and connected to the models and results that depend on it. A documented, identified dataset is one that can take its place in the scholarly record as a real output rather than an anonymous file.

    Datasets as research outputs deserving credit

    The deeper shift here is a change in status. Creating a good dataset — collecting, cleaning, labelling and documenting it carefully — is substantial intellectual labour, and the resulting dataset is a genuine research output that others build upon, often more widely than any single paper. Treating datasets as first-class outputs means documenting them properly, identifying them persistently, and crediting the people who made them. The CRediT taxonomy, whose full set of contribution types is described in our overview of the CRediT roles, captures this work through roles such as Data curation, which recognises the production, annotation and maintenance of data. Recognising dataset creation as creditable contribution is part of the same movement that produced datasheets: an insistence that the data underpinning machine learning, and the people who steward it, be taken seriously.

    A consistent vocabulary for dataset documentation

    For dataset documentation to be useful across repositories, frameworks and institutions, the elements it contains must mean the same thing everywhere — what a field describes, what a provenance statement records, what an intended-use restriction means. That consistency is what the CASRAI Dictionary works towards: a shared vocabulary so that the metadata describing a dataset is understood identically wherever it travels. Datasheets, data statements and Croissant all rest on the same insight: that a dataset without documentation is a liability, and that documenting it well is not an afterthought but part of doing the research properly.