Tag: data curation

  • Data citation: giving datasets the credit they deserve

    A great deal of published science rests on data the authors collected, cleaned, and shared — and yet the dataset itself, the object on which the conclusions actually depend, is routinely mentioned in passing or not at all. A finding is only checkable if a reader can find and reuse the data behind it, and the people who produced that data deserve recognition for an intellectual contribution that is often enormous. Treating datasets as first-class, citable outputs solves both problems at once. It is a core concern of the data-infrastructure domain and connects directly to the wider taxonomy of the research-outputs domain.

    Why data citation matters

    Citing data as data does two distinct jobs, and it is worth keeping them separate. The first is credit: assembling a well-documented dataset is real scholarly work — designing the collection, curating, validating, and documenting it — and that work is rewarded only if the dataset is cited as an output in its own right, not buried in a methods paragraph. The second is reproducibility and reuse: a result can only be verified, and the data only reused, if a reader can identify and locate the exact dataset that underpinned the analysis. A vague reference to “data available on request” serves neither goal; a formal citation to a deposited, identified dataset serves both.

    The FORCE11 data citation principles

    The community reference point here is the Joint Declaration of Data Citation Principles, developed through FORCE11 and endorsed across the scholarly-communication community. The declaration establishes that data should be treated as a legitimate, citable product of research, on the same footing as any other output. Its principles can be summarised as a short set of commitments:

    • Importance. Data should be considered legitimate, citable products of research; data citations should be accorded the same importance as citations of other objects.
    • Credit and attribution. Citations should facilitate giving scholarly credit and legal attribution to all contributors to the data.
    • Evidence. Where a claim relies on data, the corresponding data should be cited.
    • Unique identification. A citation should include a persistent, machine-actionable, globally unique identifier for the data.
    • Access, persistence, and specificity. Citations should enable access to the data and its metadata, persist even beyond the lifespan of the data, and identify the precise version and subset used.
    • Interoperability and flexibility. Citation methods should be interoperable across communities while accommodating their varying practices.

    Everything below is machinery for honouring these principles in practice.

    DataCite and the dataset DOI

    The practical foundation of data citation is the DataCite DOI. DataCite is the DOI registration agency for research data and related outputs, and a dataset deposited in a repository — a generalist repository such as Zenodo, Figshare, or Dryad, or a discipline-specific one — is assigned a DataCite DOI that resolves persistently to the dataset and its metadata. The DOI is what goes in a reference list, exactly as an article DOI would, which is what makes a dataset citable on equal terms with a paper.

    The DOI is more than a link. The DataCite metadata record behind it carries the structured information that makes the citation meaningful: the creators (ideally with their ORCID iDs), the title, the publisher and publication year, the version, the licence, the resource type, and related identifiers connecting the dataset to the article it supports, the software that processed it, and the grant that funded it. Versioning is treated as a first-class concern: a revised dataset can receive its own version-specific DOI, satisfying the principles’ demand for specificity so that a citation pins down exactly the data used, not merely the latest state of an evolving collection.

    Crediting the people: the Data curation role

    Identifying the dataset is half the task; crediting the humans who produced it is the other half, and the two are easily confused. A DataCite DOI identifies and persists the artefact; it does not, on its own, record the division of labour that produced it. That is the job of contributor-role metadata. The CRediT taxonomy includes a dedicated Data curation role — defined as the management activities to annotate, scrub, and maintain research data (including the software code where needed to interpret the data) for initial use and later reuse. Recording Data curation on the associated paper makes visible the often-uncredited work of turning raw observations into a documented, reusable dataset.

    The two layers complement each other precisely. The dataset DOI and its DataCite metadata say what the data is, where it lives, and which version; the CRediT role record says who curated, validated, and maintained it. Used together they ensure that both the data and the people behind it are visible — rather than the common outcome where neither is, and the dataset is reduced to an unattributed line in a methods section.

    A practical recipe

    1. Deposit the data in a trustworthy repository and obtain a DataCite DOI, rather than leaving it “available on request”.
    2. Cite the dataset in your reference list using its DOI, the way you would cite an article — not in a footnote or in prose.
    3. Pin the version. Where the data may change, cite the version-specific DOI so the citation identifies exactly what was used.
    4. Record the contributors — on both the DataCite record (with ORCID iDs) and, via CRediT’s Data curation role, on the paper the data supports.
    5. Apply a clear licence. Data that cannot be reused with confidence is data that will not be reused; the citation principles assume the reuse terms are stated.

    Where shared vocabulary fits

    “Dataset”, “data citation”, “version”, “data curation”, and “repository” are used inconsistently across communities, which is part of why credit for data leaks away. A shared, federated vocabulary that defines these terms precisely — and points back to the FORCE11 data citation principles and to DataCite — is what lets a data citation written in one system be understood in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain, with adjacent entries in the research-outputs domain.

    Related reading

  • Crediting contributions to AI/ML research: data, code, models and evaluation

    Machine-learning research distributes its intellectual labour differently from a conventional empirical study. The work that determines whether a result is any good is spread across data collection and annotation, code, model training, and evaluation — and the people who do each of those things are often different people. So how well do the 14 roles of CRediT describe who did what on an AI/ML paper? Better than one might fear, with a few well-understood friction points. This article walks through the mapping, role by role, for the benefit of anyone writing a CRediT author statement for ML work.

    Start from the lifecycle, not the role list

    The cleanest way to assign CRediT roles to ML work is to walk the lifecycle and ask, at each stage, who contributed and which role names that contribution. A typical AI/ML project moves through: framing the problem and research goals; designing the method or model architecture; assembling, cleaning, and annotating data; implementing and training; evaluating; and writing it up. Each stage has a natural CRediT home.

    Conceptualization and Methodology: the ideas and the design

    The framing of the research question — what problem the model is meant to solve, what would count as success — is Conceptualization, exactly as in any other field. The design of the method is where ML gets its own texture. A genuinely novel architecture, training objective, or learning algorithm is Methodology in the canonical sense: “development or design of methodology; creation of models.” The phrase “creation of models” sits slightly oddly here, because in ML “model” can mean either the conceptual method or the concrete trained weights; the CRediT definition means the former. Designing the experimental protocol — what gets held out, how runs are seeded, what ablations are performed — is also Methodology.

    Data curation and Investigation: the part that decides the result

    In ML, data quality usually matters more than model cleverness, and the people who do data work are frequently undercredited. CRediT offers two relevant roles. Investigation covers “performing the experiments, or data/evidence collection” — the gathering of the raw data, the running of the training experiments themselves. Data curation covers “management activities to annotate (produce metadata), scrub data and maintain research data… for initial use and later re-use” — which is an almost exact description of dataset cleaning, labelling, deduplication, and the construction of the documented, reusable dataset.

    The practical advice is to use both roles deliberately and not to let Investigation swallow everything. The person who designed the annotation scheme and produced the dataset’s metadata is doing Data curation, and saying so makes visible a contribution that is otherwise invisible — and that, by the field’s own lights, often determines the outcome. The datasheet for the dataset is, in effect, a written artefact of that Data curation work.

    Software: central, and overloaded

    Almost all ML work involves code, so Software — “programming, software development; designing computer programs; implementation of the computer code and supporting algorithms; testing of existing code components” — is the most frequently assigned role. It is also the most overloaded. On a real project, “Software” can cover the researcher who implemented the novel method, the engineer who built the training pipeline, the person who wrote the data-loading code, and whoever maintains the evaluation harness. CRediT gives all of them the same role name.

    This is the same limitation we have documented for software papers: the Software role lacks sub-roles for implementation, testing, infrastructure, and maintenance. The current best practice is to use the degree-of-contribution qualifier (lead / equal / supporting) to differentiate, and to carry finer-grained per-component contributorship in the repository’s own metadata — a CITATION.cff file or the model card’s authorship section — rather than trying to force it all into the paper’s CRediT statement.

    Validation: evaluation is its own contribution

    The single most useful point in this whole mapping is that Validation exists and should be used. Its definition — “verification… of the overall replication/reproducibility of results/experiments and other research outputs” — fits the work of building and running an evaluation suite almost perfectly. The person who designed the evaluation, guarded against test-set contamination, ran the baselines, and confirmed that the reported numbers reproduce is doing Validation, and in ML that is frequently the difference between a trustworthy result and a misleading one.

    Because evaluation is so central to ML and so often distinct from the modelling work, assigning Validation as a lead role to the person who owned evaluation is one of the highest-value things a CRediT statement for ML can do. It is also under-used, because the habit of treating evaluation as an undifferentiated part of “the experiments” persists.

    The remaining roles

    The rest map without surprises. Producing figures, training curves, and visualisations is Visualization. Providing compute — “computing resources… or other analysis tools” is explicitly in the Resources definition — is Resources; on compute-intensive projects, the contribution of whoever secured and managed the GPU allocation is real and namable. Writing the paper is Writing – original draft and Writing – review & editing. Leading the project is Supervision and Project administration; securing the grant is Funding acquisition.

    Where AI assistance fits, and where it does not

    One thing CRediT deliberately does not represent is the use of AI tools to do the work — an AI coding assistant that helped write the training code, or a model that drafted prose. That is a disclosure matter, not a contributorship matter: AI systems are not contributors, a position the community has settled, and the prevailing view is that AI use should be tracked as a separate dimension rather than as a CRediT role. CASRAI has written separately on authorship and AI; the short version is that a human who used an AI tool to discharge a role still gets that role, and the AI use is disclosed elsewhere.

    A worked statement

    A. Okonkwo: Conceptualization, Methodology (lead), Writing – original draft. B. Lindqvist: Data curation (lead), Investigation. C. Nakamura: Software (lead), Methodology (supporting). D. Rossi: Validation (lead), Software (supporting). E. Mwangi: Visualization, Writing – review & editing. F. Schmidt: Resources, Supervision, Funding acquisition.

    Read off, this says: someone designed the method and wrote the paper; someone else built the dataset; someone else implemented the system; someone else owned evaluation; someone made the figures and edited; and someone provided the compute and led the project. That is a far truer account of an ML project than “six authors,” and it is exactly what CRediT is for.

    What to do now

    Use the full role set, not just Software and Writing. Credit Data curation and Validation explicitly — they are where ML results are won or lost. Use the degree-of-contribution qualifier to differentiate within overloaded roles, and push fine-grained software contributorship into the repository’s own metadata. Disclose AI use separately from contributorship. CASRAI’s author-statement guidance has the templates.

    Related reading

  • Crediting the hidden labour in AI/ML research: annotation, labelling and evaluation

    Every impressive machine-learning result rests on a foundation of human work that is almost never named. Before a model can be trained, someone has to gather, clean and organise the data; someone has to annotate and label examples, often by hand and at scale; and after training, someone has to evaluate what the model actually does, judging its outputs against criteria that only people can apply. This is real intellectual labour, demanding domain knowledge, careful judgement and sustained attention — and it is frequently invisible, performed by people whose names appear nowhere in the paper that depends on their work. As AI and machine learning become central to research, the question of how to recognise this hidden labour has become a matter of fairness and of accuracy in the scholarly record. This article examines it through the AI and ML research outputs domain of the CASRAI Dictionary.

    The work that does not make the byline

    It is worth being concrete about what this labour involves, because its invisibility partly stems from it being taken for granted. Data annotation and labelling — marking up images, transcribing audio, tagging text — is the painstaking process that gives supervised learning something to learn from, and the quality of a model is bounded by the quality of these labels. Data curation — selecting, cleaning, documenting and organising the data — shapes everything that follows and embeds countless consequential decisions. Evaluation — assessing model outputs, designing test sets, identifying failure modes — is where human expertise determines whether a system actually works. None of this is mechanical. Each requires judgement, and each materially affects the result. Yet the reward structures of research, organised around authorship and citation, have tended to treat all of it as plumbing rather than contribution.

    Why recognition matters here

    The case for recognising this work is partly about fairness to the people who do it, but it is also about the integrity of the record. When the labour behind a dataset or an evaluation is invisible, two things go wrong. First, the people responsible — often early-career researchers, students, or specialist data workers — are denied credit for substantial, skilled contributions, with real consequences for their careers. Second, the research itself becomes harder to understand and trust, because the decisions embedded in annotation and curation — which are exactly the decisions that determine bias, coverage and validity — are hidden from view. Recognising the labour and documenting the choices are two sides of the same coin: both bring into the open the human work that determines what a model is and does.

    How CRediT captures these contributions

    A structured account of who did what is the most direct route to making this labour visible, and the CRediT taxonomy already contains roles that fit it well. Data curation explicitly covers the management activities of annotating, scrubbing and maintaining research data — the very heart of the annotation and labelling work that machine learning depends on. Investigation covers conducting the research and data-collection process, which includes the hands-on work of producing labelled examples. Validation covers verifying results and assessing reproducibility, which maps onto the evaluation of model outputs. Software recognises those who build the tooling that makes annotation and evaluation possible at scale. The full set is described in our overview of the CRediT roles. The point is that the vocabulary for crediting this work largely already exists; what has often been missing is the will to apply it, and the recognition that annotation and evaluation are contributions worth naming rather than chores to be absorbed silently.

    Documenting the data and the decisions

    Recognition of people goes hand in hand with documentation of process. The movement to document datasets properly — through structured records that describe how a dataset was created, by whom, with what labelling procedures, and with what known limitations — makes the hidden labour visible as part of the dataset’s own description. Approaches such as datasheets for datasets and data statements ask creators to record the provenance of the data, the annotation process, the people involved and the judgements made. This documentation serves recognition directly: a dataset that records who annotated it and how is one in which that labour is acknowledged rather than erased. It also serves responsible AI, because the same record that credits the annotators is the record that lets others understand the dataset’s biases and boundaries. Good documentation is thus both an ethical and a scientific instrument — it names the people and exposes the decisions in a single act.

    Data work as a labour question

    There is a broader dimension the research community has had to confront. Much annotation and labelling is performed by data workers — within research teams or through external labour, often poorly paid and rarely credited — whose conditions have become a focus of responsible-AI discussion. Recognising annotation as genuine contribution is connected to recognising it as genuine work, with the dignity, fair treatment and acknowledgement that implies. The contributor metadata that records who did this work is not a clerical detail; it is a statement about whose labour the research is built on.

    A consistent vocabulary for AI contributions

    For the contributions behind AI and ML research to be recognised consistently — across institutions, publishers, dataset repositories and reporting systems — the way they are described must mean the same thing everywhere. Annotation, curation, evaluation and the roles that capture them have to travel without losing their meaning. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the hidden labour behind a model or a dataset, once made visible, is understood identically wherever it is recorded. And because this work is part of the wider research enterprise, recognising it well also serves the goals of fair research administration, explored in our research administration resources. The most advanced model is, at bottom, an artefact of human judgement applied to data; crediting the people who supply that judgement is simply an honest account of how the research was done.