Tag: DataCite

  • Citing Data Properly: The Joint Declaration of Data Citation Principles

    For decades, the data underpinning a study lived in a footnote, an appendix, or nowhere visible at all. A reader who wanted to inspect, reuse, or build on those data had little to go on. As research has become more data-intensive, that omission has grown harder to justify. The Joint Declaration of Data Citation Principles, published through FORCE11 in 2014, was a deliberate attempt to fix it by treating datasets as legitimate, citable research outputs in their own right.

    Why data citation matters

    Citing data is not merely good manners. It serves the same purposes as citing the literature: it credits the people who produced the work, it lets readers verify claims, and it builds a traceable record of how knowledge accumulates. When a dataset is cited formally, the citation can be counted, indexed, and linked, which means the often-considerable labour of collecting, cleaning, and documenting data becomes visible and rewardable. This connects directly to broader efforts in FAIR data, where the goal is for data to be findable, accessible, interoperable, and reusable.

    The eight principles

    The Declaration is built around eight principles that, taken together, describe what responsible data citation looks like:

    • Importance. Data should be considered legitimate, citable products of research, deserving the same status as publications.
    • Credit and attribution. Citations should give scholarly credit and normative, legal attribution to everyone who contributed to the data.
    • Evidence. Where a claim rests on data, the corresponding data should be cited.
    • Unique identification. Citations should include a persistent, machine-actionable, globally unique identifier.
    • Access. Citations should make it possible to reach the data themselves and their associated metadata and documentation.
    • Persistence. Identifiers and metadata should persist even beyond the lifespan of the data they describe.
    • Specificity and verifiability. Citations should allow a precise version and subset of the data to be identified.
    • Interoperability and flexibility. Citation methods should work across communities while accommodating disciplinary differences.

    These principles are intentionally technology-neutral. They do not mandate a single repository or identifier scheme; they describe outcomes that any sound practice should achieve.

    How to cite a dataset in practice

    A well-formed data citation looks much like a reference to an article, but with a few additions. At minimum it should carry the creator or creators, the year of publication, the title of the dataset, the publisher or repository, a version where one exists, and a persistent identifier. In most cases that identifier is a DataCite DOI, resolvable to a landing page that describes the dataset and points to the files. A typical reference takes the shape: Creator(s) (Year): Title. Version. Publisher. Dataset. DOI.

    Two details repay attention. First, versioning is not optional for datasets that change over time. Citing the specific version used means a future reader can reproduce exactly what was analysed, rather than a later, possibly different, release. Second, the identifier should appear in the reference list, not merely in the running text. Burying a dataset DOI in a sentence keeps it out of the indexing and counting systems that make citation meaningful in the first place.

    DataCite DOIs and the reference list

    DataCite was established precisely to assign DOIs to research data and to maintain the metadata that makes those DOIs useful. When a repository mints a DataCite DOI for a dataset, it registers structured metadata describing the creators, title, publication year, resource type, and related identifiers. That metadata is what allows discovery services and reference managers to handle data citations the way they handle article citations. Placing the DOI in the reference list, formatted to the relevant style, lets indexing infrastructure pick it up and attribute it correctly.

    Data availability statements close the loop

    Many publishers now require a data availability statement, a short passage telling readers where the underlying data can be found and under what conditions. Done well, the statement names the repository and gives the persistent identifier, linking the prose of the article to the formal citation in the reference list. Done poorly, it says only that data are available on request, which research has repeatedly shown to be an unreliable route to access. A good availability statement and a properly formatted data citation are two halves of the same commitment: that the evidence behind a study can actually be found and reused.

    Bringing it together

    The Joint Declaration did not invent the idea that data deserve credit, but it gave the community a shared, citable reference point. The practical implications are modest and achievable: assign a persistent identifier, capture the version, put the citation in the reference list, and write a data availability statement that points to it. Standards bodies and metadata schemas, including the work catalogued in the CASRAI data dictionary and contributor frameworks such as CRediT, give the surrounding vocabulary to describe who did what. The principles themselves are a reminder that data are not a by-product of research but, increasingly, one of its most valuable outputs.

  • What Is a DOI? The Handle System and DOI Resolution Explained

    A Digital Object Identifier (DOI) is a persistent, globally unique character string that identifies a digital object — most often a journal article, dataset, book or other research output — and reliably resolves to that object’s current location on the web. Unlike a plain URL, a DOI is designed to keep working even when the underlying web address changes, because the identifier points to a record that the owner keeps up to date rather than to a fixed server path.

    DOIs are governed by ISO 26324, the international standard that defines DOI syntax and the rules of the DOI system, and they are managed at the apex by the International DOI Foundation (IDF). This article explains what a DOI is, how it is structured, how resolution works through the Handle System, and which organisations assign DOIs in scholarly publishing.

    The structure of a DOI: prefix and suffix

    Every DOI has two parts separated by a forward slash. A prefix always begins with 10. followed by a registrant code identifying the organisation that registered the DOI (for example 10.1000). A suffix, chosen by the registrant, identifies the specific item and can be any string the registrant chooses, provided it is unique within that prefix.

    Component Example Meaning
    DOI prefix 10.1000 Directory indicator (10) plus registrant code
    DOI suffix 182 Registrant-assigned identifier for the object
    Full DOI 10.1000/182 The complete, opaque identifier
    Resolvable form https://doi.org/10.1000/182 The DOI expressed as a clickable link

    The DOI itself is deliberately opaque: the characters carry no built-in meaning about the content, the publisher or the year. This opacity is a feature, not a flaw — it means a DOI never has to change because something about the object changed. The recommended way to display a DOI is as a full HTTPS link using the https://doi.org/ proxy, so that readers can simply click it.

    How DOI resolution works: the Handle System

    The technical machinery beneath every DOI is the Handle System, a distributed identifier-resolution infrastructure developed by the Corporation for National Research Initiatives. A DOI is in fact a Handle within a specific namespace, and DOI resolution is the process of looking up the identifier and returning the current data associated with it — principally the URL where the object now lives.

    When you click https://doi.org/10.1000/182, the request reaches the DOI proxy server, which queries the Handle System for that DOI’s record. The record contains the up-to-date target URL, and the resolver redirects your browser there. Because publishers update the target URL when content moves, the DOI keeps resolving even after the destination has been reorganised — this is the core of persistence.

    Persistence versus ordinary URLs

    An ordinary web link breaks — the familiar “404 Not Found” — whenever a page is moved, a domain is retired or a site is restructured. This phenomenon, known as link rot, is corrosive to the scholarly record, which depends on being able to cite and re-find sources years or decades later. A DOI mitigates this by adding a layer of indirection: citations point at the stable identifier, and the identifier’s owner maintains the mapping to wherever the content actually resides. The DOI is part of a wider family of persistent identifiers (PIDs) explored across our persistent-identifiers coverage.

    Who assigns DOIs?

    The IDF does not register individual DOIs itself; instead it appoints DOI registration agencies that serve particular communities. In scholarly publishing the two largest are Crossref, which registers DOIs for journal articles, conference papers, books and other text-based scholarly content, and DataCite, which focuses on research datasets and other non-traditional outputs. Each agency collects descriptive metadata alongside the DOI and operates the services that make DOIs useful for discovery and citation. We examine that division of labour in our piece on Crossref and the DOI registration agencies.

    DOIs also coexist with other identifiers in the modern research infrastructure — ORCID for people, ROR for organisations, RAiD for projects — described in our overview of ORCID, ROR, RAiD and the DOI in 2026. For definitions of these and related terms, the CASRAI dictionary is a useful reference.

    Versioning and DOIs

    Because a DOI is permanent, an updated version of a dataset or preprint is usually given its own DOI, with a separate “concept” DOI that always points to the latest version. This pattern is explained in our article on concept and version DOIs.

    Frequently asked questions

    Is a DOI the same as a URL?

    No. A DOI is an identifier, not a location. It is usually expressed as a URL via the https://doi.org/ proxy so it can be clicked, but the identifier itself is the part after the proxy. The URL it points to can change; the DOI does not.

    What standard defines DOIs?

    DOI syntax and the rules of the DOI system are defined by ISO 26324. The system is administered by the International DOI Foundation, and resolution is provided by the Handle System.

    Can a DOI ever stop working?

    A DOI continues to resolve as long as its registration agency and the registrant maintain the record. The persistence guarantee is a social and contractual commitment as well as a technical one: it depends on publishers updating target URLs and on the agencies remaining operational.

    How do I cite a DOI correctly?

    Best practice is to present the DOI as a full HTTPS link, for example https://doi.org/10.1000/182, so that it is both human-readable and machine-actionable. Guidance for authors is collected on our for-authors page.

  • DataCite, GitHub, Zenodo: the three-cornered software-citation stack

    Software citation in 2026 mostly runs on a three-cornered stack: a code repository (typically GitHub), an archiving service that issues DOIs (typically Zenodo), and the DataCite infrastructure that registers and resolves the DOIs. The integration between the three is more polished than it was five years ago and substantially less polished than it could be. This post walks through the current state and what integrators should do.

    The pattern that works

    The operational pattern that the community has converged on. A research-software project lives in a Git repository (often on GitHub, increasingly on GitLab or other forges). At each release, the repository is archived to Zenodo, which creates a DOI for that release; a concept DOI for the project overall is also issued, resolving to the latest release. The repository carries a CITATION.cff file specifying how to cite the software, including the Zenodo DOI and the contributor list. The published paper (if any) cites the software via the Zenodo DOI; the software citation pattern is operationally clean.

    The integration works at the technical layer. GitHub-Zenodo integration is documented and stable. CITATION.cff is supported by GitHub’s repository UI for human-readable citations and by an increasing number of tools (Zenodo, JOSS, R packages’ references) for machine processing. DataCite’s metadata supports the software-type record with CRediT-aligned contributor roles where the depositor provides them.

    What’s good

    Three things this stack does well.

    First, versioning. Software is versioned; citation should be versionable. The concept-DOI plus per-version-DOI pattern lets a paper cite either the specific version it used or the project conceptually, with the appropriate DOI. This is the right design for software citation and the community has converged on it.

    Second, open infrastructure. Zenodo is operated by CERN as a public infrastructure; DataCite is a community-governed organisation. The depositor’s investment in software citation does not lock them into a commercial vendor. This matters for sustainability.

    Third, integration with FAIR4RS. The FAIR4RS Principles and the resulting software citation principles are operationalised by this stack. A FAIR-aligned software project should have an archived release with a DOI, with rich metadata, with a contributor record, all of which the stack supports.

    What’s still rough

    Four issues at the seams.

    First, the GitHub dependency. The dominant code-hosting platform is a commercial service owned by a major tech company. The Zenodo integration is GitHub-specific in important ways (the auto-archival webhook, the metadata propagation from the GitHub release to Zenodo). GitLab and other forges have lighter-weight integration patterns. The community’s reliance on GitHub for the code-hosting corner of the stack creates a single-point-of-vendor risk that the FAIR-software community has been increasingly aware of. Software Heritage’s archive of public repositories provides some long-term resilience but is not a substitute for the operational integration.

    Second, metadata fidelity at deposit. The GitHub-Zenodo automatic deposit captures repository metadata but the fidelity is variable. CITATION.cff is honoured if present and well-formed; in its absence, Zenodo defaults to repository-level metadata that may not reflect the contributor structure the developers intended. Projects without CITATION.cff get less-good Zenodo records.

    Third, the CRediT-CITATION.cff alignment. CITATION.cff supports a contributors list with type-of-contribution; the type-of-contribution vocabulary has converged on a CRediT-aligned set but the alignment is not strict. Tools that translate CITATION.cff to CRediT-compliant DataCite metadata produce slightly different results. The Software Citation Working Group has been working on the formal alignment; the work is partly complete.

    Fourth, versioning of the contributor record. CITATION.cff in the repository captures current contributorship; the Zenodo deposit captures contributorship as of the deposit. A project that adds contributors after a release has a stale Zenodo record for that release until the next release. The trade-off (mutable vs immutable per-version records) is a real one; the community has accepted immutable per-version records as the better default.

    What integrators should do

    For software-paper authors and software developers, the practical advice in 2026 is: maintain a CITATION.cff in every research-software repository; archive every meaningful release to Zenodo; cite the specific Zenodo DOI in publications that use the software; cite the concept DOI in publications that reference the project conceptually. The CASRAI software-citation authors guide walks through the patterns.

    For journals publishing software papers, the recommendation is to require CITATION.cff and a Zenodo (or equivalent) deposit at submission, to verify the consistency between the CITATION.cff and the paper’s contributorship statement, and to cite the Zenodo DOI in the published paper. JOSS does all of this; other software-paper venues should follow.

    For institutions, the recommendation is to ingest software-DOI records into CRIS systems as a first-class research output, to surface them in researcher dashboards alongside publications, and to recognise software contribution in promotion and tenure assessment. The CASRAI research outputs domain tracks the institutional implementation patterns.

    For the broader infrastructure community, two priorities. First, support non-GitHub code-hosting integration with Zenodo; the single-vendor concentration is a real risk. Second, complete the CRediT-CITATION.cff alignment work; the operational ambiguity is small but real.

    What’s coming

    Two developments to watch in 2026-2027. First, the Software Heritage citation integration: Software Heritage archives the world’s public source code and assigns SWHIDs (Software Heritage Identifiers). The integration of SWHIDs as a complementary identifier alongside Zenodo DOIs is in progress; the relationship between SWHID and DOI for the same software release is in design. Second, per-version contributor records: the community has been chewing on whether per-version CRediT statements deposited to Crossref or DataCite would be useful for software. The technical viability is clear; the community-consensus and tool-support work is in motion.

    For the moment, the three-cornered stack does the job. The seams are real but workable. Software citation has moved from being a research-software-engineering aspiration to an operational practice; the further refinements are about polish, not foundation.

    Related dictionary entries

  • DMP IDs and connecting machine-actionable DMPs to the PID ecosystem

    For most of their history, data management plans have led a curiously isolated existence. A researcher writes one to satisfy a funder, submits it as a document, and there it usually stays — a static file, disconnected from the project it describes, the outputs it anticipates, and the systems that manage everything else. This is a waste. A plan describes the data a project will produce, the people responsible, the repositories that will hold it, and the funder behind it — all entities the wider research infrastructure already tracks with persistent identifiers. Connecting the plan to that infrastructure transforms it from an inert document into a living, linked object. This article explains how DMP IDs and machine-actionable plans do exactly that, drawing on the machine-actionable DMP domain of the CASRAI Dictionary.

    Giving the plan an identity

    The first step is to give the plan itself a persistent identifier. A DMP ID — a persistent identifier for a data management plan, issued through an infrastructure such as DataCite — makes the plan a first-class object in the scholarly record: something that can be referenced unambiguously, cited, and linked to other objects. This sounds modest but it is the keystone of everything that follows. Once a plan has a stable identifier, it can be pointed at and pointed from. A paper can cite the plan that governed its data; a dataset can link back to the plan that anticipated it; a funder can reference the plan associated with a grant. Without an identifier, the plan is just a file somewhere; with one, it becomes a node in the network of research objects, participating in the web of relationships that connects publications, data, people and grants.

    Connecting to the wider PID ecosystem

    The real power emerges when the DMP ID connects the plan to the other persistent identifiers that describe a project. The plan’s authors can be identified by their ORCID iDs; their institutions by ROR identifiers; the funding by a grant identifier and the funder’s own identifier; the anticipated outputs by DOIs once they exist. Threaded together, these links let the plan take its place in the connected research landscape:

    • Plan to people. ORCID links the plan to the researchers responsible, so it appears in their record of activity.
    • Plan to institutions. ROR connects the plan to the organisations involved.
    • Plan to funding. Grant and funder identifiers tie the plan to the money that required and supported it, helping funders see that planning commitments were made and, in time, met.
    • Plan to outputs. Links to the datasets, software and publications that result let anyone trace from the plan to what was actually produced.

    The plan stops being a one-off submission and becomes part of the same identifier graph that already connects the rest of the research enterprise.

    The RDA DMP Common Standard

    Connecting plans to the PID ecosystem requires that the plans themselves be machine-readable in a consistent way, and this is where the RDA DMP Common Standard comes in. Developed through the Research Data Alliance, the Common Standard is an application profile that defines a shared, structured model for expressing the content of a data management plan — its datasets, contributors, hosts, costs, distributions and the rest — in a machine-actionable form. Its purpose is interoperability: a plan expressed according to the Common Standard means the same thing to any system that understands the standard, regardless of which tool created it. This is what allows a machine-actionable DMP (maDMP) to be more than a document trapped in one application. Where a narrative plan is prose that only a human can interpret, a maDMP expressed in the Common Standard is structured data that systems can read, validate, update and exchange.

    Exchanging maDMPs between systems

    The consequence of a shared standard is that plans can flow between systems rather than being re-keyed at every step. A maDMP created in a DMP tool can be passed to a repository when data is deposited, so the repository already knows what was planned; it can be exchanged with a current research information system (CRIS) so the institution’s record of the project includes its data plan; and it can be shared with funders in a form their systems can ingest, rather than as a PDF a person must read. Information entered once can travel to wherever it is needed, kept consistent across the project’s many systems. This exchange is precisely the kind of federation of research information that reduces duplication and keeps records aligned — the principle, explored in our work on federation, that systems should connect and share rather than each maintaining its own disconnected copy.

    From a checkbox to a connected object

    Taken together, these developments change what a data management plan is. The combination of a persistent DMP ID, links to ORCID, ROR and grant identifiers, and a shared machine-actionable standard turns the plan from a compliance checkbox into a connected, citable, living object — one that participates in the research lifecycle from the moment it is written, that updates as the project progresses, and that connects to the outputs it anticipated. The plan becomes useful to the researcher, the institution and the funder alike, rather than a box ticked at the start and forgotten.

    A consistent vocabulary behind the links

    None of this works unless the elements being linked and exchanged mean the same thing across systems — what a dataset entry, a contributor role, a host or a cost denotes in a plan. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that a machine-actionable plan is understood identically wherever it travels. And because the contributions a plan records — data curation, software development and the rest — are part of the research record, they can be described in the same framework, the CRediT taxonomy and its full set of contribution roles. A DMP ID gives the plan an identity; the PID ecosystem gives it relationships; and a shared vocabulary lets those relationships mean what they should.

  • Data papers: publishing datasets as citable outputs

    Some of the most valuable products of research are datasets: a long-running environmental monitoring series, a carefully curated genomic resource, a survey assembled over years. Such a dataset can underpin dozens of later studies and outlast the project that created it. Yet the people who built it have often struggled to get formal credit, because the traditional unit of academic recognition is the journal article that interprets data, not the data themselves. The data paper exists to close that gap: a peer-reviewed article whose subject is a dataset — describing what it contains, how it was produced and how to reuse it — turning data work into a citable, reviewable output in its own right. This article explains how data papers work and why they matter, drawing on the research outputs domain of the CASRAI Dictionary.

    What a data paper is — and is not

    A data paper is not a research paper that happens to share its data, and it is not a results paper in disguise. Its purpose is descriptive: to document a dataset thoroughly enough that others can find, understand, trust and reuse it. A typical data paper covers what the data are, how and why they were collected, the methods and instruments used, the structure and format of the data, quality-control and validation procedures, and — crucially — where the data are deposited and under what licence. What a data paper generally does not do is advance a new scientific hypothesis or interpret the data to reach a novel conclusion; the contribution is the well-described, reusable resource itself. This restraint is the point: it lets the value of the data be assessed on its own terms, separately from any particular analysis.

    Data journals and where data papers appear

    Data papers are published either in dedicated data journals or in conventional journals that accept the format. Two well-established examples illustrate the model. Scientific Data publishes peer-reviewed descriptions of datasets across the sciences, pairing each with structured metadata. Earth System Science Data publishes data papers in the Earth and environmental sciences, with a strong emphasis on data quality and reusability. These venues apply genuine peer review — reviewers assess whether the data are sound, complete, properly documented and genuinely reusable — which is what gives a data paper its credibility. A peer-reviewed data paper is not merely a deposit; it is a vetted statement that the dataset meets a scholarly standard.

    The relationship between the paper and the data

    A central feature of the data paper model is the separation of the description from the data. The data paper is the human-readable, peer-reviewed article; the dataset itself lives in a repository, where it receives its own persistent identifier — typically a DataCite DOI — and is governed by an explicit licence. The data paper cites the dataset by that identifier, and the dataset record points back to the paper. This means there are two citable objects, linked but distinct: the dataset, which others cite when they reuse the data, and the data paper, which others cite when they draw on its description. Robust dataset citation through DataCite is what allows reuse of the data to be tracked and, over time, credited to the people who produced it. The infrastructure that makes datasets first-class citable objects is part of the wider picture covered in our data infrastructure domain.

    Why data papers matter for credit and FAIR data

    The deeper reason data papers matter is incentives. For a long time, the rational move for a researcher who built a valuable dataset was to mine it for conventional papers, because that was what counted. The data paper changes the calculus by making the dataset itself a recognised, citable, peer-reviewed output that appears on a CV and accrues citations. That recognition rewards exactly the careful, time-consuming data stewardship that the research system otherwise undervalues. Data papers also advance the FAIR principles — that data should be Findable, Accessible, Interoperable and Reusable — almost by construction: a good data paper makes a dataset findable (through publication and a DOI), documents it for accessibility and interoperability, and exists precisely to enable reuse.

    Crediting the people behind the data

    Producing a high-quality dataset is collaborative work — collection, curation, validation, documentation — and a data paper is an opportunity to credit it properly rather than burying it in an acknowledgement. The CRediT taxonomy maps naturally onto this work, with the Data curation role recognising the management, annotation and maintenance of the data, alongside Investigation for collection and Methodology for how it was produced. The complete set of roles is described in our overview of the CRediT roles. Applying structured contribution to a data paper ensures that the curator who made the dataset reusable is named for that contribution, not left invisible behind the names of those who later analyse the data.

    An output worth treating seriously

    Treating datasets as citable, reviewable outputs — with their own identifiers, their own peer review, and their own credit — recognises a simple reality: the data often outlast and out-influence any single paper drawn from them. Data papers give that reality formal standing. The consistent vocabulary that lets a dataset, a data paper and the contributions behind them be described the same way across repositories, journals and institutional systems is maintained in the CASRAI Dictionary, so that the credit a researcher earns for building a resource travels with it wherever it is reused.

  • Documenting datasets for machine-learning research: datasheets, data statements and Croissant

    A machine-learning model is, in a profound sense, a product of its training data. Whatever patterns, gaps, imbalances and biases live in that data are absorbed by the model and reproduced in its behaviour. And yet, for much of the field’s recent history, datasets have circulated with remarkably little documentation: a file, perhaps a brief description, and little record of where the data came from, who is represented in it, what it omits, or what it should and should not be used for. The result has been models trained on poorly understood foundations, with predictable consequences for reliability and fairness. A growing movement now treats dataset documentation as a serious, first-class research output in its own right. This article surveys that movement, drawing on the AI and ML research-outputs domain of the CASRAI Dictionary.

    Datasheets for Datasets

    The most influential proposal, borrowing an idea from electronics, is the datasheet. Just as an electronic component ships with a datasheet describing its characteristics, operating conditions and limitations, Datasheets for Datasets proposes that every dataset be accompanied by a document answering a structured set of questions about it. Those questions span the dataset’s whole life: the motivation for creating it and who funded it; its composition — what the instances are, how many there are, what they represent, and whether sensitive or personal data is involved; the collection process — how the data was gathered and whether consent was obtained; any preprocessing, cleaning or labelling; recommended and discouraged uses; and plans for distribution and maintenance. The aim is to make explicit what would otherwise remain tacit, so that anyone considering using the dataset can understand its provenance and judge its fitness for their purpose — and so that the people who created it must think carefully about these matters while they still can.

    Data Statements for NLP

    A closely related proposal arose specifically in natural-language processing, where the characteristics of the people who produced the text in a dataset profoundly shape what a model learns. Data Statements for Natural Language Processing ask dataset creators to document the relevant characteristics of their data: who the speakers and annotators are, the language varieties represented, the situations in which the language was produced, and so on. The motivation is squarely about bias and generalisation. A language model trained on text from a narrow demographic will work less well, and sometimes fail or cause harm, for people outside it — and without documentation, that limitation is invisible until it bites. Data statements make the population behind the data explicit, so that the boundaries of a model’s likely competence can be understood rather than discovered the hard way. Both datasheets and data statements share a conviction: documentation is not bureaucratic overhead but a precondition for using data responsibly.

    Croissant: machine-readable dataset metadata

    Datasheets and data statements are written largely for humans. But for datasets to be discoverable, loadable and interoperable across the many tools of the machine-learning ecosystem, their metadata also needs to be machine-readable. This is the role of Croissant, a metadata format for machine-learning datasets developed through a community effort associated with MLCommons. Croissant provides a standard, structured way to describe a dataset — its resources, structure, fields and semantics — so that tools, frameworks and repositories can understand and work with it consistently, rather than each requiring bespoke handling. By standardising the description, Croissant makes datasets easier to find, load and combine across platforms, and it can carry the kind of responsible-use and provenance information that datasheets capture into a form that systems can act on. It is, in effect, the interoperability layer for dataset documentation.

    How this connects to FAIR and persistent identifiers

    This work is the machine-learning expression of principles that the wider research-data community has long advocated. The FAIR principles — that data should be Findable, Accessible, Interoperable and Reusable — map directly onto what good dataset documentation achieves: rich, machine-readable metadata (Croissant) makes data findable and interoperable, while thorough human-readable documentation (datasheets, data statements) is what genuine reusability requires, because data cannot be responsibly reused if its provenance and limitations are unknown. Persistent identifiers complete the picture: when a dataset is registered with an identifier through an infrastructure such as DataCite, it becomes citable and trackable, so that it can be referenced precisely in papers, credited to its creators, and connected to the models and results that depend on it. A documented, identified dataset is one that can take its place in the scholarly record as a real output rather than an anonymous file.

    Datasets as research outputs deserving credit

    The deeper shift here is a change in status. Creating a good dataset — collecting, cleaning, labelling and documenting it carefully — is substantial intellectual labour, and the resulting dataset is a genuine research output that others build upon, often more widely than any single paper. Treating datasets as first-class outputs means documenting them properly, identifying them persistently, and crediting the people who made them. The CRediT taxonomy, whose full set of contribution types is described in our overview of the CRediT roles, captures this work through roles such as Data curation, which recognises the production, annotation and maintenance of data. Recognising dataset creation as creditable contribution is part of the same movement that produced datasheets: an insistence that the data underpinning machine learning, and the people who steward it, be taken seriously.

    A consistent vocabulary for dataset documentation

    For dataset documentation to be useful across repositories, frameworks and institutions, the elements it contains must mean the same thing everywhere — what a field describes, what a provenance statement records, what an intended-use restriction means. That consistency is what the CASRAI Dictionary works towards: a shared vocabulary so that the metadata describing a dataset is understood identically wherever it travels. Datasheets, data statements and Croissant all rest on the same insight: that a dataset without documentation is a liability, and that documenting it well is not an afterthought but part of doing the research properly.

  • Data citation: giving datasets the credit they deserve

    A great deal of published science rests on data the authors collected, cleaned, and shared — and yet the dataset itself, the object on which the conclusions actually depend, is routinely mentioned in passing or not at all. A finding is only checkable if a reader can find and reuse the data behind it, and the people who produced that data deserve recognition for an intellectual contribution that is often enormous. Treating datasets as first-class, citable outputs solves both problems at once. It is a core concern of the data-infrastructure domain and connects directly to the wider taxonomy of the research-outputs domain.

    Why data citation matters

    Citing data as data does two distinct jobs, and it is worth keeping them separate. The first is credit: assembling a well-documented dataset is real scholarly work — designing the collection, curating, validating, and documenting it — and that work is rewarded only if the dataset is cited as an output in its own right, not buried in a methods paragraph. The second is reproducibility and reuse: a result can only be verified, and the data only reused, if a reader can identify and locate the exact dataset that underpinned the analysis. A vague reference to “data available on request” serves neither goal; a formal citation to a deposited, identified dataset serves both.

    The FORCE11 data citation principles

    The community reference point here is the Joint Declaration of Data Citation Principles, developed through FORCE11 and endorsed across the scholarly-communication community. The declaration establishes that data should be treated as a legitimate, citable product of research, on the same footing as any other output. Its principles can be summarised as a short set of commitments:

    • Importance. Data should be considered legitimate, citable products of research; data citations should be accorded the same importance as citations of other objects.
    • Credit and attribution. Citations should facilitate giving scholarly credit and legal attribution to all contributors to the data.
    • Evidence. Where a claim relies on data, the corresponding data should be cited.
    • Unique identification. A citation should include a persistent, machine-actionable, globally unique identifier for the data.
    • Access, persistence, and specificity. Citations should enable access to the data and its metadata, persist even beyond the lifespan of the data, and identify the precise version and subset used.
    • Interoperability and flexibility. Citation methods should be interoperable across communities while accommodating their varying practices.

    Everything below is machinery for honouring these principles in practice.

    DataCite and the dataset DOI

    The practical foundation of data citation is the DataCite DOI. DataCite is the DOI registration agency for research data and related outputs, and a dataset deposited in a repository — a generalist repository such as Zenodo, Figshare, or Dryad, or a discipline-specific one — is assigned a DataCite DOI that resolves persistently to the dataset and its metadata. The DOI is what goes in a reference list, exactly as an article DOI would, which is what makes a dataset citable on equal terms with a paper.

    The DOI is more than a link. The DataCite metadata record behind it carries the structured information that makes the citation meaningful: the creators (ideally with their ORCID iDs), the title, the publisher and publication year, the version, the licence, the resource type, and related identifiers connecting the dataset to the article it supports, the software that processed it, and the grant that funded it. Versioning is treated as a first-class concern: a revised dataset can receive its own version-specific DOI, satisfying the principles’ demand for specificity so that a citation pins down exactly the data used, not merely the latest state of an evolving collection.

    Crediting the people: the Data curation role

    Identifying the dataset is half the task; crediting the humans who produced it is the other half, and the two are easily confused. A DataCite DOI identifies and persists the artefact; it does not, on its own, record the division of labour that produced it. That is the job of contributor-role metadata. The CRediT taxonomy includes a dedicated Data curation role — defined as the management activities to annotate, scrub, and maintain research data (including the software code where needed to interpret the data) for initial use and later reuse. Recording Data curation on the associated paper makes visible the often-uncredited work of turning raw observations into a documented, reusable dataset.

    The two layers complement each other precisely. The dataset DOI and its DataCite metadata say what the data is, where it lives, and which version; the CRediT role record says who curated, validated, and maintained it. Used together they ensure that both the data and the people behind it are visible — rather than the common outcome where neither is, and the dataset is reduced to an unattributed line in a methods section.

    A practical recipe

    1. Deposit the data in a trustworthy repository and obtain a DataCite DOI, rather than leaving it “available on request”.
    2. Cite the dataset in your reference list using its DOI, the way you would cite an article — not in a footnote or in prose.
    3. Pin the version. Where the data may change, cite the version-specific DOI so the citation identifies exactly what was used.
    4. Record the contributors — on both the DataCite record (with ORCID iDs) and, via CRediT’s Data curation role, on the paper the data supports.
    5. Apply a clear licence. Data that cannot be reused with confidence is data that will not be reused; the citation principles assume the reuse terms are stated.

    Where shared vocabulary fits

    “Dataset”, “data citation”, “version”, “data curation”, and “repository” are used inconsistently across communities, which is part of why credit for data leaks away. A shared, federated vocabulary that defines these terms precisely — and points back to the FORCE11 data citation principles and to DataCite — is what lets a data citation written in one system be understood in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain, with adjacent entries in the research-outputs domain.

    Related reading

  • Identifying instruments and samples: PIDINST and IGSN

    Over the past two decades, research has built an impressive web of persistent identifiers. Articles have DOIs, datasets have DOIs, researchers have ORCID iDs, organisations have ROR identifiers, and grants and projects are increasingly identified too. Follow any one of these and you can traverse the others — this person wrote that paper, which used this dataset, funded by that grant. But there have long been two conspicuous gaps in this graph, both at the point where research meets the physical world: the instruments that generate measurements, and the physical samples from which data are drawn. Two community efforts — PIDINST for instruments and IGSN for samples — are now closing those gaps. This article explains both and where they fit, drawing on the persistent identifiers domain of the CASRAI Dictionary.

    Why instruments and samples need identifiers

    Consider a measurement. To interpret it properly — to reproduce it, to compare it with another, to assess its reliability — you need to know what produced it: which spectrometer, which sensor, which sequencer, in what configuration and with what calibration history. And to know what the measurement is of, you need to identify the physical sample: which rock core, which water sample, which tissue specimen, collected where and when. Traditionally this provenance was described in prose, in ways that were inconsistent between papers and impossible to resolve automatically. Two papers might use the same instrument or analyse splits of the same sample without any way to know it. Persistent identifiers for instruments and samples make that provenance explicit, resolvable and connectable to the rest of the PID graph.

    PIDINST: persistent identifiers for instruments

    PIDINST is a community framework, developed under the auspices of the Research Data Alliance, for assigning persistent identifiers to research instruments and describing them with a shared metadata schema. The idea is that a significant instrument — a telescope, a mass spectrometer, a research vessel’s sensor array — receives a persistent identifier and a structured description covering attributes such as its owner, manufacturer, model, and where it is located or operated. Once an instrument has a resolvable identifier, data it produces can cite it, the instrument can be linked to the people and institutions responsible for it, and its outputs can be aggregated across studies. PIDINST is deliberately infrastructure-agnostic: it defines the metadata and the principle of persistent identification rather than mandating a single issuing body, allowing existing identifier systems to carry instrument PIDs.

    IGSN: identifiers for physical samples

    On the samples side, the IGSN — originally the International Geo Sample Number, now stewarded as a global sample identifier — provides persistent, resolvable identifiers for physical specimens. An IGSN identifies a particular sample: a sediment core, a mineral specimen, a biological sample, with metadata describing what it is, where and when it was collected, and how it relates to parent samples and sub-samples. This last point matters enormously in practice, because samples are routinely split, sub-sampled and distributed; IGSN can express the relationships between a parent sample and its derivatives, so that analyses performed on different splits can be traced back to a common origin. The IGSN system has been integrated with the DataCite infrastructure, aligning sample identifiers with the same resolution and metadata ecosystem used for datasets — which means a sample can be cited and linked just as a dataset can.

    A note on RRIDs

    Related to the question of identifying research resources are Research Resource Identifiers (RRIDs), which identify key biological resources used in research — antibodies, cell lines, model organisms, and software tools — so that the exact resource behind a result can be unambiguously named and found. RRIDs address a different layer from PIDINST and IGSN: not the instrument that measured or the unique physical specimen, but the catalogued, often commercially available resources whose precise identity is essential to reproducibility. Together, instrument PIDs, sample identifiers and resource identifiers fill in the parts of the provenance picture that dataset and article DOIs never reached.

    Completing the provenance chain

    The power of these identifiers is realised when they are connected. Picture a fully linked record: a dataset (DOI) was produced by an instrument (PIDINST) operated by a researcher (ORCID) at an institution (ROR), measuring a sample (IGSN) collected on a particular expedition, using a reagent identified by an RRID, all under a grant (grant ID). Each link is resolvable; the whole forms a provenance chain that a machine can traverse and a human can audit. That is a qualitatively better basis for reproducibility and reuse than a methods section written in prose, because every node can be verified against an authoritative record rather than taken on trust.

    Using them in practice

    For researchers, adopting these identifiers is becoming more straightforward as repositories and data-collection workflows build in support. The practical advice is to assign and cite instrument and sample identifiers at the point of data creation rather than retrofitting them later, and to record the relationships — instrument to data, parent sample to sub-sample — while they are still known. Our guidance on persistent identifiers for authors covers how to incorporate these into the research record, and the consistent definitions that let an instrument PID or sample identifier mean the same thing across systems are maintained in the CASRAI Dictionary. As with people and outputs, recognising the contributions of those who build and steward instruments and sample collections is part of a complete record, and structured contribution through the CRediT taxonomy helps make that work visible too.