For a long time, the formal scholarly record recognised one kind of output above all others: the journal article, identified by a DOI and citable in a standard way. The datasets, software, samples and other research outputs that often represented the greater investment of effort had no comparable standing. They were hard to cite, hard to find again, and easy to lose track of. DataCite exists to change that. It is the global, not-for-profit registration agency that issues persistent identifiers — data DOIs — and maintains the metadata standard that makes datasets and other non-article outputs first-class, citable, connectable objects. This article explains what DataCite does and why it matters, drawing on the data infrastructure domain of the CASRAI Dictionary.
Why data needed its own infrastructure
Citing a dataset properly is harder than citing a paper, and the difficulty is structural. A dataset may have versions; it lives in a repository rather than a journal; it has creators and contributors whose roles differ from those of authors; and its value is realised through reuse, which is precisely what is hardest to track. Without a persistent identifier and a shared way to describe it, a dataset cannot be cited consistently, cannot be found reliably after the project that made it has ended, and cannot accrue the credit that reuse should generate for its creators. DataCite addresses all of these at once by giving data outputs a resolvable DOI and a structured description, so that a dataset can be referenced as precisely and durably as any article.
Data DOIs and persistent identification
The core service is the assignment of DOIs to research outputs through DataCite’s member repositories and data centres. When a repository deposits a dataset, it registers a DataCite DOI that resolves persistently to the dataset’s landing page, independent of any changes to the repository’s internal structure over time. That persistence is what lets a dataset DOI sit safely in a reference list, a data-availability statement, or another dataset’s record for years. Crucially, DataCite DOIs are not limited to datasets: the same mechanism identifies software, samples, images, models, preprints and a wide range of other outputs, extending durable, citable identity well beyond the traditional article.
The DataCite metadata schema
An identifier is only useful if there is consistent information behind it, and this is where the DataCite Metadata Schema does its work. The schema defines a structured set of properties for describing a research output: its creators, title, publisher and publication year, the resource type, and a rich set of optional fields covering contributors and their roles, dates, related identifiers, funding, rights and descriptions. Two features of the schema are especially powerful. The first is relatedIdentifier, which lets a record express how an output relates to others — this dataset is a version of that one, supplements this article, is derived from that sample, is documented by this data paper. The second is the recording of contributors and their roles, which allows a dataset record to name not just abstract creators but the specific people who curated, collected or maintained the data. Together these turn each record into a node with explicit, machine-readable links to the rest of the research world.
DataCite and the PID graph
Because DataCite records carry related identifiers and references to other persistent identifiers — ORCID for people, ROR for organisations, Crossref DOIs for articles, grant identifiers for funding — they are not isolated entries but part of a connected PID graph. Follow the links and you can move from a dataset to its creators, their institutions, the grant that funded the work, and the article that analysed it. DataCite and Crossref between them register much of the scholarly output graph — broadly, the data and the literature — and their shared use of resolvable identifiers and exchangeable metadata is what lets the whole network be traversed automatically rather than reconstructed by hand. DataCite’s role in this interoperating arrangement is described in our work on DataCite and federation.
Supporting FAIR data and reuse
DataCite is foundational to the FAIR principles — that data should be Findable, Accessible, Interoperable and Reusable. A DataCite DOI and its metadata make a dataset findable through search and resolvable through a stable link; the schema’s structured, standardised fields support interoperability; and the explicit rights and relationship information supports informed reuse. Just as importantly, because datasets registered with DataCite can be cited by their DOIs, their reuse can in principle be tracked, which is the basis for crediting the people who produced them. A dataset that is cited is a dataset whose creators can be recognised — the recognition that careful data stewardship has historically been denied.
Crediting data work consistently
DataCite’s ability to record contributors and their roles connects directly to the recognition of data work. The CRediT taxonomy — whose full set of roles is described in our overview of the CRediT roles — provides a controlled vocabulary for contribution, with the Data curation role recognising the management, annotation and maintenance that make a dataset reusable, alongside Investigation for collection and Methodology for how it was produced. For a contribution recorded in a dataset’s DataCite metadata to be understood the same way in an institutional system or a data paper, the terms must be defined consistently across systems. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the metadata DataCite carries — resource types, contributor roles, relationship types — means the same thing wherever a dataset DOI travels.
Leave a Reply