CASRAI Dictionary

Tag: DataCite

DataCite and the data-citation infrastructure

For a long time, the formal scholarly record recognised one kind of output above all others: the journal article, identified by a DOI and citable in a standard way. The datasets, software, samples and other research outputs that often represented the greater investment of effort had no comparable standing. They were hard to cite, hard to find again, and easy to lose track of. DataCite exists to change that. It is the global, not-for-profit registration agency that issues persistent identifiers — data DOIs — and maintains the metadata standard that makes datasets and other non-article outputs first-class, citable, connectable objects. This article explains what DataCite does and why it matters, drawing on the data infrastructure domain of the CASRAI Dictionary.

Why data needed its own infrastructure

Citing a dataset properly is harder than citing a paper, and the difficulty is structural. A dataset may have versions; it lives in a repository rather than a journal; it has creators and contributors whose roles differ from those of authors; and its value is realised through reuse, which is precisely what is hardest to track. Without a persistent identifier and a shared way to describe it, a dataset cannot be cited consistently, cannot be found reliably after the project that made it has ended, and cannot accrue the credit that reuse should generate for its creators. DataCite addresses all of these at once by giving data outputs a resolvable DOI and a structured description, so that a dataset can be referenced as precisely and durably as any article.

Data DOIs and persistent identification

The core service is the assignment of DOIs to research outputs through DataCite’s member repositories and data centres. When a repository deposits a dataset, it registers a DataCite DOI that resolves persistently to the dataset’s landing page, independent of any changes to the repository’s internal structure over time. That persistence is what lets a dataset DOI sit safely in a reference list, a data-availability statement, or another dataset’s record for years. Crucially, DataCite DOIs are not limited to datasets: the same mechanism identifies software, samples, images, models, preprints and a wide range of other outputs, extending durable, citable identity well beyond the traditional article.

The DataCite metadata schema

An identifier is only useful if there is consistent information behind it, and this is where the DataCite Metadata Schema does its work. The schema defines a structured set of properties for describing a research output: its creators, title, publisher and publication year, the resource type, and a rich set of optional fields covering contributors and their roles, dates, related identifiers, funding, rights and descriptions. Two features of the schema are especially powerful. The first is relatedIdentifier, which lets a record express how an output relates to others — this dataset is a version of that one, supplements this article, is derived from that sample, is documented by this data paper. The second is the recording of contributors and their roles, which allows a dataset record to name not just abstract creators but the specific people who curated, collected or maintained the data. Together these turn each record into a node with explicit, machine-readable links to the rest of the research world.

DataCite and the PID graph

Because DataCite records carry related identifiers and references to other persistent identifiers — ORCID for people, ROR for organisations, Crossref DOIs for articles, grant identifiers for funding — they are not isolated entries but part of a connected PID graph. Follow the links and you can move from a dataset to its creators, their institutions, the grant that funded the work, and the article that analysed it. DataCite and Crossref between them register much of the scholarly output graph — broadly, the data and the literature — and their shared use of resolvable identifiers and exchangeable metadata is what lets the whole network be traversed automatically rather than reconstructed by hand. DataCite’s role in this interoperating arrangement is described in our work on DataCite and federation.

Supporting FAIR data and reuse

DataCite is foundational to the FAIR principles — that data should be Findable, Accessible, Interoperable and Reusable. A DataCite DOI and its metadata make a dataset findable through search and resolvable through a stable link; the schema’s structured, standardised fields support interoperability; and the explicit rights and relationship information supports informed reuse. Just as importantly, because datasets registered with DataCite can be cited by their DOIs, their reuse can in principle be tracked, which is the basis for crediting the people who produced them. A dataset that is cited is a dataset whose creators can be recognised — the recognition that careful data stewardship has historically been denied.

Crediting data work consistently

DataCite’s ability to record contributors and their roles connects directly to the recognition of data work. The CRediT taxonomy — whose full set of roles is described in our overview of the CRediT roles — provides a controlled vocabulary for contribution, with the Data curation role recognising the management, annotation and maintenance that make a dataset reusable, alongside Investigation for collection and Methodology for how it was produced. For a contribution recorded in a dataset’s DataCite metadata to be understood the same way in an institutional system or a data paper, the terms must be defined consistently across systems. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the metadata DataCite carries — resource types, contributor roles, relationship types — means the same thing wherever a dataset DOI travels.

June 10, 2026
Machine-actionable data management plans: the maDMP comes of age
The data management plan has a reputation problem. For most of its existence it has been a document written under deadline pressure to satisfy a funder requirement, deposited as a PDF, and then never opened again. It describes intentions that, by the end of a project, may bear little resemblance to what actually happened to the data. The machine-actionable DMP is the response to that failure mode, and after some years of standards work it has come of age. This article explains what it is and why it matters, drawing on the machine-actionable DMPs domain.

From document to data object

A data management plan (DMP) is a description of the data-management practices to be followed during and after a research project: what data will be produced, how they will be stored and documented, under what licence and access conditions they will be shared, and how long they will be kept. A machine-actionable DMP (maDMP) is the same content expressed as structured data that research systems can exchange, validate, ingest, and update automatically, rather than as prose only a human can read.

The distinction is not cosmetic. A prose DMP states that data will be deposited in a trusted repository; a maDMP carries that as a structured assertion that a repository system can read, act on, and later check against what was actually deposited. The DMP stops being a one-time document and becomes a node in the research-information graph, connected to the project, the outputs, the funder, and the people.

The standard that made it possible: the RDA Common Standard

Structured exchange requires an agreed structure, and that is the contribution of the RDA DMP Common Standard — the application profile developed by the Research Data Alliance to represent maDMP content in a common, system-neutral form. It defines the entities a DMP describes and the relationships between them, so that a DMP created in one tool means the same thing when read by another.

The standard’s design encodes a useful distinction the prose form blurs: between an anticipated dataset — a dataset the DMP says will be produced — and a realised dataset, one that has actually been produced and, typically, deposited. A maDMP can carry both, which is precisely what lets a system at closeout check whether the datasets the plan anticipated were in fact realised and deposited. Around these sit the structured fields that prose tends to leave vague: the retention period, the licence assertion, the access control policy, the storage location, and a data volume estimate for storage planning.

The DMP ID: giving the plan an identity

For a DMP to be referenced across systems, it needs an identity, and that is the role of the DMP ID — a persistent identifier for a specific data management plan, typically a DOI minted by DataCite through tools such as the DMPTool, the DCC’s DMPonline, or ARGOS. With a DMP ID, the plan can be cited like any other research object: a funder can refer to it, a CRIS can link to it, an output can point back to the plan that anticipated it, and the connections become part of the persistent-identifier graph alongside ORCID, ROR, and the grant ID. The DMP ID is what turns the DMP from a loose attachment into a first-class, addressable entity in the persistent-identifier ecosystem.

The living DMP

The deepest change the maDMP enables is conceptual: the move from the frozen DMP to the living DMP — a plan updated throughout the project lifecycle rather than fixed at award. A frozen DMP is a prediction made at the least-informed moment of a project, before any data exist. A living DMP is a record that tracks reality: as anticipated datasets become realised, as storage decisions change, as access conditions are settled, the plan is updated, and a DMP version captures each snapshot.

The frozen DMP answers the question “what did the applicant promise at award?” The living maDMP answers a far more useful question: “what is actually happening to this project’s data, right now?” Only the second is worth the effort of maintaining.

This is where maDMP exchange earns its keep. When the DMP is structured and identified, a change made in one system can propagate — from a DMP tool to a CRIS, from the CRIS to a repository — so that the plan stays current without re-keying. A scheduled DMP review event becomes a checkpoint against live data rather than a re-reading of a stale document, and a DMP completeness score can be computed automatically against the funder’s required elements.

Why funders and institutions want this

The maDMP is not an end in itself; it is wanted because it makes obligations checkable. A funder that requires data to be deposited in a trusted repository under an open licence can, with structured maDMPs, verify that the realised datasets meet the commitment, rather than trusting a final-report paragraph. An institution can monitor data-management compliance across its whole portfolio as a query over structured plans. And the researcher, crucially, benefits too: a living maDMP linked to the project’s outputs means the closeout data-management report is largely assembled already, not reconstructed from memory. This is the same dividend that structured grant and disclosure data pay throughout research administration.

Where shared vocabulary fits

The RDA Common Standard supplies the structure — the shape of a maDMP. It does not, on its own, fix the controlled values that populate it: the list of access categories, the licence vocabulary, the dataset-status terms. Two systems can both emit valid Common Standard maDMPs and still disagree on what “restricted access” or “realised” means. That definitional gap, below the structural model, is exactly what a shared, federated vocabulary fills, pointing back to the RDA for the standard and to DataCite for the DMP ID infrastructure. Supplying it is the role the CASRAI dictionary is built for.

What to do now

For researchers and data stewards: treat the DMP as a living, structured object with a DMP ID, updated as anticipated datasets become realised. For funders: ask for maDMPs against the RDA Common Standard and verify realised against anticipated at closeout. For standards work: pair the structural standard with shared value vocabularies so that maDMPs from different tools genuinely interoperate.

Related reading
June 7, 2026
System cards, benchmarks and evaluation suites as first-class research outputs
A previous article in this series made the case for model cards and datasheets as documented research outputs. That is the well-established end of the spectrum. A wider and faster-moving set of AI/ML artefacts is now emerging as citable scholarship: system cards, evaluation benchmarks, and the evaluation suites that run against them. These are less settled than model cards, but they are arguably more consequential for how the field measures its own progress, and they belong in the conversation about what counts as a research output. CASRAI’s AI/ML research outputs domain treats them as first-class entries.

From model cards to system cards

A model card describes a model. A system card describes a deployed system — which is usually a model plus a great deal else: input and output filters, retrieval components, tool-use scaffolding, safety mitigations, usage policies, and the human processes wrapped around the whole thing. The distinction matters because the behaviour a user encounters is a property of the system, not the model in isolation.

System cards became prominent as the major model developers began publishing them alongside frontier model releases, documenting not just capabilities and evaluations but the risk assessments, red-teaming exercises, and mitigation decisions that shaped the deployment. A system card typically records what the system can do, what failure modes were identified, what was done to mitigate them, and what residual risks remain. It is closer in spirit to a safety case than to a specification sheet.

For the research record, the significance is that a system card documents decisions, not just facts. It is the contemporaneous account of how a team reasoned about deploying a powerful artefact responsibly. That reasoning is itself a research contribution — and, when the system later behaves in unexpected ways, the system card is the record against which those decisions can be assessed.

Benchmarks as scholarly infrastructure

An evaluation benchmark is a standardised dataset paired with a scoring protocol, used to measure and compare the capability of models on a defined task. Benchmarks are not new — they have organised whole subfields of machine learning for decades — but their status as research outputs has historically been ambiguous. A benchmark is often introduced in a paper, but the artefact that matters and gets reused for years afterwards is the dataset-and-protocol, not the paper describing it.

The honest view is that benchmarks are among the highest-leverage outputs in the field. A widely adopted benchmark shapes what thousands of researchers optimise for; a flawed one steers the field toward the wrong objective. Building a good benchmark — defining the task crisply, assembling a representative dataset, designing a scoring procedure that resists gaming, and documenting all of it — is substantial and genuinely scholarly work. It deserves to be cited and credited as such, not buried as the apparatus of a single paper.

Evaluation suites and the contamination problem

An evaluation suite bundles multiple benchmarks and tasks together with the harness that runs them, so that a model release can be assessed consistently and reproducibly. The suite is the executable counterpart to the benchmark: where a benchmark is the standard, the suite is the instrument that applies it.

Treating evaluation suites as research outputs surfaces a problem that the field takes seriously: test set contamination. When benchmark data leaks into the training corpus of the models being evaluated, the resulting scores measure memorisation rather than capability, and the benchmark silently loses its validity. Documenting an evaluation suite as a versioned, identifier-bearing output — with a clear record of which data is held out and when it was published — is part of how the community defends against contamination. A benchmark whose provenance and release date are pinned by a persistent identifier can be checked against a model’s training cut-off; an undocumented one cannot.

Why first-class status matters

The recurring theme across these artefact types is that the thing that gets reused is not the paper. It is the system card, the benchmark dataset, the evaluation harness. When these are treated as mere supplementary material, three things go wrong:
- Credit is misallocated. The team that built a benchmark used by the whole field gets a single citation to a paper, while the artefact’s real influence is invisible to any contribution-based assessment.
- Versioning is lost. Benchmarks and suites evolve — tasks are added, errors corrected, contaminated items removed. Without identifiers and versioning, it becomes impossible to say which version of a benchmark a given result was measured against.
- Provenance breaks. The contamination defence above depends entirely on knowing exactly what a benchmark contained and when it was released. That is metadata, and metadata needs an identifier to hang on.
How persistent identifiers and CRediT apply

The infrastructure pattern mirrors that for model cards and datasets. A benchmark or evaluation suite is deposited in a repository and receives a DataCite DOI; each substantive revision receives its own version DOI, so that results can cite the exact version measured against. The executable harness is pinned by a Software Heritage ID. Contributors carry ORCID iDs, institutions carry ROR IDs, and where the work belongs to a larger programme a RAiD ties it together. CASRAI’s persistent-identifier guidance covers the deposit pattern.

Contributorship maps onto CRediT with the same caveats discussed for models and datasets. Designing the benchmark task is Conceptualization and Methodology; assembling and annotating the data is Investigation and Data curation; building the harness is Software; running and verifying the evaluation is Validation. As with software papers, the Software role tends to be overloaded — a known limitation of applying CRediT to code-heavy outputs — but the statement is still more informative than an undifferentiated author list.

What to do now

For researchers: deposit benchmarks and evaluation suites as versioned outputs with DataCite DOIs and Software Heritage IDs, document them with the rigour of a datasheet, record the release date and held-out contents explicitly, and attach a CRediT statement. For funders and institutions: recognise system cards, benchmarks, and evaluation suites in CRIS systems and assessment as the high-leverage outputs they are, rather than as apparatus. For the field’s collective health, the benchmark you cite should be one you can name a version of.

Related reading
June 5, 2026