Tag: DataCite

  • Machine-actionable data management plans: the maDMP comes of age

    The data management plan has a reputation problem. For most of its existence it has been a document written under deadline pressure to satisfy a funder requirement, deposited as a PDF, and then never opened again. It describes intentions that, by the end of a project, may bear little resemblance to what actually happened to the data. The machine-actionable DMP is the response to that failure mode, and after some years of standards work it has come of age. This article explains what it is and why it matters, drawing on the machine-actionable DMPs domain.

    From document to data object

    A data management plan (DMP) is a description of the data-management practices to be followed during and after a research project: what data will be produced, how they will be stored and documented, under what licence and access conditions they will be shared, and how long they will be kept. A machine-actionable DMP (maDMP) is the same content expressed as structured data that research systems can exchange, validate, ingest, and update automatically, rather than as prose only a human can read.

    The distinction is not cosmetic. A prose DMP states that data will be deposited in a trusted repository; a maDMP carries that as a structured assertion that a repository system can read, act on, and later check against what was actually deposited. The DMP stops being a one-time document and becomes a node in the research-information graph, connected to the project, the outputs, the funder, and the people.

    The standard that made it possible: the RDA Common Standard

    Structured exchange requires an agreed structure, and that is the contribution of the RDA DMP Common Standard — the application profile developed by the Research Data Alliance to represent maDMP content in a common, system-neutral form. It defines the entities a DMP describes and the relationships between them, so that a DMP created in one tool means the same thing when read by another.

    The standard’s design encodes a useful distinction the prose form blurs: between an anticipated dataset — a dataset the DMP says will be produced — and a realised dataset, one that has actually been produced and, typically, deposited. A maDMP can carry both, which is precisely what lets a system at closeout check whether the datasets the plan anticipated were in fact realised and deposited. Around these sit the structured fields that prose tends to leave vague: the retention period, the licence assertion, the access control policy, the storage location, and a data volume estimate for storage planning.

    The DMP ID: giving the plan an identity

    For a DMP to be referenced across systems, it needs an identity, and that is the role of the DMP ID — a persistent identifier for a specific data management plan, typically a DOI minted by DataCite through tools such as the DMPTool, the DCC’s DMPonline, or ARGOS. With a DMP ID, the plan can be cited like any other research object: a funder can refer to it, a CRIS can link to it, an output can point back to the plan that anticipated it, and the connections become part of the persistent-identifier graph alongside ORCID, ROR, and the grant ID. The DMP ID is what turns the DMP from a loose attachment into a first-class, addressable entity in the persistent-identifier ecosystem.

    The living DMP

    The deepest change the maDMP enables is conceptual: the move from the frozen DMP to the living DMP — a plan updated throughout the project lifecycle rather than fixed at award. A frozen DMP is a prediction made at the least-informed moment of a project, before any data exist. A living DMP is a record that tracks reality: as anticipated datasets become realised, as storage decisions change, as access conditions are settled, the plan is updated, and a DMP version captures each snapshot.

    The frozen DMP answers the question “what did the applicant promise at award?” The living maDMP answers a far more useful question: “what is actually happening to this project’s data, right now?” Only the second is worth the effort of maintaining.

    This is where maDMP exchange earns its keep. When the DMP is structured and identified, a change made in one system can propagate — from a DMP tool to a CRIS, from the CRIS to a repository — so that the plan stays current without re-keying. A scheduled DMP review event becomes a checkpoint against live data rather than a re-reading of a stale document, and a DMP completeness score can be computed automatically against the funder’s required elements.

    Why funders and institutions want this

    The maDMP is not an end in itself; it is wanted because it makes obligations checkable. A funder that requires data to be deposited in a trusted repository under an open licence can, with structured maDMPs, verify that the realised datasets meet the commitment, rather than trusting a final-report paragraph. An institution can monitor data-management compliance across its whole portfolio as a query over structured plans. And the researcher, crucially, benefits too: a living maDMP linked to the project’s outputs means the closeout data-management report is largely assembled already, not reconstructed from memory. This is the same dividend that structured grant and disclosure data pay throughout research administration.

    Where shared vocabulary fits

    The RDA Common Standard supplies the structure — the shape of a maDMP. It does not, on its own, fix the controlled values that populate it: the list of access categories, the licence vocabulary, the dataset-status terms. Two systems can both emit valid Common Standard maDMPs and still disagree on what “restricted access” or “realised” means. That definitional gap, below the structural model, is exactly what a shared, federated vocabulary fills, pointing back to the RDA for the standard and to DataCite for the DMP ID infrastructure. Supplying it is the role the CASRAI dictionary is built for.

    What to do now

    For researchers and data stewards: treat the DMP as a living, structured object with a DMP ID, updated as anticipated datasets become realised. For funders: ask for maDMPs against the RDA Common Standard and verify realised against anticipated at closeout. For standards work: pair the structural standard with shared value vocabularies so that maDMPs from different tools genuinely interoperate.

    Related reading

  • System cards, benchmarks and evaluation suites as first-class research outputs

    A previous article in this series made the case for model cards and datasheets as documented research outputs. That is the well-established end of the spectrum. A wider and faster-moving set of AI/ML artefacts is now emerging as citable scholarship: system cards, evaluation benchmarks, and the evaluation suites that run against them. These are less settled than model cards, but they are arguably more consequential for how the field measures its own progress, and they belong in the conversation about what counts as a research output. CASRAI’s AI/ML research outputs domain treats them as first-class entries.

    From model cards to system cards

    A model card describes a model. A system card describes a deployed system — which is usually a model plus a great deal else: input and output filters, retrieval components, tool-use scaffolding, safety mitigations, usage policies, and the human processes wrapped around the whole thing. The distinction matters because the behaviour a user encounters is a property of the system, not the model in isolation.

    System cards became prominent as the major model developers began publishing them alongside frontier model releases, documenting not just capabilities and evaluations but the risk assessments, red-teaming exercises, and mitigation decisions that shaped the deployment. A system card typically records what the system can do, what failure modes were identified, what was done to mitigate them, and what residual risks remain. It is closer in spirit to a safety case than to a specification sheet.

    For the research record, the significance is that a system card documents decisions, not just facts. It is the contemporaneous account of how a team reasoned about deploying a powerful artefact responsibly. That reasoning is itself a research contribution — and, when the system later behaves in unexpected ways, the system card is the record against which those decisions can be assessed.

    Benchmarks as scholarly infrastructure

    An evaluation benchmark is a standardised dataset paired with a scoring protocol, used to measure and compare the capability of models on a defined task. Benchmarks are not new — they have organised whole subfields of machine learning for decades — but their status as research outputs has historically been ambiguous. A benchmark is often introduced in a paper, but the artefact that matters and gets reused for years afterwards is the dataset-and-protocol, not the paper describing it.

    The honest view is that benchmarks are among the highest-leverage outputs in the field. A widely adopted benchmark shapes what thousands of researchers optimise for; a flawed one steers the field toward the wrong objective. Building a good benchmark — defining the task crisply, assembling a representative dataset, designing a scoring procedure that resists gaming, and documenting all of it — is substantial and genuinely scholarly work. It deserves to be cited and credited as such, not buried as the apparatus of a single paper.

    Evaluation suites and the contamination problem

    An evaluation suite bundles multiple benchmarks and tasks together with the harness that runs them, so that a model release can be assessed consistently and reproducibly. The suite is the executable counterpart to the benchmark: where a benchmark is the standard, the suite is the instrument that applies it.

    Treating evaluation suites as research outputs surfaces a problem that the field takes seriously: test set contamination. When benchmark data leaks into the training corpus of the models being evaluated, the resulting scores measure memorisation rather than capability, and the benchmark silently loses its validity. Documenting an evaluation suite as a versioned, identifier-bearing output — with a clear record of which data is held out and when it was published — is part of how the community defends against contamination. A benchmark whose provenance and release date are pinned by a persistent identifier can be checked against a model’s training cut-off; an undocumented one cannot.

    Why first-class status matters

    The recurring theme across these artefact types is that the thing that gets reused is not the paper. It is the system card, the benchmark dataset, the evaluation harness. When these are treated as mere supplementary material, three things go wrong:

    • Credit is misallocated. The team that built a benchmark used by the whole field gets a single citation to a paper, while the artefact’s real influence is invisible to any contribution-based assessment.
    • Versioning is lost. Benchmarks and suites evolve — tasks are added, errors corrected, contaminated items removed. Without identifiers and versioning, it becomes impossible to say which version of a benchmark a given result was measured against.
    • Provenance breaks. The contamination defence above depends entirely on knowing exactly what a benchmark contained and when it was released. That is metadata, and metadata needs an identifier to hang on.

    How persistent identifiers and CRediT apply

    The infrastructure pattern mirrors that for model cards and datasets. A benchmark or evaluation suite is deposited in a repository and receives a DataCite DOI; each substantive revision receives its own version DOI, so that results can cite the exact version measured against. The executable harness is pinned by a Software Heritage ID. Contributors carry ORCID iDs, institutions carry ROR IDs, and where the work belongs to a larger programme a RAiD ties it together. CASRAI’s persistent-identifier guidance covers the deposit pattern.

    Contributorship maps onto CRediT with the same caveats discussed for models and datasets. Designing the benchmark task is Conceptualization and Methodology; assembling and annotating the data is Investigation and Data curation; building the harness is Software; running and verifying the evaluation is Validation. As with software papers, the Software role tends to be overloaded — a known limitation of applying CRediT to code-heavy outputs — but the statement is still more informative than an undifferentiated author list.

    What to do now

    For researchers: deposit benchmarks and evaluation suites as versioned outputs with DataCite DOIs and Software Heritage IDs, document them with the rigour of a datasheet, record the release date and held-out contents explicitly, and attach a CRediT statement. For funders and institutions: recognise system cards, benchmarks, and evaluation suites in CRIS systems and assessment as the high-leverage outputs they are, rather than as apparatus. For the field’s collective health, the benchmark you cite should be one you can name a version of.

    Related reading