CASRAI Dictionary

Category: Guides & Explainers

Practical how-to guides, templates, checklists, and career pathways for research administrators, authors, and institutional teams.

The DMP tools landscape: comparing DMPTool, DMPonline and Argos

A standard for machine-actionable data management plans is only useful if researchers have tools that put it into practice. Over the past decade three platforms have come to dominate the data-management-planning landscape, each developed and maintained by a significant open-science organisation, and each now working towards interoperability through the same common standard. For an institution choosing how to support its researchers, or a researcher trying to understand the options, it helps to see how DMPTool, DMPonline and Argos compare — what they share, where they differ, and what unites them. This article surveys that landscape through the machine-actionable DMP domain of the CASRAI Dictionary.

Why dedicated tools exist

It is reasonable to ask why data management planning needs special software at all, when a plan could be written in a word processor. The answer lies in everything a good tool does beyond capturing text. A dedicated DMP platform guides researchers through funder and institutional templates so they answer the right questions; it supplies guidance at the point of need; it allows plans to be shared, reviewed and collaboratively edited; and, increasingly, it exports plans in structured, machine-readable formats so the commitments they contain can be acted on by other systems rather than read once and filed. This last capability — producing a machine-actionable plan rather than a static document — is what distinguishes a modern DMP tool from a template in a folder.

DMPTool

The first of the three, DMPTool, is developed and operated by the California Digital Library. It emerged to help researchers, particularly in the United States, meet the data-management-planning requirements of funders, and it provides funder and institutional templates, tailored guidance and a collaborative environment for producing plans. DMPTool has been a leading voice in the move towards machine-actionable planning, contributing to the development of the standards and infrastructure that allow plans to become connected, living objects rather than text deliverables. Its institutional adoption across many universities has made it a familiar part of the research-support landscape, and its development sits within the broader work of the California Digital Library on open scholarship and research infrastructure.

DMPonline and DMP Roadmap

The second platform, DMPonline, is developed by the Digital Curation Centre, a long-standing centre of expertise in research-data curation. Like DMPTool, it offers funder and institutional templates, embedded guidance and collaborative editing, and it is widely used across the United Kingdom, Europe and beyond. DMPonline and DMPTool are closely related at a deeper level: they share a common open-source codebase known as DMP Roadmap, jointly developed by the two organisations. This shared foundation means the two services have a great deal in common under the surface even as each is tailored to its own community of funders and institutions. The collaboration behind DMP Roadmap is itself a notable feature of the landscape: rather than building competing systems from scratch, two major infrastructures pooled effort into a common platform, which has helped align their approach to machine-actionable planning.

Argos

The third platform, Argos, comes from the European open-science ecosystem and is developed in association with OpenAIRE and EUDAT. Argos was designed from the outset with machine-actionability and openness in mind, and with close integration into the wider European research-infrastructure landscape. It supports the creation of plans against templates and, in keeping with its origins, emphasises producing plans as structured, openly available outputs that connect into the broader graph of European research information. Its provenance in OpenAIRE and EUDAT positions it naturally within an ecosystem oriented towards linking outputs, projects and funding, and it reflects a vision in which the DMP is not an isolated document but a connected node in the research record.

What unites them: the RDA DMP Common Standard

For all their differences in origin and community, the three platforms are converging on a shared foundation for interoperability: the RDA DMP Common Standard, developed through the Research Data Alliance. The common standard defines a shared model and structure for expressing the information a DMP contains, so that a machine-actionable plan can be exported from one system and understood by another. This matters because plans do not live in isolation: a plan created in one tool may need to be read by a funder’s system, harvested into a repository, or connected to the persistent identifiers for the people, projects and outputs it describes. Without a common structure, every such exchange would require bespoke translation. With it, a maDMP exported from DMPTool, DMPonline or Argos can in principle flow into the wider ecosystem and be acted upon. The standard is what turns three separate tools into parts of a connected planning landscape.

Choosing between them

For an institution or researcher, the choice often comes down to context rather than a verdict on which platform is best. Existing institutional adoption, the funders one works with, the surrounding national infrastructure and integration with other systems all weigh on the decision. Because all three are moving towards the same common standard, the choice is less consequential than it once was: the goal is interoperable, machine-actionable planning, and each platform is a credible route to it. The decision is one of fit, not of compatibility.

A consistent vocabulary across tools

For plans to move between these platforms and the systems that consume them, the elements they contain must mean the same thing everywhere — the data types, the licences, the identifiers, the contributor roles. That consistency is what the CASRAI Dictionary provides, complementing the structural interoperability of the RDA standard with shared meaning for the terms that flow through it. And because data management planning is part of the wider research record, the contributions it documents can be described in the same shared framework — the CRediT taxonomy and its full set of contribution roles. To weigh the platforms side by side in more detail, our comparison resources set out their features against one another. The tools differ in origin and emphasis, but they share a destination: planning that machines as well as people can act upon.

June 10, 2026
CRediT in JATS XML: a technical primer for production teams
A contributor-roles statement is only as useful as it is machine-readable. A typesetter can render ‘A.B. wrote the original draft; C.D. supervised’ as a tidy paragraph at the foot of an article, but if that information lives only in prose then no downstream system — a research information system, an indexer, a funder’s reporting tool — can act on it. The point of CRediT, the Contributor Roles Taxonomy, is to make contributions structured, and in scholarly publishing ‘structured’ means encoded in JATS XML. This primer is for the production teams who actually do that encoding: the people for whom ‘add CRediT’ on a project plan turns into concrete decisions about elements, attributes and controlled vocabularies. The authoritative tag-level guidance is set out in the CRediT in JATS reference and the broader JATS implementation notes.

Where contributor roles live in JATS

JATS (the Journal Article Tag Suite, the NISO Z39.96 standard) models people in the <contrib-group> element. Each named individual is a <contrib>, carrying their name, affiliations and identifiers. The element that carries a contributor’s function is <role>, nested inside the relevant <contrib>. A single contributor may hold several roles, so multiple <role> elements per <contrib> are expected and entirely valid — one person might legitimately be tagged for Conceptualization, Methodology and Writing – review & editing.

The job of a production team is to make those <role> elements unambiguous. Free-text role labels are not enough, because ‘wrote the paper’ and ‘drafting’ and ‘Writing – original draft’ are the same role expressed three ways. CRediT solves this by giving each of its roles a stable definition and a canonical identifier, and JATS provides the attributes to point at them.

The JATS4R recommendation for encoding CRediT

JATS4R — JATS for Reuse — is the community group that publishes interoperability recommendations for ambiguous corners of the standard, and it has a specific recommendation for CRediT. The core of it is that a <role> element used for a CRediT contribution should declare the vocabulary it draws from and the specific term within it. In practice this means three attributes work together:
- vocab — identifies the controlled vocabulary as CRediT;
- vocab-identifier — gives the URI of the taxonomy itself, so a consuming system can resolve what vocabulary is being used;
- vocab-term and vocab-term-identifier — give the exact term and its canonical URI, so the role resolves to one and only one CRediT definition.
The human-readable label remains the text content of the <role> element — that is what a reader sees — while the attributes carry the machine meaning. The recommendation is deliberate that the visible text and the term identifier must agree: do not tag a <role> as Data curation in its attributes while the visible text reads ‘Formal analysis’. JATS4R also advises using the official CRediT term strings verbatim rather than house variants, because verbatim strings are what validators and aggregators expect to match.

Degrees of contribution

CRediT permits, but does not require, a statement of the degree of a contribution — for example marking one contributor as having led a given role. JATS expresses this through additional attribution on the role rather than by changing the term identifier. Production teams should treat degree as optional metadata that is encoded only when the manuscript actually supplies it; inventing a lead/equal distinction where the authors stated none is a data-quality error, not an enhancement. When degree information is present, keep it consistent across the article so that a reader and a parser draw the same conclusion.

Common production pitfalls

Several mistakes recur often enough to be worth naming. The first is putting CRediT roles in the wrong place — bundling them into an unstructured author-contributions paragraph in the article body instead of, or in addition to, the structured <role> elements. The structured encoding is the one machines read; a prose paragraph is a courtesy to humans, not a substitute. The second is omitting vocab-identifier and vocab-term-identifier, which leaves the role as plain text that cannot be reliably disambiguated. The third is term drift: lightly edited labels such as ‘Writing (review and editing)’ that no longer match the canonical CRediT string and therefore fail automated checks.

A subtler issue is association: every <role> must sit inside the correct <contrib>. In articles with long author lists it is easy for a role to be attached to the wrong person during conversion, especially when contributions are supplied as a separate table that a typesetter merges by hand. Validating that each role resolves to the intended contributor is as important as validating that the term identifiers are correct.

Building it into the workflow

The practical recommendation is to capture CRediT as structured data as early as possible — ideally at submission, where many manuscript systems now collect a contribution matrix — and to carry that structure through conversion rather than reconstructing it from prose at the typesetting stage. Round-trip validation against the JATS4R recommendation should be part of the production QA step, alongside the schema validation a publisher already runs. Treating contributor roles as first-class structured metadata, governed by the definitions in the research information systems domain of the CASRAI Dictionary, is what allows contribution data to survive intact all the way to the version of record and beyond.
June 9, 2026
Co-first authorship and equal contribution: marking shared credit correctly
Two researchers do roughly equal amounts of the central work on a paper, but only one name can physically come first on the author line. This is now an everyday situation in team science, and the conventional response is to declare the two authors equal contributors. Yet that declaration is recorded in many different ways, some of which barely survive indexing, and the result is that genuinely shared credit is frequently lost when it matters most — in a hiring or promotion committee reading the line. This article sets out how to mark shared credit correctly, building on the conventions described at author order and the role definitions at the CRediT roles.

What “equal contribution” is claiming

In most experimental and biomedical fields, position on the author line is information, not decoration. By widespread convention the first author did the bulk of the hands-on work and led the writing; the last author is the senior supervising figure. A co-first or equal-contribution designation is a deliberate intervention against that convention: it asserts that two (occasionally more) people share the leading-author role even though the linear author line can only print them one after another. The claim is specifically about leadership of the work, and it should be reserved for cases where it is genuinely true — not used as a courtesy to soften the awkwardness of ordering.

It is worth being clear that equal contribution is field-specific. In mathematics, economics, and much of the humanities, authors are listed alphabetically and order carries no contribution signal at all, so an equal-contribution note is redundant. The designation does real work only where order is otherwise read as a ranking.

The three places shared credit gets recorded

Shared first authorship can be expressed through three distinct mechanisms, and the strongest practice uses them together rather than relying on any one.

1. The author-line note

The equal-contribution symbol is a superscript character placed against two or more names on the author line — most commonly a dagger (†) or an asterisk (*) — resolving to a footnote that reads “These authors contributed equally to this work.” This is the human-readable signal a reader sees on the page. Its weakness is that it is presentational: the symbol and its note are not reliably captured as structured metadata, so a system harvesting the author list may record the two authors in their printed order and silently drop the equality. That is precisely how co-first status disappears downstream.

2. The contribution statement, using the degree qualifier

This is where a contribution taxonomy earns its place. The CRediT taxonomy supports an optional degree-of-contribution qualifier on every role assignment: lead, equal, or supporting. It is not a percentage and it does not weigh one role against another; it simply distinguishes who led a role from who shared or supported it. To record co-first authorship honestly, mark the relevant leading roles — typically Conceptualization, Investigation, Formal analysis, and Writing – original draft — as equal for both authors:

Author A: Conceptualization (equal), Investigation (equal), Writing – original draft (equal). Author B: Conceptualization (equal), Investigation (equal), Writing – original draft (equal).

This carries far more information than a footnote. It says which parts of the work were shared, and it does so in a form that can travel into structured systems. The qualifier is widely available in publisher submission systems, though rarely required, so you usually have to choose to use it.

3. Order-neutral display where the venue allows

A growing number of venues let authors indicate that the printed order of co-first authors may be swapped on individual CVs — the “authors may list their name first” convention. Where offered, this is a sensible complement to the two mechanisms above, because it acknowledges directly that the linear order does not encode a ranking between the equal contributors.

A method for marking it correctly
1. Confirm the claim is true. Equal contribution means the leading work was genuinely shared. If one person clearly led, say so with lead and supporting rather than reaching for equal.
2. Decide the printed order on a transparent basis. Something has to come first. Agree the basis openly — alphabetical, coin-toss, or rotation across the group’s papers — and record that the order is not a ranking.
3. Add the author-line note so a human reader sees the equality at a glance.
4. Encode it in the CRediT statement with the equal qualifier on the shared roles, so the claim survives as structured data rather than as a presentational footnote.
5. Have every named author confirm their own line before submission. Shared-credit claims are exactly where unconfirmed assumptions cause later disputes.
Common mistakes
- Relying on the footnote alone. A dagger and a note are fragile. Without the structured qualifier, the equality often does not survive into the systems that later read the author list.
- Using “equal” to avoid an honest conversation. Declaring everyone equal because ordering is uncomfortable devalues the designation and misrepresents the work.
- Confusing equal contribution with author order generally. CRediT records what each person did; it does not set author order, which remains a separate decision governed by your field’s conventions.
- Forgetting the corresponding-author role. Corresponding authorship is a distinct responsibility and can sit with any author, including one of the co-first authors; settle it explicitly.
Where shared vocabulary fits

“Co-first”, “joint first”, “equal contribution”, and “shared senior author” are used loosely and recorded inconsistently across venues, which is exactly why the credit so often fails to travel. A shared, federated vocabulary that defines these designations precisely — and points back to NISO for the CRediT standard and its degree qualifier — is what lets an equal-contribution claim mean the same thing wherever it is read. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the CRediT extensions domain.

Related reading
June 9, 2026
Keeping your ORCID record current: a maintenance guide for researchers
Registering for an ORCID iD takes about two minutes. Keeping the record behind it accurate is where most researchers fall down, and an out-of-date ORCID record quietly undermines the very thing the identifier is meant to do. The good news is that, with a few settings configured once, ORCID will keep much of your record current for you — the work is far more about permissions than about manual data entry. This guide explains how. For the background on what the identifier is and why it matters, the persistent-identifiers guidance for authors and the explainer on what an ORCID iD is are the place to start; this article assumes you already have one and want to keep it healthy.

The two kinds of data on an ORCID record

The single most useful thing to understand about ORCID is that not all the information on a record is equal. ORCID distinguishes between data that you have typed in yourself and data that a trusted organisation has asserted about you.
- Self-asserted data is anything you add by hand — an affiliation you typed, a paper you entered manually. It is useful, but a reader cannot tell whether it is verified.
- Validated assertions are added by a trusted organisation through ORCID’s API — your university confirming an employment, a publisher confirming you authored a paper, a funder confirming you hold a grant. These carry the source of the assertion, so anyone reading the record can see that the affiliation came from the institution itself, not just from your own claim.
A record full of validated assertions is dramatically more trustworthy — and more useful to funders and hiring committees — than one you have populated entirely by hand. The goal of good ORCID maintenance is therefore to let trusted organisations do as much of the asserting as possible.

Turn on auto-update

The highest-value setting is auto-update. When you connect your ORCID iD to Crossref and DataCite — a one-time authorisation — new works that are deposited with your ORCID iD attached are added to your record automatically. In practice this means that when you publish a paper and the publisher includes your ORCID iD in the Crossref deposit, the paper appears on your ORCID record without you doing anything, and it appears as a validated assertion sourced from the registration agency.

The condition is simple but easy to miss: the publisher has to actually collect and deposit your ORCID iD. That is why you should always supply your ORCID iD during submission, and ideally sign in with it rather than typing it, so that the iD is authenticated. An authenticated iD attached at submission is what makes the whole auto-update chain work. Connect once, supply your iD every time, and your publication list largely maintains itself.

Manage your trusted organisations and trusted individuals

Auto-update is one instance of a broader mechanism: trusted parties. ORCID lets you grant two kinds of trust:
- Trusted organisations — institutions, funders, publishers, and systems you authorise to read from or write to your record through the API. Your university’s research-information system, for example, can be a trusted organisation that adds your validated employment affiliation and pushes your institutional outputs onto your record.
- Trusted individuals — a person, such as a research administrator or an assistant, whom you authorise to manage your record on your behalf. This is useful for senior researchers who would rather delegate the upkeep.
Both are managed under the Trusted parties section of your account settings, and both are fully revocable. Granting access does not hand over your password; it grants a scoped, auditable permission that you can withdraw at any time. Reviewing this list once or twice a year — confirming the organisations you expect are there, and revoking any you no longer deal with — is the core maintenance habit.

Set your visibility deliberately

Every item on an ORCID record has a visibility setting: everyone, trusted parties only, or only me. The default for new items can be configured in your account. For the record to be useful to the systems that consume it — funders checking your outputs, journals verifying your identity, your CRIS pulling your profile — the key items generally need to be public. A common and self-defeating mistake is to register an iD, set everything to private, and then wonder why the identifier seems to do nothing. As a rule, make your name, affiliations, and outputs public, and reserve restricted visibility for things you genuinely want kept back.

A short maintenance routine
1. Connect to Crossref and DataCite auto-update once. This is the single highest-leverage action; it keeps your works current automatically.
2. Always supply your authenticated ORCID iD at submission — for papers, datasets, software, and grant applications — so that each output and award can be asserted onto your record.
3. Authorise your institution as a trusted organisation so that your employment and institutional outputs arrive as validated assertions.
4. Review your trusted parties annually and revoke any you no longer use.
5. Add the things no one else will assert — education, professional memberships, peer-review and editorial service, older works that predate ORCID — by hand, since these often have no organisation to assert them for you.
6. Check your visibility settings so that the items you want discoverable are actually public.
Why a current record pays off

Beyond convenience, an accurate ORCID record increasingly does real work on your behalf. Funders draw on it for applications and reporting; narrative-CV and biosketch tools pull from it; institutional systems reconcile your outputs against it. A record rich in validated assertions lets you make precise, checkable claims about your contribution history — including, where publishers deposit them, your CRediT roles per paper, so that “I led the analysis on these studies” becomes a verifiable statement rather than an assertion on a CV. The effort is front-loaded into a handful of one-time settings; the payoff compounds across every later application and assessment.

Where shared vocabulary fits

“Auto-update”, “trusted party”, “validated assertion”, “source”, and “self-asserted” are ORCID-specific terms that are easy to muddle, and confusion about them is exactly why so many records go stale. A shared, federated vocabulary that defines these terms precisely is what lets guidance from one institution be understood at another. Supplying that definitional layer is part of the role the CASRAI dictionary is designed to play.

Related reading
June 6, 2026
Disclosing generative AI use in research: what to declare and where
Two or three years ago, declaring the use of a generative AI tool in a manuscript was an unusual courtesy. Today it is a baseline expectation, written into the author instructions of most major publishers and the recommendations of the bodies that set publishing norms. Yet the question authors most often ask is disarmingly practical: what exactly do I have to declare, and where does the declaration go? This article sets out a clear answer, drawing on the vocabulary being developed in the generative AI use and disclosure domain.

The two settled principles

Underneath the variation between publishers, two principles have hardened into near-consensus, and they are the right place to start.

The first is that a generative AI system cannot be an author. The ICMJE recommendations, and parallel statements from COPE, Nature, Science, and the major university presses, are explicit on this point: authorship entails accountability for the work, and a tool cannot be accountable. AI use is therefore disclosed as a method or a tool, never as a contributor on the author line. This connects directly to the broader account of authorship as a matter of responsibility, not merely of having touched the text.

The second is that the human authors remain fully responsible for everything the manuscript asserts, including anything an AI system produced. A fabricated citation, a misstated statistic, or a plausible-but-wrong sentence is the authors’ error regardless of which tool generated it. Disclosure does not transfer responsibility; it makes the workflow transparent so that responsibility can be located.

What counts as disclosable use

The harder question is the threshold. Not every interaction with a computational tool is a disclosable use of generative AI, and policies generally exempt the trivial. The useful distinction is whether the tool produced novel content that materially shaped the published work.
- AI-assisted writing — where a generative system drafted, restructured, summarised, or substantively edited text whose output shaped the published wording — is disclosable. A generative AI tool is, in the working definition, a system that produces novel text, code, image, or other media from a prompt, typically using a large neural network.
- AI-assisted analysis — using a model to perform or shape a data-analysis step, including exploratory analysis or hypothesis generation — is disclosable as part of the methods.
- AI-generated code that forms part of the research, and AI-generated images in a manuscript, are disclosable, the latter often under stricter rules because of the integrity risks around figures.
By contrast, most policies define an AI use exempt category for tools that do not produce novel content: a spell-checker, a grammar corrector, a reference manager, or basic translation of the author’s own words. Author-written text whose grammar was tidied by an AI checker is not, in this sense, AI-assisted writing. The line is not always crisp — substantive rewriting shades into drafting — and when in doubt the safe practice is to disclose.

Where the declaration belongs

Knowing what to declare is half the problem; the other half is placement, and here practice has converged on a small set of locations.

The dominant convention is a dedicated AI use disclosure statement in the manuscript: a short declaration that names the system, says where in the workflow it was used, and indicates the extent of that use. “Which tool, where, and how much” is the durable shape of a good statement. Many journals place this in the methods section when the use was analytical, and in a distinct acknowledgements-adjacent statement when the use was in writing.

A useful test for a disclosure statement: a reader should be able to tell, from the statement alone, which parts of the work involved a generative system and what the authors did to verify its output. A generic line that an AI tool was “used to improve readability” fails this test; it names neither the tool nor the boundary of its use.

Two adjacent practices strengthen the statement. The first is recording a model selection rationale and, where relevant, the prompt engineering that produced reliable outputs — material that belongs in supplementary methods for analytical uses, because it bears on reproducibility. The second is naming the AI tool provider at the organisational level, so that the disclosure points at an identifiable system rather than a generic category.

Why structured disclosure, not just prose

A free-text paragraph at the end of a manuscript is where most disclosures live today, and it is better than nothing. But prose disclosure has the same weakness that prose contribution statements have: it does not travel as data. A structured representation — naming the tool, the workflow stage, the extent, and the verification step as discrete, machine-readable fields — lets downstream systems index, audit, and aggregate AI use across the literature. That is the difference between a sentence a human must read and a record a system can act on, and it is the gap a controlled vocabulary is meant to close. The parallel with structured contribution metadata in CRediT is exact: a settled human-readable form, waiting on consistent machine-readable plumbing.

The role for shared vocabulary

Publishers’ AI policies differ in wording, in threshold, and in placement, which means a disclosure written for one journal does not necessarily mean the same thing when read by another system. What is missing is not more policy — the principles are settled — but a shared definitional layer: agreed terms for AI-assisted writing, AI-assisted analysis, exempt category, and the rest, so that a disclosure carries the same meaning wherever it is read. Supplying that layer, federating to ICMJE and COPE for the normative content rather than inventing it, is the convening role the CASRAI dictionary is built for. The practical guidance for authors lives at AI disclosure for authors.

What to do now

For authors: disclose any use that produced novel content shaping the work, name the tool and the workflow stage, and state that you verified the output. For editors: specify where the statement goes and ask for structured fields, not just a paragraph. For standards work: prioritise shared definitions of the disclosable categories and the exempt threshold, so disclosures mean the same thing across venues.

Related reading
June 5, 2026
What a CRIS does: the research-information backbone explained
Most universities run a system that quietly underpins a great deal of their research administration, and most researchers could not name it. It is the Current Research Information System (CRIS) — the institutional backbone that ties together who the researchers are, what projects they run, who funds them, and what they produce. This article gives a plain-language account of what a CRIS does, why it matters, and why it depends so heavily on shared vocabulary. It draws on the research-information systems domain.

CRIS and RIM: the system and the function

Two terms travel together and are easily confused. A CRIS is the software system. Research Information Management (RIM) is the broader discipline and practice of managing research information — the function that the CRIS supports. RIM is what a research office does; the CRIS is the tool it uses to do it. Both terms appear because the same activity is described from two angles: the operational system and the professional practice. Familiar CRIS products include Pure, Symplectic Elements, Converis, Worktribe, and the open-source VIVO and DSpace-CRIS.

What a CRIS actually holds

A CRIS is, at heart, a set of connected records about a handful of entity types and the relationships between them. The core entities are people, organisational units, projects, funding, and outputs. The value is in the connections: this researcher, in this department, leads this project, funded by this award, which produced these publications and datasets. Each entity is a record; the CRIS is the graph that joins them.

The researcher profile is the entity most people encounter. It aggregates a person’s affiliations, outputs, projects, and activities into a single record — the thing that often surfaces as a public staff page. Behind it sits an organisational hierarchy: the structured representation of departments, schools, institutes, and centres, so that the system can roll outputs and funding up to any level of the institution. The quality of that hierarchy determines whether “how much did the School of Engineering publish last year?” is a one-click query or a week of manual work.

The core job: getting data in

A CRIS is only as useful as the data in it, and the central operational challenge is keeping that data current without burying researchers in data entry. Two mechanisms do most of the work. A publication harvest automatically imports publication metadata from external sources — Crossref, Scopus, Web of Science, PubMed, ORCID — so that a researcher’s output list populates itself rather than being typed in. A funder ingest imports funding and award metadata, so that grants appear against the right people and projects.

Neither mechanism is reliable without identifiers. A publication harvest that matches on author name alone will mis-assign work by every researcher who shares a surname; matching on ORCID iD resolves the person unambiguously. A funder ingest that matches on institution name will fragment one university across a dozen spelling variants; matching on ROR ID collapses them to one. This is why the maturation of the persistent-identifier ecosystem has done more for CRIS data quality than any feature in the software itself.

Disambiguation, enrichment, validation

Three less-visible activities determine whether a CRIS is trusted. Disambiguation is the process of resolving ambiguous identifications — two authors with the same name, two spellings of one organisation — to canonical entities. Enriched metadata is metadata improved with information from external sources: adding Crossref Funder Registry IDs to funding records, adding ROR IDs to affiliations, adding DOIs to outputs that arrived without them. A validation rule is a check applied during ingest to enforce data quality — rejecting a publication record with no identifier, flagging an award whose dates fall outside its project. Together these turn a heap of imported records into a research-information asset an institution can report from with confidence.

What the CRIS is for

The reason institutions invest in a CRIS is that the same research-information facts are needed, repeatedly, for many different purposes. Annual reporting, research assessment exercises, open-access compliance monitoring, public staff and project pages, internal resource allocation, and responses to funder audits all draw on the same underlying entities. Without a CRIS, each of these is a separate data-gathering exercise; with one, they are views over a single maintained graph. The CRIS is the institution’s single source of truth for research information, and its value is exactly proportional to how trustworthy that single source is.

This is also why a CRIS connects outward. It is not an island: it harvests from Crossref and ORCID, it can push validated publications to a repository, it feeds open-access compliance dashboards, and increasingly it exchanges project information using shared models. A modern CRIS is a node in an institutional and sectoral information fabric, not a closed database.

Why shared vocabulary is the precondition

Here is the catch that connects the CRIS to CASRAI’s mission. Every CRIS implementation that invents its own field names — its own way of recording an ethics status, an output type, a project phase, a funding category — creates a system that cannot exchange data cleanly with any other. The harvests work because Crossref, ORCID, and ROR provide shared identifiers and shared metadata. The internal records often do not interoperate, because each institution structured them locally. A controlled, shared vocabulary for the entities and attributes a CRIS holds is what would let research information move between institutions as cleanly as it now moves in from the identifier providers. Supplying that definitional layer is the convening role the CASRAI dictionary exists to play.

What to do now

For institutions running a CRIS: invest in identifiers first — ORCID and ROR adoption do more for data quality than any feature. Treat disambiguation, enrichment, and validation as ongoing operations, not one-off projects. For those procuring or integrating systems: use vendor-neutral, shared vocabulary to specify what you need, so the conversation is about your requirements rather than one product’s field names.

Related reading
June 4, 2026
ISBN, ISSN and identifiers for books and journals
The persistent identifier most familiar to today’s researchers is the DOI, attached to journal articles, datasets and a growing range of outputs. But the idea of giving a publication a stable, standardised identifier is far older, and two of the most successful examples predate the digital scholarly record by decades. The ISBN and the ISSN — the identifiers for books and for serials respectively — are international standards that quietly organise the worlds of publishing, libraries and the book trade, and they remain essential to the scholarly record, particularly for the monographs and journals that the article-centric DOI does not directly address. Understanding what each one identifies, and how they relate to the newer identifiers, is part of being literate in the persistent identifiers domain of the CASRAI Dictionary.

The ISBN: identifying books

The International Standard Book Number (ISBN), defined by the international standard ISO 2108, identifies a specific book or book-like product. The key word is specific. An ISBN identifies a particular edition in a particular form: a hardback, a paperback and an e-book of the same title each receive their own distinct ISBN, because they are different products that a bookseller, library or reader needs to distinguish. This granularity reflects the ISBN’s origins in the book trade, where ordering, stocking and selling demand that each product be unambiguously identifiable. For scholarship, the ISBN matters above all for the monograph — the scholarly book that remains a primary form of output in the humanities and many social sciences — giving the book-length output a stable handle comparable to the article’s.

The ISSN: identifying serials

The International Standard Serial Number (ISSN), defined by ISO 3297, identifies a serial — a publication issued in a continuing sequence, such as a journal, magazine or other periodical. Here the unit of identification is different and important to grasp. An ISSN identifies the title as a whole, the ongoing publication itself, not any single issue or article within it. The journal has one ISSN that persists across all its issues and volumes; individual articles are identified by other means. The ISSN system also handles the reality that serials appear in different media: a journal published in both print and online forms is typically assigned distinct ISSNs for each, with a linking mechanism connecting them as expressions of the same title. The ISSN lets libraries, indexes, agents and discovery systems refer to a journal unambiguously across decades and across changes of publisher or format.

How they differ from DOIs

The crucial conceptual point is that these identifiers operate at different levels of granularity, and they complement rather than compete with one another:
- An ISSN identifies a journal — the whole continuing publication.
- An ISBN identifies a book — a specific edition of a specific title.
- A DOI typically identifies an individual item — a single article, chapter, dataset or other discrete output — and is designed above all to resolve to a current location online.
Seen this way, the question is not which identifier is “better” but what each is for. The ISSN and ISBN identify the container or the work at the level of the journal title or the book; the DOI identifies the individual item and provides actionable resolution — click a DOI and it takes you to the thing. A complete picture of a scholarly book chapter might involve the ISBN of the book, the ISSN of a series it belongs to, and a DOI for the chapter itself, each doing its own job.

How they connect

These systems increasingly work together rather than in isolation. A DOI registered for a journal article carries metadata that includes the journal’s ISSN, tying the item-level identifier to the title-level one. A DOI for a book chapter can reference the book’s ISBN. This linking lets discovery and citation systems assemble a coherent view: knowing that this article (DOI) appeared in that journal (ISSN), or that this chapter (DOI) belongs to that book (ISBN). The identifiers form a layered structure — title-level and item-level — and the value comes from the connections between the layers. For monographs in particular, the maturing of book-level DOIs alongside the long-established ISBN has helped scholarly books participate more fully in the citation and discovery ecosystem articles have long enjoyed.

Why this matters for the record

Getting these identifiers right is not pedantry; it is what makes the scholarly record navigable. Accurate ISSNs let a journal’s entire run be tracked and its articles correctly attributed to it; accurate ISBNs let scholarly books and their editions be found, ordered, preserved and cited; and the connection of both to DOIs lets the item and its container be linked. Where these identifiers are missing, wrong or inconsistently recorded, citations break, holdings are confused and outputs become harder to find — problems that fall especially hard on book-based disciplines. The wider persistent-identifier landscape, from ORCID for people to ROR for organisations, is the subject of much of this domain, and the ISBN and ISSN are among its oldest and most reliable members.

A consistent vocabulary for identifiers

For these identifiers to do their work across publishers, libraries, indexes and research systems, the metadata around them must be described consistently — which identifier type is which, what it identifies, and how the levels relate. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that an ISBN, an ISSN and a DOI are each understood for what they are wherever they appear. And because every book and journal rests on the contributions of authors, editors and others, the work behind them can be described in the same shared framework — the CRediT taxonomy. Long before the digital age, the ISBN and ISSN showed the power of giving publications stable, standardised identities; they remain quietly indispensable to a scholarly record that now spans articles, books, journals and far beyond.
June 3, 2026
Finding research data: dataset discovery and data search engines
A vast amount of research data is now deposited in repositories, accompanied by persistent identifiers and described with metadata. That is a real achievement — but it raises a question that is easy to overlook. How does anyone actually find a relevant dataset? A researcher who suspects that the data they need may already exist somewhere faces a genuinely hard search problem: the data is scattered across thousands of repositories worldwide, each with its own catalogue, its own search box and its own conventions. Without good ways to discover data across all of them, a great deal of valuable, well-curated data simply goes unfound and unused — the digital equivalent of a book correctly shelved in a library no one knows the address of. This article looks at the infrastructure of dataset discovery, drawing on the data infrastructure domain of the CASRAI Dictionary.

Findability is the first FAIR principle

It is no accident that the F in FAIR — Findable, Accessible, Interoperable, Reusable — comes first. Findability is logically prior to everything else: data that cannot be found cannot be accessed, cannot be reused, and delivers none of the value its careful curation promised. Findability in the FAIR sense rests on a few concrete foundations: data should be assigned a globally unique and persistent identifier; it should be described with rich metadata; that metadata should explicitly include the identifier; and the metadata should be registered or indexed in a resource that can be searched. The order of the principles is a quiet but important statement of priorities — all the work of making data accessible and reusable is wasted if the first hurdle, being found at all, is never cleared.

Registries of repositories

Discovery operates at more than one level, and the first level is finding the right repository. Before searching for a dataset, a researcher — whether looking for existing data or deciding where to deposit their own — often needs to identify which repository is appropriate for their field and data type. This is the role of re3data, the Registry of Research Data Repositories, a comprehensive directory that catalogues data repositories across all disciplines. It lets users discover repositories by subject, country, data type and the policies they operate, describing each in a structured way. re3data answers the question “where might data like this live, and where should I put mine?” It is discovery one level up — finding the haystacks before searching for the needle — and it is an essential first step that purely dataset-level search tools do not provide.

Dataset-level discovery

The second level is finding individual datasets, and several complementary services address it:
- DataCite Commons. Because DataCite is a principal minter of persistent identifiers for research data, it sits on a large, structured graph of datasets and their connections to people, organisations, funders and related outputs. DataCite Commons exposes that graph for discovery, letting users search across datasets and follow the links between a dataset and its authors, its funding and the works that cite or relate to it.
- Google Dataset Search. A general-purpose search engine specifically for datasets, it works by harvesting structured metadata that data providers publish on their own pages, then making it searchable in one place. It brings dataset discovery into a familiar, web-scale search experience.
- Repository and aggregator catalogues. Individual repositories offer their own search, and aggregators pull metadata from many sources into combined indexes, each widening the net a little further.
Why structured metadata is the engine

What makes web-scale dataset search possible at all is structured, machine-readable metadata, and in particular the schema.org/Dataset vocabulary. schema.org is a shared vocabulary for marking up information on web pages so that machines, not just humans, can understand it, and it includes a specific type for describing datasets — their title, description, creators, licence, distribution and more. When a repository or data provider embeds schema.org/Dataset markup in the page describing a dataset, a search engine crawling the web can recognise that the page describes a dataset and extract its key facts. This is precisely how a service such as Google Dataset Search builds its index: not by being given a private feed from every repository, but by reading the standardised markup that providers publish openly. The lesson is direct and practical — describing data with shared, structured metadata is not bureaucratic box-ticking, it is the literal mechanism by which the data becomes discoverable to the wider world.

Discovery depends on good deposit

All of this throws the responsibility back to the moment of deposit. A dataset is only as findable as its metadata is good. Rich, accurate, standards-based metadata — a clear title, a meaningful description, named creators with identifiers, an explicit licence, appropriate keywords — is what feeds every layer of the discovery system. Skimpy or inconsistent metadata leaves a dataset effectively invisible no matter how valuable its contents. This is why guidance on depositing data places such weight on description, and why the choices a researcher makes at deposit time echo through every subsequent attempt to find their work. Practical guidance on getting this right is part of our wider material on research data fundamentals.

A consistent vocabulary for findable data

For discovery to work across repositories, registries and search engines, the metadata describing datasets must mean the same thing everywhere — a creator, a licence, a resource type or a related-work link has to be interpretable consistently across every system that indexes it. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that dataset metadata is understood identically wherever it is harvested. And because creating and curating a dataset is genuine research contribution, the people behind it can be credited using the same framework as any other — the CRediT taxonomy and its full set of contribution roles, with Data curation recognising the work that makes data findable in the first place. Depositing data is necessary; describing it well, in shared terms, is what makes it discoverable — and discoverability is what lets data fulfil its purpose.
May 31, 2026
How the Software role applies to code-only outputs
A growing fraction of research output is code: software libraries that implement a method, computational notebooks that demonstrate an analysis, simulation frameworks that enable a body of work, infrastructure tooling that supports a research community. When the output is primarily code, the CRediT Software role carries weight that the role’s brief definition does not fully prepare it for. This post is a practical guide to assigning Software in code-centric contexts.

The Software role, briefly

The CRediT Software role is defined as: Programming, software development; designing computer programs; implementation of the computer code and supporting algorithms; testing of existing code components. The definition is short and was written with software-as-tool-for-a-paper in mind, not software-as-the-paper.

For a conventional research paper where someone wrote analysis code that supported the science, Software is straightforward: the person who wrote the analysis code gets the role. For a paper whose primary scholarly contribution is the code itself — a JOSS paper, a software-methods paper, a tool announcement — Software is the dominant role and the brevity of the definition starts to bite.

What the Software role should cover in a code-only context

Our recommendation, distilled from the practice of JOSS, the Software Sustainability Institute, the Research Software Engineers community, and several years of CASRAI editorial work, is to read Software in code-only contexts as encompassing the following five sub-activities, all of which should be visible in the contributorship statement even if they share the role.

Implementation: writing the production code itself. This is the core of Software and is what people most naturally associate with the role.

Architecture and design: the higher-level decisions about how the code is structured, what its dependencies are, how its modules interact. In a code-only paper, architecture is part of the intellectual contribution and the architect should be a co-author with Software role.

Testing: writing the test suite, including unit tests, integration tests, and regression tests. A code-only paper with a credible test suite has someone who built it.

Documentation: user-facing documentation, developer-facing documentation, README, examples, tutorials. For code intended for reuse, documentation is part of the deliverable; the documentation contributor gets the Software role.

Packaging and release: the engineering work of making the code installable, citable, and citation-resolvable. CI/CD configuration, dependency management, release-tagging, DOI registration. For long-lived code with multiple releases, this is sustained work; for a one-off code release accompanying a paper, it is still non-trivial.

Each of these is meaningful contribution that the Software role captures. A code-only paper’s CRediT statement should make the distribution of these activities across contributors visible, using the lead/equal/supporting qualifier to express relative magnitude.

Where Software overlaps with other roles

Three overlaps deserve attention.

First, Software versus Methodology. If the code implements a novel method, the method itself is a Methodology contribution; the implementation is a Software contribution. The same person often discharges both, and the contributorship statement should assign both roles to them. The error to avoid is conflating the two: assigning Software while omitting Methodology under-represents the intellectual contribution.

Second, Software versus Validation. Writing tests is Software (per the definition); validating the code against reference implementations or independent data is Validation. The distinction is genuine: tests verify that the code does what the developer intended; validation verifies that the code does what is scientifically correct. Both belong in a code-only paper’s contributorship.

Third, Software versus Writing – original draft. The README, the developer documentation, the API reference — these are documentation, captured under Software. The paper itself, including its method description and its discussion of design choices, is captured under Writing – original draft. The boundary is the publication artefact: anything in the paper is Writing; anything in the code repository is Software.

Cross-referencing with CITATION.cff

The CITATION.cff convention, increasingly standard in scientific software repositories, provides a richer contributor model than CRediT alone. CFF supports author, contact, and contributor entries with type-of-contribution fields; integrators have extended it with CRediT-aligned vocabularies. The recommended pattern for a code-only paper is to maintain both: a CRediT statement in the paper (for the paper-level contributorship) and a CITATION.cff in the repository (for the per-version, per-component contributorship that CRediT cannot express).

The two should be consistent. A contributor named in the paper with Software role should appear in the CITATION.cff with at least equivalent contribution; a contributor named in the CITATION.cff but not in the paper should be acknowledged in the paper’s acknowledgements section. The CASRAI CITATION.cff entry walks through the integration patterns.

The maintenance question

An unresolved aspect of Software in code-only contexts is how to credit maintenance over time. A research software package may have a paper at first release, with a CRediT statement reflecting the founding contributors. Five years and several major versions later, the package has new maintainers, new contributors, and a substantially different code base. The original paper’s CRediT statement is increasingly out of date.

The current pragmatic answer is: the paper’s CRediT statement freezes at publication; the CITATION.cff in the repository tracks current contributorship; downstream citation should reference both, with the paper as the publication-of-record and the CFF as the current-contributor record. This works but is imperfect. The Software Citation Working Group has been chewing on whether per-version CRediT statements, deposited to Crossref via the related-identifier mechanism, would be a cleaner answer; the proposal is technically viable but not yet a community consensus.

What journals should do

For journals publishing software papers, the recommended editorial practices are: require CRediT with qualifiers in the paper; require a CITATION.cff in the linked repository; verify that the two are consistent; for major software packages, accept and publish supplementary contributor records that go beyond the byline.

JOSS is the maturity reference here and most other software-paper venues are moving toward similar practices. The CASRAI CRediT for software papers guide is updated quarterly with current practice.

What authors should do

For authors of code-only papers, four practical steps. First, distribute the Software role across the five sub-activities visibly, using the qualifier. Second, assign Methodology when the code implements a novel method. Third, maintain the CITATION.cff in the repository in parallel with the paper’s CRediT statement. Fourth, plan for the maintenance-credit question: who will maintain the code, how their contribution will be recognised over time, where the credit will live.

The CRediT taxonomy can support code-only outputs well, with attention. The work is in using the Software role thoughtfully, in interlocking it with Methodology and Writing where appropriate, and in maintaining the parallel record in the repository.

Related dictionary entries
May 15, 2026
Horizon Europe ERC and MSCA Funding: Instruments, Eligibility and UK Association

January 1, 1970