protocol – CASRAI Dictionary

For most of the history of the scholarly record, “research output” meant one thing: the peer-reviewed journal article, with the book a distant second. Everything else — the data, the code, the protocols, the negative results — was apparatus or supplement, uncounted and largely uncredited. That assumption has broken down, and a modern outputs taxonomy has to reflect a far wider range of things that researchers produce, each deserving its own place, its own identifier, and its own recognition in assessment. This article surveys that expanded taxonomy, drawing on the research-outputs domain.

Why the article-only model failed

The article-centric model failed for a simple reason: the article is no longer where much of the value lives. A reproducible computational study’s value is as much in its code and data as in its prose. A widely reused dataset can influence a field more than the paper that introduced it. A protocol followed by hundreds of labs is a contribution in its own right. Treating all of these as mere supplements to an article misallocates credit, loses the artefacts that actually get reused, and gives assessment a distorted picture of what a researcher contributed. The expansion of the outputs taxonomy is not taxonomic enthusiasm; it is a correction.

The expanded output types

A modern taxonomy organises a wide range of outputs. Several stand out as having reshaped the landscape.

The preprint — a manuscript posted to a public server before or during formal peer review — is now a first-class output, not a second-class draft. It establishes priority, accelerates dissemination, and carries its own DOI. The relationship between a preprint and its eventual published version is itself metadata worth recording.
The dataset — a collection of research data with a DataCite DOI — is the output whose recognition has changed most. Data citation is now expected practice, and a well-curated, documented dataset is a citable contribution that can be credited and assessed.
Research software — software produced for or as part of research, with a stable identifier such as a Software Heritage ID or a DataCite DOI — is increasingly recognised as a research output, with its own citation conventions and its own (imperfect) fit to contributorship taxonomies.
The trained model — an AI/ML model released as a research output, typically documented with a model card — is the newest major addition, reflecting the rise of machine-learning research that produces models and datasets rather than only papers.
The registered report — published in two stages, with the protocol peer-reviewed and accepted before data collection — is a structural innovation in how an output is produced, designed to guard against publication bias by committing to publish regardless of outcome.

Beyond these, the taxonomy reaches further: protocols with DOIs (as minted on platforms like protocols.io), negative-results reports, systematic reviews and their living variants, policy briefs, standards contributions, patents, clinical-trial registrations, theses, conference papers, and the practice-based outputs of the arts. The breadth is the point: research produces many kinds of thing, and a taxonomy that names only one of them is misleading by omission.

Two structural requirements: identifiers and relationships

An outputs taxonomy is only useful if its entries can be reliably identified and related. Two requirements follow.

The first is stable identifiers for every output type, not just articles. A dataset needs a DOI, software needs a SWHID or DOI, a sample referenced by an output needs an IGSN, a project that produced the outputs needs a RAiD, and the people and institutions need ORCID and ROR. Without identifiers, the expanded taxonomy is just a longer list of things that cannot be cited or counted reliably. With them, every output type becomes a first-class, citable, assessable entity.

The second is clean parent-child and related relationships between output types. A registered report’s stage-1 protocol and stage-2 article are related; a preprint and its published version are related; a dataset and the software that processed it are related; a systematic review and the studies it synthesises are related. A taxonomy that captures these relationships lets automated systems and CRIS platforms reason over outputs — grouping a project’s preprint, dataset, and software as facets of one contribution rather than three unconnected records.

Why this matters for assessment

The expanded taxonomy connects directly to responsible research assessment. Narrative-CV formats explicitly invite researchers to describe contributions beyond publications — the datasets, the software, the open-science work. But for an assessor to take a dataset or a model seriously, it has to be a recognised, identifiable output type, not an undifferentiated “other.” A modern outputs taxonomy is the precondition for assessment that values what researchers actually produce. Naming a model, a dataset, or a protocol as a first-class output is what lets it be claimed on a CV and weighed by a panel.

A caution against type proliferation

A taxonomy can fail in two directions. The old failure was too few types — everything that was not an article was invisible. The opposite failure is too many: a sprawling list of hyper-specific types that no two systems classify the same way, so that exchange becomes impossible and the taxonomy collapses under its own weight. The discipline a good taxonomy needs is to enumerate the types that genuinely behave differently — that have different identifiers, different lifecycles, different assessment treatment — and to use relationships rather than ever-finer types to capture the rest. The goal is a taxonomy that classifiers and CRIS systems can apply consistently, which means stable, well-bounded types with clean relationships, not an open-ended catalogue.

Where the dictionary fits

Several stewards already maintain output-type vocabularies — COAR Resource Types, the Crossref and DataCite output types, the categories used by national assessment exercises. The need is not another competing list but an integrative, operational reference that defines each type clearly, federates to those stewards, and makes the relationships between types explicit. Providing that — so that a “dataset” or a “registered report” means the same thing across systems — is the convening role the CASRAI dictionary is designed for.

What to do now

For researchers: mint identifiers for all your outputs, not only your papers, and record the relationships between them. For institutions and CRIS owners: support the full range of output types as first-class records with clean relationships, federating your type list to an established vocabulary rather than inventing one. For assessment: recognise the expanded taxonomy, so that the dataset, the model, and the protocol can be claimed and weighed alongside the article.

Tag: protocol

Beyond the article: a modern taxonomy of research outputs