Editorial · CASRAI · Research outputs (expanded)

Data papers, software papers, and the limits of CRediT

Data papers and software papers don’t map cleanly onto the 14 CRediT roles. A practical guide to the friction and where the taxonomy needs work.

ByCASRAI Editorial Board

Published 17 Dec 2025· 6 minute read

The 14 roles of CRediT were designed against the model of a conventional research article reporting empirical work: a study with a hypothesis, a method, data, analysis, and a written argument. Data papers and software papers fit this model awkwardly. A data paper describes a dataset; a software paper describes a piece of software. The intellectual contribution is the artefact itself, not the prose around it. The CRediT roles, applied to these papers, produce statements that are technically valid but substantively misleading. This post catalogues the friction and suggests where the taxonomy could be extended.

What a data paper actually is

A data paper, as the genre has developed in venues like Scientific Data, Earth System Science Data, GigaScience, and the data-paper streams of disciplinary journals, is a peer-reviewed description of a dataset: its provenance, its collection method, its quality, its access conditions, and its potential reuse. The dataset itself lives in a repository with its own DOI; the data paper provides the citable, peer-reviewed scholarly record that the dataset exists, that it was collected with rigour, and that it is fit for reuse.

The intellectual labour behind a data paper is mostly not in the paper. It is in the years of fieldwork or instrument operation that produced the data, the protocols that ensured comparability across collection events, the curation work that turned raw observations into a structured deposit, the documentation that lets a stranger understand what the data mean. The paper is a summary record of that work.

Where CRediT falls short for data papers

Three friction points. First, Investigation and Data curation bear most of the load and they are not differentiated finely enough. A field ecologist who spent years collecting samples, a lab technician who processed them, a data manager who normalised the schema, and a metadata specialist who wrote the documentation are all plausibly Investigation or Data curation; the roles do not distinguish them. The result is that two papers with very different actual contributorship patterns can have identical-looking CRediT statements.

Second, Resources overlaps with Investigation in a confusing way. A data paper describing a long-term ecological observatory has a Resources contribution (the observatory itself) that is distinct from the per-sample Investigation. CRediT does not currently cleanly separate “provided the infrastructure that produced the data” from “provided the samples that went into the data.”

Third, Writing – original draft is often the smallest contribution, not the largest, and assigning it Lead can misrepresent the contribution structure. The person who wrote the paper is often a relatively junior team member, not the senior person whose intellectual contribution was the protocol and the multi-year campaign.

Software papers and the JOSS model

Software papers, exemplified by the Journal of Open Source Software (JOSS), face an analogous problem from a different direction. A JOSS paper is short — often under 1,000 words — and is paired with a peer-reviewed software repository. The intellectual contribution is the software: its design, its implementation, its tests, its documentation, its maintenance over time. The paper is a stub.

JOSS itself uses CRediT for its papers and has done so since 2020. The community has converged on a set of mappings:

Conceptualization covers software design and architectural decisions.
Software covers implementation. This is the central role for most JOSS contributors.
Validation covers testing, both unit tests and validation against reference implementations.
Methodology covers the algorithmic content, where the software implements a non-trivial method.
Writing – original draft covers the paper itself. The README, the developer documentation, and the user docs are also writing work, but they are not the JOSS paper.
Supervision covers project leadership; Project administration covers maintenance and coordination.

The friction in this mapping is that the Software role is overloaded. It conflates the initial implementation, ongoing maintenance, bug-fixing, refactoring, and tooling. A contributor who implemented the core algorithm and a contributor who maintains the CI/CD pipeline both get “Software” with no further distinction. For long-lived software with many contributors over years, the role assignment ends up giving everyone Software (lead/equal/supporting) and the differentiation lives in the GitHub commit history, not in CRediT.

The FAIR4RS angle

The FAIR4RS Principles for research software, finalised in 2022, set out what FAIR means for software: findable, accessible, interoperable, reusable. They explicitly acknowledge that software citation needs a richer model than data citation, because software has versions, dependencies, and ongoing development that data typically does not.

FAIR4RS implies, though does not directly require, a richer contributorship taxonomy for software. The Software Citation Implementation Working Group has been chewing on this for several years. Their working position is that CRediT remains the right vocabulary for software paper contributorship, but that the software repository itself should carry its own contributor metadata using a complementary scheme — typically CITATION.cff with extended fields — that captures the per-version, per-component contributorship that CRediT cannot.

The mapping problem

For data papers and software papers, the operational reality is that two parallel records exist: the paper’s CRediT statement and the dataset or software repository’s contributor metadata. They overlap but do not align cleanly. The dataset DOI and software DOI live in DataCite; the paper DOI lives in Crossref; the relations between them are declared in the metadata but not always reciprocally.

The CASRAI research outputs domain tracks the mapping conventions in current use. Our recommendation, for now, is that data papers and software papers should publish a CRediT statement covering the paper’s contributorship and should additionally publish a richer contributor metadata file with the dataset or software, using CRediT roles plus the disciplinary-specific extensions that have emerged.

Possible extensions

Three extensions would meaningfully improve the situation. First, sub-roles within Software: an extended taxonomy with implementation, testing, documentation, maintenance, and integration as sub-roles would give a software paper a more truthful contributorship statement. This work has been drafted by the FORCE11 software citation working group but not formally proposed as a CRediT extension.

Second, distinguished Investigation roles for data papers: collection, processing, curation, documentation as sub-roles of Investigation and Data curation would let a data paper describe its contributorship more faithfully. The challenge here is keeping the taxonomy usable; an over-elaborate vocabulary loses adoption.

Third, artefact-level role assignments: the current CRediT statement applies at the paper level. For a paper that describes a dataset and a software package, it might be more useful to have role assignments at the artefact level (paper, dataset, software each get their own statement) with cross-references. This would require schema work in Crossref, DataCite, and ORCID.

What to do now

For authors of data papers, the practical advice is: use CRediT for the paper; deposit a complementary contributors.json with the dataset that captures finer-grained roles; cross-reference the two in the related-identifier blocks. For authors of software papers, use CRediT for the paper and CITATION.cff for the repository, with the CFF carrying the rich per-component contributor data. The CASRAI data and software papers guide has worked examples.

For the CRediT stewardship group, the recommendation is to prioritise the data-paper and software-paper mapping problem in the v2026.3 revision discussion. The friction is real, the workarounds are working but ugly, and the taxonomy will be strengthened by a thoughtful extension.

Related editorial in this domain

More on Research outputs (expanded)

30 Jul 2026

How CRediT Data Reveals Co-Corresponding Roles

A 2026 Journal of Informetrics study uses CRediT data to measure contribution among co-corresponding authors, finding it rises with byline position.

23 Jul 2026

arXiv Becomes an Independent Nonprofit, Spun Out of Cornell

On July 1, 2026, arXiv left Cornell University’s umbrella to become arXiv, Inc., an independent 501(c)(3) nonprofit. Cornell and the Simons Foundation are founding Members with board seats on a 12-member board; a $10M Simons/Schmidt Sciences gift funds cloud migration. The arxiv.org URL, free access, staff, and moderation process are unchanged.

18 Jun 2026

DataCite, GitHub, Zenodo: the three-cornered software-citation stack

Software citation in 2026 runs on a three-cornered stack. The roles of DataCite, GitHub, and Zenodo — and what integrators should do about the seams.