Research software as a first-class output: citation, identifiers and credit

A great deal of modern research runs on software the researchers wrote themselves — the analysis pipeline, the simulation, the model, the tool that made the result possible. And yet that software is routinely treated as a footnote rather than a finding: mentioned in a methods section, perhaps, but rarely cited as an output in its own right and almost never credited to the people who built it. Treating research software as a first-class output — citable, persistent, and creditable — corrects that, and it is a core concern of the research-outputs domain with a direct line to the reproducibility domain. If a result depends on code, the code is part of the evidence, and the people who wrote it did intellectual work that deserves recognition.

Why software needs to be cited as software

Citing software properly does two distinct jobs, and it is worth keeping them apart. The first is credit: building reliable research software — designing it, implementing it, testing it, documenting it — is substantial scholarly work, and it is rewarded only if the software is cited as an output rather than buried in prose. The second is reproducibility: a computational result can only be verified, and the analysis only reused, if a reader can identify and obtain the exact software that produced it, down to the version. A vague reference to “custom scripts” serves neither goal. A formal citation to a versioned, identified piece of software serves both.

The community reference point here is the FORCE11 Software Citation Principles, developed by the FORCE11 Software Citation Working Group. They establish that software should be considered a legitimate, citable product of research, that citations should give credit to all contributors, that they should identify a specific version using a unique, persistent identifier, and that they should enable access to the software itself. Everything below is machinery for honouring those principles in practice.

CITATION.cff: telling people how to cite your code

The most practical first step a software author can take is to add a CITATION.cff file to the repository. Citation File Format (CFF) is a simple, human- and machine-readable YAML format, developed through the Research Software Engineering community, that records exactly how a piece of software should be cited: its authors, title, version, release date, and any associated DOI. Placing a CITATION.cff file in the root of a repository means that anyone — and any tool — can find the canonical citation rather than guessing it.

The format is well supported. GitHub, for instance, reads a CITATION.cff file and surfaces a ready-made citation in the repository interface, and reference managers and conversion tools can transform it into BibTeX or other formats. It turns “how should I cite this tool?” from an awkward question into a one-click answer, and it puts the authors in control of how their work is attributed.

Persistent identifiers: the archived DOI and the SWHID

A repository URL is not a citation. Repositories move, get renamed, or disappear, and a link to the latest state of a project does not pin down the version that produced a result. Two complementary identifiers solve this.

The first is an archived DOI. Depositing a release of the software in an archive such as Zenodo — which integrates directly with GitHub so that tagging a release can mint a DOI automatically — produces a DataCite DOI that resolves persistently to that exact version, with structured metadata describing the authors, version, and licence. The DOI is what goes in a reference list, exactly as an article DOI would, and it satisfies the principles’ demand to cite a specific, accessible version. Archives of this kind typically also mint a concept DOI for the software as a whole alongside the version-specific DOIs, so a citation can point either at “this release” or at “the software in general” as appropriate.

The second is the Software Heritage identifier (SWHID). Software Heritage is a non-profit initiative that systematically archives source code from public repositories at scale, with the explicit mission of preserving the world’s software as a commons. It assigns intrinsic, content-derived identifiers — SWHIDs — that can pin down not just a release but a precise commit, directory, or even a single file. Because a SWHID is computed from the content itself, it verifies that the code you retrieve is byte-for-byte the code that was cited. An archived DOI gives a citable, version-level reference with rich metadata; a SWHID gives a fine-grained, intrinsically verifiable anchor to the source. Used together they cover both the citation layer and the deep-reproducibility layer.

Crediting the people: the CRediT Software role

Identifying the software is half the task; crediting the humans who wrote it is the other half. A DOI and a SWHID identify the artefact; they do not record who did the work. That is the job of contributor-role metadata. The CRediT taxonomy includes a dedicated Software role, defined as programming and software development — designing computer programs, implementing the code and supporting algorithms, and testing existing code components. Recording the Software role on the paper that the code supports makes visible the often-uncredited engineering effort behind a result, and where degree-of-contribution is recorded it distinguishes the lead developer from supporting contributors.

The layers complement each other precisely. The DOI and SWHID say what the software is, where it lives, and which version; the CITATION.cff says how to cite it; and the CRediT Software role says who built it. Used together they ensure that both the code and the people behind it are visible — rather than the common outcome where neither is.

A practical recipe

  1. Add a CITATION.cff to the repository root so there is a canonical, machine-readable citation.
  2. Archive each release — for instance via the GitHub–Zenodo integration — to mint a version-specific DOI, and cite that DOI, not the bare repository URL.
  3. Use the SWHID where byte-level reproducibility matters, pinning the exact commit the result depended on.
  4. Apply a clear open-source licence — software that cannot be reused with confidence will not be reused.
  5. Record the Software role via CRediT on the associated paper, so the developers are credited alongside the other contributors.

Where shared vocabulary fits

“Software”, “version”, “release”, “repository”, and “software citation” are used loosely across communities, which is part of why credit for code leaks away. A shared, federated vocabulary that defines these terms precisely — and points back to the FORCE11 Software Citation Principles and to Software Heritage — is what lets a software citation written in one system be understood in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the research-outputs domain, with adjacent entries in the reproducibility domain.

Related reading

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *