Citation File Format – CASRAI Dictionary

A great deal of modern research is, in practice, software. Analyses run on code written by the research team; results depend on the exact version of a pipeline; reproducibility hinges on someone being able to find and run that code. And yet software remains the most under-credited output in the scholarly record — cited informally in a footnote, if at all, and rarely recorded as a first-class object with its own identity. This article sets out how to change that, using the small stack of standards that now makes software properly citable. It builds on the broader taxonomy in the research-outputs domain and connects directly to the practices of the reproducibility domain, where citable software is a precondition for reproducible work.

Why software citation matters

Treating software as a citable output does two distinct jobs. The first is credit: the people who built a tool deserve recognition for an intellectual contribution that is often as substantial as the paper it enabled, and that recognition only flows if the software is cited as software, not buried in prose. The second is reproducibility: a result is only checkable if a reader can identify the exact code — the specific version — that produced it. A vague mention of “our in-house scripts” serves neither goal. A formal citation to a specific, versioned, identified software object serves both.

The community reference point here is the software citation principles articulated by the FORCE11 Software Citation Working Group, which establish that software should be a legitimate, citable product of research, cited on the same footing as any other output, with credit, persistence, accessibility, and specificity (down to the version) as core requirements. Everything below is machinery for honouring those principles.

The building blocks

citation.cff — telling people how to cite your code

The simplest, highest-leverage step is to add a Citation File Format file — a plain-text CITATION.cff file — to the root of a software repository. It is a small, human- and machine-readable YAML file that states the authors, title, version, and preferred citation for the software. Its value is that it removes ambiguity: instead of a would-be citer guessing, the repository itself declares how it wants to be cited. Major code-hosting platforms recognise the file and surface a ready-made citation from it, which sharply lowers the effort of citing software correctly.

CodeMeta — describing software in interoperable metadata

Where citation.cff covers the citation, CodeMeta covers the fuller description. CodeMeta is a metadata standard — built on Schema.org and expressed as JSON-LD, conventionally in a codemeta.json file — that captures rich, structured information about a piece of software: its authors and contributors, licence, programming language, dependencies, related identifiers, funding, and more. Its purpose is interoperability: it provides a shared crosswalk so that the same software metadata can move between repositories, archives, registries, and citation systems without being re-keyed. Where citation.cff answers “how do I cite this?”, CodeMeta answers “how do I describe this completely and portably?”

Software Heritage and the SWHID — guaranteeing the code persists

A citation is worthless if the thing it points to vanishes. Software Heritage is a non-profit initiative that systematically archives source code from public repositories into a permanent archive, ensuring the code remains available even if its original host disappears. It issues a SWHID (Software Heritage Identifier) — an intrinsic, content-derived persistent identifier that pins down an exact snapshot, revision, or even a single line of source code. Because the SWHID is computed from the content itself, it is precise and tamper-evident in a way that a mutable repository URL can never be: it identifies exactly this code, forever.

The DOI — the citable, version-pinned reference

Finally, to make software citable alongside articles and data, mint a DOI for a released version. The common route is to connect a code repository to an archive such as Zenodo, which deposits a snapshot of each release and assigns it a DataCite DOI — typically with a version-specific DOI for each release plus a concept DOI that always resolves to the latest. That DOI is what goes in a reference list, and because it is version-specific, it satisfies the citation principles’ demand for specificity.

Putting it together: a practical recipe

Add a CITATION.cff to the repository root, so anyone can cite the software correctly without guessing.
Add a codemeta.json for rich, portable metadata — authors, licence, dependencies, funding — that travels between systems.
Apply a clear licence. Uncredited and unlicensed code cannot be reused with confidence; software citation assumes the reuse terms are stated.
Archive releases and mint a DOI (for example via Zenodo), so each version is independently citable and pinned.
Reference the Software Heritage archive / SWHID for the strongest persistence and exact-version identification, especially in reproducibility packages.
Cite software in your own work the way you want your own to be cited — close the loop by treating other people’s tools as first-class outputs.

Crediting the people, not just the artefact

Identifying the software is half the task; crediting the contributors is the other half. The CRediT taxonomy includes a dedicated Software role — programming, software development, design of computer programs, implementation of code and supporting algorithms, and testing of existing components — which lets a contribution made primarily in code be recorded on the associated paper. CRediT records the human contribution; citation.cff, CodeMeta, the SWHID, and the DOI record and persist the artefact. Used together they ensure that both the code and the people who wrote it are visible in the record, rather than the all-too-common outcome where neither is.

Where shared vocabulary fits

“Research software”, “version”, “snapshot”, “release”, and “software citation” are used inconsistently across communities, which is part of why software credit leaks away. A shared, federated vocabulary that defines these terms precisely — pointing back to the FORCE11 software citation principles, to CodeMeta, and to Software Heritage — is what lets a software citation written in one system be understood in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the research-outputs domain.

Tag: Citation File Format

Software citation and CodeMeta: making code a first-class output