Tag: CITATION.cff

  • DataCite, GitHub, Zenodo: the three-cornered software-citation stack

    Software citation in 2026 mostly runs on a three-cornered stack: a code repository (typically GitHub), an archiving service that issues DOIs (typically Zenodo), and the DataCite infrastructure that registers and resolves the DOIs. The integration between the three is more polished than it was five years ago and substantially less polished than it could be. This post walks through the current state and what integrators should do.

    The pattern that works

    The operational pattern that the community has converged on. A research-software project lives in a Git repository (often on GitHub, increasingly on GitLab or other forges). At each release, the repository is archived to Zenodo, which creates a DOI for that release; a concept DOI for the project overall is also issued, resolving to the latest release. The repository carries a CITATION.cff file specifying how to cite the software, including the Zenodo DOI and the contributor list. The published paper (if any) cites the software via the Zenodo DOI; the software citation pattern is operationally clean.

    The integration works at the technical layer. GitHub-Zenodo integration is documented and stable. CITATION.cff is supported by GitHub’s repository UI for human-readable citations and by an increasing number of tools (Zenodo, JOSS, R packages’ references) for machine processing. DataCite’s metadata supports the software-type record with CRediT-aligned contributor roles where the depositor provides them.

    What’s good

    Three things this stack does well.

    First, versioning. Software is versioned; citation should be versionable. The concept-DOI plus per-version-DOI pattern lets a paper cite either the specific version it used or the project conceptually, with the appropriate DOI. This is the right design for software citation and the community has converged on it.

    Second, open infrastructure. Zenodo is operated by CERN as a public infrastructure; DataCite is a community-governed organisation. The depositor’s investment in software citation does not lock them into a commercial vendor. This matters for sustainability.

    Third, integration with FAIR4RS. The FAIR4RS Principles and the resulting software citation principles are operationalised by this stack. A FAIR-aligned software project should have an archived release with a DOI, with rich metadata, with a contributor record, all of which the stack supports.

    What’s still rough

    Four issues at the seams.

    First, the GitHub dependency. The dominant code-hosting platform is a commercial service owned by a major tech company. The Zenodo integration is GitHub-specific in important ways (the auto-archival webhook, the metadata propagation from the GitHub release to Zenodo). GitLab and other forges have lighter-weight integration patterns. The community’s reliance on GitHub for the code-hosting corner of the stack creates a single-point-of-vendor risk that the FAIR-software community has been increasingly aware of. Software Heritage’s archive of public repositories provides some long-term resilience but is not a substitute for the operational integration.

    Second, metadata fidelity at deposit. The GitHub-Zenodo automatic deposit captures repository metadata but the fidelity is variable. CITATION.cff is honoured if present and well-formed; in its absence, Zenodo defaults to repository-level metadata that may not reflect the contributor structure the developers intended. Projects without CITATION.cff get less-good Zenodo records.

    Third, the CRediT-CITATION.cff alignment. CITATION.cff supports a contributors list with type-of-contribution; the type-of-contribution vocabulary has converged on a CRediT-aligned set but the alignment is not strict. Tools that translate CITATION.cff to CRediT-compliant DataCite metadata produce slightly different results. The Software Citation Working Group has been working on the formal alignment; the work is partly complete.

    Fourth, versioning of the contributor record. CITATION.cff in the repository captures current contributorship; the Zenodo deposit captures contributorship as of the deposit. A project that adds contributors after a release has a stale Zenodo record for that release until the next release. The trade-off (mutable vs immutable per-version records) is a real one; the community has accepted immutable per-version records as the better default.

    What integrators should do

    For software-paper authors and software developers, the practical advice in 2026 is: maintain a CITATION.cff in every research-software repository; archive every meaningful release to Zenodo; cite the specific Zenodo DOI in publications that use the software; cite the concept DOI in publications that reference the project conceptually. The CASRAI software-citation authors guide walks through the patterns.

    For journals publishing software papers, the recommendation is to require CITATION.cff and a Zenodo (or equivalent) deposit at submission, to verify the consistency between the CITATION.cff and the paper’s contributorship statement, and to cite the Zenodo DOI in the published paper. JOSS does all of this; other software-paper venues should follow.

    For institutions, the recommendation is to ingest software-DOI records into CRIS systems as a first-class research output, to surface them in researcher dashboards alongside publications, and to recognise software contribution in promotion and tenure assessment. The CASRAI research outputs domain tracks the institutional implementation patterns.

    For the broader infrastructure community, two priorities. First, support non-GitHub code-hosting integration with Zenodo; the single-vendor concentration is a real risk. Second, complete the CRediT-CITATION.cff alignment work; the operational ambiguity is small but real.

    What’s coming

    Two developments to watch in 2026-2027. First, the Software Heritage citation integration: Software Heritage archives the world’s public source code and assigns SWHIDs (Software Heritage Identifiers). The integration of SWHIDs as a complementary identifier alongside Zenodo DOIs is in progress; the relationship between SWHID and DOI for the same software release is in design. Second, per-version contributor records: the community has been chewing on whether per-version CRediT statements deposited to Crossref or DataCite would be useful for software. The technical viability is clear; the community-consensus and tool-support work is in motion.

    For the moment, the three-cornered stack does the job. The seams are real but workable. Software citation has moved from being a research-software-engineering aspiration to an operational practice; the further refinements are about polish, not foundation.

    Related dictionary entries

  • Software citation and CodeMeta: making code a first-class output

    A great deal of modern research is, in practice, software. Analyses run on code written by the research team; results depend on the exact version of a pipeline; reproducibility hinges on someone being able to find and run that code. And yet software remains the most under-credited output in the scholarly record — cited informally in a footnote, if at all, and rarely recorded as a first-class object with its own identity. This article sets out how to change that, using the small stack of standards that now makes software properly citable. It builds on the broader taxonomy in the research-outputs domain and connects directly to the practices of the reproducibility domain, where citable software is a precondition for reproducible work.

    Why software citation matters

    Treating software as a citable output does two distinct jobs. The first is credit: the people who built a tool deserve recognition for an intellectual contribution that is often as substantial as the paper it enabled, and that recognition only flows if the software is cited as software, not buried in prose. The second is reproducibility: a result is only checkable if a reader can identify the exact code — the specific version — that produced it. A vague mention of “our in-house scripts” serves neither goal. A formal citation to a specific, versioned, identified software object serves both.

    The community reference point here is the software citation principles articulated by the FORCE11 Software Citation Working Group, which establish that software should be a legitimate, citable product of research, cited on the same footing as any other output, with credit, persistence, accessibility, and specificity (down to the version) as core requirements. Everything below is machinery for honouring those principles.

    The building blocks

    citation.cff — telling people how to cite your code

    The simplest, highest-leverage step is to add a Citation File Format file — a plain-text CITATION.cff file — to the root of a software repository. It is a small, human- and machine-readable YAML file that states the authors, title, version, and preferred citation for the software. Its value is that it removes ambiguity: instead of a would-be citer guessing, the repository itself declares how it wants to be cited. Major code-hosting platforms recognise the file and surface a ready-made citation from it, which sharply lowers the effort of citing software correctly.

    CodeMeta — describing software in interoperable metadata

    Where citation.cff covers the citation, CodeMeta covers the fuller description. CodeMeta is a metadata standard — built on Schema.org and expressed as JSON-LD, conventionally in a codemeta.json file — that captures rich, structured information about a piece of software: its authors and contributors, licence, programming language, dependencies, related identifiers, funding, and more. Its purpose is interoperability: it provides a shared crosswalk so that the same software metadata can move between repositories, archives, registries, and citation systems without being re-keyed. Where citation.cff answers “how do I cite this?”, CodeMeta answers “how do I describe this completely and portably?”

    Software Heritage and the SWHID — guaranteeing the code persists

    A citation is worthless if the thing it points to vanishes. Software Heritage is a non-profit initiative that systematically archives source code from public repositories into a permanent archive, ensuring the code remains available even if its original host disappears. It issues a SWHID (Software Heritage Identifier) — an intrinsic, content-derived persistent identifier that pins down an exact snapshot, revision, or even a single line of source code. Because the SWHID is computed from the content itself, it is precise and tamper-evident in a way that a mutable repository URL can never be: it identifies exactly this code, forever.

    The DOI — the citable, version-pinned reference

    Finally, to make software citable alongside articles and data, mint a DOI for a released version. The common route is to connect a code repository to an archive such as Zenodo, which deposits a snapshot of each release and assigns it a DataCite DOI — typically with a version-specific DOI for each release plus a concept DOI that always resolves to the latest. That DOI is what goes in a reference list, and because it is version-specific, it satisfies the citation principles’ demand for specificity.

    Putting it together: a practical recipe

    1. Add a CITATION.cff to the repository root, so anyone can cite the software correctly without guessing.
    2. Add a codemeta.json for rich, portable metadata — authors, licence, dependencies, funding — that travels between systems.
    3. Apply a clear licence. Uncredited and unlicensed code cannot be reused with confidence; software citation assumes the reuse terms are stated.
    4. Archive releases and mint a DOI (for example via Zenodo), so each version is independently citable and pinned.
    5. Reference the Software Heritage archive / SWHID for the strongest persistence and exact-version identification, especially in reproducibility packages.
    6. Cite software in your own work the way you want your own to be cited — close the loop by treating other people’s tools as first-class outputs.

    Crediting the people, not just the artefact

    Identifying the software is half the task; crediting the contributors is the other half. The CRediT taxonomy includes a dedicated Software role — programming, software development, design of computer programs, implementation of code and supporting algorithms, and testing of existing components — which lets a contribution made primarily in code be recorded on the associated paper. CRediT records the human contribution; citation.cff, CodeMeta, the SWHID, and the DOI record and persist the artefact. Used together they ensure that both the code and the people who wrote it are visible in the record, rather than the all-too-common outcome where neither is.

    Where shared vocabulary fits

    “Research software”, “version”, “snapshot”, “release”, and “software citation” are used inconsistently across communities, which is part of why software credit leaks away. A shared, federated vocabulary that defines these terms precisely — pointing back to the FORCE11 software citation principles, to CodeMeta, and to Software Heritage — is what lets a software citation written in one system be understood in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the research-outputs domain.

    Related reading

  • Research software as a first-class output: citation, identifiers and credit

    A great deal of modern research runs on software the researchers wrote themselves — the analysis pipeline, the simulation, the model, the tool that made the result possible. And yet that software is routinely treated as a footnote rather than a finding: mentioned in a methods section, perhaps, but rarely cited as an output in its own right and almost never credited to the people who built it. Treating research software as a first-class output — citable, persistent, and creditable — corrects that, and it is a core concern of the research-outputs domain with a direct line to the reproducibility domain. If a result depends on code, the code is part of the evidence, and the people who wrote it did intellectual work that deserves recognition.

    Why software needs to be cited as software

    Citing software properly does two distinct jobs, and it is worth keeping them apart. The first is credit: building reliable research software — designing it, implementing it, testing it, documenting it — is substantial scholarly work, and it is rewarded only if the software is cited as an output rather than buried in prose. The second is reproducibility: a computational result can only be verified, and the analysis only reused, if a reader can identify and obtain the exact software that produced it, down to the version. A vague reference to “custom scripts” serves neither goal. A formal citation to a versioned, identified piece of software serves both.

    The community reference point here is the FORCE11 Software Citation Principles, developed by the FORCE11 Software Citation Working Group. They establish that software should be considered a legitimate, citable product of research, that citations should give credit to all contributors, that they should identify a specific version using a unique, persistent identifier, and that they should enable access to the software itself. Everything below is machinery for honouring those principles in practice.

    CITATION.cff: telling people how to cite your code

    The most practical first step a software author can take is to add a CITATION.cff file to the repository. Citation File Format (CFF) is a simple, human- and machine-readable YAML format, developed through the Research Software Engineering community, that records exactly how a piece of software should be cited: its authors, title, version, release date, and any associated DOI. Placing a CITATION.cff file in the root of a repository means that anyone — and any tool — can find the canonical citation rather than guessing it.

    The format is well supported. GitHub, for instance, reads a CITATION.cff file and surfaces a ready-made citation in the repository interface, and reference managers and conversion tools can transform it into BibTeX or other formats. It turns “how should I cite this tool?” from an awkward question into a one-click answer, and it puts the authors in control of how their work is attributed.

    Persistent identifiers: the archived DOI and the SWHID

    A repository URL is not a citation. Repositories move, get renamed, or disappear, and a link to the latest state of a project does not pin down the version that produced a result. Two complementary identifiers solve this.

    The first is an archived DOI. Depositing a release of the software in an archive such as Zenodo — which integrates directly with GitHub so that tagging a release can mint a DOI automatically — produces a DataCite DOI that resolves persistently to that exact version, with structured metadata describing the authors, version, and licence. The DOI is what goes in a reference list, exactly as an article DOI would, and it satisfies the principles’ demand to cite a specific, accessible version. Archives of this kind typically also mint a concept DOI for the software as a whole alongside the version-specific DOIs, so a citation can point either at “this release” or at “the software in general” as appropriate.

    The second is the Software Heritage identifier (SWHID). Software Heritage is a non-profit initiative that systematically archives source code from public repositories at scale, with the explicit mission of preserving the world’s software as a commons. It assigns intrinsic, content-derived identifiers — SWHIDs — that can pin down not just a release but a precise commit, directory, or even a single file. Because a SWHID is computed from the content itself, it verifies that the code you retrieve is byte-for-byte the code that was cited. An archived DOI gives a citable, version-level reference with rich metadata; a SWHID gives a fine-grained, intrinsically verifiable anchor to the source. Used together they cover both the citation layer and the deep-reproducibility layer.

    Crediting the people: the CRediT Software role

    Identifying the software is half the task; crediting the humans who wrote it is the other half. A DOI and a SWHID identify the artefact; they do not record who did the work. That is the job of contributor-role metadata. The CRediT taxonomy includes a dedicated Software role, defined as programming and software development — designing computer programs, implementing the code and supporting algorithms, and testing existing code components. Recording the Software role on the paper that the code supports makes visible the often-uncredited engineering effort behind a result, and where degree-of-contribution is recorded it distinguishes the lead developer from supporting contributors.

    The layers complement each other precisely. The DOI and SWHID say what the software is, where it lives, and which version; the CITATION.cff says how to cite it; and the CRediT Software role says who built it. Used together they ensure that both the code and the people behind it are visible — rather than the common outcome where neither is.

    A practical recipe

    1. Add a CITATION.cff to the repository root so there is a canonical, machine-readable citation.
    2. Archive each release — for instance via the GitHub–Zenodo integration — to mint a version-specific DOI, and cite that DOI, not the bare repository URL.
    3. Use the SWHID where byte-level reproducibility matters, pinning the exact commit the result depended on.
    4. Apply a clear open-source licence — software that cannot be reused with confidence will not be reused.
    5. Record the Software role via CRediT on the associated paper, so the developers are credited alongside the other contributors.

    Where shared vocabulary fits

    “Software”, “version”, “release”, “repository”, and “software citation” are used loosely across communities, which is part of why credit for code leaks away. A shared, federated vocabulary that defines these terms precisely — and points back to the FORCE11 Software Citation Principles and to Software Heritage — is what lets a software citation written in one system be understood in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the research-outputs domain, with adjacent entries in the reproducibility domain.

    Related reading