Tag: Software Heritage

  • Computational reproducibility: containers, workflows and FAIR4RS

    A computational result that no one else can re-run is, strictly speaking, a claim rather than a finding. The gap between “the figures are in the paper” and “anyone with the code and data can regenerate the figures” is the gap that computational reproducibility exists to close, and over the past decade a practical toolkit has emerged to close it. This article walks that toolkit — containers, workflow languages, source-code identifiers, and the FAIR4RS principles — drawing on the reproducibility domain.

    What computational reproducibility means, precisely

    Computational reproducibility is the property of a computational result being reproducible from the provided code and data. It is a narrower and more achievable target than replication: replication asks whether a finding holds when the study is run afresh; computational reproducibility asks only whether the same inputs and the same code produce the same outputs. That sounds trivial and is notoriously not, because a computation depends on far more than the script the author remembers to share — it depends on library versions, the operating system, environment variables, random seeds, and sometimes the hardware.

    The artefact that makes reproducibility possible is the reproducibility package (some communities say replication package): a bundle of code, data, and instructions sufficient to reproduce the results of an output. A good package is not a folder of scripts; it is a self-contained, documented environment that a stranger can execute.

    Containers: capturing the environment

    The single largest source of irreproducibility is the environment, and the most effective response is the container image — a packaged, reproducible computational environment that bundles the code together with the exact operating system libraries and dependencies it needs. The modern standard is the OCI (Open Container Initiative) image format, familiar to most through Docker. In high-performance computing, where users cannot run as root, Singularity / Apptainer images serve the same purpose under HPC constraints.

    A container is not the only way to pin an environment. A Conda environment with an exported specification, or a requirements lockfile recording exact dependency versions, achieves much of the same for interpreted-language work. The principle is constant: the environment is part of the result, and must be captured as deliberately as the code itself. Recording it in a structured compute environment record — what ran, on what, with which versions — is what lets a reviewer distinguish a genuine reproduction from an accidental match.

    Workflows: capturing the steps

    Capturing the environment is necessary but not sufficient; the steps matter too. A multi-stage analysis run by hand, in an order the author holds in their head, is not reproducible no matter how well the environment is pinned. This is the problem a workflow definition solves: a formal, executable specification of the computational steps and their dependencies.

    Several workflow languages are in wide use, and the dictionary treats them as variants of a single concept rather than picking a winner. Common Workflow Language (CWL), the Broad Institute’s WDL, Nextflow, and Snakemake each express a pipeline as a declarative graph of steps, so that the whole analysis can be re-executed with one command. Expressed this way, the workflow is itself a citable research output — the structured record of how the result was produced, not merely a description of it.

    Identifying the software: Software Heritage and SWHID

    Reproducibility presupposes that the code still exists and can be referred to unambiguously, and this is where source-code identifiers come in. Software Heritage is the universal archive of source code, harvesting and preserving public code repositories at scale. It issues the SWHID (Software Heritage Identifier): a persistent identifier that is content-derived and immutable — it identifies an exact state of the code by its content, so the same SWHID always resolves to byte-identical source.

    This intrinsic property distinguishes a SWHID from a DOI minted for a software release. A DOI (via DataCite, often through a GitHub–Zenodo deposit) gives the release a citable handle and rich metadata; a SWHID guarantees that the specific code referenced is exactly the code archived. The two are complementary, and a robust reproducibility package can carry both: a DOI for citation and a SWHID for byte-level fidelity.

    FAIR4RS: software is not just data

    The FAIR principles — Findable, Accessible, Interoperable, Reusable — were written with data in mind, and applying them naively to software misses what is distinctive about code. FAIR4RS, the FAIR Principles for Research Software, is the RDA-developed adaptation that takes software’s particular nature seriously: software is executable, it has versions and dependencies, it is composed of and depends on other software, and it evolves. FAIR4RS reframes each principle for these realities — findability through a persistent identifier and rich metadata, accessibility of both the software and its description, interoperability through standard formats and dependencies, and reusability through clear licensing, provenance, and documentation. It is the conceptual bridge between the data-centric FAIR data principles and the practical work of making research software reproducible.

    Recognition: reviewing the artefacts

    None of this happens without incentives, and the incentive structure is slowly maturing. Artifact evaluation — peer review of the code, data, and environments behind a paper — is now a standard track at many computer-science venues, and the ACM Artifact Review and Badging programme attaches visible badges to papers whose artefacts have been checked. A reproducibility review targeting the computational results specifically is becoming a recognised contribution in its own right, the kind of work that responsible assessment frameworks aim to make visible alongside conventional outputs.

    Where shared vocabulary fits

    The reproducibility toolkit is mature, but its terms are used loosely across communities — “reproducibility package” and “replication package” name the same thing, workflow languages proliferate, and “reproducible” itself means different things to different fields. A shared, federated vocabulary that defines these terms and points back to the RDA for FAIR4RS and to Software Heritage for the SWHID is what lets a reproducibility claim in one field be understood in another. Supplying that definitional layer is the role the CASRAI dictionary exists to play.

    What to do now

    For researchers: ship a reproducibility package with a pinned environment (a container or lockfile), an executable workflow, and persistent identifiers — a DOI for citation and a SWHID for the exact code. For reviewers and venues: treat artifact evaluation and reproducibility review as first-class, badge-worthy contributions. For standards work: align software vocabulary on FAIR4RS and the persistent-identifier ecosystem rather than letting each community coin its own.

    Related reading

  • Software citation and CodeMeta: making code a first-class output

    A great deal of modern research is, in practice, software. Analyses run on code written by the research team; results depend on the exact version of a pipeline; reproducibility hinges on someone being able to find and run that code. And yet software remains the most under-credited output in the scholarly record — cited informally in a footnote, if at all, and rarely recorded as a first-class object with its own identity. This article sets out how to change that, using the small stack of standards that now makes software properly citable. It builds on the broader taxonomy in the research-outputs domain and connects directly to the practices of the reproducibility domain, where citable software is a precondition for reproducible work.

    Why software citation matters

    Treating software as a citable output does two distinct jobs. The first is credit: the people who built a tool deserve recognition for an intellectual contribution that is often as substantial as the paper it enabled, and that recognition only flows if the software is cited as software, not buried in prose. The second is reproducibility: a result is only checkable if a reader can identify the exact code — the specific version — that produced it. A vague mention of “our in-house scripts” serves neither goal. A formal citation to a specific, versioned, identified software object serves both.

    The community reference point here is the software citation principles articulated by the FORCE11 Software Citation Working Group, which establish that software should be a legitimate, citable product of research, cited on the same footing as any other output, with credit, persistence, accessibility, and specificity (down to the version) as core requirements. Everything below is machinery for honouring those principles.

    The building blocks

    citation.cff — telling people how to cite your code

    The simplest, highest-leverage step is to add a Citation File Format file — a plain-text CITATION.cff file — to the root of a software repository. It is a small, human- and machine-readable YAML file that states the authors, title, version, and preferred citation for the software. Its value is that it removes ambiguity: instead of a would-be citer guessing, the repository itself declares how it wants to be cited. Major code-hosting platforms recognise the file and surface a ready-made citation from it, which sharply lowers the effort of citing software correctly.

    CodeMeta — describing software in interoperable metadata

    Where citation.cff covers the citation, CodeMeta covers the fuller description. CodeMeta is a metadata standard — built on Schema.org and expressed as JSON-LD, conventionally in a codemeta.json file — that captures rich, structured information about a piece of software: its authors and contributors, licence, programming language, dependencies, related identifiers, funding, and more. Its purpose is interoperability: it provides a shared crosswalk so that the same software metadata can move between repositories, archives, registries, and citation systems without being re-keyed. Where citation.cff answers “how do I cite this?”, CodeMeta answers “how do I describe this completely and portably?”

    Software Heritage and the SWHID — guaranteeing the code persists

    A citation is worthless if the thing it points to vanishes. Software Heritage is a non-profit initiative that systematically archives source code from public repositories into a permanent archive, ensuring the code remains available even if its original host disappears. It issues a SWHID (Software Heritage Identifier) — an intrinsic, content-derived persistent identifier that pins down an exact snapshot, revision, or even a single line of source code. Because the SWHID is computed from the content itself, it is precise and tamper-evident in a way that a mutable repository URL can never be: it identifies exactly this code, forever.

    The DOI — the citable, version-pinned reference

    Finally, to make software citable alongside articles and data, mint a DOI for a released version. The common route is to connect a code repository to an archive such as Zenodo, which deposits a snapshot of each release and assigns it a DataCite DOI — typically with a version-specific DOI for each release plus a concept DOI that always resolves to the latest. That DOI is what goes in a reference list, and because it is version-specific, it satisfies the citation principles’ demand for specificity.

    Putting it together: a practical recipe

    1. Add a CITATION.cff to the repository root, so anyone can cite the software correctly without guessing.
    2. Add a codemeta.json for rich, portable metadata — authors, licence, dependencies, funding — that travels between systems.
    3. Apply a clear licence. Uncredited and unlicensed code cannot be reused with confidence; software citation assumes the reuse terms are stated.
    4. Archive releases and mint a DOI (for example via Zenodo), so each version is independently citable and pinned.
    5. Reference the Software Heritage archive / SWHID for the strongest persistence and exact-version identification, especially in reproducibility packages.
    6. Cite software in your own work the way you want your own to be cited — close the loop by treating other people’s tools as first-class outputs.

    Crediting the people, not just the artefact

    Identifying the software is half the task; crediting the contributors is the other half. The CRediT taxonomy includes a dedicated Software role — programming, software development, design of computer programs, implementation of code and supporting algorithms, and testing of existing components — which lets a contribution made primarily in code be recorded on the associated paper. CRediT records the human contribution; citation.cff, CodeMeta, the SWHID, and the DOI record and persist the artefact. Used together they ensure that both the code and the people who wrote it are visible in the record, rather than the all-too-common outcome where neither is.

    Where shared vocabulary fits

    “Research software”, “version”, “snapshot”, “release”, and “software citation” are used inconsistently across communities, which is part of why software credit leaks away. A shared, federated vocabulary that defines these terms precisely — pointing back to the FORCE11 software citation principles, to CodeMeta, and to Software Heritage — is what lets a software citation written in one system be understood in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the research-outputs domain.

    Related reading

  • Research software as a first-class output: citation, identifiers and credit

    A great deal of modern research runs on software the researchers wrote themselves — the analysis pipeline, the simulation, the model, the tool that made the result possible. And yet that software is routinely treated as a footnote rather than a finding: mentioned in a methods section, perhaps, but rarely cited as an output in its own right and almost never credited to the people who built it. Treating research software as a first-class output — citable, persistent, and creditable — corrects that, and it is a core concern of the research-outputs domain with a direct line to the reproducibility domain. If a result depends on code, the code is part of the evidence, and the people who wrote it did intellectual work that deserves recognition.

    Why software needs to be cited as software

    Citing software properly does two distinct jobs, and it is worth keeping them apart. The first is credit: building reliable research software — designing it, implementing it, testing it, documenting it — is substantial scholarly work, and it is rewarded only if the software is cited as an output rather than buried in prose. The second is reproducibility: a computational result can only be verified, and the analysis only reused, if a reader can identify and obtain the exact software that produced it, down to the version. A vague reference to “custom scripts” serves neither goal. A formal citation to a versioned, identified piece of software serves both.

    The community reference point here is the FORCE11 Software Citation Principles, developed by the FORCE11 Software Citation Working Group. They establish that software should be considered a legitimate, citable product of research, that citations should give credit to all contributors, that they should identify a specific version using a unique, persistent identifier, and that they should enable access to the software itself. Everything below is machinery for honouring those principles in practice.

    CITATION.cff: telling people how to cite your code

    The most practical first step a software author can take is to add a CITATION.cff file to the repository. Citation File Format (CFF) is a simple, human- and machine-readable YAML format, developed through the Research Software Engineering community, that records exactly how a piece of software should be cited: its authors, title, version, release date, and any associated DOI. Placing a CITATION.cff file in the root of a repository means that anyone — and any tool — can find the canonical citation rather than guessing it.

    The format is well supported. GitHub, for instance, reads a CITATION.cff file and surfaces a ready-made citation in the repository interface, and reference managers and conversion tools can transform it into BibTeX or other formats. It turns “how should I cite this tool?” from an awkward question into a one-click answer, and it puts the authors in control of how their work is attributed.

    Persistent identifiers: the archived DOI and the SWHID

    A repository URL is not a citation. Repositories move, get renamed, or disappear, and a link to the latest state of a project does not pin down the version that produced a result. Two complementary identifiers solve this.

    The first is an archived DOI. Depositing a release of the software in an archive such as Zenodo — which integrates directly with GitHub so that tagging a release can mint a DOI automatically — produces a DataCite DOI that resolves persistently to that exact version, with structured metadata describing the authors, version, and licence. The DOI is what goes in a reference list, exactly as an article DOI would, and it satisfies the principles’ demand to cite a specific, accessible version. Archives of this kind typically also mint a concept DOI for the software as a whole alongside the version-specific DOIs, so a citation can point either at “this release” or at “the software in general” as appropriate.

    The second is the Software Heritage identifier (SWHID). Software Heritage is a non-profit initiative that systematically archives source code from public repositories at scale, with the explicit mission of preserving the world’s software as a commons. It assigns intrinsic, content-derived identifiers — SWHIDs — that can pin down not just a release but a precise commit, directory, or even a single file. Because a SWHID is computed from the content itself, it verifies that the code you retrieve is byte-for-byte the code that was cited. An archived DOI gives a citable, version-level reference with rich metadata; a SWHID gives a fine-grained, intrinsically verifiable anchor to the source. Used together they cover both the citation layer and the deep-reproducibility layer.

    Crediting the people: the CRediT Software role

    Identifying the software is half the task; crediting the humans who wrote it is the other half. A DOI and a SWHID identify the artefact; they do not record who did the work. That is the job of contributor-role metadata. The CRediT taxonomy includes a dedicated Software role, defined as programming and software development — designing computer programs, implementing the code and supporting algorithms, and testing existing code components. Recording the Software role on the paper that the code supports makes visible the often-uncredited engineering effort behind a result, and where degree-of-contribution is recorded it distinguishes the lead developer from supporting contributors.

    The layers complement each other precisely. The DOI and SWHID say what the software is, where it lives, and which version; the CITATION.cff says how to cite it; and the CRediT Software role says who built it. Used together they ensure that both the code and the people behind it are visible — rather than the common outcome where neither is.

    A practical recipe

    1. Add a CITATION.cff to the repository root so there is a canonical, machine-readable citation.
    2. Archive each release — for instance via the GitHub–Zenodo integration — to mint a version-specific DOI, and cite that DOI, not the bare repository URL.
    3. Use the SWHID where byte-level reproducibility matters, pinning the exact commit the result depended on.
    4. Apply a clear open-source licence — software that cannot be reused with confidence will not be reused.
    5. Record the Software role via CRediT on the associated paper, so the developers are credited alongside the other contributors.

    Where shared vocabulary fits

    “Software”, “version”, “release”, “repository”, and “software citation” are used loosely across communities, which is part of why credit for code leaks away. A shared, federated vocabulary that defines these terms precisely — and points back to the FORCE11 Software Citation Principles and to Software Heritage — is what lets a software citation written in one system be understood in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the research-outputs domain, with adjacent entries in the reproducibility domain.

    Related reading