Skip to main content
v2026.1714 entries · CC-BY 4.0
CASRAI

Explainer · Plain-language

Software Heritage: Definition, Meaning & Examples | CASRAI

Software Heritage is an open, non-profit initiative whose mission is to collect, preserve, and share the entire body of publicly available source code. Launched in 2016 by Inria — the French national research institute for digital science — and supported by UNESCO, it functions as a universal archive for software, making code citable and permanently accessible regardless of where it was originally hosted.

CASRAI plain-language explainers — clear answers to recurring research-administration questions

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.

What Software Heritage archives and why it matters

Software Heritage continuously harvests source code from major public hosting platforms including GitHub, GitLab, Bitbucket, and language-specific package registries, as well as from institutional repositories and individual developer sites. Its archive contains billions of software artefacts — files, directories, commits, and full repository snapshots. The motivation is both practical and cultural: software is increasingly central to research, but code hosted on platforms such as GitHub is vulnerable to deletion, platform closure, or link rot. Software Heritage provides the same kind of long-term guarantee for source code that national libraries provide for books. This is particularly important for research reproducibility — a published paper that cites a GitHub URL may find that URL broken within a few years, whereas a SWHID reference points to an immutable artefact in the Software Heritage archive. The SWHID standard has been formalised as ISO/IEC 18670.

How SWHIDs work

A SoftWare Heritage persistent IDentifier (SWHID) is an intrinsic, cryptographically verifiable identifier computed from the content of the software artefact itself. Because the identifier is derived from content rather than assigned by a central authority, the same source code will always generate the same SWHID, and any modification — even a single character — produces a completely different identifier. SWHIDs can reference five types of object: file contents (cnt), directories (dir), revisions or commits (rev), releases (rel), and full repository snapshots (snp). A basic SWHID takes the form swh:1:cnt:<hash>, while qualified SWHIDs can additionally specify origin URL, repository visit, anchor point, and line ranges within a file. This granularity means researchers can cite a specific function in a specific version of a specific library with precision that a DOI assigned to a whole repository cannot provide.

Software Heritage and FAIR research software

The FAIR principles — Findable, Accessible, Interoperable, Reusable — were extended to research software through the FAIR4RS initiative, led by a working group under the Research Data Alliance (RDA), FORCE11, and ReSA (Research Software Alliance). Software Heritage is a key infrastructure component for making software FAIR: it provides findability through persistent SWHIDs, long-term accessibility through its universal archive, and reusability by preserving the full source code alongside its licensing and provenance metadata. Zenodo, the open repository operated by CERN and OpenAIRE, integrates with Software Heritage so that code deposited in Zenodo is automatically archived in Software Heritage and a SWHID is associated with the deposit. Researchers publishing software alongside a paper are increasingly expected by journals and funders to provide a persistent identifier; a SWHID or a Zenodo DOI backed by Software Heritage meets this requirement.

Citing software archived in Software Heritage

Citing software as a first-class research output is encouraged by the Software Citation Principles (published in PeerJ Computer Science in 2016, led by FORCE11) and by funders including UKRI and the European Research Council. To cite software via Software Heritage, a researcher navigates to the Software Heritage archive (softwareheritage.org), locates the software by URL or search, and obtains the SWHID for the specific version they used. Where a codemeta.json or CITATION.cff file exists in the repository, Software Heritage surfaces this metadata automatically. The SWHID can then be included in a paper's reference list or data availability statement, providing readers with a guaranteed-permanent link to the exact code used. For software hosted on GitHub, the recommended workflow is to create a tagged release on GitHub, deposit it in Zenodo (which automatically triggers Software Heritage archiving), and cite the resulting Zenodo DOI and/or SWHID in the paper.

Key facts

At a glance

  • Founded: 2016 by Inria (French national research institute), supported by UNESCO
  • Identifier: SWHID — SoftWare Heritage persistent IDentifier, standardised as ISO/IEC 18670
  • How computed: Cryptographic hash of content; same code always produces the same SWHID
  • FAIR4RS: Software Heritage is a core infrastructure supporting FAIR principles for research software
  • Zenodo integration: Code on Zenodo is automatically archived in Software Heritage
  • Coverage: Billions of software artefacts from GitHub, GitLab, package registries, and more

Common misconceptions

What people often get wrong

Often heard: A SWHID is the same as a DOI for software.

Actually: A DOI is assigned by a central registry; a SWHID is computed from the content itself. SWHIDs are intrinsic and granular — they can identify a single file or commit — whereas a DOI typically refers to a whole repository snapshot.

Often heard: Hosting code on GitHub is sufficient for long-term preservation.

Actually: GitHub code can be deleted, repositories renamed, or the platform could close. Software Heritage provides persistent archiving independent of any commercial platform.

Often heard: Software Heritage only archives popular open-source projects.

Actually: Software Heritage aims to archive all publicly available source code, including small, niche, and legacy projects that may have significant historical or scientific value.

LAC

Partner Deal

LAC Health Supplies Mobile App

Referenced across the research world

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo
  • ORCID logo
  • Crossref logo

View CASRAI adoption →