Tag: FORCE11

  • Citing Data Properly: The Joint Declaration of Data Citation Principles

    For decades, the data underpinning a study lived in a footnote, an appendix, or nowhere visible at all. A reader who wanted to inspect, reuse, or build on those data had little to go on. As research has become more data-intensive, that omission has grown harder to justify. The Joint Declaration of Data Citation Principles, published through FORCE11 in 2014, was a deliberate attempt to fix it by treating datasets as legitimate, citable research outputs in their own right.

    Why data citation matters

    Citing data is not merely good manners. It serves the same purposes as citing the literature: it credits the people who produced the work, it lets readers verify claims, and it builds a traceable record of how knowledge accumulates. When a dataset is cited formally, the citation can be counted, indexed, and linked, which means the often-considerable labour of collecting, cleaning, and documenting data becomes visible and rewardable. This connects directly to broader efforts in FAIR data, where the goal is for data to be findable, accessible, interoperable, and reusable.

    The eight principles

    The Declaration is built around eight principles that, taken together, describe what responsible data citation looks like:

    • Importance. Data should be considered legitimate, citable products of research, deserving the same status as publications.
    • Credit and attribution. Citations should give scholarly credit and normative, legal attribution to everyone who contributed to the data.
    • Evidence. Where a claim rests on data, the corresponding data should be cited.
    • Unique identification. Citations should include a persistent, machine-actionable, globally unique identifier.
    • Access. Citations should make it possible to reach the data themselves and their associated metadata and documentation.
    • Persistence. Identifiers and metadata should persist even beyond the lifespan of the data they describe.
    • Specificity and verifiability. Citations should allow a precise version and subset of the data to be identified.
    • Interoperability and flexibility. Citation methods should work across communities while accommodating disciplinary differences.

    These principles are intentionally technology-neutral. They do not mandate a single repository or identifier scheme; they describe outcomes that any sound practice should achieve.

    How to cite a dataset in practice

    A well-formed data citation looks much like a reference to an article, but with a few additions. At minimum it should carry the creator or creators, the year of publication, the title of the dataset, the publisher or repository, a version where one exists, and a persistent identifier. In most cases that identifier is a DataCite DOI, resolvable to a landing page that describes the dataset and points to the files. A typical reference takes the shape: Creator(s) (Year): Title. Version. Publisher. Dataset. DOI.

    Two details repay attention. First, versioning is not optional for datasets that change over time. Citing the specific version used means a future reader can reproduce exactly what was analysed, rather than a later, possibly different, release. Second, the identifier should appear in the reference list, not merely in the running text. Burying a dataset DOI in a sentence keeps it out of the indexing and counting systems that make citation meaningful in the first place.

    DataCite DOIs and the reference list

    DataCite was established precisely to assign DOIs to research data and to maintain the metadata that makes those DOIs useful. When a repository mints a DataCite DOI for a dataset, it registers structured metadata describing the creators, title, publication year, resource type, and related identifiers. That metadata is what allows discovery services and reference managers to handle data citations the way they handle article citations. Placing the DOI in the reference list, formatted to the relevant style, lets indexing infrastructure pick it up and attribute it correctly.

    Data availability statements close the loop

    Many publishers now require a data availability statement, a short passage telling readers where the underlying data can be found and under what conditions. Done well, the statement names the repository and gives the persistent identifier, linking the prose of the article to the formal citation in the reference list. Done poorly, it says only that data are available on request, which research has repeatedly shown to be an unreliable route to access. A good availability statement and a properly formatted data citation are two halves of the same commitment: that the evidence behind a study can actually be found and reused.

    Bringing it together

    The Joint Declaration did not invent the idea that data deserve credit, but it gave the community a shared, citable reference point. The practical implications are modest and achievable: assign a persistent identifier, capture the version, put the citation in the reference list, and write a data availability statement that points to it. Standards bodies and metadata schemas, including the work catalogued in the CASRAI data dictionary and contributor frameworks such as CRediT, give the surrounding vocabulary to describe who did what. The principles themselves are a reminder that data are not a by-product of research but, increasingly, one of its most valuable outputs.

  • Data citation: giving datasets the credit they deserve

    A great deal of published science rests on data the authors collected, cleaned, and shared — and yet the dataset itself, the object on which the conclusions actually depend, is routinely mentioned in passing or not at all. A finding is only checkable if a reader can find and reuse the data behind it, and the people who produced that data deserve recognition for an intellectual contribution that is often enormous. Treating datasets as first-class, citable outputs solves both problems at once. It is a core concern of the data-infrastructure domain and connects directly to the wider taxonomy of the research-outputs domain.

    Why data citation matters

    Citing data as data does two distinct jobs, and it is worth keeping them separate. The first is credit: assembling a well-documented dataset is real scholarly work — designing the collection, curating, validating, and documenting it — and that work is rewarded only if the dataset is cited as an output in its own right, not buried in a methods paragraph. The second is reproducibility and reuse: a result can only be verified, and the data only reused, if a reader can identify and locate the exact dataset that underpinned the analysis. A vague reference to “data available on request” serves neither goal; a formal citation to a deposited, identified dataset serves both.

    The FORCE11 data citation principles

    The community reference point here is the Joint Declaration of Data Citation Principles, developed through FORCE11 and endorsed across the scholarly-communication community. The declaration establishes that data should be treated as a legitimate, citable product of research, on the same footing as any other output. Its principles can be summarised as a short set of commitments:

    • Importance. Data should be considered legitimate, citable products of research; data citations should be accorded the same importance as citations of other objects.
    • Credit and attribution. Citations should facilitate giving scholarly credit and legal attribution to all contributors to the data.
    • Evidence. Where a claim relies on data, the corresponding data should be cited.
    • Unique identification. A citation should include a persistent, machine-actionable, globally unique identifier for the data.
    • Access, persistence, and specificity. Citations should enable access to the data and its metadata, persist even beyond the lifespan of the data, and identify the precise version and subset used.
    • Interoperability and flexibility. Citation methods should be interoperable across communities while accommodating their varying practices.

    Everything below is machinery for honouring these principles in practice.

    DataCite and the dataset DOI

    The practical foundation of data citation is the DataCite DOI. DataCite is the DOI registration agency for research data and related outputs, and a dataset deposited in a repository — a generalist repository such as Zenodo, Figshare, or Dryad, or a discipline-specific one — is assigned a DataCite DOI that resolves persistently to the dataset and its metadata. The DOI is what goes in a reference list, exactly as an article DOI would, which is what makes a dataset citable on equal terms with a paper.

    The DOI is more than a link. The DataCite metadata record behind it carries the structured information that makes the citation meaningful: the creators (ideally with their ORCID iDs), the title, the publisher and publication year, the version, the licence, the resource type, and related identifiers connecting the dataset to the article it supports, the software that processed it, and the grant that funded it. Versioning is treated as a first-class concern: a revised dataset can receive its own version-specific DOI, satisfying the principles’ demand for specificity so that a citation pins down exactly the data used, not merely the latest state of an evolving collection.

    Crediting the people: the Data curation role

    Identifying the dataset is half the task; crediting the humans who produced it is the other half, and the two are easily confused. A DataCite DOI identifies and persists the artefact; it does not, on its own, record the division of labour that produced it. That is the job of contributor-role metadata. The CRediT taxonomy includes a dedicated Data curation role — defined as the management activities to annotate, scrub, and maintain research data (including the software code where needed to interpret the data) for initial use and later reuse. Recording Data curation on the associated paper makes visible the often-uncredited work of turning raw observations into a documented, reusable dataset.

    The two layers complement each other precisely. The dataset DOI and its DataCite metadata say what the data is, where it lives, and which version; the CRediT role record says who curated, validated, and maintained it. Used together they ensure that both the data and the people behind it are visible — rather than the common outcome where neither is, and the dataset is reduced to an unattributed line in a methods section.

    A practical recipe

    1. Deposit the data in a trustworthy repository and obtain a DataCite DOI, rather than leaving it “available on request”.
    2. Cite the dataset in your reference list using its DOI, the way you would cite an article — not in a footnote or in prose.
    3. Pin the version. Where the data may change, cite the version-specific DOI so the citation identifies exactly what was used.
    4. Record the contributors — on both the DataCite record (with ORCID iDs) and, via CRediT’s Data curation role, on the paper the data supports.
    5. Apply a clear licence. Data that cannot be reused with confidence is data that will not be reused; the citation principles assume the reuse terms are stated.

    Where shared vocabulary fits

    “Dataset”, “data citation”, “version”, “data curation”, and “repository” are used inconsistently across communities, which is part of why credit for data leaks away. A shared, federated vocabulary that defines these terms precisely — and points back to the FORCE11 data citation principles and to DataCite — is what lets a data citation written in one system be understood in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain, with adjacent entries in the research-outputs domain.

    Related reading

  • Research software as a first-class output: citation, identifiers and credit

    A great deal of modern research runs on software the researchers wrote themselves — the analysis pipeline, the simulation, the model, the tool that made the result possible. And yet that software is routinely treated as a footnote rather than a finding: mentioned in a methods section, perhaps, but rarely cited as an output in its own right and almost never credited to the people who built it. Treating research software as a first-class output — citable, persistent, and creditable — corrects that, and it is a core concern of the research-outputs domain with a direct line to the reproducibility domain. If a result depends on code, the code is part of the evidence, and the people who wrote it did intellectual work that deserves recognition.

    Why software needs to be cited as software

    Citing software properly does two distinct jobs, and it is worth keeping them apart. The first is credit: building reliable research software — designing it, implementing it, testing it, documenting it — is substantial scholarly work, and it is rewarded only if the software is cited as an output rather than buried in prose. The second is reproducibility: a computational result can only be verified, and the analysis only reused, if a reader can identify and obtain the exact software that produced it, down to the version. A vague reference to “custom scripts” serves neither goal. A formal citation to a versioned, identified piece of software serves both.

    The community reference point here is the FORCE11 Software Citation Principles, developed by the FORCE11 Software Citation Working Group. They establish that software should be considered a legitimate, citable product of research, that citations should give credit to all contributors, that they should identify a specific version using a unique, persistent identifier, and that they should enable access to the software itself. Everything below is machinery for honouring those principles in practice.

    CITATION.cff: telling people how to cite your code

    The most practical first step a software author can take is to add a CITATION.cff file to the repository. Citation File Format (CFF) is a simple, human- and machine-readable YAML format, developed through the Research Software Engineering community, that records exactly how a piece of software should be cited: its authors, title, version, release date, and any associated DOI. Placing a CITATION.cff file in the root of a repository means that anyone — and any tool — can find the canonical citation rather than guessing it.

    The format is well supported. GitHub, for instance, reads a CITATION.cff file and surfaces a ready-made citation in the repository interface, and reference managers and conversion tools can transform it into BibTeX or other formats. It turns “how should I cite this tool?” from an awkward question into a one-click answer, and it puts the authors in control of how their work is attributed.

    Persistent identifiers: the archived DOI and the SWHID

    A repository URL is not a citation. Repositories move, get renamed, or disappear, and a link to the latest state of a project does not pin down the version that produced a result. Two complementary identifiers solve this.

    The first is an archived DOI. Depositing a release of the software in an archive such as Zenodo — which integrates directly with GitHub so that tagging a release can mint a DOI automatically — produces a DataCite DOI that resolves persistently to that exact version, with structured metadata describing the authors, version, and licence. The DOI is what goes in a reference list, exactly as an article DOI would, and it satisfies the principles’ demand to cite a specific, accessible version. Archives of this kind typically also mint a concept DOI for the software as a whole alongside the version-specific DOIs, so a citation can point either at “this release” or at “the software in general” as appropriate.

    The second is the Software Heritage identifier (SWHID). Software Heritage is a non-profit initiative that systematically archives source code from public repositories at scale, with the explicit mission of preserving the world’s software as a commons. It assigns intrinsic, content-derived identifiers — SWHIDs — that can pin down not just a release but a precise commit, directory, or even a single file. Because a SWHID is computed from the content itself, it verifies that the code you retrieve is byte-for-byte the code that was cited. An archived DOI gives a citable, version-level reference with rich metadata; a SWHID gives a fine-grained, intrinsically verifiable anchor to the source. Used together they cover both the citation layer and the deep-reproducibility layer.

    Crediting the people: the CRediT Software role

    Identifying the software is half the task; crediting the humans who wrote it is the other half. A DOI and a SWHID identify the artefact; they do not record who did the work. That is the job of contributor-role metadata. The CRediT taxonomy includes a dedicated Software role, defined as programming and software development — designing computer programs, implementing the code and supporting algorithms, and testing existing code components. Recording the Software role on the paper that the code supports makes visible the often-uncredited engineering effort behind a result, and where degree-of-contribution is recorded it distinguishes the lead developer from supporting contributors.

    The layers complement each other precisely. The DOI and SWHID say what the software is, where it lives, and which version; the CITATION.cff says how to cite it; and the CRediT Software role says who built it. Used together they ensure that both the code and the people behind it are visible — rather than the common outcome where neither is.

    A practical recipe

    1. Add a CITATION.cff to the repository root so there is a canonical, machine-readable citation.
    2. Archive each release — for instance via the GitHub–Zenodo integration — to mint a version-specific DOI, and cite that DOI, not the bare repository URL.
    3. Use the SWHID where byte-level reproducibility matters, pinning the exact commit the result depended on.
    4. Apply a clear open-source licence — software that cannot be reused with confidence will not be reused.
    5. Record the Software role via CRediT on the associated paper, so the developers are credited alongside the other contributors.

    Where shared vocabulary fits

    “Software”, “version”, “release”, “repository”, and “software citation” are used loosely across communities, which is part of why credit for code leaks away. A shared, federated vocabulary that defines these terms precisely — and points back to the FORCE11 Software Citation Principles and to Software Heritage — is what lets a software citation written in one system be understood in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the research-outputs domain, with adjacent entries in the reproducibility domain.

    Related reading