Tag: research data

  • The OSTP Nelson Memo Deadline: Free Federal Research

    The 2022 memorandum from the US Office of Science and Technology Policy (OSTP), widely known as the Nelson memo after the then-acting director who signed it, directed federal agencies that fund research to make the resulting peer-reviewed publications and their supporting data freely available to the public without an embargo. Agencies were asked to develop and implement updated public-access plans, with the milestone for full effect set for the end of 2025. This article is a neutral description of the policy and its rollout, not compliance advice.

    What the memo directed

    The Nelson memo built on earlier US public-access policy but extended and tightened it in two notable ways. First, it removed the previously permitted twelve-month embargo, so that publications arising from federal funding should be free to read immediately on publication. Second, it explicitly brought supporting research data into scope, asking agencies to ensure that data underlying published, peer-reviewed findings are made publicly accessible.

    Crucially, the memo also widened applicability. Earlier guidance had focused on the largest funding agencies; the Nelson memo applied across federal agencies that fund research, including smaller agencies that had not previously operated formal public-access programmes. Each agency was asked to publish its own implementation plan within a common framework.

    The end-of-2025 milestone

    The memo set a phased timeline. Agencies were expected to update their public-access policies and then bring them fully into effect no later than 31 December 2025. In practice this meant that, across federal science funders, publications and associated data tied to awards should be subject to immediate free-access expectations by that date.

    The most visible early mover was the National Institutes of Health, whose revised arrangement is described in our companion explainer on the NIH Public Access Policy. NIH’s removal of the embargo is a concrete instance of the broader direction the Nelson memo set for the whole federal research system.

    How agency rollouts took shape

    Because the memo delegated implementation to each agency, the rollout was not a single switch but a set of staggered agency plans sharing common principles. Typical features of agency public-access plans include:

    • Immediate access to the peer-reviewed publication, removing the prior embargo window.
    • Data sharing expectations for the data underlying the published findings, with appropriate handling of sensitive or restricted data.
    • Persistent identifiers and metadata to make outputs findable and to link publications, data and awards.
    • Designated repositories or repository criteria through which compliant deposits are made.

    Identifiers feature heavily in these plans because they make compliance auditable and outputs discoverable. For background, see our notes on persistent identifiers in the standards dictionary, which explain how DOIs and related identifiers support linking across the scholarly record.

    Why data was the harder part

    Making publications free to read is operationally well-understood, building on a decade of deposit infrastructure. Extending public access to data is more complex. Datasets vary enormously in size, format and sensitivity, and not all data can be openly shared — human-subjects data, for example, may carry privacy and consent constraints. Agency plans therefore tend to frame data sharing around the principle of being as open as possible and as closed as necessary, with documented justifications where access must be restricted.

    This is where the policy intersects with established data-stewardship principles. The expectation is generally that shared data are described with sufficient metadata to be reusable, echoing the widely cited FAIR principles (findable, accessible, interoperable, reusable) referenced in our explainer on FAIR data.

    Persistent identifiers and infrastructure

    A practical thread running through agency public-access plans is the use of persistent identifiers and structured metadata. Identifiers such as DOIs for publications and datasets, ORCID iDs for researchers, and award and organisation identifiers make it possible to link an output back to the award that funded it and the person who produced it. This linking is what turns a pile of free documents into a navigable, auditable record of what public funding produced.

    That emphasis aligns the memo with infrastructure the scholarly community already uses. Our explainers on the DOI and the ORCID iD describe two of the building blocks agencies lean on. The broader point is that immediate access is not only about removing a paywall; it is about making outputs findable, attributable and connected.

    What changed for researchers and institutions

    For researchers, the practical consequence is that the funder-driven expectation of free, immediate access now extends across more agencies and now reaches data as well as papers. Award terms, data-management planning and deposit workflows reflect those expectations. Data-management and sharing plans became a more prominent part of the application and award lifecycle, prompting researchers to think early about which data will be shared, where, and under what conditions. Institutions commonly updated library guidance, data-repository support and compliance tracking in response, and many expanded research-data services to help investigators meet the data-sharing element rather than only the publication element.

    Equity and the cost question

    One theme the memo raised explicitly is equity in publishing. Removing embargoes increases free access for readers, but the costs of publishing do not disappear — they may shift, for example toward article-processing charges in some open-access models. The memo asked agencies to consider how their public-access approaches affect different communities of researchers, including those with fewer resources, so that the move to open access does not inadvertently disadvantage smaller institutions or early-career researchers who may struggle with publication fees. This is part of why depositing the accepted manuscript in a repository — a route that does not require paying a fee — remains an important compliance pathway alongside open-access journals.

    The practical upshot is that immediate access can be achieved through more than one route, and agencies have generally been careful not to mandate a single business model. The goal is free public access to the output, with flexibility in how that access is delivered.

    The bottom line

    The Nelson memo is best understood as a framework rather than a single rule: it set the destination — immediate, free public access to federally funded publications and their underlying data — and asked each agency to chart its own route there by the end of 2025. Readers seeking authoritative detail should consult each agency’s published public-access plan and OSTP’s own guidance at whitehouse.gov/ostp.

  • Identifiers for Things, Not Just Papers: IGSN and PIDINST

    When researchers think about persistent identifiers, they usually picture DOIs on papers and datasets or ORCID iDs on people. Yet a great deal of research turns on physical things: a sediment core drilled from a lake bed, a tissue specimen in a biobank, a water sample from a particular depth on a particular day, or the spectrometer that analysed it. These physical research objects have historically been referred to by inconsistent local labels, if they were referred to at all. Two complementary efforts, the IGSN for samples and the PIDINST work for instruments, set out to give them stable, global identifiers.

    Why physical objects need PIDs

    The case for identifying physical objects mirrors the case for identifying any research output. A persistent identifier lets a sample or instrument be referred to unambiguously across publications, datasets, and laboratories. It allows the measurements derived from a sample to be linked back to the sample itself, and onward to the instrument that produced them. Without such links, reuse and verification become difficult: a reader cannot easily tell whether two studies analysed the same specimen, or whether a calibration problem on a particular instrument might affect a body of results. Persistent identification turns scattered physical objects into nodes in a connected research graph, supporting the goals of FAIR data.

    IGSN: identifiers for samples

    The IGSN began in the geosciences as the International Geo Sample Number, a way to give individual physical samples a globally unique identifier so that specimens could be tracked and cited across the literature. As the approach proved useful beyond geology, the system evolved. The IGSN is now implemented as an IGSN ID, issued through DataCite, which brought sample identification into the same DOI-based infrastructure used for datasets and other outputs. This alignment means a sample can carry a resolvable identifier, a landing page, and structured metadata describing what the sample is, where and when it was collected, and how it relates to other objects.

    The practical effect is that a physical specimen becomes a citable entity. A paper can reference the exact sample it analysed; a dataset can link each measurement to the sample it came from; and a repository can expose the provenance of its holdings. For disciplines that depend on irreplaceable physical material, from earth science to the life sciences, this is a meaningful advance in traceability.

    PIDINST: identifiers for instruments

    Where IGSN addresses samples, the PIDINST working group, convened under the Research Data Alliance, addressed the instruments themselves. The group developed a metadata schema for persistent identification of measuring instruments, so that a microscope, sensor, telescope, or analytical device can be referenced by a persistent identifier and described in a consistent way. The schema captures the kind of information that makes an instrument identifiable and useful to cite: what it is, who owns or operates it, its model and configuration, and identifiers for related entities such as the institution that hosts it.

    Identifying instruments matters because the measuring apparatus is part of the methods. When the data from an experiment can be linked to the specific instrument that produced them, it becomes possible to assess instrument-related effects, to credit the facilities that maintain expensive equipment, and to trace a result from a published figure all the way back to the device on a laboratory bench.

    Connecting the chain of provenance

    The real power of these identifiers appears when they are used together. Imagine a measurement linked to the instrument that produced it via a PIDINST identifier, the sample it was taken from via an IGSN ID, the dataset it belongs to via a DataCite DOI, and the researchers responsible via their ORCID iDs. Each link is a small piece of metadata, but together they describe an unbroken chain of provenance from a published claim back to the physical objects and people behind it. That is precisely the kind of connected, machine-actionable record that modern research infrastructure aspires to.

    Towards a fully identified research record

    Extending persistent identification to samples and instruments fills two of the larger gaps in the research record. Articles, data, organisations, and people increasingly carry stable identifiers; physical objects and the apparatus that measures them have lagged behind. By bringing samples into the DataCite ecosystem as IGSN IDs and by giving instruments a shared metadata schema through PIDINST, the community is steadily closing those gaps. The vocabularies and crosswalks that hold such a record together are the kind of standards work catalogued in the CASRAI data dictionary, and they complement contributor frameworks such as CRediT by anchoring the human contributions to the physical things they acted upon.

  • Licensing research data: CC-BY, CC0 and when to use each

    You can deposit a dataset in a trusted repository, describe it with rich metadata, and give it a DOI — and still leave it effectively unusable, because you forgot the one line that tells a reuser what they are allowed to do with it. A dataset without a clear licence is data nobody can confidently build on: a careful researcher, unsure of the terms, will simply not reuse it. Licensing is therefore not a legal afterthought but the part of the data-infrastructure domain that determines whether a deposit delivers the “R” in FAIR at all. This guide explains the main choices — principally CC0 and CC BY — and when each fits.

    Why a licence is the reusability switch

    The FAIR principles ask that data be Findable, Accessible, Interoperable, and Reusable — and reusability rests explicitly on data being “released with a clear and accessible data usage licence”. Without a licence, default copyright and database rights leave the legal status ambiguous, and ambiguity is fatal to reuse: a would-be user cannot tell whether combining your data with theirs, redistributing it, or building a tool on it is permitted. An explicit, standard, machine-readable licence resolves that uncertainty in advance, for everyone, without anyone having to ask. That is why “attach an explicit licence” is the step that turns a findable dataset into a reusable one.

    The two main choices for data

    CC0 — the public-domain dedication

    CC0 is a Creative Commons tool by which the rights-holder waives, to the fullest extent the law allows, all copyright and related rights in the work — placing it as close to the public domain as possible. For data, CC0 means a reuser can use, combine, modify, and redistribute the data with no conditions at all, including no obligation to attribute. This is widely recommended as the default for research data, and for a specific reason: data are routinely aggregated from many sources, and attribution requirements that stack up across hundreds of datasets (“attribution stacking”) can become legally and practically unworkable. CC0 removes that friction entirely and maximises interoperability. Several major data repositories and infrastructures apply CC0 by default for exactly this reason.

    Importantly, CC0 waives legal requirements, not scholarly norms. Citing the data you use remains an academic and ethical expectation regardless of the licence — CC0 simply means that expectation is enforced by the norms of good scholarship rather than by copyright law.

    CC BY — attribution required

    CC BY permits the same broad reuse — use, adaptation, redistribution, including commercially — but on the single condition that the original creator is credited. For data, CC BY is appropriate where attribution matters enough to be a legal condition, or where a funder or institution requires it. It is the most permissive of the conditional Creative Commons licences and is the default for many open-access publications. The trade-off relative to CC0 is precisely the attribution clause: it guarantees credit, but it reintroduces the attribution-stacking problem when many datasets are combined.

    Choosing between them

    • Prefer CC0 for data intended for the widest possible aggregation and reuse, especially where the data will be merged with many other sources. It maximises interoperability and removes legal friction; rely on citation norms for credit.
    • Choose CC BY where attribution must be a legal condition, where a funder or repository mandates it, or where the dataset is a discrete, citable product whose creators need enforceable credit.
    • Be cautious with more restrictive clauses. Non-commercial (NC) and No-Derivatives (ND) terms substantially limit reuse and can render data incompatible with other open data; they are generally discouraged for research data unless a specific ethical or legal constraint demands them.

    Data are not software: a critical caveat

    Creative Commons licences are designed for content — text, images, and data — and Creative Commons itself advises against using them for software. Software has needs that CC licences do not address: patent grants, the distinction between source and compiled code, and copyleft mechanics. For code, use a recognised software licence instead — a permissive one such as MIT, BSD, or Apache 2.0, or a copyleft one such as the GPL. If your deposit bundles a dataset and the code that processes it, licence each part appropriately: a CC licence (or CC0) for the data, an OSI-approved software licence for the code. Conflating the two is one of the most common licensing mistakes in research deposits.

    A practical checklist

    1. Confirm you have the right to licence the data. Check funder terms, any data-sharing agreements, third-party data within your dataset, and — for personal or sensitive data — consent and governance constraints. A licence cannot grant rights you do not hold.
    2. Default to CC0 for data unless there is a positive reason to require attribution; choose CC BY where there is.
    3. Licence software separately with an OSI-approved licence; never put code under a Creative Commons licence.
    4. State the licence explicitly in the deposit metadata and in any data availability statement, using the standard licence identifier so it is machine-readable.
    5. Cite the data you reuse regardless of its licence — the scholarly norm holds even when the law does not require it.

    How this connects to contribution and credit

    Licensing answers “what may be done with this output?”; it is a sibling of the question “who made it?”, which the CRediT taxonomy answers. A dataset’s intellectual work is recorded on the associated paper through roles such as Data curation and Investigation, while the licence governs downstream reuse of the artefact itself. Used together — a clear licence on the data and clear contribution roles on the people — they ensure both the dataset and its creators are properly accounted for.

    Where shared vocabulary fits

    “CC0”, “CC BY”, “public domain”, “attribution”, and “reuse” are interpreted differently across repositories and funders, which undermines the very interoperability that licensing is meant to enable. A shared, federated vocabulary that defines these terms precisely — pointing back to Creative Commons for the licences and to the FAIR principles for the reusability requirement — is what lets a licence chosen for one repository be understood correctly in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain.

    Related reading

  • GDPR and research data: lawful bases, consent and pseudonymisation

    An enormous amount of research depends on data about people — their health, their behaviour, their genetics, their opinions, their lives. Wherever such data identify or could identify individuals, they fall within data protection law, and in Europe and the United Kingdom that law is the General Data Protection Regulation (GDPR), supplemented in the UK by the UK GDPR and the Data Protection Act 2018. For researchers the GDPR is sometimes experienced as a thicket of obligations. But its core ideas are coherent, and it contains specific provisions designed to enable responsible research rather than obstruct it. Understanding lawful bases, the special rules for sensitive data, the research exemptions, and the distinction between anonymisation and pseudonymisation is part of doing data-driven research properly. This article offers an orientation, drawing on the compliance and regulatory domain of the CASRAI Dictionary. It is general guidance, not legal advice.

    You need a lawful basis

    The first principle is that processing personal data is not permitted by default; it requires a lawful basis. Article 6 of the GDPR sets out the possible bases, several of which can be relevant to research. Many researchers assume the answer is always consent, but for research by public institutions a basis such as the performance of a task carried out in the public interest is often more appropriate. The choice matters because different bases carry different consequences for the rights individuals can exercise. The key point is that a researcher must be able to identify and justify the lawful basis on which they process personal data — good intentions and scientific value do not by themselves make processing lawful.

    Special category data and Article 9

    Much research data is not merely personal but sensitive — data about health, genetics, ethnicity, sexual life, religious or political beliefs, and so on. The GDPR calls these special categories and gives them extra protection under Article 9, which prohibits their processing unless a specific additional condition is met. Among those conditions are explicit consent and, importantly for research, processing necessary for scientific research purposes subject to appropriate safeguards. This means that to process sensitive data lawfully, a researcher must satisfy both a lawful basis under Article 6 and a condition under Article 9. The heightened protection reflects the heightened risk: misuse of health or genetic data can cause serious harm, and the law accordingly demands a stronger justification and stronger safeguards before such data may be used.

    The research provisions

    The GDPR explicitly recognises the value of research and contains provisions, centred on Article 89, intended to facilitate it while protecting individuals. These measures allow certain flexibilities under conditions — for example, data collected for one purpose may in some circumstances be further processed for scientific research without that being treated as incompatible with the original purpose, and certain individual rights may be adjusted where they would seriously impair research objectives. Crucially, these provisions are not a free pass. They are conditioned on appropriate safeguards for the rights and freedoms of individuals — safeguards that the regulation specifically associates with techniques such as data minimisation and, prominently, pseudonymisation. The research exemptions, in other words, come bundled with the expectation that researchers will take concrete measures to protect the people in their data.

    Anonymisation versus pseudonymisation

    One distinction does more practical work in research than almost any other, and it is frequently misunderstood: the difference between anonymisation and pseudonymisation.

    • Anonymisation means rendering data such that individuals are no longer identifiable, by anyone, taking account of all means reasonably likely to be used. Genuinely anonymous data falls outside the scope of the GDPR altogether, because it is no longer personal data. Achieving true anonymisation is harder than it sounds, because seemingly innocuous combinations of fields can re-identify people.
    • Pseudonymisation means processing data so that it can no longer be attributed to an individual without additional information — for example, replacing names with a code, while keeping the key that links code to identity separate and secure. Pseudonymised data remains personal data and remains within the GDPR’s scope, because re-identification is still possible with the key.

    The error to avoid is treating pseudonymised data as if it were anonymous and therefore outside the law. Pseudonymisation is a valuable safeguard — indeed the GDPR commends it — but it reduces risk rather than removing the data from regulation. Knowing which one you have done determines what obligations still apply.

    Accountability and impact assessments

    The GDPR is built on accountability: it is not enough to comply, one must be able to demonstrate compliance. For research using personal data this brings practical obligations — documenting the lawful basis and Article 9 condition, being transparent with participants, applying data minimisation, and securing the data. Where processing is likely to result in a high risk to individuals — as large-scale processing of sensitive data often will — a data protection impact assessment (DPIA) may be required, identifying the risks and planning mitigations before processing begins. The DPIA is not merely a form to file; it is the moment at which a team thinks systematically about how its use of personal data could affect people and how to reduce that effect.

    A consistent vocabulary for compliance

    Data protection touches institutions, funders, ethics committees and repositories alike, and for the relevant information to be handled consistently across them, the terms involved — lawful basis, consent type, special category, pseudonymised, anonymised, retention — must mean the same thing everywhere. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the compliance metadata describing how personal data may be used is understood identically wherever it appears, supporting the broader machinery of research administration. And because stewarding personal data responsibly is genuine contribution, that work can be described within the same framework as any other — the CRediT taxonomy and its full set of contribution roles. The GDPR is not the enemy of research; properly understood, it is the framework within which research that depends on people’s data can be done in a way that keeps faith with them.

  • Data availability statements: what to write and where to deposit

    Most journals now ask for a data availability statement, and most authors now write one. Far fewer write one that does what it is meant to do. The phrase “data are available from the authors on reasonable request” has become the default, yet study after study has found that requests against such statements frequently go unanswered — which means the statement records an intention rather than a reality. This guide covers what to write, where to put the data, and how to make a statement that is true. It builds on the foundations in the data-infrastructure domain and connects to the practices described in the reproducibility domain.

    What a data availability statement is for

    A data availability statement (sometimes a data accessibility statement) tells a reader where the data underlying a publication can be found, under what conditions, and — where access is restricted — why. Its purpose is to make the evidential basis of the work locatable and, where ethically possible, reusable. It is the public-facing expression of the principle that a published claim should be checkable against the data behind it. A good statement is specific: it names a repository, gives an identifier, and states the access conditions plainly.

    Make the data FAIR first, then describe it

    The statement is downstream of a deposit decision, so the deposit is where the real work happens. The widely adopted reference point is the FAIR principles — that data should be Findable, Accessible, Interoperable, and Reusable. FAIR is frequently misread as “open”, and the distinction matters: FAIR does not require data to be public. It requires that data be findable (with a persistent identifier and rich metadata), accessible (retrievable by a clear, possibly authenticated, protocol), interoperable (using shared formats and vocabularies), and reusable (with a clear licence and provenance). Sensitive data can be FAIR while remaining access-controlled — the metadata is open and findable even where the data themselves are not.

    Practically, making data FAIR before you write the statement means:

    • Deposit in a repository that mints a persistent identifier — typically a DataCite DOI — so the data are citable and resolvable independently of the article.
    • Describe the data with structured metadata, not just a filename, so they can be found and understood by someone who did not produce them.
    • Attach an explicit licence (for example a Creative Commons licence for open data) so reuse conditions are unambiguous.
    • Use community formats and vocabularies where they exist, so the data interoperate with other datasets in the field.

    Choosing where to deposit: domain first, generalist as fallback

    Where to put the data is the decision that most shapes their long-term value. The general rule is to prefer a domain repository where a recognised one exists for your data type, and to use a generalist repository otherwise.

    Domain repositories

    A domain (or discipline-specific) repository is built around a particular kind of data and enforces the community’s metadata standards — GenBank for nucleotide sequences, the PDB for protein structures, and many others. Depositing here means your data sit alongside comparable datasets, are described to a standard your field already reads, and are discoverable by the people most likely to reuse them. Where your field expects deposit in a specific repository, that expectation is effectively mandatory and should be your first choice.

    Generalist repositories

    Where no suitable domain repository exists, a generalist repository — Zenodo, Figshare, Dryad and others — accepts data of any type, mints a DOI, and supports structured metadata and licensing. Generalists are the right home for the long tail of data that no specialised archive covers.

    A note on trust

    Whichever route you take, prefer a trusted digital repository — one assessed against a recognised standard such as CoreTrustSeal — over ad-hoc hosting. A repository’s job is long-term preservation and stable resolution; a personal website or a generic file-sharing link offers neither, and a link that has rotted makes a data availability statement worse than useless. Institutional and supplementary-file hosting can be acceptable, but the persistence commitment is what matters.

    Writing the statement

    A strong statement names the repository, gives the identifier, and states the conditions. Some patterns:

    • Open deposit: “The data supporting this study are openly available in [repository] at [DOI], under a [licence].”
    • Controlled access: “The data are available from [repository / controlled-access archive] subject to [conditions, e.g. a data access committee], because they contain [reason, e.g. identifiable personal data]. Metadata are openly available at [DOI].”
    • Genuinely no new data: “No new data were generated; the study analysed [named existing datasets] available at [identifiers].”

    Avoid the bare “available on request” formulation wherever the data could instead be deposited. Where access genuinely must be restricted — for participant confidentiality, commercial sensitivity, or Indigenous data governance — say so, give the reason, name who controls access, and still publish open metadata so the dataset is findable. An honest restricted-access statement is far stronger than a vague promise of availability.

    Where shared vocabulary fits

    Terms like “available on request”, “restricted access”, “trusted repository”, and even “FAIR” are used inconsistently across journals and funders, which weakens the policies that depend on them. A shared, federated vocabulary that defines these precisely — pointing back to the FAIR principles and to certification schemes such as CoreTrustSeal — is what lets a statement written for one venue be understood by another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain.

    Related reading

  • Data citation: giving datasets the credit they deserve

    A great deal of published science rests on data the authors collected, cleaned, and shared — and yet the dataset itself, the object on which the conclusions actually depend, is routinely mentioned in passing or not at all. A finding is only checkable if a reader can find and reuse the data behind it, and the people who produced that data deserve recognition for an intellectual contribution that is often enormous. Treating datasets as first-class, citable outputs solves both problems at once. It is a core concern of the data-infrastructure domain and connects directly to the wider taxonomy of the research-outputs domain.

    Why data citation matters

    Citing data as data does two distinct jobs, and it is worth keeping them separate. The first is credit: assembling a well-documented dataset is real scholarly work — designing the collection, curating, validating, and documenting it — and that work is rewarded only if the dataset is cited as an output in its own right, not buried in a methods paragraph. The second is reproducibility and reuse: a result can only be verified, and the data only reused, if a reader can identify and locate the exact dataset that underpinned the analysis. A vague reference to “data available on request” serves neither goal; a formal citation to a deposited, identified dataset serves both.

    The FORCE11 data citation principles

    The community reference point here is the Joint Declaration of Data Citation Principles, developed through FORCE11 and endorsed across the scholarly-communication community. The declaration establishes that data should be treated as a legitimate, citable product of research, on the same footing as any other output. Its principles can be summarised as a short set of commitments:

    • Importance. Data should be considered legitimate, citable products of research; data citations should be accorded the same importance as citations of other objects.
    • Credit and attribution. Citations should facilitate giving scholarly credit and legal attribution to all contributors to the data.
    • Evidence. Where a claim relies on data, the corresponding data should be cited.
    • Unique identification. A citation should include a persistent, machine-actionable, globally unique identifier for the data.
    • Access, persistence, and specificity. Citations should enable access to the data and its metadata, persist even beyond the lifespan of the data, and identify the precise version and subset used.
    • Interoperability and flexibility. Citation methods should be interoperable across communities while accommodating their varying practices.

    Everything below is machinery for honouring these principles in practice.

    DataCite and the dataset DOI

    The practical foundation of data citation is the DataCite DOI. DataCite is the DOI registration agency for research data and related outputs, and a dataset deposited in a repository — a generalist repository such as Zenodo, Figshare, or Dryad, or a discipline-specific one — is assigned a DataCite DOI that resolves persistently to the dataset and its metadata. The DOI is what goes in a reference list, exactly as an article DOI would, which is what makes a dataset citable on equal terms with a paper.

    The DOI is more than a link. The DataCite metadata record behind it carries the structured information that makes the citation meaningful: the creators (ideally with their ORCID iDs), the title, the publisher and publication year, the version, the licence, the resource type, and related identifiers connecting the dataset to the article it supports, the software that processed it, and the grant that funded it. Versioning is treated as a first-class concern: a revised dataset can receive its own version-specific DOI, satisfying the principles’ demand for specificity so that a citation pins down exactly the data used, not merely the latest state of an evolving collection.

    Crediting the people: the Data curation role

    Identifying the dataset is half the task; crediting the humans who produced it is the other half, and the two are easily confused. A DataCite DOI identifies and persists the artefact; it does not, on its own, record the division of labour that produced it. That is the job of contributor-role metadata. The CRediT taxonomy includes a dedicated Data curation role — defined as the management activities to annotate, scrub, and maintain research data (including the software code where needed to interpret the data) for initial use and later reuse. Recording Data curation on the associated paper makes visible the often-uncredited work of turning raw observations into a documented, reusable dataset.

    The two layers complement each other precisely. The dataset DOI and its DataCite metadata say what the data is, where it lives, and which version; the CRediT role record says who curated, validated, and maintained it. Used together they ensure that both the data and the people behind it are visible — rather than the common outcome where neither is, and the dataset is reduced to an unattributed line in a methods section.

    A practical recipe

    1. Deposit the data in a trustworthy repository and obtain a DataCite DOI, rather than leaving it “available on request”.
    2. Cite the dataset in your reference list using its DOI, the way you would cite an article — not in a footnote or in prose.
    3. Pin the version. Where the data may change, cite the version-specific DOI so the citation identifies exactly what was used.
    4. Record the contributors — on both the DataCite record (with ORCID iDs) and, via CRediT’s Data curation role, on the paper the data supports.
    5. Apply a clear licence. Data that cannot be reused with confidence is data that will not be reused; the citation principles assume the reuse terms are stated.

    Where shared vocabulary fits

    “Dataset”, “data citation”, “version”, “data curation”, and “repository” are used inconsistently across communities, which is part of why credit for data leaks away. A shared, federated vocabulary that defines these terms precisely — and points back to the FORCE11 data citation principles and to DataCite — is what lets a data citation written in one system be understood in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain, with adjacent entries in the research-outputs domain.

    Related reading

  • Federated analysis: bringing computation to the data

    The default model of data analysis is straightforward: gather the data you need into one place, then run your analysis on it. For a great deal of research this works perfectly well. But for some of the most valuable data in existence — patient health records, genomic data, sensitive social and administrative registries — gathering it into one place is precisely the problem. Such data is often legally, ethically and practically impossible to move freely: it cannot be copied across borders or handed to external researchers without breaching privacy law and the trust of the people it describes. The conventional model assumes the data can come to the analysis. When it cannot, research seems stuck. Federated analysis offers a way out by inverting the model entirely, and it represents an important development in the data infrastructure domain of the CASRAI Dictionary.

    The core idea: send the code, not the data

    The central insight of federated analysis is deceptively simple: instead of bringing the data to the computation, bring the computation to the data. The data stays where it is — in the hospital, the registry, the institution that holds it and is responsible for it — and the analysis is sent to run against it in place. What travels back is not the raw data but the results of the analysis: aggregate statistics, model parameters, summaries. Multiple sites can each run the same analysis on their own local data, and the results are combined to produce an answer that draws on all of them — without any site ever exposing or releasing its underlying records. The researcher gets the benefit of analysing data from many sources; the data never leaves the places entitled to hold it. This reversal is what makes collaboration possible across data that could never be pooled.

    DataSHIELD

    A well-established framework embodying this approach is DataSHIELD. DataSHIELD enables the remote, non-disclosive analysis of sensitive data: researchers can run statistical analyses across data held at multiple sites without the individual-level data ever being seen or transferred. It is designed so that only aggregate, non-disclosive results are returned — the system is built to prevent queries that could expose information about individuals. DataSHIELD has been used particularly in health and biomedical research, where the data is among the most sensitive and the barriers to pooling are highest. It is a concrete demonstration that meaningful joint analysis across institutions is achievable without anyone surrendering control of their data.

    The Personal Health Train

    Another influential conception is the Personal Health Train (PHT), which offers a memorable metaphor for the same principle. In this image, the data stays in “stations” — the institutions that hold it — and analyses travel between them like “trains” that visit each station, run their computation on the local data, and move on, carrying results rather than data. The Personal Health Train frames federated analysis as an infrastructure pattern: a way of organising data and analyses so that the data remains under the governance of its custodians while still being available, in a controlled way, for legitimate research. It emphasises that the data custodians retain authority — deciding which analyses may visit and run — which is essential for maintaining trust and meeting legal obligations. The metaphor has helped communicate the concept to the clinical and governance communities whose buy-in federated approaches require.

    Federated learning

    A closely related idea, prominent in machine learning, is federated learning: training a model across multiple decentralised data sources without centralising the data. Each site trains on its own local data and shares only model updates, which are combined to build a model that has effectively learned from all the data without any of it being gathered together. Federated learning applies the bring-computation-to-the-data principle to the training of models specifically, and it has attracted intense interest precisely because so much of the data that would make models better is data that cannot be pooled. It is the same philosophy — keep the data local, move only what is non-disclosive — applied to a particularly data-hungry kind of computation.

    Data minimisation by design

    What ties these approaches together is the principle of data minimisation: the idea that you should use and move the minimum data necessary for a given purpose. Federated analysis is, in a sense, data minimisation built into the architecture. Rather than copying entire datasets around and trusting everyone downstream to handle them responsibly, it ensures that the sensitive data simply never moves, and that only the minimal, non-disclosive results are shared. This has clear advantages:

    • Privacy. Individuals’ records stay protected because they are never exposed or transferred.
    • Governance. Data custodians retain control and can meet their legal and ethical obligations to the people whose data they hold.
    • Scale. Research can draw on data from many institutions and jurisdictions that could never agree to pool their data centrally.

    Working with data that cannot be open

    Federated analysis sits within the broader challenge of doing valuable research on data that cannot be fully open. It is a powerful answer to the question of how sensitive data can be reused for the public good without being exposed: the data can be analysed and learned from while remaining as protected as it must be. This complements, rather than replaces, controlled-access arrangements and secure environments; it is another tool for reconciling the duty to protect with the desire to discover. Sound research administration increasingly has to account for these arrangements when planning sensitive-data projects.

    A consistent vocabulary for federated work

    For federated analysis to work across institutions, the descriptions of what is being analysed and shared must be consistent. Data dictionaries must align so that a variable means the same thing at every station; access conditions, governance terms and the nature of returned results must be described in compatible ways, or a federated analysis cannot reliably combine results across sites. That consistency is what the CASRAI Dictionary supports: a shared vocabulary so that the metadata describing federated data and analyses is understood identically wherever it travels. And because building, running and curating federated analyses is genuine contribution, the work can be described in the same framework used for every other — the CRediT taxonomy and its set of contribution roles. Federated analysis shows that the choice between using data and protecting it is sometimes a false one: with the right architecture, you can do both.