Tag: research data

The OSTP Nelson Memo Deadline: Free Federal Research
The 2022 memorandum from the US Office of Science and Technology Policy (OSTP), widely known as the Nelson memo after the then-acting director who signed it, directed federal agencies that fund research to make the resulting peer-reviewed publications and their supporting data freely available to the public without an embargo. Agencies were asked to develop and implement updated public-access plans, with the milestone for full effect set for the end of 2025. This article is a neutral description of the policy and its rollout, not compliance advice.

What the memo directed

The Nelson memo built on earlier US public-access policy but extended and tightened it in two notable ways. First, it removed the previously permitted twelve-month embargo, so that publications arising from federal funding should be free to read immediately on publication. Second, it explicitly brought supporting research data into scope, asking agencies to ensure that data underlying published, peer-reviewed findings are made publicly accessible.

Crucially, the memo also widened applicability. Earlier guidance had focused on the largest funding agencies; the Nelson memo applied across federal agencies that fund research, including smaller agencies that had not previously operated formal public-access programmes. Each agency was asked to publish its own implementation plan within a common framework.

The end-of-2025 milestone

The memo set a phased timeline. Agencies were expected to update their public-access policies and then bring them fully into effect no later than 31 December 2025. In practice this meant that, across federal science funders, publications and associated data tied to awards should be subject to immediate free-access expectations by that date.

The most visible early mover was the National Institutes of Health, whose revised arrangement is described in our companion explainer on the NIH Public Access Policy. NIH’s removal of the embargo is a concrete instance of the broader direction the Nelson memo set for the whole federal research system.

How agency rollouts took shape

Because the memo delegated implementation to each agency, the rollout was not a single switch but a set of staggered agency plans sharing common principles. Typical features of agency public-access plans include:
- Immediate access to the peer-reviewed publication, removing the prior embargo window.
- Data sharing expectations for the data underlying the published findings, with appropriate handling of sensitive or restricted data.
- Persistent identifiers and metadata to make outputs findable and to link publications, data and awards.
- Designated repositories or repository criteria through which compliant deposits are made.
Identifiers feature heavily in these plans because they make compliance auditable and outputs discoverable. For background, see our notes on persistent identifiers in the standards dictionary, which explain how DOIs and related identifiers support linking across the scholarly record.

Why data was the harder part

Making publications free to read is operationally well-understood, building on a decade of deposit infrastructure. Extending public access to data is more complex. Datasets vary enormously in size, format and sensitivity, and not all data can be openly shared — human-subjects data, for example, may carry privacy and consent constraints. Agency plans therefore tend to frame data sharing around the principle of being as open as possible and as closed as necessary, with documented justifications where access must be restricted.

This is where the policy intersects with established data-stewardship principles. The expectation is generally that shared data are described with sufficient metadata to be reusable, echoing the widely cited FAIR principles (findable, accessible, interoperable, reusable) referenced in our explainer on FAIR data.

Persistent identifiers and infrastructure

A practical thread running through agency public-access plans is the use of persistent identifiers and structured metadata. Identifiers such as DOIs for publications and datasets, ORCID iDs for researchers, and award and organisation identifiers make it possible to link an output back to the award that funded it and the person who produced it. This linking is what turns a pile of free documents into a navigable, auditable record of what public funding produced.

That emphasis aligns the memo with infrastructure the scholarly community already uses. Our explainers on the DOI and the ORCID iD describe two of the building blocks agencies lean on. The broader point is that immediate access is not only about removing a paywall; it is about making outputs findable, attributable and connected.

What changed for researchers and institutions

For researchers, the practical consequence is that the funder-driven expectation of free, immediate access now extends across more agencies and now reaches data as well as papers. Award terms, data-management planning and deposit workflows reflect those expectations. Data-management and sharing plans became a more prominent part of the application and award lifecycle, prompting researchers to think early about which data will be shared, where, and under what conditions. Institutions commonly updated library guidance, data-repository support and compliance tracking in response, and many expanded research-data services to help investigators meet the data-sharing element rather than only the publication element.

Equity and the cost question

One theme the memo raised explicitly is equity in publishing. Removing embargoes increases free access for readers, but the costs of publishing do not disappear — they may shift, for example toward article-processing charges in some open-access models. The memo asked agencies to consider how their public-access approaches affect different communities of researchers, including those with fewer resources, so that the move to open access does not inadvertently disadvantage smaller institutions or early-career researchers who may struggle with publication fees. This is part of why depositing the accepted manuscript in a repository — a route that does not require paying a fee — remains an important compliance pathway alongside open-access journals.

The practical upshot is that immediate access can be achieved through more than one route, and agencies have generally been careful not to mandate a single business model. The goal is free public access to the output, with flexibility in how that access is delivered.

The bottom line

The Nelson memo is best understood as a framework rather than a single rule: it set the destination — immediate, free public access to federally funded publications and their underlying data — and asked each agency to chart its own route there by the end of 2025. Readers seeking authoritative detail should consult each agency’s published public-access plan and OSTP’s own guidance at whitehouse.gov/ostp.
June 22, 2026
Identifiers for Things, Not Just Papers: IGSN and PIDINST

When researchers think about persistent identifiers, they usually picture DOIs on papers and datasets or ORCID iDs on people. Yet a great deal of research turns on physical things: a sediment core drilled from a lake bed, a tissue specimen in a biobank, a water sample from a particular depth on a particular day, or the spectrometer that analysed it. These physical research objects have historically been referred to by inconsistent local labels, if they were referred to at all. Two complementary efforts, the IGSN for samples and the PIDINST work for instruments, set out to give them stable, global identifiers.

Why physical objects need PIDs

The case for identifying physical objects mirrors the case for identifying any research output. A persistent identifier lets a sample or instrument be referred to unambiguously across publications, datasets, and laboratories. It allows the measurements derived from a sample to be linked back to the sample itself, and onward to the instrument that produced them. Without such links, reuse and verification become difficult: a reader cannot easily tell whether two studies analysed the same specimen, or whether a calibration problem on a particular instrument might affect a body of results. Persistent identification turns scattered physical objects into nodes in a connected research graph, supporting the goals of FAIR data.

IGSN: identifiers for samples

The IGSN began in the geosciences as the International Geo Sample Number, a way to give individual physical samples a globally unique identifier so that specimens could be tracked and cited across the literature. As the approach proved useful beyond geology, the system evolved. The IGSN is now implemented as an IGSN ID, issued through DataCite, which brought sample identification into the same DOI-based infrastructure used for datasets and other outputs. This alignment means a sample can carry a resolvable identifier, a landing page, and structured metadata describing what the sample is, where and when it was collected, and how it relates to other objects.

The practical effect is that a physical specimen becomes a citable entity. A paper can reference the exact sample it analysed; a dataset can link each measurement to the sample it came from; and a repository can expose the provenance of its holdings. For disciplines that depend on irreplaceable physical material, from earth science to the life sciences, this is a meaningful advance in traceability.

PIDINST: identifiers for instruments

Where IGSN addresses samples, the PIDINST working group, convened under the Research Data Alliance, addressed the instruments themselves. The group developed a metadata schema for persistent identification of measuring instruments, so that a microscope, sensor, telescope, or analytical device can be referenced by a persistent identifier and described in a consistent way. The schema captures the kind of information that makes an instrument identifiable and useful to cite: what it is, who owns or operates it, its model and configuration, and identifiers for related entities such as the institution that hosts it.

Identifying instruments matters because the measuring apparatus is part of the methods. When the data from an experiment can be linked to the specific instrument that produced them, it becomes possible to assess instrument-related effects, to credit the facilities that maintain expensive equipment, and to trace a result from a published figure all the way back to the device on a laboratory bench.

Connecting the chain of provenance

The real power of these identifiers appears when they are used together. Imagine a measurement linked to the instrument that produced it via a PIDINST identifier, the sample it was taken from via an IGSN ID, the dataset it belongs to via a DataCite DOI, and the researchers responsible via their ORCID iDs. Each link is a small piece of metadata, but together they describe an unbroken chain of provenance from a published claim back to the physical objects and people behind it. That is precisely the kind of connected, machine-actionable record that modern research infrastructure aspires to.

Towards a fully identified research record

Extending persistent identification to samples and instruments fills two of the larger gaps in the research record. Articles, data, organisations, and people increasingly carry stable identifiers; physical objects and the apparatus that measures them have lagged behind. By bringing samples into the DataCite ecosystem as IGSN IDs and by giving instruments a shared metadata schema through PIDINST, the community is steadily closing those gaps. The vocabularies and crosswalks that hold such a record together are the kind of standards work catalogued in the CASRAI data dictionary, and they complement contributor frameworks such as CRediT by anchoring the human contributions to the physical things they acted upon.

June 21, 2026

FAIR Principles for Research Data Explained

FAIR data refers to research data managed according to four guiding principles — Findable, Accessible, Interoperable and Reusable — designed to maximise the value of data for both humans and machines. The principles were set out by Mark Wilkinson and colleagues in a landmark 2016 paper in Scientific Data and have since been adopted widely by funders, publishers and research institutions as a benchmark for good data stewardship. FAIR describes how data should be described, shared and preserved so that it can be discovered and reused long after a project ends.

A common misconception is that FAIR means “open”. It does not. FAIR is about good management and clear conditions of use; data can be FAIR while access remains controlled, which matters for sensitive or personal data.

What each principle means

The four principles work together, and the order spells the acronym rather than a strict sequence. Each rests heavily on metadata and persistent identifiers.

Principle	Core idea	Key enablers
Findable	Data and metadata are easy to locate by humans and machines	Persistent identifiers (e.g. DOIs), rich metadata, indexing
Accessible	Once found, data can be retrieved by a clear, open protocol	Standard protocols; metadata stays available even if data are restricted
Interoperable	Data can be combined and used with other data and systems	Shared vocabularies, standard formats, controlled terminologies
Reusable	Data are richly described and licensed for reuse	Clear licences, provenance, community standards and metadata

Findable requires that data and metadata carry globally unique, persistent identifiers and are described well enough to be indexed and searched. Accessible means the data can be retrieved using a standardised, open communication protocol, with authentication where needed — and, importantly, that metadata remain accessible even when the underlying data are not. Interoperable calls for data to use shared, standard formats and vocabularies so they can be integrated with other datasets and processed by different systems. Reusable requires rich description, clear provenance and an explicit usage licence so others can confidently build on the data.

The role of persistent identifiers and metadata

Two enablers run through all four principles: persistent identifiers and metadata. A persistent identifier — such as a DOI for a dataset or an ORCID for a researcher — provides a stable, resolvable reference that does not break when URLs change, underpinning findability and provenance. Metadata — structured information describing what the data are, how they were produced, and under what terms they may be used — is what makes data discoverable, interpretable and reusable. Crucially, FAIR treats metadata as valuable in its own right: rich, standardised metadata can remain open and findable even when the dataset itself is access-controlled. This is precisely the kind of standardised description that shared vocabularies, such as the CASRAI dictionary, and broader data infrastructure are built to support.

FAIR versus open

FAIR and open are related but distinct. Open data is data anyone can freely access, use and redistribute. FAIR data is well-managed, well-described data with clear access conditions — which may or may not be open. The principles’ own phrasing, “as open as possible, as closed as necessary”, captures the balance: maximise reuse while respecting legitimate constraints such as privacy, consent, commercial sensitivity or indigenous data rights. A dataset of patient records can be made FAIR — richly described, identified, governed and licensed — without being openly downloadable. Conversely, dumping a file online makes it open but not necessarily FAIR if it lacks identifiers, metadata or a licence.

For researchers, adopting FAIR practice means assigning identifiers, writing good metadata, using standard formats and stating licences from the outset rather than at the end of a project. Guidance on preparing and describing data is available in our resources for authors, and FAIR data underpins the reproducibility goals discussed across our research-outputs coverage.

Frequently asked questions

What does FAIR stand for?

FAIR stands for Findable, Accessible, Interoperable and Reusable. The four principles, published by Wilkinson and colleagues in 2016, describe how research data and metadata should be managed so they can be discovered, retrieved, combined and reused effectively by both humans and machines.

Does FAIR mean the same as open data?

No. Open data can be freely accessed and reused by anyone, whereas FAIR data is well-described and well-managed with clear access conditions that may be restricted. The guiding phrase is “as open as possible, as closed as necessary”, so sensitive data can still be FAIR.

Why are persistent identifiers important for FAIR data?

Persistent identifiers such as DOIs and ORCIDs provide stable, resolvable references that do not break when web addresses change. They underpin findability and provenance, letting data, researchers and outputs be reliably located and credited over the long term.

Can data be FAIR without being publicly downloadable?

Yes. FAIR requires clear access protocols and rich metadata, not unrestricted access. Metadata can remain findable and accessible even when the underlying dataset is controlled, so sensitive datasets can be made FAIR while access stays appropriately governed.

June 18, 2026

Licensing research data: CC-BY, CC0 and when to use each
You can deposit a dataset in a trusted repository, describe it with rich metadata, and give it a DOI — and still leave it effectively unusable, because you forgot the one line that tells a reuser what they are allowed to do with it. A dataset without a clear licence is data nobody can confidently build on: a careful researcher, unsure of the terms, will simply not reuse it. Licensing is therefore not a legal afterthought but the part of the data-infrastructure domain that determines whether a deposit delivers the “R” in FAIR at all. This guide explains the main choices — principally CC0 and CC BY — and when each fits.

Why a licence is the reusability switch

The FAIR principles ask that data be Findable, Accessible, Interoperable, and Reusable — and reusability rests explicitly on data being “released with a clear and accessible data usage licence”. Without a licence, default copyright and database rights leave the legal status ambiguous, and ambiguity is fatal to reuse: a would-be user cannot tell whether combining your data with theirs, redistributing it, or building a tool on it is permitted. An explicit, standard, machine-readable licence resolves that uncertainty in advance, for everyone, without anyone having to ask. That is why “attach an explicit licence” is the step that turns a findable dataset into a reusable one.

The two main choices for data

CC0 — the public-domain dedication

CC0 is a Creative Commons tool by which the rights-holder waives, to the fullest extent the law allows, all copyright and related rights in the work — placing it as close to the public domain as possible. For data, CC0 means a reuser can use, combine, modify, and redistribute the data with no conditions at all, including no obligation to attribute. This is widely recommended as the default for research data, and for a specific reason: data are routinely aggregated from many sources, and attribution requirements that stack up across hundreds of datasets (“attribution stacking”) can become legally and practically unworkable. CC0 removes that friction entirely and maximises interoperability. Several major data repositories and infrastructures apply CC0 by default for exactly this reason.

Importantly, CC0 waives legal requirements, not scholarly norms. Citing the data you use remains an academic and ethical expectation regardless of the licence — CC0 simply means that expectation is enforced by the norms of good scholarship rather than by copyright law.

CC BY — attribution required

CC BY permits the same broad reuse — use, adaptation, redistribution, including commercially — but on the single condition that the original creator is credited. For data, CC BY is appropriate where attribution matters enough to be a legal condition, or where a funder or institution requires it. It is the most permissive of the conditional Creative Commons licences and is the default for many open-access publications. The trade-off relative to CC0 is precisely the attribution clause: it guarantees credit, but it reintroduces the attribution-stacking problem when many datasets are combined.

Choosing between them
- Prefer CC0 for data intended for the widest possible aggregation and reuse, especially where the data will be merged with many other sources. It maximises interoperability and removes legal friction; rely on citation norms for credit.
- Choose CC BY where attribution must be a legal condition, where a funder or repository mandates it, or where the dataset is a discrete, citable product whose creators need enforceable credit.
- Be cautious with more restrictive clauses. Non-commercial (NC) and No-Derivatives (ND) terms substantially limit reuse and can render data incompatible with other open data; they are generally discouraged for research data unless a specific ethical or legal constraint demands them.
Data are not software: a critical caveat

Creative Commons licences are designed for content — text, images, and data — and Creative Commons itself advises against using them for software. Software has needs that CC licences do not address: patent grants, the distinction between source and compiled code, and copyleft mechanics. For code, use a recognised software licence instead — a permissive one such as MIT, BSD, or Apache 2.0, or a copyleft one such as the GPL. If your deposit bundles a dataset and the code that processes it, licence each part appropriately: a CC licence (or CC0) for the data, an OSI-approved software licence for the code. Conflating the two is one of the most common licensing mistakes in research deposits.

A practical checklist
1. Confirm you have the right to licence the data. Check funder terms, any data-sharing agreements, third-party data within your dataset, and — for personal or sensitive data — consent and governance constraints. A licence cannot grant rights you do not hold.
2. Default to CC0 for data unless there is a positive reason to require attribution; choose CC BY where there is.
3. Licence software separately with an OSI-approved licence; never put code under a Creative Commons licence.
4. State the licence explicitly in the deposit metadata and in any data availability statement, using the standard licence identifier so it is machine-readable.
5. Cite the data you reuse regardless of its licence — the scholarly norm holds even when the law does not require it.
How this connects to contribution and credit

Licensing answers “what may be done with this output?”; it is a sibling of the question “who made it?”, which the CRediT taxonomy answers. A dataset’s intellectual work is recorded on the associated paper through roles such as Data curation and Investigation, while the licence governs downstream reuse of the artefact itself. Used together — a clear licence on the data and clear contribution roles on the people — they ensure both the dataset and its creators are properly accounted for.

Where shared vocabulary fits

“CC0”, “CC BY”, “public domain”, “attribution”, and “reuse” are interpreted differently across repositories and funders, which undermines the very interoperability that licensing is meant to enable. A shared, federated vocabulary that defines these terms precisely — pointing back to Creative Commons for the licences and to the FAIR principles for the reusability requirement — is what lets a licence chosen for one repository be understood correctly in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain.

Related reading
June 15, 2026
GDPR and research data: lawful bases, consent and pseudonymisation
An enormous amount of research depends on data about people — their health, their behaviour, their genetics, their opinions, their lives. Wherever such data identify or could identify individuals, they fall within data protection law, and in Europe and the United Kingdom that law is the General Data Protection Regulation (GDPR), supplemented in the UK by the UK GDPR and the Data Protection Act 2018. For researchers the GDPR is sometimes experienced as a thicket of obligations. But its core ideas are coherent, and it contains specific provisions designed to enable responsible research rather than obstruct it. Understanding lawful bases, the special rules for sensitive data, the research exemptions, and the distinction between anonymisation and pseudonymisation is part of doing data-driven research properly. This article offers an orientation, drawing on the compliance and regulatory domain of the CASRAI Dictionary. It is general guidance, not legal advice.

You need a lawful basis

The first principle is that processing personal data is not permitted by default; it requires a lawful basis. Article 6 of the GDPR sets out the possible bases, several of which can be relevant to research. Many researchers assume the answer is always consent, but for research by public institutions a basis such as the performance of a task carried out in the public interest is often more appropriate. The choice matters because different bases carry different consequences for the rights individuals can exercise. The key point is that a researcher must be able to identify and justify the lawful basis on which they process personal data — good intentions and scientific value do not by themselves make processing lawful.

Special category data and Article 9

Much research data is not merely personal but sensitive — data about health, genetics, ethnicity, sexual life, religious or political beliefs, and so on. The GDPR calls these special categories and gives them extra protection under Article 9, which prohibits their processing unless a specific additional condition is met. Among those conditions are explicit consent and, importantly for research, processing necessary for scientific research purposes subject to appropriate safeguards. This means that to process sensitive data lawfully, a researcher must satisfy both a lawful basis under Article 6 and a condition under Article 9. The heightened protection reflects the heightened risk: misuse of health or genetic data can cause serious harm, and the law accordingly demands a stronger justification and stronger safeguards before such data may be used.

The research provisions

The GDPR explicitly recognises the value of research and contains provisions, centred on Article 89, intended to facilitate it while protecting individuals. These measures allow certain flexibilities under conditions — for example, data collected for one purpose may in some circumstances be further processed for scientific research without that being treated as incompatible with the original purpose, and certain individual rights may be adjusted where they would seriously impair research objectives. Crucially, these provisions are not a free pass. They are conditioned on appropriate safeguards for the rights and freedoms of individuals — safeguards that the regulation specifically associates with techniques such as data minimisation and, prominently, pseudonymisation. The research exemptions, in other words, come bundled with the expectation that researchers will take concrete measures to protect the people in their data.

Anonymisation versus pseudonymisation

One distinction does more practical work in research than almost any other, and it is frequently misunderstood: the difference between anonymisation and pseudonymisation.
- Anonymisation means rendering data such that individuals are no longer identifiable, by anyone, taking account of all means reasonably likely to be used. Genuinely anonymous data falls outside the scope of the GDPR altogether, because it is no longer personal data. Achieving true anonymisation is harder than it sounds, because seemingly innocuous combinations of fields can re-identify people.
- Pseudonymisation means processing data so that it can no longer be attributed to an individual without additional information — for example, replacing names with a code, while keeping the key that links code to identity separate and secure. Pseudonymised data remains personal data and remains within the GDPR’s scope, because re-identification is still possible with the key.
The error to avoid is treating pseudonymised data as if it were anonymous and therefore outside the law. Pseudonymisation is a valuable safeguard — indeed the GDPR commends it — but it reduces risk rather than removing the data from regulation. Knowing which one you have done determines what obligations still apply.

Accountability and impact assessments

The GDPR is built on accountability: it is not enough to comply, one must be able to demonstrate compliance. For research using personal data this brings practical obligations — documenting the lawful basis and Article 9 condition, being transparent with participants, applying data minimisation, and securing the data. Where processing is likely to result in a high risk to individuals — as large-scale processing of sensitive data often will — a data protection impact assessment (DPIA) may be required, identifying the risks and planning mitigations before processing begins. The DPIA is not merely a form to file; it is the moment at which a team thinks systematically about how its use of personal data could affect people and how to reduce that effect.

A consistent vocabulary for compliance

Data protection touches institutions, funders, ethics committees and repositories alike, and for the relevant information to be handled consistently across them, the terms involved — lawful basis, consent type, special category, pseudonymised, anonymised, retention — must mean the same thing everywhere. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the compliance metadata describing how personal data may be used is understood identically wherever it appears, supporting the broader machinery of research administration. And because stewarding personal data responsibly is genuine contribution, that work can be described within the same framework as any other — the CRediT taxonomy and its full set of contribution roles. The GDPR is not the enemy of research; properly understood, it is the framework within which research that depends on people’s data can be done in a way that keeps faith with them.
June 13, 2026
Data availability statements: what to write and where to deposit
Most journals now ask for a data availability statement, and most authors now write one. Far fewer write one that does what it is meant to do. The phrase “data are available from the authors on reasonable request” has become the default, yet study after study has found that requests against such statements frequently go unanswered — which means the statement records an intention rather than a reality. This guide covers what to write, where to put the data, and how to make a statement that is true. It builds on the foundations in the data-infrastructure domain and connects to the practices described in the reproducibility domain.

What a data availability statement is for

A data availability statement (sometimes a data accessibility statement) tells a reader where the data underlying a publication can be found, under what conditions, and — where access is restricted — why. Its purpose is to make the evidential basis of the work locatable and, where ethically possible, reusable. It is the public-facing expression of the principle that a published claim should be checkable against the data behind it. A good statement is specific: it names a repository, gives an identifier, and states the access conditions plainly.

Make the data FAIR first, then describe it

The statement is downstream of a deposit decision, so the deposit is where the real work happens. The widely adopted reference point is the FAIR principles — that data should be Findable, Accessible, Interoperable, and Reusable. FAIR is frequently misread as “open”, and the distinction matters: FAIR does not require data to be public. It requires that data be findable (with a persistent identifier and rich metadata), accessible (retrievable by a clear, possibly authenticated, protocol), interoperable (using shared formats and vocabularies), and reusable (with a clear licence and provenance). Sensitive data can be FAIR while remaining access-controlled — the metadata is open and findable even where the data themselves are not.

Practically, making data FAIR before you write the statement means:
- Deposit in a repository that mints a persistent identifier — typically a DataCite DOI — so the data are citable and resolvable independently of the article.
- Describe the data with structured metadata, not just a filename, so they can be found and understood by someone who did not produce them.
- Attach an explicit licence (for example a Creative Commons licence for open data) so reuse conditions are unambiguous.
- Use community formats and vocabularies where they exist, so the data interoperate with other datasets in the field.
Choosing where to deposit: domain first, generalist as fallback

Where to put the data is the decision that most shapes their long-term value. The general rule is to prefer a domain repository where a recognised one exists for your data type, and to use a generalist repository otherwise.

Domain repositories

A domain (or discipline-specific) repository is built around a particular kind of data and enforces the community’s metadata standards — GenBank for nucleotide sequences, the PDB for protein structures, and many others. Depositing here means your data sit alongside comparable datasets, are described to a standard your field already reads, and are discoverable by the people most likely to reuse them. Where your field expects deposit in a specific repository, that expectation is effectively mandatory and should be your first choice.

Generalist repositories

Where no suitable domain repository exists, a generalist repository — Zenodo, Figshare, Dryad and others — accepts data of any type, mints a DOI, and supports structured metadata and licensing. Generalists are the right home for the long tail of data that no specialised archive covers.

A note on trust

Whichever route you take, prefer a trusted digital repository — one assessed against a recognised standard such as CoreTrustSeal — over ad-hoc hosting. A repository’s job is long-term preservation and stable resolution; a personal website or a generic file-sharing link offers neither, and a link that has rotted makes a data availability statement worse than useless. Institutional and supplementary-file hosting can be acceptable, but the persistence commitment is what matters.

Writing the statement

A strong statement names the repository, gives the identifier, and states the conditions. Some patterns:
- Open deposit: “The data supporting this study are openly available in [repository] at [DOI], under a [licence].”
- Controlled access: “The data are available from [repository / controlled-access archive] subject to [conditions, e.g. a data access committee], because they contain [reason, e.g. identifiable personal data]. Metadata are openly available at [DOI].”
- Genuinely no new data: “No new data were generated; the study analysed [named existing datasets] available at [identifiers].”
Avoid the bare “available on request” formulation wherever the data could instead be deposited. Where access genuinely must be restricted — for participant confidentiality, commercial sensitivity, or Indigenous data governance — say so, give the reason, name who controls access, and still publish open metadata so the dataset is findable. An honest restricted-access statement is far stronger than a vague promise of availability.

Where shared vocabulary fits

Terms like “available on request”, “restricted access”, “trusted repository”, and even “FAIR” are used inconsistently across journals and funders, which weakens the policies that depend on them. A shared, federated vocabulary that defines these precisely — pointing back to the FAIR principles and to certification schemes such as CoreTrustSeal — is what lets a statement written for one venue be understood by another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain.

Related reading
June 11, 2026
Data citation: giving datasets the credit they deserve
A great deal of published science rests on data the authors collected, cleaned, and shared — and yet the dataset itself, the object on which the conclusions actually depend, is routinely mentioned in passing or not at all. A finding is only checkable if a reader can find and reuse the data behind it, and the people who produced that data deserve recognition for an intellectual contribution that is often enormous. Treating datasets as first-class, citable outputs solves both problems at once. It is a core concern of the data-infrastructure domain and connects directly to the wider taxonomy of the research-outputs domain.

Why data citation matters

Citing data as data does two distinct jobs, and it is worth keeping them separate. The first is credit: assembling a well-documented dataset is real scholarly work — designing the collection, curating, validating, and documenting it — and that work is rewarded only if the dataset is cited as an output in its own right, not buried in a methods paragraph. The second is reproducibility and reuse: a result can only be verified, and the data only reused, if a reader can identify and locate the exact dataset that underpinned the analysis. A vague reference to “data available on request” serves neither goal; a formal citation to a deposited, identified dataset serves both.

The FORCE11 data citation principles

The community reference point here is the Joint Declaration of Data Citation Principles, developed through FORCE11 and endorsed across the scholarly-communication community. The declaration establishes that data should be treated as a legitimate, citable product of research, on the same footing as any other output. Its principles can be summarised as a short set of commitments:
- Importance. Data should be considered legitimate, citable products of research; data citations should be accorded the same importance as citations of other objects.
- Credit and attribution. Citations should facilitate giving scholarly credit and legal attribution to all contributors to the data.
- Evidence. Where a claim relies on data, the corresponding data should be cited.
- Unique identification. A citation should include a persistent, machine-actionable, globally unique identifier for the data.
- Access, persistence, and specificity. Citations should enable access to the data and its metadata, persist even beyond the lifespan of the data, and identify the precise version and subset used.
- Interoperability and flexibility. Citation methods should be interoperable across communities while accommodating their varying practices.
Everything below is machinery for honouring these principles in practice.

DataCite and the dataset DOI

The practical foundation of data citation is the DataCite DOI. DataCite is the DOI registration agency for research data and related outputs, and a dataset deposited in a repository — a generalist repository such as Zenodo, Figshare, or Dryad, or a discipline-specific one — is assigned a DataCite DOI that resolves persistently to the dataset and its metadata. The DOI is what goes in a reference list, exactly as an article DOI would, which is what makes a dataset citable on equal terms with a paper.

The DOI is more than a link. The DataCite metadata record behind it carries the structured information that makes the citation meaningful: the creators (ideally with their ORCID iDs), the title, the publisher and publication year, the version, the licence, the resource type, and related identifiers connecting the dataset to the article it supports, the software that processed it, and the grant that funded it. Versioning is treated as a first-class concern: a revised dataset can receive its own version-specific DOI, satisfying the principles’ demand for specificity so that a citation pins down exactly the data used, not merely the latest state of an evolving collection.

Crediting the people: the Data curation role

Identifying the dataset is half the task; crediting the humans who produced it is the other half, and the two are easily confused. A DataCite DOI identifies and persists the artefact; it does not, on its own, record the division of labour that produced it. That is the job of contributor-role metadata. The CRediT taxonomy includes a dedicated Data curation role — defined as the management activities to annotate, scrub, and maintain research data (including the software code where needed to interpret the data) for initial use and later reuse. Recording Data curation on the associated paper makes visible the often-uncredited work of turning raw observations into a documented, reusable dataset.

The two layers complement each other precisely. The dataset DOI and its DataCite metadata say what the data is, where it lives, and which version; the CRediT role record says who curated, validated, and maintained it. Used together they ensure that both the data and the people behind it are visible — rather than the common outcome where neither is, and the dataset is reduced to an unattributed line in a methods section.

A practical recipe
1. Deposit the data in a trustworthy repository and obtain a DataCite DOI, rather than leaving it “available on request”.
2. Cite the dataset in your reference list using its DOI, the way you would cite an article — not in a footnote or in prose.
3. Pin the version. Where the data may change, cite the version-specific DOI so the citation identifies exactly what was used.
4. Record the contributors — on both the DataCite record (with ORCID iDs) and, via CRediT’s Data curation role, on the paper the data supports.
5. Apply a clear licence. Data that cannot be reused with confidence is data that will not be reused; the citation principles assume the reuse terms are stated.
Where shared vocabulary fits

“Dataset”, “data citation”, “version”, “data curation”, and “repository” are used inconsistently across communities, which is part of why credit for data leaks away. A shared, federated vocabulary that defines these terms precisely — and points back to the FORCE11 data citation principles and to DataCite — is what lets a data citation written in one system be understood in another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain, with adjacent entries in the research-outputs domain.

Related reading
June 11, 2026
DataCite and the data-citation infrastructure

For a long time, the formal scholarly record recognised one kind of output above all others: the journal article, identified by a DOI and citable in a standard way. The datasets, software, samples and other research outputs that often represented the greater investment of effort had no comparable standing. They were hard to cite, hard to find again, and easy to lose track of. DataCite exists to change that. It is the global, not-for-profit registration agency that issues persistent identifiers — data DOIs — and maintains the metadata standard that makes datasets and other non-article outputs first-class, citable, connectable objects. This article explains what DataCite does and why it matters, drawing on the data infrastructure domain of the CASRAI Dictionary.

Why data needed its own infrastructure

Citing a dataset properly is harder than citing a paper, and the difficulty is structural. A dataset may have versions; it lives in a repository rather than a journal; it has creators and contributors whose roles differ from those of authors; and its value is realised through reuse, which is precisely what is hardest to track. Without a persistent identifier and a shared way to describe it, a dataset cannot be cited consistently, cannot be found reliably after the project that made it has ended, and cannot accrue the credit that reuse should generate for its creators. DataCite addresses all of these at once by giving data outputs a resolvable DOI and a structured description, so that a dataset can be referenced as precisely and durably as any article.

Data DOIs and persistent identification

The core service is the assignment of DOIs to research outputs through DataCite’s member repositories and data centres. When a repository deposits a dataset, it registers a DataCite DOI that resolves persistently to the dataset’s landing page, independent of any changes to the repository’s internal structure over time. That persistence is what lets a dataset DOI sit safely in a reference list, a data-availability statement, or another dataset’s record for years. Crucially, DataCite DOIs are not limited to datasets: the same mechanism identifies software, samples, images, models, preprints and a wide range of other outputs, extending durable, citable identity well beyond the traditional article.

The DataCite metadata schema

An identifier is only useful if there is consistent information behind it, and this is where the DataCite Metadata Schema does its work. The schema defines a structured set of properties for describing a research output: its creators, title, publisher and publication year, the resource type, and a rich set of optional fields covering contributors and their roles, dates, related identifiers, funding, rights and descriptions. Two features of the schema are especially powerful. The first is relatedIdentifier, which lets a record express how an output relates to others — this dataset is a version of that one, supplements this article, is derived from that sample, is documented by this data paper. The second is the recording of contributors and their roles, which allows a dataset record to name not just abstract creators but the specific people who curated, collected or maintained the data. Together these turn each record into a node with explicit, machine-readable links to the rest of the research world.

DataCite and the PID graph

Because DataCite records carry related identifiers and references to other persistent identifiers — ORCID for people, ROR for organisations, Crossref DOIs for articles, grant identifiers for funding — they are not isolated entries but part of a connected PID graph. Follow the links and you can move from a dataset to its creators, their institutions, the grant that funded the work, and the article that analysed it. DataCite and Crossref between them register much of the scholarly output graph — broadly, the data and the literature — and their shared use of resolvable identifiers and exchangeable metadata is what lets the whole network be traversed automatically rather than reconstructed by hand. DataCite’s role in this interoperating arrangement is described in our work on DataCite and federation.

Supporting FAIR data and reuse

DataCite is foundational to the FAIR principles — that data should be Findable, Accessible, Interoperable and Reusable. A DataCite DOI and its metadata make a dataset findable through search and resolvable through a stable link; the schema’s structured, standardised fields support interoperability; and the explicit rights and relationship information supports informed reuse. Just as importantly, because datasets registered with DataCite can be cited by their DOIs, their reuse can in principle be tracked, which is the basis for crediting the people who produced them. A dataset that is cited is a dataset whose creators can be recognised — the recognition that careful data stewardship has historically been denied.

Crediting data work consistently

DataCite’s ability to record contributors and their roles connects directly to the recognition of data work. The CRediT taxonomy — whose full set of roles is described in our overview of the CRediT roles — provides a controlled vocabulary for contribution, with the Data curation role recognising the management, annotation and maintenance that make a dataset reusable, alongside Investigation for collection and Methodology for how it was produced. For a contribution recorded in a dataset’s DataCite metadata to be understood the same way in an institutional system or a data paper, the terms must be defined consistently across systems. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the metadata DataCite carries — resource types, contributor roles, relationship types — means the same thing wherever a dataset DOI travels.

June 10, 2026
Federated analysis: bringing computation to the data
The default model of data analysis is straightforward: gather the data you need into one place, then run your analysis on it. For a great deal of research this works perfectly well. But for some of the most valuable data in existence — patient health records, genomic data, sensitive social and administrative registries — gathering it into one place is precisely the problem. Such data is often legally, ethically and practically impossible to move freely: it cannot be copied across borders or handed to external researchers without breaching privacy law and the trust of the people it describes. The conventional model assumes the data can come to the analysis. When it cannot, research seems stuck. Federated analysis offers a way out by inverting the model entirely, and it represents an important development in the data infrastructure domain of the CASRAI Dictionary.

The core idea: send the code, not the data

The central insight of federated analysis is deceptively simple: instead of bringing the data to the computation, bring the computation to the data. The data stays where it is — in the hospital, the registry, the institution that holds it and is responsible for it — and the analysis is sent to run against it in place. What travels back is not the raw data but the results of the analysis: aggregate statistics, model parameters, summaries. Multiple sites can each run the same analysis on their own local data, and the results are combined to produce an answer that draws on all of them — without any site ever exposing or releasing its underlying records. The researcher gets the benefit of analysing data from many sources; the data never leaves the places entitled to hold it. This reversal is what makes collaboration possible across data that could never be pooled.

DataSHIELD

A well-established framework embodying this approach is DataSHIELD. DataSHIELD enables the remote, non-disclosive analysis of sensitive data: researchers can run statistical analyses across data held at multiple sites without the individual-level data ever being seen or transferred. It is designed so that only aggregate, non-disclosive results are returned — the system is built to prevent queries that could expose information about individuals. DataSHIELD has been used particularly in health and biomedical research, where the data is among the most sensitive and the barriers to pooling are highest. It is a concrete demonstration that meaningful joint analysis across institutions is achievable without anyone surrendering control of their data.

The Personal Health Train

Another influential conception is the Personal Health Train (PHT), which offers a memorable metaphor for the same principle. In this image, the data stays in “stations” — the institutions that hold it — and analyses travel between them like “trains” that visit each station, run their computation on the local data, and move on, carrying results rather than data. The Personal Health Train frames federated analysis as an infrastructure pattern: a way of organising data and analyses so that the data remains under the governance of its custodians while still being available, in a controlled way, for legitimate research. It emphasises that the data custodians retain authority — deciding which analyses may visit and run — which is essential for maintaining trust and meeting legal obligations. The metaphor has helped communicate the concept to the clinical and governance communities whose buy-in federated approaches require.

Federated learning

A closely related idea, prominent in machine learning, is federated learning: training a model across multiple decentralised data sources without centralising the data. Each site trains on its own local data and shares only model updates, which are combined to build a model that has effectively learned from all the data without any of it being gathered together. Federated learning applies the bring-computation-to-the-data principle to the training of models specifically, and it has attracted intense interest precisely because so much of the data that would make models better is data that cannot be pooled. It is the same philosophy — keep the data local, move only what is non-disclosive — applied to a particularly data-hungry kind of computation.

Data minimisation by design

What ties these approaches together is the principle of data minimisation: the idea that you should use and move the minimum data necessary for a given purpose. Federated analysis is, in a sense, data minimisation built into the architecture. Rather than copying entire datasets around and trusting everyone downstream to handle them responsibly, it ensures that the sensitive data simply never moves, and that only the minimal, non-disclosive results are shared. This has clear advantages:
- Privacy. Individuals’ records stay protected because they are never exposed or transferred.
- Governance. Data custodians retain control and can meet their legal and ethical obligations to the people whose data they hold.
- Scale. Research can draw on data from many institutions and jurisdictions that could never agree to pool their data centrally.
Working with data that cannot be open

Federated analysis sits within the broader challenge of doing valuable research on data that cannot be fully open. It is a powerful answer to the question of how sensitive data can be reused for the public good without being exposed: the data can be analysed and learned from while remaining as protected as it must be. This complements, rather than replaces, controlled-access arrangements and secure environments; it is another tool for reconciling the duty to protect with the desire to discover. Sound research administration increasingly has to account for these arrangements when planning sensitive-data projects.

A consistent vocabulary for federated work

For federated analysis to work across institutions, the descriptions of what is being analysed and shared must be consistent. Data dictionaries must align so that a variable means the same thing at every station; access conditions, governance terms and the nature of returned results must be described in compatible ways, or a federated analysis cannot reliably combine results across sites. That consistency is what the CASRAI Dictionary supports: a shared vocabulary so that the metadata describing federated data and analyses is understood identically wherever it travels. And because building, running and curating federated analyses is genuine contribution, the work can be described in the same framework used for every other — the CRediT taxonomy and its set of contribution roles. Federated analysis shows that the choice between using data and protecting it is sometimes a false one: with the right architecture, you can do both.
May 26, 2026

Tag: research data

What the memo directed

The end-of-2025 milestone

How agency rollouts took shape

Why data was the harder part

Persistent identifiers and infrastructure

What changed for researchers and institutions

Equity and the cost question

The bottom line

Why physical objects need PIDs

IGSN: identifiers for samples

PIDINST: identifiers for instruments

Connecting the chain of provenance

Towards a fully identified research record

What each principle means

The role of persistent identifiers and metadata

FAIR versus open

Frequently asked questions

What does FAIR stand for?

Does FAIR mean the same as open data?

Why are persistent identifiers important for FAIR data?

Can data be FAIR without being publicly downloadable?

Why a licence is the reusability switch

The two main choices for data

CC0 — the public-domain dedication

CC BY — attribution required

Choosing between them

Data are not software: a critical caveat

A practical checklist

How this connects to contribution and credit

Where shared vocabulary fits

Related reading

You need a lawful basis

Special category data and Article 9

The research provisions

Anonymisation versus pseudonymisation

Accountability and impact assessments

A consistent vocabulary for compliance

What a data availability statement is for

Make the data FAIR first, then describe it

Choosing where to deposit: domain first, generalist as fallback

Domain repositories

Generalist repositories

A note on trust

Writing the statement

Where shared vocabulary fits

Related reading

Why data citation matters

The FORCE11 data citation principles

DataCite and the dataset DOI

Crediting the people: the Data curation role

A practical recipe

Where shared vocabulary fits

Related reading

Why data needed its own infrastructure

Data DOIs and persistent identification

The DataCite metadata schema

DataCite and the PID graph

Supporting FAIR data and reuse

Crediting data work consistently

The core idea: send the code, not the data

DataSHIELD

The Personal Health Train

Federated learning

Data minimisation by design

Working with data that cannot be open

A consistent vocabulary for federated work