CASRAI Dictionary

Tag: FAIR data

Trusted repositories and the EOSC: where research data should live
Open and FAIR data has to live somewhere, and the choice of where is not a clerical detail. A dataset deposited on a personal web page, a lab server, or a service that may not exist in five years is, for the purposes of long-term reuse, lost. The question of where research data should live is the question of trusted repositories, and the European answer to coordinating them is the EOSC. This article maps the landscape, drawing on the data-infrastructure domain.

What makes a repository trustworthy

Not every place that can store a file is fit to be the home of the scholarly record. A trusted digital repository is one assessed against a recognised trust framework, demonstrating that it has the organisational and technical capability to preserve and provide access to data over the long term. Trust here is not a vibe; it is a set of demonstrable properties — a sustainability plan, preservation procedures, persistent identifiers, clear access conditions, and the organisational continuity to outlast any individual project or grant.

The most widely recognised certification of these properties is CoreTrustSeal, a community-governed assessment that a repository meets the core requirements of trustworthy data stewardship. A CoreTrustSeal certification is a concrete signal a funder or researcher can rely on: it means an independent process has checked that the repository can actually do what “long-term preservation” implies. When a funder mandate says data must go to a trusted repository, CoreTrustSeal is the most common way that word is given operational meaning.

The repository taxonomy: generalist and domain

Trusted repositories come in two broad kinds, and choosing well between them is one of the most consequential data-management decisions a researcher makes.
- A generalist repository accepts data from any discipline. Zenodo, Figshare, and Dryad are the familiar examples: they mint a DOI, accept almost any data type, and provide a reliable, citable home when no specialist option exists. They are the right default for the long tail of research data that has no natural disciplinary home.
- A domain repository is discipline-specific, built around the data types, standards, and community of a particular field. GenBank for nucleotide sequence data is the archetype; there are equivalents across crystallography, astronomy, social science, proteomics, and more. A domain repository adds what a generalist cannot: discipline-specific metadata standards, validation, and a community of expert users who will actually find and reuse the data.
The practical rule that funders increasingly articulate is: deposit in the appropriate domain repository where one exists, and fall back to a trusted generalist repository where it does not. A sequence belongs in GenBank, not in a generic store; a one-off dataset with no community home belongs in a generalist repository with a DOI rather than on a server that will be decommissioned.

The EOSC: coordinating the federation

Individual trusted repositories are necessary but not sufficient. A researcher also needs to find the right one, move data and compute between services, and trust that the pieces interoperate. In Europe, the coordinating layer for this is the European Open Science Cloud (EOSC) — a federation of research-data services rather than a single monolithic platform.

The EOSC’s model is federation: an EOSC node is a service provider connected to the federation, and an EOSC service is something offered through its catalogue — a repository, a compute resource, a data-management tool. The aspiration is that a researcher can discover trusted repositories, deposit data, and compose data with compute across institutional and national boundaries, through a coordinated catalogue rather than a patchwork of disconnected services. The EOSC is, in effect, the European attempt to make “where should this data live?” answerable through one front door onto many trustworthy providers. It is not the only such effort — the African Open Science Platform pursues a comparable continental federation — but it is the most developed.

The human layer: stewards and custodians

Infrastructure does not curate itself, and an honest account of where data should live has to name the people. A data steward is the professional responsible for data quality, governance, and ongoing curation — the role that makes the difference between data that is merely deposited and data that is genuinely reusable. A data custodian holds legal or operational responsibility for the data. Around them sit the structured agreements that govern sharing: a data sharing agreement setting the conditions under which data move between parties, an embargo period deferring public access after deposit, and access controls distinguishing open, restricted, and metadata-only data.

A trusted repository with no data steward behind the data is a safe building with empty rooms. Preservation is an organisational commitment carried out by people, not a property that storage acquires on its own.

Why this connects to FAIR and to identifiers

Where data lives is what makes the FAIR principles operational. Findability depends on the repository minting a persistent identifier and exposing good metadata; accessibility depends on stable resolution and clear access conditions; interoperability and reusability depend on the standards a domain repository enforces. A trusted repository is, in practice, the machine that turns the FAIR aspiration into a deposited reality — which is why the choice of repository, and the trust signal of CoreTrustSeal, matters as much as the decision to share at all. The repository is also where the data’s persistent identifier enters the broader graph that links it to the project, the people, and the funding.

Where shared vocabulary fits

The terms in this domain are used loosely in funder mandates and policies — “trusted”, “appropriate”, “long-term” all mean different things to different bodies, and “generalist” versus “domain” is often left implicit. A shared, federated vocabulary that defines these precisely, pointing to CoreTrustSeal for the trust framework and to the EOSC for the federation model, is what lets a data-sharing requirement be stated unambiguously and checked. Supplying that definitional layer is the role the CASRAI dictionary is designed to play.

What to do now

For researchers: deposit in the appropriate domain repository where one exists, otherwise a CoreTrustSeal-certified generalist repository, and never a personal or project server for the long term. For institutions: invest in data stewards, not just storage. For funders and standards work: give “trusted repository” operational meaning through certification and shared vocabulary, and support the federations that make trustworthy services findable.

Related reading
June 13, 2026
Data papers: publishing datasets as citable outputs

Some of the most valuable products of research are datasets: a long-running environmental monitoring series, a carefully curated genomic resource, a survey assembled over years. Such a dataset can underpin dozens of later studies and outlast the project that created it. Yet the people who built it have often struggled to get formal credit, because the traditional unit of academic recognition is the journal article that interprets data, not the data themselves. The data paper exists to close that gap: a peer-reviewed article whose subject is a dataset — describing what it contains, how it was produced and how to reuse it — turning data work into a citable, reviewable output in its own right. This article explains how data papers work and why they matter, drawing on the research outputs domain of the CASRAI Dictionary.

What a data paper is — and is not

A data paper is not a research paper that happens to share its data, and it is not a results paper in disguise. Its purpose is descriptive: to document a dataset thoroughly enough that others can find, understand, trust and reuse it. A typical data paper covers what the data are, how and why they were collected, the methods and instruments used, the structure and format of the data, quality-control and validation procedures, and — crucially — where the data are deposited and under what licence. What a data paper generally does not do is advance a new scientific hypothesis or interpret the data to reach a novel conclusion; the contribution is the well-described, reusable resource itself. This restraint is the point: it lets the value of the data be assessed on its own terms, separately from any particular analysis.

Data journals and where data papers appear

Data papers are published either in dedicated data journals or in conventional journals that accept the format. Two well-established examples illustrate the model. Scientific Data publishes peer-reviewed descriptions of datasets across the sciences, pairing each with structured metadata. Earth System Science Data publishes data papers in the Earth and environmental sciences, with a strong emphasis on data quality and reusability. These venues apply genuine peer review — reviewers assess whether the data are sound, complete, properly documented and genuinely reusable — which is what gives a data paper its credibility. A peer-reviewed data paper is not merely a deposit; it is a vetted statement that the dataset meets a scholarly standard.

The relationship between the paper and the data

A central feature of the data paper model is the separation of the description from the data. The data paper is the human-readable, peer-reviewed article; the dataset itself lives in a repository, where it receives its own persistent identifier — typically a DataCite DOI — and is governed by an explicit licence. The data paper cites the dataset by that identifier, and the dataset record points back to the paper. This means there are two citable objects, linked but distinct: the dataset, which others cite when they reuse the data, and the data paper, which others cite when they draw on its description. Robust dataset citation through DataCite is what allows reuse of the data to be tracked and, over time, credited to the people who produced it. The infrastructure that makes datasets first-class citable objects is part of the wider picture covered in our data infrastructure domain.

Why data papers matter for credit and FAIR data

The deeper reason data papers matter is incentives. For a long time, the rational move for a researcher who built a valuable dataset was to mine it for conventional papers, because that was what counted. The data paper changes the calculus by making the dataset itself a recognised, citable, peer-reviewed output that appears on a CV and accrues citations. That recognition rewards exactly the careful, time-consuming data stewardship that the research system otherwise undervalues. Data papers also advance the FAIR principles — that data should be Findable, Accessible, Interoperable and Reusable — almost by construction: a good data paper makes a dataset findable (through publication and a DOI), documents it for accessibility and interoperability, and exists precisely to enable reuse.

Crediting the people behind the data

Producing a high-quality dataset is collaborative work — collection, curation, validation, documentation — and a data paper is an opportunity to credit it properly rather than burying it in an acknowledgement. The CRediT taxonomy maps naturally onto this work, with the Data curation role recognising the management, annotation and maintenance of the data, alongside Investigation for collection and Methodology for how it was produced. The complete set of roles is described in our overview of the CRediT roles. Applying structured contribution to a data paper ensures that the curator who made the dataset reusable is named for that contribution, not left invisible behind the names of those who later analyse the data.

An output worth treating seriously

Treating datasets as citable, reviewable outputs — with their own identifiers, their own peer review, and their own credit — recognises a simple reality: the data often outlast and out-influence any single paper drawn from them. Data papers give that reality formal standing. The consistent vocabulary that lets a dataset, a data paper and the contributions behind them be described the same way across repositories, journals and institutional systems is maintained in the CASRAI Dictionary, so that the credit a researcher earns for building a resource travels with it wherever it is reused.

June 13, 2026
Documenting datasets for machine-learning research: datasheets, data statements and Croissant

A machine-learning model is, in a profound sense, a product of its training data. Whatever patterns, gaps, imbalances and biases live in that data are absorbed by the model and reproduced in its behaviour. And yet, for much of the field’s recent history, datasets have circulated with remarkably little documentation: a file, perhaps a brief description, and little record of where the data came from, who is represented in it, what it omits, or what it should and should not be used for. The result has been models trained on poorly understood foundations, with predictable consequences for reliability and fairness. A growing movement now treats dataset documentation as a serious, first-class research output in its own right. This article surveys that movement, drawing on the AI and ML research-outputs domain of the CASRAI Dictionary.

Datasheets for Datasets

The most influential proposal, borrowing an idea from electronics, is the datasheet. Just as an electronic component ships with a datasheet describing its characteristics, operating conditions and limitations, Datasheets for Datasets proposes that every dataset be accompanied by a document answering a structured set of questions about it. Those questions span the dataset’s whole life: the motivation for creating it and who funded it; its composition — what the instances are, how many there are, what they represent, and whether sensitive or personal data is involved; the collection process — how the data was gathered and whether consent was obtained; any preprocessing, cleaning or labelling; recommended and discouraged uses; and plans for distribution and maintenance. The aim is to make explicit what would otherwise remain tacit, so that anyone considering using the dataset can understand its provenance and judge its fitness for their purpose — and so that the people who created it must think carefully about these matters while they still can.

Data Statements for NLP

A closely related proposal arose specifically in natural-language processing, where the characteristics of the people who produced the text in a dataset profoundly shape what a model learns. Data Statements for Natural Language Processing ask dataset creators to document the relevant characteristics of their data: who the speakers and annotators are, the language varieties represented, the situations in which the language was produced, and so on. The motivation is squarely about bias and generalisation. A language model trained on text from a narrow demographic will work less well, and sometimes fail or cause harm, for people outside it — and without documentation, that limitation is invisible until it bites. Data statements make the population behind the data explicit, so that the boundaries of a model’s likely competence can be understood rather than discovered the hard way. Both datasheets and data statements share a conviction: documentation is not bureaucratic overhead but a precondition for using data responsibly.

Croissant: machine-readable dataset metadata

Datasheets and data statements are written largely for humans. But for datasets to be discoverable, loadable and interoperable across the many tools of the machine-learning ecosystem, their metadata also needs to be machine-readable. This is the role of Croissant, a metadata format for machine-learning datasets developed through a community effort associated with MLCommons. Croissant provides a standard, structured way to describe a dataset — its resources, structure, fields and semantics — so that tools, frameworks and repositories can understand and work with it consistently, rather than each requiring bespoke handling. By standardising the description, Croissant makes datasets easier to find, load and combine across platforms, and it can carry the kind of responsible-use and provenance information that datasheets capture into a form that systems can act on. It is, in effect, the interoperability layer for dataset documentation.

How this connects to FAIR and persistent identifiers

This work is the machine-learning expression of principles that the wider research-data community has long advocated. The FAIR principles — that data should be Findable, Accessible, Interoperable and Reusable — map directly onto what good dataset documentation achieves: rich, machine-readable metadata (Croissant) makes data findable and interoperable, while thorough human-readable documentation (datasheets, data statements) is what genuine reusability requires, because data cannot be responsibly reused if its provenance and limitations are unknown. Persistent identifiers complete the picture: when a dataset is registered with an identifier through an infrastructure such as DataCite, it becomes citable and trackable, so that it can be referenced precisely in papers, credited to its creators, and connected to the models and results that depend on it. A documented, identified dataset is one that can take its place in the scholarly record as a real output rather than an anonymous file.

Datasets as research outputs deserving credit

The deeper shift here is a change in status. Creating a good dataset — collecting, cleaning, labelling and documenting it carefully — is substantial intellectual labour, and the resulting dataset is a genuine research output that others build upon, often more widely than any single paper. Treating datasets as first-class outputs means documenting them properly, identifying them persistently, and crediting the people who made them. The CRediT taxonomy, whose full set of contribution types is described in our overview of the CRediT roles, captures this work through roles such as Data curation, which recognises the production, annotation and maintenance of data. Recognising dataset creation as creditable contribution is part of the same movement that produced datasheets: an insistence that the data underpinning machine learning, and the people who steward it, be taken seriously.

A consistent vocabulary for dataset documentation

For dataset documentation to be useful across repositories, frameworks and institutions, the elements it contains must mean the same thing everywhere — what a field describes, what a provenance statement records, what an intended-use restriction means. That consistency is what the CASRAI Dictionary works towards: a shared vocabulary so that the metadata describing a dataset is understood identically wherever it travels. Datasheets, data statements and Croissant all rest on the same insight: that a dataset without documentation is a liability, and that documenting it well is not an afterthought but part of doing the research properly.

June 13, 2026
Open science across the research lifecycle: from preregistration to preservation

Open science is often encountered as a set of separate practices: a journal’s open-access policy, a funder’s data-sharing requirement, a colleague’s preregistered study. Treated piecemeal, each can feel like an isolated obligation. But open science is most powerful, and most coherent, when its practices are understood as connected stages in the arc of a single project — when openness runs through the whole research lifecycle rather than appearing only at the end. Seen this way, preregistration, open data, open access and preservation are not unrelated requirements but successive expressions of one principle: that research is more trustworthy, more useful and more cumulative when it is conducted in the open. This article traces openness across the lifecycle through the research lifecycle domain of the CASRAI Dictionary.

A global framework: the UNESCO Recommendation

That open science is a connected whole rather than a collection of separate practices is reflected in the most significant international statement on the subject: the UNESCO Recommendation on Open Science, adopted by member states as a shared global framework. It treats open science not as a single act of sharing but as an integrated set of practices and values — open access to publications, open research data, open-source software, open infrastructures, open engagement with society — underpinned by transparency, equity and inclusion. Its scope is the point: it frames openness as a culture spanning the entire research process, not a box ticked at publication, and provides a common reference for understanding open science as a coherent lifecycle.

The beginning: preregistration

Openness can begin before any data are collected. Preregistration is the practice of specifying a study’s hypotheses, methods and analysis plan in advance, and recording that plan in a way that cannot be quietly changed later. Its purpose is to strengthen the integrity of research by making clear what was planned before the results were known, which guards against practices such as reshaping hypotheses to fit the data or selectively reporting only what worked. A particularly developed form is the registered report, in which a study’s plan is peer-reviewed and accepted in principle before the results exist, so that publication depends on the quality of the question and method rather than on whether the findings turn out to be striking. Preregistration makes the research process transparent from the outset and sets the foundation for everything that follows.

The middle: open and FAIR data

As a project generates data, openness shifts to how that data is managed and shared. The widely adopted FAIR principles hold that data should be Findable, Accessible, Interoperable and Reusable — properties that let data be discovered, understood and built upon by others rather than locked away or lost. Making data FAIR, and as open as is responsible, transforms it from a private by-product of one study into a lasting resource for the community. This stage connects backwards and forwards: data shared openly allows the results derived from it to be checked, and it allows the data itself to feed new research it was never collected for. Openness in the middle of the lifecycle is what gives a project value beyond its own conclusions.

The output: open access

When findings are written up, openness turns to open access — making the resulting publications freely available to read rather than locked behind paywalls. It can be achieved through different routes, including publishing in open-access venues and depositing accepted manuscripts in repositories, but the principle is constant: research that anyone can read can be verified, used and built upon by the widest possible audience. Open access is the most visible face of open science, but within the lifecycle it is one stage among several. A paper that is open but rests on hidden data and an undisclosed plan is less open than it appears; open access is most meaningful when it sits atop preregistration and open data.

The long term: preservation

The lifecycle does not end at publication, because outputs that are open today are worthless tomorrow if they vanish. Digital preservation is the work of ensuring that data, publications, software and other outputs remain accessible, intact and usable over the long term, against the threats of format obsolescence, link rot, storage failure and institutional change. There is little point making research open if it cannot be found or opened a decade later. Trusted repositories, persistent identifiers and active preservation practices are what keep the open record open over time, closing the loop so that the openness built earlier actually endures.

The lifecycle as a connected whole

The deeper point is that these stages reinforce one another. Preregistration makes the eventual open data and open publication more meaningful, because the plan they can be checked against is on record. Open data makes the open publication verifiable. Preservation makes all of it durable. Openness at one stage is weakened when a stage is missing — open access over secret data, or open data with no preservation, each falls short of the whole. This is why open science is best understood as a lifecycle rather than a checklist: its value is cumulative and connected, exactly the vision the UNESCO Recommendation articulates. Our learning resources explore each practice in more depth.

A consistent vocabulary across the lifecycle

For openness to connect across stages and systems, the information describing each stage must mean the same thing everywhere — the status of a preregistration, the access conditions of data, the licence on a publication, the preservation state of an output. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the open-science attributes of a project are understood identically across the systems that record them. And because contribution runs through every stage, the work done at each can be described in the same shared framework — the CRediT taxonomy and its full set of contribution roles. Open science is not a single act but a way of working across the whole life of a project; its power lies in the connection of its parts.

June 12, 2026
Data availability statements: what to write and where to deposit
Most journals now ask for a data availability statement, and most authors now write one. Far fewer write one that does what it is meant to do. The phrase “data are available from the authors on reasonable request” has become the default, yet study after study has found that requests against such statements frequently go unanswered — which means the statement records an intention rather than a reality. This guide covers what to write, where to put the data, and how to make a statement that is true. It builds on the foundations in the data-infrastructure domain and connects to the practices described in the reproducibility domain.

What a data availability statement is for

A data availability statement (sometimes a data accessibility statement) tells a reader where the data underlying a publication can be found, under what conditions, and — where access is restricted — why. Its purpose is to make the evidential basis of the work locatable and, where ethically possible, reusable. It is the public-facing expression of the principle that a published claim should be checkable against the data behind it. A good statement is specific: it names a repository, gives an identifier, and states the access conditions plainly.

Make the data FAIR first, then describe it

The statement is downstream of a deposit decision, so the deposit is where the real work happens. The widely adopted reference point is the FAIR principles — that data should be Findable, Accessible, Interoperable, and Reusable. FAIR is frequently misread as “open”, and the distinction matters: FAIR does not require data to be public. It requires that data be findable (with a persistent identifier and rich metadata), accessible (retrievable by a clear, possibly authenticated, protocol), interoperable (using shared formats and vocabularies), and reusable (with a clear licence and provenance). Sensitive data can be FAIR while remaining access-controlled — the metadata is open and findable even where the data themselves are not.

Practically, making data FAIR before you write the statement means:
- Deposit in a repository that mints a persistent identifier — typically a DataCite DOI — so the data are citable and resolvable independently of the article.
- Describe the data with structured metadata, not just a filename, so they can be found and understood by someone who did not produce them.
- Attach an explicit licence (for example a Creative Commons licence for open data) so reuse conditions are unambiguous.
- Use community formats and vocabularies where they exist, so the data interoperate with other datasets in the field.
Choosing where to deposit: domain first, generalist as fallback

Where to put the data is the decision that most shapes their long-term value. The general rule is to prefer a domain repository where a recognised one exists for your data type, and to use a generalist repository otherwise.

Domain repositories

A domain (or discipline-specific) repository is built around a particular kind of data and enforces the community’s metadata standards — GenBank for nucleotide sequences, the PDB for protein structures, and many others. Depositing here means your data sit alongside comparable datasets, are described to a standard your field already reads, and are discoverable by the people most likely to reuse them. Where your field expects deposit in a specific repository, that expectation is effectively mandatory and should be your first choice.

Generalist repositories

Where no suitable domain repository exists, a generalist repository — Zenodo, Figshare, Dryad and others — accepts data of any type, mints a DOI, and supports structured metadata and licensing. Generalists are the right home for the long tail of data that no specialised archive covers.

A note on trust

Whichever route you take, prefer a trusted digital repository — one assessed against a recognised standard such as CoreTrustSeal — over ad-hoc hosting. A repository’s job is long-term preservation and stable resolution; a personal website or a generic file-sharing link offers neither, and a link that has rotted makes a data availability statement worse than useless. Institutional and supplementary-file hosting can be acceptable, but the persistence commitment is what matters.

Writing the statement

A strong statement names the repository, gives the identifier, and states the conditions. Some patterns:
- Open deposit: “The data supporting this study are openly available in [repository] at [DOI], under a [licence].”
- Controlled access: “The data are available from [repository / controlled-access archive] subject to [conditions, e.g. a data access committee], because they contain [reason, e.g. identifiable personal data]. Metadata are openly available at [DOI].”
- Genuinely no new data: “No new data were generated; the study analysed [named existing datasets] available at [identifiers].”
Avoid the bare “available on request” formulation wherever the data could instead be deposited. Where access genuinely must be restricted — for participant confidentiality, commercial sensitivity, or Indigenous data governance — say so, give the reason, name who controls access, and still publish open metadata so the dataset is findable. An honest restricted-access statement is far stronger than a vague promise of availability.

Where shared vocabulary fits

Terms like “available on request”, “restricted access”, “trusted repository”, and even “FAIR” are used inconsistently across journals and funders, which weakens the policies that depend on them. A shared, federated vocabulary that defines these precisely — pointing back to the FAIR principles and to certification schemes such as CoreTrustSeal — is what lets a statement written for one venue be understood by another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain.

Related reading
June 11, 2026
Sensitive and controlled-access data: FAIR for data that cannot be fully open

The push for open research data has been one of the defining movements in scholarly practice, and rightly so: openly available data is easier to verify, reuse and build upon. But an unqualified call to make all data open runs into an immovable obstacle. A great deal of research data is sensitive — patient records, genetic information, data about vulnerable people, commercially confidential material, data whose release could cause harm — and such data cannot simply be posted on the open web without breaching the law, betraying participants’ trust, or endangering people. The challenge is not to choose between openness and protection but to honour both: to make sensitive data as accessible as it responsibly can be while keeping it as protected as it must be. This article looks at how that balance is struck, drawing on the compliance and regulatory domain of the CASRAI Dictionary.

As open as possible, as closed as necessary

The principle that has come to govern this territory is captured in a single phrase: data should be “as open as possible, as closed as necessary”. The phrase does real work. It establishes openness as the default and the goal — the burden falls on reasons to restrict, not on reasons to share. But it also acknowledges, plainly, that necessity sometimes requires closure, and that protecting people and honouring legal and ethical obligations is not a failure of openness but a condition of doing research responsibly. The aim, then, is not a binary of open versus closed but a spectrum of access arrangements, each calibrated to what a particular dataset requires. Sensitive data does not fall off the map of good data practice; it occupies a different, carefully governed part of it.

FAIR does not mean open

A common misconception is that the FAIR principles — Findable, Accessible, Interoperable, Reusable — are a synonym for “open”. They are not, and the distinction matters most for sensitive data. FAIR is about good stewardship and discoverability, not unconditional availability. Sensitive data can and should be made findable: its existence, described by rich metadata, can be advertised openly even when the data itself is restricted, so that researchers know it exists and could request it. It can be made accessible in the FAIR sense — meaning that the procedure for obtaining access is clearly defined and the conditions are transparent — even when access is granted only to approved requesters under controlled conditions. And it can be made interoperable and reusable through standardised description and clear licensing. The key move is to separate the metadata, which can be fully open, from the data, whose access is controlled. Open metadata over protected data is the architecture that lets sensitive data participate in the FAIR ecosystem without being exposed.

Controlled access and data-access committees

The mechanism that delivers this is controlled access. Rather than downloading the data freely, a researcher applies for it, stating who they are, what they intend to do, and agreeing to conditions on use. The application is assessed — often by a data-access committee, a body charged with deciding whether a proposed use is legitimate, ethical, and consistent with the consent under which the data were collected. Approved access typically comes with safeguards: data-use agreements that bind the recipient, restrictions on re-identification and onward sharing, and increasingly the requirement to analyse the data within a secure environment rather than taking a copy away. These arrangements let valuable data be reused while keeping the people behind it protected and the original consent respected. The committee and the agreement are not bureaucratic obstacles for their own sake; they are the means by which trust is maintained between research and the people whose data make it possible.

Synthetic data as a bridge

One increasingly important technique deserves attention: synthetic data. Synthetic data is artificially generated to resemble a real dataset’s structure and statistical properties without containing any real individual’s information. Because it contains no real records, it can often be shared far more openly than the sensitive data it mirrors. Its value is practical: researchers can develop and test their analysis code against synthetic data, others can understand a dataset’s shape before applying for the real thing, and methods can be demonstrated without exposing anyone. Synthetic data is not a perfect substitute — conclusions must ultimately be drawn from real data, and a poorly generated synthetic set can mislead — but as a bridge between the need to share and the duty to protect, it is a genuinely useful addition to the toolkit.

The role of secure infrastructure

Making controlled access work at scale depends on the infrastructure that supports it: trusted repositories that hold sensitive data securely, secure analysis environments where data can be worked on without being copied out, and the identifier and metadata systems that let restricted data be described openly and cited when used. This is the territory of the data infrastructure domain, and it is what turns the principle of controlled access from an aspiration into a practical reality. Without secure places to hold the data and clear ways to describe it, the careful balance of access and protection cannot be maintained.

A consistent vocabulary for access and protection

For all of this to function across institutions, funders and repositories, the terms involved must mean the same thing everywhere. Access conditions, consent categories, licence terms and protection requirements have to be described consistently, or a dataset marked as controlled-access in one system will be misunderstood in another — with real consequences when the data are sensitive. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the metadata describing how sensitive data may be accessed and reused is understood identically wherever it appears. And because reusing controlled-access data is genuine, recognisable contribution, the work of curating and stewarding it can be described using the same framework as any other — the CRediT taxonomy and its full set of contribution roles. Sensitive data is not a problem to be hidden but a resource to be governed; done well, governance is what lets research honour both openness and the people it serves.

June 11, 2026
DataCite and the data-citation infrastructure

For a long time, the formal scholarly record recognised one kind of output above all others: the journal article, identified by a DOI and citable in a standard way. The datasets, software, samples and other research outputs that often represented the greater investment of effort had no comparable standing. They were hard to cite, hard to find again, and easy to lose track of. DataCite exists to change that. It is the global, not-for-profit registration agency that issues persistent identifiers — data DOIs — and maintains the metadata standard that makes datasets and other non-article outputs first-class, citable, connectable objects. This article explains what DataCite does and why it matters, drawing on the data infrastructure domain of the CASRAI Dictionary.

Why data needed its own infrastructure

Citing a dataset properly is harder than citing a paper, and the difficulty is structural. A dataset may have versions; it lives in a repository rather than a journal; it has creators and contributors whose roles differ from those of authors; and its value is realised through reuse, which is precisely what is hardest to track. Without a persistent identifier and a shared way to describe it, a dataset cannot be cited consistently, cannot be found reliably after the project that made it has ended, and cannot accrue the credit that reuse should generate for its creators. DataCite addresses all of these at once by giving data outputs a resolvable DOI and a structured description, so that a dataset can be referenced as precisely and durably as any article.

Data DOIs and persistent identification

The core service is the assignment of DOIs to research outputs through DataCite’s member repositories and data centres. When a repository deposits a dataset, it registers a DataCite DOI that resolves persistently to the dataset’s landing page, independent of any changes to the repository’s internal structure over time. That persistence is what lets a dataset DOI sit safely in a reference list, a data-availability statement, or another dataset’s record for years. Crucially, DataCite DOIs are not limited to datasets: the same mechanism identifies software, samples, images, models, preprints and a wide range of other outputs, extending durable, citable identity well beyond the traditional article.

The DataCite metadata schema

An identifier is only useful if there is consistent information behind it, and this is where the DataCite Metadata Schema does its work. The schema defines a structured set of properties for describing a research output: its creators, title, publisher and publication year, the resource type, and a rich set of optional fields covering contributors and their roles, dates, related identifiers, funding, rights and descriptions. Two features of the schema are especially powerful. The first is relatedIdentifier, which lets a record express how an output relates to others — this dataset is a version of that one, supplements this article, is derived from that sample, is documented by this data paper. The second is the recording of contributors and their roles, which allows a dataset record to name not just abstract creators but the specific people who curated, collected or maintained the data. Together these turn each record into a node with explicit, machine-readable links to the rest of the research world.

DataCite and the PID graph

Because DataCite records carry related identifiers and references to other persistent identifiers — ORCID for people, ROR for organisations, Crossref DOIs for articles, grant identifiers for funding — they are not isolated entries but part of a connected PID graph. Follow the links and you can move from a dataset to its creators, their institutions, the grant that funded the work, and the article that analysed it. DataCite and Crossref between them register much of the scholarly output graph — broadly, the data and the literature — and their shared use of resolvable identifiers and exchangeable metadata is what lets the whole network be traversed automatically rather than reconstructed by hand. DataCite’s role in this interoperating arrangement is described in our work on DataCite and federation.

Supporting FAIR data and reuse

DataCite is foundational to the FAIR principles — that data should be Findable, Accessible, Interoperable and Reusable. A DataCite DOI and its metadata make a dataset findable through search and resolvable through a stable link; the schema’s structured, standardised fields support interoperability; and the explicit rights and relationship information supports informed reuse. Just as importantly, because datasets registered with DataCite can be cited by their DOIs, their reuse can in principle be tracked, which is the basis for crediting the people who produced them. A dataset that is cited is a dataset whose creators can be recognised — the recognition that careful data stewardship has historically been denied.

Crediting data work consistently

DataCite’s ability to record contributors and their roles connects directly to the recognition of data work. The CRediT taxonomy — whose full set of roles is described in our overview of the CRediT roles — provides a controlled vocabulary for contribution, with the Data curation role recognising the management, annotation and maintenance that make a dataset reusable, alongside Investigation for collection and Methodology for how it was produced. For a contribution recorded in a dataset’s DataCite metadata to be understood the same way in an institutional system or a data paper, the terms must be defined consistently across systems. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the metadata DataCite carries — resource types, contributor roles, relationship types — means the same thing wherever a dataset DOI travels.

June 10, 2026
Finding research data: dataset discovery and data search engines
A vast amount of research data is now deposited in repositories, accompanied by persistent identifiers and described with metadata. That is a real achievement — but it raises a question that is easy to overlook. How does anyone actually find a relevant dataset? A researcher who suspects that the data they need may already exist somewhere faces a genuinely hard search problem: the data is scattered across thousands of repositories worldwide, each with its own catalogue, its own search box and its own conventions. Without good ways to discover data across all of them, a great deal of valuable, well-curated data simply goes unfound and unused — the digital equivalent of a book correctly shelved in a library no one knows the address of. This article looks at the infrastructure of dataset discovery, drawing on the data infrastructure domain of the CASRAI Dictionary.

Findability is the first FAIR principle

It is no accident that the F in FAIR — Findable, Accessible, Interoperable, Reusable — comes first. Findability is logically prior to everything else: data that cannot be found cannot be accessed, cannot be reused, and delivers none of the value its careful curation promised. Findability in the FAIR sense rests on a few concrete foundations: data should be assigned a globally unique and persistent identifier; it should be described with rich metadata; that metadata should explicitly include the identifier; and the metadata should be registered or indexed in a resource that can be searched. The order of the principles is a quiet but important statement of priorities — all the work of making data accessible and reusable is wasted if the first hurdle, being found at all, is never cleared.

Registries of repositories

Discovery operates at more than one level, and the first level is finding the right repository. Before searching for a dataset, a researcher — whether looking for existing data or deciding where to deposit their own — often needs to identify which repository is appropriate for their field and data type. This is the role of re3data, the Registry of Research Data Repositories, a comprehensive directory that catalogues data repositories across all disciplines. It lets users discover repositories by subject, country, data type and the policies they operate, describing each in a structured way. re3data answers the question “where might data like this live, and where should I put mine?” It is discovery one level up — finding the haystacks before searching for the needle — and it is an essential first step that purely dataset-level search tools do not provide.

Dataset-level discovery

The second level is finding individual datasets, and several complementary services address it:
- DataCite Commons. Because DataCite is a principal minter of persistent identifiers for research data, it sits on a large, structured graph of datasets and their connections to people, organisations, funders and related outputs. DataCite Commons exposes that graph for discovery, letting users search across datasets and follow the links between a dataset and its authors, its funding and the works that cite or relate to it.
- Google Dataset Search. A general-purpose search engine specifically for datasets, it works by harvesting structured metadata that data providers publish on their own pages, then making it searchable in one place. It brings dataset discovery into a familiar, web-scale search experience.
- Repository and aggregator catalogues. Individual repositories offer their own search, and aggregators pull metadata from many sources into combined indexes, each widening the net a little further.
Why structured metadata is the engine

What makes web-scale dataset search possible at all is structured, machine-readable metadata, and in particular the schema.org/Dataset vocabulary. schema.org is a shared vocabulary for marking up information on web pages so that machines, not just humans, can understand it, and it includes a specific type for describing datasets — their title, description, creators, licence, distribution and more. When a repository or data provider embeds schema.org/Dataset markup in the page describing a dataset, a search engine crawling the web can recognise that the page describes a dataset and extract its key facts. This is precisely how a service such as Google Dataset Search builds its index: not by being given a private feed from every repository, but by reading the standardised markup that providers publish openly. The lesson is direct and practical — describing data with shared, structured metadata is not bureaucratic box-ticking, it is the literal mechanism by which the data becomes discoverable to the wider world.

Discovery depends on good deposit

All of this throws the responsibility back to the moment of deposit. A dataset is only as findable as its metadata is good. Rich, accurate, standards-based metadata — a clear title, a meaningful description, named creators with identifiers, an explicit licence, appropriate keywords — is what feeds every layer of the discovery system. Skimpy or inconsistent metadata leaves a dataset effectively invisible no matter how valuable its contents. This is why guidance on depositing data places such weight on description, and why the choices a researcher makes at deposit time echo through every subsequent attempt to find their work. Practical guidance on getting this right is part of our wider material on research data fundamentals.

A consistent vocabulary for findable data

For discovery to work across repositories, registries and search engines, the metadata describing datasets must mean the same thing everywhere — a creator, a licence, a resource type or a related-work link has to be interpretable consistently across every system that indexes it. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that dataset metadata is understood identically wherever it is harvested. And because creating and curating a dataset is genuine research contribution, the people behind it can be credited using the same framework as any other — the CRediT taxonomy and its full set of contribution roles, with Data curation recognising the work that makes data findable in the first place. Depositing data is necessary; describing it well, in shared terms, is what makes it discoverable — and discoverability is what lets data fulfil its purpose.
May 31, 2026