Tag: FAIR data

  • Trusted repositories and the EOSC: where research data should live

    Open and FAIR data has to live somewhere, and the choice of where is not a clerical detail. A dataset deposited on a personal web page, a lab server, or a service that may not exist in five years is, for the purposes of long-term reuse, lost. The question of where research data should live is the question of trusted repositories, and the European answer to coordinating them is the EOSC. This article maps the landscape, drawing on the data-infrastructure domain.

    What makes a repository trustworthy

    Not every place that can store a file is fit to be the home of the scholarly record. A trusted digital repository is one assessed against a recognised trust framework, demonstrating that it has the organisational and technical capability to preserve and provide access to data over the long term. Trust here is not a vibe; it is a set of demonstrable properties — a sustainability plan, preservation procedures, persistent identifiers, clear access conditions, and the organisational continuity to outlast any individual project or grant.

    The most widely recognised certification of these properties is CoreTrustSeal, a community-governed assessment that a repository meets the core requirements of trustworthy data stewardship. A CoreTrustSeal certification is a concrete signal a funder or researcher can rely on: it means an independent process has checked that the repository can actually do what “long-term preservation” implies. When a funder mandate says data must go to a trusted repository, CoreTrustSeal is the most common way that word is given operational meaning.

    The repository taxonomy: generalist and domain

    Trusted repositories come in two broad kinds, and choosing well between them is one of the most consequential data-management decisions a researcher makes.

    • A generalist repository accepts data from any discipline. Zenodo, Figshare, and Dryad are the familiar examples: they mint a DOI, accept almost any data type, and provide a reliable, citable home when no specialist option exists. They are the right default for the long tail of research data that has no natural disciplinary home.
    • A domain repository is discipline-specific, built around the data types, standards, and community of a particular field. GenBank for nucleotide sequence data is the archetype; there are equivalents across crystallography, astronomy, social science, proteomics, and more. A domain repository adds what a generalist cannot: discipline-specific metadata standards, validation, and a community of expert users who will actually find and reuse the data.

    The practical rule that funders increasingly articulate is: deposit in the appropriate domain repository where one exists, and fall back to a trusted generalist repository where it does not. A sequence belongs in GenBank, not in a generic store; a one-off dataset with no community home belongs in a generalist repository with a DOI rather than on a server that will be decommissioned.

    The EOSC: coordinating the federation

    Individual trusted repositories are necessary but not sufficient. A researcher also needs to find the right one, move data and compute between services, and trust that the pieces interoperate. In Europe, the coordinating layer for this is the European Open Science Cloud (EOSC) — a federation of research-data services rather than a single monolithic platform.

    The EOSC’s model is federation: an EOSC node is a service provider connected to the federation, and an EOSC service is something offered through its catalogue — a repository, a compute resource, a data-management tool. The aspiration is that a researcher can discover trusted repositories, deposit data, and compose data with compute across institutional and national boundaries, through a coordinated catalogue rather than a patchwork of disconnected services. The EOSC is, in effect, the European attempt to make “where should this data live?” answerable through one front door onto many trustworthy providers. It is not the only such effort — the African Open Science Platform pursues a comparable continental federation — but it is the most developed.

    The human layer: stewards and custodians

    Infrastructure does not curate itself, and an honest account of where data should live has to name the people. A data steward is the professional responsible for data quality, governance, and ongoing curation — the role that makes the difference between data that is merely deposited and data that is genuinely reusable. A data custodian holds legal or operational responsibility for the data. Around them sit the structured agreements that govern sharing: a data sharing agreement setting the conditions under which data move between parties, an embargo period deferring public access after deposit, and access controls distinguishing open, restricted, and metadata-only data.

    A trusted repository with no data steward behind the data is a safe building with empty rooms. Preservation is an organisational commitment carried out by people, not a property that storage acquires on its own.

    Why this connects to FAIR and to identifiers

    Where data lives is what makes the FAIR principles operational. Findability depends on the repository minting a persistent identifier and exposing good metadata; accessibility depends on stable resolution and clear access conditions; interoperability and reusability depend on the standards a domain repository enforces. A trusted repository is, in practice, the machine that turns the FAIR aspiration into a deposited reality — which is why the choice of repository, and the trust signal of CoreTrustSeal, matters as much as the decision to share at all. The repository is also where the data’s persistent identifier enters the broader graph that links it to the project, the people, and the funding.

    Where shared vocabulary fits

    The terms in this domain are used loosely in funder mandates and policies — “trusted”, “appropriate”, “long-term” all mean different things to different bodies, and “generalist” versus “domain” is often left implicit. A shared, federated vocabulary that defines these precisely, pointing to CoreTrustSeal for the trust framework and to the EOSC for the federation model, is what lets a data-sharing requirement be stated unambiguously and checked. Supplying that definitional layer is the role the CASRAI dictionary is designed to play.

    What to do now

    For researchers: deposit in the appropriate domain repository where one exists, otherwise a CoreTrustSeal-certified generalist repository, and never a personal or project server for the long term. For institutions: invest in data stewards, not just storage. For funders and standards work: give “trusted repository” operational meaning through certification and shared vocabulary, and support the federations that make trustworthy services findable.

    Related reading

  • Data papers: publishing datasets as citable outputs

    Some of the most valuable products of research are datasets: a long-running environmental monitoring series, a carefully curated genomic resource, a survey assembled over years. Such a dataset can underpin dozens of later studies and outlast the project that created it. Yet the people who built it have often struggled to get formal credit, because the traditional unit of academic recognition is the journal article that interprets data, not the data themselves. The data paper exists to close that gap: a peer-reviewed article whose subject is a dataset — describing what it contains, how it was produced and how to reuse it — turning data work into a citable, reviewable output in its own right. This article explains how data papers work and why they matter, drawing on the research outputs domain of the CASRAI Dictionary.

    What a data paper is — and is not

    A data paper is not a research paper that happens to share its data, and it is not a results paper in disguise. Its purpose is descriptive: to document a dataset thoroughly enough that others can find, understand, trust and reuse it. A typical data paper covers what the data are, how and why they were collected, the methods and instruments used, the structure and format of the data, quality-control and validation procedures, and — crucially — where the data are deposited and under what licence. What a data paper generally does not do is advance a new scientific hypothesis or interpret the data to reach a novel conclusion; the contribution is the well-described, reusable resource itself. This restraint is the point: it lets the value of the data be assessed on its own terms, separately from any particular analysis.

    Data journals and where data papers appear

    Data papers are published either in dedicated data journals or in conventional journals that accept the format. Two well-established examples illustrate the model. Scientific Data publishes peer-reviewed descriptions of datasets across the sciences, pairing each with structured metadata. Earth System Science Data publishes data papers in the Earth and environmental sciences, with a strong emphasis on data quality and reusability. These venues apply genuine peer review — reviewers assess whether the data are sound, complete, properly documented and genuinely reusable — which is what gives a data paper its credibility. A peer-reviewed data paper is not merely a deposit; it is a vetted statement that the dataset meets a scholarly standard.

    The relationship between the paper and the data

    A central feature of the data paper model is the separation of the description from the data. The data paper is the human-readable, peer-reviewed article; the dataset itself lives in a repository, where it receives its own persistent identifier — typically a DataCite DOI — and is governed by an explicit licence. The data paper cites the dataset by that identifier, and the dataset record points back to the paper. This means there are two citable objects, linked but distinct: the dataset, which others cite when they reuse the data, and the data paper, which others cite when they draw on its description. Robust dataset citation through DataCite is what allows reuse of the data to be tracked and, over time, credited to the people who produced it. The infrastructure that makes datasets first-class citable objects is part of the wider picture covered in our data infrastructure domain.

    Why data papers matter for credit and FAIR data

    The deeper reason data papers matter is incentives. For a long time, the rational move for a researcher who built a valuable dataset was to mine it for conventional papers, because that was what counted. The data paper changes the calculus by making the dataset itself a recognised, citable, peer-reviewed output that appears on a CV and accrues citations. That recognition rewards exactly the careful, time-consuming data stewardship that the research system otherwise undervalues. Data papers also advance the FAIR principles — that data should be Findable, Accessible, Interoperable and Reusable — almost by construction: a good data paper makes a dataset findable (through publication and a DOI), documents it for accessibility and interoperability, and exists precisely to enable reuse.

    Crediting the people behind the data

    Producing a high-quality dataset is collaborative work — collection, curation, validation, documentation — and a data paper is an opportunity to credit it properly rather than burying it in an acknowledgement. The CRediT taxonomy maps naturally onto this work, with the Data curation role recognising the management, annotation and maintenance of the data, alongside Investigation for collection and Methodology for how it was produced. The complete set of roles is described in our overview of the CRediT roles. Applying structured contribution to a data paper ensures that the curator who made the dataset reusable is named for that contribution, not left invisible behind the names of those who later analyse the data.

    An output worth treating seriously

    Treating datasets as citable, reviewable outputs — with their own identifiers, their own peer review, and their own credit — recognises a simple reality: the data often outlast and out-influence any single paper drawn from them. Data papers give that reality formal standing. The consistent vocabulary that lets a dataset, a data paper and the contributions behind them be described the same way across repositories, journals and institutional systems is maintained in the CASRAI Dictionary, so that the credit a researcher earns for building a resource travels with it wherever it is reused.

  • Documenting datasets for machine-learning research: datasheets, data statements and Croissant

    A machine-learning model is, in a profound sense, a product of its training data. Whatever patterns, gaps, imbalances and biases live in that data are absorbed by the model and reproduced in its behaviour. And yet, for much of the field’s recent history, datasets have circulated with remarkably little documentation: a file, perhaps a brief description, and little record of where the data came from, who is represented in it, what it omits, or what it should and should not be used for. The result has been models trained on poorly understood foundations, with predictable consequences for reliability and fairness. A growing movement now treats dataset documentation as a serious, first-class research output in its own right. This article surveys that movement, drawing on the AI and ML research-outputs domain of the CASRAI Dictionary.

    Datasheets for Datasets

    The most influential proposal, borrowing an idea from electronics, is the datasheet. Just as an electronic component ships with a datasheet describing its characteristics, operating conditions and limitations, Datasheets for Datasets proposes that every dataset be accompanied by a document answering a structured set of questions about it. Those questions span the dataset’s whole life: the motivation for creating it and who funded it; its composition — what the instances are, how many there are, what they represent, and whether sensitive or personal data is involved; the collection process — how the data was gathered and whether consent was obtained; any preprocessing, cleaning or labelling; recommended and discouraged uses; and plans for distribution and maintenance. The aim is to make explicit what would otherwise remain tacit, so that anyone considering using the dataset can understand its provenance and judge its fitness for their purpose — and so that the people who created it must think carefully about these matters while they still can.

    Data Statements for NLP

    A closely related proposal arose specifically in natural-language processing, where the characteristics of the people who produced the text in a dataset profoundly shape what a model learns. Data Statements for Natural Language Processing ask dataset creators to document the relevant characteristics of their data: who the speakers and annotators are, the language varieties represented, the situations in which the language was produced, and so on. The motivation is squarely about bias and generalisation. A language model trained on text from a narrow demographic will work less well, and sometimes fail or cause harm, for people outside it — and without documentation, that limitation is invisible until it bites. Data statements make the population behind the data explicit, so that the boundaries of a model’s likely competence can be understood rather than discovered the hard way. Both datasheets and data statements share a conviction: documentation is not bureaucratic overhead but a precondition for using data responsibly.

    Croissant: machine-readable dataset metadata

    Datasheets and data statements are written largely for humans. But for datasets to be discoverable, loadable and interoperable across the many tools of the machine-learning ecosystem, their metadata also needs to be machine-readable. This is the role of Croissant, a metadata format for machine-learning datasets developed through a community effort associated with MLCommons. Croissant provides a standard, structured way to describe a dataset — its resources, structure, fields and semantics — so that tools, frameworks and repositories can understand and work with it consistently, rather than each requiring bespoke handling. By standardising the description, Croissant makes datasets easier to find, load and combine across platforms, and it can carry the kind of responsible-use and provenance information that datasheets capture into a form that systems can act on. It is, in effect, the interoperability layer for dataset documentation.

    How this connects to FAIR and persistent identifiers

    This work is the machine-learning expression of principles that the wider research-data community has long advocated. The FAIR principles — that data should be Findable, Accessible, Interoperable and Reusable — map directly onto what good dataset documentation achieves: rich, machine-readable metadata (Croissant) makes data findable and interoperable, while thorough human-readable documentation (datasheets, data statements) is what genuine reusability requires, because data cannot be responsibly reused if its provenance and limitations are unknown. Persistent identifiers complete the picture: when a dataset is registered with an identifier through an infrastructure such as DataCite, it becomes citable and trackable, so that it can be referenced precisely in papers, credited to its creators, and connected to the models and results that depend on it. A documented, identified dataset is one that can take its place in the scholarly record as a real output rather than an anonymous file.

    Datasets as research outputs deserving credit

    The deeper shift here is a change in status. Creating a good dataset — collecting, cleaning, labelling and documenting it carefully — is substantial intellectual labour, and the resulting dataset is a genuine research output that others build upon, often more widely than any single paper. Treating datasets as first-class outputs means documenting them properly, identifying them persistently, and crediting the people who made them. The CRediT taxonomy, whose full set of contribution types is described in our overview of the CRediT roles, captures this work through roles such as Data curation, which recognises the production, annotation and maintenance of data. Recognising dataset creation as creditable contribution is part of the same movement that produced datasheets: an insistence that the data underpinning machine learning, and the people who steward it, be taken seriously.

    A consistent vocabulary for dataset documentation

    For dataset documentation to be useful across repositories, frameworks and institutions, the elements it contains must mean the same thing everywhere — what a field describes, what a provenance statement records, what an intended-use restriction means. That consistency is what the CASRAI Dictionary works towards: a shared vocabulary so that the metadata describing a dataset is understood identically wherever it travels. Datasheets, data statements and Croissant all rest on the same insight: that a dataset without documentation is a liability, and that documenting it well is not an afterthought but part of doing the research properly.

  • Open science across the research lifecycle: from preregistration to preservation

    Open science is often encountered as a set of separate practices: a journal’s open-access policy, a funder’s data-sharing requirement, a colleague’s preregistered study. Treated piecemeal, each can feel like an isolated obligation. But open science is most powerful, and most coherent, when its practices are understood as connected stages in the arc of a single project — when openness runs through the whole research lifecycle rather than appearing only at the end. Seen this way, preregistration, open data, open access and preservation are not unrelated requirements but successive expressions of one principle: that research is more trustworthy, more useful and more cumulative when it is conducted in the open. This article traces openness across the lifecycle through the research lifecycle domain of the CASRAI Dictionary.

    A global framework: the UNESCO Recommendation

    That open science is a connected whole rather than a collection of separate practices is reflected in the most significant international statement on the subject: the UNESCO Recommendation on Open Science, adopted by member states as a shared global framework. It treats open science not as a single act of sharing but as an integrated set of practices and values — open access to publications, open research data, open-source software, open infrastructures, open engagement with society — underpinned by transparency, equity and inclusion. Its scope is the point: it frames openness as a culture spanning the entire research process, not a box ticked at publication, and provides a common reference for understanding open science as a coherent lifecycle.

    The beginning: preregistration

    Openness can begin before any data are collected. Preregistration is the practice of specifying a study’s hypotheses, methods and analysis plan in advance, and recording that plan in a way that cannot be quietly changed later. Its purpose is to strengthen the integrity of research by making clear what was planned before the results were known, which guards against practices such as reshaping hypotheses to fit the data or selectively reporting only what worked. A particularly developed form is the registered report, in which a study’s plan is peer-reviewed and accepted in principle before the results exist, so that publication depends on the quality of the question and method rather than on whether the findings turn out to be striking. Preregistration makes the research process transparent from the outset and sets the foundation for everything that follows.

    The middle: open and FAIR data

    As a project generates data, openness shifts to how that data is managed and shared. The widely adopted FAIR principles hold that data should be Findable, Accessible, Interoperable and Reusable — properties that let data be discovered, understood and built upon by others rather than locked away or lost. Making data FAIR, and as open as is responsible, transforms it from a private by-product of one study into a lasting resource for the community. This stage connects backwards and forwards: data shared openly allows the results derived from it to be checked, and it allows the data itself to feed new research it was never collected for. Openness in the middle of the lifecycle is what gives a project value beyond its own conclusions.

    The output: open access

    When findings are written up, openness turns to open access — making the resulting publications freely available to read rather than locked behind paywalls. It can be achieved through different routes, including publishing in open-access venues and depositing accepted manuscripts in repositories, but the principle is constant: research that anyone can read can be verified, used and built upon by the widest possible audience. Open access is the most visible face of open science, but within the lifecycle it is one stage among several. A paper that is open but rests on hidden data and an undisclosed plan is less open than it appears; open access is most meaningful when it sits atop preregistration and open data.

    The long term: preservation

    The lifecycle does not end at publication, because outputs that are open today are worthless tomorrow if they vanish. Digital preservation is the work of ensuring that data, publications, software and other outputs remain accessible, intact and usable over the long term, against the threats of format obsolescence, link rot, storage failure and institutional change. There is little point making research open if it cannot be found or opened a decade later. Trusted repositories, persistent identifiers and active preservation practices are what keep the open record open over time, closing the loop so that the openness built earlier actually endures.

    The lifecycle as a connected whole

    The deeper point is that these stages reinforce one another. Preregistration makes the eventual open data and open publication more meaningful, because the plan they can be checked against is on record. Open data makes the open publication verifiable. Preservation makes all of it durable. Openness at one stage is weakened when a stage is missing — open access over secret data, or open data with no preservation, each falls short of the whole. This is why open science is best understood as a lifecycle rather than a checklist: its value is cumulative and connected, exactly the vision the UNESCO Recommendation articulates. Our learning resources explore each practice in more depth.

    A consistent vocabulary across the lifecycle

    For openness to connect across stages and systems, the information describing each stage must mean the same thing everywhere — the status of a preregistration, the access conditions of data, the licence on a publication, the preservation state of an output. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the open-science attributes of a project are understood identically across the systems that record them. And because contribution runs through every stage, the work done at each can be described in the same shared framework — the CRediT taxonomy and its full set of contribution roles. Open science is not a single act but a way of working across the whole life of a project; its power lies in the connection of its parts.

  • Data availability statements: what to write and where to deposit

    Most journals now ask for a data availability statement, and most authors now write one. Far fewer write one that does what it is meant to do. The phrase “data are available from the authors on reasonable request” has become the default, yet study after study has found that requests against such statements frequently go unanswered — which means the statement records an intention rather than a reality. This guide covers what to write, where to put the data, and how to make a statement that is true. It builds on the foundations in the data-infrastructure domain and connects to the practices described in the reproducibility domain.

    What a data availability statement is for

    A data availability statement (sometimes a data accessibility statement) tells a reader where the data underlying a publication can be found, under what conditions, and — where access is restricted — why. Its purpose is to make the evidential basis of the work locatable and, where ethically possible, reusable. It is the public-facing expression of the principle that a published claim should be checkable against the data behind it. A good statement is specific: it names a repository, gives an identifier, and states the access conditions plainly.

    Make the data FAIR first, then describe it

    The statement is downstream of a deposit decision, so the deposit is where the real work happens. The widely adopted reference point is the FAIR principles — that data should be Findable, Accessible, Interoperable, and Reusable. FAIR is frequently misread as “open”, and the distinction matters: FAIR does not require data to be public. It requires that data be findable (with a persistent identifier and rich metadata), accessible (retrievable by a clear, possibly authenticated, protocol), interoperable (using shared formats and vocabularies), and reusable (with a clear licence and provenance). Sensitive data can be FAIR while remaining access-controlled — the metadata is open and findable even where the data themselves are not.

    Practically, making data FAIR before you write the statement means:

    • Deposit in a repository that mints a persistent identifier — typically a DataCite DOI — so the data are citable and resolvable independently of the article.
    • Describe the data with structured metadata, not just a filename, so they can be found and understood by someone who did not produce them.
    • Attach an explicit licence (for example a Creative Commons licence for open data) so reuse conditions are unambiguous.
    • Use community formats and vocabularies where they exist, so the data interoperate with other datasets in the field.

    Choosing where to deposit: domain first, generalist as fallback

    Where to put the data is the decision that most shapes their long-term value. The general rule is to prefer a domain repository where a recognised one exists for your data type, and to use a generalist repository otherwise.

    Domain repositories

    A domain (or discipline-specific) repository is built around a particular kind of data and enforces the community’s metadata standards — GenBank for nucleotide sequences, the PDB for protein structures, and many others. Depositing here means your data sit alongside comparable datasets, are described to a standard your field already reads, and are discoverable by the people most likely to reuse them. Where your field expects deposit in a specific repository, that expectation is effectively mandatory and should be your first choice.

    Generalist repositories

    Where no suitable domain repository exists, a generalist repository — Zenodo, Figshare, Dryad and others — accepts data of any type, mints a DOI, and supports structured metadata and licensing. Generalists are the right home for the long tail of data that no specialised archive covers.

    A note on trust

    Whichever route you take, prefer a trusted digital repository — one assessed against a recognised standard such as CoreTrustSeal — over ad-hoc hosting. A repository’s job is long-term preservation and stable resolution; a personal website or a generic file-sharing link offers neither, and a link that has rotted makes a data availability statement worse than useless. Institutional and supplementary-file hosting can be acceptable, but the persistence commitment is what matters.

    Writing the statement

    A strong statement names the repository, gives the identifier, and states the conditions. Some patterns:

    • Open deposit: “The data supporting this study are openly available in [repository] at [DOI], under a [licence].”
    • Controlled access: “The data are available from [repository / controlled-access archive] subject to [conditions, e.g. a data access committee], because they contain [reason, e.g. identifiable personal data]. Metadata are openly available at [DOI].”
    • Genuinely no new data: “No new data were generated; the study analysed [named existing datasets] available at [identifiers].”

    Avoid the bare “available on request” formulation wherever the data could instead be deposited. Where access genuinely must be restricted — for participant confidentiality, commercial sensitivity, or Indigenous data governance — say so, give the reason, name who controls access, and still publish open metadata so the dataset is findable. An honest restricted-access statement is far stronger than a vague promise of availability.

    Where shared vocabulary fits

    Terms like “available on request”, “restricted access”, “trusted repository”, and even “FAIR” are used inconsistently across journals and funders, which weakens the policies that depend on them. A shared, federated vocabulary that defines these precisely — pointing back to the FAIR principles and to certification schemes such as CoreTrustSeal — is what lets a statement written for one venue be understood by another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain.

    Related reading

  • Sensitive and controlled-access data: FAIR for data that cannot be fully open

    The push for open research data has been one of the defining movements in scholarly practice, and rightly so: openly available data is easier to verify, reuse and build upon. But an unqualified call to make all data open runs into an immovable obstacle. A great deal of research data is sensitive — patient records, genetic information, data about vulnerable people, commercially confidential material, data whose release could cause harm — and such data cannot simply be posted on the open web without breaching the law, betraying participants’ trust, or endangering people. The challenge is not to choose between openness and protection but to honour both: to make sensitive data as accessible as it responsibly can be while keeping it as protected as it must be. This article looks at how that balance is struck, drawing on the compliance and regulatory domain of the CASRAI Dictionary.

    As open as possible, as closed as necessary

    The principle that has come to govern this territory is captured in a single phrase: data should be “as open as possible, as closed as necessary”. The phrase does real work. It establishes openness as the default and the goal — the burden falls on reasons to restrict, not on reasons to share. But it also acknowledges, plainly, that necessity sometimes requires closure, and that protecting people and honouring legal and ethical obligations is not a failure of openness but a condition of doing research responsibly. The aim, then, is not a binary of open versus closed but a spectrum of access arrangements, each calibrated to what a particular dataset requires. Sensitive data does not fall off the map of good data practice; it occupies a different, carefully governed part of it.

    FAIR does not mean open

    A common misconception is that the FAIR principles — Findable, Accessible, Interoperable, Reusable — are a synonym for “open”. They are not, and the distinction matters most for sensitive data. FAIR is about good stewardship and discoverability, not unconditional availability. Sensitive data can and should be made findable: its existence, described by rich metadata, can be advertised openly even when the data itself is restricted, so that researchers know it exists and could request it. It can be made accessible in the FAIR sense — meaning that the procedure for obtaining access is clearly defined and the conditions are transparent — even when access is granted only to approved requesters under controlled conditions. And it can be made interoperable and reusable through standardised description and clear licensing. The key move is to separate the metadata, which can be fully open, from the data, whose access is controlled. Open metadata over protected data is the architecture that lets sensitive data participate in the FAIR ecosystem without being exposed.

    Controlled access and data-access committees

    The mechanism that delivers this is controlled access. Rather than downloading the data freely, a researcher applies for it, stating who they are, what they intend to do, and agreeing to conditions on use. The application is assessed — often by a data-access committee, a body charged with deciding whether a proposed use is legitimate, ethical, and consistent with the consent under which the data were collected. Approved access typically comes with safeguards: data-use agreements that bind the recipient, restrictions on re-identification and onward sharing, and increasingly the requirement to analyse the data within a secure environment rather than taking a copy away. These arrangements let valuable data be reused while keeping the people behind it protected and the original consent respected. The committee and the agreement are not bureaucratic obstacles for their own sake; they are the means by which trust is maintained between research and the people whose data make it possible.

    Synthetic data as a bridge

    One increasingly important technique deserves attention: synthetic data. Synthetic data is artificially generated to resemble a real dataset’s structure and statistical properties without containing any real individual’s information. Because it contains no real records, it can often be shared far more openly than the sensitive data it mirrors. Its value is practical: researchers can develop and test their analysis code against synthetic data, others can understand a dataset’s shape before applying for the real thing, and methods can be demonstrated without exposing anyone. Synthetic data is not a perfect substitute — conclusions must ultimately be drawn from real data, and a poorly generated synthetic set can mislead — but as a bridge between the need to share and the duty to protect, it is a genuinely useful addition to the toolkit.

    The role of secure infrastructure

    Making controlled access work at scale depends on the infrastructure that supports it: trusted repositories that hold sensitive data securely, secure analysis environments where data can be worked on without being copied out, and the identifier and metadata systems that let restricted data be described openly and cited when used. This is the territory of the data infrastructure domain, and it is what turns the principle of controlled access from an aspiration into a practical reality. Without secure places to hold the data and clear ways to describe it, the careful balance of access and protection cannot be maintained.

    A consistent vocabulary for access and protection

    For all of this to function across institutions, funders and repositories, the terms involved must mean the same thing everywhere. Access conditions, consent categories, licence terms and protection requirements have to be described consistently, or a dataset marked as controlled-access in one system will be misunderstood in another — with real consequences when the data are sensitive. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that the metadata describing how sensitive data may be accessed and reused is understood identically wherever it appears. And because reusing controlled-access data is genuine, recognisable contribution, the work of curating and stewarding it can be described using the same framework as any other — the CRediT taxonomy and its full set of contribution roles. Sensitive data is not a problem to be hidden but a resource to be governed; done well, governance is what lets research honour both openness and the people it serves.