Tag: trusted digital repository

  • Trusted repositories and the EOSC: where research data should live

    Open and FAIR data has to live somewhere, and the choice of where is not a clerical detail. A dataset deposited on a personal web page, a lab server, or a service that may not exist in five years is, for the purposes of long-term reuse, lost. The question of where research data should live is the question of trusted repositories, and the European answer to coordinating them is the EOSC. This article maps the landscape, drawing on the data-infrastructure domain.

    What makes a repository trustworthy

    Not every place that can store a file is fit to be the home of the scholarly record. A trusted digital repository is one assessed against a recognised trust framework, demonstrating that it has the organisational and technical capability to preserve and provide access to data over the long term. Trust here is not a vibe; it is a set of demonstrable properties — a sustainability plan, preservation procedures, persistent identifiers, clear access conditions, and the organisational continuity to outlast any individual project or grant.

    The most widely recognised certification of these properties is CoreTrustSeal, a community-governed assessment that a repository meets the core requirements of trustworthy data stewardship. A CoreTrustSeal certification is a concrete signal a funder or researcher can rely on: it means an independent process has checked that the repository can actually do what “long-term preservation” implies. When a funder mandate says data must go to a trusted repository, CoreTrustSeal is the most common way that word is given operational meaning.

    The repository taxonomy: generalist and domain

    Trusted repositories come in two broad kinds, and choosing well between them is one of the most consequential data-management decisions a researcher makes.

    • A generalist repository accepts data from any discipline. Zenodo, Figshare, and Dryad are the familiar examples: they mint a DOI, accept almost any data type, and provide a reliable, citable home when no specialist option exists. They are the right default for the long tail of research data that has no natural disciplinary home.
    • A domain repository is discipline-specific, built around the data types, standards, and community of a particular field. GenBank for nucleotide sequence data is the archetype; there are equivalents across crystallography, astronomy, social science, proteomics, and more. A domain repository adds what a generalist cannot: discipline-specific metadata standards, validation, and a community of expert users who will actually find and reuse the data.

    The practical rule that funders increasingly articulate is: deposit in the appropriate domain repository where one exists, and fall back to a trusted generalist repository where it does not. A sequence belongs in GenBank, not in a generic store; a one-off dataset with no community home belongs in a generalist repository with a DOI rather than on a server that will be decommissioned.

    The EOSC: coordinating the federation

    Individual trusted repositories are necessary but not sufficient. A researcher also needs to find the right one, move data and compute between services, and trust that the pieces interoperate. In Europe, the coordinating layer for this is the European Open Science Cloud (EOSC) — a federation of research-data services rather than a single monolithic platform.

    The EOSC’s model is federation: an EOSC node is a service provider connected to the federation, and an EOSC service is something offered through its catalogue — a repository, a compute resource, a data-management tool. The aspiration is that a researcher can discover trusted repositories, deposit data, and compose data with compute across institutional and national boundaries, through a coordinated catalogue rather than a patchwork of disconnected services. The EOSC is, in effect, the European attempt to make “where should this data live?” answerable through one front door onto many trustworthy providers. It is not the only such effort — the African Open Science Platform pursues a comparable continental federation — but it is the most developed.

    The human layer: stewards and custodians

    Infrastructure does not curate itself, and an honest account of where data should live has to name the people. A data steward is the professional responsible for data quality, governance, and ongoing curation — the role that makes the difference between data that is merely deposited and data that is genuinely reusable. A data custodian holds legal or operational responsibility for the data. Around them sit the structured agreements that govern sharing: a data sharing agreement setting the conditions under which data move between parties, an embargo period deferring public access after deposit, and access controls distinguishing open, restricted, and metadata-only data.

    A trusted repository with no data steward behind the data is a safe building with empty rooms. Preservation is an organisational commitment carried out by people, not a property that storage acquires on its own.

    Why this connects to FAIR and to identifiers

    Where data lives is what makes the FAIR principles operational. Findability depends on the repository minting a persistent identifier and exposing good metadata; accessibility depends on stable resolution and clear access conditions; interoperability and reusability depend on the standards a domain repository enforces. A trusted repository is, in practice, the machine that turns the FAIR aspiration into a deposited reality — which is why the choice of repository, and the trust signal of CoreTrustSeal, matters as much as the decision to share at all. The repository is also where the data’s persistent identifier enters the broader graph that links it to the project, the people, and the funding.

    Where shared vocabulary fits

    The terms in this domain are used loosely in funder mandates and policies — “trusted”, “appropriate”, “long-term” all mean different things to different bodies, and “generalist” versus “domain” is often left implicit. A shared, federated vocabulary that defines these precisely, pointing to CoreTrustSeal for the trust framework and to the EOSC for the federation model, is what lets a data-sharing requirement be stated unambiguously and checked. Supplying that definitional layer is the role the CASRAI dictionary is designed to play.

    What to do now

    For researchers: deposit in the appropriate domain repository where one exists, otherwise a CoreTrustSeal-certified generalist repository, and never a personal or project server for the long term. For institutions: invest in data stewards, not just storage. For funders and standards work: give “trusted repository” operational meaning through certification and shared vocabulary, and support the federations that make trustworthy services findable.

    Related reading

  • Data availability statements: what to write and where to deposit

    Most journals now ask for a data availability statement, and most authors now write one. Far fewer write one that does what it is meant to do. The phrase “data are available from the authors on reasonable request” has become the default, yet study after study has found that requests against such statements frequently go unanswered — which means the statement records an intention rather than a reality. This guide covers what to write, where to put the data, and how to make a statement that is true. It builds on the foundations in the data-infrastructure domain and connects to the practices described in the reproducibility domain.

    What a data availability statement is for

    A data availability statement (sometimes a data accessibility statement) tells a reader where the data underlying a publication can be found, under what conditions, and — where access is restricted — why. Its purpose is to make the evidential basis of the work locatable and, where ethically possible, reusable. It is the public-facing expression of the principle that a published claim should be checkable against the data behind it. A good statement is specific: it names a repository, gives an identifier, and states the access conditions plainly.

    Make the data FAIR first, then describe it

    The statement is downstream of a deposit decision, so the deposit is where the real work happens. The widely adopted reference point is the FAIR principles — that data should be Findable, Accessible, Interoperable, and Reusable. FAIR is frequently misread as “open”, and the distinction matters: FAIR does not require data to be public. It requires that data be findable (with a persistent identifier and rich metadata), accessible (retrievable by a clear, possibly authenticated, protocol), interoperable (using shared formats and vocabularies), and reusable (with a clear licence and provenance). Sensitive data can be FAIR while remaining access-controlled — the metadata is open and findable even where the data themselves are not.

    Practically, making data FAIR before you write the statement means:

    • Deposit in a repository that mints a persistent identifier — typically a DataCite DOI — so the data are citable and resolvable independently of the article.
    • Describe the data with structured metadata, not just a filename, so they can be found and understood by someone who did not produce them.
    • Attach an explicit licence (for example a Creative Commons licence for open data) so reuse conditions are unambiguous.
    • Use community formats and vocabularies where they exist, so the data interoperate with other datasets in the field.

    Choosing where to deposit: domain first, generalist as fallback

    Where to put the data is the decision that most shapes their long-term value. The general rule is to prefer a domain repository where a recognised one exists for your data type, and to use a generalist repository otherwise.

    Domain repositories

    A domain (or discipline-specific) repository is built around a particular kind of data and enforces the community’s metadata standards — GenBank for nucleotide sequences, the PDB for protein structures, and many others. Depositing here means your data sit alongside comparable datasets, are described to a standard your field already reads, and are discoverable by the people most likely to reuse them. Where your field expects deposit in a specific repository, that expectation is effectively mandatory and should be your first choice.

    Generalist repositories

    Where no suitable domain repository exists, a generalist repository — Zenodo, Figshare, Dryad and others — accepts data of any type, mints a DOI, and supports structured metadata and licensing. Generalists are the right home for the long tail of data that no specialised archive covers.

    A note on trust

    Whichever route you take, prefer a trusted digital repository — one assessed against a recognised standard such as CoreTrustSeal — over ad-hoc hosting. A repository’s job is long-term preservation and stable resolution; a personal website or a generic file-sharing link offers neither, and a link that has rotted makes a data availability statement worse than useless. Institutional and supplementary-file hosting can be acceptable, but the persistence commitment is what matters.

    Writing the statement

    A strong statement names the repository, gives the identifier, and states the conditions. Some patterns:

    • Open deposit: “The data supporting this study are openly available in [repository] at [DOI], under a [licence].”
    • Controlled access: “The data are available from [repository / controlled-access archive] subject to [conditions, e.g. a data access committee], because they contain [reason, e.g. identifiable personal data]. Metadata are openly available at [DOI].”
    • Genuinely no new data: “No new data were generated; the study analysed [named existing datasets] available at [identifiers].”

    Avoid the bare “available on request” formulation wherever the data could instead be deposited. Where access genuinely must be restricted — for participant confidentiality, commercial sensitivity, or Indigenous data governance — say so, give the reason, name who controls access, and still publish open metadata so the dataset is findable. An honest restricted-access statement is far stronger than a vague promise of availability.

    Where shared vocabulary fits

    Terms like “available on request”, “restricted access”, “trusted repository”, and even “FAIR” are used inconsistently across journals and funders, which weakens the policies that depend on them. A shared, federated vocabulary that defines these precisely — pointing back to the FAIR principles and to certification schemes such as CoreTrustSeal — is what lets a statement written for one venue be understood by another. Supplying that definitional layer is the role the CASRAI dictionary is designed to play; the relevant terms sit in the data-infrastructure domain.

    Related reading