Tag: fair dataset

  • Research Data Repository: Generalist vs Subject

    Choose a discipline-specific repository whenever one exists for your data type, and fall back to a generalist repository such as Zenodo, Figshare or Dryad only when no subject-specific option is available. A research data repository is a system that assigns persistent identifiers, retains data over the long term, and exposes machine-readable metadata so datasets can be found, cited and reused. The right choice depends on discoverability within your field, what your funder actually mandates, and who is committed to curating the data after the grant ends.

    What is a research data repository?

    A research data repository is a curated system for depositing, preserving and exposing datasets independently of the article they support. Unlike a general-purpose cloud drive, a qualifying repository issues a persistent identifier (typically a DOI), retains fixity and version history, and publishes structured metadata that search engines and indexing services can crawl.

    Two broad categories exist. Generalist repositories — Zenodo, Figshare, Dryad, the Open Science Framework, Harvard Dataverse — accept any discipline and any file type. Discipline-specific repositories — the Protein Data Bank, OpenNeuro, ICPSR, the UK Data Service’s ReShare — are built around domain metadata schemas, controlled vocabularies and, often, expert curators who understand the data.

    Generalist vs discipline-specific: what’s actually different?

    The two repository types are not interchangeable, even though both can technically hold the same file. They differ in who finds the data, how deeply it is described, and how funders treat the deposit for compliance purposes.

    Factor Generalist repository Discipline-specific repository
    Discoverability Indexed broadly; weaker within a subject community High within the field via domain search portals and cross-references
    Metadata depth Generic (title, creator, subject, DOI) Domain-specific schemas (e.g. genomic, crystallographic, survey metadata)
    Curation Largely automated; minimal review Often expert-reviewed before publication
    Funder acceptance Accepted as a fallback by nearly all funders and journals Frequently the stated first preference where one exists
    Typical cost to depositor Free (Zenodo, OSF) or freemium (Figshare) Varies — free (ICPSR, OpenNeuro) to fee-charging (some subject archives)
    Best for Interdisciplinary, mixed-format, or “no domain home” datasets Data types the community already expects to find in one place

    The registries FAIRsharing and re3data.org, both supported by DataCite, list several thousand repositories across disciplines and are the standard starting point for checking whether a subject-specific option exists before defaulting to a generalist platform.

    Does your funder require a specific repository type?

    Funder and journal policy is usually the deciding factor, not personal preference. Most major funders now state an explicit hierarchy: use a recognised discipline repository first, and use a generalist repository — provided it is FAIR-aligned — only where none exists.

    Funder / body Repository requirement
    Horizon Europe Model Grant Agreement Article 17 requires deposit in a research data repository, following the principle “as open as possible, as closed as necessary”
    UKRI Open access policy (in force since 1 April 2022) requires data underpinning a publication to be findable, accessible, interoperable and reusable, with access details stated in a data access statement
    NIH Data Management and Sharing Policy, effective 25 January 2023, requires a data management plan and preference for an established public repository appropriate to the data type
    ICMJE journals Data sharing statement required for clinical trials that began enrolment on or after 1 January 2019

    Where a policy is silent on repository type, DataCite’s Repository Finder tool cross-references FAIRsharing and re3data metadata to surface certified, FAIR-aligned repositories for a given data type — a step that is worth doing before defaulting to whichever repository a colleague used last time.

    Which option wins on long-term curation and sustainability?

    This is the trade-off least discussed in generic repository guidance, and it matters more than discoverability once a dataset is more than a few years old. Discipline-specific repositories often provide deeper curation at deposit time, but many depend on renewable grant funding, which creates a real risk of the archive itself losing support, freezing new deposits, or migrating without notice.

    Generalist repositories carry a different risk profile. Zenodo is operated by CERN with backing from OpenAIRE and the European Commission; Figshare is commercially operated by Digital Science; the Open Science Framework is run by the non-profit Center for Open Science. None of these guarantees permanence, but their institutional backing is typically more diversified than a single-grant-funded domain archive.

    • Ask whether the discipline repository has a named institutional or consortium backer, not just a project grant.
    • Check whether the repository is a CoreTrustSeal-certified trustworthy digital repository — certification signals an audited preservation commitment.
    • If the domain archive’s funding horizon is unclear, consider a dual deposit: primary copy in the discipline repository for discoverability, mirrored DOI in a generalist repository as a preservation backstop.

    How do you actually decide? A five-step framework

    Use this sequence rather than defaulting to whichever repository is fastest to sign up for:

    1. Check the funder mandate first. If your grant agreement or journal’s data sharing policy names a required or preferred repository type, that overrides personal choice.
    2. Search FAIRsharing and re3data for a certified discipline-specific option matching your data type, format and jurisdiction.
    3. Assess curation depth needed. Complex, reusable data (genomic sequences, clinical trial data, crystal structures) benefits from expert domain curation; simple supplementary files often do not need it.
    4. Weigh sustainability. Prefer CoreTrustSeal-certified or institutionally-backed repositories over unaffiliated project archives, especially for data with a multi-decade reuse horizon.
    5. Default to a generalist repository only when no suitable, FAIR-aligned discipline repository exists — and record the choice and rationale in your data management plan.

    Answer-first Q&A

    What is a data repository in research?

    A data repository is a system or service where researchers deposit datasets to obtain a persistent identifier, structured metadata, and long-term hosting. It exists separately from a journal article so that data can be found, cited and reused independently of the publication it supports.

    What is an example of a data repository?

    Zenodo and Figshare are widely used generalist examples; the UK Data Service’s ReShare and the Protein Data Bank are widely used discipline-specific examples. Each assigns a DOI, retains version history, and exposes metadata for discovery by search engines and domain indexes.

    What is a research repository?

    “Research repository” is often used loosely to mean either a data repository (datasets) or an institutional repository (publications, theses). In a data management context, it specifically refers to a certified system for archiving and publishing the datasets underlying research outputs.

    What this means for your data management plan

    A data management plan should name the intended repository before data collection begins, not after submission. Reviewers at UKRI, NIH and Horizon Europe increasingly check whether the named repository matches the funder’s stated hierarchy — generalist repositories named without justification, when a recognised discipline archive exists, are a common cause of DMP revision requests.

    The practical position for most research teams is not “generalist or discipline-specific” as a permanent allegiance, but a per-dataset decision applied consistently: check the mandate, search the registries, weigh curation against sustainability, and document the reasoning. That documented reasoning — more than the repository name itself — is what demonstrates genuine engagement with FAIR data principles to funders, reviewers and future re-users.

  • FAIR Dataset Mandates Risk Becoming a Checkbox

    A FAIR dataset is one that meets the Findable, Accessible, Interoperable and Reusable principles published in Scientific Data in 2016 — but a funder mandate requiring deposit and a data management plan does not, on its own, guarantee this. Genuine FAIR compliance demands rich metadata, persistent identifiers and community-standard formats that most minimally compliant deposits skip entirely, because current incentive structures reward the act of depositing, not the work of curating.

    A FAIR dataset is a digital research object — data or its metadata — that satisfies the Findable, Accessible, Interoperable and Reusable principles first formalised by the FORCE11 community and published in Scientific Data in March 2016. The principles were designed to be applied in degrees, not as a pass/fail gate, which is precisely where funder policy and researcher practice have diverged.

    What does a FAIR dataset actually require?

    The FAIR principles set out four categories of requirement, each broken into specific sub-criteria. They are deliberately conceptual rather than prescriptive, which is a strength for cross-disciplinary adoption and a weakness for enforcement.

    • Findable — data and metadata carry a globally unique, persistent identifier and are indexed in a searchable resource.
    • Accessible — retrieval uses a standardised, open protocol, with metadata remaining accessible even when the underlying data cannot be.
    • Interoperable — data and metadata use a shared, formal language and vocabularies that follow FAIR principles themselves.
    • Reusable — data carry a clear licence, detailed provenance, and conform to domain-relevant community standards.

    The Research Data Alliance’s FAIR Data Maturity Model, published in 2020, decomposes these four principles into 41 discrete indicators covering both data and metadata. That granularity matters: a dataset can satisfy some indicators and fail most others while still being described, loosely, as “FAIR.” A funder checking only for repository deposit is verifying perhaps one or two of the 41.

    Why do funder mandates default to minimal compliance?

    Funder FAIR requirements typically operationalise as two things: a submitted data management plan and a deposit in a recognised repository at the end of the project. Neither step audits metadata richness, vocabulary use, or licensing clarity. The result is a policy that is easy to comply with and easy to satisfy without producing a dataset anyone outside the original team could actually reuse.

    Three structural gaps explain why:

    • Resourcing. Science Europe’s funders’ briefing on data management planning recommends that compliant curation cost roughly 5% of total research budget — a figure rarely built into grant awards, leaving curation as unfunded overhead.
    • Recognition. Data curation is not weighted in hiring, promotion or tenure decisions in most institutions, so time spent enriching metadata competes directly with time spent on publications that do count.
    • Standards gaps. Many disciplines still lack the domain-relevant community vocabularies that Interoperability and Reusability depend on, so even willing depositors have nothing FAIR-compliant to conform to.

    Horizon Europe requires that all data produced under the programme be FAIR “by default,” which is the strongest funder-level statement of intent currently in force. Yet the European Commission’s own guidance materials acknowledge that FAIRness is a spectrum, not a binary condition — an admission that sits uneasily alongside a compliance model built around a single deposit checkpoint.

    The maturity gap: from “FAIR start” to genuine reusability

    The European Commission’s Joint Research Centre published FAIR Data Guidelines in 2025 that organise the RDA’s 41 indicators into five progressive maturity levels. The framework is useful precisely because it makes visible how far “minimally compliant” sits from “genuinely reusable.”

    Maturity level What it requires
    FAIR start Published in a catalogue with mandatory metadata; data itself is not structured for machine reuse.
    FAIR play Links added between datasets and related resources, with enriched provenance and cross-referencing.
    FAIR go Data structured to community standards, with defined terminologies (not necessarily machine-readable).
    FAIR share Machine-readable data models (JSON Schema, XML Schema, SHACL) with richly documented provenance.
    FAIRest of them all Machine-readable model endorsed by the domain community; terms exposed via shared FAIR vocabularies.

    Most mandate-driven deposits land at “FAIR start” — indexed, licensed, discoverable, but not structured for genuine machine or cross-team reuse. The JRC guidelines are explicit that not every dataset needs the top tier, but they are equally explicit that FAIRness can degrade over time if metadata and platforms are not actively maintained. A one-off deposit satisfying a funder’s closeout requirement is not maintenance; it is a snapshot.

    Rebuilding incentives for genuine data stewardship

    Treating FAIR as a compliance checkbox is a governance failure, not a researcher failure. Three changes would shift the incentive structure toward genuine stewardship rather than deposit-and-forget behaviour.

    1. Credit the labour. CASRAI originated the CRediT contributor role taxonomy in 2014, and the standard is now stewarded by NISO as ANSI/NISO Z39.104-2022. “Data curation” is one of its fourteen roles, offering institutions an existing, citable mechanism to formally recognise stewardship work in author contribution statements — a mechanism that remains inconsistently applied in promotion and tenure review.
    2. Fund it explicitly. Grant budgets should ring-fence curation costs at the level Science Europe’s own guidance recommends, rather than treating data management plans as an unfunded compliance document.
    3. Audit maturity, not deposit. Funders and institutions should reference maturity models such as the RDA’s 41 indicators or the JRC’s five-level scale in closeout review, rather than accepting repository deposit as sufficient evidence of FAIR compliance.

    FAIR is also not a complete governance answer on its own. The CARE Principles for Indigenous Data Governance, released by the Global Indigenous Data Alliance in 2019, extend the framework to cover collective benefit, authority to control, responsibility and ethics — dimensions that a pure findability-and-format checklist does not touch. Institutions building data policy around FAIR alone are optimising for machine reuse while leaving governance and consent questions unaddressed.

    Frequently asked questions

    What is a FAIR dataset?

    A FAIR dataset satisfies the Findable, Accessible, Interoperable and Reusable principles published in Scientific Data in 2016. It carries a persistent identifier, standardised access, shared vocabularies, and clear licensing and provenance — not merely a repository listing.

    What does FAIR stand for with data?

    FAIR stands for Findable, Accessible, Interoperable and Reusable. The acronym describes a framework for data stewardship, not a certification; the Research Data Alliance breaks it into 41 measurable indicators rather than a single pass condition.

    What does FAIR stand for in data management?

    In data management, FAIR describes the target state a data management plan should work toward: identifiers, rich metadata, open protocols and community-standard formats. It guides curation decisions throughout a project, not just the final deposit.

    Why does FAIR data matter?

    FAIR data matters because it lets both humans and machines discover, verify and reuse research outputs without contacting the original authors. Poorly curated “FAIR” deposits undermine reproducibility and waste the public investment funders intended the mandate to protect.

    Implications and outlook

    Funder FAIR mandates have succeeded in one respect: deposit rates have risen sharply since 2016. They have not, on current evidence, produced datasets that are reliably machine-actionable or cross-team reusable at scale. That gap will not close through stricter wording in policy documents; it requires funders to resource curation at realistic cost, institutions to credit it in career progression via mechanisms such as CRediT’s Data curation role, and disciplines to finish building the community standards that Interoperability depends on. Until those three conditions are met, “FAIR by default” will remain a policy aspiration rather than a description of the average deposited dataset.