Tag: research data repository

  • Research Data Repository: Generalist vs Subject

    Choose a discipline-specific repository whenever one exists for your data type, and fall back to a generalist repository such as Zenodo, Figshare or Dryad only when no subject-specific option is available. A research data repository is a system that assigns persistent identifiers, retains data over the long term, and exposes machine-readable metadata so datasets can be found, cited and reused. The right choice depends on discoverability within your field, what your funder actually mandates, and who is committed to curating the data after the grant ends.

    What is a research data repository?

    A research data repository is a curated system for depositing, preserving and exposing datasets independently of the article they support. Unlike a general-purpose cloud drive, a qualifying repository issues a persistent identifier (typically a DOI), retains fixity and version history, and publishes structured metadata that search engines and indexing services can crawl.

    Two broad categories exist. Generalist repositories — Zenodo, Figshare, Dryad, the Open Science Framework, Harvard Dataverse — accept any discipline and any file type. Discipline-specific repositories — the Protein Data Bank, OpenNeuro, ICPSR, the UK Data Service’s ReShare — are built around domain metadata schemas, controlled vocabularies and, often, expert curators who understand the data.

    Generalist vs discipline-specific: what’s actually different?

    The two repository types are not interchangeable, even though both can technically hold the same file. They differ in who finds the data, how deeply it is described, and how funders treat the deposit for compliance purposes.

    Factor Generalist repository Discipline-specific repository
    Discoverability Indexed broadly; weaker within a subject community High within the field via domain search portals and cross-references
    Metadata depth Generic (title, creator, subject, DOI) Domain-specific schemas (e.g. genomic, crystallographic, survey metadata)
    Curation Largely automated; minimal review Often expert-reviewed before publication
    Funder acceptance Accepted as a fallback by nearly all funders and journals Frequently the stated first preference where one exists
    Typical cost to depositor Free (Zenodo, OSF) or freemium (Figshare) Varies — free (ICPSR, OpenNeuro) to fee-charging (some subject archives)
    Best for Interdisciplinary, mixed-format, or “no domain home” datasets Data types the community already expects to find in one place

    The registries FAIRsharing and re3data.org, both supported by DataCite, list several thousand repositories across disciplines and are the standard starting point for checking whether a subject-specific option exists before defaulting to a generalist platform.

    Does your funder require a specific repository type?

    Funder and journal policy is usually the deciding factor, not personal preference. Most major funders now state an explicit hierarchy: use a recognised discipline repository first, and use a generalist repository — provided it is FAIR-aligned — only where none exists.

    Funder / body Repository requirement
    Horizon Europe Model Grant Agreement Article 17 requires deposit in a research data repository, following the principle “as open as possible, as closed as necessary”
    UKRI Open access policy (in force since 1 April 2022) requires data underpinning a publication to be findable, accessible, interoperable and reusable, with access details stated in a data access statement
    NIH Data Management and Sharing Policy, effective 25 January 2023, requires a data management plan and preference for an established public repository appropriate to the data type
    ICMJE journals Data sharing statement required for clinical trials that began enrolment on or after 1 January 2019

    Where a policy is silent on repository type, DataCite’s Repository Finder tool cross-references FAIRsharing and re3data metadata to surface certified, FAIR-aligned repositories for a given data type — a step that is worth doing before defaulting to whichever repository a colleague used last time.

    Which option wins on long-term curation and sustainability?

    This is the trade-off least discussed in generic repository guidance, and it matters more than discoverability once a dataset is more than a few years old. Discipline-specific repositories often provide deeper curation at deposit time, but many depend on renewable grant funding, which creates a real risk of the archive itself losing support, freezing new deposits, or migrating without notice.

    Generalist repositories carry a different risk profile. Zenodo is operated by CERN with backing from OpenAIRE and the European Commission; Figshare is commercially operated by Digital Science; the Open Science Framework is run by the non-profit Center for Open Science. None of these guarantees permanence, but their institutional backing is typically more diversified than a single-grant-funded domain archive.

    • Ask whether the discipline repository has a named institutional or consortium backer, not just a project grant.
    • Check whether the repository is a CoreTrustSeal-certified trustworthy digital repository — certification signals an audited preservation commitment.
    • If the domain archive’s funding horizon is unclear, consider a dual deposit: primary copy in the discipline repository for discoverability, mirrored DOI in a generalist repository as a preservation backstop.

    How do you actually decide? A five-step framework

    Use this sequence rather than defaulting to whichever repository is fastest to sign up for:

    1. Check the funder mandate first. If your grant agreement or journal’s data sharing policy names a required or preferred repository type, that overrides personal choice.
    2. Search FAIRsharing and re3data for a certified discipline-specific option matching your data type, format and jurisdiction.
    3. Assess curation depth needed. Complex, reusable data (genomic sequences, clinical trial data, crystal structures) benefits from expert domain curation; simple supplementary files often do not need it.
    4. Weigh sustainability. Prefer CoreTrustSeal-certified or institutionally-backed repositories over unaffiliated project archives, especially for data with a multi-decade reuse horizon.
    5. Default to a generalist repository only when no suitable, FAIR-aligned discipline repository exists — and record the choice and rationale in your data management plan.

    Answer-first Q&A

    What is a data repository in research?

    A data repository is a system or service where researchers deposit datasets to obtain a persistent identifier, structured metadata, and long-term hosting. It exists separately from a journal article so that data can be found, cited and reused independently of the publication it supports.

    What is an example of a data repository?

    Zenodo and Figshare are widely used generalist examples; the UK Data Service’s ReShare and the Protein Data Bank are widely used discipline-specific examples. Each assigns a DOI, retains version history, and exposes metadata for discovery by search engines and domain indexes.

    What is a research repository?

    “Research repository” is often used loosely to mean either a data repository (datasets) or an institutional repository (publications, theses). In a data management context, it specifically refers to a certified system for archiving and publishing the datasets underlying research outputs.

    What this means for your data management plan

    A data management plan should name the intended repository before data collection begins, not after submission. Reviewers at UKRI, NIH and Horizon Europe increasingly check whether the named repository matches the funder’s stated hierarchy — generalist repositories named without justification, when a recognised discipline archive exists, are a common cause of DMP revision requests.

    The practical position for most research teams is not “generalist or discipline-specific” as a permanent allegiance, but a per-dataset decision applied consistently: check the mandate, search the registries, weigh curation against sustainability, and document the reasoning. That documented reasoning — more than the repository name itself — is what demonstrates genuine engagement with FAIR data principles to funders, reviewers and future re-users.

  • FAIR Principles Data Maturity: Score Against RDA

    FAIR data maturity is scored by testing each dataset against the 41 indicators of the Research Data Alliance’s FAIR Data Maturity Model, grading Findability, Accessibility, Interoperability and Reusability separately, then weighting results by each indicator’s priority tier — essential, important, or useful. FAIR principles data management moves from an abstract commitment to a measurable score once an institution runs this test consistently across its repository.

    FAIR data is data that meets the Findable, Accessible, Interoperable and Reusable criteria first published by Wilkinson et al. in Scientific Data in 2016 — a paper now cited more than 22,000 times. This guide is a practical scoring walkthrough, not another explainer of what the four letters mean: it shows research offices how to actually audit existing datasets and repositories against the RDA model and turn the result into a remediation plan.

    What is the RDA FAIR Data Maturity Model?

    The RDA FAIR Data Maturity Model is a specification published by a Research Data Alliance working group in 2020 to standardise how organisations test FAIRness. Before it existed, dozens of institutions had built incompatible local checklists, making it impossible to compare a “FAIR score” from one repository against another.

    The model does not ship as software. It is a reference document that defines:

    • 41 indicators — testable statements mapped to the fifteen core GO FAIR sub-principles (F1–F4, A1–A2, I1–I3, R1–R1.3)
    • Three priority tiers — essential, important and useful — so institutions can triage effort rather than treat every indicator as equally urgent
    • Evaluation guidance — worked examples for testing each indicator against real metadata and data objects, rather than self-reported compliance

    Because the indicators trace directly to the GO FAIR principles, a dataset that scores well against the RDA model is, by construction, meeting the same criteria described in the original 2016 Scientific Data paper — just with a repeatable measurement attached.

    How does FAIR maturity scoring actually work?

    Scoring is done indicator by indicator, not principle by principle. Most institutions that implement the RDA model score each of the 41 indicators on a simple 0–4 scale — 0 (not implemented) through 4 (fully implemented) — then multiply by a priority weight before aggregating to a per-dataset and per-repository total.

    FAIR letter Sub-principles tested Typical essential-tier evidence
    Findable F1–F4 Persistent identifier (DOI via DataCite), indexed metadata record, machine-readable catalogue entry
    Accessible A1–A2 Retrieval via an open protocol (HTTPS), metadata that resolves even if the data itself is restricted
    Interoperable I1–I3 Structured, non-proprietary format; controlled vocabularies; qualified links to related records
    Reusable R1–R1.3 Machine-readable licence, documented provenance, alignment with a domain metadata standard

    A dataset that carries a DOI and open licence but lacks controlled vocabulary terms will score high on Findable and Reusable, and low on Interoperable — the point of indicator-level scoring is precisely to surface that kind of uneven profile, which a single pass/fail “is it FAIR?” verdict would hide.

    Manual vs automated assessment: which tool fits?

    Two complementary assessment routes exist. Automated tools are fast but only test what a machine can verify; manual review is slower but catches the indicators that require human judgement, such as whether a licence is genuinely clear or a vocabulary is genuinely domain-appropriate.

    Tool / method Coverage of the 41 indicators Output Best suited to
    F-UJI (FAIRsFAIR project) Machine-testable subset only — roughly 17 metrics derived from the RDA indicators Automated percentage score per FAIR letter, run against a DOI Bulk baseline scans across a whole repository
    FAIR-Aware (DANS) Self-assessment questionnaire, not indicator-scored Qualitative readiness report and recommendations Researchers preparing a dataset before deposit
    Manual RDA specification review All 41 indicators, including human-judgement ones Full indicator-by-indicator score with evidence notes Institutional audits and remediation planning

    A hybrid approach is the most defensible for an institution-wide programme: run an automated scan across every repository record for a fast baseline, then reserve manual review for the essential-tier indicators no tool can verify — licence clarity, provenance completeness and domain-standard alignment.

    A step-by-step scoring walkthrough

    The following sequence turns the RDA model from a reference document into a repeatable institutional process.

    1. Select a representative sample. Pull datasets across disciplines, repository platforms, and funder mandates — a sample skewed toward one department will misstate institutional maturity.
    2. Map each dataset’s DOI or identifier record and run an automated F-UJI scan for the machine-testable indicators before any manual work begins.
    3. Score the remaining essential-tier indicators manually, checking licence text, metadata schema, and vocabulary choice against the evidence guidance in the RDA specification.
    4. Weight and aggregate. Multiply each indicator score by its priority weight, sum within each FAIR letter, then average across the sample to produce a repository-level maturity profile.
    5. Report by weakest letter, not overall average. An institution scoring 3.6/4 on Findable but 1.2/4 on Interoperable needs a vocabulary-adoption project, not a generic “improve FAIR compliance” action item.

    Worked example — three datasets from the same institutional repository, scored on the 0–4 scale before weighting:

    Dataset Findable Accessible Interoperable Reusable
    Clinical trial dataset (restricted access) 4 3 2 3
    Environmental sensor archive 3 4 3 2
    Survey microdata (open) 2 4 1 4

    This profile — strong on Accessible, weak on Interoperable across all three — is a genuinely institution-specific finding a generic FAIR explainer cannot give you; only a scored audit surfaces it, and it points to a single fix (adopting a shared controlled vocabulary at ingest) rather than four separate ones.

    Common questions about FAIR data scoring

    What are FAIR principles for data?

    FAIR principles are four criteria — Findable, Accessible, Interoperable and Reusable — first published in a 2016 Scientific Data paper by Wilkinson et al. They require datasets to carry a persistent identifier, standardised retrieval protocols, shared vocabularies and machine-readable licensing, so both humans and software can locate and reuse research data reliably.

    What are the four pillars of the FAIR data principles?

    The four pillars are Findable (unique persistent identifiers and rich metadata), Accessible (standardised, open retrieval protocols), Interoperable (shared vocabularies and qualified references) and Reusable (clear licensing, provenance and community standards). The RDA FAIR Data Maturity Model breaks these four pillars into 41 individually testable indicators.

    What are the FAIR data principles of UKRI?

    UKRI does not publish a separate FAIR standard. Its research councils, including NERC’s Environmental Data Service, require grant-funded datasets to follow the same GO FAIR-published Findable, Accessible, Interoperable and Reusable principles, citing benefits including increased citation, stronger research integrity, and compliance with data management plan commitments.

    What are the FAIR principles of GDPR?

    FAIR and GDPR address different concerns and are not in conflict. FAIR governs discoverability and reuse of metadata, while GDPR governs lawful processing of personal data. A dataset containing personal information can be fully FAIR — richly described and findable — while access to the underlying records stays restricted under GDPR-compliant authorisation.

    What this means for research data offices

    A scored FAIR audit gives research offices something a qualitative checklist cannot: a repository-level baseline that can be re-measured after each remediation cycle. Institutions preparing data management plan compliance evidence for UKRI, Horizon Europe, or cOAlition S-aligned funders can cite the same indicator scores as their supporting evidence, rather than producing a fresh narrative justification each time.

    Scoring also clarifies where FAIR and openness diverge. Following the “as open as possible, as closed as necessary” principle, a dataset can score highly on all four FAIR letters while remaining access-controlled — the metadata is open and richly described even when the underlying records are not. Institutions handling Indigenous or community-originated data should additionally weigh the CARE Principles — Collective Benefit, Authority to Control, Responsibility and Ethics — published by the Global Indigenous Data Alliance, which govern who controls reuse decisions rather than how discoverable the data is.

    The practical next step after a first scoring pass is not a single “get to 100%” target — no dataset needs every useful-tier indicator satisfied — but a prioritised backlog built from essential-tier gaps, feeding directly into repository ingest workflows and metadata templates so the next deposit scores higher without a second audit.

  • Data Sharing Policy: A Research Office Template

    A data sharing policy is the institution-wide governance document that sets expectations for how researchers plan, deposit, and share research data — distinct from a data sharing agreement, which is the specific legal contract governing one data transfer. Research offices write policies to translate funder FAIR data mandates, such as the NIH’s 2023 Data Management and Sharing Policy, into consistent local practice.

    A data sharing policy is an institutional statement of principle and requirement: it tells every researcher, department, and grant applicant what the organisation expects of them before, during, and after a funded project, regardless of discipline or funder. It is not a substitute for a project-level data management plan (DMP), and it is not the same document as a data sharing agreement — the confusion between the two is the single most common drafting mistake research offices make.

    What is an institutional data sharing policy?

    An institutional data sharing policy is a governance document, usually owned jointly by the research office, library, and IT services, that sets baseline rules for how the organisation’s researchers manage and share the data underlying their published outputs. It applies across all disciplines and funders, rather than to a single grant.

    Published examples illustrate the range: the Office for National Statistics operates a data sharing policy governing record-level personal information, while Cancer Research UK’s data sharing and management policy sets FAIR-aligned requirements as a condition of every grant it awards. Both share a common shape — purpose, scope, principles, requirements, and named responsibilities — even though one governs a funder’s grant conditions and the other governs a public body’s statistical data.

    For a research office, the policy is the document that makes funder requirements operational at institutional scale: instead of each principal investigator interpreting a funder’s data mandate independently, the institution issues one interpretation, one set of approved repositories, and one escalation route for exceptions.

    Why research offices need a data sharing policy now

    Research offices need a written policy because funders increasingly make data sharing a condition of funding, not a recommendation, and institutions without a policy leave researchers to interpret those conditions inconsistently — which creates compliance risk at renewal, audit, and publication stages.

    The mandate landscape has hardened over the past decade:

    • NIH’s 2023 Data Management and Sharing Policy took effect on 25 January 2023 and requires a data management and sharing plan for essentially all NIH-funded research, reviewed alongside the science.
    • UKRI is a signatory to the 2016 Concordat on Open Research Data, which commits funded institutions to making research data openly available with as few restrictions as possible.
    • Horizon Europe’s Model Grant Agreement requires a FAIR-aligned data management plan for participating projects, applying the “as open as possible, as closed as necessary” principle carried over from Horizon 2020.
    • ICMJE’s data sharing statement requirement has applied to clinical trials that began enrolling participants on or after 1 January 2019, requiring a data availability statement as a condition of publication in ICMJE-following journals.

    Each of these mandates is written at the funder level. The institutional policy is what converts them into a single, consistent set of expectations that a research office can actually train staff on and audit against.

    Data sharing policy vs data sharing agreement

    A data sharing policy and a data sharing agreement solve different problems: the policy is a standing, institution-wide statement of expectations, while the agreement is a one-off legal contract governing a specific transfer of specific data between specific parties. Research offices need both, but they are drafted, owned, and reviewed differently.

    Aspect Institutional data sharing policy Data sharing agreement
    Scope All researchers, all funded projects, ongoing One dataset, one recipient, one purpose
    Trigger Institutional governance cycle A specific request or collaboration
    Legal status Internal policy; not itself a contract Binding contract, often referencing UK GDPR
    Typical owner Research office, library, IT, ethics committee Data protection officer, legal counsel
    Reviewed by Institution, periodically Both parties, per transfer

    A well-written policy should explicitly state this distinction and point researchers to the correct process for each: the policy for general expectations and deposit requirements, the agreement (or a data protection impact assessment) for any transfer involving personal, sensitive, or third-party data governed by UK GDPR.

    Template structure: what to include

    A usable institutional data sharing policy needs roughly eight components, moving from purpose through to enforcement, so that researchers and reviewers can find any given requirement in under a minute.

    1. Preamble and purpose — why the institution requires data sharing and its relationship to the FAIR principles, first published in Scientific Data in 2016.
    2. Scope — which staff, students, and data (all disciplines, all funders, or funder-specific) the policy covers.
    3. Definitions — research data, metadata, persistent identifier, data management plan, repository.
    4. Policy statements — the DMP requirement, repository and persistent-identifier expectations, metadata standards, data licensing, and minimum retention period.
    5. Data availability statements — a requirement that publications state how and where the underlying data can be accessed.
    6. Roles and responsibilities — what is expected of researchers, the research office, the library, IT, and departmental leadership.
    7. Exceptions and embargoes — the process for restricting access on ethical, legal, or commercial grounds.
    8. Review and implementation — the cycle on which the policy itself is revisited against evolving funder mandates.
    Section What it should specify
    Data deposit Named or criteria-based approved repositories, with a preference for those issuing DOIs via DataCite
    Persistent identifiers ORCID for researchers; DOIs for datasets
    Contributor recognition Use of Contributor Role Taxonomy (CRediT) statements so data curation and stewardship work is credited
    Retention A specific minimum period (commonly ten years post-publication) rather than an open-ended commitment
    Sensitive data A named route to ethics and data protection review before any exception is granted

    Note that CASRAI originated the CRediT contributor role taxonomy in 2014; the standard is now stewarded by NISO as ANSI/NISO Z39.104-2022, and institutional policies that reference it should cite NISO, not CASRAI, as the current maintaining body.

    Frequently asked questions and next steps

    Is a data sharing agreement legally required?

    A data sharing agreement is not universally mandated by statute in the UK, but it is required in practice whenever personal or confidential data is transferred between organisations under UK GDPR, and it is frequently a condition set by funders or ethics committees. An institutional data sharing policy is separate and is typically a funder or institutional requirement rather than a legal one.

    What is the data sharing law in the UK?

    UK data sharing is governed primarily by the UK GDPR and the Data Protection Act 2018, which set the rules for handling personal data, alongside the common law of confidentiality. Research data policies must operate within this framework whenever datasets contain identifiable or sensitive personal information, in addition to meeting funder FAIR requirements.

    What are the six key data sharing principles?

    Widely cited data sharing principles hold that shared information should be necessary, proportionate, relevant, accurate, timely, and secure. Institutional research data policies should apply the same discipline alongside FAIR — findable, accessible, interoperable, reusable — so that openness and data protection obligations are handled together rather than in conflict.

    Once a first draft exists, research offices should route it through the same stakeholders named in the policy itself — library, IT, ethics, and legal — before it goes to institutional governance for sign-off, and set a firm review date rather than leaving the document to lapse.

    As funders continue tightening data mandates, from NIH’s 2023 policy to Horizon Europe’s FAIR requirements, institutions without a current, clearly scoped policy will increasingly find researchers improvising compliance at the point of grant application — precisely the risk a written data sharing policy is designed to remove. Research offices that keep the policy distinct from the data sharing agreement, and review it on a fixed cycle, are best placed to keep pace with the next round of funder requirements.

  • NIH Genomic Data Sharing Policy vs DMS Policy

    The NIH Genomic Data Sharing (GDS) Policy and the NIH Data Management and Sharing (DMS) Policy are two separate, still-active NIH policies with different effective dates, different scopes and different submission points — the GDS Policy (2015) governs consent and controlled access for large-scale genomic data, while the DMS Policy (2023) governs data management planning for all NIH-funded scientific data. Grantees who assume the 2023 policy absorbed the 2015 one risk missing a distinct compliance step.

    The NIH Genomic Data Sharing Policy is the funder requirement, effective since 25 January 2015 under Notice NOT-OD-14-124, that governs consent-based data use limitations, controlled-access repositories and data release timelines for large-scale human and non-human genomic data generated with NIH support.

    Table of Contents

    What Is the NIH Genomic Data Sharing (GDS) Policy?

    The GDS Policy replaced NIH’s 2007 Genome-Wide Association Studies (GWAS) data-sharing policy and extended its logic to a wider set of genomic technologies. It applies to studies that generate large-scale human or non-human genomic data, including genome-wide association studies, single nucleotide polymorphism (SNP) arrays, whole-genome and whole-exome sequence data, transcriptomic data and epigenomic data produced by array-based or high-throughput sequencing platforms.

    Two features distinguish it from a generic sharing mandate:

    • A two-tiered access model — unrestricted (open) data versus controlled-access data held in a repository such as dbGaP, the NIH database of Genotypes and Phenotypes.
    • A consent-based data use limitation system, under which informed consent documents must state what data types will be shared and whether access will be open or controlled, so that secondary users are legally and ethically bound to the participant’s original consent.

    The National Human Genome Research Institute (NHGRI) implements the policy operationally through Notices NOT-HG-15-038 and NOT-HG-20-011, and designates AnVIL alongside dbGaP as primary repositories for NHGRI-funded genomic data.

    How Does the GDS Policy Differ From the DMS Policy?

    The NIH Data Management and Sharing Policy, effective 25 January 2023 under Notice NOT-OD-21-013, is far broader in scope. It applies to essentially all NIH-funded research producing “scientific data” — any data commonly accepted in the field as sufficient to validate and replicate findings — not only genomic data. It requires a data management and sharing plan with every competing grant application, whereas the GDS Policy’s genomic-specific requirements historically attached at the Just-in-Time stage, after review but before award.

    NIH has since directed that the two policies be harmonised into a single submission: where a project is subject to both, the genomic-specific elements (consent language, data type, repository choice, controlled- versus open-access designation) are folded into one data management and sharing plan rather than filed as two separate documents. The table below sets out where the policies still diverge.

    Feature GDS Policy (2015) DMS Policy (2023)
    Governing notice NOT-OD-14-124 NOT-OD-21-013
    Effective date 25 January 2015 25 January 2023
    Scope Large-scale human and non-human genomic data All NIH-funded scientific data, any type
    Core document Genomic Data Sharing Plan + Institutional Certification Data management and sharing plan
    Consent mechanism Consent-based data use limitations, enforced via dbGaP Data Access Committees General “justifiable limitations” language; no genomic-specific consent tiers
    Typical repository dbGaP, AnVIL (controlled- or open-access) Any NIH-designated or discipline-appropriate research data repository
    Budget provision Not addressed directly Explicitly allows data management and sharing costs in the budget

    Who Must Submit an Institutional Certification?

    An Institutional Certification is a GDS Policy-specific attestation — separate from the data management and sharing plan — that the institution has reviewed the consent language, IRB approval and data use limitations attached to the human genomic data before it is deposited in a controlled-access repository. It is not required by the DMS Policy for non-genomic data.

    Institutions must certify, among other things, that:

    • The data was collected in a manner consistent with 45 CFR 46 (the Common Rule) and applicable state and local laws.
    • Consent forms permit the specific type of data use requested (general research use versus disease-specific use).
    • Identifiers have been removed or the data otherwise meets the applicable de-identification standard.

    Because this certification is a distinct compliance artefact from the data management and sharing plan, research administrators who track only DMS Plan compliance can miss it entirely on genomic awards.

    How Does Controlled Access Work Under the GDS Policy?

    Controlled-access genomic data sits in dbGaP behind a Data Access Committee (DAC) review process. Secondary users submit a data access request describing their intended research use; the DAC checks that use against the consent-based data use limitation recorded for that dataset before granting access. This is materially different from the DMS Policy’s general expectation of “broadest appropriate sharing,” which does not itself impose a use-limitation enforcement layer — that enforcement mechanism is a GDS-specific feature.

    Answer-First Q&A

    Does the 2023 DMS Policy Replace the 2015 GDS Policy?

    No. The DMS Policy did not replace or repeal the GDS Policy; both remain in force. NIH’s own guidance directs grantees generating large-scale genomic data to satisfy GDS-specific requirements — informed consent language, Institutional Certification, controlled-access designation — within the single data management and sharing plan required by the DMS Policy, rather than as an independent document.

    What Counts as “Large-Scale” Genomic Data Under the GDS Policy?

    NIH does not set one fixed threshold; NHGRI and other institutes assess scale case by case, typically referencing genome-wide association studies, whole-genome or whole-exome sequencing, and array-based platforms as presumptively “large-scale.” Investigators with borderline projects should confirm applicability with their institute’s program officer before submission, since NHGRI also encourages voluntary sharing of smaller datasets.

    When Is the Institutional Certification Submitted?

    The Institutional Certification is submitted at the Just-in-Time stage — after peer review, once an application is being considered for funding — not with the initial application. This differs from the data management and sharing plan itself, which NIH requires as part of the competing application under the DMS Policy.

    Which Repository Satisfies the GDS Policy?

    NIH designates dbGaP for controlled-access human genomic data and, for NHGRI-funded work specifically, AnVIL as the primary repository accepting both controlled- and open-access data. Investigators may propose an alternative repository in the data management and sharing plan, subject to institute approval before funding.

    Implications for Research Administrators

    The practical risk is not policy conflict but a compliance gap: an office that maps its DMS Policy checklist to grant application review alone will miss the GDS Policy’s Just-in-Time Institutional Certification and its ongoing dbGaP registration obligations. Research administration offices supporting genomic PIs need two intake questions, not one — does this award generate large-scale genomic data, and if so, has the Institutional Certification been routed separately from the data management and sharing plan.

    As NIH continues to harmonise guidance across institutes, expect more sub-policies — clinical trials data sharing, foreign genomic data transfer rules — to layer onto rather than replace the DMS Policy’s baseline. Treating “DMS compliance” as a single checkbox will increasingly understate what a genomics-heavy award actually requires.