Tag: FAIR Data Principles

  • F-UJI FAIR Evaluator: What It Actually Scores

    The F-UJI FAIR evaluator is an automated web service that checks whether a dataset’s metadata — not its actual data quality — satisfies a fixed set of machine-readable tests built from the FAIRsFAIR Data Object Assessment Metrics. A high F-UJI percentage means a dataset’s landing page, identifiers and schema exposed enough structured signals for a script to find and parse; it does not certify that a human researcher can actually understand, trust or reuse the data inside.

    F-UJI is one of several tools now used to operationalise the FAIR Data Principles (Findable, Accessible, Interoperable, Reusable), alongside FAIRshake, FAIR-Checker, FAIR Aware and the FAIR Data Point specification promoted by the GO FAIR Initiative. This article explains what each type of tool actually scores, where automated scoring diverges from manual FAIR maturity review, and why institutions and research data repositories should treat a high machine score as a floor, not a finish line.

    What is the F-UJI FAIR evaluator?

    F-UJI (FAIRsFAIR Research Data Object Assessment Service) is a web service and REST API that assesses a research data object against 16 core FAIR metrics. A user submits a persistent identifier — typically a DOI — and F-UJI queries external infrastructure including the DataCite API, re3data, schema.org JSON-LD embedded on the landing page, and DCAT or Dublin Core fields to determine whether each metric passes.

    The metrics were developed under the EU Horizon 2020 FAIRsFAIR project (2019–2022) and are now maintained and versioned by its successor, the FAIR-IMPACT project, with the metric set published as a citable release (DOI 10.5281/zenodo.15045911). F-UJI’s source code is maintained on GitHub by the PANGAEA data publisher, and the tool is offered as a free public assessment service and API.

    How F-UJI’s automated scoring actually works

    F-UJI does not read the dataset’s content. It inspects the metadata surrounding the dataset — the landing page markup, the identifier’s resolution behaviour, declared licences, and machine-readable provenance fields — and scores each of the 16 metrics as pass, partial or fail. The overall percentage is a weighted sum across the Findable, Accessible, Interoperable and Reusable metric groups.

    • Findable metrics check for a persistent identifier, whether the metadata is indexable by search engines, and whether the identifier resolves to rich metadata.
    • Accessible metrics check that metadata remains retrievable even if the data itself becomes unavailable, and that access protocols are standard.
    • Interoperable metrics check for structured vocabularies declared in a JSON-LD @context (schema.org, DCAT, PROV-O) and for qualified references to related resources.
    • Reusable metrics check for a machine-readable licence, provenance statements, and a community-recognised file format for the data’s actual distribution.

    A documented example from the FAIR Data Innovations Hub illustrates how mechanical this scoring is in practice: a dataset scored 67% on its first F-UJI run, with the Findable, Interoperable and Reusable metrics flagged for missing JSON-LD context, missing PROV-O provenance fields and an undeclared distribution format. After the maintainers added a single enriched schema.org/PROV-O JSON-LD block to the landing page — without changing the underlying data at all — the same dataset scored 100% on re-assessment. The data did not become more reusable in that interval; its metadata simply became more machine-legible.

    F-UJI vs FAIRshake vs manual maturity frameworks

    F-UJI is not the only FAIR assessment approach in circulation, and the three main categories differ in what they actually test and who defines “FAIR” for the purpose of the test.

    Dimension F-UJI FAIRshake Manual maturity review
    Method Fully automated, no human input Hybrid — automated tests plus human-scored rubrics Fully manual, questionnaire/checklist-based
    Basis of criteria Fixed FAIRsFAIR/FAIR-IMPACT metric set Community-defined rubrics per research domain Institution- or project-specific checklist
    Input required A persistent identifier (e.g. DOI) A URL, via web interface or browser extension The dataset, documentation and reviewer time
    Output Percentage score per metric and overall Nine-square “FAIR insignia” visualisation Narrative report with recommendations
    Scalability High — suited to bulk repository audits Moderate Low — resource-intensive
    Contextual nuance Low — rigid, rule-based Moderate — rubrics can be domain-tailored High — accounts for discipline-specific reuse

    FAIRshake was originally developed by the Ma’ayan Laboratory at the Icahn School of Medicine at Mount Sinai under the US National Institutes of Health’s Big Data to Knowledge (BD2K) programme. Rather than one fixed metric set, it lets research communities author their own rubrics and score resources — manually, automatically, or both — against them, then renders the result as a colour-coded insignia rather than a single number.

    The GO FAIR Initiative takes a different, upstream approach: instead of scoring existing datasets after the fact, it promotes the FAIR Data Point (FDP) specification — a layered REST API (FAIR Data Point → Catalog → Dataset → Distribution) that a research data repository implements so that FAIRness is built into how metadata is served, rather than retrofitted and then measured.

    What a high FAIR score does not prove

    A 100% F-UJI score is a statement about metadata exposure, not about data quality, ethical provenance, statistical validity, or whether another researcher can actually rerun the analysis. This distinction matters because automated tools are increasingly cited in funder and repository policy discussions as if they were a proxy for genuine reusability.

    • A perfectly scored dataset can still contain undocumented preprocessing steps, missing sample metadata, or errors that no metadata check can catch.
    • F-UJI cannot verify that a licence field is legally accurate — only that a machine-readable licence field exists.
    • None of F-UJI, FAIRshake or FAIR Aware assess whether the underlying research methodology or data collection itself was sound; that remains a peer-review and domain-expert function.
    • Scores are not comparable across tools: a dataset scoring 67% on F-UJI is not equivalent to 67% “FAIR” on any absolute scale, since each tool’s metric weighting differs.

    A ScienceDirect study (Devaraju et al., 2021, cited more than 90 times) frames this precisely, describing F-UJI-based measurement as “centred on core metrics” that apply until domain- or community-specific FAIR criteria are agreed — an explicit acknowledgement that the automated baseline is deliberately generic, not a final word on reusability.

    Common questions about automated FAIR scoring

    What does F-UJI actually measure?

    F-UJI measures whether a dataset’s metadata — its identifier, landing-page markup, licence declaration and provenance fields — meets 16 machine-testable criteria drawn from the FAIRsFAIR/FAIR-IMPACT metric set. It does not inspect or validate the dataset’s actual content, methodology or scientific accuracy.

    Is a high F-UJI score the same as genuinely FAIR data?

    No. A high score confirms that metadata is machine-readable and complete according to a fixed rule set. Genuine reusability additionally depends on documentation quality, data integrity and domain-specific context that automated tools are structurally unable to evaluate.

    How does FAIRshake differ from F-UJI?

    FAIRshake combines automated tests with human-scored, community-defined rubrics, whereas F-UJI applies one fixed metric set with no human input. FAIRshake reports results as a visual “FAIR insignia” rather than F-UJI’s single percentage score.

    Do funders formally require automated FAIR scores?

    No major funder currently mandates a specific F-UJI or FAIRshake score as a compliance threshold. Funder and institutional policies (for example under Horizon Europe and UKRI) reference the FAIR Data Principles as a qualitative expectation, with automated tools used voluntarily to self-check progress.

    Implications for repositories and funders

    For research data repositories, the practical use of F-UJI is diagnostic, not evaluative: it flags specific, fixable metadata gaps — a missing JSON-LD block, an undeclared licence field, an absent provenance statement — far faster than a manual audit could. Repositories improving their F-UJI scores should treat each metric failure as a discrete engineering task, not as a proxy for a broader data-quality programme.

    For institutions and funders assessing compliance, the more defensible approach combines automated metadata scoring as a first-pass filter with a manual or community-rubric review for anything reused in decision-relevant research. Relying on one automated percentage to certify “FAIR” data risks the same error as equating a spellchecker’s clean pass with a well-argued essay: necessary, not sufficient.

    As the GO FAIR Initiative’s FAIR Data Point specification gains adoption, the balance may shift from retrospective scoring toward FAIRness built into repository infrastructure from the point of deposit — making after-the-fact tools like F-UJI a verification step rather than the primary mechanism for achieving reusable research data.

  • UK Data Service vs ICPSR: Choosing an Archive

    The UK Data Service and ICPSR are the two largest social-science data archives in the English-speaking research world, and the right choice usually depends on jurisdiction and funder mandate rather than feature parity. The UK Data Service is the ESRC-funded national repository for UK social, economic and population data, while ICPSR is a US-based, membership-funded consortium archive at the University of Michigan. Researchers outside the biomedical repository ecosystem — where PubMed-linked mandates dominate — need to weigh deposit workflow, restricted-access tiers and citation practice before picking either as a home for a dataset.

    The UK Data Service is the largest digital repository for quantitative and qualitative social science and humanities research data in the United Kingdom, formed in October 2012 when the Economic and Social Research Council (ESRC) consolidated the UK Data Archive — established at the University of Essex in 1967 — with several university partners. ICPSR, by contrast, is a membership consortium of academic and research institutions that has archived social and behavioural science data since 1962. Both are listed in re3data.org, the global Registry of Research Data Repositories, and both hold CoreTrustSeal certification for trustworthy digital repositories.

    What Are the UK Data Service and ICPSR?

    The UK Data Service is a national data repository funded through UKRI’s Economic and Social Research Council (ESRC) and led by the UK Data Archive at the University of Essex, in partnership with the University of Manchester, Jisc, EDINA and University College London. It holds more than 6,000 datasets, including UK Census data, the Labour Force Survey, the Millennium Cohort Study and cross-national surveys such as the European Social Survey.

    ICPSR — the Inter-university Consortium for Political and Social Research — is a membership-funded archive based at the University of Michigan, serving several hundred member institutions worldwide alongside non-member depositors and users. Its holdings span large-scale US and international surveys, criminal justice, education and ageing data, and it runs openICPSR as a self-publishing companion repository for rapid dissemination.

    How Do Deposit Workflows Compare?

    Both archives run a curated deposit model rather than a bare-metal upload box: staff review documentation, check disclosure risk and enhance metadata before release. The UK Data Service’s ESRC funding creates a contractual hook — grant holders are required to offer their data for archiving as a condition of the ESRC Research Data Policy — which ICPSR’s membership model does not replicate for non-US funders.

    • UK Data Service: two routes — the main curated collection for large, complex or sensitive studies, and ReShare, a lighter self-deposit repository for smaller datasets, code and syntax files.
    • ICPSR: two routes — the standard curated deposit process, and openICPSR, a self-publishing repository for researchers who want faster turnaround with lighter-touch review.

    Depositors submitting to either service should expect a documentation checklist covering variable-level metadata, consent and ethics evidence, and a data management plan — the same categories UKRI and NSF grant terms typically require regardless of which archive receives the deposit.

    How Do Restricted-Access Tiers Differ?

    Access tiering is where the two services diverge most for researchers working with confidential or disclosive social-science data. The UK Data Service operates a published three-tier model; ICPSR uses a comparable but differently named structure built around its Virtual Data Enclave.

    Access dimension UK Data Service ICPSR
    Open tier No registration; Open Government Licence data Public-use files via free MyData account
    Standard tier Safeguarded — registration plus End User Licence Member-institution access under consortium terms
    Restricted tier Controlled — SecureLab, requiring accredited-researcher training under the Five Safes Framework Restricted-use data via secure Virtual Data Enclave or encrypted physical media, subject to a data security plan
    Governance standard Accredited under the Digital Economy Act 2017 by the UK Statistics Authority (2020) Institutional Review Board and data-use-agreement based review

    The UK Data Service’s Five Safes Framework — safe people, projects, settings, data and outputs — was developed with HMRC DataLab and the Office for National Statistics Secure Research Services, and now underpins the SafePod Network launched in 2021 for wider geographical access to sensitive data. ICPSR’s restricted-data pathway achieves an equivalent security outcome through its enclave model but does not use the Five Safes terminology, which matters for UK researchers writing data management plans against ESRC or UKRI templates that reference it explicitly.

    How Do Citation Practices Compare?

    Both archives assign persistent identifiers and expect formal data citation, but their machinery differs. The UK Data Service works with DataCite and the British Library to issue DOIs and promotes an easy-to-use citation tool, framing its approach around the FAIR data principles — Findable, Accessible, Interoperable, Reusable — and its open-source QAMyData tool, which gives depositors a health check for numeric data before release.

    ICPSR similarly issues persistent identifiers for deposited studies and expects citation in publications that reuse its data, but its emphasis sits more on bibliography-style study citations tied to its own numbering system than on a dedicated public FAIR-compliance tool. For researchers publishing in journals that enforce data-availability statements — a growing requirement under funder open-science mandates — the practical difference is smaller than the access-tier gap: both produce a citable, resolvable record, but only the UK Data Service publishes a named QA tool for pre-citation data quality.

    Which Archive Should Researchers Outside Biomedicine Choose?

    For most projects the decision is jurisdictional rather than qualitative. A research data repository choice driven by funder mandate removes ambiguity immediately: ESRC-funded UK researchers must offer data to the UK Data Service, while NSF- or NIH-adjacent US social-science grants more commonly point toward ICPSR or openICPSR.

    • Choose the UK Data Service if your funder is UKRI/ESRC, your data concerns UK administrative, census or longitudinal panel data, or you need SecureLab/Five Safes access to controlled government microdata.
    • Choose ICPSR if your institution is a consortium member, your data is US-focused or cross-national with US partners, or you want the faster openICPSR self-publishing route.
    • Consult both catalogues before depositing internationally comparable survey data (e.g. European Social Survey, Eurobarometer) — coverage overlaps, and the UK Data Service can facilitate UK-based access to ICPSR holdings.

    Institutions building or reviewing a data management plan should treat this as a data repository for research compliance question first and a discoverability question second: a technically excellent dataset deposited in the wrong repository for its funder mandate creates avoidable rework at grant closeout.

    Answer-First Questions Researchers Ask

    What Is the UK Data Service?

    The UK Data Service is the ESRC-funded national repository for UK economic, population and social research data, led by the UK Data Archive at the University of Essex. It holds over 6,000 datasets, including census, survey and longitudinal study data, and operates under the OAIS digital-preservation reference model.

    How Do You Access Data on the UK Data Service?

    Access runs through three published tiers: Open data requiring no registration, Safeguarded data requiring registration and an End User Licence, and Controlled data requiring SecureLab accreditation under the Five Safes Framework. Most researchers start with the free data catalogue and register once they identify a specific study.

    Is the UK Data Service Free?

    Yes — the service is free to data owners depositing studies and free at the point of use for non-commercial research and teaching. Commercial users may incur administrative fees, and controlled-tier access requires accredited-researcher training rather than a monetary charge.

    Implications for Research Administrators

    Data management plans reviewed by institutional research offices, ARMA and INORMS-aligned research administrators, and funder compliance teams increasingly treat repository choice as an auditable field, not a footnote. A UK-funded study archived outside the UK Data Service without documented justification can trigger ESRC compliance queries at final reporting; a US consortium study left undeposited with ICPSR can weaken an institution’s case for renewed membership funding. Neither archive competes with domain-specific biomedical repositories governed by NISO, ICMJE or COPE norms — this comparison sits squarely in the national data repository space for social science, distinct from that ecosystem.

    As open-science mandates from UKRI, cOAlition S and equivalent US funders converge on FAIR-by-default expectations, the operational gap between the UK Data Service and ICPSR is narrowing to jurisdiction, access-tier terminology and citation tooling rather than underlying trustworthiness — both hold CoreTrustSeal certification and both sit inside the CESSDA/re3data recognised-repository landscape that funders now check by default.

  • CC0 for Data: CC-BY and Custom Licence Guide

    CC0 for data means dedicating a dataset to the public domain with no attribution requirement, while CC-BY permits free reuse conditional on credit — and for structured databases, neither Creative Commons tool may be the legally correct choice. Under the FAIR Data Principles, a licence is only “Findable, Accessible, Interoperable, Reusable” if it imposes minimal friction on machine and human reuse; CC0 is the tool most repositories recommend by default, CC-BY is acceptable where attribution norms are strong, and bespoke institutional terms are usually a liability, not a safeguard.

    CC0 (Creative Commons Zero) is a public domain dedication published by Creative Commons that waives copyright and related rights “to the fullest extent permitted by law”, allowing copying, modification, and commercial reuse without permission or credit.

    Why a data licence matters for FAIR reuse

    The FAIR Data Principles — Findable, Accessible, Interoperable, Reusable — treat licensing as a core reusability criterion, not an afterthought. A dataset can be technically accessible and still fail FAIR if its licence is ambiguous, restrictive, or silent on reuse conditions.

    Without an explicit licence, the default legal position in most jurisdictions is “all rights reserved”, deterring reuse even when the depositor intended openness. Data repositories such as Dryad require a clear waiver precisely to remove this ambiguity.

    • Findability is unaffected by licence choice, but reusability collapses without one.
    • Interoperability depends on whether the licence allows combination with other datasets under different terms.
    • Reusability is maximised when the licence imposes the fewest conditions consistent with the depositor’s actual requirements.

    CC0 vs CC-BY: what actually differs

    CC0 removes all conditions, including attribution; CC-BY keeps commercial and derivative reuse rights but makes crediting the source a licence condition rather than a courtesy. The practical consequences are larger for data than for text or images.

    Aspect CC0 CC-BY 4.0
    Attribution required No (legally); expected as scholarly norm Yes, legally enforceable
    Commercial reuse Permitted Permitted
    Combining with other datasets Frictionless Can trigger “attribution stacking”
    Recommended by Dryad, GBIF, most genomics/biodiversity repositories European Commission for some research data categories
    Applies cleanly to non-copyrightable facts Yes — designed for this case Ambiguous; CC-BY presumes a copyright interest may not exist in raw data

    The CESSDA Data Management Expert Guide notes that CC0 prevents attribution stacking — the compounding burden of citing every upstream source when a new dataset merges dozens of others. This is the strongest technical argument for CC0 over CC-BY in aggregated or long-tail scientific data. Dryad’s data-services team has explained that CC0 was “crafted specifically to reduce any legal and technical impediments… to the reuse of data” — a rationale FAIR later formalised as a reusability requirement.

    Does attribution disappear entirely under CC0?

    No. CC0 removes the legal obligation to cite, but citation remains a scholarly and professional norm enforced through peer review, journal policy, and disciplinary ethics rather than licence terms. Most researchers continue citing CC0 datasets exactly as they would any other source, because academic integrity — not copyright law — is what drives the practice.

    Why custom institutional terms usually backfire

    Some institutions draft bespoke data-sharing agreements instead of adopting a standard licence, adding restrictions such as “non-commercial use only” or “notify us before reuse”. This creates three recurring problems.

    • Machine unreadability: standard CC and Open Data Commons licences carry machine-readable metadata that repositories, indexers, and rights-clearance tools recognise automatically; bespoke legal text does not.
    • Interoperability failure: a custom clause requiring prior notification or a specific attribution format is often legally incompatible with the standard licences used by the other datasets a researcher wants to combine it with.
    • Enforcement uncertainty: institutions rarely have the resources to monitor or enforce bespoke terms, so the restriction deters legitimate reuse without stopping the misuse it was meant to prevent.

    The University of California’s Office of Scholarly Communication has argued that CC-BY is “not always a good fit” for data, since its legal machinery was designed for copyrightable creative works rather than mixed factual content — and a custom clause layered on top compounds that mismatch rather than resolving it.

    Databases are a special case: ODC-By and ODbL

    Raw facts are generally not copyrightable, but a database’s structure can attract separate rights, including the EU’s sui generis database right. This is a genuine gap in most CC0-vs-CC-BY explainers: Creative Commons licences were not written for database rights, and the Open Knowledge Foundation’s Open Data Commons suite exists specifically to cover them.

    • ODC-By (Open Data Commons Attribution License): permits copying, distribution, and commercial use of a database with attribution — the database-rights equivalent of CC-BY.
    • ODbL (Open Database License): adds a share-alike condition, so derived databases must carry the same licence — the database-rights equivalent of CC-BY-SA.
    • CC0 can still be applied to a database to waive both copyright and any sui generis database right simultaneously, which is why several major repositories default to it rather than layering ODC-By on top.

    Joint guidance from Kehl University of Applied Sciences and IP specialists Maucher Jenkins explicitly separates content, software, and databases into three categories, rather than treating “data licensing” as one undifferentiated choice — a distinction most generic CC0-vs-CC-BY articles omit.

    A decision framework for choosing a licence

    Choosing correctly requires matching the licence to the data type and the reuse goal, not defaulting to whichever licence a template happens to include.

    1. Default to CC0 for raw observational data, measurements, or any dataset likely to be combined with others — this is the position taken by repositories including Dryad and GBIF and referenced in OpenAIRE’s data-sharing guidance.
    2. Use CC-BY where the deposited content includes substantial original creative or analytical framing (for example, a curated data paper’s narrative sections) and attribution is central to the scholarly reward system.
    3. Use ODC-By or ODbL where the artefact is genuinely a structured database and jurisdiction-specific database rights are a live concern, particularly for depositors working under EU law.
    4. Avoid bespoke terms unless a named legal, ethical, or funder requirement (such as personal or sensitive data restrictions) makes a standard open licence genuinely unsuitable — and even then, prefer a recognised restricted-access framework over ad hoc legal drafting.

    Whichever licence is chosen, it must be declared unambiguously in the dataset’s metadata and in any accompanying data paper, since automated harvesters and data repository platforms increasingly reject or flag submissions with missing or non-standard licence fields.

    Answer-first Q&A

    Is CC0 free for commercial use?

    Yes. CC0 places a work in the public domain, so there is no restriction on commercial exploitation, modification, or redistribution. Any user — including a company building a commercial product — may use CC0 data without seeking permission, paying a fee, or providing credit, though citing the source remains good scholarly practice.

    Are CC0 and public domain the same?

    Not exactly. The Public Domain Mark is an informational label applied when a work is already believed to be out of copyright, while CC0 is an active legal waiver used by a rightsholder to voluntarily place their own work in the public domain. CC0 changes legal status; the Public Domain Mark only describes an existing one.

    Do I have to cite CC0 data?

    Legally, no — CC0 imposes no attribution requirement. In practice, researchers should still cite the original dataset because academic norms, journal policies, and reproducibility standards expect source attribution regardless of what the licence legally mandates.

    Raw facts generally cannot be copyrighted, but a database’s original selection, arrangement, or structure can attract copyright or, in the EU, a separate sui generis database right. This is precisely why database-specific licences such as ODC-By and ODbL exist alongside Creative Commons tools.

    Implications for repositories and institutions

    Repositories that mandate CC0 by default see fewer downstream reuse disputes and cleaner automated harvesting, because ambiguity is removed at the point of deposit. Institutions drafting data-management plans should specify the licence at policy level rather than per-project, and funders increasingly expect this decision documented, not left as “to be determined”.

    Looking ahead

    As FAIR compliance becomes a formal funder and publisher requirement rather than a voluntary aspiration, licence choice will keep moving from an afterthought to a mandatory, auditable field in data-management plans. CASRAI originated the CRediT contributor role taxonomy in 2014, and the standard is now stewarded by NISO as ANSI/NISO Z39.104-2022 — a reminder that clear, jointly governed standards, rather than bespoke institutional terms, are what let research infrastructure scale across disciplines and borders.

  • Genomic Data Repository Guide: ENA vs GEO vs SRA

    Choosing a genomic data repository comes down to three questions: what type of data you have, whether it is identifiable human data, and what your funder or journal mandates. Raw sequencing reads generally go to the European Nucleotide Archive (ENA) or the Sequence Read Archive (SRA) — two mirrored nodes of the same international collaboration — while processed gene-expression data belongs in the Gene Expression Omnibus (GEO). A genomic data repository is a persistent, publicly accessible database that assigns stable identifiers to deposited sequence or expression datasets so they can be cited, retrieved and reused under FAIR data principles.

    ENA, GEO and SRA are the three repositories researchers encounter most often when funder or journal data-sharing policies require deposition of sequencing output. They are not interchangeable: each has a different primary data type, a different metadata standard, and a different position in the international data-sharing infrastructure. This guide compares them on deposit requirements, metadata standards and journal acceptance so research administrators and authors can make a defensible, mandate-compliant choice.

    What is a genomic data repository?

    A genomic data repository is a curated, publicly accessible database that archives DNA or RNA sequence data — raw reads, assembled genomes, or processed expression tables — and assigns each dataset a stable accession number for permanent citation. Repositories exist because journals and funders increasingly require that sequence data underlying a publication be deposited somewhere reviewers, readers and future researchers can retrieve it, rather than held privately by the authors.

    The three most consulted repositories for sequencing output are the European Nucleotide Archive (ENA), the Sequence Read Archive (SRA), and the Gene Expression Omnibus (GEO). ENA and SRA are both members of the International Nucleotide Sequence Database Collaboration (INSDC), alongside Japan’s DNA Data Bank of Japan (DDBJ); records submitted to any one of the three are mirrored across all of them, typically within 24-48 hours.

    ENA vs GEO vs SRA: how do they differ?

    The single biggest distinction is data type: ENA and SRA hold raw sequence reads (FASTQ, BAM, CRAM), while GEO holds processed functional genomics results — expression matrices, normalised counts and the experimental metadata describing them — and links out to SRA for the underlying raw reads. Geography and stewardship differ too: ENA is maintained by EMBL-EBI in the UK/Europe, while SRA and GEO are both maintained by the US National Center for Biotechnology Information (NCBI).

    Feature ENA GEO SRA
    Steward EMBL-EBI (Europe) NCBI (US) NCBI (US)
    Primary data type Raw reads, assemblies, annotated sequences Processed expression data + metadata Raw sequencing reads
    INSDC member Yes No (links to SRA) Yes
    Metadata standard ENA checklists MINSEQE / MIAME INSDC submission schema
    Access model Open (controlled tier via EGA for identifiable human data) Open Open (controlled tier via dbGaP)

    A frequently overlooked distinction is access control. None of ENA, SRA or GEO is designed to hold identifiable human genomic or phenotypic data. That category of data belongs in a controlled-access archive — the European Genome-phenome Archive (EGA), jointly run by EMBL-EBI and the CRG, or NCBI’s database of Genotypes and Phenotypes (dbGaP) — where access is granted through a data access committee rather than opened to the public. Depositing identifiable clinical genomic data in an open repository such as ENA or SRA would breach both the repositories’ own policies and, in most jurisdictions, data protection law.

    What are the deposit requirements for each repository?

    Each repository sets its own submission checklist, but all three require a structured description of the experiment alongside the sequence files themselves.

    • ENA requires a study, sample, experiment and run object for each submission, described against one of ENA’s checklist templates (for example, the pathogen or invertebrate checklists), plus the raw read files.
    • SRA requires equivalent BioProject and BioSample records, submitted through NCBI’s submission portal, with reads in FASTQ or BAM/CRAM format.
    • GEO requires a MINSEQE-compliant description of the experimental design (samples, protocols, processed data matrix) and will route the corresponding raw reads to SRA as part of the same submission, generating a linked SRA accession automatically.

    Because ENA and SRA mirror each other, a dataset submitted to one is not normally resubmitted to the other — submitting twice creates duplicate, unlinked accessions rather than better coverage.

    Which metadata standards apply?

    Metadata quality, not just file deposition, is what makes a dataset FAIR — Findable, Accessible, Interoperable and Reusable, per the FAIR data principles first published by Wilkinson et al. in 2016. GEO submissions are assessed against MIAME (Minimum Information About a Microarray Experiment) for array data and MINSEQE (Minimum Information about a high-throughput Nucleotide Sequencing Experiment) for sequencing-based expression studies. ENA and SRA submissions follow INSDC’s shared sample and experiment metadata schema, supplemented by checklist-specific fields for the sample type in question.

    Consistent metadata is also what allows a dataset to be discovered through cross-repository registries such as re3data and FAIRsharing, both of which index genomic repositories alongside thousands of other subject and generalist repositories.

    Do journals and funders accept all three equally?

    Most journal data-availability policies name an INSDC-compliant repository — ENA, SRA or DDBJ — as the acceptable destination for raw sequence data, and GEO or ArrayExpress for expression data. PLOS, for example, states that authors should select field-appropriate repositories and lists ENA, SRA, GEO and DDBJ among its recommended sequencing repositories, while also pointing authors to re3data and FAIRsharing when no field-specific option exists.

    Funder policy is generally repository-agnostic within the INSDC family: the NIH Genomic Data Sharing Policy and the 2023 NIH Data Management and Sharing Policy both accept SRA, dbGaP or an equivalent controlled-access archive for human data, without mandating SRA specifically over ENA. UK and European funders operating under UKRI or Horizon Europe open-science requirements similarly accept any INSDC-affiliated repository, reflecting the FAIR data principles rather than naming a single preferred database.

    Frequently asked questions

    What is the difference between ENA, GEO and SRA?

    ENA and SRA both archive raw sequencing reads and mirror each other as INSDC members, differing mainly in which institution — EMBL-EBI or NCBI — hosts the submission. GEO instead archives processed gene-expression results and metadata, forwarding the associated raw reads to SRA automatically during submission.

    Do I need to submit data to both GEO and SRA?

    Not separately. When you submit a gene-expression study to GEO, the platform generates a linked SRA accession for the raw reads as part of the same workflow, so a single submission satisfies both repositories without duplicate uploads.

    Is ENA the same as SRA?

    No — they are separate databases run by different organisations that mirror the same underlying INSDC data. A dataset submitted to ENA in Europe becomes visible through SRA in the US within roughly one to two days, and vice versa, so researchers choose one, not both.

    Which repository do funders require for genomic data?

    Most funder policies, including NIH’s Genomic Data Sharing Policy and UKRI’s open research requirements, accept any INSDC-affiliated repository — ENA, SRA or DDBJ — for raw sequence data, plus GEO for expression data, rather than mandating one specific database.

    What this means for research administrators

    For institutions building data-management-plan templates or compliance checklists, the practical rule is to map deposition guidance to data type and access sensitivity rather than to a single named repository: raw non-identifiable reads to ENA or SRA, expression matrices to GEO, and any identifiable human genomic or clinical data to a controlled-access archive such as EGA or dbGaP. Framing repository choice this way keeps research administration guidance aligned with funder and journal policy regardless of which INSDC node an individual researcher prefers to use.

    As funder mandates increasingly cite FAIR data principles explicitly rather than naming individual repositories, the durable compliance strategy is to select any INSDC-affiliated repository appropriate to the data type, document the accession number in the manuscript, and reserve controlled-access archives strictly for identifiable human data. Research offices that build this decision logic into deposit checklists now will need far less rework as funder policy language continues to converge on FAIR terminology rather than named databases.

  • Trusted Research Environments Make NHS Data FAIR

    A trusted research environment (TRE) is a secure, access-controlled computing platform that lets approved researchers analyse sensitive data — such as NHS patient records — without ever copying, downloading, or exporting the underlying data. Analysts log in remotely, run their code against the data inside the environment, and only pre-checked, aggregated outputs leave the boundary. This is the mechanism that lets sensitive health datasets stay FAIR-findable and reusable while the data itself never crosses a governance line.

    A trusted research environment is: a governed digital space in which pre-approved researchers query sensitive data under the Five Safes framework, with disclosure-checked outputs as the only route out. TREs are also known as secure data environments (SDEs), data safe havens, or secure research environments (SREs) — functionally synonymous terms, though NHS England now prefers “secure data environment” in public-facing policy as more intuitive than the technical “TRE”.

    What is a trusted research environment and how does it work?

    A TRE inverts the traditional data-sharing model. Instead of sending a dataset to a researcher’s own machine, the researcher comes to the data. Code, statistical software, and disclosure-controlled outputs move; identifiable records do not.

    • No data extraction: raw records cannot be downloaded, copied, or emailed out of the environment.
    • Pre-installed analytical tooling: statistical packages and secure workspaces sit inside the perimeter, so researchers never need a local copy.
    • Output checking: a human or automated disclosure-control review screens every result before it is released, to confirm no individual can be re-identified.

    Peer-reviewed literature describes a TRE as “an environment supported by trained staff and agreed processes… to access sensitive data” — a framing echoed across UK academic TRE documentation.

    What is the Five Safes framework?

    The Five Safes framework is the governance model almost every UK TRE uses to structure access decisions — from the Office for National Statistics’ Secure Research Service to NHS regional secure data environments. It originated at the ONS and is now standard across the UK’s public-sector research data infrastructure.

    Safe Question it answers Typical control
    Safe Projects Is the research in the public interest? Independent research/ethics review of the proposal
    Safe People Can this researcher be trusted? Accreditation, training, background checks
    Safe Settings Is the technical environment secure? No internet egress, monitored virtual desktops, audit logging
    Safe Data Is the data adequately de-identified? Pseudonymisation, aggregation, statistical disclosure control
    Safe Outputs Could the results re-identify anyone? Manual or automated output review before release

    ADR UK notes that each of its national partnerships, as well as the ONS, operates a dedicated TRE built on Five Safes principles — the de facto standard, not one option among several.

    How does the NHS secure data environment programme work?

    NHS England’s SDE policy requires that access to NHS health and social care data for research and planning be provided through accredited secure data environments, rather than by disseminating extracted, pseudonymised datasets to individual organisations. This followed the 2022 Goldacre Review, “Better, Broader, Safer: Using Health Data for Research and Analysis,” which recommended TREs become the default route for accessing NHS data rather than the exception.

    The result is a two-tier structure now operating across England:

    • NHS England’s national SDE, holding national datasets for approved research uses.
    • Sub-national secure data environments (SNSDEs), regional environments aligned to Integrated Care Systems, giving researchers access to more granular, regionally linked data.

    Devolved nations run equivalent infrastructure: the Scottish National Safe Haven, Wales’ SAIL Databank at Swansea University, and Northern Ireland’s Honest Broker Service each function as a jurisdictional TRE under comparable governance.

    How do TREs make sensitive data FAIR without moving it?

    The FAIR data principles — Findable, Accessible, Interoperable, Reusable, formalised by Wilkinson et al. in Scientific Data (2016) — were designed for open datasets that can be freely retrieved. Sensitive health data cannot satisfy FAIR in its literal, open-access sense; a TRE lets each principle apply to the metadata and governance layer instead of the raw record. This is the architectural insight most explainer content on this topic misses: FAIR does not require open data, it requires a documented, machine-actionable pathway to reuse — and a TRE supplies exactly that for data which must stay closed.

    • Findable: TREs publish dataset-level metadata in public catalogues — for example, the HDR UK Innovation Gateway — with persistent identifiers, so a dataset’s existence, structure, and provenance are discoverable even though the records inside are never exposed.
    • Accessible: “accessible” is redefined as a documented, auditable application and accreditation process (Safe People, Safe Projects) rather than an open download link — the process itself is transparent even where the data is not.
    • Interoperable: common data models and coding standards (such as OMOP or SNOMED CT mappings used across NHS TREs) let approved analyses run consistently across multiple environments, enabling federated analysis without pooling raw data in one place.
    • Reusable: version-controlled analytical code, output logs, and data dictionaries are retained and, increasingly, shared openly by researchers even when the underlying data cannot be — supporting reproducibility and future reuse of the method, if not the dataset.

    This mapping is the load-bearing argument of the TRE model: sensitive data sharing and open FAIR data are not opposites. The TRE is the governance boundary that lets FAIR’s discovery and reuse guarantees operate at the metadata and code layer while Five Safes controls operate at the record layer.

    How does OpenSAFELY demonstrate the model in practice?

    OpenSAFELY, built by researchers at the University of Oxford and the Bennett Institute for Applied Data Science in response to the COVID-19 pandemic, is the most cited working example of this architecture. Rather than extracting GP records, OpenSAFELY runs analytical code inside the secure environments of the electronic health record software suppliers themselves, executing studies against the pseudonymised primary-care record for a very large proportion of England’s registered patients — without the data ever leaving NHS-contracted infrastructure.

    Its methods and code repositories are published openly, so the analytical logic is fully FAIR — reusable and auditable by anyone — even though the patient-level data it runs against never is. That split is the clearest public demonstration of “FAIR governance, closed data” in UK health research.

    Common questions about trusted research environments

    What is the difference between an SDE and a TRE?

    An SDE and a TRE describe the same underlying architecture; SDE is the term NHS England now favours as clearer for non-specialist audiences, while TRE remains standard in academic and technical documentation, including workspace-level “research TREs” built for a single project inside a broader SDE.

    Is a data safe haven the same as a trusted research environment?

    Yes — data safe haven is an earlier, still widely used UK term for the same model, applied to environments such as the Scottish National Safe Haven. All three terms describe a controlled computing space governed by comparable accreditation, de-identification, and output-checking controls, typically under a Five Safes-style framework.

    What is required to build a trusted research environment?

    Building a compliant TRE requires an on-premises or cloud-hosted secure computing platform with no unmonitored internet egress, encrypted data at rest and in transit, role-based access controls, and a formal output-checking process — King’s College London’s CREATE TRE, for example, operates under ISO 27001 certification to evidence these controls externally.

    What does “trusted research” mean in UK government usage?

    Separately from the TRE data-access model, the UK government’s “Trusted Research” guidance is a framework protecting intellectual property and research security in international collaborations, distinct from — but sometimes confused with — the data-governance meaning of “trusted research environment” discussed here.

    What this means for research administrators and funders

    For institutions handling sensitive datasets, FAIR compliance and data protection obligations are no longer competing priorities. A properly governed TRE lets a research office satisfy funder FAIR-data mandates — citing metadata, persistent identifiers, and documented reuse pathways — while meeting UK GDPR, common-law confidentiality, and NHS information-governance duties simultaneously. Research administrators evaluating data-access requests should treat “does this dataset sit behind an accredited TRE with Five Safes controls” as a first-order question, not an afterthought.

    As sub-national secure data environments mature across England’s Integrated Care Systems, and equivalent infrastructure federates across the devolved nations, the interoperability layer — common data models, shared metadata standards, cross-TRE federated analysis — is the area most likely to determine whether the FAIR promise of these environments is fully realised.

  • Datasheets for Datasets: FAIR Habits for AI Data

    Datasheets for datasets are structured documentation records — covering motivation, composition, collection process, and recommended uses — that accompany a dataset the way a technical datasheet accompanies an electronic component. Proposed for machine learning in 2018, the practice mirrors documentation habits research data managers have used for decades, and research offices are increasingly the ones best placed to recognise and credit that documentation work.

    A datasheet for a dataset is a short, standardised document that records where a dataset came from, how it was collected and labelled, what it should and should not be used for, and who is responsible for maintaining it. The idea was formalised by Timnit Gebru and colleagues in the 2018 paper “Datasheets for Datasets” (arXiv:1803.09010), later published in Communications of the ACM, Vol. 64, No. 12 (2021).

    Contents

    Where did datasheets for datasets come from?

    Gebru et al.’s 2018 paper argued that machine learning datasets circulated with almost no accompanying documentation, unlike the datasheets that have long shipped with electronic components. The paper has since been cited by more than 4,700 works, according to citation counts indexed alongside the ACM Digital Library record — a scale of uptake that puts it among the most influential AI-ethics-adjacent papers of the past decade.

    The proposal did not invent documentation practice from nothing. It imported habits that research-data communities already used. The Data Documentation Initiative (DDI), a metadata standard maintained by the DDI Alliance for the social, behavioural, and economic sciences, has specified variable-level dataset documentation since the early 2000s — well before the AI field adopted the term “datasheet.”

    What does a dataset datasheet actually document?

    Gebru et al.’s original template organises documentation into seven sections: motivation, composition, collection process, preprocessing/cleaning/labelling, uses, distribution, and maintenance. Each section is a set of prompts, not a checkbox — creators answer in prose, which is what makes the format adaptable across domains.

    • Motivation: why the dataset was created, who funded it, and what problem it addresses.
    • Composition: what the instances represent, how many there are, and whether sensitive attributes or personal data are present.
    • Collection process: how and from whom the data was gathered, and what consent or licensing applied.
    • Uses: tasks the dataset is suited for, and — critically — tasks it should not be used for.
    • Maintenance: who is responsible for updates, corrections, and retraction if problems surface.

    Adjacent frameworks document different units of the same pipeline. Model Cards for Model Reporting (Mitchell et al., Google, 2019) document a trained model’s performance across demographic subgroups rather than the training data itself. The Dataset Nutrition Label, developed by the Data Nutrition Project (originating at Harvard and MIT), condenses similar information into a scannable label modelled on food nutrition facts. The table below maps how these efforts differ.

    Framework Origin Unit documented Primary audience
    Datasheets for Datasets Gebru et al., 2018 (arXiv/ACM) Dataset provenance and composition Dataset creators and consumers
    Model Cards for Model Reporting Mitchell et al., Google, 2019 Trained model performance Model deployers and auditors
    Dataset Nutrition Label Data Nutrition Project, Harvard/MIT Dataset health at a glance Practitioners screening datasets quickly
    Datasheets for Digital Cultural Heritage Europeana Research/EuropeanaTech, 2023 Heritage collection reuse context GLAM institutions and researchers

    The FAIR Data Principles — Findable, Accessible, Interoperable, Reusable, set out by Wilkinson et al. in Scientific Data (2016) — were written for research data broadly, not for AI training corpora specifically. Datasheets operationalise the “Reusable” pillar in particular: a dataset without documented provenance, licensing, and known limitations cannot be responsibly reused, regardless of how accessible its files are.

    This is a FAIR-adjacent practice rather than a formal extension of FAIR itself, and research offices should frame it that way rather than treating “datasheet” and “FAIR-compliant” as synonyms. A dataset can be technically Findable and Accessible while still shipping with a thin or absent datasheet — the two efforts solve overlapping but distinct problems.

    Dataset-level documentation also underpins dataset citation. The Force11 Joint Declaration of Data Citation Principles (2014) established that datasets should be cited as first-class research outputs, and registration agencies such as DataCite issue the DOIs that make that citation persistent. A datasheet gives the context a citation alone cannot: not just that a dataset exists and where, but what it contains and how it may legitimately be used.

    Answer-first questions on datasheets for datasets

    What are datasheets for datasets?

    Datasheets for datasets are structured documents that record a dataset’s motivation, composition, collection process, and intended uses. They were proposed by Gebru et al. in 2018 to give dataset creators and consumers a shared, standardised record — closing the gap between how thoroughly software and hardware components are documented and how poorly datasets typically are.

    What information does a dataset datasheet include?

    A complete datasheet covers seven areas: motivation, composition, collection process, preprocessing and labelling, recommended uses, distribution terms, and maintenance responsibility. Creators answer narrative prompts under each heading rather than filling in a fixed schema, which is why the format has been adapted for domains as different as machine learning corpora and digitised cultural heritage collections.

    How do datasheets differ from model cards?

    Datasheets document the dataset — its provenance, composition, and licensing. Model cards, introduced by Mitchell et al. at Google in 2019, document the trained model built from that data, including performance disaggregated across demographic groups. The two are complementary: a model card without a corresponding dataset datasheet leaves the training-data provenance question unanswered.

    What this means for research offices

    Research administration has treated dataset documentation as a data-management-plan checkbox for years; AI training-data transparency debates are now forcing the same discipline onto machine learning teams. Institutions that already run mature research-data-management functions have a genuine head start: DMP review, licensing checks, and provenance tracking are core competencies, not new ones.

    One overlooked lever is contributor recognition. CASRAI originated the CRediT contributor role taxonomy in 2014. The standard is now stewarded by NISO as ANSI/NISO Z39.104-2022. CRediT’s Data Curation role exists precisely to credit the labour of managing, annotating, and maintaining research data for reuse — the same labour a datasheet documents. Research offices that already apply CRediT to publications have a ready-made mechanism for recognising the people who write and maintain dataset datasheets, rather than letting that work go uncredited.

    • Require a datasheet (or equivalent provenance record) as a condition of institutional data-repository deposit, alongside existing licensing checks.
    • Map datasheet authorship to CRediT’s Data Curation role in institutional repository metadata.
    • Treat AI training-data provenance requests from partners and funders as an extension of existing data-management-plan review, not a new workflow.

    Where the practice is heading

    Uptake outside machine learning is accelerating. The Europeana Research Community and EuropeanaTech Community published Datasheets for Digital Cultural Heritage Datasets in the Journal of Open Humanities Data in 2023 (DOI: 10.5334/johd.124), adapting the template for collections that were digitised long after their original creation. A revised Version 2 template was released in July 2025, with alignment to the DCAT-AP data-portal application profile identified as ongoing work.

    AI training-data transparency requirements are converging on the same documentation habits that research-data management has practised for two decades, under the Data Documentation Initiative and FAIR principles alike. Research offices that recognise datasheets as an extension of existing data governance — rather than a novel AI-specific burden — will be better positioned to advise both AI developers and dataset creators as scrutiny of training-data provenance intensifies.

  • Research Data Steward Job Description and Skills

    A research data steward is the named individual within a university, institute, or funded project who takes operational responsibility for the quality, FAIR compliance, documentation, and lifecycle management of a defined set of research datasets — distinct from the data owner, who holds accountability and sign-off authority, and the data custodian, who runs the technical storage infrastructure. The role sits inside the institutional research data management (RDM) team, typically reporting through the research office or library, and exists specifically because generic corporate data-steward job descriptions do not map cleanly onto grant-funded, multi-investigator, publicly scrutinised research data.

    Corporate data stewardship (the model most job-description templates online describe) is built around commercial master data, customer records, and regulatory compliance such as GDPR. Research data stewardship is built around a different set of pressures: funder-mandated Data Management Plans (DMPs), the FAIR Guiding Principles, discipline-specific repositories, and long-term reuse by researchers who were not part of the original project. This article defines the research-specific version of the role, maps it against the data owner and data custodian, and shows exactly where it sits in an institutional RDM structure.

    What Does a Research Data Steward Do?

    A research data steward manages the day-to-day quality, description, and reuse-readiness of research datasets on behalf of a principal investigator, department, or institutional repository. The role is operational, not accountable: a data steward implements policy, while a data owner sets it.

    Core duties typically include:

    • Reviewing datasets against the FAIR Guiding Principles — Findable, Accessible, Interoperable, Reusable — before deposit in a repository.
    • Writing and maintaining metadata, codebooks, and data dictionaries so a dataset is comprehensible to someone outside the original research team.
    • Advising researchers on Data Management Plan (DMP) compliance during grant applications and at project milestones.
    • Coordinating with disciplinary or institutional repositories on deposit, embargo periods, and licence selection.
    • Liaising with the data custodian (IT/systems) on storage, backup, and access-control implementation.
    • Flagging data quality issues — missing consent documentation, inconsistent variable coding, broken file formats — before they reach publication or reuse.

    UKRI’s Concordat on Open Research Data (2016) states that institutions are expected to have “clearly assigned responsibilities for the management of research data,” which is the direct policy basis most UK universities cite when creating dedicated data steward posts inside RDM or library services.

    Research Data Steward vs Data Owner vs Data Custodian

    These three roles are frequently conflated in generic data-governance content, but in a research setting they map to distinct, complementary functions. The data owner holds accountability; the data steward holds operational responsibility; the data custodian holds technical infrastructure responsibility.

    Role Primary focus in RDM Typical post-holder Accountable for
    Data owner Accountability and sign-off Principal Investigator or Head of Department Decisions on access, sharing, and retention of a specific dataset
    Data steward Operational quality and FAIR compliance Research data steward / RDM officer, often in the library or research office Metadata, documentation, DMP compliance, deposit readiness
    Data custodian Technical storage and access control Research IT / systems administrator Backup, encryption, storage infrastructure, access provisioning

    A single dataset can pass through all three roles: the PI (owner) approves that a dataset can be shared, the data steward prepares it to FAIR standard and selects the repository and licence, and the data custodian executes the technical transfer and sets the access permissions.

    What Skills and Qualifications Does the Role Require?

    Research data stewards need a blend of technical data-management skills and subject-domain fluency that generic corporate data-steward job descriptions rarely specify. Institutions increasingly treat this as a distinct career pathway rather than an IT-adjacent generalist role.

    • Working knowledge of the FAIR principles and metadata standards (Dublin Core, DDI, discipline-specific schemas).
    • Familiarity with persistent identifier infrastructure — DOIs assigned via DataCite, and researcher identifiers via ORCID — for correctly attributing and citing datasets.
    • Understanding of funder DMP requirements, including Horizon Europe’s and cOAlition S’s expectation that funded research data be FAIR by default.
    • Basic data-cleaning and documentation skills (spreadsheet/database literacy, controlled vocabularies, version control).
    • Communication skills sufficient to negotiate data-sharing terms between researchers, ethics committees, and repository managers.

    Professional bodies including ARMA (Association of Research Managers and Administrators) and INORMS now track research data stewardship as a recognised strand within the broader research-administration career pathway, reflecting its growing separation from generic corporate data governance.

    How Does This Differ from the CRediT “Data Curation” Role?

    The ANSI/NISO Z39.104-2022 CRediT taxonomy — originated by CASRAI in 2014 and now stewarded by NISO — includes “Data Curation” as one of fourteen contributor roles credited on a published paper. This is a per-publication authorship credit, not a job title or institutional post. A research data steward, by contrast, is an ongoing operational role that may perform data-curation work across many projects and papers, only some of which will formally credit them under the CRediT taxonomy. Conflating the two is a common error in job-description drafting.

    Where Does the Role Sit in the Institutional RDM Team?

    Research data stewards typically sit within one of three institutional homes: the library/research-data-services team, the central research office, or a departmental/faculty RDM function. Reporting lines vary, but the steward almost always works across, not inside, individual research groups.

    • Library-based model: data steward reports into research data services alongside repository managers and scholarly-communications staff — common where the institution treats RDM as an extension of open-access infrastructure.
    • Research-office model: data steward sits alongside grants and ethics administrators, closer to the DMP-compliance and funder-reporting workflow.
    • Departmental model: larger science faculties sometimes embed a data steward within a department, working directly with PIs on discipline-specific formats and repositories.

    In all three models, the data steward reports functionally to institutional data governance policy (set by data owners at PI or departmental-head level) while collaborating operationally with IT-based data custodians on infrastructure. The four core stewardship areas identified in institutional data-governance models — operational oversight, data quality, privacy/security/risk management, and policies and procedures — apply directly to this reporting structure.

    Answer-First Q&A

    What skills do you need to be a data steward?

    A data steward needs both technical and business-facing skills: metadata and data-modelling literacy, familiarity with data-quality tooling, and strong communication skills to translate governance policy into day-to-day research practice. In a research context, this also requires knowledge of FAIR principles, funder DMP requirements, and discipline-specific repository standards.

    What are the four main roles of an effective data stewardship model?

    An effective stewardship model groups responsibilities into four areas: operational oversight, data quality, privacy, security and risk management, and policies and procedures. Research data stewards typically own operational oversight and data quality directly, while collaborating with data owners and custodians on the remaining two areas.

    What makes a good data steward?

    A good data steward combines subject-domain credibility with disciplined documentation habits — able to identify data-quality problems early, communicate clearly with both researchers and technical staff, and apply governance rules consistently. In research settings, respect from the researcher community is essential, since the steward has no direct authority over the data owner.

    What is another title for a data steward?

    Common alternative titles include research data manager, data curator, RDM officer, and domain data steward. Institutions vary in naming, but the underlying responsibilities — FAIR compliance, metadata quality, and DMP support — remain consistent across these titles.

    Implications for Research Institutions

    As funders including UKRI, Horizon Europe, and cOAlition S tighten FAIR data requirements within grant conditions, institutions without a clearly defined research data steward role risk inconsistent DMP compliance and poor dataset discoverability after project closure. Writing a job description that borrows directly from generic corporate data-governance templates will under-specify the FAIR, DMP, and repository-liaison duties that make the research variant of the role effective.

    Institutions building or revising this post should draft the job description around the three-way split set out above — owner accountability, steward operations, custodian infrastructure — rather than treating “data steward” as a single undifferentiated data-governance title.

  • FAIR Principles Data Maturity: Score Against RDA

    FAIR data maturity is scored by testing each dataset against the 41 indicators of the Research Data Alliance’s FAIR Data Maturity Model, grading Findability, Accessibility, Interoperability and Reusability separately, then weighting results by each indicator’s priority tier — essential, important, or useful. FAIR principles data management moves from an abstract commitment to a measurable score once an institution runs this test consistently across its repository.

    FAIR data is data that meets the Findable, Accessible, Interoperable and Reusable criteria first published by Wilkinson et al. in Scientific Data in 2016 — a paper now cited more than 22,000 times. This guide is a practical scoring walkthrough, not another explainer of what the four letters mean: it shows research offices how to actually audit existing datasets and repositories against the RDA model and turn the result into a remediation plan.

    What is the RDA FAIR Data Maturity Model?

    The RDA FAIR Data Maturity Model is a specification published by a Research Data Alliance working group in 2020 to standardise how organisations test FAIRness. Before it existed, dozens of institutions had built incompatible local checklists, making it impossible to compare a “FAIR score” from one repository against another.

    The model does not ship as software. It is a reference document that defines:

    • 41 indicators — testable statements mapped to the fifteen core GO FAIR sub-principles (F1–F4, A1–A2, I1–I3, R1–R1.3)
    • Three priority tiers — essential, important and useful — so institutions can triage effort rather than treat every indicator as equally urgent
    • Evaluation guidance — worked examples for testing each indicator against real metadata and data objects, rather than self-reported compliance

    Because the indicators trace directly to the GO FAIR principles, a dataset that scores well against the RDA model is, by construction, meeting the same criteria described in the original 2016 Scientific Data paper — just with a repeatable measurement attached.

    How does FAIR maturity scoring actually work?

    Scoring is done indicator by indicator, not principle by principle. Most institutions that implement the RDA model score each of the 41 indicators on a simple 0–4 scale — 0 (not implemented) through 4 (fully implemented) — then multiply by a priority weight before aggregating to a per-dataset and per-repository total.

    FAIR letter Sub-principles tested Typical essential-tier evidence
    Findable F1–F4 Persistent identifier (DOI via DataCite), indexed metadata record, machine-readable catalogue entry
    Accessible A1–A2 Retrieval via an open protocol (HTTPS), metadata that resolves even if the data itself is restricted
    Interoperable I1–I3 Structured, non-proprietary format; controlled vocabularies; qualified links to related records
    Reusable R1–R1.3 Machine-readable licence, documented provenance, alignment with a domain metadata standard

    A dataset that carries a DOI and open licence but lacks controlled vocabulary terms will score high on Findable and Reusable, and low on Interoperable — the point of indicator-level scoring is precisely to surface that kind of uneven profile, which a single pass/fail “is it FAIR?” verdict would hide.

    Manual vs automated assessment: which tool fits?

    Two complementary assessment routes exist. Automated tools are fast but only test what a machine can verify; manual review is slower but catches the indicators that require human judgement, such as whether a licence is genuinely clear or a vocabulary is genuinely domain-appropriate.

    Tool / method Coverage of the 41 indicators Output Best suited to
    F-UJI (FAIRsFAIR project) Machine-testable subset only — roughly 17 metrics derived from the RDA indicators Automated percentage score per FAIR letter, run against a DOI Bulk baseline scans across a whole repository
    FAIR-Aware (DANS) Self-assessment questionnaire, not indicator-scored Qualitative readiness report and recommendations Researchers preparing a dataset before deposit
    Manual RDA specification review All 41 indicators, including human-judgement ones Full indicator-by-indicator score with evidence notes Institutional audits and remediation planning

    A hybrid approach is the most defensible for an institution-wide programme: run an automated scan across every repository record for a fast baseline, then reserve manual review for the essential-tier indicators no tool can verify — licence clarity, provenance completeness and domain-standard alignment.

    A step-by-step scoring walkthrough

    The following sequence turns the RDA model from a reference document into a repeatable institutional process.

    1. Select a representative sample. Pull datasets across disciplines, repository platforms, and funder mandates — a sample skewed toward one department will misstate institutional maturity.
    2. Map each dataset’s DOI or identifier record and run an automated F-UJI scan for the machine-testable indicators before any manual work begins.
    3. Score the remaining essential-tier indicators manually, checking licence text, metadata schema, and vocabulary choice against the evidence guidance in the RDA specification.
    4. Weight and aggregate. Multiply each indicator score by its priority weight, sum within each FAIR letter, then average across the sample to produce a repository-level maturity profile.
    5. Report by weakest letter, not overall average. An institution scoring 3.6/4 on Findable but 1.2/4 on Interoperable needs a vocabulary-adoption project, not a generic “improve FAIR compliance” action item.

    Worked example — three datasets from the same institutional repository, scored on the 0–4 scale before weighting:

    Dataset Findable Accessible Interoperable Reusable
    Clinical trial dataset (restricted access) 4 3 2 3
    Environmental sensor archive 3 4 3 2
    Survey microdata (open) 2 4 1 4

    This profile — strong on Accessible, weak on Interoperable across all three — is a genuinely institution-specific finding a generic FAIR explainer cannot give you; only a scored audit surfaces it, and it points to a single fix (adopting a shared controlled vocabulary at ingest) rather than four separate ones.

    Common questions about FAIR data scoring

    What are FAIR principles for data?

    FAIR principles are four criteria — Findable, Accessible, Interoperable and Reusable — first published in a 2016 Scientific Data paper by Wilkinson et al. They require datasets to carry a persistent identifier, standardised retrieval protocols, shared vocabularies and machine-readable licensing, so both humans and software can locate and reuse research data reliably.

    What are the four pillars of the FAIR data principles?

    The four pillars are Findable (unique persistent identifiers and rich metadata), Accessible (standardised, open retrieval protocols), Interoperable (shared vocabularies and qualified references) and Reusable (clear licensing, provenance and community standards). The RDA FAIR Data Maturity Model breaks these four pillars into 41 individually testable indicators.

    What are the FAIR data principles of UKRI?

    UKRI does not publish a separate FAIR standard. Its research councils, including NERC’s Environmental Data Service, require grant-funded datasets to follow the same GO FAIR-published Findable, Accessible, Interoperable and Reusable principles, citing benefits including increased citation, stronger research integrity, and compliance with data management plan commitments.

    What are the FAIR principles of GDPR?

    FAIR and GDPR address different concerns and are not in conflict. FAIR governs discoverability and reuse of metadata, while GDPR governs lawful processing of personal data. A dataset containing personal information can be fully FAIR — richly described and findable — while access to the underlying records stays restricted under GDPR-compliant authorisation.

    What this means for research data offices

    A scored FAIR audit gives research offices something a qualitative checklist cannot: a repository-level baseline that can be re-measured after each remediation cycle. Institutions preparing data management plan compliance evidence for UKRI, Horizon Europe, or cOAlition S-aligned funders can cite the same indicator scores as their supporting evidence, rather than producing a fresh narrative justification each time.

    Scoring also clarifies where FAIR and openness diverge. Following the “as open as possible, as closed as necessary” principle, a dataset can score highly on all four FAIR letters while remaining access-controlled — the metadata is open and richly described even when the underlying records are not. Institutions handling Indigenous or community-originated data should additionally weigh the CARE Principles — Collective Benefit, Authority to Control, Responsibility and Ethics — published by the Global Indigenous Data Alliance, which govern who controls reuse decisions rather than how discoverable the data is.

    The practical next step after a first scoring pass is not a single “get to 100%” target — no dataset needs every useful-tier indicator satisfied — but a prioritised backlog built from essential-tier gaps, feeding directly into repository ingest workflows and metadata templates so the next deposit scores higher without a second audit.

  • Data Management Plans for Qualitative Research: FAIR Meets Consent and Anonymisation

    A data management plan for qualitative research must do something generic STEM-oriented DMP templates rarely address: reconcile funder mandates for FAIR (Findable, Accessible, Interoperable, Reusable) data with an ethical and legal duty to restrict access to identifiable interview, focus-group and observational data. The two obligations are not opposites — a well-built plan satisfies both by specifying tiered access, consent-driven sharing limits and documented anonymisation, rather than treating “open data” and “protected data” as a binary choice.

    A data management plan (DMP) is a written document, created before data collection begins, that specifies how a project will collect, document, store, protect, share and preserve its research data throughout the data lifecycle. For qualitative and human-subjects research, the plan must additionally specify how consent scope, anonymisation depth and legal basis under data protection law will be recorded and enforced at each stage.

    What is a data management plan for qualitative research?

    A qualitative DMP is the same core document required for any funded project — covering data types, documentation, storage, sharing and preservation — but written around data that is, by design, difficult to de-identify. Interview transcripts, field notes, focus-group recordings and open-text survey responses carry contextual detail that resists the aggregation techniques used to anonymise quantitative datasets.

    Most UK funders require a DMP at application stage. The Economic and Social Research Council has required a data management and sharing statement since its 2010 Research Data Policy, and UK Research and Innovation’s common principles on data policy apply across its research councils. The plan is normally reviewed alongside the ethics application, since data handling and consent decisions are made together.

    Why do FAIR mandates and human-subjects ethics create tension?

    The FAIR principles — Findable, Accessible, Interoperable, Reusable, set out by Wilkinson et al. in Scientific Data (2016) — were written for structured, aggregatable datasets. Applied literally to identifiable human-subjects data, “Accessible” and “Reusable” can conflict with the duty to limit who can read a participant’s own words.

    Funders resolve this with a qualifier, not an exemption: data should be “as open as possible, as closed as necessary” — the formulation used in the European Commission’s Horizon Europe research data policy and echoed by UKRI. This does not excuse qualitative researchers from FAIR compliance; it changes what “Accessible” means in practice, from public download to documented, conditional access.

    FAIR principle Qualitative-data constraint Practical mitigation in the DMP
    Findable Full metadata can itself be re-identifying (project title, participant demographics) Publish a discoverable, de-identified metadata record with a persistent identifier even when the data itself stays closed
    Accessible Transcripts/recordings contain direct identifiers and verbatim quotes Deposit in a repository offering tiered or restricted access, not open download
    Interoperable Coding schemes and qualitative software formats (e.g. NVivo, ATLAS.ti) are often proprietary Document the coding frame and export a non-proprietary format alongside the proprietary project file
    Reusable Reuse by unknown third parties was rarely covered by original consent Use granular, re-use-specific consent wording that anticipates archiving and secondary analysis

    Under UK GDPR and the Data Protection Act 2018, personal data genuinely and irreversibly anonymised falls outside data protection law — but the Information Commissioner’s Office is explicit that this bar is high, and that pseudonymised data (a code replacing a name, with the key retained) remains personal data. The DMP must state, precisely, which version of the data at which stage is personal data and which is anonymised.

    Consent forms are the operative control, not an afterthought. A plan built for FAIR-GDPR reconciliation should specify:

    • Granular consent options separating participation, quotation in publications, and archiving of transcripts or recordings for secondary use
    • An explicit legal basis under UK GDPR Article 6 (and Article 9 condition where special category data — health, ethnicity, political opinion — is discussed)
    • A defined right-of-withdrawal window after which removal from an archived, de-identified dataset is no longer practicable
    • Named repository and access-control arrangements disclosed to participants at consent, not decided afterwards

    The UK Data Service — the Economic and Social Research Council’s designated data archive — operates a three-tier access model qualitative DMPs can cite directly: Open data (freely downloadable), Safeguarded data (registered users agree to an end-user licence), and Controlled data (approved researchers only, via a secure environment). Mapping each output to one of these tiers, rather than a vague “available on request” line, distinguishes a compliant plan from a defensive one.

    What anonymisation techniques belong in the plan?

    Anonymisation of qualitative data is a layered process, not a single redaction pass. A robust DMP names the specific technique used at each stage:

    1. Pseudonymisation during analysis — replacing names with participant codes while a separate, access-restricted key file links code to identity
    2. De-identification for sharing — removing or generalising indirect identifiers: exact job titles, place names, dates, organisational affiliations
    3. Redaction of unavoidable identifiers — where context itself identifies a small or unique population (a single named institution, a rare occupation), replacing detail with a category description
    4. Access-tier assignment — deciding, output by output, whether the residual disclosure risk permits Safeguarded deposit or requires Controlled access only

    The Qualitative Data Repository at Syracuse University and the Consortium of European Social Science Data Archives (CESSDA) both publish worked examples of this layered approach for interview and ethnographic data, and are appropriate repositories to name in a DMP for social-science-led projects.

    When is “not applicable” a legitimate data availability statement?

    A data availability statement (DAS) reading “not applicable” is defensible only when it is reasoned, not default. Journals following ICMJE and COPE guidance expect a DAS for every submission, including qualitative studies; the acceptable move is not silence but a stated justification — for example, that full transcripts cannot be shared because de-identification would strip the interpretive detail the analysis depends on, while a de-identified excerpt corpus or the coding frame is deposited instead.

    Reviewers increasingly flag blanket “not applicable” statements as a data-quality signal, because most qualitative datasets have something shareable — a codebook, an interview guide, aggregated theme frequencies — even when raw transcripts cannot be released. A DMP that pre-commits to this reasoning avoids a weak DAS being drafted under publication-deadline pressure.

    Common questions on qualitative data management plans

    What should a data management plan for qualitative research include?

    It should cover data types collected (transcripts, recordings, field notes), consent scope, anonymisation method, storage and access controls, the repository and access tier for shared outputs, and a retention and deletion schedule for identifiable source files.

    How do you anonymise qualitative data to comply with GDPR?

    Apply pseudonymisation during analysis, then de-identify indirect identifiers (locations, job titles, dates) before sharing. Under UK GDPR, only data anonymised to the point that re-identification is not reasonably likely falls outside data protection law; pseudonymised data remains regulated personal data.

    Do FAIR principles require open data sharing for human-subjects research?

    No. FAIR requires data to be findable and accessible under stated conditions, not necessarily open. Funders including UKRI and the European Commission apply the “as open as possible, as closed as necessary” standard, which explicitly permits restricted or controlled access for identifiable qualitative data.

    Can a data availability statement say “not applicable” for qualitative research?

    Only with a stated reason, such as re-identification risk that de-identification cannot remove. Journals following ICMJE and COPE practice expect a justified statement — noting what, if anything, is shareable (a codebook or interview guide) — rather than a blanket refusal.

    Implications and outlook

    Institutional research offices and ethics committees increasingly review DMPs and consent forms as one package, because anonymisation and access-tier decisions in the DMP determine what the consent form must promise participants. Research administrators supporting qualitative and mixed-methods proposals should treat the FAIR-versus-consent tension as a design question resolved at the DMP stage — via tiered access, granular consent and named repositories — not a compliance problem deferred to publication.

    As funders tighten machine-actionable DMP requirements, qualitative projects that specify access tiers and anonymisation methods in structured, repository-mappable language will be better placed to meet FAIR audit expectations and data protection obligations, without defaulting to an unjustified “closed” or “not applicable” position.

    For related definitions and standards context, see CASRAI’s research data terminology dictionary and the research administration resource hub.

  • Horizon Europe Data Management Plan: Field Guide

    The Horizon Europe Data Management Plan (DMP) template is a seven-section document — data summary, FAIR data, other research outputs, resource allocation, security, ethics, and other issues — that the European Commission recommends beneficiaries use to show how project data will be made Findable, Accessible, Interoperable and Reusable. It is due as a project deliverable within the first six months of a grant and must be kept current for the project’s duration.

    A Data Management Plan is a living document describing how research data and other outputs are generated, documented, secured, and shared, both during and after a funded project. Under Horizon Europe, the plan is not a formality: it is the mechanism through which the Commission’s FAIR data mandate under the Horizon Europe Programme Guide (Open Science, Article 17) is operationalised at project level.

    What the Horizon Europe DMP template covers

    The Commission’s recommended template, published on the Funding & Tenders Portal, structures the plan around seven headline sections. Each section exists to answer a specific compliance question the Commission needs resolved before, during, and after data generation.

    The data summary section opens the document. It requires beneficiaries to state whether data will be newly generated, reused from existing sources, or both; to describe expected data types, formats, and volumes; to explain how the data relates to the project’s objectives; and to identify who outside the consortium might find the data useful. If reuse of existing data was considered and rejected, that decision must be justified in writing.

    Section by section: what each part requires

    The template’s largest section — FAIR data — is split into four sub-parts that map directly onto the FAIR acronym. This is where most of the compliance burden, and most drafting errors, sit.

    Making data findable

    Beneficiaries must state whether data and metadata will receive a persistent identifier such as a DOI, which metadata standard will be used (for example Dublin Core or a discipline-specific schema), and whether search keywords will be added so the dataset can be indexed by data catalogues and harvested by aggregators such as OpenAIRE.

    Making data accessible

    This sub-section requires naming the trusted repository, stating whether data will be openly accessible by default, and — where access is restricted — providing a documented justification tied to legitimate interests, GDPR, security, or intellectual property constraints. Metadata should remain openly accessible under a CC0 licence even when the underlying dataset is closed.

    Making data interoperable

    Here the plan must name the vocabularies, ontologies, or methodologies used so the data can be combined with other datasets and read by non-project tools without manual reformatting.

    Increasing data reuse

    The final FAIR sub-section covers licensing terms (CC BY or CC0 by default, following the “as open as possible, as closed as necessary” principle), data provenance documentation, and the quality-assurance processes applied before deposit.

    Beyond FAIR data, three further sections complete the template:

    Section Core requirement Primarily maps to
    3. Other research outputs Software, models, workflows, protocols and physical samples managed under the same FAIR logic as data Findable, Reusable
    4. Allocation of resources Costs of making data FAIR, named responsibility for data management, long-term preservation funding Accessible
    5. Data security Secure storage, backup, recovery provisions, and secure transfer of sensitive data Accessible, Reusable
    6. Ethical aspects Handling of personal or sensitive data in line with GDPR and the project’s ethics review Accessible
    7. Other issues Any procedure or standard not captured elsewhere (e.g. national/funder-specific rules) All four FAIR pillars

    Beneficiaries should also confirm that deposited datasets carry the minimum metadata fields the Commission expects: author(s), a description or abstract, the deposit date, the licence, embargo terms if any, and the grant project name, acronym, and number.

    When the DMP is due, and whether the template is mandatory

    The European Commission states plainly on the template’s own download page that “the template is recommended but not mandatory” — beneficiaries may use their own format provided it still satisfies the underlying research-data-management obligations in the Grant Agreement.

    The timeline itself is fixed regardless of format:

    • Proposal stage: a short, typically one-page DMP outline is required as part of the proposal; a full DMP is not expected yet, except where a Work Programme calls for one at signature (for example, public-emergency topics).
    • Month 6: the full initial DMP must be submitted as a formal project deliverable.
    • During the project: the DMP is a living document; for projects running longer than 12 months, at least one updated version must be submitted.
    • Project end: a final DMP records how data were actually managed, preserved, and shared.

    Trusted repositories for deposit are those holding certifications such as CoreTrustSeal, Nestor Seal DIN 31644, or ISO 16363, or domain-specific repositories widely endorsed by the relevant research community. A 2024 metadata-readiness review commissioned by the European Research Council (Lazzeri, 2024) found that only a handful of repositories — including Zenodo, DANS, and HAL — met the Commission’s “Essential” metadata-readiness level outright, with several others still retrofitting mandatory fields. This is a practical planning risk: naming a repository in Section 2.2 that later turns out metadata-incomplete forces a DMP revision mid-project.

    Common questions about the Horizon Europe DMP

    Is the Horizon Europe DMP template mandatory?

    The template itself is optional; beneficiaries can use another format. What is mandatory, under the Grant Agreement, is establishing a DMP by month 6, keeping it updated, and depositing data in a trusted repository consistent with FAIR principles.

    When is the Data Management Plan due in Horizon Europe?

    A full initial DMP is due by month 6 of the project as a formal deliverable. A brief DMP-like outline is required earlier, at proposal stage, and the plan must be updated again for projects longer than 12 months.

    What do the FAIR data principles require in a Horizon Europe DMP?

    FAIR requires data to be Findable via persistent identifiers and rich metadata, Accessible through a trusted repository with a clear access policy, Interoperable using recognised standards or vocabularies, and Reusable under a documented licence with provenance and quality information.

    Does the DMP need to be updated after submission?

    Yes. The Commission treats the DMP as a living document. Any material change — a new dataset, a changed repository, an altered access decision — must be reflected in an updated version submitted as a subsequent deliverable.

    What this means for research administrators

    For UK institutions, the compliance picture has a domestic wrinkle worth flagging: the UK re-associated to Horizon Europe from 1 January 2024, and UK-based participants funded via the UKRI Horizon Europe Guarantee are still contractually bound by the same DMP and FAIR data obligations as any other beneficiary — the Guarantee changes the funding route, not the data-management requirements. Research offices supporting UKRI-guaranteed grants should apply the Horizon Europe template rather than a UKRI-native one.

    Institutions preparing their first Horizon Europe DMP can reduce drafting time using ARGOS, OpenAIRE’s free tool built around Horizon Europe’s own template structure, which prompts for each of the seven sections with contextual guidance. Because Section 4 requires naming a responsible individual and Section 5 requires named security measures, research administration teams should treat DMP drafting as a cross-functional task involving the data steward, the ethics lead, and the finance officer who costs the resource-allocation section — not a document a single researcher completes alone. Institutions with broader research administration workflows should build DMP review into the same grant-management checkpoints used for ethics and finance sign-off, rather than treating it as a stand-alone open-science task.

    The direction of travel across EU funding is toward more structured, field-level FAIR reporting rather than narrative compliance statements — institutions that build DMP drafting into standard grant-lifecycle checkpoints now will spend less time on ad hoc revisions as reporting expectations tighten.