Skip to main content
v2026.1714 entries · CC-BY 4.0
CASRAI

Editorial · CASRAI

Genomic Data Repository Guide: ENA vs GEO vs SRA

ENA, GEO or SRA? Compare deposit rules, metadata standards and funder mandates before you submit.

ByMCP Service
Published 3 Jul 2026· 7 minute read

Choosing a genomic data repository comes down to three questions: what type of data you have, whether it is identifiable human data, and what your funder or journal mandates. Raw sequencing reads generally go to the European Nucleotide Archive (ENA) or the Sequence Read Archive (SRA) — two mirrored nodes of the same international collaboration — while processed gene-expression data belongs in the Gene Expression Omnibus (GEO). A genomic data repository is a persistent, publicly accessible database that assigns stable identifiers to deposited sequence or expression datasets so they can be cited, retrieved and reused under FAIR data principles.

ENA, GEO and SRA are the three repositories researchers encounter most often when funder or journal data-sharing policies require deposition of sequencing output. They are not interchangeable: each has a different primary data type, a different metadata standard, and a different position in the international data-sharing infrastructure. This guide compares them on deposit requirements, metadata standards and journal acceptance so research administrators and authors can make a defensible, mandate-compliant choice.

What is a genomic data repository?

A genomic data repository is a curated, publicly accessible database that archives DNA or RNA sequence data — raw reads, assembled genomes, or processed expression tables — and assigns each dataset a stable accession number for permanent citation. Repositories exist because journals and funders increasingly require that sequence data underlying a publication be deposited somewhere reviewers, readers and future researchers can retrieve it, rather than held privately by the authors.

The three most consulted repositories for sequencing output are the European Nucleotide Archive (ENA), the Sequence Read Archive (SRA), and the Gene Expression Omnibus (GEO). ENA and SRA are both members of the International Nucleotide Sequence Database Collaboration (INSDC), alongside Japan’s DNA Data Bank of Japan (DDBJ); records submitted to any one of the three are mirrored across all of them, typically within 24-48 hours.

ENA vs GEO vs SRA: how do they differ?

The single biggest distinction is data type: ENA and SRA hold raw sequence reads (FASTQ, BAM, CRAM), while GEO holds processed functional genomics results — expression matrices, normalised counts and the experimental metadata describing them — and links out to SRA for the underlying raw reads. Geography and stewardship differ too: ENA is maintained by EMBL-EBI in the UK/Europe, while SRA and GEO are both maintained by the US National Center for Biotechnology Information (NCBI).

Feature ENA GEO SRA
Steward EMBL-EBI (Europe) NCBI (US) NCBI (US)
Primary data type Raw reads, assemblies, annotated sequences Processed expression data + metadata Raw sequencing reads
INSDC member Yes No (links to SRA) Yes
Metadata standard ENA checklists MINSEQE / MIAME INSDC submission schema
Access model Open (controlled tier via EGA for identifiable human data) Open Open (controlled tier via dbGaP)

A frequently overlooked distinction is access control. None of ENA, SRA or GEO is designed to hold identifiable human genomic or phenotypic data. That category of data belongs in a controlled-access archive — the European Genome-phenome Archive (EGA), jointly run by EMBL-EBI and the CRG, or NCBI’s database of Genotypes and Phenotypes (dbGaP) — where access is granted through a data access committee rather than opened to the public. Depositing identifiable clinical genomic data in an open repository such as ENA or SRA would breach both the repositories’ own policies and, in most jurisdictions, data protection law.

What are the deposit requirements for each repository?

Each repository sets its own submission checklist, but all three require a structured description of the experiment alongside the sequence files themselves.

  • ENA requires a study, sample, experiment and run object for each submission, described against one of ENA’s checklist templates (for example, the pathogen or invertebrate checklists), plus the raw read files.
  • SRA requires equivalent BioProject and BioSample records, submitted through NCBI’s submission portal, with reads in FASTQ or BAM/CRAM format.
  • GEO requires a MINSEQE-compliant description of the experimental design (samples, protocols, processed data matrix) and will route the corresponding raw reads to SRA as part of the same submission, generating a linked SRA accession automatically.

Because ENA and SRA mirror each other, a dataset submitted to one is not normally resubmitted to the other — submitting twice creates duplicate, unlinked accessions rather than better coverage.

Which metadata standards apply?

Metadata quality, not just file deposition, is what makes a dataset FAIR — Findable, Accessible, Interoperable and Reusable, per the FAIR data principles first published by Wilkinson et al. in 2016. GEO submissions are assessed against MIAME (Minimum Information About a Microarray Experiment) for array data and MINSEQE (Minimum Information about a high-throughput Nucleotide Sequencing Experiment) for sequencing-based expression studies. ENA and SRA submissions follow INSDC’s shared sample and experiment metadata schema, supplemented by checklist-specific fields for the sample type in question.

Consistent metadata is also what allows a dataset to be discovered through cross-repository registries such as re3data and FAIRsharing, both of which index genomic repositories alongside thousands of other subject and generalist repositories.

Do journals and funders accept all three equally?

Most journal data-availability policies name an INSDC-compliant repository — ENA, SRA or DDBJ — as the acceptable destination for raw sequence data, and GEO or ArrayExpress for expression data. PLOS, for example, states that authors should select field-appropriate repositories and lists ENA, SRA, GEO and DDBJ among its recommended sequencing repositories, while also pointing authors to re3data and FAIRsharing when no field-specific option exists.

Funder policy is generally repository-agnostic within the INSDC family: the NIH Genomic Data Sharing Policy and the 2023 NIH Data Management and Sharing Policy both accept SRA, dbGaP or an equivalent controlled-access archive for human data, without mandating SRA specifically over ENA. UK and European funders operating under UKRI or Horizon Europe open-science requirements similarly accept any INSDC-affiliated repository, reflecting the FAIR data principles rather than naming a single preferred database.

Frequently asked questions

What is the difference between ENA, GEO and SRA?

ENA and SRA both archive raw sequencing reads and mirror each other as INSDC members, differing mainly in which institution — EMBL-EBI or NCBI — hosts the submission. GEO instead archives processed gene-expression results and metadata, forwarding the associated raw reads to SRA automatically during submission.

Do I need to submit data to both GEO and SRA?

Not separately. When you submit a gene-expression study to GEO, the platform generates a linked SRA accession for the raw reads as part of the same workflow, so a single submission satisfies both repositories without duplicate uploads.

Is ENA the same as SRA?

No — they are separate databases run by different organisations that mirror the same underlying INSDC data. A dataset submitted to ENA in Europe becomes visible through SRA in the US within roughly one to two days, and vice versa, so researchers choose one, not both.

Which repository do funders require for genomic data?

Most funder policies, including NIH’s Genomic Data Sharing Policy and UKRI’s open research requirements, accept any INSDC-affiliated repository — ENA, SRA or DDBJ — for raw sequence data, plus GEO for expression data, rather than mandating one specific database.

What this means for research administrators

For institutions building data-management-plan templates or compliance checklists, the practical rule is to map deposition guidance to data type and access sensitivity rather than to a single named repository: raw non-identifiable reads to ENA or SRA, expression matrices to GEO, and any identifiable human genomic or clinical data to a controlled-access archive such as EGA or dbGaP. Framing repository choice this way keeps research administration guidance aligned with funder and journal policy regardless of which INSDC node an individual researcher prefers to use.

As funder mandates increasingly cite FAIR data principles explicitly rather than naming individual repositories, the durable compliance strategy is to select any INSDC-affiliated repository appropriate to the data type, document the accession number in the manuscript, and reserve controlled-access archives strictly for identifiable human data. Research offices that build this decision logic into deposit checklists now will need far less rework as funder policy language continues to converge on FAIR terminology rather than named databases.

LAC

Partner Deal

LAC Health Supplies Mobile App

Referenced across the research world

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo
  • ORCID logo
  • Crossref logo

View CASRAI adoption →