Tag: ga4gh standards

Schema.org Dataset Markup: A Developer’s Checklist for FAIR Metadata

Schema.org Dataset markup is the JSON-LD structured-data block, built on the Dataset type, that describes a dataset’s name, creator, licence, and distribution so search engines and repositories can index and cite it correctly. Implemented well, it is also the fastest practical route to satisfying the Findable and Accessible legs of the FAIR data principles.

Schema.org dataset markup is a body of structured information — expressed as JSON-LD — describing a dataset’s metadata rather than its contents. This article is a working checklist for developers and data curators: which properties are required, which are recommended, how they map onto FAIR and research-data infrastructure such as DataCite, re3data and GA4GH, and what to validate before publishing.

What is Schema.org Dataset markup and why does it matter?
Which Dataset properties are required vs recommended?
How does Dataset markup align with FAIR, DataCite, re3data and GA4GH?
Step-by-step developer implementation checklist
Developer FAQs

What is Schema.org Dataset markup and why does it matter?

The Dataset type sits under CreativeWork in the Schema.org vocabulary and is used to describe metadata about a dataset — its subject, creator, licence and access route — not the data values themselves. Google indexes pages carrying this markup into Google Dataset Search, and research infrastructure tools reuse the same fields for citation and provenance tracking.

Schema.org itself reports that, as of 2024, over 45 million web domains mark up pages with more than 450 billion Schema.org objects, making it the dominant structured-data vocabulary on the web. For dataset publishers specifically, correct markup is not cosmetic SEO — it is the metadata layer that machines, catalogues, and AI systems read to decide whether your dataset exists at all.

Google’s Dataset structured-data guidelines confirm it also accepts equivalent metadata expressed in the W3C Data Catalog Vocabulary (DCAT) format, and has experimental support for CSVW (“CSV on the Web”), so Schema.org JSON-LD is the recommended default but not the only route into discovery tooling.

Which Dataset properties are required vs recommended?

Two properties are mandatory for Google Dataset Search eligibility: name and description. Everything else is recommended but materially affects both discoverability and FAIR compliance. The table below is the working checklist.

Property	Status	Purpose
`name`	Required	Unique, descriptive title — avoid reusing the same name across distinct datasets
`description`	Required	50–5,000 characters; Google Dataset Search only indexes the first 5,000 characters of any textual property
`identifier`	Recommended	A DOI or Compact Identifier; repeat the property if more than one identifier exists
`creator` / `funder`	Recommended	Person or Organization; Google explicitly recommends ORCID for people and ROR for institutions via the `sameAs` property
`license`	Recommended	A URL identifying a specific licence version, e.g. a versioned Creative Commons URL, not the unversioned root
`distribution`	Recommended	`DataDownload` object; `contentUrl` is required within it, `encodingFormat` is recommended
`keywords`, `citation`, `isAccessibleForFree`	Recommended	Improve topical matching and reuse signalling
`hasPart` / `isPartOf`	Recommended	Declares parent–child relationships between a collection dataset and its subsets
`includedInDataCatalog`	Recommended	Links the dataset to the `DataCatalog` (repository) it belongs to

Google’s own guidance is unhedged on scope: names must be unique for distinct datasets, licences must point to a version-specific URL, and the citation property must reference related academic articles — not be reused to cite the dataset itself, which is instead handled through name, identifier, creator and publisher together.

How does Dataset markup align with FAIR, DataCite, re3data and GA4GH?

The FAIR Guiding Principles — Findable, Accessible, Interoperable, Reusable — were formalised by Wilkinson et al. in Scientific Data (2016) and adopted widely across funder and repository policy since. Schema.org properties map directly onto them: identifier and name serve Findable; distribution.contentUrl and isAccessibleForFree serve Accessible; @context/@type and controlled variableMeasured values serve Interoperable; and license, creator, and citation together serve Reusable.

This is where most generic Schema.org tutorials stop — and where research-data publishers need to go further. Three infrastructure layers sit downstream of your JSON-LD and expect it to be internally consistent:

DataCite: if your repository mints DOIs through DataCite, the DataCite Metadata Schema fields (creators, contributors, resourceType, rights) should match your Schema.org identifier, creator and license values exactly — mismatched names or licence URLs between the two records break the provenance chain a machine is trying to follow.
re3data: the global registry of research data repositories (re3data.org, operated with DataCite) indexes repository-level metadata, including declared standards and certifications. A repository whose dataset landing pages carry consistent, complete Schema.org markup is easier to register and audit in re3data than one relying on free-text descriptions alone.
GA4GH standards: for genomic and clinical datasets, the Global Alliance for Genomics and Health publishes discovery and access standards — including the Data Repository Service (DRS) and Data Use Ontology (DUO) — that operate at a finer grain than Schema.org’s generic properties. Where a dataset carries GA4GH-compliant access controls, reflect that in conditionsOfAccess-style text within description or a linked policy document, since Schema.org has no dedicated property for consent/use-ontology terms yet.

None of these three require abandoning Schema.org — they require treating it as one layer in a metadata stack, not the whole stack. Consistent identifiers across all three is the single highest-leverage FAIR improvement available to a repository team.

Step-by-step developer implementation checklist

Use this sequence when adding or auditing Dataset markup on a landing page:

Add a single <script type="application/ld+json"> block per canonical dataset landing page — not on list/search-result pages.
Set @context to https://schema.org/ and @type to Dataset.
Write a unique name and a description between 50 and 5,000 characters, using \n for line breaks in JSON-LD.
Populate creator and funder with nested Person/Organization objects, each carrying a sameAs pointing to an ORCID iD or ROR ID.
Add identifier (DOI preferred), a version-specific license URL, and relevant keywords.
Add distribution as one or more DataDownload objects with required contentUrl and recommended encodingFormat.
If the dataset belongs to a repository, add includedInDataCatalog; if it is a subset of a larger dataset, add isPartOf (or hasPart for the reverse).
Cross-check every identifier, creator name, and licence URL against your DataCite/DOI record — they must match exactly.
Validate with Google’s Rich Results Test and fix critical errors; non-critical warnings (such as missing organisation contactType) can be resolved but do not block eligibility.
Submit or update your sitemap so Google can (re)crawl the landing page, and monitor the Dataset rich-result status report in Search Console afterwards.

Developer FAQs

What is schema.org used for?

Schema.org is a collaborative vocabulary for marking up structured data on web pages, in formats including JSON-LD, RDFa, and Microdata. Search engines such as Google, Bing, and Yandex use it to understand page content and power features like rich results, and the Dataset type specifically feeds Google Dataset Search.

What is the schema of a dataset?

A Schema.org dataset’s “schema” is its set of metadata properties — required fields (name, description) plus recommended fields such as identifier, creator, license, keywords, and distribution. Together these describe what the dataset is, who made it, and how to access it, without encoding the data values themselves.

What is the difference between schema.org and open graph?

Open Graph is a Facebook-originated protocol controlling how a link’s title, image, and description render when shared on social platforms. Schema.org is a broader, search-engine-oriented vocabulary covering dozens of entity types, including Dataset, and is read primarily by search and AI-answer systems rather than social sharing cards.

Is schema.org still relevant?

Yes. Beyond its role in classic rich results, Schema.org markup is increasingly what AI answer engines extract to ground responses about entities and datasets, since it gives an unambiguous, machine-readable statement of facts that free text does not. For dataset publishers, it remains the most reliable route into both Google Dataset Search and AI-generated citations.

Treated as a checklist rather than a one-off tag, Schema.org Dataset markup becomes a maintenance discipline: every new dataset version, licence change, or repository migration should trigger a re-check of the JSON-LD block against the current DataCite and repository record. Repositories that keep this loop tight see fewer indexing gaps in Google Dataset Search and cleaner provenance chains when datasets are cited or aggregated elsewhere. As AI systems increasingly read structured metadata directly, that discipline is becoming a citation requirement, not just an SEO one.

For the institutional context around research data stewardship and contributor attribution, see CASRAI’s overview of research administration practice, and consult the open-research terminology dictionary for definitions of related metadata and standards terms.

July 4, 2026

GA4GH Passport: Cross-Border Genomic Data Access

A GA4GH Passport is a machine-readable digital identity that carries a researcher’s access permissions, called “Visas,” so that a data repository in another country can verify authorisation without the underlying genomic data ever leaving its home jurisdiction. Maintained by the Global Alliance for Genomics and Health (GA4GH), the Passport standard works alongside two companion specifications — Beacon, for federated discovery queries, and the Data Repository Service (DRS), for standardised object retrieval — to form the technical backbone of cross-border, FAIR-compliant genomic data access.

A GA4GH Passport is formally defined as a claim (ga4gh_passport_v1) containing a bundle of digitally signed Visas that encode a user’s identity and acquired data access permissions. This article explains how Passport, Beacon and DRS interlock to let a researcher in one jurisdiction analyse controlled-access datasets held in another, without patient-level data crossing a border.

What is a GA4GH Passport, exactly?
How do Passports authorise cross-border access?
What roles do Beacon and DRS play alongside Passport?
Who is implementing these standards today?
What are the governance requirements and limitations?
Answer-first Q&A
Implications for research administrators

What is a GA4GH Passport, exactly?

The GA4GH Passport standard, developed by GA4GH’s Data Use and Researcher Identity (DURI) Work Stream, provides a data model for encoding a researcher’s identity together with their acquired data access permissions as “Visas” that cannot be tampered with. The current specification is versioned 1.2.1 and defines the Passport as a set of one or more Visas bundled into a single OpenID Connect claim.

Each Visa is a signed JSON Web Token (JWT) that states one discrete fact about the holder — for example, that they are affiliated with a named institution, or that a named Data Access Committee (DAC) has granted them access to a specific dataset. Five standard Visa Types are defined: AffiliationAndRole, AcceptedTermsAndPolicies, ResearcherStatus, ControlledAccessGrants and LinkedIdentities. Combined, these Visas let a “Passport Clearinghouse” — the access-control component at a data repository — make an automated, auditable decision on whether to grant a specific request.

How do Passports authorise cross-border access?

Passports operationalise a “bring the analysis to the data” model rather than moving data to the researcher. A researcher authenticates once through a GA4GH Authentication and Authorisation Infrastructure (AAI) broker, which issues a Passport-scoped access token; that token, and the Visas within it, is then presented to whichever repository the researcher queries, wherever in the world it sits.

The receiving repository’s Passport Clearinghouse independently verifies the cryptographic signature on each Visa, checks it has a trust relationship with the issuing Broker and Visa Issuer, and evaluates whether the presented Visas satisfy its own access policy — for instance, requiring both a valid ResearcherStatus Visa and a matching ControlledAccessGrants Visa. If the checks pass, the repository executes the requested computation locally and returns only results; it does not release the raw dataset across the border.

This distinguishes Passports from older, dataset-by-dataset “material transfer” models. Instead of a bilateral agreement copying a file to a second institution, a Passport is portable: once issued, a Visa can be re-presented to any Clearinghouse that trusts its issuer, and the same digital identity works across a whole federation of otherwise-independent repositories.

What roles do Beacon and DRS play alongside Passport?

Passport establishes who is allowed to see what; it does not, on its own, define how a repository is queried or how a specific file is fetched. Two further GA4GH specifications complete the practical stack.

Standard	What it does	Where Passport fits in
GA4GH Passport	Encodes researcher identity and data access permissions as signed Visas	The authorisation layer other services rely on
GA4GH Beacon	A federated discovery API broadcasting a query (“does any dataset here contain variant X?”) across repositories, returning presence/absence or aggregate answers only	Beacon v2 supports Passport-based authentication so registered- and controlled-access Beacons answer detailed queries to authorised users only
GA4GH DRS (Data Repository Service)	A standardised API and URI scheme (`drs://`) resolving a stable object identifier to its actual file location, independent of hosting cloud or institution	Once a Passport authorises access, DRS lets tools such as the Tool Registry Service (TRS) and Workflow Execution Service (WES) fetch the specific object without knowing its storage path

In practice, a federated analysis often chains all three: a Beacon query establishes that a relevant variant exists in a remote cohort; a Passport Visa authorises the researcher for controlled access to that cohort; and a DRS-resolved object identifier lets a workflow retrieve or stream the specific file for computation in place. GA4GH’s Data Connect API extends the same pattern to structured tabular queries across a federated network, with each node executing locally and only aggregated results returned.

Who is implementing these standards today?

GA4GH lists “Driver Projects” that co-develop and pressure-test the Passport specification against live infrastructure, including the European Genome-phenome Archive (EGA), ELIXIR’s Cloud and AAI service, Australian Genomics, and the NIH Cloud Platform Interoperability (NCPI) effort. The foundational description was published as Voisin et al., “GA4GH Passport standard for digital identity and access permissions,” Cell Genomics, 2021, which grounds the current specification text.

A related mechanism, Registered Access, was defined earlier in Dyke et al. (2018, European Journal of Human Genetics) and is now expressed inside the Passport framework: an AcceptedTermsAndPolicies Visa plus a ResearcherStatus Visa, both pointing to the same policy identifier, satisfies Registered Access without a per-dataset application.

EGA and ELIXIR jointly operate Passport-compatible Broker and Clearinghouse infrastructure for European genomic and phenotypic data.
Beacon v2, a ratified GA4GH product, is deployed across the Beacon Network for federated variant discovery.
DRS underpins object resolution in workflow execution engines implementing GA4GH’s Cloud Workstream APIs, including tools built on Cromwell and Nextflow.

What are the governance requirements and limitations?

Passport is a technical trust framework, not a legal one: a Passport Clearinghouse must ignore any Passport or Visa unless it already has an established trust relationship with the issuing Broker and Visa Issuer. Standardised encoding does not by itself create legal authority to share data — that authority still derives from each institution’s own data access agreements, ethics approvals and, where applicable, national data protection law.

The specification also sets hard limits on Visa content: Visas are designed for machine-only interpretation and explicitly exclude rich personal identity detail and audit-trail information, which must be handled by systems outside the specification’s scope. Every Visa carries an exp (expiry) and an asserted timestamp, and Clearinghouses must enforce expiry and revocation independently of any downstream access-token lifetime.

Answer-first Q&A

What is the difference between a GA4GH Passport and a GA4GH Visa?

A Passport is the container claim that bundles one or more Visas together for a single researcher. Each Visa is an individually signed JWT asserting one fact — such as affiliation, researcher status, or a specific controlled-access grant — that a receiving repository evaluates independently before authorising access.

Does a GA4GH Passport move genomic data across borders?

No. A Passport only carries identity and permission claims. The genomic data stays inside the repository that hosts it; an authorised analysis runs locally and only results, not raw records, return to the requesting researcher.

How does GA4GH Beacon relate to the Passport standard?

Beacon is a separate federated query standard letting a researcher ask many repositories at once whether a variant is present. Beacon v2 can require a valid GA4GH Passport for registered- or controlled-access queries, using its Visas to decide how much detail a requester may see.

Is the GA4GH Passport specification legally binding on data holders?

No. It is a technical interoperability specification, not a law or contract. Institutions still need their own data access agreements and compliance with applicable data protection law; Passport standardises how permissions already granted under those agreements are communicated and verified.

Implications for research administrators

For institutions joining federated genomic infrastructure, the Passport/Beacon/DRS stack changes what “data access” means operationally. Committees issuing ResearcherStatus or ControlledAccessGrants Visas are, in effect, configuring machine-enforced policy rather than signing one-off agreements — a shift that rewards clear, versioned internal policies mapped to the specification’s controlled vocabulary of Visa Types.

The benefit is scale: a single Passport, once trusted by a federation’s Brokers and Clearinghouses, can authorise a researcher across many independent repositories without a new bilateral agreement each time. The trade-off is that trust management — deciding which Brokers and Visa Issuers an institution accepts — becomes a standing governance responsibility. As more national infrastructures follow the pattern set by EGA, ELIXIR and Australian Genomics, Passport-compatible Clearinghouses are likely to become a default expectation for international genomic data federations, much as FAIR principles have become a baseline expectation for data management plans.

July 4, 2026