Schema.org Dataset markup is the JSON-LD structured-data block, built on the Dataset type, that describes a dataset’s name, creator, licence, and distribution so search engines and repositories can index and cite it correctly. Implemented well, it is also the fastest practical route to satisfying the Findable and Accessible legs of the FAIR data principles.
Schema.org dataset markup is a body of structured information — expressed as JSON-LD — describing a dataset’s metadata rather than its contents. This article is a working checklist for developers and data curators: which properties are required, which are recommended, how they map onto FAIR and research-data infrastructure such as DataCite, re3data and GA4GH, and what to validate before publishing.
- What is Schema.org Dataset markup and why does it matter?
- Which Dataset properties are required vs recommended?
- How does Dataset markup align with FAIR, DataCite, re3data and GA4GH?
- Step-by-step developer implementation checklist
- Developer FAQs
What is Schema.org Dataset markup and why does it matter?
The Dataset type sits under CreativeWork in the Schema.org vocabulary and is used to describe metadata about a dataset — its subject, creator, licence and access route — not the data values themselves. Google indexes pages carrying this markup into Google Dataset Search, and research infrastructure tools reuse the same fields for citation and provenance tracking.
Schema.org itself reports that, as of 2024, over 45 million web domains mark up pages with more than 450 billion Schema.org objects, making it the dominant structured-data vocabulary on the web. For dataset publishers specifically, correct markup is not cosmetic SEO — it is the metadata layer that machines, catalogues, and AI systems read to decide whether your dataset exists at all.
Google’s Dataset structured-data guidelines confirm it also accepts equivalent metadata expressed in the W3C Data Catalog Vocabulary (DCAT) format, and has experimental support for CSVW (“CSV on the Web”), so Schema.org JSON-LD is the recommended default but not the only route into discovery tooling.
Which Dataset properties are required vs recommended?
Two properties are mandatory for Google Dataset Search eligibility: name and description. Everything else is recommended but materially affects both discoverability and FAIR compliance. The table below is the working checklist.
| Property | Status | Purpose |
|---|---|---|
name |
Required | Unique, descriptive title — avoid reusing the same name across distinct datasets |
description |
Required | 50–5,000 characters; Google Dataset Search only indexes the first 5,000 characters of any textual property |
identifier |
Recommended | A DOI or Compact Identifier; repeat the property if more than one identifier exists |
creator / funder |
Recommended | Person or Organization; Google explicitly recommends ORCID for people and ROR for institutions via the sameAs property |
license |
Recommended | A URL identifying a specific licence version, e.g. a versioned Creative Commons URL, not the unversioned root |
distribution |
Recommended | DataDownload object; contentUrl is required within it, encodingFormat is recommended |
keywords, citation, isAccessibleForFree |
Recommended | Improve topical matching and reuse signalling |
hasPart / isPartOf |
Recommended | Declares parent–child relationships between a collection dataset and its subsets |
includedInDataCatalog |
Recommended | Links the dataset to the DataCatalog (repository) it belongs to |
Google’s own guidance is unhedged on scope: names must be unique for distinct datasets, licences must point to a version-specific URL, and the citation property must reference related academic articles — not be reused to cite the dataset itself, which is instead handled through name, identifier, creator and publisher together.
How does Dataset markup align with FAIR, DataCite, re3data and GA4GH?
The FAIR Guiding Principles — Findable, Accessible, Interoperable, Reusable — were formalised by Wilkinson et al. in Scientific Data (2016) and adopted widely across funder and repository policy since. Schema.org properties map directly onto them: identifier and name serve Findable; distribution.contentUrl and isAccessibleForFree serve Accessible; @context/@type and controlled variableMeasured values serve Interoperable; and license, creator, and citation together serve Reusable.
This is where most generic Schema.org tutorials stop — and where research-data publishers need to go further. Three infrastructure layers sit downstream of your JSON-LD and expect it to be internally consistent:
- DataCite: if your repository mints DOIs through DataCite, the DataCite Metadata Schema fields (creators, contributors, resourceType, rights) should match your Schema.org
identifier,creatorandlicensevalues exactly — mismatched names or licence URLs between the two records break the provenance chain a machine is trying to follow. - re3data: the global registry of research data repositories (re3data.org, operated with DataCite) indexes repository-level metadata, including declared standards and certifications. A repository whose dataset landing pages carry consistent, complete Schema.org markup is easier to register and audit in re3data than one relying on free-text descriptions alone.
- GA4GH standards: for genomic and clinical datasets, the Global Alliance for Genomics and Health publishes discovery and access standards — including the Data Repository Service (DRS) and Data Use Ontology (DUO) — that operate at a finer grain than Schema.org’s generic properties. Where a dataset carries GA4GH-compliant access controls, reflect that in
conditionsOfAccess-style text withindescriptionor a linked policy document, since Schema.org has no dedicated property for consent/use-ontology terms yet.
None of these three require abandoning Schema.org — they require treating it as one layer in a metadata stack, not the whole stack. Consistent identifiers across all three is the single highest-leverage FAIR improvement available to a repository team.
Step-by-step developer implementation checklist
Use this sequence when adding or auditing Dataset markup on a landing page:
- Add a single
<script type="application/ld+json">block per canonical dataset landing page — not on list/search-result pages. - Set
@contexttohttps://schema.org/and@typetoDataset. - Write a unique
nameand adescriptionbetween 50 and 5,000 characters, using\nfor line breaks in JSON-LD. - Populate
creatorandfunderwith nestedPerson/Organizationobjects, each carrying asameAspointing to an ORCID iD or ROR ID. - Add
identifier(DOI preferred), a version-specificlicenseURL, and relevantkeywords. - Add
distributionas one or moreDataDownloadobjects with requiredcontentUrland recommendedencodingFormat. - If the dataset belongs to a repository, add
includedInDataCatalog; if it is a subset of a larger dataset, addisPartOf(orhasPartfor the reverse). - Cross-check every identifier, creator name, and licence URL against your DataCite/DOI record — they must match exactly.
- Validate with Google’s Rich Results Test and fix critical errors; non-critical warnings (such as missing organisation
contactType) can be resolved but do not block eligibility. - Submit or update your sitemap so Google can (re)crawl the landing page, and monitor the Dataset rich-result status report in Search Console afterwards.
Developer FAQs
What is schema.org used for?
Schema.org is a collaborative vocabulary for marking up structured data on web pages, in formats including JSON-LD, RDFa, and Microdata. Search engines such as Google, Bing, and Yandex use it to understand page content and power features like rich results, and the Dataset type specifically feeds Google Dataset Search.
What is the schema of a dataset?
A Schema.org dataset’s “schema” is its set of metadata properties — required fields (name, description) plus recommended fields such as identifier, creator, license, keywords, and distribution. Together these describe what the dataset is, who made it, and how to access it, without encoding the data values themselves.
What is the difference between schema.org and open graph?
Open Graph is a Facebook-originated protocol controlling how a link’s title, image, and description render when shared on social platforms. Schema.org is a broader, search-engine-oriented vocabulary covering dozens of entity types, including Dataset, and is read primarily by search and AI-answer systems rather than social sharing cards.
Is schema.org still relevant?
Yes. Beyond its role in classic rich results, Schema.org markup is increasingly what AI answer engines extract to ground responses about entities and datasets, since it gives an unambiguous, machine-readable statement of facts that free text does not. For dataset publishers, it remains the most reliable route into both Google Dataset Search and AI-generated citations.
Treated as a checklist rather than a one-off tag, Schema.org Dataset markup becomes a maintenance discipline: every new dataset version, licence change, or repository migration should trigger a re-check of the JSON-LD block against the current DataCite and repository record. Repositories that keep this loop tight see fewer indexing gaps in Google Dataset Search and cleaner provenance chains when datasets are cited or aggregated elsewhere. As AI systems increasingly read structured metadata directly, that discipline is becoming a citation requirement, not just an SEO one.
For the institutional context around research data stewardship and contributor attribution, see CASRAI’s overview of research administration practice, and consult the open-research terminology dictionary for definitions of related metadata and standards terms.