Google Dataset Search: Getting Datasets Indexed

Google Dataset Search is a free discovery tool, launched in beta in 2018 and out of beta since January 2020, that indexes datasets by crawling schema.org/Dataset structured data on repository landing pages rather than the raw data files themselves. A dataset only becomes findable there once its page carries valid JSON-LD markup — at minimum a name and description — that Googlebot can actually crawl and parse.

Google Dataset Search is a specialised search engine that matches user queries against schema.org metadata harvested from repository pages, rather than searching the content of the data files themselves. For research-data repository managers, this makes structured-data implementation — not submission forms or paid placement — the entire mechanism of the FAIR “Findable” pillar in practice.

What is Google Dataset Search and how does it work?
Which schema.org/Dataset properties does Google actually require?
How do you get a dataset indexed, step by step?
Why isn’t my dataset showing up in Dataset Search?
How does Dataset Search fit alongside re3data and DataCite?
Common questions about Google Dataset Search

What Is Google Dataset Search and How Does It Work?

Google Dataset Search launched in beta on 5 September 2018 and moved out of beta on 23 January 2020. At that point, Google’s own figures — published in the product lead’s blog post “Discovering millions of datasets on the web” — put the index at more than 25 million datasets drawn from over 4,600 repository websites.

The tool does not host or store data. It aggregates dataset descriptions in the same way Google Scholar aggregates citations, matching a search query against the structured metadata embedded on each dataset’s canonical landing page. If a repository’s dataset pages carry no schema.org/Dataset markup, Google has almost nothing to index — the page may still rank in ordinary web search, but it will not surface as a dataset result card.

Which schema.org/Dataset Properties Does Google Actually Require?

Google’s structured-data guidelines require only two properties — name and description — but a repository page that stops there will index poorly and display a sparse, low-trust result card. A competitive listing needs several of the recommended properties below as well.

Property	Status	What it does
`name`	Required	Dataset title shown in the result card
`description`	Required (50–5,000 characters)	Summary text; supports basic Markdown
`creator`	Recommended	Person or organisation responsible for the dataset
`distribution` (`DataDownload`: `contentUrl`, `encodingFormat`)	Strongly recommended	Direct access link and file format(s)
`license`	Strongly recommended	Usage rights; drives Google’s licence filter
`identifier`	Recommended	Persistent identifier, typically a DOI
`sameAs`	Recommended where mirrored	Points duplicate pages to one canonical record
`variableMeasured`, `temporalCoverage`, `spatialCoverage`	Optional	Supports filtering and richer previews

How Do You Get a Dataset Indexed, Step by Step?

Indexing a dataset is a markup-and-crawlability exercise, not a submission process — there is no “Add my dataset” button. The sequence is consistent across repository platforms.

Identify a single canonical landing page per dataset — not a search results page or a listing of many datasets.
Embed a schema.org/Dataset block as JSON-LD in the page’s <head>; Google explicitly prefers JSON-LD over Microdata or RDFa for maintainability.
Validate the markup with Google’s Rich Results Test before publishing at scale, to catch missing required fields or malformed JSON.
Confirm the page is crawlable: no robots.txt block, no noindex, no authentication wall in front of the metadata.
Submit the URL through an XML sitemap and check status via Search Console’s URL Inspection tool; re-crawl and re-index typically takes several days to a few weeks, with no guaranteed timeline.

Why Isn’t My Dataset Showing Up in Dataset Search?

Most non-indexing cases trace back to one of a small set of causes, and repository managers can usually diagnose them without contacting Google directly.

Missing required fields — no name or description, or a description under 50 characters.
No working contentUrl in distribution — Google cannot confirm the dataset is actually downloadable.
Blocked crawl access — the landing page is disallowed in robots.txt, gated behind login, or marked noindex.
Unconsolidated duplicates — the same dataset described on several mirrored pages without sameAs linking back to one canonical record, which dilutes or fragments the signal.
JavaScript-only rendering — the JSON-LD block is injected client-side in a way Googlebot’s renderer does not execute reliably.
Never crawled at all — the page is not in any sitemap and has no internal links pointing to it, so it was never discovered.

How Does Dataset Search Fit Alongside re3data and DataCite?

Google Dataset Search is one layer of data findability, not the whole of it. Repositories that treat it as their only discovery channel are leaving coverage — and complementary metadata standards — unused.

Channel	What it indexes	Discovery mechanism	Who should list
Google Dataset Search	schema.org/Dataset metadata crawled from any public page	Automatic web crawl; no submission or fee	Every repository with public dataset landing pages
re3data (Registry of Research Data Repositories)	Repository-level descriptions: subject scope, policies, certification	Curated registry; operators self-register, editors review	Repository operators, once, at repository level
DataCite Search	DOI-registered dataset metadata from DataCite members	Indexes DataCite Metadata Schema records tied to a DOI	Repositories minting DOIs via a DataCite member or consortium

DataCite publishes a documented crosswalk from its own Metadata Schema — the standard most DOI-minting repositories already produce — to schema.org/Dataset. In practice, a repository that generates DataCite-compliant records for DOI registration can reuse most of the same field values for its Dataset Search markup rather than authoring two metadata sets from scratch. This mapping is the single largest efficiency gain available to repositories still hand-coding schema.org tags separately from their DOI workflow, and it is largely absent from general explainer coverage of Dataset Search.

Listing in re3data remains a distinct, worthwhile step: it satisfies the “Findable” criteria under the FAIR Guiding Principles (Wilkinson et al., Scientific Data, 2016) at the repository level, which many funder data-management-plan templates now check for independently of whether individual datasets appear in Google’s index. For definitions of related metadata and identifier terms, see the CASRAI Dictionary.

Common Questions About Google Dataset Search

What is Google Dataset Search used for?

Google Dataset Search is used to locate datasets hosted across thousands of independent repositories from a single keyword search. It is aimed at researchers, data journalists and institutions that need to find, cite or reuse publicly available data without knowing in advance which repository holds it.

Is Google Dataset Search free to use?

Yes. Searching is free with no account required, and indexing is also free — there is no paid submission tier or guaranteed-placement fee. The only cost is the engineering effort of implementing and maintaining correct schema.org markup on a repository’s own dataset pages.

How do you use Google Dataset Search?

Visit datasetsearch.research.google.com and enter keywords describing the data required; results can be filtered by download format, usage rights and last-updated date. Repository managers can use the same interface to check whether their own holdings appear, since there is no separate indexing dashboard.

For research-data managers, Dataset Search indexing is ongoing metadata hygiene rather than a one-off SEO task: every new deposit needs a landing page with valid JSON-LD from the point of publication, not retrofitted months later. As funders increasingly reference the FAIR Guiding Principles’ Findable criteria within research administration data-management-plan requirements, repositories that build schema.org/Dataset markup into their standard publishing pipeline — rather than treating it as a manual afterthought — will see broader, compounding recognition of their holdings at no additional cost. The property list Google documents is short enough to automate at the point of deposit, which is where the durable win actually lies.

Google Dataset Search: Getting Datasets Indexed

What Is Google Dataset Search and How Does It Work?

Which schema.org/Dataset Properties Does Google Actually Require?

How Do You Get a Dataset Indexed, Step by Step?

Why Isn’t My Dataset Showing Up in Dataset Search?

How Does Dataset Search Fit Alongside re3data and DataCite?

Common Questions About Google Dataset Search

What is Google Dataset Search used for?

Is Google Dataset Search free to use?

How do you use Google Dataset Search?

Comments

Leave a Reply Cancel reply

More posts

ORCID UK: Inside the Jisc Consortium Model

NRF ORCID Mandate for South African Grants

ORCID Wikipedia and Wikidata Governance Record

ORCID API Tutorial: OAuth, Scopes and First Call