Google Dataset Search: Getting Datasets Indexed

Google Dataset Search is a free discovery tool, launched in beta in 2018 and out of beta since January 2020, that indexes datasets by crawling schema.org/Dataset structured data on repository landing pages rather than the raw data files themselves. A dataset only becomes findable there once its page carries valid JSON-LD markup — at minimum a name and description — that Googlebot can actually crawl and parse.

Google Dataset Search is a specialised search engine that matches user queries against schema.org metadata harvested from repository pages, rather than searching the content of the data files themselves. For research-data repository managers, this makes structured-data implementation — not submission forms or paid placement — the entire mechanism of the FAIR “Findable” pillar in practice.

Google Dataset Search launched in beta on 5 September 2018 and moved out of beta on 23 January 2020. At that point, Google’s own figures — published in the product lead’s blog post “Discovering millions of datasets on the web” — put the index at more than 25 million datasets drawn from over 4,600 repository websites.

The tool does not host or store data. It aggregates dataset descriptions in the same way Google Scholar aggregates citations, matching a search query against the structured metadata embedded on each dataset’s canonical landing page. If a repository’s dataset pages carry no schema.org/Dataset markup, Google has almost nothing to index — the page may still rank in ordinary web search, but it will not surface as a dataset result card.

Which schema.org/Dataset Properties Does Google Actually Require?

Google’s structured-data guidelines require only two properties — name and description — but a repository page that stops there will index poorly and display a sparse, low-trust result card. A competitive listing needs several of the recommended properties below as well.

Property Status What it does
name Required Dataset title shown in the result card
description Required (50–5,000 characters) Summary text; supports basic Markdown
creator Recommended Person or organisation responsible for the dataset
distribution (DataDownload: contentUrl, encodingFormat) Strongly recommended Direct access link and file format(s)
license Strongly recommended Usage rights; drives Google’s licence filter
identifier Recommended Persistent identifier, typically a DOI
sameAs Recommended where mirrored Points duplicate pages to one canonical record
variableMeasured, temporalCoverage, spatialCoverage Optional Supports filtering and richer previews

How Do You Get a Dataset Indexed, Step by Step?

Indexing a dataset is a markup-and-crawlability exercise, not a submission process — there is no “Add my dataset” button. The sequence is consistent across repository platforms.

  1. Identify a single canonical landing page per dataset — not a search results page or a listing of many datasets.
  2. Embed a schema.org/Dataset block as JSON-LD in the page’s <head>; Google explicitly prefers JSON-LD over Microdata or RDFa for maintainability.
  3. Validate the markup with Google’s Rich Results Test before publishing at scale, to catch missing required fields or malformed JSON.
  4. Confirm the page is crawlable: no robots.txt block, no noindex, no authentication wall in front of the metadata.
  5. Submit the URL through an XML sitemap and check status via Search Console’s URL Inspection tool; re-crawl and re-index typically takes several days to a few weeks, with no guaranteed timeline.

Why Isn’t My Dataset Showing Up in Dataset Search?

Most non-indexing cases trace back to one of a small set of causes, and repository managers can usually diagnose them without contacting Google directly.

  • Missing required fields — no name or description, or a description under 50 characters.
  • No working contentUrl in distribution — Google cannot confirm the dataset is actually downloadable.
  • Blocked crawl access — the landing page is disallowed in robots.txt, gated behind login, or marked noindex.
  • Unconsolidated duplicates — the same dataset described on several mirrored pages without sameAs linking back to one canonical record, which dilutes or fragments the signal.
  • JavaScript-only rendering — the JSON-LD block is injected client-side in a way Googlebot’s renderer does not execute reliably.
  • Never crawled at all — the page is not in any sitemap and has no internal links pointing to it, so it was never discovered.

How Does Dataset Search Fit Alongside re3data and DataCite?

Google Dataset Search is one layer of data findability, not the whole of it. Repositories that treat it as their only discovery channel are leaving coverage — and complementary metadata standards — unused.

Channel What it indexes Discovery mechanism Who should list
Google Dataset Search schema.org/Dataset metadata crawled from any public page Automatic web crawl; no submission or fee Every repository with public dataset landing pages
re3data (Registry of Research Data Repositories) Repository-level descriptions: subject scope, policies, certification Curated registry; operators self-register, editors review Repository operators, once, at repository level
DataCite Search DOI-registered dataset metadata from DataCite members Indexes DataCite Metadata Schema records tied to a DOI Repositories minting DOIs via a DataCite member or consortium

DataCite publishes a documented crosswalk from its own Metadata Schema — the standard most DOI-minting repositories already produce — to schema.org/Dataset. In practice, a repository that generates DataCite-compliant records for DOI registration can reuse most of the same field values for its Dataset Search markup rather than authoring two metadata sets from scratch. This mapping is the single largest efficiency gain available to repositories still hand-coding schema.org tags separately from their DOI workflow, and it is largely absent from general explainer coverage of Dataset Search.

Listing in re3data remains a distinct, worthwhile step: it satisfies the “Findable” criteria under the FAIR Guiding Principles (Wilkinson et al., Scientific Data, 2016) at the repository level, which many funder data-management-plan templates now check for independently of whether individual datasets appear in Google’s index. For definitions of related metadata and identifier terms, see the CASRAI Dictionary.

Common Questions About Google Dataset Search

What is Google Dataset Search used for?

Google Dataset Search is used to locate datasets hosted across thousands of independent repositories from a single keyword search. It is aimed at researchers, data journalists and institutions that need to find, cite or reuse publicly available data without knowing in advance which repository holds it.

Is Google Dataset Search free to use?

Yes. Searching is free with no account required, and indexing is also free — there is no paid submission tier or guaranteed-placement fee. The only cost is the engineering effort of implementing and maintaining correct schema.org markup on a repository’s own dataset pages.

How do you use Google Dataset Search?

Visit datasetsearch.research.google.com and enter keywords describing the data required; results can be filtered by download format, usage rights and last-updated date. Repository managers can use the same interface to check whether their own holdings appear, since there is no separate indexing dashboard.

For research-data managers, Dataset Search indexing is ongoing metadata hygiene rather than a one-off SEO task: every new deposit needs a landing page with valid JSON-LD from the point of publication, not retrofitted months later. As funders increasingly reference the FAIR Guiding Principles’ Findable criteria within research administration data-management-plan requirements, repositories that build schema.org/Dataset markup into their standard publishing pipeline — rather than treating it as a manual afterthought — will see broader, compounding recognition of their holdings at no additional cost. The property list Google documents is short enough to automate at the point of deposit, which is where the durable win actually lies.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *