A vast amount of research data is now deposited in repositories, accompanied by persistent identifiers and described with metadata. That is a real achievement — but it raises a question that is easy to overlook. How does anyone actually find a relevant dataset? A researcher who suspects that the data they need may already exist somewhere faces a genuinely hard search problem: the data is scattered across thousands of repositories worldwide, each with its own catalogue, its own search box and its own conventions. Without good ways to discover data across all of them, a great deal of valuable, well-curated data simply goes unfound and unused — the digital equivalent of a book correctly shelved in a library no one knows the address of. This article looks at the infrastructure of dataset discovery, drawing on the data infrastructure domain of the CASRAI Dictionary.
Findability is the first FAIR principle
It is no accident that the F in FAIR — Findable, Accessible, Interoperable, Reusable — comes first. Findability is logically prior to everything else: data that cannot be found cannot be accessed, cannot be reused, and delivers none of the value its careful curation promised. Findability in the FAIR sense rests on a few concrete foundations: data should be assigned a globally unique and persistent identifier; it should be described with rich metadata; that metadata should explicitly include the identifier; and the metadata should be registered or indexed in a resource that can be searched. The order of the principles is a quiet but important statement of priorities — all the work of making data accessible and reusable is wasted if the first hurdle, being found at all, is never cleared.
Registries of repositories
Discovery operates at more than one level, and the first level is finding the right repository. Before searching for a dataset, a researcher — whether looking for existing data or deciding where to deposit their own — often needs to identify which repository is appropriate for their field and data type. This is the role of re3data, the Registry of Research Data Repositories, a comprehensive directory that catalogues data repositories across all disciplines. It lets users discover repositories by subject, country, data type and the policies they operate, describing each in a structured way. re3data answers the question “where might data like this live, and where should I put mine?” It is discovery one level up — finding the haystacks before searching for the needle — and it is an essential first step that purely dataset-level search tools do not provide.
Dataset-level discovery
The second level is finding individual datasets, and several complementary services address it:
- DataCite Commons. Because DataCite is a principal minter of persistent identifiers for research data, it sits on a large, structured graph of datasets and their connections to people, organisations, funders and related outputs. DataCite Commons exposes that graph for discovery, letting users search across datasets and follow the links between a dataset and its authors, its funding and the works that cite or relate to it.
- Google Dataset Search. A general-purpose search engine specifically for datasets, it works by harvesting structured metadata that data providers publish on their own pages, then making it searchable in one place. It brings dataset discovery into a familiar, web-scale search experience.
- Repository and aggregator catalogues. Individual repositories offer their own search, and aggregators pull metadata from many sources into combined indexes, each widening the net a little further.
Why structured metadata is the engine
What makes web-scale dataset search possible at all is structured, machine-readable metadata, and in particular the schema.org/Dataset vocabulary. schema.org is a shared vocabulary for marking up information on web pages so that machines, not just humans, can understand it, and it includes a specific type for describing datasets — their title, description, creators, licence, distribution and more. When a repository or data provider embeds schema.org/Dataset markup in the page describing a dataset, a search engine crawling the web can recognise that the page describes a dataset and extract its key facts. This is precisely how a service such as Google Dataset Search builds its index: not by being given a private feed from every repository, but by reading the standardised markup that providers publish openly. The lesson is direct and practical — describing data with shared, structured metadata is not bureaucratic box-ticking, it is the literal mechanism by which the data becomes discoverable to the wider world.
Discovery depends on good deposit
All of this throws the responsibility back to the moment of deposit. A dataset is only as findable as its metadata is good. Rich, accurate, standards-based metadata — a clear title, a meaningful description, named creators with identifiers, an explicit licence, appropriate keywords — is what feeds every layer of the discovery system. Skimpy or inconsistent metadata leaves a dataset effectively invisible no matter how valuable its contents. This is why guidance on depositing data places such weight on description, and why the choices a researcher makes at deposit time echo through every subsequent attempt to find their work. Practical guidance on getting this right is part of our wider material on research data fundamentals.
A consistent vocabulary for findable data
For discovery to work across repositories, registries and search engines, the metadata describing datasets must mean the same thing everywhere — a creator, a licence, a resource type or a related-work link has to be interpretable consistently across every system that indexes it. That consistency is what the CASRAI Dictionary provides: a shared vocabulary so that dataset metadata is understood identically wherever it is harvested. And because creating and curating a dataset is genuine research contribution, the people behind it can be credited using the same framework as any other — the CRediT taxonomy and its full set of contribution roles, with Data curation recognising the work that makes data findable in the first place. Depositing data is necessary; describing it well, in shared terms, is what makes it discoverable — and discoverability is what lets data fulfil its purpose.