Tag: data documentation initiative

DDI Metadata Standard: FAIR Data Checklist for Survey Archives

The DDI metadata standard (Data Documentation Initiative) is an international, XML-based specification for documenting surveys, censuses, and other social, behavioural, and economic science microdata at both the study and variable level. It is the metadata backbone that most social science data archives use to make survey data findable, accessible, interoperable, and reusable (FAIR) — turning a raw data file plus a PDF codebook into a machine-readable, citable, cataloguable research object.

DDI is not a government mandate or a funder requirement; it is a community-maintained documentation standard. The DDI Alliance, an international collaboration established in 2003, maintains the specification and its schemas. This guide explains what the standard covers, who uses it, how it maps onto the FAIR principles, and the practical steps a repository or research team needs to adopt it.

What is the DDI metadata standard?
Who maintains DDI and which archives use it?
How does DDI support the FAIR data principles?
DDI-Codebook vs DDI-Lifecycle vs DDI-CDI
A practical checklist for adopting DDI
Answer-first Q&A
What this means for research data repositories

What is the DDI metadata standard?

The Data Documentation Initiative is a metadata standard for describing the full lifecycle of a research data collection: study design, sampling, data collection, processing, variables, and access conditions. It was built specifically for social, behavioural, and economic sciences data — surveys, censuses, panel studies, and administrative microdata — rather than as a general-purpose schema.

Records are encoded in Extensible Markup Language (XML), which makes them machine-readable and harvestable. A DDI catalogue record typically documents three layers: the study description (bibliographic citation, scope, geography, time period, methodology), the data file description (format, structure, missing-data conventions, weighting), and the variable description (question text, value labels, codes). This granularity is what separates DDI from simpler discovery schemas such as Dublin Core, which describe a resource but not its internal variable structure.

Who maintains DDI and which archives use it?

The DDI Alliance, an international collaboration of research institutions, statistical agencies, and data archives established in 2003, develops and maintains the specification. DDI is listed as a recognised research-data metadata standard in the Research Data Alliance Metadata Standards Catalog (entry m13), which documents its scope, schemas, and adoption.

According to the UK Data Service, DDI “is used by most social science data archives in the world” to structure catalogue records, and it forms the basis of the discovery metadata behind its own collection. The Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan and the members of CESSDA, the Consortium of European Social Science Data Archives, likewise build their cataloguing infrastructure on DDI, harvesting records via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) so aggregators can index them without direct database access.

How does DDI support the FAIR data principles?

The FAIR Guiding Principles — findable, accessible, interoperable, reusable — were formalised for the research community in 2016. DDI operationalises each principle for survey and social science data specifically, rather than leaving them as abstract goals.

Findable: structured study-level metadata (title, creators, keywords, abstract, coverage) makes records indexable by catalogues and search engines, and DDI records are commonly assigned persistent identifiers, including DOIs registered through DataCite.
Accessible: standardised access-condition fields tell a would-be reuser exactly how to request or download the data, and harvesting via OAI-PMH gives repositories a predictable retrieval protocol.
Interoperable: a shared XML vocabulary and controlled thesauri — the European Language Social Science Thesaurus (ELSST), maintained by CESSDA, is one widely used example — let metadata move between archives and languages without semantic drift.
Reusable: variable-level documentation (question wording, value labels, derivation logic) and provenance information are what actually let a second researcher re-run or extend an analysis, which is the point FAIR exists to serve.

DDI-Codebook vs DDI-Lifecycle vs DDI-CDI: which do you need?

DDI is not a single schema. Three variants serve different documentation depths, and choosing the wrong one is the most common early adoption mistake.

Variant	Best for	Documents	Status
DDI-Codebook (DDI-C)	A single finished dataset	Study, file, and variable description for one deposit	Simpler, widely used legacy format
DDI-Lifecycle (DDI-L)	Longitudinal or multi-wave studies	The full research lifecycle: concept, instrument, collection, processing, archiving, reuse	Comprehensive, versioned in the 3.x series
DDI-CDI (Cross-Domain Integration)	Integrating structured data across statistical and research domains	Model-driven descriptions that link datasets, variables, and classifications across systems	Developed jointly by the DDI Alliance and the SDMX community

A single-wave survey deposited once needs only DDI-Codebook. A cohort study revisited over years — the kind of resource the UK Data Service and ICPSR both hold in volume — needs DDI-Lifecycle to capture instrument changes between waves. DDI-CDI is aimed at repositories that need to align microdata with aggregate statistics (for example, linking a survey to official statistics published under SDMX), which is an emerging rather than default requirement.

A practical checklist for adopting DDI

Repositories and research teams introducing DDI documentation for the first time should work through these steps in order:

Identify your lifecycle stage. A one-off dataset needs DDI-Codebook; a repeated or panel study needs DDI-Lifecycle.
Model metadata before ingest, not after. Capture study description, sampling, collection dates, and variable labels/codes at deposit time using a structured deposit form, as the UK Data Service does, rather than reverse-engineering them from a finished file.
Use a DDI-aware authoring tool (for example Colectica or Nesstar-derived CESSDA tooling) instead of hand-writing XML, which is error-prone at scale.
Register a persistent identifier. Crosswalk core fields to the DataCite metadata schema so the dataset gets a citable DOI alongside its DDI record.
Adopt a controlled vocabulary such as ELSST for subject keywords to keep records interoperable across languages and archives.
Enable OAI-PMH harvesting so catalogue aggregators and search services can index the record without bespoke integration work.
Validate against peer practice — check the record structure against the RDA Metadata Standards Catalog entry and against comparable ICPSR or CESSDA holdings before publishing.

Answer-first Q&A

What is the metadata standard DDI?

DDI (Data Documentation Initiative) is an international metadata standard for documenting socioeconomic surveys, censuses, and microdata. It is maintained by the DDI Alliance, encoded in XML, and used by most social science data archives worldwide to capture study, file, and variable-level documentation in one structured record.

What is the best metadata standard for survey data?

For general resource discovery, Dublin Core (ISO 15836) is the simplest and most widely implemented option. For social science survey and microdata specifically, DDI is the domain standard, because it documents variables and methodology in a depth Dublin Core does not attempt.

How does DDI support the FAIR data principles?

DDI supports FAIR by pairing structured, machine-readable metadata with persistent identifiers for findability, standardised access fields for accessibility, a shared XML vocabulary and thesauri for interoperability, and variable-level provenance for reusability — the depth needed to re-run a secondary analysis.

What is the difference between DDI-Codebook and DDI-Lifecycle?

DDI-Codebook documents a single finished dataset. DDI-Lifecycle documents the entire research process — instrument design, fieldwork, processing, and archiving — across multiple waves, making it the correct choice for longitudinal and panel studies rather than one-off deposits.

What this means for research data repositories

Funder and journal data-sharing policies increasingly ask for FAIR-compliant deposits, but “FAIR” is a set of principles, not a file format. DDI is one of the few domain standards that translates those principles into a concrete, testable schema for survey and social science data — which is why it underpins the cataloguing infrastructure at the UK Data Service, ICPSR, and CESSDA member archives rather than being a niche archival choice.

Institutions building or upgrading a research data repository for social science holdings should treat DDI-Lifecycle adoption, ELSST keywording, and DataCite DOI registration as a single connected workflow rather than three separate projects. Repositories that skip variable-level documentation still get a catalogue entry, but they do not get reuse — and reuse, not deposit, is the actual measure of FAIR success. Institutional research administration and data management guidance should reference DDI explicitly wherever survey or microdata deposit is in scope.

July 4, 2026

Datasheets for Datasets: FAIR Habits for AI Data

Datasheets for datasets are structured documentation records — covering motivation, composition, collection process, and recommended uses — that accompany a dataset the way a technical datasheet accompanies an electronic component. Proposed for machine learning in 2018, the practice mirrors documentation habits research data managers have used for decades, and research offices are increasingly the ones best placed to recognise and credit that documentation work.

A datasheet for a dataset is a short, standardised document that records where a dataset came from, how it was collected and labelled, what it should and should not be used for, and who is responsible for maintaining it. The idea was formalised by Timnit Gebru and colleagues in the 2018 paper “Datasheets for Datasets” (arXiv:1803.09010), later published in Communications of the ACM, Vol. 64, No. 12 (2021).

Where did datasheets for datasets come from?
What does a dataset datasheet actually document?
How do datasheets connect to FAIR data principles?
Answer-first questions on datasheets for datasets
What this means for research offices
Where the practice is heading

Where did datasheets for datasets come from?

Gebru et al.’s 2018 paper argued that machine learning datasets circulated with almost no accompanying documentation, unlike the datasheets that have long shipped with electronic components. The paper has since been cited by more than 4,700 works, according to citation counts indexed alongside the ACM Digital Library record — a scale of uptake that puts it among the most influential AI-ethics-adjacent papers of the past decade.

The proposal did not invent documentation practice from nothing. It imported habits that research-data communities already used. The Data Documentation Initiative (DDI), a metadata standard maintained by the DDI Alliance for the social, behavioural, and economic sciences, has specified variable-level dataset documentation since the early 2000s — well before the AI field adopted the term “datasheet.”

What does a dataset datasheet actually document?

Gebru et al.’s original template organises documentation into seven sections: motivation, composition, collection process, preprocessing/cleaning/labelling, uses, distribution, and maintenance. Each section is a set of prompts, not a checkbox — creators answer in prose, which is what makes the format adaptable across domains.

Motivation: why the dataset was created, who funded it, and what problem it addresses.
Composition: what the instances represent, how many there are, and whether sensitive attributes or personal data are present.
Collection process: how and from whom the data was gathered, and what consent or licensing applied.
Uses: tasks the dataset is suited for, and — critically — tasks it should not be used for.
Maintenance: who is responsible for updates, corrections, and retraction if problems surface.

Adjacent frameworks document different units of the same pipeline. Model Cards for Model Reporting (Mitchell et al., Google, 2019) document a trained model’s performance across demographic subgroups rather than the training data itself. The Dataset Nutrition Label, developed by the Data Nutrition Project (originating at Harvard and MIT), condenses similar information into a scannable label modelled on food nutrition facts. The table below maps how these efforts differ.

Framework	Origin	Unit documented	Primary audience
Datasheets for Datasets	Gebru et al., 2018 (arXiv/ACM)	Dataset provenance and composition	Dataset creators and consumers
Model Cards for Model Reporting	Mitchell et al., Google, 2019	Trained model performance	Model deployers and auditors
Dataset Nutrition Label	Data Nutrition Project, Harvard/MIT	Dataset health at a glance	Practitioners screening datasets quickly
Datasheets for Digital Cultural Heritage	Europeana Research/EuropeanaTech, 2023	Heritage collection reuse context	GLAM institutions and researchers

How do datasheets connect to FAIR data principles?

The FAIR Data Principles — Findable, Accessible, Interoperable, Reusable, set out by Wilkinson et al. in Scientific Data (2016) — were written for research data broadly, not for AI training corpora specifically. Datasheets operationalise the “Reusable” pillar in particular: a dataset without documented provenance, licensing, and known limitations cannot be responsibly reused, regardless of how accessible its files are.

This is a FAIR-adjacent practice rather than a formal extension of FAIR itself, and research offices should frame it that way rather than treating “datasheet” and “FAIR-compliant” as synonyms. A dataset can be technically Findable and Accessible while still shipping with a thin or absent datasheet — the two efforts solve overlapping but distinct problems.

Dataset-level documentation also underpins dataset citation. The Force11 Joint Declaration of Data Citation Principles (2014) established that datasets should be cited as first-class research outputs, and registration agencies such as DataCite issue the DOIs that make that citation persistent. A datasheet gives the context a citation alone cannot: not just that a dataset exists and where, but what it contains and how it may legitimately be used.

Answer-first questions on datasheets for datasets

What are datasheets for datasets?

Datasheets for datasets are structured documents that record a dataset’s motivation, composition, collection process, and intended uses. They were proposed by Gebru et al. in 2018 to give dataset creators and consumers a shared, standardised record — closing the gap between how thoroughly software and hardware components are documented and how poorly datasets typically are.

What information does a dataset datasheet include?

A complete datasheet covers seven areas: motivation, composition, collection process, preprocessing and labelling, recommended uses, distribution terms, and maintenance responsibility. Creators answer narrative prompts under each heading rather than filling in a fixed schema, which is why the format has been adapted for domains as different as machine learning corpora and digitised cultural heritage collections.

How do datasheets differ from model cards?

Datasheets document the dataset — its provenance, composition, and licensing. Model cards, introduced by Mitchell et al. at Google in 2019, document the trained model built from that data, including performance disaggregated across demographic groups. The two are complementary: a model card without a corresponding dataset datasheet leaves the training-data provenance question unanswered.

What this means for research offices

Research administration has treated dataset documentation as a data-management-plan checkbox for years; AI training-data transparency debates are now forcing the same discipline onto machine learning teams. Institutions that already run mature research-data-management functions have a genuine head start: DMP review, licensing checks, and provenance tracking are core competencies, not new ones.

One overlooked lever is contributor recognition. CASRAI originated the CRediT contributor role taxonomy in 2014. The standard is now stewarded by NISO as ANSI/NISO Z39.104-2022. CRediT’s Data Curation role exists precisely to credit the labour of managing, annotating, and maintaining research data for reuse — the same labour a datasheet documents. Research offices that already apply CRediT to publications have a ready-made mechanism for recognising the people who write and maintain dataset datasheets, rather than letting that work go uncredited.

Require a datasheet (or equivalent provenance record) as a condition of institutional data-repository deposit, alongside existing licensing checks.
Map datasheet authorship to CRediT’s Data Curation role in institutional repository metadata.
Treat AI training-data provenance requests from partners and funders as an extension of existing data-management-plan review, not a new workflow.

Where the practice is heading

Uptake outside machine learning is accelerating. The Europeana Research Community and EuropeanaTech Community published Datasheets for Digital Cultural Heritage Datasets in the Journal of Open Humanities Data in 2023 (DOI: 10.5334/johd.124), adapting the template for collections that were digitised long after their original creation. A revised Version 2 template was released in July 2025, with alignment to the DCAT-AP data-portal application profile identified as ongoing work.

AI training-data transparency requirements are converging on the same documentation habits that research-data management has practised for two decades, under the Data Documentation Initiative and FAIR principles alike. Research offices that recognise datasheets as an extension of existing data governance — rather than a novel AI-specific burden — will be better positioned to advise both AI developers and dataset creators as scrutiny of training-data provenance intensifies.

July 3, 2026

Tag: data documentation initiative

DDI Metadata Standard: FAIR Data Checklist for Survey Archives

What is the DDI metadata standard?

Who maintains DDI and which archives use it?

How does DDI support the FAIR data principles?

DDI-Codebook vs DDI-Lifecycle vs DDI-CDI: which do you need?

A practical checklist for adopting DDI

Answer-first Q&A

What is the metadata standard DDI?

What is the best metadata standard for survey data?

How does DDI support the FAIR data principles?

What is the difference between DDI-Codebook and DDI-Lifecycle?

What this means for research data repositories

Datasheets for Datasets: FAIR Habits for AI Data

Contents

Where did datasheets for datasets come from?

What does a dataset datasheet actually document?

How do datasheets connect to FAIR data principles?

Answer-first questions on datasheets for datasets

What are datasheets for datasets?

What information does a dataset datasheet include?

How do datasheets differ from model cards?

What this means for research offices

Where the practice is heading