Skip to main content
v2026.1714 entries · CC-BY 4.0
CASRAI
Dictionary termTrack CStablev2026.2

Data statement (NLP)

A standardised description of an NLP dataset covering curation rationale, language variety, speaker and annotator demographics, speech situation, text characteristics, and recording quality.

ByCASRAI Editorial Board
· Last updated 21 May 2026

Examples

Worked examples

  • Is an instance

    A coreference dataset published with a data statement specifying US Mainstream English (white, college-educated, ~30-50yo annotators).

  • Is an instance

    A speech-recognition corpus's data statement specifying broadcast news from 1995-2005.

Counter-examples

Looks similar, but isn't

  • Not an instance

    A token count and vocabulary size summary.

  • Not an instance

    A model card (describes the model, not the data).

Editorial commentary

Bender and Friedman (2018) proposed data statements specifically for natural-language datasets to surface the linguistic and demographic constraints that bound model generalisability. A model trained on US-English Wikipedia performs differently on African-American English; data statements make such mismatches visible.

References

  • Bender, Friedman, 'Data Statements for Natural Language Processing' (Transactions of the ACL, 2018).

Also known as

NLP data statement · Bender-Friedman data statement

Machine-readable encodings

Use in your systems

JATS XML <role> element
xml
<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Data statement (NLP)"
      vocab-term-identifier="https://casrai.org/dictionary/term/data-statement-nlp" />
Schema.org DefinedTerm (JSON-LD)
json
{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "name": "Data statement (NLP)",
  "identifier": "https://casrai.org/dictionary/term/data-statement-nlp",
  "description": "A standardised description of an NLP dataset covering curation rationale, language variety, speaker and annotator demographics, speech situation, text characteristics, and recording quality.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
  "url": "https://casrai.org/dictionary/term/data-statement-nlp",
  "sameAs": [
    "NLP data statement",
    "Bender-Friedman data statement"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/"
}
LAC

Partner Deal

LAC Health Supplies Mobile App

Referenced across the research world

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo
  • ORCID logo
  • Crossref logo

View CASRAI adoption →