Examples
Worked examples
- Is an instance
A coreference dataset published with a data statement specifying US Mainstream English (white, college-educated, ~30-50yo annotators).
- Is an instance
A speech-recognition corpus's data statement specifying broadcast news from 1995-2005.
Counter-examples
Looks similar, but isn't
- Not an instance
A token count and vocabulary size summary.
- Not an instance
A model card (describes the model, not the data).
Editorial commentary
Bender and Friedman (2018) proposed data statements specifically for natural-language datasets to surface the linguistic and demographic constraints that bound model generalisability. A model trained on US-English Wikipedia performs differently on African-American English; data statements make such mismatches visible.
References
- Bender, Friedman, 'Data Statements for Natural Language Processing' (Transactions of the ACL, 2018).
Also known as
NLP data statement · Bender-Friedman data statement
Machine-readable encodings
Use in your systems
<role vocab="credit"
vocab-identifier="https://casrai.org/dictionary/"
vocab-term="Data statement (NLP)"
vocab-term-identifier="https://casrai.org/dictionary/term/data-statement-nlp" />{
"@context": "https://schema.org",
"@type": "DefinedTerm",
"name": "Data statement (NLP)",
"identifier": "https://casrai.org/dictionary/term/data-statement-nlp",
"description": "A standardised description of an NLP dataset covering curation rationale, language variety, speaker and annotator demographics, speech situation, text characteristics, and recording quality.",
"inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
"url": "https://casrai.org/dictionary/term/data-statement-nlp",
"sameAs": [
"NLP data statement",
"Bender-Friedman data statement"
],
"license": "https://creativecommons.org/licenses/by/4.0/"
}







