Datasheets for Datasets

Section 54.7

"A datasheet is what your dataset wears to the interview. The interviewer is the GDPR auditor."

CensusCensus, Datasheet-Native AI Agent
Big Picture

A datasheet is to a dataset what a model card is to a model: a structured document that captures motivation, composition, collection process, labeling, distribution, and maintenance. Gebru et al.'s 2018 "Datasheets for Datasets" predates and complements model cards. The two together cover the question every downstream user should ask: "given this model, trained on this data, what can I trust it to do?" This section walks through the canonical seven-section structure, surveys the datasheet adoption landscape on Hugging Face Datasets and Google's Know Your Data, and runs a completeness audit on a typical dataset to show what good and bad datasheets look like in practice. For LLM and agent engineers, the datasheet is the upstream provenance check: an LLM fine-tuned or RAG-grounded on a corpus without a real datasheet inherits unknown licensing, demographic, and contamination risks that show up later as evaluation blind spots and legal exposure.

Prerequisites

This section assumes the pretraining-corpus discussion from Section 10.9, the data-licensing vocabulary from Section 54.1, and the model-card pattern from Section 54.6.

54.7.1 Why Datasheets Exist

Fun Fact

The datasheet concept that Gebru et al. (2018) imported into ML was literally photocopied from the electronic components industry, where every capacitor and resistor ships with a one-page spec sheet. Texas Instruments published the first standardized component datasheet in 1952, meaning ML researchers reinvented a 70-year-old documentation discipline and gave it a 21-page CACM paper.

Datasets, even more than models, were historically published with minimal documentation: a citation, a download URL, occasionally a sentence about how the data was collected. The downstream consequences were severe: ImageNet's WordNet-driven label hierarchy contained slurs and offensive labels for years before a 2019 audit (Crawford and Paglen) brought them to public attention; the original COMPAS recidivism dataset was used for risk-prediction models without documentation of how the underlying data had been collected from a demographic-skewed criminal-justice pipeline.

Gebru et al. proposed datasheets to fix this by asking dataset creators a structured set of questions, modeled on the electronic-components industry's datasheets where every capacitor and resistor ships with a one-page specification. The intent was both to prompt creators to think about issues at creation time and to give consumers the information they need to decide whether the dataset is appropriate for their purpose.

54.7.2 The Seven Canonical Sections

The Gebru et al. format (refined in the 2021 CACM article and adopted with minor variations by the major platforms) tracks the dataset's lifecycle in seven beats:

Key Insight
Aha Moment: The Sentence That Created an Entire Audit Literature

Buried in section 4 ("Preprocessing") of the C4 datasheet is the line "URLs containing words from the List of Dirty, Naughty, Obscene, and Otherwise Bad Words were excluded." Dodge et al. (2021) tested what that 402-word list actually filtered: 40.8 percent of URLs from documents that mention LGBTQ+ identities and a disproportionate fraction of African-American English vernacular. T5, PaLM, GPT-Neo, and every downstream model trained on a C4 derivative inherited those filter choices. One line in one datasheet section, written for completeness in 2020, became the citation that launched a four-year audit literature (Birhane 2023, Soldaini 2024, the AllenAI Dolma datasheet). This is why section 4 looks boring and is the section that determines what your model can never say: the data point you filter away never shows up in the loss curve, and a datasheet is the only place anyone will ever find it.

  1. Motivation. Why was this dataset created? Who funded it? What gap does it fill? Questions surface up-front because the answers influence what gets included.
  2. Composition. What does each instance represent (an image, a sentence, a row)? What is the total count, the breakdown by relevant attributes (language, demographic, source), and the relationships between instances?
  3. Collection Process. How was the data acquired, sampled, and from what time window? Were people involved in collection, and if so, were they compensated and informed?
  4. Preprocessing, Cleaning, Labeling. What preprocessing was done; what was discarded; who labeled the data and under what conditions; what was the inter-annotator agreement?
  5. Uses. What has the dataset been used for? Are there tasks it should not be used for?
  6. Distribution. Who can access it, on what license, with what restrictions, and through what mechanism?
  7. Maintenance. Who maintains it, how are corrections handled, what is the deprecation policy?

Each section contains 3-10 specific questions; the full schema runs to about 60 questions across the seven sections.

54.7.3 Real Examples: C4 and The Pile

Two widely-used LLM pretraining datasets illustrate the spectrum of datasheet quality.

C4 (Colossal Clean Crawled Corpus). The C4 dataset documentation, published with the T5 paper and substantially expanded in Dodge et al. (2021), is one of the best examples in the LLM pretraining space. It covers motivation (a clean web-text corpus for T5), composition (730GB, English-language Common Crawl filtered with the "Bad Words List"), collection process (April 2019 Common Crawl snapshot, specific filtering steps documented), known issues (the Bad Words List filters were over-aggressive in some categories and under-aggressive in others; non-Western English dialects were disproportionately filtered). Subsequent audits (Birhane et al. 2023) expanded the documentation further, demonstrating that datasheets evolve as more people use the dataset.

The Pile. The EleutherAI Pile datasheet (Gao et al. 2020) is famous for being a textbook example of the format. The 22-component breakdown (PubMed, ArXiv, GitHub, Stack Exchange, etc.) is documented per-source with license, language, time window, and known issues. The Pile has been substantially deprecated in 2024-2025 due to legal and copyright issues with several components; the maintainers' transparent communication of these issues (documented in updated datasheet revisions) is itself a good example of section 7 (Maintenance) in action.

Key Insight

The most important datasheet question is "what was excluded?" A dataset's inclusion list is usually well-documented (it's what the creators are proud of). Its exclusion list, what was filtered, what was sampled out, what was never collected in the first place, is what determines the model's blind spots. The Bad Words List in C4 is famous because the exclusions ended up disproportionately removing African-American English; that ended up shaping every model fine-tuned on C4 derivatives.

54.7.4 Hugging Face Dataset Cards: The De Facto Standard

The Hugging Face Datasets library standardized the format for dataset documentation on the Hub, with a YAML frontmatter schema and a Markdown body. As of 2026 the schema includes:

# Hugging Face dataset card YAML metadata
annotations_creators:
- expert-generated
language_creators:
- found
languages:
- en
- es
- fr
licenses:
- cc-by-sa-4.0
multilinguality:
- multilingual
size_categories:
- 100K<n<1M
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- sentiment-classification
- topic-classification
pretty_name: Multi-Lingual Customer Reviews v2
tags:
- customer-feedback
- reviews
- benchmark
dataset_info:
  features:
  - {name: text, dtype: string}
  - {name: label, dtype: int32}
  - {name: language, dtype: string}
  - {name: source_country, dtype: string}
  splits:
  - {name: train, num_examples: 800000}
  - {name: validation, num_examples: 100000}
  - {name: test, num_examples: 100000}
  download_size: 1234567890
  dataset_size: 2345678901
Code Fragment 54.7.1: Hugging Face dataset card YAML metadata. The structured fields enable Hub-level filtering ("show me all multilingual classification datasets with permissive licenses") and procurement-tool consumption. The free-text Markdown body is where the Gebru et al. seven-section structure actually lives.

54.7.5 Google Know Your Data and the Tools Landscape

Two tools dominate the 2026 datasheet-tooling space.

Google Know Your Data (KYD) (open-sourced in 2022, substantially extended in 2024) takes a dataset and computes automatic summaries: class balance, attribute distributions, intersectional slices, near-duplicate detection. The output is a set of HTML reports that supplement (not replace) a human-written datasheet. KYD is best-fit for tabular and structured datasets; for raw text it computes vocabulary statistics, length distributions, and language-detection breakdowns.

Hugging Face Data Studio (released 2024) provides a similar Web UI on top of any dataset on the Hub. It computes basic statistics, displays sample-level previews, and surfaces flagged issues (high duplicate rates, severely imbalanced classes, suspicious string patterns). The combination of Data Studio + the manually-curated dataset card is the closest thing to a default datasheet workflow for new LLM corpora.

Diagram of a typical dataset auditing pipeline. A dataset enters from the left. Three parallel analysis tools run: (1) Hugging Face Data Studio extracts basic stats; (2) Google KYD extracts attribute distributions and intersectional slices; (3) PII scanner (Presidio + custom domain rules) flags identifying-information leakage. Outputs from all three feed a 'datasheet draft generator' which produces a Markdown skeleton conforming to the Gebru et al. seven-section format. A human author then edits and reviews. Final datasheet is published alongside the dataset. Side note: 'automated extraction covers ~60% of fields; the remaining 40% require human judgment.'
Figure 54.7.1a: A modern datasheet creation pipeline. Automated tools handle the descriptive statistics; human authors handle motivation, intended uses, and ethical considerations. The pipeline pattern is what makes datasheet adoption tractable at scale; the alternative (every dataset author writes a datasheet from scratch) is what gave us the 2018-2022 era of mostly-missing datasheets.

54.7.6 Completeness Audit: What Good and Bad Look Like

A pragmatic exercise: rate a dataset's datasheet against the 60-question Gebru et al. schema. We score each question 0 (missing), 1 (partial), 2 (complete). A datasheet scoring >90 out of 120 is "good"; 60-89 is "usable"; below 60 is "needs work." Recent audits (Boyd 2024, Sloane et al. 2024) find:

The most consequential gap is between sections 4 (Preprocessing/Labeling) and 5 (Uses). The labeling section often gives no information about labeler demographics, working conditions, or compensation, all of which influence labeling quality and the model's downstream biases. The "uses" section often lists only the dataset creator's intended use; out-of-scope uses are rarely enumerated.

Warning: Missing Datasheets Are Their Own Documentation

If a widely-used dataset has no datasheet, that absence is itself information. The 2023 audit of LAION-5B revealed that the original release lacked basic composition documentation; the dataset turned out to contain CSAM, illegal in most jurisdictions, that the creators were unable to characterize because they had not documented the collection process. Procurement reviewers should treat "no datasheet" as a strong signal to either avoid the dataset or to commission an independent audit before use.

Datasheet completeness audit: median scores per section across recent corpora
Figure 54.7.2: Empirical completeness of the seven Gebru et al. sections across 220 widely used datasets, drawn from the Boyd (FAccT 2024) and Sloane et al. (Nature Machine Intelligence 6, 245-256, 2024) audits. Motivation and composition (sections 1-2) sit well above the 60% "usable" floor; collection, preprocessing, uses, and distribution hover in the 50-70% band; maintenance (section 7) is the field-wide failure at 28%. The Preprocessing dip is where the C4 "Bad Words List" episode lived: a single line in section 4 of the datasheet revealed that 40.8% of LGBTQ+ URLs and a disproportionate fraction of African-American English vernacular were filtered out, shaping every model trained on a C4 derivative.

54.7.7 Datasheets in the EU AI Act and Procurement

The EU AI Act Article 10 (data and data governance) and Annex IV require, for high-risk and general-purpose AI systems, documentation of training data covering most of the Gebru et al. schema. The Act does not mandate the exact format, but the substance of what must be documented (data sources, collection methodology, suitability for purpose, bias-relevant attributes, data-protection assessments) maps onto the seven sections almost directly. Vendors who already publish good datasheets are most of the way to AI Act Annex IV compliance for the data documentation requirements.

A useful objective completeness score is a weighted coverage rate over the seven canonical sections:

$$\mathrm{coverage} \;=\; \frac{\sum_{s=1}^{7} w_s \cdot \mathbb{1}[\text{section } s \text{ answered}]}{\sum_{s=1}^{7} w_s}, \qquad \sum_s w_s = 1,$$

where the weights $w_s$ encode procurement priority (the EU AI Act Annex IV checklist weights collection process and data-protection assessments more heavily than motivation, for example). A typical procurement threshold is $\mathrm{coverage} \ge 0.8$ with no zero-weighted critical section.

# Quick datasheet completeness check against the Gebru et al. seven sections.
from dataclasses import dataclass

@dataclass
class SectionCheck:
    name: str
    weight: float
    present: bool

def score(checks, threshold=0.80):
    """Weighted coverage rate plus REJECT/ACCEPT decision."""
    total = sum(c.weight for c in checks)
    got   = sum(c.weight for c in checks if c.present)
    coverage = got / total
    missing = [c.name for c in checks if not c.present]
    decision = "ACCEPT" if coverage >= threshold and not missing else "REJECT"
    return coverage, missing, decision

checks = [
    SectionCheck("motivation",            0.10, True),
    SectionCheck("composition",           0.15, True),
    SectionCheck("collection_process",    0.25, False),  # EU AI Act priority
    SectionCheck("preprocessing_labeling",0.20, True),
    SectionCheck("uses",                  0.10, True),
    SectionCheck("distribution",          0.10, True),
    SectionCheck("maintenance",           0.10, False),  # common gap
]
coverage, missing, decision = score(checks)
print(f"coverage = {coverage:.2f}; missing = {missing}; {decision}")
# Output: coverage = 0.65; missing = ['collection_process', 'maintenance']; REJECT
Output: coverage = 0.65; missing = ['collection_process', 'maintenance']
Code Fragment 54.7.2a: A 20-line weighted-coverage checker. Wire this into your vendor onboarding flow and ship a one-page report to the data-protection officer for every new training dataset.
Real-World Scenario: Procuring a Medical Imaging Dataset

A healthcare-AI startup is evaluating a chest X-ray dataset for training a pneumonia-detection model. The procurement checklist drills into the datasheet for: (1) Patient demographics (sex, age, geographic distribution) - documented in section 2; (2) Image acquisition equipment (vendor, model, year) - partially documented in section 3; (3) Labeling provenance (radiologist credentials, inter-rater agreement) - documented in section 4; (4) Out-of-scope uses (e.g., pediatric populations not represented) - documented in section 5. The procurement identifies a gap: the dataset's labeling section says "labeled by radiologists" without further detail. The startup requests and obtains a supplementary document covering labeler credentials and IRR before proceeding. Six months later, when the model performs worse on a deployment-site population that differs from the training distribution, the dataset's section 2 documentation is what enables a structured root-cause analysis.

Key Insight

A datasheet captures the data side of the model-card story: motivation, composition, collection, labeling, uses, distribution, maintenance. Gebru et al.'s 2018 seven-section format is now standard across Hugging Face, Google Know Your Data, and the EU AI Act Article 10. The most common gaps are in collection-process documentation and maintenance policies. Production pipelines combine automated statistical extraction (Data Studio, KYD) with human-written motivation and intended-use sections. Like model cards, datasheets are necessary but not sufficient; pair with completeness audits and third-party review for high-stakes uses.

Self-Check
Q1: Why is the "exclusion" question more diagnostic than the "inclusion" question in evaluating a datasheet?
Show Answer
"What was included?" gets a marketing answer; any vendor can list the categories they did capture. "What was excluded?" forces the vendor to confront the gaps that determine downstream model behavior. A medical-imaging dataset that includes 12 hospitals but excluded pediatric scans, low-end ultrasound, and one specific scanner manufacturer is a different artifact from one that included only adult MRI from three Boston hospitals; the second card is honest, the first hides the gap. Exclusion answers also surface selection biases the inclusion list never reveals: who chose what to collect, what was discarded during cleaning, and what was filtered out at labeling. The Gebru et al. template asks both, but the exclusion side is where the deployment-relevant information lives.
Q2: You're auditing a dataset whose datasheet rates 50/120 on the Gebru et al. schema. Which sections should you prioritize requesting supplementary documentation for, and why?
Show Answer
Prioritize collection process (section 3) and preprocessing (section 4) first because these dominate downstream behavior and are the two sections that vendors most often leave thin. Specifically, request: (a) the protocol used to recruit and instruct labelers, including the rejection rate; (b) the deduplication, filtering, and quality-control steps applied after raw collection; (c) any rebalancing or class-weighting decisions made before publishing. After those, request distribution and maintenance (sections 6 and 7) to understand whether the dataset is a living artifact that gets corrected when problems surface or a one-shot dump. The remaining sections (motivation, uses) are usually adequately covered even in 50/120 cards because they are research-community boilerplate.
Q3: The EU AI Act Article 10 does not mandate the Gebru et al. format. Why is it likely to be adopted in practice anyway?
Show Answer
Article 10 requires data-governance documentation but does not specify a format, which leaves vendors free to invent their own templates. In practice, three forces push toward the Gebru et al. format: (a) it is the de facto standard already adopted by Hugging Face Datasets, which is the distribution layer for most open models; (b) Big Tech procurement organizations (Microsoft, Google, Salesforce) require datasheet-style documentation from data vendors, so suppliers learn one format that satisfies both EU compliance and major-customer procurement; (c) regulators benefit from a comparable format across vendors when reviewing conformity assessments. The result is convergence on the Gebru template plus minor extensions, the same pattern that turned model cards into a de facto standard before the EU AI Act mandated them in spirit.
Q4: List two facts you would want to know about labelers (section 4) before approving a dataset for medical-imaging procurement.
Show Answer
First, the labelers' clinical credentials and the supervision protocol: were these board-certified radiologists or crowdsourced workers given a half-hour training? A dataset labeled by non-clinicians is unsuitable as ground truth for a diagnostic model regardless of how large or balanced it is. Second, the inter-rater agreement (Cohen's kappa or Krippendorff's alpha) and the disagreement-resolution protocol; medical-imaging label disagreement is high (kappa around 0.6 on many conditions), and a dataset where disagreements were resolved by majority vote of three radiologists is meaningfully different from one resolved by a single senior radiologist or by an automated rule. Both facts are required before a procurement-grade approval; their absence is itself a reason to reject the dataset.
What's Next

Continue to Section 54.8: System Cards and Frontier System Disclosures. Section 54.8 moves up the abstraction ladder from individual models and datasets to systems. System cards (OpenAI's GPT-4o and o1, Anthropic's Claude system cards, Google's Frontier Safety Framework disclosures) document a deployed AI system as a whole: model components, safety mitigations, evaluation results, and red-team summaries. They are the artifact regulators are increasingly looking at.

Further Reading
Gebru, T., Morgenstern, J., Vecchione, B., et al. (2018, revised 2021). Datasheets for Datasets. Communications of the ACM 64, 86-92.
Dodge, J., Sap, M., Marasovic, A., et al. (2021). Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (C4). EMNLP 2021.
Gao, L., Biderman, S., Black, S., et al. (2020). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027.
Birhane, A., Han, S., Boddeti, V., et al. (2023). Into the LAIONs Den: Investigating Hate in Multimodal Datasets. NeurIPS 2023 Datasets and Benchmarks Track.
Crawford, K., Paglen, T. (2019). Excavating AI: The Politics of Images in Machine Learning Training Sets. AI & Society 36, 1105-1116.
Boyd, K. (2024). Documenting Computer Vision Datasets: A Critical Analysis of Datasheets and Model Cards. FAccT 2024.
European Parliament and Council (2024). Regulation (EU) 2024/1689 (AI Act), Article 10 and Annex IV: Data and Data Governance.