Section J.5: Dataset Licensing Considerations

Licensing for LLM training data is a complex and evolving area with significant legal uncertainty. The following guidelines reflect best practices as of early 2026, but they are not legal advice.

Key Principles

Open does not mean unrestricted. Datasets released under open licenses (Apache 2.0, MIT, CC-BY) allow broad use, but attribution requirements still apply. Creative Commons NonCommercial (CC BY-NC) licenses prohibit commercial use, including training a model that will be used commercially.
Web-crawled data carries inherent risk. Common Crawl and its derivatives contain copyrighted material from across the web. Several ongoing lawsuits (New York Times v. OpenAI, Getty v. Stability AI) are testing the boundaries of fair use for training data. Courts in different jurisdictions may reach different conclusions.
Synthetic data inherits conditions. Data generated by an LLM may be subject to the terms of service of the model that produced it. For example, OpenAI's terms historically restricted using GPT outputs to train competing models, though enforcement and interpretation of such terms vary.
Personal data requires special care. Training data that contains personally identifiable information (PII) may be subject to GDPR, CCPA, or similar regulations. Models trained on such data can memorize and regurgitate personal information, creating privacy risks.
Documentation is your best defense. Maintain a clear data provenance record: where each dataset came from, its license, any filtering applied, and dates of collection. This "data card" practice is increasingly expected by regulators and auditors.

The Legal Landscape is Shifting

As of early 2026, courts in the US, EU, and other jurisdictions are actively ruling on cases involving AI training data. The legal status of training on copyrighted web content remains unsettled. For commercial projects, consult with legal counsel and consider using datasets with clear, permissive licenses (FineWeb under ODC-By, RedPajama tooling under Apache 2.0, or datasets you have licensed directly).

The Benchmark Lifecycle

Benchmarks follow a predictable pattern: introduction, adoption, saturation, and replacement. MMLU was state-of-the-art challenging in 2023; by 2025, frontier models scored above 90%. When a benchmark saturates, the community introduces harder versions (MMLU to MMLU-Pro, HumanEval to SWE-bench, GSM8K to GSM-Hard). Effective evaluation requires using benchmarks appropriate to the capability level of your model and supplementing static benchmarks with live evaluation methods like Chatbot Arena.