Section J.2: Instruction and Alignment Datasets

After pretraining, models are fine-tuned on instruction-following and preference datasets. These datasets shape the model's behavior, tone, and safety characteristics.

Dataset Comparison

Dataset	Size	Type	Source	License
OpenAssistant Conversations (OASST)	~161K messages	Multi-turn conversation trees with human rankings	Human volunteers	Apache 2.0
Alpaca	52K instructions	Single-turn instruction/response pairs	GPT-3.5 generated (self-instruct)	CC BY-NC 4.0
ShareGPT / WildChat	~1M conversations	Real user conversations with ChatGPT	User-shared conversations	Varies; check terms
UltraChat	1.5M dialogues	Multi-turn synthetic conversations	GPT-3.5/4 generated	MIT
UltraFeedback	256K preference pairs	Preference data (chosen/rejected)	GPT-4 as judge	MIT
Nectar	183K preference pairs	Preference data across 7 dimensions	Multiple LLM judges	CC BY-NC 4.0
OpenHermes 2.5	1M instructions	Curated from multiple synthetic sources	Various LLMs, filtered	Varies by subset

Data Quality over Quantity

For instruction tuning, a small set of high-quality, diverse examples consistently outperforms a large set of noisy or redundant ones. Aim for coverage across task types rather than raw volume.