Appendices
Appendix J: Datasets, Benchmarks, and Leaderboards

Instruction and Alignment Datasets

After pretraining, models are fine-tuned on instruction-following and preference datasets. These datasets shape the model's behavior, tone, and safety characteristics.

Dataset Comparison
Dataset Size Type Source License
OpenAssistant Conversations (OASST) ~161K messages Multi-turn conversation trees with human rankings Human volunteers Apache 2.0
Alpaca 52K instructions Single-turn instruction/response pairs GPT-3.5 generated (self-instruct) CC BY-NC 4.0
ShareGPT / WildChat ~1M conversations Real user conversations with ChatGPT User-shared conversations Varies; check terms
UltraChat 1.5M dialogues Multi-turn synthetic conversations GPT-3.5/4 generated MIT
UltraFeedback 256K preference pairs Preference data (chosen/rejected) GPT-4 as judge MIT
Nectar 183K preference pairs Preference data across 7 dimensions Multiple LLM judges CC BY-NC 4.0
OpenHermes 2.5 1M instructions Curated from multiple synthetic sources Various LLMs, filtered Varies by subset
Data Quality over Quantity

For instruction tuning, a small set of high-quality, diverse examples consistently outperforms a large set of noisy or redundant ones. Aim for coverage across task types rather than raw volume.