After pretraining, models are fine-tuned on instruction-following and preference datasets. These datasets shape the model's behavior, tone, and safety characteristics.
Dataset Comparison
| Dataset | Size | Type | Source | License |
|---|---|---|---|---|
| OpenAssistant Conversations (OASST) | ~161K messages | Multi-turn conversation trees with human rankings | Human volunteers | Apache 2.0 |
| Alpaca | 52K instructions | Single-turn instruction/response pairs | GPT-3.5 generated (self-instruct) | CC BY-NC 4.0 |
| ShareGPT / WildChat | ~1M conversations | Real user conversations with ChatGPT | User-shared conversations | Varies; check terms |
| UltraChat | 1.5M dialogues | Multi-turn synthetic conversations | GPT-3.5/4 generated | MIT |
| UltraFeedback | 256K preference pairs | Preference data (chosen/rejected) | GPT-4 as judge | MIT |
| Nectar | 183K preference pairs | Preference data across 7 dimensions | Multiple LLM judges | CC BY-NC 4.0 |
| OpenHermes 2.5 | 1M instructions | Curated from multiple synthetic sources | Various LLMs, filtered | Varies by subset |
Data Quality over Quantity
For instruction tuning, a small set of high-quality, diverse examples consistently outperforms a large set of noisy or redundant ones. Aim for coverage across task types rather than raw volume.