Section 19.3: Datasets & Benchmarks

Part IV uses three categories of dataset: instruction-tuning data (for SFT), preference data (for DPO/RLHF), and reasoning-trace data (for GRPO and reasoning fine-tunes). The list below covers the ones that show up in Chapters 17 through 20. Section 19.3 covers the dataset catalog and the lightweight data versioning workflow with DVC; Section 19.4 covers the heavyweight data pipeline tooling (PySpark, Delta Lake, feature stores).

19.3.1 Instruction-tuning data

Alpaca (Stanford, 2023) is a 52K-example instruction dataset generated by prompting text-davinci-003 with a seed set of human instructions, then filtering. Its objective was to demonstrate that small models could be instruction-tuned cheaply using LLM-generated data, which mattered because it kicked off the entire open instruction-tuning ecosystem. The core concept is "self-instruct": seed a handful of tasks, prompt a strong model to generate variations, filter, then train. Pick Alpaca today only as a smoke test or historical reference; the data quality and CC BY-NC license make it a poor choice for production.
ShareGPT (community, 2023) is 90K real ChatGPT conversations scraped from the now-defunct sharegpt.com sharing site. Its objective was to provide naturally-occurring multi-turn dialogue rather than single-turn instructions, which mattered because it gave Vicuna its conversational tone. The core concept is human-written prompts paired with GPT-3.5/4 responses, including the messy real-world patterns synthetic data omits. Avoid for anything you intend to ship: the provenance is murky (OpenAI's terms forbid using its outputs to train competitors) and the data card cannot establish consent.
No Robots (Hugging Face, 2023) is a 10K-example fully human-written instruction dataset commissioned from professional annotators with no LLM involvement. Its objective is to offer a clean reference free of "as a language model, I cannot" artifacts and synthetic contamination, which matters when you want a high-quality control set. The core concept is paid, instructed annotators following an explicit codebook. Pick it as a clean SFT mix-in or as an evaluation set; it is too small to use alone.
Tulu 3 SFT (AllenAI, 2024) is the ~1M-example open instruction mix that drives the Tulu 3 family of models, combining No Robots, OpenAssistant, FLAN, OpenMathInstruct, and dozens of other sources. Its objective is to give the community a fully-open recipe matching the quality of proprietary instruction tuning, which matters when you want to reproduce a state-of-the-art instruction-tuned model. The core concept is careful per-source filtering, decontamination against eval benchmarks, and balanced mixing weights documented in the paper. Pick Tulu 3 SFT as your default open SFT corpus in 2026.
OpenHermes-2.5 (Nous Research, 2024) is the 1M-example open instruction mix curated by Teknium for the Hermes model series, drawing on synthetic GPT-4 generations and selected community datasets. Its objective is to be a "best of breed" mixed instruction corpus filtered for tone and reasoning quality, which matters when you want a slightly more "personality-rich" base than Tulu's strictly research-focused mix. Pick it when you want a model that feels less corporate; pick Tulu 3 when you want fully-documented provenance.
SmolTalk (Hugging Face, 2024-12) is a 1M synthetic instruction dataset used for SmolLM2; the cleanest small-model SFT recipe of 2024.
Magpie family (Xu et al., 2024, arXiv:2406.08464) generates synthetic instructions by aligning the model with itself (no seed prompts needed); the dominant synthetic-instruction approach in 2024-25.
Llama Nemotron Post-Training Dataset (NVIDIA, 2025) is a recently-released large synthetic SFT and preference corpus that covers reasoning, math, code, and tool use.

19.3.2 Pretraining and continued-pretraining data

FineWeb-Edu (Hugging Face, 2024) is a ~1.3T-token subset of CommonCrawl, filtered with an educational-content classifier trained to retain Wikipedia-like and tutorial-style pages. Its objective is to provide a quality-filtered web corpus that trains small models to higher capabilities than raw CommonCrawl, which matters when you are pretraining a sub-1B model on a constrained budget. The core concept is classifier-based filtering with a public model card showing the filter prompt and threshold. Pick FineWeb-Edu as your default pretraining corpus for small models or for continued-pretraining a base model on cleaner data.
FineWeb (Hugging Face, 2024) is the 15T-token parent corpus before educational filtering, deduplicated and lightly cleaned from 96 CommonCrawl dumps. Its objective is to be the open replacement for the proprietary corpora that GPT-4 and Llama were trained on, which matters when you need scale rather than density. Pick FineWeb when you have the compute for trillion-token pretraining; for under-1T-token budgets, FineWeb-Edu is the higher-quality slice.
The Stack v2 (BigCode, 2024) is a 3.2 billion file code corpus harvested from permissively-licensed GitHub repos, used to pretrain StarCoder 2 and many open code models. Its objective is to give code-specialized pretraining a defensible, opt-out-respecting alternative to scraping GitHub directly, which matters because most other code corpora ignore licensing entirely. The core concept is the GH Archive deduplication pipeline with per-repo license filtering and a "Am I in The Stack?" opt-out portal. Pick it for code-specific continued pretraining; for general-purpose code-aware models, mixing FineWeb-Edu and The Stack v2 is the standard recipe.

19.3.3 Preference and reasoning data

UltraFeedback (OpenBMB, 2023; Hugging Face binarization, 2024) is a 64K-pair preference dataset where each prompt has four model responses rated by GPT-4 on instruction-following, helpfulness, honesty, and truthfulness. Its objective is to provide open DPO training data without expensive human annotation, which matters when you cannot afford the labeling budget of Anthropic-scale HH. The core concept is GPT-4-as-judge synthetic preferences, binarized to chosen/rejected for DPO. Pick it as the default open DPO dataset in 2026; for reward-modeling research, you may want the original non-binarized version with full scores.
Anthropic HH-RLHF (Anthropic, 2022) is the original 170K-pair human-annotated preference dataset, split across "helpful" and "harmless" comparisons. Mostly historical in 2026: it remains the canonical RLHF baseline citation but newer preference datasets outperform it on every reasonable metric. The 2024-25 preference frontier moved to Skywork-Reward-Preference-80K and HelpSteer3 (NVIDIA), both of which supersede UltraFeedback for serious reward-model training.
Reasoning-trace datasets are the post-2024 preference cluster you should know: OpenR1-Math, NuminaMath-CoT (the 2024 AIMO-winner-derived math reasoning data, the canonical GRPO training input), and the DeepSeek-R1-Distill traces. These long-trace datasets are what every reasoning-fine-tune in 2025-26 starts from.
Self-rewarding LMs (Yuan et al., 2024, arXiv:2401.10020) describes the technique where the model judges and improves its own outputs without an external reward model; relevant when annotation budget is the bottleneck.
Capybara DPO (Argilla, 2024) is a 7K-pair curated DPO mix produced by Argilla using their distilabel pipeline. Its objective is to provide a small high-signal dataset suitable for the final-stage DPO that follows SFT, which matters when extra data adds noise rather than signal. The core concept is heavily-filtered multi-judge preferences with explicit critique chains. Pick it as a small high-quality DPO finisher on top of a larger preference mix.
OpenR1-Math-220k (Hugging Face open-r1, 2025) is a 220K-example reasoning-trace dataset in the DeepSeek-R1 style, with full chain-of-thought derivations for math problems. Its objective is to enable open reproductions of R1-style reasoning fine-tunes via either SFT or GRPO, which matters when you want long-trace reasoning capabilities without a frontier-lab data budget. The core concept is verified-correct long traces generated by R1 itself, then filtered with a verifier. Pick it for reasoning SFT or as the prompt source for your own GRPO rollouts.

19.3.4 Comparing the datasets

Table 19.3a.1: Training datasets for Part IV.

Dataset	Size	Use	License
Alpaca	52K	SFT smoke test	CC BY-NC 4.0
Tulu 3 SFT	~1M	Production SFT	ODC-By
FineWeb-Edu	~1.3T tokens	Continued pretraining	ODC-By
UltraFeedback	64K pairs	DPO	MIT
HH-RLHF	170K pairs	RLHF baseline	MIT

19.3.5 Synthetic data

Many 2025-2026 fine-tunes use synthetic data generated by a stronger model. Tools to know: distilabel (Argilla) and NeMo-Curator (NVIDIA) for generating and filtering. Synthetic data raises licensing and contamination concerns covered properly in Part IX.

Warning: License hygiene

Training data licenses propagate to fine-tuned models. Alpaca's CC BY-NC restricts commercial use; ShareGPT's provenance is unclear. Always check the data card before training a model you intend to ship.

Real-World Scenario: Tulu 3 full open release (Nov 2024)

Allen AI's Tulu 3 release (Nov 2024) shipped the recipe, the data, the code, and the models together: a complete, reproducible state-of-the-art post-training pipeline. The dataset, training scripts, and evaluations are all public; nothing about the recipe is proprietary. For Part IV the lesson is concrete: in 2024, "reproducible state-of-the-art post-training" became a thing that a single research org publishes as one drop. The R1 release from DeepSeek (Jan 2025) extended the same pattern to reasoning. Together they form the reproducibility template that 2025-26 open-recipe releases imitate.

Data Version Control (DVC)

DVC extends Git to handle large datasets and model files. It works alongside Git: code is tracked by Git, and data and models are tracked by DVC. The two stay in sync through lightweight .dvc pointer files that Git does track. Code Fragment 19.3a.1 below puts this into practice.

# Install DVC
pip install dvc dvc-s3  # or dvc-gs for Google Cloud, dvc-azure for Azure

# Initialize DVC in an existing Git repo
dvc init

# Track a large dataset
dvc add data/training_corpus.parquet
# This creates data/training_corpus.parquet.dvc (a small pointer file)

git add data/training_corpus.parquet.dvc data/.gitignore
git commit -m "Track training corpus with DVC"

# Push data to remote storage (e.g., S3 bucket)
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc push

# Example dvc.yaml pipeline
stages:
  preprocess:
    cmd: python scripts/preprocess.py
    deps:
      - data/raw/corpus.jsonl
      - scripts/preprocess.py
    outs:
      - data/processed/train.jsonl
      - data/processed/eval.jsonl

  train:
    cmd: python scripts/train.py --config configs/lora.yaml
    deps:
      - data/processed/train.jsonl
      - scripts/train.py
      - configs/lora.yaml
    outs:
      - models/finetuned/
    metrics:
      - metrics/train_results.json:
          cache: false

Code Fragment 19.3a.1: Initializing DVC, tracking a large dataset file, pushing it to remote storage (S3), and defining a reproducible two-stage pipeline in dvc.yaml with explicit dependencies and outputs.

Key Insight: DVC Pipelines

DVC can also define reproducible pipelines with dvc.yaml. Each stage specifies its dependencies (input files, scripts) and outputs. Running dvc repro re-executes only the stages whose dependencies have changed. This is invaluable for ML workflows where you need to reprocess data, retrain, and re-evaluate in a consistent order. Code Fragment 19.3a.2 below puts this into practice.

# DVC: data + model pipeline versioning.
# `dvc repro` re-runs ONLY the pipeline stages whose dependencies have changed.
# Tags let you bookmark milestones; `dvc exp` runs reproducible experiments.

# Re-execute the pipeline (only stages with stale inputs run)
dvc repro

# Tag a milestone so you can return to it later
git tag v1.0-baseline
dvc push   # upload data + artifacts to the remote DVC storage

# Compare current artifacts vs a tagged baseline
dvc exp diff v1.0-baseline HEAD

# Promote an experiment branch's results back to main
dvc exp apply $(dvc exp show --csv | grep best-loss | cut -d, -f1)

Code Fragment 19.3a.2: Running dvc repro to re-execute only the pipeline stages whose dependencies have changed, and using dvc metrics show to compare results across experiments.

What's Next?

In the next section, Section 19.4: Data Pipeline Tooling, we move from cataloging datasets to processing them at scale with PySpark, Delta Lake, and feature stores.

Further Reading

Training Datasets

Gao, L., et al. (2020). "The Pile." arXiv:2101.00027. Reference 800GB pretraining corpus.

Penedo, G., Malartic, Q., Hesslow, D., et al. (2023). "The RefinedWeb Dataset for Falcon LLM." NeurIPS 2023. arXiv:2306.01116. Reference for the high-quality web-scale pretraining corpus methodology.

Soldaini, L., Kinney, R., Bhagia, A., et al. (2024). "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." ACL 2024. arXiv:2402.00159. Reference 3T-token open corpus.