Datasets & Benchmarks

Section 19.3

Part IV uses three categories of dataset: instruction-tuning data (for SFT), preference data (for DPO/RLHF), and reasoning-trace data (for GRPO and reasoning fine-tunes). The list below covers the ones that show up in Chapters 17 through 20. Section 19.3 covers the dataset catalog and the lightweight data versioning workflow with DVC; Section 19.4 covers the heavyweight data pipeline tooling (PySpark, Delta Lake, feature stores).

19.3.1 Instruction-tuning data

19.3.2 Pretraining and continued-pretraining data

19.3.3 Preference and reasoning data

19.3.4 Comparing the datasets

Table 19.3a.1: Training datasets for Part IV.
Dataset Size Use License
Alpaca 52K SFT smoke test CC BY-NC 4.0
Tulu 3 SFT ~1M Production SFT ODC-By
FineWeb-Edu ~1.3T tokens Continued pretraining ODC-By
UltraFeedback 64K pairs DPO MIT
HH-RLHF 170K pairs RLHF baseline MIT

19.3.5 Synthetic data

Many 2025-2026 fine-tunes use synthetic data generated by a stronger model. Tools to know: distilabel (Argilla) and NeMo-Curator (NVIDIA) for generating and filtering. Synthetic data raises licensing and contamination concerns covered properly in Part IX.

Warning: License hygiene

Training data licenses propagate to fine-tuned models. Alpaca's CC BY-NC restricts commercial use; ShareGPT's provenance is unclear. Always check the data card before training a model you intend to ship.

Real-World Scenario: Tulu 3 full open release (Nov 2024)

Allen AI's Tulu 3 release (Nov 2024) shipped the recipe, the data, the code, and the models together: a complete, reproducible state-of-the-art post-training pipeline. The dataset, training scripts, and evaluations are all public; nothing about the recipe is proprietary. For Part IV the lesson is concrete: in 2024, "reproducible state-of-the-art post-training" became a thing that a single research org publishes as one drop. The R1 release from DeepSeek (Jan 2025) extended the same pattern to reasoning. Together they form the reproducibility template that 2025-26 open-recipe releases imitate.

Data Version Control (DVC)

DVC extends Git to handle large datasets and model files. It works alongside Git: code is tracked by Git, and data and models are tracked by DVC. The two stay in sync through lightweight .dvc pointer files that Git does track. Code Fragment 19.3a.1 below puts this into practice.

# Install DVC
pip install dvc dvc-s3  # or dvc-gs for Google Cloud, dvc-azure for Azure

# Initialize DVC in an existing Git repo
dvc init

# Track a large dataset
dvc add data/training_corpus.parquet
# This creates data/training_corpus.parquet.dvc (a small pointer file)

git add data/training_corpus.parquet.dvc data/.gitignore
git commit -m "Track training corpus with DVC"

# Push data to remote storage (e.g., S3 bucket)
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc push
# Example dvc.yaml pipeline
stages:
  preprocess:
    cmd: python scripts/preprocess.py
    deps:
      - data/raw/corpus.jsonl
      - scripts/preprocess.py
    outs:
      - data/processed/train.jsonl
      - data/processed/eval.jsonl

  train:
    cmd: python scripts/train.py --config configs/lora.yaml
    deps:
      - data/processed/train.jsonl
      - scripts/train.py
      - configs/lora.yaml
    outs:
      - models/finetuned/
    metrics:
      - metrics/train_results.json:
          cache: false
Code Fragment 19.3a.1: Initializing DVC, tracking a large dataset file, pushing it to remote storage (S3), and defining a reproducible two-stage pipeline in dvc.yaml with explicit dependencies and outputs.
Key Insight: DVC Pipelines

DVC can also define reproducible pipelines with dvc.yaml. Each stage specifies its dependencies (input files, scripts) and outputs. Running dvc repro re-executes only the stages whose dependencies have changed. This is invaluable for ML workflows where you need to reprocess data, retrain, and re-evaluate in a consistent order. Code Fragment 19.3a.2 below puts this into practice.

# DVC: data + model pipeline versioning.
# `dvc repro` re-runs ONLY the pipeline stages whose dependencies have changed.
# Tags let you bookmark milestones; `dvc exp` runs reproducible experiments.

# Re-execute the pipeline (only stages with stale inputs run)
dvc repro

# Tag a milestone so you can return to it later
git tag v1.0-baseline
dvc push   # upload data + artifacts to the remote DVC storage

# Compare current artifacts vs a tagged baseline
dvc exp diff v1.0-baseline HEAD

# Promote an experiment branch's results back to main
dvc exp apply $(dvc exp show --csv | grep best-loss | cut -d, -f1)
Code Fragment 19.3a.2: Running dvc repro to re-execute only the pipeline stages whose dependencies have changed, and using dvc metrics show to compare results across experiments.

What's Next?

In the next section, Section 19.4: Data Pipeline Tooling, we move from cataloging datasets to processing them at scale with PySpark, Delta Lake, and feature stores.

Further Reading

Training Datasets

Gao, L., et al. (2020). "The Pile." arXiv:2101.00027. Reference 800GB pretraining corpus.
Penedo, G., Malartic, Q., Hesslow, D., et al. (2023). "The RefinedWeb Dataset for Falcon LLM." NeurIPS 2023. arXiv:2306.01116. Reference for the high-quality web-scale pretraining corpus methodology.
Soldaini, L., Kinney, R., Bhagia, A., et al. (2024). "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." ACL 2024. arXiv:2402.00159. Reference 3T-token open corpus.