Part IV's literature is split between the academic papers introducing each algorithm, the practical blog posts explaining what works, and the open-source communities that ship the recipes.
19.6.1 Foundational papers
- Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022): the RLHF blueprint.
- Rafailov et al., "Direct Preference Optimization" (2023): DPO, the closed-form alternative to PPO-based RLHF.
- Hu et al., "LoRA: Low-Rank Adaptation" (2021): the foundational paper for PEFT.
- Dettmers et al., "QLoRA" (2023): 4-bit quantized LoRA.
- DeepSeek, "DeepSeekMath" (GRPO, 2024): the GRPO algorithm that powered reasoning fine-tunes.
- DeepSeek-R1 paper (2025): the open recipe for reasoning models. The paper defines "Direct Reinforcement Learning from Verifiable Rewards" (DeepSeek-R1-Zero), which is the recipe most reasoning fine-tunes since now follow.
- Wu et al., "Self-Play Preference Optimization" (2024): SPPO, a self-play DPO variant.
- Dong et al., "RLHF Workflow: From Reward Modeling to Online RLHF" (2024): practitioner-oriented end-to-end RLHF survey.
- Meng et al., "SimPO" (2024): simpler-than-DPO preference optimization.
19.6.2 Tutorials and recipes
- Hugging Face DPO tutorial: the official entry point.
- Alignment Handbook: open recipes for SFT and DPO; the basis of Zephyr.
- AllenAI open-instruct: the Tulu 3 training repository, fully open.
- Hugging Face open-r1 (2025): the most-watched open replication project of 2025, reproducing DeepSeek-R1 end-to-end with public infrastructure. The single best case study for "how a community replicates a reasoning model".
- nanoGPT: still the cleanest pretraining-from-scratch reference.
- Sebastian Raschka's Ahead of AI newsletter: deep-dive analyses of training papers. Raschka's 2024-25 fine-tuning book draft "Build a Large Language Model from Scratch" is the right companion for from-scratch learners.
- Maxime Labonne's LLM Course: the most popular open fine-tuning tutorial of 2024-25, with worked notebooks for every algorithm in this section.
19.6.3 Communities
- EleutherAI Discord: open-source pretraining and alignment research.
- Nous Research Discord: fine-tuning collective.
- axolotl Discord: the most active fine-tuning support channel.
- r/LocalLLaMA: weekly fine-tuning threads.
Tip: A working pattern
Treat the alignment-handbook and AllenAI's open-instruct as your reference recipes. Read them before writing your own training script; reuse 90 percent. The 10 percent you do change should be a single, named, deliberate change so the result is interpretable.
What's Next?
In the next section, Section 19.7: Hugging Face Datasets and Tokenizers, we build on the material covered here.
Further Reading
Practitioner Guides
Karpathy, A. (2024). "Let's build the GPT Tokenizer." YouTube. Reference walkthrough on training a BPE tokenizer.
Karpathy, A. (2024). "Let's Reproduce GPT-2 (124M)." YouTube. Reference end-to-end pretraining walkthrough.
Communities
EleutherAI (2024). "EleutherAI Discord and Research Forum." eleuther.ai. The largest open-source LLM research community.