Section 19.6: External Reading & Communities

Part IV's literature is split between the academic papers introducing each algorithm, the practical blog posts explaining what works, and the open-source communities that ship the recipes.

19.6.1 Foundational papers

Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022): the RLHF blueprint.
Rafailov et al., "Direct Preference Optimization" (2023): DPO, the closed-form alternative to PPO-based RLHF.
Hu et al., "LoRA: Low-Rank Adaptation" (2021): the foundational paper for PEFT.
Dettmers et al., "QLoRA" (2023): 4-bit quantized LoRA.
DeepSeek, "DeepSeekMath" (GRPO, 2024): the GRPO algorithm that powered reasoning fine-tunes.
DeepSeek-R1 paper (2025): the open recipe for reasoning models. The paper defines "Direct Reinforcement Learning from Verifiable Rewards" (DeepSeek-R1-Zero), which is the recipe most reasoning fine-tunes since now follow.
Wu et al., "Self-Play Preference Optimization" (2024): SPPO, a self-play DPO variant.
Dong et al., "RLHF Workflow: From Reward Modeling to Online RLHF" (2024): practitioner-oriented end-to-end RLHF survey.
Meng et al., "SimPO" (2024): simpler-than-DPO preference optimization.

19.6.2 Tutorials and recipes

Hugging Face DPO tutorial: the official entry point.
Alignment Handbook: open recipes for SFT and DPO; the basis of Zephyr.
AllenAI open-instruct: the Tulu 3 training repository, fully open.
Hugging Face open-r1 (2025): the most-watched open replication project of 2025, reproducing DeepSeek-R1 end-to-end with public infrastructure. The single best case study for "how a community replicates a reasoning model".
nanoGPT: still the cleanest pretraining-from-scratch reference.
Sebastian Raschka's Ahead of AI newsletter: deep-dive analyses of training papers. Raschka's 2024-25 fine-tuning book draft "Build a Large Language Model from Scratch" is the right companion for from-scratch learners.
Maxime Labonne's LLM Course: the most popular open fine-tuning tutorial of 2024-25, with worked notebooks for every algorithm in this section.

19.6.3 Communities

EleutherAI Discord: open-source pretraining and alignment research.
Nous Research Discord: fine-tuning collective.
axolotl Discord: the most active fine-tuning support channel.
r/LocalLLaMA: weekly fine-tuning threads.

Tip: A working pattern

Treat the alignment-handbook and AllenAI's open-instruct as your reference recipes. Read them before writing your own training script; reuse 90 percent. The 10 percent you do change should be a single, named, deliberate change so the result is interpretable.

What's Next?

In the next section, Section 19.7: Hugging Face Datasets and Tokenizers, we build on the material covered here.

Further Reading

Practitioner Guides

Karpathy, A. (2024). "Let's build the GPT Tokenizer." YouTube. Reference walkthrough on training a BPE tokenizer.

Karpathy, A. (2024). "Let's Reproduce GPT-2 (124M)." YouTube. Reference end-to-end pretraining walkthrough.

Communities

EleutherAI (2024). "EleutherAI Discord and Research Forum." eleuther.ai. The largest open-source LLM research community.