Part IV's training libraries split into three layers: the low-level engine (transformers + accelerate), the algorithm libraries (TRL, PEFT), and the high-level recipe layers (axolotl, lit-gpt). Pick a layer based on how much you want to write yourself.
19.2.1 The training engine
The training loop almost always wraps Hugging Face Trainer or its newer SFTTrainer subclass. Accelerate handles the distributed orchestration. The combination of transformers v5 + accelerate + bitsandbytes (install with uv, 10-100x faster than pip) is what every recipe in Part IV assumes. Note: transformers v5 (2025-Q1) is a major break from v4; older training scripts may need small updates.
bitsandbytes provides 8-bit and 4-bit quantization (NF4) for QLoRA. FlashAttention 3 handles the attention kernels that make long-context fine-tuning feasible on consumer GPUs. Liger Kernel (LinkedIn, 2024) adds drop-in Triton kernels for cross-entropy, fused linear layers, and RoPE that deliver 20-40% memory savings on Llama / Mistral fine-tunes; widely adopted in 2025.
Two additional production training frameworks belong on the radar. torchtune (PyTorch official, 2024) is Meta's native fine-tuning library that competes with axolotl and TRL on simplicity, and works particularly well with FSDP2. nanotron (Hugging Face, 2024) is the minimal pretraining-from-scratch framework that is the spiritual successor to nanoGPT at production scale. For pure-JAX teams, Levanter (Stanford CRFM) is the canonical pretraining framework. The historical name Megatron-LM still appears in older docs; the modern entry point is Megatron-Core (NVIDIA's 2024 modular rewrite). The lit-gpt name was renamed to litgpt on PyPI; if you see pip install lit-gpt, you have stale instructions.
19.2.2 Algorithm libraries
- TRL (Hugging Face, 2020) is the canonical preference-learning library that ships open-source implementations of SFTTrainer, DPOTrainer, GRPOTrainer, PPOTrainer, KTOTrainer, and RLOOTrainer. Its objective is to give you one-import access to every published alignment algorithm so you can swap DPO for IPO for ORPO without reimplementing the loop, which matters when the alignment literature ships a new acronym every quarter. The core concept is that each trainer subclasses Hugging Face
Trainerand overrides only the loss; the data loading, distributed training, and checkpointing are inherited. Pick TRL when you want to follow what papers actually publish; almost every alignment paper since 2024 cites the TRL implementation. - PEFT (Hugging Face, 2023) is the parameter-efficient fine-tuning library that implements LoRA, QLoRA, DoRA, prefix tuning, IA3, prompt tuning, and the long tail of "freeze most of the model, train a small adapter" methods. Its objective is to let you fine-tune a 70B model on a single 80 GB GPU by training under 1 percent of its parameters, which matters when full fine-tuning costs more than your monthly cloud budget. The core concept is the
LoraConfigwrapper that monkey-patches target modules with low-rank delta matrices at load time, returning a model that looks identical to the original until you save just the adapter. Pick PEFT for almost every fine-tune; full fine-tuning is the exception, not the default, in 2026. - OpenRLHF (community, 2024) is an alternative RLHF training framework built around Ray for distributed orchestration and DeepSpeed for memory efficiency. Its objective is to make multi-actor RLHF (separate policy, reference, reward, and critic models on different GPUs) tractable at 70B+ scale, which matters when TRL's single-process design becomes the bottleneck. The core concept is decoupled vLLM-based rollout actors that generate completions in parallel while the policy trains; think of it as a microservices architecture for RLHF. Pick when you need to scale beyond a single node and want stronger Ray integration than TRL provides.
- verl (ByteDance, 2024) is ByteDance's RL training framework, written from the ground up for GRPO and reasoning-model training. Its objective is to support the asymmetric compute pattern of reasoning RL, where rollouts dominate cost and the policy update is cheap, which matters when training DeepSeek-R1-style models with multi-thousand-token chains of thought. The core concept is hybrid-engine training that swaps between vLLM (for inference rollouts) and FSDP (for gradient updates) on the same GPUs. Pick verl when GRPO is your algorithm and rollout efficiency is the bottleneck. TRL also ships
GRPOTrainersince v0.12 (2024-Q4) and is the right pick when serial-friendly debugging matters more than throughput.
Who: A small open-source replication team working from the DeepSeek-R1 paper.
Situation: The team wanted to reproduce R1-style reasoning training on a public math dataset without rebuilding the GRPO loop from scratch.
Problem: Hand-rolled RL training code in their previous project had taken weeks to stabilize and was hard to compare with published baselines.
Dilemma: Build a custom GRPO trainer for maximum flexibility, or accept a library-imposed loop and ship faster.
Decision: They adopted TRL's GRPOTrainer and wrote only the reward function plus a config block.
The DeepSeek-R1 recipe (arXiv:2501.12948) reduces to a few lines once TRL's GRPOTrainer is doing the work; see Code Fragment 19.2.1 below.
How: They imported GRPOTrainer and GRPOConfig from TRL, defined a reward function that checks the boxed answer against ground truth, loaded a Qwen3-7B base model, and called trainer.train() on a NuminaMath subset.
Result: A working GRPO pipeline in roughly five lines of glue code, with the R1 recipe (2025) becoming the most-replicated open recipe of the year; the open-r1 project (Hugging Face, 2025) reproduces it end-to-end on public hardware.
Lesson: When papers publish reference implementations through TRL, picking the library trainer rather than re-rolling the loop turns a multi-week engineering project into an afternoon's work.
from trl import GRPOTrainer, GRPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
def reward_correctness(completions, **_):
# Return +1 if the completion's boxed answer matches ground truth, else -1.
return [1.0 if check_answer(c) else -1.0 for c in completions]
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-7B")
trainer = GRPOTrainer(
model=model, reward_funcs=[reward_correctness],
args=GRPOConfig(num_generations=8, per_device_train_batch_size=1),
train_dataset=numina_math_subset,
)
trainer.train()
19.2.3 Recipe layers
- axolotl (Axolotl AI, 2023) is the YAML-config-driven fine-tuning frontend that wraps transformers, TRL, PEFT, and Accelerate behind a single declarative file. Its objective is to let you describe a fine-tune as "this model, this data, these hyperparameters, this adapter" without writing Python, which matters when you want your config to be reviewable in PRs and reproducible across machines. The core concept is a config schema that maps almost one-to-one onto the underlying library knobs while supplying sane defaults for common patterns (Alpaca format, ShareGPT format, DPO, KTO). Pick axolotl when you want a config you can share with collaborators; the majority of open-source fine-tunes on Hugging Face Hub were trained this way.
- lit-gpt (Lightning AI, 2023) is a from-scratch GPT reimplementation aimed at clarity and pretraining education. Its objective is to be readable end-to-end in a single sitting while still scaling to multi-node pretraining, which matters when you want to understand what every line of the training loop does rather than treat it as a black box. The core concept is hackable PyTorch code (no Lightning wrapper) with first-class support for FSDP and FlashAttention. Pick it as a middle ground between nanoGPT (educational only) and production frameworks (opaque); avoid for fine-tuning where TRL is more direct.
- Unsloth (Unsloth AI, 2023) is a fine-tuning accelerator that ships hand-written Triton kernels for the LoRA-adapter forward and backward passes. Its objective is to make single-GPU LoRA fine-tuning 2x faster and use 50 percent less VRAM than vanilla TRL, which matters when you are training on a 24 GB consumer GPU and every gigabyte counts. The core concept is fused kernels for the attention plus LoRA adapter that avoid the round trip through standard PyTorch ops. Pick Unsloth for tight-budget single-GPU LoRA work or Colab fine-tunes; avoid when you need multi-GPU support (Unsloth's free tier is single-GPU; multi-GPU is paid).
- LLaMA-Factory (Hiyouga, 2023) is another YAML-driven fine-tuning layer, distinguished by a built-in Gradio web UI for configuring runs. Its objective is to lower the entry barrier for users who prefer clicking checkboxes over editing config files, which matters in classroom and bootcamp settings. The core concept is the same axolotl-style declarative config, exposed as both YAML and a web form. Pick it for teaching or quick experiments; for serious work, axolotl's tighter integration with the latest TRL releases tends to lead by a few weeks.
19.2.4 Practical defaults
For a single 24 GB GPU LoRA fine-tune, Unsloth is the fastest path. For multi-GPU SFT with a config you want to share, axolotl is the right answer. For RLHF or DPO research where you need to read the training loop, raw TRL is what published papers cite. Experiment Tracking covers W&B and MLflow wiring.
Beyond DPO and GRPO, three 2024-25 alignment algorithms passed the adoption threshold and now ship in TRL: SimPO (Simple Preference Optimization, Meng et al., 2024, arXiv:2405.14734), KTO (Kahneman-Tversky Optimization; canonical home Sec 18.2b), and ORPO (Odds Ratio Preference Optimization; canonical home Sec 18.2b). The 2024 "DPO Meets PPO" (Xu et al.) and "Smaug" (Pal et al.) papers complicate the DPO story and are worth reading before defaulting to it. Reward-model-free RL via Constitutional AI 2 (Anthropic, 2025) and RLAIF are the modern bridges into RLHF without separately training a reward model.
What's Next?
In the next section, Section 19.3: Datasets & Benchmarks, we build on the material covered here.