Libraries & Frameworks

Section 19.2

Part IV's training libraries split into three layers: the low-level engine (transformers + accelerate), the algorithm libraries (TRL, PEFT), and the high-level recipe layers (axolotl, lit-gpt). Pick a layer based on how much you want to write yourself.

19.2.1 The training engine

The training loop almost always wraps Hugging Face Trainer or its newer SFTTrainer subclass. Accelerate handles the distributed orchestration. The combination of transformers v5 + accelerate + bitsandbytes (install with uv, 10-100x faster than pip) is what every recipe in Part IV assumes. Note: transformers v5 (2025-Q1) is a major break from v4; older training scripts may need small updates.

bitsandbytes provides 8-bit and 4-bit quantization (NF4) for QLoRA. FlashAttention 3 handles the attention kernels that make long-context fine-tuning feasible on consumer GPUs. Liger Kernel (LinkedIn, 2024) adds drop-in Triton kernels for cross-entropy, fused linear layers, and RoPE that deliver 20-40% memory savings on Llama / Mistral fine-tunes; widely adopted in 2025.

Two additional production training frameworks belong on the radar. torchtune (PyTorch official, 2024) is Meta's native fine-tuning library that competes with axolotl and TRL on simplicity, and works particularly well with FSDP2. nanotron (Hugging Face, 2024) is the minimal pretraining-from-scratch framework that is the spiritual successor to nanoGPT at production scale. For pure-JAX teams, Levanter (Stanford CRFM) is the canonical pretraining framework. The historical name Megatron-LM still appears in older docs; the modern entry point is Megatron-Core (NVIDIA's 2024 modular rewrite). The lit-gpt name was renamed to litgpt on PyPI; if you see pip install lit-gpt, you have stale instructions.

19.2.2 Algorithm libraries

Real-World Scenario: GRPO recipe in five lines

Who: A small open-source replication team working from the DeepSeek-R1 paper.

Situation: The team wanted to reproduce R1-style reasoning training on a public math dataset without rebuilding the GRPO loop from scratch.

Problem: Hand-rolled RL training code in their previous project had taken weeks to stabilize and was hard to compare with published baselines.

Dilemma: Build a custom GRPO trainer for maximum flexibility, or accept a library-imposed loop and ship faster.

Decision: They adopted TRL's GRPOTrainer and wrote only the reward function plus a config block.

The DeepSeek-R1 recipe (arXiv:2501.12948) reduces to a few lines once TRL's GRPOTrainer is doing the work; see Code Fragment 19.2.1 below.

How: They imported GRPOTrainer and GRPOConfig from TRL, defined a reward function that checks the boxed answer against ground truth, loaded a Qwen3-7B base model, and called trainer.train() on a NuminaMath subset.

Result: A working GRPO pipeline in roughly five lines of glue code, with the R1 recipe (2025) becoming the most-replicated open recipe of the year; the open-r1 project (Hugging Face, 2025) reproduces it end-to-end on public hardware.

Lesson: When papers publish reference implementations through TRL, picking the library trainer rather than re-rolling the loop turns a multi-week engineering project into an afternoon's work.

from trl import GRPOTrainer, GRPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

def reward_correctness(completions, **_):
    # Return +1 if the completion's boxed answer matches ground truth, else -1.
    return [1.0 if check_answer(c) else -1.0 for c in completions]

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-7B")
trainer = GRPOTrainer(
    model=model, reward_funcs=[reward_correctness],
    args=GRPOConfig(num_generations=8, per_device_train_batch_size=1),
    train_dataset=numina_math_subset,
)
trainer.train()
Code Fragment 19.2.1: The DeepSeek-R1 recipe (arXiv:2501.12948) reduces to a few lines once TRL's GRPOTrainer is doing the work:.

19.2.3 Recipe layers

19.2.4 Practical defaults

For a single 24 GB GPU LoRA fine-tune, Unsloth is the fastest path. For multi-GPU SFT with a config you want to share, axolotl is the right answer. For RLHF or DPO research where you need to read the training loop, raw TRL is what published papers cite. Experiment Tracking covers W&B and MLflow wiring.

Beyond DPO and GRPO, three 2024-25 alignment algorithms passed the adoption threshold and now ship in TRL: SimPO (Simple Preference Optimization, Meng et al., 2024, arXiv:2405.14734), KTO (Kahneman-Tversky Optimization; canonical home Sec 18.2b), and ORPO (Odds Ratio Preference Optimization; canonical home Sec 18.2b). The 2024 "DPO Meets PPO" (Xu et al.) and "Smaug" (Pal et al.) papers complicate the DPO story and are worth reading before defaulting to it. Reward-model-free RL via Constitutional AI 2 (Anthropic, 2025) and RLAIF are the modern bridges into RLHF without separately training a reward model.

What's Next?

In the next section, Section 19.3: Datasets & Benchmarks, we build on the material covered here.

Further Reading

Training Libraries

Wolf, T., Debut, L., Sanh, V., et al. (2020). "Transformers: State-of-the-Art Natural Language Processing." EMNLP 2020. arXiv:1910.03771. The original Hugging Face Transformers paper.
Hugging Face (2024). "Transformers Documentation." huggingface.co/docs/transformers. Authoritative reference for the de-facto LLM library.
Hugging Face (2024). "Accelerate." huggingface.co/docs/accelerate. Reference thin distributed-training wrapper.
Hugging Face (2024). "TRL: Transformer Reinforcement Learning." huggingface.co/docs/trl. Reference library for RLHF, DPO, GRPO training.

PEFT and LoRA

Hu, E., Shen, Y., Wallis, P., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. arXiv:2106.09685. The reference LoRA paper.
Hugging Face (2024). "PEFT Documentation." huggingface.co/docs/peft. Reference parameter-efficient fine-tuning library.