Section 19.9: Hugging Face PEFT and TRL

Hugging Face PEFT and TRL Deep Dive

Big Picture

Full fine-tuning of a large language model updates every parameter, requiring enormous GPU memory and risking catastrophic forgetting. Parameter-Efficient Fine-Tuning (PEFT) methods train only a small number of additional or modified parameters while keeping the base model frozen, reducing memory requirements by 10x or more. The peft library implements these methods, and the trl (Transformer Reinforcement Learning) library builds on top of it to provide supervised fine-tuning (SFT), reward modeling, and alignment algorithms like DPO and PPO. Together, these libraries make it practical to customize billion-parameter models on a single consumer GPU. For the conceptual foundations of alignment and RLHF, see Chapter 18: Alignment, RLHF, DPO & Preference Tuning.

1. LoRA: Low-Rank Adaptation

See Chapter 17 (PEFT) for the LoRA derivation. The core idea: instead of updating a full pretrained weight $W_0 \in \mathbb{R}^{d \times d}$, learn a low-rank update $\Delta W = B A$ with $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$, and freeze $W_0$. The effective forward pass is

h = W_0 x + \frac{\alpha}{r} B A x,

where the scaling factor $\alpha / r$ keeps the adapter magnitude roughly comparable across choices of $r$. To apply LoRA via peft: a typical rank of 8 to 64 means only a fraction of a percent of the original parameters are trainable.

Real-World Scenario: LoRA Parameter Count for Mistral-7B

Mistral-7B has hidden size $d = 4096$ and 32 layers. Each layer has four attention projections (Q, K, V, O) and three MLP projections (gate, up, down). With LoRA rank $r = 16$ applied to all seven projections per layer, the trainable adapter parameter count is roughly $32 \cdot 7 \cdot (4096 \cdot 16 + 16 \cdot 4096) = 32 \cdot 7 \cdot 131{,}072 \approx 29.4 \text{ M}$ parameters. Against the full 7.24 B parameters of the base model, that is 0.41%: exactly the "trainable: 27 M / total: 7.27 B / 0.375%" line printed by print_trainable_parameters() in the next code block. The factor 4 to 7 difference between "attention only" and "attention plus MLP" target_modules choices comes straight from this counting.

The following example applies LoRA to a causal language model and inspects the resulting parameter count.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

# Load base model
model_name = "mistralai/Mistral-7B-v0.3"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank of the low-rank matrices
    lora_alpha=32,                 # Scaling factor (alpha / r)
    lora_dropout=0.05,             # Dropout on LoRA layers
    target_modules=[               # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",                   # Don't train bias terms
)

# Wrap the model with LoRA adapters
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 27,262,976 || all params: 7,268,904,960 || trainable%: 0.375%

Code Fragment 19.9.1: Applying LoRA to Mistral-7B with peft. The LoraConfig targets all attention and MLP projections with rank 16, producing only 27M trainable parameters (0.375% of the 7.3B total). The print_trainable_parameters() call confirms the adapter footprint.

Note: LoRA Rank and Target Module Guidelines

Higher rank (r) gives the adapter more capacity but increases memory and training time. For most fine-tuning tasks, r=8 to r=32 works well. The target_modules parameter controls which linear layers receive LoRA adapters. Applying LoRA to all attention projections (q, k, v, o) is standard; adding MLP layers (gate, up, down) can improve results for complex tasks at the cost of more trainable parameters. The lora_alpha parameter scales the adapter contribution: a common heuristic is to set lora_alpha = 2 * r.

2. QLoRA: Quantized LoRA for Consumer GPUs

See Chapter 17 (PEFT) for QLoRA derivation. The forward pass keeps the LoRA shape but dequantizes the frozen base weight on the fly:

h = \mathrm{dequant_{NF4}}(W_0^{\mathrm{NF4}})\, x + \frac{\alpha}{r} B A x,

where $W_0$ lives in 4-bit NormalFloat (NF4) storage with double-quantization (the per-block scale factors themselves are quantized to 8-bit), and the trainable adapter matrices $A, B$ remain in bfloat16 for stable gradient computation. This is the bitsandbytes-specific framework gotcha: base model weights are stored in 4-bit, but the LoRA matrices are kept in bfloat16. This drops a 7B model from ~14 GB to ~5 GB and makes it possible to fine-tune a 70B model on a single 48 GB GPU.

Real-World Scenario: QLoRA Memory Budget on a 24 GB RTX 4090

A 13B model in bfloat16 weighs roughly $13 \times 10^9 \times 2 \text{ bytes} = 26 \text{ GB}$, which alone overflows a 24 GB consumer GPU. NF4 storage with double-quantization drops the weight footprint to about $13 \times 10^9 \times 0.5 \text{ bytes} \approx 6.5 \text{ GB}$. Adding ~50 M LoRA adapter parameters in bfloat16 (~0.1 GB), AdamW optimizer state for the adapters (~0.2 GB), KV cache for a 4K context (~1 GB), and activation memory under gradient checkpointing (~4 GB) leaves about 12 GB of headroom on the 24 GB card. That headroom buys batch size, longer context, or both. Without QLoRA, fine-tuning a 13B model required at least an A100 80 GB; QLoRA brought it down to a single workstation card.

The following example sets up QLoRA training.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",             # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bf16
    bnb_4bit_use_double_quant=True,        # Quantize the quantization constants
)

# Load the model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for k-bit training (enables gradient checkpointing, casts norms)
model = prepare_model_for_kbit_training(model)

# Apply LoRA on top of the quantized model
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# ~0.2% of parameters are trainable; total GPU memory ~5 GB

Code Fragment 19.9.2: QLoRA setup combining NF4 quantization with LoRA adapters. The BitsAndBytesConfig compresses base weights to 4-bit, prepare_model_for_kbit_training() enables gradient checkpointing and norm casting, and LoRA adapters train in bfloat16. Total GPU memory drops from ~14 GB to ~5 GB.

Table 19.9.1: Fine-Tuning Methods: Memory and Precision Comparison (as of 2026).

Method	Base Model Precision	Adapter Precision	7B Model Memory	Trainable Params
Full fine-tuning (bf16)	bfloat16	N/A (all params)	~28 GB (with optimizer)	100%
LoRA (bf16)	bfloat16	bfloat16	~16 GB	0.1% to 1%
QLoRA (NF4 + bf16)	4-bit NF4	bfloat16	~5 GB	0.1% to 1%

3. Supervised Fine-Tuning with TRL's SFTTrainer

The trl library's SFTTrainer extends the Hugging Face Trainer with features specifically designed for instruction tuning and chat model fine-tuning. It handles chat template formatting, packing multiple short examples into a single sequence for efficiency, and integrates with PEFT out of the box.

The following example fine-tunes a model on instruction-following data using SFTTrainer with LoRA.

from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig
from datasets import load_dataset

# Load model and tokenizer
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)

# Load instruction dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")

# LoRA config
peft_config = LoraConfig(
    r=16, lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

# SFT configuration
sft_config = SFTConfig(
    output_dir="./sft-mistral",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    bf16=True,
    logging_steps=25,
    save_strategy="steps",
    save_steps=200,
    max_seq_length=2048,
    packing=True,            # Pack multiple examples into one sequence
)

# Create trainer and train
trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    processing_class=tokenizer,
    peft_config=peft_config,
)

trainer.train()
trainer.save_model("./sft-mistral/final")

Code Fragment 19.9.3: Supervised fine-tuning with SFTTrainer from trl. Key settings include packing=True (concatenates short examples to fill sequences), cosine learning rate schedule with warmup, and integrated PEFT via the peft_config parameter. The trainer handles chat template formatting automatically.

Tip: Sequence Packing

When packing=True, the SFTTrainer concatenates multiple training examples into a single sequence (separated by EOS tokens) up to max_seq_length. This eliminates the wasted computation from padding short sequences and can speed up training by 2x to 5x on datasets with short examples. The loss is computed only on the response tokens, not the padding or instruction portions.

4. Direct Preference Optimization (DPO)

See Chapter 18 (Alignment, RLHF, DPO) for the Bradley-Terry framing and DPO derivation. The trl library provides a DPOTrainer that handles the paired data format and reference model management.

The following example trains a DPO model starting from an SFT checkpoint.

from trl import DPOTrainer, DPOConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig
from datasets import load_dataset

# Load the SFT model as starting point
model_name = "./sft-mistral/final"  # Or a Hub model
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load preference dataset (must have "chosen" and "rejected" columns)
dataset = load_dataset("argilla/ultrafeedback-binarized-preferences", split="train[:5000]")

# LoRA for DPO
peft_config = LoraConfig(
    r=16, lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

# DPO configuration
dpo_config = DPOConfig(
    output_dir="./dpo-mistral",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-5,
    beta=0.1,                  # KL penalty coefficient
    loss_type="sigmoid",       # Standard DPO loss
    bf16=True,
    logging_steps=25,
    save_strategy="steps",
    save_steps=200,
    max_length=1024,
    max_prompt_length=512,
)

# The DPOTrainer manages the reference model automatically
trainer = DPOTrainer(
    model=model,
    args=dpo_config,
    train_dataset=dataset,
    processing_class=tokenizer,
    peft_config=peft_config,
)

trainer.train()
trainer.save_model("./dpo-mistral/final")

Code Fragment 19.9.4: DPO alignment using DPOTrainer. The beta=0.1 controls how strongly the model is penalized for diverging from the reference policy. The trainer automatically manages the reference model internally, and the preference dataset must contain "chosen" and "rejected" response columns.

5. Reward Modeling and PPO

See Chapter 18 (Alignment, RLHF, DPO) for the reward-model-then-PPO pipeline rationale. The trl library provides both RewardTrainer for the reward model and PPOTrainer for the policy optimization step.

The following example sketches the full RLHF pipeline: reward model training followed by PPO.

from trl import RewardTrainer, RewardConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset

# Step 1: Train a reward model
reward_model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=1
)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Preference data: each row has "chosen" and "rejected" text
pref_data = load_dataset("argilla/ultrafeedback-binarized-preferences",
                         split="train[:2000]")

reward_config = RewardConfig(
    output_dir="./reward-model",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    learning_rate=1e-5,
    bf16=True,
    max_length=512,
)

reward_trainer = RewardTrainer(
    model=reward_model,
    args=reward_config,
    train_dataset=pref_data,
    processing_class=tokenizer,
)
reward_trainer.train()

Code Fragment 19.9.5: Training a reward model with RewardTrainer. The model learns to assign higher scalar scores to "chosen" responses than "rejected" ones from the preference dataset. This reward model is used in the subsequent PPO step to provide the optimization signal.

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer
import torch

# Step 2: PPO alignment
model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft-mistral/final")
tokenizer = AutoTokenizer.from_pretrained("./sft-mistral/final")
tokenizer.pad_token = tokenizer.eos_token

# Load the trained reward model for scoring
from transformers import pipeline as hf_pipeline
reward_pipe = hf_pipeline("text-classification", model="./reward-model", device=0)

ppo_config = PPOConfig(
    learning_rate=1e-5,
    batch_size=16,
    mini_batch_size=4,
    gradient_accumulation_steps=4,
    ppo_epochs=4,
    kl_penalty="kl",           # KL divergence penalty type
    init_kl_coef=0.2,          # Initial KL coefficient
)

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    tokenizer=tokenizer,
)

# PPO training loop (simplified)
prompts = ["Explain quantum computing.", "Write a haiku about spring."]
for prompt_text in prompts:
    inputs = tokenizer(prompt_text, return_tensors="pt").input_ids.to(model.device)
    response = model.generate(inputs, max_new_tokens=128, do_sample=True)
    response_text = tokenizer.decode(response[0], skip_special_tokens=True)

    # Score with reward model
    reward_output = reward_pipe(response_text, truncation=True)
    reward = torch.tensor([reward_output[0]["score"]])

    # PPO step
    ppo_trainer.step([inputs[0]], [response[0]], [reward])

Code Fragment 19.9.6: PPO alignment loop: generate a response, score it with the reward model pipeline, and take a PPO step. The init_kl_coef=0.2 prevents the policy from drifting too far from the SFT model. Each step() call updates the model toward responses that score higher on the reward signal.

Warning: DPO vs. PPO: Practical Considerations

DPO is simpler to implement and more stable to train because it avoids the reward model and the complex PPO optimization loop. It requires only a preference dataset. PPO, on the other hand, can optimize for any reward signal (including non-differentiable metrics like code execution success) and can explore beyond the training distribution. In practice, DPO has become the preferred method for most alignment tasks due to its simplicity and competitive performance. Use PPO when you need to optimize for a reward signal that cannot be captured in pairwise preferences.

6. Saving, Loading, and Merging Adapters

PEFT adapters are saved separately from the base model, keeping adapter files small (typically 10 to 100 MB). You can load multiple adapters, switch between them, or merge an adapter permanently into the base model weights for deployment.

The following example demonstrates the adapter lifecycle: saving, loading, and merging.

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Save adapter (only the LoRA weights, not the full model)
model.save_pretrained("./my-adapter")
# This creates: adapter_config.json + adapter_model.safetensors (~50 MB)

# Load adapter onto a fresh base model
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    torch_dtype="auto",
    device_map="auto",
)
peft_model = PeftModel.from_pretrained(base_model, "./my-adapter")

# Merge adapter into base model (for faster inference, no PEFT dependency)
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained("./merged-model")

# Push to the Hub
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
merged_model.push_to_hub("my-org/mistral-7b-finetuned")
tokenizer.push_to_hub("my-org/mistral-7b-finetuned")

Code Fragment 19.9.7: Adapter lifecycle: save_pretrained() stores only the LoRA weights (~50 MB), PeftModel.from_pretrained() loads them onto a fresh base model, and merge_and_unload() folds the adapter into the base weights permanently for deployment without the PEFT dependency.

What's Next?

In the next section, Section 19.10: Linking Experiment Runs to Git Commits, we build on the material covered here.

Further Reading

PEFT and RLHF

Hu, E., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. arXiv:2106.09685. The reference LoRA paper.

Hugging Face (2024). "PEFT Documentation." huggingface.co/docs/peft. Reference parameter-efficient fine-tuning library.

Hugging Face (2024). "TRL: Transformer Reinforcement Learning." huggingface.co/docs/trl. Reference RLHF/DPO library used in this section.

Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). "Direct Preference Optimization." NeurIPS 2023. arXiv:2305.18290. The DPO paper underlying TRL's DPO trainer.