Section K.4: PEFT and TRL: Parameter-Efficient Fine-Tuning and RLHF | Building Conversational AI with LLMs and Agents

Big Picture

Full fine-tuning of a large language model updates every parameter, requiring enormous GPU memory and risking catastrophic forgetting. Parameter-Efficient Fine-Tuning (PEFT) methods train only a small number of additional or modified parameters while keeping the base model frozen, reducing memory requirements by 10x or more. The peft library implements these methods, and the trl (Transformer Reinforcement Learning) library builds on top of it to provide supervised fine-tuning (SFT), reward modeling, and alignment algorithms like DPO and PPO. Together, these libraries make it practical to customize billion-parameter models on a single consumer GPU. For the conceptual foundations of alignment and RLHF, see Chapter 10: Alignment and RLHF.

1. LoRA: Low-Rank Adaptation

LoRA (Low-Rank Adaptation) is the most widely used PEFT method. Instead of updating a full weight matrix $W \in \mathbb{R}^{d \times k}$, LoRA freezes $W$ and learns two small matrices $A \in \mathbb{R}^{d \times r}$ and $B \in \mathbb{R}^{r \times k}$ where $r \ll \min(d, k)$. The effective weight becomes $W + BA$. A rank of 8 to 64 is typical, meaning only a fraction of a percent of the original parameters are trainable.

The following example applies LoRA to a causal language model and inspects the resulting parameter count.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

# Load base model
model_name = "mistralai/Mistral-7B-v0.3"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank of the low-rank matrices
    lora_alpha=32,                 # Scaling factor (alpha / r)
    lora_dropout=0.05,             # Dropout on LoRA layers
    target_modules=[               # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",                   # Don't train bias terms
)

# Wrap the model with LoRA adapters
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 27,262,976 || all params: 7,268,904,960 || trainable%: 0.375%

Code Fragment 1: Applying LoRA to Mistral-7B with peft. The LoraConfig targets all attention and MLP projections with rank 16, producing only 27M trainable parameters (0.375% of the 7.3B total). The print_trainable_parameters() call confirms the adapter footprint.

LoRA Rank and Target Module Guidelines

Higher rank (r) gives the adapter more capacity but increases memory and training time. For most fine-tuning tasks, r=8 to r=32 works well. The target_modules parameter controls which linear layers receive LoRA adapters. Applying LoRA to all attention projections (q, k, v, o) is standard; adding MLP layers (gate, up, down) can improve results for complex tasks at the cost of more trainable parameters. The lora_alpha parameter scales the adapter contribution: a common heuristic is to set lora_alpha = 2 * r.

2. QLoRA: Quantized LoRA for Consumer GPUs

QLoRA combines 4-bit quantization of the base model with LoRA adapters trained in higher precision. The base model weights are stored in 4-bit NormalFloat (NF4) format, reducing the memory footprint of a 7B model from ~14 GB to ~4 GB. The LoRA matrices are kept in bfloat16 for stable gradient computation. This makes it possible to fine-tune a 7B model on a single 8 GB GPU, or a 70B model on a single 48 GB GPU.

The following example sets up QLoRA training.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",             # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bf16
    bnb_4bit_use_double_quant=True,        # Quantize the quantization constants
)

# Load the model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for k-bit training (enables gradient checkpointing, casts norms)
model = prepare_model_for_kbit_training(model)

# Apply LoRA on top of the quantized model
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# ~0.2% of parameters are trainable; total GPU memory ~5 GB

Code Fragment 2: QLoRA setup combining NF4 quantization with LoRA adapters. The BitsAndBytesConfig compresses base weights to 4-bit, prepare_model_for_kbit_training() enables gradient checkpointing and norm casting, and LoRA adapters train in bfloat16. Total GPU memory drops from ~14 GB to ~5 GB.

Fine-Tuning Methods: Memory and Precision Comparison

Method	Base Model Precision	Adapter Precision	7B Model Memory	Trainable Params
Full fine-tuning (bf16)	bfloat16	N/A (all params)	~28 GB (with optimizer)	100%
LoRA (bf16)	bfloat16	bfloat16	~16 GB	0.1% to 1%
QLoRA (NF4 + bf16)	4-bit NF4	bfloat16	~5 GB	0.1% to 1%

3. Supervised Fine-Tuning with TRL's SFTTrainer

The trl library's SFTTrainer extends the HuggingFace Trainer with features specifically designed for instruction tuning and chat model fine-tuning. It handles chat template formatting, packing multiple short examples into a single sequence for efficiency, and integrates with PEFT out of the box.

The following example fine-tunes a model on instruction-following data using SFTTrainer with LoRA.

from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig
from datasets import load_dataset

# Load model and tokenizer
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)

# Load instruction dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")

# LoRA config
peft_config = LoraConfig(
    r=16, lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

# SFT configuration
sft_config = SFTConfig(
    output_dir="./sft-mistral",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    bf16=True,
    logging_steps=25,
    save_strategy="steps",
    save_steps=200,
    max_seq_length=2048,
    packing=True,            # Pack multiple examples into one sequence
)

# Create trainer and train
trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    processing_class=tokenizer,
    peft_config=peft_config,
)

trainer.train()
trainer.save_model("./sft-mistral/final")

Code Fragment 3: Supervised fine-tuning with SFTTrainer from trl. Key settings include packing=True (concatenates short examples to fill sequences), cosine learning rate schedule with warmup, and integrated PEFT via the peft_config parameter. The trainer handles chat template formatting automatically.

Sequence Packing

When packing=True, the SFTTrainer concatenates multiple training examples into a single sequence (separated by EOS tokens) up to max_seq_length. This eliminates the wasted computation from padding short sequences and can speed up training by 2x to 5x on datasets with short examples. The loss is computed only on the response tokens, not the padding or instruction portions.

4. Direct Preference Optimization (DPO)

DPO is an alignment algorithm that trains a model to prefer "chosen" responses over "rejected" responses, without needing a separate reward model. It directly optimizes the policy using a contrastive loss derived from the Bradley-Terry preference model. The trl library provides a DPOTrainer that handles the paired data format and reference model management.

The following example trains a DPO model starting from an SFT checkpoint.

from trl import DPOTrainer, DPOConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig
from datasets import load_dataset

# Load the SFT model as starting point
model_name = "./sft-mistral/final"  # Or a Hub model
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load preference dataset (must have "chosen" and "rejected" columns)
dataset = load_dataset("argilla/ultrafeedback-binarized-preferences", split="train[:5000]")

# LoRA for DPO
peft_config = LoraConfig(
    r=16, lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

# DPO configuration
dpo_config = DPOConfig(
    output_dir="./dpo-mistral",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-5,
    beta=0.1,                  # KL penalty coefficient
    loss_type="sigmoid",       # Standard DPO loss
    bf16=True,
    logging_steps=25,
    save_strategy="steps",
    save_steps=200,
    max_length=1024,
    max_prompt_length=512,
)

# The DPOTrainer manages the reference model automatically
trainer = DPOTrainer(
    model=model,
    args=dpo_config,
    train_dataset=dataset,
    processing_class=tokenizer,
    peft_config=peft_config,
)

trainer.train()
trainer.save_model("./dpo-mistral/final")

Code Fragment 4: DPO alignment using DPOTrainer. The beta=0.1 controls how strongly the model is penalized for diverging from the reference policy. The trainer automatically manages the reference model internally, and the preference dataset must contain "chosen" and "rejected" response columns.

5. Reward Modeling and PPO

For the classic RLHF pipeline, you first train a reward model that scores responses, then use Proximal Policy Optimization (PPO) to optimize the language model against that reward signal. The trl library provides both RewardTrainer for the reward model and PPOTrainer for the policy optimization step.

The following example sketches the full RLHF pipeline: reward model training followed by PPO.

from trl import RewardTrainer, RewardConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset

# Step 1: Train a reward model
reward_model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=1
)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Preference data: each row has "chosen" and "rejected" text
pref_data = load_dataset("argilla/ultrafeedback-binarized-preferences",
                         split="train[:2000]")

reward_config = RewardConfig(
    output_dir="./reward-model",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    learning_rate=1e-5,
    bf16=True,
    max_length=512,
)

reward_trainer = RewardTrainer(
    model=reward_model,
    args=reward_config,
    train_dataset=pref_data,
    processing_class=tokenizer,
)
reward_trainer.train()

Code Fragment 5: Training a reward model with RewardTrainer. The model learns to assign higher scalar scores to "chosen" responses than "rejected" ones from the preference dataset. This reward model is used in the subsequent PPO step to provide the optimization signal.

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer
import torch

# Step 2: PPO alignment
model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft-mistral/final")
tokenizer = AutoTokenizer.from_pretrained("./sft-mistral/final")
tokenizer.pad_token = tokenizer.eos_token

# Load the trained reward model for scoring
from transformers import pipeline as hf_pipeline
reward_pipe = hf_pipeline("text-classification", model="./reward-model", device=0)

ppo_config = PPOConfig(
    learning_rate=1e-5,
    batch_size=16,
    mini_batch_size=4,
    gradient_accumulation_steps=4,
    ppo_epochs=4,
    kl_penalty="kl",           # KL divergence penalty type
    init_kl_coef=0.2,          # Initial KL coefficient
)

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    tokenizer=tokenizer,
)

# PPO training loop (simplified)
prompts = ["Explain quantum computing.", "Write a haiku about spring."]
for prompt_text in prompts:
    inputs = tokenizer(prompt_text, return_tensors="pt").input_ids.to(model.device)
    response = model.generate(inputs, max_new_tokens=128, do_sample=True)
    response_text = tokenizer.decode(response[0], skip_special_tokens=True)

    # Score with reward model
    reward_output = reward_pipe(response_text, truncation=True)
    reward = torch.tensor([reward_output[0]["score"]])

    # PPO step
    ppo_trainer.step([inputs[0]], [response[0]], [reward])

Code Fragment 6: PPO alignment loop: generate a response, score it with the reward model pipeline, and take a PPO step. The init_kl_coef=0.2 prevents the policy from drifting too far from the SFT model. Each step() call updates the model toward responses that score higher on the reward signal.

DPO vs. PPO: Practical Considerations

DPO is simpler to implement and more stable to train because it avoids the reward model and the complex PPO optimization loop. It requires only a preference dataset. PPO, on the other hand, can optimize for any reward signal (including non-differentiable metrics like code execution success) and can explore beyond the training distribution. In practice, DPO has become the preferred method for most alignment tasks due to its simplicity and competitive performance. Use PPO when you need to optimize for a reward signal that cannot be captured in pairwise preferences.

6. Saving, Loading, and Merging Adapters

PEFT adapters are saved separately from the base model, keeping adapter files small (typically 10 to 100 MB). You can load multiple adapters, switch between them, or merge an adapter permanently into the base model weights for deployment.

The following example demonstrates the adapter lifecycle: saving, loading, and merging.

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Save adapter (only the LoRA weights, not the full model)
model.save_pretrained("./my-adapter")
# This creates: adapter_config.json + adapter_model.safetensors (~50 MB)

# Load adapter onto a fresh base model
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    torch_dtype="auto",
    device_map="auto",
)
peft_model = PeftModel.from_pretrained(base_model, "./my-adapter")

# Merge adapter into base model (for faster inference, no PEFT dependency)
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained("./merged-model")

# Push to the Hub
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
merged_model.push_to_hub("my-org/mistral-7b-finetuned")
tokenizer.push_to_hub("my-org/mistral-7b-finetuned")

Code Fragment 7: Adapter lifecycle: save_pretrained() stores only the LoRA weights (~50 MB), PeftModel.from_pretrained() loads them onto a fresh base model, and merge_and_unload() folds the adapter into the base weights permanently for deployment without the PEFT dependency.