Section K.3: Training with Trainer and Accelerate | Building Conversational AI with LLMs and Agents

Big Picture

HuggingFace offers two complementary approaches to model training. The Trainer API provides a high-level, declarative interface that handles the training loop, evaluation, logging, and checkpointing automatically. For practitioners who need full control over the training loop, Accelerate provides a thin abstraction layer that makes any PyTorch training script work seamlessly across CPUs, single GPUs, multi-GPU setups, and TPUs with minimal code changes. This section covers both approaches, along with key techniques like gradient accumulation, mixed precision, and custom callbacks.

1. TrainingArguments: Configuring the Training Run

Every Trainer-based training run begins with a TrainingArguments object. This dataclass consolidates all training hyperparameters, output paths, logging settings, and hardware options into a single configuration. Understanding its parameters is essential for effective fine-tuning. For the broader context of fine-tuning strategies, see Chapter 9: Fine-Tuning.

The following example sets up a typical configuration for fine-tuning a classification model.

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",

    # Training hyperparameters
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",

    # Gradient accumulation: effective batch size = 16 * 4 = 64
    gradient_accumulation_steps=4,

    # Mixed precision for faster training and lower memory
    bf16=True,                    # Use bfloat16 (preferred on Ampere+ GPUs)

    # Evaluation and logging
    eval_strategy="steps",
    eval_steps=500,
    logging_steps=100,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=3,           # Keep only the 3 most recent checkpoints
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",

    # Reproducibility
    seed=42,

    # Hub integration
    push_to_hub=False,
    report_to="tensorboard",
)

Code Fragment 1: A comprehensive TrainingArguments configuration covering learning rate scheduling (cosine with warmup), gradient accumulation for effective batch size scaling, mixed-precision training with bf16, and checkpoint management with best-model selection by accuracy.

Parameter	Purpose	Typical Values
`per_device_train_batch_size`	Batch size per GPU	8, 16, 32
`gradient_accumulation_steps`	Simulate larger batches	2, 4, 8
`learning_rate`	Peak learning rate	1e-5 to 5e-5 for fine-tuning
`warmup_ratio`	Fraction of steps for LR warmup	0.05 to 0.1
`bf16` / `fp16`	Mixed-precision training	`bf16=True` on Ampere+ GPUs
`eval_strategy`	When to run evaluation	"steps" or "epoch"
`save_total_limit`	Max checkpoints on disk	2 to 5

Figure K.3.1: Key TrainingArguments parameters and their typical values.

2. The Trainer API: End-to-End Fine-Tuning

The Trainer class accepts a model, training arguments, datasets, a tokenizer, and optional metrics, then manages the entire training lifecycle. It handles device placement, gradient computation, optimizer creation, checkpointing, and evaluation automatically.

The following example fine-tunes a DistilBERT model on the SST-2 sentiment classification task from start to finish.

from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
)
import numpy as np

# 1. Load and preprocess data
dataset = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_fn(examples):
    return tokenizer(examples["sentence"], truncation=True, max_length=128)

tokenized = dataset.map(tokenize_fn, batched=True)
tokenized = tokenized.rename_column("label", "labels")

# 2. Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
)

# 3. Define metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    accuracy = (preds == labels).mean()
    return {"accuracy": accuracy}

# 4. Configure training
training_args = TrainingArguments(
    output_dir="./sst2-distilbert",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    bf16=True,
)

# 5. Create Trainer and train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

# 6. Evaluate on the validation set
results = trainer.evaluate()
print(f"Validation accuracy: {results['eval_accuracy']:.4f}")

{'eval_loss': 0.2184, 'eval_accuracy': 0.9128, 'eval_runtime': 4.31, ...} Validation accuracy: 0.9128

Code Fragment 2: A complete Trainer workflow: load model and tokenizer, preprocess data with .map(), define a compute_metrics function, instantiate the Trainer, train, and evaluate. The trainer handles gradient accumulation, mixed precision, checkpointing, and logging automatically.

Data Collators

The Trainer uses a DataCollator to batch examples together. By default, it uses DataCollatorWithPadding, which dynamically pads sequences to the longest length in each batch (rather than a fixed max length). This saves computation. For masked language modeling, use DataCollatorForLanguageModeling. For span corruption (T5-style), use DataCollatorForT5MLM. You can pass a custom collator via the data_collator parameter.

3. Custom Callbacks for Training Control

The Trainer supports a callback system that lets you inject custom logic at various points during training: after each step, after each epoch, at evaluation time, and more. Callbacks can be used for custom logging, early stopping, dynamic hyperparameter schedules, or integration with external services.

The following example implements a custom callback that logs the learning rate and stops training early if the loss plateaus.

from transformers import TrainerCallback, EarlyStoppingCallback

class LRLoggingCallback(TrainerCallback):
    """Log the current learning rate at each logging step."""

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs and "learning_rate" in logs:
            print(f"  Step {state.global_step}: "
                  f"lr={logs['learning_rate']:.2e}, "
                  f"loss={logs.get('loss', 'N/A')}")

# Built-in early stopping callback
early_stop = EarlyStoppingCallback(
    early_stopping_patience=3,     # Stop after 3 evals without improvement
    early_stopping_threshold=0.001, # Minimum improvement to count
)

# Add callbacks to the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[LRLoggingCallback(), early_stop],
)

trainer.train()

Step 100: lr=1.90e-05, loss=0.3412 Step 200: lr=1.70e-05, loss=0.2856

Code Fragment 3: Custom callback that logs the learning rate and loss at each step. Callbacks receive the TrainerState and TrainerControl objects, enabling custom logging, early stopping, or dynamic hyperparameter adjustment without modifying the training loop.

Step 300: lr=1.50e-05, loss=0.2341 ...

4. Custom Training Loops with Accelerate

While the Trainer is convenient, some projects require full control over the training loop. The Accelerate library bridges this gap: you write a standard PyTorch training loop, and Accelerate handles device placement, distributed communication, mixed precision, and gradient accumulation transparently. The same script runs on a single CPU, a single GPU, multiple GPUs, or a TPU without modification.

The following example shows a complete training loop using Accelerate.

from accelerate import Accelerator
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    get_linear_schedule_with_warmup,
)
from datasets import load_dataset

# 1. Initialize Accelerator
accelerator = Accelerator(
    mixed_precision="bf16",
    gradient_accumulation_steps=4,
)

# 2. Prepare data
dataset = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized = dataset.map(
    lambda x: tokenizer(x["sentence"], truncation=True, max_length=128),
    batched=True,
)
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])

train_loader = DataLoader(tokenized["train"], batch_size=16, shuffle=True)
eval_loader = DataLoader(tokenized["validation"], batch_size=32)

# 3. Prepare model and optimizer
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
num_steps = len(train_loader) * 3  # 3 epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=num_steps // 10, num_training_steps=num_steps
)

# 4. Let Accelerate prepare everything for the target hardware
model, optimizer, train_loader, eval_loader, scheduler = accelerator.prepare(
    model, optimizer, train_loader, eval_loader, scheduler
)

# 5. Training loop
model.train()
for epoch in range(3):
    total_loss = 0
    for step, batch in enumerate(train_loader):
        with accelerator.accumulate(model):
            outputs = model(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
                labels=batch["label"],
            )
            loss = outputs.loss
            accelerator.backward(loss)
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
            total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    accelerator.print(f"Epoch {epoch + 1}: avg loss = {avg_loss:.4f}")

Epoch 1: avg loss = 0.3218 Epoch 2: avg loss = 0.1845 Epoch 3: avg loss = 0.1102

Code Fragment 4: A custom training loop using Accelerate. The key call is accelerator.prepare(), which wraps the model, optimizer, scheduler, and dataloaders for distributed training and mixed precision. The rest is a standard PyTorch loop with accelerator.backward() replacing loss.backward().

Accelerate's prepare() Is Essential

You must call accelerator.prepare() on the model, optimizer, dataloaders, and scheduler before the training loop. This method wraps each object to handle distributed communication, mixed precision scaling, and device placement. Forgetting to prepare any one of them will cause silent correctness bugs or runtime errors in multi-GPU settings.

5. Distributed Training and Multi-GPU Strategies

Both Trainer and Accelerate support distributed training out of the box. The most common strategy is Distributed Data Parallel (DDP), where each GPU processes a different batch and gradients are synchronized across GPUs at each step. For models too large to fit on a single GPU, Fully Sharded Data Parallel (FSDP) or DeepSpeed ZeRO partitions the model weights, gradients, and optimizer states across GPUs.

Launching distributed training requires a launcher command. The accelerate CLI provides a configuration wizard and launch command.

# Step 1: Configure your distributed setup (interactive wizard)
# $ accelerate config
#   Asks: number of GPUs, mixed precision, DeepSpeed, FSDP, etc.
#   Saves config to ~/.cache/huggingface/accelerate/default_config.yaml

# Step 2: Launch your training script on all GPUs
# $ accelerate launch --num_processes 4 train.py

# Or, with the Trainer API, use torchrun directly:
# $ torchrun --nproc_per_node 4 train_with_trainer.py

Code Fragment 5: Launching distributed training with accelerate. The accelerate config wizard saves hardware settings (GPU count, mixed precision, FSDP/DeepSpeed) to a YAML file, and accelerate launch spawns processes accordingly.

For very large models, DeepSpeed integration provides advanced memory optimization. The Trainer API integrates with DeepSpeed through a simple JSON configuration file.

from transformers import TrainingArguments

# Enable DeepSpeed ZeRO Stage 2 via Trainer
training_args = TrainingArguments(
    output_dir="./output",
    deepspeed="ds_config.json",    # Path to DeepSpeed config
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    bf16=True,
    num_train_epochs=3,
)

# ds_config.json example:
# {
#     "zero_optimization": {
#         "stage": 2,
#         "offload_optimizer": { "device": "cpu" },
#         "allgather_bucket_size": 5e8,
#         "reduce_bucket_size": 5e8
#     },
#     "bf16": { "enabled": true },
#     "train_micro_batch_size_per_gpu": 4,
#     "gradient_accumulation_steps": 8
# }

Code Fragment 6: Enabling DeepSpeed ZeRO Stage 2 via a single deepspeed parameter in TrainingArguments. The JSON config file controls optimizer and gradient sharding across GPUs, allowing models that exceed single-GPU memory to train without code changes.

Strategy	Model Size	What Is Distributed	When to Use
DDP	Fits on 1 GPU	Data only	Most fine-tuning jobs
FSDP	Too large for 1 GPU	Weights, gradients, optimizer	7B+ parameter models
DeepSpeed ZeRO-2	Too large for 1 GPU	Gradients, optimizer	Memory-constrained multi-GPU
DeepSpeed ZeRO-3	Very large	Weights, gradients, optimizer	70B+ parameter models

Figure K.3.2: Distributed training strategies and their applicable scenarios.

6. Evaluation During Training

Monitoring model performance during training is critical for detecting overfitting, selecting the best checkpoint, and deciding when to stop. The Trainer supports evaluation at configurable intervals through the eval_strategy parameter and integrates with the evaluate library for computing standard metrics.

The following example adds multiple metrics to a Trainer-based run using the evaluate library.

import evaluate
import numpy as np

# Load multiple metrics
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_metric.compute(
            predictions=preds, references=labels
        )["accuracy"],
        "f1": f1_metric.compute(
            predictions=preds, references=labels, average="weighted"
        )["f1"],
        "precision": precision_metric.compute(
            predictions=preds, references=labels, average="weighted"
        )["precision"],
        "recall": recall_metric.compute(
            predictions=preds, references=labels, average="weighted"
        )["recall"],
    }

# The Trainer will call compute_metrics at each evaluation step
# and log all four metrics to your chosen reporter (TensorBoard, W&B, etc.)

Code Fragment 7: Adding multiple evaluation metrics via the evaluate library. The compute_metrics function receives model predictions and ground-truth labels, computes accuracy, F1, precision, and recall, and returns them as a dictionary that the Trainer logs at each evaluation interval.

Resuming from Checkpoints

If training is interrupted, the Trainer can resume from the last checkpoint automatically. Call trainer.train(resume_from_checkpoint=True) and it will detect the latest checkpoint in output_dir, restore the model weights, optimizer state, scheduler state, and random number generator seeds, and continue training from exactly where it left off. This is especially valuable for long-running jobs on preemptible cloud instances.