HuggingFace offers two complementary approaches to model training. The Trainer API provides a high-level, declarative interface that handles the training loop, evaluation, logging, and checkpointing automatically. For practitioners who need full control over the training loop, Accelerate provides a thin abstraction layer that makes any PyTorch training script work seamlessly across CPUs, single GPUs, multi-GPU setups, and TPUs with minimal code changes. This section covers both approaches, along with key techniques like gradient accumulation, mixed precision, and custom callbacks.
1. TrainingArguments: Configuring the Training Run
Every Trainer-based training run begins with a TrainingArguments object. This dataclass consolidates all training hyperparameters, output paths, logging settings, and hardware options into a single configuration. Understanding its parameters is essential for effective fine-tuning. For the broader context of fine-tuning strategies, see Chapter 9: Fine-Tuning.
The following example sets up a typical configuration for fine-tuning a classification model.
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
# Training hyperparameters
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=2e-5,
weight_decay=0.01,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
# Gradient accumulation: effective batch size = 16 * 4 = 64
gradient_accumulation_steps=4,
# Mixed precision for faster training and lower memory
bf16=True, # Use bfloat16 (preferred on Ampere+ GPUs)
# Evaluation and logging
eval_strategy="steps",
eval_steps=500,
logging_steps=100,
save_strategy="steps",
save_steps=500,
save_total_limit=3, # Keep only the 3 most recent checkpoints
load_best_model_at_end=True,
metric_for_best_model="accuracy",
# Reproducibility
seed=42,
# Hub integration
push_to_hub=False,
report_to="tensorboard",
)
TrainingArguments configuration covering learning rate scheduling (cosine with warmup), gradient accumulation for effective batch size scaling, mixed-precision training with bf16, and checkpoint management with best-model selection by accuracy.| Parameter | Purpose | Typical Values |
|---|---|---|
per_device_train_batch_size | Batch size per GPU | 8, 16, 32 |
gradient_accumulation_steps | Simulate larger batches | 2, 4, 8 |
learning_rate | Peak learning rate | 1e-5 to 5e-5 for fine-tuning |
warmup_ratio | Fraction of steps for LR warmup | 0.05 to 0.1 |
bf16 / fp16 | Mixed-precision training | bf16=True on Ampere+ GPUs |
eval_strategy | When to run evaluation | "steps" or "epoch" |
save_total_limit | Max checkpoints on disk | 2 to 5 |
2. The Trainer API: End-to-End Fine-Tuning
The Trainer class accepts a model, training arguments, datasets, a tokenizer, and optional metrics, then manages the entire training lifecycle. It handles device placement, gradient computation, optimizer creation, checkpointing, and evaluation automatically.
The following example fine-tunes a DistilBERT model on the SST-2 sentiment classification task from start to finish.
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
Trainer,
TrainingArguments,
)
import numpy as np
# 1. Load and preprocess data
dataset = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize_fn(examples):
return tokenizer(examples["sentence"], truncation=True, max_length=128)
tokenized = dataset.map(tokenize_fn, batched=True)
tokenized = tokenized.rename_column("label", "labels")
# 2. Load model
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=2,
)
# 3. Define metrics
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
accuracy = (preds == labels).mean()
return {"accuracy": accuracy}
# 4. Configure training
training_args = TrainingArguments(
output_dir="./sst2-distilbert",
num_train_epochs=3,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
load_best_model_at_end=True,
metric_for_best_model="accuracy",
bf16=True,
)
# 5. Create Trainer and train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()
# 6. Evaluate on the validation set
results = trainer.evaluate()
print(f"Validation accuracy: {results['eval_accuracy']:.4f}")
Trainer workflow: load model and tokenizer, preprocess data with .map(), define a compute_metrics function, instantiate the Trainer, train, and evaluate. The trainer handles gradient accumulation, mixed precision, checkpointing, and logging automatically.The Trainer uses a DataCollator to batch examples together. By default, it uses DataCollatorWithPadding, which dynamically pads sequences to the longest length in each batch (rather than a fixed max length). This saves computation. For masked language modeling, use DataCollatorForLanguageModeling. For span corruption (T5-style), use DataCollatorForT5MLM. You can pass a custom collator via the data_collator parameter.
3. Custom Callbacks for Training Control
The Trainer supports a callback system that lets you inject custom logic at various points during training: after each step, after each epoch, at evaluation time, and more. Callbacks can be used for custom logging, early stopping, dynamic hyperparameter schedules, or integration with external services.
The following example implements a custom callback that logs the learning rate and stops training early if the loss plateaus.
from transformers import TrainerCallback, EarlyStoppingCallback
class LRLoggingCallback(TrainerCallback):
"""Log the current learning rate at each logging step."""
def on_log(self, args, state, control, logs=None, **kwargs):
if logs and "learning_rate" in logs:
print(f" Step {state.global_step}: "
f"lr={logs['learning_rate']:.2e}, "
f"loss={logs.get('loss', 'N/A')}")
# Built-in early stopping callback
early_stop = EarlyStoppingCallback(
early_stopping_patience=3, # Stop after 3 evals without improvement
early_stopping_threshold=0.001, # Minimum improvement to count
)
# Add callbacks to the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
callbacks=[LRLoggingCallback(), early_stop],
)
trainer.train()
TrainerState and TrainerControl objects, enabling custom logging, early stopping, or dynamic hyperparameter adjustment without modifying the training loop.4. Custom Training Loops with Accelerate
While the Trainer is convenient, some projects require full control over the training loop. The Accelerate library bridges this gap: you write a standard PyTorch training loop, and Accelerate handles device placement, distributed communication, mixed precision, and gradient accumulation transparently. The same script runs on a single CPU, a single GPU, multiple GPUs, or a TPU without modification.
The following example shows a complete training loop using Accelerate.
from accelerate import Accelerator
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
get_linear_schedule_with_warmup,
)
from datasets import load_dataset
# 1. Initialize Accelerator
accelerator = Accelerator(
mixed_precision="bf16",
gradient_accumulation_steps=4,
)
# 2. Prepare data
dataset = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized = dataset.map(
lambda x: tokenizer(x["sentence"], truncation=True, max_length=128),
batched=True,
)
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])
train_loader = DataLoader(tokenized["train"], batch_size=16, shuffle=True)
eval_loader = DataLoader(tokenized["validation"], batch_size=32)
# 3. Prepare model and optimizer
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2
)
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
num_steps = len(train_loader) * 3 # 3 epochs
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=num_steps // 10, num_training_steps=num_steps
)
# 4. Let Accelerate prepare everything for the target hardware
model, optimizer, train_loader, eval_loader, scheduler = accelerator.prepare(
model, optimizer, train_loader, eval_loader, scheduler
)
# 5. Training loop
model.train()
for epoch in range(3):
total_loss = 0
for step, batch in enumerate(train_loader):
with accelerator.accumulate(model):
outputs = model(
input_ids=batch["input_ids"],
attention_mask=batch["attention_mask"],
labels=batch["label"],
)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
total_loss += loss.item()
avg_loss = total_loss / len(train_loader)
accelerator.print(f"Epoch {epoch + 1}: avg loss = {avg_loss:.4f}")
Accelerate. The key call is accelerator.prepare(), which wraps the model, optimizer, scheduler, and dataloaders for distributed training and mixed precision. The rest is a standard PyTorch loop with accelerator.backward() replacing loss.backward().You must call accelerator.prepare() on the model, optimizer, dataloaders, and scheduler before the training loop. This method wraps each object to handle distributed communication, mixed precision scaling, and device placement. Forgetting to prepare any one of them will cause silent correctness bugs or runtime errors in multi-GPU settings.
5. Distributed Training and Multi-GPU Strategies
Both Trainer and Accelerate support distributed training out of the box. The most common strategy is Distributed Data Parallel (DDP), where each GPU processes a different batch and gradients are synchronized across GPUs at each step. For models too large to fit on a single GPU, Fully Sharded Data Parallel (FSDP) or DeepSpeed ZeRO partitions the model weights, gradients, and optimizer states across GPUs.
Launching distributed training requires a launcher command. The accelerate CLI provides a configuration wizard and launch command.
# Step 1: Configure your distributed setup (interactive wizard)
# $ accelerate config
# Asks: number of GPUs, mixed precision, DeepSpeed, FSDP, etc.
# Saves config to ~/.cache/huggingface/accelerate/default_config.yaml
# Step 2: Launch your training script on all GPUs
# $ accelerate launch --num_processes 4 train.py
# Or, with the Trainer API, use torchrun directly:
# $ torchrun --nproc_per_node 4 train_with_trainer.py
accelerate. The accelerate config wizard saves hardware settings (GPU count, mixed precision, FSDP/DeepSpeed) to a YAML file, and accelerate launch spawns processes accordingly.For very large models, DeepSpeed integration provides advanced memory optimization. The Trainer API integrates with DeepSpeed through a simple JSON configuration file.
from transformers import TrainingArguments
# Enable DeepSpeed ZeRO Stage 2 via Trainer
training_args = TrainingArguments(
output_dir="./output",
deepspeed="ds_config.json", # Path to DeepSpeed config
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
bf16=True,
num_train_epochs=3,
)
# ds_config.json example:
# {
# "zero_optimization": {
# "stage": 2,
# "offload_optimizer": { "device": "cpu" },
# "allgather_bucket_size": 5e8,
# "reduce_bucket_size": 5e8
# },
# "bf16": { "enabled": true },
# "train_micro_batch_size_per_gpu": 4,
# "gradient_accumulation_steps": 8
# }
deepspeed parameter in TrainingArguments. The JSON config file controls optimizer and gradient sharding across GPUs, allowing models that exceed single-GPU memory to train without code changes.| Strategy | Model Size | What Is Distributed | When to Use |
|---|---|---|---|
| DDP | Fits on 1 GPU | Data only | Most fine-tuning jobs |
| FSDP | Too large for 1 GPU | Weights, gradients, optimizer | 7B+ parameter models |
| DeepSpeed ZeRO-2 | Too large for 1 GPU | Gradients, optimizer | Memory-constrained multi-GPU |
| DeepSpeed ZeRO-3 | Very large | Weights, gradients, optimizer | 70B+ parameter models |
6. Evaluation During Training
Monitoring model performance during training is critical for detecting overfitting, selecting the best checkpoint, and deciding when to stop. The Trainer supports evaluation at configurable intervals through the eval_strategy parameter and integrates with the evaluate library for computing standard metrics.
The following example adds multiple metrics to a Trainer-based run using the evaluate library.
import evaluate
import numpy as np
# Load multiple metrics
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {
"accuracy": accuracy_metric.compute(
predictions=preds, references=labels
)["accuracy"],
"f1": f1_metric.compute(
predictions=preds, references=labels, average="weighted"
)["f1"],
"precision": precision_metric.compute(
predictions=preds, references=labels, average="weighted"
)["precision"],
"recall": recall_metric.compute(
predictions=preds, references=labels, average="weighted"
)["recall"],
}
# The Trainer will call compute_metrics at each evaluation step
# and log all four metrics to your chosen reporter (TensorBoard, W&B, etc.)
evaluate library. The compute_metrics function receives model predictions and ground-truth labels, computes accuracy, F1, precision, and recall, and returns them as a dictionary that the Trainer logs at each evaluation interval.If training is interrupted, the Trainer can resume from the last checkpoint automatically. Call trainer.train(resume_from_checkpoint=True) and it will detect the latest checkpoint in output_dir, restore the model weights, optimizer state, scheduler state, and random number generator seeds, and continue training from exactly where it left off. This is especially valuable for long-running jobs on preemptible cloud instances.