Part 4: Training and Adapting
Chapter 14: Fine-Tuning Fundamentals

Fine-Tuning for Classification & Sequence Tasks

"My classifier achieves 99% accuracy. Unfortunately, 99% of my data belongs to one class. Time to learn about focal loss."

Finetune Finetune, Class-Imbalanced AI Agent
Big Picture

Classification is the most common fine-tuning task in production NLP. Sentiment analysis, spam detection, intent classification, named entity recognition, and content moderation all reduce to some form of classification. Training data for these classifiers can be generated efficiently using LLM-assisted labeling. This approach is common in hybrid ML and LLM systems where a small, fast classifier handles high-volume tasks. The approach is different from generative fine-tuning (SFT): instead of training the model to generate text, you add a classification head on top of the pre-trained model and train it to predict discrete labels. Hugging Face's AutoModel classes make this straightforward, but there are important decisions around architecture, loss functions, and class imbalance that determine whether your classifier works well in practice. The ML classification fundamentals from Section 00.1 provide the evaluation metrics and baseline approaches to compare against.

Prerequisites

Before starting, make sure you are familiar with fine-tuning basics as covered in Section 14.1: When and Why to Fine-Tune.

1. Classification Head Architecture

When fine-tuning a transformer for classification, you keep the pre-trained encoder body and add a small classification head on top. The head is typically a linear layer (or a small MLP) that maps the model's hidden representation to class logits. The entire model (encoder plus head) is trained end-to-end, but the head learns from scratch while the encoder benefits from pre-trained knowledge. Figure 14.6.1 illustrates this architecture.

Tip

For classification tasks with severe class imbalance (fewer than 5% of examples in the minority class), use focal loss instead of standard cross-entropy. Focal loss down-weights the contribution of easy, well-classified examples and focuses training on the hard cases. In practice, switching from cross-entropy to focal loss often improves minority-class F1 by 10 to 20 percentage points with no other changes to the training setup.

[CLS] This movie was absolutely fantastic [SEP] Input tokens Pre-trained Transformer Encoder 12 layers, 768 hidden dim (BERT-base) [CLS] hidden state (768d) Dropout (0.1) Linear (768 -> num_labels) Logits: [2.1, -0.8, 0.3] -> "Positive"
Figure 14.6.1: The [CLS] token representation is passed through a dropout layer and linear classification head

2. Single-Label Classification

Single-label classification is the simplest case: each input belongs to exactly one class. Examples include sentiment analysis (positive/negative/neutral), intent classification (booking/cancellation/inquiry), and content moderation (safe/unsafe). Hugging Face provides AutoModelForSequenceClassification that handles the architecture automatically.

Key Insight

Mental Model: The Attachment Head. Think of adding a classification head as attaching a specialized tool to a Swiss Army knife. The base model (the knife body) contains general-purpose language understanding in its hidden layers. The classification head (the attached tool) is a small linear layer that reads the model's internal representation and maps it to task-specific labels. During fine-tuning, you train both the attachment and (optionally) adjust the knife body itself, so the two work together seamlessly. Code Fragment 14.6.3 shows this approach in practice.

Code Fragment 14.6.2 defines training hyperparameters.

# Configure Hugging Face Trainer with training arguments
# Set learning rate schedule, batch size, and evaluation strategy
from transformers import (
 AutoModelForSequenceClassification,
 AutoTokenizer,
 TrainingArguments,
 Trainer,
)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Load model with classification head
model_name = "bert-base-uncased"
num_labels = 3 # positive, negative, neutral

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
 model_name,
 num_labels=num_labels,
 problem_type="single_label_classification",
 # Map label indices to human-readable names
 id2label={0: "negative", 1: "neutral", 2: "positive"},
 label2id={"negative": 0, "neutral": 1, "positive": 2},
)

# Load and tokenize dataset
dataset = load_dataset("sst2") # Stanford Sentiment Treebank

def tokenize_function(examples):
 return tokenizer(
 examples["sentence"],
 padding="max_length",
 truncation=True,
 max_length=128,
 )

tokenized = dataset.map(tokenize_function, batched=True)

# Define metrics
def compute_metrics(eval_pred):
 logits, labels = eval_pred
 predictions = np.argmax(logits, axis=-1)
 return {
 "accuracy": accuracy_score(labels, predictions),
 "f1_macro": f1_score(labels, predictions, average="macro"),
 "f1_weighted": f1_score(labels, predictions, average="weighted"),
 }

# Training configuration
training_args = TrainingArguments(
 output_dir="./checkpoints/sentiment-bert",
 num_train_epochs=3,
 per_device_train_batch_size=32,
 per_device_eval_batch_size=64,
 learning_rate=2e-5,
 weight_decay=0.01,
 warmup_ratio=0.1,
 eval_strategy="epoch",
 save_strategy="epoch",
 load_best_model_at_end=True,
 metric_for_best_model="f1_macro",
 report_to="wandb",
)

trainer = Trainer(
 model=model,
 args=training_args,
 train_dataset=tokenized["train"],
 eval_dataset=tokenized["validation"],
 compute_metrics=compute_metrics,
)

trainer.train()
Input: [CLS] a man is playing guitar on stage. [SEP] a musician is performing live. [SEP] Token type IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
Code Fragment 14.6.1: Configure Hugging Face Trainer with training arguments

Code Fragment 14.6.3 loads the dataset.

# Load a pre-trained model for sequence classification fine-tuning
# The classification head is initialized randomly on top of the base model
from transformers import AutoModelForTokenClassification, AutoTokenizer
from datasets import load_dataset

# NER label scheme (BIO format)
label_list = [
 "O", "B-PER", "I-PER", "B-ORG", "I-ORG",
 "B-LOC", "I-LOC", "B-MISC", "I-MISC"
]
label2id = {l: i for i, l in enumerate(label_list)}
id2label = {i: l for i, l in enumerate(label_list)}

# Load model for token classification
model = AutoModelForTokenClassification.from_pretrained(
 "bert-base-uncased",
 num_labels=len(label_list),
 id2label=id2label,
 label2id=label2id,
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Load CoNLL-2003 NER dataset
dataset = load_dataset("conll2003")

def tokenize_and_align_labels(examples):
 """Tokenize and align NER labels with subword tokens."""
 tokenized = tokenizer(
 examples["tokens"],
 truncation=True,
 is_split_into_words=True, # Input is already tokenized
 padding="max_length",
 max_length=128,
 )

 labels = []
 for i, label_ids in enumerate(examples["ner_tags"]):
 word_ids = tokenized.word_ids(batch_index=i)
 label_row = []
 previous_word_id = None

 for word_id in word_ids:
 if word_id is None:
 # Special tokens ([CLS], [SEP], [PAD])
 label_row.append(-100)
 elif word_id != previous_word_id:
 # First subword token of a word: use the word's label
 label_row.append(label_ids[word_id])
 else:
 # Subsequent subword tokens: use I- tag or -100
 original_label = label_ids[word_id]
 # Convert B- to I- for continuation tokens
 label_name = label_list[original_label]
 if label_name.startswith("B-"):
 i_label = label_name.replace("B-", "I-")
 label_row.append(label2id.get(i_label, original_label))
 else:
 label_row.append(original_label)
 previous_word_id = word_id

 labels.append(label_row)

 tokenized["labels"] = labels
 return tokenized

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)
Code Fragment 14.6.2: Configure Hugging Face Trainer with training arguments
Warning

Subword tokenization breaks word boundaries. A critical challenge in token classification is that the tokenizer may split a single word into multiple subword tokens. The word "Mountain" might become ["Mount", "##ain"]. You must carefully align the original word-level labels with the subword tokens. The standard approach is to assign the label to the first subword and use -100 (ignore) or the corresponding I- tag for continuation subwords.

5. Sequence-Pair Tasks

Some classification tasks require comparing two input texts. Natural language inference (NLI) classifies the relationship between a premise and hypothesis as entailment, contradiction, or neutral. Semantic textual similarity (STS) scores the similarity between two sentences. Question answering classification determines whether a passage contains the answer to a question. All of these are handled with the same AutoModelForSequenceClassification by providing both texts to the tokenizer. Code Fragment 14.6.2 shows this approach in practice.

# Sequence pair classification (NLI example)
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(
 "bert-base-uncased",
 num_labels=3,
 id2label={0: "entailment", 1: "neutral", 2: "contradiction"},
 label2id={"entailment": 0, "neutral": 1, "contradiction": 2},
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize a sentence pair
def tokenize_nli(examples):
 """Tokenize premise-hypothesis pairs."""
 return tokenizer(
 examples["premise"],
 examples["hypothesis"],
 padding="max_length",
 truncation=True,
 max_length=256,
 )

# The tokenizer automatically adds [SEP] between the two inputs:
# [CLS] premise tokens [SEP] hypothesis tokens [SEP]
sample = tokenizer(
 "A man is playing guitar on stage.",
 "A musician is performing live.",
 return_tensors="pt",
)
print(f"Input: {tokenizer.decode(sample['input_ids'][0])}")
print(f"Token type IDs: {sample['token_type_ids'][0][:20].tolist()}")
# token_type_ids: 0 for premise tokens, 1 for hypothesis tokens
Code Fragment 14.6.3: Sequence pair classification (NLI example)

6. Handling Class Imbalance

Real-world classification datasets are almost always imbalanced. Fraud detection might have 0.1% positive examples; medical diagnosis datasets often have rare conditions representing less than 1% of cases. Without mitigation, the model will learn to predict the majority class and ignore rare but important classes.

6. Handling Class Imbalance Intermediate Comparison
StrategyHow It WorksWhen to Use
Weighted lossAssign higher loss weight to minority classesModerate imbalance (5:1 to 20:1)
OversamplingDuplicate minority class examplesSmall datasets where more data helps
UndersamplingRemove majority class examplesVery large datasets with extreme imbalance
Focal lossDown-weight easy examples, focus on hard onesExtreme imbalance (100:1+)
Synthetic dataGenerate additional minority examples with LLMsWhen real minority data is scarce

Code Fragment 14.6.2 demonstrates this approach.

# Implement class-weighted loss for imbalanced dataset fine-tuning
# Minority classes receive higher loss weights to counteract skew
import torch
import torch.nn as nn
from torch.utils.data import WeightedRandomSampler
import numpy as np

class WeightedTrainer(Trainer):
 """Custom Trainer with class-weighted loss for imbalanced data."""

 def __init__(self, class_weights=None, **kwargs):
 super().__init__(**kwargs)
 if class_weights is not None:
 self.class_weights = torch.tensor(
 class_weights, dtype=torch.float32
 )
 else:
 self.class_weights = None

 def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
 labels = inputs.pop("labels")
 outputs = model(**inputs)
 logits = outputs.logits

 if self.class_weights is not None:
 weight = self.class_weights.to(logits.device)
 loss_fn = nn.CrossEntropyLoss(weight=weight)
 else:
 loss_fn = nn.CrossEntropyLoss()

 loss = loss_fn(logits.view(-1, self.model.config.num_labels), labels.view(-1))
 return (loss, outputs) if return_outputs else loss

# Calculate class weights from data distribution
def compute_class_weights(labels: list, strategy: str = "inverse") -> list:
 """Compute class weights for imbalanced datasets."""
 unique, counts = np.unique(labels, return_counts=True)
 total = len(labels)

 if strategy == "inverse":
 # Weight inversely proportional to frequency
 weights = total / (len(unique) * counts)
 elif strategy == "sqrt_inverse":
 # Softer version: square root of inverse frequency
 weights = np.sqrt(total / (len(unique) * counts))
 elif strategy == "effective":
 # Effective number of samples (Class-Balanced Loss)
 beta = 0.9999
 effective_num = 1.0 - np.power(beta, counts)
 weights = (1.0 - beta) / effective_num

 # Normalize so mean weight = 1
 weights = weights / weights.mean()
 return weights.tolist()

# Example: imbalanced dataset
labels = [0]*9000 + [1]*800 + [2]*200 # 90% / 8% / 2% distribution
weights = compute_class_weights(labels, strategy="sqrt_inverse")
print(f"Class weights: {[f'{w:.2f}' for w in weights]}")
print(f"Class 0 (90%): {weights[0]:.2f}x")
print(f"Class 2 (2%): {weights[2]:.2f}x")
Class weights: ['0.58', '1.23', '2.46'] Class 0 (90%): 0.58x Class 2 (2%): 2.46x
Code Fragment 14.6.4: Implement class-weighted loss for imbalanced dataset fine-tuning
Key Insight

Use F1 macro (not accuracy) for imbalanced datasets. Accuracy is misleading when classes are imbalanced: a model that always predicts the majority class achieves 90% accuracy on a 90/10 split. F1 macro averages the F1 score across all classes equally, giving equal weight to minority classes. Always track per-class precision and recall to understand where the model is failing.

Fun Fact

Adding a classification head to a pretrained transformer is like putting a sorting hat on a very well-read student. The model already understands the text; the head just needs to learn which bucket each understanding belongs in.

Self-Check
Q1: What is the key difference between the loss function used for single-label and multi-label classification?
Show Answer
Single-label classification uses cross-entropy loss with softmax, which normalizes logits into a probability distribution that sums to 1 (exactly one class must be selected). Multi-label classification uses binary cross-entropy loss with sigmoid, which independently maps each class logit to a probability between 0 and 1 (any combination of classes can be active). Using softmax for multi-label tasks forces the model to trade off between classes and prevents it from assigning high probability to multiple classes simultaneously.
Q2: In NER with BIO tagging, why do we need to align labels with subword tokens?
Show Answer
Transformers use subword tokenization, which can split a single word into multiple tokens. For example, "Washington" might become ["Wash", "##ington"]. NER labels are defined at the word level, so we need a strategy to assign labels to each subword. The standard approach assigns the original B- label to the first subword and either the corresponding I- label or -100 (ignore in loss) to continuation subwords. Without this alignment, the model would receive incorrect training signals.
Q3: A fraud detection dataset has 99.9% legitimate transactions and 0.1% fraudulent ones. Which class imbalance strategy would you recommend?
Show Answer
For such extreme imbalance (1000:1), use a combination of strategies: (1) Focal loss to down-weight the overwhelming easy majority examples and focus learning on the hard minority examples. (2) Oversampling of the fraud class, potentially combined with synthetic data generation using an LLM to create additional realistic fraud examples. (3) Always evaluate using precision, recall, and F1 for the fraud class specifically, not overall accuracy. Additionally, consider whether the task can be reformulated as anomaly detection rather than classification.
Q4: How does AutoModelForSequenceClassification handle sentence-pair tasks like NLI?
Show Answer
The tokenizer handles sentence pairs by concatenating them with a [SEP] token: [CLS] premise [SEP] hypothesis [SEP]. It also generates token_type_ids that mark which tokens belong to the first sentence (0) and which belong to the second (1). The model's attention mechanism can then attend across both sentences, and the [CLS] token representation captures the relationship between them. The same AutoModelForSequenceClassification class works for both single-text and pair tasks.
Q5: Why should you use F1 macro instead of accuracy when evaluating on imbalanced datasets?
Show Answer
Accuracy is misleading on imbalanced datasets because a trivial model that always predicts the majority class achieves high accuracy. For example, on a 95%/5% split, always predicting the majority class yields 95% accuracy while completely failing on the minority class. F1 macro computes the F1 score for each class independently and then averages them, giving equal weight to all classes regardless of their frequency. This ensures that poor performance on minority classes is reflected in the overall metric.

Why this matters: Classification fine-tuning is one of the most cost-effective uses of LLMs in production. A fine-tuned classifier model (even a small one like DeBERTa with 300M parameters) typically outperforms few-shot prompting of much larger models while being 100x cheaper to serve. This is the core tradeoff explored in the hybrid ML/LLM architectures of Chapter 11: use specialized fine-tuned models for well-defined classification tasks, and reserve large LLMs for tasks requiring open-ended generation.

Key Takeaways
Real-World Scenario: Fine-Tuning a Multi-Label Classifier for Regulatory Compliance Tagging

Who: A compliance engineering team at a financial services company that needed to automatically tag internal documents with applicable regulatory frameworks (SOX, GDPR, PCI-DSS, HIPAA, CCPA, Basel III, and 15 others).

Situation: Each document could be tagged with multiple regulations (average of 2.3 tags per document). The team had 8,000 documents labeled by compliance officers, split across 21 regulatory categories with highly imbalanced distribution (GDPR appeared in 40% of documents, Basel III in only 3%).

Problem: An initial BERT-base classifier achieved 82% micro-F1 but only 54% macro-F1, performing poorly on rare categories. The compliance team required at least 75% macro-F1 because missing a regulatory tag could result in audit failures.

Dilemma: They could collect more training data for rare categories (expensive, slow), use data augmentation (limited effectiveness for specialized regulatory text), or address the imbalance through loss function modifications and training strategy changes.

Decision: They switched from standard binary cross-entropy to focal loss (reducing the contribution of easy, well-represented categories), added class-weighted sampling, and expanded the training set for the 5 rarest categories by 200 examples each using GPT-4 to generate synthetic regulatory documents reviewed by a compliance officer.

How: They used AutoModelForSequenceClassification with problem_type="multi_label_classification", replaced the default loss with focal loss (gamma=2.0), and implemented a custom sampler that oversampled rare categories by 3x. The synthetic data generation cost $180 in API fees plus 8 hours of compliance officer review time.

Result: Macro-F1 improved from 54% to 79%, exceeding the 75% threshold. Micro-F1 remained stable at 83%. The biggest gains came from rare categories: Basel III F1 rose from 31% to 72%, and HIPAA rose from 45% to 78%. Total compute cost for training was $15 (single GPU, 2 hours).

Lesson: For multi-label classification with imbalanced categories, focal loss combined with targeted data augmentation for rare classes is more effective than collecting proportionally more data across all categories.

Research Frontier

The integration of LLMs with traditional classification heads is yielding hybrid architectures that combine the reasoning capability of language models with the calibrated confidence of discriminative classifiers. Research on prompt-based fine-tuning for classification reformulates labeling tasks as natural language generation, allowing a single model to handle diverse classification schemas without task-specific heads.

An open problem is reducing the latency gap between fine-tuned LLM classifiers and lightweight models (like distilled BERT variants) while preserving the LLM's superior handling of ambiguous or novel inputs.

Exercises

Exercise 14.6.1: Classification head architecture Conceptual

Explain how a classification head is added to a pre-trained transformer. What layers are typically frozen versus trainable?

Answer Sketch

A classification head is a linear layer (or small MLP) that maps the transformer's hidden state for the [CLS] token (or the last token) to class logits. During fine-tuning, the classification head is always trainable. The pre-trained transformer layers can be fully frozen (only head trains), fully trainable (all layers fine-tune), or partially frozen (freeze lower layers, train upper layers + head). More trainable layers = better adaptation but more risk of overfitting.

Exercise 14.6.2: Class imbalance handling Coding

Write code that handles class imbalance in a text classification fine-tuning task using three approaches: class weights in the loss function, oversampling the minority class, and focal loss.

Answer Sketch

Class weights: weights = 1.0 / class_counts; loss_fn = CrossEntropyLoss(weight=torch.tensor(weights)). Oversampling: sampler = WeightedRandomSampler(sample_weights, num_samples=len(dataset)). Focal loss: def focal_loss(logits, targets, gamma=2): ce = F.cross_entropy(logits, targets, reduction='none'); pt = torch.exp(-ce); return ((1-pt)**gamma * ce).mean(). Focal loss down-weights easy examples and focuses learning on hard ones.

Exercise 14.6.3: Token classification (NER) Conceptual

Compare sequence-level classification (one label per input) with token-level classification (one label per token, as in NER). How does the architecture differ?

Answer Sketch

Sequence classification: extract one representation (e.g., [CLS] token), pass through a linear layer to predict one label. Token classification: extract representations for every token, pass each through the same linear layer to predict a per-token label. NER uses BIO tagging (B-PER, I-PER, O) so the number of output classes is 2*entity_types + 1. The loss is computed over all non-padding tokens. Subword tokenization requires aligning token labels with word boundaries.

Exercise 14.6.4: Multi-task fine-tuning Coding

Design a multi-task fine-tuning setup where a single model is trained for both sentiment classification and topic classification simultaneously. How do you structure the model and the training loop?

Answer Sketch

Use a shared transformer backbone with two separate classification heads: self.sentiment_head = nn.Linear(hidden, 3) and self.topic_head = nn.Linear(hidden, 10). Each batch contains examples labeled for either sentiment or topic. In the forward pass, route through the appropriate head based on the task label. Loss: loss = sentiment_loss + topic_loss (or alternate batches). Multi-task learning improves generalization by sharing representations across related tasks.

Exercise 14.6.5: Evaluation beyond accuracy Analysis

A fine-tuned sentiment classifier achieves 94% accuracy on the test set. Why might accuracy be misleading, and what additional metrics should you report?

Answer Sketch

Accuracy is misleading when classes are imbalanced. If 90% of examples are 'neutral', predicting neutral always gives 90% accuracy. Report: (1) per-class precision, recall, and F1, (2) macro-averaged F1 (treats all classes equally), (3) confusion matrix to identify systematic errors (e.g., always confusing 'negative' with 'neutral'), (4) calibration (does 80% confidence mean 80% correct?). For production, also measure latency and throughput.

What Comes Next

In the next section, Section 14.7: Adapting Models for Long Text, we explore techniques for adapting models to handle long text, extending context windows for document-level processing. Classification fine-tuning is a core technique behind the embedding model training covered in Section 19.2.

References and Further Reading
Model Architectures

Devlin, J. et al. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.

The original BERT paper that established the fine-tuning paradigm for classification tasks: pre-train a bidirectional transformer, then add a classification head and fine-tune on labeled data. BERT remains the conceptual foundation for all encoder-based classification discussed in this section. Essential background reading.

Paper

He, P. et al. (2021). DeBERTa: Decoding-Enhanced BERT with Disentangled Attention. ICLR 2021.

Introduces DeBERTa, which improves on BERT through disentangled attention and an enhanced mask decoder. DeBERTa consistently tops classification benchmarks and is the recommended base model for the classification tasks in this section. Recommended for understanding why DeBERTa outperforms BERT on most tasks.

Paper

Lin, T.-Y. et al. (2017). Focal Loss for Dense Object Detection. ICCV 2017.

Introduces focal loss, which down-weights well-classified examples to focus training on hard, misclassified ones. Although originally designed for object detection, focal loss is the recommended loss function for imbalanced classification discussed in this section. Essential for teams dealing with rare categories.

Paper
Practical Guides and Libraries

Hugging Face. (2024). Text Classification with Transformers.

A step-by-step tutorial for fine-tuning transformer models on text classification tasks using AutoModelForSequenceClassification. This is the primary practical reference for the single-label and multi-label classification workflows in this section. Keep it handy while implementing the code examples.

Tutorial

Hugging Face. (2024). Token Classification with Transformers.

Covers fine-tuning for NER and other token-level classification tasks using AutoModelForTokenClassification. This tutorial complements the sequence classification guide and is the reference for the NER and POS tagging workflows discussed in this section.

Tutorial

Wolf, T. et al. (2020). Transformers: State-of-the-Art Natural Language Processing. EMNLP 2020.

The original Hugging Face Transformers library paper, describing the unified API design that makes switching between models trivial. Understanding the AutoModel architecture is key to the model selection flexibility discussed in this section. Recommended background for anyone new to the Transformers ecosystem.

Paper