"A single well-labeled example is like a seed. Augmentation is the rain and sunlight that grow it into a garden of training data."
Synth, Green-Thumbed AI Agent
Data augmentation transforms a small, high-quality dataset into a larger, more diverse training corpus without collecting new examples from scratch. While the earlier sections of this chapter focus on generating entirely new synthetic data, augmentation works with existing examples, applying transformations that preserve semantic meaning while varying surface form. This is especially valuable for low-resource languages, domain-specific tasks, and scenarios where labeled data is expensive to obtain.
Prerequisites
This section builds on the synthetic data generation pipelines from Section 15.2 and the quality assurance techniques from Section 15.3. Familiarity with prompt engineering is assumed. Fine-tuning data preparation is covered in the next chapter.
15.7.1 Why Augment? The Diversity Problem
The earliest text augmentation paper that hit it big, Wei and Zou's EDA (2019), was just four operations: synonym replacement, random insertion, random swap, random deletion. The paper was rejected from its first venue for being too simple, then accepted at EMNLP after the authors showed it boosted small-data classifiers by several points. It became the most-cited NLP augmentation paper of the decade, proving that in machine learning the dumbest baseline that actually works will always outlive the cleverest one that almost does.
Fine-tuning datasets often suffer from a lack of linguistic diversity: the same intent is expressed in only a few ways, the same entities appear repeatedly, and the phrasing follows narrow patterns. Models trained on such data overfit to surface features rather than learning the underlying task. Data augmentation addresses this by creating variations of existing examples that force the model to generalize.
The core principle is label-preserving transformation: each augmented example must retain the same label, intent, or correct answer as the original. A paraphrase of a positive sentiment review must still be positive. A reformulated question must still have the same answer. Violating this principle introduces label noise that degrades model quality.
LLM-based paraphrasing can silently flip labels. A sentiment like "The food was not bad" might be paraphrased as "The food was bad" by a model that drops the negation. Similarly, back-translation can lose critical qualifiers: "almost always works" might return as "always works." Always spot-check a random sample of augmented examples against their original labels, especially for tasks involving negation, hedging, or numerical precision. Automated label verification using a separate classifier can catch these errors at scale.
The easiest augmentation win is back-translation: translate your English examples to French (or any high-resource language), then translate back. The round-trip produces natural paraphrases that preserve meaning while varying vocabulary and sentence structure. For many classification tasks, this simple technique alone can improve accuracy by 2 to 5 percentage points when your training set has fewer than 1,000 examples.
Data augmentation (this section) starts from existing examples and produces variations. Synthetic data generation (Sections 13.1 through 13.3) creates entirely new examples from specifications or seed prompts. In practice, most data pipelines combine both: generate a seed corpus, then augment it for diversity.
15.7.2 Classical Text Augmentation Techniques
Before LLMs made augmentation trivially accessible, NLP practitioners developed several effective techniques. These remain useful as fast, controllable baselines.
Easy Data Augmentation (EDA)
Wei and Zou (2019) proposed four simple operations that improve text classification with minimal effort:
- Synonym replacement: Replace n random non-stop-words with WordNet synonyms
- Random insertion: Insert a synonym of a random word at a random position
- Random swap: Swap the positions of two random words
- Random deletion: Remove each word with probability p
# Easy Data Augmentation (EDA): four simple text transformations.
# Synonym replacement uses WordNet to swap words with equivalents.
import random
import nltk
from nltk.corpus import wordnet
def synonym_replace(sentence, n=2):
"""Replace n words with WordNet synonyms."""
words = sentence.split()
new_words = words.copy()
candidates = [w for w in words if wordnet.synsets(w)]
random.shuffle(candidates)
replaced = 0
for word in candidates:
if replaced >= n:
break
syns = wordnet.synsets(word)
if syns:
synonym = syns[0].lemmas()[0].name().replace("_", " ")
if synonym != word:
new_words = [synonym if w == word else w for w in new_words]
replaced += 1
return " ".join(new_words)
# Example
original = "The customer service was excellent and very responsive"
augmented = synonym_replace(original, n=2)
# Possible output: "The customer service was superb and very responsive"
Back-Translation
Back-translation generates paraphrases by translating text into an intermediate language and then translating it back. The round-trip introduces natural lexical and syntactic variation because different languages encode meaning differently. For example, English to German and back often restructures clauses and substitutes near-synonyms.
# Back-translation augmentation: English -> pivot language -> English.
# The round-trip introduces natural lexical and syntactic variation.
from transformers import MarianMTModel, MarianTokenizer
def back_translate(text, src="en", pivot="de"):
"""Augment text via back-translation through a pivot language."""
# English -> pivot
fwd_name = f"Helsinki-NLP/opus-mt-{src}-{pivot}"
fwd_tok = MarianTokenizer.from_pretrained(fwd_name)
fwd_model = MarianMTModel.from_pretrained(fwd_name)
encoded = fwd_tok(text, return_tensors="pt", padding=True, truncation=True)
pivot_ids = fwd_model.generate(**encoded)
pivot_text = fwd_tok.decode(pivot_ids[0], skip_special_tokens=True)
# Pivot -> English
bwd_name = f"Helsinki-NLP/opus-mt-{pivot}-{src}"
bwd_tok = MarianTokenizer.from_pretrained(bwd_name)
bwd_model = MarianMTModel.from_pretrained(bwd_name)
encoded = bwd_tok(pivot_text, return_tensors="pt", padding=True, truncation=True)
back_ids = bwd_model.generate(**encoded)
return bwd_tok.decode(back_ids[0], skip_special_tokens=True)
original = "What is the refund policy for damaged items?"
augmented = back_translate(original, pivot="fr")
# Typical output: "What is the reimbursement policy for defective products?"
Using multiple pivot languages produces diverse variations. A single source sentence translated through French, German, Russian, and Chinese yields four distinct paraphrases. The quality depends on the translation model; modern models like NLLB-200 cover 200 languages and produce higher-quality round-trips than older models.
Contextual Word Replacement with Masked LMs
Masked language models like BERT can generate contextually appropriate substitutions. Mask a word, let the model predict alternatives, and replace with a high-probability candidate. Unlike WordNet synonym replacement, this produces substitutions that are contextually coherent.
# Contextual augmentation: use BERT's masked LM to predict
# context-appropriate word substitutions (better than random synonyms).
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="bert-base-uncased")
def contextual_augment(text, word_to_replace):
"""Replace a word using BERT's contextual predictions."""
masked = text.replace(word_to_replace, "[MASK]", 1)
predictions = fill_mask(masked)
# Pick the top prediction that differs from the original
for pred in predictions:
if pred["token_str"].strip() != word_to_replace.lower():
return text.replace(word_to_replace, pred["token_str"].strip(), 1)
return text
original = "The application crashed when processing large files"
augmented = contextual_augment(original, "crashed")
# Possible output: "The application failed when processing large files"
15.7.3 LLM-Powered Augmentation
Large language models have transformed data augmentation from a word-level operation into a semantic-level one. Instead of swapping individual words, LLMs can rephrase entire sentences, shift register and formality, inject domain-specific terminology, and generate multi-turn variations of single-turn examples.
Wang et al. (Self-Instruct, 2023, arXiv:2212.10560) started with exactly 175 hand-written seed instructions, then asked GPT-3 to generate variants until they had 52,000 instruction-following examples. Fine-tuning vanilla GPT-3 (175B) on this synthetic corpus produced "Self-Instruct-GPT3," which matched or beat InstructGPT-001 on the 252-item SuperNI evaluation: 33.1 ROUGE-L vs 36.4, but 49.6 percent vs 47.9 percent on the 175 human-written test instructions. The startling part is the ratio: 297x amplification from 175 seeds to a model competitive with one trained on roughly 13,000 human-written instructions costing OpenAI an estimated $4-12 per instruction. Word-level augmentation (synonym swap, back-translation) cannot do this because the operation is local. LLM-level augmentation can because each generated example carries the full distributional knowledge of the teacher model. This is why GPT-4 synthetic data became the default for Alpaca, Vicuna, WizardLM, and every subsequent instruction-tune.
Paraphrase Generation
The simplest LLM augmentation prompts the model to paraphrase an example while preserving its meaning and label:
# LLM-powered paraphrasing: generate multiple style-varied rewordings
# via GPT-4o-mini while preserving meaning and intent.
from openai import OpenAI
client = OpenAI()
def llm_paraphrase(text, n_variants=3, style_hints=None):
"""Generate paraphrased variants using an LLM."""
style_instruction = ""
if style_hints:
style_instruction = f"\nVary the style: {', '.join(style_hints)}"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": (
f"Generate exactly {n_variants} paraphrases of the input text. "
"Each paraphrase must preserve the exact same meaning and intent. "
"Vary vocabulary, sentence structure, and length."
f"{style_instruction}"
"\nReturn one paraphrase per line, numbered."
)
}, {
"role": "user",
"content": text
}],
temperature=0.9
)
lines = response.choices[0].message.content.strip().split("\n")
return [line.lstrip("0123456789.) ").strip() for line in lines if line.strip()]
original = "How do I reset my password?"
variants = llm_paraphrase(
original,
n_variants=5,
style_hints=["formal", "casual", "verbose", "terse", "non-native speaker"]
)
# Output:
# "I need assistance with resetting my account password."
# "hey how do i change my password again?"
# "Could you please walk me through the complete process of resetting
# the password associated with my user account?"
# "Password reset steps?"
# "I am wanting to know how I can make new password for my account."
Controlled Attribute Variation
Beyond simple paraphrasing, LLMs can modify specific attributes of training examples while preserving others. This is especially powerful for intent classification and slot-filling tasks:
- Entity swapping: Replace entity values (names, dates, locations, product names) while keeping the sentence structure and intent identical
- Register shifting: Convert between formal, casual, technical, and simplified registers
- Length variation: Expand terse queries into detailed requests, or compress verbose instructions into short commands
- Perspective shifting: Rewrite from first person to third person, or from question form to imperative
- Error injection: Introduce realistic typos, grammatical errors, or colloquialisms to make the training data robust to noisy real-world input
# Entity-aware augmentation: swap named entities (people, places)
# with realistic alternatives while preserving structure and intent.
def augment_with_entity_swap(example, entity_type="person_name"):
"""Swap entities while preserving intent and structure."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": (
f"Replace all {entity_type} entities in the text with "
"different, realistic alternatives. Keep everything else "
"exactly the same, including intent, structure, and tone."
)
}, {
"role": "user",
"content": example
}],
temperature=0.7
)
return response.choices[0].message.content.strip()
# Intent: book_flight
original = "Book a flight from New York to London for Sarah Johnson on March 15th"
swapped = augment_with_entity_swap(original, "all named entities")
# "Book a flight from Chicago to Tokyo for Michael Chen on June 22nd"
Multi-Turn Augmentation
For conversational AI training data, augmentation must operate at the dialogue level rather than the sentence level. An LLM can take a single-turn question and expand it into a multi-turn conversation where the user provides context incrementally, or take a clean dialogue and inject realistic complications (corrections, clarifications, topic shifts).
15.7.4 Domain-Specific Augmentation Strategies
The general augmentation techniques covered so far apply to any text dataset. Several domains have additional constraints (limited training data for the target language, mathematically correct output, legally privileged text) that justify specialized recipes. The three most common cases (low-resource languages, mathematical reasoning, and code generation) each have well-established augmentation patterns, and we work through them in turn.
Low-Resource Language Augmentation
For languages with limited training data, augmentation is often the only practical path to acceptable model quality. Key strategies include:
- Cross-lingual transfer: Translate high-resource language data (English) into the target language using MT models, then filter with native speaker quality checks
- Code-switching injection: For multilingual settings, create examples that mix languages naturally, reflecting real user behavior
- Script variation: For languages with multiple scripts (Serbian Cyrillic/Latin, Chinese simplified/traditional), generate variants in each script
Task-Specific Augmentation Patterns
| Task | Augmentation Strategy | Key Consideration |
|---|---|---|
| Text Classification | Paraphrasing, back-translation, synonym replacement | Must preserve class label exactly |
| NER / Slot Filling | Entity swapping, context variation with fixed entities | Span annotations must be updated to match new text |
| Question Answering | Question rephrasing, answer-preserving context paraphrasing | Answer span positions shift; re-extract after augmentation |
| Instruction Following | Register variation, Evol-Instruct complexity scaling | Output must be regenerated for each augmented input |
| Dialogue / Chat | Multi-turn expansion, user persona variation, error injection | Conversational coherence must be maintained across turns |
| Summarization | Source document paraphrasing (not summary paraphrasing) | Summary must be re-generated for augmented sources |
15.7.5 Quality Control for Augmented Data
Augmented data carries risks that must be managed through quality filtering (building on the techniques from Section 15.3):
- Semantic drift: The augmented text subtly changes meaning, introducing label noise. Mitigate by computing embedding similarity between original and augmented text, filtering pairs below a threshold (e.g., cosine similarity < 0.85).
- Degenerate outputs: LLMs sometimes produce near-copies instead of genuine paraphrases. Detect using n-gram overlap metrics (BLEU or chrF); filter examples with BLEU > 0.9 against the original.
- Distribution shift: Over-augmentation can make the training distribution unrepresentative of real-world input. Monitor augmentation ratios: a common guideline is no more than 3:1 augmented-to-original for classification tasks.
- Bias amplification: If the original data contains biases, augmentation can amplify them. Entity swapping with diverse demographic names helps, but does not fully solve the problem.
# Quality filter: keep augmented examples within a semantic similarity
# band (0.75-0.98) to ensure diversity without label drift.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
def filter_augmented(original, augmented_list, min_sim=0.75, max_sim=0.98):
"""Filter augmented examples by semantic similarity to the original.
Too similar (>max_sim): likely a near-copy, adds no diversity.
Too dissimilar (<min_sim): likely semantic drift, may have wrong label.
"""
orig_emb = model.encode(original, convert_to_tensor=True)
aug_embs = model.encode(augmented_list, convert_to_tensor=True)
scores = util.cos_sim(orig_emb, aug_embs)[0]
filtered = []
for text, score in zip(augmented_list, scores):
if min_sim <= score.item() <= max_sim:
filtered.append((text, score.item()))
return sorted(filtered, key=lambda x: x[1], reverse=True)
# Usage
original = "How do I reset my password?"
augmented = [
"What are the steps to change my password?", # good paraphrase
"How do I reset my password?", # near-copy
"Tell me about your company's history", # semantic drift
"I need help resetting the password on my account", # good paraphrase
]
filtered = filter_augmented(original, augmented)
# Returns only the two good paraphrases
15.7.6 Augmentation at Scale: Pipeline Design
Production augmentation pipelines typically follow a three-stage pattern:
- Seed selection: Choose which examples to augment. Prioritize underrepresented classes, edge cases, and high-value examples identified through active learning (Section 15.4).
- Multi-strategy augmentation: Apply several augmentation methods to each seed. For example: 2 back-translations (different pivot languages), 3 LLM paraphrases (different style hints), and 1 entity swap. This produces 6 candidates per seed.
- Quality filtering and deduplication: Filter by semantic similarity, remove near-duplicates across the augmented corpus using MinHash (Section 15.3), and balance the final class distribution.
A customer service chatbot has 50 examples per intent across 30 intents (1,500 total). The target is 200 examples per intent. The pipeline generates 3 LLM paraphrases and 2 back-translations per example (5 candidates each, 7,500 total), filters by similarity window, deduplicates, and samples to 200 per class. The result is a balanced 6,000-example dataset with 4x the linguistic diversity of the original. On evaluation, this augmented dataset improves intent classification F1 (the harmonic mean of precision and recall, the standard single-number quality metric for classifiers) by 8 to 12 points compared to training on the original 1,500 examples alone.
- Label-preserving transformations are the foundation of all augmentation: every augmented example must retain the same label, intent, or correct answer as the original.
- Augmentation multiplies your data budget. Even simple techniques like synonym replacement and back-translation can significantly improve model performance on small datasets.
- Classical techniques (EDA, back-translation, BERT-based replacement) remain useful as fast, cheap baselines that complement LLM-powered methods.
- LLM-powered augmentation excels at semantic-level variation: register shifts, multi-turn expansion, and entity-aware paraphrasing, producing higher-quality outputs than rule-based methods.
- Quality filtering is essential: semantic similarity windows (0.75 to 0.98) reject both near-copies and drifted examples; combined with deduplication, this prevents label noise and distribution shift.
- Over-augmentation causes distribution shift. Keep augmentation ratios reasonable (typically 3:1 or less) and always evaluate on held-out real data to catch quality degradation.
- Production pipelines combine multiple augmentation strategies per seed example, then filter and balance the final corpus to achieve 3:1 to 5:1 augmentation ratios safely.
- Combine classical and LLM methods. Classical augmentation is fast and cheap for simple variation; LLM augmentation excels at semantic-level diversity. Using both together provides the best cost-quality tradeoff.
Show Answer
Show Answer
Show Answer
Exercises
You have 5,000 high-quality labeled examples and your model overfits at 2 epochs. (a) State two distinct reasons data augmentation can help here, beyond simply increasing dataset size. (b) Why is naive augmentation (random word deletion) often a wash for LLMs? (c) When is the right call to not augment and instead invest in collecting more real data?
Answer Sketch
(a) (i) Augmentation flattens the loss surface around training examples, improving robustness to slight input perturbations the same way image augmentations work for vision models; (ii) it covers surface-form variations (paraphrases, register shifts) that the small original set misses, addressing a generalization gap not visible at training time. (b) Random word deletion produces ungrammatical text the model already handles well; the augmentations live in a region of input space where the LLM is robust, so the model gains little signal. Useful augmentations target the failure modes of the current model. (c) Don't augment when (1) your eval shows the failure mode is conceptual (model lacks domain knowledge), not surface-form; (2) augmentation introduces label noise faster than diversity; (3) collecting 1000 more real examples is cheaper than maintaining the augmentation pipeline.
You augment 1,000 seed examples 10x using LLM paraphrasing at temperature 0.7. Predict: (a) what happens to embedding-based diversity (mean pairwise distance) of the augmented set; (b) what happens to downstream task accuracy after fine-tuning; (c) what changes if you raise temperature to 1.2?
Answer Sketch
(a) Diversity rises modestly: paraphrases cluster near the original examples in embedding space, so mean pairwise distance grows perhaps 10-30%. The 10x set is much less diverse than 10x the seed examples drawn from the original distribution. (b) Accuracy typically improves by 1-5 points; the gain comes mostly from form-invariance, not new conceptual coverage. Returns diminish quickly past 3-5x augmentation. (c) Higher temperature increases diversity but also raises label-flip risk: the paraphraser may change the meaning enough to invalidate the original label. Track per-example label fidelity (verify with a second LLM judge) and discard high-temperature samples that fail the check; you trade quantity for quality.
Sketch a 12-line augmentation function that takes a (prompt, label) example and returns 5 augmented variants validated by a judge model. Each variant must (a) preserve the label, (b) differ from the original by at least 30% of tokens, (c) pass a back-translation sanity check.
Answer Sketch
def augment(prompt, label, n=5):
variants = llm.generate(f"Paraphrase, preserve meaning: {prompt}", n=n*3, temp=0.8)
kept = []
for v in variants:
if token_overlap(v, prompt) > 0.7: continue # too similar
back = llm.generate(f"Translate to French then back to English: {v}", temp=0)
if semantic_sim(back, prompt) < 0.85: continue # meaning drifted
judge = llm.classify(f"Does this still match label '{label}'? {v}")
if judge == "yes": kept.append((v, label))
if len(kept) == n: break
return kept
The 3-stage filter (overlap, back-translation, judge) catches different failure types: similarity catches lazy paraphrases, back-translation catches semantic drift, judge catches label flips. Cost is ~5-8 LLM calls per accepted example, which is the unavoidable price of high-quality augmentation.
You augment a customer-service dataset using GPT-4 paraphrases. Six months later, your evaluation shows the model performs poorly on dialect English (AAVE, Indian English) even though the original data covered those registers. Trace the failure mechanism and propose two fixes.
Answer Sketch
Mechanism: GPT-4 paraphrases default to standard American English. Each augmentation pass shifts the dataset's register distribution toward this default, washing out dialect-specific phrasings even when seed examples contained them. After 10x augmentation, dialectal examples are now <1% of the training set, and the fine-tuned model loses register sensitivity. Fixes: (1) use a dialect-aware paraphraser or explicitly prompt the LLM to "preserve register and dialect markers"; (2) tag every example with its dialect/register label and weight the loss by inverse frequency so under-represented registers carry more gradient signal. The general principle: any augmentation pipeline imposes its own distribution on the data; you must measure and counteract it.
Data augmentation for LLMs is advancing in several directions. Self-augmentation loops use a model's own outputs as augmentation candidates, filtered by a separate verifier model. Curriculum-aware augmentation generates harder examples as training progresses, matching augmentation difficulty to model capability. Multi-modal augmentation pairs text with generated images, audio, or structured data to build richer training signals. Research also explores augmentation-aware training objectives that weight augmented examples differently from original data, preventing the model from over-fitting to synthetic patterns.
What Comes Next
With techniques for both generating synthetic data from scratch (Sections 13.1 through 13.7) and augmenting existing datasets (this section), you have a complete toolkit for building high-quality training corpora. The next step is putting this data to work: Chapter 16: Fine-Tuning Fundamentals shows how to use these datasets to adapt pretrained models to your specific tasks, covering data formatting, training loops, and evaluation strategies.