Section 13.4: Quality Assurance & Data Curation

"Generating a million examples takes an afternoon. Finding the 600,000 worth keeping takes actual engineering."
Synth, Quality-Obsessed AI Agent

Big Picture

Generation is easy; curation is where the value lies. A raw synthetic dataset typically contains 20% to 40% examples that are low quality, duplicated, or potentially harmful. The curation pipeline transforms this raw output into a clean, diverse, high quality training set. This section covers the three pillars of data curation: quality scoring (using LLM-as-judge to rate each example), deduplication (removing exact and near-duplicate content at multiple granularities), and filtering (enforcing constraints on length, language, toxicity, and topical relevance). Together, these steps typically improve downstream model performance by 10% to 25% compared to training on uncurated data. The data quality principles from Section 06.4 apply equally to synthetic data curation.

Prerequisites

Before starting, make sure you are familiar with synthetic data fundamentals as covered in Section 13.1: Principles of Synthetic Data Generation.

A pipeline with multiple filters catching bad data while letting high-quality examples pass through — **Figure 13.4.1**: Quality filtering: because not everything an LLM generates deserves to make it into your training set. Think of it as a bouncer for your dataset.

An inspector with a magnifying glass examining data samples on a conveyor belt, rejecting flawed ones — **Figure 13.4.2**: Every synthetic dataset needs a quality inspector. The good news is that the inspector can also be an LLM. The bad news? Inspectors make mistakes too.

1. Automated Quality Scoring with LLM-as-Judge

The first step in curation is scoring every example in your synthetic dataset on multiple quality dimensions. While human review of every example is impractical at scale, LLM-as-judge (an approach explored further in Chapter 29 on evaluation) provides a reliable proxy that correlates well with human judgments (typically 0.7 to 0.85 Spearman correlation) and can process thousands of examples per hour. Code Fragment 13.4.2 shows this in practice.

Key Insight

The 80/20 rule of synthetic data curation: roughly 20% of your generated examples will be responsible for 80% of downstream training value. Aggressive filtering that keeps only the top 60% of examples by quality score typically improves model performance compared to training on the full unfiltered set. More data is not always better; cleaner data almost always is.

Common Mistake: Using the Same Model to Generate and Judge Data

If you use GPT-4o to generate synthetic data and then use GPT-4o to score that data for quality, the judge will systematically overrate the generator's output. This happens because both share the same distributional biases: the judge finds the generator's phrasing, structure, and vocabulary "natural" because they come from the same model. Use a different model family for judging (e.g., generate with GPT-4o, judge with Claude, or vice versa), or validate a sample of judge scores against human ratings to calibrate for this bias.

1.1 Multi-Dimensional Scoring Rubric

1.1 Multi-Dimensional Scoring Rubric Comparison

Dimension	Scale	What It Measures	Common Failure Modes
Instruction Clarity	1-5	Is the instruction unambiguous and well-formed?	Vague tasks, missing context, contradictory requirements
Response Quality	1-5	Is the response accurate, complete, and well-organized?	Hallucinations, incomplete answers, poor structure
Instruction-Response Alignment	1-5	Does the response actually answer the instruction?	Topic drift, answering a different question
Complexity	1-5	How challenging is the instruction?	Trivially simple tasks, overly repetitive patterns
Safety	Pass/Fail	Does the content violate safety policies?	Harmful advice, PII leakage, biased content

Code Fragment 13.4.4 demonstrates this approach in practice.

Identical twin data points being separated, with one being sent away while the other stays — **Figure 13.4.3**: Deduplication catches those sneaky twin samples that inflate your dataset size without adding real diversity. Only one of you gets to stay.

Code Fragment 13.4.2 demonstrates the Batch API workflow.

# Load synthetic data and apply multi-stage quality filtering
# Combine rule-based checks with LLM-based quality scoring
import json
from openai import OpenAI
from dataclasses import dataclass

client = OpenAI()

@dataclass
class QualityScore:
 instruction_clarity: int
 response_quality: int
 alignment: int
 complexity: int
 safety_pass: bool
 reasoning: str

 @property
 def composite(self) -> float:
 """Weighted composite score (excluding safety, which is binary)."""
 if not self.safety_pass:
 return 0.0
 return (
 0.20 * self.instruction_clarity +
 0.35 * self.response_quality +
 0.25 * self.alignment +
 0.20 * self.complexity
 ) / 5.0 # Normalize to 0-1

def score_example(instruction: str, response: str,
 model: str = "gpt-4o") -> QualityScore:
 """Score a single instruction-response pair on multiple dimensions."""
 prompt = f"""Evaluate this instruction-response pair on the following
dimensions. Think through each dimension carefully before scoring.

INSTRUCTION: {instruction}

RESPONSE: {response}

Score each dimension:
- instruction_clarity (1-5): Is the instruction clear and unambiguous?
- response_quality (1-5): Is the response accurate, complete, well-organized?
- alignment (1-5): Does the response directly address the instruction?
- complexity (1-5): How challenging is the task? (1=trivial, 5=expert-level)
- safety_pass (true/false): Is the content free of harmful/biased material?

Provide your analysis, then scores as JSON:
{{
 "reasoning": "your analysis of each dimension",
 "instruction_clarity": <1-5>,
 "response_quality": <1-5>,
 "alignment": <1-5>,
 "complexity": <1-5>,
 "safety_pass": 
}}"""

 # Send chat completion request to the API
 result = client.chat.completions.create(
 model=model,
 messages=[{"role": "user", "content": prompt}],
 temperature=0.1,
 response_format={"type": "json_object"}
 )

 data = json.loads(result.choices[0].message.content)
 return QualityScore(**data)

def batch_score_dataset(
 dataset: list[dict],
 min_composite: float = 0.6,
 model: str = "gpt-4o"
) -> tuple[list[dict], list[dict]]:
 """Score and partition a dataset into accepted and rejected examples."""
 accepted, rejected = [], []

 for example in dataset:
 score = score_example(
 example["instruction"], example["response"], model
 )
 example["quality_score"] = score.composite
 example["quality_details"] = {
 "clarity": score.instruction_clarity,
 "quality": score.response_quality,
 "alignment": score.alignment,
 "complexity": score.complexity,
 "safety": score.safety_pass,
 }

 if score.composite >= min_composite and score.safety_pass:
 accepted.append(example)
 else:
 rejected.append(example)

 return accepted, rejected

# Example usage
sample_data = [
 {"instruction": "Explain how B-tree indexing works in databases.",
 "response": "B-tree indexes organize data in a balanced tree structure "
 "where each node can have multiple children. Leaf nodes contain pointers "
 "to the actual data rows. Lookups are $O(\log n)$ because the tree stays "
 "balanced through splits and merges during insertions and deletions."},
 {"instruction": "Do something.",
 "response": "Sure, I did something."},
]

accepted, rejected = batch_score_dataset(sample_data)
print(f"Accepted: {len(accepted)}, Rejected: {len(rejected)}")
for ex in accepted:
 print(f" Score: {ex['quality_score']:.3f} | {ex['instruction'][:50]}...")

Accepted: 1, Rejected: 1 Score: 0.830 | Explain how B-tree indexing works in databases....

Code Fragment 13.4.1: Load synthetic data and apply multi-stage quality filtering

Beyond per-example quality, you also need to guard against data contamination: synthetic examples that accidentally overlap with benchmark test sets. Code Fragment 13.4.4 implements n-gram overlap detection to catch this before it inflates your evaluation scores.

A rainbow spectrum of diverse data samples spanning different topics, styles, and formats — **Figure 13.4.4**: A healthy dataset is a diverse dataset. If all your examples look the same, your model will think the whole world looks that way too.

# Check for train/test contamination using n-gram overlap detection
# Prevent data leakage that would inflate benchmark scores
import hashlib
from collections import defaultdict

def exact_dedup(examples: list[dict], key: str = "instruction") -> list[dict]:
 """Remove exact duplicates based on normalized text hash."""
 seen = set()
 unique = []

 for ex in examples:
 # Normalize: lowercase, strip whitespace, collapse spaces
 normalized = " ".join(ex[key].lower().split())
 text_hash = hashlib.sha256(normalized.encode()).hexdigest()

 if text_hash not in seen:
 seen.add(text_hash)
 unique.append(ex)

 removed = len(examples) - len(unique)
 print(f"Exact dedup: {len(examples)} -> {len(unique)} "
 f"({removed} removed, {removed/len(examples)*100:.1f}%)")
 return unique

def minhash_dedup(
 examples: list[dict],
 key: str = "instruction",
 num_perm: int = 128,
 threshold: float = 0.7
) -> list[dict]:
 """Remove near-duplicates using MinHash with n-gram shingling."""
 from datasketch import MinHash, MinHashLSH

 lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
 minhashes = []

 # Create MinHash for each example
 for i, ex in enumerate(examples):
 mh = MinHash(num_perm=num_perm)
 # 3-gram character shingles
 text = ex[key].lower()
 for j in range(len(text) - 2):
 shingle = text[j:j+3]
 mh.update(shingle.encode("utf-8"))
 minhashes.append(mh)

 try:
 lsh.insert(f"doc_{i}", mh)
 except ValueError:
 pass # Duplicate detected by LSH

 # Find clusters of similar documents
 keep_indices = set()
 processed = set()

 for i in range(len(examples)):
 if i in processed:
 continue
 similar = lsh.query(minhashes[i])
 cluster_indices = [int(s.split("_")[1]) for s in similar]

 # Keep the first (or best quality) in each cluster
 best = min(cluster_indices) # Keep first occurrence
 keep_indices.add(best)
 processed.update(cluster_indices)

 unique = [examples[i] for i in sorted(keep_indices)]
 removed = len(examples) - len(unique)
 print(f"MinHash dedup: {len(examples)} -> {len(unique)} "
 f"({removed} removed, {removed/len(examples)*100:.1f}%)")
 return unique

def semantic_dedup(
 examples: list[dict],
 key: str = "instruction",
 threshold: float = 0.92,
 model: str = "text-embedding-3-small"
) -> list[dict]:
 """Remove semantic duplicates using embedding similarity."""
 import numpy as np

 # Get embeddings
 texts = [ex[key] for ex in examples]
 response = client.embeddings.create(model=model, input=texts)
 embeddings = np.array([e.embedding for e in response.data])

 # Normalize for cosine similarity
 norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
 normalized = embeddings / norms

 # Find pairs above threshold
 similarity_matrix = normalized @ normalized.T
 keep = set(range(len(examples)))

 for i in range(len(examples)):
 if i not in keep:
 continue
 for j in range(i + 1, len(examples)):
 if j not in keep:
 continue
 if similarity_matrix[i][j] > threshold:
 keep.discard(j) # Remove the later duplicate

 unique = [examples[i] for i in sorted(keep)]
 removed = len(examples) - len(unique)
 print(f"Semantic dedup: {len(examples)} -> {len(unique)} "
 f"({removed} removed, {removed/len(examples)*100:.1f}%)")
 return unique

Filtering: 3 -> 1 accepted length: 1 removed (33.3%) quality: 0 removed (0.0%) repetition: 1 removed (33.3%)

Code Fragment 13.4.2: Check for train/test contamination using n-gram overlap detection

Mental Model: The Three-Sieve Funnel

Think of data curation as pouring sand through three progressively finer sieves. The first sieve (exact hashing) catches obvious rocks quickly and cheaply. The second sieve (MinHash near-duplicate detection) catches pebbles that the first sieve missed, at moderate cost. The third sieve (semantic similarity) catches the finest grit, but is the most expensive to run. Always run them in order from coarsest to finest, because each sieve reduces the volume for the next stage. Skipping the cheap sieves and running only the expensive one wastes compute on duplicates that could have been caught for pennies.

2.4 Paraphrase Generation as Data Augmentation

While deduplication removes unwanted redundancy, paraphrase generation introduces controlled redundancy as a deliberate augmentation strategy. The idea is simple: for each high-quality example in your dataset, generate semantically equivalent variations with different surface forms. This teaches models that meaning is invariant under rephrasing, improving robustness to input variation and reducing overfitting to specific phrasings. Paraphrase augmentation is especially valuable for small datasets where every original example must be leveraged to its fullest.

LLMs make paraphrase generation straightforward, but quality control is essential. A good paraphrase preserves the core meaning while changing vocabulary, sentence structure, or rhetorical style. A bad paraphrase either drifts semantically (introducing new information or losing key details) or changes so little that it functions as a near-duplicate rather than a genuine augmentation. The solution is to measure semantic similarity between the original and each paraphrase, accepting only those that fall within a "goldilocks zone": similar enough to preserve meaning (cosine similarity above 0.85) but different enough to provide genuine diversity (cosine similarity below 0.97).

# Paraphrase generation with diversity control and quality filtering
# Uses semantic similarity to enforce the goldilocks zone
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import numpy as np

client = OpenAI()
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

def generate_paraphrases(
 text: str,
 n: int = 5,
 style_variants: list[str] = None,
 sim_min: float = 0.85,
 sim_max: float = 0.97,
) -> list[dict]:
 """Generate diverse paraphrases with semantic similarity filtering."""
 if style_variants is None:
 style_variants = [
 "more formal and technical",
 "simpler and more conversational",
 "more concise, removing unnecessary words",
 "restructured with different sentence order",
 "using different vocabulary and phrasing",
 ]

 prompt = f"""Generate {n} paraphrases of the following text.
Each paraphrase must preserve the EXACT same meaning but use
different wording, structure, or style.

Style guidance for each variant:
{chr(10).join(f'{i+1}. {s}' for i, s in enumerate(style_variants[:n]))}

Original: {text}

Return only the paraphrases, one per line, numbered 1 through {n}."""

 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": prompt}],
 temperature=0.9,
 )
 paraphrases = [
 line.split(". ", 1)[1] if ". " in line else line
 for line in response.choices[0].message.content.strip().split("\n")
 if line.strip()
 ][:n]

 # Quality control: filter by semantic similarity
 orig_emb = embed_model.encode([text])
 para_embs = embed_model.encode(paraphrases)
 similarities = np.dot(para_embs, orig_emb.T).flatten()

 results = []
 for para, sim in zip(paraphrases, similarities):
 results.append({
 "paraphrase": para,
 "similarity": float(sim),
 "accepted": sim_min <= sim <= sim_max,
 })
 return results

# Example usage
original = "The model achieved 94.2% accuracy on the test set."
for r in generate_paraphrases(original, n=3):
 status = "ACCEPT" if r["accepted"] else "REJECT"
 print(f" [{status}] (sim={r['similarity']:.3f}) {r['paraphrase']}")

Code Fragment 13.4.3: Paraphrase generation with diversity control. Each paraphrase is scored against the original using embedding similarity, and only those in the goldilocks zone (similar enough to preserve meaning, different enough to add diversity) are accepted.

Paraphrase augmentation connects naturally to two other areas of this textbook. For contrastive learning and embedding training (covered in Section 19.1), paraphrase pairs serve as positive examples: texts that should have similar embeddings. Generating diverse paraphrases at scale provides the large positive-pair datasets that contrastive learning methods like SimCSE require. For instruction tuning, paraphrasing the instruction portion of training examples teaches models to follow the same instruction regardless of how it is phrased, improving robustness to prompt variation in deployment.

Note

Paraphrase vs. deduplication tension: There is a productive tension between paraphrase generation (adding controlled variation) and deduplication (removing unwanted variation). The key distinction is intent. Paraphrases generated deliberately from known-good examples increase effective training diversity. Near-duplicates that arise accidentally from overlapping generation prompts waste training compute on redundant signal. Run deduplication first to clean the base dataset, then apply paraphrase augmentation to expand the cleaned dataset with intentional variation.

3. Multi-Dimensional Filtering

After deduplication, the next stage applies content-based filters that check each example against multiple quality dimensions. The following pipeline demonstrates a composable filtering architecture.

A spiral descending into increasingly distorted and repetitive outputs, representing model collapse from training on synthetic data — **Figure 13.4.5**: Model collapse: what happens when models train on their own outputs for too long. It is the AI equivalent of a photocopy of a photocopy of a photocopy.

Code Fragment 13.4.4 implements quality filtering.

# Build a composable filter pipeline for synthetic data quality control
# Chain length, quality score, and repetition filters in sequence
from dataclasses import dataclass, field
from typing import Callable

@dataclass
class FilterResult:
 passed: bool
 reason: str = ""

@dataclass
class FilterPipeline:
 """Configurable pipeline of data quality filters."""
 filters: list[tuple[str, Callable]] = field(default_factory=list)

 def add_filter(self, name: str, fn: Callable):
 self.filters.append((name, fn))
 return self # Allow chaining

 def run(self, examples: list[dict]) -> tuple[list[dict], dict]:
 """Run all filters, returning accepted examples and statistics."""
 stats = {name: 0 for name, _ in self.filters}
 accepted = []

 for ex in examples:
 passed_all = True
 for name, fn in self.filters:
 result = fn(ex)
 if not result.passed:
 stats[name] += 1
 passed_all = False
 break # Fail fast on first filter failure
 if passed_all:
 accepted.append(ex)

 total = len(examples)
 print(f"Filtering: {total} -> {len(accepted)} accepted")
 for name, count in stats.items():
 print(f" {name}: {count} removed ({count/total*100:.1f}%)")
 return accepted, stats

# Define individual filters
def length_filter(ex: dict, min_tokens: int = 20,
 max_tokens: int = 2048) -> FilterResult:
 """Filter by response length in approximate tokens."""
 token_count = len(ex.get("response", "").split()) * 1.3 # rough estimate
 if token_count < min_tokens:
 return FilterResult(False, f"Too short: ~{token_count:.0f} tokens")
 if token_count > max_tokens:
 return FilterResult(False, f"Too long: ~{token_count:.0f} tokens")
 return FilterResult(True)

def quality_score_filter(ex: dict,
 min_score: float = 0.6) -> FilterResult:
 """Filter by pre-computed quality score."""
 score = ex.get("quality_score", 0)
 if score < min_score:
 return FilterResult(False, f"Low quality: {score:.3f}")
 return FilterResult(True)

def repetition_filter(ex: dict,
 max_repeat_ratio: float = 0.3) -> FilterResult:
 """Filter responses with excessive repeated phrases."""
 response = ex.get("response", "")
 words = response.lower().split()
 if len(words) < 10:
 return FilterResult(True)

 # Check for repeated 4-grams
 ngrams = [" ".join(words[i:i+4]) for i in range(len(words) - 3)]
 from collections import Counter
 counts = Counter(ngrams)
 max_count = max(counts.values()) if counts else 0
 repeat_ratio = max_count / len(ngrams) if ngrams else 0

 if repeat_ratio > max_repeat_ratio:
 return FilterResult(False, f"Repetitive: {repeat_ratio:.2f} ratio")
 return FilterResult(True)

# Build and run the pipeline
pipeline = FilterPipeline()
pipeline.add_filter("length", length_filter)
pipeline.add_filter("quality", quality_score_filter)
pipeline.add_filter("repetition", repetition_filter)

# Example run (would normally be on thousands of examples)
sample = [
 {"instruction": "Explain Docker", "response": "Docker is...",
 "quality_score": 0.3}, # Too short + low quality
 {"instruction": "Explain K8s", "response": "Kubernetes is an " * 100,
 "quality_score": 0.7}, # Repetitive
 {"instruction": "Explain REST APIs",
 "response": "REST APIs use HTTP methods to expose resources. "
 "GET retrieves data, POST creates new resources, PUT updates "
 "existing ones, and DELETE removes them. RESTful design follows "
 "principles like statelessness and uniform interfaces.",
 "quality_score": 0.85}, # Good
]

accepted, stats = pipeline.run(sample)

Code Fragment 13.4.4: Build a composable filter pipeline for synthetic data quality control

4. Argilla for Human-in-the-Loop Curation

Argilla is an open-source data curation platform designed specifically for NLP and LLM data workflows. It provides a web UI for reviewing, annotating, and correcting synthetic data, combined with Python SDK integration for programmatic workflows. Argilla bridges the gap between automated quality scoring and human judgment by presenting borderline examples to human reviewers. Code Fragment 13.4.4 shows this approach in practice.

# Push synthetic data to Argilla for human review and annotation
# Borderline examples are flagged for manual inspection
import argilla as rg

# Initialize Argilla client
rg.init(api_url="http://localhost:6900", api_key="argilla.apikey")

# Create a dataset for reviewing synthetic data quality
settings = rg.Settings(
 guidelines="Review synthetic instruction-response pairs for quality. "
 "Score each dimension 1-5 and flag any safety concerns.",
 fields=[
 rg.TextField(name="instruction", title="Instruction"),
 rg.TextField(name="response", title="Response"),
 ],
 questions=[
 rg.RatingQuestion(
 name="instruction_clarity",
 title="Instruction Clarity",
 description="Is the instruction clear and unambiguous?",
 values=[1, 2, 3, 4, 5]
 ),
 rg.RatingQuestion(
 name="response_quality",
 title="Response Quality",
 description="Is the response accurate, complete, and well-organized?",
 values=[1, 2, 3, 4, 5]
 ),
 rg.LabelQuestion(
 name="safety",
 title="Safety Check",
 labels=["safe", "unsafe", "borderline"]
 ),
 rg.TextQuestion(
 name="comments",
 title="Comments",
 description="Any notes about this example?",
 required=False
 ),
 ],
 metadata=[
 rg.FloatMetadataProperty(name="llm_quality_score",
 title="LLM Quality Score"),
 rg.TermsMetadataProperty(name="source",
 title="Generation Source"),
 ],
)

dataset = rg.Dataset(name="synthetic_data_review", settings=settings)
dataset.create()

# Upload synthetic examples for human review
records = [
 rg.Record(
 fields={
 "instruction": "Explain the CAP theorem in distributed systems.",
 "response": "The CAP theorem states that a distributed system "
 "can provide at most two of three guarantees: Consistency, "
 "Availability, and Partition tolerance..."
 },
 metadata={
 "llm_quality_score": 0.82,
 "source": "self-instruct",
 },
 ),
]
dataset.records.log(records)
print(f"Uploaded {len(records)} records for review")

Uploaded 1 records for review

Code Fragment 13.4.5: Push synthetic data to Argilla for human review and annotation

Note

A practical workflow routes only borderline examples to human review: those with LLM quality scores between 0.5 and 0.7. Examples above 0.7 are auto-accepted, and examples below 0.5 are auto-rejected. This focuses expensive human attention where it adds the most value and can reduce review volume by 60% to 80% while maintaining high data quality.

Note

Synthetic data generation is especially critical in robotics, where real-world data collection is slow, expensive, and constrained by physical hardware. NVIDIA's Isaac Sim combined with Cosmos can generate 780,000 robot training trajectories in just 11 hours using LLM-guided scenario specification. NVIDIA's Eureka system uses LLMs to write and iteratively refine reward functions for robot RL, outperforming human expert rewards on the majority of tested tasks. For full coverage of LLM-driven synthetic data in robotics, see Section 28.7.

Fun Fact

Deduplication is the unsung hero of synthetic data pipelines. Without it, you end up with a dataset where 30% of the examples are minor paraphrases of each other, and your model learns to be confidently repetitive.

5. Distilabel for Production Pipelines

Distilabel is an open-source framework by Argilla (Hugging Face) for building scalable synthetic data generation and curation pipelines. It provides pre-built components for common generation patterns (Self-Instruct, Evol-Instruct, UltraFeedback-style scoring) and handles batching, rate limiting, and error recovery automatically. Code Fragment 13.4.2 shows this approach in practice.

# Build a Distilabel pipeline for automated generation and scoring
# TextGeneration creates candidates; UltraFeedback scores them
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration, UltraFeedback
from distilabel.llms import OpenAILLM

# Build a Distilabel pipeline for synthetic data generation + scoring
pipeline = Pipeline(name="synthetic-data-pipeline")

# Step 1: Generate responses to seed instructions
generate = TextGeneration(
 name="generate_response",
 llm=OpenAILLM(model="gpt-4o-mini"),
 system_prompt="You are a knowledgeable assistant. Provide detailed, "
 "accurate responses to technical questions.",
 num_generations=3, # Generate 3 candidates per instruction
)

# Step 2: Score the generated responses using UltraFeedback criteria
score = UltraFeedback(
 name="quality_scoring",
 llm=OpenAILLM(model="gpt-4o"),
 aspect="overall-rating", # Score on overall quality 1-5
)

# Wire the pipeline
pipeline.add_step(generate)
pipeline.add_step(score, input_mappings={"instruction": "instruction"})

# Run on seed data
seed_instructions = [
 {"instruction": "Explain how garbage collection works in Python."},
 {"instruction": "What are the tradeoffs between SQL and NoSQL databases?"},
 {"instruction": "Describe the observer pattern with a practical example."},
]

# In production, run with:
# results = pipeline.run(seed_instructions)
# results.to_pandas() # Analyze in pandas
# results.push_to_hub("my-org/synthetic-dataset") # Push to HF Hub

print("Pipeline configured with generate + score steps")
print(f"Seed instructions: {len(seed_instructions)}")
print(f"Expected outputs: {len(seed_instructions) * 3} candidates, scored")

Pipeline configured with generate + score steps Seed instructions: 3 Expected outputs: 9 candidates, scored

Code Fragment 13.4.6: Build a Distilabel pipeline for automated generation and scoring

Self-Check

Q1: What are the five quality dimensions used for scoring synthetic data?

Show Answer

The five dimensions are: (1) Instruction Clarity (is the instruction unambiguous and well-formed?), (2) Response Quality (is the response accurate, complete, and well-organized?), (3) Instruction-Response Alignment (does the response actually address the instruction?), (4) Complexity (how challenging is the task?), and (5) Safety (does the content violate safety policies?). The first four are scored on 1-5 scales, while Safety is a binary pass/fail.

Q2: What are the three levels of deduplication, and why should they be applied in a specific order?

Show Answer

The three levels are: (1) Exact dedup using text hashing (SHA-256), which catches 5% to 15% of data at near-zero cost; (2) Near-duplicate detection using MinHash/SimHash with Jaccard similarity, catching 10% to 25% at low cost; and (3) Semantic dedup using embedding cosine similarity, catching 15% to 35% at moderate cost. They should be applied in this order (cheapest first) because each stage reduces the dataset size, making the more expensive subsequent stages faster and cheaper to run.

Q3: How does a borderline routing strategy improve the efficiency of human review?

Show Answer

Borderline routing sends only examples with LLM quality scores in the uncertain range (typically 0.5 to 0.7) to human reviewers. Examples above 0.7 are auto-accepted and examples below 0.5 are auto-rejected. This focuses expensive human attention on the cases where automated scoring is least confident and human judgment adds the most value. The approach can reduce human review volume by 60% to 80% while maintaining high overall data quality.

Q4: What is the repetition filter detecting, and why is it important for synthetic data?

Show Answer

The repetition filter detects responses with excessive repeated phrases by computing the ratio of the most common n-grams to total n-grams. This is important because LLMs sometimes enter degenerate loops where they repeat the same phrases. Training on highly repetitive data teaches the model to produce repetitive outputs. A typical threshold flags responses where any 4-gram appears more than 30% of the time relative to all 4-grams in the response.

Q5: What advantages does Distilabel provide over building custom generation pipelines from scratch?

Show Answer

Distilabel provides: (1) pre-built components for common patterns like Self-Instruct, Evol-Instruct, and UltraFeedback scoring; (2) automatic handling of batching, rate limiting, and error recovery; (3) integration with multiple LLM providers through a unified interface; (4) built-in support for generating multiple candidates per instruction; (5) direct export to Hugging Face Hub for dataset sharing; and (6) integration with Argilla for human-in-the-loop curation workflows. This saves significant engineering time compared to implementing these features from scratch.

Why this matters: Data quality is the single biggest lever for fine-tuning success. Teams that invest in rigorous quality assurance for their synthetic data consistently achieve better results with fewer examples than teams that prioritize volume. A well-curated dataset of 5,000 high-quality examples regularly outperforms 50,000 unfiltered examples. The quality assurance patterns here connect directly to the data preparation checklist in Section 14.2, where formatting and quality are prerequisites for effective fine-tuning.

Tip: Mix Synthetic and Real Data

The best results typically come from blending synthetic data with a smaller set of real, human-verified examples (for example, 80% synthetic and 20% real). The real data anchors quality while synthetic data provides scale.

Key Takeaways

Quality scoring with LLM-as-judge evaluates each example on instruction clarity, response quality, alignment, complexity, and safety. Calibrate against human judgments (target Spearman correlation above 0.65) before scaling.
Deduplication operates at three levels: exact hash matching (cheapest, catches 5% to 15%), MinHash near-duplicate detection (catches 10% to 25%), and semantic embedding similarity (catches 15% to 35%). Always run them in order from cheapest to most expensive.
Multi-dimensional filtering removes examples that fail length, language, toxicity, PII, topic, or repetition checks. Each filter targets a specific failure mode that would degrade training quality.
Argilla provides human-in-the-loop curation with a web UI for reviewing borderline examples. Routing only borderline cases (quality scores 0.5 to 0.7) to human review reduces volume by 60% to 80%.
Distilabel automates the full pipeline from generation through scoring, with pre-built components, rate limiting, and Hugging Face Hub integration.
Curation typically improves downstream performance by 10% to 25% compared to training on uncurated synthetic data.

Real-World Scenario: Deduplication Pipeline Saving 30% of Training Compute

Who: An ML team at a conversational AI company preparing a 100,000-example instruction-tuning dataset assembled from four synthetic generation pipelines.

Situation: Each pipeline had produced 25,000 examples independently using different prompting strategies but overlapping seed data. The team suspected significant duplication across pipelines.

Problem: Training on duplicate or near-duplicate data wastes compute, amplifies biases present in repeated examples, and can cause the model to memorize specific phrasings rather than learning generalizable patterns.

Dilemma: They could skip deduplication and accept the waste (fastest), run only exact-match hashing (catches obvious copies), or implement a three-tier deduplication pipeline (hash, MinHash, semantic) that would take a day to set up but provide the deepest cleaning.

Decision: They implemented the three-tier pipeline: exact hash matching first (cheapest, catches copy-paste duplicates), MinHash with 128 permutations for near-duplicate detection (catches paraphrases with minor edits), and finally embedding-based semantic deduplication using cosine similarity above 0.95 (catches functionally identical examples with different wording).

How: They processed the pipeline in order from cheapest to most expensive. Exact hashing ran in 2 minutes, MinHash in 15 minutes, and embedding-based dedup in 3 hours (including embedding computation). Each tier operated only on the survivors from the previous tier.

Result: Exact matching removed 8,200 duplicates (8.2%), MinHash caught an additional 12,400 near-duplicates (13.5% of remaining), and semantic dedup removed 7,800 more (9.5% of remaining). The final dataset was 71,600 examples. The model trained on the deduplicated set achieved 4% higher benchmark scores than one trained on the full 100,000, while using 28% less training compute.

Lesson: Multi-tier deduplication (hash, MinHash, semantic) applied in ascending cost order is the most efficient approach; removing duplicates not only saves compute but genuinely improves model quality by increasing effective diversity.

Research Frontier

Automated data quality scoring is moving beyond simple heuristics toward learned quality predictors that can estimate the training value of each synthetic example before it enters the dataset. Research on data mixing laws (extending scaling laws to data composition) seeks to predict optimal ratios of synthetic to real data for a given task and model size.

An open problem is detecting subtle distributional biases in curated synthetic datasets that only manifest after fine-tuning, such as sycophantic or overly cautious response patterns.

Exercises

Exercise 13.4.1: Quality dimensions Conceptual

Name four dimensions of quality that should be assessed for synthetic training data. For each dimension, give an example of a quality failure.

Answer Sketch

1. Correctness: a synthetic math problem has the wrong answer. 2. Diversity: all synthetic customer messages use the same phrasing pattern. 3. Relevance: generated examples are off-topic or do not match the target task distribution. 4. Consistency: the same entity is described with contradictory attributes across examples. Each failure degrades model training by introducing noise, bias, or gaps in coverage.

Exercise 13.4.2: Deduplication methods Coding

Implement a deduplication function for synthetic data that removes both exact duplicates and near-duplicates (cosine similarity > 0.95). Use embeddings for the near-duplicate detection.

Answer Sketch

Step 1: Remove exact duplicates with a set. Step 2: Embed all remaining examples. Step 3: For each pair, compute cosine similarity. Step 4: Build a graph where edges connect pairs with similarity > 0.95. Step 5: For each connected component, keep only one representative (e.g., the longest or highest-quality example). Use an efficient approach like FAISS for pairwise similarity at scale.

Exercise 13.4.3: LLM-as-judge for quality Conceptual

Explain how to use an LLM as a judge to score synthetic data quality. What scoring rubric would you provide, and what are the limitations of this approach?

Answer Sketch

Send each synthetic example to a judge LLM with a rubric: 'Rate 1 to 5 on: (a) Realism: could this plausibly appear in real data? (b) Correctness: is the content factually accurate? (c) Completeness: does it cover all required elements? (d) Difficulty: is it appropriately challenging?' Limitations: the judge may share the same biases as the generator (both are LLMs), may be inconsistent across examples, and cannot verify domain-specific factual accuracy without external knowledge.

Exercise 13.4.4: Contamination detection Coding

Write a function that checks whether synthetic training data is contaminated with examples from a known test set. Use n-gram overlap and embedding similarity as complementary detection methods.

Answer Sketch

For n-gram overlap: extract all 10-grams from each synthetic example and each test example. Flag any synthetic example sharing 3+ 10-grams with a test example. For embedding similarity: embed both sets, compute pairwise cosine similarity, flag any pair with similarity > 0.92. The n-gram method catches verbatim copying; the embedding method catches paraphrased copies. Report flagged examples for manual review.

Exercise 13.4.5: Quality filtering impact Analysis

A team generates 50,000 synthetic examples and finds that aggressive quality filtering (keeping only top 20%) produces a dataset of 10,000. Compare the expected training outcomes of using all 50,000 vs. the filtered 10,000.

Answer Sketch

The filtered 10,000 will likely produce a better model despite being smaller. Low-quality examples introduce noise that the model must learn to ignore, slowing convergence and potentially teaching wrong patterns. Research consistently shows that a small, high-quality dataset outperforms a large, noisy one. The exception is if filtering is too aggressive and removes valid edge cases, reducing coverage. Monitor both quality metrics and distributional coverage.

What Comes Next

In the next section, Section 13.5: LLM-Assisted Labeling & Active Learning, we examine LLM-assisted labeling and active learning, using models to accelerate human annotation workflows. Quality filtering is also essential when preparing data for parameter-efficient fine-tuning (Section 15.1) and knowledge distillation (Section 16.1).

References and Further Reading

Deduplication Research

Lee, K. et al. (2022). Deduplicating Training Data Makes Language Models Better. ACL 2022.

Provides rigorous evidence that deduplication improves both training efficiency and model quality, with experiments showing reduced memorization and better generalization. This paper directly motivates the multi-tier deduplication pipeline presented in this section. Required reading for anyone curating training datasets.

Paper

Abbas, A. et al. (2023). SemDeDup: Data-Efficient Learning at Web-Scale Through Semantic Deduplication.

Introduces semantic deduplication using embedding similarity, which catches near-duplicates that hash-based methods miss. SemDeDup is the key technique in the third tier of the deduplication pipeline discussed here. Recommended for teams dealing with paraphrased or semantically redundant synthetic examples.

Paper

Broder, A. Z. (1997). On the Resemblance and Containment of Documents. Proceedings of the Compression and Complexity of Sequences.

The original MinHash paper that established the theoretical foundation for efficient near-duplicate detection at scale. MinHash remains the workhorse of the second deduplication tier discussed in this section. Essential background for understanding the algorithmic principles behind scalable deduplication.

Paper

Curation Tools and Datasets

Argilla. (2024). Argilla: Open-Source Data Curation for LLMs.

Documentation for Argilla, the open-source platform for data curation that integrates annotation, quality scoring, and dataset management. Argilla is one of the primary tools referenced in the practical curation workflow of this section. Ideal for teams wanting a production-ready curation interface.

Tool

Hugging Face. (2024). Distilabel: AI Feedback Framework for Building Datasets.

Distilabel provides a pipeline framework for generating and curating datasets using LLM-as-Judge scoring, preference pair generation, and quality filtering. It directly implements many of the automated quality scoring patterns discussed here. Practical for teams wanting to automate their curation pipeline with minimal custom code.

Framework

Penedo, G. et al. (2023). The RefinedWeb Dataset for Falcon LLM. NeurIPS 2023 Datasets Track.

Documents the curation pipeline behind RefinedWeb, demonstrating that aggressive filtering and deduplication of web data can match curated datasets in quality. The paper's filtering heuristics and quality metrics are directly applicable to synthetic data curation. A benchmark study for production-grade data pipelines.

Paper