Part 3: Working with LLMs
Chapter 12: Hybrid ML and LLM Systems

Hybrid Pipeline Patterns

"The secret to a great hybrid pipeline is the same as the secret to a great marriage: knowing when to let the other partner handle things."

Label Label, Diplomatically Wise AI Agent
Big Picture

The 80/20 insight that drives hybrid architectures. In most production systems, roughly 80% of incoming requests are straightforward cases that a fast, cheap classifier can handle correctly. The remaining 20% are complex, ambiguous, or novel cases that genuinely benefit from LLM reasoning. A well-designed hybrid pipeline routes each request to the cheapest model capable of handling it correctly, achieving LLM-level quality at a fraction of the cost. Building on the decision framework from Section 12.1, this section covers the major hybrid patterns: classical triage with LLM escalation, ensemble voting, cascading model architectures, and the router pattern.

Prerequisites

This section assumes you have read the decision framework in Section 12.1 and the feature extraction patterns in Section 12.2. The API engineering best practices from Section 10.3 (retries, caching, circuit breakers) are directly applied in the pipeline patterns here.

1. Pattern: Classical Triage + LLM Escalation

An assembly line combining classical ML components and LLM components into a hybrid pipeline
Figure 12.3.1: The hybrid pipeline assembly line: classical ML handles the fast, predictable parts while the LLM tackles the nuanced, open-ended work.

The triage pattern is the most common hybrid architecture in production. A fast, cheap classifier processes every incoming request. When the classifier is confident in its prediction, the result is returned directly. When the classifier is uncertain (low confidence), the request is escalated to an LLM for more careful analysis. Figure 12.3.1 shows this routing logic. Code Fragment 12.3.1 shows this approach in practice.

Common Mistake: Setting Confidence Thresholds Without Calibration

A classifier reporting 0.95 confidence does not necessarily mean it is correct 95% of the time. Most neural network classifiers are poorly calibrated: they tend to be overconfident. If you set your escalation threshold at 0.90 without calibrating the classifier first, you may route too few requests to the LLM, degrading overall accuracy on the hard cases. Always calibrate your classifier's confidence scores (using techniques like Platt scaling or temperature calibration) on a held-out set before setting production thresholds. Plot a reliability diagram to verify that predicted probabilities match observed frequencies.

Real-World Scenario

How a Support Team Cut API Costs by 85% with Triage Routing

Who: An ML engineer at a mid-size SaaS company handling 50,000 support tickets per day.

Situation: Every ticket was being routed through GPT-4 for intent classification, costing $4,500 per month.

Problem: The CEO wanted the same quality at one-fifth the cost.

Dilemma: Switching entirely to a fine-tuned BERT classifier would save money but miss nuanced tickets. Keeping GPT-4 on everything was accurate but expensive.

Decision: They deployed a confidence-threshold triage: a distilBERT classifier handled tickets where its confidence exceeded 0.92 (about 78% of volume), and only uncertain tickets escalated to GPT-4.

Result: Monthly API costs dropped from $4,500 to $680, while classification accuracy stayed within 0.3% of the all-GPT-4 baseline.

Lesson: You do not need the best model for every request; you need the right model for each request's difficulty level.

Incoming Request Fast Classifier TF-IDF+LR, ~0.1ms, ~$0 Confidence > threshold? Yes (80%) Direct Response No (20%) LLM Analysis GPT-4o, ~500ms, ~$0.01 Final Output Avg cost: 0.80 x $0 + 0.20 x $0.01 = $0.002/query (80% savings vs. LLM-only)
Figure 12.3.2: The triage pattern routes easy cases to a fast classifier and escalates uncertain cases to an LLM, reducing average cost by 80%.

Why the confidence threshold is the single most important parameter in a hybrid system. Set it too low, and the cheap classifier handles cases it gets wrong, degrading overall quality. Set it too high, and nearly everything gets escalated to the expensive LLM, eliminating the cost savings. The right threshold depends on your tolerance for classifier errors: if a wrong classification has low consequences (e.g., routing a ticket to the wrong department), you can use a lower threshold. If a wrong classification has high consequences (e.g., medical triage), use a higher threshold and let the LLM handle a larger share. Calibrating this threshold requires measuring classifier accuracy at different confidence levels on a holdout set, as covered in Chapter 29 on evaluation.

1.1 Implementing Confidence-Based Routing

The following implementation (Code Fragment 0) shows this approach in practice.

Code Fragment 12.3.1 shows the Anthropic Messages API.

# Use a small LLM to classify request difficulty and select a model tier
# The router returns a JSON decision with tier, reasoning, and difficulty score
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from dataclasses import dataclass

@dataclass
class TriageResult:
 category: str
 confidence: float
 source: str # "classifier" or "llm"
 cost: float

class TriageRouter:
 """Routes requests to classifier or LLM based on confidence."""

 def __init__(self, confidence_threshold: float = 0.85):
 self.threshold = confidence_threshold
 self.vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
 self.classifier = LogisticRegression(max_iter=1000)
 self.is_fitted = False

 def fit(self, texts: list[str], labels: list[str]):
 """Train the fast classifier."""
 X = self.vectorizer.fit_transform(texts)
 self.classifier.fit(X, labels)
 self.is_fitted = True

 def classify(self, text: str) -> TriageResult:
 """Route to classifier or LLM based on confidence."""
 # Step 1: Get classifier prediction and confidence
 X = self.vectorizer.transform([text])
 probas = self.classifier.predict_proba(X)[0]
 max_confidence = probas.max()
 predicted_class = self.classifier.classes_[probas.argmax()]

 # Step 2: Route based on confidence
 if max_confidence >= self.threshold:
 return TriageResult(
 category=predicted_class,
 confidence=max_confidence,
 source="classifier",
 cost=0.00001
 )
 else:
 # Escalate to LLM (simulated here)
 llm_result = self._call_llm(text)
 return TriageResult(
 category=llm_result,
 confidence=0.95, # LLM confidence (estimated)
 source="llm",
 cost=0.003
 )

 def _call_llm(self, text: str) -> str:
 """Call LLM for complex cases (simplified)."""
 # In production: call OpenAI / Anthropic API
 return "complex_case"

# Train and evaluate
train_texts = [
 "charged twice", "double charge", "billing error",
 "app crashes", "error message", "won't load",
 "change email", "update address", "password reset",
 "package lost", "shipping delay", "not delivered",
 "pricing info", "business hours", "return policy",
] * 20

train_labels = [
 "billing", "billing", "billing",
 "technical", "technical", "technical",
 "account", "account", "account",
 "shipping", "shipping", "shipping",
 "general", "general", "general",
] * 20

router = TriageRouter(confidence_threshold=0.85)
router.fit(train_texts, train_labels)

# Test with various difficulty levels
test_cases = [
 "I was charged twice on my credit card", # Easy: billing
 "The page gives me a 500 error", # Easy: technical
 "I want to change my account email and also get a refund", # Hard: mixed
 "Your competitor offers better rates", # Hard: ambiguous
]

print("Triage Results:")
print("-" * 70)
for text in test_cases:
 result = router.classify(text)
 print(f" Text: '{text[:50]}...'")
 print(f" Category: {result.category} | "
 f"Confidence: {result.confidence:.2f} | "
 f"Source: {result.source} | "
 f"Cost: ${result.cost:.5f}")
 print()
Triage Results: ---------------------------------------------------------------------- Text: 'I was charged twice on my credit card...' Category: billing | Confidence: 0.97 | Source: classifier | Cost: $0.00001 Text: 'The page gives me a 500 error...' Category: technical | Confidence: 0.91 | Source: classifier | Cost: $0.00001 Text: 'I want to change my account email and also get a...' Category: complex_case | Confidence: 0.95 | Source: llm | Cost: $0.00300 Text: 'Your competitor offers better rates...' Category: complex_{case} | Confidence: 0.95 | Source: llm | Cost: $0.00300
Code Fragment 12.3.1: Use a small LLM to classify request difficulty and select a model tier

Code Fragment 12.3.2 implements a three-tier cascade router: regex rules handle obvious patterns at near-zero cost, a small BERT model catches mid-complexity queries, and a full LLM handles only the cases that need it.

# Implement a cascade router that tries cheap models first
# Each tier has a confidence threshold; low-confidence results escalate to the next tier
from dataclasses import dataclass
from typing import Optional

@dataclass
class CascadeResult:
 prediction: str
 confidence: float
 tier: int
 total_cost: float
 total_latency_ms: float

class ModelCascade:
 """Cascading model architecture: small -> medium -> large."""

 def __init__(self, confidence_thresholds: list[float]):
 self.thresholds = confidence_thresholds
 # In production, these would be real model instances
 self.tiers = [
 {"name": "Regex/Rules", "cost": 0.0, "latency_ms": 0.01},
 {"name": "BERT-tiny", "cost": 0.0001, "latency_ms": 5},
 {"name": "GPT-4o-mini", "cost": 0.003, "latency_ms": 400},
 ]

 def predict(self, text: str) -> CascadeResult:
 total_cost = 0.0
 total_latency = 0.0

 for i, (tier, threshold) in enumerate(
 zip(self.tiers, self.thresholds + [0.0])
 ):
 total_cost += tier["cost"]
 total_latency += tier["latency_ms"]

 # Simulate prediction (in production: actual model call)
 prediction, confidence = self._run_tier(i, text)

 if confidence >= threshold or i == len(self.tiers) - 1:
 return CascadeResult(
 prediction=prediction,
 confidence=confidence,
 tier=i + 1,
 total_cost=total_cost,
 total_latency_ms=total_latency,
 )

 def _run_tier(self, tier: int, text: str) -> tuple[str, float]:
 """Simulate tier prediction. Replace with real models."""
 import random
 random.seed(hash(text) + tier)

 if tier == 0: # Regex
 if any(kw in text.lower() for kw in ["charged", "refund", "bill"]):
 return "billing", 0.99
 return "unknown", 0.30

 elif tier == 1: # Small model
 return "billing", 0.75 + random.random() * 0.2

 else: # Large LLM
 return "billing", 0.95

# Demo
cascade = ModelCascade(confidence_thresholds=[0.90, 0.85])

test_texts = [
 "I was charged twice on my bill", # Regex catches this
 "The interface feels sluggish lately", # Needs small model
 "I have a complex multi-part question", # Needs LLM
]

print("Cascade Routing Results:")
print("=" * 65)
for text in test_texts:
 result = cascade.predict(text)
 print(f" '{text[:45]}'")
 print(f" Tier {result.tier} | Conf: {result.confidence:.2f} | "
 f"Cost: ${result.total_cost:.5f} | "
 f"Latency: {result.total_latency_ms:.1f}ms\n")
Cascade Routing Results: ================================================================= 'I was charged twice on my bill' Tier 1 | Conf: 0.99 | Cost: $0.00000 | Latency: 0.0ms 'The interface feels sluggish lately' Tier 2 | Conf: 0.88 | Cost: $0.00010 | Latency: 5.0ms 'I have a complex multi-part question' Tier 3 | Conf: 0.95 | Cost: $0.00310 | Latency: 405.0ms
Code Fragment 12.3.2: Implement a cascade router that tries cheap models first

4. Pattern: LLM Router

Key Insight

Why smaller specialized models can beat LLMs for routing. Using GPT-4 to decide whether a query needs GPT-4 is circular and expensive. Instead, train a small model (a fine-tuned classifier or even GPT-4o-mini with a focused prompt) specifically on the routing decision. The routing model only needs to classify query complexity, not solve the actual problem. This is a much simpler task that a small model handles well. In practice, a fine-tuned DistilBERT classifier trained on 2,000 labeled routing examples can match GPT-4o-mini's routing accuracy at 1/100th the latency and near-zero cost, making it the preferred approach for high-volume systems.

The router pattern uses a lightweight model (or even an LLM itself) to analyze the incoming request and decide which model should handle it. This concept extends naturally to the agentic architectures covered later, where routing decisions become part of a larger planning loop. Unlike the cascade, which always starts at the cheapest tier, the router can skip directly to the appropriate tier based on the request complexity. Code Fragment 12.3.3 shows this approach in practice.

# Use a small LLM to classify request difficulty and select a model tier
# The router returns a JSON decision with tier, reasoning, and difficulty score
import openai
import json

# Initialize OpenAI client (reads OPENAI_API_KEY from env)
client = openai.OpenAI()

def route_request(text: str) -> dict:
 """Use a small LLM to decide which model should handle a request."""
 # Send chat completion request to the API
 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[
 {"role": "system", "content": """Analyze this request and decide
which model tier should handle it. Return JSON:

- tier: "regex" for simple pattern matching (dates, emails, numbers)
- tier: "classifier" for standard classification with clear categories
- tier: "small_llm" for moderate complexity needing some reasoning
- tier: "large_llm" for complex, ambiguous, or multi-step reasoning

Also return:
- reasoning: one sentence explaining why
- estimated_difficulty: 1-5 scale"""},
 {"role": "user", "content": text}
 ],
 response_format={"type": "json_object"},
 temperature=0,
 max_tokens=100,
 )
 # Extract the generated message from the API response
 return json.loads(response.choices[0].message.content)

# Example routing decisions
requests = [
 "Extract all email addresses from this text: contact us at info@corp.com",
 "Is this customer review positive or negative: 'Great product!'",
 "Summarize the key points from this 3-page contract",
 "Given our Q3 financials and market trends, should we expand to Europe?",
]

print("Router Decisions:")
print("=" * 70)
for req in requests:
 decision = route_request(req)
 print(f" Request: '{req[:55]}...'")
 print(f" Tier: {decision['tier']} | "
 f"Difficulty: {decision['estimated_difficulty']}/5")
 print(f" Reason: {decision['reasoning']}\n")
Router Decisions: ====================================================================== Request: 'Extract all email addresses from this text: contact us...' Tier: regex | Difficulty: 1/5 Reason: Email extraction follows a well-defined pattern that regex handles perfectly. Request: 'Is this customer review positive or negative: 'Great pr...' Tier: classifier | Difficulty: 1/5 Reason: Simple binary sentiment classification with a clear signal word. Request: 'Summarize the key points from this 3-page contract...' Tier: small_llm | Difficulty: 3/5 Reason: Summarization requires language understanding but not complex reasoning. Request: 'Given our Q3 financials and market trends, should we e...' Tier: large_llm | Difficulty: 5/5 Reason: Multi-factor strategic analysis requires sophisticated reasoning over multiple data sources.
Code Fragment 12.3.3: Use a small LLM to classify request difficulty and select a model tier
Warning

Using an LLM as the router adds cost to every single request. If the router itself costs $0.0003 per call, you need the routing savings to exceed this overhead. For high-volume systems, train a small BERT classifier as the router instead, or use simple heuristics (input length, keyword presence, question complexity score) to avoid the LLM router cost entirely.

Key Insight

Routing is the single highest-leverage decision in a hybrid system. Think of it as a hospital triage nurse: the nurse (router) does not treat patients, but by correctly sending each patient to the right ward (fast classifier, small LLM, or large LLM), they determine the entire system's cost and quality profile. A router that sends 10% more requests to the cheap model saves 10% on LLM costs; a router that misroutes complex queries to a fast classifier degrades quality for those users. The routing threshold is your most important hyperparameter.

Lab: Building a Customer Support Pipeline

Let us put these patterns together into a complete customer support pipeline that combines a classifier for routing, an LLM for complex extraction, and a rules engine for execution. This kind of multi-stage pipeline is also the foundation for the conversational AI systems we build later in the book. Code Fragment 12.3.3 shows this approach in practice.

# Define the routing tier configuration with cost and latency parameters
# The cascade evaluates tiers in order of increasing cost
from dataclasses import dataclass, field
from typing import Optional
import json

@dataclass
class TicketAnalysis:
 ticket_id: str
 raw_text: str
 category: str
 routing_tier: str
 extracted_info: dict = field(default_factory=dict)
 action: str = ""
 total_cost: float = 0.0

class CustomerSupportPipeline:
 """Three-stage pipeline: classify -> extract -> execute."""

 def __init__(self):
 self.simple_categories = {"billing", "shipping", "account"}
 self.actions = {
 "billing": "initiate_refund_review",
 "shipping": "create_tracking_inquiry",
 "account": "send_account_update_link",
 "technical": "create_engineering_ticket",
 "general": "route_to_agent",
 }

 def process_ticket(self, ticket_id: str, text: str) -> TicketAnalysis:
 analysis = TicketAnalysis(ticket_id=ticket_id, raw_text=text,
 category="", routing_tier="")

 # Stage 1: Fast classification
 category, confidence = self._classify(text)
 analysis.category = category

 if confidence > 0.85 and category in self.simple_categories:
 # Stage 2a: Rule-based extraction for simple cases
 analysis.routing_tier = "classifier + rules"
 analysis.extracted_info = self._rule_extract(text, category)
 analysis.total_cost = 0.00001
 else:
 # Stage 2b: LLM extraction for complex cases
 analysis.routing_tier = "classifier + llm"
 analysis.extracted_info = self._llm_extract(text)
 analysis.total_cost = 0.005

 # Stage 3: Determine action
 analysis.action = self.actions.get(analysis.category, "route_to_agent")

 return analysis

 def _classify(self, text: str) -> tuple[str, float]:
 """Fast keyword classifier (replace with trained model)."""
 keywords = {
 "billing": ["charge", "refund", "payment", "invoice", "bill"],
 "shipping": ["package", "delivery", "tracking", "shipped"],
 "account": ["password", "email", "login", "account"],
 "technical": ["error", "crash", "bug", "broken", "slow"],
 }
 text_lower = text.lower()
 for cat, words in keywords.items():
 if any(w in text_lower for w in words):
 return cat, 0.90
 return "general", 0.50

 def _rule_extract(self, text: str, category: str) -> dict:
 """Simple rule-based extraction for common patterns."""
 import re
 info = {}
 amounts = re.findall(r'\$[\d,]+\.?\d*', text)
 if amounts:
 info["amounts"] = amounts
 dates = re.findall(r'\d{4}-\d{2}-\d{2}|\d{1,2}/\d{1,2}/\d{2,4}', text)
 if dates:
 info["dates"] = dates
 return info

 def _llm_extract(self, text: str) -> dict:
 """LLM extraction for complex cases (simulated)."""
 return {
 "urgency": "high",
 "sentiment": -0.7,
 "key_issues": ["billing dispute", "service cancellation"],
 "requires_human": True,
 }

# Process sample tickets
pipeline = CustomerSupportPipeline()

tickets = [
 ("T-001", "Please refund $49.99 charged on 2025-01-15"),
 ("T-002", "My package tracking number XY123 shows no updates"),
 ("T-003", "Your AI assistant gave me wrong medical advice and I "
 "want to speak to a manager about this serious issue"),
]

print("Customer Support Pipeline Results:")
print("=" * 65)
for tid, text in tickets:
 result = pipeline.process_{ticket}(tid, text)
 print(f" Ticket: {result.ticket_{id}}")
 print(f" Category: {result.category}")
 print(f" Routing: {result.routing_{tier}}")
 print(f" Extracted: {result.extracted_{info}}")
 print(f" Action: {result.action}")
 print(f" Cost: ${result.total_cost:.5f}")
 print()
Customer Support Pipeline Results: ================================================================= Ticket: T-001 Category: billing Routing: classifier + rules Extracted: {'amounts': ['$49.99'], 'dates': ['2025-01-15']} Action: initiate_{refund}_{review} Cost: $0.00001 Ticket: T-002 Category: shipping Routing: classifier + rules Extracted: {} Action: create_tracking_inquiry Cost: $0.00001 Ticket: T-003 Category: general Routing: classifier + llm Extracted: {'urgency': 'high', 'sentiment': -0.7, 'key_{issues}': ['billing dispute', 'service cancellation'], 'requires_{human}': True} Action: route_{to}_{agent} Cost: $0.00500
Code Fragment 12.3.4: Define the routing tier configuration with cost and latency parameters
Self-Check
Q1: What is the primary advantage of the triage pattern over always using an LLM?
Show Answer
The triage pattern processes the majority of requests (typically 80%) with a fast, cheap classifier and only escalates uncertain cases to the LLM. This reduces average cost per query by 80% or more while maintaining near-LLM-level quality, because the classifier handles the straightforward cases correctly and the LLM focuses its capacity on genuinely difficult cases.
Q2: When should you use an ensemble approach instead of triage?
Show Answer
Ensembles are best for high-stakes tasks where accuracy matters more than cost: medical diagnosis, legal document review, financial compliance. Since ensembles run all models on every request, they are more expensive than triage but more robust against individual model errors. For cost-sensitive, high-volume applications, triage is usually the better choice.
Q3: How does the cascade pattern differ from the triage pattern?
Show Answer
The cascade processes requests through a sequence of increasingly powerful models (regex, then small model, then large LLM), stopping at the first tier that is confident enough. Triage uses a single classifier to make a binary routing decision (handle locally vs. escalate to LLM). Cascades have more tiers and progressively escalate, while triage makes a one-shot routing decision. Cascades are more flexible but add latency from multiple model calls on uncertain cases.
Q4: What is the "router cost trap" and how can you avoid it?
Show Answer
Using an LLM as a router adds its cost to every single request, potentially negating the savings from routing. If the router costs $0.0003 per call, it needs to save more than $0.0003 per request on average to be worthwhile. To avoid this, use a lightweight BERT classifier, simple heuristics (input length, keyword detection), or rule-based complexity scoring as the router instead of an LLM.
Q5: In the customer support pipeline example, why are simple billing and shipping tickets handled differently from complex complaints?
Show Answer
Simple tickets have predictable patterns (amounts, dates, tracking numbers) that rule-based extraction handles perfectly at near-zero cost. Complex complaints require understanding context, detecting sentiment, identifying multiple issues, and assessing urgency, which only an LLM can do reliably. The pipeline saves cost by reserving the expensive LLM for the cases that genuinely need its capabilities.
Tip: Measure Latency at the 95th Percentile

Average latency hides tail latency spikes that frustrate users. Always measure and optimize for p95 or p99 latency. A system with 200ms average but 5-second p99 feels slower than one with 400ms average and 600ms p99.

Key Takeaways
Real-World Scenario: Building a Triage Router for Customer Support Ticket Classification

Who: An ML platform team at a B2B software company handling 25,000 support tickets per day across 12 product categories.

Situation: They initially deployed GPT-4 to classify every ticket into product category, urgency level, and required expertise. Accuracy was excellent (96%), but the system cost $15,000/month in API fees.

Problem: Leadership approved a $5,000/month budget for the classification system. The team needed a 3x cost reduction without significant accuracy loss.

Dilemma: They could switch to a cheaper model like GPT-3.5 (lower quality), train a dedicated classifier (requires labeled data and maintenance), or build a triage system that routes easy tickets to a cheap classifier and only escalates ambiguous ones to GPT-4.

Decision: They implemented a two-tier triage architecture: a fine-tuned DistilBERT classifier handled straightforward tickets (single product, clear language), while GPT-4 handled ambiguous tickets (multi-product issues, vague descriptions, emotional language).

How: They fine-tuned DistilBERT on 10,000 tickets labeled by GPT-4 (the LLM bootstrap pattern). The classifier output a confidence score; tickets with confidence above 0.85 were auto-classified, and the rest were escalated to GPT-4. They tuned the threshold on a 1,000-ticket validation set reviewed by human agents.

Result: The DistilBERT model handled 78% of tickets at $0.0001 per ticket (essentially free). GPT-4 handled the remaining 22%. Overall accuracy was 94.5% (down only 1.5% from full GPT-4), and monthly costs dropped to $3,800, well within budget.

Lesson: The triage pattern (cheap model for easy cases, expensive model for hard cases) is the single most effective cost optimization for classification workloads, often achieving 70% to 80% cost reduction with minimal quality loss.

Fun Fact

The "triage" pattern in hybrid pipelines mirrors how hospital emergency rooms work: a nurse (the lightweight classifier) quickly assesses each patient (request) and routes them to the appropriate specialist (model tier). Most cases go to the general practitioner (small model), and only the truly complex ones see the surgeon (frontier LLM).

Research Frontier

Adaptive pipeline architectures. Research teams are building pipelines that dynamically adjust their ML/LLM mix based on real-time quality metrics and cost budgets. When quality dips below a threshold, the system automatically routes more traffic through larger models; when budgets tighten, it shifts toward cheaper classical components.

Pipeline compilation. Tools like DSPy compile multi-step LLM pipelines into optimized execution graphs, automatically selecting which steps need large models and which can use small models or classical components, based on validation set performance.

Observability for hybrid systems. Monitoring frameworks like LangSmith and Arize are adding specific support for hybrid ML/LLM pipelines, tracking quality and cost at each pipeline stage to identify optimization opportunities, a topic explored further in Section 29.5.

Exercises

Exercise 12.3.1: Triage pattern Conceptual

Describe the classical triage + LLM escalation pattern. What determines whether a request is handled by the fast classifier or escalated to the LLM?

Answer Sketch

Every incoming request first passes through a fast, cheap classifier. The classifier produces a prediction and a confidence score. If confidence exceeds a threshold (e.g., 0.85), the classifier's prediction is returned directly. If confidence is below the threshold, the request is escalated to an LLM for more careful analysis. The confidence threshold is tuned to balance cost (fewer LLM calls) against accuracy (catching hard cases).

Exercise 12.3.2: Confidence threshold tuning Coding

Write a Python function that evaluates different confidence thresholds (0.5 to 0.95 in steps of 0.05) for a triage system. For each threshold, compute: accuracy, percentage of requests escalated, and estimated cost.

Answer Sketch

For each threshold: escalated = classifier_conf < threshold. Accuracy = (correct predictions from classifier where confident) + (correct predictions from LLM where escalated). Escalation rate = sum(escalated) / len(escalated). Cost = base_cost * len(data) + llm_cost * sum(escalated). Plot cost vs. accuracy to find the Pareto-optimal threshold.

Exercise 12.3.3: Ensemble voting Conceptual

Explain how an ensemble of a classical model and an LLM can be combined using weighted voting. How would you determine the optimal weights?

Answer Sketch

Each model produces a prediction and confidence score. The final prediction uses a weighted vote: score = w1 * classical_conf + w2 * llm_conf. Optimal weights are found by evaluating different weight combinations on a validation set. Typically, give the LLM higher weight for complex/ambiguous cases and the classical model higher weight for common, well-represented categories. Weights can also be learned via logistic regression on the stacked predictions.

Exercise 12.3.4: Cascading architecture Coding

Implement a three-tier cascade: (1) regex rules for obvious cases, (2) a fine-tuned BERT classifier for standard cases, (3) GPT-4o for complex cases. Each tier handles what the previous tier could not classify confidently.

Answer Sketch

Define classify(text) that first checks regex rules (returns immediately if matched). If no rule matches, run BERT and check confidence. If BERT confidence > 0.9, return its prediction. Otherwise, call GPT-4o with the text and structured output instructions. Track which tier handled each request for cost monitoring. The cascade reduces LLM usage to only the hardest 5 to 10% of requests.

Exercise 12.3.5: Router pattern Analysis

Compare the triage pattern (single classifier + LLM fallback) with the router pattern (a learned router that selects from multiple specialized models). In what scenarios does the router pattern justify its added complexity?

Answer Sketch

The triage pattern works well with two options (fast vs. accurate). The router pattern shines when you have 3+ specialized models (e.g., a code model, a medical model, a general model) and different request types benefit from different models. The router adds complexity (training the routing model, maintaining multiple endpoints) but reduces cost by matching each request to the cheapest adequate model. Justify it when request types are diverse and model capabilities vary significantly across domains.

What Comes Next

In the next section, Section 12.4: Cost-Performance Optimization at Scale, we address cost-performance optimization at scale, learning to balance quality, latency, and budget in hybrid architectures.

References and Further Reading
Hybrid Architecture Research

Sanh, V. et al. (2019). DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. NeurIPS 2019 Workshop.

Introduces knowledge distillation applied to BERT, producing a model 60% smaller and 60% faster while retaining 97% of performance. DistilBERT is the workhorse of the triage pattern discussed in this section, handling easy cases cheaply so expensive LLMs focus on hard ones. Essential background for the cascade architecture.

Paper

Chen, L. et al. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.

Presents strategies for cascading, caching, and routing across multiple LLMs to minimize cost while maintaining quality. The FrugalGPT framework directly informs the hybrid pipeline patterns in this section, particularly the model cascade and confidence-based routing approaches. A must-read for cost-conscious production teams.

Paper

Ding, D. et al. (2024). Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing.

Proposes a learned router that directs queries to either a small or large LLM based on predicted difficulty. The routing approach achieves GPT-4-level quality on 80% of queries while using the cheaper model for easy inputs. Directly applicable to the confidence-based routing patterns covered in this section.

Paper

Madaan, A. et al. (2023). Automix: Automatically Mixing Language Models.

Introduces automatic model mixing, where a meta-verifier determines whether to accept a smaller model's output or escalate to a larger one. The self-verification approach is a practical alternative to training a separate routing classifier. Useful for teams wanting to implement cascades without building custom routers.

Paper
Routing and Practical Tools

Shnitzer, T. et al. (2023). Large Language Model Routing with Benchmark Datasets.

Explores how benchmark performance data can train routers that select the best LLM for each query type. The paper shows that routing can outperform any single model while reducing average cost. Recommended for teams managing multiple LLM providers who want to optimize the quality-cost tradeoff automatically.

Paper

Scikit-learn. (2024). Ensemble Methods Documentation.

The official scikit-learn guide to ensemble methods including Random Forests, Gradient Boosting, and stacking classifiers. These are the classical ML building blocks most commonly used in the hybrid pipelines discussed here. A practical reference for implementing the classical ML components of any hybrid architecture.

Documentation