"Theory is knowing the five role patterns. Practice is discovering that your classifier needs a researcher behind it, your researcher needs a verifier beside it, and your verifier needs a human above it."
Deploy, Pattern Cataloging AI Agent
Frameworks only prove their worth when applied to messy, real-world problems. The AI Role Canvas and Feasibility Scorecard from the previous two sections are powerful planning tools, but reading about them in isolation can leave you wondering how they play out in practice. This section walks through four detailed case studies, each covering the initial hypothesis, the role assignment, what went wrong, how the team iterated, and the final architecture that shipped. The goal is not to memorize these specific products, but to internalize the pattern of deliberate role selection, honest feasibility assessment, and iterative refinement.
Prerequisites
This section builds directly on the AI Role Canvas (Section 36.2) and the Feasibility Scorecard (Section 36.3). It also draws on concepts from RAG pipelines (Chapter 20), AI agents (Chapter 22), and evaluation (Chapter 29).
1. Case Study: Customer Support Email Drafter
Case studies teach pattern recognition, not recipes. The specific industries and numbers in these examples will differ from your situation. What transfers is the decision-making process: how each team assessed feasibility, chose an autonomy level, iterated on failures, and converged on an architecture that balanced ambition with reliability. Read for the process, not the product.
1.1 The Scenario
A mid-sized SaaS company receives roughly 2,000 support tickets per day. Their support agents spend an average of eight minutes per reply, with much of that time copying and adapting boilerplate responses. The product team hypothesizes that an LLM can draft replies for agents to review and send, cutting average handling time by 40%.
1.2 Role Assignment
Using the AI Role Canvas from Section 36.2, the team assigns the model the Drafter role. The human agent serves as editor and approver: every draft must be reviewed before the customer sees it. The model pulls context from a retrieval-augmented generation (RAG) pipeline grounded on the company's internal knowledge base, following the architecture patterns described in Chapter 20.
1.3 Feasibility Assessment
The Feasibility Scorecard yields a favorable result. The domain is well-studied (customer support is one of the most common LLM applications). The acceptable error rate is relatively high because a human reviews every draft. Latency is generous (agents can wait 3 to 5 seconds for a draft). Cost per ticket is low at roughly $0.02 using a mid-tier model. The main risk is data privacy: customer PII flows through the model, so the team configures a PII-scrubbing preprocessor.
1.4 What Went Wrong
The first version had two problems. First, agents reported that 30% of drafts referenced product features that had been deprecated months ago. The RAG pipeline was retrieving outdated knowledge base articles that no one had archived. Second, agents began rubber-stamping drafts without reading them carefully, a phenomenon known as automation complacency. Two incorrect refund amounts reached customers before anyone noticed.
1.5 Iteration and Final Architecture
The team addressed the stale-content problem by adding a metadata freshness filter to their retrieval step: articles older than 90 days without an update were excluded from retrieval results. To combat automation complacency, they introduced a mandatory confirmation step where the agent must select the relevant knowledge base article from a short list before the draft sends. This small friction point forced agents to engage with the content rather than blindly approving.
The Drafter role is only as safe as its human checkpoint. If the review step degenerates into rubber-stamping, you have an unmonitored autopilot disguised as a copilot. Design friction into the approval workflow: require agents to confirm a specific detail, select a source, or edit at least one field before sending. The evaluation framework from Chapter 29 can track approval-without-edit rates as an early warning signal.
2. Case Study: Legal Research Assistant
2.1 The Scenario
A legal technology startup wants to build a research assistant that helps attorneys find relevant case law and statutes for their briefs. The initial vision is ambitious: the model should autonomously research a legal question, synthesize relevant precedents, and produce a draft memo with full citations.
2.2 Initial Role Assignment (Autopilot)
The founding team, excited by impressive demos, assigns the model a combined Researcher + Drafter role with full autonomy. The hypothesis is that attorneys will paste in a legal question and receive a complete research memo within minutes. No human review step is included in the initial design.
2.3 Feasibility Assessment (Should Have Been a Warning)
The Feasibility Scorecard should have raised red flags. The domain has near-zero tolerance for hallucination: a fabricated case citation in a legal brief can result in sanctions from the court, as several high-profile incidents have demonstrated. The acceptable error rate is effectively 0% for citation accuracy. The data sources (case law databases) require licensed API access with strict usage terms. Despite these signals, the team proceeded with the autopilot design.
2.4 What Went Wrong
During beta testing with a dozen law firms, the results were sobering. The model hallucinated case citations in approximately 12% of memos. Some of these hallucinated cases were plausible enough that junior associates did not catch them. One fabricated citation made it into a draft brief before a senior partner flagged it during review. The beta firms lost confidence in the product and two withdrew from the pilot.
Who: The engineering lead at a legal-tech company whose AI research assistant had been hallucinating case citations.
Situation: The original system used a single model call to retrieve, verify, and summarize case law simultaneously. Citation accuracy sat at 88%, and two beta law firms withdrew from the pilot after a fabricated case nearly appeared in a court brief.
Problem: A single model call was performing three distinct jobs (retrieval, verification, drafting), making it impossible to isolate which step introduced errors or to apply targeted guardrails.
Decision: The team decomposed the pipeline into three discrete roles following the principle from Section 36.2: (1) a Researcher that retrieves candidate cases from a verified legal database using structured queries, (2) a Verifier that cross-checks every citation against the source database and flags unconfirmed ones, and (3) a Drafter that synthesizes only verified citations into a memo for attorney review.
Result: Citation accuracy jumped from 88% to 99.6%. The pipeline is slower (45 to 90 seconds instead of 15), but attorneys trust the output because every citation links to a verifiable source, and unverified citations display a warning badge.
Lesson: Splitting a complex AI task into single-responsibility roles lets you evaluate, debug, and guardrail each step independently.
2.5 Iteration and Final Architecture
The team completely restructured the product. They pivoted from autopilot to copilot, introducing mandatory human review at two stages. The Researcher role was constrained to retrieving and ranking results from a verified case law API rather than generating citations from the model's parametric memory. A separate Verifier model checks every citation against the source database. The Drafter model then synthesizes only verified content into a memo that the attorney reviews and edits.
The result is slower (the full pipeline takes 45 to 90 seconds instead of the original 15 seconds), but citation accuracy jumped from 88% to 99.6%. Attorneys trust the output because every citation links back to a verifiable source, and the product explicitly marks any citation it could not verify with a warning badge.
3. Case Study: Content Moderation Classifier
3.1 The Scenario
A social media platform processes 500,000 user-generated posts per hour. Their existing rule-based moderation system catches obvious violations but misses nuanced cases: sarcasm, coded language, and context-dependent toxicity. The team wants an LLM-based classifier to handle these edge cases.
3.2 Role Assignment
The model is assigned the Classifier role. However, the volume and latency requirements (each post must be classified within 200 milliseconds) make it impractical to run every post through a frontier model. The team designs a cascading architecture: a small, fast model handles clear-cut cases, and only ambiguous cases (those with confidence scores between 0.4 and 0.7) are escalated to a frontier model for a second opinion.
This cascading pattern aligns with the inference optimization strategies covered in Chapter 9. The small model runs at 50 milliseconds per classification; the frontier model at 2 seconds. Since only about 8% of posts land in the ambiguous zone, the average latency remains well within the 200-millisecond budget.
3.3 Feasibility Assessment
The Feasibility Scorecard highlights two competing risks. False negatives (missing harmful content) damage user safety and invite regulatory scrutiny. False positives (incorrectly removing benign content) frustrate users and generate appeals. The team sets asymmetric thresholds: the system errs toward flagging borderline content for human review rather than auto-removing it, reserving auto-removal only for clear violations with confidence above 0.95.
3.4 What Went Wrong
The initial deployment revealed a distribution shift problem. The small model had been trained on English-language data but the platform served a global audience. Posts in languages with less training data were being classified with artificially high confidence, leading to over-removal of legitimate content in several languages. The team also discovered that the cascading threshold (0.4 to 0.7) was too narrow: many genuinely ambiguous posts received confidence scores of 0.75 from the small model and bypassed the frontier model entirely.
3.5 Iteration and Final Architecture
The team made three changes. First, they added language detection as a preprocessing step and routed non-English posts directly to the frontier model until they could train or fine-tune language-specific classifiers. Second, they widened the escalation band to 0.3 to 0.85, which increased the percentage of posts hitting the frontier model to about 15% but significantly improved accuracy. Third, they implemented a feedback loop: every human moderator decision on an appealed post was fed back into the training pipeline for the small model, creating a continuous improvement cycle.
# Code Fragment 36.4.1: Cascading classification architecture
from dataclasses import dataclass
from enum import Enum
class ModerationDecision(Enum):
ALLOW = "allow"
ESCALATE_TO_FRONTIER = "escalate_to_frontier"
ESCALATE_TO_HUMAN = "escalate_to_human"
REMOVE = "remove"
@dataclass
class ClassificationResult:
label: str
confidence: float
model_tier: str
def classify_with_cascade(
post_text: str,
post_language: str,
fast_model,
frontier_model,
escalation_low: float = 0.3,
escalation_high: float = 0.85,
auto_remove_threshold: float = 0.95,
) -> ModerationDecision:
"""Route posts through a two-tier classification cascade.
Posts in under-resourced languages skip the fast model entirely.
Ambiguous scores are escalated to the frontier model, and
borderline frontier scores go to a human moderator.
"""
supported_languages = {"en", "es", "fr", "de", "pt", "ja", "ko", "zh"}
# Step 1: Language-based routing
if post_language not in supported_languages:
result = frontier_model.classify(post_text)
else:
# Step 2: Fast model first pass
result = fast_model.classify(post_text)
# Step 3: Escalate ambiguous scores to frontier model
if escalation_low <= result.confidence <= escalation_high:
result = frontier_model.classify(post_text)
# Step 4: Decision based on final confidence
if result.label == "safe":
return ModerationDecision.ALLOW
elif result.confidence >= auto_remove_threshold:
return ModerationDecision.REMOVE
elif result.confidence >= escalation_high:
return ModerationDecision.ESCALATE_TO_HUMAN
else:
return ModerationDecision.ALLOW
The content moderation team discovered that their frontier model was exceptionally good at detecting sarcasm in English but interpreted deadpan humor in Finnish as genuine threats. The lesson: cultural context is not a hyperparameter you can tune. When your classifier operates across cultures, invest in per-language evaluation sets (see Chapter 29 on building representative test suites).
4. Case Study: Internal Knowledge Q&A Bot
4.1 The Scenario
A 5,000-person enterprise wants to reduce the load on its IT and HR helpdesks. Employees submit roughly 800 questions per day, and 60% of those are answered by existing documentation (benefits policies, VPN setup guides, expense report procedures). The team proposes an internal Q&A bot that can answer common questions instantly and route complex ones to the appropriate human team.
4.2 Role Assignment
The team assigns two roles. A Router model first determines whether the question falls into IT, HR, Facilities, or "other" categories. Based on the category, the question is passed to a Researcher model that retrieves and synthesizes answers from category-specific document collections using the RAG pipeline patterns from Chapter 20. For questions the Researcher cannot answer with high confidence, the system escalates to the appropriate human team via the existing ticketing system.
4.3 Feasibility Assessment
The Feasibility Scorecard is mostly green. The domain is constrained (internal company knowledge only). The acceptable error rate is moderate because employees can always fall back to submitting a ticket. Latency budget is 5 seconds. Cost is manageable at roughly $0.01 per query. The primary risk is data sensitivity: the knowledge base contains internal policies, salary bands, and organizational charts. The team restricts the model to a private deployment with no data leaving the company's cloud environment.
4.4 What Went Wrong
Three issues emerged during the pilot. First, the Router misclassified 18% of questions, sending HR benefits questions to the IT knowledge base and vice versa. Second, the Researcher produced confident-sounding answers even when the retrieved documents were only tangentially relevant, a failure mode where the model synthesized plausible but incorrect answers from loosely related content. Third, employees began asking the bot questions it was never designed to handle: career advice, interpersonal conflicts, and complaints about management. The bot attempted to answer these, sometimes with amusing and sometimes with problematic results.
4.5 Iteration and Final Architecture
The team made four targeted changes based on the failure data. First, they improved the Router by fine-tuning it on 2,000 labeled routing examples collected during the pilot, reducing misclassification from 18% to 4%. Second, they added a relevance score threshold to the Researcher: if the top retrieved document scored below 0.7 on semantic similarity, the bot responds with "I could not find a confident answer" and offers to create a ticket. Third, they added an explicit scope filter using a lightweight Classifier that detects out-of-scope questions and responds with a friendly redirect. Fourth, they implemented the agent pattern from Chapter 22, where the Router, scope Classifier, and Researcher operate as a coordinated pipeline with shared context.
# Code Fragment 36.4.2: Multi-role Q&A pipeline architecture
from dataclasses import dataclass
@dataclass
class QueryResult:
answer: str
sources: list[str]
confidence: float
routed_to: str
@dataclass
class ScopeCheck:
in_scope: bool
redirect_message: str | None
def handle_employee_query(
question: str,
scope_classifier,
router_model,
researcher_model,
knowledge_bases: dict[str, object],
relevance_threshold: float = 0.7,
confidence_threshold: float = 0.6,
) -> QueryResult:
"""Process an employee question through the multi-role pipeline.
Steps:
1. Scope classifier filters out-of-scope questions.
2. Router assigns the question to a category.
3. Researcher retrieves and synthesizes from the right KB.
4. Low-confidence answers are escalated to humans.
"""
# Step 1: Scope check
scope = scope_classifier.check(question)
if not scope.in_scope:
return QueryResult(
answer=scope.redirect_message or (
"This question is outside my expertise. "
"I can help with IT, HR, and Facilities questions. "
"Would you like me to create a ticket instead?"
),
sources=[],
confidence=0.0,
routed_to="out_of_scope",
)
# Step 2: Route to category
category = router_model.classify(question)
# Step 3: Retrieve and synthesize
kb = knowledge_bases.get(category.label)
if kb is None:
return _escalate_to_human(question, category.label)
docs = kb.retrieve(question, top_k=5)
top_relevance = max((d.score for d in docs), default=0.0)
if top_relevance < relevance_threshold:
return _escalate_to_human(question, category.label)
# Step 4: Generate answer from relevant documents
answer = researcher_model.synthesize(question, docs)
if answer.confidence < confidence_threshold:
return _escalate_to_human(question, category.label)
return QueryResult(
answer=answer.text,
sources=[d.title for d in docs[:3]],
confidence=answer.confidence,
routed_to=category.label,
)
def _escalate_to_human(question: str, category: str) -> QueryResult:
"""Create a support ticket when the bot cannot answer confidently."""
return QueryResult(
answer=(
f"I was not able to find a confident answer for your question. "
f"I have created a ticket with the {category} team, and someone "
f"will follow up with you shortly."
),
sources=[],
confidence=0.0,
routed_to=f"{category}_human_escalation",
)
5. Lessons Learned Across All Four Cases
Stepping back from the individual case studies, several patterns emerge consistently:
- Every team overestimated initial model performance. Demo-quality results did not survive contact with production data. The support email drafter hit stale content. The legal assistant hallucinated citations. The content moderator failed on non-English posts. The Q&A bot could not handle out-of-scope questions. In each case, the gap between demo and production was the same gap described in Section 36.1.
- Decomposition beats monolithic design. The legal assistant improved dramatically when a single "do everything" model call was split into three discrete roles (Researcher, Verifier, Drafter). The Q&A bot stabilized when routing and answering were separated. Single-role model calls are easier to evaluate, debug, and improve independently.
- Confidence thresholds require calibration, not guessing. The content moderator's initial escalation band was too narrow. The Q&A bot's relevance threshold was set arbitrarily. In both cases, the teams needed real production data to calibrate these thresholds properly. Build your thresholds as configurable parameters from day one.
- Human checkpoints need design, not just existence. The support email drafter had a human review step, but agents still rubber-stamped drafts. A checkpoint that does not force engagement is functionally equivalent to no checkpoint at all. Design the checkpoint to require a specific action (selecting a source, confirming a value, editing a field).
- Feedback loops close the quality gap. The content moderator improved continuously by feeding human decisions back into training. The Q&A bot's router improved through fine-tuning on labeled examples from the pilot. Every case study that included a feedback loop outperformed those that treated the model as a static component.
The iteration pattern is the product. None of these four products shipped their first architecture. Every team went through at least one significant redesign driven by production failures. The AI Role Canvas and Feasibility Scorecard did not prevent these failures, but they made the failures diagnosable. When the legal assistant hallucinated citations, the team could trace the problem to a specific canvas decision (no Verifier role, no citation grounding). When the Q&A bot misrouted questions, the team could trace it to an under-specified Router with no fine-tuning data. Structured planning tools do not eliminate iteration; they make iteration productive.
- The Drafter role is the safest starting point for most applications. Customer support, legal memos, and Q&A answers all benefited from having a human review step. Even when the end goal is full autonomy, starting with a drafter workflow provides the production data you need to earn that autonomy.
- High-stakes domains demand multi-role pipelines. The legal research assistant only became viable when a single model call was decomposed into Researcher, Verifier, and Drafter stages, each with independent evaluation criteria.
- Cascading architectures balance cost and quality. The content moderator used a fast small model for clear cases and a frontier model for ambiguous ones, keeping average latency and cost low while achieving high accuracy on hard cases.
- Scope boundaries prevent embarrassing failures. The Q&A bot's biggest early problem was not wrong answers but answers to questions it should never have attempted. Explicit scope classifiers are cheap to build and prevent entire categories of failure.
- Production data always surprises you. Stale knowledge base articles, under-resourced languages, out-of-scope questions, and rubber-stamped approvals were not predicted during planning. Build your system to collect, measure, and learn from these surprises.
What Comes Next
This is the final section of Chapter 36. You now have the complete toolkit for moving from an idea to a validated product hypothesis: a product mindset (Section 36.1), role assignment via the AI Role Canvas (Section 36.2), feasibility scoring (Section 36.3), and the case-study patterns from this section. In Chapter 37: Building and Steering AI Products, you will learn how to take your validated hypothesis and turn it into a working prototype, covering prompt iteration, evaluation harnesses, deployment pipelines, and the feedback loops that keep your product improving after launch.
Show Answer
Show Answer
Show Answer
Bibliography
Shankar, V., Fishman, R., Guestrin, C. (2024). "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences." arXiv:2404.12272
Lewis, P., Perez, E., Piktus, A., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Advances in Neural Information Processing Systems, 33. arXiv:2005.11401
Parasuraman, R. and Manzey, D. H. (2010). "Complacency and Bias in Human Use of Automation." Human Factors, 52(3), 381-410. doi:10.1177/0018720810376055
Zaharia, M., Khattab, O., Chen, L., et al. (2024). "The Shift from Models to Compound AI Systems." Berkeley AI Research Blog. BAIR Blog
