Part 11: From Idea to AI Product
Chapter 36 · Section 36.4

Case Studies: Role Assignment in Practice

"Theory is knowing the five role patterns. Practice is discovering that your classifier needs a researcher behind it, your researcher needs a verifier beside it, and your verifier needs a human above it."

Deploy Deploy, Pattern Cataloging AI Agent
Big Picture

Frameworks only prove their worth when applied to messy, real-world problems. The AI Role Canvas and Feasibility Scorecard from the previous two sections are powerful planning tools, but reading about them in isolation can leave you wondering how they play out in practice. This section walks through four detailed case studies, each covering the initial hypothesis, the role assignment, what went wrong, how the team iterated, and the final architecture that shipped. The goal is not to memorize these specific products, but to internalize the pattern of deliberate role selection, honest feasibility assessment, and iterative refinement.

Prerequisites

This section builds directly on the AI Role Canvas (Section 36.2) and the Feasibility Scorecard (Section 36.3). It also draws on concepts from RAG pipelines (Chapter 20), AI agents (Chapter 22), and evaluation (Chapter 29).

Four open case files on a desk showing different industry scenarios: hospital triage, e-commerce ranking, legal review, and customer support routing, with a magnifying glass hovering above.
Figure 36.4.1: Four real-world case studies show how role assignment, feasibility scoring, and iterative refinement play out across healthcare, e-commerce, legal, and customer support domains.

1. Case Study: Customer Support Email Drafter

Key Insight

Case studies teach pattern recognition, not recipes. The specific industries and numbers in these examples will differ from your situation. What transfers is the decision-making process: how each team assessed feasibility, chose an autonomy level, iterated on failures, and converged on an architecture that balanced ambition with reliability. Read for the process, not the product.

1.1 The Scenario

A mid-sized SaaS company receives roughly 2,000 support tickets per day. Their support agents spend an average of eight minutes per reply, with much of that time copying and adapting boilerplate responses. The product team hypothesizes that an LLM can draft replies for agents to review and send, cutting average handling time by 40%.

1.2 Role Assignment

Using the AI Role Canvas from Section 36.2, the team assigns the model the Drafter role. The human agent serves as editor and approver: every draft must be reviewed before the customer sees it. The model pulls context from a retrieval-augmented generation (RAG) pipeline grounded on the company's internal knowledge base, following the architecture patterns described in Chapter 20.

1.3 Feasibility Assessment

The Feasibility Scorecard yields a favorable result. The domain is well-studied (customer support is one of the most common LLM applications). The acceptable error rate is relatively high because a human reviews every draft. Latency is generous (agents can wait 3 to 5 seconds for a draft). Cost per ticket is low at roughly $0.02 using a mid-tier model. The main risk is data privacy: customer PII flows through the model, so the team configures a PII-scrubbing preprocessor.

1.4 What Went Wrong

The first version had two problems. First, agents reported that 30% of drafts referenced product features that had been deprecated months ago. The RAG pipeline was retrieving outdated knowledge base articles that no one had archived. Second, agents began rubber-stamping drafts without reading them carefully, a phenomenon known as automation complacency. Two incorrect refund amounts reached customers before anyone noticed.

1.5 Iteration and Final Architecture

The team addressed the stale-content problem by adding a metadata freshness filter to their retrieval step: articles older than 90 days without an update were excluded from retrieval results. To combat automation complacency, they introduced a mandatory confirmation step where the agent must select the relevant knowledge base article from a short list before the draft sends. This small friction point forced agents to engage with the content rather than blindly approving.

Key Insight

The Drafter role is only as safe as its human checkpoint. If the review step degenerates into rubber-stamping, you have an unmonitored autopilot disguised as a copilot. Design friction into the approval workflow: require agents to confirm a specific detail, select a source, or edit at least one field before sending. The evaluation framework from Chapter 29 can track approval-without-edit rates as an early warning signal.

2. Case Study: Legal Research Assistant

2.1 The Scenario

A legal technology startup wants to build a research assistant that helps attorneys find relevant case law and statutes for their briefs. The initial vision is ambitious: the model should autonomously research a legal question, synthesize relevant precedents, and produce a draft memo with full citations.

2.2 Initial Role Assignment (Autopilot)

The founding team, excited by impressive demos, assigns the model a combined Researcher + Drafter role with full autonomy. The hypothesis is that attorneys will paste in a legal question and receive a complete research memo within minutes. No human review step is included in the initial design.

2.3 Feasibility Assessment (Should Have Been a Warning)

The Feasibility Scorecard should have raised red flags. The domain has near-zero tolerance for hallucination: a fabricated case citation in a legal brief can result in sanctions from the court, as several high-profile incidents have demonstrated. The acceptable error rate is effectively 0% for citation accuracy. The data sources (case law databases) require licensed API access with strict usage terms. Despite these signals, the team proceeded with the autopilot design.

2.4 What Went Wrong

During beta testing with a dozen law firms, the results were sobering. The model hallucinated case citations in approximately 12% of memos. Some of these hallucinated cases were plausible enough that junior associates did not catch them. One fabricated citation made it into a draft brief before a senior partner flagged it during review. The beta firms lost confidence in the product and two withdrew from the pilot.

Real-World Scenario: Decomposing a Legal Research Pipeline

Who: The engineering lead at a legal-tech company whose AI research assistant had been hallucinating case citations.

Situation: The original system used a single model call to retrieve, verify, and summarize case law simultaneously. Citation accuracy sat at 88%, and two beta law firms withdrew from the pilot after a fabricated case nearly appeared in a court brief.

Problem: A single model call was performing three distinct jobs (retrieval, verification, drafting), making it impossible to isolate which step introduced errors or to apply targeted guardrails.

Decision: The team decomposed the pipeline into three discrete roles following the principle from Section 36.2: (1) a Researcher that retrieves candidate cases from a verified legal database using structured queries, (2) a Verifier that cross-checks every citation against the source database and flags unconfirmed ones, and (3) a Drafter that synthesizes only verified citations into a memo for attorney review.

Result: Citation accuracy jumped from 88% to 99.6%. The pipeline is slower (45 to 90 seconds instead of 15), but attorneys trust the output because every citation links to a verifiable source, and unverified citations display a warning badge.

Lesson: Splitting a complex AI task into single-responsibility roles lets you evaluate, debug, and guardrail each step independently.

2.5 Iteration and Final Architecture

The team completely restructured the product. They pivoted from autopilot to copilot, introducing mandatory human review at two stages. The Researcher role was constrained to retrieving and ranking results from a verified case law API rather than generating citations from the model's parametric memory. A separate Verifier model checks every citation against the source database. The Drafter model then synthesizes only verified content into a memo that the attorney reviews and edits.

The result is slower (the full pipeline takes 45 to 90 seconds instead of the original 15 seconds), but citation accuracy jumped from 88% to 99.6%. Attorneys trust the output because every citation links back to a verifiable source, and the product explicitly marks any citation it could not verify with a warning badge.

3. Case Study: Content Moderation Classifier

3.1 The Scenario

A social media platform processes 500,000 user-generated posts per hour. Their existing rule-based moderation system catches obvious violations but misses nuanced cases: sarcasm, coded language, and context-dependent toxicity. The team wants an LLM-based classifier to handle these edge cases.

3.2 Role Assignment

The model is assigned the Classifier role. However, the volume and latency requirements (each post must be classified within 200 milliseconds) make it impractical to run every post through a frontier model. The team designs a cascading architecture: a small, fast model handles clear-cut cases, and only ambiguous cases (those with confidence scores between 0.4 and 0.7) are escalated to a frontier model for a second opinion.

This cascading pattern aligns with the inference optimization strategies covered in Chapter 9. The small model runs at 50 milliseconds per classification; the frontier model at 2 seconds. Since only about 8% of posts land in the ambiguous zone, the average latency remains well within the 200-millisecond budget.

3.3 Feasibility Assessment

The Feasibility Scorecard highlights two competing risks. False negatives (missing harmful content) damage user safety and invite regulatory scrutiny. False positives (incorrectly removing benign content) frustrate users and generate appeals. The team sets asymmetric thresholds: the system errs toward flagging borderline content for human review rather than auto-removing it, reserving auto-removal only for clear violations with confidence above 0.95.

3.4 What Went Wrong

The initial deployment revealed a distribution shift problem. The small model had been trained on English-language data but the platform served a global audience. Posts in languages with less training data were being classified with artificially high confidence, leading to over-removal of legitimate content in several languages. The team also discovered that the cascading threshold (0.4 to 0.7) was too narrow: many genuinely ambiguous posts received confidence scores of 0.75 from the small model and bypassed the frontier model entirely.

3.5 Iteration and Final Architecture

The team made three changes. First, they added language detection as a preprocessing step and routed non-English posts directly to the frontier model until they could train or fine-tune language-specific classifiers. Second, they widened the escalation band to 0.3 to 0.85, which increased the percentage of posts hitting the frontier model to about 15% but significantly improved accuracy. Third, they implemented a feedback loop: every human moderator decision on an appealed post was fed back into the training pipeline for the small model, creating a continuous improvement cycle.

# Code Fragment 36.4.1: Cascading classification architecture
from dataclasses import dataclass
from enum import Enum


class ModerationDecision(Enum):
 ALLOW = "allow"
 ESCALATE_TO_FRONTIER = "escalate_to_frontier"
 ESCALATE_TO_HUMAN = "escalate_to_human"
 REMOVE = "remove"


@dataclass
class ClassificationResult:
 label: str
 confidence: float
 model_tier: str


def classify_with_cascade(
 post_text: str,
 post_language: str,
 fast_model,
 frontier_model,
 escalation_low: float = 0.3,
 escalation_high: float = 0.85,
 auto_remove_threshold: float = 0.95,
) -> ModerationDecision:
 """Route posts through a two-tier classification cascade.

 Posts in under-resourced languages skip the fast model entirely.
 Ambiguous scores are escalated to the frontier model, and
 borderline frontier scores go to a human moderator.
 """
 supported_languages = {"en", "es", "fr", "de", "pt", "ja", "ko", "zh"}

 # Step 1: Language-based routing
 if post_language not in supported_languages:
 result = frontier_model.classify(post_text)
 else:
 # Step 2: Fast model first pass
 result = fast_model.classify(post_text)

 # Step 3: Escalate ambiguous scores to frontier model
 if escalation_low <= result.confidence <= escalation_high:
 result = frontier_model.classify(post_text)

 # Step 4: Decision based on final confidence
 if result.label == "safe":
 return ModerationDecision.ALLOW
 elif result.confidence >= auto_remove_threshold:
 return ModerationDecision.REMOVE
 elif result.confidence >= escalation_high:
 return ModerationDecision.ESCALATE_TO_HUMAN
 else:
 return ModerationDecision.ALLOW
Code Fragment 36.4.1: A two-tier cascading classifier. The fast model handles clear cases while ambiguous posts and under-resourced languages are escalated to a frontier model. Posts above the auto-remove threshold are removed immediately; borderline cases go to human moderators.
Fun Note

The content moderation team discovered that their frontier model was exceptionally good at detecting sarcasm in English but interpreted deadpan humor in Finnish as genuine threats. The lesson: cultural context is not a hyperparameter you can tune. When your classifier operates across cultures, invest in per-language evaluation sets (see Chapter 29 on building representative test suites).

4. Case Study: Internal Knowledge Q&A Bot

4.1 The Scenario

A 5,000-person enterprise wants to reduce the load on its IT and HR helpdesks. Employees submit roughly 800 questions per day, and 60% of those are answered by existing documentation (benefits policies, VPN setup guides, expense report procedures). The team proposes an internal Q&A bot that can answer common questions instantly and route complex ones to the appropriate human team.

4.2 Role Assignment

The team assigns two roles. A Router model first determines whether the question falls into IT, HR, Facilities, or "other" categories. Based on the category, the question is passed to a Researcher model that retrieves and synthesizes answers from category-specific document collections using the RAG pipeline patterns from Chapter 20. For questions the Researcher cannot answer with high confidence, the system escalates to the appropriate human team via the existing ticketing system.

4.3 Feasibility Assessment

The Feasibility Scorecard is mostly green. The domain is constrained (internal company knowledge only). The acceptable error rate is moderate because employees can always fall back to submitting a ticket. Latency budget is 5 seconds. Cost is manageable at roughly $0.01 per query. The primary risk is data sensitivity: the knowledge base contains internal policies, salary bands, and organizational charts. The team restricts the model to a private deployment with no data leaving the company's cloud environment.

4.4 What Went Wrong

Three issues emerged during the pilot. First, the Router misclassified 18% of questions, sending HR benefits questions to the IT knowledge base and vice versa. Second, the Researcher produced confident-sounding answers even when the retrieved documents were only tangentially relevant, a failure mode where the model synthesized plausible but incorrect answers from loosely related content. Third, employees began asking the bot questions it was never designed to handle: career advice, interpersonal conflicts, and complaints about management. The bot attempted to answer these, sometimes with amusing and sometimes with problematic results.

4.5 Iteration and Final Architecture

The team made four targeted changes based on the failure data. First, they improved the Router by fine-tuning it on 2,000 labeled routing examples collected during the pilot, reducing misclassification from 18% to 4%. Second, they added a relevance score threshold to the Researcher: if the top retrieved document scored below 0.7 on semantic similarity, the bot responds with "I could not find a confident answer" and offers to create a ticket. Third, they added an explicit scope filter using a lightweight Classifier that detects out-of-scope questions and responds with a friendly redirect. Fourth, they implemented the agent pattern from Chapter 22, where the Router, scope Classifier, and Researcher operate as a coordinated pipeline with shared context.

# Code Fragment 36.4.2: Multi-role Q&A pipeline architecture
from dataclasses import dataclass


@dataclass
class QueryResult:
 answer: str
 sources: list[str]
 confidence: float
 routed_to: str


@dataclass
class ScopeCheck:
 in_scope: bool
 redirect_message: str | None


def handle_employee_query(
 question: str,
 scope_classifier,
 router_model,
 researcher_model,
 knowledge_bases: dict[str, object],
 relevance_threshold: float = 0.7,
 confidence_threshold: float = 0.6,
) -> QueryResult:
 """Process an employee question through the multi-role pipeline.

 Steps:
 1. Scope classifier filters out-of-scope questions.
 2. Router assigns the question to a category.
 3. Researcher retrieves and synthesizes from the right KB.
 4. Low-confidence answers are escalated to humans.
 """
 # Step 1: Scope check
 scope = scope_classifier.check(question)
 if not scope.in_scope:
 return QueryResult(
 answer=scope.redirect_message or (
 "This question is outside my expertise. "
 "I can help with IT, HR, and Facilities questions. "
 "Would you like me to create a ticket instead?"
 ),
 sources=[],
 confidence=0.0,
 routed_to="out_of_scope",
 )

 # Step 2: Route to category
 category = router_model.classify(question)

 # Step 3: Retrieve and synthesize
 kb = knowledge_bases.get(category.label)
 if kb is None:
 return _escalate_to_human(question, category.label)

 docs = kb.retrieve(question, top_k=5)
 top_relevance = max((d.score for d in docs), default=0.0)

 if top_relevance < relevance_threshold:
 return _escalate_to_human(question, category.label)

 # Step 4: Generate answer from relevant documents
 answer = researcher_model.synthesize(question, docs)

 if answer.confidence < confidence_threshold:
 return _escalate_to_human(question, category.label)

 return QueryResult(
 answer=answer.text,
 sources=[d.title for d in docs[:3]],
 confidence=answer.confidence,
 routed_to=category.label,
 )


def _escalate_to_human(question: str, category: str) -> QueryResult:
 """Create a support ticket when the bot cannot answer confidently."""
 return QueryResult(
 answer=(
 f"I was not able to find a confident answer for your question. "
 f"I have created a ticket with the {category} team, and someone "
 f"will follow up with you shortly."
 ),
 sources=[],
 confidence=0.0,
 routed_to=f"{category}_human_escalation",
 )
Code Fragment 36.4.2: The Q&A bot pipeline coordinates three model roles: a scope classifier that filters out-of-scope questions, a router that selects the right knowledge base, and a researcher that retrieves and synthesizes answers. Low-confidence results at any stage trigger escalation to a human team.

5. Lessons Learned Across All Four Cases

Stepping back from the individual case studies, several patterns emerge consistently:

  1. Every team overestimated initial model performance. Demo-quality results did not survive contact with production data. The support email drafter hit stale content. The legal assistant hallucinated citations. The content moderator failed on non-English posts. The Q&A bot could not handle out-of-scope questions. In each case, the gap between demo and production was the same gap described in Section 36.1.
  2. Decomposition beats monolithic design. The legal assistant improved dramatically when a single "do everything" model call was split into three discrete roles (Researcher, Verifier, Drafter). The Q&A bot stabilized when routing and answering were separated. Single-role model calls are easier to evaluate, debug, and improve independently.
  3. Confidence thresholds require calibration, not guessing. The content moderator's initial escalation band was too narrow. The Q&A bot's relevance threshold was set arbitrarily. In both cases, the teams needed real production data to calibrate these thresholds properly. Build your thresholds as configurable parameters from day one.
  4. Human checkpoints need design, not just existence. The support email drafter had a human review step, but agents still rubber-stamped drafts. A checkpoint that does not force engagement is functionally equivalent to no checkpoint at all. Design the checkpoint to require a specific action (selecting a source, confirming a value, editing a field).
  5. Feedback loops close the quality gap. The content moderator improved continuously by feeding human decisions back into training. The Q&A bot's router improved through fine-tuning on labeled examples from the pilot. Every case study that included a feedback loop outperformed those that treated the model as a static component.
Key Insight

The iteration pattern is the product. None of these four products shipped their first architecture. Every team went through at least one significant redesign driven by production failures. The AI Role Canvas and Feasibility Scorecard did not prevent these failures, but they made the failures diagnosable. When the legal assistant hallucinated citations, the team could trace the problem to a specific canvas decision (no Verifier role, no citation grounding). When the Q&A bot misrouted questions, the team could trace it to an under-specified Router with no fine-tuning data. Structured planning tools do not eliminate iteration; they make iteration productive.

Key Takeaways

What Comes Next

This is the final section of Chapter 36. You now have the complete toolkit for moving from an idea to a validated product hypothesis: a product mindset (Section 36.1), role assignment via the AI Role Canvas (Section 36.2), feasibility scoring (Section 36.3), and the case-study patterns from this section. In Chapter 37: Building and Steering AI Products, you will learn how to take your validated hypothesis and turn it into a working prototype, covering prompt iteration, evaluation harnesses, deployment pipelines, and the feedback loops that keep your product improving after launch.

Self-Check
Q1: The legal research assistant initially used full autonomy. What specific failure caused the team to pivot to a copilot architecture, and how did the revised pipeline prevent that failure from recurring?
Show Answer
The model hallucinated case citations in approximately 12% of memos. Some fabricated citations were plausible enough to pass junior associate review. The revised pipeline prevents this by: (1) constraining the Researcher role to retrieve from a verified case law database rather than generating citations from parametric memory, (2) adding a dedicated Verifier role that cross-checks every citation against the source database, and (3) requiring attorney review of the final output. Citations that cannot be verified are explicitly flagged with a warning badge.
Q2: Explain why the content moderation team needed a cascading architecture rather than using a single frontier model for all classifications.
Show Answer
The platform processes 500,000 posts per hour with a 200-millisecond latency budget per post. A frontier model takes approximately 2 seconds per classification, making it impractical for all traffic. The cascading architecture routes clear-cut cases through a fast model (50 ms) and only escalates ambiguous cases (about 15% of traffic) to the frontier model. This keeps average latency well within budget while maintaining high accuracy on the hardest cases. The key tradeoff is engineering complexity for cost and latency efficiency.
Q3: Across all four case studies, what is the common pattern in how teams discovered and fixed their initial design mistakes?
Show Answer
Every team discovered failures through real production data, not through pre-launch testing. The common pattern is: (1) launch with a hypothesis-driven architecture, (2) collect data on actual failure modes that were not predicted during planning, (3) trace failures back to specific role assignment or feasibility decisions using the AI Role Canvas, and (4) make targeted architectural changes (adding roles, adjusting thresholds, adding scope filters, or introducing feedback loops) rather than starting over. Structured planning tools like the Canvas and Scorecard did not prevent failures but made them diagnosable and actionable.

Bibliography

AI Product Design

Shankar, V., Fishman, R., Guestrin, C. (2024). "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences." arXiv:2404.12272

Examines how LLM-based evaluation (the Verifier role) can be aligned with human judgment, directly relevant to the legal research assistant's citation verification pipeline and the content moderator's cascading architecture.
Evaluation
Retrieval-Augmented Generation

Lewis, P., Perez, E., Piktus, A., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Advances in Neural Information Processing Systems, 33. arXiv:2005.11401

The foundational RAG paper, establishing the retrieve-then-generate paradigm used by the customer support drafter and internal Q&A bot case studies. Understanding RAG's strengths and failure modes is essential for any Researcher-role deployment.
RAG
Human-AI Collaboration

Parasuraman, R. and Manzey, D. H. (2010). "Complacency and Bias in Human Use of Automation." Human Factors, 52(3), 381-410. doi:10.1177/0018720810376055

Seminal review of automation complacency and bias, the phenomenon observed in the customer support case study where agents rubber-stamped AI drafts. Provides the theoretical foundation for why human checkpoints require deliberate friction design.
Automation Complacency
Content Moderation

Zaharia, M., Khattab, O., Chen, L., et al. (2024). "The Shift from Models to Compound AI Systems." Berkeley AI Research Blog. BAIR Blog

Argues that state-of-the-art results increasingly come from compound systems rather than monolithic models. Directly supports the lesson that decomposing AI products into discrete role pipelines (as in the legal assistant and Q&A bot) outperforms single-model approaches.
Compound AI Systems