Part 11: From Idea to AI Product
Chapter 37 · Section 37.2

The Founder's Prototype Loop

"I scaffolded an entire backend in forty minutes. Then I spent four days figuring out why it kept apologizing to the database."

Deploy Deploy, Backend Debugging AI Agent
Big Picture

Speed without direction is just expensive chaos. AI coding assistants can scaffold a prototype in hours, but the prototype is worthless unless it proves (or disproves) a specific hypothesis about your product. This section introduces a disciplined loop for AI-era prototyping: build a vertical slice end-to-end, measure it with real evaluation, and steer based on evidence. You will also get a concrete Prototype Playbook that maps five prototype stages to the techniques covered throughout this book.

Prerequisites

This section builds on LLM API patterns from Chapter 10, prompt engineering from Chapter 11, and the concept of observe-steer development loops introduced in Section 37.1. Familiarity with evaluation (Chapter 29), observability (Chapter 30), and production engineering (Chapter 31) will strengthen the cross-references throughout.

A vertical rocket launch pad built from stacked colorful building blocks, with a founder at the base and circular arrows labeled Build, Measure, Steer.
Figure 37.2.1: The founder's prototype loop builds evidence at every iteration: a vertical-slice demo that exercises the full stack, measured against explicit quality gates.

1. Vertical-Slice Prototyping

Most first-time builders make the same mistake: they build every feature at a shallow level before finishing any single feature deeply. The result is a wide, brittle surface that cannot survive contact with real users. Vertical-slice prototyping inverts this pattern. You pick one user flow and build it completely, from the user interface down through the LLM call, retrieval layer, guardrails, and response rendering.

Tip: The "One User, One Task" Rule

Your first vertical slice should solve exactly one task for exactly one type of user. Not "summarize documents" but "summarize a 10-page lease agreement so a first-time renter can understand the key financial obligations in under 60 seconds." The specificity forces you to define what "good enough" looks like before you write a single line of code, and it makes your evaluation criteria concrete and testable.

Why one flow? Because a single complete flow exposes every integration point, every latency bottleneck, and every failure mode that matters. A shallow prototype that calls an LLM and displays raw text proves nothing about your product. A vertical slice that retrieves context from a knowledge base, constructs a prompt with guardrails, calls the model, validates the output, and renders it with citations proves that your architecture actually works.

Real-World Scenario: Vertical Slice in Practice

Who: An HR technology team at a 2,000-person company building an internal benefits assistant.

Situation: The original plan called for an assistant that could "answer any HR question," covering benefits, payroll, leave policies, and onboarding. The engineering estimate for the full scope was three months.

Problem: After two weeks of broad scaffolding, the team had a chat interface that could handle none of these topics end-to-end. Retrieval, grounding, and validation were all half-built.

Decision: The team scrapped the horizontal approach and picked one vertical slice: "an employee asks about dental coverage for dependents." The slice included (1) a chat interface with a single input box, (2) a RAG pipeline retrieving the relevant section from the benefits handbook, (3) a prompt template grounding the answer in retrieved context using techniques from Chapter 11, (4) an output validator checking that the response cites at least one source, and (5) a simple feedback button.

Result: The vertical slice was working end-to-end within four days. It immediately revealed a chunking bug in the RAG pipeline that would have affected every topic, catching it early instead of three months in.

Lesson: One complete user flow exercising every layer of the stack is worth more than ten shallow features that skip retrieval, evaluation, or error handling.

2. The Build-Measure-Steer Loop

The classic Lean Startup mantra is "Build, Measure, Learn." For AI products, we sharpen the third step. "Learn" is too passive. When your AI product underperforms, you do not simply learn about it; you steer the system by adjusting prompts, swapping models, tightening guardrails, or restructuring retrieval. The founder's prototype loop is therefore Build, Measure, Steer.

Phase 1: Build

Use AI coding assistants (Cursor, Claude Code, GitHub Copilot) to scaffold your vertical slice quickly. The goal is not production-quality code; it is a functional prototype that exercises the full stack. Section 3 below covers the verification discipline required when working with generated code.

Phase 2: Measure

Connect your prototype to an evaluation harness from the start. Even five well-chosen test cases are better than zero. Instrument your prototype with the observability patterns from Chapter 30: log every prompt, every retrieved chunk, every model response, and every user reaction. You cannot steer what you cannot see.

Phase 3: Steer

This is where AI products diverge from traditional software iteration. Steering means making targeted adjustments based on evaluation evidence:

Key Insight

The Build-Measure-Steer loop is not sequential; it is concurrent. In practice, you are building a new feature while measuring the last one and steering based on results from two iterations ago. The discipline is in keeping each loop small (hours, not weeks) and ensuring every steering decision is backed by evaluation data, not gut feeling.

3. AI Coding Assistants: Trust but Verify

AI coding assistants accelerate prototyping dramatically, but they introduce a new failure mode: plausible but incorrect code. A generated function may have correct syntax, reasonable variable names, and sensible-looking logic while containing subtle bugs that only surface under edge cases. The verification discipline below applies whether you use Cursor, Claude Code, GitHub Copilot, or any other assistant.

  1. Treat generated code as a draft. Read every line before integrating it. If you cannot explain what a function does, you do not understand your own prototype.
  2. Run it immediately. Do not accumulate generated code. Generate a function, run it, verify the output, then move on. Small feedback loops catch errors before they compound.
  3. Check the boundaries. Generated code often handles the happy path well and ignores edge cases. Test with empty inputs, oversized inputs, malformed data, and concurrent access.
  4. Verify dependencies. AI assistants sometimes import libraries that do not exist, reference deprecated APIs, or use incorrect function signatures. Confirm every import and every API call against current documentation.
  5. Own the architecture. Use AI assistants for implementation, not for architectural decisions. You decide the system structure; the assistant fills in the code within that structure.
Fun Fact

A 2024 GitClear analysis of code churn rates found that projects using AI coding assistants had a 39% higher rate of code that was reverted or substantially rewritten within two weeks of being committed. The code was written faster, but it was also corrected faster. The net productivity gain depends entirely on how disciplined the verification step is.

4. The Prototype Playbook

The Prototype Playbook maps five stages of increasing sophistication to the techniques covered throughout this book. Each stage builds on the previous one. You do not advance to the next stage until the current stage passes its evaluation criteria.

The Prototype Playbook: Five Stages
Stage Description Book Chapters Gate Criteria
1. Single-Prompt One API call with a hand-crafted prompt. No retrieval, no tools. Ch 10, Ch 11 Model produces relevant responses for 5+ representative inputs
2. RAG Prototype Add retrieval to ground responses in your domain data. Ch 19, Ch 20 Responses cite retrieved sources; hallucination rate below threshold
3. Agent Prototype Add tool use and multi-step reasoning for complex workflows. Ch 22, Ch 23 Agent completes target workflow in 80%+ of test runs
4. Eval-Gated Wrap the prototype in an automated evaluation harness. No deploy without passing. Ch 29 All quality metrics above defined thresholds; regression suite green
5. Production-Ready Add observability, rate limiting, cost controls, and graceful degradation. Ch 31 Latency p95 within budget; cost per request within ceiling; monitoring active
Note: You May Not Need Every Stage

Not every product requires agents or RAG. If a single well-crafted prompt solves the problem, stop at Stage 1. The playbook is a ladder, not a mandate. Complexity is a cost; add it only when simpler stages fail to meet your quality bar.

5. A Minimal Prototype Evaluation Harness

The following Python script implements a lightweight evaluation harness that you can wire into any prototype from Stage 1 onward. It loads test cases, runs each one through your prototype function, scores the output on keyword relevance, safety constraints, and latency, then produces a summary report.

# Minimal prototype evaluation harness
# Connects to any prototype via a callable that accepts a query string
import json
import time
from dataclasses import dataclass, field
from pathlib import Path
from typing import Callable


@dataclass
class EvalCase:
 """A single evaluation test case."""
 query: str
 expected_keywords: list[str] = field(default_factory=list)
 max_latency_ms: float = 5000.0
 must_not_contain: list[str] = field(default_factory=list)


@dataclass
class EvalResult:
 """Result of running one evaluation case."""
 query: str
 response: str
 latency_ms: float
 keyword_score: float # fraction of expected keywords found
 safety_pass: bool # True if no forbidden strings appear
 latency_pass: bool # True if within latency budget
 overall_pass: bool = False

 def __post_init__(self):
 self.overall_pass = (
 self.keyword_score >= 0.5
 and self.safety_pass
 and self.latency_pass
 )


def load_cases(path: str) -> list[EvalCase]:
 """Load evaluation cases from a JSON file.

 Expected format:
 [
 {
 "query": "What is the dental coverage for dependents?",
 "expected_keywords": ["dental", "dependents", "coverage"],
 "max_latency_ms": 3000,
 "must_not_contain": ["I don't know"]
 }
 ]
 """
 raw = json.loads(Path(path).read_text(encoding="utf-8"))
 return [EvalCase(**case) for case in raw]


def run_eval(
 prototype_fn: Callable[[str], str],
 cases: list[EvalCase],
) -> list[EvalResult]:
 """Run all evaluation cases and return scored results."""
 results: list[EvalResult] = []

 for case in cases:
 # Time the prototype call
 start = time.perf_counter()
 response = prototype_fn(case.query)
 elapsed_ms = (time.perf_counter() - start) * 1000

 # Score keyword presence
 response_lower = response.lower()
 hits = sum(
 1 for kw in case.expected_keywords
 if kw.lower() in response_lower
 )
 keyword_score = (
 hits / len(case.expected_keywords)
 if case.expected_keywords else 1.0
 )

 # Check safety: none of the forbidden strings should appear
 safety_pass = all(
 forbidden.lower() not in response_lower
 for forbidden in case.must_not_contain
 )

 results.append(EvalResult(
 query=case.query,
 response=response[:200], # truncate for readability
 latency_ms=round(elapsed_ms, 1),
 keyword_score=round(keyword_score, 2),
 safety_pass=safety_pass,
 latency_pass=elapsed_ms <= case.max_latency_ms,
 ))

 return results


def print_report(results: list[EvalResult]) -> None:
 """Print a summary evaluation report to stdout."""
 total = len(results)
 passed = sum(1 for r in results if r.overall_pass)
 print(f"\n{'=' * 60}")
 print(f" Prototype Evaluation Report")
 print(f" {passed}/{total} cases passed "
 f"({100 * passed / total:.0f}%)")
 print(f"{'=' * 60}")

 for i, r in enumerate(results, 1):
 status = "PASS" if r.overall_pass else "FAIL"
 print(f"\n [{status}] Case {i}: {r.query[:50]}")
 print(f" Keyword score : {r.keyword_score}")
 print(f" Safety pass : {r.safety_pass}")
 print(f" Latency : {r.latency_ms} ms "
 f"(limit: {'ok' if r.latency_pass else 'EXCEEDED'})")

 print(f"\n{'=' * 60}\n")


# Example usage with a stub prototype
if __name__ == "__main__":
 # Replace this stub with your actual prototype function
 def my_prototype(query: str) -> str:
 """Stub that simulates a prototype response."""
 return (
 "Based on the benefits handbook, dental coverage "
 "for dependents includes preventive and basic "
 "restorative services under the standard plan."
 )

 # Define inline test cases for quick testing
 cases = [
 EvalCase(
 query="What dental coverage do dependents get?",
 expected_keywords=["dental", "dependents", "coverage"],
 max_latency_ms=3000,
 must_not_contain=["I don't know"],
 ),
 EvalCase(
 query="Is vision included in the basic plan?",
 expected_keywords=["vision", "basic", "plan"],
 max_latency_ms=3000,
 must_not_contain=["I cannot help"],
 ),
 ]

 results = run_eval(my_prototype, cases)
 print_report(results)
Code Fragment 37.2.1: A minimal evaluation harness that scores prototype responses on keyword relevance, safety constraints, and latency. Wire your prototype function into run_eval and expand the test cases as your prototype matures. For production-grade evaluation, see the comprehensive patterns in Chapter 29.
Key Insight

Evaluation is not a phase; it is a habit. The harness above takes less than an hour to set up, yet most prototype builders skip it entirely. The result is that they steer blind, making prompt changes without knowing whether quality improved or degraded. Even a crude harness with five test cases gives you signal. Start measuring from Day 1, and expand your test suite as you discover new failure modes.

6. Putting It All Together: A Day in the Loop

Here is what one iteration of the Build-Measure-Steer loop looks like in practice for a founder prototyping an HR benefits assistant:

  1. Morning (Build). Use Claude Code to scaffold a FastAPI endpoint that accepts a question, retrieves relevant chunks from a benefits PDF using a simple vector store, constructs a grounded prompt, and returns the model's response. Total scaffolding time: 45 minutes. Review every generated file, fix two incorrect import paths, and add error handling the assistant omitted.
  2. Midday (Measure). Write 10 evaluation cases covering common benefits questions, edge cases (questions about policies that do not exist), and adversarial inputs (requests for salary data the bot should not disclose). Run the harness from Code Fragment 37.2.1. Result: 7/10 pass. Three failures: one hallucinated a policy that does not exist, one exceeded the latency budget, one leaked a system prompt fragment.
  3. Afternoon (Steer). For the hallucination: add an instruction to the prompt requiring every claim to cite a retrieved chunk, and add a post-processing validator that checks for citations. For the latency issue: switch from a large model to a smaller one for this simple Q&A task, following the model-selection guidance from Chapter 10. For the prompt leak: add an output filter that strips any text matching the system prompt pattern. Re-run the harness. Result: 9/10 pass. The remaining failure needs better retrieval; note it for tomorrow's iteration.
Real-World Scenario: Knowing When to Cross from Prototype to Production

Who: A founding engineer at a B2B startup building an AI-powered invoice processing tool.

Situation: After three weeks of Build-Measure-Steer iterations, the prototype correctly extracted line items from 93% of test invoices. The CEO wanted to onboard paying customers immediately.

Problem: The engineer checked three graduation signals and found one missing: evaluation stability was strong (60 test cases, stable pass rate across four iterations) and cost predictability was solid ($0.04 per invoice, forecastable), but the failure mode inventory was incomplete. The team had not documented how the system behaved on multi-currency invoices or scanned PDFs with poor OCR quality.

Decision: The engineer pushed back on immediate launch, ran a one-week spike to catalog failure modes for multi-currency and low-quality scans, and added fallback routing to human review for those cases.

Result: The tool launched one week later with documented failure modes and graceful degradation. The first enterprise customer processed 4,000 invoices in month one with only 12 escalations to human review. Crossing to production meant adding authentication, rate limiting, error recovery, monitoring dashboards, and the full production engineering stack from Chapter 31.

Lesson: A prototype is ready for production when evaluation is stable, costs are predictable, and known failure modes are documented with mitigations, not when the demo looks impressive.

Fun Fact

The term "vertical slice" originated in game development during the 1990s, where studios would build one complete level of a game before roughing out all levels. The logic was the same: a single polished level reveals rendering bugs, physics issues, and gameplay problems that a dozen grey-boxed levels never would. The concept migrated to agile software development in the 2000s and now fits AI prototyping perfectly.

Key Takeaways

What Comes Next

You now have a disciplined loop for turning an idea into a working prototype. But prototypes run on enthusiasm; products run on money. Section 37.3: Documentation as Control Surface shows how to use documentation as a control mechanism for AI-assisted development, turning intent capture into a first-class engineering practice.

Self-Check
Q1: Why does this section recommend "Steer" rather than "Learn" as the third phase of the prototype loop?
Show Answer
"Learn" implies passive observation. In AI products, the third phase requires active intervention: adjusting prompts, swapping models, adding guardrails, or restructuring retrieval pipelines. "Steer" captures the fact that you are making concrete, measurable changes to the system based on evaluation evidence, then re-measuring to confirm the effect.
Q2: What is vertical-slice prototyping, and why is it preferable to building many features at a shallow level?
Show Answer
Vertical-slice prototyping means building one complete user flow end-to-end, from UI through retrieval, prompting, model call, output validation, and rendering. It is preferable because a single complete flow exposes every integration point, latency bottleneck, and failure mode in the architecture. Shallow prototypes that skip layers (such as omitting retrieval or evaluation) hide problems that only surface during production integration.
Q3: Name three verification steps you should apply to AI-generated code before integrating it into your prototype.
Show Answer
(1) Read every line and confirm you understand the logic. (2) Run the code immediately and verify its output rather than accumulating unverified generated code. (3) Test boundary conditions: empty inputs, oversized inputs, and malformed data. Additional steps include verifying that all imports and API calls reference real, current libraries and reserving architectural decisions for yourself rather than delegating them to the assistant.

Bibliography

Lean Product Development

Ries, E. (2011). The Lean Startup: How Today's Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses. Crown Business.

The original formulation of Build-Measure-Learn. This section's Build-Measure-Steer loop adapts Ries's framework for the specific affordances and failure modes of AI products, where "steering" (prompt tuning, model swapping, guardrail addition) replaces generic "learning."
Product Development
AI Coding Assistants

Peng, S., Kalliamvakou, E., Cihon, P., & Demirer, M. (2023). "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot." arXiv:2302.06590

A controlled experiment showing that developers using GitHub Copilot completed tasks 55% faster on average. The study also notes that speed gains are largest for boilerplate and repetitive code, reinforcing the "scaffolding, not architecture" principle discussed in this section.
Developer Productivity

GitClear (2024). "Coding on Copilot: 2024 Data Suggests Downward Pressure on Code Quality." GitClear Research Report. GitClear Report

Analysis of 153 million lines of code showing increased churn rates in AI-assisted projects. Provides empirical backing for the "trust but verify" discipline emphasized in this section.
Code Quality
Evaluation and Production Readiness

Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction." Proceedings of IEEE Big Data 2017. doi:10.1109/BigData.2017.8258038

Google's rubric for assessing ML system maturity across data, model, infrastructure, and monitoring dimensions. The Prototype Playbook's stage-gate approach draws on this rubric's principle that production readiness is measurable and incremental.
ML Systems

Shankar, S., Garcia, R., Hellerstein, J.M., et al. (2024). "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences." arXiv:2404.12272

Examines the challenges of evaluating LLM products in practice, including the gap between offline evaluation metrics and real-world product quality. Directly relevant to the evaluation harness and stage-gate criteria discussed in the Prototype Playbook.
Evaluation