"Thirty-eight chapters of theory become real the moment you ship something a user can break."
Compass, Theory Graduating AI Agent
This is the final section of the book. Here you tie together every framework from Part XI into one end-to-end exercise: from hypothesis through evaluation to launch readiness. The capstone lab is designed to be completed in 4 to 5 hours, and the assessment rubric gives you (or your instructor) clear criteria for evaluating the result.
Prerequisites
This capstone draws on every section in Part XI. You should have completed (or at least read) the AI Role Canvas (Section 36.2), the Intent + Evidence Bundle (Section 37.1), the Prototype Loop (Section 37.2), and the Launch Readiness Checklist (Section 38.1). Familiarity with prompt engineering (Chapter 11) and evaluation (Chapter 29) is essential for the hands-on lab.
Lab: Building an AI Product Prototype End-to-End
This capstone exercise ties together every framework from Chapter 38. You will build a complete AI product prototype, from hypothesis to launch-readiness assessment, using AI copilots at each stage. The exercise is designed to be completed in four to six hours.
This capstone is intentionally structured to mirror how you would build a real AI product, compressed into a single working session. If you get stuck at any phase, that is valuable signal: it tells you which earlier chapters you should revisit. Keep a running log of where you struggle. After the lab, review that log against the book's table of contents to build your personal study plan.
Phase 1: Define the Hypothesis (45 minutes)
- Choose a product idea. If you need inspiration, ask an LLM: "Suggest five AI product ideas for [your domain] that solve a real pain point."
- Fill out the AI Role Canvas from Section 36.2: define the AI's role (copilot, classifier, drafter, etc.), the human's role, the fallback behaviour, and the success metric.
- Run the stress-test function from Code Fragment 38.5.1 against your hypothesis. Revise your canvas based on the critique.
Phase 2: Create the Intent + Evidence Bundle (60 minutes)
- Write a one-paragraph intent statement following the template from Section 37.1.
- Use Code Fragment 38.5.2 to generate acceptance criteria for your core feature.
- Generate a lightweight threat model by feeding the acceptance criteria back into the LLM.
- Compile these into your Evidence Bundle: hypothesis, role canvas, acceptance criteria, threat model, and at least three evaluation cases you will test during prototyping.
Phase 3: Build the Vertical Slice (120 minutes)
- Implement a vertical slice using the Prototype Loop from Section 37.2. Focus on a single user flow, not the full product.
- Use an AI coding assistant for implementation. Write the system prompt, then run it through the meta-prompting critique (Code Fragment 38.5.3) before using it in your prototype.
- Generate synthetic test data using an LLM: at least 20 realistic inputs covering normal cases, edge cases, and adversarial inputs.
Phase 4: Evaluate and Assess Launch Readiness (60 minutes)
- Run your prototype against the synthetic test data. Record input, expected output, actual output, and a pass/fail score for each case.
- Use Code Fragment 38.5.4 to cluster failures and identify the top root causes.
- Apply the Launch Readiness Checklist from Section 38.1 to your prototype. Score each dimension (quality, safety, cost, latency, monitoring).
- Write a one-page launch decision memo: ship, iterate, or pivot, with evidence supporting your recommendation.
7. Assessment Rubric
Use the following rubric to self-assess your capstone work or to evaluate peer submissions. Each dimension is scored on a four-point scale.
| Dimension | Exemplary (4) | Proficient (3) | Developing (2) | Beginning (1) |
|---|---|---|---|---|
| Hypothesis Clarity | Specific, falsifiable, with clear success metric | Clear hypothesis; metric present but vague | Hypothesis stated but not falsifiable | No clear hypothesis |
| Role Canvas | All fields complete; fallback and escalation paths defined | Most fields complete; minor gaps in fallback design | Partial canvas; AI role unclear | Canvas missing or not used |
| Evidence Bundle | Intent, criteria, threat model, and eval cases all present and linked | Most artifacts present; some disconnected from hypothesis | Only acceptance criteria present | No structured evidence |
| Prototype Quality | Vertical slice works end-to-end; prompts refined via meta-prompting | Prototype works; prompts used but not critiqued | Prototype partially functional | No working prototype |
| Evaluation Rigour | 20+ test cases; failures clustered; root causes addressed | 10+ test cases with scores; some analysis | Fewer than 10 test cases; no clustering | No evaluation performed |
| Launch Decision | Evidence-based memo with clear recommendation and next steps | Recommendation present; evidence partially cited | Opinion without evidence | No decision documented |
The capstone deliverables (role canvas, evidence bundle, prototype, eval results, and launch memo) form a complete portfolio piece that demonstrates product thinking, not just coding ability. Whether you are interviewing for a product role, an ML engineering position, or founding a startup, this package shows you can move from idea to evidence-based decision.
- AI copilots add value at every product stage, not just coding. Idea framing, requirements, prompt design, and evaluation analysis all benefit from LLM assistance.
- Meta-prompting creates a feedback loop at the prompt design level. Using one LLM call to critique another's instructions catches ambiguities and jailbreak surfaces before they reach users.
- Evaluation triage scales with LLM assistance. Clustering hundreds of failures by root cause and suggesting fixes turns raw eval data into actionable next steps.
- The capstone lab ties the full chapter together. Hypothesis (36.2), evidence (36.3), prototype (36.4), launch readiness (36.5), and copilot techniques (36.6) form a repeatable product development workflow.
- The highest-leverage copilot investment is the earliest stage. A stress-tested hypothesis prevents wasted engineering effort far more effectively than a faster code editor.
What Comes Next: The Road Ahead
You have reached the final section of Building Conversational AI with LLMs and Agents. Over 36 chapters, you have journeyed from the mathematical foundations of neural networks to the practical realities of shipping AI products. You have learned how transformers attend to context (Part 1), how large language models are trained and scaled (Part 2), how to steer them with prompts and APIs (Part 3), how to ground them with retrieval (Part 4), how to fine-tune them for specific tasks (Part 5), how to grant them agency with tools and planning (Part 6), how to orchestrate multi-agent systems (Part 7), how to evaluate and monitor them in production (Part 8), how to deploy them safely and ethically (Part 9), how to reason about their strategic implications (Part 10), and finally, how to turn all of that knowledge into a real product (Part 11).
The field is moving fast. Models will get cheaper, faster, and more capable. New modalities, new reasoning techniques, and new regulatory frameworks will emerge. But the core discipline you have built throughout this book will endure: define the problem clearly, choose the right level of AI autonomy, prototype with tight feedback loops, evaluate rigorously, ship incrementally, and keep learning from real-world evidence.
For continued reference, the Appendices provide quick-reference material on mathematical foundations, API cheat sheets, evaluation templates, and deployment checklists. The product-builder pathway you have followed in this chapter is designed to be reusable: return to the AI Role Canvas, the Intent + Evidence Bundle, and the Launch Readiness Checklist every time you start a new project.
Go build something that matters. The tools are ready. So are you.
Show Answer
Show Answer
Show Answer
Bibliography
Peng, S., Kalliamvakou, E., Cihon, P., & Demirer, M. (2023). "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot." arXiv:2302.06590
Fernando, C., Banarse, D., Michalewski, H., Osindero, S., & Rocktäschel, T. (2024). "Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution." arXiv:2309.16797
Zhou, Y., Muresanu, A.I., Han, Z., et al. (2023). "Large Language Models Are Human-Level Prompt Engineers." ICLR 2023. arXiv:2211.01910
Shani, G., Heckerman, D., & Brafman, R.I. (2005). "An MDP-Based Recommender System." Journal of Machine Learning Research, 6, 1265-1295. JMLR
