Capstone Project: Building an End-to-End System with LLMs

"Reading thirty-five chapters about LLM systems teaches you a vocabulary. Shipping one teaches you the craft."
Sage, Newly-Graduated AI Agent

Big Picture

The capstone is the integrating exercise for everything in the book: a single production-grade LLM application built end-to-end. You will make architectural decisions that balance competing concerns (quality vs. latency, accuracy vs. cost, flexibility vs. reliability) and produce a working system, a published model and dataset, a technical report, and a 15-minute presentation. This is where you cross from reading about LLM systems to actually shipping one.

Project Overview

The capstone project is the culminating experience of this book. You will design, build, and present a complete LLM-powered system that demonstrates mastery across the full stack: data preparation, model training and adaptation, retrieval-augmented generation, agent orchestration, production deployment, evaluation, and business strategy.

Unlike individual module labs that focus on a single technique, the capstone requires architectural decisions that balance competing concerns: model quality versus latency, accuracy versus cost, flexibility versus reliability. These tradeoffs are what distinguish a classroom exercise from a production system.

You will work on this project over approximately 4 to 6 weeks. The project culminates in a GitHub repository with working code, a model and dataset published on Hugging Face Hub, a written technical report, and a 15-minute presentation.

Key Takeaways: What Makes a Strong Capstone

Integration over novelty: The goal is not to invent a new architecture but to demonstrate that you can combine multiple techniques into a coherent, working system.
Production mindset: Include evaluation suites, monitoring hooks, safety guardrails, and deployment configuration. A demo that works on a laptop is not enough.
Business grounding: Frame the project around a real use case with measurable success criteria and an ROI estimate, not just a technical exercise.
Honest evaluation: Report what does not work as thoroughly as what does. Identifying limitations demonstrates deeper understanding than cherry-picked results.

Learning Objectives

Design an end-to-end LLM system architecture that balances quality, cost, latency, and safety
Prepare and publish a synthetic or curated dataset suitable for fine-tuning
Fine-tune or adapt a language model using techniques from Part IV
Build a RAG pipeline with vector search, reranking, and citation generation
Implement an agent with tool use, planning, and multi-step reasoning
Deploy the system with appropriate security, monitoring, and observability instrumentation
Design and execute a rigorous evaluation suite with both automated and human evaluation
Produce a technical report with architecture diagrams, evaluation results, and honest limitation analysis
Present the project in a clear, concise 15-minute format suitable for technical and business audiences

Capstone Pages

C.1 Requirements & Deliverables Detailed technical requirements (synthetic dataset, fine-tuned model, RAG system, agent with tools, deep research, production deployment, security, evaluation suite, hybrid architecture, ROI analysis, risk governance) and deliverable specifications (GitHub repo, HF Hub artifacts, technical report, demo, presentation).

Suggested Timeline (6 weeks)

Week	Focus
Week 1	Design. Select use case, define requirements, design architecture, identify datasets.
Week 2	Data + Model. Prepare synthetic dataset, begin fine-tuning or adapter training.
Week 3	RAG + Agent. Build RAG pipeline, implement agent with tools, integrate components.
Week 4	Deploy + Evaluate. Deploy to cloud, set up monitoring, run evaluation suite.
Week 5	Refine. Address evaluation findings, add safety guardrails, optimize performance.
Week 6	Report + Present. Write technical report, prepare presentation, publish artifacts.

Deliverable Summary

GitHub Repository with clean code, README, and deployment instructions.
Hugging Face Hub artifacts: fine-tuned model and curated dataset.
Technical Report (8 to 12 pages) with architecture, evaluation, and limitations.
Interpretability Analysis documenting attention patterns, token attributions, or probing results.
Live Demo (deployed or screencast) showing the system in action.
Presentation (15 minutes) covering motivation, architecture, results, and lessons learned.

What Comes Next

Read Capstone C.1: Requirements & Deliverables for the per-component technical bar and the deliverable specifications. After that, the work is yours.