Section 37.1: The Observe-Steer Development Loop

"You don't spec the whole building before picking up the hammer anymore. You describe the room, watch it appear, fix what's wrong, and describe the next room."
Deploy, Relentlessly Iterative AI Agent

Big Picture

AI coding assistants have changed the physics of software development. A new workflow has emerged where developers describe intent in natural language and let an AI generate code, then review, test, and steer iteratively. Andrej Karpathy coined the term "vibe coding" for this practice. Used carelessly, it produces fragile, untested code. Used responsibly, with evaluation gates, documented intent, and structured evidence, it becomes a powerful acceleration technique. This section defines the practice, introduces the observe-steer loop as its professional backbone, and delivers the Intent + Evidence Bundle (IEB): a version-controlled folder that keeps AI-assisted development auditable, reproducible, and safe.

Prerequisites

This section builds on prompt engineering fundamentals from Chapter 11 and the evaluation framework from Chapter 29. Familiarity with version control (Git) is assumed.

A developer at a glowing monitor surrounded by a circular loop of colorful arrows: observe, steer, generate, document. — **Figure 37.1.1**: The observe-steer loop replaces the traditional spec-build-test cycle. You describe intent, observe the AI's output, and steer iteratively toward the goal.

1. What Is Vibe Coding?

"Vibe coding" refers to a style of software development where the programmer describes what they want in natural language (or even by pointing at a screenshot) and an AI assistant generates the code. The programmer's role shifts from writing syntax to reviewing, steering, and constraining output. Think of it as pair programming where one partner is an LLM with encyclopedic knowledge of libraries and patterns but no understanding of your business context or quality requirements.

Real-World Scenario: Application Example

When Vibe Coding Saved a Hackathon (and Nearly Sank the Product)

Who: A two-person startup team at a 48-hour AI hackathon.

Situation: They needed a document Q&A prototype with a Streamlit frontend, RAG pipeline, and vector database integration.

Problem: Neither founder had built a RAG pipeline before, and the hackathon clock was ticking.

Dilemma: They could study the documentation for two days and build carefully, or let Claude Code scaffold the entire stack in hours and spend the remaining time evaluating quality.

Decision: They chose vibe coding with an explicit constraint: every generated function must have a test, and they would review all LLM integration points manually.

How: Using Claude Code, they scaffolded the Streamlit app, FAISS vector store, and OpenAI embedding pipeline in four hours. They spent the remaining 44 hours on evaluation, prompt tuning, and fixing three critical bugs the AI had introduced (including a silent chunking error that split sentences mid-word).

Result: They won second place. But when they later tried to extend the prototype into a product, they discovered that 30% of the generated code contained assumptions that were wrong for their production environment.

Lesson: Vibe coding compresses the time to a working prototype, but the time you save on writing code must be reinvested in reviewing, testing, and understanding what was generated.

The term first gained traction in early 2025 when Andrej Karpathy described his workflow of "just vibing" with an AI, accepting whatever code it produced, and iterating on the results. While his description was deliberately provocative, the underlying shift is real: AI coding assistants (GitHub Copilot, Cursor, Claude Code, Windsurf, and others) have made it possible to generate substantial codebases from natural language descriptions in minutes rather than days.

The critical distinction lies in how you handle the generated output. Two approaches exist along a spectrum:

Raw Vibe Coding vs. Responsible Vibe Coding

Dimension	Raw Vibe Coding	Responsible Vibe Coding
Intent capture	Verbal, ephemeral	Written in `intent.md`, version-controlled
Review discipline	"If it runs, ship it"	Review every generated function; flag uncertainty
Testing	Manual spot checks	AI generates tests, human verifies coverage
Constraints	None explicit	Documented non-negotiables (security, cost, latency)
Reproducibility	Low; depends on chat history	High; prompts versioned, eval suite pinned
Suitable for	Throwaway prototypes, personal scripts	Production code, team projects, regulated domains

Key Insight

Responsible vibe coding is not anti-speed; it is anti-regret. The overhead of writing an intent document, creating a golden test set, and versioning your prompts adds perhaps 20 minutes to a session. The cost of debugging a production incident caused by untested AI-generated code, or discovering that your entire codebase depends on a prompt pattern that breaks after a model update, is measured in days or weeks. Speed without evidence is not velocity; it is gambling.

2. The Observe-Steer Loop

Traditional software development follows a plan-build-test cycle: write a specification, implement it, then verify. AI-assisted development replaces the heavy upfront specification with a tighter feedback loop that we call the observe-steer loop:

Describe. State your intent in natural language. Be specific about constraints and non-negotiables, but do not over-specify implementation details.
Generate. Let the AI produce code, configuration, or content.
Observe. Run the output. Read it carefully. Check it against your evaluation criteria. Use the observability tools from Chapter 30 to inspect traces, latency, and costs.
Evaluate. Does the output meet your quality gates? Run your evaluation suite (Chapter 29). Not just "does it look right?" but "does it pass the rubric?"
Steer. Based on what you observed, refine your prompt, add constraints, request changes, or accept the result. Feed the observation back into the next iteration.

This loop runs in minutes, not days. A single coding session might execute 10 to 50 iterations. The key difference from ad-hoc prompting is that each iteration is grounded in observable evidence (test results, traces, metrics) rather than subjective impressions.

Two contrasting development workflows. Top: a long waterfall with huge steps and a bored developer waiting months at each stage with cobwebs forming. Bottom: a tight positive circular loop where a happy developer rapidly cycles through describe, generate, observe, evaluate, and steer steps, with evidence clipboards and test results flying around. — **Figure 37.1.2**: Traditional waterfall development (top) versus the observe-steer loop (bottom). The tight feedback cycle compresses months of speculation into minutes of evidence-based iteration.

Real-World Scenario: Observe-Steer in Practice

Who: A developer at an e-commerce company building a customer-support ticket classifier.

Situation: The first prompt produced a classifier that handled 8 of 10 test categories correctly, and the team was tempted to ship it.

Problem: The evaluation suite revealed the classifier confused "billing inquiry" with "refund request," two categories that require very different routing and handling in production.

Decision: Instead of debugging blindly, the developer steered the prompt by adding two few-shot examples for each confused category and specifying the distinguishing criteria explicitly.

Result: The next iteration scored 10/10. The developer committed the updated prompt version and the eval results. Total elapsed time: 12 minutes.

Lesson: Without an evaluation suite, the 8/10 version would have shipped and the confusion would have surfaced only through misdirected customer tickets in production.

3. Documentation as a Control Surface

In an observe-steer workflow, documentation is not an afterthought that you write once the code is "done." It is the primary control surface for the AI assistant. The documents you maintain directly shape what the AI generates, what constraints it respects, and how you verify its work.

Four types of documentation serve as active control surfaces:

Intent documents. What the system must do, what it must never do, and who approved those decisions. These function like a requirements specification but are written in plain language that both humans and AI assistants can consume. When you paste your intent document into a coding session, the AI inherits your constraints.
Prompt templates. Versioned prompts with placeholders, stored alongside the code they support. As discussed in Chapter 11, prompts are a form of source code and deserve the same version control discipline.
Evaluation artifacts. Golden test sets, scoring rubrics, and regression baselines. These define "good enough" in machine-readable form, enabling automated quality gates.
Risk and cost records. Threat models, token budgets, and routing strategies. These constrain architectural decisions and prevent cost surprises.

Note: AI Assistants Read Your Docs

Modern AI coding tools (Claude Code, Cursor, Windsurf) can read project files as context. A well-structured intent.md at the root of your project becomes a persistent system prompt for every coding session. This means documentation has a dual audience: human teammates and AI assistants. Write accordingly: be explicit, use concrete examples, and state constraints as testable assertions rather than vague aspirations.

4. Lock-in Dynamics and Switching Costs

AI-assisted development changes the economics of switching between tools, providers, and architectures. Some switching costs decrease; others increase.

Costs that decrease:

Refactoring cost. AI assistants can refactor entire codebases between frameworks, languages, or API providers in hours rather than weeks. Migrating from one LLM provider's SDK to another is now a task you can describe in a prompt, not a multi-sprint project.
Boilerplate rewriting. Switching from REST to GraphQL, or from one database to another, becomes cheaper when an AI can generate the adapter layer.

Costs that increase (data gravity):

Evaluation data. Your golden test sets, scoring rubrics, and regression baselines are tightly coupled to your current model and prompt design. Switching models means re-validating your entire evaluation suite.
Fine-tuning investments. If you have fine-tuned a model on proprietary data, that investment is locked to one provider.
Prompt libraries. Hundreds of carefully tuned prompts represent significant intellectual capital. While prompts are portable in theory, they often exploit model-specific behaviors that do not transfer cleanly.
Trace and feedback data. Months of production traces, user feedback, and quality metrics are your most valuable asset for improvement. This data has gravity; it pulls your decisions toward the system that generated it.

Fun Fact

The term "data gravity" was coined by Dave McCrory in 2010 to describe how data attracts applications, services, and more data to its location. In the AI product context, your evaluation data and production traces act as a gravitational center: the more evidence you accumulate about how your system performs, the harder it becomes to justify starting over with a different provider, even if the new provider offers better base capabilities. This is why the Intent + Evidence Bundle, introduced below, is designed to be provider-agnostic wherever possible.

5. The Intent + Evidence Bundle (IEB)

The deliverable for this section is the Intent + Evidence Bundle (IEB): a version-controlled folder that lives alongside your code and captures everything needed to understand, reproduce, and audit your AI product's behavior. The IEB is the professional answer to the question "How do I keep vibe coding under control?"

The IEB contains five components:

Intent + Evidence Bundle Structure

File / Folder	Purpose	Updated When
`intent.md`	Non-negotiable requirements, approved scope, stakeholder sign-offs	Scope changes, new constraints, design reviews
`eval/`	Golden test set (`golden.jsonl`), regression baselines, scoring rubrics	New failure modes discovered, model changes, prompt updates
`prompts/`	Versioned prompt templates with changelogs	Every prompt modification (treat like source code)
`risk.md`	Threat model: adversarial inputs, data leakage, bias vectors, failure scenarios	New risk identified, architecture changes, incident post-mortems
`cost.md`	Token budget, model routing strategy, cost-per-interaction targets	Pricing changes, traffic growth, model swaps

The following script initializes an IEB folder structure with template files.

# Initialize an Intent + Evidence Bundle (IEB) folder structure
# Run once at project start to scaffold the IEB directory
import os
from pathlib import Path
from datetime import date

def init_ieb(project_root: str, project_name: str = "My AI Project") -> Path:
 """Create the IEB folder structure with starter templates.

 Args:
 project_root: Path to the project's root directory.
 project_name: Human-readable project name for template headers.

 Returns:
 Path to the created ieb/ directory.
 """
 ieb_root = Path(project_root) / "ieb"
 today = date.today().isoformat()

 # Define folder structure
 folders = ["eval", "prompts"]
 for folder in folders:
 (ieb_root / folder).mkdir(parents=True, exist_ok=True)

 # --- intent.md ---
 intent_content = f"""# Intent Document: {project_name}
**Created:** {today}
**Owner:** [Your Name]
**Status:** Draft

## What This System Must Do
- [ ] [Describe the core capability in one sentence]

## What This System Must Never Do
- [ ] [List safety and compliance non-negotiables]

## Approved Scope
- [ ] [Define boundaries: what is in scope, what is out]

## Stakeholder Approvals
| Name | Role | Date | Notes |
|------|------|------|-------|
| | | | |
"""
 (ieb_root / "intent.md").write_text(intent_content, encoding="utf-8")

 # --- risk.md ---
 risk_content = f"""# Risk Register: {project_name}
**Last reviewed:** {today}

## Threat Model
| Threat | Likelihood | Impact | Mitigation |
|--------|-----------|--------|------------|
| Adversarial prompt injection | Medium | High | Input validation, guardrails |
| Data leakage in responses | Low | Critical | Output filtering, PII detection |
| Model hallucination | High | Medium | Eval suite, confidence thresholds |

## Open Risks
- [ ] [List unmitigated risks here]
"""
 (ieb_root / "risk.md").write_text(risk_content, encoding="utf-8")

 # --- cost.md ---
 cost_content = f"""# Cost and Routing Strategy: {project_name}
**Last reviewed:** {today}

## Token Budget
| Scenario | Input tokens | Output tokens | Model | Cost/call |
|----------|-------------|---------------|-------|-----------|
| Typical query | ~500 | ~200 | gpt-4o-mini | ~$0.0003 |
| Complex query | ~2,000 | ~800 | gpt-4o | ~$0.01 |

## Routing Strategy
- Route simple queries to the smaller, cheaper model.
- Escalate to the larger model when confidence is below threshold.
- See Chapter 10 for API routing patterns.

## Monthly Budget Target
- Target: $[amount]/month at [volume] queries/day
"""
 (ieb_root / "cost.md").write_text(cost_content, encoding="utf-8")

 # --- eval/golden.jsonl (starter with one example) ---
 golden_example = (
 '{"input": "What is your refund policy?", '
 '"expected_intent": "refund_inquiry", '
 '"expected_contains": ["refund", "policy"], '
 '"notes": "Baseline test case"}\n'
 )
 (ieb_root / "eval" / "golden.jsonl").write_text(
 golden_example, encoding="utf-8"
 )

 # --- prompts/v001_system_prompt.txt ---
 prompt_template = f"""# System Prompt v001
# Created: {today}
# Changelog: Initial version

You are a helpful assistant for {project_name}.

## Instructions
- Answer user questions accurately and concisely.
- If you are unsure, say so explicitly.
- Never fabricate information.

## Constraints
- Do not discuss competitors.
- Do not reveal internal system details.
- Keep responses under 300 words unless the user requests more detail.
"""
 (ieb_root / "prompts" / "v001_system_prompt.txt").write_text(
 prompt_template, encoding="utf-8"
 )

 print(f"IEB initialized at: {ieb_root}")
 print(f" intent.md - Define your non-negotiables")
 print(f" risk.md - Document threats and mitigations")
 print(f" cost.md - Set token budgets and routing rules")
 print(f" eval/ - Add golden test cases to golden.jsonl")
 print(f" prompts/ - Version your prompt templates here")
 return ieb_root


# Example usage
if __name__ == "__main__":
 init_ieb(".", project_name="Customer Support Bot")

Code Fragment 37.1.1: A Python script that initializes the Intent + Evidence Bundle folder structure. Running this at the start of a project creates template files for intent, risk, cost, evaluation data, and prompt versioning. Each template includes starter content and placeholders to guide the team.

Real-World Scenario: Using the IEB in a Real Session

Who: A solo developer at a small SaaS company building a support chatbot with Claude Code.

Situation: Before writing any code, the developer ran init_ieb(".", project_name="Acme Support Bot") and filled in intent.md with three non-negotiables: (1) never fabricate order numbers, (2) respond in under 3 seconds at p95, (3) cost under $0.02 per turn. They seeded eval/golden.jsonl with 20 test cases.

Problem: During the third iteration, the eval suite caught a new failure mode: the bot confused "cancel order" with "cancel subscription," routing users to the wrong workflow.

Decision: The developer added the failure case to the golden set, steered the prompt to distinguish the two intents, re-evaluated, and committed all artifacts (code, prompt, eval results) to Git.

Result: Three months later, a new teammate joined the project and reconstructed the full decision history from the IEB, including why "cancel order" and "cancel subscription" had separate handling, without a single onboarding meeting.

Lesson: Externalizing intent, constraints, and evaluation data into version-controlled files turns ephemeral coding sessions into a durable, auditable project history.

6. Putting It Together: The Responsible Vibe Coding Workflow

The following workflow combines the observe-steer loop with the IEB into a repeatable development process:

Initialize. Run init_ieb(). Fill in intent.md with non-negotiables. Seed eval/golden.jsonl with at least 10 test cases.
Describe. Open your AI coding assistant. Paste intent.md as context. Describe the feature you want to build.
Generate. Let the assistant produce code. Accept the output into your working directory.
Observe. Run the code. Examine the output. Check traces using your observability stack (Chapter 30).
Evaluate. Run the evaluation suite (Chapter 29) against your golden set. Record pass/fail rates.
Steer. If evaluation fails, refine the prompt or constraints. If a new failure mode appears, add it to the golden set. If a risk becomes apparent, update risk.md.
Commit. Commit all changes: code, prompts, eval results, and any IEB updates. Use meaningful commit messages that reference the steer that motivated the change.
Repeat. Return to step 2 for the next feature or refinement.

Warning: The "It Works on My Machine" Trap

Vibe coding sessions are inherently local and ephemeral. The AI assistant's context window includes your conversation history, which is not captured in Git. If you close the session and start a new one, the assistant loses all that context. The IEB solves this by externalizing the critical information (intent, constraints, eval data, prompts) into files that persist across sessions. Without the IEB, your project's institutional knowledge lives only in chat logs that are difficult to search, share, or reproduce.

Key Takeaways

Vibe coding is real and useful, but it requires discipline. The raw form (generate and ship without review) is suitable only for throwaway prototypes. Responsible vibe coding adds intent documentation, evaluation gates, and version-controlled prompts.
The observe-steer loop replaces heavy upfront specification. Instead of writing a 50-page requirements document, you describe intent, generate, observe the result, evaluate against a rubric, and steer. Each cycle takes minutes.
Documentation is a control surface, not an afterthought. Your intent documents, prompt templates, and evaluation artifacts directly shape what AI assistants generate. Write them for a dual audience: human teammates and AI tools.
Data gravity creates lock-in even when code is portable. AI-assisted refactoring reduces code-level switching costs, but your evaluation data, fine-tuning investments, and production traces create gravitational pull toward your current stack.
The Intent + Evidence Bundle (IEB) keeps AI-assisted development auditable. Five components (intent, eval, prompts, risk, cost) ensure that every decision is documented, every quality gate is explicit, and every teammate (human or AI) can reconstruct the project's rationale.

What Comes Next

With the IEB in place and the observe-steer workflow established, Section 37.2: The Founder's Prototype Loop puts these ideas into practice with a complete vertical-slice prototype. You will build a working mini-product that combines structured output, prompt guardrails, a tiny evaluation harness, and basic tracing, all in a single coding session guided by the Prototype Playbook.

Self-Check

Q1: What are the five components of an Intent + Evidence Bundle, and what purpose does each serve?

Show Answer

(1) intent.md: captures non-negotiable requirements and stakeholder approvals. (2) eval/: holds golden test sets and regression baselines for automated quality gates. (3) prompts/: stores versioned prompt templates with changelogs, treating prompts as source code. (4) risk.md: documents the threat model including adversarial inputs, data leakage vectors, and bias risks. (5) cost.md: records token budgets, model routing strategies, and cost-per-interaction targets.

Q2: Explain the difference between "raw" vibe coding and "responsible" vibe coding. When is each appropriate?

Show Answer

Raw vibe coding generates code from natural language descriptions and ships it with minimal review; intent is verbal and ephemeral, testing is manual spot checks, and reproducibility is low. It is appropriate only for throwaway prototypes and personal scripts. Responsible vibe coding adds written intent documents, version-controlled prompts, automated evaluation suites, and documented constraints. It maintains the speed advantage of AI-assisted generation while ensuring that output is auditable, testable, and reproducible. Responsible vibe coding is appropriate for production code, team projects, and any regulated domain.

Q3: Why does AI-assisted refactoring reduce some switching costs while "data gravity" increases others? Give one example of each.

Show Answer

AI-assisted refactoring reduces code-level switching costs because an AI can rewrite an entire codebase from one SDK to another in hours (for example, migrating from OpenAI's SDK to Anthropic's SDK). Data gravity increases switching costs because accumulated evaluation data, production traces, and fine-tuning investments are tightly coupled to the current model and provider. For example, a golden test set of 500 cases tuned for GPT-4o's behavior may need substantial re-validation if you switch to Claude, because the models have different strengths and failure patterns. The evaluation data pulls your decisions toward the system that generated it.

Bibliography

AI-Assisted Development

Karpathy, A. (2025). "Vibe Coding." Personal blog post. karpathy.ai

The original articulation of "vibe coding" as a development style where the programmer describes intent and lets AI generate code. Provides the cultural context for the practice discussed in this section.

AI Development

Peng, S., Kalliamvakou, E., Cihon, P., et al. (2023). "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot." arXiv:2302.06590

A randomized controlled trial showing that developers using GitHub Copilot completed tasks 55% faster than a control group. Provides empirical evidence for the productivity gains that motivate vibe coding practices.

Productivity

Documentation and Process

McCrory, D. (2010). "Data Gravity: In the Clouds." Dave McCrory's Blog. datagravity.org

Introduces the "data gravity" metaphor: as data accumulates in one location, it attracts applications and services. Applied in this section to explain why evaluation data and production traces create lock-in even when code is easily portable.

Architecture