Part 11: From Idea to AI Product
Chapter 37 · Section 37.3

Documentation as Control Surface

"When AI writes the code, documentation changes its job. It stops being a record of what happened and becomes instructions for what happens next."

Deploy Deploy, Fastidiously Documented AI Agent
Big Picture

Traditional documentation explains what was built. In AI-assisted development, documentation controls what will be built next. When you hand an AI coding assistant a well-structured intent document, a set of JSON constraints, and a test fixture, you are not just recording decisions for future humans. You are programming the AI's behavior for the current session. This section explores three documentation imperatives that arise from this shift: capturing human intent, documenting AI decisions, and establishing trust records. It then dives deep into the Intent + Evidence Bundle (IEB) introduced in Section 37.1, showing how each component functions as a machine-readable control surface. The section closes with a Python validation script and a governance layer for tracking accountability across AI-generated and human-reviewed code.

Prerequisites

This section builds on the Intent + Evidence Bundle introduced in Section 37.1, the prompt engineering fundamentals from Chapter 11, and the evaluation framework from Chapter 29. Familiarity with JSON Schema, Pydantic, and version control (Git) is helpful.

A large open book functioning as a control panel with knobs, switches, and indicator lights instead of text, with a developer and an AI robot on opposite sides.
Figure 37.3.1: When AI writes the code, documentation shifts from explanation to control: intent capture, decision tracking, and machine-readable specifications that guide future AI actions.

1. The Documentation Shift

For decades, software documentation served a single purpose: explain what the system does so that future maintainers can understand and modify it. The audience was always human. The timing was always retrospective: you wrote documentation after the code existed.

Key Insight

Documentation has gained a second audience: the AI. When you write a clear intent document, you are simultaneously explaining the system to a future human maintainer and programming the behavior of an AI coding assistant. A well-written CLAUDE.md or AGENTS.md file does not just record decisions; it actively shapes what the AI generates next. This dual-audience principle changes how you think about every sentence you write in a project specification.

AI-assisted development breaks both of those assumptions. The audience now includes AI coding assistants that read your project files as context before generating code. The timing shifts from retrospective to prospective: you write documentation before the code exists, and the documentation directly shapes what the AI produces. In this new model, documentation is not a mirror reflecting past work. It is a steering wheel pointed at future work.

This shift creates three new imperatives that traditional documentation practices do not address.

2. Three Documentation Imperatives

2.1 Capturing Human Intent

The first imperative is to capture what the system must do, what it must not do, and why. This is the role of intent.md from the Intent + Evidence Bundle (Section 37.1). Intent documents differ from traditional requirements specifications in several ways:

Key Insight

Negative requirements are more valuable than positive ones in AI-assisted development. An AI coding assistant can often infer what you want from a brief description and a few examples. What it cannot infer are your organization's safety boundaries, compliance constraints, and business rules about what must never happen. A single line in intent.md stating "Never call the payments API in test mode with real customer data" can prevent a costly incident that no amount of positive requirements would catch. Spend more time on your "must not" list than on your "must do" list.

2.2 Documenting AI Decisions

The second imperative is to maintain an audit trail for AI-generated artifacts. When a human developer writes code, the decision process lives in their head, in code review comments, and in commit messages. When an AI generates code, the decision process lives in a prompt that may never be saved. Without deliberate documentation, the reasoning behind AI-generated code evaporates the moment the chat session closes.

An effective AI decision record captures four things:

  1. The prompt. What instruction was given to the AI? Store versioned prompt templates in the prompts/ folder of your IEB, as discussed in Chapter 11.
  2. The acceptance criteria. What was the human reviewer looking for when deciding to accept or reject the output?
  3. The disposition. Was the output accepted as-is, modified, or rejected? If modified, what changed and why?
  4. The confidence level. How confident is the team that this artifact is correct? "High confidence, backed by 50 passing test cases" is very different from "seems reasonable, no tests yet."
A developer sitting at a control panel shaped like an open book, where the pages have become levers, dials, and switches representing different document types: a rulebook, a test checklist, a prompt template, a risk register, and a cost meter. Arrows flow from the controls to a friendly AI robot whose posture shifts based on which lever is pulled.
Figure 37.3.2: Documentation as a control surface. Each document type (intent rules, test fixtures, prompt templates, risk registers, cost budgets) functions as a lever that directly shapes the AI assistant's behavior.
Real-World Scenario: AI Decision Log in Practice

Who: A senior engineer on a healthtech team building a medical triage chatbot for a regional hospital network.

Situation: The team maintained a file called decisions.jsonl in their IEB, recording every design decision: the prompt used, the model version, what code was generated, whether the output was accepted or modified, and the reviewer's name.

Problem: Three months after launch, a clinical auditor asked why the system routes chest pain symptoms to the emergency category rather than the urgent-care category. Without a decision trail, the team would have had no defensible answer.

Decision: The engineer queried the decision log and traced the routing rule to a specific prompt version, a specific model (Claude 3.5 Sonnet), a specific reviewer (Dr. Chen), and the 12 test cases that validated the routing logic.

Result: The audit was resolved in 20 minutes with a complete evidence chain. The auditor noted this as a best practice for the hospital's other AI systems.

Lesson: A structured decision log transforms "the AI did it and someone approved it, probably" into a traceable, auditable evidence chain.

2.3 Establishing Trust Records

The third imperative is to track the verification status of every component in the system. Not all code is created equal. In an AI-assisted codebase, each module falls somewhere on a trust spectrum:

Trust Levels for AI-Assisted Code
Trust Level Criteria Appropriate Use
Verified Human-reviewed, passing tests, eval suite coverage Production, safety-critical paths
Reviewed Human-reviewed, limited test coverage Non-critical production paths
Generated AI-generated, not yet human-reviewed Development and staging only
Experimental AI-generated, known limitations, flagged for rework Prototyping, throwaway exploration

Trust records can be as simple as a comment header in each file or as sophisticated as a metadata file that CI/CD pipelines consult before allowing deployment. The key requirement is that the trust level is explicit, machine-readable, and enforced. A deployment pipeline that blocks any module marked "Generated" from reaching production is worth more than a hundred code review reminders in Slack.

3. Machine-Readable Documentation

Prose documentation is necessary for human understanding, but AI tools work better with structured data. The most effective documentation strategy uses both: prose for context and rationale, structured formats for constraints that tools can consume and enforce automatically.

Three forms of machine-readable documentation are especially valuable:

{
 "project": "Acme Support Bot",
 "version": "1.2.0",
 "constraints": {
 "max_latency_ms": 3000,
 "max_tokens_per_turn": 500,
 "max_cost_per_turn_usd": 0.02,
 "forbidden_actions": [
 "call_payments_api_in_test_mode_with_real_data",
 "expose_internal_api_keys",
 "fabricate_order_numbers"
 ],
 "required_response_fields": [
 "answer",
 "confidence_score",
 "sources"
 ]
 },
 "models": {
 "primary": "claude-sonnet-4-20250514",
 "fallback": "claude-haiku-4-20250414",
 "routing_threshold": 0.85
 }
}
Code Fragment 37.3.1: A machine-readable constraints file that both humans and AI tools can consume. AI coding assistants read this to respect token budgets and forbidden actions. CI/CD pipelines validate deployments against it. The structured format eliminates ambiguity that prose documentation often introduces.
Fun Fact

The idea of "executable documentation" predates AI coding assistants by decades. Literate programming, invented by Donald Knuth in 1984, interleaved prose explanations with executable code in a single document. The difference today is directionality. Knuth's literate programs documented code that already existed. Modern intent documents and constraint files document code that does not exist yet, and the AI assistant uses them as instructions to create it. Knuth wrote documentation that described code. We write documentation that generates code.

4. The Intent + Evidence Bundle: Deep Dive

Section 37.1 introduced the Intent + Evidence Bundle (IEB) as a five-component folder structure. Here we examine each component in detail, explaining not just what it contains but how it functions as a control surface for AI-assisted development.

4.1 intent.md: The Non-Negotiables

The intent document is the single most important file in your project when working with AI assistants. It serves three roles simultaneously:

  1. System prompt for coding sessions. Paste it into your AI assistant at the start of every session. The assistant inherits your constraints, priorities, and forbidden behaviors.
  2. Onboarding document for new teammates. A new team member reads intent.md and understands what the system must do, what it must never do, and who approved those decisions.
  3. Audit artifact. Stakeholder sign-offs, scope approvals, and change history provide a paper trail for compliance and governance reviews.

Effective intent documents include these sections:

4.2 eval/: The Golden Set and Regression Tests

The evaluation folder contains the ground truth for your system's behavior. Its primary artifact is golden.jsonl, a set of input/output pairs that define correct behavior. As discussed in Chapter 29, a golden set serves as both a regression baseline and a specification. When you add a new test case after discovering a failure mode, you are simultaneously fixing a bug and updating the specification.

The eval folder should also contain:

4.3 prompts/: Versioned Templates

Every prompt template is source code and deserves the same version control discipline as Python or JavaScript files. The prompts/ folder stores templates with explicit version numbers and changelogs. When a model update breaks a prompt, you can diff the working version against the broken one and pinpoint what changed. Without versioning, prompt debugging becomes guesswork.

4.4 risk.md: Threat Model and Mitigations

The risk register documents known threats (prompt injection, data leakage, hallucination, bias) along with their likelihood, impact, and mitigations. It is updated after every incident, architecture change, or security review. AI assistants that read risk.md can generate code with appropriate defensive measures already in place.

4.5 cost.md: Token Budgets and Routing Strategy

The cost document records token budget assumptions, model routing strategies, and cost-per-interaction targets. It prevents the common failure mode where a prototype uses an expensive model for every request and the team discovers only at scale that the monthly bill is ten times the budget. Documenting costs early forces routing decisions early, which is when they are cheapest to implement.

5. Validating the IEB: A Compliance Script

An IEB is only useful if it is complete and current. The following Python script validates an IEB folder for structural completeness, checks that required files exist and contain meaningful content, and generates a compliance report.

"""Validate an Intent + Evidence Bundle (IEB) for completeness.

Checks that all required files exist, contain meaningful content,
and generates a compliance report with pass/fail status for each
component. Designed to run in CI/CD pipelines as a quality gate.
"""
import json
import sys
from dataclasses import dataclass, field
from pathlib import Path
from datetime import datetime


@dataclass
class CheckResult:
 """Result of a single validation check."""
 name: str
 passed: bool
 message: str
 severity: str = "error" # "error" or "warning"


@dataclass
class ComplianceReport:
 """Aggregated compliance report for an IEB."""
 ieb_path: str
 timestamp: str
 checks: list[CheckResult] = field(default_factory=list)

 @property
 def passed(self) -> bool:
 return all(c.passed for c in self.checks if c.severity == "error")

 @property
 def score(self) -> str:
 total = len(self.checks)
 passing = sum(1 for c in self.checks if c.passed)
 return f"{passing}/{total}"

 def to_dict(self) -> dict:
 return {
 "ieb_path": self.ieb_path,
 "timestamp": self.timestamp,
 "overall_pass": self.passed,
 "score": self.score,
 "checks": [
 {
 "name": c.name,
 "passed": c.passed,
 "severity": c.severity,
 "message": c.message,
 }
 for c in self.checks
 ],
 }


def _check_file_exists(ieb: Path, filename: str) -> CheckResult:
 """Check that a required file exists."""
 target = ieb / filename
 if target.exists():
 return CheckResult(
 name=f"file_exists:{filename}",
 passed=True,
 message=f"{filename} found.",
 )
 return CheckResult(
 name=f"file_exists:{filename}",
 passed=False,
 message=f"{filename} is missing. Create it to complete the IEB.",
 )


def _check_file_nonempty(ieb: Path, filename: str,
 min_lines: int = 3) -> CheckResult:
 """Check that a file has meaningful content (not just a header)."""
 target = ieb / filename
 if not target.exists():
 return CheckResult(
 name=f"content_check:{filename}",
 passed=False,
 message=f"{filename} does not exist; cannot check content.",
 )
 lines = target.read_text(encoding="utf-8").strip().splitlines()
 if len(lines) < min_lines:
 return CheckResult(
 name=f"content_check:{filename}",
 passed=False,
 message=(
 f"{filename} has only {len(lines)} lines. "
 f"Expected at least {min_lines} lines of content."
 ),
 severity="warning",
 )
 return CheckResult(
 name=f"content_check:{filename}",
 passed=True,
 message=f"{filename} contains {len(lines)} lines of content.",
 )


def _check_golden_set(ieb: Path,
 min_cases: int = 10) -> CheckResult:
 """Check that the golden test set has enough cases."""
 golden = ieb / "eval" / "golden.jsonl"
 if not golden.exists():
 return CheckResult(
 name="golden_set_size",
 passed=False,
 message="eval/golden.jsonl not found.",
 )
 lines = [
 ln for ln in golden.read_text(encoding="utf-8").splitlines()
 if ln.strip()
 ]
 # Validate each line is parseable JSON
 parse_errors = 0
 for i, line in enumerate(lines, 1):
 try:
 json.loads(line)
 except json.JSONDecodeError:
 parse_errors += 1

 if parse_errors > 0:
 return CheckResult(
 name="golden_set_size",
 passed=False,
 message=f"{parse_errors} lines in golden.jsonl are not valid JSON.",
 )
 if len(lines) < min_cases:
 return CheckResult(
 name="golden_set_size",
 passed=False,
 message=(
 f"golden.jsonl has {len(lines)} cases. "
 f"Minimum recommended: {min_cases}."
 ),
 severity="warning",
 )
 return CheckResult(
 name="golden_set_size",
 passed=True,
 message=f"golden.jsonl contains {len(lines)} test cases.",
 )


def _check_prompt_versions(ieb: Path) -> CheckResult:
 """Check that the prompts folder has at least one versioned template."""
 prompts_dir = ieb / "prompts"
 if not prompts_dir.exists():
 return CheckResult(
 name="prompt_versions",
 passed=False,
 message="prompts/ directory not found.",
 )
 templates = list(prompts_dir.glob("*.txt")) + list(prompts_dir.glob("*.md"))
 if not templates:
 return CheckResult(
 name="prompt_versions",
 passed=False,
 message="No prompt templates found in prompts/ directory.",
 )
 return CheckResult(
 name="prompt_versions",
 passed=True,
 message=f"Found {len(templates)} prompt template(s) in prompts/.",
 )


def validate_ieb(ieb_path: str,
 min_golden_cases: int = 10) -> ComplianceReport:
 """Run all validation checks and return a compliance report.

 Args:
 ieb_path: Path to the IEB directory.
 min_golden_cases: Minimum number of golden test cases required.

 Returns:
 A ComplianceReport with individual check results.
 """
 ieb = Path(ieb_path)
 report = ComplianceReport(
 ieb_path=str(ieb.resolve()),
 timestamp=datetime.now().isoformat(),
 )

 if not ieb.exists():
 report.checks.append(CheckResult(
 name="ieb_directory",
 passed=False,
 message=f"IEB directory not found: {ieb_path}",
 ))
 return report

 # Required files
 for filename in ["intent.md", "risk.md", "cost.md"]:
 report.checks.append(_check_file_exists(ieb, filename))
 report.checks.append(_check_file_nonempty(ieb, filename))

 # Required folders
 for folder in ["eval", "prompts"]:
 folder_path = ieb / folder
 report.checks.append(CheckResult(
 name=f"folder_exists:{folder}",
 passed=folder_path.is_dir(),
 message=(
 f"{folder}/ directory found."
 if folder_path.is_dir()
 else f"{folder}/ directory is missing."
 ),
 ))

 # Golden set
 report.checks.append(_check_golden_set(ieb, min_golden_cases))

 # Prompt templates
 report.checks.append(_check_prompt_versions(ieb))

 return report


def print_report(report: ComplianceReport) -> None:
 """Print a human-readable compliance report to stdout."""
 status = "PASS" if report.passed else "FAIL"
 print(f"\n{'='*60}")
 print(f"IEB Compliance Report: {status}")
 print(f"Path: {report.ieb_path}")
 print(f"Timestamp: {report.timestamp}")
 print(f"Score: {report.score}")
 print(f"{'='*60}\n")

 for check in report.checks:
 icon = "[PASS]" if check.passed else "[FAIL]"
 severity = f" ({check.severity})" if not check.passed else ""
 print(f" {icon} {check.name}{severity}")
 print(f" {check.message}\n")

 if report.passed:
 print("Result: All required checks passed.\n")
 else:
 print("Result: One or more checks failed. Review the report above.\n")


if __name__ == "__main__":
 target = sys.argv[1] if len(sys.argv) > 1 else "./ieb"
 report = validate_ieb(target)
 print_report(report)

 # Optionally write JSON report
 json_path = Path(target) / "compliance_report.json"
 json_path.write_text(
 json.dumps(report.to_dict(), indent=2),
 encoding="utf-8",
 )
 print(f"JSON report written to: {json_path}")

 sys.exit(0 if report.passed else 1)
Code Fragment 37.3.2: An IEB validation script that checks structural completeness, verifies golden test set quality, and produces both human-readable and JSON compliance reports. Integrate this into your CI/CD pipeline to enforce IEB hygiene on every commit. The exit code (0 for pass, 1 for fail) enables use as a quality gate.

6. The Governance Layer

Documentation as a control surface reaches its full potential when combined with an explicit governance layer that tracks accountability. In a codebase where humans and AI assistants both contribute code, every artifact needs a clear answer to three questions: Who created it? Who reviewed it? What is its trust level?

A practical governance approach uses file-level metadata headers:

# trust: verified
# author: ai/claude-sonnet-4
# reviewer: j.martinez
# review_date: 2025-11-15
# eval_coverage: 47/50 golden cases passing
# decision_ref: decisions.jsonl#line-142

def route_support_ticket(ticket: dict) -> str:
 """Route a support ticket to the appropriate team.

 Routing logic validated against 47 of 50 golden test cases.
 Three remaining cases involve edge conditions documented in
 eval/edge_cases.md (lines 23-41).
 """
 category = ticket.get("category", "general")
 urgency = ticket.get("urgency", "normal")

 if urgency == "critical" and category in ("billing", "security"):
 return "escalation_team"
 if category == "technical":
 return "engineering_support"
 return "general_support"
Code Fragment 37.3.3: A file-level governance header that tracks authorship, review status, evaluation coverage, and a reference back to the decision log. CI/CD pipelines can parse these headers to enforce deployment rules (for example, blocking any file with trust: generated from production).

The governance layer connects documentation to deployment policy. A CI/CD pipeline can scan for trust-level headers and enforce rules such as:

This approach scales accountability. Instead of relying on informal norms ("make sure you review AI-generated code"), the governance layer encodes review requirements into the deployment pipeline itself. The documentation becomes enforceable policy, not just good advice.

Note: Governance and Observability

The governance layer pairs naturally with the observability stack from Chapter 30. Trust-level metadata can be emitted as trace attributes, allowing you to filter production traces by trust level and spot patterns. If "generated" code that slipped through causes higher error rates, your observability dashboard will show it.

Key Takeaways

What Comes Next

With documentation functioning as a control surface and governance tracking accountability, Section 37.4: AI Coding Assistants: Trust but Verify examines the practical workflow of using AI coding tools (GitHub Copilot, Cursor, Claude Code) effectively and safely. You will learn patterns for reviewing AI-generated code, common failure modes to watch for, and techniques for steering assistants toward higher-quality output.

Self-Check
Q1: Why are negative requirements ("must not do") more important than positive requirements ("must do") when working with AI coding assistants?
Show Answer
AI coding assistants can often infer desired behavior from brief descriptions and examples, but they cannot infer organizational safety boundaries, compliance constraints, or business rules about forbidden behaviors. Negative requirements prevent costly incidents that no amount of positive specification would catch. For example, an AI might optimize for speed by caching personally identifiable information unless explicitly told never to do so.
Q2: What four elements should an AI decision record capture, and why is each important?
Show Answer
(1) The prompt: records what instruction was given to the AI, enabling reproduction. (2) The acceptance criteria: documents what the reviewer was looking for, making the review process transparent. (3) The disposition: records whether output was accepted, modified, or rejected, and why. (4) The confidence level: distinguishes between well-tested artifacts and those that "seem reasonable" but lack verification. Together, these four elements create an audit trail that allows any future stakeholder to reconstruct why the code exists in its current form.
Q3: How does a file-level governance header (trust level, author, reviewer, eval coverage) integrate with CI/CD pipelines to enforce code quality?
Show Answer
CI/CD pipelines can parse trust-level headers and enforce deployment rules automatically. For example: files marked "generated" are blocked from production deployment; files marked "reviewed" require at least one passing test suite; files marked "verified" require both human review and evaluation coverage above a threshold (such as 90%). Files without any trust header default to "experimental" and are blocked. This transforms documentation from advisory guidance into enforceable policy that scales across the entire codebase.

Bibliography

Documentation and Software Practice

Knuth, D. E. (1984). "Literate Programming." The Computer Journal, 27(2), 97-111. doi:10.1093/comjnl/27.2.97

The foundational paper on literate programming, which interleaves prose explanations with executable code. Provides historical context for the idea that documentation and code should be tightly coupled, a principle that AI-assisted development has transformed from retrospective to prospective.
Documentation

Aghajani, E., Nagy, C., Vega-Marquez, O. L., et al. (2019). "Software Documentation Issues Unveiled." Proceedings of the 41st International Conference on Software Engineering (ICSE). doi:10.1109/ICSE.2019.00122

A large-scale study of documentation problems in software projects. Identifies outdated documentation and missing rationale as top issues, both of which the IEB structure and governance layer address directly.
Software Engineering
AI-Assisted Development

Perry, N., Srivastava, M., Kumar, D., et al. (2023). "Do Users Write More Insecure Code with AI Assistants?" Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS). doi:10.1145/3576915.3623157

A controlled study finding that participants with access to AI coding assistants produced significantly less secure code than those without. Underscores the need for trust records, governance layers, and explicit security constraints in AI-assisted workflows.
Security

Peng, S., Kalliamvakou, E., Cihon, P., et al. (2023). "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot." arXiv:2302.06590

Empirical evidence that AI coding assistants accelerate task completion by 55%. Combined with the security findings above, this highlights the core tension that documentation as a control surface resolves: speed gains are real, but they require structured safeguards.
Productivity