Part IX: Safety & Strategy
Chapter 32: Safety, Ethics, and Regulation

LLM Risk Governance & Audit

Governance without engineering is policy theater. Engineering without governance is an audit waiting to happen.

Guard A Steadfast Guard, Governance-Weary AI Agent
Big Picture

Enterprise AI governance requires structured frameworks that map every LLM deployment to a risk classification, assign ownership, and maintain auditable records. Established frameworks like SR 11-7 (banking model risk), NIST AI RMF, and ISO 42001 provide the scaffolding. Building on the regulatory landscape from Section 32.4 and the enterprise application patterns from Section 28.1, this section covers how to build a practical AI governance program that satisfies regulators while remaining lightweight enough for engineering teams to follow.

Prerequisites

Before starting, make sure you are familiar with the regulatory landscape from Section 32.4, the Section 29.1 that underpin audit processes, and the interpretability techniques from Section 18.1 that support model explainability requirements.

1. Governance Frameworks Comparison

1. Governance Frameworks Comparison Intermediate
FrameworkOriginScopeKey Contribution
SR 11-7US Federal ReserveBanking / FinancialThree lines of defense, independent validation
NIST AI RMFUS NISTCross-sectorGovern, Map, Measure, Manage lifecycle
ISO 42001ISOInternationalAI management system certification
EU AI ActEuropean ParliamentEU marketRisk-based obligations, conformity assessment

The NIST AI Risk Management Framework organizes governance into four core functions. As shown in Figure 32.5.1, Govern provides the overarching structure while Map, Measure, and Manage form a continuous cycle.

GOVERN MAP Context, stakeholders, risks MEASURE Metrics, testing, monitoring MANAGE Mitigate, respond, improve GOVERN provides overarching structure; MAP, MEASURE, MANAGE operate continuously.
Figure 32.5.1: The NIST AI RMF defines four core functions; Govern is the overarching function while Map, Measure, and Manage form a continuous cycle.
Key Insight

Mental Model: The Flight Safety System. LLM risk governance resembles the aviation safety framework. Govern sets the flight rules and certifications. Map identifies the routes and weather conditions (risks). Measure tracks altitude, speed, and fuel (metrics and monitoring). Manage handles turbulence and course corrections (mitigation and response). Just as airlines maintain exhaustive flight logs and conduct regular safety audits, AI governance requires a model inventory, risk classifications, and audit trails. The analogy breaks down in one important way: aviation safety has decades of standardized practice, while AI governance frameworks are still maturing rapidly.

Fun Fact

Content provenance standards like C2PA (Coalition for Content Provenance and Authenticity) embed cryptographic signatures into AI-generated images, audio, and video, creating an unforgeable trail of origin. Adobe, Microsoft, and major camera manufacturers have adopted C2PA, making it the leading candidate for combating deepfakes at scale.

Governance becomes especially important in organizations with multiple LLM deployments. Without a centralized inventory, teams often deploy models with overlapping capabilities but inconsistent safety standards. One team's customer-facing chatbot might have rigorous guardrails and monitoring from Section 31.3, while another team's internal assistant operates with no safety controls. The model inventory described below provides the visibility needed to enforce consistent governance across the organization.

Tip

Start your model inventory today, even if it is just a spreadsheet. Record the model name, provider, version, deployment date, owning team, and risk tier for every LLM in production. Teams that wait until they have 20 or more deployments before creating an inventory find that half their models are undocumented, some are running deprecated versions, and nobody knows who owns the one handling customer data. A simple inventory prevents governance surprises.

2. Model Inventory and Risk Classification

A model inventory tracks every LLM deployment in the organization, its risk tier, ownership, and review status. Code Fragment 32.5.1 below shows how to implement a risk-classified model inventory with automated review flagging.


# Define RiskTier, ModelInventoryEntry; implement needs_review
# Key operations: results display, deployment configuration
from dataclasses import dataclass
from enum import Enum
from datetime import datetime

class RiskTier(Enum):
 LOW = "low"
 MEDIUM = "medium"
 HIGH = "high"
 CRITICAL = "critical"

@dataclass
class ModelInventoryEntry:
 """Enterprise model inventory record for governance tracking."""
 model_id: str
 model_name: str
 use_case: str
 owner: str
 risk_tier: RiskTier
 deployment_date: str
 last_validation: str
 next_review: str
 data_sources: list[str]
 regulations: list[str]

 def needs_review(self) -> bool:
 return datetime.fromisoformat(self.next_review) <= datetime.now()

 def to_dict(self):
 return {
 "model_id": self.model_id,
 "model_name": self.model_name,
 "use_case": self.use_case,
 "owner": self.owner,
 "risk_tier": self.risk_tier.value,
 "overdue": self.needs_review(),
 }

entry = ModelInventoryEntry(
 model_id="LLM-CS-001", model_name="Customer Support Bot v2",
 use_case="Customer service automation", owner="ML Platform Team",
 risk_tier=RiskTier.MEDIUM, deployment_date="2025-01-15",
 last_validation="2025-01-10", next_review="2025-07-10",
 data_sources=["support_tickets", "knowledge_base"],
 regulations=["GDPR", "EU AI Act (limited risk)"],
)
print(entry.to_dict())
Code Fragment 32.5.1: A model inventory with risk-based classification and a hash-chained audit log. The inventory records tier, owner, regulations, and review dates, with needs_review flagging stale entries. The audit log chains entries via hashes so that tampering is immediately detectable.

Audit Trail Implementation

An immutable audit trail records every LLM interaction with hash-chaining so that any tampering with historical records is detectable. Code Fragment 32.5.2 below implements this pattern.


# Define AuditTrail; implement __init__, log, verify_chain
# Key operations: structured logging, data protection
import json, hashlib
from datetime import datetime

class AuditTrail:
 """Immutable audit log for LLM interactions."""

 def __init__(self):
 self.entries = []

 def log(self, request_id: str, model: str, input_text: str,
 output_text: str, user_id: str, metadata: dict = None):
 entry = {
 "request_id": request_id,
 "timestamp": datetime.utcnow().isoformat(),
 "model": model,
 "user_id": user_id,
 "input_hash": hashlib.sha256(input_text.encode()).hexdigest()[:16],
 "output_hash": hashlib.sha256(output_text.encode()).hexdigest()[:16],
 "metadata": metadata or {},
 }
 # Chain entries for tamper detection
 if self.entries:
 prev_hash = hashlib.sha256(
 json.dumps(self.entries[-1]).encode()
 ).hexdigest()[:16]
 entry["prev_hash"] = prev_hash
 self.entries.append(entry)
 return entry

 def verify_chain(self) -> bool:
 for i in range(1, len(self.entries)):
 expected = hashlib.sha256(
 json.dumps(self.entries[i-1]).encode()
 ).hexdigest()[:16]
 if self.entries[i].get("prev_hash") != expected:
 return False
 return True
Code Fragment 32.5.2: Define RiskTier, ModelInventoryEntry; implement needs_review

The hash-chained audit trail creates an immutable record of every LLM interaction. Figure 32.5.2 illustrates how each entry links to the previous one, making any tampering immediately detectable.

Entry #1 request_id: req_001 timestamp: 2027-03-15T10:30 model: gpt-4o user_id: u_abc123 input_hash: a3f8... output_hash: 7b2c... prev_hash: (genesis) SHA-256: 9e4d1f... Entry #2 request_id: req_002 timestamp: 2027-03-15T10:31 model: gpt-4o user_id: u_def456 input_hash: c1d9... output_hash: 5e8f... prev_hash: 9e4d1f... SHA-256: 2a7b3c... Entry #3 request_id: req_003 timestamp: 2027-03-15T10:32 model: claude-3.5 user_id: u_abc123 input_hash: 8f3a... output_hash: d4e1... prev_hash: 2a7b3c... SHA-256: f1c8e9... Tampered? If Entry #2 is modified, its hash changes, breaking the chain link in Entry #3. verify_chain() returns False How Verification Works 1. Recompute SHA-256 of Entry #1 data. Compare to prev_hash stored in Entry #2. 2. Recompute SHA-256 of Entry #2 data. Compare to prev_hash stored in Entry #3. 3. Any mismatch indicates the chain has been tampered with at that point.
Figure 32.5.2 A hash-chained audit trail links each LLM interaction record to its predecessor, making any modification detectable through chain verification.

In financial services, SR 11-7 provides a well-tested governance model. Figure 32.5.3 shows how its three lines of defense separate model development, independent validation, and audit oversight.

Entry #1 request_id: req_001 timestamp: 2027-03-15T10:30 model: gpt-4o user_id: u_abc123 input_hash: a3f8... output_hash: 7b2c... prev_hash: (genesis) SHA-256: 9e4d1f... Entry #2 request_id: req_002 timestamp: 2027-03-15T10:31 model: gpt-4o user_id: u_def456 input_hash: c1d9... output_hash: 5e8f... prev_hash: 9e4d1f... SHA-256: 2a7b3c... Entry #3 request_id: req_003 timestamp: 2027-03-15T10:32 model: claude-3.5 user_id: u_abc123 input_hash: 8f3a... output_hash: d4e1... prev_hash: 2a7b3c... SHA-256: f1c8e9... Tampered? If Entry #2 is modified, its hash changes, breaking the chain link in Entry #3. verify_chain() returns False How Verification Works 1. Recompute SHA-256 of Entry #1 data. Compare to prev_hash stored in Entry #2. 2. Recompute SHA-256 of Entry #2 data. Compare to prev_hash stored in Entry #3. 3. Any mismatch indicates the chain has been tampered with at that point.
Figure 32.5.3: SR 11-7's three lines of defense separate model development, independent validation, and audit oversight.
Warning

Many organizations track traditional ML models but forget to inventory their LLM deployments. Every use of an LLM API, whether it is a direct OpenAI call, a LangChain chain, or an embedded copilot feature, should be registered in the enterprise model inventory with a risk classification and assigned owner.

Note

ISO 42001 is the first international standard for AI management systems. It provides a certifiable framework for organizations to demonstrate responsible AI practices, similar to how ISO 27001 certifies information security management. Certification may become a market differentiator as AI regulation increases.

Key Insight

Audit trails for LLM systems should use hash chaining (similar to blockchain) to ensure tamper resistance. Each log entry includes a hash of the previous entry, creating an immutable chain. If any entry is modified after the fact, the chain verification fails, alerting auditors to potential tampering.

Self-Check

1. What are the four core functions of the NIST AI RMF?

Show Answer
Govern (establish policies, roles, and accountability), Map (identify context, stakeholders, and risks), Measure (assess risks through testing and metrics), and Manage (mitigate risks and respond to incidents). Govern is the overarching function, while Map, Measure, and Manage form a continuous operational cycle.

2. What is SR 11-7 and why does it matter for LLM deployments in banking?

Show Answer
SR 11-7 is the US Federal Reserve's guidance on model risk management for banking institutions. It requires a three-lines-of-defense approach: model developers own the first line, independent validation teams provide the second line of challenge, and internal audit provides the third line of oversight. Any LLM used in banking decisions (credit, fraud, compliance) must comply with SR 11-7.

3. Why should audit trail entries use hash chaining?

Show Answer
Hash chaining creates a tamper-evident log where each entry includes a cryptographic hash of the previous entry. If any entry is modified after the fact, the hash chain breaks, and the verify_chain function returns False. This provides auditors with assurance that the log has not been altered, which is essential for regulatory compliance and incident investigation.

4. What should an enterprise model inventory capture for each LLM deployment?

Show Answer
At minimum: model ID, name, use case description, risk tier, owner/responsible team, deployment date, last validation date, next review date, data sources, applicable regulations, performance metrics, and known limitations. The inventory should also track dependencies (API providers, frameworks) and trigger alerts when reviews are overdue.

5. How does ISO 42001 differ from the NIST AI RMF?

Show Answer
ISO 42001 is a certifiable management system standard (similar to ISO 27001 for security), while NIST AI RMF is a voluntary framework. ISO 42001 specifies requirements for establishing, implementing, and continuously improving an AI management system, with formal audit and certification processes. NIST AI RMF provides guidance and best practices without a certification mechanism.
Real-World Scenario: Implementing an Enterprise LLM Model Inventory

Who: A chief data officer and a risk management team at a mid-size bank

Situation: Regulators asked the bank to provide a complete inventory of all AI models in production, including any LLM usage. The CDO discovered that 14 different teams were using LLM APIs across customer service, compliance, and marketing, with no central tracking.

Problem: Without an inventory, the bank could not demonstrate SR 11-7 compliance (model risk management). Some LLM deployments had no assigned owner, no documented risk tier, and no review schedule.

Dilemma: Requiring every team to stop and complete full documentation would halt ongoing projects. Ignoring the gap risked regulatory sanctions.

Decision: They mandated a lightweight registration form (10 fields) for every existing LLM deployment within two weeks, followed by full documentation within 90 days for high-risk models only.

How: The inventory captured model ID, use case, owner, risk tier, data sources, and applicable regulations. An automated alerting system flagged overdue reviews. The second and third lines of defense (validation and audit teams) were assigned to review all high-risk entries.

Result: Within two weeks, all 14 LLM deployments were registered. Three were reclassified from "low risk" to "medium risk" based on their actual data access patterns. The bank passed its regulatory review with commendation for the governance framework.

Lesson: Start model inventories with a lightweight, mandatory registration process. Perfection is the enemy of visibility; a simple inventory today is more valuable than a comprehensive one six months from now.

Tip: Document Your AI System's Limitations

Create a user-facing limitations page that honestly describes what your system cannot do, known failure modes, and when users should not rely on its output. This builds trust and reduces liability when edge cases inevitably occur.

Key Takeaways
  • The NIST AI RMF provides a four-function framework (Govern, Map, Measure, Manage) applicable across industries.
  • SR 11-7 requires three lines of defense for model risk in banking: development, independent validation, and audit.
  • Every LLM deployment should be registered in an enterprise model inventory with risk classification and assigned ownership.
  • Audit trails should use hash chaining for tamper resistance, logging request hashes (not raw content) to protect privacy. The observability tools from Chapter 29 provide the tracing infrastructure for building these audit trails.
  • ISO 42001 provides a certifiable AI management system standard that may become a market differentiator.
  • Risk classification should consider data sensitivity, decision impact, user population, and regulatory applicability.
Research Frontier

Open Questions:

  • What should a comprehensive LLM risk register look like, and how should it differ from traditional software risk management? LLMs introduce novel risk categories (hallucination, prompt injection, emergent behavior) that existing frameworks do not cover.
  • How can organizations audit LLM systems when the models are black boxes served via third-party APIs?

Recent Developments (2024-2025):

  • The NIST AI Risk Management Framework (AI RMF) gained broader adoption in 2024-2025, with practical implementation guides specifically addressing foundation model risks and governance structures.

Explore Further: Create a risk register for an LLM application you have access to. Identify at least 10 risks, categorize them (technical, ethical, legal, operational), and propose a mitigation strategy for each.

Exercises

Exercise 32.5.1: Governance Frameworks Conceptual

Compare the NIST AI RMF's four functions (Govern, Map, Measure, Manage) with the three lines of defense model from SR 11-7. What does each framework emphasize that the other does not?

Answer Sketch

NIST AI RMF emphasizes a lifecycle approach: Govern (establish policies), Map (identify risks), Measure (assess risks quantitatively), Manage (mitigate and monitor). SR 11-7 emphasizes organizational accountability: 1st line (model developers/users), 2nd line (risk management), 3rd line (internal audit). NIST focuses on what to do; SR 11-7 focuses on who does it. NIST is broader and technology-agnostic; SR 11-7 is specific to regulated industries. Best practice: use NIST for the process and SR 11-7 for the organizational structure.

Exercise 32.5.2: Risk Register Coding

Design a risk register template for an LLM application. Include columns for: risk ID, description, likelihood (1-5), impact (1-5), risk score, mitigation strategy, owner, and review date. Populate it with 5 example risks for a healthcare chatbot.

Answer Sketch

Example risks: (1) Hallucinated medical advice (likelihood: 4, impact: 5, score: 20, mitigation: RAG grounding + disclaimer). (2) PII exposure in responses (3, 5, 15, mitigation: PII filtering). (3) Unauthorized diagnosis (3, 5, 15, mitigation: output classifier + human escalation). (4) Provider model degradation (2, 4, 8, mitigation: canary testing). (5) Regulatory non-compliance (2, 4, 8, mitigation: compliance checklist + audit). Sort by risk score descending. Review quarterly or after any system change.

Exercise 32.5.3: Model Inventory Analysis

Explain why enterprises need an AI model inventory (registry of all deployed models). What metadata should the inventory capture for each LLM deployment? How does this support audit requirements?

Answer Sketch

The inventory provides visibility into all AI usage across the organization. Metadata per deployment: model name and version, provider, use case description, risk classification, data sources, evaluation results, deployment date, owner, compliance status, incident history. Audit support: auditors can quickly identify all high-risk deployments, verify that each has proper documentation and testing, check that reviews are current, and trace any incident to the responsible team. Without an inventory, shadow AI deployments create unmanaged risk.

Exercise 32.5.4: Audit Trail Design Conceptual

Describe what information an LLM audit trail should capture for every interaction. Balance the need for comprehensive logging with privacy and storage constraints.

Answer Sketch

Capture: (1) Timestamp and request ID. (2) User identifier (hashed for privacy). (3) Input prompt (with PII redacted). (4) Model name and version. (5) Generation parameters. (6) Output response (with PII redacted). (7) Any guardrail triggers. (8) Latency and cost. Privacy balance: redact PII before logging, use role-based access to audit logs, implement retention policies (e.g., 90 days for full logs, 1 year for aggregated metrics), and encrypt logs at rest. Storage optimization: compress older logs, move to cold storage after the retention window.

Exercise 32.5.5: Governance Program Design Discussion

Design a lightweight AI governance program for a 200-person startup that uses LLMs in 3 products. Include: organizational roles, review processes, documentation requirements, and incident handling. How does this differ from governance at a large bank?

Answer Sketch

Startup: designate a part-time AI ethics lead, create a simple risk classification (high/medium/low) with lightweight review requirements, require model cards for all deployments, maintain a shared risk register, and establish an incident response channel. Review high-risk deployments quarterly. Large bank: dedicated AI governance team, formal three-lines-of-defense structure, mandatory model validation by independent teams, detailed documentation per SR 11-7, quarterly board reporting, and regulatory examination readiness. The key difference is formality and staffing: startups need pragmatic governance that does not slow shipping, while banks face regulatory mandates that require extensive documentation.

What Comes Next

In the next section, Section 32.6: LLM Licensing, IP & Privacy, we address licensing, intellectual property, and privacy considerations for LLM-generated content and training data.

Further Reading & References
Core References

NIST. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0).

The US national standard for AI risk management, organized around four core functions: Govern, Map, Measure, and Manage. Provides a flexible, non-prescriptive framework adaptable to any organization size. Essential starting point for teams building an AI governance program.

Framework Standard

Board of Governors, Federal Reserve. (2011). SR 11-7: Guidance on Model Risk Management.

The banking industry's foundational model risk management guidance, requiring independent validation and three lines of defense. Its principles have been widely adopted beyond financial services for AI governance. Required reading for regulated industries deploying LLMs.

Regulatory Guidance

ISO. (2023). ISO/IEC 42001:2023 Artificial Intelligence Management System.

International standard for establishing, implementing, and maintaining an AI management system, modeled on ISO 27001 for information security. Provides a certifiable framework for demonstrating AI governance maturity. Relevant for enterprises seeking formal AI governance certification.

Framework Standard

Raji, I. D. et al. (2020). Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing. FAT* 2020.

Proposes a practical internal audit framework for AI systems inspired by financial auditing practices. Covers scoping, testing, documentation, and remediation stages. Directly applicable to building internal audit processes for LLM deployments.

Audit Framework

Schuett, J. (2023). Risk Management in the Artificial Intelligence Act. European Journal of Risk Regulation.

Legal analysis of how the EU AI Act's risk management requirements translate into practical compliance obligations. Bridges the gap between legal text and engineering implementation. Useful for teams interpreting the AI Act's technical requirements.

Legal Analysis

Brundage, M. et al. (2020). Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims.

Multi-stakeholder report proposing mechanisms for verifying safety and fairness claims about AI systems, including third-party audits and bug bounties. Outlines a vision for accountable AI development practices. Recommended for organizations designing their external accountability structures.

Governance Report