
"With great power comes great responsibility. The same technology that can democratize access to knowledge can also amplify harm at unprecedented scale."
Guard, Red-Team-Ready AI Agent
Parts III-IX built, operated, and evaluated LLM systems. Part X zooms in on adversarial pressure: prompt injection, jailbreaks, data exfiltration, supply-chain attacks, model-level threats, and the red-teaming frameworks that surface them before attackers do. This is the chapter you read before launch, not after.
Chapter Overview
With the evaluation and observability foundations from Part IX in place, this chapter tackles the adversarial security dimension of deploying LLMs at scale. It covers the OWASP Top 10 for LLMs, prompt injection attacks and defenses, jailbreaking, data exfiltration, and the threat models that production systems face.
Building on the alignment techniques from Chapter 18 (RLHF, DPO), the chapter walks through red teaming frameworks and automated security testing (PyRIT, Garak, HarmBench, JailbreakBench), advanced attack vectors that operate beyond the prompt interface (tampered model artifacts, plaintext memory leaks, multimodal injection), and the testing patterns that turn red-team findings into reproducible regressions. It prepares the ground for runtime defenses in Chapter 48 and agent-specific safety in Chapter 49.
As LLMs become embedded in high-stakes decisions, adversarial security moves from research curiosity to deployment prerequisite. This chapter covers red-teaming, prompt injection, jailbreaks, and the testing frameworks that surface vulnerabilities before attackers do. It builds on the alignment techniques of Chapter 18 and applies to every system deployed in production.
- Defend against OWASP Top 10 LLM threats including prompt injection, jailbreaking, and data exfiltration
- Detect and mitigate hallucinations using self-consistency, citation verification, and constrained generation, complementing interpretability methods from Chapter 10
- Measure and reduce bias in LLM outputs through systematic auditing and model cards
- Navigate the EU AI Act, GDPR, and US regulatory frameworks for AI governance
- Implement enterprise risk governance using NIST AI RMF, ISO 42001, and SR 11-7 frameworks
- Understand model licensing taxonomies, IP ownership, and differential privacy for LLM training data
- Apply machine unlearning techniques for GDPR compliance, copyright removal, and safety alignment
- Conduct structured red teaming using PyRIT, Garak, and adversarial prompt libraries with CI/CD integration
- Implement EU AI Act compliance for GPAI models, including risk classification and conformity assessment
- Use automated red teaming benchmarks (HarmBench, JailbreakBench) for reproducible security evaluation
- Assess and reduce the environmental impact of LLM training using carbon tracking and efficiency techniques
- Defend against privacy attacks (training data extraction, membership inference) using differential privacy and defense-in-depth strategies
- Design federated learning systems for LLMs using FedAvg, federated LoRA, and secure aggregation frameworks
Prerequisites
- Chapter 12: Prompt Engineering (prompt design, structured outputs)
- Chapter 42: Evaluation and Observability (metrics, tracing, monitoring)
- Chapter 18: Alignment, RLHF, and DPO (alignment techniques)
- Production-engineering and LLMOps basics; covered in detail later in the book
Sections
- 47.1 Prompt Injection & Jailbreaking OWASP Top 10 framing, prompt injection defense, PII redaction, and direct vs. indirect injection in depth. Advanced
- 47.2 Data Poisoning, Extraction & Jailbreaking Training-time poisoning, model extraction and stealing, structured red-team programs, and the jailbreaking literature (GCG, multi-turn, role-play). Advanced
- 47.3 Red Teaming Frameworks & LLM Security Testing Red teaming is the practice of systematically attacking your own system to discover vulnerabilities before real adversaries do. Advanced
- 47.4 Supply Chain, Confidential Compute & Multimodal Threats Advanced LLM threat vectors that operate beyond the prompt interface: tampered model artifacts, plaintext data in shared memory, and injection through images and audio. Advanced
What's Next?
In the next chapter, Chapter 48: Guardrails and Runtime Safety, we cover the runtime defenses (input filters, output validators, allow/deny lists, model-level guardrails) that turn the attacks you just learned to find into bugs you can actually block. Agent-specific safety (Chapter 49) and privacy (Chapter 50) follow.