LLM-Specific Attack Surface

Section 71.3

"OWASP Top 10 for LLM applications: prompt injection, data leakage, supply chain. The list is short because the field is young, not because the threats are."

HalluxHallux, LLM-OWASP-Reader AI Agent
Big Picture

LLMs are not just defender tools; they are also targets. The attack surface unique to LLM systems is now catalogued in two community-maintained references that every defender should treat as canonical: the OWASP Top 10 for LLM Applications (a vulnerability ranking aimed at developers) and MITRE ATLAS (an adversarial-tactics knowledge base modelled on MITRE ATT&CK but for AI systems). Both references converge on a similar threat model: prompt injection is the dominant risk, followed by training-data poisoning, membership-inference and extraction, and model theft. This section walks through each, the mechanism that produces it, and the production defense.

OWASP Top 10 for LLM + MITRE ATLAS: four attack classes with numeric thresholds
Figure 71.3.1: The four LLM-specific attack classes from OWASP Top 10 and MITRE ATLAS, with the concrete thresholds and the defences that have stabilised in 2026. Prompt injection sits at #1 because it is structurally unfixable in current architectures (Greshake 2023 indirect-injection paper); the 0.001%-of-corpus poisoning threshold (Carlini et al. 2024) is what motivates aggressive provenance tracking.

Prerequisites

This section assumes the LLM-attack-vector vocabulary from Section 48.1 and the red-team use cases from Section 71.2.

Prompt Injection

Fun Fact

The term "prompt injection" was coined by Simon Willison in a September 2022 blog post analyzing a GPT-3 demo that was tricked into translating French to English via a prompt embedded in the input text. Willison reportedly wrote the post in a single afternoon and pushed publish without realizing he was naming an entire vulnerability class; the OWASP LLM Top 10 ranks prompt injection at #1 to this day.

Warning: Prompt Injection

The #1 LLM-specific risk. An attacker controls part of the input (via uploaded document, retrieved web content, scraped HTML, tool output) and embeds instructions that the model treats as part of the prompt. Mitigations: input/output isolation between trusted and untrusted content, prompt-injection classifiers, restricted tool permissions in agentic systems. (See Section 49.1 for the agent-specific framing.)

The mechanism is the structural feature of all current LLM architectures: there is no clean separation in the input between "trusted system prompt" and "untrusted user-or-retrieved content." Every defense is an approximation. The defenses that work in production combine three layers, none of which is sufficient on its own:

  1. Input classification labels each chunk as "trusted" or "untrusted" and the trusted system prompt instructs the model to treat untrusted content as data, not as instructions.
  2. Output filtering catches the most common injection-success signatures ("ignore previous instructions" patterns in the output, sudden shifts in tone, requests to expose system prompts).
  3. Scoped tool permissions in agentic systems: the agent cannot perform privileged actions just because the user-controlled input asked it to.

The combination is defense-in-depth; no single layer is sufficient.

Training-Data Poisoning

Warning: Training-Data Poisoning

An attacker contributes adversarial training data (web content, public datasets, even careful prompt-response logs reuploaded as synthetic data) that causes the model to develop a backdoor. Mitigations: provenance tracking on training data, anomaly detection on training loss, post-train probing for known backdoor triggers.

The attack vector is publication. An attacker who controls a website included in the major crawls (Common Crawl, OpenWebText) can place adversarial content that, when ingested into pretraining, biases the model in a specific direction. The 2024 to 2025 research literature on poisoning has shown that quite small fractions of poisoned data (well under 1 percent in some experimental setups) can produce reliable backdoor triggers. The defenses are imperfect: provenance tracking on training data and curation of the corpus are the standard practices; anomaly detection on training loss can catch some classes of attack; post-training probing against known triggers is a final check. None of these is a complete defense, but the combination raises the attacker's cost meaningfully.

Membership-Inference and Extraction

Warning: Membership-Inference and Extraction

Attackers can determine whether specific data was in a model's training set, and in some cases extract it verbatim. Concerns for any model trained on private data. Mitigations: differential privacy in training, output-filtering, rate-limiting at inference. (See Section 50.1.)

The attack: an adversary with API access submits carefully-crafted prompts and infers from the model's output whether specific data was in the training set, or in stronger forms extracts the data verbatim. The attack is most consequential for models fine-tuned on private data (customer chat logs, internal documents, medical records); a model that memorized specific records can be coaxed to reproduce them. Three defenses stack:

Model Extraction and Stealing

Warning: Model Extraction / Stealing

Attackers can train a competitor model on the outputs of a target model. Commercial mitigations: rate-limiting, watermarking, output-pattern detection, ToS enforcement. The cryptographic mitigations remain underdeveloped.

The economic logic is straightforward: if a frontier model costs hundreds of millions of dollars to train and an adversary can produce a near-clone by training a smaller model on its outputs, the IP protection is fragile. Major model providers have moved to active enforcement on three fronts:

The OpenAI / DeepSeek dispute in early 2025 was the most-publicized example; commercial enforcement against suspected model extraction is now an active legal posture for frontier providers.

Postmortem
The LLM-Amplified False-Positive Storm (representative)

A composite of several real 2024-2025 incidents reported anonymously at incident-response conferences. A mid-market SaaS company deployed a SOC-triage LLM that read every alert, enriched it with prior-incident context, and posted a recommendation to Slack with an "auto-execute in 5 minutes unless someone objects" flag for low-severity triage. One afternoon a misconfigured authentication provider began emitting tens of thousands of "anomalous login from unusual location" alerts because of a routing change. The LLM, lacking awareness of the upstream config change, classified them as a credible credential-stuffing campaign, recommended automated account-lockouts, and (because no analyst was watching Slack closely during a deploy window) the auto-execute fired. Roughly 6,000 customer accounts were locked out before a senior engineer noticed and pulled the plug. Total downtime for affected customers: about 90 minutes. Lessons that were widely propagated afterwards: (1) "auto-execute after timeout" is a footgun for SOC-LLMs and was retired across the industry; (2) the LLM's recommendation should always carry a confidence interval that auto-execution can refuse to act on; (3) anomaly-detection systems that feed the LLM need a kill-switch that propagates to the LLM as a "do not auto-act" signal. The pattern of LLM-amplified false positives is now a standard topic in MITRE ATLAS guidance.

Real-World Scenario
The Greshake Indirect Prompt-Injection Demonstration

Who. Kai Greshake and collaborators, security researchers based in Saarland and at NVIDIA, who published the foundational indirect-prompt-injection paper in February 2023 (arXiv:2302.12173). Situation. Pre-Greshake, the prompt-injection threat model focused on the direct case: a user types instructions intended to bypass system-prompt guardrails. The Greshake demonstration showed the more insidious indirect case: an attacker plants instructions in third-party content (a web page, an email, a PDF) that the LLM later retrieves or processes, and the LLM treats those instructions as user-issued. Problem. Every LLM application that processes external content (RAG systems, browser-augmented chat, email-summarizing agents, code-review tools that read pull-request descriptions) becomes vulnerable to instruction injection via the external surface. The Greshake demonstration showed concrete exfiltration: an attacker controls a web page, the user asks Bing Chat or similar to summarize that page, and the attacker's hidden instructions exfiltrate the user's prior conversation by injecting URL parameters into a follow-up suggestion. Decision. The OWASP Top 10 for LLM Applications (versions 1.0 and 1.1) ranked prompt injection as risk #1 explicitly because the Greshake demonstration showed how broad the attack surface is. MITRE ATLAS incorporated the technique as ML.T0051 ("LLM Prompt Injection"). How. The defense, codified in Section 71.4, is structural: input classification labels content by trust level, the system prompt instructs the model to treat untrusted content as data rather than as policy, output filtering catches injection-success signatures, and tool permissions are scoped tightly so a successful injection cannot directly perform privileged actions. Result. By 2026 every major LLM-application framework (LangChain, Semantic Kernel, LlamaIndex) ships with prompt-injection-resistant patterns as defaults. Lesson. The structural feature of LLM architecture (no clean separation between trusted system prompt and untrusted content) is unfixable in the current paradigm; the defense must be defense-in-depth with no single load-bearing layer.

Numeric Example
Poisoning thresholds and the cost of a compromised pretraining corpus

The 2024-2025 research on training-data poisoning produced concrete numbers. Carlini et al. (2024) "Poisoning Web-Scale Training Datasets is Practical" showed that an attacker who controls roughly 0.001 percent of a web-scale crawl (i.e., a few thousand documents in a multi-billion-document corpus) can plant reliable backdoor triggers in the resulting model. Subsequent work has shown that the threshold can be even lower for specific kinds of attack: targeted-instruction-following backdoors triggered by uncommon phrases require closer to 0.0001 percent of the corpus.

Defender cost. A full re-pretraining of a 70B model from scratch costs $1.5-3M in 2026 (Section 61.3 NumericExample) and takes 2-6 weeks of GPU time. Provenance tracking on a multi-trillion-token corpus adds roughly 5-10 percent overhead to data preparation; n-gram-overlap-based contamination checking adds 2-5 percent. Post-training probing for known backdoor triggers adds roughly $50-100K in compute per probe campaign. The combined defense raises the attacker's cost meaningfully but does not produce a hard guarantee; the only structural guarantee comes from controlling the pretraining corpus end-to-end, which is feasible only for the most frontier labs.

Membership-inference attacks on fine-tuned models typically require 1,000-10,000 API queries per target record under standard adversarial conditions; differential-privacy training with epsilon = 8 (a common operational choice) reduces attack success rates by roughly 30-50 percent at a 1-3 percent absolute accuracy cost. The trade-off is real but not prohibitive; the operational decision is dominated by sensitivity-of-data considerations, not by accuracy considerations.

See Also
Key Takeaways

What Comes Next

Section 71.4 covers the trust-boundary architecture for security-sensitive LLM systems, the pattern that has consolidated around input classification, output filtering, tool sandboxing, and audit logging.

Self-Check
1. Why is prompt injection ranked #1 on the OWASP Top 10 for LLM Applications, and what is the structural feature of LLMs that makes the vulnerability unfixable in the current paradigm?
Show Answer
Prompt injection is #1 because the structural feature of all current LLM architectures is that there is no clean separation in the input between "trusted system prompt" and "untrusted user-or-retrieved content"; the model treats everything in its context window as a single sequence and cannot reliably distinguish policy from data. Every defense (input classification, output filtering, tool permission scoping) is an approximation of a structural separation that does not exist in the model itself. The defense must be defense-in-depth at the application layer because the model layer cannot enforce it. The Greshake demonstration of indirect injection via third-party content showed how broad the attack surface is in production systems.
2. The 2024-2025 research showed that poisoning roughly 0.001 percent of a web-scale pretraining corpus can plant reliable backdoors. What defenses raise the attacker's cost, and what is the structural limitation?
Show Answer
The defenses raise attacker cost without producing a hard guarantee: provenance tracking on training data, anomaly detection on training loss, post-training probing against known backdoor triggers, and aggressive curation of the corpus. The structural limitation is that none of these defenses is a complete solution against an adaptive attacker who can iterate on poisoning strategies until they find one that evades the deployed defenses. The only structural guarantee comes from end-to-end control of the pretraining corpus, which is feasible only for the most frontier labs that can afford to curate their own datasets rather than rely on Common Crawl-style web scrapes.
3. The 2024 OpenAI / DeepSeek dispute centered on model extraction. What is the attack, what makes it commercially consequential, and what mitigations exist?
Show Answer
Model extraction is the training of a competitor model on the outputs of a target model: an adversary with API access submits a high volume of queries, collects the outputs, and uses them as supervised fine-tuning data for a smaller open-weight model. Commercially consequential because a frontier model costs hundreds of millions of dollars to train and a near-clone can be produced at a fraction of the cost. Mitigations: API rate limits that detect extraction-friendly query patterns, output watermarking that identifies trained-on-target-outputs competitors, ToS provisions that prohibit training derived models on API outputs, and active legal enforcement against suspected extraction. The cryptographic mitigations (e.g., watermark fragility) remain underdeveloped; the dominant defense in 2026 is the commercial-legal combination.

What's Next?

In the next section, Section 71.4: Trust Boundaries for LLM Systems, we build on the material covered here.

Further Reading

LLM Attack Surface

OWASP (2024). "Top 10 for LLM Applications." owasp.org/www-project-top-10-for-large-language-model-applications. The reference taxonomy for LLM application risks.
Greshake, K., Abdelnabi, S., Mishra, S., et al. (2023). "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." AISec 2023. arXiv:2302.12173. The canonical paper on indirect prompt injection; defines the architectural threat model.

Adversarial Robustness

Zou, A., Wang, Z., Carlini, N., et al. (2023). "Universal and Transferable Adversarial Attacks on Aligned Language Models." arXiv:2307.15043. GCG attack; the canonical reference for transferable jailbreaks.
Perez, F., & Ribeiro, I. (2022). "Ignore Previous Prompt: Attack Techniques For Language Models." arXiv:2211.09527. Early empirical catalog of prompt-injection attack patterns.