"The code compiles, the tests pass, and the vulnerability is already in production."
Agent X, Security-Haunted AI Agent
AI-generated code is not inherently trustworthy. Studies consistently show that LLM-generated code contains security vulnerabilities, functional bugs, and API usage errors at rates comparable to (and in some categories exceeding) human-written code. As 85% of developers adopt AI coding assistants, the volume of AI-generated code in production is growing rapidly. Understanding the specific failure modes of AI-generated code, and building automated pipelines to catch them, is now a core software engineering competency. This section examines what goes wrong, why it goes wrong, and how to detect and prevent the most common quality and security issues.
Prerequisites
This section builds on code generation agents from Section 25.1, SWE-bench evaluation from Section 25.6, and agentic coding workflows from Section 25.7. Familiarity with software testing concepts, static analysis, and basic security terminology (CWE, OWASP) is helpful.
1. Security Vulnerabilities in LLM-Generated Code
Pearce et al. (2022) conducted the first systematic study of security vulnerabilities in code generated by GitHub Copilot. Across 1,689 generated programs in C, Python, and JavaScript, approximately 40% contained security vulnerabilities. The vulnerabilities were not random; they followed predictable patterns that map directly to the CWE (Common Weakness Enumeration) and OWASP Top 10 classifications.
The most common vulnerability categories in LLM-generated code include:
CWE-79: Cross-Site Scripting (XSS). LLMs frequently generate web code that renders user input without sanitization. When asked to build a simple web page that displays user comments, the model often inserts user content directly into HTML templates. The model "knows" about XSS (it can explain the vulnerability if asked) but does not consistently apply defensive coding practices in generated output.
CWE-89: SQL Injection. String formatting for SQL queries remains a persistent pattern in LLM-generated code. Models generate f"SELECT * FROM users WHERE name = '{user_input}'" instead of parameterized queries. This occurs even when the model is given a codebase that consistently uses parameterized queries, because the training data contains more examples of string-formatted SQL than safe alternatives.
CWE-798: Hardcoded Credentials. LLMs sometimes generate code with placeholder credentials that look like real secrets (such as password = "admin123" or api_key = "sk-..."). While these are usually placeholders, they can end up in committed code if developers do not review carefully. More subtly, generated test files may contain test credentials that resemble production secrets.
CWE-22: Path Traversal. When generating file-handling code, LLMs often fail to validate or sanitize file paths, allowing directory traversal attacks. A function that serves files based on user-provided filenames may not check for ../ sequences.
# Common security vulnerabilities in LLM-generated code
# BAD: SQL injection via string formatting (CWE-89)
def get_user_bad(username: str):
query = f"SELECT * FROM users WHERE name = '{username}'"
cursor.execute(query) # Vulnerable to: ' OR '1'='1
# GOOD: Parameterized query
def get_user_good(username: str):
query = "SELECT * FROM users WHERE name = %s"
cursor.execute(query, (username,))
# BAD: XSS via unescaped output (CWE-79)
from flask import Flask, request
app = Flask(__name__)
@app.route("/comment")
def show_comment_bad():
comment = request.args.get("text", "")
return f"<h1>Comment: {comment}</h1>" # Vulnerable to script injection
# GOOD: Escaped output
from markupsafe import escape
@app.route("/comment")
def show_comment_good():
comment = request.args.get("text", "")
return f"<h1>Comment: {escape(comment)}</h1>"
# BAD: Path traversal (CWE-22)
def serve_file_bad(filename: str):
with open(f"/uploads/{filename}", "rb") as f: # Allows ../../../etc/passwd
return f.read()
# GOOD: Path validation
import os
def serve_file_good(filename: str):
safe_path = os.path.realpath(os.path.join("/uploads", filename))
if not safe_path.startswith("/uploads/"):
raise ValueError("Invalid file path")
with open(safe_path, "rb") as f:
return f.read()
2. Correctness and Functional Bugs
Beyond security, LLM-generated code contains functional bugs that produce incorrect results without causing errors or crashes. These bugs are particularly dangerous because they pass tests (if tests are insufficiently rigorous) and produce output that looks correct at first glance.
Common categories of functional bugs include: off-by-one errors in loop boundaries and array indexing; incorrect edge case handling for empty inputs, null values, and boundary conditions; race conditions in concurrent code where the model omits necessary synchronization; incorrect operator precedence in complex expressions; and silent data truncation where type conversions lose information without raising errors.
A particularly subtle class of functional bugs involves semantically correct but logically wrong implementations. The generated code may correctly implement an algorithm that does not solve the stated problem. For example, asked to implement "find the k closest points to the origin," an LLM might correctly implement a sorting-based approach but use Manhattan distance instead of Euclidean distance, or sort in the wrong order. The code is internally consistent and well-structured but produces wrong answers.
3. Code Hallucination: Plausible but Wrong API Usage
Jesse et al. (2023) studied "code hallucination," where LLMs generate code that references APIs, functions, or parameters that do not exist. The generated code looks syntactically correct and follows reasonable naming conventions, but calls functions or passes arguments that the library does not support.
Code hallucination manifests in several ways. Non-existent methods: the model calls a function that sounds like it should exist (e.g., pandas.DataFrame.to_sqlite()) but does not. Deprecated patterns: the model uses API patterns from older library versions that have been removed. Invented parameters: the model passes keyword arguments that are not accepted by the function (e.g., json.dumps(data, sort=True) instead of sort_keys=True). Wrong library attribution: the model calls a function from the wrong library (e.g., using a numpy function that only exists in scipy).
The root cause is that the model's training data contains code from many library versions and sometimes from unofficial sources (blog posts, tutorials, Stack Overflow answers) that may be outdated or incorrect. The model generates the most statistically likely code completion, which may not correspond to any actual API.
Researchers have found that LLMs will confidently generate code using packages that have never existed. Security researchers exploited this by creating packages with names that LLMs frequently hallucinate, then uploaded those packages (containing tracking code) to PyPI. Within weeks, the packages had thousands of downloads from developers who trusted their AI coding assistant's suggestions without checking. This attack vector is now known as "package hallucination" or "AI package confusion."
4. Data Contamination and Benchmark Gaming
LLMs trained on code from public repositories may have memorized solutions to benchmark problems. When a model achieves high scores on HumanEval or MBPP, it is unclear how much of that performance reflects genuine code generation ability versus memorization of training examples that overlap with the benchmark.
SWE-bench (covered in Section 25.6) was designed partly to address this concern by using real GitHub issues. However, even SWE-bench faces contamination risks: the repository histories used for evaluation may appear in training data for newer models. SWE-bench Verified and SWE-bench Live attempt to mitigate this through human verification and continuously updated evaluation sets.
For practitioners, the implication is that benchmark scores should be treated as upper bounds on expected real-world performance. A model that scores 70% on HumanEval may perform significantly worse on your specific codebase with its unique patterns, internal libraries, and domain conventions. Always evaluate models on your own representative tasks before committing to production deployment.
5. Automated Testing and Verification of Generated Code
Given the systematic quality issues in AI-generated code, automated verification is essential. An effective verification pipeline combines multiple complementary techniques: type checking catches type errors and some API hallucinations; unit tests catch functional bugs; static analysis catches security vulnerabilities and code smells; and property-based testing catches edge cases that example-based tests miss.
The key insight for AI-generated code verification is that the test suite should be written (or at least reviewed) by humans, not by the same model that generated the code. When the same LLM generates both the implementation and the tests, the tests tend to validate what the code does rather than what it should do. Bugs in the implementation are mirrored by corresponding gaps in the test suite. This is why the test-driven prompting workflow (described in Section 25.7) is so effective: human-written tests serve as an independent specification that the generated code must satisfy.
# Property-based testing for AI-generated code
from hypothesis import given, strategies as st, settings
# Suppose an LLM generated this sorting function
def ai_generated_sort(items: list[int]) -> list[int]:
"""Sort a list of integers in ascending order."""
return sorted(items) # Looks correct, but let's verify properties
# Property-based tests verify invariants, not specific examples
@given(st.lists(st.integers(min_value=-10000, max_value=10000)))
@settings(max_examples=500)
def test_sort_preserves_length(items):
"""Sorted output must have the same length as input."""
result = ai_generated_sort(items)
assert len(result) == len(items)
@given(st.lists(st.integers(min_value=-10000, max_value=10000)))
@settings(max_examples=500)
def test_sort_is_ordered(items):
"""Each element must be less than or equal to the next."""
result = ai_generated_sort(items)
for i in range(len(result) - 1):
assert result[i] <= result[i + 1]
@given(st.lists(st.integers(min_value=-10000, max_value=10000)))
@settings(max_examples=500)
def test_sort_preserves_elements(items):
"""Sorting must not add, remove, or modify elements."""
result = ai_generated_sort(items)
assert sorted(result) == sorted(items) # same multiset
@given(st.lists(st.integers(min_value=-10000, max_value=10000)))
@settings(max_examples=500)
def test_sort_is_idempotent(items):
"""Sorting an already-sorted list should produce the same result."""
result1 = ai_generated_sort(items)
result2 = ai_generated_sort(result1)
assert result1 == result2
6. Static Analysis Integration: CodeQL, Semgrep, and Bandit
Static analysis tools scan source code for known vulnerability patterns without executing the code. Integrating these tools into the AI coding workflow creates a safety net that catches vulnerabilities before they reach production.
CodeQL (GitHub) is a semantic code analysis engine that treats code as data and queries it using a SQL-like language. CodeQL ships with thousands of pre-built queries for security vulnerabilities across multiple languages. It excels at finding complex vulnerabilities that involve data flow across multiple functions, such as taint tracking from user input to SQL queries.
Semgrep is a lightweight, pattern-based static analysis tool that is fast enough to run on every save or commit. Its rules are written in a YAML format that is easy to customize for project-specific patterns. Semgrep is particularly good at detecting insecure coding patterns (like the SQL injection and XSS examples above) and enforcing coding standards.
Bandit is a Python-specific security linter that checks for common security issues: use of eval(), insecure random number generation, hardcoded passwords, and unsafe YAML loading. It is lightweight and fast, making it suitable for pre-commit hooks.
# Pipeline for scanning AI-generated code with multiple tools
import subprocess
import json
from dataclasses import dataclass
@dataclass
class ScanResult:
tool: str
severity: str
message: str
file: str
line: int
def run_bandit(target_path: str) -> list[ScanResult]:
"""Run Bandit security linter on Python code."""
result = subprocess.run(
["bandit", "-r", target_path, "-f", "json", "-ll"],
capture_output=True, text=True
)
findings = []
if result.stdout:
data = json.loads(result.stdout)
for issue in data.get("results", []):
findings.append(ScanResult(
tool="bandit",
severity=issue["issue_severity"],
message=issue["issue_text"],
file=issue["filename"],
line=issue["line_number"],
))
return findings
def run_semgrep(target_path: str, config: str = "auto") -> list[ScanResult]:
"""Run Semgrep with auto-detected rules."""
result = subprocess.run(
["semgrep", "--config", config, target_path, "--json"],
capture_output=True, text=True
)
findings = []
if result.stdout:
data = json.loads(result.stdout)
for match in data.get("results", []):
findings.append(ScanResult(
tool="semgrep",
severity=match.get("extra", {}).get("severity", "WARNING"),
message=match.get("extra", {}).get("message", match["check_id"]),
file=match["path"],
line=match["start"]["line"],
))
return findings
def run_mypy(target_path: str) -> list[ScanResult]:
"""Run mypy type checker for type safety."""
result = subprocess.run(
["mypy", target_path, "--no-error-summary", "--output", "json"],
capture_output=True, text=True
)
findings = []
for line in result.stdout.strip().split("\n"):
if line.strip():
try:
entry = json.loads(line)
findings.append(ScanResult(
tool="mypy",
severity=entry.get("severity", "error"),
message=entry.get("message", ""),
file=entry.get("file", ""),
line=entry.get("line", 0),
))
except json.JSONDecodeError:
pass
return findings
def scan_ai_generated_code(target_path: str) -> dict:
"""Run all scanners and aggregate results."""
all_findings = []
all_findings.extend(run_bandit(target_path))
all_findings.extend(run_semgrep(target_path))
all_findings.extend(run_mypy(target_path))
# Categorize by severity
critical = [f for f in all_findings if f.severity in ("HIGH", "error", "CRITICAL")]
warnings = [f for f in all_findings if f.severity in ("MEDIUM", "WARNING", "warning")]
info = [f for f in all_findings if f.severity in ("LOW", "INFO", "note")]
return {
"total_findings": len(all_findings),
"critical": len(critical),
"warnings": len(warnings),
"info": len(info),
"details": all_findings,
"pass": len(critical) == 0,
}
# Example usage
results = scan_ai_generated_code("./generated_code/")
print(f"Scan complete: {results['total_findings']} findings")
print(f"Critical: {results['critical']}, Warnings: {results['warnings']}")
print(f"Gate: {'PASS' if results['pass'] else 'FAIL'}")
7. Human Review Patterns and Checklists for AI Code
Automated tools catch known vulnerability patterns, but human review remains essential for catching logical errors, design issues, and context-specific problems that tools cannot detect. When reviewing AI-generated code, developers should apply a specialized checklist that accounts for the specific failure modes of LLM code generation.
Security review checklist: Verify that all user inputs are sanitized before use in SQL queries, HTML rendering, file operations, and system commands. Check that authentication and authorization checks are present on all endpoints (LLMs sometimes add auth to the "happy path" but omit it on error handlers). Confirm that secrets are loaded from environment variables, not hardcoded. Verify that CORS, CSP, and other security headers are properly configured. Check that cryptographic operations use current, strong algorithms (LLMs sometimes suggest MD5 or SHA-1 for password hashing).
Correctness review checklist: Trace through the code with edge case inputs (empty collections, null values, maximum integers, Unicode strings). Verify that error handling covers all failure modes, not just the most common ones. Check that concurrent code uses appropriate synchronization. Confirm that the algorithm matches the specification (not just "an algorithm that sounds right"). Verify that third-party library calls use the correct API for the installed version.
Maintainability review checklist: Ensure that the code follows project conventions (naming, structure, patterns). Check that generated code does not duplicate existing functionality in the codebase. Verify that dependencies added by the AI are actually needed and are maintained/secure. Confirm that generated tests actually test the desired behavior (not just that the code runs without errors).
8. Trust Calibration: When to Trust Generated Code
Not all AI-generated code requires the same level of scrutiny. Trust calibration means matching review intensity to risk level. This requires assessing two dimensions: the complexity of the generated code and the consequences of a defect.
| Code Category | Trust Level | Review Approach | Examples |
|---|---|---|---|
| Boilerplate / CRUD | High | Quick scan, automated checks | REST endpoints, data models, config files |
| Standard algorithms | Medium-High | Verify edge cases, run tests | Sorting, searching, data transformations |
| Business logic | Medium | Detailed review, property-based tests | Pricing rules, eligibility checks, workflows |
| Security-sensitive | Low | Expert review, SAST tools, pen testing | Auth, crypto, input validation, session mgmt |
| Infrastructure / DevOps | Low | Expert review, dry-run, staged rollout | IAM policies, network rules, CI/CD configs |
| Concurrent / distributed | Very Low | Expert review, stress testing, formal analysis | Lock management, distributed transactions |
The 80/20 rule applies to AI code review. Approximately 80% of AI-generated code falls into the "boilerplate" and "standard algorithms" categories where trust is reasonably high and review can be lightweight. The remaining 20% (security-sensitive, infrastructure, and concurrent code) requires careful expert review. An efficient AI-assisted development workflow routes each piece of generated code to the appropriate review level rather than applying uniform scrutiny everywhere. This is where AI-generated code metadata (which model, what prompt, what context) becomes valuable for risk assessment.
The biggest risk with AI-generated code is not the code itself but the review fatigue it induces. When developers review hundreds of lines of AI-generated code per day, attention naturally wanes, and subtle issues slip through. Counteract this by: (1) keeping generated code chunks small (ask for one function at a time, not entire files); (2) using automated tools to handle the mechanical checks so human attention is preserved for logic review; and (3) establishing a culture where questioning AI-generated code is normal, not a sign of distrust toward the developer who used the tool.
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
- LLM-generated code has systematic security issues, with approximately 40% of Copilot-generated programs containing vulnerabilities in early studies.
- Common vulnerability patterns (SQL injection, XSS, path traversal, hardcoded credentials) are predictable and detectable with static analysis.
- Code hallucination produces plausible calls to non-existent APIs, deprecated methods, and invented parameters.
- Benchmark contamination means evaluation scores may overstate real-world performance; always evaluate on your own tasks.
- Human-written tests should verify AI-generated implementations, not the reverse; test-driven prompting is the most reliable workflow.
- Multi-tool scanning (CodeQL + Semgrep + Bandit + mypy) creates a layered defense against different vulnerability categories.
- Trust calibration matches review intensity to code complexity and risk level, focusing human attention where it matters most.
- Review fatigue is the biggest practical risk; keep generated code chunks small and automate mechanical checks.
Self-verifying code generation is an active research area where the code generation model also generates proofs or formal specifications that can be mechanically checked. Systems like AlphaProof (DeepMind) and Lean-based verifiers can prove that generated code satisfies formal properties.
While currently limited to well-specified mathematical and algorithmic tasks, this approach promises a future where the correctness of generated code can be guaranteed rather than merely tested. In the nearer term, research on "LLM-aware static analysis" is developing tools specifically designed for the error patterns of AI-generated code, going beyond generic vulnerability databases to target the specific hallucination and security patterns documented in this section.
What Comes Next
This section covered the quality and security challenges of AI-generated code. For the broader safety, reliability, and deployment considerations for agentic systems, continue to Chapter 26: Agent Safety, Production & Operations.
Bibliography
Pearce, H., Ahmad, B., Tan, B., et al. (2022). "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions." arXiv:2108.09293
He, J. & Vechev, M. (2023). "Large Language Models for Code: Security Hardening and Adversarial Testing." arXiv:2302.05319
Jesse, K., Devanbu, P. T., & Ahmed, T. (2023). "Large Language Models and Simple, Stupid Bugs." arXiv:2303.11455
Vu, N. M., Bui, N. D. Q., & Nadi, S. (2024). "Hallucinated Code: LLMs Generate Non-Existent API Calls." arXiv:2401.09983
GitHub Security Lab. "CodeQL Documentation." codeql.github.com
Semgrep. "Semgrep Documentation." semgrep.dev/docs
