Section 25.8: Analysis and Quality of AI-Generated Code

"The code compiles, the tests pass, and the vulnerability is already in production."
Agent X, Security-Haunted AI Agent

Big Picture

AI-generated code is not inherently trustworthy. Studies consistently show that LLM-generated code contains security vulnerabilities, functional bugs, and API usage errors at rates comparable to (and in some categories exceeding) human-written code. As 85% of developers adopt AI coding assistants, the volume of AI-generated code in production is growing rapidly. Understanding the specific failure modes of AI-generated code, and building automated pipelines to catch them, is now a core software engineering competency. This section examines what goes wrong, why it goes wrong, and how to detect and prevent the most common quality and security issues.

Prerequisites

This section builds on code generation agents from Section 25.1, SWE-bench evaluation from Section 25.6, and agentic coding workflows from Section 25.7. Familiarity with software testing concepts, static analysis, and basic security terminology (CWE, OWASP) is helpful.

1. Security Vulnerabilities in LLM-Generated Code

Pearce et al. (2022) conducted the first systematic study of security vulnerabilities in code generated by GitHub Copilot. Across 1,689 generated programs in C, Python, and JavaScript, approximately 40% contained security vulnerabilities. The vulnerabilities were not random; they followed predictable patterns that map directly to the CWE (Common Weakness Enumeration) and OWASP Top 10 classifications.

The most common vulnerability categories in LLM-generated code include:

CWE-79: Cross-Site Scripting (XSS). LLMs frequently generate web code that renders user input without sanitization. When asked to build a simple web page that displays user comments, the model often inserts user content directly into HTML templates. The model "knows" about XSS (it can explain the vulnerability if asked) but does not consistently apply defensive coding practices in generated output.

CWE-89: SQL Injection. String formatting for SQL queries remains a persistent pattern in LLM-generated code. Models generate f"SELECT * FROM users WHERE name = '{user_input}'" instead of parameterized queries. This occurs even when the model is given a codebase that consistently uses parameterized queries, because the training data contains more examples of string-formatted SQL than safe alternatives.

CWE-798: Hardcoded Credentials. LLMs sometimes generate code with placeholder credentials that look like real secrets (such as password = "admin123" or api_key = "sk-..."). While these are usually placeholders, they can end up in committed code if developers do not review carefully. More subtly, generated test files may contain test credentials that resemble production secrets.

CWE-22: Path Traversal. When generating file-handling code, LLMs often fail to validate or sanitize file paths, allowing directory traversal attacks. A function that serves files based on user-provided filenames may not check for ../ sequences.

# Common security vulnerabilities in LLM-generated code

# BAD: SQL injection via string formatting (CWE-89)
def get_user_bad(username: str):
    query = f"SELECT * FROM users WHERE name = '{username}'"
    cursor.execute(query)  # Vulnerable to: ' OR '1'='1

# GOOD: Parameterized query
def get_user_good(username: str):
    query = "SELECT * FROM users WHERE name = %s"
    cursor.execute(query, (username,))

# BAD: XSS via unescaped output (CWE-79)
from flask import Flask, request
app = Flask(__name__)

@app.route("/comment")
def show_comment_bad():
    comment = request.args.get("text", "")
    return f"<h1>Comment: {comment}</h1>"  # Vulnerable to script injection

# GOOD: Escaped output
from markupsafe import escape

@app.route("/comment")
def show_comment_good():
    comment = request.args.get("text", "")
    return f"<h1>Comment: {escape(comment)}</h1>"

# BAD: Path traversal (CWE-22)
def serve_file_bad(filename: str):
    with open(f"/uploads/{filename}", "rb") as f:  # Allows ../../../etc/passwd
        return f.read()

# GOOD: Path validation
import os

def serve_file_good(filename: str):
    safe_path = os.path.realpath(os.path.join("/uploads", filename))
    if not safe_path.startswith("/uploads/"):
        raise ValueError("Invalid file path")
    with open(safe_path, "rb") as f:
        return f.read()

Scan complete: 7 findings Critical: 0, Warnings: 4 Gate: PASS

Code Fragment 25.8.1: Common security vulnerabilities in LLM-generated code

2. Correctness and Functional Bugs

Beyond security, LLM-generated code contains functional bugs that produce incorrect results without causing errors or crashes. These bugs are particularly dangerous because they pass tests (if tests are insufficiently rigorous) and produce output that looks correct at first glance.

Common categories of functional bugs include: off-by-one errors in loop boundaries and array indexing; incorrect edge case handling for empty inputs, null values, and boundary conditions; race conditions in concurrent code where the model omits necessary synchronization; incorrect operator precedence in complex expressions; and silent data truncation where type conversions lose information without raising errors.

A particularly subtle class of functional bugs involves semantically correct but logically wrong implementations. The generated code may correctly implement an algorithm that does not solve the stated problem. For example, asked to implement "find the k closest points to the origin," an LLM might correctly implement a sorting-based approach but use Manhattan distance instead of Euclidean distance, or sort in the wrong order. The code is internally consistent and well-structured but produces wrong answers.

3. Code Hallucination: Plausible but Wrong API Usage

Jesse et al. (2023) studied "code hallucination," where LLMs generate code that references APIs, functions, or parameters that do not exist. The generated code looks syntactically correct and follows reasonable naming conventions, but calls functions or passes arguments that the library does not support.

Code hallucination manifests in several ways. Non-existent methods: the model calls a function that sounds like it should exist (e.g., pandas.DataFrame.to_sqlite()) but does not. Deprecated patterns: the model uses API patterns from older library versions that have been removed. Invented parameters: the model passes keyword arguments that are not accepted by the function (e.g., json.dumps(data, sort=True) instead of sort_keys=True). Wrong library attribution: the model calls a function from the wrong library (e.g., using a numpy function that only exists in scipy).

The root cause is that the model's training data contains code from many library versions and sometimes from unofficial sources (blog posts, tutorials, Stack Overflow answers) that may be outdated or incorrect. The model generates the most statistically likely code completion, which may not correspond to any actual API.

Fun Fact

Researchers have found that LLMs will confidently generate code using packages that have never existed. Security researchers exploited this by creating packages with names that LLMs frequently hallucinate, then uploaded those packages (containing tracking code) to PyPI. Within weeks, the packages had thousands of downloads from developers who trusted their AI coding assistant's suggestions without checking. This attack vector is now known as "package hallucination" or "AI package confusion."

4. Data Contamination and Benchmark Gaming

LLMs trained on code from public repositories may have memorized solutions to benchmark problems. When a model achieves high scores on HumanEval or MBPP, it is unclear how much of that performance reflects genuine code generation ability versus memorization of training examples that overlap with the benchmark.

SWE-bench (covered in Section 25.6) was designed partly to address this concern by using real GitHub issues. However, even SWE-bench faces contamination risks: the repository histories used for evaluation may appear in training data for newer models. SWE-bench Verified and SWE-bench Live attempt to mitigate this through human verification and continuously updated evaluation sets.

For practitioners, the implication is that benchmark scores should be treated as upper bounds on expected real-world performance. A model that scores 70% on HumanEval may perform significantly worse on your specific codebase with its unique patterns, internal libraries, and domain conventions. Always evaluate models on your own representative tasks before committing to production deployment.

5. Automated Testing and Verification of Generated Code

Given the systematic quality issues in AI-generated code, automated verification is essential. An effective verification pipeline combines multiple complementary techniques: type checking catches type errors and some API hallucinations; unit tests catch functional bugs; static analysis catches security vulnerabilities and code smells; and property-based testing catches edge cases that example-based tests miss.

The key insight for AI-generated code verification is that the test suite should be written (or at least reviewed) by humans, not by the same model that generated the code. When the same LLM generates both the implementation and the tests, the tests tend to validate what the code does rather than what it should do. Bugs in the implementation are mirrored by corresponding gaps in the test suite. This is why the test-driven prompting workflow (described in Section 25.7) is so effective: human-written tests serve as an independent specification that the generated code must satisfy.

# Property-based testing for AI-generated code
from hypothesis import given, strategies as st, settings

# Suppose an LLM generated this sorting function
def ai_generated_sort(items: list[int]) -> list[int]:
    """Sort a list of integers in ascending order."""
    return sorted(items)  # Looks correct, but let's verify properties

# Property-based tests verify invariants, not specific examples
@given(st.lists(st.integers(min_value=-10000, max_value=10000)))
@settings(max_examples=500)
def test_sort_preserves_length(items):
    """Sorted output must have the same length as input."""
    result = ai_generated_sort(items)
    assert len(result) == len(items)

@given(st.lists(st.integers(min_value=-10000, max_value=10000)))
@settings(max_examples=500)
def test_sort_is_ordered(items):
    """Each element must be less than or equal to the next."""
    result = ai_generated_sort(items)
    for i in range(len(result) - 1):
        assert result[i] <= result[i + 1]

@given(st.lists(st.integers(min_value=-10000, max_value=10000)))
@settings(max_examples=500)
def test_sort_preserves_elements(items):
    """Sorting must not add, remove, or modify elements."""
    result = ai_generated_sort(items)
    assert sorted(result) == sorted(items)  # same multiset

@given(st.lists(st.integers(min_value=-10000, max_value=10000)))
@settings(max_examples=500)
def test_sort_is_idempotent(items):
    """Sorting an already-sorted list should produce the same result."""
    result1 = ai_generated_sort(items)
    result2 = ai_generated_sort(result1)
    assert result1 == result2

Code Fragment 25.8.2: Property-based testing with Hypothesis to verify invariants of AI-generated code. Unlike example-based tests, property-based tests generate hundreds of random inputs and verify that fundamental properties hold, catching edge cases that manual tests miss.

6. Static Analysis Integration: CodeQL, Semgrep, and Bandit

Static analysis tools scan source code for known vulnerability patterns without executing the code. Integrating these tools into the AI coding workflow creates a safety net that catches vulnerabilities before they reach production.

CodeQL (GitHub) is a semantic code analysis engine that treats code as data and queries it using a SQL-like language. CodeQL ships with thousands of pre-built queries for security vulnerabilities across multiple languages. It excels at finding complex vulnerabilities that involve data flow across multiple functions, such as taint tracking from user input to SQL queries.

Semgrep is a lightweight, pattern-based static analysis tool that is fast enough to run on every save or commit. Its rules are written in a YAML format that is easy to customize for project-specific patterns. Semgrep is particularly good at detecting insecure coding patterns (like the SQL injection and XSS examples above) and enforcing coding standards.

Bandit is a Python-specific security linter that checks for common security issues: use of eval(), insecure random number generation, hardcoded passwords, and unsafe YAML loading. It is lightweight and fast, making it suitable for pre-commit hooks.

# Pipeline for scanning AI-generated code with multiple tools
import subprocess
import json
from dataclasses import dataclass

@dataclass
class ScanResult:
    tool: str
    severity: str
    message: str
    file: str
    line: int

def run_bandit(target_path: str) -> list[ScanResult]:
    """Run Bandit security linter on Python code."""
    result = subprocess.run(
        ["bandit", "-r", target_path, "-f", "json", "-ll"],
        capture_output=True, text=True
    )
    findings = []
    if result.stdout:
        data = json.loads(result.stdout)
        for issue in data.get("results", []):
            findings.append(ScanResult(
                tool="bandit",
                severity=issue["issue_severity"],
                message=issue["issue_text"],
                file=issue["filename"],
                line=issue["line_number"],
            ))
    return findings

def run_semgrep(target_path: str, config: str = "auto") -> list[ScanResult]:
    """Run Semgrep with auto-detected rules."""
    result = subprocess.run(
        ["semgrep", "--config", config, target_path, "--json"],
        capture_output=True, text=True
    )
    findings = []
    if result.stdout:
        data = json.loads(result.stdout)
        for match in data.get("results", []):
            findings.append(ScanResult(
                tool="semgrep",
                severity=match.get("extra", {}).get("severity", "WARNING"),
                message=match.get("extra", {}).get("message", match["check_id"]),
                file=match["path"],
                line=match["start"]["line"],
            ))
    return findings

def run_mypy(target_path: str) -> list[ScanResult]:
    """Run mypy type checker for type safety."""
    result = subprocess.run(
        ["mypy", target_path, "--no-error-summary", "--output", "json"],
        capture_output=True, text=True
    )
    findings = []
    for line in result.stdout.strip().split("\n"):
        if line.strip():
            try:
                entry = json.loads(line)
                findings.append(ScanResult(
                    tool="mypy",
                    severity=entry.get("severity", "error"),
                    message=entry.get("message", ""),
                    file=entry.get("file", ""),
                    line=entry.get("line", 0),
                ))
            except json.JSONDecodeError:
                pass
    return findings

def scan_ai_generated_code(target_path: str) -> dict:
    """Run all scanners and aggregate results."""
    all_findings = []
    all_findings.extend(run_bandit(target_path))
    all_findings.extend(run_semgrep(target_path))
    all_findings.extend(run_mypy(target_path))

    # Categorize by severity
    critical = [f for f in all_findings if f.severity in ("HIGH", "error", "CRITICAL")]
    warnings = [f for f in all_findings if f.severity in ("MEDIUM", "WARNING", "warning")]
    info = [f for f in all_findings if f.severity in ("LOW", "INFO", "note")]

    return {
        "total_findings": len(all_findings),
        "critical": len(critical),
        "warnings": len(warnings),
        "info": len(info),
        "details": all_findings,
        "pass": len(critical) == 0,
    }

# Example usage
results = scan_ai_generated_code("./generated_code/")
print(f"Scan complete: {results['total_findings']} findings")
print(f"Critical: {results['critical']}, Warnings: {results['warnings']}")
print(f"Gate: {'PASS' if results['pass'] else 'FAIL'}")

Code Fragment 25.8.3: Pipeline for scanning AI-generated code with multiple tools

7. Human Review Patterns and Checklists for AI Code

Automated tools catch known vulnerability patterns, but human review remains essential for catching logical errors, design issues, and context-specific problems that tools cannot detect. When reviewing AI-generated code, developers should apply a specialized checklist that accounts for the specific failure modes of LLM code generation.

Security review checklist: Verify that all user inputs are sanitized before use in SQL queries, HTML rendering, file operations, and system commands. Check that authentication and authorization checks are present on all endpoints (LLMs sometimes add auth to the "happy path" but omit it on error handlers). Confirm that secrets are loaded from environment variables, not hardcoded. Verify that CORS, CSP, and other security headers are properly configured. Check that cryptographic operations use current, strong algorithms (LLMs sometimes suggest MD5 or SHA-1 for password hashing).

Correctness review checklist: Trace through the code with edge case inputs (empty collections, null values, maximum integers, Unicode strings). Verify that error handling covers all failure modes, not just the most common ones. Check that concurrent code uses appropriate synchronization. Confirm that the algorithm matches the specification (not just "an algorithm that sounds right"). Verify that third-party library calls use the correct API for the installed version.

Maintainability review checklist: Ensure that the code follows project conventions (naming, structure, patterns). Check that generated code does not duplicate existing functionality in the codebase. Verify that dependencies added by the AI are actually needed and are maintained/secure. Confirm that generated tests actually test the desired behavior (not just that the code runs without errors).

8. Trust Calibration: When to Trust Generated Code

Not all AI-generated code requires the same level of scrutiny. Trust calibration means matching review intensity to risk level. This requires assessing two dimensions: the complexity of the generated code and the consequences of a defect.

Trust Calibration Matrix for AI-Generated Code

Code Category	Trust Level	Review Approach	Examples
Boilerplate / CRUD	High	Quick scan, automated checks	REST endpoints, data models, config files
Standard algorithms	Medium-High	Verify edge cases, run tests	Sorting, searching, data transformations
Business logic	Medium	Detailed review, property-based tests	Pricing rules, eligibility checks, workflows
Security-sensitive	Low	Expert review, SAST tools, pen testing	Auth, crypto, input validation, session mgmt
Infrastructure / DevOps	Low	Expert review, dry-run, staged rollout	IAM policies, network rules, CI/CD configs
Concurrent / distributed	Very Low	Expert review, stress testing, formal analysis	Lock management, distributed transactions

Key Insight

The 80/20 rule applies to AI code review. Approximately 80% of AI-generated code falls into the "boilerplate" and "standard algorithms" categories where trust is reasonably high and review can be lightweight. The remaining 20% (security-sensitive, infrastructure, and concurrent code) requires careful expert review. An efficient AI-assisted development workflow routes each piece of generated code to the appropriate review level rather than applying uniform scrutiny everywhere. This is where AI-generated code metadata (which model, what prompt, what context) becomes valuable for risk assessment.

Warning

The biggest risk with AI-generated code is not the code itself but the review fatigue it induces. When developers review hundreds of lines of AI-generated code per day, attention naturally wanes, and subtle issues slip through. Counteract this by: (1) keeping generated code chunks small (ask for one function at a time, not entire files); (2) using automated tools to handle the mechanical checks so human attention is preserved for logic review; and (3) establishing a culture where questioning AI-generated code is normal, not a sign of distrust toward the developer who used the tool.

Self-Check

Q1: What did Pearce et al. (2022) find about the security of Copilot-generated code?

Show Answer

Across 1,689 generated programs in C, Python, and JavaScript, approximately 40% contained security vulnerabilities. The vulnerabilities followed predictable patterns mapping to CWE classifications, including SQL injection (CWE-89), cross-site scripting (CWE-79), hardcoded credentials (CWE-798), and path traversal (CWE-22).

Q2: What is "code hallucination" and why does it occur?

Show Answer

Code hallucination occurs when LLMs generate code that references APIs, functions, or parameters that do not exist. This includes non-existent methods, deprecated patterns, invented parameters, and wrong library attribution. It occurs because the model's training data contains code from many library versions and unofficial sources, and the model generates the statistically most likely completion rather than verifying that it corresponds to an actual API.

Q3: Why should the same LLM not generate both the implementation and the test suite?

Show Answer

When the same LLM generates both implementation and tests, the tests tend to validate what the code does rather than what it should do. Bugs in the implementation are mirrored by corresponding gaps in the test suite. Human-written tests serve as an independent specification that provides genuine verification. This is why test-driven prompting (human writes tests, AI writes implementation) is more reliable.

Q4: How do CodeQL, Semgrep, and Bandit differ in their approach to static analysis?

Show Answer

CodeQL is a semantic analysis engine that queries code as data, excelling at complex data-flow analysis like taint tracking. Semgrep is a pattern-based tool that matches code patterns using YAML rules, fast enough for pre-commit hooks. Bandit is Python-specific, checking for common security issues like eval(), insecure random numbers, and hardcoded passwords. They are complementary: Bandit for quick Python checks, Semgrep for pattern enforcement, CodeQL for deep data-flow analysis.

Q5: How should review intensity vary by code category?

Show Answer

Boilerplate and CRUD code can be quickly scanned with automated checks (high trust). Standard algorithms need edge-case verification and tests (medium-high trust). Business logic requires detailed review and property-based tests (medium trust). Security-sensitive, infrastructure, and concurrent code demands expert review, specialized tools, and stress testing (low to very low trust). Matching review intensity to risk level avoids both under-review of critical code and review fatigue on safe code.

Key Takeaways

LLM-generated code has systematic security issues, with approximately 40% of Copilot-generated programs containing vulnerabilities in early studies.
Common vulnerability patterns (SQL injection, XSS, path traversal, hardcoded credentials) are predictable and detectable with static analysis.
Code hallucination produces plausible calls to non-existent APIs, deprecated methods, and invented parameters.
Benchmark contamination means evaluation scores may overstate real-world performance; always evaluate on your own tasks.
Human-written tests should verify AI-generated implementations, not the reverse; test-driven prompting is the most reliable workflow.
Multi-tool scanning (CodeQL + Semgrep + Bandit + mypy) creates a layered defense against different vulnerability categories.
Trust calibration matches review intensity to code complexity and risk level, focusing human attention where it matters most.
Review fatigue is the biggest practical risk; keep generated code chunks small and automate mechanical checks.

Research Frontier

Self-verifying code generation is an active research area where the code generation model also generates proofs or formal specifications that can be mechanically checked. Systems like AlphaProof (DeepMind) and Lean-based verifiers can prove that generated code satisfies formal properties.

While currently limited to well-specified mathematical and algorithmic tasks, this approach promises a future where the correctness of generated code can be guaranteed rather than merely tested. In the nearer term, research on "LLM-aware static analysis" is developing tools specifically designed for the error patterns of AI-generated code, going beyond generic vulnerability databases to target the specific hallucination and security patterns documented in this section.

What Comes Next

This section covered the quality and security challenges of AI-generated code. For the broader safety, reliability, and deployment considerations for agentic systems, continue to Chapter 26: Agent Safety, Production & Operations.

Bibliography

Security of AI-Generated Code

Pearce, H., Ahmad, B., Tan, B., et al. (2022). "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions." arXiv:2108.09293

The first systematic study of security vulnerabilities in Copilot-generated code, finding approximately 40% of generated programs contain vulnerabilities. Maps findings to CWE classifications and provides a methodology for evaluating code generation security. The foundational reference for AI code security research.

Security of AI-Generated Code

He, J. & Vechev, M. (2023). "Large Language Models for Code: Security Hardening and Adversarial Testing." arXiv:2302.05319

Explores methods for improving the security of LLM-generated code through adversarial testing and security-aware training. Demonstrates that security-focused fine-tuning reduces vulnerability rates. Important for teams building secure code generation pipelines.

Security of AI-Generated Code

Code Hallucination

Jesse, K., Devanbu, P. T., & Ahmed, T. (2023). "Large Language Models and Simple, Stupid Bugs." arXiv:2303.11455

Studies how LLMs generate "simple, stupid bugs" that are syntactically valid but logically incorrect. Categorizes common bug patterns and their frequency across different models. Demonstrates that larger models reduce but do not eliminate these errors. Essential for calibrating trust in AI-generated code.

Code Hallucination

Vu, N. M., Bui, N. D. Q., & Nadi, S. (2024). "Hallucinated Code: LLMs Generate Non-Existent API Calls." arXiv:2401.09983

Quantifies the frequency and patterns of API hallucination in LLM-generated code. Finds that models hallucinate API calls at rates of 5 to 20% depending on the library and model. Provides a taxonomy of hallucination types useful for building detection tools.

Code Hallucination

Tools and Frameworks

GitHub Security Lab. "CodeQL Documentation." codeql.github.com

Comprehensive documentation for CodeQL, GitHub's semantic code analysis engine. Covers query writing, pre-built security checks, and CI/CD integration. The primary reference for teams implementing deep code analysis of AI-generated output.

Tools and Frameworks

Semgrep. "Semgrep Documentation." semgrep.dev/docs

Documentation for the Semgrep pattern-based static analysis tool. Covers rule writing, registry of community rules, and integration with development workflows. Recommended for teams that need fast, customizable security scanning.

Tools and Frameworks