"Mata v. Avianca taught lawyers to grep their LLM citations. The bar associations are still catching up."
Hallux, Citation-Verifier AI Agent
Legal is the industry where LLM failures have already produced a body of caselaw of their own. The Mata v. Avianca sanctions order is taught in every U.S. legal-ethics CLE, the Park v. Kim (2d Cir. 2024) follow-up extended it, and by mid-2026 there are dozens of similar incidents on the public record across U.S. state and federal courts and in Canada, Australia, and the U.K. The pattern is consistent: an attorney trusts an LLM-produced citation, the citation does not exist, opposing counsel notices, the court sanctions, and the failure becomes a teaching example. The failure modes below are the specific ways legal-LLM deployments break in practice. Each comes with a mitigation that has been validated in production at major firms; none is theoretical.
Prerequisites
This section assumes the legal LLM use cases from Section 67.1, the hallucination vocabulary from Section 47.1, and the LLM-evaluation framing from Section 42.1.
Hallucinated Precedent: The Canonical Failure
The Mata v. Avianca brief that became the canonical hallucinated-precedent case was so internally consistent that attorney Steven Schwartz asked ChatGPT "are these cases real?" and was reassured that they were. The transcript of that exchange is now part of Judge Castel's June 22, 2023 sanctions order, which is required reading in roughly 80% of U.S. legal-ethics CLE courses on AI.
The Mata v. Avianca filing (2023, S.D.N.Y.) is now law-school curriculum. Attorneys filed a brief containing six made-up cases produced by ChatGPT. The judge sanctioned counsel. By 2026 there were dozens of similar incidents across jurisdictions. Every legal LLM deployment must have an automatic citation-verification step. (Consider this a hard requirement, not a recommendation.)
In a routine personal-injury action against Avianca Airlines, plaintiff's counsel submitted a brief opposing dismissal that cited six judicial decisions, complete with quotations and parallel reporter citations. None of the cases existed. Counsel had asked ChatGPT to find supporting authority, accepted the model's output uncritically, and even confirmed (with the model itself) that the cases were "real." Judge P. Kevin Castel imposed $5,000 in joint sanctions on the two attorneys and their firm under Rule 11, and the order is now the canonical teaching example in U.S. legal-ethics CLEs. The full docket and the June 22, 2023 sanctions order are on CourtListener; read the order, not the summary, if you are designing a verification layer.
The Mata pattern has not abated. The Second Circuit issued its own sanctions for ChatGPT-fabricated citations in Park v. Kim in 2024, and the Colorado Supreme Court Office of Attorney Regulation Counsel reported in 2025 that hallucinated-citation incidents accounted for the single largest category of new generative-AI-related grievances. Several state bars (Florida, California, New York, Texas) have issued specific advisory opinions on the duty to verify LLM-produced authority; the language is harmonizing around "the attorney remains responsible for the accuracy of every citation in a filing, regardless of source."
Privilege Leakage
LLMs trained on undifferentiated firm documents can surface attorney-client-privileged material in inappropriate contexts. The fix: data-room isolation, per-matter access controls on the retrieval index, and a refusal layer trained to recognize privileged-looking content. See the broader Section 53.4 on privacy attacks.
A second failure mode, less reported but more common at large firms, is privilege leakage. A typical scenario: a firm builds a knowledge-management retrieval index over the past decade of internal memos, then a new associate in an unrelated matter asks a question whose top retrieved chunks happen to be privileged communications from a closed engagement. The LLM does not know that the boundary has been crossed; the associate, working on a different client, did not request that material. The fix at the architectural level is per-matter access control on the retrieval index, with the model running as the requesting user (not as a privileged system identity). The fix at the policy level is a periodic audit of retrieval logs to catch cross-matter leakage that the access controls missed.
Jurisdictional Bias
Models trained predominantly on U.S. and English-law sources give confidently wrong answers about civil-law jurisdictions, Indigenous law, and emerging-market regulations. The fix: explicit jurisdiction tagging in the retrieval index plus an audit step that flags any answer that crosses jurisdictions.
Frontier models trained on a Western, English-dominant corpus systematically over-represent U.S. federal law and U.K. common law and under-represent everything else. The failure mode appears most strikingly in cross-border transactions: ask a generalist legal LLM about French commercial-code requirements for a corporate restructuring and the model often answers with the equivalent U.S. doctrine rephrased in the language of French law. The mitigation that works in production is to constrain retrieval by jurisdiction at the index level (every chunk is tagged with the issuing jurisdiction and the model is prompted to answer only from retrieved chunks for the requested jurisdiction). Cross-jurisdictional questions become explicit, not implicit, and the model can flag when the corpus does not contain the requested jurisdiction.
Confidentiality and Cloud-LLM Use
The duty of confidentiality (ABA Model Rule 1.6 and its state analogs) constrains where client information can flow. Several state bars have issued opinions clarifying that submitting client-identifying information to a consumer-grade LLM (free ChatGPT, default Gemini) without provider-side data-handling protections violates the duty. The fix is the same as elsewhere in this book: enterprise tiers with Business Associate-equivalent agreements (no retention, no training on submitted content), tenant isolation, and audit logs that satisfy the firm's records-retention policy. Major LLM providers (OpenAI Enterprise, Anthropic enterprise, Azure OpenAI, AWS Bedrock) now offer legal-industry-targeted SKUs with these protections; using anything weaker for client matters is a compliance defect, not a cost optimization.
Confidently Wrong Tone
A subtler failure mode that does not appear in headlines but appears in every firm's evaluation logs: LLMs produce confidently-worded summaries that omit a critical caveat or limitation. The summary is not wrong, exactly; it is incomplete in a way a careful associate would have flagged. The fix is process, not architecture: senior-partner spot-checks on a sample of LLM-augmented memos, fed back into the prompt and evaluation set. Several major firms now maintain an internal "evals review board" that meets quarterly to update the prompts and the verification rules based on what they have seen.
A composite of three reported 2024-2025 incidents (one in a federal pleading, two in CLE-presented hypotheticals). An attorney asked an internal LLM tool to summarize a recent Supreme Court holding for a client memo. The tool produced a one-paragraph summary that quoted the majority opinion accurately but omitted that the holding applied only to a specific procedural posture not present in the client's matter. The client memo went out, the client acted on it, and opposing counsel raised the issue at the next hearing. The damage was reputational, not sanctionable, but the firm's response established a now-standard pattern: every LLM-generated case summary includes (1) the procedural posture, (2) the holding's scope as stated by the court, and (3) a "what this does not hold" line. The fix is in the prompt template, not the model; the prompt template was previously a paragraph, and is now a six-element structured form.
What Comes Next
Section 67.3 covers the bar-association and regulatory framework that now governs LLM use in legal practice. The failure modes above produced specific rule changes; understanding the rules is the next step toward a deployment that is not just technically sound but also defensible under the bar's competence and supervision duties.
- Hallucinated precedent is the canonical failure: Mata v. Avianca and Park v. Kim made automated citation verification against an authoritative database a hard requirement, not a recommendation, for any legal-LLM brief that leaves the firm.
- Privilege leakage rides on retrieval: a firm-wide knowledge-management index surfaces attorney-client material across matter boundaries unless per-matter access controls and requesting-user identity are enforced on the index itself.
- Jurisdictional bias is structural: Western, English-dominant corpora cause models to paraphrase U.S. doctrine into the language of civil-law systems, so production retrieval indexes must tag every chunk with its issuing jurisdiction and refuse cross-jurisdictional answers without explicit prompting.
- Confidentiality forbids consumer-grade tools: ABA Model Rule 1.6 and state analogs make submitting client information to free ChatGPT or default Gemini a compliance defect, with enterprise no-retention, no-training SKUs the operational minimum.
- Confidently wrong tone is the silent failure: the holding-summary that omits a procedural-posture caveat survives review because it sounds right, and the fix is a structured prompt template plus quarterly evals-review-board oversight, not a stronger model.
What's Next?
In the next section, Section 67.3: Bar Association and Regulatory Rules, we build on the material covered here.