"Contract review by LLM: cheaper than an associate, more reliable than a paralegal, less reliable than the LLM thinks it is."
Guard, Legal-Pragmatist AI Agent
Legal practice is the prototype enterprise LLM market: the work is text-on-text, the per-hour billable rate is high, and the cost of a missed clause or a missed citation is itself measurable in further billable hours. By mid-2026 the vertical has consolidated around a recognizable shortlist of production deployments: Casetext CoCounsel (acquired by Thomson Reuters in 2023 and merged into the Thomson Reuters AI Assistant stack), Harvey at the AmLaw 50 (PwC, A&O Shearman, Latham), and Lexis+ AI on the incumbent-publisher side. Regulators have moved in parallel: the EU AI Act classified several legal-tech use cases as high-risk under Annex III, while ABA Formal Opinion 512 (2024) set the U.S. baseline for an attorney's competence and supervision duties when deploying generative AI. Five categories of work have demonstrated reliable LLM augmentation in production: contract review, e-discovery, citation generation, regulatory research, and document summarization. The pattern that unites them is the same, the LLM does the volume-heavy first pass, a licensed attorney does the verification, and an automated check sits between them when stakes are high. The takeaway for this chapter: a legal-defensible deployment is built around verification infrastructure, not around model choice.
Prerequisites
This section builds on the RAG architecture from Chapter 32 (retrieval pipelines and grounding) and the agentic-coding pattern from Section 29.4 for verification loops. Familiarity with the regulation framework in Chapter 47 is useful when reading the citation-verification subsection.
Five categories of legal work have demonstrated reliable LLM augmentation in production by mid-2026.
Contract Review: Assistive, Not Autonomous
Harvey's founders chose the name as a joke reference to the movie "Suits", where the lead attorney Harvey Specter is famous for never losing a case. The pitch deck reportedly had a Suits screenshot on slide 1, which the founders kept in the deck through the Series A even after their counsel suggested removing it. The Series E in 2025 valued the company at $5 billion, and the joke is now a $5 billion joke.
LLMs reliably flag standard-clause deviations (limitation of liability, indemnification, governing law, change-of-control) in commercial contracts. The pattern: a base playbook of "what we expect to see," the LLM compares an incoming document to that playbook, surfaces redlines for human review. Quality is high enough that Big Law associates routinely use this as a first pass; quality is not high enough to skip the human review. Vendors: Harvey, Hebbia, Robin AI, Spellbook, plus increasingly capable in-house deployments using fine-tuned open-weight models on private corpora. See Chapter 32 for the RAG patterns these tools build on.
Harvey deserves a closer look because it has become the canonical reference for what a 2026-era legal LLM product looks like. Founded in 2022 by an ex-OpenAI researcher and a Latham & Watkins associate, Harvey raised over $300 million across 2023 to 2025 and signed enterprise deals with PwC, Allen & Overy, A&O Shearman, and most of the AmLaw 50. The product is structured as a tenant-isolated workspace, with the firm's own matter documents indexed in a private retrieval store; the LLM (a frontier model accessed via the Azure OpenAI Service or via Anthropic's enterprise API) never sees another firm's data. The differentiation Harvey leans on is not raw model quality but workflow integration: drafting templates, redline-comparison UIs, citation-checking against Westlaw, and audit-log defaults that satisfy the bar's competence and supervision rules.
E-Discovery and Document Triage
For discovery in litigation, vendor-reported throughput gains for LLM-assisted first-pass relevance review typically fall in the 3-10x range over manual associate review, with accuracy "comparable on routine matters" in the limited published evaluations (most numbers come from vendor case studies; independent published evaluations remain scarce). Accuracy on novel or factually unusual matters tracks lower and is the reason recall-validation protocols are mandatory. The pattern: classify each document into "responsive / privileged / not responsive," surface the top-K most likely privileged or responsive documents for human review, audit-log every classification for defensibility. Critical: courts have increasingly accepted technology-assisted review (TAR) protocols where the LLM is properly disclosed and validated. The doctrinal basis traces to Da Silva Moore v. Publicis Groupe (S.D.N.Y. 2012), the early TAR-approval case; the LLM-era extensions of that doctrine are being litigated through the mid-2020s.
Citation Generation, With Verification
LLMs draft Bluebook citations from raw case captions or article metadata with high accuracy. The failure mode is the one everyone warns about: the LLM invents cases that do not exist when asked to find supporting precedent. The fix is now standard practice: every citation gets verified against an authoritative source (Westlaw, Lexis, CourtListener API) before it leaves the firm. Tools that do not do this verification step are professional malpractice waiting to happen.
The citation-verification step is the load-bearing engineering decision in a legal LLM product. Anything else, model choice, prompt design, UI, can be debated and tuned. The verification step is what separates a tool that ships from a tool that puts an attorney in front of a disciplinary committee. Build it as a hard requirement that fails closed: every cited case, statute, or regulation must resolve to a record in an authoritative database (Westlaw, Lexis, CourtListener, the Federal Register API, the relevant state's primary-law repository). If verification fails, the citation is stripped from the output and flagged for human research, not silently allowed through.
Regulatory and Compliance Research
LLMs over a RAG index of regulatory text (CFR, statutes, agency guidance) answer compliance questions with citations to the source paragraph. Banking, healthcare, securities, and energy have all deployed these internally. The key engineering decision: RAG with the regulatory corpus as the only retrieval source, with the LLM explicitly instructed to refuse if the retrieved chunks do not contain the answer. (See Section 35.3 on grounding strategies.)
The compliance-research use case has been one of the few where firms have successfully built an internal product that competes with the commercial vendors. The reason is corpus specificity: a regional bank's regulatory exposure (CFPB rules, state banking law, internal policy memos, examination findings) is narrow enough that a small, well-curated retrieval index plus a frontier model outperforms a generalist legal LLM on the bank's actual questions. The investment that pays off is the corpus curation, not the model.
Legal Document Summarization
Summarizing depositions, expert reports, regulatory filings, and case law. Mature use case; main risk is omission rather than fabrication, so the pattern is "summary + key-quote highlights + source page links" rather than free-form prose. Litigation-support teams report meaningful review-time reductions on routine matters (vendor case studies cluster in the 50-80 percent range, with the higher end reserved for high-volume, structurally similar depositions like personal-injury or product-liability series; complex matters with novel fact patterns see considerably smaller gains). The pattern shifts attorney attention from "where is the relevant testimony?" to "is the summary accurate and complete?"
Allen & Overy (now A&O Shearman after its 2024 merger with Shearman & Sterling) was Harvey's first announced major-firm customer in February 2023. The deployment rolled out to roughly 3,500 lawyers across more than 40 offices. The published before-and-after metrics, reported by the firm at Allen & Overy's launch announcement and subsequent industry interviews, are instructive: roughly half of the firm's lawyers used Harvey in their day-to-day work within the first six months; the highest-frequency use cases were research-summary drafting, due-diligence document triage, and first-draft client memos. The firm did not report the tool replacing associates; the firm reported associates handling more matters per week. The pattern that has held across most major Big Law deployments is consistent: Harvey (or Hebbia, or Spellbook, or a fine-tuned in-house equivalent) does not change billable-hour structures, but it does shift where those hours are spent, from undifferentiated reading toward higher-leverage analytical work.
The market that has consolidated around assistive contract review and litigation drafting is mapped in the table below. All five vendors operate verified-RAG architectures of the kind described in Section 67.4; their differentiation is corpus focus and deployment posture rather than core retrieval approach.
| Vendor | Focus | Deployment | Pricing tier |
|---|---|---|---|
| Harvey | Assistive contract review and litigation drafting | Cloud (multi-tenant with tenant isolation) | Enterprise |
| Hebbia | Search and structured extraction over large document sets | Cloud | Enterprise |
| Casetext / Co-Counsel | Legal research and memo drafting (Thomson Reuters) | Cloud | Mid-market and enterprise |
| Spellbook | Transactional drafting and Word-integrated redlining | Cloud (Word add-in) | Mid-market |
| Robin AI | Transactional contract review and negotiation support | Cloud | Enterprise |
None of the five use cases above is safe to deploy without the verification or human-review step described alongside it. Legal practice operates under a duty of competence (ABA Model Rule 1.1, Comment 8) and a duty of supervision over non-attorney assistants (Rule 5.3) that both extend to LLM-augmented work. Skipping the human check is not a productivity optimization; it is a bar-discipline event waiting to happen. Section 67.2 catalogs the most common failure modes in detail, and Section 67.4 specifies the verified-RAG architecture that is now the de-facto standard for compliant deployment.
What Comes Next
Section 67.2 turns to the failure modes specific to legal LLMs, starting with the hallucinated-precedent problem that produced the Mata v. Avianca sanctions order. The use cases above all work; the question of how they fail and how to defend against those failures is what defines a deployable legal-LLM stack.
For advanced RAG patterns used in legal retrieval, see Section 35.3. For RAG fundamentals these legal pipelines build on, see Chapter 32. For legal-specific evaluation and deployment patterns, see Section 67.4.
What's Next?
In the next section, Section 67.2: Failure Modes Specific to Legal Practice, we build on the material covered here.