Section 12.5: Structured Information Extraction

"Unstructured text is just structured data that has not been properly interrogated yet. The trick is knowing which questions to ask, and which tool to ask them with."
Label, Data-Interrogating AI Agent

Big Picture

Information extraction (IE) turns unstructured text into structured data. For decades, IE relied on rule-based patterns, statistical models (CRFs, BiLSTMs), and curated ontologies. LLMs have transformed this landscape by enabling zero-shot extraction with natural language instructions. However, LLMs introduce new challenges: inconsistent output formats, hallucinated entities, and high per-token costs. The hybrid approach combines the speed and precision of classical NLP for well-defined entity types with the flexibility of LLMs for complex, open-ended extraction tasks. Libraries like Instructor, BAML, and Pydantic provide the structured output guarantees that production systems require. As we covered in Section 10.2, structured output enforcement is essential for reliable extraction pipelines.

The emerging frontier is compound AI systems: multi-component architectures where retrieval, classification, generation, and verification components work together as a coordinated pipeline. Frameworks like DSPy (covered in Section 11.3) are evolving to support production deployment of these compound systems. The broader trend is "AI engineering" as a distinct discipline, combining ML engineering, prompt engineering, and systems design. Part 4 covers training and fine-tuning, which is the next lever you can pull when prompt engineering and hybrid architectures reach their limits.

Prerequisites

This section assumes familiarity with the LLM-as-feature-extractor patterns from Section 12.2 and the hybrid pipeline architectures from Section 12.3. The structured output techniques from Section 10.2 are directly applied for entity and relation extraction.

1. The Information Extraction Landscape

Information extraction encompasses several related tasks that transform free text into structured records. Named Entity Recognition (NER), which builds on the text representation foundations from Chapter 01, identifies and classifies spans of text into categories such as persons, organizations, locations, and dates. Relation extraction identifies semantic connections between entities (e.g., "Alice works at Acme Corp"). Event extraction captures structured representations of what happened, when, where, and to whom. Each task can be approached with classical NLP tools, LLM prompting, or a combination of both.

Why hybrid information extraction is the production standard. Pure classical IE (spaCy, CRF models) is fast and precise but rigid: it can only extract entity types it was trained on. Pure LLM-based IE is flexible but expensive, slow, and prone to hallucinating entities that do not exist in the source text. The hybrid approach uses classical NLP for well-defined, high-volume entity types (dates, names, addresses) and reserves the LLM for novel or complex extraction tasks (sentiment-bearing phrases, implicit relationships, domain-specific entities). This mirrors the general hybrid philosophy from Section 12.3: use the cheapest tool that can do the job correctly, and escalate to the expensive tool only when needed.

Tip

Always run spaCy's NER first and use its output as context for the LLM call. Passing pre-extracted entities to the LLM (e.g., "spaCy found these entities: [Alice, Acme Corp, 2024-01-15]. Verify these and extract any additional entities the rules missed.") reduces hallucination rates significantly because the model can confirm or correct known entities rather than inventing them from scratch.

1.1 Classical IE vs. LLM-Based IE

Dimension	Classical IE (spaCy, CRF)	LLM-Based IE
Setup cost	High: labeled data, training pipelines	Low: prompt engineering, few examples
Entity types	Fixed at training time	Flexible, defined in the prompt
Latency	Sub-millisecond per document	100ms to 2s per document
Cost per doc	Negligible (CPU inference)	$0.001 to $0.05 per document
Accuracy (common entities)	95%+ F1 on trained types	85-92% F1 zero-shot
Accuracy (novel types)	0% (needs retraining)	75-90% F1 zero-shot
Output format	Deterministic, typed spans	Requires structured output enforcement
Hallucination risk	None (span-based)	Moderate (can invent entities)
Context window	Unlimited (streaming)	Limited by model context length

Figure 12.5.1 compares these two pipeline architectures side by side.

Figure 12.5.1: Classical NER pipelines offer deterministic, sub-millisecond inference on trained entity types, while LLM pipelines provide flexible schema extraction at higher latency and cost.

2. Classical IE with spaCy

Fun Fact

Named entity recognition was one of the first NLP tasks to reach "good enough" accuracy in the 1990s, and spaCy's modern transformer models can process over 10,000 documents per second. Meanwhile, achieving the same task with an LLM API call takes roughly 1 second per document and costs money. Sometimes the 30-year-old approach is still the right one.

spaCy remains the gold standard for production NER when you need speed and reliability on well-defined entity types. Its tokenization pipeline handles the text preprocessing that makes span-based entity recognition possible. Its transformer-based models achieve state-of-the-art accuracy on standard benchmarks, and its pipeline architecture makes it easy to add custom entity types through training or rule-based matching. Code Fragment 12.5.10 shows this approach in practice.

Code Fragment 12.5.2 runs spaCy's transformer-based NER pipeline on a financial news snippet, grouping extracted entities by type.

# Use spaCy for classical named entity recognition
# Extracts persons, organizations, dates, and locations at minimal cost
import spacy
from spacy import displacy
from collections import defaultdict

# Load a pre-trained transformer model
nlp = spacy.load("en_core_web_trf")

text = """
Apple Inc. announced today that CEO Tim Cook will present the company's
quarterly earnings at their headquarters in Cupertino, California on
January 30, 2025. Revenue is expected to exceed $120 billion, driven
by strong iPhone 16 sales across Europe and Asia.
"""

doc = nlp(text)

# Extract entities with their labels and positions
entities = []
for ent in doc.ents:
 entities.append({
 "text": ent.text,
 "label": ent.label_,
 "start": ent.start_char,
 "end": ent.end_char,
 })

# Group by entity type
by_type = defaultdict(list)
for e in entities:
 by_type[e["label"]].append(e["text"])

print("Extracted Entities:")
print("=" * 50)
for label, values in sorted(by_type.items()):
 print(f" {label:12s}: {', '.join(values)}")

print(f"\nTotal: {len(entities)} entities across {len(by_type)} types")

Extracted Entities: ================================================== CARDINAL : 120 billion, 16 DATE : today, January 30, 2025 GPE : Cupertino, California, Europe, Asia MONEY : $120 billion ORG : Apple Inc. PERSON : Tim Cook PRODUCT : iPhone 16 Total: 11 entities across 7 types

Code Fragment 12.5.1: Use spaCy for classical named entity recognition

Code Fragment 12.5.10 demonstrates the complementary LLM-based approach using the BAML framework for structured event extraction, where b.ExtractEvents() handles prompt construction, LLM invocation, and Pydantic validation in one call.

# BAML definition file: extract_events.baml
# This compiles to a type-safe Python client
#
# class EventType(str, Enum):
# ACQUISITION = "acquisition"
# PARTNERSHIP = "partnership"
# PRODUCT_LAUNCH = "product_launch"
# EARNINGS = "earnings"
# LEGAL = "legal"
#
# class ExtractedEvent(BaseModel):
# event_type: EventType
# description: str
# participants: list[str]
# date: Optional[str]
# monetary_value: Optional[str]
#
# Usage with the compiled BAML client:

from baml_client import b
from baml_client.types import ExtractedEvent

article = """
Microsoft announced on March 15, 2025, that it has completed its
$2.1 billion acquisition of cybersecurity startup CyberShield AI.
The deal, first reported in January, brings 450 employees and
several enterprise security products into Microsoft's Azure division.
CEO Satya Nadella called the acquisition transformative for the
company's cloud security strategy.
"""

# BAML handles prompt construction, LLM call, and type validation
events: list[ExtractedEvent] = b.ExtractEvents(article)

for event in events:
 print(f"Type: {event.event_type}")
 print(f"Description: {event.description}")
 print(f"Participants: {', '.join(event.participants)}")
 print(f"Date: {event.date}")
 print(f"Value: {event.monetary_value}")

Type: acquisition Description: Microsoft completed acquisition of CyberShield AI Participants: Microsoft, CyberShield AI, Satya Nadella Date: March 15, 2025 Value: $2.1 billion

Code Fragment 12.5.2: BAML definition file: extract_events.baml

Warning

LLMs can hallucinate entities that do not appear in the source text. Always implement a grounding check that verifies extracted entities against the original document. A simple substring match catches most hallucinations. For more robust grounding, use fuzzy matching or semantic similarity to handle paraphrases and abbreviations.

3. Open Information Extraction

Traditional relation extraction requires a predefined schema of relation types (e.g., works_at, located_in, acquired_by). Open Information Extraction (Open IE) removes this constraint entirely, extracting arbitrary (subject, relation, object) triples from text without any schema. This makes Open IE invaluable for exploratory analysis, knowledge graph bootstrapping, and domains where the set of possible relations is unknown or too large to enumerate in advance.

3.1 Classical Open IE Systems

Stanford OpenIE (Angeli et al., 2015) pioneered the modern approach to schema-free extraction. It decomposes complex sentences into short, self-contained clauses using natural logic, then extracts triples from each clause. For example, the sentence "Tim Cook, who became CEO in 2011, leads Apple from Cupertino" yields three triples: (Tim Cook; became CEO; in 2011), (Tim Cook; leads; Apple), and (Apple; located in; Cupertino). The system processes text at high speed with no training data required, but it struggles with implicit relations and complex syntactic constructions.

REBEL (Cabot and Navigli, 2021) takes a different approach by framing relation extraction as a sequence-to-sequence problem. A fine-tuned BART model generates linearized triples directly from input text, enabling it to capture both explicit and implicit relations. REBEL supports over 200 relation types from Wikidata and can be fine-tuned on custom relation schemas. It bridges the gap between closed and open IE by learning from structured knowledge bases while retaining the flexibility to extract novel relation patterns.

Key Insight

Open IE is the foundation of automated knowledge graph construction. Each extracted triple becomes a candidate edge in a knowledge graph: subjects and objects map to nodes, and relations map to labeled edges. When combined with entity linking (covered in Section 19.3), Open IE triples can be grounded to canonical entities, enabling structured querying over unstructured text corpora.

3.2 LLM-Based Open IE with Structured Output

LLMs bring two significant advantages to Open IE: they understand context deeply enough to extract implicit relations, and they can normalize relation phrases into consistent, canonical forms. A classical Open IE system might extract both "is headquartered in" and "has its main office in" as separate relation types; an LLM can recognize these as the same headquartered_in relation and normalize accordingly.

# LLM-based Open Information Extraction with structured output
# Extracts (subject, relation, object) triples without a predefined schema
from pydantic import BaseModel, Field
from openai import OpenAI
import instructor

client = instructor.from_openai(OpenAI())

class Triple(BaseModel):
 subject: str = Field(description="The entity performing or described by the relation")
 relation: str = Field(description="Normalized relation in snake_case (e.g., acquired_by)")
 object: str = Field(description="The target entity or value of the relation")
 confidence: float = Field(ge=0.0, le=1.0, description="Extraction confidence")
 source_span: str = Field(description="The text span supporting this triple")

class OpenIEResult(BaseModel):
 triples: list[Triple]

text = """
Anthropic, founded by Dario and Daniela Amodei in 2021, raised $2 billion
from Google in late 2023. The San Francisco-based company developed Claude,
which competes with OpenAI's ChatGPT. Anthropic employs over 1,000
researchers and engineers focused on AI safety.
"""

result = client.chat.completions.create(
 model="gpt-4o",
 response_model=OpenIEResult,
 messages=[
 {"role": "system", "content": (
 "Extract all factual (subject, relation, object) triples from the text. "
 "Normalize relation names to snake_case. Include the source text span "
 "that supports each triple. Only extract relations explicitly stated or "
 "directly implied by the text."
 )},
 {"role": "user", "content": text},
 ],
 max_retries=2,
)

print(f"Extracted {len(result.triples)} triples:\n")
for t in result.triples:
 print(f" ({t.subject}; {t.relation}; {t.object})")
 print(f" confidence={t.confidence:.2f} span=\"{t.source_span}\"")

Extracted 7 triples: (Anthropic; founded_by; Dario Amodei) confidence=0.98 span="founded by Dario and Daniela Amodei in 2021" (Anthropic; founded_by; Daniela Amodei) confidence=0.98 span="founded by Dario and Daniela Amodei in 2021" (Anthropic; founded_in; 2021) confidence=0.99 span="founded by Dario and Daniela Amodei in 2021" (Anthropic; received_funding_from; Google) confidence=0.97 span="raised $2 billion from Google in late 2023" (Anthropic; headquartered_in; San Francisco) confidence=0.95 span="The San Francisco-based company" (Anthropic; developed; Claude) confidence=0.99 span="developed Claude" (Claude; competes_with; ChatGPT) confidence=0.96 span="competes with OpenAI's ChatGPT"

Code Fragment 12.5.3: LLM-based Open Information Extraction with structured output

Dimension Comparison

Dimension	Stanford OpenIE	REBEL	LLM-Based Open IE
Schema required	None	Optional (Wikidata types)	None
Relation normalization	None (raw phrases)	Partial (learned types)	Strong (LLM normalizes)
Implicit relations	Weak	Moderate	Strong
Latency per document	<50ms	50-200ms	500ms-2s
Cost per document	Free	Free (local GPU)	$0.005-0.03
Hallucination risk	None	Low	Moderate
Best for	High-volume, explicit facts	Knowledge graph population	Complex, implicit relations

Semantic Role Labeling: The Hidden Foundation of Structured Extraction

Every structured extraction task implicitly answers the question: "Who did what to whom, when, where, and why?" This is exactly the question that Semantic Role Labeling (SRL) was designed to answer. SRL identifies the predicate (the action or event) in a sentence and assigns semantic roles to its arguments: Agent (who performed the action), Patient (who was affected), Instrument (what tool was used), Location (where it happened), and Temporal (when it happened).

Classical SRL resources include PropBank, which defines verb-specific role sets (Arg0, Arg1, Arg2, etc.) grounded in syntactic frames, and FrameNet, which organizes predicates into semantic frames with named roles (Buyer, Seller, Goods, Price for a commercial transaction). Tools like AllenNLP's SRL model provide pre-trained PropBank-based labeling that runs in milliseconds per sentence.

When an LLM extracts structured events with typed arguments (buyer, seller, price, date), it is implicitly performing SRL without the formal linguistic machinery. Understanding this connection matters for two reasons. First, classical SRL tools can serve as fast, cheap pre-processors that identify predicate-argument structures before an LLM performs the more expensive reasoning over those structures. Second, SRL annotations from PropBank and FrameNet provide excellent few-shot examples for teaching LLMs to extract domain-specific semantic roles, because the role structure transfers across domains even when the specific labels change. The event extraction subsection below demonstrates how these semantic roles appear in practice under the labels of event arguments.

3.3 Event Extraction

Event extraction goes beyond entity and relation extraction to identify structured representations of what happened. Each event consists of a trigger (the word or phrase indicating the event), an event type, and a set of arguments filling specific roles. For example, in the sentence "Google acquired Fitbit for $2.1 billion in January 2021," the trigger is "acquired," the event type is ACQUISITION, and the arguments include the buyer (Google), the target (Fitbit), the price ($2.1 billion), and the date (January 2021).

Event Ontologies and Benchmarks

The ACE (Automatic Content Extraction) program defined 33 event types across eight categories: Life, Movement, Transaction, Business, Conflict, Contact, Personnel, and Justice. The ERE (Entities, Relations, Events) annotation standard extended ACE with lighter guidelines and broader coverage. These ontologies provide the foundation for supervised event extraction models, but they cover only a fraction of real-world event types. Domain-specific applications (financial events, clinical events, cybersecurity incidents) typically require custom event schemas.

LLM-Based Event Extraction

LLMs excel at event extraction because they can handle arbitrary event schemas defined in the prompt, reason about implicit arguments (the buyer in a passive construction like "Fitbit was acquired"), and resolve temporal expressions to structured dates. The following code fragment demonstrates extracting structured events from a news article, including timeline construction from multiple event mentions.

# LLM-based event extraction with timeline construction
# Extracts structured events and builds a chronological timeline
from pydantic import BaseModel, Field
from typing import Optional
from enum import Enum
from openai import OpenAI
import instructor

client = instructor.from_openai(OpenAI())

class EventType(str, Enum):
 ACQUISITION = "acquisition"
 FUNDING = "funding_round"
 PRODUCT_LAUNCH = "product_launch"
 PARTNERSHIP = "partnership"
 LEADERSHIP_CHANGE = "leadership_change"
 LEGAL_ACTION = "legal_action"
 IPO = "ipo"
 LAYOFF = "layoff"
 EARNINGS = "earnings_report"
 OTHER = "other"

class EventArgument(BaseModel):
 role: str = Field(description="Semantic role: buyer, seller, amount, product, etc.")
 value: str = Field(description="The entity or value filling this role")
 entity_type: Optional[str] = Field(
 default=None, description="Entity type if applicable: ORG, PERSON, MONEY, DATE"
 )

class ExtractedEvent(BaseModel):
 trigger: str = Field(description="The word or phrase indicating the event")
 event_type: EventType
 arguments: list[EventArgument]
 date: Optional[str] = Field(default=None, description="ISO date if identifiable")
 sentence: str = Field(description="The source sentence containing the event")

class EventExtractionResult(BaseModel):
 events: list[ExtractedEvent]
 timeline_summary: str = Field(
 description="One-paragraph chronological summary of all events"
 )

article = """
In a dramatic week for the tech industry, Nvidia announced record quarterly
revenue of $22.1 billion on February 21, 2024, driven by surging AI chip
demand. Two days later, the EU filed an antitrust lawsuit against Apple over
its App Store policies, seeking $38 billion in penalties. Meanwhile, Microsoft
laid off 1,900 employees from its gaming division following the Activision
Blizzard acquisition. On February 26, Stripe announced a partnership with
Amazon to process payments for third-party sellers, a deal expected to
generate $500 million in annual transaction volume.
"""

result = client.chat.completions.create(
 model="gpt-4o",
 response_model=EventExtractionResult,
 messages=[
 {"role": "system", "content": (
 "Extract all business events from the article. For each event, identify "
 "the trigger word, classify the event type, extract all arguments with "
 "their semantic roles, and resolve dates to ISO format (YYYY-MM-DD) where "
 "possible. Then write a chronological timeline summary."
 )},
 {"role": "user", "content": article},
 ],
 max_retries=2,
)

for i, event in enumerate(result.events, 1):
 print(f"Event {i}: {event.event_type.value}")
 print(f" Trigger: \"{event.trigger}\"")
 print(f" Date: {event.date}")
 for arg in event.arguments:
 print(f" {arg.role:12s}: {arg.value} ({arg.entity_type or 'N/A'})")
 print()

print("Timeline:")
print(f" {result.timeline_summary}")

Event 1: earnings_report Trigger: "announced record quarterly revenue" Date: 2024-02-21 company : Nvidia (ORG) revenue : $22.1 billion (MONEY) driver : surging AI chip demand (N/A) Event 2: legal_{action} Trigger: "filed an antitrust lawsuit" Date: 2024-02-23 plaintiff : EU (ORG) defendant : Apple (ORG) subject : App Store policies (N/A) penalty : $38 billion (MONEY) Event 3: layoff Trigger: "laid off" Date: 2024-02-23 company : Microsoft (ORG) count : 1,900 employees (N/A) division : gaming division (N/A) context : Activision Blizzard acquisition (N/A) Event 4: partnership Trigger: "announced a partnership" Date: 2024-02-26 partner_1 : Stripe (ORG) partner_2 : Amazon (ORG) purpose : process payments for third-party sellers (N/A) value : $500 million annual transaction volume (MONEY) Timeline: On February 21, 2024, Nvidia reported record quarterly revenue of $22.1B. Two days later, on February 23, the EU filed an antitrust suit against Apple and Microsoft laid off 1,900 gaming employees. On February 26, Stripe and Amazon announced a payments partnership.

Code Fragment 12.5.4: LLM-based event extraction with timeline construction

Key Insight

Event extraction is the bridge between information extraction and temporal reasoning. By extracting events with resolved dates and participant roles, you can construct timelines, detect causal chains (Event A triggered Event B), and answer temporal questions ("What happened between the acquisition and the layoffs?"). This capability is critical for RAG systems (Section 20.4) that need to reason about sequences of events rather than isolated facts.

3.4 Temporal Information Extraction

Temporal information extraction builds on event extraction by focusing specifically on the time dimension: when did events happen, in what order, and how do they relate to each other chronologically? While event extraction captures individual occurrences with their participants and dates, temporal IE constructs coherent timelines that reveal causal sequences, durations, and overlapping events across a document or document collection.

Classical temporal IE relies on specialized tools and annotation standards. TimeML is the ISO standard markup language for temporal expressions, events, and temporal relations in text. Tools like SUTime (Stanford) and HeidelTime normalize temporal expressions ("last Tuesday," "Q3 2024," "two weeks after the merger") into ISO 8601 dates. These rule-based normalizers are remarkably accurate for well-formed temporal expressions and run at negligible computational cost, making them ideal candidates for the classical side of a hybrid pipeline.

LLMs complement these tools by handling the harder aspects of temporal reasoning: resolving ambiguous references ("shortly after," "during the crisis"), inferring temporal order from discourse structure when explicit dates are absent, and constructing narrative timelines that synthesize events scattered across long documents. The following code fragment demonstrates an LLM-based timeline extraction pipeline that combines classical temporal normalization with LLM reasoning.

# Temporal information extraction: build a timeline from a document
# Combines classical date normalization with LLM temporal reasoning
from openai import OpenAI
import json

client = OpenAI()

def extract_timeline(document: str, domain: str = "general") -> dict:
 """Extract a chronological timeline of events from a document."""
 response = client.chat.completions.create(
 model="gpt-4o",
 messages=[
 {"role": "system", "content": f"""You are a temporal information
extraction system for {domain} documents. Extract all events and
temporal expressions, then construct a timeline.

For each event, provide:
- "date": ISO 8601 date or date range (use "unknown" if not stated)
- "event": concise description of what happened
- "participants": list of entities involved
- "temporal_relation": relationship to the previous event
 (e.g., "after", "before", "simultaneous", "during", "unrelated")
- "confidence": float 0.0 to 1.0 for the temporal placement

Return JSON with "events" (list sorted chronologically) and
"timeline_summary" (a one-paragraph narrative of the timeline)."""},
 {"role": "user", "content": document}
 ],
 temperature=0.1,
 response_format={"type": "json_object"}
 )
 return json.loads(response.choices[0].message.content)

# Example: legal case timeline extraction
legal_doc = """On March 15, 2024, Acme Corp filed a patent infringement
suit against Beta Inc in the Eastern District of Texas. Beta Inc
responded with a motion to dismiss on April 2. Two weeks later, the
court denied the motion and set a discovery deadline for August 30.
During the discovery period, Beta Inc produced 50,000 documents. The
parties attempted mediation in early September but failed to reach a
settlement. Trial is scheduled for January 2025."""

timeline = extract_timeline(legal_doc, domain="legal")
for event in timeline["events"]:
 print(f" {event['date']:>12} {event['event']}")
print(f"\nSummary: {timeline['timeline_summary']}")

2024-03-15 Acme Corp filed patent infringement suit against Beta Inc 2024-04-02 Beta Inc filed motion to dismiss 2024-04-16 Court denied motion to dismiss, set discovery deadline 2024-08-30 Discovery deadline; Beta Inc produced 50,000 documents 2024-09-01 Mediation attempted, no settlement reached 2025-01-01 Trial scheduled Summary: The patent dispute began on March 15, 2024 when Acme Corp sued Beta Inc. After a failed motion to dismiss in April, the court set an August discovery deadline. Following document production and unsuccessful September mediation, the case proceeds to trial in January 2025.

Code Fragment 12.5.5: Temporal information extraction: build a timeline from a document

Temporal IE is critical in several domains. Legal document analysis requires constructing case timelines from filings, depositions, and correspondence to establish sequences of events for litigation. Medical record timelines track patient history (symptoms, diagnoses, treatments, outcomes) across clinical notes that span months or years, enabling clinicians to see the full trajectory of care. News event tracking constructs evolving timelines of developing stories, linking related events across sources and detecting when reported timelines conflict. These applications connect directly to the RAG systems in Chapter 20, where temporal reasoning enables queries like "What happened between the filing and the trial?"

Note

Hybrid temporal extraction in practice: The most robust timeline systems use HeidelTime or SUTime to normalize explicit temporal expressions (fast, deterministic, nearly perfect accuracy on well-formed dates), then pass the document with normalized dates to an LLM for resolving relative references and inferring temporal order from context. This hybrid approach avoids wasting LLM tokens on date parsing while leveraging the LLM's reasoning ability for the genuinely hard parts of temporal IE.

3.5 From Events to Knowledge Graphs

The triples from Open IE and the structured events from event extraction feed naturally into knowledge graphs. Each entity becomes a node, each relation becomes an edge, and each event becomes a hyper-edge connecting multiple entities through their roles. Entity linking (covered in Section 19.3) grounds these entities to canonical identifiers, enabling cross-document queries. For example, a knowledge graph built from financial news might link "MSFT," "Microsoft Corp.," and "the Redmond tech giant" to the same canonical entity, allowing a query like "show all acquisitions by Microsoft in 2024" to aggregate results across hundreds of articles.

4. Hybrid IE Architectures

Why this matters for production pipelines. In a production document processing system, you might need to extract entities from 10,000 documents per day. Running every document through an LLM at $0.02 per document costs $200/day ($6,000/month). But if spaCy handles 70% of documents correctly (those with standard entities and clean text), you only need the LLM for the remaining 30%, dropping the cost to $60/day. More importantly, the classical pipeline returns results in milliseconds, keeping your median latency low. The structured output techniques from Section 10.2 (Pydantic models, Instructor) ensure that the LLM's extraction results are parsed reliably, while spaCy's span-based output is deterministic by design.

The most effective production IE systems combine classical and LLM-based extraction in a layered architecture. Classical models handle the high-volume, well-defined entity types (persons, organizations, dates, locations) at near-zero cost, while LLMs are called selectively for complex, domain-specific extraction tasks that require reasoning or world knowledge. Figure 12.5.2 illustrates this layered hybrid approach. Code Fragment 12.5.10 shows this approach in practice.

Figure 12.5.2: A hybrid IE architecture routes documents through classical NER first, then selectively invokes LLM extraction only for complex documents requiring domain-specific entity types, relations, or event detection.

4.1 Building the Hybrid Pipeline

The following implementation (Code Fragment 0) shows this approach in practice.

Code Fragment 12.5.10 illustrates a chat completion call.

# Combine spaCy NER with LLM extraction in a two-layer pipeline
# The LLM handles domain-specific entities that spaCy cannot recognize
import spacy
from pydantic import BaseModel, Field
from typing import Optional
from dataclasses import dataclass

# Assume 'client' is an Instructor-patched OpenAI client
# client = instructor.from_openai(OpenAI())

nlp = spacy.load("en_core_web_trf")

class DomainEntity(BaseModel):
 text: str
 entity_type: str
 source: str = Field(description="'classical' or 'llm'")
 confidence: float

class RelationTriple(BaseModel):
 subject: str
 predicate: str
 object: str

class HybridExtractionResult(BaseModel):
 entities: list[DomainEntity]
 relations: list[RelationTriple]

# Mapping from spaCy labels to our unified schema
SPACY_LABEL_MAP = {
 "PERSON": "person", "ORG": "organization",
 "GPE": "location", "LOC": "location",
 "DATE": "date", "MONEY": "money",
 "PRODUCT": "product",
}

# Domain-specific types that require LLM extraction
DOMAIN_TYPES = {"medical_condition", "legal_clause", "financial_instrument"}

def needs_llm_extraction(text: str, classical_entities: list) -> bool:
 """Decide whether to invoke the LLM for deeper extraction."""
 # Heuristic: call LLM if the document contains domain keywords
 # that classical NER cannot handle
 domain_keywords = [
 "diagnosis", "plaintiff", "defendant", "derivative",
 "ct scan", "mri", "statute", "breach of contract",
 ]
 text_lower = text.lower()
 return any(kw in text_lower for kw in domain_keywords)

def extract_classical(text: str) -> list[DomainEntity]:
 """Fast, cheap extraction using spaCy."""
 doc = nlp(text)
 entities = []
 for ent in doc.ents:
 if ent.label_ in SPACY_LABEL_MAP:
 entities.append(DomainEntity(
 text=ent.text,
 entity_type=SPACY_LABEL_MAP[ent.label_],
 source="classical",
 confidence=0.95,
 ))
 return entities

def extract_with_llm(text: str, existing: list[DomainEntity]) -> HybridExtractionResult:
 """LLM extraction for domain-specific types and relations."""
 existing_summary = ", ".join(f"{e.text} ({e.entity_type})" for e in existing)

 return client.chat.completions.create(
 model="gpt-4o",
 response_model=HybridExtractionResult,
 messages=[
 {"role": "system", "content": (
 "Extract domain-specific entities and relations from the text. "
 f"These entities were already found by NER: {existing_summary}. "
 "Focus on entities and relations NOT already captured. "
 "Mark all entities with source='llm'."
 )},
 {"role": "user", "content": text},
 ],
 max_retries=2,
 )

def hybrid_extract(text: str) -> HybridExtractionResult:
 """Two-layer hybrid extraction pipeline."""
 # Layer 1: Classical NER (always runs, near-zero cost)
 classical = extract_classical(text)

 # Layer 2: LLM extraction (conditional, only when needed)
 if needs_llm_extraction(text, classical):
 llm_result = extract_with_llm(text, classical)
 # Merge: classical entities + LLM entities + LLM relations
 all_entities = classical + llm_result.entities
 return HybridExtractionResult(
 entities=all_entities,
 relations=llm_result.relations,
 )

 # Simple case: return classical entities only
 return HybridExtractionResult(entities=classical, relations=[])

# Example usage
text = """
Dr. Sarah Chen at Massachusetts General Hospital diagnosed the patient
with Stage II non-small cell lung cancer based on the CT scan results
from January 15, 2025. Treatment with pembrolizumab was initiated.
"""

result = hybrid_extract(text)
print(f"Entities ({len(result.entities)}):")
for e in result.entities:
 print(f" [{e.source:9s}] {e.entity_type:20s}: {e.text}")
print(f"\nRelations ({len(result.relations)}):")
for r in result.relations:
 print(f" {r.subject} -> {r.predicate} -> {r.object}")

Entities (7): [classical] person : Dr. Sarah Chen [classical] organization : Massachusetts General Hospital [classical] date : January 15, 2025 [llm ] medical_condition : Stage II non-small cell lung cancer [llm ] medical_procedure : CT scan [llm ] medication : pembrolizumab [llm ] medical_procedure : treatment initiation Relations (3): Dr. Sarah Chen -> diagnosed -> Stage II non-small cell lung cancer Dr. Sarah Chen -> works_at -> Massachusetts General Hospital pembrolizumab -> treats -> Stage II non-small cell lung cancer

Code Fragment 12.5.6: Combine spaCy NER with LLM extraction in a two-layer pipeline

Key Insight

The hybrid architecture delivers large cost savings because the complexity router filters out 60-80% of documents at the classical layer. Only documents that contain domain-specific signals (medical terms, legal language, financial instruments) trigger the more expensive LLM call. For a pipeline processing 100K documents/day, this means the LLM handles only 20-40K documents, reducing API costs by 60-80% compared to an LLM-only approach.

5. Production Deployment Patterns

Deploying IE systems to production requires attention to grounding, deduplication, and graceful degradation. These patterns ensure that extraction results are reliable even when individual components fail.

5.1 Grounding Verification

Every entity extracted by an LLM should be verified against the source text. This prevents hallucinated entities from entering your structured data store.

Exact substring check: verify that the entity text appears verbatim in the source document. Fast and simple, but misses abbreviations and paraphrases.
Fuzzy matching: use edit distance or token overlap to handle minor variations (e.g., "Dr. Chen" vs. "Sarah Chen"). Set a threshold of 0.8 similarity.
Semantic grounding: compute embedding similarity between the extracted entity and all noun phrases in the source. Most robust, but adds latency.

5.2 Graceful Degradation

When the LLM is unavailable or returns invalid output after retries, the system should fall back to classical extraction rather than failing entirely. This means your pipeline always returns at least the entities that spaCy can identify, even during LLM outages. Log all fallback events so you can measure how often they occur and what extraction quality looks like without the LLM layer.

Warning

Never store LLM-extracted entities at the same confidence level as classical entities unless they pass grounding verification. Downstream consumers of your structured data need to distinguish between high-confidence, span-grounded entities and lower-confidence, LLM-inferred entities. Include the source and confidence fields in every entity record.

6. End-to-End Example: Financial Event Extraction

To illustrate a complete production pipeline, consider extracting structured financial events from news articles. This requires recognizing standard entities (companies, dates, monetary values) and domain-specific events (acquisitions, IPOs, earnings reports) with their associated attributes. Figure 12.5.3 traces the four stages of this pipeline.

Figure 12.5.3: A four-stage financial event extraction pipeline that combines classical NER, LLM-based event typing, schema validation, and cross-document entity resolution.

Note

Cross-document entity resolution (deduplication) is critical for IE systems that process streams of news articles. The same company may appear as "Microsoft," "Microsoft Corp.," "MSFT," or "the Redmond-based tech giant." Use a combination of string normalization, alias dictionaries, and embedding similarity to link these mentions to a canonical entity ID.

7. Coreference Resolution

Consider the following passage: "Dr. Sarah Chen published a groundbreaking paper on protein folding. She presented her findings at NeurIPS, where the Stanford researcher received a standing ovation." A human reader immediately understands that "She," "her," and "the Stanford researcher" all refer to Dr. Sarah Chen. Coreference resolution is the NLP task of identifying these mention clusters: groups of expressions that refer to the same real-world entity.

Coreference resolution is a prerequisite for high-quality information extraction, document summarization, and question answering over long documents. Without it, an IE pipeline might extract "She diagnosed the patient" without knowing who "She" refers to, producing a relation triple with an unresolved pronoun instead of a named entity. For knowledge graph construction, unresolved coreferences lead to disconnected nodes that should be merged.

Key Insight

Coreference resolution is the glue that holds document-level IE together. NER identifies what entities appear in a document; coreference resolution identifies which mentions refer to the same entity. Without coreference, an extraction pipeline processing a 10-page contract might find 200 entity mentions but have no way to determine that 30 of them all refer to the same party. This linking step is essential before building knowledge graphs or feeding extracted data into RAG systems (Chapter 20).

7.1 Classical Approaches

Coreference resolution has a rich history in NLP, progressing through several paradigm shifts over three decades.

Mention-pair models: The earliest neural approach. A binary classifier scores every pair of mentions in a document, predicting whether they are coreferent. Pairs scoring above a threshold are linked into clusters using transitive closure. Simple to implement but quadratic in the number of mentions, and local pairwise decisions can produce globally inconsistent clusters.
Entity-based (mention-ranking) models: Instead of scoring pairs independently, these models maintain a representation of each entity cluster and score whether a new mention should join an existing cluster or start a new one. This produces more globally consistent clusters but requires careful cluster representation (typically averaging mention embeddings).
End-to-end neural coreference (Lee et al., 2017): The landmark paper that eliminated the need for a separate mention detection step. The model jointly learns to identify mentions and cluster them using span representations from a bidirectional LSTM (later upgraded to transformers). It considers all spans up to a maximum width, scores each span as a potential mention, and links mentions to their most likely antecedent. This architecture, refined in subsequent work (Lee et al., 2018; Joshi et al., 2020), remains the backbone of most production coreference systems.
spaCy integration: The coreferee and neuralcoref libraries add coreference resolution to spaCy pipelines. While not state-of-the-art in accuracy, they provide fast, pipeline-integrated coreference that is sufficient for many production use cases.

Approach Comparison

Approach	Key Work	Strengths	Limitations
Mention-pair	Clark and Manning (2016)	Simple architecture, easy to train	Quadratic complexity, inconsistent clusters
Entity-based	Wiseman et al. (2016)	Globally consistent clusters	Complex cluster representations
End-to-end neural	Lee et al. (2017, 2018)	Joint mention detection and linking, state-of-the-art accuracy	High memory usage for long documents
LLM-based	Zero-shot prompting	No training data, handles complex cases	Context window limits, cost, latency

7.2 LLM-Based Coreference Resolution

LLMs can perform coreference resolution zero-shot by leveraging their deep understanding of language, world knowledge, and discourse structure. This is particularly valuable for domains where labeled coreference data is scarce (legal documents, medical records, technical specifications) or where the mentions involve complex reasoning ("the acquisition target" referring to a company mentioned three paragraphs earlier). The trade-off is cost and context window limitations: coreference inherently requires processing entire documents, which can be expensive for long texts.

# LLM-based coreference resolution with structured output
# Returns mention clusters linking all expressions that refer to the same entity
from pydantic import BaseModel, Field
from openai import OpenAI
import instructor

client = instructor.from_openai(OpenAI())

class Mention(BaseModel):
 text: str = Field(description="The mention text as it appears in the document")
 sentence_index: int = Field(description="Zero-based index of the sentence")
 mention_type: str = Field(description="One of: proper_name, pronoun, nominal, or definite_description")

class CoreferenceCluster(BaseModel):
 entity_name: str = Field(description="Canonical name for this entity")
 entity_type: str = Field(description="Entity type: PERSON, ORG, LOCATION, EVENT, OTHER")
 mentions: list[Mention]

class CoreferenceResult(BaseModel):
 clusters: list[CoreferenceCluster]
 resolved_text: str = Field(
 description="The original text with pronouns replaced by entity names in brackets"
 )

document = """
Dr. Sarah Chen joined Anthropic in 2023 after leaving Google Brain. She
quickly became the lead of the safety team, where the former DeepMind
intern brought fresh perspectives on interpretability. Her team published
three papers at NeurIPS that year. The company credited Chen's leadership
for the rapid progress.
"""

result = client.chat.completions.create(
 model="gpt-4o",
 response_model=CoreferenceResult,
 messages=[
 {"role": "system", "content": (
 "Perform coreference resolution on the document. Identify all mentions "
 "(proper names, pronouns, nominal descriptions) that refer to the same "
 "entity and group them into clusters. For each cluster, provide a canonical "
 "entity name and type. Also produce a version of the text where pronouns "
 "are replaced with the entity name in square brackets."
 )},
 {"role": "user", "content": document},
 ],
 max_retries=2,
)

for cluster in result.clusters:
 mentions_str = ", ".join(f'"{m.text}" ({m.mention_type})' for m in cluster.mentions)
 print(f"{cluster.entity_name} [{cluster.entity_type}]:")
 print(f" Mentions: {mentions_str}")
 print()

print("Resolved text:")
print(f" {result.resolved_text}")

Dr. Sarah Chen [PERSON]: Mentions: "Dr. Sarah Chen" (proper_name), "She" (pronoun), "the former DeepMind intern" (definite_description), "Her" (pronoun), "Chen's" (proper_name) Anthropic [ORG]: Mentions: "Anthropic" (proper_name), "The company" (definite_description) the safety team [ORG]: Mentions: "the safety team" (nominal), "Her team" (nominal) Resolved text: Dr. Sarah Chen joined Anthropic in 2023 after leaving Google Brain. [Dr. Sarah Chen] quickly became the lead of the safety team, where [Dr. Sarah Chen] brought fresh perspectives on interpretability. [Dr. Sarah Chen]'s team published three papers at NeurIPS that year. [Anthropic] credited [Dr. Sarah Chen]'s leadership for the rapid progress.

Code Fragment 12.5.7: LLM-based coreference resolution with structured output

Real-World Scenario: Coreference-Aware IE for Legal Contract Analysis

Who: A legal tech company building a contract analysis platform that extracts obligations, deadlines, and parties from multi-page agreements.

Situation: Contracts frequently use pronouns ("the Party," "it," "said Company") and definite descriptions ("the Licensor," "the aforementioned entity") to refer to named parties introduced at the beginning of the document.

Problem: Their sentence-level NER pipeline correctly identified obligations ("shall deliver the software by March 2025") but could not determine which party held the obligation when the sentence used a pronoun or definite description rather than a proper name.

Decision: They added a coreference resolution pass before relation extraction, using an LLM for the first page (where parties are introduced with complex legal language) and a lighter neural model for the remainder of the document (where coreference patterns are more formulaic).

Result: Obligation extraction accuracy improved from 71% to 89% F1 because the system could now correctly attribute obligations to specific parties. Processing cost increased by only $0.02 per contract because the LLM handled only the first page while the neural model processed the remaining pages at near-zero cost.

Lesson: Coreference resolution is not optional for document-level IE. Without it, relation extraction produces ungrounded results that cannot be used for downstream reasoning.

7.3 Applications of Coreference Resolution

Document summarization preprocessing: Replacing pronouns with canonical entity names before summarization prevents the summarizer from producing ambiguous output like "She announced the deal" without context for who "She" is.
Knowledge graph construction: Coreference clusters map directly to entity nodes. All mentions in a cluster contribute attributes and relations to the same node, producing a denser, more connected graph. See Section 19.3 for entity linking techniques that connect these nodes to external knowledge bases.
Question answering over long documents: When a user asks "What did the CEO announce?", the QA system must resolve "the CEO" to a specific person mentioned earlier in the document. Coreference resolution, applied as a preprocessing step, enables RAG systems (Chapter 20) to retrieve and reason over the correct entity.
Multi-document analysis: Cross-document coreference links mentions of the same entity across different documents, enabling corpus-level knowledge aggregation. Combined with embedding-based entity similarity (Section 19.1), this supports entity-centric search and analysis across large document collections.

8. Integrated Document Understanding Pipeline

The IE techniques covered in this section (NER, Open IE, event extraction, and coreference resolution) are most powerful when combined into an integrated pipeline. Each component addresses a different dimension of document understanding: NER identifies what entities are present, Open IE captures how entities relate to each other, event extraction reveals what happened, and coreference resolution determines which mentions refer to the same thing. Together, they transform unstructured text into a rich, queryable knowledge representation.

A practical integration follows a four-stage architecture.

Stage 1: Coreference resolution. Process the full document to identify mention clusters. Replace pronouns and definite descriptions with canonical entity names, producing a "resolved" version of the text where every sentence is self-contained.
Stage 2: NER and entity linking. Run classical NER (spaCy) on the resolved text to extract typed entities. Link entities to canonical identifiers using embedding similarity against a reference knowledge base (Section 19.3).
Stage 3: Relation and event extraction. Apply Open IE and event extraction (LLM-based) to extract triples and structured events. Because the text is already coreference-resolved, extracted relations contain canonical entity names rather than ambiguous pronouns.
Stage 4: Knowledge graph assembly. Merge entities, relations, and events into a unified graph structure. Deduplicate entities across the document, assign confidence scores, and validate against grounding constraints.

Key Insight

The order matters: coreference resolution must precede relation and event extraction. If you extract relations first, you get triples like (She; diagnosed; the patient) with unresolved pronouns. By resolving coreferences first, the same sentence becomes (Dr. Sarah Chen; diagnosed; John Miller), producing a triple that is immediately useful for knowledge graph construction and downstream querying. This preprocessing step is especially critical for RAG pipelines (Section 20.3) that chunk documents into passages; without coreference resolution, chunked passages lose their referential context.

Self-Check

Q1: What is the primary advantage of classical NER (spaCy/CRF) over LLM-based extraction for well-defined entity types?

Show Answer

Classical NER offers sub-millisecond latency, near-zero marginal cost, deterministic output, and 95%+ F1 accuracy on entity types it was trained on. It produces span-based extractions grounded directly in the source text, eliminating hallucination risk. These properties make it the preferred choice for high-volume extraction of standard entity types like persons, organizations, dates, and locations.

Q2: How does Instructor handle LLM responses that fail Pydantic validation?

Show Answer

Instructor implements an automatic retry loop controlled by the max_retries parameter. When the LLM returns JSON that fails Pydantic validation (missing required fields, wrong types, or values outside specified ranges), Instructor sends the validation error message back to the LLM and asks it to produce a corrected response. This approach resolves the vast majority of parsing failures without manual intervention. If all retries are exhausted, Instructor raises a validation exception that the calling code can handle.

Q3: Why is grounding verification essential for LLM-extracted entities?

Show Answer

LLMs can hallucinate entities that do not appear in the source text. Unlike classical NER, which extracts contiguous text spans that are by definition present in the document, LLMs generate text that may include inferred or fabricated entities. Grounding verification checks that each extracted entity text can be traced back to the source document through exact substring matching, fuzzy matching, or semantic similarity. Without grounding checks, hallucinated entities can corrupt downstream structured data stores and analytics.

Q4: How does the complexity router in a hybrid IE pipeline reduce costs?

Show Answer

The complexity router examines each document after classical NER and determines whether LLM extraction is needed. Documents that contain only standard entity types (persons, organizations, dates) are resolved entirely by the classical layer at near-zero cost. Only documents containing domain-specific signals (medical terms, legal language, complex financial events) are routed to the LLM layer. In practice, 60-80% of documents can be handled by the classical layer alone, reducing LLM API costs by a corresponding amount compared to an LLM-only pipeline.

Q5: Why must coreference resolution precede relation extraction in a document understanding pipeline?

Show Answer

Without coreference resolution, relation extraction operates on raw text that contains unresolved pronouns and definite descriptions. This produces triples like (She; diagnosed; the patient) where "She" and "the patient" are ambiguous. By resolving coreferences first, the same sentence becomes (Dr. Sarah Chen; diagnosed; John Miller), yielding a triple with canonical entity names that is immediately useful for knowledge graph construction and downstream querying. Coreference resolution is especially critical for documents that are chunked into passages for RAG pipelines, since individual chunks lose their referential context.

Q6: How does LLM-based Open IE differ from Stanford OpenIE in handling implicit relations?

Show Answer

Stanford OpenIE extracts triples by decomposing sentences into clauses using natural logic, which means it can only capture relations that are syntactically explicit in the text. LLM-based Open IE can identify implicit relations that require world knowledge or pragmatic inference. For example, given "The San Francisco-based company," an LLM can extract (company; headquartered_in; San Francisco) even though "headquartered_in" never appears in the text. LLMs also normalize relation phrases into consistent canonical forms, whereas Stanford OpenIE preserves the raw surface forms, leading to duplicated relations expressed differently.

Q7: What are the four stages of an integrated document understanding pipeline, and why does order matter?

Show Answer

The four stages are: (1) coreference resolution to link mentions and replace pronouns with canonical names, (2) NER and entity linking to extract typed entities and ground them to canonical identifiers, (3) relation and event extraction to capture how entities relate and what events occurred, and (4) knowledge graph assembly to merge everything into a unified, deduplicated structure. Order matters because each stage depends on the output of the previous one. Coreference resolution must come first so that relations are extracted between canonical entities rather than ambiguous pronouns. NER must precede relation extraction so that entity types are available for schema validation. Events and relations must be extracted before graph assembly so that all edges are available for deduplication and merging.

Q8: What distinguishes BAML from Instructor as an approach to structured LLM output?

Show Answer

Instructor works by patching an existing LLM client (OpenAI, Anthropic) to accept Pydantic models as response schemas, handling JSON schema injection and response parsing at runtime. BAML takes a fundamentally different approach: it defines LLM functions in a dedicated schema language that compiles to type-safe client code. This means type errors are caught at compile time rather than runtime, prompt logic is separated from application code, and the schema definitions serve as documentation. BAML is better suited for large teams that need strict type safety across multiple services, while Instructor is more lightweight and integrates naturally into existing Python codebases.

Note

Most LLM textbooks teach you how to use LLMs. This chapter taught you when NOT to use them, and how to combine them with traditional ML for production efficiency. The triage routing, cascade, and Pareto frontier analysis patterns covered here are rarely found in textbooks but are standard practice in cost-conscious production systems. The consistent pattern across all five sections: start cheap and simple, escalate to expensive and powerful only when needed.

Real-World Scenario: Extracting Structured Clinical Data from Physician Notes

Who: A health informatics team at a hospital network digitizing 10 years of handwritten and dictated physician notes (approximately 2 million documents).

Situation: They needed to extract structured fields (diagnoses, medications, dosages, allergies, procedures) from free-text clinical notes to populate a searchable electronic health record system.

Problem: Traditional NER models trained on general biomedical text achieved only 68% F1 on their notes because the language was highly abbreviated, used non-standard shorthand, and varied significantly across 200 physicians.

Dilemma: They could manually annotate thousands of notes and train a custom NER model (accurate but 6 months of work), use an LLM for every document (accurate but prohibitively expensive at $0.15 per note), or combine an LLM extraction pass on a sample with a trained model for the bulk.

Decision: They used GPT-4 with a Pydantic schema to extract structured data from 5,000 representative notes (validated by clinical staff), then used these extractions as training data for a fine-tuned BioBERT NER model that processed the remaining 1.995 million documents.

How: The extraction schema defined nested Pydantic models for each entity type with field-level validators (medication names checked against an RxNorm lookup, ICD codes validated against a code list). The BioBERT model was trained on the LLM-generated labels, with 500 notes manually corrected by clinical coders as a gold standard test set.

Result: The fine-tuned BioBERT achieved 87% F1 (up from 68% with the off-the-shelf model) and processed the full corpus in 3 days on a single GPU at a total cost of $1,200. The LLM extraction of the 5,000 training notes cost $750. Total project cost was under $5,000, compared to an estimated $300,000 for the LLM-only approach.

Lesson: LLM-powered extraction combined with schema validation is an excellent way to generate training data for specialized extraction models; the LLM provides breadth and the schema provides consistency.

Fun Fact

Named entity recognition (NER), one of the oldest NLP tasks, has been dramatically simplified by LLMs. What once required weeks of annotation and custom CRF models can now be bootstrapped with a single prompt, then distilled into a fast, specialized model that runs at a fraction of the cost.

9. Keyword and Keyphrase Extraction

Keyword extraction identifies the most important terms or phrases in a document, enabling applications such as document tagging, search engine optimization, content recommendation, and automatic summarization. Like entity extraction, keyword extraction spans a spectrum from classical statistical methods to modern embedding-based and LLM-driven approaches. The right choice depends on your latency requirements, the diversity of your document types, and whether you need semantically meaningful keyphrases or simple term frequency signals.

9.1 Classical Methods

TF-IDF: Scores terms by how frequently they appear in a document relative to how frequently they appear across the entire corpus. High TF-IDF terms are distinctive to a document. Fast and effective for homogeneous corpora, but requires a background corpus and misses multi-word phrases unless you explicitly use n-grams.
RAKE (Rapid Automatic Keyword Extraction): Identifies candidate keyphrases by splitting text at stop words and punctuation, then scores each phrase by the ratio of word co-occurrence to word frequency. No training data needed, runs instantly, but produces noisy results on short or informal text.
YAKE (Yet Another Keyword Extractor): Combines multiple statistical features (word frequency, word position, word relatedness to context) into a single unsupervised score. More robust than RAKE for documents with varied structure, and supports multiple languages out of the box.

9.2 Embedding-Based Extraction with KeyBERT

KeyBERT (Grootendorst, 2020) represents a modern approach that bridges classical and neural methods. It embeds the full document and each candidate phrase using a sentence transformer model, then selects phrases whose embeddings are most similar to the document embedding. This captures semantic relevance rather than just frequency, producing keyphrases that genuinely represent the document's topics.

# KeyBERT: embedding-based keyword extraction
# Uses sentence embeddings to find semantically representative phrases
from keybert import KeyBERT

# Initialize with a sentence transformer model
kw_model = KeyBERT(model="all-MiniLM-L6-v2")

document = """
Retrieval-Augmented Generation (RAG) combines the strengths of
neural retrieval with large language model generation. The retriever
searches a vector database for relevant passages, which are then
injected into the LLM's context window as grounding evidence.
This approach reduces hallucination and enables the model to cite
specific sources, making it suitable for enterprise knowledge bases,
customer support systems, and research assistants.
"""

# Extract keyphrases (1 to 3 words, diversified with MMR)
keywords = kw_model.extract_keywords(
 document,
 keyphrase_ngram_range=(1, 3),
 stop_words="english",
 use_mmr=True, # Maximal Marginal Relevance for diversity
 diversity=0.5, # Balance relevance and diversity
 top_n=8,
)

print("KeyBERT Keyphrases:")
for phrase, score in keywords:
 print(f" {score:.3f} {phrase}")

KeyBERT Keyphrases: 0.612 retrieval augmented generation 0.571 vector database 0.534 neural retrieval 0.498 language model generation 0.467 enterprise knowledge bases 0.445 hallucination 0.421 relevant passages 0.398 customer support systems

Code Fragment 12.5.8: KeyBERT: embedding-based keyword extraction

9.3 LLM-Based Extraction

For documents where context and domain knowledge matter more than statistical patterns, an LLM can extract keyphrases that reflect higher-level understanding. The LLM can also categorize keyphrases by theme, distinguish between primary and secondary topics, and generate keyphrases that do not appear verbatim in the text (abstractive keyphrases).

# LLM-based keyword extraction with thematic grouping
from openai import OpenAI
import json

client = OpenAI()

def extract_keywords_llm(text: str, max_keywords: int = 10) -> dict:
 """Extract and categorize keywords using an LLM."""
 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{
 "role": "system",
 "content": f"""Extract the {max_keywords} most important
keyphrases from the text. Return JSON with:
- "keyphrases": list of objects with "phrase", "importance" (1-5), "category"
- "main_topic": one-sentence summary of the document's main topic

Focus on domain-specific terms, not generic words."""
 }, {
 "role": "user",
 "content": text,
 }],
 response_format={"type": "json_object"},
 temperature=0.0,
 )
 return json.loads(response.choices[0].message.content)

Code Fragment 12.5.9: LLM-based keyword extraction that returns categorized, importance-ranked keyphrases alongside a topic summary. Useful when you need semantic understanding beyond surface-level term frequency.

9.4 When to Use Each Approach

Method	Latency	Cost	Best For
TF-IDF / RAKE / YAKE	<10ms	Free	High-volume indexing, search pipelines, homogeneous corpora
KeyBERT	50-200ms	Free (local GPU)	Semantic keyphrases, topic diversity, multi-domain content
LLM-based	500ms-2s	$0.001-0.01/doc	Abstractive keyphrases, categorization, domain-specific extraction

The hybrid pattern from Section 12.3 applies naturally: use TF-IDF or KeyBERT for the bulk of your documents, and invoke the LLM selectively for documents that require nuanced, domain-aware keyphrase extraction.

10. LLM-Powered AutoML

AutoML (Automated Machine Learning) has traditionally focused on automating hyperparameter tuning, architecture search, and feature selection using algorithms like Bayesian optimization and evolutionary search. LLMs introduce a new dimension: they can reason about data semantics, suggest features based on domain knowledge, and even orchestrate entire ML pipelines through natural language. This section covers three emerging patterns where LLMs enhance the AutoML process, moving beyond the pure cost-optimization patterns from Section 12.4 into territory where LLMs actively improve the ML pipeline itself.

10.1 CAAFE: LLM-Generated Features

Context-Aware Automated Feature Engineering (CAAFE, Hollmann et al., 2023) uses an LLM to generate Python code that creates new features for tabular ML datasets. The LLM receives the dataset schema (column names, types, and sample values) along with a description of the prediction task, then proposes feature engineering transformations based on its understanding of the domain. For example, given a dataset with "purchase_date" and "birth_date" columns for a churn prediction task, the LLM might suggest computing "customer_age_at_purchase" or "days_since_last_purchase," features that a purely algorithmic approach would not consider without explicit programming.

# LLM-based feature engineering for tabular ML
# Inspired by the CAAFE approach (Hollmann et al., 2023)
from openai import OpenAI
import pandas as pd

client = OpenAI()

def suggest_features(df: pd.DataFrame, target_col: str, task_desc: str) -> str:
 """Use an LLM to suggest feature engineering code for a dataset."""
 # Build a schema description from the DataFrame
 schema = []
 for col in df.columns:
 dtype = str(df[col].dtype)
 sample = df[col].dropna().head(3).tolist()
 schema.append(f" {col} ({dtype}): e.g., {sample}")

 schema_str = "\n".join(schema)

 response = client.chat.completions.create(
 model="gpt-4o",
 messages=[{
 "role": "system",
 "content": """You are a senior data scientist. Given a dataset schema
and prediction task, suggest Python code (using pandas) that creates
new features likely to improve model performance.

Rules:
1. Return ONLY executable Python code, no commentary
2. Assume the DataFrame is named 'df'
3. Create 3 to 5 new features
4. Each feature should have a clear semantic rationale
5. Handle missing values gracefully"""
 }, {
 "role": "user",
 "content": f"""Dataset schema:
{schema_str}

Target column: {target_col}
Task: {task_desc}

Generate feature engineering code."""
 }],
 temperature=0.2,
 )
 return response.choices[0].message.content

# Example usage
# features_code = suggest_features(
# df=customer_df,
# target_col="churned",
# task_desc="Predict whether a customer will cancel their subscription"
# )
# exec(features_code) # Creates new columns in df

Code Fragment 12.5.10: LLM-based feature engineering that reads a dataset schema and generates pandas code for new features. The LLM applies domain reasoning (e.g., computing customer tenure, purchase frequency ratios) that purely statistical feature selectors would miss.

10.2 LLMs as ML Pipeline Orchestrators

Beyond feature engineering, LLMs are emerging as orchestrators of entire ML workflows. AutoML-Agent (Trirat et al., 2024) uses an LLM to decompose a machine learning task into subtasks (data loading, preprocessing, feature engineering, model selection, hyperparameter tuning, evaluation), generate code for each step, execute it, interpret the results, and iterate. MLE-bench (Chan et al., 2024) benchmarks this capability by evaluating LLM agents on real Kaggle competitions, measuring their ability to perform end-to-end ML engineering.

Key Insight

LLM-powered AutoML is not about replacing data scientists; it is about accelerating the exploratory phase. An LLM can propose 20 feature engineering ideas in seconds, most of which a data scientist would eventually consider but might take hours to implement and test. The data scientist's role shifts from writing boilerplate code to evaluating, validating, and curating the LLM's suggestions. This mirrors the broader human-AI collaboration pattern seen across Chapter 28's application domains: LLMs amplify human expertise rather than substituting for it.

Key Takeaways

Information extraction turns unstructured text into structured records through NER, relation extraction, and event extraction. Classical and LLM approaches each have distinct strengths.
Classical NER (spaCy, CRF) delivers sub-millisecond latency and 95%+ F1 on trained entity types with zero hallucination risk, making it ideal for high-volume production extraction.
LLM-based extraction enables zero-shot extraction of novel entity types, relations, and events, but requires structured output enforcement (Pydantic, Instructor, BAML) to produce reliable schemas.
Hybrid pipelines run classical NER on every document, then selectively invoke LLMs only for documents requiring domain-specific extraction. This reduces API costs by 60-80% while maintaining broad coverage.
Open Information Extraction removes the need for predefined relation schemas, producing (subject, relation, object) triples from arbitrary text. LLM-based Open IE adds relation normalization and implicit relation detection at higher cost.
Event extraction captures structured representations of what happened (trigger, type, arguments with roles, dates), enabling timeline construction and temporal reasoning over document collections.
Coreference resolution links pronouns, definite descriptions, and nominal mentions to their referent entities. It must precede relation and event extraction to produce grounded, unambiguous triples rather than triples with unresolved pronouns.
An integrated document understanding pipeline chains coreference resolution, NER, relation/event extraction, and knowledge graph assembly in sequence, with each stage building on the output of the previous one.
Grounding verification is essential for LLM-extracted entities. Always check that extracted text can be traced back to the source document before storing it in production databases.
Production IE systems must implement graceful degradation (falling back to classical extraction during LLM outages), entity resolution (deduplication across documents), and confidence-aware storage (distinguishing high-confidence classical entities from lower-confidence LLM extractions).

Research Frontier

Universal information extraction. Models like USM and InstructUIE are trained to extract arbitrary entity types, relations, and events from a single unified architecture, reducing the need for task-specific fine-tuning. These models accept extraction schemas as part of the prompt, enabling zero-shot extraction of novel entity types.

Document-level extraction with long context. As context windows grow beyond 100K tokens, extraction pipelines can process entire documents in a single pass rather than chunking, improving cross-sentence coreference and relation extraction accuracy significantly.

Extraction with provenance. Systems that return not just extracted entities but also source spans, confidence scores, and reasoning traces are enabling human-in-the-loop verification workflows, critical for high-stakes domains like legal and medical information extraction.

Exercises

Exercise 12.5.1: Classical vs. LLM IE Conceptual

Compare classical information extraction (spaCy, CRFs) with LLM-based extraction on three dimensions: setup cost, entity type flexibility, and per-document cost.

Answer Sketch

Setup cost: classical requires labeled training data and pipeline engineering; LLM requires only prompt engineering. Entity type flexibility: classical is fixed at training time; LLM can extract new entity types by changing the prompt. Per-document cost: classical is near-zero (CPU inference); LLM costs $0.001 to $0.05 per document. Classical wins on cost and latency; LLM wins on flexibility and setup speed.

Exercise 12.5.2: Pydantic extraction Coding

Write a Pydantic model and an Instructor-based extraction call that extracts a list of mentioned companies, their roles (customer, supplier, partner), and any monetary amounts from a business news article.

Answer Sketch

Define: class CompanyMention(BaseModel): name: str; role: Literal['customer','supplier','partner']; amounts: list[str] = [] and class ArticleExtraction(BaseModel): companies: list[CompanyMention]. Call: client.chat.completions.create(model='gpt-4o', response_model=ArticleExtraction, messages=[{'role':'user','content': article_text}]).

Exercise 12.5.3: Hybrid NER pipeline Conceptual

Design a hybrid NER pipeline where spaCy handles standard entity types (person, org, location) and an LLM handles domain-specific entities (drug names, gene symbols). How do you merge the results?

Answer Sketch

Run spaCy first for standard entities (fast, free). For domain-specific types, send the text to the LLM with instructions to extract only drug names and gene symbols. Merge by checking for overlapping spans: if spaCy and the LLM both identify the same span, prefer the more specific label (LLM's domain label). If spans partially overlap, use the LLM's boundaries for domain entities and spaCy's for standard entities.

Exercise 12.5.4: Relation extraction Coding

Write a prompt and Pydantic schema for extracting relationships between entities in a sentence. The schema should capture subject, predicate, object triples with confidence scores.

Answer Sketch

Schema: class Relation(BaseModel): subject: str; predicate: str; object: str; confidence: float = Field(ge=0, le=1) and class RelationExtraction(BaseModel): relations: list[Relation]. Prompt: 'Extract all factual relationships from the text as subject-predicate-object triples. Rate your confidence 0 to 1 for each. Only include relationships explicitly stated.' Use Instructor to enforce the schema.

Exercise 12.5.5: Extraction evaluation Analysis

You run both spaCy NER and LLM-based NER on 100 test documents. spaCy achieves 94% F1 on standard entities but 0% on domain entities. The LLM achieves 87% F1 on standard and 78% on domain entities. Which approach (or combination) should you deploy, and why?

Answer Sketch

Deploy the hybrid: use spaCy for standard entities (94% F1, near-zero cost) and the LLM for domain entities only (78% F1, higher cost but no alternative). Running the LLM on standard entities wastes money for lower accuracy. The hybrid achieves 94% on standard and 78% on domain at minimal cost, compared to using the LLM for everything (87%/78% at much higher cost).

What Comes Next

In the next chapter, Chapter 13: Synthetic Data Generation & LLM Simulation, we begin Part 4 by exploring synthetic data generation, a powerful technique for creating training data using LLMs themselves.

References and Further Reading

Information Extraction Research

Lee, J. et al. (2020). BioBERT: A Pre-Trained Biomedical Language Representation Model. Bioinformatics, 36(4), 1234-1240.

Demonstrates that domain-specific pre-training dramatically improves NER and relation extraction in biomedical text. BioBERT is a prime example of the specialized extraction models that form the classical ML side of hybrid extraction pipelines. Essential reading for teams building extraction systems in specialized domains.

Paper

Li, J. et al. (2023). Evaluating ChatGPT's Information Extraction Capabilities.

Systematically benchmarks ChatGPT on NER, relation extraction, and event extraction against traditional fine-tuned models. The results reveal where LLMs excel (zero-shot flexibility, complex relations) and where they fall short (consistency, rare entity types). Directly informs the hybrid approach advocated in this section.

Paper

Wei, X. et al. (2023). Zero-Shot Information Extraction via Chatting with ChatGPT.

Explores conversational approaches to information extraction where the LLM is guided through multi-turn dialogue to extract entities and relations. The chat-based extraction paradigm is particularly relevant to the LLM-powered extraction patterns discussed here. Useful for teams prototyping extraction without labeled training data.

Paper

Coreference Resolution

Lee, K. et al. (2017). End-to-End Neural Coreference Resolution. EMNLP 2017.

The foundational paper for modern neural coreference resolution. Introduces a model that jointly learns mention detection and coreference linking without any hand-crafted features or a separate mention detector. This architecture, and its subsequent refinements, remains the basis for most production coreference systems. Essential reading for understanding how mention clustering works at scale.

Paper

Joshi, M. et al. (2020). SpanBERT: Improving Pre-training by Representing and Predicting Spans. TACL, 8, 64-77.

SpanBERT's span-level pre-training objective produces representations that significantly improve coreference resolution accuracy compared to standard BERT. This paper demonstrates why pre-training objectives matter for downstream structured prediction tasks, and SpanBERT remains the default encoder for state-of-the-art coreference models.

Paper

Open IE and Event Extraction

Angeli, G. et al. (2015). Leveraging Linguistic Structure For Open Domain Information Extraction. ACL 2015.

Introduces Stanford OpenIE, which decomposes complex sentences into short clauses using natural logic and extracts (subject, relation, object) triples from each clause. The approach is fast, unsupervised, and remains widely used for bootstrapping knowledge graphs from unstructured text. A foundational reference for understanding schema-free extraction.

Paper

Cabot, P. and Navigli, R. (2021). REBEL: Relation Extraction By End-to-end Language Generation. EMNLP 2021.

Frames relation extraction as sequence-to-sequence generation, training a BART model to produce linearized relation triples. REBEL supports over 200 Wikidata relation types and bridges the gap between closed IE (predefined schemas) and open IE (arbitrary relations). Highly practical for teams building knowledge graph pipelines that need both coverage and consistency.

Paper

Gao, J. et al. (2023). Benchmarking Large Language Models for Event Extraction. ACL 2023.

Systematically evaluates GPT-3.5 and GPT-4 on ACE and ERE event extraction benchmarks, comparing zero-shot and few-shot LLM performance against supervised models. The results show that LLMs excel at event detection but struggle with argument extraction for rare event types, informing the hybrid approach where LLMs handle event typing while specialized models handle argument filling.

Paper

Tools and Frameworks

Boundary ML. (2024). BAML: Build AI Applications with Structured Extraction.

Documentation for BAML, a framework that combines LLM calls with type-safe schema validation for structured extraction. BAML automates the schema enforcement and retry logic that hybrid extraction pipelines need. Highly practical for teams building production extraction systems with guaranteed output formats.

Framework

spaCy. (2024). spaCy Industrial-Strength NLP.

The leading open-source library for production NLP, offering fast NER, dependency parsing, and custom pipeline components. spaCy represents the classical NLP side of the hybrid extraction stack and is the recommended starting point for the rule-based and trained extraction components discussed in this section.

Library

Hugging Face. (2024). Token Classification with Transformers.

A step-by-step tutorial for fine-tuning transformer models on NER and other token classification tasks. This covers the practical implementation of training specialized extraction models that complement LLM-based approaches. Recommended for engineers implementing the fine-tuned NER component of a hybrid pipeline.

Tutorial