Section 34.2: Classical and Open Information Extraction

"spaCy does a thousand documents per second for free; the LLM does one per second for money. The trick of production IE is knowing which document deserves which."
Lexica, Stubbornly Classical NLP Agent

Big Picture: Four sub-topics, one pipeline

This section weaves together four IE sub-topics that ultimately feed a single document-understanding pipeline: spaCy NER (classical, fast, fixed entity types), Open IE with Stanford OpenIE, REBEL, and LLM-based variants (schema-free subject-relation-object triples), event extraction (triggers with typed argument roles), and temporal IE (TimeML, SUTime/HeidelTime, LLM timeline reasoning). The triples and events these tools produce then assemble into knowledge graphs, where entities become nodes and relations become edges. The recurring theme: classical tools handle the high-volume, well-defined parts at zero marginal cost; LLMs handle implicit relations, normalization, and novel schemas.

Prerequisites

This section assumes the information-extraction landscape from Section 34.1, the basic token-classification vocabulary from Section 1.3, and familiarity with the spaCy and Hugging Face pipeline interfaces from Section 12.1.

Library Shortcut: The thinnest IE stack

For a production-ready IE pipeline, the canonical thinnest stack is four libraries: spaCy (en_core_web_trf) for fast classical NER and tokenization, Pyserini for retrieval-flavored IE over a document corpus, Instructor (Pydantic-validated LLM output) for typed LLM extraction of relations and events, and REBEL (fine-tuned BART) for high-throughput relation extraction with Wikidata-style relation types. This four-piece stack covers entity recognition, retrieval, structured LLM output, and relation extraction without pulling in heavier frameworks; everything composes through plain Python.

Show code

# Two-step IE pipeline: spaCy for fast NER, Instructor for typed LLM extraction.
import spacy, instructor
from pydantic import BaseModel
from openai import OpenAI

nlp = spacy.load("en_core_web_trf")
client = instructor.from_openai(OpenAI())

class Event(BaseModel):
    actor: str
    action: str
    target: str

text = "Alice met Bob at Acme Corp in Berlin on 2024-03-01."
ents = [(e.text, e.label_) for e in nlp(text).ents]
event = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=Event,
    messages=[{"role": "user", "content": f"Extract one event from: {text}"}],
)

Code Fragment 34.2.1: Two-step IE pipeline: spaCy for fast NER, Instructor for typed LLM extraction.

34.2.1 Classical IE with spaCy

Fun Fact

Named entity recognition was one of the first NLP tasks to reach "good enough" accuracy in the 1990s, and spaCy's modern transformer-based models (en_core_web_trf) can process over a thousand documents per second. Meanwhile, achieving the same task with an LLM API call takes roughly 1 second per document and costs money. Sometimes the 30-year-old approach is still the right one.

Numeric Example: spaCy vs LLM throughput, side by side

spaCy's en_core_web_trf model processes roughly a thousand documents per second on a modern GPU (with batching) or hundreds per second on CPU. An LLM API call for the same NER task runs at roughly 1 document per second (limited by network round-trip plus 200ms-2s of inference latency). That is a ~1,000× throughput gap. To process 1 million documents: spaCy finishes in ~1,000 seconds (about 17 minutes); the LLM takes ~11.6 days of wall time (or fewer with parallel API calls, at proportional cost). This is why classical NER remains the default first pass even when an LLM is also available.

Logarithmic bar chart comparing NER throughput. spaCy en_core_web_trf at about a thousand docs per second, REBEL fine-tuned at 100 docs per second, and LLM API at 1 doc per second. The visual shows a 1000x gap between spaCy and LLM endpoints on a log scale. — **Figure 34.2.1a**: NER throughput on a logarithmic axis. The four-order-of-magnitude gap between spaCy and an LLM is exactly why classical tools handle the well-defined high-volume work and LLMs handle the implicit-relation tail. The middle path, fine-tuned encoder-decoder models like REBEL, trades schema breadth for two orders of magnitude in speed.

spaCy remains the gold standard for production NER when you need speed and reliability on well-defined entity types. Its tokenization pipeline handles the text preprocessing that makes span-based entity recognition possible. Its transformer-based models achieve state-of-the-art accuracy on standard benchmarks, and its pipeline architecture makes it easy to add custom entity types through training or rule-based matching. Code Fragment 34.2.10 shows this approach in practice.

Library Shortcut

spaCy's en_core_web_trf transformer pipeline

For the classical column of Table 34.1.1, the production default is spaCy 3.7+ with its transformer-based pipeline. en_core_web_trf wraps a RoBERTa-base encoder behind the standard spaCy Doc.ents API: you get state-of-the-art F1 on OntoNotes entities without writing a training loop. Pair it with spacy[transformers,cuda12x] on a GPU and throughput stays in the multi-thousand-docs-per-second range.

Show code

pip install "spacy[transformers,cuda12x]"
python -m spacy download en_core_web_trf
import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("Alice met Bob at Acme Corp in Berlin on 2024-03-01.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Code Fragment 34.2.2: Four lines turn a paragraph into typed entity spans; replace en_core_web_trf with en_core_web_lg for CPU-only deployments that want lower latency over absolute accuracy.

Code Fragment 34.2.3 runs spaCy's transformer-based NER pipeline on a financial news snippet, grouping extracted entities by type.

# Use spaCy for classical named entity recognition
# Extracts persons, organizations, dates, and locations at minimal cost
import spacy
from spacy import displacy
from collections import defaultdict
# Load a pre-trained transformer model
nlp = spacy.load("en_core_web_trf")
text = """
Apple Inc. announced today that CEO Tim Cook will present the company's
quarterly earnings at their headquarters in Cupertino, California on
January 30, 2025. Revenue is expected to exceed $120 billion, driven
by strong iPhone 16 sales across Europe and Asia.
"""
doc = nlp(text)
# Extract entities with their labels and positions
entities = []
for ent in doc.ents:
    entities.append({
        "text": ent.text,
        "label": ent.label_,
        "start": ent.start_char,
        "end": ent.end_char,
        })
    # Group by entity type (PERSON, ORG, DATE, etc.) for readable output.
    by_type = defaultdict(list)
    for entity in entities:
        by_type[entity["label"]].append(entity["text"])
        print("Extracted Entities:")
        print("=" * 50)
        for label, values in sorted(by_type.items()):
            print(f" {label:12s}: {', '.join(values)}")
            print(f"\nTotal: {len(entities)} entities across {len(by_type)} types")

Output: Extracted Entities: ================================================== CARDINAL : 120 billion, 16 DATE : today, January 30, 2025 GPE : Cupertino, California, Europe, Asia MONEY : $120 billion ORG : Apple Inc. PERSON : Tim Cook PRODUCT : iPhone 16 Total: 11 entities across 7 types

Code Fragment 34.2.3a: Use spaCy for classical named entity recognition

Code Fragment 34.2.10a demonstrates the complementary LLM-based approach using the BAML framework for structured event extraction, where b.ExtractEvents() handles prompt construction, LLM invocation, and Pydantic validation in one call.

# Schema is declared in extract_events.baml (not Python).
# BAML compiles the .baml file into a typed Python client where
# ExtractedEvent has fields: event_type (enum), description (str),
# participants (list[str]), date and monetary_value (optional str).
from baml_client import b
from baml_client.types import ExtractedEvent

def summarize_event(event: ExtractedEvent) -> None:
    print(f"Type:         {event.event_type}")
    print(f"Description:  {event.description}")
    print(f"Participants: {', '.join(event.participants)}")
    print(f"Date:         {event.date}")
    print(f"Value:        {event.monetary_value}")

article = """
Microsoft announced on March 15, 2025, that it has completed its
$2.1 billion acquisition of cybersecurity startup CyberShield AI.
The deal, first reported in January, brings 450 employees and
several enterprise security products into Microsoft's Azure division.
CEO Satya Nadella called the acquisition transformative for the
company's cloud security strategy.
"""
# BAML handles prompt construction, LLM call, and type validation
events: list[ExtractedEvent] = b.ExtractEvents(article)
for event in events:
    summarize_event(event)

Output: Type: acquisition Description: Microsoft completed acquisition of CyberShield AI Participants: Microsoft, CyberShield AI, Satya Nadella Date: March 15, 2025 Value: $2.1 billion

Code Fragment 34.2.4: BAML definition file: extract_events.baml

Warning

LLMs can hallucinate entities that do not appear in the source text. Always implement a grounding check that verifies extracted entities against the original document. A simple substring match catches most hallucinations. For more robust grounding, use fuzzy matching or semantic similarity to handle paraphrases and abbreviations.

34.2.2 Open Information Extraction

Key Insight

Events become hyper-edges; entities catch them like flypaper

Paper airplanes labelled trigger + arguments fly toward a graph database. Each lands on the right node (Person, Place, Date, Event) and sticks like flypaper. — **Figure 34.2.3b**: Event extraction is a hyper-edge insertion problem. The trigger predicate is the edge type and each argument role attaches one entity node.

A cozy sepia detective's pinboard with a magnifying glass and red string connecting labeled photo cards reading AGENT butler, PATIENT lord, INSTRUMENT candlestick, LOCATION library, and TEMPORAL 9pm, beneath a chalkboard that reads Semantic Role Labeling. — **Figure 34.2.4a**: Semantic Role Labeling tags who did what to whom, where, and when. Once a sentence has been SRL-tagged, the rest of the IE pipeline is a database insert away.

Traditional relation extraction requires a predefined schema of relation types (e.g., works_at, located_in, acquired_by). Open Information Extraction (Open IE) removes this constraint entirely, extracting arbitrary (subject, relation, object) triples from text without any schema. This makes Open IE invaluable for exploratory analysis, knowledge graph bootstrapping, and domains where the set of possible relations is unknown or too large to enumerate in advance.

34.2.2.1 Classical Open IE Systems

Stanford OpenIE (Angeli et al., 2015) pioneered the modern approach to schema-free extraction. It decomposes complex sentences into short, self-contained clauses using natural logic, then extracts triples from each clause. For example, the sentence "Tim Cook, who became CEO in 2011, leads Apple from Cupertino" yields three triples: (Tim Cook; became CEO; in 2011), (Tim Cook; leads; Apple), and (Apple; located in; Cupertino). The system processes text at high speed with no training data required, but it struggles with implicit relations and complex syntactic constructions.

REBEL (Cabot and Navigli, 2021) takes a different approach by framing relation extraction as a sequence-to-sequence problem. A fine-tuned BART model generates linearized triples directly from input text, enabling it to capture both explicit and implicit relations. REBEL supports over 200 relation types from Wikidata and can be fine-tuned on custom relation schemas. It bridges the gap between closed and open IE by learning from structured knowledge bases while retaining the flexibility to extract novel relation patterns.

Key Insight

Open IE is the foundation of automated knowledge graph construction. Each extracted triple becomes a candidate edge in a knowledge graph: subjects and objects map to nodes, and relations map to labeled edges. When combined with entity linking (covered in Section 31.5), Open IE triples can be grounded to canonical entities, enabling structured querying over unstructured text corpora.

34.2.2.2 LLM-Based Open IE with Structured Output

LLMs bring two significant advantages to Open IE: they understand context deeply enough to extract implicit relations, and they can normalize relation phrases into consistent, canonical forms. A classical Open IE system might extract both "is headquartered in" and "has its main office in" as separate relation types; an LLM can recognize these as the same headquartered_in relation and normalize accordingly.

# LLM-based Open Information Extraction with structured output
# Extracts (subject, relation, object) triples without a predefined schema
from pydantic import BaseModel, Field
from openai import OpenAI
import instructor
client = instructor.from_openai(OpenAI())
class Triple(BaseModel):
    subject: str = Field(description="The entity performing or described by the relation")
    relation: str = Field(description="Normalized relation in snake_case (e.g., acquired_by)")
    object: str = Field(description="The target entity or value of the relation")
    confidence: float = Field(ge=0.0, le=1.0, description="Extraction confidence")
    source_span: str = Field(description="The text span supporting this triple")
class OpenIEResult(BaseModel):
    triples: list[Triple]
def extract_open_ie(text: str) -> OpenIEResult:
    """Run instructor-validated Open IE over a short passage."""
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=OpenIEResult,
        messages=[
            {"role": "system", "content": (
                "Extract all factual (subject, relation, object) triples from the text. "
                "Normalize relation names to snake_case. Include the source text span "
                "that supports each triple. Only extract relations explicitly stated or "
                "directly implied by the text."
            )},
            {"role": "user", "content": text},
        ],
        max_retries=2,
    )

text = """
Anthropic, founded by Dario and Daniela Amodei in 2021, raised $2 billion
from Google in late 2023. The San Francisco-based company developed Claude,
which competes with OpenAI's ChatGPT. Anthropic employs over 1,000
researchers and engineers focused on AI safety.
"""
result = extract_open_ie(text)
print(f"Extracted {len(result.triples)} triples:\n")
for t in result.triples:
    print(f"  ({t.subject}; {t.relation}; {t.object})")
    print(f"     confidence={t.confidence:.2f} span=\"{t.source_span}\"")

Output: Extracted 7 triples: (Anthropic; founded_by; Dario Amodei) confidence=0.98 span="founded by Dario and Daniela Amodei in 2021" (Anthropic; founded_by; Daniela Amodei) confidence=0.98 span="founded by Dario and Daniela Amodei in 2021" (Anthropic; founded_in; 2021) confidence=0.99 span="founded by Dario and Daniela Amodei in 2021" (Anthropic; received_funding_from; Google) confidence=0.97 span="raised <span>$</span>2 billion from Google in late 2023" (Anthropic; headquartered_in; San Francisco) confidence=0.95 span="The San Francisco-based company" (Anthropic; developed; Claude) confidence=0.99 span="developed Claude" (Claude; competes_with; ChatGPT) confidence=0.96 span="competes with OpenAI's ChatGPT"

Code Fragment 34.2.5: LLM-based Open Information Extraction with structured output

Table 34.2.2a: Dimension Comparison (as of 2026).

Dimension	Stanford OpenIE	REBEL	LLM-Based Open IE
Schema required	None	Optional (Wikidata types)	None
Relation normalization	None (raw phrases)	Partial (learned types)	Strong (LLM normalizes)
Implicit relations	Weak	Moderate	Strong
Latency per document	<50ms	50-200ms	500ms-2s
Cost per document	Free	Free (local GPU)	$0.005-0.03
Hallucination risk	None	Low	Moderate
Best for	High-volume, explicit facts	Knowledge graph population	Complex, implicit relations

Key Insight

Semantic Role Labeling: The Hidden Foundation of Structured Extraction

Every structured extraction task implicitly answers the question: "Who did what to whom, when, where, and why?" This is exactly the question that Semantic Role Labeling (SRL) was designed to answer. SRL identifies the predicate (the action or event) in a sentence and assigns semantic roles to its arguments: Agent (who performed the action), Patient (who was affected), Instrument (what tool was used), Location (where it happened), and Temporal (when it happened).

Classical SRL resources include PropBank, which defines verb-specific role sets (Arg0, Arg1, Arg2, etc.) grounded in syntactic frames, and FrameNet, which organizes predicates into semantic frames with named roles (Buyer, Seller, Goods, Price for a commercial transaction). Tools like AllenNLP's SRL model provide pretrained PropBank-based labeling that runs in milliseconds per sentence.

When an LLM extracts structured events with typed arguments (buyer, seller, price, date), it is implicitly performing SRL without the formal linguistic machinery. Understanding this connection matters for two reasons. First, classical SRL tools can serve as fast, cheap pre-processors that identify predicate-argument structures before an LLM performs the more expensive reasoning over those structures. Second, SRL annotations from PropBank and FrameNet provide excellent few-shot examples for teaching LLMs to extract domain-specific semantic roles, because the role structure transfers across domains even when the specific labels change. The event extraction subsection below demonstrates how these semantic roles appear in practice under the labels of event arguments.

34.2.2.3 Event Extraction

Event extraction goes beyond entity and relation extraction to identify structured representations of what happened. Each event consists of a trigger (the word or phrase indicating the event), an event type, and a set of arguments filling specific roles. For example, in the sentence "Google acquired Fitbit for $2.1 billion in January 2021," the trigger is "acquired," the event type is ACQUISITION, and the arguments include the buyer (Google), the target (Fitbit), the price ($2.1 billion), and the date (January 2021).

Event Ontologies and Benchmarks

The ACE (Automatic Content Extraction) program defined 33 event types across eight categories: Life, Movement, Transaction, Business, Conflict, Contact, Personnel, and Justice. The ERE (Entities, Relations, Events) annotation standard extended ACE with lighter guidelines and broader coverage. These ontologies provide the foundation for supervised event extraction models, but they cover only a fraction of real-world event types. Domain-specific applications (financial events, clinical events, cybersecurity incidents) typically require custom event schemas.

LLM-Based Event Extraction

LLMs excel at event extraction because they can handle arbitrary event schemas defined in the prompt, reason about implicit arguments (the buyer in a passive construction like "Fitbit was acquired"), and resolve temporal expressions to structured dates. The following code fragment demonstrates extracting structured events from a news article, including timeline construction from multiple event mentions.

# LLM-based event extraction with timeline construction
# Extracts structured events and builds a chronological timeline
from pydantic import BaseModel, Field
from typing import Optional
from enum import Enum
from openai import OpenAI
import instructor
client = instructor.from_openai(OpenAI())
class EventType(str, Enum):
    ACQUISITION = "acquisition"
    FUNDING = "funding_round"
    PRODUCT_LAUNCH = "product_launch"
    PARTNERSHIP = "partnership"
    LEADERSHIP_CHANGE = "leadership_change"
    LEGAL_ACTION = "legal_action"
    IPO = "ipo"
    LAYOFF = "layoff"
    EARNINGS = "earnings_report"
    OTHER = "other"
class EventArgument(BaseModel):
    role: str = Field(description="Semantic role: buyer, seller, amount, product, etc.")
    value: str = Field(description="The entity or value filling this role")
    entity_type: Optional[str] = Field(
        default=None, description="Entity type if applicable: ORG, PERSON, MONEY, DATE"
        )
class ExtractedEvent(BaseModel):
    trigger: str = Field(description="The word or phrase indicating the event")
    event_type: EventType
    arguments: list[EventArgument]
    date: Optional[str] = Field(default=None, description="ISO date if identifiable")
    sentence: str = Field(description="The source sentence containing the event")
class EventExtractionResult(BaseModel):
    events: list[ExtractedEvent]
    timeline_summary: str = Field(
        description="One-paragraph chronological summary of all events"
    )

def extract_events(article: str) -> EventExtractionResult:
    """One LLM call returns typed events plus a timeline string."""
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=EventExtractionResult,
        messages=[
            {"role": "system", "content": (
                "Extract all business events from the article. For each event, identify "
                "the trigger word, classify the event type, extract all arguments with "
                "their semantic roles, and resolve dates to ISO format (YYYY-MM-DD) where "
                "possible. Then write a chronological timeline summary."
            )},
            {"role": "user", "content": article},
        ],
        max_retries=2,
    )

article = """
In a dramatic week for the tech industry, Nvidia announced record quarterly
revenue of $22.1 billion on February 21, 2024, driven by surging AI chip
demand. Two days later, the EU filed an antitrust lawsuit against Apple over
its App Store policies, seeking $38 billion in penalties. Meanwhile, Microsoft
laid off 1,900 employees from its gaming division following the Activision
Blizzard acquisition. On February 26, Stripe announced a partnership with
Amazon to process payments for third-party sellers, a deal expected to
generate $500 million in annual transaction volume.
"""
result = extract_events(article)
for i, event in enumerate(result.events, 1):
    print(f"Event {i}: {event.event_type.value}")
    print(f"  Trigger: \"{event.trigger}\"")
    print(f"  Date: {event.date}")
    for arg in event.arguments:
        print(f"  {arg.role:12s}: {arg.value} ({arg.entity_type or 'N/A'})")
print()
print("Timeline:")
print(f"  {result.timeline_summary}")

Output: Event 1: earnings_report Trigger: "announced record quarterly revenue" Date: 2024-02-21 company : Nvidia (ORG) revenue : $22.1 billion (MONEY) driver : surging AI chip demand (N/A) Event 2: legal_{action} Trigger: "filed an antitrust lawsuit" Date: 2024-02-23 plaintiff : EU (ORG) defendant : Apple (ORG) subject : App Store policies (N/A) penalty : $38 billion (MONEY) Event 3: layoff Trigger: "laid off" Date: 2024-02-23 company : Microsoft (ORG) count : 1,900 employees (N/A) division : gaming division (N/A) context : Activision Blizzard acquisition (N/A) Event 4: partnership Trigger: "announced a partnership" Date: 2024-02-26 partner_1 : Stripe (ORG) partner_2 : Amazon (ORG) purpose : process payments for third-party sellers (N/A) value : $500 million annual transaction volume (MONEY) Timeline: On February 21, 2024, Nvidia reported record quarterly revenue of $22.1B. Two days later, on February 23, the EU filed an antitrust suit against Apple and Microsoft laid off 1,900 gaming employees. On February 26, Stripe and Amazon announced a payments partnership.

Code Fragment 34.2.6: LLM-based event extraction with timeline construction

Key Insight

Event extraction is the bridge between information extraction and temporal reasoning. By extracting events with resolved dates and participant roles, you can construct timelines, detect causal chains (Event A triggered Event B), and answer temporal questions ("What happened between the acquisition and the layoffs?"). This capability is critical for RAG systems (Section 32.3) that need to reason about sequences of events rather than isolated facts.

34.2.2.4 Temporal Information Extraction

Temporal information extraction builds on event extraction by focusing specifically on the time dimension: when did events happen, in what order, and how do they relate to each other chronologically? While event extraction captures individual occurrences with their participants and dates, temporal IE constructs coherent timelines that reveal causal sequences, durations, and overlapping events across a document or document collection.

Classical temporal IE relies on specialized tools and annotation standards. TimeML is the ISO standard markup language for temporal expressions, events, and temporal relations in text. Tools like SUTime (Stanford) and HeidelTime normalize temporal expressions ("last Tuesday," "Q3 2024," "two weeks after the merger") into ISO 8601 dates. These rule-based normalizers are remarkably accurate for well-formed temporal expressions and run at negligible computational cost, making them ideal candidates for the classical side of a hybrid pipeline.

LLMs complement these tools by handling the harder aspects of temporal reasoning: resolving ambiguous references ("shortly after," "during the crisis"), inferring temporal order from discourse structure when explicit dates are absent, and constructing narrative timelines that synthesize events scattered across long documents. The following code fragment demonstrates an LLM-based timeline extraction pipeline that combines classical temporal normalization with LLM reasoning.

# Temporal information extraction: build a timeline from a document
# Combines classical date normalization with LLM temporal reasoning
from openai import OpenAI
import json
client = OpenAI()
def extract_timeline(document: str, domain: str = "general") -> dict:
    """Extract a chronological timeline of events from a document."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
        {"role": "system", "content": f"""You are a temporal information
        extraction system for {domain} documents. Extract all events and
        temporal expressions, then construct a timeline.
        For each event, provide:
        - "date": ISO 8601 date or date range (use "unknown" if not stated)
        - "event": concise description of what happened
        - "participants": list of entities involved
        - "temporal_relation": relationship to the previous event
        (e.g., "after", "before", "simultaneous", "during", "unrelated")
        - "confidence": float 0.0 to 1.0 for the temporal placement
        Return JSON with "events" (list sorted chronologically) and
        "timeline_summary" (a one-paragraph narrative of the timeline)."""},
        {"role": "user", "content": document}
        ],
        temperature=0.1,
        response_format={"type": "json_object"}
        )
    return json.loads(response.choices[0].message.content)
    # Example: legal case timeline extraction
    legal_doc = """On March 15, 2024, Acme Corp filed a patent infringement
    suit against Beta Inc in the Eastern District of Texas. Beta Inc
    responded with a motion to dismiss on April 2. Two weeks later, the
    court denied the motion and set a discovery deadline for August 30.
    During the discovery period, Beta Inc produced 50,000 documents. The
    parties attempted mediation in early September but failed to reach a
    settlement. Trial is scheduled for January 2025."""
    timeline = extract_timeline(legal_doc, domain="legal")
    for event in timeline["events"]:
        print(f" {event['date']:>12} {event['event']}")
        print(f"\nSummary: {timeline['timeline_summary']}")

Output: 2024-03-15 Acme Corp filed patent infringement suit against Beta Inc 2024-04-02 Beta Inc filed motion to dismiss 2024-04-16 Court denied motion to dismiss, set discovery deadline 2024-08-30 Discovery deadline; Beta Inc produced 50,000 documents 2024-09-01 Mediation attempted, no settlement reached 2025-01-01 Trial scheduled Summary: The patent dispute began on March 15, 2024 when Acme Corp sued Beta Inc. After a failed motion to dismiss in April, the court set an August discovery deadline. Following document production and unsuccessful September mediation, the case proceeds to trial in January 2025.

Code Fragment 34.2.7: Temporal information extraction: build a timeline from a document

Temporal IE is critical in several domains. Legal document analysis requires constructing case timelines from filings, depositions, and correspondence to establish sequences of events for litigation. Medical record timelines track patient history (symptoms, diagnoses, treatments, outcomes) across clinical notes that span months or years, enabling clinicians to see the full trajectory of care. News event tracking constructs evolving timelines of developing stories, linking related events across sources and detecting when reported timelines conflict. These applications connect directly to the RAG systems in Chapter 32, where temporal reasoning enables queries like "What happened between the filing and the trial?"

Note

Hybrid temporal extraction in practice: The most robust timeline systems use HeidelTime or SUTime to normalize explicit temporal expressions (fast, deterministic, nearly perfect accuracy on well-formed dates), then pass the document with normalized dates to an LLM for resolving relative references and inferring temporal order from context. This hybrid approach avoids wasting LLM tokens on date parsing while leveraging the LLM's reasoning ability for the genuinely hard parts of temporal IE.

34.2.2.5 From Events to Knowledge Graphs

The triples from Open IE and the structured events from event extraction feed naturally into knowledge graphs. Each entity becomes a node, each relation becomes an edge, and each event becomes a hyper-edge connecting multiple entities through their roles. Entity linking (covered in Section 31.5) grounds these entities to canonical identifiers, enabling cross-document queries. For example, a knowledge graph built from financial news might link "MSFT," "Microsoft Corp.," and "the Redmond tech giant" to the same canonical entity, allowing a query like "show all acquisitions by Microsoft in 2024" to aggregate results across hundreds of articles.

Exercise 34.2.1: spaCy NER baseline on a news corpus Coding

Take 100 Reuters news articles. Run spacy.load("en_core_web_lg") NER and extract all PERSON, ORG, GPE entities. Then sample 20 articles by hand-annotation, compute precision and recall, and report the canonical 90-something-percent F1 number per type. Identify two systematic error patterns (e.g., titles like "Dr. X" merged with X, or company-as-person miscategorization).

Answer Sketch

Expected: PERSON F1 around 0.90, ORG F1 around 0.85, GPE F1 around 0.92 on typical news. Common errors: (1) honorifics ("Dr. Smith" tagged as a single PERSON span that includes "Dr."); (2) ambiguous "Twitter" or "Amazon" tagged as ORG even when used as GPE-like geographic markers; (3) Multi-word organizations split into two ORGs. This baseline justifies why a hybrid pipeline pipes ambiguous cases to an LLM rather than ditching spaCy.

Exercise 34.2.2: Open IE triple extraction Coding

Run Stanford OpenIE (via the openie Python wrapper) on the sentence "After acquiring Whole Foods in 2017, Amazon launched a private-label grocery line." Compare the extracted triples against what GPT-4 produces for the same prompt. Identify which system finds the temporal qualifier "in 2017" and explain why.

Answer Sketch

Stanford OpenIE typically produces (Amazon, acquired, Whole Foods) and (Amazon, launched, private-label grocery line), missing the temporal qualifier. GPT-4 produces 4-tuples or quadruples with the date attached. Open IE uses surface-syntactic patterns that strip temporal phrases; LLMs preserve them because they were trained to produce richer structured outputs. This is the canonical case for the hybrid pipeline that follows in Section 34.3.

What's Next?

In the next section, Section 34.3: Hybrid IE Architectures with LLMs, we build on the material covered here.

Further Reading

Classical Methods

Banko, M., et al. (2007). "Open Information Extraction from the Web." IJCAI 2007. ijcai.org/Proceedings/07/Papers/429.pdf. The original Open IE paper.

Mausam (2016). "Open Information Extraction Systems and Downstream Applications." IJCAI 2016. ijcai.org/Proceedings/16/Papers/653.pdf. Survey of post-2007 Open IE systems including OpenIE 5.

Modern Methods

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). "Neural Architectures for Named Entity Recognition." NAACL 2016. arXiv:1603.01360. The BiLSTM-CRF NER architecture that dominated 2016-2019; the baseline against which LLM IE is measured.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. arXiv:1810.04805. BERT defined the modern transformer-based NER baseline.

Tools

spaCy (2024). "spaCy v3 Industrial-Strength NLP." spacy.io. The reference production NLP library; defines the production-NER baseline.

Stanza (Stanford NLP Group, 2024). "Stanza Python NLP Library." stanfordnlp.github.io/stanza. Reference research-grade NLP toolkit with strong NER models.