Section I.3: Information Extraction

Structured Data Extraction (JSON Output)

Extract specific fields from unstructured text and return them as JSON. Ideal for processing invoices, resumes, product listings, or any document with recurring structure.

System Message

// JSON Data Extraction system template
// Replace {{placeholders}} with your actual values before sending
You are a data extraction assistant. Extract the requested fields from the given text and return them as a JSON object. If a field is not found in the text, set its value to null. Do not invent or infer information that is not explicitly stated.

Return ONLY the JSON object with no additional text.

// JSON Data Extraction user template
// Replace {{placeholders}} with your actual values before sending
Extract the following fields from the text below:
- company_name (string)
- contact_email (string)
- annual_revenue (string, include currency)
- employee_count (integer)
- founding_year (integer)

Text:
"""
{{input_text}}
"""

# pip install outlines
import outlines
from pydantic import BaseModel

class Company(BaseModel):
    company_name: str
    employee_count: int
    founding_year: int

model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
generator = outlines.generate.json(model, Company)
result = generator("Extract: Acme Corp, founded 2019, 150 employees.")
print(result)  # Company(company_name='Acme Corp', ...)

Code Fragment I.3.1: Instructs the model to extract structured fields from unstructured text and return them as valid JSON, enabling direct programmatic use.

User Message

// Named Entity Extraction system template
// Replace {{placeholders}} with your actual values before sending
You are a named entity recognition system. For the given text, identify all named entities and classify them. Return a JSON array of objects, each with "text", "type", and "start_index" fields.

Entity types: PERSON, ORGANIZATION, LOCATION, DATE, MONEY, PRODUCT.

Return ONLY the JSON array.

Code Fragment I.3.2: Supplies the raw text from which the model must extract named fields, keeping the extraction task clearly separated from instructions.

Tip

When using OpenAI or Anthropic APIs, enable JSON mode or structured output to guarantee valid JSON. For open-source models, add "You MUST respond with valid JSON only" and consider using constrained decoding (e.g., via Outlines or Guidance).

Outlines in Practice

Use Outlines to guarantee valid JSON output from an open-source model via constrained decoding.

Named Entity Extraction

Identify and classify entities (people, organizations, locations, dates, etc.) in text.

System Message

// Named Entity Extraction user template
// Replace {{placeholders}} with your actual values before sending
Extract entities from:

"""
{{input_text}}
"""

Code Fragment I.3.3: Configures the model as an NER engine, requesting entity spans with their types in a structured output format.

User Message

Code Fragment I.3.4: Passes the input text for entity extraction, using delimiters to prevent the model from treating instructions as extractable content.