Section O.1: Data Connectors and Document Loading | Building Conversational AI with LLMs and Agents

Big Picture

Before you can query your data with an LLM, you need to ingest it. LlamaIndex: Data Indexing and Query Engines provides a rich ecosystem of data connectors that load documents from files, APIs, databases, and SaaS platforms into a unified Document abstraction. This section covers the core loading primitives, the document and node data model, metadata extraction, and the LlamaHub connector ecosystem.

O.1.1 The Ingestion Pipeline at a Glance

Every RAG (Retrieval-Augmented Generation) application begins with a simple question: how do I get my data into a format the LLM can work with? LlamaIndex: Data Indexing and Query Engines answers this with a three-stage ingestion pipeline. First, connectors load raw content into Document objects. Second, transformations (chunking, metadata extraction, embedding) convert documents into TextNode objects. Third, nodes are inserted into an index for retrieval. This section focuses on the first two stages.

Diagram — **Figure O.1.1**: The LlamaIndex: Data Indexing and Query Engines ingestion pipeline. Data sources are read by connectors, producing Document objects that are transformed into TextNodes and stored in an index.

O.1.2 SimpleDirectoryReader

The most common starting point is SimpleDirectoryReader, which recursively loads files from a local directory. It auto-detects file types (PDF, DOCX, TXT, CSV, Markdown, images, and more) and delegates to the appropriate parser. For most prototypes and small-to-medium corpora, this single class is all you need.

from llama_index.core import SimpleDirectoryReader

# Load all supported files from a directory
documents = SimpleDirectoryReader(
    input_dir="./data/company_docs",
    recursive=True,               # traverse subdirectories
    required_exts=[".pdf", ".md", ".txt"],  # optional filter
).load_data()

print(f"Loaded {len(documents)} documents")
print(f"First doc preview: {documents[0].text[:200]}")
print(f"Metadata: {documents[0].metadata}")

Loaded 42 documents from ./data First doc: Introduction to Machine Learning. Machine learning is a subset of... Metadata: {'file_name': 'ml_intro.pdf', 'page': 0}

Each loaded file becomes one or more Document objects. The reader automatically populates metadata fields such as file_name, file_path, file_type, and creation_date. You can also supply your own metadata via the file_metadata callback parameter.

from llama_index.core import SimpleDirectoryReader

def custom_metadata(file_path: str) -> dict:
    """Attach department labels based on folder structure."""
    if "engineering" in file_path:
        return {"department": "engineering", "access_level": "internal"}
    elif "legal" in file_path:
        return {"department": "legal", "access_level": "restricted"}
    return {"department": "general", "access_level": "public"}

documents = SimpleDirectoryReader(
    input_dir="./data",
    file_metadata=custom_metadata,
).load_data()

# Verify custom metadata was attached
for doc in documents[:3]:
    print(doc.metadata["department"], doc.metadata["file_name"])

Query: What is retrieval-augmented generation? Response: Retrieval-augmented generation (RAG) is a technique that combines information retrieval with text generation. It first retrieves relevant documents from a knowledge base, then uses them as context for the LLM to generate more accurate and grounded responses.

Tip

SimpleDirectoryReader also accepts an input_files parameter if you want to load a specific list of files rather than an entire directory. This is handy when you need fine-grained control over which documents enter your pipeline.

O.1.3 LlamaParse: Structured Document Parsing

PDFs with tables, charts, and complex layouts are notoriously difficult to parse. LlamaParse is LlamaIndex: Data Indexing and Query Engines's cloud-based document parser that uses vision models to extract structured content from PDFs, PowerPoint files, and scanned documents. It returns clean Markdown with tables preserved, making it ideal for financial reports, academic papers, and technical manuals.

from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

# Configure LlamaParse as the PDF parser
parser = LlamaParse(
    api_key="llx-...",            # or set LLAMA_CLOUD_API_KEY env var
    result_type="markdown",        # "markdown" or "text"
    num_workers=4,                 # parallel parsing
    verbose=True,
)

# Use it as a file extractor within SimpleDirectoryReader
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
    input_dir="./data/financial_reports",
    file_extractor=file_extractor,
).load_data()

# Tables and structure are preserved as Markdown
print(documents[0].text[:500])

Index created with 156 nodes Stored in ./storage Loaded index from ./storage

Warning

LlamaParse is a cloud service that sends your documents to LlamaIndex: Data Indexing and Query Engines's API for processing. For sensitive or regulated data, verify that this complies with your organization's data governance policies. You receive a free tier of 1,000 pages per day; beyond that, usage is metered.

O.1.4 LlamaHub Connectors

Real-world data lives in many places beyond the local filesystem. The LlamaHub registry provides hundreds of community-built connectors (called "readers") for SaaS platforms, databases, and APIs. Each connector implements the same load_data() interface, so switching data sources requires changing only the reader class.

The following example demonstrates loading data from a Notion workspace and a PostgreSQL database.

from llama_index.readers.notion import NotionPageReader
from llama_index.readers.database import DatabaseReader

# --- Notion ---
notion_reader = NotionPageReader(integration_token="secret_...")
notion_docs = notion_reader.load_data(
    page_ids=["abc123", "def456"]
)

# --- PostgreSQL ---
db_reader = DatabaseReader(
    uri="postgresql://user:pass@localhost:5432/mydb"
)
db_docs = db_reader.load_data(
    query="SELECT title, content, created_at FROM articles WHERE published = true"
)

# Combine sources into a single corpus
all_documents = notion_docs + db_docs
print(f"Total documents: {len(all_documents)}")

Chunk 0 (412 chars): Machine learning is a subset of artificial intelligence... Chunk 1 (398 chars): Neural networks are computing systems inspired by... Total chunks: 156

Other popular LlamaHub connectors include readers for Slack, Google Drive, Confluence, GitHub repositories, Arxiv papers, Wikipedia, and S3 buckets. You can install them individually (e.g., pip install llama-index-readers-notion) or browse the full registry at llamahub.ai.

O.1.5 The Document and TextNode Data Model

Understanding the internal data model is essential for customizing your pipeline. LlamaIndex: Data Indexing and Query Engines represents all ingested content as a hierarchy of schema objects. At the top level, a Document holds the full content of a loaded source. During indexing, documents are split into smaller TextNode objects (chunks). Both classes inherit from BaseNode and share a common interface.

from llama_index.core.schema import Document, TextNode

# Create a Document manually
doc = Document(
    text="LlamaIndex: Data Indexing and Query Engines is a data framework for LLM applications.",
    metadata={"source": "manual", "category": "overview"},
    doc_id="doc-001",
)

# Create TextNodes (chunks) manually
node1 = TextNode(
    text="LlamaIndex: Data Indexing and Query Engines is a data framework",
    metadata={"source": "manual", "chunk_index": 0},
)
node2 = TextNode(
    text="for LLM applications.",
    metadata={"source": "manual", "chunk_index": 1},
)

# Establish parent-child relationships
node1.relationships = {
    "parent": doc.as_related_node_info(),
    "next": node2.as_related_node_info(),
}
node2.relationships = {
    "parent": doc.as_related_node_info(),
    "previous": node1.as_related_node_info(),
}

print(f"Document ID: {doc.doc_id}")
print(f"Node 1 ID:   {node1.node_id}")
print(f"Node 1 parent: {node1.relationships['parent'].node_id}")

Node ID: node_001 Score: 0.8432 Text: Transformers use self-attention mechanisms to process... Node ID: node_015 Score: 0.7891 Text: The key innovation of the transformer architecture is...

Note

Every node has a unique node_id (auto-generated UUID) and can carry arbitrary metadata as a dictionary. Metadata is propagated to the vector store during indexing, enabling metadata filtering at query time (covered in Section O.4).

O.1.6 Metadata Extraction and Transformations

Raw documents often lack the metadata that makes retrieval precise. LlamaIndex: Data Indexing and Query Engines provides metadata extractors that use an LLM to automatically generate titles, summaries, keywords, and question-answer pairs for each node. These extractors plug into the IngestionPipeline, which chains together transformations in sequence.

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import (
    TitleExtractor,
    SummaryExtractor,
    QuestionsAnsweredExtractor,
)
from llama_index.core import SimpleDirectoryReader

# Load raw documents
documents = SimpleDirectoryReader("./data").load_data()

# Build a transformation pipeline
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=64),
        TitleExtractor(nodes=3),          # infer title from first N nodes
        SummaryExtractor(summaries=["self"]),  # one-sentence summary
        QuestionsAnsweredExtractor(questions=3),  # 3 questions per node
    ]
)

# Run the pipeline
nodes = pipeline.run(documents=documents)

# Inspect enriched metadata
sample = nodes[0]
print("Title:", sample.metadata.get("document_title"))
print("Summary:", sample.metadata.get("section_summary"))
print("Questions:", sample.metadata.get("questions_this_excerpt_can_answer"))

Response: LlamaIndex: Data Indexing and Query Engines provides several key abstractions for building RAG applications: Documents, Nodes, Indices, Query Engines, and Retrievers. Documents represent raw data sources, while Nodes are chunks of text derived from documents... Source nodes: 3 Source 1: ml_intro.pdf (page 2, score: 0.89)

The IngestionPipeline also supports caching via a docstore parameter. When you re-run the pipeline on an updated corpus, only new or changed documents are processed, saving both time and LLM API costs.

Tip

For production systems, always attach at least a file_name and source_url to each document's metadata. This enables your application to show citations and source links in responses, which is critical for user trust and auditability.

Choosing a Chunking Strategy

The SentenceSplitter used above is the default node parser. It splits text at sentence boundaries and respects a configurable chunk_size (in tokens) and chunk_overlap. LlamaIndex: Data Indexing and Query Engines also provides alternative splitters for code (CodeSplitter), Markdown (MarkdownNodeParser), and semantic boundaries (SemanticSplitterNodeParser, which uses embedding similarity to find natural break points). The choice of splitter has a significant impact on retrieval quality; we revisit this topic in Section O.4 when discussing advanced retrieval strategies.

Exercise O.1

Build a multi-source ingestion pipeline. Load documents from at least two different sources (e.g., a local PDF directory and a Wikipedia reader). Attach custom metadata that identifies the source type. Run the documents through an IngestionPipeline with a SentenceSplitter and one metadata extractor. Verify that the resulting nodes contain both your custom metadata and the LLM-generated metadata.