Section O.3: Query Engines and Response Synthesis | Building Conversational AI with LLMs and Agents

Big Picture

A query engine is the interface between a user's question and the LLM-powered answer. It orchestrates retrieval, post-processing, and response synthesis into a single .query() call. This section covers the query engine API, response synthesis modes (refine, compact, tree_summarize, and others), streaming, metadata filters, node postprocessors, and the citation query engine that automatically attributes sources.

O.3.1 The Query Engine API

Every LlamaIndex: Data Indexing and Query Engines index exposes a .as_query_engine() method that returns a ready-to-use query engine. Under the hood, this composes a retriever (which fetches relevant nodes from the index) with a response synthesizer (which feeds those nodes to the LLM and produces the final answer). You can customize both components independently or use the convenience method with keyword arguments.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Quick setup: configure via keyword arguments
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact",
    streaming=False,
)

response = query_engine.query("What are the main benefits of microservices?")
print(response)

# Access source nodes for transparency
for node in response.source_nodes:
    print(f"  Score: {node.score:.3f} | {node.text[:80]}...")

Retrieved 5 nodes (similarity search) Node 1 (score=0.92): The transformer architecture uses... Node 2 (score=0.87): Self-attention allows the model to... Node 3 (score=0.84): Multi-head attention computes... ...

The response.source_nodes attribute provides access to every retrieved chunk along with its relevance score. This is invaluable for debugging retrieval quality and for building UIs that show the user which sources informed the answer.

O.3.2 Response Synthesis Modes

The response_mode parameter controls how the LLM combines retrieved nodes into a final response. LlamaIndex: Data Indexing and Query Engines offers several strategies, each with different trade-offs between quality, latency, and token cost.

Refine

The refine mode processes nodes one at a time. It generates an initial answer from the first node, then iteratively refines that answer by incorporating each subsequent node. This produces thorough, well-synthesized responses but requires one LLM call per node.

Compact

The compact mode (the default) stuffs as many nodes as possible into a single LLM prompt, up to the context window limit. If all nodes fit, it requires only one LLM call. If they do not fit, it falls back to the refine strategy for the overflow. This is the best balance of quality and efficiency for most applications.

Tree Summarize

The tree_summarize mode builds a bottom-up tree of summaries. It groups nodes into batches, summarizes each batch, then summarizes the summaries until a single response remains. This is ideal for producing concise answers from very large retrieved sets.

Simple and Accumulate

The simple_summarize mode truncates all nodes to fit in a single prompt (discarding overflow). The accumulate mode generates a separate response for each node and concatenates them. These are niche strategies for specific workflows.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Compare response modes on the same query
for mode in ["compact", "refine", "tree_summarize", "accumulate"]:
    engine = index.as_query_engine(
        response_mode=mode,
        similarity_top_k=5,
    )
    response = engine.query("What are the security best practices?")
    print(f"\n--- {mode.upper()} ---")
    print(str(response)[:300])

Reranked results: 1. (score=0.95) The transformer architecture... 2. (score=0.88) Self-attention mechanisms... 3. (score=0.81) Positional encoding provides...

Tip

For chat-style applications where latency matters, start with compact. For analytical reports where thoroughness matters more than speed, use refine or tree_summarize. The accumulate mode is useful when you want to preserve each source's contribution as a distinct paragraph.

O.3.3 Streaming Responses

For interactive applications, waiting for the full response before displaying anything creates a poor user experience. LlamaIndex: Data Indexing and Query Engines supports streaming, where tokens are yielded as they are generated by the LLM. This works with all response modes and all supported LLM providers.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Enable streaming
query_engine = index.as_query_engine(streaming=True)
streaming_response = query_engine.query("Explain the CAP theorem.")

# Print tokens as they arrive
for text_chunk in streaming_response.response_gen:
    print(text_chunk, end="", flush=True)
print()  # newline after stream completes

Hybrid retrieval: 5 results [vector] score=0.92: The transformer architecture... [keyword] score=0.89: Attention is All You Need paper... [fused] final top result: The transformer architecture...

The streaming response object also provides a .get_response() method that blocks until the full response is available, which is useful when you need to log the complete answer after streaming it to the user.

O.3.4 Node Postprocessors

Between retrieval and synthesis, node postprocessors transform, filter, or reorder the retrieved nodes. LlamaIndex: Data Indexing and Query Engines ships with several built-in postprocessors, and you can create custom ones by implementing the BaseNodePostprocessor interface.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.postprocessor import (
    SimilarityPostprocessor,
    KeywordNodePostprocessor,
)

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Chain postprocessors: filter low-score nodes, then require keywords
query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=0.7),
        KeywordNodePostprocessor(
            required_keywords=["security", "authentication"],
            exclude_keywords=["deprecated"],
        ),
    ],
)

response = query_engine.query("How do I set up API authentication?")
print(f"Nodes after filtering: {len(response.source_nodes)}")
print(response)

Auto-merging: merged 12 small chunks into 4 parent nodes Retrieved 4 merged nodes for query

The SimilarityPostprocessor drops nodes below a score threshold, which is useful for preventing the LLM from being distracted by marginally relevant chunks. The KeywordNodePostprocessor enforces that certain terms must (or must not) appear in the retrieved text. Other built-in postprocessors include LongContextReorder (which places the most relevant nodes at the start and end of the context, following the "lost in the middle" research finding) and MetadataReplacementPostProcessor.

O.3.5 Metadata Filters at Query Time

If your nodes carry metadata (such as department, date, or document type), you can apply metadata filters at query time to restrict retrieval to a specific subset. This is equivalent to a SQL WHERE clause applied before the similarity search.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.vector_stores import (
    MetadataFilter,
    MetadataFilters,
    FilterOperator,
)

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Filter to only engineering documents from 2024 or later
filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="department",
            value="engineering",
            operator=FilterOperator.EQ,
        ),
        MetadataFilter(
            key="year",
            value=2024,
            operator=FilterOperator.GTE,
        ),
    ]
)

query_engine = index.as_query_engine(
    similarity_top_k=5,
    filters=filters,
)
response = query_engine.query("What were the main architecture decisions?")
print(response)

Sub-questions generated: 1. What is the transformer architecture? 2. How does self-attention work? 3. What are the key components of a transformer? Final synthesized answer: The transformer architecture...

Note

Metadata filters are pushed down to the vector store when using external databases like Pinecone or Qdrant, meaning filtering happens before the similarity search. This is far more efficient than filtering after retrieval. The default in-memory store also supports metadata filtering, but without the same performance optimizations.

O.3.6 Citation Query Engine

In many applications, users need to verify the LLM's claims against original sources. The CitationQueryEngine automatically inserts numbered citations into the response text and provides a mapping from each citation number to its source node. This makes it straightforward to build UIs with clickable source references.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.query_engine import CitationQueryEngine

documents = SimpleDirectoryReader("./data/policies").load_data()
index = VectorStoreIndex.from_documents(documents)

# Create a citation-aware query engine
citation_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=5,
    citation_chunk_size=512,
)

response = citation_engine.query("What is the company's remote work policy?")

# The response text contains inline citations like [1], [2], etc.
print("Answer:", response.response)
print("\nSources:")
for i, node in enumerate(response.source_nodes, 1):
    print(f"  [{i}] {node.node.metadata.get('file_name', 'unknown')}")
    print(f"      {node.node.text[:100]}...")

Router selected: vector_index (confidence: 0.91) Query: What are the main transformer components? Response: The main components of a transformer are the encoder, decoder, self-attention mechanism, and feed-forward layers...

Tip

Always expose source citations in user-facing RAG applications. Citations build trust, enable fact-checking, and help users navigate to the full source material. The CitationQueryEngine handles citation injection automatically, but you can also build custom citation logic using node postprocessors and prompt templates.

Building a Custom Query Engine

For advanced use cases, you can compose a query engine manually from a retriever and a response synthesizer. This gives you full control over every stage of the pipeline.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Build components separately
retriever = VectorIndexRetriever(index=index, similarity_top_k=8)
synthesizer = get_response_synthesizer(response_mode="tree_summarize")
postprocessor = SimilarityPostprocessor(similarity_cutoff=0.65)

# Compose into a custom query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=synthesizer,
    node_postprocessors=[postprocessor],
)

response = query_engine.query("Summarize the onboarding process.")
print(response)

The onboarding process consists of three phases: orientation (days 1-3), training (weeks 1-2), and mentorship (weeks 3-4). New hires complete compliance modules, shadow senior team members, and receive a 30-day performance checkpoint. The process concludes with a manager review...

Exercise O.3

Build a citation-powered Q&A engine. Load a corpus of at least 10 documents with distinct filenames. Build a CitationQueryEngine and ask five questions that span multiple source documents. Verify that each response contains inline citations and that the cited sources are correct. Then add a SimilarityPostprocessor with a cutoff of 0.75 and observe how it affects the number and quality of citations.