Simple top-k retrieval works well for straightforward questions, but real-world queries are often complex: they may span multiple topics, require information from different data sources, or benefit from combining keyword and semantic search. This section covers LlamaIndex: Data Indexing and Query Engines's advanced retrieval patterns, including RouterQueryEngine for intelligent routing, SubQuestionQueryEngine for decomposing complex queries, QueryFusionRetriever for multi-strategy retrieval, hybrid search, reranking, and auto-retrieval with metadata filters.
O.4.1 RouterQueryEngine
When your application has multiple indexes (for example, one for product documentation and
another for customer support tickets), you need a way to route each query to the right index.
The RouterQueryEngine uses an LLM to examine the query and select the most
appropriate query engine from a set of candidates. Each candidate is described with a natural
language summary so the router can make an informed decision.
from llama_index.core import VectorStoreIndex, SummaryIndex, SimpleDirectoryReader
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.tools import QueryEngineTool
# Build two specialized indexes
product_docs = SimpleDirectoryReader("./data/products").load_data()
support_docs = SimpleDirectoryReader("./data/support").load_data()
product_index = VectorStoreIndex.from_documents(product_docs)
support_index = VectorStoreIndex.from_documents(support_docs)
# Wrap each index as a tool with a description
product_tool = QueryEngineTool.from_defaults(
query_engine=product_index.as_query_engine(),
description="Useful for questions about product features, specifications, and pricing.",
)
support_tool = QueryEngineTool.from_defaults(
query_engine=support_index.as_query_engine(),
description="Useful for questions about troubleshooting, bug reports, and support procedures.",
)
# Create the router
router_engine = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(),
query_engine_tools=[product_tool, support_tool],
)
# The router automatically picks the right index
response = router_engine.query("How do I reset my password?")
print(response)
print(f"Routed to: {response.metadata.get('selector_result', 'unknown')}")
The LLMSingleSelector picks exactly one query engine per query. For queries that
span multiple domains, use LLMMultiSelector instead, which can route to multiple
engines and merge the results.
Write descriptive, non-overlapping tool descriptions. The router's accuracy depends heavily on how clearly each tool's scope is defined. Ambiguous or overlapping descriptions lead to misrouted queries. Test your descriptions with a diverse set of queries before deploying.
O.4.2 SubQuestionQueryEngine
Complex questions often contain multiple implicit sub-questions. For example, "Compare the
security features of Product A and Product B" requires retrieving information about both
products separately and then synthesizing a comparison. The SubQuestionQueryEngine
uses an LLM to decompose the original query into sub-questions, routes each sub-question to
the appropriate query engine, and combines the individual answers into a final response.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
# Build separate indexes for each product
product_a_docs = SimpleDirectoryReader("./data/product_a").load_data()
product_b_docs = SimpleDirectoryReader("./data/product_b").load_data()
index_a = VectorStoreIndex.from_documents(product_a_docs)
index_b = VectorStoreIndex.from_documents(product_b_docs)
# Define tools with descriptive metadata
query_engine_tools = [
QueryEngineTool(
query_engine=index_a.as_query_engine(),
metadata=ToolMetadata(
name="product_a",
description="Documentation for Product A, including features and security specs.",
),
),
QueryEngineTool(
query_engine=index_b.as_query_engine(),
metadata=ToolMetadata(
name="product_b",
description="Documentation for Product B, including features and security specs.",
),
),
]
# Create the sub-question engine
sub_question_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=query_engine_tools,
)
response = sub_question_engine.query(
"Compare the authentication mechanisms of Product A and Product B."
)
print(response)
Behind the scenes, the engine might decompose this into: "What authentication mechanisms does Product A support?" and "What authentication mechanisms does Product B support?" Each sub-question is answered independently, and the results are synthesized into a coherent comparison. You can inspect the sub-questions and their individual answers via the response metadata.
O.4.3 QueryFusionRetriever
A single query phrasing may miss relevant documents that use different terminology. The
QueryFusionRetriever addresses this by generating multiple reformulations of the
original query, running each through the retriever, and fusing the results using reciprocal
rank fusion (RRF). This technique consistently improves recall over single-query retrieval.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.retrievers import QueryFusionRetriever
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
# Create a fusion retriever
retriever = QueryFusionRetriever(
retrievers=[index.as_retriever(similarity_top_k=5)],
num_queries=4, # generate 4 query variations
similarity_top_k=5, # final top-k after fusion
use_async=True, # run variations in parallel
)
# Retrieve nodes (not a full query engine, just retrieval)
nodes = retriever.retrieve("What are the best practices for API rate limiting?")
for node in nodes:
print(f"Score: {node.score:.3f} | {node.text[:80]}...")
Query fusion is inspired by the RAG-Fusion paper (Raudaschl, 2023). The LLM generates paraphrases like "API throttling guidelines," "rate limit configuration best practices," and "managing API request quotas." Each paraphrase may retrieve different relevant chunks that a single query would miss. Reciprocal rank fusion then combines the results, giving higher scores to nodes that appear across multiple query variations.
O.4.4 Hybrid Search
Hybrid search combines dense vector retrieval (semantic similarity) with sparse retrieval
(BM25 keyword matching) to get the best of both worlds. Semantic search excels at understanding
intent and synonyms, while BM25 excels at exact term matching. LlamaIndex: Data Indexing and Query Engines supports hybrid search
through its BM25Retriever combined with a vector retriever.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
nodes = index.docstore.docs.values()
# Dense retriever (vector similarity)
vector_retriever = index.as_retriever(similarity_top_k=5)
# Sparse retriever (BM25 keyword matching)
bm25_retriever = BM25Retriever.from_defaults(
nodes=list(nodes),
similarity_top_k=5,
)
# Combine with reciprocal rank fusion
hybrid_retriever = QueryFusionRetriever(
retrievers=[vector_retriever, bm25_retriever],
num_queries=1, # no query expansion, just fusion
similarity_top_k=5,
)
results = hybrid_retriever.retrieve("OAuth2 PKCE flow implementation")
for r in results:
print(f"Score: {r.score:.3f} | {r.text[:80]}...")
The hybrid approach is especially valuable for technical documentation, codebases, and legal texts where specific terms (function names, section numbers, legal citations) must be matched exactly while the surrounding context benefits from semantic understanding.
O.4.5 Reranking
Initial retrieval (whether vector, keyword, or hybrid) is optimized for speed, not precision. A reranker is a more expensive but more accurate model that re-scores the retrieved nodes based on their relevance to the query. LlamaIndex: Data Indexing and Query Engines integrates with cross-encoder reranking models from Cohere, sentence-transformers, and other providers.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.postprocessor.cohere_rerank import CohereRerank
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
# Add a Cohere reranker as a postprocessor
reranker = CohereRerank(
api_key="your-cohere-key",
top_n=3, # keep top 3 after reranking
)
query_engine = index.as_query_engine(
similarity_top_k=10, # retrieve 10 candidates
node_postprocessors=[reranker], # rerank to top 3
)
response = query_engine.query("How does the billing system handle refunds?")
print(response)
# Reranked source nodes
for node in response.source_nodes:
print(f" Rerank score: {node.score:.3f} | {node.text[:60]}...")
A common pattern is to over-retrieve (fetch 10 to 20 candidates) and then rerank down to 3 to 5. This gives the reranker a large pool of candidates to work with while keeping the final context window small. The latency overhead of reranking is typically 50 to 200 ms, which is a worthwhile trade-off for significantly improved answer quality.
O.4.6 Auto-Retrieval with Metadata Filters
In Section O.3.5, you saw how to apply metadata filters manually. Auto-retrieval
takes this further by having the LLM infer the appropriate filters from the user's query. For
example, if a user asks "What were the engineering team's decisions in Q3 2024?", the auto-retriever
automatically generates filters like department == "engineering" and
date >= "2024-07-01".
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo
from llama_index.core.retrievers import VectorIndexAutoRetriever
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
# Describe the metadata schema so the LLM knows what filters are available
vector_store_info = VectorStoreInfo(
content_info="Corporate documentation including policies, reports, and memos.",
metadata_info=[
MetadataInfo(
name="department",
type="str",
description="The department that authored the document (engineering, legal, hr, finance).",
),
MetadataInfo(
name="year",
type="int",
description="The year the document was published.",
),
MetadataInfo(
name="quarter",
type="str",
description="The fiscal quarter (Q1, Q2, Q3, Q4).",
),
],
)
auto_retriever = VectorIndexAutoRetriever(
index=index,
vector_store_info=vector_store_info,
similarity_top_k=5,
)
# The LLM infers filters from the natural language query
nodes = auto_retriever.retrieve(
"What were the engineering team's architecture decisions in Q3 2024?"
)
for node in nodes:
print(f"Dept: {node.metadata.get('department')} | {node.text[:80]}...")
Auto-retrieval works best when your metadata schema is well-defined and your metadata values
are consistent. Provide clear, specific descriptions in MetadataInfo so the LLM
can accurately infer filter values. Test with queries that should and should not trigger filters
to ensure the LLM does not hallucinate filter values that do not exist in your data.
Build a hybrid retrieval pipeline with reranking. Load a corpus of technical
documentation and construct both a vector retriever and a BM25 retriever. Combine them using
QueryFusionRetriever, then add a reranking postprocessor (using either Cohere or
a local cross-encoder from sentence-transformers). Compare the top-5 results from the plain
vector retriever, the hybrid retriever, and the hybrid+reranker pipeline on the same set of
queries. Measure how often the reranker promotes a truly relevant document into the top-3.