Section O.4: Advanced Retrieval: Routing, Sub-Questions, and Fusion | Building Conversational AI with LLMs and Agents

Big Picture

Simple top-k retrieval works well for straightforward questions, but real-world queries are often complex: they may span multiple topics, require information from different data sources, or benefit from combining keyword and semantic search. This section covers LlamaIndex: Data Indexing and Query Engines's advanced retrieval patterns, including RouterQueryEngine for intelligent routing, SubQuestionQueryEngine for decomposing complex queries, QueryFusionRetriever for multi-strategy retrieval, hybrid search, reranking, and auto-retrieval with metadata filters.

O.4.1 RouterQueryEngine

When your application has multiple indexes (for example, one for product documentation and another for customer support tickets), you need a way to route each query to the right index. The RouterQueryEngine uses an LLM to examine the query and select the most appropriate query engine from a set of candidates. Each candidate is described with a natural language summary so the router can make an informed decision.

from llama_index.core import VectorStoreIndex, SummaryIndex, SimpleDirectoryReader
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.tools import QueryEngineTool

# Build two specialized indexes
product_docs = SimpleDirectoryReader("./data/products").load_data()
support_docs = SimpleDirectoryReader("./data/support").load_data()

product_index = VectorStoreIndex.from_documents(product_docs)
support_index = VectorStoreIndex.from_documents(support_docs)

# Wrap each index as a tool with a description
product_tool = QueryEngineTool.from_defaults(
    query_engine=product_index.as_query_engine(),
    description="Useful for questions about product features, specifications, and pricing.",
)
support_tool = QueryEngineTool.from_defaults(
    query_engine=support_index.as_query_engine(),
    description="Useful for questions about troubleshooting, bug reports, and support procedures.",
)

# Create the router
router_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[product_tool, support_tool],
)

# The router automatically picks the right index
response = router_engine.query("How do I reset my password?")
print(response)
print(f"Routed to: {response.metadata.get('selector_result', 'unknown')}")

Connected to ChromaDB collection: my_documents Inserted 847 nodes Collection size: 847 documents

The LLMSingleSelector picks exactly one query engine per query. For queries that span multiple domains, use LLMMultiSelector instead, which can route to multiple engines and merge the results.

Tip

Write descriptive, non-overlapping tool descriptions. The router's accuracy depends heavily on how clearly each tool's scope is defined. Ambiguous or overlapping descriptions lead to misrouted queries. Test your descriptions with a diverse set of queries before deploying.

O.4.2 SubQuestionQueryEngine

Complex questions often contain multiple implicit sub-questions. For example, "Compare the security features of Product A and Product B" requires retrieving information about both products separately and then synthesizing a comparison. The SubQuestionQueryEngine uses an LLM to decompose the original query into sub-questions, routes each sub-question to the appropriate query engine, and combines the individual answers into a final response.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata

# Build separate indexes for each product
product_a_docs = SimpleDirectoryReader("./data/product_a").load_data()
product_b_docs = SimpleDirectoryReader("./data/product_b").load_data()

index_a = VectorStoreIndex.from_documents(product_a_docs)
index_b = VectorStoreIndex.from_documents(product_b_docs)

# Define tools with descriptive metadata
query_engine_tools = [
    QueryEngineTool(
        query_engine=index_a.as_query_engine(),
        metadata=ToolMetadata(
            name="product_a",
            description="Documentation for Product A, including features and security specs.",
        ),
    ),
    QueryEngineTool(
        query_engine=index_b.as_query_engine(),
        metadata=ToolMetadata(
            name="product_b",
            description="Documentation for Product B, including features and security specs.",
        ),
    ),
]

# Create the sub-question engine
sub_question_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
)

response = sub_question_engine.query(
    "Compare the authentication mechanisms of Product A and Product B."
)
print(response)

Connected to Pinecone index: llm-course Upserted 847 vectors (dimension: 1536) Index stats: {'dimension': 1536, 'index_fullness': 0.02}

Behind the scenes, the engine might decompose this into: "What authentication mechanisms does Product A support?" and "What authentication mechanisms does Product B support?" Each sub-question is answered independently, and the results are synthesized into a coherent comparison. You can inspect the sub-questions and their individual answers via the response metadata.

O.4.3 QueryFusionRetriever

A single query phrasing may miss relevant documents that use different terminology. The QueryFusionRetriever addresses this by generating multiple reformulations of the original query, running each through the retriever, and fusing the results using reciprocal rank fusion (RRF). This technique consistently improves recall over single-query retrieval.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.retrievers import QueryFusionRetriever

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Create a fusion retriever
retriever = QueryFusionRetriever(
    retrievers=[index.as_retriever(similarity_top_k=5)],
    num_queries=4,          # generate 4 query variations
    similarity_top_k=5,     # final top-k after fusion
    use_async=True,         # run variations in parallel
)

# Retrieve nodes (not a full query engine, just retrieval)
nodes = retriever.retrieve("What are the best practices for API rate limiting?")

for node in nodes:
    print(f"Score: {node.score:.3f} | {node.text[:80]}...")

Qdrant collection created: my_rag_docs Uploaded 847 points Collection info: vectors_count=847, segments_count=2

Note

Query fusion is inspired by the RAG-Fusion paper (Raudaschl, 2023). The LLM generates paraphrases like "API throttling guidelines," "rate limit configuration best practices," and "managing API request quotas." Each paraphrase may retrieve different relevant chunks that a single query would miss. Reciprocal rank fusion then combines the results, giving higher scores to nodes that appear across multiple query variations.

O.4.4 Hybrid Search

Hybrid search combines dense vector retrieval (semantic similarity) with sparse retrieval (BM25 keyword matching) to get the best of both worlds. Semantic search excels at understanding intent and synonyms, while BM25 excels at exact term matching. LlamaIndex: Data Indexing and Query Engines supports hybrid search through its BM25Retriever combined with a vector retriever.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
nodes = index.docstore.docs.values()

# Dense retriever (vector similarity)
vector_retriever = index.as_retriever(similarity_top_k=5)

# Sparse retriever (BM25 keyword matching)
bm25_retriever = BM25Retriever.from_defaults(
    nodes=list(nodes),
    similarity_top_k=5,
)

# Combine with reciprocal rank fusion
hybrid_retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    num_queries=1,            # no query expansion, just fusion
    similarity_top_k=5,
)

results = hybrid_retriever.retrieve("OAuth2 PKCE flow implementation")
for r in results:
    print(f"Score: {r.score:.3f} | {r.text[:80]}...")

Weaviate class created: Document Imported 847 objects Query returned 5 results

The hybrid approach is especially valuable for technical documentation, codebases, and legal texts where specific terms (function names, section numbers, legal citations) must be matched exactly while the surrounding context benefits from semantic understanding.

O.4.5 Reranking

Initial retrieval (whether vector, keyword, or hybrid) is optimized for speed, not precision. A reranker is a more expensive but more accurate model that re-scores the retrieved nodes based on their relevance to the query. LlamaIndex: Data Indexing and Query Engines integrates with cross-encoder reranking models from Cohere, sentence-transformers, and other providers.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.postprocessor.cohere_rerank import CohereRerank

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Add a Cohere reranker as a postprocessor
reranker = CohereRerank(
    api_key="your-cohere-key",
    top_n=3,       # keep top 3 after reranking
)

query_engine = index.as_query_engine(
    similarity_top_k=10,           # retrieve 10 candidates
    node_postprocessors=[reranker],  # rerank to top 3
)

response = query_engine.query("How does the billing system handle refunds?")
print(response)

# Reranked source nodes
for node in response.source_nodes:
    print(f"  Rerank score: {node.score:.3f} | {node.text[:60]}...")

PostgreSQL pgvector table created Inserted 847 rows with embeddings Similarity search returned 5 results

Tip

A common pattern is to over-retrieve (fetch 10 to 20 candidates) and then rerank down to 3 to 5. This gives the reranker a large pool of candidates to work with while keeping the final context window small. The latency overhead of reranking is typically 50 to 200 ms, which is a worthwhile trade-off for significantly improved answer quality.

O.4.6 Auto-Retrieval with Metadata Filters

In Section O.3.5, you saw how to apply metadata filters manually. Auto-retrieval takes this further by having the LLM infer the appropriate filters from the user's query. For example, if a user asks "What were the engineering team's decisions in Q3 2024?", the auto-retriever automatically generates filters like department == "engineering" and date >= "2024-07-01".

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo
from llama_index.core.retrievers import VectorIndexAutoRetriever

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Describe the metadata schema so the LLM knows what filters are available
vector_store_info = VectorStoreInfo(
    content_info="Corporate documentation including policies, reports, and memos.",
    metadata_info=[
        MetadataInfo(
            name="department",
            type="str",
            description="The department that authored the document (engineering, legal, hr, finance).",
        ),
        MetadataInfo(
            name="year",
            type="int",
            description="The year the document was published.",
        ),
        MetadataInfo(
            name="quarter",
            type="str",
            description="The fiscal quarter (Q1, Q2, Q3, Q4).",
        ),
    ],
)

auto_retriever = VectorIndexAutoRetriever(
    index=index,
    vector_store_info=vector_store_info,
    similarity_top_k=5,
)

# The LLM infers filters from the natural language query
nodes = auto_retriever.retrieve(
    "What were the engineering team's architecture decisions in Q3 2024?"
)
for node in nodes:
    print(f"Dept: {node.metadata.get('department')} | {node.text[:80]}...")

Index persisted to ./chroma_storage Reloaded index: 847 nodes Query response: The key finding is...

Tip

Auto-retrieval works best when your metadata schema is well-defined and your metadata values are consistent. Provide clear, specific descriptions in MetadataInfo so the LLM can accurately infer filter values. Test with queries that should and should not trigger filters to ensure the LLM does not hallucinate filter values that do not exist in your data.

Exercise O.4

Build a hybrid retrieval pipeline with reranking. Load a corpus of technical documentation and construct both a vector retriever and a BM25 retriever. Combine them using QueryFusionRetriever, then add a reranking postprocessor (using either Cohere or a local cross-encoder from sentence-transformers). Compare the top-5 results from the plain vector retriever, the hybrid retriever, and the hybrid+reranker pipeline on the same set of queries. Measure how often the reranker promotes a truly relevant document into the top-3.