Why memorize a million rows when you can simply ask the database a well-formed question?
RAG, SQL-Fluent AI Agent
Most enterprise knowledge lives in databases, not documents. Sales figures, inventory counts, customer records, and financial data are stored in relational databases and spreadsheets. Text-to-SQL enables users to query these structured data sources using natural language, bridging the gap between business questions and database queries. Combined with traditional document-based RAG (as covered in Section 20.1), this creates hybrid retrieval systems that can answer questions requiring both factual context and precise numerical data.
Prerequisites
Text-to-SQL and structured data RAG extend the retrieval paradigm from Section 20.1 to tabular and relational data sources. You should be comfortable with basic SQL concepts and understand how LLMs generate structured output, a topic introduced in Section 11.2. The evaluation challenges discussed here connect to the broader LLM evaluation framework in Section 29.1.
1. LLMs and Tabular Data
Tabular data presents unique challenges for LLMs. Unlike free-form text, tables have strict row-column structure, typed values, and relational constraints (foreign keys, uniqueness). LLMs can process small tables directly in their context window, but larger datasets require the model to generate executable queries (SQL, pandas) that retrieve the relevant information.
1.1 Direct Table Reasoning
For small tables (fewer than 50 rows), LLMs can reason over tabular data presented directly in the prompt. The key is formatting: presenting tables as Markdown or structured text with clear column headers helps the model understand the schema. However, this approach fails for large datasets because of context window limits and because LLM arithmetic on many rows is unreliable. Code Fragment 20.5.1 below puts this into practice.
# implement ask_about_table
# Key operations: API interaction
import pandas as pd
from openai import OpenAI
client = OpenAI()
def ask_about_table(df: pd.DataFrame, question: str):
"""Ask a question about a small DataFrame directly."""
# Convert to markdown for clear formatting
table_md = df.to_markdown(index=False)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "You are a data analyst. Answer questions about "
"the provided table. Show your reasoning."
}, {
"role": "user",
"content": f"Table:\n{table_md}\n\nQuestion: {question}"
}]
)
return response.choices[0].message.content
The Spider benchmark for text-to-SQL contains 10,181 questions across 200 databases. State-of-the-art LLMs now score above 85% on it, which sounds impressive until you realize that the remaining 15% includes queries like nested correlated subqueries with HAVING clauses that most human SQL developers would also get wrong on the first try.
2. Text-to-SQL: Architecture
Text-to-SQL systems convert natural language questions into SQL queries that can be executed against a database. The core challenge is mapping the user's informal language to the precise syntax of SQL while correctly identifying which tables, columns, and joins are needed. Modern LLM-based text-to-SQL pipelines achieve over 85% accuracy on standard benchmarks by combining schema linking, few-shot examples, and error correction. Figure 20.5.1 traces the complete text-to-SQL pipeline.
The single highest-leverage improvement for text-to-SQL accuracy is including 3 to 5 diverse, annotated examples in the prompt (few-shot learning). Column descriptions and sample rows help, but well-chosen examples teach the model your schema's naming conventions, join patterns, and edge cases more effectively than any amount of schema documentation.
2.1 Schema Linking
Schema linking is the process of identifying which database tables and columns are relevant to the
user's question. This is often the hardest step because users rarely use exact column names. A user
asking about "revenue" might mean the total_amount column in the orders
table, or the annual_revenue column in the companies table.
Code Fragment 20.5.2 below puts this into practice.
# implement get_schema_context
# Key operations: RAG pipeline, prompt construction
def get_schema_context(db_connection, relevant_tables=None):
"""Extract schema information for the LLM prompt."""
schema_parts = []
# Get all tables or filter to relevant ones
tables = relevant_tables or get_all_tables(db_connection)
for table in tables:
columns = get_columns(db_connection, table)
col_info = []
for col in columns:
col_str = f" {col['name']} {col['type']}"
if col.get("primary_key"):
col_str += " PRIMARY KEY"
if col.get("foreign_key"):
col_str += f" REFERENCES {col['foreign_key']}"
col_info.append(col_str)
# Include sample values for disambiguation
samples = get_sample_values(db_connection, table, limit=3)
schema_parts.append(
f"CREATE TABLE {table} (\n"
+ ",\n".join(col_info)
+ f"\n);\n-- Sample rows: {samples}"
)
return "\n\n".join(schema_parts)
2.2 SQL Generation with Few-Shot Examples
Once the schema context is prepared, the next step is generating executable SQL from the user's natural language question. Code Fragment 20.5.3 below demonstrates how to combine schema information with optional few-shot examples in a single LLM call.
# implement text_to_sql
# Key operations: API interaction
def text_to_sql(question, schema_context, examples=None):
"""Generate SQL from natural language question."""
few_shot = ""
if examples:
few_shot = "\n\nExamples:\n" + "\n".join([
f"Q: {ex['question']}\nSQL: {ex['sql']}"
for ex in examples
])
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": f"""You are a SQL expert. Generate a SQL query to
answer the user's question based on this schema:
{schema_context}
{few_shot}
Rules:
- Use only tables and columns from the schema
- Return ONLY the SQL query, no explanation
- Use standard SQL compatible with PostgreSQL
- Add LIMIT 100 to prevent unbounded results
- Handle NULLs appropriately"""
}, {
"role": "user",
"content": question
}],
temperature=0.0
)
sql = response.choices[0].message.content.strip()
# Remove markdown code fences if present
sql = sql.replace("```sql", "").replace("```", "").strip()
return sql
The choice of few-shot examples dramatically affects text-to-SQL accuracy. The most effective strategy is to embed the user's question and retrieve the most similar examples from a curated example bank using vector similarity. This "dynamic few-shot" approach ensures that the examples shown to the LLM are maximally relevant to the current question, improving accuracy by 10 to 20% compared to fixed examples on complex queries.
3. Error Correction and Self-Healing Queries
Even the best LLMs generate incorrect SQL queries. Common errors include referencing nonexistent columns, incorrect join conditions, missing GROUP BY clauses, and type mismatches. A robust text-to-SQL pipeline includes an error correction loop that catches execution failures, sends the error message back to the LLM along with the original query, and asks for a corrected version. Code Fragment 20.5.4 below puts this into practice.
# implement execute_with_retry
# See inline comments for step-by-step details.
import sqlite3
def execute_with_retry(question, schema_context, db_path,
max_retries=3):
"""Execute SQL with automatic error correction."""
conn = sqlite3.connect(db_path)
error_history = []
for attempt in range(max_retries):
# Generate SQL (include errors from prior attempts)
if error_history:
error_context = "\n".join([
f"Attempt {e['attempt']}: {e['sql']}\n"
f"Error: {e['error']}"
for e in error_history
])
question_with_context = (
f"{question}\n\nPrevious failed attempts:\n"
f"{error_context}\n\nFix the errors."
)
else:
question_with_context = question
sql = text_to_sql(question_with_context, schema_context)
try:
cursor = conn.execute(sql)
columns = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
return {
"sql": sql,
"columns": columns,
"rows": rows,
"attempts": attempt + 1
}
except Exception as e:
error_history.append({
"attempt": attempt + 1,
"sql": sql,
"error": str(e)
})
return {"error": "Failed after max retries",
"history": error_history}
Allowing an LLM to generate and execute SQL queries introduces serious security risks. Always enforce these safeguards: (1) use a read-only database connection (no INSERT, UPDATE, DELETE, or DROP), (2) set query timeouts to prevent long-running queries, (3) limit result set sizes, (4) validate generated SQL against an allowlist of permitted operations before execution, and (5) run queries against a replica or snapshot, never the production database directly.
4. Text-to-SQL Benchmarks
Standard benchmarks measure text-to-SQL system performance across databases of varying complexity. The two most widely used benchmarks are Spider and BIRD, which test different aspects of the text-to-SQL challenge.
| Benchmark | Databases | Questions | Key Challenge | SOTA Accuracy |
|---|---|---|---|---|
| Spider | 200 databases | 10,181 | Cross-database generalization | ~87% (execution) |
| BIRD | 95 databases | 12,751 | Real-world complexity, external knowledge | ~72% (execution) |
| WikiSQL | 26,521 tables | 80,654 | Single-table, simpler queries | ~93% (execution) |
| SParC | 200 databases | 4,298 | Multi-turn conversational SQL | ~70% (execution) |
5. CSV and Spreadsheet Processing
Not all structured data lives in databases. CSV files, Excel spreadsheets, and Google Sheets are ubiquitous in business environments. LLMs can process these by either loading them into an in-memory database (SQLite) for SQL queries or generating pandas code for direct manipulation. Code Fragment 20.5.5 below puts this into practice.
# implement csv_to_queryable
# Key operations: memory management
import pandas as pd
import sqlite3
def csv_to_queryable(csv_path, table_name="data"):
"""Load a CSV into SQLite for text-to-SQL querying."""
df = pd.read_csv(csv_path)
# Create in-memory database
conn = sqlite3.connect(":memory:")
df.to_sql(table_name, conn, index=False)
# Generate schema description
schema = f"Table: {table_name}\nColumns:\n"
for col in df.columns:
dtype = str(df[col].dtype)
sample = df[col].dropna().head(3).tolist()
schema += f" {col} ({dtype}), e.g., {sample}\n"
return conn, schema
# Usage
conn, schema = csv_to_queryable("sales_data.csv", "sales")
result = execute_with_retry(
"What are the top 5 products by total revenue this quarter?",
schema, conn
)
6. Hybrid Structured/Unstructured Retrieval
Many real-world questions require combining structured data (from databases) with unstructured context (from documents). For example, "Why did Q3 revenue decline for our enterprise segment?" requires both the numerical revenue data from a database and contextual explanations from quarterly reports, meeting notes, or market analyses. Figure 20.5.2 shows how hybrid retrieval merges database results with document context.
The most common pattern for hybrid retrieval is to route the question to both a text-to-SQL pipeline and a document RAG pipeline in parallel, then combine both result sets in the LLM prompt. The structured data provides the "what" (exact numbers, trends, comparisons), while the unstructured data provides the "why" (explanations, context, causal factors). This combination produces answers that are both precise and insightful.
7. Multi-Table Joins and Complex Queries
The hardest text-to-SQL queries involve multiple tables with complex join conditions, nested subqueries, window functions, and conditional aggregations. These queries require the LLM to understand the database's relational structure: which tables connect to which, through what foreign keys, and with what cardinality.
Three techniques significantly improve accuracy on complex multi-table queries: (1) Schema descriptions: annotating each column with a human-readable description of what it contains and how it relates to business concepts. (2) Join hints: explicitly listing common join patterns (e.g., "orders.customer_id joins to customers.id"). (3) Chain-of-thought SQL: asking the LLM to first write out its reasoning about which tables and joins are needed before generating the SQL, which reduces errors on queries requiring 3+ table joins.
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
- Text-to-SQL unlocks database-backed RAG: By converting natural language to SQL, LLMs can query relational databases for precise numerical answers that vector search cannot provide.
- Schema linking is the critical bottleneck: Correctly mapping user terms to database columns determines success; enhance it with column descriptions, sample values, and dynamic few-shot examples.
- Error correction loops are essential: Automatic retry with error feedback catches most SQL generation mistakes, improving reliability from roughly 70% to over 85% on complex queries.
- Security is non-negotiable: Read-only connections, query timeouts, result limits, SQL validation, and database replicas are mandatory safeguards for any production text-to-SQL system.
- Hybrid retrieval combines precision and context: The most powerful answers come from combining structured data (exact numbers) with unstructured context (explanations), retrieved in parallel and synthesized by the LLM.
Who: A data analytics team lead at a retail chain with 400 stores
Situation: Regional managers needed daily access to sales metrics (revenue by category, inventory levels, promotional performance) but had to submit requests to the analytics team because the BI dashboard was too rigid for ad-hoc questions.
Problem: The analytics team processed 60+ ad-hoc data requests per week, each taking 15 to 30 minutes. Managers often waited 2 days for answers, which made the data stale by the time they received it.
Dilemma: Giving managers direct SQL access was a security risk and required training they lacked. A fully natural-language interface risked generating incorrect queries that could produce misleading business decisions.
Decision: They built a text-to-SQL system with guardrails: the LLM generated read-only SELECT queries, a schema description layer exposed only pre-approved tables and columns, and every generated query was validated against a set of SQL safety rules before execution.
How: Table and column descriptions (with business-friendly names and sample values) were injected into the prompt. Ambiguous queries triggered a clarification step. Results were returned as formatted tables with a natural language summary explaining the numbers.
Result: Managers self-served 78% of their data questions. The analytics team's ad-hoc request volume dropped from 60 to 13 per week. Query accuracy (validated against analyst-written queries) was 91% on a test set of 200 common questions.
Lesson: Text-to-SQL works best when you constrain the schema surface area, enforce read-only access, and add a clarification step for ambiguous queries rather than guessing the user's intent.
8. Automated RAG Pipeline Optimization
Building a RAG pipeline involves dozens of configuration decisions: which embedding model to use, what chunk size and overlap to set, how many documents to retrieve, whether to apply reranking, what prompt template to use for generation, and how to handle multi-hop queries. Traditionally, teams tune these parameters manually through trial and error. Automated RAG optimization treats the entire pipeline as a configurable system and uses systematic search to find the best configuration for a given dataset and task. This approach is gaining traction as RAG matures from a research prototype into a production pattern.
8.1 RAG as a Configurable Pipeline
The key insight behind AutoRAG is that every RAG pipeline is a composition of interchangeable components, each with its own hyperparameters. Just as AutoML tunes model architecture and learning rate, AutoRAG tunes retrieval and generation parameters:
| Pipeline Stage | Configurable Parameters | Example Options |
|---|---|---|
| Chunking | Strategy, chunk size, overlap | Recursive (512 tokens, 10% overlap) vs. semantic chunking |
| Embedding | Model, dimensionality | OpenAI text-embedding-3-small vs. Cohere embed-v3 vs. BGE-M3 |
| Retrieval | Top-k, similarity metric, hybrid weights | Top-5 cosine vs. top-10 with BM25 hybrid (0.7/0.3 blend) |
| Reranking | Model, threshold | No reranking vs. Cohere rerank-v3 vs. cross-encoder |
| Generation | Model, temperature, prompt template | GPT-4o with structured prompt vs. Claude with concise template |
8.2 RAGAs: Evaluation Metrics for RAG
Meaningful optimization requires meaningful metrics. The RAGAs framework (Shahul et al., 2023) provides a suite of evaluation metrics specifically designed for RAG systems. These metrics evaluate both the retrieval and generation stages independently, enabling targeted optimization. The key metrics are:
- Context Precision: What fraction of retrieved passages are actually relevant to the question? High precision means the retriever is not wasting context window space on irrelevant documents.
- Context Recall: What fraction of the information needed to answer the question is present in the retrieved passages? High recall means the retriever is finding all the relevant evidence.
- Faithfulness: Is every claim in the generated answer supported by the retrieved context? This directly measures hallucination risk.
- Answer Relevancy: Does the generated answer actually address the question asked? A faithful answer can still be irrelevant if the model focuses on the wrong part of the context.
8.3 Manual vs. Automated RAG Tuning
| Dimension | Manual Tuning | Automated (AutoRAG) |
|---|---|---|
| Configuration space explored | 3 to 5 configurations | 50 to 200+ configurations |
| Time investment | Days to weeks | Hours (automated evaluation) |
| Reproducibility | Low (undocumented decisions) | High (logged experiments, metric history) |
| Discovery of non-obvious configs | Unlikely | Common (e.g., small chunks + reranker beats large chunks) |
| Upfront cost | Low (human time) | Moderate (compute for eval runs) |
| Best for | Prototypes, simple pipelines | Production systems, complex multi-stage pipelines |
The most important lesson from AutoRAG research is that optimal configurations are often counterintuitive. Teams frequently discover that smaller chunk sizes with aggressive reranking outperform larger chunks without reranking, or that a cheaper embedding model paired with a cross-encoder reranker beats a premium embedding model used alone. Without systematic evaluation across many configurations, these non-obvious tradeoffs remain hidden. The chunking evaluation framework from Section 19.4 and the LLM-as-judge evaluation pattern from Section 29.2 provide the building blocks for automated RAG evaluation.
Scenario: A team managing a customer support knowledge base with 50,000 articles needed to optimize their RAG pipeline. Manual tuning over two weeks had achieved 72% answer accuracy (judged by support agents).
Approach: They used an automated evaluation harness with 200 ground-truth question-answer pairs and RAGAs metrics (faithfulness, context precision, answer relevancy). The harness tested 96 configurations: 4 chunk sizes (256, 512, 768, 1024 tokens), 3 embedding models, 2 retrieval strategies (dense-only vs. hybrid), 2 reranking options (none vs. cross-encoder), and 2 generation models.
Result: The winning configuration (384-token chunks with Cohere embed-v3, hybrid retrieval, cross-encoder reranking, and GPT-4o-mini for generation) achieved 86% answer accuracy, a 14-point improvement. Notably, this configuration used a cheaper generation model than the manually tuned version, reducing per-query cost by 40% while improving quality.
LLM-native SQL generation is improving through specialized fine-tuning on SQL benchmarks (BIRD, Spider 2.0), with models like SQLCoder and DIN-SQL achieving near-human accuracy on complex queries. Semantic layers (e.g., Cube, dbt Metrics) provide a business-friendly abstraction over raw database schemas, reducing hallucinated column names and join errors. Multi-turn SQL conversations allow users to iteratively refine queries through natural language, with the system maintaining query history and context. Research into privacy-preserving text-to-SQL is developing methods to generate accurate queries without exposing raw data to the LLM. AutoRAG frameworks (AutoRAG, Ragas, ARES) are systematizing RAG pipeline optimization by combining automated evaluation with configuration search, paralleling the evolution of hyperparameter tuning in traditional ML.
Exercises
These exercises cover text-to-SQL, structured data RAG, and AutoRAG optimization.
Why is it useful to have an LLM generate SQL queries rather than building a custom search interface? What risks does this introduce?
Show Answer
Text-to-SQL lets non-technical users query databases using natural language, eliminating the need to learn SQL. Risks: SQL injection if queries are executed without sandboxing, incorrect queries that return wrong data with high confidence, and queries that are too expensive (full table scans, cross-joins).
Explain what schema linking is and why it is the most critical step in a text-to-SQL pipeline. What happens when schema linking fails?
Show Answer
Schema linking maps natural language terms to table and column names. Without it, the LLM may reference non-existent columns or join the wrong tables. Failure produces syntactically valid but semantically incorrect SQL.
Describe how error correction works in a text-to-SQL system. Why is it better to have the LLM fix its own SQL than to simply retry?
Show Answer
The LLM receives the SQL error message (e.g., "column 'user_name' not found") and the original query, allowing it to correct the specific mistake. Blind retries would generate the same error. Self-correction works because the error message provides targeted feedback.
A user asks "How did our top customers perform last quarter?" This requires both a database query (customer metrics) and document retrieval (qualitative analysis). How would you design a system that handles both?
Show Answer
Use a query router that classifies the intent: (a) structured/quantitative questions go to text-to-SQL, (b) qualitative/descriptive questions go to vector retrieval, (c) hybrid questions invoke both and combine results in the final prompt.
Explain the concept of automated RAG pipeline optimization. What components are tuned, and what metrics guide the search?
Show Answer
AutoRAG treats the RAG pipeline configuration as a hyperparameter optimization problem. Components tuned: chunk size, overlap, embedding model, retrieval top-K, reranker, prompt template. Metrics: answer correctness (LLM-as-judge), faithfulness (grounded in sources), retrieval recall, latency, and cost.
Create a SQLite database with 3 tables. Write a prompt that provides the schema to an LLM and ask it to generate SQL for 5 natural language questions. Execute the SQL and check correctness.
Extend your text-to-SQL pipeline with error handling: if the generated SQL fails, send the error message back to the LLM and ask it to fix the query. Allow up to 3 retries.
Build a pipeline that takes a CSV file, infers the schema, converts it to a SQLite database, and answers natural language questions about the data using text-to-SQL.
Implement a simple AutoRAG optimization: for a fixed test set of 20 questions, sweep over chunk sizes (128, 256, 512), top-K values (3, 5, 10), and with/without reranking. Report the best configuration based on answer accuracy.
What Comes Next
In the next section, Section 20.6: RAG Frameworks & Orchestration, we survey RAG frameworks and orchestration tools, the practical libraries that simplify building production RAG systems.
An early evaluation of LLM performance on text-to-SQL tasks, establishing baselines for the field. Highlights both strengths and failure modes. Good starting point for understanding LLM SQL capabilities.
The most widely-used benchmark for text-to-SQL evaluation with complex, cross-domain queries. Continues to be the standard evaluation dataset. Essential context for anyone working in this space.
Introduces a decomposition approach that breaks complex SQL generation into manageable sub-tasks. Achieves state-of-the-art results through self-correction. Valuable for teams implementing production text-to-SQL.
Li, J. et al. (2024). "Can LLM Already Serve as A Database Interface?" NeurIPS 2024.
A large-scale benchmark testing LLMs as database interfaces in realistic settings. Evaluates practical deployment scenarios beyond academic benchmarks. Recommended for practitioners assessing production readiness.
Vanna.ai: Open-source Text-to-SQL.
An open-source framework for training custom text-to-SQL models on your own database schema. Supports multiple LLM backends and database types. Ideal for teams wanting a turnkey text-to-SQL solution.
A family of open-source models fine-tuned specifically for text-to-SQL generation. Competitive with larger models at a fraction of the cost. Recommended for teams needing a self-hosted SQL generation model.
