Section 20.5: Structured Data & Text-to-SQL

Why memorize a million rows when you can simply ask the database a well-formed question?
RAG, SQL-Fluent AI Agent

Big Picture

Most enterprise knowledge lives in databases, not documents. Sales figures, inventory counts, customer records, and financial data are stored in relational databases and spreadsheets. Text-to-SQL enables users to query these structured data sources using natural language, bridging the gap between business questions and database queries. Combined with traditional document-based RAG (as covered in Section 20.1), this creates hybrid retrieval systems that can answer questions requiring both factual context and precise numerical data.

Prerequisites

Text-to-SQL and structured data RAG extend the retrieval paradigm from Section 20.1 to tabular and relational data sources. You should be comfortable with basic SQL concepts and understand how LLMs generate structured output, a topic introduced in Section 11.2. The evaluation challenges discussed here connect to the broader LLM evaluation framework in Section 29.1.

1. LLMs and Tabular Data

Tabular data presents unique challenges for LLMs. Unlike free-form text, tables have strict row-column structure, typed values, and relational constraints (foreign keys, uniqueness). LLMs can process small tables directly in their context window, but larger datasets require the model to generate executable queries (SQL, pandas) that retrieve the relevant information.

1.1 Direct Table Reasoning

For small tables (fewer than 50 rows), LLMs can reason over tabular data presented directly in the prompt. The key is formatting: presenting tables as Markdown or structured text with clear column headers helps the model understand the schema. However, this approach fails for large datasets because of context window limits and because LLM arithmetic on many rows is unreliable. Code Fragment 20.5.1 below puts this into practice.

# implement ask_about_table
# Key operations: API interaction

import pandas as pd
from openai import OpenAI

client = OpenAI()

def ask_about_table(df: pd.DataFrame, question: str):
 """Ask a question about a small DataFrame directly."""
 # Convert to markdown for clear formatting
 table_md = df.to_markdown(index=False)

 response = client.chat.completions.create(
 model="gpt-4o",
 messages=[{
 "role": "system",
 "content": "You are a data analyst. Answer questions about "
 "the provided table. Show your reasoning."
 }, {
 "role": "user",
 "content": f"Table:\n{table_md}\n\nQuestion: {question}"
 }]
 )
 return response.choices[0].message.content

Code Fragment 20.5.1: Converting structured data (DataFrames, SQL results) into markdown tables so the LLM can reason over tabular content in natural language.

Fun Fact

The Spider benchmark for text-to-SQL contains 10,181 questions across 200 databases. State-of-the-art LLMs now score above 85% on it, which sounds impressive until you realize that the remaining 15% includes queries like nested correlated subqueries with HAVING clauses that most human SQL developers would also get wrong on the first try.

2. Text-to-SQL: Architecture

Text-to-SQL systems convert natural language questions into SQL queries that can be executed against a database. The core challenge is mapping the user's informal language to the precise syntax of SQL while correctly identifying which tables, columns, and joins are needed. Modern LLM-based text-to-SQL pipelines achieve over 85% accuracy on standard benchmarks by combining schema linking, few-shot examples, and error correction. Figure 20.5.1 traces the complete text-to-SQL pipeline.

Figure 20.5.1: Text-to-SQL pipeline: schema linking maps natural language to database elements, the LLM generates SQL, execution returns results, and errors trigger regeneration.

Tip

The single highest-leverage improvement for text-to-SQL accuracy is including 3 to 5 diverse, annotated examples in the prompt (few-shot learning). Column descriptions and sample rows help, but well-chosen examples teach the model your schema's naming conventions, join patterns, and edge cases more effectively than any amount of schema documentation.

2.1 Schema Linking

Schema linking is the process of identifying which database tables and columns are relevant to the user's question. This is often the hardest step because users rarely use exact column names. A user asking about "revenue" might mean the total_amount column in the orders table, or the annual_revenue column in the companies table. Code Fragment 20.5.2 below puts this into practice.

# implement get_schema_context
# Key operations: RAG pipeline, prompt construction

def get_schema_context(db_connection, relevant_tables=None):
 """Extract schema information for the LLM prompt."""
 schema_parts = []

 # Get all tables or filter to relevant ones
 tables = relevant_tables or get_all_tables(db_connection)

 for table in tables:
 columns = get_columns(db_connection, table)
 col_info = []
 for col in columns:
 col_str = f" {col['name']} {col['type']}"
 if col.get("primary_key"):
 col_str += " PRIMARY KEY"
 if col.get("foreign_key"):
 col_str += f" REFERENCES {col['foreign_key']}"
 col_info.append(col_str)

 # Include sample values for disambiguation
 samples = get_sample_values(db_connection, table, limit=3)

 schema_parts.append(
 f"CREATE TABLE {table} (\n"
 + ",\n".join(col_info)
 + f"\n);\n-- Sample rows: {samples}"
 )

 return "\n\n".join(schema_parts)

Code Fragment 20.5.2: Extracting database schema context (table names, columns, types) and formatting it as a prompt section so the LLM can write accurate SQL. The function encapsulates reusable logic that can be applied across different inputs. Tracing through each step builds the intuition needed when debugging or extending similar systems.

2.2 SQL Generation with Few-Shot Examples

Once the schema context is prepared, the next step is generating executable SQL from the user's natural language question. Code Fragment 20.5.3 below demonstrates how to combine schema information with optional few-shot examples in a single LLM call.

# implement text_to_sql
# Key operations: API interaction

def text_to_sql(question, schema_context, examples=None):
 """Generate SQL from natural language question."""

 few_shot = ""
 if examples:
 few_shot = "\n\nExamples:\n" + "\n".join([
 f"Q: {ex['question']}\nSQL: {ex['sql']}"
 for ex in examples
 ])

 response = client.chat.completions.create(
 model="gpt-4o",
 messages=[{
 "role": "system",
 "content": f"""You are a SQL expert. Generate a SQL query to
answer the user's question based on this schema:

{schema_context}
{few_shot}

Rules:
- Use only tables and columns from the schema
- Return ONLY the SQL query, no explanation
- Use standard SQL compatible with PostgreSQL
- Add LIMIT 100 to prevent unbounded results
- Handle NULLs appropriately"""
 }, {
 "role": "user",
 "content": question
 }],
 temperature=0.0
 )

 sql = response.choices[0].message.content.strip()
 # Remove markdown code fences if present
 sql = sql.replace("```sql", "").replace("```", "").strip()
 return sql

Code Fragment 20.5.3: The text_to_sql function converts natural language questions into executable SQL by injecting the database schema and optional few-shot examples into the LLM prompt. Notice the post-processing step that strips markdown code fences, a common artifact when LLMs generate SQL in conversational contexts.

Note

The choice of few-shot examples dramatically affects text-to-SQL accuracy. The most effective strategy is to embed the user's question and retrieve the most similar examples from a curated example bank using vector similarity. This "dynamic few-shot" approach ensures that the examples shown to the LLM are maximally relevant to the current question, improving accuracy by 10 to 20% compared to fixed examples on complex queries.

3. Error Correction and Self-Healing Queries

Even the best LLMs generate incorrect SQL queries. Common errors include referencing nonexistent columns, incorrect join conditions, missing GROUP BY clauses, and type mismatches. A robust text-to-SQL pipeline includes an error correction loop that catches execution failures, sends the error message back to the LLM along with the original query, and asks for a corrected version. Code Fragment 20.5.4 below puts this into practice.

# implement execute_with_retry
# See inline comments for step-by-step details.

import sqlite3

def execute_with_retry(question, schema_context, db_path,
 max_retries=3):
 """Execute SQL with automatic error correction."""
 conn = sqlite3.connect(db_path)
 error_history = []

 for attempt in range(max_retries):
 # Generate SQL (include errors from prior attempts)
 if error_history:
 error_context = "\n".join([
 f"Attempt {e['attempt']}: {e['sql']}\n"
 f"Error: {e['error']}"
 for e in error_history
 ])
 question_with_context = (
 f"{question}\n\nPrevious failed attempts:\n"
 f"{error_context}\n\nFix the errors."
 )
 else:
 question_with_context = question

 sql = text_to_sql(question_with_context, schema_context)

 try:
 cursor = conn.execute(sql)
 columns = [desc[0] for desc in cursor.description]
 rows = cursor.fetchall()
 return {
 "sql": sql,
 "columns": columns,
 "rows": rows,
 "attempts": attempt + 1
 }
 except Exception as e:
 error_history.append({
 "attempt": attempt + 1,
 "sql": sql,
 "error": str(e)
 })

 return {"error": "Failed after max retries",
 "history": error_history}

Code Fragment 20.5.4: Running a self-correcting text-to-SQL loop: if the generated query fails, the error message is fed back to the LLM for a revised attempt.

Warning

Allowing an LLM to generate and execute SQL queries introduces serious security risks. Always enforce these safeguards: (1) use a read-only database connection (no INSERT, UPDATE, DELETE, or DROP), (2) set query timeouts to prevent long-running queries, (3) limit result set sizes, (4) validate generated SQL against an allowlist of permitted operations before execution, and (5) run queries against a replica or snapshot, never the production database directly.

4. Text-to-SQL Benchmarks

Standard benchmarks measure text-to-SQL system performance across databases of varying complexity. The two most widely used benchmarks are Spider and BIRD, which test different aspects of the text-to-SQL challenge.

4. Text-to-SQL Benchmarks Intermediate Comparison

Benchmark	Databases	Questions	Key Challenge	SOTA Accuracy
Spider	200 databases	10,181	Cross-database generalization	~87% (execution)
BIRD	95 databases	12,751	Real-world complexity, external knowledge	~72% (execution)
WikiSQL	26,521 tables	80,654	Single-table, simpler queries	~93% (execution)
SParC	200 databases	4,298	Multi-turn conversational SQL	~70% (execution)

5. CSV and Spreadsheet Processing

Not all structured data lives in databases. CSV files, Excel spreadsheets, and Google Sheets are ubiquitous in business environments. LLMs can process these by either loading them into an in-memory database (SQLite) for SQL queries or generating pandas code for direct manipulation. Code Fragment 20.5.5 below puts this into practice.

# implement csv_to_queryable
# Key operations: memory management

import pandas as pd
import sqlite3

def csv_to_queryable(csv_path, table_name="data"):
 """Load a CSV into SQLite for text-to-SQL querying."""
 df = pd.read_csv(csv_path)

 # Create in-memory database
 conn = sqlite3.connect(":memory:")
 df.to_sql(table_name, conn, index=False)

 # Generate schema description
 schema = f"Table: {table_name}\nColumns:\n"
 for col in df.columns:
 dtype = str(df[col].dtype)
 sample = df[col].dropna().head(3).tolist()
 schema += f" {col} ({dtype}), e.g., {sample}\n"

 return conn, schema

# Usage
conn, schema = csv_to_queryable("sales_data.csv", "sales")
result = execute_with_retry(
 "What are the top 5 products by total revenue this quarter?",
 schema, conn
)

Code Fragment 20.5.5: Building a complete text-to-SQL pipeline with an in-memory SQLite database, demonstrating the end-to-end flow from natural language question to formatted answer.

6. Hybrid Structured/Unstructured Retrieval

Many real-world questions require combining structured data (from databases) with unstructured context (from documents). For example, "Why did Q3 revenue decline for our enterprise segment?" requires both the numerical revenue data from a database and contextual explanations from quarterly reports, meeting notes, or market analyses. Figure 20.5.2 shows how hybrid retrieval merges database results with document context.

Figure 20.5.2: Hybrid retrieval combines precise numerical data from databases with contextual explanations from documents to produce comprehensive answers.

Key Insight

The most common pattern for hybrid retrieval is to route the question to both a text-to-SQL pipeline and a document RAG pipeline in parallel, then combine both result sets in the LLM prompt. The structured data provides the "what" (exact numbers, trends, comparisons), while the unstructured data provides the "why" (explanations, context, causal factors). This combination produces answers that are both precise and insightful.

7. Multi-Table Joins and Complex Queries

The hardest text-to-SQL queries involve multiple tables with complex join conditions, nested subqueries, window functions, and conditional aggregations. These queries require the LLM to understand the database's relational structure: which tables connect to which, through what foreign keys, and with what cardinality.

Note

Three techniques significantly improve accuracy on complex multi-table queries: (1) Schema descriptions: annotating each column with a human-readable description of what it contains and how it relates to business concepts. (2) Join hints: explicitly listing common join patterns (e.g., "orders.customer_id joins to customers.id"). (3) Chain-of-thought SQL: asking the LLM to first write out its reasoning about which tables and joins are needed before generating the SQL, which reduces errors on queries requiring 3+ table joins.

Self-Check

Q1: Why is direct table reasoning in the LLM context window unreliable for large datasets?

Show Answer

Two main reasons: (1) Context window limits prevent including large tables (a table with 10,000 rows would consume most of the context budget), and (2) LLMs perform unreliable arithmetic over many rows. They may miscount, miscalculate sums, or skip rows. For any dataset beyond roughly 50 rows, generating executable code (SQL or pandas) that computes the answer precisely is far more reliable.

Q2: What is schema linking and why is it the hardest step in text-to-SQL?

Show Answer

Schema linking maps natural language terms in the user's question to specific database tables and columns. It is the hardest step because users rarely use exact column names. "Revenue" might mean total_amount, annual_revenue, or sales_total depending on context. The system must disambiguate these mappings using table descriptions, column types, sample values, and the overall question context. Errors in schema linking cascade into incorrect SQL generation.

Q3: How does the error correction loop improve text-to-SQL reliability?

Show Answer

When a generated SQL query fails execution (syntax error, missing column, type mismatch), the error message is sent back to the LLM along with the failed query and the original question. The LLM can then diagnose the specific error and generate a corrected query. This retry loop typically runs for up to 3 attempts and catches most common errors (wrong column names, missing GROUP BY, incorrect joins), significantly improving end-to-end execution accuracy.

Q4: What security measures are essential for production text-to-SQL systems?

Show Answer

Five essential safeguards: (1) Read-only database connections (prevent INSERT, UPDATE, DELETE, DROP), (2) query timeouts to prevent long-running queries from overloading the database, (3) result set size limits to prevent memory exhaustion, (4) SQL validation against an allowlist of permitted operations before execution, and (5) running queries against a database replica or snapshot rather than the production database.

Q5: When should you use hybrid structured/unstructured retrieval instead of text-to-SQL alone?

Show Answer

Use hybrid retrieval when the question requires both precise data and contextual explanation. Questions like "Why did Q3 revenue decline?" need numerical revenue data from a database (the "what") combined with contextual explanations from quarterly reports, meeting notes, or market analyses (the "why"). Text-to-SQL alone can provide the numbers but cannot explain causation or context that exists only in unstructured documents.

Key Takeaways

Text-to-SQL unlocks database-backed RAG: By converting natural language to SQL, LLMs can query relational databases for precise numerical answers that vector search cannot provide.
Schema linking is the critical bottleneck: Correctly mapping user terms to database columns determines success; enhance it with column descriptions, sample values, and dynamic few-shot examples.
Error correction loops are essential: Automatic retry with error feedback catches most SQL generation mistakes, improving reliability from roughly 70% to over 85% on complex queries.
Security is non-negotiable: Read-only connections, query timeouts, result limits, SQL validation, and database replicas are mandatory safeguards for any production text-to-SQL system.
Hybrid retrieval combines precision and context: The most powerful answers come from combining structured data (exact numbers) with unstructured context (explanations), retrieved in parallel and synthesized by the LLM.

Real-World Scenario: Replacing a Dashboard with Natural Language SQL Queries

Who: A data analytics team lead at a retail chain with 400 stores

Situation: Regional managers needed daily access to sales metrics (revenue by category, inventory levels, promotional performance) but had to submit requests to the analytics team because the BI dashboard was too rigid for ad-hoc questions.

Problem: The analytics team processed 60+ ad-hoc data requests per week, each taking 15 to 30 minutes. Managers often waited 2 days for answers, which made the data stale by the time they received it.

Dilemma: Giving managers direct SQL access was a security risk and required training they lacked. A fully natural-language interface risked generating incorrect queries that could produce misleading business decisions.

Decision: They built a text-to-SQL system with guardrails: the LLM generated read-only SELECT queries, a schema description layer exposed only pre-approved tables and columns, and every generated query was validated against a set of SQL safety rules before execution.

How: Table and column descriptions (with business-friendly names and sample values) were injected into the prompt. Ambiguous queries triggered a clarification step. Results were returned as formatted tables with a natural language summary explaining the numbers.

Result: Managers self-served 78% of their data questions. The analytics team's ad-hoc request volume dropped from 60 to 13 per week. Query accuracy (validated against analyst-written queries) was 91% on a test set of 200 common questions.

Lesson: Text-to-SQL works best when you constrain the schema surface area, enforce read-only access, and add a clarification step for ambiguous queries rather than guessing the user's intent.

8. Automated RAG Pipeline Optimization

Building a RAG pipeline involves dozens of configuration decisions: which embedding model to use, what chunk size and overlap to set, how many documents to retrieve, whether to apply reranking, what prompt template to use for generation, and how to handle multi-hop queries. Traditionally, teams tune these parameters manually through trial and error. Automated RAG optimization treats the entire pipeline as a configurable system and uses systematic search to find the best configuration for a given dataset and task. This approach is gaining traction as RAG matures from a research prototype into a production pattern.

8.1 RAG as a Configurable Pipeline

The key insight behind AutoRAG is that every RAG pipeline is a composition of interchangeable components, each with its own hyperparameters. Just as AutoML tunes model architecture and learning rate, AutoRAG tunes retrieval and generation parameters:

8.1 RAG as a Configurable Pipeline Comparison

Pipeline Stage	Configurable Parameters	Example Options
Chunking	Strategy, chunk size, overlap	Recursive (512 tokens, 10% overlap) vs. semantic chunking
Embedding	Model, dimensionality	OpenAI text-embedding-3-small vs. Cohere embed-v3 vs. BGE-M3
Retrieval	Top-k, similarity metric, hybrid weights	Top-5 cosine vs. top-10 with BM25 hybrid (0.7/0.3 blend)
Reranking	Model, threshold	No reranking vs. Cohere rerank-v3 vs. cross-encoder
Generation	Model, temperature, prompt template	GPT-4o with structured prompt vs. Claude with concise template

8.2 RAGAs: Evaluation Metrics for RAG

Meaningful optimization requires meaningful metrics. The RAGAs framework (Shahul et al., 2023) provides a suite of evaluation metrics specifically designed for RAG systems. These metrics evaluate both the retrieval and generation stages independently, enabling targeted optimization. The key metrics are:

Context Precision: What fraction of retrieved passages are actually relevant to the question? High precision means the retriever is not wasting context window space on irrelevant documents.
Context Recall: What fraction of the information needed to answer the question is present in the retrieved passages? High recall means the retriever is finding all the relevant evidence.
Faithfulness: Is every claim in the generated answer supported by the retrieved context? This directly measures hallucination risk.
Answer Relevancy: Does the generated answer actually address the question asked? A faithful answer can still be irrelevant if the model focuses on the wrong part of the context.

8.3 Manual vs. Automated RAG Tuning

Dimension	Manual Tuning	Automated (AutoRAG)
Configuration space explored	3 to 5 configurations	50 to 200+ configurations
Time investment	Days to weeks	Hours (automated evaluation)
Reproducibility	Low (undocumented decisions)	High (logged experiments, metric history)
Discovery of non-obvious configs	Unlikely	Common (e.g., small chunks + reranker beats large chunks)
Upfront cost	Low (human time)	Moderate (compute for eval runs)
Best for	Prototypes, simple pipelines	Production systems, complex multi-stage pipelines

Key Insight

The most important lesson from AutoRAG research is that optimal configurations are often counterintuitive. Teams frequently discover that smaller chunk sizes with aggressive reranking outperform larger chunks without reranking, or that a cheaper embedding model paired with a cross-encoder reranker beats a premium embedding model used alone. Without systematic evaluation across many configurations, these non-obvious tradeoffs remain hidden. The chunking evaluation framework from Section 19.4 and the LLM-as-judge evaluation pattern from Section 29.2 provide the building blocks for automated RAG evaluation.

Real-World Scenario: AutoRAG Optimization for an Enterprise Knowledge Base

Scenario: A team managing a customer support knowledge base with 50,000 articles needed to optimize their RAG pipeline. Manual tuning over two weeks had achieved 72% answer accuracy (judged by support agents).

Approach: They used an automated evaluation harness with 200 ground-truth question-answer pairs and RAGAs metrics (faithfulness, context precision, answer relevancy). The harness tested 96 configurations: 4 chunk sizes (256, 512, 768, 1024 tokens), 3 embedding models, 2 retrieval strategies (dense-only vs. hybrid), 2 reranking options (none vs. cross-encoder), and 2 generation models.

Result: The winning configuration (384-token chunks with Cohere embed-v3, hybrid retrieval, cross-encoder reranking, and GPT-4o-mini for generation) achieved 86% answer accuracy, a 14-point improvement. Notably, this configuration used a cheaper generation model than the manually tuned version, reducing per-query cost by 40% while improving quality.

Research Frontier

LLM-native SQL generation is improving through specialized fine-tuning on SQL benchmarks (BIRD, Spider 2.0), with models like SQLCoder and DIN-SQL achieving near-human accuracy on complex queries. Semantic layers (e.g., Cube, dbt Metrics) provide a business-friendly abstraction over raw database schemas, reducing hallucinated column names and join errors. Multi-turn SQL conversations allow users to iteratively refine queries through natural language, with the system maintaining query history and context. Research into privacy-preserving text-to-SQL is developing methods to generate accurate queries without exposing raw data to the LLM. AutoRAG frameworks (AutoRAG, Ragas, ARES) are systematizing RAG pipeline optimization by combining automated evaluation with configuration search, paralleling the evolution of hyperparameter tuning in traditional ML.

Exercises

These exercises cover text-to-SQL, structured data RAG, and AutoRAG optimization.

Exercise 20.5.1: Text-to-SQL motivation Conceptual

Why is it useful to have an LLM generate SQL queries rather than building a custom search interface? What risks does this introduce?

Show Answer

Text-to-SQL lets non-technical users query databases using natural language, eliminating the need to learn SQL. Risks: SQL injection if queries are executed without sandboxing, incorrect queries that return wrong data with high confidence, and queries that are too expensive (full table scans, cross-joins).

Exercise 20.5.2: Schema linking Conceptual

Explain what schema linking is and why it is the most critical step in a text-to-SQL pipeline. What happens when schema linking fails?

Show Answer

Schema linking maps natural language terms to table and column names. Without it, the LLM may reference non-existent columns or join the wrong tables. Failure produces syntactically valid but semantically incorrect SQL.

Exercise 20.5.3: Self-healing queries Conceptual

Describe how error correction works in a text-to-SQL system. Why is it better to have the LLM fix its own SQL than to simply retry?

Show Answer

The LLM receives the SQL error message (e.g., "column 'user_name' not found") and the original query, allowing it to correct the specific mistake. Blind retries would generate the same error. Self-correction works because the error message provides targeted feedback.

Exercise 20.5.4: Hybrid retrieval Discussion

A user asks "How did our top customers perform last quarter?" This requires both a database query (customer metrics) and document retrieval (qualitative analysis). How would you design a system that handles both?

Show Answer

Use a query router that classifies the intent: (a) structured/quantitative questions go to text-to-SQL, (b) qualitative/descriptive questions go to vector retrieval, (c) hybrid questions invoke both and combine results in the final prompt.

Exercise 20.5.5: AutoRAG Conceptual

Explain the concept of automated RAG pipeline optimization. What components are tuned, and what metrics guide the search?

Show Answer

AutoRAG treats the RAG pipeline configuration as a hyperparameter optimization problem. Components tuned: chunk size, overlap, embedding model, retrieval top-K, reranker, prompt template. Metrics: answer correctness (LLM-as-judge), faithfulness (grounded in sources), retrieval recall, latency, and cost.

Exercise 20.5.6: Basic text-to-SQL Coding

Create a SQLite database with 3 tables. Write a prompt that provides the schema to an LLM and ask it to generate SQL for 5 natural language questions. Execute the SQL and check correctness.

Exercise 20.5.7: Error correction loop Coding

Extend your text-to-SQL pipeline with error handling: if the generated SQL fails, send the error message back to the LLM and ask it to fix the query. Allow up to 3 retries.

Exercise 20.5.8: CSV analysis Coding

Build a pipeline that takes a CSV file, infers the schema, converts it to a SQLite database, and answers natural language questions about the data using text-to-SQL.

Exercise 20.5.9: AutoRAG sweep Coding

Implement a simple AutoRAG optimization: for a fixed test set of 20 questions, sweep over chunk sizes (128, 256, 512), top-K values (3, 5, 10), and with/without reranking. Report the best configuration based on answer accuracy.

What Comes Next

In the next section, Section 20.6: RAG Frameworks & Orchestration, we survey RAG frameworks and orchestration tools, the practical libraries that simplify building production RAG systems.

References & Further Reading

Rajkumar, N. et al. (2022). "Evaluating the Text-to-SQL Capabilities of Large Language Models." arXiv preprint.

An early evaluation of LLM performance on text-to-SQL tasks, establishing baselines for the field. Highlights both strengths and failure modes. Good starting point for understanding LLM SQL capabilities.

Paper

Yu, T. et al. (2018). "Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task." EMNLP 2018.

The most widely-used benchmark for text-to-SQL evaluation with complex, cross-domain queries. Continues to be the standard evaluation dataset. Essential context for anyone working in this space.

Paper

Pourreza, M. & Rafiei, D. (2023). "DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction." NeurIPS 2023.

Introduces a decomposition approach that breaks complex SQL generation into manageable sub-tasks. Achieves state-of-the-art results through self-correction. Valuable for teams implementing production text-to-SQL.

Paper

Li, J. et al. (2024). "Can LLM Already Serve as A Database Interface?" NeurIPS 2024.

A large-scale benchmark testing LLMs as database interfaces in realistic settings. Evaluates practical deployment scenarios beyond academic benchmarks. Recommended for practitioners assessing production readiness.

Paper

Vanna.ai: Open-source Text-to-SQL.

An open-source framework for training custom text-to-SQL models on your own database schema. Supports multiple LLM backends and database types. Ideal for teams wanting a turnkey text-to-SQL solution.

Tool

SQLCoder by Defog.

A family of open-source models fine-tuned specifically for text-to-SQL generation. Competitive with larger models at a fraction of the cost. Recommended for teams needing a self-hosted SQL generation model.

Tool