Reading List Multi-Table QA & Text-to-SQL over Data Lakes

Papers on question answering and SQL generation when the relevant tables must first be found in a data lake. Classical text-to-SQL assumes a single, well-curated database. This list covers the harder problem: the schema is enormous or unknown, key/foreign-key constraints are missing, and the system must retrieve, link, plan, and compose across many tables before any SQL is generated. The list moves from classical text-to-SQL foundations and benchmarks, through schema linking and agentic methods, to open-domain table QA, multi-table retrieval, and the new benchmarks designed for data-lake settings.

Topics

  1. Foundations & Single-DB Text-to-SQL
  2. Schema Linking at Scale
  3. Agentic & Multi-Step Methods
  4. Open-Domain Table QA
  5. Multi-Table Retrieval for QA
  6. Text-to-SQL over Data Lakes
  7. Benchmarks & Evaluation
01 — Foundations

Single-Database Text-to-SQL

The classical formulation: a user's natural-language question is translated to SQL against a single, fixed database whose schema is known in advance. These papers establish the problem definitions, the canonical benchmarks, and the LLM-era execution-accuracy ceiling that data-lake methods will have to clear. Papers marked as background are provided to give a comprensive view of the field and are not suitable for class presentation.

EMNLP2018Background
Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task
T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, D. Radev
A canonical text-to-SQL benchmark: 10,181 questions and 5,693 SQL queries over 200 databases spanning 138 domains. Designed to test cross-domain generalization — train and test databases are disjoint. The de facto evaluation set for nearly all text-to-SQL work, with execution accuracy rising from ~54% (2020) to over 90% (2023) as a clear progress narrative.
NeurIPS2023Background
Can LLM Already Serve as A Database Interface? A BIG Bench for Large-Scale Database Grounded Text-to-SQLs (BIRD)
J. Li, B. Hui, G. Qu, J. Yang, B. Li, B. Li, B. Wang, B. Qin, R. Geng, N. Huo, X. Zhou, C. Ma, G. Li, K. C. C. Chang, F. Huang, R. Cheng, Y. Li
Pushes the text-to-SQL bar significantly higher than Spider: 12,751 question/SQL pairs over 95 databases (33 GB total), with dirty values, large schemas, and external knowledge required to answer many questions. Introduces "valid efficiency score" alongside execution accuracy. By 2025, leaderboard execution accuracy reached ~76% — substantially below Spider, exposing gaps between toy and realistic settings.
ACM CSUR2025
A Survey on Employing Large Language Models for Text-to-SQL Tasks
L. Shi, Z. Tang, N. Zhang, X. Zhang, Z. Yang
A comprehensive survey covering LLM prompt engineering, fine-tuning, retrieval-augmented, and agentic approaches. Catalogues benchmarks (Spider, BIRD, Spider 2.0, BIRD-Critic) and tracks the progression from single-pass SQL generation to multi-step pipelines. Distinguishes itself from earlier surveys with a systematic taxonomy and "Key Takeaways" boxes at the end of each section.
02 — Schema Linking

Schema Linking at Scale

When a database has hundreds or thousands of tables, the LLM cannot see the whole schema in its context. Schema linking — picking the right tables and columns to expose to the SQL generator — becomes the central bottleneck. Get it wrong and the SQL is doomed; get it right and SQL generation often becomes easy. These papers focus on making this step both accurate and tractable.

ICML2025
CHESS: Contextual Harnessing for Efficient SQL Synthesis
S. Talaei, M. Pourreza, Y.-C. Chang, A. Mirhoseini, A. Saberi
A multi-agent framework with four specialized agents: the Information Retriever extracts relevant data, the Schema Selector prunes large schemas, the Candidate Generator produces and iteratively refines SQL candidates, and the Unit Tester validates queries through LLM-based natural-language unit tests. Uses locality-sensitive hashing to retrieve database values from millions of rows. One of the most-cited baselines for schema-linking-heavy pipelines and a frequent comparison point for newer methods.
arXiv2024Background
RSL-SQL: Robust Schema Linking in Text-to-SQL Generation
Z. Cao, Y. Zheng, Z. Fan, X. Zhang, W. Chen
Combines bidirectional schema linking (forward and backward pruning), contextual augmentation, a binary mode-selection strategy, and multi-turn self-correction. Achieves 94% strict recall on schema linking while reducing input columns by 83% — concretely showing how much of the SQL-generation problem is really a retrieval problem.
arXiv2025Background
RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL
Amazon Science (authors per PDF)
Targets the truly massive-schema setting (thousands of tables). Reduces total schema to less than 3% on BIRD and less than 1% on FIBEN through retrieval-augmented filtering. Highlights that schema linking is a different problem at enterprise scale than on academic benchmarks.
arXiv2025Background
KaSLA: Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation
(see arXiv listing for authors)
Introduces a "restricted missing indicator" metric arguing that recall and precision alone don't capture schema-linking quality. Treats schema linking as a knapsack problem balancing inclusion against a tolerance budget for redundancy. A 1.6B model with KaSLA matches DeepSeek-V3 + SOTA schema-linking on Spider and BIRD.
03 — Agents

Agentic & Multi-Step Methods

Rather than producing SQL in one shot, agentic methods let an LLM plan, explore the schema interactively, execute partial queries, observe results, and refine. This style scales more gracefully to massive schemas where putting everything into context is impossible.

arXiv2025Background
MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL
B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan, Q.-W. Zhang, D. Yin, X. Sun, Z. Li
Three specialized agents — a Selector for schema simplification, a Decomposer for sub-question generation, and a Refiner for self-correction — collaborating on text-to-SQL. Demonstrates that the right division of labor between LLM-driven components can substantially improve over single-pass generation.
arXiv2025Background
AutoLink: Autonomous Schema Exploration and Expansion for Scalable Schema Linking in Text-to-SQL at Scale
(see arXiv listing for authors)
Reformulates schema linking as an iterative, agent-driven exploration: the agent never sees the full schema, but dynamically expands the linked subset through actions. Achieves 97.4% strict recall on BIRD-Dev and 91.2% on Spider 2.0-Lite at sub-half the token cost of leading approaches. The strongest current case for agentic schema discovery.
04 — Open Domain

Open-Domain Table QA

A complementary tradition coming from the NLP side: answer the question directly by retrieving relevant tables (and sometimes text passages) and reading them with a language model, often without generating SQL at all. The retrieval step is the key challenge — the schema is not given, so the system must find the right tables among millions.

EMNLP2021Background
Dual Reader-Parser on Hybrid Textual and Tabular Evidence for Open Domain Question Answering (DUREPA)
A. Li, Y. Zhou, T. Liu, W. Yin, B. Xiang
Among the first to combine text-to-SQL with open-domain QA: a generative model takes both textual and tabular evidence and chooses to emit either a direct answer or a SQL query, executing the latter on the relevant database. Demonstrates that interpretable SQL generation helps especially for questions requiring complex reasoning.
arXiv2022Background
End-to-end Table Question Answering via Retrieval-Augmented Generation
F. Pan, M. Canim, M. Glass, A. Gliozzo, J. Hendler
An early RAG-style architecture specifically for open-domain table QA: retrieve candidate tables, then condition a generator on them. Sets the template that nearly every subsequent LLM-based system follows in some form.
Datenbank-Spektrum2025 Background
Towards Complex Table Question Answering Over Tabular Data Lakes
J.-M. Bodensohn, C. Binnig
A systematic analysis of LLMs paired with table retrievers on private tabular data lakes. Finds that answer generation fails primarily because retrieval doesn't provide the required tabular context — making table retrieval, not SQL synthesis, the binding constraint for open table QA at scale.
05 — Multi-Table

Multi-Table Retrieval for QA

A single retrieved table is rarely enough. Answering most realistic questions requires combining evidence across multiple tables — and the retrieved set must be jointly coherent: the tables must be joinable or unionable in a way that supports the question. This is where data-lake table discovery and QA meet.

arXiv2024
Is Table Retrieval a Solved Problem? Exploring Join-Aware Multi-Table Retrieval
P. B. Chen, Y. Zhang, D. Roth
Argues that table retrieval is far from solved: existing methods retrieve individually relevant tables but fail when the answer requires joining them, and they assume key/foreign-key constraints are pre-given — which is exactly what's missing in data-lake settings. Proposes join-aware retrieval that ranks sets of tables by their joint relevance and joinability.
arXiv2025Background
Exploring Multi-Table Retrieval Through Iterative Search
A. Boutaleb, B. Amann, R. Angarita, H. Naacke
Frames multi-table retrieval as iterative search. Argues that exact methods like Mixed-Integer Programming guarantee coherence but don't scale, while greedy heuristics that optimize coverage often miss joinable sets. The iterative-search formulation offers better scalability, interpretability, and flexibility than either extreme.
arXiv2026
Decomposition-Driven Multi-Table Retrieval and Reasoning for Numerical Question Answering
(see arXiv listing for authors)
Targets numerical multi-table QA over tens of thousands of tables with incomplete metadata. Decomposes the question, retrieves tables for each sub-question, and stitches together cross-table reasoning. Explicitly positioned against both text-to-SQL (which assumes a complete schema) and single-table QA (which can't combine evidence).
06 — Data Lakes

Text-to-SQL Over Data Lakes

The most ambitious version of the problem: generate SQL — possibly spanning multiple databases — against a data lake of heterogeneous tables, without a curated unified schema. Requires retrieval, schema linking, table-relationship discovery, and SQL planning to all work together.

CoRR2024
Text2SQL is Not Enough: Unifying AI and Databases with TAG
A. Biswal, L. Patel, S. Jha, et al.
Argues that pure text-to-SQL is insufficient for realistic data-analytics tasks: many questions require reasoning, semantic operations, and tool use beyond SQL. Proposes TAG (Table-Augmented Generation) as a unifying framework — a useful conceptual reset before diving into the engineering of newer data-lake systems.
PVLDB2026
TACO: A Benchmark for Open-Domain Text-to-SQL with Real-World Data Lakes
G. Fan, et al.
The first benchmark explicitly designed for text-to-SQL over data lakes. Evaluates three capabilities: resolving ambiguity and redundancy in NL questions, retrieving relevant tables from large heterogeneous lakes, and generating SQL pipelines that may span multiple databases. Built on 1,500 NL questions from a Beijing smart-city data service with 31 government departments' data integrated. Questions and queries are significantly longer and more complex than Spider or BIRD.
arXiv2025Background
Text-to-SQL for Enterprise Data Analytics
(see arXiv listing for authors)
A practitioner-oriented analysis of why Spider/BIRD progress doesn't translate to enterprise settings: large schemas, semantic ambiguity, dirty data, and operational constraints. Useful framing for understanding the gap between benchmark numbers and real-world deployment.
07 — Benchmarks

Benchmarks & Evaluation

A field-wide reckoning is underway with the benchmarks themselves: classical Spider has become saturated, BIRD is approaching saturation, and several new benchmarks have been introduced to expose what current methods still can't do — especially at data-lake scale.

arXiv2024Background
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
F. Lei, et al.
600 real-world workflow problems sourced from BigQuery, Snowflake, and PostgreSQL deployments. Specifically targets dialect-specific SQL, schema scale, and external-knowledge integration. Top systems score ~35% — a dramatic drop from Spider's 90%+, demonstrating how much of recent progress was overfitted to the benchmark.
arXiv2024Background
BIRD-Critic: SQL Issue Diagnosis & Repair
J. Li, et al.
Moves beyond SQL generation to SQL diagnosis and repair, built on authentic user-reported database issues. Tests whether systems can identify what's wrong with a SQL query, not just write one — a capability central to deploying these systems in real settings.
EMNLP Findings2025Background
Evaluating NL2SQL via SQL2NL
(see venue for author list)
A skeptical look at NL2SQL evaluation. Examines how systems respond to paraphrased NL queries and finds substantial accuracy degradation on rewordings of the same question — a robustness gap most benchmarks don't measure.