01 — Foundations
The Foundational Formulation
The paper that formally defined the problem and set the agenda: a probabilistic framework where two tables are unionable if their columns are drawn from the same domains, ranked by attribute-level statistical signals on values and embeddings.
PVLDB2018Required
Table Union Search on Open Data
F. Nargesian, E. Zhu, K. Q. Pu, R. J. Miller
Defines table union search as finding the top-k tables whose columns are unionable with a query table's columns. Introduces three attribute-level unionability statistics — set, semantic (entity-based), and natural-language (embedding-based) — combined into an ensemble score. Releases the original TUS benchmark, the de facto evaluation set for nearly all follow-on work.
02 — Column Semantics
Column-Semantics Methods
A first generation of follow-ups: better column-level semantic signals — ensembles of similarity measures, knowledge-base lookups, and richer column embeddings — but still column-by-column matching aggregated to a table score.
ICDE2020
D³L: Dataset Discovery in Data Lakes
A. Bogatu, A. A. A. Fernandes, N. W. Paton, N. Konstantinou
Combines five column-pairwise similarity measures (attribute names, formats, word embeddings, distributions, domain) into a multi-criteria distance function, supporting both unionable and joinable discovery on the same indexing substrate. A common comparison baseline.
03 — Representation Learning
Deep Representation Learning
Methods that learn column or table embeddings end-to-end with pre-trained language models and self-supervised contrastive objectives. Encoders capture context across the whole column (and sometimes the table), and search reduces to nearest-neighbor lookup over the embeddings.
SIGMOD2023Required
Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning (Starmie)
G. Fan, J. Wang, Y. Li, D. Zhang, R. J. Miller
Pre-trains a column encoder with multi-column contrastive learning so that columns within the same table inform each other's embedding. Cosine similarity becomes the column unionability score; combined with a filter-and-verify framework and HNSW indexing for sub-linear search. Reports +6.8 MAP over prior baselines and becomes a standard 2024–25 reference point.
arXiv2023
Pylon: Semantic Table Union Search in Data Lakes
T. Cong, F. Nargesian, H. V. Jagadish
Frames union search as an unsupervised representation-learning problem with a contrastive objective that is aware of the downstream search/index structure. Tightly couples training and retrieval so that learned embeddings are good not just intrinsically but for the specific top-k nearest-neighbor query.
04 — Relationships
Relationship-Aware Search
A different cut at the problem: unionability shouldn't just depend on individual columns drawing from the same domain. It should also require that the relationships between columns are consistent across the two tables. This pushes table union search away from column-by-column matching toward a structural view.
SIGMOD2023Required
SANTOS: Relationship-Based Semantic Table Union Search
A. Khatiwada, G. Fan, R. Shraga, Z. Chen, W. Gatterbauer, R. J. Miller, M. Riedewald
Two tables are unionable when their columns share both domain semantics and binary relationships. SANTOS retrieves relationships from an existing KB (YAGO) and synthesizes one from the data lake itself when KB coverage is thin. Releases relabeled and new benchmarks; outperforms the original TUS pipeline by a wide margin.
05 — Table-Centric & LLM-Based
Table-Centric & LLM-Based Approaches
A 2024–25 turn in the literature: rather than scoring column pairs and aggregating to a table score, learn holistic table embeddings or invoke an LLM directly. Promises both better ranking quality (no aggregation artifacts) and faster retrieval (one vector per table).
arXiv2026
Efficient and Effective Table-Centric Table Union Search in Data Lakes (TACTUS)
Y. Sun, Z. Ding, H. Wang, R. Cheng
Argues column-centric methods miss holistic table semantics and ranking quality suffers. Generates a single table embedding via attentive table encoding with positive-pair construction and two-pronged negative sampling (avoiding latent positives, mining hard negatives). Retrieval uses an adaptive candidate-pool method that exploits the table-score distribution; column matching only happens in a final verification step.
arXiv2025
EasyTUS: A Comprehensive Framework for Fast and Accurate Table Union Search across Data Lakes
T. Otto
An LLM-based framework that operates directly on table content without per-lake fine-tuning, enabling search across multiple data lakes. Comes with TUSBench, a benchmarking environment integrated in the same framework, oriented toward reproducibility and joint evaluation of accuracy and runtime.
06 — Novelty & Diversity
Novelty & Diversity in Union Results
Once retrieval works well, the next question becomes: are the returned tables actually adding new information? A small but growing line of work treats unionability as necessary but not sufficient, and ranks by what new rows a union would contribute.
EDBT2026pp. 42–55
Diverse Unionable Tuple Search: Novelty-Driven Discovery in Data Lakes
A. Khatiwada, R. Shraga, R. J. Miller
Reframes the unit of discovery from tables to tuples: given a query table, return a set of unionable tuples from across the lake that is both relevant and maximally diverse — so the union actually adds new information rather than duplicating what the analyst already has. Formulates the problem as joint optimization over unionability and novelty/diversity, with retrieval and ranking algorithms designed for that combined objective.
ICDE2026
Novel Table Search
B. Kassaie, R. J. Miller
Asks a complementary question at the table level: which tables in the lake offer content that is genuinely novel with respect to a query table? Existing unionability search rewards tables that look similar to the query — exactly the wrong objective if the goal is to enrich. Defines novelty for table search and proposes retrieval methods that surface tables most likely to extend rather than echo the query.
07 — Benchmarks
Benchmarks & Evaluation
Comparing methods is famously hard in this space — benchmark choices materially affect rankings, and several
recent papers argue the field has been overfitting to a small handful of datasets and that these benchmarks might have negative examples that are "too easy".
PVLDB Workshop2024
ALT-GEN: Benchmarking Table Union Search using Large Language Models
K. Pal, A. Khatiwada, R. Shraga, R. J. Miller
Uses LLMs to generate evaluation benchmarks for table union search — addressing the chronic scarcity of large, well-labeled benchmarks. Complements rather than replaces SANTOS and TUS benchmarks; useful both as a benchmark-generation method and as a study of LLMs' notion of unionability.
SANTOS Benchmark2023
SANTOS Benchmark for Table Union Search (Dataset)
A. Khatiwada et al.
Two new tabular benchmarks (small and large real data lakes) plus a relabeled version of the original TUS benchmark that incorporates inter-column relationships. The standard set for evaluating relationship-aware methods.
PVLDB2024
LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes
Y. Deng, C. Chai, L. Cao, Q. Jin, Y. Hai, J. Wang, J. Fan, Y. Yuan, G. Li
A unified, large-scale benchmark covering both joinable and unionable table search. 16M+ tables, 10K+ labeled queries, with reference implementations for major baselines — the natural starting point for project evaluations and the most reliable way to compare modern methods head-to-head.
08 — Applications
Applications
Once discovered, tables can then be integrated or used for table verification (in an approach called table reclamation).
PVLDB2022
Integrating Data Lake Tables
A. Khatiwada, R. Shraga, W. Gatterbauer, R. J. Miller
Goes beyond search: once unionable tables are retrieved, how do you actually combine them? Studies alignment of columns and resolution of heterogeneity across union candidates — a useful framing for "what comes after the search".
ICDE2024
Gen-T: Table Reclamation in Data Lakes
G. Fan, R. Shraga, R. J. Miller
Introduces a new problem that sits downstream of table discovery: given a source table, find and integrate a set of data lake tables that reproduce it as closely as possible.
Defines an error-aware instance similarity measure and supports a richer query language than SPJ — including unions, outerjoins, subsumption, and complementation.
A useful complement to search-focused papers when thinking about what data lakes need to do after a candidate set is retrieved.