Reading List — Table Union Search

Topics

Foundations
Column-Semantics Methods
Representation Learning
Relationship-Aware Search
Table-Centric & LLM-Based
Novelty & Diversity
Benchmarks & Evaluation
Applications

01 — Foundations

The Foundational Formulation

The paper that formally defined the problem and set the agenda: a probabilistic framework where two tables are unionable if their columns are drawn from the same domains, ranked by attribute-level statistical signals on values and embeddings.

PVLDB2018Required

Table Union Search on Open Data

F. Nargesian, E. Zhu, K. Q. Pu, R. J. Miller

Defines table union search as finding the top-k tables whose columns are unionable with a query table's columns. Introduces three attribute-level unionability statistics — set, semantic (entity-based), and natural-language (embedding-based) — combined into an ensemble score. Releases the original TUS benchmark, the de facto evaluation set for nearly all follow-on work.

PDF ACM DL

02 — Column Semantics

Column-Semantics Methods

A first generation of follow-ups: better column-level semantic signals — ensembles of similarity measures, knowledge-base lookups, and richer column embeddings — but still column-by-column matching aggregated to a table score.

ICDE2020

D³L: Dataset Discovery in Data Lakes

A. Bogatu, A. A. A. Fernandes, N. W. Paton, N. Konstantinou

Combines five column-pairwise similarity measures (attribute names, formats, word embeddings, distributions, domain) into a multi-criteria distance function, supporting both unionable and joinable discovery on the same indexing substrate. A common comparison baseline.

IEEE arXiv

03 — Representation Learning

Deep Representation Learning

Methods that learn column or table embeddings end-to-end with pre-trained language models and self-supervised contrastive objectives. Encoders capture context across the whole column (and sometimes the table), and search reduces to nearest-neighbor lookup over the embeddings.

SIGMOD2023Required

Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning (Starmie)

G. Fan, J. Wang, Y. Li, D. Zhang, R. J. Miller

Pre-trains a column encoder with multi-column contrastive learning so that columns within the same table inform each other's embedding. Cosine similarity becomes the column unionability score; combined with a filter-and-verify framework and HNSW indexing for sub-linear search. Reports +6.8 MAP over prior baselines and becomes a standard 2024–25 reference point.

PDF ACM DL

arXiv2023

Pylon: Semantic Table Union Search in Data Lakes

T. Cong, F. Nargesian, H. V. Jagadish

Frames union search as an unsupervised representation-learning problem with a contrastive objective that is aware of the downstream search/index structure. Tightly couples training and retrieval so that learned embeddings are good not just intrinsically but for the specific top-k nearest-neighbor query.

arXiv PDF

04 — Relationships

Relationship-Aware Search

A different cut at the problem: unionability shouldn't just depend on individual columns drawing from the same domain. It should also require that the relationships between columns are consistent across the two tables. This pushes table union search away from column-by-column matching toward a structural view.

SIGMOD2023Required

SANTOS: Relationship-Based Semantic Table Union Search

A. Khatiwada, G. Fan, R. Shraga, Z. Chen, W. Gatterbauer, R. J. Miller, M. Riedewald

Two tables are unionable when their columns share both domain semantics and binary relationships. SANTOS retrieves relationships from an existing KB (YAGO) and synthesizes one from the data lake itself when KB coverage is thin. Releases relabeled and new benchmarks; outperforms the original TUS pipeline by a wide margin.

ACM DL arXiv Code

05 — Table-Centric & LLM-Based

Table-Centric & LLM-Based Approaches

A 2024–25 turn in the literature: rather than scoring column pairs and aggregating to a table score, learn holistic table embeddings or invoke an LLM directly. Promises both better ranking quality (no aggregation artifacts) and faster retrieval (one vector per table).

arXiv2026

Efficient and Effective Table-Centric Table Union Search in Data Lakes (TACTUS)

Y. Sun, Z. Ding, H. Wang, R. Cheng

Argues column-centric methods miss holistic table semantics and ranking quality suffers. Generates a single table embedding via attentive table encoding with positive-pair construction and two-pronged negative sampling (avoiding latent positives, mining hard negatives). Retrieval uses an adaptive candidate-pool method that exploits the table-score distribution; column matching only happens in a final verification step.

arXiv PDF

arXiv2025

EasyTUS: A Comprehensive Framework for Fast and Accurate Table Union Search across Data Lakes

T. Otto

An LLM-based framework that operates directly on table content without per-lake fine-tuning, enabling search across multiple data lakes. Comes with TUSBench, a benchmarking environment integrated in the same framework, oriented toward reproducibility and joint evaluation of accuracy and runtime.

arXiv PDF

06 — Novelty & Diversity

Novelty & Diversity in Union Results

Once retrieval works well, the next question becomes: are the returned tables actually adding new information? A small but growing line of work treats unionability as necessary but not sufficient, and ranks by what new rows a union would contribute.

EDBT2026pp. 42–55

Diverse Unionable Tuple Search: Novelty-Driven Discovery in Data Lakes

A. Khatiwada, R. Shraga, R. J. Miller

Reframes the unit of discovery from tables to tuples: given a query table, return a set of unionable tuples from across the lake that is both relevant and maximally diverse — so the union actually adds new information rather than duplicating what the analyst already has. Formulates the problem as joint optimization over unionability and novelty/diversity, with retrieval and ranking algorithms designed for that combined objective.

DBLP NSF PAR

ICDE2026

Novel Table Search

B. Kassaie, R. J. Miller

Asks a complementary question at the table level: which tables in the lake offer content that is genuinely novel with respect to a query table? Existing unionability search rewards tables that look similar to the query — exactly the wrong objective if the goal is to enrich. Defines novelty for table search and proposes retrieval methods that surface tables most likely to extend rather than echo the query.

Author page (Kassaie)

07 — Benchmarks

Benchmarks & Evaluation

Comparing methods is famously hard in this space — benchmark choices materially affect rankings, and several recent papers argue the field has been overfitting to a small handful of datasets and that these benchmarks might have negative examples that are "too easy".

PVLDB Workshop2024

ALT-GEN: Benchmarking Table Union Search using Large Language Models

K. Pal, A. Khatiwada, R. Shraga, R. J. Miller

Uses LLMs to generate evaluation benchmarks for table union search — addressing the chronic scarcity of large, well-labeled benchmarks. Complements rather than replaces SANTOS and TUS benchmarks; useful both as a benchmark-generation method and as a study of LLMs' notion of unionability.

NSF PAR

SANTOS Benchmark2023

SANTOS Benchmark for Table Union Search (Dataset)

A. Khatiwada et al.

Two new tabular benchmarks (small and large real data lakes) plus a relabeled version of the original TUS benchmark that incorporates inter-column relationships. The standard set for evaluating relationship-aware methods.

Zenodo

PVLDB2024

LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes

Y. Deng, C. Chai, L. Cao, Q. Jin, Y. Hai, J. Wang, J. Fan, Y. Yuan, G. Li

A unified, large-scale benchmark covering both joinable and unionable table search. 16M+ tables, 10K+ labeled queries, with reference implementations for major baselines — the natural starting point for project evaluations and the most reliable way to compare modern methods head-to-head.

PDF ACM DL

08 — Applications

Applications

Once discovered, tables can then be integrated or used for table verification (in an approach called table reclamation).

PVLDB2022

Integrating Data Lake Tables

A. Khatiwada, R. Shraga, W. Gatterbauer, R. J. Miller

Goes beyond search: once unionable tables are retrieved, how do you actually combine them? Studies alignment of columns and resolution of heterogeneity across union candidates — a useful framing for "what comes after the search".

PDF ACM DL

ICDE2024

Gen-T: Table Reclamation in Data Lakes

G. Fan, R. Shraga, R. J. Miller

Introduces a new problem that sits downstream of table discovery: given a source table, find and integrate a set of data lake tables that reproduce it as closely as possible. Defines an error-aware instance similarity measure and supports a richer query language than SPJ — including unions, outerjoins, subsumption, and complementation. A useful complement to search-focused papers when thinking about what data lakes need to do after a candidate set is retrieved.

PDF IEEE DL