Joinable Table Search in Data Lakes

A curated set of papers on the problem of joinable table search — discovering, given a query column, tables in a data lake whose columns can be meaningfully joined with it. The list moves from foundational set-overlap techniques to recent embedding-based, n-ary, benchmark, and application work.

Topics

  1. Foundations & Set-Overlap Search
  2. Semantic & Embedding-Based Search
  3. N-ary Joins & Composite Keys
  4. Transformation-Based Joins
  5. Context-Aware Joinability
  6. Benchmarks & Evaluation
  7. Applications
01 — Foundations

Set-Overlap Approaches

The earliest formulations of joinable table search frame the problem as set-overlap on column values: two columns are considered joinable if their value sets have large intersection. These papers establish the canonical problem definitions and the scalable indexes used to solve them.

SIGMOD2019Required
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
E. Zhu, D. Deng, F. Nargesian, R. J. Miller
Defines the joinable table search problem as a top-k overlap set similarity query. Introduces a cost-aware algorithm that mixes prefix-filtering with position-list scans to scale to lakes with millions of columns. The reference exact baseline for nearly all follow-on work.
PVLDB2016
LSH Ensemble: Internet-Scale Domain Search for Tables with Containment Joins
E. Zhu, F. Nargesian, K. Q. Pu, R. J. Miller
Tackles the variable-cardinality problem: when column sizes range from hundreds to millions, single LSH indexes degrade badly. The ensemble partitions columns by size and tunes each partition's LSH parameters to approximate the containment measure efficiently.
02 — Semantic Methods

Embedding-Based & Semantic Search

Equi-join overlap misses joinable columns that differ in surface form — "Apple" vs. "Apple Inc.", "U.S.A." vs. "United States". This line of work embeds values or whole columns into vector spaces so that semantically related columns are close, enabling fuzzy and cross-language joins.

ICDE2021Required
Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach (PEXESO)
Y. Dong, K. Takeoka, C. Xiao, M. Oyamada
Embeds textual cell values as high-dimensional vectors using pre-trained word embeddings and defines joinability via similarity predicates on those vectors. Introduces a pivot-based block-and-verify index and a partitioning scheme for out-of-core data lakes.
PVLDB2023Required
DeepJoin: Joinable Table Discovery with Pre-trained Language Models
Y. Dong, C. Xiao, K. Takeoka, M. Oyamada, W.-C. Tan
Fine-tunes a PLM to embed entire columns (rather than individual cells) into a fixed-length vector, unifying equi- and semantic joins under one retrieval framework. Couples the encoder with HNSW for approximate nearest-neighbor search; retrieval becomes logarithmic in repository size.
CIDR2023
WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses
T. Cong, J. Gale, J. Frankle, C. Jia, Ç. Demiralp
Uses pre-trained WebTable (FastText) embeddings rather than task-specific fine-tuning, indexed with HNSW. Aimed at cloud warehouse settings; a useful counterpoint to DeepJoin for an embeddings-without-training baseline.
PVLDB2025Required
OmniMatch: Joinability Discovery in Data Products
C. Koutras, J. Zhang, X. Qin, C. Lei, V. N. Ioannidis, C. Faloutsos, G. Karypis, A. Katsifodimos
Targets joinability discovery in data products — curated collections of tabular datasets — rather than open data lakes. Combines multiple column-pair similarity signals through a self-supervised Graph Neural Network (RGCN), so similarity propagates across the column-relatedness graph and indirect join relationships emerge. Uses automated negative-example generation to sharpen precision, and reports up to 14% F1/AUC gains over DeepJoin and other baselines without per-metric thresholds.
PVLDB2024
HyperJoin: LLM-augmented Hypergraph Link Prediction for Joinable Table Discovery
Y. Chen et al.
Models a table corpus as a hypergraph in which columns are nodes and hyperedges encode joinability — including LLM-augmented inter-table hyperedges. Joinable table discovery is recast as link prediction over the hypergraph, with a maximum-spanning-tree reranker for coherence. Reports +21% Precision@15 / +17% Recall@15 over the best prior baselines.
03 — Composite Keys

N-ary Joins & Multi-Column Keys

Real tables often join on combinations of columns — (first_name, last_name) or (city, state) — not single keys. Methods built for unary joins miss these or degrade badly when applied combinatorially.

PVLDB2022
MATE: Multi-Attribute Table Extraction
M. Esmailoghli, J.-A. Quiané-Ruiz, Z. Abedjan
Introduces a hash-based index supporting n-ary join discovery via space-efficient "super keys" that combine multiple columns. Demonstrates significant gains over applying unary methods combinatorially.
NAACL Findings2025
PolyJoin: Semantic Multi-key Joinable Table Search in Data Lakes
X. Hu, C. Lei, X. Qin, A. Katsifodimos, C. Faloutsos, H. Rangwala
Tackles the gap between single-key embedding methods (e.g., DeepJoin) and hash-based n-ary discovery (e.g., MATE): combining keys before embedding loses semantics, while embedding each key independently breaks correlations across keys. PolyJoin learns joint representations over sets of key columns so that semantic multi-key joinability — e.g., (first_name, last_name) or (city, state) — can be retrieved with approximate nearest-neighbor search.
04 — Transformations

Transformation-Based Joins

A complementary direction: rather than tolerating mismatches with fuzzy similarity, learn a transformation that makes columns equi-joinable (e.g., "2024-01-05" → "Jan 5, 2024"; "GOOGL" → "Alphabet Inc."). Note that this work considers transformation discovery between two tables, not joinable table discovery or search.

PVLDB2017
Auto-Join: Joining Tables by Leveraging Transformations
E. Zhu, Y. He, S. Chaudhuri
Identifies promising joinable row pairs as input/output examples and uses program-synthesis-style search to find a transformation that aligns the columns. Scales to tens of thousands of rows at interactive speed.
SIGMOD2025
Qualitative Join Discovery in Data Lakes Using Examples
M. M. Mohammad, E. K. Rezig
A by-example formulation of join discovery: users provide a small set of (row, expected match) examples and the system retrieves and ranks candidate joinable tables consistent with them. Complements both value-overlap and embedding approaches when the user's notion of "joinable" is implicit and best conveyed through samples.
05 — Context

Context-Aware Joinability

In enterprise lakes, two columns can look identical but mean different things, and ID columns may share values across unrelated tables. Recent work argues that column similarity alone is insufficient — surrounding context (other columns, table title, metadata) matters.

arXiv2025
Evaluating Joinable Column Discovery Approaches for Context-Aware Search (TOPJoin)
IBM Research et al.
Defines context-aware column joinability and proposes a multi-criteria scoring approach (TOPJoin) that combines value, header, and contextual signals. Includes an experimental comparison against JOSIE, DeepJoin, and WarpGate.
06 — Benchmarks

Benchmarks & Systems

Without shared benchmarks, comparing methods has been unreliable. These efforts collect labeled corpora and evaluation protocols for join (and union) search.

ICDE2025
BLEND: A Unified Data Discovery System. ICDE 2025: 737-750
M. Esmailoghli, C. Schnell, R. J. Miller, Z. Abedjan
A system that combines, n-ary join search with union search and explores optimization of discovery operators.
PVLDB2024
LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes
Y. Deng, C. Chai, L. Cao, Q. Jin, Y. Hai, J. Wang, J. Fan, Y. Yuan, G. Li
A large-scale, labeled benchmark covering both joinable and unionable table search across multiple domains. Includes baseline comparisons and standardized metrics — the natural starting point for project evaluations.
07 — Applications

Joinable Search for ML & Augmentation

Joinable table search is rarely the end goal. These papers study what happens when retrieved joins flow into a downstream task — typically machine learning model training — and what that reveals about retrieval quality.

VLDB2024
Retrieve, Merge, Predict: Augmenting Tables with Data Lakes
R. Cappuzzo, A. Cvetkov-Iliev, G. Varoquaux, P. Papotti
An end-to-end study of table-augmentation pipelines: retrieve joinable tables, merge them, train a model. Introduces YADL (Yet Another Data Lake) for benchmarking and reports several surprising findings — e.g., Jaccard containment is a strong retrieval criterion, and tree-based learners are more robust to noisy merges than deep models.