Joinable Search Reading List

01 — Foundations

Set-Overlap Approaches

The earliest formulations of joinable table search frame the problem as set-overlap on column values: two columns are considered joinable if their value sets have large intersection. These papers establish the canonical problem definitions and the scalable indexes used to solve them.

SIGMOD2019Required

JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes

E. Zhu, D. Deng, F. Nargesian, R. J. Miller

Defines the joinable table search problem as a top-k overlap set similarity query. Introduces a cost-aware algorithm that mixes prefix-filtering with position-list scans to scale to lakes with millions of columns. The reference exact baseline for nearly all follow-on work.

ACM DL PDF

PVLDB2016

LSH Ensemble: Internet-Scale Domain Search for Tables with Containment Joins

E. Zhu, F. Nargesian, K. Q. Pu, R. J. Miller

Tackles the variable-cardinality problem: when column sizes range from hundreds to millions, single LSH indexes degrade badly. The ensemble partitions columns by size and tunes each partition's LSH parameters to approximate the containment measure efficiently.

PDF ACM DL

02 — Semantic Methods

Embedding-Based & Semantic Search

Equi-join overlap misses joinable columns that differ in surface form — "Apple" vs. "Apple Inc.", "U.S.A." vs. "United States". This line of work embeds values or whole columns into vector spaces so that semantically related columns are close, enabling fuzzy and cross-language joins.

ICDE2021Required

Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach (PEXESO)

Y. Dong, K. Takeoka, C. Xiao, M. Oyamada

Embeds textual cell values as high-dimensional vectors using pre-trained word embeddings and defines joinability via similarity predicates on those vectors. Introduces a pivot-based block-and-verify index and a partitioning scheme for out-of-core data lakes.

arXiv PDF

PVLDB2023Required

DeepJoin: Joinable Table Discovery with Pre-trained Language Models

Y. Dong, C. Xiao, K. Takeoka, M. Oyamada, W.-C. Tan

Fine-tunes a PLM to embed entire columns (rather than individual cells) into a fixed-length vector, unifying equi- and semantic joins under one retrieval framework. Couples the encoder with HNSW for approximate nearest-neighbor search; retrieval becomes logarithmic in repository size.

arXiv PDF ACM DL

CIDR2023

WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses

T. Cong, J. Gale, J. Frankle, C. Jia, Ç. Demiralp

Uses pre-trained WebTable (FastText) embeddings rather than task-specific fine-tuning, indexed with HNSW. Aimed at cloud warehouse settings; a useful counterpoint to DeepJoin for an embeddings-without-training baseline.

PDF

PVLDB2025Required

OmniMatch: Joinability Discovery in Data Products

C. Koutras, J. Zhang, X. Qin, C. Lei, V. N. Ioannidis, C. Faloutsos, G. Karypis, A. Katsifodimos

Targets joinability discovery in data products — curated collections of tabular datasets — rather than open data lakes. Combines multiple column-pair similarity signals through a self-supervised Graph Neural Network (RGCN), so similarity propagates across the column-relatedness graph and indirect join relationships emerge. Uses automated negative-example generation to sharpen precision, and reports up to 14% F1/AUC gains over DeepJoin and other baselines without per-metric thresholds.

PDF DOI arXiv (extended)

PVLDB2024

HyperJoin: LLM-augmented Hypergraph Link Prediction for Joinable Table Discovery

Y. Chen et al.

Models a table corpus as a hypergraph in which columns are nodes and hyperedges encode joinability — including LLM-augmented inter-table hyperedges. Joinable table discovery is recast as link prediction over the hypergraph, with a maximum-spanning-tree reranker for coherence. Reports +21% Precision@15 / +17% Recall@15 over the best prior baselines.

arXiv

03 — Composite Keys

N-ary Joins & Multi-Column Keys

Real tables often join on combinations of columns — (first_name, last_name) or (city, state) — not single keys. Methods built for unary joins miss these or degrade badly when applied combinatorially.

PVLDB2022

MATE: Multi-Attribute Table Extraction

M. Esmailoghli, J.-A. Quiané-Ruiz, Z. Abedjan

Introduces a hash-based index supporting n-ary join discovery via space-efficient "super keys" that combine multiple columns. Demonstrates significant gains over applying unary methods combinatorially.

PDF ACM DL

NAACL Findings2025

PolyJoin: Semantic Multi-key Joinable Table Search in Data Lakes

X. Hu, C. Lei, X. Qin, A. Katsifodimos, C. Faloutsos, H. Rangwala

Tackles the gap between single-key embedding methods (e.g., DeepJoin) and hash-based n-ary discovery (e.g., MATE): combining keys before embedding loses semantics, while embedding each key independently breaks correlations across keys. PolyJoin learns joint representations over sets of key columns so that semantic multi-key joinability — e.g., (first_name, last_name) or (city, state) — can be retrieved with approximate nearest-neighbor search.

DOI ACL Anthology

04 — Transformations

Transformation-Based Joins

A complementary direction: rather than tolerating mismatches with fuzzy similarity, learn a transformation that makes columns equi-joinable (e.g., "2024-01-05" → "Jan 5, 2024"; "GOOGL" → "Alphabet Inc."). Note that this work considers transformation discovery between two tables, not joinable table discovery or search.

PVLDB2017

Auto-Join: Joining Tables by Leveraging Transformations

E. Zhu, Y. He, S. Chaudhuri

Identifies promising joinable row pairs as input/output examples and uses program-synthesis-style search to find a transformation that aligns the columns. Scales to tens of thousands of rows at interactive speed.

PDF

SIGMOD2025

Qualitative Join Discovery in Data Lakes Using Examples

M. M. Mohammad, E. K. Rezig

A by-example formulation of join discovery: users provide a small set of (row, expected match) examples and the system retrieves and ranks candidate joinable tables consistent with them. Complements both value-overlap and embedding approaches when the user's notion of "joinable" is implicit and best conveyed through samples.

ACM DL

05 — Context

Context-Aware Joinability

In enterprise lakes, two columns can look identical but mean different things, and ID columns may share values across unrelated tables. Recent work argues that column similarity alone is insufficient — surrounding context (other columns, table title, metadata) matters.

arXiv2025

Evaluating Joinable Column Discovery Approaches for Context-Aware Search (TOPJoin)

IBM Research et al.

Defines context-aware column joinability and proposes a multi-criteria scoring approach (TOPJoin) that combines value, header, and contextual signals. Includes an experimental comparison against JOSIE, DeepJoin, and WarpGate.

arXiv

06 — Benchmarks

Benchmarks & Systems

Without shared benchmarks, comparing methods has been unreliable. These efforts collect labeled corpora and evaluation protocols for join (and union) search.

ICDE2025

BLEND: A Unified Data Discovery System. ICDE 2025: 737-750

M. Esmailoghli, C. Schnell, R. J. Miller, Z. Abedjan

A system that combines, n-ary join search with union search and explores optimization of discovery operators.

PDF

PVLDB2024

LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes

Y. Deng, C. Chai, L. Cao, Q. Jin, Y. Hai, J. Wang, J. Fan, Y. Yuan, G. Li

A large-scale, labeled benchmark covering both joinable and unionable table search across multiple domains. Includes baseline comparisons and standardized metrics — the natural starting point for project evaluations.

PDF ACM DL

07 — Applications

Joinable Search for ML & Augmentation

Joinable table search is rarely the end goal. These papers study what happens when retrieved joins flow into a downstream task — typically machine learning model training — and what that reveals about retrieval quality.

VLDB2024

Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

R. Cappuzzo, A. Cvetkov-Iliev, G. Varoquaux, P. Papotti

An end-to-end study of table-augmentation pipelines: retrieve joinable tables, merge them, train a model. Introduces YADL (Yet Another Data Lake) for benchmarking and reports several surprising findings — e.g., Jaccard containment is a strong retrieval criterion, and tree-based learners are more robust to noisy merges than deep models.

PDF

Topics

Set-Overlap Approaches

Embedding-Based & Semantic Search

N-ary Joins & Multi-Column Keys

Transformation-Based Joins

Context-Aware Joinability

Benchmarks & Systems

Joinable Search for ML & Augmentation