01 — Topics
This Week
Subject Overview of joinable table discovery. I will introduce the
problem and some classic approaches based on value overlap.
Specifically, I will overview LSH Ensemble from PVLDB16 and JOSIE
from SIGMOD19. Both do single attribute equi-joins. I will then
cover MATE from VLDB22 which does multi-attribute equi-joins.
Notice that the reading list below covers semantic joins,
applications and benchmarking. Any paper on the list that I have
not covered is a possible paper you can present in class.
Reviews
For this class, please prepare and upload to dropbox two 1-page (ish)
pdf files containing
a review of both JOSIE and MATE. I'd
like you to include at least 3 strong
points about each paper and 3 week points. For each point, please
include specific justification/evidence from the paper itself. I
would like you to write these yourself and not use AI. Please
upload the reviews by Friday May 22nd, 7pm EDT.
The dropbox link will be posted on Piazza.
Please use file naming format: 2-JOSIE-.pdf or
2-MATE0.pdf, respectively.
02 — Readings
Recommended
PVLDB2016
LSH Ensemble: Internet-Scale Domain Search for Tables with Containment Joins
E. Zhu, F. Nargesian, K. Q. Pu, R. J. Miller
Tackles the variable-cardinality problem: when column sizes range from hundreds to millions, single LSH indexes degrade badly. The ensemble partitions columns by size and tunes each partition's LSH parameters to approximate the containment measure efficiently.
Required (submit reviews of these)
SIGMOD2019Required
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
E. Zhu, D. Deng, F. Nargesian, R. J. Miller
Defines the joinable table search problem as a top-k overlap set similarity query. Introduces a cost-aware algorithm that mixes prefix-filtering with position-list scans to scale to lakes with millions of columns. The reference exact baseline for nearly all follow-on work.
PVLDB2022
MATE: Multi-Attribute Table Extraction
M. Esmailoghli, J.-A. Quiané-Ruiz, Z. Abedjan
Introduces a hash-based index supporting n-ary join discovery via space-efficient "super keys" that combine multiple columns. Demonstrates significant gains over applying unary methods combinatorially.
Complete Joinable Table
Search Reading List, contains additional papers that you may
choose to present.
03 — Slides
Lecture Materials
To be added