Reading List — Surveys & Tutorials

Topics

Data Lake Foundations & Position Papers
Table Discovery Surveys & Tutorials
Data Lake Systems & Concepts
Model Lakes

01 — Foundations

Data Lake Foundations & Position Papers

The papers that frame data lakes as a research agenda within data management: not just storage, but a collection of problems — discovery, integration, cleaning, versioning, metadata — that classical database research must be rethought to address.

PVLDB Tutorial201912(12): 1986–1989Required

Data Lake Management: Challenges and Opportunities

F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, P. C. Arocena

The defining position paper / tutorial that established data lake management as a research field for the database community. Surveys how data lakes introduce new problems (dataset discovery, navigation) and change the requirements for classical problems (data extraction, cleaning, integration, versioning, metadata management). Presented at VLDB 2019; slides and bibliography available online. The natural first reading for the course.

PDF ACM DL Slides & resources

02 — Discovery

Table Discovery Surveys & Tutorials

Three complementary overviews of the central problem in data lakes — finding the right tables when you don't know what's there. Read together they triangulate the landscape: a unified problem framing, a recent state-of-the-art tutorial, and a deep dive into the algorithmic substrate (indexes and operations).

SIGMOD Tutorial2023Required

Table Discovery in Data Lakes: State-of-the-Art and Future Directions

G. Fan, J. Wang, Y. Li, R. J. Miller

A comprehensive tutorial on the most recent table discovery techniques developed by the data management community. Covers table understanding tasks (domain discovery, annotation, representation learning), query-driven discovery and exploration, and how these techniques support downstream data science. Discusses future directions including hybrid structured-knowledge + dense-representation approaches and efficiency via modern indexing.

ACM DL Tutorial site

PVLDB Tutorial202518(12)

Data Discovery in Data Lakes: Operations, Indexes, Systems

Z. Abedjan, M. Esmailoghli, S. Galhotra

A more recent tutorial with an algorithmic focus: how the index structures and discovery operations underlying joinability and unionability search have evolved, how they categorize, how they extend to distributed scenarios, and how they combine inside holistic systems. The complement to Fan et al.'s broader-scope tutorial — read both for full coverage of the field.

PDF

ACM CSUR2023

Dataset Discovery and Exploration: A Survey

N. Paton, J. Chen, Z. Wu

An ACM Computing Surveys treatment that unifies dataset search, navigation, annotation, and schema inference under one framework. Defines a notation for tabular collections and characterizes approaches along consistent dimensions, making head-to-head comparisons across the literature possible. Useful when you need a more taxonomy-driven view than a tutorial provides.

ACM DL

03 — Systems

Data Lake Systems & Concepts

For the historical and architectural context: how the term "data lake" emerged, what systems have been built around the concept, and how data lakes differ from data warehouses and other repository designs.

Survey2021

Data Lake Concept and Systems: A Survey

R. Hai, C. Quix, M. Jarke

Reviews the development, definitions, and architectures of data lake systems. Classifies existing data lakes by the functions they provide — ingestion, organization, governance, query — making it a useful technical reference when designing, implementing, or comparing data-lake-style systems. Less research-oriented than the Nargesian et al. tutorial; more useful as background on the systems landscape.

ResearchGate

04 — Model Lakes

Model Lakes

Model lakes are a much newer area than data lakes, and the survey literature is still thin. The two papers below — one a research-agenda paper, one an enterprise-governance framing — currently serve double duty as both the foundational statements and the field's broad overviews. Both appear here and in the dedicated Model Lakes reading list.

EDBT2025Required

Model Lakes

K. Pal, D. Bau, R. J. Miller

Defines a model lake as a repository for heterogeneous ML models — by analogy to data lakes — and lays out a research agenda with four core tasks: attribution, versioning, search, and benchmarking. Draws explicit lessons from data lake research to motivate the model lake research program. Currently the closest thing to a "state of the field" survey for model lakes.

DOI arXiv March 2024

WISE2025pp. 133–144

Model Lake: A New Alternative for Machine Learning Models Management and Governance

M. Garouani, F. Ravat, N. Valles-Parlangeau

A later complementary, enterprise-oriented view of model lakes focused on governance, audit, and lifecycle management. Provides architectural foundations, key components, and operational benefits of treating datasets, code, and models as one integrated lake. A useful counterpoint to Pal et al.'s research framing.

DOI Springer arXiv March 2025