Reading List Surveys & Tutorials

A starting-point reading list of surveys, tutorials, and vision papers that orient newcomers to data lake and model lake research. These are the right entry points before diving into any of the topic-specific lists: they establish the problem definitions, the major research threads, and the open questions that organize everything else.

Topics

  1. Data Lake Foundations & Position Papers
  2. Table Discovery Surveys & Tutorials
  3. Data Lake Systems & Concepts
  4. Model Lakes
01 — Foundations

Data Lake Foundations & Position Papers

The papers that frame data lakes as a research agenda within data management: not just storage, but a collection of problems — discovery, integration, cleaning, versioning, metadata — that classical database research must be rethought to address.

PVLDB Tutorial201912(12): 1986–1989Required
Data Lake Management: Challenges and Opportunities
F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, P. C. Arocena
The defining position paper / tutorial that established data lake management as a research field for the database community. Surveys how data lakes introduce new problems (dataset discovery, navigation) and change the requirements for classical problems (data extraction, cleaning, integration, versioning, metadata management). Presented at VLDB 2019; slides and bibliography available online. The natural first reading for the course.
02 — Discovery

Table Discovery Surveys & Tutorials

Three complementary overviews of the central problem in data lakes — finding the right tables when you don't know what's there. Read together they triangulate the landscape: a unified problem framing, a recent state-of-the-art tutorial, and a deep dive into the algorithmic substrate (indexes and operations).

SIGMOD Tutorial2023Required
Table Discovery in Data Lakes: State-of-the-Art and Future Directions
G. Fan, J. Wang, Y. Li, R. J. Miller
A comprehensive tutorial on the most recent table discovery techniques developed by the data management community. Covers table understanding tasks (domain discovery, annotation, representation learning), query-driven discovery and exploration, and how these techniques support downstream data science. Discusses future directions including hybrid structured-knowledge + dense-representation approaches and efficiency via modern indexing.
PVLDB Tutorial202518(12)
Data Discovery in Data Lakes: Operations, Indexes, Systems
Z. Abedjan, M. Esmailoghli, S. Galhotra
A more recent tutorial with an algorithmic focus: how the index structures and discovery operations underlying joinability and unionability search have evolved, how they categorize, how they extend to distributed scenarios, and how they combine inside holistic systems. The complement to Fan et al.'s broader-scope tutorial — read both for full coverage of the field.
ACM CSUR2023
Dataset Discovery and Exploration: A Survey
N. Paton, J. Chen, Z. Wu
An ACM Computing Surveys treatment that unifies dataset search, navigation, annotation, and schema inference under one framework. Defines a notation for tabular collections and characterizes approaches along consistent dimensions, making head-to-head comparisons across the literature possible. Useful when you need a more taxonomy-driven view than a tutorial provides.
03 — Systems

Data Lake Systems & Concepts

For the historical and architectural context: how the term "data lake" emerged, what systems have been built around the concept, and how data lakes differ from data warehouses and other repository designs.

Survey2021
Data Lake Concept and Systems: A Survey
R. Hai, C. Quix, M. Jarke
Reviews the development, definitions, and architectures of data lake systems. Classifies existing data lakes by the functions they provide — ingestion, organization, governance, query — making it a useful technical reference when designing, implementing, or comparing data-lake-style systems. Less research-oriented than the Nargesian et al. tutorial; more useful as background on the systems landscape.
04 — Model Lakes

Model Lakes

Model lakes are a much newer area than data lakes, and the survey literature is still thin. The two papers below — one a research-agenda paper, one an enterprise-governance framing — currently serve double duty as both the foundational statements and the field's broad overviews. Both appear here and in the dedicated Model Lakes reading list.

EDBT2025Required
Model Lakes
K. Pal, D. Bau, R. J. Miller
Defines a model lake as a repository for heterogeneous ML models — by analogy to data lakes — and lays out a research agenda with four core tasks: attribution, versioning, search, and benchmarking. Draws explicit lessons from data lake research to motivate the model lake research program. Currently the closest thing to a "state of the field" survey for model lakes.
WISE2025pp. 133–144
Model Lake: A New Alternative for Machine Learning Models Management and Governance
M. Garouani, F. Ravat, N. Valles-Parlangeau
A later complementary, enterprise-oriented view of model lakes focused on governance, audit, and lifecycle management. Provides architectural foundations, key components, and operational benefits of treating datasets, code, and models as one integrated lake. A useful counterpoint to Pal et al.'s research framing.