Last modified: Sun May 17 12:29:04 EDT 2026 Reading Lists — CS 848

Reading Lists Curated papers by topic

Course readings organized by research theme. Each list covers foundational papers, recent work, benchmarks, and surveys for one of the central problems in data lake and model lake management. Papers marked as required are seminar readings; the rest are recommended. New to the field? Start with the Surveys & Tutorials list.

Six reading lists are currently available; more will be added as the term progresses. Each list is organized into thematic sections with technical notes alongside paper links.

00 — Start Here

Surveys & Tutorials

Position papers, surveys, and tutorials that orient newcomers to data lake and model lake research. The right entry points before diving into any of the topic-specific lists — these establish problem definitions, major research threads, and open questions.

  • Data Lake Foundations & Position Papers
  • Table Discovery Surveys & Tutorials
  • Data Lake Systems & Concepts
  • Model Lakes
Open list →
01 — Discovery

Joinable Table Search

Given a query column, find tables in a data lake whose columns can be meaningfully joined with it. Covers set-overlap baselines, embedding-based methods, n-ary joins, transformation-based joins, context-aware approaches, and benchmarks.

  • Foundations & Set-Overlap Search
  • Semantic & Embedding-Based Search
  • N-ary Joins & Composite Keys
  • Transformation-Based Joins
  • Context-Aware Joinability
  • Benchmarks & Applications
Open list →
02 — Discovery

Table Union Search

Given a query table, find tables in a data lake whose rows can be appended to extend it. Covers the foundational formulation, column-semantics methods, deep representation learning, relationship-aware approaches, table-centric and LLM-based methods, and benchmarks.

  • Foundations
  • Column-Semantics Methods
  • Representation Learning
  • Relationship-Aware Search
  • Table-Centric & LLM-Based
  • Novelty, Diversity & Benchmarks
Open list →
03 — Querying

Multi-Table QA & Text-to-SQL

Question answering and SQL generation when the relevant tables must first be found in a data lake. Covers classical single-database text-to-SQL, schema linking at scale, agentic methods, open-domain table QA, multi-table retrieval, and recent data-lake benchmarks.

  • Foundations & Single-DB Text-to-SQL
  • Schema Linking at Scale
  • Agentic & Multi-Step Methods
  • Open-Domain Table QA
  • Multi-Table Retrieval for QA
  • Text-to-SQL over Data Lakes
Open list →
04 — Versioning

Table Version Management

Storing, exploring, and explaining changes across versions of tabular datasets. Covers the foundational storage/recreation tradeoff, version management systems, semantic change explanation, change exploration in the wild, and the theoretical underpinnings.

  • Foundations & Storage Tradeoffs
  • Version Management Systems
  • Semantic Versioning & Change Explanation
  • Change Exploration & Search
  • Theoretical Foundations
  • Background & Antecedents
Open list →
05 — Model Lakes

Model Lake Management

Storing, discovering, versioning, and reasoning about large collections of trained ML models, much as data lakes do for tabular data. Covers the model-lake vision, model management infrastructure, provenance, documentation, model search, and empirical studies of public hubs.

  • Foundations & Model Lake Vision
  • Model Management Infrastructure
  • Provenance & Versioning
  • Model Documentation
  • Model Search & Discovery
  • Empirical Studies of Model Hubs
Open list →

How these lists relate

Join search asks "what extends this table?"; union search asks "what is more of this table?"; multi-table QA asks "which tables, joined or unioned how, will answer this question?"; version management asks "how did this table get here?"; and model lakes generalize all of this from tabular data to trained ML models. All five share techniques — embeddings, search, provenance, governance — so reading them side by side is the fastest way to get oriented in the field.