Reading List — Table Version Management

Topics

Foundations in DBMS & Storage Tradeoffs
Version Management Systems
Semantic Versioning & Change Explanation
Change Exploration & Search
Theoretical Foundations
Background & Antecedents

01 — Foundations in DBMS

The Storage / Recreation Tradeoff

The foundational formulation for DBMS (not lakes): managing many versions of the same dataset means trading storage cost against recreation cost. Store every version explicitly and queries are fast but storage explodes; store only deltas and storage is small but recreation is slow. This paper frames the problem and most subsequent work refines it.

PVLDB20158(12): 1346–1357Required

Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, A. Parameswaran

Formulates six versioning problems trading off storage and recreation cost in different ways. Proves most are intractable and proposes heuristics drawing from delay-constrained scheduling and minimum spanning tree literature, including the LMG heuristic that has become the standard baseline. Prototype built as the foundation for DataHub. The reference framing for the entire area.

PDF arXiv

CIDR2015

DataHub: Collaborative Data Science & Dataset Version Management at Scale

A. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, A. Parameswaran

A platform for collaborative data science with first-class support for branching and merging dataset versions, a version graph DAG, and a query language (VQL) for traversing both data and version history. Goes beyond linear temporal-database histories (e.g., Oracle Flashback) by treating derived data products as first-class citizens.

PDF

PVLDB201710(10)

OrpheusDB: Bolt-On Versioning for Relational Databases

S. Huang, L. Xu, J. Liu, A. J. Elmore, A. Parameswaran

A "bolt-on" approach that adds versioning to an existing relational DBMS, inheriting its query and analytics capabilities for free. Develops multiple data models for representing versioned data and a lightweight partitioning scheme (LyreSplit) that optimizes query latency. A pragmatic contrast to DataHub's clean-slate design.

ACM DL

03 — Semantics

Semantic Versioning & Change Explanation

Storage- and graph-level views of versioning treat changes as opaque diffs. A complementary line of work asks a different question: what does a change mean? Can we explain the transformation between two versions of a table in interpretable terms an analyst could verify?

PVLDB202316(6): 1587–1600Required

Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V

R. Shraga, R. J. Miller

Shifts the problem from "what bytes differ between two versions?" to "what semantic transformation explains the change?" Generates verifiable, human-readable explanations covering both syntactic changes (column rename, type cast) and semantic ones (value transformations, derivations). Explanations let analysts decide whether to accept changes, propagate them, or roll back — a capability missing from delta-based versioning systems.

ACM DL PDF

PVLDB2026

SAVeD: Semantically Aware Version Discovery

A. Erenk and R. Shraga

Given a table, can we findall other versions of it in a data lake?

PDF

04 — Exploration

Change Exploration & Version Search

Once you have a history of versions, the natural follow-on is interactive exploration: which parts of the data changed most, what kinds of changes occurred, when, and why. These papers move from storage and explanation toward analytic queries over change itself.

PVLDB201812(2): 85–98Required

Exploring Change — A New Dimension of Data Analytics

T. Bleifuß, L. Bornemann, T. Johnson, D. V. Kalashnikov, F. Naumann, D. Srivastava

Proposes that data change itself deserves to be a first-class analytic dimension. Introduces the change-cube data model that records insertions, deletions, updates, and schema modifications over time. Demonstrates analytic queries that surface controversy, vandalism, data quality issues, and hidden processes — questions impossible to answer with snapshot-only views.

PDF ACM DL

CIDR2019

DBChEx: Interactive Exploration of Data and Schema Change

T. Bleifuß, L. Bornemann, D. V. Kalashnikov, F. Naumann, D. Srivastava

The system that turns the change-cube into an interactive demo: a web front-end over an underlying change-cube database, demonstrated on the full histories of IMDB and Wikipedia infoboxes. Shows what exploratory analytics over change history can feel like in practice.

PDF

SEAData @ VLDB2021

The Secret Life of Wikipedia Tables

T. Bleifuß, L. Bornemann, D. V. Kalashnikov, F. Naumann, D. Srivastava

Extracts, matches, and analyzes the entire history of 3.5M tables on English Wikipedia — 53.8M total table versions. Shows that web tables have rich life cycles: they're created, change shape, move, grow, shrink, and disappear. An empirical foundation for any work on table version search at web scale.

Project page

05 — Theory

Theoretical Foundations

A recent line of work returns to Bhattacherjee et al.'s formulation and asks the algorithmic questions seriously: how well can heuristics like LMG perform in the worst case, and when can we get good approximation guarantees?

arXiv2024

To Store or Not to Store: A Graph Theoretical Approach for Dataset Versioning

A. Guo, J. Li, P. Sukprasert, S. Khuller, A. Deshpande, K. Mukherjee

Establishes that the LMG heuristic introduced by Bhattacherjee et al. can perform arbitrarily badly in the worst case, and shows hardness of o(n)-approximation on general graphs even with relaxed storage. Develops polynomial-time approximation schemes for tree-like graphs — motivated by the observation that real-world version graphs arising from typical edit operations have low treewidth.

arXiv

06 — Background

Background & Antecedents

Predecessors and adjacent areas: archiving of scientific data, provenance-enabled lifecycle management, and temporal databases. Useful for placing modern dataset versioning in its broader context.

TODS200429(1): 2–42

Archiving Scientific Data

P. Buneman, S. Khanna, K. Tajima, W.-C. Tan

An early principled study of archiving versioned scientific data, with formal treatment of hierarchical data, keys, and time-stamped storage. Pre-dates the data-lake era but establishes much of the conceptual vocabulary later versioning systems inherit.

ACM DL

IEEE Data Eng. Bull.2018

ProvDB: Provenance-Enabled Lifecycle Management of Collaborative Data Analysis Workflows

H. Miao, A. Deshpande

Provenance and versioning are closely linked: a version is what you get when you track derivations through time. ProvDB tracks the full graph of datasets, transformations, and analyses across users and over time. Also referenced in the Model Lakes reading list — useful for seeing how the same techniques apply to both tabular data and trained models.

PDF