Reading List Table Version Management & Version Search

Papers on managing, storing, exploring, and explaining changes across versions of tabular datasets. Collaborative data science routinely produces thousands of versions of the same table; the questions become how to store them efficiently, how to find a particular version, and how to characterize what changed between versions. The list moves from the foundational storage/recreation tradeoff, through version-management systems, semantic change explanation, change exploration in the wild, and the theoretical underpinnings.

Topics

  1. Foundations in DBMS & Storage Tradeoffs
  2. Version Management Systems
  3. Semantic Versioning & Change Explanation
  4. Change Exploration & Search
  5. Theoretical Foundations
  6. Background & Antecedents
01 — Foundations in DBMS

The Storage / Recreation Tradeoff

The foundational formulation for DBMS (not lakes): managing many versions of the same dataset means trading storage cost against recreation cost. Store every version explicitly and queries are fast but storage explodes; store only deltas and storage is small but recreation is slow. This paper frames the problem and most subsequent work refines it.

PVLDB20158(12): 1346–1357Required
Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff
S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, A. Parameswaran
Formulates six versioning problems trading off storage and recreation cost in different ways. Proves most are intractable and proposes heuristics drawing from delay-constrained scheduling and minimum spanning tree literature, including the LMG heuristic that has become the standard baseline. Prototype built as the foundation for DataHub. The reference framing for the entire area.
CIDR2015
DataHub: Collaborative Data Science & Dataset Version Management at Scale
A. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, A. Parameswaran
A platform for collaborative data science with first-class support for branching and merging dataset versions, a version graph DAG, and a query language (VQL) for traversing both data and version history. Goes beyond linear temporal-database histories (e.g., Oracle Flashback) by treating derived data products as first-class citizens.
PVLDB201710(10)
OrpheusDB: Bolt-On Versioning for Relational Databases
S. Huang, L. Xu, J. Liu, A. J. Elmore, A. Parameswaran
A "bolt-on" approach that adds versioning to an existing relational DBMS, inheriting its query and analytics capabilities for free. Develops multiple data models for representing versioned data and a lightweight partitioning scheme (LyreSplit) that optimizes query latency. A pragmatic contrast to DataHub's clean-slate design.
03 — Semantics

Semantic Versioning & Change Explanation

Storage- and graph-level views of versioning treat changes as opaque diffs. A complementary line of work asks a different question: what does a change mean? Can we explain the transformation between two versions of a table in interpretable terms an analyst could verify?

PVLDB202316(6): 1587–1600Required
Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V
R. Shraga, R. J. Miller
Shifts the problem from "what bytes differ between two versions?" to "what semantic transformation explains the change?" Generates verifiable, human-readable explanations covering both syntactic changes (column rename, type cast) and semantic ones (value transformations, derivations). Explanations let analysts decide whether to accept changes, propagate them, or roll back — a capability missing from delta-based versioning systems.
PVLDB2026
SAVeD: Semantically Aware Version Discovery
A. Erenk and R. Shraga
Given a table, can we findall other versions of it in a data lake?
04 — Exploration

Change Exploration & Version Search

Once you have a history of versions, the natural follow-on is interactive exploration: which parts of the data changed most, what kinds of changes occurred, when, and why. These papers move from storage and explanation toward analytic queries over change itself.

PVLDB201812(2): 85–98Required
Exploring Change — A New Dimension of Data Analytics
T. Bleifuß, L. Bornemann, T. Johnson, D. V. Kalashnikov, F. Naumann, D. Srivastava
Proposes that data change itself deserves to be a first-class analytic dimension. Introduces the change-cube data model that records insertions, deletions, updates, and schema modifications over time. Demonstrates analytic queries that surface controversy, vandalism, data quality issues, and hidden processes — questions impossible to answer with snapshot-only views.
CIDR2019
DBChEx: Interactive Exploration of Data and Schema Change
T. Bleifuß, L. Bornemann, D. V. Kalashnikov, F. Naumann, D. Srivastava
The system that turns the change-cube into an interactive demo: a web front-end over an underlying change-cube database, demonstrated on the full histories of IMDB and Wikipedia infoboxes. Shows what exploratory analytics over change history can feel like in practice.
SEAData @ VLDB2021
The Secret Life of Wikipedia Tables
T. Bleifuß, L. Bornemann, D. V. Kalashnikov, F. Naumann, D. Srivastava
Extracts, matches, and analyzes the entire history of 3.5M tables on English Wikipedia — 53.8M total table versions. Shows that web tables have rich life cycles: they're created, change shape, move, grow, shrink, and disappear. An empirical foundation for any work on table version search at web scale.
05 — Theory

Theoretical Foundations

A recent line of work returns to Bhattacherjee et al.'s formulation and asks the algorithmic questions seriously: how well can heuristics like LMG perform in the worst case, and when can we get good approximation guarantees?

arXiv2024
To Store or Not to Store: A Graph Theoretical Approach for Dataset Versioning
A. Guo, J. Li, P. Sukprasert, S. Khuller, A. Deshpande, K. Mukherjee
Establishes that the LMG heuristic introduced by Bhattacherjee et al. can perform arbitrarily badly in the worst case, and shows hardness of o(n)-approximation on general graphs even with relaxed storage. Develops polynomial-time approximation schemes for tree-like graphs — motivated by the observation that real-world version graphs arising from typical edit operations have low treewidth.
06 — Background

Background & Antecedents

Predecessors and adjacent areas: archiving of scientific data, provenance-enabled lifecycle management, and temporal databases. Useful for placing modern dataset versioning in its broader context.

TODS200429(1): 2–42
Archiving Scientific Data
P. Buneman, S. Khanna, K. Tajima, W.-C. Tan
An early principled study of archiving versioned scientific data, with formal treatment of hierarchical data, keys, and time-stamped storage. Pre-dates the data-lake era but establishes much of the conceptual vocabulary later versioning systems inherit.
IEEE Data Eng. Bull.2018
ProvDB: Provenance-Enabled Lifecycle Management of Collaborative Data Analysis Workflows
H. Miao, A. Deshpande
Provenance and versioning are closely linked: a version is what you get when you track derivations through time. ProvDB tracks the full graph of datasets, transformations, and analyses across users and over time. Also referenced in the Model Lakes reading list — useful for seeing how the same techniques apply to both tabular data and trained models.