04 — Exploration
Change Exploration & Version Search
Once you have a history of versions, the natural follow-on is interactive exploration: which parts of the data changed most, what kinds of changes occurred, when, and why. These papers move from storage and explanation toward analytic queries over change itself.
PVLDB201812(2): 85–98Required
Exploring Change — A New Dimension of Data Analytics
T. Bleifuß, L. Bornemann, T. Johnson, D. V. Kalashnikov, F. Naumann, D. Srivastava
Proposes that data change itself deserves to be a first-class analytic dimension. Introduces the change-cube data model that records insertions, deletions, updates, and schema modifications over time. Demonstrates analytic queries that surface controversy, vandalism, data quality issues, and hidden processes — questions impossible to answer with snapshot-only views.
CIDR2019
DBChEx: Interactive Exploration of Data and Schema Change
T. Bleifuß, L. Bornemann, D. V. Kalashnikov, F. Naumann, D. Srivastava
The system that turns the change-cube into an interactive demo: a web front-end over an underlying change-cube database, demonstrated on the full histories of IMDB and Wikipedia infoboxes. Shows what exploratory analytics over change history can feel like in practice.
SEAData @ VLDB2021
The Secret Life of Wikipedia Tables
T. Bleifuß, L. Bornemann, D. V. Kalashnikov, F. Naumann, D. Srivastava
Extracts, matches, and analyzes the entire history of 3.5M tables on English Wikipedia — 53.8M total table versions. Shows that web tables have rich life cycles: they're created, change shape, move, grow, shrink, and disappear. An empirical foundation for any work on table version search at web scale.
05 — Theory
Theoretical Foundations
A recent line of work returns to Bhattacherjee et al.'s formulation and asks the algorithmic questions seriously: how well can heuristics like LMG perform in the worst case, and when can we get good approximation guarantees?
arXiv2024
To Store or Not to Store: A Graph Theoretical Approach for Dataset Versioning
A. Guo, J. Li, P. Sukprasert, S. Khuller, A. Deshpande, K. Mukherjee
Establishes that the LMG heuristic introduced by Bhattacherjee et al. can perform arbitrarily badly in the worst case, and shows hardness of o(n)-approximation on general graphs even with relaxed storage. Develops polynomial-time approximation schemes for tree-like graphs — motivated by the observation that real-world version graphs arising from typical edit operations have low treewidth.
06 — Background
Background & Antecedents
Predecessors and adjacent areas: archiving of scientific data, provenance-enabled lifecycle management, and temporal databases. Useful for placing modern dataset versioning in its broader context.
TODS200429(1): 2–42
Archiving Scientific Data
P. Buneman, S. Khanna, K. Tajima, W.-C. Tan
An early principled study of archiving versioned scientific data, with formal treatment of hierarchical data, keys, and time-stamped storage. Pre-dates the data-lake era but establishes much of the conceptual vocabulary later versioning systems inherit.
IEEE Data Eng. Bull.2018
ProvDB: Provenance-Enabled Lifecycle Management of Collaborative Data Analysis Workflows
H. Miao, A. Deshpande
Provenance and versioning are closely linked: a version is what you get when you track derivations through time. ProvDB tracks the full graph of datasets, transformations, and analyses across users and over time. Also referenced in the Model Lakes reading list — useful for seeing how the same techniques apply to both tabular data and trained models.