Reading List Model Lakes

Papers on model lake management — the emerging problem of storing, discovering, versioning, and reasoning about large collections of trained machine learning models, much as data lakes do for tabular data. The list moves from the seed paper that defines model lakes as a data management problem, through foundational model management infrastructure (ModelDB, MLflow), provenance and lineage, model documentation, model search and discovery, and recent work mapping the structure of public model hubs.

Topics

  1. Foundations
  2. Model Management Infrastructure
  3. Provenance & Versioning
  4. Model Documentation
  5. Model Search & Discovery
  6. Empirical Studies of Model Hubs
01 — Foundations

The Model Lake Vision

The seed paper that defines model lakes as a research agenda within data management: a unified repository for heterogeneous ML models, with formal tasks for model attribution, versioning, search, and benchmarking.

EDBT2025Required
Model Lakes
K. Pal, D. Bau, R. J. Miller
Formally defines a model lake as a repository for heterogeneous ML models — by analogy to data lakes — and lays out the research agenda. Identifies four core model-lake tasks (attribution, versioning, search, benchmarking) and argues that today's "20th-century keyword search over manually specified names or metadata" is inadequate for repositories with millions of fine-tuned variants of foundation models. Draws explicit lessons from data lake research to motivate the model lake research program.
CBI Workshop2025
Model Lake: A New Alternative for Machine Learning Models Management and Governance
M. Garouani, F. Ravat, N. Valles-Parlangeau
A complementary vision focused on enterprise model lifecycle management, audit, and governance. Extends the model-lake concept to integrate datasets, code, and models in a single ecosystem with operational benefits for versioning, discovery, and reuse. Useful as a counterpoint to Pal et al.'s research-oriented framing.
02 — Infrastructure

Model Management Infrastructure

Pre–model-lake foundations: systems that introduced storage, tracking, and querying of trained ML models. These papers anchor what "managing a collection of models" can mean in practice, and frame the gap that model lake research now aims to close.

HILDA2016Background
ModelDB: A System for Machine Learning Model Management
M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, M. Zaharia
An early end-to-end model management system supporting experiment tracking, versioning, comparison, and reproducibility through a model store with native clients for scikit-learn and Spark MLlib. Establishes the basic primitives — log models with parameters, pipelines, and metadata; query and compare via a frontend — that every subsequent model management system has built on.
IEEE Data Eng. Bull.2018Background
On Challenges in Machine Learning Model Management
S. Schelter, F. Biessmann, T. Januschowski, D. Salinas, S. Seufert, G. Szarvas
A position paper surveying the conceptual, engineering, and data-management challenges of managing ML models in production. Catalogues the open problems — model versioning, prediction validation, retraining decisions, pipeline lineage — that the field has been working on since. Required background for understanding why model lakes are hard.
IEEE Data Eng. Bull.2018Background
Accelerating the Machine Learning Lifecycle with MLflow
M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. A. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, F. Xie, C. Zumar
Introduces MLflow, now the most widely deployed open-source ML platform: tracking (logging runs, parameters, metrics), projects (reproducible packaging), models (a standard format for downstream tools), and the model registry (centralized model store with lineage, versioning, stage transitions). De facto reference architecture for production model management.
03 — Provenance

Provenance & Versioning

A model is only meaningful when you can trace it back to its inputs: the dataset, the code, the hyperparameters, the prior model it was fine-tuned from. These papers study how to capture and query that lineage at the scale of full ML pipelines.

IEEE Data Eng. Bull.2018Background
ProvDB: Provenance-Enabled Lifecycle Management of Collaborative Data Analysis Workflows
H. Miao, A. Deshpande
A provenance-centric system for collaborative data-science workflows. Tracks the full graph of datasets, transformations, models, and analyses across users and over time, supporting queries like "which models depend on this dataset version" — a capability that becomes essential at model-lake scale.
Inf. Systems2025Background
Capturing End-to-End Provenance for Machine Learning Pipelines
(see venue for author list)
A systematic study of how modern ML platforms (MLflow, others) capture provenance across the full pipeline lifecycle — registered models, version creation, deletions, stage transitions. Useful for understanding both what current systems track and what they miss for model-lake-scale governance.
04 — Documentation

Model Documentation

How do you describe a model so that someone — or some other system — can decide whether to use it? Documentation is metadata, and metadata is the substrate on which model search, attribution, and governance all depend. This line of work asks what should go on a model's "label."

FAccT2019Backgroun
Model Cards for Model Reporting
M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, T. Gebru
Proposes "model cards" — short structured documents accompanying released models — covering intended use, performance across demographic groups, training data, and ethical considerations. Now the standard model-documentation framework on Hugging Face and most major model hubs, and the lever by which much of model-lake metadata becomes available at all.
arXiv2024Background
What's Documented in AI? Systematic Analysis of 32K AI Model Cards
W. Liang, et al.
A large-scale empirical analysis of 32,000 model cards on Hugging Face. Shows that even where cards exist, they're frequently incomplete, inconsistent, or missing critical sections — revealing the gap between Mitchell et al.'s ideal and current practice, and motivating automated model-card generation.
ACL Findings2024
Automatic Generation of Model and Data Cards: A Step Towards Responsible AI
J. Liu, W. Li, Z. Wei, et al.
Uses LLMs to automatically generate model and data cards from training-time metadata and code artifacts. A concrete step toward the model-lake vision of metadata at scale: when the lake has millions of models, hand-curated cards are no longer feasible.
05 — Search

Model Search & Discovery

The core retrieval problem in a model lake: given a task description, a query model, or a dataset, find the right model(s) among potentially millions. This is the most active line of recent work and the area where data-lake techniques (embeddings, RAG, agentic search) are crossing over.

CVPR2025
Charting and Navigating Hugging Face's Model Atlas
E. Horwitz, et al.
Builds a structured "atlas" of the Hugging Face model ecosystem by recovering the implicit fine-tuning DAG that connects foundation models to their descendants. Demonstrates concrete applications: predicting model attributes, recovering fine-tuning structure even when undocumented, and surfacing the evolutionary structure of public model hubs. A model-lake-style approach applied to today's largest public model repository.
arXiv2025
HuggingR4: A Progressive Reasoning Framework for Discovering Optimal Model Companions
(see arXiv listing for authors)
Recasts model selection as an iterative reasoning process — Reasoning, Retrieval, Refinement, Reflection — rather than one-shot retrieval. Targets the 2M+ model setting of Hugging Face explicitly, where putting full descriptions into prompts is infeasible. Mirrors the agentic shift now happening in text-to-SQL schema linking.
arXiv2025
ML-Asset Management: Curation, Discovery, and Utilization
(see arXiv listing for authors)
A broader framing covering not just models but the full bundle of curated ML assets — models, datasets, prompts, agents. Useful for situating model search within the wider problem of asset discovery in the LLM era.
06 — Empirical Studies

Empirical Studies of Model Hubs

Hugging Face is, in effect, the world's largest model lake. Empirical studies of how it's used, named, and reused are the most direct evidence we have for what a working model lake actually looks like — and where it breaks.

ICSE2023Background
An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry
W. Jiang, N. Synovic, M. Hyatt, T. R. Schorlemmer, R. Sethi, Y.-H. Lu, G. K. Thiruvathukal, J. C. Davis
A foundational empirical analysis of how the Hugging Face model hub is actually used: how models are named, packaged, reused, and forked. Documents the patterns and pathologies (inconsistent naming, missing documentation, broken dependencies) that any production model lake must contend with.
FSE2024Background
"I see models being a whole other thing": Naming Practices of Pre-Trained Models on Hugging Face
W. Jiang, et al.
A qualitative and empirical analysis of model naming practices on Hugging Face. Shows that names are inconsistent, sometimes misleading, and frequently inadequate as a basis for selection — a direct argument for the richer search and retrieval primitives that model lake research targets.