Reading List — Model Lakes

Topics

Foundations
Model Management Infrastructure
Provenance & Versioning
Model Documentation
Model Search & Discovery
Empirical Studies of Model Hubs

01 — Foundations

The Model Lake Vision

The seed paper that defines model lakes as a research agenda within data management: a unified repository for heterogeneous ML models, with formal tasks for model attribution, versioning, search, and benchmarking.

EDBT2025Required

Model Lakes

K. Pal, D. Bau, R. J. Miller

Formally defines a model lake as a repository for heterogeneous ML models — by analogy to data lakes — and lays out the research agenda. Identifies four core model-lake tasks (attribution, versioning, search, benchmarking) and argues that today's "20th-century keyword search over manually specified names or metadata" is inadequate for repositories with millions of fine-tuned variants of foundation models. Draws explicit lessons from data lake research to motivate the model lake research program.

DOI arXiv March 2024

CBI Workshop2025

Model Lake: A New Alternative for Machine Learning Models Management and Governance

M. Garouani, F. Ravat, N. Valles-Parlangeau

A complementary vision focused on enterprise model lifecycle management, audit, and governance. Extends the model-lake concept to integrate datasets, code, and models in a single ecosystem with operational benefits for versioning, discovery, and reuse. Useful as a counterpoint to Pal et al.'s research-oriented framing.

Springer arXiv March 2025

02 — Infrastructure

Model Management Infrastructure

Pre–model-lake foundations: systems that introduced storage, tracking, and querying of trained ML models. These papers anchor what "managing a collection of models" can mean in practice, and frame the gap that model lake research now aims to close.

HILDA2016Background

ModelDB: A System for Machine Learning Model Management

M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, M. Zaharia

An early end-to-end model management system supporting experiment tracking, versioning, comparison, and reproducibility through a model store with native clients for scikit-learn and Spark MLlib. Establishes the basic primitives — log models with parameters, pipelines, and metadata; query and compare via a frontend — that every subsequent model management system has built on.

ACM DL

IEEE Data Eng. Bull.2018Background

On Challenges in Machine Learning Model Management

S. Schelter, F. Biessmann, T. Januschowski, D. Salinas, S. Seufert, G. Szarvas

A position paper surveying the conceptual, engineering, and data-management challenges of managing ML models in production. Catalogues the open problems — model versioning, prediction validation, retraining decisions, pipeline lineage — that the field has been working on since. Required background for understanding why model lakes are hard.

PDF

IEEE Data Eng. Bull.2018Background

Accelerating the Machine Learning Lifecycle with MLflow

M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. A. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, F. Xie, C. Zumar

Introduces MLflow, now the most widely deployed open-source ML platform: tracking (logging runs, parameters, metrics), projects (reproducible packaging), models (a standard format for downstream tools), and the model registry (centralized model store with lineage, versioning, stage transitions). De facto reference architecture for production model management.

PDF

03 — Provenance

Provenance & Versioning

A model is only meaningful when you can trace it back to its inputs: the dataset, the code, the hyperparameters, the prior model it was fine-tuned from. These papers study how to capture and query that lineage at the scale of full ML pipelines.

IEEE Data Eng. Bull.2018Background

ProvDB: Provenance-Enabled Lifecycle Management of Collaborative Data Analysis Workflows

H. Miao, A. Deshpande

A provenance-centric system for collaborative data-science workflows. Tracks the full graph of datasets, transformations, models, and analyses across users and over time, supporting queries like "which models depend on this dataset version" — a capability that becomes essential at model-lake scale.

PDF

Inf. Systems2025Background

Capturing End-to-End Provenance for Machine Learning Pipelines

(see venue for author list)

A systematic study of how modern ML platforms (MLflow, others) capture provenance across the full pipeline lifecycle — registered models, version creation, deletions, stage transitions. Useful for understanding both what current systems track and what they miss for model-lake-scale governance.

ScienceDirect

04 — Documentation

Model Documentation

How do you describe a model so that someone — or some other system — can decide whether to use it? Documentation is metadata, and metadata is the substrate on which model search, attribution, and governance all depend. This line of work asks what should go on a model's "label."

FAccT2019Backgroun

Model Cards for Model Reporting

M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, T. Gebru

Proposes "model cards" — short structured documents accompanying released models — covering intended use, performance across demographic groups, training data, and ethical considerations. Now the standard model-documentation framework on Hugging Face and most major model hubs, and the lever by which much of model-lake metadata becomes available at all.

ACM DL arXiv

arXiv2024Background

What's Documented in AI? Systematic Analysis of 32K AI Model Cards

W. Liang, et al.

A large-scale empirical analysis of 32,000 model cards on Hugging Face. Shows that even where cards exist, they're frequently incomplete, inconsistent, or missing critical sections — revealing the gap between Mitchell et al.'s ideal and current practice, and motivating automated model-card generation.

arXiv

ACL Findings2024

Automatic Generation of Model and Data Cards: A Step Towards Responsible AI

J. Liu, W. Li, Z. Wei, et al.

Uses LLMs to automatically generate model and data cards from training-time metadata and code artifacts. A concrete step toward the model-lake vision of metadata at scale: when the lake has millions of models, hand-curated cards are no longer feasible.

arXiv

05 — Search

Model Search & Discovery

The core retrieval problem in a model lake: given a task description, a query model, or a dataset, find the right model(s) among potentially millions. This is the most active line of recent work and the area where data-lake techniques (embeddings, RAG, agentic search) are crossing over.

CVPR2025

Charting and Navigating Hugging Face's Model Atlas

E. Horwitz, et al.

Builds a structured "atlas" of the Hugging Face model ecosystem by recovering the implicit fine-tuning DAG that connects foundation models to their descendants. Demonstrates concrete applications: predicting model attributes, recovering fine-tuning structure even when undocumented, and surfacing the evolutionary structure of public model hubs. A model-lake-style approach applied to today's largest public model repository.

arXiv

arXiv2025

HuggingR4: A Progressive Reasoning Framework for Discovering Optimal Model Companions

(see arXiv listing for authors)

Recasts model selection as an iterative reasoning process — Reasoning, Retrieval, Refinement, Reflection — rather than one-shot retrieval. Targets the 2M+ model setting of Hugging Face explicitly, where putting full descriptions into prompts is infeasible. Mirrors the agentic shift now happening in text-to-SQL schema linking.

arXiv

arXiv2025

ML-Asset Management: Curation, Discovery, and Utilization

(see arXiv listing for authors)

A broader framing covering not just models but the full bundle of curated ML assets — models, datasets, prompts, agents. Useful for situating model search within the wider problem of asset discovery in the LLM era.

arXiv

06 — Empirical Studies

Empirical Studies of Model Hubs

Hugging Face is, in effect, the world's largest model lake. Empirical studies of how it's used, named, and reused are the most direct evidence we have for what a working model lake actually looks like — and where it breaks.

ICSE2023Background

An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry

W. Jiang, N. Synovic, M. Hyatt, T. R. Schorlemmer, R. Sethi, Y.-H. Lu, G. K. Thiruvathukal, J. C. Davis

A foundational empirical analysis of how the Hugging Face model hub is actually used: how models are named, packaged, reused, and forked. Documents the patterns and pathologies (inconsistent naming, missing documentation, broken dependencies) that any production model lake must contend with.

arXiv

FSE2024Background

"I see models being a whole other thing": Naming Practices of Pre-Trained Models on Hugging Face

W. Jiang, et al.

A qualitative and empirical analysis of model naming practices on Hugging Face. Shows that names are inconsistent, sometimes misleading, and frequently inadequate as a basis for selection — a direct argument for the richer search and retrieval primitives that model lake research targets.

arXiv