Model Versioning & Tracking | MLOps Series

📦 Why Version ML Models

Machine learning has a reproducibility problem. A model trained last Tuesday may produce different results today — even from identical source code — because of silent changes in data, library updates, or hardware differences. Versioning is the discipline of recording every input that determined a model's behaviour so you can reproduce, compare, audit, and roll back with confidence.

What Changes Between Runs

A model is the product of far more than its weights file. Six distinct dimensions can differ between any two training runs:

Training data — rows added, labels corrected, splits reshuffled
Feature engineering code — normalisation logic, missing-value treatment
Hyperparameters — learning rate, depth, regularisation strength
Library versions — PyTorch 2.1 vs 2.3 can yield different numerical results
Random seeds — weight initialisation, data shuffle, dropout masks
Hardware / CUDA version — GPU non-determinism in floating-point operations

A Model Is Not Just Weights

When you save model.pkl, you have captured the weights — but not the full model object. A production-ready model version must include:

The preprocessing pipeline (scaler, encoder, imputer)
The exact training dataset fingerprint (hash or DVC pointer)
The git commit SHA of the training code
The conda/pip environment specification
Performance metrics on a held-out evaluation set
The hyperparameter configuration used

Why Versioning Matters in Practice

Beyond academic reproducibility, versioning solves real operational problems that arise as soon as models reach production:

Rollback — new version degrades KPIs; instantly revert to the last known-good version
Compliance & auditing — regulators may ask which model made a credit decision in Q2 2024
Debugging — compare two model versions to isolate what changed and why performance shifted
Collaboration — multiple researchers experiment in parallel without overwriting each other's results
Cost attribution — understand which experiment consumed how much compute

The Reproducibility Crisis in ML

A 2019 survey of NeurIPS and ICML papers found that fewer than 15% of published results could be independently reproduced. In industry settings, teams regularly cannot reproduce their own models three months later. Without disciplined versioning, every "improvement" is built on a foundation you cannot verify.

📊 Experiment Tracking

Experiment tracking answers the question: "Of all the runs I've done, which one should I promote to production, and why?" A tracking system logs every configurable aspect of a run and provides a UI or API for comparing runs at scale.

What to Log in Every Run

Parameters

Inputs you controlled before training began. These are scalar or small config values — not the data itself.

Learning rate, batch size, number of epochs
Model architecture (layers, hidden units, dropout rate)
Regularisation coefficients (L1/L2 lambda)
Feature selection decisions, preprocessing flags
Random seed values (global and per-library)

Metrics

Outputs you measure during or after training. Good tracking records metrics at each step so you can plot learning curves, not just final values.

Training loss per batch/epoch
Validation loss and primary task metric (AUC, F1, RMSE)
Evaluation on held-out test set after training ends
Business proxy metrics (precision@K, revenue proxy)
Training time, GPU memory usage, carbon footprint

Artifacts & Environment

The heavyweight or non-scalar outputs that must be stored by reference rather than value.

Model weights (pickle, ONNX, SavedModel)
Preprocessing pipeline object
Confusion matrix, PR curve plots
Git commit SHA of training script
pip freeze / conda env YAML
DVC pointer to training dataset version

Tool Comparison

Tool	Hosting	Key Features	Free Tier
MLflow	Self-hosted (open source); Databricks managed	Runs, experiments, model registry, serving, autolog, plugin ecosystem	Fully open source; Databricks Community free
Weights & Biases	SaaS (wandb.ai); on-prem available	Real-time charts, hyperparameter sweeps, artifact versioning, collaboration, reports	Free for individuals; 100 GB storage
Neptune.ai	SaaS; self-hosted Enterprise	Metadata store, comparison tables, integrations with 25+ frameworks	Free tier: 200 hours compute monitoring
Comet ML	SaaS; on-prem	Code diff between runs, data panels, model production monitoring	Free for open-source and academic
ClearML	Self-hosted or SaaS	Full MLOps platform: experiment tracking, pipelines, data management, orchestration	Open source community edition

Weights & Biases Sweeps

W&B Sweeps automate hyperparameter search. You define a sweep config (search space + strategy: grid, random, or Bayesian), and W&B agents distributed across machines pull the next config to try. All runs are automatically grouped, and W&B surfaces the best configuration with importance analysis showing which hyperparameters mattered most.

🔬 MLflow Deep Dive

MLflow is the most widely deployed open-source experiment tracking system. It consists of four components: Tracking (logging API + UI), Projects (reproducible code packaging), Models (generic model format), and Model Registry (lifecycle management covered in the next page). The Tracking component is where most practitioners spend 90% of their time.

Core Tracking Concepts

Experiment — a named group of runs, typically one per project or task
Run — a single execution of your training script with specific params
Run ID — a UUID that uniquely identifies a run for reproducibility
mlruns/ — local directory where tracking data is stored by default
Tracking Server — optional remote server backed by a DB + artifact store

Autolog — Zero-Code Instrumentation

MLflow's autolog intercepts framework callbacks to log params, metrics, and models with a single line. Supported frameworks include:

scikit-learn — params, CV results, feature importance
PyTorch Lightning — epoch metrics, model checkpoints
TensorFlow/Keras — epoch metrics, model summary
XGBoost / LightGBM — tree counts, feature importance
Transformers (HuggingFace) — trainer metrics, tokenizer

MLflow UI Overview

Launch with mlflow ui (default port 5000). The UI provides:

Experiment list with run counts and last-modified timestamps
Run comparison table — sort by any metric
Parallel coordinates chart for hyperparameter exploration
Metric history charts (per-step curves)
Artifact browser — view plots, model files, configs inline
Model Registry tab for lifecycle management

Full MLflow Experiment Logging Example

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
from sklearn.datasets import load_breast_cancer
import numpy as np

# ── Configuration ──────────────────────────────────────────────────
EXPERIMENT_NAME = "breast-cancer-classification"
RANDOM_SEED = 42

params = {
    "n_estimators": 200,
    "max_depth": 8,
    "min_samples_split": 5,
    "max_features": "sqrt",
    "random_state": RANDOM_SEED,
}

# ── Data ───────────────────────────────────────────────────────────
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y
)

# ── MLflow Setup ───────────────────────────────────────────────────
# Point at a remote tracking server instead of local ./mlruns:
# mlflow.set_tracking_uri("http://mlflow.internal:5000")
mlflow.set_experiment(EXPERIMENT_NAME)

with mlflow.start_run(run_name="rf-baseline-v1") as run:
    # Log every hyperparameter before training
    mlflow.log_params(params)

    # Log environment metadata as tags (searchable, not plotted)
    mlflow.set_tags({
        "author": "akumar",
        "dataset_version": "v2024-11",
        "git_commit": "a3f9c12",   # subprocess.check_output(["git","rev-parse","HEAD"])
        "framework": "scikit-learn",
    })

    # ── Training ───────────────────────────────────────────────────
    clf = RandomForestClassifier(**params)
    clf.fit(X_train, y_train)

    # ── Evaluation ─────────────────────────────────────────────────
    y_pred = clf.predict(X_test)
    y_prob = clf.predict_proba(X_test)[:, 1]

    metrics = {
        "accuracy":  accuracy_score(y_test, y_pred),
        "roc_auc":   roc_auc_score(y_test, y_prob),
        "f1_macro":  f1_score(y_test, y_pred, average="macro"),
    }
    mlflow.log_metrics(metrics)

    # Log feature importances as an artifact (CSV)
    import pandas as pd
    fi = pd.DataFrame({
        "feature": load_breast_cancer().feature_names,
        "importance": clf.feature_importances_,
    }).sort_values("importance", ascending=False)
    fi.to_csv("/tmp/feature_importance.csv", index=False)
    mlflow.log_artifact("/tmp/feature_importance.csv", artifact_path="reports")

    # Log the model — also registers input/output schema for serving
    signature = mlflow.models.infer_signature(X_train, clf.predict(X_train))
    mlflow.sklearn.log_model(
        clf,
        artifact_path="model",
        signature=signature,
        registered_model_name="breast-cancer-rf",   # auto-registers in Registry
    )

    print(f"Run ID: {run.info.run_id}")
    print(f"ROC-AUC: {metrics['roc_auc']:.4f}")

# ── Querying runs programmatically ─────────────────────────────────
runs = mlflow.search_runs(
    experiment_names=[EXPERIMENT_NAME],
    filter_string="metrics.roc_auc > 0.95",
    order_by=["metrics.roc_auc DESC"],
)
print(runs[["run_id", "params.n_estimators", "metrics.roc_auc"]].head())

Production Tracking Server Setup

For team use, run MLflow with a PostgreSQL backend store and an S3 artifact store: mlflow server --backend-store-uri postgresql://user:pass@host/mlflow --default-artifact-root s3://my-bucket/mlflow. This separates metadata (fast DB queries) from large artifacts (object storage), which is the right architecture for teams running thousands of experiments.

🗂️ Git for ML — DVC

Git was designed for text files. A 50 GB training dataset or a 7 GB model checkpoint will overwhelm any git repository. DVC (Data Version Control) extends git's paradigm to large files and ML pipelines: you store small pointer files in git while DVC manages the actual bytes in remote storage (S3, GCS, Azure Blob, SSH, etc.). The result is that checkout, branch, and merge semantics work the same way for data as they do for code.

DVC Core Concepts

.dvc files — tiny pointer files committed to git; contain an MD5 hash and path
Remote storage — configured per-repo; stores actual file bytes by content hash
dvc add — tracks a large file, writes the .dvc pointer, gitignores the original
dvc push/pull — sync files between local cache and remote storage
dvc repro — re-run only the pipeline stages that are out of date
dvc diff — compare datasets between git commits

DVC Pipelines

DVC pipelines (dvc.yaml) define a DAG of stages: each stage has deps (inputs), outs (outputs), and a command to run. DVC tracks file hashes to determine if a stage needs to be re-run, similar to Makefiles but with data awareness.

Stages: prepare → featurize → train → evaluate
Params tracked from a YAML config file
Metrics and plots are first-class outputs
dvc dag renders the pipeline graph
Integrates with CI/CD via GitHub Actions

Tagging Model Releases in Git

Complement DVC with git tags to mark production model releases:

git tag -a v1.2.0-model -m "AUC 0.962, trained on 2024-11 data"
Tag the commit whose .dvc file points to the production weights
Use annotated tags (not lightweight) for release notes
Push tags to origin: git push origin --tags
Rollback = git checkout v1.1.0-model && dvc pull

DVC Workflow: Dataset + Model Versioning

# ── Initial DVC setup ──────────────────────────────────────────────
git init my-ml-project && cd my-ml-project
dvc init                          # creates .dvc/ config directory
git add .dvc && git commit -m "init dvc"

# Configure remote storage (S3 example)
dvc remote add -d myremote s3://my-ml-bucket/dvc-store
dvc remote modify myremote region us-east-1
git add .dvc/config && git commit -m "add s3 remote"

# ── Tracking a dataset ─────────────────────────────────────────────
# Download / prepare your raw data
mkdir data && cp /mnt/nas/dataset_v1.csv data/

dvc add data/dataset_v1.csv       # creates data/dataset_v1.csv.dvc
git add data/dataset_v1.csv.dvc data/.gitignore
git commit -m "track dataset v1 with DVC"
dvc push                          # uploads bytes to S3

# ── Defining a pipeline (dvc.yaml) ────────────────────────────────
cat > dvc.yaml << 'EOF'
stages:
  prepare:
    cmd: python src/prepare.py --input data/dataset_v1.csv --output data/prepared/
    deps:
      - src/prepare.py
      - data/dataset_v1.csv
    outs:
      - data/prepared/

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/prepared/
    params:
      - params.yaml:
        - train.lr
        - train.max_depth
    outs:
      - models/model.pkl
    metrics:
      - metrics/eval.json:
          cache: false
EOF

# ── Running the pipeline ───────────────────────────────────────────
dvc repro                         # runs only stale stages
git add dvc.lock dvc.yaml metrics/eval.json
git commit -m "train: AUC 0.951, lr=0.01"
dvc push                          # push model artifact to S3

# ── Updating the dataset (v2) ──────────────────────────────────────
cp /mnt/nas/dataset_v2.csv data/dataset_v1.csv  # in-place update
dvc add data/dataset_v1.csv       # updates the .dvc file hash
git add data/dataset_v1.csv.dvc
git commit -m "track dataset v2 (Nov 2024 batch)"
dvc repro                         # automatically re-trains with new data

# ── Rollback to v1 model ───────────────────────────────────────────
git checkout HEAD~2 -- data/dataset_v1.csv.dvc models/model.pkl.dvc
dvc pull                          # fetches the v1 bytes from S3
# Your working directory now has the v1 model and dataset

Git-LFS vs DVC

Git LFS (Large File Storage) is simpler to set up but treats large files as opaque blobs — it doesn't understand ML pipelines, can't diff datasets semantically, and has no concept of stages or experiments. DVC is the right choice for ML projects because it understands the relationship between data, code, params, and model output as a first-class concept.

✅ Reproducibility Best Practices

Reproducibility is not binary — it exists on a spectrum from "roughly similar results" to "bit-for-bit identical outputs." For most ML teams, the practical goal is statistically reproducible: re-running a training script on the same data and code produces results within the expected random variation of the training process. The following practices close the gap between "it worked once" and "it always works."

Pin All Dependency Versions

Float dependencies are the silent killer of reproducibility. A library author pushes a bug fix that changes numerical behaviour and your model silently changes.

Generate pip freeze > requirements.txt after every successful experiment
Pin exact versions: torch==2.3.1 not torch>=2.0
Use conda env export > environment.yml for CUDA/cuDNN pinning
Consider Poetry or uv for deterministic lock files
Log the resolved environment in every MLflow run as an artifact

Set Random Seeds Everywhere

Many sources of randomness must be seeded independently — setting just Python's random seed is not enough:

random.seed(42) — Python stdlib
np.random.seed(42) — NumPy operations
torch.manual_seed(42) — PyTorch CPU ops
torch.cuda.manual_seed_all(42) — GPU ops
torch.backends.cudnn.deterministic = True
Set PYTHONHASHSEED=42 environment variable

Docker for Environment Capture

A Docker image is the ultimate reproducibility artefact — it captures the OS, system libraries, CUDA toolkit, Python version, and all packages at the byte level.

Build a training image tagged with the git commit SHA
Push image to a registry (ECR, GCR) so it's retrievable years later
Store the image digest in MLflow run tags
Use --platform linux/amd64 for cross-platform reproducibility
Multi-stage builds: separate train image from lean serving image

Log Hardware Specs

GPU architecture and driver version affect floating-point results at the bit level. Log hardware context as run metadata:

nvidia-smi --query-gpu=name,driver_version --format=csv
CUDA version: torch.version.cuda
cuDNN version: torch.backends.cudnn.version()
CPU model and core count for non-GPU training
Available RAM — affects shuffle buffer sizes and caching

A Result You Cannot Reproduce Is a Result You Cannot Trust

If you cannot re-run an experiment and get results within the expected variance of the stochastic training process, that result has no scientific or engineering value. Before deploying any model to production, a second engineer should be able to check out the tagged commit, run dvc pull && python train.py, and arrive at a model within 0.5% of the reported metric. If they cannot, the model is not production-ready regardless of its headline accuracy.

Reproducibility Checklist

Category	Check	Tool	Priority
Data	Dataset version tracked with content hash	DVC	Critical
Data	Train/val/test splits deterministic (seeded)	scikit-learn, DVC	Critical
Code	Git commit SHA logged with every run	MLflow tags, W&B	Critical
Params	All hyperparameters logged (not hardcoded)	MLflow, config YAML	Critical
Environment	Exact pip/conda dependency versions pinned	pip freeze, Poetry	Critical
Randomness	All random seeds set and logged	Manual + MLflow	High
Environment	Docker image built and pushed with git SHA tag	Docker, ECR/GCR	High
Hardware	GPU/CUDA version logged as run metadata	MLflow tags	Medium
Artifacts	Preprocessing pipeline stored alongside weights	MLflow artifacts	Critical