⏱ 9 min read 📊 Intermediate 🗓 Updated Jan 2025

📦 Why Version ML Models

Machine learning has a reproducibility problem. A model trained last Tuesday may produce different results today — even from identical source code — because of silent changes in data, library updates, or hardware differences. Versioning is the discipline of recording every input that determined a model's behaviour so you can reproduce, compare, audit, and roll back with confidence.

What Changes Between Runs

A model is the product of far more than its weights file. Six distinct dimensions can differ between any two training runs:

  • Training data — rows added, labels corrected, splits reshuffled
  • Feature engineering code — normalisation logic, missing-value treatment
  • Hyperparameters — learning rate, depth, regularisation strength
  • Library versions — PyTorch 2.1 vs 2.3 can yield different numerical results
  • Random seeds — weight initialisation, data shuffle, dropout masks
  • Hardware / CUDA version — GPU non-determinism in floating-point operations

A Model Is Not Just Weights

When you save model.pkl, you have captured the weights — but not the full model object. A production-ready model version must include:

  • The preprocessing pipeline (scaler, encoder, imputer)
  • The exact training dataset fingerprint (hash or DVC pointer)
  • The git commit SHA of the training code
  • The conda/pip environment specification
  • Performance metrics on a held-out evaluation set
  • The hyperparameter configuration used

Why Versioning Matters in Practice

Beyond academic reproducibility, versioning solves real operational problems that arise as soon as models reach production:

  • Rollback — new version degrades KPIs; instantly revert to the last known-good version
  • Compliance & auditing — regulators may ask which model made a credit decision in Q2 2024
  • Debugging — compare two model versions to isolate what changed and why performance shifted
  • Collaboration — multiple researchers experiment in parallel without overwriting each other's results
  • Cost attribution — understand which experiment consumed how much compute

The Reproducibility Crisis in ML

A 2019 survey of NeurIPS and ICML papers found that fewer than 15% of published results could be independently reproduced. In industry settings, teams regularly cannot reproduce their own models three months later. Without disciplined versioning, every "improvement" is built on a foundation you cannot verify.

📊 Experiment Tracking

Experiment tracking answers the question: "Of all the runs I've done, which one should I promote to production, and why?" A tracking system logs every configurable aspect of a run and provides a UI or API for comparing runs at scale.

What to Log in Every Run

Parameters

Inputs you controlled before training began. These are scalar or small config values — not the data itself.

  • Learning rate, batch size, number of epochs
  • Model architecture (layers, hidden units, dropout rate)
  • Regularisation coefficients (L1/L2 lambda)
  • Feature selection decisions, preprocessing flags
  • Random seed values (global and per-library)

Metrics

Outputs you measure during or after training. Good tracking records metrics at each step so you can plot learning curves, not just final values.

  • Training loss per batch/epoch
  • Validation loss and primary task metric (AUC, F1, RMSE)
  • Evaluation on held-out test set after training ends
  • Business proxy metrics (precision@K, revenue proxy)
  • Training time, GPU memory usage, carbon footprint

Artifacts & Environment

The heavyweight or non-scalar outputs that must be stored by reference rather than value.

  • Model weights (pickle, ONNX, SavedModel)
  • Preprocessing pipeline object
  • Confusion matrix, PR curve plots
  • Git commit SHA of training script
  • pip freeze / conda env YAML
  • DVC pointer to training dataset version

Tool Comparison

ToolHostingKey FeaturesFree Tier
MLflow Self-hosted (open source); Databricks managed Runs, experiments, model registry, serving, autolog, plugin ecosystem Fully open source; Databricks Community free
Weights & Biases SaaS (wandb.ai); on-prem available Real-time charts, hyperparameter sweeps, artifact versioning, collaboration, reports Free for individuals; 100 GB storage
Neptune.ai SaaS; self-hosted Enterprise Metadata store, comparison tables, integrations with 25+ frameworks Free tier: 200 hours compute monitoring
Comet ML SaaS; on-prem Code diff between runs, data panels, model production monitoring Free for open-source and academic
ClearML Self-hosted or SaaS Full MLOps platform: experiment tracking, pipelines, data management, orchestration Open source community edition

Weights & Biases Sweeps

W&B Sweeps automate hyperparameter search. You define a sweep config (search space + strategy: grid, random, or Bayesian), and W&B agents distributed across machines pull the next config to try. All runs are automatically grouped, and W&B surfaces the best configuration with importance analysis showing which hyperparameters mattered most.

🔬 MLflow Deep Dive

MLflow is the most widely deployed open-source experiment tracking system. It consists of four components: Tracking (logging API + UI), Projects (reproducible code packaging), Models (generic model format), and Model Registry (lifecycle management covered in the next page). The Tracking component is where most practitioners spend 90% of their time.

Core Tracking Concepts

  • Experiment — a named group of runs, typically one per project or task
  • Run — a single execution of your training script with specific params
  • Run ID — a UUID that uniquely identifies a run for reproducibility
  • mlruns/ — local directory where tracking data is stored by default
  • Tracking Server — optional remote server backed by a DB + artifact store

Autolog — Zero-Code Instrumentation

MLflow's autolog intercepts framework callbacks to log params, metrics, and models with a single line. Supported frameworks include:

  • scikit-learn — params, CV results, feature importance
  • PyTorch Lightning — epoch metrics, model checkpoints
  • TensorFlow/Keras — epoch metrics, model summary
  • XGBoost / LightGBM — tree counts, feature importance
  • Transformers (HuggingFace) — trainer metrics, tokenizer

MLflow UI Overview

Launch with mlflow ui (default port 5000). The UI provides:

  • Experiment list with run counts and last-modified timestamps
  • Run comparison table — sort by any metric
  • Parallel coordinates chart for hyperparameter exploration
  • Metric history charts (per-step curves)
  • Artifact browser — view plots, model files, configs inline
  • Model Registry tab for lifecycle management

Full MLflow Experiment Logging Example

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
from sklearn.datasets import load_breast_cancer
import numpy as np

# ── Configuration ──────────────────────────────────────────────────
EXPERIMENT_NAME = "breast-cancer-classification"
RANDOM_SEED = 42

params = {
    "n_estimators": 200,
    "max_depth": 8,
    "min_samples_split": 5,
    "max_features": "sqrt",
    "random_state": RANDOM_SEED,
}

# ── Data ───────────────────────────────────────────────────────────
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y
)

# ── MLflow Setup ───────────────────────────────────────────────────
# Point at a remote tracking server instead of local ./mlruns:
# mlflow.set_tracking_uri("http://mlflow.internal:5000")
mlflow.set_experiment(EXPERIMENT_NAME)

with mlflow.start_run(run_name="rf-baseline-v1") as run:
    # Log every hyperparameter before training
    mlflow.log_params(params)

    # Log environment metadata as tags (searchable, not plotted)
    mlflow.set_tags({
        "author": "akumar",
        "dataset_version": "v2024-11",
        "git_commit": "a3f9c12",   # subprocess.check_output(["git","rev-parse","HEAD"])
        "framework": "scikit-learn",
    })

    # ── Training ───────────────────────────────────────────────────
    clf = RandomForestClassifier(**params)
    clf.fit(X_train, y_train)

    # ── Evaluation ─────────────────────────────────────────────────
    y_pred = clf.predict(X_test)
    y_prob = clf.predict_proba(X_test)[:, 1]

    metrics = {
        "accuracy":  accuracy_score(y_test, y_pred),
        "roc_auc":   roc_auc_score(y_test, y_prob),
        "f1_macro":  f1_score(y_test, y_pred, average="macro"),
    }
    mlflow.log_metrics(metrics)

    # Log feature importances as an artifact (CSV)
    import pandas as pd
    fi = pd.DataFrame({
        "feature": load_breast_cancer().feature_names,
        "importance": clf.feature_importances_,
    }).sort_values("importance", ascending=False)
    fi.to_csv("/tmp/feature_importance.csv", index=False)
    mlflow.log_artifact("/tmp/feature_importance.csv", artifact_path="reports")

    # Log the model — also registers input/output schema for serving
    signature = mlflow.models.infer_signature(X_train, clf.predict(X_train))
    mlflow.sklearn.log_model(
        clf,
        artifact_path="model",
        signature=signature,
        registered_model_name="breast-cancer-rf",   # auto-registers in Registry
    )

    print(f"Run ID: {run.info.run_id}")
    print(f"ROC-AUC: {metrics['roc_auc']:.4f}")

# ── Querying runs programmatically ─────────────────────────────────
runs = mlflow.search_runs(
    experiment_names=[EXPERIMENT_NAME],
    filter_string="metrics.roc_auc > 0.95",
    order_by=["metrics.roc_auc DESC"],
)
print(runs[["run_id", "params.n_estimators", "metrics.roc_auc"]].head())

Production Tracking Server Setup

For team use, run MLflow with a PostgreSQL backend store and an S3 artifact store: mlflow server --backend-store-uri postgresql://user:pass@host/mlflow --default-artifact-root s3://my-bucket/mlflow. This separates metadata (fast DB queries) from large artifacts (object storage), which is the right architecture for teams running thousands of experiments.

🗂️ Git for ML — DVC

Git was designed for text files. A 50 GB training dataset or a 7 GB model checkpoint will overwhelm any git repository. DVC (Data Version Control) extends git's paradigm to large files and ML pipelines: you store small pointer files in git while DVC manages the actual bytes in remote storage (S3, GCS, Azure Blob, SSH, etc.). The result is that checkout, branch, and merge semantics work the same way for data as they do for code.

DVC Core Concepts

  • .dvc files — tiny pointer files committed to git; contain an MD5 hash and path
  • Remote storage — configured per-repo; stores actual file bytes by content hash
  • dvc add — tracks a large file, writes the .dvc pointer, gitignores the original
  • dvc push/pull — sync files between local cache and remote storage
  • dvc repro — re-run only the pipeline stages that are out of date
  • dvc diff — compare datasets between git commits

DVC Pipelines

DVC pipelines (dvc.yaml) define a DAG of stages: each stage has deps (inputs), outs (outputs), and a command to run. DVC tracks file hashes to determine if a stage needs to be re-run, similar to Makefiles but with data awareness.

  • Stages: prepare → featurize → train → evaluate
  • Params tracked from a YAML config file
  • Metrics and plots are first-class outputs
  • dvc dag renders the pipeline graph
  • Integrates with CI/CD via GitHub Actions

Tagging Model Releases in Git

Complement DVC with git tags to mark production model releases:

  • git tag -a v1.2.0-model -m "AUC 0.962, trained on 2024-11 data"
  • Tag the commit whose .dvc file points to the production weights
  • Use annotated tags (not lightweight) for release notes
  • Push tags to origin: git push origin --tags
  • Rollback = git checkout v1.1.0-model && dvc pull

DVC Workflow: Dataset + Model Versioning

# ── Initial DVC setup ──────────────────────────────────────────────
git init my-ml-project && cd my-ml-project
dvc init                          # creates .dvc/ config directory
git add .dvc && git commit -m "init dvc"

# Configure remote storage (S3 example)
dvc remote add -d myremote s3://my-ml-bucket/dvc-store
dvc remote modify myremote region us-east-1
git add .dvc/config && git commit -m "add s3 remote"

# ── Tracking a dataset ─────────────────────────────────────────────
# Download / prepare your raw data
mkdir data && cp /mnt/nas/dataset_v1.csv data/

dvc add data/dataset_v1.csv       # creates data/dataset_v1.csv.dvc
git add data/dataset_v1.csv.dvc data/.gitignore
git commit -m "track dataset v1 with DVC"
dvc push                          # uploads bytes to S3

# ── Defining a pipeline (dvc.yaml) ────────────────────────────────
cat > dvc.yaml << 'EOF'
stages:
  prepare:
    cmd: python src/prepare.py --input data/dataset_v1.csv --output data/prepared/
    deps:
      - src/prepare.py
      - data/dataset_v1.csv
    outs:
      - data/prepared/

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/prepared/
    params:
      - params.yaml:
        - train.lr
        - train.max_depth
    outs:
      - models/model.pkl
    metrics:
      - metrics/eval.json:
          cache: false
EOF

# ── Running the pipeline ───────────────────────────────────────────
dvc repro                         # runs only stale stages
git add dvc.lock dvc.yaml metrics/eval.json
git commit -m "train: AUC 0.951, lr=0.01"
dvc push                          # push model artifact to S3

# ── Updating the dataset (v2) ──────────────────────────────────────
cp /mnt/nas/dataset_v2.csv data/dataset_v1.csv  # in-place update
dvc add data/dataset_v1.csv       # updates the .dvc file hash
git add data/dataset_v1.csv.dvc
git commit -m "track dataset v2 (Nov 2024 batch)"
dvc repro                         # automatically re-trains with new data

# ── Rollback to v1 model ───────────────────────────────────────────
git checkout HEAD~2 -- data/dataset_v1.csv.dvc models/model.pkl.dvc
dvc pull                          # fetches the v1 bytes from S3
# Your working directory now has the v1 model and dataset

Git-LFS vs DVC

Git LFS (Large File Storage) is simpler to set up but treats large files as opaque blobs — it doesn't understand ML pipelines, can't diff datasets semantically, and has no concept of stages or experiments. DVC is the right choice for ML projects because it understands the relationship between data, code, params, and model output as a first-class concept.

✅ Reproducibility Best Practices

Reproducibility is not binary — it exists on a spectrum from "roughly similar results" to "bit-for-bit identical outputs." For most ML teams, the practical goal is statistically reproducible: re-running a training script on the same data and code produces results within the expected random variation of the training process. The following practices close the gap between "it worked once" and "it always works."

Pin All Dependency Versions

Float dependencies are the silent killer of reproducibility. A library author pushes a bug fix that changes numerical behaviour and your model silently changes.

  • Generate pip freeze > requirements.txt after every successful experiment
  • Pin exact versions: torch==2.3.1 not torch>=2.0
  • Use conda env export > environment.yml for CUDA/cuDNN pinning
  • Consider Poetry or uv for deterministic lock files
  • Log the resolved environment in every MLflow run as an artifact

Set Random Seeds Everywhere

Many sources of randomness must be seeded independently — setting just Python's random seed is not enough:

  • random.seed(42) — Python stdlib
  • np.random.seed(42) — NumPy operations
  • torch.manual_seed(42) — PyTorch CPU ops
  • torch.cuda.manual_seed_all(42) — GPU ops
  • torch.backends.cudnn.deterministic = True
  • Set PYTHONHASHSEED=42 environment variable

Docker for Environment Capture

A Docker image is the ultimate reproducibility artefact — it captures the OS, system libraries, CUDA toolkit, Python version, and all packages at the byte level.

  • Build a training image tagged with the git commit SHA
  • Push image to a registry (ECR, GCR) so it's retrievable years later
  • Store the image digest in MLflow run tags
  • Use --platform linux/amd64 for cross-platform reproducibility
  • Multi-stage builds: separate train image from lean serving image

Log Hardware Specs

GPU architecture and driver version affect floating-point results at the bit level. Log hardware context as run metadata:

  • nvidia-smi --query-gpu=name,driver_version --format=csv
  • CUDA version: torch.version.cuda
  • cuDNN version: torch.backends.cudnn.version()
  • CPU model and core count for non-GPU training
  • Available RAM — affects shuffle buffer sizes and caching

A Result You Cannot Reproduce Is a Result You Cannot Trust

If you cannot re-run an experiment and get results within the expected variance of the stochastic training process, that result has no scientific or engineering value. Before deploying any model to production, a second engineer should be able to check out the tagged commit, run dvc pull && python train.py, and arrive at a model within 0.5% of the reported metric. If they cannot, the model is not production-ready regardless of its headline accuracy.

Reproducibility Checklist

CategoryCheckToolPriority
DataDataset version tracked with content hashDVCCritical
DataTrain/val/test splits deterministic (seeded)scikit-learn, DVCCritical
CodeGit commit SHA logged with every runMLflow tags, W&BCritical
ParamsAll hyperparameters logged (not hardcoded)MLflow, config YAMLCritical
EnvironmentExact pip/conda dependency versions pinnedpip freeze, PoetryCritical
RandomnessAll random seeds set and loggedManual + MLflowHigh
EnvironmentDocker image built and pushed with git SHA tagDocker, ECR/GCRHigh
HardwareGPU/CUDA version logged as run metadataMLflow tagsMedium
ArtifactsPreprocessing pipeline stored alongside weightsMLflow artifactsCritical