MLOps Series: Model Versioning & Tracking Model Registry & Artifacts CI/CD for ML Monitoring & Drift A/B Testing & Canary Containerization & Orchestration
← A/B Testing & Canary Back to AI & ML Index →
⏱ 9 min read πŸ“Š Intermediate πŸ—“ Updated Jan 2025
πŸ“¦ Why Containers for ML

Dependency Hell in ML

ML projects accumulate a uniquely toxic dependency stack: Python version, CUDA version, cuDNN version, PyTorch/TensorFlow, plus dozens of transitive dependencies β€” all of which must match precisely for a model to run correctly.

  • CUDA 11.8 and CUDA 12.x have different PyTorch builds β€” wrong build = silent failures
  • OpenCV, scipy, numpy all have breaking API changes across versions
  • A model trained with PyTorch 2.0 may not load with PyTorch 1.13 (no forward compat)
  • Different team members inadvertently use different library versions with no enforcement

"Works on My Machine"

Containers solve the fundamental reproducibility problem by packaging the model, its serving code, and every dependency into a single immutable artifact.

  • The exact same image runs identically on a developer's laptop, CI/CD runner, and cloud GPU instance
  • No "but I have Python 3.10 and you're deploying to Python 3.9" incidents
  • Image is built once; every deployment uses the identical binary artifact
  • Container registry stores all historical versions β€” any old model can be re-deployed instantly

Docker Fundamentals Recap

  • Image: read-only template built from a Dockerfile; layers are cached independently
  • Container: running instance of an image; ephemeral and stateless by design
  • Layers: each Dockerfile instruction adds a layer; unchanged layers are reused from cache
  • Registry: store and distribute images (Docker Hub, ECR, GCR, GHCR)
  • Tag: human-readable image version label; always also use the immutable digest (sha256:...) for production

Portability Benefits

  • Laptop β†’ cloud: develop on local GPU, deploy to AWS/GCP/Azure without environment changes
  • Cloud β†’ edge: the same container image can target a Jetson Nano or an AWS Inferentia chip (with appropriate base images)
  • Immutable deployments: deploy by image tag β€” no in-place updates that can partially fail
  • Isolation: multiple model versions run side-by-side on the same host without conflict
🐳 Dockerizing ML Models

Base Image Selection

The base image determines your starting dependency footprint. Choose carefully β€” wrong base = much larger image or missing GPU support.

  • python:3.11-slim: smallest CPU-only Python base; add only what you need
  • nvidia/cuda:12.1-cudnn8-runtime-ubuntu22.04: for GPU inference; runtime variant avoids dev tools
  • pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime: pre-installed PyTorch + CUDA; large but convenient
  • Always prefer -runtime over -devel for serving (devel includes compilers, test suites, >3GB overhead)

Multi-Stage Builds

Multi-stage builds dramatically reduce image size by separating the build environment (compiling C extensions, building wheels) from the runtime environment.

  • Stage 1 (builder): install gcc, build-essential, compile all pip packages into wheels
  • Stage 2 (runtime): python-slim base, copy only compiled wheel files from builder
  • Typical savings: 1.5–3GB reduction for models with native C/C++ dependencies
  • Final image contains no compilers, test frameworks, or build artifacts

.dockerignore Best Practices

Every file sent in the build context increases build time and cache invalidation risk.

# .dockerignore
**/__pycache__/
**/*.pyc
**/.pytest_cache/
.git/
.github/
tests/
docs/
notebooks/
*.ipynb
experiments/
.env
secrets/
# Keep model weights out of image
# (mount or download at runtime)
weights/*.bin
weights/*.pt

GPU Support

Containers can access host GPUs when using the NVIDIA Container Toolkit. The GPU driver is NOT inside the container β€” only the CUDA runtime libraries are.

  • Install NVIDIA Container Toolkit on the host: apt install nvidia-container-toolkit
  • Run with GPU access: docker run --gpus all ... or --gpus '"device=0,1"'
  • Host driver must be >= minimum version required by the container's CUDA runtime
  • In Kubernetes: nvidia.com/gpu: 1 resource request (requires NVIDIA device plugin DaemonSet)

Production Dockerfile for FastAPI Model Server

# ── Stage 1: Build wheels ─────────────────────────────────────────
FROM python:3.11-slim AS builder

# Install build dependencies for native extensions
RUN apt-get update && apt-get install -y \
    gcc g++ make libffi-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /build

COPY requirements.txt .

# Build all packages as wheels; do not install yet
RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt


# ── Stage 2: Runtime image ─────────────────────────────────────────
FROM python:3.11-slim AS runtime

# Security: create non-root user for running the server
RUN groupadd --gid 1000 appuser \
    && useradd --uid 1000 --gid appuser --shell /bin/bash appuser

WORKDIR /app

# Install runtime system dependencies only
RUN apt-get update && apt-get install -y libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# Copy pre-built wheels from builder and install (no compilation needed)
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir --no-index --find-links /wheels /wheels/*.whl \
    && rm -rf /wheels

# Copy application code
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser configs/ ./configs/

# Copy model artifacts (or download at startup β€” see callout below)
COPY --chown=appuser:appuser models/production/ ./models/

# Switch to non-root user
USER appuser

# Expose the serving port
EXPOSE 8080

# Health check β€” Kubernetes readiness/liveness probes will call this
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:8080/health').raise_for_status()"

# Start FastAPI server with Uvicorn
ENTRYPOINT ["uvicorn", "src.server:app", \
    "--host", "0.0.0.0", \
    "--port", "8080", \
    "--workers", "2", \
    "--timeout-keep-alive", "30"]

Model Weights in Image vs Downloaded at Runtime

Baking large model weights (e.g., 7B parameter LLM at 14GB) into a Docker image creates impractically large images and slow registries. Preferred pattern: keep the serving code in the image; download weights from S3/GCS/Azure Blob at container startup using a startup script, or mount them from a persistent volume. This keeps images small and allows weight updates without rebuilding the image.

☸️ Kubernetes for ML Serving

GPU Resource Management

Kubernetes requires explicit resource requests and limits for GPUs. GPUs are not overcommittable β€” only one pod gets each GPU unless using time-slicing or MIG.

  • Resource request: nvidia.com/gpu: 1 β€” Kubernetes schedules pod only on nodes with available GPU
  • GPU requests must equal GPU limits (no burst; GPU quota is strict)
  • NVIDIA Time-Slicing: share one GPU across multiple pods (reduced memory isolation; useful for small models)
  • MIG (Multi-Instance GPU): A100/H100 only; hardware partitioning with full memory isolation
  • Use node selectors or node affinity to target GPU node pools

Autoscaling ML Workloads

Standard HPA based on CPU utilization doesn't work well for GPU-bound inference. Use custom metrics instead.

  • GPU utilization HPA: scale based on nvidia_gpu_utilization via DCGM Prometheus exporter
  • Request queue depth: scale based on pending inference requests; KEDA (Kubernetes Event-Driven Autoscaling) handles this natively
  • P99 latency: scale when P99 exceeds SLA threshold (requires custom metrics adapter)
  • Set min replicas to 1 (avoid cold start) and max replicas based on GPU budget

Persistent Volumes for Model Weights

Large model weights that are too big to bake into images can be stored on persistent volumes and mounted into serving pods.

  • Use ReadOnlyMany access mode β€” multiple pods can simultaneously mount the same weights volume
  • NFS or cloud-native shared filesystems (EFS, Filestore, Azure Files) support ReadOnlyMany
  • Pre-populate the volume using an init container that downloads from object storage
  • Version the volume alongside the model (separate PVC per model version)

Liveness vs Readiness Probes for ML

ML models have a warmup period β€” the first N requests are slow as the model loads into GPU memory and JIT-compiles. Probes must account for this.

  • Readiness probe: signals when pod is ready to receive traffic; set initialDelaySeconds large enough for model warmup (30–120s)
  • Liveness probe: signals when pod should be restarted; more lenient thresholds than readiness
  • Startup probe: replaces liveness during startup; allows very long initialization (300s+) without triggering restart
  • Use a /health/ready endpoint that returns 503 until first successful inference pass

Kubernetes Deployment with GPU Request and Readiness Probe

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-server
  labels:
    app: ml-model-server
    version: "1.5.0"
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ml-model-server
  template:
    metadata:
      labels:
        app: ml-model-server
        version: "1.5.0"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      # Target GPU node pool
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-a100

      # GPU driver tolerance
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

      containers:
        - name: model-server
          image: registry.example.com/ml-model:1.5.0@sha256:a1b2c3...
          imagePullPolicy: Always
          ports:
            - containerPort: 8080
              name: http

          # ── Resource requests and limits ─────────────────────────
          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
              nvidia.com/gpu: "1"
            limits:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: "1"  # must equal request for GPU

          # ── Environment variables ─────────────────────────────────
          env:
            - name: MODEL_VERSION
              value: "1.5.0"
            - name: NUM_WORKERS
              value: "2"
            - name: CUDA_VISIBLE_DEVICES
              value: "0"
          envFrom:
            - secretRef:
                name: model-server-secrets  # MLflow URI, S3 credentials etc.

          # ── Probe configuration ───────────────────────────────────
          # Startup probe: allow up to 5 minutes for model warmup
          startupProbe:
            httpGet:
              path: /health/ready
              port: 8080
            failureThreshold: 30   # 30 Γ— 10s = 5 minutes max startup
            periodSeconds: 10

          # Readiness probe: pod removed from Service if not ready
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 0  # startupProbe handles delay
            periodSeconds: 10
            failureThreshold: 3
            successThreshold: 1

          # Liveness probe: restart if completely hung
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 0
            periodSeconds: 30
            failureThreshold: 5  # 5 failures = 150s before restart

          # ── Volume mounts ─────────────────────────────────────────
          volumeMounts:
            - name: model-weights
              mountPath: /models
              readOnly: true
            - name: tmp-dir
              mountPath: /tmp

      volumes:
        - name: model-weights
          persistentVolumeClaim:
            claimName: model-weights-pvc-v1-5  # version-pinned PVC
        - name: tmp-dir
          emptyDir: {}

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model-server
  minReplicas: 1
  maxReplicas: 8
  metrics:
    # Scale based on GPU utilization (requires DCGM exporter)
    - type: External
      external:
        metric:
          name: nvidia_gpu_utilization_percentage
          selector:
            matchLabels:
              deployment: ml-model-server
        target:
          type: AverageValue
          averageValue: "70"  # target 70% GPU utilization per pod
πŸ› οΈ ML-Specific Serving Frameworks

KServe (formerly KFServing)

Kubernetes-native ML inference platform built on top of Knative and Istio. Defines a standard InferenceService CRD that abstracts away the serving infrastructure.

  • Supports sklearn, PyTorch, TensorFlow, XGBoost, ONNX, Triton via a single API spec
  • Canary rollouts, A/B testing, and shadow deployments are first-class features
  • Serverless inference: scales to zero when idle (with cold start tradeoff)
  • Request batching, explainability (Seldon Alibi integration), and outlier detection built-in
Open SourceKubernetes Native

BentoML

Framework for packaging model, serving logic, preprocessing, and dependencies into a single deployable "Bento" artifact. Designed to bridge the gap between data scientists and DevOps.

  • Define service logic in pure Python with @bentoml.service decorator
  • bentoml build produces a Bento that includes model + code + dependencies + Docker setup
  • bentoml containerize generates an optimized Docker image automatically
  • Adaptive batching, async runners, and multi-model pipelines are built-in
Open SourceData Scientist Friendly

Ray Serve

Built on the Ray distributed computing framework. Designed for scalable Python-native ML inference with complex pipeline graphs (preprocessing β†’ model β†’ postprocessing).

  • Scale each pipeline component independently (preprocessing CPU, inference GPU)
  • Model multiplexing: serve multiple model versions on one deployment
  • Integrates with LangChain, vLLM, and HuggingFace for LLM serving
  • Strong Python ecosystem: no YAML DSL required; pure Python configuration
Open SourceLLM Ready

NVIDIA Triton Inference Server

Production-grade multi-framework inference server from NVIDIA. Optimized for maximum GPU throughput with dynamic batching, model scheduling, and concurrent model execution.

  • Supports TensorRT, ONNX Runtime, PyTorch TorchScript, TensorFlow SavedModel, Python backend
  • Dynamic batching: accumulates requests over a configurable window before GPU dispatch
  • Model ensemble: chain multiple models in a DAG with shared GPU compute
  • Model repository: directory-based model management; hot-reload without restart
Open SourceNVIDIA Optimized

ML Serving Framework Comparison

ToolAbstraction LevelGPU SupportMulti-ModelStrengths
KServeHigh β€” InferenceService CRD abstracts infrastructureYes (via Triton or custom runtime)Yes β€” model ensembles, pipelinesKubernetes-native; standardized API; canary/shadow built-in; scales to zero
Seldon CoreHigh β€” SeldonDeployment CRD; inference graphYesYes β€” multi-stage inference graphsExplainability (Alibi); outlier detection; enterprise support; broad framework support
BentoMLMedium β€” Python-first; auto-generates Docker + YAMLYesYes β€” runner-based compositionData-scientist-friendly; excellent DX; cloud-agnostic deployment targets
Ray ServeLow β€” Python code defines deployment graphYesYes β€” per-component scalingFlexible pipeline graphs; great for LLMs; integrates with full Ray ecosystem
TritonLow β€” model repository + config.pbtxtYes β€” highly optimizedYes β€” concurrent model executionHighest GPU throughput; dynamic batching; TensorRT integration; NVIDIA-optimized
⚑ Serving Optimisation

Dynamic Request Batching

GPUs are most efficient when processing large batches. Dynamic batching accumulates multiple incoming requests over a short window and submits them together as a single GPU operation.

  • Batch size 1 β†’ GPU utilization ~10–20%; batch 32 β†’ 60–80% for transformer models
  • Configure max batch size and max batch delay (latency budget) based on SLA requirements
  • Triton: dynamic_batching { max_queue_delay_microseconds: 2000 } in model config
  • Trade-off: higher batch size = better throughput but higher P99 latency for individual requests

ONNX Runtime

ONNX Runtime (ORT) is an accelerated inference engine that can run models from any framework (PyTorch, TensorFlow, sklearn) after exporting to the ONNX format.

  • Typically 1.5–3Γ— faster than native PyTorch for inference on CPU
  • Supports CUDA, TensorRT, OpenVINO execution providers for GPU/edge acceleration
  • Export: torch.onnx.export(model, dummy_input, "model.onnx", opset_version=17)
  • Validate outputs match original model numerically before deploying ORT version

TensorRT

NVIDIA's inference optimizer. Takes a trained model and produces a highly optimized engine for a specific GPU architecture using layer fusion, kernel auto-tuning, and precision reduction.

  • Typical speedup: 2–5Γ— vs native PyTorch on NVIDIA GPUs
  • FP16: near-zero accuracy loss; 2Γ— memory reduction; supported on all modern NVIDIA GPUs
  • INT8: with calibration dataset; further speedup; ~1–2% accuracy drop for most models
  • TensorRT engine is not portable β€” must rebuild for each target GPU architecture

Horizontal vs Vertical Scaling

Two fundamentally different approaches to handling increased inference load. The right choice depends on your workload characteristics.

  • Horizontal (add replicas): add more pods/nodes; each handles a fraction of traffic; best for high-concurrency, stateless serving
  • Vertical (larger GPU): upgrade to a GPU with more memory/compute; better for models that max out a single GPU's batch capacity
  • Most production systems use horizontal scaling as the primary lever and vertical for the base unit size decision
  • Horizontal scaling requires stateless serving β€” no in-memory state between requests

Serving Optimisation Techniques

TechniqueLatency ImprovementComplexityWhen to Use
Dynamic request batching2–8Γ— throughput improvement; slight P99 latency increaseLow β€” configure in serving frameworkAny GPU serving with multiple concurrent requests; especially transformer models
ONNX Runtime1.5–3Γ— speedup on CPU; moderate GPU gainsLow-Medium β€” export + validate outputsCPU inference; cross-framework portability; batch processing workloads
TensorRT FP162–4Γ— speedup; 2Γ— memory reductionMedium β€” build engine per GPU archNVIDIA GPU serving; latency-critical applications; A100/V100/T4 deployments
INT8 quantization3–5Γ— speedup; 4Γ— memory reductionHigh β€” requires calibration dataset; validate accuracyEdge deployment; cost-sensitive cloud serving; models tolerant to slight accuracy loss
Model compilation (torch.compile)1.5–2Γ— on modern PyTorch; no framework changeLow β€” one-line addition; warm-up requiredPyTorch 2.0+ models; significant gain for transformer architectures
Continuous batching (for LLMs)3–10Γ— throughput for generative modelsHigh β€” requires specialized LLM serving infraLLM text generation; vLLM, TGI, or Triton with TensorRT-LLM backend
Horizontal pod scalingLinear throughput scaling with replicasLow β€” Kubernetes HPA + stateless designHigh-concurrency serving; cost-effective throughput scaling; standard default approach

Latency vs Throughput Trade-off

Every serving optimization involves choosing between latency and throughput. Dynamic batching increases throughput by accumulating requests β€” but the wait time in the queue adds to individual request latency. TensorRT FP16 reduces both latency and memory, making it almost always a win. INT8 further reduces memory and compute but may require careful calibration. Always profile P50, P95, and P99 latency under realistic concurrency levels β€” P99 is what users experience at the worst moment, and it is typically the binding constraint for SLA agreements.