Containerization & Orchestration for ML

📦 Why Containers for ML

Dependency Hell in ML

ML projects accumulate a uniquely toxic dependency stack: Python version, CUDA version, cuDNN version, PyTorch/TensorFlow, plus dozens of transitive dependencies — all of which must match precisely for a model to run correctly.

CUDA 11.8 and CUDA 12.x have different PyTorch builds — wrong build = silent failures
OpenCV, scipy, numpy all have breaking API changes across versions
A model trained with PyTorch 2.0 may not load with PyTorch 1.13 (no forward compat)
Different team members inadvertently use different library versions with no enforcement

"Works on My Machine"

Containers solve the fundamental reproducibility problem by packaging the model, its serving code, and every dependency into a single immutable artifact.

The exact same image runs identically on a developer's laptop, CI/CD runner, and cloud GPU instance
No "but I have Python 3.10 and you're deploying to Python 3.9" incidents
Image is built once; every deployment uses the identical binary artifact
Container registry stores all historical versions — any old model can be re-deployed instantly

Docker Fundamentals Recap

Image: read-only template built from a Dockerfile; layers are cached independently
Container: running instance of an image; ephemeral and stateless by design
Layers: each Dockerfile instruction adds a layer; unchanged layers are reused from cache
Registry: store and distribute images (Docker Hub, ECR, GCR, GHCR)
Tag: human-readable image version label; always also use the immutable digest (sha256:...) for production

Portability Benefits

Laptop → cloud: develop on local GPU, deploy to AWS/GCP/Azure without environment changes
Cloud → edge: the same container image can target a Jetson Nano or an AWS Inferentia chip (with appropriate base images)
Immutable deployments: deploy by image tag — no in-place updates that can partially fail
Isolation: multiple model versions run side-by-side on the same host without conflict

🐳 Dockerizing ML Models

Base Image Selection

The base image determines your starting dependency footprint. Choose carefully — wrong base = much larger image or missing GPU support.

python:3.11-slim: smallest CPU-only Python base; add only what you need
nvidia/cuda:12.1-cudnn8-runtime-ubuntu22.04: for GPU inference; runtime variant avoids dev tools
pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime: pre-installed PyTorch + CUDA; large but convenient
Always prefer -runtime over -devel for serving (devel includes compilers, test suites, >3GB overhead)

Multi-Stage Builds

Multi-stage builds dramatically reduce image size by separating the build environment (compiling C extensions, building wheels) from the runtime environment.

Stage 1 (builder): install gcc, build-essential, compile all pip packages into wheels
Stage 2 (runtime): python-slim base, copy only compiled wheel files from builder
Typical savings: 1.5–3GB reduction for models with native C/C++ dependencies
Final image contains no compilers, test frameworks, or build artifacts

.dockerignore Best Practices

Every file sent in the build context increases build time and cache invalidation risk.

# .dockerignore
**/__pycache__/
**/*.pyc
**/.pytest_cache/
.git/
.github/
tests/
docs/
notebooks/
*.ipynb
experiments/
.env
secrets/
# Keep model weights out of image
# (mount or download at runtime)
weights/*.bin
weights/*.pt

GPU Support

Containers can access host GPUs when using the NVIDIA Container Toolkit. The GPU driver is NOT inside the container — only the CUDA runtime libraries are.

Install NVIDIA Container Toolkit on the host: apt install nvidia-container-toolkit
Run with GPU access: docker run --gpus all ... or --gpus '"device=0,1"'
Host driver must be >= minimum version required by the container's CUDA runtime
In Kubernetes: nvidia.com/gpu: 1 resource request (requires NVIDIA device plugin DaemonSet)

Production Dockerfile for FastAPI Model Server

# ── Stage 1: Build wheels ─────────────────────────────────────────
FROM python:3.11-slim AS builder

# Install build dependencies for native extensions
RUN apt-get update && apt-get install -y \
    gcc g++ make libffi-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /build

COPY requirements.txt .

# Build all packages as wheels; do not install yet
RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt


# ── Stage 2: Runtime image ─────────────────────────────────────────
FROM python:3.11-slim AS runtime

# Security: create non-root user for running the server
RUN groupadd --gid 1000 appuser \
    && useradd --uid 1000 --gid appuser --shell /bin/bash appuser

WORKDIR /app

# Install runtime system dependencies only
RUN apt-get update && apt-get install -y libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# Copy pre-built wheels from builder and install (no compilation needed)
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir --no-index --find-links /wheels /wheels/*.whl \
    && rm -rf /wheels

# Copy application code
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser configs/ ./configs/

# Copy model artifacts (or download at startup — see callout below)
COPY --chown=appuser:appuser models/production/ ./models/

# Switch to non-root user
USER appuser

# Expose the serving port
EXPOSE 8080

# Health check — Kubernetes readiness/liveness probes will call this
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:8080/health').raise_for_status()"

# Start FastAPI server with Uvicorn
ENTRYPOINT ["uvicorn", "src.server:app", \
    "--host", "0.0.0.0", \
    "--port", "8080", \
    "--workers", "2", \
    "--timeout-keep-alive", "30"]

Model Weights in Image vs Downloaded at Runtime

Baking large model weights (e.g., 7B parameter LLM at 14GB) into a Docker image creates impractically large images and slow registries. Preferred pattern: keep the serving code in the image; download weights from S3/GCS/Azure Blob at container startup using a startup script, or mount them from a persistent volume. This keeps images small and allows weight updates without rebuilding the image.

☸️ Kubernetes for ML Serving

GPU Resource Management

Kubernetes requires explicit resource requests and limits for GPUs. GPUs are not overcommittable — only one pod gets each GPU unless using time-slicing or MIG.

Resource request: nvidia.com/gpu: 1 — Kubernetes schedules pod only on nodes with available GPU
GPU requests must equal GPU limits (no burst; GPU quota is strict)
NVIDIA Time-Slicing: share one GPU across multiple pods (reduced memory isolation; useful for small models)
MIG (Multi-Instance GPU): A100/H100 only; hardware partitioning with full memory isolation
Use node selectors or node affinity to target GPU node pools

Autoscaling ML Workloads

Standard HPA based on CPU utilization doesn't work well for GPU-bound inference. Use custom metrics instead.

GPU utilization HPA: scale based on nvidia_gpu_utilization via DCGM Prometheus exporter
Request queue depth: scale based on pending inference requests; KEDA (Kubernetes Event-Driven Autoscaling) handles this natively
P99 latency: scale when P99 exceeds SLA threshold (requires custom metrics adapter)
Set min replicas to 1 (avoid cold start) and max replicas based on GPU budget

Persistent Volumes for Model Weights

Large model weights that are too big to bake into images can be stored on persistent volumes and mounted into serving pods.

Use ReadOnlyMany access mode — multiple pods can simultaneously mount the same weights volume
NFS or cloud-native shared filesystems (EFS, Filestore, Azure Files) support ReadOnlyMany
Pre-populate the volume using an init container that downloads from object storage
Version the volume alongside the model (separate PVC per model version)

Liveness vs Readiness Probes for ML

ML models have a warmup period — the first N requests are slow as the model loads into GPU memory and JIT-compiles. Probes must account for this.

Readiness probe: signals when pod is ready to receive traffic; set initialDelaySeconds large enough for model warmup (30–120s)
Liveness probe: signals when pod should be restarted; more lenient thresholds than readiness
Startup probe: replaces liveness during startup; allows very long initialization (300s+) without triggering restart
Use a /health/ready endpoint that returns 503 until first successful inference pass

Kubernetes Deployment with GPU Request and Readiness Probe

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-server
  labels:
    app: ml-model-server
    version: "1.5.0"
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ml-model-server
  template:
    metadata:
      labels:
        app: ml-model-server
        version: "1.5.0"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      # Target GPU node pool
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-a100

      # GPU driver tolerance
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

      containers:
        - name: model-server
          image: registry.example.com/ml-model:1.5.0@sha256:a1b2c3...
          imagePullPolicy: Always
          ports:
            - containerPort: 8080
              name: http

          # ── Resource requests and limits ─────────────────────────
          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
              nvidia.com/gpu: "1"
            limits:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: "1"  # must equal request for GPU

          # ── Environment variables ─────────────────────────────────
          env:
            - name: MODEL_VERSION
              value: "1.5.0"
            - name: NUM_WORKERS
              value: "2"
            - name: CUDA_VISIBLE_DEVICES
              value: "0"
          envFrom:
            - secretRef:
                name: model-server-secrets  # MLflow URI, S3 credentials etc.

          # ── Probe configuration ───────────────────────────────────
          # Startup probe: allow up to 5 minutes for model warmup
          startupProbe:
            httpGet:
              path: /health/ready
              port: 8080
            failureThreshold: 30   # 30 × 10s = 5 minutes max startup
            periodSeconds: 10

          # Readiness probe: pod removed from Service if not ready
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 0  # startupProbe handles delay
            periodSeconds: 10
            failureThreshold: 3
            successThreshold: 1

          # Liveness probe: restart if completely hung
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 0
            periodSeconds: 30
            failureThreshold: 5  # 5 failures = 150s before restart

          # ── Volume mounts ─────────────────────────────────────────
          volumeMounts:
            - name: model-weights
              mountPath: /models
              readOnly: true
            - name: tmp-dir
              mountPath: /tmp

      volumes:
        - name: model-weights
          persistentVolumeClaim:
            claimName: model-weights-pvc-v1-5  # version-pinned PVC
        - name: tmp-dir
          emptyDir: {}

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model-server
  minReplicas: 1
  maxReplicas: 8
  metrics:
    # Scale based on GPU utilization (requires DCGM exporter)
    - type: External
      external:
        metric:
          name: nvidia_gpu_utilization_percentage
          selector:
            matchLabels:
              deployment: ml-model-server
        target:
          type: AverageValue
          averageValue: "70"  # target 70% GPU utilization per pod

🛠️ ML-Specific Serving Frameworks

KServe (formerly KFServing)

Kubernetes-native ML inference platform built on top of Knative and Istio. Defines a standard InferenceService CRD that abstracts away the serving infrastructure.

Supports sklearn, PyTorch, TensorFlow, XGBoost, ONNX, Triton via a single API spec
Canary rollouts, A/B testing, and shadow deployments are first-class features
Serverless inference: scales to zero when idle (with cold start tradeoff)
Request batching, explainability (Seldon Alibi integration), and outlier detection built-in

Open SourceKubernetes Native

BentoML

Framework for packaging model, serving logic, preprocessing, and dependencies into a single deployable "Bento" artifact. Designed to bridge the gap between data scientists and DevOps.

Define service logic in pure Python with @bentoml.service decorator
bentoml build produces a Bento that includes model + code + dependencies + Docker setup
bentoml containerize generates an optimized Docker image automatically
Adaptive batching, async runners, and multi-model pipelines are built-in

Open SourceData Scientist Friendly

Ray Serve

Built on the Ray distributed computing framework. Designed for scalable Python-native ML inference with complex pipeline graphs (preprocessing → model → postprocessing).

Scale each pipeline component independently (preprocessing CPU, inference GPU)
Model multiplexing: serve multiple model versions on one deployment
Integrates with LangChain, vLLM, and HuggingFace for LLM serving
Strong Python ecosystem: no YAML DSL required; pure Python configuration

Open SourceLLM Ready

NVIDIA Triton Inference Server

Production-grade multi-framework inference server from NVIDIA. Optimized for maximum GPU throughput with dynamic batching, model scheduling, and concurrent model execution.

Supports TensorRT, ONNX Runtime, PyTorch TorchScript, TensorFlow SavedModel, Python backend
Dynamic batching: accumulates requests over a configurable window before GPU dispatch
Model ensemble: chain multiple models in a DAG with shared GPU compute
Model repository: directory-based model management; hot-reload without restart

Open SourceNVIDIA Optimized

ML Serving Framework Comparison

Tool	Abstraction Level	GPU Support	Multi-Model	Strengths
KServe	High — InferenceService CRD abstracts infrastructure	Yes (via Triton or custom runtime)	Yes — model ensembles, pipelines	Kubernetes-native; standardized API; canary/shadow built-in; scales to zero
Seldon Core	High — SeldonDeployment CRD; inference graph	Yes	Yes — multi-stage inference graphs	Explainability (Alibi); outlier detection; enterprise support; broad framework support
BentoML	Medium — Python-first; auto-generates Docker + YAML	Yes	Yes — runner-based composition	Data-scientist-friendly; excellent DX; cloud-agnostic deployment targets
Ray Serve	Low — Python code defines deployment graph	Yes	Yes — per-component scaling	Flexible pipeline graphs; great for LLMs; integrates with full Ray ecosystem
Triton	Low — model repository + config.pbtxt	Yes — highly optimized	Yes — concurrent model execution	Highest GPU throughput; dynamic batching; TensorRT integration; NVIDIA-optimized

⚡ Serving Optimisation

Dynamic Request Batching

GPUs are most efficient when processing large batches. Dynamic batching accumulates multiple incoming requests over a short window and submits them together as a single GPU operation.

Batch size 1 → GPU utilization ~10–20%; batch 32 → 60–80% for transformer models
Configure max batch size and max batch delay (latency budget) based on SLA requirements
Triton: dynamic_batching { max_queue_delay_microseconds: 2000 } in model config
Trade-off: higher batch size = better throughput but higher P99 latency for individual requests

ONNX Runtime

ONNX Runtime (ORT) is an accelerated inference engine that can run models from any framework (PyTorch, TensorFlow, sklearn) after exporting to the ONNX format.

Typically 1.5–3× faster than native PyTorch for inference on CPU
Supports CUDA, TensorRT, OpenVINO execution providers for GPU/edge acceleration
Export: torch.onnx.export(model, dummy_input, "model.onnx", opset_version=17)
Validate outputs match original model numerically before deploying ORT version

TensorRT

NVIDIA's inference optimizer. Takes a trained model and produces a highly optimized engine for a specific GPU architecture using layer fusion, kernel auto-tuning, and precision reduction.

Typical speedup: 2–5× vs native PyTorch on NVIDIA GPUs
FP16: near-zero accuracy loss; 2× memory reduction; supported on all modern NVIDIA GPUs
INT8: with calibration dataset; further speedup; ~1–2% accuracy drop for most models
TensorRT engine is not portable — must rebuild for each target GPU architecture

Horizontal vs Vertical Scaling

Two fundamentally different approaches to handling increased inference load. The right choice depends on your workload characteristics.

Horizontal (add replicas): add more pods/nodes; each handles a fraction of traffic; best for high-concurrency, stateless serving
Vertical (larger GPU): upgrade to a GPU with more memory/compute; better for models that max out a single GPU's batch capacity
Most production systems use horizontal scaling as the primary lever and vertical for the base unit size decision
Horizontal scaling requires stateless serving — no in-memory state between requests

Serving Optimisation Techniques

Technique	Latency Improvement	Complexity	When to Use
Dynamic request batching	2–8× throughput improvement; slight P99 latency increase	Low — configure in serving framework	Any GPU serving with multiple concurrent requests; especially transformer models
ONNX Runtime	1.5–3× speedup on CPU; moderate GPU gains	Low-Medium — export + validate outputs	CPU inference; cross-framework portability; batch processing workloads
TensorRT FP16	2–4× speedup; 2× memory reduction	Medium — build engine per GPU arch	NVIDIA GPU serving; latency-critical applications; A100/V100/T4 deployments
INT8 quantization	3–5× speedup; 4× memory reduction	High — requires calibration dataset; validate accuracy	Edge deployment; cost-sensitive cloud serving; models tolerant to slight accuracy loss
Model compilation (torch.compile)	1.5–2× on modern PyTorch; no framework change	Low — one-line addition; warm-up required	PyTorch 2.0+ models; significant gain for transformer architectures
Continuous batching (for LLMs)	3–10× throughput for generative models	High — requires specialized LLM serving infra	LLM text generation; vLLM, TGI, or Triton with TensorRT-LLM backend
Horizontal pod scaling	Linear throughput scaling with replicas	Low — Kubernetes HPA + stateless design	High-concurrency serving; cost-effective throughput scaling; standard default approach

Latency vs Throughput Trade-off

Every serving optimization involves choosing between latency and throughput. Dynamic batching increases throughput by accumulating requests — but the wait time in the queue adds to individual request latency. TensorRT FP16 reduces both latency and memory, making it almost always a win. INT8 further reduces memory and compute but may require careful calibration. Always profile P50, P95, and P99 latency under realistic concurrency levels — P99 is what users experience at the worst moment, and it is typically the binding constraint for SLA agreements.