Dependency Hell in ML
ML projects accumulate a uniquely toxic dependency stack: Python version, CUDA version, cuDNN version, PyTorch/TensorFlow, plus dozens of transitive dependencies β all of which must match precisely for a model to run correctly.
- CUDA 11.8 and CUDA 12.x have different PyTorch builds β wrong build = silent failures
- OpenCV, scipy, numpy all have breaking API changes across versions
- A model trained with PyTorch 2.0 may not load with PyTorch 1.13 (no forward compat)
- Different team members inadvertently use different library versions with no enforcement
"Works on My Machine"
Containers solve the fundamental reproducibility problem by packaging the model, its serving code, and every dependency into a single immutable artifact.
- The exact same image runs identically on a developer's laptop, CI/CD runner, and cloud GPU instance
- No "but I have Python 3.10 and you're deploying to Python 3.9" incidents
- Image is built once; every deployment uses the identical binary artifact
- Container registry stores all historical versions β any old model can be re-deployed instantly
Docker Fundamentals Recap
- Image: read-only template built from a Dockerfile; layers are cached independently
- Container: running instance of an image; ephemeral and stateless by design
- Layers: each Dockerfile instruction adds a layer; unchanged layers are reused from cache
- Registry: store and distribute images (Docker Hub, ECR, GCR, GHCR)
- Tag: human-readable image version label; always also use the immutable digest (
sha256:...) for production
Portability Benefits
- Laptop β cloud: develop on local GPU, deploy to AWS/GCP/Azure without environment changes
- Cloud β edge: the same container image can target a Jetson Nano or an AWS Inferentia chip (with appropriate base images)
- Immutable deployments: deploy by image tag β no in-place updates that can partially fail
- Isolation: multiple model versions run side-by-side on the same host without conflict
Base Image Selection
The base image determines your starting dependency footprint. Choose carefully β wrong base = much larger image or missing GPU support.
- python:3.11-slim: smallest CPU-only Python base; add only what you need
- nvidia/cuda:12.1-cudnn8-runtime-ubuntu22.04: for GPU inference; runtime variant avoids dev tools
- pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime: pre-installed PyTorch + CUDA; large but convenient
- Always prefer
-runtimeover-develfor serving (devel includes compilers, test suites, >3GB overhead)
Multi-Stage Builds
Multi-stage builds dramatically reduce image size by separating the build environment (compiling C extensions, building wheels) from the runtime environment.
- Stage 1 (builder): install gcc, build-essential, compile all pip packages into wheels
- Stage 2 (runtime): python-slim base, copy only compiled wheel files from builder
- Typical savings: 1.5β3GB reduction for models with native C/C++ dependencies
- Final image contains no compilers, test frameworks, or build artifacts
.dockerignore Best Practices
Every file sent in the build context increases build time and cache invalidation risk.
# .dockerignore **/__pycache__/ **/*.pyc **/.pytest_cache/ .git/ .github/ tests/ docs/ notebooks/ *.ipynb experiments/ .env secrets/ # Keep model weights out of image # (mount or download at runtime) weights/*.bin weights/*.pt
GPU Support
Containers can access host GPUs when using the NVIDIA Container Toolkit. The GPU driver is NOT inside the container β only the CUDA runtime libraries are.
- Install NVIDIA Container Toolkit on the host:
apt install nvidia-container-toolkit - Run with GPU access:
docker run --gpus all ...or--gpus '"device=0,1"' - Host driver must be >= minimum version required by the container's CUDA runtime
- In Kubernetes:
nvidia.com/gpu: 1resource request (requires NVIDIA device plugin DaemonSet)
Production Dockerfile for FastAPI Model Server
# ββ Stage 1: Build wheels βββββββββββββββββββββββββββββββββββββββββ
FROM python:3.11-slim AS builder
# Install build dependencies for native extensions
RUN apt-get update && apt-get install -y \
gcc g++ make libffi-dev \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /build
COPY requirements.txt .
# Build all packages as wheels; do not install yet
RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt
# ββ Stage 2: Runtime image βββββββββββββββββββββββββββββββββββββββββ
FROM python:3.11-slim AS runtime
# Security: create non-root user for running the server
RUN groupadd --gid 1000 appuser \
&& useradd --uid 1000 --gid appuser --shell /bin/bash appuser
WORKDIR /app
# Install runtime system dependencies only
RUN apt-get update && apt-get install -y libgomp1 \
&& rm -rf /var/lib/apt/lists/*
# Copy pre-built wheels from builder and install (no compilation needed)
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir --no-index --find-links /wheels /wheels/*.whl \
&& rm -rf /wheels
# Copy application code
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser configs/ ./configs/
# Copy model artifacts (or download at startup β see callout below)
COPY --chown=appuser:appuser models/production/ ./models/
# Switch to non-root user
USER appuser
# Expose the serving port
EXPOSE 8080
# Health check β Kubernetes readiness/liveness probes will call this
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8080/health').raise_for_status()"
# Start FastAPI server with Uvicorn
ENTRYPOINT ["uvicorn", "src.server:app", \
"--host", "0.0.0.0", \
"--port", "8080", \
"--workers", "2", \
"--timeout-keep-alive", "30"]
Model Weights in Image vs Downloaded at Runtime
Baking large model weights (e.g., 7B parameter LLM at 14GB) into a Docker image creates impractically large images and slow registries. Preferred pattern: keep the serving code in the image; download weights from S3/GCS/Azure Blob at container startup using a startup script, or mount them from a persistent volume. This keeps images small and allows weight updates without rebuilding the image.
GPU Resource Management
Kubernetes requires explicit resource requests and limits for GPUs. GPUs are not overcommittable β only one pod gets each GPU unless using time-slicing or MIG.
- Resource request:
nvidia.com/gpu: 1β Kubernetes schedules pod only on nodes with available GPU - GPU requests must equal GPU limits (no burst; GPU quota is strict)
- NVIDIA Time-Slicing: share one GPU across multiple pods (reduced memory isolation; useful for small models)
- MIG (Multi-Instance GPU): A100/H100 only; hardware partitioning with full memory isolation
- Use node selectors or node affinity to target GPU node pools
Autoscaling ML Workloads
Standard HPA based on CPU utilization doesn't work well for GPU-bound inference. Use custom metrics instead.
- GPU utilization HPA: scale based on
nvidia_gpu_utilizationvia DCGM Prometheus exporter - Request queue depth: scale based on pending inference requests; KEDA (Kubernetes Event-Driven Autoscaling) handles this natively
- P99 latency: scale when P99 exceeds SLA threshold (requires custom metrics adapter)
- Set min replicas to 1 (avoid cold start) and max replicas based on GPU budget
Persistent Volumes for Model Weights
Large model weights that are too big to bake into images can be stored on persistent volumes and mounted into serving pods.
- Use
ReadOnlyManyaccess mode β multiple pods can simultaneously mount the same weights volume - NFS or cloud-native shared filesystems (EFS, Filestore, Azure Files) support ReadOnlyMany
- Pre-populate the volume using an init container that downloads from object storage
- Version the volume alongside the model (separate PVC per model version)
Liveness vs Readiness Probes for ML
ML models have a warmup period β the first N requests are slow as the model loads into GPU memory and JIT-compiles. Probes must account for this.
- Readiness probe: signals when pod is ready to receive traffic; set
initialDelaySecondslarge enough for model warmup (30β120s) - Liveness probe: signals when pod should be restarted; more lenient thresholds than readiness
- Startup probe: replaces liveness during startup; allows very long initialization (300s+) without triggering restart
- Use a
/health/readyendpoint that returns 503 until first successful inference pass
Kubernetes Deployment with GPU Request and Readiness Probe
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-model-server
labels:
app: ml-model-server
version: "1.5.0"
spec:
replicas: 2
selector:
matchLabels:
app: ml-model-server
template:
metadata:
labels:
app: ml-model-server
version: "1.5.0"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
# Target GPU node pool
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-a100
# GPU driver tolerance
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: model-server
image: registry.example.com/ml-model:1.5.0@sha256:a1b2c3...
imagePullPolicy: Always
ports:
- containerPort: 8080
name: http
# ββ Resource requests and limits βββββββββββββββββββββββββ
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1" # must equal request for GPU
# ββ Environment variables βββββββββββββββββββββββββββββββββ
env:
- name: MODEL_VERSION
value: "1.5.0"
- name: NUM_WORKERS
value: "2"
- name: CUDA_VISIBLE_DEVICES
value: "0"
envFrom:
- secretRef:
name: model-server-secrets # MLflow URI, S3 credentials etc.
# ββ Probe configuration βββββββββββββββββββββββββββββββββββ
# Startup probe: allow up to 5 minutes for model warmup
startupProbe:
httpGet:
path: /health/ready
port: 8080
failureThreshold: 30 # 30 Γ 10s = 5 minutes max startup
periodSeconds: 10
# Readiness probe: pod removed from Service if not ready
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 0 # startupProbe handles delay
periodSeconds: 10
failureThreshold: 3
successThreshold: 1
# Liveness probe: restart if completely hung
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 0
periodSeconds: 30
failureThreshold: 5 # 5 failures = 150s before restart
# ββ Volume mounts βββββββββββββββββββββββββββββββββββββββββ
volumeMounts:
- name: model-weights
mountPath: /models
readOnly: true
- name: tmp-dir
mountPath: /tmp
volumes:
- name: model-weights
persistentVolumeClaim:
claimName: model-weights-pvc-v1-5 # version-pinned PVC
- name: tmp-dir
emptyDir: {}
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-model-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-model-server
minReplicas: 1
maxReplicas: 8
metrics:
# Scale based on GPU utilization (requires DCGM exporter)
- type: External
external:
metric:
name: nvidia_gpu_utilization_percentage
selector:
matchLabels:
deployment: ml-model-server
target:
type: AverageValue
averageValue: "70" # target 70% GPU utilization per pod
KServe (formerly KFServing)
Kubernetes-native ML inference platform built on top of Knative and Istio. Defines a standard InferenceService CRD that abstracts away the serving infrastructure.
- Supports sklearn, PyTorch, TensorFlow, XGBoost, ONNX, Triton via a single API spec
- Canary rollouts, A/B testing, and shadow deployments are first-class features
- Serverless inference: scales to zero when idle (with cold start tradeoff)
- Request batching, explainability (Seldon Alibi integration), and outlier detection built-in
BentoML
Framework for packaging model, serving logic, preprocessing, and dependencies into a single deployable "Bento" artifact. Designed to bridge the gap between data scientists and DevOps.
- Define service logic in pure Python with
@bentoml.servicedecorator bentoml buildproduces a Bento that includes model + code + dependencies + Docker setupbentoml containerizegenerates an optimized Docker image automatically- Adaptive batching, async runners, and multi-model pipelines are built-in
Ray Serve
Built on the Ray distributed computing framework. Designed for scalable Python-native ML inference with complex pipeline graphs (preprocessing β model β postprocessing).
- Scale each pipeline component independently (preprocessing CPU, inference GPU)
- Model multiplexing: serve multiple model versions on one deployment
- Integrates with LangChain, vLLM, and HuggingFace for LLM serving
- Strong Python ecosystem: no YAML DSL required; pure Python configuration
NVIDIA Triton Inference Server
Production-grade multi-framework inference server from NVIDIA. Optimized for maximum GPU throughput with dynamic batching, model scheduling, and concurrent model execution.
- Supports TensorRT, ONNX Runtime, PyTorch TorchScript, TensorFlow SavedModel, Python backend
- Dynamic batching: accumulates requests over a configurable window before GPU dispatch
- Model ensemble: chain multiple models in a DAG with shared GPU compute
- Model repository: directory-based model management; hot-reload without restart
ML Serving Framework Comparison
| Tool | Abstraction Level | GPU Support | Multi-Model | Strengths |
|---|---|---|---|---|
| KServe | High β InferenceService CRD abstracts infrastructure | Yes (via Triton or custom runtime) | Yes β model ensembles, pipelines | Kubernetes-native; standardized API; canary/shadow built-in; scales to zero |
| Seldon Core | High β SeldonDeployment CRD; inference graph | Yes | Yes β multi-stage inference graphs | Explainability (Alibi); outlier detection; enterprise support; broad framework support |
| BentoML | Medium β Python-first; auto-generates Docker + YAML | Yes | Yes β runner-based composition | Data-scientist-friendly; excellent DX; cloud-agnostic deployment targets |
| Ray Serve | Low β Python code defines deployment graph | Yes | Yes β per-component scaling | Flexible pipeline graphs; great for LLMs; integrates with full Ray ecosystem |
| Triton | Low β model repository + config.pbtxt | Yes β highly optimized | Yes β concurrent model execution | Highest GPU throughput; dynamic batching; TensorRT integration; NVIDIA-optimized |
Dynamic Request Batching
GPUs are most efficient when processing large batches. Dynamic batching accumulates multiple incoming requests over a short window and submits them together as a single GPU operation.
- Batch size 1 β GPU utilization ~10β20%; batch 32 β 60β80% for transformer models
- Configure max batch size and max batch delay (latency budget) based on SLA requirements
- Triton:
dynamic_batching { max_queue_delay_microseconds: 2000 }in model config - Trade-off: higher batch size = better throughput but higher P99 latency for individual requests
ONNX Runtime
ONNX Runtime (ORT) is an accelerated inference engine that can run models from any framework (PyTorch, TensorFlow, sklearn) after exporting to the ONNX format.
- Typically 1.5β3Γ faster than native PyTorch for inference on CPU
- Supports CUDA, TensorRT, OpenVINO execution providers for GPU/edge acceleration
- Export:
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=17) - Validate outputs match original model numerically before deploying ORT version
TensorRT
NVIDIA's inference optimizer. Takes a trained model and produces a highly optimized engine for a specific GPU architecture using layer fusion, kernel auto-tuning, and precision reduction.
- Typical speedup: 2β5Γ vs native PyTorch on NVIDIA GPUs
- FP16: near-zero accuracy loss; 2Γ memory reduction; supported on all modern NVIDIA GPUs
- INT8: with calibration dataset; further speedup; ~1β2% accuracy drop for most models
- TensorRT engine is not portable β must rebuild for each target GPU architecture
Horizontal vs Vertical Scaling
Two fundamentally different approaches to handling increased inference load. The right choice depends on your workload characteristics.
- Horizontal (add replicas): add more pods/nodes; each handles a fraction of traffic; best for high-concurrency, stateless serving
- Vertical (larger GPU): upgrade to a GPU with more memory/compute; better for models that max out a single GPU's batch capacity
- Most production systems use horizontal scaling as the primary lever and vertical for the base unit size decision
- Horizontal scaling requires stateless serving β no in-memory state between requests
Serving Optimisation Techniques
| Technique | Latency Improvement | Complexity | When to Use |
|---|---|---|---|
| Dynamic request batching | 2β8Γ throughput improvement; slight P99 latency increase | Low β configure in serving framework | Any GPU serving with multiple concurrent requests; especially transformer models |
| ONNX Runtime | 1.5β3Γ speedup on CPU; moderate GPU gains | Low-Medium β export + validate outputs | CPU inference; cross-framework portability; batch processing workloads |
| TensorRT FP16 | 2β4Γ speedup; 2Γ memory reduction | Medium β build engine per GPU arch | NVIDIA GPU serving; latency-critical applications; A100/V100/T4 deployments |
| INT8 quantization | 3β5Γ speedup; 4Γ memory reduction | High β requires calibration dataset; validate accuracy | Edge deployment; cost-sensitive cloud serving; models tolerant to slight accuracy loss |
| Model compilation (torch.compile) | 1.5β2Γ on modern PyTorch; no framework change | Low β one-line addition; warm-up required | PyTorch 2.0+ models; significant gain for transformer architectures |
| Continuous batching (for LLMs) | 3β10Γ throughput for generative models | High β requires specialized LLM serving infra | LLM text generation; vLLM, TGI, or Triton with TensorRT-LLM backend |
| Horizontal pod scaling | Linear throughput scaling with replicas | Low β Kubernetes HPA + stateless design | High-concurrency serving; cost-effective throughput scaling; standard default approach |
Latency vs Throughput Trade-off
Every serving optimization involves choosing between latency and throughput. Dynamic batching increases throughput by accumulating requests β but the wait time in the queue adds to individual request latency. TensorRT FP16 reduces both latency and memory, making it almost always a win. INT8 further reduces memory and compute but may require careful calibration. Always profile P50, P95, and P99 latency under realistic concurrency levels β P99 is what users experience at the worst moment, and it is typically the binding constraint for SLA agreements.