📦 Container Security Fundamentals
Containers are not virtual machines. VMs virtualize hardware with a hypervisor; containers share the host OS kernel, separated only by Linux namespaces and cgroups. A container escape vulnerability means an attacker with code execution inside a container can potentially compromise the host system and all other containers running on it.
Linux Isolation Primitives
- Namespaces — provide isolation for: PID (process IDs), NET (networking), MNT (filesystem mounts), UTS (hostname), IPC (interprocess communication), USER (user/group IDs). Each container gets its own namespace instances.
- cgroups (control groups) — limit and account for resource usage: CPU, memory, disk I/O, network. Prevent a container from consuming all host resources.
- Seccomp — system call filtering. Docker's default seccomp profile blocks ~44 dangerous syscalls. Custom profiles can restrict further.
- AppArmor / SELinux — mandatory access control profiles that restrict what the container process can access on the host filesystem.
Container Escape Techniques
- Privileged containers —
--privilegedgives the container nearly all host capabilities, disables seccomp and AppArmor. Equivalent to root on the host. Never use in production. - Host mount exploitation — mounting the host filesystem (
-v /:/host) allows reading and writing to any host file, including /etc/shadow and /etc/cron.d. - Docker socket mount — mounting
/var/run/docker.sockinside a container gives full control of the Docker daemon and all containers. Commonly abused in CI/CD pipelines. - Kernel exploit — if the host kernel is vulnerable, a container process can exploit it to escape isolation. Keeping host kernels patched is critical.
- Capability abuse — specific capabilities (CAP_SYS_ADMIN, CAP_SYS_PTRACE) enable container escapes. Drop all unnecessary capabilities.
Threat Model by Phase
- Build time — vulnerable base image OS packages, hardcoded secrets in Dockerfile or image layers, malicious base images from untrusted registries, build tools included in final image.
- Deploy time — insecure Kubernetes YAML (privileged, host mounts, no resource limits), pulling unverified images without signature checking, deploying to wrong environment.
- Runtime — container escape via vulnerability or misconfiguration, lateral movement to other containers, cryptomining, data exfiltration, process injection.
🏗 Secure Image Building
Minimal Base Images
- Distroless (Google) — contains only the application runtime and its direct dependencies. No shell, no package manager, no debugging tools. Dramatically reduces attack surface and CVE count.
- Alpine Linux — minimal Linux distribution (~5MB). Includes shell (for debugging) but far fewer packages than Ubuntu/Debian. Use musl libc — verify compatibility with your app.
- scratch — empty image. For statically compiled Go or Rust binaries with zero dependencies. Maximum security, zero attack surface.
- Base image choice can reduce total CVE count by 80–90% compared to using ubuntu:latest.
Build Best Practices
- Multi-stage builds — compile in a builder stage (with compiler, build tools, dev dependencies), copy only the binary/artifact to a minimal final image. Build tools never reach production.
- Non-root user — create a dedicated user in the Dockerfile (
RUN useradd -r appuser,USER appuser). Container escape from a non-root process is harder and causes less damage. - Read-only root filesystem —
--read-onlyflag orreadOnlyRootFilesystem: truein Kubernetes. Attacker cannot write malware to disk. - No secrets in images — never use ENV for secrets in production Dockerfiles. Use runtime injection (Kubernetes secrets, Vault Agent, AWS SSM). Secrets in image layers persist even after deletion.
- Image signing — Docker Content Trust (Notary v1) or Sigstore/cosign (keyless signing) to verify images were built by your CI/CD pipeline.
# Secure Dockerfile example: multi-stage, non-root, distroless # Stage 1: Builder (has build tools, not in final image) FROM golang:1.22-alpine AS builder WORKDIR /app COPY go.mod go.sum ./ RUN go mod download COPY . . # Build statically linked binary (no external library deps) RUN CGO_ENABLED=0 GOOS=linux go build -a -ldflags '-extldflags "-static"' -o server . # Stage 2: Final image — distroless, no shell, no package manager FROM gcr.io/distroless/static-debian12:nonroot # nonroot tag sets USER to 65532 (nonroot) automatically # No RUN, no SHELL, no apt-get — just copy and run COPY --from=builder /app/server /server # Expose port (documentation only — does not publish) EXPOSE 8080 ENTRYPOINT ["/server"] # Build and sign with cosign # docker build -t myapp:v1.2.3 . # cosign sign --key cosign.key myregistry.io/myapp:v1.2.3
SBOM Generation in CI
Generate an SBOM for every container image build using Syft or Trivy. Attach the SBOM to the image using ORAS or Cosign's attestation feature. This enables downstream vulnerability scanning against the SBOM without needing to re-scan the full image, and satisfies supply chain security requirements (EO 14028, SLSA).
🔍 Image Scanning & Vulnerability Management
Container image scanners check each layer of an image against CVE databases, identifying vulnerable OS packages and application libraries. Scanning in CI/CD prevents deploying known-vulnerable images to production.
| Scanner | Open Source | Speed | Languages | Integration |
|---|---|---|---|---|
| Trivy | Yes (Apache 2.0) | Fast | OS, Java, Python, Node, Go, Ruby, .NET, Rust | CLI, Docker plugin, GitHub Actions, GitLab CI, Kubernetes operator |
| Grype | Yes (Apache 2.0) | Fast | OS, Java, Python, Node, Go, Ruby, .NET | CLI, GitHub Actions, integrates with Syft SBOM |
| Snyk Container | Free tier | Fast | OS, Node, Java, Python, Ruby | CLI, Docker Desktop, GitHub, CI/CD, IDE |
| Clair | Yes (Apache 2.0) | Moderate | OS packages primarily | API-based; integrates with Harbor, Quay registries |
| AWS ECR Scanning | No (AWS service) | Fast | OS, major language ecosystems | Native AWS; enhanced scanning powered by Inspector/Snyk |
CI Pipeline Integration
- Scan on every image build — block CI if critical (CVSS 9.0+) CVEs are found with a fix available.
- Distinguish between fixable and unfixable vulnerabilities. A CVE with no patch available shouldn't necessarily block a release indefinitely — set an exception process with time limits.
- Registry-level scanning (ECR enhanced scanning, Harbor with Trivy) rescans images periodically — detects newly disclosed CVEs in images already in the registry.
- Set up base image update automation: when a new distroless or Alpine release patches CVEs, automatically rebuild and redeploy images derived from it.
CVE Database Sources
- NVD (NIST National Vulnerability Database) — authoritative US government CVE database. Base CVSS scores.
- Red Hat Security Advisories — RHEL/CentOS/Fedora-specific severity ratings, often more accurately reflect exploitability than NVD.
- Debian Security Tracker — Debian package-specific CVE status and fixes.
- GitHub Advisory Database — language ecosystem CVEs (npm, PyPI, Maven, Go, RubyGems).
- Trivy aggregates all the above — no need to query each source separately.
⚙ Kubernetes Security
RBAC — Least Privilege
- Kubernetes RBAC controls who can do what to which resources. Principle: grant minimum permissions needed.
- Service accounts — every pod gets a service account. Default service account often has too many permissions. Create dedicated service accounts per workload.
- Avoid
cluster-adminbindings unless absolutely necessary. Prefer namespace-scoped roles over cluster-wide roles. - Audit RBAC regularly with tools like
rbac-tool,kubectl-who-can, or Polaris. - Disable automatic service account token mounting if the pod doesn't need Kubernetes API access:
automountServiceAccountToken: false.
Pod Security Standards
- Pod Security Standards (PSS) replaced deprecated PodSecurityPolicy in Kubernetes 1.25.
- Privileged — unrestricted. For trusted system workloads only.
- Baseline — prevents known privilege escalation. Blocks privileged containers, host network/PID/IPC access, dangerous capabilities.
- Restricted — heavily restricted best practices. Requires non-root, read-only root filesystem, dropped capabilities, seccomp profile. Use for most workloads.
- Enforce at namespace level:
pod-security.kubernetes.io/enforce: restricted
Network Policies
- By default, all pods can communicate with all other pods in a Kubernetes cluster — no network isolation.
- Default deny — apply a NetworkPolicy that denies all ingress and egress by default, then selectively allow required communication paths.
- Use Calico, Cilium, or Weave for network policy enforcement (requires a compatible CNI plugin — the default kubenet does not enforce policies).
- Limit egress to required external endpoints — prevents exfiltration and C2 communication from compromised pods.
- Cilium with eBPF provides L7-aware policy (HTTP method and path filtering) beyond standard L4 network policies.
Secrets Management
- Kubernetes Secrets are base64-encoded by default — not encrypted. Enable encryption at rest for Secrets using KMS (etcd encryption).
- Avoid environment variables for secrets — they appear in
kubectl describe, process listings, and crash dumps. Use volume mounts instead. - External Secrets Operator — sync secrets from Vault, AWS SSM Parameter Store, GCP Secret Manager into Kubernetes Secrets. Secrets never live in git.
- Sealed Secrets (Bitnami) — encrypt secrets with a cluster-specific key; safe to commit to git. Decrypted only by the controller inside the cluster.
- Rotate secrets regularly. Use dynamic secrets (Vault database plugin) for database credentials — short-lived credentials issued per request.
Admission Controllers for Policy Enforcement
OPA/Gatekeeper and Kyverno are admission webhooks that evaluate every Kubernetes resource create/update against custom policies before persistence. Use them to enforce: required labels, image registry allowlisting, non-root requirement, no latest tag, resource limits required. Kyverno policies are Kubernetes-native (YAML), while OPA uses Rego language. Run the CIS Kubernetes Benchmark (via kube-bench) to validate your cluster configuration against 100+ security controls.
👀 Runtime Security & Supply Chain
Runtime Threat Detection
- Falco (CNCF) — open-source runtime security tool. Monitors syscalls and Kubernetes audit logs against rules. Detects: shell spawned in container, sensitive file read, privilege escalation, network connections to unusual destinations.
- Tetragon (Cilium/CNCF) — eBPF-based security observability and enforcement. Operates at kernel level with near-zero overhead. Can enforce policies to block specific syscalls or process executions in real time.
- Container drift detection — detect when new executables are written and run inside a container post-start. Legitimate containers run the same binary at start and runtime; malware writes new files.
- Runtime tools alert on anomalies; ensure alerts flow to your SIEM/SOC for response.
Container Supply Chain Security
- Sigstore/cosign — keyless image signing using OIDC identity (GitHub Actions OIDC, Google Workload Identity). No long-lived private keys to manage or leak.
- Rekor transparency log — immutable, append-only log of all signatures. Enables auditing of what was signed and when. Part of the Sigstore project.
- SLSA (Supply-chain Levels for Software Artifacts) — framework for achieving supply chain integrity. SLSA Level 3 requires: verified source, hermetic build, signed provenance — proving what source code produced what artifact.
- Admission webhook image verification — Kyverno or Connaisseur can verify cosign signatures before a pod is admitted. Prevents running unsigned or tampered images.
Admission & Policy Automation
- Gate deployments at the admission webhook level — policies run before resources are written to etcd, not after.
- Enforce image provenance: only allow images from your approved registry (
myregistry.io/*), signed by your CI/CD pipeline's identity. - Require SBOM attestation alongside the image: Kyverno can verify that a valid SBOM attestation exists before allowing deployment.
- Scan images at admission time (Trivy Operator) — a final defense even if CI scanning was bypassed.
A Scanned Image at Build Time May Be Vulnerable 30 Days Later
CVEs are disclosed continuously. An image that passed scanning when it was built may have five new critical CVEs a month later. Implement continuous registry scanning that rescans all images in production periodically (daily for critical workloads). When new CVEs are detected in deployed images, trigger an automated rebuild and redeploy from the patched base image. Treat container image vulnerability management like OS patch management — it requires ongoing process, not a one-time scan.