Docker Best Practices for Machine Learning Workloads

Why ML Docker Images Go Wrong

Machine learning Docker images are notorious for being enormous (10GB+ is common), slow to build, and inconsistent between development and production. The root causes are almost always the same: installing everything including development dependencies in the production image, not caching pip installs effectively, and bundling large model weights directly into the image.

These best practices fix each of those problems and result in images that are smaller, faster to build, and consistently reproducible.

Multi-Stage Builds

Use multi-stage builds to separate your build environment from your runtime environment. The build stage can include compilers, dev headers, and test dependencies. The runtime stage contains only what is needed to run inference.

# Stage 1: Builder
FROM python:3.11-slim AS builder

WORKDIR /build
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# Stage 2: Runtime
FROM python:3.11-slim AS runtime

WORKDIR /app

# Copy only the installed packages
COPY --from=builder /root/.local /root/.local

# Copy application code
COPY ./src ./src

ENV PATH=/root/.local/bin:$PATH
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

CMD ["python", "-m", "src.server"]

This pattern typically reduces final image size by 40-60% compared to single-stage builds.

Layer Caching for Dependencies

Always copy your requirements file and install dependencies before copying application code. Docker caches layers — if your dependencies haven't changed, the pip install step is skipped on subsequent builds.

# WRONG — invalidates cache on any code change
COPY . .
RUN pip install -r requirements.txt

# CORRECT — cache dependencies separately from code
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .

GPU Support with NVIDIA Container Runtime

To use GPUs in Docker, install the NVIDIA Container Toolkit on the host and use a CUDA base image. Never use the full nvidia/cuda:xx-devel image in production — it is several GB larger than the runtime-only variant.

# Use runtime variant, not devel
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Install only what you need
RUN apt-get update && apt-get install -y --no-install-recommends     python3.11     python3-pip     && rm -rf /var/lib/apt/lists/*

# Run with GPU access
# docker run --gpus all my-ml-image

# docker-compose.yml
services:
  inference:
    image: my-inference-service
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Model Weight Management

Never bake model weights into Docker images. A 7B parameter model in FP16 is ~14GB — that makes images impractical to push, pull, and store in any registry. Instead:

Download models from S3 or HuggingFace Hub at container startup using an init script
Mount a volume with the model weights for local development
Use a persistent volume in Kubernetes for production — the model is downloaded once and cached

# init.sh — download model if not already cached
MODEL_DIR="/models/mistral-7b"
if [ ! -d "$MODEL_DIR" ]; then
    echo "Downloading model..."
    aws s3 sync s3://my-models/mistral-7b/ "$MODEL_DIR"
fi
exec python -m src.server

Environment and Secret Management

Never bake API keys or secrets into Docker images. Use environment variables at runtime, injected by Kubernetes secrets or your CI/CD system.

# .dockerignore — critical for keeping images clean
.git
.env
.env.*
__pycache__
*.pyc
*.pyo
.pytest_cache
tests/
notebooks/
*.ipynb
data/
models/

Build Optimization Checklist

Use --no-cache-dir in all pip install commands to prevent pip's internal cache from bloating the layer
Combine RUN commands to reduce layer count: RUN cmd1 && cmd2 && cmd3
Use .dockerignore aggressively — exclude everything that is not needed in the build context
Pin all dependency versions in requirements.txt for reproducible builds
Use docker buildx with BuildKit for parallel layer builds and better caching
Scan images with Trivy or Snyk before pushing to production registries

An ML Docker image that takes 45 minutes to build and is 15GB is a productivity tax that compounds across every team member and every CI run. Applying these practices consistently gets most ML images under 3GB and build times under 5 minutes.

Docker Best Practices for Machine Learning Workloads

Why ML Docker Images Go Wrong

Multi-Stage Builds

Layer Caching for Dependencies

GPU Support with NVIDIA Container Runtime

Model Weight Management

Environment and Secret Management

Build Optimization Checklist

Bookt.dk — Danish Salon Booking

CI/CD Pipelines for AI Model Deployment: A Complete Guide

DevOps Best Practices for AI-Powered Applications in 2025

Want to Build This for Your Team?