Why ML Docker Images Go Wrong
Machine learning Docker images are notorious for being enormous (10GB+ is common), slow to build, and inconsistent between development and production. The root causes are almost always the same: installing everything including development dependencies in the production image, not caching pip installs effectively, and bundling large model weights directly into the image.
These best practices fix each of those problems and result in images that are smaller, faster to build, and consistently reproducible.
Multi-Stage Builds
Use multi-stage builds to separate your build environment from your runtime environment. The build stage can include compilers, dev headers, and test dependencies. The runtime stage contains only what is needed to run inference.
# Stage 1: Builder
FROM python:3.11-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# Stage 2: Runtime
FROM python:3.11-slim AS runtime
WORKDIR /app
# Copy only the installed packages
COPY --from=builder /root/.local /root/.local
# Copy application code
COPY ./src ./src
ENV PATH=/root/.local/bin:$PATH
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
CMD ["python", "-m", "src.server"]
This pattern typically reduces final image size by 40-60% compared to single-stage builds.
Layer Caching for Dependencies
Always copy your requirements file and install dependencies before copying application code. Docker caches layers — if your dependencies haven't changed, the pip install step is skipped on subsequent builds.
# WRONG — invalidates cache on any code change
COPY . .
RUN pip install -r requirements.txt
# CORRECT — cache dependencies separately from code
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
GPU Support with NVIDIA Container Runtime
To use GPUs in Docker, install the NVIDIA Container Toolkit on the host and use a CUDA base image. Never use the full nvidia/cuda:xx-devel image in production — it is several GB larger than the runtime-only variant.
# Use runtime variant, not devel
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
# Install only what you need
RUN apt-get update && apt-get install -y --no-install-recommends python3.11 python3-pip && rm -rf /var/lib/apt/lists/*
# Run with GPU access
# docker run --gpus all my-ml-image
# docker-compose.yml
services:
inference:
image: my-inference-service
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Model Weight Management
Never bake model weights into Docker images. A 7B parameter model in FP16 is ~14GB — that makes images impractical to push, pull, and store in any registry. Instead:
- Download models from S3 or HuggingFace Hub at container startup using an init script
- Mount a volume with the model weights for local development
- Use a persistent volume in Kubernetes for production — the model is downloaded once and cached
# init.sh — download model if not already cached
MODEL_DIR="/models/mistral-7b"
if [ ! -d "$MODEL_DIR" ]; then
echo "Downloading model..."
aws s3 sync s3://my-models/mistral-7b/ "$MODEL_DIR"
fi
exec python -m src.server
Environment and Secret Management
Never bake API keys or secrets into Docker images. Use environment variables at runtime, injected by Kubernetes secrets or your CI/CD system.
# .dockerignore — critical for keeping images clean
.git
.env
.env.*
__pycache__
*.pyc
*.pyo
.pytest_cache
tests/
notebooks/
*.ipynb
data/
models/
Build Optimization Checklist
- Use
--no-cache-dirin all pip install commands to prevent pip's internal cache from bloating the layer - Combine RUN commands to reduce layer count:
RUN cmd1 && cmd2 && cmd3 - Use
.dockerignoreaggressively — exclude everything that is not needed in the build context - Pin all dependency versions in requirements.txt for reproducible builds
- Use
docker buildxwith BuildKit for parallel layer builds and better caching - Scan images with Trivy or Snyk before pushing to production registries
An ML Docker image that takes 45 minutes to build and is 15GB is a productivity tax that compounds across every team member and every CI run. Applying these practices consistently gets most ML images under 3GB and build times under 5 minutes.