Docker image-bloat rebuild playbook — find the dominant dep before multi-stage

Image-bloat rebuild playbook — find the single dep that dominates before multi-stage

For any Docker image >3 GB, the correct FIRST move is docker exec <container> du -sh /path/to/site-packages/* — bloat is almost always concentrated in one accidental dep, not spread across the layer set. Multi-stage is the vehicle; the lever is usually one dep pin or one extra removal.

Three patterns from 20-Apr-2026 Wave 1 rebuild (16.21 GB reclaimed across three images)

Root causePlatformBeforeAfterFix lever
camel-ai pulled torch-CUDA defaultMiroFish5.80 GB1.41 GB (76% ↓)Pre-install torch from https://download.pytorch.org/whl/cpu BEFORE uv pip install -r pyproject.toml — camel-ai sees torch present and skips CUDA resolve
--extra providers pulled sentence-transformers → torch → CUDAGraphiti MCP5.65 GB0.41 GB (93% ↓)Drop --extra providers, install only the actually-used provider (google-genai). Sentence-transformers unused because embeddings go through Gemini API, not local model
Build toolchain retained in runtime + dev-only node_modules survived Vite buildORACLE Hermes4.89 GB2.10 GB (57% ↓)Multi-stage: builder keeps build-essential/python3-dev/libffi-dev/gcc; runtime drops them. Strip web/node_modules post-build

The general pattern

  1. Profile first: docker exec <name> du -sh /opt/<app>/.venv/lib/python*/site-packages/* | sort -rh | head -15
  2. Identify the outlier: typically ONE package is 50%+ of site-packages — torch+CUDA (GPU stack on CPU VPS), sentence-transformers (unused embedder), playwright-with-browsers (if only API used)
  3. Check the dep chain: importlib.metadata.requires("<bloat-package>") — confirms what pulls it
  4. Pin or drop: either install a CPU/slim variant via --index-url before the main resolve, OR remove the offending extra, OR cherry-pick the one sub-package you actually use
  5. Multi-stage after: only now does splitting builder/runtime matter; it’s the 10-30% polish on top of the 70-90% win from the dep fix
  6. Backup anchor: cp Dockerfile Dockerfile.pre-slim.<YYYYMMDD> before edit — 7-day rollback window per Clause B of the dead-weight rule

When this does NOT apply

  • Image is already small (<1.5 GB) — gains are diminishing, multi-stage still helps if toolchain is in runtime
  • All deps are legitimately used at runtime — then the bloat isn’t “dead weight”, it’s the actual cost of the capability (playwright chromium + ffmpeg + node in oracle-hermes is the right shape at 2 GB)
  • Legacy pip-editable install — .egg-link files reference source paths, so site-packages AND source must move together across stages (oracle-hermes pattern)

Smoke-test protocol before live swap

  1. Tag slim build with -test suffix: docker build -t <image>:slim-test .
  2. Run on alt port with real env vars: docker run -d --name <name>-slim-smoke -p 127.0.0.1:<alt>:<port> <image>:slim-test
  3. Health probe: curl http://127.0.0.1:<alt>/health — if 200, image starts
  4. If the runtime log reaches initialize_server() or equivalent startup point without ImportError, Python deps are complete. Env-var failures at runtime (wrong DB host, missing auth) are NOT image issues and can be ignored for the image-level smoke
  5. ONLY AFTER smoke passes: retag to :latest, then docker compose up -d --force-recreate
  6. Run docker image prune -f after compose recreate — reclaims the dangling orphan of the old large image