Docker image-bloat rebuild playbook — find the dominant dep before multi-stage
Image-bloat rebuild playbook — find the single dep that dominates before multi-stage
For any Docker image >3 GB, the correct FIRST move is docker exec <container> du -sh /path/to/site-packages/* — bloat is almost always concentrated in one accidental dep, not spread across the layer set. Multi-stage is the vehicle; the lever is usually one dep pin or one extra removal.
Three patterns from 20-Apr-2026 Wave 1 rebuild (16.21 GB reclaimed across three images)
| Root cause | Platform | Before | After | Fix lever |
|---|---|---|---|---|
camel-ai pulled torch-CUDA default | MiroFish | 5.80 GB | 1.41 GB (76% ↓) | Pre-install torch from https://download.pytorch.org/whl/cpu BEFORE uv pip install -r pyproject.toml — camel-ai sees torch present and skips CUDA resolve |
--extra providers pulled sentence-transformers → torch → CUDA | Graphiti MCP | 5.65 GB | 0.41 GB (93% ↓) | Drop --extra providers, install only the actually-used provider (google-genai). Sentence-transformers unused because embeddings go through Gemini API, not local model |
| Build toolchain retained in runtime + dev-only node_modules survived Vite build | ORACLE Hermes | 4.89 GB | 2.10 GB (57% ↓) | Multi-stage: builder keeps build-essential/python3-dev/libffi-dev/gcc; runtime drops them. Strip web/node_modules post-build |
The general pattern
- Profile first:
docker exec <name> du -sh /opt/<app>/.venv/lib/python*/site-packages/* | sort -rh | head -15 - Identify the outlier: typically ONE package is 50%+ of site-packages — torch+CUDA (GPU stack on CPU VPS), sentence-transformers (unused embedder), playwright-with-browsers (if only API used)
- Check the dep chain:
importlib.metadata.requires("<bloat-package>")— confirms what pulls it - Pin or drop: either install a CPU/slim variant via
--index-urlbefore the main resolve, OR remove the offending extra, OR cherry-pick the one sub-package you actually use - Multi-stage after: only now does splitting builder/runtime matter; it’s the 10-30% polish on top of the 70-90% win from the dep fix
- Backup anchor:
cp Dockerfile Dockerfile.pre-slim.<YYYYMMDD>before edit — 7-day rollback window per Clause B of the dead-weight rule
When this does NOT apply
- Image is already small (<1.5 GB) — gains are diminishing, multi-stage still helps if toolchain is in runtime
- All deps are legitimately used at runtime — then the bloat isn’t “dead weight”, it’s the actual cost of the capability (playwright chromium + ffmpeg + node in oracle-hermes is the right shape at 2 GB)
- Legacy pip-editable install —
.egg-linkfiles reference source paths, so site-packages AND source must move together across stages (oracle-hermes pattern)
Smoke-test protocol before live swap
- Tag slim build with
-testsuffix:docker build -t <image>:slim-test . - Run on alt port with real env vars:
docker run -d --name <name>-slim-smoke -p 127.0.0.1:<alt>:<port> <image>:slim-test - Health probe:
curl http://127.0.0.1:<alt>/health— if 200, image starts - If the runtime log reaches
initialize_server()or equivalent startup point without ImportError, Python deps are complete. Env-var failures at runtime (wrong DB host, missing auth) are NOT image issues and can be ignored for the image-level smoke - ONLY AFTER smoke passes: retag to
:latest, thendocker compose up -d --force-recreate - Run
docker image prune -fafter compose recreate — reclaims the dangling orphan of the old large image