acme-retry-queue-cpu-memory-accumulator-edge-proxy
Pattern: ACME Retry Queue Is a CPU & Memory Accumulator (Long-Lived Edge Proxies)
Observed symptom: Traefik edge proxy consuming 100% CPU across 3-4 hours. Root cause NOT the commonly-hypothesized Docker service-discovery loop — it was an accumulating ACME renewal retry queue on 13 expired orphan certs with NXDOMAIN. Each renewal attempt held failed-challenge state in memory; queue peaked at ~100 concurrent retries holding 129 MB of TLS handshake contexts.
Diagnostic signals (in priority order):
- Memory delta is definitive — Traefik RSS 151 MB → 22 MB (85% reclaim) after acme.json orphan purge. Normal baseline for a 80-router Traefik is 20-30 MB. Anything above 100 MB = queued retry state.
- Log error rate is insufficient alone — 27 ERR/hr can’t saturate an 8-core host on log I/O. The real CPU burn is TLS handshake + cert-signing crypto in concurrent retry contexts.
- Timestamp concordance — cert expiry times aligned exactly with the user-observed spike window.
Fix pattern (reusable across ACME-issuing proxies — Traefik, certbot, cert-manager, Caddy):
- Purge orphan certs from the store (those with no matching live router/server-block/Ingress).
- Remove stale resolver/account blocks from prior config versions.
- Install nightly orphan-detect cron that alerts on finding (don’t auto-purge — destructive op needs manual review).
- Wire a write-time hook/lint that catches compose-label classes that generate future orphans: missing
traefik.docker.networklabel on multi-network containers; unknown cert resolver names not defined in proxy config.
Meta-pattern (applies beyond ACME): Every class-of-bug identified in a production RCA should spawn BOTH (a) an audit script for fleet bulk-scan AND (b) a matching PostToolUse/pre-commit hook for per-edit advisory. Audit catches existing instances, hook prevents new ones from shipping. Pattern generalizes to: dependency-pin drift, IaC configuration lint, schema drift, secret leakage, deprecated API usage.
Self-maintenance discipline: every accumulation point (ACME cert store, certbot cert lineage, queue depth, log rotation) requires a retention + detect cron. Orphan accumulation is guaranteed over time — only active purge policy prevents expiry-triggered storms.