acme-retry-queue-cpu-memory-accumulator-edge-proxy

Pattern: ACME Retry Queue Is a CPU & Memory Accumulator (Long-Lived Edge Proxies)

Observed symptom: Traefik edge proxy consuming 100% CPU across 3-4 hours. Root cause NOT the commonly-hypothesized Docker service-discovery loop — it was an accumulating ACME renewal retry queue on 13 expired orphan certs with NXDOMAIN. Each renewal attempt held failed-challenge state in memory; queue peaked at ~100 concurrent retries holding 129 MB of TLS handshake contexts.

Diagnostic signals (in priority order):

  1. Memory delta is definitive — Traefik RSS 151 MB → 22 MB (85% reclaim) after acme.json orphan purge. Normal baseline for a 80-router Traefik is 20-30 MB. Anything above 100 MB = queued retry state.
  2. Log error rate is insufficient alone — 27 ERR/hr can’t saturate an 8-core host on log I/O. The real CPU burn is TLS handshake + cert-signing crypto in concurrent retry contexts.
  3. Timestamp concordance — cert expiry times aligned exactly with the user-observed spike window.

Fix pattern (reusable across ACME-issuing proxies — Traefik, certbot, cert-manager, Caddy):

  1. Purge orphan certs from the store (those with no matching live router/server-block/Ingress).
  2. Remove stale resolver/account blocks from prior config versions.
  3. Install nightly orphan-detect cron that alerts on finding (don’t auto-purge — destructive op needs manual review).
  4. Wire a write-time hook/lint that catches compose-label classes that generate future orphans: missing traefik.docker.network label on multi-network containers; unknown cert resolver names not defined in proxy config.

Meta-pattern (applies beyond ACME): Every class-of-bug identified in a production RCA should spawn BOTH (a) an audit script for fleet bulk-scan AND (b) a matching PostToolUse/pre-commit hook for per-edit advisory. Audit catches existing instances, hook prevents new ones from shipping. Pattern generalizes to: dependency-pin drift, IaC configuration lint, schema drift, secret leakage, deprecated API usage.

Self-maintenance discipline: every accumulation point (ACME cert store, certbot cert lineage, queue depth, log rotation) requires a retention + detect cron. Orphan accumulation is guaranteed over time — only active purge policy prevents expiry-triggered storms.