Purge 22 orphan certs from Traefik acme.json, drop mytlschallenge re…
Decision
Purge 22 orphan certs from Traefik acme.json, drop mytlschallenge resolver block, fix huggingface-mcp cert resolver (mytlschallenge→letsencrypt), add traefik.docker.network label to 4 compose files (oracle-hermes, oracle-mirofish, openspace-mcp, runwal-bkc), connect oracle-mirofish to mcp-global-network, install nightly orphan-detect cron at 04:00 UTC, build validate-traefik-labels.sh fleet validator.
Rationale
CPU 100% RCA showed actual primary cause was a 129 MB ACME retry queue holding ~100 concurrent failed-challenge contexts for 13 v2-mcp certs that expired Apr 20-23 (evidence: Traefik RSS 151→22 MB after purge; timestamp alignment with user’s 3-4hr spike window). User’s surface hypothesis (Docker tight-loop, rate-limit, phantom resolver) was partially right but missed the dominant accumulator. Fix spans 4 layers: (a) immediate — purge orphans to drain the retry queue, (b) config — correct labels so Traefik can actually route, (c) prevention — nightly detect cron satisfies Tier 1 Directive #7 self-maintenance, (d) fleet-level — lint validator prevents recurrence of the same class of bug at onboarding. Side-discovery: acme.json swap requires second Traefik restart to re-index cert cache (broke ST MCP TLS transiently until I restarted twice).
Alternatives Rejected
Outcome
Pending
Related
- docker
- comprehensive-vps-audit-sweep-post-cpu-spike-rca-disabled-tr
- oracle
- 2026-04-04-oracle-001-self-architecture-analysis
- traefik
- applied-executive-authority-on-technical-dead-weight-archive
- reconciled-from-st-fallback-journalmd-23-apr-2026-1230-utc-p
- acme-retry-queue-cpu-memory-accumulator-edge-proxy