Purge 22 orphan certs from Traefik acme.json, drop mytlschallenge re…

Decision

Purge 22 orphan certs from Traefik acme.json, drop mytlschallenge resolver block, fix huggingface-mcp cert resolver (mytlschallenge→letsencrypt), add traefik.docker.network label to 4 compose files (oracle-hermes, oracle-mirofish, openspace-mcp, runwal-bkc), connect oracle-mirofish to mcp-global-network, install nightly orphan-detect cron at 04:00 UTC, build validate-traefik-labels.sh fleet validator.

Rationale

CPU 100% RCA showed actual primary cause was a 129 MB ACME retry queue holding ~100 concurrent failed-challenge contexts for 13 v2-mcp certs that expired Apr 20-23 (evidence: Traefik RSS 151→22 MB after purge; timestamp alignment with user’s 3-4hr spike window). User’s surface hypothesis (Docker tight-loop, rate-limit, phantom resolver) was partially right but missed the dominant accumulator. Fix spans 4 layers: (a) immediate — purge orphans to drain the retry queue, (b) config — correct labels so Traefik can actually route, (c) prevention — nightly detect cron satisfies Tier 1 Directive #7 self-maintenance, (d) fleet-level — lint validator prevents recurrence of the same class of bug at onboarding. Side-discovery: acme.json swap requires second Traefik restart to re-index cert cache (broke ST MCP TLS transiently until I restarted twice).

Alternatives Rejected

Outcome

Pending