Comprehensive VPS audit sweep (post CPU-spike RCA): disabled Traefik r…

Decision

Comprehensive VPS audit sweep (post CPU-spike RCA): disabled Traefik routing for 2 NXDOMAIN vestigial domains (oracle.arjtech.in, mirofish-oracle.arjtech.in) via traefik.enable=false with documented re-enable conditions; fixed stale Prometheus monitoring target (gumlet-mcp-v2 → gumlet-mcp); added traefik.docker.network label to 2 un-deployed compose files (second-brain vault-mcp, snowflake-mcp prometheus/grafana) then subsequently retired snowflake prometheus+grafana Traefik labels as superseded by main /opt/prometheus stack. Fleet validator now reports 0 findings across all compose files under /opt and /root.

Rationale

User escalation after initial CPU fix: “scan whole VPS, resolve all, zero ambiguity, zero redundancy”. 6-dimension audit (container health, disk, resource outliers, log error hotspots, cert chain health, validator scan) surfaced 5 remaining issues beyond the initial CPU RCA — all resolved in one pass. Fix philosophy applied feedback_nothing_is_dead_weight clause B (retire superseded) for the snowflake prometheus+grafana services since /opt/prometheus already scrapes snowflake-mcp. Applied clause A (fix wiring, don’t remove) for vault-mcp duplicate compose file (kept future-redeployable). Applied zero-ambiguity to NXDOMAIN domains — stopped the ACME retry loop by disabling the public route rather than leaving it broken.

Alternatives Rejected

Outcome

Pending