Ship 7-fix process-pile-up defense package: (1) systemd StartLimitInte…

Decision

Ship 7-fix process-pile-up defense package: (1) systemd StartLimitInterval/Burst on mc-heartbeat + brainstorm, (2) daemon exponential-backoff on resolve_agent_ids transient 404, (3) hourly systemd-restart-watchdog cron, (4) MC compose hardening (memory 768→1.5G + deeper /api/agents healthcheck), (5) docker-health-check periodic timer, (6) daemon sys.exit(2) on config drift for watchdog visibility, (7) watchdog absolute-count threshold for slow-leak detection.

Rationale

30-Apr-2026 incident: mc-heartbeat infinite-looped 791 times in 12h42m against MC /api/agents 404 (Next.js uncaughtException-induced route degradation), pushed plist-sz to 13.5k and load-avg to 1138 while user CPU stayed at 3.7% — Hostinger panel showed “100% CPU” because it reads load-avg, not user-CPU. Reboot was the only escape. Premortem on the fix package surfaced two additional blind spots (daemon active-but-idle silent path; watchdog delta-only blindspot for slow leaks) — both patched same session. All 7 items verified live: mc-heartbeat NRestarts=0 since patch + monitoring 5 agents; MC container Up healthy with deeper healthcheck passing inside container; restart-watchdog DRY_RUN passes; docker-health-check.timer active scheduled 06:45 UTC. Backups preserved for every modified file. Three secondary risks (Telegram SPOF, Docker-restart-loop without rate limit, cron-daemon-dying) acknowledged as systemic — out of scope for this incident, tracked for next hardening pass. 0.85

Alternatives Rejected

Outcome

Pending