Ship 7-fix process-pile-up defense package: (1) systemd StartLimitInte…
Decision
Ship 7-fix process-pile-up defense package: (1) systemd StartLimitInterval/Burst on mc-heartbeat + brainstorm, (2) daemon exponential-backoff on resolve_agent_ids transient 404, (3) hourly systemd-restart-watchdog cron, (4) MC compose hardening (memory 768→1.5G + deeper /api/agents healthcheck), (5) docker-health-check periodic timer, (6) daemon sys.exit(2) on config drift for watchdog visibility, (7) watchdog absolute-count threshold for slow-leak detection.
Rationale
30-Apr-2026 incident: mc-heartbeat infinite-looped 791 times in 12h42m against MC /api/agents 404 (Next.js uncaughtException-induced route degradation), pushed plist-sz to 13.5k and load-avg to 1138 while user CPU stayed at 3.7% — Hostinger panel showed “100% CPU” because it reads load-avg, not user-CPU. Reboot was the only escape. Premortem on the fix package surfaced two additional blind spots (daemon active-but-idle silent path; watchdog delta-only blindspot for slow leaks) — both patched same session. All 7 items verified live: mc-heartbeat NRestarts=0 since patch + monitoring 5 agents; MC container Up healthy with deeper healthcheck passing inside container; restart-watchdog DRY_RUN passes; docker-health-check.timer active scheduled 06:45 UTC. Backups preserved for every modified file. Three secondary risks (Telegram SPOF, Docker-restart-loop without rate limit, cron-daemon-dying) acknowledged as systemic — out of scope for this incident, tracked for next hardening pass.
Alternatives Rejected
Outcome
Pending
Related
- final-closure-round-2-30-apr-process-pile-up-incident-harden
- docker
- confirmed-closure-7-fix-process-pile-up-defense-package-is-c
- snowflake-mcp-v2203-upgrade-quality-audit-full-pass-bible-v1
- 2026-04-04-oracle-001-self-architecture-analysis
- process-pile-up-triage-hostinger-cpu-panel-reads-load-avg
- restart-loop-watchdog-with-delta-only-thresholds-blind-to-sl