VPS alert plane retuned to multi-axis AND-conjunction (26-Apr-2026)

Decision

Replaced single-axis alert gates in vps-health-watchdog.sh and claude-session-runaway-guard.sh with multi-axis AND-conjunction matching actual pre-hang incident shape. Steal alert: steal≥25 AND load1m≥6 AND user_cpu≥20. Load alert: load1m≥8 AND user_cpu≥20. Session runaway: age≥6h AND cpu_hrs≥1 (RSS≥4GB independent path retained, concurrent-session-cluster check unchanged). Top-CPU panel switched from ps --sort=-%cpu (which divided cputime by ~zero etime, producing 600% nonsense for the measurement command itself) to top -b -n 2 -d 1 second-snapshot (real interval delta).

Rationale

AJ mandate: Telegram receives only “about to break” signals, never inbox spam. Validation: 44/44 historical noise alerts in last 24h would have been silent under new gates; 22-Apr pre-hang incident shape (steal=91, load=359, user=47, session 24h-wall/7h-CPU/7.1GB-RSS) still fires on multiple paths. Premortem identified 2 residual risks accepted: (1) disk-I/O saturation hang where load climbs from D-state procs but user_cpu stays low — would be silenced; mitigated by independent MEM path + Hostinger panel + Prometheus/Loki/Grafana surfaces from architecture.md (Telegram is now LOUD-ALARM channel, not early-warning). (2) Strict-AND fires mid-break not pre-break — same mitigation via alternate observability surfaces. Both risks acceptable per explicit user policy that early-warning lives in dashboards, not Telegram.

Alternatives Rejected

REJECTED: (a) Hysteresis (require N consecutive cron firings breaching threshold) — adds 2-4 min latency before real alert, unnecessary given 3-second sar window already smooths spikes. (b) Per-cron-firing rate-limit tighter than 15min cooldown — would suppress legitimate cascading alerts during real incident. (c) Tighten thresholds without AND-conjunction (e.g. STEAL_PCT_ALERT=60) — still fires on pure noisy-neighbor events at high steal with idle CPU. (d) Disable steal alert entirely — loses the 22-Apr pre-hang signal. (e) Document threshold change in /root/aj-workspace/CLAUDE.md or MEMORY.md — rejected per Law 1 Zero Redundancy and existing 28KB MEMORY.md bloat warning; rationale lives at point of use in script headers.

Outcome

Pending