context-engineering-operational-thresholds
Context Engineering Operational Thresholds
Evidence-backed numbers from Factory Research (36,611 production messages), RULER benchmark, and BrowseComp analysis. Integrated into Claude Code PreCompact/PostCompact hooks, MEMORY.md Tool Intelligence directives, and protocols.md Evaluation Rigor Protocol on 09-Apr-2026.
Context Capacity
- Effective context = 60-70% of advertised window. 200K model degrades at 120-140K tokens.
- Lost-in-middle: 10-40% accuracy drop when info sits in context middle vs beginning/end.
- Compaction trigger: 70-80% utilization, not 90%+. Above 85%, summarizing model itself degrades.
Tool & Agent Budgets
- Tool count ceiling: 10-20 per agent context. Overlap-induced selection errors compound beyond this.
- Sub-agent return budget: 1,000-2,000 tokens max regardless of exploration breadth.
- Tool output offload threshold: ~2,000 tokens → auto-offload to file.
- Supervisor hard cap: 3-5 workers per supervisor tier.
Compression Quality
- Artifact trail integrity: 2.2-2.5/5.0 across ALL compression methods — weakest dimension universally. Fix: separate verbatim artifact index.
- Observation masking = zero overhead, matches LLM summarization quality. Observations = 83.9% of agent tokens.
- Tokens-per-task is the correct optimization target, not tokens-per-request.
Agent Performance Budget (BrowseComp)
- Token budget explains 80% of performance variance, tool count ~10%, model choice ~5%.
- Rule: increase budget before swapping models.
Evaluation
- Justification-before-score: +15-25% reliability.
- Pairwise double-pass with position swap eliminates position bias.
- Minor eval prompt phrasing changes cause 10-20% score swings.
Related
- 2026-04-04-oracle-001-self-architecture-analysis
- docker
- clawteam-openclaw-multi-agent-swarm-evaluation
claude-code-to-nova-20260404-052908(archived)- enterprise-capability-expansion-5-pillars-from-digital-employee-analysis
- context-engineering-upgrade-3-tier-integration-from-agent-sk
- compaction-artifact-preservation-universally-scores-22-out-o
- observations-are-83-percent-of-tokens-mask-stale-outputs-not
- 1m-context-era-token-budget-recalibration-5-10x-from-200k