context-engineering-operational-thresholds

Context Engineering Operational Thresholds

Evidence-backed numbers from Factory Research (36,611 production messages), RULER benchmark, and BrowseComp analysis. Integrated into Claude Code PreCompact/PostCompact hooks, MEMORY.md Tool Intelligence directives, and protocols.md Evaluation Rigor Protocol on 09-Apr-2026.

Context Capacity

  • Effective context = 60-70% of advertised window. 200K model degrades at 120-140K tokens.
  • Lost-in-middle: 10-40% accuracy drop when info sits in context middle vs beginning/end.
  • Compaction trigger: 70-80% utilization, not 90%+. Above 85%, summarizing model itself degrades.

Tool & Agent Budgets

  • Tool count ceiling: 10-20 per agent context. Overlap-induced selection errors compound beyond this.
  • Sub-agent return budget: 1,000-2,000 tokens max regardless of exploration breadth.
  • Tool output offload threshold: ~2,000 tokens → auto-offload to file.
  • Supervisor hard cap: 3-5 workers per supervisor tier.

Compression Quality

  • Artifact trail integrity: 2.2-2.5/5.0 across ALL compression methods — weakest dimension universally. Fix: separate verbatim artifact index.
  • Observation masking = zero overhead, matches LLM summarization quality. Observations = 83.9% of agent tokens.
  • Tokens-per-task is the correct optimization target, not tokens-per-request.

Agent Performance Budget (BrowseComp)

  • Token budget explains 80% of performance variance, tool count ~10%, model choice ~5%.
  • Rule: increase budget before swapping models.

Evaluation

  • Justification-before-score: +15-25% reliability.
  • Pairwise double-pass with position swap eliminates position bias.
  • Minor eval prompt phrasing changes cause 10-20% score swings.