Your tokens are gone before turn one. Here is exactly where they go – and the provider-native 2026 fixes that claw back 67-98% of your context window.
Why does my AI agent context window fill up so fast?
Your AI agent context window fills up so fast because most of it is consumed before your conversation even begins, and what is left compounds with every tool call. The two culprits are tool-definition bloat at startup (MCP servers loading their entire tool catalogs into context) and intermediate tool-result accumulation during execution (every file read, API response, and transcript stacking up). Neither is fixed by a bigger window or a vague ‘just summarize’ instruction.
Here is the part the generic ‘context rot’ explainers never quantify. With a typical enterprise stack of MCP servers connected, tool definitions alone can eat roughly two-thirds of a 200K window before the first user message. One measured enterprise setup hit 134,000 tokens – 67% of a 200K window – consumed by definitions before the agent could start working. A single GitHub MCP server can run ~55,000 tokens at initialization; Jira adds ~17,000 more.
Then execution begins. Tool results enter the window and never leave on their own. A single two-hour meeting transcript can add over 50,000 tokens. Across a long agentic loop, re-fetchable tool output routinely becomes 90%+ of your live context. So the honest answer to ‘why does my AI agent context window fill up so fast’ is: you are paying a tax at the door and a second tax on every step. This guide maps each cause to a specific, provider-native 2026 fix with published numbers.

Where do the tokens actually go? (The cause -> fix table)
Tokens go to three places: tool definitions (startup), intermediate tool results (per-step), and accumulated dialogue plus reasoning (over time) – and each has a different, measured fix. The mistake is applying one technique to all three. Summarization does nothing for definition bloat; clearing tool results does nothing for a runaway reasoning trace.
The table below is the core of this article – the cause-to-fix map no incumbent explainer ships. Every number in the ‘measured impact’ column comes from a published 2026 source, cited at the end.
Read it as a decision tree. If your problem is at turn zero, you have a definition problem (fix it with tool search or code execution). If your window climbs steadily through a loop, you have a tool-result problem (fix it with clearing or programmatic tool calling). If long dialogues degrade, you have an accumulation problem (fix it with compaction plus a memory tool).
Print your token count after MCP servers connect but before the first user message. If tool definitions already exceed ~30% of your window, no amount of summarization will save you – that is a definition-loading problem, and the fix is code execution or tool search, not compaction.
| Cause (where tokens go) | Why it happens | The 2026 fix | Measured impact |
|---|---|---|---|
| Tool-definition bloat (startup) | MCP servers load full tool catalogs into context before turn one; ~550-1,400 tokens per tool | Code execution with MCP / Tool Search Tool – load only the tools needed, on demand | 150,000 -> 2,000 tokens (98.7% reduction); tool-selection accuracy 49% -> 74% |
| Intermediate tool results (per-step) | Every file read, API call, and transcript result stays in context and compounds across steps | clear_tool_uses_20250919 (server-side tool-result clearing) – replaces old results with placeholders | ~128,740 -> ~43,060 tokens (67% reduction) in Anthropic’s demo |
| Fan-out tool results (loops) | Agent loops over many records/endpoints; each result lands in the window | Programmatic tool calling – model orchestrates tools in executed code; only the final result returns | 110,473 -> 15,919 tokens (85.6% reduction) on an expense-management workflow |
| Dialogue + reasoning accumulation | Long multi-turn agents accumulate chat history and thinking blocks | Compaction API (compact_20260112) – server-side summary when nearing threshold | Peak context ~335K -> ~169K tokens (50% reduction) in Anthropic’s research-agent eval |
| Cross-session amnesia + overflow | Everything needed must live in the active window | Memory tool (memory_20250818) + context editing together | +39% performance combined; 84% token reduction over a 100-turn web-search eval |
How to reduce AI agent context window usage at startup
To reduce AI agent context window usage at startup, stop loading every tool definition up front – load tools on demand via code execution or a tool-search step. This is the single highest-leverage fix because the cost is paid at turn zero, every single run, whether or not the tool is ever used.
Anthropic’s code-execution-with-MCP pattern is the cleanest version: instead of injecting all tool schemas into context, the model writes code that imports only the tool definitions it actually needs from a filesystem. In their documented example this cut token usage from 150,000 to 2,000 – a 98.7% reduction – because the agent never sees the schemas it does not call.
A lighter-weight alternative is a Tool Search Tool: rather than parsing every definition, the agent searches a tool index and pulls in only matching schemas. The search tool itself costs ~500 tokens, and Anthropic reports tool-selection accuracy improving from 49% to 74% versus dumping all definitions into context. Either way, the principle is identical: definitions are a just-in-time resource, not a startup constant.
If you cannot adopt either yet, the cheap interim move is brutal pruning – disconnect MCP servers you are not using this session. The 93-tool GitHub server and the 58-tool enterprise bundle are where the ~55K-token charges hide.
# Tool Search Tool: index tools, load only matches into context
# Instead of injecting ~58 tool schemas (~55K tokens) up front,
# the model searches and pulls in only what it needs (~500 tokens).
import anthropic
client = anthropic.Anthropic()
response = client.beta.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
betas=["advanced-tool-use-2026-01-12"],
tools=[
{"type": "tool_search_tool_20251119", "name": "tool_search"},
# Your full catalog is registered but deferred -
# schemas are only materialized into context on a search hit.
*your_mcp_tool_catalog, # e.g. 58 tools across GitHub, Jira, Drive
],
messages=[{"role": "user", "content": "Open a PR that fixes the failing test."}],
)
How do I stop intermediate tool results from compounding?
Stop intermediate tool results from compounding with server-side tool-result clearing – it replaces old, already-processed results with placeholders while keeping the tool-call record intact. This is the fix for the second tax: the per-step accumulation that turns a healthy agent into one that overflows at step 40.
On the Claude Developer Platform this is the clear_tool_uses_20250919 strategy. Once your context crosses a configurable threshold, the API clears the oldest tool results in chronological order, leaving a short placeholder so the model knows the result existed but no longer carries its full payload. In Anthropic’s own demo, clearing two of three results dropped context from ~128,740 to ~43,060 tokens – a 67% reduction – with no inference cost because it runs server-side.
The reason this works so well: in heavy agentic workflows, tool results were measured at ~96% of baseline context. They are also overwhelmingly re-fetchable – a file you read at step 3 can be re-read at step 30 if needed. Clearing them is nearly free in terms of capability and enormous in terms of headroom. Set a keep value (default 3) so your most recent results stay live.
This is what the brief calls tool result compaction for AI agents, and it is distinct from full conversation compaction. Clearing targets re-fetchable result blocks; it does not touch your dialogue or reasoning.
# Server-side tool-result clearing: replace stale results with
# placeholders once context crosses the trigger, keeping recent ones live.
response = client.beta.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
betas=["context-management-2025-06-27"],
tools=your_tools,
messages=conversation,
context_management={
"edits": [{
"type": "clear_tool_uses_20250919",
"trigger": {"type": "input_tokens", "value": 100_000},
"keep": {"type": "tool_uses", "value": 3}, # keep 3 most-recent
"clear_at_least": {"type": "input_tokens", "value": 10_000},
}]
},
)
What is programmatic tool calling and why does it cut context most?
Programmatic tool calling lets the model orchestrate tools by writing and executing code, so intermediate outputs are consumed in-code and only the final result returns to the context window. For any fan-out step – loop over 50 endpoints, filter a 10,000-row spreadsheet, reconcile 20 records – this is the largest single context win available in 2026.
The mechanism is the difference. In normal tool calling, every tool result round-trips through the model and lands in context. In programmatic tool calling (PTC), the model writes code in an execution container that calls your tools, filters and aggregates the data there, and returns only the answer. The intermediate results never touch the window. On a documented expense-management workflow, PTC cut tokens from 110,473 to 15,919 – an 85.6% reduction – at similar execution time.
This directly attacks the brief’s worst case: a 50-step workflow where each call adds ~20K tokens of intermediate output, which would otherwise compound toward roughly a million tokens. With PTC, those 50 results are processed inside the code sandbox; the model sees one clean summary, not fifty raw dumps.
The trade-off: PTC needs a code-execution environment and is best for parallel or large-result operations. For small, sequential, single-result tool calls, normal tool use plus clearing is simpler. Use PTC where the data is big and the loop is wide.
“In normal tool use you pay for every intermediate result. In programmatic tool calling you pay only for the answer – that is why a fan-out step drops 85% of its tokens.”
On Anthropic’s published expense-workflow benchmark
How does the compaction API fix context overflow in 2026?
The compaction API fixes context overflow by automatically distilling older conversation into a high-fidelity summary server-side when you approach a configurable threshold – no client code required. Where clearing targets re-fetchable tool results, compaction handles the part that is not re-fetchable: accumulated dialogue and reasoning.
Compaction is the compact_20260112 strategy, launched in beta in January 2026 for Claude Opus 4.6 and Sonnet 4.6. You set a token threshold (minimum 50K, default 150K); when the conversation nears it, the API summarizes the older turns and continues. In Anthropic’s research-agent eval, a run that peaked at ~335,279 tokens with no management – enough to fail on a 200K window at turn three – completed at a ~169,164-token peak with compaction, a roughly 50% reduction, with all three high-level-fact probes preserved.
Two operational notes for 2026. First, Anthropic recommends server-side compaction over client-side SDK compaction: better token accounting, less integration complexity, and it is eligible for Zero Data Retention. Second, availability is broad – the same context-management capabilities are live across the Claude Developer Platform, AWS Bedrock, Google Cloud Vertex AI, and Microsoft Foundry, so you are not locked to one cloud.
One caveat compaction explainers skip: summaries preserve high-level facts and lose obscure specifics (in Anthropic’s probes, 3/3 high-level facts retained, 0/3 obscure specifics). If your agent needs exact values to survive, pair compaction with a memory tool so the specifics get written to durable storage before they are summarized away.
A compaction summary keeps the gist and drops the details. If your workflow depends on exact figures, IDs, or quotes surviving long-running, write them to a memory tool file (memory_20250818) before the threshold triggers. Do not rely on the summary to remember a number.
Which fix should I use – and can I stack them?
Curate context, don’t enlarge it
Use them together: tool search or code execution for startup definition bloat, tool-result clearing or programmatic tool calling for per-step results, and compaction plus a memory tool for long-running accumulation. They are not competing options – they target different bottlenecks, and Anthropic’s eval shows the combined effect beating any single technique.
The headline combined number: context editing alone delivered a +29% improvement on agentic tasks, and pairing it with a memory tool reached +39% over baseline, with an 84% reduction in token consumption over a 100-turn web-search evaluation that would otherwise have failed from context exhaustion. That is the difference between an agent that dies at step 40 and one that runs for hundreds of turns.
Sequence your adoption by where your pain is. Most teams should fix startup bloat first (it is paid every run), then add clearing (cheapest per-step win), then PTC on fan-out steps, then compaction and memory for genuinely long sessions.
Pros
Cons
Builder’s take
I run two agent platforms – Cyntr (orchestration) and Loomfeed – and context exhaustion is the single most common reason a long-running agent silently degrades. The mistake almost everyone makes is treating the window as a storage problem to solve with a bigger window. It is a curation problem.
- Audit before turn one. Print your token count after connecting MCP servers but before the first user message. If tool definitions are already past 30-40% of the window, you have a startup-bloat problem that no summarization will fix – you need fewer tools loaded, not smaller history.
- Tool results are the silent killer, not chat history. In agentic loops the 96% of context that is re-fetchable tool output is what overflows you. Clear it server-side with clear_tool_uses before you reach for compaction.
- Reach for programmatic tool calling on any fan-out step. The moment your agent loops over 20+ records or files, route it through executed code so intermediate results never touch the window. The published lift (85%+ token reduction) is the biggest single win available.
- Layer the fixes – they target different bottlenecks. Tool clearing kills result bloat, compaction handles dialogue and reasoning accumulation, a memory tool persists across resets. Anthropic’s own eval shows context editing alone at +29% and +39% when combined with memory.
- Instrument it like any other resource. Put a context-budget gauge on your dashboard next to latency and cost. An agent that is at 80% window utilization is an outage waiting to happen.
Frequently asked questions
Because connected MCP servers load their entire tool catalogs into context at startup. Each tool costs roughly 550-1,400 tokens, and a typical enterprise stack can consume two-thirds of a 200K window – one measured setup hit 134,000 tokens (67%) before the first message. Fix it by loading tools on demand via code execution or a tool-search step rather than injecting all definitions up front.
Use server-side tool-result clearing (clear_tool_uses_20250919), which replaces old, already-processed results with placeholders once you cross a token threshold while keeping the most recent few live. In Anthropic’s demo this cut context from ~128,740 to ~43,060 tokens (67%) at no inference cost. For wide fan-out loops, use programmatic tool calling so intermediate results stay in code and never enter the window.
Context rot is the degradation in agent focus and accuracy as irrelevant or stale content accumulates in the window – more context with diminishing returns. The 2026 fix is active curation, not a bigger window: clear re-fetchable tool results, compact long dialogue into summaries, and offload durable facts to a memory tool. Anthropic’s eval shows context editing improving agentic performance by 29% alone and 39% combined with a memory tool.
Tool result compaction (more precisely, tool-result clearing) is a server-side technique that removes old tool_result blocks once context grows past a configured trigger, swapping each for a placeholder so the model knows it existed. It is distinct from full conversation compaction, which summarizes dialogue and reasoning. Clearing targets the re-fetchable tool output that is often 96% of an agentic context’s bulk.
No – a bigger window only delays the same failure. The causes are structural: definitions paid at turn zero and results compounding per step. A 50-step workflow at ~20K tokens per call trends toward roughly a million tokens regardless of window size. Curation techniques – on-demand tools, result clearing, programmatic tool calling, and compaction – prevent the overflow; a larger window just postpones it.
Yes. The context-management capabilities – compaction, tool-result clearing, and the memory tool – are available across the Claude Developer Platform, AWS Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Compaction launched in beta in January 2026 for Claude Opus 4.6 and Sonnet 4.6, runs server-side, and is eligible for Zero Data Retention.
Primary sources
- The Two Context Bloat Problems Every AI Agent Builder Must Understand — Agenteer
- Managing context on the Claude Developer Platform — Anthropic
- Context editing – Claude API Docs — Anthropic
- Context engineering: memory, compaction, and tool clearing (Cookbook) — Anthropic
- Code execution with MCP: building more efficient AI agents — Anthropic
- Programmatic tool calling – Claude API Docs — Anthropic
- Compaction – Claude API Docs — Anthropic
- Claude Opus 4.6 Introduces Adaptive Reasoning and Context Compaction — InfoQ
- MCP’s Context Bloat Crisis — AgentMarketCap
Last updated: June 6, 2026. Related: Observability.