By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
  • Home
  • Products
  • Agents
  • Capital
  • Commerce
Reading: Why Is My AI Agent So Slow? Latency Causes and Fixes
Sign In
  • Join US
Font ResizerAa
  • Home
  • Products
  • Agents
Search
  • Home
  • Products
  • Agents
  • Capital
  • Commerce
Have an existing account? Sign In
Follow US
> Blog > Observability > Why Is My AI Agent So Slow? Latency Causes and Fixes
Developer staring at a latency waterfall trace showing six stacked LLM calls in an AI agent turn
Observability

Why Is My AI Agent So Slow? Latency Causes and Fixes

Surya Koritala
Last updated: June 6, 2026 6:31 pm
By Surya Koritala
28 Min Read
Share
SHARE

Your agent isn’t one slow LLM call – it’s 6 to 12 calls that compound. Here’s a find-the-bottleneck decision tree and the concrete fixes for each cause.

Contents
  • Why is my AI agent so slow? The one-line answer
  • First, measure per-step TTFT (don’t guess the bottleneck)
  • Bottleneck 1 – Prefill: long prompts make TTFT explode
  • Bottleneck 2 – Decode and reasoning model agent latency
        • Pros
        • Cons
  • Bottleneck 3 – Tool wait: multi-tool agent slow response
  • Bottleneck 4 – Orchestration: when the agent loop takes 10 to 30 seconds
  • The find-the-bottleneck decision tree (and how to reduce agent time to first token)
  • Builder’s take
  • Frequently asked questions
    • Why is my AI agent so slow even when the model benchmarks fast?
    • How do I reduce agent time to first token?
    • Why does my agent get slower the longer the conversation runs?
    • Are reasoning models the reason my agent feels frozen?
    • Why is my agent loop taking 10 to 30 seconds?
    • How do I speed up a LangGraph agent specifically?
  • Primary sources

Why is my AI agent so slow? The one-line answer

Your AI agent is so slow because a single user turn is not one LLM call – it is typically 6 to 12 calls chained together, and latency compounds across all of them. A modest +300ms per call becomes 2 to 4 extra seconds end-to-end, and that is before you add tool round-trips and reasoning-model thinking time.

This is the thing most articles on agent latency miss. They define time to first token (TTFT) as a single metric, or they benchmark raw model speed, or they write an enterprise think-piece about customer experience. None of that helps the builder staring at a 15-second spinner. The reason why your AI agent is so slow is structural: an agent is a loop, and every iteration pays the full latency tax again.

Concretely, agent runtimes issue 6-12 LLM calls per user turn for planning, tool selection, tool-result interpretation, and final synthesis. If each call has a TTFT that creeps up by 300ms – because your context grew, or you switched to a bigger model – you do not pay 300ms once. You pay it 6 to 12 times. That is the 2-4 second swing, and on a 20-step research agent a 3-second TTFT per call is a full minute of waiting.

So the diagnostic question is never ‘is my model slow?’ It is ‘which of my four latency layers is slow, and on which step?’ Those four layers – prefill, decode, tool wait, and orchestration – are the entire game. The rest of this guide is a decision tree for finding which one is hurting you and the specific fix for each.

Developer staring at a latency waterfall trace showing six stacked LLM calls in an AI agent turn
Image.

6-12 LLM calls per turn means latency multiplies, not adds once. +300ms/call = +2-4s end-to-end. Always reason about per-turn latency as (per-call latency x number of calls), not as a single response time.

First, measure per-step TTFT (don’t guess the bottleneck)

Before changing anything, capture a waterfall trace of one slow turn and record TTFT and total duration for every individual LLM call and tool call. The bottleneck is almost never where intuition points, and optimizing the wrong layer is the most common mistake.

TTFT is the time from when the model endpoint receives your request to when it emits the first token. It covers tokenization, KV-cache prefill, queue time, and routing. In an agent, you want this number per step, not aggregated – because one step (usually a long-context synthesis or a reasoning call) is typically responsible for most of the wall-clock time.

If you use LangSmith, the waterfall view does this directly. Otherwise, wrap each call and log four numbers: input token count, TTFT, total duration, and output token count. The input token count is the single most diagnostic field, because TTFT scales primarily with prompt length during prefill. A step with 95k input tokens and a 4-second TTFT is a prefill problem; a step with 800 input tokens and a 6-second TTFT is a decode or reasoning problem. Same symptom, opposite fix.

Here is a minimal, runnable instrumentation wrapper you can drop around any call. It separates TTFT from total time so you can classify the bottleneck.

import time
from anthropic import Anthropic

client = Anthropic()

def timed_stream(messages, system, tools=None, model="claude-sonnet-4-6"):
    start = time.perf_counter()
    ttft = None
    out_tokens = 0
    text = []
    with client.messages.stream(
        model=model,
        max_tokens=1024,
        system=system,
        tools=tools or [],
        messages=messages,
    ) as stream:
        for event in stream:
            if event.type == "content_block_delta" and ttft is None:
                ttft = time.perf_counter() - start  # first visible token
            if event.type == "content_block_delta" and getattr(event.delta, "text", None):
                out_tokens += 1
                text.append(event.delta.text)
        final = stream.get_final_message()

    total = time.perf_counter() - start
    in_tokens = final.usage.input_tokens
    cache_read = getattr(final.usage, "cache_read_input_tokens", 0)
    print(f"in={in_tokens} cache_read={cache_read} ttft={ttft:.2f}s "
          f"total={total:.2f}s out_tok~{out_tokens}")
    # Classify: high in_tokens + high ttft  -> prefill
    #           low in_tokens  + high ttft  -> decode / reasoning
    return "".join(text), final
What you measureLikely layerTell-tale signalJump to fix
High input tokens + high TTFTPrefillTTFT grows as context accumulates each turnPrompt caching / trim context
Low input tokens + high TTFTDecode or reasoningSmall prompt, still slow to first tokenSmaller/faster model or router
Gap between LLM calls is largeTool waitTime spent outside the model, in awaitsParallel tool calls / cache results
Many calls per turnOrchestration6-12+ calls, critic/Reflexion loopsDrop a loop / fewer calls
Classify the bottleneck from your per-step numbers

Bottleneck 1 – Prefill: long prompts make TTFT explode

If your slow steps have large input token counts, prefill is your bottleneck. TTFT scales with prompt length because the model must process every input token before emitting the first output token – and in agents, context balloons as tool results, history, and retrieved files accumulate every turn.

This is why why your AI agent is so slow gets worse the longer a conversation runs. Turn one sends 3k tokens; turn ten is re-prefilling 90k tokens of accumulated context on every single LLM call in the loop. A ~20k-token context processed from scratch already costs several seconds of TTFT; multiply by the number of calls per turn and you have your spinner.

There are two fixes, and you want both. First, prompt caching. Put your stable, high-volume prefix – system prompt, tool definitions, long reference docs – above a cache breakpoint, and the volatile tail (the current user message, latest tool result) below it. On a cache hit, the model only prefills the new tokens, so generation starts almost immediately. Independent evaluations across providers report TTFT improvements of 13-31% and cost cuts of 41-80%; for long prompts specifically, Anthropic documents latency reductions up to about 85%, and cache reads bill at roughly 0.1x the normal input price.

Second, trim context. Most agents carry dead weight: full tool-call JSON from ten turns ago, redundant retrieved chunks, verbose system scaffolding. The response time is proportional to input length, so summarizing or evicting stale turns is a direct TTFT win. The two work together – cache the part you must keep, delete the part you don’t.

Cache hits require a byte-identical prefix. A single mutable field high in the prompt – a timestamp, a per-request ID, or reordered tool defs – invalidates the cache and you re-prefill from scratch. Keep volatile content strictly below the breakpoint. Naive ‘cache the whole context’ can paradoxically raise latency.

# Cache the stable prefix; leave volatile content uncached below the breakpoint.
system = [
    {"type": "text", "text": LONG_SYSTEM_PROMPT},
    {
        "type": "text",
        "text": TOOL_REFERENCE_DOCS,        # big, stable, reused every turn
        "cache_control": {"type": "ephemeral"},  # <-- breakpoint: everything
    },                                          #     up to here is cached
]
# messages[] (current user turn + latest tool result) stay BELOW the breakpoint
# so only the new tokens are prefilled on a cache hit.

Bottleneck 2 – Decode and reasoning model agent latency

If a slow step has a small prompt but still takes seconds to first token, the model itself is the bottleneck – either it’s oversized for the task, or it’s a reasoning model spending invisible ‘thinking time’ before any user-visible token appears.

Reasoning model agent latency is a distinct and underappreciated cause. The o-series and similar models generate internal chain-of-thought tokens before producing visible output, which can push TTFT into the 2-10+ second range on routine calls and far higher on hard ones – reports cite tens of seconds when a model gets stuck self-correcting. From the user’s seat this looks like a frozen agent, because the thinking tokens are invisible. Roughly 20% of organizations report struggling with reasoning-model response times in customer-facing apps.

The fix is selective model use. You almost never need your most powerful reasoning model for every step in the loop. Use a small, fast router or worker model (Gemini Flash, a Haiku-class model, or a hosted fast endpoint) for cheap decisions – tool selection, classification, formatting – and reserve the big reasoning model for the single step that genuinely needs deep reasoning. This is a latency strategy as much as a cost one.

If you must use a reasoning model, cap and tune the thinking budget. Setting max_tokens too low is its own trap: the reasoning tokens consume the entire budget and you get a length-finish with zero visible output. Set an explicit thinking budget sized to the task, and stream a ‘thinking…’ indicator so perceived latency drops even when actual latency doesn’t.

Pros
  • Cuts decode time and TTFT on the majority of calls in the loop
  • Frees the big reasoning model for the one step that needs it
  • Lower cost compounds with the latency win across 6-12 calls
Cons
  • Accuracy tradeoff on borderline steps – validate with evals before shipping
  • Adds a routing decision (one extra fast call) you must keep cheap
  • Two models means two prompt formats and two failure modes to maintain

Bottleneck 3 – Tool wait: multi-tool agent slow response

If your trace shows large gaps between LLM calls – time spent outside the model – your bottleneck is tool wait. Every external tool call adds network round-trip, serialization, and queuing delay, and those stack across the chain. A typical support turn can accumulate ~900ms of pure tool latency before the user hears anything.

A multi-tool agent slow response usually comes from four tool-calling anti-patterns. Excessive calls: unclear instructions or over-granular tools make the model issue five or six requests where one would do. Sequential execution: independent calls awaited one after another, doubling latency for no reason. Redundant calls: the same tool invoked twice with identical parameters because the model forgot it already has the answer. And missing caching: hitting an external API repeatedly for static data.

The fixes map one-to-one. Parallelize independent tool calls – if fetching user data and inventory don’t depend on each other, run them concurrently and you cut that segment of the turn roughly in half. Cache tool results for static or slow-changing data so repeat lookups within a turn are free. Batch related operations into one tool that returns everything the model needs. And tighten tool definitions so the model stops over-calling.

Here is the parallel pattern – the single biggest tool-wait win for most agents.

import asyncio

# SLOW: sequential awaits - latency = sum of both round-trips
# user = await get_user(uid)
# inv  = await get_inventory(sku)

# FAST: independent tool calls run concurrently - latency = max(...)
async def gather_context(uid, sku):
    user, inv = await asyncio.gather(
        get_user(uid),
        get_inventory(sku),
    )
    return user, inv

# Most agent SDKs expose a parallel/batch mode for tool calls the model
# requested in the same step - enable it instead of looping awaits.
Rule of thumb: any two tool calls the model requests in the same step with no data dependency should run in parallel. If your framework awaits them in a for-loop, that loop is your latency.

Bottleneck 4 – Orchestration: when the agent loop takes 10 to 30 seconds

If no single step is slow but the turn still takes 10-30 seconds, your bottleneck is orchestration – you simply have too many LLM calls. An Orchestrator-Worker flow with a Reflexion (self-critique) loop routinely takes 10-30 seconds, which is often unacceptable for user-facing support.

This is the agent loop 10 to 30 seconds too slow problem, and it is the hardest for people to see because every individual call looks fine. The cost is in the count. A planner call, a worker call per subtask, a critic call to evaluate, a revision call, then synthesis – each is a full TTFT, and a Reflexion loop adds a round trip every iteration. With 6-12 calls per turn, even a snappy 1.5s per call is 9-18 seconds.

The fix is to cut calls, not speed them up. Drop the Reflexion or critic loop unless an eval proves it materially improves output – it is a 2-4 second tax per turn and frequently changes nothing for simple queries. Collapse multi-agent hand-offs into a single structured call where you can. And move from an open-ended ReAct loop to an explicit graph (LangGraph or similar) where you specify exactly how steps communicate, which routinely produces significantly fewer LLM calls than letting an agent free-run.

To speed up LangGraph agent latency specifically: use conditional edges so cheap queries skip the heavy nodes entirely; run independent nodes (guardrails, parallel retrieval, multi-doc extraction) concurrently; and gate expensive nodes like critics behind a router that only invokes them when the task warrants it. The graph is your lever – it lets you make ‘fewer calls’ a structural property instead of a hope.

Where a slow agent turn spends its seconds
Each layer is fixed differently: caching for prefill, routing for decode, parallelism for tool wait, fewer calls for orchestration. Values are illustrative composites, not a single measured run.

The find-the-bottleneck decision tree (and how to reduce agent time to first token)

Run this decision tree top to bottom on your slowest measured step: large input tokens means prefill (cache and trim); small input but slow first token means decode/reasoning (route to a smaller model); large gaps between calls means tool wait (parallelize and cache); too many calls means orchestration (drop a loop). That order is also the order of highest expected payoff for most agents.

To reduce agent time to first token in practice, work from the cheapest, safest fix to the most invasive. Prompt caching is first because it requires no behavior change and pays back across every call in the loop. Context trimming is second. Parallelizing independent tool calls is third. Routing non-reasoning steps to a smaller model is fourth. Dropping a Reflexion or critic loop is last, because it can affect output quality and needs an eval to justify.

Two perception fixes apply at every layer and cost almost nothing: stream tokens so the user sees motion immediately, and surface intermediate progress – thinking indicators, retrieval results, planning steps. For voice and support, the bar is roughly one to two seconds end-to-end, and sub-second for routing; for a background research agent, users tolerate far more if they can see progress. Match the fix to the channel.

Re-measure after each change. The waterfall that found the bottleneck is also how you confirm the fix moved the right number – and how you catch the next bottleneck, which often only becomes visible once you remove the first.

“Latency in an agent is not one slow call you can fix in isolation – it is a tax you pay 6 to 12 times per turn. Find which layer, on which step, then stop guessing.”

Surya Koritala, founder of Cyntr and Loomfeed
Symptom in your traceBottleneckPrimary fixExpected effect
Input tokens grow each turn, TTFT climbsPrefillPrompt caching + trim contextUp to ~85% TTFT cut on long prompts (cache hit)
Small prompt, slow first tokenDecode/reasoningRouter to smaller model; cap thinking budgetRemoves seconds of invisible thinking time per call
Idle gaps between LLM callsTool waitParallel tool calls + result cachingCuts independent tool segments toward max(), not sum()
6-12+ calls, critic/Reflexion presentOrchestrationDrop loop; explicit graph; conditional edgesRemoves a 2-4s/turn tax; fewer calls overall
Bottleneck-to-fix cheat sheet

Builder’s take

I’ve shipped enough agent latency regressions on Cyntr to have a reflex now: before I touch a single parameter, I look at the per-step waterfall. The mistake almost everyone makes – including me, early on – is optimizing the wrong layer. You swap to a faster model to fix what was actually a prefill problem, and the agent stays slow.

  • Measure per-step TTFT before you change anything. The bottleneck is almost never where your gut says it is – on Cyntr it was usually a 90k-token context getting re-prefilled every turn, not the model.
  • Prompt caching is the single highest-leverage fix for long-context agents. Stable system prompt and tool defs above the breakpoint, volatile stuff below – we cut perceived latency hard without changing the model.
  • Most ‘multi-agent’ slowness is just too many LLM calls. A Reflexion or critic loop that adds one round trip per turn is a 2-4 second tax. Earn it or cut it.
  • Parallelize independent tool calls. If two lookups don’t depend on each other and you’re awaiting them in sequence, you’re paying double for nothing.
  • Use a small router model for the cheap decisions and save the big reasoning model for the one step that actually needs it. Routing is a latency strategy, not just a cost one.

Frequently asked questions

Why is my AI agent so slow even when the model benchmarks fast?

Because a turn is 6-12 LLM calls, not one. Raw model benchmarks measure a single call; your agent pays that latency repeatedly, plus tool round-trips and any reasoning thinking time. A +300ms per-call regression becomes 2-4 seconds end-to-end. Measure per-step TTFT in a waterfall trace rather than trusting an aggregate or a benchmark.

How do I reduce agent time to first token?

In order of payoff: enable prompt caching on the stable prefix (system prompt, tool defs), trim accumulated context, parallelize independent tool calls, route non-reasoning steps to a smaller/faster model, and only then consider dropping a critic loop. Caching alone can cut TTFT up to ~85% on long prompts on a cache hit, with no behavior change.

Why does my agent get slower the longer the conversation runs?

Prefill. TTFT scales with prompt length, and an agent’s context accumulates tool results, history, and retrieved files every turn. By turn ten you may be re-prefilling 90k+ tokens on every call in the loop. Cache the stable portion above a breakpoint and evict or summarize stale turns to stop the growth.

Are reasoning models the reason my agent feels frozen?

Often, yes. Reasoning models generate invisible chain-of-thought tokens before any visible output, pushing TTFT into the 2-10+ second range and sometimes far higher. Use a smaller model for cheap steps, reserve the reasoning model for the one step that needs it, cap the thinking budget, and stream a thinking indicator so the user sees progress.

Why is my agent loop taking 10 to 30 seconds?

Too many LLM calls. An Orchestrator-Worker flow with a Reflexion self-critique loop routinely lands in the 10-30 second range because each call is a full TTFT and the loop adds round trips. Drop the critic loop unless an eval proves it helps, collapse hand-offs, and use an explicit graph to enforce fewer calls.

How do I speed up a LangGraph agent specifically?

Use the LangSmith waterfall to find the slow node, then: add conditional edges so cheap queries skip heavy nodes, run independent nodes (guardrails, parallel retrieval) concurrently, gate expensive nodes like critics behind a router, and reduce total LLM calls by specifying exactly how nodes communicate. Pair this with prompt caching on stable prefixes.

Primary sources

  • AI Agent Latency 101: How do I speed up my agent? — LangChain
  • What Is Agentic AI Latency? How It Impacts Enterprise CX — Parloa
  • What Is Time to First Token (TTFT)? — Future AGI
  • Time to First Token (TTFT) — IBM
  • Prompt caching documentation — Anthropic
  • Don’t Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks — arXiv
  • Why Bad Tool Calling Makes LLMs Slow and Expensive — CodeAnt AI
  • Thinking Tokens Trap: How Reasoning Models Burn max_tokens — TokenMix

Last updated: June 6, 2026. Related: Observability.

AI Agent FinOps 2026: Track, Cap, and Cut Token Spend
Stop AI Agent Calling Tools With Wrong Arguments: A Fix Guide
Best AI SOC Agents 2026: Cost-Per-Investigation, Ranked
GDPval Benchmark 2026: Scores, Cost and Win Rates Decoded
Durable Execution for AI Agents: Temporal vs Restate vs DBOS
TAGGED:AI agent latencyLangGraphobservabilityprompt cachingreasoning modelstool callsTTFT
Share This Article
Facebook Email Copy Link Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

More Popular from Alatirok

Dashboard visualizing token consumption per agentic coding task across frontier AI models
Observability

Tokens Per Agentic Coding Task: The 2026 Variance Data

By Surya Koritala
21 Min Read
What Is Cognition Devin? The Enterprise Guide for

What Is Cognition Devin? The Enterprise Guide for 2026

By Surya Koritala
Diagram of an AI agent holding a USDC wallet with spending-limit guardrails enforced before an onchain transfer
Commerce

What Is Circle Agent Stack? USDC Wallets for AI Agents

By Surya Koritala
24 Min Read
Identity & Provenance

AI Agent Identity: Entra Agent ID vs Okta vs SailPoint

AI agent identity governance, Entra vs Okta vs SailPoint: a 2026 buyer matrix on what each…

By Surya Koritala
Observability

Why Does My AI Agent Context Window Fill Up So Fast?

Why does my AI agent context window fill up so fast? Tool definitions eat two-thirds of…

By Surya Koritala
Agent Infrastructure

Migrate OpenAI Agent Builder to Agents SDK Before Nov 30

A hands-on tutorial to migrate OpenAI Agent Builder to Agents SDK before the Nov 30, 2026…

By Surya Koritala
Agent Infrastructure

Best Voice AI Agent Framework 2026: Vapi vs LiveKit vs Pipecat

The best voice AI agent framework 2026 depends on your call volume. Our neutral ranking covers…

By Surya Koritala

Purpose-Built Legal AI vs General LLM: 2026 Verdict

Purpose-built legal AI vs general LLM, settled with real 2026 benchmark data: where ChatGPT and Claude…

By Surya Koritala

what’s actually being built in AI agents, who’s building it, and why it matters. Independent. Opinionated.

Categories

  • Home
  • Products
  • Agents
  • Capital
  • Commerce

Quick Links

  • Home
  • Products
  • Agents

© Alatirok by Loomfeed. All Rights Reserved.

Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?