The measured degradation curve, the 128-tool hard limit, and a concrete rubric for when to split agents or switch to retrieval-based tool selection.
How many tools can an AI agent have before it breaks?
An AI agent can technically hold up to 128 tools in OpenAI‘s function-calling API, but tool-selection accuracy starts collapsing far earlier — often in the 10-to-20 tool range — so the practical answer to how many tools an AI agent can have is closer to a dozen well-designed tools per context, not the hard limit. The ceiling is an API constraint; the real limit is the model’s ability to pick the right tool from a crowded list on every single turn. In practice, how many tools can an AI agent have is capped by selection accuracy, not the API’s hard ceiling.
If you landed here because your agent keeps calling the wrong function, you are not imagining it and you are not alone. The failure is measurable, repeatable, and well-documented across both practitioner experiments and academic benchmarks. The good news: it is also fixable without swapping models.
This article assembles the full degradation curve in one place — the hard limit, the measured accuracy drops, the counter-evidence that retrieval can scale to thousands of tools, and a concrete rubric for what to do at each threshold. Most of the top results on this question cite a single anecdote or a vendor’s vague ‘keep it small.’ The numbers below are specific, current to 2026, and sourced.

What is the max tools OpenAI function calling limit?
OpenAI imposes a hard limit of 128 tools per agent in its function-calling API — but hitting that number is a configuration question, not a quality target, because accuracy degrades long before you reach it. Treat 128 as the wall you should never get near, not a budget to spend.
There are three independent costs to stuffing the tool list, and all of them grow with count. First, every tool definition is serialized into the prompt on every interaction, so a fat toolset inflates token cost and latency on each turn. Second, more candidate tools means more ways to be wrong: the model has to disambiguate among an ever-larger set of similar-looking options. Third, long tool schemas crowd out the context window you actually need for the task.
So the question ‘how many tools can an AI agent have’ has two answers. The technical answer is 128 on OpenAI’s API. The engineering answer — the one that keeps your agent reliable — is ‘as few as you can get away with.’
128 is the OpenAI hard cap. The reliability cliff is roughly an order of magnitude lower. If your agent has more than ~15 active tools in one context, assume it is already mis-selecting until you have measured otherwise.
Tool selection accuracy in AI agents: the measured degradation curve
43% -> 2%
GPT-4o calendar accuracy
From 4 tools (1 domain) to 51 tools (7 domains)
58% -> 26%
GPT-4o customer support accuracy
From 9 tools to 51 tools
128
OpenAI hard tool cap
Per agent, function-calling API
In a controlled ReAct-agent experiment, GPT-4o’s calendar-scheduling accuracy fell from 43% with 4 tools (one domain) to just 2% with 51 tools across seven domains, and customer-support accuracy dropped from 58% with 9 tools to 26% with 51 tools — the clearest public demonstration that tool count alone wrecks tool selection accuracy. These are not edge cases; they are the same model, same tasks, only more tools in the menu.
The pattern is consistent: as you add domains and tools, the agent’s ability to pick the correct function falls off a cliff, and tasks that require chaining multiple calls degrade even faster. A weaker open model in the same study (Llama-3.3-70B) failed in nearly all the high-count cases. The chart below plots the two GPT-4o curves against the retrieval-based ceiling we discuss later.
The takeaway from the curve is blunt: between roughly 4 and 51 tools, raw ‘dump everything in the prompt’ accuracy can lose 20 to 40 absolute points. The safe zone — where direct tool-passing stays reliable — sits under about 10 to 20 tools.
Why does my agent pick the wrong tool? Count vs. complexity
Your agent picks the wrong tool for two compounding reasons: there are simply too many candidates to disambiguate (count), and the candidates look too similar or have overly complex parameter schemas (complexity) — and the benchmark evidence shows count alone is a driver even when every tool is simple. Fixing one without the other rarely solves the problem.
The cleanest isolation of the count effect comes from the Nexus Function Calling Leaderboard. Its VirusTotal category contains 12 APIs and its OTX category contains 9 APIs. Both are simple, single-call, and isolated — yet models reliably score highest on the smaller OTX set. With difficulty held roughly constant, the larger menu measurably hurts. Count, by itself, costs you accuracy.
Complexity is the second axis. Anthropic’s engineering guidance is explicit that one of the most common failure modes is a bloated tool set with ambiguous decision points: if a human engineer cannot say which tool to use in a given situation, the agent will not do better either. Overly elaborate parameter schemas cause selection errors even with small toolsets, because the model burns its reasoning budget figuring out arguments instead of intent. Two near-duplicate tools are a worse problem than one extra tool.
Pros
Cons
“If a human engineer can’t definitively say which tool should be used in a given situation, an AI agent can’t be expected to do better.”
Anthropic, Writing effective tools for AI agents
How many tools is too many? A tools-per-agent threshold table
As a working rule: under ~10 tools is the safe zone for direct tool-passing, 10-20 is a caution zone where you should measure selection accuracy, 20-50 means you should split into sub-agents, and above ~50-100 you should switch to retrieval-based tool selection rather than fighting the prompt. The thresholds below turn the degradation curve into decisions.
These are heuristics, not laws — a tightly-scoped, well-named set of 25 tools in one domain can outperform 8 overlapping ones across three. But they map directly to where the measured accuracy falls, and they give you a default to deviate from deliberately rather than by accident. Anthropic notes that successful coding agents typically expose fewer than ten tools, which is a useful sanity check on the low end.
| Tools in one context | Zone | What’s happening | What to do |
|---|---|---|---|
| 1-10 | Safe | Direct tool-passing is reliable; this is where strong coding agents live | Keep it here. Consolidate related actions into single tools with an action parameter. |
| 10-20 | Caution | Selection accuracy starts slipping, especially across domains | Measure tool-selection accuracy on a held-out set. Merge near-duplicates. Tighten names and descriptions. |
| 20-50 | Degrading | Cross-domain crowding; ReAct accuracy drops sharply (e.g. 43%->2%) | Split into sub-agents by domain with a router/orchestrator. Each child stays in the safe zone. |
| 50-128 | Red | Near the OpenAI hard cap; raw selection is unreliable | Switch to retrieval-based tool selection (RAG over a tool index). Do not pass all tools in the prompt. |
| 128+ | Blocked / scale-out | Exceeds OpenAI’s 128-tool API limit entirely | Mandatory: tool retrieval / RAG-Tool Fusion plus dynamic activation. Treat tools as a searchable catalog. |
When to split into sub-agents vs. switch to RAG tool retrieval for agents
Split into sub-agents when your tools cluster into a handful of clean domains and each cluster fits under ~10-20 tools; switch to RAG tool retrieval when you have hundreds-to-thousands of tools, churning catalogs, or no clean domain boundaries to split along. They solve the same problem — shrinking the choices the model sees on any given turn — at different scales.
Sub-agents are the cheaper, more debuggable option and should be your first move past the caution zone. A router agent classifies the request and hands off to a specialist child that only ever sees its own small toolset. Each child stays in the safe zone, selection accuracy recovers, and you get clean logs per domain. The cost is orchestration latency and the routing decision itself, which is now your new single point of failure to monitor.
RAG-based tool retrieval is the move when sub-agents stop scaling — when you have an MCP aggregator, an API marketplace, or an internal catalog with hundreds of endpoints. Instead of putting tools in the prompt, you embed each tool’s description into a vector index and retrieve only the top-k candidates relevant to the current query. The agent then chooses among, say, five candidates instead of five hundred.
Concrete migration path: from a fat agent to sub-agents
1) Group your tools by domain (calendar, CRM, billing, search). 2) Build one router agent whose only job is to classify the request into a domain and call the matching sub-agent. 3) Give each sub-agent only its domain’s tools. 4) Add a held-out eval set of (query -> correct tool) pairs and log selection accuracy per sub-agent on every deploy. 5) If any sub-agent’s toolset creeps past ~15, split it again. This keeps every model call inside the safe zone where accuracy is highest.Concrete migration path: from sub-agents to RAG tool retrieval
1) Write a rich description and example invocations for each tool (Anthropic’s ‘tool knowledge base’ idea). 2) Embed those descriptions into a vector store. 3) On each request, retrieve the top-k (start with k=5) most relevant tools. 4) Pass only those k tools to the model. 5) Measure Recall@k against gold tools — if the right tool isn’t in the top-k, the agent can’t pick it, so retrieval quality becomes your new accuracy ceiling. 6) Add query rewriting/expansion before retrieval; the Toolshed work attributes large accuracy gains to pre-retrieval query transformation.Counter-evidence: how to reduce agent tool count with RAG-Tool Fusion
You don’t always have to reduce the tool count if you stop putting every tool in the prompt: Toolshed’s RAG-Tool Fusion maintains high tool-selection accuracy up to 4,000-10,000 tools by retrieving a small relevant subset per query, reporting 46%, 56%, and 47% absolute improvements (Recall@5) on the ToolE single-tool, ToolE multi-tool, and Seal-Tools benchmarks. This is the flat line on the chart above — and it is the reason the answer to ‘how many tools can an AI agent have’ is ‘effectively unlimited, if you architect for retrieval.’
The mechanism is a tool knowledge base: each tool’s documentation is enriched and stored in a vector database, then advanced RAG techniques (query planning and transformation pre-retrieval, dense retrieval intra-retrieval, re-ranking post-retrieval) surface only the handful of tools relevant to the current request. The model never sees the full catalog. Independent surveys of tool-selection accuracy report that dense retrieval with context enrichment holds accuracy up to 4,000-10,000 tools, with gains of 40-60 absolute points over naive BM25 baselines.
The catch — and it is a real one — is that retrieval introduces its own failure mode. If the correct tool is not in the retrieved top-k, the agent literally cannot choose it, no matter how good the model is. So your tool-selection problem becomes a retrieval-quality problem, which you must evaluate with Recall@k. For most teams under a few dozen tools, sub-agents are simpler and you should not adopt a tool index you now have to maintain and monitor.
The reframe: above ~50 tools, the question is no longer ‘how do I reduce my agent’s tool count?’ but ‘how do I retrieve the right 5 tools per turn?’ Retrieval-based tool selection keeps accuracy highParameter-design fixes that improve tool selection without cutting tools
Before you split agents or build a retrieval index, fix the tool definitions themselves — clear names, unambiguous descriptions, simplified parameter schemas, and consolidated actions can recover most of the accuracy lost to a moderately bloated toolset. Complexity is often the cheaper bug to fix than count.
Anthropic’s guidance is concrete here: consolidate related operations into fewer, more capable tools rather than one tool per action. Instead of create_pr, review_pr, and merge_pr, expose a single tool with an action parameter. This shrinks the menu and removes ambiguous decision points in one move. Pair that with descriptions written for disambiguation — each tool’s description should make clear when not to use it, not just when to use it.
On parameters: keep schemas as flat and minimal as the task allows. Deep nesting and large optional-argument lists cause the model to mis-fill arguments even when it picks the right tool, which shows up in your logs as ‘right tool, wrong call.’ If two tools share most of their parameters and differ only slightly, that is a strong signal to merge them behind one action enum.
# BEFORE: three near-duplicate tools the model must disambiguate
tools = [
{"name": "create_pr", "description": "Create a pull request", "parameters": {...}},
{"name": "review_pr", "description": "Review a pull request", "parameters": {...}},
{"name": "merge_pr", "description": "Merge a pull request", "parameters": {...}},
]
# AFTER: one consolidated tool — fewer choices, clearer intent (Anthropic pattern)
tools = [
{
"name": "manage_pull_request",
"description": (
"Create, review, or merge a pull request. "
"Use this for ANY pull-request operation. "
"Do NOT use for issues or branches."
),
"parameters": {
"type": "object",
"properties": {
"action": {"type": "string", "enum": ["create", "review", "merge"]},
"repo": {"type": "string"},
"number": {"type": "integer", "description": "PR number; omit for create"},
},
"required": ["action", "repo"],
},
}
]
# Then measure it: log chosen vs. gold tool on a held-out set every deploy
def tool_selection_accuracy(eval_set, agent):
hits = sum(agent.pick_tool(q) == gold for q, gold in eval_set)
return hits / len(eval_set) # treat a >10pt drop as a regression
The bottom line: how many tools should an AI agent have?
Keep direct toolsets under ~10-15; split or retrieve beyond that
An AI agent should have as few tools as the job allows — under ~10 for a single agent, split into domain sub-agents by ~20-50, and moved to retrieval-based tool selection above ~50, with 128 as OpenAI’s absolute hard cap. The 128 number is a wall, not a target; the measured accuracy data (43%->2% on calendar, 58%->26% on support) is what should actually govern your architecture.
If your agent is mis-selecting today, work the cheap fixes first: merge near-duplicate tools, simplify parameter schemas, and tighten descriptions for disambiguation. If accuracy is still slipping past the caution zone, split into sub-agents. Only when you genuinely have hundreds-to-thousands of tools should you stand up a RAG-Tool Fusion index — and when you do, your job becomes guaranteeing the right tool lands in the retrieved top-k.
Whatever you do, instrument it. Log chosen-vs-gold tool on a held-out set every deploy and treat a 10-point drop after adding tools as a regression. That single metric turns ‘how many tools can an AI agent have’ from a guessing game into a number you can actually defend.
Builder’s take
I run the orchestration engine behind Cyntr and the agent layer inside Loomfeed, and the single biggest reliability lever I have found is not a better model — it is a shorter tool list. Here is how I think about it after shipping agents that route across dozens of capabilities.
- The 128-tool API ceiling is a red herring. Real degradation starts in the teens. If your agent has more than ~15 tools in one context, assume it is already mis-selecting and measure it.
- Treat ‘too many tools’ and ‘too vague tools’ as the same bug. Half the wrong-tool calls I have debugged were two near-duplicate tools the model could not disambiguate, not a count problem at all. Merge first, count second.
- Sub-agents are cheaper than RAG-tool retrieval for most teams. If you have 30 tools across 4 clean domains, a router that hands off to four small specialist agents will beat a vector-retrieval tool index you now have to maintain and evaluate.
- Only reach for RAG-Tool Fusion when you genuinely have hundreds-to-thousands of tools — a marketplace, an MCP aggregator, an internal API catalog. Below that it is over-engineering.
- Instrument tool selection like a metric, not a vibe. We log the chosen tool versus the gold tool on a held-out set every deploy. A 10-point drop after adding three tools is the cheapest regression test in the stack.
Frequently asked questions
Technically up to 128 on OpenAI’s function-calling API, but tool-selection accuracy degrades well before that. In practice, keep a single agent under about 10-15 well-designed tools; past 20-50 split into sub-agents, and above ~50 switch to retrieval-based tool selection that surfaces only a relevant subset per turn.
OpenAI imposes a hard limit of 128 tools per agent in its function-calling API. This is a configuration ceiling, not a quality target — reliability typically starts dropping in the 10-to-20 tool range, long before you reach 128.
Two reasons usually compound: too many candidate tools to disambiguate (count) and tools that look too similar or have overly complex parameter schemas (complexity). The Nexus leaderboard shows count alone hurts even when tools are simple — models score higher on the 9-API OTX set than the 12-API VirusTotal set. Merge near-duplicate tools and simplify schemas first, then reduce count.
There is no universal number, but under ~10 tools is the safe zone where direct tool-passing stays reliable, and Anthropic notes successful coding agents typically expose fewer than ten. Treat 10-20 as a caution zone where you should measure selection accuracy, and split or switch to retrieval beyond that.
Consolidate related operations into single tools with an action parameter (e.g. one manage_pull_request tool instead of create/review/merge), remove redundant tools, and split capabilities across domain sub-agents behind a router so each agent only sees its own small toolset. If you genuinely need hundreds of tools, use RAG-based tool retrieval instead of cutting the catalog.
Yes. Toolshed’s RAG-Tool Fusion maintains high tool-selection accuracy up to 4,000-10,000 tools by embedding tool descriptions into a vector index and retrieving only the top-k relevant tools per query, reporting up to 46-56% absolute Recall@5 gains on standard benchmarks. The trade-off is that retrieval quality becomes your new accuracy ceiling — if the right tool isn’t in the retrieved subset, the agent can’t choose it.
Primary sources
- How many tools/functions can an AI Agent have? (128-tool limit) — Allen Chan, Medium
- How Tool Complexity Impacts AI Agents Selection Accuracy (ReAct degradation figures) — Allen Chan, Medium
- Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion — arXiv (Lumer et al.)
- Tool Selection Accuracy in AI Agents (retrieval scaling to 4,000-10,000 tools) — EmergentMind
- Beyond the Leaderboard: Unpacking Function Calling Evaluation (Nexus VirusTotal vs OTX) — Databricks
- Writing effective tools for AI agents — Anthropic
Last updated: June 2, 2026. Related: Agent Infrastructure.