Agentic AI Benchmarks: A Different Model Wins Each

GAIA, tau2-bench, WebArena and BFCL V4 each crown a different winner in 2026 – here is why no single model tops the board, and what the matrix actually tells buyers.

Contents

Which model wins the agentic AI benchmarks in 2026?

No single model wins the agentic AI benchmarks in 2026 – a different model tops nearly every board. Claude Opus 4.6 leads tau2-bench customer service, Claude Sonnet 4.5 leads GAIA, the open OpAgent leads WebArena, and Zhipu’s GLM-4.5 leads BFCL V4 function calling. Four headline benchmarks, four different winners, and not one of them is the obvious frontier flagship across the board.

This is the most important thing a buyer can internalize this year. The reflex to ask ‘what is the best agent model?’ assumes the leaderboards collapse into one ranking. They do not. Each benchmark stresses a genuinely different capability – long-horizon web-and-file reasoning, multi-turn dual-control dialogue, live website navigation, and structured tool-call accuracy – and the model that is strongest at one is frequently middling at another.

The data below is drawn from the Awesome Agents agentic benchmarks board (updated April 8, 2026), Princeton’s HAL GAIA leaderboard, Artificial Analysis, and the Berkeley Function Calling Leaderboard. Every number in this article is sourced; the chart in the next section is the single visual that makes the split obvious at a glance.

A 2026 agentic AI benchmark scoreboard showing different models leading GAIA, tau2-bench, WebArena and BFCL V4 — Image.

The 2026 agentic AI benchmarks scoreboard, one matrix

Read the matrix row by row and the highlighted winner jumps to a different column on almost every line. That visual shift is the entire thesis of this article: the agentic AI benchmarks do not rank models, they rank models-on-a-task. Claude dominates the two customer-service rows and GAIA, but loses live web navigation to an open Qwen3-VL agent and loses raw function calling to GLM-4.5.

tau2-bench, built by Sierra Research, runs multi-turn customer-service episodes in a dual-control environment where both the agent and a simulated user hold tools. Claude Opus 4.6 posts 99.3% on the telecom domain and 91.9% on retail. GPT-5.2 Thinking is right behind on telecom at 98.7% but falls to 82.0% on retail – a nearly 10-point retail gap that no telecom score would have warned you about. The open-source LongCat-Flash-Thinking-2601 ties the telecom leader at 0.993.

GAIA – Princeton’s 466 web-and-file reasoning tasks – is led by Claude Sonnet 4.5 at 74.6% on the HAL scaffold, against a GPT-5 Medium configuration at 62.8%. WebArena’s live-website tasks are won by OpAgent (Qwen3-VL-32B plus reinforcement learning) at 71.6%, edging a GPT-5 agent at 71.2% and a Claude-Code-based agent at 68.0%. BFCL V4 puts GLM-4.5 first at 70.9%, Claude Opus 4.1 at 70.4%, and a GPT-5 entry well back at 59.2%.

Notice that GPT-5 variants are competitive everywhere and first nowhere on this particular board. That is not a knock on the model – it is a demonstration that ‘competitive everywhere’ and ‘benchmark winner’ are different properties, and that picking a model off a single bolded cell is how teams end up surprised in production.

A different model wins every agentic benchmark — Row leaders: Claude Sonnet 4.5 on GAIA (74.6%), Claude Opus 4.6 on tau2 retail (91.9%) and telecom (99.3%), OpAgent on WebArena (71.6%), GLM-4.5 on BFCL V4 (70.9%). Zero values mean the model is not ranked on that benchmark.

Each benchmark uses its own evaluation harness and scaffold. A score is only comparable within its own column – never read across rows to declare an overall winner.

tau2-bench: why retail and telecom must be split

99.3%

tau2-bench telecom leader

Claude Opus 4.6; LongCat-Flash ties at 0.993

91.9%

tau2-bench retail leader

Claude Opus 4.6, the most discriminating split

9.9 pts

GPT-5.2 telecom-to-retail drop

98.7% telecom vs 82.0% retail

Treat tau2-bench retail and telecom as two separate benchmarks, because the same model can be near-perfect on one and mediocre on the other. Sierra Research’s tau2-bench models customer service as a decentralized partially-observable process where the agent and a simulated user each operate tools, so an agent has to both reason and coordinate a human counterpart through a fix.

The telecom domain is close to saturated at the top: Claude Opus 4.6 at 99.3%, Claude Opus 4.5 at 98.2%, GPT-5.2 Thinking at 98.7%, and LongCat-Flash-Thinking-2601 tying the leader at 0.993. When four models cluster inside a single point, telecom has stopped discriminating between them and you should weight other evidence.

Retail tells a different story and is where the real separation lives. Claude Opus 4.6 leads at 91.9%, Claude Opus 4.5 follows at 88.9%, and GPT-5.2 Thinking drops to 82.0%. The retail tasks involve more branching policy logic and order-state manipulation, and that is exactly where a model that aced telecom can stumble. If your deployment is a retail or e-commerce support agent, the telecom leaderboard is actively misleading – benchmark the retail split.

“When four models cluster inside a single point on telecom, the benchmark has stopped telling you which one to ship.”
On benchmark saturation

GAIA and the scaffold delta: the harness is half the score

On GAIA, the agent scaffold adds roughly 30 absolute points – meaning your orchestration layer can matter as much as your model choice. The same underlying GPT-5 family scores 44.8% bare as GPT-5 Mini on the public GAIA board, while a HAL-scaffolded run reaches 74.6% with Claude Sonnet 4.5. That is a +29.8-point swing attributable to the harness around the model, not the model alone.

Princeton’s HAL is a generalist agent scaffold built specifically to push GAIA’s 466 web-and-file reasoning tasks. It supplies planning, tool routing, retries, and reflection on top of the base model. The lesson for builders is uncomfortable but liberating: a worse model inside a better scaffold can beat a better model running bare. Kili Technology’s 2026 benchmark review found the same effect elsewhere, noting one Claude Opus configuration scoring 64.9% in one framework and 57.6% in another – a 7-point gap from the orchestration layer alone.

This is why GAIA scores carry a methodology asterisk that buyers ignore at their peril. A vendor citing ‘74.6% on GAIA’ is citing a model-plus-scaffold result. If you drop their model into your own thinner harness, you should expect to land far closer to the bare number. Always ask what scaffold produced a quoted agent score.

GAIA scaffold delta: bare vs HAL-scaffolded — The HAL scaffold lifts results by +29.8 absolute points over a bare GPT-5 Mini run on the same 466-task benchmark.

Any quoted agent benchmark score is a model-plus-scaffold number. Drop the same model into a thinner harness and expect to regress toward the bare baseline.

WebArena and BFCL V4: where open weights take the crown

Two of the four headline agentic AI benchmarks are led by open-weight systems, not closed frontier flagships. On WebArena’s live-website navigation tasks, OpAgent – built on Qwen3-VL-32B with a reinforcement-learning phase on real sites – leads at 71.6%, edging a GPT-5 agent (ColorBrowserAgent) at 71.2% and a Claude-Code-based agent at 68.0%. OpAgent’s published pipeline uses a Planner-Grounder-Reflector-Summarizer loop, again underscoring that the agent architecture, not just the model, is doing the work.

On the Berkeley Function Calling Leaderboard V4 – the standard for structured tool-call accuracy – GLM-4.5 from Zhipu AI leads at 70.9%, with Claude Opus 4.1 at 70.4% and Claude Sonnet 4 at 70.3% close behind, and a GPT-5 entry back at 59.2%. (Note that BFCL has since logged newer entries like Claude Opus 4.5 and GLM-4.6 at higher overall marks; the 70.9% figure is the V4 snapshot from the Awesome Agents board.)

For anyone weighing build-versus-buy on a self-hosted stack, this is the headline. The two axes most central to production agents – reliably driving a browser and reliably emitting valid tool calls – are currently led by models you can run yourself under permissive licenses. That reframes the cost conversation entirely, because the per-token economics of an open model winning your specific benchmark can dwarf a one-point closed-model edge elsewhere.

Benchmark	What it measures	Leader	Runner-up	Notable GPT-5
GAIA	Web + file reasoning (466 tasks)	Claude Sonnet 4.5 – 74.6%	Claude Sonnet 4.5 High – 70.9%	GPT-5 Medium – 62.8%
tau2 retail	Multi-turn support, retail	Claude Opus 4.6 – 91.9%	Claude Opus 4.5 – 88.9%	GPT-5.2 Thinking – 82.0%
tau2 telecom	Multi-turn support, telecom	Claude Opus 4.6 – 99.3%	GPT-5.2 Thinking – 98.7%	GPT-5.2 Thinking – 98.7%
WebArena	Live website navigation	OpAgent (Qwen3-VL) – 71.6%	ColorBrowserAgent (GPT-5) – 71.2%	ColorBrowserAgent – 71.2%
BFCL V4	Function / tool-call accuracy	GLM-4.5 – 70.9%	Claude Opus 4.1 – 70.4%	GPT-5 – 59.2%

The 2026 agentic benchmark board – leader, runner-up, and a frontier challenger per benchmark

Open weights now lead WebArena (OpAgent) and BFCL V4 (GLM-4.5) – the two axes most central to production agents – which materially changes self-hosted build-vs-buy economics.

Why the agentic AI benchmarks should never be one ranking

The agentic AI benchmarks measure different capabilities and should never be collapsed into a single ranking, because doing so averages away the exact signal you need. GAIA tests long-horizon reasoning over documents and the web. tau2-bench tests multi-turn coordination with a tool-wielding human. WebArena tests live DOM navigation. BFCL V4 tests structured tool-call correctness. A weighted average of those four is a number with no operational meaning.

The benchmarks also disagree on the failure modes they expose. tau2-bench specifically shows that agent behavior degrades sharply moving from single-control to dual-control settings – social coordination is its own capability axis that GAIA and BFCL barely touch. And the scaffold dependence is everywhere: the same model swings 7 points across frameworks on one task and nearly 30 points bare-versus-scaffolded on GAIA.

Then there is the production gap. Kili Technology’s review cites a 37% gap between lab benchmark scores and real-world deployment, with one example of consistency dropping from 60% on a single run to 25% across eight consecutive runs. A headline 99.3% can coexist with a reliability problem you only see when you measure N-run consistency – which no public board currently reports as the primary metric. The right reading of this board is not ‘who won’ but ‘which benchmark matches my workload, and what did the bare-model baseline look like.’

Pros

Fast, free directional signal on which models are even in contention for your task type
Per-benchmark splits (like tau2 retail vs telecom) surface workload-specific strengths a single score would hide
Open-weight leaders on WebArena and BFCL V4 flag credible self-hosted options before you spend on closed APIs
Scaffold deltas quantify how much your orchestration layer can buy you

Cons

Scores are model-plus-scaffold; quoted numbers rarely transfer to your own thinner harness
Boards saturate within months, so a current leader can be stale by your procurement date
Single-run success rates hide N-run reliability, the metric production actually needs
A 37% lab-to-production gap means no benchmark substitutes for an internal eval on your data

Match the benchmark to your workload, read the bare-model baseline, then re-run your own eval. The leaderboard narrows the field; it does not make the decision.

How to actually choose a model from this board

Four benchmarks, four winners – the board is a map, not a ranking

In 2026 the agentic AI benchmarks crown Claude Opus 4.6 on tau2-bench, Claude Sonnet 4.5 on GAIA, OpAgent on WebArena, and GLM-4.5 on BFCL V4. No model sweeps, two of four leaders are open-weight, and a ~30-point GAIA scaffold delta proves the harness is half the score. The actionable move is to match the benchmark to your workload, check the bare-model baseline, and re-run your own N-run eval before committing – because the right model is genuinely a different one for each job.

Choose the benchmark that mirrors your workload first, then read that column – never the other way around. If you are building a retail support agent, tau2-bench retail is your row and Claude Opus 4.6 is your front-runner; the telecom score is noise for you. If you are building an autonomous web operator, WebArena is your row and an open Qwen3-VL agent deserves a serious look before you default to a closed API.

Second, find the bare-model baseline before you believe any quoted agent score. The GAIA gap between 44.8% bare and 74.6% scaffolded is the clearest warning on this entire board: the number a vendor shows you includes their harness. Estimate what you will actually get inside your own orchestration layer, which is usually thinner than a research scaffold like HAL.

Third, budget for your own evaluation harness as a product line item, not an afterthought. Across Cyntr and Loomfeed the harness has bought more reliable agent behavior than any single model swap this year, which is exactly what the GAIA scaffold delta predicts. Public benchmarks tell you who is in contention. Your own N-run eval on your own data tells you who to ship – and on this board, that will be a different model for a different job.

Builder’s take

I run evaluation harnesses for Cyntr’s orchestration engine and Loomfeed’s AI layer every week, and the single most expensive mistake I see teams make is shopping for ‘the best agent model’ off one number. There is no such number in 2026.

Pick the benchmark that matches your workload before you pick the model. A retail-support agent should be tuned against tau2-bench retail, not a coding leaderboard – the leaders are different models.
The scaffold is a first-class product decision. GAIA’s +29.8-point jump from bare GPT-5 Mini to a HAL-scaffolded run means your orchestration layer can out-weigh your model choice. Cyntr’s gains this year came mostly from the harness, not from swapping models.
Open weights are now genuinely competitive on the agentic axes that matter to me. OpAgent leads WebArena and GLM-4.5 leads BFCL V4, which changes the build-vs-buy math for any self-hosted deployment.
Re-run your own evals on every model release. These boards saturate in months and a 99.3% telecom score can hide a sub-N-run reliability problem you only catch in production.

Frequently asked questions

Which model is best at agentic AI benchmarks in 2026?

There is no single best model. As of the April 2026 boards, Claude Opus 4.6 leads tau2-bench customer service (99.3% telecom, 91.9% retail), Claude Sonnet 4.5 leads GAIA at 74.6%, the open OpAgent leads WebArena at 71.6%, and GLM-4.5 leads BFCL V4 function calling at 70.9%. Each benchmark crowns a different winner because each measures a different capability.

Why shouldn’t agentic benchmarks be combined into one ranking?

Because they measure fundamentally different things – GAIA tests web-and-file reasoning, tau2-bench tests multi-turn coordination with a tool-wielding user, WebArena tests live website navigation, and BFCL V4 tests structured tool-call accuracy. Averaging them produces a number with no operational meaning, and it hides the workload-specific strengths you actually need to see when choosing a model.

What is the GAIA scaffold delta and why does it matter?

The scaffold delta is the score difference between a bare model and the same model inside an agent scaffold. On GAIA, a bare GPT-5 Mini scores 44.8% while a HAL-scaffolded Claude Sonnet 4.5 run reaches 74.6% – a +29.8-point swing from the harness alone. It matters because any quoted agent benchmark is a model-plus-scaffold result that rarely transfers to a thinner in-house harness.

What is the difference between tau2-bench retail and telecom?

Both are Sierra Research customer-service domains, but telecom is near-saturated at the top (four models within one point of 99.3%) while retail is far more discriminating. Claude Opus 4.6 leads retail at 91.9%, but GPT-5.2 Thinking drops from 98.7% on telecom to 82.0% on retail. If you build a retail support agent, the telecom score will mislead you – benchmark the retail split.

Are open-weight models competitive on agentic benchmarks?

Yes, two of the four headline boards are led by open systems. OpAgent, built on Qwen3-VL-32B with reinforcement learning, leads WebArena at 71.6%, and GLM-4.5 from Zhipu AI leads BFCL V4 at 70.9%. Because those two axes – browser navigation and tool-call accuracy – are central to production agents, open weights meaningfully change self-hosted build-versus-buy economics.

Can I trust a 99.3% benchmark score for production?

Not on its own. Public boards report single-run success rates, but Kili Technology documents a 37% gap between lab scores and real-world deployment, with one case where consistency fell from 60% on a single run to 25% across eight runs. A high headline number can coexist with an N-run reliability problem you only catch by re-running your own evaluation on your own data.

Primary sources

Agentic AI Benchmarks Leaderboard – GAIA, WebArena, BFCL, tau2-bench — Awesome Agents
HAL: GAIA Leaderboard — Princeton HAL
Berkeley Function Calling Leaderboard (BFCL) V4 — UC Berkeley
tau2-bench Telecom Benchmark Leaderboard — Artificial Analysis
OpAgent: Operator Agent for Web Navigation — arXiv
AI Benchmarks 2026: Top Evaluations and Their Limits — Kili Technology

Last updated: June 1, 2026. Related: Observability.

Agentic AI Benchmarks: A Different Model Wins Each

Which model wins the agentic AI benchmarks in 2026?

The 2026 agentic AI benchmarks scoreboard, one matrix

tau2-bench: why retail and telecom must be split

GAIA and the scaffold delta: the harness is half the score

WebArena and BFCL V4: where open weights take the crown

Why the agentic AI benchmarks should never be one ranking

Pros

Cons

How to actually choose a model from this board

Four benchmarks, four winners – the board is a map, not a ranking

Builder’s take

Frequently asked questions

Which model is best at agentic AI benchmarks in 2026?

Why shouldn’t agentic benchmarks be combined into one ranking?

What is the GAIA scaffold delta and why does it matter?

What is the difference between tau2-bench retail and telecom?

Are open-weight models competitive on agentic benchmarks?

Can I trust a 99.3% benchmark score for production?

Primary sources

Leave a Reply Cancel reply

More Popular from Alatirok

Tokens Per Agentic Coding Task: The 2026 Variance Data

What Is Cognition Devin? The Enterprise Guide for 2026

What Is Circle Agent Stack? USDC Wallets for AI Agents

AI Agent Identity: Entra Agent ID vs Okta vs SailPoint

Why Does My AI Agent Context Window Fill Up So Fast?

Migrate OpenAI Agent Builder to Agents SDK Before Nov 30

Best Voice AI Agent Framework 2026: Vapi vs LiveKit vs Pipecat

Purpose-Built Legal AI vs General LLM: 2026 Verdict

Categories

Quick Links