Agent Harness Benchmark Score: The Point-Swing Data (2026)

We assembled the one normalized table nobody owns: hold each model fixed, and watch the agent harness benchmark score swing as far as a full model generation does.

Contents

What does the agent harness benchmark score data actually show?

The agent harness benchmark score moves results by 7 to 34 percentage points when you hold the model completely fixed and change only the scaffold — a swing that frequently equals or exceeds an entire model-generation upgrade. That is the finding that, as of 2026, nobody disputes but almost nobody quantifies side by side. This article does.

Every incumbent post asserts “the harness matters” and then offers exactly one anecdote. The problem is that one anecdote is not a budget input. If you are a practitioner deciding whether to spend three weeks migrating to a newer model or three days rebuilding your tool layer, you need the point-swing per benchmark, per model, lined up against what a model jump buys you. The two best 2026 data sources for this are HAL (the Holistic Agent Leaderboard out of Princeton, published at ICLR 2026) and Harness-Bench, and neither one had been normalized into a single decision table — until now.

Here is the short version before we get into the numbers. HAL ran 21,730 agent rollouts across 9 models and 9 benchmarks for roughly $40,000 of compute, and its three-dimensional analysis (models x scaffolds x benchmarks) is the largest controlled look we have at the agent scaffold effect on benchmark scores. Harness-Bench did the complementary experiment: same task set, same model-backend pool, swap only the harness, and the aggregate score moved 23.8 points. The conclusion both papers reach is the same: agent capability should be reported at the model-harness level, not attributed to the base model alone.

Paired-bar benchmark chart contrasting agent harness point swings against model-generation deltas on a dark analytics dashboard — Image.

An agent harness (or scaffold) is everything wrapping the model: the system prompt, the tool definitions, the context-management and middleware hooks, the retry/verify loop, and the reasoning-effort budget. The benchmark score you read is a property of the model embedded in that harness — not the model alone.

Does the harness matter more than the model? The normalized swing table

Yes — holding the model fixed and changing only the harness can move a benchmark score more than upgrading to the next model generation does. The table below is the one comparison the thin Medium and vendor posts never assemble: each row holds a model constant and shows the harness-driven swing, then sets it against the best documented model-generation delta of 2026 so you can see they live on the same scale.

Read the table as “same model, different harness” except for the final row, which is the control: a genuine model upgrade (Claude Opus 4.5 to 4.7) lifted SWE-bench Verified by 6.8 points. Notice that nearly every harness swing in the table rivals or beats that 6.8-point model jump — and the Sonnet 4.5 row dwarfs it in the wrong direction.

“A bad harness can hide a frontier model’s capability entirely — Sonnet 4.5 fell 34 points across scaffolds on the same benchmark, the same model.”
Holistic Agent Leaderboard (HAL), Princeton, ICLR 2026

Model (fixed)	Benchmark	Harness change	Score swing	Magnitude vs a model-gen jump
Claude Opus 4	GAIA	HF Open Deep Research -> HAL Generalist	57.6% -> 64.9% (+7.3)	Beats the +6.8 model jump
Claude Opus 4.5	GAIA	Claude Code -> SEAL	~ +9.5 pts	Beats the +6.8 model jump
Claude Sonnet 4.5	(one HAL benchmark)	best -> worst scaffold	68% -> 34% (-34)	~5x a model jump, downward
GPT-5.2-Codex (LangChain)	Terminal-Bench 2.0	baseline -> engineered harness	52.8% -> 66.5% (+13.7)	~2x a model jump
— (Harness-Bench, mixed pool)	aggregate suite	OpenClaw -> NanoBot	52.4% -> 76.2% (+23.8)	~3.5x a model jump
Claude Opus 4.5 -> 4.7 (CONTROL)	SWE-bench Verified	model upgrade, harness fixed	80.8% -> 87.6% (+6.8)	1x (the reference)

Harness-driven point swings (model held fixed) vs the 2026 model-generation delta (control row).

How big is the agent scaffold effect on benchmark scores? (the chart)

21,730

agent rollouts in HAL

9 models x 9 benchmarks, ~$40k compute

34 pts

Sonnet 4.5 cross-scaffold gap

68% -> 34% on one HAL benchmark

+6.8 pts

Opus 4.5 -> 4.7 SWE-bench

the model-generation control delta

The agent scaffold effect ranges from about +7 points to a brutal -34 points per benchmark, and the upside swings cluster around 2-3x what a single model generation delivers. The chart plots each fixed-model harness swing against the lone model-generation control bar so the comparison is visual, not rhetorical.

The takeaway from the dumbbell view: if your agent harness benchmark score is mediocre, the data says you are more likely sitting on a scaffold problem than a model problem. The model-generation bar (+6.8) is one of the shortest positive bars on the chart. That is the headline practitioners keep missing because the marketing cycle is organized around model launches, not harness releases.

Harness swing vs model-gen delta, same benchmark — Positive bars are harness wins; the Sonnet 4.5 bar shows how far a bad scaffold can sink a frontier model. The control bar (+6.8) is a full model generation.

HAL holistic agent leaderboard data: what the 21,730 rollouts revealed

The HAL holistic agent leaderboard data is the strongest controlled evidence that the harness is a first-class variable: across 21,730 rollouts, varying only the scaffold produced swings as large as the gaps between models. HAL’s entire premise is that prior leaderboards conflated model and harness, so it built a standardized harness layer (the open-source hal-harness on GitHub) that orchestrates parallel evaluation across hundreds of VMs and runs identical scaffolds against every model.

Two HAL findings matter most for this article. First, the GAIA result: Claude Opus 4 scores 64.9% under HAL’s Generalist scaffold but only 57.6% under HF Open Deep Research on the same GAIA tasks — a 7.3-point swing from the harness alone, no model change. Hold a newer model (Opus 4.5) fixed and vary only the harness, and the GAIA spread widens to roughly 9.5 points between SEAL and Claude Code. Second, the downside: Claude Sonnet 4.5 showed a 34-point cross-scaffold gap (68% collapsing to 34%) on one benchmark — proof that scaffolding can erase frontier capability.

HAL’s other headline finding is a warning to anyone who maxes settings by reflex: higher reasoning effort reduced accuracy in the majority of runs while inflating token cost. The harness decision and the reasoning-budget decision are entangled — which is exactly the trap LangChain hit on Terminal-Bench, covered below.

These are controlled, single-variable swaps: same model weights, same benchmark tasks, only the scaffold differs. That is what makes the swings attributable to the harness rather than to noise or model differences. Treat any ‘harness matters’ claim that isn’t a fixed-model comparison as marketing, not measurement.

Claude Code vs generic scaffold benchmark: the largest documented jumps

Claude Code versus a generic scaffold is where the agent harness benchmark score gap gets dramatic — purpose-built coding harnesses repeatedly outscore bare or generic scaffolds by double digits on the same model. In the 2026 field reports, swapping a model out of a generic CORE-Agent-style scaffold into a coding-optimized harness like Claude Code lifted task success from the 40s into the 70s+ on the same weights, before any further tuning. The exact figure depends on the benchmark and harness pair, but the direction is unanimous across HAL, Harness-Bench, and vendor write-ups.

The mechanism is concrete, not magical. A coding harness adds a test-execution loop so the agent gets a hill-climbing signal, injects repository structure upfront instead of making the model re-discover it every turn, and constrains tool calls to the ones that actually matter. LangChain’s own blog spells this out: they kept GPT-5.2-Codex fixed and moved their Deep Agents coding harness from outside the top 30 to rank 5 on Terminal-Bench 2.0, lifting the score from 52.8% to 66.5% by changing only system prompts, tools, and middleware hooks.

This is the Claude Code vs generic scaffold benchmark lesson in one line: the model supplies raw reasoning, but the harness decides how much of that reasoning survives contact with a real task. Generic scaffolds leak capability; specialized ones conserve it.

Pros

Harness changes ship in days, not weeks — no model migration, no re-quals
Swings often exceed a model generation (+13.7 LangChain, +23.8 Harness-Bench vs +6.8 model-gen)
Improvements are portable: a better scaffold lifts every model you run through it
You control it end to end — verify loops, context, tools, and reasoning budget are all yours

Cons

Harness gains are benchmark-specific; a coding harness won’t help a web-navigation task
Easy to over-fit the scaffold to one benchmark and regress elsewhere
Reasoning-budget interactions are non-obvious (xhigh can lose to high via timeouts)
A genuinely smarter model still raises the ceiling the harness can reach

Same model, different harness score: why the gap exists

The same model produces a different harness score because the benchmark measures what the harness lets the model observe, modify, recover from, and verify — not just what the model can infer. Two teams running identical weights on an identical benchmark can land 20-plus points apart purely on scaffold design, which is why Harness-Bench reports a 23.8-point aggregate gap between its best harness (NanoBot, 76.2%) and worst (OpenClaw, 52.4%) under the same model-backend pool.

Break the gap into its components and it stops looking mysterious. Harness-Bench decomposed the 23.8-point aggregate swing into completion score (NanoBot 81.6% vs OpenClaw 60.0%, a 21.6-point gap), process quality (93.8% vs 79.5%, 14.3 points), and tool-use appropriateness (91.7% vs 70.9%, 20.8 points). The worse harness wasn’t running a worse model — it was wasting tool calls, recovering poorly from errors, and completing fewer tasks end to end.

If you only remember one operational takeaway from the same-model-different-harness data, make it this: instrument your harness like production code. Log tool-call appropriateness and recovery rate, not just the final score. The aggregate benchmark number hides exactly the levers you can move fastest.

A benchmark score is a model-harness configuration result. Report it as ‘Opus 4.7 + Claude Code on SWE-bench Verified,’ never as ‘Opus 4.7 scores 87.6%.’ The bare number isn’t reproducible — and in 20

Harness-Bench results 2026 and how to act on them

Fix the harness before you upgrade the model

Across HAL’s 21,730 rollouts and Harness-Bench’s controlled suite, the agent harness benchmark score swings 7-34 points on a fixed model — routinely outpacing the +6.8-point gain of a full model generation. With frontier SWE-bench scores bunched inside ~2 points, the scaffold is now the dominant, fastest, and cheapest lever. Benchmark model-plus-harness as one unit, cap reasoning effort, add a verify loop, and only chase a new model after the harness plateaus.

The harness-bench results in 2026 close the loop HAL opened: across realistic multi-step workflows, the configurable harness alone explains a ~24-point band of performance, which should reorder how you spend engineering budget. Put HAL and Harness-Bench together and the practitioner playbook is clear. Before you buy a model upgrade, exhaust your harness first.

Concretely: (1) Pin a known-good baseline harness — Claude Code for coding, NanoBot or a HAL Generalist-style scaffold for general agents — and measure your current scaffold against it on your own tasks. (2) Cap reasoning effort per task type rather than maxing it; HAL and LangChain both found high beats xhigh once timeouts enter. (3) Add a verify/test loop so the agent has a signal to hill-climb. (4) Only after the harness plateaus should a model-generation jump (the +6.8-point class of move) be on the table.

For context on where the model ceiling actually is in mid-2026: SWE-bench Verified leaders now sit near 88-89% (Claude Opus 4.8 at 88.6%, GPT-5.5 at 88.7%), and Claude Opus 4.7 holds 87.6%. The frontier models are bunched within ~2 points of each other on this benchmark. When the models are that close, the harness is no longer the tiebreaker — it is the whole game. If you want to go deeper on which benchmarks even predict production value, read our companion pieces on agentic AI benchmarks for 2026, why SWE-bench doesn’t predict engineering value, and the effective context-length scoreboard.

Builder’s take

I run evals on Cyntr and Loomfeed constantly, and the single most expensive mistake I see teams make is attributing a score to a model when they’re really measuring their scaffold. Here is what the 2026 data actually changed in how I budget engineering time.

Stop benchmarking models. Benchmark model-plus-harness configurations, and report them as one unit — a bare model number is not falsifiable in production.
Before you pay for a model upgrade, A/B your harness against a known-good one (Claude Code, NanoBot). A harness swap is usually cheaper than a model migration and the HAL/Harness-Bench data says it often moves the score more.
Reasoning effort is not free. HAL found higher reasoning effort hurt accuracy in most runs and burned tokens — I cap it per task type rather than maxing it globally, which is exactly what sank LangChain’s xhigh config.
The scary number is the downside swing. Sonnet 4.5 dropping from 68% to 34% on one benchmark across scaffolds means a bad harness can hide a frontier model’s capability entirely. Instrument the harness, not just the model.

Frequently asked questions

Does the agent harness matter more than the model?

Often, yes. In 2026 controlled data (HAL and Harness-Bench), changing only the harness while holding the model fixed swings benchmark scores 7-34 points — frequently more than the +6.8-point gain of a full model-generation upgrade (Claude Opus 4.5 to 4.7 on SWE-bench Verified). The harness matters most when frontier models are bunched within a couple of points, as they are now.

What is an agent harness benchmark score?

It is the benchmark result of a model running inside a specific scaffold — the system prompt, tools, context management, verify/retry loop, and reasoning budget. Because the harness determines what the model can observe, modify, recover from, and verify, the score is a property of the model-harness configuration, not the model alone. Always report both.

How much can the scaffold change a benchmark score?

HAL found Claude Opus 4 swings 7.3 points on GAIA (57.6% to 64.9%) just from the harness, and Claude Sonnet 4.5 fell 34 points (68% to 34%) across scaffolds on one benchmark. Harness-Bench reported a 23.8-point aggregate gap (52.4% to 76.2%) between its worst and best harness on the same model pool.

What is the HAL holistic agent leaderboard?

HAL (Holistic Agent Leaderboard) is a Princeton project published at ICLR 2026 (arXiv 2510.11977). It ran 21,730 agent rollouts across 9 models and 9 benchmarks for about $40,000, using a standardized open-source harness so model, scaffold, and benchmark can be analyzed as separate variables. It is the largest controlled look at the agent scaffold effect to date.

Why did LangChain jump 25 spots on Terminal-Bench without changing the model?

LangChain kept GPT-5.2-Codex fixed and rebuilt only the harness — system prompts, tools, and middleware hooks — moving their Deep Agents coding agent from outside the top 30 to rank 5 on Terminal-Bench 2.0, lifting the score from 52.8% to 66.5% (+13.7 points). They also found a high reasoning budget beat xhigh, which lost time to timeouts.

Should I improve my harness or upgrade my model first?

Improve the harness first. It ships in days instead of weeks, the gains often exceed a model generation, and a better scaffold lifts every model you run through it. Pin a known-good baseline (Claude Code for coding, NanoBot/HAL Generalist for general agents), cap reasoning effort per task, add a verify loop, and only pursue a model jump once the harness plateaus.

Primary sources

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation (arXiv 2510.11977) — arXiv / Princeton
HAL: Holistic Agent Leaderboard — Princeton
Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows (arXiv 2605.27922) — arXiv
Improving Deep Agents with harness engineering — LangChain
Agent Benchmark Scores Are Measuring the Harness, Not the Model — Focused Labs
SWE-Bench 2026: Claude Opus 4.7 Wins 87.6% — TokenMix
Claude Opus 4.8 Benchmarks Explained — Vellum
princeton-pli/hal-harness — GitHub

Last updated: June 6, 2026. Related: Observability.

Agent Harness Benchmark Score: The Point-Swing Data (2026)

What does the agent harness benchmark score data actually show?

Does the harness matter more than the model? The normalized swing table

How big is the agent scaffold effect on benchmark scores? (the chart)

HAL holistic agent leaderboard data: what the 21,730 rollouts revealed

Claude Code vs generic scaffold benchmark: the largest documented jumps

Pros

Cons

Same model, different harness score: why the gap exists

Harness-Bench results 2026 and how to act on them

Fix the harness before you upgrade the model

Builder’s take

Frequently asked questions

Does the agent harness matter more than the model?

What is an agent harness benchmark score?

How much can the scaffold change a benchmark score?

What is the HAL holistic agent leaderboard?

Why did LangChain jump 25 spots on Terminal-Bench without changing the model?

Should I improve my harness or upgrade my model first?

Primary sources

Leave a Reply Cancel reply

More Popular from Alatirok

Tokens Per Agentic Coding Task: The 2026 Variance Data

What Is Cognition Devin? The Enterprise Guide for 2026

What Is Circle Agent Stack? USDC Wallets for AI Agents

AI Agent Identity: Entra Agent ID vs Okta vs SailPoint

Why Does My AI Agent Context Window Fill Up So Fast?

Migrate OpenAI Agent Builder to Agents SDK Before Nov 30

Best Voice AI Agent Framework 2026: Vapi vs LiveKit vs Pipecat

Purpose-Built Legal AI vs General LLM: 2026 Verdict

Categories

Quick Links