Harvey Legal Agent Benchmark — what the all-pass scoring actually means

Surya Koritala
17 Min Read

The Harvey Legal Agent Benchmark landed on May 6, 2026 with an unusually strict premise: legal agents should be judged on whether they complete real work end to end, not whether they collect partial credit on narrow tasks. Harvey says LAB — short for Legal Agent Benchmark — includes 1,200+ tasks across 24 practice areas and 75,000+ expert-written rubric criteria, with code and data released on GitHub alongside the company’s announcement.

May 6, 2026

LAB release date

Announced by Harvey

1,200+

Tasks

Across legal workflows

24

Practice areas

Coverage claimed by Harvey

75,000+

Rubric criteria

Expert-written grading points

All-pass

Grading method

No partial credit

Harvey introduced LAB, its Legal Agent Benchmark, as an open-source benchmark aimed at measuring extended legal workflows rather than isolated reasoning questions. In the company’s framing, the problem with many existing evaluations is not that they are useless; it is that they often test a slice of legal cognition instead of the full chain of work a legal team would actually ship. That makes the Harvey Legal Agent Benchmark notable beyond legal tech: it is trying to move agent evaluation closer to production conditions.

The release details are concrete. Harvey says LAB includes 1,200+ tasks across 24 legal practice areas, backed by 75,000+ expert-written rubric criteria. Task instructions average about 50 words, and the benchmark is designed to assess end-to-end legal work rather than single-clause extraction or short-form Q&A. One example in Harvey’s launch post is an M&A deal review task with 57 criteria across 9 legal issues.

That scope is what separates LAB from older legal benchmarks such as LegalBench, CUAD, LEXam, and BigLaw Bench, which have each been useful for narrower slices of legal AI evaluation. Harvey’s argument is that legal agents now need testing closer to how firms and in-house teams actually use them: multi-step, domain-specific, and unforgiving when an omission changes the output.

Harvey blog post introducing the Legal Agent Benchmark
Image: source page. Used under fair use.

Harvey released LAB on May 6, 2026 as an open-source benchmark for long-horizon legal agent work, with code and dataset on GitHub.

“The goal of LAB is to provide a clear picture of how agents can be deployed to support legal work”

Harvey researchers, introducing LAB
https://github.com/harveyai/harvey-labs
Harvey’s GitHub repository for LAB

The dataset shape is the real story

Benchmarks are often discussed as if the scoring method is the whole product. In LAB, the dataset may matter just as much. Harvey says the benchmark spans 24 practice areas and uses more than 75,000 rubric criteria written by experts. That is a large hand-curated investment, and it is the kind of asset that is difficult to reproduce quickly. Anyone can announce a benchmark; far fewer can assemble a legal dataset with this many task-specific grading points.

That matters because legal work is highly contextual. A benchmark built around one-shot questions can miss the operational reality of legal review, where the output has to account for multiple issues at once, often under a client or transaction-specific frame. Harvey’s example of an M&A review with 57 criteria across 9 issues shows what the company is trying to capture: not whether a model spots a problem, but whether it catches the full set of issues that make the work product usable.

The Harvey Legal Agent Benchmark also creates a strategic asset for Harvey itself. If the dataset becomes a standard reference point for legal-agent evaluation, Harvey gains influence over what the market treats as progress. That does not invalidate the benchmark. It does mean competitors and model providers will have to decide whether to adopt Harvey’s framing, contribute to it, or build rival standards with different assumptions.

BenchmarkFocusWhat it tends to measure
LABLong-horizon legal workEnd-to-end legal task completion with rubric-based grading
LegalBenchLegal reasoning tasksNarrower legal sub-tasks and question formats
CUADContract analysisClause and obligation extraction from contracts
SWE-Bench VerifiedSoftware engineeringIssue resolution against real code repositories
OSWorld-Verified / BrowseComp / FinanceAgentGeneral agentic workTask completion in browsing, operating systems, or finance contexts
LAB is positioned as a benchmark for end-to-end legal workflows rather than single-task legal QA or extraction.

Why all-pass scoring changes the interpretation of results

The key takeaway: expect low-looking scores

LAB’s all-pass design is likely to compress results because one missed criterion zeros out the task. That makes the benchmark more reflective of legal production standards than softer pass-rate metrics.

The most editorially important part of the launch is the grading rule. Harvey says LAB uses an all-pass method: no partial credit. A system either satisfies every criterion for a task or receives a zero. For legal work, that is a much harsher standard than the pass-rate conventions many readers know from benchmarks in coding, reasoning, or question answering. It is also much closer to how legal outputs are judged in practice. A review that misses a required issue is not partly production-ready; it is incomplete.

That is why the Harvey Legal Agent Benchmark could produce scores that look surprisingly low even if the underlying systems are improving. A model might identify most of the right issues in a contract, summarize them cleanly, and still fail the task because it omitted one criterion that matters to the rubric. In a partial-credit benchmark, that effort might look strong. Under all-pass grading, it looks like failure. The scoring is not saying the model did nothing useful. It is saying the model did not clear the bar for dependable legal work.

This is the same reason LAB should not be read like a leaderboard sport on day one. If legal teams care about whether an agent can be trusted to complete a workflow without silent misses, all-pass scoring is a more realistic proxy than average token-level overlap or issue-spotting recall alone. The benchmark is effectively asking a binary question: would you ship this output into a legal workflow without a human having to rescue it?

An all-pass score is not a measure of partial usefulness. It is a measure of whether the agent completed the full legal task to the benchmark’s standard.

No partial credit means production realism

No public leaderboard at launch is a strategic choice

Harvey did not launch LAB with a public leaderboard. The company said it plans to develop baseline results with research partners first and then publish them. That is unusual enough to deserve attention. In benchmark culture, the leaderboard is often the launch. Harvey chose to publish the benchmark before publishing the race.

There is a practical reason for that decision. Without baseline context, early scores on a hard all-pass benchmark are easy to misread. A vendor could post a result that sounds impressive in isolation while avoiding the harder question of what the score means relative to the task design. Holding back the leaderboard reduces the incentive for immediate benchmark gaming and gives Harvey time to establish a baseline interpretation with partners before the marketing cycle starts.

It also suggests Harvey knows the first round of results may be sobering. The company does not say that outright, and it has not published model scores. Still, the structure of the Harvey Legal Agent Benchmark points toward a benchmark where early performance could look much lower than readers are used to seeing in coding evals or broad reasoning tests. That is not a flaw in the benchmark. It may be the point.

Pros
  • Reduces immediate PR-driven score cherry-picking
  • Lets Harvey and partners establish baseline interpretation first
  • Keeps attention on benchmark design rather than one-off vendor claims
Cons
  • Delays external comparison across models
  • Leaves room for speculation about current capability levels
  • Concentrates early narrative control with Harvey and its partners

Harvey says it will work with research partners on baseline results before publishing a leaderboard, limiting early leaderboard-gaming.

Compared with SWE-Bench and other agent evals, LAB may expose a wider capability gap

One useful way to frame LAB is against the benchmarks that have shaped agent discourse elsewhere. In software engineering, SWE-Bench Verified has become a reference point for whether coding agents can resolve real repository issues. In broader agentic evaluation, readers now track names like OSWorld-Verified, BrowseComp, Terminal-Bench 2.0, GDPval, and FinanceAgent. Those benchmarks differ in setup, but they share a common role: they translate abstract model capability into task completion under constraints.

LAB is trying to do that for legal work, but with a stricter notion of completion. That is why comparisons to coding benchmarks need care. A score on SWE-Bench Verified is not directly comparable to a score on LAB because the domains, grading rules, and failure costs differ. Still, if the first published LAB baselines come in far below the headline numbers people associate with top coding agents, the broader message will be hard to miss: legal-agent capability may be materially further behind code-agent capability than the market narrative suggests.

That would fit the nature of the work. Legal tasks often require issue spotting, procedural judgment, domain-specific drafting norms, and a lower tolerance for omissions. The Harvey Legal Agent Benchmark is built to surface those weaknesses rather than smooth them over. If that leads to lower-looking pass rates, the benchmark may end up doing something healthy for the market: separating demos that feel competent from systems that can actually clear a professional standard.

For competitors such as Spellbook, Robin AI, and EvenUp, LAB creates a decision point. If they adopt the benchmark, they validate Harvey’s framing of what legal-agent progress should look like. If they ignore it, buyers may ask why. If they build alternatives, the market could split into competing standards, with each benchmark emphasizing different slices of legal work. That is a familiar pattern in AI infrastructure: the first benchmark to gain mindshare often shapes product roadmaps as much as it measures them.

For buyers, the immediate lesson is simpler. Ask what a vendor means by success. A legal AI system that performs well on extraction, summarization, or clause classification may still struggle on a long-horizon task with all-pass grading. The Harvey Legal Agent Benchmark puts pressure on vendors to show not just that their systems are helpful, but that they can complete a workflow without hidden misses that force a human reviewer to reconstruct the work.

That is why LAB matters even before the first leaderboard appears. It gives the legal AI market a stricter vocabulary for discussing reliability. If Harvey and its research partners can publish credible baselines and the broader ecosystem starts testing against them, LAB could become less a marketing artifact than a forcing function. The benchmark would then do what the best evals do: make it harder to confuse partial competence with deployable performance.

Frequently asked questions

What is LAB in Harvey’s announcement?

LAB stands for Legal Agent Benchmark. Harvey introduced it on May 6, 2026 as an open-source benchmark for evaluating extended legal workflows, with details in the company’s launch post and code on GitHub.

What does all-pass scoring mean in the Harvey Legal Agent Benchmark?

Harvey says LAB uses an all-pass grading method, meaning a system must satisfy every rubric criterion for a task or receive a zero. Harvey describes that approach in its announcement, and it is one reason LAB may produce lower-looking scores than benchmarks that allow partial credit.

Why didn’t Harvey publish a leaderboard at launch?

Harvey said it plans to develop baseline results with research partners first and then publish them. That decision is described in the company announcement and discussed in LawNext’s analysis.

How is LAB different from LegalBench or CUAD?

Harvey positions LAB as a benchmark for long-horizon, end-to-end legal work rather than narrower legal tasks such as question answering or clause extraction. Harvey makes that distinction in its launch post, while LawNext highlights the benchmark’s broader workflow orientation.

Primary sources

Last updated: May 23, 2026. Related: Observability.

Share This Article
2 Comments