By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
  • Home
  • Products
  • Agents
  • Capital
  • Commerce
Reading: Multimodal AI Benchmarks 2026: The Still vs Video Split
Sign In
  • Join US
Font ResizerAa
  • Home
  • Products
  • Agents
Search
  • Home
  • Products
  • Agents
  • Capital
  • Commerce
Have an existing account? Sign In
Follow US
> Blog > Multimodal AI Benchmarks 2026: The Still vs Video Split
Side-by-side comparison of MMMU-Pro image reasoning and Video-MME video understanding leaderboards for 2026 frontier AI models

Multimodal AI Benchmarks 2026: The Still vs Video Split

Surya Koritala
Last updated: June 1, 2026 1:47 am
By Surya Koritala
20 Min Read
Share
SHARE

GPT-5.4 Pro and Claude own image and chart reasoning. Kimi K2.5, Gemini and Qwen own video. In 2026, no single model wins both.

Contents
  • Who actually wins multimodal AI benchmarks 2026?
  • MMMU-Pro: the image and chart reasoning leaders
  • Video-MME: the video understanding leaders are different
  • Why one model cannot win both leaderboards
        • Pros
        • Cons
  • Video-MME-v2: where the whole frontier still breaks
  • How to pick a multimodal model from these boards
    • In 2026, multimodal is two markets, not one
  • Builder’s take
  • Frequently asked questions
    • What are the top multimodal AI benchmarks in 2026?
    • Which AI model leads MMMU-Pro in 2026?
    • Which AI model is best at video understanding in 2026?
    • Why don’t the same models win image and video benchmarks?
    • What is MMMU-Pro and how is it harder than MMMU?
    • How well do AI models do on Video-MME-v2?
  • Primary sources

Who actually wins multimodal AI benchmarks 2026?

No single model wins the multimodal AI benchmarks 2026. GPT-5.4 Pro leads image and chart reasoning on MMMU-Pro at 94%, while Kimi K2.5 leads video understanding on Video-MME at 0.874 — and the two leaderboards share almost no overlap at the top.

For two years, “multimodal” was treated as one capability you either had or did not. In 2026 that framing collapsed. The benchmarks that matter now measure two genuinely different skills: parsing a static, information-dense frame (a chart, a circuit diagram, a CT scan) versus tracking meaning across thousands of frames over time. They turn out to be won by different labs, with different architectures and different training priorities.

The practical takeaway lands before the details: if you are buying a model to read documents and diagrams, you look at one leaderboard. If you are buying a model to understand hour-long video, you look at a completely different one. Conflating them is the most expensive mistake in multimodal procurement this year.

This piece breaks down both boards from primary sources — BenchLM.ai’s MMMU-Pro ranking dated May 28, 2026, and llm-stats.com’s Video-MME ranking — explains why the split exists, and shows where the frontier still falls apart entirely (the new Video-MME-v2).

Side-by-side comparison of MMMU-Pro image reasoning and Video-MME video understanding leaderboards for 2026 frontier AI models
Image.

MMMU-Pro: the image and chart reasoning leaders

On MMMU-Pro, OpenAI’s GPT-5.4 Pro leads at 94%, Anthropic’s Claude Mythos Preview is a close second at 92.7%, and Google’s Gemini 3.1 Pro is third at 83.9% — a roughly 9-point gap between the top two and the rest of the field of 27 models.

MMMU-Pro is the hard successor to MMMU. Where the original benchmark asked college-level questions paired with charts, diagrams, maps and chemical structures, MMMU-Pro deliberately makes them harder in three ways: it filters out questions a text-only model could guess, it adds more wrong-answer options to defeat lucky guessing, and — critically — it introduces a vision-only mode where the question text itself is embedded inside the image. The model has to see and read at the same time, the way a human does with a textbook page.

That third change is brutal. When MMMU-Pro launched, models lost between 16.8 and 26.9 percentage points moving from MMMU to MMMU-Pro. The vision-only setting alone cost GPT-4o several points and knocked open-source models down by double digits. So a 94% on MMMU-Pro in 2026 is not a saturated, meaningless score — it represents genuine frontier difficulty, and the top two models are now brushing up against the human-expert band of roughly 88.6%.

The shape of this leaderboard matters as much as the numbers. There is a clear two-model breakaway (GPT-5.4 Pro and Claude Mythos), then a dense Google-and-everyone-else cluster from 84% down to the high 70s, where Gemini 3.5 Flash (83.6%), GPT-5.5 (81.2%), Gemini 3 Pro (81%) and Moonshot’s Kimi K2.6 (79.4%) all sit within a few points of each other.

College-level questions fused with charts, diagrams, maps and chemical structures — including a vision-only mode where the question text is baked into the image, forcing the model to read and reason inside a single frame.

RankModelProviderMMMU-Pro score
1GPT-5.4 ProOpenAI94%
2Claude Mythos PreviewAnthropic92.7%
3Gemini 3.1 ProGoogle83.9%
4Gemini 3.5 FlashGoogle83.6%
5GPT-5.5OpenAI81.2%
6Gemini 3 ProGoogle81%
7Kimi K2.6Moonshot AI79.4%
MMMU-Pro image and chart reasoning leaderboard — top models, BenchLM.ai, May 28, 2026.

Video-MME: the video understanding leaders are different

On Video-MME, Moonshot AI’s Kimi K2.5 leads at 0.874, Google’s Gemini 2.5 Pro is second at 0.848, and Alibaba’s Qwen3.6 Plus is third at 0.842 — none of which top the MMMU-Pro image leaderboard. The still-image champions, GPT-5.4 Pro and Claude, do not appear on the Video-MME podium at all.

Video-MME measures something MMMU-Pro cannot: comprehension across time. The benchmark spans 900 videos totaling 254 hours, with 2,700 human-annotated question-answer pairs, deliberately ranging from 11-second clips to full hour-long videos across six domains — knowledge, film and television, sports, artistic performance, life-record and multilingual content. It folds in video frames, subtitles and audio, so a strong score reflects fused temporal-plus-audio understanding, not single-frame pattern matching.

This is where the headline split becomes undeniable. The models that dominate static visual reasoning are absent from the top of video, and an open-weight model from Moonshot leads the closed labs. Temporal reasoning — causality, action order, event transitions — is a distinct muscle, and it is being trained hardest by a different set of teams. Alibaba’s Qwen family in particular stacks the mid-board, with multiple Qwen3-VL variants from 0.745 down filling out the open-source ranks.

If you only ever read one chart in this article, read the next one. It puts both leaderboards on a single 0-100 axis so the still-vs-video leadership swap is impossible to miss.

Still vs video: who leads multimodal in 2026
The image/chart leaders (left bars) and the video leaders (right bars) are entirely different models. A zero means the model is not a top-ranked entry on that board.

Why one model cannot win both leaderboards

Image reasoning and video understanding diverge because they stress different parts of the stack: MMMU-Pro rewards dense single-frame perception and reading, while Video-MME rewards temporal modeling, frame budgeting and audio fusion. Optimizing hard for one trades against the other.

A static chart question is, computationally, a fixed problem: one high-resolution image, all the information present at once, and the challenge is reading it correctly and reasoning over it. The winning move is high-fidelity visual tokenization and strong symbolic reasoning — exactly what OpenAI and Anthropic have poured effort into for document and diagram workloads.

Video is a different beast. An hour of footage is tens of thousands of frames, far more data than any context window can hold uncompressed. So video models must aggressively sample and compress frames, then reason about what happened between the frames they kept — causality, ordering, scene transitions — while keeping audio and subtitles synchronized to the visuals. A one-second sync drift can break the association entirely. The skill that wins here is temporal aggregation and efficient frame budgeting, not pixel-perfect single-frame reading.

That is why an open-weight model like Kimi K2.5 can top video while sitting mid-pack on still images, and why the MMMU-Pro leaders are nowhere on the video podium. The labs are making genuine, divergent bets. There is no free lunch where one architecture maxes both — at least not in the 2026 generation of models on the board today.

Pros
  • Match the benchmark to your workload — MMMU-Pro for documents/charts, Video-MME for clips
  • Treat a 90%+ MMMU-Pro score as real difficulty, not saturation
  • Test open-weight leaders (Kimi, Qwen) explicitly for video before defaulting to a closed lab
  • Re-run evals quarterly; the board shifts month to month
Cons
  • Don’t assume the image leader is the video leader — in 2026 it never is
  • Don’t trust a single aggregate ‘multimodal’ number from a vendor slide
  • Don’t ship long-video autonomy on current models (see Video-MME-v2)
  • Don’t compare scores across benchmarks with different scales without normalizing

Video-MME-v2: where the whole frontier still breaks

3,300

human-hours

to build Video-MME-v2

90.7

human score

Video-MME-v2 non-linear

49.4

best model

Gemini-3-Pro, Video-MME-v2

41.3 pt

human–model gap

on Video-MME-v2

On Video-MME-v2, released April 2026, the best commercial model (Gemini-3-Pro) scores just 49.4 against a human-expert 90.7 — a 41.3-point gap that exposes how far real video reasoning still lags, even for the models leading the original Video-MME.

The original Video-MME is increasingly saturated at the top, so the benchmark’s authors built a harder successor. Video-MME-v2 took roughly 3,300 human-hours to construct — 12 annotators and 50 independent reviewers — across 800 videos and 3,200 questions, each with eight answer options instead of four. Crucially, it replaces simple per-question accuracy with a group-based, non-linear scoring strategy that rewards consistency and reasoning coherence across related questions, punishing models that get individual frames right but cannot hold a coherent understanding of the whole video.

The results are humbling. Where humans reach a 90.7 non-linear score (94.9% raw accuracy), the best commercial model lands at 49.4 (66.1% accuracy), and the best open-source model — a Qwen3.5 variant — manages only 39.1. For builders, this is the most important number in the entire 2026 multimodal landscape: it says the impressive 0.87 on the original Video-MME does not translate into trustworthy comprehension of complex, long-form video.

So read the two boards in sequence. MMMU-Pro shows a near-solved skill where the top models touch human level. Video-MME shows a contested skill with a clear but different set of leaders. Video-MME-v2 shows an unsolved skill where the entire frontier is closer to a coin flip than to a human. That progression — solved, contested, unsolved — is the real map of multimodal AI in 2026.

“MMMU-Pro shows a near-solved skill. Video-MME shows a contested one. Video-MME-v2 shows an unsolved one. That is the real map of multimodal AI in 2026.”

Alatirok analysis

How to pick a multimodal model from these boards

In 2026, multimodal is two markets, not one

Image and chart reasoning is a near-solved, two-horse race led by GPT-5.4 Pro (94%) and Claude Mythos (92.7%) on MMMU-Pro. Video understanding is a separate contest won by Kimi K2.5 (0.874), Gemini and Qwen on Video-MME. And long-form video reasoning, measured by Video-MME-v2, remains unsolved — the best model scores 49.4 against a human 90.7. Buy by modality, verify on your own data, and re-test quarterly.

Pick by modality, not by brand: route document, chart and diagram work to GPT-5.4 Pro or Claude Mythos (92-94% MMMU-Pro), and route short-to-medium video work to Kimi K2.5, Gemini 2.5 Pro or Qwen3.6 Plus (0.84-0.87 Video-MME). For long-form autonomous video, keep a human in the loop until Video-MME-v2 scores climb.

Start with your actual data. If your pipeline ingests PDFs, financial charts, schematics or scientific figures, MMMU-Pro is your benchmark and the two-model breakaway at the top is your shortlist. The 9-point gap between the leaders and the field is large enough to matter in production accuracy, so the closed-lab leaders earn their place here.

If your pipeline ingests video — security clips, lectures, meetings, product demos — ignore the MMMU-Pro ranking entirely and shortlist from Video-MME. The strong showing of open-weight models (Kimi, the Qwen family) is a gift: you can often self-host or fine-tune the video leaders, which the closed image leaders do not allow. Just cap your ambition at clip lengths and complexity the benchmark actually covers.

Whatever you choose, treat these scores as a starting filter, not a verdict. The MMMU-Pro board moved measurably between March and late May 2026, and vendor-reported numbers vary by evaluation methodology. The only authoritative leaderboard is the one you build from your own representative samples — then re-run it every quarter as the frontier keeps reshuffling.

Builder’s take

As someone who routes real traffic through these models inside Cyntr and Loomfeed, the still-vs-video split is not trivia — it is a procurement signal. Here is how I read the 2026 board.

  • Stop asking ‘which model is best at multimodal.’ That question has no answer in 2026. Ask ‘best at charts’ or ‘best at an hour of video’ — they have different winners.
  • For document, chart and diagram pipelines, GPT-5.4 Pro and Claude Mythos are the safe defaults at 92-94% on MMMU-Pro. I send Cyntr’s chart-extraction work there without a second thought.
  • For anything frame-heavy — surveillance clips, lecture recordings, long product demos — I test Kimi K2.5, Gemini and Qwen first, because the still-image leaders are simply not on the Video-MME podium.
  • Watch Video-MME-v2. When the best model scores 49.4 against a human 90.7, that is not a leaderboard — it is a warning label. Do not ship autonomous long-video agents on today’s models.
  • Benchmark scores rot fast. The May 28 board already looks different from March. Re-test against your own clips quarterly; vendor slides are marketing, your eval is truth.

Frequently asked questions

What are the top multimodal AI benchmarks in 2026?

The two most-cited multimodal AI benchmarks in 2026 are MMMU-Pro, which tests image, chart and diagram reasoning, and Video-MME, which tests video understanding across clips from 11 seconds to an hour. A harder video successor, Video-MME-v2, launched in April 2026.

Which AI model leads MMMU-Pro in 2026?

GPT-5.4 Pro leads MMMU-Pro at 94% as of BenchLM.ai’s May 28, 2026 ranking, with Claude Mythos Preview second at 92.7% and Gemini 3.1 Pro third at 83.9%, across a field of 27 models.

Which AI model is best at video understanding in 2026?

On the Video-MME leaderboard, Kimi K2.5 from Moonshot AI leads at 0.874, followed by Gemini 2.5 Pro at 0.848 and Alibaba’s Qwen3.6 Plus at 0.842. Notably, the image-reasoning leaders do not top the video board.

Why don’t the same models win image and video benchmarks?

Image reasoning rewards dense single-frame perception and reading, while video understanding rewards temporal modeling, frame sampling and audio synchronization. These are different skills, so labs make divergent architectural bets and no single 2026 model leads both leaderboards.

What is MMMU-Pro and how is it harder than MMMU?

MMMU-Pro is the robust successor to MMMU. It filters out text-only-answerable questions, adds more answer options to defeat guessing, and embeds question text inside images so models must see and read at once. Models lost 16.8 to 26.9 points moving from MMMU to MMMU-Pro.

How well do AI models do on Video-MME-v2?

Poorly. On Video-MME-v2, released April 2026 after roughly 3,300 human-hours of annotation, the best commercial model (Gemini-3-Pro) scores 49.4 against a human-expert 90.7 — a 41.3-point gap that shows complex long-form video reasoning remains unsolved.

Primary sources

  • MMMU-Pro Benchmark 2026: 27 LLM scores — BenchLM.ai
  • Video-MME Benchmark Leaderboard — llm-stats.com
  • Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding — arXiv
  • MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark — arXiv
  • Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis — MME-Benchmarks (GitHub)

Last updated: June 1, 2026. Related: Products.

Chinese AI models 2026 — open-weight lead
What Is the Model Context Protocol (MCP)? 2026 Guide
Build an MCP App: Interactive UI in Claude and ChatGPT
Build a LangGraph Multi-Agent Crew With Claude (Tutorial)
Andrej Karpathy joins Anthropic to lead a pre-training accelerator team
TAGGED:benchmarksClaudeGeminiGPT-5.4Kimi K2.5MMMU-ProMultimodal AIQwenVideo-MME
Share This Article
Facebook Email Copy Link Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

More Popular from Alatirok

Reference architecture diagram showing an AI agent calling a website's NLWeb /ask endpoint, which extracts Schema.org JSON-LD into a vector store and exposes an MCP server
Agent Infrastructure

What Is NLWeb? Microsoft’s Agentic Web Protocol Explained

By Surya Koritala
28 Min Read
What Is Cognition Devin? The Enterprise Guide for

What Is Cognition Devin? The Enterprise Guide for 2026

By Surya Koritala
An AI agent connected to a virtual credit card with a spending limit gauge, illustrating agentic commerce controls in 2026
Commerce

How to Give an AI Agent a Credit Card With a Spending Limit

By Surya Koritala
31 Min Read
Agent Infrastructure

Azure Agent Mesh Tutorial: Deploy a Federated Agent

This azure agent mesh tutorial is the first hands-on deploy: target the Mesh with Agent Framework…

By Surya Koritala
Capital

LLM Long-Context Pricing Surcharge 2026: The Cliff Mapped

Long-context pricing surcharge: The LLM long context pricing surcharge 2026 doubles your whole request the moment…

By Surya Koritala

What Is Claude Cowork? Architecture, Cost, and Limits

What is Claude Cowork? A technical, vendor-neutral guide to its sandbox architecture, real per-seat plus API…

By Surya Koritala
Commerce

Best AI Agent Marketplaces 2026: Where to Sell Agents

The best AI agent marketplaces 2026 ranked by audience, listing model, and revenue share — AgentExchange,…

By Surya Koritala

Best AI Coding CLI 2026: Claude Code vs Codex vs Antigravity

The best AI coding CLI 2026 comes down to Claude Code, Codex CLI, and Antigravity CLI.…

By Surya Koritala

what’s actually being built in AI agents, who’s building it, and why it matters. Independent. Opinionated.

Categories

  • Home
  • Products
  • Agents
  • Capital
  • Commerce

Quick Links

  • Home
  • Products
  • Agents

© Alatirok by Loomfeed. All Rights Reserved.

Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?