By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
  • Home
  • Products
  • Agents
  • Capital
  • Commerce
Reading: AI Inference Speed 2026: Fastest Providers Compared
Sign In
  • Join US
Font ResizerAa
  • Home
  • Products
  • Agents
Search
  • Home
  • Products
  • Agents
  • Capital
  • Commerce
Have an existing account? Sign In
Follow US
> Blog > AI Inference Speed 2026: Fastest Providers Compared
Comparison of fastest AI inference providers by output tokens per second in 2026

AI Inference Speed 2026: Fastest Providers Compared

Surya Koritala
Last updated: May 31, 2026 11:59 pm
By Surya Koritala
21 Min Read
Share
SHARE

Cerebras, Groq, SambaNova, Together, Fireworks and Baseten benchmarked on a common open model — with the real tokens-per-second numbers and the caveats that change them.

Contents
  • What is the fastest AI inference provider in 2026?
  • AI inference speed compared: output tokens per second by provider
  • Latency vs throughput: why time to first token matters
  • Custom silicon vs GPUs: where the speed actually comes from
  • How to choose a fast inference provider
      • What works
      • Watch out for
      • What works
      • Watch out for
      • What works
      • Watch out for
      • What works
      • Watch out for
      • What works
      • Watch out for
      • What works
      • Watch out for
        • Pros
        • Cons
  • The caveat that breaks every inference benchmark
    • Cerebras owns raw speed; the right pick still depends on your workload
  • Builder’s take
  • Frequently asked questions
    • What is the fastest AI inference provider in 2026?
    • How is AI inference speed measured?
    • Why are custom-silicon providers faster than GPU providers?
    • Is the fastest inference provider also the cheapest?
    • Why do inference speed benchmarks vary so much?
    • Should I pick an inference provider based on Artificial Analysis rankings?
  • Primary sources

What is the fastest AI inference provider in 2026?

On the most-compared common open model — OpenAI’s gpt-oss-120b — Cerebras is the fastest AI inference provider in 2026, clocking 1,705.8 output tokens per second on Artificial Analysis, more than 2.4x the next-fastest API. Behind it, SambaNova (692.5 t/s) and Fireworks (683.4 t/s) lead a tight chasing pack, with Baseten close at roughly 650 t/s as the fastest NVIDIA-GPU-based option.

That headline hides a more interesting story. The same Artificial Analysis page reports a 39.2x gap between the fastest and slowest providers serving the exact same weights. Two endpoints can run byte-identical model files and deliver wildly different experiences — because AI inference speed is a property of the hardware and serving stack, not the model.

This comparison uses gpt-oss-120b deliberately: it is open-weight, widely hosted, and benchmarked by a neutral third party across more than 20 providers, which makes it the cleanest apples-to-apples test bed available in 2026. Every number below is a single-stream measurement and will shift with model, context length and load — a caveat we return to at the end.

Comparison of fastest AI inference providers by output tokens per second in 2026
Image.

Output tokens per second (t/s) measures how fast a model streams new text in a single request. Roughly, 1 token is about 0.75 of an English word, so 700 t/s is around 525 words per second — far faster than any human reads. Above ~1,000 t/s, the bottleneck stops being the model and starts being your network and application code.

AI inference speed compared: output tokens per second by provider

Across the six providers most associated with fast inference, output speed on gpt-oss-120b ranges from Cerebras at 1,705.8 tokens per second down to GPU-based providers in the 300–700 t/s band. The chart below plots the verified Artificial Analysis figures for the fastest endpoint of each named provider.

Three of these — Cerebras, Groq and SambaNova — run custom silicon (wafer-scale, LPU and RDU respectively) built specifically to stream tokens fast. The other three — Together, Fireworks and Baseten — run optimized stacks on NVIDIA GPUs, and have closed much of the gap through speculative decoding and aggressive kernel work. Baseten, for instance, reports it made its gpt-oss serving 60% faster over ten weeks using EAGLE-3 speculative decoding to reach the ~650 t/s tier.

A note on Groq: on gpt-oss-120b the public Artificial Analysis ranking is led by the custom-silicon and top GPU stacks above, but Groq remains a speed leader on other open models — it posts 306.8 t/s on Llama 3.3 70B, narrowly ahead of SambaNova’s 295.1 t/s on that model. This is exactly why the ranking is model-dependent.

AI inference speed: tokens/sec on gpt-oss-120b
Cerebras leads gpt-oss-120b by a wide margin; the GPU-based field (Fireworks, Baseten) now rivals second-place custom silicon. Figures are single-stream and vary by context length and load.
ProviderOutput speed (t/s)Time to first tokenHardware class
Cerebras1,705.81.66sWafer-Scale Engine (custom)
SambaNova692.5sub-1s classRDU (custom)
Fireworks683.4mid-single-digit sNVIDIA GPU
Baseten~650mid-single-digit sNVIDIA GPU (EAGLE-3)
Togethernot top-5 speed; 4.23s TTFT4.23sNVIDIA GPU
Groq (Llama 3.3 70B)306.8 on that model0.91sLPU (custom)
Fast-inference providers on gpt-oss-120b — speed, latency and architecture (Artificial Analysis, 2026)

Latency vs throughput: why time to first token matters

Peak tokens per second tells you how fast a model finishes; time to first token (TTFT) tells you how fast it starts — and for chat and agents, the second number often dominates the felt experience. On gpt-oss-120b, Cerebras posts both the highest throughput (1,705.8 t/s) and the lowest latency (1.66s TTFT), a rare double.

Below the leader, the trade-offs sharpen. Together.ai is not in the top tier for raw output speed on this model, yet it lands among the lower-latency providers at 4.23s TTFT — useful context for workloads where the first token’s arrival is what users notice. A provider that streams at 680 t/s but takes 4+ seconds to produce its first token will feel slower in a quick back-and-forth than one that starts in under two.

AI inference speed is really two numbers, so the practical rule splits with them: for long-form generation (reports, code files, batch summarization) optimize for throughput, because most of the wall-clock time is spent streaming. For short, interactive turns — the bulk of agent tool-calls and chat replies — optimize for TTFT, because the response is over before peak throughput ever matters.

“Two endpoints can run byte-identical model weights and deliver a 39x difference in speed. Inference speed is a property of the stack, not the model.”

Artificial Analysis gpt-oss-120b benchmark, 2026

Custom silicon vs GPUs: where the speed actually comes from

1,705.8

Cerebras t/s on gpt-oss-120b

Fastest single endpoint benchmarked by Artificial Analysis

39.2x

Spread fastest-to-slowest

Same model, same weights, different stacks

~650

Baseten t/s on NVIDIA GPUs

Up ~60% in 10 weeks via EAGLE-3 speculative decoding

1.66s

Cerebras time to first token

Lowest TTFT in the gpt-oss-120b field

The fastest AI inference speed in 2026 comes from purpose-built chips that keep the entire model in fast on-chip memory, eliminating the memory-bandwidth bottleneck that throttles GPU inference. Cerebras’ Wafer-Scale Engine is a single dinner-plate-sized chip; Groq’s LPU and SambaNova’s RDU take different routes to the same goal: stop shuttling weights back and forth from external memory on every token.

In Cerebras’ own head-to-head, its WSE ran gpt-oss-120b at over 3,000 t/s in a configuration that beat an eight-GPU NVIDIA GB200 setup — which Baseten benchmarked at around 650 t/s — by roughly 5x, at a modestly higher token price ($0.75 vs $0.50 per million in that test). The architectural advantage is real and repeatable.

But custom silicon has a coverage cost. These vendors host a curated menu of popular open models. The moment you need a specific fine-tune, a quantization they don’t offer, or a brand-new release on day one, you are back on GPU providers — Fireworks, Together and Baseten — whose entire value proposition is flexibility plus increasingly competitive speed. The GPU field’s gains via speculative decoding mean the speed penalty for that flexibility is now far smaller than it was in 2024.

How to choose a fast inference provider

Choose your inference provider by matching its strengths to your workload shape, not by chasing the top of a single leaderboard. The right answer is different for an interactive agent, a batch pipeline, and a cost-sensitive product feature — and it changes again if you need a model the speed leaders don’t host.

Start with three questions: Is my model on a custom-silicon menu? Do I need lowest latency or highest throughput? And what is my real cost-per-task once AI inference speed is priced in? On gpt-oss-120b, blended prices on Artificial Analysis ran from about $0.05 per million tokens (DeepInfra) up to the $0.75 range for the fastest custom-silicon endpoints — so the fastest option is rarely the cheapest, and the cheapest is rarely the fastest.

The pros and cons below summarize where each architecture class makes sense. Use it to shortlist, then run your own prompts: published benchmarks are single-stream snapshots, and your concurrency, context length and prompt mix will move the numbers.

Cerebras

5 out of 5
The outright speed king on supported models, leading both throughput and latency.
Best for: Real-time agents and latency-critical apps on popular open models

What works

  • 1,705.8 t/s on gpt-oss-120b — 2.4x the next provider
  • Lowest TTFT in the field at 1.66s
  • Beats 8-GPU NVIDIA GB200 by ~5x in vendor tests

Watch out for

  • Limited model menu
  • Premium token pricing (~$0.75/M in head-to-head tests)

SambaNova

5 out of 5
Custom-RDU speed with strong low-latency characteristics across open models.
Best for: Teams wanting custom-silicon speed beyond a single vendor’s menu

What works

  • 692.5 t/s on gpt-oss-120b — fastest after Cerebras
  • Competitive on Llama 3.3 70B at 295.1 t/s
  • Strong time-to-first-token profile

Watch out for

  • Well behind Cerebras on peak throughput
  • Smaller model catalog than GPU providers

Fireworks

5 out of 5
The fastest of the flexible GPU providers, nearly matching second-place custom silicon.
Best for: Production apps needing speed plus a broad, current model catalog

What works

  • 683.4 t/s on gpt-oss-120b
  • Runs a wide range of open models and fine-tunes
  • Mature production tooling

Watch out for

  • Higher TTFT than custom silicon
  • Throughput trails Cerebras by ~2.5x

Baseten

5 out of 5
Fastest NVIDIA-GPU option, closing on custom silicon via speculative decoding.
Best for: Teams wanting GPU flexibility with near-custom-silicon speed

What works

  • ~650 t/s on gpt-oss-120b — fastest NVIDIA-based API
  • Improved ~60% in 10 weeks with EAGLE-3
  • Dedicated-deployment control

Watch out for

  • Still behind the custom-silicon leaders
  • Speed depends on per-model optimization work

Together

5 out of 5
Broadest model selection with respectable latency, mid-pack on raw throughput.
Best for: Maximum model coverage and same-day access to new releases

What works

  • Among lower-latency providers on gpt-oss-120b
  • Enormous open-model catalog
  • Strong for batch and experimentation

Watch out for

  • Not in the top tier for raw output speed on this model
  • 4.23s TTFT lags the leaders

Groq

5 out of 5
LPU speed that leads on some models even when it trails on others.
Best for: Low-latency serving of the specific models Groq optimizes

What works

  • 306.8 t/s on Llama 3.3 70B — fastest on that model
  • Strong 0.91s TTFT on Llama 3.3 70B
  • Custom LPU architecture

Watch out for

  • Not top-ranked on gpt-oss-120b
  • Model-specific performance varies
Pros
  • Custom silicon (Cerebras/Groq/SambaNova): unmatched single-stream speed and low latency on supported models
  • Cerebras leads both throughput and TTFT on gpt-oss-120b — no trade-off between the two
  • GPU providers (Fireworks/Together/Baseten): host almost any open model, including same-day releases and custom fine-tunes
  • GPU stacks have closed much of the speed gap with speculative decoding (Baseten ~650 t/s)
  • GPU providers often win on price-per-token for non-latency-critical batch work
Cons
  • Custom-silicon vendors host a limited model menu — your exact fine-tune may not be available
  • Fastest endpoints carry premium pricing (up to ~$0.75/M tokens vs $0.05 budget GPU options)
  • GPU providers can trail on TTFT (Together ~4.23s on gpt-oss-120b) despite solid throughput
  • Every published number is single-stream — real concurrency and long context will lower it
  • Rankings reshuffle by model: Groq leads on Llama 3.3 70B but not on gpt-oss-120b

The caveat that breaks every inference benchmark

Cerebras owns raw speed; the right pick still depends on your workload

On the common gpt-oss-120b benchmark, Cerebras is the clear AI inference speed leader at 1,705.8 t/s with the lowest latency, and SambaNova and Fireworks lead the chase. But custom silicon only wins where it hosts your model. For broad coverage and same-day access, GPU providers like Fireworks, Together and Baseten — now within striking distance on speed — are the pragmatic default. Shortlist from the leaderboard, then prove it on your own prompts.

Every tokens-per-second figure in this article is a single-stream measurement on one model at one moment — change the model, the context length, the concurrency, or the time of day, and the ranking can reorder. This is the single most important thing to internalize before you pick a provider on the strength of a leaderboard.

Three forces move the numbers. Model: Groq tops Llama 3.3 70B but not gpt-oss-120b. Context length: long prompts add prefill time and drag effective throughput down. Load: a single-user benchmark says nothing about how an endpoint behaves at 100 concurrent requests, when batching and queueing change the math entirely. Artificial Analysis publishes its figures precisely because they are standardized — but standardized is not the same as identical to your production traffic.

The discipline is simple: treat published AI inference speed as a claim to verify, never a number to trust. Use public benchmarks like Artificial Analysis to build a shortlist of three or four candidates. Then replay your own prompts, at your own concurrency, measuring your own cost-per-completed-task. The provider that wins the leaderboard and the provider that wins your bill are frequently not the same name.

Published tokens/sec numbers are single-stream snapshots. Before committing, benchmark your real prompt distribution at production concurrency and measure cost-per-task, not just peak speed. The leaderboard winner can lose once your context lengths and load are applied.

Builder’s take

As founder of Cyntr and Loomfeed, I run agent pipelines where latency compounds across dozens of model calls. Here is how I actually read the inference-speed leaderboard:

  • Headline tokens/sec is a single-stream number on a single model. The instant you change the model, the context length, or the time of day, the ranking shuffles. Treat any benchmark as a snapshot, not a contract.
  • For interactive chat, time-to-first-token matters more than peak throughput. A provider at 650 t/s with a 1.6s TTFT feels snappier than one at 700 t/s with a 4.3s TTFT.
  • Custom-silicon speed (Cerebras, Groq, SambaNova) is real, but it only exists for the handful of models those vendors host. If you need a specific fine-tune, GPU providers like Fireworks, Together and Baseten are where you live.
  • In multi-step agent loops, raw speed is multiplied by your number of hops. Shaving 400ms per call across a 20-call plan is the difference between a 2-second and a 10-second agent.
  • Speed is necessary but not sufficient — I benchmark cost-per-task and output quality on my own prompts before I trust any leaderboard position.

Frequently asked questions

What is the fastest AI inference provider in 2026?

On gpt-oss-120b, the most widely benchmarked common open model, Cerebras is the fastest at 1,705.8 output tokens per second according to Artificial Analysis — more than 2.4x the next-fastest provider. SambaNova (692.5 t/s) and Fireworks (683.4 t/s) follow. Rankings shift by model: Groq leads on Llama 3.3 70B.

How is AI inference speed measured?

It is measured in output tokens per second (t/s) — how fast a model streams generated text in a single request — alongside time to first token (TTFT), which is how long until the first token appears. Throughput matters for long generations; TTFT matters for interactive chat and agents. One token is roughly 0.75 of an English word.

Why are custom-silicon providers faster than GPU providers?

Chips like Cerebras’ Wafer-Scale Engine, Groq’s LPU and SambaNova’s RDU keep the entire model in fast on-chip memory, removing the memory-bandwidth bottleneck that throttles GPU inference. In Cerebras’ own tests it ran gpt-oss-120b about 5x faster than an eight-GPU NVIDIA GB200 setup, though GPU providers have narrowed the gap with speculative decoding.

Is the fastest inference provider also the cheapest?

Usually not. On gpt-oss-120b, blended prices on Artificial Analysis ranged from about $0.05 per million tokens (DeepInfra) up to roughly $0.75 for the fastest custom-silicon endpoints. The fastest option carries a premium, and the cheapest option is rarely the fastest — so you should weigh cost-per-task against speed.

Why do inference speed benchmarks vary so much?

Published figures are single-stream snapshots on one model. Output speed changes with the model (Groq leads Llama 3.3 70B but not gpt-oss-120b), context length (long prompts slow effective throughput), and concurrency or load (a one-user test says nothing about 100 simultaneous requests). Artificial Analysis reports a 39.2x spread across providers on the same model.

Should I pick an inference provider based on Artificial Analysis rankings?

Use them to build a shortlist, not to make the final call. Artificial Analysis provides standardized, neutral benchmarks ideal for narrowing to three or four candidates. Then replay your own prompts at your real concurrency and measure cost-per-completed-task, because the leaderboard winner and the provider that minimizes your bill are often different.

Primary sources

  • gpt-oss-120b: API Provider Performance Benchmarking & Price Analysis — Artificial Analysis
  • Llama 3.3 70B: API Provider Performance Benchmarking — Artificial Analysis
  • OpenAI GPT-OSS 120B Benchmarked: NVIDIA Blackwell vs Cerebras — Cerebras
  • How we made the fastest GPT-OSS on NVIDIA GPUs 60% faster — Baseten
  • Llama3.1 Model Quality Evaluation: Cerebras, Groq, SambaNova, Together, Fireworks — Cerebras

Last updated: May 31, 2026. Related: Products.

Price Per Intelligence: Same Score, 5x the Cost
AI Agent Industry Digest: Week of May 25, 2026
Harvey Legal Agent Benchmark — what the all-pass scoring actually means
Multimodal AI Benchmarks 2026: The Still vs Video Split
AI training hardware 2026: five-way comparison
TAGGED:AI inferenceBasetenbenchmarksCerebrasFireworksGroqSambaNovatokens per second
Share This Article
Facebook Email Copy Link Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

More Popular from Alatirok

Reference architecture diagram showing an AI agent calling a website's NLWeb /ask endpoint, which extracts Schema.org JSON-LD into a vector store and exposes an MCP server
Agent Infrastructure

What Is NLWeb? Microsoft’s Agentic Web Protocol Explained

By Surya Koritala
28 Min Read
What Is Cognition Devin? The Enterprise Guide for

What Is Cognition Devin? The Enterprise Guide for 2026

By Surya Koritala
An AI agent connected to a virtual credit card with a spending limit gauge, illustrating agentic commerce controls in 2026
Commerce

How to Give an AI Agent a Credit Card With a Spending Limit

By Surya Koritala
31 Min Read
Agent Infrastructure

Azure Agent Mesh Tutorial: Deploy a Federated Agent

This azure agent mesh tutorial is the first hands-on deploy: target the Mesh with Agent Framework…

By Surya Koritala
Capital

LLM Long-Context Pricing Surcharge 2026: The Cliff Mapped

Long-context pricing surcharge: The LLM long context pricing surcharge 2026 doubles your whole request the moment…

By Surya Koritala

What Is Claude Cowork? Architecture, Cost, and Limits

What is Claude Cowork? A technical, vendor-neutral guide to its sandbox architecture, real per-seat plus API…

By Surya Koritala
Commerce

Best AI Agent Marketplaces 2026: Where to Sell Agents

The best AI agent marketplaces 2026 ranked by audience, listing model, and revenue share — AgentExchange,…

By Surya Koritala

Best AI Coding CLI 2026: Claude Code vs Codex vs Antigravity

The best AI coding CLI 2026 comes down to Claude Code, Codex CLI, and Antigravity CLI.…

By Surya Koritala

what’s actually being built in AI agents, who’s building it, and why it matters. Independent. Opinionated.

Categories

  • Home
  • Products
  • Agents
  • Capital
  • Commerce

Quick Links

  • Home
  • Products
  • Agents

© Alatirok by Loomfeed. All Rights Reserved.

Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?