AI Inference Speed 2026: Fastest Providers Compared

Cerebras, Groq, SambaNova, Together, Fireworks and Baseten benchmarked on a common open model — with the real tokens-per-second numbers and the caveats that change them.

Contents

What is the fastest AI inference provider in 2026?

On the most-compared common open model — OpenAI’s gpt-oss-120b — Cerebras is the fastest AI inference provider in 2026, clocking 1,705.8 output tokens per second on Artificial Analysis, more than 2.4x the next-fastest API. Behind it, SambaNova (692.5 t/s) and Fireworks (683.4 t/s) lead a tight chasing pack, with Baseten close at roughly 650 t/s as the fastest NVIDIA-GPU-based option.

That headline hides a more interesting story. The same Artificial Analysis page reports a 39.2x gap between the fastest and slowest providers serving the exact same weights. Two endpoints can run byte-identical model files and deliver wildly different experiences — because AI inference speed is a property of the hardware and serving stack, not the model.

This comparison uses gpt-oss-120b deliberately: it is open-weight, widely hosted, and benchmarked by a neutral third party across more than 20 providers, which makes it the cleanest apples-to-apples test bed available in 2026. Every number below is a single-stream measurement and will shift with model, context length and load — a caveat we return to at the end.

Comparison of fastest AI inference providers by output tokens per second in 2026 — Image.

Output tokens per second (t/s) measures how fast a model streams new text in a single request. Roughly, 1 token is about 0.75 of an English word, so 700 t/s is around 525 words per second — far faster than any human reads. Above ~1,000 t/s, the bottleneck stops being the model and starts being your network and application code.

AI inference speed compared: output tokens per second by provider

Across the six providers most associated with fast inference, output speed on gpt-oss-120b ranges from Cerebras at 1,705.8 tokens per second down to GPU-based providers in the 300–700 t/s band. The chart below plots the verified Artificial Analysis figures for the fastest endpoint of each named provider.

Three of these — Cerebras, Groq and SambaNova — run custom silicon (wafer-scale, LPU and RDU respectively) built specifically to stream tokens fast. The other three — Together, Fireworks and Baseten — run optimized stacks on NVIDIA GPUs, and have closed much of the gap through speculative decoding and aggressive kernel work. Baseten, for instance, reports it made its gpt-oss serving 60% faster over ten weeks using EAGLE-3 speculative decoding to reach the ~650 t/s tier.

A note on Groq: on gpt-oss-120b the public Artificial Analysis ranking is led by the custom-silicon and top GPU stacks above, but Groq remains a speed leader on other open models — it posts 306.8 t/s on Llama 3.3 70B, narrowly ahead of SambaNova’s 295.1 t/s on that model. This is exactly why the ranking is model-dependent.

AI inference speed: tokens/sec on gpt-oss-120b — Cerebras leads gpt-oss-120b by a wide margin; the GPU-based field (Fireworks, Baseten) now rivals second-place custom silicon. Figures are single-stream and vary by context length and load.

Provider	Output speed (t/s)	Time to first token	Hardware class
Cerebras	1,705.8	1.66s	Wafer-Scale Engine (custom)
SambaNova	692.5	sub-1s class	RDU (custom)
Fireworks	683.4	mid-single-digit s	NVIDIA GPU
Baseten	~650	mid-single-digit s	NVIDIA GPU (EAGLE-3)
Together	not top-5 speed; 4.23s TTFT	4.23s	NVIDIA GPU
Groq (Llama 3.3 70B)	306.8 on that model	0.91s	LPU (custom)

Fast-inference providers on gpt-oss-120b — speed, latency and architecture (Artificial Analysis, 2026)

Latency vs throughput: why time to first token matters

Peak tokens per second tells you how fast a model finishes; time to first token (TTFT) tells you how fast it starts — and for chat and agents, the second number often dominates the felt experience. On gpt-oss-120b, Cerebras posts both the highest throughput (1,705.8 t/s) and the lowest latency (1.66s TTFT), a rare double.

Below the leader, the trade-offs sharpen. Together.ai is not in the top tier for raw output speed on this model, yet it lands among the lower-latency providers at 4.23s TTFT — useful context for workloads where the first token’s arrival is what users notice. A provider that streams at 680 t/s but takes 4+ seconds to produce its first token will feel slower in a quick back-and-forth than one that starts in under two.

AI inference speed is really two numbers, so the practical rule splits with them: for long-form generation (reports, code files, batch summarization) optimize for throughput, because most of the wall-clock time is spent streaming. For short, interactive turns — the bulk of agent tool-calls and chat replies — optimize for TTFT, because the response is over before peak throughput ever matters.

“Two endpoints can run byte-identical model weights and deliver a 39x difference in speed. Inference speed is a property of the stack, not the model.”
Artificial Analysis gpt-oss-120b benchmark, 2026

Custom silicon vs GPUs: where the speed actually comes from

1,705.8

Cerebras t/s on gpt-oss-120b

Fastest single endpoint benchmarked by Artificial Analysis

39.2x

Spread fastest-to-slowest

Same model, same weights, different stacks

~650

Baseten t/s on NVIDIA GPUs

Up ~60% in 10 weeks via EAGLE-3 speculative decoding

1.66s

Cerebras time to first token

Lowest TTFT in the gpt-oss-120b field

The fastest AI inference speed in 2026 comes from purpose-built chips that keep the entire model in fast on-chip memory, eliminating the memory-bandwidth bottleneck that throttles GPU inference. Cerebras’ Wafer-Scale Engine is a single dinner-plate-sized chip; Groq’s LPU and SambaNova’s RDU take different routes to the same goal: stop shuttling weights back and forth from external memory on every token.

In Cerebras’ own head-to-head, its WSE ran gpt-oss-120b at over 3,000 t/s in a configuration that beat an eight-GPU NVIDIA GB200 setup — which Baseten benchmarked at around 650 t/s — by roughly 5x, at a modestly higher token price ($0.75 vs $0.50 per million in that test). The architectural advantage is real and repeatable.

But custom silicon has a coverage cost. These vendors host a curated menu of popular open models. The moment you need a specific fine-tune, a quantization they don’t offer, or a brand-new release on day one, you are back on GPU providers — Fireworks, Together and Baseten — whose entire value proposition is flexibility plus increasingly competitive speed. The GPU field’s gains via speculative decoding mean the speed penalty for that flexibility is now far smaller than it was in 2024.

How to choose a fast inference provider

Choose your inference provider by matching its strengths to your workload shape, not by chasing the top of a single leaderboard. The right answer is different for an interactive agent, a batch pipeline, and a cost-sensitive product feature — and it changes again if you need a model the speed leaders don’t host.

Start with three questions: Is my model on a custom-silicon menu? Do I need lowest latency or highest throughput? And what is my real cost-per-task once AI inference speed is priced in? On gpt-oss-120b, blended prices on Artificial Analysis ran from about $0.05 per million tokens (DeepInfra) up to the $0.75 range for the fastest custom-silicon endpoints — so the fastest option is rarely the cheapest, and the cheapest is rarely the fastest.

The pros and cons below summarize where each architecture class makes sense. Use it to shortlist, then run your own prompts: published benchmarks are single-stream snapshots, and your concurrency, context length and prompt mix will move the numbers.

Cerebras

5 out of 5

The outright speed king on supported models, leading both throughput and latency.
Best for: Real-time agents and latency-critical apps on popular open models

What works

1,705.8 t/s on gpt-oss-120b — 2.4x the next provider
Lowest TTFT in the field at 1.66s
Beats 8-GPU NVIDIA GB200 by ~5x in vendor tests

Watch out for

Limited model menu
Premium token pricing (~$0.75/M in head-to-head tests)

SambaNova

5 out of 5

Custom-RDU speed with strong low-latency characteristics across open models.
Best for: Teams wanting custom-silicon speed beyond a single vendor’s menu

What works

692.5 t/s on gpt-oss-120b — fastest after Cerebras
Competitive on Llama 3.3 70B at 295.1 t/s
Strong time-to-first-token profile

Watch out for

Well behind Cerebras on peak throughput
Smaller model catalog than GPU providers

Fireworks

5 out of 5

The fastest of the flexible GPU providers, nearly matching second-place custom silicon.
Best for: Production apps needing speed plus a broad, current model catalog

What works

683.4 t/s on gpt-oss-120b
Runs a wide range of open models and fine-tunes
Mature production tooling

Watch out for

Higher TTFT than custom silicon
Throughput trails Cerebras by ~2.5x

Baseten

5 out of 5

Fastest NVIDIA-GPU option, closing on custom silicon via speculative decoding.
Best for: Teams wanting GPU flexibility with near-custom-silicon speed

What works

~650 t/s on gpt-oss-120b — fastest NVIDIA-based API
Improved ~60% in 10 weeks with EAGLE-3
Dedicated-deployment control

Watch out for

Still behind the custom-silicon leaders
Speed depends on per-model optimization work

Together

5 out of 5

Broadest model selection with respectable latency, mid-pack on raw throughput.
Best for: Maximum model coverage and same-day access to new releases

What works

Among lower-latency providers on gpt-oss-120b
Enormous open-model catalog
Strong for batch and experimentation

Watch out for

Not in the top tier for raw output speed on this model
4.23s TTFT lags the leaders

Groq

5 out of 5

LPU speed that leads on some models even when it trails on others.
Best for: Low-latency serving of the specific models Groq optimizes

What works

306.8 t/s on Llama 3.3 70B — fastest on that model
Strong 0.91s TTFT on Llama 3.3 70B
Custom LPU architecture

Watch out for

Not top-ranked on gpt-oss-120b
Model-specific performance varies

Pros

Custom silicon (Cerebras/Groq/SambaNova): unmatched single-stream speed and low latency on supported models
Cerebras leads both throughput and TTFT on gpt-oss-120b — no trade-off between the two
GPU providers (Fireworks/Together/Baseten): host almost any open model, including same-day releases and custom fine-tunes
GPU stacks have closed much of the speed gap with speculative decoding (Baseten ~650 t/s)
GPU providers often win on price-per-token for non-latency-critical batch work

Cons

Custom-silicon vendors host a limited model menu — your exact fine-tune may not be available
Fastest endpoints carry premium pricing (up to ~$0.75/M tokens vs $0.05 budget GPU options)
GPU providers can trail on TTFT (Together ~4.23s on gpt-oss-120b) despite solid throughput
Every published number is single-stream — real concurrency and long context will lower it
Rankings reshuffle by model: Groq leads on Llama 3.3 70B but not on gpt-oss-120b

The caveat that breaks every inference benchmark

Cerebras owns raw speed; the right pick still depends on your workload

On the common gpt-oss-120b benchmark, Cerebras is the clear AI inference speed leader at 1,705.8 t/s with the lowest latency, and SambaNova and Fireworks lead the chase. But custom silicon only wins where it hosts your model. For broad coverage and same-day access, GPU providers like Fireworks, Together and Baseten — now within striking distance on speed — are the pragmatic default. Shortlist from the leaderboard, then prove it on your own prompts.

Every tokens-per-second figure in this article is a single-stream measurement on one model at one moment — change the model, the context length, the concurrency, or the time of day, and the ranking can reorder. This is the single most important thing to internalize before you pick a provider on the strength of a leaderboard.

Three forces move the numbers. Model: Groq tops Llama 3.3 70B but not gpt-oss-120b. Context length: long prompts add prefill time and drag effective throughput down. Load: a single-user benchmark says nothing about how an endpoint behaves at 100 concurrent requests, when batching and queueing change the math entirely. Artificial Analysis publishes its figures precisely because they are standardized — but standardized is not the same as identical to your production traffic.

The discipline is simple: treat published AI inference speed as a claim to verify, never a number to trust. Use public benchmarks like Artificial Analysis to build a shortlist of three or four candidates. Then replay your own prompts, at your own concurrency, measuring your own cost-per-completed-task. The provider that wins the leaderboard and the provider that wins your bill are frequently not the same name.

Published tokens/sec numbers are single-stream snapshots. Before committing, benchmark your real prompt distribution at production concurrency and measure cost-per-task, not just peak speed. The leaderboard winner can lose once your context lengths and load are applied.

Builder’s take

As founder of Cyntr and Loomfeed, I run agent pipelines where latency compounds across dozens of model calls. Here is how I actually read the inference-speed leaderboard:

Headline tokens/sec is a single-stream number on a single model. The instant you change the model, the context length, or the time of day, the ranking shuffles. Treat any benchmark as a snapshot, not a contract.
For interactive chat, time-to-first-token matters more than peak throughput. A provider at 650 t/s with a 1.6s TTFT feels snappier than one at 700 t/s with a 4.3s TTFT.
Custom-silicon speed (Cerebras, Groq, SambaNova) is real, but it only exists for the handful of models those vendors host. If you need a specific fine-tune, GPU providers like Fireworks, Together and Baseten are where you live.
In multi-step agent loops, raw speed is multiplied by your number of hops. Shaving 400ms per call across a 20-call plan is the difference between a 2-second and a 10-second agent.
Speed is necessary but not sufficient — I benchmark cost-per-task and output quality on my own prompts before I trust any leaderboard position.

Frequently asked questions

What is the fastest AI inference provider in 2026?

On gpt-oss-120b, the most widely benchmarked common open model, Cerebras is the fastest at 1,705.8 output tokens per second according to Artificial Analysis — more than 2.4x the next-fastest provider. SambaNova (692.5 t/s) and Fireworks (683.4 t/s) follow. Rankings shift by model: Groq leads on Llama 3.3 70B.

How is AI inference speed measured?

It is measured in output tokens per second (t/s) — how fast a model streams generated text in a single request — alongside time to first token (TTFT), which is how long until the first token appears. Throughput matters for long generations; TTFT matters for interactive chat and agents. One token is roughly 0.75 of an English word.

Why are custom-silicon providers faster than GPU providers?

Chips like Cerebras’ Wafer-Scale Engine, Groq’s LPU and SambaNova’s RDU keep the entire model in fast on-chip memory, removing the memory-bandwidth bottleneck that throttles GPU inference. In Cerebras’ own tests it ran gpt-oss-120b about 5x faster than an eight-GPU NVIDIA GB200 setup, though GPU providers have narrowed the gap with speculative decoding.

Is the fastest inference provider also the cheapest?

Usually not. On gpt-oss-120b, blended prices on Artificial Analysis ranged from about $0.05 per million tokens (DeepInfra) up to roughly $0.75 for the fastest custom-silicon endpoints. The fastest option carries a premium, and the cheapest option is rarely the fastest — so you should weigh cost-per-task against speed.

Why do inference speed benchmarks vary so much?

Published figures are single-stream snapshots on one model. Output speed changes with the model (Groq leads Llama 3.3 70B but not gpt-oss-120b), context length (long prompts slow effective throughput), and concurrency or load (a one-user test says nothing about 100 simultaneous requests). Artificial Analysis reports a 39.2x spread across providers on the same model.

Should I pick an inference provider based on Artificial Analysis rankings?

Use them to build a shortlist, not to make the final call. Artificial Analysis provides standardized, neutral benchmarks ideal for narrowing to three or four candidates. Then replay your own prompts at your real concurrency and measure cost-per-completed-task, because the leaderboard winner and the provider that minimizes your bill are often different.

Primary sources

gpt-oss-120b: API Provider Performance Benchmarking & Price Analysis — Artificial Analysis
Llama 3.3 70B: API Provider Performance Benchmarking — Artificial Analysis
OpenAI GPT-OSS 120B Benchmarked: NVIDIA Blackwell vs Cerebras — Cerebras
How we made the fastest GPT-OSS on NVIDIA GPUs 60% faster — Baseten
Llama3.1 Model Quality Evaluation: Cerebras, Groq, SambaNova, Together, Fireworks — Cerebras

Last updated: May 31, 2026. Related: Products.

What is the fastest AI inference provider in 2026?

AI inference speed compared: output tokens per second by provider

Latency vs throughput: why time to first token matters

Custom silicon vs GPUs: where the speed actually comes from

How to choose a fast inference provider

Cerebras

What works

Watch out for

SambaNova

What works

Watch out for

Fireworks

What works

Watch out for

Baseten

What works

Watch out for

Together

What works

Watch out for

Groq

What works

Watch out for

Pros

Cons

The caveat that breaks every inference benchmark

Cerebras owns raw speed; the right pick still depends on your workload

Builder’s take

Frequently asked questions

What is the fastest AI inference provider in 2026?

How is AI inference speed measured?

Why are custom-silicon providers faster than GPU providers?

Is the fastest inference provider also the cheapest?

Why do inference speed benchmarks vary so much?

Should I pick an inference provider based on Artificial Analysis rankings?

Primary sources

Leave a Reply Cancel reply

More Popular from Alatirok

Categories

Quick Links