AI inference economics 2026: the token spread -

AI inference economics 2026 comes down to one headline number: the same open-weight model, Llama 4 70B, spans roughly 6x in output-token price across major inference providers, while latency and throughput diverge just as sharply. Using Digital Applied’s April 2026 measurements, provider pricing pages, and vendor disclosures, this data breakdown shows what the per-million-token spread actually buys you in production.

Contents

The headline chart: one model, a 6.5x output-price spread

6.46x

Price spread on Llama 4 70B output tokens

$0.65/M to $4.20/M in Digital Applied’s April 2026 measurements

$0.65/M

Lowest listed output price

Together AI batch

$4.20/M

Highest listed output price

Cerebras

The market has a real spread

For open-weight inference, token pricing is no longer clustered tightly enough to ignore provider choice. The cheapest option is tied to slower service modes, while the fastest options sit at the top of the price curve.

The core finding in AI inference economics 2026 is simple: open-weight inference has become a market with real price dispersion, not a single clearing price. In Digital Applied’s Q2 2026 matrix, Llama 4 70B output pricing ranges from $0.65 per million output tokens on Together AI batch inference to $4.20 per million output tokens on Cerebras. That is a spread of about 6.46x for the same model family and the same basic unit of output, with Anthropic and OpenAI excluded because the comparison is limited to open-weight inference.

The practical takeaway is that a low per-million-token number does not mean a provider is universally cheaper for your workload. The cheapest figure in the set, Together AI’s batch tier, comes with a stated 60-minute latency profile in the Digital Applied comparison. At the other end, specialty hardware providers charge materially more per token but can deliver much higher decode throughput. Readers looking only at token price will miss the operating constraint that actually matters in production: whether your application is bottlenecked by cost, responsiveness, or concurrency.

Pricing matrix for AI inference providers in 2026 — Image: source page. Used under fair use.

Digital Applied says its Q2 2026 matrix uses public pricing pages, Artificial Analysis benchmarks, and direct April 2026 testing.

“The same open-weight model spans roughly 6.5x in output-token price across major providers.”
Digital Applied Q2 2026 pricing matrix

Provider	Llama 4 70B output pricing	Commercial mode	Notes
Together AI	$0.65/M	Batch	60-minute latency in Digital Applied matrix
Together AI	$0.95/M	Reserved capacity	Commitment-based option in Digital Applied matrix
Fireworks AI	$1.20/M	Serverless	Open-weight API pricing comparison
OctoAI	$1.50/M	Serverless	Digital Applied comparison
Anyscale Endpoints	$2.10/M	Enterprise	Digital Applied comparison
Replicate	$2.55/M	On-demand	Digital Applied comparison
Groq LPU	$3.20/M	Specialty hardware	Digital Applied comparison
Cerebras	$4.20/M	Wafer-scale	Digital Applied comparison

Llama 4 70B output-token pricing from Digital Applied’s Q2 2026 matrix, based on April 2026 measurements.

Pricing matrix breakdown: what the low end and high end really represent

The cheapest number in this dataset is not a generic serverless endpoint. It is Together AI batch pricing at $0.65/M for Llama 4 70B output, paired in the Digital Applied matrix with 60-minute latency. Together’s reserved-capacity figure, $0.95/M, sits closer to the mainstream serverless market while still undercutting Fireworks AI at $1.20/M and OctoAI at $1.50/M. That is the first lesson of AI inference economics 2026: the lowest token price usually reflects a different service contract, not a free lunch.

The middle of the market is where many production teams will likely compare offers. Fireworks AI’s public pricing page shows broad tiering by model size, with sub-4B models at $0.10/M, 4B-16B at $0.20/M, and models above 16B at $0.90/M. Fireworks also publishes split pricing for DeepSeek V3 at $0.56/M input and $1.68/M output, which is a useful reminder that blended request cost depends on prompt-to-completion ratio, not output price alone. Together AI’s own pricing page shows a wider catalog range, from $0.10/M on smaller models up to $9.00/M on the largest listed models, plus fine-tuning prices from $0.48/M for LoRA on models up to 16B to $3.20/M for full fine-tuning on 70B-100B models.

At the high end, Groq and Cerebras are not simply expensive versions of commodity inference. They are selling a different performance envelope. Digital Applied places Groq at $3.20/M and Cerebras at $4.20/M for Llama 4 70B output. Those prices look rich against Together or Fireworks, but they sit alongside much higher decode throughput. The pricing matrix only makes sense when read next to the speed data.

Some providers quote input and output separately, some emphasize serverless tiers, and some sell reserved or batch capacity. Comparing only one headline number can hide the real bill.

Provider/data point	Published price	Unit	Source context
Together AI catalog range	$0.10/M to $9.00/M	Tokens	Public pricing page across model sizes
Together AI fine-tuning	$0.48/M to $3.20/M	Training tokens	LoRA through full fine-tune tiers
Fireworks AI sub-4B	$0.10/M	Tokens	Public pricing tier
Fireworks AI 4B-16B	$0.20/M	Tokens	Public pricing tier
Fireworks AI >16B	$0.90/M	Tokens	Public pricing tier
Fireworks AI DeepSeek V3 input	$0.56/M	Input tokens	Published split pricing
Fireworks AI DeepSeek V3 output	$1.68/M	Output tokens	Published split pricing
Modal Labs effective H100 rate	~$3.95/hr	GPU hour	CloudZero analysis of sustained-load economics

Selected public pricing details that frame the Llama 4 70B comparison.

Throughput chart: why tokens per second can outweigh token price

750 tps

Groq LPU decode throughput

Llama 4 70B output decode

620 tps

Cerebras decode throughput

Llama 4 70B output decode

100-150 tps

Commodity H100 endpoint range

Llama 4 70B output decode

Speed changes the economics

Throughput is the hidden variable in most pricing comparisons. Once a workload is concurrency-bound, higher token prices can still produce lower operational cost per served request.

Digital Applied reports 750 tokens per second for Groq LPU and 620 tokens per second for Cerebras on Llama 4 70B output decode, versus roughly 100-150 tokens per second for commodity H100 endpoints. That means specialty hardware is delivering about 5x to 7.5x the throughput of the commodity baseline, depending on where in the H100 range a given endpoint lands. This is the second major point in AI inference economics 2026: a provider that looks expensive per million tokens can still be economically rational if your application is queue-bound.

For interactive products, throughput is not just a benchmark vanity metric. It affects time-to-first-completion under load, how many concurrent sessions a deployment can absorb, and whether you need to overprovision replicas to survive spikes. A team paying $1.20/M on a slower serverless endpoint may still spend more per successful user interaction than a team paying $3.20/M on Groq if the slower stack forces retries, larger buffers, or lower concurrency ceilings.

The editor’s framing is right to call this a triangle: price, latency, and throughput do not all optimize at once. Together batch is cheap and slow. Cerebras is fast and expensive. Commodity serverless sits in the middle but often with more latency variance. The spread is not market inefficiency so much as a menu of different operating assumptions.

If your bottleneck is tokens per second rather than raw token spend, the fastest provider can be cheaper at the request level even when it is pricier per million tokens.

“Groq and Cerebras charge more per token, but they can deliver roughly 5x to 7.5x the throughput of commodity H100 endpoints.”
Digital Applied Q2 2026 pricing matrix

Platform type	Reported throughput	Relative to 100 tps baseline	Relative to 150 tps baseline
Groq LPU	750 tps	7.5x	5.0x
Cerebras	620 tps	6.2x	4.1x
Commodity H100 endpoints	100-150 tps	1.0x-1.5x	0.67x-1.0x

Llama 4 70B output decode throughput from Digital Applied’s April 2026 comparison.

Billing model breakdown: per-token, per-second, and reserved capacity

Most of the market in this comparison sells inference on a per-token basis. That model is easy for finance teams to reason about because cost scales with usage and there is no explicit idle penalty. It also makes cross-provider shopping easier, which is one reason open-weight inference is commoditizing. In AI inference economics 2026, per-token pricing is the dominant billing language, but it is not the only one that matters.

Modal Labs is a useful counterexample because it sells containerized GPU time rather than a simple token meter. CloudZero’s analysis cites an effective H100 rate of about $3.95 per hour under sustained load. That structure can work well for teams that can keep accelerators busy with batched or bursty jobs, especially when they control scheduling tightly. It is less attractive when traffic is steady but not dense enough to maintain high utilization, because the idle tax becomes your problem rather than the provider’s.

Reserved capacity sits between those poles. Digital Applied lists Together AI reserved capacity for Llama 4 70B output at $0.95/M, above its batch price but below many serverless alternatives. The trade-off is commitment. You get a discount relative to generic on-demand service, but you also accept lock-in and planning risk. If your demand profile is predictable, that can be rational. If it is not, the apparent savings may disappear into unused capacity or migration friction.

Pros

Per-token pricing is easy to forecast
Per-second pricing can reward high utilization
Reserved capacity can lower unit cost for stable demand

Cons

Per-token pricing can hide latency and throughput constraints
Per-second pricing punishes idle time
Reserved capacity can create commitment risk

def output_cost(tokens, price_per_million):
    return (tokens / 1_000_000) * price_per_million

# Example: 250k output tokens on Fireworks at $1.20/M
print(round(output_cost(250_000, 1.20), 4))  # 0.3

Billing model	Example	Best fit	Main trade-off
Per-token	Fireworks, Together, Replicate	Predictable variable cost	Can become throughput-bound at peak
Per-second / GPU time	Modal Labs	Batched or tightly managed workloads	Idle utilization risk
Reserved capacity	Together AI reserved	Stable demand with planning discipline	Commitment and lock-in

Commercial models shape the real economics as much as headline token price.

Specialty hardware math: when a higher token price is still the cheaper choice

Specialty hardware is a throughput bet

Groq and Cerebras are not competing to be the cheapest token meter. They are selling higher service rate, which matters most when concurrency and latency targets dominate the cost model.

The cleanest way to understand the Groq and Cerebras premium is to compare throughput uplift against token-price uplift. Relative to Together batch at $0.65/M, Groq at $3.20/M is about 4.9x more expensive per output token. Cerebras at $4.20/M is about 6.5x more expensive. On throughput, though, Groq’s 750 tps is roughly 5x to 7.5x faster than the 100-150 tps commodity H100 range, while Cerebras at 620 tps is roughly 4.1x to 6.2x faster. That means the premium is not obviously irrational if your bottleneck is service rate.

The editor’s shorthand that Cerebras can offer roughly 6x throughput at about 3x cost versus commodity H100 endpoints captures the directional point even if the exact ratio depends on which commodity price you use for comparison. Against Fireworks at $1.20/M, Cerebras is 3.5x pricier on output tokens. Against OctoAI at $1.50/M, it is 2.8x. If the faster platform lets you serve several times more concurrent traffic or hit a latency SLO that avoids overprovisioning, the request-level economics can flip in its favor.

NVIDIA is making the same broader argument from the hardware side. In its Blackwell inference analysis, the company says newer systems can reduce cost per token for open-source models by improving throughput and efficiency. Vendor analyses should always be read critically, but the directional claim matches the market data here: hardware architecture is now a first-order variable in inference economics, not a backend implementation detail.

Do not compare specialty hardware only to the cheapest batch tier. Compare it to the slowest provider that still meets your latency target.

def relative_multiple(a, b):
    return round(a / b, 2)

print({
    'groq_vs_together_batch_price': relative_multiple(3.20, 0.65),
    'cerebras_vs_fireworks_price': relative_multiple(4.20, 1.20),
    'cerebras_vs_octo_price': relative_multiple(4.20, 1.50)
})

“If your bottleneck is tokens per second, specialty hardware can be cheaper per served request despite a higher per-million-token price.”
Digital Applied data and NVIDIA Blackwell inference analysis

Comparison	Token price multiple	Throughput multiple	Interpretation
Groq vs Together batch	4.9x	Not directly comparable on latency mode	Higher price buys a very different performance profile
Cerebras vs Fireworks	3.5x	About 4.1x-6.2x vs commodity H100 baseline	Can be favorable if queueing is the bottleneck
Cerebras vs OctoAI	2.8x	About 4.1x-6.2x vs commodity H100 baseline	Premium narrows against mid-market serverless pricing

Illustrative ratios using Digital Applied’s Llama 4 70B pricing and throughput figures.

What the data means for production inference decisions

7 providers

Core market set in the editor’s framing

Q2 2026 serverless inference market has consolidated around seven providers

5x-7x

Latency spread in the editorial brief

Headline market context supplied for this piece

What the spread actually tells you

The per-million-token spread is not just a pricing story. It is a map of service modes: batch, serverless, reserved, and specialty hardware each optimize a different part of the production stack.

The market context behind AI inference economics 2026 is that open-weight inference is consolidating into a smaller set of recognizable providers while the economics moat keeps shrinking. The editor’s framing notes that the Q2 2026 serverless market has consolidated around seven providers, and the Llama 4 70B comparison shows why competition is intensifying: the same model already fits inside a relatively tight absolute band of $0.65/M to $4.20/M, even though the relative spread is large. For builders, that means model quality is no longer the only scarce asset. Routing, workload shaping, and hardware fit matter more every quarter.

There is also a strategic implication for proprietary-model pricing. The editorial brief notes that this open-weight band is already far below GPT-4o-class proprietary pricing, and that many builders are getting comparable quality at a fraction of the cost. This article does not extend that comparison numerically because the supplied matrix excludes Anthropic and OpenAI, but the direction is clear from the open-weight side alone: token economics are compressing fast enough that infrastructure choices, not just model choices, are becoming the main source of margin.

For operators, the recommendation is workload-specific. Choose Together batch when latency is irrelevant and cost minimization dominates. Choose Fireworks or another mid-market serverless option when you need straightforward per-token billing without the highest specialty-hardware premium. Choose Together reserved capacity when demand is stable enough to justify commitment. Choose Groq or Cerebras when tokens per second, concurrency, or tight response-time targets are the real constraint. The per-million-token spread tells you less about who is cheapest in the abstract than about which provider is optimized for your workload shape.

Open-weight inference is commoditizing on price, but not on performance. The winning provider depends on whether your application is cost-bound, latency-bound, or throughput-bound.

Workload shape	Likely best fit	Why
Offline batch generation	Together AI batch	Lowest listed token price, latency is acceptable
General serverless production	Fireworks AI / similar mid-market endpoint	Balanced token pricing with standard API model
Predictable steady demand	Together AI reserved capacity	Lower unit cost with commitment
High-concurrency interactive apps	Groq or Cerebras	Higher throughput can reduce queueing and overprovisioning
Custom scheduled GPU jobs	Modal Labs	Per-second model can work when utilization is tightly managed

Provider fit depends more on workload shape than on the cheapest headline token price.

Frequently asked questions

What does per-million-token pricing actually measure?

It measures the variable cost of model usage, usually split into input and output tokens or quoted separately for one side. For examples, see Together AI’s pricing page and Fireworks AI’s pricing page, both of which publish token-based pricing structures.

Why is the same Llama 4 70B model so much cheaper on some providers than others?

Because providers are not selling the same service envelope. Digital Applied’s Q2 2026 matrix shows Together AI batch at $0.65/M with a 60-minute latency profile, while Groq and Cerebras charge more but deliver much higher throughput. The matrix and methodology are published at Digital Applied.

When does specialty hardware make sense despite higher token prices?

It makes sense when your application is bottlenecked by tokens per second, concurrency, or latency SLOs rather than raw token spend. Digital Applied reports 750 tps for Groq and 620 tps for Cerebras versus 100-150 tps for commodity H100 endpoints, and NVIDIA argues newer Blackwell systems can reduce cost per token through higher efficiency in its analysis at NVIDIA’s blog.

Is per-token billing always better than per-second GPU billing?

No. Per-token billing is easier to forecast and avoids explicit idle waste, but per-second GPU billing can work well when you keep accelerators highly utilized. CloudZero’s analysis of Modal-style economics discusses an effective sustained-load H100 rate of about $3.95/hour here: CloudZero.

Primary sources

Digital Applied — AI inference providers pricing matrix Q2 2026 — Digital Applied
Together AI pricing — Together AI
Fireworks AI pricing — Fireworks AI
CloudZero — Together AI pricing analysis — CloudZero
Featherless — LLM API pricing comparison 2026 — Featherless
NVIDIA — Blackwell inference cost-per-token analysis — NVIDIA

Last updated: May 22, 2026. Related: Agent Infrastructure.

AI inference economics 2026: the token spread

The headline chart: one model, a 6.5x output-price spread

The market has a real spread

Pricing matrix breakdown: what the low end and high end really represent

Throughput chart: why tokens per second can outweigh token price

Speed changes the economics

Billing model breakdown: per-token, per-second, and reserved capacity

Pros

Cons

Specialty hardware math: when a higher token price is still the cheaper choice

Specialty hardware is a throughput bet

What the data means for production inference decisions

What the spread actually tells you

Frequently asked questions

What does per-million-token pricing actually measure?

Why is the same Llama 4 70B model so much cheaper on some providers than others?

When does specialty hardware make sense despite higher token prices?

Is per-token billing always better than per-second GPU billing?

Primary sources

Leave a Reply Cancel reply

More Popular from Alatirok

Tokens Per Agentic Coding Task: The 2026 Variance Data

What Is Cognition Devin? The Enterprise Guide for 2026

What Is Circle Agent Stack? USDC Wallets for AI Agents

AI Agent Identity: Entra Agent ID vs Okta vs SailPoint

Why Does My AI Agent Context Window Fill Up So Fast?

Migrate OpenAI Agent Builder to Agents SDK Before Nov 30

Best Voice AI Agent Framework 2026: Vapi vs LiveKit vs Pipecat

Purpose-Built Legal AI vs General LLM: 2026 Verdict

Categories

Quick Links