AI inference economics 2026: the token spread

Surya Koritala
22 Min Read

AI inference economics 2026 comes down to one headline number: the same open-weight model, Llama 4 70B, spans roughly 6x in output-token price across major inference providers, while latency and throughput diverge just as sharply. Using Digital Applied’s April 2026 measurements, provider pricing pages, and vendor disclosures, this data breakdown shows what the per-million-token spread actually buys you in production.

The headline chart: one model, a 6.5x output-price spread

6.46x

Price spread on Llama 4 70B output tokens

$0.65/M to $4.20/M in Digital Applied’s April 2026 measurements

$0.65/M

Lowest listed output price

Together AI batch

$4.20/M

Highest listed output price

Cerebras

The market has a real spread

For open-weight inference, token pricing is no longer clustered tightly enough to ignore provider choice. The cheapest option is tied to slower service modes, while the fastest options sit at the top of the price curve.

The core finding in AI inference economics 2026 is simple: open-weight inference has become a market with real price dispersion, not a single clearing price. In Digital Applied’s Q2 2026 matrix, Llama 4 70B output pricing ranges from $0.65 per million output tokens on Together AI batch inference to $4.20 per million output tokens on Cerebras. That is a spread of about 6.46x for the same model family and the same basic unit of output, with Anthropic and OpenAI excluded because the comparison is limited to open-weight inference.

The practical takeaway is that a low per-million-token number does not mean a provider is universally cheaper for your workload. The cheapest figure in the set, Together AI’s batch tier, comes with a stated 60-minute latency profile in the Digital Applied comparison. At the other end, specialty hardware providers charge materially more per token but can deliver much higher decode throughput. Readers looking only at token price will miss the operating constraint that actually matters in production: whether your application is bottlenecked by cost, responsiveness, or concurrency.

Pricing matrix for AI inference providers in 2026
Image: source page. Used under fair use.

Digital Applied says its Q2 2026 matrix uses public pricing pages, Artificial Analysis benchmarks, and direct April 2026 testing.

“The same open-weight model spans roughly 6.5x in output-token price across major providers.”

Digital Applied Q2 2026 pricing matrix
ProviderLlama 4 70B output pricingCommercial modeNotes
Together AI$0.65/MBatch60-minute latency in Digital Applied matrix
Together AI$0.95/MReserved capacityCommitment-based option in Digital Applied matrix
Fireworks AI$1.20/MServerlessOpen-weight API pricing comparison
OctoAI$1.50/MServerlessDigital Applied comparison
Anyscale Endpoints$2.10/MEnterpriseDigital Applied comparison
Replicate$2.55/MOn-demandDigital Applied comparison
Groq LPU$3.20/MSpecialty hardwareDigital Applied comparison
Cerebras$4.20/MWafer-scaleDigital Applied comparison
Llama 4 70B output-token pricing from Digital Applied’s Q2 2026 matrix, based on April 2026 measurements.

Pricing matrix breakdown: what the low end and high end really represent

The cheapest number in this dataset is not a generic serverless endpoint. It is Together AI batch pricing at $0.65/M for Llama 4 70B output, paired in the Digital Applied matrix with 60-minute latency. Together’s reserved-capacity figure, $0.95/M, sits closer to the mainstream serverless market while still undercutting Fireworks AI at $1.20/M and OctoAI at $1.50/M. That is the first lesson of AI inference economics 2026: the lowest token price usually reflects a different service contract, not a free lunch.

The middle of the market is where many production teams will likely compare offers. Fireworks AI’s public pricing page shows broad tiering by model size, with sub-4B models at $0.10/M, 4B-16B at $0.20/M, and models above 16B at $0.90/M. Fireworks also publishes split pricing for DeepSeek V3 at $0.56/M input and $1.68/M output, which is a useful reminder that blended request cost depends on prompt-to-completion ratio, not output price alone. Together AI’s own pricing page shows a wider catalog range, from $0.10/M on smaller models up to $9.00/M on the largest listed models, plus fine-tuning prices from $0.48/M for LoRA on models up to 16B to $3.20/M for full fine-tuning on 70B-100B models.

At the high end, Groq and Cerebras are not simply expensive versions of commodity inference. They are selling a different performance envelope. Digital Applied places Groq at $3.20/M and Cerebras at $4.20/M for Llama 4 70B output. Those prices look rich against Together or Fireworks, but they sit alongside much higher decode throughput. The pricing matrix only makes sense when read next to the speed data.

Some providers quote input and output separately, some emphasize serverless tiers, and some sell reserved or batch capacity. Comparing only one headline number can hide the real bill.

Provider/data pointPublished priceUnitSource context
Together AI catalog range$0.10/M to $9.00/MTokensPublic pricing page across model sizes
Together AI fine-tuning$0.48/M to $3.20/MTraining tokensLoRA through full fine-tune tiers
Fireworks AI sub-4B$0.10/MTokensPublic pricing tier
Fireworks AI 4B-16B$0.20/MTokensPublic pricing tier
Fireworks AI >16B$0.90/MTokensPublic pricing tier
Fireworks AI DeepSeek V3 input$0.56/MInput tokensPublished split pricing
Fireworks AI DeepSeek V3 output$1.68/MOutput tokensPublished split pricing
Modal Labs effective H100 rate~$3.95/hrGPU hourCloudZero analysis of sustained-load economics
Selected public pricing details that frame the Llama 4 70B comparison.

Throughput chart: why tokens per second can outweigh token price

750 tps

Groq LPU decode throughput

Llama 4 70B output decode

620 tps

Cerebras decode throughput

Llama 4 70B output decode

100-150 tps

Commodity H100 endpoint range

Llama 4 70B output decode

Speed changes the economics

Throughput is the hidden variable in most pricing comparisons. Once a workload is concurrency-bound, higher token prices can still produce lower operational cost per served request.

Digital Applied reports 750 tokens per second for Groq LPU and 620 tokens per second for Cerebras on Llama 4 70B output decode, versus roughly 100-150 tokens per second for commodity H100 endpoints. That means specialty hardware is delivering about 5x to 7.5x the throughput of the commodity baseline, depending on where in the H100 range a given endpoint lands. This is the second major point in AI inference economics 2026: a provider that looks expensive per million tokens can still be economically rational if your application is queue-bound.

For interactive products, throughput is not just a benchmark vanity metric. It affects time-to-first-completion under load, how many concurrent sessions a deployment can absorb, and whether you need to overprovision replicas to survive spikes. A team paying $1.20/M on a slower serverless endpoint may still spend more per successful user interaction than a team paying $3.20/M on Groq if the slower stack forces retries, larger buffers, or lower concurrency ceilings.

The editor’s framing is right to call this a triangle: price, latency, and throughput do not all optimize at once. Together batch is cheap and slow. Cerebras is fast and expensive. Commodity serverless sits in the middle but often with more latency variance. The spread is not market inefficiency so much as a menu of different operating assumptions.

If your bottleneck is tokens per second rather than raw token spend, the fastest provider can be cheaper at the request level even when it is pricier per million tokens.

“Groq and Cerebras charge more per token, but they can deliver roughly 5x to 7.5x the throughput of commodity H100 endpoints.”

Digital Applied Q2 2026 pricing matrix
Platform typeReported throughputRelative to 100 tps baselineRelative to 150 tps baseline
Groq LPU750 tps7.5x5.0x
Cerebras620 tps6.2x4.1x
Commodity H100 endpoints100-150 tps1.0x-1.5x0.67x-1.0x
Llama 4 70B output decode throughput from Digital Applied’s April 2026 comparison.

Billing model breakdown: per-token, per-second, and reserved capacity

Most of the market in this comparison sells inference on a per-token basis. That model is easy for finance teams to reason about because cost scales with usage and there is no explicit idle penalty. It also makes cross-provider shopping easier, which is one reason open-weight inference is commoditizing. In AI inference economics 2026, per-token pricing is the dominant billing language, but it is not the only one that matters.

Modal Labs is a useful counterexample because it sells containerized GPU time rather than a simple token meter. CloudZero’s analysis cites an effective H100 rate of about $3.95 per hour under sustained load. That structure can work well for teams that can keep accelerators busy with batched or bursty jobs, especially when they control scheduling tightly. It is less attractive when traffic is steady but not dense enough to maintain high utilization, because the idle tax becomes your problem rather than the provider’s.

Reserved capacity sits between those poles. Digital Applied lists Together AI reserved capacity for Llama 4 70B output at $0.95/M, above its batch price but below many serverless alternatives. The trade-off is commitment. You get a discount relative to generic on-demand service, but you also accept lock-in and planning risk. If your demand profile is predictable, that can be rational. If it is not, the apparent savings may disappear into unused capacity or migration friction.

Pros
  • Per-token pricing is easy to forecast
  • Per-second pricing can reward high utilization
  • Reserved capacity can lower unit cost for stable demand
Cons
  • Per-token pricing can hide latency and throughput constraints
  • Per-second pricing punishes idle time
  • Reserved capacity can create commitment risk
def output_cost(tokens, price_per_million):
    return (tokens / 1_000_000) * price_per_million

# Example: 250k output tokens on Fireworks at $1.20/M
print(round(output_cost(250_000, 1.20), 4))  # 0.3
Billing modelExampleBest fitMain trade-off
Per-tokenFireworks, Together, ReplicatePredictable variable costCan become throughput-bound at peak
Per-second / GPU timeModal LabsBatched or tightly managed workloadsIdle utilization risk
Reserved capacityTogether AI reservedStable demand with planning disciplineCommitment and lock-in
Commercial models shape the real economics as much as headline token price.

Specialty hardware math: when a higher token price is still the cheaper choice

Specialty hardware is a throughput bet

Groq and Cerebras are not competing to be the cheapest token meter. They are selling higher service rate, which matters most when concurrency and latency targets dominate the cost model.

The cleanest way to understand the Groq and Cerebras premium is to compare throughput uplift against token-price uplift. Relative to Together batch at $0.65/M, Groq at $3.20/M is about 4.9x more expensive per output token. Cerebras at $4.20/M is about 6.5x more expensive. On throughput, though, Groq’s 750 tps is roughly 5x to 7.5x faster than the 100-150 tps commodity H100 range, while Cerebras at 620 tps is roughly 4.1x to 6.2x faster. That means the premium is not obviously irrational if your bottleneck is service rate.

The editor’s shorthand that Cerebras can offer roughly 6x throughput at about 3x cost versus commodity H100 endpoints captures the directional point even if the exact ratio depends on which commodity price you use for comparison. Against Fireworks at $1.20/M, Cerebras is 3.5x pricier on output tokens. Against OctoAI at $1.50/M, it is 2.8x. If the faster platform lets you serve several times more concurrent traffic or hit a latency SLO that avoids overprovisioning, the request-level economics can flip in its favor.

NVIDIA is making the same broader argument from the hardware side. In its Blackwell inference analysis, the company says newer systems can reduce cost per token for open-source models by improving throughput and efficiency. Vendor analyses should always be read critically, but the directional claim matches the market data here: hardware architecture is now a first-order variable in inference economics, not a backend implementation detail.

Do not compare specialty hardware only to the cheapest batch tier. Compare it to the slowest provider that still meets your latency target.

def relative_multiple(a, b):
    return round(a / b, 2)

print({
    'groq_vs_together_batch_price': relative_multiple(3.20, 0.65),
    'cerebras_vs_fireworks_price': relative_multiple(4.20, 1.20),
    'cerebras_vs_octo_price': relative_multiple(4.20, 1.50)
})

“If your bottleneck is tokens per second, specialty hardware can be cheaper per served request despite a higher per-million-token price.”

Digital Applied data and NVIDIA Blackwell inference analysis
ComparisonToken price multipleThroughput multipleInterpretation
Groq vs Together batch4.9xNot directly comparable on latency modeHigher price buys a very different performance profile
Cerebras vs Fireworks3.5xAbout 4.1x-6.2x vs commodity H100 baselineCan be favorable if queueing is the bottleneck
Cerebras vs OctoAI2.8xAbout 4.1x-6.2x vs commodity H100 baselinePremium narrows against mid-market serverless pricing
Illustrative ratios using Digital Applied’s Llama 4 70B pricing and throughput figures.

What the data means for production inference decisions

7 providers

Core market set in the editor’s framing

Q2 2026 serverless inference market has consolidated around seven providers

5x-7x

Latency spread in the editorial brief

Headline market context supplied for this piece

What the spread actually tells you

The per-million-token spread is not just a pricing story. It is a map of service modes: batch, serverless, reserved, and specialty hardware each optimize a different part of the production stack.

The market context behind AI inference economics 2026 is that open-weight inference is consolidating into a smaller set of recognizable providers while the economics moat keeps shrinking. The editor’s framing notes that the Q2 2026 serverless market has consolidated around seven providers, and the Llama 4 70B comparison shows why competition is intensifying: the same model already fits inside a relatively tight absolute band of $0.65/M to $4.20/M, even though the relative spread is large. For builders, that means model quality is no longer the only scarce asset. Routing, workload shaping, and hardware fit matter more every quarter.

There is also a strategic implication for proprietary-model pricing. The editorial brief notes that this open-weight band is already far below GPT-4o-class proprietary pricing, and that many builders are getting comparable quality at a fraction of the cost. This article does not extend that comparison numerically because the supplied matrix excludes Anthropic and OpenAI, but the direction is clear from the open-weight side alone: token economics are compressing fast enough that infrastructure choices, not just model choices, are becoming the main source of margin.

For operators, the recommendation is workload-specific. Choose Together batch when latency is irrelevant and cost minimization dominates. Choose Fireworks or another mid-market serverless option when you need straightforward per-token billing without the highest specialty-hardware premium. Choose Together reserved capacity when demand is stable enough to justify commitment. Choose Groq or Cerebras when tokens per second, concurrency, or tight response-time targets are the real constraint. The per-million-token spread tells you less about who is cheapest in the abstract than about which provider is optimized for your workload shape.

Open-weight inference is commoditizing on price, but not on performance. The winning provider depends on whether your application is cost-bound, latency-bound, or throughput-bound.

Workload shapeLikely best fitWhy
Offline batch generationTogether AI batchLowest listed token price, latency is acceptable
General serverless productionFireworks AI / similar mid-market endpointBalanced token pricing with standard API model
Predictable steady demandTogether AI reserved capacityLower unit cost with commitment
High-concurrency interactive appsGroq or CerebrasHigher throughput can reduce queueing and overprovisioning
Custom scheduled GPU jobsModal LabsPer-second model can work when utilization is tightly managed
Provider fit depends more on workload shape than on the cheapest headline token price.

Frequently asked questions

What does per-million-token pricing actually measure?

It measures the variable cost of model usage, usually split into input and output tokens or quoted separately for one side. For examples, see Together AI’s pricing page and Fireworks AI’s pricing page, both of which publish token-based pricing structures.

Why is the same Llama 4 70B model so much cheaper on some providers than others?

Because providers are not selling the same service envelope. Digital Applied’s Q2 2026 matrix shows Together AI batch at $0.65/M with a 60-minute latency profile, while Groq and Cerebras charge more but deliver much higher throughput. The matrix and methodology are published at Digital Applied.

When does specialty hardware make sense despite higher token prices?

It makes sense when your application is bottlenecked by tokens per second, concurrency, or latency SLOs rather than raw token spend. Digital Applied reports 750 tps for Groq and 620 tps for Cerebras versus 100-150 tps for commodity H100 endpoints, and NVIDIA argues newer Blackwell systems can reduce cost per token through higher efficiency in its analysis at NVIDIA’s blog.

Is per-token billing always better than per-second GPU billing?

No. Per-token billing is easier to forecast and avoids explicit idle waste, but per-second GPU billing can work well when you keep accelerators highly utilized. CloudZero’s analysis of Modal-style economics discusses an effective sustained-load H100 rate of about $3.95/hour here: CloudZero.

Primary sources

Last updated: May 22, 2026. Related: Agent Infrastructure.

Share This Article
Leave a Comment