AI inference economics 2026 comes down to one headline number: the same open-weight model, Llama 4 70B, spans roughly 6x in output-token price across major inference providers, while latency and throughput diverge just as sharply. Using Digital Applied’s April 2026 measurements, provider pricing pages, and vendor disclosures, this data breakdown shows what the per-million-token spread actually buys you in production.
- The headline chart: one model, a 6.5x output-price spread
- Pricing matrix breakdown: what the low end and high end really represent
- Throughput chart: why tokens per second can outweigh token price
- Billing model breakdown: per-token, per-second, and reserved capacity
- Specialty hardware math: when a higher token price is still the cheaper choice
- What the data means for production inference decisions
- Frequently asked questions
- What does per-million-token pricing actually measure?
- Why is the same Llama 4 70B model so much cheaper on some providers than others?
- When does specialty hardware make sense despite higher token prices?
- Is per-token billing always better than per-second GPU billing?
- Primary sources
The headline chart: one model, a 6.5x output-price spread
6.46x
Price spread on Llama 4 70B output tokens
$0.65/M to $4.20/M in Digital Applied’s April 2026 measurements
$0.65/M
Lowest listed output price
Together AI batch
$4.20/M
Highest listed output price
Cerebras
The market has a real spread
The core finding in AI inference economics 2026 is simple: open-weight inference has become a market with real price dispersion, not a single clearing price. In Digital Applied’s Q2 2026 matrix, Llama 4 70B output pricing ranges from $0.65 per million output tokens on Together AI batch inference to $4.20 per million output tokens on Cerebras. That is a spread of about 6.46x for the same model family and the same basic unit of output, with Anthropic and OpenAI excluded because the comparison is limited to open-weight inference.
The practical takeaway is that a low per-million-token number does not mean a provider is universally cheaper for your workload. The cheapest figure in the set, Together AI’s batch tier, comes with a stated 60-minute latency profile in the Digital Applied comparison. At the other end, specialty hardware providers charge materially more per token but can deliver much higher decode throughput. Readers looking only at token price will miss the operating constraint that actually matters in production: whether your application is bottlenecked by cost, responsiveness, or concurrency.
Digital Applied says its Q2 2026 matrix uses public pricing pages, Artificial Analysis benchmarks, and direct April 2026 testing.
“The same open-weight model spans roughly 6.5x in output-token price across major providers.”
Digital Applied Q2 2026 pricing matrix
| Provider | Llama 4 70B output pricing | Commercial mode | Notes |
|---|---|---|---|
| Together AI | $0.65/M | Batch | 60-minute latency in Digital Applied matrix |
| Together AI | $0.95/M | Reserved capacity | Commitment-based option in Digital Applied matrix |
| Fireworks AI | $1.20/M | Serverless | Open-weight API pricing comparison |
| OctoAI | $1.50/M | Serverless | Digital Applied comparison |
| Anyscale Endpoints | $2.10/M | Enterprise | Digital Applied comparison |
| Replicate | $2.55/M | On-demand | Digital Applied comparison |
| Groq LPU | $3.20/M | Specialty hardware | Digital Applied comparison |
| Cerebras | $4.20/M | Wafer-scale | Digital Applied comparison |
Pricing matrix breakdown: what the low end and high end really represent
The cheapest number in this dataset is not a generic serverless endpoint. It is Together AI batch pricing at $0.65/M for Llama 4 70B output, paired in the Digital Applied matrix with 60-minute latency. Together’s reserved-capacity figure, $0.95/M, sits closer to the mainstream serverless market while still undercutting Fireworks AI at $1.20/M and OctoAI at $1.50/M. That is the first lesson of AI inference economics 2026: the lowest token price usually reflects a different service contract, not a free lunch.
The middle of the market is where many production teams will likely compare offers. Fireworks AI’s public pricing page shows broad tiering by model size, with sub-4B models at $0.10/M, 4B-16B at $0.20/M, and models above 16B at $0.90/M. Fireworks also publishes split pricing for DeepSeek V3 at $0.56/M input and $1.68/M output, which is a useful reminder that blended request cost depends on prompt-to-completion ratio, not output price alone. Together AI’s own pricing page shows a wider catalog range, from $0.10/M on smaller models up to $9.00/M on the largest listed models, plus fine-tuning prices from $0.48/M for LoRA on models up to 16B to $3.20/M for full fine-tuning on 70B-100B models.
At the high end, Groq and Cerebras are not simply expensive versions of commodity inference. They are selling a different performance envelope. Digital Applied places Groq at $3.20/M and Cerebras at $4.20/M for Llama 4 70B output. Those prices look rich against Together or Fireworks, but they sit alongside much higher decode throughput. The pricing matrix only makes sense when read next to the speed data.
Some providers quote input and output separately, some emphasize serverless tiers, and some sell reserved or batch capacity. Comparing only one headline number can hide the real bill.
| Provider/data point | Published price | Unit | Source context |
|---|---|---|---|
| Together AI catalog range | $0.10/M to $9.00/M | Tokens | Public pricing page across model sizes |
| Together AI fine-tuning | $0.48/M to $3.20/M | Training tokens | LoRA through full fine-tune tiers |
| Fireworks AI sub-4B | $0.10/M | Tokens | Public pricing tier |
| Fireworks AI 4B-16B | $0.20/M | Tokens | Public pricing tier |
| Fireworks AI >16B | $0.90/M | Tokens | Public pricing tier |
| Fireworks AI DeepSeek V3 input | $0.56/M | Input tokens | Published split pricing |
| Fireworks AI DeepSeek V3 output | $1.68/M | Output tokens | Published split pricing |
| Modal Labs effective H100 rate | ~$3.95/hr | GPU hour | CloudZero analysis of sustained-load economics |
Throughput chart: why tokens per second can outweigh token price
750 tps
Groq LPU decode throughput
Llama 4 70B output decode
620 tps
Cerebras decode throughput
Llama 4 70B output decode
100-150 tps
Commodity H100 endpoint range
Llama 4 70B output decode
Speed changes the economics
Digital Applied reports 750 tokens per second for Groq LPU and 620 tokens per second for Cerebras on Llama 4 70B output decode, versus roughly 100-150 tokens per second for commodity H100 endpoints. That means specialty hardware is delivering about 5x to 7.5x the throughput of the commodity baseline, depending on where in the H100 range a given endpoint lands. This is the second major point in AI inference economics 2026: a provider that looks expensive per million tokens can still be economically rational if your application is queue-bound.
For interactive products, throughput is not just a benchmark vanity metric. It affects time-to-first-completion under load, how many concurrent sessions a deployment can absorb, and whether you need to overprovision replicas to survive spikes. A team paying $1.20/M on a slower serverless endpoint may still spend more per successful user interaction than a team paying $3.20/M on Groq if the slower stack forces retries, larger buffers, or lower concurrency ceilings.
The editor’s framing is right to call this a triangle: price, latency, and throughput do not all optimize at once. Together batch is cheap and slow. Cerebras is fast and expensive. Commodity serverless sits in the middle but often with more latency variance. The spread is not market inefficiency so much as a menu of different operating assumptions.
If your bottleneck is tokens per second rather than raw token spend, the fastest provider can be cheaper at the request level even when it is pricier per million tokens.
“Groq and Cerebras charge more per token, but they can deliver roughly 5x to 7.5x the throughput of commodity H100 endpoints.”
Digital Applied Q2 2026 pricing matrix
| Platform type | Reported throughput | Relative to 100 tps baseline | Relative to 150 tps baseline |
|---|---|---|---|
| Groq LPU | 750 tps | 7.5x | 5.0x |
| Cerebras | 620 tps | 6.2x | 4.1x |
| Commodity H100 endpoints | 100-150 tps | 1.0x-1.5x | 0.67x-1.0x |
Billing model breakdown: per-token, per-second, and reserved capacity
Most of the market in this comparison sells inference on a per-token basis. That model is easy for finance teams to reason about because cost scales with usage and there is no explicit idle penalty. It also makes cross-provider shopping easier, which is one reason open-weight inference is commoditizing. In AI inference economics 2026, per-token pricing is the dominant billing language, but it is not the only one that matters.
Modal Labs is a useful counterexample because it sells containerized GPU time rather than a simple token meter. CloudZero’s analysis cites an effective H100 rate of about $3.95 per hour under sustained load. That structure can work well for teams that can keep accelerators busy with batched or bursty jobs, especially when they control scheduling tightly. It is less attractive when traffic is steady but not dense enough to maintain high utilization, because the idle tax becomes your problem rather than the provider’s.
Reserved capacity sits between those poles. Digital Applied lists Together AI reserved capacity for Llama 4 70B output at $0.95/M, above its batch price but below many serverless alternatives. The trade-off is commitment. You get a discount relative to generic on-demand service, but you also accept lock-in and planning risk. If your demand profile is predictable, that can be rational. If it is not, the apparent savings may disappear into unused capacity or migration friction.
Pros
- Per-token pricing is easy to forecast
- Per-second pricing can reward high utilization
- Reserved capacity can lower unit cost for stable demand
Cons
- Per-token pricing can hide latency and throughput constraints
- Per-second pricing punishes idle time
- Reserved capacity can create commitment risk
def output_cost(tokens, price_per_million):
return (tokens / 1_000_000) * price_per_million
# Example: 250k output tokens on Fireworks at $1.20/M
print(round(output_cost(250_000, 1.20), 4)) # 0.3
| Billing model | Example | Best fit | Main trade-off |
|---|---|---|---|
| Per-token | Fireworks, Together, Replicate | Predictable variable cost | Can become throughput-bound at peak |
| Per-second / GPU time | Modal Labs | Batched or tightly managed workloads | Idle utilization risk |
| Reserved capacity | Together AI reserved | Stable demand with planning discipline | Commitment and lock-in |
Specialty hardware math: when a higher token price is still the cheaper choice
Specialty hardware is a throughput bet
The cleanest way to understand the Groq and Cerebras premium is to compare throughput uplift against token-price uplift. Relative to Together batch at $0.65/M, Groq at $3.20/M is about 4.9x more expensive per output token. Cerebras at $4.20/M is about 6.5x more expensive. On throughput, though, Groq’s 750 tps is roughly 5x to 7.5x faster than the 100-150 tps commodity H100 range, while Cerebras at 620 tps is roughly 4.1x to 6.2x faster. That means the premium is not obviously irrational if your bottleneck is service rate.
The editor’s shorthand that Cerebras can offer roughly 6x throughput at about 3x cost versus commodity H100 endpoints captures the directional point even if the exact ratio depends on which commodity price you use for comparison. Against Fireworks at $1.20/M, Cerebras is 3.5x pricier on output tokens. Against OctoAI at $1.50/M, it is 2.8x. If the faster platform lets you serve several times more concurrent traffic or hit a latency SLO that avoids overprovisioning, the request-level economics can flip in its favor.
NVIDIA is making the same broader argument from the hardware side. In its Blackwell inference analysis, the company says newer systems can reduce cost per token for open-source models by improving throughput and efficiency. Vendor analyses should always be read critically, but the directional claim matches the market data here: hardware architecture is now a first-order variable in inference economics, not a backend implementation detail.
Do not compare specialty hardware only to the cheapest batch tier. Compare it to the slowest provider that still meets your latency target.
def relative_multiple(a, b):
return round(a / b, 2)
print({
'groq_vs_together_batch_price': relative_multiple(3.20, 0.65),
'cerebras_vs_fireworks_price': relative_multiple(4.20, 1.20),
'cerebras_vs_octo_price': relative_multiple(4.20, 1.50)
})
“If your bottleneck is tokens per second, specialty hardware can be cheaper per served request despite a higher per-million-token price.”
Digital Applied data and NVIDIA Blackwell inference analysis
| Comparison | Token price multiple | Throughput multiple | Interpretation |
|---|---|---|---|
| Groq vs Together batch | 4.9x | Not directly comparable on latency mode | Higher price buys a very different performance profile |
| Cerebras vs Fireworks | 3.5x | About 4.1x-6.2x vs commodity H100 baseline | Can be favorable if queueing is the bottleneck |
| Cerebras vs OctoAI | 2.8x | About 4.1x-6.2x vs commodity H100 baseline | Premium narrows against mid-market serverless pricing |
What the data means for production inference decisions
7 providers
Core market set in the editor’s framing
Q2 2026 serverless inference market has consolidated around seven providers
5x-7x
Latency spread in the editorial brief
Headline market context supplied for this piece
What the spread actually tells you
The market context behind AI inference economics 2026 is that open-weight inference is consolidating into a smaller set of recognizable providers while the economics moat keeps shrinking. The editor’s framing notes that the Q2 2026 serverless market has consolidated around seven providers, and the Llama 4 70B comparison shows why competition is intensifying: the same model already fits inside a relatively tight absolute band of $0.65/M to $4.20/M, even though the relative spread is large. For builders, that means model quality is no longer the only scarce asset. Routing, workload shaping, and hardware fit matter more every quarter.
There is also a strategic implication for proprietary-model pricing. The editorial brief notes that this open-weight band is already far below GPT-4o-class proprietary pricing, and that many builders are getting comparable quality at a fraction of the cost. This article does not extend that comparison numerically because the supplied matrix excludes Anthropic and OpenAI, but the direction is clear from the open-weight side alone: token economics are compressing fast enough that infrastructure choices, not just model choices, are becoming the main source of margin.
For operators, the recommendation is workload-specific. Choose Together batch when latency is irrelevant and cost minimization dominates. Choose Fireworks or another mid-market serverless option when you need straightforward per-token billing without the highest specialty-hardware premium. Choose Together reserved capacity when demand is stable enough to justify commitment. Choose Groq or Cerebras when tokens per second, concurrency, or tight response-time targets are the real constraint. The per-million-token spread tells you less about who is cheapest in the abstract than about which provider is optimized for your workload shape.
Open-weight inference is commoditizing on price, but not on performance. The winning provider depends on whether your application is cost-bound, latency-bound, or throughput-bound.
| Workload shape | Likely best fit | Why |
|---|---|---|
| Offline batch generation | Together AI batch | Lowest listed token price, latency is acceptable |
| General serverless production | Fireworks AI / similar mid-market endpoint | Balanced token pricing with standard API model |
| Predictable steady demand | Together AI reserved capacity | Lower unit cost with commitment |
| High-concurrency interactive apps | Groq or Cerebras | Higher throughput can reduce queueing and overprovisioning |
| Custom scheduled GPU jobs | Modal Labs | Per-second model can work when utilization is tightly managed |
Frequently asked questions
What does per-million-token pricing actually measure?
It measures the variable cost of model usage, usually split into input and output tokens or quoted separately for one side. For examples, see Together AI’s pricing page and Fireworks AI’s pricing page, both of which publish token-based pricing structures.
Why is the same Llama 4 70B model so much cheaper on some providers than others?
Because providers are not selling the same service envelope. Digital Applied’s Q2 2026 matrix shows Together AI batch at $0.65/M with a 60-minute latency profile, while Groq and Cerebras charge more but deliver much higher throughput. The matrix and methodology are published at Digital Applied.
When does specialty hardware make sense despite higher token prices?
It makes sense when your application is bottlenecked by tokens per second, concurrency, or latency SLOs rather than raw token spend. Digital Applied reports 750 tps for Groq and 620 tps for Cerebras versus 100-150 tps for commodity H100 endpoints, and NVIDIA argues newer Blackwell systems can reduce cost per token through higher efficiency in its analysis at NVIDIA’s blog.
Is per-token billing always better than per-second GPU billing?
No. Per-token billing is easier to forecast and avoids explicit idle waste, but per-second GPU billing can work well when you keep accelerators highly utilized. CloudZero’s analysis of Modal-style economics discusses an effective sustained-load H100 rate of about $3.95/hour here: CloudZero.
Primary sources
- Digital Applied — AI inference providers pricing matrix Q2 2026 — Digital Applied
- Together AI pricing — Together AI
- Fireworks AI pricing — Fireworks AI
- CloudZero — Together AI pricing analysis — CloudZero
- Featherless — LLM API pricing comparison 2026 — Featherless
- NVIDIA — Blackwell inference cost-per-token analysis — NVIDIA
Last updated: May 22, 2026. Related: Agent Infrastructure.