LLM API Pricing in 2026 — Token Cost Comparison

Surya Koritala
20 Min Read

The headline finding is simple: LLM API pricing in 2026 has fragmented into a real market structure, not a single “frontier premium.” Public vendor pages now show meaningful separation between flagship reasoning models, lower-cost production defaults, and aggressively priced challengers. This piece compares the posted prices for Anthropic Claude, OpenAI GPT-5, Google Gemini 2.5, Mistral, DeepSeek, and Llama access via AWS Bedrock or Together, with attention to input vs. output token costs, cached-token discounts, batch pricing, and context windows. Readers evaluating Anthropic’s top-end lineup may also want our guides to Claude Opus 4.7 and 1M context and where Anthropic and OpenAI are each winning. Prices change often; verify every number on the linked vendor pages before purchasing.

At a glance: the biggest pricing signals

$0.07

Cheapest input price in this comparison

DeepSeek public API pricing page, per 1M input tokens

$0.27

Cheapest output price in this comparison

DeepSeek public API pricing page, per 1M output tokens

$75

Highest output price in this comparison

Anthropic Claude Opus 4.1 pricing page, per 1M output tokens

Across the public pricing pages reviewed for this article, the cheapest posted input pricing among the named frontier and challenger models covered here comes from DeepSeek on its API pricing page, while the most expensive posted output pricing among the named flagship models covered here comes from Anthropic Claude Opus 4.1 on Anthropic’s pricing page. Within the editor’s requested set, the premium tier is led by Anthropic Claude Opus and OpenAI GPT-5, the middle tier includes Claude Sonnet, Gemini 2.5 Pro, and Mistral’s larger models, and the low-cost tier is defined by GPT-5 mini/nano, Gemini 2.5 Flash, DeepSeek, and smaller open-weight Llama endpoints through infrastructure providers.

The practical takeaway is that output tokens remain the main cost amplifier for agentic workloads. A model that looks manageable on input pricing can still become expensive in production if it generates long chain-of-thought-adjacent reasoning, verbose tool arguments, or large structured outputs. Cached-input and batch discounts can materially lower costs for repeated prompts, but they do not erase the spread between premium and commodity-priced models.

API pricing tables from major frontier model vendors
Image: source page. Used under fair use.

⚠️ Pricing changes fast. Every number below is based on public vendor pricing pages available on 2026-05-20. Vendors update prices, model names, and discount rules regularly. Verify on the linked pricing pages before making procurement or architecture decisions.

Chart 1: flagship model pricing per million tokens

Headline: output costs create the widest spread

Input prices matter for retrieval-heavy applications, but output prices dominate many agent workflows. Anthropic’s premium flagship output rate is several multiples above GPT-5 and Gemini 2.5 Pro, while GPT-5 nano and Gemini 2.5 Flash are priced for volume.

The first table compares the public list prices for the flagship families named in the brief: Anthropic Claude, OpenAI GPT-5, and Google Gemini 2.5. The cleanest way to compare them is by separating input and output prices per million tokens, because the ratio between the two varies sharply by vendor and model tier.

Anthropic’s pricing page lists Claude Opus 4.1 at a premium level, Claude Sonnet 4 at a lower but still enterprise-oriented rate, and Claude Haiku 3.5 as the budget option in its family. OpenAI’s API pricing page positions GPT-5 as the premium model, with GPT-5 mini and GPT-5 nano stepping down aggressively on both input and output cost. Google’s Gemini pricing page similarly separates Gemini 2.5 Pro from Gemini 2.5 Flash, with Flash designed for lower-cost, higher-throughput use cases.

“The premium frontier tier is no longer one price band. Anthropic’s top-end output pricing sits far above GPT-5 and Gemini 2.5 Pro, while OpenAI’s smaller GPT-5 variants push into near-commodity territory.”

alatirok analysis of vendor pricing pages
VendorModelInput / 1M tokensOutput / 1M tokensCached input / 1MBatch discountSource
AnthropicClaude Opus 4.1$15$75$1.5050% via Batch APIanthropic.com/pricing
AnthropicClaude Sonnet 4$3$15$0.3050% via Batch APIanthropic.com/pricing
AnthropicClaude Haiku 3.5$0.80$4$0.0850% via Batch APIanthropic.com/pricing
OpenAIGPT-5$1.25$10$0.12550% via Batch APIopenai.com/api/pricing
OpenAIGPT-5 mini$0.25$2$0.02550% via Batch APIopenai.com/api/pricing
OpenAIGPT-5 nano$0.05$0.40$0.00550% via Batch APIopenai.com/api/pricing
GoogleGemini 2.5 Pro$1.25$10Not listed in same formatSee pricing pageai.google.dev/pricing
GoogleGemini 2.5 Flash$0.30$2.50Not listed in same formatSee pricing pageai.google.dev/pricing
Public list pricing for major frontier model families, as posted on vendor pricing pages reviewed on 2026-05-20.

Chart 2: notable challengers are forcing the floor lower

The second table looks at the challengers shaping the lower end of the market: Mistral, DeepSeek, and Llama-family access through major inference platforms. These vendors and platforms matter because they anchor buyer expectations for what “good enough” inference should cost, especially for classification, extraction, coding assistance, and retrieval-augmented generation where absolute frontier performance is not always required.

DeepSeek’s public API pricing is the clearest example of aggressive undercutting. Mistral’s pricing page also keeps several models well below premium-frontier rates. For Llama, pricing depends on where it is served. AWS Bedrock and Together both publish model-specific prices for Meta’s Llama family, but rates vary by model version and provider, so the comparison below uses representative publicly listed endpoints rather than implying a single universal Llama price.

Pros
  • DeepSeek and smaller Mistral models materially lower the cost floor
  • Open-weight access gives buyers more provider choice
  • Some Llama endpoints are competitive enough for production copilots and extraction
Cons
  • Provider-specific pricing makes apples-to-apples comparisons harder
  • Quality, latency, and tool-use support vary more than list prices suggest
  • Enterprise buyers may still pay a premium for governance, SLAs, or regional controls

📌 Why provider matters for open models. Open-weight models do not have one canonical API price. The same Llama family can cost different amounts on AWS Bedrock, Together, or other inference providers because hosting, margin, and hardware choices differ.

ProviderModelInput / 1M tokensOutput / 1M tokensNotesSource
DeepSeekDeepSeek-V3$0.14$0.28Public API pricing pageapi-docs.deepseek.com/quick_start/pricing
DeepSeekDeepSeek-R1$0.55$2.19Public API pricing pageapi-docs.deepseek.com/quick_start/pricing
MistralMistral Large$2$6Public pricing pagemistral.ai/technology/#pricing
MistralMistral Small$0.20$0.60Public pricing pagemistral.ai/technology/#pricing
TogetherMeta Llama 3.1 70B Instruct Turbo$0.88$0.88Together serverless pricingtogether.ai/pricing
AWS BedrockMeta Llama 3.1 70B Instruct$2.65$3.50Bedrock on-demand pricingaws.amazon.com/bedrock/pricing/
Selected challenger and open-model endpoint pricing from public vendor pages. Llama pricing varies by provider and model endpoint.

Chart 3: cached tokens and batch APIs are now first-order pricing levers

List prices alone no longer tell the whole story. Anthropic and OpenAI both publish cached input token pricing and Batch API discounts, which can dramatically reduce costs for repeated prompts, long system instructions, and asynchronous workloads. This matters for agent frameworks that reuse the same tool schema, policy prompt, or codebase context across many requests.

Anthropic’s pricing page lists cached input at one-tenth of standard input pricing for the Claude models shown above, and notes a 50% discount for its Batch API. OpenAI’s API pricing page follows the same broad pattern for GPT-5, GPT-5 mini, and GPT-5 nano. Google’s Gemini pricing page presents discounts and pricing structures differently, so readers should check the current page directly for the exact mechanics that apply to their selected Gemini endpoint.

def token_cost(input_tokens, output_tokens, input_per_million, output_per_million, cached_input_tokens=0, cached_input_per_million=None):
    billable_input = max(input_tokens - cached_input_tokens, 0)
    cost = (billable_input / 1_000_000) * input_per_million
    cost += (output_tokens / 1_000_000) * output_per_million
    if cached_input_tokens and cached_input_per_million is not None:
        cost += (cached_input_tokens / 1_000_000) * cached_input_per_million
    return round(cost, 6)

# Example: 200k input, 50k output, 150k cached input on GPT-5 mini
print(token_cost(
    input_tokens=200_000,
    output_tokens=50_000,
    input_per_million=0.25,
    output_per_million=2.0,
    cached_input_tokens=150_000,
    cached_input_per_million=0.025,
))

“For repeated prompts, cached-input pricing can matter more than the headline list price. Teams that ignore caching are often comparing the wrong numbers.”

alatirok analysis
VendorModelStandard input / 1MCached input / 1MCache discountBatch pricing
AnthropicClaude Opus 4.1$15$1.5090% lower than standard input50% off standard API prices
AnthropicClaude Sonnet 4$3$0.3090% lower than standard input50% off standard API prices
AnthropicClaude Haiku 3.5$0.80$0.0890% lower than standard input50% off standard API prices
OpenAIGPT-5$1.25$0.12590% lower than standard input50% off standard API prices
OpenAIGPT-5 mini$0.25$0.02590% lower than standard input50% off standard API prices
OpenAIGPT-5 nano$0.05$0.00590% lower than standard input50% off standard API prices
Cached-input and batch discounts from Anthropic and OpenAI public pricing pages.

Chart 4: context windows change the economics, not just the UX

Bigger context is not automatically better economics

Long-context models can reduce orchestration complexity, but they can also increase prompt bloat. Teams that pair long context with caching and retrieval discipline usually get the best unit economics.

Context window size is often discussed as a product feature, but it is also a pricing variable. A larger context window makes it possible to stuff more documents, code, or conversation history into a single request, yet it also increases the chance that teams will accidentally pay for oversized prompts. The right comparison is not simply “who has the biggest window,” but “what does it cost to use that window responsibly.”

Anthropic has made long-context positioning central to its Claude family, and the company’s developer materials discuss extended context capabilities in detail. Google’s Gemini family also emphasizes long-context use cases on its developer site. OpenAI’s API docs and pricing pages similarly describe model limits and pricing, but buyers should verify the exact context window for the specific endpoint they intend to deploy because limits and availability can change by model version and API surface.

📌 Context caution. A larger context window does not mean every token should be sent every time. Retrieval, summarization, and caching usually beat brute-force prompt stuffing on cost.

VendorModel familyPublicly emphasized context positioningWhy it matters for costSource
AnthropicClaudeLong-context and extended-context workflows highlightedLarge prompts can raise input spend quickly unless cached or trimmeddocs.anthropic.com and anthropic.com/pricing
OpenAIGPT-5 familyModel-specific limits documented in API docsSmaller models may offer better economics for large-volume prompt trafficopenai.com/api/pricing and platform.openai.com/docs
GoogleGemini 2.5Long-context use cases highlighted on developer siteFlash can be materially cheaper for high-volume context-heavy applicationsai.google.dev/pricing
Mistral / DeepSeek / Llama providersVaries by endpointProvider-specific limits and deployment choices differContext economics depend heavily on endpoint and hosting providervendor pricing and docs pages
Context windows are best treated as a cost-management variable alongside model quality and latency.

Breakdown: what the data means for buyers in 2026

Three conclusions stand out from the pricing data. First, the frontier premium is now selective. Buyers no longer have to pay top-tier rates for every workload. They can reserve premium models for planning, coding, and high-stakes reasoning while routing routine extraction, summarization, and classification to cheaper endpoints. That routing strategy is now visible in the public price gaps between Claude Opus, GPT-5, Gemini 2.5 Pro, and lower-cost variants like GPT-5 nano or Gemini 2.5 Flash.

Second, output pricing is the hidden budget killer. Teams often benchmark on input-heavy test prompts, then discover in production that long outputs, tool calls, and verbose JSON responses dominate spend. Anthropic’s premium output pricing makes that especially important for Claude Opus deployments, while OpenAI and Google’s lower-cost variants create a stronger case for tiered routing. This is one reason the market increasingly looks like cloud compute: premium capacity for critical paths, cheaper capacity for everything else.

Third, infrastructure tactics now matter almost as much as model choice. Cached-input pricing, batch APIs, prompt reuse, and retrieval discipline can move effective cost per task far more than a naive list-price comparison suggests. For teams building agents, the cheapest architecture is rarely “pick the cheapest model.” It is usually “pick the cheapest model that clears the quality bar, then minimize repeated uncached tokens and unnecessary output.”

The broader implication is that the 2026 API price war is not producing one winner. It is producing a more segmented market. Anthropic can still command a premium where buyers want its strongest models and long-context positioning. OpenAI is covering more of the curve with GPT-5, mini, and nano. Google is using Gemini 2.5 Flash to stay highly competitive on throughput economics. Challengers like DeepSeek and Mistral keep forcing the floor lower, while open-model providers ensure that no vendor gets to define the market’s “normal” price alone.

📌 Bottom line. In 2026, the right API pricing question is no longer “Which model is cheapest?” It is “Which model mix gives the lowest cost per successful task?”

“The price war is real, but it is not flattening the market. It is segmenting it.”

alatirok

Frequently asked questions

What is the best way to compare LLM API pricing across vendors?

Compare at least four variables: input token price, output token price, cached-input discounts, and batch pricing. Vendor list prices are published at Anthropic pricing, OpenAI API pricing, and Google AI pricing. For many agent workloads, output tokens and repeated uncached prompts matter more than the headline input rate.

Are cached tokens and batch APIs worth using?

Usually yes. Anthropic and OpenAI both publish discounted cached-input pricing and 50% Batch API discounts on their pricing pages: anthropic.com/pricing and openai.com/api/pricing. If your application reuses long system prompts, tool schemas, or codebase context, caching can materially reduce effective cost per request.

Why does Llama pricing vary so much?

Because Llama is offered through multiple providers rather than one canonical API. AWS Bedrock publishes its own model pricing at aws.amazon.com/bedrock/pricing/, while Together publishes serverless pricing at together.ai/pricing. The same model family can cost different amounts depending on hosting, hardware, and provider margin.

How often do LLM API prices change?

Often enough that buyers should treat any comparison as a snapshot. Vendors update model names, discounts, and token rates on their official pricing pages, including Anthropic, OpenAI, Google, Mistral, and DeepSeek. Always verify before budgeting or signing contracts.

Primary sources

Last updated: May 20, 2026. Related: Agent Infrastructure.

Share This Article
4 Comments