What every major LLM API costs in 2026

The spread in 2026

The cheapest production API token today costs roughly 3,000× less than the most expensive one. DeepInfra's Qwen3-235B-A22B runs at $0.071/$0.10 per million tokens. Anthropic's Claude Mythos Preview — invite-only, critical infrastructure only — is priced at $25/$125 per million (third-party sourced; not on the official pricing page). Most developers will live somewhere between those extremes, but knowing the full range matters when you are deciding which model to default to, which to cache, and which to batch.

The other defining trend is the open-weight wave. DeepSeek V4 Flash, Kimi K2.6, MiniMax M2.7, GLM-5.1, and Qwen3-235B-A22B are all open-weight and available on multiple inference hosts simultaneously. That creates genuine competition on hosting price in a way that proprietary models cannot. When the weights are public, every inference provider can undercut the originator.

Frontier models — Table 1

Closed, proprietary models from the major western labs. Prices are for the standard synchronous API unless noted. All figures are per 1M tokens.

Model	Provider	Input $/1M	Output $/1M	Context
Claude Opus 4.7 / 4.6 / 4.5	Anthropic	$5.00	$25.00	1M
Claude Sonnet 4.6 / 4.5	Anthropic	$3.00	$15.00	1M
Claude Haiku 4.5	Anthropic	$1.00	$5.00	200K
GPT-5.5	OpenAI	$5.00	$30.00	1M
GPT-5.4	OpenAI	$2.50	$15.00	1M (272K short)
GPT-5.4 mini	OpenAI	$0.75	$4.50	400K
GPT-5.4 nano	OpenAI	$0.20	$1.25	400K
o3 (reasoning)	OpenAI	$2.00	$8.00	200K
o4-mini (reasoning)	OpenAI	$0.55	$2.20	200K
Gemini 3.1 Pro	Google	$2.00 / $4.00*	$12.00 / $18.00*	2M
Gemini 3.1 Flash-Lite	Google	$0.25	$1.50	—
Gemini 2.5 Pro	Google	$1.25 / $2.50*	$10.00 / $15.00*	1M
Gemini 2.5 Flash	Google	$0.30	$2.50	1M
Gemini 2.5 Flash-Lite	Google	$0.10	$0.40	1M
Grok 4.1 Fast / 4 Fast	xAI	$0.20	$0.50	2M
Grok 4.3	xAI	$1.25	$2.50	1M
Grok 4	xAI	$3.00	$15.00	256K
Cohere Command A	Cohere	$2.50	$10.00	256K
Cohere Command R7B	Cohere	$0.037	$0.150	128K

* Google Pro models: lower price applies ≤200K tokens; higher price applies above 200K tokens.

o-series models bill internal reasoning tokens at output rates — actual cost can be 3–10× the headline price for complex tasks.

Grok 4.3 pricing from third-party aggregators only (x.ai API returned HTTP 403 during research).

Value / open-weight API providers — Table 2

Models from Chinese labs and European providers, mostly open-weight (weights publicly available). Prices are the provider's own hosted API unless noted. All figures per 1M tokens, standard tier, cache-miss input.

Model	Provider	Input $/1M	Output $/1M	Notes
DeepSeek V4 Flash	DeepSeek	$0.14	$0.28	MIT; cache hit $0.0028
DeepSeek V4 Pro	DeepSeek	$0.435	$0.87	75% promo until 2026-05-31; full price $1.74/$3.48
Qwen-Turbo	Alibaba Cloud	$0.05	$0.20	Cheapest production Qwen
Qwen3-235B-A22B	Alibaba Cloud	$0.70	$2.80	Open-weight Apache 2.0; 128K ctx
Qwen3-Max	Alibaba Cloud	$1.20–$2.40	$6.00	Tiered by ctx length
Mistral Nemo	Mistral AI	$0.02	$0.03	Cheapest in lineup
Mistral Small 3.2 24B	Mistral AI	$0.075	$0.20	Best value in Mistral range
Mistral Large 3 2512	Mistral AI	$0.50	$1.50	No prompt caching
Codestral 2508	Mistral AI	$0.30	$0.90	Code-specialist; 256K ctx
GLM-4.5-Flash / GLM-4.7-Flash	Zhipu / Z.ai	Free	Free	No quota cap
GLM-5	Zhipu / Z.ai	$1.00	$3.20	203K ctx
GLM-5.1	Zhipu / Z.ai	$1.40	$4.40	Open-sourced 2026-04-08
Kimi K2.6	Moonshot AI	$0.73–$0.75	$3.49–$3.50	MIT+; 262K ctx; cache hit $0.15
MiniMax M2.7 / M2.5	MiniMax	$0.28–$0.30	$1.20	Open-weight; 1M ctx (M2.5)

Mistral does not offer prompt-caching discounts on any models as of May 2026, unlike Anthropic (10% of base on cache hits) and OpenAI (75–90% on cache hits).

DeepSeek V4 Flash cache hit: $0.0028/M — a 98% discount vs standard input. DeepSeek V4 Pro's promo expires 2026-05-31; full price will be $1.74/$3.48 thereafter.

Open-weight inference hosts — Table 3

These hosts run open-weight models on their own GPU infrastructure. You get the same weights, different SLAs, speeds, and prices. All figures per 1M tokens.

Provider	Model	Input $/1M	Output $/1M	Speed (tok/s)
Groq	Llama 3.3 70B	$0.59	$0.79	~325
Groq	Llama 4 Scout (17Bx16E)	$0.11	$0.34	—
Groq	Qwen3 32B	$0.29	$0.59	—
Groq	Llama 3.1 8B Instant	$0.05	$0.08	651
Cerebras	Llama 3.3 70B	$0.60	$0.60	1,800+
Cerebras	Llama 3.1 8B	$0.10	$0.10	2,360
Together AI	Llama 3.3 70B	$0.88	$0.88	~150
Together AI	DeepSeek R1	$3.00	$7.00	—
Fireworks AI	GPT OSS 20B	$0.07	$0.30	—
Fireworks AI	Kimi K2.6	$0.95	$4.00	$0.16 cached
DeepInfra	Llama 3.3 70B Turbo	$0.10	$0.32	~19 (FP8)
DeepInfra	Qwen3-235B-A22B	$0.071	$0.10	—
DeepInfra	DeepSeek R1-0528	$0.50	$2.15	$0.35 cached
Novita AI	Llama 3.3 70B	$0.135	$0.40	—

Cerebras uses wafer-scale silicon, not GPU clusters — hence the extreme throughput. Groq uses LPU silicon. Both offer free tiers: Cerebras ~1M tokens/day on Llama 3.3 70B; Groq free tier at 30 req/min. Lepton AI was acquired by NVIDIA (May 2025) and has been sunset.

What changed in 2026

Anthropic cut Opus pricing 67%. Claude Opus 4.5, 4.6, and 4.7 are all priced at $5/$25 per million tokens — down from $15/$75 for the earlier Opus 4.1. The effective saving is real, but read the small print: Opus 4.7 ships with a new tokenizer that may use up to 35% more tokens for the same text. Same per-token price, potentially higher per-request cost.

OpenAI's o-series got dramatically cheaper. o3 at $2/$8 replaced o1 at $15/$60 — roughly an 87% input cut. o4-mini at $0.55/$2.20 is cheaper still. The hidden cost remains: reasoning tokens on o-series are billed at output rates, and for complex tasks the model generates a lot of them. Headline price and actual bill can diverge 3–10×.

Google Flash-tier is genuinely cheap. Gemini 2.5 Flash-Lite at $0.10/$0.40 matches the cheapest open-weight inference on major hosts. The tradeoff is Google's April 2026 free-tier cuts — no migration period, no advance warning — and mandatory monthly spending caps ($250/mo Tier 1, $2,000/mo Tier 2) that suspend your API automatically when hit.

xAI entered with surprisingly competitive fast-tier pricing. Grok 4.1 Fast at $0.20/$0.50 with a 2M context window undercuts Claude Sonnet, GPT-5 mini, and Gemini Flash. The maturity gap is real — shorter deployment track record, and one third-party source notes a knowledge cutoff of November 2024 without search tools — but the price-per-context-token on fast-tier Grok is hard to beat among proprietary models.

DeepSeek's cache pricing is extreme. V4 Flash cache hits: $0.0028/M — 98% below the standard input rate. If your workload has a large, reusable system prompt, the effective cost can fall well below any frontier provider's headline rate. The V4 Pro 75% promo expires 2026-05-31; budget accordingly.

Open-weight models are credibly competitive. MiniMax M2.7 scores 56.22% on SWE-Pro, matching GPT-5.3-Codex. Kimi K2.6 reportedly ties GPT-5.5 on SWE-Bench Pro coding at ~80% lower cost, at $0.73–$0.75/$3.49–$3.50. These are MoE architectures (many parameters, few active per token), which is why they can be hosted cheaply.

How to read the numbers

Headline price vs effective price. Several mechanisms drive a wedge between the two:

Batch APIs universally offer 50% off — Anthropic, OpenAI, Google, xAI, and most inference hosts all have async batch modes. If you don't need sub-second latency, this is the lowest-effort cost reduction available.
Prompt caching — Anthropic cache reads at 10% of base input; OpenAI at 10–25%; Google at 10%. Mistral has no caching at all. For apps with large, reusable system prompts this can dominate your bill.
Long-context surcharges — GPT-5.4 and GPT-5.5 double input price above 272K tokens; Gemini Pro models double input above 200K tokens. Anthropic does not surcharge for long context at standard tiers.
Reasoning tokens — OpenAI o-series and Gemini thinking mode meter internal chain-of-thought at output rates. For a task that generates 20K reasoning tokens before producing 500 output tokens, you are billed for the 20K.
New tokenizers — Opus 4.7's tokenizer may consume up to 35% more tokens for the same text vs 4.6. Model version upgrades can silently increase your bill without any rate change.
Perplexity Sonar's per-request fee — Sonar charges both per-token and per-request ($5–$14 per 1,000 requests on top of token costs). For short queries, the request fee can dominate.

The managed cloud premium. AWS Bedrock prices Claude at parity with Anthropic direct, but prices Llama 70B at a significant markup vs specialist inference hosts. Azure OpenAI prices tokens identically to OpenAI direct, but production deployments add support plans ($100–$1,000+/month) and infrastructure overhead, typically 20–40% above listed rates. Both are worth it for specific compliance requirements; neither saves money on raw token cost.

A multi-key router — something like llmdeal.me — lets you hit the cheapest backend for each request class without managing a dozen API keys and billing dashboards yourself.

References

Anthropic — Official API Pricing — accessed 2026-05-16
Anthropic — Subscription Plans — accessed 2026-05-16
Anthropic — Claude Opus 4.7 Announcement — 2026-04-16
OpenAI — AI Pricing Guru: OpenAI Pricing — updated 2026-05-15
OpenAI — MetaCTO: True Cost of OpenAI API — accessed 2026-05-16
OpenAI — CloudZero: OpenAI Pricing Guide 2026 — accessed 2026-05-16
OpenAI — CostGoat: OpenAI o-series pricing — accessed 2026-05-16
Google — Official Gemini Developer API Pricing — accessed 2026-05-16
Google — FindSkill: Gemini API Pricing Guide (April 2026 free-tier changes) — accessed 2026-05-16
Google — TokenMix: Gemini 3.1 Pro pricing deep-dive — accessed 2026-05-16
xAI — FelloAI: Grok Pricing — accessed 2026-05-16
xAI — Mem0: xAI Grok API Pricing — 2026-05-15
xAI — PricePerToken: xAI Model Catalogue — accessed 2026-05-16
DeepSeek — DeepSeek Official Pricing — accessed 2026-05-16
DeepSeek — Ofox: DeepSeek V4 Flash vs Pro breakdown — accessed 2026-05-16
Alibaba / Qwen — Alibaba Cloud Model Studio Pricing — accessed 2026-05-16
Mistral AI — PricePerToken: Mistral AI Model Catalogue — accessed 2026-05-16
Mistral AI — DevTk: Mistral API Pricing Guide 2026 — accessed 2026-05-16
Zhipu / GLM — VibeCoding: Zhipu AI GLM Pricing 2026 — accessed 2026-05-16
Zhipu / GLM — CnTechPost: GLM-5.1 open-source launch — 2026-04-08
Moonshot / Kimi — OpenRouter: Kimi K2.6 — accessed 2026-05-16
Moonshot / Kimi — DeepInfra: Kimi K2.6 Pricing Guide — accessed 2026-05-16
MiniMax — MiniMax Official API Pricing — accessed 2026-05-16
MiniMax — MarkTechPost: MiniMax M2.7 launch — 2026-04-12
Groq — Groq Official Pricing — accessed 2026-05-16
Cerebras — TokenMix: Cerebras Pricing and Speed Tests — accessed 2026-05-16
Together AI — Together AI Official Pricing — accessed 2026-05-16
Fireworks AI — Fireworks AI Serverless Pricing — accessed 2026-05-16
DeepInfra — DeepInfra Official Pricing — accessed 2026-05-16
Novita AI — Novita AI Official Pricing — accessed 2026-05-16
Cohere — Cohere Official Pricing — accessed 2026-05-16
Perplexity — Perplexity Sonar API Pricing (Official) — accessed 2026-05-16
AWS Bedrock — AWS Bedrock Pricing (Official) — accessed 2026-05-16
AWS Bedrock — TokenMix: AWS Bedrock Pricing Guide — updated 2026-04-29
Azure OpenAI / AI Foundry — Azure OpenAI Pricing (Official) — accessed 2026-05-16
Azure OpenAI / AI Foundry — CloudZero: Azure OpenAI Total Cost Analysis — 2026-05-04
Cross-provider — PECollective: LLM Pricing Comparison 2026 — April 2026
Self-hosting — DevTk: Self-Hosting vs API Cost 2026 — February 2026

Rates checked against providers' own pricing pages, May 2026. Article published 2026-05-16.