A complete, sourced pricing reference — frontier models, value models, and open-weight inference hosts — all in one place. Per 1M tokens, input and output, as of May 2026.
· llmdeal.me
The cheapest production API token today costs roughly 3,000× less than the most expensive one. DeepInfra's Qwen3-235B-A22B runs at $0.071/$0.10 per million tokens. Anthropic's Claude Mythos Preview — invite-only, critical infrastructure only — is priced at $25/$125 per million (third-party sourced; not on the official pricing page). Most developers will live somewhere between those extremes, but knowing the full range matters when you are deciding which model to default to, which to cache, and which to batch.
The other defining trend is the open-weight wave. DeepSeek V4 Flash, Kimi K2.6, MiniMax M2.7, GLM-5.1, and Qwen3-235B-A22B are all open-weight and available on multiple inference hosts simultaneously. That creates genuine competition on hosting price in a way that proprietary models cannot. When the weights are public, every inference provider can undercut the originator.
Closed, proprietary models from the major western labs. Prices are for the standard synchronous API unless noted. All figures are per 1M tokens.
| Model | Provider | Input $/1M | Output $/1M | Context |
|---|---|---|---|---|
| Claude Opus 4.7 / 4.6 / 4.5 | Anthropic | $5.00 | $25.00 | 1M |
| Claude Sonnet 4.6 / 4.5 | Anthropic | $3.00 | $15.00 | 1M |
| Claude Haiku 4.5 | Anthropic | $1.00 | $5.00 | 200K |
| GPT-5.5 | OpenAI | $5.00 | $30.00 | 1M |
| GPT-5.4 | OpenAI | $2.50 | $15.00 | 1M (272K short) |
| GPT-5.4 mini | OpenAI | $0.75 | $4.50 | 400K |
| GPT-5.4 nano | OpenAI | $0.20 | $1.25 | 400K |
| o3 (reasoning) | OpenAI | $2.00 | $8.00 | 200K |
| o4-mini (reasoning) | OpenAI | $0.55 | $2.20 | 200K |
| Gemini 3.1 Pro | $2.00 / $4.00* | $12.00 / $18.00* | 2M | |
| Gemini 3.1 Flash-Lite | $0.25 | $1.50 | — | |
| Gemini 2.5 Pro | $1.25 / $2.50* | $10.00 / $15.00* | 1M | |
| Gemini 2.5 Flash | $0.30 | $2.50 | 1M | |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | 1M | |
| Grok 4.1 Fast / 4 Fast | xAI | $0.20 | $0.50 | 2M |
| Grok 4.3 | xAI | $1.25 | $2.50 | 1M |
| Grok 4 | xAI | $3.00 | $15.00 | 256K |
| Cohere Command A | Cohere | $2.50 | $10.00 | 256K |
| Cohere Command R7B | Cohere | $0.037 | $0.150 | 128K |
* Google Pro models: lower price applies ≤200K tokens; higher price applies above 200K tokens.
o-series models bill internal reasoning tokens at output rates — actual cost can be 3–10× the headline price for complex tasks.
Grok 4.3 pricing from third-party aggregators only (x.ai API returned HTTP 403 during research).
Models from Chinese labs and European providers, mostly open-weight (weights publicly available). Prices are the provider's own hosted API unless noted. All figures per 1M tokens, standard tier, cache-miss input.
| Model | Provider | Input $/1M | Output $/1M | Notes |
|---|---|---|---|---|
| DeepSeek V4 Flash | DeepSeek | $0.14 | $0.28 | MIT; cache hit $0.0028 |
| DeepSeek V4 Pro | DeepSeek | $0.435 | $0.87 | 75% promo until 2026-05-31; full price $1.74/$3.48 |
| Qwen-Turbo | Alibaba Cloud | $0.05 | $0.20 | Cheapest production Qwen |
| Qwen3-235B-A22B | Alibaba Cloud | $0.70 | $2.80 | Open-weight Apache 2.0; 128K ctx |
| Qwen3-Max | Alibaba Cloud | $1.20–$2.40 | $6.00 | Tiered by ctx length |
| Mistral Nemo | Mistral AI | $0.02 | $0.03 | Cheapest in lineup |
| Mistral Small 3.2 24B | Mistral AI | $0.075 | $0.20 | Best value in Mistral range |
| Mistral Large 3 2512 | Mistral AI | $0.50 | $1.50 | No prompt caching |
| Codestral 2508 | Mistral AI | $0.30 | $0.90 | Code-specialist; 256K ctx |
| GLM-4.5-Flash / GLM-4.7-Flash | Zhipu / Z.ai | Free | Free | No quota cap |
| GLM-5 | Zhipu / Z.ai | $1.00 | $3.20 | 203K ctx |
| GLM-5.1 | Zhipu / Z.ai | $1.40 | $4.40 | Open-sourced 2026-04-08 |
| Kimi K2.6 | Moonshot AI | $0.73–$0.75 | $3.49–$3.50 | MIT+; 262K ctx; cache hit $0.15 |
| MiniMax M2.7 / M2.5 | MiniMax | $0.28–$0.30 | $1.20 | Open-weight; 1M ctx (M2.5) |
Mistral does not offer prompt-caching discounts on any models as of May 2026, unlike Anthropic (10% of base on cache hits) and OpenAI (75–90% on cache hits).
DeepSeek V4 Flash cache hit: $0.0028/M — a 98% discount vs standard input. DeepSeek V4 Pro's promo expires 2026-05-31; full price will be $1.74/$3.48 thereafter.
These hosts run open-weight models on their own GPU infrastructure. You get the same weights, different SLAs, speeds, and prices. All figures per 1M tokens.
| Provider | Model | Input $/1M | Output $/1M | Speed (tok/s) |
|---|---|---|---|---|
| Groq | Llama 3.3 70B | $0.59 | $0.79 | ~325 |
| Groq | Llama 4 Scout (17Bx16E) | $0.11 | $0.34 | — |
| Groq | Qwen3 32B | $0.29 | $0.59 | — |
| Groq | Llama 3.1 8B Instant | $0.05 | $0.08 | 651 |
| Cerebras | Llama 3.3 70B | $0.60 | $0.60 | 1,800+ |
| Cerebras | Llama 3.1 8B | $0.10 | $0.10 | 2,360 |
| Together AI | Llama 3.3 70B | $0.88 | $0.88 | ~150 |
| Together AI | DeepSeek R1 | $3.00 | $7.00 | — |
| Fireworks AI | GPT OSS 20B | $0.07 | $0.30 | — |
| Fireworks AI | Kimi K2.6 | $0.95 | $4.00 | $0.16 cached |
| DeepInfra | Llama 3.3 70B Turbo | $0.10 | $0.32 | ~19 (FP8) |
| DeepInfra | Qwen3-235B-A22B | $0.071 | $0.10 | — |
| DeepInfra | DeepSeek R1-0528 | $0.50 | $2.15 | $0.35 cached |
| Novita AI | Llama 3.3 70B | $0.135 | $0.40 | — |
Cerebras uses wafer-scale silicon, not GPU clusters — hence the extreme throughput. Groq uses LPU silicon. Both offer free tiers: Cerebras ~1M tokens/day on Llama 3.3 70B; Groq free tier at 30 req/min. Lepton AI was acquired by NVIDIA (May 2025) and has been sunset.
Anthropic cut Opus pricing 67%. Claude Opus 4.5, 4.6, and 4.7 are all priced at $5/$25 per million tokens — down from $15/$75 for the earlier Opus 4.1. The effective saving is real, but read the small print: Opus 4.7 ships with a new tokenizer that may use up to 35% more tokens for the same text. Same per-token price, potentially higher per-request cost.
OpenAI's o-series got dramatically cheaper. o3 at $2/$8 replaced o1 at $15/$60 — roughly an 87% input cut. o4-mini at $0.55/$2.20 is cheaper still. The hidden cost remains: reasoning tokens on o-series are billed at output rates, and for complex tasks the model generates a lot of them. Headline price and actual bill can diverge 3–10×.
Google Flash-tier is genuinely cheap. Gemini 2.5 Flash-Lite at $0.10/$0.40 matches the cheapest open-weight inference on major hosts. The tradeoff is Google's April 2026 free-tier cuts — no migration period, no advance warning — and mandatory monthly spending caps ($250/mo Tier 1, $2,000/mo Tier 2) that suspend your API automatically when hit.
xAI entered with surprisingly competitive fast-tier pricing. Grok 4.1 Fast at $0.20/$0.50 with a 2M context window undercuts Claude Sonnet, GPT-5 mini, and Gemini Flash. The maturity gap is real — shorter deployment track record, and one third-party source notes a knowledge cutoff of November 2024 without search tools — but the price-per-context-token on fast-tier Grok is hard to beat among proprietary models.
DeepSeek's cache pricing is extreme. V4 Flash cache hits: $0.0028/M — 98% below the standard input rate. If your workload has a large, reusable system prompt, the effective cost can fall well below any frontier provider's headline rate. The V4 Pro 75% promo expires 2026-05-31; budget accordingly.
Open-weight models are credibly competitive. MiniMax M2.7 scores 56.22% on SWE-Pro, matching GPT-5.3-Codex. Kimi K2.6 reportedly ties GPT-5.5 on SWE-Bench Pro coding at ~80% lower cost, at $0.73–$0.75/$3.49–$3.50. These are MoE architectures (many parameters, few active per token), which is why they can be hosted cheaply.
Headline price vs effective price. Several mechanisms drive a wedge between the two:
The managed cloud premium. AWS Bedrock prices Claude at parity with Anthropic direct, but prices Llama 70B at a significant markup vs specialist inference hosts. Azure OpenAI prices tokens identically to OpenAI direct, but production deployments add support plans ($100–$1,000+/month) and infrastructure overhead, typically 20–40% above listed rates. Both are worth it for specific compliance requirements; neither saves money on raw token cost.
A multi-key router — something like llmdeal.me — lets you hit the cheapest backend for each request class without managing a dozen API keys and billing dashboards yourself.
Rates checked against providers' own pricing pages, May 2026. Article published 2026-05-16.