The cheap models caught up: open-weight LLMs in 2026

The claim

For most of 2024 and into 2025, the advice was simple: use frontier APIs (Claude Opus, GPT-4o) for anything that mattered, use cheap models for throwaway tasks. The quality gap was real. You felt it in multi-file context, in long chains of reasoning, in anything requiring sustained precision.

That advice is now outdated for a large share of work.

Between January and May 2026, at least five open-weight model families crossed benchmarks that were previously exclusive to $15–$25/M-output frontier models — and they're doing it at $0.28–$3.50/M output. This post lays out the specific numbers.

The price comparison, starting with the extremes

Claude Opus 4.7 via Anthropic's API costs $3.00/M input and $25.00/M output. That's the reference point for "frontier quality." Here's what the open-weight tier looks like alongside it:

Model	Input / 1M	Output / 1M	License	Released
Claude Opus 4.7	$3.00	$25.00	Proprietary	—
DeepSeek V4 Flash	$0.14	$0.28	MIT	2026-04-24
MiniMax M2.7	$0.28–$0.30	$1.20	Open-weight	2026-04-12
Qwen3-235B-A22B	$0.70	$2.80	Apache 2.0	2026
Kimi K2.6	$0.73–$0.75	$3.49–$3.50	MIT+	2026-04-20
GLM-5	$1.00	$3.20	Proprietary API	2026
GLM-5.1	$1.40	$4.40	Open-sourced 2026-04-08	2026-04-08

DeepSeek V4 Flash output is 89× cheaper per token than Claude Opus 4.7. MiniMax's own marketing calls M2.7 "only 8% of the price of Claude Sonnet." Even Kimi K2.6 — the most expensive model in this table — comes in below Claude Sonnet's $15.00/M output rate.

And these aren't small models. DeepSeek V4 Flash is a 284B/13B MoE (active) architecture with 1M context. Kimi K2.6 is a 1-trillion-parameter model. Qwen3-235B-A22B runs 235B total with 22B active parameters. They're big models running cheap because the inference cost of MoE architectures doesn't scale linearly with parameter count.

The benchmark story: where parity actually landed

Pricing is only interesting if quality held. Let's look at what the benchmarks say — specifically for coding, where the numbers are the most concrete.

DeepSeek V4 Flash vs Claude Opus on SWE-bench Verified: DeepSeek V4 Flash scores 80.6% on SWE-bench Verified. Claude Opus 4.7 sits at approximately 80.8%. That's not an approximation of parity — it's a dead heat, at $0.28/M vs $25.00/M output. DeepSeek V4 Pro scores 83.7% by one estimate (unverified across independent evaluations), but the V4 Flash figure alone is enough to make the point.

Kimi K2.6 ties GPT-5.5 on SWE-Bench Pro: Moonshot's Kimi K2.6 claims 58.6% on SWE-Bench Pro, matching GPT-5.5 on that benchmark, and ranked 4th on the Artificial Analysis Intelligence Index v4.0. Moonshot notes this comes at approximately 80% lower cost. At $0.75/M input and $3.50/M output, that's a significant claim if the benchmark holds under independent replication.

MiniMax M2.7 at 56.22% on SWE-Pro: MiniMax open-sourced M2.7 on April 12, 2026, citing a 56.22% score on SWE-Pro and 57.0% on Terminal Bench 2. The OpenHands team called M2.5 (the predecessor) "the first open model that has exceeded Claude Sonnet on recent tests." M2.7 followed one month later with further improvements — at $0.30/M input and $1.20/M output.

Qwen3 in the local tier: Qwen3.6-27B (dense) scores 77.2% on SWE-bench Verified running on 24GB of VRAM — a consumer GPU. Qwen3.6-35B-A3B hits 73.4% on the same benchmark at the same hardware tier, with only 3B active parameters per forward pass. Both are Apache 2.0 licensed. You can self-host them.

GLM-5.1 at 1/10th the cost of Claude Opus: Zhipu open-sourced GLM-5.1 on April 8, 2026. One comparison puts GLM-5.1 at roughly 94% of Claude Opus's coding ability at approximately 1/10th the cost via API. The model is also cited for extended agentic durability — "the only open-source model capable of eight hours of continuous work," per Zhipu's own announcement.

Cache pricing makes the gap wider

Prompt caching amplifies the cost advantage for repeated-context workloads — which is basically every agentic coding task.

DeepSeek V4 Flash cache-hit pricing is $0.0028/M input — that's a 98% discount off the already-cheap $0.14/M cache-miss rate. Kimi K2.6 cache hits drop to $0.15/M input (80% off $0.75). GLM-4.5 offers cached input at $0.11/M vs $0.60/M standard. These are aggressive numbers for workloads with stable system prompts — which agentic coding frames almost always have.

Mistral, by contrast, does not offer prompt-caching discounts on any of its models as of May 2026. And Anthropic/OpenAI's 50–90% cache discounts, while real, are discounts off much higher base rates.

What "good enough" actually means in practice

The benchmark convergence doesn't mean these models are interchangeable with Claude Opus on every task. There's a practical distribution here that teams have actually settled into.

One observed pattern, from dev.to: "The open-weight tier (Qwen 3 Coder, Kimi K2.6, DeepSeek weights) is now good enough that lots of teams run 60 to 80 percent of their agent traffic locally and only escalate the hard 20 percent to a frontier API." Another take: "Claude is maybe 20-25% better. DeepSeek gets you 80% of the way there. And that 80% is more than enough for most of what people actually do with AI."

The remaining gap is real but narrow — and concentrated in specific failure modes: very long multi-file context windows, complex multi-step reasoning chains, instruction-following under unusual edge cases. For standard agentic tasks — code generation, refactoring, test writing, documentation, SQL, API integration — the open-weight tier is functionally equivalent for most teams doing most work.

Chinese open-weight models have gone from roughly 1% of global AI market share in January 2025 to about 15% by January 2026 (Qwen: 0.5% → 9%, DeepSeek: 0.5% → 6%). That's a market share shift, not a benchmark number — it reflects actual production usage.

The practical consequence for your API budget

These models are available via API today at the prices in the table above. You don't need to self-host to access them — though Qwen3-235B-A22B, the Kimi K2.5/K2.6 weights, and the full MiniMax M2.x series are all on Hugging Face under permissive licenses if you want to.

For teams burning real money on Claude Opus for bulk tasks — code review pipelines, test generation, documentation passes — the arithmetic is no longer abstract. At 80.6% vs 80.8% SWE-bench Verified, the question of whether to pay $25.00/M or $0.28/M in output tokens for a given workload is an engineering decision, not a hand-wave about quality.

llmdeal.me routes requests across this tier automatically, so you don't have to hardcode provider-specific SDK calls for each model.

References

DeepSeek, DeepSeek API Pricing, accessed 2026-05-16
ofox.ai, DeepSeek API Pricing Guide 2026 — V4 Flash vs Pro, accessed 2026-05-16
nxcode.io, DeepSeek API Pricing: Complete Guide 2026, accessed 2026-05-16
Alibaba Cloud, Model Studio Pricing, accessed 2026-05-16
Hugging Face, Qwen3-235B-A22B model card, accessed 2026-05-16
OpenRouter, Kimi K2.6 on OpenRouter, accessed 2026-05-16
deepinfra.com, Kimi K2.6 Pricing Guide & Deployment Tradeoffs, accessed 2026-05-16
miraflow.ai, Kimi K2.6 Explained — Moonshot AI Open-Source Model Ties GPT-5.5 on Coding, accessed 2026-05-16
MiniMax, MiniMax API Pricing (Pay-as-you-go), accessed 2026-05-16
marktechpost.com, MiniMax M2.7 Open-Source Launch — 56.22% on SWE-Pro, 2026-04-12
OpenHands.dev, MiniMax M2.5: Open-Weights Models Catch Up to Claude, 2026-02-11
MiniMax, MiniMax M2 Launch, accessed 2026-05-16
vibecoding.app, Zhipu AI GLM Pricing 2026, accessed 2026-05-16
CnTechPost, Zhipu Open-Sources Flagship GLM-5.1 Model, April 2026, accessed 2026-05-16
help.apiyi.com, Claude Code Too Expensive? 2026 Comparison, accessed 2026-05-16
particula.tech, DeepSeek V4 and Qwen 3.5: Open-Source AI Is Rewriting the Rules in 2026, accessed 2026-05-16
buildfastwithai.com, Qwen3.6-35B-A3B: 73.4% SWE-Bench, Runs Locally, 2026-04-22
dev.to/danishashko, The Best LLMs for Agentic Coding in 2026, accessed 2026-05-16
fundaai.substack.com, DeepSeek V4 vs Claude vs GPT-5.4: A 38-Task Benchmark, accessed 2026-05-16
devtk.ai, Mistral API Pricing Guide 2026, accessed 2026-05-16

Rates checked against providers' own pricing pages, May 2026. Article published 2026-05-16.