Open-weight LLMs — DeepSeek V4, Qwen3, Kimi K2.6, GLM-5, MiniMax M2.7 — hit near-frontier benchmark scores in early 2026. The price gap is no longer remotely proportional to the quality gap.
· llmdeal.me
For most of 2024 and into 2025, the advice was simple: use frontier APIs (Claude Opus, GPT-4o) for anything that mattered, use cheap models for throwaway tasks. The quality gap was real. You felt it in multi-file context, in long chains of reasoning, in anything requiring sustained precision.
That advice is now outdated for a large share of work.
Between January and May 2026, at least five open-weight model families crossed benchmarks that were previously exclusive to $15–$25/M-output frontier models — and they're doing it at $0.28–$3.50/M output. This post lays out the specific numbers.
Claude Opus 4.7 via Anthropic's API costs $3.00/M input and $25.00/M output. That's the reference point for "frontier quality." Here's what the open-weight tier looks like alongside it:
| Model | Input / 1M | Output / 1M | License | Released |
|---|---|---|---|---|
| Claude Opus 4.7 | $3.00 | $25.00 | Proprietary | — |
| DeepSeek V4 Flash | $0.14 | $0.28 | MIT | 2026-04-24 |
| MiniMax M2.7 | $0.28–$0.30 | $1.20 | Open-weight | 2026-04-12 |
| Qwen3-235B-A22B | $0.70 | $2.80 | Apache 2.0 | 2026 |
| Kimi K2.6 | $0.73–$0.75 | $3.49–$3.50 | MIT+ | 2026-04-20 |
| GLM-5 | $1.00 | $3.20 | Proprietary API | 2026 |
| GLM-5.1 | $1.40 | $4.40 | Open-sourced 2026-04-08 | 2026-04-08 |
DeepSeek V4 Flash output is 89× cheaper per token than Claude Opus 4.7. MiniMax's own marketing calls M2.7 "only 8% of the price of Claude Sonnet." Even Kimi K2.6 — the most expensive model in this table — comes in below Claude Sonnet's $15.00/M output rate.
And these aren't small models. DeepSeek V4 Flash is a 284B/13B MoE (active) architecture with 1M context. Kimi K2.6 is a 1-trillion-parameter model. Qwen3-235B-A22B runs 235B total with 22B active parameters. They're big models running cheap because the inference cost of MoE architectures doesn't scale linearly with parameter count.
Pricing is only interesting if quality held. Let's look at what the benchmarks say — specifically for coding, where the numbers are the most concrete.
DeepSeek V4 Flash vs Claude Opus on SWE-bench Verified: DeepSeek V4 Flash scores 80.6% on SWE-bench Verified. Claude Opus 4.7 sits at approximately 80.8%. That's not an approximation of parity — it's a dead heat, at $0.28/M vs $25.00/M output. DeepSeek V4 Pro scores 83.7% by one estimate (unverified across independent evaluations), but the V4 Flash figure alone is enough to make the point.
Kimi K2.6 ties GPT-5.5 on SWE-Bench Pro: Moonshot's Kimi K2.6 claims 58.6% on SWE-Bench Pro, matching GPT-5.5 on that benchmark, and ranked 4th on the Artificial Analysis Intelligence Index v4.0. Moonshot notes this comes at approximately 80% lower cost. At $0.75/M input and $3.50/M output, that's a significant claim if the benchmark holds under independent replication.
MiniMax M2.7 at 56.22% on SWE-Pro: MiniMax open-sourced M2.7 on April 12, 2026, citing a 56.22% score on SWE-Pro and 57.0% on Terminal Bench 2. The OpenHands team called M2.5 (the predecessor) "the first open model that has exceeded Claude Sonnet on recent tests." M2.7 followed one month later with further improvements — at $0.30/M input and $1.20/M output.
Qwen3 in the local tier: Qwen3.6-27B (dense) scores 77.2% on SWE-bench Verified running on 24GB of VRAM — a consumer GPU. Qwen3.6-35B-A3B hits 73.4% on the same benchmark at the same hardware tier, with only 3B active parameters per forward pass. Both are Apache 2.0 licensed. You can self-host them.
GLM-5.1 at 1/10th the cost of Claude Opus: Zhipu open-sourced GLM-5.1 on April 8, 2026. One comparison puts GLM-5.1 at roughly 94% of Claude Opus's coding ability at approximately 1/10th the cost via API. The model is also cited for extended agentic durability — "the only open-source model capable of eight hours of continuous work," per Zhipu's own announcement.
Prompt caching amplifies the cost advantage for repeated-context workloads — which is basically every agentic coding task.
DeepSeek V4 Flash cache-hit pricing is $0.0028/M input — that's a 98% discount off the already-cheap $0.14/M cache-miss rate. Kimi K2.6 cache hits drop to $0.15/M input (80% off $0.75). GLM-4.5 offers cached input at $0.11/M vs $0.60/M standard. These are aggressive numbers for workloads with stable system prompts — which agentic coding frames almost always have.
Mistral, by contrast, does not offer prompt-caching discounts on any of its models as of May 2026. And Anthropic/OpenAI's 50–90% cache discounts, while real, are discounts off much higher base rates.
The benchmark convergence doesn't mean these models are interchangeable with Claude Opus on every task. There's a practical distribution here that teams have actually settled into.
One observed pattern, from dev.to: "The open-weight tier (Qwen 3 Coder, Kimi K2.6, DeepSeek weights) is now good enough that lots of teams run 60 to 80 percent of their agent traffic locally and only escalate the hard 20 percent to a frontier API." Another take: "Claude is maybe 20-25% better. DeepSeek gets you 80% of the way there. And that 80% is more than enough for most of what people actually do with AI."
The remaining gap is real but narrow — and concentrated in specific failure modes: very long multi-file context windows, complex multi-step reasoning chains, instruction-following under unusual edge cases. For standard agentic tasks — code generation, refactoring, test writing, documentation, SQL, API integration — the open-weight tier is functionally equivalent for most teams doing most work.
Chinese open-weight models have gone from roughly 1% of global AI market share in January 2025 to about 15% by January 2026 (Qwen: 0.5% → 9%, DeepSeek: 0.5% → 6%). That's a market share shift, not a benchmark number — it reflects actual production usage.
These models are available via API today at the prices in the table above. You don't need to self-host to access them — though Qwen3-235B-A22B, the Kimi K2.5/K2.6 weights, and the full MiniMax M2.x series are all on Hugging Face under permissive licenses if you want to.
For teams burning real money on Claude Opus for bulk tasks — code review pipelines, test generation, documentation passes — the arithmetic is no longer abstract. At 80.6% vs 80.8% SWE-bench Verified, the question of whether to pay $25.00/M or $0.28/M in output tokens for a given workload is an engineering decision, not a hand-wave about quality.
llmdeal.me routes requests across this tier automatically, so you don't have to hardcode provider-specific SDK calls for each model.
Rates checked against providers' own pricing pages, May 2026. Article published 2026-05-16.