The research is real, the arithmetic is ironclad — but "up to 85%" depends heavily on what you're running. Here's what the numbers actually say.
· llmdeal.me
Before discussing routers, consider the spread they exploit.
GPT-4o costs roughly 27× more per token than GPT-4o-mini. Claude Opus 4.6 is 15× the per-token price of Haiku 4.5 on both input and output — $15/$75 per million tokens versus $1/$5. Across the frontier, GPT-4 at its 2023 launch pricing has effectively become ~75× cheaper in equivalent capability terms by 2026 ($30/M → $0.40/M for GPT-4.1 mini).
Those gaps are the entire argument for routing. If you send every query to the expensive model and some fraction of them could have been handled correctly by the cheap one, you are paying a large and entirely unnecessary premium. The arithmetic is not subtle: even a naive 50/50 split across a 27× price differential yields roughly 13× lower cost on the diverted segment.
The question routing research actually tries to answer is harder: how big is that fraction, and how accurately can you classify queries in real time without regressing quality?
UC Berkeley / Anyscale / Canva — peer-reviewed, not a vendor blog.
The RouteLLM paper, accepted at ICLR 2025, is the source most people cite for "up to 85% cost reduction." The result is real, but it comes with a benchmark label attached: MT-Bench, with LLM-judge augmentation. On MT-Bench, their best router sends only 14% of queries to GPT-4 while retaining 95% of its quality score — yielding roughly 85% cost savings versus always-frontier.
Run the same routers on other benchmarks and the picture shifts:
| Benchmark | Cost reduction | Quality retained |
|---|---|---|
| MT-Bench (w/ augmentation) | 85% | 95% of GPT-4 |
| MT-Bench (no augmentation) | ~48% vs random routing | 95% of GPT-4 |
| MMLU | 45% | 95% of GPT-4 |
| GSM8K (math reasoning) | 35% | 95% of GPT-4 |
The headline figure is neither dishonest nor representative. MT-Bench has a significant proportion of tasks where a smaller model genuinely suffices; math-heavy workloads are harder to divert. Citing "85%" without the benchmark qualifier is the problem, not the research itself.
Independent evaluation, not a benchmark curated by a routing vendor.
RouterArena evaluated multiple production and open-source routers against a broader set of tasks. Their headline finding for best-in-class routers (vLLM-SR and CARROT): ~35% cost reduction with under 2% accuracy loss. That is a more defensible number for a general-purpose workload than 85%.
The paper also lands two findings that rarely make it into the secondary coverage:
Notable: NotDiamond, one of the better-known commercial routers, ranked 12th — cost-unfavorable because it over-routes to expensive models. Commercial does not automatically mean better-calibrated.
Accepted at TMLR. Real savings, real constraint.
FrugalGPT matched GPT-4 accuracy with up to 98% cost reduction — on the HEADLINES financial classification dataset. The 50–98% range in the paper reflects how wildly results vary by dataset. The hard constraint: FrugalGPT requires labeled training examples from the same distribution as your production queries. If your queries are domain-specific and you have labeled examples, it can be very effective. If you don't, you cannot reproduce this number. This caveat is omitted from nearly every secondary citation.
The foundational assumption behind most routing pitches — and it has no solid primary source.
You will see this claim constantly: "40–60% of your queries don't need a frontier model." Multiple practitioner blogs repeat it as established fact. The best sourcing found for it is an informal audit of ~500 queries described on TianPan.co (Oct 2025) — not a peer-reviewed study, not a large-scale production analysis. Morphllm.com (March 2026) offers a more specific and more credible variant: "60–80% of coding-agent requests are routine," attributed to coding-agent workloads specifically.
The RouteLLM data implies something in this ballpark for MT-Bench tasks (14–26% strong-model calls needed). But generalizing from one benchmark to your production traffic is the core uncertainty. The fraction of queries that are "simple enough for a cheap model" depends entirely on what your application does. Treat 40–60% as a planning heuristic, not a measured baseline.
Per-prompt routing alone versus stacked strategies.
The research and practitioner literature converge on a rough ladder:
| Strategy | Savings vs always-frontier |
|---|---|
| Manual model switching (quarterly reviews) | ~25% of available savings; ~5-month lag |
| Per-prompt routing alone | ~40–70% at steady state |
| Routing + context compression | 78–84% |
| Routing + compression + caching + batching | 84–91% |
The 85%+ ceiling is achievable — but only when multiple strategies stack, on workloads that happen to have a high fraction of simple queries, with a well-calibrated router. Semantic caching alone can move the needle significantly: one documented production case went from $47,000/month to $12,700/month (73% reduction). AWS research across 63,796 real chatbot queries found 86% cost reduction at optimal caching thresholds. These are complementary to routing, not substitutes.
Routing adds complexity; complexity has costs.
A routing layer introduces latency, engineering overhead, and a new failure mode: silent quality regression. If your router miscategorizes a hard query as easy and sends it to a small model, it may return a plausible-looking wrong answer. Unlike a 500 error, this can be invisible in your monitoring until a user notices. RouterBench (Martian, 2024) found that basic routers often performed no better than a random baseline on unfamiliar workloads. RouterArena confirms: no single router is universally dominant.
The LogRocket practitioner guidance is blunt: building complex routing infrastructure at $500/month in LLM spend is almost certainly not worth it. The savings need to be large relative to the engineering cost of implementing and maintaining a calibrated router. At scale the math easily clears; at low volume it usually doesn't.
One practical note: as frontier prices continue dropping — Gartner projects inference costs for a 1-trillion-parameter model will fall over 90% between 2025 and 2030 — the routing arbitrage window narrows. The gap between cheap and expensive is large enough today to act on. In three years, it may be much smaller.
Routing savings are real and the arithmetic is sound — the price gaps between frontier and small models are large enough that even imperfect classification pays off. The honest range for per-prompt routing alone, across general-purpose workloads, is 35–70%, based on RouterArena's ~35% under 2% accuracy loss and RouteLLM's 45–85% range across benchmarks. The 85% ceiling is not fiction, but it requires MT-Bench-like workloads, LLM-judge augmentation, and a well-calibrated router — not a default configuration on arbitrary production traffic.
The "40–60% of queries are simple" assumption that underlies most routing pitches is a practitioner heuristic with weak empirical backing. Audit your own traffic before building around it. Your fraction may be higher or lower, and routing performance scales with how accurately you can classify your specific workload.
llmdeal.me routes requests across models automatically — the same approach described here, without the infrastructure cost of building your own.
Rates checked against providers' own pricing pages, May 2026. Article published 2026-05-16.