Does model routing actually cut your LLM bill?

The structural case first: the price gap is enormous

Before discussing routers, consider the spread they exploit.

GPT-4o costs roughly 27× more per token than GPT-4o-mini. Claude Opus 4.6 is 15× the per-token price of Haiku 4.5 on both input and output — $15/$75 per million tokens versus $1/$5. Across the frontier, GPT-4 at its 2023 launch pricing has effectively become ~75× cheaper in equivalent capability terms by 2026 ($30/M → $0.40/M for GPT-4.1 mini).

Those gaps are the entire argument for routing. If you send every query to the expensive model and some fraction of them could have been handled correctly by the cheap one, you are paying a large and entirely unnecessary premium. The arithmetic is not subtle: even a naive 50/50 split across a 27× price differential yields roughly 13× lower cost on the diverted segment.

The question routing research actually tries to answer is harder: how big is that fraction, and how accurately can you classify queries in real time without regressing quality?

RouteLLM (ICLR 2025): where the 85% figure comes from

UC Berkeley / Anyscale / Canva — peer-reviewed, not a vendor blog.

The RouteLLM paper, accepted at ICLR 2025, is the source most people cite for "up to 85% cost reduction." The result is real, but it comes with a benchmark label attached: MT-Bench, with LLM-judge augmentation. On MT-Bench, their best router sends only 14% of queries to GPT-4 while retaining 95% of its quality score — yielding roughly 85% cost savings versus always-frontier.

Run the same routers on other benchmarks and the picture shifts:

Benchmark	Cost reduction	Quality retained
MT-Bench (w/ augmentation)	85%	95% of GPT-4
MT-Bench (no augmentation)	~48% vs random routing	95% of GPT-4
MMLU	45%	95% of GPT-4
GSM8K (math reasoning)	35%	95% of GPT-4

The headline figure is neither dishonest nor representative. MT-Bench has a significant proportion of tasks where a smaller model genuinely suffices; math-heavy workloads are harder to divert. Citing "85%" without the benchmark qualifier is the problem, not the research itself.

RouterArena (Rice University, Oct 2025): a more realistic ceiling

Independent evaluation, not a benchmark curated by a routing vendor.

RouterArena evaluated multiple production and open-source routers against a broader set of tasks. Their headline finding for best-in-class routers (vLLM-SR and CARROT): ~35% cost reduction with under 2% accuracy loss. That is a more defensible number for a general-purpose workload than 85%.

The paper also lands two findings that rarely make it into the secondary coverage:

"No router ranks at the top across all metrics." Every router has workload-specific blind spots.
"All existing routers fall short of the oracle's achievable performance, primarily because they are inefficient at recognizing when smaller, cheaper models are sufficient." Even the best classifiers leave savings on the table.

Notable: NotDiamond, one of the better-known commercial routers, ranked 12th — cost-unfavorable because it over-routes to expensive models. Commercial does not automatically mean better-calibrated.

FrugalGPT (Stanford, 2023): the 98% headline and its fine print

Accepted at TMLR. Real savings, real constraint.

FrugalGPT matched GPT-4 accuracy with up to 98% cost reduction — on the HEADLINES financial classification dataset. The 50–98% range in the paper reflects how wildly results vary by dataset. The hard constraint: FrugalGPT requires labeled training examples from the same distribution as your production queries. If your queries are domain-specific and you have labeled examples, it can be very effective. If you don't, you cannot reproduce this number. This caveat is omitted from nearly every secondary citation.

The "40–60% of queries are simple" figure

The foundational assumption behind most routing pitches — and it has no solid primary source.

You will see this claim constantly: "40–60% of your queries don't need a frontier model." Multiple practitioner blogs repeat it as established fact. The best sourcing found for it is an informal audit of ~500 queries described on TianPan.co (Oct 2025) — not a peer-reviewed study, not a large-scale production analysis. Morphllm.com (March 2026) offers a more specific and more credible variant: "60–80% of coding-agent requests are routine," attributed to coding-agent workloads specifically.

The RouteLLM data implies something in this ballpark for MT-Bench tasks (14–26% strong-model calls needed). But generalizing from one benchmark to your production traffic is the core uncertainty. The fraction of queries that are "simple enough for a cheap model" depends entirely on what your application does. Treat 40–60% as a planning heuristic, not a measured baseline.

What realistic savings look like

Per-prompt routing alone versus stacked strategies.

The research and practitioner literature converge on a rough ladder:

Strategy	Savings vs always-frontier
Manual model switching (quarterly reviews)	~25% of available savings; ~5-month lag
Per-prompt routing alone	~40–70% at steady state
Routing + context compression	78–84%
Routing + compression + caching + batching	84–91%

The 85%+ ceiling is achievable — but only when multiple strategies stack, on workloads that happen to have a high fraction of simple queries, with a well-calibrated router. Semantic caching alone can move the needle significantly: one documented production case went from $47,000/month to $12,700/month (73% reduction). AWS research across 63,796 real chatbot queries found 86% cost reduction at optimal caching thresholds. These are complementary to routing, not substitutes.

The honest trade-offs

Routing adds complexity; complexity has costs.

A routing layer introduces latency, engineering overhead, and a new failure mode: silent quality regression. If your router miscategorizes a hard query as easy and sends it to a small model, it may return a plausible-looking wrong answer. Unlike a 500 error, this can be invisible in your monitoring until a user notices. RouterBench (Martian, 2024) found that basic routers often performed no better than a random baseline on unfamiliar workloads. RouterArena confirms: no single router is universally dominant.

The LogRocket practitioner guidance is blunt: building complex routing infrastructure at $500/month in LLM spend is almost certainly not worth it. The savings need to be large relative to the engineering cost of implementing and maintaining a calibrated router. At scale the math easily clears; at low volume it usually doesn't.

One practical note: as frontier prices continue dropping — Gartner projects inference costs for a 1-trillion-parameter model will fall over 90% between 2025 and 2030 — the routing arbitrage window narrows. The gap between cheap and expensive is large enough today to act on. In three years, it may be much smaller.

The mental model to take away

Routing savings are real and the arithmetic is sound — the price gaps between frontier and small models are large enough that even imperfect classification pays off. The honest range for per-prompt routing alone, across general-purpose workloads, is 35–70%, based on RouterArena's ~35% under 2% accuracy loss and RouteLLM's 45–85% range across benchmarks. The 85% ceiling is not fiction, but it requires MT-Bench-like workloads, LLM-judge augmentation, and a well-calibrated router — not a default configuration on arbitrary production traffic.

The "40–60% of queries are simple" assumption that underlies most routing pitches is a practitioner heuristic with weak empirical backing. Audit your own traffic before building around it. Your fraction may be higher or lower, and routing performance scales with how accurately you can classify your specific workload.

llmdeal.me routes requests across models automatically — the same approach described here, without the infrastructure cost of building your own.

References

LMSYS RouteLLM Blog, July 1, 2024 — lmsys.org/blog/2024-07-01-routellm/ — accessed 2026-05-16
RouteLLM ICLR 2025 paper — ICLR 2025 proceedings — accessed 2026-05-16
RouterArena, Rice University, arXiv 2510.00202v1, October 2025 — arxiv.org/html/2510.00202v1 — accessed 2026-05-16
FrugalGPT, Stanford, arXiv 2305.05176, May 2023 — arxiv.org/abs/2305.05176 — accessed 2026-05-16
IBM Research LLM Routers blog, October 10, 2024 — research.ibm.com/blog/LLM-routers — accessed 2026-05-16
RouterBench, Martian, March 2024 — github.com/withmartian/routerbench — accessed 2026-05-16
Anthropic Claude Haiku 4.5 launch (Gamma quote), October 15, 2025 — anthropic.com/news/claude-haiku-4-5 — accessed 2026-05-16
TianPan.co, LLM routing in production, October 19, 2025 — tianpan.co/blog/2025-10-19-llm-routing-production — accessed 2026-05-16
Morphllm.com, LLM cost optimization, March 31, 2026 — morphllm.com/llm-cost-optimization — accessed 2026-05-16
Divyam.ai, LLMflation blog, March 30, 2026 — divyam.ai/blog/hidden-cost-of-llmflation/ — accessed 2026-05-16
ScyllaDB semantic caching, November 24, 2025 — scylladb.com/2025/11/24/cut-llm-costs-and-latency-with-scylladb-semantic-caching/ — accessed 2026-05-16
VentureBeat semantic caching 73% case — venturebeat.com — figures from search snippets (article paywalled), accessed 2026-05-16
LogRocket, LLM routing in production — blog.logrocket.com/llm-routing-right-model-for-requests/ — accessed 2026-05-16
LeanLM.ai, LLM cost optimization, February 2026 (updated May 2026) — leanlm.ai/blog/llm-cost-optimization — accessed 2026-05-16
Gartner, inference cost prediction press release, March 25, 2026 — gartner.com — accessed 2026-05-16

Rates checked against providers' own pricing pages, May 2026. Article published 2026-05-16.