The per-token sticker is the smallest number on your bill. Here is what actually drives the cost up — and how to read the charges that providers bury in footnotes.
· llmdeal.me
Providers publish a headline number — dollars per million tokens — and that number is technically accurate. The problem is that it describes only one axis of a billing surface that has at least six. Before you sign off on a model choice or a cloud deployment, you need to understand each axis separately.
What follows is a breakdown of the six most common hidden multipliers, with exact figures sourced from provider documentation and verified aggregators as of May 2026.
Claude Opus 4.7, released April 16, 2026, ships with a new tokenizer. Anthropic's own pricing documentation states it plainly:
"This new tokenizer may use up to 35% more tokens for the same fixed text."
The per-token price for Opus 4.7 is $5.00 per million input tokens and $25.00 per million output tokens — identical to Opus 4.5 and 4.6. No sticker change. But if the tokenizer produces 35% more tokens for the same prompt, your effective cost per request rises by up to 35% on input and up to 35% on output. For input-heavy workloads — long system prompts, large context windows, document processing — this lands closer to a 30–35% effective price increase overnight, with no announcement under the "pricing" heading.
The mechanism matters: tokenizers differ in how they handle whitespace, punctuation, non-ASCII characters, and code. If your codebase or prompts are English prose, the impact may be on the lower end. If they include structured data, multilingual text, or dense markup, you should benchmark the new tokenizer against your actual payload before assuming the headline rate still applies.
OpenAI's o-series reasoning models and similar thinking-enabled models bill their internal chain-of-thought tokens at output token rates. Those tokens are never shown to you — they are the model working through the problem — but they are charged as if they were output text.
For a concrete reference point: o4-mini is priced at $0.55 input / $2.20 output per million tokens. If the model internally generates 10,000 reasoning tokens to answer a 500-token question and produce a 200-token answer, you pay for 500 input tokens plus 10,200 output tokens. The stated input rate is largely irrelevant for reasoning-intensive tasks.
MetaCTO's analysis (May 2026) puts the practical range bluntly: "The o-series models also bill internal reasoning tokens at output rates, which can multiply costs by 3–10x depending on task complexity." The headline price for o3 is $2.00/$8.00 per million tokens — but on a complex multi-step reasoning task, your effective rate can approach or exceed the o3-pro list price of $20.00/$80.00.
This is not unique to OpenAI. Any model with an extended thinking or reasoning trace bills those tokens. Verify in the API response whether your provider surfaces a separate reasoning_tokens field in the usage object — if it does, that count is added to your output bill.
Stateless APIs do not store conversation history. Your client code sends the full message array with every request. That means every prior user and assistant turn is re-tokenized and billed as input tokens on every subsequent call.
MetaCTO describes the compounding effect directly: "As the conversation grows longer, the number of prompt_tokens increases with every turn. You are effectively paying for all previous messages over and over again."
A five-turn conversation where each turn averages 1,000 tokens of prior context does not cost 5,000 input tokens total — it costs roughly 15,000 (1K + 2K + 3K + 4K + 5K). At turn 20 with a 500-token average per turn, you are re-billing ~9,500 tokens of context on that turn alone. At Opus 4.7 rates ($5.00/MTok input), a 200-turn customer support conversation with a 300-token average message length costs roughly $0.15 in context re-billing alone — before counting any actual response tokens.
Prompt caching cuts this, but only partially. Anthropic's cache-read rate is $0.50/MTok for Opus 4.7 (10% of the base $5.00 input rate), and only for cached segments that remain static across turns. The dynamic portion of the conversation — any new turn — is still billed at full input rates.
Perplexity Sonar's API has a dual billing structure that is documented but easy to miss when comparing models by per-token rate alone.
Every Sonar API call charges two separate line items:
| Model | Input | Output | Per-request fee |
|---|---|---|---|
| Sonar | $1.00/MTok | $1.00/MTok | $5–$12 per 1K requests |
| Sonar Pro | $3.00/MTok | $15.00/MTok | $6–$14 per 1K requests |
| Sonar Reasoning | $1.00/MTok | $5.00/MTok | $5–$12 per 1K requests |
| Sonar Reasoning Pro | $2.00/MTok | $8.00/MTok | $6–$14 per 1K requests |
For short queries — a user asking a quick factual question — the per-request fee often dominates. At Sonar's minimum of $5 per 1,000 requests, 10,000 daily queries costs $50/day in request fees alone, before a single token is counted. The exact request fee varies by "search context size," but Perplexity has not published the token thresholds for each tier. Note that the exact tier boundaries are unverified in public documentation — the fee range itself ($5–$22/1K requests) is from official Perplexity pricing docs, accessed 2026-05-16.
For high-volume applications where you are treating Sonar like a cheap search API, model the request fee separately from token costs before committing. At scale, the two lines can be roughly equal in magnitude.
For frontier models from Anthropic and OpenAI, AWS Bedrock and Azure OpenAI price at parity with the providers' direct API — no markup on those specific models. The hidden cost comes from two other places.
Open-source models on Bedrock carry significant markups over cheaper alternatives. According to a tokenmix.ai analysis updated April 2026, Llama 3.3 70B costs $2.65 per million tokens on Bedrock versus $0.88 per million on Together AI — a 201% premium. Command R from Cohere costs $0.50/$1.50 per million tokens on Bedrock, versus $0.15/$0.60 direct on Cohere — again a meaningful uplift. If your workload is on open-source models, you are paying a managed-hosting fee whether you see it itemized or not.
Azure's overhead lives below the token price line. The per-token rates for GPT-5 on Azure AI Foundry are identical to OpenAI direct ($1.25 input / $10.00 output per million tokens). But production Azure deployments require support plans ($100–$1,000+/month mandatory for production SLAs), fine-tuned model hosting when idle ($1.70–$3.00/hour), monitoring infrastructure (~$35–$50/month), and data egress after 100 GB free ($0.087/GB). CloudZero's May 2026 analysis puts the real-world Azure OpenAI cost 20–40% above listed token rates once those line items are included.
If you are deploying on Azure for compliance or procurement consolidation, that premium may be worth it. If you are deploying there because it seems like the obvious choice, benchmark the total bill against direct API access first.
Both Anthropic and OpenAI advance your API account through rate-limit tiers based on cumulative spend. The Anthropic tier thresholds are $5, $40, $200, and $400 to reach Tiers 1 through 4. OpenAI's are $5, $50, $100, $250, and $1,000 for Tiers 1 through 5.
The part that tends to surprise people: OpenAI's tiers also require calendar-time minimums. Tier 3 requires $100 spent AND 7 days on account. Tier 5 requires $1,000 AND 30 days. As Respan.ai noted in their rate-limit guide: "spending $10,000 on day one still requires waiting before reaching Tier 5 status." The practical implication is that if you are launching a high-traffic product and expecting to hit production-grade rate limits immediately, you cannot buy your way there — you have to wait out the clock regardless of spend.
Anthropic's tiers are spend-only (no calendar requirement listed in their public docs as of May 2026), but monthly spend caps apply at each tier — Tier 1 and 2 cap at $500/month, Tier 3 at $1,000/month. Exceeding those caps requires moving up a tier, which requires more cumulative spend. It is a forced ramp whether you plan for it or not.
When you get your first bill that is 2–4x what you expected from the per-token rate, it is usually a combination of two or three of the above. The most common combination in practice:
The fix is not to avoid expensive models — it is to instrument your actual token usage per call. Log usage.input_tokens, usage.output_tokens, and any usage.cache_read_input_tokens or reasoning-token fields your provider surfaces. A week of logging with realistic traffic will tell you more than any pricing calculator.
llmdeal.me routes requests across providers and exposes per-request token breakdowns in one place, which makes this kind of comparison less painful if you are evaluating across multiple models.
Rates checked against providers' own pricing pages, May 2026. Article published 2026-05-16.