llmdeal API reference.
Drop-in OpenAI replacement. Same request shape, same response shape, same client libraries.
Bearer auth with lld_ keys, JSON in, JSON or SSE out. 16+ models behind one base URL;
smart-route picks the cheapest capable lane, or pin any model yourself.
Quick start
One Bearer header. One POST. Standard OpenAI body. Sign up at /buy.html; the dashboard mints your lld_ key.
Base URL
Either host serves the gateway; pick whichever your tooling prefers:
Authentication
Every /v1/* request requires:
Authorization: Bearer lld_<your-32-hex-token>
Keys are 36 chars total (lld_ prefix + 32 lowercase hex). Missing or malformed key returns 401 invalid_api_key.
First call
curl https://api.llmdeal.me/v1/chat/completions \
-H "Authorization: Bearer $LLMDEAL_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "smart-route-fast",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
The response is the standard OpenAI chat-completion shape; see /v1/chat/completions below.
Models & rates
Set "model" to any alias below. Rates are USD per 1M tokens, billed against your
balance at exactly the rates shown. byo-* aliases route to a customer-owned upstream
key and are zero-rated on llmdeal's side.
Smart-route aliases
Load-balancer aliases that pick the cheapest healthy upstream in their lane:
| Alias | Input $/1M | Output $/1M | Lane | Tier |
|---|---|---|---|---|
smart-route-fast | 0.15 | 0.45 | llama-3.1-8b · gemini-2.5-flash | all |
smart-route | 0.30 | 0.90 | llama-3.3-70b group (default) | all |
smart-route-large | 0.35 | 1.05 | nemotron-120b · qwen3-235b · gpt-oss-120b | pro+ |
smart-route-coder | 0.40 | 1.20 | qwen3-coder-480b · codestral | pro+ |
Pinned open-weight models
| Alias | Input $/1M | Output $/1M | Notes | Tier |
|---|---|---|---|---|
llama-3.1-8b-instant | 0.10 | 0.30 | Groq · 131k ctx · fast workhorse | all |
llama-3.3-70b | 0.20 | 0.60 | 128k ctx · general chat | all |
gemini-2.5-flash | 0.15 | 0.45 | Google · 1M ctx · multimodal | all |
codestral | 0.30 | 0.90 | Mistral · 256k ctx · code-specialist | pro+ |
qwen3-235b-cerebras | 0.40 | 1.20 | Cerebras · ~2000 tok/s · 131k ctx | pro+ |
gpt-oss-120b | 0.30 | 0.90 | 128k ctx · reasoning | pro+ |
NVIDIA-routed models
| Alias | Input $/1M | Output $/1M | Notes | Tier |
|---|---|---|---|---|
nvidia/llama-3.3-70b | 0.20 | 0.60 | NVIDIA NIM · 128k ctx | all |
nvidia/qwen3-coder-480b | 0.40 | 1.20 | NVIDIA NIM · 131k ctx · code | pro+ |
nvidia/nemotron-super-120b | 0.30 | 0.90 | NVIDIA NIM · 128k ctx · reasoning | pro+ |
nvidia/llama-4-maverick | 0.40 | 1.20 | NVIDIA NIM | pro+ |
Consortium-routed models
Routed to revenue-share partner GPUs; priced strictly under the closest catalog sibling so you have an incentive to opt in.
| Alias | Input $/1M | Output $/1M | Notes | Tier |
|---|---|---|---|---|
consortium-llama-70b | 0.15 | 0.45 | Partner GPU · llama-3.3-70b | all |
consortium-nemotron-120b | 0.22 | 0.66 | Partner GPU · nemotron-super-120b | pro+ |
consortium-qwen-coder | 0.30 | 0.90 | Partner GPU · qwen3-coder-480b | pro+ |
consortium-qwen-235b | 0.30 | 0.90 | Partner GPU · qwen3-235b | pro+ |
Frontier escalation (Pro+ add-on)
Reserved aliases for hand-off to a frontier vendor when smart-route flags low confidence. Billed against Frontier Credits, not your base balance; rates are passed through.
| Alias | Vendor | Notes |
|---|---|---|
claude-sonnet-4-6-frontier | Anthropic Sonnet 4.6 | pass-through retail rate |
claude-opus-4-7-frontier | Anthropic Opus 4.7 | pass-through retail rate |
gpt-5.5-frontier | OpenAI GPT-5.5 | pass-through retail rate |
Embedding models
| Alias | Input $/1M | Vector dim | Notes |
|---|---|---|---|
nv-embed-qa | 0.02 | 1024 | NVIDIA NIM · QA-tuned |
bge-m3 | 0.01 | 1024 | multilingual · cheapest tier |
mistral-embed | 0.10 | 1024 | Mistral |
BYO ("bring your own key") models
Aliases prefixed byo- route to a customer-owned upstream key registered in your
dashboard; llmdeal bills nothing for them (0.00 / 0.00 per 1M). Use these when you
already pay an upstream vendor and want them to flow through the same gateway, key, audit log,
and budget controls as everything else.
Examples: byo-openai-gpt-4o, byo-anthropic-sonnet, byo-groq-llama-70b. The exact alias list is whatever you wire on the dashboard.
GET /v1/cost-preview/rates (see below) to fetch the exact
table the gateway is using right now, including which aliases your tier can actually call.
Unknown aliases fall back to the conservative default rate of 0.40 / 1.20.
POST /v1/chat/completions
Standard OpenAI chat-completion body. JSON only, Content-Type: application/json required. Body cap: 256 KB.
Non-streaming
curl https://api.llmdeal.me/v1/chat/completions \
-H "Authorization: Bearer $LLMDEAL_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "smart-route",
"messages": [
{"role": "system", "content": "You are a terse assistant."},
{"role": "user", "content": "Summarise this PR in one sentence."}
],
"temperature": 0.2,
"max_tokens": 512
}'
Response
{
"id": "chatcmpl-9f2c...",
"object": "chat.completion",
"created": 1747200000,
"model": "llama-3.3-70b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Refactored the loop into a single-pass reduce."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 412,
"completion_tokens": 87,
"total_tokens": 499
}
}
Streaming
Set "stream": true to receive Server-Sent Events. The format is wire-identical to OpenAI's streaming response; OpenAI SDKs parse it natively.
curl -N https://api.llmdeal.me/v1/chat/completions \
-H "Authorization: Bearer $LLMDEAL_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "smart-route",
"stream": true,
"messages": [{"role": "user", "content": "Stream a haiku."}]
}'
# Server-Sent Events:
# data: {"id":"chatcmpl-9f2c","object":"chat.completion.chunk","choices":[{"delta":{"role":"assistant","content":"Refactored"}}]}
# data: {"id":"chatcmpl-9f2c","object":"chat.completion.chunk","choices":[{"delta":{"content":" the loop"}}]}
# data: {"id":"chatcmpl-9f2c","object":"chat.completion.chunk","choices":[{"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":5,"total_tokens":17}}
# data: [DONE]
Supported parameters
model· required · any alias from Models & ratesmessages· required · array of{ role, content };roleissystem,user,assistant, ortooltemperature,top_p,max_tokens· standard OpenAI semanticsstream· SSE when truetools,tool_choice· OpenAI-compatible wire format; upstream support varies per modelstop,presence_penalty,frequency_penalty,seed,response_format· passed through to upstream
Server header is llmdeal-gateway/1.0; per-request trace IDs surface as
x-llmdeal-request-id.
POST /v1/embeddings
OpenAI-compatible embeddings passthrough. Body cap: 256 KB; batch large RAG inputs client-side.
curl https://api.llmdeal.me/v1/embeddings \
-H "Authorization: Bearer $LLMDEAL_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "bge-m3",
"input": [
"The quick brown fox jumps over the lazy dog.",
"Embed me too."
]
}'
Response
{
"object": "list",
"model": "bge-m3",
"data": [
{ "object": "embedding", "index": 0, "embedding": [0.012, -0.034] },
{ "object": "embedding", "index": 1, "embedding": [0.057, 0.018] }
],
"usage": { "prompt_tokens": 24, "total_tokens": 24 }
}
Only prompt_tokens is billed (output is the vector itself, not completion tokens). See the embedding-model table in Models & rates.
POST /v1/cost-preview
Preflight the USD cost of a chat completion without calling the upstream model. Use this to surface cost estimates in your UI, gate expensive prompts, or budget-check before a real call.
Request
curl https://api.llmdeal.me/v1/cost-preview \
-H "Authorization: Bearer $LLMDEAL_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "smart-route",
"messages": [
{"role": "user", "content": "Explain Raft consensus in 200 words."}
],
"max_tokens": 300
}'
Response
{
"ok": true,
"model": "smart-route",
"input_tokens": 11,
"est_output_tokens": { "low": 60, "expected": 180, "high": 300 },
"est_cost_usd": {
"low": 0.0000573,
"expected": 0.0001653,
"high": 0.0002733
},
"breakdown": { "input_usd": 0.0000033, "output_usd": 0.000162 },
"rate_per_1m": { "input": 0.30, "output": 0.90 },
"currency": "usd",
"estimator": "tiktoken-cl100k-base"
}
Notes
- Body shape is the same as
/v1/chat/completionsminus inference-only fields. Onlymodel,messages, and optionalmax_tokensare read. - When
max_tokensis provided, thehighband equalsmax_tokens; when omitted, it's capped at 4096 or 2x the input, whichever is smaller. byo-*models returnest_cost_usd: 0with anotefield explaining the BYO routing.- Unknown models return
404 model_not_foundeven though the rate-table would silently default; this matches the chat-completions tier-gate. - 16 KB request body cap.
GET /v1/cost-preview/rates
Returns the full per-1M-token rate table the gateway is using right now, the chat-model catalog, and the subset of aliases your tier can call. Cached 60 s.
curl https://api.llmdeal.me/v1/cost-preview/rates \
-H "Authorization: Bearer $LLMDEAL_KEY"
{
"ok": true,
"currency": "usd",
"default_rate": { "input": 0.4, "output": 1.2 },
"rates": [
{ "model": "bge-m3", "input_per_1m": 0.01, "output_per_1m": 0.00 },
{ "model": "codestral", "input_per_1m": 0.30, "output_per_1m": 0.90 },
{ "model": "consortium-llama-70b", "input_per_1m": 0.15, "output_per_1m": 0.45 },
{ "model": "llama-3.3-70b", "input_per_1m": 0.20, "output_per_1m": 0.60 },
{ "model": "smart-route-fast", "input_per_1m": 0.15, "output_per_1m": 0.45 }
],
"catalog": [
{ "id": "llama-70b", "display_name": "Llama 3.3 70B", "tiers": ["free-trial","starter","pro","founder","charter"], "context_window": 128000 }
],
"accessible_to_caller": ["llama-70b","llama-8b","gemini-flash","smart-route","smart-route-fast"],
"caller_tier": "starter",
"note": "Unknown models fall back to the default rate (conservative)."
}
GET /v1/models
OpenAI-shape model list, filtered to the aliases your tier can call. Use this for tool integrations that probe model availability at runtime.
curl https://api.llmdeal.me/v1/models \
-H "Authorization: Bearer $LLMDEAL_KEY"
{
"object": "list",
"data": [
{
"id": "llama-70b",
"object": "model",
"created": 1747353600,
"owned_by": "llmdeal",
"context_window": 128000,
"tags": ["chat", "fast"]
},
{
"id": "smart-route-fast",
"object": "model",
"created": 1747353600,
"owned_by": "llmdeal",
"context_window": 131072,
"tags": ["auto", "fast"]
}
]
}
The owned_by field is always llmdeal; upstream provider names are intentionally not surfaced.
Rate limits
Per-account gateway limits
| Account | Sustained | Burst |
|---|---|---|
| Standard | 120 req / minute | 25 req / 10 s |
| Demo / free-trial | 1200 req / minute | 250 req / 10 s |
Demo accounts get 10x headroom so client-library smoke tests don't trip the limiter during onboarding. The token bucket is per lld_ key, not per IP.
Per-tier upstream limits
On top of the gateway limits, every lld_ key carries a per-tier upstream TPM/RPM cap set
inside LiteLLM. Hitting the upstream cap surfaces as 429 rate_limited the same as the
gateway cap; the response carries a Retry-After header in seconds. The exact tier
caps are visible on your dashboard.
Retry-After. The OpenAI Python and Node SDKs do
this automatically when you set max_retries >= 1.
Errors
All errors return the OpenAI envelope: { "error": { "code", "message", "type" } }.
| HTTP | code | Meaning | What to do |
|---|---|---|---|
| 401 | invalid_api_key |
Auth header missing, malformed, or key revoked. | Check the Bearer lld_<32hex> format; re-mint from dashboard if revoked. |
| 404 | model_not_available |
The alias isn't in the catalog at all. | Compare against GET /v1/models or GET /v1/cost-preview/rates. |
| 403 | model_not_in_tier |
Alias exists, but your tier can't call it (e.g. Starter requesting qwen-coder). |
Upgrade tier, or pick a model in accessible_to_caller from /v1/cost-preview/rates. |
| 503 | no_consortium_machine_available |
You called a consortium-* alias and no partner GPU is currently healthy. |
Retry with the catalog-routed sibling (e.g. llama-3.3-70b instead of consortium-llama-70b), or with smart-route. |
| 503 | upstream_circuit_open |
llmdeal opened a circuit breaker on this upstream after repeated failures. | Retry with smart-route (it picks a different upstream), or wait ~60 s. |
| 429 | rate_limited |
Gateway or upstream rate-limit hit. | Back off per the Retry-After header. See Rate limits. |
| 402 | quota_exhausted |
Your balance or budget cap is below the estimated cost of this request. | Top up at /buy.html, or raise your per-key cap on the dashboard. |
{
"error": {
"code": "model_not_in_tier",
"message": "Your tier (starter) cannot call 'qwen-coder'. Upgrade to pro+ or use smart-route.",
"type": "invalid_request_error"
}
}
SDK compatibility
llmdeal speaks OpenAI's wire format on /v1/chat/completions, /v1/embeddings,
and /v1/models. Any client library that accepts a base_url override
works unchanged; just point it at https://api.llmdeal.me/v1 and pass your lld_ key.
openai-python
from openai import OpenAI
client = OpenAI(
api_key="lld_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
base_url="https://api.llmdeal.me/v1",
)
resp = client.chat.completions.create(
model="smart-route",
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)
openai-node
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: process.env.LLMDEAL_KEY,
baseURL: 'https://api.llmdeal.me/v1',
});
const resp = await client.chat.completions.create({
model: 'smart-route',
messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(resp.choices[0].message.content);
LangChain (Python)
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="smart-route",
api_key="lld_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
base_url="https://api.llmdeal.me/v1",
)
print(llm.invoke("Hello!").content)
Instructor
import instructor
from openai import OpenAI
from pydantic import BaseModel
class Answer(BaseModel):
summary: str
client = instructor.from_openai(OpenAI(
api_key="lld_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
base_url="https://api.llmdeal.me/v1",
))
answer = client.chat.completions.create(
model="smart-route",
response_model=Answer,
messages=[{"role": "user", "content": "What is Raft consensus?"}],
)
print(answer.summary)
openai SDK and is not formally tested against every release.
Wire-level behaviour (auth, request shape, response shape, SSE) is identical to OpenAI; any
divergence is a bug to file.
Other tools that work the same way (set base_url + api_key): Cursor, PearAI, Continue.dev, Cline, Roo Code, Zed, Aider, opencode, OpenHands, Open Interpreter, OpenWebUI, LibreChat, Tabby. Step-by-step configs at /setup.html.