API Reference · v1 · OpenAI-compatible

llmdeal API reference.

Drop-in OpenAI replacement. Same request shape, same response shape, same client libraries. Bearer auth with lld_ keys, JSON in, JSON or SSE out. 16+ models behind one base URL; smart-route picks the cheapest capable lane, or pin any model yourself.

Quick start

One Bearer header. One POST. Standard OpenAI body. Sign up at /buy.html; the dashboard mints your lld_ key.

Base URL

Either host serves the gateway; pick whichever your tooling prefers:

https://api.llmdeal.me/v1
https://llmdeal.me/v1

Authentication

Every /v1/* request requires:

http
Authorization: Bearer lld_<your-32-hex-token>

Keys are 36 chars total (lld_ prefix + 32 lowercase hex). Missing or malformed key returns 401 invalid_api_key.

First call

curl · chat completion
curl https://api.llmdeal.me/v1/chat/completions \
  -H "Authorization: Bearer $LLMDEAL_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "smart-route-fast",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

The response is the standard OpenAI chat-completion shape; see /v1/chat/completions below.

Models & rates

Set "model" to any alias below. Rates are USD per 1M tokens, billed against your balance at exactly the rates shown. byo-* aliases route to a customer-owned upstream key and are zero-rated on llmdeal's side.

Smart-route aliases

Load-balancer aliases that pick the cheapest healthy upstream in their lane:

AliasInput $/1MOutput $/1MLaneTier
smart-route-fast0.150.45llama-3.1-8b · gemini-2.5-flashall
smart-route0.300.90llama-3.3-70b group (default)all
smart-route-large0.351.05nemotron-120b · qwen3-235b · gpt-oss-120bpro+
smart-route-coder0.401.20qwen3-coder-480b · codestralpro+

Pinned open-weight models

AliasInput $/1MOutput $/1MNotesTier
llama-3.1-8b-instant0.100.30Groq · 131k ctx · fast workhorseall
llama-3.3-70b0.200.60128k ctx · general chatall
gemini-2.5-flash0.150.45Google · 1M ctx · multimodalall
codestral0.300.90Mistral · 256k ctx · code-specialistpro+
qwen3-235b-cerebras0.401.20Cerebras · ~2000 tok/s · 131k ctxpro+
gpt-oss-120b0.300.90128k ctx · reasoningpro+

NVIDIA-routed models

AliasInput $/1MOutput $/1MNotesTier
nvidia/llama-3.3-70b0.200.60NVIDIA NIM · 128k ctxall
nvidia/qwen3-coder-480b0.401.20NVIDIA NIM · 131k ctx · codepro+
nvidia/nemotron-super-120b0.300.90NVIDIA NIM · 128k ctx · reasoningpro+
nvidia/llama-4-maverick0.401.20NVIDIA NIMpro+

Consortium-routed models

Routed to revenue-share partner GPUs; priced strictly under the closest catalog sibling so you have an incentive to opt in.

AliasInput $/1MOutput $/1MNotesTier
consortium-llama-70b0.150.45Partner GPU · llama-3.3-70ball
consortium-nemotron-120b0.220.66Partner GPU · nemotron-super-120bpro+
consortium-qwen-coder0.300.90Partner GPU · qwen3-coder-480bpro+
consortium-qwen-235b0.300.90Partner GPU · qwen3-235bpro+

Frontier escalation (Pro+ add-on)

Reserved aliases for hand-off to a frontier vendor when smart-route flags low confidence. Billed against Frontier Credits, not your base balance; rates are passed through.

AliasVendorNotes
claude-sonnet-4-6-frontierAnthropic Sonnet 4.6pass-through retail rate
claude-opus-4-7-frontierAnthropic Opus 4.7pass-through retail rate
gpt-5.5-frontierOpenAI GPT-5.5pass-through retail rate

Embedding models

AliasInput $/1MVector dimNotes
nv-embed-qa0.021024NVIDIA NIM · QA-tuned
bge-m30.011024multilingual · cheapest tier
mistral-embed0.101024Mistral

BYO ("bring your own key") models

Aliases prefixed byo- route to a customer-owned upstream key registered in your dashboard; llmdeal bills nothing for them (0.00 / 0.00 per 1M). Use these when you already pay an upstream vendor and want them to flow through the same gateway, key, audit log, and budget controls as everything else.

Examples: byo-openai-gpt-4o, byo-anthropic-sonnet, byo-groq-llama-70b. The exact alias list is whatever you wire on the dashboard.

Live rate table. Call GET /v1/cost-preview/rates (see below) to fetch the exact table the gateway is using right now, including which aliases your tier can actually call. Unknown aliases fall back to the conservative default rate of 0.40 / 1.20.

POST /v1/chat/completions

Standard OpenAI chat-completion body. JSON only, Content-Type: application/json required. Body cap: 256 KB.

POST/v1/chat/completions

Non-streaming

curl
curl https://api.llmdeal.me/v1/chat/completions \
  -H "Authorization: Bearer $LLMDEAL_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "smart-route",
    "messages": [
      {"role": "system", "content": "You are a terse assistant."},
      {"role": "user",   "content": "Summarise this PR in one sentence."}
    ],
    "temperature": 0.2,
    "max_tokens": 512
  }'

Response

json · response
{
  "id": "chatcmpl-9f2c...",
  "object": "chat.completion",
  "created": 1747200000,
  "model": "llama-3.3-70b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Refactored the loop into a single-pass reduce."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens":     412,
    "completion_tokens": 87,
    "total_tokens":      499
  }
}

Streaming

Set "stream": true to receive Server-Sent Events. The format is wire-identical to OpenAI's streaming response; OpenAI SDKs parse it natively.

curl · streaming
curl -N https://api.llmdeal.me/v1/chat/completions \
  -H "Authorization: Bearer $LLMDEAL_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "smart-route",
    "stream": true,
    "messages": [{"role": "user", "content": "Stream a haiku."}]
  }'

# Server-Sent Events:
# data: {"id":"chatcmpl-9f2c","object":"chat.completion.chunk","choices":[{"delta":{"role":"assistant","content":"Refactored"}}]}
# data: {"id":"chatcmpl-9f2c","object":"chat.completion.chunk","choices":[{"delta":{"content":" the loop"}}]}
# data: {"id":"chatcmpl-9f2c","object":"chat.completion.chunk","choices":[{"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":5,"total_tokens":17}}
# data: [DONE]

Supported parameters

  • model · required · any alias from Models & rates
  • messages · required · array of { role, content }; role is system, user, assistant, or tool
  • temperature, top_p, max_tokens · standard OpenAI semantics
  • stream · SSE when true
  • tools, tool_choice · OpenAI-compatible wire format; upstream support varies per model
  • stop, presence_penalty, frequency_penalty, seed, response_format · passed through to upstream
Concealment. Upstream provider names never appear in response headers or body. The Server header is llmdeal-gateway/1.0; per-request trace IDs surface as x-llmdeal-request-id.

POST /v1/embeddings

OpenAI-compatible embeddings passthrough. Body cap: 256 KB; batch large RAG inputs client-side.

POST/v1/embeddings
curl
curl https://api.llmdeal.me/v1/embeddings \
  -H "Authorization: Bearer $LLMDEAL_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bge-m3",
    "input": [
      "The quick brown fox jumps over the lazy dog.",
      "Embed me too."
    ]
  }'

Response

json · response
{
  "object": "list",
  "model": "bge-m3",
  "data": [
    { "object": "embedding", "index": 0, "embedding": [0.012, -0.034] },
    { "object": "embedding", "index": 1, "embedding": [0.057,  0.018] }
  ],
  "usage": { "prompt_tokens": 24, "total_tokens": 24 }
}

Only prompt_tokens is billed (output is the vector itself, not completion tokens). See the embedding-model table in Models & rates.

POST /v1/cost-preview

Preflight the USD cost of a chat completion without calling the upstream model. Use this to surface cost estimates in your UI, gate expensive prompts, or budget-check before a real call.

POST/v1/cost-preview

Request

curl
curl https://api.llmdeal.me/v1/cost-preview \
  -H "Authorization: Bearer $LLMDEAL_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "smart-route",
    "messages": [
      {"role": "user", "content": "Explain Raft consensus in 200 words."}
    ],
    "max_tokens": 300
  }'

Response

json · response
{
  "ok": true,
  "model": "smart-route",
  "input_tokens": 11,
  "est_output_tokens": { "low": 60, "expected": 180, "high": 300 },
  "est_cost_usd": {
    "low":      0.0000573,
    "expected": 0.0001653,
    "high":     0.0002733
  },
  "breakdown":   { "input_usd": 0.0000033, "output_usd": 0.000162 },
  "rate_per_1m": { "input": 0.30, "output": 0.90 },
  "currency": "usd",
  "estimator": "tiktoken-cl100k-base"
}

Notes

  • Body shape is the same as /v1/chat/completions minus inference-only fields. Only model, messages, and optional max_tokens are read.
  • When max_tokens is provided, the high band equals max_tokens; when omitted, it's capped at 4096 or 2x the input, whichever is smaller.
  • byo-* models return est_cost_usd: 0 with a note field explaining the BYO routing.
  • Unknown models return 404 model_not_found even though the rate-table would silently default; this matches the chat-completions tier-gate.
  • 16 KB request body cap.

GET /v1/cost-preview/rates

Returns the full per-1M-token rate table the gateway is using right now, the chat-model catalog, and the subset of aliases your tier can call. Cached 60 s.

GET/v1/cost-preview/rates
curl
curl https://api.llmdeal.me/v1/cost-preview/rates \
  -H "Authorization: Bearer $LLMDEAL_KEY"
json · response (truncated)
{
  "ok": true,
  "currency": "usd",
  "default_rate": { "input": 0.4, "output": 1.2 },
  "rates": [
    { "model": "bge-m3",                 "input_per_1m": 0.01, "output_per_1m": 0.00 },
    { "model": "codestral",              "input_per_1m": 0.30, "output_per_1m": 0.90 },
    { "model": "consortium-llama-70b",   "input_per_1m": 0.15, "output_per_1m": 0.45 },
    { "model": "llama-3.3-70b",          "input_per_1m": 0.20, "output_per_1m": 0.60 },
    { "model": "smart-route-fast",       "input_per_1m": 0.15, "output_per_1m": 0.45 }
  ],
  "catalog": [
    { "id": "llama-70b", "display_name": "Llama 3.3 70B", "tiers": ["free-trial","starter","pro","founder","charter"], "context_window": 128000 }
  ],
  "accessible_to_caller": ["llama-70b","llama-8b","gemini-flash","smart-route","smart-route-fast"],
  "caller_tier": "starter",
  "note": "Unknown models fall back to the default rate (conservative)."
}

GET /v1/models

OpenAI-shape model list, filtered to the aliases your tier can call. Use this for tool integrations that probe model availability at runtime.

GET/v1/models
curl
curl https://api.llmdeal.me/v1/models \
  -H "Authorization: Bearer $LLMDEAL_KEY"
json · response (truncated)
{
  "object": "list",
  "data": [
    {
      "id": "llama-70b",
      "object": "model",
      "created": 1747353600,
      "owned_by": "llmdeal",
      "context_window": 128000,
      "tags": ["chat", "fast"]
    },
    {
      "id": "smart-route-fast",
      "object": "model",
      "created": 1747353600,
      "owned_by": "llmdeal",
      "context_window": 131072,
      "tags": ["auto", "fast"]
    }
  ]
}

The owned_by field is always llmdeal; upstream provider names are intentionally not surfaced.

Rate limits

Per-account gateway limits

AccountSustainedBurst
Standard120 req / minute25 req / 10 s
Demo / free-trial1200 req / minute250 req / 10 s

Demo accounts get 10x headroom so client-library smoke tests don't trip the limiter during onboarding. The token bucket is per lld_ key, not per IP.

Per-tier upstream limits

On top of the gateway limits, every lld_ key carries a per-tier upstream TPM/RPM cap set inside LiteLLM. Hitting the upstream cap surfaces as 429 rate_limited the same as the gateway cap; the response carries a Retry-After header in seconds. The exact tier caps are visible on your dashboard.

429 backoff. Respect Retry-After. The OpenAI Python and Node SDKs do this automatically when you set max_retries >= 1.

Errors

All errors return the OpenAI envelope: { "error": { "code", "message", "type" } }.

HTTPcodeMeaningWhat to do
401 invalid_api_key Auth header missing, malformed, or key revoked. Check the Bearer lld_<32hex> format; re-mint from dashboard if revoked.
404 model_not_available The alias isn't in the catalog at all. Compare against GET /v1/models or GET /v1/cost-preview/rates.
403 model_not_in_tier Alias exists, but your tier can't call it (e.g. Starter requesting qwen-coder). Upgrade tier, or pick a model in accessible_to_caller from /v1/cost-preview/rates.
503 no_consortium_machine_available You called a consortium-* alias and no partner GPU is currently healthy. Retry with the catalog-routed sibling (e.g. llama-3.3-70b instead of consortium-llama-70b), or with smart-route.
503 upstream_circuit_open llmdeal opened a circuit breaker on this upstream after repeated failures. Retry with smart-route (it picks a different upstream), or wait ~60 s.
429 rate_limited Gateway or upstream rate-limit hit. Back off per the Retry-After header. See Rate limits.
402 quota_exhausted Your balance or budget cap is below the estimated cost of this request. Top up at /buy.html, or raise your per-key cap on the dashboard.
json · error shape
{
  "error": {
    "code":    "model_not_in_tier",
    "message": "Your tier (starter) cannot call 'qwen-coder'. Upgrade to pro+ or use smart-route.",
    "type":    "invalid_request_error"
  }
}

SDK compatibility

llmdeal speaks OpenAI's wire format on /v1/chat/completions, /v1/embeddings, and /v1/models. Any client library that accepts a base_url override works unchanged; just point it at https://api.llmdeal.me/v1 and pass your lld_ key.

openai-python

python
from openai import OpenAI

client = OpenAI(
    api_key="lld_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    base_url="https://api.llmdeal.me/v1",
)

resp = client.chat.completions.create(
    model="smart-route",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

openai-node

typescript
import OpenAI from 'openai';

const client = new OpenAI({
  apiKey:  process.env.LLMDEAL_KEY,
  baseURL: 'https://api.llmdeal.me/v1',
});

const resp = await client.chat.completions.create({
  model:    'smart-route',
  messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(resp.choices[0].message.content);

LangChain (Python)

python · compatible (OpenAI-spec)
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="smart-route",
    api_key="lld_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    base_url="https://api.llmdeal.me/v1",
)
print(llm.invoke("Hello!").content)

Instructor

python · compatible (OpenAI-spec)
import instructor
from openai import OpenAI
from pydantic import BaseModel

class Answer(BaseModel):
    summary: str

client = instructor.from_openai(OpenAI(
    api_key="lld_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    base_url="https://api.llmdeal.me/v1",
))

answer = client.chat.completions.create(
    model="smart-route",
    response_model=Answer,
    messages=[{"role": "user", "content": "What is Raft consensus?"}],
)
print(answer.summary)
Note. LangChain and Instructor compatibility is based on OpenAI-spec compliance of the underlying openai SDK and is not formally tested against every release. Wire-level behaviour (auth, request shape, response shape, SSE) is identical to OpenAI; any divergence is a bug to file.

Other tools that work the same way (set base_url + api_key): Cursor, PearAI, Continue.dev, Cline, Roo Code, Zed, Aider, opencode, OpenHands, Open Interpreter, OpenWebUI, LibreChat, Tabby. Step-by-step configs at /setup.html.

More tools