LLM Routing¶
LLM routing is platform infrastructure: model choice, response caching, usage tracking, and cost guardrails. Workloads call the router rather than vendors directly.
flowchart LR
W[Workloads] --> R[LLM router]
R --> C{Cache hit?}
C -->|yes| OUT[Return cached]
C -->|no| P[Provider]
P --> R
R --> AUD[Async usage pipeline]
Model tiers¶
Typical tiers include local (fast / on-device), premium (higher capability), and code-oriented models. Task type and priority steer the default tier; overrides can be expressed per task where the PRD allows.
Backends¶
Multiple provider adapters exist (cloud APIs, local inference stacks). Usage metadata (tokens, cost) is recorded when the adapter exposes it.
Caching¶
Identical requests can be served from a shared cache with a long TTL so repeat prompts don’t re-hit providers. Cached entries preserve usage fields where applicable so dashboards stay consistent.
Usage audit¶
Usage is written asynchronously so the hot path stays within latency targets (PRD §23, §25). Durable cost and token records feed billing and alerts.
Cost controls¶
| Theme | Idea |
|---|---|
| Quotas | Per-agent or per-day limits |
| Budget | Hard caps with approval to exceed |
| Cache | Reduce duplicate spend |
| Degrade | Temporarily restrict premium models on spike |
| Concurrency cap | Limit simultaneous provider calls |
Observability¶
Cost and token metrics support dashboards and the LLM cost spike runbook. Exact metric names: PRD §17.