Skip to content

LLM Routing

LLM routing is platform infrastructure: model choice, response caching, usage tracking, and cost guardrails. Workloads call the router rather than vendors directly.

flowchart LR
  W[Workloads] --> R[LLM router]
  R --> C{Cache hit?}
  C -->|yes| OUT[Return cached]
  C -->|no| P[Provider]
  P --> R
  R --> AUD[Async usage pipeline]

Model tiers

Typical tiers include local (fast / on-device), premium (higher capability), and code-oriented models. Task type and priority steer the default tier; overrides can be expressed per task where the PRD allows.

Backends

Multiple provider adapters exist (cloud APIs, local inference stacks). Usage metadata (tokens, cost) is recorded when the adapter exposes it.

Caching

Identical requests can be served from a shared cache with a long TTL so repeat prompts don’t re-hit providers. Cached entries preserve usage fields where applicable so dashboards stay consistent.

Usage audit

Usage is written asynchronously so the hot path stays within latency targets (PRD §23, §25). Durable cost and token records feed billing and alerts.

Cost controls

Theme Idea
Quotas Per-agent or per-day limits
Budget Hard caps with approval to exceed
Cache Reduce duplicate spend
Degrade Temporarily restrict premium models on spike
Concurrency cap Limit simultaneous provider calls

Observability

Cost and token metrics support dashboards and the LLM cost spike runbook. Exact metric names: PRD §17.