Scheduler¶

The scheduler connects the task graph in the database to workers via stream-based queues. It runs on a periodic loop, finds ready tasks, and dispatches them so workers can claim work fairly under load.

Latency targets (PRD §25)¶

Metric	Target
Ready detection + dispatch	Median ≤50ms
Including coordination	P95 ≤500ms
Worker heartbeat	~10s
Worker considered lost	~30s without heartbeat

flowchart TB
  SCH[Scheduler]
  S0[Shard 0 queue]
  S1[Shard 1 queue]
  SN[Shard N queue]
  W[Workers]
  SCH --> S0
  SCH --> S1
  SCH --> SN
  S0 --> W
  S1 --> W
  SN --> W

Behaviour (summary)¶

Find tasks that are ready (dependencies satisfied).
Atomically mark them as eligible for dispatch and publish work to the appropriate shard queue.
Workers consume from their consumer groups, run the task, then complete or fail via the task API.
On worker loss, in-flight work is re-queued so another worker can pick it up; after max retries, tasks go to a dead-letter path.

Sharding¶

Work is partitioned by shard (e.g. by agent or graph) so many schedulers and workers can scale without one global bottleneck. Same-graph work tends to stay on one shard — simple, but very large graphs can create hot shards; mitigations are an operational/design topic in the PRD.

Tradeoffs¶

Polling interval balances scheduling latency vs database load; the PRD states the SLA targets the design aims for.

See Runbook: Worker Lost for operator-facing recovery (high level only on this wiki).