SLAs & Acceptance Criteria
Production SLAs
SLA
Target
Measurement
Control plane API availability
99.9%
Uptime checks + gateway health
API read response time (p99)
≤ 10ms
Histogram on cached read paths
Task scheduling latency (median)
≤ 50ms
Ready detection → dispatch
Task scheduling latency (p95)
≤ 500ms
End-to-end scheduling path
Task execution correctness
≥ 99% pass rate
Task success / (success + failure)
Worker failure detection
≤ 30s
Heartbeat stream gap
Event durability
≤ 1s
Async path to durable audit log
The 10ms read SLA is the hardest constraint in the system. It is the reason the cache architecture exists. Any code path that reads from Postgres synchronously on a hot API endpoint is a bug — not a performance issue, a correctness issue.
MVP functional acceptance
Criterion
Phase delivered
Spawn and run a persistent agent
Phase 1
Planner produces task DAGs from a goal
Phase 4
Scheduler detects ready tasks and dispatches to workers
Phase 1
Worker executes tasks and returns results persisted in Postgres
Phase 1/2
Task state transitions emit events to events table
Phase 1
Observability traces visible for each task execution
Phase 5
Tool runtime can run sandboxed command and return artifact
Phase 2
Scale targets
Target
Value
Concurrent agents
Millions
Tasks per day
100M+
No single API call >
10ms
Worker failure detection
≤ 30s
These are design targets . Load-testing procedures live in the Astra repo .
astra
reference
sla
performance