SLAs & Acceptance Criteria¶
Production SLAs¶
| SLA | Target | Measurement |
|---|---|---|
| Control plane API availability | 99.9% | Uptime checks + gateway health |
| API read response time (p99) | ≤ 10ms | Histogram on cached read paths |
| Task scheduling latency (median) | ≤ 50ms | Ready detection → dispatch |
| Task scheduling latency (p95) | ≤ 500ms | End-to-end scheduling path |
| Task execution correctness | ≥ 99% pass rate | Task success / (success + failure) |
| Worker failure detection | ≤ 30s | Heartbeat stream gap |
| Event durability | ≤ 1s | Async path to durable audit log |
The 10ms read SLA is the hardest constraint in the system. It is the reason the cache architecture exists. Any code path that reads from Postgres synchronously on a hot API endpoint is a bug — not a performance issue, a correctness issue.
MVP Milestone Map¶
| Phase | Capability | Status |
|---|---|---|
| Phase 0 | Prep — repo scaffolding, infra, migrations | COMPLETE |
| Phase 1 | Kernel MVP — actors, state, messaging, task graph, scheduler | COMPLETE |
| Phase 2 | Workers & Tool Runtime — execution, Docker sandbox, worker manager | COMPLETE |
| Phase 3 | Memory & LLM Routing — pgvector, LLM router, Memcached caching | COMPLETE |
| Phase 4 | Orchestration, Eval, Security — planner, goal-service, identity, access-control, approvals | COMPLETE |
| Phase 5 | Scale & Production Hardening — load tests, Grafana, alerts, runbooks, cost tracking | COMPLETE |
| Phase 6 | SDK & Applications — AgentContext, MemoryClient, ToolClient, examples | COMPLETE |
| Phase 7 | Security Compliance — gRPC/HTTP TLS, Vault integration | COMPLETE |
| Phase 8 | Platform Dashboard — embedded UI, snapshot API, auto-refresh | COMPLETE |
| Phase 9 | Agent Profile & Context — system_prompt, agent_documents, context propagation | COMPLETE |
| Phase 10 | Chat Agents — WebSocket streaming, sessions, tool invocation | COMPLETE |
| P0-P2 | Platform Stability — agent restore, dead-letter, circuit breakers, idempotency, sharding | COMPLETE |
| Phase 11 | Multi-tenancy — orgs, teams, RBAC, visibility, data isolation | In progress |
| Phase 12 | Slack integration — adapter, proactive posting, platform secrets | Partial |
MVP functional acceptance¶
| Criterion | Phase delivered |
|---|---|
| Spawn and run a persistent agent | Phase 1 |
| Planner produces task DAGs from a goal | Phase 4 |
| Scheduler detects ready tasks and dispatches to workers | Phase 1 |
| Worker executes tasks and returns results persisted in Postgres | Phase 1/2 |
Task state transitions emit events to events table |
Phase 1 |
| Observability traces visible for each task execution | Phase 5 |
| Tool runtime can run sandboxed command and return artifact | Phase 2 |
Scale targets¶
| Target | Value |
|---|---|
| Concurrent agents | Millions |
| Tasks per day | 100M+ |
| No single API call > | 10ms |
| Worker failure detection | ≤ 30s |
These are design targets. Load-testing procedures live in the Astra repo.