Operations
Runbooks, oncall procedures, and incident response for Astra. The authoritative runbooks also live in docs/runbooks/ in the Astra repo — these pages are the wiki summary. For step-by-step shell commands during an incident, use the repo runbooks.
Severity levels
Level
Meaning
Example
SEV1
Platform down or data loss risk
Postgres primary unreachable for all traffic
SEV2
Major degradation
Single region Redis loss, elevated task failures
SEV3
Minor / isolated
One worker pool stuck, non-prod only
Communication
Post in the designated incident channel ; nominate incident commander for SEV1/2.
Update status page (if configured) when user-visible.
Stakeholders: product owner for sustained SEV1; security for auth or data exposure.
Runbooks
Runbook
Trigger
Worker Lost
Heartbeat lost >30s
High Error Rate
>5% task failures over 5min
Postgres Outage
DB connection errors platform-wide
Redis Failure
Redis connection errors or data loss
LLM Cost Spike
Cost >2x daily average
Shard Scaling
Shard imbalance or scaling event — update TASK_SHARD_COUNT, restart scheduler + workers
TLS Rotation
Certificate expiration — generate new certs, update K8s secrets, roll services
Vault Setup
Initial or reconfigured — set ASTRA_VAULT_ADDR/TOKEN/PATH, load secrets to KV-v2
Failure modes (P0–P2)
Failure
Detection
Recovery
Task final failure (max retries)
FailTask with retries >= maxRetries
Status → dead_letter; optional publish to astra:dead_letter for alerting/repair
Agent-service restart
Process restart
Agent restore loads active agents from DB, spawns into kernel automatically
Downstream overload
Circuit breaker opens
Gateway returns 503 + Retry-After; clients back off
Duplicate POST /goals
Same Idempotency-Key within TTL
Goal-service returns cached 201 with same goal_id
Scheduler shard imbalance
Monitoring
Rebalance via TASK_SHARD_COUNT; see shard-scaling runbook
Actor mailbox full
Kernel returns ResourceExhausted
Client backs off; retry-after in gRPC trailer
Oncall rotations
Rotation
Scope
Kernel SRE
Scheduler, actors, tasks, messaging, state manager
Agent Platform
Workers, tools, memory, LLM routing
Full escalation matrix: docs/runbooks/ in the Astra repo.
Incident lifecycle
Detect → Triage → Contain → Remediate → Postmortem → Remediation Review
Upgrade plan
Kernel upgrades must be backward-compatible on message contracts and schemas
Rolling upgrades with canary: 5% traffic for 30 minutes, then full rollout
DB migrations: schema changes must be backward-compatible with the current running binary (add columns before removing them)
Blue/green deployment where possible for zero-downtime schema migrations
astra
operations
runbooks