Skip to content

Operations

Runbooks, oncall procedures, and incident response for Astra. The authoritative runbooks also live in docs/runbooks/ in the Astra repo — these pages are the wiki summary. For step-by-step shell commands during an incident, use the repo runbooks.

Severity levels

Level Meaning Example
SEV1 Platform down or data loss risk Postgres primary unreachable for all traffic
SEV2 Major degradation Single region Redis loss, elevated task failures
SEV3 Minor / isolated One worker pool stuck, non-prod only

Communication

  • Post in the designated incident channel; nominate incident commander for SEV1/2.
  • Update status page (if configured) when user-visible.
  • Stakeholders: product owner for sustained SEV1; security for auth or data exposure.

Runbooks

Runbook Trigger
Worker Lost Heartbeat lost >30s
High Error Rate >5% task failures over 5min
Postgres Outage DB connection errors platform-wide
Redis Failure Redis connection errors or data loss
LLM Cost Spike Cost >2x daily average
Shard Scaling Shard imbalance or scaling event — update TASK_SHARD_COUNT, restart scheduler + workers
TLS Rotation Certificate expiration — generate new certs, update K8s secrets, roll services
Vault Setup Initial or reconfigured — set ASTRA_VAULT_ADDR/TOKEN/PATH, load secrets to KV-v2

Failure modes (P0–P2)

Failure Detection Recovery
Task final failure (max retries) FailTask with retries >= maxRetries Status → dead_letter; optional publish to astra:dead_letter for alerting/repair
Agent-service restart Process restart Agent restore loads active agents from DB, spawns into kernel automatically
Downstream overload Circuit breaker opens Gateway returns 503 + Retry-After; clients back off
Duplicate POST /goals Same Idempotency-Key within TTL Goal-service returns cached 201 with same goal_id
Scheduler shard imbalance Monitoring Rebalance via TASK_SHARD_COUNT; see shard-scaling runbook
Actor mailbox full Kernel returns ResourceExhausted Client backs off; retry-after in gRPC trailer

Oncall rotations

Rotation Scope
Kernel SRE Scheduler, actors, tasks, messaging, state manager
Agent Platform Workers, tools, memory, LLM routing

Full escalation matrix: docs/runbooks/ in the Astra repo.

Incident lifecycle

Detect → Triage → Contain → Remediate → Postmortem → Remediation Review

Upgrade plan

  • Kernel upgrades must be backward-compatible on message contracts and schemas
  • Rolling upgrades with canary: 5% traffic for 30 minutes, then full rollout
  • DB migrations: schema changes must be backward-compatible with the current running binary (add columns before removing them)
  • Blue/green deployment where possible for zero-downtime schema migrations