Skip to content

Operations

Runbooks, oncall procedures, and incident response for Astra. The authoritative runbooks also live in docs/runbooks/ in the Astra repo — these pages are the wiki summary. For step-by-step shell commands during an incident, use the repo runbooks.

Severity levels

Level Meaning Example
SEV1 Platform down or data loss risk Postgres primary unreachable for all traffic
SEV2 Major degradation Single region Redis loss, elevated task failures
SEV3 Minor / isolated One worker pool stuck, non-prod only

Communication

  • Post in the designated incident channel; nominate incident commander for SEV1/2.
  • Update status page (if configured) when user-visible.
  • Stakeholders: product owner for sustained SEV1; security for auth or data exposure.

Runbooks

Runbook Trigger
Worker Lost Heartbeat lost >30s
High Error Rate >5% task failures over 5min
Postgres Outage DB connection errors platform-wide
Redis Failure Redis connection errors or data loss
LLM Cost Spike Cost >2x daily average

Oncall rotations

Rotation Scope
Kernel SRE Scheduler, actors, tasks, messaging, state manager
Agent Platform Workers, tools, memory, LLM routing

Full escalation matrix: docs/runbooks/ in the Astra repo.

Incident lifecycle

Detect → Triage → Contain → Remediate → Postmortem → Remediation Review

Upgrade plan

  • Kernel upgrades must be backward-compatible on message contracts and schemas
  • Rolling upgrades with canary: 5% traffic for 30 minutes, then full rollout
  • DB migrations: schema changes must be backward-compatible with the current running binary (add columns before removing them)
  • Blue/green deployment where possible for zero-downtime schema migrations