Runbook — Postgres outage¶

Trigger: Widespread database connectivity errors; control plane and workers cannot persist state.

Symptoms¶

Severe: the system cannot commit task or agent state until the database is reachable or a replica is promoted.

Confirm primary vs replica status with your DBA / cloud console.
Fail over to a healthy instance per your provider runbook.
Update connection config for all dependent services and roll restarts in a controlled order (internal playbook).
Verify writes and reads; watch event backlog if any async consumers lagged.
Post-incident: replication lag, data window, and whether event log replay is needed — documented in private runbooks.

Note

No SQL, kubectl, or psql recipes on this public wiki.