Runbook: Database Readonly Incident

Fly Managed Postgres silently switches a cluster to readonly when its volume passes ~90% disk usage. The failure masquerades as a connection problem, which sends you debugging the wrong layer.

Symptoms

  • Atlas / lib-pq during deploy or migration: sql/migrate: write revision: driver: bad connection
  • App writes failing while reads work
  • Postgres error code 25006 (read_only_sql_transaction) — but drivers often surface it as “bad connection”

Diagnosis — check this FIRST

flyctl checks list -a <db-app>
flyctl mpg status <cluster-id>

Look for failing disk/health checks before touching connection strings, PgBouncer, or credentials. A “bad connection” message during a write with working reads is readonly mode until proven otherwise.

Resolution

  1. Extend the volume (Fly dashboard or flyctl) — readonly lifts automatically once usage drops below the threshold.
  2. Re-run the failed migration/deploy.
  3. Afterwards: investigate what consumed the disk (table bloat, WAL accumulation, log growth) so it does not recur.

Verification

  • flyctl checks list -a <db-app> all green.
  • A write succeeds (re-run the deploy’s release_command or a trivial UPDATE via flyctl mpg proxy).