Runbook: Database Readonly Incident
Fly Managed Postgres silently switches a cluster to readonly when its volume passes ~90% disk usage. The failure masquerades as a connection problem, which sends you debugging the wrong layer.
Symptoms
- Atlas / lib-pq during deploy or migration:
sql/migrate: write revision: driver: bad connection - App writes failing while reads work
- Postgres error code
25006(read_only_sql_transaction) — but drivers often surface it as “bad connection”
Diagnosis — check this FIRST
flyctl checks list -a <db-app>
flyctl mpg status <cluster-id>Look for failing disk/health checks before touching connection strings, PgBouncer, or credentials. A “bad connection” message during a write with working reads is readonly mode until proven otherwise.
Resolution
- Extend the volume (Fly dashboard or
flyctl) — readonly lifts automatically once usage drops below the threshold. - Re-run the failed migration/deploy.
- Afterwards: investigate what consumed the disk (table bloat, WAL accumulation, log growth) so it does not recur.
Verification
flyctl checks list -a <db-app>all green.- A write succeeds (re-run the deploy’s
release_commandor a trivial UPDATE viaflyctl mpg proxy).
Related
- Deployment Rollback — if a deploy half-applied during the incident
- Migration Recovery — if the migration revision table is now inconsistent