Runbook: Deployment Rollback

A bad deploy reached development, staging, or production and must be reverted. Two paths — always try Path A first; Path B is break-glass for database damage.

Canonical tooling: renewa-one/scripts/rollback-deploy.sh (reworked in PR #1918, spec docs/superpowers/specs/2026-06-09-mpg-rollback-rework-design.md). Both paths live-tested 2026-06-10.

Path A — Image rollback (first resort, non-destructive)

Use when: the bad deploy is code-level (crash, broken feature, bad config) and the database is intact. Migrations are kept additive/backward-compatible (expand/contract), so the previous image runs safely against the current schema.

./renewa-one/scripts/rollback-deploy.sh <env> --dry-run   # verify resolved digest first
./renewa-one/scripts/rollback-deploy.sh <env>

What it does:

  1. Resolves the previous image digest from the deployment-record GitHub issues.
  2. Re-promotes it via gh workflow run promote-image.yml — the rollback stays on the CI deploy path (no local flyctl deploy).
  3. Database untouched.

Notes:

  • flyctl releases rollback does not exist — do not attempt it.
  • Image digests join with @ (repo@sha256:…), never :.

Path B — Database break-glass (rare; script prepares, human flips)

Use when: a destructive migration or data corruption requires restoring the pre-deploy backup. Backup IDs are recorded in the deployment-record issue (created by snapshot-before-deploy.sh on every mutating deploy).

Critical mental model: flyctl mpg restore always forks a NEW cluster (new ID, new hostname). It can never restore in place. The CLUSTER_ID argument is the source whose backup is read.

The script:

  1. Restores the pre-deploy backup into a new cluster and waits for readiness (~5 min).
  2. Reads the current DB URLs from the running app via flyctl ssh console -C 'printenv DATABASE_URL' — no Infisical login needed mid-incident.
  3. Verifies connectivity as BOTH app-user and migration-user through a flyctl mpg proxy tunnel (*.flympg.net hostnames resolve only on Fly’s private network — direct psql from a laptop can never reach them).
  4. Prints the manual flip checklist and stops. No destructive step is automated.

Manual flip (human executes, in order):

  1. Update DATABASE_URL + DATABASE_URL_MIGRATION in Infisical (project renewa-one, matching env scope) — never via flyctl secrets set (the Infisical sync would overwrite it).
  2. Redeploy / wait for Auto Redeploy.
  3. Update the env→cluster-id map in renewa-one/scripts/lib/mpg-clusters.sh (single source of truth).
  4. Only after the new cluster is verified serving: detach + destroy the old cluster.

Facts that save time:

  • pgBackRest physical restore preserves roles and password hashes — existing app-user/migration-user credentials work on the forked cluster.
  • 🚨 NEVER flyctl mpg destroy on production before the replacement is verified — zero recovery.

Verification

  • flyctl status -a <app> shows the expected image.
  • GET https://<app>/api/health/ready returns 200.
  • Deployment-record issue updated with what was rolled back and why.