Runbook: Migration Recovery

Atlas migration state diverged from expectation — typically during promotion to staging/production, or after env-gated data migrations.

Gotcha 1: atlas migrate set <V> is INCLUSIVE

atlas migrate set <version> marks all versions up to and including <version> as applied — without running them. If real, unapplied migrations exist below <version>, they are silently skipped (this bit production during the PR #1789 promote).

Correct sequence when marking a version:

atlas migrate apply --to-version <PREV>   # actually apply everything below first
atlas migrate set <V>                     # then mark only the gated one

Gotcha 2: env-gated migrations (renewa:atlas:skip-on)

Destructive/data migrations can be gated per environment with an in-file directive (PR #1797):

-- renewa:atlas:skip-on staging,production

atlas-migrate.sh reads the directive and uses atlas migrate set to skip on the named envs (APP_ENV). When debugging “migration ran locally but not in staging” — check for this directive before suspecting the pipeline.

Gotcha 3: atlas.sum integrity

atlas.sum is an append-only Merkle chain — never re-hash from scratch, never hand-edit. If your migration’s timestamp sorts before one already on main:

  1. Rename your migration to a current timestamp (regenerate via make db-generate, don’t hand-edit).
  2. Reset atlas.sum to main’s version.
  3. atlas migrate hash to append only your entry.

Never fabricate migration artifacts (SQL, journal entries, timestamps, sums) by hand — always via drizzle-kit / atlas CLIs. Round timestamps in history are the fingerprint of past fabrication.

Gotcha 4: N-1 compatibility gate

Renames/drops/SET NOT NULL are gated by scripts/check-migration-n1.sh (PR #1938): renames ship a compat view plus a renewa:n-1-shim: <old> drop-with #<issue> annotation. The gate fails PRs whose live shim’s tracking issue was closed — reopen the issue or drop the shim in the same PR.

Diagnosis quick path

atlas migrate status --url "$DATABASE_URL_MIGRATION"   # what does the cluster think is applied?
atlas migrate validate                                  # is the local dir consistent with atlas.sum?

Compare against atlas/migrations/ on main. If status reports a dirty/partial version: fix the cause (often Database Readonly Incident), then resolve the revision before re-applying.