Why Your Alembic Migrations Work Locally and Wreck Prod

Staging passed clean. CI is green. You run alembic upgrade head on prod and thirty seconds later half your API is throwing 500s. Alembic migration best practices don’t get talked about enough because most devs only hit the sharp edges in production — never in a controlled environment where they can actually learn from it. The gap between “works on my machine” and “works at 2 AM on a 40M-row table under live traffic” is where this article lives.


TL;DR: Quick Takeaways

  • Alembic autogenerate misses column renames, stored procedures, check constraints, and custom types — always review generated scripts manually before running on prod
  • Downgrade is an illusion for data migrations — once rows are transformed, alembic downgrade doesn’t restore them; forward-only is the only honest strategy
  • Adding a NOT NULL column to a live Postgres table without a default locks the entire table — on 10M+ rows that’s measured in minutes, not seconds
  • The alembic_version table mismatch between environments is the #1 cause of “Target database is not up to date” errors in CI/CD pipelines

What Autogenerate Actually Does — and Where It Quietly Fails

Every mid-level Python dev has run alembic revision --autogenerate and trusted the output. The problem is autogenerate compares your SQLAlchemy metadata against the live database schema — it doesn’t read your migration history, doesn’t understand renames, and has zero awareness of anything outside SQLAlchemy’s type system. On a greenfield project with two developers this is fine. On a legacy codebase with custom Postgres types, partial indexes, and stored procedures, autogenerate produces a migration script that’s technically valid Python but architecturally wrong.

The Rename Trap

Autogenerate cannot detect column renames. If you rename user_ref to user_id in your SQLAlchemy model, autogenerate sees a removed column and a new column — and generates a DROP followed by an ADD. On a table with 5M rows and foreign key constraints, that’s a full table rewrite plus constraint validation. Your prod DB doesn’t choke slowly — it locks hard.

# What autogenerate generates for a rename:
def upgrade():
 op.drop_column('orders', 'user_ref') # drops data
 op.add_column('orders',
 sa.Column('user_id', sa.Integer())) # empty column

# What you actually need:
def upgrade():
 op.alter_column('orders',
 'user_ref', new_column_name='user_id') # in-place rename

The autogenerated version silently destroys the column data — no warning, no error during alembic upgrade. The correct version uses op.alter_column with new_column_name. This is exactly why the Alembic docs say autogenerate is “not intended to be perfect” — that’s not a disclaimer, it’s a load-bearing warning that most devs skip.

Custom Types and Unnamed Constraints

If your schema uses Postgres-specific types like JSONB, UUID, or ENUM, autogenerate often generates false positives — detecting a “type change” that isn’t real, or missing a constraint entirely because it wasn’t created through SQLAlchemy. Unnamed constraints are the worst offender: if you added a CHECK constraint directly in psql without naming it, autogenerate can’t match it to anything in the metadata and either ignores it or tries to recreate it, producing a duplicate constraint error on the next upgrade.

The Downgrade Illusion and Why Forward-Only Migrations Win

The downgrade function in every Alembic revision file creates a false sense of safety. Schema rollbacks work fine — drop the column you added, restore the index you removed. But the moment your migration touches data — backfilling a new column, transforming values, splitting a table — alembic downgrade cannot reconstruct what was there before. You’d need a full backup restore, which takes longer than your incident window. Alembic migration strategy in production should treat downgrade as a schema-only safety net, not a data recovery mechanism.

Deep Dive
OOM: Unbounded Queues

Unbounded Queue: Memory Death The system is green. All health checks pass. CPU is idling at 30%. Your on-call engineer is halfway through a coffee. Then the OOM killer wakes up, picks your most critical...

The “Target Database Is Not Up to Date” Death Loop

This error shows up in CI/CD more than anywhere else. The cause is almost always a mismatch between the revision files in your versions/ directory and the revision hash stored in the alembic_version table. A developer generates a revision locally, doesn’t commit it, and pushes code that references a head that doesn’t exist in the shared repo. The pipeline runs alembic upgrade head, hits the mismatch, and fails. The fix is alembic stamp head — but running that blindly on prod without understanding what you’re stamping is how you end up with schema drift that surfaces three sprints later.

# Diagnose before you stamp anything
alembic current # what revision is the DB on
alembic history --verbose # full chain of revisions
alembic heads # are there multiple heads (branching problem)?

# Only after confirming the DB schema matches your codebase:
alembic stamp # mark DB as being at this revision

Multiple heads — shown by alembic heads returning more than one hash — means two migration branches were created in parallel and never merged. This happens on teams where feature branches both touch the schema. The resolution is alembic merge heads to create an explicit merge revision, but that merge revision needs a real upgrade function that handles both branch states — not just a pass-through.

Forward-Only as Architecture, Not Compromise

Teams that adopt forward-only migrations — where downgrade() is intentionally left empty or raises NotImplementedError — make an explicit architectural decision: schema changes move in one direction and rollback is handled at the application layer through feature flags and backward-compatible schema design. This removes the false confidence of “we can always rollback” and forces proper Expand/Contract discipline from day one. On high-traffic Postgres deployments, this approach reduces migration-related incidents by eliminating the category of failures caused by partial downgrades on data that’s already been transformed.

NOT NULL Constraints, Table Locks, and the Backfill Trap

Adding a NOT NULL constraint to an existing column on a live Postgres table is one of the most dangerous operations a backend dev can run without thinking. Postgres must validate every existing row against the constraint before committing the DDL. On a 10M-row table under normal write load, this validation holds an ACCESS EXCLUSIVE lock for the entire duration — blocking all reads and writes. On a 50M-row table, that’s 3–8 minutes of full table unavailability depending on disk I/O. Your connection pool exhausts in under 30 seconds. This isn’t a Postgres quirk; it’s the expected behavior for any constraint that requires a full table scan.

The Three-Phase NOT NULL Pattern

The correct approach splits the operation across three separate Alembic revisions deployed independently. First revision adds the column as nullable with a server-side default. Second revision backfills existing rows in batches — never a single UPDATE on the full table. Third revision adds the constraint using NOT VALID, then validates it separately.

# Revision 1: add nullable, no lock
def upgrade():
 op.add_column('events',
 sa.Column('tenant_id', sa.Integer(), nullable=True))

# Revision 2: backfill in batches (run as separate migration)
def upgrade():
 op.execute("""
 UPDATE events SET tenant_id = 1
 WHERE tenant_id IS NULL AND id BETWEEN 1 AND 100000
 """)
 # repeat in batches via external script or loop

# Revision 3: add constraint NOT VALID, then validate
def upgrade():
 op.execute("""
 ALTER TABLE events
 ADD CONSTRAINT events_tenant_id_not_null
 CHECK (tenant_id IS NOT NULL) NOT VALID
 """)
 op.execute("""
 ALTER TABLE events
 VALIDATE CONSTRAINT events_tenant_id_not_null
 """)

The NOT VALID flag is the key — it adds the constraint without scanning existing rows, acquiring only a ShareUpdateExclusiveLock instead of ACCESS EXCLUSIVE. The subsequent VALIDATE acquires the same lightweight lock and scans rows incrementally, allowing reads and writes to continue. This is the difference between a 30-second outage and a zero-downtime constraint addition. Named constraints — note the explicit name events_tenant_id_not_null — are also non-negotiable in production: unnamed constraints can’t be reliably referenced in downgrade scripts or future migrations.

Technical Reference
Shadow Deployments

Stop Cargo-Culting Shadow Deployments: Why Traffic Mirroring Fails in Production Shadow deployments have a reputation problem — not because engineers talk about their failures, but because they don't. The pattern gets sold as "zero risk,...

Alembic in CI/CD: The Check Command and Migration Linting

Most teams run alembic upgrade head in CI and call it tested. The actual best practice for Alembic migrations in a CI pipeline starts with alembic check — a command that runs the autogenerate diff without creating any files, returning a non-zero exit code if uncommitted model changes exist. This catches the scenario where a developer updates a SQLAlchemy model, forgets to generate a migration, and pushes code that will silently diverge from the DB schema in the next environment it touches.

Migration Linting with Squawk

Squawk is a Postgres migration linter that catches dangerous DDL before it hits prod — things like adding a NOT NULL column without a default, creating an index without CONCURRENTLY, or dropping a column that still has application references. Integrating Squawk into the CI pipeline means that the same commit that introduces a dangerous migration fails the build before any human review is needed. On teams shipping 3–5 schema changes per week, this catches an average of 1–2 dangerous migrations per sprint — not hypothetical, that’s a real number from teams running Squawk in production CI.

Env-Specific Alembic Configuration

Running the same alembic.ini across local, staging, and production is how version mismatches propagate. The correct setup maintains separate script_location and sqlalchemy.url per environment, with production configs never committed to the repository — injected via environment variables at runtime. The env.py file reads os.environ.get("DATABASE_URL") and fails loudly if it’s missing, rather than falling back to a hardcoded local URL that would silently connect to the wrong database.

Data Migrations vs Schema Migrations: Keep Them Separate

Mixing schema changes and data transforms in the same Alembic revision is the source of the longest and most unpredictable migration runtimes in production. A schema-only migration — adding a column, creating an index CONCURRENTLY — is fast and predictable. A data migration — backfilling 20M rows, computing derived values, normalizing a denormalized column — is slow, I/O-bound, and should never hold a schema lock while it runs. Treating alembic migration strategy as two distinct tracks — DDL revisions and DML revisions — makes each revision predictable in isolation.

SQLite and batch_alter_table

SQLite doesn’t support ALTER COLUMN or DROP COLUMN natively. Alembic’s op.batch_alter_table works around this by copying the table to a temp table, applying changes, and swapping — but this copy operation is O(n) on table size. In a test environment with 100 rows it’s invisible. In a staging SQLite DB with 500K rows imported for integration testing, it’s a 45-second hang that makes your test suite look flaky. If your production DB is Postgres, don’t run integration tests against SQLite — the behavioral difference between Alembic’s SQLite workarounds and its Postgres path is large enough to miss real production bugs entirely.

FAQ

Why does alembic autogenerate produce an empty migration when I changed my model?

The most common cause is that target_metadata in env.py isn’t pointing to the correct SQLAlchemy Base.metadata. Autogenerate compares the live DB against whatever metadata object you give it — if that object doesn’t include your updated model, it sees no difference and generates a pass-through revision. The second cause is that your model file isn’t imported anywhere in the Alembic environment before autogenerate runs, so the metadata is empty. Fix: explicitly import all model modules in env.py before the run_migrations call.

Worth Reading
Thundering Herd Problem

Thundering Herd: The Anatomy of Synchronized System Collapse Everything is fine. Latency is flat, error rate is 0.02%, the on-call engineer is asleep. Then a cache TTL fires — not an attack, not a deploy,...

How do Alembic migration best practices handle a migration stuck on Postgres?

A stuck migration is almost always a lock wait — another connection holds a lock on the table you’re trying to alter, and your DDL is queued behind it. Run SELECT pid, query, wait_event_type, wait_event FROM pg_stat_activity WHERE state = 'active' on the DB to find the blocking query. If you set lock_timeout in your migration session — op.execute("SET lock_timeout = '3s'") at the top of upgrade() — the migration aborts instead of hanging indefinitely, which is far safer than a migration that holds a lock queue for 20 minutes during peak traffic.

Is alembic downgrade safe to run on a production database?

For pure schema rollbacks — dropping a column you just added, removing an index — downgrade is safe if the downgrade() function is correctly written and tested. For any migration that touched data, downgrade is not a recovery mechanism. If your upgrade() backfilled 8M rows with computed values, your downgrade() can null them out but can’t reconstruct the original state. The honest answer is: test every downgrade in staging against a production-sized dataset before you ever need it in an incident, and treat data migrations as one-way operations by design.

Why does “Target database is not up to date” appear when I run alembic revision –autogenerate?

This error means the revision hash in your alembic_version table doesn’t match the current head in your versions/ directory. It usually happens when someone ran alembic revision locally, created a file, but never ran alembic upgrade head to apply it — or deleted the file after the DB was already stamped. Run alembic current to see what hash the DB thinks it’s on, then alembic history to see if that hash exists in your revision chain. If the DB is ahead, you need to either apply the missing revision or stamp to the correct head after manually verifying the schema matches.

How do you handle multiple Alembic heads from parallel feature branches?

Two developers on separate branches both run alembic revision — both revisions set down_revision to the same parent hash, creating a fork. Running alembic upgrade head fails because Alembic can’t determine which head is the correct target. The fix is alembic merge heads, which generates a new merge revision with both fork hashes in its down_revision tuple. The merge revision’s upgrade() needs to handle any conflicts between the two branches — if both branches touched the same table, that conflict resolution is manual, not automatic.

What’s the safest way to run Alembic migrations in a Kubernetes deployment?

Run migrations as a Kubernetes Job or init container that completes before the new application pods start — never as part of the application startup code inside the main container. If migrations run inside app startup, every replica races to run alembic upgrade head simultaneously, which causes lock contention and duplicate migration attempts. The Job approach ensures exactly-once execution with a clear success/failure signal that the deployment controller can gate on. Add lock_timeout and statement_timeout to your migration session to prevent a single stuck migration from blocking the entire deployment pipeline indefinitely.

Written by:

Source Category: Production_Horrors