Guard against schema drift (readiness 503 + loud startup log) #239

Merged
justin merged 1 commits from schema-drift-guard into main 2026-06-09 21:56:09 -04:00
Owner

Defense-in-depth follow-up to today's outage.

What happened: a backend image shipped ahead of an un-applied migration. The Tree model selected ai_member_provider/ai_recommender_provider, the DB didn't have them yet, so every trees query 500'd with UndefinedColumnError and the UI showed no trees. The root cause (Watchtower deploys not running migrations — the running backend had no RUN_MIGRATIONS=1) is already fixed on prod by recreating backend/worker from the current compose. This PR makes the symptom impossible to miss next time.

Change

  • app/core/schema_version.py: compares the DB's stamped Alembic head against the head(s) baked into the image's migration scripts. A DB with no alembic_version table (e.g. a create_all test DB) is treated as current, so the check stays quiet outside real deployments. Uses to_regclass so a missing table never errors / poisons the transaction.
  • GET /health/ready503 with an explicit drift: db=… expected=… message when the schema is behind, instead of reporting ready and serving 500s.
  • Startup lifespan logs CRITICAL on drift (advisory — never blocks startup).

Safety

  • Liveness (/health) is untouched, so a drifted container is not killed into a crash-loop — it's loudly degraded and self-heals the moment migrations apply.

Tests

  • New tests/test_schema_drift.py: head discovery + drift detection (no-table → quiet, wrong rev → drift, correct head → current). Full suite 91 passing.

🤖 Generated with Claude Code

Defense-in-depth follow-up to today's outage. **What happened:** a backend image shipped ahead of an un-applied migration. The `Tree` model selected `ai_member_provider`/`ai_recommender_provider`, the DB didn't have them yet, so every trees query 500'd with `UndefinedColumnError` and the UI showed no trees. The **root cause** (Watchtower deploys not running migrations — the running backend had no `RUN_MIGRATIONS=1`) is already fixed on prod by recreating backend/worker from the current compose. This PR makes the **symptom** impossible to miss next time. ### Change - `app/core/schema_version.py`: compares the DB's stamped Alembic head against the head(s) baked into the image's migration scripts. A DB with no `alembic_version` table (e.g. a `create_all` test DB) is treated as current, so the check stays quiet outside real deployments. Uses `to_regclass` so a missing table never errors / poisons the transaction. - `GET /health/ready` → **503** with an explicit `drift: db=… expected=…` message when the schema is behind, instead of reporting ready and serving 500s. - Startup lifespan logs **CRITICAL** on drift (advisory — never blocks startup). ### Safety - Liveness (`/health`) is untouched, so a drifted container is **not** killed into a crash-loop — it's loudly degraded and self-heals the moment migrations apply. ### Tests - New `tests/test_schema_drift.py`: head discovery + drift detection (no-table → quiet, wrong rev → drift, correct head → current). Full suite **91 passing**. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
justin added 1 commit 2026-06-09 21:55:39 -04:00
Defense-in-depth for the deploy pipeline. Today a backend image shipped ahead
of an un-applied migration; the Tree model selected columns the DB didn't have
yet, so every trees query 500'd with an opaque UndefinedColumnError and the UI
showed no trees. The root cause (deploys not running migrations) is fixed
separately; this makes the *symptom* impossible to miss.

- app/core/schema_version.py: compare the DB's stamped alembic head to the
  head(s) baked into the image's migration scripts. A DB with no alembic_version
  table (e.g. a create_all test DB) is treated as current, so this stays quiet
  outside real deployments. Uses to_regclass so a missing table never poisons
  the caller's transaction.
- /health/ready: returns 503 with an explicit "drift: db=… expected=…" message
  when the schema is behind, instead of reporting ready and serving 500s.
- Startup lifespan: logs CRITICAL on drift (advisory — never blocks startup).

Liveness (/health) is untouched, so a drifted container isn't killed into a
crash-loop — it's loudly degraded and self-heals once migrations apply.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Justin Paul <justin@jpaul.me>
justin merged commit 6fbad3106d into main 2026-06-09 21:56:09 -04:00
justin deleted branch schema-drift-guard 2026-06-09 21:56:09 -04:00
Sign in to join this conversation.