Guard against schema drift (readiness 503 + loud startup log) #239
Reference in New Issue
Block a user
Delete Branch "schema-drift-guard"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Defense-in-depth follow-up to today's outage.
What happened: a backend image shipped ahead of an un-applied migration. The
Treemodel selectedai_member_provider/ai_recommender_provider, the DB didn't have them yet, so every trees query 500'd withUndefinedColumnErrorand the UI showed no trees. The root cause (Watchtower deploys not running migrations — the running backend had noRUN_MIGRATIONS=1) is already fixed on prod by recreating backend/worker from the current compose. This PR makes the symptom impossible to miss next time.Change
app/core/schema_version.py: compares the DB's stamped Alembic head against the head(s) baked into the image's migration scripts. A DB with noalembic_versiontable (e.g. acreate_alltest DB) is treated as current, so the check stays quiet outside real deployments. Usesto_regclassso a missing table never errors / poisons the transaction.GET /health/ready→ 503 with an explicitdrift: db=… expected=…message when the schema is behind, instead of reporting ready and serving 500s.Safety
/health) is untouched, so a drifted container is not killed into a crash-loop — it's loudly degraded and self-heals the moment migrations apply.Tests
tests/test_schema_drift.py: head discovery + drift detection (no-table → quiet, wrong rev → drift, correct head → current). Full suite 91 passing.🤖 Generated with Claude Code