05b6d799f0
Three more learnings from alirezarezvani/claude-skills, applied: 1. SkillCheck validator (scripts/skillcheck.mjs) — validates every SKILL.md against the authoring standard (frontmatter, name/folder match, trigger + produces clauses, required headings) plus tier referential integrity. Errors fail CI; --strict fails on warnings too. New skillcheck.yml workflow and a SkillCheck status badge in the README. Current: 0 errors / 14 advisory warnings across 172 skills. 2. Cursor export platform — build-exports.mjs now generates exports/cursor/<bundle>/<skill>/<skill>.mdc rule files. The PLATFORMS registry now supports per-skill filenames (file as a function). 3. Per-agent installers — scripts/install.sh unifies install for claude/hermes/codex/openclaw/cursor (--link, --target, --dry-run, --list). Curl-able one-liners codex-install.sh, openclaw-install.sh, and cursor-install.sh clone the library and install in a single command. README documents the one-line installs and Cursor exports; CHANGELOG and the authoring standard updated. Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>
570 lines
23 KiB
Plaintext
570 lines
23 KiB
Plaintext
---
|
||
description: "Write a disaster recovery plan for a service or system — covering RPO/RTO targets, failure scenario runbooks, backup and restore procedures, DR testing cadence, and communication templates. Use when asked to write a DR plan, document failover procedures, create recovery runbooks, define RTO/RPO targets, or prepare for a disaster recovery game day. Produces a full DR document with per-scenario recovery runbooks, backup validation procedures, testing schedule, and communication templates."
|
||
globs:
|
||
alwaysApply: false
|
||
---
|
||
|
||
# Disaster Recovery Plan Skill
|
||
|
||
Produce a complete disaster recovery plan for a service or system — giving engineers, SREs, and on-call responders everything they need to recover from a disaster scenario in the shortest possible time. A good DR plan is tested regularly, has exact commands (not vague instructions), and makes RTO/RPO targets measurable so the team knows whether recovery succeeded.
|
||
|
||
## Required Inputs
|
||
|
||
Ask for these if not already provided:
|
||
- **Service name** and what it does (business function and technical role)
|
||
- **Criticality tier** — business impact of extended downtime (e.g. Tier 1 = revenue-critical, Tier 2 = ops impact, Tier 3 = internal only)
|
||
- **Current infrastructure setup** — cloud provider, regions/zones, deployment model (Kubernetes, ECS, VMs, serverless)
|
||
- **RPO/RTO requirements** — Recovery Point Objective (how much data loss is acceptable) and Recovery Time Objective (how long can it be down)
|
||
- **Backup strategy** — what is backed up, how often, where backups are stored, retention policy
|
||
- **On-call contacts** — names and contact details for the responder chain
|
||
|
||
## Output Format
|
||
|
||
---
|
||
|
||
# Disaster Recovery Plan: [Service Name]
|
||
|
||
**Team:** [Team name] | **Tech lead:** [Name]
|
||
**Criticality tier:** [Tier 1 / Tier 2 / Tier 3] | **Last tested:** [Date]
|
||
**Next DR test:** [Date] | **Document owner:** [Name]
|
||
**Last updated:** [Date] | **Review cycle:** Quarterly
|
||
|
||
> **Emergency? Skip to Section 3 — Failure Scenario Runbooks.** Find the scenario that matches your situation and follow the steps exactly.
|
||
|
||
---
|
||
|
||
## 1. Recovery Targets
|
||
|
||
| Target | Value | Rationale |
|
||
|---|---|---|
|
||
| RPO (Recovery Point Objective) | [X minutes/hours] | [e.g. "Last committed transaction — database replication is synchronous"] |
|
||
| RTO (Recovery Time Objective) | [Y minutes/hours] | [e.g. "Revenue impact begins at 30 min; target recovery in 15 min"] |
|
||
| MTTR target (non-disaster) | [Z minutes] | [Operational incidents, not DR events] |
|
||
| Data retention (backups) | [N days/weeks] | [Compliance requirement or operational policy] |
|
||
| Backup frequency | [Every X hours] | [RPO-driven — backup interval must be ≤ RPO] |
|
||
|
||
**What these mean in practice:**
|
||
- If a database is corrupted, we can lose at most [X minutes] of transactions before the business impact is unacceptable.
|
||
- The service must be operational again within [Y minutes/hours] of declaring a DR event.
|
||
- If either target cannot be met, escalate to [Engineering Manager] immediately.
|
||
|
||
---
|
||
|
||
## 2. Failure Scenario Inventory
|
||
|
||
| Scenario | Likelihood | Impact | RTO target | RPO target | Runbook |
|
||
|---|---|---|---|---|---|
|
||
| Single availability zone failure | Medium | [Partial / Full outage] | [15 min] | [0 — no data loss] | Section 3.1 |
|
||
| Full region failure | Low | Full outage | [60 min] | [5 min] | Section 3.2 |
|
||
| Database corruption / data loss | Low | Full outage | [90 min] | [RPO value] | Section 3.3 |
|
||
| Critical dependency outage | High | [Partial degradation] | [30 min] | [N/A] | Section 3.4 |
|
||
| Security breach / ransomware | Very low | Full outage + investigation | [4 hours] | [Last clean backup] | Section 3.5 |
|
||
| Accidental bulk data deletion | Low | Partial or full data loss | [60 min] | [RPO value] | Section 3.6 |
|
||
|
||
---
|
||
|
||
## 3. Failure Scenario Runbooks
|
||
|
||
### 3.1 Single Availability Zone Failure
|
||
|
||
**Trigger:** One AZ becomes unreachable — pods/instances in that zone stop responding.
|
||
**Detection:** PagerDuty alert `[AlertName]` fires, or cloud provider status page shows AZ degradation.
|
||
**Expected RTO:** [15 minutes] | **Expected RPO:** Zero (no data loss if multi-AZ replication is working)
|
||
|
||
**Step 1 — Confirm the failure**
|
||
```bash
|
||
# Check pod/instance health across zones
|
||
kubectl get pods -o wide -n [namespace] | grep -v Running
|
||
|
||
# Check which nodes are affected
|
||
kubectl get nodes -o wide | grep -v Ready
|
||
|
||
# Verify cloud provider AZ status
|
||
# AWS: https://health.aws.amazon.com/health/status
|
||
# GCP: https://status.cloud.google.com
|
||
```
|
||
|
||
**Step 2 — Assess whether auto-recovery has occurred**
|
||
```bash
|
||
# If using auto-scaling, check if replacement instances launched
|
||
kubectl get pods -n [namespace] --watch
|
||
|
||
# Check deployment replica count
|
||
kubectl get deployment [service-name] -n [namespace]
|
||
|
||
# Verify load balancer health checks are passing
|
||
[cloud provider CLI command to check target group health]
|
||
```
|
||
|
||
**Step 3 — Force rescheduling if auto-recovery stalled**
|
||
```bash
|
||
# Cordon the affected node so no new pods schedule on it
|
||
kubectl cordon [node-name]
|
||
|
||
# Drain the node — moves all pods to healthy nodes
|
||
kubectl drain [node-name] --ignore-daemonsets --delete-emptydir-data
|
||
|
||
# Verify pods have rescheduled successfully
|
||
kubectl get pods -o wide -n [namespace]
|
||
```
|
||
|
||
**Step 4 — Verify service health**
|
||
```bash
|
||
# Smoke test key endpoints
|
||
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/health
|
||
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/[critical-endpoint]
|
||
|
||
# Check error rate in monitoring
|
||
[dashboard link or query]
|
||
```
|
||
|
||
**Recovery confirmed when:** All pods are Running, health check returns 200, error rate is at baseline.
|
||
|
||
---
|
||
|
||
### 3.2 Full Region Failure
|
||
|
||
**Trigger:** The primary region is entirely unavailable.
|
||
**Detection:** All service health checks failing, cloud provider status page confirms region-wide event.
|
||
**Expected RTO:** [60 minutes] | **Expected RPO:** [5 minutes — based on cross-region replication lag]
|
||
|
||
**Step 1 — Confirm regional failure (5 minutes)**
|
||
```bash
|
||
# Confirm the primary region is unreachable
|
||
ping [primary-region-endpoint] || echo "Primary region unreachable"
|
||
|
||
# Check replication lag on standby region database
|
||
[command to check replica lag — e.g. for RDS: aws rds describe-db-instances --region [dr-region]]
|
||
```
|
||
|
||
**Step 2 — Declare DR event and notify (2 minutes)**
|
||
|
||
Post to `#incidents`:
|
||
```
|
||
🔴 DR EVENT — [Service Name] — Region Failure
|
||
Primary region: [region] — UNREACHABLE
|
||
Activating failover to: [dr-region]
|
||
Incident commander: [Name]
|
||
Next update: 15 minutes
|
||
```
|
||
|
||
Page [Engineering Manager] and [CTO/VP Eng] via PagerDuty.
|
||
|
||
**Step 3 — Promote DR database (10 minutes)**
|
||
```bash
|
||
# AWS RDS — promote read replica to primary
|
||
aws rds promote-read-replica \
|
||
--db-instance-identifier [dr-replica-identifier] \
|
||
--region [dr-region]
|
||
|
||
# Wait for promotion to complete
|
||
aws rds wait db-instance-available \
|
||
--db-instance-identifier [dr-replica-identifier] \
|
||
--region [dr-region]
|
||
|
||
# Record the new database endpoint
|
||
aws rds describe-db-instances \
|
||
--db-instance-identifier [dr-replica-identifier] \
|
||
--region [dr-region] \
|
||
--query 'DBInstances[0].Endpoint.Address'
|
||
```
|
||
|
||
**Step 4 — Deploy service in DR region (20 minutes)**
|
||
```bash
|
||
# Update service configuration to point at DR database
|
||
kubectl set env deployment/[service-name] \
|
||
DATABASE_URL=[new-dr-database-url] \
|
||
-n [namespace] \
|
||
--context [dr-region-context]
|
||
|
||
# Scale up the DR deployment
|
||
kubectl scale deployment/[service-name] --replicas=[N] \
|
||
-n [namespace] \
|
||
--context [dr-region-context]
|
||
|
||
# Verify all pods are running
|
||
kubectl get pods -n [namespace] --context [dr-region-context]
|
||
```
|
||
|
||
**Step 5 — Cut over DNS / load balancer (5 minutes)**
|
||
```bash
|
||
# Update DNS to point to DR region load balancer
|
||
# AWS Route 53:
|
||
aws route53 change-resource-record-sets \
|
||
--hosted-zone-id [zone-id] \
|
||
--change-batch file://dr-failover-dns.json
|
||
|
||
# Verify DNS propagation (may take up to [TTL] seconds)
|
||
dig [service-domain] @8.8.8.8
|
||
```
|
||
|
||
**Step 6 — Verify end-to-end**
|
||
```bash
|
||
# Full smoke test against DR endpoint
|
||
curl -s https://[service-url]/health
|
||
[run automated smoke test suite if available]
|
||
```
|
||
|
||
**Recovery confirmed when:** DNS resolves to DR region, smoke tests pass, error rate is at baseline.
|
||
|
||
**Post-failover actions (not urgent — after service is stable):**
|
||
- Do not fail back to primary until root cause is confirmed resolved
|
||
- Document data loss window (check replication lag at time of failure)
|
||
- Begin post-incident review — see [incident-postmortem skill]
|
||
|
||
---
|
||
|
||
### 3.3 Database Corruption or Data Loss
|
||
|
||
**Trigger:** Data in the database is corrupted, deleted, or otherwise incorrect due to a software bug, operator error, or hardware fault.
|
||
**Detection:** Application errors referencing missing/invalid data, monitoring alerts on query error rate, user reports.
|
||
**Expected RTO:** [90 minutes] | **Expected RPO:** [Backup interval — e.g. 1 hour]
|
||
|
||
**Step 1 — Stop the bleeding immediately**
|
||
```bash
|
||
# Put the service into maintenance mode to prevent further writes to corrupted data
|
||
[command to enable maintenance mode — e.g. kubectl set env deployment/[name] MAINTENANCE_MODE=true]
|
||
|
||
# Or: scale down the service to zero to prevent writes
|
||
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]
|
||
```
|
||
|
||
**Step 2 — Assess scope of corruption**
|
||
```bash
|
||
# Identify which tables/records are affected
|
||
[SQL query to check data integrity — e.g.]
|
||
# psql $DATABASE_URL -c "SELECT COUNT(*) FROM [table] WHERE [integrity check condition]"
|
||
|
||
# Determine when corruption started (cross-reference with deploy times and error logs)
|
||
[log query to find earliest error — e.g. in Datadog:]
|
||
# service:[service-name] status:error "[corruption error message]" | sort by timestamp asc
|
||
```
|
||
|
||
**Step 3 — Identify the correct restore point**
|
||
```bash
|
||
# List available backups
|
||
[command to list backups — e.g. for RDS:]
|
||
aws rds describe-db-snapshots \
|
||
--db-instance-identifier [db-identifier] \
|
||
--query 'DBSnapshots[*].[SnapshotCreateTime,DBSnapshotIdentifier]' \
|
||
--output table
|
||
|
||
# Choose the most recent backup BEFORE corruption started
|
||
# Record the chosen snapshot ID: [snapshot-id]
|
||
```
|
||
|
||
**Step 4 — Restore from backup**
|
||
```bash
|
||
# Restore to a NEW database instance (never overwrite production directly)
|
||
aws rds restore-db-instance-from-db-snapshot \
|
||
--db-instance-identifier [service-name]-restored-[date] \
|
||
--db-snapshot-identifier [snapshot-id] \
|
||
--region [region]
|
||
|
||
# Wait for restore to complete
|
||
aws rds wait db-instance-available \
|
||
--db-instance-identifier [service-name]-restored-[date]
|
||
|
||
# Get the restored instance endpoint
|
||
aws rds describe-db-instances \
|
||
--db-instance-identifier [service-name]-restored-[date] \
|
||
--query 'DBInstances[0].Endpoint.Address'
|
||
```
|
||
|
||
**Step 5 — Validate restored data**
|
||
```bash
|
||
# Connect to restored database and verify integrity
|
||
psql [restored-db-endpoint] -U [user] -d [database] -c "[data integrity query]"
|
||
|
||
# Confirm record counts match expectations
|
||
psql [restored-db-endpoint] -U [user] -d [database] -c "SELECT COUNT(*) FROM [critical-table]"
|
||
```
|
||
|
||
**Step 6 — Point service at restored database**
|
||
```bash
|
||
kubectl set env deployment/[service-name] \
|
||
DATABASE_URL=postgres://[user]:[pass]@[restored-endpoint]/[db] \
|
||
-n [namespace]
|
||
|
||
kubectl scale deployment/[service-name] --replicas=[N] -n [namespace]
|
||
```
|
||
|
||
**Recovery confirmed when:** Service is running against restored database, data integrity checks pass, error rate is at baseline.
|
||
|
||
---
|
||
|
||
### 3.4 Critical Dependency Outage
|
||
|
||
**Trigger:** A service that [service name] depends on is unavailable or degraded.
|
||
**Detection:** Increased error rate or latency on endpoints that call [dependency], alerts from dependency owner.
|
||
**Expected RTO:** Depends on dependency — [30 minutes for mitigation, resolution depends on dependency owner]
|
||
|
||
**Dependency map:**
|
||
|
||
| Dependency | Criticality | Degraded behaviour | Mitigation |
|
||
|---|---|---|---|
|
||
| [Database] | Critical — all writes fail | Full outage | Activate DR database (Section 3.3) |
|
||
| [Cache — Redis] | High — latency increases | Performance degradation | Bypass cache, serve from DB |
|
||
| [Auth service] | Critical — auth fails | All authenticated endpoints fail | Return cached tokens (if implemented) |
|
||
| [Message queue] | Medium — async processing delays | Writes succeed, async jobs queue | Queue backlog — see on-call runbook |
|
||
| [External API — name] | Low — feature X unavailable | Graceful degradation | Feature flag to disable feature X |
|
||
|
||
**Mitigation steps:**
|
||
```bash
|
||
# Enable circuit breaker / fallback for [dependency] if implemented
|
||
kubectl set env deployment/[service-name] [DEPENDENCY]_CIRCUIT_BREAKER=open -n [namespace]
|
||
|
||
# Enable feature flag to disable [dependency-backed feature]
|
||
[feature flag CLI command or dashboard link]
|
||
|
||
# Check if dependency has a status page
|
||
# [Dependency status URL]
|
||
```
|
||
|
||
**Escalation:** Contact [dependency] on-call via [PagerDuty / Slack `#[channel]`]. Share your service's error rate and the time dependency errors started.
|
||
|
||
---
|
||
|
||
### 3.5 Security Breach or Ransomware
|
||
|
||
**Trigger:** Evidence of unauthorized access, data exfiltration, or encryption of service data.
|
||
**Detection:** Security tooling alert, unusual access patterns, user reports of data exposure.
|
||
**Expected RTO:** [4+ hours — prioritise containment over speed] | **Expected RPO:** [Last verified clean backup]
|
||
|
||
**Step 1 — Isolate immediately**
|
||
```bash
|
||
# Take the service offline — do not attempt to recover while breach is active
|
||
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]
|
||
|
||
# Revoke all API keys and service account credentials immediately
|
||
[command to rotate secrets — e.g. via Vault or cloud provider]
|
||
|
||
# Block all external access at network level
|
||
[firewall/security group command to deny all inbound traffic]
|
||
```
|
||
|
||
**Step 2 — Notify security team immediately**
|
||
Page [Security lead] via PagerDuty. Do NOT attempt to remediate without security team involvement.
|
||
|
||
Post to `#security-incidents` (private channel, not `#incidents`):
|
||
```
|
||
🔴 SECURITY INCIDENT — [Service Name]
|
||
Time detected: [Time]
|
||
Evidence: [One sentence — what was observed]
|
||
Actions taken: Service isolated, credentials revoked
|
||
Awaiting: Security team guidance
|
||
```
|
||
|
||
**Step 3 — Preserve evidence**
|
||
```bash
|
||
# Export current logs before any remediation
|
||
[log export command — preserve evidence for forensics]
|
||
|
||
# Snapshot the current state of all infrastructure
|
||
[snapshot/image command]
|
||
```
|
||
|
||
**Steps 4+ — Follow security team guidance.** Do not restore from backup until security team confirms the attack vector is closed.
|
||
|
||
---
|
||
|
||
### 3.6 Accidental Bulk Data Deletion
|
||
|
||
**Trigger:** An operator, script, or application bug has deleted records in bulk.
|
||
**Detection:** Sudden drop in record counts, user reports of missing data, application errors.
|
||
**Expected RTO:** [60 minutes] | **Expected RPO:** [Backup interval]
|
||
|
||
```bash
|
||
# Step 1 — Stop further writes immediately
|
||
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]
|
||
|
||
# Step 2 — Determine what was deleted and when
|
||
psql $DATABASE_URL -c "
|
||
SELECT schemaname, tablename,
|
||
n_dead_tup, last_autovacuum
|
||
FROM pg_stat_user_tables
|
||
ORDER BY n_dead_tup DESC LIMIT 10;
|
||
"
|
||
|
||
# Step 3 — Check if deletion is recoverable via MVCC (PostgreSQL)
|
||
# Records may still be recoverable if VACUUM has not run
|
||
psql $DATABASE_URL -c "
|
||
SELECT * FROM [table]
|
||
WHERE xmax != 0 -- recently deleted rows
|
||
LIMIT 100;
|
||
"
|
||
|
||
# Step 4 — If not recoverable via MVCC, restore from backup
|
||
# Follow Section 3.3 (Database Corruption runbook) from Step 3 onward
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Backup and Restore Procedures
|
||
|
||
### Backup Configuration
|
||
|
||
| Data store | Backup type | Frequency | Retention | Location |
|
||
|---|---|---|---|---|
|
||
| [Primary database] | Automated snapshots | Every [N] hours | [N] days | [S3 bucket / cloud storage path] |
|
||
| [Primary database] | Transaction log backups | Continuous | [N] days | [Location] |
|
||
| [Secondary store — e.g. Redis] | RDB dump | Daily | [N] days | [Location] |
|
||
| [Blob/object storage] | Cross-region replication | Continuous | [N] days | [DR region bucket] |
|
||
| [Config / secrets] | Terraform state + Vault backup | On change | Indefinite | [Location] |
|
||
|
||
### Backup Validation (Run Weekly)
|
||
|
||
```bash
|
||
# Test restore of latest database backup to a throwaway instance
|
||
aws rds restore-db-instance-from-db-snapshot \
|
||
--db-instance-identifier [service-name]-backup-test-$(date +%Y%m%d) \
|
||
--db-snapshot-identifier $(aws rds describe-db-snapshots \
|
||
--db-instance-identifier [db-id] \
|
||
--query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
|
||
--output text)
|
||
|
||
# Wait for restore, then run integrity checks
|
||
psql [test-instance-endpoint] -c "[integrity check query]"
|
||
|
||
# Confirm row counts match recent production values (allow ≤ RPO difference)
|
||
psql [test-instance-endpoint] -c "SELECT COUNT(*) FROM [critical-table]"
|
||
|
||
# Destroy the test instance
|
||
aws rds delete-db-instance \
|
||
--db-instance-identifier [service-name]-backup-test-$(date +%Y%m%d) \
|
||
--skip-final-snapshot
|
||
```
|
||
|
||
---
|
||
|
||
## 5. DR Testing Cadence
|
||
|
||
Regular testing is mandatory. An untested DR plan is not a DR plan.
|
||
|
||
| Test type | Frequency | Who runs it | Pass criteria |
|
||
|---|---|---|---|
|
||
| Backup restore validation | Weekly (automated) | On-call rotation | Restore completes, integrity checks pass |
|
||
| Zone failover drill | Monthly | Engineering team | RTO target met, zero data loss |
|
||
| Region failover drill | Quarterly | Engineering + SRE | RTO/RPO targets met |
|
||
| Full DR game day | Annually | Engineering + stakeholders | All scenarios exercised, gaps documented |
|
||
| Chaos engineering (infra failures) | Weekly (automated) | Chaos engineering tooling | Service degrades gracefully, recovers automatically |
|
||
|
||
### Game Day Procedure
|
||
|
||
1. **Pre-game day (1 week before):** Notify all stakeholders, freeze production changes for the day, prepare DR environment.
|
||
2. **Scope definition:** Choose 2–3 scenarios from Section 2. Document expected outcomes before the test.
|
||
3. **Execute:** One person acts as incident commander, others execute runbook steps while another observes and times.
|
||
4. **Measure:** Record actual RTO and RPO against targets for each scenario.
|
||
5. **Debrief (same day):** Document gaps, runbook inaccuracies, and automation opportunities.
|
||
6. **Action items:** File tickets for every gap found. Priority: P1 items must be fixed before next game day.
|
||
|
||
---
|
||
|
||
## 6. Communication Plan
|
||
|
||
### Internal Communication During DR Event
|
||
|
||
**Incident commander responsibilities:**
|
||
- Declare the DR event and open the incident channel
|
||
- Post updates every 15 minutes minimum
|
||
- Make the call to fail over (do not let the team decide by committee)
|
||
- Notify business stakeholders of expected recovery time
|
||
|
||
**Notify these people at DR event start:**
|
||
|
||
| Role | Name | Contact | When to notify |
|
||
|---|---|---|---|
|
||
| Engineering manager | [Name] | [Slack / Phone] | Immediately |
|
||
| CTO / VP Engineering | [Name] | [Phone] | Tier 1 services: immediately |
|
||
| Customer success lead | [Name] | [Slack] | If customer-facing impact |
|
||
| Security lead | [Name] | [Slack / PagerDuty] | If breach suspected |
|
||
| Legal / compliance | [Name] | [Email / Phone] | If data loss involves PII |
|
||
|
||
### Communication Templates
|
||
|
||
**DR event declared:**
|
||
```
|
||
🔴 DR EVENT — [Service Name]
|
||
Time: [HH:MM UTC]
|
||
Scenario: [Zone failure / Region failure / Data loss / etc.]
|
||
Impact: [Who is affected and how]
|
||
RTO target: [X minutes]
|
||
Incident commander: [Name]
|
||
War room: [Slack channel / call link]
|
||
Next update: [Time + 15 min]
|
||
```
|
||
|
||
**Status update (every 15 minutes):**
|
||
```
|
||
🔴 DR UPDATE — [Service Name] — [HH:MM UTC]
|
||
Status: [Investigating / Executing recovery / Verifying]
|
||
Progress: [One sentence on current step]
|
||
Blockers: [Any — or "None"]
|
||
Updated RTO estimate: [Time]
|
||
Next update: [Time + 15 min]
|
||
```
|
||
|
||
**Recovery confirmed:**
|
||
```
|
||
✅ DR RESOLVED — [Service Name] — [HH:MM UTC]
|
||
Total downtime: [X minutes]
|
||
Data loss: [None / X minutes of transactions]
|
||
RTO target: [X min] — Actual: [Y min] — [MET / MISSED]
|
||
RPO target: [X min] — Actual: [Y min] — [MET / MISSED]
|
||
Root cause: [One sentence]
|
||
Post-incident review: [Scheduled for / Link when created]
|
||
```
|
||
|
||
---
|
||
|
||
## 7. DR Readiness Checklist
|
||
|
||
Run this checklist quarterly and before any major infrastructure change:
|
||
|
||
**Backups:**
|
||
- [ ] Automated backups are running and alerts fire if they fail
|
||
- [ ] Most recent backup restore was tested within the last 7 days
|
||
- [ ] Backup retention meets RPO and compliance requirements
|
||
- [ ] Backups are stored in a separate region / account from primary
|
||
|
||
**Failover infrastructure:**
|
||
- [ ] DR region / environment exists and is provisioned (not just documented)
|
||
- [ ] DNS failover procedure is documented with exact commands
|
||
- [ ] DR database replica is current (replication lag is within RPO)
|
||
- [ ] Service can be deployed in DR region with a single command or automated pipeline
|
||
|
||
**Runbooks:**
|
||
- [ ] All runbooks in Section 3 have been tested within the last quarter
|
||
- [ ] Runbook commands have been verified against current infrastructure (no stale references)
|
||
- [ ] Contact list is current (no departed employees)
|
||
|
||
**Access:**
|
||
- [ ] On-call engineers have access to DR region console / CLI
|
||
- [ ] Service account credentials for DR region are provisioned and tested
|
||
- [ ] Break-glass accounts exist for emergency access if SSO is unavailable
|
||
|
||
**Monitoring:**
|
||
- [ ] Monitoring exists in DR region (not just primary)
|
||
- [ ] Alerts fire correctly when DR environment has issues
|
||
|
||
---
|
||
|
||
## Quality Checks
|
||
|
||
- [ ] RPO and RTO targets are specific numbers, not ranges, and are agreed with the business
|
||
- [ ] Every command in every runbook has been run by a human in the last quarter — not copied from documentation untested
|
||
- [ ] DR database exists in the DR region and replication lag is monitored
|
||
- [ ] Backup restore has been tested end-to-end within the last 7 days
|
||
- [ ] The game day schedule is on the team calendar — not just documented here
|
||
- [ ] Contact list contains current phone numbers, not just Slack handles (Slack may be down during a DR event)
|
||
- [ ] Security breach runbook (3.5) explicitly names the security team contact and does not attempt self-remediation
|
||
- [ ] All thresholds (RTO/RPO) are visible in the monitoring dashboard so actual vs. target is measurable in real time
|
||
|
||
## Anti-Patterns
|
||
|
||
- [ ] Do not write runbook commands without testing them — an untested command in a runbook is actively dangerous during a real disaster when cognitive load is highest
|
||
- [ ] Do not set RTO/RPO targets without business sign-off — technical teams often set aspirational targets that do not reflect actual business cost tolerance for downtime
|
||
- [ ] Do not include only the "happy path" of each failover scenario — runbooks must explicitly cover what to do when the recovery step itself fails
|
||
- [ ] Do not list Slack handles as the only escalation contact — Slack may be unavailable during a region-wide failure; phone numbers are mandatory
|
||
- [ ] Do not schedule DR game days without pre-committing to fix the gaps found — a game day that produces action items no one owns is theater, not preparedness
|