pm-claude-skills/exports/cursor/pm-engineering/disaster-recovery-plan/disaster-recovery-plan.mdc

---
description: "Write a disaster recovery plan for a service or system — covering RPO/RTO targets, failure scenario runbooks, backup and restore procedures, DR testing cadence, and communication templates. Use when asked to write a DR plan, document failover procedures, create recovery runbooks, define RTO/RPO targets, or prepare for a disaster recovery game day. Produces a full DR document with per-scenario recovery runbooks, backup validation procedures, testing schedule, and communication templates."
globs:
alwaysApply: false
---

# Disaster Recovery Plan Skill

Produce a complete disaster recovery plan for a service or system — giving engineers, SREs, and on-call responders everything they need to recover from a disaster scenario in the shortest possible time. A good DR plan is tested regularly, has exact commands (not vague instructions), and makes RTO/RPO targets measurable so the team knows whether recovery succeeded.

## Required Inputs

Ask for these if not already provided:
- **Service name** and what it does (business function and technical role)
- **Criticality tier** — business impact of extended downtime (e.g. Tier 1 = revenue-critical, Tier 2 = ops impact, Tier 3 = internal only)
- **Current infrastructure setup** — cloud provider, regions/zones, deployment model (Kubernetes, ECS, VMs, serverless)
- **RPO/RTO requirements** — Recovery Point Objective (how much data loss is acceptable) and Recovery Time Objective (how long can it be down)
- **Backup strategy** — what is backed up, how often, where backups are stored, retention policy
- **On-call contacts** — names and contact details for the responder chain

## Output Format

---

# Disaster Recovery Plan: [Service Name]

**Team:** [Team name] | **Tech lead:** [Name]
**Criticality tier:** [Tier 1 / Tier 2 / Tier 3] | **Last tested:** [Date]
**Next DR test:** [Date] | **Document owner:** [Name]
**Last updated:** [Date] | **Review cycle:** Quarterly

> **Emergency? Skip to Section 3 — Failure Scenario Runbooks.** Find the scenario that matches your situation and follow the steps exactly.

---

## 1. Recovery Targets

| Target | Value | Rationale |
|---|---|---|
| RPO (Recovery Point Objective) | [X minutes/hours] | [e.g. "Last committed transaction — database replication is synchronous"] |
| RTO (Recovery Time Objective) | [Y minutes/hours] | [e.g. "Revenue impact begins at 30 min; target recovery in 15 min"] |
| MTTR target (non-disaster) | [Z minutes] | [Operational incidents, not DR events] |
| Data retention (backups) | [N days/weeks] | [Compliance requirement or operational policy] |
| Backup frequency | [Every X hours] | [RPO-driven — backup interval must be ≤ RPO] |

**What these mean in practice:**
- If a database is corrupted, we can lose at most [X minutes] of transactions before the business impact is unacceptable.
- The service must be operational again within [Y minutes/hours] of declaring a DR event.
- If either target cannot be met, escalate to [Engineering Manager] immediately.

---

## 2. Failure Scenario Inventory

| Scenario | Likelihood | Impact | RTO target | RPO target | Runbook |
|---|---|---|---|---|---|
| Single availability zone failure | Medium | [Partial / Full outage] | [15 min] | [0 — no data loss] | Section 3.1 |
| Full region failure | Low | Full outage | [60 min] | [5 min] | Section 3.2 |
| Database corruption / data loss | Low | Full outage | [90 min] | [RPO value] | Section 3.3 |
| Critical dependency outage | High | [Partial degradation] | [30 min] | [N/A] | Section 3.4 |
| Security breach / ransomware | Very low | Full outage + investigation | [4 hours] | [Last clean backup] | Section 3.5 |
| Accidental bulk data deletion | Low | Partial or full data loss | [60 min] | [RPO value] | Section 3.6 |

---

## 3. Failure Scenario Runbooks

### 3.1 Single Availability Zone Failure

**Trigger:** One AZ becomes unreachable — pods/instances in that zone stop responding.
**Detection:** PagerDuty alert `[AlertName]` fires, or cloud provider status page shows AZ degradation.
**Expected RTO:** [15 minutes] | **Expected RPO:** Zero (no data loss if multi-AZ replication is working)

**Step 1 — Confirm the failure**
```bash
# Check pod/instance health across zones
kubectl get pods -o wide -n [namespace] | grep -v Running

# Check which nodes are affected
kubectl get nodes -o wide | grep -v Ready

# Verify cloud provider AZ status
# AWS: https://health.aws.amazon.com/health/status
# GCP: https://status.cloud.google.com
```

**Step 2 — Assess whether auto-recovery has occurred**
```bash
# If using auto-scaling, check if replacement instances launched
kubectl get pods -n [namespace] --watch

# Check deployment replica count
kubectl get deployment [service-name] -n [namespace]

# Verify load balancer health checks are passing
[cloud provider CLI command to check target group health]
```

**Step 3 — Force rescheduling if auto-recovery stalled**
```bash
# Cordon the affected node so no new pods schedule on it
kubectl cordon [node-name]

# Drain the node — moves all pods to healthy nodes
kubectl drain [node-name] --ignore-daemonsets --delete-emptydir-data

# Verify pods have rescheduled successfully
kubectl get pods -o wide -n [namespace]
```

**Step 4 — Verify service health**
```bash
# Smoke test key endpoints
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/health
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/[critical-endpoint]

# Check error rate in monitoring
[dashboard link or query]
```

**Recovery confirmed when:** All pods are Running, health check returns 200, error rate is at baseline.

---

### 3.2 Full Region Failure

**Trigger:** The primary region is entirely unavailable.
**Detection:** All service health checks failing, cloud provider status page confirms region-wide event.
**Expected RTO:** [60 minutes] | **Expected RPO:** [5 minutes — based on cross-region replication lag]

**Step 1 — Confirm regional failure (5 minutes)**
```bash
# Confirm the primary region is unreachable
ping [primary-region-endpoint] || echo "Primary region unreachable"

# Check replication lag on standby region database
[command to check replica lag — e.g. for RDS: aws rds describe-db-instances --region [dr-region]]
```

**Step 2 — Declare DR event and notify (2 minutes)**

Post to `#incidents`:
```
🔴 DR EVENT — [Service Name] — Region Failure
Primary region: [region] — UNREACHABLE
Activating failover to: [dr-region]
Incident commander: [Name]
Next update: 15 minutes
```

Page [Engineering Manager] and [CTO/VP Eng] via PagerDuty.

**Step 3 — Promote DR database (10 minutes)**
```bash
# AWS RDS — promote read replica to primary
aws rds promote-read-replica \
  --db-instance-identifier [dr-replica-identifier] \
  --region [dr-region]

# Wait for promotion to complete
aws rds wait db-instance-available \
  --db-instance-identifier [dr-replica-identifier] \
  --region [dr-region]

# Record the new database endpoint
aws rds describe-db-instances \
  --db-instance-identifier [dr-replica-identifier] \
  --region [dr-region] \
  --query 'DBInstances[0].Endpoint.Address'
```

**Step 4 — Deploy service in DR region (20 minutes)**
```bash
# Update service configuration to point at DR database
kubectl set env deployment/[service-name] \
  DATABASE_URL=[new-dr-database-url] \
  -n [namespace] \
  --context [dr-region-context]

# Scale up the DR deployment
kubectl scale deployment/[service-name] --replicas=[N] \
  -n [namespace] \
  --context [dr-region-context]

# Verify all pods are running
kubectl get pods -n [namespace] --context [dr-region-context]
```

**Step 5 — Cut over DNS / load balancer (5 minutes)**
```bash
# Update DNS to point to DR region load balancer
# AWS Route 53:
aws route53 change-resource-record-sets \
  --hosted-zone-id [zone-id] \
  --change-batch file://dr-failover-dns.json

# Verify DNS propagation (may take up to [TTL] seconds)
dig [service-domain] @8.8.8.8
```

**Step 6 — Verify end-to-end**
```bash
# Full smoke test against DR endpoint
curl -s https://[service-url]/health
[run automated smoke test suite if available]
```

**Recovery confirmed when:** DNS resolves to DR region, smoke tests pass, error rate is at baseline.

**Post-failover actions (not urgent — after service is stable):**
- Do not fail back to primary until root cause is confirmed resolved
- Document data loss window (check replication lag at time of failure)
- Begin post-incident review — see [incident-postmortem skill]

---

### 3.3 Database Corruption or Data Loss

**Trigger:** Data in the database is corrupted, deleted, or otherwise incorrect due to a software bug, operator error, or hardware fault.
**Detection:** Application errors referencing missing/invalid data, monitoring alerts on query error rate, user reports.
**Expected RTO:** [90 minutes] | **Expected RPO:** [Backup interval — e.g. 1 hour]

**Step 1 — Stop the bleeding immediately**
```bash
# Put the service into maintenance mode to prevent further writes to corrupted data
[command to enable maintenance mode — e.g. kubectl set env deployment/[name] MAINTENANCE_MODE=true]

# Or: scale down the service to zero to prevent writes
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]
```

**Step 2 — Assess scope of corruption**
```bash
# Identify which tables/records are affected
[SQL query to check data integrity — e.g.]
# psql $DATABASE_URL -c "SELECT COUNT(*) FROM [table] WHERE [integrity check condition]"

# Determine when corruption started (cross-reference with deploy times and error logs)
[log query to find earliest error — e.g. in Datadog:]
# service:[service-name] status:error "[corruption error message]" | sort by timestamp asc
```

**Step 3 — Identify the correct restore point**
```bash
# List available backups
[command to list backups — e.g. for RDS:]
aws rds describe-db-snapshots \
  --db-instance-identifier [db-identifier] \
  --query 'DBSnapshots[*].[SnapshotCreateTime,DBSnapshotIdentifier]' \
  --output table

# Choose the most recent backup BEFORE corruption started
# Record the chosen snapshot ID: [snapshot-id]
```

**Step 4 — Restore from backup**
```bash
# Restore to a NEW database instance (never overwrite production directly)
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier [service-name]-restored-[date] \
  --db-snapshot-identifier [snapshot-id] \
  --region [region]

# Wait for restore to complete
aws rds wait db-instance-available \
  --db-instance-identifier [service-name]-restored-[date]

# Get the restored instance endpoint
aws rds describe-db-instances \
  --db-instance-identifier [service-name]-restored-[date] \
  --query 'DBInstances[0].Endpoint.Address'
```

**Step 5 — Validate restored data**
```bash
# Connect to restored database and verify integrity
psql [restored-db-endpoint] -U [user] -d [database] -c "[data integrity query]"

# Confirm record counts match expectations
psql [restored-db-endpoint] -U [user] -d [database] -c "SELECT COUNT(*) FROM [critical-table]"
```

**Step 6 — Point service at restored database**
```bash
kubectl set env deployment/[service-name] \
  DATABASE_URL=postgres://[user]:[pass]@[restored-endpoint]/[db] \
  -n [namespace]

kubectl scale deployment/[service-name] --replicas=[N] -n [namespace]
```

**Recovery confirmed when:** Service is running against restored database, data integrity checks pass, error rate is at baseline.

---

### 3.4 Critical Dependency Outage

**Trigger:** A service that [service name] depends on is unavailable or degraded.
**Detection:** Increased error rate or latency on endpoints that call [dependency], alerts from dependency owner.
**Expected RTO:** Depends on dependency — [30 minutes for mitigation, resolution depends on dependency owner]

**Dependency map:**

| Dependency | Criticality | Degraded behaviour | Mitigation |
|---|---|---|---|
| [Database] | Critical — all writes fail | Full outage | Activate DR database (Section 3.3) |
| [Cache — Redis] | High — latency increases | Performance degradation | Bypass cache, serve from DB |
| [Auth service] | Critical — auth fails | All authenticated endpoints fail | Return cached tokens (if implemented) |
| [Message queue] | Medium — async processing delays | Writes succeed, async jobs queue | Queue backlog — see on-call runbook |
| [External API — name] | Low — feature X unavailable | Graceful degradation | Feature flag to disable feature X |

**Mitigation steps:**
```bash
# Enable circuit breaker / fallback for [dependency] if implemented
kubectl set env deployment/[service-name] [DEPENDENCY]_CIRCUIT_BREAKER=open -n [namespace]

# Enable feature flag to disable [dependency-backed feature]
[feature flag CLI command or dashboard link]

# Check if dependency has a status page
# [Dependency status URL]
```

**Escalation:** Contact [dependency] on-call via [PagerDuty / Slack `#[channel]`]. Share your service's error rate and the time dependency errors started.

---

### 3.5 Security Breach or Ransomware

**Trigger:** Evidence of unauthorized access, data exfiltration, or encryption of service data.
**Detection:** Security tooling alert, unusual access patterns, user reports of data exposure.
**Expected RTO:** [4+ hours — prioritise containment over speed] | **Expected RPO:** [Last verified clean backup]

**Step 1 — Isolate immediately**
```bash
# Take the service offline — do not attempt to recover while breach is active
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]

# Revoke all API keys and service account credentials immediately
[command to rotate secrets — e.g. via Vault or cloud provider]

# Block all external access at network level
[firewall/security group command to deny all inbound traffic]
```

**Step 2 — Notify security team immediately**
Page [Security lead] via PagerDuty. Do NOT attempt to remediate without security team involvement.

Post to `#security-incidents` (private channel, not `#incidents`):
```
🔴 SECURITY INCIDENT — [Service Name]
Time detected: [Time]
Evidence: [One sentence — what was observed]
Actions taken: Service isolated, credentials revoked
Awaiting: Security team guidance
```

**Step 3 — Preserve evidence**
```bash
# Export current logs before any remediation
[log export command — preserve evidence for forensics]

# Snapshot the current state of all infrastructure
[snapshot/image command]
```

**Steps 4+ — Follow security team guidance.** Do not restore from backup until security team confirms the attack vector is closed.

---

### 3.6 Accidental Bulk Data Deletion

**Trigger:** An operator, script, or application bug has deleted records in bulk.
**Detection:** Sudden drop in record counts, user reports of missing data, application errors.
**Expected RTO:** [60 minutes] | **Expected RPO:** [Backup interval]

```bash
# Step 1 — Stop further writes immediately
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]

# Step 2 — Determine what was deleted and when
psql $DATABASE_URL -c "
  SELECT schemaname, tablename,
         n_dead_tup, last_autovacuum
  FROM pg_stat_user_tables
  ORDER BY n_dead_tup DESC LIMIT 10;
"

# Step 3 — Check if deletion is recoverable via MVCC (PostgreSQL)
# Records may still be recoverable if VACUUM has not run
psql $DATABASE_URL -c "
  SELECT * FROM [table]
  WHERE xmax != 0  -- recently deleted rows
  LIMIT 100;
"

# Step 4 — If not recoverable via MVCC, restore from backup
# Follow Section 3.3 (Database Corruption runbook) from Step 3 onward
```

---

## 4. Backup and Restore Procedures

### Backup Configuration

| Data store | Backup type | Frequency | Retention | Location |
|---|---|---|---|---|
| [Primary database] | Automated snapshots | Every [N] hours | [N] days | [S3 bucket / cloud storage path] |
| [Primary database] | Transaction log backups | Continuous | [N] days | [Location] |
| [Secondary store — e.g. Redis] | RDB dump | Daily | [N] days | [Location] |
| [Blob/object storage] | Cross-region replication | Continuous | [N] days | [DR region bucket] |
| [Config / secrets] | Terraform state + Vault backup | On change | Indefinite | [Location] |

### Backup Validation (Run Weekly)

```bash
# Test restore of latest database backup to a throwaway instance
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier [service-name]-backup-test-$(date +%Y%m%d) \
  --db-snapshot-identifier $(aws rds describe-db-snapshots \
    --db-instance-identifier [db-id] \
    --query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
    --output text)

# Wait for restore, then run integrity checks
psql [test-instance-endpoint] -c "[integrity check query]"

# Confirm row counts match recent production values (allow ≤ RPO difference)
psql [test-instance-endpoint] -c "SELECT COUNT(*) FROM [critical-table]"

# Destroy the test instance
aws rds delete-db-instance \
  --db-instance-identifier [service-name]-backup-test-$(date +%Y%m%d) \
  --skip-final-snapshot
```

---

## 5. DR Testing Cadence

Regular testing is mandatory. An untested DR plan is not a DR plan.

| Test type | Frequency | Who runs it | Pass criteria |
|---|---|---|---|
| Backup restore validation | Weekly (automated) | On-call rotation | Restore completes, integrity checks pass |
| Zone failover drill | Monthly | Engineering team | RTO target met, zero data loss |
| Region failover drill | Quarterly | Engineering + SRE | RTO/RPO targets met |
| Full DR game day | Annually | Engineering + stakeholders | All scenarios exercised, gaps documented |
| Chaos engineering (infra failures) | Weekly (automated) | Chaos engineering tooling | Service degrades gracefully, recovers automatically |

### Game Day Procedure

1. **Pre-game day (1 week before):** Notify all stakeholders, freeze production changes for the day, prepare DR environment.
2. **Scope definition:** Choose 2–3 scenarios from Section 2. Document expected outcomes before the test.
3. **Execute:** One person acts as incident commander, others execute runbook steps while another observes and times.
4. **Measure:** Record actual RTO and RPO against targets for each scenario.
5. **Debrief (same day):** Document gaps, runbook inaccuracies, and automation opportunities.
6. **Action items:** File tickets for every gap found. Priority: P1 items must be fixed before next game day.

---

## 6. Communication Plan

### Internal Communication During DR Event

**Incident commander responsibilities:**
- Declare the DR event and open the incident channel
- Post updates every 15 minutes minimum
- Make the call to fail over (do not let the team decide by committee)
- Notify business stakeholders of expected recovery time

**Notify these people at DR event start:**

| Role | Name | Contact | When to notify |
|---|---|---|---|
| Engineering manager | [Name] | [Slack / Phone] | Immediately |
| CTO / VP Engineering | [Name] | [Phone] | Tier 1 services: immediately |
| Customer success lead | [Name] | [Slack] | If customer-facing impact |
| Security lead | [Name] | [Slack / PagerDuty] | If breach suspected |
| Legal / compliance | [Name] | [Email / Phone] | If data loss involves PII |

### Communication Templates

**DR event declared:**
```
🔴 DR EVENT — [Service Name]
Time: [HH:MM UTC]
Scenario: [Zone failure / Region failure / Data loss / etc.]
Impact: [Who is affected and how]
RTO target: [X minutes]
Incident commander: [Name]
War room: [Slack channel / call link]
Next update: [Time + 15 min]
```

**Status update (every 15 minutes):**
```
🔴 DR UPDATE — [Service Name] — [HH:MM UTC]
Status: [Investigating / Executing recovery / Verifying]
Progress: [One sentence on current step]
Blockers: [Any — or "None"]
Updated RTO estimate: [Time]
Next update: [Time + 15 min]
```

**Recovery confirmed:**
```
✅ DR RESOLVED — [Service Name] — [HH:MM UTC]
Total downtime: [X minutes]
Data loss: [None / X minutes of transactions]
RTO target: [X min] — Actual: [Y min] — [MET / MISSED]
RPO target: [X min] — Actual: [Y min] — [MET / MISSED]
Root cause: [One sentence]
Post-incident review: [Scheduled for / Link when created]
```

---

## 7. DR Readiness Checklist

Run this checklist quarterly and before any major infrastructure change:

**Backups:**
- [ ] Automated backups are running and alerts fire if they fail
- [ ] Most recent backup restore was tested within the last 7 days
- [ ] Backup retention meets RPO and compliance requirements
- [ ] Backups are stored in a separate region / account from primary

**Failover infrastructure:**
- [ ] DR region / environment exists and is provisioned (not just documented)
- [ ] DNS failover procedure is documented with exact commands
- [ ] DR database replica is current (replication lag is within RPO)
- [ ] Service can be deployed in DR region with a single command or automated pipeline

**Runbooks:**
- [ ] All runbooks in Section 3 have been tested within the last quarter
- [ ] Runbook commands have been verified against current infrastructure (no stale references)
- [ ] Contact list is current (no departed employees)

**Access:**
- [ ] On-call engineers have access to DR region console / CLI
- [ ] Service account credentials for DR region are provisioned and tested
- [ ] Break-glass accounts exist for emergency access if SSO is unavailable

**Monitoring:**
- [ ] Monitoring exists in DR region (not just primary)
- [ ] Alerts fire correctly when DR environment has issues

---

## Quality Checks

- [ ] RPO and RTO targets are specific numbers, not ranges, and are agreed with the business
- [ ] Every command in every runbook has been run by a human in the last quarter — not copied from documentation untested
- [ ] DR database exists in the DR region and replication lag is monitored
- [ ] Backup restore has been tested end-to-end within the last 7 days
- [ ] The game day schedule is on the team calendar — not just documented here
- [ ] Contact list contains current phone numbers, not just Slack handles (Slack may be down during a DR event)
- [ ] Security breach runbook (3.5) explicitly names the security team contact and does not attempt self-remediation
- [ ] All thresholds (RTO/RPO) are visible in the monitoring dashboard so actual vs. target is measurable in real time

## Anti-Patterns

- [ ] Do not write runbook commands without testing them — an untested command in a runbook is actively dangerous during a real disaster when cognitive load is highest
- [ ] Do not set RTO/RPO targets without business sign-off — technical teams often set aspirational targets that do not reflect actual business cost tolerance for downtime
- [ ] Do not include only the "happy path" of each failover scenario — runbooks must explicitly cover what to do when the recovery step itself fails
- [ ] Do not list Slack handles as the only escalation contact — Slack may be unavailable during a region-wide failure; phone numbers are mandatory
- [ ] Do not schedule DR game days without pre-committing to fix the gaps found — a game day that produces action items no one owns is theater, not preparedness