49137bd1b6
Applies reviewer-feedback-driven improvements across 7 skills: - code-review-checklist: add Section 1 header, optional diff input, precise review time estimate, stronger quality checks - debugging-log-analyser: improve Context input, add Frequency input, add Section 1 Error Classification header, stronger quality checks - changelog-generator: add Previous Version Behaviour + Scope inputs, clarify Formatting Rules are skill-internal, stronger quality checks - pr-description-writer: add Target Branch + Linked Issue inputs, fix Screenshots omission instruction, stronger quality checks - test-strategy-doc: split Existing Coverage from Tech Stack, add Deployment Cadence input, fix Performance Tests conditional, stronger quality checks - runbook-writer: add Monitoring Tools + Key Environment Details inputs, fix Last Updated placeholder, stronger quality checks - incident-postmortem: add Responders + Customer Communications inputs Both skills/ and plugins/pm-engineering/skills/ copies updated. https://claude.ai/code/session_01C3HwChrccJd145vJ6Z7ajF
148 lines
5.2 KiB
Markdown
148 lines
5.2 KiB
Markdown
---
|
||
name: runbook-writer
|
||
description: "Write an operational runbook for a service, incident type, or deployment procedure. Use when asked to write a runbook, create an ops guide, document an operational procedure, or prepare an incident response playbook. Produces a runbook with overview, prerequisites, step-by-step procedures, rollback steps, troubleshooting table, and escalation paths."
|
||
---
|
||
|
||
# Runbook Writer Skill
|
||
|
||
Produces operational runbooks for services, incident types, and deployment procedures — structured so an on-call engineer who's never touched the system can follow them under pressure.
|
||
|
||
## Required Inputs
|
||
|
||
Ask for these if not provided:
|
||
- **What the runbook is for** (e.g. deploying the payment service, responding to a database failover, rotating API keys)
|
||
- **Runbook type** (Deployment / Incident Response / Maintenance / Disaster Recovery)
|
||
- **System/service name and what it does** (brief description)
|
||
- **Audience** (new on-call engineers / experienced SREs / DevOps team)
|
||
- **Tech stack** (where relevant — e.g. Kubernetes, AWS RDS, Node.js)
|
||
- **Monitoring tools** (e.g. Grafana, Datadog, CloudWatch, Splunk — used to name specific dashboards and alert links in the steps)
|
||
- **Key environment details** (e.g. Kubernetes cluster name, AWS account/region, relevant namespaces or resource names — paste what's relevant for exact commands)
|
||
|
||
## Output Format
|
||
|
||
---
|
||
**Runbook:** [Runbook Title]
|
||
**Service:** [Service Name]
|
||
**Type:** [Deployment / Incident Response / Maintenance / DR]
|
||
**Last Updated:** [Insert today's date in YYYY-MM-DD format]
|
||
**Owner:** [Team or person]
|
||
**Severity:** [P1 / P2 / P3 — if incident-type]
|
||
|
||
---
|
||
|
||
### Overview
|
||
**What this runbook covers:**
|
||
[1–2 sentences on the scenario this runbook handles]
|
||
|
||
**When to use this runbook:**
|
||
- [Specific trigger condition 1 — e.g. PagerDuty alert: `high-error-rate-payment-service`]
|
||
- [Specific trigger condition 2 — e.g. Deploy needed after PR merged to `main`]
|
||
|
||
**Estimated time to complete:** [X minutes / X–Y minutes depending on outcome]
|
||
|
||
**Impact if not completed correctly:** [e.g. Payment processing degraded / Data loss risk / Users locked out]
|
||
|
||
---
|
||
|
||
### Prerequisites
|
||
|
||
**Access required:**
|
||
- [ ] [System/tool access — e.g. AWS Console: `production-account`]
|
||
- [ ] [Credential — e.g. `vault read secret/payment-service`]
|
||
- [ ] [VPN / bastion access if needed]
|
||
|
||
**Tools required:**
|
||
- [ ] [Tool name and version — e.g. `kubectl` v1.28+]
|
||
- [ ] [CLI or dashboard name]
|
||
|
||
**Before you start:**
|
||
- [ ] [Prerequisite check — e.g. Verify current deployment is healthy in Grafana]
|
||
- [ ] [Prerequisite action — e.g. Announce in `#ops-live` that you're starting]
|
||
|
||
---
|
||
|
||
### Procedure
|
||
|
||
Number every step. Use exact commands. Do not paraphrase tool names or flags.
|
||
|
||
**Step 1: [Action name]**
|
||
[What you're doing and why — one sentence]
|
||
```bash
|
||
# Exact command
|
||
[command here]
|
||
```
|
||
**Expected output:** `[what should appear if this worked]`
|
||
**If this fails:** [Exact error message to look for] → [What to do, or see Troubleshooting]
|
||
|
||
**Step 2: [Action name]**
|
||
[Same structure as Step 1]
|
||
|
||
**Step 3: Verify**
|
||
Always include a verification step after the main procedure:
|
||
```bash
|
||
[verification command]
|
||
```
|
||
**Expected state:** [What a healthy system looks like after this runbook completes]
|
||
|
||
---
|
||
|
||
### Rollback
|
||
|
||
How to undo this procedure if something went wrong:
|
||
|
||
**Step R1: [Rollback action]**
|
||
```bash
|
||
[rollback command]
|
||
```
|
||
**Verify rollback:** `[command to confirm rollback succeeded]`
|
||
|
||
---
|
||
|
||
### Troubleshooting
|
||
|
||
| Symptom | Likely Cause | Resolution |
|
||
|---|---|---|
|
||
| [Error message or observable symptom] | [Why this happens] | [Exact fix or next step] |
|
||
| [Another symptom] | [Cause] | [Resolution] |
|
||
|
||
---
|
||
|
||
### Escalation
|
||
|
||
If this runbook does not resolve the issue:
|
||
|
||
| Condition | Who to Contact | How |
|
||
|---|---|---|
|
||
| [e.g. DB unavailable after 10 min] | [DBA on-call] | [PagerDuty policy: `db-oncall`] |
|
||
| [e.g. Payment provider unresponsive] | [Vendor contact] | [Contact in 1Password: `vendor-escalation`] |
|
||
|
||
**Always update the incident timeline in [tool] before escalating.**
|
||
|
||
---
|
||
|
||
### Post-Procedure Checklist
|
||
|
||
After completing the runbook:
|
||
- [ ] Announce completion in `#ops-live` with outcome
|
||
- [ ] Update the incident ticket / deploy log
|
||
- [ ] Verify alerts have resolved in monitoring dashboard
|
||
- [ ] If this revealed a gap in this runbook — update it now (link to edit process)
|
||
|
||
---
|
||
|
||
## Quality Checks
|
||
- [ ] Every step has an exact command (no "run the deploy script")
|
||
- [ ] Expected output is specified for each step so engineer knows if it worked
|
||
- [ ] Failure path is explicit for each step (not "if it fails, investigate")
|
||
- [ ] Rollback procedure is complete and independently testable
|
||
- [ ] Escalation table has no cells containing only "[Team name]" — every row must either have a real contact or be explicitly flagged as [FILL IN: on-call rotation link]
|
||
- [ ] Rollback section contains at least one concrete command (not left as "[rollback command]" placeholder)
|
||
- [ ] Runbook can be followed by someone who has never touched this system
|
||
|
||
## Usage Examples
|
||
- "Write a runbook for [service] deployment"
|
||
- "Create an incident response runbook for [alert type]"
|
||
- "I need a runbook for [procedure]"
|
||
- "Document the operational procedure for [X]"
|
||
- "Write an ops playbook for [scenario]"
|