Files
pm-claude-skills/skills/runbook-writer/SKILL.md
T
mohitagw15856 49137bd1b6 Content quality improvements to 7 engineering skills (partial batch)
Applies reviewer-feedback-driven improvements across 7 skills:
- code-review-checklist: add Section 1 header, optional diff input, precise
  review time estimate, stronger quality checks
- debugging-log-analyser: improve Context input, add Frequency input,
  add Section 1 Error Classification header, stronger quality checks
- changelog-generator: add Previous Version Behaviour + Scope inputs,
  clarify Formatting Rules are skill-internal, stronger quality checks
- pr-description-writer: add Target Branch + Linked Issue inputs, fix
  Screenshots omission instruction, stronger quality checks
- test-strategy-doc: split Existing Coverage from Tech Stack, add
  Deployment Cadence input, fix Performance Tests conditional,
  stronger quality checks
- runbook-writer: add Monitoring Tools + Key Environment Details inputs,
  fix Last Updated placeholder, stronger quality checks
- incident-postmortem: add Responders + Customer Communications inputs

Both skills/ and plugins/pm-engineering/skills/ copies updated.

https://claude.ai/code/session_01C3HwChrccJd145vJ6Z7ajF
2026-05-20 12:06:26 +00:00

148 lines
5.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
name: runbook-writer
description: "Write an operational runbook for a service, incident type, or deployment procedure. Use when asked to write a runbook, create an ops guide, document an operational procedure, or prepare an incident response playbook. Produces a runbook with overview, prerequisites, step-by-step procedures, rollback steps, troubleshooting table, and escalation paths."
---
# Runbook Writer Skill
Produces operational runbooks for services, incident types, and deployment procedures — structured so an on-call engineer who's never touched the system can follow them under pressure.
## Required Inputs
Ask for these if not provided:
- **What the runbook is for** (e.g. deploying the payment service, responding to a database failover, rotating API keys)
- **Runbook type** (Deployment / Incident Response / Maintenance / Disaster Recovery)
- **System/service name and what it does** (brief description)
- **Audience** (new on-call engineers / experienced SREs / DevOps team)
- **Tech stack** (where relevant — e.g. Kubernetes, AWS RDS, Node.js)
- **Monitoring tools** (e.g. Grafana, Datadog, CloudWatch, Splunk — used to name specific dashboards and alert links in the steps)
- **Key environment details** (e.g. Kubernetes cluster name, AWS account/region, relevant namespaces or resource names — paste what's relevant for exact commands)
## Output Format
---
**Runbook:** [Runbook Title]
**Service:** [Service Name]
**Type:** [Deployment / Incident Response / Maintenance / DR]
**Last Updated:** [Insert today's date in YYYY-MM-DD format]
**Owner:** [Team or person]
**Severity:** [P1 / P2 / P3 — if incident-type]
---
### Overview
**What this runbook covers:**
[12 sentences on the scenario this runbook handles]
**When to use this runbook:**
- [Specific trigger condition 1 — e.g. PagerDuty alert: `high-error-rate-payment-service`]
- [Specific trigger condition 2 — e.g. Deploy needed after PR merged to `main`]
**Estimated time to complete:** [X minutes / XY minutes depending on outcome]
**Impact if not completed correctly:** [e.g. Payment processing degraded / Data loss risk / Users locked out]
---
### Prerequisites
**Access required:**
- [ ] [System/tool access — e.g. AWS Console: `production-account`]
- [ ] [Credential — e.g. `vault read secret/payment-service`]
- [ ] [VPN / bastion access if needed]
**Tools required:**
- [ ] [Tool name and version — e.g. `kubectl` v1.28+]
- [ ] [CLI or dashboard name]
**Before you start:**
- [ ] [Prerequisite check — e.g. Verify current deployment is healthy in Grafana]
- [ ] [Prerequisite action — e.g. Announce in `#ops-live` that you're starting]
---
### Procedure
Number every step. Use exact commands. Do not paraphrase tool names or flags.
**Step 1: [Action name]**
[What you're doing and why — one sentence]
```bash
# Exact command
[command here]
```
**Expected output:** `[what should appear if this worked]`
**If this fails:** [Exact error message to look for] → [What to do, or see Troubleshooting]
**Step 2: [Action name]**
[Same structure as Step 1]
**Step 3: Verify**
Always include a verification step after the main procedure:
```bash
[verification command]
```
**Expected state:** [What a healthy system looks like after this runbook completes]
---
### Rollback
How to undo this procedure if something went wrong:
**Step R1: [Rollback action]**
```bash
[rollback command]
```
**Verify rollback:** `[command to confirm rollback succeeded]`
---
### Troubleshooting
| Symptom | Likely Cause | Resolution |
|---|---|---|
| [Error message or observable symptom] | [Why this happens] | [Exact fix or next step] |
| [Another symptom] | [Cause] | [Resolution] |
---
### Escalation
If this runbook does not resolve the issue:
| Condition | Who to Contact | How |
|---|---|---|
| [e.g. DB unavailable after 10 min] | [DBA on-call] | [PagerDuty policy: `db-oncall`] |
| [e.g. Payment provider unresponsive] | [Vendor contact] | [Contact in 1Password: `vendor-escalation`] |
**Always update the incident timeline in [tool] before escalating.**
---
### Post-Procedure Checklist
After completing the runbook:
- [ ] Announce completion in `#ops-live` with outcome
- [ ] Update the incident ticket / deploy log
- [ ] Verify alerts have resolved in monitoring dashboard
- [ ] If this revealed a gap in this runbook — update it now (link to edit process)
---
## Quality Checks
- [ ] Every step has an exact command (no "run the deploy script")
- [ ] Expected output is specified for each step so engineer knows if it worked
- [ ] Failure path is explicit for each step (not "if it fails, investigate")
- [ ] Rollback procedure is complete and independently testable
- [ ] Escalation table has no cells containing only "[Team name]" — every row must either have a real contact or be explicitly flagged as [FILL IN: on-call rotation link]
- [ ] Rollback section contains at least one concrete command (not left as "[rollback command]" placeholder)
- [ ] Runbook can be followed by someone who has never touched this system
## Usage Examples
- "Write a runbook for [service] deployment"
- "Create an incident response runbook for [alert type]"
- "I need a runbook for [procedure]"
- "Document the operational procedure for [X]"
- "Write an ops playbook for [scenario]"