New skills added to pm-engineering bundle (now 10 skills total): - debugging-log-analyser: stack trace → structured root cause diagnosis + fix - pr-description-writer: diff/commits → reviewer-ready PR description - system-design-interview: full system design with capacity, components, trade-offs - changelog-generator: git log → polished Keep a Changelog entry - test-strategy-doc: spec/PRD → complete test strategy with P0/P1 test cases - runbook-writer: operational runbooks with exact commands, rollback, and escalation README updates: - 5 shields.io badges (stars, skill count, version, install, license) - "See It in Action" demo section - pm-engineering added to Quick Install list - Star Milestone Tracker (100/250/500/1000 stars roadmap) - Engineering table extended from 4 to 10 skills (41–50) - Article 14 link resolved from remote merge Config updates: - marketplace.json: v6.0.0 → v7.0.0, "106 skills" - pm-engineering plugin.json: v1.0.0 → v2.0.0 New file: SKILL_REQUEST.md — community skill voting board Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.7 KiB
name, description
| name | description |
|---|---|
| runbook-writer | Write an operational runbook for a service, incident type, or deployment procedure. Use when asked to write a runbook, create an ops guide, document an operational procedure, or prepare an incident response playbook. Produces a runbook with overview, prerequisites, step-by-step procedures, rollback steps, troubleshooting table, and escalation paths. Optimised for Opus 4.7 and newer models. |
Runbook Writer Skill
Produces operational runbooks for services, incident types, and deployment procedures — structured so an on-call engineer who's never touched the system can follow them under pressure.
Required Inputs
Ask for these if not provided:
- What the runbook is for (e.g. deploying the payment service, responding to a database failover, rotating API keys)
- Runbook type (Deployment / Incident Response / Maintenance / Disaster Recovery)
- System/service name and what it does (brief description)
- Audience (new on-call engineers / experienced SREs / DevOps team)
- Tech stack (where relevant — e.g. Kubernetes, AWS RDS, Node.js)
Output Structure
Runbook: [Runbook Title] Service: [Service Name] Type: [Deployment / Incident Response / Maintenance / DR] Last Updated: [Date] Owner: [Team or person] Severity: [P1 / P2 / P3 — if incident-type]
Overview
What this runbook covers: [1–2 sentences on the scenario this runbook handles]
When to use this runbook:
- [Specific trigger condition 1 — e.g. PagerDuty alert:
high-error-rate-payment-service] - [Specific trigger condition 2 — e.g. Deploy needed after PR merged to
main]
Estimated time to complete: [X minutes / X–Y minutes depending on outcome]
Impact if not completed correctly: [e.g. Payment processing degraded / Data loss risk / Users locked out]
Prerequisites
Access required:
- [System/tool access — e.g. AWS Console:
production-account] - [Credential — e.g.
vault read secret/payment-service] - [VPN / bastion access if needed]
Tools required:
- [Tool name and version — e.g.
kubectlv1.28+] - [CLI or dashboard name]
Before you start:
- [Prerequisite check — e.g. Verify current deployment is healthy in Grafana]
- [Prerequisite action — e.g. Announce in
#ops-livethat you're starting]
Procedure
Number every step. Use exact commands. Do not paraphrase tool names or flags.
Step 1: [Action name] [What you're doing and why — one sentence]
# Exact command
[command here]
Expected output: [what should appear if this worked]
If this fails: [Exact error message to look for] → [What to do, or see Troubleshooting]
Step 2: [Action name] [Same structure as Step 1]
Step 3: Verify Always include a verification step after the main procedure:
[verification command]
Expected state: [What a healthy system looks like after this runbook completes]
Rollback
How to undo this procedure if something went wrong:
Step R1: [Rollback action]
[rollback command]
Verify rollback: [command to confirm rollback succeeded]
Troubleshooting
| Symptom | Likely Cause | Resolution |
|---|---|---|
| [Error message or observable symptom] | [Why this happens] | [Exact fix or next step] |
| [Another symptom] | [Cause] | [Resolution] |
Escalation
If this runbook does not resolve the issue:
| Condition | Who to Contact | How |
|---|---|---|
| [e.g. DB unavailable after 10 min] | [DBA on-call] | [PagerDuty policy: db-oncall] |
| [e.g. Payment provider unresponsive] | [Vendor contact] | [Contact in 1Password: vendor-escalation] |
Always update the incident timeline in [tool] before escalating.
Post-Procedure Checklist
After completing the runbook:
- Announce completion in
#ops-livewith outcome - Update the incident ticket / deploy log
- Verify alerts have resolved in monitoring dashboard
- If this revealed a gap in this runbook — update it now (link to edit process)
Quality Checks
- Every step has an exact command (no "run the deploy script")
- Expected output is specified for each step so engineer knows if it worked
- Failure path is explicit for each step (not "if it fails, investigate")
- Rollback procedure is complete and independently testable
- Escalation paths name specific contacts, not just team names
- Runbook can be followed by someone who has never touched this system
Example Trigger Phrases
- "Write a runbook for [service] deployment"
- "Create an incident response runbook for [alert type]"
- "I need a runbook for [procedure]"
- "Document the operational procedure for [X]"
- "Write an ops playbook for [scenario]"