05b6d799f0
Three more learnings from alirezarezvani/claude-skills, applied: 1. SkillCheck validator (scripts/skillcheck.mjs) — validates every SKILL.md against the authoring standard (frontmatter, name/folder match, trigger + produces clauses, required headings) plus tier referential integrity. Errors fail CI; --strict fails on warnings too. New skillcheck.yml workflow and a SkillCheck status badge in the README. Current: 0 errors / 14 advisory warnings across 172 skills. 2. Cursor export platform — build-exports.mjs now generates exports/cursor/<bundle>/<skill>/<skill>.mdc rule files. The PLATFORMS registry now supports per-skill filenames (file as a function). 3. Per-agent installers — scripts/install.sh unifies install for claude/hermes/codex/openclaw/cursor (--link, --target, --dry-run, --list). Curl-able one-liners codex-install.sh, openclaw-install.sh, and cursor-install.sh clone the library and install in a single command. README documents the one-line installs and Cursor exports; CHANGELOG and the authoring standard updated. Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>
157 lines
6.3 KiB
Plaintext
157 lines
6.3 KiB
Plaintext
---
|
||
description: "Write a structured incident postmortem or post-incident review. Use when asked to write a postmortem, incident report, P1/P2 review, outage report, or RCA (root cause analysis). Generates a blameless postmortem with timeline, root cause, contributing factors, impact summary, and action items."
|
||
globs:
|
||
alwaysApply: false
|
||
---
|
||
|
||
# Incident Postmortem Skill
|
||
|
||
This skill produces a complete, blameless incident postmortem document following industry-standard format. Output enforces blameless framing throughout — system gaps over individual failures — and drives toward specific, closeable action items rather than vague process commitments.
|
||
|
||
## Required Inputs
|
||
|
||
Ask the user for these if not provided:
|
||
- **Incident title / ID**
|
||
- **Severity** (P1 / P2 / P3 or SEV1 / SEV2 / SEV3)
|
||
- **Date and duration** of the incident
|
||
- **What happened** (rough notes are fine — the skill will structure them)
|
||
- **Services or systems affected**
|
||
- **Customer impact** (how many users, what was degraded)
|
||
- **How it was detected**
|
||
- **How it was resolved**
|
||
- **Initial thoughts on root cause**
|
||
- **Action items already identified** (optional)
|
||
- **Responders** (who was on-call or responded — names or roles; used for the timeline, not for blame)
|
||
- **Customer or external communications sent** (optional — any status page updates, emails, or support messages with timestamps)
|
||
|
||
## Output Format
|
||
|
||
---
|
||
|
||
# Incident Postmortem: [Incident Title]
|
||
|
||
**Incident ID:** [ID]
|
||
**Severity:** [P1/P2/P3]
|
||
**Date:** [Date]
|
||
**Duration:** [Start time → Resolution time — total duration]
|
||
**Status:** [Resolved / Monitoring / Ongoing]
|
||
**Author:** [Leave blank for user to fill]
|
||
**Last updated:** [Date]
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
[3–5 sentences. Describe what happened, who was affected, and what was done to resolve it. Written for a non-technical stakeholder. No jargon. No blame.]
|
||
|
||
---
|
||
|
||
## Impact
|
||
|
||
| Dimension | Details |
|
||
|---|---|
|
||
| **Users affected** | [Number or percentage] |
|
||
| **Services degraded** | [List affected services] |
|
||
| **Business impact** | [Revenue, SLA breach, support tickets, etc. if known] |
|
||
| **Duration** | [Total time from first detection to full resolution] |
|
||
|
||
---
|
||
|
||
## Timeline
|
||
|
||
List events in chronological order. Each entry: `[HH:MM UTC] — [What happened. Who did what. What changed.]`
|
||
|
||
Rules for timeline entries:
|
||
- Use passive or system-focused language — avoid "X made a mistake"
|
||
- Include: first symptom, detection, escalation, hypothesis tested, fix applied, confirmation of resolution
|
||
- Note time between key events (e.g. "22 minutes between detection and escalation")
|
||
|
||
---
|
||
|
||
## Root Cause
|
||
|
||
**Primary root cause:** [One clear sentence. Technical but plain. "A misconfigured deployment config caused..."]
|
||
|
||
**Contributing factors:**
|
||
- [Factor 1 — e.g. lack of canary deployment meant change hit 100% of traffic immediately]
|
||
- [Factor 2 — e.g. alert threshold was set too high to catch the initial degradation]
|
||
- [Factor 3 — add as many as are relevant]
|
||
|
||
**Why did our existing safeguards not prevent this?**
|
||
[Honest paragraph explaining why monitoring, tests, or processes didn't catch this earlier. This is where blameless analysis matters most — focus on system gaps, not individual failures.]
|
||
|
||
---
|
||
|
||
## Detection
|
||
|
||
- **How was it first detected?** [Customer report / automated alert / internal monitoring / manual observation]
|
||
- **Time from incident start to detection:** [X minutes]
|
||
- **Should we have detected this faster?** [Yes / No — and why]
|
||
|
||
---
|
||
|
||
## Resolution
|
||
|
||
**What fixed it?** [Clear description of the actual fix — one paragraph]
|
||
**Why did this work?** [Brief technical explanation]
|
||
**Was there a temporary mitigation before full resolution?** [Yes/No — describe if yes]
|
||
|
||
---
|
||
|
||
## Action Items
|
||
|
||
| # | Action | Owner | Due Date | Priority |
|
||
|---|---|---|---|---|
|
||
| 1 | [Specific, testable action] | [Team or person] | [Date] | P1/P2/P3 |
|
||
|
||
Rules for action items:
|
||
- Each action must be specific enough to close as "done" or "not done" — no vague items like "improve monitoring"
|
||
- Distinguish between: **Prevent recurrence** (fix the root cause), **Improve detection** (catch it faster next time), **Improve response** (resolve it faster next time)
|
||
- Assign a real owner — not "team" or "TBD" if avoidable
|
||
- Flag P1 actions as items that block the incident from being marked fully closed
|
||
|
||
---
|
||
|
||
## What Went Well
|
||
|
||
[3–5 honest observations about the response. Include: fast collaboration, good runbooks used, effective escalation, clear communication. This section builds team confidence and reinforces good habits.]
|
||
|
||
---
|
||
|
||
## Lessons Learned
|
||
|
||
[3–5 key insights from this incident that are worth sharing beyond this team. Write these as transferable lessons — e.g. "Our runbook for database failover didn't account for read-replica lag. All runbooks involving database failover should be reviewed."]
|
||
|
||
---
|
||
|
||
## Communication Log
|
||
|
||
[Optional — list external communications sent: status page updates, customer emails, support responses. Include timestamps.]
|
||
|
||
---
|
||
|
||
## Quality Checks
|
||
|
||
- [ ] Timeline has no blame-focused language
|
||
- [ ] Root cause is specific (not "human error")
|
||
- [ ] Root cause answers "why did this happen?" not just "what happened?" — it names a system or process gap, not a symptom
|
||
- [ ] Contributing factors explain the systemic gaps
|
||
- [ ] Every action item has an owner and due date
|
||
- [ ] "What went well" section is genuine, not token
|
||
- [ ] No action item contains vague language like "improve monitoring", "increase resilience", or "better testing" — each must name a specific change
|
||
- [ ] Executive summary is readable by non-technical leadership
|
||
|
||
## Anti-Patterns
|
||
|
||
- [ ] Do not assign blame to individuals — postmortems must focus on system and process failures
|
||
- [ ] Do not write action items with vague language like "improve monitoring" — each must name a specific, ownable change
|
||
- [ ] Do not skip the contributing factors — root cause alone misses the systemic issues that enable incidents
|
||
- [ ] Do not omit the detection timeline — how long it took to detect matters as much as how long it took to resolve
|
||
- [ ] Do not treat the postmortem as closed until all action items have named owners and due dates
|
||
|
||
## Usage Examples
|
||
- "Write a postmortem for the [incident name] outage"
|
||
- "Help me write a P1 incident report"
|
||
- "Generate an RCA document for [service] going down on [date]"
|
||
- "Draft a blameless postmortem from these notes: [paste notes]"
|