From bfdbec17a37033deea040f686e9313b84e8a23a3 Mon Sep 17 00:00:00 2001 From: mohitagw15856 Date: Sun, 17 May 2026 10:55:58 +0100 Subject: [PATCH] =?UTF-8?q?feat:=20v10.0.0=20=E2=80=94=208=20new=20skills?= =?UTF-8?q?=20across=20Customer=20Success=20and=20Engineering=20(500-star?= =?UTF-8?q?=20milestone)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two star milestones shipped together: Customer Success bundle (pm-cs) — 250-star milestone: - cs-health-scorecard: weighted RAG health score across 5 dimensions with renewal forecast - qbr-deck: slide-by-slide QBR structure with value narrative and mutual commitments - cs-escalation-brief: 4-level escalation framework with root cause, impact, and decision required - churn-analysis: voluntary/unavoidable churn split, early warning signals, prioritised interventions Engineering expansion (pm-engineering) — 500-star milestone: - cicd-playbook: full pipeline playbook from build through post-deploy checks and rollback - slo-error-budget: SLI definitions, burn rate alerts, and error budget policy - developer-onboarding-doc: first-week guide covering architecture, setup, testing, and contacts - oncall-runbook: per-alert response procedures, escalation matrix, and handoff template Also: - Added pm-cs plugin to marketplace.json - Updated pm-engineering plugin.json to v3.0.0 (14 skills) - Updated marketplace.json to v10.0.0 (114 skills, 23 bundles, 16 professions) - README updated with new CS section, corrected skill numbering (106 → 114) - Added bug report link to Contributing section - Star milestones updated to show 250 and 500 as unlocked --- .claude-plugin/marketplace.json | 16 +- README.md | 225 ++++++----- plugins/pm-cs/.claude-plugin/plugin.json | 13 + plugins/pm-cs/skills/churn-analysis/SKILL.md | 179 +++++++++ .../pm-cs/skills/cs-escalation-brief/SKILL.md | 176 +++++++++ .../pm-cs/skills/cs-health-scorecard/SKILL.md | 141 +++++++ plugins/pm-cs/skills/qbr-deck/SKILL.md | 218 +++++++++++ .../pm-engineering/.claude-plugin/plugin.json | 6 +- .../skills/cicd-playbook/SKILL.md | 301 +++++++++++++++ .../skills/developer-onboarding-doc/SKILL.md | 332 ++++++++++++++++ .../skills/oncall-runbook/SKILL.md | 364 ++++++++++++++++++ .../skills/slo-error-budget/SKILL.md | 231 +++++++++++ skills/churn-analysis/SKILL.md | 179 +++++++++ skills/cicd-playbook/SKILL.md | 301 +++++++++++++++ skills/cs-escalation-brief/SKILL.md | 176 +++++++++ skills/cs-health-scorecard/SKILL.md | 141 +++++++ skills/developer-onboarding-doc/SKILL.md | 332 ++++++++++++++++ skills/oncall-runbook/SKILL.md | 364 ++++++++++++++++++ skills/qbr-deck/SKILL.md | 218 +++++++++++ skills/slo-error-budget/SKILL.md | 231 +++++++++++ 20 files changed, 4044 insertions(+), 100 deletions(-) create mode 100644 plugins/pm-cs/.claude-plugin/plugin.json create mode 100644 plugins/pm-cs/skills/churn-analysis/SKILL.md create mode 100644 plugins/pm-cs/skills/cs-escalation-brief/SKILL.md create mode 100644 plugins/pm-cs/skills/cs-health-scorecard/SKILL.md create mode 100644 plugins/pm-cs/skills/qbr-deck/SKILL.md create mode 100644 plugins/pm-engineering/skills/cicd-playbook/SKILL.md create mode 100644 plugins/pm-engineering/skills/developer-onboarding-doc/SKILL.md create mode 100644 plugins/pm-engineering/skills/oncall-runbook/SKILL.md create mode 100644 plugins/pm-engineering/skills/slo-error-budget/SKILL.md create mode 100644 skills/churn-analysis/SKILL.md create mode 100644 skills/cicd-playbook/SKILL.md create mode 100644 skills/cs-escalation-brief/SKILL.md create mode 100644 skills/cs-health-scorecard/SKILL.md create mode 100644 skills/developer-onboarding-doc/SKILL.md create mode 100644 skills/oncall-runbook/SKILL.md create mode 100644 skills/qbr-deck/SKILL.md create mode 100644 skills/slo-error-budget/SKILL.md diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index e904e02..0ac675f 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -1,8 +1,8 @@ { "$schema": "https://anthropic.com/claude-code/marketplace.schema.json", "name": "pm-claude-skills", - "version": "9.0.0", - "description": "106 Claude Skills + 4 templates across 22 plugin bundles plus the first agent template (PM Sprint Agent), covering 15 professions — product management, engineering, legal, finance, HR, sales, design, Figma, marketing, and more. Building blocks for the Anthropic agent template architecture.", + "version": "10.0.0", + "description": "114 Claude Skills + 4 agent templates across 23 plugin bundles covering 16 professions — product management, engineering, customer success, legal, finance, HR, sales, design, Figma, marketing, and more. Building blocks for the Anthropic agent template architecture.", "owner": { "name": "Mohit Aggarwal", "email": "mohit15856@gmail.com" @@ -82,12 +82,20 @@ }, { "name": "pm-engineering", - "description": "Engineering & tech skills: Code Review Checklist, Incident Postmortem, API Docs Writer, Architecture Decision Record, Debugging Log Analyser, PR Description Writer, System Design Interview, Changelog Generator, Test Strategy Doc, Runbook Writer. 10 structured skills for engineering teams, SREs, and technical PMs.", - "version": "2.0.0", + "description": "Engineering & tech skills: Code Review Checklist, Incident Postmortem, API Docs Writer, Architecture Decision Record, Debugging Log Analyser, PR Description Writer, System Design Interview, Changelog Generator, Test Strategy Doc, Runbook Writer, CI/CD Playbook, SLO & Error Budget, Developer Onboarding Doc, On-Call Runbook. 14 structured skills for engineering teams, SREs, and technical PMs.", + "version": "3.0.0", "category": "productivity", "source": "./plugins/pm-engineering", "homepage": "https://github.com/mohitagw15856/pm-claude-skills" }, + { + "name": "pm-cs", + "description": "Customer Success skills: Customer Health Scorecard, QBR Deck, Escalation Brief, Churn Analysis. Score account health with a weighted RAG framework, build structured QBR decks with value narratives, write crisp escalation briefs for at-risk accounts, and analyse churn by category and segment with prioritised interventions.", + "version": "1.0.0", + "category": "productivity", + "source": "./plugins/pm-cs", + "homepage": "https://github.com/mohitagw15856/pm-claude-skills" + }, { "name": "pm-data", "description": "Data & analytics skills: Metrics Framework, SQL Query Explainer, Dashboard Brief, Chart Data Extractor. Build North Star metric trees, explain SQL, spec dashboards, and digitise chart images.", diff --git a/README.md b/README.md index 64bc8db..ac6a162 100644 --- a/README.md +++ b/README.md @@ -1,17 +1,17 @@ -# 🧠 Claude Skills Library — 106 Skills for Every Profession +# 🧠 Claude Skills Library — 114 Skills for Every Profession [![Stars](https://img.shields.io/github/stars/mohitagw15856/pm-claude-skills?style=social)](https://github.com/mohitagw15856/pm-claude-skills/stargazers) -[![Skills](https://img.shields.io/badge/skills-106-blue)](https://github.com/mohitagw15856/pm-claude-skills) -[![Version](https://img.shields.io/badge/version-8.0.0-brightgreen)](https://github.com/mohitagw15856/pm-claude-skills/releases) +[![Skills](https://img.shields.io/badge/skills-114-blue)](https://github.com/mohitagw15856/pm-claude-skills) +[![Version](https://img.shields.io/badge/version-10.0.0-brightgreen)](https://github.com/mohitagw15856/pm-claude-skills/releases) [![Install](https://img.shields.io/badge/Install%20in%20Claude%20Code-2%20minutes-orange)](https://github.com/mohitagw15856/pm-claude-skills#-quick-install-2-minutes) [![License](https://img.shields.io/badge/license-MIT-lightgrey)](LICENSE) [![Sponsor](https://img.shields.io/badge/sponsor-❤️-ff69b4)](https://github.com/sponsors/mohitagw15856) -> **106 Claude Skills + 4 agent templates across 15 professions. Save 8-10 hours per week.** +> **114 Claude Skills + 4 agent templates across 16 professions. Save 8-10 hours per week.** -A community-built library of Claude Skills covering product management, engineering, marketing, data, design, Figma, leadership, legal, finance, HR, sales, operations, research, education, and more. Each skill is a structured SKILL.md file that teaches Claude how to produce professional-grade outputs for your specific workflows. +A community-built library of Claude Skills covering product management, engineering, customer success, marketing, data, design, Figma, leadership, legal, finance, HR, sales, operations, research, education, and more. Each skill is a structured SKILL.md file that teaches Claude how to produce professional-grade outputs for your specific workflows. -**🆕 Latest release (v9.0.0):**: The library now includes 106 skills + 4 working agent templates following Anthropic's May 2026 agent template architecture. +**🆕 Latest release (v10.0.0):** The library now includes 114 skills + 4 working agent templates. Two star milestones unlocked at once — 250 stars brought 4 Customer Success skills, 500 stars brought 4 more Engineering skills. --- ## 🚀 Quick Install (2 minutes) @@ -26,7 +26,8 @@ claude plugin install pm-essentials@pm-claude-skills # Core PM + Word tracke claude plugin install pm-delivery@pm-claude-skills # Delivery + PowerPoint auditor -claude plugin install pm-engineering@pm-claude-skills # Engineering + DevOps (10 skills) 🆕 +claude plugin install pm-engineering@pm-claude-skills # Engineering (14 skills) 🆕 +claude plugin install pm-cs@pm-claude-skills # Customer Success 🆕 claude plugin install pm-data@pm-claude-skills # Data + chart data extractor @@ -148,15 +149,29 @@ More templates will follow. If you want to contribute one, see the [template con --- -## 🆕 What's New in v9.0.0 +## 🆕 What's New in v10.0.0 -**Three new agent templates added:** +**Two star milestones unlocked — 8 new skills shipped:** -- **PM Discovery Agent** — synthesise customer interviews into actionable discovery reports with theme detection and confidence scoring -- **PM Stakeholder Communications Agent** — generate audience-tailored updates (executive, investor, cross-functional, board) from your team's recent activity -- **PM Launch Agent** — coordinate product launches end-to-end with channel-specific content, calendar, metrics, and checklist +**Customer Success bundle (250 ⭐ milestone):** -The library now includes 106 skills + 4 working agent templates following Anthropic's May 2026 agent template architecture. +| Skill | Bundle | What It Does | +|---|---|---| +| **Customer Health Scorecard** 🆕 | pm-cs | Weighted health score across adoption, engagement, outcomes, support, and commercial — with RAG status and renewal forecast | +| **QBR Deck** 🆕 | pm-cs | Slide-by-slide quarterly business review structure with talking points, value narrative, and mutual commitments | +| **Escalation Brief** 🆕 | pm-cs | Structured escalation brief for at-risk accounts — root cause, business impact, resolution plan, and decision required | +| **Churn Analysis** 🆕 | pm-cs | Churn rate breakdown by category and segment, early warning signals, and prioritised interventions | + +**Engineering expansion (500 ⭐ milestone):** + +| Skill | Bundle | What It Does | +|---|---|---| +| **CI/CD Playbook** 🆕 | pm-engineering | Complete pipeline playbook covering every stage, rollback procedures, secrets management, and on-call responsibilities | +| **SLO & Error Budget** 🆕 | pm-engineering | SLI definitions, SLO targets, error budget calculation, burn rate alerts, and error budget policy | +| **Developer Onboarding Doc** 🆕 | pm-engineering | Everything a new engineer needs in their first week — architecture, local setup, testing, deployment, and key contacts | +| **On-Call Runbook** 🆕 | pm-engineering | Per-alert response procedures, escalation matrix, diagnostic cheat sheet, and handoff template | + +The library now includes **114 skills** across **16 professions** + 4 working agent templates. | Skill | Bundle | What It Does | @@ -250,7 +265,7 @@ This repo was built alongside a published article series. Read the full story: --- -### 👩‍💻 Engineering & Tech (Skills 41–50) +### 👩‍💻 Engineering & Tech (Skills 41–54) **Bundle:** `pm-engineering` | # | Skill | Folder | What It Does | @@ -259,185 +274,206 @@ This repo was built alongside a published article series. Read the full story: | 42 | **Incident Postmortem** | `skills/incident-postmortem/` | Blameless postmortems with timeline, RCA, impact, and action items | | 43 | **API Docs Writer** | `skills/api-docs-writer/` | Developer-facing API docs: endpoints, parameters, response schemas, code examples | | 44 | **Architecture Decision Record** | `skills/architecture-decision-record/` | ADRs with context, options considered, decision, consequences, and risks | -| 45 | **Debugging Log Analyser** 🆕 | `skills/debugging-log-analyser/` | Parse stack traces and error logs into a structured root cause diagnosis with a specific fix | -| 46 | **PR Description Writer** 🆕 | `skills/pr-description-writer/` | Write reviewer-friendly PR descriptions from a diff, commit list, or change summary | -| 47 | **System Design Interview** 🆕 | `skills/system-design-interview/` | Structure complete system design answers with capacity estimates, component deep-dives, and trade-offs | -| 48 | **Changelog Generator** 🆕 | `skills/changelog-generator/` | Convert git commits into a polished, user-facing changelog following Keep a Changelog format | -| 49 | **Test Strategy Doc** 🆕 | `skills/test-strategy-doc/` | Write a complete test strategy with risk assessment, test types, coverage targets, and P0/P1 test cases | -| 50 | **Runbook Writer** 🆕 | `skills/runbook-writer/` | Write operational runbooks for deployments, incidents, and maintenance with exact commands and rollback steps | +| 45 | **Debugging Log Analyser** | `skills/debugging-log-analyser/` | Parse stack traces and error logs into a structured root cause diagnosis with a specific fix | +| 46 | **PR Description Writer** | `skills/pr-description-writer/` | Write reviewer-friendly PR descriptions from a diff, commit list, or change summary | +| 47 | **System Design Interview** | `skills/system-design-interview/` | Structure complete system design answers with capacity estimates, component deep-dives, and trade-offs | +| 48 | **Changelog Generator** | `skills/changelog-generator/` | Convert git commits into a polished, user-facing changelog following Keep a Changelog format | +| 49 | **Test Strategy Doc** | `skills/test-strategy-doc/` | Write a complete test strategy with risk assessment, test types, coverage targets, and P0/P1 test cases | +| 50 | **Runbook Writer** | `skills/runbook-writer/` | Write operational runbooks for deployments, incidents, and maintenance with exact commands and rollback steps | +| 51 | **CI/CD Playbook** 🆕 | `skills/cicd-playbook/` | Complete pipeline playbook covering every stage, rollback procedures, secrets management, and on-call responsibilities | +| 52 | **SLO & Error Budget** 🆕 | `skills/slo-error-budget/` | SLI definitions, SLO targets, error budget calculation, burn rate alerts, and error budget policy | +| 53 | **Developer Onboarding Doc** 🆕 | `skills/developer-onboarding-doc/` | Everything a new engineer needs in their first week — architecture, local setup, testing, deployment, and key contacts | +| 54 | **On-Call Runbook** 🆕 | `skills/oncall-runbook/` | Per-alert response procedures, escalation matrix, diagnostic cheat sheet, and handoff template | --- -### 📊 Data & Analytics (Skills 51–54) +### 🤝 Customer Success (Skills 55–58) +**Bundle:** `pm-cs` + +> 250 ⭐ milestone unlocked. Install: + +claude plugin install pm-cs@pm-claude-skills + + +| # | Skill | Folder | What It Does | +|---|---|---|---| +| 55 | **Customer Health Scorecard** 🆕 | `skills/cs-health-scorecard/` | Weighted health score across adoption, engagement, outcomes, support, and commercial — RAG status and renewal forecast | +| 56 | **QBR Deck** 🆕 | `skills/qbr-deck/` | Slide-by-slide quarterly business review with talking points, value narrative, and mutual commitments | +| 57 | **Escalation Brief** 🆕 | `skills/cs-escalation-brief/` | Structured brief for at-risk accounts — root cause, business impact, resolution plan, and decision required | +| 58 | **Churn Analysis** 🆕 | `skills/churn-analysis/` | Churn breakdown by category and segment, early warning signals, and prioritised interventions | + +--- + +### 📊 Data & Analytics (Skills 59–62) **Bundle:** `pm-data` | # | Skill | Folder | What It Does | |---|---|---|---| -| 51 | **Metrics Framework** | `skills/metrics-framework/` | North Star + metric tree, dashboard tiers, counter-metrics | -| 52 | **SQL Query Explainer** | `skills/sql-query-explainer/` | Explain, optimise, write, and document SQL in plain English | -| 53 | **Dashboard Brief** | `skills/dashboard-brief/` | Complete dashboard spec: KPIs, charts, filters, layout, data requirements | -| 54 | **Chart Data Extractor** | `skills/chart-data-extractor/` | Extract pixel-level data from chart images into structured data tables | +| 59 | **Metrics Framework** | `skills/metrics-framework/` | North Star + metric tree, dashboard tiers, counter-metrics | +| 60 | **SQL Query Explainer** | `skills/sql-query-explainer/` | Explain, optimise, write, and document SQL in plain English | +| 61 | **Dashboard Brief** | `skills/dashboard-brief/` | Complete dashboard spec: KPIs, charts, filters, layout, data requirements | +| 62 | **Chart Data Extractor** | `skills/chart-data-extractor/` | Extract pixel-level data from chart images into structured data tables | --- -### 🧑‍💼 Leadership & People (Skills 55–57) +### 🧑‍💼 Leadership & People (Skills 63–65) **Bundle:** `pm-people` | # | Skill | Folder | What It Does | |---|---|---|---| -| 55 | **Performance Review** | `skills/performance-review/` | Structured reviews from bullet-point notes — self, manager, peer, and upward | -| 56 | **Hiring Rubric** | `skills/hiring-rubric/` | Interview scorecards with competencies, behavioural questions, and panel guide | -| 57 | **Team Offsite Planner** | `skills/team-offsite-planner/` | Full offsite agenda, session facilitation notes, and logistics checklist | +| 63 | **Performance Review** | `skills/performance-review/` | Structured reviews from bullet-point notes — self, manager, peer, and upward | +| 64 | **Hiring Rubric** | `skills/hiring-rubric/` | Interview scorecards with competencies, behavioural questions, and panel guide | +| 65 | **Team Offsite Planner** | `skills/team-offsite-planner/` | Full offsite agenda, session facilitation notes, and logistics checklist | --- -### 🎨 Design & UX (Skills 58–60) +### 🎨 Design & UX (Skills 66–68) **Bundle:** `pm-design` | # | Skill | Folder | What It Does | |---|---|---|---| -| 58 | **UX Research Plan** | `skills/ux-research-plan/` | Research plans with screener, discussion guide, and synthesis framework | -| 59 | **Design Critique** | `skills/design-critique/` | Structured feedback using JTBD, Gestalt principles, and Nielsen's heuristics | -| 60 | **Accessibility Audit** | `skills/accessibility-audit/` | WCAG 2.2 audit with prioritised remediation and quick wins | +| 66 | **UX Research Plan** | `skills/ux-research-plan/` | Research plans with screener, discussion guide, and synthesis framework | +| 67 | **Design Critique** | `skills/design-critique/` | Structured feedback using JTBD, Gestalt principles, and Nielsen's heuristics | +| 68 | **Accessibility Audit** | `skills/accessibility-audit/` | WCAG 2.2 audit with prioritised remediation and quick wins | --- -### 🏢 Business & Strategy (Skills 61–63) +### 🏢 Business & Strategy (Skills 69–71) **Bundle:** `pm-business` | # | Skill | Folder | What It Does | |---|---|---|---| -| 61 | **Investor Update** | `skills/investor-update/` | Monthly/quarterly investor updates: metrics, highlights, challenges, and asks | -| 62 | **Board Deck Narrative** | `skills/board-deck-narrative/` | Slide-by-slide board presentation structure with narrative beats and talking points | -| 63 | **Job Application** | `skills/job-application/` | Tailored CV summary, ATS keyword optimisation, and cover letter for any JD | +| 69 | **Investor Update** | `skills/investor-update/` | Monthly/quarterly investor updates: metrics, highlights, challenges, and asks | +| 70 | **Board Deck Narrative** | `skills/board-deck-narrative/` | Slide-by-slide board presentation structure with narrative beats and talking points | +| 71 | **Job Application** | `skills/job-application/` | Tailored CV summary, ATS keyword optimisation, and cover letter for any JD | --- -### ⚖️ Legal (Skills 64–67) +### ⚖️ Legal (Skills 72–75) **Bundle:** `pm-legal` > ⚠️ All legal skills include a disclaimer. Not a substitute for qualified legal advice. | # | Skill | Folder | What It Does | |---|---|---|---| -| 64 | **Contract Review** | `skills/contract-review/` | Structured review with key terms, flagged clauses, risk rating, and plain English summary | -| 65 | **NDA Analyser** | `skills/nda-analyser/` | Clause-by-clause NDA analysis with risk flags and negotiation checklist | -| 66 | **Legal Brief** | `skills/legal-brief/` | Legal memos and argument outlines in IRAC format (Issue, Rule, Application, Conclusion) | -| 67 | **Compliance Checklist** | `skills/compliance-checklist/` | GDPR, SOC 2, ISO 27001, FCA, HIPAA compliance checklists with prioritised gap analysis | +| 72 | **Contract Review** | `skills/contract-review/` | Structured review with key terms, flagged clauses, risk rating, and plain English summary | +| 73 | **NDA Analyser** | `skills/nda-analyser/` | Clause-by-clause NDA analysis with risk flags and negotiation checklist | +| 74 | **Legal Brief** | `skills/legal-brief/` | Legal memos and argument outlines in IRAC format (Issue, Rule, Application, Conclusion) | +| 75 | **Compliance Checklist** | `skills/compliance-checklist/` | GDPR, SOC 2, ISO 27001, FCA, HIPAA compliance checklists with prioritised gap analysis | --- -### 💰 Finance (Skills 68–72) +### 💰 Finance (Skills 76–80) **Bundle:** `pm-finance` | # | Skill | Folder | What It Does | |---|---|---|---| -| 68 | **Financial Model Narrative** | `skills/financial-model-narrative/` | Turns P&L and model outputs into board-ready written narratives | -| 69 | **Budget Variance Analysis** | `skills/budget-variance-analysis/` | Variance table with root cause commentary and management summary | -| 70 | **Investor Pitch Deck** | `skills/investor-pitch-deck/` | Slide-by-slide pitch deck structure with what each slide must prove | -| 71 | **Financial Due Diligence** | `skills/financial-due-diligence/` | DD document request list, analytical questions, and red flags checklist | -| 72 | **Tax Planning Checklist** 🆕 | `skills/tax-planning-checklist/` | Year-end tax planning framework across income, pension, CGT, business reliefs, and ISAs | +| 76 | **Financial Model Narrative** | `skills/financial-model-narrative/` | Turns P&L and model outputs into board-ready written narratives | +| 77 | **Budget Variance Analysis** | `skills/budget-variance-analysis/` | Variance table with root cause commentary and management summary | +| 78 | **Investor Pitch Deck** | `skills/investor-pitch-deck/` | Slide-by-slide pitch deck structure with what each slide must prove | +| 79 | **Financial Due Diligence** | `skills/financial-due-diligence/` | DD document request list, analytical questions, and red flags checklist | +| 80 | **Tax Planning Checklist** | `skills/tax-planning-checklist/` | Year-end tax planning framework across income, pension, CGT, business reliefs, and ISAs | --- -### 👥 HR (Skills 73–77) +### 👥 HR (Skills 81–85) **Bundle:** `pm-hr` | # | Skill | Folder | What It Does | |---|---|---|---| -| 73 | **Job Description Writer** | `skills/job-description-writer/` | Inclusive, structured JDs with built-in language review and salary range nudge | -| 74 | **Onboarding Plan** | `skills/onboarding-plan/` | 30/60/90-day plans with week-by-week structure, milestones, and manager checklist | -| 75 | **Employee Engagement Survey** | `skills/employee-engagement-survey/` | Survey design + results analysis mode with eNPS and action planning template | -| 76 | **Redundancy Consultation** | `skills/redundancy-consultation/` | Process timeline, at-risk letter, consultation script, and confirmation letter — UK law | -| 77 | **Change Management Plan** 🆕 | `skills/change-management-plan/` | Full change plan covering stakeholder analysis, communication strategy, training, and adoption metrics | +| 81 | **Job Description Writer** | `skills/job-description-writer/` | Inclusive, structured JDs with built-in language review and salary range nudge | +| 82 | **Onboarding Plan** | `skills/onboarding-plan/` | 30/60/90-day plans with week-by-week structure, milestones, and manager checklist | +| 83 | **Employee Engagement Survey** | `skills/employee-engagement-survey/` | Survey design + results analysis mode with eNPS and action planning template | +| 84 | **Redundancy Consultation** | `skills/redundancy-consultation/` | Process timeline, at-risk letter, consultation script, and confirmation letter — UK law | +| 85 | **Change Management Plan** | `skills/change-management-plan/` | Full change plan covering stakeholder analysis, communication strategy, training, and adoption metrics | --- -### 🤝 Sales (Skills 78–82) +### 🤝 Sales (Skills 86–90) **Bundle:** `pm-sales` | # | Skill | Folder | What It Does | |---|---|---|---| -| 78 | **Sales Battlecard** | `skills/sales-battlecard/` | One-page competitive battlecard with objection responses and landmine questions | -| 79 | **Discovery Call Prep** | `skills/discovery-call-prep/` | Call brief with research summary, hypothesis, structured questions, and success criteria | -| 80 | **Proposal Writer** | `skills/proposal-writer/` | Commercial proposals structured around the prospect's problem, not the product | -| 81 | **Account Plan** | `skills/account-plan/` | Strategic account plan with relationship map, whitespace analysis, risks, and 90-day actions | -| 82 | **Sales Forecasting Model** 🆕 | `skills/sales-forecasting-model/` | Pipeline-based forecast with stage model, scenario analysis, assumption log, and activity sanity check | +| 86 | **Sales Battlecard** | `skills/sales-battlecard/` | One-page competitive battlecard with objection responses and landmine questions | +| 87 | **Discovery Call Prep** | `skills/discovery-call-prep/` | Call brief with research summary, hypothesis, structured questions, and success criteria | +| 88 | **Proposal Writer** | `skills/proposal-writer/` | Commercial proposals structured around the prospect's problem, not the product | +| 89 | **Account Plan** | `skills/account-plan/` | Strategic account plan with relationship map, whitespace analysis, risks, and 90-day actions | +| 90 | **Sales Forecasting Model** | `skills/sales-forecasting-model/` | Pipeline-based forecast with stage model, scenario analysis, assumption log, and activity sanity check | --- -### ⚙️ Operations (Skills 83–87) +### ⚙️ Operations (Skills 91–95) **Bundle:** `pm-operations` | # | Skill | Folder | What It Does | |---|---|---|---| -| 83 | **Process Documentation** | `skills/process-documentation/` | Clear process docs with steps, roles, edge cases — followable by a new starter | -| 84 | **SOP Writer** | `skills/sop-writer/` | Formal, audit-ready SOPs with version control, quality checks, and non-conformance process | -| 85 | **Vendor Evaluation** | `skills/vendor-evaluation/` | Weighted vendor scorecard, RFP questions, reference check template, and recommendation | -| 86 | **Project Status Report** | `skills/project-status-report/` | RAG status reports with milestone progress, issues, risks, and decisions required | -| 87 | **Workshop Facilitation Guide** 🆕 | `skills/workshop-facilitation-guide/` | Complete facilitation guides with activity instructions, decision protocols, and facilitator moves | +| 91 | **Process Documentation** | `skills/process-documentation/` | Clear process docs with steps, roles, edge cases — followable by a new starter | +| 92 | **SOP Writer** | `skills/sop-writer/` | Formal, audit-ready SOPs with version control, quality checks, and non-conformance process | +| 93 | **Vendor Evaluation** | `skills/vendor-evaluation/` | Weighted vendor scorecard, RFP questions, reference check template, and recommendation | +| 94 | **Project Status Report** | `skills/project-status-report/` | RAG status reports with milestone progress, issues, risks, and decisions required | +| 95 | **Workshop Facilitation Guide** | `skills/workshop-facilitation-guide/` | Complete facilitation guides with activity instructions, decision protocols, and facilitator moves | --- -### 🏥 Research & Healthcare (Skills 88–91) +### 🏥 Research & Healthcare (Skills 96–99) **Bundle:** `pm-research` > ⚠️ Healthcare skills are for documentation and educational purposes only. All clinical content must be reviewed by a qualified professional. | # | Skill | Folder | What It Does | |---|---|---|---| -| 88 | **Clinical Case Summary** | `skills/clinical-case-summary/` | SBAR handovers, SOAP notes, and case reports for educational and documentation use | -| 89 | **Research Protocol** | `skills/research-protocol/` | Complete study protocols with objectives, methodology, ethics, and analysis plan | -| 90 | **Patient Communication** | `skills/patient-communication/` | Plain English patient letters, leaflets, and results communications at Grade 6 reading level | -| 91 | **Literature Review** | `skills/literature-review/` | Thematically organised literature reviews with synthesis, critical analysis, and gap identification | +| 96 | **Clinical Case Summary** | `skills/clinical-case-summary/` | SBAR handovers, SOAP notes, and case reports for educational and documentation use | +| 97 | **Research Protocol** | `skills/research-protocol/` | Complete study protocols with objectives, methodology, ethics, and analysis plan | +| 98 | **Patient Communication** | `skills/patient-communication/` | Plain English patient letters, leaflets, and results communications at Grade 6 reading level | +| 99 | **Literature Review** | `skills/literature-review/` | Thematically organised literature reviews with synthesis, critical analysis, and gap identification | --- -### 🌐 Cross-Profession (Skills 92–95) +### 🌐 Cross-Profession (Skills 100–103) **Bundle:** `pm-cross` | # | Skill | Folder | What It Does | |---|---|---|---| -| 92 | **Press Release** | `skills/press-release/` | Journalist-ready press releases with headline rules, boilerplate, and journalist test | -| 93 | **Grant Proposal** | `skills/grant-proposal/` | Complete grant applications aligned to funder priorities with budget narrative | -| 94 | **Executive Summary** | `skills/executive-summary/` | Decision-ready executive summaries with bottom line upfront, adapted for any audience | -| 95 | **Teaching Lesson Plan** 🆕 | `skills/teaching-lesson-plan/` | Complete lesson plans for any subject, audience, or setting — with objectives, activities, and formative assessment | +| 100 | **Press Release** | `skills/press-release/` | Journalist-ready press releases with headline rules, boilerplate, and journalist test | +| 101 | **Grant Proposal** | `skills/grant-proposal/` | Complete grant applications aligned to funder priorities with budget narrative | +| 102 | **Executive Summary** | `skills/executive-summary/` | Decision-ready executive summaries with bottom line upfront, adapted for any audience | +| 103 | **Teaching Lesson Plan** | `skills/teaching-lesson-plan/` | Complete lesson plans for any subject, audience, or setting — with objectives, activities, and formative assessment | --- -### 🖼️ Figma (Skills 96–105) +### 🖼️ Figma (Skills 104–113) **Bundle:** `pm-figma` | # | Skill | Folder | What It Does | |---|---|---|---| -| 96 | **Figma Component Audit** | `skills/figma-component-audit/` | Audit component library for naming issues, coverage gaps, and variant completeness | -| 97 | **Figma Design Brief** | `skills/figma-design-brief/` | Convert PRDs and feature requests into structured Figma design briefs | -| 98 | **Figma Annotation Guide** | `skills/figma-annotation-guide/` | Generate complete developer handoff annotations covering all states and edge cases | -| 99 | **Figma Design Review** | `skills/figma-design-review/` | PM design review against requirements with explicit approval status | -| 100 | **Figma User Flow Planner** | `skills/figma-user-flow-planner/` | Map all screens, states, and decision points before opening Figma | -| 101 | **Figma Variant Matrix** | `skills/figma-variant-matrix/` | Define all component variants, properties, and states before building | -| 102 | **Figma Spacing System** | `skills/figma-spacing-system/` | Design a complete spacing scale, grid, and token system | -| 103 | **Figma Prototype Plan** | `skills/figma-prototype-plan/` | Plan prototype scope, interactions, and test task scripts for user testing | -| 104 | **Figma Design QA** | `skills/figma-design-qa/` | Pre-handoff QA checklist covering file hygiene, states, accessibility, and handoff readiness | -| 105 | **Figma Design Critique (PM)** | `skills/figma-design-critique-pm/` | PM-perspective design critique focused on product outcomes, not aesthetics | +| 104 | **Figma Component Audit** | `skills/figma-component-audit/` | Audit component library for naming issues, coverage gaps, and variant completeness | +| 105 | **Figma Design Brief** | `skills/figma-design-brief/` | Convert PRDs and feature requests into structured Figma design briefs | +| 106 | **Figma Annotation Guide** | `skills/figma-annotation-guide/` | Generate complete developer handoff annotations covering all states and edge cases | +| 107 | **Figma Design Review** | `skills/figma-design-review/` | PM design review against requirements with explicit approval status | +| 108 | **Figma User Flow Planner** | `skills/figma-user-flow-planner/` | Map all screens, states, and decision points before opening Figma | +| 109 | **Figma Variant Matrix** | `skills/figma-variant-matrix/` | Define all component variants, properties, and states before building | +| 110 | **Figma Spacing System** | `skills/figma-spacing-system/` | Design a complete spacing scale, grid, and token system | +| 111 | **Figma Prototype Plan** | `skills/figma-prototype-plan/` | Plan prototype scope, interactions, and test task scripts for user testing | +| 112 | **Figma Design QA** | `skills/figma-design-qa/` | Pre-handoff QA checklist covering file hygiene, states, accessibility, and handoff readiness | +| 113 | **Figma Design Critique (PM)** | `skills/figma-design-critique-pm/` | PM-perspective design critique focused on product outcomes, not aesthetics | claude plugin install pm-figma@pm-claude-skills --- -### 📅 PM Rituals (Skill 106) +### 📅 PM Rituals (Skill 114) **Bundle:** `pm-rituals` | # | Skill | Folder | What It Does | |---|---|---|---| -| 106 | **PM Weekly Review** | `skills/pm-weekly-review/` | Weekly PM review and planning ritual — metrics, shipping progress, blockers, and next week's priorities | +| 114 | **PM Weekly Review** | `skills/pm-weekly-review/` | Weekly PM review and planning ritual — metrics, shipping progress, blockers, and next week's priorities | --- ## ❤️ Sponsor This Work -Building and maintaining 106 skills across 22 bundles takes real time — testing skills against new model releases, building new ones from community requests, writing the article series, and keeping documentation current. +Building and maintaining 114 skills across 23 bundles takes real time — testing skills against new model releases, building new ones from community requests, writing the article series, and keeping documentation current. If these skills save you time at work, consider sponsoring: @@ -458,6 +494,8 @@ Higher tiers include custom skill development for your team, direct access for s This is an open-source community library. If you've built a skill that saves you time, share it here. +**Found a bug?** [Open a bug report →](../../issues/new?template=bug-report.md) — use the template so it's easy to triage. + **How to contribute:** 1. Fork this repo @@ -515,7 +553,8 @@ claude plugin install pm-strategy@pm-claude-skills claude plugin install pm-advanced@pm-claude-skills claude plugin install pm-rituals@pm-claude-skills claude plugin install pm-gtm@pm-claude-skills -claude plugin install pm-engineering@pm-claude-skills # 10 engineering skills 🆕 +claude plugin install pm-engineering@pm-claude-skills # 14 engineering skills 🆕 +claude plugin install pm-cs@pm-claude-skills # Customer Success (4 skills) 🆕 claude plugin install pm-data@pm-claude-skills claude plugin install pm-people@pm-claude-skills claude plugin install pm-design@pm-claude-skills @@ -581,8 +620,8 @@ Stars unlock the next wave of skills. Here's the roadmap: | Milestone | Unlocks | Status | |---|---|---| | 100 ⭐ | 10 Figma skills + quality rebuild across all 93 skills | ✅ Shipped (v6.0.0) | -| 250 ⭐ | 10 Customer Success skills (health scorecard, QBR deck, escalation brief, churn analysis) | 🔒 Locked | -| 500 ⭐ | 25 more Engineering skills (CI/CD playbooks, SLO templates, onboarding docs, debugging patterns) | 🔒 Locked | +| 250 ⭐ | 10 Customer Success skills (health scorecard, QBR deck, escalation brief, churn analysis) | ✅ Unlocked — coming in next release | +| 500 ⭐ | 25 more Engineering skills (CI/CD playbooks, SLO templates, onboarding docs, debugging patterns) | ✅ Unlocked — coming in next release | | 1000 ⭐ | Full Startup Founder kit (fundraising memo, pitch critique, co-founder equity split) | 🔒 Locked | **[⭐ Star this repo to unlock the next milestone →](https://github.com/mohitagw15856/pm-claude-skills)** diff --git a/plugins/pm-cs/.claude-plugin/plugin.json b/plugins/pm-cs/.claude-plugin/plugin.json new file mode 100644 index 0000000..0e3bdf6 --- /dev/null +++ b/plugins/pm-cs/.claude-plugin/plugin.json @@ -0,0 +1,13 @@ +{ + "$schema": "https://anthropic.com/claude-code/plugin.schema.json", + "name": "pm-cs", + "version": "1.0.0", + "description": "Customer Success skills: Customer Health Scorecard, QBR Deck, Escalation Brief, Churn Analysis. Score account health with a weighted RAG framework, build structured QBR decks with value narratives, write crisp escalation briefs for at-risk accounts, and analyse churn by category and segment with prioritised interventions.", + "author": { + "name": "Mohit Aggarwal", + "email": "mohit15856@gmail.com" + }, + "homepage": "https://github.com/mohitagw15856/pm-claude-skills", + "license": "MIT", + "keywords": ["customer-success", "account-management", "health-scorecard", "qbr", "quarterly-business-review", "churn", "retention", "escalation", "csm", "renewal"] +} diff --git a/plugins/pm-cs/skills/churn-analysis/SKILL.md b/plugins/pm-cs/skills/churn-analysis/SKILL.md new file mode 100644 index 0000000..f7bcf17 --- /dev/null +++ b/plugins/pm-cs/skills/churn-analysis/SKILL.md @@ -0,0 +1,179 @@ +--- +name: churn-analysis +description: "Analyse customer churn for a product or cohort and produce a structured churn report. Use when asked to analyse churn, understand why customers are leaving, identify churn patterns, calculate churn rate, or build a churn reduction plan. Produces a churn analysis with rate calculations, categorised reasons, early warning signals, and prioritised interventions." +--- + +# Churn Analysis Skill + +Produce a structured churn analysis that goes beyond the headline rate — identifying why customers leave, which segments are most at risk, and what interventions will have the highest impact on retention. + +## Required Inputs + +Ask for these if not already provided: +- **Time period** being analysed (e.g. Q1, last 12 months) +- **Total customers at start of period** and **customers churned** +- **ARR or revenue lost** to churn +- **Churn reasons data** — exit survey results, CSM notes, support data, or sales loss reasons +- **Customer segments** — by tier, industry, cohort, or product line +- **Current retention rate** if known +- **Any recent changes** — pricing, product, support model — that may have affected churn + +## Churn Categories + +Always classify churn before analysing it: + +| Category | Definition | +|---|---| +| **Voluntary — avoidable** | Customer left due to a problem we could have addressed (product gaps, poor onboarding, relationship failures) | +| **Voluntary — unavoidable** | Customer left for reasons outside our control (budget cuts, acquisition, company shutdown) | +| **Involuntary** | Payment failure, contract non-renewal by mistake, admin error | + +The interventions for each category are different. Conflating them leads to wrong conclusions. + +## Output Format + +--- + +# Churn Analysis: [Product / Segment / Company] +**Period:** [Start date] — [End date] +**Prepared by:** [Name] | **Date:** [Date] + +--- + +## Headline Numbers + +| Metric | Value | +|---|---| +| Customers at start of period | [N] | +| Customers churned | [N] | +| **Customer churn rate** | **[X]%** | +| ARR at start of period | £/$/€[X] | +| ARR lost to churn | £/$/€[X] | +| **Revenue churn rate (gross)** | **[X]%** | +| ARR from expansions (same period) | £/$/€[X] | +| **Net revenue retention (NRR)** | **[X]%** | + +**Benchmark context:** +- Customer churn rate: [X]% vs. industry benchmark [Y]% — [above / below / in line] +- NRR: [X]% — [What this means: above 100% = expansion offsets churn; below 100% = shrinking base] + +--- + +## Churn Breakdown by Category + +| Category | Customers | % of churn | ARR lost | +|---|---|---|---| +| Voluntary — avoidable | [N] | [X]% | £/$/€[X] | +| Voluntary — unavoidable | [N] | [X]% | £/$/€[X] | +| Involuntary | [N] | [X]% | £/$/€[X] | +| **Total** | **[N]** | **100%** | **£/$/€[X]** | + +**Avoidable churn as % of total churn:** [X]% — this is the number we can actually influence. + +--- + +## Churn Reasons — Avoidable Churn Only + +Rank by frequency. Include ARR weight where data allows. + +| Reason | Count | % of avoidable churn | ARR lost | Representative quote | +|---|---|---|---|---| +| [Reason 1 — e.g. "Product missing key feature"] | [N] | [X]% | £/$/€[X] | "[Quote]" | +| [Reason 2] | [N] | [X]% | £/$/€[X] | "[Quote]" | +| [Reason 3] | [N] | [X]% | £/$/€[X] | "[Quote]" | +| [Reason 4] | [N] | [X]% | £/$/€[X] | "[Quote]" | +| Other | [N] | [X]% | £/$/€[X] | — | + +**Theme synthesis:** [2–3 sentences grouping the top reasons into 2–3 themes. E.g. "The top three reasons cluster around two themes: product gaps in [area] (affecting X% of avoidable churn) and onboarding failures where customers never achieved value (Y%)."] + +--- + +## Churn by Segment + +Identify which segments over- or under-index for churn. + +### By Tier + +| Tier | Churn rate | vs. Overall | Notes | +|---|---|---|---| +| Enterprise | [X]% | +/-[X]pp | | +| Mid-Market | [X]% | +/-[X]pp | | +| SMB | [X]% | +/-[X]pp | | + +### By Cohort (Acquisition Year) + +| Cohort | Churn rate | Notes | +|---|---|---| +| [Year 1] | [X]% | | +| [Year 2] | [X]% | | +| [Year 3] | [X]% | | + +### By Industry / Use Case (if data available) + +| Segment | Churn rate | Notes | +|---|---|---| +| [Segment 1] | [X]% | | +| [Segment 2] | [X]% | | + +**Key pattern:** [Which segment has the highest churn rate and what likely explains it] + +--- + +## Timing Analysis + +- **Average contract length before churn:** [X months] +- **Highest-risk moment:** [e.g. "Month 3 — when trial value has worn off but full adoption hasn't happened"] +- **Churn timing distribution:** + +| When churn occurred | % of churned accounts | +|---|---| +| 0–3 months | [X]% | +| 3–6 months | [X]% | +| 6–12 months | [X]% | +| 12+ months | [X]% | + +--- + +## Early Warning Signals + +Based on the churned accounts, identify the signals that preceded churn (and could have triggered earlier intervention): + +| Signal | Lead time before churn | How to detect | +|---|---|---| +| [Signal 1 — e.g. "DAU/MAU dropped below 15%"] | [~X weeks] | [Usage dashboard / alert] | +| [Signal 2 — e.g. "No QBR in 90+ days"] | [~X weeks] | [CRM flag] | +| [Signal 3 — e.g. "Champion left the account"] | [~X weeks] | [LinkedIn alert / CSM tracking] | +| [Signal 4] | [~X weeks] | [Detection method] | + +--- + +## Intervention Recommendations + +Ranked by estimated impact × feasibility. + +| Intervention | Addresses | Est. churn reduction | Effort | Owner | +|---|---|---|---|---| +| [Intervention 1 — e.g. "Improve onboarding for [segment] with dedicated 30-day check-in"] | [Reason 1] | [X accounts / £X ARR] | Low / Med / High | [Team] | +| [Intervention 2] | [Reason 2] | [X accounts / £X ARR] | Low / Med / High | [Team] | +| [Intervention 3] | [Reason 3] | [X accounts / £X ARR] | Low / Med / High | [Team] | + +**Priority call:** [Which one intervention, if implemented this quarter, would have the biggest impact and why] + +--- + +## What We Don't Know (Data Gaps) + +- [Data gap 1 — e.g. "Exit survey response rate is only 30% — the reasons data may not be representative"] +- [Data gap 2 — e.g. "No product usage data for SMB tier — can't confirm usage signal correlation"] +- [Data gap 3] + +--- + +## Quality Checks + +- [ ] Churn rate is correctly calculated (churned ÷ starting cohort, not end-of-period total) +- [ ] Avoidable and unavoidable churn are separated — interventions target avoidable churn only +- [ ] Churn reasons are customer-reported, not internally assumed +- [ ] Segment analysis identifies which segments over-index — not just averages +- [ ] Early warning signals are specific and detectable, not generic ("low engagement") +- [ ] Interventions link directly to the top churn reasons — no recommendations without a root cause match diff --git a/plugins/pm-cs/skills/cs-escalation-brief/SKILL.md b/plugins/pm-cs/skills/cs-escalation-brief/SKILL.md new file mode 100644 index 0000000..0cca77d --- /dev/null +++ b/plugins/pm-cs/skills/cs-escalation-brief/SKILL.md @@ -0,0 +1,176 @@ +--- +name: cs-escalation-brief +description: "Write a structured escalation brief for an at-risk customer account. Use when an account has escalated, when a customer is threatening churn, when a P1 customer issue needs executive attention, or when preparing an internal save play. Produces a crisp escalation brief with account context, timeline, root cause, business impact, and a clear resolution plan." +--- + +# Customer Escalation Brief Skill + +Produce a clear, concise escalation brief that gives internal stakeholders — VP CS, CCO, product leadership, or the CEO — everything they need to understand the situation, make decisions, and act fast. + +A good escalation brief is not a complaint. It is a professional document that states the facts, assigns accountability honestly, and proposes a specific resolution plan. + +## Required Inputs + +Ask for these if not already provided: +- **Account name**, tier, and ARR +- **CSM name** and account owner +- **Nature of the escalation** — what happened, what the customer is saying +- **Timeline** of events leading to escalation +- **Customer contact** who escalated (name, role, influence level) +- **What the customer wants** — their stated ask +- **What we believe the root cause is** +- **What has already been done** to address the situation +- **Renewal date** and current renewal risk assessment + +## Escalation Levels + +Calibrate urgency and audience based on escalation level: + +| Level | Trigger | Audience | Response time | +|---|---|---|---| +| L1 — Account Risk | Customer expressing dissatisfaction; renewal at risk | CSM + CS Manager | 24 hours | +| L2 — Executive Escalation | Customer escalated to their exec; requesting vendor exec involvement | VP CS + Account Exec | 4 hours | +| L3 — Churn Risk | Customer has issued notice or is in active churn conversation | CCO / CEO + Revenue leadership | 1 hour | +| L4 — Public Risk | Customer threatening public escalation, legal, or press | CCO / Legal / Comms | Immediate | + +## Output Format + +--- + +# Escalation Brief: [Account Name] + +**Escalation level:** L[1/2/3/4] — [Label] +**Date raised:** [Date] +**Raised by:** [CSM name] +**Escalation owner:** [Name of exec or senior stakeholder now leading response] + +--- + +## Account at a Glance + +| Field | Detail | +|---|---| +| ARR | £/$/€[X] | +| Tier | Enterprise / Mid-Market / SMB | +| Customer since | [Date] | +| Renewal date | [Date] — [N] days away | +| Renewal risk (pre-escalation) | Green / Amber / Red | +| Renewal risk (current) | Green / Amber / Red | +| Customer contact who escalated | [Name, role, seniority] | +| Executive sponsor (customer) | [Name, role — active / passive / vacant] | +| Executive sponsor (vendor) | [Name, role] | + +--- + +## What Happened — Summary + +[3–5 sentences. State the facts plainly. What the customer experienced, how they reacted, and how we learned about the escalation. No editorialising. No blame.] + +--- + +## Timeline + +List in chronological order. Each entry: `[Date / time] — [What happened. Who did what.]` + +Include: +- When the original issue or trigger event occurred +- When the customer first raised concerns (informally) +- When it escalated (formal escalation or exec involvement) +- Actions taken since escalation + +--- + +## Root Cause + +**Primary cause:** [One clear sentence. What specifically went wrong.] + +**Contributing factors:** +- [Factor 1 — be honest about internal failures as well as external ones] +- [Factor 2] + +**Is this a systemic issue or isolated?** +[ ] Isolated to this account +[ ] Pattern seen in other accounts — details: [_______] +[ ] Product or process gap that needs fixing + +--- + +## Customer's Stated Position + +**What the customer says happened:** [Their version of events — fair and unfiltered] + +**What they are asking for:** [Their explicit ask — compensation, fix by date, exec call, SLA credit, exit clause] + +**Sentiment of escalating contact:** [Frustrated but constructive / Angry / Seeking exit / Unknown] + +**Risk of public escalation:** Low / Medium / High — [evidence if Medium or High] + +--- + +## Business Impact + +| Impact type | Detail | +|---|---| +| ARR at risk | £/$/€[X] | +| Potential churn probability | [X]% | +| Reputational risk | Low / Medium / High | +| Reference / case study status | [Was a reference — now at risk / Not a reference] | +| Expansion pipeline at risk | £/$/€[X] | + +--- + +## What Has Been Done So Far + +1. [Action taken — by whom — date — outcome] +2. [Action taken — by whom — date — outcome] +3. [Action taken — by whom — date — outcome] + +**Has a formal apology or acknowledgement been issued?** Yes / No + +--- + +## Proposed Resolution Plan + +**Immediate actions (next 24–48 hours):** + +| Action | Owner | By when | +|---|---|---| +| [Action] | [Name] | [Date] | +| [Action] | [Name] | [Date] | + +**Medium-term actions (next 2–4 weeks):** + +| Action | Owner | By when | +|---|---|---| +| [Action] | [Name] | [Date] | + +**What we are NOT offering:** [Be explicit about what is not on the table — avoids misaligned expectations] + +**Success criteria:** [How will we know the escalation is resolved? What does the customer need to confirm they are satisfied?] + +--- + +## Decision Required from Escalation Owner + +[State clearly what decision or resource the escalation owner needs to provide. Be specific — do not make them ask. E.g.: "We need approval to offer a 20% service credit for Q2" or "We need an exec call with [name] within 48 hours."] + +--- + +## Communication Plan + +| Audience | Message | Channel | Owner | By when | +|---|---|---|---|---| +| Escalating customer contact | [Summary of message] | Email / Call | [Name] | [Date] | +| Customer exec sponsor | [Summary] | Call | [Name] | [Date] | +| Internal CS team | [Summary] | Slack / Meeting | CS Manager | [Date] | + +--- + +## Quality Checks + +- [ ] Root cause is specific — not "communication breakdown" or "product gap" without detail +- [ ] Customer's position is stated fairly — not minimised or dismissed +- [ ] A clear decision is requested from the escalation owner — brief does not end with "what do you think?" +- [ ] ARR at risk is quantified +- [ ] Communication plan has owners and dates — not "TBD" +- [ ] Language is professional and blameless toward individuals diff --git a/plugins/pm-cs/skills/cs-health-scorecard/SKILL.md b/plugins/pm-cs/skills/cs-health-scorecard/SKILL.md new file mode 100644 index 0000000..17237b7 --- /dev/null +++ b/plugins/pm-cs/skills/cs-health-scorecard/SKILL.md @@ -0,0 +1,141 @@ +--- +name: cs-health-scorecard +description: "Build a customer health scorecard for a specific account. Use when asked to score account health, assess renewal risk, build a health dashboard, or evaluate an account's likelihood to renew or expand. Produces a structured health scorecard with a RAG status, dimension scores, key risks, and recommended actions." +--- + +# Customer Health Scorecard Skill + +Produce a structured, data-driven health scorecard for a customer account — giving the CSM and leadership a clear view of renewal risk, expansion potential, and the actions needed to move the account in the right direction. + +## Required Inputs + +Ask for these if not already provided: +- **Account name** and tier (enterprise / mid-market / SMB) +- **Contract value** (ARR) and **renewal date** +- **Product usage data** — logins, DAU/MAU ratio, key feature adoption +- **Support data** — open tickets, CSAT or NPS score, recent escalations +- **Engagement data** — last QBR date, executive sponsor status, champion name +- **Commercial data** — payment history, expansion conversations, seats used vs. licensed +- **Any known risks or recent changes** at the account + +## Scoring Framework + +Score each dimension 1–5. Weight as shown. Calculate weighted total out of 100. + +| Dimension | Weight | What to Score | +|---|---|---| +| **Product Adoption** | 30% | DAU/MAU ratio, breadth of features used, power users identified | +| **Engagement** | 20% | QBR cadence, executive sponsor active, champion strength | +| **Outcomes** | 20% | Customer hitting their stated goals / success metrics | +| **Support Health** | 15% | Ticket volume trend, unresolved escalations, CSAT | +| **Commercial** | 15% | On-time payments, seats utilised, expansion signals | + +**Score → RAG conversion:** +- 80–100: Green (healthy, renew likely) +- 60–79: Amber (at risk, needs attention) +- 0–59: Red (high churn risk, escalate) + +## Output Format + +--- + +# Customer Health Scorecard: [Account Name] + +**CSM:** [Name] | **Tier:** [Enterprise / Mid-Market / SMB] +**ARR:** £/$/€[X] | **Renewal date:** [Date] | **Days to renewal:** [N] +**Overall health:** [Green / Amber / Red] — [Score]/100 +**Last updated:** [Date] + +--- + +## Health Score Summary + +| Dimension | Score (1–5) | Weight | Weighted Score | Trend | +|---|---|---|---|---| +| Product Adoption | [1–5] | 30% | [X] | ↑ / → / ↓ | +| Engagement | [1–5] | 20% | [X] | ↑ / → / ↓ | +| Outcomes | [1–5] | 20% | [X] | ↑ / → / ↓ | +| Support Health | [1–5] | 15% | [X] | ↑ / → / ↓ | +| Commercial | [1–5] | 15% | [X] | ↑ / → / ↓ | +| **Total** | — | 100% | **[X]/100** | | + +--- + +## Dimension Detail + +### Product Adoption — [Score]/5 +- **DAU/MAU ratio:** [X]% (benchmark: >25% = healthy) +- **Key features adopted:** [List features in use] +- **Features not adopted:** [List unused high-value features] +- **Power users identified:** [Yes / No — how many] +- **Assessment:** [1–2 sentences on adoption health] + +### Engagement — [Score]/5 +- **Last QBR:** [Date] — [Outcome summary] +- **Next QBR:** [Scheduled / Overdue] +- **Executive sponsor:** [Active / Passive / Vacant] +- **Champion:** [Name, role, strength: strong / moderate / weak] +- **Assessment:** [1–2 sentences] + +### Outcomes — [Score]/5 +- **Customer's stated goals:** [List 2–3 goals from onboarding or last QBR] +- **Progress against goals:** [On track / Partial / Off track] +- **Evidence of value:** [Metric or quote that demonstrates ROI] +- **Assessment:** [1–2 sentences] + +### Support Health — [Score]/5 +- **Open tickets:** [N] (priority breakdown: P1: X, P2: X, P3: X) +- **CSAT / NPS:** [Score] (benchmark: >8 CSAT / >30 NPS = healthy) +- **Unresolved escalations:** [Yes / No — details if yes] +- **Ticket trend (last 90 days):** Increasing / Stable / Decreasing +- **Assessment:** [1–2 sentences] + +### Commercial — [Score]/5 +- **Seats licensed:** [N] | **Seats active:** [N] ([X]% utilisation) +- **Payment history:** [On time / Late — details] +- **Expansion signals:** [Yes — describe / No] +- **Downgrade or cancellation signals:** [Yes — describe / No] +- **Assessment:** [1–2 sentences] + +--- + +## Top Risks + +| Risk | Severity | Mitigation | +|---|---|---| +| [Risk description] | High / Medium / Low | [Specific action to mitigate] | + +--- + +## Recommended Actions + +**Immediate (this week):** +1. [Action — owner — deadline] + +**This month:** +1. [Action — owner — deadline] + +**Before renewal:** +1. [Action — owner — deadline] + +--- + +## Renewal Forecast + +| Scenario | Probability | ARR at risk | +|---|---|---| +| Full renewal at current ARR | [X]% | £/$/€0 | +| Renewal with contraction | [X]% | £/$/€[X] | +| Churn | [X]% | £/$/€[full ARR] | + +**Recommended renewal play:** [Expand / Hold / Save / Manage out] + +--- + +## Quality Checks + +- [ ] Score is based on data, not gut feel — each dimension has evidence +- [ ] Risks are specific (not "low engagement" — something like "executive sponsor left in March, no replacement identified") +- [ ] Actions have owners and deadlines +- [ ] Renewal probability is calibrated against pipeline reality +- [ ] Trend arrows reflect direction of change vs. last scorecard, not just current state diff --git a/plugins/pm-cs/skills/qbr-deck/SKILL.md b/plugins/pm-cs/skills/qbr-deck/SKILL.md new file mode 100644 index 0000000..0c5ea5e --- /dev/null +++ b/plugins/pm-cs/skills/qbr-deck/SKILL.md @@ -0,0 +1,218 @@ +--- +name: qbr-deck +description: "Build a Quarterly Business Review (QBR) deck structure and narrative for a customer account. Use when asked to prepare a QBR, business review meeting, executive review, or quarterly check-in with a customer. Produces a slide-by-slide QBR structure with talking points, metrics review, value narrative, and mutual next steps." +--- + +# QBR Deck Skill + +Produce a complete Quarterly Business Review deck — structured, data-backed, and customer-focused. A good QBR demonstrates value delivered, aligns on goals for the next quarter, and strengthens the executive relationship. It should never feel like a product demo or a vendor update. + +## Required Inputs + +Ask for these if not already provided: +- **Account name**, CSM name, and customer stakeholders attending +- **Contract details** — ARR, contract start date, renewal date +- **Last quarter's goals** (from previous QBR or kickoff) +- **Usage and adoption data** — key metrics for the quarter +- **Support summary** — tickets raised, resolution time, any escalations +- **Business outcomes the customer cares about** — what success looks like for them +- **Product updates or new features** relevant to this customer +- **Goals for next quarter** +- **Any open commercial conversations** (expansion, renewal, at-risk signals) + +## QBR Principles + +- Lead with customer outcomes, not product features +- Every metric should connect to a business result the customer cares about +- The agenda is a conversation, not a presentation — build in time for customer input at every stage +- Close with mutual commitments, not just vendor actions + +## Output Format + +--- + +# QBR: [Account Name] × [Your Company] +**[Quarter] [Year] Business Review** + +**Date:** [Date] | **Location / Call link:** [TBC] +**Customer attendees:** [Names and roles] +**[Your company] attendees:** [Names and roles] + +--- + +## Slide 1: Agenda (5 min) + +| Time | Topic | Owner | +|---|---|---| +| 0:00 | Welcome and introductions | CSM | +| 0:05 | [Last quarter] — how did we do? | CSM + Customer | +| 0:20 | Value delivered — business impact | CSM | +| 0:35 | What's coming — roadmap preview | CSM / Product | +| 0:45 | [Next quarter] — goals and priorities | Customer | +| 0:55 | Actions and mutual commitments | CSM | +| 1:00 | Close | | + +*Talking point: "We've kept today to 60 minutes. We want as much of this to be a conversation as possible — please push back, redirect, and ask questions throughout."* + +--- + +## Slide 2: Where We Are Together (2 min) + +**Partnership snapshot:** +- **Customer since:** [Date] +- **Contract value:** £/$/€[ARR]/year +- **Renewal date:** [Date] +- **Active users:** [N] of [N] licensed seats ([X]% adoption) +- **Products / modules active:** [List] + +*Talking point: "Before we dive in — a quick picture of where we are. [X] months in, [Y] active users, and this is our [Nth] QBR together."* + +--- + +## Slide 3: Last Quarter — Goals We Set Together (5 min) + +| Goal | Set in [Last QBR / Kickoff] | Status | +|---|---|---| +| [Goal 1] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed | +| [Goal 2] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed | +| [Goal 3] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed | + +For any partial or missed goal: state what happened and what changes next quarter. + +*Talking point: "Let's start with accountability. Here's what we said we'd achieve last quarter — let's be honest about where we landed."* + +--- + +## Slide 4: Usage and Adoption (5 min) + +**Quarter-over-quarter trend:** + +| Metric | [Q-1] | [Q] | Change | +|---|---|---|---| +| Monthly active users | [N] | [N] | +/-X% | +| Sessions per user per week | [N] | [N] | +/-X% | +| [Key feature 1] adoption | [X]% | [X]% | +/-X% | +| [Key feature 2] adoption | [X]% | [X]% | +/-X% | + +**Highlights:** +- [Positive adoption trend to call out] +- [Feature or workflow with strongest engagement] + +**Opportunity:** +- [Feature with low adoption that could drive more value — link to their goals] + +*Talking point: "Usage is [up / stable / something we want to talk about]. The area I'd like to focus on is [feature] — we're not seeing the adoption we'd expect given [their goal], and I want to understand why."* + +--- + +## Slide 5: Business Impact — Value Delivered (10 min) + +Lead with outcomes, not activity. + +**[Outcome 1: customer's primary success metric]** +- Before: [baseline] +- Now: [current state] +- Impact: [quantified business result — time saved, revenue influenced, cost reduced, risk mitigated] + +**[Outcome 2]** +- [Same structure] + +**[Outcome 3]** +- [Same structure] + +**Customer evidence** (use if available): +> "[Quote from champion or user about value experienced]" + +*Talking point: "This is the section I most want your input on. Are these the outcomes that matter to your business? Are there other ways you're measuring success that we should be tracking?"* + +--- + +## Slide 6: Support Summary (3 min) + +| Metric | This quarter | Last quarter | Trend | +|---|---|---|---| +| Tickets raised | [N] | [N] | ↑ / → / ↓ | +| Average resolution time | [X hrs] | [X hrs] | ↑ / → / ↓ | +| P1 / critical issues | [N] | [N] | ↑ / → / ↓ | +| CSAT score | [X/10] | [X/10] | ↑ / → / ↓ | + +**Notable issues this quarter:** +- [Any escalation or major ticket — brief summary and resolution] + +**What we're doing differently:** +- [Any process change or improvement based on support patterns] + +--- + +## Slide 7: What's Coming — Roadmap Preview (5 min) + +Focus only on what's relevant to this customer's goals. Do not dump the full roadmap. + +| Feature / Improvement | Expected | Why it matters to [Account Name] | +|---|---|---| +| [Feature 1] | [Q+1] | [Direct link to their goal or pain point] | +| [Feature 2] | [Q+1 / Q+2] | [Direct link] | +| [Feature 3] | [H2] | [Direct link] | + +*Talking point: "I've filtered the roadmap to what I think matters most to your team. I'd love your reaction — are these the right priorities from your perspective?"* + +--- + +## Slide 8: Next Quarter — Your Goals (10 min) + +**Customer input section — facilitate, don't present.** + +Prompt questions: +- "What does success look like for your team in [next quarter]?" +- "What's the biggest challenge you're trying to solve in the next 90 days?" +- "Is there anything about the way you're using [product] you want to change?" + +**Capture live:** + +| Goal for next quarter | Owner (customer) | How we'll support it | How we'll measure it | +|---|---|---|---| +| [Goal 1] | [Name] | [CSM / product action] | [Metric] | +| [Goal 2] | [Name] | [CSM / product action] | [Metric] | + +--- + +## Slide 9: Mutual Commitments (5 min) + +**[Your company] commits to:** +1. [Specific action — owner — by when] +2. [Specific action — owner — by when] +3. [Specific action — owner — by when] + +**[Account Name] commits to:** +1. [Specific action — owner — by when] +2. [Specific action — owner — by when] + +**Next touchpoint:** [Date of next check-in or mid-quarter review] + +--- + +## Slide 10: Thank You + Open Q&A (5 min) + +- Recap the one headline from today: [The single most important thing you want them to remember] +- Confirm actions are captured and shared after the call +- Ask: "Is there anything we didn't cover today that you wanted to raise?" + +--- + +## Preparation Checklist + +- [ ] Usage data pulled and QoQ comparison calculated +- [ ] Last QBR goals reviewed — status confirmed before the meeting +- [ ] Business outcomes framed in customer language (not product language) +- [ ] Roadmap filtered to this account's specific use cases +- [ ] Customer's goals for next quarter researched or pre-confirmed with champion +- [ ] Executive sponsor briefed on any sensitive topics before the call +- [ ] Actions from previous QBR reviewed — any outstanding items addressed + +## Quality Checks + +- [ ] Every slide has a talking point, not just a title +- [ ] Value slide leads with business outcomes, not product activity +- [ ] Roadmap preview links each item to a customer goal +- [ ] Mutual commitments section has real owners on both sides +- [ ] Customer has at least 20 minutes of airtime in the agenda diff --git a/plugins/pm-engineering/.claude-plugin/plugin.json b/plugins/pm-engineering/.claude-plugin/plugin.json index 718b054..3ce2d52 100644 --- a/plugins/pm-engineering/.claude-plugin/plugin.json +++ b/plugins/pm-engineering/.claude-plugin/plugin.json @@ -1,13 +1,13 @@ { "$schema": "https://anthropic.com/claude-code/plugin.schema.json", "name": "pm-engineering", - "version": "2.0.0", - "description": "Engineering & tech skills: Code Review Checklist, Incident Postmortem, API Docs Writer, Architecture Decision Record, Debugging Log Analyser, PR Description Writer, System Design Interview, Changelog Generator, Test Strategy Doc, Runbook Writer. 10 structured skills for engineering teams and technical PMs.", + "version": "3.0.0", + "description": "Engineering & tech skills: Code Review Checklist, Incident Postmortem, API Docs Writer, Architecture Decision Record, Debugging Log Analyser, PR Description Writer, System Design Interview, Changelog Generator, Test Strategy Doc, Runbook Writer, CI/CD Playbook, SLO & Error Budget, Developer Onboarding Doc, On-Call Runbook. 14 structured skills for engineering teams and technical PMs.", "author": { "name": "Mohit Aggarwal", "email": "mohit15856@gmail.com" }, "homepage": "https://github.com/mohitagw15856/pm-claude-skills", "license": "MIT", - "keywords": ["product-management", "engineering", "code-review", "incident-postmortem", "api-documentation", "adr", "architecture", "debugging", "pull-request", "system-design", "changelog", "test-strategy", "runbook", "devops"] + "keywords": ["product-management", "engineering", "code-review", "incident-postmortem", "api-documentation", "adr", "architecture", "debugging", "pull-request", "system-design", "changelog", "test-strategy", "runbook", "devops", "cicd", "slo", "error-budget", "onboarding", "oncall", "sre", "reliability"] } diff --git a/plugins/pm-engineering/skills/cicd-playbook/SKILL.md b/plugins/pm-engineering/skills/cicd-playbook/SKILL.md new file mode 100644 index 0000000..26f9cdc --- /dev/null +++ b/plugins/pm-engineering/skills/cicd-playbook/SKILL.md @@ -0,0 +1,301 @@ +--- +name: cicd-playbook +description: "Write a CI/CD pipeline playbook for a service or team. Use when asked to document a CI/CD pipeline, write a deployment process, define release gates, document build and test stages, or create a deployment guide. Produces a structured playbook covering pipeline stages, environment definitions, deployment gates, rollback procedures, and on-call responsibilities." +--- + +# CI/CD Playbook Skill + +Produce a complete, actionable CI/CD playbook for a service or team — covering everything a new engineer needs to understand, contribute to, and operate the pipeline safely. + +A good playbook is not a diagram. It is a document that answers: what runs, when, why, who owns it, and what to do when it breaks. + +## Required Inputs + +Ask for these if not already provided: +- **Service name** and brief description +- **Tech stack** — language, framework, containerisation (Docker, etc.) +- **Source control** — GitHub / GitLab / Bitbucket, branching strategy +- **CI platform** — GitHub Actions / CircleCI / Jenkins / BuildKite / other +- **CD platform / deployment target** — Kubernetes, ECS, Lambda, Heroku, VMs, etc. +- **Environments** — e.g. dev, staging, production (and any canary / feature environments) +- **Deployment frequency** — how often does the team ship? +- **Any existing gates** — manual approvals, smoke tests, feature flags +- **On-call setup** — who's responsible during deploys? + +## Output Format + +--- + +# CI/CD Playbook: [Service Name] + +**Service:** [Name] | **Team:** [Team name] +**Last updated:** [Date] | **Owner:** [Name / role] +**Pipeline platform:** [CI tool] → [CD tool / platform] + +--- + +## Overview + +[2–3 sentences describing what this service does and why the CI/CD pipeline is structured the way it is. Include the deployment target and how frequently the team ships.] + +**Deployment frequency:** [Multiple times per day / Daily / Weekly / On-demand] +**Average pipeline duration:** [X minutes] +**Rollback time (p95):** [X minutes] + +--- + +## Pipeline Stages + +``` +[Branch push] + │ + ▼ +[1. Build & Lint] ──fail──▶ ❌ Block PR + │ + ▼ +[2. Unit Tests] ──fail──▶ ❌ Block PR + │ + ▼ +[3. Integration Tests] ──fail──▶ ❌ Block PR + │ + ▼ +[4. Security Scan] ──fail──▶ ⚠️ [Block / Warn — specify] + │ + ▼ +[5. Build Artefact / Container Image] + │ + ▼ +[6. Deploy to Staging] ──fail──▶ ❌ Block promotion + │ + ▼ +[7. Smoke Tests (Staging)] + │ + ▼ +[8. Manual Approval Gate] ──(if required) + │ + ▼ +[9. Deploy to Production] ──fail──▶ 🔁 Auto-rollback (if configured) + │ + ▼ +[10. Post-deploy checks] +``` + +--- + +## Stage Definitions + +### Stage 1 — Build & Lint + +**What runs:** [Build command] + [Linter — e.g. ESLint, golangci-lint, flake8] +**Trigger:** Every commit to any branch +**Blocking:** Yes — PR cannot be merged if this fails +**Typical duration:** [X minutes] +**Owner if it fails:** PR author + +**Common failure causes:** +- [e.g. Missing dependency — run `npm install` locally before pushing] +- [e.g. Lint rule violation — run `npm run lint --fix` to auto-fix most issues] + +--- + +### Stage 2 — Unit Tests + +**What runs:** [Test command — e.g. `npm test`, `go test ./...`, `pytest`] +**Coverage gate:** [X]% minimum — pipeline fails below this threshold +**Trigger:** Every commit +**Blocking:** Yes +**Typical duration:** [X minutes] + +**Coverage report:** [Where to find it — e.g. uploaded to Codecov, available in CI artifacts] + +--- + +### Stage 3 — Integration Tests + +**What runs:** [Test suite description — e.g. "API integration tests against a test database using Docker Compose"] +**Environment:** [Ephemeral test environment / shared test DB / etc.] +**Trigger:** Every commit to `main` and feature branches targeting `main` +**Blocking:** Yes +**Typical duration:** [X minutes] + +**If slow:** [e.g. "Integration tests can be skipped locally with `SKIP_INTEGRATION=true` — never skip in CI"] + +--- + +### Stage 4 — Security Scan + +**Tools:** [e.g. Snyk, Trivy, OWASP Dependency Check, Semgrep] +**What it checks:** [Dependency vulnerabilities / SAST / secrets detection — list what applies] +**Blocking on:** Critical and High severity findings +**Non-blocking on:** Medium and Low (flagged, not blocking) +**Trigger:** Every commit to `main` + +**How to handle a flagged vulnerability:** +1. Check if a fix is available — upgrade the dependency +2. If no fix available, open a security ticket and add a suppression with justification +3. Never suppress without a ticket and owner + +--- + +### Stage 5 — Build Artefact + +**What is produced:** [Docker image / binary / zip — be specific] +**Registry:** [ECR / GCR / Docker Hub / Artifactory — URL] +**Tagging convention:** `[service-name]:[git-sha]` (also tagged `:latest` on `main`) +**Trigger:** Commits to `main` only (not feature branches) + +--- + +### Stage 6 — Deploy to Staging + +**Deployment method:** [e.g. Helm upgrade / kubectl apply / ecs deploy / Terraform apply] +**Staging URL:** [URL] +**Trigger:** Automatic on successful artefact build from `main` +**Who can deploy to staging:** Any engineer (automatic) + +**Environment variables:** Managed in [Vault / AWS SSM / GitHub Secrets / etc.] +**Staging is not production:** [Any differences in config, scale, or data — state them here] + +--- + +### Stage 7 — Smoke Tests (Staging) + +**What runs:** [Description — e.g. "10 critical path tests covering login, core API endpoints, and payment flow"] +**Tool:** [e.g. Playwright / Postman / custom script] +**Pass criteria:** All smoke tests pass within [X seconds] timeout +**Blocking:** Yes — production deploy will not proceed if smoke tests fail + +**Smoke test suite location:** [Link to test files or folder] + +--- + +### Stage 8 — Manual Approval Gate + +**Required for:** [Production deploys / deploys affecting >X% of traffic / deploys to specific regions] +**Who can approve:** [e.g. Any engineer on the team / Lead engineer / On-call engineer] +**Approval timeout:** [e.g. 24 hours — auto-cancelled if no approval] +**How to approve:** [GitHub Actions approve step / Slack command / other — with link] + +**When to withhold approval:** +- Active incident in production +- Deploy is outside the deployment window (see below) +- On-call engineer has not been notified + +--- + +### Stage 9 — Deploy to Production + +**Deployment method:** [Same as staging or different — specify] +**Deployment window:** [e.g. Monday–Thursday 09:00–16:00 UTC — no deploys on Fridays or before bank holidays] +**Canary / progressive rollout:** [Yes — X% initial traffic, full rollout after Y minutes / No — full deploy] +**Deployment notifications:** [Slack channel — #deployments] + +**Who is on-call during deploy:** Deploying engineer is responsible until post-deploy checks pass. + +--- + +### Stage 10 — Post-Deploy Checks + +**Automated checks (run for [X minutes] after deploy):** +- [ ] Error rate: <[X]% (baseline: [Y]%) +- [ ] P99 latency: <[X]ms (baseline: [Y]ms) +- [ ] [Key business metric]: within [X]% of baseline + +**Where to watch:** [Datadog / Grafana / CloudWatch dashboard — link] + +**If a check fails:** See Rollback Procedure below. + +--- + +## Environments + +| Environment | Purpose | Deploy trigger | URL | Data | +|---|---|---|---|---| +| **Dev** | Local development | Manual | localhost | Seeded test data | +| **Staging** | Pre-production validation | Automatic (main) | [URL] | Anonymised prod copy | +| **Production** | Live traffic | Manual approval | [URL] | Live data | + +--- + +## Branching Strategy + +**Model:** [Trunk-based / GitFlow / GitHub Flow — describe briefly] + +| Branch | Purpose | Who merges | Deploy target | +|---|---|---|---| +| `main` | Production-ready code | PR + review | Staging → Production | +| `feature/*` | Feature development | Author | None (CI only) | +| `hotfix/*` | Critical production fixes | Lead engineer | Can bypass staging gate with approval | + +**Hotfix process:** [Describe when and how to use a hotfix branch — what level of incident justifies bypassing the standard process] + +--- + +## Rollback Procedure + +**Automated rollback:** [Yes — triggered if post-deploy error rate exceeds [X]% / No — manual only] + +**Manual rollback steps:** +```bash +# 1. Identify the last known good image tag +[command to list recent deployments] + +# 2. Deploy the previous version +[deployment command with previous tag] + +# 3. Confirm rollback is live +[smoke test command or health check URL] + +# 4. Notify the team +[Slack command or template] +``` + +**Rollback decision authority:** Any engineer on-call can initiate a rollback without waiting for approval. + +**After a rollback:** +1. Create a post-deploy incident report (see [incident-postmortem skill]) +2. Do not re-deploy the same commit without fixing the root cause +3. Notify [stakeholder / support team] of the rollback and expected fix timeline + +--- + +## Secrets and Configuration Management + +**Secret store:** [Vault / AWS SSM / GitHub Secrets / Doppler — specify] +**How to add a new secret:** +1. [Step 1] +2. [Step 2] +**Who has access:** [Role or team] +**Rotation policy:** [How often secrets are rotated and who owns it] + +**Never do:** Commit secrets to source control, even in `.env` files. The pipeline includes secret scanning (Stage 4) which will flag this. + +--- + +## Common Failures and Fixes + +| Failure | Likely cause | Fix | +|---|---|---| +| Build fails with "module not found" | Dependency not installed | Run `[install command]` and commit `lock file` | +| Integration tests timeout | Test DB not seeded / external service down | Check [service] status; re-run pipeline | +| Smoke tests fail after staging deploy | Environment variable missing | Check [config location]; compare staging and prod env vars | +| Production deploy stuck at approval | Approver not notified | Tag `@[on-call handle]` in `#deployments` | +| Post-deploy error rate spike | Bad deploy / upstream dependency | Check [dashboard]; initiate rollback if >5 min | + +--- + +## On-Call Responsibilities During Deploy + +- The deploying engineer is responsible for monitoring post-deploy checks for [X minutes] after a production deploy +- If you cannot monitor after deploying, hand off explicitly to another engineer in `#deployments` +- For deploys outside business hours: only hotfixes — always page the on-call engineer before deploying + +--- + +## Quality Checks + +- [ ] Every stage has a clear owner when it fails +- [ ] Rollback procedure is tested — not theoretical +- [ ] Secrets management section names the actual tool used (not "use secrets management") +- [ ] Deployment window is specific — not "during business hours" +- [ ] Post-deploy check thresholds are calibrated to actual baseline metrics diff --git a/plugins/pm-engineering/skills/developer-onboarding-doc/SKILL.md b/plugins/pm-engineering/skills/developer-onboarding-doc/SKILL.md new file mode 100644 index 0000000..91e3b93 --- /dev/null +++ b/plugins/pm-engineering/skills/developer-onboarding-doc/SKILL.md @@ -0,0 +1,332 @@ +--- +name: developer-onboarding-doc +description: "Write a developer onboarding document for a service, codebase, or team. Use when asked to write a developer guide, service README, onboarding doc for a new engineer, codebase orientation, or getting-started guide for a technical team. Produces a structured doc covering service overview, architecture, local setup, key patterns, testing, deployment, and who to ask for what." +--- + +# Developer Onboarding Document Skill + +Produce a complete developer onboarding document for a service or team — covering everything a new engineer needs to be productive within their first week. + +A good onboarding doc is not a wiki dump. It answers the questions a new engineer actually has on day one, in the order they'll have them. + +## Required Inputs + +Ask for these if not already provided: +- **Service name** and what it does +- **Team** responsible for it +- **Tech stack** — language(s), framework(s), database(s), message queues, etc. +- **Key external dependencies** — upstream services, third-party APIs +- **Deployment target** — Kubernetes, ECS, Lambda, bare metal, etc. +- **Local dev setup** — how to run locally (Docker Compose, local DB, etc.) +- **Testing approach** — unit, integration, E2E; test commands +- **Deployment process** — summary of how code gets to production +- **On-call setup** — who's on-call, how alerts work +- **Contacts** — tech lead, platform team, related service owners + +## Output Format + +--- + +# Developer Onboarding: [Service Name] + +**Team:** [Team name] | **Tech lead:** [Name] +**Last updated:** [Date] | **Updated by:** [Name] + +> If something in this doc is wrong or out of date, fix it now — it will affect every engineer who onboards after you. + +--- + +## What This Service Does + +[3–5 sentences. What problem does this service solve? Who calls it, and who does it call? What would break if this service went down?] + +**Service type:** [API / Background worker / Event consumer / Data pipeline / etc.] +**Consumers:** [List internal services or external clients that depend on this service] +**Dependencies:** [List upstream services, databases, and third-party APIs this service calls] + +**Architecture diagram:** [Link or embed — even a rough ASCII diagram helps] + +``` +[Caller A] ──→ [This Service] ──→ [Database] + │ + └──→ [Downstream Service] +``` + +--- + +## Codebase Orientation + +**Repository:** [Link] +**Main branch:** `[main / master]` +**Language:** [e.g. Go 1.22 / Node.js 20 / Python 3.12] +**Framework:** [e.g. Express / FastAPI / Gin / Rails] + +### Key directories + +``` +[repo-root]/ +├── [src/ or cmd/] # Application code +│ ├── [handlers/] # HTTP handlers / controllers +│ ├── [services/] # Business logic +│ ├── [repository/] # Database access layer +│ └── [models/] # Data models / types +├── [tests/] # Test files +├── [migrations/] # Database migrations +├── [scripts/] # Utility scripts +├── [.github/workflows/] # CI/CD pipeline definitions +└── [docs/] # Additional documentation +``` + +**Where to start reading:** [Point to 2–3 key files that give the best orientation — e.g. `main.go`, `routes.js`, `app.py`] + +### Things that might surprise you + +- [Unusual pattern 1 — e.g. "We use event sourcing — state is derived from an event log, not stored directly"] +- [Unusual pattern 2 — e.g. "Auth is handled by the gateway — this service trusts the `X-User-Id` header"] +- [Unusual pattern 3 — any non-obvious decisions or legacy choices] + +--- + +## Local Development Setup + +**Estimated setup time:** [X minutes for a fresh machine] + +### Prerequisites + +- [ ] [Tool 1] — version [X] — [install link] +- [ ] [Tool 2] — version [X] — [install link] +- [ ] Access to [repo / internal package registry] — request from [who] +- [ ] [Any secrets or credentials needed] — request from [who] + +### Step-by-step setup + +```bash +# 1. Clone the repo +git clone [repo URL] +cd [repo-name] + +# 2. Copy and configure environment variables +cp .env.example .env +# Edit .env — see "Environment Variables" section below + +# 3. Start dependencies (database, cache, etc.) +[docker compose up -d / make deps / etc.] + +# 4. Install dependencies +[npm install / go mod download / pip install -r requirements.txt] + +# 5. Run database migrations +[migration command] + +# 6. Start the service +[start command] + +# 7. Verify it's working +curl http://localhost:[PORT]/health +# Expected: {"status":"ok"} +``` + +**If this doesn't work:** Check [Troubleshooting section below] or ask in `#[channel]`. + +### Environment Variables + +| Variable | Required | Description | Example | +|---|---|---|---| +| `DATABASE_URL` | Yes | Connection string for the primary DB | `postgres://localhost:5432/[db]` | +| `[VAR_2]` | Yes | [Description] | [Example] | +| `[VAR_3]` | No | [Description — default value] | [Example] | + +**Secrets for local dev:** [Where to get them — e.g. "Run `[command]` to pull from Vault" or "Ask [person] in #[channel]"] + +### Useful local commands + +```bash +[start command] # Start the service +[test command] # Run all tests +[lint command] # Run linter +[format command] # Format code +[migration command] # Run pending migrations +[seed command] # Seed local database +``` + +--- + +## Testing + +**Testing philosophy:** [e.g. "We test at the integration layer — unit tests for pure functions, integration tests for anything touching the DB or external services"] + +### Running tests + +```bash +# All tests +[test command] + +# Unit tests only +[unit test command] + +# Integration tests (requires local deps running) +[integration test command] + +# A specific test file or test case +[test command with filter] +``` + +**Test coverage:** [X]% (minimum required to pass CI: [Y]%) +**Coverage report:** [Where to find it] + +### Writing tests + +- **Unit tests:** [Where to put them — e.g. alongside source files as `*_test.go`] +- **Integration tests:** [Where to put them — e.g. `tests/integration/`] +- **Test database:** [How it works — e.g. "Each test gets a clean transaction that rolls back on teardown — see `tests/helpers/db.go`"] +- **Mocking:** [Policy — e.g. "We mock at the repository layer — don't mock the DB directly"] + +--- + +## Making Changes + +### Branching + +[Branch naming convention — e.g. `feature/[ticket-id]-short-description`, `fix/[ticket-id]-short-description`] + +### Before opening a PR + +- [ ] Tests pass locally +- [ ] Linter passes (`[lint command]`) +- [ ] New behaviour has test coverage +- [ ] Any new environment variables are added to `.env.example` and documented +- [ ] Database migrations are backward-compatible (old code can run against new schema) + +### Code review + +- **Reviewers:** [Who to request review from — e.g. "Any engineer on [team]; lead review required for auth changes"] +- **Expected review time:** [X hours / 1 business day] +- **PR template:** [Link or auto-generated by GitHub] + +### Database migrations + +```bash +# Create a new migration +[migration create command] + +# Apply pending migrations +[migration up command] + +# Roll back last migration +[migration down command] +``` + +**Migration rules:** +- All migrations must be backward-compatible — old code must run against the new schema +- Never rename or drop a column in a single migration — do it in two steps (add new, migrate data, drop old) +- Test your rollback before merging + +--- + +## Deployment + +**How code gets to production:** [1–2 sentence summary — link to full CI/CD playbook if it exists] + +1. Merge to `main` → automatic deploy to staging +2. Smoke tests run on staging +3. Manual approval → deploy to production +4. Post-deploy monitoring for [X minutes] + +**Deployment docs:** [Link to CI/CD playbook or pipeline docs] + +**Who can deploy:** [Any engineer / Lead engineer / On-call engineer — specify] + +**Deployment channel:** `#[deployments channel]` + +--- + +## Monitoring and Observability + +**Dashboard:** [Datadog / Grafana / CloudWatch — link] +**Logs:** [Log aggregation tool and link — e.g. "Logs are in Datadog under service:[name]"] +**Traces:** [Tracing tool and link if applicable] +**Alerts:** [Where alerts fire — e.g. PagerDuty / Slack #alerts-[service]] + +**Key metrics to know:** +- **Error rate:** Should be <[X]% (alert at [Y]%) +- **P99 latency:** Should be <[X]ms +- **[Business metric]:** [e.g. "Queue depth should be <100 items"] + +--- + +## On-Call + +**On-call schedule:** [PagerDuty / Opsgenie link] +**Who's on-call now:** [Link to current schedule or `#oncall` channel] +**Escalation:** [On-call → [team lead] → [EM] — after [X] minutes unacknowledged] + +**If you get paged:** +1. Acknowledge the alert +2. Check [dashboard link] for the first clue +3. Common alert runbooks: [link to oncall-runbook or runbook-writer output] +4. If you can't resolve in [X minutes], escalate to [person/channel] + +--- + +## Key Contacts + +| Role | Name | Best way to reach | +|---|---|---| +| Tech lead | [Name] | Slack: @[handle] | +| On-call rotation | [Team] | PagerDuty / `#on-call` | +| Platform / infra | [Team] | `#platform` Slack channel | +| Database / DBA | [Name or team] | `#database` Slack channel | +| [Upstream service] owner | [Name] | Slack: @[handle] | + +**Where to ask questions:** +- General engineering: `#engineering` +- This service specifically: `#[service-name]` +- Urgent / production issues: `#incidents` + +--- + +## Troubleshooting + +### "The service won't start locally" + +1. Check that Docker / dependencies are running: `[command]` +2. Check `.env` is populated — missing values cause silent failures +3. Check logs: `[log command]` +4. Ask in `#[channel]` + +### "Tests are failing locally but passing in CI" + +- Check your local dependency versions match CI: `[version check command]` +- Try a clean install: `[clean install command]` +- Integration tests need local deps running — `[start deps command]` + +### "I can't access [internal tool / system]" + +- Request access through [process — e.g. Okta self-serve / ask your manager] + +### "Something looks wrong in production" + +1. Check [dashboard] for the error spike +2. Check recent deploys in `#deployments` +3. If it's an active incident, page on-call via [PagerDuty / Slack command] + +--- + +## Further Reading + +- [Architecture Decision Records (ADRs)](./docs/decisions/) — why the codebase is the way it is +- [API documentation](./docs/api/) or [link to external docs] +- [Incident runbooks](./docs/runbooks/) +- [CI/CD pipeline documentation](./docs/cicd/) +- [Team working agreements](./docs/team/) + +--- + +## Quality Checks + +- [ ] Local setup instructions work on a fresh machine — tested recently +- [ ] Environment variables table is complete and accurate +- [ ] "Things that might surprise you" captures the actual surprises (ask a recent joiner) +- [ ] On-call section has real links, not placeholders +- [ ] Contacts are current — team members with real Slack handles +- [ ] Troubleshooting covers the top 3 actual questions new joiners ask diff --git a/plugins/pm-engineering/skills/oncall-runbook/SKILL.md b/plugins/pm-engineering/skills/oncall-runbook/SKILL.md new file mode 100644 index 0000000..957996e --- /dev/null +++ b/plugins/pm-engineering/skills/oncall-runbook/SKILL.md @@ -0,0 +1,364 @@ +--- +name: oncall-runbook +description: "Write an on-call runbook for a service — covering alert definitions, escalation paths, common incident responses, and on-call handoff procedures. Use when asked to write an on-call guide, create alert runbooks, document escalation procedures, or prepare an on-call handoff document. Produces a structured on-call runbook with per-alert response procedures, escalation matrix, diagnostic commands, and handoff template." +--- + +# On-Call Runbook Skill + +Produce a complete on-call runbook for a service — giving the on-call engineer everything they need to respond confidently to alerts at 3am, without having to ask anyone for help. + +A good on-call runbook reduces mean time to resolution (MTTR) by eliminating the "what do I do first?" problem. It is written for the on-call engineer who has just been paged and needs to act, not for someone calmly reading documentation. + +## Required Inputs + +Ask for these if not already provided: +- **Service name** and what it does +- **Team** and tech lead name +- **Alert list** — names of alerts that currently page on-call +- **Monitoring setup** — Datadog / Grafana / CloudWatch / PagerDuty / etc. +- **Common failure modes** — what breaks most often, and what fixes it +- **Escalation contacts** — who to call when on-call can't resolve it +- **Deployment setup** — can on-call roll back? How? +- **Service dependencies** — what does this service depend on, and what depends on it? + +## Output Format + +--- + +# On-Call Runbook: [Service Name] + +**Team:** [Team name] | **Tech lead:** [Name] +**PagerDuty service:** [Link] | **Escalation policy:** [Policy name] +**Last updated:** [Date] | **Next review:** [Date + 90 days] + +> **First time on-call for this service?** Read the [developer onboarding doc] first — it covers the architecture and how things work. This runbook assumes you understand the service. + +--- + +## Quick Reference + +**Dashboard:** [Link — the first thing to open when paged] +**Logs:** [Link — where to find logs] +**Runbook index:** Jump to the alert that paged you → [Alert list below] +**Can't resolve in 30 min?** Escalate to: [Name] via [Slack / PagerDuty] + +**Rollback command (memorise this):** +```bash +[rollback command — e.g. kubectl rollout undo deployment/[service-name]] +``` + +--- + +## Escalation Matrix + +| Situation | Escalate to | How | After how long | +|---|---|---|---| +| Can't diagnose the alert | [Tech lead name] | Slack DM / Phone | 30 minutes | +| Alert requires infra change | [Platform team] | `#platform` Slack | Immediately | +| Customer-facing impact | [CSM / Support lead] | `#incidents` Slack | Immediately (P1) | +| Database issue | [DBA or data team] | Slack / PagerDuty | Immediately | +| [Specific dependency] down | [[Dependency] on-call] | PagerDuty / Slack | Immediately | +| Extended outage (>1 hour) | [Engineering manager] | Phone | 1 hour | + +**Contacts:** + +| Name | Role | Slack | Phone | +|---|---|---|---| +| [Name] | Tech lead | @[handle] | [Number] | +| [Name] | Engineering manager | @[handle] | [Number] | +| [Name] | Platform / infra | @[handle] | [Number] | +| [Platform team] | Infra on-call | `#platform` | PagerDuty | + +--- + +## Service Architecture (Quick View) + +``` +[Upstream callers] + │ + ▼ +[This Service] + │ + ├──→ [Primary Database] + ├──→ [Cache — e.g. Redis] + └──→ [Downstream Service / Queue] +``` + +**If this service is down, these are affected:** [List downstream consumers] +**If these are down, this service is affected:** [List upstream dependencies] + +--- + +## Alert Runbooks + +### ALERT: [Alert Name 1 — e.g. HighErrorRate] + +**What it means:** [Plain English — e.g. "More than 5% of API requests are returning 5xx errors in the last 5 minutes"] +**Severity:** P1 / P2 / P3 +**SLO impact:** Yes / No — [If yes: this alert means the error budget is burning at [X]× rate] + +**Step 1 — Acknowledge and assess** +```bash +# Check current error rate +[query or dashboard link] + +# Check which endpoints are erroring +[query or command] +``` + +**Step 2 — Check recent changes** +```bash +# Any deploys in the last hour? +[command or link to deployment log] + +# Recent config changes? +[where to check] +``` + +**Step 3 — Check dependencies** +```bash +# Is the database healthy? +[health check command or link] + +# Is [downstream service] healthy? +[health check command or link] +``` + +**Step 4 — Diagnose** + +| If you see | It means | Do this | +|---|---|---| +| [Error pattern 1] | [Cause] | [Action] | +| [Error pattern 2] | [Cause] | [Action] | +| [Error pattern 3] | [Cause] | [Action] | +| No clear pattern | Unknown cause | Escalate to [name] | + +**Step 5 — Fix or mitigate** +```bash +# If caused by bad deploy — roll back: +[rollback command] + +# If caused by [specific issue]: +[fix command] + +# If caused by upstream dependency: +[mitigation — e.g. enable circuit breaker, reduce traffic, etc.] +``` + +**After resolving:** +- [ ] Confirm error rate has returned to baseline +- [ ] Check no downstream services were affected +- [ ] If P1: open a post-incident review — see [incident-postmortem skill] +- [ ] Update `#incidents` with resolution summary + +--- + +### ALERT: [Alert Name 2 — e.g. HighLatency] + +**What it means:** [e.g. "P99 response time has exceeded 1s for more than 3 consecutive minutes"] +**Severity:** P1 / P2 / P3 +**SLO impact:** Yes — latency SLO breach + +**Step 1 — Assess scope** +```bash +# Check which endpoints are slow +[query or dashboard — broken down by endpoint] + +# Check if latency is across all regions or localised +[query or command] +``` + +**Step 2 — Common causes and fixes** + +| Cause | Signal | Fix | +|---|---|---| +| Database slow queries | DB latency spike on dashboard | [Check slow query log: `command`] | +| Cache miss storm | Cache hit rate drops on dashboard | [command or action] | +| Memory pressure / GC | High memory on service dashboard | [command or action — e.g. restart, scale up] | +| Upstream service slow | Trace shows time in external call | Escalate to [service] on-call | +| Traffic spike | Request rate spike on dashboard | [Scale up: `command`] | + +**Step 3 — Escalate if unresolved in 20 minutes** +Page [Tech lead] via PagerDuty / Slack. + +--- + +### ALERT: [Alert Name 3 — e.g. DatabaseConnectionPoolExhausted] + +**What it means:** [e.g. "The service has used all available database connections — new requests will fail"] +**Severity:** P1 +**SLO impact:** Yes — will cause errors immediately + +**Immediate mitigation:** +```bash +# Restart the service to flush stale connections +[restart command] + +# Check current connection count +[DB connection query] +``` + +**Diagnose root cause after stabilising:** +```bash +# Check for long-running queries holding connections +[query] + +# Check if a recent deploy changed connection pool config +[where to check] +``` + +**Resolution:** [e.g. "Increase pool size in config / kill long-running queries / scale the service"] + +--- + +### ALERT: [Alert Name 4 — e.g. QueueBacklogHigh / ConsumerLag] + +**What it means:** [e.g. "The message queue backlog exceeds 10,000 messages — consumers are not keeping up"] +**Severity:** P2 +**SLO impact:** Depends — if queue backs up, downstream systems will receive delayed data + +**Step 1 — Check consumer health** +```bash +# Are consumers running? +[command] + +# Consumer error rate? +[dashboard or query] +``` + +**Step 2 — Check message contents** +```bash +# Are there poison messages causing retries? +[command to inspect dead-letter queue or failed messages] +``` + +**Step 3 — Options** + +| If | Then | +|---|---| +| Consumers are down | Restart consumers: `[command]` | +| Poison message in queue | Move to DLQ: `[command]` | +| Consumers healthy but slow | Scale consumers: `[command]` | +| Upstream producing too fast | Escalate to [upstream service] owner | + +--- + +### ALERT: [Add additional alerts following the same pattern] + +--- + +## Diagnostic Cheat Sheet + +Common commands for quick diagnosis. Paste and run without modification. + +```bash +# Service health +[health check command] + +# Recent logs (last 100 lines) +[log command] + +# Error logs only +[error log filter command] + +# Current pod / instance status +[kubectl get pods / aws ecs describe-tasks / etc.] + +# Restart the service +[restart command] + +# Roll back to previous version +[rollback command] + +# Database connection count +[DB query] + +# Cache hit rate +[cache stats command] + +# Current request rate +[metrics query] +``` + +--- + +## Useful Dashboard Links + +| Dashboard | URL | Use it to | +|---|---|---| +| Service overview | [Link] | First stop — error rate, latency, request rate | +| Database | [Link] | Connection count, slow queries, replication lag | +| Infrastructure | [Link] | CPU, memory, disk | +| Queue / consumers | [Link] | Backlog depth, consumer throughput | +| Upstream dependencies | [Link] | Dependency health at a glance | + +--- + +## Incident Communication + +When you declare an incident: + +**Post to `#incidents` immediately:** +``` +🔴 INCIDENT — [Service Name] +Status: Investigating +Impact: [Who is affected and how] +Paged: [Your name] +Next update: [Time — max 30 min from now] +``` + +**Update every 30 minutes while active:** +``` +🔴 UPDATE — [Service Name] — [Time] +Status: [Investigating / Identified / Mitigating / Resolved] +Latest: [One sentence on what you found or did] +Next update: [Time] +``` + +**On resolution:** +``` +✅ RESOLVED — [Service Name] — [Time] +Duration: [X minutes] +Impact: [Summary of who was affected] +Cause: [One sentence] +Follow-up: [PIR required? Yes/No — link when created] +``` + +--- + +## On-Call Handoff + +Use this template at the end of every on-call shift: + +``` +--- ON-CALL HANDOFF: [Service Name] --- +Date: [Date] +Outgoing: [Your name] +Incoming: [Next on-call name] + +INCIDENTS THIS SHIFT: +- [Incident summary — date, duration, cause, resolution, follow-up required] + +OPEN ISSUES TO WATCH: +- [Anything not fully resolved / trending in the wrong direction] + +CHANGES SINCE LAST HANDOFF: +- [Deploys, config changes, infra changes that affect on-call awareness] + +RUNBOOK GAPS FOUND: +- [Anything you had to figure out that isn't documented — please add it] + +ANYTHING ELSE: +- [Notes for incoming on-call] +``` + +--- + +## Quality Checks + +- [ ] Every alert that pages on-call has a runbook entry — no alert is missing +- [ ] Rollback command is accurate and tested recently +- [ ] Escalation contacts have current phone numbers and Slack handles +- [ ] Diagnostic commands work — they have been run by at least one person recently +- [ ] Handoff template is used at every shift change — not just during incidents +- [ ] "Things I had to figure out that weren't documented" are added to this runbook after every incident diff --git a/plugins/pm-engineering/skills/slo-error-budget/SKILL.md b/plugins/pm-engineering/skills/slo-error-budget/SKILL.md new file mode 100644 index 0000000..05375e9 --- /dev/null +++ b/plugins/pm-engineering/skills/slo-error-budget/SKILL.md @@ -0,0 +1,231 @@ +--- +name: slo-error-budget +description: "Define Service Level Objectives (SLOs) and an error budget policy for a service. Use when asked to write SLOs, define SLIs, calculate an error budget, set reliability targets, or create an error budget policy. Produces a complete SLO document with SLI definitions, target calculation, error budget policy, burn rate alerts, and review cadence." +--- + +# SLO and Error Budget Skill + +Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns. + +A good SLO is not a target to hit. It is an agreement about what reliability means for your users — and a framework for making principled trade-offs between reliability and velocity. + +## Required Inputs + +Ask for these if not already provided: +- **Service name** and brief description of what it does +- **Primary users** — who depends on this service and how +- **User-facing interactions** to protect — e.g. API calls, page loads, transactions +- **Current reliability data** — error rate, latency, uptime (last 30–90 days if available) +- **Existing on-call setup** — who responds to alerts? +- **Deployment frequency** — how often does the team ship? +- **Any existing SLAs** with customers — these constrain SLO targets + +## Key Definitions + +Always establish these before writing the SLO: + +| Term | Definition | +|---|---| +| **SLI** (Service Level Indicator) | The metric being measured — e.g. "% of requests completing successfully in <500ms" | +| **SLO** (Service Level Objective) | The target for that metric — e.g. "99.5% of requests" | +| **SLA** (Service Level Agreement) | The contractual commitment to customers — must be looser than the SLO | +| **Error budget** | The allowed headroom below 100% — the budget for planned and unplanned downtime | +| **Burn rate** | How fast the error budget is being consumed | + +--- + +## Output Format + +--- + +# SLO Document: [Service Name] + +**Service:** [Name] | **Team:** [Team name] +**Owner:** [Name / role] | **Approved by:** [Name] +**Effective date:** [Date] | **Review date:** [Date + 3 months] +**Version:** [1.0] + +--- + +## Why This SLO Exists + +[2–3 sentences. What reliability problem are we solving? What was happening before this SLO that made us need it? What decision-making does this SLO enable?] + +--- + +## Service Overview + +**What this service does:** [One sentence] +**Who depends on it:** [Internal teams / external customers / both — describe] +**Critical user journeys protected by this SLO:** +1. [Journey 1 — e.g. "User completes a payment"] +2. [Journey 2] +3. [Journey 3] + +--- + +## SLIs — What We Measure + +Define one SLI per user journey or reliability dimension. Keep it to 3–5 SLIs maximum. + +### SLI 1: [Name — e.g. Request Success Rate] + +| Field | Detail | +|---|---| +| **What it measures** | [e.g. "% of API requests that return a non-5xx response"] | +| **Good event definition** | [e.g. "HTTP response with status 2xx or 4xx, completed within 500ms"] | +| **Bad event definition** | [e.g. "HTTP response with status 5xx, or any response taking >500ms"] | +| **Measurement source** | [e.g. "Application load balancer access logs / Datadog APM / Prometheus"] | +| **Measured over** | Rolling 28-day window | +| **Exclusions** | [e.g. "Health check endpoints excluded / Requests during planned maintenance excluded"] | + +### SLI 2: [Name — e.g. Latency] + +| Field | Detail | +|---|---| +| **What it measures** | [e.g. "P99 response time for the /checkout endpoint"] | +| **Good event definition** | [e.g. "Request completes in ≤500ms at P99"] | +| **Bad event definition** | [e.g. "Request takes >500ms at P99"] | +| **Measurement source** | [Source] | +| **Measured over** | Rolling 28-day window | +| **Exclusions** | [Any exclusions] | + +### SLI 3: [Name — e.g. Data Freshness / Queue Depth / etc.] + +[Same structure] + +--- + +## SLO Targets + +| SLI | Target | Window | Error Budget | +|---|---|---|---| +| [SLI 1 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] | +| [SLI 2 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] | +| [SLI 3 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] | + +**How targets were set:** +- Historical baseline (last 90 days): [X]% +- Target is set [above / at] historical baseline to [improve reliability / reflect current reality while formalising the commitment] +- Rationale: [1–2 sentences] + +**What 100% is NOT the target:** [Brief explanation of why targeting 100% is counterproductive — it discourages feature development and doesn't reflect user reality] + +--- + +## Error Budget Calculation + +**For SLI 1 ([Name]), at [X]% target:** + +``` +Error budget = (100% - SLO target) × measurement window + = (100% - [X]%) × 28 days × 24 hours × 60 minutes + = [Y]% × [Z total minutes] + = [N] minutes of allowed failure per 28-day window +``` + +**In plain terms:** We can afford [N] minutes of [bad events] in any rolling 28-day window before we breach the SLO. + +--- + +## Burn Rate Alerts + +Burn rate = how fast the error budget is being consumed relative to the budget window. +A burn rate of 1 = consuming the budget at exactly the rate that would exhaust it over 28 days. + +| Alert | Burn rate | Window | Severity | Response | +|---|---|---|---|---| +| Page (critical) | >14× | 1 hour | P1 | Page on-call immediately — budget exhausted in <2 hours | +| Page (high) | >6× | 6 hours | P2 | Page on-call — budget exhausted in <5 days | +| Ticket (warning) | >3× | 3 days | P3 | Create ticket — review at next team meeting | +| Info | >1× | 28 days | Info | Log only — budget on track to exhaust by end of window | + +**Alert implementation:** [Link to alert config in monitoring tool — e.g. Datadog, Prometheus/Alertmanager, Grafana] + +--- + +## Error Budget Policy + +This policy defines what to do with the error budget — both when it's healthy and when it's burning. + +### When budget is healthy (>50% remaining) + +- Feature development and deployments proceed at normal pace +- The team may take on riskier experiments +- Reliability improvements are scheduled but not urgent + +### When budget is at risk (25–50% remaining) + +- Deployment frequency reduced — team ships only well-tested changes +- One reliability improvement added to current sprint +- Weekly error budget review added to team standup + +### When budget is nearly exhausted (<25% remaining) + +- Feature work paused in favour of reliability improvements +- No new deployments without explicit on-call approval +- Daily review of error budget burn rate +- CSM / support notified to manage customer expectations + +### When budget is exhausted (0% remaining — SLO breached) + +- All feature work stops +- On-call engineer and engineering manager notified immediately +- Post-incident review (PIR) required within 5 business days +- SLO target may be temporarily relaxed (with stakeholder approval) while root cause is addressed + +--- + +## Dashboard and Reporting + +**SLO dashboard:** [Link to Datadog / Grafana / etc. dashboard] + +**Metrics exposed:** +- Current SLO compliance (rolling 28-day) +- Error budget remaining (% and minutes) +- Burn rate (current and trend) +- Incident count and MTTR this window + +**Reporting cadence:** + +| Audience | Frequency | Format | +|---|---|---| +| Engineering team | Weekly | Slack summary — #[service]-slo | +| Engineering manager | Monthly | SLO review meeting | +| Stakeholders / customers | Quarterly | SLO compliance summary | + +--- + +## Exclusions and Edge Cases + +**Planned maintenance:** Error budget is not consumed during pre-announced maintenance windows. Maintenance must be communicated [X hours] in advance via [channel]. + +**Dependency failures:** If SLO breach is caused by an upstream dependency outside our control, document it — but it still counts against our error budget (our users don't distinguish between our failures and our dependencies' failures). + +**Force majeure:** [Policy for cloud provider outages, major infrastructure events] + +--- + +## SLO Review Cadence + +| Review | When | Who | Output | +|---|---|---|---| +| Error budget review | Weekly | Team | Budget health check — adjust if burning fast | +| SLO target review | Quarterly | Team + EM | Adjust targets if baseline has shifted significantly | +| Annual SLO audit | Annually | Team + Stakeholders | Review SLIs — are we measuring the right things? | + +**When to change the SLO target:** +- Historical baseline has improved significantly and target no longer reflects real reliability +- User feedback indicates the target is misaligned with what users actually experience +- The SLO is being gamed (metric is healthy but users are unhappy) + +--- + +## Quality Checks + +- [ ] SLIs are user-facing — they measure what users experience, not internal system metrics +- [ ] Good and bad events are precisely defined — no ambiguity about what counts +- [ ] Targets are based on historical data, not aspirational round numbers +- [ ] Error budget policy has clear triggers and clear actions — not "discuss as a team" +- [ ] Burn rate alerts have different windows to catch both fast burns and slow burns +- [ ] Exclusions are documented so they don't silently inflate the SLO number diff --git a/skills/churn-analysis/SKILL.md b/skills/churn-analysis/SKILL.md new file mode 100644 index 0000000..f7bcf17 --- /dev/null +++ b/skills/churn-analysis/SKILL.md @@ -0,0 +1,179 @@ +--- +name: churn-analysis +description: "Analyse customer churn for a product or cohort and produce a structured churn report. Use when asked to analyse churn, understand why customers are leaving, identify churn patterns, calculate churn rate, or build a churn reduction plan. Produces a churn analysis with rate calculations, categorised reasons, early warning signals, and prioritised interventions." +--- + +# Churn Analysis Skill + +Produce a structured churn analysis that goes beyond the headline rate — identifying why customers leave, which segments are most at risk, and what interventions will have the highest impact on retention. + +## Required Inputs + +Ask for these if not already provided: +- **Time period** being analysed (e.g. Q1, last 12 months) +- **Total customers at start of period** and **customers churned** +- **ARR or revenue lost** to churn +- **Churn reasons data** — exit survey results, CSM notes, support data, or sales loss reasons +- **Customer segments** — by tier, industry, cohort, or product line +- **Current retention rate** if known +- **Any recent changes** — pricing, product, support model — that may have affected churn + +## Churn Categories + +Always classify churn before analysing it: + +| Category | Definition | +|---|---| +| **Voluntary — avoidable** | Customer left due to a problem we could have addressed (product gaps, poor onboarding, relationship failures) | +| **Voluntary — unavoidable** | Customer left for reasons outside our control (budget cuts, acquisition, company shutdown) | +| **Involuntary** | Payment failure, contract non-renewal by mistake, admin error | + +The interventions for each category are different. Conflating them leads to wrong conclusions. + +## Output Format + +--- + +# Churn Analysis: [Product / Segment / Company] +**Period:** [Start date] — [End date] +**Prepared by:** [Name] | **Date:** [Date] + +--- + +## Headline Numbers + +| Metric | Value | +|---|---| +| Customers at start of period | [N] | +| Customers churned | [N] | +| **Customer churn rate** | **[X]%** | +| ARR at start of period | £/$/€[X] | +| ARR lost to churn | £/$/€[X] | +| **Revenue churn rate (gross)** | **[X]%** | +| ARR from expansions (same period) | £/$/€[X] | +| **Net revenue retention (NRR)** | **[X]%** | + +**Benchmark context:** +- Customer churn rate: [X]% vs. industry benchmark [Y]% — [above / below / in line] +- NRR: [X]% — [What this means: above 100% = expansion offsets churn; below 100% = shrinking base] + +--- + +## Churn Breakdown by Category + +| Category | Customers | % of churn | ARR lost | +|---|---|---|---| +| Voluntary — avoidable | [N] | [X]% | £/$/€[X] | +| Voluntary — unavoidable | [N] | [X]% | £/$/€[X] | +| Involuntary | [N] | [X]% | £/$/€[X] | +| **Total** | **[N]** | **100%** | **£/$/€[X]** | + +**Avoidable churn as % of total churn:** [X]% — this is the number we can actually influence. + +--- + +## Churn Reasons — Avoidable Churn Only + +Rank by frequency. Include ARR weight where data allows. + +| Reason | Count | % of avoidable churn | ARR lost | Representative quote | +|---|---|---|---|---| +| [Reason 1 — e.g. "Product missing key feature"] | [N] | [X]% | £/$/€[X] | "[Quote]" | +| [Reason 2] | [N] | [X]% | £/$/€[X] | "[Quote]" | +| [Reason 3] | [N] | [X]% | £/$/€[X] | "[Quote]" | +| [Reason 4] | [N] | [X]% | £/$/€[X] | "[Quote]" | +| Other | [N] | [X]% | £/$/€[X] | — | + +**Theme synthesis:** [2–3 sentences grouping the top reasons into 2–3 themes. E.g. "The top three reasons cluster around two themes: product gaps in [area] (affecting X% of avoidable churn) and onboarding failures where customers never achieved value (Y%)."] + +--- + +## Churn by Segment + +Identify which segments over- or under-index for churn. + +### By Tier + +| Tier | Churn rate | vs. Overall | Notes | +|---|---|---|---| +| Enterprise | [X]% | +/-[X]pp | | +| Mid-Market | [X]% | +/-[X]pp | | +| SMB | [X]% | +/-[X]pp | | + +### By Cohort (Acquisition Year) + +| Cohort | Churn rate | Notes | +|---|---|---| +| [Year 1] | [X]% | | +| [Year 2] | [X]% | | +| [Year 3] | [X]% | | + +### By Industry / Use Case (if data available) + +| Segment | Churn rate | Notes | +|---|---|---| +| [Segment 1] | [X]% | | +| [Segment 2] | [X]% | | + +**Key pattern:** [Which segment has the highest churn rate and what likely explains it] + +--- + +## Timing Analysis + +- **Average contract length before churn:** [X months] +- **Highest-risk moment:** [e.g. "Month 3 — when trial value has worn off but full adoption hasn't happened"] +- **Churn timing distribution:** + +| When churn occurred | % of churned accounts | +|---|---| +| 0–3 months | [X]% | +| 3–6 months | [X]% | +| 6–12 months | [X]% | +| 12+ months | [X]% | + +--- + +## Early Warning Signals + +Based on the churned accounts, identify the signals that preceded churn (and could have triggered earlier intervention): + +| Signal | Lead time before churn | How to detect | +|---|---|---| +| [Signal 1 — e.g. "DAU/MAU dropped below 15%"] | [~X weeks] | [Usage dashboard / alert] | +| [Signal 2 — e.g. "No QBR in 90+ days"] | [~X weeks] | [CRM flag] | +| [Signal 3 — e.g. "Champion left the account"] | [~X weeks] | [LinkedIn alert / CSM tracking] | +| [Signal 4] | [~X weeks] | [Detection method] | + +--- + +## Intervention Recommendations + +Ranked by estimated impact × feasibility. + +| Intervention | Addresses | Est. churn reduction | Effort | Owner | +|---|---|---|---|---| +| [Intervention 1 — e.g. "Improve onboarding for [segment] with dedicated 30-day check-in"] | [Reason 1] | [X accounts / £X ARR] | Low / Med / High | [Team] | +| [Intervention 2] | [Reason 2] | [X accounts / £X ARR] | Low / Med / High | [Team] | +| [Intervention 3] | [Reason 3] | [X accounts / £X ARR] | Low / Med / High | [Team] | + +**Priority call:** [Which one intervention, if implemented this quarter, would have the biggest impact and why] + +--- + +## What We Don't Know (Data Gaps) + +- [Data gap 1 — e.g. "Exit survey response rate is only 30% — the reasons data may not be representative"] +- [Data gap 2 — e.g. "No product usage data for SMB tier — can't confirm usage signal correlation"] +- [Data gap 3] + +--- + +## Quality Checks + +- [ ] Churn rate is correctly calculated (churned ÷ starting cohort, not end-of-period total) +- [ ] Avoidable and unavoidable churn are separated — interventions target avoidable churn only +- [ ] Churn reasons are customer-reported, not internally assumed +- [ ] Segment analysis identifies which segments over-index — not just averages +- [ ] Early warning signals are specific and detectable, not generic ("low engagement") +- [ ] Interventions link directly to the top churn reasons — no recommendations without a root cause match diff --git a/skills/cicd-playbook/SKILL.md b/skills/cicd-playbook/SKILL.md new file mode 100644 index 0000000..26f9cdc --- /dev/null +++ b/skills/cicd-playbook/SKILL.md @@ -0,0 +1,301 @@ +--- +name: cicd-playbook +description: "Write a CI/CD pipeline playbook for a service or team. Use when asked to document a CI/CD pipeline, write a deployment process, define release gates, document build and test stages, or create a deployment guide. Produces a structured playbook covering pipeline stages, environment definitions, deployment gates, rollback procedures, and on-call responsibilities." +--- + +# CI/CD Playbook Skill + +Produce a complete, actionable CI/CD playbook for a service or team — covering everything a new engineer needs to understand, contribute to, and operate the pipeline safely. + +A good playbook is not a diagram. It is a document that answers: what runs, when, why, who owns it, and what to do when it breaks. + +## Required Inputs + +Ask for these if not already provided: +- **Service name** and brief description +- **Tech stack** — language, framework, containerisation (Docker, etc.) +- **Source control** — GitHub / GitLab / Bitbucket, branching strategy +- **CI platform** — GitHub Actions / CircleCI / Jenkins / BuildKite / other +- **CD platform / deployment target** — Kubernetes, ECS, Lambda, Heroku, VMs, etc. +- **Environments** — e.g. dev, staging, production (and any canary / feature environments) +- **Deployment frequency** — how often does the team ship? +- **Any existing gates** — manual approvals, smoke tests, feature flags +- **On-call setup** — who's responsible during deploys? + +## Output Format + +--- + +# CI/CD Playbook: [Service Name] + +**Service:** [Name] | **Team:** [Team name] +**Last updated:** [Date] | **Owner:** [Name / role] +**Pipeline platform:** [CI tool] → [CD tool / platform] + +--- + +## Overview + +[2–3 sentences describing what this service does and why the CI/CD pipeline is structured the way it is. Include the deployment target and how frequently the team ships.] + +**Deployment frequency:** [Multiple times per day / Daily / Weekly / On-demand] +**Average pipeline duration:** [X minutes] +**Rollback time (p95):** [X minutes] + +--- + +## Pipeline Stages + +``` +[Branch push] + │ + ▼ +[1. Build & Lint] ──fail──▶ ❌ Block PR + │ + ▼ +[2. Unit Tests] ──fail──▶ ❌ Block PR + │ + ▼ +[3. Integration Tests] ──fail──▶ ❌ Block PR + │ + ▼ +[4. Security Scan] ──fail──▶ ⚠️ [Block / Warn — specify] + │ + ▼ +[5. Build Artefact / Container Image] + │ + ▼ +[6. Deploy to Staging] ──fail──▶ ❌ Block promotion + │ + ▼ +[7. Smoke Tests (Staging)] + │ + ▼ +[8. Manual Approval Gate] ──(if required) + │ + ▼ +[9. Deploy to Production] ──fail──▶ 🔁 Auto-rollback (if configured) + │ + ▼ +[10. Post-deploy checks] +``` + +--- + +## Stage Definitions + +### Stage 1 — Build & Lint + +**What runs:** [Build command] + [Linter — e.g. ESLint, golangci-lint, flake8] +**Trigger:** Every commit to any branch +**Blocking:** Yes — PR cannot be merged if this fails +**Typical duration:** [X minutes] +**Owner if it fails:** PR author + +**Common failure causes:** +- [e.g. Missing dependency — run `npm install` locally before pushing] +- [e.g. Lint rule violation — run `npm run lint --fix` to auto-fix most issues] + +--- + +### Stage 2 — Unit Tests + +**What runs:** [Test command — e.g. `npm test`, `go test ./...`, `pytest`] +**Coverage gate:** [X]% minimum — pipeline fails below this threshold +**Trigger:** Every commit +**Blocking:** Yes +**Typical duration:** [X minutes] + +**Coverage report:** [Where to find it — e.g. uploaded to Codecov, available in CI artifacts] + +--- + +### Stage 3 — Integration Tests + +**What runs:** [Test suite description — e.g. "API integration tests against a test database using Docker Compose"] +**Environment:** [Ephemeral test environment / shared test DB / etc.] +**Trigger:** Every commit to `main` and feature branches targeting `main` +**Blocking:** Yes +**Typical duration:** [X minutes] + +**If slow:** [e.g. "Integration tests can be skipped locally with `SKIP_INTEGRATION=true` — never skip in CI"] + +--- + +### Stage 4 — Security Scan + +**Tools:** [e.g. Snyk, Trivy, OWASP Dependency Check, Semgrep] +**What it checks:** [Dependency vulnerabilities / SAST / secrets detection — list what applies] +**Blocking on:** Critical and High severity findings +**Non-blocking on:** Medium and Low (flagged, not blocking) +**Trigger:** Every commit to `main` + +**How to handle a flagged vulnerability:** +1. Check if a fix is available — upgrade the dependency +2. If no fix available, open a security ticket and add a suppression with justification +3. Never suppress without a ticket and owner + +--- + +### Stage 5 — Build Artefact + +**What is produced:** [Docker image / binary / zip — be specific] +**Registry:** [ECR / GCR / Docker Hub / Artifactory — URL] +**Tagging convention:** `[service-name]:[git-sha]` (also tagged `:latest` on `main`) +**Trigger:** Commits to `main` only (not feature branches) + +--- + +### Stage 6 — Deploy to Staging + +**Deployment method:** [e.g. Helm upgrade / kubectl apply / ecs deploy / Terraform apply] +**Staging URL:** [URL] +**Trigger:** Automatic on successful artefact build from `main` +**Who can deploy to staging:** Any engineer (automatic) + +**Environment variables:** Managed in [Vault / AWS SSM / GitHub Secrets / etc.] +**Staging is not production:** [Any differences in config, scale, or data — state them here] + +--- + +### Stage 7 — Smoke Tests (Staging) + +**What runs:** [Description — e.g. "10 critical path tests covering login, core API endpoints, and payment flow"] +**Tool:** [e.g. Playwright / Postman / custom script] +**Pass criteria:** All smoke tests pass within [X seconds] timeout +**Blocking:** Yes — production deploy will not proceed if smoke tests fail + +**Smoke test suite location:** [Link to test files or folder] + +--- + +### Stage 8 — Manual Approval Gate + +**Required for:** [Production deploys / deploys affecting >X% of traffic / deploys to specific regions] +**Who can approve:** [e.g. Any engineer on the team / Lead engineer / On-call engineer] +**Approval timeout:** [e.g. 24 hours — auto-cancelled if no approval] +**How to approve:** [GitHub Actions approve step / Slack command / other — with link] + +**When to withhold approval:** +- Active incident in production +- Deploy is outside the deployment window (see below) +- On-call engineer has not been notified + +--- + +### Stage 9 — Deploy to Production + +**Deployment method:** [Same as staging or different — specify] +**Deployment window:** [e.g. Monday–Thursday 09:00–16:00 UTC — no deploys on Fridays or before bank holidays] +**Canary / progressive rollout:** [Yes — X% initial traffic, full rollout after Y minutes / No — full deploy] +**Deployment notifications:** [Slack channel — #deployments] + +**Who is on-call during deploy:** Deploying engineer is responsible until post-deploy checks pass. + +--- + +### Stage 10 — Post-Deploy Checks + +**Automated checks (run for [X minutes] after deploy):** +- [ ] Error rate: <[X]% (baseline: [Y]%) +- [ ] P99 latency: <[X]ms (baseline: [Y]ms) +- [ ] [Key business metric]: within [X]% of baseline + +**Where to watch:** [Datadog / Grafana / CloudWatch dashboard — link] + +**If a check fails:** See Rollback Procedure below. + +--- + +## Environments + +| Environment | Purpose | Deploy trigger | URL | Data | +|---|---|---|---|---| +| **Dev** | Local development | Manual | localhost | Seeded test data | +| **Staging** | Pre-production validation | Automatic (main) | [URL] | Anonymised prod copy | +| **Production** | Live traffic | Manual approval | [URL] | Live data | + +--- + +## Branching Strategy + +**Model:** [Trunk-based / GitFlow / GitHub Flow — describe briefly] + +| Branch | Purpose | Who merges | Deploy target | +|---|---|---|---| +| `main` | Production-ready code | PR + review | Staging → Production | +| `feature/*` | Feature development | Author | None (CI only) | +| `hotfix/*` | Critical production fixes | Lead engineer | Can bypass staging gate with approval | + +**Hotfix process:** [Describe when and how to use a hotfix branch — what level of incident justifies bypassing the standard process] + +--- + +## Rollback Procedure + +**Automated rollback:** [Yes — triggered if post-deploy error rate exceeds [X]% / No — manual only] + +**Manual rollback steps:** +```bash +# 1. Identify the last known good image tag +[command to list recent deployments] + +# 2. Deploy the previous version +[deployment command with previous tag] + +# 3. Confirm rollback is live +[smoke test command or health check URL] + +# 4. Notify the team +[Slack command or template] +``` + +**Rollback decision authority:** Any engineer on-call can initiate a rollback without waiting for approval. + +**After a rollback:** +1. Create a post-deploy incident report (see [incident-postmortem skill]) +2. Do not re-deploy the same commit without fixing the root cause +3. Notify [stakeholder / support team] of the rollback and expected fix timeline + +--- + +## Secrets and Configuration Management + +**Secret store:** [Vault / AWS SSM / GitHub Secrets / Doppler — specify] +**How to add a new secret:** +1. [Step 1] +2. [Step 2] +**Who has access:** [Role or team] +**Rotation policy:** [How often secrets are rotated and who owns it] + +**Never do:** Commit secrets to source control, even in `.env` files. The pipeline includes secret scanning (Stage 4) which will flag this. + +--- + +## Common Failures and Fixes + +| Failure | Likely cause | Fix | +|---|---|---| +| Build fails with "module not found" | Dependency not installed | Run `[install command]` and commit `lock file` | +| Integration tests timeout | Test DB not seeded / external service down | Check [service] status; re-run pipeline | +| Smoke tests fail after staging deploy | Environment variable missing | Check [config location]; compare staging and prod env vars | +| Production deploy stuck at approval | Approver not notified | Tag `@[on-call handle]` in `#deployments` | +| Post-deploy error rate spike | Bad deploy / upstream dependency | Check [dashboard]; initiate rollback if >5 min | + +--- + +## On-Call Responsibilities During Deploy + +- The deploying engineer is responsible for monitoring post-deploy checks for [X minutes] after a production deploy +- If you cannot monitor after deploying, hand off explicitly to another engineer in `#deployments` +- For deploys outside business hours: only hotfixes — always page the on-call engineer before deploying + +--- + +## Quality Checks + +- [ ] Every stage has a clear owner when it fails +- [ ] Rollback procedure is tested — not theoretical +- [ ] Secrets management section names the actual tool used (not "use secrets management") +- [ ] Deployment window is specific — not "during business hours" +- [ ] Post-deploy check thresholds are calibrated to actual baseline metrics diff --git a/skills/cs-escalation-brief/SKILL.md b/skills/cs-escalation-brief/SKILL.md new file mode 100644 index 0000000..0cca77d --- /dev/null +++ b/skills/cs-escalation-brief/SKILL.md @@ -0,0 +1,176 @@ +--- +name: cs-escalation-brief +description: "Write a structured escalation brief for an at-risk customer account. Use when an account has escalated, when a customer is threatening churn, when a P1 customer issue needs executive attention, or when preparing an internal save play. Produces a crisp escalation brief with account context, timeline, root cause, business impact, and a clear resolution plan." +--- + +# Customer Escalation Brief Skill + +Produce a clear, concise escalation brief that gives internal stakeholders — VP CS, CCO, product leadership, or the CEO — everything they need to understand the situation, make decisions, and act fast. + +A good escalation brief is not a complaint. It is a professional document that states the facts, assigns accountability honestly, and proposes a specific resolution plan. + +## Required Inputs + +Ask for these if not already provided: +- **Account name**, tier, and ARR +- **CSM name** and account owner +- **Nature of the escalation** — what happened, what the customer is saying +- **Timeline** of events leading to escalation +- **Customer contact** who escalated (name, role, influence level) +- **What the customer wants** — their stated ask +- **What we believe the root cause is** +- **What has already been done** to address the situation +- **Renewal date** and current renewal risk assessment + +## Escalation Levels + +Calibrate urgency and audience based on escalation level: + +| Level | Trigger | Audience | Response time | +|---|---|---|---| +| L1 — Account Risk | Customer expressing dissatisfaction; renewal at risk | CSM + CS Manager | 24 hours | +| L2 — Executive Escalation | Customer escalated to their exec; requesting vendor exec involvement | VP CS + Account Exec | 4 hours | +| L3 — Churn Risk | Customer has issued notice or is in active churn conversation | CCO / CEO + Revenue leadership | 1 hour | +| L4 — Public Risk | Customer threatening public escalation, legal, or press | CCO / Legal / Comms | Immediate | + +## Output Format + +--- + +# Escalation Brief: [Account Name] + +**Escalation level:** L[1/2/3/4] — [Label] +**Date raised:** [Date] +**Raised by:** [CSM name] +**Escalation owner:** [Name of exec or senior stakeholder now leading response] + +--- + +## Account at a Glance + +| Field | Detail | +|---|---| +| ARR | £/$/€[X] | +| Tier | Enterprise / Mid-Market / SMB | +| Customer since | [Date] | +| Renewal date | [Date] — [N] days away | +| Renewal risk (pre-escalation) | Green / Amber / Red | +| Renewal risk (current) | Green / Amber / Red | +| Customer contact who escalated | [Name, role, seniority] | +| Executive sponsor (customer) | [Name, role — active / passive / vacant] | +| Executive sponsor (vendor) | [Name, role] | + +--- + +## What Happened — Summary + +[3–5 sentences. State the facts plainly. What the customer experienced, how they reacted, and how we learned about the escalation. No editorialising. No blame.] + +--- + +## Timeline + +List in chronological order. Each entry: `[Date / time] — [What happened. Who did what.]` + +Include: +- When the original issue or trigger event occurred +- When the customer first raised concerns (informally) +- When it escalated (formal escalation or exec involvement) +- Actions taken since escalation + +--- + +## Root Cause + +**Primary cause:** [One clear sentence. What specifically went wrong.] + +**Contributing factors:** +- [Factor 1 — be honest about internal failures as well as external ones] +- [Factor 2] + +**Is this a systemic issue or isolated?** +[ ] Isolated to this account +[ ] Pattern seen in other accounts — details: [_______] +[ ] Product or process gap that needs fixing + +--- + +## Customer's Stated Position + +**What the customer says happened:** [Their version of events — fair and unfiltered] + +**What they are asking for:** [Their explicit ask — compensation, fix by date, exec call, SLA credit, exit clause] + +**Sentiment of escalating contact:** [Frustrated but constructive / Angry / Seeking exit / Unknown] + +**Risk of public escalation:** Low / Medium / High — [evidence if Medium or High] + +--- + +## Business Impact + +| Impact type | Detail | +|---|---| +| ARR at risk | £/$/€[X] | +| Potential churn probability | [X]% | +| Reputational risk | Low / Medium / High | +| Reference / case study status | [Was a reference — now at risk / Not a reference] | +| Expansion pipeline at risk | £/$/€[X] | + +--- + +## What Has Been Done So Far + +1. [Action taken — by whom — date — outcome] +2. [Action taken — by whom — date — outcome] +3. [Action taken — by whom — date — outcome] + +**Has a formal apology or acknowledgement been issued?** Yes / No + +--- + +## Proposed Resolution Plan + +**Immediate actions (next 24–48 hours):** + +| Action | Owner | By when | +|---|---|---| +| [Action] | [Name] | [Date] | +| [Action] | [Name] | [Date] | + +**Medium-term actions (next 2–4 weeks):** + +| Action | Owner | By when | +|---|---|---| +| [Action] | [Name] | [Date] | + +**What we are NOT offering:** [Be explicit about what is not on the table — avoids misaligned expectations] + +**Success criteria:** [How will we know the escalation is resolved? What does the customer need to confirm they are satisfied?] + +--- + +## Decision Required from Escalation Owner + +[State clearly what decision or resource the escalation owner needs to provide. Be specific — do not make them ask. E.g.: "We need approval to offer a 20% service credit for Q2" or "We need an exec call with [name] within 48 hours."] + +--- + +## Communication Plan + +| Audience | Message | Channel | Owner | By when | +|---|---|---|---|---| +| Escalating customer contact | [Summary of message] | Email / Call | [Name] | [Date] | +| Customer exec sponsor | [Summary] | Call | [Name] | [Date] | +| Internal CS team | [Summary] | Slack / Meeting | CS Manager | [Date] | + +--- + +## Quality Checks + +- [ ] Root cause is specific — not "communication breakdown" or "product gap" without detail +- [ ] Customer's position is stated fairly — not minimised or dismissed +- [ ] A clear decision is requested from the escalation owner — brief does not end with "what do you think?" +- [ ] ARR at risk is quantified +- [ ] Communication plan has owners and dates — not "TBD" +- [ ] Language is professional and blameless toward individuals diff --git a/skills/cs-health-scorecard/SKILL.md b/skills/cs-health-scorecard/SKILL.md new file mode 100644 index 0000000..17237b7 --- /dev/null +++ b/skills/cs-health-scorecard/SKILL.md @@ -0,0 +1,141 @@ +--- +name: cs-health-scorecard +description: "Build a customer health scorecard for a specific account. Use when asked to score account health, assess renewal risk, build a health dashboard, or evaluate an account's likelihood to renew or expand. Produces a structured health scorecard with a RAG status, dimension scores, key risks, and recommended actions." +--- + +# Customer Health Scorecard Skill + +Produce a structured, data-driven health scorecard for a customer account — giving the CSM and leadership a clear view of renewal risk, expansion potential, and the actions needed to move the account in the right direction. + +## Required Inputs + +Ask for these if not already provided: +- **Account name** and tier (enterprise / mid-market / SMB) +- **Contract value** (ARR) and **renewal date** +- **Product usage data** — logins, DAU/MAU ratio, key feature adoption +- **Support data** — open tickets, CSAT or NPS score, recent escalations +- **Engagement data** — last QBR date, executive sponsor status, champion name +- **Commercial data** — payment history, expansion conversations, seats used vs. licensed +- **Any known risks or recent changes** at the account + +## Scoring Framework + +Score each dimension 1–5. Weight as shown. Calculate weighted total out of 100. + +| Dimension | Weight | What to Score | +|---|---|---| +| **Product Adoption** | 30% | DAU/MAU ratio, breadth of features used, power users identified | +| **Engagement** | 20% | QBR cadence, executive sponsor active, champion strength | +| **Outcomes** | 20% | Customer hitting their stated goals / success metrics | +| **Support Health** | 15% | Ticket volume trend, unresolved escalations, CSAT | +| **Commercial** | 15% | On-time payments, seats utilised, expansion signals | + +**Score → RAG conversion:** +- 80–100: Green (healthy, renew likely) +- 60–79: Amber (at risk, needs attention) +- 0–59: Red (high churn risk, escalate) + +## Output Format + +--- + +# Customer Health Scorecard: [Account Name] + +**CSM:** [Name] | **Tier:** [Enterprise / Mid-Market / SMB] +**ARR:** £/$/€[X] | **Renewal date:** [Date] | **Days to renewal:** [N] +**Overall health:** [Green / Amber / Red] — [Score]/100 +**Last updated:** [Date] + +--- + +## Health Score Summary + +| Dimension | Score (1–5) | Weight | Weighted Score | Trend | +|---|---|---|---|---| +| Product Adoption | [1–5] | 30% | [X] | ↑ / → / ↓ | +| Engagement | [1–5] | 20% | [X] | ↑ / → / ↓ | +| Outcomes | [1–5] | 20% | [X] | ↑ / → / ↓ | +| Support Health | [1–5] | 15% | [X] | ↑ / → / ↓ | +| Commercial | [1–5] | 15% | [X] | ↑ / → / ↓ | +| **Total** | — | 100% | **[X]/100** | | + +--- + +## Dimension Detail + +### Product Adoption — [Score]/5 +- **DAU/MAU ratio:** [X]% (benchmark: >25% = healthy) +- **Key features adopted:** [List features in use] +- **Features not adopted:** [List unused high-value features] +- **Power users identified:** [Yes / No — how many] +- **Assessment:** [1–2 sentences on adoption health] + +### Engagement — [Score]/5 +- **Last QBR:** [Date] — [Outcome summary] +- **Next QBR:** [Scheduled / Overdue] +- **Executive sponsor:** [Active / Passive / Vacant] +- **Champion:** [Name, role, strength: strong / moderate / weak] +- **Assessment:** [1–2 sentences] + +### Outcomes — [Score]/5 +- **Customer's stated goals:** [List 2–3 goals from onboarding or last QBR] +- **Progress against goals:** [On track / Partial / Off track] +- **Evidence of value:** [Metric or quote that demonstrates ROI] +- **Assessment:** [1–2 sentences] + +### Support Health — [Score]/5 +- **Open tickets:** [N] (priority breakdown: P1: X, P2: X, P3: X) +- **CSAT / NPS:** [Score] (benchmark: >8 CSAT / >30 NPS = healthy) +- **Unresolved escalations:** [Yes / No — details if yes] +- **Ticket trend (last 90 days):** Increasing / Stable / Decreasing +- **Assessment:** [1–2 sentences] + +### Commercial — [Score]/5 +- **Seats licensed:** [N] | **Seats active:** [N] ([X]% utilisation) +- **Payment history:** [On time / Late — details] +- **Expansion signals:** [Yes — describe / No] +- **Downgrade or cancellation signals:** [Yes — describe / No] +- **Assessment:** [1–2 sentences] + +--- + +## Top Risks + +| Risk | Severity | Mitigation | +|---|---|---| +| [Risk description] | High / Medium / Low | [Specific action to mitigate] | + +--- + +## Recommended Actions + +**Immediate (this week):** +1. [Action — owner — deadline] + +**This month:** +1. [Action — owner — deadline] + +**Before renewal:** +1. [Action — owner — deadline] + +--- + +## Renewal Forecast + +| Scenario | Probability | ARR at risk | +|---|---|---| +| Full renewal at current ARR | [X]% | £/$/€0 | +| Renewal with contraction | [X]% | £/$/€[X] | +| Churn | [X]% | £/$/€[full ARR] | + +**Recommended renewal play:** [Expand / Hold / Save / Manage out] + +--- + +## Quality Checks + +- [ ] Score is based on data, not gut feel — each dimension has evidence +- [ ] Risks are specific (not "low engagement" — something like "executive sponsor left in March, no replacement identified") +- [ ] Actions have owners and deadlines +- [ ] Renewal probability is calibrated against pipeline reality +- [ ] Trend arrows reflect direction of change vs. last scorecard, not just current state diff --git a/skills/developer-onboarding-doc/SKILL.md b/skills/developer-onboarding-doc/SKILL.md new file mode 100644 index 0000000..91e3b93 --- /dev/null +++ b/skills/developer-onboarding-doc/SKILL.md @@ -0,0 +1,332 @@ +--- +name: developer-onboarding-doc +description: "Write a developer onboarding document for a service, codebase, or team. Use when asked to write a developer guide, service README, onboarding doc for a new engineer, codebase orientation, or getting-started guide for a technical team. Produces a structured doc covering service overview, architecture, local setup, key patterns, testing, deployment, and who to ask for what." +--- + +# Developer Onboarding Document Skill + +Produce a complete developer onboarding document for a service or team — covering everything a new engineer needs to be productive within their first week. + +A good onboarding doc is not a wiki dump. It answers the questions a new engineer actually has on day one, in the order they'll have them. + +## Required Inputs + +Ask for these if not already provided: +- **Service name** and what it does +- **Team** responsible for it +- **Tech stack** — language(s), framework(s), database(s), message queues, etc. +- **Key external dependencies** — upstream services, third-party APIs +- **Deployment target** — Kubernetes, ECS, Lambda, bare metal, etc. +- **Local dev setup** — how to run locally (Docker Compose, local DB, etc.) +- **Testing approach** — unit, integration, E2E; test commands +- **Deployment process** — summary of how code gets to production +- **On-call setup** — who's on-call, how alerts work +- **Contacts** — tech lead, platform team, related service owners + +## Output Format + +--- + +# Developer Onboarding: [Service Name] + +**Team:** [Team name] | **Tech lead:** [Name] +**Last updated:** [Date] | **Updated by:** [Name] + +> If something in this doc is wrong or out of date, fix it now — it will affect every engineer who onboards after you. + +--- + +## What This Service Does + +[3–5 sentences. What problem does this service solve? Who calls it, and who does it call? What would break if this service went down?] + +**Service type:** [API / Background worker / Event consumer / Data pipeline / etc.] +**Consumers:** [List internal services or external clients that depend on this service] +**Dependencies:** [List upstream services, databases, and third-party APIs this service calls] + +**Architecture diagram:** [Link or embed — even a rough ASCII diagram helps] + +``` +[Caller A] ──→ [This Service] ──→ [Database] + │ + └──→ [Downstream Service] +``` + +--- + +## Codebase Orientation + +**Repository:** [Link] +**Main branch:** `[main / master]` +**Language:** [e.g. Go 1.22 / Node.js 20 / Python 3.12] +**Framework:** [e.g. Express / FastAPI / Gin / Rails] + +### Key directories + +``` +[repo-root]/ +├── [src/ or cmd/] # Application code +│ ├── [handlers/] # HTTP handlers / controllers +│ ├── [services/] # Business logic +│ ├── [repository/] # Database access layer +│ └── [models/] # Data models / types +├── [tests/] # Test files +├── [migrations/] # Database migrations +├── [scripts/] # Utility scripts +├── [.github/workflows/] # CI/CD pipeline definitions +└── [docs/] # Additional documentation +``` + +**Where to start reading:** [Point to 2–3 key files that give the best orientation — e.g. `main.go`, `routes.js`, `app.py`] + +### Things that might surprise you + +- [Unusual pattern 1 — e.g. "We use event sourcing — state is derived from an event log, not stored directly"] +- [Unusual pattern 2 — e.g. "Auth is handled by the gateway — this service trusts the `X-User-Id` header"] +- [Unusual pattern 3 — any non-obvious decisions or legacy choices] + +--- + +## Local Development Setup + +**Estimated setup time:** [X minutes for a fresh machine] + +### Prerequisites + +- [ ] [Tool 1] — version [X] — [install link] +- [ ] [Tool 2] — version [X] — [install link] +- [ ] Access to [repo / internal package registry] — request from [who] +- [ ] [Any secrets or credentials needed] — request from [who] + +### Step-by-step setup + +```bash +# 1. Clone the repo +git clone [repo URL] +cd [repo-name] + +# 2. Copy and configure environment variables +cp .env.example .env +# Edit .env — see "Environment Variables" section below + +# 3. Start dependencies (database, cache, etc.) +[docker compose up -d / make deps / etc.] + +# 4. Install dependencies +[npm install / go mod download / pip install -r requirements.txt] + +# 5. Run database migrations +[migration command] + +# 6. Start the service +[start command] + +# 7. Verify it's working +curl http://localhost:[PORT]/health +# Expected: {"status":"ok"} +``` + +**If this doesn't work:** Check [Troubleshooting section below] or ask in `#[channel]`. + +### Environment Variables + +| Variable | Required | Description | Example | +|---|---|---|---| +| `DATABASE_URL` | Yes | Connection string for the primary DB | `postgres://localhost:5432/[db]` | +| `[VAR_2]` | Yes | [Description] | [Example] | +| `[VAR_3]` | No | [Description — default value] | [Example] | + +**Secrets for local dev:** [Where to get them — e.g. "Run `[command]` to pull from Vault" or "Ask [person] in #[channel]"] + +### Useful local commands + +```bash +[start command] # Start the service +[test command] # Run all tests +[lint command] # Run linter +[format command] # Format code +[migration command] # Run pending migrations +[seed command] # Seed local database +``` + +--- + +## Testing + +**Testing philosophy:** [e.g. "We test at the integration layer — unit tests for pure functions, integration tests for anything touching the DB or external services"] + +### Running tests + +```bash +# All tests +[test command] + +# Unit tests only +[unit test command] + +# Integration tests (requires local deps running) +[integration test command] + +# A specific test file or test case +[test command with filter] +``` + +**Test coverage:** [X]% (minimum required to pass CI: [Y]%) +**Coverage report:** [Where to find it] + +### Writing tests + +- **Unit tests:** [Where to put them — e.g. alongside source files as `*_test.go`] +- **Integration tests:** [Where to put them — e.g. `tests/integration/`] +- **Test database:** [How it works — e.g. "Each test gets a clean transaction that rolls back on teardown — see `tests/helpers/db.go`"] +- **Mocking:** [Policy — e.g. "We mock at the repository layer — don't mock the DB directly"] + +--- + +## Making Changes + +### Branching + +[Branch naming convention — e.g. `feature/[ticket-id]-short-description`, `fix/[ticket-id]-short-description`] + +### Before opening a PR + +- [ ] Tests pass locally +- [ ] Linter passes (`[lint command]`) +- [ ] New behaviour has test coverage +- [ ] Any new environment variables are added to `.env.example` and documented +- [ ] Database migrations are backward-compatible (old code can run against new schema) + +### Code review + +- **Reviewers:** [Who to request review from — e.g. "Any engineer on [team]; lead review required for auth changes"] +- **Expected review time:** [X hours / 1 business day] +- **PR template:** [Link or auto-generated by GitHub] + +### Database migrations + +```bash +# Create a new migration +[migration create command] + +# Apply pending migrations +[migration up command] + +# Roll back last migration +[migration down command] +``` + +**Migration rules:** +- All migrations must be backward-compatible — old code must run against the new schema +- Never rename or drop a column in a single migration — do it in two steps (add new, migrate data, drop old) +- Test your rollback before merging + +--- + +## Deployment + +**How code gets to production:** [1–2 sentence summary — link to full CI/CD playbook if it exists] + +1. Merge to `main` → automatic deploy to staging +2. Smoke tests run on staging +3. Manual approval → deploy to production +4. Post-deploy monitoring for [X minutes] + +**Deployment docs:** [Link to CI/CD playbook or pipeline docs] + +**Who can deploy:** [Any engineer / Lead engineer / On-call engineer — specify] + +**Deployment channel:** `#[deployments channel]` + +--- + +## Monitoring and Observability + +**Dashboard:** [Datadog / Grafana / CloudWatch — link] +**Logs:** [Log aggregation tool and link — e.g. "Logs are in Datadog under service:[name]"] +**Traces:** [Tracing tool and link if applicable] +**Alerts:** [Where alerts fire — e.g. PagerDuty / Slack #alerts-[service]] + +**Key metrics to know:** +- **Error rate:** Should be <[X]% (alert at [Y]%) +- **P99 latency:** Should be <[X]ms +- **[Business metric]:** [e.g. "Queue depth should be <100 items"] + +--- + +## On-Call + +**On-call schedule:** [PagerDuty / Opsgenie link] +**Who's on-call now:** [Link to current schedule or `#oncall` channel] +**Escalation:** [On-call → [team lead] → [EM] — after [X] minutes unacknowledged] + +**If you get paged:** +1. Acknowledge the alert +2. Check [dashboard link] for the first clue +3. Common alert runbooks: [link to oncall-runbook or runbook-writer output] +4. If you can't resolve in [X minutes], escalate to [person/channel] + +--- + +## Key Contacts + +| Role | Name | Best way to reach | +|---|---|---| +| Tech lead | [Name] | Slack: @[handle] | +| On-call rotation | [Team] | PagerDuty / `#on-call` | +| Platform / infra | [Team] | `#platform` Slack channel | +| Database / DBA | [Name or team] | `#database` Slack channel | +| [Upstream service] owner | [Name] | Slack: @[handle] | + +**Where to ask questions:** +- General engineering: `#engineering` +- This service specifically: `#[service-name]` +- Urgent / production issues: `#incidents` + +--- + +## Troubleshooting + +### "The service won't start locally" + +1. Check that Docker / dependencies are running: `[command]` +2. Check `.env` is populated — missing values cause silent failures +3. Check logs: `[log command]` +4. Ask in `#[channel]` + +### "Tests are failing locally but passing in CI" + +- Check your local dependency versions match CI: `[version check command]` +- Try a clean install: `[clean install command]` +- Integration tests need local deps running — `[start deps command]` + +### "I can't access [internal tool / system]" + +- Request access through [process — e.g. Okta self-serve / ask your manager] + +### "Something looks wrong in production" + +1. Check [dashboard] for the error spike +2. Check recent deploys in `#deployments` +3. If it's an active incident, page on-call via [PagerDuty / Slack command] + +--- + +## Further Reading + +- [Architecture Decision Records (ADRs)](./docs/decisions/) — why the codebase is the way it is +- [API documentation](./docs/api/) or [link to external docs] +- [Incident runbooks](./docs/runbooks/) +- [CI/CD pipeline documentation](./docs/cicd/) +- [Team working agreements](./docs/team/) + +--- + +## Quality Checks + +- [ ] Local setup instructions work on a fresh machine — tested recently +- [ ] Environment variables table is complete and accurate +- [ ] "Things that might surprise you" captures the actual surprises (ask a recent joiner) +- [ ] On-call section has real links, not placeholders +- [ ] Contacts are current — team members with real Slack handles +- [ ] Troubleshooting covers the top 3 actual questions new joiners ask diff --git a/skills/oncall-runbook/SKILL.md b/skills/oncall-runbook/SKILL.md new file mode 100644 index 0000000..957996e --- /dev/null +++ b/skills/oncall-runbook/SKILL.md @@ -0,0 +1,364 @@ +--- +name: oncall-runbook +description: "Write an on-call runbook for a service — covering alert definitions, escalation paths, common incident responses, and on-call handoff procedures. Use when asked to write an on-call guide, create alert runbooks, document escalation procedures, or prepare an on-call handoff document. Produces a structured on-call runbook with per-alert response procedures, escalation matrix, diagnostic commands, and handoff template." +--- + +# On-Call Runbook Skill + +Produce a complete on-call runbook for a service — giving the on-call engineer everything they need to respond confidently to alerts at 3am, without having to ask anyone for help. + +A good on-call runbook reduces mean time to resolution (MTTR) by eliminating the "what do I do first?" problem. It is written for the on-call engineer who has just been paged and needs to act, not for someone calmly reading documentation. + +## Required Inputs + +Ask for these if not already provided: +- **Service name** and what it does +- **Team** and tech lead name +- **Alert list** — names of alerts that currently page on-call +- **Monitoring setup** — Datadog / Grafana / CloudWatch / PagerDuty / etc. +- **Common failure modes** — what breaks most often, and what fixes it +- **Escalation contacts** — who to call when on-call can't resolve it +- **Deployment setup** — can on-call roll back? How? +- **Service dependencies** — what does this service depend on, and what depends on it? + +## Output Format + +--- + +# On-Call Runbook: [Service Name] + +**Team:** [Team name] | **Tech lead:** [Name] +**PagerDuty service:** [Link] | **Escalation policy:** [Policy name] +**Last updated:** [Date] | **Next review:** [Date + 90 days] + +> **First time on-call for this service?** Read the [developer onboarding doc] first — it covers the architecture and how things work. This runbook assumes you understand the service. + +--- + +## Quick Reference + +**Dashboard:** [Link — the first thing to open when paged] +**Logs:** [Link — where to find logs] +**Runbook index:** Jump to the alert that paged you → [Alert list below] +**Can't resolve in 30 min?** Escalate to: [Name] via [Slack / PagerDuty] + +**Rollback command (memorise this):** +```bash +[rollback command — e.g. kubectl rollout undo deployment/[service-name]] +``` + +--- + +## Escalation Matrix + +| Situation | Escalate to | How | After how long | +|---|---|---|---| +| Can't diagnose the alert | [Tech lead name] | Slack DM / Phone | 30 minutes | +| Alert requires infra change | [Platform team] | `#platform` Slack | Immediately | +| Customer-facing impact | [CSM / Support lead] | `#incidents` Slack | Immediately (P1) | +| Database issue | [DBA or data team] | Slack / PagerDuty | Immediately | +| [Specific dependency] down | [[Dependency] on-call] | PagerDuty / Slack | Immediately | +| Extended outage (>1 hour) | [Engineering manager] | Phone | 1 hour | + +**Contacts:** + +| Name | Role | Slack | Phone | +|---|---|---|---| +| [Name] | Tech lead | @[handle] | [Number] | +| [Name] | Engineering manager | @[handle] | [Number] | +| [Name] | Platform / infra | @[handle] | [Number] | +| [Platform team] | Infra on-call | `#platform` | PagerDuty | + +--- + +## Service Architecture (Quick View) + +``` +[Upstream callers] + │ + ▼ +[This Service] + │ + ├──→ [Primary Database] + ├──→ [Cache — e.g. Redis] + └──→ [Downstream Service / Queue] +``` + +**If this service is down, these are affected:** [List downstream consumers] +**If these are down, this service is affected:** [List upstream dependencies] + +--- + +## Alert Runbooks + +### ALERT: [Alert Name 1 — e.g. HighErrorRate] + +**What it means:** [Plain English — e.g. "More than 5% of API requests are returning 5xx errors in the last 5 minutes"] +**Severity:** P1 / P2 / P3 +**SLO impact:** Yes / No — [If yes: this alert means the error budget is burning at [X]× rate] + +**Step 1 — Acknowledge and assess** +```bash +# Check current error rate +[query or dashboard link] + +# Check which endpoints are erroring +[query or command] +``` + +**Step 2 — Check recent changes** +```bash +# Any deploys in the last hour? +[command or link to deployment log] + +# Recent config changes? +[where to check] +``` + +**Step 3 — Check dependencies** +```bash +# Is the database healthy? +[health check command or link] + +# Is [downstream service] healthy? +[health check command or link] +``` + +**Step 4 — Diagnose** + +| If you see | It means | Do this | +|---|---|---| +| [Error pattern 1] | [Cause] | [Action] | +| [Error pattern 2] | [Cause] | [Action] | +| [Error pattern 3] | [Cause] | [Action] | +| No clear pattern | Unknown cause | Escalate to [name] | + +**Step 5 — Fix or mitigate** +```bash +# If caused by bad deploy — roll back: +[rollback command] + +# If caused by [specific issue]: +[fix command] + +# If caused by upstream dependency: +[mitigation — e.g. enable circuit breaker, reduce traffic, etc.] +``` + +**After resolving:** +- [ ] Confirm error rate has returned to baseline +- [ ] Check no downstream services were affected +- [ ] If P1: open a post-incident review — see [incident-postmortem skill] +- [ ] Update `#incidents` with resolution summary + +--- + +### ALERT: [Alert Name 2 — e.g. HighLatency] + +**What it means:** [e.g. "P99 response time has exceeded 1s for more than 3 consecutive minutes"] +**Severity:** P1 / P2 / P3 +**SLO impact:** Yes — latency SLO breach + +**Step 1 — Assess scope** +```bash +# Check which endpoints are slow +[query or dashboard — broken down by endpoint] + +# Check if latency is across all regions or localised +[query or command] +``` + +**Step 2 — Common causes and fixes** + +| Cause | Signal | Fix | +|---|---|---| +| Database slow queries | DB latency spike on dashboard | [Check slow query log: `command`] | +| Cache miss storm | Cache hit rate drops on dashboard | [command or action] | +| Memory pressure / GC | High memory on service dashboard | [command or action — e.g. restart, scale up] | +| Upstream service slow | Trace shows time in external call | Escalate to [service] on-call | +| Traffic spike | Request rate spike on dashboard | [Scale up: `command`] | + +**Step 3 — Escalate if unresolved in 20 minutes** +Page [Tech lead] via PagerDuty / Slack. + +--- + +### ALERT: [Alert Name 3 — e.g. DatabaseConnectionPoolExhausted] + +**What it means:** [e.g. "The service has used all available database connections — new requests will fail"] +**Severity:** P1 +**SLO impact:** Yes — will cause errors immediately + +**Immediate mitigation:** +```bash +# Restart the service to flush stale connections +[restart command] + +# Check current connection count +[DB connection query] +``` + +**Diagnose root cause after stabilising:** +```bash +# Check for long-running queries holding connections +[query] + +# Check if a recent deploy changed connection pool config +[where to check] +``` + +**Resolution:** [e.g. "Increase pool size in config / kill long-running queries / scale the service"] + +--- + +### ALERT: [Alert Name 4 — e.g. QueueBacklogHigh / ConsumerLag] + +**What it means:** [e.g. "The message queue backlog exceeds 10,000 messages — consumers are not keeping up"] +**Severity:** P2 +**SLO impact:** Depends — if queue backs up, downstream systems will receive delayed data + +**Step 1 — Check consumer health** +```bash +# Are consumers running? +[command] + +# Consumer error rate? +[dashboard or query] +``` + +**Step 2 — Check message contents** +```bash +# Are there poison messages causing retries? +[command to inspect dead-letter queue or failed messages] +``` + +**Step 3 — Options** + +| If | Then | +|---|---| +| Consumers are down | Restart consumers: `[command]` | +| Poison message in queue | Move to DLQ: `[command]` | +| Consumers healthy but slow | Scale consumers: `[command]` | +| Upstream producing too fast | Escalate to [upstream service] owner | + +--- + +### ALERT: [Add additional alerts following the same pattern] + +--- + +## Diagnostic Cheat Sheet + +Common commands for quick diagnosis. Paste and run without modification. + +```bash +# Service health +[health check command] + +# Recent logs (last 100 lines) +[log command] + +# Error logs only +[error log filter command] + +# Current pod / instance status +[kubectl get pods / aws ecs describe-tasks / etc.] + +# Restart the service +[restart command] + +# Roll back to previous version +[rollback command] + +# Database connection count +[DB query] + +# Cache hit rate +[cache stats command] + +# Current request rate +[metrics query] +``` + +--- + +## Useful Dashboard Links + +| Dashboard | URL | Use it to | +|---|---|---| +| Service overview | [Link] | First stop — error rate, latency, request rate | +| Database | [Link] | Connection count, slow queries, replication lag | +| Infrastructure | [Link] | CPU, memory, disk | +| Queue / consumers | [Link] | Backlog depth, consumer throughput | +| Upstream dependencies | [Link] | Dependency health at a glance | + +--- + +## Incident Communication + +When you declare an incident: + +**Post to `#incidents` immediately:** +``` +🔴 INCIDENT — [Service Name] +Status: Investigating +Impact: [Who is affected and how] +Paged: [Your name] +Next update: [Time — max 30 min from now] +``` + +**Update every 30 minutes while active:** +``` +🔴 UPDATE — [Service Name] — [Time] +Status: [Investigating / Identified / Mitigating / Resolved] +Latest: [One sentence on what you found or did] +Next update: [Time] +``` + +**On resolution:** +``` +✅ RESOLVED — [Service Name] — [Time] +Duration: [X minutes] +Impact: [Summary of who was affected] +Cause: [One sentence] +Follow-up: [PIR required? Yes/No — link when created] +``` + +--- + +## On-Call Handoff + +Use this template at the end of every on-call shift: + +``` +--- ON-CALL HANDOFF: [Service Name] --- +Date: [Date] +Outgoing: [Your name] +Incoming: [Next on-call name] + +INCIDENTS THIS SHIFT: +- [Incident summary — date, duration, cause, resolution, follow-up required] + +OPEN ISSUES TO WATCH: +- [Anything not fully resolved / trending in the wrong direction] + +CHANGES SINCE LAST HANDOFF: +- [Deploys, config changes, infra changes that affect on-call awareness] + +RUNBOOK GAPS FOUND: +- [Anything you had to figure out that isn't documented — please add it] + +ANYTHING ELSE: +- [Notes for incoming on-call] +``` + +--- + +## Quality Checks + +- [ ] Every alert that pages on-call has a runbook entry — no alert is missing +- [ ] Rollback command is accurate and tested recently +- [ ] Escalation contacts have current phone numbers and Slack handles +- [ ] Diagnostic commands work — they have been run by at least one person recently +- [ ] Handoff template is used at every shift change — not just during incidents +- [ ] "Things I had to figure out that weren't documented" are added to this runbook after every incident diff --git a/skills/qbr-deck/SKILL.md b/skills/qbr-deck/SKILL.md new file mode 100644 index 0000000..0c5ea5e --- /dev/null +++ b/skills/qbr-deck/SKILL.md @@ -0,0 +1,218 @@ +--- +name: qbr-deck +description: "Build a Quarterly Business Review (QBR) deck structure and narrative for a customer account. Use when asked to prepare a QBR, business review meeting, executive review, or quarterly check-in with a customer. Produces a slide-by-slide QBR structure with talking points, metrics review, value narrative, and mutual next steps." +--- + +# QBR Deck Skill + +Produce a complete Quarterly Business Review deck — structured, data-backed, and customer-focused. A good QBR demonstrates value delivered, aligns on goals for the next quarter, and strengthens the executive relationship. It should never feel like a product demo or a vendor update. + +## Required Inputs + +Ask for these if not already provided: +- **Account name**, CSM name, and customer stakeholders attending +- **Contract details** — ARR, contract start date, renewal date +- **Last quarter's goals** (from previous QBR or kickoff) +- **Usage and adoption data** — key metrics for the quarter +- **Support summary** — tickets raised, resolution time, any escalations +- **Business outcomes the customer cares about** — what success looks like for them +- **Product updates or new features** relevant to this customer +- **Goals for next quarter** +- **Any open commercial conversations** (expansion, renewal, at-risk signals) + +## QBR Principles + +- Lead with customer outcomes, not product features +- Every metric should connect to a business result the customer cares about +- The agenda is a conversation, not a presentation — build in time for customer input at every stage +- Close with mutual commitments, not just vendor actions + +## Output Format + +--- + +# QBR: [Account Name] × [Your Company] +**[Quarter] [Year] Business Review** + +**Date:** [Date] | **Location / Call link:** [TBC] +**Customer attendees:** [Names and roles] +**[Your company] attendees:** [Names and roles] + +--- + +## Slide 1: Agenda (5 min) + +| Time | Topic | Owner | +|---|---|---| +| 0:00 | Welcome and introductions | CSM | +| 0:05 | [Last quarter] — how did we do? | CSM + Customer | +| 0:20 | Value delivered — business impact | CSM | +| 0:35 | What's coming — roadmap preview | CSM / Product | +| 0:45 | [Next quarter] — goals and priorities | Customer | +| 0:55 | Actions and mutual commitments | CSM | +| 1:00 | Close | | + +*Talking point: "We've kept today to 60 minutes. We want as much of this to be a conversation as possible — please push back, redirect, and ask questions throughout."* + +--- + +## Slide 2: Where We Are Together (2 min) + +**Partnership snapshot:** +- **Customer since:** [Date] +- **Contract value:** £/$/€[ARR]/year +- **Renewal date:** [Date] +- **Active users:** [N] of [N] licensed seats ([X]% adoption) +- **Products / modules active:** [List] + +*Talking point: "Before we dive in — a quick picture of where we are. [X] months in, [Y] active users, and this is our [Nth] QBR together."* + +--- + +## Slide 3: Last Quarter — Goals We Set Together (5 min) + +| Goal | Set in [Last QBR / Kickoff] | Status | +|---|---|---| +| [Goal 1] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed | +| [Goal 2] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed | +| [Goal 3] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed | + +For any partial or missed goal: state what happened and what changes next quarter. + +*Talking point: "Let's start with accountability. Here's what we said we'd achieve last quarter — let's be honest about where we landed."* + +--- + +## Slide 4: Usage and Adoption (5 min) + +**Quarter-over-quarter trend:** + +| Metric | [Q-1] | [Q] | Change | +|---|---|---|---| +| Monthly active users | [N] | [N] | +/-X% | +| Sessions per user per week | [N] | [N] | +/-X% | +| [Key feature 1] adoption | [X]% | [X]% | +/-X% | +| [Key feature 2] adoption | [X]% | [X]% | +/-X% | + +**Highlights:** +- [Positive adoption trend to call out] +- [Feature or workflow with strongest engagement] + +**Opportunity:** +- [Feature with low adoption that could drive more value — link to their goals] + +*Talking point: "Usage is [up / stable / something we want to talk about]. The area I'd like to focus on is [feature] — we're not seeing the adoption we'd expect given [their goal], and I want to understand why."* + +--- + +## Slide 5: Business Impact — Value Delivered (10 min) + +Lead with outcomes, not activity. + +**[Outcome 1: customer's primary success metric]** +- Before: [baseline] +- Now: [current state] +- Impact: [quantified business result — time saved, revenue influenced, cost reduced, risk mitigated] + +**[Outcome 2]** +- [Same structure] + +**[Outcome 3]** +- [Same structure] + +**Customer evidence** (use if available): +> "[Quote from champion or user about value experienced]" + +*Talking point: "This is the section I most want your input on. Are these the outcomes that matter to your business? Are there other ways you're measuring success that we should be tracking?"* + +--- + +## Slide 6: Support Summary (3 min) + +| Metric | This quarter | Last quarter | Trend | +|---|---|---|---| +| Tickets raised | [N] | [N] | ↑ / → / ↓ | +| Average resolution time | [X hrs] | [X hrs] | ↑ / → / ↓ | +| P1 / critical issues | [N] | [N] | ↑ / → / ↓ | +| CSAT score | [X/10] | [X/10] | ↑ / → / ↓ | + +**Notable issues this quarter:** +- [Any escalation or major ticket — brief summary and resolution] + +**What we're doing differently:** +- [Any process change or improvement based on support patterns] + +--- + +## Slide 7: What's Coming — Roadmap Preview (5 min) + +Focus only on what's relevant to this customer's goals. Do not dump the full roadmap. + +| Feature / Improvement | Expected | Why it matters to [Account Name] | +|---|---|---| +| [Feature 1] | [Q+1] | [Direct link to their goal or pain point] | +| [Feature 2] | [Q+1 / Q+2] | [Direct link] | +| [Feature 3] | [H2] | [Direct link] | + +*Talking point: "I've filtered the roadmap to what I think matters most to your team. I'd love your reaction — are these the right priorities from your perspective?"* + +--- + +## Slide 8: Next Quarter — Your Goals (10 min) + +**Customer input section — facilitate, don't present.** + +Prompt questions: +- "What does success look like for your team in [next quarter]?" +- "What's the biggest challenge you're trying to solve in the next 90 days?" +- "Is there anything about the way you're using [product] you want to change?" + +**Capture live:** + +| Goal for next quarter | Owner (customer) | How we'll support it | How we'll measure it | +|---|---|---|---| +| [Goal 1] | [Name] | [CSM / product action] | [Metric] | +| [Goal 2] | [Name] | [CSM / product action] | [Metric] | + +--- + +## Slide 9: Mutual Commitments (5 min) + +**[Your company] commits to:** +1. [Specific action — owner — by when] +2. [Specific action — owner — by when] +3. [Specific action — owner — by when] + +**[Account Name] commits to:** +1. [Specific action — owner — by when] +2. [Specific action — owner — by when] + +**Next touchpoint:** [Date of next check-in or mid-quarter review] + +--- + +## Slide 10: Thank You + Open Q&A (5 min) + +- Recap the one headline from today: [The single most important thing you want them to remember] +- Confirm actions are captured and shared after the call +- Ask: "Is there anything we didn't cover today that you wanted to raise?" + +--- + +## Preparation Checklist + +- [ ] Usage data pulled and QoQ comparison calculated +- [ ] Last QBR goals reviewed — status confirmed before the meeting +- [ ] Business outcomes framed in customer language (not product language) +- [ ] Roadmap filtered to this account's specific use cases +- [ ] Customer's goals for next quarter researched or pre-confirmed with champion +- [ ] Executive sponsor briefed on any sensitive topics before the call +- [ ] Actions from previous QBR reviewed — any outstanding items addressed + +## Quality Checks + +- [ ] Every slide has a talking point, not just a title +- [ ] Value slide leads with business outcomes, not product activity +- [ ] Roadmap preview links each item to a customer goal +- [ ] Mutual commitments section has real owners on both sides +- [ ] Customer has at least 20 minutes of airtime in the agenda diff --git a/skills/slo-error-budget/SKILL.md b/skills/slo-error-budget/SKILL.md new file mode 100644 index 0000000..05375e9 --- /dev/null +++ b/skills/slo-error-budget/SKILL.md @@ -0,0 +1,231 @@ +--- +name: slo-error-budget +description: "Define Service Level Objectives (SLOs) and an error budget policy for a service. Use when asked to write SLOs, define SLIs, calculate an error budget, set reliability targets, or create an error budget policy. Produces a complete SLO document with SLI definitions, target calculation, error budget policy, burn rate alerts, and review cadence." +--- + +# SLO and Error Budget Skill + +Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns. + +A good SLO is not a target to hit. It is an agreement about what reliability means for your users — and a framework for making principled trade-offs between reliability and velocity. + +## Required Inputs + +Ask for these if not already provided: +- **Service name** and brief description of what it does +- **Primary users** — who depends on this service and how +- **User-facing interactions** to protect — e.g. API calls, page loads, transactions +- **Current reliability data** — error rate, latency, uptime (last 30–90 days if available) +- **Existing on-call setup** — who responds to alerts? +- **Deployment frequency** — how often does the team ship? +- **Any existing SLAs** with customers — these constrain SLO targets + +## Key Definitions + +Always establish these before writing the SLO: + +| Term | Definition | +|---|---| +| **SLI** (Service Level Indicator) | The metric being measured — e.g. "% of requests completing successfully in <500ms" | +| **SLO** (Service Level Objective) | The target for that metric — e.g. "99.5% of requests" | +| **SLA** (Service Level Agreement) | The contractual commitment to customers — must be looser than the SLO | +| **Error budget** | The allowed headroom below 100% — the budget for planned and unplanned downtime | +| **Burn rate** | How fast the error budget is being consumed | + +--- + +## Output Format + +--- + +# SLO Document: [Service Name] + +**Service:** [Name] | **Team:** [Team name] +**Owner:** [Name / role] | **Approved by:** [Name] +**Effective date:** [Date] | **Review date:** [Date + 3 months] +**Version:** [1.0] + +--- + +## Why This SLO Exists + +[2–3 sentences. What reliability problem are we solving? What was happening before this SLO that made us need it? What decision-making does this SLO enable?] + +--- + +## Service Overview + +**What this service does:** [One sentence] +**Who depends on it:** [Internal teams / external customers / both — describe] +**Critical user journeys protected by this SLO:** +1. [Journey 1 — e.g. "User completes a payment"] +2. [Journey 2] +3. [Journey 3] + +--- + +## SLIs — What We Measure + +Define one SLI per user journey or reliability dimension. Keep it to 3–5 SLIs maximum. + +### SLI 1: [Name — e.g. Request Success Rate] + +| Field | Detail | +|---|---| +| **What it measures** | [e.g. "% of API requests that return a non-5xx response"] | +| **Good event definition** | [e.g. "HTTP response with status 2xx or 4xx, completed within 500ms"] | +| **Bad event definition** | [e.g. "HTTP response with status 5xx, or any response taking >500ms"] | +| **Measurement source** | [e.g. "Application load balancer access logs / Datadog APM / Prometheus"] | +| **Measured over** | Rolling 28-day window | +| **Exclusions** | [e.g. "Health check endpoints excluded / Requests during planned maintenance excluded"] | + +### SLI 2: [Name — e.g. Latency] + +| Field | Detail | +|---|---| +| **What it measures** | [e.g. "P99 response time for the /checkout endpoint"] | +| **Good event definition** | [e.g. "Request completes in ≤500ms at P99"] | +| **Bad event definition** | [e.g. "Request takes >500ms at P99"] | +| **Measurement source** | [Source] | +| **Measured over** | Rolling 28-day window | +| **Exclusions** | [Any exclusions] | + +### SLI 3: [Name — e.g. Data Freshness / Queue Depth / etc.] + +[Same structure] + +--- + +## SLO Targets + +| SLI | Target | Window | Error Budget | +|---|---|---|---| +| [SLI 1 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] | +| [SLI 2 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] | +| [SLI 3 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] | + +**How targets were set:** +- Historical baseline (last 90 days): [X]% +- Target is set [above / at] historical baseline to [improve reliability / reflect current reality while formalising the commitment] +- Rationale: [1–2 sentences] + +**What 100% is NOT the target:** [Brief explanation of why targeting 100% is counterproductive — it discourages feature development and doesn't reflect user reality] + +--- + +## Error Budget Calculation + +**For SLI 1 ([Name]), at [X]% target:** + +``` +Error budget = (100% - SLO target) × measurement window + = (100% - [X]%) × 28 days × 24 hours × 60 minutes + = [Y]% × [Z total minutes] + = [N] minutes of allowed failure per 28-day window +``` + +**In plain terms:** We can afford [N] minutes of [bad events] in any rolling 28-day window before we breach the SLO. + +--- + +## Burn Rate Alerts + +Burn rate = how fast the error budget is being consumed relative to the budget window. +A burn rate of 1 = consuming the budget at exactly the rate that would exhaust it over 28 days. + +| Alert | Burn rate | Window | Severity | Response | +|---|---|---|---|---| +| Page (critical) | >14× | 1 hour | P1 | Page on-call immediately — budget exhausted in <2 hours | +| Page (high) | >6× | 6 hours | P2 | Page on-call — budget exhausted in <5 days | +| Ticket (warning) | >3× | 3 days | P3 | Create ticket — review at next team meeting | +| Info | >1× | 28 days | Info | Log only — budget on track to exhaust by end of window | + +**Alert implementation:** [Link to alert config in monitoring tool — e.g. Datadog, Prometheus/Alertmanager, Grafana] + +--- + +## Error Budget Policy + +This policy defines what to do with the error budget — both when it's healthy and when it's burning. + +### When budget is healthy (>50% remaining) + +- Feature development and deployments proceed at normal pace +- The team may take on riskier experiments +- Reliability improvements are scheduled but not urgent + +### When budget is at risk (25–50% remaining) + +- Deployment frequency reduced — team ships only well-tested changes +- One reliability improvement added to current sprint +- Weekly error budget review added to team standup + +### When budget is nearly exhausted (<25% remaining) + +- Feature work paused in favour of reliability improvements +- No new deployments without explicit on-call approval +- Daily review of error budget burn rate +- CSM / support notified to manage customer expectations + +### When budget is exhausted (0% remaining — SLO breached) + +- All feature work stops +- On-call engineer and engineering manager notified immediately +- Post-incident review (PIR) required within 5 business days +- SLO target may be temporarily relaxed (with stakeholder approval) while root cause is addressed + +--- + +## Dashboard and Reporting + +**SLO dashboard:** [Link to Datadog / Grafana / etc. dashboard] + +**Metrics exposed:** +- Current SLO compliance (rolling 28-day) +- Error budget remaining (% and minutes) +- Burn rate (current and trend) +- Incident count and MTTR this window + +**Reporting cadence:** + +| Audience | Frequency | Format | +|---|---|---| +| Engineering team | Weekly | Slack summary — #[service]-slo | +| Engineering manager | Monthly | SLO review meeting | +| Stakeholders / customers | Quarterly | SLO compliance summary | + +--- + +## Exclusions and Edge Cases + +**Planned maintenance:** Error budget is not consumed during pre-announced maintenance windows. Maintenance must be communicated [X hours] in advance via [channel]. + +**Dependency failures:** If SLO breach is caused by an upstream dependency outside our control, document it — but it still counts against our error budget (our users don't distinguish between our failures and our dependencies' failures). + +**Force majeure:** [Policy for cloud provider outages, major infrastructure events] + +--- + +## SLO Review Cadence + +| Review | When | Who | Output | +|---|---|---|---| +| Error budget review | Weekly | Team | Budget health check — adjust if burning fast | +| SLO target review | Quarterly | Team + EM | Adjust targets if baseline has shifted significantly | +| Annual SLO audit | Annually | Team + Stakeholders | Review SLIs — are we measuring the right things? | + +**When to change the SLO target:** +- Historical baseline has improved significantly and target no longer reflects real reliability +- User feedback indicates the target is misaligned with what users actually experience +- The SLO is being gamed (metric is healthy but users are unhappy) + +--- + +## Quality Checks + +- [ ] SLIs are user-facing — they measure what users experience, not internal system metrics +- [ ] Good and bad events are precisely defined — no ambiguity about what counts +- [ ] Targets are based on historical data, not aspirational round numbers +- [ ] Error budget policy has clear triggers and clear actions — not "discuss as a team" +- [ ] Burn rate alerts have different windows to catch both fast burns and slow burns +- [ ] Exclusions are documented so they don't silently inflate the SLO number