feat: v10.0.0 — 8 new skills across Customer Success and Engineering (500-star milestone)

Two star milestones shipped together: Customer Success bundle (pm-cs) — 250-star milestone: - cs-health-scorecard: weighted RAG health score across 5 dimensions with renewal forecast - qbr-deck: slide-by-slide QBR structure with value narrative and mutual commitments - cs-escalation-brief: 4-level escalation framework with root cause, impact, and decision required - churn-analysis: voluntary/unavoidable churn split, early warning signals, prioritised interventions Engineering expansion (pm-engineering) — 500-star milestone: - cicd-playbook: full pipeline playbook from build through post-deploy checks and rollback - slo-error-budget: SLI definitions, burn rate alerts, and error budget policy - developer-onboarding-doc: first-week guide covering architecture, setup, testing, and contacts - oncall-runbook: per-alert response procedures, escalation matrix, and handoff template Also: - Added pm-cs plugin to marketplace.json - Updated pm-engineering plugin.json to v3.0.0 (14 skills) - Updated marketplace.json to v10.0.0 (114 skills, 23 bundles, 16 professions) - README updated with new CS section, corrected skill numbering (106 → 114) - Added bug report link to Contributing section - Star milestones updated to show 250 and 500 as unlocked
2026-05-17 10:55:58 +01:00
parent 48fd4dd6ad
commit bfdbec17a3
20 changed files with 4044 additions and 100 deletions
@@ -0,0 +1,179 @@
+---
+name: churn-analysis
+description: "Analyse customer churn for a product or cohort and produce a structured churn report. Use when asked to analyse churn, understand why customers are leaving, identify churn patterns, calculate churn rate, or build a churn reduction plan. Produces a churn analysis with rate calculations, categorised reasons, early warning signals, and prioritised interventions."
+---
+
+# Churn Analysis Skill
+
+Produce a structured churn analysis that goes beyond the headline rate — identifying why customers leave, which segments are most at risk, and what interventions will have the highest impact on retention.
+
+## Required Inputs
+
+Ask for these if not already provided:
+- **Time period** being analysed (e.g. Q1, last 12 months)
+- **Total customers at start of period** and **customers churned**
+- **ARR or revenue lost** to churn
+- **Churn reasons data** — exit survey results, CSM notes, support data, or sales loss reasons
+- **Customer segments** — by tier, industry, cohort, or product line
+- **Current retention rate** if known
+- **Any recent changes** — pricing, product, support model — that may have affected churn
+
+## Churn Categories
+
+Always classify churn before analysing it:
+
+| Category | Definition |
+|---|---|
+| **Voluntary — avoidable** | Customer left due to a problem we could have addressed (product gaps, poor onboarding, relationship failures) |
+| **Voluntary — unavoidable** | Customer left for reasons outside our control (budget cuts, acquisition, company shutdown) |
+| **Involuntary** | Payment failure, contract non-renewal by mistake, admin error |
+
+The interventions for each category are different. Conflating them leads to wrong conclusions.
+
+## Output Format
+
+---
+
+# Churn Analysis: [Product / Segment / Company]
+**Period:** [Start date] — [End date]
+**Prepared by:** [Name] | **Date:** [Date]
+
+---
+
+## Headline Numbers
+
+| Metric | Value |
+|---|---|
+| Customers at start of period | [N] |
+| Customers churned | [N] |
+| **Customer churn rate** | **[X]%** |
+| ARR at start of period | £/$/€[X] |
+| ARR lost to churn | £/$/€[X] |
+| **Revenue churn rate (gross)** | **[X]%** |
+| ARR from expansions (same period) | £/$/€[X] |
+| **Net revenue retention (NRR)** | **[X]%** |
+
+**Benchmark context:**
+- Customer churn rate: [X]% vs. industry benchmark [Y]% — [above / below / in line]
+- NRR: [X]% — [What this means: above 100% = expansion offsets churn; below 100% = shrinking base]
+
+---
+
+## Churn Breakdown by Category
+
+| Category | Customers | % of churn | ARR lost |
+|---|---|---|---|
+| Voluntary — avoidable | [N] | [X]% | £/$/€[X] |
+| Voluntary — unavoidable | [N] | [X]% | £/$/€[X] |
+| Involuntary | [N] | [X]% | £/$/€[X] |
+| **Total** | **[N]** | **100%** | **£/$/€[X]** |
+
+**Avoidable churn as % of total churn:** [X]% — this is the number we can actually influence.
+
+---
+
+## Churn Reasons — Avoidable Churn Only
+
+Rank by frequency. Include ARR weight where data allows.
+
+| Reason | Count | % of avoidable churn | ARR lost | Representative quote |
+|---|---|---|---|---|
+| [Reason 1 — e.g. "Product missing key feature"] | [N] | [X]% | £/$/€[X] | "[Quote]" |
+| [Reason 2] | [N] | [X]% | £/$/€[X] | "[Quote]" |
+| [Reason 3] | [N] | [X]% | £/$/€[X] | "[Quote]" |
+| [Reason 4] | [N] | [X]% | £/$/€[X] | "[Quote]" |
+| Other | [N] | [X]% | £/$/€[X] | — |
+
+**Theme synthesis:** [2–3 sentences grouping the top reasons into 2–3 themes. E.g. "The top three reasons cluster around two themes: product gaps in [area] (affecting X% of avoidable churn) and onboarding failures where customers never achieved value (Y%)."]
+
+---
+
+## Churn by Segment
+
+Identify which segments over- or under-index for churn.
+
+### By Tier
+
+| Tier | Churn rate | vs. Overall | Notes |
+|---|---|---|---|
+| Enterprise | [X]% | +/-[X]pp | |
+| Mid-Market | [X]% | +/-[X]pp | |
+| SMB | [X]% | +/-[X]pp | |
+
+### By Cohort (Acquisition Year)
+
+| Cohort | Churn rate | Notes |
+|---|---|---|
+| [Year 1] | [X]% | |
+| [Year 2] | [X]% | |
+| [Year 3] | [X]% | |
+
+### By Industry / Use Case (if data available)
+
+| Segment | Churn rate | Notes |
+|---|---|---|
+| [Segment 1] | [X]% | |
+| [Segment 2] | [X]% | |
+
+**Key pattern:** [Which segment has the highest churn rate and what likely explains it]
+
+---
+
+## Timing Analysis
+
+- **Average contract length before churn:** [X months]
+- **Highest-risk moment:** [e.g. "Month 3 — when trial value has worn off but full adoption hasn't happened"]
+- **Churn timing distribution:**
+
+| When churn occurred | % of churned accounts |
+|---|---|
+| 0–3 months | [X]% |
+| 3–6 months | [X]% |
+| 6–12 months | [X]% |
+| 12+ months | [X]% |
+
+---
+
+## Early Warning Signals
+
+Based on the churned accounts, identify the signals that preceded churn (and could have triggered earlier intervention):
+
+| Signal | Lead time before churn | How to detect |
+|---|---|---|
+| [Signal 1 — e.g. "DAU/MAU dropped below 15%"] | [~X weeks] | [Usage dashboard / alert] |
+| [Signal 2 — e.g. "No QBR in 90+ days"] | [~X weeks] | [CRM flag] |
+| [Signal 3 — e.g. "Champion left the account"] | [~X weeks] | [LinkedIn alert / CSM tracking] |
+| [Signal 4] | [~X weeks] | [Detection method] |
+
+---
+
+## Intervention Recommendations
+
+Ranked by estimated impact × feasibility.
+
+| Intervention | Addresses | Est. churn reduction | Effort | Owner |
+|---|---|---|---|---|
+| [Intervention 1 — e.g. "Improve onboarding for [segment] with dedicated 30-day check-in"] | [Reason 1] | [X accounts / £X ARR] | Low / Med / High | [Team] |
+| [Intervention 2] | [Reason 2] | [X accounts / £X ARR] | Low / Med / High | [Team] |
+| [Intervention 3] | [Reason 3] | [X accounts / £X ARR] | Low / Med / High | [Team] |
+
+**Priority call:** [Which one intervention, if implemented this quarter, would have the biggest impact and why]
+
+---
+
+## What We Don't Know (Data Gaps)
+
+- [Data gap 1 — e.g. "Exit survey response rate is only 30% — the reasons data may not be representative"]
+- [Data gap 2 — e.g. "No product usage data for SMB tier — can't confirm usage signal correlation"]
+- [Data gap 3]
+
+---
+
+## Quality Checks
+
+- [ ] Churn rate is correctly calculated (churned ÷ starting cohort, not end-of-period total)
+- [ ] Avoidable and unavoidable churn are separated — interventions target avoidable churn only
+- [ ] Churn reasons are customer-reported, not internally assumed
+- [ ] Segment analysis identifies which segments over-index — not just averages
+- [ ] Early warning signals are specific and detectable, not generic ("low engagement")
+- [ ] Interventions link directly to the top churn reasons — no recommendations without a root cause match
@@ -0,0 +1,301 @@
+---
+name: cicd-playbook
+description: "Write a CI/CD pipeline playbook for a service or team. Use when asked to document a CI/CD pipeline, write a deployment process, define release gates, document build and test stages, or create a deployment guide. Produces a structured playbook covering pipeline stages, environment definitions, deployment gates, rollback procedures, and on-call responsibilities."
+---
+
+# CI/CD Playbook Skill
+
+Produce a complete, actionable CI/CD playbook for a service or team — covering everything a new engineer needs to understand, contribute to, and operate the pipeline safely.
+
+A good playbook is not a diagram. It is a document that answers: what runs, when, why, who owns it, and what to do when it breaks.
+
+## Required Inputs
+
+Ask for these if not already provided:
+- **Service name** and brief description
+- **Tech stack** — language, framework, containerisation (Docker, etc.)
+- **Source control** — GitHub / GitLab / Bitbucket, branching strategy
+- **CI platform** — GitHub Actions / CircleCI / Jenkins / BuildKite / other
+- **CD platform / deployment target** — Kubernetes, ECS, Lambda, Heroku, VMs, etc.
+- **Environments** — e.g. dev, staging, production (and any canary / feature environments)
+- **Deployment frequency** — how often does the team ship?
+- **Any existing gates** — manual approvals, smoke tests, feature flags
+- **On-call setup** — who's responsible during deploys?
+
+## Output Format
+
+---
+
+# CI/CD Playbook: [Service Name]
+
+**Service:** [Name] | **Team:** [Team name]
+**Last updated:** [Date] | **Owner:** [Name / role]
+**Pipeline platform:** [CI tool] → [CD tool / platform]
+
+---
+
+## Overview
+
+[2–3 sentences describing what this service does and why the CI/CD pipeline is structured the way it is. Include the deployment target and how frequently the team ships.]
+
+**Deployment frequency:** [Multiple times per day / Daily / Weekly / On-demand]
+**Average pipeline duration:** [X minutes]
+**Rollback time (p95):** [X minutes]
+
+---
+
+## Pipeline Stages
+
+```
+[Branch push]
+    │
+    ▼
+[1. Build & Lint] ──fail──▶ ❌ Block PR
+    │
+    ▼
+[2. Unit Tests] ──fail──▶ ❌ Block PR
+    │
+    ▼
+[3. Integration Tests] ──fail──▶ ❌ Block PR
+    │
+    ▼
+[4. Security Scan] ──fail──▶ ⚠️ [Block / Warn — specify]
+    │
+    ▼
+[5. Build Artefact / Container Image]
+    │
+    ▼
+[6. Deploy to Staging] ──fail──▶ ❌ Block promotion
+    │
+    ▼
+[7. Smoke Tests (Staging)]
+    │
+    ▼
+[8. Manual Approval Gate] ──(if required)
+    │
+    ▼
+[9. Deploy to Production] ──fail──▶ 🔁 Auto-rollback (if configured)
+    │
+    ▼
+[10. Post-deploy checks]
+```
+
+---
+
+## Stage Definitions
+
+### Stage 1 — Build & Lint
+
+**What runs:** [Build command] + [Linter — e.g. ESLint, golangci-lint, flake8]
+**Trigger:** Every commit to any branch
+**Blocking:** Yes — PR cannot be merged if this fails
+**Typical duration:** [X minutes]
+**Owner if it fails:** PR author
+
+**Common failure causes:**
+- [e.g. Missing dependency — run `npm install` locally before pushing]
+- [e.g. Lint rule violation — run `npm run lint --fix` to auto-fix most issues]
+
+---
+
+### Stage 2 — Unit Tests
+
+**What runs:** [Test command — e.g. `npm test`, `go test ./...`, `pytest`]
+**Coverage gate:** [X]% minimum — pipeline fails below this threshold
+**Trigger:** Every commit
+**Blocking:** Yes
+**Typical duration:** [X minutes]
+
+**Coverage report:** [Where to find it — e.g. uploaded to Codecov, available in CI artifacts]
+
+---
+
+### Stage 3 — Integration Tests
+
+**What runs:** [Test suite description — e.g. "API integration tests against a test database using Docker Compose"]
+**Environment:** [Ephemeral test environment / shared test DB / etc.]
+**Trigger:** Every commit to `main` and feature branches targeting `main`
+**Blocking:** Yes
+**Typical duration:** [X minutes]
+
+**If slow:** [e.g. "Integration tests can be skipped locally with `SKIP_INTEGRATION=true` — never skip in CI"]
+
+---
+
+### Stage 4 — Security Scan
+
+**Tools:** [e.g. Snyk, Trivy, OWASP Dependency Check, Semgrep]
+**What it checks:** [Dependency vulnerabilities / SAST / secrets detection — list what applies]
+**Blocking on:** Critical and High severity findings
+**Non-blocking on:** Medium and Low (flagged, not blocking)
+**Trigger:** Every commit to `main`
+
+**How to handle a flagged vulnerability:**
+1. Check if a fix is available — upgrade the dependency
+2. If no fix available, open a security ticket and add a suppression with justification
+3. Never suppress without a ticket and owner
+
+---
+
+### Stage 5 — Build Artefact
+
+**What is produced:** [Docker image / binary / zip — be specific]
+**Registry:** [ECR / GCR / Docker Hub / Artifactory — URL]
+**Tagging convention:** `[service-name]:[git-sha]` (also tagged `:latest` on `main`)
+**Trigger:** Commits to `main` only (not feature branches)
+
+---
+
+### Stage 6 — Deploy to Staging
+
+**Deployment method:** [e.g. Helm upgrade / kubectl apply / ecs deploy / Terraform apply]
+**Staging URL:** [URL]
+**Trigger:** Automatic on successful artefact build from `main`
+**Who can deploy to staging:** Any engineer (automatic)
+
+**Environment variables:** Managed in [Vault / AWS SSM / GitHub Secrets / etc.]
+**Staging is not production:** [Any differences in config, scale, or data — state them here]
+
+---
+
+### Stage 7 — Smoke Tests (Staging)
+
+**What runs:** [Description — e.g. "10 critical path tests covering login, core API endpoints, and payment flow"]
+**Tool:** [e.g. Playwright / Postman / custom script]
+**Pass criteria:** All smoke tests pass within [X seconds] timeout
+**Blocking:** Yes — production deploy will not proceed if smoke tests fail
+
+**Smoke test suite location:** [Link to test files or folder]
+
+---
+
+### Stage 8 — Manual Approval Gate
+
+**Required for:** [Production deploys / deploys affecting >X% of traffic / deploys to specific regions]
+**Who can approve:** [e.g. Any engineer on the team / Lead engineer / On-call engineer]
+**Approval timeout:** [e.g. 24 hours — auto-cancelled if no approval]
+**How to approve:** [GitHub Actions approve step / Slack command / other — with link]
+
+**When to withhold approval:**
+- Active incident in production
+- Deploy is outside the deployment window (see below)
+- On-call engineer has not been notified
+
+---
+
+### Stage 9 — Deploy to Production
+
+**Deployment method:** [Same as staging or different — specify]
+**Deployment window:** [e.g. Monday–Thursday 09:00–16:00 UTC — no deploys on Fridays or before bank holidays]
+**Canary / progressive rollout:** [Yes — X% initial traffic, full rollout after Y minutes / No — full deploy]
+**Deployment notifications:** [Slack channel — #deployments]
+
+**Who is on-call during deploy:** Deploying engineer is responsible until post-deploy checks pass.
+
+---
+
+### Stage 10 — Post-Deploy Checks
+
+**Automated checks (run for [X minutes] after deploy):**
+- [ ] Error rate: <[X]% (baseline: [Y]%)
+- [ ] P99 latency: <[X]ms (baseline: [Y]ms)
+- [ ] [Key business metric]: within [X]% of baseline
+
+**Where to watch:** [Datadog / Grafana / CloudWatch dashboard — link]
+
+**If a check fails:** See Rollback Procedure below.
+
+---
+
+## Environments
+
+| Environment | Purpose | Deploy trigger | URL | Data |
+|---|---|---|---|---|
+| **Dev** | Local development | Manual | localhost | Seeded test data |
+| **Staging** | Pre-production validation | Automatic (main) | [URL] | Anonymised prod copy |
+| **Production** | Live traffic | Manual approval | [URL] | Live data |
+
+---
+
+## Branching Strategy
+
+**Model:** [Trunk-based / GitFlow / GitHub Flow — describe briefly]
+
+| Branch | Purpose | Who merges | Deploy target |
+|---|---|---|---|
+| `main` | Production-ready code | PR + review | Staging → Production |
+| `feature/*` | Feature development | Author | None (CI only) |
+| `hotfix/*` | Critical production fixes | Lead engineer | Can bypass staging gate with approval |
+
+**Hotfix process:** [Describe when and how to use a hotfix branch — what level of incident justifies bypassing the standard process]
+
+---
+
+## Rollback Procedure
+
+**Automated rollback:** [Yes — triggered if post-deploy error rate exceeds [X]% / No — manual only]
+
+**Manual rollback steps:**
+```bash
+# 1. Identify the last known good image tag
+[command to list recent deployments]
+
+# 2. Deploy the previous version
+[deployment command with previous tag]
+
+# 3. Confirm rollback is live
+[smoke test command or health check URL]
+
+# 4. Notify the team
+[Slack command or template]
+```
+
+**Rollback decision authority:** Any engineer on-call can initiate a rollback without waiting for approval.
+
+**After a rollback:**
+1. Create a post-deploy incident report (see [incident-postmortem skill])
+2. Do not re-deploy the same commit without fixing the root cause
+3. Notify [stakeholder / support team] of the rollback and expected fix timeline
+
+---
+
+## Secrets and Configuration Management
+
+**Secret store:** [Vault / AWS SSM / GitHub Secrets / Doppler — specify]
+**How to add a new secret:**
+1. [Step 1]
+2. [Step 2]
+**Who has access:** [Role or team]
+**Rotation policy:** [How often secrets are rotated and who owns it]
+
+**Never do:** Commit secrets to source control, even in `.env` files. The pipeline includes secret scanning (Stage 4) which will flag this.
+
+---
+
+## Common Failures and Fixes
+
+| Failure | Likely cause | Fix |
+|---|---|---|
+| Build fails with "module not found" | Dependency not installed | Run `[install command]` and commit `lock file` |
+| Integration tests timeout | Test DB not seeded / external service down | Check [service] status; re-run pipeline |
+| Smoke tests fail after staging deploy | Environment variable missing | Check [config location]; compare staging and prod env vars |
+| Production deploy stuck at approval | Approver not notified | Tag `@[on-call handle]` in `#deployments` |
+| Post-deploy error rate spike | Bad deploy / upstream dependency | Check [dashboard]; initiate rollback if >5 min |
+
+---
+
+## On-Call Responsibilities During Deploy
+
+- The deploying engineer is responsible for monitoring post-deploy checks for [X minutes] after a production deploy
+- If you cannot monitor after deploying, hand off explicitly to another engineer in `#deployments`
+- For deploys outside business hours: only hotfixes — always page the on-call engineer before deploying
+
+---
+
+## Quality Checks
+
+- [ ] Every stage has a clear owner when it fails
+- [ ] Rollback procedure is tested — not theoretical
+- [ ] Secrets management section names the actual tool used (not "use secrets management")
+- [ ] Deployment window is specific — not "during business hours"
+- [ ] Post-deploy check thresholds are calibrated to actual baseline metrics
@@ -0,0 +1,176 @@
+---
+name: cs-escalation-brief
+description: "Write a structured escalation brief for an at-risk customer account. Use when an account has escalated, when a customer is threatening churn, when a P1 customer issue needs executive attention, or when preparing an internal save play. Produces a crisp escalation brief with account context, timeline, root cause, business impact, and a clear resolution plan."
+---
+
+# Customer Escalation Brief Skill
+
+Produce a clear, concise escalation brief that gives internal stakeholders — VP CS, CCO, product leadership, or the CEO — everything they need to understand the situation, make decisions, and act fast.
+
+A good escalation brief is not a complaint. It is a professional document that states the facts, assigns accountability honestly, and proposes a specific resolution plan.
+
+## Required Inputs
+
+Ask for these if not already provided:
+- **Account name**, tier, and ARR
+- **CSM name** and account owner
+- **Nature of the escalation** — what happened, what the customer is saying
+- **Timeline** of events leading to escalation
+- **Customer contact** who escalated (name, role, influence level)
+- **What the customer wants** — their stated ask
+- **What we believe the root cause is**
+- **What has already been done** to address the situation
+- **Renewal date** and current renewal risk assessment
+
+## Escalation Levels
+
+Calibrate urgency and audience based on escalation level:
+
+| Level | Trigger | Audience | Response time |
+|---|---|---|---|
+| L1 — Account Risk | Customer expressing dissatisfaction; renewal at risk | CSM + CS Manager | 24 hours |
+| L2 — Executive Escalation | Customer escalated to their exec; requesting vendor exec involvement | VP CS + Account Exec | 4 hours |
+| L3 — Churn Risk | Customer has issued notice or is in active churn conversation | CCO / CEO + Revenue leadership | 1 hour |
+| L4 — Public Risk | Customer threatening public escalation, legal, or press | CCO / Legal / Comms | Immediate |
+
+## Output Format
+
+---
+
+# Escalation Brief: [Account Name]
+
+**Escalation level:** L[1/2/3/4] — [Label]
+**Date raised:** [Date]
+**Raised by:** [CSM name]
+**Escalation owner:** [Name of exec or senior stakeholder now leading response]
+
+---
+
+## Account at a Glance
+
+| Field | Detail |
+|---|---|
+| ARR | £/$/€[X] |
+| Tier | Enterprise / Mid-Market / SMB |
+| Customer since | [Date] |
+| Renewal date | [Date] — [N] days away |
+| Renewal risk (pre-escalation) | Green / Amber / Red |
+| Renewal risk (current) | Green / Amber / Red |
+| Customer contact who escalated | [Name, role, seniority] |
+| Executive sponsor (customer) | [Name, role — active / passive / vacant] |
+| Executive sponsor (vendor) | [Name, role] |
+
+---
+
+## What Happened — Summary
+
+[3–5 sentences. State the facts plainly. What the customer experienced, how they reacted, and how we learned about the escalation. No editorialising. No blame.]
+
+---
+
+## Timeline
+
+List in chronological order. Each entry: `[Date / time] — [What happened. Who did what.]`
+
+Include:
+- When the original issue or trigger event occurred
+- When the customer first raised concerns (informally)
+- When it escalated (formal escalation or exec involvement)
+- Actions taken since escalation
+
+---
+
+## Root Cause
+
+**Primary cause:** [One clear sentence. What specifically went wrong.]
+
+**Contributing factors:**
+- [Factor 1 — be honest about internal failures as well as external ones]
+- [Factor 2]
+
+**Is this a systemic issue or isolated?**
+[ ] Isolated to this account
+[ ] Pattern seen in other accounts — details: [_______]
+[ ] Product or process gap that needs fixing
+
+---
+
+## Customer's Stated Position
+
+**What the customer says happened:** [Their version of events — fair and unfiltered]
+
+**What they are asking for:** [Their explicit ask — compensation, fix by date, exec call, SLA credit, exit clause]
+
+**Sentiment of escalating contact:** [Frustrated but constructive / Angry / Seeking exit / Unknown]
+
+**Risk of public escalation:** Low / Medium / High — [evidence if Medium or High]
+
+---
+
+## Business Impact
+
+| Impact type | Detail |
+|---|---|
+| ARR at risk | £/$/€[X] |
+| Potential churn probability | [X]% |
+| Reputational risk | Low / Medium / High |
+| Reference / case study status | [Was a reference — now at risk / Not a reference] |
+| Expansion pipeline at risk | £/$/€[X] |
+
+---
+
+## What Has Been Done So Far
+
+1. [Action taken — by whom — date — outcome]
+2. [Action taken — by whom — date — outcome]
+3. [Action taken — by whom — date — outcome]
+
+**Has a formal apology or acknowledgement been issued?** Yes / No
+
+---
+
+## Proposed Resolution Plan
+
+**Immediate actions (next 24–48 hours):**
+
+| Action | Owner | By when |
+|---|---|---|
+| [Action] | [Name] | [Date] |
+| [Action] | [Name] | [Date] |
+
+**Medium-term actions (next 2–4 weeks):**
+
+| Action | Owner | By when |
+|---|---|---|
+| [Action] | [Name] | [Date] |
+
+**What we are NOT offering:** [Be explicit about what is not on the table — avoids misaligned expectations]
+
+**Success criteria:** [How will we know the escalation is resolved? What does the customer need to confirm they are satisfied?]
+
+---
+
+## Decision Required from Escalation Owner
+
+[State clearly what decision or resource the escalation owner needs to provide. Be specific — do not make them ask. E.g.: "We need approval to offer a 20% service credit for Q2" or "We need an exec call with [name] within 48 hours."]
+
+---
+
+## Communication Plan
+
+| Audience | Message | Channel | Owner | By when |
+|---|---|---|---|---|
+| Escalating customer contact | [Summary of message] | Email / Call | [Name] | [Date] |
+| Customer exec sponsor | [Summary] | Call | [Name] | [Date] |
+| Internal CS team | [Summary] | Slack / Meeting | CS Manager | [Date] |
+
+---
+
+## Quality Checks
+
+- [ ] Root cause is specific — not "communication breakdown" or "product gap" without detail
+- [ ] Customer's position is stated fairly — not minimised or dismissed
+- [ ] A clear decision is requested from the escalation owner — brief does not end with "what do you think?"
+- [ ] ARR at risk is quantified
+- [ ] Communication plan has owners and dates — not "TBD"
+- [ ] Language is professional and blameless toward individuals
@@ -0,0 +1,141 @@
+---
+name: cs-health-scorecard
+description: "Build a customer health scorecard for a specific account. Use when asked to score account health, assess renewal risk, build a health dashboard, or evaluate an account's likelihood to renew or expand. Produces a structured health scorecard with a RAG status, dimension scores, key risks, and recommended actions."
+---
+
+# Customer Health Scorecard Skill
+
+Produce a structured, data-driven health scorecard for a customer account — giving the CSM and leadership a clear view of renewal risk, expansion potential, and the actions needed to move the account in the right direction.
+
+## Required Inputs
+
+Ask for these if not already provided:
+- **Account name** and tier (enterprise / mid-market / SMB)
+- **Contract value** (ARR) and **renewal date**
+- **Product usage data** — logins, DAU/MAU ratio, key feature adoption
+- **Support data** — open tickets, CSAT or NPS score, recent escalations
+- **Engagement data** — last QBR date, executive sponsor status, champion name
+- **Commercial data** — payment history, expansion conversations, seats used vs. licensed
+- **Any known risks or recent changes** at the account
+
+## Scoring Framework
+
+Score each dimension 1–5. Weight as shown. Calculate weighted total out of 100.
+
+| Dimension | Weight | What to Score |
+|---|---|---|
+| **Product Adoption** | 30% | DAU/MAU ratio, breadth of features used, power users identified |
+| **Engagement** | 20% | QBR cadence, executive sponsor active, champion strength |
+| **Outcomes** | 20% | Customer hitting their stated goals / success metrics |
+| **Support Health** | 15% | Ticket volume trend, unresolved escalations, CSAT |
+| **Commercial** | 15% | On-time payments, seats utilised, expansion signals |
+
+**Score → RAG conversion:**
+- 80–100: Green (healthy, renew likely)
+- 60–79: Amber (at risk, needs attention)
+- 0–59: Red (high churn risk, escalate)
+
+## Output Format
+
+---
+
+# Customer Health Scorecard: [Account Name]
+
+**CSM:** [Name] | **Tier:** [Enterprise / Mid-Market / SMB]
+**ARR:** £/$/€[X] | **Renewal date:** [Date] | **Days to renewal:** [N]
+**Overall health:** [Green / Amber / Red] — [Score]/100
+**Last updated:** [Date]
+
+---
+
+## Health Score Summary
+
+| Dimension | Score (1–5) | Weight | Weighted Score | Trend |
+|---|---|---|---|---|
+| Product Adoption | [1–5] | 30% | [X] | ↑ / → / ↓ |
+| Engagement | [1–5] | 20% | [X] | ↑ / → / ↓ |
+| Outcomes | [1–5] | 20% | [X] | ↑ / → / ↓ |
+| Support Health | [1–5] | 15% | [X] | ↑ / → / ↓ |
+| Commercial | [1–5] | 15% | [X] | ↑ / → / ↓ |
+| **Total** | — | 100% | **[X]/100** | |
+
+---
+
+## Dimension Detail
+
+### Product Adoption — [Score]/5
+- **DAU/MAU ratio:** [X]% (benchmark: >25% = healthy)
+- **Key features adopted:** [List features in use]
+- **Features not adopted:** [List unused high-value features]
+- **Power users identified:** [Yes / No — how many]
+- **Assessment:** [1–2 sentences on adoption health]
+
+### Engagement — [Score]/5
+- **Last QBR:** [Date] — [Outcome summary]
+- **Next QBR:** [Scheduled / Overdue]
+- **Executive sponsor:** [Active / Passive / Vacant]
+- **Champion:** [Name, role, strength: strong / moderate / weak]
+- **Assessment:** [1–2 sentences]
+
+### Outcomes — [Score]/5
+- **Customer's stated goals:** [List 2–3 goals from onboarding or last QBR]
+- **Progress against goals:** [On track / Partial / Off track]
+- **Evidence of value:** [Metric or quote that demonstrates ROI]
+- **Assessment:** [1–2 sentences]
+
+### Support Health — [Score]/5
+- **Open tickets:** [N] (priority breakdown: P1: X, P2: X, P3: X)
+- **CSAT / NPS:** [Score] (benchmark: >8 CSAT / >30 NPS = healthy)
+- **Unresolved escalations:** [Yes / No — details if yes]
+- **Ticket trend (last 90 days):** Increasing / Stable / Decreasing
+- **Assessment:** [1–2 sentences]
+
+### Commercial — [Score]/5
+- **Seats licensed:** [N] | **Seats active:** [N] ([X]% utilisation)
+- **Payment history:** [On time / Late — details]
+- **Expansion signals:** [Yes — describe / No]
+- **Downgrade or cancellation signals:** [Yes — describe / No]
+- **Assessment:** [1–2 sentences]
+
+---
+
+## Top Risks
+
+| Risk | Severity | Mitigation |
+|---|---|---|
+| [Risk description] | High / Medium / Low | [Specific action to mitigate] |
+
+---
+
+## Recommended Actions
+
+**Immediate (this week):**
+1. [Action — owner — deadline]
+
+**This month:**
+1. [Action — owner — deadline]
+
+**Before renewal:**
+1. [Action — owner — deadline]
+
+---
+
+## Renewal Forecast
+
+| Scenario | Probability | ARR at risk |
+|---|---|---|
+| Full renewal at current ARR | [X]% | £/$/€0 |
+| Renewal with contraction | [X]% | £/$/€[X] |
+| Churn | [X]% | £/$/€[full ARR] |
+
+**Recommended renewal play:** [Expand / Hold / Save / Manage out]
+
+---
+
+## Quality Checks
+
+- [ ] Score is based on data, not gut feel — each dimension has evidence
+- [ ] Risks are specific (not "low engagement" — something like "executive sponsor left in March, no replacement identified")
+- [ ] Actions have owners and deadlines
+- [ ] Renewal probability is calibrated against pipeline reality
+- [ ] Trend arrows reflect direction of change vs. last scorecard, not just current state
@@ -0,0 +1,332 @@
+---
+name: developer-onboarding-doc
+description: "Write a developer onboarding document for a service, codebase, or team. Use when asked to write a developer guide, service README, onboarding doc for a new engineer, codebase orientation, or getting-started guide for a technical team. Produces a structured doc covering service overview, architecture, local setup, key patterns, testing, deployment, and who to ask for what."
+---
+
+# Developer Onboarding Document Skill
+
+Produce a complete developer onboarding document for a service or team — covering everything a new engineer needs to be productive within their first week.
+
+A good onboarding doc is not a wiki dump. It answers the questions a new engineer actually has on day one, in the order they'll have them.
+
+## Required Inputs
+
+Ask for these if not already provided:
+- **Service name** and what it does
+- **Team** responsible for it
+- **Tech stack** — language(s), framework(s), database(s), message queues, etc.
+- **Key external dependencies** — upstream services, third-party APIs
+- **Deployment target** — Kubernetes, ECS, Lambda, bare metal, etc.
+- **Local dev setup** — how to run locally (Docker Compose, local DB, etc.)
+- **Testing approach** — unit, integration, E2E; test commands
+- **Deployment process** — summary of how code gets to production
+- **On-call setup** — who's on-call, how alerts work
+- **Contacts** — tech lead, platform team, related service owners
+
+## Output Format
+
+---
+
+# Developer Onboarding: [Service Name]
+
+**Team:** [Team name] | **Tech lead:** [Name]
+**Last updated:** [Date] | **Updated by:** [Name]
+
+> If something in this doc is wrong or out of date, fix it now — it will affect every engineer who onboards after you.
+
+---
+
+## What This Service Does
+
+[3–5 sentences. What problem does this service solve? Who calls it, and who does it call? What would break if this service went down?]
+
+**Service type:** [API / Background worker / Event consumer / Data pipeline / etc.]
+**Consumers:** [List internal services or external clients that depend on this service]
+**Dependencies:** [List upstream services, databases, and third-party APIs this service calls]
+
+**Architecture diagram:** [Link or embed — even a rough ASCII diagram helps]
+
+```
+[Caller A] ──→ [This Service] ──→ [Database]
+                      │
+                      └──→ [Downstream Service]
+```
+
+---
+
+## Codebase Orientation
+
+**Repository:** [Link]
+**Main branch:** `[main / master]`
+**Language:** [e.g. Go 1.22 / Node.js 20 / Python 3.12]
+**Framework:** [e.g. Express / FastAPI / Gin / Rails]
+
+### Key directories
+
+```
+[repo-root]/
+├── [src/ or cmd/]          # Application code
+│   ├── [handlers/]         # HTTP handlers / controllers
+│   ├── [services/]         # Business logic
+│   ├── [repository/]       # Database access layer
+│   └── [models/]           # Data models / types
+├── [tests/]                # Test files
+├── [migrations/]           # Database migrations
+├── [scripts/]              # Utility scripts
+├── [.github/workflows/]    # CI/CD pipeline definitions
+└── [docs/]                 # Additional documentation
+```
+
+**Where to start reading:** [Point to 2–3 key files that give the best orientation — e.g. `main.go`, `routes.js`, `app.py`]
+
+### Things that might surprise you
+
+- [Unusual pattern 1 — e.g. "We use event sourcing — state is derived from an event log, not stored directly"]
+- [Unusual pattern 2 — e.g. "Auth is handled by the gateway — this service trusts the `X-User-Id` header"]
+- [Unusual pattern 3 — any non-obvious decisions or legacy choices]
+
+---
+
+## Local Development Setup
+
+**Estimated setup time:** [X minutes for a fresh machine]
+
+### Prerequisites
+
+- [ ] [Tool 1] — version [X] — [install link]
+- [ ] [Tool 2] — version [X] — [install link]
+- [ ] Access to [repo / internal package registry] — request from [who]
+- [ ] [Any secrets or credentials needed] — request from [who]
+
+### Step-by-step setup
+
+```bash
+# 1. Clone the repo
+git clone [repo URL]
+cd [repo-name]
+
+# 2. Copy and configure environment variables
+cp .env.example .env
+# Edit .env — see "Environment Variables" section below
+
+# 3. Start dependencies (database, cache, etc.)
+[docker compose up -d / make deps / etc.]
+
+# 4. Install dependencies
+[npm install / go mod download / pip install -r requirements.txt]
+
+# 5. Run database migrations
+[migration command]
+
+# 6. Start the service
+[start command]
+
+# 7. Verify it's working
+curl http://localhost:[PORT]/health
+# Expected: {"status":"ok"}
+```
+
+**If this doesn't work:** Check [Troubleshooting section below] or ask in `#[channel]`.
+
+### Environment Variables
+
+| Variable | Required | Description | Example |
+|---|---|---|---|
+| `DATABASE_URL` | Yes | Connection string for the primary DB | `postgres://localhost:5432/[db]` |
+| `[VAR_2]` | Yes | [Description] | [Example] |
+| `[VAR_3]` | No | [Description — default value] | [Example] |
+
+**Secrets for local dev:** [Where to get them — e.g. "Run `[command]` to pull from Vault" or "Ask [person] in #[channel]"]
+
+### Useful local commands
+
+```bash
+[start command]           # Start the service
+[test command]            # Run all tests
+[lint command]            # Run linter
+[format command]          # Format code
+[migration command]       # Run pending migrations
+[seed command]            # Seed local database
+```
+
+---
+
+## Testing
+
+**Testing philosophy:** [e.g. "We test at the integration layer — unit tests for pure functions, integration tests for anything touching the DB or external services"]
+
+### Running tests
+
+```bash
+# All tests
+[test command]
+
+# Unit tests only
+[unit test command]
+
+# Integration tests (requires local deps running)
+[integration test command]
+
+# A specific test file or test case
+[test command with filter]
+```
+
+**Test coverage:** [X]% (minimum required to pass CI: [Y]%)
+**Coverage report:** [Where to find it]
+
+### Writing tests
+
+- **Unit tests:** [Where to put them — e.g. alongside source files as `*_test.go`]
+- **Integration tests:** [Where to put them — e.g. `tests/integration/`]
+- **Test database:** [How it works — e.g. "Each test gets a clean transaction that rolls back on teardown — see `tests/helpers/db.go`"]
+- **Mocking:** [Policy — e.g. "We mock at the repository layer — don't mock the DB directly"]
+
+---
+
+## Making Changes
+
+### Branching
+
+[Branch naming convention — e.g. `feature/[ticket-id]-short-description`, `fix/[ticket-id]-short-description`]
+
+### Before opening a PR
+
+- [ ] Tests pass locally
+- [ ] Linter passes (`[lint command]`)
+- [ ] New behaviour has test coverage
+- [ ] Any new environment variables are added to `.env.example` and documented
+- [ ] Database migrations are backward-compatible (old code can run against new schema)
+
+### Code review
+
+- **Reviewers:** [Who to request review from — e.g. "Any engineer on [team]; lead review required for auth changes"]
+- **Expected review time:** [X hours / 1 business day]
+- **PR template:** [Link or auto-generated by GitHub]
+
+### Database migrations
+
+```bash
+# Create a new migration
+[migration create command]
+
+# Apply pending migrations
+[migration up command]
+
+# Roll back last migration
+[migration down command]
+```
+
+**Migration rules:**
+- All migrations must be backward-compatible — old code must run against the new schema
+- Never rename or drop a column in a single migration — do it in two steps (add new, migrate data, drop old)
+- Test your rollback before merging
+
+---
+
+## Deployment
+
+**How code gets to production:** [1–2 sentence summary — link to full CI/CD playbook if it exists]
+
+1. Merge to `main` → automatic deploy to staging
+2. Smoke tests run on staging
+3. Manual approval → deploy to production
+4. Post-deploy monitoring for [X minutes]
+
+**Deployment docs:** [Link to CI/CD playbook or pipeline docs]
+
+**Who can deploy:** [Any engineer / Lead engineer / On-call engineer — specify]
+
+**Deployment channel:** `#[deployments channel]`
+
+---
+
+## Monitoring and Observability
+
+**Dashboard:** [Datadog / Grafana / CloudWatch — link]
+**Logs:** [Log aggregation tool and link — e.g. "Logs are in Datadog under service:[name]"]
+**Traces:** [Tracing tool and link if applicable]
+**Alerts:** [Where alerts fire — e.g. PagerDuty / Slack #alerts-[service]]
+
+**Key metrics to know:**
+- **Error rate:** Should be <[X]% (alert at [Y]%)
+- **P99 latency:** Should be <[X]ms
+- **[Business metric]:** [e.g. "Queue depth should be <100 items"]
+
+---
+
+## On-Call
+
+**On-call schedule:** [PagerDuty / Opsgenie link]
+**Who's on-call now:** [Link to current schedule or `#oncall` channel]
+**Escalation:** [On-call → [team lead] → [EM] — after [X] minutes unacknowledged]
+
+**If you get paged:**
+1. Acknowledge the alert
+2. Check [dashboard link] for the first clue
+3. Common alert runbooks: [link to oncall-runbook or runbook-writer output]
+4. If you can't resolve in [X minutes], escalate to [person/channel]
+
+---
+
+## Key Contacts
+
+| Role | Name | Best way to reach |
+|---|---|---|
+| Tech lead | [Name] | Slack: @[handle] |
+| On-call rotation | [Team] | PagerDuty / `#on-call` |
+| Platform / infra | [Team] | `#platform` Slack channel |
+| Database / DBA | [Name or team] | `#database` Slack channel |
+| [Upstream service] owner | [Name] | Slack: @[handle] |
+
+**Where to ask questions:**
+- General engineering: `#engineering`
+- This service specifically: `#[service-name]`
+- Urgent / production issues: `#incidents`
+
+---
+
+## Troubleshooting
+
+### "The service won't start locally"
+
+1. Check that Docker / dependencies are running: `[command]`
+2. Check `.env` is populated — missing values cause silent failures
+3. Check logs: `[log command]`
+4. Ask in `#[channel]`
+
+### "Tests are failing locally but passing in CI"
+
+- Check your local dependency versions match CI: `[version check command]`
+- Try a clean install: `[clean install command]`
+- Integration tests need local deps running — `[start deps command]`
+
+### "I can't access [internal tool / system]"
+
+- Request access through [process — e.g. Okta self-serve / ask your manager]
+
+### "Something looks wrong in production"
+
+1. Check [dashboard] for the error spike
+2. Check recent deploys in `#deployments`
+3. If it's an active incident, page on-call via [PagerDuty / Slack command]
+
+---
+
+## Further Reading
+
+- [Architecture Decision Records (ADRs)](./docs/decisions/) — why the codebase is the way it is
+- [API documentation](./docs/api/) or [link to external docs]
+- [Incident runbooks](./docs/runbooks/)
+- [CI/CD pipeline documentation](./docs/cicd/)
+- [Team working agreements](./docs/team/)
+
+---
+
+## Quality Checks
+
+- [ ] Local setup instructions work on a fresh machine — tested recently
+- [ ] Environment variables table is complete and accurate
+- [ ] "Things that might surprise you" captures the actual surprises (ask a recent joiner)
+- [ ] On-call section has real links, not placeholders
+- [ ] Contacts are current — team members with real Slack handles
+- [ ] Troubleshooting covers the top 3 actual questions new joiners ask
@@ -0,0 +1,364 @@
+---
+name: oncall-runbook
+description: "Write an on-call runbook for a service — covering alert definitions, escalation paths, common incident responses, and on-call handoff procedures. Use when asked to write an on-call guide, create alert runbooks, document escalation procedures, or prepare an on-call handoff document. Produces a structured on-call runbook with per-alert response procedures, escalation matrix, diagnostic commands, and handoff template."
+---
+
+# On-Call Runbook Skill
+
+Produce a complete on-call runbook for a service — giving the on-call engineer everything they need to respond confidently to alerts at 3am, without having to ask anyone for help.
+
+A good on-call runbook reduces mean time to resolution (MTTR) by eliminating the "what do I do first?" problem. It is written for the on-call engineer who has just been paged and needs to act, not for someone calmly reading documentation.
+
+## Required Inputs
+
+Ask for these if not already provided:
+- **Service name** and what it does
+- **Team** and tech lead name
+- **Alert list** — names of alerts that currently page on-call
+- **Monitoring setup** — Datadog / Grafana / CloudWatch / PagerDuty / etc.
+- **Common failure modes** — what breaks most often, and what fixes it
+- **Escalation contacts** — who to call when on-call can't resolve it
+- **Deployment setup** — can on-call roll back? How?
+- **Service dependencies** — what does this service depend on, and what depends on it?
+
+## Output Format
+
+---
+
+# On-Call Runbook: [Service Name]
+
+**Team:** [Team name] | **Tech lead:** [Name]
+**PagerDuty service:** [Link] | **Escalation policy:** [Policy name]
+**Last updated:** [Date] | **Next review:** [Date + 90 days]
+
+> **First time on-call for this service?** Read the [developer onboarding doc] first — it covers the architecture and how things work. This runbook assumes you understand the service.
+
+---
+
+## Quick Reference
+
+**Dashboard:** [Link — the first thing to open when paged]
+**Logs:** [Link — where to find logs]
+**Runbook index:** Jump to the alert that paged you → [Alert list below]
+**Can't resolve in 30 min?** Escalate to: [Name] via [Slack / PagerDuty]
+
+**Rollback command (memorise this):**
+```bash
+[rollback command — e.g. kubectl rollout undo deployment/[service-name]]
+```
+
+---
+
+## Escalation Matrix
+
+| Situation | Escalate to | How | After how long |
+|---|---|---|---|
+| Can't diagnose the alert | [Tech lead name] | Slack DM / Phone | 30 minutes |
+| Alert requires infra change | [Platform team] | `#platform` Slack | Immediately |
+| Customer-facing impact | [CSM / Support lead] | `#incidents` Slack | Immediately (P1) |
+| Database issue | [DBA or data team] | Slack / PagerDuty | Immediately |
+| [Specific dependency] down | [[Dependency] on-call] | PagerDuty / Slack | Immediately |
+| Extended outage (>1 hour) | [Engineering manager] | Phone | 1 hour |
+
+**Contacts:**
+
+| Name | Role | Slack | Phone |
+|---|---|---|---|
+| [Name] | Tech lead | @[handle] | [Number] |
+| [Name] | Engineering manager | @[handle] | [Number] |
+| [Name] | Platform / infra | @[handle] | [Number] |
+| [Platform team] | Infra on-call | `#platform` | PagerDuty |
+
+---
+
+## Service Architecture (Quick View)
+
+```
+[Upstream callers]
+        │
+        ▼
+[This Service]
+        │
+        ├──→ [Primary Database]
+        ├──→ [Cache — e.g. Redis]
+        └──→ [Downstream Service / Queue]
+```
+
+**If this service is down, these are affected:** [List downstream consumers]
+**If these are down, this service is affected:** [List upstream dependencies]
+
+---
+
+## Alert Runbooks
+
+### ALERT: [Alert Name 1 — e.g. HighErrorRate]
+
+**What it means:** [Plain English — e.g. "More than 5% of API requests are returning 5xx errors in the last 5 minutes"]
+**Severity:** P1 / P2 / P3
+**SLO impact:** Yes / No — [If yes: this alert means the error budget is burning at [X]× rate]
+
+**Step 1 — Acknowledge and assess**
+```bash
+# Check current error rate
+[query or dashboard link]
+
+# Check which endpoints are erroring
+[query or command]
+```
+
+**Step 2 — Check recent changes**
+```bash
+# Any deploys in the last hour?
+[command or link to deployment log]
+
+# Recent config changes?
+[where to check]
+```
+
+**Step 3 — Check dependencies**
+```bash
+# Is the database healthy?
+[health check command or link]
+
+# Is [downstream service] healthy?
+[health check command or link]
+```
+
+**Step 4 — Diagnose**
+
+| If you see | It means | Do this |
+|---|---|---|
+| [Error pattern 1] | [Cause] | [Action] |
+| [Error pattern 2] | [Cause] | [Action] |
+| [Error pattern 3] | [Cause] | [Action] |
+| No clear pattern | Unknown cause | Escalate to [name] |
+
+**Step 5 — Fix or mitigate**
+```bash
+# If caused by bad deploy — roll back:
+[rollback command]
+
+# If caused by [specific issue]:
+[fix command]
+
+# If caused by upstream dependency:
+[mitigation — e.g. enable circuit breaker, reduce traffic, etc.]
+```
+
+**After resolving:**
+- [ ] Confirm error rate has returned to baseline
+- [ ] Check no downstream services were affected
+- [ ] If P1: open a post-incident review — see [incident-postmortem skill]
+- [ ] Update `#incidents` with resolution summary
+
+---
+
+### ALERT: [Alert Name 2 — e.g. HighLatency]
+
+**What it means:** [e.g. "P99 response time has exceeded 1s for more than 3 consecutive minutes"]
+**Severity:** P1 / P2 / P3
+**SLO impact:** Yes — latency SLO breach
+
+**Step 1 — Assess scope**
+```bash
+# Check which endpoints are slow
+[query or dashboard — broken down by endpoint]
+
+# Check if latency is across all regions or localised
+[query or command]
+```
+
+**Step 2 — Common causes and fixes**
+
+| Cause | Signal | Fix |
+|---|---|---|
+| Database slow queries | DB latency spike on dashboard | [Check slow query log: `command`] |
+| Cache miss storm | Cache hit rate drops on dashboard | [command or action] |
+| Memory pressure / GC | High memory on service dashboard | [command or action — e.g. restart, scale up] |
+| Upstream service slow | Trace shows time in external call | Escalate to [service] on-call |
+| Traffic spike | Request rate spike on dashboard | [Scale up: `command`] |
+
+**Step 3 — Escalate if unresolved in 20 minutes**
+Page [Tech lead] via PagerDuty / Slack.
+
+---
+
+### ALERT: [Alert Name 3 — e.g. DatabaseConnectionPoolExhausted]
+
+**What it means:** [e.g. "The service has used all available database connections — new requests will fail"]
+**Severity:** P1
+**SLO impact:** Yes — will cause errors immediately
+
+**Immediate mitigation:**
+```bash
+# Restart the service to flush stale connections
+[restart command]
+
+# Check current connection count
+[DB connection query]
+```
+
+**Diagnose root cause after stabilising:**
+```bash
+# Check for long-running queries holding connections
+[query]
+
+# Check if a recent deploy changed connection pool config
+[where to check]
+```
+
+**Resolution:** [e.g. "Increase pool size in config / kill long-running queries / scale the service"]
+
+---
+
+### ALERT: [Alert Name 4 — e.g. QueueBacklogHigh / ConsumerLag]
+
+**What it means:** [e.g. "The message queue backlog exceeds 10,000 messages — consumers are not keeping up"]
+**Severity:** P2
+**SLO impact:** Depends — if queue backs up, downstream systems will receive delayed data
+
+**Step 1 — Check consumer health**
+```bash
+# Are consumers running?
+[command]
+
+# Consumer error rate?
+[dashboard or query]
+```
+
+**Step 2 — Check message contents**
+```bash
+# Are there poison messages causing retries?
+[command to inspect dead-letter queue or failed messages]
+```
+
+**Step 3 — Options**
+
+| If | Then |
+|---|---|
+| Consumers are down | Restart consumers: `[command]` |
+| Poison message in queue | Move to DLQ: `[command]` |
+| Consumers healthy but slow | Scale consumers: `[command]` |
+| Upstream producing too fast | Escalate to [upstream service] owner |
+
+---
+
+### ALERT: [Add additional alerts following the same pattern]
+
+---
+
+## Diagnostic Cheat Sheet
+
+Common commands for quick diagnosis. Paste and run without modification.
+
+```bash
+# Service health
+[health check command]
+
+# Recent logs (last 100 lines)
+[log command]
+
+# Error logs only
+[error log filter command]
+
+# Current pod / instance status
+[kubectl get pods / aws ecs describe-tasks / etc.]
+
+# Restart the service
+[restart command]
+
+# Roll back to previous version
+[rollback command]
+
+# Database connection count
+[DB query]
+
+# Cache hit rate
+[cache stats command]
+
+# Current request rate
+[metrics query]
+```
+
+---
+
+## Useful Dashboard Links
+
+| Dashboard | URL | Use it to |
+|---|---|---|
+| Service overview | [Link] | First stop — error rate, latency, request rate |
+| Database | [Link] | Connection count, slow queries, replication lag |
+| Infrastructure | [Link] | CPU, memory, disk |
+| Queue / consumers | [Link] | Backlog depth, consumer throughput |
+| Upstream dependencies | [Link] | Dependency health at a glance |
+
+---
+
+## Incident Communication
+
+When you declare an incident:
+
+**Post to `#incidents` immediately:**
+```
+🔴 INCIDENT — [Service Name]
+Status: Investigating
+Impact: [Who is affected and how]
+Paged: [Your name]
+Next update: [Time — max 30 min from now]
+```
+
+**Update every 30 minutes while active:**
+```
+🔴 UPDATE — [Service Name] — [Time]
+Status: [Investigating / Identified / Mitigating / Resolved]
+Latest: [One sentence on what you found or did]
+Next update: [Time]
+```
+
+**On resolution:**
+```
+✅ RESOLVED — [Service Name] — [Time]
+Duration: [X minutes]
+Impact: [Summary of who was affected]
+Cause: [One sentence]
+Follow-up: [PIR required? Yes/No — link when created]
+```
+
+---
+
+## On-Call Handoff
+
+Use this template at the end of every on-call shift:
+
+```
+--- ON-CALL HANDOFF: [Service Name] ---
+Date: [Date]
+Outgoing: [Your name]
+Incoming: [Next on-call name]
+
+INCIDENTS THIS SHIFT:
+- [Incident summary — date, duration, cause, resolution, follow-up required]
+
+OPEN ISSUES TO WATCH:
+- [Anything not fully resolved / trending in the wrong direction]
+
+CHANGES SINCE LAST HANDOFF:
+- [Deploys, config changes, infra changes that affect on-call awareness]
+
+RUNBOOK GAPS FOUND:
+- [Anything you had to figure out that isn't documented — please add it]
+
+ANYTHING ELSE:
+- [Notes for incoming on-call]
+```
+
+---
+
+## Quality Checks
+
+- [ ] Every alert that pages on-call has a runbook entry — no alert is missing
+- [ ] Rollback command is accurate and tested recently
+- [ ] Escalation contacts have current phone numbers and Slack handles
+- [ ] Diagnostic commands work — they have been run by at least one person recently
+- [ ] Handoff template is used at every shift change — not just during incidents
+- [ ] "Things I had to figure out that weren't documented" are added to this runbook after every incident
@@ -0,0 +1,218 @@
+---
+name: qbr-deck
+description: "Build a Quarterly Business Review (QBR) deck structure and narrative for a customer account. Use when asked to prepare a QBR, business review meeting, executive review, or quarterly check-in with a customer. Produces a slide-by-slide QBR structure with talking points, metrics review, value narrative, and mutual next steps."
+---
+
+# QBR Deck Skill
+
+Produce a complete Quarterly Business Review deck — structured, data-backed, and customer-focused. A good QBR demonstrates value delivered, aligns on goals for the next quarter, and strengthens the executive relationship. It should never feel like a product demo or a vendor update.
+
+## Required Inputs
+
+Ask for these if not already provided:
+- **Account name**, CSM name, and customer stakeholders attending
+- **Contract details** — ARR, contract start date, renewal date
+- **Last quarter's goals** (from previous QBR or kickoff)
+- **Usage and adoption data** — key metrics for the quarter
+- **Support summary** — tickets raised, resolution time, any escalations
+- **Business outcomes the customer cares about** — what success looks like for them
+- **Product updates or new features** relevant to this customer
+- **Goals for next quarter**
+- **Any open commercial conversations** (expansion, renewal, at-risk signals)
+
+## QBR Principles
+
+- Lead with customer outcomes, not product features
+- Every metric should connect to a business result the customer cares about
+- The agenda is a conversation, not a presentation — build in time for customer input at every stage
+- Close with mutual commitments, not just vendor actions
+
+## Output Format
+
+---
+
+# QBR: [Account Name] × [Your Company]
+**[Quarter] [Year] Business Review**
+
+**Date:** [Date] | **Location / Call link:** [TBC]
+**Customer attendees:** [Names and roles]
+**[Your company] attendees:** [Names and roles]
+
+---
+
+## Slide 1: Agenda (5 min)
+
+| Time | Topic | Owner |
+|---|---|---|
+| 0:00 | Welcome and introductions | CSM |
+| 0:05 | [Last quarter] — how did we do? | CSM + Customer |
+| 0:20 | Value delivered — business impact | CSM |
+| 0:35 | What's coming — roadmap preview | CSM / Product |
+| 0:45 | [Next quarter] — goals and priorities | Customer |
+| 0:55 | Actions and mutual commitments | CSM |
+| 1:00 | Close | |
+
+*Talking point: "We've kept today to 60 minutes. We want as much of this to be a conversation as possible — please push back, redirect, and ask questions throughout."*
+
+---
+
+## Slide 2: Where We Are Together (2 min)
+
+**Partnership snapshot:**
+- **Customer since:** [Date]
+- **Contract value:** £/$/€[ARR]/year
+- **Renewal date:** [Date]
+- **Active users:** [N] of [N] licensed seats ([X]% adoption)
+- **Products / modules active:** [List]
+
+*Talking point: "Before we dive in — a quick picture of where we are. [X] months in, [Y] active users, and this is our [Nth] QBR together."*
+
+---
+
+## Slide 3: Last Quarter — Goals We Set Together (5 min)
+
+| Goal | Set in [Last QBR / Kickoff] | Status |
+|---|---|---|
+| [Goal 1] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed |
+| [Goal 2] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed |
+| [Goal 3] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed |
+
+For any partial or missed goal: state what happened and what changes next quarter.
+
+*Talking point: "Let's start with accountability. Here's what we said we'd achieve last quarter — let's be honest about where we landed."*
+
+---
+
+## Slide 4: Usage and Adoption (5 min)
+
+**Quarter-over-quarter trend:**
+
+| Metric | [Q-1] | [Q] | Change |
+|---|---|---|---|
+| Monthly active users | [N] | [N] | +/-X% |
+| Sessions per user per week | [N] | [N] | +/-X% |
+| [Key feature 1] adoption | [X]% | [X]% | +/-X% |
+| [Key feature 2] adoption | [X]% | [X]% | +/-X% |
+
+**Highlights:**
+- [Positive adoption trend to call out]
+- [Feature or workflow with strongest engagement]
+
+**Opportunity:**
+- [Feature with low adoption that could drive more value — link to their goals]
+
+*Talking point: "Usage is [up / stable / something we want to talk about]. The area I'd like to focus on is [feature] — we're not seeing the adoption we'd expect given [their goal], and I want to understand why."*
+
+---
+
+## Slide 5: Business Impact — Value Delivered (10 min)
+
+Lead with outcomes, not activity.
+
+**[Outcome 1: customer's primary success metric]**
+- Before: [baseline]
+- Now: [current state]
+- Impact: [quantified business result — time saved, revenue influenced, cost reduced, risk mitigated]
+
+**[Outcome 2]**
+- [Same structure]
+
+**[Outcome 3]**
+- [Same structure]
+
+**Customer evidence** (use if available):
+> "[Quote from champion or user about value experienced]"
+
+*Talking point: "This is the section I most want your input on. Are these the outcomes that matter to your business? Are there other ways you're measuring success that we should be tracking?"*
+
+---
+
+## Slide 6: Support Summary (3 min)
+
+| Metric | This quarter | Last quarter | Trend |
+|---|---|---|---|
+| Tickets raised | [N] | [N] | ↑ / → / ↓ |
+| Average resolution time | [X hrs] | [X hrs] | ↑ / → / ↓ |
+| P1 / critical issues | [N] | [N] | ↑ / → / ↓ |
+| CSAT score | [X/10] | [X/10] | ↑ / → / ↓ |
+
+**Notable issues this quarter:**
+- [Any escalation or major ticket — brief summary and resolution]
+
+**What we're doing differently:**
+- [Any process change or improvement based on support patterns]
+
+---
+
+## Slide 7: What's Coming — Roadmap Preview (5 min)
+
+Focus only on what's relevant to this customer's goals. Do not dump the full roadmap.
+
+| Feature / Improvement | Expected | Why it matters to [Account Name] |
+|---|---|---|
+| [Feature 1] | [Q+1] | [Direct link to their goal or pain point] |
+| [Feature 2] | [Q+1 / Q+2] | [Direct link] |
+| [Feature 3] | [H2] | [Direct link] |
+
+*Talking point: "I've filtered the roadmap to what I think matters most to your team. I'd love your reaction — are these the right priorities from your perspective?"*
+
+---
+
+## Slide 8: Next Quarter — Your Goals (10 min)
+
+**Customer input section — facilitate, don't present.**
+
+Prompt questions:
+- "What does success look like for your team in [next quarter]?"
+- "What's the biggest challenge you're trying to solve in the next 90 days?"
+- "Is there anything about the way you're using [product] you want to change?"
+
+**Capture live:**
+
+| Goal for next quarter | Owner (customer) | How we'll support it | How we'll measure it |
+|---|---|---|---|
+| [Goal 1] | [Name] | [CSM / product action] | [Metric] |
+| [Goal 2] | [Name] | [CSM / product action] | [Metric] |
+
+---
+
+## Slide 9: Mutual Commitments (5 min)
+
+**[Your company] commits to:**
+1. [Specific action — owner — by when]
+2. [Specific action — owner — by when]
+3. [Specific action — owner — by when]
+
+**[Account Name] commits to:**
+1. [Specific action — owner — by when]
+2. [Specific action — owner — by when]
+
+**Next touchpoint:** [Date of next check-in or mid-quarter review]
+
+---
+
+## Slide 10: Thank You + Open Q&A (5 min)
+
+- Recap the one headline from today: [The single most important thing you want them to remember]
+- Confirm actions are captured and shared after the call
+- Ask: "Is there anything we didn't cover today that you wanted to raise?"
+
+---
+
+## Preparation Checklist
+
+- [ ] Usage data pulled and QoQ comparison calculated
+- [ ] Last QBR goals reviewed — status confirmed before the meeting
+- [ ] Business outcomes framed in customer language (not product language)
+- [ ] Roadmap filtered to this account's specific use cases
+- [ ] Customer's goals for next quarter researched or pre-confirmed with champion
+- [ ] Executive sponsor briefed on any sensitive topics before the call
+- [ ] Actions from previous QBR reviewed — any outstanding items addressed
+
+## Quality Checks
+
+- [ ] Every slide has a talking point, not just a title
+- [ ] Value slide leads with business outcomes, not product activity
+- [ ] Roadmap preview links each item to a customer goal
+- [ ] Mutual commitments section has real owners on both sides
+- [ ] Customer has at least 20 minutes of airtime in the agenda
@@ -0,0 +1,231 @@
+---
+name: slo-error-budget
+description: "Define Service Level Objectives (SLOs) and an error budget policy for a service. Use when asked to write SLOs, define SLIs, calculate an error budget, set reliability targets, or create an error budget policy. Produces a complete SLO document with SLI definitions, target calculation, error budget policy, burn rate alerts, and review cadence."
+---
+
+# SLO and Error Budget Skill
+
+Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns.
+
+A good SLO is not a target to hit. It is an agreement about what reliability means for your users — and a framework for making principled trade-offs between reliability and velocity.
+
+## Required Inputs
+
+Ask for these if not already provided:
+- **Service name** and brief description of what it does
+- **Primary users** — who depends on this service and how
+- **User-facing interactions** to protect — e.g. API calls, page loads, transactions
+- **Current reliability data** — error rate, latency, uptime (last 30–90 days if available)
+- **Existing on-call setup** — who responds to alerts?
+- **Deployment frequency** — how often does the team ship?
+- **Any existing SLAs** with customers — these constrain SLO targets
+
+## Key Definitions
+
+Always establish these before writing the SLO:
+
+| Term | Definition |
+|---|---|
+| **SLI** (Service Level Indicator) | The metric being measured — e.g. "% of requests completing successfully in <500ms" |
+| **SLO** (Service Level Objective) | The target for that metric — e.g. "99.5% of requests" |
+| **SLA** (Service Level Agreement) | The contractual commitment to customers — must be looser than the SLO |
+| **Error budget** | The allowed headroom below 100% — the budget for planned and unplanned downtime |
+| **Burn rate** | How fast the error budget is being consumed |
+
+---
+
+## Output Format
+
+---
+
+# SLO Document: [Service Name]
+
+**Service:** [Name] | **Team:** [Team name]
+**Owner:** [Name / role] | **Approved by:** [Name]
+**Effective date:** [Date] | **Review date:** [Date + 3 months]
+**Version:** [1.0]
+
+---
+
+## Why This SLO Exists
+
+[2–3 sentences. What reliability problem are we solving? What was happening before this SLO that made us need it? What decision-making does this SLO enable?]
+
+---
+
+## Service Overview
+
+**What this service does:** [One sentence]
+**Who depends on it:** [Internal teams / external customers / both — describe]
+**Critical user journeys protected by this SLO:**
+1. [Journey 1 — e.g. "User completes a payment"]
+2. [Journey 2]
+3. [Journey 3]
+
+---
+
+## SLIs — What We Measure
+
+Define one SLI per user journey or reliability dimension. Keep it to 3–5 SLIs maximum.
+
+### SLI 1: [Name — e.g. Request Success Rate]
+
+| Field | Detail |
+|---|---|
+| **What it measures** | [e.g. "% of API requests that return a non-5xx response"] |
+| **Good event definition** | [e.g. "HTTP response with status 2xx or 4xx, completed within 500ms"] |
+| **Bad event definition** | [e.g. "HTTP response with status 5xx, or any response taking >500ms"] |
+| **Measurement source** | [e.g. "Application load balancer access logs / Datadog APM / Prometheus"] |
+| **Measured over** | Rolling 28-day window |
+| **Exclusions** | [e.g. "Health check endpoints excluded / Requests during planned maintenance excluded"] |
+
+### SLI 2: [Name — e.g. Latency]
+
+| Field | Detail |
+|---|---|
+| **What it measures** | [e.g. "P99 response time for the /checkout endpoint"] |
+| **Good event definition** | [e.g. "Request completes in ≤500ms at P99"] |
+| **Bad event definition** | [e.g. "Request takes >500ms at P99"] |
+| **Measurement source** | [Source] |
+| **Measured over** | Rolling 28-day window |
+| **Exclusions** | [Any exclusions] |
+
+### SLI 3: [Name — e.g. Data Freshness / Queue Depth / etc.]
+
+[Same structure]
+
+---
+
+## SLO Targets
+
+| SLI | Target | Window | Error Budget |
+|---|---|---|---|
+| [SLI 1 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
+| [SLI 2 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
+| [SLI 3 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
+
+**How targets were set:**
+- Historical baseline (last 90 days): [X]%
+- Target is set [above / at] historical baseline to [improve reliability / reflect current reality while formalising the commitment]
+- Rationale: [1–2 sentences]
+
+**What 100% is NOT the target:** [Brief explanation of why targeting 100% is counterproductive — it discourages feature development and doesn't reflect user reality]
+
+---
+
+## Error Budget Calculation
+
+**For SLI 1 ([Name]), at [X]% target:**
+
+```
+Error budget = (100% - SLO target) × measurement window
+             = (100% - [X]%) × 28 days × 24 hours × 60 minutes
+             = [Y]% × [Z total minutes]
+             = [N] minutes of allowed failure per 28-day window
+```
+
+**In plain terms:** We can afford [N] minutes of [bad events] in any rolling 28-day window before we breach the SLO.
+
+---
+
+## Burn Rate Alerts
+
+Burn rate = how fast the error budget is being consumed relative to the budget window.
+A burn rate of 1 = consuming the budget at exactly the rate that would exhaust it over 28 days.
+
+| Alert | Burn rate | Window | Severity | Response |
+|---|---|---|---|---|
+| Page (critical) | >14× | 1 hour | P1 | Page on-call immediately — budget exhausted in <2 hours |
+| Page (high) | >6× | 6 hours | P2 | Page on-call — budget exhausted in <5 days |
+| Ticket (warning) | >3× | 3 days | P3 | Create ticket — review at next team meeting |
+| Info | >1× | 28 days | Info | Log only — budget on track to exhaust by end of window |
+
+**Alert implementation:** [Link to alert config in monitoring tool — e.g. Datadog, Prometheus/Alertmanager, Grafana]
+
+---
+
+## Error Budget Policy
+
+This policy defines what to do with the error budget — both when it's healthy and when it's burning.
+
+### When budget is healthy (>50% remaining)
+
+- Feature development and deployments proceed at normal pace
+- The team may take on riskier experiments
+- Reliability improvements are scheduled but not urgent
+
+### When budget is at risk (25–50% remaining)
+
+- Deployment frequency reduced — team ships only well-tested changes
+- One reliability improvement added to current sprint
+- Weekly error budget review added to team standup
+
+### When budget is nearly exhausted (<25% remaining)
+
+- Feature work paused in favour of reliability improvements
+- No new deployments without explicit on-call approval
+- Daily review of error budget burn rate
+- CSM / support notified to manage customer expectations
+
+### When budget is exhausted (0% remaining — SLO breached)
+
+- All feature work stops
+- On-call engineer and engineering manager notified immediately
+- Post-incident review (PIR) required within 5 business days
+- SLO target may be temporarily relaxed (with stakeholder approval) while root cause is addressed
+
+---
+
+## Dashboard and Reporting
+
+**SLO dashboard:** [Link to Datadog / Grafana / etc. dashboard]
+
+**Metrics exposed:**
+- Current SLO compliance (rolling 28-day)
+- Error budget remaining (% and minutes)
+- Burn rate (current and trend)
+- Incident count and MTTR this window
+
+**Reporting cadence:**
+
+| Audience | Frequency | Format |
+|---|---|---|
+| Engineering team | Weekly | Slack summary — #[service]-slo |
+| Engineering manager | Monthly | SLO review meeting |
+| Stakeholders / customers | Quarterly | SLO compliance summary |
+
+---
+
+## Exclusions and Edge Cases
+
+**Planned maintenance:** Error budget is not consumed during pre-announced maintenance windows. Maintenance must be communicated [X hours] in advance via [channel].
+
+**Dependency failures:** If SLO breach is caused by an upstream dependency outside our control, document it — but it still counts against our error budget (our users don't distinguish between our failures and our dependencies' failures).
+
+**Force majeure:** [Policy for cloud provider outages, major infrastructure events]
+
+---
+
+## SLO Review Cadence
+
+| Review | When | Who | Output |
+|---|---|---|---|
+| Error budget review | Weekly | Team | Budget health check — adjust if burning fast |
+| SLO target review | Quarterly | Team + EM | Adjust targets if baseline has shifted significantly |
+| Annual SLO audit | Annually | Team + Stakeholders | Review SLIs — are we measuring the right things? |
+
+**When to change the SLO target:**
+- Historical baseline has improved significantly and target no longer reflects real reliability
+- User feedback indicates the target is misaligned with what users actually experience
+- The SLO is being gamed (metric is healthy but users are unhappy)
+
+---
+
+## Quality Checks
+
+- [ ] SLIs are user-facing — they measure what users experience, not internal system metrics
+- [ ] Good and bad events are precisely defined — no ambiguity about what counts
+- [ ] Targets are based on historical data, not aspirational round numbers
+- [ ] Error budget policy has clear triggers and clear actions — not "discuss as a team"
+- [ ] Burn rate alerts have different windows to catch both fast burns and slow burns
+- [ ] Exclusions are documented so they don't silently inflate the SLO number