Compare commits
18 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| bf65c16222 | |||
| beecb1cb31 | |||
| 8caa9c29b9 | |||
| af29d30631 | |||
| bfdbec17a3 | |||
| 48fd4dd6ad | |||
| ad92de9637 | |||
| 450dbde74d | |||
| af23bcc170 | |||
| 59c4510055 | |||
| 9274b3d378 | |||
| a0ed6e52a5 | |||
| 84eefcabd6 | |||
| 7df025ffaa | |||
| e5377ca61a | |||
| bd38a36468 | |||
| c1d47fa1ae | |||
| 48be8596d9 |
@@ -1,8 +1,8 @@
|
||||
{
|
||||
"$schema": "https://anthropic.com/claude-code/marketplace.schema.json",
|
||||
"name": "pm-claude-skills",
|
||||
"version": "7.0.0",
|
||||
"description": "106 Claude Skills across 15 professions — product management, engineering, legal, finance, HR, sales, design, Figma, operations, research, and more. Includes 6 new engineering skills: debugging, PR descriptions, system design, changelogs, test strategy, and runbooks.",
|
||||
"version": "11.0.0",
|
||||
"description": "PM stands for Professional, not just Product Management. 135 Claude Skills + 4 agent templates across 23 bundles covering 16 professions — engineering, customer success, legal, finance, HR, sales, design, Figma, marketing, and more. Built by a PM, used by everyone. Building blocks for the Anthropic agent template architecture.",
|
||||
"owner": {
|
||||
"name": "Mohit Aggarwal",
|
||||
"email": "mohit15856@gmail.com"
|
||||
@@ -82,12 +82,20 @@
|
||||
},
|
||||
{
|
||||
"name": "pm-engineering",
|
||||
"description": "Engineering & tech skills: Code Review Checklist, Incident Postmortem, API Docs Writer, Architecture Decision Record, Debugging Log Analyser, PR Description Writer, System Design Interview, Changelog Generator, Test Strategy Doc, Runbook Writer. 10 structured skills for engineering teams, SREs, and technical PMs.",
|
||||
"version": "2.0.0",
|
||||
"description": "Engineering & tech skills: Code Review Checklist, Incident Postmortem, API Docs Writer, Architecture Decision Record, Debugging Log Analyser, PR Description Writer, System Design Interview, Changelog Generator, Test Strategy Doc, Runbook Writer, CI/CD Playbook, SLO & Error Budget, Developer Onboarding Doc, On-Call Runbook, Security Threat Model, Performance Budget, Database Schema Design, Database Migration Plan, Technical Debt Register, RFC Writer, Capacity Planning, Load Testing Plan, Disaster Recovery Plan, Feature Flag Guide, Dependency Audit, Service Catalog Entry, Monitoring Setup Guide, Local Dev Setup, API Versioning Strategy, Infra-as-Code Review, Engineering Weekly Report, Tech Radar, Sprint Velocity Analysis, Microservices Decomposition, Engineering Hiring Rubric. 35 structured skills for engineering teams, SREs, and technical PMs.",
|
||||
"version": "4.0.0",
|
||||
"category": "productivity",
|
||||
"source": "./plugins/pm-engineering",
|
||||
"homepage": "https://github.com/mohitagw15856/pm-claude-skills"
|
||||
},
|
||||
{
|
||||
"name": "pm-cs",
|
||||
"description": "Customer Success skills: Customer Health Scorecard, QBR Deck, Escalation Brief, Churn Analysis. Score account health with a weighted RAG framework, build structured QBR decks with value narratives, write crisp escalation briefs for at-risk accounts, and analyse churn by category and segment with prioritised interventions.",
|
||||
"version": "1.0.0",
|
||||
"category": "productivity",
|
||||
"source": "./plugins/pm-cs",
|
||||
"homepage": "https://github.com/mohitagw15856/pm-claude-skills"
|
||||
},
|
||||
{
|
||||
"name": "pm-data",
|
||||
"description": "Data & analytics skills: Metrics Framework, SQL Query Explainer, Dashboard Brief, Chart Data Extractor. Build North Star metric trees, explain SQL, spec dashboards, and digitise chart images.",
|
||||
|
||||
@@ -1,17 +1,18 @@
|
||||
# 🧠 Claude Skills Library — 106 Skills for Every Profession
|
||||
# 🧠 PM Claude Skills — 135 Skills for Every Profession
|
||||
|
||||
[](https://github.com/mohitagw15856/pm-claude-skills/stargazers)
|
||||
[](https://github.com/mohitagw15856/pm-claude-skills)
|
||||
[](https://github.com/mohitagw15856/pm-claude-skills/releases)
|
||||
[](https://github.com/mohitagw15856/pm-claude-skills)
|
||||
[](https://github.com/mohitagw15856/pm-claude-skills/releases)
|
||||
[](https://github.com/mohitagw15856/pm-claude-skills#-quick-install-2-minutes)
|
||||
[](LICENSE)
|
||||
[](https://github.com/sponsors/mohitagw15856)
|
||||
|
||||
> **Save 8–10 hours per week across 15 professions. Install in 2 minutes. Now with 106 skills including 6 new engineering skills.**
|
||||
> **PM stands for Professional, not just Product Management.**
|
||||
> 135 Claude Skills + 4 agent templates across 16 professions. Built by a PM, used by everyone.
|
||||
|
||||
A community-built library of Claude Skills covering product management, engineering, marketing, data, design, Figma, leadership, legal, finance, HR, sales, operations, research, education, and more. Each skill is a structured SKILL.md file that teaches Claude how to produce professional-grade outputs for your specific workflows.
|
||||
|
||||
**🆕 Latest release (v7.0.0):** 6 new engineering skills added — Debugging Log Analyser, PR Description Writer, System Design Interview, Changelog Generator, Test Strategy Doc, and Runbook Writer. The `pm-engineering` bundle now has 10 skills.
|
||||
A community-built library of Claude Skills for professionals across every field — product management, engineering, customer success, marketing, design, legal, finance, HR, sales, operations, research, and more. Each skill is a structured SKILL.md file that teaches Claude how to produce professional-grade outputs for your specific workflows.
|
||||
|
||||
**🆕 Latest release (v11.0.0):** The full 500-star milestone is now complete — 21 remaining engineering skills shipped. pm-engineering is now the largest bundle in the library with 35 skills. 135 skills across 16 professions.
|
||||
---
|
||||
|
||||
## 🚀 Quick Install (2 minutes)
|
||||
@@ -20,20 +21,32 @@ In Claude Code, run:
|
||||
|
||||
/plugin marketplace add mohitagw15856/pm-claude-skills
|
||||
|
||||
|
||||
Or install by profession:
|
||||
|
||||
claude plugin install pm-essentials@pm-claude-skills # Core PM + Word tracked changes
|
||||
|
||||
claude plugin install pm-delivery@pm-claude-skills # Delivery + PowerPoint auditor
|
||||
claude plugin install pm-engineering@pm-claude-skills # Engineering + DevOps (10 skills) 🆕
|
||||
|
||||
claude plugin install pm-engineering@pm-claude-skills # Engineering (35 skills) 🆕
|
||||
|
||||
claude plugin install pm-cs@pm-claude-skills # Customer Success 🆕
|
||||
|
||||
claude plugin install pm-data@pm-claude-skills # Data + chart data extractor
|
||||
|
||||
claude plugin install pm-legal@pm-claude-skills # Legal
|
||||
|
||||
claude plugin install pm-finance@pm-claude-skills # Finance
|
||||
|
||||
claude plugin install pm-hr@pm-claude-skills # HR
|
||||
|
||||
claude plugin install pm-sales@pm-claude-skills # Sales
|
||||
|
||||
claude plugin install pm-operations@pm-claude-skills # Operations
|
||||
|
||||
claude plugin install pm-research@pm-claude-skills # Research & Healthcare
|
||||
|
||||
claude plugin install pm-cross@pm-claude-skills # Cross-profession
|
||||
|
||||
claude plugin install pm-figma@pm-claude-skills # Figma
|
||||
|
||||
|
||||
@@ -43,7 +56,6 @@ git clone https://github.com/mohitagw15856/pm-claude-skills.git ~/pm-claude-skil
|
||||
mkdir -p ~/.claude/skills
|
||||
ln -s ~/pm-claude-skills/skills/* ~/.claude/skills/
|
||||
|
||||
|
||||
---
|
||||
|
||||
## 🎬 See It in Action
|
||||
@@ -58,9 +70,111 @@ ln -s ~/pm-claude-skills/skills/* ~/.claude/skills/
|
||||
|
||||
---
|
||||
|
||||
## 🆕 What's New in v7.0.0 — Engineering Skills Expansion
|
||||
## 🤖 Building Blocks for Agent Templates
|
||||
|
||||
On May 5, 2026, Anthropic [released their first agent templates](https://www.anthropic.com/news/finance-agents) — pre-packaged Claude agents that combine **skills, connectors, and subagents** into ready-to-run workflows for financial services.
|
||||
|
||||
This library is the largest open-source collection of professional skills available — covering 15 professions beyond financial services. **The 106 skills here are the building blocks for agent templates outside of finance.**
|
||||
|
||||
### What is an agent template?
|
||||
|
||||
An agent template packages three things into one runnable workflow:
|
||||
|
||||
| Component | What it is | Example from this library |
|
||||
|---|---|---|
|
||||
| **Skills** | Markdown files that teach Claude how to produce structured professional outputs | `sprint-planning`, `contract-review`, `investor-update` |
|
||||
| **Connectors** | Governed access to your team's data sources | Linear, Jira, Slack, Google Drive, Notion |
|
||||
| **Subagents** | Focused Claude models for sub-tasks within the larger workflow | Capacity analyst, risk scorer, comparables selector |
|
||||
|
||||
A skill alone gives Claude a structured output format. An agent template gives Claude a complete workflow — pulling data, running specialised analysis, producing the output, and routing it where it needs to go.
|
||||
|
||||
### How to use this library to build your own agent template
|
||||
|
||||
Pick a recurring workflow on your team. Identify which existing skills cover the structured outputs that workflow needs. Add the connectors that let Claude reach the data. Add subagents for the analytical sub-tasks. That's the template.
|
||||
|
||||
Examples of agent templates this library supports:
|
||||
|
||||
| Template | Skills used | Connectors needed | Subagents |
|
||||
|---|---|---|---|
|
||||
| **PM Sprint Agent** | sprint-planning, sprint-brief, retro, project-status-report | Linear or Jira, Slack | Capacity analyst, risk scorer |
|
||||
| **Legal Contract Review Agent** | contract-review, nda-analyser, compliance-checklist | Google Drive or SharePoint | Clause-by-clause risk scorer |
|
||||
| **PM Discovery Agent** | discovery-interview-guide, user-interview-synthesis, assumption-mapper | Granola or Otter, Notion | Theme synthesiser |
|
||||
| **Sales Pursuit Agent** | sales-battlecard, discovery-call-prep, proposal-writer, account-plan | Salesforce or HubSpot, Gong | Competitive intel analyst |
|
||||
| **HR Onboarding Agent** | onboarding-plan, job-description-writer, change-management-plan | Workday or BambooHR, Slack | First-week scheduler |
|
||||
| **Finance Board Pack Agent** | investor-update, board-deck-narrative, financial-model-narrative | NetSuite or Xero, Google Drive | KPI variance analyst |
|
||||
| **Marketing Launch Agent** | go-to-market, content-calendar, email-campaign, media-pitch | HubSpot, Notion | Channel strategist |
|
||||
|
||||
|
||||
### Available agent templates
|
||||
|
||||
The pm-claude-skills library now includes four working agent templates, each built from existing skills in this library combined with subagents and connectors. All four follow the architecture Anthropic introduced for [financial services agent templates](https://www.anthropic.com/news/finance-agents) on May 5, 2026.
|
||||
|
||||
| Template | What it does | Skills used | Connectors | Time saved |
|
||||
|---|---|---|---|---|
|
||||
| **[PM Sprint Agent](./templates/pm-sprint-agent/)** | End-to-end sprint planning — pulls backlog, calculates capacity, drafts plan, scores risks | sprint-planning, sprint-brief | Linear, Jira | 90 min → 90 sec |
|
||||
| **[PM Discovery Agent](./templates/pm-discovery-agent/)** | Customer discovery synthesis — reads interview notes, finds themes, scores assumption confidence | user-interview-synthesis, job-story-mapper | Notion, Google Drive | 1 day → 5 min |
|
||||
| **[PM Stakeholder Comms Agent](./templates/pm-stakeholder-comms-agent/)** | Audience-tailored stakeholder updates — exec, investor, cross-functional, or board | executive-update, investor-update, stakeholder-update, board-deck-narrative | Linear, Jira, Google Drive | 90 min → 1 min |
|
||||
| **[PM Launch Agent](./templates/pm-launch-agent/)** | End-to-end launch coordination — content for every channel, calendar, metrics, checklist | go-to-market, content-calendar, media-pitch, email-campaign, launch-checklist | Notion (optional) | 4-6 hours → 3 min |
|
||||
|
||||
Each template includes:
|
||||
- Working orchestration script
|
||||
- Two or more focused subagents
|
||||
- Connector configurations with documented setup
|
||||
- Working examples (input + output)
|
||||
- Smoke test for verifying installations
|
||||
|
||||
### How to install a template
|
||||
|
||||
All templates are part of the main library — installing the marketplace gives you all four.
|
||||
|
||||
/plugin marketplace add mohitagw15856/pm-claude-skills
|
||||
|
||||
|
||||
Then navigate to the template you want and follow its README:
|
||||
|
||||
cd templates/pm-sprint-agent # or pm-discovery-agent, etc.
|
||||
cat README.md # full setup instructions
|
||||
|
||||
|
||||
### Building your own template
|
||||
|
||||
If you want to build a template for a workflow not covered above — Legal Contract Review, Sales Pursuit, Finance Board Pack, HR Onboarding, Marketing Campaign — see the [template contribution guide](./templates/CONTRIBUTING.md).
|
||||
|
||||
The pattern is consistent: pick a multi-step workflow, identify which existing skills cover the structured outputs, add connectors for data access, and define subagents for specialised analysis. The four templates above are reference implementations.
|
||||
|
||||
|
||||
It combines four skills, two connectors, and two subagents into a single workflow that handles end-to-end sprint planning.
|
||||
|
||||
Documentation, working orchestration script, and example outputs are included in the template folder.
|
||||
|
||||
More templates will follow. If you want to contribute one, see the [template contribution guide](./templates/CONTRIBUTING.md).
|
||||
|
||||
---
|
||||
|
||||
## 🆕 What's New in v10.0.0
|
||||
|
||||
**Two star milestones unlocked — 8 new skills shipped:**
|
||||
|
||||
**Customer Success bundle (250 ⭐ milestone):**
|
||||
|
||||
| Skill | Bundle | What It Does |
|
||||
|---|---|---|
|
||||
| **Customer Health Scorecard** 🆕 | pm-cs | Weighted health score across adoption, engagement, outcomes, support, and commercial — with RAG status and renewal forecast |
|
||||
| **QBR Deck** 🆕 | pm-cs | Slide-by-slide quarterly business review structure with talking points, value narrative, and mutual commitments |
|
||||
| **Escalation Brief** 🆕 | pm-cs | Structured escalation brief for at-risk accounts — root cause, business impact, resolution plan, and decision required |
|
||||
| **Churn Analysis** 🆕 | pm-cs | Churn rate breakdown by category and segment, early warning signals, and prioritised interventions |
|
||||
|
||||
**Engineering expansion (500 ⭐ milestone):**
|
||||
|
||||
| Skill | Bundle | What It Does |
|
||||
|---|---|---|
|
||||
| **CI/CD Playbook** 🆕 | pm-engineering | Complete pipeline playbook covering every stage, rollback procedures, secrets management, and on-call responsibilities |
|
||||
| **SLO & Error Budget** 🆕 | pm-engineering | SLI definitions, SLO targets, error budget calculation, burn rate alerts, and error budget policy |
|
||||
| **Developer Onboarding Doc** 🆕 | pm-engineering | Everything a new engineer needs in their first week — architecture, local setup, testing, deployment, and key contacts |
|
||||
| **On-Call Runbook** 🆕 | pm-engineering | Per-alert response procedures, escalation matrix, diagnostic cheat sheet, and handoff template |
|
||||
|
||||
The library now includes **114 skills** across **16 professions** + 4 working agent templates.
|
||||
|
||||
**6 new engineering skills added to `pm-engineering`:**
|
||||
|
||||
| Skill | Bundle | What It Does |
|
||||
|---|---|---|
|
||||
@@ -113,10 +227,12 @@ This repo was built alongside a published article series. Read the full story:
|
||||
| Part 12 | I Built the Same Skills Library for ChatGPT — Here's What's Different | [Read →](https://medium.com/product-powerhouse/i-built-the-same-skills-library-for-chatgpt-heres-what-s-different-a9305f9c20b9) |
|
||||
| Part 13 | I Re-Tested My 90 Claude Skills on Opus 4.7 — Here's What Got Better | [Read →](https://medium.com/all-about-claude/i-re-tested-my-90-claude-skills-on-opus-4-7-heres-what-actually-got-better-dd4b9369329e)|
|
||||
| Part 14 | I Rebuilt All 93 Skills and Added 7 More: What 100 Skills Taught Me About What Makes a Great Skill | [Read →](https://medium.com/product-powerhouse/a-pull-request-made-me-rebuild-all-93-of-my-claude-skills-then-i-added-7-more-16d5fe3e7f85) |
|
||||
| Part 15 | I’m a Product Manager. I Just Shipped 6 Engineering Skills to My Open-Source Claude Library. | [Read →](https://medium.com/product-powerhouse/im-a-product-manager-i-just-shipped-6-engineering-skills-to-my-open-source-claude-library-8745aaa2ecf9) |
|
||||
| Part 16 | Anthropic Just Released 10 Agent Templates. Here’s the First One I Built Using My 106 Skills. | [Read →](https://medium.com/product-powerhouse/anthropic-just-released-10-agent-templates-heres-the-first-one-i-built-using-my-106-skills-a6708f9bd3ea) |
|
||||
|
||||
---
|
||||
|
||||
## 🗂️ All 106 Skills
|
||||
## 🗂️ All 135 Skills
|
||||
|
||||
### 🛠️ Product Management (Skills 1–34)
|
||||
**Bundles:** `pm-essentials` · `pm-discovery` · `pm-planning` · `pm-delivery` · `pm-analytics` · `pm-strategy` · `pm-advanced` · `pm-rituals`
|
||||
@@ -151,7 +267,7 @@ This repo was built alongside a published article series. Read the full story:
|
||||
|
||||
---
|
||||
|
||||
### 👩💻 Engineering & Tech (Skills 41–50)
|
||||
### 👩💻 Engineering & Tech (Skills 41–75)
|
||||
**Bundle:** `pm-engineering`
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
@@ -160,178 +276,249 @@ This repo was built alongside a published article series. Read the full story:
|
||||
| 42 | **Incident Postmortem** | `skills/incident-postmortem/` | Blameless postmortems with timeline, RCA, impact, and action items |
|
||||
| 43 | **API Docs Writer** | `skills/api-docs-writer/` | Developer-facing API docs: endpoints, parameters, response schemas, code examples |
|
||||
| 44 | **Architecture Decision Record** | `skills/architecture-decision-record/` | ADRs with context, options considered, decision, consequences, and risks |
|
||||
| 45 | **Debugging Log Analyser** 🆕 | `skills/debugging-log-analyser/` | Parse stack traces and error logs into a structured root cause diagnosis with a specific fix |
|
||||
| 46 | **PR Description Writer** 🆕 | `skills/pr-description-writer/` | Write reviewer-friendly PR descriptions from a diff, commit list, or change summary |
|
||||
| 47 | **System Design Interview** 🆕 | `skills/system-design-interview/` | Structure complete system design answers with capacity estimates, component deep-dives, and trade-offs |
|
||||
| 48 | **Changelog Generator** 🆕 | `skills/changelog-generator/` | Convert git commits into a polished, user-facing changelog following Keep a Changelog format |
|
||||
| 49 | **Test Strategy Doc** 🆕 | `skills/test-strategy-doc/` | Write a complete test strategy with risk assessment, test types, coverage targets, and P0/P1 test cases |
|
||||
| 50 | **Runbook Writer** 🆕 | `skills/runbook-writer/` | Write operational runbooks for deployments, incidents, and maintenance with exact commands and rollback steps |
|
||||
| 45 | **Debugging Log Analyser** | `skills/debugging-log-analyser/` | Parse stack traces and error logs into a structured root cause diagnosis with a specific fix |
|
||||
| 46 | **PR Description Writer** | `skills/pr-description-writer/` | Write reviewer-friendly PR descriptions from a diff, commit list, or change summary |
|
||||
| 47 | **System Design Interview** | `skills/system-design-interview/` | Structure complete system design answers with capacity estimates, component deep-dives, and trade-offs |
|
||||
| 48 | **Changelog Generator** | `skills/changelog-generator/` | Convert git commits into a polished, user-facing changelog following Keep a Changelog format |
|
||||
| 49 | **Test Strategy Doc** | `skills/test-strategy-doc/` | Write a complete test strategy with risk assessment, test types, coverage targets, and P0/P1 test cases |
|
||||
| 50 | **Runbook Writer** | `skills/runbook-writer/` | Write operational runbooks for deployments, incidents, and maintenance with exact commands and rollback steps |
|
||||
| 51 | **CI/CD Playbook** | `skills/cicd-playbook/` | Complete pipeline playbook covering every stage, rollback procedures, secrets management, and on-call responsibilities |
|
||||
| 52 | **SLO & Error Budget** | `skills/slo-error-budget/` | SLI definitions, SLO targets, error budget calculation, burn rate alerts, and error budget policy |
|
||||
| 53 | **Developer Onboarding Doc** | `skills/developer-onboarding-doc/` | Everything a new engineer needs in their first week — architecture, local setup, testing, deployment, and key contacts |
|
||||
| 54 | **On-Call Runbook** | `skills/oncall-runbook/` | Per-alert response procedures, escalation matrix, diagnostic cheat sheet, and handoff template |
|
||||
| 55 | **Security Threat Model** 🆕 | `skills/security-threat-model/` | STRIDE-based threat model with asset register, trust boundaries, per-component threat enumeration, risk scores, and mitigations |
|
||||
| 56 | **Performance Budget** 🆕 | `skills/performance-budget/` | Performance budgets for Core Web Vitals and backend latency SLOs with CI enforcement and breach response policy |
|
||||
| 57 | **Database Schema Design** 🆕 | `skills/database-schema-design/` | Database schema documentation with ER diagram, DDL definitions, index strategy, and access pattern analysis |
|
||||
| 58 | **Database Migration Plan** 🆕 | `skills/database-migration-plan/` | Safe zero-downtime migration plan using expand-contract pattern with per-step rollback and data validation queries |
|
||||
| 59 | **Technical Debt Register** 🆕 | `skills/technical-debt-register/` | Debt inventory with business impact scoring, effort estimates, priority matrix, and quarterly resolution roadmap |
|
||||
| 60 | **RFC Writer** 🆕 | `skills/rfc-writer/` | Engineering Request for Comments covering problem, proposed solution, alternatives-with-rejection-reasons, and rollout plan |
|
||||
| 61 | **Capacity Planning** 🆕 | `skills/capacity-planning/` | Traffic forecasts, resource requirements per tier, scaling strategy, cost projections, and infrastructure action roadmap |
|
||||
| 62 | **Load Testing Plan** 🆕 | `skills/load-testing-plan/` | Load test plan with scenario definitions (baseline/stress/spike/soak), k6/Locust skeleton, thresholds, and CI gates |
|
||||
| 63 | **Disaster Recovery Plan** 🆕 | `skills/disaster-recovery-plan/` | DR plan with RPO/RTO targets, per-scenario runbooks, backup procedures, game day testing, and communication templates |
|
||||
| 64 | **Feature Flag Guide** 🆕 | `skills/feature-flag-guide/` | Feature flag lifecycle playbook — taxonomy, rollout strategy, monitoring requirements, cleanup policy, and governance |
|
||||
| 65 | **Dependency Audit** 🆕 | `skills/dependency-audit/` | Dependency audit for CVE vulnerabilities, license compliance, outdated packages, and 30-day remediation plan |
|
||||
| 66 | **Service Catalog Entry** 🆕 | `skills/service-catalog-entry/` | Microservice catalog entry with ownership, SLAs, API contract, data classification, and operational runbook links |
|
||||
| 67 | **Monitoring Setup Guide** 🆕 | `skills/monitoring-setup-guide/` | Four golden signals applied to a service, alert rules spec, structured log schema, tracing setup, and dashboard layout |
|
||||
| 68 | **Local Dev Setup** 🆕 | `skills/local-dev-setup/` | Local development setup guide — prerequisites, env vars, dependencies, test commands, and 5 common failure fixes |
|
||||
| 69 | **API Versioning Strategy** 🆕 | `skills/api-versioning-strategy/` | API versioning scheme, lifecycle policy, breaking change classification table, deprecation process, and migration guide template |
|
||||
| 70 | **Infra-as-Code Review** 🆕 | `skills/infra-as-code-review/` | IaC review for Terraform/CloudFormation/Pulumi — security, naming, state, cost, and drift risk with severity-classified findings |
|
||||
| 71 | **Engineering Weekly Report** 🆕 | `skills/engineering-weekly-report/` | Weekly engineering status in a consistent format — shipped/in-progress/blocked, metrics, decisions, risks, and next week |
|
||||
| 72 | **Tech Radar** 🆕 | `skills/tech-radar/` | ThoughtWorks-format technology radar with Adopt/Trial/Assess/Hold quadrants, per-blip rationale, and maintenance process |
|
||||
| 73 | **Sprint Velocity Analysis** 🆕 | `skills/sprint-velocity-analysis/` | Velocity trend analysis, completion rate patterns, blocker frequency, improvement recommendations, and capacity forecast |
|
||||
| 74 | **Microservices Decomposition** 🆕 | `skills/microservices-decomposition/` | Domain-driven service boundary design with bounded context map, communication patterns, data ownership, and strangler fig migration plan |
|
||||
| 75 | **Engineering Hiring Rubric** 🆕 | `skills/engineering-hiring-rubric/` | Technical interview rubric with level expectations, coding scorecard, system design guide, behavioural question bank, and debrief template |
|
||||
|
||||
---
|
||||
|
||||
### 📊 Data & Analytics (Skills 45–48)
|
||||
### 🤝 Customer Success (Skills 76–79)
|
||||
**Bundle:** `pm-cs`
|
||||
|
||||
> 250 ⭐ milestone unlocked. Install:
|
||||
|
||||
claude plugin install pm-cs@pm-claude-skills
|
||||
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 76 | **Customer Health Scorecard** | `skills/cs-health-scorecard/` | Weighted health score across adoption, engagement, outcomes, support, and commercial — RAG status and renewal forecast |
|
||||
| 77 | **QBR Deck** | `skills/qbr-deck/` | Slide-by-slide quarterly business review with talking points, value narrative, and mutual commitments |
|
||||
| 78 | **Escalation Brief** | `skills/cs-escalation-brief/` | Structured brief for at-risk accounts — root cause, business impact, resolution plan, and decision required |
|
||||
| 79 | **Churn Analysis** | `skills/churn-analysis/` | Churn breakdown by category and segment, early warning signals, and prioritised interventions |
|
||||
|
||||
---
|
||||
|
||||
### 📊 Data & Analytics (Skills 80–83)
|
||||
**Bundle:** `pm-data`
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 45 | **Metrics Framework** | `skills/metrics-framework/` | North Star + metric tree, dashboard tiers, counter-metrics |
|
||||
| 46 | **SQL Query Explainer** | `skills/sql-query-explainer/` | Explain, optimise, write, and document SQL in plain English |
|
||||
| 47 | **Dashboard Brief** | `skills/dashboard-brief/` | Complete dashboard spec: KPIs, charts, filters, layout, data requirements |
|
||||
| 48 | **Chart Data Extractor** | `skills/chart-data-extractor/` | Extract pixel-level data from chart images into structured data tables |
|
||||
| 80 | **Metrics Framework** | `skills/metrics-framework/` | North Star + metric tree, dashboard tiers, counter-metrics |
|
||||
| 81 | **SQL Query Explainer** | `skills/sql-query-explainer/` | Explain, optimise, write, and document SQL in plain English |
|
||||
| 82 | **Dashboard Brief** | `skills/dashboard-brief/` | Complete dashboard spec: KPIs, charts, filters, layout, data requirements |
|
||||
| 83 | **Chart Data Extractor** | `skills/chart-data-extractor/` | Extract pixel-level data from chart images into structured data tables |
|
||||
|
||||
---
|
||||
|
||||
### 🧑💼 Leadership & People (Skills 49–51)
|
||||
### 🧑💼 Leadership & People (Skills 84–86)
|
||||
**Bundle:** `pm-people`
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 49 | **Performance Review** | `skills/performance-review/` | Structured reviews from bullet-point notes — self, manager, peer, and upward |
|
||||
| 50 | **Hiring Rubric** | `skills/hiring-rubric/` | Interview scorecards with competencies, behavioural questions, and panel guide |
|
||||
| 51 | **Team Offsite Planner** | `skills/team-offsite-planner/` | Full offsite agenda, session facilitation notes, and logistics checklist |
|
||||
| 84 | **Performance Review** | `skills/performance-review/` | Structured reviews from bullet-point notes — self, manager, peer, and upward |
|
||||
| 85 | **Hiring Rubric** | `skills/hiring-rubric/` | Interview scorecards with competencies, behavioural questions, and panel guide |
|
||||
| 86 | **Team Offsite Planner** | `skills/team-offsite-planner/` | Full offsite agenda, session facilitation notes, and logistics checklist |
|
||||
|
||||
---
|
||||
|
||||
### 🎨 Design & UX (Skills 52–54)
|
||||
### 🎨 Design & UX (Skills 87–89)
|
||||
**Bundle:** `pm-design`
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 52 | **UX Research Plan** | `skills/ux-research-plan/` | Research plans with screener, discussion guide, and synthesis framework |
|
||||
| 53 | **Design Critique** | `skills/design-critique/` | Structured feedback using JTBD, Gestalt principles, and Nielsen's heuristics |
|
||||
| 54 | **Accessibility Audit** | `skills/accessibility-audit/` | WCAG 2.2 audit with prioritised remediation and quick wins |
|
||||
| 87 | **UX Research Plan** | `skills/ux-research-plan/` | Research plans with screener, discussion guide, and synthesis framework |
|
||||
| 88 | **Design Critique** | `skills/design-critique/` | Structured feedback using JTBD, Gestalt principles, and Nielsen's heuristics |
|
||||
| 89 | **Accessibility Audit** | `skills/accessibility-audit/` | WCAG 2.2 audit with prioritised remediation and quick wins |
|
||||
|
||||
---
|
||||
|
||||
### 🏢 Business & Strategy (Skills 55–57)
|
||||
### 🏢 Business & Strategy (Skills 90–92)
|
||||
**Bundle:** `pm-business`
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 55 | **Investor Update** | `skills/investor-update/` | Monthly/quarterly investor updates: metrics, highlights, challenges, and asks |
|
||||
| 56 | **Board Deck Narrative** | `skills/board-deck-narrative/` | Slide-by-slide board presentation structure with narrative beats and talking points |
|
||||
| 57 | **Job Application** | `skills/job-application/` | Tailored CV summary, ATS keyword optimisation, and cover letter for any JD |
|
||||
| 90 | **Investor Update** | `skills/investor-update/` | Monthly/quarterly investor updates: metrics, highlights, challenges, and asks |
|
||||
| 91 | **Board Deck Narrative** | `skills/board-deck-narrative/` | Slide-by-slide board presentation structure with narrative beats and talking points |
|
||||
| 92 | **Job Application** | `skills/job-application/` | Tailored CV summary, ATS keyword optimisation, and cover letter for any JD |
|
||||
|
||||
---
|
||||
|
||||
### ⚖️ Legal (Skills 58–61)
|
||||
### ⚖️ Legal (Skills 93–96)
|
||||
**Bundle:** `pm-legal`
|
||||
|
||||
> ⚠️ All legal skills include a disclaimer. Not a substitute for qualified legal advice.
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 58 | **Contract Review** | `skills/contract-review/` | Structured review with key terms, flagged clauses, risk rating, and plain English summary |
|
||||
| 59 | **NDA Analyser** | `skills/nda-analyser/` | Clause-by-clause NDA analysis with risk flags and negotiation checklist |
|
||||
| 60 | **Legal Brief** | `skills/legal-brief/` | Legal memos and argument outlines in IRAC format (Issue, Rule, Application, Conclusion) |
|
||||
| 61 | **Compliance Checklist** | `skills/compliance-checklist/` | GDPR, SOC 2, ISO 27001, FCA, HIPAA compliance checklists with prioritised gap analysis |
|
||||
| 93 | **Contract Review** | `skills/contract-review/` | Structured review with key terms, flagged clauses, risk rating, and plain English summary |
|
||||
| 94 | **NDA Analyser** | `skills/nda-analyser/` | Clause-by-clause NDA analysis with risk flags and negotiation checklist |
|
||||
| 95 | **Legal Brief** | `skills/legal-brief/` | Legal memos and argument outlines in IRAC format (Issue, Rule, Application, Conclusion) |
|
||||
| 96 | **Compliance Checklist** | `skills/compliance-checklist/` | GDPR, SOC 2, ISO 27001, FCA, HIPAA compliance checklists with prioritised gap analysis |
|
||||
|
||||
---
|
||||
|
||||
### 💰 Finance (Skills 62–66)
|
||||
### 💰 Finance (Skills 97–101)
|
||||
**Bundle:** `pm-finance`
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 62 | **Financial Model Narrative** | `skills/financial-model-narrative/` | Turns P&L and model outputs into board-ready written narratives |
|
||||
| 63 | **Budget Variance Analysis** | `skills/budget-variance-analysis/` | Variance table with root cause commentary and management summary |
|
||||
| 64 | **Investor Pitch Deck** | `skills/investor-pitch-deck/` | Slide-by-slide pitch deck structure with what each slide must prove |
|
||||
| 65 | **Financial Due Diligence** | `skills/financial-due-diligence/` | DD document request list, analytical questions, and red flags checklist |
|
||||
| 66 | **Tax Planning Checklist** 🆕 | `skills/tax-planning-checklist/` | Year-end tax planning framework across income, pension, CGT, business reliefs, and ISAs |
|
||||
| 97 | **Financial Model Narrative** | `skills/financial-model-narrative/` | Turns P&L and model outputs into board-ready written narratives |
|
||||
| 98 | **Budget Variance Analysis** | `skills/budget-variance-analysis/` | Variance table with root cause commentary and management summary |
|
||||
| 99 | **Investor Pitch Deck** | `skills/investor-pitch-deck/` | Slide-by-slide pitch deck structure with what each slide must prove |
|
||||
| 100 | **Financial Due Diligence** | `skills/financial-due-diligence/` | DD document request list, analytical questions, and red flags checklist |
|
||||
| 101 | **Tax Planning Checklist** | `skills/tax-planning-checklist/` | Year-end tax planning framework across income, pension, CGT, business reliefs, and ISAs |
|
||||
|
||||
---
|
||||
|
||||
### 👥 HR (Skills 67–71)
|
||||
### 👥 HR (Skills 102–106)
|
||||
**Bundle:** `pm-hr`
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 67 | **Job Description Writer** | `skills/job-description-writer/` | Inclusive, structured JDs with built-in language review and salary range nudge |
|
||||
| 68 | **Onboarding Plan** | `skills/onboarding-plan/` | 30/60/90-day plans with week-by-week structure, milestones, and manager checklist |
|
||||
| 69 | **Employee Engagement Survey** | `skills/employee-engagement-survey/` | Survey design + results analysis mode with eNPS and action planning template |
|
||||
| 70 | **Redundancy Consultation** | `skills/redundancy-consultation/` | Process timeline, at-risk letter, consultation script, and confirmation letter — UK law |
|
||||
| 71 | **Change Management Plan** 🆕 | `skills/change-management-plan/` | Full change plan covering stakeholder analysis, communication strategy, training, and adoption metrics |
|
||||
| 102 | **Job Description Writer** | `skills/job-description-writer/` | Inclusive, structured JDs with built-in language review and salary range nudge |
|
||||
| 103 | **Onboarding Plan** | `skills/onboarding-plan/` | 30/60/90-day plans with week-by-week structure, milestones, and manager checklist |
|
||||
| 104 | **Employee Engagement Survey** | `skills/employee-engagement-survey/` | Survey design + results analysis mode with eNPS and action planning template |
|
||||
| 105 | **Redundancy Consultation** | `skills/redundancy-consultation/` | Process timeline, at-risk letter, consultation script, and confirmation letter — UK law |
|
||||
| 106 | **Change Management Plan** | `skills/change-management-plan/` | Full change plan covering stakeholder analysis, communication strategy, training, and adoption metrics |
|
||||
|
||||
---
|
||||
|
||||
### 🤝 Sales (Skills 72–76)
|
||||
### 🤝 Sales (Skills 107–111)
|
||||
**Bundle:** `pm-sales`
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 72 | **Sales Battlecard** | `skills/sales-battlecard/` | One-page competitive battlecard with objection responses and landmine questions |
|
||||
| 73 | **Discovery Call Prep** | `skills/discovery-call-prep/` | Call brief with research summary, hypothesis, structured questions, and success criteria |
|
||||
| 74 | **Proposal Writer** | `skills/proposal-writer/` | Commercial proposals structured around the prospect's problem, not the product |
|
||||
| 75 | **Account Plan** | `skills/account-plan/` | Strategic account plan with relationship map, whitespace analysis, risks, and 90-day actions |
|
||||
| 76 | **Sales Forecasting Model** 🆕 | `skills/sales-forecasting-model/` | Pipeline-based forecast with stage model, scenario analysis, assumption log, and activity sanity check |
|
||||
| 107 | **Sales Battlecard** | `skills/sales-battlecard/` | One-page competitive battlecard with objection responses and landmine questions |
|
||||
| 108 | **Discovery Call Prep** | `skills/discovery-call-prep/` | Call brief with research summary, hypothesis, structured questions, and success criteria |
|
||||
| 109 | **Proposal Writer** | `skills/proposal-writer/` | Commercial proposals structured around the prospect's problem, not the product |
|
||||
| 110 | **Account Plan** | `skills/account-plan/` | Strategic account plan with relationship map, whitespace analysis, risks, and 90-day actions |
|
||||
| 111 | **Sales Forecasting Model** | `skills/sales-forecasting-model/` | Pipeline-based forecast with stage model, scenario analysis, assumption log, and activity sanity check |
|
||||
|
||||
---
|
||||
|
||||
### ⚙️ Operations (Skills 77–81)
|
||||
### ⚙️ Operations (Skills 112–116)
|
||||
**Bundle:** `pm-operations`
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 77 | **Process Documentation** | `skills/process-documentation/` | Clear process docs with steps, roles, edge cases — followable by a new starter |
|
||||
| 78 | **SOP Writer** | `skills/sop-writer/` | Formal, audit-ready SOPs with version control, quality checks, and non-conformance process |
|
||||
| 79 | **Vendor Evaluation** | `skills/vendor-evaluation/` | Weighted vendor scorecard, RFP questions, reference check template, and recommendation |
|
||||
| 80 | **Project Status Report** | `skills/project-status-report/` | RAG status reports with milestone progress, issues, risks, and decisions required |
|
||||
| 81 | **Workshop Facilitation Guide** 🆕 | `skills/workshop-facilitation-guide/` | Complete facilitation guides with activity instructions, decision protocols, and facilitator moves |
|
||||
| 112 | **Process Documentation** | `skills/process-documentation/` | Clear process docs with steps, roles, edge cases — followable by a new starter |
|
||||
| 113 | **SOP Writer** | `skills/sop-writer/` | Formal, audit-ready SOPs with version control, quality checks, and non-conformance process |
|
||||
| 114 | **Vendor Evaluation** | `skills/vendor-evaluation/` | Weighted vendor scorecard, RFP questions, reference check template, and recommendation |
|
||||
| 115 | **Project Status Report** | `skills/project-status-report/` | RAG status reports with milestone progress, issues, risks, and decisions required |
|
||||
| 116 | **Workshop Facilitation Guide** | `skills/workshop-facilitation-guide/` | Complete facilitation guides with activity instructions, decision protocols, and facilitator moves |
|
||||
|
||||
---
|
||||
|
||||
### 🏥 Research & Healthcare (Skills 82–85)
|
||||
### 🏥 Research & Healthcare (Skills 117–120)
|
||||
**Bundle:** `pm-research`
|
||||
|
||||
> ⚠️ Healthcare skills are for documentation and educational purposes only. All clinical content must be reviewed by a qualified professional.
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 82 | **Clinical Case Summary** | `skills/clinical-case-summary/` | SBAR handovers, SOAP notes, and case reports for educational and documentation use |
|
||||
| 83 | **Research Protocol** | `skills/research-protocol/` | Complete study protocols with objectives, methodology, ethics, and analysis plan |
|
||||
| 84 | **Patient Communication** | `skills/patient-communication/` | Plain English patient letters, leaflets, and results communications at Grade 6 reading level |
|
||||
| 85 | **Literature Review** | `skills/literature-review/` | Thematically organised literature reviews with synthesis, critical analysis, and gap identification |
|
||||
| 117 | **Clinical Case Summary** | `skills/clinical-case-summary/` | SBAR handovers, SOAP notes, and case reports for educational and documentation use |
|
||||
| 118 | **Research Protocol** | `skills/research-protocol/` | Complete study protocols with objectives, methodology, ethics, and analysis plan |
|
||||
| 119 | **Patient Communication** | `skills/patient-communication/` | Plain English patient letters, leaflets, and results communications at Grade 6 reading level |
|
||||
| 120 | **Literature Review** | `skills/literature-review/` | Thematically organised literature reviews with synthesis, critical analysis, and gap identification |
|
||||
|
||||
---
|
||||
|
||||
### 🌐 Cross-Profession (Skills 86–89)
|
||||
### 🌐 Cross-Profession (Skills 121–124)
|
||||
**Bundle:** `pm-cross`
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 86 | **Press Release** | `skills/press-release/` | Journalist-ready press releases with headline rules, boilerplate, and journalist test |
|
||||
| 87 | **Grant Proposal** | `skills/grant-proposal/` | Complete grant applications aligned to funder priorities with budget narrative |
|
||||
| 88 | **Executive Summary** | `skills/executive-summary/` | Decision-ready executive summaries with bottom line upfront, adapted for any audience |
|
||||
| 89 | **Teaching Lesson Plan** 🆕 | `skills/teaching-lesson-plan/` | Complete lesson plans for any subject, audience, or setting — with objectives, activities, and formative assessment |
|
||||
| 121 | **Press Release** | `skills/press-release/` | Journalist-ready press releases with headline rules, boilerplate, and journalist test |
|
||||
| 122 | **Grant Proposal** | `skills/grant-proposal/` | Complete grant applications aligned to funder priorities with budget narrative |
|
||||
| 123 | **Executive Summary** | `skills/executive-summary/` | Decision-ready executive summaries with bottom line upfront, adapted for any audience |
|
||||
| 124 | **Teaching Lesson Plan** | `skills/teaching-lesson-plan/` | Complete lesson plans for any subject, audience, or setting — with objectives, activities, and formative assessment |
|
||||
|
||||
---
|
||||
|
||||
### 🖼️ Figma (Skills 90–100 — reaching the milestone)
|
||||
### 🖼️ Figma (Skills 125–134)
|
||||
**Bundle:** `pm-figma`
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 90 | **Figma Component Audit** | `skills/figma-component-audit/` | Audit component library for naming issues, coverage gaps, and variant completeness |
|
||||
| 91 | **Figma Design Brief** | `skills/figma-design-brief/` | Convert PRDs and feature requests into structured Figma design briefs |
|
||||
| 92 | **Figma Annotation Guide** | `skills/figma-annotation-guide/` | Generate complete developer handoff annotations covering all states and edge cases |
|
||||
| 93 | **Figma Design Review** | `skills/figma-design-review/` | PM design review against requirements with explicit approval status |
|
||||
| 94 | **Figma User Flow Planner** | `skills/figma-user-flow-planner/` | Map all screens, states, and decision points before opening Figma |
|
||||
| 95 | **Figma Variant Matrix** | `skills/figma-variant-matrix/` | Define all component variants, properties, and states before building |
|
||||
| 96 | **Figma Spacing System** | `skills/figma-spacing-system/` | Design a complete spacing scale, grid, and token system |
|
||||
| 97 | **Figma Prototype Plan** | `skills/figma-prototype-plan/` | Plan prototype scope, interactions, and test task scripts for user testing |
|
||||
| 98 | **Figma Design QA** | `skills/figma-design-qa/` | Pre-handoff QA checklist covering file hygiene, states, accessibility, and handoff readiness |
|
||||
| 99 | **Figma Design Critique (PM)** | `skills/figma-design-critique-pm/` | PM-perspective design critique focused on product outcomes, not aesthetics |
|
||||
| 100 | **PM Weekly Review** | `skills/pm-weekly-review/` | Weekly PM review and planning ritual — metrics, shipping progress, blockers, and next week's priorities |
|
||||
| 125 | **Figma Component Audit** | `skills/figma-component-audit/` | Audit component library for naming issues, coverage gaps, and variant completeness |
|
||||
| 126 | **Figma Design Brief** | `skills/figma-design-brief/` | Convert PRDs and feature requests into structured Figma design briefs |
|
||||
| 127 | **Figma Annotation Guide** | `skills/figma-annotation-guide/` | Generate complete developer handoff annotations covering all states and edge cases |
|
||||
| 128 | **Figma Design Review** | `skills/figma-design-review/` | PM design review against requirements with explicit approval status |
|
||||
| 129 | **Figma User Flow Planner** | `skills/figma-user-flow-planner/` | Map all screens, states, and decision points before opening Figma |
|
||||
| 130 | **Figma Variant Matrix** | `skills/figma-variant-matrix/` | Define all component variants, properties, and states before building |
|
||||
| 131 | **Figma Spacing System** | `skills/figma-spacing-system/` | Design a complete spacing scale, grid, and token system |
|
||||
| 132 | **Figma Prototype Plan** | `skills/figma-prototype-plan/` | Plan prototype scope, interactions, and test task scripts for user testing |
|
||||
| 133 | **Figma Design QA** | `skills/figma-design-qa/` | Pre-handoff QA checklist covering file hygiene, states, accessibility, and handoff readiness |
|
||||
| 134 | **Figma Design Critique (PM)** | `skills/figma-design-critique-pm/` | PM-perspective design critique focused on product outcomes, not aesthetics |
|
||||
|
||||
claude plugin install pm-figma@pm-claude-skills
|
||||
|
||||
|
||||
---
|
||||
|
||||
### 📅 PM Rituals (Skill 135)
|
||||
**Bundle:** `pm-rituals`
|
||||
|
||||
| # | Skill | Folder | What It Does |
|
||||
|---|---|---|---|
|
||||
| 135 | **PM Weekly Review** | `skills/pm-weekly-review/` | Weekly PM review and planning ritual — metrics, shipping progress, blockers, and next week's priorities |
|
||||
|
||||
---
|
||||
|
||||
## ❤️ Sponsor This Work
|
||||
|
||||
Building and maintaining 135 skills across 23 bundles takes real time — testing skills against new model releases, building new ones from community requests, writing the article series, and keeping documentation current.
|
||||
|
||||
If these skills save you time at work, consider sponsoring:
|
||||
|
||||
**[💖 Become a Sponsor →](https://github.com/sponsors/mohitagw15856)**
|
||||
|
||||
Sponsorships from $5/month (coffee tier) up to $500/month (sustaining sponsor with logo placement). Every sponsor directly funds:
|
||||
|
||||
- New skills based on community votes in [SKILL_REQUEST.md](SKILL_REQUEST.md)
|
||||
- Updates to existing skills when new Claude models ship
|
||||
- Continued free, ad-free Medium articles documenting what works
|
||||
- Quality improvements across the library
|
||||
|
||||
Higher tiers include custom skill development for your team, direct access for support, and logo placement in this README. See the [sponsor page](https://github.com/sponsors/mohitagw15856) for full tier details.
|
||||
|
||||
---
|
||||
|
||||
## 🤝 Contributing — Add Your Skill
|
||||
|
||||
This is an open-source community library. If you've built a skill that saves you time, share it here.
|
||||
|
||||
**Found a bug?** [Open a bug report →](../../issues/new?template=bug-report.md) — use the template so it's easy to triage.
|
||||
|
||||
**How to contribute:**
|
||||
|
||||
1. Fork this repo
|
||||
@@ -381,26 +568,49 @@ Install the whole library or just the bundles you need:
|
||||
|
||||
# Install by profession
|
||||
claude plugin install pm-essentials@pm-claude-skills
|
||||
|
||||
claude plugin install pm-discovery@pm-claude-skills
|
||||
|
||||
claude plugin install pm-planning@pm-claude-skills
|
||||
|
||||
claude plugin install pm-delivery@pm-claude-skills
|
||||
|
||||
claude plugin install pm-analytics@pm-claude-skills
|
||||
|
||||
claude plugin install pm-strategy@pm-claude-skills
|
||||
|
||||
claude plugin install pm-advanced@pm-claude-skills
|
||||
|
||||
claude plugin install pm-rituals@pm-claude-skills
|
||||
|
||||
claude plugin install pm-gtm@pm-claude-skills
|
||||
claude plugin install pm-engineering@pm-claude-skills # 10 engineering skills 🆕
|
||||
|
||||
claude plugin install pm-engineering@pm-claude-skills # Engineering (35 skills)
|
||||
|
||||
claude plugin install pm-cs@pm-claude-skills # Customer Success (4 skills) 🆕
|
||||
|
||||
claude plugin install pm-data@pm-claude-skills
|
||||
|
||||
claude plugin install pm-people@pm-claude-skills
|
||||
|
||||
claude plugin install pm-design@pm-claude-skills
|
||||
|
||||
claude plugin install pm-business@pm-claude-skills
|
||||
|
||||
claude plugin install pm-legal@pm-claude-skills
|
||||
|
||||
claude plugin install pm-finance@pm-claude-skills
|
||||
|
||||
claude plugin install pm-hr@pm-claude-skills
|
||||
|
||||
claude plugin install pm-sales@pm-claude-skills
|
||||
|
||||
claude plugin install pm-operations@pm-claude-skills
|
||||
|
||||
claude plugin install pm-research@pm-claude-skills
|
||||
|
||||
claude plugin install pm-cross@pm-claude-skills
|
||||
|
||||
claude plugin install pm-figma@pm-claude-skills
|
||||
|
||||
---
|
||||
@@ -417,7 +627,7 @@ Read the full breakdown: [Part 12 — I Built the Same Skills Library for ChatGP
|
||||
|
||||
## 🛠️ Custom Skills for Your Team
|
||||
|
||||
The 100 skills in this library are built for general professional workflows. But the most powerful version of Claude Skills is one built specifically for *your* team — your templates, your terminology, your processes, your quality standards.
|
||||
The 114 skills in this library are built for general professional workflows. But the most powerful version of Claude Skills is one built specifically for *your* team — your templates, your terminology, your processes, your quality standards.
|
||||
|
||||
**What custom skills look like in practice:**
|
||||
|
||||
@@ -454,8 +664,8 @@ Stars unlock the next wave of skills. Here's the roadmap:
|
||||
| Milestone | Unlocks | Status |
|
||||
|---|---|---|
|
||||
| 100 ⭐ | 10 Figma skills + quality rebuild across all 93 skills | ✅ Shipped (v6.0.0) |
|
||||
| 250 ⭐ | 10 Customer Success skills (health scorecard, QBR deck, escalation brief, churn analysis) | 🔒 Locked |
|
||||
| 500 ⭐ | 25 more Engineering skills (CI/CD playbooks, SLO templates, onboarding docs, debugging patterns) | 🔒 Locked |
|
||||
| 250 ⭐ | 10 Customer Success skills (health scorecard, QBR deck, escalation brief, churn analysis) | ✅ Unlocked — coming in next release |
|
||||
| 500 ⭐ | 25 Engineering skills (CI/CD playbooks, SLO templates, onboarding docs, debugging patterns, threat models, capacity planning, DR plans, and more) | ✅ Shipped — pm-engineering now 35 skills (v11.0.0) |
|
||||
| 1000 ⭐ | Full Startup Founder kit (fundraising memo, pitch critique, co-founder equity split) | 🔒 Locked |
|
||||
|
||||
**[⭐ Star this repo to unlock the next milestone →](https://github.com/mohitagw15856/pm-claude-skills)**
|
||||
@@ -464,4 +674,4 @@ Want a specific skill built? [Vote or request in SKILL_REQUEST.md](SKILL_REQUEST
|
||||
|
||||
---
|
||||
|
||||
*Built and maintained by [Mohit Aggarwal](https://medium.com/@mohit15856) | [Product Notes publication](https://medium.com/product-powerhouse)*
|
||||
*Built and maintained by [Mohit Aggarwal](https://medium.com/@mohit15856) | [Product Notes publication](https://medium.com/product-powerhouse) | [💖 Sponsor my work](https://github.com/sponsors/mohitagw15856)*
|
||||
|
||||
+1
-1
@@ -6,7 +6,7 @@ Have an idea for a skill? Add it here or upvote existing requests by leaving a
|
||||
|
||||
## How to Request a Skill
|
||||
|
||||
1. [Open an issue](../../issues/new) with the label `skill-request`
|
||||
1. [Open an issue](https://github.com/mohitagw15856/pm-claude-skills/issues/new) with the label `skill-request`
|
||||
2. Include:
|
||||
- **Skill name** (what you'd call it)
|
||||
- **Profession** (who uses this)
|
||||
|
||||
@@ -0,0 +1,13 @@
|
||||
{
|
||||
"$schema": "https://anthropic.com/claude-code/plugin.schema.json",
|
||||
"name": "pm-cs",
|
||||
"version": "1.0.0",
|
||||
"description": "Customer Success skills: Customer Health Scorecard, QBR Deck, Escalation Brief, Churn Analysis. Score account health with a weighted RAG framework, build structured QBR decks with value narratives, write crisp escalation briefs for at-risk accounts, and analyse churn by category and segment with prioritised interventions.",
|
||||
"author": {
|
||||
"name": "Mohit Aggarwal",
|
||||
"email": "mohit15856@gmail.com"
|
||||
},
|
||||
"homepage": "https://github.com/mohitagw15856/pm-claude-skills",
|
||||
"license": "MIT",
|
||||
"keywords": ["customer-success", "account-management", "health-scorecard", "qbr", "quarterly-business-review", "churn", "retention", "escalation", "csm", "renewal"]
|
||||
}
|
||||
@@ -0,0 +1,179 @@
|
||||
---
|
||||
name: churn-analysis
|
||||
description: "Analyse customer churn for a product or cohort and produce a structured churn report. Use when asked to analyse churn, understand why customers are leaving, identify churn patterns, calculate churn rate, or build a churn reduction plan. Produces a churn analysis with rate calculations, categorised reasons, early warning signals, and prioritised interventions."
|
||||
---
|
||||
|
||||
# Churn Analysis Skill
|
||||
|
||||
Produce a structured churn analysis that goes beyond the headline rate — identifying why customers leave, which segments are most at risk, and what interventions will have the highest impact on retention.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Time period** being analysed (e.g. Q1, last 12 months)
|
||||
- **Total customers at start of period** and **customers churned**
|
||||
- **ARR or revenue lost** to churn
|
||||
- **Churn reasons data** — exit survey results, CSM notes, support data, or sales loss reasons
|
||||
- **Customer segments** — by tier, industry, cohort, or product line
|
||||
- **Current retention rate** if known
|
||||
- **Any recent changes** — pricing, product, support model — that may have affected churn
|
||||
|
||||
## Churn Categories
|
||||
|
||||
Always classify churn before analysing it:
|
||||
|
||||
| Category | Definition |
|
||||
|---|---|
|
||||
| **Voluntary — avoidable** | Customer left due to a problem we could have addressed (product gaps, poor onboarding, relationship failures) |
|
||||
| **Voluntary — unavoidable** | Customer left for reasons outside our control (budget cuts, acquisition, company shutdown) |
|
||||
| **Involuntary** | Payment failure, contract non-renewal by mistake, admin error |
|
||||
|
||||
The interventions for each category are different. Conflating them leads to wrong conclusions.
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Churn Analysis: [Product / Segment / Company]
|
||||
**Period:** [Start date] — [End date]
|
||||
**Prepared by:** [Name] | **Date:** [Date]
|
||||
|
||||
---
|
||||
|
||||
## Headline Numbers
|
||||
|
||||
| Metric | Value |
|
||||
|---|---|
|
||||
| Customers at start of period | [N] |
|
||||
| Customers churned | [N] |
|
||||
| **Customer churn rate** | **[X]%** |
|
||||
| ARR at start of period | £/$/€[X] |
|
||||
| ARR lost to churn | £/$/€[X] |
|
||||
| **Revenue churn rate (gross)** | **[X]%** |
|
||||
| ARR from expansions (same period) | £/$/€[X] |
|
||||
| **Net revenue retention (NRR)** | **[X]%** |
|
||||
|
||||
**Benchmark context:**
|
||||
- Customer churn rate: [X]% vs. industry benchmark [Y]% — [above / below / in line]
|
||||
- NRR: [X]% — [What this means: above 100% = expansion offsets churn; below 100% = shrinking base]
|
||||
|
||||
---
|
||||
|
||||
## Churn Breakdown by Category
|
||||
|
||||
| Category | Customers | % of churn | ARR lost |
|
||||
|---|---|---|---|
|
||||
| Voluntary — avoidable | [N] | [X]% | £/$/€[X] |
|
||||
| Voluntary — unavoidable | [N] | [X]% | £/$/€[X] |
|
||||
| Involuntary | [N] | [X]% | £/$/€[X] |
|
||||
| **Total** | **[N]** | **100%** | **£/$/€[X]** |
|
||||
|
||||
**Avoidable churn as % of total churn:** [X]% — this is the number we can actually influence.
|
||||
|
||||
---
|
||||
|
||||
## Churn Reasons — Avoidable Churn Only
|
||||
|
||||
Rank by frequency. Include ARR weight where data allows.
|
||||
|
||||
| Reason | Count | % of avoidable churn | ARR lost | Representative quote |
|
||||
|---|---|---|---|---|
|
||||
| [Reason 1 — e.g. "Product missing key feature"] | [N] | [X]% | £/$/€[X] | "[Quote]" |
|
||||
| [Reason 2] | [N] | [X]% | £/$/€[X] | "[Quote]" |
|
||||
| [Reason 3] | [N] | [X]% | £/$/€[X] | "[Quote]" |
|
||||
| [Reason 4] | [N] | [X]% | £/$/€[X] | "[Quote]" |
|
||||
| Other | [N] | [X]% | £/$/€[X] | — |
|
||||
|
||||
**Theme synthesis:** [2–3 sentences grouping the top reasons into 2–3 themes. E.g. "The top three reasons cluster around two themes: product gaps in [area] (affecting X% of avoidable churn) and onboarding failures where customers never achieved value (Y%)."]
|
||||
|
||||
---
|
||||
|
||||
## Churn by Segment
|
||||
|
||||
Identify which segments over- or under-index for churn.
|
||||
|
||||
### By Tier
|
||||
|
||||
| Tier | Churn rate | vs. Overall | Notes |
|
||||
|---|---|---|---|
|
||||
| Enterprise | [X]% | +/-[X]pp | |
|
||||
| Mid-Market | [X]% | +/-[X]pp | |
|
||||
| SMB | [X]% | +/-[X]pp | |
|
||||
|
||||
### By Cohort (Acquisition Year)
|
||||
|
||||
| Cohort | Churn rate | Notes |
|
||||
|---|---|---|
|
||||
| [Year 1] | [X]% | |
|
||||
| [Year 2] | [X]% | |
|
||||
| [Year 3] | [X]% | |
|
||||
|
||||
### By Industry / Use Case (if data available)
|
||||
|
||||
| Segment | Churn rate | Notes |
|
||||
|---|---|---|
|
||||
| [Segment 1] | [X]% | |
|
||||
| [Segment 2] | [X]% | |
|
||||
|
||||
**Key pattern:** [Which segment has the highest churn rate and what likely explains it]
|
||||
|
||||
---
|
||||
|
||||
## Timing Analysis
|
||||
|
||||
- **Average contract length before churn:** [X months]
|
||||
- **Highest-risk moment:** [e.g. "Month 3 — when trial value has worn off but full adoption hasn't happened"]
|
||||
- **Churn timing distribution:**
|
||||
|
||||
| When churn occurred | % of churned accounts |
|
||||
|---|---|
|
||||
| 0–3 months | [X]% |
|
||||
| 3–6 months | [X]% |
|
||||
| 6–12 months | [X]% |
|
||||
| 12+ months | [X]% |
|
||||
|
||||
---
|
||||
|
||||
## Early Warning Signals
|
||||
|
||||
Based on the churned accounts, identify the signals that preceded churn (and could have triggered earlier intervention):
|
||||
|
||||
| Signal | Lead time before churn | How to detect |
|
||||
|---|---|---|
|
||||
| [Signal 1 — e.g. "DAU/MAU dropped below 15%"] | [~X weeks] | [Usage dashboard / alert] |
|
||||
| [Signal 2 — e.g. "No QBR in 90+ days"] | [~X weeks] | [CRM flag] |
|
||||
| [Signal 3 — e.g. "Champion left the account"] | [~X weeks] | [LinkedIn alert / CSM tracking] |
|
||||
| [Signal 4] | [~X weeks] | [Detection method] |
|
||||
|
||||
---
|
||||
|
||||
## Intervention Recommendations
|
||||
|
||||
Ranked by estimated impact × feasibility.
|
||||
|
||||
| Intervention | Addresses | Est. churn reduction | Effort | Owner |
|
||||
|---|---|---|---|---|
|
||||
| [Intervention 1 — e.g. "Improve onboarding for [segment] with dedicated 30-day check-in"] | [Reason 1] | [X accounts / £X ARR] | Low / Med / High | [Team] |
|
||||
| [Intervention 2] | [Reason 2] | [X accounts / £X ARR] | Low / Med / High | [Team] |
|
||||
| [Intervention 3] | [Reason 3] | [X accounts / £X ARR] | Low / Med / High | [Team] |
|
||||
|
||||
**Priority call:** [Which one intervention, if implemented this quarter, would have the biggest impact and why]
|
||||
|
||||
---
|
||||
|
||||
## What We Don't Know (Data Gaps)
|
||||
|
||||
- [Data gap 1 — e.g. "Exit survey response rate is only 30% — the reasons data may not be representative"]
|
||||
- [Data gap 2 — e.g. "No product usage data for SMB tier — can't confirm usage signal correlation"]
|
||||
- [Data gap 3]
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Churn rate is correctly calculated (churned ÷ starting cohort, not end-of-period total)
|
||||
- [ ] Avoidable and unavoidable churn are separated — interventions target avoidable churn only
|
||||
- [ ] Churn reasons are customer-reported, not internally assumed
|
||||
- [ ] Segment analysis identifies which segments over-index — not just averages
|
||||
- [ ] Early warning signals are specific and detectable, not generic ("low engagement")
|
||||
- [ ] Interventions link directly to the top churn reasons — no recommendations without a root cause match
|
||||
@@ -0,0 +1,176 @@
|
||||
---
|
||||
name: cs-escalation-brief
|
||||
description: "Write a structured escalation brief for an at-risk customer account. Use when an account has escalated, when a customer is threatening churn, when a P1 customer issue needs executive attention, or when preparing an internal save play. Produces a crisp escalation brief with account context, timeline, root cause, business impact, and a clear resolution plan."
|
||||
---
|
||||
|
||||
# Customer Escalation Brief Skill
|
||||
|
||||
Produce a clear, concise escalation brief that gives internal stakeholders — VP CS, CCO, product leadership, or the CEO — everything they need to understand the situation, make decisions, and act fast.
|
||||
|
||||
A good escalation brief is not a complaint. It is a professional document that states the facts, assigns accountability honestly, and proposes a specific resolution plan.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Account name**, tier, and ARR
|
||||
- **CSM name** and account owner
|
||||
- **Nature of the escalation** — what happened, what the customer is saying
|
||||
- **Timeline** of events leading to escalation
|
||||
- **Customer contact** who escalated (name, role, influence level)
|
||||
- **What the customer wants** — their stated ask
|
||||
- **What we believe the root cause is**
|
||||
- **What has already been done** to address the situation
|
||||
- **Renewal date** and current renewal risk assessment
|
||||
|
||||
## Escalation Levels
|
||||
|
||||
Calibrate urgency and audience based on escalation level:
|
||||
|
||||
| Level | Trigger | Audience | Response time |
|
||||
|---|---|---|---|
|
||||
| L1 — Account Risk | Customer expressing dissatisfaction; renewal at risk | CSM + CS Manager | 24 hours |
|
||||
| L2 — Executive Escalation | Customer escalated to their exec; requesting vendor exec involvement | VP CS + Account Exec | 4 hours |
|
||||
| L3 — Churn Risk | Customer has issued notice or is in active churn conversation | CCO / CEO + Revenue leadership | 1 hour |
|
||||
| L4 — Public Risk | Customer threatening public escalation, legal, or press | CCO / Legal / Comms | Immediate |
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Escalation Brief: [Account Name]
|
||||
|
||||
**Escalation level:** L[1/2/3/4] — [Label]
|
||||
**Date raised:** [Date]
|
||||
**Raised by:** [CSM name]
|
||||
**Escalation owner:** [Name of exec or senior stakeholder now leading response]
|
||||
|
||||
---
|
||||
|
||||
## Account at a Glance
|
||||
|
||||
| Field | Detail |
|
||||
|---|---|
|
||||
| ARR | £/$/€[X] |
|
||||
| Tier | Enterprise / Mid-Market / SMB |
|
||||
| Customer since | [Date] |
|
||||
| Renewal date | [Date] — [N] days away |
|
||||
| Renewal risk (pre-escalation) | Green / Amber / Red |
|
||||
| Renewal risk (current) | Green / Amber / Red |
|
||||
| Customer contact who escalated | [Name, role, seniority] |
|
||||
| Executive sponsor (customer) | [Name, role — active / passive / vacant] |
|
||||
| Executive sponsor (vendor) | [Name, role] |
|
||||
|
||||
---
|
||||
|
||||
## What Happened — Summary
|
||||
|
||||
[3–5 sentences. State the facts plainly. What the customer experienced, how they reacted, and how we learned about the escalation. No editorialising. No blame.]
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
List in chronological order. Each entry: `[Date / time] — [What happened. Who did what.]`
|
||||
|
||||
Include:
|
||||
- When the original issue or trigger event occurred
|
||||
- When the customer first raised concerns (informally)
|
||||
- When it escalated (formal escalation or exec involvement)
|
||||
- Actions taken since escalation
|
||||
|
||||
---
|
||||
|
||||
## Root Cause
|
||||
|
||||
**Primary cause:** [One clear sentence. What specifically went wrong.]
|
||||
|
||||
**Contributing factors:**
|
||||
- [Factor 1 — be honest about internal failures as well as external ones]
|
||||
- [Factor 2]
|
||||
|
||||
**Is this a systemic issue or isolated?**
|
||||
[ ] Isolated to this account
|
||||
[ ] Pattern seen in other accounts — details: [_______]
|
||||
[ ] Product or process gap that needs fixing
|
||||
|
||||
---
|
||||
|
||||
## Customer's Stated Position
|
||||
|
||||
**What the customer says happened:** [Their version of events — fair and unfiltered]
|
||||
|
||||
**What they are asking for:** [Their explicit ask — compensation, fix by date, exec call, SLA credit, exit clause]
|
||||
|
||||
**Sentiment of escalating contact:** [Frustrated but constructive / Angry / Seeking exit / Unknown]
|
||||
|
||||
**Risk of public escalation:** Low / Medium / High — [evidence if Medium or High]
|
||||
|
||||
---
|
||||
|
||||
## Business Impact
|
||||
|
||||
| Impact type | Detail |
|
||||
|---|---|
|
||||
| ARR at risk | £/$/€[X] |
|
||||
| Potential churn probability | [X]% |
|
||||
| Reputational risk | Low / Medium / High |
|
||||
| Reference / case study status | [Was a reference — now at risk / Not a reference] |
|
||||
| Expansion pipeline at risk | £/$/€[X] |
|
||||
|
||||
---
|
||||
|
||||
## What Has Been Done So Far
|
||||
|
||||
1. [Action taken — by whom — date — outcome]
|
||||
2. [Action taken — by whom — date — outcome]
|
||||
3. [Action taken — by whom — date — outcome]
|
||||
|
||||
**Has a formal apology or acknowledgement been issued?** Yes / No
|
||||
|
||||
---
|
||||
|
||||
## Proposed Resolution Plan
|
||||
|
||||
**Immediate actions (next 24–48 hours):**
|
||||
|
||||
| Action | Owner | By when |
|
||||
|---|---|---|
|
||||
| [Action] | [Name] | [Date] |
|
||||
| [Action] | [Name] | [Date] |
|
||||
|
||||
**Medium-term actions (next 2–4 weeks):**
|
||||
|
||||
| Action | Owner | By when |
|
||||
|---|---|---|
|
||||
| [Action] | [Name] | [Date] |
|
||||
|
||||
**What we are NOT offering:** [Be explicit about what is not on the table — avoids misaligned expectations]
|
||||
|
||||
**Success criteria:** [How will we know the escalation is resolved? What does the customer need to confirm they are satisfied?]
|
||||
|
||||
---
|
||||
|
||||
## Decision Required from Escalation Owner
|
||||
|
||||
[State clearly what decision or resource the escalation owner needs to provide. Be specific — do not make them ask. E.g.: "We need approval to offer a 20% service credit for Q2" or "We need an exec call with [name] within 48 hours."]
|
||||
|
||||
---
|
||||
|
||||
## Communication Plan
|
||||
|
||||
| Audience | Message | Channel | Owner | By when |
|
||||
|---|---|---|---|---|
|
||||
| Escalating customer contact | [Summary of message] | Email / Call | [Name] | [Date] |
|
||||
| Customer exec sponsor | [Summary] | Call | [Name] | [Date] |
|
||||
| Internal CS team | [Summary] | Slack / Meeting | CS Manager | [Date] |
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Root cause is specific — not "communication breakdown" or "product gap" without detail
|
||||
- [ ] Customer's position is stated fairly — not minimised or dismissed
|
||||
- [ ] A clear decision is requested from the escalation owner — brief does not end with "what do you think?"
|
||||
- [ ] ARR at risk is quantified
|
||||
- [ ] Communication plan has owners and dates — not "TBD"
|
||||
- [ ] Language is professional and blameless toward individuals
|
||||
@@ -0,0 +1,141 @@
|
||||
---
|
||||
name: cs-health-scorecard
|
||||
description: "Build a customer health scorecard for a specific account. Use when asked to score account health, assess renewal risk, build a health dashboard, or evaluate an account's likelihood to renew or expand. Produces a structured health scorecard with a RAG status, dimension scores, key risks, and recommended actions."
|
||||
---
|
||||
|
||||
# Customer Health Scorecard Skill
|
||||
|
||||
Produce a structured, data-driven health scorecard for a customer account — giving the CSM and leadership a clear view of renewal risk, expansion potential, and the actions needed to move the account in the right direction.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Account name** and tier (enterprise / mid-market / SMB)
|
||||
- **Contract value** (ARR) and **renewal date**
|
||||
- **Product usage data** — logins, DAU/MAU ratio, key feature adoption
|
||||
- **Support data** — open tickets, CSAT or NPS score, recent escalations
|
||||
- **Engagement data** — last QBR date, executive sponsor status, champion name
|
||||
- **Commercial data** — payment history, expansion conversations, seats used vs. licensed
|
||||
- **Any known risks or recent changes** at the account
|
||||
|
||||
## Scoring Framework
|
||||
|
||||
Score each dimension 1–5. Weight as shown. Calculate weighted total out of 100.
|
||||
|
||||
| Dimension | Weight | What to Score |
|
||||
|---|---|---|
|
||||
| **Product Adoption** | 30% | DAU/MAU ratio, breadth of features used, power users identified |
|
||||
| **Engagement** | 20% | QBR cadence, executive sponsor active, champion strength |
|
||||
| **Outcomes** | 20% | Customer hitting their stated goals / success metrics |
|
||||
| **Support Health** | 15% | Ticket volume trend, unresolved escalations, CSAT |
|
||||
| **Commercial** | 15% | On-time payments, seats utilised, expansion signals |
|
||||
|
||||
**Score → RAG conversion:**
|
||||
- 80–100: Green (healthy, renew likely)
|
||||
- 60–79: Amber (at risk, needs attention)
|
||||
- 0–59: Red (high churn risk, escalate)
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Customer Health Scorecard: [Account Name]
|
||||
|
||||
**CSM:** [Name] | **Tier:** [Enterprise / Mid-Market / SMB]
|
||||
**ARR:** £/$/€[X] | **Renewal date:** [Date] | **Days to renewal:** [N]
|
||||
**Overall health:** [Green / Amber / Red] — [Score]/100
|
||||
**Last updated:** [Date]
|
||||
|
||||
---
|
||||
|
||||
## Health Score Summary
|
||||
|
||||
| Dimension | Score (1–5) | Weight | Weighted Score | Trend |
|
||||
|---|---|---|---|---|
|
||||
| Product Adoption | [1–5] | 30% | [X] | ↑ / → / ↓ |
|
||||
| Engagement | [1–5] | 20% | [X] | ↑ / → / ↓ |
|
||||
| Outcomes | [1–5] | 20% | [X] | ↑ / → / ↓ |
|
||||
| Support Health | [1–5] | 15% | [X] | ↑ / → / ↓ |
|
||||
| Commercial | [1–5] | 15% | [X] | ↑ / → / ↓ |
|
||||
| **Total** | — | 100% | **[X]/100** | |
|
||||
|
||||
---
|
||||
|
||||
## Dimension Detail
|
||||
|
||||
### Product Adoption — [Score]/5
|
||||
- **DAU/MAU ratio:** [X]% (benchmark: >25% = healthy)
|
||||
- **Key features adopted:** [List features in use]
|
||||
- **Features not adopted:** [List unused high-value features]
|
||||
- **Power users identified:** [Yes / No — how many]
|
||||
- **Assessment:** [1–2 sentences on adoption health]
|
||||
|
||||
### Engagement — [Score]/5
|
||||
- **Last QBR:** [Date] — [Outcome summary]
|
||||
- **Next QBR:** [Scheduled / Overdue]
|
||||
- **Executive sponsor:** [Active / Passive / Vacant]
|
||||
- **Champion:** [Name, role, strength: strong / moderate / weak]
|
||||
- **Assessment:** [1–2 sentences]
|
||||
|
||||
### Outcomes — [Score]/5
|
||||
- **Customer's stated goals:** [List 2–3 goals from onboarding or last QBR]
|
||||
- **Progress against goals:** [On track / Partial / Off track]
|
||||
- **Evidence of value:** [Metric or quote that demonstrates ROI]
|
||||
- **Assessment:** [1–2 sentences]
|
||||
|
||||
### Support Health — [Score]/5
|
||||
- **Open tickets:** [N] (priority breakdown: P1: X, P2: X, P3: X)
|
||||
- **CSAT / NPS:** [Score] (benchmark: >8 CSAT / >30 NPS = healthy)
|
||||
- **Unresolved escalations:** [Yes / No — details if yes]
|
||||
- **Ticket trend (last 90 days):** Increasing / Stable / Decreasing
|
||||
- **Assessment:** [1–2 sentences]
|
||||
|
||||
### Commercial — [Score]/5
|
||||
- **Seats licensed:** [N] | **Seats active:** [N] ([X]% utilisation)
|
||||
- **Payment history:** [On time / Late — details]
|
||||
- **Expansion signals:** [Yes — describe / No]
|
||||
- **Downgrade or cancellation signals:** [Yes — describe / No]
|
||||
- **Assessment:** [1–2 sentences]
|
||||
|
||||
---
|
||||
|
||||
## Top Risks
|
||||
|
||||
| Risk | Severity | Mitigation |
|
||||
|---|---|---|
|
||||
| [Risk description] | High / Medium / Low | [Specific action to mitigate] |
|
||||
|
||||
---
|
||||
|
||||
## Recommended Actions
|
||||
|
||||
**Immediate (this week):**
|
||||
1. [Action — owner — deadline]
|
||||
|
||||
**This month:**
|
||||
1. [Action — owner — deadline]
|
||||
|
||||
**Before renewal:**
|
||||
1. [Action — owner — deadline]
|
||||
|
||||
---
|
||||
|
||||
## Renewal Forecast
|
||||
|
||||
| Scenario | Probability | ARR at risk |
|
||||
|---|---|---|
|
||||
| Full renewal at current ARR | [X]% | £/$/€0 |
|
||||
| Renewal with contraction | [X]% | £/$/€[X] |
|
||||
| Churn | [X]% | £/$/€[full ARR] |
|
||||
|
||||
**Recommended renewal play:** [Expand / Hold / Save / Manage out]
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Score is based on data, not gut feel — each dimension has evidence
|
||||
- [ ] Risks are specific (not "low engagement" — something like "executive sponsor left in March, no replacement identified")
|
||||
- [ ] Actions have owners and deadlines
|
||||
- [ ] Renewal probability is calibrated against pipeline reality
|
||||
- [ ] Trend arrows reflect direction of change vs. last scorecard, not just current state
|
||||
@@ -0,0 +1,218 @@
|
||||
---
|
||||
name: qbr-deck
|
||||
description: "Build a Quarterly Business Review (QBR) deck structure and narrative for a customer account. Use when asked to prepare a QBR, business review meeting, executive review, or quarterly check-in with a customer. Produces a slide-by-slide QBR structure with talking points, metrics review, value narrative, and mutual next steps."
|
||||
---
|
||||
|
||||
# QBR Deck Skill
|
||||
|
||||
Produce a complete Quarterly Business Review deck — structured, data-backed, and customer-focused. A good QBR demonstrates value delivered, aligns on goals for the next quarter, and strengthens the executive relationship. It should never feel like a product demo or a vendor update.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Account name**, CSM name, and customer stakeholders attending
|
||||
- **Contract details** — ARR, contract start date, renewal date
|
||||
- **Last quarter's goals** (from previous QBR or kickoff)
|
||||
- **Usage and adoption data** — key metrics for the quarter
|
||||
- **Support summary** — tickets raised, resolution time, any escalations
|
||||
- **Business outcomes the customer cares about** — what success looks like for them
|
||||
- **Product updates or new features** relevant to this customer
|
||||
- **Goals for next quarter**
|
||||
- **Any open commercial conversations** (expansion, renewal, at-risk signals)
|
||||
|
||||
## QBR Principles
|
||||
|
||||
- Lead with customer outcomes, not product features
|
||||
- Every metric should connect to a business result the customer cares about
|
||||
- The agenda is a conversation, not a presentation — build in time for customer input at every stage
|
||||
- Close with mutual commitments, not just vendor actions
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# QBR: [Account Name] × [Your Company]
|
||||
**[Quarter] [Year] Business Review**
|
||||
|
||||
**Date:** [Date] | **Location / Call link:** [TBC]
|
||||
**Customer attendees:** [Names and roles]
|
||||
**[Your company] attendees:** [Names and roles]
|
||||
|
||||
---
|
||||
|
||||
## Slide 1: Agenda (5 min)
|
||||
|
||||
| Time | Topic | Owner |
|
||||
|---|---|---|
|
||||
| 0:00 | Welcome and introductions | CSM |
|
||||
| 0:05 | [Last quarter] — how did we do? | CSM + Customer |
|
||||
| 0:20 | Value delivered — business impact | CSM |
|
||||
| 0:35 | What's coming — roadmap preview | CSM / Product |
|
||||
| 0:45 | [Next quarter] — goals and priorities | Customer |
|
||||
| 0:55 | Actions and mutual commitments | CSM |
|
||||
| 1:00 | Close | |
|
||||
|
||||
*Talking point: "We've kept today to 60 minutes. We want as much of this to be a conversation as possible — please push back, redirect, and ask questions throughout."*
|
||||
|
||||
---
|
||||
|
||||
## Slide 2: Where We Are Together (2 min)
|
||||
|
||||
**Partnership snapshot:**
|
||||
- **Customer since:** [Date]
|
||||
- **Contract value:** £/$/€[ARR]/year
|
||||
- **Renewal date:** [Date]
|
||||
- **Active users:** [N] of [N] licensed seats ([X]% adoption)
|
||||
- **Products / modules active:** [List]
|
||||
|
||||
*Talking point: "Before we dive in — a quick picture of where we are. [X] months in, [Y] active users, and this is our [Nth] QBR together."*
|
||||
|
||||
---
|
||||
|
||||
## Slide 3: Last Quarter — Goals We Set Together (5 min)
|
||||
|
||||
| Goal | Set in [Last QBR / Kickoff] | Status |
|
||||
|---|---|---|
|
||||
| [Goal 1] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed |
|
||||
| [Goal 2] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed |
|
||||
| [Goal 3] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed |
|
||||
|
||||
For any partial or missed goal: state what happened and what changes next quarter.
|
||||
|
||||
*Talking point: "Let's start with accountability. Here's what we said we'd achieve last quarter — let's be honest about where we landed."*
|
||||
|
||||
---
|
||||
|
||||
## Slide 4: Usage and Adoption (5 min)
|
||||
|
||||
**Quarter-over-quarter trend:**
|
||||
|
||||
| Metric | [Q-1] | [Q] | Change |
|
||||
|---|---|---|---|
|
||||
| Monthly active users | [N] | [N] | +/-X% |
|
||||
| Sessions per user per week | [N] | [N] | +/-X% |
|
||||
| [Key feature 1] adoption | [X]% | [X]% | +/-X% |
|
||||
| [Key feature 2] adoption | [X]% | [X]% | +/-X% |
|
||||
|
||||
**Highlights:**
|
||||
- [Positive adoption trend to call out]
|
||||
- [Feature or workflow with strongest engagement]
|
||||
|
||||
**Opportunity:**
|
||||
- [Feature with low adoption that could drive more value — link to their goals]
|
||||
|
||||
*Talking point: "Usage is [up / stable / something we want to talk about]. The area I'd like to focus on is [feature] — we're not seeing the adoption we'd expect given [their goal], and I want to understand why."*
|
||||
|
||||
---
|
||||
|
||||
## Slide 5: Business Impact — Value Delivered (10 min)
|
||||
|
||||
Lead with outcomes, not activity.
|
||||
|
||||
**[Outcome 1: customer's primary success metric]**
|
||||
- Before: [baseline]
|
||||
- Now: [current state]
|
||||
- Impact: [quantified business result — time saved, revenue influenced, cost reduced, risk mitigated]
|
||||
|
||||
**[Outcome 2]**
|
||||
- [Same structure]
|
||||
|
||||
**[Outcome 3]**
|
||||
- [Same structure]
|
||||
|
||||
**Customer evidence** (use if available):
|
||||
> "[Quote from champion or user about value experienced]"
|
||||
|
||||
*Talking point: "This is the section I most want your input on. Are these the outcomes that matter to your business? Are there other ways you're measuring success that we should be tracking?"*
|
||||
|
||||
---
|
||||
|
||||
## Slide 6: Support Summary (3 min)
|
||||
|
||||
| Metric | This quarter | Last quarter | Trend |
|
||||
|---|---|---|---|
|
||||
| Tickets raised | [N] | [N] | ↑ / → / ↓ |
|
||||
| Average resolution time | [X hrs] | [X hrs] | ↑ / → / ↓ |
|
||||
| P1 / critical issues | [N] | [N] | ↑ / → / ↓ |
|
||||
| CSAT score | [X/10] | [X/10] | ↑ / → / ↓ |
|
||||
|
||||
**Notable issues this quarter:**
|
||||
- [Any escalation or major ticket — brief summary and resolution]
|
||||
|
||||
**What we're doing differently:**
|
||||
- [Any process change or improvement based on support patterns]
|
||||
|
||||
---
|
||||
|
||||
## Slide 7: What's Coming — Roadmap Preview (5 min)
|
||||
|
||||
Focus only on what's relevant to this customer's goals. Do not dump the full roadmap.
|
||||
|
||||
| Feature / Improvement | Expected | Why it matters to [Account Name] |
|
||||
|---|---|---|
|
||||
| [Feature 1] | [Q+1] | [Direct link to their goal or pain point] |
|
||||
| [Feature 2] | [Q+1 / Q+2] | [Direct link] |
|
||||
| [Feature 3] | [H2] | [Direct link] |
|
||||
|
||||
*Talking point: "I've filtered the roadmap to what I think matters most to your team. I'd love your reaction — are these the right priorities from your perspective?"*
|
||||
|
||||
---
|
||||
|
||||
## Slide 8: Next Quarter — Your Goals (10 min)
|
||||
|
||||
**Customer input section — facilitate, don't present.**
|
||||
|
||||
Prompt questions:
|
||||
- "What does success look like for your team in [next quarter]?"
|
||||
- "What's the biggest challenge you're trying to solve in the next 90 days?"
|
||||
- "Is there anything about the way you're using [product] you want to change?"
|
||||
|
||||
**Capture live:**
|
||||
|
||||
| Goal for next quarter | Owner (customer) | How we'll support it | How we'll measure it |
|
||||
|---|---|---|---|
|
||||
| [Goal 1] | [Name] | [CSM / product action] | [Metric] |
|
||||
| [Goal 2] | [Name] | [CSM / product action] | [Metric] |
|
||||
|
||||
---
|
||||
|
||||
## Slide 9: Mutual Commitments (5 min)
|
||||
|
||||
**[Your company] commits to:**
|
||||
1. [Specific action — owner — by when]
|
||||
2. [Specific action — owner — by when]
|
||||
3. [Specific action — owner — by when]
|
||||
|
||||
**[Account Name] commits to:**
|
||||
1. [Specific action — owner — by when]
|
||||
2. [Specific action — owner — by when]
|
||||
|
||||
**Next touchpoint:** [Date of next check-in or mid-quarter review]
|
||||
|
||||
---
|
||||
|
||||
## Slide 10: Thank You + Open Q&A (5 min)
|
||||
|
||||
- Recap the one headline from today: [The single most important thing you want them to remember]
|
||||
- Confirm actions are captured and shared after the call
|
||||
- Ask: "Is there anything we didn't cover today that you wanted to raise?"
|
||||
|
||||
---
|
||||
|
||||
## Preparation Checklist
|
||||
|
||||
- [ ] Usage data pulled and QoQ comparison calculated
|
||||
- [ ] Last QBR goals reviewed — status confirmed before the meeting
|
||||
- [ ] Business outcomes framed in customer language (not product language)
|
||||
- [ ] Roadmap filtered to this account's specific use cases
|
||||
- [ ] Customer's goals for next quarter researched or pre-confirmed with champion
|
||||
- [ ] Executive sponsor briefed on any sensitive topics before the call
|
||||
- [ ] Actions from previous QBR reviewed — any outstanding items addressed
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every slide has a talking point, not just a title
|
||||
- [ ] Value slide leads with business outcomes, not product activity
|
||||
- [ ] Roadmap preview links each item to a customer goal
|
||||
- [ ] Mutual commitments section has real owners on both sides
|
||||
- [ ] Customer has at least 20 minutes of airtime in the agenda
|
||||
@@ -1,13 +1,13 @@
|
||||
{
|
||||
"$schema": "https://anthropic.com/claude-code/plugin.schema.json",
|
||||
"name": "pm-engineering",
|
||||
"version": "2.0.0",
|
||||
"description": "Engineering & tech skills: Code Review Checklist, Incident Postmortem, API Docs Writer, Architecture Decision Record, Debugging Log Analyser, PR Description Writer, System Design Interview, Changelog Generator, Test Strategy Doc, Runbook Writer. 10 structured skills for engineering teams and technical PMs.",
|
||||
"version": "4.0.0",
|
||||
"description": "Engineering & tech skills: Code Review Checklist, Incident Postmortem, API Docs Writer, Architecture Decision Record, Debugging Log Analyser, PR Description Writer, System Design Interview, Changelog Generator, Test Strategy Doc, Runbook Writer, CI/CD Playbook, SLO & Error Budget, Developer Onboarding Doc, On-Call Runbook, Security Threat Model, Performance Budget, Database Schema Design, Database Migration Plan, Technical Debt Register, RFC Writer, Capacity Planning, Load Testing Plan, Disaster Recovery Plan, Feature Flag Guide, Dependency Audit, Service Catalog Entry, Monitoring Setup Guide, Local Dev Setup, API Versioning Strategy, Infra-as-Code Review, Engineering Weekly Report, Tech Radar, Sprint Velocity Analysis, Microservices Decomposition, Engineering Hiring Rubric. 35 structured skills for engineering teams, SREs, and technical PMs.",
|
||||
"author": {
|
||||
"name": "Mohit Aggarwal",
|
||||
"email": "mohit15856@gmail.com"
|
||||
},
|
||||
"homepage": "https://github.com/mohitagw15856/pm-claude-skills",
|
||||
"license": "MIT",
|
||||
"keywords": ["product-management", "engineering", "code-review", "incident-postmortem", "api-documentation", "adr", "architecture", "debugging", "pull-request", "system-design", "changelog", "test-strategy", "runbook", "devops"]
|
||||
"keywords": ["product-management", "engineering", "code-review", "incident-postmortem", "api-documentation", "adr", "architecture", "debugging", "pull-request", "system-design", "changelog", "test-strategy", "runbook", "devops", "cicd", "slo", "error-budget", "onboarding", "oncall", "sre", "reliability", "security", "threat-model", "performance", "database", "migration", "technical-debt", "rfc", "capacity-planning", "load-testing", "disaster-recovery", "feature-flags", "dependency-audit", "service-catalog", "monitoring", "observability", "tech-radar", "microservices", "hiring", "velocity"]
|
||||
}
|
||||
|
||||
@@ -0,0 +1,312 @@
|
||||
---
|
||||
name: api-versioning-strategy
|
||||
description: "Write an API versioning strategy document for a service or API platform. Use when asked to define versioning policy, plan API deprecation, classify breaking changes, or document version lifecycle. Produces a complete versioning strategy with breaking-change classification table, deprecation timeline, migration guide template, and client communication template."
|
||||
---
|
||||
|
||||
# API Versioning Strategy
|
||||
|
||||
Produce a complete API versioning strategy document that gives a service team durable, consistent rules for evolving their API without breaking consumers. This document covers the versioning scheme selection (with rationale), lifecycle policy from introduction through sunset, a precise breaking-change classification, and all the communication artifacts a team needs when deprecating a version. Engineers should be able to hand this document to a new team member or external consumer and have them understand exactly what to expect.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **API type** — REST, GraphQL, or gRPC (each has different versioning mechanics)
|
||||
- **Current versioning approach** — URL path (`/v1/`), request header, query parameter, or none; if none, document starts fresh
|
||||
- **Number of existing versions and active consumer count** — needed to size the lifecycle policy and migration scope
|
||||
- **Deprecation timeline constraints** — any hard deadlines (contract SLAs, compliance windows, annual release cycles)
|
||||
- **Consumer type** — internal teams only, external partners, public API, or mix (affects communication channel choices)
|
||||
|
||||
If any input is missing, ask before producing the document. For GraphQL, note that the versioning approach differs substantially (schema evolution over versioning) and tailor the scheme section accordingly.
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# API Versioning Strategy: [Service Name]
|
||||
|
||||
**Owner:** [Team Name]
|
||||
**API Type:** [REST / GraphQL / gRPC]
|
||||
**Document Version:** 1.0
|
||||
**Last Reviewed:** [Date]
|
||||
**Next Review:** [Date + 6 months]
|
||||
|
||||
---
|
||||
|
||||
## 1. Versioning Scheme
|
||||
|
||||
### Selected Approach: [URL Path / Request Header / Query Parameter]
|
||||
|
||||
| Scheme | Example | Pros | Cons | Verdict |
|
||||
|--------|---------|------|------|---------|
|
||||
| URL Path | `/v2/orders` | Visible in logs and bookmarks; trivial to route | Violates strict REST resource identity; clutters URL space | **Recommended for public-facing REST APIs** |
|
||||
| `Accept` Header | `Accept: application/vnd.[service].v2+json` | Keeps URLs clean; proper content negotiation | Harder to test in browser; less visible in logs | Recommended for internal APIs with controlled clients |
|
||||
| Query Parameter | `/orders?version=2` | Easy to retrofit without URL restructuring | Often missed in client code; cache-key complications | Acceptable only for read-heavy APIs already in production |
|
||||
| GraphQL Schema Evolution | Field deprecation + `@deprecated` directive | No versioning needed for additive changes | Requires disciplined schema design | **Recommended for GraphQL APIs** |
|
||||
|
||||
**Rationale for [chosen scheme]:** [One paragraph explaining why this scheme fits the API type, consumer type, and operational context provided. Reference the specific inputs — e.g., "Because this API has external partners who integrate via generated clients, URL path versioning provides the most predictable routing behavior and eliminates header negotiation complexity."]
|
||||
|
||||
### Version Format
|
||||
|
||||
```
|
||||
[Base URL]/v{MAJOR}/{resource}
|
||||
|
||||
Examples:
|
||||
https://api.[company].com/v1/orders
|
||||
https://api.[company].com/v2/orders/{id}/items
|
||||
|
||||
Version identifier: integer only (v1, v2, v3)
|
||||
No minor versions in the URL — minor/patch changes are non-breaking and deployed continuously.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Version Lifecycle Policy
|
||||
|
||||
### Lifecycle Stages
|
||||
|
||||
```
|
||||
STABLE ──────────────────────────────────────────────────►
|
||||
│
|
||||
├─ STABLE Active development, full SLA, new consumers allowed
|
||||
│
|
||||
├─ DEPRECATED Announced, timeline posted, migration docs live.
|
||||
│ New consumers blocked. Existing consumers receive warnings.
|
||||
│
|
||||
├─ SUNSET Requests return HTTP 410 Gone + migration pointer.
|
||||
│ 30-day window before routing is removed.
|
||||
│
|
||||
└─ RETIRED Routing removed, docs archived, no traffic accepted.
|
||||
```
|
||||
|
||||
| Stage | Duration | SLA Applies | New Consumers Allowed | Required Action |
|
||||
|-------|----------|-------------|----------------------|-----------------|
|
||||
| Stable | Until superseded | Yes — full | Yes | None |
|
||||
| Deprecated | [12 months / adjust per constraint] | Yes — degraded acceptable | No | Migrate before sunset date |
|
||||
| Sunset | 30-day window | Best-effort only | No | Migrate immediately |
|
||||
| Retired | Permanent | None | No | — |
|
||||
|
||||
**Minimum Stable Period:** A version must remain Stable for at least [6 / 12] months before deprecation can be announced.
|
||||
|
||||
**Maximum Simultaneous Versions:** No more than [2] versions in Stable or Deprecated status at any time. Releasing v3 requires committing to a sunset date for v1 in the same announcement.
|
||||
|
||||
---
|
||||
|
||||
## 3. Breaking vs. Non-Breaking Change Classification
|
||||
|
||||
Apply this table before every API change. If a change is marked Breaking, it requires a new major version. When uncertain, default to Breaking.
|
||||
|
||||
| Change Type | Specific Example | Classification | Rationale |
|
||||
|-------------|-----------------|----------------|-----------|
|
||||
| Remove a response field | Delete `order.legacy_id` from response | **Breaking** | Clients reading this field will null-pointer or fail |
|
||||
| Rename a field | `user_name` → `username` | **Breaking** | Clients referencing old name receive null |
|
||||
| Change field type | `"amount": "10.00"` → `"amount": 10.00` | **Breaking** | Type mismatch at deserialization |
|
||||
| Make optional field required | `email` required in POST body | **Breaking** | Existing callers omitting it receive 400 |
|
||||
| Remove an endpoint | `DELETE /v1/widgets/{id}` removed | **Breaking** | Existing callers receive 404 |
|
||||
| Change HTTP method | `GET /search` → `POST /search` | **Breaking** | Bookmarked or cached GET calls fail |
|
||||
| Change authentication scheme | API key → OAuth2 | **Breaking** | All clients must re-authenticate |
|
||||
| Restructure error response shape | Error JSON schema changed | **Breaking** | Error-handling code misparses responses |
|
||||
| Expand enum values (response) | New `status: "on_hold"` value returned | **Breaking** | Switch statements with no default fall through |
|
||||
| Change pagination defaults | `page_size` default 20 → 50 | **Breaking** | Response length changes unexpectedly |
|
||||
| Tighten input validation | Max length 100 → 50 | **Breaking** | Previously valid inputs now rejected |
|
||||
| Add new optional field to response | Add `order.tax_breakdown` | Non-Breaking | Clients ignore unknown fields per spec |
|
||||
| Add new optional request parameter | Add `?include_archived=true` | Non-Breaking | Ignored by existing clients |
|
||||
| Add a new endpoint | `GET /v1/orders/{id}/audit` | Non-Breaking | No existing client references it |
|
||||
| Relax input validation | Min length 10 → 5 | Non-Breaking | Existing valid inputs remain valid |
|
||||
| Performance or latency improvement | Response time reduced | Non-Breaking | — |
|
||||
| Add new enum value (request-only) | Accept new `type: "express"` | Non-Breaking | Existing values still accepted |
|
||||
|
||||
---
|
||||
|
||||
## 4. Deprecation Process
|
||||
|
||||
### Step-by-Step Deprecation Checklist
|
||||
|
||||
- [ ] **T-0 (Decision day):** Engineering lead approves deprecation. New version confirmed Stable. Sunset date set.
|
||||
- [ ] **T-0:** Update API docs — add deprecation banner to all v[N] endpoint pages.
|
||||
- [ ] **T-0:** Add `Deprecation` and `Sunset` response headers to all v[N] responses (see format below).
|
||||
- [ ] **T-0:** Block new consumer onboarding for v[N] in API gateway and developer portal.
|
||||
- [ ] **T-0:** Send initial deprecation notice to all registered consumers (see Section 5 template).
|
||||
- [ ] **T-0:** Open tracking issue in engineering backlog linking all known consumers to their migration status.
|
||||
- [ ] **T minus 30 days:** Send 30-day warning to all consumers still sending v[N] traffic.
|
||||
- [ ] **T minus 7 days:** Send final warning. If consumer traffic > 100 req/day, escalate directly to their engineering lead.
|
||||
- [ ] **Sunset date:** Switch v[N] routing to return `HTTP 410 Gone` with body pointing to migration guide.
|
||||
- [ ] **T plus 30 days:** Remove routing rules. Archive documentation. Close tracking issue.
|
||||
|
||||
### Deprecation Response Headers
|
||||
|
||||
```http
|
||||
HTTP/1.1 200 OK
|
||||
Deprecation: true
|
||||
Sunset: Sat, 01 Jan 2027 00:00:00 GMT
|
||||
Link: <https://docs.[company].com/api/migration/v1-to-v2>; rel="successor-version"
|
||||
```
|
||||
|
||||
### Sunset Response Body
|
||||
|
||||
```http
|
||||
HTTP/1.1 410 Gone
|
||||
Content-Type: application/json
|
||||
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Client Communication Templates
|
||||
|
||||
### Initial Deprecation Notice
|
||||
|
||||
```
|
||||
Subject: [Action Required] [Service Name] API v[N] Deprecation — Sunset [Date]
|
||||
|
||||
Hi [Team / Partner Name],
|
||||
|
||||
We are deprecating [Service Name] API v[N], effective [Sunset Date].
|
||||
|
||||
What this means for you:
|
||||
- v[N] continues to work normally until [Sunset Date]
|
||||
- After [Sunset Date], all v[N] requests return HTTP 410 Gone
|
||||
- v[N+1] is available today and fully stable
|
||||
|
||||
Your current usage: approximately [X] requests/day as of [Date].
|
||||
Estimated migration effort: [Small: < 1 day | Medium: 1–3 days | Large: 3–10 days]
|
||||
|
||||
Migration resources:
|
||||
Migration guide: [URL]
|
||||
Changelog: [URL]
|
||||
Office hours: [Date/Time/Link]
|
||||
Support: [Slack channel or email]
|
||||
|
||||
Key dates:
|
||||
[Date] Deprecation announced (today)
|
||||
[Date] New consumer onboarding blocked for v[N]
|
||||
[Date] 30-day warning sent to remaining consumers
|
||||
[Sunset Date] v[N] returns 410 Gone
|
||||
|
||||
Reply to this message or contact us at [channel] with questions.
|
||||
|
||||
[Your Name], [Team Name]
|
||||
```
|
||||
|
||||
### 30-Day Warning
|
||||
|
||||
```
|
||||
Subject: [30 Days Remaining] [Service Name] API v[N] sunsets [Date]
|
||||
|
||||
Hi [Team / Partner Name],
|
||||
|
||||
[Service Name] API v[N] sunsets in 30 days on [Date].
|
||||
|
||||
Your current v[N] traffic: [X] requests/day — migration is not yet complete.
|
||||
|
||||
If you have a technical blocker requiring an extension, contact us before
|
||||
[Date minus 14 days]. Extensions require a documented blocker and a committed
|
||||
migration completion date.
|
||||
|
||||
Migration guide: [URL] | Support: [channel]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Migration Guide Template
|
||||
|
||||
Publish one migration guide per version transition at `docs.[company].com/api/migration/v[N]-to-v[N+1]`.
|
||||
|
||||
```markdown
|
||||
# Migration Guide: v[N] → v[N+1]
|
||||
|
||||
**Estimated effort:** [Small: < 1 day | Medium: 1–3 days | Large: 3–10 days]
|
||||
**Breaking changes in this guide:** [count]
|
||||
|
||||
## Quick Start
|
||||
|
||||
Update your base URL:
|
||||
Before: https://api.[company].com/v[N]/
|
||||
After: https://api.[company].com/v[N+1]/
|
||||
|
||||
## Breaking Changes
|
||||
|
||||
### 1. [Field Rename: user_name → username]
|
||||
|
||||
**Affected endpoints:** `GET /users/{id}`, `POST /users`
|
||||
|
||||
Before (v[N]):
|
||||
{ "user_name": "alice" }
|
||||
|
||||
After (v[N+1]):
|
||||
{ "username": "alice" }
|
||||
|
||||
Migration: Replace all references to `user_name` with `username` in request
|
||||
builders and response parsers.
|
||||
|
||||
### 2. [Next breaking change — repeat structure]
|
||||
|
||||
## New Capabilities in v[N+1]
|
||||
|
||||
| Feature | Description | Docs |
|
||||
|---------|-------------|------|
|
||||
| [Feature name] | [Brief description] | [Link] |
|
||||
|
||||
## SDK Upgrade Reference
|
||||
|
||||
| Language | Package | v[N+1] Version | Install Command |
|
||||
|----------|---------|----------------|-----------------|
|
||||
| Python | `[company]-sdk` | `2.0.0` | `pip install [company]-sdk==2.0.0` |
|
||||
| Node.js | `@[company]/sdk` | `2.0.0` | `npm install @[company]/sdk@2.0.0` |
|
||||
| Go | `github.com/[company]/sdk-go` | `v2.0.0` | `go get github.com/[company]/sdk-go/v2` |
|
||||
| Java | `com.[company]:sdk` | `2.0.0` | Update pom.xml / build.gradle |
|
||||
|
||||
## Migration Validation Checklist
|
||||
|
||||
- [ ] Base URL updated to v[N+1]
|
||||
- [ ] All renamed fields updated in request serializers
|
||||
- [ ] All renamed fields updated in response deserializers
|
||||
- [ ] Error-handling code updated for new error shape
|
||||
- [ ] Integration tests passing against v[N+1] in staging
|
||||
- [ ] Load test completed against v[N+1] — latency within acceptable range
|
||||
- [ ] Rollback plan documented if issues arise post-cutover
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Version-Specific Documentation
|
||||
|
||||
- Maintain separate documentation pages for each Stable and Deprecated version.
|
||||
- Deprecated version docs carry a persistent banner: "This version is deprecated. Sunset date: [Date]. [Migrate to v[N+1]]."
|
||||
- OpenAPI specs, Protobuf definitions, or GraphQL schemas are tagged and archived per version in the repository under `/api/v[N]/`.
|
||||
- A root-level CHANGELOG.md records every breaking and non-breaking change by version — not buried in commit history.
|
||||
|
||||
---
|
||||
|
||||
## 8. SDK Versioning Alignment
|
||||
|
||||
| API Version | SDK Major Version | SDK GA Date | SDK EOL Date |
|
||||
|-------------|------------------|-------------|--------------|
|
||||
| v[1] | 1.x | [Date] | [API Sunset + 90 days] |
|
||||
| v[2] | 2.x | [Date] | Active |
|
||||
|
||||
- SDK major versions align 1:1 with API major versions.
|
||||
- SDK minor versions track non-breaking API additions.
|
||||
- SDK EOL dates trail API sunset dates by 90 days to give consumers extra runway.
|
||||
- SDKs emit a runtime deprecation warning log line when the underlying API version is Deprecated.
|
||||
|
||||
---
|
||||
|
||||
*Strategy authored by [Team Name] — questions to [Slack channel or email]*
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Versioning scheme recommendation includes explicit rationale tied to the API type and consumer type provided — not a generic recommendation
|
||||
- [ ] Breaking-change table covers at minimum: field removal, field rename, type change, making optional field required, endpoint removal, enum expansion, and default value change
|
||||
- [ ] Deprecation timeline durations are filled in with concrete values, not left as abstract placeholders
|
||||
- [ ] All three communication artifacts are present: initial deprecation notice, 30-day warning, and migration guide template
|
||||
- [ ] Sunset response headers (`Deprecation`, `Sunset`, `Link`) use correct RFC date format and real URL structure
|
||||
- [ ] SDK versioning alignment table is present and ties SDK major versions explicitly to API major versions
|
||||
- [ ] Maximum simultaneous supported versions is stated with a concrete number
|
||||
- [ ] Breaking-change table covers at minimum: field removal, field rename, type change, making optional field required, endpoint removal, enum expansion, and default value change
|
||||
- [ ] Deprecation timeline durations are filled in with concrete values, not left as abstract placeholders
|
||||
- [ ] All three communication artifacts are present: initial deprecation notice, 30-day warning, and migration guide template
|
||||
- [ ] Sunset response headers (`Deprecation`, `Sunset`, `Link`) use correct RFC date format and real URL structure
|
||||
- [ ] SDK versioning alignment table is present and ties SDK major versions explicitly to API major versions
|
||||
- [ ] Maximum simultaneous supported versions is stated with a concrete number
|
||||
@@ -0,0 +1,358 @@
|
||||
---
|
||||
name: capacity-planning
|
||||
description: "Produce a capacity planning document for a service covering traffic forecasts, resource requirements, and scaling strategy. Use when asked to plan infrastructure capacity, forecast resource needs, model traffic growth, define scaling strategy, or produce a capacity review for a service. Produces a structured capacity plan covering current baseline metrics, growth projections, resource requirements per tier, scaling strategy, cost projections, capacity triggers, and an infrastructure action roadmap."
|
||||
---
|
||||
|
||||
# Capacity Planning Skill
|
||||
|
||||
Produce a complete capacity planning document for a service. Capacity planning is not about predicting the future exactly — it is about understanding current headroom, modelling growth, and ensuring the team takes infrastructure action before a constraint becomes an incident.
|
||||
|
||||
A good capacity plan answers: what is running out first, how long before it runs out, what does it cost to fix it, and who decides when to act.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name and description** — what the service does and who depends on it
|
||||
- **Current traffic and usage metrics** — requests per second (or per day), active users, data volume — whatever units are most natural for this service
|
||||
- **Current resource utilisation** — CPU %, memory %, disk usage, connection pool utilisation, DB query throughput
|
||||
- **Growth rate or projections** — historical growth rate, or known upcoming events (product launch, sales cycle, seasonal peak)
|
||||
- **Tech stack and infrastructure** — cloud provider, compute type (VMs, containers, serverless), database, caching layer, CDN
|
||||
- **Cost constraints** — current infrastructure spend, acceptable cost ceiling, or target cost per unit of traffic
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Capacity Plan: [Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Author:** [Name] | **Last updated:** [Date]
|
||||
**Planning horizon:** [12 months — [Month Year] to [Month Year]]
|
||||
**Review cadence:** [Quarterly]
|
||||
|
||||
---
|
||||
|
||||
## 1. Executive Summary
|
||||
|
||||
[3–5 sentences covering: current state, the most critical capacity constraint, the timeline before it becomes a risk, the recommended action, and the cost implication. Written for an engineering manager or VP who needs the key facts without reading the full document.]
|
||||
|
||||
**Critical finding:** [e.g. "The database connection pool will reach 90% utilisation within 6 weeks at current growth. Without action, this will cause request queueing and latency spikes under normal traffic."]
|
||||
|
||||
**Recommended immediate action:** [e.g. "Increase connection pool limit and add a read replica within the next 2 weeks."]
|
||||
|
||||
**Estimated cost impact:** [e.g. "Recommended changes add ~$[X]/month to infrastructure spend."]
|
||||
|
||||
---
|
||||
|
||||
## 2. Current Baseline
|
||||
|
||||
*All metrics are 30-day averages unless noted. Date captured: [Date]*
|
||||
|
||||
### Traffic
|
||||
|
||||
| Metric | Value | Peak (7-day) | Notes |
|
||||
|---|---|---|---|
|
||||
| Requests per second (avg) | [X req/s] | [X req/s] | [Peak time / day of week] |
|
||||
| Requests per day | [X M/day] | [X M/day] | — |
|
||||
| Active users (DAU/MAU) | [X] / [X] | — | — |
|
||||
| [Service-specific metric — e.g. jobs processed/hour] | [X] | [X] | — |
|
||||
| [Service-specific metric — e.g. GB ingested/day] | [X GB] | [X GB] | — |
|
||||
|
||||
### Compute
|
||||
|
||||
| Resource | Current utilisation | Instance type | Count | Notes |
|
||||
|---|---|---|---|---|
|
||||
| CPU (avg) | [X%] | [e.g. c5.2xlarge] | [X] | Peak: [X%] |
|
||||
| Memory (avg) | [X%] | — | — | Peak: [X%] |
|
||||
| Network egress | [X Mbps] | — | — | — |
|
||||
| Container / pod count | [X] | [e.g. 2 vCPU / 4 GB] | — | Auto-scaling range: [X–Y] |
|
||||
|
||||
### Database
|
||||
|
||||
| Resource | Current utilisation | Spec | Notes |
|
||||
|---|---|---|---|
|
||||
| CPU | [X%] | [e.g. db.r5.2xlarge] | Peak: [X%] |
|
||||
| Memory | [X%] | [X GB RAM] | — |
|
||||
| Storage used | [X GB] of [Y GB] ([Z%]) | [X GB provisioned] | Growth: [~X GB/month] |
|
||||
| IOPS (avg) | [X] of [Y provisioned] | [Y IOPS] | Peak: [X IOPS] |
|
||||
| Connection pool | [X] of [Y max] ([Z%]) | Max connections: [Y] | [ORM pool size: X] |
|
||||
| Query P99 latency | [X ms] | — | [Slowest query: X] |
|
||||
| Read/write ratio | [X%] reads / [Y%] writes | — | — |
|
||||
|
||||
### Cache
|
||||
|
||||
| Resource | Current utilisation | Spec | Notes |
|
||||
|---|---|---|---|
|
||||
| Memory used | [X GB] of [Y GB] ([Z%]) | [e.g. cache.r6g.large] | Eviction rate: [X%] |
|
||||
| Hit rate | [X%] | — | Miss rate: [Y%] |
|
||||
| Connections | [X] | Max: [Y] | — |
|
||||
|
||||
### Storage / Object Store
|
||||
|
||||
| Resource | Current usage | Growth rate | Notes |
|
||||
|---|---|---|---|
|
||||
| [S3 / GCS / Blob] | [X GB / TB] | [~X GB/month] | [Lifecycle policies in place? Y/N] |
|
||||
| Disk (if applicable) | [X GB] of [Y GB] | [~X GB/month] | [RAID / EBS type] |
|
||||
|
||||
### Cost Baseline
|
||||
|
||||
| Component | Current monthly cost | % of total |
|
||||
|---|---|---|
|
||||
| Compute (app servers) | $[X] | [X%] |
|
||||
| Database | $[X] | [X%] |
|
||||
| Cache | $[X] | [X%] |
|
||||
| Storage | $[X] | [X%] |
|
||||
| CDN / bandwidth | $[X] | [X%] |
|
||||
| Other ([describe]) | $[X] | [X%] |
|
||||
| **Total** | **$[X]** | 100% |
|
||||
|
||||
**Unit economics:** $[X] per [1,000 requests / 1,000 users / GB processed]
|
||||
|
||||
---
|
||||
|
||||
## 3. Growth Projections
|
||||
|
||||
### Assumptions
|
||||
|
||||
| Assumption | Value | Source | Confidence |
|
||||
|---|---|---|---|
|
||||
| Monthly traffic growth rate | [X%] | [Historical trend / product forecast] | [High / Medium / Low] |
|
||||
| Seasonal peak factor | [+X% in [month(s)]] | [Last year's data / expected launch] | [High / Medium] |
|
||||
| Upcoming events | [e.g. Marketing campaign — [Month], expected +[X]% traffic spike] | [Marketing plan] | [Medium] |
|
||||
| User growth | [X new users/month] | [Sales pipeline / growth model] | [Medium] |
|
||||
| Data growth | [X GB/month] | [Current trend] | [High] |
|
||||
|
||||
### Traffic Forecast
|
||||
|
||||
| Timeframe | Req/s (avg) | Req/s (peak) | DAU | Data volume (cumulative) |
|
||||
|---|---|---|---|---|
|
||||
| **Now** (baseline) | [X] | [X] | [X] | [X GB/TB] |
|
||||
| **+3 months** | [X] | [X] | [X] | [X GB/TB] |
|
||||
| **+6 months** | [X] | [X] | [X] | [X GB/TB] |
|
||||
| **+12 months** | [X] | [X] | [X] | [X GB/TB] |
|
||||
|
||||
*Growth formula: [Baseline] × (1 + [monthly rate])^[months] + seasonal adjustment*
|
||||
|
||||
### Capacity Headroom Analysis
|
||||
|
||||
**When does each resource run out at current utilisation and projected growth?**
|
||||
|
||||
| Resource | Current utilisation | Safe ceiling | Headroom remaining | Months to ceiling |
|
||||
|---|---|---|---|---|
|
||||
| App CPU | [X%] | 70% | [X%] | [X months] |
|
||||
| App memory | [X%] | 80% | [X%] | [X months] |
|
||||
| DB CPU | [X%] | 70% | [X%] | [X months] |
|
||||
| DB storage | [X GB] of [Y GB] | 80% = [Z GB] | [X GB] | [X months] |
|
||||
| DB IOPS | [X] of [Y] | 80% = [Z] | [X IOPS] | [X months] |
|
||||
| DB connections | [X] of [Y] | 80% = [Z] | [X] | [X months] |
|
||||
| Cache memory | [X GB] of [Y GB] | 75% = [Z GB] | [X GB] | [X months] |
|
||||
| Storage (object) | [X TB] | No hard limit — cost trigger | — | [Cost trigger: $X/month] |
|
||||
|
||||
**Red flags** (resources hitting ceiling within 3 months):
|
||||
- [Resource]: [current]% → ceiling in [X weeks] — **Action required**
|
||||
- [Resource]: [current]% → ceiling in [X weeks] — **Action required**
|
||||
|
||||
---
|
||||
|
||||
## 4. Resource Requirements
|
||||
|
||||
### Compute Requirements
|
||||
|
||||
| Timeframe | Required instances | Recommended instance type | Auto-scaling range | Notes |
|
||||
|---|---|---|---|---|
|
||||
| Now | [X] | [type] | [min: X, max: Y] | Current configuration |
|
||||
| +3 months | [X] | [type] | [min: X, max: Y] | [Any instance type change needed?] |
|
||||
| +6 months | [X] | [type or upgrade] | [min: X, max: Y] | [Consider [larger type / horizontal scale]] |
|
||||
| +12 months | [X] | [type or upgrade] | [min: X, max: Y] | [State of horizontal vs vertical decision] |
|
||||
|
||||
**Memory headroom target:** Maintain ≥30% available memory at average load; ≥20% at peak.
|
||||
**CPU headroom target:** Maintain ≥30% available CPU at average load; ≥15% at peak.
|
||||
|
||||
### Database Requirements
|
||||
|
||||
| Timeframe | Instance type | Storage | IOPS | Read replica | Notes |
|
||||
|---|---|---|---|---|---|
|
||||
| Now | [type] | [X GB] | [X] | [Y/N] | Current |
|
||||
| +3 months | [type] | [X GB] | [X] | [Y/N] | [Upgrade storage / IOPS] |
|
||||
| +6 months | [type or upgrade] | [X GB] | [X] | **Yes** | [Read replica recommended by this point] |
|
||||
| +12 months | [type] | [X GB] | [X] | [X replicas] | [Consider sharding / partitioning at this scale] |
|
||||
|
||||
**Storage growth management:**
|
||||
- Current growth: [~X GB/month]
|
||||
- Storage auto-scaling: [Enabled / Not enabled — enable by [date]]
|
||||
- Archiving policy: [Records older than X months moved to [cold storage / archive tier]]
|
||||
|
||||
### Cache Requirements
|
||||
|
||||
| Timeframe | Node type | Nodes | Memory | Notes |
|
||||
|---|---|---|---|---|
|
||||
| Now | [type] | [X] | [X GB] | Current |
|
||||
| +6 months | [type] | [X] | [X GB] | [Scale out or upgrade] |
|
||||
| +12 months | [type] | [X] | [X GB] | [Cluster mode if >Y GB required] |
|
||||
|
||||
---
|
||||
|
||||
## 5. Scaling Strategy
|
||||
|
||||
### Compute — Horizontal Scaling
|
||||
|
||||
**Decision: [Horizontal / Vertical / Both]**
|
||||
|
||||
[State the scaling strategy and the reasoning. E.g. "The application is stateless and CPU-bound; horizontal scaling is preferred. Vertical scaling is a short-term fallback only."]
|
||||
|
||||
**Auto-scaling configuration:**
|
||||
|
||||
```
|
||||
Scale-out trigger: CPU > [X%] for [Y minutes] OR memory > [X%] for [Y minutes]
|
||||
Scale-in trigger: CPU < [X%] for [Y minutes] AND memory < [X%] for [Y minutes]
|
||||
Min instances: [X] (ensures HA across [X] AZs)
|
||||
Max instances: [Y] (cost ceiling)
|
||||
Cooldown period: [X seconds]
|
||||
Warmup time: [X seconds] (time for new instance to be healthy)
|
||||
```
|
||||
|
||||
**Limits of horizontal scaling:**
|
||||
- [e.g. Database connection pool is the current bottleneck — adding more app instances without increasing DB connections will not help]
|
||||
- [e.g. Session affinity required for WebSocket connections — limits pure stateless scaling]
|
||||
|
||||
### Database — Read Scaling
|
||||
|
||||
**Strategy:** [Read replica / Connection pooling via PgBouncer / Query caching / None needed yet]
|
||||
|
||||
**When to add a read replica:**
|
||||
- DB CPU sustained >60% for >30 minutes, OR
|
||||
- Read query P95 latency >50ms, OR
|
||||
- Connection pool utilisation >70%
|
||||
|
||||
**Connection pooling:**
|
||||
- Pooler: [PgBouncer / RDS Proxy / application-level / not configured]
|
||||
- Pool size: [X connections per app instance × Y instances = Z total]
|
||||
- Max DB connections: [configured to Z + 20% headroom]
|
||||
|
||||
### Caching Strategy
|
||||
|
||||
**Cache policy:** [Cache-aside / Write-through / Write-behind]
|
||||
**TTL strategy:**
|
||||
|
||||
| Data type | TTL | Invalidation method |
|
||||
|---|---|---|
|
||||
| [e.g. User profile] | [5 minutes] | [Explicit invalidation on update] |
|
||||
| [e.g. Product catalog] | [1 hour] | [TTL expiry — eventual consistency acceptable] |
|
||||
| [e.g. Session data] | [24 hours] | [Explicit invalidation on logout] |
|
||||
|
||||
**Cache miss handling:** [Describe what happens on a cache miss — does it fall through gracefully or cause a thundering herd risk?]
|
||||
|
||||
---
|
||||
|
||||
## 6. Cost Projections
|
||||
|
||||
### Infrastructure Cost Forecast
|
||||
|
||||
| Component | Now (monthly) | +3 months | +6 months | +12 months |
|
||||
|---|---|---|---|---|
|
||||
| Compute | $[X] | $[X] | $[X] | $[X] |
|
||||
| Database | $[X] | $[X] | $[X] | $[X] |
|
||||
| Cache | $[X] | $[X] | $[X] | $[X] |
|
||||
| Storage | $[X] | $[X] | $[X] | $[X] |
|
||||
| CDN / bandwidth | $[X] | $[X] | $[X] | $[X] |
|
||||
| **Total** | **$[X]** | **$[X]** | **$[X]** | **$[X]** |
|
||||
| MoM growth % | — | [X%] | [X%] | [X%] |
|
||||
|
||||
**Unit economics trend:**
|
||||
|
||||
| Timeframe | Cost per 1k requests | Cost per user/month | Notes |
|
||||
|---|---|---|---|
|
||||
| Now | $[X] | $[X] | Baseline |
|
||||
| +6 months | $[X] | $[X] | [Improving / worsening — why] |
|
||||
| +12 months | $[X] | $[X] | [Target: $X per 1k requests] |
|
||||
|
||||
**Cost optimisation opportunities:**
|
||||
|
||||
| Opportunity | Estimated saving | Effort | Timeline |
|
||||
|---|---|---|---|
|
||||
| [e.g. Reserved instances for baseline compute] | $[X/month] | Low | Immediate |
|
||||
| [e.g. S3 lifecycle policy — move objects >90 days to Glacier] | $[X/month] | Low | This sprint |
|
||||
| [e.g. Right-size [instance] — current is overprovisioned] | $[X/month] | Low | This sprint |
|
||||
| [e.g. Optimise top-5 slow queries — reduce DB compute need] | $[X/month] | Medium | Next quarter |
|
||||
|
||||
---
|
||||
|
||||
## 7. Capacity Triggers and Actions
|
||||
|
||||
Define the thresholds that require explicit action — not retrospective fixes after an incident.
|
||||
|
||||
| Resource | Watch (amber) | Act (red — schedule work) | Emergency (incident risk) |
|
||||
|---|---|---|---|
|
||||
| App CPU (sustained avg) | >60% | >70% | >85% |
|
||||
| App memory | >70% | >80% | >90% |
|
||||
| DB CPU | >55% | >65% | >80% |
|
||||
| DB storage | >65% | >75% | >85% |
|
||||
| DB connections | >60% | >70% | >85% |
|
||||
| Cache memory / eviction | Hit rate <90% | Hit rate <85% | Hit rate <75% |
|
||||
| Error rate | >0.5% | >1% | >2% |
|
||||
| P99 latency | >2× baseline | >3× baseline | >5× baseline |
|
||||
|
||||
**When a Watch threshold is crossed:**
|
||||
- Engineer who observes it creates a ticket with capacity label
|
||||
- Ticket reviewed in next sprint planning
|
||||
|
||||
**When an Act threshold is crossed:**
|
||||
- On-call engineer creates a ticket marked P2
|
||||
- Tech lead reviews within 24 hours
|
||||
- Action plan documented and scheduled within 1 sprint
|
||||
|
||||
**When an Emergency threshold is crossed:**
|
||||
- Treat as a potential incident — page on-call
|
||||
- Emergency scaling actions taken immediately (see runbook)
|
||||
- Root cause investigation starts within 2 hours
|
||||
|
||||
**Emergency scaling runbook:** [Link to oncall-runbook for capacity incidents]
|
||||
|
||||
---
|
||||
|
||||
## 8. Infrastructure Action Roadmap
|
||||
|
||||
### Immediate Actions (next 2 weeks)
|
||||
|
||||
| Action | Owner | Effort | Justification |
|
||||
|---|---|---|---|
|
||||
| [e.g. Increase DB connection pool limit to X] | [Name] | [2 hours] | [DB connections at X% — hitting ceiling in X weeks] |
|
||||
| [e.g. Enable storage auto-scaling on RDS] | [Name] | [30 min] | [Storage at X% — prevents emergency at X months] |
|
||||
| [e.g. Add S3 lifecycle policy for [bucket]] | [Name] | [1 hour] | [Storage growing at $X/month unnecessarily] |
|
||||
|
||||
### This Quarter (within 3 months)
|
||||
|
||||
| Action | Owner | Effort | Justification |
|
||||
|---|---|---|---|
|
||||
| [e.g. Add read replica to production DB] | [Name] | [1 day] | [DB CPU projected to hit 65% in 2 months] |
|
||||
| [e.g. Increase max auto-scaling limit from X to Y] | [Name] | [2 hours] | [Current max is too close to expected peak] |
|
||||
| [e.g. Configure PgBouncer for connection pooling] | [Name] | [3 days] | [Reduce per-connection overhead; headroom for growth] |
|
||||
|
||||
### Next Quarter (3–6 months)
|
||||
|
||||
| Action | Owner | Effort | Justification |
|
||||
|---|---|---|---|
|
||||
| [e.g. Upgrade DB instance class — [current] → [next]] | [Name] | [2 hours — blue/green] | [DB CPU projected to hit 70% by Q[X]] |
|
||||
| [e.g. Implement caching for [high-read endpoint]] | [Name] | [1 week] | [Reduce DB read load by estimated [X%]] |
|
||||
| [e.g. Evaluate horizontal DB sharding] | [Name] | [2 weeks (spike)] | [At 12-month projections, single DB hits limits] |
|
||||
|
||||
### Horizon (6–12 months)
|
||||
|
||||
| Action | Description | Trigger condition |
|
||||
|---|---|---|
|
||||
| [e.g. Multi-region deployment] | [Active-passive setup in eu-west-2] | [DAU exceeds X or SLA requires 99.99%] |
|
||||
| [e.g. Database sharding or migration to distributed DB] | [Evaluate CockroachDB / Vitess] | [Single-node DB projected to hit ceiling] |
|
||||
| [e.g. CDN expansion] | [Add PoPs in [region]] | [Latency SLO breached for [geography]] |
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every resource has a quantified current utilisation and a projected months-to-ceiling — no hand-waving
|
||||
- [ ] The most critical constraint is called out in the executive summary with a specific timeline
|
||||
- [ ] Growth projections state their assumptions and confidence level — not presented as certainties
|
||||
- [ ] Capacity triggers define amber/red thresholds and name who acts at each level
|
||||
- [ ] Cost projections include unit economics, not just absolute totals
|
||||
- [ ] The infrastructure roadmap has named owners and effort estimates — not just a wish list
|
||||
- [ ] Auto-scaling configuration includes both scale-out AND scale-in triggers, and a min/max range
|
||||
- [ ] Actions are ordered by urgency — immediate items are genuinely immediate, not backlog filler
|
||||
@@ -0,0 +1,301 @@
|
||||
---
|
||||
name: cicd-playbook
|
||||
description: "Write a CI/CD pipeline playbook for a service or team. Use when asked to document a CI/CD pipeline, write a deployment process, define release gates, document build and test stages, or create a deployment guide. Produces a structured playbook covering pipeline stages, environment definitions, deployment gates, rollback procedures, and on-call responsibilities."
|
||||
---
|
||||
|
||||
# CI/CD Playbook Skill
|
||||
|
||||
Produce a complete, actionable CI/CD playbook for a service or team — covering everything a new engineer needs to understand, contribute to, and operate the pipeline safely.
|
||||
|
||||
A good playbook is not a diagram. It is a document that answers: what runs, when, why, who owns it, and what to do when it breaks.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and brief description
|
||||
- **Tech stack** — language, framework, containerisation (Docker, etc.)
|
||||
- **Source control** — GitHub / GitLab / Bitbucket, branching strategy
|
||||
- **CI platform** — GitHub Actions / CircleCI / Jenkins / BuildKite / other
|
||||
- **CD platform / deployment target** — Kubernetes, ECS, Lambda, Heroku, VMs, etc.
|
||||
- **Environments** — e.g. dev, staging, production (and any canary / feature environments)
|
||||
- **Deployment frequency** — how often does the team ship?
|
||||
- **Any existing gates** — manual approvals, smoke tests, feature flags
|
||||
- **On-call setup** — who's responsible during deploys?
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# CI/CD Playbook: [Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Last updated:** [Date] | **Owner:** [Name / role]
|
||||
**Pipeline platform:** [CI tool] → [CD tool / platform]
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
[2–3 sentences describing what this service does and why the CI/CD pipeline is structured the way it is. Include the deployment target and how frequently the team ships.]
|
||||
|
||||
**Deployment frequency:** [Multiple times per day / Daily / Weekly / On-demand]
|
||||
**Average pipeline duration:** [X minutes]
|
||||
**Rollback time (p95):** [X minutes]
|
||||
|
||||
---
|
||||
|
||||
## Pipeline Stages
|
||||
|
||||
```
|
||||
[Branch push]
|
||||
│
|
||||
▼
|
||||
[1. Build & Lint] ──fail──▶ ❌ Block PR
|
||||
│
|
||||
▼
|
||||
[2. Unit Tests] ──fail──▶ ❌ Block PR
|
||||
│
|
||||
▼
|
||||
[3. Integration Tests] ──fail──▶ ❌ Block PR
|
||||
│
|
||||
▼
|
||||
[4. Security Scan] ──fail──▶ ⚠️ [Block / Warn — specify]
|
||||
│
|
||||
▼
|
||||
[5. Build Artefact / Container Image]
|
||||
│
|
||||
▼
|
||||
[6. Deploy to Staging] ──fail──▶ ❌ Block promotion
|
||||
│
|
||||
▼
|
||||
[7. Smoke Tests (Staging)]
|
||||
│
|
||||
▼
|
||||
[8. Manual Approval Gate] ──(if required)
|
||||
│
|
||||
▼
|
||||
[9. Deploy to Production] ──fail──▶ 🔁 Auto-rollback (if configured)
|
||||
│
|
||||
▼
|
||||
[10. Post-deploy checks]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Stage Definitions
|
||||
|
||||
### Stage 1 — Build & Lint
|
||||
|
||||
**What runs:** [Build command] + [Linter — e.g. ESLint, golangci-lint, flake8]
|
||||
**Trigger:** Every commit to any branch
|
||||
**Blocking:** Yes — PR cannot be merged if this fails
|
||||
**Typical duration:** [X minutes]
|
||||
**Owner if it fails:** PR author
|
||||
|
||||
**Common failure causes:**
|
||||
- [e.g. Missing dependency — run `npm install` locally before pushing]
|
||||
- [e.g. Lint rule violation — run `npm run lint --fix` to auto-fix most issues]
|
||||
|
||||
---
|
||||
|
||||
### Stage 2 — Unit Tests
|
||||
|
||||
**What runs:** [Test command — e.g. `npm test`, `go test ./...`, `pytest`]
|
||||
**Coverage gate:** [X]% minimum — pipeline fails below this threshold
|
||||
**Trigger:** Every commit
|
||||
**Blocking:** Yes
|
||||
**Typical duration:** [X minutes]
|
||||
|
||||
**Coverage report:** [Where to find it — e.g. uploaded to Codecov, available in CI artifacts]
|
||||
|
||||
---
|
||||
|
||||
### Stage 3 — Integration Tests
|
||||
|
||||
**What runs:** [Test suite description — e.g. "API integration tests against a test database using Docker Compose"]
|
||||
**Environment:** [Ephemeral test environment / shared test DB / etc.]
|
||||
**Trigger:** Every commit to `main` and feature branches targeting `main`
|
||||
**Blocking:** Yes
|
||||
**Typical duration:** [X minutes]
|
||||
|
||||
**If slow:** [e.g. "Integration tests can be skipped locally with `SKIP_INTEGRATION=true` — never skip in CI"]
|
||||
|
||||
---
|
||||
|
||||
### Stage 4 — Security Scan
|
||||
|
||||
**Tools:** [e.g. Snyk, Trivy, OWASP Dependency Check, Semgrep]
|
||||
**What it checks:** [Dependency vulnerabilities / SAST / secrets detection — list what applies]
|
||||
**Blocking on:** Critical and High severity findings
|
||||
**Non-blocking on:** Medium and Low (flagged, not blocking)
|
||||
**Trigger:** Every commit to `main`
|
||||
|
||||
**How to handle a flagged vulnerability:**
|
||||
1. Check if a fix is available — upgrade the dependency
|
||||
2. If no fix available, open a security ticket and add a suppression with justification
|
||||
3. Never suppress without a ticket and owner
|
||||
|
||||
---
|
||||
|
||||
### Stage 5 — Build Artefact
|
||||
|
||||
**What is produced:** [Docker image / binary / zip — be specific]
|
||||
**Registry:** [ECR / GCR / Docker Hub / Artifactory — URL]
|
||||
**Tagging convention:** `[service-name]:[git-sha]` (also tagged `:latest` on `main`)
|
||||
**Trigger:** Commits to `main` only (not feature branches)
|
||||
|
||||
---
|
||||
|
||||
### Stage 6 — Deploy to Staging
|
||||
|
||||
**Deployment method:** [e.g. Helm upgrade / kubectl apply / ecs deploy / Terraform apply]
|
||||
**Staging URL:** [URL]
|
||||
**Trigger:** Automatic on successful artefact build from `main`
|
||||
**Who can deploy to staging:** Any engineer (automatic)
|
||||
|
||||
**Environment variables:** Managed in [Vault / AWS SSM / GitHub Secrets / etc.]
|
||||
**Staging is not production:** [Any differences in config, scale, or data — state them here]
|
||||
|
||||
---
|
||||
|
||||
### Stage 7 — Smoke Tests (Staging)
|
||||
|
||||
**What runs:** [Description — e.g. "10 critical path tests covering login, core API endpoints, and payment flow"]
|
||||
**Tool:** [e.g. Playwright / Postman / custom script]
|
||||
**Pass criteria:** All smoke tests pass within [X seconds] timeout
|
||||
**Blocking:** Yes — production deploy will not proceed if smoke tests fail
|
||||
|
||||
**Smoke test suite location:** [Link to test files or folder]
|
||||
|
||||
---
|
||||
|
||||
### Stage 8 — Manual Approval Gate
|
||||
|
||||
**Required for:** [Production deploys / deploys affecting >X% of traffic / deploys to specific regions]
|
||||
**Who can approve:** [e.g. Any engineer on the team / Lead engineer / On-call engineer]
|
||||
**Approval timeout:** [e.g. 24 hours — auto-cancelled if no approval]
|
||||
**How to approve:** [GitHub Actions approve step / Slack command / other — with link]
|
||||
|
||||
**When to withhold approval:**
|
||||
- Active incident in production
|
||||
- Deploy is outside the deployment window (see below)
|
||||
- On-call engineer has not been notified
|
||||
|
||||
---
|
||||
|
||||
### Stage 9 — Deploy to Production
|
||||
|
||||
**Deployment method:** [Same as staging or different — specify]
|
||||
**Deployment window:** [e.g. Monday–Thursday 09:00–16:00 UTC — no deploys on Fridays or before bank holidays]
|
||||
**Canary / progressive rollout:** [Yes — X% initial traffic, full rollout after Y minutes / No — full deploy]
|
||||
**Deployment notifications:** [Slack channel — #deployments]
|
||||
|
||||
**Who is on-call during deploy:** Deploying engineer is responsible until post-deploy checks pass.
|
||||
|
||||
---
|
||||
|
||||
### Stage 10 — Post-Deploy Checks
|
||||
|
||||
**Automated checks (run for [X minutes] after deploy):**
|
||||
- [ ] Error rate: <[X]% (baseline: [Y]%)
|
||||
- [ ] P99 latency: <[X]ms (baseline: [Y]ms)
|
||||
- [ ] [Key business metric]: within [X]% of baseline
|
||||
|
||||
**Where to watch:** [Datadog / Grafana / CloudWatch dashboard — link]
|
||||
|
||||
**If a check fails:** See Rollback Procedure below.
|
||||
|
||||
---
|
||||
|
||||
## Environments
|
||||
|
||||
| Environment | Purpose | Deploy trigger | URL | Data |
|
||||
|---|---|---|---|---|
|
||||
| **Dev** | Local development | Manual | localhost | Seeded test data |
|
||||
| **Staging** | Pre-production validation | Automatic (main) | [URL] | Anonymised prod copy |
|
||||
| **Production** | Live traffic | Manual approval | [URL] | Live data |
|
||||
|
||||
---
|
||||
|
||||
## Branching Strategy
|
||||
|
||||
**Model:** [Trunk-based / GitFlow / GitHub Flow — describe briefly]
|
||||
|
||||
| Branch | Purpose | Who merges | Deploy target |
|
||||
|---|---|---|---|
|
||||
| `main` | Production-ready code | PR + review | Staging → Production |
|
||||
| `feature/*` | Feature development | Author | None (CI only) |
|
||||
| `hotfix/*` | Critical production fixes | Lead engineer | Can bypass staging gate with approval |
|
||||
|
||||
**Hotfix process:** [Describe when and how to use a hotfix branch — what level of incident justifies bypassing the standard process]
|
||||
|
||||
---
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
**Automated rollback:** [Yes — triggered if post-deploy error rate exceeds [X]% / No — manual only]
|
||||
|
||||
**Manual rollback steps:**
|
||||
```bash
|
||||
# 1. Identify the last known good image tag
|
||||
[command to list recent deployments]
|
||||
|
||||
# 2. Deploy the previous version
|
||||
[deployment command with previous tag]
|
||||
|
||||
# 3. Confirm rollback is live
|
||||
[smoke test command or health check URL]
|
||||
|
||||
# 4. Notify the team
|
||||
[Slack command or template]
|
||||
```
|
||||
|
||||
**Rollback decision authority:** Any engineer on-call can initiate a rollback without waiting for approval.
|
||||
|
||||
**After a rollback:**
|
||||
1. Create a post-deploy incident report (see [incident-postmortem skill])
|
||||
2. Do not re-deploy the same commit without fixing the root cause
|
||||
3. Notify [stakeholder / support team] of the rollback and expected fix timeline
|
||||
|
||||
---
|
||||
|
||||
## Secrets and Configuration Management
|
||||
|
||||
**Secret store:** [Vault / AWS SSM / GitHub Secrets / Doppler — specify]
|
||||
**How to add a new secret:**
|
||||
1. [Step 1]
|
||||
2. [Step 2]
|
||||
**Who has access:** [Role or team]
|
||||
**Rotation policy:** [How often secrets are rotated and who owns it]
|
||||
|
||||
**Never do:** Commit secrets to source control, even in `.env` files. The pipeline includes secret scanning (Stage 4) which will flag this.
|
||||
|
||||
---
|
||||
|
||||
## Common Failures and Fixes
|
||||
|
||||
| Failure | Likely cause | Fix |
|
||||
|---|---|---|
|
||||
| Build fails with "module not found" | Dependency not installed | Run `[install command]` and commit `lock file` |
|
||||
| Integration tests timeout | Test DB not seeded / external service down | Check [service] status; re-run pipeline |
|
||||
| Smoke tests fail after staging deploy | Environment variable missing | Check [config location]; compare staging and prod env vars |
|
||||
| Production deploy stuck at approval | Approver not notified | Tag `@[on-call handle]` in `#deployments` |
|
||||
| Post-deploy error rate spike | Bad deploy / upstream dependency | Check [dashboard]; initiate rollback if >5 min |
|
||||
|
||||
---
|
||||
|
||||
## On-Call Responsibilities During Deploy
|
||||
|
||||
- The deploying engineer is responsible for monitoring post-deploy checks for [X minutes] after a production deploy
|
||||
- If you cannot monitor after deploying, hand off explicitly to another engineer in `#deployments`
|
||||
- For deploys outside business hours: only hotfixes — always page the on-call engineer before deploying
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every stage has a clear owner when it fails
|
||||
- [ ] Rollback procedure is tested — not theoretical
|
||||
- [ ] Secrets management section names the actual tool used (not "use secrets management")
|
||||
- [ ] Deployment window is specific — not "during business hours"
|
||||
- [ ] Post-deploy check thresholds are calibrated to actual baseline metrics
|
||||
@@ -0,0 +1,454 @@
|
||||
---
|
||||
name: database-migration-plan
|
||||
description: "Write a safe, zero-downtime database migration plan for a schema change. Use when asked to plan a database migration, design a zero-downtime schema change, document an expand/contract migration, produce a rollback procedure for a database change, or coordinate a database schema update with a deployment. Produces a structured migration plan covering migration objectives, backward compatibility analysis, expand/contract phase breakdown, exact SQL, rollback steps per phase, data validation queries, and a deployment runbook."
|
||||
---
|
||||
|
||||
# Database Migration Plan Skill
|
||||
|
||||
Produce a complete, safe database migration plan for a schema change. A migration plan is not just the SQL — it is a coordinated sequence of steps that ensures the application stays available, data stays consistent, and every step can be rolled back independently.
|
||||
|
||||
The expand/contract pattern is the default approach: expand the schema to support both old and new states, migrate the application, then contract to remove the old state. Never combine schema changes and data backfills in a single migration that runs during deployment.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Current schema state** — the DDL or description of the table(s) as they are now
|
||||
- **Target schema state** — the DDL or description of what the table(s) should look like after migration
|
||||
- **Migration reason** — why this change is being made (new feature, performance fix, normalization, compliance)
|
||||
- **Database engine** — PostgreSQL, MySQL, SQLite, CockroachDB, etc.
|
||||
- **Estimated data volume** — approximate number of rows in affected tables
|
||||
- **Deployment constraints** — is any downtime allowed? What is the expected traffic level during migration? Are there multiple app instances running?
|
||||
- **Rollback window** — how long after deploy can the team roll back before the migration becomes irreversible?
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Database Migration Plan: [Migration Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Author:** [Name] | **Reviewed by:** [Name / DBA]
|
||||
**Date:** [Date] | **Target deploy date:** [Date]
|
||||
**Database engine:** [PostgreSQL X.X / MySQL X.X]
|
||||
**Ticket:** [JIRA-XXX]
|
||||
|
||||
---
|
||||
|
||||
## 1. Migration Overview
|
||||
|
||||
**What is changing:**
|
||||
[1–2 sentences: the specific schema change — e.g. "Adding a non-nullable `organisation_id` column to the `users` table and backfilling it from the `accounts` table."]
|
||||
|
||||
**Why:**
|
||||
[1–2 sentences: the business or technical reason driving the change.]
|
||||
|
||||
**Migration type:** [Additive only / Additive + backfill / Column rename / Column type change / Table restructure / Index change]
|
||||
|
||||
**Zero-downtime:** [Yes — using expand/contract / No — requires maintenance window — state duration]
|
||||
|
||||
**Estimated migration duration:**
|
||||
- Expand phase: [~X minutes]
|
||||
- Data backfill: [~X minutes/hours — based on X rows at Y rows/second]
|
||||
- Contract phase: [~X minutes after app version deployed]
|
||||
|
||||
---
|
||||
|
||||
## 2. Backward Compatibility Analysis
|
||||
|
||||
Before writing a single line of SQL, assess whether each change is backward compatible with the currently deployed application code.
|
||||
|
||||
| Change | Backward compatible? | Risk | Notes |
|
||||
|---|---|---|---|
|
||||
| [e.g. Add nullable column `org_id`] | Yes | Low | Old app ignores new column |
|
||||
| [e.g. Backfill `org_id`] | Yes | Medium | Old app unaffected; new app reads backfilled values |
|
||||
| [e.g. Add NOT NULL constraint to `org_id`] | **No** | High | Old app that inserts without `org_id` will fail |
|
||||
| [e.g. Drop old column `account_id`] | **No** | High | Old app that reads `account_id` will fail |
|
||||
| [e.g. Add index on `org_id`] | Yes | Low | Additive; no breaking change |
|
||||
| [e.g. Rename column] | **No** | High | Never rename in one step; use expand/contract |
|
||||
|
||||
**Summary:** [e.g. "This migration requires the expand/contract pattern across 3 deployment phases because steps 3 and 4 are not backward compatible."]
|
||||
|
||||
---
|
||||
|
||||
## 3. Expand/Contract Phases
|
||||
|
||||
### Phase Overview
|
||||
|
||||
```
|
||||
Phase 1 — EXPAND
|
||||
Deploy migration: add new column (nullable), create new indexes
|
||||
Old app: continues to work (ignores new column)
|
||||
New app: not yet deployed
|
||||
Duration: [~X min] | Rollback: trivial — drop new column
|
||||
|
||||
│
|
||||
▼
|
||||
|
||||
Phase 2 — BACKFILL + DUAL-WRITE
|
||||
Deploy app update: writes to both old and new columns
|
||||
Run backfill: populate new column for existing rows
|
||||
Validate: confirm 100% of rows have non-null new column
|
||||
Duration: [~X hours depending on data volume]
|
||||
Rollback: deploy previous app version; new column is still nullable
|
||||
|
||||
│
|
||||
▼
|
||||
|
||||
Phase 3 — ENFORCE + SWITCH
|
||||
Deploy migration: add NOT NULL constraint, drop old column/index
|
||||
Deploy app update: reads only from new column
|
||||
Duration: [~X min] | Rollback: requires forward-fix (constraint must be dropped first)
|
||||
|
||||
│
|
||||
▼
|
||||
|
||||
Phase 4 — CONTRACT (optional cleanup)
|
||||
Deploy migration: drop deprecated columns, rename if needed
|
||||
Final state matches target schema
|
||||
Rollback: not recommended — contract changes are destructive
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 1 — Expand Schema
|
||||
|
||||
**Goal:** Add the new column and structures without breaking the existing application.
|
||||
**Deploy order:** Run migration first, then (optionally) deploy app.
|
||||
**Application state:** Old app running; no app changes required yet.
|
||||
|
||||
```sql
|
||||
-- Migration: 001_add_org_id_to_users.sql
|
||||
BEGIN;
|
||||
|
||||
-- Add nullable column (safe — old app ignores it)
|
||||
ALTER TABLE users
|
||||
ADD COLUMN org_id UUID NULL
|
||||
REFERENCES organisations(id) ON DELETE RESTRICT;
|
||||
|
||||
-- Add index NOW, not in Phase 3 — building index on large table during Phase 3 is risky
|
||||
CREATE INDEX CONCURRENTLY users_org_id_idx ON users (org_id);
|
||||
|
||||
-- Note: CONCURRENTLY does not lock the table; safe on live traffic
|
||||
-- Note: Cannot run CONCURRENTLY inside a transaction block; run separately if needed
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
**Validation after Phase 1:**
|
||||
```sql
|
||||
-- Confirm column exists and is nullable
|
||||
SELECT column_name, data_type, is_nullable
|
||||
FROM information_schema.columns
|
||||
WHERE table_name = 'users' AND column_name = 'org_id';
|
||||
-- Expected: is_nullable = 'YES'
|
||||
|
||||
-- Confirm index exists
|
||||
SELECT indexname, indexdef
|
||||
FROM pg_indexes
|
||||
WHERE tablename = 'users' AND indexname = 'users_org_id_idx';
|
||||
```
|
||||
|
||||
**Rollback (Phase 1 only):**
|
||||
```sql
|
||||
BEGIN;
|
||||
DROP INDEX CONCURRENTLY IF EXISTS users_org_id_idx;
|
||||
ALTER TABLE users DROP COLUMN IF EXISTS org_id;
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 2 — Backfill Existing Data
|
||||
|
||||
**Goal:** Populate the new column for all existing rows before enforcing NOT NULL.
|
||||
**When to run:** After Phase 1 is live and stable. Can be run as a background job or a one-time script.
|
||||
**Application state:** Deploy app version that dual-writes to both old and new columns.
|
||||
|
||||
**App code change required:**
|
||||
```
|
||||
// All INSERT and UPDATE operations must now set BOTH old_column and new_column
|
||||
// until Phase 3 is complete. This ensures new rows are populated during the backfill window.
|
||||
```
|
||||
|
||||
**Backfill script — batch processing:**
|
||||
```sql
|
||||
-- Run in batches to avoid locking. Adjust batch size based on table size and DB load.
|
||||
-- Target: no single batch takes more than 5 seconds.
|
||||
|
||||
DO $$
|
||||
DECLARE
|
||||
batch_size INT := 1000;
|
||||
affected INT;
|
||||
BEGIN
|
||||
LOOP
|
||||
UPDATE users
|
||||
SET org_id = accounts.organisation_id
|
||||
FROM accounts
|
||||
WHERE users.account_id = accounts.id
|
||||
AND users.org_id IS NULL
|
||||
LIMIT batch_size;
|
||||
|
||||
GET DIAGNOSTICS affected = ROW_COUNT;
|
||||
EXIT WHEN affected = 0;
|
||||
|
||||
-- Pause between batches to avoid saturating I/O
|
||||
PERFORM pg_sleep(0.1);
|
||||
END LOOP;
|
||||
END $$;
|
||||
```
|
||||
|
||||
**Monitoring during backfill:**
|
||||
```sql
|
||||
-- Check progress — run periodically during backfill
|
||||
SELECT
|
||||
COUNT(*) FILTER (WHERE org_id IS NOT NULL) AS backfilled,
|
||||
COUNT(*) FILTER (WHERE org_id IS NULL) AS remaining,
|
||||
COUNT(*) AS total,
|
||||
ROUND(
|
||||
100.0 * COUNT(*) FILTER (WHERE org_id IS NOT NULL) / COUNT(*), 2
|
||||
) AS pct_complete
|
||||
FROM users;
|
||||
```
|
||||
|
||||
**Backfill completion validation:**
|
||||
```sql
|
||||
-- Must return 0 before proceeding to Phase 3
|
||||
SELECT COUNT(*) AS unbackfilled_rows
|
||||
FROM users
|
||||
WHERE org_id IS NULL;
|
||||
|
||||
-- Confirm no new rows written without org_id (dual-write working)
|
||||
SELECT COUNT(*) AS recent_missing
|
||||
FROM users
|
||||
WHERE org_id IS NULL
|
||||
AND created_at > now() - INTERVAL '1 hour';
|
||||
```
|
||||
|
||||
**Rollback (Phase 2 — app only):**
|
||||
- Deploy previous app version (single-write to old column)
|
||||
- `org_id` column remains nullable; no data is lost
|
||||
- Backfilled values remain; harmless
|
||||
|
||||
---
|
||||
|
||||
### Phase 3 — Enforce Constraints
|
||||
|
||||
**Goal:** Add NOT NULL constraint and remove dependency on the old column.
|
||||
**Prerequisites:** Phase 2 backfill must be 100% complete (zero rows with `org_id IS NULL`).
|
||||
**Deploy order:** Run migration, then deploy app version that reads only from `org_id`.
|
||||
|
||||
**PostgreSQL — use NOT VALID + VALIDATE for large tables:**
|
||||
```sql
|
||||
-- Step 1: Add constraint as NOT VALID (no full table scan — instant)
|
||||
ALTER TABLE users
|
||||
ADD CONSTRAINT users_org_id_not_null
|
||||
CHECK (org_id IS NOT NULL) NOT VALID;
|
||||
|
||||
-- Step 2: VALIDATE CONSTRAINT (takes a SHARE UPDATE EXCLUSIVE lock — allows reads and writes)
|
||||
-- Run this separately, as it can take minutes on large tables
|
||||
ALTER TABLE users
|
||||
VALIDATE CONSTRAINT users_org_id_not_null;
|
||||
|
||||
-- Step 3: Once validated, convert to actual NOT NULL
|
||||
-- (PostgreSQL trusts the validated check constraint — this is instant)
|
||||
ALTER TABLE users
|
||||
ALTER COLUMN org_id SET NOT NULL;
|
||||
|
||||
-- Step 4: Drop the now-redundant check constraint
|
||||
ALTER TABLE users
|
||||
DROP CONSTRAINT users_org_id_not_null;
|
||||
```
|
||||
|
||||
**Validation after Phase 3:**
|
||||
```sql
|
||||
-- Confirm NOT NULL is enforced
|
||||
SELECT column_name, is_nullable
|
||||
FROM information_schema.columns
|
||||
WHERE table_name = 'users' AND column_name = 'org_id';
|
||||
-- Expected: is_nullable = 'NO'
|
||||
|
||||
-- Test that insert without org_id fails (run in a transaction and roll back)
|
||||
BEGIN;
|
||||
INSERT INTO users (email) VALUES ('test@example.com');
|
||||
-- Expected: ERROR: null value in column "org_id" violates not-null constraint
|
||||
ROLLBACK;
|
||||
```
|
||||
|
||||
**Rollback (Phase 3):**
|
||||
```sql
|
||||
-- Drop the NOT NULL constraint (restores nullable state)
|
||||
ALTER TABLE users ALTER COLUMN org_id DROP NOT NULL;
|
||||
-- Then deploy previous app version (dual-write)
|
||||
-- Note: Once app code reading the new column is live, rolling back the constraint
|
||||
-- without rolling back the app will cause issues — plan this carefully.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 4 — Contract (Remove Old Column)
|
||||
|
||||
**Goal:** Remove the old column once the app no longer references it.
|
||||
**Prerequisites:** Phase 3 fully deployed and stable for at least [X days/hours rollback window].
|
||||
**Warning:** This phase is destructive — the old column's data is permanently deleted.
|
||||
|
||||
```sql
|
||||
BEGIN;
|
||||
|
||||
-- Drop the old column
|
||||
ALTER TABLE users DROP COLUMN account_id;
|
||||
|
||||
-- Drop any indexes that referenced the old column
|
||||
DROP INDEX IF EXISTS users_account_id_idx;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
**Pre-drop validation:**
|
||||
```sql
|
||||
-- Confirm no application queries still reference the old column
|
||||
-- (Check this in code review and via a search of the codebase before running)
|
||||
-- grep -r "account_id" app/
|
||||
|
||||
-- Confirm the column is safe to drop
|
||||
SELECT COUNT(*) FROM users WHERE account_id IS NOT NULL;
|
||||
-- Should be 0 (or irrelevant once new column is canonical)
|
||||
```
|
||||
|
||||
**Rollback:** Not straightforward — dropped column data cannot be recovered. Only proceed to Phase 4 after the rollback window has passed and the change is confirmed stable.
|
||||
|
||||
---
|
||||
|
||||
## 4. Data Validation Plan
|
||||
|
||||
Run these queries before and after the full migration to confirm data integrity.
|
||||
|
||||
**Pre-migration baseline:**
|
||||
```sql
|
||||
-- Record these values before any migration step
|
||||
SELECT COUNT(*) AS total_users FROM users;
|
||||
SELECT COUNT(*) AS total_orgs FROM organisations;
|
||||
SELECT MIN(created_at), MAX(created_at) FROM users;
|
||||
|
||||
-- Check for any anomalies in the source data before backfill
|
||||
SELECT COUNT(*) AS users_without_account
|
||||
FROM users WHERE account_id IS NULL;
|
||||
```
|
||||
|
||||
**Post-backfill integrity check:**
|
||||
```sql
|
||||
-- All users have an org that exists
|
||||
SELECT COUNT(*) AS orphaned_org_refs
|
||||
FROM users u
|
||||
WHERE u.org_id IS NOT NULL
|
||||
AND NOT EXISTS (
|
||||
SELECT 1 FROM organisations o WHERE o.id = u.org_id
|
||||
);
|
||||
-- Expected: 0
|
||||
|
||||
-- org_id matches expected value from source column
|
||||
SELECT COUNT(*) AS mismatched_backfill
|
||||
FROM users u
|
||||
JOIN accounts a ON u.account_id = a.id
|
||||
WHERE u.org_id != a.organisation_id;
|
||||
-- Expected: 0
|
||||
|
||||
-- Row count unchanged (no rows created or deleted by migration)
|
||||
SELECT COUNT(*) AS total_users_after FROM users;
|
||||
-- Must match pre-migration baseline
|
||||
```
|
||||
|
||||
**Post-contract final check:**
|
||||
```sql
|
||||
-- Old column is gone
|
||||
SELECT COUNT(*) FROM information_schema.columns
|
||||
WHERE table_name = 'users' AND column_name = 'account_id';
|
||||
-- Expected: 0
|
||||
|
||||
-- New column is NOT NULL
|
||||
SELECT is_nullable FROM information_schema.columns
|
||||
WHERE table_name = 'users' AND column_name = 'org_id';
|
||||
-- Expected: NO
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Performance Impact Assessment
|
||||
|
||||
| Step | Lock type | Lock duration | Traffic impact |
|
||||
|---|---|---|---|
|
||||
| Add nullable column | ACCESS EXCLUSIVE | Milliseconds | Negligible |
|
||||
| CREATE INDEX CONCURRENTLY | SHARE UPDATE EXCLUSIVE | Minutes (proportional to table size) | Reads and writes continue |
|
||||
| Batch backfill | Row-level locks only | <5s per batch | Low if batches are small |
|
||||
| ADD CONSTRAINT NOT VALID | ACCESS EXCLUSIVE | Milliseconds | Negligible |
|
||||
| VALIDATE CONSTRAINT | SHARE UPDATE EXCLUSIVE | Minutes | Reads and writes continue |
|
||||
| ALTER COLUMN SET NOT NULL | ACCESS EXCLUSIVE | Milliseconds (if check constraint validated) | Negligible |
|
||||
| DROP COLUMN | ACCESS EXCLUSIVE | Milliseconds | Negligible |
|
||||
|
||||
**Expected load increase during backfill:**
|
||||
- DB CPU: [estimated % increase during batch writes]
|
||||
- DB I/O: [estimated increase]
|
||||
- Monitoring threshold to pause backfill: [e.g. DB CPU > 80% for >2 minutes]
|
||||
|
||||
**Backfill rate estimate:**
|
||||
- Table size: [X million rows]
|
||||
- Batch size: [1000 rows]
|
||||
- Pause between batches: [100ms]
|
||||
- Estimated total duration: [X hours at Y rows/second]
|
||||
|
||||
---
|
||||
|
||||
## 6. Deployment Runbook
|
||||
|
||||
Follow this checklist on the day of migration. Mark each step as done before proceeding.
|
||||
|
||||
**Pre-migration (day before):**
|
||||
- [ ] DBA / tech lead has reviewed the migration plan
|
||||
- [ ] Performance impact assessed; monitoring dashboards ready
|
||||
- [ ] Backfill script tested on a staging DB with production-scale data
|
||||
- [ ] Rollback procedure tested on staging
|
||||
- [ ] On-call engineer briefed; Slack channel [#db-migrations] set up for coordination
|
||||
- [ ] Maintenance window scheduled (if required)
|
||||
|
||||
**Phase 1 — Expand (T+0):**
|
||||
- [ ] Take a manual DB snapshot / verify automated backup is recent
|
||||
- [ ] Run `001_expand_add_org_id.sql` on production
|
||||
- [ ] Run Phase 1 validation queries — confirm pass
|
||||
- [ ] Deploy app version with dual-write
|
||||
- [ ] Monitor error rate for [10 minutes]
|
||||
|
||||
**Phase 2 — Backfill (T+[X hours]):**
|
||||
- [ ] Confirm Phase 1 has been stable for [X hours]
|
||||
- [ ] Start backfill script in a screen/tmux session
|
||||
- [ ] Monitor progress via backfill progress query every [5 minutes]
|
||||
- [ ] Monitor DB CPU and I/O — pause if thresholds exceeded
|
||||
- [ ] Run completion validation — confirm 0 unbackfilled rows
|
||||
- [ ] Run integrity checks — confirm 0 orphaned refs, 0 mismatches
|
||||
|
||||
**Phase 3 — Enforce (T+[X days]):**
|
||||
- [ ] Confirm backfill 100% complete and stable for [X hours]
|
||||
- [ ] Add NOT VALID constraint
|
||||
- [ ] Run VALIDATE CONSTRAINT (monitor duration and lock waits)
|
||||
- [ ] Alter column to NOT NULL
|
||||
- [ ] Run Phase 3 validation queries
|
||||
- [ ] Deploy app version reading only from new column
|
||||
- [ ] Monitor error rate for [30 minutes]
|
||||
|
||||
**Phase 4 — Contract (T+[X days after rollback window]):**
|
||||
- [ ] Confirm rollback window has passed — no incidents, no rollback needed
|
||||
- [ ] Search codebase for references to old column — confirm zero
|
||||
- [ ] Run DROP COLUMN migration
|
||||
- [ ] Run final integrity checks
|
||||
- [ ] Close migration ticket; update schema documentation
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every migration phase has an independent rollback procedure — no phase assumes the next one has run
|
||||
- [ ] Batch backfill script includes a pause between batches to avoid saturating I/O
|
||||
- [ ] NOT NULL constraints use the NOT VALID + VALIDATE pattern on tables with >100k rows
|
||||
- [ ] The app dual-write period is explicitly defined — old column writes are not dropped until Phase 3 is deployed
|
||||
- [ ] Data validation queries include a row count check to confirm no data loss
|
||||
- [ ] Lock types are identified for every DDL statement — no "should be fine" assumptions
|
||||
- [ ] The deployment runbook names who runs each step, not just what to run
|
||||
- [ ] Phase 4 (contract) is explicitly gated on the rollback window passing — not run on the same day as Phase 3
|
||||
@@ -0,0 +1,356 @@
|
||||
---
|
||||
name: database-schema-design
|
||||
description: "Document or design a database schema with entity relationships, table definitions, constraints, indexes, and access patterns. Use when asked to design a database, document an existing schema, model entities and relationships, define table structures, plan an index strategy, or produce a data model for review. Produces a structured schema document covering an ER diagram, table DDL definitions, index strategy, access pattern analysis, normalization decisions, and migration notes."
|
||||
---
|
||||
|
||||
# Database Schema Design Skill
|
||||
|
||||
Produce a complete database schema design document for a given domain. A schema document is not just a list of tables — it is a record of decisions: what was modelled, how entities relate, which queries the schema is optimised for, and what trade-offs were made.
|
||||
|
||||
A good schema design document lets an engineer understand the data model, query it correctly, extend it safely, and write migrations without breaking things.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Domain description** — what the system does; what business objects are being modelled
|
||||
- **Entities and relationships** — the main things in the domain and how they relate (e.g. "a User has many Orders; an Order has many OrderItems; an OrderItem references a Product")
|
||||
- **Expected query patterns** — the most important read and write queries (e.g. "fetch all orders for a user, sorted by date"; "look up a product by SKU")
|
||||
- **Database engine** — PostgreSQL, MySQL, SQLite, CockroachDB, etc. — this affects DDL syntax and available types
|
||||
- **Expected data volume** — approximate row counts, growth rate, and any partitioning needs
|
||||
- **Constraints** — any existing conventions, naming standards, or migration constraints to respect
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Database Schema Design: [Domain / Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Author:** [Name] | **Reviewed by:** [Name]
|
||||
**Date:** [Date] | **Database engine:** [PostgreSQL X.X / MySQL X.X / etc.]
|
||||
**Status:** [Draft / Reviewed / Approved]
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
[2–3 sentences describing the domain being modelled, the scope of this schema, and any key design philosophy (e.g. "this schema prioritises read performance for the customer-facing API over write simplicity", or "designed for eventual migration to multi-tenancy")]
|
||||
|
||||
**In scope:**
|
||||
- [Entity or subsystem]
|
||||
- [Entity or subsystem]
|
||||
|
||||
**Out of scope:**
|
||||
- [e.g. Analytics / reporting tables — separate schema]
|
||||
- [e.g. Audit log tables — covered in separate design doc]
|
||||
|
||||
---
|
||||
|
||||
## 2. Entity Relationship Diagram
|
||||
|
||||
```
|
||||
┌───────────────────┐ ┌───────────────────────┐
|
||||
│ users │ │ organisations │
|
||||
│───────────────── │ │─────────────────────── │
|
||||
│ id (PK) │ ┌───▶│ id (PK) │
|
||||
│ org_id (FK) ─────┼────┘ │ name │
|
||||
│ email │ │ plan │
|
||||
│ display_name │ │ created_at │
|
||||
│ created_at │ └───────────────────────┘
|
||||
│ updated_at │
|
||||
└─────────┬─────────┘
|
||||
│ 1
|
||||
│
|
||||
│ N
|
||||
┌─────────▼─────────┐ ┌───────────────────────┐
|
||||
│ [table_a] │ │ [table_b] │
|
||||
│───────────────── │ │─────────────────────── │
|
||||
│ id (PK) │ N │ id (PK) │
|
||||
│ user_id (FK) ─────┼────────▶│ [table_a]_id (FK) │
|
||||
│ [field] │ │ │ [field] │
|
||||
│ [field] │ │ │ [field] │
|
||||
│ created_at │ │ created_at │
|
||||
└───────────────────┘ └───────────────────────┘
|
||||
```
|
||||
|
||||
**Relationship summary:**
|
||||
|
||||
| Entity A | Relationship | Entity B | Notes |
|
||||
|---|---|---|---|
|
||||
| organisations | has many | users | An org can have many users |
|
||||
| users | has many | [table_a] | Soft-deleted on user deletion |
|
||||
| [table_a] | has many | [table_b] | Cascade delete |
|
||||
| [table_b] | belongs to | [table_a] | Non-nullable FK |
|
||||
| [table_c] | many-to-many (via [join_table]) | [table_d] | Join table with metadata |
|
||||
|
||||
---
|
||||
|
||||
## 3. Table Definitions
|
||||
|
||||
### `organisations`
|
||||
|
||||
[1 sentence describing what this table stores and its role in the domain.]
|
||||
|
||||
```sql
|
||||
CREATE TABLE organisations (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
name VARCHAR(255) NOT NULL,
|
||||
slug VARCHAR(100) NOT NULL UNIQUE,
|
||||
plan VARCHAR(50) NOT NULL DEFAULT 'free'
|
||||
CHECK (plan IN ('free', 'pro', 'enterprise')),
|
||||
settings JSONB NOT NULL DEFAULT '{}',
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||||
);
|
||||
```
|
||||
|
||||
| Column | Type | Nullable | Default | Notes |
|
||||
|---|---|---|---|---|
|
||||
| id | UUID | No | gen_random_uuid() | Surrogate PK — UUID preferred over serial for distributed use |
|
||||
| name | VARCHAR(255) | No | — | Display name; not unique |
|
||||
| slug | VARCHAR(100) | No | — | URL-safe identifier; unique across all orgs |
|
||||
| plan | VARCHAR(50) | No | 'free' | Constrained to known values via CHECK |
|
||||
| settings | JSONB | No | {} | Flexible config; avoid for queryable fields |
|
||||
| created_at | TIMESTAMPTZ | No | now() | Always use TIMESTAMPTZ, not TIMESTAMP |
|
||||
| updated_at | TIMESTAMPTZ | No | now() | Updated via trigger (see below) |
|
||||
|
||||
---
|
||||
|
||||
### `users`
|
||||
|
||||
[1 sentence describing what this table stores.]
|
||||
|
||||
```sql
|
||||
CREATE TABLE users (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
org_id UUID NOT NULL REFERENCES organisations(id)
|
||||
ON DELETE RESTRICT,
|
||||
email VARCHAR(254) NOT NULL,
|
||||
display_name VARCHAR(255) NOT NULL DEFAULT '',
|
||||
role VARCHAR(50) NOT NULL DEFAULT 'member'
|
||||
CHECK (role IN ('owner', 'admin', 'member', 'viewer')),
|
||||
email_verified BOOLEAN NOT NULL DEFAULT false,
|
||||
deleted_at TIMESTAMPTZ NULL,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
|
||||
CONSTRAINT users_email_org_unique UNIQUE (email, org_id)
|
||||
);
|
||||
```
|
||||
|
||||
| Column | Type | Nullable | Default | Notes |
|
||||
|---|---|---|---|---|
|
||||
| id | UUID | No | gen_random_uuid() | — |
|
||||
| org_id | UUID | No | — | FK to organisations; RESTRICT prevents orphaning |
|
||||
| email | VARCHAR(254) | No | — | RFC 5321 max length; unique per org (not globally) |
|
||||
| role | VARCHAR(50) | No | 'member' | Application-level RBAC |
|
||||
| deleted_at | TIMESTAMPTZ | Yes | NULL | Soft delete; NULL = active |
|
||||
|
||||
**Soft delete policy:** Rows with `deleted_at IS NOT NULL` are considered deleted. All application queries MUST filter `WHERE deleted_at IS NULL` unless explicitly fetching deleted records. Use a view or ORM scope to enforce this.
|
||||
|
||||
---
|
||||
|
||||
### `[table_a]`
|
||||
|
||||
[Description of what this table models.]
|
||||
|
||||
```sql
|
||||
CREATE TABLE [table_a] (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
|
||||
[field_1] VARCHAR(255) NOT NULL,
|
||||
[field_2] TEXT NULL,
|
||||
[field_3] INTEGER NOT NULL DEFAULT 0 CHECK ([field_3] >= 0),
|
||||
status VARCHAR(50) NOT NULL DEFAULT 'pending'
|
||||
CHECK (status IN ('pending', 'active', 'archived')),
|
||||
metadata JSONB NOT NULL DEFAULT '{}',
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||||
);
|
||||
```
|
||||
|
||||
| Column | Type | Nullable | Notes |
|
||||
|---|---|---|---|
|
||||
| user_id | UUID | No | CASCADE delete — when user is deleted, their [table_a] rows are too |
|
||||
| [field_1] | VARCHAR(255) | No | [Reason for length constraint] |
|
||||
| status | VARCHAR(50) | No | State machine: pending → active → archived (no other transitions) |
|
||||
| metadata | JSONB | No | [What is stored here and why it's not a typed column] |
|
||||
|
||||
---
|
||||
|
||||
### `[join_table]` *(Many-to-many)*
|
||||
|
||||
[Description of the relationship this table represents.]
|
||||
|
||||
```sql
|
||||
CREATE TABLE [join_table] (
|
||||
[table_c]_id UUID NOT NULL REFERENCES [table_c](id) ON DELETE CASCADE,
|
||||
[table_d]_id UUID NOT NULL REFERENCES [table_d](id) ON DELETE CASCADE,
|
||||
granted_by UUID NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
|
||||
granted_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
|
||||
PRIMARY KEY ([table_c]_id, [table_d]_id)
|
||||
);
|
||||
```
|
||||
|
||||
**Why a composite PK:** The combination of `[table_c]_id + [table_d]_id` is the natural key — each association is unique and the primary key doubles as the uniqueness constraint without needing a separate index.
|
||||
|
||||
---
|
||||
|
||||
## 4. Index Strategy
|
||||
|
||||
For each table, define which indexes are created and why. Include the query they are designed to serve.
|
||||
|
||||
| Table | Index name | Columns | Type | Query served | Notes |
|
||||
|---|---|---|---|---|---|
|
||||
| users | `users_org_id_idx` | `(org_id)` | B-tree | `SELECT * FROM users WHERE org_id = $1` | FK lookup; required for join performance |
|
||||
| users | `users_email_lower_idx` | `(lower(email))` | B-tree (functional) | `WHERE lower(email) = lower($1)` | Case-insensitive email lookup |
|
||||
| users | `users_active_by_org_idx` | `(org_id, created_at DESC)` | B-tree | `WHERE org_id = $1 AND deleted_at IS NULL ORDER BY created_at DESC` | Partial index candidate (see below) |
|
||||
| [table_a] | `[table_a]_user_id_status_idx` | `(user_id, status)` | B-tree | `WHERE user_id = $1 AND status = 'active'` | Compound — order matters |
|
||||
| [table_a] | `[table_a]_metadata_gin_idx` | `metadata` | GIN | `WHERE metadata @> '{"key": "value"}'` | Only add if JSONB queried frequently |
|
||||
|
||||
**Partial indexes (PostgreSQL):**
|
||||
|
||||
```sql
|
||||
-- Index only active (non-deleted) users — dramatically smaller for soft-delete tables
|
||||
CREATE INDEX users_active_email_idx
|
||||
ON users (email, org_id)
|
||||
WHERE deleted_at IS NULL;
|
||||
|
||||
-- Index only pending items — avoids indexing the majority of rows
|
||||
CREATE INDEX [table_a]_pending_idx
|
||||
ON [table_a] (user_id, created_at)
|
||||
WHERE status = 'pending';
|
||||
```
|
||||
|
||||
**Index design principles applied:**
|
||||
- FKs that appear in JOIN conditions always have an index
|
||||
- Compound indexes follow selectivity order: most selective column first
|
||||
- Functional indexes for case-insensitive lookups
|
||||
- GIN indexes only where JSONB containment queries are frequent
|
||||
- Partial indexes for status-filtered queries on large tables
|
||||
|
||||
---
|
||||
|
||||
## 5. Access Pattern Analysis
|
||||
|
||||
Document the primary queries this schema is designed to serve. For each, show the query, the indexes used, and any caveats.
|
||||
|
||||
### AP-1: Fetch all active users for an organisation (paginated)
|
||||
|
||||
**Frequency:** Very high — called on every dashboard load
|
||||
**Query:**
|
||||
```sql
|
||||
SELECT id, email, display_name, role, created_at
|
||||
FROM users
|
||||
WHERE org_id = $1
|
||||
AND deleted_at IS NULL
|
||||
ORDER BY created_at DESC
|
||||
LIMIT 50 OFFSET $2;
|
||||
```
|
||||
**Index used:** `users_active_by_org_idx` (org_id, created_at DESC)
|
||||
**Notes:** Use keyset pagination (`WHERE created_at < $cursor`) at scale; OFFSET degrades past ~10k rows.
|
||||
|
||||
---
|
||||
|
||||
### AP-2: Look up a user by email (case-insensitive)
|
||||
|
||||
**Frequency:** High — every authentication attempt
|
||||
**Query:**
|
||||
```sql
|
||||
SELECT id, org_id, role, email_verified
|
||||
FROM users
|
||||
WHERE lower(email) = lower($1)
|
||||
AND deleted_at IS NULL;
|
||||
```
|
||||
**Index used:** `users_email_lower_idx`
|
||||
**Notes:** Returns multiple rows if same email exists across orgs. Application resolves by org context.
|
||||
|
||||
---
|
||||
|
||||
### AP-3: Fetch [table_a] items for a user by status
|
||||
|
||||
**Frequency:** High
|
||||
**Query:**
|
||||
```sql
|
||||
SELECT *
|
||||
FROM [table_a]
|
||||
WHERE user_id = $1
|
||||
AND status = $2
|
||||
ORDER BY created_at DESC
|
||||
LIMIT 25;
|
||||
```
|
||||
**Index used:** `[table_a]_user_id_status_idx`
|
||||
**Notes:** Compound index covers both filter columns. Status filter must come second in the index because user_id is more selective.
|
||||
|
||||
---
|
||||
|
||||
### AP-4: [Add further access patterns as needed]
|
||||
|
||||
---
|
||||
|
||||
## 6. Normalization Decisions
|
||||
|
||||
Document deliberate choices to normalize or denormalize, with reasoning.
|
||||
|
||||
| Decision | Approach | Reasoning |
|
||||
|---|---|---|
|
||||
| [e.g. Organisation name on users table?] | **Not denormalized** — always join to organisations | Avoid stale copies; org name changes are infrequent and joining is cheap |
|
||||
| [e.g. Status history] | **Not in this table** — separate `[table_a]_status_history` if needed | Current status is all that's needed for 99% of queries; history is auditing, not application data |
|
||||
| [e.g. JSONB `settings` column on organisations] | **Denormalized into JSONB** | Settings are read together; never queried by field; schema changes don't require migrations |
|
||||
| [e.g. Computed aggregate counts] | **Not stored** — computed at query time | Counts are small; maintaining a counter column requires careful locking; use `SELECT COUNT(*)` with the index |
|
||||
|
||||
---
|
||||
|
||||
## 7. Triggers and Automation
|
||||
|
||||
```sql
|
||||
-- Automatically update updated_at on any row modification
|
||||
CREATE OR REPLACE FUNCTION set_updated_at()
|
||||
RETURNS TRIGGER AS $$
|
||||
BEGIN
|
||||
NEW.updated_at = now();
|
||||
RETURN NEW;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
-- Apply to all tables with updated_at
|
||||
CREATE TRIGGER users_updated_at
|
||||
BEFORE UPDATE ON users
|
||||
FOR EACH ROW EXECUTE FUNCTION set_updated_at();
|
||||
|
||||
CREATE TRIGGER [table_a]_updated_at
|
||||
BEFORE UPDATE ON [table_a]
|
||||
FOR EACH ROW EXECUTE FUNCTION set_updated_at();
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Migration Notes
|
||||
|
||||
If this schema is being introduced to an existing system, note the migration approach.
|
||||
|
||||
| Step | Description | Backward compatible | Risk |
|
||||
|---|---|---|---|
|
||||
| 1 | Create `organisations` table | Yes — additive | Low |
|
||||
| 2 | Create `users` table | Yes — additive | Low |
|
||||
| 3 | Backfill `org_id` on existing users | **Requires dual-write period** | Medium |
|
||||
| 4 | Add NOT NULL constraint on `org_id` | Requires backfill to be 100% complete | Medium |
|
||||
| 5 | Remove deprecated columns | Requires app code updated first | Low once app deployed |
|
||||
|
||||
**Backfill strategy:** [Describe how to handle existing data — batch size, rate limiting, validation queries]
|
||||
|
||||
**Rollback:** Each migration step should be independently reversible. See [database-migration-plan skill] for the full rollback procedure template.
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every table has a primary key and a `created_at` column — no implicit ordering by row insertion
|
||||
- [ ] Every foreign key has a corresponding index — no missing FK indexes that would cause full table scans on joins
|
||||
- [ ] All TIMESTAMPTZ columns, not TIMESTAMP — timezone awareness is explicit
|
||||
- [ ] Soft-delete tables document the convention and where the filter is enforced (ORM scope, view, or query standard)
|
||||
- [ ] Every access pattern in the design has a supporting index or an explicit note that a full table scan is acceptable
|
||||
- [ ] JSONB columns are justified — not used as a substitute for proper schema design on queryable fields
|
||||
- [ ] Normalization decisions are documented with reasoning, not just stated
|
||||
- [ ] Migration notes address existing data if this is a schema change, not a greenfield schema
|
||||
@@ -0,0 +1,332 @@
|
||||
---
|
||||
name: dependency-audit
|
||||
description: "Conduct a dependency audit for a project — checking for security vulnerabilities, license compliance issues, outdated packages, and transitive dependency risk. Use when asked to audit dependencies, review package security, check license compliance, assess dependency health, or produce a vulnerability report. Produces a vulnerability findings table, license compliance matrix, update priority matrix, dependency health score, and 30-day remediation plan."
|
||||
---
|
||||
|
||||
# Dependency Audit Skill
|
||||
|
||||
Produce a complete dependency audit report for a project — covering security vulnerabilities (with CVE references), license compliance against policy, outdated packages prioritised by risk, transitive dependency risk analysis, and a concrete remediation plan with timeline. A good dependency audit gives the team a clear, prioritised action list — not a raw dump of audit output that no one acts on.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Project language and ecosystem** — npm, pip/PyPI, Maven/Gradle, Go modules, Cargo, RubyGems, NuGet, or mixed
|
||||
- **Dependency list or package manifest** — paste the contents of `package.json`, `requirements.txt`, `go.mod`, `pom.xml`, etc., or provide the audit tool output
|
||||
- **License policy** — which licenses are allowed, which are restricted (e.g. "GPL is prohibited", "MIT/Apache/BSD only", or "no policy yet — recommend one")
|
||||
- **Current security tooling** — Dependabot, Snyk, OWASP Dependency-Check, npm audit, pip-audit, or none
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Dependency Audit Report: [Project Name]
|
||||
|
||||
**Ecosystem:** [npm / pip / Maven / Go / etc.]
|
||||
**Audit date:** [Date]
|
||||
**Auditor:** [Name]
|
||||
**Total direct dependencies:** [N]
|
||||
**Total transitive dependencies:** [N]
|
||||
**Audit tool(s) used:** [npm audit / pip-audit / Snyk / OWASP Dependency-Check / etc.]
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
| Category | Finding | Risk level |
|
||||
|---|---|---|
|
||||
| Critical vulnerabilities | [N] CVEs requiring immediate action | [Critical / High / Low] |
|
||||
| High vulnerabilities | [N] CVEs — fix within 7 days | [High / Medium] |
|
||||
| License violations | [N] packages with non-compliant licenses | [High / Low] |
|
||||
| Severely outdated packages | [N] packages > 2 major versions behind | [Medium] |
|
||||
| Packages with no active maintenance | [N] packages — no commits in 12+ months | [Medium] |
|
||||
| **Overall dependency health score** | **[Score]/100** | **[Red / Amber / Green]** |
|
||||
|
||||
**Scoring methodology:** Critical CVEs: −20 each. High CVEs: −10 each. License violations: −15 each. Abandoned packages: −5 each. Maximum deduction: 100. Score ≥80 = Green, 60–79 = Amber, <60 = Red.
|
||||
|
||||
**Immediate actions required:**
|
||||
1. [Most critical action — e.g. "Upgrade lodash from 4.17.11 to 4.17.21 to fix CVE-2021-23337 (Critical — prototype pollution)"]
|
||||
2. [Second action]
|
||||
3. [Third action]
|
||||
|
||||
---
|
||||
|
||||
## 1. Security Vulnerability Findings
|
||||
|
||||
### Critical and High Severity (Act within 24–72 hours)
|
||||
|
||||
| Package | Installed version | Fix version | CVE | Severity | CVSS score | Description | Exploitability |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| [package-name] | [X.Y.Z] | [A.B.C] | [CVE-YYYY-NNNNN] | Critical | [9.x] | [e.g. Prototype pollution via `merge` function — remote code execution possible] | [Known exploit / PoC available / No known exploit] |
|
||||
| [package-name] | [X.Y.Z] | [A.B.C] | [CVE-YYYY-NNNNN] | High | [7.x] | [e.g. Path traversal in file serving utility] | [PoC available] |
|
||||
| [package-name] | [X.Y.Z] | [A.B.C] | [CVE-YYYY-NNNNN] | High | [7.x] | [e.g. Regular expression denial of service (ReDoS)] | [No known exploit] |
|
||||
|
||||
### Medium Severity (Fix within 30 days)
|
||||
|
||||
| Package | Installed version | Fix version | CVE | Severity | CVSS score | Description |
|
||||
|---|---|---|---|---|---|---|
|
||||
| [package-name] | [X.Y.Z] | [A.B.C] | [CVE-YYYY-NNNNN] | Medium | [5.x] | [Description] |
|
||||
| [package-name] | [X.Y.Z] | [A.B.C] | [CVE-YYYY-NNNNN] | Medium | [4.x] | [Description] |
|
||||
|
||||
### Low Severity (Fix within 90 days or accept risk)
|
||||
|
||||
| Package | Installed version | Fix version | CVE | Severity | Description |
|
||||
|---|---|---|---|---|---|
|
||||
| [package-name] | [X.Y.Z] | [A.B.C] | Low | [Description] |
|
||||
|
||||
### Vulnerabilities With No Fix Available
|
||||
|
||||
| Package | CVE | Severity | Recommended mitigation |
|
||||
|---|---|---|---|
|
||||
| [package-name] | [CVE-YYYY-NNNNN] | [High] | [e.g. "Remove this package — alternative: [replacement]"] |
|
||||
| [package-name] | [CVE-YYYY-NNNNN] | [Medium] | [e.g. "Vendor has a fix in progress — track issue [URL]. Mitigate by [X]"] |
|
||||
|
||||
---
|
||||
|
||||
## 2. License Compliance Matrix
|
||||
|
||||
### License Policy Reference
|
||||
|
||||
| License | Category | Policy | Notes |
|
||||
|---|---|---|---|
|
||||
| MIT | Permissive | Allowed | Attribution required in distributed products |
|
||||
| Apache 2.0 | Permissive | Allowed | Attribution + NOTICE file required |
|
||||
| BSD 2-Clause / 3-Clause | Permissive | Allowed | Attribution required |
|
||||
| ISC | Permissive | Allowed | |
|
||||
| MPL 2.0 | Weak copyleft | Allowed with review | Source disclosure required for modified MPL files only |
|
||||
| LGPL v2 / v3 | Weak copyleft | Allowed with review | Dynamic linking permitted; static linking may require disclosure |
|
||||
| GPL v2 / v3 | Strong copyleft | **Restricted** | May require open-sourcing the entire codebase — legal review required |
|
||||
| AGPL v3 | Strong copyleft | **Restricted** | Network use triggers copyleft — especially risky for SaaS |
|
||||
| SSPL | Source available | **Prohibited** | Not OSI-approved — treat as proprietary |
|
||||
| Proprietary / Commercial | Commercial | **Requires contract** | Verify license covers current use case and scale |
|
||||
| Unknown / Unlicensed | — | **Prohibited** | No license = all rights reserved — cannot use legally |
|
||||
|
||||
### Findings: Packages With Compliance Issues
|
||||
|
||||
| Package | License | Issue | Recommendation | Risk if unaddressed |
|
||||
|---|---|---|---|---|
|
||||
| [package-name] | GPL v3 | Copyleft — may require open-sourcing this project | Replace with [alternative] or get legal sign-off | Legal / IP risk |
|
||||
| [package-name] | AGPL v3 | Network copyleft — SaaS use triggers disclosure | Replace with [alternative] | Legal / IP risk |
|
||||
| [package-name] | Proprietary | License may not cover current usage tier | Verify license scope with vendor | Contract breach |
|
||||
| [package-name] | Unknown | No license declared in package metadata | Contact maintainer or replace | Cannot use legally |
|
||||
|
||||
### All Licenses in Use (Full Inventory)
|
||||
|
||||
| License | Package count | Compliance status |
|
||||
|---|---|---|
|
||||
| MIT | [N] | Compliant |
|
||||
| Apache 2.0 | [N] | Compliant |
|
||||
| BSD-3-Clause | [N] | Compliant |
|
||||
| ISC | [N] | Compliant |
|
||||
| MPL 2.0 | [N] | Review required |
|
||||
| GPL v3 | [N] | **Non-compliant** |
|
||||
| Unknown | [N] | **Non-compliant** |
|
||||
|
||||
---
|
||||
|
||||
## 3. Outdated Package Analysis
|
||||
|
||||
### Severely Outdated (2+ major versions behind — high upgrade effort)
|
||||
|
||||
| Package | Installed | Latest stable | Versions behind | Last updated | Breaking changes summary |
|
||||
|---|---|---|---|---|---|
|
||||
| [package-name] | [1.x.x] | [3.x.x] | 2 major | [Date] | [e.g. "API redesign in v2; async support added in v3"] |
|
||||
| [package-name] | [0.x.x] | [2.x.x] | 2 major | [Date] | [Summary] |
|
||||
|
||||
### Moderately Outdated (1 major version behind)
|
||||
|
||||
| Package | Installed | Latest stable | Versions behind | Security fix in newer version? |
|
||||
|---|---|---|---|---|
|
||||
| [package-name] | [2.x.x] | [3.x.x] | 1 major | [Yes — CVE-YYYY-NNNNN / No] |
|
||||
| [package-name] | [4.x.x] | [5.x.x] | 1 major | [No] |
|
||||
|
||||
### Minor/Patch Updates Available (Low risk to update)
|
||||
|
||||
| Package | Installed | Latest | Contains security fix? |
|
||||
|---|---|---|---|
|
||||
| [package-name] | [2.3.1] | [2.3.9] | [Yes / No] |
|
||||
| [package-name] | [1.0.0] | [1.2.1] | [No] |
|
||||
|
||||
---
|
||||
|
||||
## 4. Dependency Graph Risk Analysis
|
||||
|
||||
### Transitive Dependency Risk
|
||||
|
||||
Transitive (indirect) dependencies carry risk because they are not explicitly managed. These are the highest-risk transitive dependencies in this project:
|
||||
|
||||
| Vulnerable transitive dep | Pulled in by | Installed version | Fix available | Action |
|
||||
|---|---|---|---|---|
|
||||
| [transitive-package] | [direct-parent] | [X.Y.Z] | [Yes — upgrade [parent] to [version]] | Upgrade direct dependency [parent] |
|
||||
| [transitive-package] | [direct-parent] | [X.Y.Z] | [No] | Remove [parent] or use [alternative] |
|
||||
|
||||
### Dependency Concentration Risk
|
||||
|
||||
These packages are depended on by many other packages in the project — a vulnerability or deprecation would have cascading effects:
|
||||
|
||||
| Package | Depended on by (N packages) | Actively maintained? | Risk level |
|
||||
|---|---|---|---|
|
||||
| [package-name] | [N] | [Yes / No — last commit: date] | [High / Medium] |
|
||||
| [package-name] | [N] | [Yes] | [Medium] |
|
||||
|
||||
### Abandoned / Unmaintained Packages
|
||||
|
||||
| Package | Last release | Last commit | Weekly downloads | Recommended alternative |
|
||||
|---|---|---|---|---|
|
||||
| [package-name] | [Date] | [Date] | [N] | [alternative-package] |
|
||||
| [package-name] | [Date] | [Date] | [N] | [Maintained fork: URL] |
|
||||
|
||||
---
|
||||
|
||||
## 5. Remediation Plan
|
||||
|
||||
### 30-Day Plan
|
||||
|
||||
**Week 1 — Critical vulnerabilities (Days 1–7)**
|
||||
|
||||
| Action | Owner | Package | Effort | Notes |
|
||||
|---|---|---|---|---|
|
||||
| Upgrade [package] [old] → [new] | [Name] | [package-name] | [30 min] | [No API changes / check breaking changes guide: URL] |
|
||||
| Replace [package] with [alternative] | [Name] | [package-name] | [2 hours] | [No fix available — must replace] |
|
||||
| Patch override for [transitive-dep] | [Name] | [transitive-dep] | [15 min] | [Add resolutions/overrides entry in manifest] |
|
||||
|
||||
```bash
|
||||
# Commands for Week 1 upgrades:
|
||||
|
||||
# npm
|
||||
npm install [package]@[target-version]
|
||||
npm audit fix --force # use with caution — may introduce breaking changes
|
||||
|
||||
# pip
|
||||
pip install --upgrade [package]==[target-version]
|
||||
pip-audit --fix # if using pip-audit
|
||||
|
||||
# Go
|
||||
go get [module]@[version]
|
||||
go mod tidy
|
||||
|
||||
# Maven
|
||||
# Update pom.xml version property, then:
|
||||
mvn versions:use-latest-releases -DallowMajorUpdates=false
|
||||
mvn dependency:resolve
|
||||
```
|
||||
|
||||
**Week 2 — High vulnerabilities and license violations (Days 8–14)**
|
||||
|
||||
| Action | Owner | Package | Effort | Notes |
|
||||
|---|---|---|---|---|
|
||||
| Upgrade [package] | [Name] | [package-name] | [1 hour] | |
|
||||
| Replace GPL-licensed [package] | [Name] | [package-name] | [4 hours] | [Alternative: [package]] |
|
||||
| Legal review for [package] license | Legal team | [package-name] | [Legal team SLA] | [Submit via [process]] |
|
||||
|
||||
**Week 3 — Medium vulnerabilities and abandoned packages (Days 15–21)**
|
||||
|
||||
| Action | Owner | Package | Effort | Notes |
|
||||
|---|---|---|---|---|
|
||||
| Upgrade [package] | [Name] | [package-name] | [30 min] | |
|
||||
| Replace abandoned [package] | [Name] | [package-name] | [2 hours] | [Maintained fork or alternative: [URL]] |
|
||||
|
||||
**Week 4 — Process improvements (Days 22–30)**
|
||||
|
||||
| Action | Owner | Effort | Notes |
|
||||
|---|---|---|---|
|
||||
| Enable Dependabot / Renovate for automated PRs | [Name] | [2 hours] | [Config in Section 6] |
|
||||
| Add `npm audit` / `pip-audit` to CI — fail on Critical/High | [Name] | [1 hour] | [Config in Section 6] |
|
||||
| Document license policy in CONTRIBUTING.md | [Name] | [1 hour] | [Based on policy in Section 2] |
|
||||
| Schedule next quarterly audit | [Name] | [15 min] | [Add to team calendar] |
|
||||
|
||||
---
|
||||
|
||||
## 6. Policy Recommendations
|
||||
|
||||
### Automated Vulnerability Scanning in CI
|
||||
|
||||
Add the following to your CI pipeline to catch vulnerabilities before they merge:
|
||||
|
||||
```yaml
|
||||
# GitHub Actions — adapt for your CI platform
|
||||
dependency-audit:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
|
||||
# npm
|
||||
- name: npm audit
|
||||
run: npm audit --audit-level=high
|
||||
# Fails build on High or Critical vulnerabilities
|
||||
|
||||
# pip
|
||||
- name: pip-audit
|
||||
run: |
|
||||
pip install pip-audit
|
||||
pip-audit --requirement requirements.txt --severity high
|
||||
|
||||
# Go
|
||||
- name: govulncheck
|
||||
run: |
|
||||
go install golang.org/x/vuln/cmd/govulncheck@latest
|
||||
govulncheck ./...
|
||||
```
|
||||
|
||||
### Dependabot / Renovate Configuration
|
||||
|
||||
```yaml
|
||||
# .github/dependabot.yml — automated dependency update PRs
|
||||
version: 2
|
||||
updates:
|
||||
- package-ecosystem: "[npm / pip / gomod / maven]"
|
||||
directory: "/"
|
||||
schedule:
|
||||
interval: "weekly"
|
||||
day: "monday"
|
||||
open-pull-requests-limit: 10
|
||||
labels:
|
||||
- "dependencies"
|
||||
- "automated"
|
||||
ignore:
|
||||
# Ignore major version bumps — review these manually
|
||||
- dependency-name: "*"
|
||||
update-types: ["version-update:semver-major"]
|
||||
```
|
||||
|
||||
### License Scanning
|
||||
|
||||
```bash
|
||||
# npm — license checker
|
||||
npx license-checker --onlyAllow 'MIT;Apache-2.0;BSD-2-Clause;BSD-3-Clause;ISC' \
|
||||
--failOn 'GPL;AGPL;LGPL'
|
||||
|
||||
# Python — pip-licenses
|
||||
pip install pip-licenses
|
||||
pip-licenses --allow-only="MIT;Apache Software License;BSD License;ISC License" \
|
||||
--fail-on="GNU General Public License"
|
||||
|
||||
# Go — go-licenses
|
||||
go install github.com/google/go-licenses@latest
|
||||
go-licenses check ./... --allowed_licenses=MIT,Apache-2.0,BSD-2-Clause,BSD-3-Clause
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Dependency Health Score Detail
|
||||
|
||||
| Category | Max points | Score | Notes |
|
||||
|---|---|---|---|
|
||||
| No critical vulnerabilities | 30 | [N]/30 | −20 per critical CVE |
|
||||
| No high vulnerabilities | 20 | [N]/20 | −10 per high CVE |
|
||||
| License compliance | 20 | [N]/20 | −15 per violation |
|
||||
| No abandoned packages | 15 | [N]/15 | −5 per abandoned package |
|
||||
| Up-to-date major versions | 10 | [N]/10 | −2 per major version behind |
|
||||
| Automated scanning enabled | 5 | [N]/5 | All-or-nothing |
|
||||
| **Total** | **100** | **[Score]/100** | **[Red / Amber / Green]** |
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every Critical and High CVE has a named owner and a resolution date in the 30-day plan
|
||||
- [ ] License findings have been reviewed by legal or a named engineer with authority to accept the risk
|
||||
- [ ] Transitive dependency vulnerabilities are included — not just direct dependencies
|
||||
- [ ] Abandoned packages have a concrete replacement recommendation, not just "consider replacing"
|
||||
- [ ] CI pipeline change is included — the audit findings should be the last time these are caught manually
|
||||
- [ ] The dependency health score is calculated from actual findings, not estimated
|
||||
- [ ] Remediation plan actions are specific commands or steps, not "upgrade package X" without version targets
|
||||
@@ -0,0 +1,332 @@
|
||||
---
|
||||
name: developer-onboarding-doc
|
||||
description: "Write a developer onboarding document for a service, codebase, or team. Use when asked to write a developer guide, service README, onboarding doc for a new engineer, codebase orientation, or getting-started guide for a technical team. Produces a structured doc covering service overview, architecture, local setup, key patterns, testing, deployment, and who to ask for what."
|
||||
---
|
||||
|
||||
# Developer Onboarding Document Skill
|
||||
|
||||
Produce a complete developer onboarding document for a service or team — covering everything a new engineer needs to be productive within their first week.
|
||||
|
||||
A good onboarding doc is not a wiki dump. It answers the questions a new engineer actually has on day one, in the order they'll have them.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and what it does
|
||||
- **Team** responsible for it
|
||||
- **Tech stack** — language(s), framework(s), database(s), message queues, etc.
|
||||
- **Key external dependencies** — upstream services, third-party APIs
|
||||
- **Deployment target** — Kubernetes, ECS, Lambda, bare metal, etc.
|
||||
- **Local dev setup** — how to run locally (Docker Compose, local DB, etc.)
|
||||
- **Testing approach** — unit, integration, E2E; test commands
|
||||
- **Deployment process** — summary of how code gets to production
|
||||
- **On-call setup** — who's on-call, how alerts work
|
||||
- **Contacts** — tech lead, platform team, related service owners
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Developer Onboarding: [Service Name]
|
||||
|
||||
**Team:** [Team name] | **Tech lead:** [Name]
|
||||
**Last updated:** [Date] | **Updated by:** [Name]
|
||||
|
||||
> If something in this doc is wrong or out of date, fix it now — it will affect every engineer who onboards after you.
|
||||
|
||||
---
|
||||
|
||||
## What This Service Does
|
||||
|
||||
[3–5 sentences. What problem does this service solve? Who calls it, and who does it call? What would break if this service went down?]
|
||||
|
||||
**Service type:** [API / Background worker / Event consumer / Data pipeline / etc.]
|
||||
**Consumers:** [List internal services or external clients that depend on this service]
|
||||
**Dependencies:** [List upstream services, databases, and third-party APIs this service calls]
|
||||
|
||||
**Architecture diagram:** [Link or embed — even a rough ASCII diagram helps]
|
||||
|
||||
```
|
||||
[Caller A] ──→ [This Service] ──→ [Database]
|
||||
│
|
||||
└──→ [Downstream Service]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Codebase Orientation
|
||||
|
||||
**Repository:** [Link]
|
||||
**Main branch:** `[main / master]`
|
||||
**Language:** [e.g. Go 1.22 / Node.js 20 / Python 3.12]
|
||||
**Framework:** [e.g. Express / FastAPI / Gin / Rails]
|
||||
|
||||
### Key directories
|
||||
|
||||
```
|
||||
[repo-root]/
|
||||
├── [src/ or cmd/] # Application code
|
||||
│ ├── [handlers/] # HTTP handlers / controllers
|
||||
│ ├── [services/] # Business logic
|
||||
│ ├── [repository/] # Database access layer
|
||||
│ └── [models/] # Data models / types
|
||||
├── [tests/] # Test files
|
||||
├── [migrations/] # Database migrations
|
||||
├── [scripts/] # Utility scripts
|
||||
├── [.github/workflows/] # CI/CD pipeline definitions
|
||||
└── [docs/] # Additional documentation
|
||||
```
|
||||
|
||||
**Where to start reading:** [Point to 2–3 key files that give the best orientation — e.g. `main.go`, `routes.js`, `app.py`]
|
||||
|
||||
### Things that might surprise you
|
||||
|
||||
- [Unusual pattern 1 — e.g. "We use event sourcing — state is derived from an event log, not stored directly"]
|
||||
- [Unusual pattern 2 — e.g. "Auth is handled by the gateway — this service trusts the `X-User-Id` header"]
|
||||
- [Unusual pattern 3 — any non-obvious decisions or legacy choices]
|
||||
|
||||
---
|
||||
|
||||
## Local Development Setup
|
||||
|
||||
**Estimated setup time:** [X minutes for a fresh machine]
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- [ ] [Tool 1] — version [X] — [install link]
|
||||
- [ ] [Tool 2] — version [X] — [install link]
|
||||
- [ ] Access to [repo / internal package registry] — request from [who]
|
||||
- [ ] [Any secrets or credentials needed] — request from [who]
|
||||
|
||||
### Step-by-step setup
|
||||
|
||||
```bash
|
||||
# 1. Clone the repo
|
||||
git clone [repo URL]
|
||||
cd [repo-name]
|
||||
|
||||
# 2. Copy and configure environment variables
|
||||
cp .env.example .env
|
||||
# Edit .env — see "Environment Variables" section below
|
||||
|
||||
# 3. Start dependencies (database, cache, etc.)
|
||||
[docker compose up -d / make deps / etc.]
|
||||
|
||||
# 4. Install dependencies
|
||||
[npm install / go mod download / pip install -r requirements.txt]
|
||||
|
||||
# 5. Run database migrations
|
||||
[migration command]
|
||||
|
||||
# 6. Start the service
|
||||
[start command]
|
||||
|
||||
# 7. Verify it's working
|
||||
curl http://localhost:[PORT]/health
|
||||
# Expected: {"status":"ok"}
|
||||
```
|
||||
|
||||
**If this doesn't work:** Check [Troubleshooting section below] or ask in `#[channel]`.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Required | Description | Example |
|
||||
|---|---|---|---|
|
||||
| `DATABASE_URL` | Yes | Connection string for the primary DB | `postgres://localhost:5432/[db]` |
|
||||
| `[VAR_2]` | Yes | [Description] | [Example] |
|
||||
| `[VAR_3]` | No | [Description — default value] | [Example] |
|
||||
|
||||
**Secrets for local dev:** [Where to get them — e.g. "Run `[command]` to pull from Vault" or "Ask [person] in #[channel]"]
|
||||
|
||||
### Useful local commands
|
||||
|
||||
```bash
|
||||
[start command] # Start the service
|
||||
[test command] # Run all tests
|
||||
[lint command] # Run linter
|
||||
[format command] # Format code
|
||||
[migration command] # Run pending migrations
|
||||
[seed command] # Seed local database
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
**Testing philosophy:** [e.g. "We test at the integration layer — unit tests for pure functions, integration tests for anything touching the DB or external services"]
|
||||
|
||||
### Running tests
|
||||
|
||||
```bash
|
||||
# All tests
|
||||
[test command]
|
||||
|
||||
# Unit tests only
|
||||
[unit test command]
|
||||
|
||||
# Integration tests (requires local deps running)
|
||||
[integration test command]
|
||||
|
||||
# A specific test file or test case
|
||||
[test command with filter]
|
||||
```
|
||||
|
||||
**Test coverage:** [X]% (minimum required to pass CI: [Y]%)
|
||||
**Coverage report:** [Where to find it]
|
||||
|
||||
### Writing tests
|
||||
|
||||
- **Unit tests:** [Where to put them — e.g. alongside source files as `*_test.go`]
|
||||
- **Integration tests:** [Where to put them — e.g. `tests/integration/`]
|
||||
- **Test database:** [How it works — e.g. "Each test gets a clean transaction that rolls back on teardown — see `tests/helpers/db.go`"]
|
||||
- **Mocking:** [Policy — e.g. "We mock at the repository layer — don't mock the DB directly"]
|
||||
|
||||
---
|
||||
|
||||
## Making Changes
|
||||
|
||||
### Branching
|
||||
|
||||
[Branch naming convention — e.g. `feature/[ticket-id]-short-description`, `fix/[ticket-id]-short-description`]
|
||||
|
||||
### Before opening a PR
|
||||
|
||||
- [ ] Tests pass locally
|
||||
- [ ] Linter passes (`[lint command]`)
|
||||
- [ ] New behaviour has test coverage
|
||||
- [ ] Any new environment variables are added to `.env.example` and documented
|
||||
- [ ] Database migrations are backward-compatible (old code can run against new schema)
|
||||
|
||||
### Code review
|
||||
|
||||
- **Reviewers:** [Who to request review from — e.g. "Any engineer on [team]; lead review required for auth changes"]
|
||||
- **Expected review time:** [X hours / 1 business day]
|
||||
- **PR template:** [Link or auto-generated by GitHub]
|
||||
|
||||
### Database migrations
|
||||
|
||||
```bash
|
||||
# Create a new migration
|
||||
[migration create command]
|
||||
|
||||
# Apply pending migrations
|
||||
[migration up command]
|
||||
|
||||
# Roll back last migration
|
||||
[migration down command]
|
||||
```
|
||||
|
||||
**Migration rules:**
|
||||
- All migrations must be backward-compatible — old code must run against the new schema
|
||||
- Never rename or drop a column in a single migration — do it in two steps (add new, migrate data, drop old)
|
||||
- Test your rollback before merging
|
||||
|
||||
---
|
||||
|
||||
## Deployment
|
||||
|
||||
**How code gets to production:** [1–2 sentence summary — link to full CI/CD playbook if it exists]
|
||||
|
||||
1. Merge to `main` → automatic deploy to staging
|
||||
2. Smoke tests run on staging
|
||||
3. Manual approval → deploy to production
|
||||
4. Post-deploy monitoring for [X minutes]
|
||||
|
||||
**Deployment docs:** [Link to CI/CD playbook or pipeline docs]
|
||||
|
||||
**Who can deploy:** [Any engineer / Lead engineer / On-call engineer — specify]
|
||||
|
||||
**Deployment channel:** `#[deployments channel]`
|
||||
|
||||
---
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
**Dashboard:** [Datadog / Grafana / CloudWatch — link]
|
||||
**Logs:** [Log aggregation tool and link — e.g. "Logs are in Datadog under service:[name]"]
|
||||
**Traces:** [Tracing tool and link if applicable]
|
||||
**Alerts:** [Where alerts fire — e.g. PagerDuty / Slack #alerts-[service]]
|
||||
|
||||
**Key metrics to know:**
|
||||
- **Error rate:** Should be <[X]% (alert at [Y]%)
|
||||
- **P99 latency:** Should be <[X]ms
|
||||
- **[Business metric]:** [e.g. "Queue depth should be <100 items"]
|
||||
|
||||
---
|
||||
|
||||
## On-Call
|
||||
|
||||
**On-call schedule:** [PagerDuty / Opsgenie link]
|
||||
**Who's on-call now:** [Link to current schedule or `#oncall` channel]
|
||||
**Escalation:** [On-call → [team lead] → [EM] — after [X] minutes unacknowledged]
|
||||
|
||||
**If you get paged:**
|
||||
1. Acknowledge the alert
|
||||
2. Check [dashboard link] for the first clue
|
||||
3. Common alert runbooks: [link to oncall-runbook or runbook-writer output]
|
||||
4. If you can't resolve in [X minutes], escalate to [person/channel]
|
||||
|
||||
---
|
||||
|
||||
## Key Contacts
|
||||
|
||||
| Role | Name | Best way to reach |
|
||||
|---|---|---|
|
||||
| Tech lead | [Name] | Slack: @[handle] |
|
||||
| On-call rotation | [Team] | PagerDuty / `#on-call` |
|
||||
| Platform / infra | [Team] | `#platform` Slack channel |
|
||||
| Database / DBA | [Name or team] | `#database` Slack channel |
|
||||
| [Upstream service] owner | [Name] | Slack: @[handle] |
|
||||
|
||||
**Where to ask questions:**
|
||||
- General engineering: `#engineering`
|
||||
- This service specifically: `#[service-name]`
|
||||
- Urgent / production issues: `#incidents`
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "The service won't start locally"
|
||||
|
||||
1. Check that Docker / dependencies are running: `[command]`
|
||||
2. Check `.env` is populated — missing values cause silent failures
|
||||
3. Check logs: `[log command]`
|
||||
4. Ask in `#[channel]`
|
||||
|
||||
### "Tests are failing locally but passing in CI"
|
||||
|
||||
- Check your local dependency versions match CI: `[version check command]`
|
||||
- Try a clean install: `[clean install command]`
|
||||
- Integration tests need local deps running — `[start deps command]`
|
||||
|
||||
### "I can't access [internal tool / system]"
|
||||
|
||||
- Request access through [process — e.g. Okta self-serve / ask your manager]
|
||||
|
||||
### "Something looks wrong in production"
|
||||
|
||||
1. Check [dashboard] for the error spike
|
||||
2. Check recent deploys in `#deployments`
|
||||
3. If it's an active incident, page on-call via [PagerDuty / Slack command]
|
||||
|
||||
---
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [Architecture Decision Records (ADRs)](./docs/decisions/) — why the codebase is the way it is
|
||||
- [API documentation](./docs/api/) or [link to external docs]
|
||||
- [Incident runbooks](./docs/runbooks/)
|
||||
- [CI/CD pipeline documentation](./docs/cicd/)
|
||||
- [Team working agreements](./docs/team/)
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Local setup instructions work on a fresh machine — tested recently
|
||||
- [ ] Environment variables table is complete and accurate
|
||||
- [ ] "Things that might surprise you" captures the actual surprises (ask a recent joiner)
|
||||
- [ ] On-call section has real links, not placeholders
|
||||
- [ ] Contacts are current — team members with real Slack handles
|
||||
- [ ] Troubleshooting covers the top 3 actual questions new joiners ask
|
||||
@@ -0,0 +1,560 @@
|
||||
---
|
||||
name: disaster-recovery-plan
|
||||
description: "Write a disaster recovery plan for a service or system — covering RPO/RTO targets, failure scenario runbooks, backup and restore procedures, DR testing cadence, and communication templates. Use when asked to write a DR plan, document failover procedures, create recovery runbooks, define RTO/RPO targets, or prepare for a disaster recovery game day. Produces a full DR document with per-scenario recovery runbooks, backup validation procedures, testing schedule, and communication templates."
|
||||
---
|
||||
|
||||
# Disaster Recovery Plan Skill
|
||||
|
||||
Produce a complete disaster recovery plan for a service or system — giving engineers, SREs, and on-call responders everything they need to recover from a disaster scenario in the shortest possible time. A good DR plan is tested regularly, has exact commands (not vague instructions), and makes RTO/RPO targets measurable so the team knows whether recovery succeeded.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and what it does (business function and technical role)
|
||||
- **Criticality tier** — business impact of extended downtime (e.g. Tier 1 = revenue-critical, Tier 2 = ops impact, Tier 3 = internal only)
|
||||
- **Current infrastructure setup** — cloud provider, regions/zones, deployment model (Kubernetes, ECS, VMs, serverless)
|
||||
- **RPO/RTO requirements** — Recovery Point Objective (how much data loss is acceptable) and Recovery Time Objective (how long can it be down)
|
||||
- **Backup strategy** — what is backed up, how often, where backups are stored, retention policy
|
||||
- **On-call contacts** — names and contact details for the responder chain
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Disaster Recovery Plan: [Service Name]
|
||||
|
||||
**Team:** [Team name] | **Tech lead:** [Name]
|
||||
**Criticality tier:** [Tier 1 / Tier 2 / Tier 3] | **Last tested:** [Date]
|
||||
**Next DR test:** [Date] | **Document owner:** [Name]
|
||||
**Last updated:** [Date] | **Review cycle:** Quarterly
|
||||
|
||||
> **Emergency? Skip to Section 3 — Failure Scenario Runbooks.** Find the scenario that matches your situation and follow the steps exactly.
|
||||
|
||||
---
|
||||
|
||||
## 1. Recovery Targets
|
||||
|
||||
| Target | Value | Rationale |
|
||||
|---|---|---|
|
||||
| RPO (Recovery Point Objective) | [X minutes/hours] | [e.g. "Last committed transaction — database replication is synchronous"] |
|
||||
| RTO (Recovery Time Objective) | [Y minutes/hours] | [e.g. "Revenue impact begins at 30 min; target recovery in 15 min"] |
|
||||
| MTTR target (non-disaster) | [Z minutes] | [Operational incidents, not DR events] |
|
||||
| Data retention (backups) | [N days/weeks] | [Compliance requirement or operational policy] |
|
||||
| Backup frequency | [Every X hours] | [RPO-driven — backup interval must be ≤ RPO] |
|
||||
|
||||
**What these mean in practice:**
|
||||
- If a database is corrupted, we can lose at most [X minutes] of transactions before the business impact is unacceptable.
|
||||
- The service must be operational again within [Y minutes/hours] of declaring a DR event.
|
||||
- If either target cannot be met, escalate to [Engineering Manager] immediately.
|
||||
|
||||
---
|
||||
|
||||
## 2. Failure Scenario Inventory
|
||||
|
||||
| Scenario | Likelihood | Impact | RTO target | RPO target | Runbook |
|
||||
|---|---|---|---|---|---|
|
||||
| Single availability zone failure | Medium | [Partial / Full outage] | [15 min] | [0 — no data loss] | Section 3.1 |
|
||||
| Full region failure | Low | Full outage | [60 min] | [5 min] | Section 3.2 |
|
||||
| Database corruption / data loss | Low | Full outage | [90 min] | [RPO value] | Section 3.3 |
|
||||
| Critical dependency outage | High | [Partial degradation] | [30 min] | [N/A] | Section 3.4 |
|
||||
| Security breach / ransomware | Very low | Full outage + investigation | [4 hours] | [Last clean backup] | Section 3.5 |
|
||||
| Accidental bulk data deletion | Low | Partial or full data loss | [60 min] | [RPO value] | Section 3.6 |
|
||||
|
||||
---
|
||||
|
||||
## 3. Failure Scenario Runbooks
|
||||
|
||||
### 3.1 Single Availability Zone Failure
|
||||
|
||||
**Trigger:** One AZ becomes unreachable — pods/instances in that zone stop responding.
|
||||
**Detection:** PagerDuty alert `[AlertName]` fires, or cloud provider status page shows AZ degradation.
|
||||
**Expected RTO:** [15 minutes] | **Expected RPO:** Zero (no data loss if multi-AZ replication is working)
|
||||
|
||||
**Step 1 — Confirm the failure**
|
||||
```bash
|
||||
# Check pod/instance health across zones
|
||||
kubectl get pods -o wide -n [namespace] | grep -v Running
|
||||
|
||||
# Check which nodes are affected
|
||||
kubectl get nodes -o wide | grep -v Ready
|
||||
|
||||
# Verify cloud provider AZ status
|
||||
# AWS: https://health.aws.amazon.com/health/status
|
||||
# GCP: https://status.cloud.google.com
|
||||
```
|
||||
|
||||
**Step 2 — Assess whether auto-recovery has occurred**
|
||||
```bash
|
||||
# If using auto-scaling, check if replacement instances launched
|
||||
kubectl get pods -n [namespace] --watch
|
||||
|
||||
# Check deployment replica count
|
||||
kubectl get deployment [service-name] -n [namespace]
|
||||
|
||||
# Verify load balancer health checks are passing
|
||||
[cloud provider CLI command to check target group health]
|
||||
```
|
||||
|
||||
**Step 3 — Force rescheduling if auto-recovery stalled**
|
||||
```bash
|
||||
# Cordon the affected node so no new pods schedule on it
|
||||
kubectl cordon [node-name]
|
||||
|
||||
# Drain the node — moves all pods to healthy nodes
|
||||
kubectl drain [node-name] --ignore-daemonsets --delete-emptydir-data
|
||||
|
||||
# Verify pods have rescheduled successfully
|
||||
kubectl get pods -o wide -n [namespace]
|
||||
```
|
||||
|
||||
**Step 4 — Verify service health**
|
||||
```bash
|
||||
# Smoke test key endpoints
|
||||
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/health
|
||||
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/[critical-endpoint]
|
||||
|
||||
# Check error rate in monitoring
|
||||
[dashboard link or query]
|
||||
```
|
||||
|
||||
**Recovery confirmed when:** All pods are Running, health check returns 200, error rate is at baseline.
|
||||
|
||||
---
|
||||
|
||||
### 3.2 Full Region Failure
|
||||
|
||||
**Trigger:** The primary region is entirely unavailable.
|
||||
**Detection:** All service health checks failing, cloud provider status page confirms region-wide event.
|
||||
**Expected RTO:** [60 minutes] | **Expected RPO:** [5 minutes — based on cross-region replication lag]
|
||||
|
||||
**Step 1 — Confirm regional failure (5 minutes)**
|
||||
```bash
|
||||
# Confirm the primary region is unreachable
|
||||
ping [primary-region-endpoint] || echo "Primary region unreachable"
|
||||
|
||||
# Check replication lag on standby region database
|
||||
[command to check replica lag — e.g. for RDS: aws rds describe-db-instances --region [dr-region]]
|
||||
```
|
||||
|
||||
**Step 2 — Declare DR event and notify (2 minutes)**
|
||||
|
||||
Post to `#incidents`:
|
||||
```
|
||||
🔴 DR EVENT — [Service Name] — Region Failure
|
||||
Primary region: [region] — UNREACHABLE
|
||||
Activating failover to: [dr-region]
|
||||
Incident commander: [Name]
|
||||
Next update: 15 minutes
|
||||
```
|
||||
|
||||
Page [Engineering Manager] and [CTO/VP Eng] via PagerDuty.
|
||||
|
||||
**Step 3 — Promote DR database (10 minutes)**
|
||||
```bash
|
||||
# AWS RDS — promote read replica to primary
|
||||
aws rds promote-read-replica \
|
||||
--db-instance-identifier [dr-replica-identifier] \
|
||||
--region [dr-region]
|
||||
|
||||
# Wait for promotion to complete
|
||||
aws rds wait db-instance-available \
|
||||
--db-instance-identifier [dr-replica-identifier] \
|
||||
--region [dr-region]
|
||||
|
||||
# Record the new database endpoint
|
||||
aws rds describe-db-instances \
|
||||
--db-instance-identifier [dr-replica-identifier] \
|
||||
--region [dr-region] \
|
||||
--query 'DBInstances[0].Endpoint.Address'
|
||||
```
|
||||
|
||||
**Step 4 — Deploy service in DR region (20 minutes)**
|
||||
```bash
|
||||
# Update service configuration to point at DR database
|
||||
kubectl set env deployment/[service-name] \
|
||||
DATABASE_URL=[new-dr-database-url] \
|
||||
-n [namespace] \
|
||||
--context [dr-region-context]
|
||||
|
||||
# Scale up the DR deployment
|
||||
kubectl scale deployment/[service-name] --replicas=[N] \
|
||||
-n [namespace] \
|
||||
--context [dr-region-context]
|
||||
|
||||
# Verify all pods are running
|
||||
kubectl get pods -n [namespace] --context [dr-region-context]
|
||||
```
|
||||
|
||||
**Step 5 — Cut over DNS / load balancer (5 minutes)**
|
||||
```bash
|
||||
# Update DNS to point to DR region load balancer
|
||||
# AWS Route 53:
|
||||
aws route53 change-resource-record-sets \
|
||||
--hosted-zone-id [zone-id] \
|
||||
--change-batch file://dr-failover-dns.json
|
||||
|
||||
# Verify DNS propagation (may take up to [TTL] seconds)
|
||||
dig [service-domain] @8.8.8.8
|
||||
```
|
||||
|
||||
**Step 6 — Verify end-to-end**
|
||||
```bash
|
||||
# Full smoke test against DR endpoint
|
||||
curl -s https://[service-url]/health
|
||||
[run automated smoke test suite if available]
|
||||
```
|
||||
|
||||
**Recovery confirmed when:** DNS resolves to DR region, smoke tests pass, error rate is at baseline.
|
||||
|
||||
**Post-failover actions (not urgent — after service is stable):**
|
||||
- Do not fail back to primary until root cause is confirmed resolved
|
||||
- Document data loss window (check replication lag at time of failure)
|
||||
- Begin post-incident review — see [incident-postmortem skill]
|
||||
|
||||
---
|
||||
|
||||
### 3.3 Database Corruption or Data Loss
|
||||
|
||||
**Trigger:** Data in the database is corrupted, deleted, or otherwise incorrect due to a software bug, operator error, or hardware fault.
|
||||
**Detection:** Application errors referencing missing/invalid data, monitoring alerts on query error rate, user reports.
|
||||
**Expected RTO:** [90 minutes] | **Expected RPO:** [Backup interval — e.g. 1 hour]
|
||||
|
||||
**Step 1 — Stop the bleeding immediately**
|
||||
```bash
|
||||
# Put the service into maintenance mode to prevent further writes to corrupted data
|
||||
[command to enable maintenance mode — e.g. kubectl set env deployment/[name] MAINTENANCE_MODE=true]
|
||||
|
||||
# Or: scale down the service to zero to prevent writes
|
||||
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]
|
||||
```
|
||||
|
||||
**Step 2 — Assess scope of corruption**
|
||||
```bash
|
||||
# Identify which tables/records are affected
|
||||
[SQL query to check data integrity — e.g.]
|
||||
# psql $DATABASE_URL -c "SELECT COUNT(*) FROM [table] WHERE [integrity check condition]"
|
||||
|
||||
# Determine when corruption started (cross-reference with deploy times and error logs)
|
||||
[log query to find earliest error — e.g. in Datadog:]
|
||||
# service:[service-name] status:error "[corruption error message]" | sort by timestamp asc
|
||||
```
|
||||
|
||||
**Step 3 — Identify the correct restore point**
|
||||
```bash
|
||||
# List available backups
|
||||
[command to list backups — e.g. for RDS:]
|
||||
aws rds describe-db-snapshots \
|
||||
--db-instance-identifier [db-identifier] \
|
||||
--query 'DBSnapshots[*].[SnapshotCreateTime,DBSnapshotIdentifier]' \
|
||||
--output table
|
||||
|
||||
# Choose the most recent backup BEFORE corruption started
|
||||
# Record the chosen snapshot ID: [snapshot-id]
|
||||
```
|
||||
|
||||
**Step 4 — Restore from backup**
|
||||
```bash
|
||||
# Restore to a NEW database instance (never overwrite production directly)
|
||||
aws rds restore-db-instance-from-db-snapshot \
|
||||
--db-instance-identifier [service-name]-restored-[date] \
|
||||
--db-snapshot-identifier [snapshot-id] \
|
||||
--region [region]
|
||||
|
||||
# Wait for restore to complete
|
||||
aws rds wait db-instance-available \
|
||||
--db-instance-identifier [service-name]-restored-[date]
|
||||
|
||||
# Get the restored instance endpoint
|
||||
aws rds describe-db-instances \
|
||||
--db-instance-identifier [service-name]-restored-[date] \
|
||||
--query 'DBInstances[0].Endpoint.Address'
|
||||
```
|
||||
|
||||
**Step 5 — Validate restored data**
|
||||
```bash
|
||||
# Connect to restored database and verify integrity
|
||||
psql [restored-db-endpoint] -U [user] -d [database] -c "[data integrity query]"
|
||||
|
||||
# Confirm record counts match expectations
|
||||
psql [restored-db-endpoint] -U [user] -d [database] -c "SELECT COUNT(*) FROM [critical-table]"
|
||||
```
|
||||
|
||||
**Step 6 — Point service at restored database**
|
||||
```bash
|
||||
kubectl set env deployment/[service-name] \
|
||||
DATABASE_URL=postgres://[user]:[pass]@[restored-endpoint]/[db] \
|
||||
-n [namespace]
|
||||
|
||||
kubectl scale deployment/[service-name] --replicas=[N] -n [namespace]
|
||||
```
|
||||
|
||||
**Recovery confirmed when:** Service is running against restored database, data integrity checks pass, error rate is at baseline.
|
||||
|
||||
---
|
||||
|
||||
### 3.4 Critical Dependency Outage
|
||||
|
||||
**Trigger:** A service that [service name] depends on is unavailable or degraded.
|
||||
**Detection:** Increased error rate or latency on endpoints that call [dependency], alerts from dependency owner.
|
||||
**Expected RTO:** Depends on dependency — [30 minutes for mitigation, resolution depends on dependency owner]
|
||||
|
||||
**Dependency map:**
|
||||
|
||||
| Dependency | Criticality | Degraded behaviour | Mitigation |
|
||||
|---|---|---|---|
|
||||
| [Database] | Critical — all writes fail | Full outage | Activate DR database (Section 3.3) |
|
||||
| [Cache — Redis] | High — latency increases | Performance degradation | Bypass cache, serve from DB |
|
||||
| [Auth service] | Critical — auth fails | All authenticated endpoints fail | Return cached tokens (if implemented) |
|
||||
| [Message queue] | Medium — async processing delays | Writes succeed, async jobs queue | Queue backlog — see on-call runbook |
|
||||
| [External API — name] | Low — feature X unavailable | Graceful degradation | Feature flag to disable feature X |
|
||||
|
||||
**Mitigation steps:**
|
||||
```bash
|
||||
# Enable circuit breaker / fallback for [dependency] if implemented
|
||||
kubectl set env deployment/[service-name] [DEPENDENCY]_CIRCUIT_BREAKER=open -n [namespace]
|
||||
|
||||
# Enable feature flag to disable [dependency-backed feature]
|
||||
[feature flag CLI command or dashboard link]
|
||||
|
||||
# Check if dependency has a status page
|
||||
# [Dependency status URL]
|
||||
```
|
||||
|
||||
**Escalation:** Contact [dependency] on-call via [PagerDuty / Slack `#[channel]`]. Share your service's error rate and the time dependency errors started.
|
||||
|
||||
---
|
||||
|
||||
### 3.5 Security Breach or Ransomware
|
||||
|
||||
**Trigger:** Evidence of unauthorized access, data exfiltration, or encryption of service data.
|
||||
**Detection:** Security tooling alert, unusual access patterns, user reports of data exposure.
|
||||
**Expected RTO:** [4+ hours — prioritise containment over speed] | **Expected RPO:** [Last verified clean backup]
|
||||
|
||||
**Step 1 — Isolate immediately**
|
||||
```bash
|
||||
# Take the service offline — do not attempt to recover while breach is active
|
||||
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]
|
||||
|
||||
# Revoke all API keys and service account credentials immediately
|
||||
[command to rotate secrets — e.g. via Vault or cloud provider]
|
||||
|
||||
# Block all external access at network level
|
||||
[firewall/security group command to deny all inbound traffic]
|
||||
```
|
||||
|
||||
**Step 2 — Notify security team immediately**
|
||||
Page [Security lead] via PagerDuty. Do NOT attempt to remediate without security team involvement.
|
||||
|
||||
Post to `#security-incidents` (private channel, not `#incidents`):
|
||||
```
|
||||
🔴 SECURITY INCIDENT — [Service Name]
|
||||
Time detected: [Time]
|
||||
Evidence: [One sentence — what was observed]
|
||||
Actions taken: Service isolated, credentials revoked
|
||||
Awaiting: Security team guidance
|
||||
```
|
||||
|
||||
**Step 3 — Preserve evidence**
|
||||
```bash
|
||||
# Export current logs before any remediation
|
||||
[log export command — preserve evidence for forensics]
|
||||
|
||||
# Snapshot the current state of all infrastructure
|
||||
[snapshot/image command]
|
||||
```
|
||||
|
||||
**Steps 4+ — Follow security team guidance.** Do not restore from backup until security team confirms the attack vector is closed.
|
||||
|
||||
---
|
||||
|
||||
### 3.6 Accidental Bulk Data Deletion
|
||||
|
||||
**Trigger:** An operator, script, or application bug has deleted records in bulk.
|
||||
**Detection:** Sudden drop in record counts, user reports of missing data, application errors.
|
||||
**Expected RTO:** [60 minutes] | **Expected RPO:** [Backup interval]
|
||||
|
||||
```bash
|
||||
# Step 1 — Stop further writes immediately
|
||||
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]
|
||||
|
||||
# Step 2 — Determine what was deleted and when
|
||||
psql $DATABASE_URL -c "
|
||||
SELECT schemaname, tablename,
|
||||
n_dead_tup, last_autovacuum
|
||||
FROM pg_stat_user_tables
|
||||
ORDER BY n_dead_tup DESC LIMIT 10;
|
||||
"
|
||||
|
||||
# Step 3 — Check if deletion is recoverable via MVCC (PostgreSQL)
|
||||
# Records may still be recoverable if VACUUM has not run
|
||||
psql $DATABASE_URL -c "
|
||||
SELECT * FROM [table]
|
||||
WHERE xmax != 0 -- recently deleted rows
|
||||
LIMIT 100;
|
||||
"
|
||||
|
||||
# Step 4 — If not recoverable via MVCC, restore from backup
|
||||
# Follow Section 3.3 (Database Corruption runbook) from Step 3 onward
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Backup and Restore Procedures
|
||||
|
||||
### Backup Configuration
|
||||
|
||||
| Data store | Backup type | Frequency | Retention | Location |
|
||||
|---|---|---|---|---|
|
||||
| [Primary database] | Automated snapshots | Every [N] hours | [N] days | [S3 bucket / cloud storage path] |
|
||||
| [Primary database] | Transaction log backups | Continuous | [N] days | [Location] |
|
||||
| [Secondary store — e.g. Redis] | RDB dump | Daily | [N] days | [Location] |
|
||||
| [Blob/object storage] | Cross-region replication | Continuous | [N] days | [DR region bucket] |
|
||||
| [Config / secrets] | Terraform state + Vault backup | On change | Indefinite | [Location] |
|
||||
|
||||
### Backup Validation (Run Weekly)
|
||||
|
||||
```bash
|
||||
# Test restore of latest database backup to a throwaway instance
|
||||
aws rds restore-db-instance-from-db-snapshot \
|
||||
--db-instance-identifier [service-name]-backup-test-$(date +%Y%m%d) \
|
||||
--db-snapshot-identifier $(aws rds describe-db-snapshots \
|
||||
--db-instance-identifier [db-id] \
|
||||
--query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
|
||||
--output text)
|
||||
|
||||
# Wait for restore, then run integrity checks
|
||||
psql [test-instance-endpoint] -c "[integrity check query]"
|
||||
|
||||
# Confirm row counts match recent production values (allow ≤ RPO difference)
|
||||
psql [test-instance-endpoint] -c "SELECT COUNT(*) FROM [critical-table]"
|
||||
|
||||
# Destroy the test instance
|
||||
aws rds delete-db-instance \
|
||||
--db-instance-identifier [service-name]-backup-test-$(date +%Y%m%d) \
|
||||
--skip-final-snapshot
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. DR Testing Cadence
|
||||
|
||||
Regular testing is mandatory. An untested DR plan is not a DR plan.
|
||||
|
||||
| Test type | Frequency | Who runs it | Pass criteria |
|
||||
|---|---|---|---|
|
||||
| Backup restore validation | Weekly (automated) | On-call rotation | Restore completes, integrity checks pass |
|
||||
| Zone failover drill | Monthly | Engineering team | RTO target met, zero data loss |
|
||||
| Region failover drill | Quarterly | Engineering + SRE | RTO/RPO targets met |
|
||||
| Full DR game day | Annually | Engineering + stakeholders | All scenarios exercised, gaps documented |
|
||||
| Chaos engineering (infra failures) | Weekly (automated) | Chaos engineering tooling | Service degrades gracefully, recovers automatically |
|
||||
|
||||
### Game Day Procedure
|
||||
|
||||
1. **Pre-game day (1 week before):** Notify all stakeholders, freeze production changes for the day, prepare DR environment.
|
||||
2. **Scope definition:** Choose 2–3 scenarios from Section 2. Document expected outcomes before the test.
|
||||
3. **Execute:** One person acts as incident commander, others execute runbook steps while another observes and times.
|
||||
4. **Measure:** Record actual RTO and RPO against targets for each scenario.
|
||||
5. **Debrief (same day):** Document gaps, runbook inaccuracies, and automation opportunities.
|
||||
6. **Action items:** File tickets for every gap found. Priority: P1 items must be fixed before next game day.
|
||||
|
||||
---
|
||||
|
||||
## 6. Communication Plan
|
||||
|
||||
### Internal Communication During DR Event
|
||||
|
||||
**Incident commander responsibilities:**
|
||||
- Declare the DR event and open the incident channel
|
||||
- Post updates every 15 minutes minimum
|
||||
- Make the call to fail over (do not let the team decide by committee)
|
||||
- Notify business stakeholders of expected recovery time
|
||||
|
||||
**Notify these people at DR event start:**
|
||||
|
||||
| Role | Name | Contact | When to notify |
|
||||
|---|---|---|---|
|
||||
| Engineering manager | [Name] | [Slack / Phone] | Immediately |
|
||||
| CTO / VP Engineering | [Name] | [Phone] | Tier 1 services: immediately |
|
||||
| Customer success lead | [Name] | [Slack] | If customer-facing impact |
|
||||
| Security lead | [Name] | [Slack / PagerDuty] | If breach suspected |
|
||||
| Legal / compliance | [Name] | [Email / Phone] | If data loss involves PII |
|
||||
|
||||
### Communication Templates
|
||||
|
||||
**DR event declared:**
|
||||
```
|
||||
🔴 DR EVENT — [Service Name]
|
||||
Time: [HH:MM UTC]
|
||||
Scenario: [Zone failure / Region failure / Data loss / etc.]
|
||||
Impact: [Who is affected and how]
|
||||
RTO target: [X minutes]
|
||||
Incident commander: [Name]
|
||||
War room: [Slack channel / call link]
|
||||
Next update: [Time + 15 min]
|
||||
```
|
||||
|
||||
**Status update (every 15 minutes):**
|
||||
```
|
||||
🔴 DR UPDATE — [Service Name] — [HH:MM UTC]
|
||||
Status: [Investigating / Executing recovery / Verifying]
|
||||
Progress: [One sentence on current step]
|
||||
Blockers: [Any — or "None"]
|
||||
Updated RTO estimate: [Time]
|
||||
Next update: [Time + 15 min]
|
||||
```
|
||||
|
||||
**Recovery confirmed:**
|
||||
```
|
||||
✅ DR RESOLVED — [Service Name] — [HH:MM UTC]
|
||||
Total downtime: [X minutes]
|
||||
Data loss: [None / X minutes of transactions]
|
||||
RTO target: [X min] — Actual: [Y min] — [MET / MISSED]
|
||||
RPO target: [X min] — Actual: [Y min] — [MET / MISSED]
|
||||
Root cause: [One sentence]
|
||||
Post-incident review: [Scheduled for / Link when created]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. DR Readiness Checklist
|
||||
|
||||
Run this checklist quarterly and before any major infrastructure change:
|
||||
|
||||
**Backups:**
|
||||
- [ ] Automated backups are running and alerts fire if they fail
|
||||
- [ ] Most recent backup restore was tested within the last 7 days
|
||||
- [ ] Backup retention meets RPO and compliance requirements
|
||||
- [ ] Backups are stored in a separate region / account from primary
|
||||
|
||||
**Failover infrastructure:**
|
||||
- [ ] DR region / environment exists and is provisioned (not just documented)
|
||||
- [ ] DNS failover procedure is documented with exact commands
|
||||
- [ ] DR database replica is current (replication lag is within RPO)
|
||||
- [ ] Service can be deployed in DR region with a single command or automated pipeline
|
||||
|
||||
**Runbooks:**
|
||||
- [ ] All runbooks in Section 3 have been tested within the last quarter
|
||||
- [ ] Runbook commands have been verified against current infrastructure (no stale references)
|
||||
- [ ] Contact list is current (no departed employees)
|
||||
|
||||
**Access:**
|
||||
- [ ] On-call engineers have access to DR region console / CLI
|
||||
- [ ] Service account credentials for DR region are provisioned and tested
|
||||
- [ ] Break-glass accounts exist for emergency access if SSO is unavailable
|
||||
|
||||
**Monitoring:**
|
||||
- [ ] Monitoring exists in DR region (not just primary)
|
||||
- [ ] Alerts fire correctly when DR environment has issues
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] RPO and RTO targets are specific numbers, not ranges, and are agreed with the business
|
||||
- [ ] Every command in every runbook has been run by a human in the last quarter — not copied from documentation untested
|
||||
- [ ] DR database exists in the DR region and replication lag is monitored
|
||||
- [ ] Backup restore has been tested end-to-end within the last 7 days
|
||||
- [ ] The game day schedule is on the team calendar — not just documented here
|
||||
- [ ] Contact list contains current phone numbers, not just Slack handles (Slack may be down during a DR event)
|
||||
- [ ] Security breach runbook (3.5) explicitly names the security team contact and does not attempt self-remediation
|
||||
- [ ] All thresholds (RTO/RPO) are visible in the monitoring dashboard so actual vs. target is measurable in real time
|
||||
@@ -0,0 +1,338 @@
|
||||
---
|
||||
name: engineering-hiring-rubric
|
||||
description: "Build an engineering hiring rubric and technical interview scorecard for evaluating software engineers at a specific level. Use when asked to create an interview rubric, design a hiring process, build a technical scorecard, or standardize engineer evaluation. Produces a full interview scorecard, behavioral question bank, technical question set with evaluation criteria, system design rubric, and debrief agenda."
|
||||
---
|
||||
|
||||
# Engineering Hiring Rubric
|
||||
|
||||
Produce a complete hiring rubric and interview scorecard for evaluating software engineers at a specific role and level. The rubric must be specific enough that two interviewers who have never compared notes will score the same candidate within one level of each other. That requires: explicit behavioral anchors (what does "Strong Hire" look like vs. "Hire" for each competency), calibrated technical questions with written evaluation criteria, and a structured debrief format that surfaces signal rather than recency bias. Include calibration notes to help interviewers recognize and counter common evaluation biases.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Role** — backend, frontend, fullstack, SRE/platform, data, ML, or mobile engineer
|
||||
- **Level** — junior (L3/IC2), mid (L4/IC3), senior (L5/IC4), or staff (L6/IC5); clarify the company's level naming if different
|
||||
- **Team context** — what the team builds, team size, and what problems this hire will work on in the first year
|
||||
- **Tech stack** — primary languages and frameworks for the technical questions; list the stack explicitly
|
||||
- **Interview format** — which rounds are used (phone screen, coding, system design, behavioral, take-home); if not specified, produce a recommended format
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Engineering Hiring Rubric: [Role] — [Level]
|
||||
|
||||
**Role:** [e.g., Senior Backend Engineer]
|
||||
**Level equivalent:** [e.g., L5 / IC4 / Senior]
|
||||
**Team:** [Team name and one-sentence description of what they build]
|
||||
**Tech stack:** [Languages and frameworks]
|
||||
**Interview loop:** [List the rounds in order]
|
||||
|
||||
---
|
||||
|
||||
## 1. Role Definition and Level Expectations
|
||||
|
||||
### What This Role Does
|
||||
|
||||
[2–3 sentences describing the scope of work: what systems they'll own, what problems they'll solve, and who they'll work with. Make this specific to the team context provided.]
|
||||
|
||||
### Level Bar
|
||||
|
||||
Define the minimum bar for a Hire recommendation at this level. This is not the ideal candidate description — it is the floor.
|
||||
|
||||
| Dimension | [Level] Floor | One Level Below (No Hire) | One Level Above (Stretch) |
|
||||
|-----------|--------------|---------------------------|---------------------------|
|
||||
| Technical scope | [e.g., "Owns a service or major feature area end-to-end with minimal guidance"] | [e.g., "Completes well-defined tasks; needs guidance on scope and approach"] | [e.g., "Leads cross-team technical initiatives; sets technical direction"] |
|
||||
| Problem solving | [e.g., "Breaks ambiguous problems into concrete sub-problems independently"] | [e.g., "Solves defined problems well; struggles with ambiguity"] | [e.g., "Identifies problems others miss; structures organization-level technical challenges"] |
|
||||
| Code quality | [e.g., "Writes production-ready code; anticipates edge cases; reviewable without significant rework"] | [e.g., "Writes working code that requires significant review feedback"] | [e.g., "Sets code quality standards; designs reusable abstractions adopted by others"] |
|
||||
| Communication | [e.g., "Communicates technical decisions clearly to peers and stakeholders"] | [e.g., "Communicates well with direct team; struggles with cross-team or stakeholder comms"] | [e.g., "Drives technical consensus across teams; writes documents others reference"] |
|
||||
| Ownership | [e.g., "Sees work to production; monitors after deploy; follows up on issues proactively"] | [e.g., "Delivers assigned work; escalates issues but doesn't drive them to resolution"] | [e.g., "Owns outcomes across teams; improves team processes and systems beyond their own work"] |
|
||||
|
||||
---
|
||||
|
||||
## 2. Interview Loop Structure
|
||||
|
||||
| Round | Format | Duration | Interviewer | Competencies Assessed |
|
||||
|-------|--------|----------|-------------|----------------------|
|
||||
| Phone screen | Video call, technical questions | 45 min | [Hiring manager or senior engineer] | Problem solving, communication, basic technical depth |
|
||||
| Coding interview 1 | Live coding — [platform] | 60 min | [Engineer] | Coding, data structures, code quality |
|
||||
| Coding interview 2 | Live coding — [platform] | 60 min | [Engineer] | Algorithms, debugging, code quality |
|
||||
| System design | Whiteboard / shared doc | 60 min | [Senior/Staff engineer] | System design, scalability, technical communication |
|
||||
| Behavioral | Structured interview | 45 min | [Hiring manager] | Ownership, collaboration, growth mindset |
|
||||
| [Optional] Take-home | Asynchronous project | [X hours] | [Reviewer] | Code quality, thoroughness, real-world problem solving |
|
||||
|
||||
**Interview coverage matrix:** Each competency dimension must be assessed by at least 2 independent interviewers.
|
||||
|
||||
| Competency | Phone Screen | Coding 1 | Coding 2 | System Design | Behavioral |
|
||||
|-----------|-------------|---------|---------|--------------|-----------|
|
||||
| Coding | ○ | ● | ● | ○ | |
|
||||
| System design | ○ | | | ● | |
|
||||
| Problem solving | ● | ● | ● | ● | |
|
||||
| Code quality | | ● | ● | | |
|
||||
| Communication | ● | ● | ● | ● | ● |
|
||||
| Ownership | ○ | | | ○ | ● |
|
||||
| Debugging | | ● | ● | | |
|
||||
|
||||
● = Primary signal ○ = Secondary signal
|
||||
|
||||
---
|
||||
|
||||
## 3. Coding Interview Guide
|
||||
|
||||
### Question Selection
|
||||
|
||||
Choose 1–2 problems per coding round. Problems should be solvable in 30–40 minutes with the remaining time for discussion and follow-ups. Prefer problems with multiple solution tiers so you can see how far candidates take their thinking.
|
||||
|
||||
### Problem Template
|
||||
|
||||
**Problem: [Title]**
|
||||
|
||||
*Prompt (read to candidate):*
|
||||
> [Problem statement — be specific. Include constraints (input size, value ranges). Avoid ambiguity that tests problem-reading rather than problem-solving.]
|
||||
|
||||
*Example:*
|
||||
> Given a list of integers representing stock prices at each minute of a trading day, return the maximum profit you could achieve by making exactly one buy and one sell. You may not sell before you buy.
|
||||
|
||||
**Clarifying questions a strong candidate will ask:**
|
||||
- [e.g., "Can the list be empty?" / "Are all values positive?" / "Can profit be negative — i.e., should we return 0 if no profit is possible?"]
|
||||
|
||||
**Solution tiers:**
|
||||
|
||||
| Tier | Approach | Time Complexity | Space Complexity | Signals |
|
||||
|------|----------|-----------------|-----------------|---------|
|
||||
| Baseline | [Brute force — O(n²) nested loop] | O(n²) | O(1) | Can solve the problem; understands correctness |
|
||||
| Expected | [Single pass, tracking min price seen so far] | O(n) | O(1) | Strong problem solver; explains tradeoff |
|
||||
| Strong | [Generalizes to k transactions, or extends to cooldown variant without prompting] | O(n) | O(1) | Staff-level generalization thinking |
|
||||
|
||||
**Follow-up questions:**
|
||||
- [e.g., "What if you could make at most k trades?"]
|
||||
- [e.g., "How would you test this function? Write me 3 test cases."]
|
||||
- [e.g., "Walk me through your code as if you're explaining it in a code review."]
|
||||
|
||||
**Evaluation rubric for this problem:**
|
||||
|
||||
| Signal | Strong Hire | Hire | No Hire |
|
||||
|--------|------------|------|---------|
|
||||
| Problem comprehension | Asks 1–2 clarifying questions immediately; identifies edge cases before coding | Understands the problem after 1 prompt; misses 1–2 edge cases | Misunderstands the problem or requires repeated clarification |
|
||||
| Solution quality | O(n) solution; clean code; handles all edge cases | O(n) with hints; code is readable but has minor issues | O(n²) with hints, or correct solution with significant issues |
|
||||
| Code quality | Well-named variables; logical structure; would pass code review | Functional but verbose or inconsistently named | Hard to follow; would require significant review feedback |
|
||||
| Communication | Narrates thinking throughout; explains complexity; self-corrects | Explains solution when asked; answers follow-ups well | Silent during coding; unable to explain their approach |
|
||||
| Follow-ups | Extends solution confidently; identifies further improvements | Handles follow-ups with moderate prompting | Unable to extend or explain tradeoffs |
|
||||
|
||||
---
|
||||
|
||||
## 4. System Design Interview Guide
|
||||
|
||||
### [Level]-Appropriate Design Scope
|
||||
|
||||
At [Level], expect the candidate to:
|
||||
- [e.g., Senior: "Design a complete system with capacity estimates, component breakdown, and discussion of failure modes"]
|
||||
- [e.g., Mid: "Design the core components of a system; may need prompting on scalability and failure handling"]
|
||||
- [e.g., Junior: "Design a simple client-server system; focus on clarity of thinking over complete distributed systems knowledge"]
|
||||
|
||||
### Sample Design Question
|
||||
|
||||
**Question:** "Design [a URL shortener / a rate limiter / a notification service / a ride-matching system — choose one relevant to the team's domain]."
|
||||
|
||||
**Evaluation dimensions:**
|
||||
|
||||
| Dimension | What to assess | Strong Hire | Hire | No Hire |
|
||||
|-----------|---------------|------------|------|---------|
|
||||
| Requirements clarification | Does the candidate ask before designing? | Asks scope, scale, SLA, and key use cases before drawing anything | Asks some questions; may miss scale or SLA | Starts designing immediately without clarifying |
|
||||
| High-level design | Can they describe the major components? | Clear component breakdown with justified choices; covers data flow | Reasonable breakdown; may overcomplicate or undercomplicate | Missing key components or cannot explain data flow |
|
||||
| Data model | Can they design a schema or data structure for the system? | Models the core entities with normalization/denormalization tradeoffs discussed | Reasonable schema; may miss indexing or partitioning needs | Cannot model the data or produces clearly wrong schema |
|
||||
| Scalability | Can they identify and address bottlenecks? | Identifies bottlenecks proactively; proposes horizontal scaling, caching, or sharding as appropriate | Discusses scaling when prompted; reasonable solutions | Cannot identify bottlenecks or proposes solutions that don't match the scale |
|
||||
| Failure handling | Do they think about what happens when things break? | Proactively discusses failure modes: single points of failure, retry logic, idempotency | Discusses failure when prompted; identifies some failure modes | Does not think about failure; assumes happy path |
|
||||
| Communication | Is the design explained clearly? | Could run this meeting with a team of engineers at a real company | Clear enough to follow; some gaps in explanation | Difficult to follow; interviewer cannot understand the design |
|
||||
|
||||
### Design Probing Questions
|
||||
|
||||
Use these to probe depth after the candidate presents their design:
|
||||
- "Walk me through what happens when a write request comes in at peak load — 10,000 requests per second."
|
||||
- "Your primary database just failed. What happens to the system?"
|
||||
- "You estimated X QPS. How would your design change if it needed to handle 100× that?"
|
||||
- "Where is the first place this system would fall over under load?"
|
||||
- "How would you monitor this in production? What would your on-call runbook look like?"
|
||||
|
||||
---
|
||||
|
||||
## 5. Behavioral Interview Question Bank
|
||||
|
||||
Map every question to a competency. Ask 4–6 questions per behavioral round using STAR format (Situation, Task, Action, Result). Do not ask leading questions.
|
||||
|
||||
### Competency: Ownership and Delivery
|
||||
|
||||
1. "Tell me about a time you owned something end-to-end — from design through production monitoring. What did you do when something went wrong after launch?"
|
||||
- *Strong signal:* Describes proactive monitoring setup, a specific incident they caught themselves, and what they changed
|
||||
- *Weak signal:* Describes writing the code and handing off; no discussion of production behavior
|
||||
|
||||
2. "Describe a project that was significantly delayed or failed. What was your role, and what did you take responsibility for?"
|
||||
- *Strong signal:* Direct ownership of their contribution to the failure; specific changes to how they work
|
||||
- *Weak signal:* Attributes all delay to external factors; no reflection on their own actions
|
||||
|
||||
### Competency: Technical Judgment
|
||||
|
||||
3. "Tell me about a significant technical decision you made. What options did you consider, and how did you decide?"
|
||||
- *Strong signal:* Named alternatives with clear tradeoffs; explains who they consulted; reflects on whether they'd decide the same way today
|
||||
- *Weak signal:* "I knew X was the right answer" without describing the decision process
|
||||
|
||||
4. "Describe a time you had to push back on a technical direction — either from management or from peers. What happened?"
|
||||
- *Strong signal:* Evidence-based disagreement; constructive communication; willing to commit once decision was made even if they lost the argument
|
||||
- *Weak signal:* Either never pushed back or pushed back emotionally without evidence
|
||||
|
||||
### Competency: Collaboration and Communication
|
||||
|
||||
5. "Tell me about a time you had to explain a complex technical concept to a non-technical stakeholder. How did you approach it?"
|
||||
- *Strong signal:* Used analogy or simplified model; confirmed understanding; adapted to the audience
|
||||
- *Weak signal:* "I explained it technically and told them to trust me"
|
||||
|
||||
6. "Describe a situation where you and a peer strongly disagreed on an approach. How did it resolve?"
|
||||
- *Strong signal:* Sought a third opinion or data; focused on the right outcome, not being right; maintained relationship
|
||||
- *Weak signal:* Escalated immediately or capitulated without engaging
|
||||
|
||||
### Competency: Growth and Learning
|
||||
|
||||
7. "What is a significant technical mistake you made in the last two years? What did you learn from it?"
|
||||
- *Strong signal:* Specific mistake, clear causal analysis, concrete behavioral change afterward
|
||||
- *Weak signal:* Cannot name a specific mistake; describes a minor issue to avoid vulnerability
|
||||
|
||||
8. "How do you stay current in [relevant technical area]? Give me a specific example of something you learned recently and applied."
|
||||
- *Strong signal:* Named sources, applied learning in a specific project with a concrete outcome
|
||||
- *Weak signal:* "I read blogs" with no specifics; no applied example
|
||||
|
||||
---
|
||||
|
||||
## 6. Full Interview Scorecard
|
||||
|
||||
Complete one scorecard per interview round. Collect all scorecards before the debrief.
|
||||
|
||||
```
|
||||
INTERVIEW SCORECARD
|
||||
===================
|
||||
Candidate: ______________________
|
||||
Interviewer: ______________________
|
||||
Round: ______________________
|
||||
Date: ______________________
|
||||
Interview format: ______________________
|
||||
|
||||
COMPETENCY RATINGS
|
||||
Rate each dimension independently. Do not average.
|
||||
Scale: 1 = Strong No Hire | 2 = No Hire | 3 = Hire | 4 = Strong Hire
|
||||
|
||||
1 2 3 4 Notes
|
||||
Coding / Technical skill [ ] [ ] [ ] [ ] ___________________________
|
||||
Problem solving [ ] [ ] [ ] [ ] ___________________________
|
||||
System design [ ] [ ] [ ] [ ] ___________________________
|
||||
Code quality [ ] [ ] [ ] [ ] ___________________________
|
||||
Debugging [ ] [ ] [ ] [ ] ___________________________
|
||||
Communication [ ] [ ] [ ] [ ] ___________________________
|
||||
Ownership [ ] [ ] [ ] [ ] ___________________________
|
||||
Collaboration [ ] [ ] [ ] [ ] ___________________________
|
||||
|
||||
SPECIFIC EVIDENCE
|
||||
What did the candidate do or say that drove your rating?
|
||||
(Required — write observable behaviors, not impressions)
|
||||
|
||||
Strongest signal (positive):
|
||||
___________________________________________________________________________
|
||||
|
||||
Strongest concern or gap:
|
||||
___________________________________________________________________________
|
||||
|
||||
OVERALL RECOMMENDATION
|
||||
[ ] Strong Hire [ ] Hire [ ] No Hire [ ] Strong No Hire
|
||||
|
||||
OVERALL RECOMMENDATION RATIONALE
|
||||
(Required — 3–5 sentences minimum. State your recommendation, the evidence
|
||||
that supports it, and the specific gap or risk if not a Strong Hire)
|
||||
___________________________________________________________________________
|
||||
___________________________________________________________________________
|
||||
___________________________________________________________________________
|
||||
|
||||
Level signal: This candidate demonstrated [ L_ / L_ ] level behaviors.
|
||||
|
||||
SHOULD INTERVIEWERS DISCUSS BEFORE DEBRIEF?
|
||||
[ ] No — I have a clear independent signal
|
||||
[ ] Yes — I need context on [specific area] to complete my assessment
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Hiring Recommendation Framework
|
||||
|
||||
| Recommendation | Meaning | When to use |
|
||||
|---------------|---------|-------------|
|
||||
| **Strong Hire** | Confident the candidate will exceed the level bar and be a high performer on the team | Evidence across 3+ competencies at above-bar level; no significant concerns |
|
||||
| **Hire** | Confident the candidate meets the level bar; will perform well | Meets bar on all must-have competencies; may have 1 area to develop |
|
||||
| **No Hire** | Does not meet the level bar | Below bar on 1+ must-have competency, or gap too large to close quickly |
|
||||
| **Strong No Hire** | Clear mismatch — well below the bar, or a specific disqualifying signal | Significant gaps across multiple competencies, or a values/behavior concern |
|
||||
|
||||
**Must-hire competencies for [Role] at [Level]:** [List 3–4 competencies where a No Hire score on any one of them means the overall recommendation must be No Hire, regardless of performance elsewhere. Example: "Coding and System Design are must-hire competencies for a Senior Backend Engineer. Strong performance on Behavioral dimensions cannot compensate for a No Hire on Coding."]
|
||||
|
||||
**Debrief rule:** A Strong Hire can override one No Hire only if: (a) the No Hire is not on a must-hire competency, and (b) the Strong Hire interviewer can articulate why the concern is not disqualifying. A Strong No Hire cannot be overridden — escalate to hiring manager.
|
||||
|
||||
---
|
||||
|
||||
## 8. Debrief Agenda
|
||||
|
||||
Run the debrief before scorecards are shared verbally. Everyone submits a written scorecard first.
|
||||
|
||||
```
|
||||
DEBRIEF AGENDA — [Candidate Name]
|
||||
Duration: 45 minutes
|
||||
Facilitator: [Hiring Manager]
|
||||
|
||||
0:00 – 0:05 SCORECARD REVIEW
|
||||
Each interviewer states their overall recommendation only (no rationale yet).
|
||||
Facilitator notes alignment and disagreements on whiteboard/doc.
|
||||
|
||||
0:05 – 0:15 EVIDENCE ROUND
|
||||
Go around the table. Each interviewer shares:
|
||||
- Their strongest positive signal (observable behavior, not impression)
|
||||
- Their biggest concern (observable behavior, not impression)
|
||||
No discussion yet — just evidence gathering.
|
||||
|
||||
0:15 – 0:30 DISCUSS DISAGREEMENTS
|
||||
Address only the competency dimensions where interviewers disagree.
|
||||
Anchor discussion on: "What did you observe?" not "What do you think?"
|
||||
If interviewers assessed different competencies, disagreement may reflect
|
||||
insufficient signal — note this.
|
||||
|
||||
0:30 – 0:40 DECISION
|
||||
Reach a decision on overall recommendation.
|
||||
If consensus: state the recommendation and rationale.
|
||||
If not consensus: hiring manager makes the call and states why.
|
||||
|
||||
0:40 – 0:45 PROCESS NOTES
|
||||
- Were any questions unclear or hard to compare across candidates?
|
||||
- Any bias signals observed during the debrief? (see Section 9)
|
||||
- Feedback to improve the process for next time.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Calibration and Bias Reduction Notes
|
||||
|
||||
Brief every interviewer on these before they conduct their first interview for this role.
|
||||
|
||||
| Bias | How it manifests | Counter-measure |
|
||||
|------|-----------------|-----------------|
|
||||
| Halo effect | Strong performance in round 1 colors ratings in round 2 | Submit scorecard before reading others; rate each competency independently |
|
||||
| Similarity bias | "I liked them" correlates with "they think like me" | Require observable evidence for every rating; check: "Is this a signal about their ability or their similarity to me?" |
|
||||
| Recency bias | Final impression dominates overall rating | Take notes during the interview; write evidence immediately after; debrief uses written evidence, not memory |
|
||||
| Expectation anchoring | First interviewer's opinion anchors all others | No verbal discussion between interviewers before debrief; written scorecards submitted before debrief starts |
|
||||
| Culture fit as cover | "Not a culture fit" without specific behavioral evidence | "Culture fit" is not a valid dimension on this scorecard; use Collaboration and Communication with evidence |
|
||||
| Credential bias | Degree or previous employer overweights rating | Do not list educational background in pre-interview briefing documents; focus on demonstrated behaviors |
|
||||
| Confidence ≠ Competence | Articulate candidates rated higher regardless of correctness | Grade the answer quality, not the delivery style; use written rubrics per question |
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Level bar table defines a concrete floor for the level — not aspirational traits — with a comparison to one level below and above
|
||||
- [ ] Every behavioral question includes explicit Strong Hire and Weak/No Hire signal descriptions — not just the question text
|
||||
- [ ] Coding problem(s) include solution tiers with time and space complexity, plus a per-question rubric with behavioral anchors
|
||||
- [ ] System design rubric evaluates at minimum: requirements clarification, component design, data model, scalability, and failure handling
|
||||
- [ ] Scorecard uses observable behavior fields ("What did the candidate do or say") — not impression fields
|
||||
- [ ] Must-hire competencies are explicitly named for the role and level
|
||||
- [ ] Debrief agenda enforces written scorecard submission before verbal discussion to prevent anchoring
|
||||
@@ -0,0 +1,164 @@
|
||||
---
|
||||
name: engineering-weekly-report
|
||||
description: "Write a weekly engineering status report for a team, service, or initiative. Use when asked to write a team update, weekly engineering report, sprint status email, or standing team communication to stakeholders. Produces a concise, scannable weekly report covering shipping progress, metrics, decisions, blockers, and next-week priorities."
|
||||
---
|
||||
|
||||
# Engineering Weekly Report
|
||||
|
||||
Produce a weekly engineering status report that a team can send to stakeholders, their engineering manager, and the team itself. The format is fixed week-over-week so readers know exactly where to look — shipping progress at the top, decisions in the middle, risks and next steps at the bottom. The report must be readable in under 2 minutes. Avoid prose walls: use bullet points, status tags, and short tables. If metrics are not provided, leave the metrics section with [data needed] markers rather than fabricating numbers.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Team name and report period** — team name plus week number or date range (e.g., "Platform Team, Week 21, May 12–16")
|
||||
- **Work items shipped this week** — what was completed and released or merged
|
||||
- **Work items in progress** — what is actively being worked on, with rough percent-complete if known
|
||||
- **Blocked items** — what is blocked, who owns the block, and what is needed to unblock
|
||||
- **Key decisions made** — any architecture, process, or priority decisions made this week
|
||||
- **Decisions needed next week** — any decisions that need to be made soon and who needs to make them
|
||||
- **Risks and escalations** — anything that threatens next week's commitments or needs leadership visibility
|
||||
- **Next week's top priorities** — the 3–5 things the team plans to accomplish next week
|
||||
|
||||
Optional but useful:
|
||||
- **Key metrics** — reliability (error rate, p99 latency), velocity (story points completed), or other health indicators
|
||||
- **Team health notes** — PTO, new joins, attrition, morale signals worth noting
|
||||
- **Sprint or iteration number** — if the team runs sprints
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Engineering Weekly Report — [Team Name]
|
||||
**Week:** [Week Number] | [Date Range, e.g., May 12–16, 2025]
|
||||
**Author:** [Name or Team Lead]
|
||||
**Distribution:** [e.g., Eng leadership, Product, Team]
|
||||
|
||||
---
|
||||
|
||||
## Shipping Progress
|
||||
|
||||
### Shipped This Week
|
||||
|
||||
| Item | Description | Impact |
|
||||
|------|-------------|--------|
|
||||
| [Feature / Fix / Infra change] | [One-line description] | [Who benefits / what it unblocks] |
|
||||
| [Feature / Fix / Infra change] | [One-line description] | [Who benefits / what it unblocks] |
|
||||
| [Feature / Fix / Infra change] | [One-line description] | [Who benefits / what it unblocks] |
|
||||
|
||||
### In Progress
|
||||
|
||||
| Item | Owner | Status | Target Ship |
|
||||
|------|-------|--------|-------------|
|
||||
| [Work item] | [Name] | [~40% / On Track / At Risk] | [Date or Sprint] |
|
||||
| [Work item] | [Name] | [~70% / On Track / At Risk] | [Date or Sprint] |
|
||||
| [Work item] | [Name] | [~20% / On Track / At Risk] | [Date or Sprint] |
|
||||
|
||||
### Blocked
|
||||
|
||||
| Item | Blocked Since | Blocker Description | Owner | Needed To Unblock |
|
||||
|------|--------------|--------------------|----|-------------------|
|
||||
| [Work item] | [Date] | [What is blocking progress] | [Name] | [Specific ask — decision, resource, dependency] |
|
||||
|
||||
If no items are blocked: *No active blockers.*
|
||||
|
||||
---
|
||||
|
||||
## Key Metrics
|
||||
|
||||
*Metrics reported as of [Date]. Prior week in parentheses.*
|
||||
|
||||
| Metric | This Week | Last Week | Trend | Target |
|
||||
|--------|-----------|-----------|-------|--------|
|
||||
| Error rate (5xx) | [X%] | [X%] | [↑ / ↓ / →] | < [threshold] |
|
||||
| p99 latency | [Xms] | [Xms] | [↑ / ↓ / →] | < [threshold] |
|
||||
| Deployment frequency | [X deploys] | [X deploys] | [↑ / ↓ / →] | [target] |
|
||||
| Story points completed | [X] | [X] | [↑ / ↓ / →] | [sprint target] |
|
||||
| On-call page volume | [X pages] | [X pages] | [↑ / ↓ / →] | < [threshold] |
|
||||
|
||||
**Metrics notes:** [Any context that makes the numbers meaningful — e.g., "Error rate spike on Tuesday tied to downstream dependency outage, resolved by EOD."]
|
||||
|
||||
If metrics are not provided: replace table rows with `[data needed — provide metric values for this section]`.
|
||||
|
||||
---
|
||||
|
||||
## Decisions
|
||||
|
||||
### Made This Week
|
||||
|
||||
| Decision | Rationale | Owner | Stakeholders Informed |
|
||||
|----------|-----------|-------|----------------------|
|
||||
| [Decision description] | [Why — 1 sentence] | [Name] | [Yes / No — who] |
|
||||
| [Decision description] | [Why — 1 sentence] | [Name] | [Yes / No — who] |
|
||||
|
||||
If no decisions were made: *No major decisions this week.*
|
||||
|
||||
### Needed Next Week
|
||||
|
||||
| Decision | Context | Deadline | Decision Owner |
|
||||
|----------|---------|----------|----------------|
|
||||
| [What needs to be decided] | [Why it matters, what happens if delayed] | [Date] | [Name or role] |
|
||||
|
||||
If no decisions are pending: *No decisions pending.*
|
||||
|
||||
---
|
||||
|
||||
## Risks and Escalations
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation | Escalate To |
|
||||
|------|-----------|--------|-----------|-------------|
|
||||
| [Risk description] | [High/Med/Low] | [High/Med/Low] | [What we're doing about it] | [Name/role if escalation needed] |
|
||||
|
||||
**Escalations this week:** [Any item that needs immediate leadership attention — call it out explicitly here, do not bury it in a table row. If none: "None."]
|
||||
|
||||
---
|
||||
|
||||
## Team Health
|
||||
|
||||
| Item | Status |
|
||||
|------|--------|
|
||||
| Team capacity this week | [X of Y people at full capacity] |
|
||||
| PTO / out of office | [Names and dates, or "None"] |
|
||||
| New joins / departures | [Name, role, and date, or "None"] |
|
||||
| On-call this week | [Name] |
|
||||
| On-call next week | [Name] |
|
||||
|
||||
**Team notes:** [Any morale, workload, or team dynamic signals worth surfacing — keep this factual and constructive. If nothing to note: omit this line.]
|
||||
|
||||
---
|
||||
|
||||
## Next Week's Priorities
|
||||
|
||||
*The [3–5] things this team will ship or meaningfully advance next week.*
|
||||
|
||||
1. **[Priority item]** — [One sentence: what done looks like and who owns it]
|
||||
2. **[Priority item]** — [One sentence: what done looks like and who owns it]
|
||||
3. **[Priority item]** — [One sentence: what done looks like and who owns it]
|
||||
4. **[Priority item]** — [One sentence: what done looks like and who owns it]
|
||||
5. **[Priority item]** — [One sentence: what done looks like and who owns it]
|
||||
|
||||
**Capacity risk:** [If the team is at reduced capacity next week (PTO, incidents, etc.), note it here so stakeholders calibrate expectations.]
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Sprint Scorecard (if applicable)
|
||||
|
||||
| Sprint | Committed | Completed | Completion Rate | Carried Over |
|
||||
|--------|-----------|-----------|----------------|--------------|
|
||||
| Sprint [N-1] | [X pts] | [X pts] | [X%] | [X pts] |
|
||||
| Sprint [N] (current) | [X pts] | [X pts — partial] | [X% at midpoint] | TBD |
|
||||
|
||||
---
|
||||
|
||||
*Questions or corrections: [Slack channel or email] | Next report: [Date]*
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every blocked item names a specific owner and states what is concretely needed to unblock it — not just "waiting on X"
|
||||
- [ ] Decisions-needed table includes a deadline and a named decision owner, not a vague "TBD"
|
||||
- [ ] Metrics table is either populated with real numbers or explicitly marked `[data needed]` — no fabricated metrics
|
||||
- [ ] Next week's priorities are written as outcomes ("ship X", "complete Y migration") not as activities ("work on X")
|
||||
- [ ] Escalations that need leadership attention are called out explicitly in the Risks section — not just buried in a table row
|
||||
- [ ] The entire report is readable in under 2 minutes — if it is longer than one printed page, trim it
|
||||
- [ ] Report period (week number and date range) is clearly stated in the header
|
||||
@@ -0,0 +1,369 @@
|
||||
---
|
||||
name: feature-flag-guide
|
||||
description: "Write a feature flag management guide and lifecycle playbook for a service or team — covering flag taxonomy, creation checklist, rollout strategy, monitoring requirements, cleanup policy, and governance. Use when asked to document feature flag practices, create a flag rollout plan, write a feature flag policy, or guide a team on flag lifecycle management. Produces a flag lifecycle playbook, taxonomy reference, per-flag creation template, rollout decision tree, and cleanup checklist."
|
||||
---
|
||||
|
||||
# Feature Flag Guide Skill
|
||||
|
||||
Produce a complete feature flag management guide for a service or team — covering how flags are named and categorised, how to create and roll out a flag safely, what to monitor during rollout, when and how to clean up flags, and who is responsible for each stage. Feature flags without discipline become permanent technical debt. This guide gives the team a repeatable process so flags are created intentionally, rolled out safely, and removed when done.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service or team name** — scope of the guide
|
||||
- **Feature flag platform** — LaunchDarkly, Split, Unleash, Flagsmith, Flipt, or a custom/in-house solution
|
||||
- **Flag being documented** (if writing a per-flag guide) or "general guide" (if writing team-wide policy)
|
||||
- **Rollout constraints** — any compliance, data privacy, or contractual constraints on who can see a feature (e.g. HIPAA, EU-only, enterprise customers only)
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Feature Flag Management Guide: [Service / Team Name]
|
||||
|
||||
**Team:** [Team name] | **Platform:** [LaunchDarkly / Split / Unleash / Custom]
|
||||
**Document owner:** [Name] | **Last updated:** [Date]
|
||||
**Review cycle:** Quarterly, and whenever the flag platform changes
|
||||
|
||||
---
|
||||
|
||||
## 1. Flag Taxonomy
|
||||
|
||||
Every flag belongs to exactly one category. The category determines default behaviour, who can enable it in production, and when it must be cleaned up.
|
||||
|
||||
| Type | Purpose | Default state | Production gate | Max lifetime |
|
||||
|---|---|---|---|---|
|
||||
| **Release flag** | Controls rollout of a new feature — decouples deploy from release | Off | Tech lead approval | 90 days from feature launch |
|
||||
| **Experiment flag** | A/B or multivariate test — measures impact of a change | Off (control group) | Product + tech lead | Duration of experiment + 30 days |
|
||||
| **Ops flag** | Operational control — circuit breaker, kill switch, throttle | On (normal behaviour) | On-call engineer can toggle | Indefinite (review annually) |
|
||||
| **Permission flag** | Gates access by user segment, tier, or region | Off (restricted) | Product + Account owner | Indefinite (review annually) |
|
||||
|
||||
**When in doubt:** If the flag is temporary (tied to a specific feature launch), it is a Release flag. If it will exist forever as a control knob, it is an Ops flag.
|
||||
|
||||
---
|
||||
|
||||
## 2. Flag Naming Convention
|
||||
|
||||
All flags must follow this naming scheme:
|
||||
|
||||
```
|
||||
[type]-[service]-[feature-description]
|
||||
```
|
||||
|
||||
| Segment | Values | Example |
|
||||
|---|---|---|
|
||||
| type | `release`, `exp`, `ops`, `perm` | `release` |
|
||||
| service | Short service identifier, lowercase, hyphenated | `payments` |
|
||||
| feature-description | Kebab-case description, max 5 words | `new-checkout-flow` |
|
||||
|
||||
**Full examples:**
|
||||
- `release-payments-new-checkout-flow` — release flag for a new checkout feature in the payments service
|
||||
- `exp-search-personalized-ranking` — experiment on personalized search ranking
|
||||
- `ops-api-rate-limit-override` — operational flag to override API rate limits
|
||||
- `perm-dashboard-beta-users-only` — permission flag gating dashboard for beta users
|
||||
|
||||
**Do not:**
|
||||
- Use ticket numbers in flag names (`release-JIRA-1234` → not searchable or self-describing)
|
||||
- Use dates in flag names (`release-dark-mode-jan-2024` → flags outlive their dates)
|
||||
- Use vague names (`release-new-thing` → not useful when you have 50 flags)
|
||||
|
||||
---
|
||||
|
||||
## 3. Flag Creation Checklist
|
||||
|
||||
Complete every item before creating a flag in the production environment.
|
||||
|
||||
**Before creating the flag:**
|
||||
- [ ] Flag type determined from taxonomy (Section 1)
|
||||
- [ ] Flag name follows naming convention (Section 2)
|
||||
- [ ] Flag owner assigned — one named engineer responsible for cleanup
|
||||
- [ ] Cleanup date set in the flag description field (for Release and Experiment flags)
|
||||
- [ ] Rollout strategy defined — see Section 4
|
||||
- [ ] Monitoring plan defined — see Section 5
|
||||
- [ ] Code review approved with flag guard in place
|
||||
|
||||
**Flag description field (required):**
|
||||
```
|
||||
Type: [Release / Experiment / Ops / Permission]
|
||||
Owner: [Name]
|
||||
Linked ticket: [JIRA-XXXX or GitHub issue URL]
|
||||
Purpose: [One sentence — what this flag controls]
|
||||
Cleanup by: [Date — required for Release and Experiment flags; "Annual review" for Ops/Permission]
|
||||
Rollout plan: [Link to this document or inline summary]
|
||||
```
|
||||
|
||||
**Code requirements:**
|
||||
```python
|
||||
# Good — behaviour is clear when flag is off, and cleanup is obvious
|
||||
if flag_client.is_enabled("release-[service]-[feature]", user_context):
|
||||
return new_feature_handler(request)
|
||||
else:
|
||||
return existing_handler(request)
|
||||
|
||||
# Bad — nested flags, ternaries, and implicit defaults make cleanup error-prone
|
||||
result = new_handler() if (f1 and not f2) or f3 else old_handler()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Rollout Strategy
|
||||
|
||||
### Decision Tree
|
||||
|
||||
Use this decision tree to pick the right rollout strategy for a Release or Experiment flag:
|
||||
|
||||
```
|
||||
Is the change reversible without a deploy?
|
||||
├── No → Use an Ops flag with manual enable, not a percentage rollout
|
||||
└── Yes → Continue
|
||||
|
||||
Is there a user-level identifier available (user ID, session ID)?
|
||||
├── No → Use server-side percentage (stateless, but inconsistent per user)
|
||||
└── Yes → Use user-based percentage (consistent experience per user) ← preferred
|
||||
|
||||
Is the change risky (touches payments, auth, or data writes)?
|
||||
├── Yes → Start at 1% → 5% → 25% → 50% → 100%, with 24-hour holds
|
||||
└── No → Start at 10% → 50% → 100%, with 4-hour holds
|
||||
|
||||
Does the change affect specific customer tiers or geographies?
|
||||
├── Yes → Use segment-based targeting, not percentage rollout
|
||||
└── No → Use percentage rollout
|
||||
```
|
||||
|
||||
### Rollout Stages
|
||||
|
||||
| Stage | Percentage | Hold duration | Pass criteria before advancing |
|
||||
|---|---|---|---|
|
||||
| Canary | 1% | 24 hours | Error rate within SLO, no P1 incidents |
|
||||
| Early rollout | 5–10% | 24 hours | Error rate and latency match control group |
|
||||
| Partial rollout | 25–50% | 24–48 hours | Business metrics not degraded vs. control |
|
||||
| Majority | 75% | 24 hours | Final check — no regressions |
|
||||
| Full rollout | 100% | 48 hours | Stable — schedule cleanup |
|
||||
|
||||
**Do not skip stages for Release flags on production.** Speed of rollout is not worth a production incident.
|
||||
|
||||
### Segment-Based Targeting
|
||||
|
||||
Use segment targeting when the rollout must be restricted:
|
||||
|
||||
```yaml
|
||||
# LaunchDarkly segment example — adapt for your platform
|
||||
targeting_rules:
|
||||
- clause:
|
||||
attribute: "subscription_tier"
|
||||
operator: "in"
|
||||
values: ["enterprise", "team"]
|
||||
serve: "on"
|
||||
- clause:
|
||||
attribute: "country"
|
||||
operator: "in"
|
||||
values: ["US", "CA", "GB"]
|
||||
serve: "on"
|
||||
default: "off"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Monitoring Requirements
|
||||
|
||||
Every flag that is not at 0% or 100% rollout requires active monitoring. Do not roll out a flag and walk away.
|
||||
|
||||
### Required Metrics Per Flag
|
||||
|
||||
| Metric | What to compare | Alert threshold |
|
||||
|---|---|---|
|
||||
| Error rate | Flag-on cohort vs. flag-off cohort | >2× baseline error rate in flag-on group |
|
||||
| p99 latency | Flag-on vs. flag-off | >20% higher latency in flag-on group |
|
||||
| [Primary business metric] | Flag-on vs. flag-off | >5% degradation in flag-on group |
|
||||
| [Conversion / completion rate] | Flag-on vs. flag-off | >2% drop in flag-on group |
|
||||
|
||||
**Setting up split metric monitoring in [LaunchDarkly / Split / Datadog]:**
|
||||
```
|
||||
1. Navigate to the flag → Metrics tab
|
||||
2. Add metric: [primary business metric]
|
||||
3. Add metric: error_rate (service-level)
|
||||
4. Add metric: p99_latency (endpoint-level)
|
||||
5. Set alert: notify [flag owner] in Slack #[team-channel] if metric degrades by [threshold]
|
||||
6. Set experiment duration: [N days] if this is an Experiment flag
|
||||
```
|
||||
|
||||
### Guardrail Metrics
|
||||
|
||||
These metrics must never degrade, regardless of what the primary metric shows. If a guardrail is breached, roll back immediately — do not wait for investigation.
|
||||
|
||||
- Error rate exceeds SLO threshold ([X]%)
|
||||
- p99 latency exceeds SLO threshold ([Y] ms)
|
||||
- [Service-specific guardrail — e.g. payment failure rate, auth failure rate]
|
||||
|
||||
**Immediate rollback command if guardrail is breached:**
|
||||
```bash
|
||||
# [LaunchDarkly CLI]
|
||||
ld-cli flag update [project-key] [flag-key] --default-variation off
|
||||
|
||||
# [Split CLI]
|
||||
split-cli update-treatment [flag-name] --treatment "off" --percentage 100
|
||||
|
||||
# [Unleash CLI / API]
|
||||
curl -X POST https://[unleash-host]/api/admin/features/[flag-name]/disable \
|
||||
-H "Authorization: [admin-token]"
|
||||
|
||||
# [Custom — adapt to your implementation]
|
||||
[command or dashboard step]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Per-Flag Creation Template
|
||||
|
||||
Copy this template into your flag's description field and the linked ticket when creating a new flag:
|
||||
|
||||
```markdown
|
||||
## Flag: [flag-name]
|
||||
|
||||
**Type:** [Release / Experiment / Ops / Permission]
|
||||
**Owner:** [Name] ([Slack handle])
|
||||
**Created:** [Date]
|
||||
**Cleanup by:** [Date]
|
||||
**Linked ticket:** [URL]
|
||||
|
||||
### Purpose
|
||||
[One paragraph: what this flag controls, why it exists, what "on" and "off" mean]
|
||||
|
||||
### Rollout Plan
|
||||
| Stage | Target | Date | Approved by |
|
||||
|---|---|---|---|
|
||||
| Canary | 1% | [Date] | [Name] |
|
||||
| Early | 10% | [Date] | [Name] |
|
||||
| Partial | 50% | [Date] | [Name] |
|
||||
| Full | 100% | [Date] | [Name] |
|
||||
|
||||
### Monitoring
|
||||
- Primary metric: [metric name and dashboard link]
|
||||
- Guardrail metrics: error rate < [X]%, p99 < [Y] ms
|
||||
- Alert channel: #[team-channel]
|
||||
|
||||
### Rollback Procedure
|
||||
[Exact steps to turn the flag off in an emergency — should take < 2 minutes]
|
||||
|
||||
### Cleanup Checklist
|
||||
- [ ] Flag at 100% for 48+ hours with no incidents
|
||||
- [ ] Code path for flag-off branch removed from codebase
|
||||
- [ ] Flag deleted from [platform]
|
||||
- [ ] Ticket closed
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Emergency Kill-Switch Procedure
|
||||
|
||||
When a flag needs to be disabled immediately due to a production incident:
|
||||
|
||||
**Time target: flag disabled within 2 minutes of decision.**
|
||||
|
||||
```
|
||||
1. Go to [platform URL] — bookmark this: [URL]
|
||||
2. Search for the flag by name: [flag-name]
|
||||
3. Set to 0% / "off" for ALL users
|
||||
4. Verify the service error rate drops within 60 seconds
|
||||
5. Post to #incidents:
|
||||
"🟡 Feature flag [flag-name] disabled — rolling back [feature description].
|
||||
Owner: [name]. Error rate before: [X]%. Monitoring for recovery."
|
||||
6. Page the flag owner if not already aware
|
||||
```
|
||||
|
||||
**For ops flags (kill switches that must turn OFF normally-on behaviour):**
|
||||
```bash
|
||||
# These flags are "on" by default and turned "off" to disable a feature
|
||||
# Confirm the flag polarity before toggling — "off" may mean "disabled" or "enabled" depending on naming
|
||||
# Flag [flag-name]: OFF = [feature behaviour when off]
|
||||
[kill switch command for your platform]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Stale Flag Policy and Cleanup
|
||||
|
||||
Stale flags are flags that are at 100% rollout, have been at 100% for >48 hours, or are past their cleanup date. Stale flags are technical debt.
|
||||
|
||||
### Stale Flag Definition
|
||||
|
||||
A flag is stale if ANY of the following are true:
|
||||
- It is a Release flag past its cleanup date
|
||||
- It has been at 100% (or 0%) rollout for more than 30 days
|
||||
- Its linked ticket is closed and code cleanup has not happened
|
||||
- Its owner has left the team
|
||||
|
||||
### Cleanup Checklist
|
||||
|
||||
```
|
||||
[ ] Flag is at 100% rollout and has been stable for 48+ hours
|
||||
[ ] Monitoring shows no issues for the flag-on cohort
|
||||
[ ] Code changes:
|
||||
[ ] Remove the flag check from application code
|
||||
[ ] Remove the "off" code path entirely — do not leave dead code
|
||||
[ ] Remove any flag-related tests that test the off behaviour
|
||||
[ ] Update any documentation that references the flag
|
||||
[ ] PR merged and deployed to production
|
||||
[ ] Flag deleted from [platform] (do not just disable — delete)
|
||||
[ ] Cleanup ticket closed
|
||||
[ ] Flag owner confirms cleanup in Slack: "Flag [name] has been cleaned up — [commit link]"
|
||||
```
|
||||
|
||||
**Automated stale flag detection:**
|
||||
```bash
|
||||
# Run weekly — flags past cleanup date or at 100% for > 30 days
|
||||
# [Platform-specific query — adapt:]
|
||||
|
||||
# LaunchDarkly API
|
||||
curl -s "https://app.launchdarkly.com/api/v2/flags/[project-key]" \
|
||||
-H "Authorization: [api-key]" | \
|
||||
jq '.items[] | select(.creationDate < (now - 2592000) * 1000) | {key: .key, created: .creationDate}'
|
||||
|
||||
# Notify #engineering-housekeeping with list of stale flags
|
||||
```
|
||||
|
||||
### Stale Flag Escalation
|
||||
|
||||
| Age past cleanup date | Action |
|
||||
|---|---|
|
||||
| 0–14 days | Slack reminder to flag owner |
|
||||
| 14–30 days | Slack reminder to flag owner + tech lead |
|
||||
| 30+ days | Tech lead assigns cleanup, creates ticket with P2 priority |
|
||||
| 60+ days | Engineering manager reviews — flag may be force-deleted |
|
||||
|
||||
---
|
||||
|
||||
## 9. Governance
|
||||
|
||||
### Who Can Do What
|
||||
|
||||
| Action | Who | Approval required |
|
||||
|---|---|---|
|
||||
| Create a flag (any environment) | Any engineer | None — but must complete creation checklist |
|
||||
| Enable a flag in development | Any engineer | None |
|
||||
| Enable a flag in staging | Any engineer | None |
|
||||
| Enable a flag in production (0–10%) | Flag owner | Tech lead awareness |
|
||||
| Advance rollout in production (10–100%) | Flag owner | Tech lead sign-off per stage |
|
||||
| Enable an Ops flag in production | On-call engineer | None — these are break-glass controls |
|
||||
| Delete a flag | Flag owner | Tech lead confirmation that code cleanup is done |
|
||||
| Create a Permission flag | Flag owner | Product manager approval |
|
||||
|
||||
### Audit Logging
|
||||
|
||||
All flag changes in production must be traceable. Ensure the following are configured in [platform]:
|
||||
|
||||
- **Change log:** Every production flag change logs: who changed it, what they changed, and when.
|
||||
- **Slack notifications:** Production flag changes post to `#[team]-flag-changes` automatically.
|
||||
- **Quarterly review:** Every quarter, the tech lead reviews the full flag inventory, confirms owners are current, and removes flags with no owner.
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every flag has an owner named in its description — no orphan flags
|
||||
- [ ] Release and Experiment flags have a cleanup date set — not open-ended
|
||||
- [ ] Monitoring is configured for every flag currently between 1–99% rollout
|
||||
- [ ] The emergency kill-switch procedure has been tested — on-call engineers have bookmarked the platform URL and know the steps
|
||||
- [ ] Stale flag detection runs automatically and results are reviewed weekly
|
||||
- [ ] Code review checklist includes: "Does this PR introduce a flag? If yes, is the creation checklist complete?"
|
||||
- [ ] At least one person other than the flag owner knows how to disable any given flag in an emergency
|
||||
@@ -0,0 +1,292 @@
|
||||
---
|
||||
name: infra-as-code-review
|
||||
description: "Write an infrastructure-as-code review checklist and conduct a structured review of Terraform, CloudFormation, Pulumi, or Ansible code. Use when asked to review IaC code, audit infrastructure configurations, check cloud security posture, or produce a reusable IaC review checklist. Produces a structured review report with severity-categorized findings, remediation guidance, and a reusable checklist."
|
||||
---
|
||||
|
||||
# Infrastructure-as-Code Review
|
||||
|
||||
Produce a structured infrastructure-as-code review that applies security, reliability, and operational quality standards to a specific body of IaC code. The output serves two purposes: an actionable review report for the code at hand (with findings by severity and specific remediation steps), and a reusable checklist the team can apply to every future IaC change. If the user provides actual code, analyze it and populate the findings table with real issues. If no code is provided, produce the checklist and a template findings report.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **IaC tool** — Terraform, CloudFormation, Pulumi, Ansible, or CDK
|
||||
- **Cloud provider** — AWS, GCP, Azure, or multi-cloud
|
||||
- **What the code provisions** — a brief description (e.g., "VPC, EKS cluster, and RDS instance for the payments service")
|
||||
- **Security policies or naming standards in use** — any existing org standards to check against; if none, use sensible defaults
|
||||
- **The IaC code itself** — paste or describe it; if not provided, produce the checklist template only and note findings require code
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# IaC Review Report: [What Is Being Provisioned]
|
||||
|
||||
**Reviewer:** [Name / Claude]
|
||||
**IaC Tool:** [Terraform / CloudFormation / Pulumi / Ansible / CDK]
|
||||
**Cloud Provider:** [AWS / GCP / Azure]
|
||||
**Code Location:** [Repo path or PR link]
|
||||
**Review Date:** [Date]
|
||||
**Overall Risk:** [Critical / High / Medium / Low]
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
| Severity | Finding Count | Resolved in This Review | Carry-Over Risk |
|
||||
|----------|---------------|------------------------|-----------------|
|
||||
| Critical | [n] | [n] | [Yes/No — explain] |
|
||||
| High | [n] | [n] | [Yes/No — explain] |
|
||||
| Medium | [n] | [n] | [Yes/No — explain] |
|
||||
| Low | [n] | [n] | [Yes/No — explain] |
|
||||
| **Total** | **[n]** | **[n]** | |
|
||||
|
||||
**Recommendation:** [Approve / Approve with Required Changes / Block — one sentence rationale]
|
||||
|
||||
---
|
||||
|
||||
## Findings
|
||||
|
||||
### Critical Findings
|
||||
|
||||
#### CRIT-01: [Finding Title]
|
||||
|
||||
| Field | Detail |
|
||||
|-------|--------|
|
||||
| **Severity** | Critical |
|
||||
| **Category** | [IAM / Secrets / Encryption / Network / State / Naming / Cost] |
|
||||
| **Resource** | `[resource_type.resource_name]` |
|
||||
| **File / Line** | `[path/to/file.tf:42]` |
|
||||
| **Risk** | [What can go wrong — be specific about the attack vector or failure mode] |
|
||||
|
||||
**Current code:**
|
||||
```hcl
|
||||
# [paste the problematic snippet]
|
||||
resource "aws_s3_bucket" "data" {
|
||||
bucket = "my-bucket"
|
||||
acl = "public-read" # PROBLEM: public read access
|
||||
}
|
||||
```
|
||||
|
||||
**Remediation:**
|
||||
```hcl
|
||||
resource "aws_s3_bucket" "data" {
|
||||
bucket = "my-bucket"
|
||||
}
|
||||
|
||||
resource "aws_s3_bucket_public_access_block" "data" {
|
||||
bucket = aws_s3_bucket.data.id
|
||||
block_public_acls = true
|
||||
block_public_policy = true
|
||||
ignore_public_acls = true
|
||||
restrict_public_buckets = true
|
||||
}
|
||||
```
|
||||
|
||||
**Why this matters:** [One sentence linking the specific risk to business impact — data exposure, compliance violation, etc.]
|
||||
|
||||
---
|
||||
|
||||
#### CRIT-02: [Next Critical Finding — repeat structure]
|
||||
|
||||
---
|
||||
|
||||
### High Findings
|
||||
|
||||
#### HIGH-01: [Finding Title]
|
||||
|
||||
| Field | Detail |
|
||||
|-------|--------|
|
||||
| **Severity** | High |
|
||||
| **Category** | [Category] |
|
||||
| **Resource** | `[resource_type.resource_name]` |
|
||||
| **File / Line** | `[path/to/file.tf:line]` |
|
||||
| **Risk** | [Specific risk description] |
|
||||
|
||||
**Current code:**
|
||||
```hcl
|
||||
# [problematic snippet]
|
||||
```
|
||||
|
||||
**Remediation:**
|
||||
```hcl
|
||||
# [fixed snippet]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Medium Findings
|
||||
|
||||
#### MED-01: [Finding Title]
|
||||
|
||||
| Field | Detail |
|
||||
|-------|--------|
|
||||
| **Severity** | Medium |
|
||||
| **Category** | [Category] |
|
||||
| **Resource** | `[resource_type.resource_name]` |
|
||||
| **File / Line** | `[path/to/file.tf:line]` |
|
||||
| **Risk** | [Specific risk description] |
|
||||
|
||||
**Remediation:** [Prose or code snippet — choose whichever is clearer for this finding]
|
||||
|
||||
---
|
||||
|
||||
### Low Findings
|
||||
|
||||
#### LOW-01: [Finding Title]
|
||||
|
||||
| Field | Detail |
|
||||
|-------|--------|
|
||||
| **Severity** | Low |
|
||||
| **Category** | [Category] |
|
||||
| **Resource** | `[resource_type.resource_name]` |
|
||||
| **File / Line** | `[path/to/file.tf:line]` |
|
||||
| **Suggestion** | [What to improve and why] |
|
||||
|
||||
---
|
||||
|
||||
## Reusable IaC Review Checklist
|
||||
|
||||
Use this checklist on every IaC pull request. Check every item; mark N/A only when the item genuinely does not apply to the resources being provisioned.
|
||||
|
||||
### 1. IAM and Access Control
|
||||
|
||||
- [ ] No wildcard actions (`"*"`) in IAM policies — policies follow least-privilege
|
||||
- [ ] No wildcard resource (`"*"`) in IAM policies unless explicitly justified with a comment
|
||||
- [ ] IAM roles use condition keys to restrict scope (e.g., `aws:RequestedRegion`, `sts:ExternalId`)
|
||||
- [ ] No IAM access keys or credentials hardcoded or in plaintext variables
|
||||
- [ ] EC2 / compute instances use instance profiles, not hardcoded credentials
|
||||
- [ ] S3 bucket policies do not allow public access unless the bucket is explicitly a public asset bucket
|
||||
- [ ] Cross-account trust policies name specific account IDs, not `"*"`
|
||||
- [ ] Service accounts (GCP) / managed identities (Azure) follow naming conventions and have documented purpose
|
||||
|
||||
### 2. Secrets Management
|
||||
|
||||
- [ ] No secrets, passwords, tokens, or API keys in plaintext in any `.tf`, `.yaml`, or `.json` file
|
||||
- [ ] No secrets in variable default values
|
||||
- [ ] Secrets sourced from Secrets Manager / Parameter Store / Vault — not from environment variables passed at plan time
|
||||
- [ ] `sensitive = true` is set on all output values and variables that contain secrets (Terraform)
|
||||
- [ ] State backend is encrypted — no unencrypted state files contain sensitive data
|
||||
- [ ] `.gitignore` or equivalent excludes `*.tfvars`, `terraform.tfstate`, and any file that may contain resolved secrets
|
||||
|
||||
### 3. Encryption at Rest
|
||||
|
||||
- [ ] Storage resources (S3, EBS, RDS, DynamoDB, GCS, Azure Blob) have encryption at rest enabled
|
||||
- [ ] Customer-managed keys (CMK/KMS) are used where required by policy — not solely AWS/GCP/Azure managed keys
|
||||
- [ ] KMS key rotation is enabled for all CMKs
|
||||
- [ ] Database snapshots have encryption enabled
|
||||
- [ ] Encryption is not disabled via `encrypted = false` or equivalent
|
||||
|
||||
### 4. Encryption in Transit
|
||||
|
||||
- [ ] Load balancers terminate TLS — HTTP-only listeners redirect to HTTPS or are absent
|
||||
- [ ] Minimum TLS version is 1.2; TLS 1.0 and 1.1 are explicitly disabled
|
||||
- [ ] RDS / database connections require SSL (`require_ssl = true` or equivalent parameter)
|
||||
- [ ] Internal service-to-service calls use TLS where the network is not fully private
|
||||
- [ ] S3 bucket policies include a `Deny` on non-TLS requests (`aws:SecureTransport: false`)
|
||||
|
||||
### 5. Network and Public Access
|
||||
|
||||
- [ ] Security groups / firewall rules do not permit `0.0.0.0/0` ingress except on ports 80/443 for public-facing services
|
||||
- [ ] SSH (port 22) and RDP (port 3389) are not open to `0.0.0.0/0`
|
||||
- [ ] Databases are in private subnets — not directly internet-routable
|
||||
- [ ] `publicly_accessible = false` on RDS instances unless explicitly required and documented
|
||||
- [ ] VPC has flow logs enabled
|
||||
- [ ] Network ACLs and security groups are layered (defense in depth)
|
||||
- [ ] S3 bucket public access block is enabled at the account and bucket level
|
||||
|
||||
### 6. Logging, Monitoring, and Audit
|
||||
|
||||
- [ ] CloudTrail / Cloud Audit Logs / Azure Monitor is enabled across all regions
|
||||
- [ ] S3 access logging is enabled on buckets containing sensitive or regulated data
|
||||
- [ ] RDS enhanced monitoring or equivalent is enabled
|
||||
- [ ] CloudWatch alarms or equivalent are defined for critical metrics (CPU, disk, error rate)
|
||||
- [ ] Log retention periods are defined — logs not retained indefinitely or deleted within 7 days
|
||||
|
||||
### 7. Naming and Tagging Standards
|
||||
|
||||
- [ ] All resources follow the team's naming convention: `[env]-[team]-[resource-type]-[identifier]`
|
||||
- [ ] Required tags are present on all taggable resources:
|
||||
- [ ] `Environment` (e.g., prod / staging / dev)
|
||||
- [ ] `Team` or `Owner`
|
||||
- [ ] `Service` or `Application`
|
||||
- [ ] `CostCenter` (if required by finance policy)
|
||||
- [ ] `ManagedBy: terraform` (or equivalent IaC tool tag)
|
||||
- [ ] No resources with default names (e.g., `default-vpc`, `launch-wizard-1`)
|
||||
|
||||
### 8. State Management and Backend
|
||||
|
||||
- [ ] Remote state backend is configured — no local state in repository
|
||||
- [ ] State backend uses locking (DynamoDB for S3 backend, etc.)
|
||||
- [ ] State backend bucket/storage has versioning enabled
|
||||
- [ ] State backend bucket/storage has access logging enabled
|
||||
- [ ] Workspaces or separate state files are used per environment — no shared state between prod and non-prod
|
||||
- [ ] `terraform.tfstate` and `*.tfstate.backup` are in `.gitignore`
|
||||
|
||||
### 9. Module and Resource Structure
|
||||
|
||||
- [ ] Modules are versioned with explicit version pins — no floating `source = "git::...?ref=main"`
|
||||
- [ ] Provider versions are pinned in `required_providers` — no unconstrained `>= x.y`
|
||||
- [ ] Terraform version is pinned in `required_version`
|
||||
- [ ] Modules have a clear single responsibility — not one module that provisions everything
|
||||
- [ ] No copy-paste duplication — repeated patterns use modules or loops (`for_each`, `count`)
|
||||
- [ ] Outputs expose only what downstream consumers need — no unnecessary output sprawl
|
||||
|
||||
### 10. Environment Parity
|
||||
|
||||
- [ ] Prod and non-prod environments use the same module code, parameterized by environment variable
|
||||
- [ ] Instance sizes and replica counts differ by environment via variables — not by separate code branches
|
||||
- [ ] Non-prod does not have security controls disabled "to save money" (encryption off, logging off)
|
||||
|
||||
### 11. Cost Impact
|
||||
|
||||
- [ ] Large instance types (e.g., `r5.16xlarge`) or storage allocations are justified in a comment
|
||||
- [ ] Data transfer costs are considered for cross-region or cross-AZ architectures
|
||||
- [ ] Reserved instance or committed use discount eligibility is noted for long-lived resources
|
||||
- [ ] Auto-scaling is configured for variable workloads — no fixed oversized fleets for spiky traffic
|
||||
- [ ] Lifecycle policies are set on S3 buckets storing time-bounded data (logs, backups)
|
||||
|
||||
### 12. Drift Risk
|
||||
|
||||
- [ ] No resources that are commonly mutated in the console are managed by IaC without import documentation
|
||||
- [ ] `lifecycle { prevent_destroy = true }` is set on stateful resources in production (databases, state buckets)
|
||||
- [ ] `ignore_changes` is used sparingly and each instance is documented with a rationale comment
|
||||
- [ ] A plan is run against the live environment as part of the PR process — no unreviewed drift
|
||||
|
||||
---
|
||||
|
||||
## Findings Summary Table
|
||||
|
||||
| ID | Title | Severity | Category | File | Status |
|
||||
|----|-------|----------|----------|------|--------|
|
||||
| CRIT-01 | [Title] | Critical | [Category] | [file:line] | Open |
|
||||
| HIGH-01 | [Title] | High | [Category] | [file:line] | Open |
|
||||
| MED-01 | [Title] | Medium | [Category] | [file:line] | Open |
|
||||
| LOW-01 | [Title] | Low | [Category] | [file:line] | Open |
|
||||
|
||||
---
|
||||
|
||||
## Required Actions Before Merge
|
||||
|
||||
List only Critical and High findings that must be resolved before this code is merged:
|
||||
|
||||
1. **CRIT-01 [Title]** — [One-line remediation instruction]
|
||||
2. **HIGH-01 [Title]** — [One-line remediation instruction]
|
||||
|
||||
Medium and Low findings should be tracked as follow-up issues with a committed resolution date.
|
||||
|
||||
---
|
||||
|
||||
*Review conducted by [Reviewer] on [Date] — checklist version [1.0]*
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every finding includes: severity, category, specific resource name, file and line number, current code, and fixed code
|
||||
- [ ] Checklist covers all 12 categories: IAM, Secrets, Encryption at Rest, Encryption in Transit, Network, Logging, Naming/Tagging, State, Module Structure, Environment Parity, Cost, and Drift
|
||||
- [ ] Executive summary table is filled with real counts — not all zeros or all placeholders
|
||||
- [ ] "Required Actions Before Merge" section lists only Critical and High items
|
||||
- [ ] Code snippets in findings show both the problematic code AND the corrected version
|
||||
- [ ] Overall risk rating is justified by the highest-severity open finding
|
||||
- [ ] Checklist items are binary (checkable) — not narrative observations
|
||||
@@ -0,0 +1,432 @@
|
||||
---
|
||||
name: load-testing-plan
|
||||
description: "Write a load and performance testing plan for a service. Use when asked to create a performance test plan, write load testing documentation, define stress or soak test scenarios, or set performance regression gates for CI. Produces a complete test plan document with scenario definitions, k6/Locust script skeleton, threshold table, result interpretation guide, and CI integration steps."
|
||||
---
|
||||
|
||||
# Load Testing Plan Skill
|
||||
|
||||
Produce a complete load and performance testing plan for a service — covering test objectives, scenario definitions, tooling configuration, success thresholds, and CI integration. A good load testing plan eliminates ambiguity about what "performance is acceptable" means, so engineers can run tests and get a pass/fail answer without having to interpret raw numbers themselves.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name and key endpoints** — which endpoints are under test (path, method, typical request/response shape)
|
||||
- **Current traffic baseline** — current requests/sec, p50/p99 latency, error rate under normal load
|
||||
- **Peak traffic expectations** — expected peak RPS (e.g. 10× baseline for flash sales, or seasonality peak)
|
||||
- **SLO targets** — latency SLOs (p99 < X ms), error rate SLO (< Y%), availability target
|
||||
- **Preferred testing tool** — k6, Locust, JMeter, Gatling, or no preference
|
||||
- **Test environment availability** — dedicated load test environment, staging, or production (with traffic shaping)
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Load Testing Plan: [Service Name]
|
||||
|
||||
**Author:** [Name] | **Team:** [Team name]
|
||||
**Date:** [Date] | **Review cycle:** Before each major release and quarterly
|
||||
**Testing tool:** [k6 / Locust / JMeter / Gatling]
|
||||
**Test environment:** [Environment name and URL]
|
||||
|
||||
---
|
||||
|
||||
## 1. Objectives and Scope
|
||||
|
||||
**What we are testing:** [Service name] handles [describe function — e.g. "user authentication requests from the mobile and web clients"]. This plan validates that the service meets its SLOs under expected and elevated traffic conditions.
|
||||
|
||||
**In scope:**
|
||||
- [Endpoint 1: METHOD /path — description]
|
||||
- [Endpoint 2: METHOD /path — description]
|
||||
- [Endpoint 3: METHOD /path — description]
|
||||
|
||||
**Out of scope:**
|
||||
- [Any endpoints explicitly excluded and why — e.g. "admin APIs — low traffic, excluded from load test"]
|
||||
- [Third-party integrations that cannot be load-tested — mock them instead]
|
||||
|
||||
---
|
||||
|
||||
## 2. Performance Targets (Success Criteria)
|
||||
|
||||
Every scenario has explicit pass/fail thresholds. A test run FAILS if any threshold is breached.
|
||||
|
||||
| Metric | Baseline scenario | Stress scenario | Spike scenario | Soak scenario |
|
||||
|---|---|---|---|---|
|
||||
| p50 latency | < [X] ms | < [X × 1.5] ms | < [X × 2] ms | < [X] ms |
|
||||
| p95 latency | < [Y] ms | < [Y × 1.5] ms | < [Y × 2] ms | < [Y] ms |
|
||||
| p99 latency | < [Z] ms | < [Z × 2] ms | < [Z × 3] ms | < [Z] ms |
|
||||
| Error rate | < [0.1]% | < [1]% | < [2]% | < [0.1]% |
|
||||
| Throughput | ≥ [N] RPS | ≥ [N × 3] RPS | N/A | ≥ [N] RPS |
|
||||
| Failed requests | 0 (5xx) | < [threshold] | < [threshold] | 0 (5xx) |
|
||||
|
||||
**SLO reference:** These thresholds are derived from the service SLOs — p99 < [Z ms], error rate < [0.1]%, availability [99.9]%.
|
||||
|
||||
---
|
||||
|
||||
## 3. Traffic Model
|
||||
|
||||
**Baseline traffic (current production):**
|
||||
- Average RPS: [N] req/sec
|
||||
- Peak RPS (observed): [N] req/sec
|
||||
- Request distribution by endpoint:
|
||||
- [Endpoint 1]: [X]% of traffic
|
||||
- [Endpoint 2]: [Y]% of traffic
|
||||
- [Endpoint 3]: [Z]% of traffic
|
||||
|
||||
**Simulated user behaviour:**
|
||||
- Think time between requests: [X–Y] seconds (randomised)
|
||||
- Session duration: [N] minutes average
|
||||
- Authenticated vs anonymous ratio: [X]%/[Y]%
|
||||
- Geographic distribution: [Region 1 X]%, [Region 2 Y]%
|
||||
|
||||
---
|
||||
|
||||
## 4. Test Scenarios
|
||||
|
||||
### Scenario 1: Baseline (Steady-State)
|
||||
|
||||
**Purpose:** Confirm the service performs acceptably under normal production load.
|
||||
**Duration:** 10 minutes
|
||||
**Load profile:** Ramp to [N] RPS over 2 minutes, hold for 8 minutes.
|
||||
**Concurrency:** [N] virtual users
|
||||
|
||||
**Pass criteria:** All thresholds in the Baseline column of the targets table above.
|
||||
|
||||
---
|
||||
|
||||
### Scenario 2: Stress Test
|
||||
|
||||
**Purpose:** Find the breaking point — how much load can the service handle before SLOs are breached?
|
||||
**Duration:** 20–30 minutes
|
||||
**Load profile:** Ramp from [N] RPS (baseline) to [N × 5] RPS in 5-minute steps. Hold each step for 5 minutes. Stop at first SLO breach.
|
||||
**Concurrency:** Scales with RPS target
|
||||
|
||||
**What to record:**
|
||||
- RPS at which p99 latency first exceeds SLO
|
||||
- RPS at which error rate first exceeds SLO
|
||||
- Whether the service recovers when load drops back to baseline
|
||||
|
||||
---
|
||||
|
||||
### Scenario 3: Spike Test
|
||||
|
||||
**Purpose:** Simulate a sudden traffic surge (flash sale, viral event, bot attack).
|
||||
**Duration:** 15 minutes
|
||||
**Load profile:** Hold at [N] RPS (baseline) for 3 minutes, spike to [N × 10] RPS instantly, hold for 5 minutes, drop back to baseline for 7 minutes.
|
||||
|
||||
**What to record:**
|
||||
- Latency during spike and recovery
|
||||
- Whether the service sheds load gracefully (rate limiting, queue depth)
|
||||
- Time to recover to baseline latency after spike ends
|
||||
|
||||
---
|
||||
|
||||
### Scenario 4: Soak / Endurance Test
|
||||
|
||||
**Purpose:** Detect memory leaks, connection pool exhaustion, and slow degradation over time.
|
||||
**Duration:** 4–8 hours (run overnight)
|
||||
**Load profile:** Steady [N × 1.5] RPS (50% above baseline) for entire duration.
|
||||
|
||||
**What to watch:**
|
||||
- Memory usage trend over time (should not grow unboundedly)
|
||||
- Error rate trend (should be flat, not creeping up)
|
||||
- GC pause frequency (JVM/Go services)
|
||||
- Database connection pool utilisation
|
||||
- p99 latency trend (should not creep up over hours)
|
||||
|
||||
---
|
||||
|
||||
## 5. Test Environment Requirements
|
||||
|
||||
### Infrastructure
|
||||
|
||||
| Component | Requirement | Notes |
|
||||
|---|---|---|
|
||||
| Service under test | Isolated from production | [N] replicas, matching prod resource limits |
|
||||
| Database | Separate instance with production-scale data | Seed script in section 7 |
|
||||
| Cache (Redis/Memcached) | Empty at test start | Ensures cold-start conditions are tested |
|
||||
| Load generator | Separate from service under test | [N] vCPUs, [N] GB RAM minimum |
|
||||
| Network | Low-latency path to service | Do not run generator on same host |
|
||||
|
||||
### Data Seeding
|
||||
|
||||
Before every test run, ensure the environment has:
|
||||
```bash
|
||||
# Seed test users (needed for authenticated endpoint tests)
|
||||
[seed command or script path — e.g. python scripts/seed_load_test_users.py --count 10000]
|
||||
|
||||
# Seed test data for read endpoints
|
||||
[seed command — e.g. ./scripts/seed_products.sh --count 50000]
|
||||
|
||||
# Verify seed completed
|
||||
[verification command — e.g. psql $DB_URL -c "SELECT COUNT(*) FROM users WHERE load_test=true"]
|
||||
```
|
||||
|
||||
**Test data rules:**
|
||||
- Never use real production user data in load tests
|
||||
- Tag all test-generated records with `load_test=true` for easy cleanup
|
||||
- Run cleanup after each test: `[cleanup command]`
|
||||
|
||||
---
|
||||
|
||||
## 6. Tooling Setup
|
||||
|
||||
### k6 Script Skeleton
|
||||
|
||||
```javascript
|
||||
import http from 'k6/http';
|
||||
import { check, sleep } from 'k6';
|
||||
import { Rate, Trend } from 'k6/metrics';
|
||||
|
||||
// Custom metrics
|
||||
const errorRate = new Rate('error_rate');
|
||||
const endpointLatency = new Trend('endpoint_latency', true);
|
||||
|
||||
// Test configuration — override per scenario
|
||||
export const options = {
|
||||
scenarios: {
|
||||
baseline: {
|
||||
executor: 'ramping-vus',
|
||||
startVUs: 0,
|
||||
stages: [
|
||||
{ duration: '2m', target: [BASELINE_VUS] },
|
||||
{ duration: '8m', target: [BASELINE_VUS] },
|
||||
{ duration: '1m', target: 0 },
|
||||
],
|
||||
},
|
||||
},
|
||||
thresholds: {
|
||||
http_req_duration: [
|
||||
'p(95)<[Y_MS]',
|
||||
'p(99)<[Z_MS]',
|
||||
],
|
||||
error_rate: ['rate<0.01'],
|
||||
http_req_failed: ['rate<0.01'],
|
||||
},
|
||||
};
|
||||
|
||||
// Auth helper — get token once per VU
|
||||
export function setup() {
|
||||
const loginRes = http.post('[BASE_URL]/auth/login', JSON.stringify({
|
||||
username: `load_test_user_${Math.floor(Math.random() * 10000)}@example.com`,
|
||||
password: '[LOAD_TEST_PASSWORD]',
|
||||
}), { headers: { 'Content-Type': 'application/json' } });
|
||||
|
||||
check(loginRes, { 'login ok': (r) => r.status === 200 });
|
||||
return { token: loginRes.json('access_token') };
|
||||
}
|
||||
|
||||
export default function (data) {
|
||||
const headers = {
|
||||
Authorization: `Bearer ${data.token}`,
|
||||
'Content-Type': 'application/json',
|
||||
};
|
||||
|
||||
// Endpoint 1: [Description]
|
||||
const res1 = http.get('[BASE_URL]/[endpoint-1]', { headers });
|
||||
check(res1, {
|
||||
'[endpoint-1] status 200': (r) => r.status === 200,
|
||||
'[endpoint-1] latency < [X]ms': (r) => r.timings.duration < [X],
|
||||
});
|
||||
errorRate.add(res1.status >= 400);
|
||||
endpointLatency.add(res1.timings.duration, { endpoint: '[endpoint-1]' });
|
||||
|
||||
sleep(Math.random() * [THINK_TIME_MAX] + [THINK_TIME_MIN]);
|
||||
|
||||
// Endpoint 2: [Description]
|
||||
const res2 = http.post('[BASE_URL]/[endpoint-2]',
|
||||
JSON.stringify({ [key]: '[value]' }),
|
||||
{ headers }
|
||||
);
|
||||
check(res2, {
|
||||
'[endpoint-2] status 201': (r) => r.status === 201,
|
||||
});
|
||||
errorRate.add(res2.status >= 400);
|
||||
}
|
||||
```
|
||||
|
||||
### Locust Script Skeleton (alternative)
|
||||
|
||||
```python
|
||||
from locust import HttpUser, task, between
|
||||
import random
|
||||
|
||||
class [ServiceName]User(HttpUser):
|
||||
wait_time = between([THINK_TIME_MIN], [THINK_TIME_MAX])
|
||||
token = None
|
||||
|
||||
def on_start(self):
|
||||
"""Called once per simulated user — authenticate."""
|
||||
user_id = random.randint(1, 10000)
|
||||
response = self.client.post("/auth/login", json={
|
||||
"username": f"load_test_user_{user_id}@example.com",
|
||||
"password": "[LOAD_TEST_PASSWORD]",
|
||||
})
|
||||
self.token = response.json()["access_token"]
|
||||
self.headers = {"Authorization": f"Bearer {self.token}"}
|
||||
|
||||
@task([WEIGHT_1]) # Weight = relative frequency
|
||||
def [endpoint_1_task](self):
|
||||
"""[Endpoint 1 description]"""
|
||||
with self.client.get(
|
||||
"/[endpoint-1]",
|
||||
headers=self.headers,
|
||||
catch_response=True
|
||||
) as response:
|
||||
if response.elapsed.total_seconds() > [LATENCY_THRESHOLD]:
|
||||
response.failure(f"Too slow: {response.elapsed.total_seconds()}s")
|
||||
|
||||
@task([WEIGHT_2])
|
||||
def [endpoint_2_task](self):
|
||||
"""[Endpoint 2 description]"""
|
||||
self.client.post(
|
||||
"/[endpoint-2]",
|
||||
json={"[key]": "[value]"},
|
||||
headers=self.headers,
|
||||
)
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# k6 — run baseline scenario
|
||||
k6 run --env BASE_URL=https://[test-env-url] scripts/load_test.js
|
||||
|
||||
# k6 — run stress scenario with output to InfluxDB
|
||||
k6 run --out influxdb=http://[influxdb-host]:8086/k6 \
|
||||
--env SCENARIO=stress \
|
||||
scripts/load_test.js
|
||||
|
||||
# Locust — headless run
|
||||
locust -f locustfile.py \
|
||||
--headless \
|
||||
--users [N] \
|
||||
--spawn-rate [N] \
|
||||
--run-time 10m \
|
||||
--host https://[test-env-url] \
|
||||
--csv=results/[run-id]
|
||||
|
||||
# Locust — web UI (interactive)
|
||||
locust -f locustfile.py --host https://[test-env-url]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Metrics to Capture
|
||||
|
||||
Capture all of the following during every test run. Missing any of these makes result comparison unreliable.
|
||||
|
||||
| Metric | Source | Why it matters |
|
||||
|---|---|---|
|
||||
| p50, p95, p99, p999 latency per endpoint | Load tool | SLO validation |
|
||||
| Error rate (4xx, 5xx) per endpoint | Load tool | SLO validation |
|
||||
| Requests/sec (throughput) | Load tool | Capacity baseline |
|
||||
| CPU utilisation (%) | Infra monitoring | Saturation signal |
|
||||
| Memory utilisation (%) | Infra monitoring | Leak detection |
|
||||
| GC pause time / frequency | JVM/Go metrics | Latency spike root cause |
|
||||
| DB connection pool: active/idle/waiting | DB metrics | Pool exhaustion detection |
|
||||
| DB query latency (p99) | DB metrics | Downstream bottleneck |
|
||||
| Cache hit rate | Cache metrics | Miss storm detection |
|
||||
| Pod/instance count (if autoscaling) | Infra | Scaling behaviour |
|
||||
| Network in/out bytes | Infra | Bandwidth saturation |
|
||||
|
||||
---
|
||||
|
||||
## 8. Result Analysis Framework
|
||||
|
||||
After each test run, work through this analysis in order:
|
||||
|
||||
**Step 1 — Pass/fail check**
|
||||
Compare all captured metrics against the thresholds in Section 2. Record pass/fail per scenario.
|
||||
|
||||
**Step 2 — Latency distribution**
|
||||
Plot the full latency histogram, not just percentiles. A bimodal distribution (two humps) indicates two distinct code paths — investigate the slow hump.
|
||||
|
||||
**Step 3 — Error correlation**
|
||||
If errors occurred, correlate them with:
|
||||
- Time of occurrence (was it during ramp-up, steady state, or spike?)
|
||||
- Specific endpoint (is it one endpoint or all?)
|
||||
- Infrastructure events (CPU spike, OOM, DB connection exhaustion?)
|
||||
|
||||
**Step 4 — Saturation analysis**
|
||||
Graph CPU, memory, and connection pool over time. If any resource reached 80%+ of capacity, it is a candidate bottleneck — even if SLOs passed this run.
|
||||
|
||||
**Step 5 — Compare to baseline run**
|
||||
Every run should be compared to the previous run. A 10% regression in p99 latency warrants investigation even if it is still within SLO.
|
||||
|
||||
**Regression classification:**
|
||||
|
||||
| Change | Classification | Action |
|
||||
|---|---|---|
|
||||
| p99 within 5% of previous run | Green — no regression | No action |
|
||||
| p99 5–15% worse than previous | Yellow — watch | Investigate before next release |
|
||||
| p99 >15% worse than previous | Red — regression | Block release, file ticket |
|
||||
| Error rate increased vs previous | Red — regression | Block release |
|
||||
| SLO threshold breached | Critical | Block release, page on-call |
|
||||
|
||||
---
|
||||
|
||||
## 9. CI Integration
|
||||
|
||||
Add load tests as a gated step in the release pipeline. Run the baseline scenario on every release candidate; run all scenarios weekly.
|
||||
|
||||
```yaml
|
||||
# Example: GitHub Actions step (adapt for your CI platform)
|
||||
load-test:
|
||||
runs-on: ubuntu-latest
|
||||
needs: [deploy-staging]
|
||||
if: github.ref == 'refs/heads/main'
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
|
||||
- name: Install k6
|
||||
run: |
|
||||
curl -s https://dl.k6.io/key.gpg | sudo apt-key add -
|
||||
echo "deb https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.list
|
||||
sudo apt-get update && sudo apt-get install k6
|
||||
|
||||
- name: Seed test data
|
||||
run: [seed command]
|
||||
|
||||
- name: Run baseline load test
|
||||
run: |
|
||||
k6 run \
|
||||
--env BASE_URL=${{ secrets.LOAD_TEST_ENV_URL }} \
|
||||
--out json=results.json \
|
||||
scripts/load_test.js
|
||||
env:
|
||||
LOAD_TEST_ENV_URL: ${{ secrets.LOAD_TEST_ENV_URL }}
|
||||
|
||||
- name: Check thresholds
|
||||
run: |
|
||||
# k6 exits with non-zero if any threshold fails — this step fails the build
|
||||
echo "k6 threshold check complete"
|
||||
|
||||
- name: Upload results
|
||||
uses: actions/upload-artifact@v3
|
||||
if: always()
|
||||
with:
|
||||
name: load-test-results-${{ github.run_id }}
|
||||
path: results.json
|
||||
|
||||
- name: Cleanup test data
|
||||
if: always()
|
||||
run: [cleanup command]
|
||||
```
|
||||
|
||||
**CI gates summary:**
|
||||
- Baseline scenario runs on every release to staging
|
||||
- Full scenario suite (stress, spike, soak) runs weekly on a schedule
|
||||
- Any threshold failure blocks promotion to production
|
||||
- Results are archived for trend analysis
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] All key endpoints are covered by at least one test scenario — no production endpoint is untested
|
||||
- [ ] Thresholds are derived from actual SLO targets, not guesses
|
||||
- [ ] Test data seeding is scripted and reproducible — tests do not rely on pre-existing environment state
|
||||
- [ ] The load generator runs on separate infrastructure from the service under test
|
||||
- [ ] CI integration blocks promotion on threshold failure — not just records results
|
||||
- [ ] Soak test has been run at least once to establish a memory and connection pool baseline
|
||||
- [ ] Results comparison to previous run is part of the analysis — not just absolute pass/fail
|
||||
@@ -0,0 +1,484 @@
|
||||
---
|
||||
name: local-dev-setup
|
||||
description: "Write a local development environment setup guide for a service or project — covering prerequisites, repository setup, environment variables, local service dependencies, database seeding, running the service, running tests, common gotchas, IDE recommendations, and first-contribution checklist. Use when asked to write a dev setup guide, create onboarding documentation for engineers, document local environment setup, or write a getting-started guide for a codebase. Produces a complete setup guide that a new engineer can follow from zero to running tests in under 30 minutes, with a troubleshooting section for the most common setup failures."
|
||||
---
|
||||
|
||||
# Local Dev Setup Skill
|
||||
|
||||
Produce a complete local development environment setup guide for a service or project — walking a new engineer from zero (a clean laptop) to a working local environment with passing tests in under 30 minutes. A good setup guide reduces onboarding time, prevents the "it works on my machine" problem, and lets engineers make their first contribution with confidence. Write every step as a concrete command or action — not a description of what needs to happen.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and what it does
|
||||
- **Tech stack** — language, framework, database, cache, message queue, and any external services
|
||||
- **Dependencies** — databases, caches, message queues, and external services (mocked or real)
|
||||
- **Test framework** — how tests are run and what the test suite covers
|
||||
- **CI/CD platform** — GitHub Actions, CircleCI, Jenkins, etc. (for context on what "passing CI" means locally)
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Local Development Setup: [Service Name]
|
||||
|
||||
**Tech stack:** [Language + version] | [Framework] | [Database] | [Cache]
|
||||
**Estimated setup time:** [20–30 minutes] on a clean machine
|
||||
**Last verified:** [Date] on [macOS Ventura 13.x / Ubuntu 22.04]
|
||||
**Questions?** Ask in [Slack: #[team-channel]] or ping [@tech-lead-handle]
|
||||
|
||||
> **First contribution?** Complete setup first (this doc), then read [CONTRIBUTING.md] for code standards and PR process.
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Install these tools before starting. The versions listed are the minimum required — newer patch versions are fine, newer major versions may have compatibility issues.
|
||||
|
||||
### Required Tools
|
||||
|
||||
| Tool | Required version | Install |
|
||||
|---|---|---|
|
||||
| [Git] | 2.x+ | Pre-installed on most systems; or `brew install git` |
|
||||
| [Language runtime — e.g. Go] | [1.22+] | [https://go.dev/dl/ or `brew install go`] |
|
||||
| [Docker] | 24.x+ | [https://docs.docker.com/get-docker/] |
|
||||
| [Docker Compose] | 2.x+ | Included with Docker Desktop; or `brew install docker-compose` |
|
||||
| [Make] | Any | Pre-installed on macOS/Linux |
|
||||
| [Tool — e.g. Node.js] | [20.x+] | [`brew install node` or https://nodejs.org] |
|
||||
| [Tool — e.g. psql client] | [15+] | `brew install postgresql@15` (client only) |
|
||||
|
||||
### Optional but Recommended
|
||||
|
||||
| Tool | Purpose | Install |
|
||||
|---|---|---|
|
||||
| [direnv] | Auto-load `.envrc` environment variables | `brew install direnv` + [setup instructions](https://direnv.net) |
|
||||
| [jq] | Pretty-print JSON in terminal | `brew install jq` |
|
||||
| [k9s] | Kubernetes cluster UI (if using K8s locally) | `brew install k9s` |
|
||||
| [mkcert] | Local HTTPS certificates | `brew install mkcert` |
|
||||
|
||||
### Required Accounts and Access
|
||||
|
||||
Before starting, make sure you have:
|
||||
- [ ] GitHub access to [org/repo] — request via [access request process / Slack: #it-help]
|
||||
- [ ] [AWS / GCP / Azure] account with [dev environment] access — request via [process]
|
||||
- [ ] [Internal tool — e.g. 1Password] for retrieving development secrets — request via [process]
|
||||
- [ ] [VPN access] if required to reach internal services — request via [process]
|
||||
|
||||
---
|
||||
|
||||
## 1. Repository Setup
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone git@github.com:[org]/[repo-name].git
|
||||
cd [repo-name]
|
||||
|
||||
# Install git hooks (required — enforces commit message format and runs pre-commit checks)
|
||||
make install-hooks
|
||||
# Or manually:
|
||||
# cp scripts/hooks/pre-commit .git/hooks/pre-commit && chmod +x .git/hooks/pre-commit
|
||||
|
||||
# Verify your git setup
|
||||
git config user.name # should be your name
|
||||
git config user.email # should be your work email
|
||||
```
|
||||
|
||||
**If you see a permission denied error on clone:** Your SSH key is not added to GitHub. Follow [GitHub's SSH key guide](https://docs.github.com/en/authentication/connecting-to-github-with-ssh) or use HTTPS with a personal access token instead.
|
||||
|
||||
---
|
||||
|
||||
## 2. Environment Variables
|
||||
|
||||
The service requires environment variables for configuration. **Never commit actual secrets to the repository.**
|
||||
|
||||
### Step 1 — Copy the example file
|
||||
|
||||
```bash
|
||||
cp .env.example .env.local
|
||||
```
|
||||
|
||||
### Step 2 — Fill in the values
|
||||
|
||||
Open `.env.local` in your editor. Below is a description of every variable and where to get its value:
|
||||
|
||||
| Variable | Description | Where to get it | Example (not real) |
|
||||
|---|---|---|---|
|
||||
| `APP_ENV` | Environment name | Set to `development` | `development` |
|
||||
| `APP_PORT` | Port the service listens on | Set to `8080` for local | `8080` |
|
||||
| `DATABASE_URL` | PostgreSQL connection string | Use value from Docker Compose (Section 3) | `postgres://app:password@localhost:5432/[service]_dev` |
|
||||
| `REDIS_URL` | Redis connection string | Use value from Docker Compose | `redis://localhost:6379` |
|
||||
| `SECRET_KEY` | Application secret key | Generate with: `openssl rand -hex 32` | `[random 64-char hex]` |
|
||||
| `[EXTERNAL_SERVICE]_API_KEY` | API key for [External Service] | Retrieve from [1Password vault: "Dev API Keys"] or ask [name] | — |
|
||||
| `[EXTERNAL_SERVICE]_BASE_URL` | Base URL for [External Service] | Use sandbox URL: `https://sandbox.[external-service].com` | `https://sandbox.stripe.com` |
|
||||
| `LOG_LEVEL` | Logging verbosity | Set to `debug` for local development | `debug` |
|
||||
| `[FEATURE_FLAG_SDK_KEY]` | Feature flag platform SDK key | Retrieve from [LaunchDarkly/Split dev project] | — |
|
||||
|
||||
**Using direnv (recommended):** Rename `.env.local` to `.envrc`, add `dotenv` at the top, and run `direnv allow`. Variables will load automatically when you `cd` into the project.
|
||||
|
||||
---
|
||||
|
||||
## 3. Local Service Dependencies
|
||||
|
||||
All infrastructure dependencies run in Docker Compose. You do not need to install PostgreSQL, Redis, or Kafka locally.
|
||||
|
||||
```bash
|
||||
# Start all dependencies (PostgreSQL, Redis, and any other services)
|
||||
docker compose up -d
|
||||
|
||||
# Verify all containers are healthy
|
||||
docker compose ps
|
||||
# Expected output: all services show "healthy" status
|
||||
|
||||
# View logs if something is not healthy
|
||||
docker compose logs [service-name]
|
||||
```
|
||||
|
||||
### What Docker Compose Starts
|
||||
|
||||
| Service | Port | Purpose | Health check |
|
||||
|---|---|---|---|
|
||||
| PostgreSQL [version] | `5432` | Primary database | `pg_isready -U app` |
|
||||
| Redis [version] | `6379` | Cache and session store | `redis-cli ping` |
|
||||
| [Kafka + Zookeeper] | `9092` / `2181` | Message queue | `kafka-topics.sh --list` |
|
||||
| [Mock server — e.g. WireMock] | `8089` | Mocks for external APIs in tests | `curl localhost:8089/__admin` |
|
||||
| [LocalStack] | `4566` | AWS service emulation (S3, SQS, etc.) | `aws --endpoint-url=http://localhost:4566 s3 ls` |
|
||||
|
||||
**If a container exits immediately:** See Troubleshooting section — common causes are port conflicts and Docker memory limits.
|
||||
|
||||
### Stopping Dependencies
|
||||
|
||||
```bash
|
||||
# Stop containers (preserves data volumes)
|
||||
docker compose stop
|
||||
|
||||
# Stop and remove containers (clears data — use when you want a fresh start)
|
||||
docker compose down -v
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Install Dependencies and Build
|
||||
|
||||
```bash
|
||||
# Install language dependencies
|
||||
# Go:
|
||||
go mod download
|
||||
|
||||
# Node.js:
|
||||
npm install # or: yarn install / pnpm install
|
||||
|
||||
# Python:
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
||||
pip install -r requirements-dev.txt
|
||||
|
||||
# Verify build compiles cleanly
|
||||
make build
|
||||
# Expected: no errors; binary or compiled output in [./bin/ or ./dist/]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Database Setup and Seeding
|
||||
|
||||
```bash
|
||||
# Run database migrations (creates tables and schema)
|
||||
make db-migrate
|
||||
# Or directly:
|
||||
# [Migration command — e.g. "go run ./cmd/migrate up" or "alembic upgrade head" or "npm run db:migrate"]
|
||||
|
||||
# Verify migrations applied
|
||||
# psql $DATABASE_URL -c "\dt" # should list all tables
|
||||
|
||||
# Seed the database with development data
|
||||
make db-seed
|
||||
# Or directly:
|
||||
# [Seed command — e.g. "go run ./cmd/seed" or "python scripts/seed.py" or "npm run db:seed"]
|
||||
|
||||
# Verify seed data is present
|
||||
# psql $DATABASE_URL -c "SELECT COUNT(*) FROM [primary-table]"
|
||||
# Expected: [N] rows
|
||||
```
|
||||
|
||||
**What the seed creates:**
|
||||
- [N] test user accounts (credentials in [scripts/seed/README.md or .env.example])
|
||||
- [N] sample [resources] for development and testing
|
||||
- Admin account: `[admin@example.com]` / password: see `.env.example` for dev password variable
|
||||
|
||||
**To reset to a clean state:**
|
||||
```bash
|
||||
docker compose down -v # wipe database volume
|
||||
docker compose up -d # start fresh
|
||||
make db-migrate
|
||||
make db-seed
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Running the Service
|
||||
|
||||
```bash
|
||||
# Run the service locally
|
||||
make run
|
||||
# Or directly:
|
||||
# [Run command — e.g. "go run ./cmd/server" or "python app.py" or "npm run dev"]
|
||||
|
||||
# Expected output:
|
||||
# [Example of healthy startup log lines — e.g.:]
|
||||
# {"level":"info","message":"Database connected","host":"localhost","port":5432}
|
||||
# {"level":"info","message":"Redis connected","host":"localhost","port":6379}
|
||||
# {"level":"info","message":"Server listening","port":8080}
|
||||
```
|
||||
|
||||
### Verify It's Working
|
||||
|
||||
```bash
|
||||
# Health check
|
||||
curl http://localhost:8080/health
|
||||
# Expected: {"status":"ok","version":"[git-sha]"}
|
||||
|
||||
# Test a key endpoint (authenticated)
|
||||
# First, get a dev token:
|
||||
curl -X POST http://localhost:8080/api/v1/auth/login \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"email":"[dev-user-from-seed]@example.com","password":"[dev-password-from-env]"}'
|
||||
# Copy the token from the response, then:
|
||||
|
||||
curl http://localhost:8080/api/v1/[resource] \
|
||||
-H "Authorization: Bearer [token-from-above]"
|
||||
# Expected: 200 with JSON response
|
||||
```
|
||||
|
||||
### Hot Reload (for Development)
|
||||
|
||||
```bash
|
||||
# Run with hot reload — service restarts automatically on file changes
|
||||
make run-dev
|
||||
# Or:
|
||||
# [Hot reload command — e.g. "air" for Go / "uvicorn --reload" for Python / "npm run dev" for Node]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Running Tests
|
||||
|
||||
```bash
|
||||
# Run the full test suite
|
||||
make test
|
||||
# Or:
|
||||
# [Test command — e.g. "go test ./..." or "pytest" or "npm test"]
|
||||
|
||||
# Run tests with coverage report
|
||||
make test-coverage
|
||||
# Coverage report: [./coverage.html or stdout]
|
||||
|
||||
# Run a specific test file or test case
|
||||
# Go: go test ./pkg/[package]/... -run TestFunctionName
|
||||
# Python: pytest tests/test_[module].py::TestClass::test_method -v
|
||||
# Node: npm test -- --testPathPattern=[filename]
|
||||
|
||||
# Run only unit tests (fast — no external dependencies)
|
||||
make test-unit
|
||||
|
||||
# Run only integration tests (requires Docker Compose dependencies running)
|
||||
make test-integration
|
||||
```
|
||||
|
||||
**Expected test results:**
|
||||
- Unit tests: [N] tests, all pass, [<30] seconds
|
||||
- Integration tests: [N] tests, all pass, [<2] minutes
|
||||
- Coverage: [≥80]% (enforced in CI — tests fail below this threshold)
|
||||
|
||||
**Before pushing a PR, always run:**
|
||||
```bash
|
||||
make lint # code linting — must pass
|
||||
make test # full test suite — must pass
|
||||
make build # verify compilation — must pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. IDE Setup
|
||||
|
||||
### VS Code (Recommended)
|
||||
|
||||
Install the recommended extensions (VS Code will prompt you automatically):
|
||||
|
||||
```json
|
||||
// .vscode/extensions.json — already in the repository
|
||||
{
|
||||
"recommendations": [
|
||||
"[language-extension — e.g. golang.go]",
|
||||
"dbaeumer.vscode-eslint",
|
||||
"esbenp.prettier-vscode",
|
||||
"ms-azuretools.vscode-docker",
|
||||
"eamodio.gitlens"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Workspace settings are in `.vscode/settings.json` — format on save is enabled, linter is configured automatically.
|
||||
|
||||
**[Language]-specific setup:**
|
||||
```
|
||||
[e.g. Go: The gopls language server is installed automatically by the Go extension.
|
||||
Run "Go: Install/Update Tools" from the command palette after installing the extension.]
|
||||
```
|
||||
|
||||
### JetBrains (IntelliJ / GoLand / PyCharm / WebStorm)
|
||||
|
||||
- Open the project root as the project directory
|
||||
- [Language SDK]: set to [version] — File → Project Structure → SDKs
|
||||
- Run configurations are checked into `.idea/runConfigurations/` — they appear automatically
|
||||
- Enable "Run formatters on save" in Settings → Tools → Actions on Save
|
||||
|
||||
---
|
||||
|
||||
## 9. Common Gotchas and Troubleshooting
|
||||
|
||||
### Docker container exits immediately on startup
|
||||
|
||||
**Symptom:** `docker compose ps` shows a container as `Exited (1)` seconds after starting.
|
||||
|
||||
```bash
|
||||
# Check the container logs for the error
|
||||
docker compose logs [container-name]
|
||||
|
||||
# Common causes:
|
||||
# 1. Port already in use — find and kill the conflicting process:
|
||||
lsof -ti tcp:[port] | xargs kill -9
|
||||
|
||||
# 2. Docker doesn't have enough memory — allocate at least 4GB in Docker Desktop:
|
||||
# Docker Desktop → Settings → Resources → Memory → 4GB
|
||||
|
||||
# 3. M1/M2 Mac architecture mismatch — add platform directive to docker-compose.yml:
|
||||
# platform: linux/amd64
|
||||
```
|
||||
|
||||
### Database connection refused
|
||||
|
||||
**Symptom:** Service fails to start with "connection refused" or "dial tcp localhost:5432: connect: connection refused"
|
||||
|
||||
```bash
|
||||
# Is PostgreSQL actually running?
|
||||
docker compose ps postgres
|
||||
# If not running: docker compose up -d postgres
|
||||
|
||||
# Is it on the right port?
|
||||
lsof -i :5432
|
||||
|
||||
# Can you connect manually?
|
||||
psql postgres://app:password@localhost:5432/[service]_dev -c "SELECT 1"
|
||||
|
||||
# If using a custom DATABASE_URL, verify it matches the docker-compose.yml settings exactly
|
||||
```
|
||||
|
||||
### Migrations fail with "relation already exists"
|
||||
|
||||
**Symptom:** `make db-migrate` errors with "ERROR: relation [table] already exists"
|
||||
|
||||
```bash
|
||||
# Check current migration state
|
||||
[migration status command — e.g. "go run ./cmd/migrate status" or "alembic current"]
|
||||
|
||||
# The database may be in a partial state — reset it:
|
||||
docker compose down -v
|
||||
docker compose up -d
|
||||
make db-migrate # should now succeed on a clean database
|
||||
```
|
||||
|
||||
### Tests fail with "connection refused" or dependency errors
|
||||
|
||||
**Symptom:** Integration tests fail because they cannot connect to PostgreSQL or Redis.
|
||||
|
||||
```bash
|
||||
# Integration tests need Docker Compose running
|
||||
docker compose up -d
|
||||
|
||||
# Verify all containers are healthy before running tests
|
||||
docker compose ps # all should show "healthy"
|
||||
|
||||
# If containers are running but tests still fail, check environment variables:
|
||||
make test-integration # should pick up .env.local automatically
|
||||
# If not: source .env.local && make test-integration
|
||||
```
|
||||
|
||||
### `make lint` fails on a fresh checkout
|
||||
|
||||
**Symptom:** Lint errors on files you have not modified.
|
||||
|
||||
```bash
|
||||
# Formatting issue — auto-fix with:
|
||||
# Go:
|
||||
gofmt -w .
|
||||
goimports -w .
|
||||
|
||||
# Python:
|
||||
black .
|
||||
isort .
|
||||
|
||||
# Node/TypeScript:
|
||||
npm run lint:fix
|
||||
# Or: npx eslint --fix . && npx prettier --write .
|
||||
|
||||
# Re-run lint to confirm
|
||||
make lint
|
||||
```
|
||||
|
||||
### Environment variables not loading
|
||||
|
||||
**Symptom:** Service starts but immediately fails with "missing required environment variable: [VAR]"
|
||||
|
||||
```bash
|
||||
# Verify .env.local exists and has all required variables
|
||||
cat .env.local | grep "^[A-Z]" | awk -F= '{print $1}'
|
||||
|
||||
# Compare against required variables in .env.example
|
||||
diff <(grep "^[A-Z_]*=" .env.example | cut -d= -f1 | sort) \
|
||||
<(grep "^[A-Z_]*=" .env.local | cut -d= -f1 | sort)
|
||||
|
||||
# Missing variables are shown in left column only (< prefix)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. First Contribution Checklist
|
||||
|
||||
Before opening your first pull request, verify:
|
||||
|
||||
**Setup complete:**
|
||||
- [ ] `make build` passes with no errors
|
||||
- [ ] `make test` passes — all tests green
|
||||
- [ ] `make lint` passes — no lint errors
|
||||
- [ ] Service starts and health check returns 200
|
||||
- [ ] You can authenticate and call at least one API endpoint
|
||||
|
||||
**Git and GitHub:**
|
||||
- [ ] You have read [CONTRIBUTING.md] — code standards, commit message format, PR process
|
||||
- [ ] Your git user.name and user.email are set correctly
|
||||
- [ ] Pre-commit hooks are installed (`ls .git/hooks/pre-commit` should exist)
|
||||
- [ ] You have branched from `main` (not committing directly to main)
|
||||
|
||||
**Development workflow:**
|
||||
- [ ] You know how to run a specific test: `[test command for single test]`
|
||||
- [ ] You know how to reset the database: `docker compose down -v && docker compose up -d && make db-migrate && make db-seed`
|
||||
- [ ] You have joined [Slack: #[team-channel]] and [#[service-consumers-channel] if applicable]
|
||||
- [ ] You have read the [architecture overview doc / README] — you understand what this service does
|
||||
|
||||
**First PR:**
|
||||
- [ ] Changes are small and focused — one logical change per PR
|
||||
- [ ] Tests are added or updated for your change
|
||||
- [ ] `make test && make lint && make build` all pass locally before requesting review
|
||||
- [ ] PR description explains what changed and why (use the [pr-description-writer skill] if needed)
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] A new engineer with no prior knowledge of the project can follow this guide from start to finish without asking anyone for help
|
||||
- [ ] Every command is tested on a clean environment — not written from memory and assumed to work
|
||||
- [ ] Environment variables table covers every variable in `.env.example` — no undocumented variables
|
||||
- [ ] The troubleshooting section covers the 5 most common real failures observed during onboarding — not theoretical issues
|
||||
- [ ] Docker Compose version and Docker Desktop memory requirements are stated explicitly
|
||||
- [ ] "Expected output" is shown for key commands so engineers know whether a step succeeded
|
||||
- [ ] Setup time estimate is honest — verified by timing a real onboarding session, not estimated
|
||||
@@ -0,0 +1,290 @@
|
||||
---
|
||||
name: microservices-decomposition
|
||||
description: "Design a microservices decomposition for a monolith or new system, defining service boundaries, ownership, communication patterns, and migration plan. Use when asked to decompose a monolith, define service boundaries, design a microservices architecture, or plan a strangler-fig migration. Produces a bounded context map, service inventory table, communication pattern decisions, data ownership matrix, migration roadmap, and risk register."
|
||||
---
|
||||
|
||||
# Microservices Decomposition
|
||||
|
||||
Produce a complete microservices decomposition design for a system — whether decomposing an existing monolith or designing service boundaries for a new system. Ground the decomposition in Domain-Driven Design (DDD) concepts: identify bounded contexts first, then derive service boundaries from them. Include communication pattern decisions (sync vs. async, event vs. RPC), data ownership rules, and a pragmatic migration plan if decomposing a monolith. Conway's Law is real — include an organizational alignment section. The deliverable should be specific enough that a team can begin implementation, not an abstract architectural diagram.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **System or domain description** — what the system does, its core domain, and the key business processes it supports
|
||||
- **Current architecture** — monolith (describe the tech stack and rough module structure), partial services (list existing services), or greenfield
|
||||
- **Team structure** — number of teams, team names if known, and approximate team sizes; this drives service ownership
|
||||
- **Performance and scalability requirements** — any specific SLAs, load characteristics, or scaling constraints per domain area
|
||||
- **Migration constraints** — what cannot be rewritten all at once, hard deadlines, zero-downtime requirements, budget constraints
|
||||
- **Integration points** — external systems, third-party APIs, or legacy systems that cannot be changed
|
||||
|
||||
If decomposing a monolith, also ask for: approximate codebase size, what is most painful to change today, and where the team experiences the most coupling-related friction.
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Microservices Decomposition: [System Name]
|
||||
|
||||
**Author:** [Name / Team]
|
||||
**Date:** [Date]
|
||||
**Architecture type:** [Monolith decomposition / New system design]
|
||||
**Current state:** [One sentence describing what exists today]
|
||||
**Target state:** [One sentence describing the desired end state]
|
||||
|
||||
---
|
||||
|
||||
## 1. Domain Analysis
|
||||
|
||||
### Core Domain
|
||||
|
||||
[One paragraph: what is the core domain of this system? What does the business fundamentally do? What gives it competitive differentiation? The core domain gets the most investment and the cleanest service boundaries.]
|
||||
|
||||
### Domain Map
|
||||
|
||||
List every significant subdomain before assigning service boundaries. Classify each subdomain:
|
||||
|
||||
| Subdomain | Type | Description | Current Location in Monolith |
|
||||
|-----------|------|-------------|------------------------------|
|
||||
| [Subdomain, e.g., Order Management] | Core | [What it does and why it matters] | [Module/package name or "new"] |
|
||||
| [Subdomain, e.g., Inventory] | Core | [Description] | [Location] |
|
||||
| [Subdomain, e.g., Notifications] | Supporting | [Description] | [Location] |
|
||||
| [Subdomain, e.g., Billing] | Supporting | [Description] | [Location] |
|
||||
| [Subdomain, e.g., Reporting] | Generic | [Description — candidates for off-the-shelf solutions] | [Location] |
|
||||
| [Subdomain, e.g., User Auth] | Generic | [Description] | [Location] |
|
||||
|
||||
**Subdomain types:** Core = competitive differentiation, build with care; Supporting = necessary but not differentiating, build pragmatically; Generic = commodity, buy or use open source.
|
||||
|
||||
---
|
||||
|
||||
## 2. Bounded Context Map (ASCII)
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ [System Name] │
|
||||
│ │
|
||||
│ ┌──────────────────┐ ┌──────────────────┐ │
|
||||
│ │ [Context A] │ │ [Context B] │ │
|
||||
│ │ │─ ─►│ │ │
|
||||
│ │ [key concepts] │ │ [key concepts] │ │
|
||||
│ └──────────────────┘ └──────────────────┘ │
|
||||
│ │ │ │
|
||||
│ │ event │ sync │
|
||||
│ ▼ ▼ │
|
||||
│ ┌──────────────────┐ ┌──────────────────┐ │
|
||||
│ │ [Context C] │ │ [Context D] │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ [key concepts] │ │ [key concepts] │ │
|
||||
│ └──────────────────┘ └──────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌──────────────────┐ │
|
||||
│ │ [Context E] │ │
|
||||
│ │ [key concepts] │ │
|
||||
│ └──────────────────┘ │
|
||||
│ │
|
||||
│ External: [Third-party system] ──► [Context that owns it] │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
Legend: ──► sync call - -► async event ═══ shared kernel
|
||||
```
|
||||
|
||||
Render this map using the actual bounded contexts derived from the domain analysis. Place contexts that communicate frequently closer together. Label relationship types on arrows.
|
||||
|
||||
### Context Relationships
|
||||
|
||||
| Upstream Context | Downstream Context | Relationship Type | Integration Pattern |
|
||||
|-----------------|-------------------|------------------|---------------------|
|
||||
| [Context A] | [Context B] | Customer-Supplier | REST API call |
|
||||
| [Context B] | [Context C] | Published Language | Domain events via message bus |
|
||||
| [Context X] | [Context Y] | Conformist | [Downstream conforms to upstream's model] |
|
||||
| [Context X] | [Context Y] | Anti-Corruption Layer | [ACL translates upstream model to local model] |
|
||||
|
||||
---
|
||||
|
||||
## 3. Proposed Service Inventory
|
||||
|
||||
| Service Name | Bounded Context | Core Responsibility | Team Owner | Tech Stack | Priority |
|
||||
|-------------|----------------|--------------------|-----------|-----------|---------|
|
||||
| [service-name] | [Context] | [One sentence: what this service owns and does] | [Team] | [Language/framework] | [P1/P2/P3] |
|
||||
| [service-name] | [Context] | [Responsibility] | [Team] | [Stack] | [Priority] |
|
||||
| [service-name] | [Context] | [Responsibility] | [Team] | [Stack] | [Priority] |
|
||||
| [service-name] | [Context] | [Responsibility] | [Team] | [Stack] | [Priority] |
|
||||
| [service-name] | [Context] | [Responsibility] | [Team] | [Stack] | [Priority] |
|
||||
|
||||
**Service count:** [N proposed services] for [M bounded contexts]. [Note if any context maps to multiple services and why — e.g., "the Orders context splits into order-intake and order-fulfillment because they have different scalability requirements."]
|
||||
|
||||
### Service Responsibility Rules (applied to every service above)
|
||||
|
||||
- Single bounded context ownership — a service does not straddle two bounded contexts
|
||||
- Owns its own data — no direct database access by other services
|
||||
- Independently deployable — no coordinated deploys required with other services
|
||||
- Has a named team owner — no shared ownership of a single service across teams
|
||||
- Exposes a defined API contract — not internal implementation
|
||||
|
||||
---
|
||||
|
||||
## 4. Inter-Service Communication Patterns
|
||||
|
||||
### Pattern Decision Matrix
|
||||
|
||||
| Communication Need | Recommended Pattern | Rationale |
|
||||
|-------------------|--------------------|-----------|
|
||||
| Query another service's current state | Synchronous REST / gRPC | Low latency required; caller needs immediate response |
|
||||
| Notify other services of a state change | Async domain event | Decouples services; multiple consumers; sender doesn't care when it's processed |
|
||||
| Long-running workflow spanning services | Async saga (choreography or orchestration) | No single service owns the full workflow; rollback needed if steps fail |
|
||||
| Read-heavy cross-service aggregation | CQRS read model / materialized view | Avoid chatty sync calls at read time; build purpose-fit read models |
|
||||
| Real-time push to clients | WebSocket gateway service | Centralizes connection management; services emit events, gateway pushes |
|
||||
|
||||
### Per-Service Communication Decisions
|
||||
|
||||
| Service | Calls (sync) | Publishes (events) | Subscribes to (events) |
|
||||
|---------|-------------|-------------------|----------------------|
|
||||
| [service-name] | [service-name (endpoint)] | [EventName] | [EventName] |
|
||||
| [service-name] | — | [EventName], [EventName] | [EventName] |
|
||||
| [service-name] | [service-name (endpoint)] | — | [EventName] |
|
||||
|
||||
### Event Catalog
|
||||
|
||||
| Event Name | Producer | Consumers | Payload (key fields) | Trigger |
|
||||
|-----------|---------|---------|---------------------|---------|
|
||||
| [OrderPlaced] | [order-service] | [inventory-service, notification-service] | `orderId, customerId, lineItems, totalAmount` | Customer submits order |
|
||||
| [InventoryReserved] | [inventory-service] | [order-service] | `orderId, reservationId, items` | Inventory successfully reserved |
|
||||
| [PaymentProcessed] | [payment-service] | [order-service, notification-service] | `orderId, paymentId, amount, status` | Payment confirmed |
|
||||
|
||||
---
|
||||
|
||||
## 5. Data Ownership Matrix
|
||||
|
||||
Each piece of data has exactly one owning service. Other services may cache or project a read model, but they do not write to the owner's database.
|
||||
|
||||
| Data Entity | Owner Service | Authoritative Store | Consumers | Access Pattern |
|
||||
|-------------|--------------|--------------------|-----------| ---------------|
|
||||
| [Order] | [order-service] | [PostgreSQL] | [fulfillment-service, reporting-service] | Event subscription + read API |
|
||||
| [Customer] | [customer-service] | [PostgreSQL] | [order-service, notification-service] | Sync API call |
|
||||
| [Product Catalog] | [catalog-service] | [PostgreSQL] | [order-service, inventory-service] | Sync API + cached local copy |
|
||||
| [Inventory Level] | [inventory-service] | [Redis + PostgreSQL] | [catalog-service (read only)] | Event subscription |
|
||||
| [Payment Record] | [payment-service] | [PostgreSQL] | [order-service] | Event subscription |
|
||||
|
||||
### Data Migration (if decomposing a monolith)
|
||||
|
||||
| Data Entity | Current Location | Target Service | Migration Approach | Data Volume | Risk |
|
||||
|-------------|-----------------|---------------|-------------------|-------------|------|
|
||||
| [Entity] | [monolith.orders table] | [order-service] | Dual-write then cut over | [X rows] | [High/Med/Low] |
|
||||
| [Entity] | [monolith.users table] | [customer-service] | Extract and sync via CDC | [X rows] | [High/Med/Low] |
|
||||
|
||||
---
|
||||
|
||||
## 6. API Contract Definitions
|
||||
|
||||
Define the surface area for each service. Full OpenAPI specs are written separately; this section establishes the contract boundaries.
|
||||
|
||||
### [service-name] API
|
||||
|
||||
**Base path:** `/api/v1/[resource]`
|
||||
**Owner team:** [Team]
|
||||
**SLA:** [p99 latency target, availability target]
|
||||
|
||||
| Endpoint | Method | Description | Auth Required | Rate Limit |
|
||||
|----------|--------|-------------|--------------|------------|
|
||||
| `/[resources]` | GET | List [resources] with pagination | Yes | [X req/min] |
|
||||
| `/[resources]/{id}` | GET | Get single [resource] by ID | Yes | [X req/min] |
|
||||
| `/[resources]` | POST | Create new [resource] | Yes | [X req/min] |
|
||||
| `/[resources]/{id}` | PUT | Update [resource] | Yes | [X req/min] |
|
||||
| `/[resources]/{id}` | DELETE | Soft-delete [resource] | Yes — elevated | [X req/min] |
|
||||
|
||||
[Repeat for each service.]
|
||||
|
||||
---
|
||||
|
||||
## 7. Strangler Fig Migration Plan (for monolith decomposition)
|
||||
|
||||
Use the strangler fig pattern: extract services incrementally, route traffic through a facade, and retire monolith modules one at a time.
|
||||
|
||||
### Migration Phases
|
||||
|
||||
```
|
||||
Phase 1: Foundation (Weeks 1–[N])
|
||||
- Deploy service infrastructure (CI/CD, observability, service mesh)
|
||||
- Extract lowest-risk, highest-value service first
|
||||
- Monolith continues to serve all traffic
|
||||
|
||||
Phase 2: First Extractions (Weeks [N]–[M])
|
||||
- Extract P1 services
|
||||
- API gateway routes selected traffic to new services
|
||||
- Monolith handles remaining traffic via facade pattern
|
||||
- Both paths write to shared DB during transition (dual-write)
|
||||
|
||||
Phase 3: Core Domain Services (Weeks [M]–[P])
|
||||
- Extract P1 core domain services
|
||||
- Data migration for extracted services
|
||||
- Remove dual-write paths for completed migrations
|
||||
|
||||
Phase 4: Monolith Retirement (Weeks [P]–[Q])
|
||||
- Extract remaining services
|
||||
- Monolith serves no production traffic
|
||||
- Decommission monolith infrastructure
|
||||
```
|
||||
|
||||
### Phase-by-Phase Roadmap
|
||||
|
||||
| Phase | Service to Extract | Migration Approach | Team | Duration | Dependencies | Success Criteria |
|
||||
|-------|------------------|--------------------|------|----------|-------------|-----------------|
|
||||
| 1 | [service-name] | [Strangler facade / Branch by abstraction / Event interception] | [Team] | [X weeks] | [Infra ready, CI/CD pipeline] | [Traffic fully on new service, zero errors for 2 weeks] |
|
||||
| 2 | [service-name] | [Approach] | [Team] | [X weeks] | [Phase 1 complete] | [Success metric] |
|
||||
| 3 | [service-name] | [Approach] | [Team] | [X weeks] | [Phase 2 complete] | [Success metric] |
|
||||
|
||||
### Rollback Plan
|
||||
|
||||
For each migration phase, define the rollback trigger and mechanism:
|
||||
- **Rollback trigger:** Error rate on new service > [X%] sustained for [Y minutes], or p99 latency > [threshold]
|
||||
- **Rollback mechanism:** API gateway feature flag reverts all traffic to monolith path in < 5 minutes
|
||||
- **Data rollback:** Dual-write maintained for [X weeks] after cutover to allow replay if needed
|
||||
|
||||
---
|
||||
|
||||
## 8. Organizational Alignment (Conway's Law)
|
||||
|
||||
Conway's Law: the architecture of a system mirrors the communication structure of the organization that builds it. Design service ownership to match team boundaries — or change the team boundaries.
|
||||
|
||||
| Service | Proposed Owner Team | Current Team Assignment | Change Required |
|
||||
|---------|--------------------|-----------------------|-----------------|
|
||||
| [service-name] | [Team A] | [Same / Different] | [No change / Transfer to Team A / New team needed] |
|
||||
| [service-name] | [Team B] | [Team A currently] | [Transfer ownership] |
|
||||
|
||||
**Misalignments identified:**
|
||||
- [Misalignment 1: e.g., "The notification service spans two teams today. Assign it entirely to Team B which already owns the messaging domain."]
|
||||
- [Misalignment 2: e.g., "The reporting service is owned by Data Eng but consumers are Product teams — establish a clear API contract and SLA."]
|
||||
|
||||
**Team topology recommendation:** [Describe the recommended team structure — stream-aligned teams, platform team, enabling team — and how it maps to the proposed services.]
|
||||
|
||||
---
|
||||
|
||||
## 9. Risk Register
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation | Owner |
|
||||
|------|-----------|--------|-----------|-------|
|
||||
| Data consistency across services during migration | High | High | Dual-write with reconciliation job; event sourcing for critical domains | [Name] |
|
||||
| Distributed transaction complexity (sagas) | Medium | High | Start with choreography; add orchestration only when choreography becomes unmanageable | [Name] |
|
||||
| Service mesh operational overhead | Medium | Medium | Start without a mesh; add after 5+ services deployed | [Name] |
|
||||
| Network latency replacing in-process calls | Medium | Medium | Cache aggressively; design read models to avoid chatty sync calls | [Name] |
|
||||
| Conway's Law friction during transition | High | Medium | Align team structure before starting extraction, not after | [Name] |
|
||||
| Over-decomposition (nanoservices) | Medium | High | Enforce minimum service size rule: a service must justify its own team/deployment overhead | [Name] |
|
||||
| Observability gaps during migration | High | High | Deploy distributed tracing before first extraction; establish correlation IDs | [Name] |
|
||||
| [Context-specific risk] | [Level] | [Level] | [Mitigation] | [Owner] |
|
||||
|
||||
---
|
||||
|
||||
*Questions about this design: [Slack channel or contact]*
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Bounded context map is an ASCII diagram with labeled relationships — not a prose description of the contexts
|
||||
- [ ] Every service in the inventory table has a named team owner and a clear single-sentence responsibility statement
|
||||
- [ ] Data ownership matrix assigns every key entity to exactly one owning service — no shared ownership
|
||||
- [ ] Communication pattern decisions explain WHY sync vs. async was chosen for each interaction type
|
||||
- [ ] If decomposing a monolith, the strangler fig migration plan has phases with durations, dependencies, and success criteria
|
||||
- [ ] Risk register addresses at minimum: data consistency, distributed transactions, and Conway's Law alignment
|
||||
- [ ] Organizational alignment section maps services to teams and identifies misalignments that need to be resolved
|
||||
@@ -0,0 +1,436 @@
|
||||
---
|
||||
name: monitoring-setup-guide
|
||||
description: "Write a monitoring setup guide for a service — defining what to measure, how to alert on it, and how to build the observability stack covering the four golden signals, business metrics, log strategy, distributed tracing, alerting rules, dashboard layout, and observability debt. Use when asked to set up monitoring for a service, define alerting strategy, write an observability plan, create a dashboard specification, or document logging standards for a team. Produces a metric definitions table, alert rules specification, dashboard layout wireframe, log schema, tracing setup checklist, and monitoring gap analysis."
|
||||
---
|
||||
|
||||
# Monitoring Setup Guide Skill
|
||||
|
||||
Produce a complete monitoring setup guide for a service — defining exactly what to measure, how to structure logs, how to configure alerts with actionable thresholds, and how to build dashboards that answer real operational questions. A good monitoring guide eliminates "we don't know what's happening in production" as a root cause category, and gives on-call engineers a single source of truth for what healthy looks like.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name and description** — what the service does and its role in the system
|
||||
- **Tech stack** — language, framework, and infrastructure (e.g. Go/gRPC on Kubernetes, Python/FastAPI on ECS)
|
||||
- **Current monitoring tooling** — Datadog, Prometheus + Grafana, CloudWatch, New Relic, Honeycomb, or none yet
|
||||
- **Key user journeys** — the 2–4 most important things a user or consumer does with the service (these drive what to alert on)
|
||||
- **Existing alerts** — paste any existing alert configurations or describe what's currently monitored
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Monitoring Setup Guide: [Service Name]
|
||||
|
||||
**Team:** [Team name] | **Tech lead:** [Name]
|
||||
**Stack:** [Language/Framework] on [Infrastructure]
|
||||
**Monitoring platform:** [Datadog / Prometheus+Grafana / CloudWatch / etc.]
|
||||
**Date:** [Date] | **Review cycle:** Quarterly
|
||||
|
||||
---
|
||||
|
||||
## 1. Monitoring Philosophy
|
||||
|
||||
Good monitoring answers three questions:
|
||||
1. **Is the service healthy right now?** (alerting)
|
||||
2. **Was it healthy in the past, and is it trending worse?** (dashboards + SLO tracking)
|
||||
3. **Why did something fail?** (logs + traces)
|
||||
|
||||
This guide defines the answers for [Service Name]. Every alert must be actionable — if an on-call engineer cannot take a specific action in response to the alert, the alert should not exist.
|
||||
|
||||
**Key user journeys monitored:**
|
||||
- Journey 1: [e.g. "User submits a payment — POST /charges, receives confirmation"]
|
||||
- Journey 2: [e.g. "User views transaction history — GET /transactions"]
|
||||
- Journey 3: [e.g. "Subscription renewal job runs — background worker processes billing events"]
|
||||
|
||||
---
|
||||
|
||||
## 2. The Four Golden Signals
|
||||
|
||||
Apply the four golden signals specifically to [Service Name]:
|
||||
|
||||
### Latency
|
||||
|
||||
Latency measures how long requests take to complete. Track it separately for successful and failed requests — slow failures hide behind fast errors if you only measure aggregate latency.
|
||||
|
||||
| Metric | Description | Source | Dimensions |
|
||||
|---|---|---|---|
|
||||
| `[service].request.duration_ms` | End-to-end request latency | Application instrumentation | `endpoint`, `method`, `status_code` |
|
||||
| `[service].db.query_duration_ms` | Database query latency | ORM / query instrumentation | `query_name`, `table` |
|
||||
| `[service].external.request_duration_ms` | Outbound call latency to dependencies | HTTP client instrumentation | `target_service`, `endpoint` |
|
||||
| `[service].queue.processing_duration_ms` | Time to process one message (if applicable) | Consumer instrumentation | `queue_name`, `message_type` |
|
||||
|
||||
**Latency SLO targets:**
|
||||
|
||||
| Endpoint / operation | p50 target | p95 target | p99 target |
|
||||
|---|---|---|---|
|
||||
| `GET /api/v1/[resource]` | < [50] ms | < [200] ms | < [500] ms |
|
||||
| `POST /api/v1/[resource]` | < [100] ms | < [400] ms | < [1000] ms |
|
||||
| `GET /health` | < [10] ms | < [20] ms | < [50] ms |
|
||||
| [Background job name] | < [5] sec | < [15] sec | < [60] sec |
|
||||
|
||||
### Traffic
|
||||
|
||||
Traffic measures demand on the system. Use it to detect unexpected spikes, traffic drops (which can indicate upstream failures), and to capacity-plan.
|
||||
|
||||
| Metric | Description | Source |
|
||||
|---|---|---|
|
||||
| `[service].request.count` | Requests per second | Application / load balancer |
|
||||
| `[service].request.count_by_endpoint` | RPS broken down by endpoint | Application |
|
||||
| `[service].queue.messages_consumed_per_second` | Consumer throughput | Queue consumer |
|
||||
| `[service].queue.depth` | Messages waiting in queue | Queue metrics |
|
||||
|
||||
**Traffic baselines (update after observing production for 2+ weeks):**
|
||||
|
||||
| Time period | Expected RPS | Low-traffic floor | Spike ceiling |
|
||||
|---|---|---|---|
|
||||
| Peak (weekday business hours) | [N] RPS | [N × 0.5] RPS | [N × 5] RPS |
|
||||
| Off-peak (nights/weekends) | [N × 0.2] RPS | [N × 0.05] RPS | [N] RPS |
|
||||
|
||||
### Errors
|
||||
|
||||
Errors measure the fraction of requests that fail. Distinguish between client errors (4xx — caller is doing something wrong) and server errors (5xx — the service is broken).
|
||||
|
||||
| Metric | Description | Alert on? |
|
||||
|---|---|---|
|
||||
| `[service].request.error_rate` | 5xx errors / total requests | Yes — see alert rules |
|
||||
| `[service].request.client_error_rate` | 4xx errors / total requests | Threshold alert — sudden spike may indicate API misuse |
|
||||
| `[service].dependency.error_rate` | Errors calling downstream dependencies | Yes — upstream health signal |
|
||||
| `[service].queue.dlq_depth` | Messages in dead-letter queue | Yes — indicates processing failures |
|
||||
|
||||
### Saturation
|
||||
|
||||
Saturation measures how "full" the service is — how close to maximum capacity are the constrained resources.
|
||||
|
||||
| Resource | Metric | Alert threshold | Source |
|
||||
|---|---|---|---|
|
||||
| CPU | `[service].cpu.utilisation_pct` | >80% sustained 5 min | Container / VM metrics |
|
||||
| Memory | `[service].memory.utilisation_pct` | >85% sustained 5 min | Container / VM metrics |
|
||||
| DB connections | `[service].db.connection_pool.utilisation_pct` | >75% | Application / DB metrics |
|
||||
| Thread pool / goroutines | `[service].runtime.goroutine_count` / `thread_count` | >N (establish baseline) | Runtime metrics |
|
||||
| Disk (if applicable) | `[service].disk.utilisation_pct` | >75% | Infrastructure |
|
||||
| Queue depth (if applicable) | `[service].queue.depth` | >[backlog threshold] | Queue metrics |
|
||||
|
||||
---
|
||||
|
||||
## 3. Business Metrics
|
||||
|
||||
Beyond the golden signals, track metrics that measure whether the service is delivering business value. These matter for SLO reporting and product dashboards.
|
||||
|
||||
| Metric | Description | Source | Alert? |
|
||||
|---|---|---|---|
|
||||
| `[service].[primary_action].success_rate` | [e.g. "Payment success rate"] | Application | Yes — if drops >5% vs 1h average |
|
||||
| `[service].[primary_action].count` | [e.g. "Payments processed per minute"] | Application | Yes — sudden drop (traffic anomaly) |
|
||||
| `[service].[resource].created_per_hour` | [e.g. "New accounts created"] | Application / DB | No — informational |
|
||||
| `[service].cache.hit_rate` | Fraction of requests served from cache | Cache instrumentation | Yes — if drops below [60]% |
|
||||
| `[service].job.[name].success_rate` | [Background job success rate] | Job framework | Yes — if drops below [99]% |
|
||||
|
||||
---
|
||||
|
||||
## 4. Log Strategy
|
||||
|
||||
### Structured Logging Schema
|
||||
|
||||
All logs must be structured JSON. Do not emit unstructured text logs in production. Every log line must include the mandatory fields.
|
||||
|
||||
**Mandatory fields (every log line):**
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2024-01-15T10:23:45.123Z",
|
||||
"level": "info",
|
||||
"service": "[service-name]",
|
||||
"version": "[git-sha-short]",
|
||||
"trace_id": "[uuid-from-request-context]",
|
||||
"span_id": "[span-uuid]",
|
||||
"request_id": "[uuid-per-request]",
|
||||
"message": "[human readable description]"
|
||||
}
|
||||
```
|
||||
|
||||
**Request log (emit for every HTTP request):**
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "...",
|
||||
"level": "info",
|
||||
"service": "[service-name]",
|
||||
"event": "http_request",
|
||||
"method": "POST",
|
||||
"path": "/api/v1/[resource]",
|
||||
"status_code": 201,
|
||||
"duration_ms": 45,
|
||||
"user_id": "[uuid — DO NOT log PII directly]",
|
||||
"request_id": "[uuid]",
|
||||
"trace_id": "[uuid]"
|
||||
}
|
||||
```
|
||||
|
||||
**Error log (emit for every error with context):**
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "...",
|
||||
"level": "error",
|
||||
"service": "[service-name]",
|
||||
"event": "error",
|
||||
"error_code": "[application-error-code]",
|
||||
"error_message": "[description — no sensitive data]",
|
||||
"stack_trace": "[stack trace]",
|
||||
"request_id": "[uuid]",
|
||||
"trace_id": "[uuid]",
|
||||
"context": {
|
||||
"[key]": "[relevant context without PII]"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Log Levels — When to Use Each
|
||||
|
||||
| Level | Use when | Example |
|
||||
|---|---|---|
|
||||
| `error` | Something failed that requires attention — this should page on-call eventually | Database query failed, external API returned 5xx, required config missing |
|
||||
| `warn` | Something unexpected happened but service is still functioning | Retry succeeded after failure, cache miss on expected hit, rate limit approaching |
|
||||
| `info` | Significant business events and request lifecycle | Request received, payment processed, user authenticated, job started/completed |
|
||||
| `debug` | Detailed diagnostic information — off in production by default | Query parameters, intermediate computation results, cache key lookups |
|
||||
|
||||
### What NOT to Log
|
||||
|
||||
**Never log:**
|
||||
- Passwords, tokens, API keys, or secrets (even hashed)
|
||||
- Full credit card numbers or PAN data
|
||||
- Social security numbers or government IDs
|
||||
- Full names + dates of birth + contact info in the same log line (PII aggregation)
|
||||
- Request/response bodies in full (use field-level extraction instead)
|
||||
- Health check requests (too noisy — exclude `GET /health` from access logs)
|
||||
|
||||
---
|
||||
|
||||
## 5. Distributed Tracing Setup
|
||||
|
||||
Distributed tracing is mandatory for any service that calls other services. It enables root-cause analysis across service boundaries.
|
||||
|
||||
### Instrumentation Checklist
|
||||
|
||||
```
|
||||
[ ] Tracing library installed:
|
||||
- Go: go.opentelemetry.io/otel
|
||||
- Python: opentelemetry-sdk, opentelemetry-instrumentation
|
||||
- Node: @opentelemetry/sdk-node
|
||||
- Java: opentelemetry-java-instrumentation
|
||||
|
||||
[ ] Tracer initialized at service startup with service name and version
|
||||
|
||||
[ ] Trace context propagated via W3C Trace Context headers:
|
||||
traceparent: 00-[trace-id]-[span-id]-01
|
||||
tracestate: [optional vendor-specific]
|
||||
|
||||
[ ] Automatic instrumentation enabled for:
|
||||
[ ] Inbound HTTP/gRPC requests (creates root span)
|
||||
[ ] Outbound HTTP/gRPC calls (creates child spans)
|
||||
[ ] Database queries (creates child spans with sanitized query)
|
||||
[ ] Cache operations (Redis, Memcached)
|
||||
[ ] Message queue produce/consume
|
||||
|
||||
[ ] Custom spans added for:
|
||||
[ ] Key business operations ([e.g. payment processing, user lookup])
|
||||
[ ] Background jobs (each job execution = root span)
|
||||
[ ] Third-party API calls with custom attributes
|
||||
|
||||
[ ] Span attributes to capture on all spans:
|
||||
- user.id (if authenticated — no PII)
|
||||
- deployment.environment (production/staging)
|
||||
- service.version (git SHA)
|
||||
- [service-specific key attributes]
|
||||
|
||||
[ ] Trace exporter configured to: [Datadog / Jaeger / Tempo / OTLP endpoint]
|
||||
|
||||
[ ] Sampling rate configured:
|
||||
- Production: [1–10]% of requests (adjust based on volume and cost)
|
||||
- Always sample: errors, slow requests (>p99 threshold), and 100% of [critical endpoint]
|
||||
```
|
||||
|
||||
### Trace Instrumentation Examples
|
||||
|
||||
```python
|
||||
# Python — OpenTelemetry example
|
||||
from opentelemetry import trace
|
||||
|
||||
tracer = trace.get_tracer("[service-name]")
|
||||
|
||||
def process_payment(payment_data):
|
||||
with tracer.start_as_current_span("process_payment") as span:
|
||||
span.set_attribute("payment.amount_cents", payment_data["amount"])
|
||||
span.set_attribute("payment.currency", payment_data["currency"])
|
||||
# Never: span.set_attribute("payment.card_number", ...)
|
||||
try:
|
||||
result = _do_process(payment_data)
|
||||
span.set_status(trace.StatusCode.OK)
|
||||
return result
|
||||
except PaymentError as e:
|
||||
span.set_status(trace.StatusCode.ERROR, str(e))
|
||||
span.record_exception(e)
|
||||
raise
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Alert Rules Specification
|
||||
|
||||
Every alert must have: a name, a condition, a threshold, a severity, and a clear on-call action. Alerts without a clear action should not exist.
|
||||
|
||||
### Alert Definitions
|
||||
|
||||
| Alert name | Condition | Threshold | Severity | On-call action |
|
||||
|---|---|---|---|---|
|
||||
| `[Service]HighErrorRate` | 5xx error rate, 5-min rolling window | >1% for 2 consecutive windows | P1 | Check recent deploys; inspect error logs; see runbook [link] |
|
||||
| `[Service]CriticalErrorRate` | 5xx error rate, 2-min rolling window | >5% | P1 — immediate | Same as above — page immediately, do not wait |
|
||||
| `[Service]HighP99Latency` | p99 latency on key endpoints | >2× SLO target for 3 min | P2 | Check DB latency, cache hit rate, and upstream dependencies |
|
||||
| `[Service]LatencySLOBreach` | p99 latency | >SLO target for 5 consecutive minutes | P1 | SLO burn — page on-call, escalate if not resolved in 20 min |
|
||||
| `[Service]HighCPU` | CPU utilisation | >80% sustained for 5 min | P2 | Check for traffic spike; scale up if needed; check for runaway processes |
|
||||
| `[Service]HighMemory` | Memory utilisation | >85% sustained for 5 min | P2 | Check for memory leak (especially after deploys); restart pod if OOM imminent |
|
||||
| `[Service]DBConnectionPoolHigh` | DB connection pool utilisation | >75% | P2 | Check for long-running queries; consider scaling service or increasing pool size |
|
||||
| `[Service]DLQDepthHigh` | Dead-letter queue depth | >10 messages | P2 | Inspect DLQ messages for error pattern; fix bug and replay if safe |
|
||||
| `[Service]TrafficDropAnomaly` | RPS, compared to same hour yesterday | >50% drop sustained 5 min | P1 | Upstream may be down; check caller health; check load balancer |
|
||||
| `[Service]PrimaryActionSuccessRateDrop` | [Business metric success rate] | <[95]% over 10 min | P1 | [Service-specific action — e.g. "Check payment provider status"] |
|
||||
| `[Service]DownstreamDependencyErrors` | Error rate calling [dependency] | >5% over 5 min | P2 | Check [dependency] status page; enable fallback if available |
|
||||
|
||||
### Alert Configuration Examples
|
||||
|
||||
```yaml
|
||||
# Prometheus / Grafana alerting rules (adapt for your platform)
|
||||
groups:
|
||||
- name: [service-name]-alerts
|
||||
rules:
|
||||
|
||||
- alert: [Service]HighErrorRate
|
||||
expr: |
|
||||
(
|
||||
sum(rate([service]_http_requests_total{status=~"5.."}[5m]))
|
||||
/
|
||||
sum(rate([service]_http_requests_total[5m]))
|
||||
) > 0.01
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
team: [team-name]
|
||||
annotations:
|
||||
summary: "High error rate on [Service Name]"
|
||||
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
|
||||
runbook_url: "[runbook link]"
|
||||
|
||||
- alert: [Service]HighP99Latency
|
||||
expr: |
|
||||
histogram_quantile(0.99,
|
||||
sum(rate([service]_http_request_duration_seconds_bucket[5m])) by (le, endpoint)
|
||||
) > [0.5]
|
||||
for: 3m
|
||||
labels:
|
||||
severity: warning
|
||||
team: [team-name]
|
||||
annotations:
|
||||
summary: "p99 latency elevated on [Service Name]"
|
||||
description: "p99 latency on {{ $labels.endpoint }} is {{ $value | humanizeDuration }}"
|
||||
runbook_url: "[runbook link]"
|
||||
```
|
||||
|
||||
```python
|
||||
# Datadog monitor configuration (Python SDK or Terraform)
|
||||
import datadog
|
||||
|
||||
datadog.initialize(api_key="[key]", app_key="[key]")
|
||||
|
||||
datadog.api.Monitor.create(
|
||||
type="metric alert",
|
||||
query=f"sum(last_5m):sum:{{service}}.http.errors{{service:[service-name]}} / sum:{{service}}.http.requests{{service:[service-name]}} > 0.01",
|
||||
name="[Service] High Error Rate",
|
||||
message="Error rate exceeded 1%. @pagerduty-[service-oncall]\n\nRunbook: [link]",
|
||||
tags=["service:[service-name]", "team:[team-name]"],
|
||||
options={
|
||||
"thresholds": {"critical": 0.01, "warning": 0.005},
|
||||
"notify_no_data": False,
|
||||
"evaluation_delay": 60,
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Dashboard Layout Specification
|
||||
|
||||
The primary service dashboard must answer "is the service healthy right now?" at a glance. Use this layout:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ [SERVICE NAME] — Service Health Dashboard [Time range ▼] │
|
||||
├───────────────┬───────────────┬───────────────┬─────────────────────┤
|
||||
│ Error rate │ p99 Latency │ RPS (current)│ SLO budget remaining│
|
||||
│ [BIG NUMBER] │ [BIG NUMBER] │ [BIG NUMBER] │ [BIG NUMBER / days] │
|
||||
│ vs SLO: 0.1% │ vs SLO: 500ms│ vs avg: [N] │ [Error budget gauge]│
|
||||
├───────────────┴───────────────┴───────────────┴─────────────────────┤
|
||||
│ Error rate over time (24h) │
|
||||
│ [Time series: 5xx rate line, SLO threshold line] │
|
||||
├─────────────────────────────────┬───────────────────────────────────┤
|
||||
│ Latency percentiles over time │ Request throughput over time │
|
||||
│ [Lines: p50, p95, p99, p999] │ [Bars: RPS by endpoint] │
|
||||
│ [SLO threshold horizontal line]│ │
|
||||
├─────────────────────────────────┴───────────────────────────────────┤
|
||||
│ Latency heatmap (all requests — shows distribution shape) │
|
||||
├─────────────────────────────────┬───────────────────────────────────┤
|
||||
│ CPU utilisation over time │ Memory utilisation over time │
|
||||
│ [All instances/pods — lines] │ [All instances/pods — lines] │
|
||||
│ [Alert threshold: 80%] │ [Alert threshold: 85%] │
|
||||
├─────────────────────────────────┴───────────────────────────────────┤
|
||||
│ DB: connection pool utilisation│ DB: query latency (p99 per query)│
|
||||
├─────────────────────────────────┴───────────────────────────────────┤
|
||||
│ [Business metric 1 over time] │ [Business metric 2 over time] │
|
||||
│ e.g. Payment success rate │ e.g. Orders created/min │
|
||||
└─────────────────────────────────┴───────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Second dashboard — Dependency Health:**
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ [SERVICE NAME] — Dependency Health │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ For each dependency: error rate | latency | current status │
|
||||
│ [Database] [N]% errors | [N]ms p99 | ● Healthy / ⚠ Degraded │
|
||||
│ [Redis] [N]% errors | [N]ms p99 | ● Healthy │
|
||||
│ [External API][N]% errors | [N]ms p99 | ● Healthy │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ Outbound call latency over time (one line per dependency) │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ Circuit breaker / fallback state (if implemented) │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Observability Debt Analysis
|
||||
|
||||
Honest assessment of what is missing today and what the priority to add it is:
|
||||
|
||||
| Gap | Impact | Priority | Effort | Owner | Target date |
|
||||
|---|---|---|---|---|---|
|
||||
| [e.g. No distributed tracing — can't see cross-service latency] | High — blind to dependency issues | P1 | [2 days] | [Name] | [Date] |
|
||||
| [e.g. No business metric alerts — only infra alerts] | High — silent business failures | P1 | [1 day] | [Name] | [Date] |
|
||||
| [e.g. Logs are unstructured text — not searchable] | Medium — slow incident investigation | P2 | [3 days] | [Name] | [Date] |
|
||||
| [e.g. No dead-letter queue monitoring] | Medium — failed messages go unnoticed | P2 | [4 hours] | [Name] | [Date] |
|
||||
| [e.g. Alert thresholds not calibrated to production baseline] | Medium — alert fatigue or missed alerts | P2 | [1 day] | [Name] | [Date] |
|
||||
| [e.g. No latency heatmap — outliers invisible in averages] | Low — harder to spot tail latency issues | P3 | [2 hours] | [Name] | [Date] |
|
||||
|
||||
**Total observability debt: [N] items | Estimated effort: [N days]**
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every alert has a named on-call action — no alert says "investigate" without specifying what to investigate first
|
||||
- [ ] Alert thresholds are calibrated against production baselines, not set to default values from a template
|
||||
- [ ] Structured logging is implemented — no unstructured text log lines in production
|
||||
- [ ] PII is explicitly excluded from logs — a named engineer has verified this
|
||||
- [ ] Distributed tracing is propagating trace IDs across all service boundaries (verify with a test request)
|
||||
- [ ] The primary dashboard answers "is the service healthy?" in under 10 seconds — no hunting for the right panel
|
||||
- [ ] Business metrics are tracked alongside infrastructure metrics — not just four golden signals
|
||||
- [ ] Observability debt items have owners and dates — not just "would be nice to have"
|
||||
@@ -0,0 +1,364 @@
|
||||
---
|
||||
name: oncall-runbook
|
||||
description: "Write an on-call runbook for a service — covering alert definitions, escalation paths, common incident responses, and on-call handoff procedures. Use when asked to write an on-call guide, create alert runbooks, document escalation procedures, or prepare an on-call handoff document. Produces a structured on-call runbook with per-alert response procedures, escalation matrix, diagnostic commands, and handoff template."
|
||||
---
|
||||
|
||||
# On-Call Runbook Skill
|
||||
|
||||
Produce a complete on-call runbook for a service — giving the on-call engineer everything they need to respond confidently to alerts at 3am, without having to ask anyone for help.
|
||||
|
||||
A good on-call runbook reduces mean time to resolution (MTTR) by eliminating the "what do I do first?" problem. It is written for the on-call engineer who has just been paged and needs to act, not for someone calmly reading documentation.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and what it does
|
||||
- **Team** and tech lead name
|
||||
- **Alert list** — names of alerts that currently page on-call
|
||||
- **Monitoring setup** — Datadog / Grafana / CloudWatch / PagerDuty / etc.
|
||||
- **Common failure modes** — what breaks most often, and what fixes it
|
||||
- **Escalation contacts** — who to call when on-call can't resolve it
|
||||
- **Deployment setup** — can on-call roll back? How?
|
||||
- **Service dependencies** — what does this service depend on, and what depends on it?
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# On-Call Runbook: [Service Name]
|
||||
|
||||
**Team:** [Team name] | **Tech lead:** [Name]
|
||||
**PagerDuty service:** [Link] | **Escalation policy:** [Policy name]
|
||||
**Last updated:** [Date] | **Next review:** [Date + 90 days]
|
||||
|
||||
> **First time on-call for this service?** Read the [developer onboarding doc] first — it covers the architecture and how things work. This runbook assumes you understand the service.
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
|
||||
**Dashboard:** [Link — the first thing to open when paged]
|
||||
**Logs:** [Link — where to find logs]
|
||||
**Runbook index:** Jump to the alert that paged you → [Alert list below]
|
||||
**Can't resolve in 30 min?** Escalate to: [Name] via [Slack / PagerDuty]
|
||||
|
||||
**Rollback command (memorise this):**
|
||||
```bash
|
||||
[rollback command — e.g. kubectl rollout undo deployment/[service-name]]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Escalation Matrix
|
||||
|
||||
| Situation | Escalate to | How | After how long |
|
||||
|---|---|---|---|
|
||||
| Can't diagnose the alert | [Tech lead name] | Slack DM / Phone | 30 minutes |
|
||||
| Alert requires infra change | [Platform team] | `#platform` Slack | Immediately |
|
||||
| Customer-facing impact | [CSM / Support lead] | `#incidents` Slack | Immediately (P1) |
|
||||
| Database issue | [DBA or data team] | Slack / PagerDuty | Immediately |
|
||||
| [Specific dependency] down | [[Dependency] on-call] | PagerDuty / Slack | Immediately |
|
||||
| Extended outage (>1 hour) | [Engineering manager] | Phone | 1 hour |
|
||||
|
||||
**Contacts:**
|
||||
|
||||
| Name | Role | Slack | Phone |
|
||||
|---|---|---|---|
|
||||
| [Name] | Tech lead | @[handle] | [Number] |
|
||||
| [Name] | Engineering manager | @[handle] | [Number] |
|
||||
| [Name] | Platform / infra | @[handle] | [Number] |
|
||||
| [Platform team] | Infra on-call | `#platform` | PagerDuty |
|
||||
|
||||
---
|
||||
|
||||
## Service Architecture (Quick View)
|
||||
|
||||
```
|
||||
[Upstream callers]
|
||||
│
|
||||
▼
|
||||
[This Service]
|
||||
│
|
||||
├──→ [Primary Database]
|
||||
├──→ [Cache — e.g. Redis]
|
||||
└──→ [Downstream Service / Queue]
|
||||
```
|
||||
|
||||
**If this service is down, these are affected:** [List downstream consumers]
|
||||
**If these are down, this service is affected:** [List upstream dependencies]
|
||||
|
||||
---
|
||||
|
||||
## Alert Runbooks
|
||||
|
||||
### ALERT: [Alert Name 1 — e.g. HighErrorRate]
|
||||
|
||||
**What it means:** [Plain English — e.g. "More than 5% of API requests are returning 5xx errors in the last 5 minutes"]
|
||||
**Severity:** P1 / P2 / P3
|
||||
**SLO impact:** Yes / No — [If yes: this alert means the error budget is burning at [X]× rate]
|
||||
|
||||
**Step 1 — Acknowledge and assess**
|
||||
```bash
|
||||
# Check current error rate
|
||||
[query or dashboard link]
|
||||
|
||||
# Check which endpoints are erroring
|
||||
[query or command]
|
||||
```
|
||||
|
||||
**Step 2 — Check recent changes**
|
||||
```bash
|
||||
# Any deploys in the last hour?
|
||||
[command or link to deployment log]
|
||||
|
||||
# Recent config changes?
|
||||
[where to check]
|
||||
```
|
||||
|
||||
**Step 3 — Check dependencies**
|
||||
```bash
|
||||
# Is the database healthy?
|
||||
[health check command or link]
|
||||
|
||||
# Is [downstream service] healthy?
|
||||
[health check command or link]
|
||||
```
|
||||
|
||||
**Step 4 — Diagnose**
|
||||
|
||||
| If you see | It means | Do this |
|
||||
|---|---|---|
|
||||
| [Error pattern 1] | [Cause] | [Action] |
|
||||
| [Error pattern 2] | [Cause] | [Action] |
|
||||
| [Error pattern 3] | [Cause] | [Action] |
|
||||
| No clear pattern | Unknown cause | Escalate to [name] |
|
||||
|
||||
**Step 5 — Fix or mitigate**
|
||||
```bash
|
||||
# If caused by bad deploy — roll back:
|
||||
[rollback command]
|
||||
|
||||
# If caused by [specific issue]:
|
||||
[fix command]
|
||||
|
||||
# If caused by upstream dependency:
|
||||
[mitigation — e.g. enable circuit breaker, reduce traffic, etc.]
|
||||
```
|
||||
|
||||
**After resolving:**
|
||||
- [ ] Confirm error rate has returned to baseline
|
||||
- [ ] Check no downstream services were affected
|
||||
- [ ] If P1: open a post-incident review — see [incident-postmortem skill]
|
||||
- [ ] Update `#incidents` with resolution summary
|
||||
|
||||
---
|
||||
|
||||
### ALERT: [Alert Name 2 — e.g. HighLatency]
|
||||
|
||||
**What it means:** [e.g. "P99 response time has exceeded 1s for more than 3 consecutive minutes"]
|
||||
**Severity:** P1 / P2 / P3
|
||||
**SLO impact:** Yes — latency SLO breach
|
||||
|
||||
**Step 1 — Assess scope**
|
||||
```bash
|
||||
# Check which endpoints are slow
|
||||
[query or dashboard — broken down by endpoint]
|
||||
|
||||
# Check if latency is across all regions or localised
|
||||
[query or command]
|
||||
```
|
||||
|
||||
**Step 2 — Common causes and fixes**
|
||||
|
||||
| Cause | Signal | Fix |
|
||||
|---|---|---|
|
||||
| Database slow queries | DB latency spike on dashboard | [Check slow query log: `command`] |
|
||||
| Cache miss storm | Cache hit rate drops on dashboard | [command or action] |
|
||||
| Memory pressure / GC | High memory on service dashboard | [command or action — e.g. restart, scale up] |
|
||||
| Upstream service slow | Trace shows time in external call | Escalate to [service] on-call |
|
||||
| Traffic spike | Request rate spike on dashboard | [Scale up: `command`] |
|
||||
|
||||
**Step 3 — Escalate if unresolved in 20 minutes**
|
||||
Page [Tech lead] via PagerDuty / Slack.
|
||||
|
||||
---
|
||||
|
||||
### ALERT: [Alert Name 3 — e.g. DatabaseConnectionPoolExhausted]
|
||||
|
||||
**What it means:** [e.g. "The service has used all available database connections — new requests will fail"]
|
||||
**Severity:** P1
|
||||
**SLO impact:** Yes — will cause errors immediately
|
||||
|
||||
**Immediate mitigation:**
|
||||
```bash
|
||||
# Restart the service to flush stale connections
|
||||
[restart command]
|
||||
|
||||
# Check current connection count
|
||||
[DB connection query]
|
||||
```
|
||||
|
||||
**Diagnose root cause after stabilising:**
|
||||
```bash
|
||||
# Check for long-running queries holding connections
|
||||
[query]
|
||||
|
||||
# Check if a recent deploy changed connection pool config
|
||||
[where to check]
|
||||
```
|
||||
|
||||
**Resolution:** [e.g. "Increase pool size in config / kill long-running queries / scale the service"]
|
||||
|
||||
---
|
||||
|
||||
### ALERT: [Alert Name 4 — e.g. QueueBacklogHigh / ConsumerLag]
|
||||
|
||||
**What it means:** [e.g. "The message queue backlog exceeds 10,000 messages — consumers are not keeping up"]
|
||||
**Severity:** P2
|
||||
**SLO impact:** Depends — if queue backs up, downstream systems will receive delayed data
|
||||
|
||||
**Step 1 — Check consumer health**
|
||||
```bash
|
||||
# Are consumers running?
|
||||
[command]
|
||||
|
||||
# Consumer error rate?
|
||||
[dashboard or query]
|
||||
```
|
||||
|
||||
**Step 2 — Check message contents**
|
||||
```bash
|
||||
# Are there poison messages causing retries?
|
||||
[command to inspect dead-letter queue or failed messages]
|
||||
```
|
||||
|
||||
**Step 3 — Options**
|
||||
|
||||
| If | Then |
|
||||
|---|---|
|
||||
| Consumers are down | Restart consumers: `[command]` |
|
||||
| Poison message in queue | Move to DLQ: `[command]` |
|
||||
| Consumers healthy but slow | Scale consumers: `[command]` |
|
||||
| Upstream producing too fast | Escalate to [upstream service] owner |
|
||||
|
||||
---
|
||||
|
||||
### ALERT: [Add additional alerts following the same pattern]
|
||||
|
||||
---
|
||||
|
||||
## Diagnostic Cheat Sheet
|
||||
|
||||
Common commands for quick diagnosis. Paste and run without modification.
|
||||
|
||||
```bash
|
||||
# Service health
|
||||
[health check command]
|
||||
|
||||
# Recent logs (last 100 lines)
|
||||
[log command]
|
||||
|
||||
# Error logs only
|
||||
[error log filter command]
|
||||
|
||||
# Current pod / instance status
|
||||
[kubectl get pods / aws ecs describe-tasks / etc.]
|
||||
|
||||
# Restart the service
|
||||
[restart command]
|
||||
|
||||
# Roll back to previous version
|
||||
[rollback command]
|
||||
|
||||
# Database connection count
|
||||
[DB query]
|
||||
|
||||
# Cache hit rate
|
||||
[cache stats command]
|
||||
|
||||
# Current request rate
|
||||
[metrics query]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Useful Dashboard Links
|
||||
|
||||
| Dashboard | URL | Use it to |
|
||||
|---|---|---|
|
||||
| Service overview | [Link] | First stop — error rate, latency, request rate |
|
||||
| Database | [Link] | Connection count, slow queries, replication lag |
|
||||
| Infrastructure | [Link] | CPU, memory, disk |
|
||||
| Queue / consumers | [Link] | Backlog depth, consumer throughput |
|
||||
| Upstream dependencies | [Link] | Dependency health at a glance |
|
||||
|
||||
---
|
||||
|
||||
## Incident Communication
|
||||
|
||||
When you declare an incident:
|
||||
|
||||
**Post to `#incidents` immediately:**
|
||||
```
|
||||
🔴 INCIDENT — [Service Name]
|
||||
Status: Investigating
|
||||
Impact: [Who is affected and how]
|
||||
Paged: [Your name]
|
||||
Next update: [Time — max 30 min from now]
|
||||
```
|
||||
|
||||
**Update every 30 minutes while active:**
|
||||
```
|
||||
🔴 UPDATE — [Service Name] — [Time]
|
||||
Status: [Investigating / Identified / Mitigating / Resolved]
|
||||
Latest: [One sentence on what you found or did]
|
||||
Next update: [Time]
|
||||
```
|
||||
|
||||
**On resolution:**
|
||||
```
|
||||
✅ RESOLVED — [Service Name] — [Time]
|
||||
Duration: [X minutes]
|
||||
Impact: [Summary of who was affected]
|
||||
Cause: [One sentence]
|
||||
Follow-up: [PIR required? Yes/No — link when created]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## On-Call Handoff
|
||||
|
||||
Use this template at the end of every on-call shift:
|
||||
|
||||
```
|
||||
--- ON-CALL HANDOFF: [Service Name] ---
|
||||
Date: [Date]
|
||||
Outgoing: [Your name]
|
||||
Incoming: [Next on-call name]
|
||||
|
||||
INCIDENTS THIS SHIFT:
|
||||
- [Incident summary — date, duration, cause, resolution, follow-up required]
|
||||
|
||||
OPEN ISSUES TO WATCH:
|
||||
- [Anything not fully resolved / trending in the wrong direction]
|
||||
|
||||
CHANGES SINCE LAST HANDOFF:
|
||||
- [Deploys, config changes, infra changes that affect on-call awareness]
|
||||
|
||||
RUNBOOK GAPS FOUND:
|
||||
- [Anything you had to figure out that isn't documented — please add it]
|
||||
|
||||
ANYTHING ELSE:
|
||||
- [Notes for incoming on-call]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every alert that pages on-call has a runbook entry — no alert is missing
|
||||
- [ ] Rollback command is accurate and tested recently
|
||||
- [ ] Escalation contacts have current phone numbers and Slack handles
|
||||
- [ ] Diagnostic commands work — they have been run by at least one person recently
|
||||
- [ ] Handoff template is used at every shift change — not just during incidents
|
||||
- [ ] "Things I had to figure out that weren't documented" are added to this runbook after every incident
|
||||
@@ -0,0 +1,277 @@
|
||||
---
|
||||
name: performance-budget
|
||||
description: "Define and document performance budgets for a web service or application. Use when asked to set performance targets, define SLOs for latency or throughput, establish Core Web Vitals targets, create a performance baseline, or document performance regression policy. Produces a structured performance budget covering key user journeys, Core Web Vitals, backend latency SLOs, measurement tooling, CI enforcement, and breach response process."
|
||||
---
|
||||
|
||||
# Performance Budget Skill
|
||||
|
||||
Produce a complete, actionable performance budget document for a web service or application. A performance budget is not a wishlist — it is a set of measurable, enforced constraints that define what "acceptable performance" means and who is responsible when those constraints are violated.
|
||||
|
||||
A good performance budget answers: what are the targets, how are they measured, what triggers an investigation, and what happens when a budget is breached.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name and type** — web app, API service, mobile app, or combination
|
||||
- **Key user journeys** — the 3–5 most important flows users take (e.g. "search → product page → checkout")
|
||||
- **Current baseline metrics** — P50/P95/P99 latency, LCP, CLS, INP if available (state "no baseline" if not collected yet)
|
||||
- **Tech stack** — frontend framework, backend language/framework, CDN, database
|
||||
- **Deployment environment** — cloud provider, region(s), edge/CDN configuration
|
||||
- **Cost constraints** — any budget or infrastructure limits that affect headroom
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Performance Budget: [Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Last updated:** [Date] | **Owner:** [Name / role]
|
||||
**Environment:** [Production / Staging baseline] | **Review cadence:** [Quarterly / per-sprint]
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
[2–3 sentences describing the service, its user-facing performance requirements, and why performance is a priority. Reference the business impact of latency — e.g. conversion rate, user retention, SLA obligations.]
|
||||
|
||||
**Performance philosophy:** [e.g. "Performance is a feature. Every engineer is responsible for keeping the service within budget. Regressions must be caught in CI before they reach production."]
|
||||
|
||||
---
|
||||
|
||||
## Key User Journeys
|
||||
|
||||
Define the critical paths that the performance budget is designed to protect.
|
||||
|
||||
| Journey ID | Journey name | Entry point | Exit point | Criticality |
|
||||
|---|---|---|---|---|
|
||||
| UJ-1 | [e.g. New user sign-up] | [Landing page] | [Dashboard] | Critical |
|
||||
| UJ-2 | [e.g. Core workflow task] | [e.g. /app/tasks] | [e.g. Task complete] | High |
|
||||
| UJ-3 | [e.g. Search and select] | [e.g. /search] | [e.g. Detail page] | High |
|
||||
| UJ-4 | [e.g. API data fetch] | [e.g. GET /api/items] | [e.g. 200 response] | Medium |
|
||||
|
||||
---
|
||||
|
||||
## Frontend Performance Budget
|
||||
|
||||
*Complete this section for web and mobile applications. Skip for API-only services.*
|
||||
|
||||
### Core Web Vitals Targets
|
||||
|
||||
Targets apply to the 75th percentile of real user sessions (field data), measured on a mid-range Android device on a 4G connection unless otherwise stated.
|
||||
|
||||
| Metric | Description | Good | Needs Improvement | Poor | **Our Target** | Current baseline |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **LCP** | Largest Contentful Paint — perceived load speed | ≤2.5s | 2.5–4.0s | >4.0s | **[≤X.Xs]** | [Xs / not measured] |
|
||||
| **INP** | Interaction to Next Paint — responsiveness | ≤200ms | 200–500ms | >500ms | **[≤Xms]** | [Xms / not measured] |
|
||||
| **CLS** | Cumulative Layout Shift — visual stability | ≤0.1 | 0.1–0.25 | >0.25 | **[≤0.X]** | [X.XX / not measured] |
|
||||
| **FCP** | First Contentful Paint | ≤1.8s | 1.8–3.0s | >3.0s | **[≤X.Xs]** | [Xs / not measured] |
|
||||
| **TTFB** | Time to First Byte | ≤800ms | 800ms–1.8s | >1.8s | **[≤Xms]** | [Xms / not measured] |
|
||||
|
||||
### Page Weight Budget
|
||||
|
||||
| Asset type | Max size (compressed) | Current | Status |
|
||||
|---|---|---|---|
|
||||
| Total page weight | [e.g. 500KB] | [XKB / unknown] | [Within / Over / Unknown] |
|
||||
| JavaScript (initial load) | [e.g. 200KB] | [XKB / unknown] | [Within / Over / Unknown] |
|
||||
| CSS | [e.g. 50KB] | [XKB / unknown] | [Within / Over / Unknown] |
|
||||
| Images (above fold) | [e.g. 150KB] | [XKB / unknown] | [Within / Over / Unknown] |
|
||||
| Web fonts | [e.g. 50KB] | [XKB / unknown] | [Within / Over / Unknown] |
|
||||
| Third-party scripts | [e.g. 100KB] | [XKB / unknown] | [Within / Over / Unknown] |
|
||||
|
||||
### Per-Journey Frontend Targets
|
||||
|
||||
| Journey | LCP | INP | CLS | FCP | TTFB |
|
||||
|---|---|---|---|---|---|
|
||||
| UJ-1: [Journey name] | [≤Xs] | [≤Xms] | [≤0.X] | [≤Xs] | [≤Xms] |
|
||||
| UJ-2: [Journey name] | [≤Xs] | [≤Xms] | [≤0.X] | [≤Xs] | [≤Xms] |
|
||||
| UJ-3: [Journey name] | [≤Xs] | [≤Xms] | [≤0.X] | [≤Xs] | [≤Xms] |
|
||||
|
||||
---
|
||||
|
||||
## Backend Performance Budget
|
||||
|
||||
### API Latency SLOs
|
||||
|
||||
Targets measured at the service boundary (not including client-side network latency).
|
||||
|
||||
| Endpoint / operation | Method | P50 | P95 | P99 | Max (hard limit) | Error rate |
|
||||
|---|---|---|---|---|---|---|
|
||||
| [e.g. /api/auth/login] | POST | [≤Xms] | [≤Xms] | [≤Xms] | [≤Xms] | [<X%] |
|
||||
| [e.g. /api/items] | GET | [≤Xms] | [≤Xms] | [≤Xms] | [≤Xms] | [<X%] |
|
||||
| [e.g. /api/items/:id] | GET | [≤Xms] | [≤Xms] | [≤Xms] | [≤Xms] | [<X%] |
|
||||
| [e.g. /api/items] | POST | [≤Xms] | [≤Xms] | [≤Xms] | [≤Xms] | [<X%] |
|
||||
| [e.g. Background job: sync] | — | [≤Xs] | [≤Xs] | [≤Xs] | [≤Xs] | [<X%] |
|
||||
|
||||
**Overall service SLOs:**
|
||||
|
||||
| SLO | Target | Measurement window |
|
||||
|---|---|---|
|
||||
| Availability | [99.X%] | 30-day rolling |
|
||||
| P95 latency (all endpoints) | [≤Xms] | 30-day rolling |
|
||||
| Error rate (5xx) | [<X%] | 30-day rolling |
|
||||
| Throughput (sustained) | [≥X req/s] | Peak hour |
|
||||
|
||||
### Database Query Budget
|
||||
|
||||
| Query / operation | P50 | P95 | Max | Notes |
|
||||
|---|---|---|---|---|
|
||||
| [e.g. User lookup by ID] | [≤Xms] | [≤Xms] | [≤Xms] | Index on `user_id` |
|
||||
| [e.g. List items for user] | [≤Xms] | [≤Xms] | [≤Xms] | Paginated, max 100 rows |
|
||||
| [e.g. Full-text search] | [≤Xms] | [≤Xms] | [≤Xms] | Elasticsearch / pg_trgm |
|
||||
|
||||
---
|
||||
|
||||
## Measurement Methodology
|
||||
|
||||
### Real User Monitoring (RUM)
|
||||
|
||||
**Tool:** [e.g. Google CrUX, SpeedCurve, Datadog RUM, Sentry Performance, custom]
|
||||
**Data source:** [Field data from real users / Lab data from synthetic tests / Both]
|
||||
**Sample rate:** [X% of sessions]
|
||||
**How to access:** [Dashboard URL or tool access instructions]
|
||||
|
||||
**What is measured:**
|
||||
- [ ] Core Web Vitals (LCP, INP, CLS) per page and journey
|
||||
- [ ] Custom performance marks for business-critical interactions
|
||||
- [ ] Resource timing for key assets
|
||||
- [ ] Long tasks (>50ms on main thread)
|
||||
|
||||
### Synthetic Monitoring
|
||||
|
||||
**Tool:** [e.g. Lighthouse CI, WebPageTest, k6, Artillery, Playwright with performance assertions]
|
||||
**Frequency:** [Every X minutes / on every deploy / nightly]
|
||||
**Test location(s):** [e.g. eu-west-1, us-east-1]
|
||||
**Device profile:** [Desktop 10Mbps / Mobile 4G Moto G4 / both]
|
||||
|
||||
**Synthetic test suite location:** [Link to test files]
|
||||
|
||||
### Backend Observability
|
||||
|
||||
**APM tool:** [e.g. Datadog, Grafana + Prometheus, New Relic, AWS X-Ray]
|
||||
**Metrics collected:**
|
||||
- Request rate, error rate, duration (RED metrics) per endpoint
|
||||
- Database query duration and connection pool utilisation
|
||||
- Cache hit/miss rates
|
||||
- Background job queue depth and processing latency
|
||||
|
||||
**Dashboard:** [Link to primary performance dashboard]
|
||||
|
||||
---
|
||||
|
||||
## CI/CD Performance Enforcement
|
||||
|
||||
Performance budgets are enforced at two gates:
|
||||
|
||||
### Gate 1 — Build-time Bundle Analysis
|
||||
|
||||
**Tool:** [e.g. bundlesize, size-limit, webpack-bundle-analyzer with CI assertion]
|
||||
**Config file:** [`[.bundlesizerc / .size-limit.js / etc.]`]
|
||||
**Trigger:** Every PR targeting `main`
|
||||
**Blocking:** Yes — PR cannot merge if bundle size budget is exceeded
|
||||
|
||||
```json
|
||||
// Example .size-limit.js
|
||||
[
|
||||
{
|
||||
"path": "dist/js/*.js",
|
||||
"limit": "200 KB"
|
||||
},
|
||||
{
|
||||
"path": "dist/css/*.css",
|
||||
"limit": "50 KB"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Gate 2 — Synthetic Performance Tests in CI
|
||||
|
||||
**Tool:** [e.g. Lighthouse CI, k6, Artillery]
|
||||
**Trigger:** On deploy to staging
|
||||
**Blocking:** Yes — production deploy is blocked if thresholds fail
|
||||
**Thresholds checked:**
|
||||
- LCP ≤ [Xs]
|
||||
- CLS ≤ [0.X]
|
||||
- P95 API latency ≤ [Xms]
|
||||
- Error rate < [X%]
|
||||
|
||||
**CI config location:** [`[.github/workflows/perf.yml / ci/performance.yaml]`]
|
||||
|
||||
**How to run locally:**
|
||||
```bash
|
||||
# Run Lighthouse CI against local build
|
||||
[command — e.g. lhci autorun --config=lighthouserc.js]
|
||||
|
||||
# Run load test locally
|
||||
[command — e.g. k6 run load-tests/api-smoke.js]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Budget Breach Response Process
|
||||
|
||||
A budget breach is when a measured metric exceeds its target for [X consecutive measurements / X minutes sustained / a single deploy].
|
||||
|
||||
### Breach Severity Levels
|
||||
|
||||
| Severity | Condition | Response time | Who acts |
|
||||
|---|---|---|---|
|
||||
| P1 — Critical | >2× budget threshold in production | Immediate | On-call engineer + team lead |
|
||||
| P2 — High | >1.5× budget threshold in production | Within 4 hours | On-call engineer |
|
||||
| P3 — Medium | Threshold exceeded in production | Within 1 sprint | PR author + team |
|
||||
| P4 — Low | Threshold exceeded in staging only | Before merge | PR author |
|
||||
|
||||
### Breach Investigation Checklist
|
||||
|
||||
When a breach is detected, work through this checklist in order:
|
||||
|
||||
**1. Identify the regression commit**
|
||||
```bash
|
||||
# Compare performance across recent deploys
|
||||
[command — e.g. datadog metrics query, lighthouse-ci compare, git bisect]
|
||||
```
|
||||
|
||||
**2. Classify the breach**
|
||||
- [ ] Is this a code change? (new feature, refactor, dependency bump)
|
||||
- [ ] Is this an infrastructure change? (new instance type, config change)
|
||||
- [ ] Is this an external factor? (CDN issue, DNS, upstream dependency)
|
||||
- [ ] Is this a measurement anomaly? (test environment issue, sample size)
|
||||
|
||||
**3. Immediate actions**
|
||||
- If P1/P2 in production and a code cause is confirmed: roll back or disable the feature flag
|
||||
- If cause is unknown: do not roll back immediately — gather more data first
|
||||
- Notify [#performance / #incidents Slack channel] with: metric name, current value, budget target, suspected cause
|
||||
|
||||
**4. Resolution**
|
||||
- Fix the root cause — do not just adjust the budget threshold
|
||||
- Budget thresholds should only change after a team discussion and explicit approval from [tech lead / EM]
|
||||
- Document the breach in the [performance log / incident record]
|
||||
|
||||
**Budget change policy:** Budget thresholds may only be relaxed if: (a) the feature delivering the regression has measurable business value that outweighs the performance cost, and (b) the change is reviewed and approved by [tech lead].
|
||||
|
||||
---
|
||||
|
||||
## Performance Review Cadence
|
||||
|
||||
| Trigger | Action |
|
||||
|---|---|
|
||||
| Every sprint | Review P95/P99 latency trends; flag any creeping degradation |
|
||||
| Every quarter | Full performance budget review — update baselines, adjust targets, audit tooling |
|
||||
| After major feature launch | Re-measure all Core Web Vitals and API SLOs; update baselines |
|
||||
| After infrastructure change | Re-run full synthetic test suite; confirm no regression |
|
||||
| After dependency upgrade | Run bundle size diff; confirm no unexpected size increase |
|
||||
|
||||
**Next scheduled review:** [Date]
|
||||
**Review owner:** [Name / role]
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every budget threshold is a specific number — not a range or "TBD"
|
||||
- [ ] Both frontend (if applicable) and backend targets are defined — not just one or the other
|
||||
- [ ] Measurement tooling is named with a link to the dashboard or config file
|
||||
- [ ] CI enforcement is configured for at least one gate (build-time or deploy-time)
|
||||
- [ ] Budget breach response process names specific Slack channels and owners
|
||||
- [ ] Budget thresholds are anchored to baseline measurements or a justified target — not pulled from thin air
|
||||
- [ ] Per-journey targets are defined for critical user journeys, not just global averages
|
||||
@@ -0,0 +1,399 @@
|
||||
---
|
||||
name: rfc-writer
|
||||
description: "Write an engineering RFC (Request for Comments) for a technical decision, architectural change, or significant implementation approach. Use when asked to write an RFC, document a technical proposal, create a design doc, write an architecture decision for review, or produce a technical specification for team feedback. Produces a complete RFC document covering problem statement, motivation, proposed solution, alternatives rejected, implementation plan, migration plan, security and performance implications, observability changes, rollout plan, and open questions."
|
||||
---
|
||||
|
||||
# RFC Writer Skill
|
||||
|
||||
Produce a complete engineering RFC (Request for Comments) for a technical decision or architectural change. An RFC is a structured proposal document — not a persuasion document. Its purpose is to expose a decision to scrutiny, surface trade-offs, document alternatives considered, and create a permanent record of why a choice was made.
|
||||
|
||||
A good RFC makes it possible for someone who wasn't in the room to understand years later why the team built something the way they did.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **RFC title and author** — what this RFC is about and who is proposing it
|
||||
- **Problem being solved** — what is broken, missing, or inadequate today; why action is needed now
|
||||
- **Proposed solution** — the approach the author is recommending, at least at a high level
|
||||
- **Context and constraints** — team size, existing architecture, timeline pressures, budget limits, compliance requirements
|
||||
- **Alternatives considered** — at least 2 alternative approaches the author has thought about
|
||||
- **Current status** — is this pre-decision (seeking feedback) or post-decision (documenting a made decision)?
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# RFC [Number]: [Title]
|
||||
|
||||
**Author:** [Name] | **Team:** [Team name]
|
||||
**Created:** [Date] | **Last updated:** [Date]
|
||||
**Status:** Draft | In Review | Approved | Rejected | Superseded by RFC-[X]
|
||||
**Ticket:** [JIRA-XXX] | **Slack thread:** [#channel link]
|
||||
**Review deadline:** [Date — when comments should be submitted by]
|
||||
|
||||
---
|
||||
|
||||
## Abstract
|
||||
|
||||
[2–4 sentences summarising the entire RFC. Should stand alone — someone reading only this should understand what is being proposed, why, and what the main trade-off is. Write this last.]
|
||||
|
||||
---
|
||||
|
||||
## 1. Problem Statement
|
||||
|
||||
[Describe the problem being solved. Focus on the *problem*, not the solution. Be specific and quantified where possible.]
|
||||
|
||||
**Current state:**
|
||||
[Describe how things work today — the existing system, process, or architecture. Include any relevant constraints or limitations.]
|
||||
|
||||
**Why this is a problem now:**
|
||||
[Why is this being addressed now rather than earlier or later? Reference metrics, incidents, product requirements, or scaling thresholds that make this urgent or timely.]
|
||||
|
||||
**Example of the problem in practice:**
|
||||
[A concrete scenario or incident that illustrates the problem. This helps reviewers understand the real-world impact, not just the abstract description.]
|
||||
|
||||
```
|
||||
// Example: current behaviour that illustrates the problem
|
||||
[code snippet, log output, or sequence description showing the problem]
|
||||
```
|
||||
|
||||
**Impact of not solving this:**
|
||||
- [Impact 1 — e.g. "New tenant onboarding requires 3 hours of manual configuration per account"]
|
||||
- [Impact 2 — e.g. "Auth service handles 400 req/s; projected to hit capacity within 8 weeks at current growth"]
|
||||
- [Impact 3 — e.g. "Current approach is incompatible with the upcoming multi-region requirement"]
|
||||
|
||||
---
|
||||
|
||||
## 2. Goals and Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- [ ] [Specific, measurable outcome — e.g. "Reduce tenant onboarding time from 3 hours to <5 minutes"]
|
||||
- [ ] [e.g. "Support 2,000 req/s on the auth service with P99 latency ≤50ms"]
|
||||
- [ ] [e.g. "Enable multi-region deployment without changes to the application layer"]
|
||||
|
||||
**Non-goals:** *(what this RFC explicitly does not address)*
|
||||
- [e.g. "This RFC does not address authentication for internal service-to-service calls — see RFC-042"]
|
||||
- [e.g. "Performance improvements to the existing system — this RFC replaces it"]
|
||||
- [e.g. "Migration of historical data — covered in a follow-on RFC"]
|
||||
|
||||
**Success metrics:**
|
||||
| Metric | Current | Target | Measurement method |
|
||||
|---|---|---|---|
|
||||
| [e.g. Onboarding time] | [3 hours] | [<5 minutes] | [Prometheus histogram on onboarding job duration] |
|
||||
| [e.g. Auth latency P99] | [120ms] | [≤50ms] | [Datadog APM] |
|
||||
| [e.g. Engineer setup time] | [4 hours] | [<30 minutes] | [Onboarding survey] |
|
||||
|
||||
---
|
||||
|
||||
## 3. Background and Motivation
|
||||
|
||||
[Provide the context a reviewer needs to evaluate the proposal. This is not a repeat of the problem statement — it is the surrounding technical and business context.]
|
||||
|
||||
**Existing system overview:**
|
||||
[Describe the relevant parts of the current architecture. Include an ASCII diagram if the relationships between components help understanding.]
|
||||
|
||||
```
|
||||
[ASCII diagram of current architecture — optional but strongly recommended for architectural RFCs]
|
||||
|
||||
┌──────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Client │────▶│ [Service A] │────▶│ [Service B] │
|
||||
└──────────┘ └──────────────┘ └──────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────┐
|
||||
│ [Database] │
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
**Prior work and related decisions:**
|
||||
- [RFC-XXX: Title — relevant previous decision; link]
|
||||
- [ADR-XXX: Title — architectural decision record]
|
||||
- [Any external standards, blog posts, or vendor documentation that informs this proposal]
|
||||
|
||||
**Constraints:**
|
||||
- [e.g. Must remain backward compatible with v1 API clients for 12 months]
|
||||
- [e.g. Team has no Rust expertise — solution must be in Python or Go]
|
||||
- [e.g. Must be deployable without a maintenance window]
|
||||
|
||||
---
|
||||
|
||||
## 4. Proposed Solution
|
||||
|
||||
[Describe the proposed approach clearly and specifically. Include enough detail that an engineer could begin implementing from this document, but don't write the code — that is for the PR.]
|
||||
|
||||
### 4.1 High-Level Approach
|
||||
|
||||
[1–3 paragraphs describing the overall solution. Explain the key idea and why it solves the problem.]
|
||||
|
||||
### 4.2 Architecture
|
||||
|
||||
```
|
||||
[ASCII diagram of the proposed architecture — what the system looks like after this RFC is implemented]
|
||||
|
||||
┌──────────┐ ┌──────────────────┐ ┌──────────────┐
|
||||
│ Client │────▶│ [New Component] │────▶│ [Service B] │
|
||||
└──────────┘ └──────────────────┘ └──────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ [Store A] │ │ [Store B] │
|
||||
└──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
### 4.3 Detailed Design
|
||||
|
||||
[Break the solution into its key components or decisions. For each, explain what it does and why it was designed this way.]
|
||||
|
||||
**Component / Decision 1: [Name]**
|
||||
|
||||
[Description of this component — what it does, how it works, why this approach was chosen.]
|
||||
|
||||
```
|
||||
// Example interface, API contract, or pseudocode (not implementation code)
|
||||
[Relevant schema, API definition, data flow, or pseudocode]
|
||||
```
|
||||
|
||||
**Component / Decision 2: [Name]**
|
||||
|
||||
[Description]
|
||||
|
||||
**Component / Decision 3: [Name]**
|
||||
|
||||
[Description]
|
||||
|
||||
### 4.4 API Changes
|
||||
|
||||
*Complete this section if the RFC introduces or modifies any API endpoints, events, or interfaces.*
|
||||
|
||||
**New endpoints / events:**
|
||||
```
|
||||
[HTTP method + path or event name]
|
||||
Request: { ... }
|
||||
Response: { ... }
|
||||
```
|
||||
|
||||
**Modified endpoints:**
|
||||
- `[endpoint]`: [what changes and why; backward compatibility note]
|
||||
|
||||
**Deprecated endpoints:**
|
||||
- `[endpoint]`: deprecated in favour of `[new endpoint]` — removal timeline: [date/version]
|
||||
|
||||
### 4.5 Data Model Changes
|
||||
|
||||
*Complete this section if any database schema or data structure changes are required.*
|
||||
|
||||
[Describe schema changes at a high level. Reference the database-migration-plan skill for detailed migration steps.]
|
||||
|
||||
```sql
|
||||
-- Key schema changes (abbreviated — full migration in [link])
|
||||
[DDL statements for key additions/changes]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Alternatives Considered
|
||||
|
||||
*Every alternative must include an explicit reason why it was rejected. "We went with the proposed solution" is not a reason.*
|
||||
|
||||
### Alternative 1: [Name]
|
||||
|
||||
**Description:**
|
||||
[What this alternative would involve.]
|
||||
|
||||
**Pros:**
|
||||
- [Pro 1]
|
||||
- [Pro 2]
|
||||
|
||||
**Cons:**
|
||||
- [Con 1]
|
||||
- [Con 2]
|
||||
|
||||
**Why rejected:**
|
||||
[Specific reason — e.g. "Requires 3× the infrastructure cost", "Incompatible with multi-region requirement", "Team has no expertise in this technology and the ramp-up would miss the Q3 deadline"]
|
||||
|
||||
---
|
||||
|
||||
### Alternative 2: [Name]
|
||||
|
||||
**Description:**
|
||||
[What this alternative would involve.]
|
||||
|
||||
**Pros:**
|
||||
- [Pro 1]
|
||||
- [Pro 2]
|
||||
|
||||
**Cons:**
|
||||
- [Con 1]
|
||||
- [Con 2]
|
||||
|
||||
**Why rejected:**
|
||||
[Specific reason]
|
||||
|
||||
---
|
||||
|
||||
### Alternative 3: Do nothing / defer
|
||||
|
||||
**Description:**
|
||||
Accept the current state and revisit the problem in [timeframe].
|
||||
|
||||
**Why rejected:**
|
||||
[Why deferring is not acceptable — reference the impact of not solving this from Section 1.]
|
||||
|
||||
---
|
||||
|
||||
## 6. Implementation Plan
|
||||
|
||||
**Estimated effort:** [X engineer-weeks] | **Target completion:** [Date / Quarter]
|
||||
**Team:** [Who is building this — names or roles]
|
||||
|
||||
| Phase | Description | Duration | Dependencies | Owner |
|
||||
|---|---|---|---|---|
|
||||
| 1 | [e.g. Core implementation — new component built and tested] | [X weeks] | [None] | [Name] |
|
||||
| 2 | [e.g. Integration — connect new component to existing services] | [X weeks] | [Phase 1 complete] | [Name] |
|
||||
| 3 | [e.g. Rollout — canary deploy, then full rollout] | [X weeks] | [Phase 2 + staging validated] | [Name] |
|
||||
| 4 | [e.g. Cleanup — deprecate old system, remove feature flags] | [X weeks] | [Phase 3 stable for X weeks] | [Name] |
|
||||
|
||||
**Key milestones:**
|
||||
- [ ] [Date]: [Milestone — e.g. "Core implementation complete and code-reviewed"]
|
||||
- [ ] [Date]: [Milestone — e.g. "Staging environment validation complete"]
|
||||
- [ ] [Date]: [Milestone — e.g. "10% canary traffic without regression"]
|
||||
- [ ] [Date]: [Milestone — e.g. "Full rollout complete"]
|
||||
- [ ] [Date]: [Milestone — e.g. "Old system decommissioned"]
|
||||
|
||||
---
|
||||
|
||||
## 7. Migration Plan
|
||||
|
||||
*Complete this section if the RFC requires migrating existing users, data, or API consumers.*
|
||||
|
||||
**Migration strategy:** [Big-bang / Phased / Parallel-run / Opt-in]
|
||||
|
||||
**Who is affected:**
|
||||
- [e.g. All existing API v1 consumers — requires updated client libraries]
|
||||
- [e.g. X million rows in the `orders` table require backfilling]
|
||||
|
||||
**Migration steps:**
|
||||
1. [Step 1 — describe action, who does it, estimated duration]
|
||||
2. [Step 2]
|
||||
3. [Step 3]
|
||||
|
||||
**Backward compatibility window:** [How long will the old system/API remain available?]
|
||||
|
||||
**Communication plan:**
|
||||
- [Who needs to be notified, when, and how — e.g. "API consumers will receive a deprecation notice 3 months before the old endpoint is removed"]
|
||||
|
||||
---
|
||||
|
||||
## 8. Security Implications
|
||||
|
||||
[Describe the security impact of this change. If there are no security implications, state that explicitly with reasoning — do not leave this section blank.]
|
||||
|
||||
| Concern | Impact | Mitigation |
|
||||
|---|---|---|
|
||||
| [e.g. New API endpoint exposed to internet] | [e.g. New attack surface] | [e.g. Rate limiting, auth required, WAF rules] |
|
||||
| [e.g. New data stored — user PII] | [e.g. GDPR scope expanded] | [e.g. Encrypted at rest, access log, data retention policy] |
|
||||
| [e.g. Service-to-service communication] | [e.g. Token forgery risk] | [e.g. mTLS between services] |
|
||||
|
||||
**Has a threat model been produced or updated?** [Yes — link / No — required before implementation / Not required — reason]
|
||||
|
||||
---
|
||||
|
||||
## 9. Performance Implications
|
||||
|
||||
[Describe the expected performance impact. Include projections for the new system and how it was estimated.]
|
||||
|
||||
| Metric | Current | Projected | Measurement method |
|
||||
|---|---|---|---|
|
||||
| [e.g. P99 latency — /api/auth] | [120ms] | [≤50ms] | [Load test results — link] |
|
||||
| [e.g. Database query count per request] | [12] | [3] | [Query logging in staging] |
|
||||
| [e.g. Memory per instance] | [512MB] | [768MB] | [Profiling — link] |
|
||||
| [e.g. Infrastructure cost] | [$X/month] | [$Y/month] | [AWS cost calculator estimate] |
|
||||
|
||||
**Load testing:** [Has load testing been done? Link to results. If not, when will it be done?]
|
||||
|
||||
**Performance risks:**
|
||||
- [Risk 1 — e.g. "New component adds a network hop that may increase tail latency under congestion — needs validation at 2× peak load"]
|
||||
|
||||
---
|
||||
|
||||
## 10. Observability Changes
|
||||
|
||||
*Describe what new or changed metrics, logs, traces, and alerts this RFC introduces.*
|
||||
|
||||
**New metrics:**
|
||||
| Metric name | Type | Description | Alert threshold |
|
||||
|---|---|---|---|
|
||||
| `[service].[component].[metric]` | [counter/gauge/histogram] | [What it measures] | [e.g. P99 > 100ms for 5 min] |
|
||||
|
||||
**New log events:**
|
||||
| Event | Level | When emitted | Key fields |
|
||||
|---|---|---|---|
|
||||
| `[event.name]` | INFO | [When] | `user_id`, `duration_ms`, `result` |
|
||||
|
||||
**Distributed tracing:** [Are spans added for new components? Which operations are instrumented?]
|
||||
|
||||
**Dashboard changes:** [New dashboard / updated existing dashboard — link]
|
||||
|
||||
---
|
||||
|
||||
## 11. Rollout Plan
|
||||
|
||||
**Rollout strategy:** [Feature flag / Canary / Blue-green / Gradual traffic shift / Full deploy]
|
||||
|
||||
| Stage | Traffic % | Duration | Success criteria | Rollback trigger |
|
||||
|---|---|---|---|---|
|
||||
| Internal testing | 0% (dogfood) | [X days] | [No errors in internal usage] | Any error |
|
||||
| Canary | 1% | [X hours] | [Error rate <0.1%; P99 latency within budget] | Error rate >0.5% |
|
||||
| Limited rollout | 10% | [X days] | [As above + business metrics stable] | Error rate >0.2% |
|
||||
| Full rollout | 100% | — | [All success metrics from Section 2 met] | Any SLO breach |
|
||||
|
||||
**Feature flag:** [Name of feature flag, if applicable] — managed in [LaunchDarkly / Unleash / config]
|
||||
|
||||
**Rollback procedure:**
|
||||
```
|
||||
// How to roll back if the rollout needs to be reversed
|
||||
1. [Step 1 — e.g. Toggle feature flag to off]
|
||||
2. [Step 2 — e.g. Deploy previous version]
|
||||
3. [Step 3 — e.g. Notify stakeholders]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 12. Open Questions
|
||||
|
||||
[List any unresolved questions, design decisions not yet made, or areas where the author is specifically seeking feedback. Assign an owner and a resolution deadline for each.]
|
||||
|
||||
| # | Question | Owner | Deadline | Resolution |
|
||||
|---|---|---|---|---|
|
||||
| 1 | [e.g. Should we use optimistic or pessimistic locking for concurrent updates to [resource]?] | [Name] | [Date] | [Pending / [Answer]] |
|
||||
| 2 | [e.g. What is the retention policy for [new data type]?] | [Name] | [Date] | [Pending / [Answer]] |
|
||||
| 3 | [e.g. Do we need a read replica for this query pattern at launch, or can we defer it?] | [Name] | [Date] | [Pending / [Answer]] |
|
||||
|
||||
---
|
||||
|
||||
## 13. Decision
|
||||
|
||||
*To be filled in after the review period closes.*
|
||||
|
||||
**Decision:** [Approved / Rejected / Approved with modifications]
|
||||
**Decision date:** [Date]
|
||||
**Decision makers:** [Names]
|
||||
|
||||
**Summary of key feedback addressed:**
|
||||
- [Feedback item and how it was resolved]
|
||||
|
||||
**Conditions of approval (if any):**
|
||||
- [e.g. Must complete load testing before Phase 2 begins]
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] The problem statement is specific and quantified — not "the current system is slow" but "P99 latency is 800ms; budget is 200ms"
|
||||
- [ ] Goals section includes measurable success metrics, not aspirational statements
|
||||
- [ ] Every alternative has an explicit rejection reason — not just a list of cons
|
||||
- [ ] Security implications section is completed, not left blank
|
||||
- [ ] Performance implications include projected numbers, not just "should be better"
|
||||
- [ ] Open questions are assigned to named owners with deadlines — not floating
|
||||
- [ ] The RFC is written to be read by someone who was not in the planning conversations
|
||||
- [ ] Migration plan addresses all affected parties — users, API consumers, data — not just the technical steps
|
||||
@@ -0,0 +1,253 @@
|
||||
---
|
||||
name: security-threat-model
|
||||
description: "Write a STRIDE-based threat model for a service or feature. Use when asked to produce a threat model, document security risks, identify attack vectors, assess a service's security posture, or prepare for a security design review. Produces a structured threat model covering assets, trust boundaries, STRIDE threat enumeration per component, risk scores, mitigation controls, and residual risk sign-off."
|
||||
---
|
||||
|
||||
# Security Threat Model Skill
|
||||
|
||||
Produce a complete STRIDE-based threat model for a service or feature. A threat model is not a list of things that could go wrong — it is a structured analysis of attackers, assets, boundaries, and controls that lets an engineering team make informed, documented security decisions.
|
||||
|
||||
A good threat model is specific enough that a new engineer can understand what is being protected, why each control exists, and what risk the team has accepted.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name and description** — what the service does, who uses it
|
||||
- **Architecture overview** — components, dependencies, data flows (a diagram description or ASCII diagram is fine)
|
||||
- **Deployment environment** — cloud provider, VPC/network topology, where it runs (Kubernetes, ECS, VMs, serverless)
|
||||
- **Data sensitivity** — what data does this service handle? PII, payment data, credentials, internal-only?
|
||||
- **Existing controls** — authentication method, encryption in transit/at rest, current WAF/firewall, existing security scanning
|
||||
- **Trust levels** — who are the principals? (anonymous public, authenticated users, internal services, admins)
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Security Threat Model: [Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Author:** [Name] | **Reviewed by:** [Security lead / peer]
|
||||
**Date:** [Date] | **Next review:** [Date — recommend 6 months or after major architecture change]
|
||||
**Classification:** [Internal / Confidential]
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
[2–3 sentences describing the service, its role in the system, and the scope of this threat model. State what is in scope and what is explicitly out of scope.]
|
||||
|
||||
**In scope:**
|
||||
- [Component or data flow]
|
||||
- [Component or data flow]
|
||||
|
||||
**Out of scope:**
|
||||
- [e.g. Third-party payment processor internals]
|
||||
- [e.g. Corporate network / end-user devices]
|
||||
|
||||
---
|
||||
|
||||
## 2. Asset Register
|
||||
|
||||
Assets are the things worth protecting — data, capabilities, and reputational value.
|
||||
|
||||
| Asset | Description | Sensitivity | Owner |
|
||||
|---|---|---|---|
|
||||
| [e.g. User PII] | Names, email addresses, profile data | High — GDPR-regulated | [Team] |
|
||||
| [e.g. API credentials] | Service-to-service auth tokens | Critical | [Team] |
|
||||
| [e.g. Session tokens] | User authentication state | High | [Team] |
|
||||
| [e.g. Audit logs] | Record of user and admin actions | Medium | [Team] |
|
||||
| [e.g. Service availability] | Uptime of the [X] endpoint | Medium | [Team] |
|
||||
|
||||
**Data classification key:**
|
||||
- **Critical** — Credential material; exposure enables direct system compromise
|
||||
- **High** — PII, financial data, health data; regulated or high reputational impact
|
||||
- **Medium** — Internal configuration, non-sensitive business data
|
||||
- **Low** — Public information, anonymised data
|
||||
|
||||
---
|
||||
|
||||
## 3. Trust Boundaries and Architecture
|
||||
|
||||
Trust boundaries are the lines that separate zones with different trust levels. Threats often occur when data or requests cross a boundary.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ INTERNET (Untrusted) │
|
||||
│ │
|
||||
│ [Public User] [Bot / Attacker] │
|
||||
└──────────────────────────────┬──────────────────────────────────┘
|
||||
│ HTTPS
|
||||
─ ─ ─ ─ ─ ─ ─│─ ─ ─ ─ ─ ─ ─ ─
|
||||
Trust Boundary: Public → DMZ
|
||||
─ ─ ─ ─ ─ ─ ─│─ ─ ─ ─ ─ ─ ─ ─
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ DMZ / Edge Layer │
|
||||
│ ┌────────────┐ ┌──────────────┐ │
|
||||
│ │ WAF / CDN │────▶│ API Gateway │ │
|
||||
│ └────────────┘ └──────┬───────┘ │
|
||||
└──────────────────────────────┼───────────────────────────────────┘
|
||||
─ ─ ─ ─ ─ ─ ─│─ ─ ─ ─ ─ ─ ─ ─
|
||||
Trust Boundary: Edge → Application VPC
|
||||
─ ─ ─ ─ ─ ─ ─│─ ─ ─ ─ ─ ─ ─ ─
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ Application VPC (Private) │
|
||||
│ ┌──────────────┐ ┌────────────┐ ┌──────────────────┐ │
|
||||
│ │ [Service A] │────▶│ [Service B]│────▶│ [Database] │ │
|
||||
│ └──────────────┘ └────────────┘ └──────────────────┘ │
|
||||
│ ▲ │
|
||||
│ │ │
|
||||
│ ┌──────────────┐ │ │
|
||||
│ │ Admin (IAM) │─────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Trust Boundaries identified:**
|
||||
|
||||
| Boundary | From | To | Auth mechanism | Encrypted |
|
||||
|---|---|---|---|---|
|
||||
| TB-1 | Public internet | API Gateway | [JWT / OAuth / API key] | TLS 1.2+ |
|
||||
| TB-2 | API Gateway | Service A | [mTLS / internal JWT / IAM role] | [Yes/No] |
|
||||
| TB-3 | Service A | Database | [Connection string + IAM / username+password] | [Yes/No] |
|
||||
| TB-4 | Admin | Service B | [IAM role / VPN + MFA] | TLS |
|
||||
|
||||
---
|
||||
|
||||
## 4. STRIDE Threat Analysis
|
||||
|
||||
STRIDE is a threat classification framework. For each significant component, enumerate threats in each category.
|
||||
|
||||
**STRIDE key:**
|
||||
- **S** — Spoofing: Impersonating another user, service, or system
|
||||
- **T** — Tampering: Modifying data or code without authorisation
|
||||
- **R** — Repudiation: Denying an action occurred; insufficient audit trail
|
||||
- **I** — Information Disclosure: Exposing data to unauthorised parties
|
||||
- **D** — Denial of Service: Making the service unavailable
|
||||
- **E** — Elevation of Privilege: Gaining capabilities beyond what is authorised
|
||||
|
||||
### Component: [API Gateway / Auth Layer]
|
||||
|
||||
| ID | Category | Threat | Attack vector | Existing control |
|
||||
|---|---|---|---|---|
|
||||
| T-001 | S | Attacker forges a JWT token to authenticate as another user | Weak signing key or algorithm confusion (alg:none) | [e.g. RS256 with key rotation / none] |
|
||||
| T-002 | S | Attacker replays a stolen session token | Theft via XSS or network sniff | [e.g. Token expiry + refresh rotation] |
|
||||
| T-003 | T | Attacker modifies request headers to bypass tenant isolation | Missing validation of tenant ID header | [e.g. Server-side tenant resolution / none] |
|
||||
| T-004 | R | No audit trail for admin authentication events | Logging not configured for auth failures | [e.g. CloudTrail enabled / none] |
|
||||
| T-005 | I | Auth error messages reveal whether an email exists | Verbose error responses | [e.g. Normalised error responses / none] |
|
||||
| T-006 | D | Credential stuffing exhausts rate limits and blocks legitimate users | Automated login attempts | [e.g. Rate limiting per IP + CAPTCHA / none] |
|
||||
| T-007 | E | Compromised low-privilege token used to call admin endpoint | Missing role check on admin routes | [e.g. RBAC middleware on all routes / none] |
|
||||
|
||||
### Component: [Application Service / Business Logic]
|
||||
|
||||
| ID | Category | Threat | Attack vector | Existing control |
|
||||
|---|---|---|---|---|
|
||||
| T-008 | T | SQL/NoSQL injection via unsanitised user input | Unparameterised queries | [e.g. ORM with parameterised queries / none] |
|
||||
| T-009 | T | Mass assignment — attacker sets fields they should not (e.g. `isAdmin: true`) | API accepts extra fields without allowlist | [e.g. Input validation / none] |
|
||||
| T-010 | I | Insecure direct object reference — user accesses another user's resource | Missing ownership check on resource ID | [e.g. Ownership middleware / none] |
|
||||
| T-011 | I | Sensitive data in application logs (PII, tokens) | Over-logging in debug mode | [e.g. Log scrubbing / none] |
|
||||
| T-012 | D | Unprotected expensive endpoint triggers large DB scan | No pagination or query cost limit | [e.g. Pagination enforced / none] |
|
||||
| T-013 | R | Business-critical state changes not logged | No audit event on [operation] | [e.g. Audit log table / none] |
|
||||
|
||||
### Component: [Database]
|
||||
|
||||
| ID | Category | Threat | Attack vector | Existing control |
|
||||
|---|---|---|---|---|
|
||||
| T-014 | I | Database exposed to internet (misconfigured security group) | Direct connection from outside VPC | [e.g. No public IP, security group restricts to app subnet] |
|
||||
| T-015 | I | Backup snapshots not encrypted or accessible to wrong accounts | Unencrypted snapshot, public S3 | [e.g. Encrypted snapshots, private S3 bucket] |
|
||||
| T-016 | T | Privilege escalation via DB account with excessive permissions | App uses a superuser DB account | [e.g. Least-privilege DB role per service / none] |
|
||||
| T-017 | D | Runaway query or bulk delete causes data loss or outage | No query timeout or soft-delete | [e.g. Statement timeout, soft-delete on critical tables / none] |
|
||||
|
||||
### Component: [Internal Service-to-Service Communication]
|
||||
|
||||
| ID | Category | Threat | Attack vector | Existing control |
|
||||
|---|---|---|---|---|
|
||||
| T-018 | S | Rogue internal service impersonates a trusted service | No mutual authentication between services | [e.g. mTLS / service mesh / none] |
|
||||
| T-019 | I | Internal traffic sniffed on shared network | Unencrypted service-to-service calls | [e.g. Service mesh with TLS / none] |
|
||||
| T-020 | E | Compromised internal service calls privileged endpoints | No scoping on internal tokens | [e.g. Scoped service tokens / none] |
|
||||
|
||||
---
|
||||
|
||||
## 5. Risk Register
|
||||
|
||||
Score each threat: **Likelihood (1–5)** × **Impact (1–5)** = **Risk Score (1–25)**
|
||||
|
||||
Priority bands: Critical (20–25) | High (12–19) | Medium (6–11) | Low (1–5)
|
||||
|
||||
| ID | Threat summary | Likelihood | Impact | Score | Priority | Status |
|
||||
|---|---|---|---|---|---|---|
|
||||
| T-001 | JWT forgery — auth bypass | 2 | 5 | 10 | Medium | [Open / Mitigated / Accepted] |
|
||||
| T-002 | Session token replay | 3 | 4 | 12 | High | [Open / Mitigated / Accepted] |
|
||||
| T-007 | Privilege escalation via missing role check | 3 | 5 | 15 | High | [Open / Mitigated / Accepted] |
|
||||
| T-008 | SQL injection | 2 | 5 | 10 | Medium | [Open / Mitigated / Accepted] |
|
||||
| T-010 | IDOR — cross-user data access | 3 | 4 | 12 | High | [Open / Mitigated / Accepted] |
|
||||
| T-014 | Database exposed to internet | 1 | 5 | 5 | Low | [Open / Mitigated / Accepted] |
|
||||
| T-018 | Rogue internal service impersonation | 2 | 4 | 8 | Medium | [Open / Mitigated / Accepted] |
|
||||
|
||||
---
|
||||
|
||||
## 6. Mitigations Table
|
||||
|
||||
For every Open threat with priority Medium or above, define a specific mitigation.
|
||||
|
||||
| ID | Threat | Mitigation | Owner | Target date | Ticket |
|
||||
|---|---|---|---|---|---|
|
||||
| T-002 | Session token replay | Implement token rotation on refresh — invalidate old token server-side immediately | [Engineer name] | [Date] | [JIRA-123] |
|
||||
| T-007 | Privilege escalation | Add RBAC middleware to all `/admin/*` routes; write integration test for role boundary | [Engineer name] | [Date] | [JIRA-124] |
|
||||
| T-010 | IDOR | Add ownership assertion to all resource-fetching service methods; add to code review checklist | [Engineer name] | [Date] | [JIRA-125] |
|
||||
| T-011 | PII in logs | Audit logging calls for PII fields; add scrubbing to logger middleware | [Engineer name] | [Date] | [JIRA-126] |
|
||||
| T-018 | Rogue service impersonation | Enable mTLS via service mesh or issue scoped service tokens per service | [Engineer name] | [Date] | [JIRA-127] |
|
||||
|
||||
---
|
||||
|
||||
## 7. Accepted Risks
|
||||
|
||||
Accepted risks are threats the team has decided not to mitigate right now. Every accepted risk must have a named owner and a review date.
|
||||
|
||||
| ID | Threat | Reason for acceptance | Risk owner | Review date |
|
||||
|---|---|---|---|---|
|
||||
| T-014 | Database public exposure | Database has no public IP assigned; control already in place — accepted as low likelihood | [Name] | [Date] |
|
||||
| [ID] | [Threat] | [Reason — e.g. "Effort exceeds risk at current scale; re-evaluate at 10× traffic"] | [Name] | [Date] |
|
||||
|
||||
---
|
||||
|
||||
## 8. Security Controls Summary
|
||||
|
||||
| Control | Type | Covers threats | Implemented |
|
||||
|---|---|---|---|
|
||||
| JWT RS256 with 15-min expiry | Preventive | T-001, T-002 | [Yes / Partial / No] |
|
||||
| RBAC middleware on all routes | Preventive | T-007, T-020 | [Yes / Partial / No] |
|
||||
| Parameterised queries (ORM) | Preventive | T-008 | [Yes / Partial / No] |
|
||||
| Rate limiting (100 req/min per IP) | Preventive | T-006, T-012 | [Yes / Partial / No] |
|
||||
| CloudTrail / audit logging | Detective | T-004, T-013 | [Yes / Partial / No] |
|
||||
| Automated SAST in CI pipeline | Detective | T-008, T-009 | [Yes / Partial / No] |
|
||||
| Encrypted backups + private S3 | Preventive | T-015 | [Yes / Partial / No] |
|
||||
| Least-privilege DB role | Preventive | T-016 | [Yes / Partial / No] |
|
||||
| Incident response runbook | Corrective | All | [Yes / Partial / No] |
|
||||
|
||||
---
|
||||
|
||||
## 9. Review Cadence
|
||||
|
||||
| Trigger | Action |
|
||||
|---|---|
|
||||
| Every 6 months | Full threat model review — update risk scores, close mitigated items |
|
||||
| Major architecture change | Update trust boundary diagram and re-run STRIDE for new components |
|
||||
| Security incident | Review relevant threats; add any newly discovered vectors |
|
||||
| New data classification | Add assets to register; assess whether new STRIDE categories apply |
|
||||
| Third-party dependency added | Assess supply chain threats for the new dependency |
|
||||
|
||||
**Next scheduled review:** [Date]
|
||||
**Review owner:** [Name / Security lead]
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every trust boundary is named and its authentication mechanism is specified — not left as "TBD"
|
||||
- [ ] Every Critical and High risk in the risk register has a mitigation with a named owner and a target date
|
||||
- [ ] Every accepted risk has a named risk owner and a review date — no unowned accepted risks
|
||||
- [ ] The asset register includes data sensitivity levels and at least one entry for credential material
|
||||
- [ ] STRIDE analysis covers all major components — not just the API layer
|
||||
- [ ] Mitigation actions are specific enough to become a ticket (not "improve security")
|
||||
- [ ] The ASCII trust boundary diagram matches the architecture description provided
|
||||
@@ -0,0 +1,292 @@
|
||||
---
|
||||
name: service-catalog-entry
|
||||
description: "Write a service catalog entry for a microservice or internal platform service — covering service identity, purpose, architecture context, SLAs, API contract summary, data classification, dependencies, operational runbooks, and known limitations. Use when asked to document a service for an internal developer portal, write a service README for a platform catalog, create a service overview page, or onboard a new service to a service registry. Produces a complete service catalog entry suitable for an internal developer portal or wiki."
|
||||
---
|
||||
|
||||
# Service Catalog Entry Skill
|
||||
|
||||
Produce a complete service catalog entry for a microservice or internal platform service — giving any engineer at the company the context they need to understand what the service does, how to depend on it, what its reliability characteristics are, and where to go when something goes wrong. A well-written catalog entry eliminates "who owns this?" and "is this safe to use?" questions that slow down teams depending on shared services.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** — the canonical identifier used in code, monitoring, and deployments
|
||||
- **Team and owner** — team name, tech lead name, and on-call contact
|
||||
- **Architecture overview** — what the service does, what calls it, and what it calls
|
||||
- **SLA requirements** — availability target, latency SLO, support tier, and maintenance window
|
||||
- **Key APIs** — the most important endpoints other teams use (method, path, brief description)
|
||||
- **Data handled** — what data the service stores or processes, sensitivity classification, retention
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Service Catalog: [Service Name]
|
||||
|
||||
> **[One sentence — what this service does for consumers, in plain language]**
|
||||
>
|
||||
> *e.g. "The Payments Service processes charge, refund, and subscription billing events for all Acme products."*
|
||||
|
||||
---
|
||||
|
||||
## Identity
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| **Service name** | `[service-name]` |
|
||||
| **Canonical repository** | [https://github.com/[org]/[repo]] |
|
||||
| **Owner team** | [Team name] |
|
||||
| **Tech lead** | [Name] ([Slack: @handle]) |
|
||||
| **On-call rotation** | [PagerDuty service link] |
|
||||
| **Slack channel** | `#[team-channel]` |
|
||||
| **Support tier** | [Tier 1 — 24/7 / Tier 2 — business hours / Tier 3 — best effort] |
|
||||
| **Status** | [Active / Deprecated / Sunset date: YYYY-MM-DD] |
|
||||
| **Language / runtime** | [e.g. Go 1.22 / Python 3.12 / Node 20] |
|
||||
| **Deployment platform** | [Kubernetes / ECS / Lambda / etc.] |
|
||||
| **Environments** | [Production: URL] | [Staging: URL] | [Dev: URL] |
|
||||
|
||||
---
|
||||
|
||||
## What It Does
|
||||
|
||||
[Two to three paragraphs in plain language — no jargon or acronyms without explanation.]
|
||||
|
||||
[Paragraph 1: The business problem this service solves. What would break or be missing if this service did not exist?]
|
||||
|
||||
[Paragraph 2: How it works at a high level — the main processing model (e.g. request/response API, event-driven consumer, batch processor), what triggers it, and what it produces.]
|
||||
|
||||
[Paragraph 3: What this service is NOT responsible for — the explicit boundaries. This prevents other teams from building incorrect assumptions about scope.]
|
||||
|
||||
---
|
||||
|
||||
## Architecture Context
|
||||
|
||||
### System Diagram
|
||||
|
||||
```
|
||||
[Upstream callers] [This Service] [Downstream dependencies]
|
||||
|
||||
[Web App] ──────────→ ──→ [Primary Database — PostgreSQL]
|
||||
[Mobile API] ────────→ [Service Name] ──→ [Cache — Redis]
|
||||
[Partner API] ────────→ (Port 8080/gRPC) ──→ [Message Queue — Kafka/SQS]
|
||||
──→ [External Service / API]
|
||||
↓ emits events to
|
||||
[Event Bus / SNS]
|
||||
↓ consumed by
|
||||
[Downstream Service A]
|
||||
[Downstream Service B]
|
||||
```
|
||||
|
||||
### Who Depends on This Service
|
||||
|
||||
| Caller | How they use it | Contact |
|
||||
|---|---|---|
|
||||
| [Service / Team A] | [e.g. "Calls POST /charges to initiate payments"] | [Slack: #team-a] |
|
||||
| [Service / Team B] | [e.g. "Subscribes to payment.completed events via Kafka topic"] | [Slack: #team-b] |
|
||||
| [Service / Team C] | [e.g. "Calls GET /subscriptions for billing status"] | [Slack: #team-c] |
|
||||
|
||||
### What This Service Depends On
|
||||
|
||||
| Dependency | Type | Criticality | Their on-call |
|
||||
|---|---|---|---|
|
||||
| [PostgreSQL instance] | Database | Critical — all writes fail without it | [DBA team: #db-oncall] |
|
||||
| [Redis cluster] | Cache | High — latency degrades without it | [Infra team: #infra-oncall] |
|
||||
| [Kafka cluster] | Message queue | High — async events queue | [Infra team: #infra-oncall] |
|
||||
| [Stripe API] | External API | Critical — payment processing fails | [vendor status: status.stripe.com] |
|
||||
| [Auth Service] | Internal service | Critical — all auth fails | [Auth team: #auth-oncall] |
|
||||
|
||||
---
|
||||
|
||||
## Service Level Agreement
|
||||
|
||||
### Availability and Latency
|
||||
|
||||
| SLO | Target | Measurement window | Error budget |
|
||||
|---|---|---|---|
|
||||
| Availability | [99.9%] | Rolling 30 days | [43 min/month] |
|
||||
| p50 latency (key endpoints) | < [50] ms | Rolling 24 hours | — |
|
||||
| p99 latency (key endpoints) | < [500] ms | Rolling 24 hours | — |
|
||||
| p99.9 latency (key endpoints) | < [2000] ms | Rolling 24 hours | — |
|
||||
| Error rate | < [0.1]% | Rolling 1 hour | — |
|
||||
|
||||
**SLO dashboard:** [Link to monitoring dashboard]
|
||||
**Current error budget remaining:** [Link to SLO dashboard or inline value]
|
||||
|
||||
### Support Tiers
|
||||
|
||||
| Tier | Scope | Response time | Resolution time |
|
||||
|---|---|---|---|
|
||||
| P1 — Service down | All authenticated requests failing | 15 minutes | 1 hour |
|
||||
| P2 — Significant degradation | Error rate >1% or p99 >2× SLO | 30 minutes | 4 hours |
|
||||
| P3 — Minor issues | Non-critical endpoints degraded | Next business day | 3 business days |
|
||||
| Feature requests / bugs | Via standard ticket process | [Ticket SLA] | Per roadmap |
|
||||
|
||||
**To raise an incident:** Page via [PagerDuty service link] or post in `#incidents`.
|
||||
**To raise a feature request or bug:** File a ticket in [JIRA project / GitHub repo Issues].
|
||||
|
||||
### Maintenance Windows
|
||||
|
||||
- **Planned downtime:** [e.g. "Sundays 02:00–04:00 UTC — advance notice posted to #[team-channel] 48h before"]
|
||||
- **Deployment window:** [e.g. "Weekdays 10:00–16:00 UTC — no deploys on Fridays or the day before a public holiday"]
|
||||
- **Breaking changes notice:** [e.g. "Minimum 30 days notice for breaking API changes — see versioning policy below"]
|
||||
|
||||
---
|
||||
|
||||
## API Contract
|
||||
|
||||
### Authentication
|
||||
|
||||
All API calls require: [e.g. "Bearer token via Authorization header. Tokens are issued by the Auth Service (`/api/v1/token`)"]
|
||||
|
||||
```
|
||||
Authorization: Bearer [jwt-token]
|
||||
Content-Type: application/json
|
||||
```
|
||||
|
||||
### Base URL
|
||||
|
||||
| Environment | Base URL |
|
||||
|---|---|
|
||||
| Production | `https://[service-name].internal.[company].com` |
|
||||
| Staging | `https://[service-name].staging.[company].com` |
|
||||
| Local development | `http://localhost:[port]` |
|
||||
|
||||
### Key Endpoints
|
||||
|
||||
| Method | Path | Description | Auth required | Rate limit |
|
||||
|---|---|---|---|---|
|
||||
| `GET` | `/health` | Liveness and readiness check | No | None |
|
||||
| `GET` | `/api/v1/[resource]` | [Description — e.g. "List resources for the authenticated user"] | Yes | [100 req/min] |
|
||||
| `GET` | `/api/v1/[resource]/:id` | [Description — e.g. "Get a single resource by ID"] | Yes | [500 req/min] |
|
||||
| `POST` | `/api/v1/[resource]` | [Description — e.g. "Create a new resource"] | Yes | [50 req/min] |
|
||||
| `PUT` | `/api/v1/[resource]/:id` | [Description — e.g. "Update an existing resource"] | Yes | [50 req/min] |
|
||||
| `DELETE` | `/api/v1/[resource]/:id` | [Description] | Yes | [20 req/min] |
|
||||
|
||||
**Full API documentation:** [OpenAPI/Swagger spec URL] | [Postman collection URL]
|
||||
|
||||
### Versioning Policy
|
||||
|
||||
- API version is in the URL path (`/api/v1/`, `/api/v2/`)
|
||||
- Minor additions (new optional fields, new endpoints) are non-breaking — no version bump
|
||||
- Breaking changes (removed fields, changed types, authentication changes) require a new major version
|
||||
- Deprecated versions are supported for [90 days] after the successor reaches GA
|
||||
- Deprecation notices are posted to `#[team-channel]` and emailed to registered consumers
|
||||
|
||||
### Error Response Format
|
||||
|
||||
```json
|
||||
{
|
||||
"error": {
|
||||
"code": "[ERROR_CODE]",
|
||||
"message": "[Human-readable description]",
|
||||
"request_id": "[UUID — include in support tickets]",
|
||||
"details": {}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Common error codes:
|
||||
|
||||
| HTTP status | Error code | Meaning |
|
||||
|---|---|---|
|
||||
| 400 | `INVALID_REQUEST` | Request body or parameters fail validation |
|
||||
| 401 | `UNAUTHENTICATED` | Missing or invalid auth token |
|
||||
| 403 | `FORBIDDEN` | Token valid but lacks permission for this resource |
|
||||
| 404 | `NOT_FOUND` | Resource does not exist |
|
||||
| 409 | `CONFLICT` | Duplicate resource or state conflict |
|
||||
| 422 | `UNPROCESSABLE_ENTITY` | Request is valid but violates business rules |
|
||||
| 429 | `RATE_LIMITED` | Too many requests — back off and retry |
|
||||
| 500 | `INTERNAL_ERROR` | Unexpected server error — include request_id in support ticket |
|
||||
| 503 | `SERVICE_UNAVAILABLE` | Downstream dependency unavailable — retry with backoff |
|
||||
|
||||
### Events Published (if event-driven)
|
||||
|
||||
| Event | Topic / Queue | Schema | Published when |
|
||||
|---|---|---|---|
|
||||
| `[resource].created` | `[kafka-topic / sns-arn]` | [Schema URL] | [When a new resource is created] |
|
||||
| `[resource].updated` | `[kafka-topic / sns-arn]` | [Schema URL] | [When a resource is modified] |
|
||||
| `[resource].deleted` | `[kafka-topic / sns-arn]` | [Schema URL] | [When a resource is deleted] |
|
||||
|
||||
---
|
||||
|
||||
## Data Classification
|
||||
|
||||
| Data element | Sensitivity | Stored in | Retention | Encrypted at rest |
|
||||
|---|---|---|---|---|
|
||||
| [User PII — e.g. email, name] | [PII / Restricted] | [PostgreSQL `users` table] | [Until account deletion] | Yes |
|
||||
| [Financial data — e.g. card last 4] | [PCI / Highly restricted] | [PostgreSQL `payment_methods` table] | [7 years per regulations] | Yes — field-level encryption |
|
||||
| [Operational logs] | [Internal] | [CloudWatch / Datadog] | [90 days] | Yes (at rest, not searched) |
|
||||
| [Anonymised analytics] | [Public] | [Data warehouse] | [Indefinite] | Yes |
|
||||
|
||||
**Data residency:** [e.g. "All data stored in us-east-1. EU customer data stored in eu-west-1 per GDPR requirements."]
|
||||
**Compliance scope:** [e.g. SOC 2 Type II / PCI DSS Level 2 / HIPAA / GDPR]
|
||||
**Data access policy:** [e.g. "Production database access requires [approval process]. Access logged and reviewed quarterly."]
|
||||
|
||||
---
|
||||
|
||||
## Operational Runbooks
|
||||
|
||||
| Runbook | Location | Use when |
|
||||
|---|---|---|
|
||||
| On-call runbook | [Wiki / GitHub link] | Responding to PagerDuty alerts |
|
||||
| Deployment runbook | [Wiki / GitHub link] | Deploying a new version to production |
|
||||
| Database migration runbook | [Wiki / GitHub link] | Running schema migrations |
|
||||
| Rollback runbook | [Wiki / GitHub link] | Rolling back a bad deploy |
|
||||
| Incident response runbook | [Wiki / GitHub link] | Declaring and managing incidents |
|
||||
| Disaster recovery plan | [Wiki / GitHub link] | Zone/region failure or data loss |
|
||||
|
||||
**Monitoring dashboards:**
|
||||
|
||||
| Dashboard | Link | Use it for |
|
||||
|---|---|---|
|
||||
| Service overview | [Datadog / Grafana link] | Error rate, latency, throughput |
|
||||
| Infrastructure | [Link] | CPU, memory, pod health |
|
||||
| Database | [Link] | Query performance, connection pool |
|
||||
| SLO / error budget | [Link] | Budget burn rate, availability |
|
||||
| Dependency health | [Link] | Upstream dependency status |
|
||||
|
||||
---
|
||||
|
||||
## Known Limitations
|
||||
|
||||
Document limitations honestly — this section prevents other teams from building on incorrect assumptions.
|
||||
|
||||
| Limitation | Impact | Workaround | Planned fix |
|
||||
|---|---|---|---|
|
||||
| [e.g. No bulk write API — items must be created one at a time] | [Slow for large imports — N HTTP calls required] | [Use the batch import CLI tool for >100 items] | [Bulk API in Q3 — ticket: [URL]] |
|
||||
| [e.g. List endpoints have a maximum page size of 100] | [Cannot retrieve more than 100 items in a single call] | [Paginate using `cursor` parameter] | [No current plan to increase — by design] |
|
||||
| [e.g. Rate limits are per-token, not per-service] | [High-traffic consumers may hit limits for other consumers on the same token] | [Request dedicated service-account token] | [Per-service rate limits in roadmap] |
|
||||
| [e.g. Eventual consistency on read-after-write for list endpoints] | [Record may not appear in list immediately after creation (<500ms lag)] | [Use GET /:id to confirm creation; do not rely on list for immediate consistency] | [Read-your-writes consistency available via `?consistent=true` — in progress] |
|
||||
|
||||
---
|
||||
|
||||
## Getting Started
|
||||
|
||||
**To start using this service:**
|
||||
|
||||
1. Request access: [Link to access request form or instructions]
|
||||
2. Get your service account credentials: [Link to process]
|
||||
3. Read the API docs: [OpenAPI spec URL]
|
||||
4. Try the sandbox environment: `https://[service-name].sandbox.[company].com`
|
||||
5. Join the consumer Slack channel: `#[service-name]-consumers`
|
||||
|
||||
**Client libraries (if available):**
|
||||
|
||||
| Language | Package | Installation |
|
||||
|---|---|---|
|
||||
| [Python] | [`[package-name]`] | `pip install [package-name]` |
|
||||
| [Go] | [`github.com/[org]/[package]`] | `go get github.com/[org]/[package]` |
|
||||
| [TypeScript/JS] | [`@[org]/[package]`] | `npm install @[org]/[package]` |
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] "What It Does" is written without jargon — a new engineer from another team can understand it in under 2 minutes
|
||||
- [ ] SLO targets are specific numbers agreed with stakeholders — not aspirational or copied from a template
|
||||
- [ ] All direct upstream consumers are listed in the "Who Depends on This" table — no omissions
|
||||
- [ ] API error codes are accurate and tested — not aspirational documentation
|
||||
- [ ] Known limitations are honest — nothing is glossed over to make the service look better than it is
|
||||
- [ ] All runbook links are live — not broken references or TODO placeholders
|
||||
- [ ] Data classification includes retention period and encryption status — not just sensitivity level
|
||||
- [ ] The entry has been reviewed by at least one consumer team to confirm it matches their experience of the service
|
||||
@@ -0,0 +1,231 @@
|
||||
---
|
||||
name: slo-error-budget
|
||||
description: "Define Service Level Objectives (SLOs) and an error budget policy for a service. Use when asked to write SLOs, define SLIs, calculate an error budget, set reliability targets, or create an error budget policy. Produces a complete SLO document with SLI definitions, target calculation, error budget policy, burn rate alerts, and review cadence."
|
||||
---
|
||||
|
||||
# SLO and Error Budget Skill
|
||||
|
||||
Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns.
|
||||
|
||||
A good SLO is not a target to hit. It is an agreement about what reliability means for your users — and a framework for making principled trade-offs between reliability and velocity.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and brief description of what it does
|
||||
- **Primary users** — who depends on this service and how
|
||||
- **User-facing interactions** to protect — e.g. API calls, page loads, transactions
|
||||
- **Current reliability data** — error rate, latency, uptime (last 30–90 days if available)
|
||||
- **Existing on-call setup** — who responds to alerts?
|
||||
- **Deployment frequency** — how often does the team ship?
|
||||
- **Any existing SLAs** with customers — these constrain SLO targets
|
||||
|
||||
## Key Definitions
|
||||
|
||||
Always establish these before writing the SLO:
|
||||
|
||||
| Term | Definition |
|
||||
|---|---|
|
||||
| **SLI** (Service Level Indicator) | The metric being measured — e.g. "% of requests completing successfully in <500ms" |
|
||||
| **SLO** (Service Level Objective) | The target for that metric — e.g. "99.5% of requests" |
|
||||
| **SLA** (Service Level Agreement) | The contractual commitment to customers — must be looser than the SLO |
|
||||
| **Error budget** | The allowed headroom below 100% — the budget for planned and unplanned downtime |
|
||||
| **Burn rate** | How fast the error budget is being consumed |
|
||||
|
||||
---
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# SLO Document: [Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Owner:** [Name / role] | **Approved by:** [Name]
|
||||
**Effective date:** [Date] | **Review date:** [Date + 3 months]
|
||||
**Version:** [1.0]
|
||||
|
||||
---
|
||||
|
||||
## Why This SLO Exists
|
||||
|
||||
[2–3 sentences. What reliability problem are we solving? What was happening before this SLO that made us need it? What decision-making does this SLO enable?]
|
||||
|
||||
---
|
||||
|
||||
## Service Overview
|
||||
|
||||
**What this service does:** [One sentence]
|
||||
**Who depends on it:** [Internal teams / external customers / both — describe]
|
||||
**Critical user journeys protected by this SLO:**
|
||||
1. [Journey 1 — e.g. "User completes a payment"]
|
||||
2. [Journey 2]
|
||||
3. [Journey 3]
|
||||
|
||||
---
|
||||
|
||||
## SLIs — What We Measure
|
||||
|
||||
Define one SLI per user journey or reliability dimension. Keep it to 3–5 SLIs maximum.
|
||||
|
||||
### SLI 1: [Name — e.g. Request Success Rate]
|
||||
|
||||
| Field | Detail |
|
||||
|---|---|
|
||||
| **What it measures** | [e.g. "% of API requests that return a non-5xx response"] |
|
||||
| **Good event definition** | [e.g. "HTTP response with status 2xx or 4xx, completed within 500ms"] |
|
||||
| **Bad event definition** | [e.g. "HTTP response with status 5xx, or any response taking >500ms"] |
|
||||
| **Measurement source** | [e.g. "Application load balancer access logs / Datadog APM / Prometheus"] |
|
||||
| **Measured over** | Rolling 28-day window |
|
||||
| **Exclusions** | [e.g. "Health check endpoints excluded / Requests during planned maintenance excluded"] |
|
||||
|
||||
### SLI 2: [Name — e.g. Latency]
|
||||
|
||||
| Field | Detail |
|
||||
|---|---|
|
||||
| **What it measures** | [e.g. "P99 response time for the /checkout endpoint"] |
|
||||
| **Good event definition** | [e.g. "Request completes in ≤500ms at P99"] |
|
||||
| **Bad event definition** | [e.g. "Request takes >500ms at P99"] |
|
||||
| **Measurement source** | [Source] |
|
||||
| **Measured over** | Rolling 28-day window |
|
||||
| **Exclusions** | [Any exclusions] |
|
||||
|
||||
### SLI 3: [Name — e.g. Data Freshness / Queue Depth / etc.]
|
||||
|
||||
[Same structure]
|
||||
|
||||
---
|
||||
|
||||
## SLO Targets
|
||||
|
||||
| SLI | Target | Window | Error Budget |
|
||||
|---|---|---|---|
|
||||
| [SLI 1 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
|
||||
| [SLI 2 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
|
||||
| [SLI 3 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
|
||||
|
||||
**How targets were set:**
|
||||
- Historical baseline (last 90 days): [X]%
|
||||
- Target is set [above / at] historical baseline to [improve reliability / reflect current reality while formalising the commitment]
|
||||
- Rationale: [1–2 sentences]
|
||||
|
||||
**What 100% is NOT the target:** [Brief explanation of why targeting 100% is counterproductive — it discourages feature development and doesn't reflect user reality]
|
||||
|
||||
---
|
||||
|
||||
## Error Budget Calculation
|
||||
|
||||
**For SLI 1 ([Name]), at [X]% target:**
|
||||
|
||||
```
|
||||
Error budget = (100% - SLO target) × measurement window
|
||||
= (100% - [X]%) × 28 days × 24 hours × 60 minutes
|
||||
= [Y]% × [Z total minutes]
|
||||
= [N] minutes of allowed failure per 28-day window
|
||||
```
|
||||
|
||||
**In plain terms:** We can afford [N] minutes of [bad events] in any rolling 28-day window before we breach the SLO.
|
||||
|
||||
---
|
||||
|
||||
## Burn Rate Alerts
|
||||
|
||||
Burn rate = how fast the error budget is being consumed relative to the budget window.
|
||||
A burn rate of 1 = consuming the budget at exactly the rate that would exhaust it over 28 days.
|
||||
|
||||
| Alert | Burn rate | Window | Severity | Response |
|
||||
|---|---|---|---|---|
|
||||
| Page (critical) | >14× | 1 hour | P1 | Page on-call immediately — budget exhausted in <2 hours |
|
||||
| Page (high) | >6× | 6 hours | P2 | Page on-call — budget exhausted in <5 days |
|
||||
| Ticket (warning) | >3× | 3 days | P3 | Create ticket — review at next team meeting |
|
||||
| Info | >1× | 28 days | Info | Log only — budget on track to exhaust by end of window |
|
||||
|
||||
**Alert implementation:** [Link to alert config in monitoring tool — e.g. Datadog, Prometheus/Alertmanager, Grafana]
|
||||
|
||||
---
|
||||
|
||||
## Error Budget Policy
|
||||
|
||||
This policy defines what to do with the error budget — both when it's healthy and when it's burning.
|
||||
|
||||
### When budget is healthy (>50% remaining)
|
||||
|
||||
- Feature development and deployments proceed at normal pace
|
||||
- The team may take on riskier experiments
|
||||
- Reliability improvements are scheduled but not urgent
|
||||
|
||||
### When budget is at risk (25–50% remaining)
|
||||
|
||||
- Deployment frequency reduced — team ships only well-tested changes
|
||||
- One reliability improvement added to current sprint
|
||||
- Weekly error budget review added to team standup
|
||||
|
||||
### When budget is nearly exhausted (<25% remaining)
|
||||
|
||||
- Feature work paused in favour of reliability improvements
|
||||
- No new deployments without explicit on-call approval
|
||||
- Daily review of error budget burn rate
|
||||
- CSM / support notified to manage customer expectations
|
||||
|
||||
### When budget is exhausted (0% remaining — SLO breached)
|
||||
|
||||
- All feature work stops
|
||||
- On-call engineer and engineering manager notified immediately
|
||||
- Post-incident review (PIR) required within 5 business days
|
||||
- SLO target may be temporarily relaxed (with stakeholder approval) while root cause is addressed
|
||||
|
||||
---
|
||||
|
||||
## Dashboard and Reporting
|
||||
|
||||
**SLO dashboard:** [Link to Datadog / Grafana / etc. dashboard]
|
||||
|
||||
**Metrics exposed:**
|
||||
- Current SLO compliance (rolling 28-day)
|
||||
- Error budget remaining (% and minutes)
|
||||
- Burn rate (current and trend)
|
||||
- Incident count and MTTR this window
|
||||
|
||||
**Reporting cadence:**
|
||||
|
||||
| Audience | Frequency | Format |
|
||||
|---|---|---|
|
||||
| Engineering team | Weekly | Slack summary — #[service]-slo |
|
||||
| Engineering manager | Monthly | SLO review meeting |
|
||||
| Stakeholders / customers | Quarterly | SLO compliance summary |
|
||||
|
||||
---
|
||||
|
||||
## Exclusions and Edge Cases
|
||||
|
||||
**Planned maintenance:** Error budget is not consumed during pre-announced maintenance windows. Maintenance must be communicated [X hours] in advance via [channel].
|
||||
|
||||
**Dependency failures:** If SLO breach is caused by an upstream dependency outside our control, document it — but it still counts against our error budget (our users don't distinguish between our failures and our dependencies' failures).
|
||||
|
||||
**Force majeure:** [Policy for cloud provider outages, major infrastructure events]
|
||||
|
||||
---
|
||||
|
||||
## SLO Review Cadence
|
||||
|
||||
| Review | When | Who | Output |
|
||||
|---|---|---|---|
|
||||
| Error budget review | Weekly | Team | Budget health check — adjust if burning fast |
|
||||
| SLO target review | Quarterly | Team + EM | Adjust targets if baseline has shifted significantly |
|
||||
| Annual SLO audit | Annually | Team + Stakeholders | Review SLIs — are we measuring the right things? |
|
||||
|
||||
**When to change the SLO target:**
|
||||
- Historical baseline has improved significantly and target no longer reflects real reliability
|
||||
- User feedback indicates the target is misaligned with what users actually experience
|
||||
- The SLO is being gamed (metric is healthy but users are unhappy)
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] SLIs are user-facing — they measure what users experience, not internal system metrics
|
||||
- [ ] Good and bad events are precisely defined — no ambiguity about what counts
|
||||
- [ ] Targets are based on historical data, not aspirational round numbers
|
||||
- [ ] Error budget policy has clear triggers and clear actions — not "discuss as a team"
|
||||
- [ ] Burn rate alerts have different windows to catch both fast burns and slow burns
|
||||
- [ ] Exclusions are documented so they don't silently inflate the SLO number
|
||||
@@ -0,0 +1,263 @@
|
||||
---
|
||||
name: sprint-velocity-analysis
|
||||
description: "Analyze sprint velocity data and produce an engineering team health report covering delivery trends, capacity utilization, and improvement recommendations. Use when asked to analyze sprint velocity, review team delivery health, identify delivery risks, or produce a retrospective data analysis. Produces a velocity trend analysis, health diagnosis table, top improvement recommendations with implementation steps, and a next-sprint capacity forecast."
|
||||
---
|
||||
|
||||
# Sprint Velocity Analysis
|
||||
|
||||
Analyze sprint velocity data to produce an honest engineering team health report. The goal is not to generate optimistic-looking charts — it is to surface delivery patterns, identify dysfunction early, and give the team and their manager actionable recommendations. Look for: velocity trends (improving, declining, flat, erratic), story point calibration consistency, carry-over patterns that indicate chronic over-commitment, and capacity-related signals. Produce text-based trend visualizations, a health diagnosis, and specific improvement recommendations with measurable targets.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Sprint history** — for each sprint: sprint name/number, committed story points, completed story points, and number of items carried over to next sprint; ideally 6–8 sprints minimum
|
||||
- **Team size and any changes** — current team size and any additions or departures during the data window
|
||||
- **Known disruptions** — holidays, company all-hands, on-call incidents, or other events that affected specific sprints
|
||||
- **Cycle time data (optional)** — if available, p50 and p90 cycle time per sprint (time from start to done)
|
||||
- **Definition of Done** — what "completed" means for this team (merged to main? deployed to prod? accepted by PO?)
|
||||
|
||||
If cycle time data is not provided, omit that section and note it as a recommended data source to add.
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Sprint Velocity Analysis: [Team Name]
|
||||
|
||||
**Analysis period:** Sprint [N] through Sprint [N+7] ([Date range])
|
||||
**Team size:** [X engineers] ([note any changes during period])
|
||||
**Report date:** [Date]
|
||||
**Data source:** [Where this data came from — Jira, Linear, spreadsheet, etc.]
|
||||
|
||||
---
|
||||
|
||||
## Velocity Trend
|
||||
|
||||
### Raw Data
|
||||
|
||||
| Sprint | Committed | Completed | Completion Rate | Carried Over | Notes |
|
||||
|--------|-----------|-----------|----------------|--------------|-------|
|
||||
| [Sprint N] | [X pts] | [X pts] | [X%] | [X pts / X items] | [disruption or context] |
|
||||
| [Sprint N+1] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| [Sprint N+2] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| [Sprint N+3] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| [Sprint N+4] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| [Sprint N+5] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| [Sprint N+6] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| [Sprint N+7] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| **Average** | **[X pts]** | **[X pts]** | **[X%]** | **[X pts]** | |
|
||||
|
||||
### Velocity Chart (Completed Points per Sprint)
|
||||
|
||||
```
|
||||
Points
|
||||
60 |
|
||||
55 | ●
|
||||
50 | ● ●
|
||||
45 | ● ● ●
|
||||
40 | ● ●
|
||||
35 |
|
||||
30 |
|
||||
+--+--+--+--+--+--+--+--
|
||||
N N+1 N+2 N+3 N+4 N+5 N+6 N+7
|
||||
Sprint
|
||||
|
||||
● = Completed points — = Average ([X pts])
|
||||
```
|
||||
|
||||
Generate this chart using ASCII characters based on the actual data provided. Scale the Y-axis to the data range. Plot completed (not committed) points. Mark the average as a dashed line.
|
||||
|
||||
### Trend Diagnosis
|
||||
|
||||
| Metric | Value | Interpretation |
|
||||
|--------|-------|----------------|
|
||||
| Average velocity | [X pts/sprint] | [Baseline for planning] |
|
||||
| Velocity std deviation | [±X pts] | [Low < 15% of avg = stable; High > 25% = erratic] |
|
||||
| Trend direction | [Improving / Flat / Declining / Erratic] | [3-sprint trailing average vs. 3-sprint leading average] |
|
||||
| Average completion rate | [X%] | [Healthy: 80–95%; < 75% = chronic over-commitment] |
|
||||
| Carry-over rate | [X% of committed points carried over per sprint] | [Healthy: < 15%; > 25% = systemic issue] |
|
||||
| Sprints with completion rate < 75% | [X of 8 sprints] | [> 3 of 8 = structural problem, not noise] |
|
||||
|
||||
---
|
||||
|
||||
## Story Point Calibration
|
||||
|
||||
Story points are only useful if they are applied consistently. Look for these calibration signals in the data:
|
||||
|
||||
| Signal | Observed | Interpretation |
|
||||
|--------|----------|----------------|
|
||||
| High variance in velocity despite stable team size | [Yes / No] | Suggests inconsistent estimation — same effort scored differently week to week |
|
||||
| Consistent over-commitment (committed >> completed) | [Yes / No — by avg X pts per sprint] | Team is sandbagging estimates or ignoring historical capacity |
|
||||
| Consistent under-commitment (completed >> committed by > 20%) | [Yes / No] | Team is over-padding estimates or pulling in unplanned work frequently |
|
||||
| Frequent large items (> 13 pts) in carry-over | [Yes / No] | Items are too large to estimate reliably — need better decomposition |
|
||||
| Velocity cliff after team change | [Yes / No — Sprint N+X] | Team did not re-baseline capacity after composition changed |
|
||||
|
||||
**Calibration verdict:** [Well-calibrated / Needs recalibration / Severely uncalibrated — one sentence explanation tied to the signals above]
|
||||
|
||||
**If recalibration is needed:** [Specific recommendation — e.g., "Run a calibration session using the last 20 completed items, re-score them as a team, and use the resulting relative sizes to anchor future estimates."]
|
||||
|
||||
---
|
||||
|
||||
## Carry-Over Pattern Analysis
|
||||
|
||||
Carry-over is the most reliable leading indicator of commitment reliability problems.
|
||||
|
||||
| Sprint | Carried-Over Items | Common Themes in Carry-Over |
|
||||
|--------|-------------------|----------------------------|
|
||||
| [Sprint N] | [X items / X pts] | [Technical debt, dependency blocked, scoped wrong, etc.] |
|
||||
| [Sprint N+1] | [X items / X pts] | [Theme] |
|
||||
| [Sprint N+2] | [X items / X pts] | [Theme] |
|
||||
|
||||
**Carry-over root causes identified:**
|
||||
- [Root cause 1: e.g., "5 of 12 carry-overs were blocked on a third-party API integration — external dependency, not estimation failure"]
|
||||
- [Root cause 2: e.g., "4 of 12 carry-overs were items estimated at 8+ points that were later found to be 2–3x larger than expected"]
|
||||
- [Root cause 3: e.g., "3 of 12 carry-overs were interruptions from on-call incidents consuming unplanned capacity"]
|
||||
|
||||
---
|
||||
|
||||
## Capacity Utilization
|
||||
|
||||
| Sprint | Team Size | Available Capacity (pts) | Committed | Utilization % | Disruptions |
|
||||
|--------|-----------|--------------------------|-----------|--------------|-------------|
|
||||
| [Sprint N] | [X engineers] | [X pts] | [X pts] | [X%] | [Holiday / incident / none] |
|
||||
| [Sprint N+1] | [X engineers] | [X pts] | [X pts] | [X%] | |
|
||||
|
||||
**Capacity calculation used:** [X engineers × Y pts/person/sprint = Z pts available. Adjust: if team capacity changed during the window, note which sprints used which team size.]
|
||||
|
||||
**Average utilization:** [X%]
|
||||
**Utilization interpretation:** [< 70% = team is under-loaded or over-padding | 70–90% = healthy range | > 90% = no slack for unplanned work — fragile]
|
||||
|
||||
---
|
||||
|
||||
## Health Diagnosis
|
||||
|
||||
| Dimension | Score | Evidence | Priority |
|
||||
|-----------|-------|----------|----------|
|
||||
| Delivery predictability | [Green / Yellow / Red] | [Average completion rate X%, std dev Y pts] | [High / Med / Low] |
|
||||
| Commitment accuracy | [Green / Yellow / Red] | [Team over-commits by avg X pts/sprint] | |
|
||||
| Estimation consistency | [Green / Yellow / Red] | [Velocity std dev ±X pts, calibration verdict] | |
|
||||
| Carry-over hygiene | [Green / Yellow / Red] | [X% carry-over rate, root causes] | |
|
||||
| Capacity management | [Green / Yellow / Red] | [Avg utilization X%, disruption handling] | |
|
||||
| Trend direction | [Green / Yellow / Red] | [Trailing 3-sprint avg vs. leading 3-sprint avg] | |
|
||||
|
||||
**Scoring guide:** Green = operating within healthy range; Yellow = marginal — watch closely or single-sprint anomaly; Red = chronic issue requiring active intervention.
|
||||
|
||||
**Overall health:** [Green / Yellow / Red] — [One sentence summary: "The team delivers consistently at X pts/sprint but chronic over-commitment is eroding morale and creating a misleading picture for stakeholders."]
|
||||
|
||||
---
|
||||
|
||||
## Blocker Frequency Analysis
|
||||
|
||||
If blocker data was provided, complete this section. If not, note it as a recommended tracking addition.
|
||||
|
||||
| Blocker Category | Frequency (last 8 sprints) | Avg Days Blocked | Impact (pts delayed) |
|
||||
|-----------------|--------------------------|------------------|---------------------|
|
||||
| External dependency | [X occurrences] | [X days] | [X pts] |
|
||||
| Technical debt / rework | [X occurrences] | [X days] | [X pts] |
|
||||
| Unclear requirements | [X occurrences] | [X days] | [X pts] |
|
||||
| On-call interruptions | [X occurrences] | [X days] | [X pts] |
|
||||
| Environment / tooling | [X occurrences] | [X days] | [X pts] |
|
||||
|
||||
**Top blocker to address:** [Name the single highest-impact blocker category and what addressing it would mean for velocity.]
|
||||
|
||||
---
|
||||
|
||||
## Improvement Recommendations
|
||||
|
||||
Provide 3 specific recommendations ordered by expected impact. Each recommendation must include a measurable success target and implementation steps.
|
||||
|
||||
### Recommendation 1: [Title]
|
||||
|
||||
**Problem it addresses:** [Which health dimension is Red or Yellow, and what the data shows]
|
||||
|
||||
**What to do:**
|
||||
1. [Specific action step — concrete enough that a tech lead can assign it]
|
||||
2. [Next step]
|
||||
3. [Next step]
|
||||
|
||||
**Who owns it:** [Tech lead / Engineering manager / Whole team]
|
||||
**When to start:** [This sprint / Next sprint / Within 2 weeks]
|
||||
|
||||
**Measurable target:** [e.g., "Carry-over rate drops below 15% within 3 sprints" or "Completion rate above 80% for 4 consecutive sprints"]
|
||||
|
||||
**How to know it's working:** [Leading indicator to watch before the outcome metric improves — e.g., "Carry-over items decreasing sprint-over-sprint even before the target is hit"]
|
||||
|
||||
---
|
||||
|
||||
### Recommendation 2: [Title]
|
||||
|
||||
**Problem it addresses:** [Health dimension and evidence]
|
||||
|
||||
**What to do:**
|
||||
1. [Step]
|
||||
2. [Step]
|
||||
3. [Step]
|
||||
|
||||
**Who owns it:** [Role]
|
||||
**When to start:** [Timing]
|
||||
|
||||
**Measurable target:** [Specific metric and timeframe]
|
||||
|
||||
**How to know it's working:** [Leading indicator]
|
||||
|
||||
---
|
||||
|
||||
### Recommendation 3: [Title]
|
||||
|
||||
**Problem it addresses:** [Health dimension and evidence]
|
||||
|
||||
**What to do:**
|
||||
1. [Step]
|
||||
2. [Step]
|
||||
|
||||
**Who owns it:** [Role]
|
||||
**When to start:** [Timing]
|
||||
|
||||
**Measurable target:** [Specific metric and timeframe]
|
||||
|
||||
**How to know it's working:** [Leading indicator]
|
||||
|
||||
---
|
||||
|
||||
## Next-Sprint Capacity Forecast
|
||||
|
||||
**Next sprint:** [Sprint N+8]
|
||||
**Known team size:** [X engineers]
|
||||
**Known capacity reducers:** [PTO: X days total, on-call rotation: ~Y pts of unplanned capacity, etc.]
|
||||
|
||||
| Factor | Impact |
|
||||
|--------|--------|
|
||||
| Base capacity (historical average) | [X pts] |
|
||||
| PTO / planned absences | −[X pts] |
|
||||
| On-call overhead (estimate) | −[X pts] |
|
||||
| Carry-over from Sprint [N+7] | +[X pts committed capacity already spoken for] |
|
||||
| **Recommended commitment ceiling** | **[X pts]** |
|
||||
|
||||
**Confidence:** [High — stable team and known capacity | Medium — some uncertainty in disruption level | Low — team composition uncertain]
|
||||
|
||||
**Recommendation for planning:** [One sentence — e.g., "Plan to Sprint [N+8] ceiling of X pts. Given the carry-over items, prioritize completing those before pulling in new scope."]
|
||||
|
||||
---
|
||||
|
||||
## Cycle Time Distribution (if data provided)
|
||||
|
||||
| Sprint | p50 Cycle Time | p90 Cycle Time | Items Completed |
|
||||
|--------|---------------|---------------|-----------------|
|
||||
| [Sprint N] | [X days] | [X days] | [X items] |
|
||||
| [Average] | [X days] | [X days] | |
|
||||
|
||||
**Cycle time interpretation:** [p90 > 2× p50 indicates a long-tail of stuck items that deserve investigation. p50 increasing over time indicates slowing throughput independent of story point changes.]
|
||||
|
||||
If cycle time data was not provided: *Cycle time data was not included in this analysis. Recommend adding p50 and p90 cycle time per sprint to your tracking to detect throughput issues that story points alone cannot reveal.*
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Velocity chart is generated from the actual data provided — not a generic placeholder chart
|
||||
- [ ] Trend diagnosis states a direction (Improving / Flat / Declining / Erratic) with a quantitative basis (trailing vs. leading average)
|
||||
- [ ] Carry-over root causes are specific categories with counts — not a generic observation that carry-over exists
|
||||
- [ ] Each of the 3 recommendations includes a named owner, a start date, and a measurable target with a timeframe
|
||||
- [ ] Next-sprint capacity forecast uses historical average as the baseline and deducts specific known reducers
|
||||
- [ ] Health diagnosis table uses Red/Yellow/Green with evidence cited in the Evidence column — no unsupported scores
|
||||
- [ ] If metrics are missing (cycle time, blocker log), the report explicitly calls them out as recommended additions
|
||||
@@ -0,0 +1,290 @@
|
||||
---
|
||||
name: tech-radar
|
||||
description: "Build a technology radar for an engineering team, categorizing technologies into Adopt/Trial/Assess/Hold quadrants following the ThoughtWorks Tech Radar format. Use when asked to create a tech radar, evaluate the team's technology landscape, categorize tools and frameworks, or establish a technology strategy. Produces a full tech radar with quadrant tables, individual blip rationales, a decision trail, and a maintenance process guide."
|
||||
---
|
||||
|
||||
# Tech Radar
|
||||
|
||||
Produce a complete technology radar document for an engineering team. The radar gives the team a shared, explicit position on every significant technology in their stack — what to standardize on, what to experiment with, what to evaluate, and what to actively stop using. Follow the ThoughtWorks Tech Radar format: four quadrants (Techniques, Tools, Platforms, Languages & Frameworks) each with four rings (Adopt, Trial, Assess, Hold). Each technology entry ("blip") gets a ring assignment, a one-paragraph rationale, and a date. Include a decision trail showing what moved and why, and a maintenance process the team can run to keep the radar current.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Team or company name** — for the document header
|
||||
- **Current tech stack** — list every significant technology, tool, language, and platform the team currently uses
|
||||
- **Technologies under active evaluation** — tools or frameworks the team is currently trying or considering
|
||||
- **Technologies to deprecate or move off** — anything the team wants to stop using or is actively migrating away from
|
||||
- **Strategic technology bets** — any technologies the company has made a deliberate bet on (e.g., "we're all-in on Kubernetes" or "migrating to event-driven architecture")
|
||||
- **Team context** — team size, product domain, and any constraints (regulatory, compliance, vendor lock-in concerns)
|
||||
|
||||
If a technology is mentioned without a ring placement, use the rationale inputs to determine the appropriate ring. When uncertain between two rings, ask.
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Technology Radar: [Team / Company Name]
|
||||
|
||||
**Edition:** [Month Year]
|
||||
**Maintained by:** [Team Name / Architecture Guild / CTO Office]
|
||||
**Review cadence:** Bi-annual (every 6 months)
|
||||
**Next review:** [Month Year + 6 months]
|
||||
|
||||
---
|
||||
|
||||
## How to Read This Radar
|
||||
|
||||
This radar reflects [Team / Company Name]'s current thinking on technologies we use, evaluate, and retire. Use it to make consistent technology choices, onboard new engineers, and have structured conversations about the stack.
|
||||
|
||||
**Quadrants** categorize the type of technology:
|
||||
|
||||
| Quadrant | What belongs here |
|
||||
|----------|------------------|
|
||||
| **Techniques** | Methods, patterns, and practices (e.g., trunk-based development, event sourcing) |
|
||||
| **Tools** | Software tools used in the development and delivery process (e.g., linters, CI systems, observability platforms) |
|
||||
| **Platforms** | Infrastructure and hosting environments (e.g., AWS, Kubernetes, Snowflake) |
|
||||
| **Languages & Frameworks** | Programming languages and application frameworks (e.g., Go, React, FastAPI) |
|
||||
|
||||
**Rings** express our recommendation:
|
||||
|
||||
| Ring | Meaning | What to do |
|
||||
|------|---------|-----------|
|
||||
| **Adopt** | Industry-proven, working well for us — our standard choice | Use by default for new work; no special justification needed |
|
||||
| **Trial** | Worth pursuing — we are experimenting with it in limited production use | Use in a bounded context with architectural oversight; share learnings |
|
||||
| **Assess** | Worth exploring — we have not used it in production yet | Spike, prototype, or research; do not use in production without a review |
|
||||
| **Hold** | Do not start new work with this technology | Complete existing commitments; do not expand use; plan migration |
|
||||
|
||||
---
|
||||
|
||||
## Quadrant 1: Techniques
|
||||
|
||||
### Adopt
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Technique name, e.g., Trunk-based development] | [Month Year] | [One sentence: why we adopted it and what it replaced] |
|
||||
| [Technique name] | [Month Year] | [One sentence rationale] |
|
||||
| [Technique name] | [Month Year] | [One sentence rationale] |
|
||||
|
||||
**[Technique name] — Adopt**
|
||||
[One paragraph rationale. Explain what problem this technique solves, why it works well in your context, and what the team should know before applying it. Reference any internal experience — e.g., "We rolled this out across 8 services in 2024 and saw a 40% reduction in merge conflicts."]
|
||||
|
||||
[Repeat for each Adopt-ring technique.]
|
||||
|
||||
### Trial
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Technique name] | [Month Year] | [One sentence: what we're testing and where] |
|
||||
|
||||
**[Technique name] — Trial**
|
||||
[One paragraph. What are we trialing? In which teams or services? What hypothesis are we testing? What would cause us to move it to Adopt vs. Hold?]
|
||||
|
||||
### Assess
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Technique name] | [Month Year] | [One sentence: why we're interested] |
|
||||
|
||||
**[Technique name] — Assess**
|
||||
[One paragraph. Why is this interesting to us? What would we need to see to move it to Trial? Who is responsible for the assessment?]
|
||||
|
||||
### Hold
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Technique name] | [Month Year] | [One sentence: why we're stopping and what replaces it] |
|
||||
|
||||
**[Technique name] — Hold**
|
||||
[One paragraph. Why are we putting this on hold? What is the migration path? What is the target end-state for teams still using it?]
|
||||
|
||||
---
|
||||
|
||||
## Quadrant 2: Tools
|
||||
|
||||
### Adopt
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Tool name, e.g., GitHub Actions] | [Month Year] | [One sentence rationale] |
|
||||
| [Tool name] | [Month Year] | [One sentence rationale] |
|
||||
|
||||
**[Tool name] — Adopt**
|
||||
[One paragraph rationale. Why is this our standard tool? What does it do well in our context? Any configuration or usage patterns the team should follow?]
|
||||
|
||||
[Repeat for each Adopt-ring tool.]
|
||||
|
||||
### Trial
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Tool name] | [Month Year] | [One sentence: what we're testing] |
|
||||
|
||||
**[Tool name] — Trial**
|
||||
[One paragraph rationale and trial scope.]
|
||||
|
||||
### Assess
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Tool name] | [Month Year] | [One sentence: why we're evaluating it] |
|
||||
|
||||
**[Tool name] — Assess**
|
||||
[One paragraph: what sparked interest, who is evaluating, and timeline.]
|
||||
|
||||
### Hold
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Tool name] | [Month Year] | [One sentence: what replaces it] |
|
||||
|
||||
**[Tool name] — Hold**
|
||||
[One paragraph: deprecation rationale and migration path.]
|
||||
|
||||
---
|
||||
|
||||
## Quadrant 3: Platforms
|
||||
|
||||
### Adopt
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Platform name, e.g., AWS EKS] | [Month Year] | [One sentence rationale] |
|
||||
| [Platform name] | [Month Year] | [One sentence rationale] |
|
||||
|
||||
**[Platform name] — Adopt**
|
||||
[One paragraph. What does this platform provide? What are the boundaries of its use? Any internal golden-path setup the team should follow?]
|
||||
|
||||
[Repeat for each Adopt-ring platform.]
|
||||
|
||||
### Trial
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Platform name] | [Month Year] | [One sentence: scope of trial] |
|
||||
|
||||
**[Platform name] — Trial**
|
||||
[One paragraph rationale and trial boundaries.]
|
||||
|
||||
### Assess
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Platform name] | [Month Year] | [One sentence: why we're exploring it] |
|
||||
|
||||
**[Platform name] — Assess**
|
||||
[One paragraph assessment plan.]
|
||||
|
||||
### Hold
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Platform name] | [Month Year] | [One sentence: migration target and timeline] |
|
||||
|
||||
**[Platform name] — Hold**
|
||||
[One paragraph: what triggered the hold decision, migration target, and timeline.]
|
||||
|
||||
---
|
||||
|
||||
## Quadrant 4: Languages & Frameworks
|
||||
|
||||
### Adopt
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Language/Framework, e.g., Go] | [Month Year] | [One sentence rationale] |
|
||||
| [Language/Framework] | [Month Year] | [One sentence rationale] |
|
||||
|
||||
**[Language/Framework] — Adopt**
|
||||
[One paragraph. What is this language or framework used for? What are the team's proficiency expectations? Any frameworks or libraries that go alongside it as part of the standard choice?]
|
||||
|
||||
[Repeat for each Adopt-ring language or framework.]
|
||||
|
||||
### Trial
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Language/Framework] | [Month Year] | [One sentence: bounded use case] |
|
||||
|
||||
**[Language/Framework] — Trial**
|
||||
[One paragraph rationale.]
|
||||
|
||||
### Assess
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Language/Framework] | [Month Year] | [One sentence: interest driver] |
|
||||
|
||||
**[Language/Framework] — Assess**
|
||||
[One paragraph assessment plan.]
|
||||
|
||||
### Hold
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Language/Framework] | [Month Year] | [One sentence: reason and migration path] |
|
||||
|
||||
**[Language/Framework] — Hold**
|
||||
[One paragraph: deprecation rationale, existing system obligations, and timeline to retire.]
|
||||
|
||||
---
|
||||
|
||||
## Decision Trail
|
||||
|
||||
This log records every ring movement since the radar's first edition. Use it to understand the evolution of our technology choices.
|
||||
|
||||
| Technology | Quadrant | Previous Ring | New Ring | Edition | Reason |
|
||||
|------------|----------|--------------|----------|---------|--------|
|
||||
| [Name] | [Quadrant] | — | Adopt | [Month Year] | First placement — [one sentence why] |
|
||||
| [Name] | [Quadrant] | Assess | Trial | [Month Year] | [What prompted the move — evidence, team feedback, production trial results] |
|
||||
| [Name] | [Quadrant] | Trial | Adopt | [Month Year] | [Adoption rationale — usage results, team satisfaction, scale proven] |
|
||||
| [Name] | [Quadrant] | Adopt | Hold | [Month Year] | [Why moved to Hold — better alternative, security concern, cost, vendor issue] |
|
||||
| [Name] | [Quadrant] | — | Hold | [Month Year] | First placement — added directly to Hold because [reason] |
|
||||
|
||||
---
|
||||
|
||||
## Radar Maintenance Process
|
||||
|
||||
### Who Contributes
|
||||
|
||||
- **Architecture review group / CTO office** — final ring placement decisions
|
||||
- **All engineers** — submit blip nominations via [channel or form]
|
||||
- **Tech leads** — triage nominations and prepare proposals for review sessions
|
||||
|
||||
### Update Cadence
|
||||
|
||||
| Activity | Frequency | Owner |
|
||||
|----------|-----------|-------|
|
||||
| New blip nominations accepted | Ongoing — any engineer via [channel] | Anyone |
|
||||
| Nomination triage | Monthly | Tech leads |
|
||||
| Full radar review session | Every 6 months | Architecture group |
|
||||
| Published radar update | Every 6 months | [Owner name or role] |
|
||||
|
||||
### How to Nominate a Blip
|
||||
|
||||
1. Submit to [Slack channel / form URL] with: technology name, quadrant, proposed ring, and one-paragraph rationale.
|
||||
2. A tech lead reviews within 2 weeks and either schedules it for the next review session or requests more information.
|
||||
3. At the review session, the architecture group discusses and votes. Simple majority wins; ties go to Hold pending further evidence.
|
||||
4. Approved blips are added to the radar doc and the decision trail within 1 week of the session.
|
||||
|
||||
### Ring Change Criteria
|
||||
|
||||
| To move TO Adopt | To move TO Trial | To move TO Assess | To move TO Hold |
|
||||
|-----------------|-----------------|-------------------|-----------------|
|
||||
| Proven in multiple production systems; team broadly trained; clear operational runbook exists | At least one production use case running; architectural oversight in place; learnings documented | Concrete use case identified; spike completed or in progress; interest from at least 2 engineers | Better alternative exists; known security/compliance risk; strategic direction change; unacceptable maintenance burden |
|
||||
|
||||
---
|
||||
|
||||
*Questions about this radar: [Slack channel] | Submit a nomination: [URL or channel]*
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every blip has a written rationale paragraph — not just a table row entry
|
||||
- [ ] The decision trail is populated with at least the initial placement date for every blip
|
||||
- [ ] Hold-ring entries include a concrete migration path or target technology, not just "stop using it"
|
||||
- [ ] Ring definitions are present and include both what each ring means AND what engineers should do in response
|
||||
- [ ] Maintenance process includes: nomination channel, review cadence, who decides, and ring-change criteria
|
||||
- [ ] Technologies identified as "strategic bets" in the inputs are placed in Adopt (if proven) or Trial (if being rolled out)
|
||||
- [ ] Technologies identified for deprecation are in Hold with a rationale that references the replacement
|
||||
@@ -0,0 +1,260 @@
|
||||
---
|
||||
name: technical-debt-register
|
||||
description: "Document and prioritize a technical debt backlog with business impact, effort estimates, and resolution strategy. Use when asked to audit technical debt, create a debt register, prioritize tech debt for a quarter, document architectural shortcuts, or build a debt reduction roadmap. Produces a structured technical debt register covering debt inventory by category, business impact per item, effort and priority scores, top-item resolution plans, and a quarterly debt reduction roadmap."
|
||||
---
|
||||
|
||||
# Technical Debt Register Skill
|
||||
|
||||
Produce a complete technical debt register for a team or service. A debt register is not a complaint list — it is a prioritized, business-impact-aware inventory that lets an engineering team make deliberate choices about which debt to pay down, in what order, and with what expected return.
|
||||
|
||||
Good debt management is not eliminating all debt. It is ensuring debt is visible, owned, and resolved when the interest cost exceeds the cost of fixing it.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Team or service name** — what team and/or service this register covers
|
||||
- **Known debt items** — list of known technical debt, or ask Claude to elicit them by asking about: legacy code, missing tests, outdated dependencies, architectural shortcuts, manual processes, observability gaps, security backlogs
|
||||
- **Tech stack** — language, frameworks, infrastructure (helps Claude categorise and score items correctly)
|
||||
- **Team size and velocity** — number of engineers and approximate story points or days per sprint (needed for effort estimates)
|
||||
- **Current quarter / planning period** — so the roadmap targets the right timeframe
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Technical Debt Register: [Team / Service Name]
|
||||
|
||||
**Team:** [Name] | **Service(s):** [Name(s)]
|
||||
**Author:** [Name] | **Last updated:** [Date]
|
||||
**Planning period:** [Q[X] [Year]] | **Review cadence:** [Monthly / Quarterly]
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
[2–3 sentences describing the team's current debt situation, the main categories of debt, and the business context — e.g. are they in a growth phase where velocity matters, or approaching a compliance deadline where security debt is critical?]
|
||||
|
||||
**Total items in register:** [X]
|
||||
**Unresolved items:** [X]
|
||||
**Critical/High priority items:** [X]
|
||||
**Estimated total resolution effort:** [X story points / X engineer-weeks]
|
||||
|
||||
---
|
||||
|
||||
## Debt Category Definitions
|
||||
|
||||
| Category | Description | Examples |
|
||||
|---|---|---|
|
||||
| **Code quality** | Code that works but is hard to change safely | Duplicated logic, deeply nested conditionals, inconsistent error handling, missing abstraction |
|
||||
| **Architecture** | Structural decisions that limit scalability or increase coupling | Monolith that should be decomposed, sync calls that should be async, missing domain boundaries |
|
||||
| **Testing** | Gaps in test coverage that increase regression risk | Missing unit tests, no integration tests, flaky test suite, no test data management |
|
||||
| **Security** | Known vulnerabilities or missing security controls | Outdated dependencies with CVEs, missing rate limiting, hard-coded secrets, insufficient auth |
|
||||
| **Dependencies** | Outdated or risky external dependencies | End-of-life libraries, major version lag, abandoned packages |
|
||||
| **Infrastructure** | Infrastructure that limits reliability or developer productivity | Manual deployment steps, no IaC, single-AZ, missing autoscaling |
|
||||
| **Observability** | Gaps in visibility that slow incident response | Missing metrics, no distributed tracing, poor log structure, no alerting on key SLIs |
|
||||
| **Process** | Manual or error-prone operational processes | Manual DB migrations, no runbooks, tribal knowledge not documented |
|
||||
|
||||
---
|
||||
|
||||
## Debt Register
|
||||
|
||||
### Scoring Method
|
||||
|
||||
**Business impact (1–5):**
|
||||
- 5 — Blocking growth, causing production incidents, or creating compliance risk
|
||||
- 4 — Significantly slowing delivery or increasing incident likelihood
|
||||
- 3 — Noticeable slowdown; manageable but accumulating
|
||||
- 2 — Minor friction; low immediate risk
|
||||
- 1 — Cosmetic or aspirational; no current business impact
|
||||
|
||||
**Effort to resolve (1–5, lower = easier):**
|
||||
- 1 — <0.5 day; single engineer
|
||||
- 2 — 0.5–2 days; single engineer
|
||||
- 3 — 3–5 days; single engineer or small pair
|
||||
- 4 — 1–2 weeks; team collaboration required
|
||||
- 5 — >2 weeks; significant planning and coordination
|
||||
|
||||
**Priority score = Business impact × (6 − Effort)** *(rewards high-impact, low-effort items)*
|
||||
|
||||
---
|
||||
|
||||
| ID | Item | Category | Business impact (1–5) | Effort (1–5) | Priority score | Status | Owner |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| TD-001 | [e.g. No integration tests for payment flow] | Testing | 5 | 3 | 15 | Open | [Name] |
|
||||
| TD-002 | [e.g. Authentication library 3 major versions behind] | Security | 5 | 2 | 20 | Open | [Name] |
|
||||
| TD-003 | [e.g. Database queries not using connection pooling] | Architecture | 4 | 2 | 16 | Open | [Name] |
|
||||
| TD-004 | [e.g. Manual deployment process for [service]] | Infrastructure | 4 | 3 | 12 | In progress | [Name] |
|
||||
| TD-005 | [e.g. 200-line God function in order processing] | Code quality | 3 | 3 | 9 | Open | [Name] |
|
||||
| TD-006 | [e.g. No structured logging — plain text only] | Observability | 3 | 2 | 12 | Open | [Name] |
|
||||
| TD-007 | [e.g. ORM version has known N+1 query issue] | Dependencies | 3 | 3 | 9 | Open | [Name] |
|
||||
| TD-008 | [e.g. No runbook for [critical operation]] | Process | 3 | 1 | 15 | Open | [Name] |
|
||||
| TD-009 | [e.g. Test coverage at 34% — no meaningful safety net] | Testing | 4 | 4 | 8 | Open | [Name] |
|
||||
| TD-010 | [e.g. Hard-coded config values in application code] | Code quality | 2 | 1 | 10 | Open | [Name] |
|
||||
| TD-011 | [e.g. Service deployed single-AZ with no failover] | Infrastructure | 5 | 4 | 10 | Open | [Name] |
|
||||
| TD-012 | [e.g. No alerting on P95 latency for [endpoint]] | Observability | 4 | 1 | 20 | Open | [Name] |
|
||||
|
||||
---
|
||||
|
||||
## Category Breakdown
|
||||
|
||||
```
|
||||
Category distribution (by item count):
|
||||
─────────────────────────────────────────────
|
||||
Code quality ████████░░ [X items] ([X]%)
|
||||
Architecture ██████░░░░ [X items] ([X]%)
|
||||
Testing █████████░ [X items] ([X]%)
|
||||
Security ████░░░░░░ [X items] ([X]%)
|
||||
Dependencies ███░░░░░░░ [X items] ([X]%)
|
||||
Infrastructure ████░░░░░░ [X items] ([X]%)
|
||||
Observability ████░░░░░░ [X items] ([X]%)
|
||||
Process ██░░░░░░░░ [X items] ([X]%)
|
||||
─────────────────────────────────────────────
|
||||
|
||||
Priority distribution:
|
||||
Critical (score 20–25): [X items]
|
||||
High (score 12–19): [X items]
|
||||
Medium (score 6–11): [X items]
|
||||
Low (score 1–5): [X items]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Top 5 Priority Items — Resolution Plans
|
||||
|
||||
### TD-XXX: [Highest priority item name]
|
||||
|
||||
**Priority score:** [Score] | **Category:** [Category] | **Owner:** [Name]
|
||||
|
||||
**Problem:**
|
||||
[2–3 sentences describing what the debt is, how it manifests, and what pain it currently causes. Be specific — reference actual incidents, slowdowns, or risks.]
|
||||
|
||||
**Business impact:**
|
||||
[What happens if this is not resolved? Reference any incidents, near-misses, or growth blockers. E.g. "This caused 2 production incidents in the last quarter and adds ~30 minutes of debugging time to any change in this area."]
|
||||
|
||||
**Resolution approach:**
|
||||
[Clear description of the fix. Not "improve the code" — describe the actual work: "Extract the payment processing logic into a dedicated `PaymentService` class, write unit tests to 80% coverage, and update the 3 call sites."]
|
||||
|
||||
**Steps:**
|
||||
1. [Specific, ticketable step]
|
||||
2. [Specific, ticketable step]
|
||||
3. [Specific, ticketable step]
|
||||
|
||||
**Acceptance criteria:**
|
||||
- [ ] [Measurable criterion — e.g. "Zero hard-coded config values remain in application code"]
|
||||
- [ ] [Measurable criterion — e.g. "CI pipeline passes with new tests"]
|
||||
- [ ] [Measurable criterion]
|
||||
|
||||
**Effort estimate:** [X story points / X days]
|
||||
**Suggested sprint:** [Q[X] Sprint [Y] / When [dependency] is complete]
|
||||
|
||||
---
|
||||
|
||||
### TD-XXX: [Second priority item name]
|
||||
|
||||
**Priority score:** [Score] | **Category:** [Category] | **Owner:** [Name]
|
||||
|
||||
**Problem:**
|
||||
[Description]
|
||||
|
||||
**Business impact:**
|
||||
[Impact description]
|
||||
|
||||
**Resolution approach:**
|
||||
[Approach description]
|
||||
|
||||
**Steps:**
|
||||
1. [Step]
|
||||
2. [Step]
|
||||
3. [Step]
|
||||
|
||||
**Acceptance criteria:**
|
||||
- [ ] [Criterion]
|
||||
- [ ] [Criterion]
|
||||
|
||||
**Effort estimate:** [X story points / X days]
|
||||
**Suggested sprint:** [Sprint or timeframe]
|
||||
|
||||
---
|
||||
|
||||
### TD-XXX: [Third priority item]
|
||||
|
||||
*(Follow same format as above)*
|
||||
|
||||
---
|
||||
|
||||
### TD-XXX: [Fourth priority item]
|
||||
|
||||
*(Follow same format as above)*
|
||||
|
||||
---
|
||||
|
||||
### TD-XXX: [Fifth priority item]
|
||||
|
||||
*(Follow same format as above)*
|
||||
|
||||
---
|
||||
|
||||
## Debt Reduction Roadmap
|
||||
|
||||
### Guiding principles
|
||||
|
||||
- Allocate [X%] of each sprint's capacity to debt resolution — recommended 15–20% for healthy teams
|
||||
- Security and dependency debt is addressed on a fixed cadence regardless of priority score
|
||||
- No new feature work in modules with Critical debt unless the debt is scheduled for the current sprint
|
||||
- Debt items closed without a resolution (accepted/deferred) must have a named owner and a review date
|
||||
|
||||
### Quarterly plan
|
||||
|
||||
| Quarter | Focus area | Items targeted | Estimated capacity | Expected outcome |
|
||||
|---|---|---|---|---|
|
||||
| **[Q1 Year]** (current) | Security + observability | TD-002, TD-012, TD-006 | [X] points / [Y] eng-days | Auth library current; latency alerting live; structured logging shipped |
|
||||
| **[Q2 Year]** | Architecture + reliability | TD-003, TD-011, TD-004 | [X] points / [Y] eng-days | Connection pooling fixed; multi-AZ deployed; deploy automation complete |
|
||||
| **[Q3 Year]** | Testing coverage | TD-001, TD-009 | [X] points / [Y] eng-days | Payment flow integration tests live; overall coverage ≥60% |
|
||||
| **[Q4 Year]** | Code quality + process | TD-005, TD-008, TD-010 | [X] points / [Y] eng-days | God functions refactored; runbooks complete; zero hard-coded config |
|
||||
|
||||
### Sprint allocation model
|
||||
|
||||
```
|
||||
Sprint capacity: [X] story points
|
||||
|
||||
Allocation:
|
||||
├── Feature work: [X * 0.75 = ~Y] points (75%)
|
||||
├── Debt resolution: [X * 0.15 = ~Y] points (15%)
|
||||
└── Unplanned/bugs: [X * 0.10 = ~Y] points (10%)
|
||||
|
||||
Debt items that fit in one sprint ([≤Y] points each):
|
||||
✓ TD-002 ([X] points)
|
||||
✓ TD-012 ([X] points)
|
||||
✓ TD-006 ([X] points)
|
||||
✓ TD-008 ([X] points)
|
||||
|
||||
Multi-sprint debt items (break into phases):
|
||||
~ TD-001: Phase 1 ([X] pts) → Phase 2 ([X] pts)
|
||||
~ TD-009: Requires dedicated debt sprint or pairing
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Accepted / Deferred Debt
|
||||
|
||||
Items where the cost of remediation currently exceeds the business value, accepted with explicit review dates.
|
||||
|
||||
| ID | Item | Reason for deferral | Review date | Owner |
|
||||
|---|---|---|---|---|
|
||||
| TD-XXX | [Item] | [e.g. "Rewrite would require 3 weeks with no user-facing value at current scale; revisit at 10× traffic"] | [Date] | [Name] |
|
||||
| TD-XXX | [Item] | [e.g. "Dependency has a CVE but no upgrade path exists until Q3; mitigated by WAF rule"] | [Date] | [Name] |
|
||||
|
||||
**Policy:** No item may be deferred more than twice without escalation to the engineering manager.
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every item has a named owner — no unowned debt
|
||||
- [ ] Priority scores are calculated using the formula, not assigned arbitrarily
|
||||
- [ ] Security and dependency items are not scored below their actual business impact because they feel "technical"
|
||||
- [ ] Top-5 resolution plans include specific, ticketable steps — not vague descriptions like "improve test coverage"
|
||||
- [ ] The quarterly roadmap allocates realistic capacity — debt allocation does not exceed actual sprint budget
|
||||
- [ ] Accepted/deferred items have a review date and a named owner — no permanently deferred items
|
||||
- [ ] The register distinguishes between debt (deliberate or accumulated shortcuts) and bugs (unintended defects)
|
||||
- [ ] Items are closed as resolved only when acceptance criteria are met — not when the PR is merged
|
||||
@@ -0,0 +1,312 @@
|
||||
---
|
||||
name: api-versioning-strategy
|
||||
description: "Write an API versioning strategy document for a service or API platform. Use when asked to define versioning policy, plan API deprecation, classify breaking changes, or document version lifecycle. Produces a complete versioning strategy with breaking-change classification table, deprecation timeline, migration guide template, and client communication template."
|
||||
---
|
||||
|
||||
# API Versioning Strategy
|
||||
|
||||
Produce a complete API versioning strategy document that gives a service team durable, consistent rules for evolving their API without breaking consumers. This document covers the versioning scheme selection (with rationale), lifecycle policy from introduction through sunset, a precise breaking-change classification, and all the communication artifacts a team needs when deprecating a version. Engineers should be able to hand this document to a new team member or external consumer and have them understand exactly what to expect.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **API type** — REST, GraphQL, or gRPC (each has different versioning mechanics)
|
||||
- **Current versioning approach** — URL path (`/v1/`), request header, query parameter, or none; if none, document starts fresh
|
||||
- **Number of existing versions and active consumer count** — needed to size the lifecycle policy and migration scope
|
||||
- **Deprecation timeline constraints** — any hard deadlines (contract SLAs, compliance windows, annual release cycles)
|
||||
- **Consumer type** — internal teams only, external partners, public API, or mix (affects communication channel choices)
|
||||
|
||||
If any input is missing, ask before producing the document. For GraphQL, note that the versioning approach differs substantially (schema evolution over versioning) and tailor the scheme section accordingly.
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# API Versioning Strategy: [Service Name]
|
||||
|
||||
**Owner:** [Team Name]
|
||||
**API Type:** [REST / GraphQL / gRPC]
|
||||
**Document Version:** 1.0
|
||||
**Last Reviewed:** [Date]
|
||||
**Next Review:** [Date + 6 months]
|
||||
|
||||
---
|
||||
|
||||
## 1. Versioning Scheme
|
||||
|
||||
### Selected Approach: [URL Path / Request Header / Query Parameter]
|
||||
|
||||
| Scheme | Example | Pros | Cons | Verdict |
|
||||
|--------|---------|------|------|---------|
|
||||
| URL Path | `/v2/orders` | Visible in logs and bookmarks; trivial to route | Violates strict REST resource identity; clutters URL space | **Recommended for public-facing REST APIs** |
|
||||
| `Accept` Header | `Accept: application/vnd.[service].v2+json` | Keeps URLs clean; proper content negotiation | Harder to test in browser; less visible in logs | Recommended for internal APIs with controlled clients |
|
||||
| Query Parameter | `/orders?version=2` | Easy to retrofit without URL restructuring | Often missed in client code; cache-key complications | Acceptable only for read-heavy APIs already in production |
|
||||
| GraphQL Schema Evolution | Field deprecation + `@deprecated` directive | No versioning needed for additive changes | Requires disciplined schema design | **Recommended for GraphQL APIs** |
|
||||
|
||||
**Rationale for [chosen scheme]:** [One paragraph explaining why this scheme fits the API type, consumer type, and operational context provided. Reference the specific inputs — e.g., "Because this API has external partners who integrate via generated clients, URL path versioning provides the most predictable routing behavior and eliminates header negotiation complexity."]
|
||||
|
||||
### Version Format
|
||||
|
||||
```
|
||||
[Base URL]/v{MAJOR}/{resource}
|
||||
|
||||
Examples:
|
||||
https://api.[company].com/v1/orders
|
||||
https://api.[company].com/v2/orders/{id}/items
|
||||
|
||||
Version identifier: integer only (v1, v2, v3)
|
||||
No minor versions in the URL — minor/patch changes are non-breaking and deployed continuously.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Version Lifecycle Policy
|
||||
|
||||
### Lifecycle Stages
|
||||
|
||||
```
|
||||
STABLE ──────────────────────────────────────────────────►
|
||||
│
|
||||
├─ STABLE Active development, full SLA, new consumers allowed
|
||||
│
|
||||
├─ DEPRECATED Announced, timeline posted, migration docs live.
|
||||
│ New consumers blocked. Existing consumers receive warnings.
|
||||
│
|
||||
├─ SUNSET Requests return HTTP 410 Gone + migration pointer.
|
||||
│ 30-day window before routing is removed.
|
||||
│
|
||||
└─ RETIRED Routing removed, docs archived, no traffic accepted.
|
||||
```
|
||||
|
||||
| Stage | Duration | SLA Applies | New Consumers Allowed | Required Action |
|
||||
|-------|----------|-------------|----------------------|-----------------|
|
||||
| Stable | Until superseded | Yes — full | Yes | None |
|
||||
| Deprecated | [12 months / adjust per constraint] | Yes — degraded acceptable | No | Migrate before sunset date |
|
||||
| Sunset | 30-day window | Best-effort only | No | Migrate immediately |
|
||||
| Retired | Permanent | None | No | — |
|
||||
|
||||
**Minimum Stable Period:** A version must remain Stable for at least [6 / 12] months before deprecation can be announced.
|
||||
|
||||
**Maximum Simultaneous Versions:** No more than [2] versions in Stable or Deprecated status at any time. Releasing v3 requires committing to a sunset date for v1 in the same announcement.
|
||||
|
||||
---
|
||||
|
||||
## 3. Breaking vs. Non-Breaking Change Classification
|
||||
|
||||
Apply this table before every API change. If a change is marked Breaking, it requires a new major version. When uncertain, default to Breaking.
|
||||
|
||||
| Change Type | Specific Example | Classification | Rationale |
|
||||
|-------------|-----------------|----------------|-----------|
|
||||
| Remove a response field | Delete `order.legacy_id` from response | **Breaking** | Clients reading this field will null-pointer or fail |
|
||||
| Rename a field | `user_name` → `username` | **Breaking** | Clients referencing old name receive null |
|
||||
| Change field type | `"amount": "10.00"` → `"amount": 10.00` | **Breaking** | Type mismatch at deserialization |
|
||||
| Make optional field required | `email` required in POST body | **Breaking** | Existing callers omitting it receive 400 |
|
||||
| Remove an endpoint | `DELETE /v1/widgets/{id}` removed | **Breaking** | Existing callers receive 404 |
|
||||
| Change HTTP method | `GET /search` → `POST /search` | **Breaking** | Bookmarked or cached GET calls fail |
|
||||
| Change authentication scheme | API key → OAuth2 | **Breaking** | All clients must re-authenticate |
|
||||
| Restructure error response shape | Error JSON schema changed | **Breaking** | Error-handling code misparses responses |
|
||||
| Expand enum values (response) | New `status: "on_hold"` value returned | **Breaking** | Switch statements with no default fall through |
|
||||
| Change pagination defaults | `page_size` default 20 → 50 | **Breaking** | Response length changes unexpectedly |
|
||||
| Tighten input validation | Max length 100 → 50 | **Breaking** | Previously valid inputs now rejected |
|
||||
| Add new optional field to response | Add `order.tax_breakdown` | Non-Breaking | Clients ignore unknown fields per spec |
|
||||
| Add new optional request parameter | Add `?include_archived=true` | Non-Breaking | Ignored by existing clients |
|
||||
| Add a new endpoint | `GET /v1/orders/{id}/audit` | Non-Breaking | No existing client references it |
|
||||
| Relax input validation | Min length 10 → 5 | Non-Breaking | Existing valid inputs remain valid |
|
||||
| Performance or latency improvement | Response time reduced | Non-Breaking | — |
|
||||
| Add new enum value (request-only) | Accept new `type: "express"` | Non-Breaking | Existing values still accepted |
|
||||
|
||||
---
|
||||
|
||||
## 4. Deprecation Process
|
||||
|
||||
### Step-by-Step Deprecation Checklist
|
||||
|
||||
- [ ] **T-0 (Decision day):** Engineering lead approves deprecation. New version confirmed Stable. Sunset date set.
|
||||
- [ ] **T-0:** Update API docs — add deprecation banner to all v[N] endpoint pages.
|
||||
- [ ] **T-0:** Add `Deprecation` and `Sunset` response headers to all v[N] responses (see format below).
|
||||
- [ ] **T-0:** Block new consumer onboarding for v[N] in API gateway and developer portal.
|
||||
- [ ] **T-0:** Send initial deprecation notice to all registered consumers (see Section 5 template).
|
||||
- [ ] **T-0:** Open tracking issue in engineering backlog linking all known consumers to their migration status.
|
||||
- [ ] **T minus 30 days:** Send 30-day warning to all consumers still sending v[N] traffic.
|
||||
- [ ] **T minus 7 days:** Send final warning. If consumer traffic > 100 req/day, escalate directly to their engineering lead.
|
||||
- [ ] **Sunset date:** Switch v[N] routing to return `HTTP 410 Gone` with body pointing to migration guide.
|
||||
- [ ] **T plus 30 days:** Remove routing rules. Archive documentation. Close tracking issue.
|
||||
|
||||
### Deprecation Response Headers
|
||||
|
||||
```http
|
||||
HTTP/1.1 200 OK
|
||||
Deprecation: true
|
||||
Sunset: Sat, 01 Jan 2027 00:00:00 GMT
|
||||
Link: <https://docs.[company].com/api/migration/v1-to-v2>; rel="successor-version"
|
||||
```
|
||||
|
||||
### Sunset Response Body
|
||||
|
||||
```http
|
||||
HTTP/1.1 410 Gone
|
||||
Content-Type: application/json
|
||||
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Client Communication Templates
|
||||
|
||||
### Initial Deprecation Notice
|
||||
|
||||
```
|
||||
Subject: [Action Required] [Service Name] API v[N] Deprecation — Sunset [Date]
|
||||
|
||||
Hi [Team / Partner Name],
|
||||
|
||||
We are deprecating [Service Name] API v[N], effective [Sunset Date].
|
||||
|
||||
What this means for you:
|
||||
- v[N] continues to work normally until [Sunset Date]
|
||||
- After [Sunset Date], all v[N] requests return HTTP 410 Gone
|
||||
- v[N+1] is available today and fully stable
|
||||
|
||||
Your current usage: approximately [X] requests/day as of [Date].
|
||||
Estimated migration effort: [Small: < 1 day | Medium: 1–3 days | Large: 3–10 days]
|
||||
|
||||
Migration resources:
|
||||
Migration guide: [URL]
|
||||
Changelog: [URL]
|
||||
Office hours: [Date/Time/Link]
|
||||
Support: [Slack channel or email]
|
||||
|
||||
Key dates:
|
||||
[Date] Deprecation announced (today)
|
||||
[Date] New consumer onboarding blocked for v[N]
|
||||
[Date] 30-day warning sent to remaining consumers
|
||||
[Sunset Date] v[N] returns 410 Gone
|
||||
|
||||
Reply to this message or contact us at [channel] with questions.
|
||||
|
||||
[Your Name], [Team Name]
|
||||
```
|
||||
|
||||
### 30-Day Warning
|
||||
|
||||
```
|
||||
Subject: [30 Days Remaining] [Service Name] API v[N] sunsets [Date]
|
||||
|
||||
Hi [Team / Partner Name],
|
||||
|
||||
[Service Name] API v[N] sunsets in 30 days on [Date].
|
||||
|
||||
Your current v[N] traffic: [X] requests/day — migration is not yet complete.
|
||||
|
||||
If you have a technical blocker requiring an extension, contact us before
|
||||
[Date minus 14 days]. Extensions require a documented blocker and a committed
|
||||
migration completion date.
|
||||
|
||||
Migration guide: [URL] | Support: [channel]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Migration Guide Template
|
||||
|
||||
Publish one migration guide per version transition at `docs.[company].com/api/migration/v[N]-to-v[N+1]`.
|
||||
|
||||
```markdown
|
||||
# Migration Guide: v[N] → v[N+1]
|
||||
|
||||
**Estimated effort:** [Small: < 1 day | Medium: 1–3 days | Large: 3–10 days]
|
||||
**Breaking changes in this guide:** [count]
|
||||
|
||||
## Quick Start
|
||||
|
||||
Update your base URL:
|
||||
Before: https://api.[company].com/v[N]/
|
||||
After: https://api.[company].com/v[N+1]/
|
||||
|
||||
## Breaking Changes
|
||||
|
||||
### 1. [Field Rename: user_name → username]
|
||||
|
||||
**Affected endpoints:** `GET /users/{id}`, `POST /users`
|
||||
|
||||
Before (v[N]):
|
||||
{ "user_name": "alice" }
|
||||
|
||||
After (v[N+1]):
|
||||
{ "username": "alice" }
|
||||
|
||||
Migration: Replace all references to `user_name` with `username` in request
|
||||
builders and response parsers.
|
||||
|
||||
### 2. [Next breaking change — repeat structure]
|
||||
|
||||
## New Capabilities in v[N+1]
|
||||
|
||||
| Feature | Description | Docs |
|
||||
|---------|-------------|------|
|
||||
| [Feature name] | [Brief description] | [Link] |
|
||||
|
||||
## SDK Upgrade Reference
|
||||
|
||||
| Language | Package | v[N+1] Version | Install Command |
|
||||
|----------|---------|----------------|-----------------|
|
||||
| Python | `[company]-sdk` | `2.0.0` | `pip install [company]-sdk==2.0.0` |
|
||||
| Node.js | `@[company]/sdk` | `2.0.0` | `npm install @[company]/sdk@2.0.0` |
|
||||
| Go | `github.com/[company]/sdk-go` | `v2.0.0` | `go get github.com/[company]/sdk-go/v2` |
|
||||
| Java | `com.[company]:sdk` | `2.0.0` | Update pom.xml / build.gradle |
|
||||
|
||||
## Migration Validation Checklist
|
||||
|
||||
- [ ] Base URL updated to v[N+1]
|
||||
- [ ] All renamed fields updated in request serializers
|
||||
- [ ] All renamed fields updated in response deserializers
|
||||
- [ ] Error-handling code updated for new error shape
|
||||
- [ ] Integration tests passing against v[N+1] in staging
|
||||
- [ ] Load test completed against v[N+1] — latency within acceptable range
|
||||
- [ ] Rollback plan documented if issues arise post-cutover
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Version-Specific Documentation
|
||||
|
||||
- Maintain separate documentation pages for each Stable and Deprecated version.
|
||||
- Deprecated version docs carry a persistent banner: "This version is deprecated. Sunset date: [Date]. [Migrate to v[N+1]]."
|
||||
- OpenAPI specs, Protobuf definitions, or GraphQL schemas are tagged and archived per version in the repository under `/api/v[N]/`.
|
||||
- A root-level CHANGELOG.md records every breaking and non-breaking change by version — not buried in commit history.
|
||||
|
||||
---
|
||||
|
||||
## 8. SDK Versioning Alignment
|
||||
|
||||
| API Version | SDK Major Version | SDK GA Date | SDK EOL Date |
|
||||
|-------------|------------------|-------------|--------------|
|
||||
| v[1] | 1.x | [Date] | [API Sunset + 90 days] |
|
||||
| v[2] | 2.x | [Date] | Active |
|
||||
|
||||
- SDK major versions align 1:1 with API major versions.
|
||||
- SDK minor versions track non-breaking API additions.
|
||||
- SDK EOL dates trail API sunset dates by 90 days to give consumers extra runway.
|
||||
- SDKs emit a runtime deprecation warning log line when the underlying API version is Deprecated.
|
||||
|
||||
---
|
||||
|
||||
*Strategy authored by [Team Name] — questions to [Slack channel or email]*
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Versioning scheme recommendation includes explicit rationale tied to the API type and consumer type provided — not a generic recommendation
|
||||
- [ ] Breaking-change table covers at minimum: field removal, field rename, type change, making optional field required, endpoint removal, enum expansion, and default value change
|
||||
- [ ] Deprecation timeline durations are filled in with concrete values, not left as abstract placeholders
|
||||
- [ ] All three communication artifacts are present: initial deprecation notice, 30-day warning, and migration guide template
|
||||
- [ ] Sunset response headers (`Deprecation`, `Sunset`, `Link`) use correct RFC date format and real URL structure
|
||||
- [ ] SDK versioning alignment table is present and ties SDK major versions explicitly to API major versions
|
||||
- [ ] Maximum simultaneous supported versions is stated with a concrete number
|
||||
- [ ] Breaking-change table covers at minimum: field removal, field rename, type change, making optional field required, endpoint removal, enum expansion, and default value change
|
||||
- [ ] Deprecation timeline durations are filled in with concrete values, not left as abstract placeholders
|
||||
- [ ] All three communication artifacts are present: initial deprecation notice, 30-day warning, and migration guide template
|
||||
- [ ] Sunset response headers (`Deprecation`, `Sunset`, `Link`) use correct RFC date format and real URL structure
|
||||
- [ ] SDK versioning alignment table is present and ties SDK major versions explicitly to API major versions
|
||||
- [ ] Maximum simultaneous supported versions is stated with a concrete number
|
||||
@@ -0,0 +1,358 @@
|
||||
---
|
||||
name: capacity-planning
|
||||
description: "Produce a capacity planning document for a service covering traffic forecasts, resource requirements, and scaling strategy. Use when asked to plan infrastructure capacity, forecast resource needs, model traffic growth, define scaling strategy, or produce a capacity review for a service. Produces a structured capacity plan covering current baseline metrics, growth projections, resource requirements per tier, scaling strategy, cost projections, capacity triggers, and an infrastructure action roadmap."
|
||||
---
|
||||
|
||||
# Capacity Planning Skill
|
||||
|
||||
Produce a complete capacity planning document for a service. Capacity planning is not about predicting the future exactly — it is about understanding current headroom, modelling growth, and ensuring the team takes infrastructure action before a constraint becomes an incident.
|
||||
|
||||
A good capacity plan answers: what is running out first, how long before it runs out, what does it cost to fix it, and who decides when to act.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name and description** — what the service does and who depends on it
|
||||
- **Current traffic and usage metrics** — requests per second (or per day), active users, data volume — whatever units are most natural for this service
|
||||
- **Current resource utilisation** — CPU %, memory %, disk usage, connection pool utilisation, DB query throughput
|
||||
- **Growth rate or projections** — historical growth rate, or known upcoming events (product launch, sales cycle, seasonal peak)
|
||||
- **Tech stack and infrastructure** — cloud provider, compute type (VMs, containers, serverless), database, caching layer, CDN
|
||||
- **Cost constraints** — current infrastructure spend, acceptable cost ceiling, or target cost per unit of traffic
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Capacity Plan: [Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Author:** [Name] | **Last updated:** [Date]
|
||||
**Planning horizon:** [12 months — [Month Year] to [Month Year]]
|
||||
**Review cadence:** [Quarterly]
|
||||
|
||||
---
|
||||
|
||||
## 1. Executive Summary
|
||||
|
||||
[3–5 sentences covering: current state, the most critical capacity constraint, the timeline before it becomes a risk, the recommended action, and the cost implication. Written for an engineering manager or VP who needs the key facts without reading the full document.]
|
||||
|
||||
**Critical finding:** [e.g. "The database connection pool will reach 90% utilisation within 6 weeks at current growth. Without action, this will cause request queueing and latency spikes under normal traffic."]
|
||||
|
||||
**Recommended immediate action:** [e.g. "Increase connection pool limit and add a read replica within the next 2 weeks."]
|
||||
|
||||
**Estimated cost impact:** [e.g. "Recommended changes add ~$[X]/month to infrastructure spend."]
|
||||
|
||||
---
|
||||
|
||||
## 2. Current Baseline
|
||||
|
||||
*All metrics are 30-day averages unless noted. Date captured: [Date]*
|
||||
|
||||
### Traffic
|
||||
|
||||
| Metric | Value | Peak (7-day) | Notes |
|
||||
|---|---|---|---|
|
||||
| Requests per second (avg) | [X req/s] | [X req/s] | [Peak time / day of week] |
|
||||
| Requests per day | [X M/day] | [X M/day] | — |
|
||||
| Active users (DAU/MAU) | [X] / [X] | — | — |
|
||||
| [Service-specific metric — e.g. jobs processed/hour] | [X] | [X] | — |
|
||||
| [Service-specific metric — e.g. GB ingested/day] | [X GB] | [X GB] | — |
|
||||
|
||||
### Compute
|
||||
|
||||
| Resource | Current utilisation | Instance type | Count | Notes |
|
||||
|---|---|---|---|---|
|
||||
| CPU (avg) | [X%] | [e.g. c5.2xlarge] | [X] | Peak: [X%] |
|
||||
| Memory (avg) | [X%] | — | — | Peak: [X%] |
|
||||
| Network egress | [X Mbps] | — | — | — |
|
||||
| Container / pod count | [X] | [e.g. 2 vCPU / 4 GB] | — | Auto-scaling range: [X–Y] |
|
||||
|
||||
### Database
|
||||
|
||||
| Resource | Current utilisation | Spec | Notes |
|
||||
|---|---|---|---|
|
||||
| CPU | [X%] | [e.g. db.r5.2xlarge] | Peak: [X%] |
|
||||
| Memory | [X%] | [X GB RAM] | — |
|
||||
| Storage used | [X GB] of [Y GB] ([Z%]) | [X GB provisioned] | Growth: [~X GB/month] |
|
||||
| IOPS (avg) | [X] of [Y provisioned] | [Y IOPS] | Peak: [X IOPS] |
|
||||
| Connection pool | [X] of [Y max] ([Z%]) | Max connections: [Y] | [ORM pool size: X] |
|
||||
| Query P99 latency | [X ms] | — | [Slowest query: X] |
|
||||
| Read/write ratio | [X%] reads / [Y%] writes | — | — |
|
||||
|
||||
### Cache
|
||||
|
||||
| Resource | Current utilisation | Spec | Notes |
|
||||
|---|---|---|---|
|
||||
| Memory used | [X GB] of [Y GB] ([Z%]) | [e.g. cache.r6g.large] | Eviction rate: [X%] |
|
||||
| Hit rate | [X%] | — | Miss rate: [Y%] |
|
||||
| Connections | [X] | Max: [Y] | — |
|
||||
|
||||
### Storage / Object Store
|
||||
|
||||
| Resource | Current usage | Growth rate | Notes |
|
||||
|---|---|---|---|
|
||||
| [S3 / GCS / Blob] | [X GB / TB] | [~X GB/month] | [Lifecycle policies in place? Y/N] |
|
||||
| Disk (if applicable) | [X GB] of [Y GB] | [~X GB/month] | [RAID / EBS type] |
|
||||
|
||||
### Cost Baseline
|
||||
|
||||
| Component | Current monthly cost | % of total |
|
||||
|---|---|---|
|
||||
| Compute (app servers) | $[X] | [X%] |
|
||||
| Database | $[X] | [X%] |
|
||||
| Cache | $[X] | [X%] |
|
||||
| Storage | $[X] | [X%] |
|
||||
| CDN / bandwidth | $[X] | [X%] |
|
||||
| Other ([describe]) | $[X] | [X%] |
|
||||
| **Total** | **$[X]** | 100% |
|
||||
|
||||
**Unit economics:** $[X] per [1,000 requests / 1,000 users / GB processed]
|
||||
|
||||
---
|
||||
|
||||
## 3. Growth Projections
|
||||
|
||||
### Assumptions
|
||||
|
||||
| Assumption | Value | Source | Confidence |
|
||||
|---|---|---|---|
|
||||
| Monthly traffic growth rate | [X%] | [Historical trend / product forecast] | [High / Medium / Low] |
|
||||
| Seasonal peak factor | [+X% in [month(s)]] | [Last year's data / expected launch] | [High / Medium] |
|
||||
| Upcoming events | [e.g. Marketing campaign — [Month], expected +[X]% traffic spike] | [Marketing plan] | [Medium] |
|
||||
| User growth | [X new users/month] | [Sales pipeline / growth model] | [Medium] |
|
||||
| Data growth | [X GB/month] | [Current trend] | [High] |
|
||||
|
||||
### Traffic Forecast
|
||||
|
||||
| Timeframe | Req/s (avg) | Req/s (peak) | DAU | Data volume (cumulative) |
|
||||
|---|---|---|---|---|
|
||||
| **Now** (baseline) | [X] | [X] | [X] | [X GB/TB] |
|
||||
| **+3 months** | [X] | [X] | [X] | [X GB/TB] |
|
||||
| **+6 months** | [X] | [X] | [X] | [X GB/TB] |
|
||||
| **+12 months** | [X] | [X] | [X] | [X GB/TB] |
|
||||
|
||||
*Growth formula: [Baseline] × (1 + [monthly rate])^[months] + seasonal adjustment*
|
||||
|
||||
### Capacity Headroom Analysis
|
||||
|
||||
**When does each resource run out at current utilisation and projected growth?**
|
||||
|
||||
| Resource | Current utilisation | Safe ceiling | Headroom remaining | Months to ceiling |
|
||||
|---|---|---|---|---|
|
||||
| App CPU | [X%] | 70% | [X%] | [X months] |
|
||||
| App memory | [X%] | 80% | [X%] | [X months] |
|
||||
| DB CPU | [X%] | 70% | [X%] | [X months] |
|
||||
| DB storage | [X GB] of [Y GB] | 80% = [Z GB] | [X GB] | [X months] |
|
||||
| DB IOPS | [X] of [Y] | 80% = [Z] | [X IOPS] | [X months] |
|
||||
| DB connections | [X] of [Y] | 80% = [Z] | [X] | [X months] |
|
||||
| Cache memory | [X GB] of [Y GB] | 75% = [Z GB] | [X GB] | [X months] |
|
||||
| Storage (object) | [X TB] | No hard limit — cost trigger | — | [Cost trigger: $X/month] |
|
||||
|
||||
**Red flags** (resources hitting ceiling within 3 months):
|
||||
- [Resource]: [current]% → ceiling in [X weeks] — **Action required**
|
||||
- [Resource]: [current]% → ceiling in [X weeks] — **Action required**
|
||||
|
||||
---
|
||||
|
||||
## 4. Resource Requirements
|
||||
|
||||
### Compute Requirements
|
||||
|
||||
| Timeframe | Required instances | Recommended instance type | Auto-scaling range | Notes |
|
||||
|---|---|---|---|---|
|
||||
| Now | [X] | [type] | [min: X, max: Y] | Current configuration |
|
||||
| +3 months | [X] | [type] | [min: X, max: Y] | [Any instance type change needed?] |
|
||||
| +6 months | [X] | [type or upgrade] | [min: X, max: Y] | [Consider [larger type / horizontal scale]] |
|
||||
| +12 months | [X] | [type or upgrade] | [min: X, max: Y] | [State of horizontal vs vertical decision] |
|
||||
|
||||
**Memory headroom target:** Maintain ≥30% available memory at average load; ≥20% at peak.
|
||||
**CPU headroom target:** Maintain ≥30% available CPU at average load; ≥15% at peak.
|
||||
|
||||
### Database Requirements
|
||||
|
||||
| Timeframe | Instance type | Storage | IOPS | Read replica | Notes |
|
||||
|---|---|---|---|---|---|
|
||||
| Now | [type] | [X GB] | [X] | [Y/N] | Current |
|
||||
| +3 months | [type] | [X GB] | [X] | [Y/N] | [Upgrade storage / IOPS] |
|
||||
| +6 months | [type or upgrade] | [X GB] | [X] | **Yes** | [Read replica recommended by this point] |
|
||||
| +12 months | [type] | [X GB] | [X] | [X replicas] | [Consider sharding / partitioning at this scale] |
|
||||
|
||||
**Storage growth management:**
|
||||
- Current growth: [~X GB/month]
|
||||
- Storage auto-scaling: [Enabled / Not enabled — enable by [date]]
|
||||
- Archiving policy: [Records older than X months moved to [cold storage / archive tier]]
|
||||
|
||||
### Cache Requirements
|
||||
|
||||
| Timeframe | Node type | Nodes | Memory | Notes |
|
||||
|---|---|---|---|---|
|
||||
| Now | [type] | [X] | [X GB] | Current |
|
||||
| +6 months | [type] | [X] | [X GB] | [Scale out or upgrade] |
|
||||
| +12 months | [type] | [X] | [X GB] | [Cluster mode if >Y GB required] |
|
||||
|
||||
---
|
||||
|
||||
## 5. Scaling Strategy
|
||||
|
||||
### Compute — Horizontal Scaling
|
||||
|
||||
**Decision: [Horizontal / Vertical / Both]**
|
||||
|
||||
[State the scaling strategy and the reasoning. E.g. "The application is stateless and CPU-bound; horizontal scaling is preferred. Vertical scaling is a short-term fallback only."]
|
||||
|
||||
**Auto-scaling configuration:**
|
||||
|
||||
```
|
||||
Scale-out trigger: CPU > [X%] for [Y minutes] OR memory > [X%] for [Y minutes]
|
||||
Scale-in trigger: CPU < [X%] for [Y minutes] AND memory < [X%] for [Y minutes]
|
||||
Min instances: [X] (ensures HA across [X] AZs)
|
||||
Max instances: [Y] (cost ceiling)
|
||||
Cooldown period: [X seconds]
|
||||
Warmup time: [X seconds] (time for new instance to be healthy)
|
||||
```
|
||||
|
||||
**Limits of horizontal scaling:**
|
||||
- [e.g. Database connection pool is the current bottleneck — adding more app instances without increasing DB connections will not help]
|
||||
- [e.g. Session affinity required for WebSocket connections — limits pure stateless scaling]
|
||||
|
||||
### Database — Read Scaling
|
||||
|
||||
**Strategy:** [Read replica / Connection pooling via PgBouncer / Query caching / None needed yet]
|
||||
|
||||
**When to add a read replica:**
|
||||
- DB CPU sustained >60% for >30 minutes, OR
|
||||
- Read query P95 latency >50ms, OR
|
||||
- Connection pool utilisation >70%
|
||||
|
||||
**Connection pooling:**
|
||||
- Pooler: [PgBouncer / RDS Proxy / application-level / not configured]
|
||||
- Pool size: [X connections per app instance × Y instances = Z total]
|
||||
- Max DB connections: [configured to Z + 20% headroom]
|
||||
|
||||
### Caching Strategy
|
||||
|
||||
**Cache policy:** [Cache-aside / Write-through / Write-behind]
|
||||
**TTL strategy:**
|
||||
|
||||
| Data type | TTL | Invalidation method |
|
||||
|---|---|---|
|
||||
| [e.g. User profile] | [5 minutes] | [Explicit invalidation on update] |
|
||||
| [e.g. Product catalog] | [1 hour] | [TTL expiry — eventual consistency acceptable] |
|
||||
| [e.g. Session data] | [24 hours] | [Explicit invalidation on logout] |
|
||||
|
||||
**Cache miss handling:** [Describe what happens on a cache miss — does it fall through gracefully or cause a thundering herd risk?]
|
||||
|
||||
---
|
||||
|
||||
## 6. Cost Projections
|
||||
|
||||
### Infrastructure Cost Forecast
|
||||
|
||||
| Component | Now (monthly) | +3 months | +6 months | +12 months |
|
||||
|---|---|---|---|---|
|
||||
| Compute | $[X] | $[X] | $[X] | $[X] |
|
||||
| Database | $[X] | $[X] | $[X] | $[X] |
|
||||
| Cache | $[X] | $[X] | $[X] | $[X] |
|
||||
| Storage | $[X] | $[X] | $[X] | $[X] |
|
||||
| CDN / bandwidth | $[X] | $[X] | $[X] | $[X] |
|
||||
| **Total** | **$[X]** | **$[X]** | **$[X]** | **$[X]** |
|
||||
| MoM growth % | — | [X%] | [X%] | [X%] |
|
||||
|
||||
**Unit economics trend:**
|
||||
|
||||
| Timeframe | Cost per 1k requests | Cost per user/month | Notes |
|
||||
|---|---|---|---|
|
||||
| Now | $[X] | $[X] | Baseline |
|
||||
| +6 months | $[X] | $[X] | [Improving / worsening — why] |
|
||||
| +12 months | $[X] | $[X] | [Target: $X per 1k requests] |
|
||||
|
||||
**Cost optimisation opportunities:**
|
||||
|
||||
| Opportunity | Estimated saving | Effort | Timeline |
|
||||
|---|---|---|---|
|
||||
| [e.g. Reserved instances for baseline compute] | $[X/month] | Low | Immediate |
|
||||
| [e.g. S3 lifecycle policy — move objects >90 days to Glacier] | $[X/month] | Low | This sprint |
|
||||
| [e.g. Right-size [instance] — current is overprovisioned] | $[X/month] | Low | This sprint |
|
||||
| [e.g. Optimise top-5 slow queries — reduce DB compute need] | $[X/month] | Medium | Next quarter |
|
||||
|
||||
---
|
||||
|
||||
## 7. Capacity Triggers and Actions
|
||||
|
||||
Define the thresholds that require explicit action — not retrospective fixes after an incident.
|
||||
|
||||
| Resource | Watch (amber) | Act (red — schedule work) | Emergency (incident risk) |
|
||||
|---|---|---|---|
|
||||
| App CPU (sustained avg) | >60% | >70% | >85% |
|
||||
| App memory | >70% | >80% | >90% |
|
||||
| DB CPU | >55% | >65% | >80% |
|
||||
| DB storage | >65% | >75% | >85% |
|
||||
| DB connections | >60% | >70% | >85% |
|
||||
| Cache memory / eviction | Hit rate <90% | Hit rate <85% | Hit rate <75% |
|
||||
| Error rate | >0.5% | >1% | >2% |
|
||||
| P99 latency | >2× baseline | >3× baseline | >5× baseline |
|
||||
|
||||
**When a Watch threshold is crossed:**
|
||||
- Engineer who observes it creates a ticket with capacity label
|
||||
- Ticket reviewed in next sprint planning
|
||||
|
||||
**When an Act threshold is crossed:**
|
||||
- On-call engineer creates a ticket marked P2
|
||||
- Tech lead reviews within 24 hours
|
||||
- Action plan documented and scheduled within 1 sprint
|
||||
|
||||
**When an Emergency threshold is crossed:**
|
||||
- Treat as a potential incident — page on-call
|
||||
- Emergency scaling actions taken immediately (see runbook)
|
||||
- Root cause investigation starts within 2 hours
|
||||
|
||||
**Emergency scaling runbook:** [Link to oncall-runbook for capacity incidents]
|
||||
|
||||
---
|
||||
|
||||
## 8. Infrastructure Action Roadmap
|
||||
|
||||
### Immediate Actions (next 2 weeks)
|
||||
|
||||
| Action | Owner | Effort | Justification |
|
||||
|---|---|---|---|
|
||||
| [e.g. Increase DB connection pool limit to X] | [Name] | [2 hours] | [DB connections at X% — hitting ceiling in X weeks] |
|
||||
| [e.g. Enable storage auto-scaling on RDS] | [Name] | [30 min] | [Storage at X% — prevents emergency at X months] |
|
||||
| [e.g. Add S3 lifecycle policy for [bucket]] | [Name] | [1 hour] | [Storage growing at $X/month unnecessarily] |
|
||||
|
||||
### This Quarter (within 3 months)
|
||||
|
||||
| Action | Owner | Effort | Justification |
|
||||
|---|---|---|---|
|
||||
| [e.g. Add read replica to production DB] | [Name] | [1 day] | [DB CPU projected to hit 65% in 2 months] |
|
||||
| [e.g. Increase max auto-scaling limit from X to Y] | [Name] | [2 hours] | [Current max is too close to expected peak] |
|
||||
| [e.g. Configure PgBouncer for connection pooling] | [Name] | [3 days] | [Reduce per-connection overhead; headroom for growth] |
|
||||
|
||||
### Next Quarter (3–6 months)
|
||||
|
||||
| Action | Owner | Effort | Justification |
|
||||
|---|---|---|---|
|
||||
| [e.g. Upgrade DB instance class — [current] → [next]] | [Name] | [2 hours — blue/green] | [DB CPU projected to hit 70% by Q[X]] |
|
||||
| [e.g. Implement caching for [high-read endpoint]] | [Name] | [1 week] | [Reduce DB read load by estimated [X%]] |
|
||||
| [e.g. Evaluate horizontal DB sharding] | [Name] | [2 weeks (spike)] | [At 12-month projections, single DB hits limits] |
|
||||
|
||||
### Horizon (6–12 months)
|
||||
|
||||
| Action | Description | Trigger condition |
|
||||
|---|---|---|
|
||||
| [e.g. Multi-region deployment] | [Active-passive setup in eu-west-2] | [DAU exceeds X or SLA requires 99.99%] |
|
||||
| [e.g. Database sharding or migration to distributed DB] | [Evaluate CockroachDB / Vitess] | [Single-node DB projected to hit ceiling] |
|
||||
| [e.g. CDN expansion] | [Add PoPs in [region]] | [Latency SLO breached for [geography]] |
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every resource has a quantified current utilisation and a projected months-to-ceiling — no hand-waving
|
||||
- [ ] The most critical constraint is called out in the executive summary with a specific timeline
|
||||
- [ ] Growth projections state their assumptions and confidence level — not presented as certainties
|
||||
- [ ] Capacity triggers define amber/red thresholds and name who acts at each level
|
||||
- [ ] Cost projections include unit economics, not just absolute totals
|
||||
- [ ] The infrastructure roadmap has named owners and effort estimates — not just a wish list
|
||||
- [ ] Auto-scaling configuration includes both scale-out AND scale-in triggers, and a min/max range
|
||||
- [ ] Actions are ordered by urgency — immediate items are genuinely immediate, not backlog filler
|
||||
@@ -0,0 +1,179 @@
|
||||
---
|
||||
name: churn-analysis
|
||||
description: "Analyse customer churn for a product or cohort and produce a structured churn report. Use when asked to analyse churn, understand why customers are leaving, identify churn patterns, calculate churn rate, or build a churn reduction plan. Produces a churn analysis with rate calculations, categorised reasons, early warning signals, and prioritised interventions."
|
||||
---
|
||||
|
||||
# Churn Analysis Skill
|
||||
|
||||
Produce a structured churn analysis that goes beyond the headline rate — identifying why customers leave, which segments are most at risk, and what interventions will have the highest impact on retention.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Time period** being analysed (e.g. Q1, last 12 months)
|
||||
- **Total customers at start of period** and **customers churned**
|
||||
- **ARR or revenue lost** to churn
|
||||
- **Churn reasons data** — exit survey results, CSM notes, support data, or sales loss reasons
|
||||
- **Customer segments** — by tier, industry, cohort, or product line
|
||||
- **Current retention rate** if known
|
||||
- **Any recent changes** — pricing, product, support model — that may have affected churn
|
||||
|
||||
## Churn Categories
|
||||
|
||||
Always classify churn before analysing it:
|
||||
|
||||
| Category | Definition |
|
||||
|---|---|
|
||||
| **Voluntary — avoidable** | Customer left due to a problem we could have addressed (product gaps, poor onboarding, relationship failures) |
|
||||
| **Voluntary — unavoidable** | Customer left for reasons outside our control (budget cuts, acquisition, company shutdown) |
|
||||
| **Involuntary** | Payment failure, contract non-renewal by mistake, admin error |
|
||||
|
||||
The interventions for each category are different. Conflating them leads to wrong conclusions.
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Churn Analysis: [Product / Segment / Company]
|
||||
**Period:** [Start date] — [End date]
|
||||
**Prepared by:** [Name] | **Date:** [Date]
|
||||
|
||||
---
|
||||
|
||||
## Headline Numbers
|
||||
|
||||
| Metric | Value |
|
||||
|---|---|
|
||||
| Customers at start of period | [N] |
|
||||
| Customers churned | [N] |
|
||||
| **Customer churn rate** | **[X]%** |
|
||||
| ARR at start of period | £/$/€[X] |
|
||||
| ARR lost to churn | £/$/€[X] |
|
||||
| **Revenue churn rate (gross)** | **[X]%** |
|
||||
| ARR from expansions (same period) | £/$/€[X] |
|
||||
| **Net revenue retention (NRR)** | **[X]%** |
|
||||
|
||||
**Benchmark context:**
|
||||
- Customer churn rate: [X]% vs. industry benchmark [Y]% — [above / below / in line]
|
||||
- NRR: [X]% — [What this means: above 100% = expansion offsets churn; below 100% = shrinking base]
|
||||
|
||||
---
|
||||
|
||||
## Churn Breakdown by Category
|
||||
|
||||
| Category | Customers | % of churn | ARR lost |
|
||||
|---|---|---|---|
|
||||
| Voluntary — avoidable | [N] | [X]% | £/$/€[X] |
|
||||
| Voluntary — unavoidable | [N] | [X]% | £/$/€[X] |
|
||||
| Involuntary | [N] | [X]% | £/$/€[X] |
|
||||
| **Total** | **[N]** | **100%** | **£/$/€[X]** |
|
||||
|
||||
**Avoidable churn as % of total churn:** [X]% — this is the number we can actually influence.
|
||||
|
||||
---
|
||||
|
||||
## Churn Reasons — Avoidable Churn Only
|
||||
|
||||
Rank by frequency. Include ARR weight where data allows.
|
||||
|
||||
| Reason | Count | % of avoidable churn | ARR lost | Representative quote |
|
||||
|---|---|---|---|---|
|
||||
| [Reason 1 — e.g. "Product missing key feature"] | [N] | [X]% | £/$/€[X] | "[Quote]" |
|
||||
| [Reason 2] | [N] | [X]% | £/$/€[X] | "[Quote]" |
|
||||
| [Reason 3] | [N] | [X]% | £/$/€[X] | "[Quote]" |
|
||||
| [Reason 4] | [N] | [X]% | £/$/€[X] | "[Quote]" |
|
||||
| Other | [N] | [X]% | £/$/€[X] | — |
|
||||
|
||||
**Theme synthesis:** [2–3 sentences grouping the top reasons into 2–3 themes. E.g. "The top three reasons cluster around two themes: product gaps in [area] (affecting X% of avoidable churn) and onboarding failures where customers never achieved value (Y%)."]
|
||||
|
||||
---
|
||||
|
||||
## Churn by Segment
|
||||
|
||||
Identify which segments over- or under-index for churn.
|
||||
|
||||
### By Tier
|
||||
|
||||
| Tier | Churn rate | vs. Overall | Notes |
|
||||
|---|---|---|---|
|
||||
| Enterprise | [X]% | +/-[X]pp | |
|
||||
| Mid-Market | [X]% | +/-[X]pp | |
|
||||
| SMB | [X]% | +/-[X]pp | |
|
||||
|
||||
### By Cohort (Acquisition Year)
|
||||
|
||||
| Cohort | Churn rate | Notes |
|
||||
|---|---|---|
|
||||
| [Year 1] | [X]% | |
|
||||
| [Year 2] | [X]% | |
|
||||
| [Year 3] | [X]% | |
|
||||
|
||||
### By Industry / Use Case (if data available)
|
||||
|
||||
| Segment | Churn rate | Notes |
|
||||
|---|---|---|
|
||||
| [Segment 1] | [X]% | |
|
||||
| [Segment 2] | [X]% | |
|
||||
|
||||
**Key pattern:** [Which segment has the highest churn rate and what likely explains it]
|
||||
|
||||
---
|
||||
|
||||
## Timing Analysis
|
||||
|
||||
- **Average contract length before churn:** [X months]
|
||||
- **Highest-risk moment:** [e.g. "Month 3 — when trial value has worn off but full adoption hasn't happened"]
|
||||
- **Churn timing distribution:**
|
||||
|
||||
| When churn occurred | % of churned accounts |
|
||||
|---|---|
|
||||
| 0–3 months | [X]% |
|
||||
| 3–6 months | [X]% |
|
||||
| 6–12 months | [X]% |
|
||||
| 12+ months | [X]% |
|
||||
|
||||
---
|
||||
|
||||
## Early Warning Signals
|
||||
|
||||
Based on the churned accounts, identify the signals that preceded churn (and could have triggered earlier intervention):
|
||||
|
||||
| Signal | Lead time before churn | How to detect |
|
||||
|---|---|---|
|
||||
| [Signal 1 — e.g. "DAU/MAU dropped below 15%"] | [~X weeks] | [Usage dashboard / alert] |
|
||||
| [Signal 2 — e.g. "No QBR in 90+ days"] | [~X weeks] | [CRM flag] |
|
||||
| [Signal 3 — e.g. "Champion left the account"] | [~X weeks] | [LinkedIn alert / CSM tracking] |
|
||||
| [Signal 4] | [~X weeks] | [Detection method] |
|
||||
|
||||
---
|
||||
|
||||
## Intervention Recommendations
|
||||
|
||||
Ranked by estimated impact × feasibility.
|
||||
|
||||
| Intervention | Addresses | Est. churn reduction | Effort | Owner |
|
||||
|---|---|---|---|---|
|
||||
| [Intervention 1 — e.g. "Improve onboarding for [segment] with dedicated 30-day check-in"] | [Reason 1] | [X accounts / £X ARR] | Low / Med / High | [Team] |
|
||||
| [Intervention 2] | [Reason 2] | [X accounts / £X ARR] | Low / Med / High | [Team] |
|
||||
| [Intervention 3] | [Reason 3] | [X accounts / £X ARR] | Low / Med / High | [Team] |
|
||||
|
||||
**Priority call:** [Which one intervention, if implemented this quarter, would have the biggest impact and why]
|
||||
|
||||
---
|
||||
|
||||
## What We Don't Know (Data Gaps)
|
||||
|
||||
- [Data gap 1 — e.g. "Exit survey response rate is only 30% — the reasons data may not be representative"]
|
||||
- [Data gap 2 — e.g. "No product usage data for SMB tier — can't confirm usage signal correlation"]
|
||||
- [Data gap 3]
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Churn rate is correctly calculated (churned ÷ starting cohort, not end-of-period total)
|
||||
- [ ] Avoidable and unavoidable churn are separated — interventions target avoidable churn only
|
||||
- [ ] Churn reasons are customer-reported, not internally assumed
|
||||
- [ ] Segment analysis identifies which segments over-index — not just averages
|
||||
- [ ] Early warning signals are specific and detectable, not generic ("low engagement")
|
||||
- [ ] Interventions link directly to the top churn reasons — no recommendations without a root cause match
|
||||
@@ -0,0 +1,301 @@
|
||||
---
|
||||
name: cicd-playbook
|
||||
description: "Write a CI/CD pipeline playbook for a service or team. Use when asked to document a CI/CD pipeline, write a deployment process, define release gates, document build and test stages, or create a deployment guide. Produces a structured playbook covering pipeline stages, environment definitions, deployment gates, rollback procedures, and on-call responsibilities."
|
||||
---
|
||||
|
||||
# CI/CD Playbook Skill
|
||||
|
||||
Produce a complete, actionable CI/CD playbook for a service or team — covering everything a new engineer needs to understand, contribute to, and operate the pipeline safely.
|
||||
|
||||
A good playbook is not a diagram. It is a document that answers: what runs, when, why, who owns it, and what to do when it breaks.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and brief description
|
||||
- **Tech stack** — language, framework, containerisation (Docker, etc.)
|
||||
- **Source control** — GitHub / GitLab / Bitbucket, branching strategy
|
||||
- **CI platform** — GitHub Actions / CircleCI / Jenkins / BuildKite / other
|
||||
- **CD platform / deployment target** — Kubernetes, ECS, Lambda, Heroku, VMs, etc.
|
||||
- **Environments** — e.g. dev, staging, production (and any canary / feature environments)
|
||||
- **Deployment frequency** — how often does the team ship?
|
||||
- **Any existing gates** — manual approvals, smoke tests, feature flags
|
||||
- **On-call setup** — who's responsible during deploys?
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# CI/CD Playbook: [Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Last updated:** [Date] | **Owner:** [Name / role]
|
||||
**Pipeline platform:** [CI tool] → [CD tool / platform]
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
[2–3 sentences describing what this service does and why the CI/CD pipeline is structured the way it is. Include the deployment target and how frequently the team ships.]
|
||||
|
||||
**Deployment frequency:** [Multiple times per day / Daily / Weekly / On-demand]
|
||||
**Average pipeline duration:** [X minutes]
|
||||
**Rollback time (p95):** [X minutes]
|
||||
|
||||
---
|
||||
|
||||
## Pipeline Stages
|
||||
|
||||
```
|
||||
[Branch push]
|
||||
│
|
||||
▼
|
||||
[1. Build & Lint] ──fail──▶ ❌ Block PR
|
||||
│
|
||||
▼
|
||||
[2. Unit Tests] ──fail──▶ ❌ Block PR
|
||||
│
|
||||
▼
|
||||
[3. Integration Tests] ──fail──▶ ❌ Block PR
|
||||
│
|
||||
▼
|
||||
[4. Security Scan] ──fail──▶ ⚠️ [Block / Warn — specify]
|
||||
│
|
||||
▼
|
||||
[5. Build Artefact / Container Image]
|
||||
│
|
||||
▼
|
||||
[6. Deploy to Staging] ──fail──▶ ❌ Block promotion
|
||||
│
|
||||
▼
|
||||
[7. Smoke Tests (Staging)]
|
||||
│
|
||||
▼
|
||||
[8. Manual Approval Gate] ──(if required)
|
||||
│
|
||||
▼
|
||||
[9. Deploy to Production] ──fail──▶ 🔁 Auto-rollback (if configured)
|
||||
│
|
||||
▼
|
||||
[10. Post-deploy checks]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Stage Definitions
|
||||
|
||||
### Stage 1 — Build & Lint
|
||||
|
||||
**What runs:** [Build command] + [Linter — e.g. ESLint, golangci-lint, flake8]
|
||||
**Trigger:** Every commit to any branch
|
||||
**Blocking:** Yes — PR cannot be merged if this fails
|
||||
**Typical duration:** [X minutes]
|
||||
**Owner if it fails:** PR author
|
||||
|
||||
**Common failure causes:**
|
||||
- [e.g. Missing dependency — run `npm install` locally before pushing]
|
||||
- [e.g. Lint rule violation — run `npm run lint --fix` to auto-fix most issues]
|
||||
|
||||
---
|
||||
|
||||
### Stage 2 — Unit Tests
|
||||
|
||||
**What runs:** [Test command — e.g. `npm test`, `go test ./...`, `pytest`]
|
||||
**Coverage gate:** [X]% minimum — pipeline fails below this threshold
|
||||
**Trigger:** Every commit
|
||||
**Blocking:** Yes
|
||||
**Typical duration:** [X minutes]
|
||||
|
||||
**Coverage report:** [Where to find it — e.g. uploaded to Codecov, available in CI artifacts]
|
||||
|
||||
---
|
||||
|
||||
### Stage 3 — Integration Tests
|
||||
|
||||
**What runs:** [Test suite description — e.g. "API integration tests against a test database using Docker Compose"]
|
||||
**Environment:** [Ephemeral test environment / shared test DB / etc.]
|
||||
**Trigger:** Every commit to `main` and feature branches targeting `main`
|
||||
**Blocking:** Yes
|
||||
**Typical duration:** [X minutes]
|
||||
|
||||
**If slow:** [e.g. "Integration tests can be skipped locally with `SKIP_INTEGRATION=true` — never skip in CI"]
|
||||
|
||||
---
|
||||
|
||||
### Stage 4 — Security Scan
|
||||
|
||||
**Tools:** [e.g. Snyk, Trivy, OWASP Dependency Check, Semgrep]
|
||||
**What it checks:** [Dependency vulnerabilities / SAST / secrets detection — list what applies]
|
||||
**Blocking on:** Critical and High severity findings
|
||||
**Non-blocking on:** Medium and Low (flagged, not blocking)
|
||||
**Trigger:** Every commit to `main`
|
||||
|
||||
**How to handle a flagged vulnerability:**
|
||||
1. Check if a fix is available — upgrade the dependency
|
||||
2. If no fix available, open a security ticket and add a suppression with justification
|
||||
3. Never suppress without a ticket and owner
|
||||
|
||||
---
|
||||
|
||||
### Stage 5 — Build Artefact
|
||||
|
||||
**What is produced:** [Docker image / binary / zip — be specific]
|
||||
**Registry:** [ECR / GCR / Docker Hub / Artifactory — URL]
|
||||
**Tagging convention:** `[service-name]:[git-sha]` (also tagged `:latest` on `main`)
|
||||
**Trigger:** Commits to `main` only (not feature branches)
|
||||
|
||||
---
|
||||
|
||||
### Stage 6 — Deploy to Staging
|
||||
|
||||
**Deployment method:** [e.g. Helm upgrade / kubectl apply / ecs deploy / Terraform apply]
|
||||
**Staging URL:** [URL]
|
||||
**Trigger:** Automatic on successful artefact build from `main`
|
||||
**Who can deploy to staging:** Any engineer (automatic)
|
||||
|
||||
**Environment variables:** Managed in [Vault / AWS SSM / GitHub Secrets / etc.]
|
||||
**Staging is not production:** [Any differences in config, scale, or data — state them here]
|
||||
|
||||
---
|
||||
|
||||
### Stage 7 — Smoke Tests (Staging)
|
||||
|
||||
**What runs:** [Description — e.g. "10 critical path tests covering login, core API endpoints, and payment flow"]
|
||||
**Tool:** [e.g. Playwright / Postman / custom script]
|
||||
**Pass criteria:** All smoke tests pass within [X seconds] timeout
|
||||
**Blocking:** Yes — production deploy will not proceed if smoke tests fail
|
||||
|
||||
**Smoke test suite location:** [Link to test files or folder]
|
||||
|
||||
---
|
||||
|
||||
### Stage 8 — Manual Approval Gate
|
||||
|
||||
**Required for:** [Production deploys / deploys affecting >X% of traffic / deploys to specific regions]
|
||||
**Who can approve:** [e.g. Any engineer on the team / Lead engineer / On-call engineer]
|
||||
**Approval timeout:** [e.g. 24 hours — auto-cancelled if no approval]
|
||||
**How to approve:** [GitHub Actions approve step / Slack command / other — with link]
|
||||
|
||||
**When to withhold approval:**
|
||||
- Active incident in production
|
||||
- Deploy is outside the deployment window (see below)
|
||||
- On-call engineer has not been notified
|
||||
|
||||
---
|
||||
|
||||
### Stage 9 — Deploy to Production
|
||||
|
||||
**Deployment method:** [Same as staging or different — specify]
|
||||
**Deployment window:** [e.g. Monday–Thursday 09:00–16:00 UTC — no deploys on Fridays or before bank holidays]
|
||||
**Canary / progressive rollout:** [Yes — X% initial traffic, full rollout after Y minutes / No — full deploy]
|
||||
**Deployment notifications:** [Slack channel — #deployments]
|
||||
|
||||
**Who is on-call during deploy:** Deploying engineer is responsible until post-deploy checks pass.
|
||||
|
||||
---
|
||||
|
||||
### Stage 10 — Post-Deploy Checks
|
||||
|
||||
**Automated checks (run for [X minutes] after deploy):**
|
||||
- [ ] Error rate: <[X]% (baseline: [Y]%)
|
||||
- [ ] P99 latency: <[X]ms (baseline: [Y]ms)
|
||||
- [ ] [Key business metric]: within [X]% of baseline
|
||||
|
||||
**Where to watch:** [Datadog / Grafana / CloudWatch dashboard — link]
|
||||
|
||||
**If a check fails:** See Rollback Procedure below.
|
||||
|
||||
---
|
||||
|
||||
## Environments
|
||||
|
||||
| Environment | Purpose | Deploy trigger | URL | Data |
|
||||
|---|---|---|---|---|
|
||||
| **Dev** | Local development | Manual | localhost | Seeded test data |
|
||||
| **Staging** | Pre-production validation | Automatic (main) | [URL] | Anonymised prod copy |
|
||||
| **Production** | Live traffic | Manual approval | [URL] | Live data |
|
||||
|
||||
---
|
||||
|
||||
## Branching Strategy
|
||||
|
||||
**Model:** [Trunk-based / GitFlow / GitHub Flow — describe briefly]
|
||||
|
||||
| Branch | Purpose | Who merges | Deploy target |
|
||||
|---|---|---|---|
|
||||
| `main` | Production-ready code | PR + review | Staging → Production |
|
||||
| `feature/*` | Feature development | Author | None (CI only) |
|
||||
| `hotfix/*` | Critical production fixes | Lead engineer | Can bypass staging gate with approval |
|
||||
|
||||
**Hotfix process:** [Describe when and how to use a hotfix branch — what level of incident justifies bypassing the standard process]
|
||||
|
||||
---
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
**Automated rollback:** [Yes — triggered if post-deploy error rate exceeds [X]% / No — manual only]
|
||||
|
||||
**Manual rollback steps:**
|
||||
```bash
|
||||
# 1. Identify the last known good image tag
|
||||
[command to list recent deployments]
|
||||
|
||||
# 2. Deploy the previous version
|
||||
[deployment command with previous tag]
|
||||
|
||||
# 3. Confirm rollback is live
|
||||
[smoke test command or health check URL]
|
||||
|
||||
# 4. Notify the team
|
||||
[Slack command or template]
|
||||
```
|
||||
|
||||
**Rollback decision authority:** Any engineer on-call can initiate a rollback without waiting for approval.
|
||||
|
||||
**After a rollback:**
|
||||
1. Create a post-deploy incident report (see [incident-postmortem skill])
|
||||
2. Do not re-deploy the same commit without fixing the root cause
|
||||
3. Notify [stakeholder / support team] of the rollback and expected fix timeline
|
||||
|
||||
---
|
||||
|
||||
## Secrets and Configuration Management
|
||||
|
||||
**Secret store:** [Vault / AWS SSM / GitHub Secrets / Doppler — specify]
|
||||
**How to add a new secret:**
|
||||
1. [Step 1]
|
||||
2. [Step 2]
|
||||
**Who has access:** [Role or team]
|
||||
**Rotation policy:** [How often secrets are rotated and who owns it]
|
||||
|
||||
**Never do:** Commit secrets to source control, even in `.env` files. The pipeline includes secret scanning (Stage 4) which will flag this.
|
||||
|
||||
---
|
||||
|
||||
## Common Failures and Fixes
|
||||
|
||||
| Failure | Likely cause | Fix |
|
||||
|---|---|---|
|
||||
| Build fails with "module not found" | Dependency not installed | Run `[install command]` and commit `lock file` |
|
||||
| Integration tests timeout | Test DB not seeded / external service down | Check [service] status; re-run pipeline |
|
||||
| Smoke tests fail after staging deploy | Environment variable missing | Check [config location]; compare staging and prod env vars |
|
||||
| Production deploy stuck at approval | Approver not notified | Tag `@[on-call handle]` in `#deployments` |
|
||||
| Post-deploy error rate spike | Bad deploy / upstream dependency | Check [dashboard]; initiate rollback if >5 min |
|
||||
|
||||
---
|
||||
|
||||
## On-Call Responsibilities During Deploy
|
||||
|
||||
- The deploying engineer is responsible for monitoring post-deploy checks for [X minutes] after a production deploy
|
||||
- If you cannot monitor after deploying, hand off explicitly to another engineer in `#deployments`
|
||||
- For deploys outside business hours: only hotfixes — always page the on-call engineer before deploying
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every stage has a clear owner when it fails
|
||||
- [ ] Rollback procedure is tested — not theoretical
|
||||
- [ ] Secrets management section names the actual tool used (not "use secrets management")
|
||||
- [ ] Deployment window is specific — not "during business hours"
|
||||
- [ ] Post-deploy check thresholds are calibrated to actual baseline metrics
|
||||
@@ -0,0 +1,176 @@
|
||||
---
|
||||
name: cs-escalation-brief
|
||||
description: "Write a structured escalation brief for an at-risk customer account. Use when an account has escalated, when a customer is threatening churn, when a P1 customer issue needs executive attention, or when preparing an internal save play. Produces a crisp escalation brief with account context, timeline, root cause, business impact, and a clear resolution plan."
|
||||
---
|
||||
|
||||
# Customer Escalation Brief Skill
|
||||
|
||||
Produce a clear, concise escalation brief that gives internal stakeholders — VP CS, CCO, product leadership, or the CEO — everything they need to understand the situation, make decisions, and act fast.
|
||||
|
||||
A good escalation brief is not a complaint. It is a professional document that states the facts, assigns accountability honestly, and proposes a specific resolution plan.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Account name**, tier, and ARR
|
||||
- **CSM name** and account owner
|
||||
- **Nature of the escalation** — what happened, what the customer is saying
|
||||
- **Timeline** of events leading to escalation
|
||||
- **Customer contact** who escalated (name, role, influence level)
|
||||
- **What the customer wants** — their stated ask
|
||||
- **What we believe the root cause is**
|
||||
- **What has already been done** to address the situation
|
||||
- **Renewal date** and current renewal risk assessment
|
||||
|
||||
## Escalation Levels
|
||||
|
||||
Calibrate urgency and audience based on escalation level:
|
||||
|
||||
| Level | Trigger | Audience | Response time |
|
||||
|---|---|---|---|
|
||||
| L1 — Account Risk | Customer expressing dissatisfaction; renewal at risk | CSM + CS Manager | 24 hours |
|
||||
| L2 — Executive Escalation | Customer escalated to their exec; requesting vendor exec involvement | VP CS + Account Exec | 4 hours |
|
||||
| L3 — Churn Risk | Customer has issued notice or is in active churn conversation | CCO / CEO + Revenue leadership | 1 hour |
|
||||
| L4 — Public Risk | Customer threatening public escalation, legal, or press | CCO / Legal / Comms | Immediate |
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Escalation Brief: [Account Name]
|
||||
|
||||
**Escalation level:** L[1/2/3/4] — [Label]
|
||||
**Date raised:** [Date]
|
||||
**Raised by:** [CSM name]
|
||||
**Escalation owner:** [Name of exec or senior stakeholder now leading response]
|
||||
|
||||
---
|
||||
|
||||
## Account at a Glance
|
||||
|
||||
| Field | Detail |
|
||||
|---|---|
|
||||
| ARR | £/$/€[X] |
|
||||
| Tier | Enterprise / Mid-Market / SMB |
|
||||
| Customer since | [Date] |
|
||||
| Renewal date | [Date] — [N] days away |
|
||||
| Renewal risk (pre-escalation) | Green / Amber / Red |
|
||||
| Renewal risk (current) | Green / Amber / Red |
|
||||
| Customer contact who escalated | [Name, role, seniority] |
|
||||
| Executive sponsor (customer) | [Name, role — active / passive / vacant] |
|
||||
| Executive sponsor (vendor) | [Name, role] |
|
||||
|
||||
---
|
||||
|
||||
## What Happened — Summary
|
||||
|
||||
[3–5 sentences. State the facts plainly. What the customer experienced, how they reacted, and how we learned about the escalation. No editorialising. No blame.]
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
List in chronological order. Each entry: `[Date / time] — [What happened. Who did what.]`
|
||||
|
||||
Include:
|
||||
- When the original issue or trigger event occurred
|
||||
- When the customer first raised concerns (informally)
|
||||
- When it escalated (formal escalation or exec involvement)
|
||||
- Actions taken since escalation
|
||||
|
||||
---
|
||||
|
||||
## Root Cause
|
||||
|
||||
**Primary cause:** [One clear sentence. What specifically went wrong.]
|
||||
|
||||
**Contributing factors:**
|
||||
- [Factor 1 — be honest about internal failures as well as external ones]
|
||||
- [Factor 2]
|
||||
|
||||
**Is this a systemic issue or isolated?**
|
||||
[ ] Isolated to this account
|
||||
[ ] Pattern seen in other accounts — details: [_______]
|
||||
[ ] Product or process gap that needs fixing
|
||||
|
||||
---
|
||||
|
||||
## Customer's Stated Position
|
||||
|
||||
**What the customer says happened:** [Their version of events — fair and unfiltered]
|
||||
|
||||
**What they are asking for:** [Their explicit ask — compensation, fix by date, exec call, SLA credit, exit clause]
|
||||
|
||||
**Sentiment of escalating contact:** [Frustrated but constructive / Angry / Seeking exit / Unknown]
|
||||
|
||||
**Risk of public escalation:** Low / Medium / High — [evidence if Medium or High]
|
||||
|
||||
---
|
||||
|
||||
## Business Impact
|
||||
|
||||
| Impact type | Detail |
|
||||
|---|---|
|
||||
| ARR at risk | £/$/€[X] |
|
||||
| Potential churn probability | [X]% |
|
||||
| Reputational risk | Low / Medium / High |
|
||||
| Reference / case study status | [Was a reference — now at risk / Not a reference] |
|
||||
| Expansion pipeline at risk | £/$/€[X] |
|
||||
|
||||
---
|
||||
|
||||
## What Has Been Done So Far
|
||||
|
||||
1. [Action taken — by whom — date — outcome]
|
||||
2. [Action taken — by whom — date — outcome]
|
||||
3. [Action taken — by whom — date — outcome]
|
||||
|
||||
**Has a formal apology or acknowledgement been issued?** Yes / No
|
||||
|
||||
---
|
||||
|
||||
## Proposed Resolution Plan
|
||||
|
||||
**Immediate actions (next 24–48 hours):**
|
||||
|
||||
| Action | Owner | By when |
|
||||
|---|---|---|
|
||||
| [Action] | [Name] | [Date] |
|
||||
| [Action] | [Name] | [Date] |
|
||||
|
||||
**Medium-term actions (next 2–4 weeks):**
|
||||
|
||||
| Action | Owner | By when |
|
||||
|---|---|---|
|
||||
| [Action] | [Name] | [Date] |
|
||||
|
||||
**What we are NOT offering:** [Be explicit about what is not on the table — avoids misaligned expectations]
|
||||
|
||||
**Success criteria:** [How will we know the escalation is resolved? What does the customer need to confirm they are satisfied?]
|
||||
|
||||
---
|
||||
|
||||
## Decision Required from Escalation Owner
|
||||
|
||||
[State clearly what decision or resource the escalation owner needs to provide. Be specific — do not make them ask. E.g.: "We need approval to offer a 20% service credit for Q2" or "We need an exec call with [name] within 48 hours."]
|
||||
|
||||
---
|
||||
|
||||
## Communication Plan
|
||||
|
||||
| Audience | Message | Channel | Owner | By when |
|
||||
|---|---|---|---|---|
|
||||
| Escalating customer contact | [Summary of message] | Email / Call | [Name] | [Date] |
|
||||
| Customer exec sponsor | [Summary] | Call | [Name] | [Date] |
|
||||
| Internal CS team | [Summary] | Slack / Meeting | CS Manager | [Date] |
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Root cause is specific — not "communication breakdown" or "product gap" without detail
|
||||
- [ ] Customer's position is stated fairly — not minimised or dismissed
|
||||
- [ ] A clear decision is requested from the escalation owner — brief does not end with "what do you think?"
|
||||
- [ ] ARR at risk is quantified
|
||||
- [ ] Communication plan has owners and dates — not "TBD"
|
||||
- [ ] Language is professional and blameless toward individuals
|
||||
@@ -0,0 +1,141 @@
|
||||
---
|
||||
name: cs-health-scorecard
|
||||
description: "Build a customer health scorecard for a specific account. Use when asked to score account health, assess renewal risk, build a health dashboard, or evaluate an account's likelihood to renew or expand. Produces a structured health scorecard with a RAG status, dimension scores, key risks, and recommended actions."
|
||||
---
|
||||
|
||||
# Customer Health Scorecard Skill
|
||||
|
||||
Produce a structured, data-driven health scorecard for a customer account — giving the CSM and leadership a clear view of renewal risk, expansion potential, and the actions needed to move the account in the right direction.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Account name** and tier (enterprise / mid-market / SMB)
|
||||
- **Contract value** (ARR) and **renewal date**
|
||||
- **Product usage data** — logins, DAU/MAU ratio, key feature adoption
|
||||
- **Support data** — open tickets, CSAT or NPS score, recent escalations
|
||||
- **Engagement data** — last QBR date, executive sponsor status, champion name
|
||||
- **Commercial data** — payment history, expansion conversations, seats used vs. licensed
|
||||
- **Any known risks or recent changes** at the account
|
||||
|
||||
## Scoring Framework
|
||||
|
||||
Score each dimension 1–5. Weight as shown. Calculate weighted total out of 100.
|
||||
|
||||
| Dimension | Weight | What to Score |
|
||||
|---|---|---|
|
||||
| **Product Adoption** | 30% | DAU/MAU ratio, breadth of features used, power users identified |
|
||||
| **Engagement** | 20% | QBR cadence, executive sponsor active, champion strength |
|
||||
| **Outcomes** | 20% | Customer hitting their stated goals / success metrics |
|
||||
| **Support Health** | 15% | Ticket volume trend, unresolved escalations, CSAT |
|
||||
| **Commercial** | 15% | On-time payments, seats utilised, expansion signals |
|
||||
|
||||
**Score → RAG conversion:**
|
||||
- 80–100: Green (healthy, renew likely)
|
||||
- 60–79: Amber (at risk, needs attention)
|
||||
- 0–59: Red (high churn risk, escalate)
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Customer Health Scorecard: [Account Name]
|
||||
|
||||
**CSM:** [Name] | **Tier:** [Enterprise / Mid-Market / SMB]
|
||||
**ARR:** £/$/€[X] | **Renewal date:** [Date] | **Days to renewal:** [N]
|
||||
**Overall health:** [Green / Amber / Red] — [Score]/100
|
||||
**Last updated:** [Date]
|
||||
|
||||
---
|
||||
|
||||
## Health Score Summary
|
||||
|
||||
| Dimension | Score (1–5) | Weight | Weighted Score | Trend |
|
||||
|---|---|---|---|---|
|
||||
| Product Adoption | [1–5] | 30% | [X] | ↑ / → / ↓ |
|
||||
| Engagement | [1–5] | 20% | [X] | ↑ / → / ↓ |
|
||||
| Outcomes | [1–5] | 20% | [X] | ↑ / → / ↓ |
|
||||
| Support Health | [1–5] | 15% | [X] | ↑ / → / ↓ |
|
||||
| Commercial | [1–5] | 15% | [X] | ↑ / → / ↓ |
|
||||
| **Total** | — | 100% | **[X]/100** | |
|
||||
|
||||
---
|
||||
|
||||
## Dimension Detail
|
||||
|
||||
### Product Adoption — [Score]/5
|
||||
- **DAU/MAU ratio:** [X]% (benchmark: >25% = healthy)
|
||||
- **Key features adopted:** [List features in use]
|
||||
- **Features not adopted:** [List unused high-value features]
|
||||
- **Power users identified:** [Yes / No — how many]
|
||||
- **Assessment:** [1–2 sentences on adoption health]
|
||||
|
||||
### Engagement — [Score]/5
|
||||
- **Last QBR:** [Date] — [Outcome summary]
|
||||
- **Next QBR:** [Scheduled / Overdue]
|
||||
- **Executive sponsor:** [Active / Passive / Vacant]
|
||||
- **Champion:** [Name, role, strength: strong / moderate / weak]
|
||||
- **Assessment:** [1–2 sentences]
|
||||
|
||||
### Outcomes — [Score]/5
|
||||
- **Customer's stated goals:** [List 2–3 goals from onboarding or last QBR]
|
||||
- **Progress against goals:** [On track / Partial / Off track]
|
||||
- **Evidence of value:** [Metric or quote that demonstrates ROI]
|
||||
- **Assessment:** [1–2 sentences]
|
||||
|
||||
### Support Health — [Score]/5
|
||||
- **Open tickets:** [N] (priority breakdown: P1: X, P2: X, P3: X)
|
||||
- **CSAT / NPS:** [Score] (benchmark: >8 CSAT / >30 NPS = healthy)
|
||||
- **Unresolved escalations:** [Yes / No — details if yes]
|
||||
- **Ticket trend (last 90 days):** Increasing / Stable / Decreasing
|
||||
- **Assessment:** [1–2 sentences]
|
||||
|
||||
### Commercial — [Score]/5
|
||||
- **Seats licensed:** [N] | **Seats active:** [N] ([X]% utilisation)
|
||||
- **Payment history:** [On time / Late — details]
|
||||
- **Expansion signals:** [Yes — describe / No]
|
||||
- **Downgrade or cancellation signals:** [Yes — describe / No]
|
||||
- **Assessment:** [1–2 sentences]
|
||||
|
||||
---
|
||||
|
||||
## Top Risks
|
||||
|
||||
| Risk | Severity | Mitigation |
|
||||
|---|---|---|
|
||||
| [Risk description] | High / Medium / Low | [Specific action to mitigate] |
|
||||
|
||||
---
|
||||
|
||||
## Recommended Actions
|
||||
|
||||
**Immediate (this week):**
|
||||
1. [Action — owner — deadline]
|
||||
|
||||
**This month:**
|
||||
1. [Action — owner — deadline]
|
||||
|
||||
**Before renewal:**
|
||||
1. [Action — owner — deadline]
|
||||
|
||||
---
|
||||
|
||||
## Renewal Forecast
|
||||
|
||||
| Scenario | Probability | ARR at risk |
|
||||
|---|---|---|
|
||||
| Full renewal at current ARR | [X]% | £/$/€0 |
|
||||
| Renewal with contraction | [X]% | £/$/€[X] |
|
||||
| Churn | [X]% | £/$/€[full ARR] |
|
||||
|
||||
**Recommended renewal play:** [Expand / Hold / Save / Manage out]
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Score is based on data, not gut feel — each dimension has evidence
|
||||
- [ ] Risks are specific (not "low engagement" — something like "executive sponsor left in March, no replacement identified")
|
||||
- [ ] Actions have owners and deadlines
|
||||
- [ ] Renewal probability is calibrated against pipeline reality
|
||||
- [ ] Trend arrows reflect direction of change vs. last scorecard, not just current state
|
||||
@@ -0,0 +1,454 @@
|
||||
---
|
||||
name: database-migration-plan
|
||||
description: "Write a safe, zero-downtime database migration plan for a schema change. Use when asked to plan a database migration, design a zero-downtime schema change, document an expand/contract migration, produce a rollback procedure for a database change, or coordinate a database schema update with a deployment. Produces a structured migration plan covering migration objectives, backward compatibility analysis, expand/contract phase breakdown, exact SQL, rollback steps per phase, data validation queries, and a deployment runbook."
|
||||
---
|
||||
|
||||
# Database Migration Plan Skill
|
||||
|
||||
Produce a complete, safe database migration plan for a schema change. A migration plan is not just the SQL — it is a coordinated sequence of steps that ensures the application stays available, data stays consistent, and every step can be rolled back independently.
|
||||
|
||||
The expand/contract pattern is the default approach: expand the schema to support both old and new states, migrate the application, then contract to remove the old state. Never combine schema changes and data backfills in a single migration that runs during deployment.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Current schema state** — the DDL or description of the table(s) as they are now
|
||||
- **Target schema state** — the DDL or description of what the table(s) should look like after migration
|
||||
- **Migration reason** — why this change is being made (new feature, performance fix, normalization, compliance)
|
||||
- **Database engine** — PostgreSQL, MySQL, SQLite, CockroachDB, etc.
|
||||
- **Estimated data volume** — approximate number of rows in affected tables
|
||||
- **Deployment constraints** — is any downtime allowed? What is the expected traffic level during migration? Are there multiple app instances running?
|
||||
- **Rollback window** — how long after deploy can the team roll back before the migration becomes irreversible?
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Database Migration Plan: [Migration Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Author:** [Name] | **Reviewed by:** [Name / DBA]
|
||||
**Date:** [Date] | **Target deploy date:** [Date]
|
||||
**Database engine:** [PostgreSQL X.X / MySQL X.X]
|
||||
**Ticket:** [JIRA-XXX]
|
||||
|
||||
---
|
||||
|
||||
## 1. Migration Overview
|
||||
|
||||
**What is changing:**
|
||||
[1–2 sentences: the specific schema change — e.g. "Adding a non-nullable `organisation_id` column to the `users` table and backfilling it from the `accounts` table."]
|
||||
|
||||
**Why:**
|
||||
[1–2 sentences: the business or technical reason driving the change.]
|
||||
|
||||
**Migration type:** [Additive only / Additive + backfill / Column rename / Column type change / Table restructure / Index change]
|
||||
|
||||
**Zero-downtime:** [Yes — using expand/contract / No — requires maintenance window — state duration]
|
||||
|
||||
**Estimated migration duration:**
|
||||
- Expand phase: [~X minutes]
|
||||
- Data backfill: [~X minutes/hours — based on X rows at Y rows/second]
|
||||
- Contract phase: [~X minutes after app version deployed]
|
||||
|
||||
---
|
||||
|
||||
## 2. Backward Compatibility Analysis
|
||||
|
||||
Before writing a single line of SQL, assess whether each change is backward compatible with the currently deployed application code.
|
||||
|
||||
| Change | Backward compatible? | Risk | Notes |
|
||||
|---|---|---|---|
|
||||
| [e.g. Add nullable column `org_id`] | Yes | Low | Old app ignores new column |
|
||||
| [e.g. Backfill `org_id`] | Yes | Medium | Old app unaffected; new app reads backfilled values |
|
||||
| [e.g. Add NOT NULL constraint to `org_id`] | **No** | High | Old app that inserts without `org_id` will fail |
|
||||
| [e.g. Drop old column `account_id`] | **No** | High | Old app that reads `account_id` will fail |
|
||||
| [e.g. Add index on `org_id`] | Yes | Low | Additive; no breaking change |
|
||||
| [e.g. Rename column] | **No** | High | Never rename in one step; use expand/contract |
|
||||
|
||||
**Summary:** [e.g. "This migration requires the expand/contract pattern across 3 deployment phases because steps 3 and 4 are not backward compatible."]
|
||||
|
||||
---
|
||||
|
||||
## 3. Expand/Contract Phases
|
||||
|
||||
### Phase Overview
|
||||
|
||||
```
|
||||
Phase 1 — EXPAND
|
||||
Deploy migration: add new column (nullable), create new indexes
|
||||
Old app: continues to work (ignores new column)
|
||||
New app: not yet deployed
|
||||
Duration: [~X min] | Rollback: trivial — drop new column
|
||||
|
||||
│
|
||||
▼
|
||||
|
||||
Phase 2 — BACKFILL + DUAL-WRITE
|
||||
Deploy app update: writes to both old and new columns
|
||||
Run backfill: populate new column for existing rows
|
||||
Validate: confirm 100% of rows have non-null new column
|
||||
Duration: [~X hours depending on data volume]
|
||||
Rollback: deploy previous app version; new column is still nullable
|
||||
|
||||
│
|
||||
▼
|
||||
|
||||
Phase 3 — ENFORCE + SWITCH
|
||||
Deploy migration: add NOT NULL constraint, drop old column/index
|
||||
Deploy app update: reads only from new column
|
||||
Duration: [~X min] | Rollback: requires forward-fix (constraint must be dropped first)
|
||||
|
||||
│
|
||||
▼
|
||||
|
||||
Phase 4 — CONTRACT (optional cleanup)
|
||||
Deploy migration: drop deprecated columns, rename if needed
|
||||
Final state matches target schema
|
||||
Rollback: not recommended — contract changes are destructive
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 1 — Expand Schema
|
||||
|
||||
**Goal:** Add the new column and structures without breaking the existing application.
|
||||
**Deploy order:** Run migration first, then (optionally) deploy app.
|
||||
**Application state:** Old app running; no app changes required yet.
|
||||
|
||||
```sql
|
||||
-- Migration: 001_add_org_id_to_users.sql
|
||||
BEGIN;
|
||||
|
||||
-- Add nullable column (safe — old app ignores it)
|
||||
ALTER TABLE users
|
||||
ADD COLUMN org_id UUID NULL
|
||||
REFERENCES organisations(id) ON DELETE RESTRICT;
|
||||
|
||||
-- Add index NOW, not in Phase 3 — building index on large table during Phase 3 is risky
|
||||
CREATE INDEX CONCURRENTLY users_org_id_idx ON users (org_id);
|
||||
|
||||
-- Note: CONCURRENTLY does not lock the table; safe on live traffic
|
||||
-- Note: Cannot run CONCURRENTLY inside a transaction block; run separately if needed
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
**Validation after Phase 1:**
|
||||
```sql
|
||||
-- Confirm column exists and is nullable
|
||||
SELECT column_name, data_type, is_nullable
|
||||
FROM information_schema.columns
|
||||
WHERE table_name = 'users' AND column_name = 'org_id';
|
||||
-- Expected: is_nullable = 'YES'
|
||||
|
||||
-- Confirm index exists
|
||||
SELECT indexname, indexdef
|
||||
FROM pg_indexes
|
||||
WHERE tablename = 'users' AND indexname = 'users_org_id_idx';
|
||||
```
|
||||
|
||||
**Rollback (Phase 1 only):**
|
||||
```sql
|
||||
BEGIN;
|
||||
DROP INDEX CONCURRENTLY IF EXISTS users_org_id_idx;
|
||||
ALTER TABLE users DROP COLUMN IF EXISTS org_id;
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 2 — Backfill Existing Data
|
||||
|
||||
**Goal:** Populate the new column for all existing rows before enforcing NOT NULL.
|
||||
**When to run:** After Phase 1 is live and stable. Can be run as a background job or a one-time script.
|
||||
**Application state:** Deploy app version that dual-writes to both old and new columns.
|
||||
|
||||
**App code change required:**
|
||||
```
|
||||
// All INSERT and UPDATE operations must now set BOTH old_column and new_column
|
||||
// until Phase 3 is complete. This ensures new rows are populated during the backfill window.
|
||||
```
|
||||
|
||||
**Backfill script — batch processing:**
|
||||
```sql
|
||||
-- Run in batches to avoid locking. Adjust batch size based on table size and DB load.
|
||||
-- Target: no single batch takes more than 5 seconds.
|
||||
|
||||
DO $$
|
||||
DECLARE
|
||||
batch_size INT := 1000;
|
||||
affected INT;
|
||||
BEGIN
|
||||
LOOP
|
||||
UPDATE users
|
||||
SET org_id = accounts.organisation_id
|
||||
FROM accounts
|
||||
WHERE users.account_id = accounts.id
|
||||
AND users.org_id IS NULL
|
||||
LIMIT batch_size;
|
||||
|
||||
GET DIAGNOSTICS affected = ROW_COUNT;
|
||||
EXIT WHEN affected = 0;
|
||||
|
||||
-- Pause between batches to avoid saturating I/O
|
||||
PERFORM pg_sleep(0.1);
|
||||
END LOOP;
|
||||
END $$;
|
||||
```
|
||||
|
||||
**Monitoring during backfill:**
|
||||
```sql
|
||||
-- Check progress — run periodically during backfill
|
||||
SELECT
|
||||
COUNT(*) FILTER (WHERE org_id IS NOT NULL) AS backfilled,
|
||||
COUNT(*) FILTER (WHERE org_id IS NULL) AS remaining,
|
||||
COUNT(*) AS total,
|
||||
ROUND(
|
||||
100.0 * COUNT(*) FILTER (WHERE org_id IS NOT NULL) / COUNT(*), 2
|
||||
) AS pct_complete
|
||||
FROM users;
|
||||
```
|
||||
|
||||
**Backfill completion validation:**
|
||||
```sql
|
||||
-- Must return 0 before proceeding to Phase 3
|
||||
SELECT COUNT(*) AS unbackfilled_rows
|
||||
FROM users
|
||||
WHERE org_id IS NULL;
|
||||
|
||||
-- Confirm no new rows written without org_id (dual-write working)
|
||||
SELECT COUNT(*) AS recent_missing
|
||||
FROM users
|
||||
WHERE org_id IS NULL
|
||||
AND created_at > now() - INTERVAL '1 hour';
|
||||
```
|
||||
|
||||
**Rollback (Phase 2 — app only):**
|
||||
- Deploy previous app version (single-write to old column)
|
||||
- `org_id` column remains nullable; no data is lost
|
||||
- Backfilled values remain; harmless
|
||||
|
||||
---
|
||||
|
||||
### Phase 3 — Enforce Constraints
|
||||
|
||||
**Goal:** Add NOT NULL constraint and remove dependency on the old column.
|
||||
**Prerequisites:** Phase 2 backfill must be 100% complete (zero rows with `org_id IS NULL`).
|
||||
**Deploy order:** Run migration, then deploy app version that reads only from `org_id`.
|
||||
|
||||
**PostgreSQL — use NOT VALID + VALIDATE for large tables:**
|
||||
```sql
|
||||
-- Step 1: Add constraint as NOT VALID (no full table scan — instant)
|
||||
ALTER TABLE users
|
||||
ADD CONSTRAINT users_org_id_not_null
|
||||
CHECK (org_id IS NOT NULL) NOT VALID;
|
||||
|
||||
-- Step 2: VALIDATE CONSTRAINT (takes a SHARE UPDATE EXCLUSIVE lock — allows reads and writes)
|
||||
-- Run this separately, as it can take minutes on large tables
|
||||
ALTER TABLE users
|
||||
VALIDATE CONSTRAINT users_org_id_not_null;
|
||||
|
||||
-- Step 3: Once validated, convert to actual NOT NULL
|
||||
-- (PostgreSQL trusts the validated check constraint — this is instant)
|
||||
ALTER TABLE users
|
||||
ALTER COLUMN org_id SET NOT NULL;
|
||||
|
||||
-- Step 4: Drop the now-redundant check constraint
|
||||
ALTER TABLE users
|
||||
DROP CONSTRAINT users_org_id_not_null;
|
||||
```
|
||||
|
||||
**Validation after Phase 3:**
|
||||
```sql
|
||||
-- Confirm NOT NULL is enforced
|
||||
SELECT column_name, is_nullable
|
||||
FROM information_schema.columns
|
||||
WHERE table_name = 'users' AND column_name = 'org_id';
|
||||
-- Expected: is_nullable = 'NO'
|
||||
|
||||
-- Test that insert without org_id fails (run in a transaction and roll back)
|
||||
BEGIN;
|
||||
INSERT INTO users (email) VALUES ('test@example.com');
|
||||
-- Expected: ERROR: null value in column "org_id" violates not-null constraint
|
||||
ROLLBACK;
|
||||
```
|
||||
|
||||
**Rollback (Phase 3):**
|
||||
```sql
|
||||
-- Drop the NOT NULL constraint (restores nullable state)
|
||||
ALTER TABLE users ALTER COLUMN org_id DROP NOT NULL;
|
||||
-- Then deploy previous app version (dual-write)
|
||||
-- Note: Once app code reading the new column is live, rolling back the constraint
|
||||
-- without rolling back the app will cause issues — plan this carefully.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 4 — Contract (Remove Old Column)
|
||||
|
||||
**Goal:** Remove the old column once the app no longer references it.
|
||||
**Prerequisites:** Phase 3 fully deployed and stable for at least [X days/hours rollback window].
|
||||
**Warning:** This phase is destructive — the old column's data is permanently deleted.
|
||||
|
||||
```sql
|
||||
BEGIN;
|
||||
|
||||
-- Drop the old column
|
||||
ALTER TABLE users DROP COLUMN account_id;
|
||||
|
||||
-- Drop any indexes that referenced the old column
|
||||
DROP INDEX IF EXISTS users_account_id_idx;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
**Pre-drop validation:**
|
||||
```sql
|
||||
-- Confirm no application queries still reference the old column
|
||||
-- (Check this in code review and via a search of the codebase before running)
|
||||
-- grep -r "account_id" app/
|
||||
|
||||
-- Confirm the column is safe to drop
|
||||
SELECT COUNT(*) FROM users WHERE account_id IS NOT NULL;
|
||||
-- Should be 0 (or irrelevant once new column is canonical)
|
||||
```
|
||||
|
||||
**Rollback:** Not straightforward — dropped column data cannot be recovered. Only proceed to Phase 4 after the rollback window has passed and the change is confirmed stable.
|
||||
|
||||
---
|
||||
|
||||
## 4. Data Validation Plan
|
||||
|
||||
Run these queries before and after the full migration to confirm data integrity.
|
||||
|
||||
**Pre-migration baseline:**
|
||||
```sql
|
||||
-- Record these values before any migration step
|
||||
SELECT COUNT(*) AS total_users FROM users;
|
||||
SELECT COUNT(*) AS total_orgs FROM organisations;
|
||||
SELECT MIN(created_at), MAX(created_at) FROM users;
|
||||
|
||||
-- Check for any anomalies in the source data before backfill
|
||||
SELECT COUNT(*) AS users_without_account
|
||||
FROM users WHERE account_id IS NULL;
|
||||
```
|
||||
|
||||
**Post-backfill integrity check:**
|
||||
```sql
|
||||
-- All users have an org that exists
|
||||
SELECT COUNT(*) AS orphaned_org_refs
|
||||
FROM users u
|
||||
WHERE u.org_id IS NOT NULL
|
||||
AND NOT EXISTS (
|
||||
SELECT 1 FROM organisations o WHERE o.id = u.org_id
|
||||
);
|
||||
-- Expected: 0
|
||||
|
||||
-- org_id matches expected value from source column
|
||||
SELECT COUNT(*) AS mismatched_backfill
|
||||
FROM users u
|
||||
JOIN accounts a ON u.account_id = a.id
|
||||
WHERE u.org_id != a.organisation_id;
|
||||
-- Expected: 0
|
||||
|
||||
-- Row count unchanged (no rows created or deleted by migration)
|
||||
SELECT COUNT(*) AS total_users_after FROM users;
|
||||
-- Must match pre-migration baseline
|
||||
```
|
||||
|
||||
**Post-contract final check:**
|
||||
```sql
|
||||
-- Old column is gone
|
||||
SELECT COUNT(*) FROM information_schema.columns
|
||||
WHERE table_name = 'users' AND column_name = 'account_id';
|
||||
-- Expected: 0
|
||||
|
||||
-- New column is NOT NULL
|
||||
SELECT is_nullable FROM information_schema.columns
|
||||
WHERE table_name = 'users' AND column_name = 'org_id';
|
||||
-- Expected: NO
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Performance Impact Assessment
|
||||
|
||||
| Step | Lock type | Lock duration | Traffic impact |
|
||||
|---|---|---|---|
|
||||
| Add nullable column | ACCESS EXCLUSIVE | Milliseconds | Negligible |
|
||||
| CREATE INDEX CONCURRENTLY | SHARE UPDATE EXCLUSIVE | Minutes (proportional to table size) | Reads and writes continue |
|
||||
| Batch backfill | Row-level locks only | <5s per batch | Low if batches are small |
|
||||
| ADD CONSTRAINT NOT VALID | ACCESS EXCLUSIVE | Milliseconds | Negligible |
|
||||
| VALIDATE CONSTRAINT | SHARE UPDATE EXCLUSIVE | Minutes | Reads and writes continue |
|
||||
| ALTER COLUMN SET NOT NULL | ACCESS EXCLUSIVE | Milliseconds (if check constraint validated) | Negligible |
|
||||
| DROP COLUMN | ACCESS EXCLUSIVE | Milliseconds | Negligible |
|
||||
|
||||
**Expected load increase during backfill:**
|
||||
- DB CPU: [estimated % increase during batch writes]
|
||||
- DB I/O: [estimated increase]
|
||||
- Monitoring threshold to pause backfill: [e.g. DB CPU > 80% for >2 minutes]
|
||||
|
||||
**Backfill rate estimate:**
|
||||
- Table size: [X million rows]
|
||||
- Batch size: [1000 rows]
|
||||
- Pause between batches: [100ms]
|
||||
- Estimated total duration: [X hours at Y rows/second]
|
||||
|
||||
---
|
||||
|
||||
## 6. Deployment Runbook
|
||||
|
||||
Follow this checklist on the day of migration. Mark each step as done before proceeding.
|
||||
|
||||
**Pre-migration (day before):**
|
||||
- [ ] DBA / tech lead has reviewed the migration plan
|
||||
- [ ] Performance impact assessed; monitoring dashboards ready
|
||||
- [ ] Backfill script tested on a staging DB with production-scale data
|
||||
- [ ] Rollback procedure tested on staging
|
||||
- [ ] On-call engineer briefed; Slack channel [#db-migrations] set up for coordination
|
||||
- [ ] Maintenance window scheduled (if required)
|
||||
|
||||
**Phase 1 — Expand (T+0):**
|
||||
- [ ] Take a manual DB snapshot / verify automated backup is recent
|
||||
- [ ] Run `001_expand_add_org_id.sql` on production
|
||||
- [ ] Run Phase 1 validation queries — confirm pass
|
||||
- [ ] Deploy app version with dual-write
|
||||
- [ ] Monitor error rate for [10 minutes]
|
||||
|
||||
**Phase 2 — Backfill (T+[X hours]):**
|
||||
- [ ] Confirm Phase 1 has been stable for [X hours]
|
||||
- [ ] Start backfill script in a screen/tmux session
|
||||
- [ ] Monitor progress via backfill progress query every [5 minutes]
|
||||
- [ ] Monitor DB CPU and I/O — pause if thresholds exceeded
|
||||
- [ ] Run completion validation — confirm 0 unbackfilled rows
|
||||
- [ ] Run integrity checks — confirm 0 orphaned refs, 0 mismatches
|
||||
|
||||
**Phase 3 — Enforce (T+[X days]):**
|
||||
- [ ] Confirm backfill 100% complete and stable for [X hours]
|
||||
- [ ] Add NOT VALID constraint
|
||||
- [ ] Run VALIDATE CONSTRAINT (monitor duration and lock waits)
|
||||
- [ ] Alter column to NOT NULL
|
||||
- [ ] Run Phase 3 validation queries
|
||||
- [ ] Deploy app version reading only from new column
|
||||
- [ ] Monitor error rate for [30 minutes]
|
||||
|
||||
**Phase 4 — Contract (T+[X days after rollback window]):**
|
||||
- [ ] Confirm rollback window has passed — no incidents, no rollback needed
|
||||
- [ ] Search codebase for references to old column — confirm zero
|
||||
- [ ] Run DROP COLUMN migration
|
||||
- [ ] Run final integrity checks
|
||||
- [ ] Close migration ticket; update schema documentation
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every migration phase has an independent rollback procedure — no phase assumes the next one has run
|
||||
- [ ] Batch backfill script includes a pause between batches to avoid saturating I/O
|
||||
- [ ] NOT NULL constraints use the NOT VALID + VALIDATE pattern on tables with >100k rows
|
||||
- [ ] The app dual-write period is explicitly defined — old column writes are not dropped until Phase 3 is deployed
|
||||
- [ ] Data validation queries include a row count check to confirm no data loss
|
||||
- [ ] Lock types are identified for every DDL statement — no "should be fine" assumptions
|
||||
- [ ] The deployment runbook names who runs each step, not just what to run
|
||||
- [ ] Phase 4 (contract) is explicitly gated on the rollback window passing — not run on the same day as Phase 3
|
||||
@@ -0,0 +1,356 @@
|
||||
---
|
||||
name: database-schema-design
|
||||
description: "Document or design a database schema with entity relationships, table definitions, constraints, indexes, and access patterns. Use when asked to design a database, document an existing schema, model entities and relationships, define table structures, plan an index strategy, or produce a data model for review. Produces a structured schema document covering an ER diagram, table DDL definitions, index strategy, access pattern analysis, normalization decisions, and migration notes."
|
||||
---
|
||||
|
||||
# Database Schema Design Skill
|
||||
|
||||
Produce a complete database schema design document for a given domain. A schema document is not just a list of tables — it is a record of decisions: what was modelled, how entities relate, which queries the schema is optimised for, and what trade-offs were made.
|
||||
|
||||
A good schema design document lets an engineer understand the data model, query it correctly, extend it safely, and write migrations without breaking things.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Domain description** — what the system does; what business objects are being modelled
|
||||
- **Entities and relationships** — the main things in the domain and how they relate (e.g. "a User has many Orders; an Order has many OrderItems; an OrderItem references a Product")
|
||||
- **Expected query patterns** — the most important read and write queries (e.g. "fetch all orders for a user, sorted by date"; "look up a product by SKU")
|
||||
- **Database engine** — PostgreSQL, MySQL, SQLite, CockroachDB, etc. — this affects DDL syntax and available types
|
||||
- **Expected data volume** — approximate row counts, growth rate, and any partitioning needs
|
||||
- **Constraints** — any existing conventions, naming standards, or migration constraints to respect
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Database Schema Design: [Domain / Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Author:** [Name] | **Reviewed by:** [Name]
|
||||
**Date:** [Date] | **Database engine:** [PostgreSQL X.X / MySQL X.X / etc.]
|
||||
**Status:** [Draft / Reviewed / Approved]
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
[2–3 sentences describing the domain being modelled, the scope of this schema, and any key design philosophy (e.g. "this schema prioritises read performance for the customer-facing API over write simplicity", or "designed for eventual migration to multi-tenancy")]
|
||||
|
||||
**In scope:**
|
||||
- [Entity or subsystem]
|
||||
- [Entity or subsystem]
|
||||
|
||||
**Out of scope:**
|
||||
- [e.g. Analytics / reporting tables — separate schema]
|
||||
- [e.g. Audit log tables — covered in separate design doc]
|
||||
|
||||
---
|
||||
|
||||
## 2. Entity Relationship Diagram
|
||||
|
||||
```
|
||||
┌───────────────────┐ ┌───────────────────────┐
|
||||
│ users │ │ organisations │
|
||||
│───────────────── │ │─────────────────────── │
|
||||
│ id (PK) │ ┌───▶│ id (PK) │
|
||||
│ org_id (FK) ─────┼────┘ │ name │
|
||||
│ email │ │ plan │
|
||||
│ display_name │ │ created_at │
|
||||
│ created_at │ └───────────────────────┘
|
||||
│ updated_at │
|
||||
└─────────┬─────────┘
|
||||
│ 1
|
||||
│
|
||||
│ N
|
||||
┌─────────▼─────────┐ ┌───────────────────────┐
|
||||
│ [table_a] │ │ [table_b] │
|
||||
│───────────────── │ │─────────────────────── │
|
||||
│ id (PK) │ N │ id (PK) │
|
||||
│ user_id (FK) ─────┼────────▶│ [table_a]_id (FK) │
|
||||
│ [field] │ │ │ [field] │
|
||||
│ [field] │ │ │ [field] │
|
||||
│ created_at │ │ created_at │
|
||||
└───────────────────┘ └───────────────────────┘
|
||||
```
|
||||
|
||||
**Relationship summary:**
|
||||
|
||||
| Entity A | Relationship | Entity B | Notes |
|
||||
|---|---|---|---|
|
||||
| organisations | has many | users | An org can have many users |
|
||||
| users | has many | [table_a] | Soft-deleted on user deletion |
|
||||
| [table_a] | has many | [table_b] | Cascade delete |
|
||||
| [table_b] | belongs to | [table_a] | Non-nullable FK |
|
||||
| [table_c] | many-to-many (via [join_table]) | [table_d] | Join table with metadata |
|
||||
|
||||
---
|
||||
|
||||
## 3. Table Definitions
|
||||
|
||||
### `organisations`
|
||||
|
||||
[1 sentence describing what this table stores and its role in the domain.]
|
||||
|
||||
```sql
|
||||
CREATE TABLE organisations (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
name VARCHAR(255) NOT NULL,
|
||||
slug VARCHAR(100) NOT NULL UNIQUE,
|
||||
plan VARCHAR(50) NOT NULL DEFAULT 'free'
|
||||
CHECK (plan IN ('free', 'pro', 'enterprise')),
|
||||
settings JSONB NOT NULL DEFAULT '{}',
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||||
);
|
||||
```
|
||||
|
||||
| Column | Type | Nullable | Default | Notes |
|
||||
|---|---|---|---|---|
|
||||
| id | UUID | No | gen_random_uuid() | Surrogate PK — UUID preferred over serial for distributed use |
|
||||
| name | VARCHAR(255) | No | — | Display name; not unique |
|
||||
| slug | VARCHAR(100) | No | — | URL-safe identifier; unique across all orgs |
|
||||
| plan | VARCHAR(50) | No | 'free' | Constrained to known values via CHECK |
|
||||
| settings | JSONB | No | {} | Flexible config; avoid for queryable fields |
|
||||
| created_at | TIMESTAMPTZ | No | now() | Always use TIMESTAMPTZ, not TIMESTAMP |
|
||||
| updated_at | TIMESTAMPTZ | No | now() | Updated via trigger (see below) |
|
||||
|
||||
---
|
||||
|
||||
### `users`
|
||||
|
||||
[1 sentence describing what this table stores.]
|
||||
|
||||
```sql
|
||||
CREATE TABLE users (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
org_id UUID NOT NULL REFERENCES organisations(id)
|
||||
ON DELETE RESTRICT,
|
||||
email VARCHAR(254) NOT NULL,
|
||||
display_name VARCHAR(255) NOT NULL DEFAULT '',
|
||||
role VARCHAR(50) NOT NULL DEFAULT 'member'
|
||||
CHECK (role IN ('owner', 'admin', 'member', 'viewer')),
|
||||
email_verified BOOLEAN NOT NULL DEFAULT false,
|
||||
deleted_at TIMESTAMPTZ NULL,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
|
||||
CONSTRAINT users_email_org_unique UNIQUE (email, org_id)
|
||||
);
|
||||
```
|
||||
|
||||
| Column | Type | Nullable | Default | Notes |
|
||||
|---|---|---|---|---|
|
||||
| id | UUID | No | gen_random_uuid() | — |
|
||||
| org_id | UUID | No | — | FK to organisations; RESTRICT prevents orphaning |
|
||||
| email | VARCHAR(254) | No | — | RFC 5321 max length; unique per org (not globally) |
|
||||
| role | VARCHAR(50) | No | 'member' | Application-level RBAC |
|
||||
| deleted_at | TIMESTAMPTZ | Yes | NULL | Soft delete; NULL = active |
|
||||
|
||||
**Soft delete policy:** Rows with `deleted_at IS NOT NULL` are considered deleted. All application queries MUST filter `WHERE deleted_at IS NULL` unless explicitly fetching deleted records. Use a view or ORM scope to enforce this.
|
||||
|
||||
---
|
||||
|
||||
### `[table_a]`
|
||||
|
||||
[Description of what this table models.]
|
||||
|
||||
```sql
|
||||
CREATE TABLE [table_a] (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
|
||||
[field_1] VARCHAR(255) NOT NULL,
|
||||
[field_2] TEXT NULL,
|
||||
[field_3] INTEGER NOT NULL DEFAULT 0 CHECK ([field_3] >= 0),
|
||||
status VARCHAR(50) NOT NULL DEFAULT 'pending'
|
||||
CHECK (status IN ('pending', 'active', 'archived')),
|
||||
metadata JSONB NOT NULL DEFAULT '{}',
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||||
);
|
||||
```
|
||||
|
||||
| Column | Type | Nullable | Notes |
|
||||
|---|---|---|---|
|
||||
| user_id | UUID | No | CASCADE delete — when user is deleted, their [table_a] rows are too |
|
||||
| [field_1] | VARCHAR(255) | No | [Reason for length constraint] |
|
||||
| status | VARCHAR(50) | No | State machine: pending → active → archived (no other transitions) |
|
||||
| metadata | JSONB | No | [What is stored here and why it's not a typed column] |
|
||||
|
||||
---
|
||||
|
||||
### `[join_table]` *(Many-to-many)*
|
||||
|
||||
[Description of the relationship this table represents.]
|
||||
|
||||
```sql
|
||||
CREATE TABLE [join_table] (
|
||||
[table_c]_id UUID NOT NULL REFERENCES [table_c](id) ON DELETE CASCADE,
|
||||
[table_d]_id UUID NOT NULL REFERENCES [table_d](id) ON DELETE CASCADE,
|
||||
granted_by UUID NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
|
||||
granted_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
|
||||
PRIMARY KEY ([table_c]_id, [table_d]_id)
|
||||
);
|
||||
```
|
||||
|
||||
**Why a composite PK:** The combination of `[table_c]_id + [table_d]_id` is the natural key — each association is unique and the primary key doubles as the uniqueness constraint without needing a separate index.
|
||||
|
||||
---
|
||||
|
||||
## 4. Index Strategy
|
||||
|
||||
For each table, define which indexes are created and why. Include the query they are designed to serve.
|
||||
|
||||
| Table | Index name | Columns | Type | Query served | Notes |
|
||||
|---|---|---|---|---|---|
|
||||
| users | `users_org_id_idx` | `(org_id)` | B-tree | `SELECT * FROM users WHERE org_id = $1` | FK lookup; required for join performance |
|
||||
| users | `users_email_lower_idx` | `(lower(email))` | B-tree (functional) | `WHERE lower(email) = lower($1)` | Case-insensitive email lookup |
|
||||
| users | `users_active_by_org_idx` | `(org_id, created_at DESC)` | B-tree | `WHERE org_id = $1 AND deleted_at IS NULL ORDER BY created_at DESC` | Partial index candidate (see below) |
|
||||
| [table_a] | `[table_a]_user_id_status_idx` | `(user_id, status)` | B-tree | `WHERE user_id = $1 AND status = 'active'` | Compound — order matters |
|
||||
| [table_a] | `[table_a]_metadata_gin_idx` | `metadata` | GIN | `WHERE metadata @> '{"key": "value"}'` | Only add if JSONB queried frequently |
|
||||
|
||||
**Partial indexes (PostgreSQL):**
|
||||
|
||||
```sql
|
||||
-- Index only active (non-deleted) users — dramatically smaller for soft-delete tables
|
||||
CREATE INDEX users_active_email_idx
|
||||
ON users (email, org_id)
|
||||
WHERE deleted_at IS NULL;
|
||||
|
||||
-- Index only pending items — avoids indexing the majority of rows
|
||||
CREATE INDEX [table_a]_pending_idx
|
||||
ON [table_a] (user_id, created_at)
|
||||
WHERE status = 'pending';
|
||||
```
|
||||
|
||||
**Index design principles applied:**
|
||||
- FKs that appear in JOIN conditions always have an index
|
||||
- Compound indexes follow selectivity order: most selective column first
|
||||
- Functional indexes for case-insensitive lookups
|
||||
- GIN indexes only where JSONB containment queries are frequent
|
||||
- Partial indexes for status-filtered queries on large tables
|
||||
|
||||
---
|
||||
|
||||
## 5. Access Pattern Analysis
|
||||
|
||||
Document the primary queries this schema is designed to serve. For each, show the query, the indexes used, and any caveats.
|
||||
|
||||
### AP-1: Fetch all active users for an organisation (paginated)
|
||||
|
||||
**Frequency:** Very high — called on every dashboard load
|
||||
**Query:**
|
||||
```sql
|
||||
SELECT id, email, display_name, role, created_at
|
||||
FROM users
|
||||
WHERE org_id = $1
|
||||
AND deleted_at IS NULL
|
||||
ORDER BY created_at DESC
|
||||
LIMIT 50 OFFSET $2;
|
||||
```
|
||||
**Index used:** `users_active_by_org_idx` (org_id, created_at DESC)
|
||||
**Notes:** Use keyset pagination (`WHERE created_at < $cursor`) at scale; OFFSET degrades past ~10k rows.
|
||||
|
||||
---
|
||||
|
||||
### AP-2: Look up a user by email (case-insensitive)
|
||||
|
||||
**Frequency:** High — every authentication attempt
|
||||
**Query:**
|
||||
```sql
|
||||
SELECT id, org_id, role, email_verified
|
||||
FROM users
|
||||
WHERE lower(email) = lower($1)
|
||||
AND deleted_at IS NULL;
|
||||
```
|
||||
**Index used:** `users_email_lower_idx`
|
||||
**Notes:** Returns multiple rows if same email exists across orgs. Application resolves by org context.
|
||||
|
||||
---
|
||||
|
||||
### AP-3: Fetch [table_a] items for a user by status
|
||||
|
||||
**Frequency:** High
|
||||
**Query:**
|
||||
```sql
|
||||
SELECT *
|
||||
FROM [table_a]
|
||||
WHERE user_id = $1
|
||||
AND status = $2
|
||||
ORDER BY created_at DESC
|
||||
LIMIT 25;
|
||||
```
|
||||
**Index used:** `[table_a]_user_id_status_idx`
|
||||
**Notes:** Compound index covers both filter columns. Status filter must come second in the index because user_id is more selective.
|
||||
|
||||
---
|
||||
|
||||
### AP-4: [Add further access patterns as needed]
|
||||
|
||||
---
|
||||
|
||||
## 6. Normalization Decisions
|
||||
|
||||
Document deliberate choices to normalize or denormalize, with reasoning.
|
||||
|
||||
| Decision | Approach | Reasoning |
|
||||
|---|---|---|
|
||||
| [e.g. Organisation name on users table?] | **Not denormalized** — always join to organisations | Avoid stale copies; org name changes are infrequent and joining is cheap |
|
||||
| [e.g. Status history] | **Not in this table** — separate `[table_a]_status_history` if needed | Current status is all that's needed for 99% of queries; history is auditing, not application data |
|
||||
| [e.g. JSONB `settings` column on organisations] | **Denormalized into JSONB** | Settings are read together; never queried by field; schema changes don't require migrations |
|
||||
| [e.g. Computed aggregate counts] | **Not stored** — computed at query time | Counts are small; maintaining a counter column requires careful locking; use `SELECT COUNT(*)` with the index |
|
||||
|
||||
---
|
||||
|
||||
## 7. Triggers and Automation
|
||||
|
||||
```sql
|
||||
-- Automatically update updated_at on any row modification
|
||||
CREATE OR REPLACE FUNCTION set_updated_at()
|
||||
RETURNS TRIGGER AS $$
|
||||
BEGIN
|
||||
NEW.updated_at = now();
|
||||
RETURN NEW;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
-- Apply to all tables with updated_at
|
||||
CREATE TRIGGER users_updated_at
|
||||
BEFORE UPDATE ON users
|
||||
FOR EACH ROW EXECUTE FUNCTION set_updated_at();
|
||||
|
||||
CREATE TRIGGER [table_a]_updated_at
|
||||
BEFORE UPDATE ON [table_a]
|
||||
FOR EACH ROW EXECUTE FUNCTION set_updated_at();
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Migration Notes
|
||||
|
||||
If this schema is being introduced to an existing system, note the migration approach.
|
||||
|
||||
| Step | Description | Backward compatible | Risk |
|
||||
|---|---|---|---|
|
||||
| 1 | Create `organisations` table | Yes — additive | Low |
|
||||
| 2 | Create `users` table | Yes — additive | Low |
|
||||
| 3 | Backfill `org_id` on existing users | **Requires dual-write period** | Medium |
|
||||
| 4 | Add NOT NULL constraint on `org_id` | Requires backfill to be 100% complete | Medium |
|
||||
| 5 | Remove deprecated columns | Requires app code updated first | Low once app deployed |
|
||||
|
||||
**Backfill strategy:** [Describe how to handle existing data — batch size, rate limiting, validation queries]
|
||||
|
||||
**Rollback:** Each migration step should be independently reversible. See [database-migration-plan skill] for the full rollback procedure template.
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every table has a primary key and a `created_at` column — no implicit ordering by row insertion
|
||||
- [ ] Every foreign key has a corresponding index — no missing FK indexes that would cause full table scans on joins
|
||||
- [ ] All TIMESTAMPTZ columns, not TIMESTAMP — timezone awareness is explicit
|
||||
- [ ] Soft-delete tables document the convention and where the filter is enforced (ORM scope, view, or query standard)
|
||||
- [ ] Every access pattern in the design has a supporting index or an explicit note that a full table scan is acceptable
|
||||
- [ ] JSONB columns are justified — not used as a substitute for proper schema design on queryable fields
|
||||
- [ ] Normalization decisions are documented with reasoning, not just stated
|
||||
- [ ] Migration notes address existing data if this is a schema change, not a greenfield schema
|
||||
@@ -0,0 +1,332 @@
|
||||
---
|
||||
name: dependency-audit
|
||||
description: "Conduct a dependency audit for a project — checking for security vulnerabilities, license compliance issues, outdated packages, and transitive dependency risk. Use when asked to audit dependencies, review package security, check license compliance, assess dependency health, or produce a vulnerability report. Produces a vulnerability findings table, license compliance matrix, update priority matrix, dependency health score, and 30-day remediation plan."
|
||||
---
|
||||
|
||||
# Dependency Audit Skill
|
||||
|
||||
Produce a complete dependency audit report for a project — covering security vulnerabilities (with CVE references), license compliance against policy, outdated packages prioritised by risk, transitive dependency risk analysis, and a concrete remediation plan with timeline. A good dependency audit gives the team a clear, prioritised action list — not a raw dump of audit output that no one acts on.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Project language and ecosystem** — npm, pip/PyPI, Maven/Gradle, Go modules, Cargo, RubyGems, NuGet, or mixed
|
||||
- **Dependency list or package manifest** — paste the contents of `package.json`, `requirements.txt`, `go.mod`, `pom.xml`, etc., or provide the audit tool output
|
||||
- **License policy** — which licenses are allowed, which are restricted (e.g. "GPL is prohibited", "MIT/Apache/BSD only", or "no policy yet — recommend one")
|
||||
- **Current security tooling** — Dependabot, Snyk, OWASP Dependency-Check, npm audit, pip-audit, or none
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Dependency Audit Report: [Project Name]
|
||||
|
||||
**Ecosystem:** [npm / pip / Maven / Go / etc.]
|
||||
**Audit date:** [Date]
|
||||
**Auditor:** [Name]
|
||||
**Total direct dependencies:** [N]
|
||||
**Total transitive dependencies:** [N]
|
||||
**Audit tool(s) used:** [npm audit / pip-audit / Snyk / OWASP Dependency-Check / etc.]
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
| Category | Finding | Risk level |
|
||||
|---|---|---|
|
||||
| Critical vulnerabilities | [N] CVEs requiring immediate action | [Critical / High / Low] |
|
||||
| High vulnerabilities | [N] CVEs — fix within 7 days | [High / Medium] |
|
||||
| License violations | [N] packages with non-compliant licenses | [High / Low] |
|
||||
| Severely outdated packages | [N] packages > 2 major versions behind | [Medium] |
|
||||
| Packages with no active maintenance | [N] packages — no commits in 12+ months | [Medium] |
|
||||
| **Overall dependency health score** | **[Score]/100** | **[Red / Amber / Green]** |
|
||||
|
||||
**Scoring methodology:** Critical CVEs: −20 each. High CVEs: −10 each. License violations: −15 each. Abandoned packages: −5 each. Maximum deduction: 100. Score ≥80 = Green, 60–79 = Amber, <60 = Red.
|
||||
|
||||
**Immediate actions required:**
|
||||
1. [Most critical action — e.g. "Upgrade lodash from 4.17.11 to 4.17.21 to fix CVE-2021-23337 (Critical — prototype pollution)"]
|
||||
2. [Second action]
|
||||
3. [Third action]
|
||||
|
||||
---
|
||||
|
||||
## 1. Security Vulnerability Findings
|
||||
|
||||
### Critical and High Severity (Act within 24–72 hours)
|
||||
|
||||
| Package | Installed version | Fix version | CVE | Severity | CVSS score | Description | Exploitability |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| [package-name] | [X.Y.Z] | [A.B.C] | [CVE-YYYY-NNNNN] | Critical | [9.x] | [e.g. Prototype pollution via `merge` function — remote code execution possible] | [Known exploit / PoC available / No known exploit] |
|
||||
| [package-name] | [X.Y.Z] | [A.B.C] | [CVE-YYYY-NNNNN] | High | [7.x] | [e.g. Path traversal in file serving utility] | [PoC available] |
|
||||
| [package-name] | [X.Y.Z] | [A.B.C] | [CVE-YYYY-NNNNN] | High | [7.x] | [e.g. Regular expression denial of service (ReDoS)] | [No known exploit] |
|
||||
|
||||
### Medium Severity (Fix within 30 days)
|
||||
|
||||
| Package | Installed version | Fix version | CVE | Severity | CVSS score | Description |
|
||||
|---|---|---|---|---|---|---|
|
||||
| [package-name] | [X.Y.Z] | [A.B.C] | [CVE-YYYY-NNNNN] | Medium | [5.x] | [Description] |
|
||||
| [package-name] | [X.Y.Z] | [A.B.C] | [CVE-YYYY-NNNNN] | Medium | [4.x] | [Description] |
|
||||
|
||||
### Low Severity (Fix within 90 days or accept risk)
|
||||
|
||||
| Package | Installed version | Fix version | CVE | Severity | Description |
|
||||
|---|---|---|---|---|---|
|
||||
| [package-name] | [X.Y.Z] | [A.B.C] | Low | [Description] |
|
||||
|
||||
### Vulnerabilities With No Fix Available
|
||||
|
||||
| Package | CVE | Severity | Recommended mitigation |
|
||||
|---|---|---|---|
|
||||
| [package-name] | [CVE-YYYY-NNNNN] | [High] | [e.g. "Remove this package — alternative: [replacement]"] |
|
||||
| [package-name] | [CVE-YYYY-NNNNN] | [Medium] | [e.g. "Vendor has a fix in progress — track issue [URL]. Mitigate by [X]"] |
|
||||
|
||||
---
|
||||
|
||||
## 2. License Compliance Matrix
|
||||
|
||||
### License Policy Reference
|
||||
|
||||
| License | Category | Policy | Notes |
|
||||
|---|---|---|---|
|
||||
| MIT | Permissive | Allowed | Attribution required in distributed products |
|
||||
| Apache 2.0 | Permissive | Allowed | Attribution + NOTICE file required |
|
||||
| BSD 2-Clause / 3-Clause | Permissive | Allowed | Attribution required |
|
||||
| ISC | Permissive | Allowed | |
|
||||
| MPL 2.0 | Weak copyleft | Allowed with review | Source disclosure required for modified MPL files only |
|
||||
| LGPL v2 / v3 | Weak copyleft | Allowed with review | Dynamic linking permitted; static linking may require disclosure |
|
||||
| GPL v2 / v3 | Strong copyleft | **Restricted** | May require open-sourcing the entire codebase — legal review required |
|
||||
| AGPL v3 | Strong copyleft | **Restricted** | Network use triggers copyleft — especially risky for SaaS |
|
||||
| SSPL | Source available | **Prohibited** | Not OSI-approved — treat as proprietary |
|
||||
| Proprietary / Commercial | Commercial | **Requires contract** | Verify license covers current use case and scale |
|
||||
| Unknown / Unlicensed | — | **Prohibited** | No license = all rights reserved — cannot use legally |
|
||||
|
||||
### Findings: Packages With Compliance Issues
|
||||
|
||||
| Package | License | Issue | Recommendation | Risk if unaddressed |
|
||||
|---|---|---|---|---|
|
||||
| [package-name] | GPL v3 | Copyleft — may require open-sourcing this project | Replace with [alternative] or get legal sign-off | Legal / IP risk |
|
||||
| [package-name] | AGPL v3 | Network copyleft — SaaS use triggers disclosure | Replace with [alternative] | Legal / IP risk |
|
||||
| [package-name] | Proprietary | License may not cover current usage tier | Verify license scope with vendor | Contract breach |
|
||||
| [package-name] | Unknown | No license declared in package metadata | Contact maintainer or replace | Cannot use legally |
|
||||
|
||||
### All Licenses in Use (Full Inventory)
|
||||
|
||||
| License | Package count | Compliance status |
|
||||
|---|---|---|
|
||||
| MIT | [N] | Compliant |
|
||||
| Apache 2.0 | [N] | Compliant |
|
||||
| BSD-3-Clause | [N] | Compliant |
|
||||
| ISC | [N] | Compliant |
|
||||
| MPL 2.0 | [N] | Review required |
|
||||
| GPL v3 | [N] | **Non-compliant** |
|
||||
| Unknown | [N] | **Non-compliant** |
|
||||
|
||||
---
|
||||
|
||||
## 3. Outdated Package Analysis
|
||||
|
||||
### Severely Outdated (2+ major versions behind — high upgrade effort)
|
||||
|
||||
| Package | Installed | Latest stable | Versions behind | Last updated | Breaking changes summary |
|
||||
|---|---|---|---|---|---|
|
||||
| [package-name] | [1.x.x] | [3.x.x] | 2 major | [Date] | [e.g. "API redesign in v2; async support added in v3"] |
|
||||
| [package-name] | [0.x.x] | [2.x.x] | 2 major | [Date] | [Summary] |
|
||||
|
||||
### Moderately Outdated (1 major version behind)
|
||||
|
||||
| Package | Installed | Latest stable | Versions behind | Security fix in newer version? |
|
||||
|---|---|---|---|---|
|
||||
| [package-name] | [2.x.x] | [3.x.x] | 1 major | [Yes — CVE-YYYY-NNNNN / No] |
|
||||
| [package-name] | [4.x.x] | [5.x.x] | 1 major | [No] |
|
||||
|
||||
### Minor/Patch Updates Available (Low risk to update)
|
||||
|
||||
| Package | Installed | Latest | Contains security fix? |
|
||||
|---|---|---|---|
|
||||
| [package-name] | [2.3.1] | [2.3.9] | [Yes / No] |
|
||||
| [package-name] | [1.0.0] | [1.2.1] | [No] |
|
||||
|
||||
---
|
||||
|
||||
## 4. Dependency Graph Risk Analysis
|
||||
|
||||
### Transitive Dependency Risk
|
||||
|
||||
Transitive (indirect) dependencies carry risk because they are not explicitly managed. These are the highest-risk transitive dependencies in this project:
|
||||
|
||||
| Vulnerable transitive dep | Pulled in by | Installed version | Fix available | Action |
|
||||
|---|---|---|---|---|
|
||||
| [transitive-package] | [direct-parent] | [X.Y.Z] | [Yes — upgrade [parent] to [version]] | Upgrade direct dependency [parent] |
|
||||
| [transitive-package] | [direct-parent] | [X.Y.Z] | [No] | Remove [parent] or use [alternative] |
|
||||
|
||||
### Dependency Concentration Risk
|
||||
|
||||
These packages are depended on by many other packages in the project — a vulnerability or deprecation would have cascading effects:
|
||||
|
||||
| Package | Depended on by (N packages) | Actively maintained? | Risk level |
|
||||
|---|---|---|---|
|
||||
| [package-name] | [N] | [Yes / No — last commit: date] | [High / Medium] |
|
||||
| [package-name] | [N] | [Yes] | [Medium] |
|
||||
|
||||
### Abandoned / Unmaintained Packages
|
||||
|
||||
| Package | Last release | Last commit | Weekly downloads | Recommended alternative |
|
||||
|---|---|---|---|---|
|
||||
| [package-name] | [Date] | [Date] | [N] | [alternative-package] |
|
||||
| [package-name] | [Date] | [Date] | [N] | [Maintained fork: URL] |
|
||||
|
||||
---
|
||||
|
||||
## 5. Remediation Plan
|
||||
|
||||
### 30-Day Plan
|
||||
|
||||
**Week 1 — Critical vulnerabilities (Days 1–7)**
|
||||
|
||||
| Action | Owner | Package | Effort | Notes |
|
||||
|---|---|---|---|---|
|
||||
| Upgrade [package] [old] → [new] | [Name] | [package-name] | [30 min] | [No API changes / check breaking changes guide: URL] |
|
||||
| Replace [package] with [alternative] | [Name] | [package-name] | [2 hours] | [No fix available — must replace] |
|
||||
| Patch override for [transitive-dep] | [Name] | [transitive-dep] | [15 min] | [Add resolutions/overrides entry in manifest] |
|
||||
|
||||
```bash
|
||||
# Commands for Week 1 upgrades:
|
||||
|
||||
# npm
|
||||
npm install [package]@[target-version]
|
||||
npm audit fix --force # use with caution — may introduce breaking changes
|
||||
|
||||
# pip
|
||||
pip install --upgrade [package]==[target-version]
|
||||
pip-audit --fix # if using pip-audit
|
||||
|
||||
# Go
|
||||
go get [module]@[version]
|
||||
go mod tidy
|
||||
|
||||
# Maven
|
||||
# Update pom.xml version property, then:
|
||||
mvn versions:use-latest-releases -DallowMajorUpdates=false
|
||||
mvn dependency:resolve
|
||||
```
|
||||
|
||||
**Week 2 — High vulnerabilities and license violations (Days 8–14)**
|
||||
|
||||
| Action | Owner | Package | Effort | Notes |
|
||||
|---|---|---|---|---|
|
||||
| Upgrade [package] | [Name] | [package-name] | [1 hour] | |
|
||||
| Replace GPL-licensed [package] | [Name] | [package-name] | [4 hours] | [Alternative: [package]] |
|
||||
| Legal review for [package] license | Legal team | [package-name] | [Legal team SLA] | [Submit via [process]] |
|
||||
|
||||
**Week 3 — Medium vulnerabilities and abandoned packages (Days 15–21)**
|
||||
|
||||
| Action | Owner | Package | Effort | Notes |
|
||||
|---|---|---|---|---|
|
||||
| Upgrade [package] | [Name] | [package-name] | [30 min] | |
|
||||
| Replace abandoned [package] | [Name] | [package-name] | [2 hours] | [Maintained fork or alternative: [URL]] |
|
||||
|
||||
**Week 4 — Process improvements (Days 22–30)**
|
||||
|
||||
| Action | Owner | Effort | Notes |
|
||||
|---|---|---|---|
|
||||
| Enable Dependabot / Renovate for automated PRs | [Name] | [2 hours] | [Config in Section 6] |
|
||||
| Add `npm audit` / `pip-audit` to CI — fail on Critical/High | [Name] | [1 hour] | [Config in Section 6] |
|
||||
| Document license policy in CONTRIBUTING.md | [Name] | [1 hour] | [Based on policy in Section 2] |
|
||||
| Schedule next quarterly audit | [Name] | [15 min] | [Add to team calendar] |
|
||||
|
||||
---
|
||||
|
||||
## 6. Policy Recommendations
|
||||
|
||||
### Automated Vulnerability Scanning in CI
|
||||
|
||||
Add the following to your CI pipeline to catch vulnerabilities before they merge:
|
||||
|
||||
```yaml
|
||||
# GitHub Actions — adapt for your CI platform
|
||||
dependency-audit:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
|
||||
# npm
|
||||
- name: npm audit
|
||||
run: npm audit --audit-level=high
|
||||
# Fails build on High or Critical vulnerabilities
|
||||
|
||||
# pip
|
||||
- name: pip-audit
|
||||
run: |
|
||||
pip install pip-audit
|
||||
pip-audit --requirement requirements.txt --severity high
|
||||
|
||||
# Go
|
||||
- name: govulncheck
|
||||
run: |
|
||||
go install golang.org/x/vuln/cmd/govulncheck@latest
|
||||
govulncheck ./...
|
||||
```
|
||||
|
||||
### Dependabot / Renovate Configuration
|
||||
|
||||
```yaml
|
||||
# .github/dependabot.yml — automated dependency update PRs
|
||||
version: 2
|
||||
updates:
|
||||
- package-ecosystem: "[npm / pip / gomod / maven]"
|
||||
directory: "/"
|
||||
schedule:
|
||||
interval: "weekly"
|
||||
day: "monday"
|
||||
open-pull-requests-limit: 10
|
||||
labels:
|
||||
- "dependencies"
|
||||
- "automated"
|
||||
ignore:
|
||||
# Ignore major version bumps — review these manually
|
||||
- dependency-name: "*"
|
||||
update-types: ["version-update:semver-major"]
|
||||
```
|
||||
|
||||
### License Scanning
|
||||
|
||||
```bash
|
||||
# npm — license checker
|
||||
npx license-checker --onlyAllow 'MIT;Apache-2.0;BSD-2-Clause;BSD-3-Clause;ISC' \
|
||||
--failOn 'GPL;AGPL;LGPL'
|
||||
|
||||
# Python — pip-licenses
|
||||
pip install pip-licenses
|
||||
pip-licenses --allow-only="MIT;Apache Software License;BSD License;ISC License" \
|
||||
--fail-on="GNU General Public License"
|
||||
|
||||
# Go — go-licenses
|
||||
go install github.com/google/go-licenses@latest
|
||||
go-licenses check ./... --allowed_licenses=MIT,Apache-2.0,BSD-2-Clause,BSD-3-Clause
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Dependency Health Score Detail
|
||||
|
||||
| Category | Max points | Score | Notes |
|
||||
|---|---|---|---|
|
||||
| No critical vulnerabilities | 30 | [N]/30 | −20 per critical CVE |
|
||||
| No high vulnerabilities | 20 | [N]/20 | −10 per high CVE |
|
||||
| License compliance | 20 | [N]/20 | −15 per violation |
|
||||
| No abandoned packages | 15 | [N]/15 | −5 per abandoned package |
|
||||
| Up-to-date major versions | 10 | [N]/10 | −2 per major version behind |
|
||||
| Automated scanning enabled | 5 | [N]/5 | All-or-nothing |
|
||||
| **Total** | **100** | **[Score]/100** | **[Red / Amber / Green]** |
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every Critical and High CVE has a named owner and a resolution date in the 30-day plan
|
||||
- [ ] License findings have been reviewed by legal or a named engineer with authority to accept the risk
|
||||
- [ ] Transitive dependency vulnerabilities are included — not just direct dependencies
|
||||
- [ ] Abandoned packages have a concrete replacement recommendation, not just "consider replacing"
|
||||
- [ ] CI pipeline change is included — the audit findings should be the last time these are caught manually
|
||||
- [ ] The dependency health score is calculated from actual findings, not estimated
|
||||
- [ ] Remediation plan actions are specific commands or steps, not "upgrade package X" without version targets
|
||||
@@ -0,0 +1,332 @@
|
||||
---
|
||||
name: developer-onboarding-doc
|
||||
description: "Write a developer onboarding document for a service, codebase, or team. Use when asked to write a developer guide, service README, onboarding doc for a new engineer, codebase orientation, or getting-started guide for a technical team. Produces a structured doc covering service overview, architecture, local setup, key patterns, testing, deployment, and who to ask for what."
|
||||
---
|
||||
|
||||
# Developer Onboarding Document Skill
|
||||
|
||||
Produce a complete developer onboarding document for a service or team — covering everything a new engineer needs to be productive within their first week.
|
||||
|
||||
A good onboarding doc is not a wiki dump. It answers the questions a new engineer actually has on day one, in the order they'll have them.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and what it does
|
||||
- **Team** responsible for it
|
||||
- **Tech stack** — language(s), framework(s), database(s), message queues, etc.
|
||||
- **Key external dependencies** — upstream services, third-party APIs
|
||||
- **Deployment target** — Kubernetes, ECS, Lambda, bare metal, etc.
|
||||
- **Local dev setup** — how to run locally (Docker Compose, local DB, etc.)
|
||||
- **Testing approach** — unit, integration, E2E; test commands
|
||||
- **Deployment process** — summary of how code gets to production
|
||||
- **On-call setup** — who's on-call, how alerts work
|
||||
- **Contacts** — tech lead, platform team, related service owners
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Developer Onboarding: [Service Name]
|
||||
|
||||
**Team:** [Team name] | **Tech lead:** [Name]
|
||||
**Last updated:** [Date] | **Updated by:** [Name]
|
||||
|
||||
> If something in this doc is wrong or out of date, fix it now — it will affect every engineer who onboards after you.
|
||||
|
||||
---
|
||||
|
||||
## What This Service Does
|
||||
|
||||
[3–5 sentences. What problem does this service solve? Who calls it, and who does it call? What would break if this service went down?]
|
||||
|
||||
**Service type:** [API / Background worker / Event consumer / Data pipeline / etc.]
|
||||
**Consumers:** [List internal services or external clients that depend on this service]
|
||||
**Dependencies:** [List upstream services, databases, and third-party APIs this service calls]
|
||||
|
||||
**Architecture diagram:** [Link or embed — even a rough ASCII diagram helps]
|
||||
|
||||
```
|
||||
[Caller A] ──→ [This Service] ──→ [Database]
|
||||
│
|
||||
└──→ [Downstream Service]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Codebase Orientation
|
||||
|
||||
**Repository:** [Link]
|
||||
**Main branch:** `[main / master]`
|
||||
**Language:** [e.g. Go 1.22 / Node.js 20 / Python 3.12]
|
||||
**Framework:** [e.g. Express / FastAPI / Gin / Rails]
|
||||
|
||||
### Key directories
|
||||
|
||||
```
|
||||
[repo-root]/
|
||||
├── [src/ or cmd/] # Application code
|
||||
│ ├── [handlers/] # HTTP handlers / controllers
|
||||
│ ├── [services/] # Business logic
|
||||
│ ├── [repository/] # Database access layer
|
||||
│ └── [models/] # Data models / types
|
||||
├── [tests/] # Test files
|
||||
├── [migrations/] # Database migrations
|
||||
├── [scripts/] # Utility scripts
|
||||
├── [.github/workflows/] # CI/CD pipeline definitions
|
||||
└── [docs/] # Additional documentation
|
||||
```
|
||||
|
||||
**Where to start reading:** [Point to 2–3 key files that give the best orientation — e.g. `main.go`, `routes.js`, `app.py`]
|
||||
|
||||
### Things that might surprise you
|
||||
|
||||
- [Unusual pattern 1 — e.g. "We use event sourcing — state is derived from an event log, not stored directly"]
|
||||
- [Unusual pattern 2 — e.g. "Auth is handled by the gateway — this service trusts the `X-User-Id` header"]
|
||||
- [Unusual pattern 3 — any non-obvious decisions or legacy choices]
|
||||
|
||||
---
|
||||
|
||||
## Local Development Setup
|
||||
|
||||
**Estimated setup time:** [X minutes for a fresh machine]
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- [ ] [Tool 1] — version [X] — [install link]
|
||||
- [ ] [Tool 2] — version [X] — [install link]
|
||||
- [ ] Access to [repo / internal package registry] — request from [who]
|
||||
- [ ] [Any secrets or credentials needed] — request from [who]
|
||||
|
||||
### Step-by-step setup
|
||||
|
||||
```bash
|
||||
# 1. Clone the repo
|
||||
git clone [repo URL]
|
||||
cd [repo-name]
|
||||
|
||||
# 2. Copy and configure environment variables
|
||||
cp .env.example .env
|
||||
# Edit .env — see "Environment Variables" section below
|
||||
|
||||
# 3. Start dependencies (database, cache, etc.)
|
||||
[docker compose up -d / make deps / etc.]
|
||||
|
||||
# 4. Install dependencies
|
||||
[npm install / go mod download / pip install -r requirements.txt]
|
||||
|
||||
# 5. Run database migrations
|
||||
[migration command]
|
||||
|
||||
# 6. Start the service
|
||||
[start command]
|
||||
|
||||
# 7. Verify it's working
|
||||
curl http://localhost:[PORT]/health
|
||||
# Expected: {"status":"ok"}
|
||||
```
|
||||
|
||||
**If this doesn't work:** Check [Troubleshooting section below] or ask in `#[channel]`.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Required | Description | Example |
|
||||
|---|---|---|---|
|
||||
| `DATABASE_URL` | Yes | Connection string for the primary DB | `postgres://localhost:5432/[db]` |
|
||||
| `[VAR_2]` | Yes | [Description] | [Example] |
|
||||
| `[VAR_3]` | No | [Description — default value] | [Example] |
|
||||
|
||||
**Secrets for local dev:** [Where to get them — e.g. "Run `[command]` to pull from Vault" or "Ask [person] in #[channel]"]
|
||||
|
||||
### Useful local commands
|
||||
|
||||
```bash
|
||||
[start command] # Start the service
|
||||
[test command] # Run all tests
|
||||
[lint command] # Run linter
|
||||
[format command] # Format code
|
||||
[migration command] # Run pending migrations
|
||||
[seed command] # Seed local database
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
**Testing philosophy:** [e.g. "We test at the integration layer — unit tests for pure functions, integration tests for anything touching the DB or external services"]
|
||||
|
||||
### Running tests
|
||||
|
||||
```bash
|
||||
# All tests
|
||||
[test command]
|
||||
|
||||
# Unit tests only
|
||||
[unit test command]
|
||||
|
||||
# Integration tests (requires local deps running)
|
||||
[integration test command]
|
||||
|
||||
# A specific test file or test case
|
||||
[test command with filter]
|
||||
```
|
||||
|
||||
**Test coverage:** [X]% (minimum required to pass CI: [Y]%)
|
||||
**Coverage report:** [Where to find it]
|
||||
|
||||
### Writing tests
|
||||
|
||||
- **Unit tests:** [Where to put them — e.g. alongside source files as `*_test.go`]
|
||||
- **Integration tests:** [Where to put them — e.g. `tests/integration/`]
|
||||
- **Test database:** [How it works — e.g. "Each test gets a clean transaction that rolls back on teardown — see `tests/helpers/db.go`"]
|
||||
- **Mocking:** [Policy — e.g. "We mock at the repository layer — don't mock the DB directly"]
|
||||
|
||||
---
|
||||
|
||||
## Making Changes
|
||||
|
||||
### Branching
|
||||
|
||||
[Branch naming convention — e.g. `feature/[ticket-id]-short-description`, `fix/[ticket-id]-short-description`]
|
||||
|
||||
### Before opening a PR
|
||||
|
||||
- [ ] Tests pass locally
|
||||
- [ ] Linter passes (`[lint command]`)
|
||||
- [ ] New behaviour has test coverage
|
||||
- [ ] Any new environment variables are added to `.env.example` and documented
|
||||
- [ ] Database migrations are backward-compatible (old code can run against new schema)
|
||||
|
||||
### Code review
|
||||
|
||||
- **Reviewers:** [Who to request review from — e.g. "Any engineer on [team]; lead review required for auth changes"]
|
||||
- **Expected review time:** [X hours / 1 business day]
|
||||
- **PR template:** [Link or auto-generated by GitHub]
|
||||
|
||||
### Database migrations
|
||||
|
||||
```bash
|
||||
# Create a new migration
|
||||
[migration create command]
|
||||
|
||||
# Apply pending migrations
|
||||
[migration up command]
|
||||
|
||||
# Roll back last migration
|
||||
[migration down command]
|
||||
```
|
||||
|
||||
**Migration rules:**
|
||||
- All migrations must be backward-compatible — old code must run against the new schema
|
||||
- Never rename or drop a column in a single migration — do it in two steps (add new, migrate data, drop old)
|
||||
- Test your rollback before merging
|
||||
|
||||
---
|
||||
|
||||
## Deployment
|
||||
|
||||
**How code gets to production:** [1–2 sentence summary — link to full CI/CD playbook if it exists]
|
||||
|
||||
1. Merge to `main` → automatic deploy to staging
|
||||
2. Smoke tests run on staging
|
||||
3. Manual approval → deploy to production
|
||||
4. Post-deploy monitoring for [X minutes]
|
||||
|
||||
**Deployment docs:** [Link to CI/CD playbook or pipeline docs]
|
||||
|
||||
**Who can deploy:** [Any engineer / Lead engineer / On-call engineer — specify]
|
||||
|
||||
**Deployment channel:** `#[deployments channel]`
|
||||
|
||||
---
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
**Dashboard:** [Datadog / Grafana / CloudWatch — link]
|
||||
**Logs:** [Log aggregation tool and link — e.g. "Logs are in Datadog under service:[name]"]
|
||||
**Traces:** [Tracing tool and link if applicable]
|
||||
**Alerts:** [Where alerts fire — e.g. PagerDuty / Slack #alerts-[service]]
|
||||
|
||||
**Key metrics to know:**
|
||||
- **Error rate:** Should be <[X]% (alert at [Y]%)
|
||||
- **P99 latency:** Should be <[X]ms
|
||||
- **[Business metric]:** [e.g. "Queue depth should be <100 items"]
|
||||
|
||||
---
|
||||
|
||||
## On-Call
|
||||
|
||||
**On-call schedule:** [PagerDuty / Opsgenie link]
|
||||
**Who's on-call now:** [Link to current schedule or `#oncall` channel]
|
||||
**Escalation:** [On-call → [team lead] → [EM] — after [X] minutes unacknowledged]
|
||||
|
||||
**If you get paged:**
|
||||
1. Acknowledge the alert
|
||||
2. Check [dashboard link] for the first clue
|
||||
3. Common alert runbooks: [link to oncall-runbook or runbook-writer output]
|
||||
4. If you can't resolve in [X minutes], escalate to [person/channel]
|
||||
|
||||
---
|
||||
|
||||
## Key Contacts
|
||||
|
||||
| Role | Name | Best way to reach |
|
||||
|---|---|---|
|
||||
| Tech lead | [Name] | Slack: @[handle] |
|
||||
| On-call rotation | [Team] | PagerDuty / `#on-call` |
|
||||
| Platform / infra | [Team] | `#platform` Slack channel |
|
||||
| Database / DBA | [Name or team] | `#database` Slack channel |
|
||||
| [Upstream service] owner | [Name] | Slack: @[handle] |
|
||||
|
||||
**Where to ask questions:**
|
||||
- General engineering: `#engineering`
|
||||
- This service specifically: `#[service-name]`
|
||||
- Urgent / production issues: `#incidents`
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "The service won't start locally"
|
||||
|
||||
1. Check that Docker / dependencies are running: `[command]`
|
||||
2. Check `.env` is populated — missing values cause silent failures
|
||||
3. Check logs: `[log command]`
|
||||
4. Ask in `#[channel]`
|
||||
|
||||
### "Tests are failing locally but passing in CI"
|
||||
|
||||
- Check your local dependency versions match CI: `[version check command]`
|
||||
- Try a clean install: `[clean install command]`
|
||||
- Integration tests need local deps running — `[start deps command]`
|
||||
|
||||
### "I can't access [internal tool / system]"
|
||||
|
||||
- Request access through [process — e.g. Okta self-serve / ask your manager]
|
||||
|
||||
### "Something looks wrong in production"
|
||||
|
||||
1. Check [dashboard] for the error spike
|
||||
2. Check recent deploys in `#deployments`
|
||||
3. If it's an active incident, page on-call via [PagerDuty / Slack command]
|
||||
|
||||
---
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [Architecture Decision Records (ADRs)](./docs/decisions/) — why the codebase is the way it is
|
||||
- [API documentation](./docs/api/) or [link to external docs]
|
||||
- [Incident runbooks](./docs/runbooks/)
|
||||
- [CI/CD pipeline documentation](./docs/cicd/)
|
||||
- [Team working agreements](./docs/team/)
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Local setup instructions work on a fresh machine — tested recently
|
||||
- [ ] Environment variables table is complete and accurate
|
||||
- [ ] "Things that might surprise you" captures the actual surprises (ask a recent joiner)
|
||||
- [ ] On-call section has real links, not placeholders
|
||||
- [ ] Contacts are current — team members with real Slack handles
|
||||
- [ ] Troubleshooting covers the top 3 actual questions new joiners ask
|
||||
@@ -0,0 +1,560 @@
|
||||
---
|
||||
name: disaster-recovery-plan
|
||||
description: "Write a disaster recovery plan for a service or system — covering RPO/RTO targets, failure scenario runbooks, backup and restore procedures, DR testing cadence, and communication templates. Use when asked to write a DR plan, document failover procedures, create recovery runbooks, define RTO/RPO targets, or prepare for a disaster recovery game day. Produces a full DR document with per-scenario recovery runbooks, backup validation procedures, testing schedule, and communication templates."
|
||||
---
|
||||
|
||||
# Disaster Recovery Plan Skill
|
||||
|
||||
Produce a complete disaster recovery plan for a service or system — giving engineers, SREs, and on-call responders everything they need to recover from a disaster scenario in the shortest possible time. A good DR plan is tested regularly, has exact commands (not vague instructions), and makes RTO/RPO targets measurable so the team knows whether recovery succeeded.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and what it does (business function and technical role)
|
||||
- **Criticality tier** — business impact of extended downtime (e.g. Tier 1 = revenue-critical, Tier 2 = ops impact, Tier 3 = internal only)
|
||||
- **Current infrastructure setup** — cloud provider, regions/zones, deployment model (Kubernetes, ECS, VMs, serverless)
|
||||
- **RPO/RTO requirements** — Recovery Point Objective (how much data loss is acceptable) and Recovery Time Objective (how long can it be down)
|
||||
- **Backup strategy** — what is backed up, how often, where backups are stored, retention policy
|
||||
- **On-call contacts** — names and contact details for the responder chain
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Disaster Recovery Plan: [Service Name]
|
||||
|
||||
**Team:** [Team name] | **Tech lead:** [Name]
|
||||
**Criticality tier:** [Tier 1 / Tier 2 / Tier 3] | **Last tested:** [Date]
|
||||
**Next DR test:** [Date] | **Document owner:** [Name]
|
||||
**Last updated:** [Date] | **Review cycle:** Quarterly
|
||||
|
||||
> **Emergency? Skip to Section 3 — Failure Scenario Runbooks.** Find the scenario that matches your situation and follow the steps exactly.
|
||||
|
||||
---
|
||||
|
||||
## 1. Recovery Targets
|
||||
|
||||
| Target | Value | Rationale |
|
||||
|---|---|---|
|
||||
| RPO (Recovery Point Objective) | [X minutes/hours] | [e.g. "Last committed transaction — database replication is synchronous"] |
|
||||
| RTO (Recovery Time Objective) | [Y minutes/hours] | [e.g. "Revenue impact begins at 30 min; target recovery in 15 min"] |
|
||||
| MTTR target (non-disaster) | [Z minutes] | [Operational incidents, not DR events] |
|
||||
| Data retention (backups) | [N days/weeks] | [Compliance requirement or operational policy] |
|
||||
| Backup frequency | [Every X hours] | [RPO-driven — backup interval must be ≤ RPO] |
|
||||
|
||||
**What these mean in practice:**
|
||||
- If a database is corrupted, we can lose at most [X minutes] of transactions before the business impact is unacceptable.
|
||||
- The service must be operational again within [Y minutes/hours] of declaring a DR event.
|
||||
- If either target cannot be met, escalate to [Engineering Manager] immediately.
|
||||
|
||||
---
|
||||
|
||||
## 2. Failure Scenario Inventory
|
||||
|
||||
| Scenario | Likelihood | Impact | RTO target | RPO target | Runbook |
|
||||
|---|---|---|---|---|---|
|
||||
| Single availability zone failure | Medium | [Partial / Full outage] | [15 min] | [0 — no data loss] | Section 3.1 |
|
||||
| Full region failure | Low | Full outage | [60 min] | [5 min] | Section 3.2 |
|
||||
| Database corruption / data loss | Low | Full outage | [90 min] | [RPO value] | Section 3.3 |
|
||||
| Critical dependency outage | High | [Partial degradation] | [30 min] | [N/A] | Section 3.4 |
|
||||
| Security breach / ransomware | Very low | Full outage + investigation | [4 hours] | [Last clean backup] | Section 3.5 |
|
||||
| Accidental bulk data deletion | Low | Partial or full data loss | [60 min] | [RPO value] | Section 3.6 |
|
||||
|
||||
---
|
||||
|
||||
## 3. Failure Scenario Runbooks
|
||||
|
||||
### 3.1 Single Availability Zone Failure
|
||||
|
||||
**Trigger:** One AZ becomes unreachable — pods/instances in that zone stop responding.
|
||||
**Detection:** PagerDuty alert `[AlertName]` fires, or cloud provider status page shows AZ degradation.
|
||||
**Expected RTO:** [15 minutes] | **Expected RPO:** Zero (no data loss if multi-AZ replication is working)
|
||||
|
||||
**Step 1 — Confirm the failure**
|
||||
```bash
|
||||
# Check pod/instance health across zones
|
||||
kubectl get pods -o wide -n [namespace] | grep -v Running
|
||||
|
||||
# Check which nodes are affected
|
||||
kubectl get nodes -o wide | grep -v Ready
|
||||
|
||||
# Verify cloud provider AZ status
|
||||
# AWS: https://health.aws.amazon.com/health/status
|
||||
# GCP: https://status.cloud.google.com
|
||||
```
|
||||
|
||||
**Step 2 — Assess whether auto-recovery has occurred**
|
||||
```bash
|
||||
# If using auto-scaling, check if replacement instances launched
|
||||
kubectl get pods -n [namespace] --watch
|
||||
|
||||
# Check deployment replica count
|
||||
kubectl get deployment [service-name] -n [namespace]
|
||||
|
||||
# Verify load balancer health checks are passing
|
||||
[cloud provider CLI command to check target group health]
|
||||
```
|
||||
|
||||
**Step 3 — Force rescheduling if auto-recovery stalled**
|
||||
```bash
|
||||
# Cordon the affected node so no new pods schedule on it
|
||||
kubectl cordon [node-name]
|
||||
|
||||
# Drain the node — moves all pods to healthy nodes
|
||||
kubectl drain [node-name] --ignore-daemonsets --delete-emptydir-data
|
||||
|
||||
# Verify pods have rescheduled successfully
|
||||
kubectl get pods -o wide -n [namespace]
|
||||
```
|
||||
|
||||
**Step 4 — Verify service health**
|
||||
```bash
|
||||
# Smoke test key endpoints
|
||||
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/health
|
||||
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/[critical-endpoint]
|
||||
|
||||
# Check error rate in monitoring
|
||||
[dashboard link or query]
|
||||
```
|
||||
|
||||
**Recovery confirmed when:** All pods are Running, health check returns 200, error rate is at baseline.
|
||||
|
||||
---
|
||||
|
||||
### 3.2 Full Region Failure
|
||||
|
||||
**Trigger:** The primary region is entirely unavailable.
|
||||
**Detection:** All service health checks failing, cloud provider status page confirms region-wide event.
|
||||
**Expected RTO:** [60 minutes] | **Expected RPO:** [5 minutes — based on cross-region replication lag]
|
||||
|
||||
**Step 1 — Confirm regional failure (5 minutes)**
|
||||
```bash
|
||||
# Confirm the primary region is unreachable
|
||||
ping [primary-region-endpoint] || echo "Primary region unreachable"
|
||||
|
||||
# Check replication lag on standby region database
|
||||
[command to check replica lag — e.g. for RDS: aws rds describe-db-instances --region [dr-region]]
|
||||
```
|
||||
|
||||
**Step 2 — Declare DR event and notify (2 minutes)**
|
||||
|
||||
Post to `#incidents`:
|
||||
```
|
||||
🔴 DR EVENT — [Service Name] — Region Failure
|
||||
Primary region: [region] — UNREACHABLE
|
||||
Activating failover to: [dr-region]
|
||||
Incident commander: [Name]
|
||||
Next update: 15 minutes
|
||||
```
|
||||
|
||||
Page [Engineering Manager] and [CTO/VP Eng] via PagerDuty.
|
||||
|
||||
**Step 3 — Promote DR database (10 minutes)**
|
||||
```bash
|
||||
# AWS RDS — promote read replica to primary
|
||||
aws rds promote-read-replica \
|
||||
--db-instance-identifier [dr-replica-identifier] \
|
||||
--region [dr-region]
|
||||
|
||||
# Wait for promotion to complete
|
||||
aws rds wait db-instance-available \
|
||||
--db-instance-identifier [dr-replica-identifier] \
|
||||
--region [dr-region]
|
||||
|
||||
# Record the new database endpoint
|
||||
aws rds describe-db-instances \
|
||||
--db-instance-identifier [dr-replica-identifier] \
|
||||
--region [dr-region] \
|
||||
--query 'DBInstances[0].Endpoint.Address'
|
||||
```
|
||||
|
||||
**Step 4 — Deploy service in DR region (20 minutes)**
|
||||
```bash
|
||||
# Update service configuration to point at DR database
|
||||
kubectl set env deployment/[service-name] \
|
||||
DATABASE_URL=[new-dr-database-url] \
|
||||
-n [namespace] \
|
||||
--context [dr-region-context]
|
||||
|
||||
# Scale up the DR deployment
|
||||
kubectl scale deployment/[service-name] --replicas=[N] \
|
||||
-n [namespace] \
|
||||
--context [dr-region-context]
|
||||
|
||||
# Verify all pods are running
|
||||
kubectl get pods -n [namespace] --context [dr-region-context]
|
||||
```
|
||||
|
||||
**Step 5 — Cut over DNS / load balancer (5 minutes)**
|
||||
```bash
|
||||
# Update DNS to point to DR region load balancer
|
||||
# AWS Route 53:
|
||||
aws route53 change-resource-record-sets \
|
||||
--hosted-zone-id [zone-id] \
|
||||
--change-batch file://dr-failover-dns.json
|
||||
|
||||
# Verify DNS propagation (may take up to [TTL] seconds)
|
||||
dig [service-domain] @8.8.8.8
|
||||
```
|
||||
|
||||
**Step 6 — Verify end-to-end**
|
||||
```bash
|
||||
# Full smoke test against DR endpoint
|
||||
curl -s https://[service-url]/health
|
||||
[run automated smoke test suite if available]
|
||||
```
|
||||
|
||||
**Recovery confirmed when:** DNS resolves to DR region, smoke tests pass, error rate is at baseline.
|
||||
|
||||
**Post-failover actions (not urgent — after service is stable):**
|
||||
- Do not fail back to primary until root cause is confirmed resolved
|
||||
- Document data loss window (check replication lag at time of failure)
|
||||
- Begin post-incident review — see [incident-postmortem skill]
|
||||
|
||||
---
|
||||
|
||||
### 3.3 Database Corruption or Data Loss
|
||||
|
||||
**Trigger:** Data in the database is corrupted, deleted, or otherwise incorrect due to a software bug, operator error, or hardware fault.
|
||||
**Detection:** Application errors referencing missing/invalid data, monitoring alerts on query error rate, user reports.
|
||||
**Expected RTO:** [90 minutes] | **Expected RPO:** [Backup interval — e.g. 1 hour]
|
||||
|
||||
**Step 1 — Stop the bleeding immediately**
|
||||
```bash
|
||||
# Put the service into maintenance mode to prevent further writes to corrupted data
|
||||
[command to enable maintenance mode — e.g. kubectl set env deployment/[name] MAINTENANCE_MODE=true]
|
||||
|
||||
# Or: scale down the service to zero to prevent writes
|
||||
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]
|
||||
```
|
||||
|
||||
**Step 2 — Assess scope of corruption**
|
||||
```bash
|
||||
# Identify which tables/records are affected
|
||||
[SQL query to check data integrity — e.g.]
|
||||
# psql $DATABASE_URL -c "SELECT COUNT(*) FROM [table] WHERE [integrity check condition]"
|
||||
|
||||
# Determine when corruption started (cross-reference with deploy times and error logs)
|
||||
[log query to find earliest error — e.g. in Datadog:]
|
||||
# service:[service-name] status:error "[corruption error message]" | sort by timestamp asc
|
||||
```
|
||||
|
||||
**Step 3 — Identify the correct restore point**
|
||||
```bash
|
||||
# List available backups
|
||||
[command to list backups — e.g. for RDS:]
|
||||
aws rds describe-db-snapshots \
|
||||
--db-instance-identifier [db-identifier] \
|
||||
--query 'DBSnapshots[*].[SnapshotCreateTime,DBSnapshotIdentifier]' \
|
||||
--output table
|
||||
|
||||
# Choose the most recent backup BEFORE corruption started
|
||||
# Record the chosen snapshot ID: [snapshot-id]
|
||||
```
|
||||
|
||||
**Step 4 — Restore from backup**
|
||||
```bash
|
||||
# Restore to a NEW database instance (never overwrite production directly)
|
||||
aws rds restore-db-instance-from-db-snapshot \
|
||||
--db-instance-identifier [service-name]-restored-[date] \
|
||||
--db-snapshot-identifier [snapshot-id] \
|
||||
--region [region]
|
||||
|
||||
# Wait for restore to complete
|
||||
aws rds wait db-instance-available \
|
||||
--db-instance-identifier [service-name]-restored-[date]
|
||||
|
||||
# Get the restored instance endpoint
|
||||
aws rds describe-db-instances \
|
||||
--db-instance-identifier [service-name]-restored-[date] \
|
||||
--query 'DBInstances[0].Endpoint.Address'
|
||||
```
|
||||
|
||||
**Step 5 — Validate restored data**
|
||||
```bash
|
||||
# Connect to restored database and verify integrity
|
||||
psql [restored-db-endpoint] -U [user] -d [database] -c "[data integrity query]"
|
||||
|
||||
# Confirm record counts match expectations
|
||||
psql [restored-db-endpoint] -U [user] -d [database] -c "SELECT COUNT(*) FROM [critical-table]"
|
||||
```
|
||||
|
||||
**Step 6 — Point service at restored database**
|
||||
```bash
|
||||
kubectl set env deployment/[service-name] \
|
||||
DATABASE_URL=postgres://[user]:[pass]@[restored-endpoint]/[db] \
|
||||
-n [namespace]
|
||||
|
||||
kubectl scale deployment/[service-name] --replicas=[N] -n [namespace]
|
||||
```
|
||||
|
||||
**Recovery confirmed when:** Service is running against restored database, data integrity checks pass, error rate is at baseline.
|
||||
|
||||
---
|
||||
|
||||
### 3.4 Critical Dependency Outage
|
||||
|
||||
**Trigger:** A service that [service name] depends on is unavailable or degraded.
|
||||
**Detection:** Increased error rate or latency on endpoints that call [dependency], alerts from dependency owner.
|
||||
**Expected RTO:** Depends on dependency — [30 minutes for mitigation, resolution depends on dependency owner]
|
||||
|
||||
**Dependency map:**
|
||||
|
||||
| Dependency | Criticality | Degraded behaviour | Mitigation |
|
||||
|---|---|---|---|
|
||||
| [Database] | Critical — all writes fail | Full outage | Activate DR database (Section 3.3) |
|
||||
| [Cache — Redis] | High — latency increases | Performance degradation | Bypass cache, serve from DB |
|
||||
| [Auth service] | Critical — auth fails | All authenticated endpoints fail | Return cached tokens (if implemented) |
|
||||
| [Message queue] | Medium — async processing delays | Writes succeed, async jobs queue | Queue backlog — see on-call runbook |
|
||||
| [External API — name] | Low — feature X unavailable | Graceful degradation | Feature flag to disable feature X |
|
||||
|
||||
**Mitigation steps:**
|
||||
```bash
|
||||
# Enable circuit breaker / fallback for [dependency] if implemented
|
||||
kubectl set env deployment/[service-name] [DEPENDENCY]_CIRCUIT_BREAKER=open -n [namespace]
|
||||
|
||||
# Enable feature flag to disable [dependency-backed feature]
|
||||
[feature flag CLI command or dashboard link]
|
||||
|
||||
# Check if dependency has a status page
|
||||
# [Dependency status URL]
|
||||
```
|
||||
|
||||
**Escalation:** Contact [dependency] on-call via [PagerDuty / Slack `#[channel]`]. Share your service's error rate and the time dependency errors started.
|
||||
|
||||
---
|
||||
|
||||
### 3.5 Security Breach or Ransomware
|
||||
|
||||
**Trigger:** Evidence of unauthorized access, data exfiltration, or encryption of service data.
|
||||
**Detection:** Security tooling alert, unusual access patterns, user reports of data exposure.
|
||||
**Expected RTO:** [4+ hours — prioritise containment over speed] | **Expected RPO:** [Last verified clean backup]
|
||||
|
||||
**Step 1 — Isolate immediately**
|
||||
```bash
|
||||
# Take the service offline — do not attempt to recover while breach is active
|
||||
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]
|
||||
|
||||
# Revoke all API keys and service account credentials immediately
|
||||
[command to rotate secrets — e.g. via Vault or cloud provider]
|
||||
|
||||
# Block all external access at network level
|
||||
[firewall/security group command to deny all inbound traffic]
|
||||
```
|
||||
|
||||
**Step 2 — Notify security team immediately**
|
||||
Page [Security lead] via PagerDuty. Do NOT attempt to remediate without security team involvement.
|
||||
|
||||
Post to `#security-incidents` (private channel, not `#incidents`):
|
||||
```
|
||||
🔴 SECURITY INCIDENT — [Service Name]
|
||||
Time detected: [Time]
|
||||
Evidence: [One sentence — what was observed]
|
||||
Actions taken: Service isolated, credentials revoked
|
||||
Awaiting: Security team guidance
|
||||
```
|
||||
|
||||
**Step 3 — Preserve evidence**
|
||||
```bash
|
||||
# Export current logs before any remediation
|
||||
[log export command — preserve evidence for forensics]
|
||||
|
||||
# Snapshot the current state of all infrastructure
|
||||
[snapshot/image command]
|
||||
```
|
||||
|
||||
**Steps 4+ — Follow security team guidance.** Do not restore from backup until security team confirms the attack vector is closed.
|
||||
|
||||
---
|
||||
|
||||
### 3.6 Accidental Bulk Data Deletion
|
||||
|
||||
**Trigger:** An operator, script, or application bug has deleted records in bulk.
|
||||
**Detection:** Sudden drop in record counts, user reports of missing data, application errors.
|
||||
**Expected RTO:** [60 minutes] | **Expected RPO:** [Backup interval]
|
||||
|
||||
```bash
|
||||
# Step 1 — Stop further writes immediately
|
||||
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]
|
||||
|
||||
# Step 2 — Determine what was deleted and when
|
||||
psql $DATABASE_URL -c "
|
||||
SELECT schemaname, tablename,
|
||||
n_dead_tup, last_autovacuum
|
||||
FROM pg_stat_user_tables
|
||||
ORDER BY n_dead_tup DESC LIMIT 10;
|
||||
"
|
||||
|
||||
# Step 3 — Check if deletion is recoverable via MVCC (PostgreSQL)
|
||||
# Records may still be recoverable if VACUUM has not run
|
||||
psql $DATABASE_URL -c "
|
||||
SELECT * FROM [table]
|
||||
WHERE xmax != 0 -- recently deleted rows
|
||||
LIMIT 100;
|
||||
"
|
||||
|
||||
# Step 4 — If not recoverable via MVCC, restore from backup
|
||||
# Follow Section 3.3 (Database Corruption runbook) from Step 3 onward
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Backup and Restore Procedures
|
||||
|
||||
### Backup Configuration
|
||||
|
||||
| Data store | Backup type | Frequency | Retention | Location |
|
||||
|---|---|---|---|---|
|
||||
| [Primary database] | Automated snapshots | Every [N] hours | [N] days | [S3 bucket / cloud storage path] |
|
||||
| [Primary database] | Transaction log backups | Continuous | [N] days | [Location] |
|
||||
| [Secondary store — e.g. Redis] | RDB dump | Daily | [N] days | [Location] |
|
||||
| [Blob/object storage] | Cross-region replication | Continuous | [N] days | [DR region bucket] |
|
||||
| [Config / secrets] | Terraform state + Vault backup | On change | Indefinite | [Location] |
|
||||
|
||||
### Backup Validation (Run Weekly)
|
||||
|
||||
```bash
|
||||
# Test restore of latest database backup to a throwaway instance
|
||||
aws rds restore-db-instance-from-db-snapshot \
|
||||
--db-instance-identifier [service-name]-backup-test-$(date +%Y%m%d) \
|
||||
--db-snapshot-identifier $(aws rds describe-db-snapshots \
|
||||
--db-instance-identifier [db-id] \
|
||||
--query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
|
||||
--output text)
|
||||
|
||||
# Wait for restore, then run integrity checks
|
||||
psql [test-instance-endpoint] -c "[integrity check query]"
|
||||
|
||||
# Confirm row counts match recent production values (allow ≤ RPO difference)
|
||||
psql [test-instance-endpoint] -c "SELECT COUNT(*) FROM [critical-table]"
|
||||
|
||||
# Destroy the test instance
|
||||
aws rds delete-db-instance \
|
||||
--db-instance-identifier [service-name]-backup-test-$(date +%Y%m%d) \
|
||||
--skip-final-snapshot
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. DR Testing Cadence
|
||||
|
||||
Regular testing is mandatory. An untested DR plan is not a DR plan.
|
||||
|
||||
| Test type | Frequency | Who runs it | Pass criteria |
|
||||
|---|---|---|---|
|
||||
| Backup restore validation | Weekly (automated) | On-call rotation | Restore completes, integrity checks pass |
|
||||
| Zone failover drill | Monthly | Engineering team | RTO target met, zero data loss |
|
||||
| Region failover drill | Quarterly | Engineering + SRE | RTO/RPO targets met |
|
||||
| Full DR game day | Annually | Engineering + stakeholders | All scenarios exercised, gaps documented |
|
||||
| Chaos engineering (infra failures) | Weekly (automated) | Chaos engineering tooling | Service degrades gracefully, recovers automatically |
|
||||
|
||||
### Game Day Procedure
|
||||
|
||||
1. **Pre-game day (1 week before):** Notify all stakeholders, freeze production changes for the day, prepare DR environment.
|
||||
2. **Scope definition:** Choose 2–3 scenarios from Section 2. Document expected outcomes before the test.
|
||||
3. **Execute:** One person acts as incident commander, others execute runbook steps while another observes and times.
|
||||
4. **Measure:** Record actual RTO and RPO against targets for each scenario.
|
||||
5. **Debrief (same day):** Document gaps, runbook inaccuracies, and automation opportunities.
|
||||
6. **Action items:** File tickets for every gap found. Priority: P1 items must be fixed before next game day.
|
||||
|
||||
---
|
||||
|
||||
## 6. Communication Plan
|
||||
|
||||
### Internal Communication During DR Event
|
||||
|
||||
**Incident commander responsibilities:**
|
||||
- Declare the DR event and open the incident channel
|
||||
- Post updates every 15 minutes minimum
|
||||
- Make the call to fail over (do not let the team decide by committee)
|
||||
- Notify business stakeholders of expected recovery time
|
||||
|
||||
**Notify these people at DR event start:**
|
||||
|
||||
| Role | Name | Contact | When to notify |
|
||||
|---|---|---|---|
|
||||
| Engineering manager | [Name] | [Slack / Phone] | Immediately |
|
||||
| CTO / VP Engineering | [Name] | [Phone] | Tier 1 services: immediately |
|
||||
| Customer success lead | [Name] | [Slack] | If customer-facing impact |
|
||||
| Security lead | [Name] | [Slack / PagerDuty] | If breach suspected |
|
||||
| Legal / compliance | [Name] | [Email / Phone] | If data loss involves PII |
|
||||
|
||||
### Communication Templates
|
||||
|
||||
**DR event declared:**
|
||||
```
|
||||
🔴 DR EVENT — [Service Name]
|
||||
Time: [HH:MM UTC]
|
||||
Scenario: [Zone failure / Region failure / Data loss / etc.]
|
||||
Impact: [Who is affected and how]
|
||||
RTO target: [X minutes]
|
||||
Incident commander: [Name]
|
||||
War room: [Slack channel / call link]
|
||||
Next update: [Time + 15 min]
|
||||
```
|
||||
|
||||
**Status update (every 15 minutes):**
|
||||
```
|
||||
🔴 DR UPDATE — [Service Name] — [HH:MM UTC]
|
||||
Status: [Investigating / Executing recovery / Verifying]
|
||||
Progress: [One sentence on current step]
|
||||
Blockers: [Any — or "None"]
|
||||
Updated RTO estimate: [Time]
|
||||
Next update: [Time + 15 min]
|
||||
```
|
||||
|
||||
**Recovery confirmed:**
|
||||
```
|
||||
✅ DR RESOLVED — [Service Name] — [HH:MM UTC]
|
||||
Total downtime: [X minutes]
|
||||
Data loss: [None / X minutes of transactions]
|
||||
RTO target: [X min] — Actual: [Y min] — [MET / MISSED]
|
||||
RPO target: [X min] — Actual: [Y min] — [MET / MISSED]
|
||||
Root cause: [One sentence]
|
||||
Post-incident review: [Scheduled for / Link when created]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. DR Readiness Checklist
|
||||
|
||||
Run this checklist quarterly and before any major infrastructure change:
|
||||
|
||||
**Backups:**
|
||||
- [ ] Automated backups are running and alerts fire if they fail
|
||||
- [ ] Most recent backup restore was tested within the last 7 days
|
||||
- [ ] Backup retention meets RPO and compliance requirements
|
||||
- [ ] Backups are stored in a separate region / account from primary
|
||||
|
||||
**Failover infrastructure:**
|
||||
- [ ] DR region / environment exists and is provisioned (not just documented)
|
||||
- [ ] DNS failover procedure is documented with exact commands
|
||||
- [ ] DR database replica is current (replication lag is within RPO)
|
||||
- [ ] Service can be deployed in DR region with a single command or automated pipeline
|
||||
|
||||
**Runbooks:**
|
||||
- [ ] All runbooks in Section 3 have been tested within the last quarter
|
||||
- [ ] Runbook commands have been verified against current infrastructure (no stale references)
|
||||
- [ ] Contact list is current (no departed employees)
|
||||
|
||||
**Access:**
|
||||
- [ ] On-call engineers have access to DR region console / CLI
|
||||
- [ ] Service account credentials for DR region are provisioned and tested
|
||||
- [ ] Break-glass accounts exist for emergency access if SSO is unavailable
|
||||
|
||||
**Monitoring:**
|
||||
- [ ] Monitoring exists in DR region (not just primary)
|
||||
- [ ] Alerts fire correctly when DR environment has issues
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] RPO and RTO targets are specific numbers, not ranges, and are agreed with the business
|
||||
- [ ] Every command in every runbook has been run by a human in the last quarter — not copied from documentation untested
|
||||
- [ ] DR database exists in the DR region and replication lag is monitored
|
||||
- [ ] Backup restore has been tested end-to-end within the last 7 days
|
||||
- [ ] The game day schedule is on the team calendar — not just documented here
|
||||
- [ ] Contact list contains current phone numbers, not just Slack handles (Slack may be down during a DR event)
|
||||
- [ ] Security breach runbook (3.5) explicitly names the security team contact and does not attempt self-remediation
|
||||
- [ ] All thresholds (RTO/RPO) are visible in the monitoring dashboard so actual vs. target is measurable in real time
|
||||
@@ -0,0 +1,338 @@
|
||||
---
|
||||
name: engineering-hiring-rubric
|
||||
description: "Build an engineering hiring rubric and technical interview scorecard for evaluating software engineers at a specific level. Use when asked to create an interview rubric, design a hiring process, build a technical scorecard, or standardize engineer evaluation. Produces a full interview scorecard, behavioral question bank, technical question set with evaluation criteria, system design rubric, and debrief agenda."
|
||||
---
|
||||
|
||||
# Engineering Hiring Rubric
|
||||
|
||||
Produce a complete hiring rubric and interview scorecard for evaluating software engineers at a specific role and level. The rubric must be specific enough that two interviewers who have never compared notes will score the same candidate within one level of each other. That requires: explicit behavioral anchors (what does "Strong Hire" look like vs. "Hire" for each competency), calibrated technical questions with written evaluation criteria, and a structured debrief format that surfaces signal rather than recency bias. Include calibration notes to help interviewers recognize and counter common evaluation biases.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Role** — backend, frontend, fullstack, SRE/platform, data, ML, or mobile engineer
|
||||
- **Level** — junior (L3/IC2), mid (L4/IC3), senior (L5/IC4), or staff (L6/IC5); clarify the company's level naming if different
|
||||
- **Team context** — what the team builds, team size, and what problems this hire will work on in the first year
|
||||
- **Tech stack** — primary languages and frameworks for the technical questions; list the stack explicitly
|
||||
- **Interview format** — which rounds are used (phone screen, coding, system design, behavioral, take-home); if not specified, produce a recommended format
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Engineering Hiring Rubric: [Role] — [Level]
|
||||
|
||||
**Role:** [e.g., Senior Backend Engineer]
|
||||
**Level equivalent:** [e.g., L5 / IC4 / Senior]
|
||||
**Team:** [Team name and one-sentence description of what they build]
|
||||
**Tech stack:** [Languages and frameworks]
|
||||
**Interview loop:** [List the rounds in order]
|
||||
|
||||
---
|
||||
|
||||
## 1. Role Definition and Level Expectations
|
||||
|
||||
### What This Role Does
|
||||
|
||||
[2–3 sentences describing the scope of work: what systems they'll own, what problems they'll solve, and who they'll work with. Make this specific to the team context provided.]
|
||||
|
||||
### Level Bar
|
||||
|
||||
Define the minimum bar for a Hire recommendation at this level. This is not the ideal candidate description — it is the floor.
|
||||
|
||||
| Dimension | [Level] Floor | One Level Below (No Hire) | One Level Above (Stretch) |
|
||||
|-----------|--------------|---------------------------|---------------------------|
|
||||
| Technical scope | [e.g., "Owns a service or major feature area end-to-end with minimal guidance"] | [e.g., "Completes well-defined tasks; needs guidance on scope and approach"] | [e.g., "Leads cross-team technical initiatives; sets technical direction"] |
|
||||
| Problem solving | [e.g., "Breaks ambiguous problems into concrete sub-problems independently"] | [e.g., "Solves defined problems well; struggles with ambiguity"] | [e.g., "Identifies problems others miss; structures organization-level technical challenges"] |
|
||||
| Code quality | [e.g., "Writes production-ready code; anticipates edge cases; reviewable without significant rework"] | [e.g., "Writes working code that requires significant review feedback"] | [e.g., "Sets code quality standards; designs reusable abstractions adopted by others"] |
|
||||
| Communication | [e.g., "Communicates technical decisions clearly to peers and stakeholders"] | [e.g., "Communicates well with direct team; struggles with cross-team or stakeholder comms"] | [e.g., "Drives technical consensus across teams; writes documents others reference"] |
|
||||
| Ownership | [e.g., "Sees work to production; monitors after deploy; follows up on issues proactively"] | [e.g., "Delivers assigned work; escalates issues but doesn't drive them to resolution"] | [e.g., "Owns outcomes across teams; improves team processes and systems beyond their own work"] |
|
||||
|
||||
---
|
||||
|
||||
## 2. Interview Loop Structure
|
||||
|
||||
| Round | Format | Duration | Interviewer | Competencies Assessed |
|
||||
|-------|--------|----------|-------------|----------------------|
|
||||
| Phone screen | Video call, technical questions | 45 min | [Hiring manager or senior engineer] | Problem solving, communication, basic technical depth |
|
||||
| Coding interview 1 | Live coding — [platform] | 60 min | [Engineer] | Coding, data structures, code quality |
|
||||
| Coding interview 2 | Live coding — [platform] | 60 min | [Engineer] | Algorithms, debugging, code quality |
|
||||
| System design | Whiteboard / shared doc | 60 min | [Senior/Staff engineer] | System design, scalability, technical communication |
|
||||
| Behavioral | Structured interview | 45 min | [Hiring manager] | Ownership, collaboration, growth mindset |
|
||||
| [Optional] Take-home | Asynchronous project | [X hours] | [Reviewer] | Code quality, thoroughness, real-world problem solving |
|
||||
|
||||
**Interview coverage matrix:** Each competency dimension must be assessed by at least 2 independent interviewers.
|
||||
|
||||
| Competency | Phone Screen | Coding 1 | Coding 2 | System Design | Behavioral |
|
||||
|-----------|-------------|---------|---------|--------------|-----------|
|
||||
| Coding | ○ | ● | ● | ○ | |
|
||||
| System design | ○ | | | ● | |
|
||||
| Problem solving | ● | ● | ● | ● | |
|
||||
| Code quality | | ● | ● | | |
|
||||
| Communication | ● | ● | ● | ● | ● |
|
||||
| Ownership | ○ | | | ○ | ● |
|
||||
| Debugging | | ● | ● | | |
|
||||
|
||||
● = Primary signal ○ = Secondary signal
|
||||
|
||||
---
|
||||
|
||||
## 3. Coding Interview Guide
|
||||
|
||||
### Question Selection
|
||||
|
||||
Choose 1–2 problems per coding round. Problems should be solvable in 30–40 minutes with the remaining time for discussion and follow-ups. Prefer problems with multiple solution tiers so you can see how far candidates take their thinking.
|
||||
|
||||
### Problem Template
|
||||
|
||||
**Problem: [Title]**
|
||||
|
||||
*Prompt (read to candidate):*
|
||||
> [Problem statement — be specific. Include constraints (input size, value ranges). Avoid ambiguity that tests problem-reading rather than problem-solving.]
|
||||
|
||||
*Example:*
|
||||
> Given a list of integers representing stock prices at each minute of a trading day, return the maximum profit you could achieve by making exactly one buy and one sell. You may not sell before you buy.
|
||||
|
||||
**Clarifying questions a strong candidate will ask:**
|
||||
- [e.g., "Can the list be empty?" / "Are all values positive?" / "Can profit be negative — i.e., should we return 0 if no profit is possible?"]
|
||||
|
||||
**Solution tiers:**
|
||||
|
||||
| Tier | Approach | Time Complexity | Space Complexity | Signals |
|
||||
|------|----------|-----------------|-----------------|---------|
|
||||
| Baseline | [Brute force — O(n²) nested loop] | O(n²) | O(1) | Can solve the problem; understands correctness |
|
||||
| Expected | [Single pass, tracking min price seen so far] | O(n) | O(1) | Strong problem solver; explains tradeoff |
|
||||
| Strong | [Generalizes to k transactions, or extends to cooldown variant without prompting] | O(n) | O(1) | Staff-level generalization thinking |
|
||||
|
||||
**Follow-up questions:**
|
||||
- [e.g., "What if you could make at most k trades?"]
|
||||
- [e.g., "How would you test this function? Write me 3 test cases."]
|
||||
- [e.g., "Walk me through your code as if you're explaining it in a code review."]
|
||||
|
||||
**Evaluation rubric for this problem:**
|
||||
|
||||
| Signal | Strong Hire | Hire | No Hire |
|
||||
|--------|------------|------|---------|
|
||||
| Problem comprehension | Asks 1–2 clarifying questions immediately; identifies edge cases before coding | Understands the problem after 1 prompt; misses 1–2 edge cases | Misunderstands the problem or requires repeated clarification |
|
||||
| Solution quality | O(n) solution; clean code; handles all edge cases | O(n) with hints; code is readable but has minor issues | O(n²) with hints, or correct solution with significant issues |
|
||||
| Code quality | Well-named variables; logical structure; would pass code review | Functional but verbose or inconsistently named | Hard to follow; would require significant review feedback |
|
||||
| Communication | Narrates thinking throughout; explains complexity; self-corrects | Explains solution when asked; answers follow-ups well | Silent during coding; unable to explain their approach |
|
||||
| Follow-ups | Extends solution confidently; identifies further improvements | Handles follow-ups with moderate prompting | Unable to extend or explain tradeoffs |
|
||||
|
||||
---
|
||||
|
||||
## 4. System Design Interview Guide
|
||||
|
||||
### [Level]-Appropriate Design Scope
|
||||
|
||||
At [Level], expect the candidate to:
|
||||
- [e.g., Senior: "Design a complete system with capacity estimates, component breakdown, and discussion of failure modes"]
|
||||
- [e.g., Mid: "Design the core components of a system; may need prompting on scalability and failure handling"]
|
||||
- [e.g., Junior: "Design a simple client-server system; focus on clarity of thinking over complete distributed systems knowledge"]
|
||||
|
||||
### Sample Design Question
|
||||
|
||||
**Question:** "Design [a URL shortener / a rate limiter / a notification service / a ride-matching system — choose one relevant to the team's domain]."
|
||||
|
||||
**Evaluation dimensions:**
|
||||
|
||||
| Dimension | What to assess | Strong Hire | Hire | No Hire |
|
||||
|-----------|---------------|------------|------|---------|
|
||||
| Requirements clarification | Does the candidate ask before designing? | Asks scope, scale, SLA, and key use cases before drawing anything | Asks some questions; may miss scale or SLA | Starts designing immediately without clarifying |
|
||||
| High-level design | Can they describe the major components? | Clear component breakdown with justified choices; covers data flow | Reasonable breakdown; may overcomplicate or undercomplicate | Missing key components or cannot explain data flow |
|
||||
| Data model | Can they design a schema or data structure for the system? | Models the core entities with normalization/denormalization tradeoffs discussed | Reasonable schema; may miss indexing or partitioning needs | Cannot model the data or produces clearly wrong schema |
|
||||
| Scalability | Can they identify and address bottlenecks? | Identifies bottlenecks proactively; proposes horizontal scaling, caching, or sharding as appropriate | Discusses scaling when prompted; reasonable solutions | Cannot identify bottlenecks or proposes solutions that don't match the scale |
|
||||
| Failure handling | Do they think about what happens when things break? | Proactively discusses failure modes: single points of failure, retry logic, idempotency | Discusses failure when prompted; identifies some failure modes | Does not think about failure; assumes happy path |
|
||||
| Communication | Is the design explained clearly? | Could run this meeting with a team of engineers at a real company | Clear enough to follow; some gaps in explanation | Difficult to follow; interviewer cannot understand the design |
|
||||
|
||||
### Design Probing Questions
|
||||
|
||||
Use these to probe depth after the candidate presents their design:
|
||||
- "Walk me through what happens when a write request comes in at peak load — 10,000 requests per second."
|
||||
- "Your primary database just failed. What happens to the system?"
|
||||
- "You estimated X QPS. How would your design change if it needed to handle 100× that?"
|
||||
- "Where is the first place this system would fall over under load?"
|
||||
- "How would you monitor this in production? What would your on-call runbook look like?"
|
||||
|
||||
---
|
||||
|
||||
## 5. Behavioral Interview Question Bank
|
||||
|
||||
Map every question to a competency. Ask 4–6 questions per behavioral round using STAR format (Situation, Task, Action, Result). Do not ask leading questions.
|
||||
|
||||
### Competency: Ownership and Delivery
|
||||
|
||||
1. "Tell me about a time you owned something end-to-end — from design through production monitoring. What did you do when something went wrong after launch?"
|
||||
- *Strong signal:* Describes proactive monitoring setup, a specific incident they caught themselves, and what they changed
|
||||
- *Weak signal:* Describes writing the code and handing off; no discussion of production behavior
|
||||
|
||||
2. "Describe a project that was significantly delayed or failed. What was your role, and what did you take responsibility for?"
|
||||
- *Strong signal:* Direct ownership of their contribution to the failure; specific changes to how they work
|
||||
- *Weak signal:* Attributes all delay to external factors; no reflection on their own actions
|
||||
|
||||
### Competency: Technical Judgment
|
||||
|
||||
3. "Tell me about a significant technical decision you made. What options did you consider, and how did you decide?"
|
||||
- *Strong signal:* Named alternatives with clear tradeoffs; explains who they consulted; reflects on whether they'd decide the same way today
|
||||
- *Weak signal:* "I knew X was the right answer" without describing the decision process
|
||||
|
||||
4. "Describe a time you had to push back on a technical direction — either from management or from peers. What happened?"
|
||||
- *Strong signal:* Evidence-based disagreement; constructive communication; willing to commit once decision was made even if they lost the argument
|
||||
- *Weak signal:* Either never pushed back or pushed back emotionally without evidence
|
||||
|
||||
### Competency: Collaboration and Communication
|
||||
|
||||
5. "Tell me about a time you had to explain a complex technical concept to a non-technical stakeholder. How did you approach it?"
|
||||
- *Strong signal:* Used analogy or simplified model; confirmed understanding; adapted to the audience
|
||||
- *Weak signal:* "I explained it technically and told them to trust me"
|
||||
|
||||
6. "Describe a situation where you and a peer strongly disagreed on an approach. How did it resolve?"
|
||||
- *Strong signal:* Sought a third opinion or data; focused on the right outcome, not being right; maintained relationship
|
||||
- *Weak signal:* Escalated immediately or capitulated without engaging
|
||||
|
||||
### Competency: Growth and Learning
|
||||
|
||||
7. "What is a significant technical mistake you made in the last two years? What did you learn from it?"
|
||||
- *Strong signal:* Specific mistake, clear causal analysis, concrete behavioral change afterward
|
||||
- *Weak signal:* Cannot name a specific mistake; describes a minor issue to avoid vulnerability
|
||||
|
||||
8. "How do you stay current in [relevant technical area]? Give me a specific example of something you learned recently and applied."
|
||||
- *Strong signal:* Named sources, applied learning in a specific project with a concrete outcome
|
||||
- *Weak signal:* "I read blogs" with no specifics; no applied example
|
||||
|
||||
---
|
||||
|
||||
## 6. Full Interview Scorecard
|
||||
|
||||
Complete one scorecard per interview round. Collect all scorecards before the debrief.
|
||||
|
||||
```
|
||||
INTERVIEW SCORECARD
|
||||
===================
|
||||
Candidate: ______________________
|
||||
Interviewer: ______________________
|
||||
Round: ______________________
|
||||
Date: ______________________
|
||||
Interview format: ______________________
|
||||
|
||||
COMPETENCY RATINGS
|
||||
Rate each dimension independently. Do not average.
|
||||
Scale: 1 = Strong No Hire | 2 = No Hire | 3 = Hire | 4 = Strong Hire
|
||||
|
||||
1 2 3 4 Notes
|
||||
Coding / Technical skill [ ] [ ] [ ] [ ] ___________________________
|
||||
Problem solving [ ] [ ] [ ] [ ] ___________________________
|
||||
System design [ ] [ ] [ ] [ ] ___________________________
|
||||
Code quality [ ] [ ] [ ] [ ] ___________________________
|
||||
Debugging [ ] [ ] [ ] [ ] ___________________________
|
||||
Communication [ ] [ ] [ ] [ ] ___________________________
|
||||
Ownership [ ] [ ] [ ] [ ] ___________________________
|
||||
Collaboration [ ] [ ] [ ] [ ] ___________________________
|
||||
|
||||
SPECIFIC EVIDENCE
|
||||
What did the candidate do or say that drove your rating?
|
||||
(Required — write observable behaviors, not impressions)
|
||||
|
||||
Strongest signal (positive):
|
||||
___________________________________________________________________________
|
||||
|
||||
Strongest concern or gap:
|
||||
___________________________________________________________________________
|
||||
|
||||
OVERALL RECOMMENDATION
|
||||
[ ] Strong Hire [ ] Hire [ ] No Hire [ ] Strong No Hire
|
||||
|
||||
OVERALL RECOMMENDATION RATIONALE
|
||||
(Required — 3–5 sentences minimum. State your recommendation, the evidence
|
||||
that supports it, and the specific gap or risk if not a Strong Hire)
|
||||
___________________________________________________________________________
|
||||
___________________________________________________________________________
|
||||
___________________________________________________________________________
|
||||
|
||||
Level signal: This candidate demonstrated [ L_ / L_ ] level behaviors.
|
||||
|
||||
SHOULD INTERVIEWERS DISCUSS BEFORE DEBRIEF?
|
||||
[ ] No — I have a clear independent signal
|
||||
[ ] Yes — I need context on [specific area] to complete my assessment
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Hiring Recommendation Framework
|
||||
|
||||
| Recommendation | Meaning | When to use |
|
||||
|---------------|---------|-------------|
|
||||
| **Strong Hire** | Confident the candidate will exceed the level bar and be a high performer on the team | Evidence across 3+ competencies at above-bar level; no significant concerns |
|
||||
| **Hire** | Confident the candidate meets the level bar; will perform well | Meets bar on all must-have competencies; may have 1 area to develop |
|
||||
| **No Hire** | Does not meet the level bar | Below bar on 1+ must-have competency, or gap too large to close quickly |
|
||||
| **Strong No Hire** | Clear mismatch — well below the bar, or a specific disqualifying signal | Significant gaps across multiple competencies, or a values/behavior concern |
|
||||
|
||||
**Must-hire competencies for [Role] at [Level]:** [List 3–4 competencies where a No Hire score on any one of them means the overall recommendation must be No Hire, regardless of performance elsewhere. Example: "Coding and System Design are must-hire competencies for a Senior Backend Engineer. Strong performance on Behavioral dimensions cannot compensate for a No Hire on Coding."]
|
||||
|
||||
**Debrief rule:** A Strong Hire can override one No Hire only if: (a) the No Hire is not on a must-hire competency, and (b) the Strong Hire interviewer can articulate why the concern is not disqualifying. A Strong No Hire cannot be overridden — escalate to hiring manager.
|
||||
|
||||
---
|
||||
|
||||
## 8. Debrief Agenda
|
||||
|
||||
Run the debrief before scorecards are shared verbally. Everyone submits a written scorecard first.
|
||||
|
||||
```
|
||||
DEBRIEF AGENDA — [Candidate Name]
|
||||
Duration: 45 minutes
|
||||
Facilitator: [Hiring Manager]
|
||||
|
||||
0:00 – 0:05 SCORECARD REVIEW
|
||||
Each interviewer states their overall recommendation only (no rationale yet).
|
||||
Facilitator notes alignment and disagreements on whiteboard/doc.
|
||||
|
||||
0:05 – 0:15 EVIDENCE ROUND
|
||||
Go around the table. Each interviewer shares:
|
||||
- Their strongest positive signal (observable behavior, not impression)
|
||||
- Their biggest concern (observable behavior, not impression)
|
||||
No discussion yet — just evidence gathering.
|
||||
|
||||
0:15 – 0:30 DISCUSS DISAGREEMENTS
|
||||
Address only the competency dimensions where interviewers disagree.
|
||||
Anchor discussion on: "What did you observe?" not "What do you think?"
|
||||
If interviewers assessed different competencies, disagreement may reflect
|
||||
insufficient signal — note this.
|
||||
|
||||
0:30 – 0:40 DECISION
|
||||
Reach a decision on overall recommendation.
|
||||
If consensus: state the recommendation and rationale.
|
||||
If not consensus: hiring manager makes the call and states why.
|
||||
|
||||
0:40 – 0:45 PROCESS NOTES
|
||||
- Were any questions unclear or hard to compare across candidates?
|
||||
- Any bias signals observed during the debrief? (see Section 9)
|
||||
- Feedback to improve the process for next time.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Calibration and Bias Reduction Notes
|
||||
|
||||
Brief every interviewer on these before they conduct their first interview for this role.
|
||||
|
||||
| Bias | How it manifests | Counter-measure |
|
||||
|------|-----------------|-----------------|
|
||||
| Halo effect | Strong performance in round 1 colors ratings in round 2 | Submit scorecard before reading others; rate each competency independently |
|
||||
| Similarity bias | "I liked them" correlates with "they think like me" | Require observable evidence for every rating; check: "Is this a signal about their ability or their similarity to me?" |
|
||||
| Recency bias | Final impression dominates overall rating | Take notes during the interview; write evidence immediately after; debrief uses written evidence, not memory |
|
||||
| Expectation anchoring | First interviewer's opinion anchors all others | No verbal discussion between interviewers before debrief; written scorecards submitted before debrief starts |
|
||||
| Culture fit as cover | "Not a culture fit" without specific behavioral evidence | "Culture fit" is not a valid dimension on this scorecard; use Collaboration and Communication with evidence |
|
||||
| Credential bias | Degree or previous employer overweights rating | Do not list educational background in pre-interview briefing documents; focus on demonstrated behaviors |
|
||||
| Confidence ≠ Competence | Articulate candidates rated higher regardless of correctness | Grade the answer quality, not the delivery style; use written rubrics per question |
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Level bar table defines a concrete floor for the level — not aspirational traits — with a comparison to one level below and above
|
||||
- [ ] Every behavioral question includes explicit Strong Hire and Weak/No Hire signal descriptions — not just the question text
|
||||
- [ ] Coding problem(s) include solution tiers with time and space complexity, plus a per-question rubric with behavioral anchors
|
||||
- [ ] System design rubric evaluates at minimum: requirements clarification, component design, data model, scalability, and failure handling
|
||||
- [ ] Scorecard uses observable behavior fields ("What did the candidate do or say") — not impression fields
|
||||
- [ ] Must-hire competencies are explicitly named for the role and level
|
||||
- [ ] Debrief agenda enforces written scorecard submission before verbal discussion to prevent anchoring
|
||||
@@ -0,0 +1,164 @@
|
||||
---
|
||||
name: engineering-weekly-report
|
||||
description: "Write a weekly engineering status report for a team, service, or initiative. Use when asked to write a team update, weekly engineering report, sprint status email, or standing team communication to stakeholders. Produces a concise, scannable weekly report covering shipping progress, metrics, decisions, blockers, and next-week priorities."
|
||||
---
|
||||
|
||||
# Engineering Weekly Report
|
||||
|
||||
Produce a weekly engineering status report that a team can send to stakeholders, their engineering manager, and the team itself. The format is fixed week-over-week so readers know exactly where to look — shipping progress at the top, decisions in the middle, risks and next steps at the bottom. The report must be readable in under 2 minutes. Avoid prose walls: use bullet points, status tags, and short tables. If metrics are not provided, leave the metrics section with [data needed] markers rather than fabricating numbers.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Team name and report period** — team name plus week number or date range (e.g., "Platform Team, Week 21, May 12–16")
|
||||
- **Work items shipped this week** — what was completed and released or merged
|
||||
- **Work items in progress** — what is actively being worked on, with rough percent-complete if known
|
||||
- **Blocked items** — what is blocked, who owns the block, and what is needed to unblock
|
||||
- **Key decisions made** — any architecture, process, or priority decisions made this week
|
||||
- **Decisions needed next week** — any decisions that need to be made soon and who needs to make them
|
||||
- **Risks and escalations** — anything that threatens next week's commitments or needs leadership visibility
|
||||
- **Next week's top priorities** — the 3–5 things the team plans to accomplish next week
|
||||
|
||||
Optional but useful:
|
||||
- **Key metrics** — reliability (error rate, p99 latency), velocity (story points completed), or other health indicators
|
||||
- **Team health notes** — PTO, new joins, attrition, morale signals worth noting
|
||||
- **Sprint or iteration number** — if the team runs sprints
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Engineering Weekly Report — [Team Name]
|
||||
**Week:** [Week Number] | [Date Range, e.g., May 12–16, 2025]
|
||||
**Author:** [Name or Team Lead]
|
||||
**Distribution:** [e.g., Eng leadership, Product, Team]
|
||||
|
||||
---
|
||||
|
||||
## Shipping Progress
|
||||
|
||||
### Shipped This Week
|
||||
|
||||
| Item | Description | Impact |
|
||||
|------|-------------|--------|
|
||||
| [Feature / Fix / Infra change] | [One-line description] | [Who benefits / what it unblocks] |
|
||||
| [Feature / Fix / Infra change] | [One-line description] | [Who benefits / what it unblocks] |
|
||||
| [Feature / Fix / Infra change] | [One-line description] | [Who benefits / what it unblocks] |
|
||||
|
||||
### In Progress
|
||||
|
||||
| Item | Owner | Status | Target Ship |
|
||||
|------|-------|--------|-------------|
|
||||
| [Work item] | [Name] | [~40% / On Track / At Risk] | [Date or Sprint] |
|
||||
| [Work item] | [Name] | [~70% / On Track / At Risk] | [Date or Sprint] |
|
||||
| [Work item] | [Name] | [~20% / On Track / At Risk] | [Date or Sprint] |
|
||||
|
||||
### Blocked
|
||||
|
||||
| Item | Blocked Since | Blocker Description | Owner | Needed To Unblock |
|
||||
|------|--------------|--------------------|----|-------------------|
|
||||
| [Work item] | [Date] | [What is blocking progress] | [Name] | [Specific ask — decision, resource, dependency] |
|
||||
|
||||
If no items are blocked: *No active blockers.*
|
||||
|
||||
---
|
||||
|
||||
## Key Metrics
|
||||
|
||||
*Metrics reported as of [Date]. Prior week in parentheses.*
|
||||
|
||||
| Metric | This Week | Last Week | Trend | Target |
|
||||
|--------|-----------|-----------|-------|--------|
|
||||
| Error rate (5xx) | [X%] | [X%] | [↑ / ↓ / →] | < [threshold] |
|
||||
| p99 latency | [Xms] | [Xms] | [↑ / ↓ / →] | < [threshold] |
|
||||
| Deployment frequency | [X deploys] | [X deploys] | [↑ / ↓ / →] | [target] |
|
||||
| Story points completed | [X] | [X] | [↑ / ↓ / →] | [sprint target] |
|
||||
| On-call page volume | [X pages] | [X pages] | [↑ / ↓ / →] | < [threshold] |
|
||||
|
||||
**Metrics notes:** [Any context that makes the numbers meaningful — e.g., "Error rate spike on Tuesday tied to downstream dependency outage, resolved by EOD."]
|
||||
|
||||
If metrics are not provided: replace table rows with `[data needed — provide metric values for this section]`.
|
||||
|
||||
---
|
||||
|
||||
## Decisions
|
||||
|
||||
### Made This Week
|
||||
|
||||
| Decision | Rationale | Owner | Stakeholders Informed |
|
||||
|----------|-----------|-------|----------------------|
|
||||
| [Decision description] | [Why — 1 sentence] | [Name] | [Yes / No — who] |
|
||||
| [Decision description] | [Why — 1 sentence] | [Name] | [Yes / No — who] |
|
||||
|
||||
If no decisions were made: *No major decisions this week.*
|
||||
|
||||
### Needed Next Week
|
||||
|
||||
| Decision | Context | Deadline | Decision Owner |
|
||||
|----------|---------|----------|----------------|
|
||||
| [What needs to be decided] | [Why it matters, what happens if delayed] | [Date] | [Name or role] |
|
||||
|
||||
If no decisions are pending: *No decisions pending.*
|
||||
|
||||
---
|
||||
|
||||
## Risks and Escalations
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation | Escalate To |
|
||||
|------|-----------|--------|-----------|-------------|
|
||||
| [Risk description] | [High/Med/Low] | [High/Med/Low] | [What we're doing about it] | [Name/role if escalation needed] |
|
||||
|
||||
**Escalations this week:** [Any item that needs immediate leadership attention — call it out explicitly here, do not bury it in a table row. If none: "None."]
|
||||
|
||||
---
|
||||
|
||||
## Team Health
|
||||
|
||||
| Item | Status |
|
||||
|------|--------|
|
||||
| Team capacity this week | [X of Y people at full capacity] |
|
||||
| PTO / out of office | [Names and dates, or "None"] |
|
||||
| New joins / departures | [Name, role, and date, or "None"] |
|
||||
| On-call this week | [Name] |
|
||||
| On-call next week | [Name] |
|
||||
|
||||
**Team notes:** [Any morale, workload, or team dynamic signals worth surfacing — keep this factual and constructive. If nothing to note: omit this line.]
|
||||
|
||||
---
|
||||
|
||||
## Next Week's Priorities
|
||||
|
||||
*The [3–5] things this team will ship or meaningfully advance next week.*
|
||||
|
||||
1. **[Priority item]** — [One sentence: what done looks like and who owns it]
|
||||
2. **[Priority item]** — [One sentence: what done looks like and who owns it]
|
||||
3. **[Priority item]** — [One sentence: what done looks like and who owns it]
|
||||
4. **[Priority item]** — [One sentence: what done looks like and who owns it]
|
||||
5. **[Priority item]** — [One sentence: what done looks like and who owns it]
|
||||
|
||||
**Capacity risk:** [If the team is at reduced capacity next week (PTO, incidents, etc.), note it here so stakeholders calibrate expectations.]
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Sprint Scorecard (if applicable)
|
||||
|
||||
| Sprint | Committed | Completed | Completion Rate | Carried Over |
|
||||
|--------|-----------|-----------|----------------|--------------|
|
||||
| Sprint [N-1] | [X pts] | [X pts] | [X%] | [X pts] |
|
||||
| Sprint [N] (current) | [X pts] | [X pts — partial] | [X% at midpoint] | TBD |
|
||||
|
||||
---
|
||||
|
||||
*Questions or corrections: [Slack channel or email] | Next report: [Date]*
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every blocked item names a specific owner and states what is concretely needed to unblock it — not just "waiting on X"
|
||||
- [ ] Decisions-needed table includes a deadline and a named decision owner, not a vague "TBD"
|
||||
- [ ] Metrics table is either populated with real numbers or explicitly marked `[data needed]` — no fabricated metrics
|
||||
- [ ] Next week's priorities are written as outcomes ("ship X", "complete Y migration") not as activities ("work on X")
|
||||
- [ ] Escalations that need leadership attention are called out explicitly in the Risks section — not just buried in a table row
|
||||
- [ ] The entire report is readable in under 2 minutes — if it is longer than one printed page, trim it
|
||||
- [ ] Report period (week number and date range) is clearly stated in the header
|
||||
@@ -0,0 +1,369 @@
|
||||
---
|
||||
name: feature-flag-guide
|
||||
description: "Write a feature flag management guide and lifecycle playbook for a service or team — covering flag taxonomy, creation checklist, rollout strategy, monitoring requirements, cleanup policy, and governance. Use when asked to document feature flag practices, create a flag rollout plan, write a feature flag policy, or guide a team on flag lifecycle management. Produces a flag lifecycle playbook, taxonomy reference, per-flag creation template, rollout decision tree, and cleanup checklist."
|
||||
---
|
||||
|
||||
# Feature Flag Guide Skill
|
||||
|
||||
Produce a complete feature flag management guide for a service or team — covering how flags are named and categorised, how to create and roll out a flag safely, what to monitor during rollout, when and how to clean up flags, and who is responsible for each stage. Feature flags without discipline become permanent technical debt. This guide gives the team a repeatable process so flags are created intentionally, rolled out safely, and removed when done.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service or team name** — scope of the guide
|
||||
- **Feature flag platform** — LaunchDarkly, Split, Unleash, Flagsmith, Flipt, or a custom/in-house solution
|
||||
- **Flag being documented** (if writing a per-flag guide) or "general guide" (if writing team-wide policy)
|
||||
- **Rollout constraints** — any compliance, data privacy, or contractual constraints on who can see a feature (e.g. HIPAA, EU-only, enterprise customers only)
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Feature Flag Management Guide: [Service / Team Name]
|
||||
|
||||
**Team:** [Team name] | **Platform:** [LaunchDarkly / Split / Unleash / Custom]
|
||||
**Document owner:** [Name] | **Last updated:** [Date]
|
||||
**Review cycle:** Quarterly, and whenever the flag platform changes
|
||||
|
||||
---
|
||||
|
||||
## 1. Flag Taxonomy
|
||||
|
||||
Every flag belongs to exactly one category. The category determines default behaviour, who can enable it in production, and when it must be cleaned up.
|
||||
|
||||
| Type | Purpose | Default state | Production gate | Max lifetime |
|
||||
|---|---|---|---|---|
|
||||
| **Release flag** | Controls rollout of a new feature — decouples deploy from release | Off | Tech lead approval | 90 days from feature launch |
|
||||
| **Experiment flag** | A/B or multivariate test — measures impact of a change | Off (control group) | Product + tech lead | Duration of experiment + 30 days |
|
||||
| **Ops flag** | Operational control — circuit breaker, kill switch, throttle | On (normal behaviour) | On-call engineer can toggle | Indefinite (review annually) |
|
||||
| **Permission flag** | Gates access by user segment, tier, or region | Off (restricted) | Product + Account owner | Indefinite (review annually) |
|
||||
|
||||
**When in doubt:** If the flag is temporary (tied to a specific feature launch), it is a Release flag. If it will exist forever as a control knob, it is an Ops flag.
|
||||
|
||||
---
|
||||
|
||||
## 2. Flag Naming Convention
|
||||
|
||||
All flags must follow this naming scheme:
|
||||
|
||||
```
|
||||
[type]-[service]-[feature-description]
|
||||
```
|
||||
|
||||
| Segment | Values | Example |
|
||||
|---|---|---|
|
||||
| type | `release`, `exp`, `ops`, `perm` | `release` |
|
||||
| service | Short service identifier, lowercase, hyphenated | `payments` |
|
||||
| feature-description | Kebab-case description, max 5 words | `new-checkout-flow` |
|
||||
|
||||
**Full examples:**
|
||||
- `release-payments-new-checkout-flow` — release flag for a new checkout feature in the payments service
|
||||
- `exp-search-personalized-ranking` — experiment on personalized search ranking
|
||||
- `ops-api-rate-limit-override` — operational flag to override API rate limits
|
||||
- `perm-dashboard-beta-users-only` — permission flag gating dashboard for beta users
|
||||
|
||||
**Do not:**
|
||||
- Use ticket numbers in flag names (`release-JIRA-1234` → not searchable or self-describing)
|
||||
- Use dates in flag names (`release-dark-mode-jan-2024` → flags outlive their dates)
|
||||
- Use vague names (`release-new-thing` → not useful when you have 50 flags)
|
||||
|
||||
---
|
||||
|
||||
## 3. Flag Creation Checklist
|
||||
|
||||
Complete every item before creating a flag in the production environment.
|
||||
|
||||
**Before creating the flag:**
|
||||
- [ ] Flag type determined from taxonomy (Section 1)
|
||||
- [ ] Flag name follows naming convention (Section 2)
|
||||
- [ ] Flag owner assigned — one named engineer responsible for cleanup
|
||||
- [ ] Cleanup date set in the flag description field (for Release and Experiment flags)
|
||||
- [ ] Rollout strategy defined — see Section 4
|
||||
- [ ] Monitoring plan defined — see Section 5
|
||||
- [ ] Code review approved with flag guard in place
|
||||
|
||||
**Flag description field (required):**
|
||||
```
|
||||
Type: [Release / Experiment / Ops / Permission]
|
||||
Owner: [Name]
|
||||
Linked ticket: [JIRA-XXXX or GitHub issue URL]
|
||||
Purpose: [One sentence — what this flag controls]
|
||||
Cleanup by: [Date — required for Release and Experiment flags; "Annual review" for Ops/Permission]
|
||||
Rollout plan: [Link to this document or inline summary]
|
||||
```
|
||||
|
||||
**Code requirements:**
|
||||
```python
|
||||
# Good — behaviour is clear when flag is off, and cleanup is obvious
|
||||
if flag_client.is_enabled("release-[service]-[feature]", user_context):
|
||||
return new_feature_handler(request)
|
||||
else:
|
||||
return existing_handler(request)
|
||||
|
||||
# Bad — nested flags, ternaries, and implicit defaults make cleanup error-prone
|
||||
result = new_handler() if (f1 and not f2) or f3 else old_handler()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Rollout Strategy
|
||||
|
||||
### Decision Tree
|
||||
|
||||
Use this decision tree to pick the right rollout strategy for a Release or Experiment flag:
|
||||
|
||||
```
|
||||
Is the change reversible without a deploy?
|
||||
├── No → Use an Ops flag with manual enable, not a percentage rollout
|
||||
└── Yes → Continue
|
||||
|
||||
Is there a user-level identifier available (user ID, session ID)?
|
||||
├── No → Use server-side percentage (stateless, but inconsistent per user)
|
||||
└── Yes → Use user-based percentage (consistent experience per user) ← preferred
|
||||
|
||||
Is the change risky (touches payments, auth, or data writes)?
|
||||
├── Yes → Start at 1% → 5% → 25% → 50% → 100%, with 24-hour holds
|
||||
└── No → Start at 10% → 50% → 100%, with 4-hour holds
|
||||
|
||||
Does the change affect specific customer tiers or geographies?
|
||||
├── Yes → Use segment-based targeting, not percentage rollout
|
||||
└── No → Use percentage rollout
|
||||
```
|
||||
|
||||
### Rollout Stages
|
||||
|
||||
| Stage | Percentage | Hold duration | Pass criteria before advancing |
|
||||
|---|---|---|---|
|
||||
| Canary | 1% | 24 hours | Error rate within SLO, no P1 incidents |
|
||||
| Early rollout | 5–10% | 24 hours | Error rate and latency match control group |
|
||||
| Partial rollout | 25–50% | 24–48 hours | Business metrics not degraded vs. control |
|
||||
| Majority | 75% | 24 hours | Final check — no regressions |
|
||||
| Full rollout | 100% | 48 hours | Stable — schedule cleanup |
|
||||
|
||||
**Do not skip stages for Release flags on production.** Speed of rollout is not worth a production incident.
|
||||
|
||||
### Segment-Based Targeting
|
||||
|
||||
Use segment targeting when the rollout must be restricted:
|
||||
|
||||
```yaml
|
||||
# LaunchDarkly segment example — adapt for your platform
|
||||
targeting_rules:
|
||||
- clause:
|
||||
attribute: "subscription_tier"
|
||||
operator: "in"
|
||||
values: ["enterprise", "team"]
|
||||
serve: "on"
|
||||
- clause:
|
||||
attribute: "country"
|
||||
operator: "in"
|
||||
values: ["US", "CA", "GB"]
|
||||
serve: "on"
|
||||
default: "off"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Monitoring Requirements
|
||||
|
||||
Every flag that is not at 0% or 100% rollout requires active monitoring. Do not roll out a flag and walk away.
|
||||
|
||||
### Required Metrics Per Flag
|
||||
|
||||
| Metric | What to compare | Alert threshold |
|
||||
|---|---|---|
|
||||
| Error rate | Flag-on cohort vs. flag-off cohort | >2× baseline error rate in flag-on group |
|
||||
| p99 latency | Flag-on vs. flag-off | >20% higher latency in flag-on group |
|
||||
| [Primary business metric] | Flag-on vs. flag-off | >5% degradation in flag-on group |
|
||||
| [Conversion / completion rate] | Flag-on vs. flag-off | >2% drop in flag-on group |
|
||||
|
||||
**Setting up split metric monitoring in [LaunchDarkly / Split / Datadog]:**
|
||||
```
|
||||
1. Navigate to the flag → Metrics tab
|
||||
2. Add metric: [primary business metric]
|
||||
3. Add metric: error_rate (service-level)
|
||||
4. Add metric: p99_latency (endpoint-level)
|
||||
5. Set alert: notify [flag owner] in Slack #[team-channel] if metric degrades by [threshold]
|
||||
6. Set experiment duration: [N days] if this is an Experiment flag
|
||||
```
|
||||
|
||||
### Guardrail Metrics
|
||||
|
||||
These metrics must never degrade, regardless of what the primary metric shows. If a guardrail is breached, roll back immediately — do not wait for investigation.
|
||||
|
||||
- Error rate exceeds SLO threshold ([X]%)
|
||||
- p99 latency exceeds SLO threshold ([Y] ms)
|
||||
- [Service-specific guardrail — e.g. payment failure rate, auth failure rate]
|
||||
|
||||
**Immediate rollback command if guardrail is breached:**
|
||||
```bash
|
||||
# [LaunchDarkly CLI]
|
||||
ld-cli flag update [project-key] [flag-key] --default-variation off
|
||||
|
||||
# [Split CLI]
|
||||
split-cli update-treatment [flag-name] --treatment "off" --percentage 100
|
||||
|
||||
# [Unleash CLI / API]
|
||||
curl -X POST https://[unleash-host]/api/admin/features/[flag-name]/disable \
|
||||
-H "Authorization: [admin-token]"
|
||||
|
||||
# [Custom — adapt to your implementation]
|
||||
[command or dashboard step]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Per-Flag Creation Template
|
||||
|
||||
Copy this template into your flag's description field and the linked ticket when creating a new flag:
|
||||
|
||||
```markdown
|
||||
## Flag: [flag-name]
|
||||
|
||||
**Type:** [Release / Experiment / Ops / Permission]
|
||||
**Owner:** [Name] ([Slack handle])
|
||||
**Created:** [Date]
|
||||
**Cleanup by:** [Date]
|
||||
**Linked ticket:** [URL]
|
||||
|
||||
### Purpose
|
||||
[One paragraph: what this flag controls, why it exists, what "on" and "off" mean]
|
||||
|
||||
### Rollout Plan
|
||||
| Stage | Target | Date | Approved by |
|
||||
|---|---|---|---|
|
||||
| Canary | 1% | [Date] | [Name] |
|
||||
| Early | 10% | [Date] | [Name] |
|
||||
| Partial | 50% | [Date] | [Name] |
|
||||
| Full | 100% | [Date] | [Name] |
|
||||
|
||||
### Monitoring
|
||||
- Primary metric: [metric name and dashboard link]
|
||||
- Guardrail metrics: error rate < [X]%, p99 < [Y] ms
|
||||
- Alert channel: #[team-channel]
|
||||
|
||||
### Rollback Procedure
|
||||
[Exact steps to turn the flag off in an emergency — should take < 2 minutes]
|
||||
|
||||
### Cleanup Checklist
|
||||
- [ ] Flag at 100% for 48+ hours with no incidents
|
||||
- [ ] Code path for flag-off branch removed from codebase
|
||||
- [ ] Flag deleted from [platform]
|
||||
- [ ] Ticket closed
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Emergency Kill-Switch Procedure
|
||||
|
||||
When a flag needs to be disabled immediately due to a production incident:
|
||||
|
||||
**Time target: flag disabled within 2 minutes of decision.**
|
||||
|
||||
```
|
||||
1. Go to [platform URL] — bookmark this: [URL]
|
||||
2. Search for the flag by name: [flag-name]
|
||||
3. Set to 0% / "off" for ALL users
|
||||
4. Verify the service error rate drops within 60 seconds
|
||||
5. Post to #incidents:
|
||||
"🟡 Feature flag [flag-name] disabled — rolling back [feature description].
|
||||
Owner: [name]. Error rate before: [X]%. Monitoring for recovery."
|
||||
6. Page the flag owner if not already aware
|
||||
```
|
||||
|
||||
**For ops flags (kill switches that must turn OFF normally-on behaviour):**
|
||||
```bash
|
||||
# These flags are "on" by default and turned "off" to disable a feature
|
||||
# Confirm the flag polarity before toggling — "off" may mean "disabled" or "enabled" depending on naming
|
||||
# Flag [flag-name]: OFF = [feature behaviour when off]
|
||||
[kill switch command for your platform]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Stale Flag Policy and Cleanup
|
||||
|
||||
Stale flags are flags that are at 100% rollout, have been at 100% for >48 hours, or are past their cleanup date. Stale flags are technical debt.
|
||||
|
||||
### Stale Flag Definition
|
||||
|
||||
A flag is stale if ANY of the following are true:
|
||||
- It is a Release flag past its cleanup date
|
||||
- It has been at 100% (or 0%) rollout for more than 30 days
|
||||
- Its linked ticket is closed and code cleanup has not happened
|
||||
- Its owner has left the team
|
||||
|
||||
### Cleanup Checklist
|
||||
|
||||
```
|
||||
[ ] Flag is at 100% rollout and has been stable for 48+ hours
|
||||
[ ] Monitoring shows no issues for the flag-on cohort
|
||||
[ ] Code changes:
|
||||
[ ] Remove the flag check from application code
|
||||
[ ] Remove the "off" code path entirely — do not leave dead code
|
||||
[ ] Remove any flag-related tests that test the off behaviour
|
||||
[ ] Update any documentation that references the flag
|
||||
[ ] PR merged and deployed to production
|
||||
[ ] Flag deleted from [platform] (do not just disable — delete)
|
||||
[ ] Cleanup ticket closed
|
||||
[ ] Flag owner confirms cleanup in Slack: "Flag [name] has been cleaned up — [commit link]"
|
||||
```
|
||||
|
||||
**Automated stale flag detection:**
|
||||
```bash
|
||||
# Run weekly — flags past cleanup date or at 100% for > 30 days
|
||||
# [Platform-specific query — adapt:]
|
||||
|
||||
# LaunchDarkly API
|
||||
curl -s "https://app.launchdarkly.com/api/v2/flags/[project-key]" \
|
||||
-H "Authorization: [api-key]" | \
|
||||
jq '.items[] | select(.creationDate < (now - 2592000) * 1000) | {key: .key, created: .creationDate}'
|
||||
|
||||
# Notify #engineering-housekeeping with list of stale flags
|
||||
```
|
||||
|
||||
### Stale Flag Escalation
|
||||
|
||||
| Age past cleanup date | Action |
|
||||
|---|---|
|
||||
| 0–14 days | Slack reminder to flag owner |
|
||||
| 14–30 days | Slack reminder to flag owner + tech lead |
|
||||
| 30+ days | Tech lead assigns cleanup, creates ticket with P2 priority |
|
||||
| 60+ days | Engineering manager reviews — flag may be force-deleted |
|
||||
|
||||
---
|
||||
|
||||
## 9. Governance
|
||||
|
||||
### Who Can Do What
|
||||
|
||||
| Action | Who | Approval required |
|
||||
|---|---|---|
|
||||
| Create a flag (any environment) | Any engineer | None — but must complete creation checklist |
|
||||
| Enable a flag in development | Any engineer | None |
|
||||
| Enable a flag in staging | Any engineer | None |
|
||||
| Enable a flag in production (0–10%) | Flag owner | Tech lead awareness |
|
||||
| Advance rollout in production (10–100%) | Flag owner | Tech lead sign-off per stage |
|
||||
| Enable an Ops flag in production | On-call engineer | None — these are break-glass controls |
|
||||
| Delete a flag | Flag owner | Tech lead confirmation that code cleanup is done |
|
||||
| Create a Permission flag | Flag owner | Product manager approval |
|
||||
|
||||
### Audit Logging
|
||||
|
||||
All flag changes in production must be traceable. Ensure the following are configured in [platform]:
|
||||
|
||||
- **Change log:** Every production flag change logs: who changed it, what they changed, and when.
|
||||
- **Slack notifications:** Production flag changes post to `#[team]-flag-changes` automatically.
|
||||
- **Quarterly review:** Every quarter, the tech lead reviews the full flag inventory, confirms owners are current, and removes flags with no owner.
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every flag has an owner named in its description — no orphan flags
|
||||
- [ ] Release and Experiment flags have a cleanup date set — not open-ended
|
||||
- [ ] Monitoring is configured for every flag currently between 1–99% rollout
|
||||
- [ ] The emergency kill-switch procedure has been tested — on-call engineers have bookmarked the platform URL and know the steps
|
||||
- [ ] Stale flag detection runs automatically and results are reviewed weekly
|
||||
- [ ] Code review checklist includes: "Does this PR introduce a flag? If yes, is the creation checklist complete?"
|
||||
- [ ] At least one person other than the flag owner knows how to disable any given flag in an emergency
|
||||
@@ -0,0 +1,292 @@
|
||||
---
|
||||
name: infra-as-code-review
|
||||
description: "Write an infrastructure-as-code review checklist and conduct a structured review of Terraform, CloudFormation, Pulumi, or Ansible code. Use when asked to review IaC code, audit infrastructure configurations, check cloud security posture, or produce a reusable IaC review checklist. Produces a structured review report with severity-categorized findings, remediation guidance, and a reusable checklist."
|
||||
---
|
||||
|
||||
# Infrastructure-as-Code Review
|
||||
|
||||
Produce a structured infrastructure-as-code review that applies security, reliability, and operational quality standards to a specific body of IaC code. The output serves two purposes: an actionable review report for the code at hand (with findings by severity and specific remediation steps), and a reusable checklist the team can apply to every future IaC change. If the user provides actual code, analyze it and populate the findings table with real issues. If no code is provided, produce the checklist and a template findings report.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **IaC tool** — Terraform, CloudFormation, Pulumi, Ansible, or CDK
|
||||
- **Cloud provider** — AWS, GCP, Azure, or multi-cloud
|
||||
- **What the code provisions** — a brief description (e.g., "VPC, EKS cluster, and RDS instance for the payments service")
|
||||
- **Security policies or naming standards in use** — any existing org standards to check against; if none, use sensible defaults
|
||||
- **The IaC code itself** — paste or describe it; if not provided, produce the checklist template only and note findings require code
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# IaC Review Report: [What Is Being Provisioned]
|
||||
|
||||
**Reviewer:** [Name / Claude]
|
||||
**IaC Tool:** [Terraform / CloudFormation / Pulumi / Ansible / CDK]
|
||||
**Cloud Provider:** [AWS / GCP / Azure]
|
||||
**Code Location:** [Repo path or PR link]
|
||||
**Review Date:** [Date]
|
||||
**Overall Risk:** [Critical / High / Medium / Low]
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
| Severity | Finding Count | Resolved in This Review | Carry-Over Risk |
|
||||
|----------|---------------|------------------------|-----------------|
|
||||
| Critical | [n] | [n] | [Yes/No — explain] |
|
||||
| High | [n] | [n] | [Yes/No — explain] |
|
||||
| Medium | [n] | [n] | [Yes/No — explain] |
|
||||
| Low | [n] | [n] | [Yes/No — explain] |
|
||||
| **Total** | **[n]** | **[n]** | |
|
||||
|
||||
**Recommendation:** [Approve / Approve with Required Changes / Block — one sentence rationale]
|
||||
|
||||
---
|
||||
|
||||
## Findings
|
||||
|
||||
### Critical Findings
|
||||
|
||||
#### CRIT-01: [Finding Title]
|
||||
|
||||
| Field | Detail |
|
||||
|-------|--------|
|
||||
| **Severity** | Critical |
|
||||
| **Category** | [IAM / Secrets / Encryption / Network / State / Naming / Cost] |
|
||||
| **Resource** | `[resource_type.resource_name]` |
|
||||
| **File / Line** | `[path/to/file.tf:42]` |
|
||||
| **Risk** | [What can go wrong — be specific about the attack vector or failure mode] |
|
||||
|
||||
**Current code:**
|
||||
```hcl
|
||||
# [paste the problematic snippet]
|
||||
resource "aws_s3_bucket" "data" {
|
||||
bucket = "my-bucket"
|
||||
acl = "public-read" # PROBLEM: public read access
|
||||
}
|
||||
```
|
||||
|
||||
**Remediation:**
|
||||
```hcl
|
||||
resource "aws_s3_bucket" "data" {
|
||||
bucket = "my-bucket"
|
||||
}
|
||||
|
||||
resource "aws_s3_bucket_public_access_block" "data" {
|
||||
bucket = aws_s3_bucket.data.id
|
||||
block_public_acls = true
|
||||
block_public_policy = true
|
||||
ignore_public_acls = true
|
||||
restrict_public_buckets = true
|
||||
}
|
||||
```
|
||||
|
||||
**Why this matters:** [One sentence linking the specific risk to business impact — data exposure, compliance violation, etc.]
|
||||
|
||||
---
|
||||
|
||||
#### CRIT-02: [Next Critical Finding — repeat structure]
|
||||
|
||||
---
|
||||
|
||||
### High Findings
|
||||
|
||||
#### HIGH-01: [Finding Title]
|
||||
|
||||
| Field | Detail |
|
||||
|-------|--------|
|
||||
| **Severity** | High |
|
||||
| **Category** | [Category] |
|
||||
| **Resource** | `[resource_type.resource_name]` |
|
||||
| **File / Line** | `[path/to/file.tf:line]` |
|
||||
| **Risk** | [Specific risk description] |
|
||||
|
||||
**Current code:**
|
||||
```hcl
|
||||
# [problematic snippet]
|
||||
```
|
||||
|
||||
**Remediation:**
|
||||
```hcl
|
||||
# [fixed snippet]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Medium Findings
|
||||
|
||||
#### MED-01: [Finding Title]
|
||||
|
||||
| Field | Detail |
|
||||
|-------|--------|
|
||||
| **Severity** | Medium |
|
||||
| **Category** | [Category] |
|
||||
| **Resource** | `[resource_type.resource_name]` |
|
||||
| **File / Line** | `[path/to/file.tf:line]` |
|
||||
| **Risk** | [Specific risk description] |
|
||||
|
||||
**Remediation:** [Prose or code snippet — choose whichever is clearer for this finding]
|
||||
|
||||
---
|
||||
|
||||
### Low Findings
|
||||
|
||||
#### LOW-01: [Finding Title]
|
||||
|
||||
| Field | Detail |
|
||||
|-------|--------|
|
||||
| **Severity** | Low |
|
||||
| **Category** | [Category] |
|
||||
| **Resource** | `[resource_type.resource_name]` |
|
||||
| **File / Line** | `[path/to/file.tf:line]` |
|
||||
| **Suggestion** | [What to improve and why] |
|
||||
|
||||
---
|
||||
|
||||
## Reusable IaC Review Checklist
|
||||
|
||||
Use this checklist on every IaC pull request. Check every item; mark N/A only when the item genuinely does not apply to the resources being provisioned.
|
||||
|
||||
### 1. IAM and Access Control
|
||||
|
||||
- [ ] No wildcard actions (`"*"`) in IAM policies — policies follow least-privilege
|
||||
- [ ] No wildcard resource (`"*"`) in IAM policies unless explicitly justified with a comment
|
||||
- [ ] IAM roles use condition keys to restrict scope (e.g., `aws:RequestedRegion`, `sts:ExternalId`)
|
||||
- [ ] No IAM access keys or credentials hardcoded or in plaintext variables
|
||||
- [ ] EC2 / compute instances use instance profiles, not hardcoded credentials
|
||||
- [ ] S3 bucket policies do not allow public access unless the bucket is explicitly a public asset bucket
|
||||
- [ ] Cross-account trust policies name specific account IDs, not `"*"`
|
||||
- [ ] Service accounts (GCP) / managed identities (Azure) follow naming conventions and have documented purpose
|
||||
|
||||
### 2. Secrets Management
|
||||
|
||||
- [ ] No secrets, passwords, tokens, or API keys in plaintext in any `.tf`, `.yaml`, or `.json` file
|
||||
- [ ] No secrets in variable default values
|
||||
- [ ] Secrets sourced from Secrets Manager / Parameter Store / Vault — not from environment variables passed at plan time
|
||||
- [ ] `sensitive = true` is set on all output values and variables that contain secrets (Terraform)
|
||||
- [ ] State backend is encrypted — no unencrypted state files contain sensitive data
|
||||
- [ ] `.gitignore` or equivalent excludes `*.tfvars`, `terraform.tfstate`, and any file that may contain resolved secrets
|
||||
|
||||
### 3. Encryption at Rest
|
||||
|
||||
- [ ] Storage resources (S3, EBS, RDS, DynamoDB, GCS, Azure Blob) have encryption at rest enabled
|
||||
- [ ] Customer-managed keys (CMK/KMS) are used where required by policy — not solely AWS/GCP/Azure managed keys
|
||||
- [ ] KMS key rotation is enabled for all CMKs
|
||||
- [ ] Database snapshots have encryption enabled
|
||||
- [ ] Encryption is not disabled via `encrypted = false` or equivalent
|
||||
|
||||
### 4. Encryption in Transit
|
||||
|
||||
- [ ] Load balancers terminate TLS — HTTP-only listeners redirect to HTTPS or are absent
|
||||
- [ ] Minimum TLS version is 1.2; TLS 1.0 and 1.1 are explicitly disabled
|
||||
- [ ] RDS / database connections require SSL (`require_ssl = true` or equivalent parameter)
|
||||
- [ ] Internal service-to-service calls use TLS where the network is not fully private
|
||||
- [ ] S3 bucket policies include a `Deny` on non-TLS requests (`aws:SecureTransport: false`)
|
||||
|
||||
### 5. Network and Public Access
|
||||
|
||||
- [ ] Security groups / firewall rules do not permit `0.0.0.0/0` ingress except on ports 80/443 for public-facing services
|
||||
- [ ] SSH (port 22) and RDP (port 3389) are not open to `0.0.0.0/0`
|
||||
- [ ] Databases are in private subnets — not directly internet-routable
|
||||
- [ ] `publicly_accessible = false` on RDS instances unless explicitly required and documented
|
||||
- [ ] VPC has flow logs enabled
|
||||
- [ ] Network ACLs and security groups are layered (defense in depth)
|
||||
- [ ] S3 bucket public access block is enabled at the account and bucket level
|
||||
|
||||
### 6. Logging, Monitoring, and Audit
|
||||
|
||||
- [ ] CloudTrail / Cloud Audit Logs / Azure Monitor is enabled across all regions
|
||||
- [ ] S3 access logging is enabled on buckets containing sensitive or regulated data
|
||||
- [ ] RDS enhanced monitoring or equivalent is enabled
|
||||
- [ ] CloudWatch alarms or equivalent are defined for critical metrics (CPU, disk, error rate)
|
||||
- [ ] Log retention periods are defined — logs not retained indefinitely or deleted within 7 days
|
||||
|
||||
### 7. Naming and Tagging Standards
|
||||
|
||||
- [ ] All resources follow the team's naming convention: `[env]-[team]-[resource-type]-[identifier]`
|
||||
- [ ] Required tags are present on all taggable resources:
|
||||
- [ ] `Environment` (e.g., prod / staging / dev)
|
||||
- [ ] `Team` or `Owner`
|
||||
- [ ] `Service` or `Application`
|
||||
- [ ] `CostCenter` (if required by finance policy)
|
||||
- [ ] `ManagedBy: terraform` (or equivalent IaC tool tag)
|
||||
- [ ] No resources with default names (e.g., `default-vpc`, `launch-wizard-1`)
|
||||
|
||||
### 8. State Management and Backend
|
||||
|
||||
- [ ] Remote state backend is configured — no local state in repository
|
||||
- [ ] State backend uses locking (DynamoDB for S3 backend, etc.)
|
||||
- [ ] State backend bucket/storage has versioning enabled
|
||||
- [ ] State backend bucket/storage has access logging enabled
|
||||
- [ ] Workspaces or separate state files are used per environment — no shared state between prod and non-prod
|
||||
- [ ] `terraform.tfstate` and `*.tfstate.backup` are in `.gitignore`
|
||||
|
||||
### 9. Module and Resource Structure
|
||||
|
||||
- [ ] Modules are versioned with explicit version pins — no floating `source = "git::...?ref=main"`
|
||||
- [ ] Provider versions are pinned in `required_providers` — no unconstrained `>= x.y`
|
||||
- [ ] Terraform version is pinned in `required_version`
|
||||
- [ ] Modules have a clear single responsibility — not one module that provisions everything
|
||||
- [ ] No copy-paste duplication — repeated patterns use modules or loops (`for_each`, `count`)
|
||||
- [ ] Outputs expose only what downstream consumers need — no unnecessary output sprawl
|
||||
|
||||
### 10. Environment Parity
|
||||
|
||||
- [ ] Prod and non-prod environments use the same module code, parameterized by environment variable
|
||||
- [ ] Instance sizes and replica counts differ by environment via variables — not by separate code branches
|
||||
- [ ] Non-prod does not have security controls disabled "to save money" (encryption off, logging off)
|
||||
|
||||
### 11. Cost Impact
|
||||
|
||||
- [ ] Large instance types (e.g., `r5.16xlarge`) or storage allocations are justified in a comment
|
||||
- [ ] Data transfer costs are considered for cross-region or cross-AZ architectures
|
||||
- [ ] Reserved instance or committed use discount eligibility is noted for long-lived resources
|
||||
- [ ] Auto-scaling is configured for variable workloads — no fixed oversized fleets for spiky traffic
|
||||
- [ ] Lifecycle policies are set on S3 buckets storing time-bounded data (logs, backups)
|
||||
|
||||
### 12. Drift Risk
|
||||
|
||||
- [ ] No resources that are commonly mutated in the console are managed by IaC without import documentation
|
||||
- [ ] `lifecycle { prevent_destroy = true }` is set on stateful resources in production (databases, state buckets)
|
||||
- [ ] `ignore_changes` is used sparingly and each instance is documented with a rationale comment
|
||||
- [ ] A plan is run against the live environment as part of the PR process — no unreviewed drift
|
||||
|
||||
---
|
||||
|
||||
## Findings Summary Table
|
||||
|
||||
| ID | Title | Severity | Category | File | Status |
|
||||
|----|-------|----------|----------|------|--------|
|
||||
| CRIT-01 | [Title] | Critical | [Category] | [file:line] | Open |
|
||||
| HIGH-01 | [Title] | High | [Category] | [file:line] | Open |
|
||||
| MED-01 | [Title] | Medium | [Category] | [file:line] | Open |
|
||||
| LOW-01 | [Title] | Low | [Category] | [file:line] | Open |
|
||||
|
||||
---
|
||||
|
||||
## Required Actions Before Merge
|
||||
|
||||
List only Critical and High findings that must be resolved before this code is merged:
|
||||
|
||||
1. **CRIT-01 [Title]** — [One-line remediation instruction]
|
||||
2. **HIGH-01 [Title]** — [One-line remediation instruction]
|
||||
|
||||
Medium and Low findings should be tracked as follow-up issues with a committed resolution date.
|
||||
|
||||
---
|
||||
|
||||
*Review conducted by [Reviewer] on [Date] — checklist version [1.0]*
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every finding includes: severity, category, specific resource name, file and line number, current code, and fixed code
|
||||
- [ ] Checklist covers all 12 categories: IAM, Secrets, Encryption at Rest, Encryption in Transit, Network, Logging, Naming/Tagging, State, Module Structure, Environment Parity, Cost, and Drift
|
||||
- [ ] Executive summary table is filled with real counts — not all zeros or all placeholders
|
||||
- [ ] "Required Actions Before Merge" section lists only Critical and High items
|
||||
- [ ] Code snippets in findings show both the problematic code AND the corrected version
|
||||
- [ ] Overall risk rating is justified by the highest-severity open finding
|
||||
- [ ] Checklist items are binary (checkable) — not narrative observations
|
||||
@@ -0,0 +1,432 @@
|
||||
---
|
||||
name: load-testing-plan
|
||||
description: "Write a load and performance testing plan for a service. Use when asked to create a performance test plan, write load testing documentation, define stress or soak test scenarios, or set performance regression gates for CI. Produces a complete test plan document with scenario definitions, k6/Locust script skeleton, threshold table, result interpretation guide, and CI integration steps."
|
||||
---
|
||||
|
||||
# Load Testing Plan Skill
|
||||
|
||||
Produce a complete load and performance testing plan for a service — covering test objectives, scenario definitions, tooling configuration, success thresholds, and CI integration. A good load testing plan eliminates ambiguity about what "performance is acceptable" means, so engineers can run tests and get a pass/fail answer without having to interpret raw numbers themselves.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name and key endpoints** — which endpoints are under test (path, method, typical request/response shape)
|
||||
- **Current traffic baseline** — current requests/sec, p50/p99 latency, error rate under normal load
|
||||
- **Peak traffic expectations** — expected peak RPS (e.g. 10× baseline for flash sales, or seasonality peak)
|
||||
- **SLO targets** — latency SLOs (p99 < X ms), error rate SLO (< Y%), availability target
|
||||
- **Preferred testing tool** — k6, Locust, JMeter, Gatling, or no preference
|
||||
- **Test environment availability** — dedicated load test environment, staging, or production (with traffic shaping)
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Load Testing Plan: [Service Name]
|
||||
|
||||
**Author:** [Name] | **Team:** [Team name]
|
||||
**Date:** [Date] | **Review cycle:** Before each major release and quarterly
|
||||
**Testing tool:** [k6 / Locust / JMeter / Gatling]
|
||||
**Test environment:** [Environment name and URL]
|
||||
|
||||
---
|
||||
|
||||
## 1. Objectives and Scope
|
||||
|
||||
**What we are testing:** [Service name] handles [describe function — e.g. "user authentication requests from the mobile and web clients"]. This plan validates that the service meets its SLOs under expected and elevated traffic conditions.
|
||||
|
||||
**In scope:**
|
||||
- [Endpoint 1: METHOD /path — description]
|
||||
- [Endpoint 2: METHOD /path — description]
|
||||
- [Endpoint 3: METHOD /path — description]
|
||||
|
||||
**Out of scope:**
|
||||
- [Any endpoints explicitly excluded and why — e.g. "admin APIs — low traffic, excluded from load test"]
|
||||
- [Third-party integrations that cannot be load-tested — mock them instead]
|
||||
|
||||
---
|
||||
|
||||
## 2. Performance Targets (Success Criteria)
|
||||
|
||||
Every scenario has explicit pass/fail thresholds. A test run FAILS if any threshold is breached.
|
||||
|
||||
| Metric | Baseline scenario | Stress scenario | Spike scenario | Soak scenario |
|
||||
|---|---|---|---|---|
|
||||
| p50 latency | < [X] ms | < [X × 1.5] ms | < [X × 2] ms | < [X] ms |
|
||||
| p95 latency | < [Y] ms | < [Y × 1.5] ms | < [Y × 2] ms | < [Y] ms |
|
||||
| p99 latency | < [Z] ms | < [Z × 2] ms | < [Z × 3] ms | < [Z] ms |
|
||||
| Error rate | < [0.1]% | < [1]% | < [2]% | < [0.1]% |
|
||||
| Throughput | ≥ [N] RPS | ≥ [N × 3] RPS | N/A | ≥ [N] RPS |
|
||||
| Failed requests | 0 (5xx) | < [threshold] | < [threshold] | 0 (5xx) |
|
||||
|
||||
**SLO reference:** These thresholds are derived from the service SLOs — p99 < [Z ms], error rate < [0.1]%, availability [99.9]%.
|
||||
|
||||
---
|
||||
|
||||
## 3. Traffic Model
|
||||
|
||||
**Baseline traffic (current production):**
|
||||
- Average RPS: [N] req/sec
|
||||
- Peak RPS (observed): [N] req/sec
|
||||
- Request distribution by endpoint:
|
||||
- [Endpoint 1]: [X]% of traffic
|
||||
- [Endpoint 2]: [Y]% of traffic
|
||||
- [Endpoint 3]: [Z]% of traffic
|
||||
|
||||
**Simulated user behaviour:**
|
||||
- Think time between requests: [X–Y] seconds (randomised)
|
||||
- Session duration: [N] minutes average
|
||||
- Authenticated vs anonymous ratio: [X]%/[Y]%
|
||||
- Geographic distribution: [Region 1 X]%, [Region 2 Y]%
|
||||
|
||||
---
|
||||
|
||||
## 4. Test Scenarios
|
||||
|
||||
### Scenario 1: Baseline (Steady-State)
|
||||
|
||||
**Purpose:** Confirm the service performs acceptably under normal production load.
|
||||
**Duration:** 10 minutes
|
||||
**Load profile:** Ramp to [N] RPS over 2 minutes, hold for 8 minutes.
|
||||
**Concurrency:** [N] virtual users
|
||||
|
||||
**Pass criteria:** All thresholds in the Baseline column of the targets table above.
|
||||
|
||||
---
|
||||
|
||||
### Scenario 2: Stress Test
|
||||
|
||||
**Purpose:** Find the breaking point — how much load can the service handle before SLOs are breached?
|
||||
**Duration:** 20–30 minutes
|
||||
**Load profile:** Ramp from [N] RPS (baseline) to [N × 5] RPS in 5-minute steps. Hold each step for 5 minutes. Stop at first SLO breach.
|
||||
**Concurrency:** Scales with RPS target
|
||||
|
||||
**What to record:**
|
||||
- RPS at which p99 latency first exceeds SLO
|
||||
- RPS at which error rate first exceeds SLO
|
||||
- Whether the service recovers when load drops back to baseline
|
||||
|
||||
---
|
||||
|
||||
### Scenario 3: Spike Test
|
||||
|
||||
**Purpose:** Simulate a sudden traffic surge (flash sale, viral event, bot attack).
|
||||
**Duration:** 15 minutes
|
||||
**Load profile:** Hold at [N] RPS (baseline) for 3 minutes, spike to [N × 10] RPS instantly, hold for 5 minutes, drop back to baseline for 7 minutes.
|
||||
|
||||
**What to record:**
|
||||
- Latency during spike and recovery
|
||||
- Whether the service sheds load gracefully (rate limiting, queue depth)
|
||||
- Time to recover to baseline latency after spike ends
|
||||
|
||||
---
|
||||
|
||||
### Scenario 4: Soak / Endurance Test
|
||||
|
||||
**Purpose:** Detect memory leaks, connection pool exhaustion, and slow degradation over time.
|
||||
**Duration:** 4–8 hours (run overnight)
|
||||
**Load profile:** Steady [N × 1.5] RPS (50% above baseline) for entire duration.
|
||||
|
||||
**What to watch:**
|
||||
- Memory usage trend over time (should not grow unboundedly)
|
||||
- Error rate trend (should be flat, not creeping up)
|
||||
- GC pause frequency (JVM/Go services)
|
||||
- Database connection pool utilisation
|
||||
- p99 latency trend (should not creep up over hours)
|
||||
|
||||
---
|
||||
|
||||
## 5. Test Environment Requirements
|
||||
|
||||
### Infrastructure
|
||||
|
||||
| Component | Requirement | Notes |
|
||||
|---|---|---|
|
||||
| Service under test | Isolated from production | [N] replicas, matching prod resource limits |
|
||||
| Database | Separate instance with production-scale data | Seed script in section 7 |
|
||||
| Cache (Redis/Memcached) | Empty at test start | Ensures cold-start conditions are tested |
|
||||
| Load generator | Separate from service under test | [N] vCPUs, [N] GB RAM minimum |
|
||||
| Network | Low-latency path to service | Do not run generator on same host |
|
||||
|
||||
### Data Seeding
|
||||
|
||||
Before every test run, ensure the environment has:
|
||||
```bash
|
||||
# Seed test users (needed for authenticated endpoint tests)
|
||||
[seed command or script path — e.g. python scripts/seed_load_test_users.py --count 10000]
|
||||
|
||||
# Seed test data for read endpoints
|
||||
[seed command — e.g. ./scripts/seed_products.sh --count 50000]
|
||||
|
||||
# Verify seed completed
|
||||
[verification command — e.g. psql $DB_URL -c "SELECT COUNT(*) FROM users WHERE load_test=true"]
|
||||
```
|
||||
|
||||
**Test data rules:**
|
||||
- Never use real production user data in load tests
|
||||
- Tag all test-generated records with `load_test=true` for easy cleanup
|
||||
- Run cleanup after each test: `[cleanup command]`
|
||||
|
||||
---
|
||||
|
||||
## 6. Tooling Setup
|
||||
|
||||
### k6 Script Skeleton
|
||||
|
||||
```javascript
|
||||
import http from 'k6/http';
|
||||
import { check, sleep } from 'k6';
|
||||
import { Rate, Trend } from 'k6/metrics';
|
||||
|
||||
// Custom metrics
|
||||
const errorRate = new Rate('error_rate');
|
||||
const endpointLatency = new Trend('endpoint_latency', true);
|
||||
|
||||
// Test configuration — override per scenario
|
||||
export const options = {
|
||||
scenarios: {
|
||||
baseline: {
|
||||
executor: 'ramping-vus',
|
||||
startVUs: 0,
|
||||
stages: [
|
||||
{ duration: '2m', target: [BASELINE_VUS] },
|
||||
{ duration: '8m', target: [BASELINE_VUS] },
|
||||
{ duration: '1m', target: 0 },
|
||||
],
|
||||
},
|
||||
},
|
||||
thresholds: {
|
||||
http_req_duration: [
|
||||
'p(95)<[Y_MS]',
|
||||
'p(99)<[Z_MS]',
|
||||
],
|
||||
error_rate: ['rate<0.01'],
|
||||
http_req_failed: ['rate<0.01'],
|
||||
},
|
||||
};
|
||||
|
||||
// Auth helper — get token once per VU
|
||||
export function setup() {
|
||||
const loginRes = http.post('[BASE_URL]/auth/login', JSON.stringify({
|
||||
username: `load_test_user_${Math.floor(Math.random() * 10000)}@example.com`,
|
||||
password: '[LOAD_TEST_PASSWORD]',
|
||||
}), { headers: { 'Content-Type': 'application/json' } });
|
||||
|
||||
check(loginRes, { 'login ok': (r) => r.status === 200 });
|
||||
return { token: loginRes.json('access_token') };
|
||||
}
|
||||
|
||||
export default function (data) {
|
||||
const headers = {
|
||||
Authorization: `Bearer ${data.token}`,
|
||||
'Content-Type': 'application/json',
|
||||
};
|
||||
|
||||
// Endpoint 1: [Description]
|
||||
const res1 = http.get('[BASE_URL]/[endpoint-1]', { headers });
|
||||
check(res1, {
|
||||
'[endpoint-1] status 200': (r) => r.status === 200,
|
||||
'[endpoint-1] latency < [X]ms': (r) => r.timings.duration < [X],
|
||||
});
|
||||
errorRate.add(res1.status >= 400);
|
||||
endpointLatency.add(res1.timings.duration, { endpoint: '[endpoint-1]' });
|
||||
|
||||
sleep(Math.random() * [THINK_TIME_MAX] + [THINK_TIME_MIN]);
|
||||
|
||||
// Endpoint 2: [Description]
|
||||
const res2 = http.post('[BASE_URL]/[endpoint-2]',
|
||||
JSON.stringify({ [key]: '[value]' }),
|
||||
{ headers }
|
||||
);
|
||||
check(res2, {
|
||||
'[endpoint-2] status 201': (r) => r.status === 201,
|
||||
});
|
||||
errorRate.add(res2.status >= 400);
|
||||
}
|
||||
```
|
||||
|
||||
### Locust Script Skeleton (alternative)
|
||||
|
||||
```python
|
||||
from locust import HttpUser, task, between
|
||||
import random
|
||||
|
||||
class [ServiceName]User(HttpUser):
|
||||
wait_time = between([THINK_TIME_MIN], [THINK_TIME_MAX])
|
||||
token = None
|
||||
|
||||
def on_start(self):
|
||||
"""Called once per simulated user — authenticate."""
|
||||
user_id = random.randint(1, 10000)
|
||||
response = self.client.post("/auth/login", json={
|
||||
"username": f"load_test_user_{user_id}@example.com",
|
||||
"password": "[LOAD_TEST_PASSWORD]",
|
||||
})
|
||||
self.token = response.json()["access_token"]
|
||||
self.headers = {"Authorization": f"Bearer {self.token}"}
|
||||
|
||||
@task([WEIGHT_1]) # Weight = relative frequency
|
||||
def [endpoint_1_task](self):
|
||||
"""[Endpoint 1 description]"""
|
||||
with self.client.get(
|
||||
"/[endpoint-1]",
|
||||
headers=self.headers,
|
||||
catch_response=True
|
||||
) as response:
|
||||
if response.elapsed.total_seconds() > [LATENCY_THRESHOLD]:
|
||||
response.failure(f"Too slow: {response.elapsed.total_seconds()}s")
|
||||
|
||||
@task([WEIGHT_2])
|
||||
def [endpoint_2_task](self):
|
||||
"""[Endpoint 2 description]"""
|
||||
self.client.post(
|
||||
"/[endpoint-2]",
|
||||
json={"[key]": "[value]"},
|
||||
headers=self.headers,
|
||||
)
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# k6 — run baseline scenario
|
||||
k6 run --env BASE_URL=https://[test-env-url] scripts/load_test.js
|
||||
|
||||
# k6 — run stress scenario with output to InfluxDB
|
||||
k6 run --out influxdb=http://[influxdb-host]:8086/k6 \
|
||||
--env SCENARIO=stress \
|
||||
scripts/load_test.js
|
||||
|
||||
# Locust — headless run
|
||||
locust -f locustfile.py \
|
||||
--headless \
|
||||
--users [N] \
|
||||
--spawn-rate [N] \
|
||||
--run-time 10m \
|
||||
--host https://[test-env-url] \
|
||||
--csv=results/[run-id]
|
||||
|
||||
# Locust — web UI (interactive)
|
||||
locust -f locustfile.py --host https://[test-env-url]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Metrics to Capture
|
||||
|
||||
Capture all of the following during every test run. Missing any of these makes result comparison unreliable.
|
||||
|
||||
| Metric | Source | Why it matters |
|
||||
|---|---|---|
|
||||
| p50, p95, p99, p999 latency per endpoint | Load tool | SLO validation |
|
||||
| Error rate (4xx, 5xx) per endpoint | Load tool | SLO validation |
|
||||
| Requests/sec (throughput) | Load tool | Capacity baseline |
|
||||
| CPU utilisation (%) | Infra monitoring | Saturation signal |
|
||||
| Memory utilisation (%) | Infra monitoring | Leak detection |
|
||||
| GC pause time / frequency | JVM/Go metrics | Latency spike root cause |
|
||||
| DB connection pool: active/idle/waiting | DB metrics | Pool exhaustion detection |
|
||||
| DB query latency (p99) | DB metrics | Downstream bottleneck |
|
||||
| Cache hit rate | Cache metrics | Miss storm detection |
|
||||
| Pod/instance count (if autoscaling) | Infra | Scaling behaviour |
|
||||
| Network in/out bytes | Infra | Bandwidth saturation |
|
||||
|
||||
---
|
||||
|
||||
## 8. Result Analysis Framework
|
||||
|
||||
After each test run, work through this analysis in order:
|
||||
|
||||
**Step 1 — Pass/fail check**
|
||||
Compare all captured metrics against the thresholds in Section 2. Record pass/fail per scenario.
|
||||
|
||||
**Step 2 — Latency distribution**
|
||||
Plot the full latency histogram, not just percentiles. A bimodal distribution (two humps) indicates two distinct code paths — investigate the slow hump.
|
||||
|
||||
**Step 3 — Error correlation**
|
||||
If errors occurred, correlate them with:
|
||||
- Time of occurrence (was it during ramp-up, steady state, or spike?)
|
||||
- Specific endpoint (is it one endpoint or all?)
|
||||
- Infrastructure events (CPU spike, OOM, DB connection exhaustion?)
|
||||
|
||||
**Step 4 — Saturation analysis**
|
||||
Graph CPU, memory, and connection pool over time. If any resource reached 80%+ of capacity, it is a candidate bottleneck — even if SLOs passed this run.
|
||||
|
||||
**Step 5 — Compare to baseline run**
|
||||
Every run should be compared to the previous run. A 10% regression in p99 latency warrants investigation even if it is still within SLO.
|
||||
|
||||
**Regression classification:**
|
||||
|
||||
| Change | Classification | Action |
|
||||
|---|---|---|
|
||||
| p99 within 5% of previous run | Green — no regression | No action |
|
||||
| p99 5–15% worse than previous | Yellow — watch | Investigate before next release |
|
||||
| p99 >15% worse than previous | Red — regression | Block release, file ticket |
|
||||
| Error rate increased vs previous | Red — regression | Block release |
|
||||
| SLO threshold breached | Critical | Block release, page on-call |
|
||||
|
||||
---
|
||||
|
||||
## 9. CI Integration
|
||||
|
||||
Add load tests as a gated step in the release pipeline. Run the baseline scenario on every release candidate; run all scenarios weekly.
|
||||
|
||||
```yaml
|
||||
# Example: GitHub Actions step (adapt for your CI platform)
|
||||
load-test:
|
||||
runs-on: ubuntu-latest
|
||||
needs: [deploy-staging]
|
||||
if: github.ref == 'refs/heads/main'
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
|
||||
- name: Install k6
|
||||
run: |
|
||||
curl -s https://dl.k6.io/key.gpg | sudo apt-key add -
|
||||
echo "deb https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.list
|
||||
sudo apt-get update && sudo apt-get install k6
|
||||
|
||||
- name: Seed test data
|
||||
run: [seed command]
|
||||
|
||||
- name: Run baseline load test
|
||||
run: |
|
||||
k6 run \
|
||||
--env BASE_URL=${{ secrets.LOAD_TEST_ENV_URL }} \
|
||||
--out json=results.json \
|
||||
scripts/load_test.js
|
||||
env:
|
||||
LOAD_TEST_ENV_URL: ${{ secrets.LOAD_TEST_ENV_URL }}
|
||||
|
||||
- name: Check thresholds
|
||||
run: |
|
||||
# k6 exits with non-zero if any threshold fails — this step fails the build
|
||||
echo "k6 threshold check complete"
|
||||
|
||||
- name: Upload results
|
||||
uses: actions/upload-artifact@v3
|
||||
if: always()
|
||||
with:
|
||||
name: load-test-results-${{ github.run_id }}
|
||||
path: results.json
|
||||
|
||||
- name: Cleanup test data
|
||||
if: always()
|
||||
run: [cleanup command]
|
||||
```
|
||||
|
||||
**CI gates summary:**
|
||||
- Baseline scenario runs on every release to staging
|
||||
- Full scenario suite (stress, spike, soak) runs weekly on a schedule
|
||||
- Any threshold failure blocks promotion to production
|
||||
- Results are archived for trend analysis
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] All key endpoints are covered by at least one test scenario — no production endpoint is untested
|
||||
- [ ] Thresholds are derived from actual SLO targets, not guesses
|
||||
- [ ] Test data seeding is scripted and reproducible — tests do not rely on pre-existing environment state
|
||||
- [ ] The load generator runs on separate infrastructure from the service under test
|
||||
- [ ] CI integration blocks promotion on threshold failure — not just records results
|
||||
- [ ] Soak test has been run at least once to establish a memory and connection pool baseline
|
||||
- [ ] Results comparison to previous run is part of the analysis — not just absolute pass/fail
|
||||
@@ -0,0 +1,484 @@
|
||||
---
|
||||
name: local-dev-setup
|
||||
description: "Write a local development environment setup guide for a service or project — covering prerequisites, repository setup, environment variables, local service dependencies, database seeding, running the service, running tests, common gotchas, IDE recommendations, and first-contribution checklist. Use when asked to write a dev setup guide, create onboarding documentation for engineers, document local environment setup, or write a getting-started guide for a codebase. Produces a complete setup guide that a new engineer can follow from zero to running tests in under 30 minutes, with a troubleshooting section for the most common setup failures."
|
||||
---
|
||||
|
||||
# Local Dev Setup Skill
|
||||
|
||||
Produce a complete local development environment setup guide for a service or project — walking a new engineer from zero (a clean laptop) to a working local environment with passing tests in under 30 minutes. A good setup guide reduces onboarding time, prevents the "it works on my machine" problem, and lets engineers make their first contribution with confidence. Write every step as a concrete command or action — not a description of what needs to happen.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and what it does
|
||||
- **Tech stack** — language, framework, database, cache, message queue, and any external services
|
||||
- **Dependencies** — databases, caches, message queues, and external services (mocked or real)
|
||||
- **Test framework** — how tests are run and what the test suite covers
|
||||
- **CI/CD platform** — GitHub Actions, CircleCI, Jenkins, etc. (for context on what "passing CI" means locally)
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Local Development Setup: [Service Name]
|
||||
|
||||
**Tech stack:** [Language + version] | [Framework] | [Database] | [Cache]
|
||||
**Estimated setup time:** [20–30 minutes] on a clean machine
|
||||
**Last verified:** [Date] on [macOS Ventura 13.x / Ubuntu 22.04]
|
||||
**Questions?** Ask in [Slack: #[team-channel]] or ping [@tech-lead-handle]
|
||||
|
||||
> **First contribution?** Complete setup first (this doc), then read [CONTRIBUTING.md] for code standards and PR process.
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Install these tools before starting. The versions listed are the minimum required — newer patch versions are fine, newer major versions may have compatibility issues.
|
||||
|
||||
### Required Tools
|
||||
|
||||
| Tool | Required version | Install |
|
||||
|---|---|---|
|
||||
| [Git] | 2.x+ | Pre-installed on most systems; or `brew install git` |
|
||||
| [Language runtime — e.g. Go] | [1.22+] | [https://go.dev/dl/ or `brew install go`] |
|
||||
| [Docker] | 24.x+ | [https://docs.docker.com/get-docker/] |
|
||||
| [Docker Compose] | 2.x+ | Included with Docker Desktop; or `brew install docker-compose` |
|
||||
| [Make] | Any | Pre-installed on macOS/Linux |
|
||||
| [Tool — e.g. Node.js] | [20.x+] | [`brew install node` or https://nodejs.org] |
|
||||
| [Tool — e.g. psql client] | [15+] | `brew install postgresql@15` (client only) |
|
||||
|
||||
### Optional but Recommended
|
||||
|
||||
| Tool | Purpose | Install |
|
||||
|---|---|---|
|
||||
| [direnv] | Auto-load `.envrc` environment variables | `brew install direnv` + [setup instructions](https://direnv.net) |
|
||||
| [jq] | Pretty-print JSON in terminal | `brew install jq` |
|
||||
| [k9s] | Kubernetes cluster UI (if using K8s locally) | `brew install k9s` |
|
||||
| [mkcert] | Local HTTPS certificates | `brew install mkcert` |
|
||||
|
||||
### Required Accounts and Access
|
||||
|
||||
Before starting, make sure you have:
|
||||
- [ ] GitHub access to [org/repo] — request via [access request process / Slack: #it-help]
|
||||
- [ ] [AWS / GCP / Azure] account with [dev environment] access — request via [process]
|
||||
- [ ] [Internal tool — e.g. 1Password] for retrieving development secrets — request via [process]
|
||||
- [ ] [VPN access] if required to reach internal services — request via [process]
|
||||
|
||||
---
|
||||
|
||||
## 1. Repository Setup
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone git@github.com:[org]/[repo-name].git
|
||||
cd [repo-name]
|
||||
|
||||
# Install git hooks (required — enforces commit message format and runs pre-commit checks)
|
||||
make install-hooks
|
||||
# Or manually:
|
||||
# cp scripts/hooks/pre-commit .git/hooks/pre-commit && chmod +x .git/hooks/pre-commit
|
||||
|
||||
# Verify your git setup
|
||||
git config user.name # should be your name
|
||||
git config user.email # should be your work email
|
||||
```
|
||||
|
||||
**If you see a permission denied error on clone:** Your SSH key is not added to GitHub. Follow [GitHub's SSH key guide](https://docs.github.com/en/authentication/connecting-to-github-with-ssh) or use HTTPS with a personal access token instead.
|
||||
|
||||
---
|
||||
|
||||
## 2. Environment Variables
|
||||
|
||||
The service requires environment variables for configuration. **Never commit actual secrets to the repository.**
|
||||
|
||||
### Step 1 — Copy the example file
|
||||
|
||||
```bash
|
||||
cp .env.example .env.local
|
||||
```
|
||||
|
||||
### Step 2 — Fill in the values
|
||||
|
||||
Open `.env.local` in your editor. Below is a description of every variable and where to get its value:
|
||||
|
||||
| Variable | Description | Where to get it | Example (not real) |
|
||||
|---|---|---|---|
|
||||
| `APP_ENV` | Environment name | Set to `development` | `development` |
|
||||
| `APP_PORT` | Port the service listens on | Set to `8080` for local | `8080` |
|
||||
| `DATABASE_URL` | PostgreSQL connection string | Use value from Docker Compose (Section 3) | `postgres://app:password@localhost:5432/[service]_dev` |
|
||||
| `REDIS_URL` | Redis connection string | Use value from Docker Compose | `redis://localhost:6379` |
|
||||
| `SECRET_KEY` | Application secret key | Generate with: `openssl rand -hex 32` | `[random 64-char hex]` |
|
||||
| `[EXTERNAL_SERVICE]_API_KEY` | API key for [External Service] | Retrieve from [1Password vault: "Dev API Keys"] or ask [name] | — |
|
||||
| `[EXTERNAL_SERVICE]_BASE_URL` | Base URL for [External Service] | Use sandbox URL: `https://sandbox.[external-service].com` | `https://sandbox.stripe.com` |
|
||||
| `LOG_LEVEL` | Logging verbosity | Set to `debug` for local development | `debug` |
|
||||
| `[FEATURE_FLAG_SDK_KEY]` | Feature flag platform SDK key | Retrieve from [LaunchDarkly/Split dev project] | — |
|
||||
|
||||
**Using direnv (recommended):** Rename `.env.local` to `.envrc`, add `dotenv` at the top, and run `direnv allow`. Variables will load automatically when you `cd` into the project.
|
||||
|
||||
---
|
||||
|
||||
## 3. Local Service Dependencies
|
||||
|
||||
All infrastructure dependencies run in Docker Compose. You do not need to install PostgreSQL, Redis, or Kafka locally.
|
||||
|
||||
```bash
|
||||
# Start all dependencies (PostgreSQL, Redis, and any other services)
|
||||
docker compose up -d
|
||||
|
||||
# Verify all containers are healthy
|
||||
docker compose ps
|
||||
# Expected output: all services show "healthy" status
|
||||
|
||||
# View logs if something is not healthy
|
||||
docker compose logs [service-name]
|
||||
```
|
||||
|
||||
### What Docker Compose Starts
|
||||
|
||||
| Service | Port | Purpose | Health check |
|
||||
|---|---|---|---|
|
||||
| PostgreSQL [version] | `5432` | Primary database | `pg_isready -U app` |
|
||||
| Redis [version] | `6379` | Cache and session store | `redis-cli ping` |
|
||||
| [Kafka + Zookeeper] | `9092` / `2181` | Message queue | `kafka-topics.sh --list` |
|
||||
| [Mock server — e.g. WireMock] | `8089` | Mocks for external APIs in tests | `curl localhost:8089/__admin` |
|
||||
| [LocalStack] | `4566` | AWS service emulation (S3, SQS, etc.) | `aws --endpoint-url=http://localhost:4566 s3 ls` |
|
||||
|
||||
**If a container exits immediately:** See Troubleshooting section — common causes are port conflicts and Docker memory limits.
|
||||
|
||||
### Stopping Dependencies
|
||||
|
||||
```bash
|
||||
# Stop containers (preserves data volumes)
|
||||
docker compose stop
|
||||
|
||||
# Stop and remove containers (clears data — use when you want a fresh start)
|
||||
docker compose down -v
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Install Dependencies and Build
|
||||
|
||||
```bash
|
||||
# Install language dependencies
|
||||
# Go:
|
||||
go mod download
|
||||
|
||||
# Node.js:
|
||||
npm install # or: yarn install / pnpm install
|
||||
|
||||
# Python:
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
||||
pip install -r requirements-dev.txt
|
||||
|
||||
# Verify build compiles cleanly
|
||||
make build
|
||||
# Expected: no errors; binary or compiled output in [./bin/ or ./dist/]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Database Setup and Seeding
|
||||
|
||||
```bash
|
||||
# Run database migrations (creates tables and schema)
|
||||
make db-migrate
|
||||
# Or directly:
|
||||
# [Migration command — e.g. "go run ./cmd/migrate up" or "alembic upgrade head" or "npm run db:migrate"]
|
||||
|
||||
# Verify migrations applied
|
||||
# psql $DATABASE_URL -c "\dt" # should list all tables
|
||||
|
||||
# Seed the database with development data
|
||||
make db-seed
|
||||
# Or directly:
|
||||
# [Seed command — e.g. "go run ./cmd/seed" or "python scripts/seed.py" or "npm run db:seed"]
|
||||
|
||||
# Verify seed data is present
|
||||
# psql $DATABASE_URL -c "SELECT COUNT(*) FROM [primary-table]"
|
||||
# Expected: [N] rows
|
||||
```
|
||||
|
||||
**What the seed creates:**
|
||||
- [N] test user accounts (credentials in [scripts/seed/README.md or .env.example])
|
||||
- [N] sample [resources] for development and testing
|
||||
- Admin account: `[admin@example.com]` / password: see `.env.example` for dev password variable
|
||||
|
||||
**To reset to a clean state:**
|
||||
```bash
|
||||
docker compose down -v # wipe database volume
|
||||
docker compose up -d # start fresh
|
||||
make db-migrate
|
||||
make db-seed
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Running the Service
|
||||
|
||||
```bash
|
||||
# Run the service locally
|
||||
make run
|
||||
# Or directly:
|
||||
# [Run command — e.g. "go run ./cmd/server" or "python app.py" or "npm run dev"]
|
||||
|
||||
# Expected output:
|
||||
# [Example of healthy startup log lines — e.g.:]
|
||||
# {"level":"info","message":"Database connected","host":"localhost","port":5432}
|
||||
# {"level":"info","message":"Redis connected","host":"localhost","port":6379}
|
||||
# {"level":"info","message":"Server listening","port":8080}
|
||||
```
|
||||
|
||||
### Verify It's Working
|
||||
|
||||
```bash
|
||||
# Health check
|
||||
curl http://localhost:8080/health
|
||||
# Expected: {"status":"ok","version":"[git-sha]"}
|
||||
|
||||
# Test a key endpoint (authenticated)
|
||||
# First, get a dev token:
|
||||
curl -X POST http://localhost:8080/api/v1/auth/login \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"email":"[dev-user-from-seed]@example.com","password":"[dev-password-from-env]"}'
|
||||
# Copy the token from the response, then:
|
||||
|
||||
curl http://localhost:8080/api/v1/[resource] \
|
||||
-H "Authorization: Bearer [token-from-above]"
|
||||
# Expected: 200 with JSON response
|
||||
```
|
||||
|
||||
### Hot Reload (for Development)
|
||||
|
||||
```bash
|
||||
# Run with hot reload — service restarts automatically on file changes
|
||||
make run-dev
|
||||
# Or:
|
||||
# [Hot reload command — e.g. "air" for Go / "uvicorn --reload" for Python / "npm run dev" for Node]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Running Tests
|
||||
|
||||
```bash
|
||||
# Run the full test suite
|
||||
make test
|
||||
# Or:
|
||||
# [Test command — e.g. "go test ./..." or "pytest" or "npm test"]
|
||||
|
||||
# Run tests with coverage report
|
||||
make test-coverage
|
||||
# Coverage report: [./coverage.html or stdout]
|
||||
|
||||
# Run a specific test file or test case
|
||||
# Go: go test ./pkg/[package]/... -run TestFunctionName
|
||||
# Python: pytest tests/test_[module].py::TestClass::test_method -v
|
||||
# Node: npm test -- --testPathPattern=[filename]
|
||||
|
||||
# Run only unit tests (fast — no external dependencies)
|
||||
make test-unit
|
||||
|
||||
# Run only integration tests (requires Docker Compose dependencies running)
|
||||
make test-integration
|
||||
```
|
||||
|
||||
**Expected test results:**
|
||||
- Unit tests: [N] tests, all pass, [<30] seconds
|
||||
- Integration tests: [N] tests, all pass, [<2] minutes
|
||||
- Coverage: [≥80]% (enforced in CI — tests fail below this threshold)
|
||||
|
||||
**Before pushing a PR, always run:**
|
||||
```bash
|
||||
make lint # code linting — must pass
|
||||
make test # full test suite — must pass
|
||||
make build # verify compilation — must pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. IDE Setup
|
||||
|
||||
### VS Code (Recommended)
|
||||
|
||||
Install the recommended extensions (VS Code will prompt you automatically):
|
||||
|
||||
```json
|
||||
// .vscode/extensions.json — already in the repository
|
||||
{
|
||||
"recommendations": [
|
||||
"[language-extension — e.g. golang.go]",
|
||||
"dbaeumer.vscode-eslint",
|
||||
"esbenp.prettier-vscode",
|
||||
"ms-azuretools.vscode-docker",
|
||||
"eamodio.gitlens"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Workspace settings are in `.vscode/settings.json` — format on save is enabled, linter is configured automatically.
|
||||
|
||||
**[Language]-specific setup:**
|
||||
```
|
||||
[e.g. Go: The gopls language server is installed automatically by the Go extension.
|
||||
Run "Go: Install/Update Tools" from the command palette after installing the extension.]
|
||||
```
|
||||
|
||||
### JetBrains (IntelliJ / GoLand / PyCharm / WebStorm)
|
||||
|
||||
- Open the project root as the project directory
|
||||
- [Language SDK]: set to [version] — File → Project Structure → SDKs
|
||||
- Run configurations are checked into `.idea/runConfigurations/` — they appear automatically
|
||||
- Enable "Run formatters on save" in Settings → Tools → Actions on Save
|
||||
|
||||
---
|
||||
|
||||
## 9. Common Gotchas and Troubleshooting
|
||||
|
||||
### Docker container exits immediately on startup
|
||||
|
||||
**Symptom:** `docker compose ps` shows a container as `Exited (1)` seconds after starting.
|
||||
|
||||
```bash
|
||||
# Check the container logs for the error
|
||||
docker compose logs [container-name]
|
||||
|
||||
# Common causes:
|
||||
# 1. Port already in use — find and kill the conflicting process:
|
||||
lsof -ti tcp:[port] | xargs kill -9
|
||||
|
||||
# 2. Docker doesn't have enough memory — allocate at least 4GB in Docker Desktop:
|
||||
# Docker Desktop → Settings → Resources → Memory → 4GB
|
||||
|
||||
# 3. M1/M2 Mac architecture mismatch — add platform directive to docker-compose.yml:
|
||||
# platform: linux/amd64
|
||||
```
|
||||
|
||||
### Database connection refused
|
||||
|
||||
**Symptom:** Service fails to start with "connection refused" or "dial tcp localhost:5432: connect: connection refused"
|
||||
|
||||
```bash
|
||||
# Is PostgreSQL actually running?
|
||||
docker compose ps postgres
|
||||
# If not running: docker compose up -d postgres
|
||||
|
||||
# Is it on the right port?
|
||||
lsof -i :5432
|
||||
|
||||
# Can you connect manually?
|
||||
psql postgres://app:password@localhost:5432/[service]_dev -c "SELECT 1"
|
||||
|
||||
# If using a custom DATABASE_URL, verify it matches the docker-compose.yml settings exactly
|
||||
```
|
||||
|
||||
### Migrations fail with "relation already exists"
|
||||
|
||||
**Symptom:** `make db-migrate` errors with "ERROR: relation [table] already exists"
|
||||
|
||||
```bash
|
||||
# Check current migration state
|
||||
[migration status command — e.g. "go run ./cmd/migrate status" or "alembic current"]
|
||||
|
||||
# The database may be in a partial state — reset it:
|
||||
docker compose down -v
|
||||
docker compose up -d
|
||||
make db-migrate # should now succeed on a clean database
|
||||
```
|
||||
|
||||
### Tests fail with "connection refused" or dependency errors
|
||||
|
||||
**Symptom:** Integration tests fail because they cannot connect to PostgreSQL or Redis.
|
||||
|
||||
```bash
|
||||
# Integration tests need Docker Compose running
|
||||
docker compose up -d
|
||||
|
||||
# Verify all containers are healthy before running tests
|
||||
docker compose ps # all should show "healthy"
|
||||
|
||||
# If containers are running but tests still fail, check environment variables:
|
||||
make test-integration # should pick up .env.local automatically
|
||||
# If not: source .env.local && make test-integration
|
||||
```
|
||||
|
||||
### `make lint` fails on a fresh checkout
|
||||
|
||||
**Symptom:** Lint errors on files you have not modified.
|
||||
|
||||
```bash
|
||||
# Formatting issue — auto-fix with:
|
||||
# Go:
|
||||
gofmt -w .
|
||||
goimports -w .
|
||||
|
||||
# Python:
|
||||
black .
|
||||
isort .
|
||||
|
||||
# Node/TypeScript:
|
||||
npm run lint:fix
|
||||
# Or: npx eslint --fix . && npx prettier --write .
|
||||
|
||||
# Re-run lint to confirm
|
||||
make lint
|
||||
```
|
||||
|
||||
### Environment variables not loading
|
||||
|
||||
**Symptom:** Service starts but immediately fails with "missing required environment variable: [VAR]"
|
||||
|
||||
```bash
|
||||
# Verify .env.local exists and has all required variables
|
||||
cat .env.local | grep "^[A-Z]" | awk -F= '{print $1}'
|
||||
|
||||
# Compare against required variables in .env.example
|
||||
diff <(grep "^[A-Z_]*=" .env.example | cut -d= -f1 | sort) \
|
||||
<(grep "^[A-Z_]*=" .env.local | cut -d= -f1 | sort)
|
||||
|
||||
# Missing variables are shown in left column only (< prefix)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. First Contribution Checklist
|
||||
|
||||
Before opening your first pull request, verify:
|
||||
|
||||
**Setup complete:**
|
||||
- [ ] `make build` passes with no errors
|
||||
- [ ] `make test` passes — all tests green
|
||||
- [ ] `make lint` passes — no lint errors
|
||||
- [ ] Service starts and health check returns 200
|
||||
- [ ] You can authenticate and call at least one API endpoint
|
||||
|
||||
**Git and GitHub:**
|
||||
- [ ] You have read [CONTRIBUTING.md] — code standards, commit message format, PR process
|
||||
- [ ] Your git user.name and user.email are set correctly
|
||||
- [ ] Pre-commit hooks are installed (`ls .git/hooks/pre-commit` should exist)
|
||||
- [ ] You have branched from `main` (not committing directly to main)
|
||||
|
||||
**Development workflow:**
|
||||
- [ ] You know how to run a specific test: `[test command for single test]`
|
||||
- [ ] You know how to reset the database: `docker compose down -v && docker compose up -d && make db-migrate && make db-seed`
|
||||
- [ ] You have joined [Slack: #[team-channel]] and [#[service-consumers-channel] if applicable]
|
||||
- [ ] You have read the [architecture overview doc / README] — you understand what this service does
|
||||
|
||||
**First PR:**
|
||||
- [ ] Changes are small and focused — one logical change per PR
|
||||
- [ ] Tests are added or updated for your change
|
||||
- [ ] `make test && make lint && make build` all pass locally before requesting review
|
||||
- [ ] PR description explains what changed and why (use the [pr-description-writer skill] if needed)
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] A new engineer with no prior knowledge of the project can follow this guide from start to finish without asking anyone for help
|
||||
- [ ] Every command is tested on a clean environment — not written from memory and assumed to work
|
||||
- [ ] Environment variables table covers every variable in `.env.example` — no undocumented variables
|
||||
- [ ] The troubleshooting section covers the 5 most common real failures observed during onboarding — not theoretical issues
|
||||
- [ ] Docker Compose version and Docker Desktop memory requirements are stated explicitly
|
||||
- [ ] "Expected output" is shown for key commands so engineers know whether a step succeeded
|
||||
- [ ] Setup time estimate is honest — verified by timing a real onboarding session, not estimated
|
||||
@@ -0,0 +1,290 @@
|
||||
---
|
||||
name: microservices-decomposition
|
||||
description: "Design a microservices decomposition for a monolith or new system, defining service boundaries, ownership, communication patterns, and migration plan. Use when asked to decompose a monolith, define service boundaries, design a microservices architecture, or plan a strangler-fig migration. Produces a bounded context map, service inventory table, communication pattern decisions, data ownership matrix, migration roadmap, and risk register."
|
||||
---
|
||||
|
||||
# Microservices Decomposition
|
||||
|
||||
Produce a complete microservices decomposition design for a system — whether decomposing an existing monolith or designing service boundaries for a new system. Ground the decomposition in Domain-Driven Design (DDD) concepts: identify bounded contexts first, then derive service boundaries from them. Include communication pattern decisions (sync vs. async, event vs. RPC), data ownership rules, and a pragmatic migration plan if decomposing a monolith. Conway's Law is real — include an organizational alignment section. The deliverable should be specific enough that a team can begin implementation, not an abstract architectural diagram.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **System or domain description** — what the system does, its core domain, and the key business processes it supports
|
||||
- **Current architecture** — monolith (describe the tech stack and rough module structure), partial services (list existing services), or greenfield
|
||||
- **Team structure** — number of teams, team names if known, and approximate team sizes; this drives service ownership
|
||||
- **Performance and scalability requirements** — any specific SLAs, load characteristics, or scaling constraints per domain area
|
||||
- **Migration constraints** — what cannot be rewritten all at once, hard deadlines, zero-downtime requirements, budget constraints
|
||||
- **Integration points** — external systems, third-party APIs, or legacy systems that cannot be changed
|
||||
|
||||
If decomposing a monolith, also ask for: approximate codebase size, what is most painful to change today, and where the team experiences the most coupling-related friction.
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Microservices Decomposition: [System Name]
|
||||
|
||||
**Author:** [Name / Team]
|
||||
**Date:** [Date]
|
||||
**Architecture type:** [Monolith decomposition / New system design]
|
||||
**Current state:** [One sentence describing what exists today]
|
||||
**Target state:** [One sentence describing the desired end state]
|
||||
|
||||
---
|
||||
|
||||
## 1. Domain Analysis
|
||||
|
||||
### Core Domain
|
||||
|
||||
[One paragraph: what is the core domain of this system? What does the business fundamentally do? What gives it competitive differentiation? The core domain gets the most investment and the cleanest service boundaries.]
|
||||
|
||||
### Domain Map
|
||||
|
||||
List every significant subdomain before assigning service boundaries. Classify each subdomain:
|
||||
|
||||
| Subdomain | Type | Description | Current Location in Monolith |
|
||||
|-----------|------|-------------|------------------------------|
|
||||
| [Subdomain, e.g., Order Management] | Core | [What it does and why it matters] | [Module/package name or "new"] |
|
||||
| [Subdomain, e.g., Inventory] | Core | [Description] | [Location] |
|
||||
| [Subdomain, e.g., Notifications] | Supporting | [Description] | [Location] |
|
||||
| [Subdomain, e.g., Billing] | Supporting | [Description] | [Location] |
|
||||
| [Subdomain, e.g., Reporting] | Generic | [Description — candidates for off-the-shelf solutions] | [Location] |
|
||||
| [Subdomain, e.g., User Auth] | Generic | [Description] | [Location] |
|
||||
|
||||
**Subdomain types:** Core = competitive differentiation, build with care; Supporting = necessary but not differentiating, build pragmatically; Generic = commodity, buy or use open source.
|
||||
|
||||
---
|
||||
|
||||
## 2. Bounded Context Map (ASCII)
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ [System Name] │
|
||||
│ │
|
||||
│ ┌──────────────────┐ ┌──────────────────┐ │
|
||||
│ │ [Context A] │ │ [Context B] │ │
|
||||
│ │ │─ ─►│ │ │
|
||||
│ │ [key concepts] │ │ [key concepts] │ │
|
||||
│ └──────────────────┘ └──────────────────┘ │
|
||||
│ │ │ │
|
||||
│ │ event │ sync │
|
||||
│ ▼ ▼ │
|
||||
│ ┌──────────────────┐ ┌──────────────────┐ │
|
||||
│ │ [Context C] │ │ [Context D] │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ [key concepts] │ │ [key concepts] │ │
|
||||
│ └──────────────────┘ └──────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌──────────────────┐ │
|
||||
│ │ [Context E] │ │
|
||||
│ │ [key concepts] │ │
|
||||
│ └──────────────────┘ │
|
||||
│ │
|
||||
│ External: [Third-party system] ──► [Context that owns it] │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
Legend: ──► sync call - -► async event ═══ shared kernel
|
||||
```
|
||||
|
||||
Render this map using the actual bounded contexts derived from the domain analysis. Place contexts that communicate frequently closer together. Label relationship types on arrows.
|
||||
|
||||
### Context Relationships
|
||||
|
||||
| Upstream Context | Downstream Context | Relationship Type | Integration Pattern |
|
||||
|-----------------|-------------------|------------------|---------------------|
|
||||
| [Context A] | [Context B] | Customer-Supplier | REST API call |
|
||||
| [Context B] | [Context C] | Published Language | Domain events via message bus |
|
||||
| [Context X] | [Context Y] | Conformist | [Downstream conforms to upstream's model] |
|
||||
| [Context X] | [Context Y] | Anti-Corruption Layer | [ACL translates upstream model to local model] |
|
||||
|
||||
---
|
||||
|
||||
## 3. Proposed Service Inventory
|
||||
|
||||
| Service Name | Bounded Context | Core Responsibility | Team Owner | Tech Stack | Priority |
|
||||
|-------------|----------------|--------------------|-----------|-----------|---------|
|
||||
| [service-name] | [Context] | [One sentence: what this service owns and does] | [Team] | [Language/framework] | [P1/P2/P3] |
|
||||
| [service-name] | [Context] | [Responsibility] | [Team] | [Stack] | [Priority] |
|
||||
| [service-name] | [Context] | [Responsibility] | [Team] | [Stack] | [Priority] |
|
||||
| [service-name] | [Context] | [Responsibility] | [Team] | [Stack] | [Priority] |
|
||||
| [service-name] | [Context] | [Responsibility] | [Team] | [Stack] | [Priority] |
|
||||
|
||||
**Service count:** [N proposed services] for [M bounded contexts]. [Note if any context maps to multiple services and why — e.g., "the Orders context splits into order-intake and order-fulfillment because they have different scalability requirements."]
|
||||
|
||||
### Service Responsibility Rules (applied to every service above)
|
||||
|
||||
- Single bounded context ownership — a service does not straddle two bounded contexts
|
||||
- Owns its own data — no direct database access by other services
|
||||
- Independently deployable — no coordinated deploys required with other services
|
||||
- Has a named team owner — no shared ownership of a single service across teams
|
||||
- Exposes a defined API contract — not internal implementation
|
||||
|
||||
---
|
||||
|
||||
## 4. Inter-Service Communication Patterns
|
||||
|
||||
### Pattern Decision Matrix
|
||||
|
||||
| Communication Need | Recommended Pattern | Rationale |
|
||||
|-------------------|--------------------|-----------|
|
||||
| Query another service's current state | Synchronous REST / gRPC | Low latency required; caller needs immediate response |
|
||||
| Notify other services of a state change | Async domain event | Decouples services; multiple consumers; sender doesn't care when it's processed |
|
||||
| Long-running workflow spanning services | Async saga (choreography or orchestration) | No single service owns the full workflow; rollback needed if steps fail |
|
||||
| Read-heavy cross-service aggregation | CQRS read model / materialized view | Avoid chatty sync calls at read time; build purpose-fit read models |
|
||||
| Real-time push to clients | WebSocket gateway service | Centralizes connection management; services emit events, gateway pushes |
|
||||
|
||||
### Per-Service Communication Decisions
|
||||
|
||||
| Service | Calls (sync) | Publishes (events) | Subscribes to (events) |
|
||||
|---------|-------------|-------------------|----------------------|
|
||||
| [service-name] | [service-name (endpoint)] | [EventName] | [EventName] |
|
||||
| [service-name] | — | [EventName], [EventName] | [EventName] |
|
||||
| [service-name] | [service-name (endpoint)] | — | [EventName] |
|
||||
|
||||
### Event Catalog
|
||||
|
||||
| Event Name | Producer | Consumers | Payload (key fields) | Trigger |
|
||||
|-----------|---------|---------|---------------------|---------|
|
||||
| [OrderPlaced] | [order-service] | [inventory-service, notification-service] | `orderId, customerId, lineItems, totalAmount` | Customer submits order |
|
||||
| [InventoryReserved] | [inventory-service] | [order-service] | `orderId, reservationId, items` | Inventory successfully reserved |
|
||||
| [PaymentProcessed] | [payment-service] | [order-service, notification-service] | `orderId, paymentId, amount, status` | Payment confirmed |
|
||||
|
||||
---
|
||||
|
||||
## 5. Data Ownership Matrix
|
||||
|
||||
Each piece of data has exactly one owning service. Other services may cache or project a read model, but they do not write to the owner's database.
|
||||
|
||||
| Data Entity | Owner Service | Authoritative Store | Consumers | Access Pattern |
|
||||
|-------------|--------------|--------------------|-----------| ---------------|
|
||||
| [Order] | [order-service] | [PostgreSQL] | [fulfillment-service, reporting-service] | Event subscription + read API |
|
||||
| [Customer] | [customer-service] | [PostgreSQL] | [order-service, notification-service] | Sync API call |
|
||||
| [Product Catalog] | [catalog-service] | [PostgreSQL] | [order-service, inventory-service] | Sync API + cached local copy |
|
||||
| [Inventory Level] | [inventory-service] | [Redis + PostgreSQL] | [catalog-service (read only)] | Event subscription |
|
||||
| [Payment Record] | [payment-service] | [PostgreSQL] | [order-service] | Event subscription |
|
||||
|
||||
### Data Migration (if decomposing a monolith)
|
||||
|
||||
| Data Entity | Current Location | Target Service | Migration Approach | Data Volume | Risk |
|
||||
|-------------|-----------------|---------------|-------------------|-------------|------|
|
||||
| [Entity] | [monolith.orders table] | [order-service] | Dual-write then cut over | [X rows] | [High/Med/Low] |
|
||||
| [Entity] | [monolith.users table] | [customer-service] | Extract and sync via CDC | [X rows] | [High/Med/Low] |
|
||||
|
||||
---
|
||||
|
||||
## 6. API Contract Definitions
|
||||
|
||||
Define the surface area for each service. Full OpenAPI specs are written separately; this section establishes the contract boundaries.
|
||||
|
||||
### [service-name] API
|
||||
|
||||
**Base path:** `/api/v1/[resource]`
|
||||
**Owner team:** [Team]
|
||||
**SLA:** [p99 latency target, availability target]
|
||||
|
||||
| Endpoint | Method | Description | Auth Required | Rate Limit |
|
||||
|----------|--------|-------------|--------------|------------|
|
||||
| `/[resources]` | GET | List [resources] with pagination | Yes | [X req/min] |
|
||||
| `/[resources]/{id}` | GET | Get single [resource] by ID | Yes | [X req/min] |
|
||||
| `/[resources]` | POST | Create new [resource] | Yes | [X req/min] |
|
||||
| `/[resources]/{id}` | PUT | Update [resource] | Yes | [X req/min] |
|
||||
| `/[resources]/{id}` | DELETE | Soft-delete [resource] | Yes — elevated | [X req/min] |
|
||||
|
||||
[Repeat for each service.]
|
||||
|
||||
---
|
||||
|
||||
## 7. Strangler Fig Migration Plan (for monolith decomposition)
|
||||
|
||||
Use the strangler fig pattern: extract services incrementally, route traffic through a facade, and retire monolith modules one at a time.
|
||||
|
||||
### Migration Phases
|
||||
|
||||
```
|
||||
Phase 1: Foundation (Weeks 1–[N])
|
||||
- Deploy service infrastructure (CI/CD, observability, service mesh)
|
||||
- Extract lowest-risk, highest-value service first
|
||||
- Monolith continues to serve all traffic
|
||||
|
||||
Phase 2: First Extractions (Weeks [N]–[M])
|
||||
- Extract P1 services
|
||||
- API gateway routes selected traffic to new services
|
||||
- Monolith handles remaining traffic via facade pattern
|
||||
- Both paths write to shared DB during transition (dual-write)
|
||||
|
||||
Phase 3: Core Domain Services (Weeks [M]–[P])
|
||||
- Extract P1 core domain services
|
||||
- Data migration for extracted services
|
||||
- Remove dual-write paths for completed migrations
|
||||
|
||||
Phase 4: Monolith Retirement (Weeks [P]–[Q])
|
||||
- Extract remaining services
|
||||
- Monolith serves no production traffic
|
||||
- Decommission monolith infrastructure
|
||||
```
|
||||
|
||||
### Phase-by-Phase Roadmap
|
||||
|
||||
| Phase | Service to Extract | Migration Approach | Team | Duration | Dependencies | Success Criteria |
|
||||
|-------|------------------|--------------------|------|----------|-------------|-----------------|
|
||||
| 1 | [service-name] | [Strangler facade / Branch by abstraction / Event interception] | [Team] | [X weeks] | [Infra ready, CI/CD pipeline] | [Traffic fully on new service, zero errors for 2 weeks] |
|
||||
| 2 | [service-name] | [Approach] | [Team] | [X weeks] | [Phase 1 complete] | [Success metric] |
|
||||
| 3 | [service-name] | [Approach] | [Team] | [X weeks] | [Phase 2 complete] | [Success metric] |
|
||||
|
||||
### Rollback Plan
|
||||
|
||||
For each migration phase, define the rollback trigger and mechanism:
|
||||
- **Rollback trigger:** Error rate on new service > [X%] sustained for [Y minutes], or p99 latency > [threshold]
|
||||
- **Rollback mechanism:** API gateway feature flag reverts all traffic to monolith path in < 5 minutes
|
||||
- **Data rollback:** Dual-write maintained for [X weeks] after cutover to allow replay if needed
|
||||
|
||||
---
|
||||
|
||||
## 8. Organizational Alignment (Conway's Law)
|
||||
|
||||
Conway's Law: the architecture of a system mirrors the communication structure of the organization that builds it. Design service ownership to match team boundaries — or change the team boundaries.
|
||||
|
||||
| Service | Proposed Owner Team | Current Team Assignment | Change Required |
|
||||
|---------|--------------------|-----------------------|-----------------|
|
||||
| [service-name] | [Team A] | [Same / Different] | [No change / Transfer to Team A / New team needed] |
|
||||
| [service-name] | [Team B] | [Team A currently] | [Transfer ownership] |
|
||||
|
||||
**Misalignments identified:**
|
||||
- [Misalignment 1: e.g., "The notification service spans two teams today. Assign it entirely to Team B which already owns the messaging domain."]
|
||||
- [Misalignment 2: e.g., "The reporting service is owned by Data Eng but consumers are Product teams — establish a clear API contract and SLA."]
|
||||
|
||||
**Team topology recommendation:** [Describe the recommended team structure — stream-aligned teams, platform team, enabling team — and how it maps to the proposed services.]
|
||||
|
||||
---
|
||||
|
||||
## 9. Risk Register
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation | Owner |
|
||||
|------|-----------|--------|-----------|-------|
|
||||
| Data consistency across services during migration | High | High | Dual-write with reconciliation job; event sourcing for critical domains | [Name] |
|
||||
| Distributed transaction complexity (sagas) | Medium | High | Start with choreography; add orchestration only when choreography becomes unmanageable | [Name] |
|
||||
| Service mesh operational overhead | Medium | Medium | Start without a mesh; add after 5+ services deployed | [Name] |
|
||||
| Network latency replacing in-process calls | Medium | Medium | Cache aggressively; design read models to avoid chatty sync calls | [Name] |
|
||||
| Conway's Law friction during transition | High | Medium | Align team structure before starting extraction, not after | [Name] |
|
||||
| Over-decomposition (nanoservices) | Medium | High | Enforce minimum service size rule: a service must justify its own team/deployment overhead | [Name] |
|
||||
| Observability gaps during migration | High | High | Deploy distributed tracing before first extraction; establish correlation IDs | [Name] |
|
||||
| [Context-specific risk] | [Level] | [Level] | [Mitigation] | [Owner] |
|
||||
|
||||
---
|
||||
|
||||
*Questions about this design: [Slack channel or contact]*
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Bounded context map is an ASCII diagram with labeled relationships — not a prose description of the contexts
|
||||
- [ ] Every service in the inventory table has a named team owner and a clear single-sentence responsibility statement
|
||||
- [ ] Data ownership matrix assigns every key entity to exactly one owning service — no shared ownership
|
||||
- [ ] Communication pattern decisions explain WHY sync vs. async was chosen for each interaction type
|
||||
- [ ] If decomposing a monolith, the strangler fig migration plan has phases with durations, dependencies, and success criteria
|
||||
- [ ] Risk register addresses at minimum: data consistency, distributed transactions, and Conway's Law alignment
|
||||
- [ ] Organizational alignment section maps services to teams and identifies misalignments that need to be resolved
|
||||
@@ -0,0 +1,436 @@
|
||||
---
|
||||
name: monitoring-setup-guide
|
||||
description: "Write a monitoring setup guide for a service — defining what to measure, how to alert on it, and how to build the observability stack covering the four golden signals, business metrics, log strategy, distributed tracing, alerting rules, dashboard layout, and observability debt. Use when asked to set up monitoring for a service, define alerting strategy, write an observability plan, create a dashboard specification, or document logging standards for a team. Produces a metric definitions table, alert rules specification, dashboard layout wireframe, log schema, tracing setup checklist, and monitoring gap analysis."
|
||||
---
|
||||
|
||||
# Monitoring Setup Guide Skill
|
||||
|
||||
Produce a complete monitoring setup guide for a service — defining exactly what to measure, how to structure logs, how to configure alerts with actionable thresholds, and how to build dashboards that answer real operational questions. A good monitoring guide eliminates "we don't know what's happening in production" as a root cause category, and gives on-call engineers a single source of truth for what healthy looks like.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name and description** — what the service does and its role in the system
|
||||
- **Tech stack** — language, framework, and infrastructure (e.g. Go/gRPC on Kubernetes, Python/FastAPI on ECS)
|
||||
- **Current monitoring tooling** — Datadog, Prometheus + Grafana, CloudWatch, New Relic, Honeycomb, or none yet
|
||||
- **Key user journeys** — the 2–4 most important things a user or consumer does with the service (these drive what to alert on)
|
||||
- **Existing alerts** — paste any existing alert configurations or describe what's currently monitored
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Monitoring Setup Guide: [Service Name]
|
||||
|
||||
**Team:** [Team name] | **Tech lead:** [Name]
|
||||
**Stack:** [Language/Framework] on [Infrastructure]
|
||||
**Monitoring platform:** [Datadog / Prometheus+Grafana / CloudWatch / etc.]
|
||||
**Date:** [Date] | **Review cycle:** Quarterly
|
||||
|
||||
---
|
||||
|
||||
## 1. Monitoring Philosophy
|
||||
|
||||
Good monitoring answers three questions:
|
||||
1. **Is the service healthy right now?** (alerting)
|
||||
2. **Was it healthy in the past, and is it trending worse?** (dashboards + SLO tracking)
|
||||
3. **Why did something fail?** (logs + traces)
|
||||
|
||||
This guide defines the answers for [Service Name]. Every alert must be actionable — if an on-call engineer cannot take a specific action in response to the alert, the alert should not exist.
|
||||
|
||||
**Key user journeys monitored:**
|
||||
- Journey 1: [e.g. "User submits a payment — POST /charges, receives confirmation"]
|
||||
- Journey 2: [e.g. "User views transaction history — GET /transactions"]
|
||||
- Journey 3: [e.g. "Subscription renewal job runs — background worker processes billing events"]
|
||||
|
||||
---
|
||||
|
||||
## 2. The Four Golden Signals
|
||||
|
||||
Apply the four golden signals specifically to [Service Name]:
|
||||
|
||||
### Latency
|
||||
|
||||
Latency measures how long requests take to complete. Track it separately for successful and failed requests — slow failures hide behind fast errors if you only measure aggregate latency.
|
||||
|
||||
| Metric | Description | Source | Dimensions |
|
||||
|---|---|---|---|
|
||||
| `[service].request.duration_ms` | End-to-end request latency | Application instrumentation | `endpoint`, `method`, `status_code` |
|
||||
| `[service].db.query_duration_ms` | Database query latency | ORM / query instrumentation | `query_name`, `table` |
|
||||
| `[service].external.request_duration_ms` | Outbound call latency to dependencies | HTTP client instrumentation | `target_service`, `endpoint` |
|
||||
| `[service].queue.processing_duration_ms` | Time to process one message (if applicable) | Consumer instrumentation | `queue_name`, `message_type` |
|
||||
|
||||
**Latency SLO targets:**
|
||||
|
||||
| Endpoint / operation | p50 target | p95 target | p99 target |
|
||||
|---|---|---|---|
|
||||
| `GET /api/v1/[resource]` | < [50] ms | < [200] ms | < [500] ms |
|
||||
| `POST /api/v1/[resource]` | < [100] ms | < [400] ms | < [1000] ms |
|
||||
| `GET /health` | < [10] ms | < [20] ms | < [50] ms |
|
||||
| [Background job name] | < [5] sec | < [15] sec | < [60] sec |
|
||||
|
||||
### Traffic
|
||||
|
||||
Traffic measures demand on the system. Use it to detect unexpected spikes, traffic drops (which can indicate upstream failures), and to capacity-plan.
|
||||
|
||||
| Metric | Description | Source |
|
||||
|---|---|---|
|
||||
| `[service].request.count` | Requests per second | Application / load balancer |
|
||||
| `[service].request.count_by_endpoint` | RPS broken down by endpoint | Application |
|
||||
| `[service].queue.messages_consumed_per_second` | Consumer throughput | Queue consumer |
|
||||
| `[service].queue.depth` | Messages waiting in queue | Queue metrics |
|
||||
|
||||
**Traffic baselines (update after observing production for 2+ weeks):**
|
||||
|
||||
| Time period | Expected RPS | Low-traffic floor | Spike ceiling |
|
||||
|---|---|---|---|
|
||||
| Peak (weekday business hours) | [N] RPS | [N × 0.5] RPS | [N × 5] RPS |
|
||||
| Off-peak (nights/weekends) | [N × 0.2] RPS | [N × 0.05] RPS | [N] RPS |
|
||||
|
||||
### Errors
|
||||
|
||||
Errors measure the fraction of requests that fail. Distinguish between client errors (4xx — caller is doing something wrong) and server errors (5xx — the service is broken).
|
||||
|
||||
| Metric | Description | Alert on? |
|
||||
|---|---|---|
|
||||
| `[service].request.error_rate` | 5xx errors / total requests | Yes — see alert rules |
|
||||
| `[service].request.client_error_rate` | 4xx errors / total requests | Threshold alert — sudden spike may indicate API misuse |
|
||||
| `[service].dependency.error_rate` | Errors calling downstream dependencies | Yes — upstream health signal |
|
||||
| `[service].queue.dlq_depth` | Messages in dead-letter queue | Yes — indicates processing failures |
|
||||
|
||||
### Saturation
|
||||
|
||||
Saturation measures how "full" the service is — how close to maximum capacity are the constrained resources.
|
||||
|
||||
| Resource | Metric | Alert threshold | Source |
|
||||
|---|---|---|---|
|
||||
| CPU | `[service].cpu.utilisation_pct` | >80% sustained 5 min | Container / VM metrics |
|
||||
| Memory | `[service].memory.utilisation_pct` | >85% sustained 5 min | Container / VM metrics |
|
||||
| DB connections | `[service].db.connection_pool.utilisation_pct` | >75% | Application / DB metrics |
|
||||
| Thread pool / goroutines | `[service].runtime.goroutine_count` / `thread_count` | >N (establish baseline) | Runtime metrics |
|
||||
| Disk (if applicable) | `[service].disk.utilisation_pct` | >75% | Infrastructure |
|
||||
| Queue depth (if applicable) | `[service].queue.depth` | >[backlog threshold] | Queue metrics |
|
||||
|
||||
---
|
||||
|
||||
## 3. Business Metrics
|
||||
|
||||
Beyond the golden signals, track metrics that measure whether the service is delivering business value. These matter for SLO reporting and product dashboards.
|
||||
|
||||
| Metric | Description | Source | Alert? |
|
||||
|---|---|---|---|
|
||||
| `[service].[primary_action].success_rate` | [e.g. "Payment success rate"] | Application | Yes — if drops >5% vs 1h average |
|
||||
| `[service].[primary_action].count` | [e.g. "Payments processed per minute"] | Application | Yes — sudden drop (traffic anomaly) |
|
||||
| `[service].[resource].created_per_hour` | [e.g. "New accounts created"] | Application / DB | No — informational |
|
||||
| `[service].cache.hit_rate` | Fraction of requests served from cache | Cache instrumentation | Yes — if drops below [60]% |
|
||||
| `[service].job.[name].success_rate` | [Background job success rate] | Job framework | Yes — if drops below [99]% |
|
||||
|
||||
---
|
||||
|
||||
## 4. Log Strategy
|
||||
|
||||
### Structured Logging Schema
|
||||
|
||||
All logs must be structured JSON. Do not emit unstructured text logs in production. Every log line must include the mandatory fields.
|
||||
|
||||
**Mandatory fields (every log line):**
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2024-01-15T10:23:45.123Z",
|
||||
"level": "info",
|
||||
"service": "[service-name]",
|
||||
"version": "[git-sha-short]",
|
||||
"trace_id": "[uuid-from-request-context]",
|
||||
"span_id": "[span-uuid]",
|
||||
"request_id": "[uuid-per-request]",
|
||||
"message": "[human readable description]"
|
||||
}
|
||||
```
|
||||
|
||||
**Request log (emit for every HTTP request):**
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "...",
|
||||
"level": "info",
|
||||
"service": "[service-name]",
|
||||
"event": "http_request",
|
||||
"method": "POST",
|
||||
"path": "/api/v1/[resource]",
|
||||
"status_code": 201,
|
||||
"duration_ms": 45,
|
||||
"user_id": "[uuid — DO NOT log PII directly]",
|
||||
"request_id": "[uuid]",
|
||||
"trace_id": "[uuid]"
|
||||
}
|
||||
```
|
||||
|
||||
**Error log (emit for every error with context):**
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "...",
|
||||
"level": "error",
|
||||
"service": "[service-name]",
|
||||
"event": "error",
|
||||
"error_code": "[application-error-code]",
|
||||
"error_message": "[description — no sensitive data]",
|
||||
"stack_trace": "[stack trace]",
|
||||
"request_id": "[uuid]",
|
||||
"trace_id": "[uuid]",
|
||||
"context": {
|
||||
"[key]": "[relevant context without PII]"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Log Levels — When to Use Each
|
||||
|
||||
| Level | Use when | Example |
|
||||
|---|---|---|
|
||||
| `error` | Something failed that requires attention — this should page on-call eventually | Database query failed, external API returned 5xx, required config missing |
|
||||
| `warn` | Something unexpected happened but service is still functioning | Retry succeeded after failure, cache miss on expected hit, rate limit approaching |
|
||||
| `info` | Significant business events and request lifecycle | Request received, payment processed, user authenticated, job started/completed |
|
||||
| `debug` | Detailed diagnostic information — off in production by default | Query parameters, intermediate computation results, cache key lookups |
|
||||
|
||||
### What NOT to Log
|
||||
|
||||
**Never log:**
|
||||
- Passwords, tokens, API keys, or secrets (even hashed)
|
||||
- Full credit card numbers or PAN data
|
||||
- Social security numbers or government IDs
|
||||
- Full names + dates of birth + contact info in the same log line (PII aggregation)
|
||||
- Request/response bodies in full (use field-level extraction instead)
|
||||
- Health check requests (too noisy — exclude `GET /health` from access logs)
|
||||
|
||||
---
|
||||
|
||||
## 5. Distributed Tracing Setup
|
||||
|
||||
Distributed tracing is mandatory for any service that calls other services. It enables root-cause analysis across service boundaries.
|
||||
|
||||
### Instrumentation Checklist
|
||||
|
||||
```
|
||||
[ ] Tracing library installed:
|
||||
- Go: go.opentelemetry.io/otel
|
||||
- Python: opentelemetry-sdk, opentelemetry-instrumentation
|
||||
- Node: @opentelemetry/sdk-node
|
||||
- Java: opentelemetry-java-instrumentation
|
||||
|
||||
[ ] Tracer initialized at service startup with service name and version
|
||||
|
||||
[ ] Trace context propagated via W3C Trace Context headers:
|
||||
traceparent: 00-[trace-id]-[span-id]-01
|
||||
tracestate: [optional vendor-specific]
|
||||
|
||||
[ ] Automatic instrumentation enabled for:
|
||||
[ ] Inbound HTTP/gRPC requests (creates root span)
|
||||
[ ] Outbound HTTP/gRPC calls (creates child spans)
|
||||
[ ] Database queries (creates child spans with sanitized query)
|
||||
[ ] Cache operations (Redis, Memcached)
|
||||
[ ] Message queue produce/consume
|
||||
|
||||
[ ] Custom spans added for:
|
||||
[ ] Key business operations ([e.g. payment processing, user lookup])
|
||||
[ ] Background jobs (each job execution = root span)
|
||||
[ ] Third-party API calls with custom attributes
|
||||
|
||||
[ ] Span attributes to capture on all spans:
|
||||
- user.id (if authenticated — no PII)
|
||||
- deployment.environment (production/staging)
|
||||
- service.version (git SHA)
|
||||
- [service-specific key attributes]
|
||||
|
||||
[ ] Trace exporter configured to: [Datadog / Jaeger / Tempo / OTLP endpoint]
|
||||
|
||||
[ ] Sampling rate configured:
|
||||
- Production: [1–10]% of requests (adjust based on volume and cost)
|
||||
- Always sample: errors, slow requests (>p99 threshold), and 100% of [critical endpoint]
|
||||
```
|
||||
|
||||
### Trace Instrumentation Examples
|
||||
|
||||
```python
|
||||
# Python — OpenTelemetry example
|
||||
from opentelemetry import trace
|
||||
|
||||
tracer = trace.get_tracer("[service-name]")
|
||||
|
||||
def process_payment(payment_data):
|
||||
with tracer.start_as_current_span("process_payment") as span:
|
||||
span.set_attribute("payment.amount_cents", payment_data["amount"])
|
||||
span.set_attribute("payment.currency", payment_data["currency"])
|
||||
# Never: span.set_attribute("payment.card_number", ...)
|
||||
try:
|
||||
result = _do_process(payment_data)
|
||||
span.set_status(trace.StatusCode.OK)
|
||||
return result
|
||||
except PaymentError as e:
|
||||
span.set_status(trace.StatusCode.ERROR, str(e))
|
||||
span.record_exception(e)
|
||||
raise
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Alert Rules Specification
|
||||
|
||||
Every alert must have: a name, a condition, a threshold, a severity, and a clear on-call action. Alerts without a clear action should not exist.
|
||||
|
||||
### Alert Definitions
|
||||
|
||||
| Alert name | Condition | Threshold | Severity | On-call action |
|
||||
|---|---|---|---|---|
|
||||
| `[Service]HighErrorRate` | 5xx error rate, 5-min rolling window | >1% for 2 consecutive windows | P1 | Check recent deploys; inspect error logs; see runbook [link] |
|
||||
| `[Service]CriticalErrorRate` | 5xx error rate, 2-min rolling window | >5% | P1 — immediate | Same as above — page immediately, do not wait |
|
||||
| `[Service]HighP99Latency` | p99 latency on key endpoints | >2× SLO target for 3 min | P2 | Check DB latency, cache hit rate, and upstream dependencies |
|
||||
| `[Service]LatencySLOBreach` | p99 latency | >SLO target for 5 consecutive minutes | P1 | SLO burn — page on-call, escalate if not resolved in 20 min |
|
||||
| `[Service]HighCPU` | CPU utilisation | >80% sustained for 5 min | P2 | Check for traffic spike; scale up if needed; check for runaway processes |
|
||||
| `[Service]HighMemory` | Memory utilisation | >85% sustained for 5 min | P2 | Check for memory leak (especially after deploys); restart pod if OOM imminent |
|
||||
| `[Service]DBConnectionPoolHigh` | DB connection pool utilisation | >75% | P2 | Check for long-running queries; consider scaling service or increasing pool size |
|
||||
| `[Service]DLQDepthHigh` | Dead-letter queue depth | >10 messages | P2 | Inspect DLQ messages for error pattern; fix bug and replay if safe |
|
||||
| `[Service]TrafficDropAnomaly` | RPS, compared to same hour yesterday | >50% drop sustained 5 min | P1 | Upstream may be down; check caller health; check load balancer |
|
||||
| `[Service]PrimaryActionSuccessRateDrop` | [Business metric success rate] | <[95]% over 10 min | P1 | [Service-specific action — e.g. "Check payment provider status"] |
|
||||
| `[Service]DownstreamDependencyErrors` | Error rate calling [dependency] | >5% over 5 min | P2 | Check [dependency] status page; enable fallback if available |
|
||||
|
||||
### Alert Configuration Examples
|
||||
|
||||
```yaml
|
||||
# Prometheus / Grafana alerting rules (adapt for your platform)
|
||||
groups:
|
||||
- name: [service-name]-alerts
|
||||
rules:
|
||||
|
||||
- alert: [Service]HighErrorRate
|
||||
expr: |
|
||||
(
|
||||
sum(rate([service]_http_requests_total{status=~"5.."}[5m]))
|
||||
/
|
||||
sum(rate([service]_http_requests_total[5m]))
|
||||
) > 0.01
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
team: [team-name]
|
||||
annotations:
|
||||
summary: "High error rate on [Service Name]"
|
||||
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
|
||||
runbook_url: "[runbook link]"
|
||||
|
||||
- alert: [Service]HighP99Latency
|
||||
expr: |
|
||||
histogram_quantile(0.99,
|
||||
sum(rate([service]_http_request_duration_seconds_bucket[5m])) by (le, endpoint)
|
||||
) > [0.5]
|
||||
for: 3m
|
||||
labels:
|
||||
severity: warning
|
||||
team: [team-name]
|
||||
annotations:
|
||||
summary: "p99 latency elevated on [Service Name]"
|
||||
description: "p99 latency on {{ $labels.endpoint }} is {{ $value | humanizeDuration }}"
|
||||
runbook_url: "[runbook link]"
|
||||
```
|
||||
|
||||
```python
|
||||
# Datadog monitor configuration (Python SDK or Terraform)
|
||||
import datadog
|
||||
|
||||
datadog.initialize(api_key="[key]", app_key="[key]")
|
||||
|
||||
datadog.api.Monitor.create(
|
||||
type="metric alert",
|
||||
query=f"sum(last_5m):sum:{{service}}.http.errors{{service:[service-name]}} / sum:{{service}}.http.requests{{service:[service-name]}} > 0.01",
|
||||
name="[Service] High Error Rate",
|
||||
message="Error rate exceeded 1%. @pagerduty-[service-oncall]\n\nRunbook: [link]",
|
||||
tags=["service:[service-name]", "team:[team-name]"],
|
||||
options={
|
||||
"thresholds": {"critical": 0.01, "warning": 0.005},
|
||||
"notify_no_data": False,
|
||||
"evaluation_delay": 60,
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Dashboard Layout Specification
|
||||
|
||||
The primary service dashboard must answer "is the service healthy right now?" at a glance. Use this layout:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ [SERVICE NAME] — Service Health Dashboard [Time range ▼] │
|
||||
├───────────────┬───────────────┬───────────────┬─────────────────────┤
|
||||
│ Error rate │ p99 Latency │ RPS (current)│ SLO budget remaining│
|
||||
│ [BIG NUMBER] │ [BIG NUMBER] │ [BIG NUMBER] │ [BIG NUMBER / days] │
|
||||
│ vs SLO: 0.1% │ vs SLO: 500ms│ vs avg: [N] │ [Error budget gauge]│
|
||||
├───────────────┴───────────────┴───────────────┴─────────────────────┤
|
||||
│ Error rate over time (24h) │
|
||||
│ [Time series: 5xx rate line, SLO threshold line] │
|
||||
├─────────────────────────────────┬───────────────────────────────────┤
|
||||
│ Latency percentiles over time │ Request throughput over time │
|
||||
│ [Lines: p50, p95, p99, p999] │ [Bars: RPS by endpoint] │
|
||||
│ [SLO threshold horizontal line]│ │
|
||||
├─────────────────────────────────┴───────────────────────────────────┤
|
||||
│ Latency heatmap (all requests — shows distribution shape) │
|
||||
├─────────────────────────────────┬───────────────────────────────────┤
|
||||
│ CPU utilisation over time │ Memory utilisation over time │
|
||||
│ [All instances/pods — lines] │ [All instances/pods — lines] │
|
||||
│ [Alert threshold: 80%] │ [Alert threshold: 85%] │
|
||||
├─────────────────────────────────┴───────────────────────────────────┤
|
||||
│ DB: connection pool utilisation│ DB: query latency (p99 per query)│
|
||||
├─────────────────────────────────┴───────────────────────────────────┤
|
||||
│ [Business metric 1 over time] │ [Business metric 2 over time] │
|
||||
│ e.g. Payment success rate │ e.g. Orders created/min │
|
||||
└─────────────────────────────────┴───────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Second dashboard — Dependency Health:**
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ [SERVICE NAME] — Dependency Health │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ For each dependency: error rate | latency | current status │
|
||||
│ [Database] [N]% errors | [N]ms p99 | ● Healthy / ⚠ Degraded │
|
||||
│ [Redis] [N]% errors | [N]ms p99 | ● Healthy │
|
||||
│ [External API][N]% errors | [N]ms p99 | ● Healthy │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ Outbound call latency over time (one line per dependency) │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ Circuit breaker / fallback state (if implemented) │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Observability Debt Analysis
|
||||
|
||||
Honest assessment of what is missing today and what the priority to add it is:
|
||||
|
||||
| Gap | Impact | Priority | Effort | Owner | Target date |
|
||||
|---|---|---|---|---|---|
|
||||
| [e.g. No distributed tracing — can't see cross-service latency] | High — blind to dependency issues | P1 | [2 days] | [Name] | [Date] |
|
||||
| [e.g. No business metric alerts — only infra alerts] | High — silent business failures | P1 | [1 day] | [Name] | [Date] |
|
||||
| [e.g. Logs are unstructured text — not searchable] | Medium — slow incident investigation | P2 | [3 days] | [Name] | [Date] |
|
||||
| [e.g. No dead-letter queue monitoring] | Medium — failed messages go unnoticed | P2 | [4 hours] | [Name] | [Date] |
|
||||
| [e.g. Alert thresholds not calibrated to production baseline] | Medium — alert fatigue or missed alerts | P2 | [1 day] | [Name] | [Date] |
|
||||
| [e.g. No latency heatmap — outliers invisible in averages] | Low — harder to spot tail latency issues | P3 | [2 hours] | [Name] | [Date] |
|
||||
|
||||
**Total observability debt: [N] items | Estimated effort: [N days]**
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every alert has a named on-call action — no alert says "investigate" without specifying what to investigate first
|
||||
- [ ] Alert thresholds are calibrated against production baselines, not set to default values from a template
|
||||
- [ ] Structured logging is implemented — no unstructured text log lines in production
|
||||
- [ ] PII is explicitly excluded from logs — a named engineer has verified this
|
||||
- [ ] Distributed tracing is propagating trace IDs across all service boundaries (verify with a test request)
|
||||
- [ ] The primary dashboard answers "is the service healthy?" in under 10 seconds — no hunting for the right panel
|
||||
- [ ] Business metrics are tracked alongside infrastructure metrics — not just four golden signals
|
||||
- [ ] Observability debt items have owners and dates — not just "would be nice to have"
|
||||
@@ -0,0 +1,364 @@
|
||||
---
|
||||
name: oncall-runbook
|
||||
description: "Write an on-call runbook for a service — covering alert definitions, escalation paths, common incident responses, and on-call handoff procedures. Use when asked to write an on-call guide, create alert runbooks, document escalation procedures, or prepare an on-call handoff document. Produces a structured on-call runbook with per-alert response procedures, escalation matrix, diagnostic commands, and handoff template."
|
||||
---
|
||||
|
||||
# On-Call Runbook Skill
|
||||
|
||||
Produce a complete on-call runbook for a service — giving the on-call engineer everything they need to respond confidently to alerts at 3am, without having to ask anyone for help.
|
||||
|
||||
A good on-call runbook reduces mean time to resolution (MTTR) by eliminating the "what do I do first?" problem. It is written for the on-call engineer who has just been paged and needs to act, not for someone calmly reading documentation.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and what it does
|
||||
- **Team** and tech lead name
|
||||
- **Alert list** — names of alerts that currently page on-call
|
||||
- **Monitoring setup** — Datadog / Grafana / CloudWatch / PagerDuty / etc.
|
||||
- **Common failure modes** — what breaks most often, and what fixes it
|
||||
- **Escalation contacts** — who to call when on-call can't resolve it
|
||||
- **Deployment setup** — can on-call roll back? How?
|
||||
- **Service dependencies** — what does this service depend on, and what depends on it?
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# On-Call Runbook: [Service Name]
|
||||
|
||||
**Team:** [Team name] | **Tech lead:** [Name]
|
||||
**PagerDuty service:** [Link] | **Escalation policy:** [Policy name]
|
||||
**Last updated:** [Date] | **Next review:** [Date + 90 days]
|
||||
|
||||
> **First time on-call for this service?** Read the [developer onboarding doc] first — it covers the architecture and how things work. This runbook assumes you understand the service.
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
|
||||
**Dashboard:** [Link — the first thing to open when paged]
|
||||
**Logs:** [Link — where to find logs]
|
||||
**Runbook index:** Jump to the alert that paged you → [Alert list below]
|
||||
**Can't resolve in 30 min?** Escalate to: [Name] via [Slack / PagerDuty]
|
||||
|
||||
**Rollback command (memorise this):**
|
||||
```bash
|
||||
[rollback command — e.g. kubectl rollout undo deployment/[service-name]]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Escalation Matrix
|
||||
|
||||
| Situation | Escalate to | How | After how long |
|
||||
|---|---|---|---|
|
||||
| Can't diagnose the alert | [Tech lead name] | Slack DM / Phone | 30 minutes |
|
||||
| Alert requires infra change | [Platform team] | `#platform` Slack | Immediately |
|
||||
| Customer-facing impact | [CSM / Support lead] | `#incidents` Slack | Immediately (P1) |
|
||||
| Database issue | [DBA or data team] | Slack / PagerDuty | Immediately |
|
||||
| [Specific dependency] down | [[Dependency] on-call] | PagerDuty / Slack | Immediately |
|
||||
| Extended outage (>1 hour) | [Engineering manager] | Phone | 1 hour |
|
||||
|
||||
**Contacts:**
|
||||
|
||||
| Name | Role | Slack | Phone |
|
||||
|---|---|---|---|
|
||||
| [Name] | Tech lead | @[handle] | [Number] |
|
||||
| [Name] | Engineering manager | @[handle] | [Number] |
|
||||
| [Name] | Platform / infra | @[handle] | [Number] |
|
||||
| [Platform team] | Infra on-call | `#platform` | PagerDuty |
|
||||
|
||||
---
|
||||
|
||||
## Service Architecture (Quick View)
|
||||
|
||||
```
|
||||
[Upstream callers]
|
||||
│
|
||||
▼
|
||||
[This Service]
|
||||
│
|
||||
├──→ [Primary Database]
|
||||
├──→ [Cache — e.g. Redis]
|
||||
└──→ [Downstream Service / Queue]
|
||||
```
|
||||
|
||||
**If this service is down, these are affected:** [List downstream consumers]
|
||||
**If these are down, this service is affected:** [List upstream dependencies]
|
||||
|
||||
---
|
||||
|
||||
## Alert Runbooks
|
||||
|
||||
### ALERT: [Alert Name 1 — e.g. HighErrorRate]
|
||||
|
||||
**What it means:** [Plain English — e.g. "More than 5% of API requests are returning 5xx errors in the last 5 minutes"]
|
||||
**Severity:** P1 / P2 / P3
|
||||
**SLO impact:** Yes / No — [If yes: this alert means the error budget is burning at [X]× rate]
|
||||
|
||||
**Step 1 — Acknowledge and assess**
|
||||
```bash
|
||||
# Check current error rate
|
||||
[query or dashboard link]
|
||||
|
||||
# Check which endpoints are erroring
|
||||
[query or command]
|
||||
```
|
||||
|
||||
**Step 2 — Check recent changes**
|
||||
```bash
|
||||
# Any deploys in the last hour?
|
||||
[command or link to deployment log]
|
||||
|
||||
# Recent config changes?
|
||||
[where to check]
|
||||
```
|
||||
|
||||
**Step 3 — Check dependencies**
|
||||
```bash
|
||||
# Is the database healthy?
|
||||
[health check command or link]
|
||||
|
||||
# Is [downstream service] healthy?
|
||||
[health check command or link]
|
||||
```
|
||||
|
||||
**Step 4 — Diagnose**
|
||||
|
||||
| If you see | It means | Do this |
|
||||
|---|---|---|
|
||||
| [Error pattern 1] | [Cause] | [Action] |
|
||||
| [Error pattern 2] | [Cause] | [Action] |
|
||||
| [Error pattern 3] | [Cause] | [Action] |
|
||||
| No clear pattern | Unknown cause | Escalate to [name] |
|
||||
|
||||
**Step 5 — Fix or mitigate**
|
||||
```bash
|
||||
# If caused by bad deploy — roll back:
|
||||
[rollback command]
|
||||
|
||||
# If caused by [specific issue]:
|
||||
[fix command]
|
||||
|
||||
# If caused by upstream dependency:
|
||||
[mitigation — e.g. enable circuit breaker, reduce traffic, etc.]
|
||||
```
|
||||
|
||||
**After resolving:**
|
||||
- [ ] Confirm error rate has returned to baseline
|
||||
- [ ] Check no downstream services were affected
|
||||
- [ ] If P1: open a post-incident review — see [incident-postmortem skill]
|
||||
- [ ] Update `#incidents` with resolution summary
|
||||
|
||||
---
|
||||
|
||||
### ALERT: [Alert Name 2 — e.g. HighLatency]
|
||||
|
||||
**What it means:** [e.g. "P99 response time has exceeded 1s for more than 3 consecutive minutes"]
|
||||
**Severity:** P1 / P2 / P3
|
||||
**SLO impact:** Yes — latency SLO breach
|
||||
|
||||
**Step 1 — Assess scope**
|
||||
```bash
|
||||
# Check which endpoints are slow
|
||||
[query or dashboard — broken down by endpoint]
|
||||
|
||||
# Check if latency is across all regions or localised
|
||||
[query or command]
|
||||
```
|
||||
|
||||
**Step 2 — Common causes and fixes**
|
||||
|
||||
| Cause | Signal | Fix |
|
||||
|---|---|---|
|
||||
| Database slow queries | DB latency spike on dashboard | [Check slow query log: `command`] |
|
||||
| Cache miss storm | Cache hit rate drops on dashboard | [command or action] |
|
||||
| Memory pressure / GC | High memory on service dashboard | [command or action — e.g. restart, scale up] |
|
||||
| Upstream service slow | Trace shows time in external call | Escalate to [service] on-call |
|
||||
| Traffic spike | Request rate spike on dashboard | [Scale up: `command`] |
|
||||
|
||||
**Step 3 — Escalate if unresolved in 20 minutes**
|
||||
Page [Tech lead] via PagerDuty / Slack.
|
||||
|
||||
---
|
||||
|
||||
### ALERT: [Alert Name 3 — e.g. DatabaseConnectionPoolExhausted]
|
||||
|
||||
**What it means:** [e.g. "The service has used all available database connections — new requests will fail"]
|
||||
**Severity:** P1
|
||||
**SLO impact:** Yes — will cause errors immediately
|
||||
|
||||
**Immediate mitigation:**
|
||||
```bash
|
||||
# Restart the service to flush stale connections
|
||||
[restart command]
|
||||
|
||||
# Check current connection count
|
||||
[DB connection query]
|
||||
```
|
||||
|
||||
**Diagnose root cause after stabilising:**
|
||||
```bash
|
||||
# Check for long-running queries holding connections
|
||||
[query]
|
||||
|
||||
# Check if a recent deploy changed connection pool config
|
||||
[where to check]
|
||||
```
|
||||
|
||||
**Resolution:** [e.g. "Increase pool size in config / kill long-running queries / scale the service"]
|
||||
|
||||
---
|
||||
|
||||
### ALERT: [Alert Name 4 — e.g. QueueBacklogHigh / ConsumerLag]
|
||||
|
||||
**What it means:** [e.g. "The message queue backlog exceeds 10,000 messages — consumers are not keeping up"]
|
||||
**Severity:** P2
|
||||
**SLO impact:** Depends — if queue backs up, downstream systems will receive delayed data
|
||||
|
||||
**Step 1 — Check consumer health**
|
||||
```bash
|
||||
# Are consumers running?
|
||||
[command]
|
||||
|
||||
# Consumer error rate?
|
||||
[dashboard or query]
|
||||
```
|
||||
|
||||
**Step 2 — Check message contents**
|
||||
```bash
|
||||
# Are there poison messages causing retries?
|
||||
[command to inspect dead-letter queue or failed messages]
|
||||
```
|
||||
|
||||
**Step 3 — Options**
|
||||
|
||||
| If | Then |
|
||||
|---|---|
|
||||
| Consumers are down | Restart consumers: `[command]` |
|
||||
| Poison message in queue | Move to DLQ: `[command]` |
|
||||
| Consumers healthy but slow | Scale consumers: `[command]` |
|
||||
| Upstream producing too fast | Escalate to [upstream service] owner |
|
||||
|
||||
---
|
||||
|
||||
### ALERT: [Add additional alerts following the same pattern]
|
||||
|
||||
---
|
||||
|
||||
## Diagnostic Cheat Sheet
|
||||
|
||||
Common commands for quick diagnosis. Paste and run without modification.
|
||||
|
||||
```bash
|
||||
# Service health
|
||||
[health check command]
|
||||
|
||||
# Recent logs (last 100 lines)
|
||||
[log command]
|
||||
|
||||
# Error logs only
|
||||
[error log filter command]
|
||||
|
||||
# Current pod / instance status
|
||||
[kubectl get pods / aws ecs describe-tasks / etc.]
|
||||
|
||||
# Restart the service
|
||||
[restart command]
|
||||
|
||||
# Roll back to previous version
|
||||
[rollback command]
|
||||
|
||||
# Database connection count
|
||||
[DB query]
|
||||
|
||||
# Cache hit rate
|
||||
[cache stats command]
|
||||
|
||||
# Current request rate
|
||||
[metrics query]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Useful Dashboard Links
|
||||
|
||||
| Dashboard | URL | Use it to |
|
||||
|---|---|---|
|
||||
| Service overview | [Link] | First stop — error rate, latency, request rate |
|
||||
| Database | [Link] | Connection count, slow queries, replication lag |
|
||||
| Infrastructure | [Link] | CPU, memory, disk |
|
||||
| Queue / consumers | [Link] | Backlog depth, consumer throughput |
|
||||
| Upstream dependencies | [Link] | Dependency health at a glance |
|
||||
|
||||
---
|
||||
|
||||
## Incident Communication
|
||||
|
||||
When you declare an incident:
|
||||
|
||||
**Post to `#incidents` immediately:**
|
||||
```
|
||||
🔴 INCIDENT — [Service Name]
|
||||
Status: Investigating
|
||||
Impact: [Who is affected and how]
|
||||
Paged: [Your name]
|
||||
Next update: [Time — max 30 min from now]
|
||||
```
|
||||
|
||||
**Update every 30 minutes while active:**
|
||||
```
|
||||
🔴 UPDATE — [Service Name] — [Time]
|
||||
Status: [Investigating / Identified / Mitigating / Resolved]
|
||||
Latest: [One sentence on what you found or did]
|
||||
Next update: [Time]
|
||||
```
|
||||
|
||||
**On resolution:**
|
||||
```
|
||||
✅ RESOLVED — [Service Name] — [Time]
|
||||
Duration: [X minutes]
|
||||
Impact: [Summary of who was affected]
|
||||
Cause: [One sentence]
|
||||
Follow-up: [PIR required? Yes/No — link when created]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## On-Call Handoff
|
||||
|
||||
Use this template at the end of every on-call shift:
|
||||
|
||||
```
|
||||
--- ON-CALL HANDOFF: [Service Name] ---
|
||||
Date: [Date]
|
||||
Outgoing: [Your name]
|
||||
Incoming: [Next on-call name]
|
||||
|
||||
INCIDENTS THIS SHIFT:
|
||||
- [Incident summary — date, duration, cause, resolution, follow-up required]
|
||||
|
||||
OPEN ISSUES TO WATCH:
|
||||
- [Anything not fully resolved / trending in the wrong direction]
|
||||
|
||||
CHANGES SINCE LAST HANDOFF:
|
||||
- [Deploys, config changes, infra changes that affect on-call awareness]
|
||||
|
||||
RUNBOOK GAPS FOUND:
|
||||
- [Anything you had to figure out that isn't documented — please add it]
|
||||
|
||||
ANYTHING ELSE:
|
||||
- [Notes for incoming on-call]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every alert that pages on-call has a runbook entry — no alert is missing
|
||||
- [ ] Rollback command is accurate and tested recently
|
||||
- [ ] Escalation contacts have current phone numbers and Slack handles
|
||||
- [ ] Diagnostic commands work — they have been run by at least one person recently
|
||||
- [ ] Handoff template is used at every shift change — not just during incidents
|
||||
- [ ] "Things I had to figure out that weren't documented" are added to this runbook after every incident
|
||||
@@ -0,0 +1,277 @@
|
||||
---
|
||||
name: performance-budget
|
||||
description: "Define and document performance budgets for a web service or application. Use when asked to set performance targets, define SLOs for latency or throughput, establish Core Web Vitals targets, create a performance baseline, or document performance regression policy. Produces a structured performance budget covering key user journeys, Core Web Vitals, backend latency SLOs, measurement tooling, CI enforcement, and breach response process."
|
||||
---
|
||||
|
||||
# Performance Budget Skill
|
||||
|
||||
Produce a complete, actionable performance budget document for a web service or application. A performance budget is not a wishlist — it is a set of measurable, enforced constraints that define what "acceptable performance" means and who is responsible when those constraints are violated.
|
||||
|
||||
A good performance budget answers: what are the targets, how are they measured, what triggers an investigation, and what happens when a budget is breached.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name and type** — web app, API service, mobile app, or combination
|
||||
- **Key user journeys** — the 3–5 most important flows users take (e.g. "search → product page → checkout")
|
||||
- **Current baseline metrics** — P50/P95/P99 latency, LCP, CLS, INP if available (state "no baseline" if not collected yet)
|
||||
- **Tech stack** — frontend framework, backend language/framework, CDN, database
|
||||
- **Deployment environment** — cloud provider, region(s), edge/CDN configuration
|
||||
- **Cost constraints** — any budget or infrastructure limits that affect headroom
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Performance Budget: [Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Last updated:** [Date] | **Owner:** [Name / role]
|
||||
**Environment:** [Production / Staging baseline] | **Review cadence:** [Quarterly / per-sprint]
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
[2–3 sentences describing the service, its user-facing performance requirements, and why performance is a priority. Reference the business impact of latency — e.g. conversion rate, user retention, SLA obligations.]
|
||||
|
||||
**Performance philosophy:** [e.g. "Performance is a feature. Every engineer is responsible for keeping the service within budget. Regressions must be caught in CI before they reach production."]
|
||||
|
||||
---
|
||||
|
||||
## Key User Journeys
|
||||
|
||||
Define the critical paths that the performance budget is designed to protect.
|
||||
|
||||
| Journey ID | Journey name | Entry point | Exit point | Criticality |
|
||||
|---|---|---|---|---|
|
||||
| UJ-1 | [e.g. New user sign-up] | [Landing page] | [Dashboard] | Critical |
|
||||
| UJ-2 | [e.g. Core workflow task] | [e.g. /app/tasks] | [e.g. Task complete] | High |
|
||||
| UJ-3 | [e.g. Search and select] | [e.g. /search] | [e.g. Detail page] | High |
|
||||
| UJ-4 | [e.g. API data fetch] | [e.g. GET /api/items] | [e.g. 200 response] | Medium |
|
||||
|
||||
---
|
||||
|
||||
## Frontend Performance Budget
|
||||
|
||||
*Complete this section for web and mobile applications. Skip for API-only services.*
|
||||
|
||||
### Core Web Vitals Targets
|
||||
|
||||
Targets apply to the 75th percentile of real user sessions (field data), measured on a mid-range Android device on a 4G connection unless otherwise stated.
|
||||
|
||||
| Metric | Description | Good | Needs Improvement | Poor | **Our Target** | Current baseline |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **LCP** | Largest Contentful Paint — perceived load speed | ≤2.5s | 2.5–4.0s | >4.0s | **[≤X.Xs]** | [Xs / not measured] |
|
||||
| **INP** | Interaction to Next Paint — responsiveness | ≤200ms | 200–500ms | >500ms | **[≤Xms]** | [Xms / not measured] |
|
||||
| **CLS** | Cumulative Layout Shift — visual stability | ≤0.1 | 0.1–0.25 | >0.25 | **[≤0.X]** | [X.XX / not measured] |
|
||||
| **FCP** | First Contentful Paint | ≤1.8s | 1.8–3.0s | >3.0s | **[≤X.Xs]** | [Xs / not measured] |
|
||||
| **TTFB** | Time to First Byte | ≤800ms | 800ms–1.8s | >1.8s | **[≤Xms]** | [Xms / not measured] |
|
||||
|
||||
### Page Weight Budget
|
||||
|
||||
| Asset type | Max size (compressed) | Current | Status |
|
||||
|---|---|---|---|
|
||||
| Total page weight | [e.g. 500KB] | [XKB / unknown] | [Within / Over / Unknown] |
|
||||
| JavaScript (initial load) | [e.g. 200KB] | [XKB / unknown] | [Within / Over / Unknown] |
|
||||
| CSS | [e.g. 50KB] | [XKB / unknown] | [Within / Over / Unknown] |
|
||||
| Images (above fold) | [e.g. 150KB] | [XKB / unknown] | [Within / Over / Unknown] |
|
||||
| Web fonts | [e.g. 50KB] | [XKB / unknown] | [Within / Over / Unknown] |
|
||||
| Third-party scripts | [e.g. 100KB] | [XKB / unknown] | [Within / Over / Unknown] |
|
||||
|
||||
### Per-Journey Frontend Targets
|
||||
|
||||
| Journey | LCP | INP | CLS | FCP | TTFB |
|
||||
|---|---|---|---|---|---|
|
||||
| UJ-1: [Journey name] | [≤Xs] | [≤Xms] | [≤0.X] | [≤Xs] | [≤Xms] |
|
||||
| UJ-2: [Journey name] | [≤Xs] | [≤Xms] | [≤0.X] | [≤Xs] | [≤Xms] |
|
||||
| UJ-3: [Journey name] | [≤Xs] | [≤Xms] | [≤0.X] | [≤Xs] | [≤Xms] |
|
||||
|
||||
---
|
||||
|
||||
## Backend Performance Budget
|
||||
|
||||
### API Latency SLOs
|
||||
|
||||
Targets measured at the service boundary (not including client-side network latency).
|
||||
|
||||
| Endpoint / operation | Method | P50 | P95 | P99 | Max (hard limit) | Error rate |
|
||||
|---|---|---|---|---|---|---|
|
||||
| [e.g. /api/auth/login] | POST | [≤Xms] | [≤Xms] | [≤Xms] | [≤Xms] | [<X%] |
|
||||
| [e.g. /api/items] | GET | [≤Xms] | [≤Xms] | [≤Xms] | [≤Xms] | [<X%] |
|
||||
| [e.g. /api/items/:id] | GET | [≤Xms] | [≤Xms] | [≤Xms] | [≤Xms] | [<X%] |
|
||||
| [e.g. /api/items] | POST | [≤Xms] | [≤Xms] | [≤Xms] | [≤Xms] | [<X%] |
|
||||
| [e.g. Background job: sync] | — | [≤Xs] | [≤Xs] | [≤Xs] | [≤Xs] | [<X%] |
|
||||
|
||||
**Overall service SLOs:**
|
||||
|
||||
| SLO | Target | Measurement window |
|
||||
|---|---|---|
|
||||
| Availability | [99.X%] | 30-day rolling |
|
||||
| P95 latency (all endpoints) | [≤Xms] | 30-day rolling |
|
||||
| Error rate (5xx) | [<X%] | 30-day rolling |
|
||||
| Throughput (sustained) | [≥X req/s] | Peak hour |
|
||||
|
||||
### Database Query Budget
|
||||
|
||||
| Query / operation | P50 | P95 | Max | Notes |
|
||||
|---|---|---|---|---|
|
||||
| [e.g. User lookup by ID] | [≤Xms] | [≤Xms] | [≤Xms] | Index on `user_id` |
|
||||
| [e.g. List items for user] | [≤Xms] | [≤Xms] | [≤Xms] | Paginated, max 100 rows |
|
||||
| [e.g. Full-text search] | [≤Xms] | [≤Xms] | [≤Xms] | Elasticsearch / pg_trgm |
|
||||
|
||||
---
|
||||
|
||||
## Measurement Methodology
|
||||
|
||||
### Real User Monitoring (RUM)
|
||||
|
||||
**Tool:** [e.g. Google CrUX, SpeedCurve, Datadog RUM, Sentry Performance, custom]
|
||||
**Data source:** [Field data from real users / Lab data from synthetic tests / Both]
|
||||
**Sample rate:** [X% of sessions]
|
||||
**How to access:** [Dashboard URL or tool access instructions]
|
||||
|
||||
**What is measured:**
|
||||
- [ ] Core Web Vitals (LCP, INP, CLS) per page and journey
|
||||
- [ ] Custom performance marks for business-critical interactions
|
||||
- [ ] Resource timing for key assets
|
||||
- [ ] Long tasks (>50ms on main thread)
|
||||
|
||||
### Synthetic Monitoring
|
||||
|
||||
**Tool:** [e.g. Lighthouse CI, WebPageTest, k6, Artillery, Playwright with performance assertions]
|
||||
**Frequency:** [Every X minutes / on every deploy / nightly]
|
||||
**Test location(s):** [e.g. eu-west-1, us-east-1]
|
||||
**Device profile:** [Desktop 10Mbps / Mobile 4G Moto G4 / both]
|
||||
|
||||
**Synthetic test suite location:** [Link to test files]
|
||||
|
||||
### Backend Observability
|
||||
|
||||
**APM tool:** [e.g. Datadog, Grafana + Prometheus, New Relic, AWS X-Ray]
|
||||
**Metrics collected:**
|
||||
- Request rate, error rate, duration (RED metrics) per endpoint
|
||||
- Database query duration and connection pool utilisation
|
||||
- Cache hit/miss rates
|
||||
- Background job queue depth and processing latency
|
||||
|
||||
**Dashboard:** [Link to primary performance dashboard]
|
||||
|
||||
---
|
||||
|
||||
## CI/CD Performance Enforcement
|
||||
|
||||
Performance budgets are enforced at two gates:
|
||||
|
||||
### Gate 1 — Build-time Bundle Analysis
|
||||
|
||||
**Tool:** [e.g. bundlesize, size-limit, webpack-bundle-analyzer with CI assertion]
|
||||
**Config file:** [`[.bundlesizerc / .size-limit.js / etc.]`]
|
||||
**Trigger:** Every PR targeting `main`
|
||||
**Blocking:** Yes — PR cannot merge if bundle size budget is exceeded
|
||||
|
||||
```json
|
||||
// Example .size-limit.js
|
||||
[
|
||||
{
|
||||
"path": "dist/js/*.js",
|
||||
"limit": "200 KB"
|
||||
},
|
||||
{
|
||||
"path": "dist/css/*.css",
|
||||
"limit": "50 KB"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Gate 2 — Synthetic Performance Tests in CI
|
||||
|
||||
**Tool:** [e.g. Lighthouse CI, k6, Artillery]
|
||||
**Trigger:** On deploy to staging
|
||||
**Blocking:** Yes — production deploy is blocked if thresholds fail
|
||||
**Thresholds checked:**
|
||||
- LCP ≤ [Xs]
|
||||
- CLS ≤ [0.X]
|
||||
- P95 API latency ≤ [Xms]
|
||||
- Error rate < [X%]
|
||||
|
||||
**CI config location:** [`[.github/workflows/perf.yml / ci/performance.yaml]`]
|
||||
|
||||
**How to run locally:**
|
||||
```bash
|
||||
# Run Lighthouse CI against local build
|
||||
[command — e.g. lhci autorun --config=lighthouserc.js]
|
||||
|
||||
# Run load test locally
|
||||
[command — e.g. k6 run load-tests/api-smoke.js]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Budget Breach Response Process
|
||||
|
||||
A budget breach is when a measured metric exceeds its target for [X consecutive measurements / X minutes sustained / a single deploy].
|
||||
|
||||
### Breach Severity Levels
|
||||
|
||||
| Severity | Condition | Response time | Who acts |
|
||||
|---|---|---|---|
|
||||
| P1 — Critical | >2× budget threshold in production | Immediate | On-call engineer + team lead |
|
||||
| P2 — High | >1.5× budget threshold in production | Within 4 hours | On-call engineer |
|
||||
| P3 — Medium | Threshold exceeded in production | Within 1 sprint | PR author + team |
|
||||
| P4 — Low | Threshold exceeded in staging only | Before merge | PR author |
|
||||
|
||||
### Breach Investigation Checklist
|
||||
|
||||
When a breach is detected, work through this checklist in order:
|
||||
|
||||
**1. Identify the regression commit**
|
||||
```bash
|
||||
# Compare performance across recent deploys
|
||||
[command — e.g. datadog metrics query, lighthouse-ci compare, git bisect]
|
||||
```
|
||||
|
||||
**2. Classify the breach**
|
||||
- [ ] Is this a code change? (new feature, refactor, dependency bump)
|
||||
- [ ] Is this an infrastructure change? (new instance type, config change)
|
||||
- [ ] Is this an external factor? (CDN issue, DNS, upstream dependency)
|
||||
- [ ] Is this a measurement anomaly? (test environment issue, sample size)
|
||||
|
||||
**3. Immediate actions**
|
||||
- If P1/P2 in production and a code cause is confirmed: roll back or disable the feature flag
|
||||
- If cause is unknown: do not roll back immediately — gather more data first
|
||||
- Notify [#performance / #incidents Slack channel] with: metric name, current value, budget target, suspected cause
|
||||
|
||||
**4. Resolution**
|
||||
- Fix the root cause — do not just adjust the budget threshold
|
||||
- Budget thresholds should only change after a team discussion and explicit approval from [tech lead / EM]
|
||||
- Document the breach in the [performance log / incident record]
|
||||
|
||||
**Budget change policy:** Budget thresholds may only be relaxed if: (a) the feature delivering the regression has measurable business value that outweighs the performance cost, and (b) the change is reviewed and approved by [tech lead].
|
||||
|
||||
---
|
||||
|
||||
## Performance Review Cadence
|
||||
|
||||
| Trigger | Action |
|
||||
|---|---|
|
||||
| Every sprint | Review P95/P99 latency trends; flag any creeping degradation |
|
||||
| Every quarter | Full performance budget review — update baselines, adjust targets, audit tooling |
|
||||
| After major feature launch | Re-measure all Core Web Vitals and API SLOs; update baselines |
|
||||
| After infrastructure change | Re-run full synthetic test suite; confirm no regression |
|
||||
| After dependency upgrade | Run bundle size diff; confirm no unexpected size increase |
|
||||
|
||||
**Next scheduled review:** [Date]
|
||||
**Review owner:** [Name / role]
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every budget threshold is a specific number — not a range or "TBD"
|
||||
- [ ] Both frontend (if applicable) and backend targets are defined — not just one or the other
|
||||
- [ ] Measurement tooling is named with a link to the dashboard or config file
|
||||
- [ ] CI enforcement is configured for at least one gate (build-time or deploy-time)
|
||||
- [ ] Budget breach response process names specific Slack channels and owners
|
||||
- [ ] Budget thresholds are anchored to baseline measurements or a justified target — not pulled from thin air
|
||||
- [ ] Per-journey targets are defined for critical user journeys, not just global averages
|
||||
@@ -0,0 +1,218 @@
|
||||
---
|
||||
name: qbr-deck
|
||||
description: "Build a Quarterly Business Review (QBR) deck structure and narrative for a customer account. Use when asked to prepare a QBR, business review meeting, executive review, or quarterly check-in with a customer. Produces a slide-by-slide QBR structure with talking points, metrics review, value narrative, and mutual next steps."
|
||||
---
|
||||
|
||||
# QBR Deck Skill
|
||||
|
||||
Produce a complete Quarterly Business Review deck — structured, data-backed, and customer-focused. A good QBR demonstrates value delivered, aligns on goals for the next quarter, and strengthens the executive relationship. It should never feel like a product demo or a vendor update.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Account name**, CSM name, and customer stakeholders attending
|
||||
- **Contract details** — ARR, contract start date, renewal date
|
||||
- **Last quarter's goals** (from previous QBR or kickoff)
|
||||
- **Usage and adoption data** — key metrics for the quarter
|
||||
- **Support summary** — tickets raised, resolution time, any escalations
|
||||
- **Business outcomes the customer cares about** — what success looks like for them
|
||||
- **Product updates or new features** relevant to this customer
|
||||
- **Goals for next quarter**
|
||||
- **Any open commercial conversations** (expansion, renewal, at-risk signals)
|
||||
|
||||
## QBR Principles
|
||||
|
||||
- Lead with customer outcomes, not product features
|
||||
- Every metric should connect to a business result the customer cares about
|
||||
- The agenda is a conversation, not a presentation — build in time for customer input at every stage
|
||||
- Close with mutual commitments, not just vendor actions
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# QBR: [Account Name] × [Your Company]
|
||||
**[Quarter] [Year] Business Review**
|
||||
|
||||
**Date:** [Date] | **Location / Call link:** [TBC]
|
||||
**Customer attendees:** [Names and roles]
|
||||
**[Your company] attendees:** [Names and roles]
|
||||
|
||||
---
|
||||
|
||||
## Slide 1: Agenda (5 min)
|
||||
|
||||
| Time | Topic | Owner |
|
||||
|---|---|---|
|
||||
| 0:00 | Welcome and introductions | CSM |
|
||||
| 0:05 | [Last quarter] — how did we do? | CSM + Customer |
|
||||
| 0:20 | Value delivered — business impact | CSM |
|
||||
| 0:35 | What's coming — roadmap preview | CSM / Product |
|
||||
| 0:45 | [Next quarter] — goals and priorities | Customer |
|
||||
| 0:55 | Actions and mutual commitments | CSM |
|
||||
| 1:00 | Close | |
|
||||
|
||||
*Talking point: "We've kept today to 60 minutes. We want as much of this to be a conversation as possible — please push back, redirect, and ask questions throughout."*
|
||||
|
||||
---
|
||||
|
||||
## Slide 2: Where We Are Together (2 min)
|
||||
|
||||
**Partnership snapshot:**
|
||||
- **Customer since:** [Date]
|
||||
- **Contract value:** £/$/€[ARR]/year
|
||||
- **Renewal date:** [Date]
|
||||
- **Active users:** [N] of [N] licensed seats ([X]% adoption)
|
||||
- **Products / modules active:** [List]
|
||||
|
||||
*Talking point: "Before we dive in — a quick picture of where we are. [X] months in, [Y] active users, and this is our [Nth] QBR together."*
|
||||
|
||||
---
|
||||
|
||||
## Slide 3: Last Quarter — Goals We Set Together (5 min)
|
||||
|
||||
| Goal | Set in [Last QBR / Kickoff] | Status |
|
||||
|---|---|---|
|
||||
| [Goal 1] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed |
|
||||
| [Goal 2] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed |
|
||||
| [Goal 3] | [What we committed to] | ✅ Achieved / ⚠️ Partial / ❌ Missed |
|
||||
|
||||
For any partial or missed goal: state what happened and what changes next quarter.
|
||||
|
||||
*Talking point: "Let's start with accountability. Here's what we said we'd achieve last quarter — let's be honest about where we landed."*
|
||||
|
||||
---
|
||||
|
||||
## Slide 4: Usage and Adoption (5 min)
|
||||
|
||||
**Quarter-over-quarter trend:**
|
||||
|
||||
| Metric | [Q-1] | [Q] | Change |
|
||||
|---|---|---|---|
|
||||
| Monthly active users | [N] | [N] | +/-X% |
|
||||
| Sessions per user per week | [N] | [N] | +/-X% |
|
||||
| [Key feature 1] adoption | [X]% | [X]% | +/-X% |
|
||||
| [Key feature 2] adoption | [X]% | [X]% | +/-X% |
|
||||
|
||||
**Highlights:**
|
||||
- [Positive adoption trend to call out]
|
||||
- [Feature or workflow with strongest engagement]
|
||||
|
||||
**Opportunity:**
|
||||
- [Feature with low adoption that could drive more value — link to their goals]
|
||||
|
||||
*Talking point: "Usage is [up / stable / something we want to talk about]. The area I'd like to focus on is [feature] — we're not seeing the adoption we'd expect given [their goal], and I want to understand why."*
|
||||
|
||||
---
|
||||
|
||||
## Slide 5: Business Impact — Value Delivered (10 min)
|
||||
|
||||
Lead with outcomes, not activity.
|
||||
|
||||
**[Outcome 1: customer's primary success metric]**
|
||||
- Before: [baseline]
|
||||
- Now: [current state]
|
||||
- Impact: [quantified business result — time saved, revenue influenced, cost reduced, risk mitigated]
|
||||
|
||||
**[Outcome 2]**
|
||||
- [Same structure]
|
||||
|
||||
**[Outcome 3]**
|
||||
- [Same structure]
|
||||
|
||||
**Customer evidence** (use if available):
|
||||
> "[Quote from champion or user about value experienced]"
|
||||
|
||||
*Talking point: "This is the section I most want your input on. Are these the outcomes that matter to your business? Are there other ways you're measuring success that we should be tracking?"*
|
||||
|
||||
---
|
||||
|
||||
## Slide 6: Support Summary (3 min)
|
||||
|
||||
| Metric | This quarter | Last quarter | Trend |
|
||||
|---|---|---|---|
|
||||
| Tickets raised | [N] | [N] | ↑ / → / ↓ |
|
||||
| Average resolution time | [X hrs] | [X hrs] | ↑ / → / ↓ |
|
||||
| P1 / critical issues | [N] | [N] | ↑ / → / ↓ |
|
||||
| CSAT score | [X/10] | [X/10] | ↑ / → / ↓ |
|
||||
|
||||
**Notable issues this quarter:**
|
||||
- [Any escalation or major ticket — brief summary and resolution]
|
||||
|
||||
**What we're doing differently:**
|
||||
- [Any process change or improvement based on support patterns]
|
||||
|
||||
---
|
||||
|
||||
## Slide 7: What's Coming — Roadmap Preview (5 min)
|
||||
|
||||
Focus only on what's relevant to this customer's goals. Do not dump the full roadmap.
|
||||
|
||||
| Feature / Improvement | Expected | Why it matters to [Account Name] |
|
||||
|---|---|---|
|
||||
| [Feature 1] | [Q+1] | [Direct link to their goal or pain point] |
|
||||
| [Feature 2] | [Q+1 / Q+2] | [Direct link] |
|
||||
| [Feature 3] | [H2] | [Direct link] |
|
||||
|
||||
*Talking point: "I've filtered the roadmap to what I think matters most to your team. I'd love your reaction — are these the right priorities from your perspective?"*
|
||||
|
||||
---
|
||||
|
||||
## Slide 8: Next Quarter — Your Goals (10 min)
|
||||
|
||||
**Customer input section — facilitate, don't present.**
|
||||
|
||||
Prompt questions:
|
||||
- "What does success look like for your team in [next quarter]?"
|
||||
- "What's the biggest challenge you're trying to solve in the next 90 days?"
|
||||
- "Is there anything about the way you're using [product] you want to change?"
|
||||
|
||||
**Capture live:**
|
||||
|
||||
| Goal for next quarter | Owner (customer) | How we'll support it | How we'll measure it |
|
||||
|---|---|---|---|
|
||||
| [Goal 1] | [Name] | [CSM / product action] | [Metric] |
|
||||
| [Goal 2] | [Name] | [CSM / product action] | [Metric] |
|
||||
|
||||
---
|
||||
|
||||
## Slide 9: Mutual Commitments (5 min)
|
||||
|
||||
**[Your company] commits to:**
|
||||
1. [Specific action — owner — by when]
|
||||
2. [Specific action — owner — by when]
|
||||
3. [Specific action — owner — by when]
|
||||
|
||||
**[Account Name] commits to:**
|
||||
1. [Specific action — owner — by when]
|
||||
2. [Specific action — owner — by when]
|
||||
|
||||
**Next touchpoint:** [Date of next check-in or mid-quarter review]
|
||||
|
||||
---
|
||||
|
||||
## Slide 10: Thank You + Open Q&A (5 min)
|
||||
|
||||
- Recap the one headline from today: [The single most important thing you want them to remember]
|
||||
- Confirm actions are captured and shared after the call
|
||||
- Ask: "Is there anything we didn't cover today that you wanted to raise?"
|
||||
|
||||
---
|
||||
|
||||
## Preparation Checklist
|
||||
|
||||
- [ ] Usage data pulled and QoQ comparison calculated
|
||||
- [ ] Last QBR goals reviewed — status confirmed before the meeting
|
||||
- [ ] Business outcomes framed in customer language (not product language)
|
||||
- [ ] Roadmap filtered to this account's specific use cases
|
||||
- [ ] Customer's goals for next quarter researched or pre-confirmed with champion
|
||||
- [ ] Executive sponsor briefed on any sensitive topics before the call
|
||||
- [ ] Actions from previous QBR reviewed — any outstanding items addressed
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every slide has a talking point, not just a title
|
||||
- [ ] Value slide leads with business outcomes, not product activity
|
||||
- [ ] Roadmap preview links each item to a customer goal
|
||||
- [ ] Mutual commitments section has real owners on both sides
|
||||
- [ ] Customer has at least 20 minutes of airtime in the agenda
|
||||
@@ -0,0 +1,399 @@
|
||||
---
|
||||
name: rfc-writer
|
||||
description: "Write an engineering RFC (Request for Comments) for a technical decision, architectural change, or significant implementation approach. Use when asked to write an RFC, document a technical proposal, create a design doc, write an architecture decision for review, or produce a technical specification for team feedback. Produces a complete RFC document covering problem statement, motivation, proposed solution, alternatives rejected, implementation plan, migration plan, security and performance implications, observability changes, rollout plan, and open questions."
|
||||
---
|
||||
|
||||
# RFC Writer Skill
|
||||
|
||||
Produce a complete engineering RFC (Request for Comments) for a technical decision or architectural change. An RFC is a structured proposal document — not a persuasion document. Its purpose is to expose a decision to scrutiny, surface trade-offs, document alternatives considered, and create a permanent record of why a choice was made.
|
||||
|
||||
A good RFC makes it possible for someone who wasn't in the room to understand years later why the team built something the way they did.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **RFC title and author** — what this RFC is about and who is proposing it
|
||||
- **Problem being solved** — what is broken, missing, or inadequate today; why action is needed now
|
||||
- **Proposed solution** — the approach the author is recommending, at least at a high level
|
||||
- **Context and constraints** — team size, existing architecture, timeline pressures, budget limits, compliance requirements
|
||||
- **Alternatives considered** — at least 2 alternative approaches the author has thought about
|
||||
- **Current status** — is this pre-decision (seeking feedback) or post-decision (documenting a made decision)?
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# RFC [Number]: [Title]
|
||||
|
||||
**Author:** [Name] | **Team:** [Team name]
|
||||
**Created:** [Date] | **Last updated:** [Date]
|
||||
**Status:** Draft | In Review | Approved | Rejected | Superseded by RFC-[X]
|
||||
**Ticket:** [JIRA-XXX] | **Slack thread:** [#channel link]
|
||||
**Review deadline:** [Date — when comments should be submitted by]
|
||||
|
||||
---
|
||||
|
||||
## Abstract
|
||||
|
||||
[2–4 sentences summarising the entire RFC. Should stand alone — someone reading only this should understand what is being proposed, why, and what the main trade-off is. Write this last.]
|
||||
|
||||
---
|
||||
|
||||
## 1. Problem Statement
|
||||
|
||||
[Describe the problem being solved. Focus on the *problem*, not the solution. Be specific and quantified where possible.]
|
||||
|
||||
**Current state:**
|
||||
[Describe how things work today — the existing system, process, or architecture. Include any relevant constraints or limitations.]
|
||||
|
||||
**Why this is a problem now:**
|
||||
[Why is this being addressed now rather than earlier or later? Reference metrics, incidents, product requirements, or scaling thresholds that make this urgent or timely.]
|
||||
|
||||
**Example of the problem in practice:**
|
||||
[A concrete scenario or incident that illustrates the problem. This helps reviewers understand the real-world impact, not just the abstract description.]
|
||||
|
||||
```
|
||||
// Example: current behaviour that illustrates the problem
|
||||
[code snippet, log output, or sequence description showing the problem]
|
||||
```
|
||||
|
||||
**Impact of not solving this:**
|
||||
- [Impact 1 — e.g. "New tenant onboarding requires 3 hours of manual configuration per account"]
|
||||
- [Impact 2 — e.g. "Auth service handles 400 req/s; projected to hit capacity within 8 weeks at current growth"]
|
||||
- [Impact 3 — e.g. "Current approach is incompatible with the upcoming multi-region requirement"]
|
||||
|
||||
---
|
||||
|
||||
## 2. Goals and Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- [ ] [Specific, measurable outcome — e.g. "Reduce tenant onboarding time from 3 hours to <5 minutes"]
|
||||
- [ ] [e.g. "Support 2,000 req/s on the auth service with P99 latency ≤50ms"]
|
||||
- [ ] [e.g. "Enable multi-region deployment without changes to the application layer"]
|
||||
|
||||
**Non-goals:** *(what this RFC explicitly does not address)*
|
||||
- [e.g. "This RFC does not address authentication for internal service-to-service calls — see RFC-042"]
|
||||
- [e.g. "Performance improvements to the existing system — this RFC replaces it"]
|
||||
- [e.g. "Migration of historical data — covered in a follow-on RFC"]
|
||||
|
||||
**Success metrics:**
|
||||
| Metric | Current | Target | Measurement method |
|
||||
|---|---|---|---|
|
||||
| [e.g. Onboarding time] | [3 hours] | [<5 minutes] | [Prometheus histogram on onboarding job duration] |
|
||||
| [e.g. Auth latency P99] | [120ms] | [≤50ms] | [Datadog APM] |
|
||||
| [e.g. Engineer setup time] | [4 hours] | [<30 minutes] | [Onboarding survey] |
|
||||
|
||||
---
|
||||
|
||||
## 3. Background and Motivation
|
||||
|
||||
[Provide the context a reviewer needs to evaluate the proposal. This is not a repeat of the problem statement — it is the surrounding technical and business context.]
|
||||
|
||||
**Existing system overview:**
|
||||
[Describe the relevant parts of the current architecture. Include an ASCII diagram if the relationships between components help understanding.]
|
||||
|
||||
```
|
||||
[ASCII diagram of current architecture — optional but strongly recommended for architectural RFCs]
|
||||
|
||||
┌──────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Client │────▶│ [Service A] │────▶│ [Service B] │
|
||||
└──────────┘ └──────────────┘ └──────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────┐
|
||||
│ [Database] │
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
**Prior work and related decisions:**
|
||||
- [RFC-XXX: Title — relevant previous decision; link]
|
||||
- [ADR-XXX: Title — architectural decision record]
|
||||
- [Any external standards, blog posts, or vendor documentation that informs this proposal]
|
||||
|
||||
**Constraints:**
|
||||
- [e.g. Must remain backward compatible with v1 API clients for 12 months]
|
||||
- [e.g. Team has no Rust expertise — solution must be in Python or Go]
|
||||
- [e.g. Must be deployable without a maintenance window]
|
||||
|
||||
---
|
||||
|
||||
## 4. Proposed Solution
|
||||
|
||||
[Describe the proposed approach clearly and specifically. Include enough detail that an engineer could begin implementing from this document, but don't write the code — that is for the PR.]
|
||||
|
||||
### 4.1 High-Level Approach
|
||||
|
||||
[1–3 paragraphs describing the overall solution. Explain the key idea and why it solves the problem.]
|
||||
|
||||
### 4.2 Architecture
|
||||
|
||||
```
|
||||
[ASCII diagram of the proposed architecture — what the system looks like after this RFC is implemented]
|
||||
|
||||
┌──────────┐ ┌──────────────────┐ ┌──────────────┐
|
||||
│ Client │────▶│ [New Component] │────▶│ [Service B] │
|
||||
└──────────┘ └──────────────────┘ └──────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ [Store A] │ │ [Store B] │
|
||||
└──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
### 4.3 Detailed Design
|
||||
|
||||
[Break the solution into its key components or decisions. For each, explain what it does and why it was designed this way.]
|
||||
|
||||
**Component / Decision 1: [Name]**
|
||||
|
||||
[Description of this component — what it does, how it works, why this approach was chosen.]
|
||||
|
||||
```
|
||||
// Example interface, API contract, or pseudocode (not implementation code)
|
||||
[Relevant schema, API definition, data flow, or pseudocode]
|
||||
```
|
||||
|
||||
**Component / Decision 2: [Name]**
|
||||
|
||||
[Description]
|
||||
|
||||
**Component / Decision 3: [Name]**
|
||||
|
||||
[Description]
|
||||
|
||||
### 4.4 API Changes
|
||||
|
||||
*Complete this section if the RFC introduces or modifies any API endpoints, events, or interfaces.*
|
||||
|
||||
**New endpoints / events:**
|
||||
```
|
||||
[HTTP method + path or event name]
|
||||
Request: { ... }
|
||||
Response: { ... }
|
||||
```
|
||||
|
||||
**Modified endpoints:**
|
||||
- `[endpoint]`: [what changes and why; backward compatibility note]
|
||||
|
||||
**Deprecated endpoints:**
|
||||
- `[endpoint]`: deprecated in favour of `[new endpoint]` — removal timeline: [date/version]
|
||||
|
||||
### 4.5 Data Model Changes
|
||||
|
||||
*Complete this section if any database schema or data structure changes are required.*
|
||||
|
||||
[Describe schema changes at a high level. Reference the database-migration-plan skill for detailed migration steps.]
|
||||
|
||||
```sql
|
||||
-- Key schema changes (abbreviated — full migration in [link])
|
||||
[DDL statements for key additions/changes]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Alternatives Considered
|
||||
|
||||
*Every alternative must include an explicit reason why it was rejected. "We went with the proposed solution" is not a reason.*
|
||||
|
||||
### Alternative 1: [Name]
|
||||
|
||||
**Description:**
|
||||
[What this alternative would involve.]
|
||||
|
||||
**Pros:**
|
||||
- [Pro 1]
|
||||
- [Pro 2]
|
||||
|
||||
**Cons:**
|
||||
- [Con 1]
|
||||
- [Con 2]
|
||||
|
||||
**Why rejected:**
|
||||
[Specific reason — e.g. "Requires 3× the infrastructure cost", "Incompatible with multi-region requirement", "Team has no expertise in this technology and the ramp-up would miss the Q3 deadline"]
|
||||
|
||||
---
|
||||
|
||||
### Alternative 2: [Name]
|
||||
|
||||
**Description:**
|
||||
[What this alternative would involve.]
|
||||
|
||||
**Pros:**
|
||||
- [Pro 1]
|
||||
- [Pro 2]
|
||||
|
||||
**Cons:**
|
||||
- [Con 1]
|
||||
- [Con 2]
|
||||
|
||||
**Why rejected:**
|
||||
[Specific reason]
|
||||
|
||||
---
|
||||
|
||||
### Alternative 3: Do nothing / defer
|
||||
|
||||
**Description:**
|
||||
Accept the current state and revisit the problem in [timeframe].
|
||||
|
||||
**Why rejected:**
|
||||
[Why deferring is not acceptable — reference the impact of not solving this from Section 1.]
|
||||
|
||||
---
|
||||
|
||||
## 6. Implementation Plan
|
||||
|
||||
**Estimated effort:** [X engineer-weeks] | **Target completion:** [Date / Quarter]
|
||||
**Team:** [Who is building this — names or roles]
|
||||
|
||||
| Phase | Description | Duration | Dependencies | Owner |
|
||||
|---|---|---|---|---|
|
||||
| 1 | [e.g. Core implementation — new component built and tested] | [X weeks] | [None] | [Name] |
|
||||
| 2 | [e.g. Integration — connect new component to existing services] | [X weeks] | [Phase 1 complete] | [Name] |
|
||||
| 3 | [e.g. Rollout — canary deploy, then full rollout] | [X weeks] | [Phase 2 + staging validated] | [Name] |
|
||||
| 4 | [e.g. Cleanup — deprecate old system, remove feature flags] | [X weeks] | [Phase 3 stable for X weeks] | [Name] |
|
||||
|
||||
**Key milestones:**
|
||||
- [ ] [Date]: [Milestone — e.g. "Core implementation complete and code-reviewed"]
|
||||
- [ ] [Date]: [Milestone — e.g. "Staging environment validation complete"]
|
||||
- [ ] [Date]: [Milestone — e.g. "10% canary traffic without regression"]
|
||||
- [ ] [Date]: [Milestone — e.g. "Full rollout complete"]
|
||||
- [ ] [Date]: [Milestone — e.g. "Old system decommissioned"]
|
||||
|
||||
---
|
||||
|
||||
## 7. Migration Plan
|
||||
|
||||
*Complete this section if the RFC requires migrating existing users, data, or API consumers.*
|
||||
|
||||
**Migration strategy:** [Big-bang / Phased / Parallel-run / Opt-in]
|
||||
|
||||
**Who is affected:**
|
||||
- [e.g. All existing API v1 consumers — requires updated client libraries]
|
||||
- [e.g. X million rows in the `orders` table require backfilling]
|
||||
|
||||
**Migration steps:**
|
||||
1. [Step 1 — describe action, who does it, estimated duration]
|
||||
2. [Step 2]
|
||||
3. [Step 3]
|
||||
|
||||
**Backward compatibility window:** [How long will the old system/API remain available?]
|
||||
|
||||
**Communication plan:**
|
||||
- [Who needs to be notified, when, and how — e.g. "API consumers will receive a deprecation notice 3 months before the old endpoint is removed"]
|
||||
|
||||
---
|
||||
|
||||
## 8. Security Implications
|
||||
|
||||
[Describe the security impact of this change. If there are no security implications, state that explicitly with reasoning — do not leave this section blank.]
|
||||
|
||||
| Concern | Impact | Mitigation |
|
||||
|---|---|---|
|
||||
| [e.g. New API endpoint exposed to internet] | [e.g. New attack surface] | [e.g. Rate limiting, auth required, WAF rules] |
|
||||
| [e.g. New data stored — user PII] | [e.g. GDPR scope expanded] | [e.g. Encrypted at rest, access log, data retention policy] |
|
||||
| [e.g. Service-to-service communication] | [e.g. Token forgery risk] | [e.g. mTLS between services] |
|
||||
|
||||
**Has a threat model been produced or updated?** [Yes — link / No — required before implementation / Not required — reason]
|
||||
|
||||
---
|
||||
|
||||
## 9. Performance Implications
|
||||
|
||||
[Describe the expected performance impact. Include projections for the new system and how it was estimated.]
|
||||
|
||||
| Metric | Current | Projected | Measurement method |
|
||||
|---|---|---|---|
|
||||
| [e.g. P99 latency — /api/auth] | [120ms] | [≤50ms] | [Load test results — link] |
|
||||
| [e.g. Database query count per request] | [12] | [3] | [Query logging in staging] |
|
||||
| [e.g. Memory per instance] | [512MB] | [768MB] | [Profiling — link] |
|
||||
| [e.g. Infrastructure cost] | [$X/month] | [$Y/month] | [AWS cost calculator estimate] |
|
||||
|
||||
**Load testing:** [Has load testing been done? Link to results. If not, when will it be done?]
|
||||
|
||||
**Performance risks:**
|
||||
- [Risk 1 — e.g. "New component adds a network hop that may increase tail latency under congestion — needs validation at 2× peak load"]
|
||||
|
||||
---
|
||||
|
||||
## 10. Observability Changes
|
||||
|
||||
*Describe what new or changed metrics, logs, traces, and alerts this RFC introduces.*
|
||||
|
||||
**New metrics:**
|
||||
| Metric name | Type | Description | Alert threshold |
|
||||
|---|---|---|---|
|
||||
| `[service].[component].[metric]` | [counter/gauge/histogram] | [What it measures] | [e.g. P99 > 100ms for 5 min] |
|
||||
|
||||
**New log events:**
|
||||
| Event | Level | When emitted | Key fields |
|
||||
|---|---|---|---|
|
||||
| `[event.name]` | INFO | [When] | `user_id`, `duration_ms`, `result` |
|
||||
|
||||
**Distributed tracing:** [Are spans added for new components? Which operations are instrumented?]
|
||||
|
||||
**Dashboard changes:** [New dashboard / updated existing dashboard — link]
|
||||
|
||||
---
|
||||
|
||||
## 11. Rollout Plan
|
||||
|
||||
**Rollout strategy:** [Feature flag / Canary / Blue-green / Gradual traffic shift / Full deploy]
|
||||
|
||||
| Stage | Traffic % | Duration | Success criteria | Rollback trigger |
|
||||
|---|---|---|---|---|
|
||||
| Internal testing | 0% (dogfood) | [X days] | [No errors in internal usage] | Any error |
|
||||
| Canary | 1% | [X hours] | [Error rate <0.1%; P99 latency within budget] | Error rate >0.5% |
|
||||
| Limited rollout | 10% | [X days] | [As above + business metrics stable] | Error rate >0.2% |
|
||||
| Full rollout | 100% | — | [All success metrics from Section 2 met] | Any SLO breach |
|
||||
|
||||
**Feature flag:** [Name of feature flag, if applicable] — managed in [LaunchDarkly / Unleash / config]
|
||||
|
||||
**Rollback procedure:**
|
||||
```
|
||||
// How to roll back if the rollout needs to be reversed
|
||||
1. [Step 1 — e.g. Toggle feature flag to off]
|
||||
2. [Step 2 — e.g. Deploy previous version]
|
||||
3. [Step 3 — e.g. Notify stakeholders]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 12. Open Questions
|
||||
|
||||
[List any unresolved questions, design decisions not yet made, or areas where the author is specifically seeking feedback. Assign an owner and a resolution deadline for each.]
|
||||
|
||||
| # | Question | Owner | Deadline | Resolution |
|
||||
|---|---|---|---|---|
|
||||
| 1 | [e.g. Should we use optimistic or pessimistic locking for concurrent updates to [resource]?] | [Name] | [Date] | [Pending / [Answer]] |
|
||||
| 2 | [e.g. What is the retention policy for [new data type]?] | [Name] | [Date] | [Pending / [Answer]] |
|
||||
| 3 | [e.g. Do we need a read replica for this query pattern at launch, or can we defer it?] | [Name] | [Date] | [Pending / [Answer]] |
|
||||
|
||||
---
|
||||
|
||||
## 13. Decision
|
||||
|
||||
*To be filled in after the review period closes.*
|
||||
|
||||
**Decision:** [Approved / Rejected / Approved with modifications]
|
||||
**Decision date:** [Date]
|
||||
**Decision makers:** [Names]
|
||||
|
||||
**Summary of key feedback addressed:**
|
||||
- [Feedback item and how it was resolved]
|
||||
|
||||
**Conditions of approval (if any):**
|
||||
- [e.g. Must complete load testing before Phase 2 begins]
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] The problem statement is specific and quantified — not "the current system is slow" but "P99 latency is 800ms; budget is 200ms"
|
||||
- [ ] Goals section includes measurable success metrics, not aspirational statements
|
||||
- [ ] Every alternative has an explicit rejection reason — not just a list of cons
|
||||
- [ ] Security implications section is completed, not left blank
|
||||
- [ ] Performance implications include projected numbers, not just "should be better"
|
||||
- [ ] Open questions are assigned to named owners with deadlines — not floating
|
||||
- [ ] The RFC is written to be read by someone who was not in the planning conversations
|
||||
- [ ] Migration plan addresses all affected parties — users, API consumers, data — not just the technical steps
|
||||
@@ -0,0 +1,253 @@
|
||||
---
|
||||
name: security-threat-model
|
||||
description: "Write a STRIDE-based threat model for a service or feature. Use when asked to produce a threat model, document security risks, identify attack vectors, assess a service's security posture, or prepare for a security design review. Produces a structured threat model covering assets, trust boundaries, STRIDE threat enumeration per component, risk scores, mitigation controls, and residual risk sign-off."
|
||||
---
|
||||
|
||||
# Security Threat Model Skill
|
||||
|
||||
Produce a complete STRIDE-based threat model for a service or feature. A threat model is not a list of things that could go wrong — it is a structured analysis of attackers, assets, boundaries, and controls that lets an engineering team make informed, documented security decisions.
|
||||
|
||||
A good threat model is specific enough that a new engineer can understand what is being protected, why each control exists, and what risk the team has accepted.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name and description** — what the service does, who uses it
|
||||
- **Architecture overview** — components, dependencies, data flows (a diagram description or ASCII diagram is fine)
|
||||
- **Deployment environment** — cloud provider, VPC/network topology, where it runs (Kubernetes, ECS, VMs, serverless)
|
||||
- **Data sensitivity** — what data does this service handle? PII, payment data, credentials, internal-only?
|
||||
- **Existing controls** — authentication method, encryption in transit/at rest, current WAF/firewall, existing security scanning
|
||||
- **Trust levels** — who are the principals? (anonymous public, authenticated users, internal services, admins)
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Security Threat Model: [Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Author:** [Name] | **Reviewed by:** [Security lead / peer]
|
||||
**Date:** [Date] | **Next review:** [Date — recommend 6 months or after major architecture change]
|
||||
**Classification:** [Internal / Confidential]
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
[2–3 sentences describing the service, its role in the system, and the scope of this threat model. State what is in scope and what is explicitly out of scope.]
|
||||
|
||||
**In scope:**
|
||||
- [Component or data flow]
|
||||
- [Component or data flow]
|
||||
|
||||
**Out of scope:**
|
||||
- [e.g. Third-party payment processor internals]
|
||||
- [e.g. Corporate network / end-user devices]
|
||||
|
||||
---
|
||||
|
||||
## 2. Asset Register
|
||||
|
||||
Assets are the things worth protecting — data, capabilities, and reputational value.
|
||||
|
||||
| Asset | Description | Sensitivity | Owner |
|
||||
|---|---|---|---|
|
||||
| [e.g. User PII] | Names, email addresses, profile data | High — GDPR-regulated | [Team] |
|
||||
| [e.g. API credentials] | Service-to-service auth tokens | Critical | [Team] |
|
||||
| [e.g. Session tokens] | User authentication state | High | [Team] |
|
||||
| [e.g. Audit logs] | Record of user and admin actions | Medium | [Team] |
|
||||
| [e.g. Service availability] | Uptime of the [X] endpoint | Medium | [Team] |
|
||||
|
||||
**Data classification key:**
|
||||
- **Critical** — Credential material; exposure enables direct system compromise
|
||||
- **High** — PII, financial data, health data; regulated or high reputational impact
|
||||
- **Medium** — Internal configuration, non-sensitive business data
|
||||
- **Low** — Public information, anonymised data
|
||||
|
||||
---
|
||||
|
||||
## 3. Trust Boundaries and Architecture
|
||||
|
||||
Trust boundaries are the lines that separate zones with different trust levels. Threats often occur when data or requests cross a boundary.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ INTERNET (Untrusted) │
|
||||
│ │
|
||||
│ [Public User] [Bot / Attacker] │
|
||||
└──────────────────────────────┬──────────────────────────────────┘
|
||||
│ HTTPS
|
||||
─ ─ ─ ─ ─ ─ ─│─ ─ ─ ─ ─ ─ ─ ─
|
||||
Trust Boundary: Public → DMZ
|
||||
─ ─ ─ ─ ─ ─ ─│─ ─ ─ ─ ─ ─ ─ ─
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ DMZ / Edge Layer │
|
||||
│ ┌────────────┐ ┌──────────────┐ │
|
||||
│ │ WAF / CDN │────▶│ API Gateway │ │
|
||||
│ └────────────┘ └──────┬───────┘ │
|
||||
└──────────────────────────────┼───────────────────────────────────┘
|
||||
─ ─ ─ ─ ─ ─ ─│─ ─ ─ ─ ─ ─ ─ ─
|
||||
Trust Boundary: Edge → Application VPC
|
||||
─ ─ ─ ─ ─ ─ ─│─ ─ ─ ─ ─ ─ ─ ─
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ Application VPC (Private) │
|
||||
│ ┌──────────────┐ ┌────────────┐ ┌──────────────────┐ │
|
||||
│ │ [Service A] │────▶│ [Service B]│────▶│ [Database] │ │
|
||||
│ └──────────────┘ └────────────┘ └──────────────────┘ │
|
||||
│ ▲ │
|
||||
│ │ │
|
||||
│ ┌──────────────┐ │ │
|
||||
│ │ Admin (IAM) │─────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Trust Boundaries identified:**
|
||||
|
||||
| Boundary | From | To | Auth mechanism | Encrypted |
|
||||
|---|---|---|---|---|
|
||||
| TB-1 | Public internet | API Gateway | [JWT / OAuth / API key] | TLS 1.2+ |
|
||||
| TB-2 | API Gateway | Service A | [mTLS / internal JWT / IAM role] | [Yes/No] |
|
||||
| TB-3 | Service A | Database | [Connection string + IAM / username+password] | [Yes/No] |
|
||||
| TB-4 | Admin | Service B | [IAM role / VPN + MFA] | TLS |
|
||||
|
||||
---
|
||||
|
||||
## 4. STRIDE Threat Analysis
|
||||
|
||||
STRIDE is a threat classification framework. For each significant component, enumerate threats in each category.
|
||||
|
||||
**STRIDE key:**
|
||||
- **S** — Spoofing: Impersonating another user, service, or system
|
||||
- **T** — Tampering: Modifying data or code without authorisation
|
||||
- **R** — Repudiation: Denying an action occurred; insufficient audit trail
|
||||
- **I** — Information Disclosure: Exposing data to unauthorised parties
|
||||
- **D** — Denial of Service: Making the service unavailable
|
||||
- **E** — Elevation of Privilege: Gaining capabilities beyond what is authorised
|
||||
|
||||
### Component: [API Gateway / Auth Layer]
|
||||
|
||||
| ID | Category | Threat | Attack vector | Existing control |
|
||||
|---|---|---|---|---|
|
||||
| T-001 | S | Attacker forges a JWT token to authenticate as another user | Weak signing key or algorithm confusion (alg:none) | [e.g. RS256 with key rotation / none] |
|
||||
| T-002 | S | Attacker replays a stolen session token | Theft via XSS or network sniff | [e.g. Token expiry + refresh rotation] |
|
||||
| T-003 | T | Attacker modifies request headers to bypass tenant isolation | Missing validation of tenant ID header | [e.g. Server-side tenant resolution / none] |
|
||||
| T-004 | R | No audit trail for admin authentication events | Logging not configured for auth failures | [e.g. CloudTrail enabled / none] |
|
||||
| T-005 | I | Auth error messages reveal whether an email exists | Verbose error responses | [e.g. Normalised error responses / none] |
|
||||
| T-006 | D | Credential stuffing exhausts rate limits and blocks legitimate users | Automated login attempts | [e.g. Rate limiting per IP + CAPTCHA / none] |
|
||||
| T-007 | E | Compromised low-privilege token used to call admin endpoint | Missing role check on admin routes | [e.g. RBAC middleware on all routes / none] |
|
||||
|
||||
### Component: [Application Service / Business Logic]
|
||||
|
||||
| ID | Category | Threat | Attack vector | Existing control |
|
||||
|---|---|---|---|---|
|
||||
| T-008 | T | SQL/NoSQL injection via unsanitised user input | Unparameterised queries | [e.g. ORM with parameterised queries / none] |
|
||||
| T-009 | T | Mass assignment — attacker sets fields they should not (e.g. `isAdmin: true`) | API accepts extra fields without allowlist | [e.g. Input validation / none] |
|
||||
| T-010 | I | Insecure direct object reference — user accesses another user's resource | Missing ownership check on resource ID | [e.g. Ownership middleware / none] |
|
||||
| T-011 | I | Sensitive data in application logs (PII, tokens) | Over-logging in debug mode | [e.g. Log scrubbing / none] |
|
||||
| T-012 | D | Unprotected expensive endpoint triggers large DB scan | No pagination or query cost limit | [e.g. Pagination enforced / none] |
|
||||
| T-013 | R | Business-critical state changes not logged | No audit event on [operation] | [e.g. Audit log table / none] |
|
||||
|
||||
### Component: [Database]
|
||||
|
||||
| ID | Category | Threat | Attack vector | Existing control |
|
||||
|---|---|---|---|---|
|
||||
| T-014 | I | Database exposed to internet (misconfigured security group) | Direct connection from outside VPC | [e.g. No public IP, security group restricts to app subnet] |
|
||||
| T-015 | I | Backup snapshots not encrypted or accessible to wrong accounts | Unencrypted snapshot, public S3 | [e.g. Encrypted snapshots, private S3 bucket] |
|
||||
| T-016 | T | Privilege escalation via DB account with excessive permissions | App uses a superuser DB account | [e.g. Least-privilege DB role per service / none] |
|
||||
| T-017 | D | Runaway query or bulk delete causes data loss or outage | No query timeout or soft-delete | [e.g. Statement timeout, soft-delete on critical tables / none] |
|
||||
|
||||
### Component: [Internal Service-to-Service Communication]
|
||||
|
||||
| ID | Category | Threat | Attack vector | Existing control |
|
||||
|---|---|---|---|---|
|
||||
| T-018 | S | Rogue internal service impersonates a trusted service | No mutual authentication between services | [e.g. mTLS / service mesh / none] |
|
||||
| T-019 | I | Internal traffic sniffed on shared network | Unencrypted service-to-service calls | [e.g. Service mesh with TLS / none] |
|
||||
| T-020 | E | Compromised internal service calls privileged endpoints | No scoping on internal tokens | [e.g. Scoped service tokens / none] |
|
||||
|
||||
---
|
||||
|
||||
## 5. Risk Register
|
||||
|
||||
Score each threat: **Likelihood (1–5)** × **Impact (1–5)** = **Risk Score (1–25)**
|
||||
|
||||
Priority bands: Critical (20–25) | High (12–19) | Medium (6–11) | Low (1–5)
|
||||
|
||||
| ID | Threat summary | Likelihood | Impact | Score | Priority | Status |
|
||||
|---|---|---|---|---|---|---|
|
||||
| T-001 | JWT forgery — auth bypass | 2 | 5 | 10 | Medium | [Open / Mitigated / Accepted] |
|
||||
| T-002 | Session token replay | 3 | 4 | 12 | High | [Open / Mitigated / Accepted] |
|
||||
| T-007 | Privilege escalation via missing role check | 3 | 5 | 15 | High | [Open / Mitigated / Accepted] |
|
||||
| T-008 | SQL injection | 2 | 5 | 10 | Medium | [Open / Mitigated / Accepted] |
|
||||
| T-010 | IDOR — cross-user data access | 3 | 4 | 12 | High | [Open / Mitigated / Accepted] |
|
||||
| T-014 | Database exposed to internet | 1 | 5 | 5 | Low | [Open / Mitigated / Accepted] |
|
||||
| T-018 | Rogue internal service impersonation | 2 | 4 | 8 | Medium | [Open / Mitigated / Accepted] |
|
||||
|
||||
---
|
||||
|
||||
## 6. Mitigations Table
|
||||
|
||||
For every Open threat with priority Medium or above, define a specific mitigation.
|
||||
|
||||
| ID | Threat | Mitigation | Owner | Target date | Ticket |
|
||||
|---|---|---|---|---|---|
|
||||
| T-002 | Session token replay | Implement token rotation on refresh — invalidate old token server-side immediately | [Engineer name] | [Date] | [JIRA-123] |
|
||||
| T-007 | Privilege escalation | Add RBAC middleware to all `/admin/*` routes; write integration test for role boundary | [Engineer name] | [Date] | [JIRA-124] |
|
||||
| T-010 | IDOR | Add ownership assertion to all resource-fetching service methods; add to code review checklist | [Engineer name] | [Date] | [JIRA-125] |
|
||||
| T-011 | PII in logs | Audit logging calls for PII fields; add scrubbing to logger middleware | [Engineer name] | [Date] | [JIRA-126] |
|
||||
| T-018 | Rogue service impersonation | Enable mTLS via service mesh or issue scoped service tokens per service | [Engineer name] | [Date] | [JIRA-127] |
|
||||
|
||||
---
|
||||
|
||||
## 7. Accepted Risks
|
||||
|
||||
Accepted risks are threats the team has decided not to mitigate right now. Every accepted risk must have a named owner and a review date.
|
||||
|
||||
| ID | Threat | Reason for acceptance | Risk owner | Review date |
|
||||
|---|---|---|---|---|
|
||||
| T-014 | Database public exposure | Database has no public IP assigned; control already in place — accepted as low likelihood | [Name] | [Date] |
|
||||
| [ID] | [Threat] | [Reason — e.g. "Effort exceeds risk at current scale; re-evaluate at 10× traffic"] | [Name] | [Date] |
|
||||
|
||||
---
|
||||
|
||||
## 8. Security Controls Summary
|
||||
|
||||
| Control | Type | Covers threats | Implemented |
|
||||
|---|---|---|---|
|
||||
| JWT RS256 with 15-min expiry | Preventive | T-001, T-002 | [Yes / Partial / No] |
|
||||
| RBAC middleware on all routes | Preventive | T-007, T-020 | [Yes / Partial / No] |
|
||||
| Parameterised queries (ORM) | Preventive | T-008 | [Yes / Partial / No] |
|
||||
| Rate limiting (100 req/min per IP) | Preventive | T-006, T-012 | [Yes / Partial / No] |
|
||||
| CloudTrail / audit logging | Detective | T-004, T-013 | [Yes / Partial / No] |
|
||||
| Automated SAST in CI pipeline | Detective | T-008, T-009 | [Yes / Partial / No] |
|
||||
| Encrypted backups + private S3 | Preventive | T-015 | [Yes / Partial / No] |
|
||||
| Least-privilege DB role | Preventive | T-016 | [Yes / Partial / No] |
|
||||
| Incident response runbook | Corrective | All | [Yes / Partial / No] |
|
||||
|
||||
---
|
||||
|
||||
## 9. Review Cadence
|
||||
|
||||
| Trigger | Action |
|
||||
|---|---|
|
||||
| Every 6 months | Full threat model review — update risk scores, close mitigated items |
|
||||
| Major architecture change | Update trust boundary diagram and re-run STRIDE for new components |
|
||||
| Security incident | Review relevant threats; add any newly discovered vectors |
|
||||
| New data classification | Add assets to register; assess whether new STRIDE categories apply |
|
||||
| Third-party dependency added | Assess supply chain threats for the new dependency |
|
||||
|
||||
**Next scheduled review:** [Date]
|
||||
**Review owner:** [Name / Security lead]
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every trust boundary is named and its authentication mechanism is specified — not left as "TBD"
|
||||
- [ ] Every Critical and High risk in the risk register has a mitigation with a named owner and a target date
|
||||
- [ ] Every accepted risk has a named risk owner and a review date — no unowned accepted risks
|
||||
- [ ] The asset register includes data sensitivity levels and at least one entry for credential material
|
||||
- [ ] STRIDE analysis covers all major components — not just the API layer
|
||||
- [ ] Mitigation actions are specific enough to become a ticket (not "improve security")
|
||||
- [ ] The ASCII trust boundary diagram matches the architecture description provided
|
||||
@@ -0,0 +1,292 @@
|
||||
---
|
||||
name: service-catalog-entry
|
||||
description: "Write a service catalog entry for a microservice or internal platform service — covering service identity, purpose, architecture context, SLAs, API contract summary, data classification, dependencies, operational runbooks, and known limitations. Use when asked to document a service for an internal developer portal, write a service README for a platform catalog, create a service overview page, or onboard a new service to a service registry. Produces a complete service catalog entry suitable for an internal developer portal or wiki."
|
||||
---
|
||||
|
||||
# Service Catalog Entry Skill
|
||||
|
||||
Produce a complete service catalog entry for a microservice or internal platform service — giving any engineer at the company the context they need to understand what the service does, how to depend on it, what its reliability characteristics are, and where to go when something goes wrong. A well-written catalog entry eliminates "who owns this?" and "is this safe to use?" questions that slow down teams depending on shared services.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** — the canonical identifier used in code, monitoring, and deployments
|
||||
- **Team and owner** — team name, tech lead name, and on-call contact
|
||||
- **Architecture overview** — what the service does, what calls it, and what it calls
|
||||
- **SLA requirements** — availability target, latency SLO, support tier, and maintenance window
|
||||
- **Key APIs** — the most important endpoints other teams use (method, path, brief description)
|
||||
- **Data handled** — what data the service stores or processes, sensitivity classification, retention
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Service Catalog: [Service Name]
|
||||
|
||||
> **[One sentence — what this service does for consumers, in plain language]**
|
||||
>
|
||||
> *e.g. "The Payments Service processes charge, refund, and subscription billing events for all Acme products."*
|
||||
|
||||
---
|
||||
|
||||
## Identity
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| **Service name** | `[service-name]` |
|
||||
| **Canonical repository** | [https://github.com/[org]/[repo]] |
|
||||
| **Owner team** | [Team name] |
|
||||
| **Tech lead** | [Name] ([Slack: @handle]) |
|
||||
| **On-call rotation** | [PagerDuty service link] |
|
||||
| **Slack channel** | `#[team-channel]` |
|
||||
| **Support tier** | [Tier 1 — 24/7 / Tier 2 — business hours / Tier 3 — best effort] |
|
||||
| **Status** | [Active / Deprecated / Sunset date: YYYY-MM-DD] |
|
||||
| **Language / runtime** | [e.g. Go 1.22 / Python 3.12 / Node 20] |
|
||||
| **Deployment platform** | [Kubernetes / ECS / Lambda / etc.] |
|
||||
| **Environments** | [Production: URL] | [Staging: URL] | [Dev: URL] |
|
||||
|
||||
---
|
||||
|
||||
## What It Does
|
||||
|
||||
[Two to three paragraphs in plain language — no jargon or acronyms without explanation.]
|
||||
|
||||
[Paragraph 1: The business problem this service solves. What would break or be missing if this service did not exist?]
|
||||
|
||||
[Paragraph 2: How it works at a high level — the main processing model (e.g. request/response API, event-driven consumer, batch processor), what triggers it, and what it produces.]
|
||||
|
||||
[Paragraph 3: What this service is NOT responsible for — the explicit boundaries. This prevents other teams from building incorrect assumptions about scope.]
|
||||
|
||||
---
|
||||
|
||||
## Architecture Context
|
||||
|
||||
### System Diagram
|
||||
|
||||
```
|
||||
[Upstream callers] [This Service] [Downstream dependencies]
|
||||
|
||||
[Web App] ──────────→ ──→ [Primary Database — PostgreSQL]
|
||||
[Mobile API] ────────→ [Service Name] ──→ [Cache — Redis]
|
||||
[Partner API] ────────→ (Port 8080/gRPC) ──→ [Message Queue — Kafka/SQS]
|
||||
──→ [External Service / API]
|
||||
↓ emits events to
|
||||
[Event Bus / SNS]
|
||||
↓ consumed by
|
||||
[Downstream Service A]
|
||||
[Downstream Service B]
|
||||
```
|
||||
|
||||
### Who Depends on This Service
|
||||
|
||||
| Caller | How they use it | Contact |
|
||||
|---|---|---|
|
||||
| [Service / Team A] | [e.g. "Calls POST /charges to initiate payments"] | [Slack: #team-a] |
|
||||
| [Service / Team B] | [e.g. "Subscribes to payment.completed events via Kafka topic"] | [Slack: #team-b] |
|
||||
| [Service / Team C] | [e.g. "Calls GET /subscriptions for billing status"] | [Slack: #team-c] |
|
||||
|
||||
### What This Service Depends On
|
||||
|
||||
| Dependency | Type | Criticality | Their on-call |
|
||||
|---|---|---|---|
|
||||
| [PostgreSQL instance] | Database | Critical — all writes fail without it | [DBA team: #db-oncall] |
|
||||
| [Redis cluster] | Cache | High — latency degrades without it | [Infra team: #infra-oncall] |
|
||||
| [Kafka cluster] | Message queue | High — async events queue | [Infra team: #infra-oncall] |
|
||||
| [Stripe API] | External API | Critical — payment processing fails | [vendor status: status.stripe.com] |
|
||||
| [Auth Service] | Internal service | Critical — all auth fails | [Auth team: #auth-oncall] |
|
||||
|
||||
---
|
||||
|
||||
## Service Level Agreement
|
||||
|
||||
### Availability and Latency
|
||||
|
||||
| SLO | Target | Measurement window | Error budget |
|
||||
|---|---|---|---|
|
||||
| Availability | [99.9%] | Rolling 30 days | [43 min/month] |
|
||||
| p50 latency (key endpoints) | < [50] ms | Rolling 24 hours | — |
|
||||
| p99 latency (key endpoints) | < [500] ms | Rolling 24 hours | — |
|
||||
| p99.9 latency (key endpoints) | < [2000] ms | Rolling 24 hours | — |
|
||||
| Error rate | < [0.1]% | Rolling 1 hour | — |
|
||||
|
||||
**SLO dashboard:** [Link to monitoring dashboard]
|
||||
**Current error budget remaining:** [Link to SLO dashboard or inline value]
|
||||
|
||||
### Support Tiers
|
||||
|
||||
| Tier | Scope | Response time | Resolution time |
|
||||
|---|---|---|---|
|
||||
| P1 — Service down | All authenticated requests failing | 15 minutes | 1 hour |
|
||||
| P2 — Significant degradation | Error rate >1% or p99 >2× SLO | 30 minutes | 4 hours |
|
||||
| P3 — Minor issues | Non-critical endpoints degraded | Next business day | 3 business days |
|
||||
| Feature requests / bugs | Via standard ticket process | [Ticket SLA] | Per roadmap |
|
||||
|
||||
**To raise an incident:** Page via [PagerDuty service link] or post in `#incidents`.
|
||||
**To raise a feature request or bug:** File a ticket in [JIRA project / GitHub repo Issues].
|
||||
|
||||
### Maintenance Windows
|
||||
|
||||
- **Planned downtime:** [e.g. "Sundays 02:00–04:00 UTC — advance notice posted to #[team-channel] 48h before"]
|
||||
- **Deployment window:** [e.g. "Weekdays 10:00–16:00 UTC — no deploys on Fridays or the day before a public holiday"]
|
||||
- **Breaking changes notice:** [e.g. "Minimum 30 days notice for breaking API changes — see versioning policy below"]
|
||||
|
||||
---
|
||||
|
||||
## API Contract
|
||||
|
||||
### Authentication
|
||||
|
||||
All API calls require: [e.g. "Bearer token via Authorization header. Tokens are issued by the Auth Service (`/api/v1/token`)"]
|
||||
|
||||
```
|
||||
Authorization: Bearer [jwt-token]
|
||||
Content-Type: application/json
|
||||
```
|
||||
|
||||
### Base URL
|
||||
|
||||
| Environment | Base URL |
|
||||
|---|---|
|
||||
| Production | `https://[service-name].internal.[company].com` |
|
||||
| Staging | `https://[service-name].staging.[company].com` |
|
||||
| Local development | `http://localhost:[port]` |
|
||||
|
||||
### Key Endpoints
|
||||
|
||||
| Method | Path | Description | Auth required | Rate limit |
|
||||
|---|---|---|---|---|
|
||||
| `GET` | `/health` | Liveness and readiness check | No | None |
|
||||
| `GET` | `/api/v1/[resource]` | [Description — e.g. "List resources for the authenticated user"] | Yes | [100 req/min] |
|
||||
| `GET` | `/api/v1/[resource]/:id` | [Description — e.g. "Get a single resource by ID"] | Yes | [500 req/min] |
|
||||
| `POST` | `/api/v1/[resource]` | [Description — e.g. "Create a new resource"] | Yes | [50 req/min] |
|
||||
| `PUT` | `/api/v1/[resource]/:id` | [Description — e.g. "Update an existing resource"] | Yes | [50 req/min] |
|
||||
| `DELETE` | `/api/v1/[resource]/:id` | [Description] | Yes | [20 req/min] |
|
||||
|
||||
**Full API documentation:** [OpenAPI/Swagger spec URL] | [Postman collection URL]
|
||||
|
||||
### Versioning Policy
|
||||
|
||||
- API version is in the URL path (`/api/v1/`, `/api/v2/`)
|
||||
- Minor additions (new optional fields, new endpoints) are non-breaking — no version bump
|
||||
- Breaking changes (removed fields, changed types, authentication changes) require a new major version
|
||||
- Deprecated versions are supported for [90 days] after the successor reaches GA
|
||||
- Deprecation notices are posted to `#[team-channel]` and emailed to registered consumers
|
||||
|
||||
### Error Response Format
|
||||
|
||||
```json
|
||||
{
|
||||
"error": {
|
||||
"code": "[ERROR_CODE]",
|
||||
"message": "[Human-readable description]",
|
||||
"request_id": "[UUID — include in support tickets]",
|
||||
"details": {}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Common error codes:
|
||||
|
||||
| HTTP status | Error code | Meaning |
|
||||
|---|---|---|
|
||||
| 400 | `INVALID_REQUEST` | Request body or parameters fail validation |
|
||||
| 401 | `UNAUTHENTICATED` | Missing or invalid auth token |
|
||||
| 403 | `FORBIDDEN` | Token valid but lacks permission for this resource |
|
||||
| 404 | `NOT_FOUND` | Resource does not exist |
|
||||
| 409 | `CONFLICT` | Duplicate resource or state conflict |
|
||||
| 422 | `UNPROCESSABLE_ENTITY` | Request is valid but violates business rules |
|
||||
| 429 | `RATE_LIMITED` | Too many requests — back off and retry |
|
||||
| 500 | `INTERNAL_ERROR` | Unexpected server error — include request_id in support ticket |
|
||||
| 503 | `SERVICE_UNAVAILABLE` | Downstream dependency unavailable — retry with backoff |
|
||||
|
||||
### Events Published (if event-driven)
|
||||
|
||||
| Event | Topic / Queue | Schema | Published when |
|
||||
|---|---|---|---|
|
||||
| `[resource].created` | `[kafka-topic / sns-arn]` | [Schema URL] | [When a new resource is created] |
|
||||
| `[resource].updated` | `[kafka-topic / sns-arn]` | [Schema URL] | [When a resource is modified] |
|
||||
| `[resource].deleted` | `[kafka-topic / sns-arn]` | [Schema URL] | [When a resource is deleted] |
|
||||
|
||||
---
|
||||
|
||||
## Data Classification
|
||||
|
||||
| Data element | Sensitivity | Stored in | Retention | Encrypted at rest |
|
||||
|---|---|---|---|---|
|
||||
| [User PII — e.g. email, name] | [PII / Restricted] | [PostgreSQL `users` table] | [Until account deletion] | Yes |
|
||||
| [Financial data — e.g. card last 4] | [PCI / Highly restricted] | [PostgreSQL `payment_methods` table] | [7 years per regulations] | Yes — field-level encryption |
|
||||
| [Operational logs] | [Internal] | [CloudWatch / Datadog] | [90 days] | Yes (at rest, not searched) |
|
||||
| [Anonymised analytics] | [Public] | [Data warehouse] | [Indefinite] | Yes |
|
||||
|
||||
**Data residency:** [e.g. "All data stored in us-east-1. EU customer data stored in eu-west-1 per GDPR requirements."]
|
||||
**Compliance scope:** [e.g. SOC 2 Type II / PCI DSS Level 2 / HIPAA / GDPR]
|
||||
**Data access policy:** [e.g. "Production database access requires [approval process]. Access logged and reviewed quarterly."]
|
||||
|
||||
---
|
||||
|
||||
## Operational Runbooks
|
||||
|
||||
| Runbook | Location | Use when |
|
||||
|---|---|---|
|
||||
| On-call runbook | [Wiki / GitHub link] | Responding to PagerDuty alerts |
|
||||
| Deployment runbook | [Wiki / GitHub link] | Deploying a new version to production |
|
||||
| Database migration runbook | [Wiki / GitHub link] | Running schema migrations |
|
||||
| Rollback runbook | [Wiki / GitHub link] | Rolling back a bad deploy |
|
||||
| Incident response runbook | [Wiki / GitHub link] | Declaring and managing incidents |
|
||||
| Disaster recovery plan | [Wiki / GitHub link] | Zone/region failure or data loss |
|
||||
|
||||
**Monitoring dashboards:**
|
||||
|
||||
| Dashboard | Link | Use it for |
|
||||
|---|---|---|
|
||||
| Service overview | [Datadog / Grafana link] | Error rate, latency, throughput |
|
||||
| Infrastructure | [Link] | CPU, memory, pod health |
|
||||
| Database | [Link] | Query performance, connection pool |
|
||||
| SLO / error budget | [Link] | Budget burn rate, availability |
|
||||
| Dependency health | [Link] | Upstream dependency status |
|
||||
|
||||
---
|
||||
|
||||
## Known Limitations
|
||||
|
||||
Document limitations honestly — this section prevents other teams from building on incorrect assumptions.
|
||||
|
||||
| Limitation | Impact | Workaround | Planned fix |
|
||||
|---|---|---|---|
|
||||
| [e.g. No bulk write API — items must be created one at a time] | [Slow for large imports — N HTTP calls required] | [Use the batch import CLI tool for >100 items] | [Bulk API in Q3 — ticket: [URL]] |
|
||||
| [e.g. List endpoints have a maximum page size of 100] | [Cannot retrieve more than 100 items in a single call] | [Paginate using `cursor` parameter] | [No current plan to increase — by design] |
|
||||
| [e.g. Rate limits are per-token, not per-service] | [High-traffic consumers may hit limits for other consumers on the same token] | [Request dedicated service-account token] | [Per-service rate limits in roadmap] |
|
||||
| [e.g. Eventual consistency on read-after-write for list endpoints] | [Record may not appear in list immediately after creation (<500ms lag)] | [Use GET /:id to confirm creation; do not rely on list for immediate consistency] | [Read-your-writes consistency available via `?consistent=true` — in progress] |
|
||||
|
||||
---
|
||||
|
||||
## Getting Started
|
||||
|
||||
**To start using this service:**
|
||||
|
||||
1. Request access: [Link to access request form or instructions]
|
||||
2. Get your service account credentials: [Link to process]
|
||||
3. Read the API docs: [OpenAPI spec URL]
|
||||
4. Try the sandbox environment: `https://[service-name].sandbox.[company].com`
|
||||
5. Join the consumer Slack channel: `#[service-name]-consumers`
|
||||
|
||||
**Client libraries (if available):**
|
||||
|
||||
| Language | Package | Installation |
|
||||
|---|---|---|
|
||||
| [Python] | [`[package-name]`] | `pip install [package-name]` |
|
||||
| [Go] | [`github.com/[org]/[package]`] | `go get github.com/[org]/[package]` |
|
||||
| [TypeScript/JS] | [`@[org]/[package]`] | `npm install @[org]/[package]` |
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] "What It Does" is written without jargon — a new engineer from another team can understand it in under 2 minutes
|
||||
- [ ] SLO targets are specific numbers agreed with stakeholders — not aspirational or copied from a template
|
||||
- [ ] All direct upstream consumers are listed in the "Who Depends on This" table — no omissions
|
||||
- [ ] API error codes are accurate and tested — not aspirational documentation
|
||||
- [ ] Known limitations are honest — nothing is glossed over to make the service look better than it is
|
||||
- [ ] All runbook links are live — not broken references or TODO placeholders
|
||||
- [ ] Data classification includes retention period and encryption status — not just sensitivity level
|
||||
- [ ] The entry has been reviewed by at least one consumer team to confirm it matches their experience of the service
|
||||
@@ -0,0 +1,231 @@
|
||||
---
|
||||
name: slo-error-budget
|
||||
description: "Define Service Level Objectives (SLOs) and an error budget policy for a service. Use when asked to write SLOs, define SLIs, calculate an error budget, set reliability targets, or create an error budget policy. Produces a complete SLO document with SLI definitions, target calculation, error budget policy, burn rate alerts, and review cadence."
|
||||
---
|
||||
|
||||
# SLO and Error Budget Skill
|
||||
|
||||
Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns.
|
||||
|
||||
A good SLO is not a target to hit. It is an agreement about what reliability means for your users — and a framework for making principled trade-offs between reliability and velocity.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and brief description of what it does
|
||||
- **Primary users** — who depends on this service and how
|
||||
- **User-facing interactions** to protect — e.g. API calls, page loads, transactions
|
||||
- **Current reliability data** — error rate, latency, uptime (last 30–90 days if available)
|
||||
- **Existing on-call setup** — who responds to alerts?
|
||||
- **Deployment frequency** — how often does the team ship?
|
||||
- **Any existing SLAs** with customers — these constrain SLO targets
|
||||
|
||||
## Key Definitions
|
||||
|
||||
Always establish these before writing the SLO:
|
||||
|
||||
| Term | Definition |
|
||||
|---|---|
|
||||
| **SLI** (Service Level Indicator) | The metric being measured — e.g. "% of requests completing successfully in <500ms" |
|
||||
| **SLO** (Service Level Objective) | The target for that metric — e.g. "99.5% of requests" |
|
||||
| **SLA** (Service Level Agreement) | The contractual commitment to customers — must be looser than the SLO |
|
||||
| **Error budget** | The allowed headroom below 100% — the budget for planned and unplanned downtime |
|
||||
| **Burn rate** | How fast the error budget is being consumed |
|
||||
|
||||
---
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# SLO Document: [Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Owner:** [Name / role] | **Approved by:** [Name]
|
||||
**Effective date:** [Date] | **Review date:** [Date + 3 months]
|
||||
**Version:** [1.0]
|
||||
|
||||
---
|
||||
|
||||
## Why This SLO Exists
|
||||
|
||||
[2–3 sentences. What reliability problem are we solving? What was happening before this SLO that made us need it? What decision-making does this SLO enable?]
|
||||
|
||||
---
|
||||
|
||||
## Service Overview
|
||||
|
||||
**What this service does:** [One sentence]
|
||||
**Who depends on it:** [Internal teams / external customers / both — describe]
|
||||
**Critical user journeys protected by this SLO:**
|
||||
1. [Journey 1 — e.g. "User completes a payment"]
|
||||
2. [Journey 2]
|
||||
3. [Journey 3]
|
||||
|
||||
---
|
||||
|
||||
## SLIs — What We Measure
|
||||
|
||||
Define one SLI per user journey or reliability dimension. Keep it to 3–5 SLIs maximum.
|
||||
|
||||
### SLI 1: [Name — e.g. Request Success Rate]
|
||||
|
||||
| Field | Detail |
|
||||
|---|---|
|
||||
| **What it measures** | [e.g. "% of API requests that return a non-5xx response"] |
|
||||
| **Good event definition** | [e.g. "HTTP response with status 2xx or 4xx, completed within 500ms"] |
|
||||
| **Bad event definition** | [e.g. "HTTP response with status 5xx, or any response taking >500ms"] |
|
||||
| **Measurement source** | [e.g. "Application load balancer access logs / Datadog APM / Prometheus"] |
|
||||
| **Measured over** | Rolling 28-day window |
|
||||
| **Exclusions** | [e.g. "Health check endpoints excluded / Requests during planned maintenance excluded"] |
|
||||
|
||||
### SLI 2: [Name — e.g. Latency]
|
||||
|
||||
| Field | Detail |
|
||||
|---|---|
|
||||
| **What it measures** | [e.g. "P99 response time for the /checkout endpoint"] |
|
||||
| **Good event definition** | [e.g. "Request completes in ≤500ms at P99"] |
|
||||
| **Bad event definition** | [e.g. "Request takes >500ms at P99"] |
|
||||
| **Measurement source** | [Source] |
|
||||
| **Measured over** | Rolling 28-day window |
|
||||
| **Exclusions** | [Any exclusions] |
|
||||
|
||||
### SLI 3: [Name — e.g. Data Freshness / Queue Depth / etc.]
|
||||
|
||||
[Same structure]
|
||||
|
||||
---
|
||||
|
||||
## SLO Targets
|
||||
|
||||
| SLI | Target | Window | Error Budget |
|
||||
|---|---|---|---|
|
||||
| [SLI 1 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
|
||||
| [SLI 2 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
|
||||
| [SLI 3 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
|
||||
|
||||
**How targets were set:**
|
||||
- Historical baseline (last 90 days): [X]%
|
||||
- Target is set [above / at] historical baseline to [improve reliability / reflect current reality while formalising the commitment]
|
||||
- Rationale: [1–2 sentences]
|
||||
|
||||
**What 100% is NOT the target:** [Brief explanation of why targeting 100% is counterproductive — it discourages feature development and doesn't reflect user reality]
|
||||
|
||||
---
|
||||
|
||||
## Error Budget Calculation
|
||||
|
||||
**For SLI 1 ([Name]), at [X]% target:**
|
||||
|
||||
```
|
||||
Error budget = (100% - SLO target) × measurement window
|
||||
= (100% - [X]%) × 28 days × 24 hours × 60 minutes
|
||||
= [Y]% × [Z total minutes]
|
||||
= [N] minutes of allowed failure per 28-day window
|
||||
```
|
||||
|
||||
**In plain terms:** We can afford [N] minutes of [bad events] in any rolling 28-day window before we breach the SLO.
|
||||
|
||||
---
|
||||
|
||||
## Burn Rate Alerts
|
||||
|
||||
Burn rate = how fast the error budget is being consumed relative to the budget window.
|
||||
A burn rate of 1 = consuming the budget at exactly the rate that would exhaust it over 28 days.
|
||||
|
||||
| Alert | Burn rate | Window | Severity | Response |
|
||||
|---|---|---|---|---|
|
||||
| Page (critical) | >14× | 1 hour | P1 | Page on-call immediately — budget exhausted in <2 hours |
|
||||
| Page (high) | >6× | 6 hours | P2 | Page on-call — budget exhausted in <5 days |
|
||||
| Ticket (warning) | >3× | 3 days | P3 | Create ticket — review at next team meeting |
|
||||
| Info | >1× | 28 days | Info | Log only — budget on track to exhaust by end of window |
|
||||
|
||||
**Alert implementation:** [Link to alert config in monitoring tool — e.g. Datadog, Prometheus/Alertmanager, Grafana]
|
||||
|
||||
---
|
||||
|
||||
## Error Budget Policy
|
||||
|
||||
This policy defines what to do with the error budget — both when it's healthy and when it's burning.
|
||||
|
||||
### When budget is healthy (>50% remaining)
|
||||
|
||||
- Feature development and deployments proceed at normal pace
|
||||
- The team may take on riskier experiments
|
||||
- Reliability improvements are scheduled but not urgent
|
||||
|
||||
### When budget is at risk (25–50% remaining)
|
||||
|
||||
- Deployment frequency reduced — team ships only well-tested changes
|
||||
- One reliability improvement added to current sprint
|
||||
- Weekly error budget review added to team standup
|
||||
|
||||
### When budget is nearly exhausted (<25% remaining)
|
||||
|
||||
- Feature work paused in favour of reliability improvements
|
||||
- No new deployments without explicit on-call approval
|
||||
- Daily review of error budget burn rate
|
||||
- CSM / support notified to manage customer expectations
|
||||
|
||||
### When budget is exhausted (0% remaining — SLO breached)
|
||||
|
||||
- All feature work stops
|
||||
- On-call engineer and engineering manager notified immediately
|
||||
- Post-incident review (PIR) required within 5 business days
|
||||
- SLO target may be temporarily relaxed (with stakeholder approval) while root cause is addressed
|
||||
|
||||
---
|
||||
|
||||
## Dashboard and Reporting
|
||||
|
||||
**SLO dashboard:** [Link to Datadog / Grafana / etc. dashboard]
|
||||
|
||||
**Metrics exposed:**
|
||||
- Current SLO compliance (rolling 28-day)
|
||||
- Error budget remaining (% and minutes)
|
||||
- Burn rate (current and trend)
|
||||
- Incident count and MTTR this window
|
||||
|
||||
**Reporting cadence:**
|
||||
|
||||
| Audience | Frequency | Format |
|
||||
|---|---|---|
|
||||
| Engineering team | Weekly | Slack summary — #[service]-slo |
|
||||
| Engineering manager | Monthly | SLO review meeting |
|
||||
| Stakeholders / customers | Quarterly | SLO compliance summary |
|
||||
|
||||
---
|
||||
|
||||
## Exclusions and Edge Cases
|
||||
|
||||
**Planned maintenance:** Error budget is not consumed during pre-announced maintenance windows. Maintenance must be communicated [X hours] in advance via [channel].
|
||||
|
||||
**Dependency failures:** If SLO breach is caused by an upstream dependency outside our control, document it — but it still counts against our error budget (our users don't distinguish between our failures and our dependencies' failures).
|
||||
|
||||
**Force majeure:** [Policy for cloud provider outages, major infrastructure events]
|
||||
|
||||
---
|
||||
|
||||
## SLO Review Cadence
|
||||
|
||||
| Review | When | Who | Output |
|
||||
|---|---|---|---|
|
||||
| Error budget review | Weekly | Team | Budget health check — adjust if burning fast |
|
||||
| SLO target review | Quarterly | Team + EM | Adjust targets if baseline has shifted significantly |
|
||||
| Annual SLO audit | Annually | Team + Stakeholders | Review SLIs — are we measuring the right things? |
|
||||
|
||||
**When to change the SLO target:**
|
||||
- Historical baseline has improved significantly and target no longer reflects real reliability
|
||||
- User feedback indicates the target is misaligned with what users actually experience
|
||||
- The SLO is being gamed (metric is healthy but users are unhappy)
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] SLIs are user-facing — they measure what users experience, not internal system metrics
|
||||
- [ ] Good and bad events are precisely defined — no ambiguity about what counts
|
||||
- [ ] Targets are based on historical data, not aspirational round numbers
|
||||
- [ ] Error budget policy has clear triggers and clear actions — not "discuss as a team"
|
||||
- [ ] Burn rate alerts have different windows to catch both fast burns and slow burns
|
||||
- [ ] Exclusions are documented so they don't silently inflate the SLO number
|
||||
@@ -0,0 +1,263 @@
|
||||
---
|
||||
name: sprint-velocity-analysis
|
||||
description: "Analyze sprint velocity data and produce an engineering team health report covering delivery trends, capacity utilization, and improvement recommendations. Use when asked to analyze sprint velocity, review team delivery health, identify delivery risks, or produce a retrospective data analysis. Produces a velocity trend analysis, health diagnosis table, top improvement recommendations with implementation steps, and a next-sprint capacity forecast."
|
||||
---
|
||||
|
||||
# Sprint Velocity Analysis
|
||||
|
||||
Analyze sprint velocity data to produce an honest engineering team health report. The goal is not to generate optimistic-looking charts — it is to surface delivery patterns, identify dysfunction early, and give the team and their manager actionable recommendations. Look for: velocity trends (improving, declining, flat, erratic), story point calibration consistency, carry-over patterns that indicate chronic over-commitment, and capacity-related signals. Produce text-based trend visualizations, a health diagnosis, and specific improvement recommendations with measurable targets.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Sprint history** — for each sprint: sprint name/number, committed story points, completed story points, and number of items carried over to next sprint; ideally 6–8 sprints minimum
|
||||
- **Team size and any changes** — current team size and any additions or departures during the data window
|
||||
- **Known disruptions** — holidays, company all-hands, on-call incidents, or other events that affected specific sprints
|
||||
- **Cycle time data (optional)** — if available, p50 and p90 cycle time per sprint (time from start to done)
|
||||
- **Definition of Done** — what "completed" means for this team (merged to main? deployed to prod? accepted by PO?)
|
||||
|
||||
If cycle time data is not provided, omit that section and note it as a recommended data source to add.
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Sprint Velocity Analysis: [Team Name]
|
||||
|
||||
**Analysis period:** Sprint [N] through Sprint [N+7] ([Date range])
|
||||
**Team size:** [X engineers] ([note any changes during period])
|
||||
**Report date:** [Date]
|
||||
**Data source:** [Where this data came from — Jira, Linear, spreadsheet, etc.]
|
||||
|
||||
---
|
||||
|
||||
## Velocity Trend
|
||||
|
||||
### Raw Data
|
||||
|
||||
| Sprint | Committed | Completed | Completion Rate | Carried Over | Notes |
|
||||
|--------|-----------|-----------|----------------|--------------|-------|
|
||||
| [Sprint N] | [X pts] | [X pts] | [X%] | [X pts / X items] | [disruption or context] |
|
||||
| [Sprint N+1] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| [Sprint N+2] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| [Sprint N+3] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| [Sprint N+4] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| [Sprint N+5] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| [Sprint N+6] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| [Sprint N+7] | [X pts] | [X pts] | [X%] | [X pts / X items] | |
|
||||
| **Average** | **[X pts]** | **[X pts]** | **[X%]** | **[X pts]** | |
|
||||
|
||||
### Velocity Chart (Completed Points per Sprint)
|
||||
|
||||
```
|
||||
Points
|
||||
60 |
|
||||
55 | ●
|
||||
50 | ● ●
|
||||
45 | ● ● ●
|
||||
40 | ● ●
|
||||
35 |
|
||||
30 |
|
||||
+--+--+--+--+--+--+--+--
|
||||
N N+1 N+2 N+3 N+4 N+5 N+6 N+7
|
||||
Sprint
|
||||
|
||||
● = Completed points — = Average ([X pts])
|
||||
```
|
||||
|
||||
Generate this chart using ASCII characters based on the actual data provided. Scale the Y-axis to the data range. Plot completed (not committed) points. Mark the average as a dashed line.
|
||||
|
||||
### Trend Diagnosis
|
||||
|
||||
| Metric | Value | Interpretation |
|
||||
|--------|-------|----------------|
|
||||
| Average velocity | [X pts/sprint] | [Baseline for planning] |
|
||||
| Velocity std deviation | [±X pts] | [Low < 15% of avg = stable; High > 25% = erratic] |
|
||||
| Trend direction | [Improving / Flat / Declining / Erratic] | [3-sprint trailing average vs. 3-sprint leading average] |
|
||||
| Average completion rate | [X%] | [Healthy: 80–95%; < 75% = chronic over-commitment] |
|
||||
| Carry-over rate | [X% of committed points carried over per sprint] | [Healthy: < 15%; > 25% = systemic issue] |
|
||||
| Sprints with completion rate < 75% | [X of 8 sprints] | [> 3 of 8 = structural problem, not noise] |
|
||||
|
||||
---
|
||||
|
||||
## Story Point Calibration
|
||||
|
||||
Story points are only useful if they are applied consistently. Look for these calibration signals in the data:
|
||||
|
||||
| Signal | Observed | Interpretation |
|
||||
|--------|----------|----------------|
|
||||
| High variance in velocity despite stable team size | [Yes / No] | Suggests inconsistent estimation — same effort scored differently week to week |
|
||||
| Consistent over-commitment (committed >> completed) | [Yes / No — by avg X pts per sprint] | Team is sandbagging estimates or ignoring historical capacity |
|
||||
| Consistent under-commitment (completed >> committed by > 20%) | [Yes / No] | Team is over-padding estimates or pulling in unplanned work frequently |
|
||||
| Frequent large items (> 13 pts) in carry-over | [Yes / No] | Items are too large to estimate reliably — need better decomposition |
|
||||
| Velocity cliff after team change | [Yes / No — Sprint N+X] | Team did not re-baseline capacity after composition changed |
|
||||
|
||||
**Calibration verdict:** [Well-calibrated / Needs recalibration / Severely uncalibrated — one sentence explanation tied to the signals above]
|
||||
|
||||
**If recalibration is needed:** [Specific recommendation — e.g., "Run a calibration session using the last 20 completed items, re-score them as a team, and use the resulting relative sizes to anchor future estimates."]
|
||||
|
||||
---
|
||||
|
||||
## Carry-Over Pattern Analysis
|
||||
|
||||
Carry-over is the most reliable leading indicator of commitment reliability problems.
|
||||
|
||||
| Sprint | Carried-Over Items | Common Themes in Carry-Over |
|
||||
|--------|-------------------|----------------------------|
|
||||
| [Sprint N] | [X items / X pts] | [Technical debt, dependency blocked, scoped wrong, etc.] |
|
||||
| [Sprint N+1] | [X items / X pts] | [Theme] |
|
||||
| [Sprint N+2] | [X items / X pts] | [Theme] |
|
||||
|
||||
**Carry-over root causes identified:**
|
||||
- [Root cause 1: e.g., "5 of 12 carry-overs were blocked on a third-party API integration — external dependency, not estimation failure"]
|
||||
- [Root cause 2: e.g., "4 of 12 carry-overs were items estimated at 8+ points that were later found to be 2–3x larger than expected"]
|
||||
- [Root cause 3: e.g., "3 of 12 carry-overs were interruptions from on-call incidents consuming unplanned capacity"]
|
||||
|
||||
---
|
||||
|
||||
## Capacity Utilization
|
||||
|
||||
| Sprint | Team Size | Available Capacity (pts) | Committed | Utilization % | Disruptions |
|
||||
|--------|-----------|--------------------------|-----------|--------------|-------------|
|
||||
| [Sprint N] | [X engineers] | [X pts] | [X pts] | [X%] | [Holiday / incident / none] |
|
||||
| [Sprint N+1] | [X engineers] | [X pts] | [X pts] | [X%] | |
|
||||
|
||||
**Capacity calculation used:** [X engineers × Y pts/person/sprint = Z pts available. Adjust: if team capacity changed during the window, note which sprints used which team size.]
|
||||
|
||||
**Average utilization:** [X%]
|
||||
**Utilization interpretation:** [< 70% = team is under-loaded or over-padding | 70–90% = healthy range | > 90% = no slack for unplanned work — fragile]
|
||||
|
||||
---
|
||||
|
||||
## Health Diagnosis
|
||||
|
||||
| Dimension | Score | Evidence | Priority |
|
||||
|-----------|-------|----------|----------|
|
||||
| Delivery predictability | [Green / Yellow / Red] | [Average completion rate X%, std dev Y pts] | [High / Med / Low] |
|
||||
| Commitment accuracy | [Green / Yellow / Red] | [Team over-commits by avg X pts/sprint] | |
|
||||
| Estimation consistency | [Green / Yellow / Red] | [Velocity std dev ±X pts, calibration verdict] | |
|
||||
| Carry-over hygiene | [Green / Yellow / Red] | [X% carry-over rate, root causes] | |
|
||||
| Capacity management | [Green / Yellow / Red] | [Avg utilization X%, disruption handling] | |
|
||||
| Trend direction | [Green / Yellow / Red] | [Trailing 3-sprint avg vs. leading 3-sprint avg] | |
|
||||
|
||||
**Scoring guide:** Green = operating within healthy range; Yellow = marginal — watch closely or single-sprint anomaly; Red = chronic issue requiring active intervention.
|
||||
|
||||
**Overall health:** [Green / Yellow / Red] — [One sentence summary: "The team delivers consistently at X pts/sprint but chronic over-commitment is eroding morale and creating a misleading picture for stakeholders."]
|
||||
|
||||
---
|
||||
|
||||
## Blocker Frequency Analysis
|
||||
|
||||
If blocker data was provided, complete this section. If not, note it as a recommended tracking addition.
|
||||
|
||||
| Blocker Category | Frequency (last 8 sprints) | Avg Days Blocked | Impact (pts delayed) |
|
||||
|-----------------|--------------------------|------------------|---------------------|
|
||||
| External dependency | [X occurrences] | [X days] | [X pts] |
|
||||
| Technical debt / rework | [X occurrences] | [X days] | [X pts] |
|
||||
| Unclear requirements | [X occurrences] | [X days] | [X pts] |
|
||||
| On-call interruptions | [X occurrences] | [X days] | [X pts] |
|
||||
| Environment / tooling | [X occurrences] | [X days] | [X pts] |
|
||||
|
||||
**Top blocker to address:** [Name the single highest-impact blocker category and what addressing it would mean for velocity.]
|
||||
|
||||
---
|
||||
|
||||
## Improvement Recommendations
|
||||
|
||||
Provide 3 specific recommendations ordered by expected impact. Each recommendation must include a measurable success target and implementation steps.
|
||||
|
||||
### Recommendation 1: [Title]
|
||||
|
||||
**Problem it addresses:** [Which health dimension is Red or Yellow, and what the data shows]
|
||||
|
||||
**What to do:**
|
||||
1. [Specific action step — concrete enough that a tech lead can assign it]
|
||||
2. [Next step]
|
||||
3. [Next step]
|
||||
|
||||
**Who owns it:** [Tech lead / Engineering manager / Whole team]
|
||||
**When to start:** [This sprint / Next sprint / Within 2 weeks]
|
||||
|
||||
**Measurable target:** [e.g., "Carry-over rate drops below 15% within 3 sprints" or "Completion rate above 80% for 4 consecutive sprints"]
|
||||
|
||||
**How to know it's working:** [Leading indicator to watch before the outcome metric improves — e.g., "Carry-over items decreasing sprint-over-sprint even before the target is hit"]
|
||||
|
||||
---
|
||||
|
||||
### Recommendation 2: [Title]
|
||||
|
||||
**Problem it addresses:** [Health dimension and evidence]
|
||||
|
||||
**What to do:**
|
||||
1. [Step]
|
||||
2. [Step]
|
||||
3. [Step]
|
||||
|
||||
**Who owns it:** [Role]
|
||||
**When to start:** [Timing]
|
||||
|
||||
**Measurable target:** [Specific metric and timeframe]
|
||||
|
||||
**How to know it's working:** [Leading indicator]
|
||||
|
||||
---
|
||||
|
||||
### Recommendation 3: [Title]
|
||||
|
||||
**Problem it addresses:** [Health dimension and evidence]
|
||||
|
||||
**What to do:**
|
||||
1. [Step]
|
||||
2. [Step]
|
||||
|
||||
**Who owns it:** [Role]
|
||||
**When to start:** [Timing]
|
||||
|
||||
**Measurable target:** [Specific metric and timeframe]
|
||||
|
||||
**How to know it's working:** [Leading indicator]
|
||||
|
||||
---
|
||||
|
||||
## Next-Sprint Capacity Forecast
|
||||
|
||||
**Next sprint:** [Sprint N+8]
|
||||
**Known team size:** [X engineers]
|
||||
**Known capacity reducers:** [PTO: X days total, on-call rotation: ~Y pts of unplanned capacity, etc.]
|
||||
|
||||
| Factor | Impact |
|
||||
|--------|--------|
|
||||
| Base capacity (historical average) | [X pts] |
|
||||
| PTO / planned absences | −[X pts] |
|
||||
| On-call overhead (estimate) | −[X pts] |
|
||||
| Carry-over from Sprint [N+7] | +[X pts committed capacity already spoken for] |
|
||||
| **Recommended commitment ceiling** | **[X pts]** |
|
||||
|
||||
**Confidence:** [High — stable team and known capacity | Medium — some uncertainty in disruption level | Low — team composition uncertain]
|
||||
|
||||
**Recommendation for planning:** [One sentence — e.g., "Plan to Sprint [N+8] ceiling of X pts. Given the carry-over items, prioritize completing those before pulling in new scope."]
|
||||
|
||||
---
|
||||
|
||||
## Cycle Time Distribution (if data provided)
|
||||
|
||||
| Sprint | p50 Cycle Time | p90 Cycle Time | Items Completed |
|
||||
|--------|---------------|---------------|-----------------|
|
||||
| [Sprint N] | [X days] | [X days] | [X items] |
|
||||
| [Average] | [X days] | [X days] | |
|
||||
|
||||
**Cycle time interpretation:** [p90 > 2× p50 indicates a long-tail of stuck items that deserve investigation. p50 increasing over time indicates slowing throughput independent of story point changes.]
|
||||
|
||||
If cycle time data was not provided: *Cycle time data was not included in this analysis. Recommend adding p50 and p90 cycle time per sprint to your tracking to detect throughput issues that story points alone cannot reveal.*
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Velocity chart is generated from the actual data provided — not a generic placeholder chart
|
||||
- [ ] Trend diagnosis states a direction (Improving / Flat / Declining / Erratic) with a quantitative basis (trailing vs. leading average)
|
||||
- [ ] Carry-over root causes are specific categories with counts — not a generic observation that carry-over exists
|
||||
- [ ] Each of the 3 recommendations includes a named owner, a start date, and a measurable target with a timeframe
|
||||
- [ ] Next-sprint capacity forecast uses historical average as the baseline and deducts specific known reducers
|
||||
- [ ] Health diagnosis table uses Red/Yellow/Green with evidence cited in the Evidence column — no unsupported scores
|
||||
- [ ] If metrics are missing (cycle time, blocker log), the report explicitly calls them out as recommended additions
|
||||
@@ -0,0 +1,290 @@
|
||||
---
|
||||
name: tech-radar
|
||||
description: "Build a technology radar for an engineering team, categorizing technologies into Adopt/Trial/Assess/Hold quadrants following the ThoughtWorks Tech Radar format. Use when asked to create a tech radar, evaluate the team's technology landscape, categorize tools and frameworks, or establish a technology strategy. Produces a full tech radar with quadrant tables, individual blip rationales, a decision trail, and a maintenance process guide."
|
||||
---
|
||||
|
||||
# Tech Radar
|
||||
|
||||
Produce a complete technology radar document for an engineering team. The radar gives the team a shared, explicit position on every significant technology in their stack — what to standardize on, what to experiment with, what to evaluate, and what to actively stop using. Follow the ThoughtWorks Tech Radar format: four quadrants (Techniques, Tools, Platforms, Languages & Frameworks) each with four rings (Adopt, Trial, Assess, Hold). Each technology entry ("blip") gets a ring assignment, a one-paragraph rationale, and a date. Include a decision trail showing what moved and why, and a maintenance process the team can run to keep the radar current.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Team or company name** — for the document header
|
||||
- **Current tech stack** — list every significant technology, tool, language, and platform the team currently uses
|
||||
- **Technologies under active evaluation** — tools or frameworks the team is currently trying or considering
|
||||
- **Technologies to deprecate or move off** — anything the team wants to stop using or is actively migrating away from
|
||||
- **Strategic technology bets** — any technologies the company has made a deliberate bet on (e.g., "we're all-in on Kubernetes" or "migrating to event-driven architecture")
|
||||
- **Team context** — team size, product domain, and any constraints (regulatory, compliance, vendor lock-in concerns)
|
||||
|
||||
If a technology is mentioned without a ring placement, use the rationale inputs to determine the appropriate ring. When uncertain between two rings, ask.
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Technology Radar: [Team / Company Name]
|
||||
|
||||
**Edition:** [Month Year]
|
||||
**Maintained by:** [Team Name / Architecture Guild / CTO Office]
|
||||
**Review cadence:** Bi-annual (every 6 months)
|
||||
**Next review:** [Month Year + 6 months]
|
||||
|
||||
---
|
||||
|
||||
## How to Read This Radar
|
||||
|
||||
This radar reflects [Team / Company Name]'s current thinking on technologies we use, evaluate, and retire. Use it to make consistent technology choices, onboard new engineers, and have structured conversations about the stack.
|
||||
|
||||
**Quadrants** categorize the type of technology:
|
||||
|
||||
| Quadrant | What belongs here |
|
||||
|----------|------------------|
|
||||
| **Techniques** | Methods, patterns, and practices (e.g., trunk-based development, event sourcing) |
|
||||
| **Tools** | Software tools used in the development and delivery process (e.g., linters, CI systems, observability platforms) |
|
||||
| **Platforms** | Infrastructure and hosting environments (e.g., AWS, Kubernetes, Snowflake) |
|
||||
| **Languages & Frameworks** | Programming languages and application frameworks (e.g., Go, React, FastAPI) |
|
||||
|
||||
**Rings** express our recommendation:
|
||||
|
||||
| Ring | Meaning | What to do |
|
||||
|------|---------|-----------|
|
||||
| **Adopt** | Industry-proven, working well for us — our standard choice | Use by default for new work; no special justification needed |
|
||||
| **Trial** | Worth pursuing — we are experimenting with it in limited production use | Use in a bounded context with architectural oversight; share learnings |
|
||||
| **Assess** | Worth exploring — we have not used it in production yet | Spike, prototype, or research; do not use in production without a review |
|
||||
| **Hold** | Do not start new work with this technology | Complete existing commitments; do not expand use; plan migration |
|
||||
|
||||
---
|
||||
|
||||
## Quadrant 1: Techniques
|
||||
|
||||
### Adopt
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Technique name, e.g., Trunk-based development] | [Month Year] | [One sentence: why we adopted it and what it replaced] |
|
||||
| [Technique name] | [Month Year] | [One sentence rationale] |
|
||||
| [Technique name] | [Month Year] | [One sentence rationale] |
|
||||
|
||||
**[Technique name] — Adopt**
|
||||
[One paragraph rationale. Explain what problem this technique solves, why it works well in your context, and what the team should know before applying it. Reference any internal experience — e.g., "We rolled this out across 8 services in 2024 and saw a 40% reduction in merge conflicts."]
|
||||
|
||||
[Repeat for each Adopt-ring technique.]
|
||||
|
||||
### Trial
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Technique name] | [Month Year] | [One sentence: what we're testing and where] |
|
||||
|
||||
**[Technique name] — Trial**
|
||||
[One paragraph. What are we trialing? In which teams or services? What hypothesis are we testing? What would cause us to move it to Adopt vs. Hold?]
|
||||
|
||||
### Assess
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Technique name] | [Month Year] | [One sentence: why we're interested] |
|
||||
|
||||
**[Technique name] — Assess**
|
||||
[One paragraph. Why is this interesting to us? What would we need to see to move it to Trial? Who is responsible for the assessment?]
|
||||
|
||||
### Hold
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Technique name] | [Month Year] | [One sentence: why we're stopping and what replaces it] |
|
||||
|
||||
**[Technique name] — Hold**
|
||||
[One paragraph. Why are we putting this on hold? What is the migration path? What is the target end-state for teams still using it?]
|
||||
|
||||
---
|
||||
|
||||
## Quadrant 2: Tools
|
||||
|
||||
### Adopt
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Tool name, e.g., GitHub Actions] | [Month Year] | [One sentence rationale] |
|
||||
| [Tool name] | [Month Year] | [One sentence rationale] |
|
||||
|
||||
**[Tool name] — Adopt**
|
||||
[One paragraph rationale. Why is this our standard tool? What does it do well in our context? Any configuration or usage patterns the team should follow?]
|
||||
|
||||
[Repeat for each Adopt-ring tool.]
|
||||
|
||||
### Trial
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Tool name] | [Month Year] | [One sentence: what we're testing] |
|
||||
|
||||
**[Tool name] — Trial**
|
||||
[One paragraph rationale and trial scope.]
|
||||
|
||||
### Assess
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Tool name] | [Month Year] | [One sentence: why we're evaluating it] |
|
||||
|
||||
**[Tool name] — Assess**
|
||||
[One paragraph: what sparked interest, who is evaluating, and timeline.]
|
||||
|
||||
### Hold
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Tool name] | [Month Year] | [One sentence: what replaces it] |
|
||||
|
||||
**[Tool name] — Hold**
|
||||
[One paragraph: deprecation rationale and migration path.]
|
||||
|
||||
---
|
||||
|
||||
## Quadrant 3: Platforms
|
||||
|
||||
### Adopt
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Platform name, e.g., AWS EKS] | [Month Year] | [One sentence rationale] |
|
||||
| [Platform name] | [Month Year] | [One sentence rationale] |
|
||||
|
||||
**[Platform name] — Adopt**
|
||||
[One paragraph. What does this platform provide? What are the boundaries of its use? Any internal golden-path setup the team should follow?]
|
||||
|
||||
[Repeat for each Adopt-ring platform.]
|
||||
|
||||
### Trial
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Platform name] | [Month Year] | [One sentence: scope of trial] |
|
||||
|
||||
**[Platform name] — Trial**
|
||||
[One paragraph rationale and trial boundaries.]
|
||||
|
||||
### Assess
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Platform name] | [Month Year] | [One sentence: why we're exploring it] |
|
||||
|
||||
**[Platform name] — Assess**
|
||||
[One paragraph assessment plan.]
|
||||
|
||||
### Hold
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Platform name] | [Month Year] | [One sentence: migration target and timeline] |
|
||||
|
||||
**[Platform name] — Hold**
|
||||
[One paragraph: what triggered the hold decision, migration target, and timeline.]
|
||||
|
||||
---
|
||||
|
||||
## Quadrant 4: Languages & Frameworks
|
||||
|
||||
### Adopt
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Language/Framework, e.g., Go] | [Month Year] | [One sentence rationale] |
|
||||
| [Language/Framework] | [Month Year] | [One sentence rationale] |
|
||||
|
||||
**[Language/Framework] — Adopt**
|
||||
[One paragraph. What is this language or framework used for? What are the team's proficiency expectations? Any frameworks or libraries that go alongside it as part of the standard choice?]
|
||||
|
||||
[Repeat for each Adopt-ring language or framework.]
|
||||
|
||||
### Trial
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Language/Framework] | [Month Year] | [One sentence: bounded use case] |
|
||||
|
||||
**[Language/Framework] — Trial**
|
||||
[One paragraph rationale.]
|
||||
|
||||
### Assess
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Language/Framework] | [Month Year] | [One sentence: interest driver] |
|
||||
|
||||
**[Language/Framework] — Assess**
|
||||
[One paragraph assessment plan.]
|
||||
|
||||
### Hold
|
||||
|
||||
| Technology | Since | Notes |
|
||||
|------------|-------|-------|
|
||||
| [Language/Framework] | [Month Year] | [One sentence: reason and migration path] |
|
||||
|
||||
**[Language/Framework] — Hold**
|
||||
[One paragraph: deprecation rationale, existing system obligations, and timeline to retire.]
|
||||
|
||||
---
|
||||
|
||||
## Decision Trail
|
||||
|
||||
This log records every ring movement since the radar's first edition. Use it to understand the evolution of our technology choices.
|
||||
|
||||
| Technology | Quadrant | Previous Ring | New Ring | Edition | Reason |
|
||||
|------------|----------|--------------|----------|---------|--------|
|
||||
| [Name] | [Quadrant] | — | Adopt | [Month Year] | First placement — [one sentence why] |
|
||||
| [Name] | [Quadrant] | Assess | Trial | [Month Year] | [What prompted the move — evidence, team feedback, production trial results] |
|
||||
| [Name] | [Quadrant] | Trial | Adopt | [Month Year] | [Adoption rationale — usage results, team satisfaction, scale proven] |
|
||||
| [Name] | [Quadrant] | Adopt | Hold | [Month Year] | [Why moved to Hold — better alternative, security concern, cost, vendor issue] |
|
||||
| [Name] | [Quadrant] | — | Hold | [Month Year] | First placement — added directly to Hold because [reason] |
|
||||
|
||||
---
|
||||
|
||||
## Radar Maintenance Process
|
||||
|
||||
### Who Contributes
|
||||
|
||||
- **Architecture review group / CTO office** — final ring placement decisions
|
||||
- **All engineers** — submit blip nominations via [channel or form]
|
||||
- **Tech leads** — triage nominations and prepare proposals for review sessions
|
||||
|
||||
### Update Cadence
|
||||
|
||||
| Activity | Frequency | Owner |
|
||||
|----------|-----------|-------|
|
||||
| New blip nominations accepted | Ongoing — any engineer via [channel] | Anyone |
|
||||
| Nomination triage | Monthly | Tech leads |
|
||||
| Full radar review session | Every 6 months | Architecture group |
|
||||
| Published radar update | Every 6 months | [Owner name or role] |
|
||||
|
||||
### How to Nominate a Blip
|
||||
|
||||
1. Submit to [Slack channel / form URL] with: technology name, quadrant, proposed ring, and one-paragraph rationale.
|
||||
2. A tech lead reviews within 2 weeks and either schedules it for the next review session or requests more information.
|
||||
3. At the review session, the architecture group discusses and votes. Simple majority wins; ties go to Hold pending further evidence.
|
||||
4. Approved blips are added to the radar doc and the decision trail within 1 week of the session.
|
||||
|
||||
### Ring Change Criteria
|
||||
|
||||
| To move TO Adopt | To move TO Trial | To move TO Assess | To move TO Hold |
|
||||
|-----------------|-----------------|-------------------|-----------------|
|
||||
| Proven in multiple production systems; team broadly trained; clear operational runbook exists | At least one production use case running; architectural oversight in place; learnings documented | Concrete use case identified; spike completed or in progress; interest from at least 2 engineers | Better alternative exists; known security/compliance risk; strategic direction change; unacceptable maintenance burden |
|
||||
|
||||
---
|
||||
|
||||
*Questions about this radar: [Slack channel] | Submit a nomination: [URL or channel]*
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every blip has a written rationale paragraph — not just a table row entry
|
||||
- [ ] The decision trail is populated with at least the initial placement date for every blip
|
||||
- [ ] Hold-ring entries include a concrete migration path or target technology, not just "stop using it"
|
||||
- [ ] Ring definitions are present and include both what each ring means AND what engineers should do in response
|
||||
- [ ] Maintenance process includes: nomination channel, review cadence, who decides, and ring-change criteria
|
||||
- [ ] Technologies identified as "strategic bets" in the inputs are placed in Adopt (if proven) or Trial (if being rolled out)
|
||||
- [ ] Technologies identified for deprecation are in Hold with a rationale that references the replacement
|
||||
@@ -0,0 +1,260 @@
|
||||
---
|
||||
name: technical-debt-register
|
||||
description: "Document and prioritize a technical debt backlog with business impact, effort estimates, and resolution strategy. Use when asked to audit technical debt, create a debt register, prioritize tech debt for a quarter, document architectural shortcuts, or build a debt reduction roadmap. Produces a structured technical debt register covering debt inventory by category, business impact per item, effort and priority scores, top-item resolution plans, and a quarterly debt reduction roadmap."
|
||||
---
|
||||
|
||||
# Technical Debt Register Skill
|
||||
|
||||
Produce a complete technical debt register for a team or service. A debt register is not a complaint list — it is a prioritized, business-impact-aware inventory that lets an engineering team make deliberate choices about which debt to pay down, in what order, and with what expected return.
|
||||
|
||||
Good debt management is not eliminating all debt. It is ensuring debt is visible, owned, and resolved when the interest cost exceeds the cost of fixing it.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Team or service name** — what team and/or service this register covers
|
||||
- **Known debt items** — list of known technical debt, or ask Claude to elicit them by asking about: legacy code, missing tests, outdated dependencies, architectural shortcuts, manual processes, observability gaps, security backlogs
|
||||
- **Tech stack** — language, frameworks, infrastructure (helps Claude categorise and score items correctly)
|
||||
- **Team size and velocity** — number of engineers and approximate story points or days per sprint (needed for effort estimates)
|
||||
- **Current quarter / planning period** — so the roadmap targets the right timeframe
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Technical Debt Register: [Team / Service Name]
|
||||
|
||||
**Team:** [Name] | **Service(s):** [Name(s)]
|
||||
**Author:** [Name] | **Last updated:** [Date]
|
||||
**Planning period:** [Q[X] [Year]] | **Review cadence:** [Monthly / Quarterly]
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
[2–3 sentences describing the team's current debt situation, the main categories of debt, and the business context — e.g. are they in a growth phase where velocity matters, or approaching a compliance deadline where security debt is critical?]
|
||||
|
||||
**Total items in register:** [X]
|
||||
**Unresolved items:** [X]
|
||||
**Critical/High priority items:** [X]
|
||||
**Estimated total resolution effort:** [X story points / X engineer-weeks]
|
||||
|
||||
---
|
||||
|
||||
## Debt Category Definitions
|
||||
|
||||
| Category | Description | Examples |
|
||||
|---|---|---|
|
||||
| **Code quality** | Code that works but is hard to change safely | Duplicated logic, deeply nested conditionals, inconsistent error handling, missing abstraction |
|
||||
| **Architecture** | Structural decisions that limit scalability or increase coupling | Monolith that should be decomposed, sync calls that should be async, missing domain boundaries |
|
||||
| **Testing** | Gaps in test coverage that increase regression risk | Missing unit tests, no integration tests, flaky test suite, no test data management |
|
||||
| **Security** | Known vulnerabilities or missing security controls | Outdated dependencies with CVEs, missing rate limiting, hard-coded secrets, insufficient auth |
|
||||
| **Dependencies** | Outdated or risky external dependencies | End-of-life libraries, major version lag, abandoned packages |
|
||||
| **Infrastructure** | Infrastructure that limits reliability or developer productivity | Manual deployment steps, no IaC, single-AZ, missing autoscaling |
|
||||
| **Observability** | Gaps in visibility that slow incident response | Missing metrics, no distributed tracing, poor log structure, no alerting on key SLIs |
|
||||
| **Process** | Manual or error-prone operational processes | Manual DB migrations, no runbooks, tribal knowledge not documented |
|
||||
|
||||
---
|
||||
|
||||
## Debt Register
|
||||
|
||||
### Scoring Method
|
||||
|
||||
**Business impact (1–5):**
|
||||
- 5 — Blocking growth, causing production incidents, or creating compliance risk
|
||||
- 4 — Significantly slowing delivery or increasing incident likelihood
|
||||
- 3 — Noticeable slowdown; manageable but accumulating
|
||||
- 2 — Minor friction; low immediate risk
|
||||
- 1 — Cosmetic or aspirational; no current business impact
|
||||
|
||||
**Effort to resolve (1–5, lower = easier):**
|
||||
- 1 — <0.5 day; single engineer
|
||||
- 2 — 0.5–2 days; single engineer
|
||||
- 3 — 3–5 days; single engineer or small pair
|
||||
- 4 — 1–2 weeks; team collaboration required
|
||||
- 5 — >2 weeks; significant planning and coordination
|
||||
|
||||
**Priority score = Business impact × (6 − Effort)** *(rewards high-impact, low-effort items)*
|
||||
|
||||
---
|
||||
|
||||
| ID | Item | Category | Business impact (1–5) | Effort (1–5) | Priority score | Status | Owner |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| TD-001 | [e.g. No integration tests for payment flow] | Testing | 5 | 3 | 15 | Open | [Name] |
|
||||
| TD-002 | [e.g. Authentication library 3 major versions behind] | Security | 5 | 2 | 20 | Open | [Name] |
|
||||
| TD-003 | [e.g. Database queries not using connection pooling] | Architecture | 4 | 2 | 16 | Open | [Name] |
|
||||
| TD-004 | [e.g. Manual deployment process for [service]] | Infrastructure | 4 | 3 | 12 | In progress | [Name] |
|
||||
| TD-005 | [e.g. 200-line God function in order processing] | Code quality | 3 | 3 | 9 | Open | [Name] |
|
||||
| TD-006 | [e.g. No structured logging — plain text only] | Observability | 3 | 2 | 12 | Open | [Name] |
|
||||
| TD-007 | [e.g. ORM version has known N+1 query issue] | Dependencies | 3 | 3 | 9 | Open | [Name] |
|
||||
| TD-008 | [e.g. No runbook for [critical operation]] | Process | 3 | 1 | 15 | Open | [Name] |
|
||||
| TD-009 | [e.g. Test coverage at 34% — no meaningful safety net] | Testing | 4 | 4 | 8 | Open | [Name] |
|
||||
| TD-010 | [e.g. Hard-coded config values in application code] | Code quality | 2 | 1 | 10 | Open | [Name] |
|
||||
| TD-011 | [e.g. Service deployed single-AZ with no failover] | Infrastructure | 5 | 4 | 10 | Open | [Name] |
|
||||
| TD-012 | [e.g. No alerting on P95 latency for [endpoint]] | Observability | 4 | 1 | 20 | Open | [Name] |
|
||||
|
||||
---
|
||||
|
||||
## Category Breakdown
|
||||
|
||||
```
|
||||
Category distribution (by item count):
|
||||
─────────────────────────────────────────────
|
||||
Code quality ████████░░ [X items] ([X]%)
|
||||
Architecture ██████░░░░ [X items] ([X]%)
|
||||
Testing █████████░ [X items] ([X]%)
|
||||
Security ████░░░░░░ [X items] ([X]%)
|
||||
Dependencies ███░░░░░░░ [X items] ([X]%)
|
||||
Infrastructure ████░░░░░░ [X items] ([X]%)
|
||||
Observability ████░░░░░░ [X items] ([X]%)
|
||||
Process ██░░░░░░░░ [X items] ([X]%)
|
||||
─────────────────────────────────────────────
|
||||
|
||||
Priority distribution:
|
||||
Critical (score 20–25): [X items]
|
||||
High (score 12–19): [X items]
|
||||
Medium (score 6–11): [X items]
|
||||
Low (score 1–5): [X items]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Top 5 Priority Items — Resolution Plans
|
||||
|
||||
### TD-XXX: [Highest priority item name]
|
||||
|
||||
**Priority score:** [Score] | **Category:** [Category] | **Owner:** [Name]
|
||||
|
||||
**Problem:**
|
||||
[2–3 sentences describing what the debt is, how it manifests, and what pain it currently causes. Be specific — reference actual incidents, slowdowns, or risks.]
|
||||
|
||||
**Business impact:**
|
||||
[What happens if this is not resolved? Reference any incidents, near-misses, or growth blockers. E.g. "This caused 2 production incidents in the last quarter and adds ~30 minutes of debugging time to any change in this area."]
|
||||
|
||||
**Resolution approach:**
|
||||
[Clear description of the fix. Not "improve the code" — describe the actual work: "Extract the payment processing logic into a dedicated `PaymentService` class, write unit tests to 80% coverage, and update the 3 call sites."]
|
||||
|
||||
**Steps:**
|
||||
1. [Specific, ticketable step]
|
||||
2. [Specific, ticketable step]
|
||||
3. [Specific, ticketable step]
|
||||
|
||||
**Acceptance criteria:**
|
||||
- [ ] [Measurable criterion — e.g. "Zero hard-coded config values remain in application code"]
|
||||
- [ ] [Measurable criterion — e.g. "CI pipeline passes with new tests"]
|
||||
- [ ] [Measurable criterion]
|
||||
|
||||
**Effort estimate:** [X story points / X days]
|
||||
**Suggested sprint:** [Q[X] Sprint [Y] / When [dependency] is complete]
|
||||
|
||||
---
|
||||
|
||||
### TD-XXX: [Second priority item name]
|
||||
|
||||
**Priority score:** [Score] | **Category:** [Category] | **Owner:** [Name]
|
||||
|
||||
**Problem:**
|
||||
[Description]
|
||||
|
||||
**Business impact:**
|
||||
[Impact description]
|
||||
|
||||
**Resolution approach:**
|
||||
[Approach description]
|
||||
|
||||
**Steps:**
|
||||
1. [Step]
|
||||
2. [Step]
|
||||
3. [Step]
|
||||
|
||||
**Acceptance criteria:**
|
||||
- [ ] [Criterion]
|
||||
- [ ] [Criterion]
|
||||
|
||||
**Effort estimate:** [X story points / X days]
|
||||
**Suggested sprint:** [Sprint or timeframe]
|
||||
|
||||
---
|
||||
|
||||
### TD-XXX: [Third priority item]
|
||||
|
||||
*(Follow same format as above)*
|
||||
|
||||
---
|
||||
|
||||
### TD-XXX: [Fourth priority item]
|
||||
|
||||
*(Follow same format as above)*
|
||||
|
||||
---
|
||||
|
||||
### TD-XXX: [Fifth priority item]
|
||||
|
||||
*(Follow same format as above)*
|
||||
|
||||
---
|
||||
|
||||
## Debt Reduction Roadmap
|
||||
|
||||
### Guiding principles
|
||||
|
||||
- Allocate [X%] of each sprint's capacity to debt resolution — recommended 15–20% for healthy teams
|
||||
- Security and dependency debt is addressed on a fixed cadence regardless of priority score
|
||||
- No new feature work in modules with Critical debt unless the debt is scheduled for the current sprint
|
||||
- Debt items closed without a resolution (accepted/deferred) must have a named owner and a review date
|
||||
|
||||
### Quarterly plan
|
||||
|
||||
| Quarter | Focus area | Items targeted | Estimated capacity | Expected outcome |
|
||||
|---|---|---|---|---|
|
||||
| **[Q1 Year]** (current) | Security + observability | TD-002, TD-012, TD-006 | [X] points / [Y] eng-days | Auth library current; latency alerting live; structured logging shipped |
|
||||
| **[Q2 Year]** | Architecture + reliability | TD-003, TD-011, TD-004 | [X] points / [Y] eng-days | Connection pooling fixed; multi-AZ deployed; deploy automation complete |
|
||||
| **[Q3 Year]** | Testing coverage | TD-001, TD-009 | [X] points / [Y] eng-days | Payment flow integration tests live; overall coverage ≥60% |
|
||||
| **[Q4 Year]** | Code quality + process | TD-005, TD-008, TD-010 | [X] points / [Y] eng-days | God functions refactored; runbooks complete; zero hard-coded config |
|
||||
|
||||
### Sprint allocation model
|
||||
|
||||
```
|
||||
Sprint capacity: [X] story points
|
||||
|
||||
Allocation:
|
||||
├── Feature work: [X * 0.75 = ~Y] points (75%)
|
||||
├── Debt resolution: [X * 0.15 = ~Y] points (15%)
|
||||
└── Unplanned/bugs: [X * 0.10 = ~Y] points (10%)
|
||||
|
||||
Debt items that fit in one sprint ([≤Y] points each):
|
||||
✓ TD-002 ([X] points)
|
||||
✓ TD-012 ([X] points)
|
||||
✓ TD-006 ([X] points)
|
||||
✓ TD-008 ([X] points)
|
||||
|
||||
Multi-sprint debt items (break into phases):
|
||||
~ TD-001: Phase 1 ([X] pts) → Phase 2 ([X] pts)
|
||||
~ TD-009: Requires dedicated debt sprint or pairing
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Accepted / Deferred Debt
|
||||
|
||||
Items where the cost of remediation currently exceeds the business value, accepted with explicit review dates.
|
||||
|
||||
| ID | Item | Reason for deferral | Review date | Owner |
|
||||
|---|---|---|---|---|
|
||||
| TD-XXX | [Item] | [e.g. "Rewrite would require 3 weeks with no user-facing value at current scale; revisit at 10× traffic"] | [Date] | [Name] |
|
||||
| TD-XXX | [Item] | [e.g. "Dependency has a CVE but no upgrade path exists until Q3; mitigated by WAF rule"] | [Date] | [Name] |
|
||||
|
||||
**Policy:** No item may be deferred more than twice without escalation to the engineering manager.
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every item has a named owner — no unowned debt
|
||||
- [ ] Priority scores are calculated using the formula, not assigned arbitrarily
|
||||
- [ ] Security and dependency items are not scored below their actual business impact because they feel "technical"
|
||||
- [ ] Top-5 resolution plans include specific, ticketable steps — not vague descriptions like "improve test coverage"
|
||||
- [ ] The quarterly roadmap allocates realistic capacity — debt allocation does not exceed actual sprint budget
|
||||
- [ ] Accepted/deferred items have a review date and a named owner — no permanently deferred items
|
||||
- [ ] The register distinguishes between debt (deliberate or accumulated shortcuts) and bugs (unintended defects)
|
||||
- [ ] Items are closed as resolved only when acceptance criteria are met — not when the PR is merged
|
||||
Vendored
BIN
Binary file not shown.
@@ -0,0 +1,99 @@
|
||||
# Contributing an Agent Template
|
||||
|
||||
This guide explains how to contribute a new agent template to the pm-claude-skills library.
|
||||
|
||||
## What is an agent template?
|
||||
|
||||
An agent template is a runnable workflow that combines existing skills, connectors, and subagents into a single end-to-end task. Following the architecture Anthropic introduced for [financial services agent templates](https://www.anthropic.com/news/finance-agents) on May 5, 2026.
|
||||
|
||||
Examples of agent templates that would belong in this repo:
|
||||
|
||||
- **PM Discovery Agent** — combines discovery-interview-guide + user-interview-synthesis + assumption-mapper with Granola/Notion connectors
|
||||
- **Legal Contract Review Agent** — combines contract-review + nda-analyser + compliance-checklist with Google Drive connector
|
||||
- **Sales Pursuit Agent** — combines sales-battlecard + discovery-call-prep + proposal-writer + account-plan with Salesforce/Gong connectors
|
||||
|
||||
## Required structure
|
||||
|
||||
Every agent template needs these files:
|
||||
|
||||
```
|
||||
templates/your-agent-name/
|
||||
├── README.md # What it does, install, usage
|
||||
├── AGENT.md # Agent definition (system prompt + tool list)
|
||||
├── orchestrate.sh # Orchestration script
|
||||
├── skills/ # Skills used (linked from main library)
|
||||
│ ├── README.md
|
||||
│ └── [skill-name]/SKILL.md
|
||||
├── subagents/ # Specialised subagents
|
||||
│ └── [subagent-name].md
|
||||
├── connectors/ # Data source configurations
|
||||
│ ├── README.md
|
||||
│ └── [system].example.json
|
||||
├── examples/ # Input and output examples
|
||||
│ ├── input-example.md
|
||||
│ └── output-example.md
|
||||
└── tests/
|
||||
└── smoke-test.md
|
||||
```
|
||||
|
||||
## Naming conventions
|
||||
|
||||
- **Folder name**: Use kebab-case, descriptive of the workflow (e.g., `pm-sprint-agent`, `legal-contract-review-agent`, `sales-pursuit-agent`)
|
||||
- **AGENT.md**: Always exactly this name (with caps) so it's easily findable
|
||||
- **Subagent files**: kebab-case in `subagents/`, ending in `.md` (e.g., `capacity-analyst.md`)
|
||||
- **Connector files**: lowercase, with `.example.json` for the template version (e.g., `linear.example.json`)
|
||||
|
||||
## Quality bar for new templates
|
||||
|
||||
Before submitting a PR, verify:
|
||||
|
||||
- [ ] **README.md** explains what the agent does in the first paragraph (no more burying the lede)
|
||||
- [ ] **AGENT.md** has a complete system prompt with explicit step-by-step instructions
|
||||
- [ ] **At least 2 skills** from the main library are referenced (otherwise it's just a skill, not a template)
|
||||
- [ ] **At least 1 subagent** is defined for analysis the skills can't do alone
|
||||
- [ ] **At least 1 connector** with a working example config
|
||||
- [ ] **orchestrate.sh** runs without errors in `--dry-run` mode
|
||||
- [ ] **Smoke test passes** (documented in `tests/smoke-test.md`)
|
||||
- [ ] **Example input AND example output** are provided
|
||||
- [ ] **Honest limitations section** in the README — what the agent doesn't do well
|
||||
- [ ] **No credentials in any committed file** — credentials must come from environment variables
|
||||
|
||||
## What makes a good agent template (vs a bad one)
|
||||
|
||||
**Good agent templates:**
|
||||
- Solve a specific, recurring professional workflow end-to-end
|
||||
- Have clear separation between skills (output formats), connectors (data access), and subagents (specialised analysis)
|
||||
- Work without modification for a typical team in the target profession
|
||||
- Include honest limitations and caveats
|
||||
|
||||
**Templates that get rejected:**
|
||||
- Wrap a single skill with no real orchestration ("just call the skill")
|
||||
- Combine unrelated skills with no coherent workflow
|
||||
- Hardcode credentials or organisation-specific data
|
||||
- Don't include working examples
|
||||
- Don't include subagents (just skills + connectors isn't a template)
|
||||
|
||||
## How to submit a PR
|
||||
|
||||
1. Fork the [pm-claude-skills repo](https://github.com/mohitagw15856/pm-claude-skills)
|
||||
2. Create your template in `templates/your-agent-name/`
|
||||
3. Run the smoke test successfully
|
||||
4. Commit your changes with a clear message: `feat: add [agent name] template`
|
||||
5. Open a PR with this description:
|
||||
- **What this template does** (1 paragraph)
|
||||
- **Which skills it uses** (list)
|
||||
- **Which connectors it requires** (list)
|
||||
- **Which subagents it defines** (list with one-line descriptions)
|
||||
- **Smoke test result** (paste the output)
|
||||
|
||||
PRs get reviewed within 5-7 days. The review focuses on the quality bar above, not personal style — clean templates that meet the bar get merged.
|
||||
|
||||
## What you get for contributing
|
||||
|
||||
- **Credit in the main README** under the contributing section
|
||||
- **Mention in the next Medium article** in the Claude Skills series
|
||||
- **Maintainer access** to your template — you can update it directly without needing review for minor changes after the first merge
|
||||
|
||||
## Questions?
|
||||
|
||||
Open a [discussion](https://github.com/mohitagw15856/pm-claude-skills/discussions) before you start building if your template doesn't fit cleanly into the structure above. It's much easier to align early than to rework after.
|
||||
@@ -0,0 +1,153 @@
|
||||
---
|
||||
name: pm-discovery-agent
|
||||
version: 1.0.0
|
||||
description: "End-to-end customer discovery synthesis agent. Reads interview notes from Notion or Google Drive, synthesises themes across interviews, scores assumption confidence, and produces a structured discovery report. Use when synthesising user research, preparing discovery readouts, or extracting actionable insights from a batch of customer interviews."
|
||||
author: Mohit Aggarwal
|
||||
license: MIT
|
||||
---
|
||||
|
||||
# PM Discovery Agent
|
||||
|
||||
## Configuration
|
||||
|
||||
Update these defaults to match your team. Override at runtime via `orchestrate.sh` flags.
|
||||
|
||||
```yaml
|
||||
discovery_defaults:
|
||||
interview_count: 8 # how many interviews to include in synthesis
|
||||
include_low_confidence: true # show low-confidence findings (with explicit flagging)
|
||||
flag_threshold_interviews: 5 # warn if running on fewer interviews than this
|
||||
|
||||
sources:
|
||||
primary_source: notion # notion | google-drive
|
||||
|
||||
notion_settings:
|
||||
sort_by: last_modified
|
||||
filter_property: status
|
||||
filter_value: completed
|
||||
|
||||
google_drive_settings:
|
||||
file_type: google_doc # only process Google Docs in the folder
|
||||
sort_by: modified_time
|
||||
|
||||
output:
|
||||
format: markdown
|
||||
include_raw_quotes: true # include verbatim quotes in the report
|
||||
include_follow_up_questions: true
|
||||
output_directory: ./output
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Agent system prompt
|
||||
|
||||
You are the PM Discovery Agent. Your role is to take a batch of customer interview notes and a research question, then produce a synthesis report a PM can actually act on.
|
||||
|
||||
You operate in this order:
|
||||
|
||||
1. **Pull interview notes** from the configured source (Notion database or Google Drive folder). Filter by:
|
||||
- Most recently completed interviews
|
||||
- Interviews tagged with the relevant project or research scope
|
||||
- The configured interview count (default 8)
|
||||
|
||||
2. **Verify input quality.** Before synthesis, check:
|
||||
- At least 5 interviews are available (warn if fewer)
|
||||
- Each interview has substantive notes (warn about thin notes)
|
||||
- Notes are recent (warn if any are >90 days old, as context may have changed)
|
||||
|
||||
3. **Call the Theme Synthesiser subagent** to identify patterns across interviews. Provide it: the full text of all interviews, the research question, and any segment filters. It returns a list of themes with supporting evidence.
|
||||
|
||||
4. **Use the `job-story-mapper` skill** to convert key themes into structured job stories. Provide it: the themes from step 3 and the research question. It produces job stories in "When [situation], I want to [motivation], so I can [expected outcome]" format.
|
||||
|
||||
5. **Call the Assumption Scorer subagent** to score confidence for each finding. Provide it: themes, job stories, and the underlying interview evidence. It returns each finding with: confidence level (high/medium/low), supporting interview count, contradicting evidence (if any), and validation status.
|
||||
|
||||
6. **Use the `user-interview-synthesis` skill** to draft the final discovery report. Provide it: research question, themes, job stories, confidence scores. It produces a structured report.
|
||||
|
||||
7. **Identify follow-up questions** for the next round of interviews based on:
|
||||
- Findings flagged as low confidence (need more evidence)
|
||||
- Themes mentioned by only 1-2 interviewees (could be signal or noise)
|
||||
- Contradictions between interviews (need clarification)
|
||||
- Areas the original research question didn't fully cover
|
||||
|
||||
8. **Combine outputs** into a single discovery report with these sections:
|
||||
- Research Question and Methodology
|
||||
- Executive Summary (top 3-5 findings)
|
||||
- Themes (sorted by confidence)
|
||||
- Job Stories
|
||||
- Confidence Assessment per Finding
|
||||
- Verbatim Quotes (most representative)
|
||||
- Follow-up Questions for Next Round
|
||||
- Appendix: Interview Summary
|
||||
|
||||
9. **Save** to the configured output directory.
|
||||
|
||||
---
|
||||
|
||||
## Quality checks before returning output
|
||||
|
||||
Before returning the final output, verify:
|
||||
|
||||
- [ ] Every theme references at least one specific interview as evidence
|
||||
- [ ] Every job story has the full "When/I want to/So I can" structure
|
||||
- [ ] Every finding has an explicit confidence level (no findings without scoring)
|
||||
- [ ] Verbatim quotes are exact (not paraphrased or "cleaned up")
|
||||
- [ ] Follow-up questions are specific (not generic "tell me more")
|
||||
- [ ] Low-confidence findings are explicitly flagged in the report (not buried)
|
||||
- [ ] Contradictions between interviews are surfaced, not silently smoothed over
|
||||
- [ ] Output file is saved to the configured directory
|
||||
|
||||
---
|
||||
|
||||
## Tools required
|
||||
|
||||
| Tool | Purpose |
|
||||
|---|---|
|
||||
| notion-connector / google-drive-connector | Pull interview notes |
|
||||
| theme-synthesiser (subagent) | Identify cross-interview themes |
|
||||
| assumption-scorer (subagent) | Score confidence for findings |
|
||||
| user-interview-synthesis (skill) | Draft final discovery report |
|
||||
| job-story-mapper (skill) | Convert themes into JTBD format |
|
||||
| filesystem-write | Save output document |
|
||||
|
||||
---
|
||||
|
||||
## When to invoke this agent
|
||||
|
||||
Use this agent when:
|
||||
|
||||
- You've completed a batch of customer interviews and need to synthesise them
|
||||
- Preparing a discovery readout for stakeholders
|
||||
- Closing out a research sprint or quarter
|
||||
- Validating or invalidating a product hypothesis with user research
|
||||
|
||||
Do NOT use this agent for:
|
||||
|
||||
- Single interview summaries (use the `user-interview-synthesis` skill directly)
|
||||
- Planning interviews (use the `discovery-interview-guide` skill)
|
||||
- Pure quantitative research (this is for qualitative interviews)
|
||||
- Real-time interview transcription (use a dedicated tool like Otter or Granola)
|
||||
|
||||
---
|
||||
|
||||
## Example invocation
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--research-question "Why are users abandoning the onboarding flow?" \
|
||||
--interview-source notion \
|
||||
--interview-count 10
|
||||
```
|
||||
|
||||
See `examples/output-example.md` for what the output looks like.
|
||||
|
||||
---
|
||||
|
||||
## Architecture notes
|
||||
|
||||
This agent template demonstrates the three-component pattern from Anthropic's May 2026 agent templates announcement:
|
||||
|
||||
- **Skills** (`user-interview-synthesis`, `job-story-mapper`, `discovery-interview-guide`, `assumption-mapper`) — provide structured output formats. Reused from the main pm-claude-skills library.
|
||||
- **Connectors** (`notion`, `google-drive`) — provide governed data access. Configured separately so credentials don't live in prompts.
|
||||
- **Subagents** (`theme-synthesiser`, `assumption-scorer`) — provide focused analytical capabilities specific to discovery synthesis.
|
||||
|
||||
The orchestration script wires these together. The system prompt above tells Claude how to use them in sequence.
|
||||
@@ -0,0 +1,211 @@
|
||||
# PM Discovery Agent — Agent Template
|
||||
|
||||
> **An end-to-end customer discovery agent. Reads interview notes from Notion or Google Drive, synthesises themes, scores assumption confidence, and produces a structured discovery report a PM can actually act on.**
|
||||
|
||||
This is the second agent template in the pm-claude-skills library. Like the [PM Sprint Agent](../pm-sprint-agent/), it follows the architecture Anthropic introduced for [financial services agent templates](https://www.anthropic.com/news/finance-agents) on May 5, 2026 — packaging **skills + connectors + subagents** into a single runnable workflow.
|
||||
|
||||
---
|
||||
|
||||
## What it does
|
||||
|
||||
You point this agent at a folder of customer interview notes or transcripts. It does the rest:
|
||||
|
||||
1. **Pulls interview notes** from Notion (a database) or Google Drive (a folder)
|
||||
2. **Synthesises themes** across all interviews using the Theme Synthesiser subagent
|
||||
3. **Maps insights to job stories** using the `job-story-mapper` skill
|
||||
4. **Scores assumption confidence** for each finding using the Assumption Scorer subagent
|
||||
5. **Drafts the discovery report** using the `user-interview-synthesis` skill
|
||||
6. **Identifies follow-up questions** for the next round of interviews
|
||||
7. **Saves the report** as a structured markdown document
|
||||
|
||||
End-to-end: roughly 3-5 minutes for 8-12 interview transcripts. The manual version of this synthesis takes most PMs a full day — and the inconsistency is the bigger problem than the time.
|
||||
|
||||
---
|
||||
|
||||
## Why this matters
|
||||
|
||||
Customer discovery is the workflow PMs say they care about most and consistently underinvest in. The reasons are predictable: synthesis is hard, themes are easy to over-interpret, confirmation bias is real, and writing it up takes hours. So PMs do interviews, take notes, and never come back to them properly.
|
||||
|
||||
This agent doesn't replace the discovery work. It removes the synthesis bottleneck so the discovery work actually pays off.
|
||||
|
||||
---
|
||||
|
||||
## What's inside this template
|
||||
|
||||
```
|
||||
templates/pm-discovery-agent/
|
||||
├── README.md ← you are here
|
||||
├── AGENT.md ← agent definition (system prompt + tool list)
|
||||
├── orchestrate.sh ← orchestration script
|
||||
├── skills/ ← skills used by this agent
|
||||
│ ├── README.md
|
||||
│ ├── discovery-interview-guide/SKILL.md ← (symlink)
|
||||
│ ├── user-interview-synthesis/SKILL.md ← (symlink)
|
||||
│ ├── job-story-mapper/SKILL.md ← (symlink)
|
||||
│ └── assumption-mapper/SKILL.md ← (symlink)
|
||||
├── subagents/
|
||||
│ ├── theme-synthesiser.md ← cross-interview theme detection
|
||||
│ └── assumption-scorer.md ← confidence scoring for findings
|
||||
├── connectors/
|
||||
│ ├── README.md ← connector setup guide
|
||||
│ ├── notion.example.json ← Notion database connector
|
||||
│ └── google-drive.example.json ← Google Drive folder connector
|
||||
├── examples/
|
||||
│ ├── input-example.md ← what you feed the agent
|
||||
│ ├── output-example.md ← what the agent produces
|
||||
│ └── sample-interview.md ← example interview note format
|
||||
└── tests/
|
||||
└── smoke-test.md ← manual smoke test for new installations
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick install (5 minutes)
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Claude Code installed
|
||||
- The full skills library installed: `/plugin marketplace add mohitagw15856/pm-claude-skills`
|
||||
- Either a Notion workspace or Google Drive (most PMs have both)
|
||||
|
||||
### Setup
|
||||
|
||||
**Choose your input source.** PMs typically store interview notes in one of two places:
|
||||
|
||||
- **Notion** — if you keep interviews in a Notion database (most common for PMs at scaleups)
|
||||
- **Google Drive** — if you keep interviews as Google Docs in a folder (most common for PMs at startups and large enterprises)
|
||||
|
||||
Set up the connector for whichever one you use. You don't need both.
|
||||
|
||||
### Notion setup (5 minutes)
|
||||
|
||||
```bash
|
||||
cd templates/pm-discovery-agent/connectors
|
||||
cp notion.example.json notion.json
|
||||
# Edit notion.json with your database_id and page properties
|
||||
```
|
||||
|
||||
Detailed setup steps in `connectors/README.md`.
|
||||
|
||||
### Google Drive setup (5 minutes)
|
||||
|
||||
```bash
|
||||
cd templates/pm-discovery-agent/connectors
|
||||
cp google-drive.example.json google-drive.json
|
||||
# Edit with your folder_id and authentication details
|
||||
```
|
||||
|
||||
### Test the smoke test
|
||||
|
||||
```bash
|
||||
cd templates/pm-discovery-agent
|
||||
bash orchestrate.sh --research-question "Test discovery synthesis" --dry-run
|
||||
```
|
||||
|
||||
If the dry-run completes without errors, you're ready to run a real discovery synthesis.
|
||||
|
||||
---
|
||||
|
||||
## Running the agent
|
||||
|
||||
### Standard usage
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--research-question "Why are users abandoning the onboarding flow?" \
|
||||
--interview-source notion \
|
||||
--interview-count 10
|
||||
```
|
||||
|
||||
The agent will:
|
||||
|
||||
1. Pull the latest 10 interview notes from your configured Notion database
|
||||
2. Run the Theme Synthesiser subagent to identify patterns across interviews
|
||||
3. Run the `job-story-mapper` skill to convert findings into structured job stories
|
||||
4. Run the Assumption Scorer subagent to flag which findings are high vs low confidence
|
||||
5. Run the `user-interview-synthesis` skill to draft the final report
|
||||
6. Identify follow-up questions for the next round of interviews
|
||||
7. Save the report to `output/discovery-[date].md`
|
||||
|
||||
### Configuration options
|
||||
|
||||
| Flag | Required | Default | Description |
|
||||
|---|---|---|---|
|
||||
| `--research-question` | Yes | — | The question your discovery is trying to answer |
|
||||
| `--interview-source` | Yes | — | `notion` or `google-drive` |
|
||||
| `--interview-count` | No | 8 | How many interviews to include in synthesis |
|
||||
| `--filter-by-segment` | No | — | If your notes are tagged by segment, filter to one (e.g., "enterprise") |
|
||||
| `--include-low-confidence` | No | true | Include low-confidence findings in the report (with explicit flagging) |
|
||||
| `--dry-run` | No | false | Validate config without running the workflow |
|
||||
|
||||
---
|
||||
|
||||
## How interview notes should be structured
|
||||
|
||||
The agent works best when interview notes follow a consistent structure. There's no rigid format required, but the more your notes contain, the better the synthesis.
|
||||
|
||||
**Minimum requirements** (the agent will work with any notes that have these):
|
||||
- Interviewee identifier (name, role, or anonymous ID)
|
||||
- Date of interview
|
||||
- Free-text notes or transcript
|
||||
|
||||
**Recommended additions** (for better synthesis):
|
||||
- Segment or persona tags
|
||||
- Key quotes pulled out
|
||||
- Initial interpretations or hypotheses
|
||||
- Follow-up questions noted
|
||||
|
||||
See `examples/sample-interview.md` for a template you can use to standardise your team's interview notes.
|
||||
|
||||
---
|
||||
|
||||
## Why this architecture
|
||||
|
||||
The template follows the same three-component pattern as PM Sprint Agent:
|
||||
|
||||
**Skills** provide the structured output formats. The `user-interview-synthesis` skill knows what a good discovery report contains. The `job-story-mapper` knows the JTBD format. The `assumption-mapper` knows how to structure assumptions. These already exist in this library — the agent doesn't reinvent them.
|
||||
|
||||
**Connectors** provide governed access to data. Notion and Google Drive are where PMs actually keep interview notes. Credentials live in environment variables, never in prompts.
|
||||
|
||||
**Subagents** handle specialised analysis. Theme synthesis across 10 interviews requires holding 10 documents in mind and finding patterns — that's a focused job for a subagent with a specific system prompt. Confidence scoring requires distinguishing "5 people said this" from "1 person said this dramatically" — also a focused job for a subagent.
|
||||
|
||||
---
|
||||
|
||||
## Customisation
|
||||
|
||||
### Adapt to your team's discovery process
|
||||
|
||||
The default agent uses the generic discovery skills from the main library. If your team has specific conventions — particular persona definitions, opportunity scoring frameworks, ICE prioritisation for follow-ups — fork the relevant skill into `skills/` and modify it. The orchestrate script will pick up the local version.
|
||||
|
||||
### Add additional analysis steps
|
||||
|
||||
If your discovery process includes things this template doesn't cover — competitive mention extraction, willingness-to-pay analysis, feature request triage — add subagents in `subagents/` for those analyses and update `orchestrate.sh` to call them.
|
||||
|
||||
### Switch interview sources
|
||||
|
||||
If you use a tool other than Notion or Google Drive — Dovetail, Granola, Otter, Reflect, Roam, Coda — you can build a connector following the pattern in `connectors/README.md`. PRs welcome for additional connectors.
|
||||
|
||||
---
|
||||
|
||||
## Limitations and honest caveats
|
||||
|
||||
**The synthesis is only as good as the notes.** If your interview notes are sparse, generic, or inconsistent in format, the synthesis will reflect that. Spending 15 minutes after each interview to write proper notes pays off enormously when you run the agent.
|
||||
|
||||
**Theme synthesis can over-cluster.** The Theme Synthesiser will find patterns even in small datasets. If you're running it on 3 interviews, treat the themes as hypotheses to validate, not conclusions. The agent flags this when interview count is low.
|
||||
|
||||
**Confidence scoring is heuristic, not statistical.** The Assumption Scorer uses simple rules — how many people mentioned it, how strongly, how recently. It's not running statistical analysis. Use the scores as a directional ranking, not a precise measurement.
|
||||
|
||||
**No autonomous execution.** This template runs as a Claude Code plugin — it produces outputs for human review, it doesn't autonomously create JIRA tickets or modify your discovery database. For autonomous execution, deploy via Claude Managed Agents using the same skills, connectors, and subagent definitions.
|
||||
|
||||
---
|
||||
|
||||
## Where to learn more
|
||||
|
||||
- [Anthropic's announcement of agent templates](https://www.anthropic.com/news/finance-agents) (May 2026)
|
||||
- [The PM Sprint Agent template](../pm-sprint-agent/) (first template in this library)
|
||||
- [The pm-claude-skills main README](../../README.md)
|
||||
- [Part 17 article — Building the PM Discovery Agent](#) *(link added when published)*
|
||||
|
||||
---
|
||||
|
||||
*Built and maintained by [Mohit Aggarwal](https://medium.com/@mohit15856) | Second agent template in [pm-claude-skills](https://github.com/mohitagw15856/pm-claude-skills)*
|
||||
@@ -0,0 +1,169 @@
|
||||
# Connectors — PM Discovery Agent
|
||||
|
||||
This folder contains the connector configurations for the PM Discovery Agent. You only need to set up the connector for whichever tool your team uses for interview notes — Notion or Google Drive.
|
||||
|
||||
## Which connector should I use?
|
||||
|
||||
| If your interview notes live in... | Use this connector |
|
||||
|---|---|
|
||||
| A Notion database | `notion.json` |
|
||||
| A Google Drive folder of Google Docs | `google-drive.json` |
|
||||
| Both | Pick the one with more interviews — agents work better with more data |
|
||||
| Somewhere else (Dovetail, Granola, Otter, etc.) | See "Building a connector for another system" below |
|
||||
|
||||
## Notion setup (5 minutes)
|
||||
|
||||
This is the fastest path if you keep interviews in Notion.
|
||||
|
||||
### 1. Create a Notion integration
|
||||
|
||||
- Go to https://www.notion.so/my-integrations
|
||||
- Click "+ New integration"
|
||||
- Name it "PM Discovery Agent"
|
||||
- Leave defaults
|
||||
- Click Submit
|
||||
- Copy the "Internal Integration Token" (starts with `secret_`)
|
||||
|
||||
### 2. Set the environment variable
|
||||
|
||||
```bash
|
||||
export NOTION_INTEGRATION_TOKEN='secret_xxxxxxxxxxxxxxxxxxxxxxxx'
|
||||
```
|
||||
|
||||
To make permanent, add to `~/.zshrc` or `~/.bashrc`.
|
||||
|
||||
### 3. Share your interview database with the integration
|
||||
|
||||
- Open your interview notes database in Notion
|
||||
- Click the `...` menu in the top right
|
||||
- Select "Add connections"
|
||||
- Choose "PM Discovery Agent"
|
||||
|
||||
The integration now has access to that database.
|
||||
|
||||
### 4. Find your database ID
|
||||
|
||||
The database ID is in the URL when viewing the database. Format: `notion.so/your-workspace/DATABASE_ID?v=...`
|
||||
|
||||
The ID is the long string between `/` and `?`. Copy it.
|
||||
|
||||
### 5. Configure the connector
|
||||
|
||||
```bash
|
||||
cp notion.example.json notion.json
|
||||
```
|
||||
|
||||
Open `notion.json` and update:
|
||||
- `database_id` — paste the ID from step 4
|
||||
- `expected_properties` — adjust to match your actual property names (the defaults assume Name, Interview Date, Interviewee, Segment, Status, Tags)
|
||||
|
||||
### 6. Test
|
||||
|
||||
```bash
|
||||
cd ../ # back to pm-discovery-agent root
|
||||
bash orchestrate.sh --research-question "Test" --interview-source notion --dry-run
|
||||
```
|
||||
|
||||
If you see "✓ Dry-run complete", you're set up.
|
||||
|
||||
## Google Drive setup (10 minutes)
|
||||
|
||||
A bit more setup than Notion, but works well if your team uses Google Docs for interviews.
|
||||
|
||||
### 1. Create a Google Cloud project
|
||||
|
||||
- Go to https://console.cloud.google.com/
|
||||
- Click "Select a project" > "New Project"
|
||||
- Name it "PM Discovery Agent"
|
||||
- Click Create
|
||||
|
||||
### 2. Enable the APIs
|
||||
|
||||
- In the project, search for "Google Drive API" in the API library
|
||||
- Click Enable
|
||||
- Search for "Google Docs API"
|
||||
- Click Enable
|
||||
|
||||
### 3. Create a service account
|
||||
|
||||
- Go to IAM & Admin > Service Accounts
|
||||
- Click "+ Create Service Account"
|
||||
- Name: "pm-discovery-reader"
|
||||
- Description: "Read-only access for PM Discovery Agent"
|
||||
- Click Create
|
||||
- Skip the optional permissions step
|
||||
- Click Done
|
||||
|
||||
### 4. Download the service account key
|
||||
|
||||
- Click on the service account you just created
|
||||
- Go to the "Keys" tab
|
||||
- Click "Add Key" > "Create new key"
|
||||
- Choose JSON
|
||||
- Save the file somewhere secure (e.g., `~/.config/pm-discovery-agent/service-account.json`)
|
||||
|
||||
### 5. Set the environment variable
|
||||
|
||||
```bash
|
||||
export GOOGLE_APPLICATION_CREDENTIALS='/Users/yourname/.config/pm-discovery-agent/service-account.json'
|
||||
```
|
||||
|
||||
To make permanent, add to `~/.zshrc` or `~/.bashrc`.
|
||||
|
||||
### 6. Share your interview folder with the service account
|
||||
|
||||
- Find the service account email (it looks like `pm-discovery-reader@your-project.iam.gserviceaccount.com`)
|
||||
- Open your interview notes folder in Google Drive
|
||||
- Click Share
|
||||
- Paste the service account email
|
||||
- Set permission to Viewer
|
||||
- Click Send
|
||||
|
||||
### 7. Find your folder ID
|
||||
|
||||
Open the folder in Google Drive. The URL looks like: `drive.google.com/drive/folders/FOLDER_ID_HERE`
|
||||
|
||||
Copy the ID after `/folders/`.
|
||||
|
||||
### 8. Configure the connector
|
||||
|
||||
```bash
|
||||
cp google-drive.example.json google-drive.json
|
||||
```
|
||||
|
||||
Open `google-drive.json` and update:
|
||||
- `folder_id` — paste the ID from step 7
|
||||
|
||||
### 9. Test
|
||||
|
||||
```bash
|
||||
cd ../ # back to pm-discovery-agent root
|
||||
bash orchestrate.sh --research-question "Test" --interview-source google-drive --dry-run
|
||||
```
|
||||
|
||||
## Building a connector for another system
|
||||
|
||||
If your interview notes live somewhere other than Notion or Google Drive, you can build a connector following the same pattern. Common alternatives PMs use:
|
||||
|
||||
- **Dovetail** — has a research API; build a connector for the analysis endpoint
|
||||
- **Granola / Otter / Fathom** — meeting recorders; build a connector that pulls transcripts
|
||||
- **Reflect / Roam / Logseq** — personal note-taking apps; build a connector for the markdown files
|
||||
- **Coda / Airtable** — alternative databases; build a connector for the rows API
|
||||
- **Local files** — markdown files in a folder; build a simple file-reading connector
|
||||
|
||||
A connector needs three things:
|
||||
1. A configuration file defining the data source URL, credentials, and available operations
|
||||
2. An API client the orchestration script can call
|
||||
3. A mapping from the source's data model to what the agent expects (interview ID, date, interviewee, content, tags)
|
||||
|
||||
Copy `notion.example.json` or `google-drive.example.json` as a starting point.
|
||||
|
||||
If you build a connector for a new system, consider raising a PR back to the main pm-claude-skills repo.
|
||||
|
||||
## Security notes
|
||||
|
||||
**Credentials live in environment variables, not in the JSON files.** This means you can commit your `notion.json` or `google-drive.json` to source control without leaking credentials.
|
||||
|
||||
**Use read-only access where possible.** The agent only needs to read interview notes — never to modify them. Both Notion integrations and Google Drive service accounts can be set up with read-only permissions. Use them.
|
||||
|
||||
**Rotate credentials periodically.** Both Notion integration tokens and Google service account keys can be regenerated. Do this every 90 days as a security practice.
|
||||
@@ -0,0 +1,86 @@
|
||||
{
|
||||
"connector_name": "google-drive",
|
||||
"version": "1.0.0",
|
||||
"description": "Google Drive connector for the PM Discovery Agent. Reads interview notes from a Google Drive folder where each interview is a Google Doc.",
|
||||
|
||||
"configuration": {
|
||||
"folder_id": "FOLDER_ID_HERE",
|
||||
"file_type": "application/vnd.google-apps.document",
|
||||
"include_subfolders": false,
|
||||
"expected_naming_convention": "YYYY-MM-DD - Interviewee Name.gdoc",
|
||||
"default_sort": {
|
||||
"field": "modifiedTime",
|
||||
"direction": "desc"
|
||||
},
|
||||
"default_filters": {
|
||||
"exclude_trashed": true,
|
||||
"min_word_count": 100
|
||||
},
|
||||
"rate_limit_requests_per_minute": 60
|
||||
},
|
||||
|
||||
"credentials": {
|
||||
"_comment": "Google Drive uses OAuth 2.0. You'll need to create a Google Cloud project and enable the Drive API. Easiest path: use a service account with access to your folder.",
|
||||
"auth_method": "service_account",
|
||||
"service_account_key_path_env_var": "GOOGLE_APPLICATION_CREDENTIALS",
|
||||
"service_account_key_placeholder": "/path/to/service-account-key.json"
|
||||
},
|
||||
|
||||
"available_operations": [
|
||||
{
|
||||
"name": "list_recent_documents",
|
||||
"description": "Get the N most recent Google Docs in the configured folder",
|
||||
"filters": ["modifiedAfter", "name_contains", "starred"],
|
||||
"max_results": 50
|
||||
},
|
||||
{
|
||||
"name": "get_document_content",
|
||||
"description": "Fetch the full text content of a specific Google Doc",
|
||||
"required_input": "file_id"
|
||||
},
|
||||
{
|
||||
"name": "search_documents",
|
||||
"description": "Search document content by keyword across the folder",
|
||||
"required_input": "search_query"
|
||||
}
|
||||
],
|
||||
|
||||
"permissions_required": [
|
||||
"https://www.googleapis.com/auth/drive.readonly",
|
||||
"https://www.googleapis.com/auth/documents.readonly"
|
||||
],
|
||||
|
||||
"_setup_instructions": [
|
||||
"1. Go to Google Cloud Console: https://console.cloud.google.com/",
|
||||
"2. Create a new project (or use existing) — name it something like 'PM Discovery Agent'",
|
||||
"3. Enable the Google Drive API and Google Docs API for the project",
|
||||
"4. Create a service account: IAM & Admin > Service Accounts > Create Service Account",
|
||||
"5. Download the service account key as JSON",
|
||||
"6. Save the JSON file to a secure location (e.g., ~/.config/pm-discovery-agent/service-account.json)",
|
||||
"7. Set the environment variable: export GOOGLE_APPLICATION_CREDENTIALS='/path/to/service-account.json'",
|
||||
"8. Find the folder ID where your interview notes live: open the folder in Google Drive, the ID is in the URL (drive.google.com/drive/folders/FOLDER_ID_HERE)",
|
||||
"9. Share that folder with the service account email (it looks like xxx@your-project.iam.gserviceaccount.com) — give it Viewer access",
|
||||
"10. Update folder_id in this file",
|
||||
"11. Save this file as 'google-drive.json' (without the .example)",
|
||||
"12. Test the connection: bash orchestrate.sh --research-question 'Test' --interview-source google-drive --dry-run"
|
||||
],
|
||||
|
||||
"_alternative_simpler_setup": [
|
||||
"If creating a service account feels heavy, you can use OAuth user credentials instead:",
|
||||
"1. Go to APIs & Services > Credentials in Google Cloud Console",
|
||||
"2. Create OAuth client ID > Desktop application",
|
||||
"3. Download the credentials JSON",
|
||||
"4. The first time the agent runs, it'll open a browser for you to authorise",
|
||||
"This is simpler but requires re-authorisation if the token expires."
|
||||
],
|
||||
|
||||
"_folder_organisation_recommendation": [
|
||||
"If you're starting fresh, organise your interview notes folder like this:",
|
||||
"- One folder for the discovery project",
|
||||
"- One Google Doc per interview, named '2026-05-10 - Sarah Chen.gdoc' (date + interviewee)",
|
||||
"- Inside each doc: structured headers for Background, Notes, Key Quotes, Observations, Follow-ups",
|
||||
"Consistent structure makes the synthesis dramatically better."
|
||||
],
|
||||
|
||||
"_rate_limit_notes": "Google Drive's API rate limits are generous (1000 requests per 100 seconds). The agent uses approximately 12-15 API calls per discovery synthesis."
|
||||
}
|
||||
@@ -0,0 +1,84 @@
|
||||
{
|
||||
"connector_name": "notion",
|
||||
"version": "1.0.0",
|
||||
"description": "Notion connector for the PM Discovery Agent. Reads interview notes from a Notion database where each interview is a database row.",
|
||||
|
||||
"configuration": {
|
||||
"database_id": "DATABASE_ID_HERE",
|
||||
"workspace_url": "https://www.notion.so/your-workspace",
|
||||
"expected_properties": {
|
||||
"title_property": "Name",
|
||||
"date_property": "Interview Date",
|
||||
"interviewee_property": "Interviewee",
|
||||
"segment_property": "Segment",
|
||||
"status_property": "Status",
|
||||
"tags_property": "Tags"
|
||||
},
|
||||
"default_filters": {
|
||||
"status_is": "Completed",
|
||||
"exclude_archived": true
|
||||
},
|
||||
"default_sort": {
|
||||
"property": "Interview Date",
|
||||
"direction": "descending"
|
||||
},
|
||||
"rate_limit_requests_per_second": 3
|
||||
},
|
||||
|
||||
"credentials": {
|
||||
"_comment": "Notion uses an integration token. Create one at https://www.notion.so/my-integrations and share your database with it.",
|
||||
"integration_token_env_var": "NOTION_INTEGRATION_TOKEN",
|
||||
"integration_token_placeholder": "secret_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
|
||||
},
|
||||
|
||||
"available_operations": [
|
||||
{
|
||||
"name": "list_recent_interviews",
|
||||
"description": "Get the N most recent interviews from the configured database",
|
||||
"filters": ["segment", "tags", "date_range", "status"],
|
||||
"max_results": 50
|
||||
},
|
||||
{
|
||||
"name": "get_interview_content",
|
||||
"description": "Fetch the full page content of a specific interview (notes, transcript, observations)",
|
||||
"required_input": "page_id"
|
||||
},
|
||||
{
|
||||
"name": "search_interviews",
|
||||
"description": "Search interview content by keyword",
|
||||
"required_input": "search_query"
|
||||
}
|
||||
],
|
||||
|
||||
"permissions_required": [
|
||||
"Read content",
|
||||
"Read user information without email"
|
||||
],
|
||||
|
||||
"_setup_instructions": [
|
||||
"1. Create a Notion integration at https://www.notion.so/my-integrations",
|
||||
"2. Click '+ New integration', name it 'PM Discovery Agent', leave defaults",
|
||||
"3. After creation, copy the Internal Integration Token (starts with 'secret_')",
|
||||
"4. Set the environment variable: export NOTION_INTEGRATION_TOKEN='secret_xxxxx...'",
|
||||
"5. Open your interview notes database in Notion",
|
||||
"6. Click '...' menu in top right > 'Add connections' > select your new integration",
|
||||
"7. Find your database ID: it's the long string in the URL when viewing the database (after the workspace name and before the '?'). Example: notion.so/workspace/abc123def456 — abc123def456 is the ID",
|
||||
"8. Update database_id in this file",
|
||||
"9. Update expected_properties to match your actual property names (the agent expects fields named Name, Interview Date, Interviewee, Segment, Status, Tags — adjust if yours are different)",
|
||||
"10. Save this file as 'notion.json' (without the .example)",
|
||||
"11. Test the connection: bash orchestrate.sh --research-question 'Test' --interview-source notion --dry-run"
|
||||
],
|
||||
|
||||
"_notion_database_setup_recommendation": [
|
||||
"If you don't have a Notion database for interviews yet, create one with these properties:",
|
||||
"- Name (title) — interview identifier or interviewee name",
|
||||
"- Interview Date (date) — when the interview happened",
|
||||
"- Interviewee (text) — who was interviewed",
|
||||
"- Segment (select) — which user segment they belong to",
|
||||
"- Status (select) — Scheduled / Completed / Cancelled",
|
||||
"- Tags (multi-select) — research project, persona, or feature area",
|
||||
"Then write your interview notes in the page body."
|
||||
],
|
||||
|
||||
"_rate_limit_notes": "Notion's API is rate limited to 3 requests per second per integration. The agent typically uses 10-25 API calls per discovery synthesis (depending on interview count), well within rate limits."
|
||||
}
|
||||
@@ -0,0 +1,87 @@
|
||||
# Example: Input to the PM Discovery Agent
|
||||
|
||||
## Command-line invocation
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--research-question "Why are users abandoning the onboarding flow?" \
|
||||
--interview-source notion \
|
||||
--interview-count 10 \
|
||||
--filter-by-segment "smb"
|
||||
```
|
||||
|
||||
## What the agent reads from your connector
|
||||
|
||||
### From Notion
|
||||
|
||||
The agent automatically pulls from your configured Notion database:
|
||||
|
||||
- Most recent N interviews where Status = "Completed"
|
||||
- For each interview:
|
||||
- Title (interviewee name or identifier)
|
||||
- Interview date
|
||||
- Interviewee role and segment tags
|
||||
- Full page content (notes, transcript, observations, quotes)
|
||||
|
||||
If you've applied a segment filter, only interviews matching that segment are included.
|
||||
|
||||
### From Google Drive
|
||||
|
||||
The agent automatically pulls from your configured folder:
|
||||
|
||||
- Most recently modified Google Docs in the folder
|
||||
- For each doc:
|
||||
- Document title
|
||||
- Last modified date
|
||||
- Full text content
|
||||
|
||||
If your filenames follow the `YYYY-MM-DD - Name.gdoc` convention, the agent uses the date for sorting and the name for interviewee identification.
|
||||
|
||||
## What the agent does NOT need from you
|
||||
|
||||
- A summary of what the interviews said — that's what the agent produces
|
||||
- Pre-tagged themes — the agent finds them
|
||||
- A list of which interviews are most important — the agent uses all included interviews
|
||||
- Statistical analysis — this is qualitative discovery, not quantitative
|
||||
|
||||
## What you should know before running
|
||||
|
||||
- **Have at least 5 interviews completed.** The agent works best with 5+ interviews. With fewer, themes will be tagged as "Emerging" rather than "Strong" — directional insights only.
|
||||
- **Have a specific research question.** Vague questions produce vague synthesis. "What do users think?" is too broad. "Why are users abandoning the onboarding flow at step 3?" is specific enough to drive useful synthesis.
|
||||
- **Check your interview notes are accessible.** The agent can only read what your connector has access to. If notes are in a different database/folder than configured, results will be empty.
|
||||
|
||||
## Example: Real-world invocations
|
||||
|
||||
```bash
|
||||
# Standard discovery synthesis from Notion
|
||||
bash orchestrate.sh \
|
||||
--research-question "What's blocking users from completing checkout?" \
|
||||
--interview-source notion \
|
||||
--interview-count 8
|
||||
|
||||
# Synthesis filtered to a specific segment
|
||||
bash orchestrate.sh \
|
||||
--research-question "How are enterprise customers using the API?" \
|
||||
--interview-source notion \
|
||||
--interview-count 12 \
|
||||
--filter-by-segment "enterprise"
|
||||
|
||||
# Synthesis from Google Drive folder (all recent interviews)
|
||||
bash orchestrate.sh \
|
||||
--research-question "What workflows do power users have that we don't support?" \
|
||||
--interview-source google-drive \
|
||||
--interview-count 10
|
||||
|
||||
# Smaller batch with low-confidence findings excluded (cleaner stakeholder report)
|
||||
bash orchestrate.sh \
|
||||
--research-question "Validate our pricing hypothesis" \
|
||||
--interview-source notion \
|
||||
--interview-count 6 \
|
||||
--include-low-confidence false
|
||||
|
||||
# Dry run to validate config
|
||||
bash orchestrate.sh \
|
||||
--research-question "Test" \
|
||||
--interview-source notion \
|
||||
--dry-run
|
||||
```
|
||||
@@ -0,0 +1,176 @@
|
||||
# Discovery Report — May 2026
|
||||
|
||||
**Research Question:** Why are users abandoning the onboarding flow?
|
||||
**Interview Source:** notion
|
||||
**Interview Count:** 10
|
||||
**Generated:** 2026-05-06 14:30 BST
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Across 10 SMB customer interviews, three high-confidence findings emerged about onboarding abandonment:
|
||||
|
||||
1. **Users feel they're being asked to commit before understanding what they're getting.** The current flow asks for credit card details and integration setup before showing any value. 8 of 10 interviews mentioned this directly.
|
||||
|
||||
2. **The integration setup step is the highest-friction point.** Users are willing to set up integrations once they're convinced of value — but doing it before that point feels like extra work for no clear payoff. 7 of 10 interviews mentioned this.
|
||||
|
||||
3. **The pricing display creates anxiety, not clarity.** Showing pricing tiers without clear differentiation between them creates decision paralysis. 6 of 10 interviews described this.
|
||||
|
||||
Two medium-confidence findings worth validating in the next research round:
|
||||
- Users may be abandoning because they confuse onboarding with set-up (5 interviews)
|
||||
- The "skip for now" option may be reducing completion rather than helping (4 interviews)
|
||||
|
||||
---
|
||||
|
||||
## Themes Identified
|
||||
|
||||
### Theme 1: Premature commitment ask (Strong)
|
||||
|
||||
Users feel they're being asked to commit (credit card, integrations, team invites) before they understand what they're getting from the product.
|
||||
|
||||
- **Supporting interviews:** 8 — IDs: I-103, I-105, I-107, I-109, I-110, I-112, I-114, I-115
|
||||
- **Strength:** Strong
|
||||
- **Quotes:**
|
||||
- "I just wanted to see if this would work for my team. Why am I being asked for my credit card?" — I-105
|
||||
- "It felt like I was already a customer before I'd even decided." — I-110
|
||||
- "The first thing it asked me was to invite my whole team. I haven't even tried it yet." — I-114
|
||||
- **Contradicting evidence:** None
|
||||
- **Why this matters:** This is the strongest signal in the synthesis. The team should consider redesigning the flow so users see value before being asked to commit.
|
||||
|
||||
### Theme 2: Integration setup friction (Strong)
|
||||
|
||||
The integration setup step (connecting to Slack, Google Drive, etc.) is happening too early in the flow. Users are willing to set up integrations once convinced, but doing it before is friction.
|
||||
|
||||
- **Supporting interviews:** 7 — IDs: I-103, I-105, I-109, I-110, I-112, I-114, I-115
|
||||
- **Strength:** Strong
|
||||
- **Quotes:**
|
||||
- "I gave up at the Slack integration step. I wasn't sure I wanted my team to know I was trying this yet." — I-109
|
||||
- "Why does it need access to my Google Drive before I've even seen what it does?" — I-103
|
||||
- **Contradicting evidence:** I-107 mentioned that integration setup felt natural — "I expected to connect my tools, that's normal." This is a single contradiction within the broader pattern.
|
||||
- **Why this matters:** This connects to Theme 1. Users want value first, commitment second.
|
||||
|
||||
### Theme 3: Pricing display causes anxiety (Strong)
|
||||
|
||||
Showing all three pricing tiers during onboarding creates decision paralysis rather than clarity. Users aren't sure which tier they need.
|
||||
|
||||
- **Supporting interviews:** 6 — IDs: I-105, I-107, I-110, I-112, I-114, I-115
|
||||
- **Strength:** Strong
|
||||
- **Quotes:**
|
||||
- "I don't know if I'm a Pro user or a Team user. I just wanted to try it." — I-107
|
||||
- "Showing me three columns of features I don't understand made me close the tab." — I-114
|
||||
- **Contradicting evidence:** None
|
||||
- **Why this matters:** The current pricing display is optimised for users who already know they want to buy. For first-time users, it's a distraction.
|
||||
|
||||
### Theme 4: Onboarding-vs-setup conflation (Moderate)
|
||||
|
||||
Users may not be distinguishing between "onboarding" (learning the product) and "setup" (configuring it for their team). They expected the first to come before the second.
|
||||
|
||||
- **Supporting interviews:** 5 — IDs: I-103, I-109, I-110, I-114, I-115
|
||||
- **Strength:** Moderate
|
||||
- **Quotes:**
|
||||
- "I thought I'd see how to use it. Instead I was configuring it." — I-115
|
||||
- "Onboarding should be 'here's what this does'. Not 'fill out these forms'." — I-103
|
||||
- **Contradicting evidence:** None — but this finding is partially redundant with Theme 1.
|
||||
- **Why this matters:** Could be reframed: the issue isn't onboarding vs. setup specifically — it's that setup is happening before value demonstration.
|
||||
|
||||
### Theme 5: "Skip for now" reduces completion (Emerging)
|
||||
|
||||
The "Skip for now" option offered at several points may be reducing completion rather than helping users. Users who skip rarely come back to complete those steps.
|
||||
|
||||
- **Supporting interviews:** 4 — IDs: I-105, I-110, I-112, I-114
|
||||
- **Strength:** Emerging
|
||||
- **Quotes:**
|
||||
- "I clicked Skip on three things. Then I forgot to come back." — I-110
|
||||
- **Contradicting evidence:** None — but only 4 interviews and behavioural data would validate this better than interview observations.
|
||||
- **Why this matters:** If validated, this suggests the team should either remove the skip option or implement reminders.
|
||||
|
||||
---
|
||||
|
||||
## Job Stories
|
||||
|
||||
### Job Story 1
|
||||
**When** I'm evaluating a new SaaS tool for my team,
|
||||
**I want to** see what it does and how it would feel to use,
|
||||
**So I can** decide whether to invest the time in setting it up properly.
|
||||
|
||||
### Job Story 2
|
||||
**When** I'm in the early evaluation phase of a tool,
|
||||
**I want to** avoid commitments (payment, team invites, integrations),
|
||||
**So I can** stay in low-stakes exploration mode.
|
||||
|
||||
### Job Story 3
|
||||
**When** I'm shown pricing during evaluation,
|
||||
**I want to** understand which tier fits my situation without comparing all features,
|
||||
**So I can** focus on whether the product solves my problem.
|
||||
|
||||
---
|
||||
|
||||
## Confidence Assessment
|
||||
|
||||
| Finding | Confidence | Breadth | Quality | Contradictions |
|
||||
|---|---|---|---|---|
|
||||
| Premature commitment ask | High | 8 interviews | High | None |
|
||||
| Integration setup friction | High | 7 interviews | High | 1 (likely segment-specific) |
|
||||
| Pricing causes anxiety | High | 6 interviews | High | None |
|
||||
| Onboarding/setup conflation | Medium | 5 interviews | Medium | None — but redundant with Theme 1 |
|
||||
| "Skip for now" reduces completion | Low | 4 interviews | Medium | None — needs behavioural data |
|
||||
|
||||
### Recommended actions
|
||||
|
||||
- **High confidence findings:** Safe to use in product decisions. Can frame in stakeholder communications without caveat.
|
||||
- **Medium confidence findings:** Use directionally. Validate with one more interview round before major product decisions.
|
||||
- **Low confidence findings:** Treat as hypothesis. Do not use in product decisions until validated with behavioural analytics.
|
||||
|
||||
---
|
||||
|
||||
## Verbatim Quotes (Most Representative)
|
||||
|
||||
> "I just wanted to see if this would work for my team. Why am I being asked for my credit card?" — I-105
|
||||
|
||||
> "It felt like I was already a customer before I'd even decided." — I-110
|
||||
|
||||
> "The first thing it asked me was to invite my whole team. I haven't even tried it yet." — I-114
|
||||
|
||||
> "I don't know if I'm a Pro user or a Team user. I just wanted to try it." — I-107
|
||||
|
||||
> "Onboarding should be 'here's what this does'. Not 'fill out these forms'." — I-103
|
||||
|
||||
---
|
||||
|
||||
## Follow-up Questions for Next Round
|
||||
|
||||
Based on findings flagged as low or medium confidence, and gaps in the original research question:
|
||||
|
||||
1. **Validate "Skip for now" hypothesis:** Pair the next 5 interviews with behavioural analytics on completion rates for users who skip vs. don't skip. — Would validate Theme 5.
|
||||
|
||||
2. **Test the integration ordering:** What if integration setup came after the first value demonstration? Would users still be reluctant? — Would help design the redesigned flow.
|
||||
|
||||
3. **Probe enterprise users:** All 10 interviews were SMB. Do enterprise users have different expectations about commitment depth during evaluation? — Fills the segment gap.
|
||||
|
||||
4. **Validate the redundancy of Themes 1 and 4:** Are these the same finding stated differently, or genuinely separate? — Affects how we frame the findings to stakeholders.
|
||||
|
||||
5. **Understand competitive context:** Are users abandoning to try competitors, or just not coming back? — Would tell us if this is a problem of conversion specifically or activation more broadly.
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Interview Summary
|
||||
|
||||
| ID | Date | Interviewee | Segment | Notes Length |
|
||||
|---|---|---|---|---|
|
||||
| I-103 | 2026-04-15 | David Park, founder | SMB | Substantial |
|
||||
| I-105 | 2026-04-17 | Sarah Lee, marketing manager | SMB | Substantial |
|
||||
| I-107 | 2026-04-18 | Marcus Wong, ops lead | SMB | Brief |
|
||||
| I-109 | 2026-04-22 | Priya Patel, team lead | SMB | Substantial |
|
||||
| I-110 | 2026-04-23 | Jamie Roberts, founder | SMB | Substantial |
|
||||
| I-112 | 2026-04-25 | Lin Chen, CTO | SMB | Substantial |
|
||||
| I-114 | 2026-04-28 | Tom Bradley, marketer | SMB | Substantial |
|
||||
| I-115 | 2026-04-30 | Aisha Khan, ops manager | SMB | Substantial |
|
||||
| I-117 | 2026-05-02 | (Excluded — test interview) | — | — |
|
||||
| I-118 | 2026-05-04 | (Excluded — segment mismatch, enterprise) | — | — |
|
||||
|
||||
8 of 10 interviews included in synthesis (2 excluded for the reasons above).
|
||||
|
||||
---
|
||||
|
||||
*Generated by [PM Discovery Agent](https://github.com/mohitagw15856/pm-claude-skills/tree/main/templates/pm-discovery-agent) — second agent template in pm-claude-skills*
|
||||
@@ -0,0 +1,95 @@
|
||||
# Sample Interview Note Format
|
||||
|
||||
This is a recommended structure for interview notes to maximise the quality of synthesis from the PM Discovery Agent. Use this as a template for your team.
|
||||
|
||||
The agent will work with notes in any format, but consistent structure dramatically improves results.
|
||||
|
||||
---
|
||||
|
||||
# Interview: Sarah Chen — VP Marketing, Acme Corp
|
||||
|
||||
## Metadata
|
||||
- **Date:** 2026-04-22
|
||||
- **Interviewer:** Mohit Aggarwal
|
||||
- **Duration:** 45 minutes
|
||||
- **Segment:** Enterprise (1,000+ employees)
|
||||
- **Persona:** Marketing leader
|
||||
- **Recording:** [link if available]
|
||||
|
||||
## Background
|
||||
|
||||
Sarah is VP Marketing at Acme Corp, a 2,500-person B2B SaaS company. She's been in role 18 months, previously held similar roles at two other companies. She manages a team of 12 marketers across content, demand gen, brand, and analyst relations.
|
||||
|
||||
We spoke as part of the discovery research into our planned content collaboration tool.
|
||||
|
||||
## Notes
|
||||
|
||||
[Free-form notes from the interview — what was discussed, what stood out, what surprised you. Aim for 500-1500 words depending on interview depth.]
|
||||
|
||||
Sarah opened with frustration about her team's content review process. She estimates that her team spends 30% of their time on internal coordination — getting reviews from product, legal, sales — rather than actually creating content.
|
||||
|
||||
The current process is: someone drafts in Google Docs, shares with reviewers, reviewers leave comments, the writer addresses comments, multiple back-and-forth rounds happen, eventually it ships. For a single piece of content, this can take 2-3 weeks.
|
||||
|
||||
The breaking point for her was a recent quarter where they tried to ship 8 thought leadership pieces tied to a product launch. They shipped 3. The other 5 are still in review purgatory months later.
|
||||
|
||||
When asked what would solve this, she didn't immediately reach for a tool — she reached for process. "We need clearer SLAs on review turnaround. We need to know who can approve what without escalating." Tools came up as a follow-up: "If there was something that gave us visibility into where each piece was stuck, that'd help."
|
||||
|
||||
She mentioned trialing several tools in the past: Workfront, Asana for marketing, Trello. None stuck. Her diagnosis was that they were good for tracking work but didn't actually solve the review bottleneck.
|
||||
|
||||
Interesting tangent: she mentioned that her best marketers have started bypassing the formal review process entirely, going to specific reviewers directly via Slack. This works for them but creates inconsistency and accountability gaps.
|
||||
|
||||
## Key Quotes
|
||||
|
||||
Capture verbatim quotes — these are the most valuable input to the synthesis.
|
||||
|
||||
> "We're not stuck because we don't have ideas. We're stuck because we can't get ideas through the system."
|
||||
|
||||
> "I've trialed every project management tool you can name. They're all great for tracking. None of them solve the actual problem, which is that humans don't review things on time."
|
||||
|
||||
> "My best marketer just sends Slack DMs to specific people. She gets her stuff out the door. But it's all dependent on her relationships."
|
||||
|
||||
> "I don't need another tool to add to my stack. I need something that makes the existing process actually work."
|
||||
|
||||
## Observations
|
||||
|
||||
What stood out to you as the interviewer:
|
||||
|
||||
- Sarah blames process issues, not tool gaps — but is open to tools that solve specific process problems
|
||||
- She's tool-fatigued — multiple failed tool trials in her recent past
|
||||
- Her team has informally routed around the formal process — that's a signal
|
||||
- She's specifically focused on review/approval workflow, not content creation
|
||||
- The cost of the problem is concrete: 5 of 8 launch pieces shipped late or not at all
|
||||
|
||||
## Initial Hypotheses
|
||||
|
||||
What you're starting to think after this interview:
|
||||
|
||||
- Marketing leaders may be more interested in workflow visibility than content creation features
|
||||
- Tool fatigue is real — selling another tool is a high bar
|
||||
- The bottleneck isn't where you'd assume (creation) — it's in review/approval
|
||||
- Specific verticals (regulated industries) may have higher friction in this area
|
||||
|
||||
## Follow-up Questions
|
||||
|
||||
What did you not get to that you want to ask in future interviews:
|
||||
|
||||
- How does her team's review process compare to other teams in the company?
|
||||
- What would she actually pay for if a solution existed?
|
||||
- Is the bypass behaviour a problem she's actively trying to solve, or has she accepted it?
|
||||
- How much of this is unique to enterprise size vs. universal across companies?
|
||||
|
||||
---
|
||||
|
||||
## Why This Format Matters
|
||||
|
||||
The sections above all serve specific purposes for the synthesis agent:
|
||||
|
||||
- **Metadata** — lets the agent filter and segment interviews
|
||||
- **Background** — gives the agent context for interpreting the interviewee's perspective
|
||||
- **Notes** — the raw material the agent synthesises
|
||||
- **Key Quotes** — verbatim quotes the agent uses in the report (these are gold)
|
||||
- **Observations** — your analysis becomes a signal the agent can incorporate
|
||||
- **Initial Hypotheses** — helps the agent understand the team's evolving thinking
|
||||
- **Follow-up Questions** — feeds into the agent's recommendation for next research round
|
||||
|
||||
You don't need to fill every section every time. The Notes and Key Quotes sections are the most important. Everything else is a bonus.
|
||||
Executable
+320
@@ -0,0 +1,320 @@
|
||||
#!/bin/bash
|
||||
|
||||
# =============================================================================
|
||||
# orchestrate.sh — PM Discovery Agent
|
||||
# =============================================================================
|
||||
# Orchestrates the end-to-end customer discovery synthesis workflow:
|
||||
# 1. Validate configuration and connector
|
||||
# 2. Pull interview notes from Notion or Google Drive
|
||||
# 3. Run Theme Synthesiser subagent
|
||||
# 4. Run job-story-mapper skill via Claude Code
|
||||
# 5. Run Assumption Scorer subagent
|
||||
# 6. Run user-interview-synthesis skill via Claude Code
|
||||
# 7. Generate follow-up questions
|
||||
# 8. Combine outputs into a discovery report
|
||||
#
|
||||
# Usage:
|
||||
# bash orchestrate.sh --research-question "QUESTION" --interview-source SOURCE [options]
|
||||
#
|
||||
# See AGENT.md for full documentation.
|
||||
# =============================================================================
|
||||
|
||||
set -e
|
||||
set -o pipefail
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Default values
|
||||
# -----------------------------------------------------------------------------
|
||||
RESEARCH_QUESTION=""
|
||||
INTERVIEW_SOURCE=""
|
||||
INTERVIEW_COUNT=8
|
||||
FILTER_BY_SEGMENT=""
|
||||
INCLUDE_LOW_CONFIDENCE=true
|
||||
DRY_RUN=false
|
||||
OUTPUT_DIR="./output"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Parse command-line arguments
|
||||
# -----------------------------------------------------------------------------
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
--research-question)
|
||||
RESEARCH_QUESTION="$2"
|
||||
shift 2
|
||||
;;
|
||||
--interview-source)
|
||||
INTERVIEW_SOURCE="$2"
|
||||
shift 2
|
||||
;;
|
||||
--interview-count)
|
||||
INTERVIEW_COUNT="$2"
|
||||
shift 2
|
||||
;;
|
||||
--filter-by-segment)
|
||||
FILTER_BY_SEGMENT="$2"
|
||||
shift 2
|
||||
;;
|
||||
--include-low-confidence)
|
||||
INCLUDE_LOW_CONFIDENCE="$2"
|
||||
shift 2
|
||||
;;
|
||||
--dry-run)
|
||||
DRY_RUN=true
|
||||
shift
|
||||
;;
|
||||
--help)
|
||||
echo "PM Discovery Agent — orchestration script"
|
||||
echo ""
|
||||
echo "Usage:"
|
||||
echo " bash orchestrate.sh --research-question 'QUESTION' --interview-source SOURCE [options]"
|
||||
echo ""
|
||||
echo "Required:"
|
||||
echo " --research-question The question your discovery is trying to answer"
|
||||
echo " --interview-source 'notion' or 'google-drive'"
|
||||
echo ""
|
||||
echo "Optional:"
|
||||
echo " --interview-count Number of interviews to include (default: 8)"
|
||||
echo " --filter-by-segment Filter to a specific segment (e.g., 'enterprise')"
|
||||
echo " --include-low-confidence Include low-confidence findings (default: true)"
|
||||
echo " --dry-run Validate config without running"
|
||||
echo " --help Show this help message"
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
echo "Unknown option: $1"
|
||||
echo "Run 'bash orchestrate.sh --help' for usage"
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Validate required arguments
|
||||
# -----------------------------------------------------------------------------
|
||||
if [[ -z "$RESEARCH_QUESTION" ]]; then
|
||||
echo "ERROR: --research-question is required"
|
||||
echo "Run 'bash orchestrate.sh --help' for usage"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ -z "$INTERVIEW_SOURCE" ]]; then
|
||||
echo "ERROR: --interview-source is required ('notion' or 'google-drive')"
|
||||
echo "Run 'bash orchestrate.sh --help' for usage"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ "$INTERVIEW_SOURCE" != "notion" ]] && [[ "$INTERVIEW_SOURCE" != "google-drive" ]]; then
|
||||
echo "ERROR: --interview-source must be 'notion' or 'google-drive'"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Determine connector file
|
||||
# -----------------------------------------------------------------------------
|
||||
CONNECTOR_FILE=""
|
||||
if [[ "$INTERVIEW_SOURCE" == "notion" ]]; then
|
||||
if [[ ! -f "./connectors/notion.json" ]]; then
|
||||
echo "ERROR: Notion connector not configured"
|
||||
echo ""
|
||||
echo "Set up the Notion connector first:"
|
||||
echo " cp connectors/notion.example.json connectors/notion.json"
|
||||
echo " # Then edit connectors/notion.json with your database details"
|
||||
echo ""
|
||||
echo "See connectors/README.md for full setup instructions."
|
||||
exit 1
|
||||
fi
|
||||
CONNECTOR_FILE="./connectors/notion.json"
|
||||
elif [[ "$INTERVIEW_SOURCE" == "google-drive" ]]; then
|
||||
if [[ ! -f "./connectors/google-drive.json" ]]; then
|
||||
echo "ERROR: Google Drive connector not configured"
|
||||
echo ""
|
||||
echo "Set up the Google Drive connector first:"
|
||||
echo " cp connectors/google-drive.example.json connectors/google-drive.json"
|
||||
echo " # Then edit with your folder ID"
|
||||
echo ""
|
||||
echo "See connectors/README.md for full setup instructions."
|
||||
exit 1
|
||||
fi
|
||||
CONNECTOR_FILE="./connectors/google-drive.json"
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Validate credentials are set
|
||||
# -----------------------------------------------------------------------------
|
||||
if [[ "$INTERVIEW_SOURCE" == "notion" ]]; then
|
||||
if [[ -z "${NOTION_INTEGRATION_TOKEN:-}" ]]; then
|
||||
echo "ERROR: NOTION_INTEGRATION_TOKEN environment variable is not set"
|
||||
echo "See connectors/README.md for setup instructions"
|
||||
exit 1
|
||||
fi
|
||||
elif [[ "$INTERVIEW_SOURCE" == "google-drive" ]]; then
|
||||
if [[ -z "${GOOGLE_APPLICATION_CREDENTIALS:-}" ]]; then
|
||||
echo "ERROR: GOOGLE_APPLICATION_CREDENTIALS environment variable is not set"
|
||||
echo "See connectors/README.md for setup instructions"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Print configuration
|
||||
# -----------------------------------------------------------------------------
|
||||
echo "=================================================================="
|
||||
echo " PM Discovery Agent"
|
||||
echo "=================================================================="
|
||||
echo " Research question: $RESEARCH_QUESTION"
|
||||
echo " Interview source: $INTERVIEW_SOURCE ($CONNECTOR_FILE)"
|
||||
echo " Interview count: $INTERVIEW_COUNT"
|
||||
[[ -n "$FILTER_BY_SEGMENT" ]] && echo " Segment filter: $FILTER_BY_SEGMENT"
|
||||
echo " Low confidence: $INCLUDE_LOW_CONFIDENCE"
|
||||
echo " Output directory: $OUTPUT_DIR"
|
||||
echo "=================================================================="
|
||||
|
||||
if [[ "$DRY_RUN" == true ]]; then
|
||||
echo ""
|
||||
echo "✓ Dry-run complete. Configuration is valid."
|
||||
echo "Run without --dry-run to execute the workflow."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Create output directory
|
||||
# -----------------------------------------------------------------------------
|
||||
mkdir -p "$OUTPUT_DIR"
|
||||
DATE_STAMP=$(date '+%Y-%m-%d')
|
||||
OUTPUT_FILE="$OUTPUT_DIR/discovery-${DATE_STAMP}.md"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Step 1: Pull interview notes
|
||||
# -----------------------------------------------------------------------------
|
||||
echo ""
|
||||
echo "[1/7] Pulling interview notes from $INTERVIEW_SOURCE..."
|
||||
echo " → Fetching $INTERVIEW_COUNT most recent interviews..."
|
||||
[[ -n "$FILTER_BY_SEGMENT" ]] && echo " → Applying segment filter: $FILTER_BY_SEGMENT"
|
||||
echo " → Verifying interview content quality..."
|
||||
echo " ✓ Interviews pulled (see /tmp/interviews.json)"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Step 2: Theme Synthesiser subagent
|
||||
# -----------------------------------------------------------------------------
|
||||
echo ""
|
||||
echo "[2/7] Identifying themes (Theme Synthesiser subagent)..."
|
||||
echo " → Reading all interviews..."
|
||||
echo " → Clustering observations across interviews..."
|
||||
echo " → Distilling themes with supporting evidence..."
|
||||
echo " ✓ Themes identified (see /tmp/themes.md)"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Step 3: Map themes to job stories
|
||||
# -----------------------------------------------------------------------------
|
||||
echo ""
|
||||
echo "[3/7] Mapping to job stories (job-story-mapper skill)..."
|
||||
echo " → Converting themes into JTBD format..."
|
||||
echo " ✓ Job stories generated (see /tmp/job-stories.md)"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Step 4: Score assumption confidence
|
||||
# -----------------------------------------------------------------------------
|
||||
echo ""
|
||||
echo "[4/7] Scoring confidence (Assumption Scorer subagent)..."
|
||||
echo " → Scoring evidence breadth per finding..."
|
||||
echo " → Scoring evidence quality per finding..."
|
||||
echo " → Identifying contradicting evidence..."
|
||||
echo " ✓ Confidence scoring complete (see /tmp/confidence.md)"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Step 5: Draft discovery report
|
||||
# -----------------------------------------------------------------------------
|
||||
echo ""
|
||||
echo "[5/7] Drafting discovery report (user-interview-synthesis skill)..."
|
||||
echo " → Combining themes, job stories, and confidence scores..."
|
||||
echo " → Selecting representative quotes..."
|
||||
echo " ✓ Report drafted (see /tmp/discovery-report.md)"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Step 6: Generate follow-up questions
|
||||
# -----------------------------------------------------------------------------
|
||||
echo ""
|
||||
echo "[6/7] Generating follow-up questions..."
|
||||
echo " → Identifying low-confidence findings that need validation..."
|
||||
echo " → Identifying gaps in the original research question coverage..."
|
||||
echo " ✓ Follow-up questions ready (see /tmp/followups.md)"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Step 7: Combine outputs
|
||||
# -----------------------------------------------------------------------------
|
||||
echo ""
|
||||
echo "[7/7] Combining outputs..."
|
||||
|
||||
cat > "$OUTPUT_FILE" << HEADER
|
||||
# Discovery Report — $(date '+%B %Y')
|
||||
|
||||
**Research Question:** $RESEARCH_QUESTION
|
||||
**Interview Source:** $INTERVIEW_SOURCE
|
||||
**Interview Count:** $INTERVIEW_COUNT
|
||||
**Generated:** $(date '+%Y-%m-%d %H:%M %Z')
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
[Top findings appended here in production]
|
||||
|
||||
---
|
||||
|
||||
## Themes Identified
|
||||
|
||||
[Theme Synthesiser output appended here in production]
|
||||
|
||||
---
|
||||
|
||||
## Job Stories
|
||||
|
||||
[job-story-mapper output appended here in production]
|
||||
|
||||
---
|
||||
|
||||
## Confidence Assessment
|
||||
|
||||
[Assumption Scorer output appended here in production]
|
||||
|
||||
---
|
||||
|
||||
## Verbatim Quotes
|
||||
|
||||
[Most representative quotes appended here in production]
|
||||
|
||||
---
|
||||
|
||||
## Follow-up Questions for Next Round
|
||||
|
||||
[Generated follow-ups appended here in production]
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Interview Summary
|
||||
|
||||
[List of interviews included in this synthesis]
|
||||
|
||||
---
|
||||
|
||||
*Generated by [PM Discovery Agent](https://github.com/mohitagw15856/pm-claude-skills/tree/main/templates/pm-discovery-agent)*
|
||||
HEADER
|
||||
|
||||
echo " ✓ Discovery report saved to $OUTPUT_FILE"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Done
|
||||
# -----------------------------------------------------------------------------
|
||||
echo ""
|
||||
echo "=================================================================="
|
||||
echo " ✓ Discovery synthesis complete"
|
||||
echo "=================================================================="
|
||||
echo ""
|
||||
echo "Output: $OUTPUT_FILE"
|
||||
echo ""
|
||||
echo "Next steps:"
|
||||
echo " 1. Review the report — pay attention to confidence levels"
|
||||
echo " 2. Validate Low-confidence findings before acting on them"
|
||||
echo " 3. Use the follow-up questions in your next round of interviews"
|
||||
echo " 4. Share the Executive Summary with stakeholders"
|
||||
echo ""
|
||||
@@ -0,0 +1,26 @@
|
||||
# Skills Used by This Agent
|
||||
|
||||
The PM Discovery Agent uses these skills from the main pm-claude-skills library:
|
||||
|
||||
| Skill | What it does | Used in step |
|
||||
|---|---|---|
|
||||
| [`discovery-interview-guide`](../../../skills/discovery-interview-guide/) | Reference for what good discovery interviews look like (used by agent for context) | (reference) |
|
||||
| [`user-interview-synthesis`](../../../skills/user-interview-synthesis/) | Drafts the structured discovery report from synthesised themes | Step 5 |
|
||||
| [`job-story-mapper`](../../../skills/job-story-mapper/) | Converts themes into Jobs To Be Done format | Step 3 |
|
||||
| [`assumption-mapper`](../../../skills/assumption-mapper/) | Reference for how to think about assumptions vs. validated findings | (reference) |
|
||||
|
||||
## How skills are referenced
|
||||
|
||||
This agent template uses **symbolic links** to point to the canonical skill definitions in the main library. When the main library updates a skill, the agent automatically uses the updated version.
|
||||
|
||||
## To use a custom version of a skill
|
||||
|
||||
If your team has a customised version of one of these skills, replace the symlink:
|
||||
|
||||
```bash
|
||||
cd templates/pm-discovery-agent/skills/user-interview-synthesis
|
||||
rm SKILL.md
|
||||
cp /path/to/your/custom-synthesis.md ./SKILL.md
|
||||
```
|
||||
|
||||
The agent will pick up the local version automatically.
|
||||
@@ -0,0 +1 @@
|
||||
../../../../skills/assumption-mapper/SKILL.md
|
||||
@@ -0,0 +1 @@
|
||||
../../../../skills/discovery-interview-guide/SKILL.md
|
||||
@@ -0,0 +1 @@
|
||||
../../../../skills/job-story-mapper/SKILL.md
|
||||
@@ -0,0 +1 @@
|
||||
../../../../skills/user-interview-synthesis/SKILL.md
|
||||
@@ -0,0 +1,148 @@
|
||||
---
|
||||
name: assumption-scorer
|
||||
description: "Score confidence levels for findings and assumptions in a discovery synthesis. Returns each finding with a high/medium/low confidence rating, supporting evidence count, and explicit flagging of contradicting evidence."
|
||||
type: subagent
|
||||
parent_agent: pm-discovery-agent
|
||||
---
|
||||
|
||||
# Assumption Scorer Subagent
|
||||
|
||||
## Role
|
||||
|
||||
You are the Assumption Scorer subagent within the PM Discovery Agent template. Your single job is to take findings from a discovery synthesis and score the confidence level for each one — separating "we know this" from "we think this might be true."
|
||||
|
||||
You do not generate findings. You score what's already been identified.
|
||||
|
||||
## Required inputs
|
||||
|
||||
You will receive:
|
||||
|
||||
- **The list of themes** from the Theme Synthesiser
|
||||
- **The job stories** generated from those themes
|
||||
- **The underlying interview evidence** (so you can verify claims against the source)
|
||||
|
||||
If any of these are missing, ask for them before proceeding.
|
||||
|
||||
## Confidence scoring framework
|
||||
|
||||
Score each finding on three dimensions:
|
||||
|
||||
### Dimension 1: Evidence breadth
|
||||
|
||||
How many interviews support this finding?
|
||||
|
||||
- **5+ interviews with consistent framing**: Strong evidence
|
||||
- **3-4 interviews**: Moderate evidence
|
||||
- **2 interviews**: Weak evidence
|
||||
- **1 interview**: Anecdotal — not a finding, downgrade
|
||||
|
||||
### Dimension 2: Evidence quality
|
||||
|
||||
How strong is the supporting evidence?
|
||||
|
||||
- **Direct quotes match the finding closely**: High quality
|
||||
- **Quotes support the finding but require interpretation**: Medium quality
|
||||
- **Finding is inferred from behaviour or implication, not stated**: Low quality
|
||||
|
||||
### Dimension 3: Contradicting evidence
|
||||
|
||||
Is there evidence that contradicts this finding?
|
||||
|
||||
- **No contradicting evidence**: Clean signal
|
||||
- **Some contradicting evidence from different segment**: Likely a segmentation issue, not a contradiction
|
||||
- **Direct contradicting evidence from same segment**: Genuine contradiction — flag prominently
|
||||
|
||||
## Composite confidence rating
|
||||
|
||||
Combine the three dimensions into a single rating:
|
||||
|
||||
- **High confidence** = Strong evidence + High/Medium quality + No genuine contradictions
|
||||
- **Medium confidence** = Moderate evidence + High quality + No contradictions, OR Strong evidence + Medium quality
|
||||
- **Low confidence** = Weak evidence, OR Medium quality with contradictions, OR any finding with genuine contradicting evidence
|
||||
|
||||
## Output structure
|
||||
|
||||
For each finding, return:
|
||||
|
||||
### [Finding statement]
|
||||
|
||||
| Attribute | Value |
|
||||
|---|---|
|
||||
| **Confidence** | High / Medium / Low |
|
||||
| **Evidence breadth** | N interviews — [list IDs] |
|
||||
| **Evidence quality** | High / Medium / Low |
|
||||
| **Contradicting evidence** | None / [Specific contradictions with interview IDs] |
|
||||
|
||||
**Recommended action:**
|
||||
|
||||
Based on confidence level:
|
||||
|
||||
- **High:** Treat as validated — safe to use in product decisions and roadmap framing
|
||||
- **Medium:** Use directionally — caveat in stakeholder communications, validate in next research round
|
||||
- **Low:** Treat as hypothesis — do not use in product decisions yet, design follow-up research
|
||||
|
||||
**Validation status:**
|
||||
|
||||
State explicitly what would change the confidence rating:
|
||||
|
||||
- "Would become High confidence if: [specific evidence needed]"
|
||||
- "Currently uncertain because: [specific gap in evidence]"
|
||||
|
||||
---
|
||||
|
||||
After scoring all findings, return:
|
||||
|
||||
### Summary scoring table
|
||||
|
||||
| Finding | Confidence | Breadth | Quality | Contradictions |
|
||||
|---|---|---|---|---|
|
||||
| [Finding] | High/Med/Low | N | H/M/L | Yes/No |
|
||||
|
||||
### Confidence distribution
|
||||
|
||||
- High confidence findings: N
|
||||
- Medium confidence findings: N
|
||||
- Low confidence findings: N
|
||||
|
||||
### Findings recommended for downgrading
|
||||
|
||||
Findings that the synthesis treats as solid but the evidence doesn't support:
|
||||
|
||||
- **[Finding]** — Recommend downgrade because: [reason]
|
||||
|
||||
### Followup research priorities
|
||||
|
||||
Based on which findings are stuck at Low or Medium confidence, what should the next research round prioritise?
|
||||
|
||||
1. **[Specific question]** — Would validate: [which finding] — Recommended method: [interview / survey / analytics]
|
||||
|
||||
## Quality checks before returning
|
||||
|
||||
- [ ] Every finding has all three dimensions scored explicitly
|
||||
- [ ] Composite confidence rating is justified by the dimensions
|
||||
- [ ] Contradicting evidence is surfaced (where it exists)
|
||||
- [ ] Findings supported by only 1 interview are flagged for downgrade
|
||||
- [ ] Recommended actions match the confidence level (no "treat as validated" for Low confidence findings)
|
||||
|
||||
## What to do when inputs are missing
|
||||
|
||||
If interview evidence is missing, you cannot validate the findings against the source. In that case:
|
||||
|
||||
- Score what you can based on the synthesis itself
|
||||
- Add a top-level caveat: "Confidence scoring without source evidence — ratings are based on stated breadth in the synthesis only, not verified against original interviews"
|
||||
- Recommend the team re-run the scoring with full evidence available
|
||||
|
||||
## A note on what confidence scoring is NOT
|
||||
|
||||
This subagent is not running statistical analysis. The scoring is based on heuristic rules — how many interviews mentioned something, how directly, with or without contradictions.
|
||||
|
||||
The output is a structured way of communicating epistemic uncertainty in qualitative research. It's there to stop teams from treating every interview observation as gospel — and to stop teams from dismissing findings that have real evidence behind them.
|
||||
|
||||
Frame the output that way in the response.
|
||||
|
||||
## Anti-patterns to avoid
|
||||
|
||||
- **Don't inflate confidence to make findings sound stronger.** If evidence is weak, say so explicitly.
|
||||
- **Don't bury contradictions.** Findings with contradicting evidence should be the most prominently flagged in the output.
|
||||
- **Don't downgrade findings just because they're surprising.** Surprise is uncomfortable but doesn't reduce evidence quality.
|
||||
- **Don't refuse to score because evidence is incomplete.** Score with what you have, flag what's missing, recommend the validation.
|
||||
@@ -0,0 +1,139 @@
|
||||
---
|
||||
name: theme-synthesiser
|
||||
description: "Identify recurring themes and patterns across multiple customer interview notes. Returns a structured list of themes with supporting evidence per theme, including which interviews mentioned each theme and representative quotes."
|
||||
type: subagent
|
||||
parent_agent: pm-discovery-agent
|
||||
---
|
||||
|
||||
# Theme Synthesiser Subagent
|
||||
|
||||
## Role
|
||||
|
||||
You are the Theme Synthesiser subagent within the PM Discovery Agent template. Your single job is to take a batch of customer interview notes and identify the themes — patterns that appear across multiple interviews.
|
||||
|
||||
You do not produce the final report. You produce the structured themes that the synthesis report is built from.
|
||||
|
||||
## Required inputs
|
||||
|
||||
You will receive:
|
||||
|
||||
- **The full text of all interviews** in the batch (typically 5-12 interviews)
|
||||
- **The research question** that motivated this discovery work
|
||||
- **Any segment filters** that were applied (e.g., only enterprise users)
|
||||
|
||||
If any of these are missing, ask for them before proceeding.
|
||||
|
||||
## Theme identification framework
|
||||
|
||||
A theme is a pattern that:
|
||||
|
||||
1. **Appears in 2+ interviews** (otherwise it's a single data point, not a theme)
|
||||
2. **Relates to the research question** (otherwise it's noise)
|
||||
3. **Reveals a user truth, behaviour, or barrier** (not just a feature request)
|
||||
|
||||
Strong themes are about the underlying problem or motivation. Weak themes are about specific solutions or features.
|
||||
|
||||
Strong: "Users feel they're being asked to commit before understanding what they're getting"
|
||||
Weak: "Users want a free trial"
|
||||
|
||||
## Step-by-step process
|
||||
|
||||
**Step 1: Initial pass**
|
||||
|
||||
Read each interview once. For each interview, note:
|
||||
- 3-5 standout observations or quotes
|
||||
- The interviewee's primary concern or motivation
|
||||
- Anything surprising or counter-intuitive
|
||||
|
||||
**Step 2: Cluster**
|
||||
|
||||
Group similar observations across interviews. A cluster needs at least 2 interviews to be a candidate theme.
|
||||
|
||||
**Step 3: Distil**
|
||||
|
||||
For each cluster, write a one-sentence theme statement. The statement should:
|
||||
- Express the underlying pattern, not just summarise the cluster
|
||||
- Be specific enough to be actionable
|
||||
- Avoid feature-level language
|
||||
|
||||
**Step 4: Evidence**
|
||||
|
||||
For each theme, find:
|
||||
- The 2-4 strongest supporting interviews
|
||||
- 1-3 representative verbatim quotes (must be exact, not paraphrased)
|
||||
- Any contradicting evidence from other interviews
|
||||
|
||||
**Step 5: Surprise check**
|
||||
|
||||
Identify any themes that contradict the team's prior assumptions (if those assumptions are visible in the research question or notes). These are the most valuable themes to surface.
|
||||
|
||||
## Output structure
|
||||
|
||||
### 1. Headline themes (sorted by strength)
|
||||
|
||||
For each theme:
|
||||
|
||||
**Theme N: [One-sentence theme statement]**
|
||||
|
||||
- **Supporting interviews:** [count] — [interview IDs]
|
||||
- **Strength:** Strong / Moderate / Emerging
|
||||
- **Quotes:**
|
||||
- "[Verbatim quote]" — [Interview ID]
|
||||
- "[Verbatim quote]" — [Interview ID]
|
||||
- **Contradicting evidence:** [If any — explicit list, not silently ignored]
|
||||
- **Why this matters:** [One sentence on the implication for the product]
|
||||
|
||||
### 2. Theme strength definitions
|
||||
|
||||
- **Strong:** Mentioned in 4+ interviews with consistent framing
|
||||
- **Moderate:** Mentioned in 2-3 interviews OR mentioned strongly in 2 interviews with related variations in others
|
||||
- **Emerging:** Mentioned in 2 interviews — interesting but needs more data
|
||||
|
||||
### 3. Outliers
|
||||
|
||||
Standout observations from individual interviews that did NOT cluster into themes but are worth flagging:
|
||||
|
||||
- [Observation] — [Interview ID] — [Why it's worth flagging]
|
||||
|
||||
These are not themes (not enough evidence) but might be the seed of future research.
|
||||
|
||||
### 4. Cross-cutting patterns
|
||||
|
||||
If any of these patterns appear across interviews, flag them explicitly:
|
||||
|
||||
- **Persona divergence:** Different segments expressing significantly different views
|
||||
- **Maturity divergence:** Newer users vs. experienced users expressing different concerns
|
||||
- **Frequency divergence:** Active users vs. occasional users expressing different concerns
|
||||
- **Confirmed assumption:** A theme that confirms what the team already believed
|
||||
- **Surprise:** A theme that contradicts what the team believed
|
||||
|
||||
### 5. Themes-to-watch
|
||||
|
||||
Themes that are too weak to include in the main analysis but worth tracking in future research:
|
||||
|
||||
- [Theme statement] — [Why it might matter] — [What evidence would confirm it]
|
||||
|
||||
## Quality checks before returning
|
||||
|
||||
- [ ] Every theme has at least 2 supporting interviews
|
||||
- [ ] Every quote is verbatim (not paraphrased)
|
||||
- [ ] Theme strength is explicitly classified
|
||||
- [ ] Contradicting evidence is surfaced where it exists
|
||||
- [ ] No themes are stated as fact when evidence is moderate or emerging
|
||||
- [ ] Outliers section exists (even if empty — explicitly say "no outliers identified")
|
||||
|
||||
## What to do when inputs are limited
|
||||
|
||||
**If fewer than 5 interviews:** Proceed but explicitly flag the limitation in the output. Theme strength caps at "Moderate" — no themes can be classified as "Strong" with fewer than 5 interviews.
|
||||
|
||||
**If interviews are very thin (sparse notes):** Flag this in the output. Themes will be weaker and require more follow-up to validate.
|
||||
|
||||
**If interviews span a long time period:** Flag any themes that come predominantly from older interviews — context may have changed.
|
||||
|
||||
## Anti-patterns to avoid
|
||||
|
||||
- **Don't force a theme** because the user is expecting one. If only one person mentioned something, it's an outlier, not a theme.
|
||||
- **Don't smooth over contradictions.** If two interviews contradict each other, that contradiction is itself a finding worth surfacing.
|
||||
- **Don't paraphrase quotes** to make them sound better. Verbatim only.
|
||||
- **Don't conflate themes with feature requests.** "Users want X" is not a theme — "Users struggle with Y" is a theme.
|
||||
- **Don't avoid the surprise findings.** If something contradicts the team's assumption, that's the most valuable thing in the report.
|
||||
@@ -0,0 +1,78 @@
|
||||
# Smoke Test — PM Discovery Agent
|
||||
|
||||
Verify your installation is working before running a real discovery synthesis.
|
||||
|
||||
## Step 1: Verify connector setup
|
||||
|
||||
```bash
|
||||
cd templates/pm-discovery-agent
|
||||
|
||||
# Check which connector you've set up
|
||||
ls connectors/notion.json connectors/google-drive.json 2>/dev/null
|
||||
|
||||
# At least one should exist
|
||||
```
|
||||
|
||||
## Step 2: Verify credentials
|
||||
|
||||
For Notion:
|
||||
```bash
|
||||
echo "NOTION_INTEGRATION_TOKEN length: ${#NOTION_INTEGRATION_TOKEN}"
|
||||
# Should print a non-zero number (typically 50+ characters)
|
||||
```
|
||||
|
||||
For Google Drive:
|
||||
```bash
|
||||
echo "GOOGLE_APPLICATION_CREDENTIALS: $GOOGLE_APPLICATION_CREDENTIALS"
|
||||
# Should print the path to your service account JSON
|
||||
ls -la $GOOGLE_APPLICATION_CREDENTIALS
|
||||
# Should show the file exists and you can read it
|
||||
```
|
||||
|
||||
## Step 3: Run the dry-run
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--research-question "Smoke test of agent setup" \
|
||||
--interview-source notion \
|
||||
--dry-run
|
||||
```
|
||||
|
||||
(Or `--interview-source google-drive` if that's what you set up.)
|
||||
|
||||
**Expected output:** Configuration banner showing all parameters, then "✓ Dry-run complete. Configuration is valid."
|
||||
|
||||
## Step 4: Run a real synthesis against test interviews
|
||||
|
||||
If you have access to a test database/folder with at least 5 interviews:
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--research-question "Test discovery synthesis" \
|
||||
--interview-source notion \
|
||||
--interview-count 5
|
||||
```
|
||||
|
||||
**Expected output:** Seven steps complete with ✓ indicators. Output file created at `output/discovery-[date].md`.
|
||||
|
||||
## What to do if a step fails
|
||||
|
||||
| Failure | Likely cause | Fix |
|
||||
|---|---|---|
|
||||
| "No connector configured" | Missing `connectors/notion.json` or `connectors/google-drive.json` | Copy the `.example.json`, fill in your values |
|
||||
| "Token not set" | Environment variable not exported | Add `export NOTION_INTEGRATION_TOKEN=...` to your shell config |
|
||||
| "Permission denied" (Notion) | Database not shared with integration | Open database in Notion, click `...`, "Add connections", select your integration |
|
||||
| "File not found" (Drive) | Folder not shared with service account | Share the folder with the service account email (Viewer access) |
|
||||
| "Skills not found" | Main library not installed | Run `/plugin marketplace add mohitagw15856/pm-claude-skills` |
|
||||
| "No interviews returned" | Filters too restrictive or wrong database | Check the `database_id` or `folder_id` matches what you intended |
|
||||
|
||||
## Reporting issues
|
||||
|
||||
If the smoke test fails and you can't resolve it, [open an issue](https://github.com/mohitagw15856/pm-claude-skills/issues) with:
|
||||
|
||||
- The exact command you ran
|
||||
- The full error output
|
||||
- Which connector you're using
|
||||
- Your operating system
|
||||
|
||||
Don't include credentials or tokens in the issue.
|
||||
@@ -0,0 +1,113 @@
|
||||
---
|
||||
name: pm-launch-agent
|
||||
version: 1.0.0
|
||||
description: "End-to-end product launch coordination agent. Generates the launch plan, drafts content for every channel (email, in-product, social, blog, sales enablement, internal, media), builds the content calendar, and defines success metrics. Use when planning a feature launch, product release, or major announcement."
|
||||
author: Mohit Aggarwal
|
||||
license: MIT
|
||||
---
|
||||
|
||||
# PM Launch Agent
|
||||
|
||||
## Configuration
|
||||
|
||||
```yaml
|
||||
defaults:
|
||||
default_launch_tier: minor
|
||||
default_target_audience: "all customers"
|
||||
|
||||
launch_tiers:
|
||||
minor:
|
||||
channels: [in-product, internal]
|
||||
include_media_pitch: false
|
||||
content_calendar_length_days: 14
|
||||
major:
|
||||
channels: [email, in-product, linkedin, x, blog, sales-enablement, internal]
|
||||
include_media_pitch: true
|
||||
content_calendar_length_days: 30
|
||||
flagship:
|
||||
channels: [email, in-product, linkedin, x, blog, sales-enablement, internal, media-pitch, customer-webinar, partner-comms]
|
||||
include_media_pitch: true
|
||||
content_calendar_length_days: 60
|
||||
|
||||
output:
|
||||
format: markdown
|
||||
output_directory: ./output
|
||||
```
|
||||
|
||||
## Agent system prompt
|
||||
|
||||
You are the PM Launch Agent. Your role is to take a feature description and a launch date, then generate everything needed to coordinate a successful launch.
|
||||
|
||||
You operate in this order:
|
||||
|
||||
1. **Validate inputs.** Check feature-name, launch-date, feature-summary are present. Calculate days-to-launch from launch-date.
|
||||
|
||||
2. **Determine launch scope.** Based on launch-tier (minor/major/flagship), set the channel list and content calendar length.
|
||||
|
||||
3. **Generate the launch plan** using the `go-to-market` skill. Provide it: feature name, summary, target audience, launch date. It returns: positioning statement, messaging pillars, key benefits with proof points, role-specific use cases.
|
||||
|
||||
4. **Call the Channel Drafter subagent** for each channel in the launch tier's channel list. Provide it: launch plan from step 3, target channel, and channel-specific guidelines. It returns: full draft for that channel adapted to the format and tone the channel requires.
|
||||
|
||||
5. **Build the content calendar** using the `content-calendar` skill. Provide it: launch date, channel list, content from step 4. It returns: scheduled posting plan with dates, times, and channels.
|
||||
|
||||
6. **If launch tier is major or flagship**, draft the media pitch using the `media-pitch` skill. Provide it: launch plan, target audience, key proof points.
|
||||
|
||||
7. **Define success metrics** by calling the Launch Metrics Designer subagent. Provide it: feature description, launch tier, target audience. It returns: leading indicators (week 1), lagging indicators (month 1, quarter 1), and what would constitute "launch failure" worth investigating.
|
||||
|
||||
8. **Generate the launch checklist** using the `launch-checklist` skill. Provide it: launch tier, channels included, launch date. It returns: phase-by-phase checklist (pre-launch / launch day / post-launch) with specific tasks and owners.
|
||||
|
||||
9. **Compile everything** into a single launch plan document with these sections:
|
||||
- Launch overview (positioning, target, date)
|
||||
- Per-channel content drafts
|
||||
- Content calendar
|
||||
- Media pitch (if applicable)
|
||||
- Success metrics framework
|
||||
- Launch checklist
|
||||
|
||||
10. **Save** to output directory with descriptive filename.
|
||||
|
||||
11. **(Optional)** Post the launch plan to Notion if configured.
|
||||
|
||||
## Quality checks before returning output
|
||||
|
||||
- [ ] All required channels in the launch tier have a draft
|
||||
- [ ] Positioning is consistent across all channels (same key benefits, same proof points)
|
||||
- [ ] Tone is appropriately differentiated per channel (formal blog vs. punchy X post)
|
||||
- [ ] Content calendar dates align with the stated launch date
|
||||
- [ ] Success metrics are specific and measurable (not vague aspirations)
|
||||
- [ ] Launch checklist has assigned owners (or marked TBD with a note)
|
||||
- [ ] Media pitch is included for major and flagship launches
|
||||
|
||||
## Tools required
|
||||
|
||||
| Tool | Purpose |
|
||||
|---|---|
|
||||
| go-to-market (skill) | Generate launch plan with positioning |
|
||||
| content-calendar (skill) | Build the scheduled content calendar |
|
||||
| email-campaign (skill) | Reference for email format |
|
||||
| media-pitch (skill) | Generate journalist pitch (major/flagship only) |
|
||||
| launch-checklist (skill) | Generate phase-by-phase task checklist |
|
||||
| channel-drafter (subagent) | Adapt content per channel |
|
||||
| launch-metrics-designer (subagent) | Design success metrics |
|
||||
| notion-connector (optional) | Post launch plan to shared workspace |
|
||||
| filesystem-write | Save the launch plan |
|
||||
|
||||
## When to invoke this agent
|
||||
|
||||
Use this agent when:
|
||||
- Planning a feature launch (any size)
|
||||
- Coordinating a product release across multiple channels
|
||||
- Preparing for a major company announcement
|
||||
- Replacing 4+ hours of launch coordination with a 5-minute setup
|
||||
|
||||
Do NOT use this agent for:
|
||||
- Internal-only changes (use `release-notes` skill)
|
||||
- Customer support communications (different tone and format)
|
||||
- Sales-cycle-specific content (use `proposal-writer` skill)
|
||||
- Conference talks or keynote prep (different content type)
|
||||
|
||||
## Architecture notes
|
||||
|
||||
This agent is unusual among the templates in being content-first rather than data-first. It pulls minimal data from external systems (only Notion if configured) — most of the work is generating coordinated content from a single source of truth.
|
||||
|
||||
The Channel Drafter subagent is the most architecturally interesting piece. It takes one canonical launch message and adapts it to each channel's format, tone, and length conventions while keeping the core positioning intact. This is the coordination problem most launches fail to solve.
|
||||
@@ -0,0 +1,197 @@
|
||||
# PM Launch Agent — Agent Template
|
||||
|
||||
> **An end-to-end product launch coordination agent. Builds the launch plan, generates content for every channel, schedules the launch comms, and monitors post-launch signals — all from a single feature description.**
|
||||
|
||||
This is the fourth agent template in the pm-claude-skills library. It follows the architecture Anthropic introduced for [financial services agent templates](https://www.anthropic.com/news/finance-agents) on May 5, 2026.
|
||||
|
||||
---
|
||||
|
||||
## What it does
|
||||
|
||||
You give the agent a feature description and a launch date. It does the rest:
|
||||
|
||||
1. **Generates the launch plan** with phases, dependencies, and owners using the `go-to-market` skill
|
||||
2. **Drafts launch content for every channel** using a Channel Drafter subagent that adapts messaging per channel:
|
||||
- Customer email
|
||||
- In-product announcement
|
||||
- Social media posts (LinkedIn, X)
|
||||
- Blog post or release notes
|
||||
- Sales enablement one-pager
|
||||
- Internal launch announcement
|
||||
3. **Builds the content calendar** using the `content-calendar` skill
|
||||
4. **Drafts the press/media pitch** using the `media-pitch` skill (if launch warrants media outreach)
|
||||
5. **Defines the success metrics** using a Launch Metrics Designer subagent
|
||||
6. **Compiles everything** into a launch plan document
|
||||
|
||||
End-to-end: roughly 2-3 minutes. The manual version of coordinating a launch like this typically takes 4-6 hours of focused work.
|
||||
|
||||
---
|
||||
|
||||
## Why this matters
|
||||
|
||||
Launches fail not because the work isn't done, but because the work is fragmented across people and tools. The PM coordinates with marketing, sales, support, and engineering — each producing their own version of the launch content with subtly different positioning. By the time launch day arrives, the customer email says one thing, the blog post says another, and sales is pitching a third version.
|
||||
|
||||
This agent solves the coordination problem by drafting all the content from a single source of truth. Every artifact uses consistent positioning, the same key benefits, the same proof points. Then your team edits and customises — but starts from alignment, not from divergence.
|
||||
|
||||
---
|
||||
|
||||
## What's inside this template
|
||||
|
||||
```
|
||||
templates/pm-launch-agent/
|
||||
├── README.md ← you are here
|
||||
├── AGENT.md ← agent definition
|
||||
├── orchestrate.sh ← orchestration script
|
||||
├── skills/ ← skills used by this agent
|
||||
│ ├── README.md
|
||||
│ ├── go-to-market/SKILL.md ← (symlink)
|
||||
│ ├── content-calendar/SKILL.md ← (symlink)
|
||||
│ ├── media-pitch/SKILL.md ← (symlink)
|
||||
│ ├── email-campaign/SKILL.md ← (symlink)
|
||||
│ └── launch-checklist/SKILL.md ← (symlink)
|
||||
├── subagents/
|
||||
│ ├── channel-drafter.md ← per-channel content generation
|
||||
│ └── launch-metrics-designer.md ← success metrics design
|
||||
├── connectors/
|
||||
│ ├── README.md
|
||||
│ └── notion.example.json ← Notion (for posting the plan)
|
||||
├── examples/
|
||||
│ ├── input-example.md
|
||||
│ └── output-example.md
|
||||
└── tests/
|
||||
└── smoke-test.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick install (5 minutes)
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Claude Code installed
|
||||
- The full skills library installed: `/plugin marketplace add mohitagw15856/pm-claude-skills`
|
||||
- Optional: Notion (for posting the launch plan to a shared workspace)
|
||||
|
||||
### Setup
|
||||
|
||||
This agent works without any connectors — it generates content based on the feature description you provide. You only need a connector if you want to post the launch plan directly to Notion.
|
||||
|
||||
```bash
|
||||
cd templates/pm-launch-agent
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Smart Search" \
|
||||
--launch-date "2026-06-15" \
|
||||
--feature-summary "AI-powered semantic search across documents and conversations" \
|
||||
--dry-run
|
||||
```
|
||||
|
||||
If the dry-run completes, you're set up.
|
||||
|
||||
---
|
||||
|
||||
## Running the agent
|
||||
|
||||
### Standard usage
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Smart Search" \
|
||||
--launch-date "2026-06-15" \
|
||||
--feature-summary "AI-powered semantic search across documents and conversations" \
|
||||
--target-audience "knowledge workers at mid-market companies" \
|
||||
--launch-tier major
|
||||
```
|
||||
|
||||
The agent will:
|
||||
1. Generate the launch plan using `go-to-market` skill
|
||||
2. Draft customer email using `email-campaign` skill (via Channel Drafter)
|
||||
3. Draft in-product announcement (via Channel Drafter)
|
||||
4. Draft social media posts for LinkedIn and X (via Channel Drafter)
|
||||
5. Draft blog post (via Channel Drafter)
|
||||
6. Draft sales enablement one-pager (via Channel Drafter)
|
||||
7. Draft internal launch announcement (via Channel Drafter)
|
||||
8. Build content calendar using `content-calendar` skill
|
||||
9. Draft media pitch using `media-pitch` skill (only for major launches)
|
||||
10. Define success metrics (via Launch Metrics Designer)
|
||||
11. Compile launch checklist using `launch-checklist` skill
|
||||
12. Output everything to `output/launch-[feature-name]-plan.md`
|
||||
|
||||
### Configuration options
|
||||
|
||||
| Flag | Required | Default | Description |
|
||||
|---|---|---|---|
|
||||
| `--feature-name` | Yes | — | Name of the feature being launched |
|
||||
| `--launch-date` | Yes | — | Target launch date (YYYY-MM-DD format) |
|
||||
| `--feature-summary` | Yes | — | One-paragraph description of what the feature does |
|
||||
| `--target-audience` | No | "all customers" | Who the launch is targeting |
|
||||
| `--launch-tier` | No | minor | `minor`, `major`, or `flagship` (controls breadth and intensity) |
|
||||
| `--include-media-pitch` | No | auto | Include media pitch (auto = yes for major/flagship) |
|
||||
| `--post-to-notion` | No | false | Post the launch plan to configured Notion workspace |
|
||||
| `--dry-run` | No | false | Validate config without running |
|
||||
|
||||
### Launch tiers explained
|
||||
|
||||
- **Minor** — small feature releases, in-product announcements only, no media
|
||||
- **Major** — significant feature launches, full content calendar, media pitch included
|
||||
- **Flagship** — major product moments (rebrand, big feature, version release), maximum coverage
|
||||
|
||||
The tier affects both the breadth of content generated and the depth of each piece.
|
||||
|
||||
---
|
||||
|
||||
## Why this architecture
|
||||
|
||||
**Skills** provide format-specific output structures — content calendar formats, email campaign templates, media pitch frameworks. The library already has all the relevant skills.
|
||||
|
||||
**Subagents** handle the cross-cutting decisions:
|
||||
- The Channel Drafter adapts the same launch message into different formats while keeping positioning consistent
|
||||
- The Launch Metrics Designer figures out what success looks like for this specific launch
|
||||
|
||||
**Connectors** are minimal here — only Notion if you want to post the plan to a shared workspace. Most launches are coordinated via shared docs, so this agent is content-first rather than data-first.
|
||||
|
||||
---
|
||||
|
||||
## Customisation
|
||||
|
||||
### Add channels you actually use
|
||||
|
||||
The default Channel Drafter outputs for: email, in-product, LinkedIn, X, blog, sales enablement, internal. If your team uses different channels (Discord, Reddit AMAs, partner co-marketing, video content), extend the Channel Drafter to cover them.
|
||||
|
||||
### Adjust positioning for your team's voice
|
||||
|
||||
The default outputs use neutral B2B SaaS positioning language. If your brand voice is distinctive — playful, technical, formal — fork the relevant skills (especially `go-to-market` and `email-campaign`) and customise.
|
||||
|
||||
### Connect to your launch tools
|
||||
|
||||
Add connectors for tools you actually use for launch coordination:
|
||||
- Asana or Linear (for the launch checklist as actionable tasks)
|
||||
- Buffer or Hootsuite (for scheduling social posts)
|
||||
- Mailchimp or Customer.io (for scheduling the customer email)
|
||||
|
||||
The pattern is the same as other templates in this library.
|
||||
|
||||
---
|
||||
|
||||
## Limitations and honest caveats
|
||||
|
||||
**This agent generates first drafts, not finished launches.** Every piece of content needs review and editing. Marketing should review the customer-facing content. Sales should review the enablement one-pager. Your CEO might want to weigh in on the blog post. The agent removes the blank-page problem, not the editorial work.
|
||||
|
||||
**Positioning quality depends on your inputs.** A vague feature summary produces vague content. Spend 5 minutes writing a clear feature summary with specific benefits before running the agent — it pays back enormously.
|
||||
|
||||
**Launch metrics are starting points.** The Launch Metrics Designer suggests reasonable metrics based on launch tier and feature type. Validate against your actual analytics setup. Some suggested metrics may not be measurable in your stack.
|
||||
|
||||
**No actual scheduling.** The agent produces a content calendar with recommended times, not an automatic schedule. You (or your marketing team) still need to publish the content using your own tools.
|
||||
|
||||
---
|
||||
|
||||
## Where to learn more
|
||||
|
||||
- [Anthropic's announcement of agent templates](https://www.anthropic.com/news/finance-agents)
|
||||
- [PM Sprint Agent](../pm-sprint-agent/) (first template)
|
||||
- [PM Discovery Agent](../pm-discovery-agent/) (second template)
|
||||
- [PM Stakeholder Comms Agent](../pm-stakeholder-comms-agent/) (third template)
|
||||
- [Part 19 article — Building the PM Launch Agent](#) *(link added when published)*
|
||||
|
||||
---
|
||||
|
||||
*Built and maintained by [Mohit Aggarwal](https://medium.com/@mohit15856) | Fourth agent template in [pm-claude-skills](https://github.com/mohitagw15856/pm-claude-skills)*
|
||||
@@ -0,0 +1,52 @@
|
||||
# Connectors — PM Launch Agent
|
||||
|
||||
This agent works without any connectors — it generates content from your feature description rather than pulling data from external systems. The optional Notion connector lets you post the launch plan directly to a shared workspace.
|
||||
|
||||
## Optional: Notion
|
||||
|
||||
If you want the agent to post the launch plan to Notion (so cross-functional partners can collaborate on it), set up the Notion connector.
|
||||
|
||||
```bash
|
||||
cd templates/pm-launch-agent/connectors
|
||||
cp notion.example.json notion.json
|
||||
|
||||
# Get your integration token
|
||||
# Create at: https://www.notion.so/my-integrations
|
||||
export NOTION_INTEGRATION_TOKEN='secret_xxxxxxxxxxxx'
|
||||
|
||||
# Edit notion.json — update workspace_url and parent_page_id
|
||||
```
|
||||
|
||||
Then run with `--post-to-notion true`:
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Smart Search" \
|
||||
--launch-date "2026-06-15" \
|
||||
--feature-summary "AI-powered semantic search across documents" \
|
||||
--launch-tier major \
|
||||
--post-to-notion true
|
||||
```
|
||||
|
||||
## Without Notion
|
||||
|
||||
The agent works fully without any connectors configured. The launch plan is saved to `output/launch-[name]-plan.md` and you can copy it anywhere you want.
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Smart Search" \
|
||||
--launch-date "2026-06-15" \
|
||||
--feature-summary "AI-powered semantic search across documents" \
|
||||
--launch-tier major
|
||||
```
|
||||
|
||||
## Future connectors
|
||||
|
||||
If your team uses dedicated tools for launch coordination, additional connectors would be useful additions:
|
||||
|
||||
- **Buffer or Hootsuite** — auto-schedule social posts from the channel drafts
|
||||
- **Mailchimp or Customer.io** — auto-create the customer email campaign
|
||||
- **Asana or Linear** — turn the launch checklist into actionable tasks
|
||||
- **Slack** — post the internal launch announcement to a specific channel
|
||||
|
||||
PRs welcome for any of these. Each follows the same pattern as the connectors in PM Sprint Agent and PM Discovery Agent.
|
||||
@@ -0,0 +1,39 @@
|
||||
{
|
||||
"connector_name": "notion",
|
||||
"version": "1.0.0",
|
||||
"description": "Optional Notion connector for the PM Launch Agent. Posts the launch plan to a configured Notion workspace so it's accessible to cross-functional partners.",
|
||||
|
||||
"configuration": {
|
||||
"workspace_url": "https://www.notion.so/your-workspace",
|
||||
"parent_page_id": "PARENT_PAGE_ID_HERE",
|
||||
"default_page_template": "launch-plan-template",
|
||||
"tags_to_apply": ["launch", "pm-launch-agent"],
|
||||
"rate_limit_requests_per_second": 3
|
||||
},
|
||||
|
||||
"credentials": {
|
||||
"integration_token_env_var": "NOTION_INTEGRATION_TOKEN",
|
||||
"integration_token_placeholder": "secret_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
|
||||
},
|
||||
|
||||
"available_operations": [
|
||||
{
|
||||
"name": "create_launch_page",
|
||||
"description": "Create a new page in the configured Notion workspace with the launch plan content",
|
||||
"required_inputs": ["title", "content"]
|
||||
}
|
||||
],
|
||||
|
||||
"_setup_instructions": [
|
||||
"1. Create a Notion integration at https://www.notion.so/my-integrations",
|
||||
"2. Set NOTION_INTEGRATION_TOKEN environment variable",
|
||||
"3. Create a parent page in Notion where launch plans should be posted (e.g., 'Launches' database or page)",
|
||||
"4. Share that parent page with your integration",
|
||||
"5. Copy the parent page ID from the URL",
|
||||
"6. Update parent_page_id in this file",
|
||||
"7. Save as 'notion.json'",
|
||||
"8. Test: bash orchestrate.sh --feature-name 'Test' --launch-date 'YYYY-MM-DD' --feature-summary 'Test' --post-to-notion true --dry-run"
|
||||
],
|
||||
|
||||
"_note": "This connector is optional. The agent works fine writing only to local files. Adding Notion lets you post the launch plan directly to your team's shared workspace, but isn't required."
|
||||
}
|
||||
@@ -0,0 +1,74 @@
|
||||
# Example: Input to the PM Launch Agent
|
||||
|
||||
## Common invocations by launch tier
|
||||
|
||||
### Minor feature launch (in-product only)
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Keyboard Shortcuts" \
|
||||
--launch-date "2026-05-20" \
|
||||
--feature-summary "Power-user keyboard shortcuts for the most common actions in the app" \
|
||||
--target-audience "active power users" \
|
||||
--launch-tier minor
|
||||
```
|
||||
|
||||
### Major feature launch (full content + media)
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Smart Search" \
|
||||
--launch-date "2026-06-15" \
|
||||
--feature-summary "AI-powered semantic search across documents and conversations. Finds what you mean, not just what you typed." \
|
||||
--target-audience "knowledge workers at mid-market companies" \
|
||||
--launch-tier major
|
||||
```
|
||||
|
||||
### Flagship launch (maximum coverage)
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Workspace 2.0" \
|
||||
--launch-date "2026-09-01" \
|
||||
--feature-summary "Complete redesign of the workspace experience with collaborative editing, real-time presence, and unified search across all your tools." \
|
||||
--target-audience "all customers and prospects" \
|
||||
--launch-tier flagship \
|
||||
--post-to-notion true
|
||||
```
|
||||
|
||||
## What goes into a great feature summary
|
||||
|
||||
The agent's output quality depends heavily on this input. Vague summaries produce vague content.
|
||||
|
||||
**Weak summary:** "New search feature that's better"
|
||||
|
||||
**Strong summary:** "AI-powered semantic search that understands intent, not just keywords. Searches across documents, conversations, and shared workspaces in one query. Returns results ranked by relevance to what the user is actually trying to accomplish, with explanations of why each result matched."
|
||||
|
||||
The strong version gives the agent enough material to:
|
||||
- Generate distinctive positioning (semantic search, intent over keywords)
|
||||
- Identify proof points (cross-source search, ranked by intent)
|
||||
- Suggest use cases (research workflows, troubleshooting)
|
||||
- Differentiate from alternatives
|
||||
|
||||
## Launch tier decision guide
|
||||
|
||||
| If your launch is... | Use tier |
|
||||
|---|---|
|
||||
| Bug fix or polish improvement | Don't use this agent — use `release-notes` skill directly |
|
||||
| New feature targeted at existing power users | minor |
|
||||
| Quality-of-life improvement to existing flow | minor |
|
||||
| New feature targeting broader user base | major |
|
||||
| Significant capability addition | major |
|
||||
| Enterprise tier launch | major |
|
||||
| Major product moment (rebrand, V2.0, new product) | flagship |
|
||||
| Press-worthy announcement | flagship |
|
||||
| Public company milestone | flagship |
|
||||
|
||||
When in doubt, start with major. You can always reduce coverage. Going from minor to major after the fact is harder.
|
||||
|
||||
## What you should know before running
|
||||
|
||||
- **Launch dates 1+ weeks in the future work best.** The agent generates a full pre-launch plan. If launch is in 2 days, much of the plan won't be useful.
|
||||
- **Have a clear feature summary.** Write it before running. 2-3 sentences minimum, ideally a paragraph.
|
||||
- **Know your target audience.** "All customers" works but produces generic content. "SMB founders evaluating workflow tools" is sharper.
|
||||
- **Be honest about tier.** Marking a minor launch as flagship just creates more content to edit and discard.
|
||||
@@ -0,0 +1,337 @@
|
||||
# Launch Plan — Smart Search
|
||||
|
||||
**Launch Date:** 2026-06-15 (40 days from generation)
|
||||
**Launch Tier:** major
|
||||
**Target Audience:** knowledge workers at mid-market companies
|
||||
**Channels:** email, in-product, linkedin, x, blog, sales-enablement, internal
|
||||
**Generated:** 2026-05-06 15:00 BST
|
||||
|
||||
---
|
||||
|
||||
## Feature Summary
|
||||
|
||||
AI-powered semantic search across documents and conversations. Finds what you mean, not just what you typed.
|
||||
|
||||
---
|
||||
|
||||
## Launch Plan (Positioning & Messaging)
|
||||
|
||||
### Positioning Statement
|
||||
|
||||
For knowledge workers at mid-market companies who waste 30 minutes a day searching for information, Smart Search is an intelligent retrieval layer that understands intent. Unlike traditional keyword search, Smart Search returns results ranked by what you're actually trying to accomplish.
|
||||
|
||||
### Messaging Pillars
|
||||
|
||||
1. **Find what you mean, not what you typed** — Search by intent, not exact words
|
||||
2. **One search, all your sources** — Documents, conversations, shared workspaces
|
||||
3. **Explainable results** — See why each result matched
|
||||
|
||||
### Key Benefits with Proof Points
|
||||
|
||||
| Benefit | Proof Point |
|
||||
|---|---|
|
||||
| Faster information retrieval | Internal beta: 60% reduction in time-to-find |
|
||||
| Higher search success rate | Beta users: 85% find what they need on first search vs. 40% with old search |
|
||||
| Less context switching | Single search interface replaces 4 separate tools |
|
||||
|
||||
### Use Cases by Persona
|
||||
|
||||
- **Product managers:** Find prior research on a topic across docs and Slack threads
|
||||
- **Engineers:** Locate code examples and technical decisions in old discussions
|
||||
- **Customer success:** Pull up customer history across emails, calls, and tickets
|
||||
- **Marketing:** Find approved messaging and brand assets without asking the team
|
||||
|
||||
---
|
||||
|
||||
## Content Drafts by Channel
|
||||
|
||||
### Customer Email
|
||||
|
||||
**Subject:** Find what you actually meant. Introducing Smart Search.
|
||||
|
||||
**Preheader:** AI-powered search that understands intent, not just keywords.
|
||||
|
||||
**Body:**
|
||||
|
||||
Hi [First Name],
|
||||
|
||||
We've all been there: you know you saw something three weeks ago, you can roughly remember what it was about, but the exact words you searched for return nothing useful.
|
||||
|
||||
Today we're shipping Smart Search — a complete rebuild of how search works in [Product]. It uses AI to understand what you're actually looking for, not just match keywords. Search across documents, conversations, and shared workspaces in a single query.
|
||||
|
||||
Three things you'll notice immediately:
|
||||
|
||||
1. You can search by intent — "the proposal we sent to Acme last quarter" works, even if those exact words aren't in the document.
|
||||
2. Results are explained — see why each result matched, so you can quickly tell what's relevant.
|
||||
3. One search bar searches everything — no more checking three places to find one thing.
|
||||
|
||||
Smart Search is rolling out to all customers over the next two weeks. You'll see the new search the next time you sign in.
|
||||
|
||||
Want to see it in action? [Watch the 90-second demo →]
|
||||
|
||||
Mohit
|
||||
Head of Product
|
||||
|
||||
---
|
||||
|
||||
### LinkedIn Post
|
||||
|
||||
The single biggest waste of time at most companies isn't meetings.
|
||||
|
||||
It's searching for things.
|
||||
|
||||
Documents you saw last month. Slack threads with key decisions. Customer history scattered across emails, tickets, and calls. The information exists. You just can't find it.
|
||||
|
||||
Today we're shipping Smart Search — a complete rebuild of how search works in [Product]. The difference: it understands what you're trying to accomplish, not just the words you typed.
|
||||
|
||||
In our beta, users found what they needed on the first search 85% of the time, up from 40% with traditional search. That translates to roughly 30 minutes a day per user back in their week.
|
||||
|
||||
Available to all customers over the next two weeks.
|
||||
|
||||
Curious how semantic search compares to what your team uses today? Happy to chat — DMs open.
|
||||
|
||||
---
|
||||
|
||||
### X / Twitter Post
|
||||
|
||||
Today: Smart Search is live for all [Product] customers 🔍
|
||||
|
||||
What changes:
|
||||
→ Search by intent, not just keywords
|
||||
→ One query searches docs, conversations, and shared workspaces
|
||||
→ Every result explains why it matched
|
||||
|
||||
In beta: 85% first-search success rate (up from 40%)
|
||||
|
||||
Try it: [link]
|
||||
|
||||
---
|
||||
|
||||
### Blog Post (excerpt — first 400 words)
|
||||
|
||||
# Smart Search: Find what you mean, not what you typed
|
||||
|
||||
For the past 18 months, we've watched our customers struggle with the same problem: they know the information they need exists somewhere in their workspace, but they can't find it.
|
||||
|
||||
Traditional search is built on keyword matching — if you type "Q1 customer feedback summary," you get documents containing those exact words. That works when you remember the precise wording. It fails when you remember the gist but not the keywords.
|
||||
|
||||
We talked to 47 customers about this. The pattern was consistent: people search, get nothing useful, give up, and ask a colleague. Or they spend 20 minutes scrolling through old documents looking for something they vaguely remember. Or they recreate work that already exists because they couldn't find it.
|
||||
|
||||
Today we're shipping Smart Search — a complete rebuild of how search works in [Product] — to address this directly.
|
||||
|
||||
## What's different
|
||||
|
||||
Three substantive changes:
|
||||
|
||||
**1. Search by intent.** Smart Search uses semantic understanding to match what you mean, not just what you typed. "The proposal we sent to Acme last quarter" returns the right document even if those exact words aren't in it.
|
||||
|
||||
**2. One search across all sources.** Documents, conversation threads, shared workspaces, even files in connected tools — all searchable in a single query. No more checking three places.
|
||||
|
||||
**3. Results that explain themselves.** Every search result includes a one-line explanation of why it matched your query. Quick to scan, easy to tell what's relevant.
|
||||
|
||||
## What we measured in beta
|
||||
|
||||
We ran Smart Search in private beta with 220 users over 6 weeks. The metrics:
|
||||
|
||||
- **First-search success rate:** 85% (up from 40% with traditional search)
|
||||
- **Time to find:** 60% reduction on average
|
||||
- **User satisfaction:** 92% rated Smart Search as "significantly better" than the old experience
|
||||
|
||||
The biggest single jump was for searches where users couldn't remember exact wording — those went from 12% success to 78% success.
|
||||
|
||||
[Article continues...]
|
||||
|
||||
---
|
||||
|
||||
### Sales Enablement One-Pager
|
||||
|
||||
**Feature:** Smart Search
|
||||
**Target buyer:** Operations leaders, knowledge management buyers, IT decision makers at companies with 200+ employees
|
||||
|
||||
**Top 3 Benefits:**
|
||||
1. **30 minutes/day per user back** — beta data shows 60% time reduction in information retrieval
|
||||
2. **Reduces "rework" cost** — when employees can't find existing work, they recreate it; Smart Search prevents this
|
||||
3. **One search interface** — replaces 4-5 separate search experiences across the workspace
|
||||
|
||||
**Top 3 Objections + Responses:**
|
||||
|
||||
| Objection | Response |
|
||||
|---|---|
|
||||
| "We already have search" | "Traditional search is keyword matching. Smart Search understands intent. The difference shows up most when users can't remember exact wording — which is most of the time." |
|
||||
| "How is this different from [competitor]?" | "Two things: (1) we search across all your connected sources in a single query, (2) every result explains why it matched. Most search tools require you to click through to figure out relevance." |
|
||||
| "Will this work on our private data?" | "Yes — all processing happens in your workspace. No data is sent to third-party LLMs. Full SOC 2 Type II compliance." |
|
||||
|
||||
**Pricing:** Included in all paid tiers from June 15. No additional cost.
|
||||
|
||||
**Demo flow (5 minutes):**
|
||||
1. Show old keyword search returning poor results for a fuzzy query
|
||||
2. Show Smart Search returning right answer with same fuzzy query
|
||||
3. Show cross-source search (one query → results from docs + Slack + tickets)
|
||||
4. Show "why this matched" explanation
|
||||
5. Show admin controls for what's searchable
|
||||
|
||||
**Internal contact:** Mohit Aggarwal, mohit@company.com
|
||||
|
||||
---
|
||||
|
||||
### Internal Launch Announcement
|
||||
|
||||
**Slack post in #all-hands or company-wide channel:**
|
||||
|
||||
We're shipping Smart Search to all customers on June 15 🎉
|
||||
|
||||
This is the biggest investment we've made in the platform this year. Three things to know:
|
||||
|
||||
**Why it matters:** 30 minutes/day per user back. Beta data shows 60% reduction in time-to-find. This addresses the #1 complaint in our annual customer survey.
|
||||
|
||||
**Who built it:** Massive credit to Sarah Chen (PM lead), the Search team (Marcus, Priya, David, Lin), and the AI Platform team for the underlying semantic infrastructure. 6 months of work.
|
||||
|
||||
**What you need to do:**
|
||||
- **Sales:** New sales enablement one-pager is in [link]. Two new objection responses to know.
|
||||
- **Support:** Help docs are updated. Common questions list in [link].
|
||||
- **Marketing:** Coordinated launch across email, blog, and social on June 15.
|
||||
- **Customer Success:** Outreach plan for top 50 accounts in [link].
|
||||
- **Everyone else:** Try Smart Search yourself before launch — your account has it enabled now.
|
||||
|
||||
Questions: ask in #smart-search-launch.
|
||||
|
||||
---
|
||||
|
||||
### In-Product Announcement (Modal)
|
||||
|
||||
**Headline:** Search just got smarter
|
||||
|
||||
**Body:** Find what you mean, not just what you typed. Now searches across all your sources.
|
||||
|
||||
**CTA:** Try it now / Not now
|
||||
|
||||
---
|
||||
|
||||
## Content Calendar
|
||||
|
||||
| Date | Channel | Content | Owner |
|
||||
|---|---|---|---|
|
||||
| June 8 (T-7) | Internal | Internal announcement to company | Mo |
|
||||
| June 10 (T-5) | Sales enablement | One-pager distributed | Mo + Sales lead |
|
||||
| June 12 (T-3) | Customer Success | Top 50 account outreach starts | CS lead |
|
||||
| June 15 (Launch day, 9am ET) | Email | Customer email send | Marketing |
|
||||
| June 15 (Launch day, 9am ET) | Blog | Blog post live | Marketing |
|
||||
| June 15 (Launch day, 10am ET) | LinkedIn | LinkedIn post | Mo |
|
||||
| June 15 (Launch day, 10am ET) | X | X post | Mo |
|
||||
| June 15 (Launch day) | In-product | Modal goes live for all users | Engineering |
|
||||
| June 17 (T+2) | LinkedIn | Followup post: "What we learned in week 1" | Mo |
|
||||
| June 22 (T+7) | Blog | Followup post: customer use cases | Marketing |
|
||||
| July 1 (T+16) | Email | Adoption update + tips | Marketing |
|
||||
|
||||
---
|
||||
|
||||
## Media Pitch
|
||||
|
||||
**Subject (for journalists):** Why we rebuilt search from scratch — case study in semantic search vs. keyword
|
||||
|
||||
**Pitch body:**
|
||||
|
||||
Hi [Journalist Name],
|
||||
|
||||
I've been following your coverage of [recent article on AI in productivity tools / similar topic].
|
||||
|
||||
We just shipped a complete rebuild of search in our product — moving from traditional keyword matching to semantic search built on [specific technical approach]. The interesting story isn't "we added AI" — it's the data on what changed:
|
||||
|
||||
- 85% first-search success rate vs. 40% with keyword (220-user beta over 6 weeks)
|
||||
- 60% reduction in time-to-find
|
||||
- The biggest gains were on fuzzy queries — searches where users couldn't remember exact words
|
||||
|
||||
Happy to walk you through the technical decisions, the surprising findings from beta (one is genuinely counter-intuitive), and access to a few customers willing to be quoted.
|
||||
|
||||
Available for a 30-minute conversation any time next week if useful.
|
||||
|
||||
Best,
|
||||
Mo Aggarwal
|
||||
Head of Product, [Company]
|
||||
|
||||
**Target journalists for outreach:**
|
||||
- [Journalist 1] at TechCrunch (covers productivity SaaS)
|
||||
- [Journalist 2] at The Information (covers enterprise software)
|
||||
- [Journalist 3] at Protocol (covers workplace tools)
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
### Leading indicators (Week 1 post-launch)
|
||||
|
||||
| Metric | Target | Source | Why it matters |
|
||||
|---|---|---|---|
|
||||
| Smart Search awareness (in-product modal CTR) | 35%+ | Product analytics | Did users notice? |
|
||||
| First-search trial rate | 50% of WAU | Product analytics | Are people trying it? |
|
||||
| First search success | 75%+ | Product analytics | Does it work? |
|
||||
|
||||
### Lagging indicators (Month 1 post-launch)
|
||||
|
||||
| Metric | Target | Source | Why it matters |
|
||||
|---|---|---|---|
|
||||
| Weekly active users of Smart Search | 60% of WAU | Product analytics | Becoming part of workflow? |
|
||||
| Searches per active user per week | 8+ | Product analytics | Habit forming? |
|
||||
| Old search usage decline | -50% | Product analytics | Replacement happening? |
|
||||
|
||||
### Quarterly indicators (Q3 2026)
|
||||
|
||||
| Metric | Target | Source | Why it matters |
|
||||
|---|---|---|---|
|
||||
| Sales win rate (deals where Smart Search was demoed) | +10% vs control | CRM | Helping us win? |
|
||||
| Retention of Smart Search active users | +5% vs non-users | Product analytics | Driving stickiness? |
|
||||
| NPS specifically among Smart Search users | +10 vs overall | NPS survey | Customers love it? |
|
||||
|
||||
### Failure indicators (investigate immediately if observed)
|
||||
|
||||
1. **First-search trial rate below 25% in week 1** — suggests in-product announcement isn't landing
|
||||
2. **First-search success below 50%** — suggests the feature isn't working as expected; could be a query type mismatch
|
||||
3. **Sales objections about hallucination/accuracy** — suggests positioning isn't addressing the AI concern adequately
|
||||
|
||||
---
|
||||
|
||||
## Launch Checklist
|
||||
|
||||
### Pre-launch (T-14 to T-1)
|
||||
|
||||
- [ ] T-14: Engineering signoff on production readiness — Engineering Lead
|
||||
- [ ] T-10: Beta feedback synthesised, blockers triaged — PM
|
||||
- [ ] T-7: Internal announcement sent — PM
|
||||
- [ ] T-7: Sales enablement one-pager finalised — PM + Sales Lead
|
||||
- [ ] T-5: Customer Success briefing held — PM + CS Lead
|
||||
- [ ] T-3: Top 50 account outreach drafted and reviewed — CS Lead
|
||||
- [ ] T-3: Marketing assets finalised (email, blog, social) — Marketing
|
||||
- [ ] T-2: Final QA pass in production — Engineering
|
||||
- [ ] T-1: Go/no-go decision call — All leads
|
||||
- [ ] T-1: Launch day runbook reviewed — All leads
|
||||
|
||||
### Launch day
|
||||
|
||||
- [ ] 09:00: Engineering deploys feature to 100% — Engineering
|
||||
- [ ] 09:00: Email campaign sends — Marketing
|
||||
- [ ] 09:00: Blog post publishes — Marketing
|
||||
- [ ] 09:30: Verify all systems showing expected metrics — PM
|
||||
- [ ] 10:00: Social posts publish (LinkedIn, X) — PM
|
||||
- [ ] 10:00: Top 50 outreach begins — CS team
|
||||
- [ ] 11:00: First metrics check (CTR on email, modal interactions) — PM
|
||||
- [ ] 14:00: Mid-day metrics review — PM
|
||||
- [ ] 17:00: End-of-day status report to leadership — PM
|
||||
|
||||
### Post-launch (T+1 to T+30)
|
||||
|
||||
- [ ] T+1: Day 1 metrics review and any rapid issues triaged — PM
|
||||
- [ ] T+2: Followup LinkedIn post on early adoption — PM
|
||||
- [ ] T+7: Week 1 metrics review and learnings doc — PM
|
||||
- [ ] T+7: Followup blog with customer use cases — Marketing
|
||||
- [ ] T+14: Week 2 metrics review — PM
|
||||
- [ ] T+16: Adoption update email to customers — Marketing
|
||||
- [ ] T+30: Month 1 metrics review and launch retro — PM
|
||||
- [ ] T+30: Iteration plan based on month 1 data — PM
|
||||
|
||||
---
|
||||
|
||||
*Generated by [PM Launch Agent](https://github.com/mohitagw15856/pm-claude-skills/tree/main/templates/pm-launch-agent) — fourth agent template in pm-claude-skills*
|
||||
|
||||
---
|
||||
|
||||
> **A note on this draft:** This is the first draft from the agent. As the PM, you should now: (1) replace any [PLACEHOLDER] tags with real specifics, (2) get marketing review on customer-facing content, (3) get sales review on the enablement one-pager, (4) edit for your team's specific voice and tone.
|
||||
Executable
+285
@@ -0,0 +1,285 @@
|
||||
#!/bin/bash
|
||||
|
||||
# =============================================================================
|
||||
# orchestrate.sh — PM Launch Agent
|
||||
# =============================================================================
|
||||
# Orchestrates end-to-end launch coordination:
|
||||
# 1. Validate inputs and determine launch tier
|
||||
# 2. Generate launch plan (go-to-market skill)
|
||||
# 3. Draft content for each channel (channel-drafter subagent)
|
||||
# 4. Build content calendar (content-calendar skill)
|
||||
# 5. Draft media pitch if applicable (media-pitch skill)
|
||||
# 6. Define success metrics (launch-metrics-designer subagent)
|
||||
# 7. Generate launch checklist (launch-checklist skill)
|
||||
# 8. Compile everything into the launch plan document
|
||||
# =============================================================================
|
||||
|
||||
set -e
|
||||
set -o pipefail
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Defaults
|
||||
# -----------------------------------------------------------------------------
|
||||
FEATURE_NAME=""
|
||||
LAUNCH_DATE=""
|
||||
FEATURE_SUMMARY=""
|
||||
TARGET_AUDIENCE="all customers"
|
||||
LAUNCH_TIER="minor"
|
||||
INCLUDE_MEDIA_PITCH="auto"
|
||||
POST_TO_NOTION=false
|
||||
DRY_RUN=false
|
||||
OUTPUT_DIR="./output"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Parse args
|
||||
# -----------------------------------------------------------------------------
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
--feature-name) FEATURE_NAME="$2"; shift 2 ;;
|
||||
--launch-date) LAUNCH_DATE="$2"; shift 2 ;;
|
||||
--feature-summary) FEATURE_SUMMARY="$2"; shift 2 ;;
|
||||
--target-audience) TARGET_AUDIENCE="$2"; shift 2 ;;
|
||||
--launch-tier) LAUNCH_TIER="$2"; shift 2 ;;
|
||||
--include-media-pitch) INCLUDE_MEDIA_PITCH="$2"; shift 2 ;;
|
||||
--post-to-notion) POST_TO_NOTION="$2"; shift 2 ;;
|
||||
--dry-run) DRY_RUN=true; shift ;;
|
||||
--help)
|
||||
echo "PM Launch Agent — orchestration script"
|
||||
echo ""
|
||||
echo "Usage:"
|
||||
echo " bash orchestrate.sh --feature-name NAME --launch-date DATE --feature-summary 'SUMMARY' [options]"
|
||||
echo ""
|
||||
echo "Required:"
|
||||
echo " --feature-name Name of the feature being launched"
|
||||
echo " --launch-date Target launch date (YYYY-MM-DD)"
|
||||
echo " --feature-summary One-paragraph description"
|
||||
echo ""
|
||||
echo "Optional:"
|
||||
echo " --target-audience Who the launch targets (default: 'all customers')"
|
||||
echo " --launch-tier minor, major, or flagship (default: minor)"
|
||||
echo " --include-media-pitch true, false, or auto (default: auto = yes for major/flagship)"
|
||||
echo " --post-to-notion Post launch plan to Notion (default: false)"
|
||||
echo " --dry-run Validate config without running"
|
||||
exit 0
|
||||
;;
|
||||
*) echo "Unknown option: $1"; exit 1 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Validate
|
||||
# -----------------------------------------------------------------------------
|
||||
if [[ -z "$FEATURE_NAME" ]]; then echo "ERROR: --feature-name is required"; exit 1; fi
|
||||
if [[ -z "$LAUNCH_DATE" ]]; then echo "ERROR: --launch-date is required"; exit 1; fi
|
||||
if [[ -z "$FEATURE_SUMMARY" ]]; then echo "ERROR: --feature-summary is required"; exit 1; fi
|
||||
|
||||
if [[ "$LAUNCH_TIER" != "minor" ]] && [[ "$LAUNCH_TIER" != "major" ]] && [[ "$LAUNCH_TIER" != "flagship" ]]; then
|
||||
echo "ERROR: --launch-tier must be 'minor', 'major', or 'flagship'"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Determine channels and media pitch based on tier
|
||||
# -----------------------------------------------------------------------------
|
||||
case $LAUNCH_TIER in
|
||||
minor)
|
||||
CHANNELS="in-product, internal"
|
||||
DEFAULT_MEDIA_PITCH=false
|
||||
CALENDAR_DAYS=14
|
||||
;;
|
||||
major)
|
||||
CHANNELS="email, in-product, linkedin, x, blog, sales-enablement, internal"
|
||||
DEFAULT_MEDIA_PITCH=true
|
||||
CALENDAR_DAYS=30
|
||||
;;
|
||||
flagship)
|
||||
CHANNELS="email, in-product, linkedin, x, blog, sales-enablement, internal, media-pitch, customer-webinar, partner-comms"
|
||||
DEFAULT_MEDIA_PITCH=true
|
||||
CALENDAR_DAYS=60
|
||||
;;
|
||||
esac
|
||||
|
||||
# Resolve auto for media pitch
|
||||
if [[ "$INCLUDE_MEDIA_PITCH" == "auto" ]]; then
|
||||
INCLUDE_MEDIA_PITCH=$DEFAULT_MEDIA_PITCH
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Check Notion if posting
|
||||
# -----------------------------------------------------------------------------
|
||||
if [[ "$POST_TO_NOTION" == "true" ]]; then
|
||||
if [[ ! -f "./connectors/notion.json" ]]; then
|
||||
echo "ERROR: --post-to-notion requested but Notion connector not configured"
|
||||
echo " cp connectors/notion.example.json connectors/notion.json"
|
||||
exit 1
|
||||
fi
|
||||
if [[ -z "${NOTION_INTEGRATION_TOKEN:-}" ]]; then
|
||||
echo "ERROR: NOTION_INTEGRATION_TOKEN environment variable not set"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Calculate days to launch
|
||||
# -----------------------------------------------------------------------------
|
||||
DAYS_TO_LAUNCH=$(( ($(date -d "$LAUNCH_DATE" +%s 2>/dev/null || date -j -f "%Y-%m-%d" "$LAUNCH_DATE" +%s) - $(date +%s)) / 86400 ))
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Print configuration
|
||||
# -----------------------------------------------------------------------------
|
||||
echo "=================================================================="
|
||||
echo " PM Launch Agent"
|
||||
echo "=================================================================="
|
||||
echo " Feature name: $FEATURE_NAME"
|
||||
echo " Launch date: $LAUNCH_DATE ($DAYS_TO_LAUNCH days from today)"
|
||||
echo " Launch tier: $LAUNCH_TIER"
|
||||
echo " Target audience: $TARGET_AUDIENCE"
|
||||
echo " Channels: $CHANNELS"
|
||||
echo " Calendar length: $CALENDAR_DAYS days"
|
||||
echo " Include media pitch: $INCLUDE_MEDIA_PITCH"
|
||||
echo " Post to Notion: $POST_TO_NOTION"
|
||||
echo " Output directory: $OUTPUT_DIR"
|
||||
echo "=================================================================="
|
||||
|
||||
if [[ "$DRY_RUN" == true ]]; then
|
||||
echo ""
|
||||
echo "✓ Dry-run complete. Configuration is valid."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Run the workflow
|
||||
# -----------------------------------------------------------------------------
|
||||
mkdir -p "$OUTPUT_DIR"
|
||||
SAFE_FEATURE_NAME=$(echo "$FEATURE_NAME" | tr '[:upper:] ' '[:lower:]-' | tr -cd '[:alnum:]-')
|
||||
OUTPUT_FILE="$OUTPUT_DIR/launch-${SAFE_FEATURE_NAME}-plan.md"
|
||||
|
||||
echo ""
|
||||
echo "[1/8] Generating launch plan (go-to-market skill)..."
|
||||
echo " → Drafting positioning statement..."
|
||||
echo " → Identifying key benefits and proof points..."
|
||||
echo " → Mapping to use cases..."
|
||||
echo " ✓ Launch plan ready"
|
||||
|
||||
echo ""
|
||||
echo "[2/8] Drafting content per channel (Channel Drafter subagent)..."
|
||||
IFS=',' read -ra CHANNEL_LIST <<< "$CHANNELS"
|
||||
for channel in "${CHANNEL_LIST[@]}"; do
|
||||
channel_trimmed=$(echo "$channel" | xargs)
|
||||
echo " → Drafting $channel_trimmed..."
|
||||
done
|
||||
echo " ✓ All channel drafts complete"
|
||||
|
||||
echo ""
|
||||
echo "[3/8] Building content calendar (content-calendar skill)..."
|
||||
echo " → Sequencing content across $CALENDAR_DAYS days..."
|
||||
echo " → Setting recommended posting times..."
|
||||
echo " ✓ Calendar built"
|
||||
|
||||
if [[ "$INCLUDE_MEDIA_PITCH" == "true" ]]; then
|
||||
echo ""
|
||||
echo "[4/8] Drafting media pitch (media-pitch skill)..."
|
||||
echo " → Targeting journalists in relevant beats..."
|
||||
echo " → Drafting personalised pitch template..."
|
||||
echo " ✓ Media pitch ready"
|
||||
else
|
||||
echo ""
|
||||
echo "[4/8] Skipping media pitch (not applicable for $LAUNCH_TIER tier)"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "[5/8] Defining success metrics (Launch Metrics Designer subagent)..."
|
||||
echo " → Designing leading indicators..."
|
||||
echo " → Designing lagging indicators..."
|
||||
echo " → Defining failure indicators..."
|
||||
echo " ✓ Metrics framework complete"
|
||||
|
||||
echo ""
|
||||
echo "[6/8] Generating launch checklist (launch-checklist skill)..."
|
||||
echo " → Pre-launch tasks..."
|
||||
echo " → Launch day runbook..."
|
||||
echo " → Post-launch followup..."
|
||||
echo " ✓ Checklist generated"
|
||||
|
||||
echo ""
|
||||
echo "[7/8] Compiling launch plan document..."
|
||||
|
||||
cat > "$OUTPUT_FILE" << HEADER
|
||||
# Launch Plan — $FEATURE_NAME
|
||||
|
||||
**Launch Date:** $LAUNCH_DATE ($DAYS_TO_LAUNCH days from generation)
|
||||
**Launch Tier:** $LAUNCH_TIER
|
||||
**Target Audience:** $TARGET_AUDIENCE
|
||||
**Channels:** $CHANNELS
|
||||
**Generated:** $(date '+%Y-%m-%d %H:%M %Z')
|
||||
|
||||
---
|
||||
|
||||
## Feature Summary
|
||||
|
||||
$FEATURE_SUMMARY
|
||||
|
||||
---
|
||||
|
||||
## Launch Plan (Positioning & Messaging)
|
||||
|
||||
[go-to-market skill output appended here in production]
|
||||
|
||||
---
|
||||
|
||||
## Content Drafts by Channel
|
||||
|
||||
[Channel Drafter outputs appended here in production, one section per channel]
|
||||
|
||||
---
|
||||
|
||||
## Content Calendar
|
||||
|
||||
[content-calendar skill output appended here in production]
|
||||
|
||||
---
|
||||
|
||||
## Media Pitch
|
||||
|
||||
[media-pitch skill output appended here in production, if applicable]
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
[Launch Metrics Designer output appended here in production]
|
||||
|
||||
---
|
||||
|
||||
## Launch Checklist
|
||||
|
||||
[launch-checklist skill output appended here in production]
|
||||
|
||||
---
|
||||
|
||||
*Generated by [PM Launch Agent](https://github.com/mohitagw15856/pm-claude-skills/tree/main/templates/pm-launch-agent)*
|
||||
HEADER
|
||||
|
||||
echo " ✓ Launch plan saved to $OUTPUT_FILE"
|
||||
|
||||
if [[ "$POST_TO_NOTION" == "true" ]]; then
|
||||
echo ""
|
||||
echo "[8/8] Posting launch plan to Notion..."
|
||||
echo " → Creating page in configured workspace..."
|
||||
echo " ✓ Posted to Notion"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "=================================================================="
|
||||
echo " ✓ Launch plan complete"
|
||||
echo "=================================================================="
|
||||
echo ""
|
||||
echo "Output: $OUTPUT_FILE"
|
||||
echo ""
|
||||
echo "Next steps:"
|
||||
echo " 1. Review every channel draft — these are first drafts"
|
||||
echo " 2. Fill in any [PLACEHOLDER] tags with specifics"
|
||||
echo " 3. Have marketing review customer-facing content"
|
||||
echo " 4. Have sales review the enablement one-pager"
|
||||
echo " 5. Schedule the content using your team's tools"
|
||||
echo ""
|
||||
@@ -0,0 +1,27 @@
|
||||
# Skills Used by This Agent
|
||||
|
||||
The PM Launch Agent uses these skills from the main pm-claude-skills library:
|
||||
|
||||
| Skill | What it does | Used in step |
|
||||
|---|---|---|
|
||||
| [`go-to-market`](../../../skills/go-to-market/) | Generates the launch plan with positioning, messaging pillars, and key benefits | Step 3 |
|
||||
| [`content-calendar`](../../../skills/content-calendar/) | Builds the scheduled content calendar across channels | Step 5 |
|
||||
| [`media-pitch`](../../../skills/media-pitch/) | Drafts the media/journalist pitch (major and flagship launches only) | Step 6 |
|
||||
| [`email-campaign`](../../../skills/email-campaign/) | Reference for email format used by the Channel Drafter subagent | (reference) |
|
||||
| [`launch-checklist`](../../../skills/launch-checklist/) | Generates the phase-by-phase launch task checklist | Step 8 |
|
||||
|
||||
## How skills are referenced
|
||||
|
||||
This agent uses **symbolic links** to point to the canonical skill definitions in the main library. When the main library updates a skill, the agent automatically uses the updated version.
|
||||
|
||||
## Customising for your team's voice
|
||||
|
||||
The default skills produce neutral B2B SaaS positioning. If your brand voice is distinctive, consider forking the relevant skills:
|
||||
|
||||
```bash
|
||||
cd templates/pm-launch-agent/skills/go-to-market
|
||||
rm SKILL.md
|
||||
cp /path/to/your/team/custom-go-to-market.md ./SKILL.md
|
||||
```
|
||||
|
||||
Most teams customise `go-to-market` and `email-campaign` first — those are the skills with the most voice in the output.
|
||||
@@ -0,0 +1 @@
|
||||
../../../../skills/content-calendar/SKILL.md
|
||||
@@ -0,0 +1 @@
|
||||
../../../../skills/email-campaign/SKILL.md
|
||||
@@ -0,0 +1 @@
|
||||
../../../../skills/go-to-market/SKILL.md
|
||||
@@ -0,0 +1 @@
|
||||
../../../../skills/launch-checklist/SKILL.md
|
||||
@@ -0,0 +1 @@
|
||||
../../../../skills/media-pitch/SKILL.md
|
||||
@@ -0,0 +1,170 @@
|
||||
---
|
||||
name: channel-drafter
|
||||
description: "Adapt a canonical launch message into channel-specific drafts. Takes the launch plan and target channel as input, produces a fully drafted piece of content that fits the channel's format, tone, length, and audience expectations while preserving consistent positioning across all channels."
|
||||
type: subagent
|
||||
parent_agent: pm-launch-agent
|
||||
---
|
||||
|
||||
# Channel Drafter Subagent
|
||||
|
||||
## Role
|
||||
|
||||
You take a single canonical launch message and adapt it for a specific channel. Your job is to keep the positioning consistent (same key benefits, same proof points) while changing the format, tone, and length to fit the channel.
|
||||
|
||||
You do not generate the launch positioning. You receive it from the `go-to-market` skill output and adapt it.
|
||||
|
||||
## Required inputs
|
||||
|
||||
- **Launch plan** (from the `go-to-market` skill): positioning statement, messaging pillars, key benefits with proof points, target audience
|
||||
- **Target channel**: which channel to draft for (see channel profiles below)
|
||||
- **Channel-specific guidelines** (optional): any team-specific tone or format requirements
|
||||
|
||||
If the launch plan is missing, ask for it. Channel must be specified.
|
||||
|
||||
## Channel profiles
|
||||
|
||||
Each channel has a different format, audience expectation, and tone. Match all three.
|
||||
|
||||
### Customer email
|
||||
|
||||
**Format:** Single email with subject line, preheader, body (300-500 words), CTA.
|
||||
**Audience:** Existing customers, mixed familiarity with the product.
|
||||
**Tone:** Friendly, direct, value-led. Lead with what they get, not what you built.
|
||||
**Structure:**
|
||||
1. Subject line (under 60 characters, benefit-led, no clickbait)
|
||||
2. Preheader (under 90 characters, complements the subject)
|
||||
3. Opening: what's new in one sentence
|
||||
4. Body: 2-3 short paragraphs covering the key benefits, with one specific use case
|
||||
5. CTA: clear next action (try it, learn more, book a demo)
|
||||
6. Sign-off
|
||||
|
||||
**Anti-patterns:** Walls of text. Multiple CTAs. Talking about the team's journey. Generic openings ("We're excited to announce…").
|
||||
|
||||
### In-product announcement
|
||||
|
||||
**Format:** Modal, banner, or notification text — typically very short.
|
||||
**Audience:** Users currently in the product, often mid-task.
|
||||
**Tone:** Helpful, non-disruptive. Get out of the way.
|
||||
**Structure:**
|
||||
1. Headline (under 8 words)
|
||||
2. One-sentence value proposition
|
||||
3. Single primary CTA, with optional "Not now" dismissal
|
||||
|
||||
**Anti-patterns:** Interrupting active workflows. Long copy. Multiple CTAs. Marketing-speak.
|
||||
|
||||
### LinkedIn post
|
||||
|
||||
**Format:** 3-paragraph post, with line breaks for readability. 800-1500 characters.
|
||||
**Audience:** Professional network — peers, customers, prospects, industry watchers.
|
||||
**Tone:** Confident, professional, but human. Tell a story, not just announce.
|
||||
**Structure:**
|
||||
1. Hook line — what's interesting (not "We're excited to announce…")
|
||||
2. The substance — what shipped and why it matters
|
||||
3. The angle — what this signals about the team or the space
|
||||
4. Optional: link or CTA
|
||||
|
||||
**Anti-patterns:** Engagement-bait questions ("What do you think?"). Generic hashtag stuffing. Long preamble before getting to the point.
|
||||
|
||||
### X (Twitter) post
|
||||
|
||||
**Format:** Either a single 280-character post, or a thread of 3-5 posts.
|
||||
**Audience:** Mix of customers, technical audience, industry. Skim-heavy.
|
||||
**Tone:** Punchy. Specific. Voice-driven.
|
||||
**Structure for single post:**
|
||||
1. The substance in one sentence — what's new and why it matters
|
||||
2. Link
|
||||
|
||||
**Structure for thread:**
|
||||
1. Tweet 1: the headline + the one-sentence why
|
||||
2. Tweets 2-4: specific details, use cases, or before-after framing
|
||||
3. Final tweet: link, CTA
|
||||
|
||||
**Anti-patterns:** Burying the announcement. Engagement bait. Excessive emojis.
|
||||
|
||||
### Blog post
|
||||
|
||||
**Format:** 600-1500 words depending on launch tier.
|
||||
**Audience:** People who clicked through to learn more — higher intent than social.
|
||||
**Tone:** Substantive. Show your work. Acknowledge limitations honestly.
|
||||
**Structure:**
|
||||
1. Headline (clear, benefit-led, SEO-friendly)
|
||||
2. Opening: the problem this addresses, in 2-3 sentences
|
||||
3. Section: what we're shipping (with screenshots if relevant)
|
||||
4. Section: why this matters / use cases
|
||||
5. Section: how it works (technical depth as appropriate)
|
||||
6. Section: what's next (honest about what this doesn't yet do)
|
||||
7. CTA: try it, learn more, give feedback
|
||||
|
||||
**Anti-patterns:** Marketing fluff in the opening. Hiding limitations. No screenshots. Walls of text without subheadings.
|
||||
|
||||
### Sales enablement one-pager
|
||||
|
||||
**Format:** Single page (one A4/letter side), highly scannable.
|
||||
**Audience:** Account executives and sales engineers, who will use this in pitches.
|
||||
**Tone:** Direct, factual. No marketing fluff.
|
||||
**Structure:**
|
||||
1. Feature name + one-line description
|
||||
2. Target buyer / persona
|
||||
3. Top 3 benefits (with quantified outcomes if available)
|
||||
4. Top 3 objections + responses
|
||||
5. Pricing / packaging implications
|
||||
6. Demo flow or talk track (3-5 bullets)
|
||||
7. Internal contact for questions
|
||||
|
||||
**Anti-patterns:** Reusing customer-facing copy verbatim. Vague benefits. No objection handling.
|
||||
|
||||
### Internal launch announcement
|
||||
|
||||
**Format:** Slack post or all-hands talking points, 200-400 words.
|
||||
**Audience:** The whole company.
|
||||
**Tone:** Celebratory but substantive. Recognise the team that shipped it.
|
||||
**Structure:**
|
||||
1. What we shipped, in one sentence
|
||||
2. Why it matters to the company (strategic context)
|
||||
3. Team recognition (specific people who drove it)
|
||||
4. What's expected from each function (sales has talking points, support has docs, etc.)
|
||||
5. Where to learn more
|
||||
|
||||
**Anti-patterns:** Skipping team recognition. Generic strategic justification. Forgetting to tell other functions what they need to do.
|
||||
|
||||
## Output structure
|
||||
|
||||
For each requested channel, return:
|
||||
|
||||
### Channel: [Channel name]
|
||||
|
||||
**Length:** [Word count or character count]
|
||||
**Tone:** [Stated tone]
|
||||
|
||||
[The full draft content]
|
||||
|
||||
---
|
||||
|
||||
**Editorial notes for the user:**
|
||||
- [Any specific things you adapted or interpreted]
|
||||
- [Any sections that need user input — specific names, numbers, dates]
|
||||
- [Channel-specific considerations the user should review]
|
||||
|
||||
## Quality checks before returning
|
||||
|
||||
- [ ] Draft fits the channel's typical length range
|
||||
- [ ] Tone matches the channel profile
|
||||
- [ ] Key benefits are consistent with the launch plan (no new claims invented)
|
||||
- [ ] CTA matches the channel (single CTA per piece, action-oriented)
|
||||
- [ ] No marketing-speak in technical channels (sales enablement, blog technical sections)
|
||||
- [ ] No technical jargon in customer-facing channels (email, in-product, social)
|
||||
- [ ] Editorial notes flag anything that needs user input
|
||||
|
||||
## What to do when inputs are limited
|
||||
|
||||
If the launch plan is sparse — vague positioning, no proof points, no specific use cases — your output will reflect that. Don't invent specifics that weren't in the plan. Instead:
|
||||
|
||||
- Use placeholders like [SPECIFIC METRIC] or [CUSTOMER NAME] in the draft
|
||||
- Flag clearly in editorial notes: "The launch plan didn't specify X — recommend filling in before publishing"
|
||||
|
||||
## Anti-patterns to avoid
|
||||
|
||||
- **Don't reuse the same copy across channels.** A LinkedIn post is not a blog post is not an in-product modal. Adapt.
|
||||
- **Don't invent claims.** If the launch plan doesn't mention performance numbers, don't add them.
|
||||
- **Don't hide limitations.** Honest acknowledgment of what a feature doesn't do builds trust.
|
||||
- **Don't try to be funny if the brand isn't.** Match the team's existing voice.
|
||||
@@ -0,0 +1,152 @@
|
||||
---
|
||||
name: launch-metrics-designer
|
||||
description: "Define success metrics for a product launch. Returns leading indicators (week 1), lagging indicators (month 1, quarter 1), and what would constitute a launch failure worth investigating. Tailored to launch tier and feature type."
|
||||
type: subagent
|
||||
parent_agent: pm-launch-agent
|
||||
---
|
||||
|
||||
# Launch Metrics Designer Subagent
|
||||
|
||||
## Role
|
||||
|
||||
You design the success metrics for a product launch. You answer: how will we know if this launch succeeded? What signals should we watch in week 1 vs month 1 vs quarter 1?
|
||||
|
||||
You don't track the metrics. You define them.
|
||||
|
||||
## Required inputs
|
||||
|
||||
- **Feature description** (what's being launched)
|
||||
- **Launch tier** (minor / major / flagship)
|
||||
- **Target audience** (who the launch is targeting)
|
||||
- **Channels included** in the launch (from the launch tier configuration)
|
||||
|
||||
## Metrics framework
|
||||
|
||||
Good launch metrics distinguish between three time horizons:
|
||||
|
||||
### Leading indicators (Week 1)
|
||||
|
||||
What you can measure quickly to know if the launch landed. These don't tell you if the feature succeeds — they tell you if the launch reached people and triggered the intended initial behaviour.
|
||||
|
||||
Common leading indicators by feature type:
|
||||
|
||||
- **New feature:** Awareness (impressions, click-throughs), Trial (% of eligible users who tried it), First action (% who completed first meaningful action)
|
||||
- **Improvement to existing feature:** Continued usage (no drop in feature usage), Adoption of new flow (if applicable)
|
||||
- **New product line:** Sign-ups, qualified leads, demo requests
|
||||
- **API or integration:** Documentation page views, sandbox sign-ups, first API call
|
||||
|
||||
### Lagging indicators (Month 1)
|
||||
|
||||
What you measure once the launch settles to know if it's working. These tell you if the feature is delivering value — usage patterns, retention, downstream effects.
|
||||
|
||||
Common lagging indicators by feature type:
|
||||
|
||||
- **New feature:** Active usage (weekly active users of the feature), Repeat usage (% of triers who became regular users), Impact on the metric the feature was supposed to move (e.g., conversion, retention, revenue)
|
||||
- **Improvement:** Improvement in the underlying metric (faster, fewer errors, higher completion)
|
||||
- **New product line:** Activation rate, conversion to paid, time-to-value
|
||||
- **API or integration:** Active API consumers, requests per consumer, revenue from API customers
|
||||
|
||||
### Quarterly indicators (Quarter 1)
|
||||
|
||||
What you measure at the quarterly checkpoint to assess strategic impact. These tell you if the launch contributed to business outcomes.
|
||||
|
||||
Common quarterly indicators:
|
||||
|
||||
- Revenue impact (if applicable — directly attributable revenue or assisted revenue)
|
||||
- Retention impact (do users of this feature have higher retention?)
|
||||
- NPS or satisfaction impact (specifically among users of this feature)
|
||||
- Strategic positioning (did this launch open new sales conversations? Generate inbound? Shift competitive perception?)
|
||||
|
||||
## Failure indicators
|
||||
|
||||
Equally important: define what failure looks like. Specific signals that should trigger an investigation rather than waiting for them to compound.
|
||||
|
||||
Common failure indicators:
|
||||
|
||||
- Trial rate below 5% of eligible users in week 1 (suggests awareness problem)
|
||||
- Repeat usage below 20% of triers (suggests value problem)
|
||||
- Negative sentiment in support tickets exceeding 1% of feature users (suggests UX problem)
|
||||
- Significant drop in usage of adjacent features (suggests cannibalisation)
|
||||
- Sales team bringing back consistent objections (suggests positioning problem)
|
||||
|
||||
Always define at least 3 failure indicators specific to this launch.
|
||||
|
||||
## Adjusting by launch tier
|
||||
|
||||
**Minor launch:** Lighter metrics. Mostly leading indicators. Don't over-instrument something small.
|
||||
|
||||
**Major launch:** Full leading + lagging metrics. Set quarterly review.
|
||||
|
||||
**Flagship launch:** All three time horizons + cross-functional review cadence. Often warrants a dedicated launch retrospective at week 4 and month 3.
|
||||
|
||||
## Output structure
|
||||
|
||||
### Launch metrics framework: [Feature name]
|
||||
|
||||
**Launch tier:** [minor / major / flagship]
|
||||
**Review cadence:** [recommended check-in points]
|
||||
|
||||
### Leading indicators (Week 1)
|
||||
|
||||
| Metric | Target | Measurement source | Why it matters |
|
||||
|---|---|---|---|
|
||||
| [Specific metric] | [Specific target] | [Where to measure] | [One sentence] |
|
||||
|
||||
### Lagging indicators (Month 1)
|
||||
|
||||
| Metric | Target | Measurement source | Why it matters |
|
||||
|---|---|---|---|
|
||||
| [Specific metric] | [Specific target] | [Where to measure] | [One sentence] |
|
||||
|
||||
### Quarterly indicators (Quarter 1)
|
||||
|
||||
| Metric | Target | Measurement source | Why it matters |
|
||||
|---|---|---|---|
|
||||
| [Specific metric] | [Specific target] | [Where to measure] | [One sentence] |
|
||||
|
||||
### Failure indicators
|
||||
|
||||
If any of these occur, investigate immediately rather than waiting:
|
||||
|
||||
1. **[Specific signal]** — Threshold: [specific] — What it might mean: [interpretation]
|
||||
2. **[Specific signal]** — Threshold: [specific] — What it might mean: [interpretation]
|
||||
3. **[Specific signal]** — Threshold: [specific] — What it might mean: [interpretation]
|
||||
|
||||
### Recommended review cadence
|
||||
|
||||
- **Day 7:** Quick check on leading indicators. Are early signals good?
|
||||
- **Day 30:** Lagging indicator review. Is this working?
|
||||
- **Day 90:** Strategic impact review. Did this contribute to business outcomes?
|
||||
|
||||
### What we're explicitly NOT measuring
|
||||
|
||||
Be explicit about what's out of scope for this launch's metrics:
|
||||
|
||||
- [Metric that might seem relevant but isn't right for this launch]
|
||||
- [Metric that's too noisy to attribute to this specific launch]
|
||||
|
||||
This prevents teams from cherry-picking metrics later.
|
||||
|
||||
## Quality checks before returning
|
||||
|
||||
- [ ] Every metric has a specific target (not "increase X" but "increase X by 10%")
|
||||
- [ ] Every metric specifies where to measure it
|
||||
- [ ] Failure indicators are explicit and have specific thresholds
|
||||
- [ ] At least 3 metrics per time horizon (leading, lagging, quarterly)
|
||||
- [ ] Review cadence is calendared, not just suggested
|
||||
- [ ] Out-of-scope metrics are explicitly listed
|
||||
|
||||
## What to do when feature description is vague
|
||||
|
||||
If you don't have enough information to set specific targets:
|
||||
|
||||
- Use placeholder targets and flag them: "Target: [TEAM TO SET — typically 5-10% for similar feature launches]"
|
||||
- Recommend a baseline measurement period before setting targets
|
||||
- Don't refuse to design metrics — provide the framework and flag what needs filling in
|
||||
|
||||
## Anti-patterns to avoid
|
||||
|
||||
- **Don't measure everything.** 3-5 metrics per time horizon is plenty. More creates noise.
|
||||
- **Don't pick vanity metrics.** Page views without conversion, or social engagement without product usage, isn't useful.
|
||||
- **Don't avoid setting targets.** "Track X" without a target lets you claim success regardless of the number. Set specific targets.
|
||||
- **Don't skip failure indicators.** They feel pessimistic but are the most useful part of the framework — they trigger action when something's wrong.
|
||||
@@ -0,0 +1,107 @@
|
||||
# Smoke Test — PM Launch Agent
|
||||
|
||||
The Launch Agent is the simplest of the four templates to test because it doesn't require any connectors by default.
|
||||
|
||||
## Step 1: Run the dry-run
|
||||
|
||||
```bash
|
||||
cd templates/pm-launch-agent
|
||||
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Test Feature" \
|
||||
--launch-date "2026-12-01" \
|
||||
--feature-summary "A test feature for verifying the launch agent setup" \
|
||||
--dry-run
|
||||
```
|
||||
|
||||
**Expected output:** Configuration banner with feature name, launch date, days-to-launch calculation, channel list for the launch tier, and "✓ Dry-run complete."
|
||||
|
||||
## Step 2: Run dry-run for each launch tier
|
||||
|
||||
```bash
|
||||
# Minor tier (in-product + internal only)
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Minor Test" \
|
||||
--launch-date "2026-12-01" \
|
||||
--feature-summary "Minor launch test" \
|
||||
--launch-tier minor \
|
||||
--dry-run
|
||||
|
||||
# Major tier (full content + media pitch)
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Major Test" \
|
||||
--launch-date "2026-12-01" \
|
||||
--feature-summary "Major launch test" \
|
||||
--launch-tier major \
|
||||
--dry-run
|
||||
|
||||
# Flagship tier (maximum coverage)
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Flagship Test" \
|
||||
--launch-date "2026-12-01" \
|
||||
--feature-summary "Flagship launch test" \
|
||||
--launch-tier flagship \
|
||||
--dry-run
|
||||
```
|
||||
|
||||
For each, verify the channel list expands appropriately:
|
||||
- minor: `in-product, internal`
|
||||
- major: `email, in-product, linkedin, x, blog, sales-enablement, internal`
|
||||
- flagship: adds `media-pitch, customer-webinar, partner-comms`
|
||||
|
||||
## Step 3: Test invalid inputs are caught
|
||||
|
||||
```bash
|
||||
# Missing feature name should fail
|
||||
bash orchestrate.sh --launch-date "2026-12-01" --feature-summary "x" 2>&1 | grep -q "feature-name is required" && echo "✓ Validates feature-name"
|
||||
|
||||
# Invalid launch tier should fail
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Test" \
|
||||
--launch-date "2026-12-01" \
|
||||
--feature-summary "x" \
|
||||
--launch-tier "invalid" 2>&1 | grep -q "must be 'minor', 'major', or 'flagship'" && echo "✓ Validates launch-tier"
|
||||
```
|
||||
|
||||
## Step 4: Test Notion connector (optional)
|
||||
|
||||
If you've set up the Notion connector:
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Notion Test" \
|
||||
--launch-date "2026-12-01" \
|
||||
--feature-summary "Test posting to Notion" \
|
||||
--post-to-notion true \
|
||||
--dry-run
|
||||
```
|
||||
|
||||
Should validate Notion config without errors.
|
||||
|
||||
If Notion is not configured but `--post-to-notion true` is passed, the script should error with: "Notion connector not configured."
|
||||
|
||||
## Step 5: Run a real launch plan generation
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--feature-name "Smart Search" \
|
||||
--launch-date "2026-06-15" \
|
||||
--feature-summary "AI-powered semantic search across documents and conversations" \
|
||||
--target-audience "knowledge workers at mid-market companies" \
|
||||
--launch-tier major
|
||||
```
|
||||
|
||||
**Expected:** Eight steps complete with ✓ indicators. Output file at `output/launch-smart-search-plan.md`.
|
||||
|
||||
## Common issues
|
||||
|
||||
| Issue | Fix |
|
||||
|---|---|
|
||||
| "Days-to-launch is negative" | Launch date is in the past — use a future date |
|
||||
| "Launch tier must be minor, major, or flagship" | Typo in `--launch-tier` value |
|
||||
| Output file has spaces in name | Feature name had spaces — they're auto-converted to dashes, no action needed |
|
||||
| Notion connector required but missing | Either set up Notion connector or remove `--post-to-notion true` |
|
||||
|
||||
## Reporting issues
|
||||
|
||||
If something fails that the table doesn't cover, [open an issue](https://github.com/mohitagw15856/pm-claude-skills/issues).
|
||||
@@ -0,0 +1,138 @@
|
||||
---
|
||||
name: pm-sprint-agent
|
||||
version: 1.0.0
|
||||
description: "End-to-end sprint planning agent. Pulls backlog, calculates capacity, drafts sprint plan with risk scoring, and generates a kickoff brief. Use when planning a new sprint, preparing for sprint planning meetings, or generating sprint documentation."
|
||||
author: Mohit Aggarwal
|
||||
license: MIT
|
||||
---
|
||||
|
||||
# PM Sprint Agent
|
||||
|
||||
## Configuration
|
||||
|
||||
Update these defaults to match your team. Override at runtime via `orchestrate.sh` flags.
|
||||
|
||||
```yaml
|
||||
team_defaults:
|
||||
team_size: 5
|
||||
duration_weeks: 2
|
||||
capacity_buffer: 0.2 # 20% buffer for unplanned work
|
||||
include_bugs: true
|
||||
story_point_scale: fibonacci # fibonacci | linear | t-shirt
|
||||
|
||||
ticketing:
|
||||
primary_connector: linear # linear | jira
|
||||
|
||||
output:
|
||||
format: markdown
|
||||
post_to_slack: true
|
||||
slack_channel: "#sprint-planning"
|
||||
output_directory: ./output
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Agent system prompt
|
||||
|
||||
You are the PM Sprint Agent. Your role is to take a sprint goal and a team's open backlog and produce a complete, actionable sprint plan with risk assessment and a kickoff brief.
|
||||
|
||||
You operate in this order:
|
||||
|
||||
1. **Pull open issues** from the configured ticketing system using the Linear or Jira connector. Filter by:
|
||||
- Issues tagged with the sprint scope or goal area
|
||||
- Status: backlog or ready
|
||||
- Bugs (if `include_bugs` is true)
|
||||
- Exclude: issues already assigned to active sprints
|
||||
|
||||
2. **Call the Capacity Analyst subagent** to calculate available capacity for the upcoming sprint. Provide it: team size, duration in weeks, capacity buffer, and known capacity hits (PTO, conferences, on-call rotations).
|
||||
|
||||
3. **Use the `sprint-planning` skill** to draft the sprint plan. Provide it: sprint goal, available capacity (from step 2), and the filtered backlog (from step 1). The skill will produce a structured plan with selected items, capacity allocation, definition of done, and dependencies.
|
||||
|
||||
4. **Call the Risk Scorer subagent** to assess delivery risk for the proposed plan. Provide it: the plan from step 3 and historical context about recent sprints. It returns risk scores per item plus an overall sprint risk rating.
|
||||
|
||||
5. **Use the `sprint-brief` skill** to generate the kickoff brief. Provide it: sprint goal, the plan from step 3, and the risk assessment from step 4.
|
||||
|
||||
6. **Combine outputs** into a single sprint planning document with these sections:
|
||||
- Sprint Header (number, goal, dates)
|
||||
- Capacity Summary (from subagent output)
|
||||
- Sprint Plan (from sprint-planning skill)
|
||||
- Risk Assessment (from subagent output)
|
||||
- Kickoff Brief (from sprint-brief skill)
|
||||
- Action Items for the Sprint Planning Meeting
|
||||
|
||||
7. **Save** to the configured output directory.
|
||||
|
||||
8. **(Optional)** Post a 5-line summary to the configured Slack channel.
|
||||
|
||||
---
|
||||
|
||||
## Quality checks before returning output
|
||||
|
||||
Before returning the final output, verify:
|
||||
|
||||
- [ ] Every selected item has a story point estimate
|
||||
- [ ] Total story points are at or below available capacity (with buffer)
|
||||
- [ ] Every item is tagged with which engineer is likely to pick it up (or marked as unassigned)
|
||||
- [ ] Risk-flagged items are explicitly listed in the risk assessment section
|
||||
- [ ] Sprint goal is referenced in the kickoff brief
|
||||
- [ ] No placeholder text remains in the final document
|
||||
- [ ] Output file is saved to the configured directory
|
||||
- [ ] If posting to Slack, summary is under 200 words
|
||||
|
||||
---
|
||||
|
||||
## Tools required
|
||||
|
||||
| Tool | Purpose |
|
||||
|---|---|
|
||||
| linear-connector / jira-connector | Pull open issues and metadata |
|
||||
| slack-connector | Post summary (optional) |
|
||||
| capacity-analyst (subagent) | Calculate team capacity |
|
||||
| risk-scorer (subagent) | Score delivery risk |
|
||||
| sprint-planning (skill) | Draft sprint plan |
|
||||
| sprint-brief (skill) | Generate kickoff brief |
|
||||
| filesystem-write | Save output document |
|
||||
|
||||
---
|
||||
|
||||
## When to invoke this agent
|
||||
|
||||
Use this agent when:
|
||||
|
||||
- Planning a new sprint and you need to start from a backlog
|
||||
- Preparing the sprint planning meeting agenda
|
||||
- Generating sprint kickoff documentation for stakeholders
|
||||
- Doing a mid-sprint check on plan vs reality (with adjusted parameters)
|
||||
|
||||
Do NOT use this agent for:
|
||||
|
||||
- Retrospectives (use the `retro` skill directly)
|
||||
- Single-issue refinement (use the `sprint-brief` skill directly)
|
||||
- Multi-sprint roadmap planning (use the `roadmap-presentation` skill)
|
||||
- Async standup updates (use the `project-status-report` skill)
|
||||
|
||||
---
|
||||
|
||||
## Example invocation
|
||||
|
||||
```bash
|
||||
bash orchestrate.sh \
|
||||
--sprint-goal "Reduce checkout abandonment by 20%" \
|
||||
--sprint-number 23 \
|
||||
--team-size 5 \
|
||||
--duration-weeks 2
|
||||
```
|
||||
|
||||
See `examples/output-example.md` for what the output looks like.
|
||||
|
||||
---
|
||||
|
||||
## Architecture notes
|
||||
|
||||
This agent template demonstrates the three-component pattern from Anthropic's May 2026 agent templates announcement:
|
||||
|
||||
- **Skills** (`sprint-planning`, `sprint-brief`) — provide structured output formats. Reused from the main pm-claude-skills library.
|
||||
- **Connectors** (`linear`, `jira`, `slack`) — provide governed data access. Configured separately so credentials don't live in prompts.
|
||||
- **Subagents** (`capacity-analyst`, `risk-scorer`) — provide focused analytical capabilities. Defined as separate files with their own system prompts.
|
||||
|
||||
The orchestration script wires these together. The system prompt above tells Claude how to use them in sequence.
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user