feat: v10.0.0 — 8 new skills across Customer Success and Engineering (500-star milestone)
Two star milestones shipped together: Customer Success bundle (pm-cs) — 250-star milestone: - cs-health-scorecard: weighted RAG health score across 5 dimensions with renewal forecast - qbr-deck: slide-by-slide QBR structure with value narrative and mutual commitments - cs-escalation-brief: 4-level escalation framework with root cause, impact, and decision required - churn-analysis: voluntary/unavoidable churn split, early warning signals, prioritised interventions Engineering expansion (pm-engineering) — 500-star milestone: - cicd-playbook: full pipeline playbook from build through post-deploy checks and rollback - slo-error-budget: SLI definitions, burn rate alerts, and error budget policy - developer-onboarding-doc: first-week guide covering architecture, setup, testing, and contacts - oncall-runbook: per-alert response procedures, escalation matrix, and handoff template Also: - Added pm-cs plugin to marketplace.json - Updated pm-engineering plugin.json to v3.0.0 (14 skills) - Updated marketplace.json to v10.0.0 (114 skills, 23 bundles, 16 professions) - README updated with new CS section, corrected skill numbering (106 → 114) - Added bug report link to Contributing section - Star milestones updated to show 250 and 500 as unlocked
This commit is contained in:
@@ -0,0 +1,301 @@
|
||||
---
|
||||
name: cicd-playbook
|
||||
description: "Write a CI/CD pipeline playbook for a service or team. Use when asked to document a CI/CD pipeline, write a deployment process, define release gates, document build and test stages, or create a deployment guide. Produces a structured playbook covering pipeline stages, environment definitions, deployment gates, rollback procedures, and on-call responsibilities."
|
||||
---
|
||||
|
||||
# CI/CD Playbook Skill
|
||||
|
||||
Produce a complete, actionable CI/CD playbook for a service or team — covering everything a new engineer needs to understand, contribute to, and operate the pipeline safely.
|
||||
|
||||
A good playbook is not a diagram. It is a document that answers: what runs, when, why, who owns it, and what to do when it breaks.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and brief description
|
||||
- **Tech stack** — language, framework, containerisation (Docker, etc.)
|
||||
- **Source control** — GitHub / GitLab / Bitbucket, branching strategy
|
||||
- **CI platform** — GitHub Actions / CircleCI / Jenkins / BuildKite / other
|
||||
- **CD platform / deployment target** — Kubernetes, ECS, Lambda, Heroku, VMs, etc.
|
||||
- **Environments** — e.g. dev, staging, production (and any canary / feature environments)
|
||||
- **Deployment frequency** — how often does the team ship?
|
||||
- **Any existing gates** — manual approvals, smoke tests, feature flags
|
||||
- **On-call setup** — who's responsible during deploys?
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# CI/CD Playbook: [Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Last updated:** [Date] | **Owner:** [Name / role]
|
||||
**Pipeline platform:** [CI tool] → [CD tool / platform]
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
[2–3 sentences describing what this service does and why the CI/CD pipeline is structured the way it is. Include the deployment target and how frequently the team ships.]
|
||||
|
||||
**Deployment frequency:** [Multiple times per day / Daily / Weekly / On-demand]
|
||||
**Average pipeline duration:** [X minutes]
|
||||
**Rollback time (p95):** [X minutes]
|
||||
|
||||
---
|
||||
|
||||
## Pipeline Stages
|
||||
|
||||
```
|
||||
[Branch push]
|
||||
│
|
||||
▼
|
||||
[1. Build & Lint] ──fail──▶ ❌ Block PR
|
||||
│
|
||||
▼
|
||||
[2. Unit Tests] ──fail──▶ ❌ Block PR
|
||||
│
|
||||
▼
|
||||
[3. Integration Tests] ──fail──▶ ❌ Block PR
|
||||
│
|
||||
▼
|
||||
[4. Security Scan] ──fail──▶ ⚠️ [Block / Warn — specify]
|
||||
│
|
||||
▼
|
||||
[5. Build Artefact / Container Image]
|
||||
│
|
||||
▼
|
||||
[6. Deploy to Staging] ──fail──▶ ❌ Block promotion
|
||||
│
|
||||
▼
|
||||
[7. Smoke Tests (Staging)]
|
||||
│
|
||||
▼
|
||||
[8. Manual Approval Gate] ──(if required)
|
||||
│
|
||||
▼
|
||||
[9. Deploy to Production] ──fail──▶ 🔁 Auto-rollback (if configured)
|
||||
│
|
||||
▼
|
||||
[10. Post-deploy checks]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Stage Definitions
|
||||
|
||||
### Stage 1 — Build & Lint
|
||||
|
||||
**What runs:** [Build command] + [Linter — e.g. ESLint, golangci-lint, flake8]
|
||||
**Trigger:** Every commit to any branch
|
||||
**Blocking:** Yes — PR cannot be merged if this fails
|
||||
**Typical duration:** [X minutes]
|
||||
**Owner if it fails:** PR author
|
||||
|
||||
**Common failure causes:**
|
||||
- [e.g. Missing dependency — run `npm install` locally before pushing]
|
||||
- [e.g. Lint rule violation — run `npm run lint --fix` to auto-fix most issues]
|
||||
|
||||
---
|
||||
|
||||
### Stage 2 — Unit Tests
|
||||
|
||||
**What runs:** [Test command — e.g. `npm test`, `go test ./...`, `pytest`]
|
||||
**Coverage gate:** [X]% minimum — pipeline fails below this threshold
|
||||
**Trigger:** Every commit
|
||||
**Blocking:** Yes
|
||||
**Typical duration:** [X minutes]
|
||||
|
||||
**Coverage report:** [Where to find it — e.g. uploaded to Codecov, available in CI artifacts]
|
||||
|
||||
---
|
||||
|
||||
### Stage 3 — Integration Tests
|
||||
|
||||
**What runs:** [Test suite description — e.g. "API integration tests against a test database using Docker Compose"]
|
||||
**Environment:** [Ephemeral test environment / shared test DB / etc.]
|
||||
**Trigger:** Every commit to `main` and feature branches targeting `main`
|
||||
**Blocking:** Yes
|
||||
**Typical duration:** [X minutes]
|
||||
|
||||
**If slow:** [e.g. "Integration tests can be skipped locally with `SKIP_INTEGRATION=true` — never skip in CI"]
|
||||
|
||||
---
|
||||
|
||||
### Stage 4 — Security Scan
|
||||
|
||||
**Tools:** [e.g. Snyk, Trivy, OWASP Dependency Check, Semgrep]
|
||||
**What it checks:** [Dependency vulnerabilities / SAST / secrets detection — list what applies]
|
||||
**Blocking on:** Critical and High severity findings
|
||||
**Non-blocking on:** Medium and Low (flagged, not blocking)
|
||||
**Trigger:** Every commit to `main`
|
||||
|
||||
**How to handle a flagged vulnerability:**
|
||||
1. Check if a fix is available — upgrade the dependency
|
||||
2. If no fix available, open a security ticket and add a suppression with justification
|
||||
3. Never suppress without a ticket and owner
|
||||
|
||||
---
|
||||
|
||||
### Stage 5 — Build Artefact
|
||||
|
||||
**What is produced:** [Docker image / binary / zip — be specific]
|
||||
**Registry:** [ECR / GCR / Docker Hub / Artifactory — URL]
|
||||
**Tagging convention:** `[service-name]:[git-sha]` (also tagged `:latest` on `main`)
|
||||
**Trigger:** Commits to `main` only (not feature branches)
|
||||
|
||||
---
|
||||
|
||||
### Stage 6 — Deploy to Staging
|
||||
|
||||
**Deployment method:** [e.g. Helm upgrade / kubectl apply / ecs deploy / Terraform apply]
|
||||
**Staging URL:** [URL]
|
||||
**Trigger:** Automatic on successful artefact build from `main`
|
||||
**Who can deploy to staging:** Any engineer (automatic)
|
||||
|
||||
**Environment variables:** Managed in [Vault / AWS SSM / GitHub Secrets / etc.]
|
||||
**Staging is not production:** [Any differences in config, scale, or data — state them here]
|
||||
|
||||
---
|
||||
|
||||
### Stage 7 — Smoke Tests (Staging)
|
||||
|
||||
**What runs:** [Description — e.g. "10 critical path tests covering login, core API endpoints, and payment flow"]
|
||||
**Tool:** [e.g. Playwright / Postman / custom script]
|
||||
**Pass criteria:** All smoke tests pass within [X seconds] timeout
|
||||
**Blocking:** Yes — production deploy will not proceed if smoke tests fail
|
||||
|
||||
**Smoke test suite location:** [Link to test files or folder]
|
||||
|
||||
---
|
||||
|
||||
### Stage 8 — Manual Approval Gate
|
||||
|
||||
**Required for:** [Production deploys / deploys affecting >X% of traffic / deploys to specific regions]
|
||||
**Who can approve:** [e.g. Any engineer on the team / Lead engineer / On-call engineer]
|
||||
**Approval timeout:** [e.g. 24 hours — auto-cancelled if no approval]
|
||||
**How to approve:** [GitHub Actions approve step / Slack command / other — with link]
|
||||
|
||||
**When to withhold approval:**
|
||||
- Active incident in production
|
||||
- Deploy is outside the deployment window (see below)
|
||||
- On-call engineer has not been notified
|
||||
|
||||
---
|
||||
|
||||
### Stage 9 — Deploy to Production
|
||||
|
||||
**Deployment method:** [Same as staging or different — specify]
|
||||
**Deployment window:** [e.g. Monday–Thursday 09:00–16:00 UTC — no deploys on Fridays or before bank holidays]
|
||||
**Canary / progressive rollout:** [Yes — X% initial traffic, full rollout after Y minutes / No — full deploy]
|
||||
**Deployment notifications:** [Slack channel — #deployments]
|
||||
|
||||
**Who is on-call during deploy:** Deploying engineer is responsible until post-deploy checks pass.
|
||||
|
||||
---
|
||||
|
||||
### Stage 10 — Post-Deploy Checks
|
||||
|
||||
**Automated checks (run for [X minutes] after deploy):**
|
||||
- [ ] Error rate: <[X]% (baseline: [Y]%)
|
||||
- [ ] P99 latency: <[X]ms (baseline: [Y]ms)
|
||||
- [ ] [Key business metric]: within [X]% of baseline
|
||||
|
||||
**Where to watch:** [Datadog / Grafana / CloudWatch dashboard — link]
|
||||
|
||||
**If a check fails:** See Rollback Procedure below.
|
||||
|
||||
---
|
||||
|
||||
## Environments
|
||||
|
||||
| Environment | Purpose | Deploy trigger | URL | Data |
|
||||
|---|---|---|---|---|
|
||||
| **Dev** | Local development | Manual | localhost | Seeded test data |
|
||||
| **Staging** | Pre-production validation | Automatic (main) | [URL] | Anonymised prod copy |
|
||||
| **Production** | Live traffic | Manual approval | [URL] | Live data |
|
||||
|
||||
---
|
||||
|
||||
## Branching Strategy
|
||||
|
||||
**Model:** [Trunk-based / GitFlow / GitHub Flow — describe briefly]
|
||||
|
||||
| Branch | Purpose | Who merges | Deploy target |
|
||||
|---|---|---|---|
|
||||
| `main` | Production-ready code | PR + review | Staging → Production |
|
||||
| `feature/*` | Feature development | Author | None (CI only) |
|
||||
| `hotfix/*` | Critical production fixes | Lead engineer | Can bypass staging gate with approval |
|
||||
|
||||
**Hotfix process:** [Describe when and how to use a hotfix branch — what level of incident justifies bypassing the standard process]
|
||||
|
||||
---
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
**Automated rollback:** [Yes — triggered if post-deploy error rate exceeds [X]% / No — manual only]
|
||||
|
||||
**Manual rollback steps:**
|
||||
```bash
|
||||
# 1. Identify the last known good image tag
|
||||
[command to list recent deployments]
|
||||
|
||||
# 2. Deploy the previous version
|
||||
[deployment command with previous tag]
|
||||
|
||||
# 3. Confirm rollback is live
|
||||
[smoke test command or health check URL]
|
||||
|
||||
# 4. Notify the team
|
||||
[Slack command or template]
|
||||
```
|
||||
|
||||
**Rollback decision authority:** Any engineer on-call can initiate a rollback without waiting for approval.
|
||||
|
||||
**After a rollback:**
|
||||
1. Create a post-deploy incident report (see [incident-postmortem skill])
|
||||
2. Do not re-deploy the same commit without fixing the root cause
|
||||
3. Notify [stakeholder / support team] of the rollback and expected fix timeline
|
||||
|
||||
---
|
||||
|
||||
## Secrets and Configuration Management
|
||||
|
||||
**Secret store:** [Vault / AWS SSM / GitHub Secrets / Doppler — specify]
|
||||
**How to add a new secret:**
|
||||
1. [Step 1]
|
||||
2. [Step 2]
|
||||
**Who has access:** [Role or team]
|
||||
**Rotation policy:** [How often secrets are rotated and who owns it]
|
||||
|
||||
**Never do:** Commit secrets to source control, even in `.env` files. The pipeline includes secret scanning (Stage 4) which will flag this.
|
||||
|
||||
---
|
||||
|
||||
## Common Failures and Fixes
|
||||
|
||||
| Failure | Likely cause | Fix |
|
||||
|---|---|---|
|
||||
| Build fails with "module not found" | Dependency not installed | Run `[install command]` and commit `lock file` |
|
||||
| Integration tests timeout | Test DB not seeded / external service down | Check [service] status; re-run pipeline |
|
||||
| Smoke tests fail after staging deploy | Environment variable missing | Check [config location]; compare staging and prod env vars |
|
||||
| Production deploy stuck at approval | Approver not notified | Tag `@[on-call handle]` in `#deployments` |
|
||||
| Post-deploy error rate spike | Bad deploy / upstream dependency | Check [dashboard]; initiate rollback if >5 min |
|
||||
|
||||
---
|
||||
|
||||
## On-Call Responsibilities During Deploy
|
||||
|
||||
- The deploying engineer is responsible for monitoring post-deploy checks for [X minutes] after a production deploy
|
||||
- If you cannot monitor after deploying, hand off explicitly to another engineer in `#deployments`
|
||||
- For deploys outside business hours: only hotfixes — always page the on-call engineer before deploying
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every stage has a clear owner when it fails
|
||||
- [ ] Rollback procedure is tested — not theoretical
|
||||
- [ ] Secrets management section names the actual tool used (not "use secrets management")
|
||||
- [ ] Deployment window is specific — not "during business hours"
|
||||
- [ ] Post-deploy check thresholds are calibrated to actual baseline metrics
|
||||
@@ -0,0 +1,332 @@
|
||||
---
|
||||
name: developer-onboarding-doc
|
||||
description: "Write a developer onboarding document for a service, codebase, or team. Use when asked to write a developer guide, service README, onboarding doc for a new engineer, codebase orientation, or getting-started guide for a technical team. Produces a structured doc covering service overview, architecture, local setup, key patterns, testing, deployment, and who to ask for what."
|
||||
---
|
||||
|
||||
# Developer Onboarding Document Skill
|
||||
|
||||
Produce a complete developer onboarding document for a service or team — covering everything a new engineer needs to be productive within their first week.
|
||||
|
||||
A good onboarding doc is not a wiki dump. It answers the questions a new engineer actually has on day one, in the order they'll have them.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and what it does
|
||||
- **Team** responsible for it
|
||||
- **Tech stack** — language(s), framework(s), database(s), message queues, etc.
|
||||
- **Key external dependencies** — upstream services, third-party APIs
|
||||
- **Deployment target** — Kubernetes, ECS, Lambda, bare metal, etc.
|
||||
- **Local dev setup** — how to run locally (Docker Compose, local DB, etc.)
|
||||
- **Testing approach** — unit, integration, E2E; test commands
|
||||
- **Deployment process** — summary of how code gets to production
|
||||
- **On-call setup** — who's on-call, how alerts work
|
||||
- **Contacts** — tech lead, platform team, related service owners
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# Developer Onboarding: [Service Name]
|
||||
|
||||
**Team:** [Team name] | **Tech lead:** [Name]
|
||||
**Last updated:** [Date] | **Updated by:** [Name]
|
||||
|
||||
> If something in this doc is wrong or out of date, fix it now — it will affect every engineer who onboards after you.
|
||||
|
||||
---
|
||||
|
||||
## What This Service Does
|
||||
|
||||
[3–5 sentences. What problem does this service solve? Who calls it, and who does it call? What would break if this service went down?]
|
||||
|
||||
**Service type:** [API / Background worker / Event consumer / Data pipeline / etc.]
|
||||
**Consumers:** [List internal services or external clients that depend on this service]
|
||||
**Dependencies:** [List upstream services, databases, and third-party APIs this service calls]
|
||||
|
||||
**Architecture diagram:** [Link or embed — even a rough ASCII diagram helps]
|
||||
|
||||
```
|
||||
[Caller A] ──→ [This Service] ──→ [Database]
|
||||
│
|
||||
└──→ [Downstream Service]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Codebase Orientation
|
||||
|
||||
**Repository:** [Link]
|
||||
**Main branch:** `[main / master]`
|
||||
**Language:** [e.g. Go 1.22 / Node.js 20 / Python 3.12]
|
||||
**Framework:** [e.g. Express / FastAPI / Gin / Rails]
|
||||
|
||||
### Key directories
|
||||
|
||||
```
|
||||
[repo-root]/
|
||||
├── [src/ or cmd/] # Application code
|
||||
│ ├── [handlers/] # HTTP handlers / controllers
|
||||
│ ├── [services/] # Business logic
|
||||
│ ├── [repository/] # Database access layer
|
||||
│ └── [models/] # Data models / types
|
||||
├── [tests/] # Test files
|
||||
├── [migrations/] # Database migrations
|
||||
├── [scripts/] # Utility scripts
|
||||
├── [.github/workflows/] # CI/CD pipeline definitions
|
||||
└── [docs/] # Additional documentation
|
||||
```
|
||||
|
||||
**Where to start reading:** [Point to 2–3 key files that give the best orientation — e.g. `main.go`, `routes.js`, `app.py`]
|
||||
|
||||
### Things that might surprise you
|
||||
|
||||
- [Unusual pattern 1 — e.g. "We use event sourcing — state is derived from an event log, not stored directly"]
|
||||
- [Unusual pattern 2 — e.g. "Auth is handled by the gateway — this service trusts the `X-User-Id` header"]
|
||||
- [Unusual pattern 3 — any non-obvious decisions or legacy choices]
|
||||
|
||||
---
|
||||
|
||||
## Local Development Setup
|
||||
|
||||
**Estimated setup time:** [X minutes for a fresh machine]
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- [ ] [Tool 1] — version [X] — [install link]
|
||||
- [ ] [Tool 2] — version [X] — [install link]
|
||||
- [ ] Access to [repo / internal package registry] — request from [who]
|
||||
- [ ] [Any secrets or credentials needed] — request from [who]
|
||||
|
||||
### Step-by-step setup
|
||||
|
||||
```bash
|
||||
# 1. Clone the repo
|
||||
git clone [repo URL]
|
||||
cd [repo-name]
|
||||
|
||||
# 2. Copy and configure environment variables
|
||||
cp .env.example .env
|
||||
# Edit .env — see "Environment Variables" section below
|
||||
|
||||
# 3. Start dependencies (database, cache, etc.)
|
||||
[docker compose up -d / make deps / etc.]
|
||||
|
||||
# 4. Install dependencies
|
||||
[npm install / go mod download / pip install -r requirements.txt]
|
||||
|
||||
# 5. Run database migrations
|
||||
[migration command]
|
||||
|
||||
# 6. Start the service
|
||||
[start command]
|
||||
|
||||
# 7. Verify it's working
|
||||
curl http://localhost:[PORT]/health
|
||||
# Expected: {"status":"ok"}
|
||||
```
|
||||
|
||||
**If this doesn't work:** Check [Troubleshooting section below] or ask in `#[channel]`.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Required | Description | Example |
|
||||
|---|---|---|---|
|
||||
| `DATABASE_URL` | Yes | Connection string for the primary DB | `postgres://localhost:5432/[db]` |
|
||||
| `[VAR_2]` | Yes | [Description] | [Example] |
|
||||
| `[VAR_3]` | No | [Description — default value] | [Example] |
|
||||
|
||||
**Secrets for local dev:** [Where to get them — e.g. "Run `[command]` to pull from Vault" or "Ask [person] in #[channel]"]
|
||||
|
||||
### Useful local commands
|
||||
|
||||
```bash
|
||||
[start command] # Start the service
|
||||
[test command] # Run all tests
|
||||
[lint command] # Run linter
|
||||
[format command] # Format code
|
||||
[migration command] # Run pending migrations
|
||||
[seed command] # Seed local database
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
**Testing philosophy:** [e.g. "We test at the integration layer — unit tests for pure functions, integration tests for anything touching the DB or external services"]
|
||||
|
||||
### Running tests
|
||||
|
||||
```bash
|
||||
# All tests
|
||||
[test command]
|
||||
|
||||
# Unit tests only
|
||||
[unit test command]
|
||||
|
||||
# Integration tests (requires local deps running)
|
||||
[integration test command]
|
||||
|
||||
# A specific test file or test case
|
||||
[test command with filter]
|
||||
```
|
||||
|
||||
**Test coverage:** [X]% (minimum required to pass CI: [Y]%)
|
||||
**Coverage report:** [Where to find it]
|
||||
|
||||
### Writing tests
|
||||
|
||||
- **Unit tests:** [Where to put them — e.g. alongside source files as `*_test.go`]
|
||||
- **Integration tests:** [Where to put them — e.g. `tests/integration/`]
|
||||
- **Test database:** [How it works — e.g. "Each test gets a clean transaction that rolls back on teardown — see `tests/helpers/db.go`"]
|
||||
- **Mocking:** [Policy — e.g. "We mock at the repository layer — don't mock the DB directly"]
|
||||
|
||||
---
|
||||
|
||||
## Making Changes
|
||||
|
||||
### Branching
|
||||
|
||||
[Branch naming convention — e.g. `feature/[ticket-id]-short-description`, `fix/[ticket-id]-short-description`]
|
||||
|
||||
### Before opening a PR
|
||||
|
||||
- [ ] Tests pass locally
|
||||
- [ ] Linter passes (`[lint command]`)
|
||||
- [ ] New behaviour has test coverage
|
||||
- [ ] Any new environment variables are added to `.env.example` and documented
|
||||
- [ ] Database migrations are backward-compatible (old code can run against new schema)
|
||||
|
||||
### Code review
|
||||
|
||||
- **Reviewers:** [Who to request review from — e.g. "Any engineer on [team]; lead review required for auth changes"]
|
||||
- **Expected review time:** [X hours / 1 business day]
|
||||
- **PR template:** [Link or auto-generated by GitHub]
|
||||
|
||||
### Database migrations
|
||||
|
||||
```bash
|
||||
# Create a new migration
|
||||
[migration create command]
|
||||
|
||||
# Apply pending migrations
|
||||
[migration up command]
|
||||
|
||||
# Roll back last migration
|
||||
[migration down command]
|
||||
```
|
||||
|
||||
**Migration rules:**
|
||||
- All migrations must be backward-compatible — old code must run against the new schema
|
||||
- Never rename or drop a column in a single migration — do it in two steps (add new, migrate data, drop old)
|
||||
- Test your rollback before merging
|
||||
|
||||
---
|
||||
|
||||
## Deployment
|
||||
|
||||
**How code gets to production:** [1–2 sentence summary — link to full CI/CD playbook if it exists]
|
||||
|
||||
1. Merge to `main` → automatic deploy to staging
|
||||
2. Smoke tests run on staging
|
||||
3. Manual approval → deploy to production
|
||||
4. Post-deploy monitoring for [X minutes]
|
||||
|
||||
**Deployment docs:** [Link to CI/CD playbook or pipeline docs]
|
||||
|
||||
**Who can deploy:** [Any engineer / Lead engineer / On-call engineer — specify]
|
||||
|
||||
**Deployment channel:** `#[deployments channel]`
|
||||
|
||||
---
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
**Dashboard:** [Datadog / Grafana / CloudWatch — link]
|
||||
**Logs:** [Log aggregation tool and link — e.g. "Logs are in Datadog under service:[name]"]
|
||||
**Traces:** [Tracing tool and link if applicable]
|
||||
**Alerts:** [Where alerts fire — e.g. PagerDuty / Slack #alerts-[service]]
|
||||
|
||||
**Key metrics to know:**
|
||||
- **Error rate:** Should be <[X]% (alert at [Y]%)
|
||||
- **P99 latency:** Should be <[X]ms
|
||||
- **[Business metric]:** [e.g. "Queue depth should be <100 items"]
|
||||
|
||||
---
|
||||
|
||||
## On-Call
|
||||
|
||||
**On-call schedule:** [PagerDuty / Opsgenie link]
|
||||
**Who's on-call now:** [Link to current schedule or `#oncall` channel]
|
||||
**Escalation:** [On-call → [team lead] → [EM] — after [X] minutes unacknowledged]
|
||||
|
||||
**If you get paged:**
|
||||
1. Acknowledge the alert
|
||||
2. Check [dashboard link] for the first clue
|
||||
3. Common alert runbooks: [link to oncall-runbook or runbook-writer output]
|
||||
4. If you can't resolve in [X minutes], escalate to [person/channel]
|
||||
|
||||
---
|
||||
|
||||
## Key Contacts
|
||||
|
||||
| Role | Name | Best way to reach |
|
||||
|---|---|---|
|
||||
| Tech lead | [Name] | Slack: @[handle] |
|
||||
| On-call rotation | [Team] | PagerDuty / `#on-call` |
|
||||
| Platform / infra | [Team] | `#platform` Slack channel |
|
||||
| Database / DBA | [Name or team] | `#database` Slack channel |
|
||||
| [Upstream service] owner | [Name] | Slack: @[handle] |
|
||||
|
||||
**Where to ask questions:**
|
||||
- General engineering: `#engineering`
|
||||
- This service specifically: `#[service-name]`
|
||||
- Urgent / production issues: `#incidents`
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "The service won't start locally"
|
||||
|
||||
1. Check that Docker / dependencies are running: `[command]`
|
||||
2. Check `.env` is populated — missing values cause silent failures
|
||||
3. Check logs: `[log command]`
|
||||
4. Ask in `#[channel]`
|
||||
|
||||
### "Tests are failing locally but passing in CI"
|
||||
|
||||
- Check your local dependency versions match CI: `[version check command]`
|
||||
- Try a clean install: `[clean install command]`
|
||||
- Integration tests need local deps running — `[start deps command]`
|
||||
|
||||
### "I can't access [internal tool / system]"
|
||||
|
||||
- Request access through [process — e.g. Okta self-serve / ask your manager]
|
||||
|
||||
### "Something looks wrong in production"
|
||||
|
||||
1. Check [dashboard] for the error spike
|
||||
2. Check recent deploys in `#deployments`
|
||||
3. If it's an active incident, page on-call via [PagerDuty / Slack command]
|
||||
|
||||
---
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [Architecture Decision Records (ADRs)](./docs/decisions/) — why the codebase is the way it is
|
||||
- [API documentation](./docs/api/) or [link to external docs]
|
||||
- [Incident runbooks](./docs/runbooks/)
|
||||
- [CI/CD pipeline documentation](./docs/cicd/)
|
||||
- [Team working agreements](./docs/team/)
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Local setup instructions work on a fresh machine — tested recently
|
||||
- [ ] Environment variables table is complete and accurate
|
||||
- [ ] "Things that might surprise you" captures the actual surprises (ask a recent joiner)
|
||||
- [ ] On-call section has real links, not placeholders
|
||||
- [ ] Contacts are current — team members with real Slack handles
|
||||
- [ ] Troubleshooting covers the top 3 actual questions new joiners ask
|
||||
@@ -0,0 +1,364 @@
|
||||
---
|
||||
name: oncall-runbook
|
||||
description: "Write an on-call runbook for a service — covering alert definitions, escalation paths, common incident responses, and on-call handoff procedures. Use when asked to write an on-call guide, create alert runbooks, document escalation procedures, or prepare an on-call handoff document. Produces a structured on-call runbook with per-alert response procedures, escalation matrix, diagnostic commands, and handoff template."
|
||||
---
|
||||
|
||||
# On-Call Runbook Skill
|
||||
|
||||
Produce a complete on-call runbook for a service — giving the on-call engineer everything they need to respond confidently to alerts at 3am, without having to ask anyone for help.
|
||||
|
||||
A good on-call runbook reduces mean time to resolution (MTTR) by eliminating the "what do I do first?" problem. It is written for the on-call engineer who has just been paged and needs to act, not for someone calmly reading documentation.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and what it does
|
||||
- **Team** and tech lead name
|
||||
- **Alert list** — names of alerts that currently page on-call
|
||||
- **Monitoring setup** — Datadog / Grafana / CloudWatch / PagerDuty / etc.
|
||||
- **Common failure modes** — what breaks most often, and what fixes it
|
||||
- **Escalation contacts** — who to call when on-call can't resolve it
|
||||
- **Deployment setup** — can on-call roll back? How?
|
||||
- **Service dependencies** — what does this service depend on, and what depends on it?
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# On-Call Runbook: [Service Name]
|
||||
|
||||
**Team:** [Team name] | **Tech lead:** [Name]
|
||||
**PagerDuty service:** [Link] | **Escalation policy:** [Policy name]
|
||||
**Last updated:** [Date] | **Next review:** [Date + 90 days]
|
||||
|
||||
> **First time on-call for this service?** Read the [developer onboarding doc] first — it covers the architecture and how things work. This runbook assumes you understand the service.
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
|
||||
**Dashboard:** [Link — the first thing to open when paged]
|
||||
**Logs:** [Link — where to find logs]
|
||||
**Runbook index:** Jump to the alert that paged you → [Alert list below]
|
||||
**Can't resolve in 30 min?** Escalate to: [Name] via [Slack / PagerDuty]
|
||||
|
||||
**Rollback command (memorise this):**
|
||||
```bash
|
||||
[rollback command — e.g. kubectl rollout undo deployment/[service-name]]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Escalation Matrix
|
||||
|
||||
| Situation | Escalate to | How | After how long |
|
||||
|---|---|---|---|
|
||||
| Can't diagnose the alert | [Tech lead name] | Slack DM / Phone | 30 minutes |
|
||||
| Alert requires infra change | [Platform team] | `#platform` Slack | Immediately |
|
||||
| Customer-facing impact | [CSM / Support lead] | `#incidents` Slack | Immediately (P1) |
|
||||
| Database issue | [DBA or data team] | Slack / PagerDuty | Immediately |
|
||||
| [Specific dependency] down | [[Dependency] on-call] | PagerDuty / Slack | Immediately |
|
||||
| Extended outage (>1 hour) | [Engineering manager] | Phone | 1 hour |
|
||||
|
||||
**Contacts:**
|
||||
|
||||
| Name | Role | Slack | Phone |
|
||||
|---|---|---|---|
|
||||
| [Name] | Tech lead | @[handle] | [Number] |
|
||||
| [Name] | Engineering manager | @[handle] | [Number] |
|
||||
| [Name] | Platform / infra | @[handle] | [Number] |
|
||||
| [Platform team] | Infra on-call | `#platform` | PagerDuty |
|
||||
|
||||
---
|
||||
|
||||
## Service Architecture (Quick View)
|
||||
|
||||
```
|
||||
[Upstream callers]
|
||||
│
|
||||
▼
|
||||
[This Service]
|
||||
│
|
||||
├──→ [Primary Database]
|
||||
├──→ [Cache — e.g. Redis]
|
||||
└──→ [Downstream Service / Queue]
|
||||
```
|
||||
|
||||
**If this service is down, these are affected:** [List downstream consumers]
|
||||
**If these are down, this service is affected:** [List upstream dependencies]
|
||||
|
||||
---
|
||||
|
||||
## Alert Runbooks
|
||||
|
||||
### ALERT: [Alert Name 1 — e.g. HighErrorRate]
|
||||
|
||||
**What it means:** [Plain English — e.g. "More than 5% of API requests are returning 5xx errors in the last 5 minutes"]
|
||||
**Severity:** P1 / P2 / P3
|
||||
**SLO impact:** Yes / No — [If yes: this alert means the error budget is burning at [X]× rate]
|
||||
|
||||
**Step 1 — Acknowledge and assess**
|
||||
```bash
|
||||
# Check current error rate
|
||||
[query or dashboard link]
|
||||
|
||||
# Check which endpoints are erroring
|
||||
[query or command]
|
||||
```
|
||||
|
||||
**Step 2 — Check recent changes**
|
||||
```bash
|
||||
# Any deploys in the last hour?
|
||||
[command or link to deployment log]
|
||||
|
||||
# Recent config changes?
|
||||
[where to check]
|
||||
```
|
||||
|
||||
**Step 3 — Check dependencies**
|
||||
```bash
|
||||
# Is the database healthy?
|
||||
[health check command or link]
|
||||
|
||||
# Is [downstream service] healthy?
|
||||
[health check command or link]
|
||||
```
|
||||
|
||||
**Step 4 — Diagnose**
|
||||
|
||||
| If you see | It means | Do this |
|
||||
|---|---|---|
|
||||
| [Error pattern 1] | [Cause] | [Action] |
|
||||
| [Error pattern 2] | [Cause] | [Action] |
|
||||
| [Error pattern 3] | [Cause] | [Action] |
|
||||
| No clear pattern | Unknown cause | Escalate to [name] |
|
||||
|
||||
**Step 5 — Fix or mitigate**
|
||||
```bash
|
||||
# If caused by bad deploy — roll back:
|
||||
[rollback command]
|
||||
|
||||
# If caused by [specific issue]:
|
||||
[fix command]
|
||||
|
||||
# If caused by upstream dependency:
|
||||
[mitigation — e.g. enable circuit breaker, reduce traffic, etc.]
|
||||
```
|
||||
|
||||
**After resolving:**
|
||||
- [ ] Confirm error rate has returned to baseline
|
||||
- [ ] Check no downstream services were affected
|
||||
- [ ] If P1: open a post-incident review — see [incident-postmortem skill]
|
||||
- [ ] Update `#incidents` with resolution summary
|
||||
|
||||
---
|
||||
|
||||
### ALERT: [Alert Name 2 — e.g. HighLatency]
|
||||
|
||||
**What it means:** [e.g. "P99 response time has exceeded 1s for more than 3 consecutive minutes"]
|
||||
**Severity:** P1 / P2 / P3
|
||||
**SLO impact:** Yes — latency SLO breach
|
||||
|
||||
**Step 1 — Assess scope**
|
||||
```bash
|
||||
# Check which endpoints are slow
|
||||
[query or dashboard — broken down by endpoint]
|
||||
|
||||
# Check if latency is across all regions or localised
|
||||
[query or command]
|
||||
```
|
||||
|
||||
**Step 2 — Common causes and fixes**
|
||||
|
||||
| Cause | Signal | Fix |
|
||||
|---|---|---|
|
||||
| Database slow queries | DB latency spike on dashboard | [Check slow query log: `command`] |
|
||||
| Cache miss storm | Cache hit rate drops on dashboard | [command or action] |
|
||||
| Memory pressure / GC | High memory on service dashboard | [command or action — e.g. restart, scale up] |
|
||||
| Upstream service slow | Trace shows time in external call | Escalate to [service] on-call |
|
||||
| Traffic spike | Request rate spike on dashboard | [Scale up: `command`] |
|
||||
|
||||
**Step 3 — Escalate if unresolved in 20 minutes**
|
||||
Page [Tech lead] via PagerDuty / Slack.
|
||||
|
||||
---
|
||||
|
||||
### ALERT: [Alert Name 3 — e.g. DatabaseConnectionPoolExhausted]
|
||||
|
||||
**What it means:** [e.g. "The service has used all available database connections — new requests will fail"]
|
||||
**Severity:** P1
|
||||
**SLO impact:** Yes — will cause errors immediately
|
||||
|
||||
**Immediate mitigation:**
|
||||
```bash
|
||||
# Restart the service to flush stale connections
|
||||
[restart command]
|
||||
|
||||
# Check current connection count
|
||||
[DB connection query]
|
||||
```
|
||||
|
||||
**Diagnose root cause after stabilising:**
|
||||
```bash
|
||||
# Check for long-running queries holding connections
|
||||
[query]
|
||||
|
||||
# Check if a recent deploy changed connection pool config
|
||||
[where to check]
|
||||
```
|
||||
|
||||
**Resolution:** [e.g. "Increase pool size in config / kill long-running queries / scale the service"]
|
||||
|
||||
---
|
||||
|
||||
### ALERT: [Alert Name 4 — e.g. QueueBacklogHigh / ConsumerLag]
|
||||
|
||||
**What it means:** [e.g. "The message queue backlog exceeds 10,000 messages — consumers are not keeping up"]
|
||||
**Severity:** P2
|
||||
**SLO impact:** Depends — if queue backs up, downstream systems will receive delayed data
|
||||
|
||||
**Step 1 — Check consumer health**
|
||||
```bash
|
||||
# Are consumers running?
|
||||
[command]
|
||||
|
||||
# Consumer error rate?
|
||||
[dashboard or query]
|
||||
```
|
||||
|
||||
**Step 2 — Check message contents**
|
||||
```bash
|
||||
# Are there poison messages causing retries?
|
||||
[command to inspect dead-letter queue or failed messages]
|
||||
```
|
||||
|
||||
**Step 3 — Options**
|
||||
|
||||
| If | Then |
|
||||
|---|---|
|
||||
| Consumers are down | Restart consumers: `[command]` |
|
||||
| Poison message in queue | Move to DLQ: `[command]` |
|
||||
| Consumers healthy but slow | Scale consumers: `[command]` |
|
||||
| Upstream producing too fast | Escalate to [upstream service] owner |
|
||||
|
||||
---
|
||||
|
||||
### ALERT: [Add additional alerts following the same pattern]
|
||||
|
||||
---
|
||||
|
||||
## Diagnostic Cheat Sheet
|
||||
|
||||
Common commands for quick diagnosis. Paste and run without modification.
|
||||
|
||||
```bash
|
||||
# Service health
|
||||
[health check command]
|
||||
|
||||
# Recent logs (last 100 lines)
|
||||
[log command]
|
||||
|
||||
# Error logs only
|
||||
[error log filter command]
|
||||
|
||||
# Current pod / instance status
|
||||
[kubectl get pods / aws ecs describe-tasks / etc.]
|
||||
|
||||
# Restart the service
|
||||
[restart command]
|
||||
|
||||
# Roll back to previous version
|
||||
[rollback command]
|
||||
|
||||
# Database connection count
|
||||
[DB query]
|
||||
|
||||
# Cache hit rate
|
||||
[cache stats command]
|
||||
|
||||
# Current request rate
|
||||
[metrics query]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Useful Dashboard Links
|
||||
|
||||
| Dashboard | URL | Use it to |
|
||||
|---|---|---|
|
||||
| Service overview | [Link] | First stop — error rate, latency, request rate |
|
||||
| Database | [Link] | Connection count, slow queries, replication lag |
|
||||
| Infrastructure | [Link] | CPU, memory, disk |
|
||||
| Queue / consumers | [Link] | Backlog depth, consumer throughput |
|
||||
| Upstream dependencies | [Link] | Dependency health at a glance |
|
||||
|
||||
---
|
||||
|
||||
## Incident Communication
|
||||
|
||||
When you declare an incident:
|
||||
|
||||
**Post to `#incidents` immediately:**
|
||||
```
|
||||
🔴 INCIDENT — [Service Name]
|
||||
Status: Investigating
|
||||
Impact: [Who is affected and how]
|
||||
Paged: [Your name]
|
||||
Next update: [Time — max 30 min from now]
|
||||
```
|
||||
|
||||
**Update every 30 minutes while active:**
|
||||
```
|
||||
🔴 UPDATE — [Service Name] — [Time]
|
||||
Status: [Investigating / Identified / Mitigating / Resolved]
|
||||
Latest: [One sentence on what you found or did]
|
||||
Next update: [Time]
|
||||
```
|
||||
|
||||
**On resolution:**
|
||||
```
|
||||
✅ RESOLVED — [Service Name] — [Time]
|
||||
Duration: [X minutes]
|
||||
Impact: [Summary of who was affected]
|
||||
Cause: [One sentence]
|
||||
Follow-up: [PIR required? Yes/No — link when created]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## On-Call Handoff
|
||||
|
||||
Use this template at the end of every on-call shift:
|
||||
|
||||
```
|
||||
--- ON-CALL HANDOFF: [Service Name] ---
|
||||
Date: [Date]
|
||||
Outgoing: [Your name]
|
||||
Incoming: [Next on-call name]
|
||||
|
||||
INCIDENTS THIS SHIFT:
|
||||
- [Incident summary — date, duration, cause, resolution, follow-up required]
|
||||
|
||||
OPEN ISSUES TO WATCH:
|
||||
- [Anything not fully resolved / trending in the wrong direction]
|
||||
|
||||
CHANGES SINCE LAST HANDOFF:
|
||||
- [Deploys, config changes, infra changes that affect on-call awareness]
|
||||
|
||||
RUNBOOK GAPS FOUND:
|
||||
- [Anything you had to figure out that isn't documented — please add it]
|
||||
|
||||
ANYTHING ELSE:
|
||||
- [Notes for incoming on-call]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] Every alert that pages on-call has a runbook entry — no alert is missing
|
||||
- [ ] Rollback command is accurate and tested recently
|
||||
- [ ] Escalation contacts have current phone numbers and Slack handles
|
||||
- [ ] Diagnostic commands work — they have been run by at least one person recently
|
||||
- [ ] Handoff template is used at every shift change — not just during incidents
|
||||
- [ ] "Things I had to figure out that weren't documented" are added to this runbook after every incident
|
||||
@@ -0,0 +1,231 @@
|
||||
---
|
||||
name: slo-error-budget
|
||||
description: "Define Service Level Objectives (SLOs) and an error budget policy for a service. Use when asked to write SLOs, define SLIs, calculate an error budget, set reliability targets, or create an error budget policy. Produces a complete SLO document with SLI definitions, target calculation, error budget policy, burn rate alerts, and review cadence."
|
||||
---
|
||||
|
||||
# SLO and Error Budget Skill
|
||||
|
||||
Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns.
|
||||
|
||||
A good SLO is not a target to hit. It is an agreement about what reliability means for your users — and a framework for making principled trade-offs between reliability and velocity.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Ask for these if not already provided:
|
||||
- **Service name** and brief description of what it does
|
||||
- **Primary users** — who depends on this service and how
|
||||
- **User-facing interactions** to protect — e.g. API calls, page loads, transactions
|
||||
- **Current reliability data** — error rate, latency, uptime (last 30–90 days if available)
|
||||
- **Existing on-call setup** — who responds to alerts?
|
||||
- **Deployment frequency** — how often does the team ship?
|
||||
- **Any existing SLAs** with customers — these constrain SLO targets
|
||||
|
||||
## Key Definitions
|
||||
|
||||
Always establish these before writing the SLO:
|
||||
|
||||
| Term | Definition |
|
||||
|---|---|
|
||||
| **SLI** (Service Level Indicator) | The metric being measured — e.g. "% of requests completing successfully in <500ms" |
|
||||
| **SLO** (Service Level Objective) | The target for that metric — e.g. "99.5% of requests" |
|
||||
| **SLA** (Service Level Agreement) | The contractual commitment to customers — must be looser than the SLO |
|
||||
| **Error budget** | The allowed headroom below 100% — the budget for planned and unplanned downtime |
|
||||
| **Burn rate** | How fast the error budget is being consumed |
|
||||
|
||||
---
|
||||
|
||||
## Output Format
|
||||
|
||||
---
|
||||
|
||||
# SLO Document: [Service Name]
|
||||
|
||||
**Service:** [Name] | **Team:** [Team name]
|
||||
**Owner:** [Name / role] | **Approved by:** [Name]
|
||||
**Effective date:** [Date] | **Review date:** [Date + 3 months]
|
||||
**Version:** [1.0]
|
||||
|
||||
---
|
||||
|
||||
## Why This SLO Exists
|
||||
|
||||
[2–3 sentences. What reliability problem are we solving? What was happening before this SLO that made us need it? What decision-making does this SLO enable?]
|
||||
|
||||
---
|
||||
|
||||
## Service Overview
|
||||
|
||||
**What this service does:** [One sentence]
|
||||
**Who depends on it:** [Internal teams / external customers / both — describe]
|
||||
**Critical user journeys protected by this SLO:**
|
||||
1. [Journey 1 — e.g. "User completes a payment"]
|
||||
2. [Journey 2]
|
||||
3. [Journey 3]
|
||||
|
||||
---
|
||||
|
||||
## SLIs — What We Measure
|
||||
|
||||
Define one SLI per user journey or reliability dimension. Keep it to 3–5 SLIs maximum.
|
||||
|
||||
### SLI 1: [Name — e.g. Request Success Rate]
|
||||
|
||||
| Field | Detail |
|
||||
|---|---|
|
||||
| **What it measures** | [e.g. "% of API requests that return a non-5xx response"] |
|
||||
| **Good event definition** | [e.g. "HTTP response with status 2xx or 4xx, completed within 500ms"] |
|
||||
| **Bad event definition** | [e.g. "HTTP response with status 5xx, or any response taking >500ms"] |
|
||||
| **Measurement source** | [e.g. "Application load balancer access logs / Datadog APM / Prometheus"] |
|
||||
| **Measured over** | Rolling 28-day window |
|
||||
| **Exclusions** | [e.g. "Health check endpoints excluded / Requests during planned maintenance excluded"] |
|
||||
|
||||
### SLI 2: [Name — e.g. Latency]
|
||||
|
||||
| Field | Detail |
|
||||
|---|---|
|
||||
| **What it measures** | [e.g. "P99 response time for the /checkout endpoint"] |
|
||||
| **Good event definition** | [e.g. "Request completes in ≤500ms at P99"] |
|
||||
| **Bad event definition** | [e.g. "Request takes >500ms at P99"] |
|
||||
| **Measurement source** | [Source] |
|
||||
| **Measured over** | Rolling 28-day window |
|
||||
| **Exclusions** | [Any exclusions] |
|
||||
|
||||
### SLI 3: [Name — e.g. Data Freshness / Queue Depth / etc.]
|
||||
|
||||
[Same structure]
|
||||
|
||||
---
|
||||
|
||||
## SLO Targets
|
||||
|
||||
| SLI | Target | Window | Error Budget |
|
||||
|---|---|---|---|
|
||||
| [SLI 1 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
|
||||
| [SLI 2 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
|
||||
| [SLI 3 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
|
||||
|
||||
**How targets were set:**
|
||||
- Historical baseline (last 90 days): [X]%
|
||||
- Target is set [above / at] historical baseline to [improve reliability / reflect current reality while formalising the commitment]
|
||||
- Rationale: [1–2 sentences]
|
||||
|
||||
**What 100% is NOT the target:** [Brief explanation of why targeting 100% is counterproductive — it discourages feature development and doesn't reflect user reality]
|
||||
|
||||
---
|
||||
|
||||
## Error Budget Calculation
|
||||
|
||||
**For SLI 1 ([Name]), at [X]% target:**
|
||||
|
||||
```
|
||||
Error budget = (100% - SLO target) × measurement window
|
||||
= (100% - [X]%) × 28 days × 24 hours × 60 minutes
|
||||
= [Y]% × [Z total minutes]
|
||||
= [N] minutes of allowed failure per 28-day window
|
||||
```
|
||||
|
||||
**In plain terms:** We can afford [N] minutes of [bad events] in any rolling 28-day window before we breach the SLO.
|
||||
|
||||
---
|
||||
|
||||
## Burn Rate Alerts
|
||||
|
||||
Burn rate = how fast the error budget is being consumed relative to the budget window.
|
||||
A burn rate of 1 = consuming the budget at exactly the rate that would exhaust it over 28 days.
|
||||
|
||||
| Alert | Burn rate | Window | Severity | Response |
|
||||
|---|---|---|---|---|
|
||||
| Page (critical) | >14× | 1 hour | P1 | Page on-call immediately — budget exhausted in <2 hours |
|
||||
| Page (high) | >6× | 6 hours | P2 | Page on-call — budget exhausted in <5 days |
|
||||
| Ticket (warning) | >3× | 3 days | P3 | Create ticket — review at next team meeting |
|
||||
| Info | >1× | 28 days | Info | Log only — budget on track to exhaust by end of window |
|
||||
|
||||
**Alert implementation:** [Link to alert config in monitoring tool — e.g. Datadog, Prometheus/Alertmanager, Grafana]
|
||||
|
||||
---
|
||||
|
||||
## Error Budget Policy
|
||||
|
||||
This policy defines what to do with the error budget — both when it's healthy and when it's burning.
|
||||
|
||||
### When budget is healthy (>50% remaining)
|
||||
|
||||
- Feature development and deployments proceed at normal pace
|
||||
- The team may take on riskier experiments
|
||||
- Reliability improvements are scheduled but not urgent
|
||||
|
||||
### When budget is at risk (25–50% remaining)
|
||||
|
||||
- Deployment frequency reduced — team ships only well-tested changes
|
||||
- One reliability improvement added to current sprint
|
||||
- Weekly error budget review added to team standup
|
||||
|
||||
### When budget is nearly exhausted (<25% remaining)
|
||||
|
||||
- Feature work paused in favour of reliability improvements
|
||||
- No new deployments without explicit on-call approval
|
||||
- Daily review of error budget burn rate
|
||||
- CSM / support notified to manage customer expectations
|
||||
|
||||
### When budget is exhausted (0% remaining — SLO breached)
|
||||
|
||||
- All feature work stops
|
||||
- On-call engineer and engineering manager notified immediately
|
||||
- Post-incident review (PIR) required within 5 business days
|
||||
- SLO target may be temporarily relaxed (with stakeholder approval) while root cause is addressed
|
||||
|
||||
---
|
||||
|
||||
## Dashboard and Reporting
|
||||
|
||||
**SLO dashboard:** [Link to Datadog / Grafana / etc. dashboard]
|
||||
|
||||
**Metrics exposed:**
|
||||
- Current SLO compliance (rolling 28-day)
|
||||
- Error budget remaining (% and minutes)
|
||||
- Burn rate (current and trend)
|
||||
- Incident count and MTTR this window
|
||||
|
||||
**Reporting cadence:**
|
||||
|
||||
| Audience | Frequency | Format |
|
||||
|---|---|---|
|
||||
| Engineering team | Weekly | Slack summary — #[service]-slo |
|
||||
| Engineering manager | Monthly | SLO review meeting |
|
||||
| Stakeholders / customers | Quarterly | SLO compliance summary |
|
||||
|
||||
---
|
||||
|
||||
## Exclusions and Edge Cases
|
||||
|
||||
**Planned maintenance:** Error budget is not consumed during pre-announced maintenance windows. Maintenance must be communicated [X hours] in advance via [channel].
|
||||
|
||||
**Dependency failures:** If SLO breach is caused by an upstream dependency outside our control, document it — but it still counts against our error budget (our users don't distinguish between our failures and our dependencies' failures).
|
||||
|
||||
**Force majeure:** [Policy for cloud provider outages, major infrastructure events]
|
||||
|
||||
---
|
||||
|
||||
## SLO Review Cadence
|
||||
|
||||
| Review | When | Who | Output |
|
||||
|---|---|---|---|
|
||||
| Error budget review | Weekly | Team | Budget health check — adjust if burning fast |
|
||||
| SLO target review | Quarterly | Team + EM | Adjust targets if baseline has shifted significantly |
|
||||
| Annual SLO audit | Annually | Team + Stakeholders | Review SLIs — are we measuring the right things? |
|
||||
|
||||
**When to change the SLO target:**
|
||||
- Historical baseline has improved significantly and target no longer reflects real reliability
|
||||
- User feedback indicates the target is misaligned with what users actually experience
|
||||
- The SLO is being gamed (metric is healthy but users are unhappy)
|
||||
|
||||
---
|
||||
|
||||
## Quality Checks
|
||||
|
||||
- [ ] SLIs are user-facing — they measure what users experience, not internal system metrics
|
||||
- [ ] Good and bad events are precisely defined — no ambiguity about what counts
|
||||
- [ ] Targets are based on historical data, not aspirational round numbers
|
||||
- [ ] Error budget policy has clear triggers and clear actions — not "discuss as a team"
|
||||
- [ ] Burn rate alerts have different windows to catch both fast burns and slow burns
|
||||
- [ ] Exclusions are documented so they don't silently inflate the SLO number
|
||||
Reference in New Issue
Block a user