05b6d799f0
Three more learnings from alirezarezvani/claude-skills, applied: 1. SkillCheck validator (scripts/skillcheck.mjs) — validates every SKILL.md against the authoring standard (frontmatter, name/folder match, trigger + produces clauses, required headings) plus tier referential integrity. Errors fail CI; --strict fails on warnings too. New skillcheck.yml workflow and a SkillCheck status badge in the README. Current: 0 errors / 14 advisory warnings across 172 skills. 2. Cursor export platform — build-exports.mjs now generates exports/cursor/<bundle>/<skill>/<skill>.mdc rule files. The PLATFORMS registry now supports per-skill filenames (file as a function). 3. Per-agent installers — scripts/install.sh unifies install for claude/hermes/codex/openclaw/cursor (--link, --target, --dry-run, --list). Curl-able one-liners codex-install.sh, openclaw-install.sh, and cursor-install.sh clone the library and install in a single command. README documents the one-line installs and Cursor exports; CHANGELOG and the authoring standard updated. Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>
311 lines
11 KiB
Plaintext
311 lines
11 KiB
Plaintext
---
|
||
description: "Write a CI/CD pipeline playbook for a service or team. Use when asked to document a CI/CD pipeline, write a deployment process, define release gates, document build and test stages, or create a deployment guide. Produces a structured playbook covering pipeline stages, environment definitions, deployment gates, rollback procedures, and on-call responsibilities."
|
||
globs:
|
||
alwaysApply: false
|
||
---
|
||
|
||
# CI/CD Playbook Skill
|
||
|
||
Produce a complete, actionable CI/CD playbook for a service or team — covering everything a new engineer needs to understand, contribute to, and operate the pipeline safely.
|
||
|
||
A good playbook is not a diagram. It is a document that answers: what runs, when, why, who owns it, and what to do when it breaks.
|
||
|
||
## Required Inputs
|
||
|
||
Ask for these if not already provided:
|
||
- **Service name** and brief description
|
||
- **Tech stack** — language, framework, containerisation (Docker, etc.)
|
||
- **Source control** — GitHub / GitLab / Bitbucket, branching strategy
|
||
- **CI platform** — GitHub Actions / CircleCI / Jenkins / BuildKite / other
|
||
- **CD platform / deployment target** — Kubernetes, ECS, Lambda, Heroku, VMs, etc.
|
||
- **Environments** — e.g. dev, staging, production (and any canary / feature environments)
|
||
- **Deployment frequency** — how often does the team ship?
|
||
- **Any existing gates** — manual approvals, smoke tests, feature flags
|
||
- **On-call setup** — who's responsible during deploys?
|
||
|
||
## Output Format
|
||
|
||
---
|
||
|
||
# CI/CD Playbook: [Service Name]
|
||
|
||
**Service:** [Name] | **Team:** [Team name]
|
||
**Last updated:** [Date] | **Owner:** [Name / role]
|
||
**Pipeline platform:** [CI tool] → [CD tool / platform]
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
[2–3 sentences describing what this service does and why the CI/CD pipeline is structured the way it is. Include the deployment target and how frequently the team ships.]
|
||
|
||
**Deployment frequency:** [Multiple times per day / Daily / Weekly / On-demand]
|
||
**Average pipeline duration:** [X minutes]
|
||
**Rollback time (p95):** [X minutes]
|
||
|
||
---
|
||
|
||
## Pipeline Stages
|
||
|
||
```
|
||
[Branch push]
|
||
│
|
||
▼
|
||
[1. Build & Lint] ──fail──▶ ❌ Block PR
|
||
│
|
||
▼
|
||
[2. Unit Tests] ──fail──▶ ❌ Block PR
|
||
│
|
||
▼
|
||
[3. Integration Tests] ──fail──▶ ❌ Block PR
|
||
│
|
||
▼
|
||
[4. Security Scan] ──fail──▶ ⚠️ [Block / Warn — specify]
|
||
│
|
||
▼
|
||
[5. Build Artefact / Container Image]
|
||
│
|
||
▼
|
||
[6. Deploy to Staging] ──fail──▶ ❌ Block promotion
|
||
│
|
||
▼
|
||
[7. Smoke Tests (Staging)]
|
||
│
|
||
▼
|
||
[8. Manual Approval Gate] ──(if required)
|
||
│
|
||
▼
|
||
[9. Deploy to Production] ──fail──▶ 🔁 Auto-rollback (if configured)
|
||
│
|
||
▼
|
||
[10. Post-deploy checks]
|
||
```
|
||
|
||
---
|
||
|
||
## Stage Definitions
|
||
|
||
### Stage 1 — Build & Lint
|
||
|
||
**What runs:** [Build command] + [Linter — e.g. ESLint, golangci-lint, flake8]
|
||
**Trigger:** Every commit to any branch
|
||
**Blocking:** Yes — PR cannot be merged if this fails
|
||
**Typical duration:** [X minutes]
|
||
**Owner if it fails:** PR author
|
||
|
||
**Common failure causes:**
|
||
- [e.g. Missing dependency — run `npm install` locally before pushing]
|
||
- [e.g. Lint rule violation — run `npm run lint --fix` to auto-fix most issues]
|
||
|
||
---
|
||
|
||
### Stage 2 — Unit Tests
|
||
|
||
**What runs:** [Test command — e.g. `npm test`, `go test ./...`, `pytest`]
|
||
**Coverage gate:** [X]% minimum — pipeline fails below this threshold
|
||
**Trigger:** Every commit
|
||
**Blocking:** Yes
|
||
**Typical duration:** [X minutes]
|
||
|
||
**Coverage report:** [Where to find it — e.g. uploaded to Codecov, available in CI artifacts]
|
||
|
||
---
|
||
|
||
### Stage 3 — Integration Tests
|
||
|
||
**What runs:** [Test suite description — e.g. "API integration tests against a test database using Docker Compose"]
|
||
**Environment:** [Ephemeral test environment / shared test DB / etc.]
|
||
**Trigger:** Every commit to `main` and feature branches targeting `main`
|
||
**Blocking:** Yes
|
||
**Typical duration:** [X minutes]
|
||
|
||
**If slow:** [e.g. "Integration tests can be skipped locally with `SKIP_INTEGRATION=true` — never skip in CI"]
|
||
|
||
---
|
||
|
||
### Stage 4 — Security Scan
|
||
|
||
**Tools:** [e.g. Snyk, Trivy, OWASP Dependency Check, Semgrep]
|
||
**What it checks:** [Dependency vulnerabilities / SAST / secrets detection — list what applies]
|
||
**Blocking on:** Critical and High severity findings
|
||
**Non-blocking on:** Medium and Low (flagged, not blocking)
|
||
**Trigger:** Every commit to `main`
|
||
|
||
**How to handle a flagged vulnerability:**
|
||
1. Check if a fix is available — upgrade the dependency
|
||
2. If no fix available, open a security ticket and add a suppression with justification
|
||
3. Never suppress without a ticket and owner
|
||
|
||
---
|
||
|
||
### Stage 5 — Build Artefact
|
||
|
||
**What is produced:** [Docker image / binary / zip — be specific]
|
||
**Registry:** [ECR / GCR / Docker Hub / Artifactory — URL]
|
||
**Tagging convention:** `[service-name]:[git-sha]` (also tagged `:latest` on `main`)
|
||
**Trigger:** Commits to `main` only (not feature branches)
|
||
|
||
---
|
||
|
||
### Stage 6 — Deploy to Staging
|
||
|
||
**Deployment method:** [e.g. Helm upgrade / kubectl apply / ecs deploy / Terraform apply]
|
||
**Staging URL:** [URL]
|
||
**Trigger:** Automatic on successful artefact build from `main`
|
||
**Who can deploy to staging:** Any engineer (automatic)
|
||
|
||
**Environment variables:** Managed in [Vault / AWS SSM / GitHub Secrets / etc.]
|
||
**Staging is not production:** [Any differences in config, scale, or data — state them here]
|
||
|
||
---
|
||
|
||
### Stage 7 — Smoke Tests (Staging)
|
||
|
||
**What runs:** [Description — e.g. "10 critical path tests covering login, core API endpoints, and payment flow"]
|
||
**Tool:** [e.g. Playwright / Postman / custom script]
|
||
**Pass criteria:** All smoke tests pass within [X seconds] timeout
|
||
**Blocking:** Yes — production deploy will not proceed if smoke tests fail
|
||
|
||
**Smoke test suite location:** [Link to test files or folder]
|
||
|
||
---
|
||
|
||
### Stage 8 — Manual Approval Gate
|
||
|
||
**Required for:** [Production deploys / deploys affecting >X% of traffic / deploys to specific regions]
|
||
**Who can approve:** [e.g. Any engineer on the team / Lead engineer / On-call engineer]
|
||
**Approval timeout:** [e.g. 24 hours — auto-cancelled if no approval]
|
||
**How to approve:** [GitHub Actions approve step / Slack command / other — with link]
|
||
|
||
**When to withhold approval:**
|
||
- Active incident in production
|
||
- Deploy is outside the deployment window (see below)
|
||
- On-call engineer has not been notified
|
||
|
||
---
|
||
|
||
### Stage 9 — Deploy to Production
|
||
|
||
**Deployment method:** [Same as staging or different — specify]
|
||
**Deployment window:** [e.g. Monday–Thursday 09:00–16:00 UTC — no deploys on Fridays or before bank holidays]
|
||
**Canary / progressive rollout:** [Yes — X% initial traffic, full rollout after Y minutes / No — full deploy]
|
||
**Deployment notifications:** [Slack channel — #deployments]
|
||
|
||
**Who is on-call during deploy:** Deploying engineer is responsible until post-deploy checks pass.
|
||
|
||
---
|
||
|
||
### Stage 10 — Post-Deploy Checks
|
||
|
||
**Automated checks (run for [X minutes] after deploy):**
|
||
- [ ] Error rate: <[X]% (baseline: [Y]%)
|
||
- [ ] P99 latency: <[X]ms (baseline: [Y]ms)
|
||
- [ ] [Key business metric]: within [X]% of baseline
|
||
|
||
**Where to watch:** [Datadog / Grafana / CloudWatch dashboard — link]
|
||
|
||
**If a check fails:** See Rollback Procedure below.
|
||
|
||
---
|
||
|
||
## Environments
|
||
|
||
| Environment | Purpose | Deploy trigger | URL | Data |
|
||
|---|---|---|---|---|
|
||
| **Dev** | Local development | Manual | localhost | Seeded test data |
|
||
| **Staging** | Pre-production validation | Automatic (main) | [URL] | Anonymised prod copy |
|
||
| **Production** | Live traffic | Manual approval | [URL] | Live data |
|
||
|
||
---
|
||
|
||
## Branching Strategy
|
||
|
||
**Model:** [Trunk-based / GitFlow / GitHub Flow — describe briefly]
|
||
|
||
| Branch | Purpose | Who merges | Deploy target |
|
||
|---|---|---|---|
|
||
| `main` | Production-ready code | PR + review | Staging → Production |
|
||
| `feature/*` | Feature development | Author | None (CI only) |
|
||
| `hotfix/*` | Critical production fixes | Lead engineer | Can bypass staging gate with approval |
|
||
|
||
**Hotfix process:** [Describe when and how to use a hotfix branch — what level of incident justifies bypassing the standard process]
|
||
|
||
---
|
||
|
||
## Rollback Procedure
|
||
|
||
**Automated rollback:** [Yes — triggered if post-deploy error rate exceeds [X]% / No — manual only]
|
||
|
||
**Manual rollback steps:**
|
||
```bash
|
||
# 1. Identify the last known good image tag
|
||
[command to list recent deployments]
|
||
|
||
# 2. Deploy the previous version
|
||
[deployment command with previous tag]
|
||
|
||
# 3. Confirm rollback is live
|
||
[smoke test command or health check URL]
|
||
|
||
# 4. Notify the team
|
||
[Slack command or template]
|
||
```
|
||
|
||
**Rollback decision authority:** Any engineer on-call can initiate a rollback without waiting for approval.
|
||
|
||
**After a rollback:**
|
||
1. Create a post-deploy incident report (see [incident-postmortem skill])
|
||
2. Do not re-deploy the same commit without fixing the root cause
|
||
3. Notify [stakeholder / support team] of the rollback and expected fix timeline
|
||
|
||
---
|
||
|
||
## Secrets and Configuration Management
|
||
|
||
**Secret store:** [Vault / AWS SSM / GitHub Secrets / Doppler — specify]
|
||
**How to add a new secret:**
|
||
1. [Step 1]
|
||
2. [Step 2]
|
||
**Who has access:** [Role or team]
|
||
**Rotation policy:** [How often secrets are rotated and who owns it]
|
||
|
||
**Never do:** Commit secrets to source control, even in `.env` files. The pipeline includes secret scanning (Stage 4) which will flag this.
|
||
|
||
---
|
||
|
||
## Common Failures and Fixes
|
||
|
||
| Failure | Likely cause | Fix |
|
||
|---|---|---|
|
||
| Build fails with "module not found" | Dependency not installed | Run `[install command]` and commit `lock file` |
|
||
| Integration tests timeout | Test DB not seeded / external service down | Check [service] status; re-run pipeline |
|
||
| Smoke tests fail after staging deploy | Environment variable missing | Check [config location]; compare staging and prod env vars |
|
||
| Production deploy stuck at approval | Approver not notified | Tag `@[on-call handle]` in `#deployments` |
|
||
| Post-deploy error rate spike | Bad deploy / upstream dependency | Check [dashboard]; initiate rollback if >5 min |
|
||
|
||
---
|
||
|
||
## On-Call Responsibilities During Deploy
|
||
|
||
- The deploying engineer is responsible for monitoring post-deploy checks for [X minutes] after a production deploy
|
||
- If you cannot monitor after deploying, hand off explicitly to another engineer in `#deployments`
|
||
- For deploys outside business hours: only hotfixes — always page the on-call engineer before deploying
|
||
|
||
---
|
||
|
||
## Anti-Patterns
|
||
|
||
- [ ] Do not describe a rollback procedure that has never been tested — a theoretical rollback is not a rollback plan; test it in staging before production
|
||
- [ ] Do not allow deploys on Fridays or before holidays without an explicit on-call engineer who will monitor through the weekend
|
||
- [ ] Do not commit secrets to source control even in non-production branches — secret scanning in the pipeline catches this, but prevention is the standard
|
||
- [ ] Do not skip post-deploy monitoring after a production deploy — the deploying engineer must watch error rates and latency for the specified observation window
|
||
- [ ] Do not suppress a security scan finding without a linked ticket and a named owner — suppressions without accountability accumulate into unmanaged risk
|
||
|
||
## Quality Checks
|
||
|
||
- [ ] Every stage has a clear owner when it fails
|
||
- [ ] Rollback procedure is tested — not theoretical
|
||
- [ ] Secrets management section names the actual tool used (not "use secrets management")
|
||
- [ ] Deployment window is specific — not "during business hours"
|
||
- [ ] Post-deploy check thresholds are calibrated to actual baseline metrics
|