beecb1cb31
pm-engineering grows from 14 to 35 skills (v4.0.0), completing the full 25-skill promise made at the 500-star milestone. The library grows from 114 to 135 total skills. New skills added (21): - security-threat-model: STRIDE-based threat model with trust boundaries, per-component threat enumeration, risk scores, and mitigations - performance-budget: Performance budgets for Core Web Vitals and backend latency SLOs with CI enforcement - database-schema-design: Schema documentation with ER diagram, DDL definitions, index strategy, and access pattern analysis - database-migration-plan: Zero-downtime expand-contract migration plan with per-step rollback and data validation queries - technical-debt-register: Debt inventory with impact scoring, effort estimates, and quarterly resolution roadmap - rfc-writer: Engineering RFC covering problem, proposed solution, alternatives-with-rejection-reasons, and rollout plan - capacity-planning: Traffic forecasts, resource requirements by tier, scaling strategy, and infrastructure roadmap - load-testing-plan: Load test plan with baseline/stress/spike/soak scenarios, k6/Locust skeleton, and CI gates - disaster-recovery-plan: DR plan with RPO/RTO targets, per-scenario runbooks, game day testing, and communication templates - feature-flag-guide: Feature flag lifecycle — taxonomy, rollout strategy, monitoring requirements, cleanup policy, governance - dependency-audit: CVE vulnerabilities, license compliance, outdated packages, and 30-day remediation plan - service-catalog-entry: Microservice catalog entry with SLAs, API contract, data classification, and runbook links - monitoring-setup-guide: Four golden signals, alert rules spec, log schema, tracing setup, dashboard layout spec - local-dev-setup: Local development guide — prerequisites, env vars, Docker deps, test commands, 5 failure fixes - api-versioning-strategy: Versioning scheme, lifecycle policy, breaking change classification table, deprecation process - infra-as-code-review: IaC review for Terraform/CloudFormation/Pulumi with severity-classified findings - engineering-weekly-report: Consistent weekly status — shipped/blocked, metrics, decisions, risks, next week - tech-radar: ThoughtWorks-format radar with Adopt/Trial/Assess/Hold, blip rationales, maintenance process - sprint-velocity-analysis: Velocity trends, completion patterns, improvement recommendations, capacity forecast - microservices-decomposition: Domain-driven service boundaries, communication patterns, data ownership, migration plan - engineering-hiring-rubric: Technical interview rubric with level expectations, coding/system design scorecards, debrief guide Also: - plugin.json bumped to v4.0.0 with all 35 skills listed - marketplace.json updated to v11.0.0, library count 135 - README updated: skill count, all section numbers, engineering table expanded, star milestone marked complete https://claude.ai/code/session_01C3HwChrccJd145vJ6Z7ajF
400 lines
16 KiB
Markdown
400 lines
16 KiB
Markdown
---
|
||
name: rfc-writer
|
||
description: "Write an engineering RFC (Request for Comments) for a technical decision, architectural change, or significant implementation approach. Use when asked to write an RFC, document a technical proposal, create a design doc, write an architecture decision for review, or produce a technical specification for team feedback. Produces a complete RFC document covering problem statement, motivation, proposed solution, alternatives rejected, implementation plan, migration plan, security and performance implications, observability changes, rollout plan, and open questions."
|
||
---
|
||
|
||
# RFC Writer Skill
|
||
|
||
Produce a complete engineering RFC (Request for Comments) for a technical decision or architectural change. An RFC is a structured proposal document — not a persuasion document. Its purpose is to expose a decision to scrutiny, surface trade-offs, document alternatives considered, and create a permanent record of why a choice was made.
|
||
|
||
A good RFC makes it possible for someone who wasn't in the room to understand years later why the team built something the way they did.
|
||
|
||
## Required Inputs
|
||
|
||
Ask for these if not already provided:
|
||
- **RFC title and author** — what this RFC is about and who is proposing it
|
||
- **Problem being solved** — what is broken, missing, or inadequate today; why action is needed now
|
||
- **Proposed solution** — the approach the author is recommending, at least at a high level
|
||
- **Context and constraints** — team size, existing architecture, timeline pressures, budget limits, compliance requirements
|
||
- **Alternatives considered** — at least 2 alternative approaches the author has thought about
|
||
- **Current status** — is this pre-decision (seeking feedback) or post-decision (documenting a made decision)?
|
||
|
||
## Output Format
|
||
|
||
---
|
||
|
||
# RFC [Number]: [Title]
|
||
|
||
**Author:** [Name] | **Team:** [Team name]
|
||
**Created:** [Date] | **Last updated:** [Date]
|
||
**Status:** Draft | In Review | Approved | Rejected | Superseded by RFC-[X]
|
||
**Ticket:** [JIRA-XXX] | **Slack thread:** [#channel link]
|
||
**Review deadline:** [Date — when comments should be submitted by]
|
||
|
||
---
|
||
|
||
## Abstract
|
||
|
||
[2–4 sentences summarising the entire RFC. Should stand alone — someone reading only this should understand what is being proposed, why, and what the main trade-off is. Write this last.]
|
||
|
||
---
|
||
|
||
## 1. Problem Statement
|
||
|
||
[Describe the problem being solved. Focus on the *problem*, not the solution. Be specific and quantified where possible.]
|
||
|
||
**Current state:**
|
||
[Describe how things work today — the existing system, process, or architecture. Include any relevant constraints or limitations.]
|
||
|
||
**Why this is a problem now:**
|
||
[Why is this being addressed now rather than earlier or later? Reference metrics, incidents, product requirements, or scaling thresholds that make this urgent or timely.]
|
||
|
||
**Example of the problem in practice:**
|
||
[A concrete scenario or incident that illustrates the problem. This helps reviewers understand the real-world impact, not just the abstract description.]
|
||
|
||
```
|
||
// Example: current behaviour that illustrates the problem
|
||
[code snippet, log output, or sequence description showing the problem]
|
||
```
|
||
|
||
**Impact of not solving this:**
|
||
- [Impact 1 — e.g. "New tenant onboarding requires 3 hours of manual configuration per account"]
|
||
- [Impact 2 — e.g. "Auth service handles 400 req/s; projected to hit capacity within 8 weeks at current growth"]
|
||
- [Impact 3 — e.g. "Current approach is incompatible with the upcoming multi-region requirement"]
|
||
|
||
---
|
||
|
||
## 2. Goals and Non-Goals
|
||
|
||
**Goals:**
|
||
- [ ] [Specific, measurable outcome — e.g. "Reduce tenant onboarding time from 3 hours to <5 minutes"]
|
||
- [ ] [e.g. "Support 2,000 req/s on the auth service with P99 latency ≤50ms"]
|
||
- [ ] [e.g. "Enable multi-region deployment without changes to the application layer"]
|
||
|
||
**Non-goals:** *(what this RFC explicitly does not address)*
|
||
- [e.g. "This RFC does not address authentication for internal service-to-service calls — see RFC-042"]
|
||
- [e.g. "Performance improvements to the existing system — this RFC replaces it"]
|
||
- [e.g. "Migration of historical data — covered in a follow-on RFC"]
|
||
|
||
**Success metrics:**
|
||
| Metric | Current | Target | Measurement method |
|
||
|---|---|---|---|
|
||
| [e.g. Onboarding time] | [3 hours] | [<5 minutes] | [Prometheus histogram on onboarding job duration] |
|
||
| [e.g. Auth latency P99] | [120ms] | [≤50ms] | [Datadog APM] |
|
||
| [e.g. Engineer setup time] | [4 hours] | [<30 minutes] | [Onboarding survey] |
|
||
|
||
---
|
||
|
||
## 3. Background and Motivation
|
||
|
||
[Provide the context a reviewer needs to evaluate the proposal. This is not a repeat of the problem statement — it is the surrounding technical and business context.]
|
||
|
||
**Existing system overview:**
|
||
[Describe the relevant parts of the current architecture. Include an ASCII diagram if the relationships between components help understanding.]
|
||
|
||
```
|
||
[ASCII diagram of current architecture — optional but strongly recommended for architectural RFCs]
|
||
|
||
┌──────────┐ ┌──────────────┐ ┌──────────────┐
|
||
│ Client │────▶│ [Service A] │────▶│ [Service B] │
|
||
└──────────┘ └──────────────┘ └──────────────┘
|
||
│
|
||
▼
|
||
┌──────────────┐
|
||
│ [Database] │
|
||
└──────────────┘
|
||
```
|
||
|
||
**Prior work and related decisions:**
|
||
- [RFC-XXX: Title — relevant previous decision; link]
|
||
- [ADR-XXX: Title — architectural decision record]
|
||
- [Any external standards, blog posts, or vendor documentation that informs this proposal]
|
||
|
||
**Constraints:**
|
||
- [e.g. Must remain backward compatible with v1 API clients for 12 months]
|
||
- [e.g. Team has no Rust expertise — solution must be in Python or Go]
|
||
- [e.g. Must be deployable without a maintenance window]
|
||
|
||
---
|
||
|
||
## 4. Proposed Solution
|
||
|
||
[Describe the proposed approach clearly and specifically. Include enough detail that an engineer could begin implementing from this document, but don't write the code — that is for the PR.]
|
||
|
||
### 4.1 High-Level Approach
|
||
|
||
[1–3 paragraphs describing the overall solution. Explain the key idea and why it solves the problem.]
|
||
|
||
### 4.2 Architecture
|
||
|
||
```
|
||
[ASCII diagram of the proposed architecture — what the system looks like after this RFC is implemented]
|
||
|
||
┌──────────┐ ┌──────────────────┐ ┌──────────────┐
|
||
│ Client │────▶│ [New Component] │────▶│ [Service B] │
|
||
└──────────┘ └──────────────────┘ └──────────────┘
|
||
│ │
|
||
▼ ▼
|
||
┌──────────────┐ ┌──────────────┐
|
||
│ [Store A] │ │ [Store B] │
|
||
└──────────────┘ └──────────────┘
|
||
```
|
||
|
||
### 4.3 Detailed Design
|
||
|
||
[Break the solution into its key components or decisions. For each, explain what it does and why it was designed this way.]
|
||
|
||
**Component / Decision 1: [Name]**
|
||
|
||
[Description of this component — what it does, how it works, why this approach was chosen.]
|
||
|
||
```
|
||
// Example interface, API contract, or pseudocode (not implementation code)
|
||
[Relevant schema, API definition, data flow, or pseudocode]
|
||
```
|
||
|
||
**Component / Decision 2: [Name]**
|
||
|
||
[Description]
|
||
|
||
**Component / Decision 3: [Name]**
|
||
|
||
[Description]
|
||
|
||
### 4.4 API Changes
|
||
|
||
*Complete this section if the RFC introduces or modifies any API endpoints, events, or interfaces.*
|
||
|
||
**New endpoints / events:**
|
||
```
|
||
[HTTP method + path or event name]
|
||
Request: { ... }
|
||
Response: { ... }
|
||
```
|
||
|
||
**Modified endpoints:**
|
||
- `[endpoint]`: [what changes and why; backward compatibility note]
|
||
|
||
**Deprecated endpoints:**
|
||
- `[endpoint]`: deprecated in favour of `[new endpoint]` — removal timeline: [date/version]
|
||
|
||
### 4.5 Data Model Changes
|
||
|
||
*Complete this section if any database schema or data structure changes are required.*
|
||
|
||
[Describe schema changes at a high level. Reference the database-migration-plan skill for detailed migration steps.]
|
||
|
||
```sql
|
||
-- Key schema changes (abbreviated — full migration in [link])
|
||
[DDL statements for key additions/changes]
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Alternatives Considered
|
||
|
||
*Every alternative must include an explicit reason why it was rejected. "We went with the proposed solution" is not a reason.*
|
||
|
||
### Alternative 1: [Name]
|
||
|
||
**Description:**
|
||
[What this alternative would involve.]
|
||
|
||
**Pros:**
|
||
- [Pro 1]
|
||
- [Pro 2]
|
||
|
||
**Cons:**
|
||
- [Con 1]
|
||
- [Con 2]
|
||
|
||
**Why rejected:**
|
||
[Specific reason — e.g. "Requires 3× the infrastructure cost", "Incompatible with multi-region requirement", "Team has no expertise in this technology and the ramp-up would miss the Q3 deadline"]
|
||
|
||
---
|
||
|
||
### Alternative 2: [Name]
|
||
|
||
**Description:**
|
||
[What this alternative would involve.]
|
||
|
||
**Pros:**
|
||
- [Pro 1]
|
||
- [Pro 2]
|
||
|
||
**Cons:**
|
||
- [Con 1]
|
||
- [Con 2]
|
||
|
||
**Why rejected:**
|
||
[Specific reason]
|
||
|
||
---
|
||
|
||
### Alternative 3: Do nothing / defer
|
||
|
||
**Description:**
|
||
Accept the current state and revisit the problem in [timeframe].
|
||
|
||
**Why rejected:**
|
||
[Why deferring is not acceptable — reference the impact of not solving this from Section 1.]
|
||
|
||
---
|
||
|
||
## 6. Implementation Plan
|
||
|
||
**Estimated effort:** [X engineer-weeks] | **Target completion:** [Date / Quarter]
|
||
**Team:** [Who is building this — names or roles]
|
||
|
||
| Phase | Description | Duration | Dependencies | Owner |
|
||
|---|---|---|---|---|
|
||
| 1 | [e.g. Core implementation — new component built and tested] | [X weeks] | [None] | [Name] |
|
||
| 2 | [e.g. Integration — connect new component to existing services] | [X weeks] | [Phase 1 complete] | [Name] |
|
||
| 3 | [e.g. Rollout — canary deploy, then full rollout] | [X weeks] | [Phase 2 + staging validated] | [Name] |
|
||
| 4 | [e.g. Cleanup — deprecate old system, remove feature flags] | [X weeks] | [Phase 3 stable for X weeks] | [Name] |
|
||
|
||
**Key milestones:**
|
||
- [ ] [Date]: [Milestone — e.g. "Core implementation complete and code-reviewed"]
|
||
- [ ] [Date]: [Milestone — e.g. "Staging environment validation complete"]
|
||
- [ ] [Date]: [Milestone — e.g. "10% canary traffic without regression"]
|
||
- [ ] [Date]: [Milestone — e.g. "Full rollout complete"]
|
||
- [ ] [Date]: [Milestone — e.g. "Old system decommissioned"]
|
||
|
||
---
|
||
|
||
## 7. Migration Plan
|
||
|
||
*Complete this section if the RFC requires migrating existing users, data, or API consumers.*
|
||
|
||
**Migration strategy:** [Big-bang / Phased / Parallel-run / Opt-in]
|
||
|
||
**Who is affected:**
|
||
- [e.g. All existing API v1 consumers — requires updated client libraries]
|
||
- [e.g. X million rows in the `orders` table require backfilling]
|
||
|
||
**Migration steps:**
|
||
1. [Step 1 — describe action, who does it, estimated duration]
|
||
2. [Step 2]
|
||
3. [Step 3]
|
||
|
||
**Backward compatibility window:** [How long will the old system/API remain available?]
|
||
|
||
**Communication plan:**
|
||
- [Who needs to be notified, when, and how — e.g. "API consumers will receive a deprecation notice 3 months before the old endpoint is removed"]
|
||
|
||
---
|
||
|
||
## 8. Security Implications
|
||
|
||
[Describe the security impact of this change. If there are no security implications, state that explicitly with reasoning — do not leave this section blank.]
|
||
|
||
| Concern | Impact | Mitigation |
|
||
|---|---|---|
|
||
| [e.g. New API endpoint exposed to internet] | [e.g. New attack surface] | [e.g. Rate limiting, auth required, WAF rules] |
|
||
| [e.g. New data stored — user PII] | [e.g. GDPR scope expanded] | [e.g. Encrypted at rest, access log, data retention policy] |
|
||
| [e.g. Service-to-service communication] | [e.g. Token forgery risk] | [e.g. mTLS between services] |
|
||
|
||
**Has a threat model been produced or updated?** [Yes — link / No — required before implementation / Not required — reason]
|
||
|
||
---
|
||
|
||
## 9. Performance Implications
|
||
|
||
[Describe the expected performance impact. Include projections for the new system and how it was estimated.]
|
||
|
||
| Metric | Current | Projected | Measurement method |
|
||
|---|---|---|---|
|
||
| [e.g. P99 latency — /api/auth] | [120ms] | [≤50ms] | [Load test results — link] |
|
||
| [e.g. Database query count per request] | [12] | [3] | [Query logging in staging] |
|
||
| [e.g. Memory per instance] | [512MB] | [768MB] | [Profiling — link] |
|
||
| [e.g. Infrastructure cost] | [$X/month] | [$Y/month] | [AWS cost calculator estimate] |
|
||
|
||
**Load testing:** [Has load testing been done? Link to results. If not, when will it be done?]
|
||
|
||
**Performance risks:**
|
||
- [Risk 1 — e.g. "New component adds a network hop that may increase tail latency under congestion — needs validation at 2× peak load"]
|
||
|
||
---
|
||
|
||
## 10. Observability Changes
|
||
|
||
*Describe what new or changed metrics, logs, traces, and alerts this RFC introduces.*
|
||
|
||
**New metrics:**
|
||
| Metric name | Type | Description | Alert threshold |
|
||
|---|---|---|---|
|
||
| `[service].[component].[metric]` | [counter/gauge/histogram] | [What it measures] | [e.g. P99 > 100ms for 5 min] |
|
||
|
||
**New log events:**
|
||
| Event | Level | When emitted | Key fields |
|
||
|---|---|---|---|
|
||
| `[event.name]` | INFO | [When] | `user_id`, `duration_ms`, `result` |
|
||
|
||
**Distributed tracing:** [Are spans added for new components? Which operations are instrumented?]
|
||
|
||
**Dashboard changes:** [New dashboard / updated existing dashboard — link]
|
||
|
||
---
|
||
|
||
## 11. Rollout Plan
|
||
|
||
**Rollout strategy:** [Feature flag / Canary / Blue-green / Gradual traffic shift / Full deploy]
|
||
|
||
| Stage | Traffic % | Duration | Success criteria | Rollback trigger |
|
||
|---|---|---|---|---|
|
||
| Internal testing | 0% (dogfood) | [X days] | [No errors in internal usage] | Any error |
|
||
| Canary | 1% | [X hours] | [Error rate <0.1%; P99 latency within budget] | Error rate >0.5% |
|
||
| Limited rollout | 10% | [X days] | [As above + business metrics stable] | Error rate >0.2% |
|
||
| Full rollout | 100% | — | [All success metrics from Section 2 met] | Any SLO breach |
|
||
|
||
**Feature flag:** [Name of feature flag, if applicable] — managed in [LaunchDarkly / Unleash / config]
|
||
|
||
**Rollback procedure:**
|
||
```
|
||
// How to roll back if the rollout needs to be reversed
|
||
1. [Step 1 — e.g. Toggle feature flag to off]
|
||
2. [Step 2 — e.g. Deploy previous version]
|
||
3. [Step 3 — e.g. Notify stakeholders]
|
||
```
|
||
|
||
---
|
||
|
||
## 12. Open Questions
|
||
|
||
[List any unresolved questions, design decisions not yet made, or areas where the author is specifically seeking feedback. Assign an owner and a resolution deadline for each.]
|
||
|
||
| # | Question | Owner | Deadline | Resolution |
|
||
|---|---|---|---|---|
|
||
| 1 | [e.g. Should we use optimistic or pessimistic locking for concurrent updates to [resource]?] | [Name] | [Date] | [Pending / [Answer]] |
|
||
| 2 | [e.g. What is the retention policy for [new data type]?] | [Name] | [Date] | [Pending / [Answer]] |
|
||
| 3 | [e.g. Do we need a read replica for this query pattern at launch, or can we defer it?] | [Name] | [Date] | [Pending / [Answer]] |
|
||
|
||
---
|
||
|
||
## 13. Decision
|
||
|
||
*To be filled in after the review period closes.*
|
||
|
||
**Decision:** [Approved / Rejected / Approved with modifications]
|
||
**Decision date:** [Date]
|
||
**Decision makers:** [Names]
|
||
|
||
**Summary of key feedback addressed:**
|
||
- [Feedback item and how it was resolved]
|
||
|
||
**Conditions of approval (if any):**
|
||
- [e.g. Must complete load testing before Phase 2 begins]
|
||
|
||
---
|
||
|
||
## Quality Checks
|
||
|
||
- [ ] The problem statement is specific and quantified — not "the current system is slow" but "P99 latency is 800ms; budget is 200ms"
|
||
- [ ] Goals section includes measurable success metrics, not aspirational statements
|
||
- [ ] Every alternative has an explicit rejection reason — not just a list of cons
|
||
- [ ] Security implications section is completed, not left blank
|
||
- [ ] Performance implications include projected numbers, not just "should be better"
|
||
- [ ] Open questions are assigned to named owners with deadlines — not floating
|
||
- [ ] The RFC is written to be read by someone who was not in the planning conversations
|
||
- [ ] Migration plan addresses all affected parties — users, API consumers, data — not just the technical steps
|