05b6d799f0
Three more learnings from alirezarezvani/claude-skills, applied: 1. SkillCheck validator (scripts/skillcheck.mjs) — validates every SKILL.md against the authoring standard (frontmatter, name/folder match, trigger + produces clauses, required headings) plus tier referential integrity. Errors fail CI; --strict fails on warnings too. New skillcheck.yml workflow and a SkillCheck status badge in the README. Current: 0 errors / 14 advisory warnings across 172 skills. 2. Cursor export platform — build-exports.mjs now generates exports/cursor/<bundle>/<skill>/<skill>.mdc rule files. The PLATFORMS registry now supports per-skill filenames (file as a function). 3. Per-agent installers — scripts/install.sh unifies install for claude/hermes/codex/openclaw/cursor (--link, --target, --dry-run, --list). Curl-able one-liners codex-install.sh, openclaw-install.sh, and cursor-install.sh clone the library and install in a single command. README documents the one-line installs and Cursor exports; CHANGELOG and the authoring standard updated. Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>
123 lines
5.4 KiB
Plaintext
123 lines
5.4 KiB
Plaintext
---
|
||
description: "Design statistically rigorous A/B tests for product features, UI changes, onboarding flows, and pricing experiments. Use when asked to set up an experiment, design an A/B test, calculate sample size, or interpret test results. Produces a complete test plan with hypothesis, variant definitions, sample size, duration estimate, guardrail metrics, and a results interpretation guide."
|
||
globs:
|
||
alwaysApply: false
|
||
---
|
||
|
||
# A/B Test Planner Skill
|
||
|
||
Design experiments that produce trustworthy results — not just directional signals. Every test output includes hypothesis, success metrics, sample size, duration, and a results interpretation guide.
|
||
|
||
## Required Inputs
|
||
|
||
Ask the user for these if not provided:
|
||
- **What is being tested** (feature, UI change, copy, pricing, onboarding step)
|
||
- **Hypothesis** (or ask to help formulate one)
|
||
- **Primary metric** (conversion rate, click-through, completion rate, etc.)
|
||
- **Baseline rate** and **minimum detectable effect** (MDE)
|
||
- **Daily eligible users** (to calculate duration)
|
||
|
||
## Experiment Design Checklist
|
||
|
||
Before running any test, confirm:
|
||
- [ ] Clear hypothesis with predicted direction
|
||
- [ ] Single primary metric (plus up to 2 guardrail metrics)
|
||
- [ ] Minimum detectable effect (MDE) defined
|
||
- [ ] Sample size calculated
|
||
- [ ] Test duration estimated
|
||
- [ ] Segment isolated (no overlap with other running tests)
|
||
- [ ] Rollback plan defined
|
||
|
||
## Hypothesis Template
|
||
|
||
> "We believe that [change] will cause [primary metric] to [increase/decrease] by [X%] for [user segment], because [rationale based on data or insight]."
|
||
|
||
Never run a test without a directional hypothesis. "Let's just see what happens" is not a hypothesis.
|
||
|
||
## Sample Size Calculator Logic
|
||
|
||
Use this formula (provide the output, not the formula, to the user):
|
||
|
||
- **Baseline conversion rate:** Current rate of primary metric
|
||
- **MDE:** Smallest change worth detecting (recommend 10–20% relative lift for most features)
|
||
- **Statistical power:** 80% (standard)
|
||
- **Significance level:** 95% (p < 0.05)
|
||
|
||
For common scenarios, provide pre-calculated estimates:
|
||
|
||
| Baseline Rate | MDE (Relative) | Required Sample per Variant |
|
||
|---|---|---|
|
||
| 5% | 20% | ~19,000 |
|
||
| 10% | 15% | ~14,000 |
|
||
| 20% | 10% | ~15,000 |
|
||
| 40% | 10% | ~9,500 |
|
||
| 60% | 5% | ~42,000 |
|
||
|
||
Always warn: "These are estimates. Use a tool like Evan Miller's calculator or Statsig for precision."
|
||
|
||
## Test Duration Guidance
|
||
|
||
Minimum: 2 full weeks (to capture weekly seasonality)
|
||
Maximum: 4 weeks (novelty effect distorts results beyond this)
|
||
|
||
`Duration = Required sample ÷ (Daily traffic × % exposed)`
|
||
|
||
Flag if traffic is too low to reach significance in under 8 weeks — recommend a different approach (e.g., holdout test, qualitative research).
|
||
|
||
## Output Format
|
||
|
||
### A/B Test Plan — [Test Name] — [Date]
|
||
|
||
**Hypothesis:**
|
||
> [Filled hypothesis template]
|
||
|
||
**Variants:**
|
||
- Control (A): [Current experience]
|
||
- Treatment (B): [Changed experience — be specific]
|
||
|
||
**Primary Metric:** [Metric name + how measured]
|
||
**Guardrail Metrics:** [Metrics that must not degrade]
|
||
|
||
**Target Segment:** [Who sees the test — % of traffic, user type]
|
||
**Traffic Split:** [50/50 recommended unless ramp-up needed]
|
||
|
||
**Sample Size Required:** ~[N] users per variant
|
||
**Estimated Duration:** [X] weeks (based on [Y] daily eligible users)
|
||
**Significance Threshold:** 95% confidence, 80% power
|
||
|
||
**Exclusions:** [Any user segments to exclude and why]
|
||
|
||
**Rollback Trigger:** If [guardrail metric] degrades by [X%], stop the test immediately.
|
||
|
||
**Results Interpretation Guide:**
|
||
- ✅ Ship if: Treatment shows [X%]+ lift on primary metric at 95% confidence AND guardrail metrics are stable
|
||
- 🔄 Iterate if: Direction is positive but not significant — consider extending or redesigning
|
||
- ❌ Reject if: No lift or negative direction at significance
|
||
- ⚠️ Inconclusive: Do not ship. Do not call it a win.
|
||
|
||
---
|
||
|
||
## Guidelines
|
||
|
||
- Always recommend against peeking at results before the test reaches planned sample size — explain p-hacking risk
|
||
- If user wants to test multiple variants, explain the multiple comparisons problem and recommend a Bonferroni correction or a Bayesian approach
|
||
- If traffic is very low (<1,000 users/day), recommend qualitative alternatives: moderated testing, 5-second tests, or user interviews
|
||
- Never approve a test with no guardrail metrics — always protect revenue, retention, or core engagement
|
||
|
||
## Anti-Patterns
|
||
|
||
- [ ] Do not run a test without a directional hypothesis — "let's see what happens" produces uninterpretable results
|
||
- [ ] Do not declare a winner before reaching the pre-planned sample size — peeking at results inflates false positive rates
|
||
- [ ] Do not test multiple independent changes in a single variant — you won't know which change caused the result
|
||
- [ ] Do not use engagement metrics (clicks, time-on-page) as the primary metric when the goal is revenue or retention — proxy metrics mislead
|
||
- [ ] Do not ignore guardrail metrics — a conversion lift that causes a support ticket spike is not a win
|
||
|
||
## Quality Checks
|
||
|
||
- [ ] Hypothesis is directional (predicts a specific direction and magnitude, not "let's see")
|
||
- [ ] Primary metric is singular (guardrail metrics are secondary)
|
||
- [ ] Sample size is calculated from actual MDE and baseline (not guessed)
|
||
- [ ] Test duration accounts for weekly seasonality (minimum 2 weeks)
|
||
- [ ] Guardrail metrics are defined (at least one to protect revenue or core engagement)
|
||
- [ ] Rollback trigger is specified with a concrete threshold
|