diff --git a/skills/experiment-designer/SKILL.md b/skills/experiment-designer/SKILL.md new file mode 100644 index 0000000..122e7d5 --- /dev/null +++ b/skills/experiment-designer/SKILL.md @@ -0,0 +1,55 @@ +--- +name: experiment-designer +description: Designs A/B tests from hypotheses and interprets experiment results +with statistical rigour. Use when user says "run an experiment", "design an A/B +test", "test this feature", "interpret these results", "was this experiment +successful", or "what sample size do I need". +metadata: + author: Mohit Aggarwal + version: 1.0.0 + category: data-and-metrics + tags: [experimentation, data, analytics, ab-testing] + documentation: https://github.com/mohitagw15856/pm-claude-skills +--- +# Experiment Designer Skill + +## Purpose +Produce rigorous experiment designs from product hypotheses, and interpret +results with statistical and practical significance — so you can defend every +decision to a sceptical engineering lead or data scientist. + +## Two-Phase Process + +### Phase 1: Experiment Design +**Required inputs:** hypothesis, primary metric, current baseline, minimum +detectable effect (MDE), available sample size per day. + +**Output:** +- Hypothesis restated as: "If we [change], we expect [metric] to [move by X%] + because [reason]" +- Control and variant definitions +- Primary metric (one only) +- Secondary guardrail metrics (2-3 max) +- Required sample size (calculated from MDE and baseline) +- Estimated run time in days +- Pre-defined success criteria (before the test runs — no moving goalposts) +- Design risk flags: novelty effects, seasonal confounds, multiple testing issues, + network effects, sample ratio mismatch risks + +### Phase 2: Results Interpretation +**Required inputs:** control results, variant results, p-value or raw numbers, +run duration, any anomalies observed. + +**Output:** +- Statistical significance assessment (p < 0.05 threshold) +- Practical significance: was the lift meaningful for the business, not just real? +- Confidence interval interpretation +- Confounding factors to investigate +- Recommendation: Ship / Iterate / Kill / Run follow-up test +- If "Iterate": specific hypotheses to test next + +## Quality Checks +- Never interpret results from an underpowered test without flagging it +- Always distinguish statistical from practical significance +- Flag if test was stopped early (peeking problem) +- Note if sample ratio mismatch occurred