Create SKILL.md
New skill: experiment-designer
This commit is contained in:
@@ -0,0 +1,55 @@
|
|||||||
|
---
|
||||||
|
name: experiment-designer
|
||||||
|
description: Designs A/B tests from hypotheses and interprets experiment results
|
||||||
|
with statistical rigour. Use when user says "run an experiment", "design an A/B
|
||||||
|
test", "test this feature", "interpret these results", "was this experiment
|
||||||
|
successful", or "what sample size do I need".
|
||||||
|
metadata:
|
||||||
|
author: Mohit Aggarwal
|
||||||
|
version: 1.0.0
|
||||||
|
category: data-and-metrics
|
||||||
|
tags: [experimentation, data, analytics, ab-testing]
|
||||||
|
documentation: https://github.com/mohitagw15856/pm-claude-skills
|
||||||
|
---
|
||||||
|
# Experiment Designer Skill
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
Produce rigorous experiment designs from product hypotheses, and interpret
|
||||||
|
results with statistical and practical significance — so you can defend every
|
||||||
|
decision to a sceptical engineering lead or data scientist.
|
||||||
|
|
||||||
|
## Two-Phase Process
|
||||||
|
|
||||||
|
### Phase 1: Experiment Design
|
||||||
|
**Required inputs:** hypothesis, primary metric, current baseline, minimum
|
||||||
|
detectable effect (MDE), available sample size per day.
|
||||||
|
|
||||||
|
**Output:**
|
||||||
|
- Hypothesis restated as: "If we [change], we expect [metric] to [move by X%]
|
||||||
|
because [reason]"
|
||||||
|
- Control and variant definitions
|
||||||
|
- Primary metric (one only)
|
||||||
|
- Secondary guardrail metrics (2-3 max)
|
||||||
|
- Required sample size (calculated from MDE and baseline)
|
||||||
|
- Estimated run time in days
|
||||||
|
- Pre-defined success criteria (before the test runs — no moving goalposts)
|
||||||
|
- Design risk flags: novelty effects, seasonal confounds, multiple testing issues,
|
||||||
|
network effects, sample ratio mismatch risks
|
||||||
|
|
||||||
|
### Phase 2: Results Interpretation
|
||||||
|
**Required inputs:** control results, variant results, p-value or raw numbers,
|
||||||
|
run duration, any anomalies observed.
|
||||||
|
|
||||||
|
**Output:**
|
||||||
|
- Statistical significance assessment (p < 0.05 threshold)
|
||||||
|
- Practical significance: was the lift meaningful for the business, not just real?
|
||||||
|
- Confidence interval interpretation
|
||||||
|
- Confounding factors to investigate
|
||||||
|
- Recommendation: Ship / Iterate / Kill / Run follow-up test
|
||||||
|
- If "Iterate": specific hypotheses to test next
|
||||||
|
|
||||||
|
## Quality Checks
|
||||||
|
- Never interpret results from an underpowered test without flagging it
|
||||||
|
- Always distinguish statistical from practical significance
|
||||||
|
- Flag if test was stopped early (peeking problem)
|
||||||
|
- Note if sample ratio mismatch occurred
|
||||||
Reference in New Issue
Block a user