Files
pm-claude-skills/exports/windsurf/pm-advanced/experiment-designer/experiment-designer.md
T
mohitagw15856 036511ab3e Windsurf + Aider targets, MCP server, and demo placement (#33)
Broadens both reach (more tools) and content types (an MCP server), continuing
the multi-platform story.

Windsurf + Aider:
- build-exports.mjs gains two platforms: exports/windsurf/*.md (workspace rules,
  trigger: model_decision) and exports/aider/*.md (conventions for `aider --read`).
  Now 5 platforms (ChatGPT, Gemini, Cursor, Windsurf, Aider).
- install.sh + bin/cli.mjs install both (windsurf -> .windsurf/rules, aider ->
  .aider/skills with a --read hint); generated README index is excluded from copies.
- One-line windsurf-install.sh / aider-install.sh wrappers for parity.

MCP server (new content type):
- mcp/server.mjs — zero-dependency stdio MCP server exposing list_skills,
  search_skills, get_skill. Published as a second bin (pm-claude-skills-mcp).
  Logs to stderr; reads bundled skills/ at startup. mcp/README.md documents
  client config.

Also: README hero "See it in action" demo placement (ready to swap in a GIF;
recording guide in web/docs-assets/README.md), Works-With table + exports +
install docs updated, CHANGELOG Unreleased. package.json files/bin updated.


Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-17 23:15:38 +01:00

4.0 KiB

trigger, description
trigger description
model_decision Design statistically rigorous A/B tests and interpret experiment results. Use when asked to design an experiment, run an A/B test, calculate sample size, interpret test results, or assess whether an experiment was successful. Produces a complete experiment design with hypothesis, sample size, run time, success criteria, and risk flags — or a results interpretation with ship/iterate/kill recommendation.

Experiment Designer Skill

Produce rigorous experiment designs from product hypotheses, and interpret results with statistical and practical significance — so you can defend every decision to a sceptical engineering lead or data scientist.

Required Inputs

Ask the user for these if not provided: For experiment design:

  • Hypothesis (what change, what metric, what expected movement)
  • Current baseline metric value
  • Minimum detectable effect (MDE) — the smallest lift worth caring about
  • Available daily sample size

For results interpretation:

  • Control and variant results (raw numbers or percentages)
  • P-value or confidence interval
  • Run duration (days)
  • Any anomalies observed during the test

Two-Phase Process

Phase 1: Experiment Design

  1. Restate hypothesis as: "If we [change], we expect [metric] to [move by X%] because [reason]"
  2. Define control and variant clearly
  3. Select primary metric (one only) and secondary guardrail metrics (2-3 max)
  4. Calculate required sample size from MDE and baseline
  5. Estimate run time in days
  6. Set pre-defined success criteria before the test runs — no moving goalposts
  7. Flag design risks: novelty effects, seasonal confounds, multiple testing issues, network effects, sample ratio mismatch

Phase 2: Results Interpretation

  1. Assess statistical significance (p < 0.05 threshold)
  2. Assess practical significance: was the lift meaningful for the business, not just real?
  3. Interpret confidence intervals
  4. Investigate confounding factors
  5. Recommend: Ship / Iterate / Kill / Run follow-up test
  6. Validate — Confirm the test ran for the full planned duration. Flag if it was stopped early (peeking problem). Confirm sample ratio mismatch did not occur.

Output Structure

[Design or Results header based on phase]

Hypothesis: "If we [change], we expect [metric] to [move by X%] because [reason]"

Primary metric: [One metric only] Guardrail metrics: [2-3 max] Required sample size: [n per variant] Estimated run time: [days] Pre-defined success threshold: [specific number] Design risk flags: [any concerns]

Results (Phase 2 only): Statistical significance: [p-value and conclusion] Practical significance: [lift size vs. business threshold] Recommendation: Ship / Iterate / Kill / Follow-up — [rationale]

Quality Checks

  • Hypothesis specifies the change, the metric, the direction, and the reason
  • Primary metric is singular — guardrail metrics are secondary
  • Success criteria are defined before the test launches (not after seeing results)
  • Test was not stopped early (or flagged clearly if it was)
  • Practical significance assessed separately from statistical significance
  • Sample ratio mismatch is checked in results interpretation

Anti-Patterns

  • Do not define success criteria after seeing preliminary results — post-hoc success definitions are HARKing (Hypothesising After Results are Known) and invalidate the experiment
  • Do not stop a test early because the result looks significant — early stopping dramatically inflates false positive rates; the test must run to the planned sample size
  • Do not treat statistical significance as the same as practical significance — a p < 0.05 result with a 0.1% lift is real but may not be worth shipping
  • Do not run the same experiment on the same population multiple times without correction — multiple testing inflates the chance of a false positive proportionally
  • Do not use more than one primary metric — multiple primary metrics require multiple hypothesis corrections and make the ship/kill decision ambiguous