feat: compare-mode demo GIF, expanded eval cases, sample-generation workflow
- Add compare-mode demo GIF + its Playwright recorder; embed in README eval section - Expand evals/cases.json (6 → 15 flagship skills) so more skills can be eval-scored and sample-generated - Add --generate-missing mode to build-samples.mjs - Add generate-samples.yml: workflow_dispatch job that generates real sample outputs via the ANTHROPIC_API_KEY secret (key never leaves GitHub) and commits Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,45 @@
|
|||||||
|
name: Generate Sample Outputs
|
||||||
|
|
||||||
|
# Generates real model outputs for the sample-output gallery using the
|
||||||
|
# ANTHROPIC_API_KEY repo secret — the key never leaves GitHub. Generates a
|
||||||
|
# sample for every eval-case skill that doesn't already have one (it never
|
||||||
|
# overwrites hand-written samples), rebuilds web/samples.json, and commits.
|
||||||
|
#
|
||||||
|
# Run it from the Actions tab → "Generate Sample Outputs" → Run workflow.
|
||||||
|
|
||||||
|
on:
|
||||||
|
workflow_dispatch: {}
|
||||||
|
|
||||||
|
permissions:
|
||||||
|
contents: write
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
generate:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v4
|
||||||
|
- uses: actions/setup-node@v4
|
||||||
|
with:
|
||||||
|
node-version: '20'
|
||||||
|
|
||||||
|
- name: Generate missing samples + rebuild gallery
|
||||||
|
env:
|
||||||
|
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
||||||
|
run: |
|
||||||
|
if [ -z "$ANTHROPIC_API_KEY" ]; then
|
||||||
|
echo "::error::ANTHROPIC_API_KEY secret is not set."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
node scripts/build-samples.mjs --generate-missing
|
||||||
|
|
||||||
|
- name: Commit new samples
|
||||||
|
run: |
|
||||||
|
if ! git diff --quiet -- examples/samples web/samples.json; then
|
||||||
|
git config user.name "github-actions[bot]"
|
||||||
|
git config user.email "github-actions[bot]@users.noreply.github.com"
|
||||||
|
git add examples/samples web/samples.json
|
||||||
|
git commit -m "chore(samples): generate sample outputs for the gallery"
|
||||||
|
git push
|
||||||
|
else
|
||||||
|
echo "No new samples to commit."
|
||||||
|
fi
|
||||||
@@ -121,6 +121,10 @@ The flagship skills score consistently high (out of 5):
|
|||||||
|
|
||||||
These scores show up as badges in the [Playground](https://mohitagw15856.github.io/pm-claude-skills/) and the [🏆 leaderboard](https://mohitagw15856.github.io/pm-claude-skills/leaderboard.html). Coverage is expanding — run it yourself with `node evals/run-evals.mjs` (needs an API key). *Honest note: 6 skills are eval-scored today; the rest are reviewed against the [authoring standard](SKILL-AUTHORING-STANDARD.md) but not yet auto-scored.*
|
These scores show up as badges in the [Playground](https://mohitagw15856.github.io/pm-claude-skills/) and the [🏆 leaderboard](https://mohitagw15856.github.io/pm-claude-skills/leaderboard.html). Coverage is expanding — run it yourself with `node evals/run-evals.mjs` (needs an API key). *Honest note: 6 skills are eval-scored today; the rest are reviewed against the [authoring standard](SKILL-AUTHORING-STANDARD.md) but not yet auto-scored.*
|
||||||
|
|
||||||
|
**See the difference for yourself.** The Playground's *Compare* toggle runs the same inputs with and without the skill, side by side — structured, shippable output on the left; generic mush on the right:
|
||||||
|
|
||||||
|
[](https://mohitagw15856.github.io/pm-claude-skills/)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Contents
|
## Contents
|
||||||
|
|||||||
@@ -24,6 +24,42 @@
|
|||||||
{
|
{
|
||||||
"skill": "sprint-planning",
|
"skill": "sprint-planning",
|
||||||
"input": "Team of 5, 2-week sprint, average velocity 30 points, one engineer out 3 days. Backlog: checkout redesign (8), payment retries (5), analytics events (3), bug bash (3), API rate limiting (5)."
|
"input": "Team of 5, 2-week sprint, average velocity 30 points, one engineer out 3 days. Backlog: checkout redesign (8), payment retries (5), analytics events (3), bug bash (3), API rate limiting (5)."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"skill": "roadmap-narrative",
|
||||||
|
"input": "H2 roadmap for a B2B analytics product. Themes: self-serve onboarding, an integrations marketplace, and enterprise SSO/audit logs. Audience: the exec team and key customers. We want the story, not a feature list."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"skill": "okr-builder",
|
||||||
|
"input": "Company objective: become the default analytics tool for startups. For the product team, next quarter. We care about activation, retention, and word-of-mouth growth."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"skill": "go-to-market",
|
||||||
|
"input": "Launching an integrations marketplace for our analytics product. Target: existing mid-market customers and their ops teams. Goal: 30% of accounts install at least one integration within 60 days."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"skill": "churn-analysis",
|
||||||
|
"input": "SMB SaaS, $49/mo. Monthly logo churn rose from 3% to 5% over two quarters. Most cancellations happen in month 2-3. Top stated reasons: 'too hard to set up' and 'didn't see value'. Annual plans churn far less than monthly."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"skill": "stakeholder-update",
|
||||||
|
"input": "Weekly update for sales, support, and exec stakeholders on the checkout revamp. Status: 10% rollout live, conversion +4%, one payments edge case under investigation, full launch gated on a Legal PCI review due Tuesday."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"skill": "user-story-writer",
|
||||||
|
"input": "Feature: let users export a dashboard to PDF and schedule a recurring email of it. Users are analysts and their managers. Keep stories small and testable with clear acceptance criteria."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"skill": "incident-postmortem",
|
||||||
|
"input": "Checkout was down 42 minutes after a deploy set a wrong env var on the payments service; 5xx spiked, ~1,200 failed checkouts. Detected by alert in 6 min, fixed by rollback. Blameless postmortem with timeline and action items."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"skill": "ab-test-planner",
|
||||||
|
"input": "Test whether moving the signup CTA above the fold on the pricing page increases free-trial starts. Current trial-start rate 8%, ~20k weekly visitors. We want to detect a 10% relative lift."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"skill": "metrics-framework",
|
||||||
|
"input": "Define the metrics framework for a B2B analytics product: the north star, input metrics across acquisition/activation/retention/revenue, and guardrails. Stage: early growth, ~500 paying accounts."
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -30,13 +30,13 @@ function parseFrontmatter(text) {
|
|||||||
async function generate(skillName) {
|
async function generate(skillName) {
|
||||||
const { complete, parseSkill } = await import('../bin/lib/anthropic.mjs');
|
const { complete, parseSkill } = await import('../bin/lib/anthropic.mjs');
|
||||||
const apiKey = process.env.ANTHROPIC_API_KEY;
|
const apiKey = process.env.ANTHROPIC_API_KEY;
|
||||||
if (!apiKey) { console.error('Set ANTHROPIC_API_KEY to generate.'); process.exit(1); }
|
if (!apiKey) throw new Error('Set ANTHROPIC_API_KEY to generate.');
|
||||||
const skillFile = join(root, 'skills', skillName, 'SKILL.md');
|
const skillFile = join(root, 'skills', skillName, 'SKILL.md');
|
||||||
if (!existsSync(skillFile)) { console.error(`Unknown skill: ${skillName}`); process.exit(1); }
|
if (!existsSync(skillFile)) throw new Error(`Unknown skill: ${skillName}`);
|
||||||
const { body } = parseSkill(readFileSync(skillFile, 'utf8'));
|
const { body } = parseSkill(readFileSync(skillFile, 'utf8'));
|
||||||
const cases = JSON.parse(readFileSync(join(root, 'evals', 'cases.json'), 'utf8')).cases;
|
const cases = JSON.parse(readFileSync(join(root, 'evals', 'cases.json'), 'utf8')).cases;
|
||||||
const input = (cases.find((c) => c.skill === skillName) || {}).input;
|
const input = (cases.find((c) => c.skill === skillName) || {}).input;
|
||||||
if (!input) { console.error(`No eval case input for ${skillName}; add one to evals/cases.json first.`); process.exit(1); }
|
if (!input) throw new Error(`No eval case input for ${skillName}; add one to evals/cases.json first.`);
|
||||||
const system = body + '\n\n---\nExecute this skill now on the input below and produce the complete output. Do not ask questions.';
|
const system = body + '\n\n---\nExecute this skill now on the input below and produce the complete output. Do not ask questions.';
|
||||||
const output = await complete({ apiKey, model: 'claude-sonnet-4-6', system, messages: [{ role: 'user', content: input }], maxTokens: 4096 });
|
const output = await complete({ apiKey, model: 'claude-sonnet-4-6', system, messages: [{ role: 'user', content: input }], maxTokens: 4096 });
|
||||||
const title = skillName.split('-').map((w) => w[0].toUpperCase() + w.slice(1)).join(' ');
|
const title = skillName.split('-').map((w) => w[0].toUpperCase() + w.slice(1)).join(' ');
|
||||||
@@ -48,7 +48,18 @@ async function generate(skillName) {
|
|||||||
|
|
||||||
const genIdx = process.argv.indexOf('--generate');
|
const genIdx = process.argv.indexOf('--generate');
|
||||||
if (genIdx !== -1) {
|
if (genIdx !== -1) {
|
||||||
await generate(process.argv[genIdx + 1]);
|
try { await generate(process.argv[genIdx + 1]); }
|
||||||
|
catch (e) { console.error(e.message); process.exit(1); }
|
||||||
|
}
|
||||||
|
|
||||||
|
// --generate-missing: generate a sample for every eval-case skill that doesn't
|
||||||
|
// already have one (never overwrites hand-written samples).
|
||||||
|
if (process.argv.includes('--generate-missing')) {
|
||||||
|
const cases = JSON.parse(readFileSync(join(root, 'evals', 'cases.json'), 'utf8')).cases;
|
||||||
|
for (const c of cases) {
|
||||||
|
if (existsSync(join(samplesDir, `${c.skill}.md`))) { console.log(`skip ${c.skill} (already has a sample)`); continue; }
|
||||||
|
try { await generate(c.skill); } catch (e) { console.error(`failed ${c.skill}: ${e.message}`); }
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// Build samples.json
|
// Build samples.json
|
||||||
|
|||||||
Binary file not shown.
|
After Width: | Height: | Size: 1.7 MiB |
@@ -0,0 +1,89 @@
|
|||||||
|
// Records the "Compare vs. plain prompt" demo. Drives the real playground UI;
|
||||||
|
// mocks BOTH API calls so the contrast is visible without a key — the with-skill
|
||||||
|
// call (has a system prompt) streams a structured exec update, the plain call
|
||||||
|
// (no system) streams generic mush. Re-record live by removing the fetch override.
|
||||||
|
//
|
||||||
|
// Usage: node web/docs-assets/record-compare-demo.mjs (serve web/ on :8080 first)
|
||||||
|
import { fileURLToPath } from 'url';
|
||||||
|
import path from 'path';
|
||||||
|
|
||||||
|
const __dirname = path.dirname(fileURLToPath(import.meta.url));
|
||||||
|
const pwPath = process.env.PLAYWRIGHT_PATH || 'playwright';
|
||||||
|
const pw = await import(pwPath);
|
||||||
|
const chromium = pw.chromium || (pw.default && pw.default.chromium);
|
||||||
|
|
||||||
|
const BASE = process.env.DEMO_URL || 'http://localhost:8080/';
|
||||||
|
const VIEWPORT = { width: 1180, height: 820 };
|
||||||
|
|
||||||
|
const WITH_SKILL = `# Executive Update — Q2, for the CEO
|
||||||
|
|
||||||
|
**Headline:** Activation up 14 pts after the onboarding redesign; on track for Q2 with one staffing risk.
|
||||||
|
|
||||||
|
## Key metrics
|
||||||
|
- Weekly active accounts: 12,400 (+18% QoQ)
|
||||||
|
- Activation rate: 61% (+14 pts)
|
||||||
|
- Net revenue retention: 112%
|
||||||
|
|
||||||
|
## Risk & decision needed
|
||||||
|
Backend hiring is two roles behind — the Q3 billing revamp is at risk. Approve contractor budget to hold the date.`;
|
||||||
|
|
||||||
|
const PLAIN = `Here is an update on the quarter. The team worked hard this quarter and made good progress on a number of initiatives. Onboarding was improved and we saw some positive movement in our metrics overall. There are a few things we are keeping an eye on going forward, and we will continue to monitor the situation and provide updates as needed. Overall it was a solid quarter with room to improve.`;
|
||||||
|
|
||||||
|
const chunk = (t) => t.match(/\S+\s*/g) || [t];
|
||||||
|
|
||||||
|
const initScript = ({ withChunks, plainChunks }) => {
|
||||||
|
try { localStorage.setItem('anthropic_api_key', 'sk-ant-demo-key-not-real'); } catch (e) {}
|
||||||
|
const realFetch = window.fetch.bind(window);
|
||||||
|
window.fetch = (url, opts) => {
|
||||||
|
const u = typeof url === 'string' ? url : (url && url.url) || '';
|
||||||
|
if (!u.includes('api.anthropic.com')) return realFetch(url, opts);
|
||||||
|
let hasSystem = false;
|
||||||
|
try { hasSystem = !!JSON.parse(opts.body).system; } catch (e) {}
|
||||||
|
const chunks = hasSystem ? withChunks : plainChunks;
|
||||||
|
const enc = new TextEncoder();
|
||||||
|
let i = 0;
|
||||||
|
const stream = new ReadableStream({
|
||||||
|
start(controller) {
|
||||||
|
const push = () => {
|
||||||
|
if (i >= chunks.length) {
|
||||||
|
controller.enqueue(enc.encode('data: {"type":"message_stop"}\n\n'));
|
||||||
|
controller.close();
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
const evt = { type: 'content_block_delta', delta: { type: 'text_delta', text: chunks[i++] } };
|
||||||
|
controller.enqueue(enc.encode('data: ' + JSON.stringify(evt) + '\n\n'));
|
||||||
|
setTimeout(push, 38);
|
||||||
|
};
|
||||||
|
setTimeout(push, 150);
|
||||||
|
},
|
||||||
|
});
|
||||||
|
return Promise.resolve(new Response(stream, { status: 200, headers: { 'content-type': 'text/event-stream' } }));
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
const pause = (ms) => new Promise((r) => setTimeout(r, ms));
|
||||||
|
|
||||||
|
const browser = await chromium.launch();
|
||||||
|
const context = await browser.newContext({ viewport: VIEWPORT, deviceScaleFactor: 2, recordVideo: { dir: __dirname, size: VIEWPORT } });
|
||||||
|
const page = await context.newPage();
|
||||||
|
await page.addInitScript(initScript, { withChunks: chunk(WITH_SKILL), plainChunks: chunk(PLAIN) });
|
||||||
|
|
||||||
|
await page.goto(BASE + '?skill=executive-update', { waitUntil: 'networkidle' });
|
||||||
|
await page.waitForSelector('#inputForm input, #inputForm textarea');
|
||||||
|
await pause(700);
|
||||||
|
|
||||||
|
await page.locator('#f_0').type('Shipped onboarding redesign; activation up sharply; backend hiring behind, billing revamp at risk for Q3.', { delay: 6 });
|
||||||
|
await page.locator('#f_1').type('CEO', { delay: 22 });
|
||||||
|
await page.locator('#f_2').type('Q2', { delay: 22 });
|
||||||
|
await page.locator('#f_3').type('WAA, activation, NRR', { delay: 16 });
|
||||||
|
await pause(400);
|
||||||
|
|
||||||
|
await page.check('#compareToggle');
|
||||||
|
await pause(700);
|
||||||
|
await page.click('#runBtn');
|
||||||
|
await page.waitForFunction(() => document.querySelector('#status')?.textContent?.includes('Done'), { timeout: 15000 });
|
||||||
|
await pause(2200);
|
||||||
|
|
||||||
|
await context.close();
|
||||||
|
await browser.close();
|
||||||
|
console.log('Recorded compare demo into', __dirname);
|
||||||
Reference in New Issue
Block a user