From 51bf4be52f1c04a207c2f45901b9c1fab33084b8 Mon Sep 17 00:00:00 2001 From: mohitagw15856 <119053560+mohitagw15856@users.noreply.github.com> Date: Thu, 18 Jun 2026 08:37:40 +0100 Subject: [PATCH] AI-powered tooling: GitHub Action, generate command, evals + leaderboard (#41) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three features riding 2026 trends (agentic CI, codegen, evals), sharing one dependency-free Anthropic client (bin/lib/anthropic.mjs). 1. GitHub Action (action/) — run any skill in a consumer repo's CI: uses: mohitagw15856/pm-claude-skills/action@main. Composite action + run.mjs (loads the bundled SKILL.md, calls the API, exposes result as a step output / file). Docs with auto-PR-description example. 2. generate command — `npx pm-claude-skills generate --from ` turns a team's docs into a SKILL.md following the authoring standard (bin/generate.mjs, wired into the CLI; needs ANTHROPIC_API_KEY). 3. Skill evals + Leaderboard — evals/run-evals.mjs runs each case across models and scores output with an LLM judge (structure/completeness/usefulness/ grounding); scripts/build-leaderboard.mjs renders web/leaderboard.html (built in the Pages deploy, falls back to clearly-labelled example data). Linked from README, catalog, and playground. Offline-testable parts verified (prompt building, skill loading, graceful errors, leaderboard render). SkillCheck/audit/exports all green. Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude --- .github/workflows/deploy-playground.yml | 3 + .gitignore | 1 + CHANGELOG.md | 14 ++- README.md | 24 ++++++ action/README.md | 65 ++++++++++++++ action/action.yml | 51 +++++++++++ action/run.mjs | 58 +++++++++++++ bin/cli.mjs | 7 ++ bin/generate.mjs | 109 ++++++++++++++++++++++++ bin/lib/anthropic.mjs | 51 +++++++++++ evals/README.md | 40 +++++++++ evals/cases.json | 29 +++++++ evals/results.example.json | 22 +++++ evals/run-evals.mjs | 93 ++++++++++++++++++++ scripts/build-docs.mjs | 1 + scripts/build-leaderboard.mjs | 76 +++++++++++++++++ web/index.html | 2 +- 17 files changed, 644 insertions(+), 2 deletions(-) create mode 100644 action/README.md create mode 100644 action/action.yml create mode 100644 action/run.mjs create mode 100644 bin/generate.mjs create mode 100644 bin/lib/anthropic.mjs create mode 100644 evals/README.md create mode 100644 evals/cases.json create mode 100644 evals/results.example.json create mode 100644 evals/run-evals.mjs create mode 100644 scripts/build-leaderboard.mjs diff --git a/.github/workflows/deploy-playground.yml b/.github/workflows/deploy-playground.yml index dd1bd46..ccfc6ea 100644 --- a/.github/workflows/deploy-playground.yml +++ b/.github/workflows/deploy-playground.yml @@ -41,6 +41,9 @@ jobs: - name: Build the static skill catalog (web/catalog.html) run: node scripts/build-docs.mjs + - name: Build the skill leaderboard (web/leaderboard.html) + run: node scripts/build-leaderboard.mjs + - name: Configure Pages uses: actions/configure-pages@v5 diff --git a/.gitignore b/.gitignore index 1959bc9..88a16b3 100644 --- a/.gitignore +++ b/.gitignore @@ -13,3 +13,4 @@ venv/ # Generated docs catalog (built in CI for Pages) web/catalog.html +web/leaderboard.html diff --git a/CHANGELOG.md b/CHANGELOG.md index cc0c4e5..3176111 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,7 +9,19 @@ each new wave of skills bumps the **major** version, extensions and fixes bump ## [Unreleased] -_Nothing yet._ +### Added +- **GitHub Action** ([`action/`](action/)) — run any skill in CI: `uses: + mohitagw15856/pm-claude-skills/action@main` to auto-write PR descriptions, + changelogs, release notes, or code-review checklists. Composite action + + dependency-free runner. +- **`generate` command** — `npx pm-claude-skills generate --from ` turns a + team's documentation into a `SKILL.md` that follows the authoring standard + (`bin/generate.mjs`, needs `ANTHROPIC_API_KEY`). +- **Skill evals + Leaderboard** — `evals/run-evals.mjs` scores skill output across models + with an LLM judge (structure / completeness / usefulness / grounding); + `scripts/build-leaderboard.mjs` renders a public `web/leaderboard.html` (built in the + Pages deploy, linked from the README, catalog, and playground). +- Shared, dependency-free Anthropic client (`bin/lib/anthropic.mjs`) used by all three. ## [19.0.0] — Security Auditor, Personas & Catalog — 2026-06-18 diff --git a/README.md b/README.md index 5b4398c..0e6da32 100644 --- a/README.md +++ b/README.md @@ -226,6 +226,30 @@ Then ask: *"search the skills for customer churn, then apply the best one to my --- +## ⚙️ AI-Powered Tooling + +Three ways to put the library to work beyond installing files: + +**🤖 Run a skill in your CI — [GitHub Action](action/).** Auto-write PR descriptions, changelogs, release notes, or run a code-review checklist on every PR: + +```yaml +- uses: mohitagw15856/pm-claude-skills/action@main + with: + skill: pr-description-writer + input: ${{ steps.diff.outputs.text }} + api_key: ${{ secrets.ANTHROPIC_API_KEY }} +``` + +**🏗️ Turn your docs into a skill — `generate`.** Point it at a URL or file and it writes a `SKILL.md` that follows the authoring standard: + +```bash +ANTHROPIC_API_KEY=sk-ant-… npx pm-claude-skills generate --from ./team-process.md +``` + +**🏆 Skill Leaderboard — [evals](evals/).** An LLM-as-judge harness scores each skill across Claude models on structure, completeness, usefulness, and grounding. **[View the leaderboard →](https://mohitagw15856.github.io/pm-claude-skills/leaderboard.html)** + +--- + ## 🌐 Skill Playground — Try Any Skill in Your Browser **▶ Live: [mohitagw15856.github.io/pm-claude-skills](https://mohitagw15856.github.io/pm-claude-skills/)** · 📚 [Browse the full skill catalog](https://mohitagw15856.github.io/pm-claude-skills/catalog.html) diff --git a/action/README.md b/action/README.md new file mode 100644 index 0000000..b6e55ee --- /dev/null +++ b/action/README.md @@ -0,0 +1,65 @@ +# PM Skills — GitHub Action + +Run any skill from this library inside **your** repo's CI. Turn the library's frameworks +into automation: auto-write PR descriptions, generate release notes and changelogs, or run +a code-review checklist — on every push or PR. + +```yaml +- uses: mohitagw15856/pm-claude-skills/action@main + with: + skill: pr-description-writer + input: ${{ steps.diff.outputs.text }} + api_key: ${{ secrets.ANTHROPIC_API_KEY }} +``` + +## Inputs + +| Input | Required | Description | +|---|---|---| +| `skill` | ✅ | Skill name, e.g. `pr-description-writer`, `changelog-generator`, `code-review-checklist`. | +| `input` | — | The text/context to run the skill on. | +| `input_file` | — | Read input from a file instead of `input`. | +| `api_key` | ✅ | Anthropic API key (store as a repo secret). | +| `model` | — | Model id (default `claude-sonnet-4-6`). | +| `output_file` | — | Also write the result to this file. | + +**Output:** `result` — the skill's output (use `output_file` for long, multi-line results). + +## Example — auto-write a PR description + +```yaml +name: PR description +on: { pull_request: { types: [opened] } } +permissions: { contents: read, pull-requests: write } +jobs: + describe: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + with: { fetch-depth: 0 } + - id: diff + run: | + echo "text<> "$GITHUB_OUTPUT" + git diff origin/${{ github.base_ref }}...HEAD --stat >> "$GITHUB_OUTPUT" + echo "EOF" >> "$GITHUB_OUTPUT" + - id: skill + uses: mohitagw15856/pm-claude-skills/action@main + with: + skill: pr-description-writer + input: ${{ steps.diff.outputs.text }} + api_key: ${{ secrets.ANTHROPIC_API_KEY }} + - uses: actions/github-script@v7 + with: + script: | + github.rest.pulls.update({ owner: context.repo.owner, repo: context.repo.repo, + pull_number: context.issue.number, body: process.env.BODY }) + env: { BODY: ${{ steps.skill.outputs.result }} } +``` + +## Other ideas + +- `skill: changelog-generator` from `git log` → write `CHANGELOG.md`. +- `skill: release-notes` on tag push → set the GitHub Release body. +- `skill: code-review-checklist` → post a review checklist as a PR comment. + +Pin to a release tag (e.g. `@v19`) for stability once you've tried `@main`. diff --git a/action/action.yml b/action/action.yml new file mode 100644 index 0000000..7bae11a --- /dev/null +++ b/action/action.yml @@ -0,0 +1,51 @@ +name: 'PM Skills — Run a Skill' +description: 'Run any pm-claude-skills SKILL.md in CI — auto PR descriptions, changelogs, release notes, code-review checklists, and more.' +author: 'Mohit Aggarwal' +branding: + icon: 'cpu' + color: 'purple' + +inputs: + skill: + description: 'Skill name to run (e.g. pr-description-writer, changelog-generator, code-review-checklist).' + required: true + input: + description: 'The input/context text the skill should work on.' + required: false + input_file: + description: 'Read the input from this file instead of the `input` string.' + required: false + api_key: + description: 'Anthropic API key (store it as a secret).' + required: true + model: + description: 'Claude model id.' + required: false + default: 'claude-sonnet-4-6' + output_file: + description: 'If set, also write the result to this file.' + required: false + max_tokens: + description: 'Max output tokens.' + required: false + default: '4096' + +outputs: + result: + description: 'The skill output (also use output_file for multi-line results).' + value: ${{ steps.run.outputs.result }} + +runs: + using: composite + steps: + - id: run + shell: bash + run: node "$GITHUB_ACTION_PATH/run.mjs" + env: + INPUT_SKILL: ${{ inputs.skill }} + INPUT_INPUT: ${{ inputs.input }} + INPUT_INPUT_FILE: ${{ inputs.input_file }} + INPUT_API_KEY: ${{ inputs.api_key }} + INPUT_MODEL: ${{ inputs.model }} + INPUT_OUTPUT_FILE: ${{ inputs.output_file }} + INPUT_MAX_TOKENS: ${{ inputs.max_tokens }} diff --git a/action/run.mjs b/action/run.mjs new file mode 100644 index 0000000..0b2780a --- /dev/null +++ b/action/run.mjs @@ -0,0 +1,58 @@ +#!/usr/bin/env node +// Runner for the pm-skills GitHub Action. Loads a bundled SKILL.md, runs it on +// the provided input via the Anthropic API, and exposes the result as a step +// output (and optionally a file). Inputs arrive as INPUT_* env vars. +import { readFileSync, existsSync, writeFileSync, appendFileSync } from 'node:fs'; +import { join, dirname } from 'node:path'; +import { fileURLToPath, pathToFileURL } from 'node:url'; +import { complete, parseSkill } from '../bin/lib/anthropic.mjs'; + +const ACTION_DIR = dirname(fileURLToPath(import.meta.url)); +const REPO_ROOT = join(ACTION_DIR, '..'); + +const inp = (name, def = '') => (process.env[`INPUT_${name.toUpperCase()}`] ?? def).trim(); + +// Pure: assemble the system prompt + user message for a skill run (testable offline). +export function buildRequest(skillBody, userInput) { + const system = skillBody + + '\n\n---\nExecute this skill now on the input below and produce the complete output. ' + + 'Do not ask follow-up questions — work with what is given and note any reasonable assumptions. ' + + 'Output only the finished artifact (no preamble).'; + return { system, messages: [{ role: 'user', content: userInput }] }; +} + +async function main() { + const skill = inp('skill'); + if (!skill) throw new Error('Input `skill` is required.'); + const apiKey = inp('api_key') || process.env.ANTHROPIC_API_KEY || ''; + const model = inp('model', 'claude-sonnet-4-6'); + const maxTokens = parseInt(inp('max_tokens', '4096'), 10) || 4096; + + let input = inp('input'); + const inputFile = inp('input_file'); + if (!input && inputFile && existsSync(inputFile)) input = readFileSync(inputFile, 'utf8'); + if (!input) throw new Error('Provide `input` or `input_file`.'); + + const skillFile = join(REPO_ROOT, 'skills', skill, 'SKILL.md'); + if (!existsSync(skillFile)) throw new Error(`Unknown skill "${skill}" (no skills/${skill}/SKILL.md).`); + const { body } = parseSkill(readFileSync(skillFile, 'utf8')); + + const { system, messages } = buildRequest(body, input); + console.log(`Running skill "${skill}" with ${model}…`); + const result = await complete({ apiKey, model, system, messages, maxTokens }); + + // Step output (multiline-safe heredoc) + optional file. + if (process.env.GITHUB_OUTPUT) { + const d = `EOF_${Math.random().toString(36).slice(2)}`; + appendFileSync(process.env.GITHUB_OUTPUT, `result<<${d}\n${result}\n${d}\n`); + } + const outFile = inp('output_file'); + if (outFile) { writeFileSync(outFile, result + '\n'); console.log(`Wrote ${outFile}`); } + + console.log('\n----- skill output -----\n' + result); +} + +// Run only when executed directly (so tests can import buildRequest). +if (import.meta.url === pathToFileURL(process.argv[1] || '').href) { + main().catch((e) => { console.error(`Error: ${e.message}`); process.exit(1); }); +} diff --git a/bin/cli.mjs b/bin/cli.mjs index 7ee823f..7c304e7 100755 --- a/bin/cli.mjs +++ b/bin/cli.mjs @@ -153,6 +153,8 @@ Examples: npx pm-claude-skills add --agent cursor # .mdc rules into ./.cursor/rules npx pm-claude-skills add --agent windsurf # .md rules into ./.windsurf/rules npx pm-claude-skills add --agent codex --link + + npx pm-claude-skills generate --from # turn your docs into a SKILL.md (needs ANTHROPIC_API_KEY) `; const opts = parse(process.argv.slice(2)); @@ -161,4 +163,9 @@ if (opts.version) console.log(VERSION); else if (opts.help || !cmd || cmd === 'help') console.log(HELP); else if (cmd === 'list') list(); else if (cmd === 'add') add(opts); +else if (cmd === 'generate') { + const { run } = await import('./generate.mjs'); + try { process.exit(await run(process.argv.slice(3))); } + catch (e) { console.error(`Error: ${e.message}`); process.exit(1); } +} else { console.error(`Unknown command: ${cmd}\n`); console.log(HELP); process.exit(2); } diff --git a/bin/generate.mjs b/bin/generate.mjs new file mode 100644 index 0000000..9ec07a5 --- /dev/null +++ b/bin/generate.mjs @@ -0,0 +1,109 @@ +// `pm-claude-skills generate` — turn a doc (URL or file) into a SKILL.md that +// follows this library's authoring standard. Uses the Anthropic API. +// +// ANTHROPIC_API_KEY=sk-ant-... npx pm-claude-skills generate --from ./process.md +// ... generate --from https://example.com/runbook --name incident-runbook +// ... generate --from notes.txt --out ./skills --dry-run +import { writeFileSync, mkdirSync, existsSync, readFileSync } from 'node:fs'; +import { join } from 'node:path'; +import { complete, parseSkill } from './lib/anthropic.mjs'; + +function getArg(argv, name, def) { + const i = argv.indexOf(`--${name}`); + return i !== -1 ? argv[i + 1] : def; +} + +// Strip tags/scripts/styles from HTML to rough text (good enough for an LLM). +function htmlToText(html) { + return html + .replace(//gi, ' ') + .replace(//gi, ' ') + .replace(/<[^>]+>/g, ' ') + .replace(/&[a-z]+;/gi, ' ') + .replace(/\s+/g, ' ') + .trim(); +} + +async function loadSource(from) { + if (/^https?:\/\//i.test(from)) { + const res = await fetch(from); + if (!res.ok) throw new Error(`Could not fetch ${from} (HTTP ${res.status}).`); + const text = await res.text(); + return / +description: ". Use when . Produces ." +--- + +# Skill + +<one-line value summary> + +## What This Skill Produces +- <deliverables> + +## Required Inputs +Ask for (if not provided): +- <inputs to gather; never invent them> + +## Process +1. <steps> + +## Output Format +<a concrete template — headings/tables — of the final artifact> + +## Quality Checks +- [ ] <checks the output must pass> + +## Anti-Patterns +- [ ] Do not <mistakes this skill prevents> + +Rules: be specific to the documentation provided; turn its rules/process into the skill. The description MUST contain "Use when" and "Produces". Do not include any text outside the file.`; + +export async function run(argv) { + const from = getArg(argv, 'from'); + if (!from || argv.includes('--help')) { + console.log('Usage: pm-claude-skills generate --from <url|file> [--name x] [--out dir] [--model m] [--dry-run]'); + return from ? 0 : 1; + } + const apiKey = process.env.ANTHROPIC_API_KEY || ''; + if (!apiKey) { console.error('Set ANTHROPIC_API_KEY to generate a skill.'); return 1; } + const model = getArg(argv, 'model', 'claude-sonnet-4-6'); + const outDir = getArg(argv, 'out', 'skills'); + const dryRun = argv.includes('--dry-run'); + + console.error(`Reading ${from}…`); + const source = (await loadSource(from)).slice(0, 24000); // cap context + + console.error(`Generating a SKILL.md with ${model}…`); + const out = await complete({ + apiKey, model, system: META_PROMPT, + messages: [{ role: 'user', content: `Documentation to convert into a skill:\n\n${source}` }], + maxTokens: 3000, + }); + + const cleaned = out.replace(/^```[a-z]*\n?/i, '').replace(/\n?```$/i, '').trim(); + const { meta } = parseSkill(cleaned); + const name = getArg(argv, 'name', meta.name); + if (!name) { console.error('Could not determine a skill name — pass --name.'); return 1; } + + if (dryRun) { + console.log(cleaned); + console.error(`\n[dry-run] Would write ${join(outDir, name, 'SKILL.md')}`); + return 0; + } + const dir = join(outDir, name); + mkdirSync(dir, { recursive: true }); + writeFileSync(join(dir, 'SKILL.md'), cleaned + '\n'); + console.log(`Created ${join(dir, 'SKILL.md')}`); + console.log('Next: review it, then validate — node scripts/skillcheck.mjs && node scripts/skill-audit.mjs'); + return 0; +} diff --git a/bin/lib/anthropic.mjs b/bin/lib/anthropic.mjs new file mode 100644 index 0000000..409d5b4 --- /dev/null +++ b/bin/lib/anthropic.mjs @@ -0,0 +1,51 @@ +// Minimal, dependency-free Anthropic Messages API client (Node 18+ global fetch). +// Shared by the GitHub Action runner, the eval harness, and skill generation. +// No SDK, no install — just a thin POST wrapper. + +const API_URL = 'https://api.anthropic.com/v1/messages'; + +/** + * Call the Anthropic Messages API and return the concatenated text output. + * @param {object} o + * @param {string} o.apiKey - Anthropic API key. + * @param {string} [o.model] - Model id (default claude-sonnet-4-6). + * @param {string} [o.system]- System prompt. + * @param {Array} o.messages- [{role, content}] messages. + * @param {number} [o.maxTokens] + * @returns {Promise<string>} + */ +export async function complete({ apiKey, model = 'claude-sonnet-4-6', system, messages, maxTokens = 4096 }) { + if (!apiKey) throw new Error('Missing Anthropic API key (set ANTHROPIC_API_KEY).'); + const res = await fetch(API_URL, { + method: 'POST', + headers: { + 'content-type': 'application/json', + 'x-api-key': apiKey, + 'anthropic-version': '2023-06-01', + }, + body: JSON.stringify({ model, max_tokens: maxTokens, ...(system ? { system } : {}), messages }), + }); + if (!res.ok) { + const body = await res.text().catch(() => ''); + throw new Error(`Anthropic API ${res.status}: ${body.slice(0, 500)}`); + } + const data = await res.json(); + return (data.content || []).map((c) => c.text || '').join('').trim(); +} + +/** Parse "name: value" YAML-ish frontmatter + body from a SKILL.md string. */ +export function parseSkill(text) { + const m = text.match(/^---\n([\s\S]*?)\n---\n?([\s\S]*)$/); + const meta = {}; + if (m) { + for (const line of m[1].split('\n')) { + const kv = line.match(/^(\w[\w-]*):\s*(.*)$/); + if (kv) { + let v = kv[2].trim(); + if ((v.startsWith('"') && v.endsWith('"')) || (v.startsWith("'") && v.endsWith("'"))) v = v.slice(1, -1); + meta[kv[1]] = v; + } + } + } + return { meta, body: m ? m[2].trim() : text.trim() }; +} diff --git a/evals/README.md b/evals/README.md new file mode 100644 index 0000000..a830901 --- /dev/null +++ b/evals/README.md @@ -0,0 +1,40 @@ +# Skill Evals + +An LLM-as-judge harness that scores skill output quality across models — so claims like +"production-ready" are backed by numbers, not vibes. Results render as a public +[Skill Leaderboard](https://mohitagw15856.github.io/pm-claude-skills/leaderboard.html). + +## What it measures + +For each [case](cases.json), a model runs the skill, then a **judge model** scores the +output 1–5 on four dimensions: + +- **structure** — follows a clear, expected structure +- **completeness** — covers what the task needs +- **usefulness** — specific and actually useful, not generic +- **grounding** — stays grounded in the input, no invented facts + +## Run it + +Needs an Anthropic API key (this calls the API and costs tokens): + +```bash +ANTHROPIC_API_KEY=sk-ant-... node evals/run-evals.mjs +# --models claude-opus-4-8,claude-sonnet-4-6,claude-haiku-4-5-20251001 +# --judge claude-opus-4-8 +node scripts/build-leaderboard.mjs # render web/leaderboard.html +``` + +`run-evals.mjs` writes `evals/results.json`; the leaderboard builder prefers it and falls +back to `results.example.json` (clearly labelled) so the page renders before you run real evals. + +## Add a case + +Append to [`cases.json`](cases.json): `{ "skill": "<name>", "input": "<a realistic prompt>" }`. +Keep inputs short but representative of how the skill is actually used. + +## Honesty notes + +- Scores are an LLM judge's opinion, not ground truth — treat them as a comparative signal. +- The judge sees the skill's stated purpose and the output, not the model name (reduces bias). +- Re-run after model upgrades; numbers drift. diff --git a/evals/cases.json b/evals/cases.json new file mode 100644 index 0000000..2b1b2e0 --- /dev/null +++ b/evals/cases.json @@ -0,0 +1,29 @@ +{ + "_comment": "Eval cases: a representative input per skill. Run with: node evals/run-evals.mjs", + "cases": [ + { + "skill": "rice-prioritisation", + "input": "Rank these for next quarter:\n1. Onboarding redesign — reach ~5000 users/qtr, big activation impact, ~3 person-months.\n2. Dark mode — ~8000 users want it, low impact, ~1 person-month.\n3. SSO for enterprise — ~400 accounts, high deal impact, ~4 person-months, low confidence." + }, + { + "skill": "prd-template", + "input": "Feature: in-app referral program so existing users invite colleagues and both get a credit. Target: activated B2B users. Goal: grow signups 15% in Q3." + }, + { + "skill": "cs-health-scorecard", + "input": "Account: Acme Corp, enterprise, ARR $120k, renewal in 90 days. DAU/MAU 18%, 2 open P2 tickets, CSAT 7, exec sponsor left last month, seats 80/100 used, payments on time." + }, + { + "skill": "executive-summary", + "input": "Summarise: our Q2 retention dropped from 82% to 76% driven by a new onboarding flow that confused mobile users; we shipped a fix in week 10 and retention recovered to 80%; we recommend a full mobile onboarding rework next quarter." + }, + { + "skill": "competitive-analysis", + "input": "Analyse our position vs Notion and Coda for a lightweight team wiki aimed at small startups. We're cheaper and faster to set up but have fewer integrations." + }, + { + "skill": "sprint-planning", + "input": "Team of 5, 2-week sprint, average velocity 30 points, one engineer out 3 days. Backlog: checkout redesign (8), payment retries (5), analytics events (3), bug bash (3), API rate limiting (5)." + } + ] +} diff --git a/evals/results.example.json b/evals/results.example.json new file mode 100644 index 0000000..489abd9 --- /dev/null +++ b/evals/results.example.json @@ -0,0 +1,22 @@ +{ + "_comment": "EXAMPLE data so the leaderboard renders before you run real evals. Replace by running: ANTHROPIC_API_KEY=... node evals/run-evals.mjs", + "example": true, + "generatedAt": "2026-06-18T00:00:00.000Z", + "judge": "claude-opus-4-8", + "models": ["claude-sonnet-4-6", "claude-haiku-4-5-20251001"], + "dimensions": ["structure", "completeness", "usefulness", "grounding"], + "results": [ + { "skill": "rice-prioritisation", "model": "claude-sonnet-4-6", "scores": {"structure":5,"completeness":5,"usefulness":5,"grounding":4}, "overall": 4.75 }, + { "skill": "rice-prioritisation", "model": "claude-haiku-4-5-20251001", "scores": {"structure":5,"completeness":4,"usefulness":4,"grounding":4}, "overall": 4.25 }, + { "skill": "prd-template", "model": "claude-sonnet-4-6", "scores": {"structure":5,"completeness":4,"usefulness":5,"grounding":4}, "overall": 4.5 }, + { "skill": "prd-template", "model": "claude-haiku-4-5-20251001", "scores": {"structure":4,"completeness":4,"usefulness":4,"grounding":4}, "overall": 4.0 }, + { "skill": "cs-health-scorecard", "model": "claude-sonnet-4-6", "scores": {"structure":5,"completeness":5,"usefulness":5,"grounding":5}, "overall": 5.0 }, + { "skill": "cs-health-scorecard", "model": "claude-haiku-4-5-20251001", "scores": {"structure":5,"completeness":4,"usefulness":4,"grounding":4}, "overall": 4.25 }, + { "skill": "executive-summary", "model": "claude-sonnet-4-6", "scores": {"structure":5,"completeness":5,"usefulness":4,"grounding":5}, "overall": 4.75 }, + { "skill": "executive-summary", "model": "claude-haiku-4-5-20251001", "scores": {"structure":5,"completeness":4,"usefulness":4,"grounding":5}, "overall": 4.5 }, + { "skill": "competitive-analysis", "model": "claude-sonnet-4-6", "scores": {"structure":4,"completeness":4,"usefulness":5,"grounding":4}, "overall": 4.25 }, + { "skill": "competitive-analysis", "model": "claude-haiku-4-5-20251001", "scores": {"structure":4,"completeness":4,"usefulness":4,"grounding":4}, "overall": 4.0 }, + { "skill": "sprint-planning", "model": "claude-sonnet-4-6", "scores": {"structure":5,"completeness":5,"usefulness":5,"grounding":5}, "overall": 5.0 }, + { "skill": "sprint-planning", "model": "claude-haiku-4-5-20251001", "scores": {"structure":5,"completeness":4,"usefulness":4,"grounding":5}, "overall": 4.5 } + ] +} diff --git a/evals/run-evals.mjs b/evals/run-evals.mjs new file mode 100644 index 0000000..9669fdd --- /dev/null +++ b/evals/run-evals.mjs @@ -0,0 +1,93 @@ +#!/usr/bin/env node +// Skill eval harness. For each case × model: run the skill, then score the output +// with an LLM judge on a fixed rubric. Writes evals/results.json — feed it to +// scripts/build-leaderboard.mjs to render web/leaderboard.html. +// +// Requires an Anthropic API key (this calls the API and costs tokens). +// +// Usage: +// ANTHROPIC_API_KEY=sk-ant-... node evals/run-evals.mjs +// ... node evals/run-evals.mjs --models claude-opus-4-8,claude-sonnet-4-6,claude-haiku-4-5-20251001 +// ... node evals/run-evals.mjs --judge claude-opus-4-8 --cases evals/cases.json +import { readFileSync, writeFileSync, existsSync } from 'node:fs'; +import { join, dirname } from 'node:path'; +import { fileURLToPath } from 'node:url'; +import { complete, parseSkill } from '../bin/lib/anthropic.mjs'; + +const __dirname = dirname(fileURLToPath(import.meta.url)); +const root = join(__dirname, '..'); + +function arg(name, def) { + const i = process.argv.indexOf(`--${name}`); + return i !== -1 ? process.argv[i + 1] : def; +} + +const apiKey = process.env.ANTHROPIC_API_KEY || ''; +const models = arg('models', 'claude-sonnet-4-6,claude-haiku-4-5-20251001').split(',').map((s) => s.trim()); +const judge = arg('judge', 'claude-opus-4-8'); +const casesPath = arg('cases', join(__dirname, 'cases.json')); +const outPath = arg('out', join(__dirname, 'results.json')); + +const DIMENSIONS = ['structure', 'completeness', 'usefulness', 'grounding']; + +function runPrompt(skillBody) { + return skillBody + '\n\n---\nExecute this skill now on the input. Output only the finished artifact.'; +} + +function judgePrompt(description, output) { + return `You are a strict evaluator of a professional work artifact. + +The artifact was produced by a skill whose job is: +"${description}" + +Score the artifact below from 1 (poor) to 5 (excellent) on each dimension: +- structure: follows a clear, expected structure for this kind of output +- completeness: covers what the task needs, nothing important missing +- usefulness: actually useful to a professional, specific not generic +- grounding: stays grounded in the given input, no invented facts/metrics + +Return ONLY a JSON object, no prose: {"structure":N,"completeness":N,"usefulness":N,"grounding":N} + +--- ARTIFACT --- +${output}`; +} + +function parseScores(text) { + const m = text.match(/\{[\s\S]*\}/); + if (!m) throw new Error('judge did not return JSON'); + const j = JSON.parse(m[0]); + const s = {}; + for (const d of DIMENSIONS) s[d] = Math.max(1, Math.min(5, Number(j[d]) || 0)); + return s; +} + +async function main() { + if (!apiKey) { console.error('Set ANTHROPIC_API_KEY to run evals.'); process.exit(1); } + const { cases } = JSON.parse(readFileSync(casesPath, 'utf8')); + const results = []; + + for (const c of cases) { + const skillFile = join(root, 'skills', c.skill, 'SKILL.md'); + if (!existsSync(skillFile)) { console.error(`skip ${c.skill}: no SKILL.md`); continue; } + const { meta, body } = parseSkill(readFileSync(skillFile, 'utf8')); + for (const model of models) { + process.stderr.write(`Running ${c.skill} on ${model}… `); + try { + const output = await complete({ apiKey, model, system: runPrompt(body), messages: [{ role: 'user', content: c.input }], maxTokens: 3000 }); + const judged = await complete({ apiKey, model: judge, messages: [{ role: 'user', content: judgePrompt(meta.description || c.skill, output) }], maxTokens: 200 }); + const scores = parseScores(judged); + const overall = DIMENSIONS.reduce((a, d) => a + scores[d], 0) / DIMENSIONS.length; + results.push({ skill: c.skill, model, scores, overall: Math.round(overall * 100) / 100 }); + process.stderr.write(`${overall.toFixed(2)}/5\n`); + } catch (e) { + process.stderr.write(`FAILED (${e.message})\n`); + } + } + } + + const out = { generatedAt: new Date().toISOString(), judge, models, dimensions: DIMENSIONS, results }; + writeFileSync(outPath, JSON.stringify(out, null, 2)); + console.log(`\nWrote ${outPath} — ${results.length} scored runs. Build the page: node scripts/build-leaderboard.mjs`); +} + +main(); diff --git a/scripts/build-docs.mjs b/scripts/build-docs.mjs index 719141a..f618813 100644 --- a/scripts/build-docs.mjs +++ b/scripts/build-docs.mjs @@ -89,6 +89,7 @@ const html = `<!DOCTYPE html> <a href="https://mohitagw15856.github.io/pm-claude-skills/">▶ Live Playground</a> <a href="${REPO}">GitHub</a> <a href="${REPO}#-quick-install-2-minutes">Install</a> + <a href="leaderboard.html">Leaderboard</a> <a href="${REPO}/blob/main/TIERS.md">Tiers</a> </div> </header> diff --git a/scripts/build-leaderboard.mjs b/scripts/build-leaderboard.mjs new file mode 100644 index 0000000..c8c79e4 --- /dev/null +++ b/scripts/build-leaderboard.mjs @@ -0,0 +1,76 @@ +#!/usr/bin/env node +// Renders web/leaderboard.html from evals/results.json (or evals/results.example.json +// as a clearly-labelled placeholder). Run after evals/run-evals.mjs. No dependencies. +import { readFileSync, writeFileSync, existsSync } from 'node:fs'; +import { join, dirname } from 'node:path'; +import { fileURLToPath } from 'node:url'; + +const __dirname = dirname(fileURLToPath(import.meta.url)); +const root = join(__dirname, '..'); +const REPO = 'https://github.com/mohitagw15856/pm-claude-skills'; + +const real = join(root, 'evals', 'results.json'); +const example = join(root, 'evals', 'results.example.json'); +const src = existsSync(real) ? real : example; +const data = JSON.parse(readFileSync(src, 'utf8')); +const isExample = !!data.example || src === example; + +const esc = (s) => String(s).replace(/[&<>"]/g, (c) => ({ '&': '&', '<': '<', '>': '>', '"': '"' }[c])); +const skills = [...new Set(data.results.map((r) => r.skill))].sort(); +const models = data.models || [...new Set(data.results.map((r) => r.model))]; +const cell = (skill, model) => data.results.find((r) => r.skill === skill && r.model === model); +const colour = (v) => v >= 4.5 ? '#6ee7b7' : v >= 4 ? '#93c5fd' : v >= 3 ? '#fcd34d' : '#fca5a5'; + +const modelAvg = (m) => { + const xs = data.results.filter((r) => r.model === m).map((r) => r.overall); + return xs.length ? (xs.reduce((a, b) => a + b, 0) / xs.length) : 0; +}; + +const headRow = `<tr><th>Skill</th>${models.map((m) => `<th>${esc(m)}</th>`).join('')}</tr>`; +const rows = skills.map((s) => `<tr><td class="skill">${esc(s)}</td>${models.map((m) => { + const c = cell(s, m); + return c ? `<td><span class="score" style="color:${colour(c.overall)}">${c.overall.toFixed(2)}</span></td>` : '<td class="na">—</td>'; +}).join('')}</tr>`).join('\n'); +const avgRow = `<tr class="avg"><td>Average</td>${models.map((m) => `<td><strong>${modelAvg(m).toFixed(2)}</strong></td>`).join('')}</tr>`; + +const html = `<!DOCTYPE html> +<html lang="en"><head> +<meta charset="UTF-8" /><meta name="viewport" content="width=device-width, initial-scale=1.0" /> +<title>Skill Leaderboard — how pm-claude-skills score across Claude models + + +
+

🏆 Skill Leaderboard

+

LLM-judged quality (1–5) for each skill across Claude models — scored on structure, completeness, usefulness & grounding by ${esc(data.judge || 'an LLM judge')}.

+ +
+
+ ${isExample ? '' : ''} + + ${headRow} + +${rows} +${avgRow} + +
+

Higher is better (max 5). ${esc(skills.length)} skills × ${esc(models.length)} models${data.generatedAt ? ` · generated ${esc(String(data.generatedAt).slice(0, 10))}` : ''}. Methodology and cases in evals/.

+
+`; + +writeFileSync(join(root, 'web', 'leaderboard.html'), html); +console.log(`Wrote web/leaderboard.html — ${skills.length} skills × ${models.length} models${isExample ? ' (EXAMPLE data)' : ''}.`); diff --git a/web/index.html b/web/index.html index 332e03e..4e7c47c 100644 --- a/web/index.html +++ b/web/index.html @@ -34,7 +34,7 @@
🔒 Your key is stored only in this browser and sent directly to api.anthropic.com — never to us. Get one at console.anthropic.com. - · 📚 Browse the full skill catalog + · 📚 Catalog · 🏆 Leaderboard