AI-powered tooling: GitHub Action, generate command, evals + leaderboard (#41)

Three features riding 2026 trends (agentic CI, codegen, evals), sharing one dependency-free Anthropic client (bin/lib/anthropic.mjs). 1. GitHub Action (action/) — run any skill in a consumer repo's CI: uses: mohitagw15856/pm-claude-skills/action@main. Composite action + run.mjs (loads the bundled SKILL.md, calls the API, exposes result as a step output / file). Docs with auto-PR-description example. 2. generate command — `npx pm-claude-skills generate --from <url|file>` turns a team's docs into a SKILL.md following the authoring standard (bin/generate.mjs, wired into the CLI; needs ANTHROPIC_API_KEY). 3. Skill evals + Leaderboard — evals/run-evals.mjs runs each case across models and scores output with an LLM judge (structure/completeness/usefulness/ grounding); scripts/build-leaderboard.mjs renders web/leaderboard.html (built in the Pages deploy, falls back to clearly-labelled example data). Linked from README, catalog, and playground. Offline-testable parts verified (prompt building, skill loading, graceful errors, leaderboard render). SkillCheck/audit/exports all green. Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>
2026-06-18 08:37:40 +01:00
parent 288a340dbe
commit 51bf4be52f
17 changed files with 644 additions and 2 deletions
@@ -0,0 +1,40 @@
+# Skill Evals
+
+An LLM-as-judge harness that scores skill output quality across models — so claims like
+"production-ready" are backed by numbers, not vibes. Results render as a public
+[Skill Leaderboard](https://mohitagw15856.github.io/pm-claude-skills/leaderboard.html).
+
+## What it measures
+
+For each [case](cases.json), a model runs the skill, then a **judge model** scores the
+output 1–5 on four dimensions:
+
+- **structure** — follows a clear, expected structure
+- **completeness** — covers what the task needs
+- **usefulness** — specific and actually useful, not generic
+- **grounding** — stays grounded in the input, no invented facts
+
+## Run it
+
+Needs an Anthropic API key (this calls the API and costs tokens):
+
+```bash
+ANTHROPIC_API_KEY=sk-ant-... node evals/run-evals.mjs
+#   --models claude-opus-4-8,claude-sonnet-4-6,claude-haiku-4-5-20251001
+#   --judge  claude-opus-4-8
+node scripts/build-leaderboard.mjs       # render web/leaderboard.html
+```
+
+`run-evals.mjs` writes `evals/results.json`; the leaderboard builder prefers it and falls
+back to `results.example.json` (clearly labelled) so the page renders before you run real evals.
+
+## Add a case
+
+Append to [`cases.json`](cases.json): `{ "skill": "<name>", "input": "<a realistic prompt>" }`.
+Keep inputs short but representative of how the skill is actually used.
+
+## Honesty notes
+
+- Scores are an LLM judge's opinion, not ground truth — treat them as a comparative signal.
+- The judge sees the skill's stated purpose and the output, not the model name (reduces bias).
+- Re-run after model upgrades; numbers drift.