Files
pm-claude-skills/evals/README.md
T
mohitagw15856 51bf4be52f AI-powered tooling: GitHub Action, generate command, evals + leaderboard (#41)
Three features riding 2026 trends (agentic CI, codegen, evals), sharing one
dependency-free Anthropic client (bin/lib/anthropic.mjs).

1. GitHub Action (action/) — run any skill in a consumer repo's CI:
   uses: mohitagw15856/pm-claude-skills/action@main. Composite action +
   run.mjs (loads the bundled SKILL.md, calls the API, exposes result as a
   step output / file). Docs with auto-PR-description example.

2. generate command — `npx pm-claude-skills generate --from <url|file>` turns
   a team's docs into a SKILL.md following the authoring standard
   (bin/generate.mjs, wired into the CLI; needs ANTHROPIC_API_KEY).

3. Skill evals + Leaderboard — evals/run-evals.mjs runs each case across models
   and scores output with an LLM judge (structure/completeness/usefulness/
   grounding); scripts/build-leaderboard.mjs renders web/leaderboard.html
   (built in the Pages deploy, falls back to clearly-labelled example data).
   Linked from README, catalog, and playground.

Offline-testable parts verified (prompt building, skill loading, graceful
errors, leaderboard render). SkillCheck/audit/exports all green.


Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-18 08:37:40 +01:00

1.6 KiB
Raw Blame History

Skill Evals

An LLM-as-judge harness that scores skill output quality across models — so claims like "production-ready" are backed by numbers, not vibes. Results render as a public Skill Leaderboard.

What it measures

For each case, a model runs the skill, then a judge model scores the output 15 on four dimensions:

  • structure — follows a clear, expected structure
  • completeness — covers what the task needs
  • usefulness — specific and actually useful, not generic
  • grounding — stays grounded in the input, no invented facts

Run it

Needs an Anthropic API key (this calls the API and costs tokens):

ANTHROPIC_API_KEY=sk-ant-... node evals/run-evals.mjs
#   --models claude-opus-4-8,claude-sonnet-4-6,claude-haiku-4-5-20251001
#   --judge  claude-opus-4-8
node scripts/build-leaderboard.mjs       # render web/leaderboard.html

run-evals.mjs writes evals/results.json; the leaderboard builder prefers it and falls back to results.example.json (clearly labelled) so the page renders before you run real evals.

Add a case

Append to cases.json: { "skill": "<name>", "input": "<a realistic prompt>" }. Keep inputs short but representative of how the skill is actually used.

Honesty notes

  • Scores are an LLM judge's opinion, not ground truth — treat them as a comparative signal.
  • The judge sees the skill's stated purpose and the output, not the model name (reduces bias).
  • Re-run after model upgrades; numbers drift.