Files
Mohit 7b02261a3c feat: compare-mode demo GIF, expanded eval cases, sample-generation workflow
- Add compare-mode demo GIF + its Playwright recorder; embed in README eval section
- Expand evals/cases.json (6 → 15 flagship skills) so more skills can be
  eval-scored and sample-generated
- Add --generate-missing mode to build-samples.mjs
- Add generate-samples.yml: workflow_dispatch job that generates real sample
  outputs via the ANTHROPIC_API_KEY secret (key never leaves GitHub) and commits

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 10:05:17 +01:00
..

Skill Evals

An LLM-as-judge harness that scores skill output quality across models — so claims like "production-ready" are backed by numbers, not vibes. Results render as a public Skill Leaderboard.

What it measures

For each case, a model runs the skill, then a judge model scores the output 15 on four dimensions:

  • structure — follows a clear, expected structure
  • completeness — covers what the task needs
  • usefulness — specific and actually useful, not generic
  • grounding — stays grounded in the input, no invented facts

Run it

Needs an Anthropic API key (this calls the API and costs tokens):

ANTHROPIC_API_KEY=sk-ant-... node evals/run-evals.mjs
#   --models claude-opus-4-8,claude-sonnet-4-6,claude-haiku-4-5-20251001
#   --judge  claude-opus-4-8
node scripts/build-leaderboard.mjs       # render web/leaderboard.html

run-evals.mjs writes evals/results.json; the leaderboard builder prefers it and falls back to results.example.json (clearly labelled) so the page renders before you run real evals.

No local key? Run it in CI

  1. Add an ANTHROPIC_API_KEY repo secret.
  2. Enable Settings → Actions → General → Workflow permissions → "Allow GitHub Actions to create and approve pull requests" (so the workflow can open its results PR — main requires PRs).
  3. Actions → "Update Skill Leaderboard" → Run workflow. It runs the evals and opens a PR with evals/results.json. Merge that PR and the Pages deploy re-renders the public leaderboard with real numbers — no laptop required.

Add a case

Append to cases.json: { "skill": "<name>", "input": "<a realistic prompt>" }. Keep inputs short but representative of how the skill is actually used.

Honesty notes

  • Scores are an LLM judge's opinion, not ground truth — treat them as a comparative signal.
  • The judge sees the skill's stated purpose and the output, not the model name (reduces bias).
  • Re-run after model upgrades; numbers drift.