4209963cff
The eval run worked (12 scored runs) but the final step failed: it pushed
evals/results.json directly to main, which the branch ruleset blocks
("Changes must be made through a pull request").
- eval-leaderboard.yml: replace the direct commit/push with
peter-evans/create-pull-request@v7 (branch eval-results), add
pull-requests: write. Merging that PR triggers the Pages deploy (which
watches evals/results.json) to publish real numbers.
- evals/README documents the PR flow + the required "Allow GitHub Actions to
create and approve pull requests" setting.
Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px
Co-authored-by: Claude <noreply@anthropic.com>
2.1 KiB
2.1 KiB
Skill Evals
An LLM-as-judge harness that scores skill output quality across models — so claims like "production-ready" are backed by numbers, not vibes. Results render as a public Skill Leaderboard.
What it measures
For each case, a model runs the skill, then a judge model scores the output 1–5 on four dimensions:
- structure — follows a clear, expected structure
- completeness — covers what the task needs
- usefulness — specific and actually useful, not generic
- grounding — stays grounded in the input, no invented facts
Run it
Needs an Anthropic API key (this calls the API and costs tokens):
ANTHROPIC_API_KEY=sk-ant-... node evals/run-evals.mjs
# --models claude-opus-4-8,claude-sonnet-4-6,claude-haiku-4-5-20251001
# --judge claude-opus-4-8
node scripts/build-leaderboard.mjs # render web/leaderboard.html
run-evals.mjs writes evals/results.json; the leaderboard builder prefers it and falls
back to results.example.json (clearly labelled) so the page renders before you run real evals.
No local key? Run it in CI
- Add an
ANTHROPIC_API_KEYrepo secret. - Enable Settings → Actions → General → Workflow permissions → "Allow GitHub Actions to
create and approve pull requests" (so the workflow can open its results PR —
mainrequires PRs). - Actions → "Update Skill Leaderboard" → Run workflow. It runs the evals and opens a
PR with
evals/results.json. Merge that PR and the Pages deploy re-renders the public leaderboard with real numbers — no laptop required.
Add a case
Append to cases.json: { "skill": "<name>", "input": "<a realistic prompt>" }.
Keep inputs short but representative of how the skill is actually used.
Honesty notes
- Scores are an LLM judge's opinion, not ground truth — treat them as a comparative signal.
- The judge sees the skill's stated purpose and the output, not the model name (reduces bias).
- Re-run after model upgrades; numbers drift.