Files

T

mohitagw15856 edb663ad72 CI workflow to run evals and update the leaderboard (#43 )

Lets the leaderboard show real numbers without a local key: the new
"Update Skill Leaderboard" workflow (workflow_dispatch) runs the eval harness
with the ANTHROPIC_API_KEY secret, commits evals/results.json, and the Pages
deploy re-renders the public leaderboard with real data.

- .github/workflows/eval-leaderboard.yml: manual trigger, contents: write,
  runs run-evals.mjs + build-leaderboard.mjs, commits results.json.
- deploy-playground.yml: also trigger on evals/results.json (and the build
  scripts) so the committed results refresh the live page.
- evals/README + CHANGELOG document the CI route.


Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px

Co-authored-by: Claude <noreply@anthropic.com>

2026-06-18 12:58:45 +01:00

cases.json

AI-powered tooling: GitHub Action, generate command, evals + leaderboard (#41 )

2026-06-18 08:37:40 +01:00

README.md

CI workflow to run evals and update the leaderboard (#43 )

2026-06-18 12:58:45 +01:00

results.example.json

AI-powered tooling: GitHub Action, generate command, evals + leaderboard (#41 )

2026-06-18 08:37:40 +01:00

run-evals.mjs

AI-powered tooling: GitHub Action, generate command, evals + leaderboard (#41 )

2026-06-18 08:37:40 +01:00

README.md

Skill Evals

An LLM-as-judge harness that scores skill output quality across models — so claims like "production-ready" are backed by numbers, not vibes. Results render as a public Skill Leaderboard.

What it measures

For each case, a model runs the skill, then a judge model scores the output 1–5 on four dimensions:

structure — follows a clear, expected structure
completeness — covers what the task needs
usefulness — specific and actually useful, not generic
grounding — stays grounded in the input, no invented facts

Run it

Needs an Anthropic API key (this calls the API and costs tokens):

ANTHROPIC_API_KEY=sk-ant-... node evals/run-evals.mjs
#   --models claude-opus-4-8,claude-sonnet-4-6,claude-haiku-4-5-20251001
#   --judge  claude-opus-4-8
node scripts/build-leaderboard.mjs       # render web/leaderboard.html

run-evals.mjs writes evals/results.json; the leaderboard builder prefers it and falls back to results.example.json (clearly labelled) so the page renders before you run real evals.

No local key? Run it in CI

Add an ANTHROPIC_API_KEY repo secret, then go to Actions → "Update Skill Leaderboard" → Run workflow. It runs the evals, commits evals/results.json, and the Pages deploy re-renders the public leaderboard with real numbers — no laptop required.

Add a case

Append to cases.json: { "skill": "<name>", "input": "<a realistic prompt>" }. Keep inputs short but representative of how the skill is actually used.

Honesty notes

Scores are an LLM judge's opinion, not ground truth — treat them as a comparative signal.
The judge sees the skill's stated purpose and the output, not the model name (reduces bias).
Re-run after model upgrades; numbers drift.

README.md Unescape Escape

Skill Evals

What it measures

Run it

No local key? Run it in CI

Add a case

Honesty notes

README.md