CI workflow to run evals and update the leaderboard (#43)

Lets the leaderboard show real numbers without a local key: the new "Update Skill Leaderboard" workflow (workflow_dispatch) runs the eval harness with the ANTHROPIC_API_KEY secret, commits evals/results.json, and the Pages deploy re-renders the public leaderboard with real data. - .github/workflows/eval-leaderboard.yml: manual trigger, contents: write, runs run-evals.mjs + build-leaderboard.mjs, commits results.json. - deploy-playground.yml: also trigger on evals/results.json (and the build scripts) so the committed results refresh the live page. - evals/README + CHANGELOG document the CI route. Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>
2026-06-18 12:58:45 +01:00
parent 3ccfd6b5c7
commit edb663ad72
4 changed files with 83 additions and 1 deletions
@@ -28,6 +28,12 @@ node scripts/build-leaderboard.mjs       # render web/leaderboard.html
 `run-evals.mjs` writes `evals/results.json`; the leaderboard builder prefers it and falls
 back to `results.example.json` (clearly labelled) so the page renders before you run real evals.

+### No local key? Run it in CI
+
+Add an `ANTHROPIC_API_KEY` repo secret, then go to **Actions → "Update Skill Leaderboard"
+→ Run workflow**. It runs the evals, commits `evals/results.json`, and the Pages deploy
+re-renders the public leaderboard with real numbers — no laptop required.
+
 ## Add a case

 Append to [`cases.json`](cases.json): `{ "skill": "<name>", "input": "<a realistic prompt>" }`.