CI workflow to run evals and update the leaderboard (#43)
Lets the leaderboard show real numbers without a local key: the new "Update Skill Leaderboard" workflow (workflow_dispatch) runs the eval harness with the ANTHROPIC_API_KEY secret, commits evals/results.json, and the Pages deploy re-renders the public leaderboard with real data. - .github/workflows/eval-leaderboard.yml: manual trigger, contents: write, runs run-evals.mjs + build-leaderboard.mjs, commits results.json. - deploy-playground.yml: also trigger on evals/results.json (and the build scripts) so the committed results refresh the live page. - evals/README + CHANGELOG document the CI route. Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -10,6 +10,10 @@ on:
|
|||||||
paths:
|
paths:
|
||||||
- 'skills/**'
|
- 'skills/**'
|
||||||
- 'web/**'
|
- 'web/**'
|
||||||
|
- 'evals/results.json'
|
||||||
|
- 'skill-tiers.json'
|
||||||
|
- 'scripts/build-docs.mjs'
|
||||||
|
- 'scripts/build-leaderboard.mjs'
|
||||||
- '.github/workflows/deploy-playground.yml'
|
- '.github/workflows/deploy-playground.yml'
|
||||||
workflow_dispatch:
|
workflow_dispatch:
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,67 @@
|
|||||||
|
name: Update Skill Leaderboard
|
||||||
|
|
||||||
|
# Runs the eval harness with your ANTHROPIC_API_KEY secret, commits the real
|
||||||
|
# results (evals/results.json), and lets the Pages deploy re-render the public
|
||||||
|
# leaderboard with real numbers. Manual trigger so it never burns tokens by
|
||||||
|
# surprise. (Uncomment the schedule to re-run, e.g. monthly, after model upgrades.)
|
||||||
|
|
||||||
|
on:
|
||||||
|
workflow_dispatch:
|
||||||
|
inputs:
|
||||||
|
models:
|
||||||
|
description: 'Comma-separated model ids to score'
|
||||||
|
required: false
|
||||||
|
default: 'claude-sonnet-4-6,claude-haiku-4-5-20251001'
|
||||||
|
judge:
|
||||||
|
description: 'Judge model id'
|
||||||
|
required: false
|
||||||
|
default: 'claude-opus-4-8'
|
||||||
|
# schedule:
|
||||||
|
# - cron: '0 6 1 * *' # 06:00 on the 1st of each month
|
||||||
|
|
||||||
|
permissions:
|
||||||
|
contents: write
|
||||||
|
|
||||||
|
concurrency:
|
||||||
|
group: eval-leaderboard
|
||||||
|
cancel-in-progress: false
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
evaluate:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- name: Checkout
|
||||||
|
uses: actions/checkout@v4
|
||||||
|
|
||||||
|
- name: Set up Node
|
||||||
|
uses: actions/setup-node@v4
|
||||||
|
with:
|
||||||
|
node-version: '20'
|
||||||
|
|
||||||
|
- name: Run evals
|
||||||
|
env:
|
||||||
|
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
||||||
|
run: |
|
||||||
|
if [ -z "$ANTHROPIC_API_KEY" ]; then
|
||||||
|
echo "::error::ANTHROPIC_API_KEY secret is not set. Add it in Settings → Secrets and variables → Actions."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
node evals/run-evals.mjs \
|
||||||
|
--models "${{ github.event.inputs.models || 'claude-sonnet-4-6,claude-haiku-4-5-20251001' }}" \
|
||||||
|
--judge "${{ github.event.inputs.judge || 'claude-opus-4-8' }}"
|
||||||
|
|
||||||
|
- name: Build the leaderboard page (sanity check)
|
||||||
|
run: node scripts/build-leaderboard.mjs
|
||||||
|
|
||||||
|
- name: Commit results
|
||||||
|
run: |
|
||||||
|
git config user.name "github-actions[bot]"
|
||||||
|
git config user.email "github-actions[bot]@users.noreply.github.com"
|
||||||
|
git add evals/results.json
|
||||||
|
if git diff --cached --quiet; then
|
||||||
|
echo "No change in results."
|
||||||
|
else
|
||||||
|
git commit -m "chore(evals): refresh leaderboard results"
|
||||||
|
git push
|
||||||
|
echo "Committed evals/results.json — the Pages deploy will render real numbers."
|
||||||
|
fi
|
||||||
+6
-1
@@ -9,7 +9,12 @@ each new wave of skills bumps the **major** version, extensions and fixes bump
|
|||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
_Nothing yet._
|
### Added
|
||||||
|
- **One-click leaderboard updates in CI** — `.github/workflows/eval-leaderboard.yml`
|
||||||
|
("Update Skill Leaderboard") runs the evals with the `ANTHROPIC_API_KEY` secret, commits
|
||||||
|
`evals/results.json`, and the Pages deploy re-renders the public leaderboard with real
|
||||||
|
numbers — no local key needed. The deploy workflow now also triggers on
|
||||||
|
`evals/results.json`.
|
||||||
|
|
||||||
## [20.0.0] — Agentic Tooling — 2026-06-18
|
## [20.0.0] — Agentic Tooling — 2026-06-18
|
||||||
|
|
||||||
|
|||||||
@@ -28,6 +28,12 @@ node scripts/build-leaderboard.mjs # render web/leaderboard.html
|
|||||||
`run-evals.mjs` writes `evals/results.json`; the leaderboard builder prefers it and falls
|
`run-evals.mjs` writes `evals/results.json`; the leaderboard builder prefers it and falls
|
||||||
back to `results.example.json` (clearly labelled) so the page renders before you run real evals.
|
back to `results.example.json` (clearly labelled) so the page renders before you run real evals.
|
||||||
|
|
||||||
|
### No local key? Run it in CI
|
||||||
|
|
||||||
|
Add an `ANTHROPIC_API_KEY` repo secret, then go to **Actions → "Update Skill Leaderboard"
|
||||||
|
→ Run workflow**. It runs the evals, commits `evals/results.json`, and the Pages deploy
|
||||||
|
re-renders the public leaderboard with real numbers — no laptop required.
|
||||||
|
|
||||||
## Add a case
|
## Add a case
|
||||||
|
|
||||||
Append to [`cases.json`](cases.json): `{ "skill": "<name>", "input": "<a realistic prompt>" }`.
|
Append to [`cases.json`](cases.json): `{ "skill": "<name>", "input": "<a realistic prompt>" }`.
|
||||||
|
|||||||
Reference in New Issue
Block a user