Commit Graph

3 Commits

Author SHA1 Message Date
mohitagw15856 4209963cff Leaderboard workflow: open a PR instead of pushing to protected main (#45)
The eval run worked (12 scored runs) but the final step failed: it pushed
evals/results.json directly to main, which the branch ruleset blocks
("Changes must be made through a pull request").

- eval-leaderboard.yml: replace the direct commit/push with
  peter-evans/create-pull-request@v7 (branch eval-results), add
  pull-requests: write. Merging that PR triggers the Pages deploy (which
  watches evals/results.json) to publish real numbers.
- evals/README documents the PR flow + the required "Allow GitHub Actions to
  create and approve pull requests" setting.


Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-18 13:33:15 +01:00
mohitagw15856 827d7f62ec Make evals fast and hang-proof (timeout, retry, concurrency) (#44)
The "Run evals" step ran 24 API calls sequentially with no request timeout, so
it was slow and could stall indefinitely if one call hung.

- bin/lib/anthropic.mjs: per-request timeout (120s) via AbortController + retry
  (2x, backoff) on 429/5xx/timeout. Fails fast on 4xx (bad key/model).
- evals/run-evals.mjs: run (case × model) tasks through a concurrency pool
  (default 4, --concurrency to tune); preserves result order.
- eval-leaderboard.yml: job timeout-minutes: 20 as a safety net.

Applies to the next run. The hardening also benefits the Action runner and
`generate`, which share the client.


Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-18 13:30:06 +01:00
mohitagw15856 edb663ad72 CI workflow to run evals and update the leaderboard (#43)
Lets the leaderboard show real numbers without a local key: the new
"Update Skill Leaderboard" workflow (workflow_dispatch) runs the eval harness
with the ANTHROPIC_API_KEY secret, commits evals/results.json, and the Pages
deploy re-renders the public leaderboard with real data.

- .github/workflows/eval-leaderboard.yml: manual trigger, contents: write,
  runs run-evals.mjs + build-leaderboard.mjs, commits results.json.
- deploy-playground.yml: also trigger on evals/results.json (and the build
  scripts) so the committed results refresh the live page.
- evals/README + CHANGELOG document the CI route.


Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-18 12:58:45 +01:00