Files
mohitagw15856 4209963cff Leaderboard workflow: open a PR instead of pushing to protected main (#45)
The eval run worked (12 scored runs) but the final step failed: it pushed
evals/results.json directly to main, which the branch ruleset blocks
("Changes must be made through a pull request").

- eval-leaderboard.yml: replace the direct commit/push with
  peter-evans/create-pull-request@v7 (branch eval-results), add
  pull-requests: write. Merging that PR triggers the Pages deploy (which
  watches evals/results.json) to publish real numbers.
- evals/README documents the PR flow + the required "Allow GitHub Actions to
  create and approve pull requests" setting.


Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-18 13:33:15 +01:00

51 lines
2.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Skill Evals
An LLM-as-judge harness that scores skill output quality across models — so claims like
"production-ready" are backed by numbers, not vibes. Results render as a public
[Skill Leaderboard](https://mohitagw15856.github.io/pm-claude-skills/leaderboard.html).
## What it measures
For each [case](cases.json), a model runs the skill, then a **judge model** scores the
output 15 on four dimensions:
- **structure** — follows a clear, expected structure
- **completeness** — covers what the task needs
- **usefulness** — specific and actually useful, not generic
- **grounding** — stays grounded in the input, no invented facts
## Run it
Needs an Anthropic API key (this calls the API and costs tokens):
```bash
ANTHROPIC_API_KEY=sk-ant-... node evals/run-evals.mjs
# --models claude-opus-4-8,claude-sonnet-4-6,claude-haiku-4-5-20251001
# --judge claude-opus-4-8
node scripts/build-leaderboard.mjs # render web/leaderboard.html
```
`run-evals.mjs` writes `evals/results.json`; the leaderboard builder prefers it and falls
back to `results.example.json` (clearly labelled) so the page renders before you run real evals.
### No local key? Run it in CI
1. Add an `ANTHROPIC_API_KEY` repo secret.
2. Enable **Settings → Actions → General → Workflow permissions → "Allow GitHub Actions to
create and approve pull requests"** (so the workflow can open its results PR — `main`
requires PRs).
3. **Actions → "Update Skill Leaderboard" → Run workflow.** It runs the evals and opens a
PR with `evals/results.json`. **Merge that PR** and the Pages deploy re-renders the
public leaderboard with real numbers — no laptop required.
## Add a case
Append to [`cases.json`](cases.json): `{ "skill": "<name>", "input": "<a realistic prompt>" }`.
Keep inputs short but representative of how the skill is actually used.
## Honesty notes
- Scores are an LLM judge's opinion, not ground truth — treat them as a comparative signal.
- The judge sees the skill's stated purpose and the output, not the model name (reduces bias).
- Re-run after model upgrades; numbers drift.