- Add compare-mode demo GIF + its Playwright recorder; embed in README eval section
- Expand evals/cases.json (6 → 15 flagship skills) so more skills can be
eval-scored and sample-generated
- Add --generate-missing mode to build-samples.mjs
- Add generate-samples.yml: workflow_dispatch job that generates real sample
outputs via the ANTHROPIC_API_KEY secret (key never leaves GitHub) and commits
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The eval run worked (12 scored runs) but the final step failed: it pushed
evals/results.json directly to main, which the branch ruleset blocks
("Changes must be made through a pull request").
- eval-leaderboard.yml: replace the direct commit/push with
peter-evans/create-pull-request@v7 (branch eval-results), add
pull-requests: write. Merging that PR triggers the Pages deploy (which
watches evals/results.json) to publish real numbers.
- evals/README documents the PR flow + the required "Allow GitHub Actions to
create and approve pull requests" setting.
Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px
Co-authored-by: Claude <noreply@anthropic.com>
The "Run evals" step ran 24 API calls sequentially with no request timeout, so
it was slow and could stall indefinitely if one call hung.
- bin/lib/anthropic.mjs: per-request timeout (120s) via AbortController + retry
(2x, backoff) on 429/5xx/timeout. Fails fast on 4xx (bad key/model).
- evals/run-evals.mjs: run (case × model) tasks through a concurrency pool
(default 4, --concurrency to tune); preserves result order.
- eval-leaderboard.yml: job timeout-minutes: 20 as a safety net.
Applies to the next run. The hardening also benefits the Action runner and
`generate`, which share the client.
Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px
Co-authored-by: Claude <noreply@anthropic.com>
Lets the leaderboard show real numbers without a local key: the new
"Update Skill Leaderboard" workflow (workflow_dispatch) runs the eval harness
with the ANTHROPIC_API_KEY secret, commits evals/results.json, and the Pages
deploy re-renders the public leaderboard with real data.
- .github/workflows/eval-leaderboard.yml: manual trigger, contents: write,
runs run-evals.mjs + build-leaderboard.mjs, commits results.json.
- deploy-playground.yml: also trigger on evals/results.json (and the build
scripts) so the committed results refresh the live page.
- evals/README + CHANGELOG document the CI route.
Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px
Co-authored-by: Claude <noreply@anthropic.com>
Three features riding 2026 trends (agentic CI, codegen, evals), sharing one
dependency-free Anthropic client (bin/lib/anthropic.mjs).
1. GitHub Action (action/) — run any skill in a consumer repo's CI:
uses: mohitagw15856/pm-claude-skills/action@main. Composite action +
run.mjs (loads the bundled SKILL.md, calls the API, exposes result as a
step output / file). Docs with auto-PR-description example.
2. generate command — `npx pm-claude-skills generate --from <url|file>` turns
a team's docs into a SKILL.md following the authoring standard
(bin/generate.mjs, wired into the CLI; needs ANTHROPIC_API_KEY).
3. Skill evals + Leaderboard — evals/run-evals.mjs runs each case across models
and scores output with an LLM judge (structure/completeness/usefulness/
grounding); scripts/build-leaderboard.mjs renders web/leaderboard.html
(built in the Pages deploy, falls back to clearly-labelled example data).
Linked from README, catalog, and playground.
Offline-testable parts verified (prompt building, skill loading, graceful
errors, leaderboard render). SkillCheck/audit/exports all green.
Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px
Co-authored-by: Claude <noreply@anthropic.com>