docs: surface v20.1.0 in README changelog and latest-release line (#53 )

The structured CHANGELOG.md already had 20.1.0; this updates the README's embedded 'What's New' changelog and the top 'Latest release' line to match. Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>
release: v20.1.0 — star nudges & eval hardening (#52 )
2026-06-18 20:35:03 +01:00 · 2026-06-18 20:29:01 +01:00 · 2026-06-18 20:25:13 +01:00 · 2026-06-18 19:03:41 +01:00 · 2026-06-18 13:41:58 +01:00 · 2026-06-18 12:40:15 +00:00
10 changed files with 286 additions and 49 deletions
@@ -21,6 +21,7 @@ on:
 permissions:
  contents: write
  pull-requests: write
 concurrency:
  group: eval-leaderboard
@@ -29,6 +30,7 @@ concurrency:
 jobs:
  evaluate:
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - name: Checkout
        uses: actions/checkout@v4
@@ -53,15 +55,16 @@ jobs:
      - name: Build the leaderboard page (sanity check)
        run: node scripts/build-leaderboard.mjs
-      - name: Commit results
+      - name: Open a PR with the refreshed results
-        run: |
+        uses: peter-evans/create-pull-request@v7
-          git config user.name "github-actions[bot]"
+        with:
-          git config user.email "github-actions[bot]@users.noreply.github.com"
+          add-paths: evals/results.json
-          git add evals/results.json
+          branch: eval-results
-          if git diff --cached --quiet; then
+          delete-branch: true
-            echo "No change in results."
+          commit-message: "chore(evals): refresh leaderboard results"
-          else
+          title: "chore(evals): refresh leaderboard results"
-            git commit -m "chore(evals): refresh leaderboard results"
+          body: |
-            git push
+            Auto-generated by the **Update Skill Leaderboard** workflow.
-            echo "Committed evals/results.json — the Pages deploy will render real numbers."
+
-          fi
+            Merging this publishes the **real** numbers on the live leaderboard — the
            Pages deploy is triggered by changes to `evals/results.json`.
@@ -9,13 +9,28 @@ each new wave of skills bumps the **major** version, extensions and fixes bump
 ## [Unreleased]
 ## [20.1.0] — Star Nudges & Eval Hardening — 2026-06-18
 ### Added
 - **Star the repo, from anywhere you use it.** Tasteful, non-spammy calls-to-action that turn
  npm/CLI users into stargazers — no `postinstall` hook: a prompt after a successful
  `npx pm-claude-skills add`, in `--help`, in `list`, in the MCP server's startup banner, a
  CTA below the README badges (npm renders it on the package page), and a `funding` field in
  `package.json` so npm shows a Fund/Sponsor link.
 - **One-click leaderboard updates in CI** — `.github/workflows/eval-leaderboard.yml`
  ("Update Skill Leaderboard") runs the evals with the `ANTHROPIC_API_KEY` secret, commits
  `evals/results.json`, and the Pages deploy re-renders the public leaderboard with real
  numbers — no local key needed. The deploy workflow now also triggers on
  `evals/results.json`.
 ### Changed
 - **Leaderboard workflow opens a PR** instead of pushing to `main` (which the branch
  ruleset blocks). After it runs, merge the auto-created results PR to publish real numbers.
 - **Faster, hang-proof evals.** The Anthropic client now has a per-request timeout (120s)
  and limited retries (429/5xx/timeout); the eval harness runs cases concurrently
  (default 4). The leaderboard workflow has a 20-minute job timeout. A 24-call run that
  was sequential now finishes in a few minutes and can't stall a job indefinitely.
 ## [20.0.0] — Agentic Tooling — 2026-06-18
 ### Added
@@ -12,17 +12,19 @@
 [![Platforms](https://img.shields.io/badge/works%20with-Claude%20%7C%20ChatGPT%20%7C%20Gemini%20%7C%20Cursor%20%7C%20Codex%20%7C%20Hermes-8A2BE2)](#-works-with--cross-tool-compatibility)
 [![SkillCheck](https://img.shields.io/github/actions/workflow/status/mohitagw15856/pm-claude-skills/skillcheck.yml?branch=main&label=SkillCheck)](.github/workflows/skillcheck.yml)
 [![Security Audit](https://img.shields.io/github/actions/workflow/status/mohitagw15856/pm-claude-skills/skill-audit.yml?branch=main&label=security%20audit)](.github/workflows/skill-audit.yml)
-[![Version](https://img.shields.io/badge/version-20.0.0-brightgreen)](https://github.com/mohitagw15856/pm-claude-skills/releases)
+[![Version](https://img.shields.io/badge/version-20.1.0-brightgreen)](https://github.com/mohitagw15856/pm-claude-skills/releases)
 [![Install](https://img.shields.io/badge/Install%20in%20Claude%20Code-2%20minutes-orange)](https://github.com/mohitagw15856/pm-claude-skills#-quick-install-2-minutes)
 [![License](https://img.shields.io/badge/license-MIT-lightgrey)](LICENSE)
 [![Sponsor](https://img.shields.io/badge/sponsor-❤️-ff69b4)](https://github.com/sponsors/mohitagw15856)
 ### ⭐ If this saves you time, [star the repo](https://github.com/mohitagw15856/pm-claude-skills) — it's the #1 way to help others find it.
 > **PM stands for Professional, not just Product Management.**
 > 167 professional skills + 4 agent templates across 26 bundles covering 18 professions. Built for Claude Code — and now portable to ChatGPT, Gemini, and Hermes Agent. Built by a PM, used by everyone.
 A community-built library of professional skills for every field — product management, engineering, customer success, marketing, social media, writers, design, legal, finance, HR, sales, operations, research, and more. Each skill is a structured `SKILL.md` file that teaches an AI assistant how to produce professional-grade outputs for your workflows. Skills run natively in **Claude Code** and **Hermes Agent** (same open `SKILL.md` standard), and ship as ready-to-paste exports for **ChatGPT** and **Gemini** — see [Works With](#-works-with--cross-tool-compatibility).
-**🆕 Latest release (v20.0.0 — Agentic Tooling):** run any skill in CI with the new **[GitHub Action](action/)**, turn your docs into a skill with **`npx pm-claude-skills generate`**, and compare skills across models on the **[Skill Leaderboard](https://mohitagw15856.github.io/pm-claude-skills/leaderboard.html)** (LLM-judge evals). See the [changelog](#-changelog).
+**🆕 Latest release (v20.1.0 — Star Nudges & Eval Hardening):** run any skill in CI with the **[GitHub Action](action/)**, turn your docs into a skill with **`npx pm-claude-skills generate`**, compare skills across models on the **[Skill Leaderboard](https://mohitagw15856.github.io/pm-claude-skills/leaderboard.html)** (now one-click in CI), and — if it saves you time — ⭐ the repo. See the [changelog](#-changelog).
 <!-- DEMO: replace web/docs-assets/playground.png below with web/docs-assets/playground-demo.gif
     once recorded (see web/docs-assets/README.md for how). The link goes to the live app. -->
@@ -403,7 +405,14 @@ More templates will follow. If you want to contribute one, see the [template con
 The highlights are below. For the structured, [Keep a Changelog](https://keepachangelog.com/)-format history, see **[CHANGELOG.md](CHANGELOG.md)**.
-### 🆕 What's New in v20.0.0 — Agentic Tooling
+### 🆕 What's New in v20.1.0 — Star Nudges & Eval Hardening
 - **Star the repo, from anywhere you use it** — tasteful, non-spammy CTAs (no `postinstall`): after a successful `npx pm-claude-skills add`, in `--help`, in `list`, in the MCP server banner, below the README badges, and a `funding` link on npm.
 - **One-click leaderboard in CI** — the "Update Skill Leaderboard" workflow runs the evals with your `ANTHROPIC_API_KEY` secret and opens a results PR; merge it to publish real numbers.
 - **Faster, hang-proof evals** — per-request timeout + retries in the API client and concurrent eval runs, so a CI run finishes in minutes and can't stall.
 <details>
 <summary><strong>v20.0.0 — Agentic Tooling</strong> (click to expand)</summary>
 The library starts *doing* the work, not just describing it:
@@ -411,6 +420,8 @@ The library starts *doing* the work, not just describing it:
 - **`generate` command** — `npx pm-claude-skills generate --from <url|file>` turns your docs into a standard-compliant `SKILL.md`.
 - **Skill evals + Leaderboard** — LLM-as-judge scoring of skills across models, rendered as a public [leaderboard](https://mohitagw15856.github.io/pm-claude-skills/leaderboard.html).
 </details>
 <details>
 <summary><strong>v19.0.0 — Security Auditor, Personas & Catalog</strong> (click to expand)</summary>
@@ -19,6 +19,7 @@ import { homedir } from 'node:os';
 import { createRequire } from 'node:module';
 const PKG_ROOT = dirname(dirname(fileURLToPath(import.meta.url)));
 const STAR = '⭐ Find this useful? Star the repo: https://github.com/mohitagw15856/pm-claude-skills';
 const VERSION = (() => {
  try { return createRequire(import.meta.url)('../package.json').version; } catch { return '0.0.0'; }
 })();
@@ -128,6 +129,7 @@ function add(opts) {
      aider: `Load any of them with:  aider --read ${join(target, '<skill>.md')}`,
    }[agent] || `Restart ${agent} — it auto-discovers SKILL.md skills in ${target} by their description.`;
    console.log(note);
    console.log(`\n${STAR}`);
  }
 }
@@ -139,6 +141,7 @@ function list() {
  console.log('\nNative SKILL.md agents: claude, hermes, codex, openclaw (install skill folders).');
  console.log('Claude also gets subagents + slash commands. Cursor/Windsurf install rule files;');
  console.log('Aider installs conventions you load with "aider --read".');
  console.log(`\n${STAR}`);
 }
 const HELP = `pm-claude-skills — install professional Agent Skills into any AI coding tool.
@@ -155,6 +158,8 @@ Examples:
  npx pm-claude-skills add --agent codex --link
  npx pm-claude-skills generate --from <url|file>   # turn your docs into a SKILL.md (needs ANTHROPIC_API_KEY)
 ${STAR}
 `;
 const opts = parse(process.argv.slice(2));
@@ -6,31 +6,57 @@ const API_URL = 'https://api.anthropic.com/v1/messages';
 /**
 * Call the Anthropic Messages API and return the concatenated text output.
 * Adds a per-request timeout and limited retries so a slow/transient failure
 * can't hang a CI job forever.
 * @param {object} o
 * @param {string} o.apiKey  - Anthropic API key.
 * @param {string} [o.model] - Model id (default claude-sonnet-4-6).
 * @param {string} [o.system]- System prompt.
 * @param {Array}  o.messages- [{role, content}] messages.
 * @param {number} [o.maxTokens]
 * @param {number} [o.timeoutMs] - Per-request timeout (default 120s).
 * @param {number} [o.retries]   - Retries on timeout / 429 / 5xx (default 2).
 * @returns {Promise<string>}
 */
-export async function complete({ apiKey, model = 'claude-sonnet-4-6', system, messages, maxTokens = 4096 }) {
+export async function complete({ apiKey, model = 'claude-sonnet-4-6', system, messages, maxTokens = 4096, timeoutMs = 120000, retries = 2 }) {
  if (!apiKey) throw new Error('Missing Anthropic API key (set ANTHROPIC_API_KEY).');
-  const res = await fetch(API_URL, {
+  let lastErr;
-    method: 'POST',
+  for (let attempt = 0; attempt <= retries; attempt++) {
-    headers: {
+    const ctrl = new AbortController();
-      'content-type': 'application/json',
+    const timer = setTimeout(() => ctrl.abort(), timeoutMs);
-      'x-api-key': apiKey,
+    try {
-      'anthropic-version': '2023-06-01',
+      const res = await fetch(API_URL, {
-    },
+        method: 'POST',
-    body: JSON.stringify({ model, max_tokens: maxTokens, ...(system ? { system } : {}), messages }),
+        headers: {
-  });
+          'content-type': 'application/json',
-  if (!res.ok) {
+          'x-api-key': apiKey,
-    const body = await res.text().catch(() => '');
+          'anthropic-version': '2023-06-01',
-    throw new Error(`Anthropic API ${res.status}: ${body.slice(0, 500)}`);
+        },
        body: JSON.stringify({ model, max_tokens: maxTokens, ...(system ? { system } : {}), messages }),
        signal: ctrl.signal,
      });
      if (res.ok) {
        const data = await res.json();
        return (data.content || []).map((c) => c.text || '').join('').trim();
      }
      const body = await res.text().catch(() => '');
      // Retry transient server / rate-limit errors; fail fast on 4xx (bad key/model).
      if ((res.status === 429 || res.status >= 500) && attempt < retries) {
        lastErr = new Error(`Anthropic API ${res.status}`);
      } else {
        throw new Error(`Anthropic API ${res.status}: ${body.slice(0, 500)}`);
      }
    } catch (e) {
      if (e.name === 'AbortError') e = new Error(`Anthropic API request timed out after ${timeoutMs}ms`);
      const retryable = /timed out/.test(e.message) || e.name === 'TypeError' || /Anthropic API (429|5\d\d)/.test(e.message);
      if (!retryable || attempt >= retries) throw e;
      lastErr = e;
    } finally {
      clearTimeout(timer);
    }
    await new Promise((r) => setTimeout(r, 1000 * 2 ** attempt)); // backoff: 1s, 2s, 4s
  }
-  const data = await res.json();
+  throw lastErr || new Error('Anthropic API request failed.');
  return (data.content || []).map((c) => c.text || '').join('').trim();
 }
 /** Parse "name: value" YAML-ish frontmatter + body from a SKILL.md string. */
@@ -30,9 +30,13 @@ back to `results.example.json` (clearly labelled) so the page renders before you
 ### No local key? Run it in CI
-Add an `ANTHROPIC_API_KEY` repo secret, then go to **Actions → "Update Skill Leaderboard"
+1. Add an `ANTHROPIC_API_KEY` repo secret.
-→ Run workflow**. It runs the evals, commits `evals/results.json`, and the Pages deploy
+2. Enable **Settings → Actions → General → Workflow permissions → "Allow GitHub Actions to
-re-renders the public leaderboard with real numbers — no laptop required.
+   create and approve pull requests"** (so the workflow can open its results PR — `main`
   requires PRs).
 3. **Actions → "Update Skill Leaderboard" → Run workflow.** It runs the evals and opens a
   PR with `evals/results.json`. **Merge that PR** and the Pages deploy re-renders the
   public leaderboard with real numbers — no laptop required.
 ## Add a case
@@ -0,0 +1,148 @@
 {
  "generatedAt": "2026-06-18T12:40:14.995Z",
  "judge": "claude-opus-4-8",
  "models": [
    "claude-sonnet-4-6",
    "claude-haiku-4-5-20251001"
  ],
  "dimensions": [
    "structure",
    "completeness",
    "usefulness",
    "grounding"
  ],
  "results": [
    {
      "skill": "rice-prioritisation",
      "model": "claude-sonnet-4-6",
      "scores": {
        "structure": 5,
        "completeness": 5,
        "usefulness": 5,
        "grounding": 5
      },
      "overall": 5
    },
    {
      "skill": "rice-prioritisation",
      "model": "claude-haiku-4-5-20251001",
      "scores": {
        "structure": 5,
        "completeness": 5,
        "usefulness": 5,
        "grounding": 4
      },
      "overall": 4.75
    },
    {
      "skill": "prd-template",
      "model": "claude-sonnet-4-6",
      "scores": {
        "structure": 5,
        "completeness": 5,
        "usefulness": 5,
        "grounding": 4
      },
      "overall": 4.75
    },
    {
      "skill": "prd-template",
      "model": "claude-haiku-4-5-20251001",
      "scores": {
        "structure": 5,
        "completeness": 4,
        "usefulness": 5,
        "grounding": 3
      },
      "overall": 4.25
    },
    {
      "skill": "cs-health-scorecard",
      "model": "claude-sonnet-4-6",
      "scores": {
        "structure": 5,
        "completeness": 5,
        "usefulness": 5,
        "grounding": 5
      },
      "overall": 5
    },
    {
      "skill": "cs-health-scorecard",
      "model": "claude-haiku-4-5-20251001",
      "scores": {
        "structure": 5,
        "completeness": 5,
        "usefulness": 5,
        "grounding": 4
      },
      "overall": 4.75
    },
    {
      "skill": "executive-summary",
      "model": "claude-sonnet-4-6",
      "scores": {
        "structure": 5,
        "completeness": 5,
        "usefulness": 5,
        "grounding": 5
      },
      "overall": 5
    },
    {
      "skill": "executive-summary",
      "model": "claude-haiku-4-5-20251001",
      "scores": {
        "structure": 5,
        "completeness": 5,
        "usefulness": 5,
        "grounding": 4
      },
      "overall": 4.75
    },
    {
      "skill": "competitive-analysis",
      "model": "claude-sonnet-4-6",
      "scores": {
        "structure": 5,
        "completeness": 4,
        "usefulness": 5,
        "grounding": 5
      },
      "overall": 4.75
    },
    {
      "skill": "competitive-analysis",
      "model": "claude-haiku-4-5-20251001",
      "scores": {
        "structure": 5,
        "completeness": 4,
        "usefulness": 5,
        "grounding": 4
      },
      "overall": 4.5
    },
    {
      "skill": "sprint-planning",
      "model": "claude-sonnet-4-6",
      "scores": {
        "structure": 5,
        "completeness": 5,
        "usefulness": 5,
        "grounding": 4
      },
      "overall": 4.75
    },
    {
      "skill": "sprint-planning",
      "model": "claude-haiku-4-5-20251001",
      "scores": {
        "structure": 5,
        "completeness": 4,
        "usefulness": 4,
        "grounding": 3
      },
      "overall": 4
    }
  ]
 }
@@ -61,33 +61,53 @@ function parseScores(text) {
  return s;
 }
 // Run an async worker over `items` with at most `limit` in flight.
 async function pool(items, limit, worker) {
  const out = [];
  let i = 0;
  await Promise.all(Array.from({ length: Math.min(limit, items.length) }, async () => {
    while (i < items.length) {
      const idx = i++;
      out[idx] = await worker(items[idx]);
    }
  }));
  return out;
 }
 async function scoreTask({ c, body, description, model }) {
  try {
    const output = await complete({ apiKey, model, system: runPrompt(body), messages: [{ role: 'user', content: c.input }], maxTokens: 3000 });
    const judged = await complete({ apiKey, model: judge, messages: [{ role: 'user', content: judgePrompt(description, output) }], maxTokens: 200 });
    const scores = parseScores(judged);
    const overall = DIMENSIONS.reduce((a, d) => a + scores[d], 0) / DIMENSIONS.length;
    process.stderr.write(`✓ ${c.skill} on ${model} — ${overall.toFixed(2)}/5\n`);
    return { skill: c.skill, model, scores, overall: Math.round(overall * 100) / 100 };
  } catch (e) {
    process.stderr.write(`✗ ${c.skill} on ${model} — FAILED (${e.message})\n`);
    return null;
  }
 }
 async function main() {
  if (!apiKey) { console.error('Set ANTHROPIC_API_KEY to run evals.'); process.exit(1); }
  const concurrency = parseInt(arg('concurrency', '4'), 10) || 4;
  const { cases } = JSON.parse(readFileSync(casesPath, 'utf8'));
  const results = [];
  // Build the full (case × model) task list.
  const tasks = [];
  for (const c of cases) {
    const skillFile = join(root, 'skills', c.skill, 'SKILL.md');
    if (!existsSync(skillFile)) { console.error(`skip ${c.skill}: no SKILL.md`); continue; }
    const { meta, body } = parseSkill(readFileSync(skillFile, 'utf8'));
-    for (const model of models) {
+    for (const model of models) tasks.push({ c, body, description: meta.description || c.skill, model });
      process.stderr.write(`Running ${c.skill} on ${model}… `);
      try {
        const output = await complete({ apiKey, model, system: runPrompt(body), messages: [{ role: 'user', content: c.input }], maxTokens: 3000 });
        const judged = await complete({ apiKey, model: judge, messages: [{ role: 'user', content: judgePrompt(meta.description || c.skill, output) }], maxTokens: 200 });
        const scores = parseScores(judged);
        const overall = DIMENSIONS.reduce((a, d) => a + scores[d], 0) / DIMENSIONS.length;
        results.push({ skill: c.skill, model, scores, overall: Math.round(overall * 100) / 100 });
        process.stderr.write(`${overall.toFixed(2)}/5\n`);
      } catch (e) {
        process.stderr.write(`FAILED (${e.message})\n`);
      }
    }
  }
  process.stderr.write(`Scoring ${tasks.length} runs (concurrency ${concurrency})…\n`);
  const results = (await pool(tasks, concurrency, scoreTask)).filter(Boolean);
  const out = { generatedAt: new Date().toISOString(), judge, models, dimensions: DIMENSIONS, results };
  writeFileSync(outPath, JSON.stringify(out, null, 2));
-  console.log(`\nWrote ${outPath} — ${results.length} scored runs. Build the page: node scripts/build-leaderboard.mjs`);
+  console.log(`\nWrote ${outPath} — ${results.length}/${tasks.length} scored runs. Build the page: node scripts/build-leaderboard.mjs`);
 }
 main();
@@ -166,6 +166,7 @@ function handle(msg) {
 }
 process.stderr.write(`[${SERVER_NAME}] MCP server ready — ${SKILLS.length} skills, ${TOOLS.length} tools.\n`);
 process.stderr.write(`[${SERVER_NAME}] ⭐ Star the repo: https://github.com/mohitagw15856/pm-claude-skills\n`);
 const rl = createInterface({ input: process.stdin });
 rl.on('line', (line) => {
  const s = line.trim();
@@ -1,6 +1,6 @@
 {
  "name": "pm-claude-skills",
-  "version": "20.0.0",
+  "version": "20.1.0",
  "type": "module",
  "description": "167 professional Agent Skills (SKILL.md) + subagents + slash commands for Claude, ChatGPT, Gemini, Cursor, Codex & Hermes. Install into any AI coding tool with: npx pm-claude-skills add --agent <tool>.",
  "keywords": [
@@ -29,6 +29,10 @@
  "bugs": {
    "url": "https://github.com/mohitagw15856/pm-claude-skills/issues"
  },
  "funding": {
    "type": "github",
    "url": "https://github.com/mohitagw15856/pm-claude-skills"
  },
  "author": "Mohit Aggarwal",
  "bin": {
    "pm-claude-skills": "bin/cli.mjs",
Author	SHA1	Message	Date
mohitagw15856	83bfff4f2f	docs: surface v20.1.0 in README changelog and latest-release line (#53 ) The structured CHANGELOG.md already had 20.1.0; this updates the README's embedded 'What's New' changelog and the top 'Latest release' line to match. Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>	2026-06-18 20:35:03 +01:00
mohitagw15856	0c33330211	release: v20.1.0 — star nudges & eval hardening (#52 ) Bump to 20.1.0. Folds the prior Unreleased items (CI leaderboard, PR-based results flow, faster/hang-proof evals) plus the new star CTAs into a [20.1.0] changelog section. Updates the README version badge. Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>	2026-06-18 20:29:01 +01:00
mohitagw15856	82beaed5c6	feat: add star CTA to CLI list output and MCP server banner (#51 ) More touchpoints to convert users into stargazers: the `list` command footer and the MCP server's stderr startup banner (stderr is safe — it never corrupts the JSON-RPC stream on stdout). Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>	2026-06-18 20:25:13 +01:00
mohitagw15856	511bad19b0	feat: nudge npm users to star the repo (CLI + README + funding) (#49 ) - CLI prints a star CTA after a successful install and in --help - README adds a prominent star line below the badges (npm renders this) - package.json gains a funding field so npm shows a Fund/Star link Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>	2026-06-18 19:03:41 +01:00
mohitagw15856	63cef03324	Merge pull request #46 from mohitagw15856/eval-results chore(evals): refresh leaderboard results	2026-06-18 13:41:58 +01:00
mohitagw15856	c28825dd38	chore(evals): refresh leaderboard results	2026-06-18 12:40:15 +00:00
mohitagw15856	4209963cff	Leaderboard workflow: open a PR instead of pushing to protected main (#45 ) The eval run worked (12 scored runs) but the final step failed: it pushed evals/results.json directly to main, which the branch ruleset blocks ("Changes must be made through a pull request"). - eval-leaderboard.yml: replace the direct commit/push with peter-evans/create-pull-request@v7 (branch eval-results), add pull-requests: write. Merging that PR triggers the Pages deploy (which watches evals/results.json) to publish real numbers. - evals/README documents the PR flow + the required "Allow GitHub Actions to create and approve pull requests" setting. Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>	2026-06-18 13:33:15 +01:00
mohitagw15856	827d7f62ec	Make evals fast and hang-proof (timeout, retry, concurrency) (#44 ) The "Run evals" step ran 24 API calls sequentially with no request timeout, so it was slow and could stall indefinitely if one call hung. - bin/lib/anthropic.mjs: per-request timeout (120s) via AbortController + retry (2x, backoff) on 429/5xx/timeout. Fails fast on 4xx (bad key/model). - evals/run-evals.mjs: run (case × model) tasks through a concurrency pool (default 4, --concurrency to tune); preserves result order. - eval-leaderboard.yml: job timeout-minutes: 20 as a safety net. Applies to the next run. The hardening also benefits the Action runner and `generate`, which share the client. Claude-Session: https://claude.ai/code/session_016JWn5jRD5tcEFKrubjQ6Px Co-authored-by: Claude <noreply@anthropic.com>	2026-06-18 13:30:06 +01:00