This commit was merged in pull request #105.
This commit is contained in:
@@ -245,7 +245,7 @@ output.
|
||||
|
||||
```bash
|
||||
cd modules/27-evals/lab
|
||||
python run_eval.py candidates/current_model
|
||||
python3 run_eval.py candidates/current_model
|
||||
echo "exit code: $?"
|
||||
```
|
||||
|
||||
@@ -258,7 +258,7 @@ output.
|
||||
2. Now simulate the swap: run the *exact same eval set* against the other candidate:
|
||||
|
||||
```bash
|
||||
python run_eval.py candidates/swapped_model
|
||||
python3 run_eval.py candidates/swapped_model
|
||||
echo "exit code: $?"
|
||||
```
|
||||
|
||||
@@ -275,7 +275,7 @@ output.
|
||||
yourself and read the scorecard:
|
||||
|
||||
```bash
|
||||
python run_eval.py candidates/my_run_1
|
||||
python3 run_eval.py candidates/my_run_1
|
||||
```
|
||||
|
||||
4. Now actually swap something. Either change the model Claude Code uses, or change the *prompt* (ask
|
||||
@@ -309,10 +309,10 @@ output.
|
||||
`working-directory:` line makes the CI job `cd` into the lab folder first, so the
|
||||
`candidates/...` path and `run_eval.py`'s own `from eval_set import CASES` resolve exactly as they
|
||||
did on your machine. (Drop it and point a repo-root job straight at
|
||||
`python modules/27-evals/lab/run_eval.py candidates/current_model`, and `candidates/`
|
||||
`python3 modules/27-evals/lab/run_eval.py candidates/current_model`, and `candidates/`
|
||||
won't exist from the repo root: the gate crashes with a *false* failure, which is worse than no
|
||||
gate. If the agent prefers a single line, it can spell both paths out from the repo root:
|
||||
`python modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model
|
||||
`python3 modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model
|
||||
--threshold 1.0`.)
|
||||
|
||||
Below threshold exits non-zero and the pipeline blocks, exactly like a failing test. The guardrail
|
||||
@@ -388,5 +388,5 @@ This is an expansion-zone module over fast-moving ground. Re-check at build/publ
|
||||
- [ ] **Module cross-references.** Confirm Modules 13, 14, 10, and 24–26 still carry the
|
||||
responsibilities referenced here (tests, CI gating, review, the agent autonomy ladder) and that
|
||||
none were renumbered.
|
||||
- [ ] **Lab still runs.** `python run_eval.py candidates/current_model` exits 0 at 100%, and
|
||||
- [ ] **Lab still runs.** `python3 run_eval.py candidates/current_model` exits 0 at 100%, and
|
||||
`candidates/swapped_model` exits 1 below threshold, on a current Python 3.x.
|
||||
|
||||
@@ -14,7 +14,7 @@ than pretending. NOTHING here pins a provider.
|
||||
EVAL_JUDGE_MODEL # the model name to ask for
|
||||
|
||||
Run it standalone to grade one sample:
|
||||
python llm_judge.py "Add count command" "fix"
|
||||
python3 llm_judge.py "Add count command" "fix"
|
||||
"""
|
||||
|
||||
import json
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
"""Run the eval set against one candidate and print a scorecard.
|
||||
|
||||
Usage:
|
||||
python run_eval.py candidates/current_model
|
||||
python run_eval.py candidates/swapped_model
|
||||
python run_eval.py candidates/current_model --threshold 0.9
|
||||
python3 run_eval.py candidates/current_model
|
||||
python3 run_eval.py candidates/swapped_model
|
||||
python3 run_eval.py candidates/current_model --threshold 0.9
|
||||
|
||||
A "candidate" is a directory containing a tasks.py that an agent produced. The
|
||||
runner imports that tasks.py, runs every case in eval_set.py against it, prints
|
||||
|
||||
Reference in New Issue
Block a user