Use python3 as the canonical command name course-wide (#104) (#105)
CI / check (push) Successful in 7s
Sync course wiki / sync-wiki (push) Successful in 4s

This commit was merged in pull request #105.
This commit is contained in:
2026-06-23 20:25:05 -04:00
parent 7f439212ac
commit 95e5911957
102 changed files with 380 additions and 378 deletions
+6 -6
View File
@@ -245,7 +245,7 @@ output.
```bash
cd modules/27-evals/lab
python run_eval.py candidates/current_model
python3 run_eval.py candidates/current_model
echo "exit code: $?"
```
@@ -258,7 +258,7 @@ output.
2. Now simulate the swap: run the *exact same eval set* against the other candidate:
```bash
python run_eval.py candidates/swapped_model
python3 run_eval.py candidates/swapped_model
echo "exit code: $?"
```
@@ -275,7 +275,7 @@ output.
yourself and read the scorecard:
```bash
python run_eval.py candidates/my_run_1
python3 run_eval.py candidates/my_run_1
```
4. Now actually swap something. Either change the model Claude Code uses, or change the *prompt* (ask
@@ -309,10 +309,10 @@ output.
`working-directory:` line makes the CI job `cd` into the lab folder first, so the
`candidates/...` path and `run_eval.py`'s own `from eval_set import CASES` resolve exactly as they
did on your machine. (Drop it and point a repo-root job straight at
`python modules/27-evals/lab/run_eval.py candidates/current_model`, and `candidates/`
`python3 modules/27-evals/lab/run_eval.py candidates/current_model`, and `candidates/`
won't exist from the repo root: the gate crashes with a *false* failure, which is worse than no
gate. If the agent prefers a single line, it can spell both paths out from the repo root:
`python modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model
`python3 modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model
--threshold 1.0`.)
Below threshold exits non-zero and the pipeline blocks, exactly like a failing test. The guardrail
@@ -388,5 +388,5 @@ This is an expansion-zone module over fast-moving ground. Re-check at build/publ
- [ ] **Module cross-references.** Confirm Modules 13, 14, 10, and 2426 still carry the
responsibilities referenced here (tests, CI gating, review, the agent autonomy ladder) and that
none were renumbered.
- [ ] **Lab still runs.** `python run_eval.py candidates/current_model` exits 0 at 100%, and
- [ ] **Lab still runs.** `python3 run_eval.py candidates/current_model` exits 0 at 100%, and
`candidates/swapped_model` exits 1 below threshold, on a current Python 3.x.
+1 -1
View File
@@ -14,7 +14,7 @@ than pretending. NOTHING here pins a provider.
EVAL_JUDGE_MODEL # the model name to ask for
Run it standalone to grade one sample:
python llm_judge.py "Add count command" "fix"
python3 llm_judge.py "Add count command" "fix"
"""
import json
+3 -3
View File
@@ -1,9 +1,9 @@
"""Run the eval set against one candidate and print a scorecard.
Usage:
python run_eval.py candidates/current_model
python run_eval.py candidates/swapped_model
python run_eval.py candidates/current_model --threshold 0.9
python3 run_eval.py candidates/current_model
python3 run_eval.py candidates/swapped_model
python3 run_eval.py candidates/current_model --threshold 0.9
A "candidate" is a directory containing a tasks.py that an agent produced. The
runner imports that tasks.py, runs every case in eval_set.py against it, prints