Use python3 as the canonical command name course-wide (#104) (#105)

2026-06-23 20:25:05 -04:00
parent 7f439212ac
commit 95e5911957
102 changed files with 380 additions and 378 deletions
@@ -245,7 +245,7 @@ output.

   ```bash
   cd modules/27-evals/lab
-   python run_eval.py candidates/current_model
+   python3 run_eval.py candidates/current_model
   echo "exit code: $?"
   ```

@@ -258,7 +258,7 @@ output.
 2. Now simulate the swap: run the *exact same eval set* against the other candidate:

   ```bash
-   python run_eval.py candidates/swapped_model
+   python3 run_eval.py candidates/swapped_model
   echo "exit code: $?"
   ```

@@ -275,7 +275,7 @@ output.
   yourself and read the scorecard:

   ```bash
-   python run_eval.py candidates/my_run_1
+   python3 run_eval.py candidates/my_run_1
   ```

 4. Now actually swap something. Either change the model Claude Code uses, or change the *prompt* (ask
@@ -309,10 +309,10 @@ output.
   `working-directory:` line makes the CI job `cd` into the lab folder first, so the
   `candidates/...` path and `run_eval.py`'s own `from eval_set import CASES` resolve exactly as they
   did on your machine. (Drop it and point a repo-root job straight at
-   `python modules/27-evals/lab/run_eval.py candidates/current_model`, and `candidates/`
+   `python3 modules/27-evals/lab/run_eval.py candidates/current_model`, and `candidates/`
   won't exist from the repo root: the gate crashes with a *false* failure, which is worse than no
   gate. If the agent prefers a single line, it can spell both paths out from the repo root:
-   `python modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model
+   `python3 modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model
   --threshold 1.0`.)

   Below threshold exits non-zero and the pipeline blocks, exactly like a failing test. The guardrail
@@ -388,5 +388,5 @@ This is an expansion-zone module over fast-moving ground. Re-check at build/publ
 - [ ] **Module cross-references.** Confirm Modules 13, 14, 10, and 24–26 still carry the
  responsibilities referenced here (tests, CI gating, review, the agent autonomy ladder) and that
  none were renumbered.
- [ ] **Lab still runs.** `python run_eval.py candidates/current_model` exits 0 at 100%, and
+- [ ] **Lab still runs.** `python3 run_eval.py candidates/current_model` exits 0 at 100%, and
  `candidates/swapped_model` exits 1 below threshold, on a current Python 3.x.
@@ -14,7 +14,7 @@ than pretending. NOTHING here pins a provider.
    EVAL_JUDGE_MODEL  # the model name to ask for

 Run it standalone to grade one sample:
-    python llm_judge.py "Add count command" "fix"
+    python3 llm_judge.py "Add count command" "fix"
 """

 import json
@@ -1,9 +1,9 @@
 """Run the eval set against one candidate and print a scorecard.

 Usage:
-    python run_eval.py candidates/current_model
-    python run_eval.py candidates/swapped_model
-    python run_eval.py candidates/current_model --threshold 0.9
+    python3 run_eval.py candidates/current_model
+    python3 run_eval.py candidates/swapped_model
+    python3 run_eval.py candidates/current_model --threshold 0.9

 A "candidate" is a directory containing a tasks.py that an agent produced. The
 runner imports that tasks.py, runs every case in eval_set.py against it, prints