fix(testing/ci/tooling): consistent unittest, venv guidance, runnable lab commands
- #9: standardize the test chain on stdlib unittest (nothing-to-install, which keeps M13's claims true and its planted bug intact). Aligned M5/M14/M16 prose, M14 lab/test_tasks.py, and ci/gitlab starters; ruff stays the only pip install. - #20: add venv / PEP 668 / which-python guidance to M20 (+ M14/M15 local installs); point MCP config at the venv's absolute python. - #21: replace M21 Part D's empty `git diff HEAD~1` with `git log -p` (no .gitignore added — device preserved). - #22: add a dependency-install step before M23's green baseline on a fresh clone. - #23: M24 reviewer/triage now tolerate code-fence-wrapped JSON (stdlib only); feature.patch trap untouched. - #28: fix M27 Part D CI snippet path (working-directory) and require the gate to target a varying candidate; swapped_model regression kept as the fixture. Closes #9 Closes #20 Closes #21 Closes #22 Closes #23 Closes #28 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01TfzV5QvtPDz8LJS3Pu5VLT
This commit is contained in:
@@ -287,16 +287,35 @@ agent's output.
|
||||
|
||||
6. Decide the autonomy for this task using the ladder in Key concepts. Write one sentence:
|
||||
*"`pending_count` changes may merge unattended only when `run_eval.py` scores 100%; otherwise a
|
||||
human reviews."* Then make it enforceable — this is one line in a CI workflow (Module 14):
|
||||
human reviews."* Then make it enforceable — this is one job in a CI workflow (Module 14), running
|
||||
the exact command you ran in Parts A–B:
|
||||
|
||||
```yaml
|
||||
- name: Eval gate
|
||||
run: python modules/27-evals/lab/run_eval.py candidates/current_model --threshold 1.0
|
||||
working-directory: modules/27-evals/lab
|
||||
run: python run_eval.py candidates/current_model --threshold 1.0
|
||||
```
|
||||
|
||||
The `working-directory:` line makes the CI job `cd` into the lab folder first, so the
|
||||
`candidates/...` path and `run_eval.py`'s own `from eval_set import CASES` resolve exactly as they
|
||||
did on your machine. (Drop it and point a repo-root job straight at
|
||||
`python modules/27-evals/lab/run_eval.py candidates/current_model` instead, and `candidates/`
|
||||
won't exist from the repo root — the gate crashes with a *false* failure, which is worse than no
|
||||
gate. If you'd rather keep a single line, spell both paths out from the repo root:
|
||||
`python modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model
|
||||
--threshold 1.0`.)
|
||||
|
||||
Below threshold exits non-zero and the pipeline blocks, exactly like a failing test. The guardrail
|
||||
is now structural, not a promise.
|
||||
|
||||
**One honest caveat, or this gate guards nothing.** `candidates/current_model` is the bundled,
|
||||
always-correct stand-in — it scores 100% on every run, forever, so a gate pointed at it can never
|
||||
fail. That's a dashboard, not a guardrail: the exact trap this section warns about. In a real
|
||||
pipeline, point the gate at the candidate that actually *varies* — your agent's real output for
|
||||
this task (the `candidates/my_run_2` you made in Part C, or wherever your pipeline writes the
|
||||
model's output before merge). Prove the gate bites by aiming it at `candidates/swapped_model`: the
|
||||
same command drops to 60%, exits `1`, and blocks the merge.
|
||||
|
||||
---
|
||||
|
||||
## Where it breaks
|
||||
|
||||
Reference in New Issue
Block a user