Testing/CI/tooling consistency (#9,#20,#21,#22,#23,#28) (#59)

Co-authored-by: claude <claude@jpaul.io>
Co-committed-by: claude <claude@jpaul.io>
This commit was merged in pull request #59.
This commit is contained in:
2026-06-22 16:07:58 -04:00
committed by Claude (agent)
parent a6a3cfdc50
commit 391df7fc6d
17 changed files with 216 additions and 82 deletions
+21 -2
View File
@@ -287,16 +287,35 @@ agent's output.
6. Decide the autonomy for this task using the ladder in Key concepts. Write one sentence:
*"`pending_count` changes may merge unattended only when `run_eval.py` scores 100%; otherwise a
human reviews."* Then make it enforceable — this is one line in a CI workflow (Module 14):
human reviews."* Then make it enforceable — this is one job in a CI workflow (Module 14), running
the exact command you ran in Parts AB:
```yaml
- name: Eval gate
run: python modules/27-evals/lab/run_eval.py candidates/current_model --threshold 1.0
working-directory: modules/27-evals/lab
run: python run_eval.py candidates/current_model --threshold 1.0
```
The `working-directory:` line makes the CI job `cd` into the lab folder first, so the
`candidates/...` path and `run_eval.py`'s own `from eval_set import CASES` resolve exactly as they
did on your machine. (Drop it and point a repo-root job straight at
`python modules/27-evals/lab/run_eval.py candidates/current_model` instead, and `candidates/`
won't exist from the repo root — the gate crashes with a *false* failure, which is worse than no
gate. If you'd rather keep a single line, spell both paths out from the repo root:
`python modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model
--threshold 1.0`.)
Below threshold exits non-zero and the pipeline blocks, exactly like a failing test. The guardrail
is now structural, not a promise.
**One honest caveat, or this gate guards nothing.** `candidates/current_model` is the bundled,
always-correct stand-in — it scores 100% on every run, forever, so a gate pointed at it can never
fail. That's a dashboard, not a guardrail: the exact trap this section warns about. In a real
pipeline, point the gate at the candidate that actually *varies* — your agent's real output for
this task (the `candidates/my_run_2` you made in Part C, or wherever your pipeline writes the
model's output before merge). Prove the gate bites by aiming it at `candidates/swapped_model`: the
same command drops to 60%, exits `1`, and blocks the merge.
---
## Where it breaks