Fix Module 27 Part D CI snippet path (won't resolve from repo root) and the frozen always-100% gate fixture #28

Closed
opened 2026-06-22 14:23:51 -04:00 by claude · 0 comments
Contributor

Problem

Two issues in the capstone-evals Part D CI snippet:

  1. Path base mismatch. The script path is repo-root-relative (modules/27-evals/lab/run_eval.py) but the candidate arg is lab-relative (candidates/current_model). A CI job runs from the repo root, where candidates/ doesn't exist, so the gate the module calls "structural, not a promise" crashes with a false failure ("no tasks.py in candidates/current_model", exit 1).
  2. Frozen fixture. Even once the path is fixed, it gates on the bundled current_model candidate, whose tasks.py is the always-correct baseline that scores 100% on every run forever — guarding nothing, contradicting the section's own "an eval nobody must act on is a dashboard, not a guardrail."

Evidence

modules/27-evals/README.md Part D (~line 294): "run: python modules/27-evals/lab/run_eval.py candidates/current_model --threshold 1.0". From repo root → "no tasks.py in candidates/current_model", exit 1. current_model is the always-passing baseline (Part A); its tasks.py is commented "It's correct".

Why it matters

The closing module's flagship "structural, not a promise" example crashes when copy-pasted, and even fixed it gates on something that can never fail — self-undermining the lesson.

Proposed change

  1. Make both paths repo-root-relative: python3 modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model --threshold 1.0 (verified to run and exit 0; from eval_set import CASES still resolves via the script dir on sys.path[0]). Alternatively add working-directory: modules/27-evals/lab and keep the original relative command.
  2. Point the gate at the candidate that actually varies (the agent's/repo's real output), OR — for a generic course snippet — add a note that the bundled current_model is an illustrative stand-in and a real gate should target the varying output.

Acceptance criteria

  • The Part D snippet runs and exits 0 when executed exactly as written (from repo root or with the stated working-directory).
  • The snippet or its prose makes clear the gate must target a varying candidate, not the frozen 100% fixture.

Affected files

  • modules/27-evals/README.md

References

Source finding F42 (realVotes 3/3).


Filed from an adversarial multi-agent course review (217 raw findings → 54 adversarially-verified survivors). Scoped for manual review; intentionally not auto-assigned to an agent.

## Problem Two issues in the capstone-evals Part D CI snippet: 1. **Path base mismatch.** The script path is repo-root-relative (`modules/27-evals/lab/run_eval.py`) but the candidate arg is lab-relative (`candidates/current_model`). A CI job runs from the repo root, where `candidates/` doesn't exist, so the gate the module calls "structural, not a promise" crashes with a false failure ("no tasks.py in candidates/current_model", exit 1). 2. **Frozen fixture.** Even once the path is fixed, it gates on the bundled `current_model` candidate, whose `tasks.py` is the always-correct baseline that scores 100% on every run forever — guarding nothing, contradicting the section's own "an eval nobody must act on is a dashboard, not a guardrail." ## Evidence `modules/27-evals/README.md` Part D (~line 294): "run: python modules/27-evals/lab/run_eval.py candidates/current_model --threshold 1.0". From repo root → "no tasks.py in candidates/current_model", exit 1. `current_model` is the always-passing baseline (Part A); its `tasks.py` is commented "It's correct". ## Why it matters The closing module's flagship "structural, not a promise" example crashes when copy-pasted, and even fixed it gates on something that can never fail — self-undermining the lesson. ## Proposed change 1. Make both paths repo-root-relative: `python3 modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model --threshold 1.0` (verified to run and exit 0; `from eval_set import CASES` still resolves via the script dir on `sys.path[0]`). Alternatively add `working-directory: modules/27-evals/lab` and keep the original relative command. 2. Point the gate at the candidate that actually varies (the agent's/repo's real output), OR — for a generic course snippet — add a note that the bundled `current_model` is an illustrative stand-in and a real gate should target the varying output. ## Acceptance criteria - [ ] The Part D snippet runs and exits 0 when executed exactly as written (from repo root or with the stated working-directory). - [ ] The snippet or its prose makes clear the gate must target a varying candidate, not the frozen 100% fixture. ## Affected files - `modules/27-evals/README.md` ## References Source finding F42 (realVotes 3/3). --- *Filed from an adversarial multi-agent course review (217 raw findings → 54 adversarially-verified survivors). Scoped for manual review; intentionally not auto-assigned to an agent.*
claude added the ai-readybugP1 labels 2026-06-22 14:23:51 -04:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/ai-workflow-course#28