Reframe sweep M7-27 + capstone (AI drives git, lesson=theory, de-slop) (#93)
Sync course wiki / sync-wiki (push) Successful in 11s

Co-authored-by: claude <claude@jpaul.io>
Co-committed-by: claude <claude@jpaul.io>
This commit was merged in pull request #93.
This commit is contained in:
2026-06-22 21:58:36 -04:00
committed by Claude (agent)
parent a29823f4b3
commit 513d7e7ac8
38 changed files with 1735 additions and 1424 deletions
+39 -38
View File
@@ -51,10 +51,10 @@ from a loop. So the question this module exists to answer is blunt:
> **An agent did work while you were asleep. How do you *know* it did good work?**
"I read the diff" doesn't scale the whole point of an unattended agent is that you weren't there.
"CI passed" is necessary but thin: CI proves the code builds and your existing tests are green, not
"I read the diff" doesn't scale: the whole point of an unattended agent is that you weren't there.
"CI passed" is necessary but thin. CI proves the code builds and your existing tests are green, not
that the agent actually did the *right thing*, well, on the cases that matter. You need a way to
measure agent output **systematically** the same way every time, on a fixed set of cases, with a
measure agent output **systematically**, the same way every time, on a fixed set of cases, with a
score you can compare across runs. That measurement is an **eval**.
### What an eval actually is
@@ -113,7 +113,7 @@ good set is mostly edges. Three sources fill it fast:
head and forgetting the results.
Keep it small and sharp. Twenty discriminating cases beat two hundred that all test the happy path.
A case that every candidate passes tells you nothing the cases that *separate* a good agent from a
A case that every candidate passes tells you nothing; the cases that *separate* a good agent from a
bad one are the whole value. And the eval set is code-adjacent data: commit it, review changes to it
in PRs (Module 10), and grow it every time an agent surprises you. It is durable in exactly the way
the syllabus means — it outlives every model it ever judges.
@@ -129,7 +129,7 @@ either runs and produces the right thing or it doesn't.
**LLM-as-judge.** Some output has no `==`: "is this commit message clear?", "does this PR
description explain the change?", "is this refactor actually cleaner?" The standard move is to ask
*another* model to grade it against a rubric. It works, and sometimes it's the only option but be
*another* model to grade it against a rubric. It works, and sometimes it's the only option, but be
honest about what you've built:
- **Correlated blind spots.** A judge is a model grading a model. It can share the candidate's
@@ -153,17 +153,14 @@ Here is where the course thesis stops being a slogan and becomes a procedure.
You *will* swap the model. A cheaper one ships, your provider deprecates the one you're on, a new
release benchmarks better, someone edits the agent's prompt or its committed instructions file
(Module 5). Every one of those changes the behavior of every agent you run silently. The code
(Module 5). Every one of those changes the behavior of every agent you run, silently. The code
around the model didn't change; the model did, and the model is the part you don't control.
A **regression eval** is the discipline of running the *same eval set* before and after the change
and comparing the scores:
1. Run the eval against the current model/prompt. Record the score — this is your baseline.
2. Make the change (new model, new prompt).
3. Run the *same* eval set again.
4. Compare. Score held or rose → the swap is safe by this eval. Score dropped → you just caught a
regression *before* it ran unattended against real work, not after.
and comparing the scores. The current model/prompt earns a baseline score. After the change (a new
model, a new prompt), the same eval set runs again and the two scores get compared. A score that
held or rose means the swap is safe by this eval; a score that dropped is a regression caught
*before* it ran unattended against real work, not after.
This is the answer to "the model is swappable." It's swappable **because** the eval set is what
makes swapping safe. Your prompts, your pipeline, your review reflexes, and — most of all — your
@@ -184,7 +181,7 @@ autonomy.
| At/above bar, stable across runs | Unattended on this *narrow* task, landing behind CI + the eval as a gate. |
| High across a broad set, held over time | Orchestrate it; let it run in a fleet (Module 26). |
Two things make a guardrail real rather than decorative:
Two things make a guardrail bite:
- **The threshold blocks.** The eval returns an exit code; below-bar exits non-zero and stops the
pipeline exactly like a failing test (Module 14). The lab does this. An eval whose result nobody is
@@ -198,15 +195,15 @@ Two things make a guardrail real rather than decorative:
## The AI angle
Every other module made a tool more valuable *because* you're using AI. This one is the load-bearing
case, and it closes the argument the course opened with.
Every other module made a tool more valuable *because* you're using AI. This module closes the
argument the course opened with.
Module 1 claimed the model is the cheap, swappable part and the workflow is the durable skill. Every
module since has been an installment on that claim — version control, review, CI, containers,
secrets, MCP, agents. **Evals are where it's proven.** An eval set is, literally, a model-agnostic
instrument: it judges output without caring which model produced it, which is exactly why it survives
the swap that retires the model. You don't trust an agent because you trust the vendor or this
quarter's benchmark; you trust it because *your* eval, on *your* cases, scored it above *your* bar
quarter's benchmark; you trust it because *your* eval, on *your* cases, scored it above *your* bar,
and you'll re-run that same eval the day the model changes under you, which it will.
That's the durable skill. Models are weather. The eval set is the thermometer you keep.
@@ -228,10 +225,10 @@ The lab files are in [`lab/`](lab/):
- `candidates/swapped_model/tasks.py` — a plausible-but-wrong candidate (stand-in for a bad swap).
- `llm_judge.py` — a model-agnostic LLM-as-judge stub, with its limits written in.
**You'll need:** Python 3.10+, the `tasks-app` you've carried since Module 1, and your usual agentic
tool (any vendor). No API key or paid model is required to complete the lab the bundled candidates
let the regression demo run offline — but the real payoff comes when you replace them with your own
agent's output.
**You'll need:** Python 3.10+, the `tasks-app` you've carried since Module 1, and Claude Code (sub
your own agent). No API key or paid model is required to complete the lab; the bundled candidates let
the regression demo run offline. The real payoff comes when you replace them with your own agent's
output.
### Part A — Run the eval against the current model
@@ -263,20 +260,22 @@ agent's output.
### Part C — Make it real with your own agent
3. Open your `tasks-app` and ask your agentic tool to implement (or re-implement) `pending_count()`
in `tasks.py`. Copy the `tasks.py` it produces into a new folder, e.g.
`candidates/my_run_1/tasks.py`, and score it:
3. Open your `tasks-app` and tell Claude Code (sub your own agent) to implement (or re-implement)
`pending_count()` and write its version straight into `candidates/my_run_1/tasks.py`, creating the
folder if it doesn't exist. You direct; the agent does the file plumbing. Then run the eval
yourself and read the scorecard:
```bash
python run_eval.py candidates/my_run_1
```
4. Now actually swap something. Either change the model your tool uses, or change the *prompt* (ask
the same thing a different way, or tweak your committed instructions file from Module 5). Save the
new output as `candidates/my_run_2/` and score it. Compare the two scores. You just ran a
regression eval on a real model/prompt change and got a number that tells you whether the change
was safe. If a run scores below 100%, read the failing case and add the input that broke it as a
new permanent case in `eval_set.py` — the set gets sharper every time an agent surprises you.
4. Now actually swap something. Either change the model Claude Code uses, or change the *prompt* (ask
the same thing a different way, or tweak your committed instructions file from Module 5). Have the
agent write this run into `candidates/my_run_2/`, then run `run_eval.py` yourself and compare the
two scores. You just ran a regression eval on a real model/prompt change and got a number that
tells you whether the change was safe. If a run scores below 100%, read the failing case and direct
the agent to append the input that broke it as a new permanent case in `eval_set.py`; verify the
case it added. The set gets sharper every time an agent surprises you.
5. *(Optional, needs a model endpoint.)* Open `llm_judge.py`, read the limits at the bottom, set the
`EVAL_JUDGE_*` environment variables to your own endpoint, and grade an open-ended output — say, a
@@ -287,8 +286,9 @@ agent's output.
6. Decide the autonomy for this task using the ladder in Key concepts. Write one sentence:
*"`pending_count` changes may merge unattended only when `run_eval.py` scores 100%; otherwise a
human reviews."* Then make it enforceable — this is one job in a CI workflow (Module 14), running
the exact command you ran in Parts AB:
human reviews."* Then make it enforceable. This is one job in a CI workflow (Module 14), so direct
Claude Code (sub your own agent) to add an eval-gate job to the workflow it already wired up in
Module 14, running the same command from Parts AB. The job it adds should look like this:
```yaml
- name: Eval gate
@@ -296,12 +296,13 @@ agent's output.
run: python run_eval.py candidates/current_model --threshold 1.0
```
The `working-directory:` line makes the CI job `cd` into the lab folder first, so the
Review the diff before you accept it, and confirm the path logic is right. The
`working-directory:` line makes the CI job `cd` into the lab folder first, so the
`candidates/...` path and `run_eval.py`'s own `from eval_set import CASES` resolve exactly as they
did on your machine. (Drop it and point a repo-root job straight at
`python modules/27-evals/lab/run_eval.py candidates/current_model` instead, and `candidates/`
won't exist from the repo root the gate crashes with a *false* failure, which is worse than no
gate. If you'd rather keep a single line, spell both paths out from the repo root:
`python modules/27-evals/lab/run_eval.py candidates/current_model`, and `candidates/`
won't exist from the repo root: the gate crashes with a *false* failure, which is worse than no
gate. If the agent prefers a single line, it can spell both paths out from the repo root:
`python modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model
--threshold 1.0`.)
@@ -367,10 +368,10 @@ line will change many times. The line is yours to keep.
This is an expansion-zone module over fast-moving ground. Re-check at build/publish time:
- [ ] **No vendor pinned.** Confirm the prose, lab, and `llm_judge.py` still name no specific LLM
- [ ] **No vendor pinned.** Confirm the module text, lab, and `llm_judge.py` still name no specific LLM
provider, model id, or pricing, and that `llm_judge.py`'s endpoint config is still generic
(env-var driven, OpenAI-style-compatible but not branded).
- [ ] **Eval tooling landscape.** If the module names any eval framework or LLM-as-judge tool by
- [ ] **Eval frameworks named.** If the module names any eval framework or LLM-as-judge tool by
name (it currently names none on purpose), verify it still exists and behaves as described. Prefer
keeping it tool-agnostic.
- [ ] **LLM-as-judge claims.** The bias/drift/correlation caveats are durable, but re-check that no