Reframe sweep M7-27 + capstone (AI drives git, lesson=theory, de-slop) (#93)
Sync course wiki / sync-wiki (push) Successful in 11s
Sync course wiki / sync-wiki (push) Successful in 11s
Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io>
This commit was merged in pull request #93.
This commit is contained in:
+39
-38
@@ -51,10 +51,10 @@ from a loop. So the question this module exists to answer is blunt:
|
||||
|
||||
> **An agent did work while you were asleep. How do you *know* it did good work?**
|
||||
|
||||
"I read the diff" doesn't scale — the whole point of an unattended agent is that you weren't there.
|
||||
"CI passed" is necessary but thin: CI proves the code builds and your existing tests are green, not
|
||||
"I read the diff" doesn't scale: the whole point of an unattended agent is that you weren't there.
|
||||
"CI passed" is necessary but thin. CI proves the code builds and your existing tests are green, not
|
||||
that the agent actually did the *right thing*, well, on the cases that matter. You need a way to
|
||||
measure agent output **systematically** — the same way every time, on a fixed set of cases, with a
|
||||
measure agent output **systematically**, the same way every time, on a fixed set of cases, with a
|
||||
score you can compare across runs. That measurement is an **eval**.
|
||||
|
||||
### What an eval actually is
|
||||
@@ -113,7 +113,7 @@ good set is mostly edges. Three sources fill it fast:
|
||||
head and forgetting the results.
|
||||
|
||||
Keep it small and sharp. Twenty discriminating cases beat two hundred that all test the happy path.
|
||||
A case that every candidate passes tells you nothing — the cases that *separate* a good agent from a
|
||||
A case that every candidate passes tells you nothing; the cases that *separate* a good agent from a
|
||||
bad one are the whole value. And the eval set is code-adjacent data: commit it, review changes to it
|
||||
in PRs (Module 10), and grow it every time an agent surprises you. It is durable in exactly the way
|
||||
the syllabus means — it outlives every model it ever judges.
|
||||
@@ -129,7 +129,7 @@ either runs and produces the right thing or it doesn't.
|
||||
|
||||
**LLM-as-judge.** Some output has no `==`: "is this commit message clear?", "does this PR
|
||||
description explain the change?", "is this refactor actually cleaner?" The standard move is to ask
|
||||
*another* model to grade it against a rubric. It works, and sometimes it's the only option — but be
|
||||
*another* model to grade it against a rubric. It works, and sometimes it's the only option, but be
|
||||
honest about what you've built:
|
||||
|
||||
- **Correlated blind spots.** A judge is a model grading a model. It can share the candidate's
|
||||
@@ -153,17 +153,14 @@ Here is where the course thesis stops being a slogan and becomes a procedure.
|
||||
|
||||
You *will* swap the model. A cheaper one ships, your provider deprecates the one you're on, a new
|
||||
release benchmarks better, someone edits the agent's prompt or its committed instructions file
|
||||
(Module 5). Every one of those changes the behavior of every agent you run — silently. The code
|
||||
(Module 5). Every one of those changes the behavior of every agent you run, silently. The code
|
||||
around the model didn't change; the model did, and the model is the part you don't control.
|
||||
|
||||
A **regression eval** is the discipline of running the *same eval set* before and after the change
|
||||
and comparing the scores:
|
||||
|
||||
1. Run the eval against the current model/prompt. Record the score — this is your baseline.
|
||||
2. Make the change (new model, new prompt).
|
||||
3. Run the *same* eval set again.
|
||||
4. Compare. Score held or rose → the swap is safe by this eval. Score dropped → you just caught a
|
||||
regression *before* it ran unattended against real work, not after.
|
||||
and comparing the scores. The current model/prompt earns a baseline score. After the change (a new
|
||||
model, a new prompt), the same eval set runs again and the two scores get compared. A score that
|
||||
held or rose means the swap is safe by this eval; a score that dropped is a regression caught
|
||||
*before* it ran unattended against real work, not after.
|
||||
|
||||
This is the answer to "the model is swappable." It's swappable **because** the eval set is what
|
||||
makes swapping safe. Your prompts, your pipeline, your review reflexes, and — most of all — your
|
||||
@@ -184,7 +181,7 @@ autonomy.
|
||||
| At/above bar, stable across runs | Unattended on this *narrow* task, landing behind CI + the eval as a gate. |
|
||||
| High across a broad set, held over time | Orchestrate it; let it run in a fleet (Module 26). |
|
||||
|
||||
Two things make a guardrail real rather than decorative:
|
||||
Two things make a guardrail bite:
|
||||
|
||||
- **The threshold blocks.** The eval returns an exit code; below-bar exits non-zero and stops the
|
||||
pipeline exactly like a failing test (Module 14). The lab does this. An eval whose result nobody is
|
||||
@@ -198,15 +195,15 @@ Two things make a guardrail real rather than decorative:
|
||||
|
||||
## The AI angle
|
||||
|
||||
Every other module made a tool more valuable *because* you're using AI. This one is the load-bearing
|
||||
case, and it closes the argument the course opened with.
|
||||
Every other module made a tool more valuable *because* you're using AI. This module closes the
|
||||
argument the course opened with.
|
||||
|
||||
Module 1 claimed the model is the cheap, swappable part and the workflow is the durable skill. Every
|
||||
module since has been an installment on that claim — version control, review, CI, containers,
|
||||
secrets, MCP, agents. **Evals are where it's proven.** An eval set is, literally, a model-agnostic
|
||||
instrument: it judges output without caring which model produced it, which is exactly why it survives
|
||||
the swap that retires the model. You don't trust an agent because you trust the vendor or this
|
||||
quarter's benchmark; you trust it because *your* eval, on *your* cases, scored it above *your* bar —
|
||||
quarter's benchmark; you trust it because *your* eval, on *your* cases, scored it above *your* bar,
|
||||
and you'll re-run that same eval the day the model changes under you, which it will.
|
||||
|
||||
That's the durable skill. Models are weather. The eval set is the thermometer you keep.
|
||||
@@ -228,10 +225,10 @@ The lab files are in [`lab/`](lab/):
|
||||
- `candidates/swapped_model/tasks.py` — a plausible-but-wrong candidate (stand-in for a bad swap).
|
||||
- `llm_judge.py` — a model-agnostic LLM-as-judge stub, with its limits written in.
|
||||
|
||||
**You'll need:** Python 3.10+, the `tasks-app` you've carried since Module 1, and your usual agentic
|
||||
tool (any vendor). No API key or paid model is required to complete the lab — the bundled candidates
|
||||
let the regression demo run offline — but the real payoff comes when you replace them with your own
|
||||
agent's output.
|
||||
**You'll need:** Python 3.10+, the `tasks-app` you've carried since Module 1, and Claude Code (sub
|
||||
your own agent). No API key or paid model is required to complete the lab; the bundled candidates let
|
||||
the regression demo run offline. The real payoff comes when you replace them with your own agent's
|
||||
output.
|
||||
|
||||
### Part A — Run the eval against the current model
|
||||
|
||||
@@ -263,20 +260,22 @@ agent's output.
|
||||
|
||||
### Part C — Make it real with your own agent
|
||||
|
||||
3. Open your `tasks-app` and ask your agentic tool to implement (or re-implement) `pending_count()`
|
||||
in `tasks.py`. Copy the `tasks.py` it produces into a new folder, e.g.
|
||||
`candidates/my_run_1/tasks.py`, and score it:
|
||||
3. Open your `tasks-app` and tell Claude Code (sub your own agent) to implement (or re-implement)
|
||||
`pending_count()` and write its version straight into `candidates/my_run_1/tasks.py`, creating the
|
||||
folder if it doesn't exist. You direct; the agent does the file plumbing. Then run the eval
|
||||
yourself and read the scorecard:
|
||||
|
||||
```bash
|
||||
python run_eval.py candidates/my_run_1
|
||||
```
|
||||
|
||||
4. Now actually swap something. Either change the model your tool uses, or change the *prompt* (ask
|
||||
the same thing a different way, or tweak your committed instructions file from Module 5). Save the
|
||||
new output as `candidates/my_run_2/` and score it. Compare the two scores. You just ran a
|
||||
regression eval on a real model/prompt change and got a number that tells you whether the change
|
||||
was safe. If a run scores below 100%, read the failing case and add the input that broke it as a
|
||||
new permanent case in `eval_set.py` — the set gets sharper every time an agent surprises you.
|
||||
4. Now actually swap something. Either change the model Claude Code uses, or change the *prompt* (ask
|
||||
the same thing a different way, or tweak your committed instructions file from Module 5). Have the
|
||||
agent write this run into `candidates/my_run_2/`, then run `run_eval.py` yourself and compare the
|
||||
two scores. You just ran a regression eval on a real model/prompt change and got a number that
|
||||
tells you whether the change was safe. If a run scores below 100%, read the failing case and direct
|
||||
the agent to append the input that broke it as a new permanent case in `eval_set.py`; verify the
|
||||
case it added. The set gets sharper every time an agent surprises you.
|
||||
|
||||
5. *(Optional, needs a model endpoint.)* Open `llm_judge.py`, read the limits at the bottom, set the
|
||||
`EVAL_JUDGE_*` environment variables to your own endpoint, and grade an open-ended output — say, a
|
||||
@@ -287,8 +286,9 @@ agent's output.
|
||||
|
||||
6. Decide the autonomy for this task using the ladder in Key concepts. Write one sentence:
|
||||
*"`pending_count` changes may merge unattended only when `run_eval.py` scores 100%; otherwise a
|
||||
human reviews."* Then make it enforceable — this is one job in a CI workflow (Module 14), running
|
||||
the exact command you ran in Parts A–B:
|
||||
human reviews."* Then make it enforceable. This is one job in a CI workflow (Module 14), so direct
|
||||
Claude Code (sub your own agent) to add an eval-gate job to the workflow it already wired up in
|
||||
Module 14, running the same command from Parts A–B. The job it adds should look like this:
|
||||
|
||||
```yaml
|
||||
- name: Eval gate
|
||||
@@ -296,12 +296,13 @@ agent's output.
|
||||
run: python run_eval.py candidates/current_model --threshold 1.0
|
||||
```
|
||||
|
||||
The `working-directory:` line makes the CI job `cd` into the lab folder first, so the
|
||||
Review the diff before you accept it, and confirm the path logic is right. The
|
||||
`working-directory:` line makes the CI job `cd` into the lab folder first, so the
|
||||
`candidates/...` path and `run_eval.py`'s own `from eval_set import CASES` resolve exactly as they
|
||||
did on your machine. (Drop it and point a repo-root job straight at
|
||||
`python modules/27-evals/lab/run_eval.py candidates/current_model` instead, and `candidates/`
|
||||
won't exist from the repo root — the gate crashes with a *false* failure, which is worse than no
|
||||
gate. If you'd rather keep a single line, spell both paths out from the repo root:
|
||||
`python modules/27-evals/lab/run_eval.py candidates/current_model`, and `candidates/`
|
||||
won't exist from the repo root: the gate crashes with a *false* failure, which is worse than no
|
||||
gate. If the agent prefers a single line, it can spell both paths out from the repo root:
|
||||
`python modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model
|
||||
--threshold 1.0`.)
|
||||
|
||||
@@ -367,10 +368,10 @@ line will change many times. The line is yours to keep.
|
||||
|
||||
This is an expansion-zone module over fast-moving ground. Re-check at build/publish time:
|
||||
|
||||
- [ ] **No vendor pinned.** Confirm the prose, lab, and `llm_judge.py` still name no specific LLM
|
||||
- [ ] **No vendor pinned.** Confirm the module text, lab, and `llm_judge.py` still name no specific LLM
|
||||
provider, model id, or pricing, and that `llm_judge.py`'s endpoint config is still generic
|
||||
(env-var driven, OpenAI-style-compatible but not branded).
|
||||
- [ ] **Eval tooling landscape.** If the module names any eval framework or LLM-as-judge tool by
|
||||
- [ ] **Eval frameworks named.** If the module names any eval framework or LLM-as-judge tool by
|
||||
name (it currently names none on purpose), verify it still exists and behaves as described. Prefer
|
||||
keeping it tool-agnostic.
|
||||
- [ ] **LLM-as-judge claims.** The bias/drift/correlation caveats are durable, but re-check that no
|
||||
|
||||
Reference in New Issue
Block a user