De-slop: remove every em-dash + banned words across all modules + capstone (#94)
Sync course wiki / sync-wiki (push) Successful in 4s

Co-authored-by: claude <claude@jpaul.io>
Co-committed-by: claude <claude@jpaul.io>
This commit was merged in pull request #94.
This commit is contained in:
2026-06-22 23:21:22 -04:00
committed by Claude (agent)
parent 513d7e7ac8
commit c098933f25
99 changed files with 1324 additions and 1315 deletions
+49 -49
View File
@@ -1,7 +1,7 @@
# Module 27 Evals: Trusting an Agent That Acts Without You
# Module 27. Evals: Trusting an Agent That Acts Without You
> **You will swap the model. Evals are the only thing that tells you whether the swap was safe.**
> This is the instrument that turns "the agent's output looks fine" into a number you can gate on
> This is the instrument that turns "the agent's output looks fine" into a number you can gate on,
> and it's where the whole course's thesis finally pays out.
---
@@ -10,16 +10,16 @@
This is the closer. It assumes the whole course, but it leans hardest on:
- **Module 1** the thesis (the model is the cheap, swappable part; the workflow is the durable
- **Module 1**: the thesis (the model is the cheap, swappable part; the workflow is the durable
skill) and the `tasks-app` we've carried the whole way. This module is where the thesis gets its
proof.
- **Module 13 Testing in the AI Era** you can write a deterministic pass/fail check. Evals are
- **Module 13, Testing in the AI Era**: you can write a deterministic pass/fail check. Evals are
the next thing up the ladder: scoring output that a single test can't fully pin down.
- **Module 14 Continuous Integration** running checks automatically on every change, with an
- **Module 14, Continuous Integration**: running checks automatically on every change, with an
exit code that gates. Evals run the same way and gate the same way.
- **Module 10 Reviewing Code You Didn't Write** the human review skill evals partially automate
- **Module 10, Reviewing Code You Didn't Write**: the human review skill evals partially automate
and partially *replace* once a human isn't in the loop.
- **Modules 2426 the Unit 5 agent ladder** assistive agents (24), autonomous-but-supervised
- **Modules 2426, the Unit 5 agent ladder**: assistive agents (24), autonomous-but-supervised
agents (25), and orchestrated fleets (26). Evals are what decide how far up that ladder any given
agent is allowed to climb.
@@ -29,11 +29,11 @@ This is the closer. It assumes the whole course, but it leans hardest on:
By the end of this module you can:
1. State precisely what an eval is and how it differs from a test and when you need one instead of
1. State precisely what an eval is and how it differs from a test, and when you need one instead of
the other.
2. Build a small eval set for a concrete agent task: representative cases plus a grader that turns
output into a score.
3. Score agent output programmatically, and use an LLM-as-judge where you must honestly, knowing
3. Score agent output programmatically, and use an LLM-as-judge where you must, honestly, knowing
its failure modes.
4. Run a **regression eval** across a model or prompt change and read whether the change was safe.
5. Set a **guardrail**: tie an autonomy level to an eval score so an agent earns the right to act
@@ -61,18 +61,18 @@ score you can compare across runs. That measurement is an **eval**.
An eval has exactly three parts. None of them are exotic:
1. **An eval set** a fixed list of representative cases. Inputs the agent will face, chosen to
1. **An eval set**: a fixed list of representative cases. Inputs the agent will face, chosen to
cover the normal path *and* the edges where it tends to fail.
2. **A grader** something that turns each case's output into a result. Pass/fail, or a score. The
2. **A grader**: something that turns each case's output into a result. Pass/fail, or a score. The
grader can be code (`==`, a regex, "does it compile, run, and produce this output") or, when the
output is open-ended, another model (LLM-as-judge).
3. **An aggregate + a threshold** roll the per-case results into one number, and a line that number
3. **An aggregate + a threshold**: roll the per-case results into one number, and a line that number
has to clear. "18/20 = 90%, and I require 90%."
That's it. An eval is a test suite pointed at *agent behavior* instead of a function, with a score
instead of a single green check, run against a moving target (the model) instead of frozen code.
### Eval vs. test the distinction that matters
### Eval vs. test: the distinction that matters
This audience already writes tests (Module 13). The instinct to ask "isn't an eval just a test?" is
correct enough to be dangerous. Where they diverge:
@@ -82,7 +82,7 @@ correct enough to be dangerous. Where they diverge:
| **Subject** | Your code, frozen | An agent/model's output, which changes under you |
| **Result** | Binary: pass/fail | A score across many cases (90%, not "green") |
| **Determinism** | Same input → same output | Same input may give *different* output run to run |
| **Failure meaning** | The code is broken | The agent is *less good* maybe still acceptable |
| **Failure meaning** | The code is broken | The agent is *less good*, maybe still acceptable |
| **What it gates** | "Is the code correct?" | "Is this model/prompt good enough to trust here?" |
The practical upshot: a single failing case doesn't condemn an agent the way a failing unit test
@@ -91,7 +91,7 @@ want unattended on low-stakes work and nowhere near enough for high-stakes work.
the rate; *you* set the bar per task.
And the inverse: **where a deterministic test is possible, write the test, not an eval.** Evals are
for the band of behavior tests can't pin down open-ended output, judgment calls, "did it pick a
for the band of behavior tests can't pin down: open-ended output, judgment calls, "did it pick a
reasonable approach." Reaching for an LLM judge to grade something `==` could have caught is how you
get a slower, flakier, more expensive test that you trust less. (The lab's grader is deliberately
programmatic for exactly this reason.)
@@ -101,14 +101,14 @@ programmatic for exactly this reason.)
The eval set is the asset. The grader is plumbing; the *cases* are where the judgment lives, and a
good set is mostly edges. Three sources fill it fast:
- **The normal path** a couple of cases proving the agent does the obvious thing. These rarely
- **The normal path**: a couple of cases proving the agent does the obvious thing. These rarely
catch anything; they're the floor.
- **The edges you already know break** every "it looked right but" bug your agents have shipped is
- **The edges you already know break**: every "it looked right but" bug your agents have shipped is
a permanent case. Module 13 left us a perfect one: an agent implemented `pending_count()` as
`len(self.tasks)`. It passes any quick manual check (add three tasks, count says three) and is
wrong the instant a task is marked done. *That bug becomes case #4 in this module's lab and never
escapes again.*
- **The cases you'd manually check anyway** write down the inputs you reflexively try when
- **The cases you'd manually check anyway**: write down the inputs you reflexively try when
reviewing this kind of change. That list *is* your eval set; you've just been running it in your
head and forgetting the results.
@@ -116,14 +116,14 @@ Keep it small and sharp. Twenty discriminating cases beat two hundred that all t
A case that every candidate passes tells you nothing; the cases that *separate* a good agent from a
bad one are the whole value. And the eval set is code-adjacent data: commit it, review changes to it
in PRs (Module 10), and grow it every time an agent surprises you. It is durable in exactly the way
the syllabus means it outlives every model it ever judges.
the syllabus means: it outlives every model it ever judges.
### Scoring: programmatic first, LLM-as-judge only when you must
Two graders, in strict priority order.
**Programmatic.** If "correct" is checkable in code exact value, output matches, exit code is 0,
the file it shouldn't have touched is untouched do that. It's deterministic, free, fast, and you
**Programmatic.** If "correct" is checkable in code (exact value, output matches, exit code is 0,
the file it shouldn't have touched is untouched), do that. It's deterministic, free, fast, and you
trust it completely. Most of what an agent does to a codebase is checkable this way, because code
either runs and produces the right thing or it doesn't.
@@ -138,11 +138,11 @@ honest about what you've built:
- **Bias.** Judges favor longer, more confident, and first-presented answers regardless of
correctness. Control for position and length or your scores measure verbosity.
- **Drift.** Swap the judge model and your scores move while the candidate didn't change. The ruler
is made of rubber which is poison for *regression* evals, whose entire job is to hold the ruler
is made of rubber, which is poison for *regression* evals, whose entire job is to hold the ruler
still.
So when you must use a judge: pin it (fixed model, `temperature: 0`), keep it **separate** from the
model under test, and **calibrate it against human labels** hand-grade ~20 examples, run the judge
model under test, and **calibrate it against human labels**: hand-grade ~20 examples, run the judge
on the same 20, and confirm it agrees with you *before* you let it gate anything. An uncalibrated
judge is a vibe with a number attached. The lab ships a model-agnostic judge stub (`llm_judge.py`)
that abstains until you point it at your own endpoint, with these limits written into the file.
@@ -163,7 +163,7 @@ held or rose means the swap is safe by this eval; a score that dropped is a regr
*before* it ran unattended against real work, not after.
This is the answer to "the model is swappable." It's swappable **because** the eval set is what
makes swapping safe. Your prompts, your pipeline, your review reflexes, and most of all your
makes swapping safe. Your prompts, your pipeline, your review reflexes, and, most of all, your
eval set don't expire when the model does. They're the durable skill the course promised in Module
1. The model is a component you can replace; the eval is the regression test that tells you the
replacement fits. That's the whole argument, made operational.
@@ -176,8 +176,8 @@ autonomy.
| Eval score on this task | Reasonable autonomy (the Unit 5 ladder) |
|---|---|
| Low / unmeasured | Assistive only it suggests, a human decides (Module 24). |
| Solid, below your bar | Autonomous but fully gated opens a PR, a human reviews and merges (Module 25). |
| Low / unmeasured | Assistive only; it suggests, a human decides (Module 24). |
| Solid, below your bar | Autonomous but fully gated; opens a PR, a human reviews and merges (Module 25). |
| At/above bar, stable across runs | Unattended on this *narrow* task, landing behind CI + the eval as a gate. |
| High across a broad set, held over time | Orchestrate it; let it run in a fleet (Module 26). |
@@ -199,7 +199,7 @@ Every other module made a tool more valuable *because* you're using AI. This mod
argument the course opened with.
Module 1 claimed the model is the cheap, swappable part and the workflow is the durable skill. Every
module since has been an installment on that claim version control, review, CI, containers,
module since has been an installment on that claim: version control, review, CI, containers,
secrets, MCP, agents. **Evals are where it's proven.** An eval set is, literally, a model-agnostic
instrument: it judges output without caring which model produced it, which is exactly why it survives
the swap that retires the model. You don't trust an agent because you trust the vendor or this
@@ -217,20 +217,20 @@ a regression eval across a "model swap."
The lab files are in [`lab/`](lab/):
- `eval_set.py` five cases for the `pending_count` task (data only).
- `run_eval.py` the runner: imports a candidate, scores it, prints a scorecard, exits non-zero
- `eval_set.py`: five cases for the `pending_count` task (data only).
- `run_eval.py` is the runner; it imports a candidate, scores it, prints a scorecard, exits non-zero
below threshold.
- `candidates/current_model/tasks.py` a correct candidate (stand-in for your current model's
- `candidates/current_model/tasks.py`: a correct candidate (stand-in for your current model's
output).
- `candidates/swapped_model/tasks.py` a plausible-but-wrong candidate (stand-in for a bad swap).
- `llm_judge.py` a model-agnostic LLM-as-judge stub, with its limits written in.
- `candidates/swapped_model/tasks.py`: a plausible-but-wrong candidate (stand-in for a bad swap).
- `llm_judge.py`: a model-agnostic LLM-as-judge stub, with its limits written in.
**You'll need:** Python 3.10+, the `tasks-app` you've carried since Module 1, and Claude Code (sub
your own agent). No API key or paid model is required to complete the lab; the bundled candidates let
the regression demo run offline. The real payoff comes when you replace them with your own agent's
output.
### Part A Run the eval against the current model
### Part A: Run the eval against the current model
1. From the lab folder, run the eval against the passing candidate:
@@ -240,25 +240,25 @@ output.
echo "exit code: $?"
```
Five cases pass, the score is 100%, and the exit code is `0`. **This is your baseline** the
Five cases pass, the score is 100%, and the exit code is `0`. **This is your baseline**: the
score the current model earns on this task. Read the cases in `eval_set.py`: notice case #4,
"completed tasks are NOT pending." That's the Module 13 bug, now a permanent case.
### Part B Swap the model and re-run (the whole point)
### Part B: Swap the model and re-run (the whole point)
2. Now simulate the swap run the *exact same eval set* against the other candidate:
2. Now simulate the swap: run the *exact same eval set* against the other candidate:
```bash
python run_eval.py candidates/swapped_model
echo "exit code: $?"
```
It drops to 60% and exits `1`. Look at *which* cases failed: the easy ones still pass this
It drops to 60% and exits `1`. Look at *which* cases failed: the easy ones still pass; this
output would sail through a casual manual check. The eval caught a regression that a skim would
have missed, **and the non-zero exit code means a pipeline would have blocked it.** That is a
guardrail doing its job.
### Part C Make it real with your own agent
### Part C: Make it real with your own agent
3. Open your `tasks-app` and tell Claude Code (sub your own agent) to implement (or re-implement)
`pending_count()` and write its version straight into `candidates/my_run_1/tasks.py`, creating the
@@ -278,11 +278,11 @@ output.
case it added. The set gets sharper every time an agent surprises you.
5. *(Optional, needs a model endpoint.)* Open `llm_judge.py`, read the limits at the bottom, set the
`EVAL_JUDGE_*` environment variables to your own endpoint, and grade an open-ended output say, a
`EVAL_JUDGE_*` environment variables to your own endpoint, and grade an open-ended output, say a
commit message your agent wrote. Note how much shakier that score feels than the programmatic one.
That feeling is correct, and it's why programmatic graders come first.
### Part D Set the guardrail (on paper, then in CI)
### Part D: Set the guardrail (on paper, then in CI)
6. Decide the autonomy for this task using the ladder in Key concepts. Write one sentence:
*"`pending_count` changes may merge unattended only when `run_eval.py` scores 100%; otherwise a
@@ -310,9 +310,9 @@ output.
is now structural, not a promise.
**One honest caveat, or this gate guards nothing.** `candidates/current_model` is the bundled,
always-correct stand-in it scores 100% on every run, forever, so a gate pointed at it can never
always-correct stand-in: it scores 100% on every run, forever, so a gate pointed at it can never
fail. That's a dashboard, not a guardrail: the exact trap this section warns about. In a real
pipeline, point the gate at the candidate that actually *varies* your agent's real output for
pipeline, point the gate at the candidate that actually *varies*: your agent's real output for
this task (the `candidates/my_run_2` you made in Part C, or wherever your pipeline writes the
model's output before merge). Prove the gate bites by aiming it at `candidates/swapped_model`: the
same command drops to 60%, exits `1`, and blocks the merge.
@@ -323,22 +323,22 @@ output.
The honesty this course has insisted on all the way through applies hardest to its own closer.
- **Evals measure what you put in them and nothing else.** A 100% score means the agent passed
- **Evals measure what you put in them, and nothing else.** A 100% score means the agent passed
*your cases*, not that it's correct in general. The gap between "passes my eval" and "is actually
good" is exactly the cases you didn't think to write. An eval set is a lower bound on quality, never
a proof. Treat a green eval as "no known regression," not "verified correct."
- **Eval sets rot.** Cases that no model ever fails stop discriminating; tasks drift away from what
you actually do. An eval set you don't prune and grow becomes a comforting green light that's
measuring last year's problems. Budget maintenance for it like any other test suite.
- **LLM-as-judge is a model grading a model.** Re-read that section correlated blind spots, bias,
- **LLM-as-judge is a model grading a model.** Re-read that section: correlated blind spots, bias,
and drift are not edge cases, they're the default behavior. An uncalibrated judge can hand you a
confident wrong score, which is worse than no score. Where you can grade in code, do.
- **A score is not a decision.** The eval tells you the rate; *you* still set the bar, and the right
bar depends on stakes the eval can't see. 95% might be plenty for triaging issue labels and
reckless for anything touching auth, money, or customer data. The number informs the judgment; it
doesn't replace it.
- **Evals don't catch novel harms, only measured ones.** A genuinely new failure mode a class of
mistake no case anticipates passes every eval until the day it doesn't and you add the case after
- **Evals don't catch novel harms, only measured ones.** A genuinely new failure mode (a class of
mistake no case anticipates) passes every eval until the day it doesn't and you add the case after
the fact. Evals make agents *trustworthy on known territory*. They are not a substitute for the
recovery muscles (Module 12) that exist for when something gets through anyway.
@@ -350,13 +350,13 @@ The honesty this course has insisted on all the way through applies hardest to i
- You can explain the difference between a test and an eval, and say when you'd reach for each.
- You've run `run_eval.py` against both bundled candidates and watched the same eval set pass one and
fail the other including the exit code flipping to `1`.
fail the other, including the exit code flipping to `1`.
- You've graded your *own* agent's output, then changed the model or prompt and re-run the same eval
set as a regression check, and you can read the before/after scores as "safe" or "not safe."
- You can state, for one concrete task, the eval score that would let an agent act unattended on it
- You can state, for one concrete task, the eval score that would let an agent act unattended on it,
and where that threshold would live in your pipeline.
- You can say, in your own words, why the eval set is the durable skill and the model is the swappable
part. That's the whole course in one sentence and you can now run it from the keyboard.
part. That's the whole course in one sentence, and you can now run it from the keyboard.
That's the close. You started by copy-pasting out of a chat window; you're ending by letting an agent
act without you and holding a measured, enforceable line on whether to trust it. The model under that
@@ -1,12 +1,12 @@
"""Candidate output: a SWAPPED model/prompt.
Same task, different model (or a tweaked prompt). This output "looks right" and
passes a casual manual check adding three tasks and calling count returns 3.
passes a casual manual check; adding three tasks and calling count returns 3.
But pending_count() returns the total number of tasks, not the number of
*pending* ones, so it's wrong the moment anything is marked done.
Nobody would notice this by skimming. The eval set notices it instantly. That's
the regression eval catching an unsafe swap exactly the scenario this module
the regression eval catching an unsafe swap, exactly the scenario this module
exists for. Replace this with your own swapped-model output when you run it for
real; you may get lucky and have it pass, or you may catch a regression like
this one.
+1 -1
View File
@@ -7,7 +7,7 @@ An *eval set* is a list of CASES. Each case is three things:
- the expected result (here: how many tasks should count as pending).
The grading lives in run_eval.py; this file is just data. Keeping the cases
separate from any model, prompt, or runner is the whole point the same eval
separate from any model, prompt, or runner is the whole point; the same eval
set judges *any* candidate you point it at, which is what makes it useful when
you swap the model out from under it.
+2 -2
View File
@@ -34,7 +34,7 @@ def judge(candidate_text: str) -> dict:
key = os.environ.get("EVAL_JUDGE_KEY")
model = os.environ.get("EVAL_JUDGE_MODEL")
if not (url and key and model):
return {"score": None, "reason": "judge not configured abstaining (set EVAL_JUDGE_* to enable)"}
return {"score": None, "reason": "judge not configured; abstaining (set EVAL_JUDGE_* to enable)"}
payload = json.dumps({
"model": model,
@@ -72,7 +72,7 @@ if __name__ == "__main__":
# about the candidate changed. The ruler is itself made of rubber.
#
# So: use a programmatic grader (run_eval.py) wherever a deterministic check is
# possible that is most of the time. Reach for an LLM judge only for genuinely
# possible; that is most of the time. Reach for an LLM judge only for genuinely
# open-ended output, and CALIBRATE it first: hand-label ~20 examples yourself,
# run the judge on them, and confirm it agrees with you before you let it gate
# anything. An uncalibrated judge is a vibe with a number attached.
+2 -2
View File
@@ -68,9 +68,9 @@ def main(argv):
print(f"\nscore: {passed}/{len(CASES)} = {score:.0%} threshold: {args.threshold:.0%}")
if score < args.threshold:
print("RESULT: below threshold this change is NOT safe to ship.\n")
print("RESULT: below threshold; this change is NOT safe to ship.\n")
return 1
print("RESULT: at or above threshold safe by this eval.\n")
print("RESULT: at or above threshold; safe by this eval.\n")
return 0