c098933f25
Sync course wiki / sync-wiki (push) Successful in 4s
Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io>
384 lines
22 KiB
Markdown
384 lines
22 KiB
Markdown
# Module 27. Evals: Trusting an Agent That Acts Without You
|
||
|
||
> **You will swap the model. Evals are the only thing that tells you whether the swap was safe.**
|
||
> This is the instrument that turns "the agent's output looks fine" into a number you can gate on,
|
||
> and it's where the whole course's thesis finally pays out.
|
||
|
||
---
|
||
|
||
## Prerequisites
|
||
|
||
This is the closer. It assumes the whole course, but it leans hardest on:
|
||
|
||
- **Module 1**: the thesis (the model is the cheap, swappable part; the workflow is the durable
|
||
skill) and the `tasks-app` we've carried the whole way. This module is where the thesis gets its
|
||
proof.
|
||
- **Module 13, Testing in the AI Era**: you can write a deterministic pass/fail check. Evals are
|
||
the next thing up the ladder: scoring output that a single test can't fully pin down.
|
||
- **Module 14, Continuous Integration**: running checks automatically on every change, with an
|
||
exit code that gates. Evals run the same way and gate the same way.
|
||
- **Module 10, Reviewing Code You Didn't Write**: the human review skill evals partially automate
|
||
and partially *replace* once a human isn't in the loop.
|
||
- **Modules 24–26, the Unit 5 agent ladder**: assistive agents (24), autonomous-but-supervised
|
||
agents (25), and orchestrated fleets (26). Evals are what decide how far up that ladder any given
|
||
agent is allowed to climb.
|
||
|
||
---
|
||
|
||
## Learning objectives
|
||
|
||
By the end of this module you can:
|
||
|
||
1. State precisely what an eval is and how it differs from a test, and when you need one instead of
|
||
the other.
|
||
2. Build a small eval set for a concrete agent task: representative cases plus a grader that turns
|
||
output into a score.
|
||
3. Score agent output programmatically, and use an LLM-as-judge where you must, honestly, knowing
|
||
its failure modes.
|
||
4. Run a **regression eval** across a model or prompt change and read whether the change was safe.
|
||
5. Set a **guardrail**: tie an autonomy level to an eval score so an agent earns the right to act
|
||
unattended instead of being granted it on faith.
|
||
|
||
---
|
||
|
||
## Key concepts
|
||
|
||
### The question Unit 5 has been building toward
|
||
|
||
Unit 5 walked the agent from your elbow into the pipeline: assisting you (Module 24), then acting
|
||
under supervision (Module 25), then several of them at once (Module 26). Each step removed a human
|
||
from a loop. So the question this module exists to answer is blunt:
|
||
|
||
> **An agent did work while you were asleep. How do you *know* it did good work?**
|
||
|
||
"I read the diff" doesn't scale: the whole point of an unattended agent is that you weren't there.
|
||
"CI passed" is necessary but thin. CI proves the code builds and your existing tests are green, not
|
||
that the agent actually did the *right thing*, well, on the cases that matter. You need a way to
|
||
measure agent output **systematically**, the same way every time, on a fixed set of cases, with a
|
||
score you can compare across runs. That measurement is an **eval**.
|
||
|
||
### What an eval actually is
|
||
|
||
An eval has exactly three parts. None of them are exotic:
|
||
|
||
1. **An eval set**: a fixed list of representative cases. Inputs the agent will face, chosen to
|
||
cover the normal path *and* the edges where it tends to fail.
|
||
2. **A grader**: something that turns each case's output into a result. Pass/fail, or a score. The
|
||
grader can be code (`==`, a regex, "does it compile, run, and produce this output") or, when the
|
||
output is open-ended, another model (LLM-as-judge).
|
||
3. **An aggregate + a threshold**: roll the per-case results into one number, and a line that number
|
||
has to clear. "18/20 = 90%, and I require 90%."
|
||
|
||
That's it. An eval is a test suite pointed at *agent behavior* instead of a function, with a score
|
||
instead of a single green check, run against a moving target (the model) instead of frozen code.
|
||
|
||
### Eval vs. test: the distinction that matters
|
||
|
||
This audience already writes tests (Module 13). The instinct to ask "isn't an eval just a test?" is
|
||
correct enough to be dangerous. Where they diverge:
|
||
|
||
| | A test (Module 13) | An eval |
|
||
|---|---|---|
|
||
| **Subject** | Your code, frozen | An agent/model's output, which changes under you |
|
||
| **Result** | Binary: pass/fail | A score across many cases (90%, not "green") |
|
||
| **Determinism** | Same input → same output | Same input may give *different* output run to run |
|
||
| **Failure meaning** | The code is broken | The agent is *less good*, maybe still acceptable |
|
||
| **What it gates** | "Is the code correct?" | "Is this model/prompt good enough to trust here?" |
|
||
|
||
The practical upshot: a single failing case doesn't condemn an agent the way a failing unit test
|
||
condemns code. You're measuring a *rate*. An agent that gets 19/20 right may be exactly what you
|
||
want unattended on low-stakes work and nowhere near enough for high-stakes work. The eval gives you
|
||
the rate; *you* set the bar per task.
|
||
|
||
And the inverse: **where a deterministic test is possible, write the test, not an eval.** Evals are
|
||
for the band of behavior tests can't pin down: open-ended output, judgment calls, "did it pick a
|
||
reasonable approach." Reaching for an LLM judge to grade something `==` could have caught is how you
|
||
get a slower, flakier, more expensive test that you trust less. (The lab's grader is deliberately
|
||
programmatic for exactly this reason.)
|
||
|
||
### Building the eval set
|
||
|
||
The eval set is the asset. The grader is plumbing; the *cases* are where the judgment lives, and a
|
||
good set is mostly edges. Three sources fill it fast:
|
||
|
||
- **The normal path**: a couple of cases proving the agent does the obvious thing. These rarely
|
||
catch anything; they're the floor.
|
||
- **The edges you already know break**: every "it looked right but" bug your agents have shipped is
|
||
a permanent case. Module 13 left us a perfect one: an agent implemented `pending_count()` as
|
||
`len(self.tasks)`. It passes any quick manual check (add three tasks, count says three) and is
|
||
wrong the instant a task is marked done. *That bug becomes case #4 in this module's lab and never
|
||
escapes again.*
|
||
- **The cases you'd manually check anyway**: write down the inputs you reflexively try when
|
||
reviewing this kind of change. That list *is* your eval set; you've just been running it in your
|
||
head and forgetting the results.
|
||
|
||
Keep it small and sharp. Twenty discriminating cases beat two hundred that all test the happy path.
|
||
A case that every candidate passes tells you nothing; the cases that *separate* a good agent from a
|
||
bad one are the whole value. And the eval set is code-adjacent data: commit it, review changes to it
|
||
in PRs (Module 10), and grow it every time an agent surprises you. It is durable in exactly the way
|
||
the syllabus means: it outlives every model it ever judges.
|
||
|
||
### Scoring: programmatic first, LLM-as-judge only when you must
|
||
|
||
Two graders, in strict priority order.
|
||
|
||
**Programmatic.** If "correct" is checkable in code (exact value, output matches, exit code is 0,
|
||
the file it shouldn't have touched is untouched), do that. It's deterministic, free, fast, and you
|
||
trust it completely. Most of what an agent does to a codebase is checkable this way, because code
|
||
either runs and produces the right thing or it doesn't.
|
||
|
||
**LLM-as-judge.** Some output has no `==`: "is this commit message clear?", "does this PR
|
||
description explain the change?", "is this refactor actually cleaner?" The standard move is to ask
|
||
*another* model to grade it against a rubric. It works, and sometimes it's the only option, but be
|
||
honest about what you've built:
|
||
|
||
- **Correlated blind spots.** A judge is a model grading a model. It can share the candidate's
|
||
confusion and pass a wrong answer because both are wrong the same way. Your grader and the thing
|
||
it grades are not independent.
|
||
- **Bias.** Judges favor longer, more confident, and first-presented answers regardless of
|
||
correctness. Control for position and length or your scores measure verbosity.
|
||
- **Drift.** Swap the judge model and your scores move while the candidate didn't change. The ruler
|
||
is made of rubber, which is poison for *regression* evals, whose entire job is to hold the ruler
|
||
still.
|
||
|
||
So when you must use a judge: pin it (fixed model, `temperature: 0`), keep it **separate** from the
|
||
model under test, and **calibrate it against human labels**: hand-grade ~20 examples, run the judge
|
||
on the same 20, and confirm it agrees with you *before* you let it gate anything. An uncalibrated
|
||
judge is a vibe with a number attached. The lab ships a model-agnostic judge stub (`llm_judge.py`)
|
||
that abstains until you point it at your own endpoint, with these limits written into the file.
|
||
|
||
### Regression evals: the safety check on a swap
|
||
|
||
Here is where the course thesis stops being a slogan and becomes a procedure.
|
||
|
||
You *will* swap the model. A cheaper one ships, your provider deprecates the one you're on, a new
|
||
release benchmarks better, someone edits the agent's prompt or its committed instructions file
|
||
(Module 5). Every one of those changes the behavior of every agent you run, silently. The code
|
||
around the model didn't change; the model did, and the model is the part you don't control.
|
||
|
||
A **regression eval** is the discipline of running the *same eval set* before and after the change
|
||
and comparing the scores. The current model/prompt earns a baseline score. After the change (a new
|
||
model, a new prompt), the same eval set runs again and the two scores get compared. A score that
|
||
held or rose means the swap is safe by this eval; a score that dropped is a regression caught
|
||
*before* it ran unattended against real work, not after.
|
||
|
||
This is the answer to "the model is swappable." It's swappable **because** the eval set is what
|
||
makes swapping safe. Your prompts, your pipeline, your review reflexes, and, most of all, your
|
||
eval set don't expire when the model does. They're the durable skill the course promised in Module
|
||
1. The model is a component you can replace; the eval is the regression test that tells you the
|
||
replacement fits. That's the whole argument, made operational.
|
||
|
||
### Guardrails: tying autonomy to a score
|
||
|
||
The last piece, and the real subject of Unit 5: **how much is this agent allowed to do without a
|
||
human?** Don't answer that by gut. Answer it with the eval score, and make the score *gate* the
|
||
autonomy.
|
||
|
||
| Eval score on this task | Reasonable autonomy (the Unit 5 ladder) |
|
||
|---|---|
|
||
| Low / unmeasured | Assistive only; it suggests, a human decides (Module 24). |
|
||
| Solid, below your bar | Autonomous but fully gated; opens a PR, a human reviews and merges (Module 25). |
|
||
| At/above bar, stable across runs | Unattended on this *narrow* task, landing behind CI + the eval as a gate. |
|
||
| High across a broad set, held over time | Orchestrate it; let it run in a fleet (Module 26). |
|
||
|
||
Two things make a guardrail bite:
|
||
|
||
- **The threshold blocks.** The eval returns an exit code; below-bar exits non-zero and stops the
|
||
pipeline exactly like a failing test (Module 14). The lab does this. An eval whose result nobody is
|
||
forced to act on is a dashboard, not a guardrail.
|
||
- **Autonomy is per-task, not per-agent.** The same model can be trustworthy enough to merge
|
||
doc fixes unattended and nowhere near enough to touch auth code. You hold a *different* eval and a
|
||
*different* bar for each. "Trust the agent" is the wrong granularity; "trust this agent, on this
|
||
task, to this score" is the right one.
|
||
|
||
---
|
||
|
||
## The AI angle
|
||
|
||
Every other module made a tool more valuable *because* you're using AI. This module closes the
|
||
argument the course opened with.
|
||
|
||
Module 1 claimed the model is the cheap, swappable part and the workflow is the durable skill. Every
|
||
module since has been an installment on that claim: version control, review, CI, containers,
|
||
secrets, MCP, agents. **Evals are where it's proven.** An eval set is, literally, a model-agnostic
|
||
instrument: it judges output without caring which model produced it, which is exactly why it survives
|
||
the swap that retires the model. You don't trust an agent because you trust the vendor or this
|
||
quarter's benchmark; you trust it because *your* eval, on *your* cases, scored it above *your* bar,
|
||
and you'll re-run that same eval the day the model changes under you, which it will.
|
||
|
||
That's the durable skill. Models are weather. The eval set is the thermometer you keep.
|
||
|
||
---
|
||
|
||
## Hands-on lab
|
||
|
||
**Lab language:** Python + shell. You'll run a tiny eval harness, point an agent at a task, and run
|
||
a regression eval across a "model swap."
|
||
|
||
The lab files are in [`lab/`](lab/):
|
||
|
||
- `eval_set.py`: five cases for the `pending_count` task (data only).
|
||
- `run_eval.py` is the runner; it imports a candidate, scores it, prints a scorecard, exits non-zero
|
||
below threshold.
|
||
- `candidates/current_model/tasks.py`: a correct candidate (stand-in for your current model's
|
||
output).
|
||
- `candidates/swapped_model/tasks.py`: a plausible-but-wrong candidate (stand-in for a bad swap).
|
||
- `llm_judge.py`: a model-agnostic LLM-as-judge stub, with its limits written in.
|
||
|
||
**You'll need:** Python 3.10+, the `tasks-app` you've carried since Module 1, and Claude Code (sub
|
||
your own agent). No API key or paid model is required to complete the lab; the bundled candidates let
|
||
the regression demo run offline. The real payoff comes when you replace them with your own agent's
|
||
output.
|
||
|
||
### Part A: Run the eval against the current model
|
||
|
||
1. From the lab folder, run the eval against the passing candidate:
|
||
|
||
```bash
|
||
cd modules/27-evals/lab
|
||
python run_eval.py candidates/current_model
|
||
echo "exit code: $?"
|
||
```
|
||
|
||
Five cases pass, the score is 100%, and the exit code is `0`. **This is your baseline**: the
|
||
score the current model earns on this task. Read the cases in `eval_set.py`: notice case #4,
|
||
"completed tasks are NOT pending." That's the Module 13 bug, now a permanent case.
|
||
|
||
### Part B: Swap the model and re-run (the whole point)
|
||
|
||
2. Now simulate the swap: run the *exact same eval set* against the other candidate:
|
||
|
||
```bash
|
||
python run_eval.py candidates/swapped_model
|
||
echo "exit code: $?"
|
||
```
|
||
|
||
It drops to 60% and exits `1`. Look at *which* cases failed: the easy ones still pass; this
|
||
output would sail through a casual manual check. The eval caught a regression that a skim would
|
||
have missed, **and the non-zero exit code means a pipeline would have blocked it.** That is a
|
||
guardrail doing its job.
|
||
|
||
### Part C: Make it real with your own agent
|
||
|
||
3. Open your `tasks-app` and tell Claude Code (sub your own agent) to implement (or re-implement)
|
||
`pending_count()` and write its version straight into `candidates/my_run_1/tasks.py`, creating the
|
||
folder if it doesn't exist. You direct; the agent does the file plumbing. Then run the eval
|
||
yourself and read the scorecard:
|
||
|
||
```bash
|
||
python run_eval.py candidates/my_run_1
|
||
```
|
||
|
||
4. Now actually swap something. Either change the model Claude Code uses, or change the *prompt* (ask
|
||
the same thing a different way, or tweak your committed instructions file from Module 5). Have the
|
||
agent write this run into `candidates/my_run_2/`, then run `run_eval.py` yourself and compare the
|
||
two scores. You just ran a regression eval on a real model/prompt change and got a number that
|
||
tells you whether the change was safe. If a run scores below 100%, read the failing case and direct
|
||
the agent to append the input that broke it as a new permanent case in `eval_set.py`; verify the
|
||
case it added. The set gets sharper every time an agent surprises you.
|
||
|
||
5. *(Optional, needs a model endpoint.)* Open `llm_judge.py`, read the limits at the bottom, set the
|
||
`EVAL_JUDGE_*` environment variables to your own endpoint, and grade an open-ended output, say a
|
||
commit message your agent wrote. Note how much shakier that score feels than the programmatic one.
|
||
That feeling is correct, and it's why programmatic graders come first.
|
||
|
||
### Part D: Set the guardrail (on paper, then in CI)
|
||
|
||
6. Decide the autonomy for this task using the ladder in Key concepts. Write one sentence:
|
||
*"`pending_count` changes may merge unattended only when `run_eval.py` scores 100%; otherwise a
|
||
human reviews."* Then make it enforceable. This is one job in a CI workflow (Module 14), so direct
|
||
Claude Code (sub your own agent) to add an eval-gate job to the workflow it already wired up in
|
||
Module 14, running the same command from Parts A–B. The job it adds should look like this:
|
||
|
||
```yaml
|
||
- name: Eval gate
|
||
working-directory: modules/27-evals/lab
|
||
run: python run_eval.py candidates/current_model --threshold 1.0
|
||
```
|
||
|
||
Review the diff before you accept it, and confirm the path logic is right. The
|
||
`working-directory:` line makes the CI job `cd` into the lab folder first, so the
|
||
`candidates/...` path and `run_eval.py`'s own `from eval_set import CASES` resolve exactly as they
|
||
did on your machine. (Drop it and point a repo-root job straight at
|
||
`python modules/27-evals/lab/run_eval.py candidates/current_model`, and `candidates/`
|
||
won't exist from the repo root: the gate crashes with a *false* failure, which is worse than no
|
||
gate. If the agent prefers a single line, it can spell both paths out from the repo root:
|
||
`python modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model
|
||
--threshold 1.0`.)
|
||
|
||
Below threshold exits non-zero and the pipeline blocks, exactly like a failing test. The guardrail
|
||
is now structural, not a promise.
|
||
|
||
**One honest caveat, or this gate guards nothing.** `candidates/current_model` is the bundled,
|
||
always-correct stand-in: it scores 100% on every run, forever, so a gate pointed at it can never
|
||
fail. That's a dashboard, not a guardrail: the exact trap this section warns about. In a real
|
||
pipeline, point the gate at the candidate that actually *varies*: your agent's real output for
|
||
this task (the `candidates/my_run_2` you made in Part C, or wherever your pipeline writes the
|
||
model's output before merge). Prove the gate bites by aiming it at `candidates/swapped_model`: the
|
||
same command drops to 60%, exits `1`, and blocks the merge.
|
||
|
||
---
|
||
|
||
## Where it breaks
|
||
|
||
The honesty this course has insisted on all the way through applies hardest to its own closer.
|
||
|
||
- **Evals measure what you put in them, and nothing else.** A 100% score means the agent passed
|
||
*your cases*, not that it's correct in general. The gap between "passes my eval" and "is actually
|
||
good" is exactly the cases you didn't think to write. An eval set is a lower bound on quality, never
|
||
a proof. Treat a green eval as "no known regression," not "verified correct."
|
||
- **Eval sets rot.** Cases that no model ever fails stop discriminating; tasks drift away from what
|
||
you actually do. An eval set you don't prune and grow becomes a comforting green light that's
|
||
measuring last year's problems. Budget maintenance for it like any other test suite.
|
||
- **LLM-as-judge is a model grading a model.** Re-read that section: correlated blind spots, bias,
|
||
and drift are not edge cases, they're the default behavior. An uncalibrated judge can hand you a
|
||
confident wrong score, which is worse than no score. Where you can grade in code, do.
|
||
- **A score is not a decision.** The eval tells you the rate; *you* still set the bar, and the right
|
||
bar depends on stakes the eval can't see. 95% might be plenty for triaging issue labels and
|
||
reckless for anything touching auth, money, or customer data. The number informs the judgment; it
|
||
doesn't replace it.
|
||
- **Evals don't catch novel harms, only measured ones.** A genuinely new failure mode (a class of
|
||
mistake no case anticipates) passes every eval until the day it doesn't and you add the case after
|
||
the fact. Evals make agents *trustworthy on known territory*. They are not a substitute for the
|
||
recovery muscles (Module 12) that exist for when something gets through anyway.
|
||
|
||
---
|
||
|
||
## Check for understanding
|
||
|
||
**You're done when:**
|
||
|
||
- You can explain the difference between a test and an eval, and say when you'd reach for each.
|
||
- You've run `run_eval.py` against both bundled candidates and watched the same eval set pass one and
|
||
fail the other, including the exit code flipping to `1`.
|
||
- You've graded your *own* agent's output, then changed the model or prompt and re-run the same eval
|
||
set as a regression check, and you can read the before/after scores as "safe" or "not safe."
|
||
- You can state, for one concrete task, the eval score that would let an agent act unattended on it,
|
||
and where that threshold would live in your pipeline.
|
||
- You can say, in your own words, why the eval set is the durable skill and the model is the swappable
|
||
part. That's the whole course in one sentence, and you can now run it from the keyboard.
|
||
|
||
That's the close. You started by copy-pasting out of a chat window; you're ending by letting an agent
|
||
act without you and holding a measured, enforceable line on whether to trust it. The model under that
|
||
line will change many times. The line is yours to keep.
|
||
|
||
---
|
||
|
||
## Verify-before-publish
|
||
|
||
This is an expansion-zone module over fast-moving ground. Re-check at build/publish time:
|
||
|
||
- [ ] **No vendor pinned.** Confirm the module text, lab, and `llm_judge.py` still name no specific LLM
|
||
provider, model id, or pricing, and that `llm_judge.py`'s endpoint config is still generic
|
||
(env-var driven, OpenAI-style-compatible but not branded).
|
||
- [ ] **Eval frameworks named.** If the module names any eval framework or LLM-as-judge tool by
|
||
name (it currently names none on purpose), verify it still exists and behaves as described. Prefer
|
||
keeping it tool-agnostic.
|
||
- [ ] **LLM-as-judge claims.** The bias/drift/correlation caveats are durable, but re-check that no
|
||
cited best practice (e.g., calibration-against-human-labels guidance) has been superseded.
|
||
- [ ] **Module cross-references.** Confirm Modules 13, 14, 10, and 24–26 still carry the
|
||
responsibilities referenced here (tests, CI gating, review, the agent autonomy ladder) and that
|
||
none were renumbered.
|
||
- [ ] **Lab still runs.** `python run_eval.py candidates/current_model` exits 0 at 100%, and
|
||
`candidates/swapped_model` exits 1 below threshold, on a current Python 3.x.
|