De-slop: remove every em-dash + banned words across all modules + capstone (#94)
Sync course wiki / sync-wiki (push) Successful in 4s
Sync course wiki / sync-wiki (push) Successful in 4s
Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io>
This commit was merged in pull request #94.
This commit is contained in:
+49
-49
@@ -1,7 +1,7 @@
|
||||
# Module 27 — Evals: Trusting an Agent That Acts Without You
|
||||
# Module 27. Evals: Trusting an Agent That Acts Without You
|
||||
|
||||
> **You will swap the model. Evals are the only thing that tells you whether the swap was safe.**
|
||||
> This is the instrument that turns "the agent's output looks fine" into a number you can gate on —
|
||||
> This is the instrument that turns "the agent's output looks fine" into a number you can gate on,
|
||||
> and it's where the whole course's thesis finally pays out.
|
||||
|
||||
---
|
||||
@@ -10,16 +10,16 @@
|
||||
|
||||
This is the closer. It assumes the whole course, but it leans hardest on:
|
||||
|
||||
- **Module 1** — the thesis (the model is the cheap, swappable part; the workflow is the durable
|
||||
- **Module 1**: the thesis (the model is the cheap, swappable part; the workflow is the durable
|
||||
skill) and the `tasks-app` we've carried the whole way. This module is where the thesis gets its
|
||||
proof.
|
||||
- **Module 13 — Testing in the AI Era** — you can write a deterministic pass/fail check. Evals are
|
||||
- **Module 13, Testing in the AI Era**: you can write a deterministic pass/fail check. Evals are
|
||||
the next thing up the ladder: scoring output that a single test can't fully pin down.
|
||||
- **Module 14 — Continuous Integration** — running checks automatically on every change, with an
|
||||
- **Module 14, Continuous Integration**: running checks automatically on every change, with an
|
||||
exit code that gates. Evals run the same way and gate the same way.
|
||||
- **Module 10 — Reviewing Code You Didn't Write** — the human review skill evals partially automate
|
||||
- **Module 10, Reviewing Code You Didn't Write**: the human review skill evals partially automate
|
||||
and partially *replace* once a human isn't in the loop.
|
||||
- **Modules 24–26 — the Unit 5 agent ladder** — assistive agents (24), autonomous-but-supervised
|
||||
- **Modules 24–26, the Unit 5 agent ladder**: assistive agents (24), autonomous-but-supervised
|
||||
agents (25), and orchestrated fleets (26). Evals are what decide how far up that ladder any given
|
||||
agent is allowed to climb.
|
||||
|
||||
@@ -29,11 +29,11 @@ This is the closer. It assumes the whole course, but it leans hardest on:
|
||||
|
||||
By the end of this module you can:
|
||||
|
||||
1. State precisely what an eval is and how it differs from a test — and when you need one instead of
|
||||
1. State precisely what an eval is and how it differs from a test, and when you need one instead of
|
||||
the other.
|
||||
2. Build a small eval set for a concrete agent task: representative cases plus a grader that turns
|
||||
output into a score.
|
||||
3. Score agent output programmatically, and use an LLM-as-judge where you must — honestly, knowing
|
||||
3. Score agent output programmatically, and use an LLM-as-judge where you must, honestly, knowing
|
||||
its failure modes.
|
||||
4. Run a **regression eval** across a model or prompt change and read whether the change was safe.
|
||||
5. Set a **guardrail**: tie an autonomy level to an eval score so an agent earns the right to act
|
||||
@@ -61,18 +61,18 @@ score you can compare across runs. That measurement is an **eval**.
|
||||
|
||||
An eval has exactly three parts. None of them are exotic:
|
||||
|
||||
1. **An eval set** — a fixed list of representative cases. Inputs the agent will face, chosen to
|
||||
1. **An eval set**: a fixed list of representative cases. Inputs the agent will face, chosen to
|
||||
cover the normal path *and* the edges where it tends to fail.
|
||||
2. **A grader** — something that turns each case's output into a result. Pass/fail, or a score. The
|
||||
2. **A grader**: something that turns each case's output into a result. Pass/fail, or a score. The
|
||||
grader can be code (`==`, a regex, "does it compile, run, and produce this output") or, when the
|
||||
output is open-ended, another model (LLM-as-judge).
|
||||
3. **An aggregate + a threshold** — roll the per-case results into one number, and a line that number
|
||||
3. **An aggregate + a threshold**: roll the per-case results into one number, and a line that number
|
||||
has to clear. "18/20 = 90%, and I require 90%."
|
||||
|
||||
That's it. An eval is a test suite pointed at *agent behavior* instead of a function, with a score
|
||||
instead of a single green check, run against a moving target (the model) instead of frozen code.
|
||||
|
||||
### Eval vs. test — the distinction that matters
|
||||
### Eval vs. test: the distinction that matters
|
||||
|
||||
This audience already writes tests (Module 13). The instinct to ask "isn't an eval just a test?" is
|
||||
correct enough to be dangerous. Where they diverge:
|
||||
@@ -82,7 +82,7 @@ correct enough to be dangerous. Where they diverge:
|
||||
| **Subject** | Your code, frozen | An agent/model's output, which changes under you |
|
||||
| **Result** | Binary: pass/fail | A score across many cases (90%, not "green") |
|
||||
| **Determinism** | Same input → same output | Same input may give *different* output run to run |
|
||||
| **Failure meaning** | The code is broken | The agent is *less good* — maybe still acceptable |
|
||||
| **Failure meaning** | The code is broken | The agent is *less good*, maybe still acceptable |
|
||||
| **What it gates** | "Is the code correct?" | "Is this model/prompt good enough to trust here?" |
|
||||
|
||||
The practical upshot: a single failing case doesn't condemn an agent the way a failing unit test
|
||||
@@ -91,7 +91,7 @@ want unattended on low-stakes work and nowhere near enough for high-stakes work.
|
||||
the rate; *you* set the bar per task.
|
||||
|
||||
And the inverse: **where a deterministic test is possible, write the test, not an eval.** Evals are
|
||||
for the band of behavior tests can't pin down — open-ended output, judgment calls, "did it pick a
|
||||
for the band of behavior tests can't pin down: open-ended output, judgment calls, "did it pick a
|
||||
reasonable approach." Reaching for an LLM judge to grade something `==` could have caught is how you
|
||||
get a slower, flakier, more expensive test that you trust less. (The lab's grader is deliberately
|
||||
programmatic for exactly this reason.)
|
||||
@@ -101,14 +101,14 @@ programmatic for exactly this reason.)
|
||||
The eval set is the asset. The grader is plumbing; the *cases* are where the judgment lives, and a
|
||||
good set is mostly edges. Three sources fill it fast:
|
||||
|
||||
- **The normal path** — a couple of cases proving the agent does the obvious thing. These rarely
|
||||
- **The normal path**: a couple of cases proving the agent does the obvious thing. These rarely
|
||||
catch anything; they're the floor.
|
||||
- **The edges you already know break** — every "it looked right but" bug your agents have shipped is
|
||||
- **The edges you already know break**: every "it looked right but" bug your agents have shipped is
|
||||
a permanent case. Module 13 left us a perfect one: an agent implemented `pending_count()` as
|
||||
`len(self.tasks)`. It passes any quick manual check (add three tasks, count says three) and is
|
||||
wrong the instant a task is marked done. *That bug becomes case #4 in this module's lab and never
|
||||
escapes again.*
|
||||
- **The cases you'd manually check anyway** — write down the inputs you reflexively try when
|
||||
- **The cases you'd manually check anyway**: write down the inputs you reflexively try when
|
||||
reviewing this kind of change. That list *is* your eval set; you've just been running it in your
|
||||
head and forgetting the results.
|
||||
|
||||
@@ -116,14 +116,14 @@ Keep it small and sharp. Twenty discriminating cases beat two hundred that all t
|
||||
A case that every candidate passes tells you nothing; the cases that *separate* a good agent from a
|
||||
bad one are the whole value. And the eval set is code-adjacent data: commit it, review changes to it
|
||||
in PRs (Module 10), and grow it every time an agent surprises you. It is durable in exactly the way
|
||||
the syllabus means — it outlives every model it ever judges.
|
||||
the syllabus means: it outlives every model it ever judges.
|
||||
|
||||
### Scoring: programmatic first, LLM-as-judge only when you must
|
||||
|
||||
Two graders, in strict priority order.
|
||||
|
||||
**Programmatic.** If "correct" is checkable in code — exact value, output matches, exit code is 0,
|
||||
the file it shouldn't have touched is untouched — do that. It's deterministic, free, fast, and you
|
||||
**Programmatic.** If "correct" is checkable in code (exact value, output matches, exit code is 0,
|
||||
the file it shouldn't have touched is untouched), do that. It's deterministic, free, fast, and you
|
||||
trust it completely. Most of what an agent does to a codebase is checkable this way, because code
|
||||
either runs and produces the right thing or it doesn't.
|
||||
|
||||
@@ -138,11 +138,11 @@ honest about what you've built:
|
||||
- **Bias.** Judges favor longer, more confident, and first-presented answers regardless of
|
||||
correctness. Control for position and length or your scores measure verbosity.
|
||||
- **Drift.** Swap the judge model and your scores move while the candidate didn't change. The ruler
|
||||
is made of rubber — which is poison for *regression* evals, whose entire job is to hold the ruler
|
||||
is made of rubber, which is poison for *regression* evals, whose entire job is to hold the ruler
|
||||
still.
|
||||
|
||||
So when you must use a judge: pin it (fixed model, `temperature: 0`), keep it **separate** from the
|
||||
model under test, and **calibrate it against human labels** — hand-grade ~20 examples, run the judge
|
||||
model under test, and **calibrate it against human labels**: hand-grade ~20 examples, run the judge
|
||||
on the same 20, and confirm it agrees with you *before* you let it gate anything. An uncalibrated
|
||||
judge is a vibe with a number attached. The lab ships a model-agnostic judge stub (`llm_judge.py`)
|
||||
that abstains until you point it at your own endpoint, with these limits written into the file.
|
||||
@@ -163,7 +163,7 @@ held or rose means the swap is safe by this eval; a score that dropped is a regr
|
||||
*before* it ran unattended against real work, not after.
|
||||
|
||||
This is the answer to "the model is swappable." It's swappable **because** the eval set is what
|
||||
makes swapping safe. Your prompts, your pipeline, your review reflexes, and — most of all — your
|
||||
makes swapping safe. Your prompts, your pipeline, your review reflexes, and, most of all, your
|
||||
eval set don't expire when the model does. They're the durable skill the course promised in Module
|
||||
1. The model is a component you can replace; the eval is the regression test that tells you the
|
||||
replacement fits. That's the whole argument, made operational.
|
||||
@@ -176,8 +176,8 @@ autonomy.
|
||||
|
||||
| Eval score on this task | Reasonable autonomy (the Unit 5 ladder) |
|
||||
|---|---|
|
||||
| Low / unmeasured | Assistive only — it suggests, a human decides (Module 24). |
|
||||
| Solid, below your bar | Autonomous but fully gated — opens a PR, a human reviews and merges (Module 25). |
|
||||
| Low / unmeasured | Assistive only; it suggests, a human decides (Module 24). |
|
||||
| Solid, below your bar | Autonomous but fully gated; opens a PR, a human reviews and merges (Module 25). |
|
||||
| At/above bar, stable across runs | Unattended on this *narrow* task, landing behind CI + the eval as a gate. |
|
||||
| High across a broad set, held over time | Orchestrate it; let it run in a fleet (Module 26). |
|
||||
|
||||
@@ -199,7 +199,7 @@ Every other module made a tool more valuable *because* you're using AI. This mod
|
||||
argument the course opened with.
|
||||
|
||||
Module 1 claimed the model is the cheap, swappable part and the workflow is the durable skill. Every
|
||||
module since has been an installment on that claim — version control, review, CI, containers,
|
||||
module since has been an installment on that claim: version control, review, CI, containers,
|
||||
secrets, MCP, agents. **Evals are where it's proven.** An eval set is, literally, a model-agnostic
|
||||
instrument: it judges output without caring which model produced it, which is exactly why it survives
|
||||
the swap that retires the model. You don't trust an agent because you trust the vendor or this
|
||||
@@ -217,20 +217,20 @@ a regression eval across a "model swap."
|
||||
|
||||
The lab files are in [`lab/`](lab/):
|
||||
|
||||
- `eval_set.py` — five cases for the `pending_count` task (data only).
|
||||
- `run_eval.py` — the runner: imports a candidate, scores it, prints a scorecard, exits non-zero
|
||||
- `eval_set.py`: five cases for the `pending_count` task (data only).
|
||||
- `run_eval.py` is the runner; it imports a candidate, scores it, prints a scorecard, exits non-zero
|
||||
below threshold.
|
||||
- `candidates/current_model/tasks.py` — a correct candidate (stand-in for your current model's
|
||||
- `candidates/current_model/tasks.py`: a correct candidate (stand-in for your current model's
|
||||
output).
|
||||
- `candidates/swapped_model/tasks.py` — a plausible-but-wrong candidate (stand-in for a bad swap).
|
||||
- `llm_judge.py` — a model-agnostic LLM-as-judge stub, with its limits written in.
|
||||
- `candidates/swapped_model/tasks.py`: a plausible-but-wrong candidate (stand-in for a bad swap).
|
||||
- `llm_judge.py`: a model-agnostic LLM-as-judge stub, with its limits written in.
|
||||
|
||||
**You'll need:** Python 3.10+, the `tasks-app` you've carried since Module 1, and Claude Code (sub
|
||||
your own agent). No API key or paid model is required to complete the lab; the bundled candidates let
|
||||
the regression demo run offline. The real payoff comes when you replace them with your own agent's
|
||||
output.
|
||||
|
||||
### Part A — Run the eval against the current model
|
||||
### Part A: Run the eval against the current model
|
||||
|
||||
1. From the lab folder, run the eval against the passing candidate:
|
||||
|
||||
@@ -240,25 +240,25 @@ output.
|
||||
echo "exit code: $?"
|
||||
```
|
||||
|
||||
Five cases pass, the score is 100%, and the exit code is `0`. **This is your baseline** — the
|
||||
Five cases pass, the score is 100%, and the exit code is `0`. **This is your baseline**: the
|
||||
score the current model earns on this task. Read the cases in `eval_set.py`: notice case #4,
|
||||
"completed tasks are NOT pending." That's the Module 13 bug, now a permanent case.
|
||||
|
||||
### Part B — Swap the model and re-run (the whole point)
|
||||
### Part B: Swap the model and re-run (the whole point)
|
||||
|
||||
2. Now simulate the swap — run the *exact same eval set* against the other candidate:
|
||||
2. Now simulate the swap: run the *exact same eval set* against the other candidate:
|
||||
|
||||
```bash
|
||||
python run_eval.py candidates/swapped_model
|
||||
echo "exit code: $?"
|
||||
```
|
||||
|
||||
It drops to 60% and exits `1`. Look at *which* cases failed: the easy ones still pass — this
|
||||
It drops to 60% and exits `1`. Look at *which* cases failed: the easy ones still pass; this
|
||||
output would sail through a casual manual check. The eval caught a regression that a skim would
|
||||
have missed, **and the non-zero exit code means a pipeline would have blocked it.** That is a
|
||||
guardrail doing its job.
|
||||
|
||||
### Part C — Make it real with your own agent
|
||||
### Part C: Make it real with your own agent
|
||||
|
||||
3. Open your `tasks-app` and tell Claude Code (sub your own agent) to implement (or re-implement)
|
||||
`pending_count()` and write its version straight into `candidates/my_run_1/tasks.py`, creating the
|
||||
@@ -278,11 +278,11 @@ output.
|
||||
case it added. The set gets sharper every time an agent surprises you.
|
||||
|
||||
5. *(Optional, needs a model endpoint.)* Open `llm_judge.py`, read the limits at the bottom, set the
|
||||
`EVAL_JUDGE_*` environment variables to your own endpoint, and grade an open-ended output — say, a
|
||||
`EVAL_JUDGE_*` environment variables to your own endpoint, and grade an open-ended output, say a
|
||||
commit message your agent wrote. Note how much shakier that score feels than the programmatic one.
|
||||
That feeling is correct, and it's why programmatic graders come first.
|
||||
|
||||
### Part D — Set the guardrail (on paper, then in CI)
|
||||
### Part D: Set the guardrail (on paper, then in CI)
|
||||
|
||||
6. Decide the autonomy for this task using the ladder in Key concepts. Write one sentence:
|
||||
*"`pending_count` changes may merge unattended only when `run_eval.py` scores 100%; otherwise a
|
||||
@@ -310,9 +310,9 @@ output.
|
||||
is now structural, not a promise.
|
||||
|
||||
**One honest caveat, or this gate guards nothing.** `candidates/current_model` is the bundled,
|
||||
always-correct stand-in — it scores 100% on every run, forever, so a gate pointed at it can never
|
||||
always-correct stand-in: it scores 100% on every run, forever, so a gate pointed at it can never
|
||||
fail. That's a dashboard, not a guardrail: the exact trap this section warns about. In a real
|
||||
pipeline, point the gate at the candidate that actually *varies* — your agent's real output for
|
||||
pipeline, point the gate at the candidate that actually *varies*: your agent's real output for
|
||||
this task (the `candidates/my_run_2` you made in Part C, or wherever your pipeline writes the
|
||||
model's output before merge). Prove the gate bites by aiming it at `candidates/swapped_model`: the
|
||||
same command drops to 60%, exits `1`, and blocks the merge.
|
||||
@@ -323,22 +323,22 @@ output.
|
||||
|
||||
The honesty this course has insisted on all the way through applies hardest to its own closer.
|
||||
|
||||
- **Evals measure what you put in them — and nothing else.** A 100% score means the agent passed
|
||||
- **Evals measure what you put in them, and nothing else.** A 100% score means the agent passed
|
||||
*your cases*, not that it's correct in general. The gap between "passes my eval" and "is actually
|
||||
good" is exactly the cases you didn't think to write. An eval set is a lower bound on quality, never
|
||||
a proof. Treat a green eval as "no known regression," not "verified correct."
|
||||
- **Eval sets rot.** Cases that no model ever fails stop discriminating; tasks drift away from what
|
||||
you actually do. An eval set you don't prune and grow becomes a comforting green light that's
|
||||
measuring last year's problems. Budget maintenance for it like any other test suite.
|
||||
- **LLM-as-judge is a model grading a model.** Re-read that section — correlated blind spots, bias,
|
||||
- **LLM-as-judge is a model grading a model.** Re-read that section: correlated blind spots, bias,
|
||||
and drift are not edge cases, they're the default behavior. An uncalibrated judge can hand you a
|
||||
confident wrong score, which is worse than no score. Where you can grade in code, do.
|
||||
- **A score is not a decision.** The eval tells you the rate; *you* still set the bar, and the right
|
||||
bar depends on stakes the eval can't see. 95% might be plenty for triaging issue labels and
|
||||
reckless for anything touching auth, money, or customer data. The number informs the judgment; it
|
||||
doesn't replace it.
|
||||
- **Evals don't catch novel harms, only measured ones.** A genuinely new failure mode — a class of
|
||||
mistake no case anticipates — passes every eval until the day it doesn't and you add the case after
|
||||
- **Evals don't catch novel harms, only measured ones.** A genuinely new failure mode (a class of
|
||||
mistake no case anticipates) passes every eval until the day it doesn't and you add the case after
|
||||
the fact. Evals make agents *trustworthy on known territory*. They are not a substitute for the
|
||||
recovery muscles (Module 12) that exist for when something gets through anyway.
|
||||
|
||||
@@ -350,13 +350,13 @@ The honesty this course has insisted on all the way through applies hardest to i
|
||||
|
||||
- You can explain the difference between a test and an eval, and say when you'd reach for each.
|
||||
- You've run `run_eval.py` against both bundled candidates and watched the same eval set pass one and
|
||||
fail the other — including the exit code flipping to `1`.
|
||||
fail the other, including the exit code flipping to `1`.
|
||||
- You've graded your *own* agent's output, then changed the model or prompt and re-run the same eval
|
||||
set as a regression check, and you can read the before/after scores as "safe" or "not safe."
|
||||
- You can state, for one concrete task, the eval score that would let an agent act unattended on it —
|
||||
- You can state, for one concrete task, the eval score that would let an agent act unattended on it,
|
||||
and where that threshold would live in your pipeline.
|
||||
- You can say, in your own words, why the eval set is the durable skill and the model is the swappable
|
||||
part. That's the whole course in one sentence — and you can now run it from the keyboard.
|
||||
part. That's the whole course in one sentence, and you can now run it from the keyboard.
|
||||
|
||||
That's the close. You started by copy-pasting out of a chat window; you're ending by letting an agent
|
||||
act without you and holding a measured, enforceable line on whether to trust it. The model under that
|
||||
|
||||
@@ -1,12 +1,12 @@
|
||||
"""Candidate output: a SWAPPED model/prompt.
|
||||
|
||||
Same task, different model (or a tweaked prompt). This output "looks right" and
|
||||
passes a casual manual check — adding three tasks and calling count returns 3.
|
||||
passes a casual manual check; adding three tasks and calling count returns 3.
|
||||
But pending_count() returns the total number of tasks, not the number of
|
||||
*pending* ones, so it's wrong the moment anything is marked done.
|
||||
|
||||
Nobody would notice this by skimming. The eval set notices it instantly. That's
|
||||
the regression eval catching an unsafe swap — exactly the scenario this module
|
||||
the regression eval catching an unsafe swap, exactly the scenario this module
|
||||
exists for. Replace this with your own swapped-model output when you run it for
|
||||
real; you may get lucky and have it pass, or you may catch a regression like
|
||||
this one.
|
||||
|
||||
@@ -7,7 +7,7 @@ An *eval set* is a list of CASES. Each case is three things:
|
||||
- the expected result (here: how many tasks should count as pending).
|
||||
|
||||
The grading lives in run_eval.py; this file is just data. Keeping the cases
|
||||
separate from any model, prompt, or runner is the whole point — the same eval
|
||||
separate from any model, prompt, or runner is the whole point; the same eval
|
||||
set judges *any* candidate you point it at, which is what makes it useful when
|
||||
you swap the model out from under it.
|
||||
|
||||
|
||||
@@ -34,7 +34,7 @@ def judge(candidate_text: str) -> dict:
|
||||
key = os.environ.get("EVAL_JUDGE_KEY")
|
||||
model = os.environ.get("EVAL_JUDGE_MODEL")
|
||||
if not (url and key and model):
|
||||
return {"score": None, "reason": "judge not configured — abstaining (set EVAL_JUDGE_* to enable)"}
|
||||
return {"score": None, "reason": "judge not configured; abstaining (set EVAL_JUDGE_* to enable)"}
|
||||
|
||||
payload = json.dumps({
|
||||
"model": model,
|
||||
@@ -72,7 +72,7 @@ if __name__ == "__main__":
|
||||
# about the candidate changed. The ruler is itself made of rubber.
|
||||
#
|
||||
# So: use a programmatic grader (run_eval.py) wherever a deterministic check is
|
||||
# possible — that is most of the time. Reach for an LLM judge only for genuinely
|
||||
# possible; that is most of the time. Reach for an LLM judge only for genuinely
|
||||
# open-ended output, and CALIBRATE it first: hand-label ~20 examples yourself,
|
||||
# run the judge on them, and confirm it agrees with you before you let it gate
|
||||
# anything. An uncalibrated judge is a vibe with a number attached.
|
||||
|
||||
@@ -68,9 +68,9 @@ def main(argv):
|
||||
print(f"\nscore: {passed}/{len(CASES)} = {score:.0%} threshold: {args.threshold:.0%}")
|
||||
|
||||
if score < args.threshold:
|
||||
print("RESULT: below threshold — this change is NOT safe to ship.\n")
|
||||
print("RESULT: below threshold; this change is NOT safe to ship.\n")
|
||||
return 1
|
||||
print("RESULT: at or above threshold — safe by this eval.\n")
|
||||
print("RESULT: at or above threshold; safe by this eval.\n")
|
||||
return 0
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user