ai-workflow-course/modules/27-evals/README.md

# Module 27. Evals: Trusting an Agent That Acts Without You

> **You will swap the model. Evals are the only thing that tells you whether the swap was safe.**
> This is the instrument that turns "the agent's output looks fine" into a number you can gate on,
> and it's where the whole course's thesis finally pays out.

---

## Prerequisites

This is the closer. It assumes the whole course, but it leans hardest on:

- **Module 1**: the thesis (the model is the cheap, swappable part; the workflow is the durable
  skill) and the `tasks-app` we've carried the whole way. This module is where the thesis gets its
  proof.
- **Module 10, Reviewing Code You Didn't Write**: the human review skill evals partially automate
  and partially *replace* once a human isn't in the loop.
- **Module 13, Testing in the AI Era**: you can write a deterministic pass/fail check. Evals are
  the next thing up the ladder: scoring output that a single test can't fully pin down.
- **Module 14, Continuous Integration**: running checks automatically on every change, with an
  exit code that gates. Evals run the same way and gate the same way.
- **Modules 24–26, the Unit 5 agent ladder**: assistive agents (24), autonomous-but-supervised
  agents (25), and orchestrated fleets (26). Evals are what decide how far up that ladder any given
  agent is allowed to climb.

---

## Learning objectives

By the end of this module you can:

1. State precisely what an eval is and how it differs from a test, and when you need one instead of
   the other.
2. Build a small eval set for a concrete agent task: representative cases plus a grader that turns
   output into a score.
3. Score agent output programmatically, and use an LLM-as-judge where you must, honestly, knowing
   its failure modes.
4. Run a **regression eval** across a model or prompt change and read whether the change was safe.
5. Set a **guardrail**: tie an autonomy level to an eval score so an agent earns the right to act
   unattended instead of being granted it on faith.

---

## Key concepts

### The question Unit 5 has been building toward

Unit 5 walked the agent from your elbow into the pipeline: assisting you (Module 24), then acting
under supervision (Module 25), then several of them at once (Module 26). Each step removed a human
from a loop. So the question this module exists to answer is blunt:

> **An agent did work while you were asleep. How do you *know* it did good work?**

"I read the diff" doesn't scale: the whole point of an unattended agent is that you weren't there.
"CI passed" is necessary but thin. CI proves the code builds and your existing tests are green, not
that the agent actually did the *right thing*, well, on the cases that matter. You need a way to
measure agent output **systematically**, the same way every time, on a fixed set of cases, with a
score you can compare across runs. That measurement is an **eval**.

### What an eval actually is

An eval has exactly three parts. None of them are exotic:

1. **An eval set**: a fixed list of representative cases. Inputs the agent will face, chosen to
   cover the normal path *and* the edges where it tends to fail.
2. **A grader**: something that turns each case's output into a result. Pass/fail, or a score. The
   grader can be code (`==`, a regex, "does it compile, run, and produce this output") or, when the
   output is open-ended, another model (LLM-as-judge).
3. **An aggregate + a threshold**: roll the per-case results into one number, and a line that number
   has to clear. "18/20 = 90%, and I require 90%."

That's it. An eval is a test suite pointed at *agent behavior* instead of a function, with a score
instead of a single green check, run against a moving target (the model) instead of frozen code.

### Eval vs. test: the distinction that matters

This audience already writes tests (Module 13). The instinct to ask "isn't an eval just a test?" is
correct enough to be dangerous. Where they diverge:

| | A test (Module 13) | An eval |
|---|---|---|
| **Subject** | Your code, frozen | An agent/model's output, which changes under you |
| **Result** | Binary: pass/fail | A score across many cases (90%, not "green") |
| **Determinism** | Same input → same output | Same input may give *different* output run to run |
| **Failure meaning** | The code is broken | The agent is *less good*, maybe still acceptable |
| **What it gates** | "Is the code correct?" | "Is this model/prompt good enough to trust here?" |

The practical upshot: a single failing case doesn't condemn an agent the way a failing unit test
condemns code. You're measuring a *rate*. An agent that gets 19/20 right may be exactly what you
want unattended on low-stakes work and nowhere near enough for high-stakes work. The eval gives you
the rate; *you* set the bar per task.

And the inverse: **where a deterministic test is possible, write the test, not an eval.** Evals are
for the band of behavior tests can't pin down: open-ended output, judgment calls, "did it pick a
reasonable approach." Reaching for an LLM judge to grade something `==` could have caught is how you
get a slower, flakier, more expensive test that you trust less. (The lab's grader is deliberately
programmatic for exactly this reason.)

### Building the eval set

The eval set is the asset. The grader is plumbing; the *cases* are where the judgment lives, and a
good set is mostly edges. Three sources fill it fast:

- **The normal path**: a couple of cases proving the agent does the obvious thing. These rarely
  catch anything; they're the floor.
- **The edges you already know break**: every "it looked right but" bug your agents have shipped is
  a permanent case. Module 13 left us a perfect one: an agent implemented `pending_count()` as
  `len(self.tasks)`. It passes any quick manual check (add three tasks, count says three) and is
  wrong the instant a task is marked done. *That bug becomes case #4 in this module's lab and never
  escapes again.*
- **The cases you'd manually check anyway**: write down the inputs you reflexively try when
  reviewing this kind of change. That list *is* your eval set; you've just been running it in your
  head and forgetting the results.

Keep it small and sharp. Twenty discriminating cases beat two hundred that all test the happy path.
A case that every candidate passes tells you nothing; the cases that *separate* a good agent from a
bad one are the whole value. And the eval set is code-adjacent data: commit it, review changes to it
in PRs (Module 10), and grow it every time an agent surprises you. It is durable in exactly the way
the syllabus means: it outlives every model it ever judges.

### Scoring: programmatic first, LLM-as-judge only when you must

Two graders, in strict priority order.

**Programmatic.** If "correct" is checkable in code (exact value, output matches, exit code is 0,
the file it shouldn't have touched is untouched), do that. It's deterministic, free, fast, and you
trust it completely. Most of what an agent does to a codebase is checkable this way, because code
either runs and produces the right thing or it doesn't.

**LLM-as-judge.** Some output has no `==`: "is this commit message clear?", "does this PR
description explain the change?", "is this refactor actually cleaner?" The standard move is to ask
*another* model to grade it against a rubric. It works, and sometimes it's the only option, but be
honest about what you've built:

- **Correlated blind spots.** A judge is a model grading a model. It can share the candidate's
  confusion and pass a wrong answer because both are wrong the same way. Your grader and the thing
  it grades are not independent.
- **Bias.** Judges favor longer, more confident, and first-presented answers regardless of
  correctness. Control for position and length or your scores measure verbosity.
- **Drift.** Swap the judge model and your scores move while the candidate didn't change. The ruler
  is made of rubber, which is poison for *regression* evals, whose entire job is to hold the ruler
  still.

So when you must use a judge: pin it (fixed model, `temperature: 0`), keep it **separate** from the
model under test, and **calibrate it against human labels**: hand-grade ~20 examples, run the judge
on the same 20, and confirm it agrees with you *before* you let it gate anything. An uncalibrated
judge is a vibe with a number attached. The lab ships a model-agnostic judge stub (`llm_judge.py`)
that abstains until you point it at your own endpoint, with these limits written into the file.

### Regression evals: the safety check on a swap

Here is where the course thesis stops being a slogan and becomes a procedure.

You *will* swap the model. A cheaper one ships, your provider deprecates the one you're on, a new
release benchmarks better, someone edits the agent's prompt or its committed instructions file
(Module 5). Every one of those changes the behavior of every agent you run, silently. The code
around the model didn't change; the model did, and the model is the part you don't control.

A **regression eval** is the discipline of running the *same eval set* before and after the change
and comparing the scores. The current model/prompt earns a baseline score. After the change (a new
model, a new prompt), the same eval set runs again and the two scores get compared. A score that
held or rose means the swap is safe by this eval; a score that dropped is a regression caught
*before* it ran unattended against real work, not after.

This is the answer to "the model is swappable." It's swappable **because** the eval set is what
makes swapping safe. Your prompts, your pipeline, your review reflexes, and, most of all, your
eval set don't expire when the model does. They're the durable skill the course promised in Module
1. The model is a component you can replace; the eval is the regression test that tells you the
replacement fits. That's the whole argument, made operational.

### Guardrails: tying autonomy to a score

The last piece, and the real subject of Unit 5: **how much is this agent allowed to do without a
human?** Don't answer that by gut. Answer it with the eval score, and make the score *gate* the
autonomy.

| Eval score on this task | Reasonable autonomy (the Unit 5 ladder) |
|---|---|
| Low / unmeasured | Assistive only; it suggests, a human decides (Module 24). |
| Solid, below your bar | Autonomous but fully gated; opens a PR, a human reviews and merges (Module 25). |
| At/above bar, stable across runs | Unattended on this *narrow* task, landing behind CI + the eval as a gate. |
| High across a broad set, held over time | Orchestrate it; let it run in a fleet (Module 26). |

Two things make a guardrail bite:

- **The threshold blocks.** The eval returns an exit code; below-bar exits non-zero and stops the
  pipeline exactly like a failing test (Module 14). The lab does this. An eval whose result nobody is
  forced to act on is a dashboard, not a guardrail.
- **Autonomy is per-task, not per-agent.** The same model can be trustworthy enough to merge
  doc fixes unattended and nowhere near enough to touch auth code. You hold a *different* eval and a
  *different* bar for each. "Trust the agent" is the wrong granularity; "trust this agent, on this
  task, to this score" is the right one.

---

## The AI angle

Every other module made a tool more valuable *because* you're using AI. This module closes the
argument the course opened with.

Module 1 claimed the model is the cheap, swappable part and the workflow is the durable skill. Every
module since has been an installment on that claim: version control, review, CI, containers,
secrets, MCP, agents. **Evals are where it's proven.** An eval set is, literally, a model-agnostic
instrument: it judges output without caring which model produced it, which is exactly why it survives
the swap that retires the model. You don't trust an agent because you trust the vendor or this
quarter's benchmark; you trust it because *your* eval, on *your* cases, scored it above *your* bar,
and you'll re-run that same eval the day the model changes under you, which it will.

That's the durable skill. Models are weather. The eval set is the thermometer you keep.

---

## Hands-on lab


> **Starting point (this lab is skip-friendly).** This lab is self-contained and does not depend
> on the earlier labs. Its files live in `modules/27-evals/lab/`. Copy them into a working folder
> and make a first commit so you start clean:
>
> ```bash
> cp -r ~/ai-workflow-course/modules/27-evals/lab ~/ai-workflow-course/27-evals-lab
> cd ~/ai-workflow-course/27-evals-lab && git init -b main && git add -A && git commit -m "start: module 27"
> ```
**Lab language:** Python + shell. You'll run a tiny eval harness, point an agent at a task, and run
a regression eval across a "model swap."

The lab files are in [`lab/`](lab/):

- `eval_set.py`: five cases for the `pending_count` task (data only).
- `run_eval.py` is the runner; it imports a candidate, scores it, prints a scorecard, exits non-zero
  below threshold.
- `candidates/current_model/tasks.py`: a correct candidate (stand-in for your current model's
  output).
- `candidates/swapped_model/tasks.py`: a plausible-but-wrong candidate (stand-in for a bad swap).
- `llm_judge.py`: a model-agnostic LLM-as-judge stub, with its limits written in.

**You'll need:** Python 3.10+, the `tasks-app` you've carried since Module 1, and Claude Code (sub
your own agent). No API key or paid model is required to complete the lab; the bundled candidates let
the regression demo run offline. The real payoff comes when you replace them with your own agent's
output.

### Part A: Run the eval against the current model

1. From the lab folder, run the eval against the passing candidate:

   ```bash
   cd modules/27-evals/lab
   python run_eval.py candidates/current_model
   echo "exit code: $?"
   ```

   Five cases pass, the score is 100%, and the exit code is `0`. **This is your baseline**: the
   score the current model earns on this task. Read the cases in `eval_set.py`: notice case #4,
   "completed tasks are NOT pending." That's the Module 13 bug, now a permanent case.

### Part B: Swap the model and re-run (the whole point)

2. Now simulate the swap: run the *exact same eval set* against the other candidate:

   ```bash
   python run_eval.py candidates/swapped_model
   echo "exit code: $?"
   ```

   It drops to 60% and exits `1`. Look at *which* cases failed: the easy ones still pass; this
   output would sail through a casual manual check. The eval caught a regression that a skim would
   have missed, **and the non-zero exit code means a pipeline would have blocked it.** That is a
   guardrail doing its job.

### Part C: Make it real with your own agent

3. Open your `tasks-app` and tell Claude Code (sub your own agent) to implement (or re-implement)
   `pending_count()` and write its version straight into `candidates/my_run_1/tasks.py`, creating the
   folder if it doesn't exist. You direct; the agent does the file plumbing. Then run the eval
   yourself and read the scorecard:

   ```bash
   python run_eval.py candidates/my_run_1
   ```

4. Now actually swap something. Either change the model Claude Code uses, or change the *prompt* (ask
   the same thing a different way, or tweak your committed instructions file from Module 5). Have the
   agent write this run into `candidates/my_run_2/`, then run `run_eval.py` yourself and compare the
   two scores. You just ran a regression eval on a real model/prompt change and got a number that
   tells you whether the change was safe. If a run scores below 100%, read the failing case and direct
   the agent to append the input that broke it as a new permanent case in `eval_set.py`; verify the
   case it added. The set gets sharper every time an agent surprises you.

5. *(Optional, needs a model endpoint.)* Open `llm_judge.py`, read the limits at the bottom, set the
   `EVAL_JUDGE_*` environment variables to your own endpoint, and grade an open-ended output, say a
   commit message your agent wrote. Note how much shakier that score feels than the programmatic one.
   That feeling is correct, and it's why programmatic graders come first.

### Part D: Set the guardrail (on paper, then in CI)

6. Decide the autonomy for this task using the ladder in Key concepts. Write one sentence:
   *"`pending_count` changes may merge unattended only when `run_eval.py` scores 100%; otherwise a
   human reviews."* Then make it enforceable. This is one job in a CI workflow (Module 14), so direct
   Claude Code (sub your own agent) to add an eval-gate job to the workflow it already wired up in
   Module 14, running the same command from Parts A–B. The job it adds should look like this:

   ```yaml
   - name: Eval gate
     working-directory: modules/27-evals/lab
     run: python run_eval.py candidates/current_model --threshold 1.0
   ```

   Review the diff before you accept it, and confirm the path logic is right. The
   `working-directory:` line makes the CI job `cd` into the lab folder first, so the
   `candidates/...` path and `run_eval.py`'s own `from eval_set import CASES` resolve exactly as they
   did on your machine. (Drop it and point a repo-root job straight at
   `python modules/27-evals/lab/run_eval.py candidates/current_model`, and `candidates/`
   won't exist from the repo root: the gate crashes with a *false* failure, which is worse than no
   gate. If the agent prefers a single line, it can spell both paths out from the repo root:
   `python modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model
   --threshold 1.0`.)

   Below threshold exits non-zero and the pipeline blocks, exactly like a failing test. The guardrail
   is now structural, not a promise.

   **One honest caveat, or this gate guards nothing.** `candidates/current_model` is the bundled,
   always-correct stand-in: it scores 100% on every run, forever, so a gate pointed at it can never
   fail. That's a dashboard, not a guardrail: the exact trap this section warns about. In a real
   pipeline, point the gate at the candidate that actually *varies*: your agent's real output for
   this task (the `candidates/my_run_2` you made in Part C, or wherever your pipeline writes the
   model's output before merge). Prove the gate bites by aiming it at `candidates/swapped_model`: the
   same command drops to 60%, exits `1`, and blocks the merge.

---

## Where it breaks

The honesty this course has insisted on all the way through applies hardest to its own closer.

- **Evals measure what you put in them, and nothing else.** A 100% score means the agent passed
  *your cases*, not that it's correct in general. The gap between "passes my eval" and "is actually
  good" is exactly the cases you didn't think to write. An eval set is a lower bound on quality, never
  a proof. Treat a green eval as "no known regression," not "verified correct."
- **Eval sets rot.** Cases that no model ever fails stop discriminating; tasks drift away from what
  you actually do. An eval set you don't prune and grow becomes a comforting green light that's
  measuring last year's problems. Budget maintenance for it like any other test suite.
- **LLM-as-judge is a model grading a model.** Re-read that section: correlated blind spots, bias,
  and drift are not edge cases, they're the default behavior. An uncalibrated judge can hand you a
  confident wrong score, which is worse than no score. Where you can grade in code, do.
- **A score is not a decision.** The eval tells you the rate; *you* still set the bar, and the right
  bar depends on stakes the eval can't see. 95% might be plenty for triaging issue labels and
  reckless for anything touching auth, money, or customer data. The number informs the judgment; it
  doesn't replace it.
- **Evals don't catch novel harms, only measured ones.** A genuinely new failure mode (a class of
  mistake no case anticipates) passes every eval until the day it doesn't and you add the case after
  the fact. Evals make agents *trustworthy on known territory*. They are not a substitute for the
  recovery muscles (Module 12) that exist for when something gets through anyway.

---

## Check for understanding

**You're done when:**

- You can explain the difference between a test and an eval, and say when you'd reach for each.
- You've run `run_eval.py` against both bundled candidates and watched the same eval set pass one and
  fail the other, including the exit code flipping to `1`.
- You've graded your *own* agent's output, then changed the model or prompt and re-run the same eval
  set as a regression check, and you can read the before/after scores as "safe" or "not safe."
- You can state, for one concrete task, the eval score that would let an agent act unattended on it,
  and where that threshold would live in your pipeline.
- You can say, in your own words, why the eval set is the durable skill and the model is the swappable
  part. That's the whole course in one sentence, and you can now run it from the keyboard.

That's the close. You started by copy-pasting out of a chat window; you're ending by letting an agent
act without you and holding a measured, enforceable line on whether to trust it. The model under that
line will change many times. The line is yours to keep.

---

## Verify-before-publish

This is an expansion-zone module over fast-moving ground. Re-check at build/publish time:

- [ ] **No vendor pinned.** Confirm the module text, lab, and `llm_judge.py` still name no specific LLM
  provider, model id, or pricing, and that `llm_judge.py`'s endpoint config is still generic
  (env-var driven, OpenAI-style-compatible but not branded).
- [ ] **Eval frameworks named.** If the module names any eval framework or LLM-as-judge tool by
  name (it currently names none on purpose), verify it still exists and behaves as described. Prefer
  keeping it tool-agnostic.
- [ ] **LLM-as-judge claims.** The bias/drift/correlation caveats are durable, but re-check that no
  cited best practice (e.g., calibration-against-human-labels guidance) has been superseded.
- [ ] **Module cross-references.** Confirm Modules 13, 14, 10, and 24–26 still carry the
  responsibilities referenced here (tests, CI gating, review, the agent autonomy ladder) and that
  none were renumbered.
- [ ] **Lab still runs.** `python run_eval.py candidates/current_model` exits 0 at 100%, and
  `candidates/swapped_model` exits 1 below threshold, on a current Python 3.x.