Files
ai-workflow-course/blog/16-unit5-ai-in-the-loop.md
T
2026-06-23 07:28:55 -04:00

16 KiB

Letting the AI Off the Leash (Without Getting Bitten)

For fifteen posts now I've been telling you to keep the AI on a short leash. Review every diff. Gate every change behind CI. Commit everything so you can undo it. Treat the model like a fast, confident, occasionally-wrong contributor who needs a net under it.

This is the post where I tell you to walk away and let it work.

Not because the leash was wrong, but because the leash is exactly what makes walking away safe. That's the whole idea of Unit 5 of The Workflow, the final unit before the capstone, and it's the part people skip straight to and then wonder why it goes badly. They want the agent that fixes its own failing build at 3am. They don't want the eight modules of review reflexes, CI gates, security scanning, and recovery muscle that are the only reason that agent isn't a liability. You can't have the second thing without the first. The whole back half of this course was load-bearing for this exact moment.

So let me walk you up the ladder, because Unit 5 is a ladder: four modules, each handing the AI a little more rope, and each rung only reachable because the one below it held.

The honest through-line

Here's the thing I most want you to take from this unit, even if you read nothing else:

You don't supervise an autonomous agent by watching it work. You supervise it structurally, by making everything it produces pass through gates that don't care whether a human or a machine wrote the change.

Read that twice. The instinct everybody brings to "AI agents" is I'll keep an eye on it. But watching an agent type is both a terrible use of your attention and a lie you tell yourself: you'll watch the first three and rubber-stamp the next thirty. Supervision that depends on your vigilance isn't supervision; it's hope.

The fix is to move the supervision off the human and into the structure. The agent's output lands in a PR. CI runs on it. Security scans it. A human reviews a sample. Recovery is one git revert away if something slips. You're not trusting the agent. You're trusting the catches, and you built every one of those catches in earlier units, on purpose, before you needed them. That's why this unit is at the end and not the start.

Rung 1, Assistive: the AI comments, you decide

The bottom rung is the safest possible way to put an AI inside your workflow instead of beside it: let it comment and label, and keep every decision yours.

Two patterns. The AI reviewer reads a pull request diff against a rubric you committed to the repo and posts review comments: the tireless first pass that catches the boring-but-deadly stuff (a handler that prints "saved" without persisting, a behavior change with no new test, a hardcoded secret) so your fresh human attention lands on the judgment calls. The triage agent reads an incoming issue and proposes labels and a route (ai-ready for the small, well-scoped stuff an agent could take, needs-human for the ambiguous and risky) from a taxonomy you committed.

Notice the word I keep using: proposes. The output is text. Comments and suggestions. And text changes nothing until a person acts on it. That's the entire reason this is the safe on-ramp: the blast radius of a wrong answer is a comment you ignore or a label you fix with one click. Same agent, same model you'll use on the scary rungs, but here being wrong is free. You build the reflex of working with an agent while its mistakes cost nothing.

The lab makes this concrete and local: no hosted bot account required. You run a little Python script that assembles the prompt, you hand it to your own AI, and the script renders the result and stops at a decision gate:

cd modules/24-assistive-agents/lab
python reviewer.py prompt          # builds: your committed rubric + the diff
# (paste into your AI, save its JSON to my-review.json)
python reviewer.py apply my-review.json

The diff it's reviewing has a real trap planted in it: a new clear command that prints "cleared all tasks" but never actually calls save(), so tasks.json is untouched. Did your AI catch it? Either way, you make the merge call, and you learn exactly how much this reviewer is worth before the stakes go up.

[insert a screenshot referencing the reviewer.py output showing AI comments sorted by severity, a recommendation, and the "human decides" gate here]

One caveat that's really the whole game: an assistive agent is only assistive if its permissions say so. "It just comments" is a property of its access token, not its prompt. Grant the reviewer bot merge rights "for convenience" and you've silently jumped two rungs up the ladder without the gate that makes the higher rung safe. Scope it to comment-and-label. Verify the scope. The human-decides guarantee has to be structural, not a promise.

Rung 2, Autonomous: the AI acts, supervised

Now the agent stops suggesting and starts doing. You hand it an issue; it reads the acceptance criteria, makes a branch, edits files, commits, and opens a pull request. Or you point it at a red CI build and it reads the failing logs, proposes a fix, and pushes it back. The AI is taking real actions now, and the obvious worry is, if I'm not watching, what stops it from shipping garbage?

The gates do. The exact ones you already built:

Gate Built in What it catches on an agent's PR
Review Unit 2 Plausible-but-wrong logic, scope creep, dropped edge cases.
CI Unit 3 Lint failures, broken tests, anything that doesn't build.
Security Unit 3 Hardcoded secrets, vulnerable or hallucinated dependencies.
Recovery Unit 2 The backstop: if something slips through, revert undoes it cleanly.

The agent is autonomous inside that box and powerless to escape it. It cannot merge past a failing check or an unapproved review. Its last step is open a PR, not merge. If your mental model of "autonomous" was "merges to main unseen," this is where you fix it; nothing in this unit does that, and the moment you wire an agent to merge its own work past a gate a human controls, you've left supervised autonomy and you own whatever it ships.

The lab runs the whole thing locally against the tasks-app, and the best part is watching the gate reject a bad change:

git checkout -b agent/delete-command
python agent_runner.py issue-to-pr issue-delete-command.md --simulate bad
# → ruff + pytest run, a test fails, the script refuses to call the work ready.
#   Exit code non-zero. No PR. Nothing reached main.

That's structural supervision in four seconds. It didn't matter that the change looked plausible; the gate didn't care who wrote it.

There's a second pattern here worth its own warning, self-healing CI, because it tempts the single worst shortcut in the toolkit. Point an agent at a failing test and it will cheerfully "fix" it by editing the test to pass. A human would feel the dishonesty. The agent just optimizes the objective you gave it. So the green result still lands as a reviewable PR where a human reads the - lines on the test file, and the retry loop is capped at two or three attempts, because an agent that can retry forever on a flaky test will, with a runner bill to match.

Which brings me to the one number that actually governs how much autonomy you can hand out:

An autonomous agent is exactly as safe as the gates it lands behind; no safer.

If your tests cover 30% of behavior, an agent can silently break the other 70% and still go green. The honest version of "should I let an agent do this unattended?" is "would my CI catch it if it got it wrong?" Autonomy doesn't ask you to trust the model more. It asks you to trust your gates more, and to have earned it.

Rung 3, Orchestration: more than one, without the collisions

One agent on a branch was the experiment. The thing nobody tells you is how fast you want a second one. The agent works in wall-clock minutes, so the instant one job is running you notice three others sitting idle. The model was never the constraint; the constraint was that every job wanted the same repo, the same files, the same checked-out branch.

This is where the worktrees from way back in Unit 1 finally pay the rent. Each agent gets its own worktree on its own branch tied to its own issue, main reserved as the sacred integration point that no agent works in:

tasks-app/            ← main worktree, on main, the integration point, no agent here
tasks-app-42-count/   ← issue #42, branch feature/42-count, agent A
tasks-app-43-docs/    ← issue #43, branch feature/43-docs,  agent B
tasks-app-44-clear/   ← issue #44, branch feature/44-clear, agent C

But here's the reframe that organizes the whole module, and it surprised me the first time it clicked:

Running multiple agents is not a parallel-programming problem. It's a project-management problem that happens to have agents as the workers.

Splitting work so it doesn't overlap, coordinating who owns what, integrating the results, reviewing it all: those are the hard parts a tech lead has always had. The agents just make the doing fast enough that the coordinating becomes the whole job. The lab hands you three issues where two are genuinely independent (different files) and one is deliberately set to collide (it touches the same cli.py dispatch chain as another). You predict the conflict from a one-table coordination plan before launching anything, and then watch it come true at merge, exactly where the plan said it would.

And then you hit the wall that every honest practitioner hits:

Compute stopped being the bottleneck the moment agents got cheap. Your attention is the new bottleneck, and it doesn't fan out.

Five agents finish in parallel. You read their diffs in series. Splitting the work (one brain deciding the seams) and reviewing the results (one brain reading the diffs) are the two things that stay exactly as serial as they ever were. Three well-scoped agents routinely beat one. Eight overlapping agents routinely lose to one. The right fleet size isn't "as many as the tool allows"; it's "as many as the work genuinely splits into and you can still review." Merging unread AI diffs to clear the queue is how a fleet quietly ships bugs at scale.

Rung 4, Evals: how you actually know

Which forces the question the entire unit has been building toward, and it's blunt:

An agent did work while you were asleep. How do you know it did good work?

"I read the diff" doesn't scale; the whole point was that you weren't there. "CI passed" is necessary but thin; it proves the code builds and your existing tests are green, not that the agent did the right thing on the cases that matter. You need to measure agent output systematically: the same way every time, on a fixed set of cases, with a score you can compare run to run. That measurement is an eval, and it's the close of the whole course.

An eval has three parts, none exotic: an eval set (a fixed list of representative cases, mostly edges), a grader (code where you can: ==, exit codes, "did it touch the file it shouldn't have"; an LLM-as-judge only where the output is genuinely open-ended), and a threshold the aggregate score has to clear. It's a test suite pointed at agent behavior instead of a frozen function, scored as a rate instead of a single green check.

The lab is the punchline of the whole series. You run the same eval set against two candidates:

cd modules/27-evals/lab
python run_eval.py candidates/current_model   # 100%, exit 0, your baseline
python run_eval.py candidates/swapped_model    # 60%, exit 1, blocked

The "swapped model" is a stand-in for the day a cheaper model ships, or your provider deprecates the one you're on, or someone edits the agent's prompt. The easy cases still pass (this output would sail through a casual skim), but the eval caught a regression a skim would have missed, and the non-zero exit code means a pipeline would have blocked the merge. That's a regression eval, and it's the moment this course's thesis stops being a slogan and becomes a procedure you run from the keyboard.

Because here's where it all lands: the model is the cheap, swappable part. The workflow around it is the skill that lasts. An eval set is, literally, a model-agnostic instrument: it judges output without caring which model produced it, which is exactly why it survives the swap that retires the model. You will swap the model; you don't get a vote. You trust an agent not because you trust the vendor or this quarter's benchmark, but because your eval, on your cases, scored it above your bar, and you'll re-run that same eval the day the model changes under you. Models are weather. The eval set is the thermometer you keep.

And the eval is what finally lets you set the autonomy honestly. Not by gut, but by tying the rung of the ladder to the score:

Eval score on this task Reasonable autonomy
Low / unmeasured Assistive only; it suggests, a human decides.
Solid, below your bar Autonomous but fully gated; opens a PR, a human merges.
At/above bar, stable Unattended on this narrow task, behind CI + the eval as a gate.
High across a broad set, held over time Orchestrate it; run it in a fleet.

Autonomy is per-task, not per-agent. The same model can be trustworthy enough to merge doc fixes unattended and nowhere near enough to touch auth code. "Trust the agent" is the wrong granularity. "Trust this agent, on this task, to this score" is the right one.

Where it breaks (because I always tell you)

  • An eval is a lower bound, never a proof. A 100% score means the agent passed your cases, not that it's correct in general. The gap between "passes my eval" and "is actually good" is exactly the cases you didn't think to write. Treat a green eval as "no known regression," not "verified correct," and grow the set every time an agent surprises you.
  • LLM-as-judge is a model grading a model. Correlated blind spots, length bias, and drift when you swap the judge aren't edge cases; they're the default. Where you can grade in code, grade in code. An uncalibrated judge is a vibe with a number attached.
  • Self-healing fixes the evidence, not the bug, if you let it. The bounded-retry cap stops the loop; only a human reading the diff stops the cheat. Never auto-merge a self-heal PR on green alone.
  • Fanning out non-parallel work is strictly worse than doing it in order: same work, plus a merge tax, plus N reviews instead of one. When in doubt, run it as one agent.
  • Your gates are the ceiling, and most gates are weaker than they look. Thin coverage, skipped scans, review-by-rubber-stamp: those don't just lower quality, they directly set how much an agent can quietly break. The unglamorous work of hardening your gates is the work of making agents trustworthy.

That's the close

You started this course copy-pasting code out of a chat window, hoping you didn't drop a function in the shuffle. You're ending it letting an agent act without you and holding a measured, enforceable line on whether to trust it. The model under that line will change many times. The line is yours to keep, and it's the same line whether you run today's model or next year's.

That's the last unit. The next post is the capstone: one real feature taken end to end (prompt to branch to AI implementation to tests to PR to CI to security scan to review to merge to deploy) so the whole thing clicks into a single motion instead of a pile of tips.

If you've made it this far in the series, I'd genuinely love to know which rung of this ladder you actually use day to day, and which one still feels like a step too far. Drop a comment; I read them, and the honest pushback is what makes the course better.