Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io>
13 KiB
The AI's Code Looks Right. That's the Problem.
Here's a thing I had to unlearn the hard way: I'd spent years using how clean code looks as a shortcut for whether it's correct. Messy, confusing code made me slow down and squint. Tidy code with good variable names and a clear comment on top got a nod and a merge. That heuristic served me well for a long time.
Then I started reviewing code an AI wrote, and that instinct walked me straight into a wall.
This is the eleventh post in my walk through The Workflow, my free course on the toolchain around AI coding. And I'll say this plainly, the way the course does: reviewing a diff you didn't write is one of the most important and least-taught skills in this whole space. If you take one habit from the entire series, I'd be tempted to point at this one. So this post gets the weight it deserves.
Why your review instinct is now lying to you
Think about where bugs live in code a human wrote. They cluster where the human was uncertain: the gnarly edge case, the bit they rushed, the function with the TODO they meant to come back to. You can often feel the soft spots. The roughness is a signal. Confusing code is suspicious code, and your eye learned to slow down right where it mattered.
AI output inverts that signal completely. It is uniformly fluent. The variable names are good. The structure is clean. The comment above the broken line confidently states the correct intention. And the one wrong line looks exactly as polished as the forty right ones around it. The fluency is constant; the correctness is not, and you've spent a career using fluency as a proxy for correctness. That proxy is now actively misleading you.
So the question you're asking has to change. With human code, you mostly ask "is this good code?" With AI code, you have to ask something colder: "is this code true?" Does it actually do what it claims? Against the request I actually made? Using things that actually exist? That's a different activity, and assuming it's the same one is how people get burned.
The four plausibility traps
I call these plausibility traps because that's exactly what they are: code produced by a process optimizing for plausible-looking output, engineered (not on purpose, but effectively) to pass the quick skim you're tempted to give it. They're not random bugs. They're the characteristic ways fluent-but-untrue code goes wrong, and once you can name them you start seeing them.
1. Invented APIs. The model reaches for a function, a keyword argument, a config key, a flag, an endpoint that should exist by analogy, and doesn't, or exists with a different signature. The tell is that it reads more natural than the real API, because it was generated to be plausible rather than recalled from docs. Classic shape: assuming list.pop(i, default) works because dict.pop(k, default) does. The fix is unglamorous: verify every unfamiliar symbol against real docs or source. Confidence in the surrounding writing is not evidence.
2. Silent scope creep. You asked for one thing. The diff does that thing and quietly "improves" three others it was never asked to touch: reformats a file, reshuffles imports, renames a variable across the module, "simplifies" an unrelated function. Each extra edit is an unrequested change you now have to review with no stated intent behind it, and it's exactly where regressions hide. The discipline: every hunk must trace back to the request. Anything that doesn't is guilty until proven innocent.
3. Deleted edge-case handling. This is the most dangerous one, because it lives in the - lines you skim. While building the feature, the model drops a bounds check, removes a None guard, or, the worst version, replaces a real error with a silent swallow (except: pass) under the banner of "making it safer." The code now looks cleaner and passes every test you'd casually run, because you'd test the path that works. The bad input the deleted guard existed to catch now fails silently. Read every deletion. Deletions are where behavior disappears.
4. Convincing-but-wrong logic. An inverted condition (if not x where it meant if x), an off-by-one, < where it meant <=, and where it meant or, a filter quietly dropped from a comprehension. On the happy path it produces a believable-enough result, and the comment above it cheerfully narrates the correct behavior, so the comment actively vouches for the bug. The defense is to trace one real call through the changed code yourself instead of trusting the narration.
A real AI diff usually has most lines correct and one trap buried in legitimate work. That's the whole danger. The feature genuinely works when you try it. The trap is somewhere you didn't look.
The pull request is a gate, not a formality
So where do you run this review? At a gate. And the gate already has a name you know: the pull request (or merge request, if you're on GitLab; same thing).
A PR proposes merging a branch into main and pauses there so the change can be looked at before it lands. The trap is treating that pause as a rubber stamp ("looks good, merge"), which is exactly how bad changes get the institutional blessing of "well, it was reviewed."
Reframe it the way you already think about change control: a PR is a change gate, and merge is a one-way door. Once it's on main, it's in everyone's next clone, in CI, on its way to a deploy. The cheapest place to catch a problem is in the diff, before the door closes.
And here's the part people resist: this holds even when you're the only human on the repo. Not for bureaucracy's sake. For two reasons that genuinely pay off solo. Traceability: the PR is a durable record of what changed and why, linked to the issue it answers; git log tells you the change happened, the PR tells you the reasoning. And a forced read: opening the PR makes you look at the whole change as one diff, away from the chat you generated it in. That context switch is where you catch the thing you were too close to see. When the author is an AI with total confidence and zero memory of why, both reasons get sharper.
[insert a screenshot referencing a pull request diff view on GitHub/Gitea with a line comment on a deletion here]
Let me show you a trap
Talk is cheap, so here's the lab the course runs, compressed. You've got a tiny tasks-app, a command-line to-do list. In the base version, complete() validates the index, so done 99 on a list with three tasks gives you a clean, loud error and a non-zero exit code:
python cli.py done 99 # prints "error: no task at index 99", exits non-zero
echo "exit code: $?"
Remember that. A bad index is a clean failure today.
Now a teammate-who-is-an-AI opens a PR. The prompt it was given was exactly one sentence: "Add a delete <index> command to the tasks app." You bring its change in on its own branch and read it as one unit before deciding anything:
git switch -c ai-delete-command
git apply /path/to/lab/ai-change.patch
git diff main..ai-delete-command
The diff adds a delete command. It works: try delete 0, the task goes away, clean exit. If you stopped there, you'd approve it. The feature you asked for is genuinely fine.
But run the failure path, not the happy one:
python cli.py done 99 # the trap
echo "exit code: $?"
In the base app that was a loud error. After this "add a delete command" change, it prints updated and exits 0, silently claiming success while marking nothing. Why? Because while it was in the file, the AI also rewrote complete() to swallow the IndexError "for safety." That's three traps in one small hunk: scope creep (it touched complete(), which the request never mentioned), deleted edge-case handling (the guard done relied on is gone), and convincing-but-wrong logic wearing a reassuring comment. The diff said it was adding delete. It quietly turned a loud failure into a silent lie.
That's the whole lesson in one hunk. The feature works. The trap is in the part the description didn't mention and you didn't run.
How to actually read the diff
Mechanically, you want the change as one reviewable unit, separate from the chat you generated it in: git diff main..feature-branch in the terminal, or the PR page on your host (which gives you the same diff plus line comments and CI results). The content of the review is the same either way. The pass goes in this order:
- State the request in one sentence. That's your scope yardstick. If it answers an issue, that's your sentence.
- Read the diff, not the AI's summary. The summary is what it intended. The diff is what it did. Only the diff is real.
- Scope check. Every hunk maps to the request. Flag everything that doesn't.
- Deletions first. Read every
-line and ask what behavior just left the codebase. - Verify the unfamiliar. Every API, flag, and key you don't personally know exists: check it.
- Trace one real call, including a failure case. Not the happy path. The bad input.
- Decide. Approve only if you can explain every hunk. Otherwise request changes.
That last point is the whole posture: a diff is guilty until proven correct. "It runs" is the weakest evidence there is; the traps above are designed to run.
The AI angle
Every other tool in this course gets more valuable because of AI. This is the one module where the human stays in the loop on purpose, and it's worth being precise about why.
The thing AI is best at (fluent, confident, well-structured output) is precisely the thing that defeats the review reflex you built reviewing humans. You learned to trust clean code and distrust messy code; AI produces uniformly clean code regardless of correctness, so that heuristic now points the wrong way. Reviewing AI diffs means consciously overriding an instinct that served you well for years.
And the volume cuts against you. AI makes generating a 300-line PR almost free, which quietly moves the bottleneck from writing to reviewing, and tempts everyone to review at the speed they generate. The whole economics of a team now hinge on review being the gate that writing no longer is. The fluent-but-wrong line costs nothing to produce and everything to miss.
Where it breaks (because I like to be honest)
A few caveats, because I'd rather you trust me than oversell you:
- A checklist is a floor, not a ceiling. It reliably catches the characteristic traps. It will not catch a deep logic error that needs you to understand the whole system. Reviewing an isolated diff in code you don't know is a harder case (a later module's problem).
- Tests catch what review misses, and vice versa. This is human review; it pairs with testing and CI, not replaces them. The trap in that lab passes a casual run and would pass a test suite that only tests the happy path. Review is what notices the test you should have written.
- Review fatigue is real, and AI makes it worse. Twenty fluent PRs in a day will wear down the exact attention this skill needs, and a rubber-stamped review is worse than none; it launders the change as "reviewed." The mitigation is small PRs. A change too big to review honestly should be sent back to be split, not skimmed.
- You can't review what you don't understand. If a diff uses a corner of the language you don't know, "looks fine" isn't a review. Verify it, or pull in someone who can. "I'm not qualified to approve this" is a valid and honest result.
You're done when
"It runs" stops feeling like sufficient evidence, and "I read every - line" starts feeling mandatory. You can name the four traps from memory (invented APIs, silent scope creep, deleted edge-case handling, convincing-but-wrong logic) and you treat every diff as guilty until proven correct. That's the skill.
Next up, I take this review gate and wire it into the full collaboration loop, issue to branch to PR to review to merge, with both humans and agents as contributors. The gate you just learned is what makes letting an agent open PRs survivable.
If you've been burned by a clean-looking AI diff that turned out to be quietly wrong: I want to hear that story. Drop it in the comments. I read them, and the traps you've hit are exactly what makes this lesson sharper.