332 lines
18 KiB
Markdown
332 lines
18 KiB
Markdown
# Module 10 — Reviewing Code You Didn't Write
|
||
|
||
> **The AI wrote a diff that reads beautifully and is wrong in one line you'll skim right past.**
|
||
> Reviewing for *plausibility traps* — not just bugs — is the highest-leverage, least-taught skill
|
||
> in this whole space. This module gives you a gate to run it at and a checklist to run.
|
||
|
||
---
|
||
|
||
## Prerequisites
|
||
|
||
- **Module 2 — Version Control as a Safety Net.** You read changes with `git diff`. This module
|
||
turns that one-off habit into a disciplined review pass over a whole change.
|
||
- **Module 8 — Remotes and Hosting.** Your repo lives on a host now, and a change arrives as a
|
||
*pull request* (GitHub/Gitea/Forgejo) or *merge request* (GitLab) — same thing, different name.
|
||
We'll write "PR" throughout; it's the unit of review.
|
||
- **Module 9 — Issues and the Task Layer** (helpful, not required). A PR usually answers an issue;
|
||
the issue is the "what I asked for" you review the diff against.
|
||
|
||
If you only have Modules 1–2, you can still do the core skill of this module locally — reviewing a
|
||
diff between two branches with `git diff` — and skip the part where you open it as a PR on a host.
|
||
|
||
---
|
||
|
||
## Learning objectives
|
||
|
||
By the end of this module you can:
|
||
|
||
1. Use a pull request as a **review gate**: nothing reaches the main branch without passing through
|
||
a diff someone (or something) signed off on — even on a solo repo.
|
||
2. Read an AI-generated diff the right way: against the request, deletions first, the diff over the
|
||
AI's own description of it.
|
||
3. Name and spot the four **plausibility traps** — invented APIs, silent scope creep, deleted
|
||
edge-case handling, and convincing-but-wrong logic — that pass a human skim and a quick run.
|
||
4. Run a repeatable **AI-diff review checklist** and end every review with an explicit
|
||
*approve* / *request changes* decision you can defend.
|
||
|
||
---
|
||
|
||
## Key concepts
|
||
|
||
### The gate, not the formality
|
||
|
||
A pull request proposes merging a branch into another (usually `main`) and pauses there so the
|
||
change can be looked at *before* it lands. On a team that pause is where review happens. The trap
|
||
is treating it as a rubber stamp — "looks good, merge" — which is exactly how bad changes get the
|
||
institutional blessing of "it was reviewed."
|
||
|
||
Reframe it the way you already think about change control: **a PR is a change gate, and merge is a
|
||
one-way door.** Once it's on `main`, it's in everyone's next clone, in CI, on its way to a deploy.
|
||
The cheapest place to catch a problem is in the diff, before the door closes. You can recover after
|
||
(that's Module 12), but recovery is always more expensive than the review you skipped.
|
||
|
||
This holds **even when you're the only human on the repo.** That's not bureaucracy for its own
|
||
sake — the syllabus's own course repo opens a PR for every module for exactly two reasons that
|
||
apply to you solo:
|
||
|
||
- **Traceability.** The PR is a durable record of *what changed and why*, linked to the issue it
|
||
answers. `git log` tells you the change happened; the PR tells you the reasoning, the discussion,
|
||
and what was rejected.
|
||
- **A forced read.** Opening the PR makes you look at the *whole* change as one diff, away from the
|
||
editor you wrote it in. That context switch is where you catch the thing you were too close to
|
||
see while generating it.
|
||
|
||
When the author is an AI, both reasons get sharper. The AI produced the change with total
|
||
confidence and no memory of why; the PR is where a human supplies the judgment and the record the
|
||
AI can't.
|
||
|
||
### Why this is a genuinely new skill
|
||
|
||
You already know how to review human code. Reviewing AI code is *not the same activity*, and
|
||
assuming it is gets people burned.
|
||
|
||
When a human writes a function, the bugs cluster where the human was uncertain — the gnarly edge,
|
||
the bit they rushed, the TODO they meant to come back to. You can often *feel* the soft spots, and
|
||
the code's roughness is a signal: confusing code is suspicious code.
|
||
|
||
AI output inverts that signal. It is **uniformly fluent.** The variable names are good, the
|
||
structure is clean, the comment above the broken line confidently states the correct intention,
|
||
and the one wrong line looks exactly as polished as the forty right ones. The fluency is constant;
|
||
the correctness is not — and your eye has spent a career using fluency as a proxy for correctness.
|
||
That proxy is now actively misleading.
|
||
|
||
So the question shifts. With human code you mostly ask *"is this good code?"* With AI code you have
|
||
to ask *"is this code true?"* — does it do what it claims, against the request I actually made,
|
||
using things that actually exist. That's reviewing for **plausibility traps**: code engineered (by
|
||
a process optimizing for plausible-looking output) to pass exactly the skim you're tempted to give
|
||
it.
|
||
|
||
### The four plausibility traps
|
||
|
||
These are the failure modes to hunt for specifically. They're not random bugs; they're the
|
||
characteristic ways fluent-but-untrue code goes wrong.
|
||
|
||
**1. Invented APIs.** The model reaches for a function, method, keyword argument, flag, config key,
|
||
or endpoint that *should* exist by analogy — and doesn't, or exists with a different signature.
|
||
It's the same generative move behind hallucinated package names (the supply-chain version of this
|
||
gets its own treatment in Module 15). The tell is that it reads *more* natural than the real API,
|
||
because it was generated to be plausible rather than recalled from docs. Classic shape: assuming
|
||
`list.pop(i, default)` works because `dict.pop(k, default)` does. Verify every unfamiliar
|
||
symbol against real docs or source — confidence in the surrounding prose is not evidence.
|
||
|
||
**2. Silent scope creep.** You asked for one thing; the diff does that thing *and* quietly
|
||
"improves" three others it was never asked to touch — reformatting a file, reshuffling imports,
|
||
renaming a variable across the module, "simplifying" an unrelated function. Each extra edit is an
|
||
unrequested change you now have to review with no stated intent behind it, and it's where
|
||
regressions hide. The discipline: **every hunk must trace back to the request.** Anything that
|
||
doesn't is guilty until proven innocent, and the right move is often "take it out and do it in its
|
||
own PR."
|
||
|
||
**3. Deleted edge-case handling.** The most dangerous trap, because it lives in the `-` lines you
|
||
skim. While implementing the feature, the model drops a bounds check, removes a `None` guard,
|
||
collapses a `try/except` into the happy path, or — worst — *replaces a real error with a silent
|
||
swallow* (`except: pass`) under the banner of "making it robust." The code now looks cleaner and
|
||
passes every test you'd casually run, because you'd test the path that works. The bad input that
|
||
the deleted guard existed to catch now fails silently. **Read every deletion. Deletions are where
|
||
behavior disappears.**
|
||
|
||
**4. Convincing-but-wrong logic.** An inverted condition (`if not x` where it meant `if x`), an
|
||
off-by-one, `<` where it meant `<=`, `and` where it meant `or`, a filter quietly dropped from a
|
||
comprehension. On the happy path it often produces a believable-enough result, and the comment
|
||
above it cheerfully describes the *correct* behavior — so the comment actively vouches for the bug.
|
||
The defense is to **trace one real call through the changed code yourself** instead of trusting the
|
||
narration.
|
||
|
||
A real AI diff usually has *most lines correct* and one trap buried in legitimate work — which is
|
||
what makes it dangerous. The feature genuinely works when you try it; the trap is somewhere you
|
||
didn't look.
|
||
|
||
### How to actually read the diff
|
||
|
||
Mechanics first. You want the change as one reviewable unit, separate from the code you wrote it in:
|
||
|
||
```bash
|
||
git fetch # get the branch the PR is built from
|
||
git diff main..feature-branch # the whole change, as one diff
|
||
```
|
||
|
||
On your host's PR page you get the same diff with line comments, file-by-file navigation, and the
|
||
CI results attached — use it. But the content of the review is the same whether you read it in the
|
||
browser or the terminal.
|
||
|
||
Then run the pass in this order (the full version is in
|
||
[`lab/ai-diff-review-checklist.md`](lab/ai-diff-review-checklist.md) — keep it open while you work):
|
||
|
||
1. **State the request in one sentence.** This is your scope yardstick. If it answers an issue
|
||
(Module 9), that's your sentence.
|
||
2. **Read the diff, not the AI's summary.** The summary tells you what it *intended*; the diff is
|
||
what it *did*. Only the diff is real.
|
||
3. **Scope check.** Every hunk maps to the request. Flag everything that doesn't.
|
||
4. **Deletions first.** Read every `-` line and ask what behavior just left the codebase.
|
||
5. **Verify the unfamiliar.** Every API, flag, and key you don't personally know exists —
|
||
check it.
|
||
6. **Trace one real call**, including a failure case. Not the happy path — the bad input.
|
||
7. **Decide.** Approve only if you can explain every hunk. Otherwise request changes. The burden of
|
||
proof is on the diff, not on you.
|
||
|
||
That last point is the whole posture: **a diff is guilty until proven correct.** "It runs" is the
|
||
weakest evidence there is — the traps above are *designed* to run.
|
||
|
||
---
|
||
|
||
## The AI angle
|
||
|
||
Every other module here makes a tool more valuable because of AI. This module is the one where the
|
||
*human stays in the loop on purpose*, and it's worth being precise about why.
|
||
|
||
The thing AI is best at — producing fluent, confident, well-structured output — is precisely the
|
||
thing that defeats the review reflex you built reviewing humans. You learned to trust clean code
|
||
and distrust messy code; AI produces uniformly clean code regardless of whether it's correct, so
|
||
that heuristic now points the wrong way. Reviewing AI diffs means consciously *overriding* an
|
||
instinct that served you well for years.
|
||
|
||
And the volume cuts against you. AI makes generating a 300-line PR almost free, which quietly
|
||
shifts the bottleneck from *writing* to *reviewing* — and tempts everyone to review at the speed
|
||
they generate. The economics of the team now hinge on review being the gate that writing no longer
|
||
is. The fluent-but-wrong line costs nothing to produce and everything to miss.
|
||
|
||
This is the human half of a loop you'll keep building. Module 11 wires this review gate into the
|
||
full issue → branch → PR → review → merge motion with humans *and* agents as contributors. Much
|
||
later, Module 24 looks at AI *reviewers* that comment on PRs automatically — but an automated
|
||
reviewer is an assistant to this skill, not a replacement for it. You can't supervise a review bot
|
||
you couldn't do yourself.
|
||
|
||
---
|
||
|
||
## Hands-on lab
|
||
|
||
**Lab language:** shell + the Python `tasks-app`. You won't write Python; you'll open a PR for a
|
||
real change, then review a diff the "AI" produced and catch the trap planted in it.
|
||
|
||
**You'll need:**
|
||
|
||
- Git, Python 3.10+, and your AI assistant.
|
||
- The starter base app in [`lab/tasks-app/`](lab/tasks-app/) (`tasks.py`, `cli.py`). It's the
|
||
Module 1/2 app with one addition: `complete()` validates the index and `done` turns a bad index
|
||
into a clean error. Note that behavior — the trap will mess with it.
|
||
- The planted AI change in [`lab/ai-change.patch`](lab/ai-change.patch).
|
||
- The review checklist in [`lab/ai-diff-review-checklist.md`](lab/ai-diff-review-checklist.md).
|
||
- **Optional (Part A as a real PR):** the repo you pushed to a host in Module 8. If you don't have
|
||
one, do Part A locally as a branch — the review skill in Parts B–C is identical either way.
|
||
|
||
### Part A — Open a PR as a gate
|
||
|
||
1. Set up the base app as a repo and confirm its baseline behavior. This `review-lab` is a
|
||
throwaway repo *separate* from the `tasks-app` you've built up across earlier modules — you can
|
||
delete it when you're done, and nothing here touches your main app. (Use your real course path in
|
||
place of `/path/to/`, the same copy-it-in move from Module 5.)
|
||
|
||
```bash
|
||
mkdir -p ~/workflow-course/review-lab && cd ~/workflow-course/review-lab
|
||
cp /path/to/modules/10-reviewing-code-you-didnt-write/lab/tasks-app/*.py .
|
||
printf 'tasks.json\n__pycache__/\n' > .gitignore # keep generated runtime state out of your review diffs (Module 2)
|
||
git init -qb main && git add . && git commit -qm "base: tasks-app" # -b main so the git switch main / git diff main.. steps below resolve
|
||
|
||
python cli.py add "write the review module"
|
||
python cli.py done 99 # baseline: prints "error: no task at index 99", exits non-zero
|
||
echo "exit code: $?"
|
||
```
|
||
|
||
Remember that last result. A bad index is a clean, loud error today.
|
||
|
||
2. Make a small honest change of your own on a branch — ask your AI for a one-line tweak, e.g.
|
||
*"make the empty-list message say '(nothing to do)' instead of '(no tasks yet)'"* — apply it,
|
||
commit it, and open it as a PR:
|
||
|
||
```bash
|
||
git switch -c tweak-empty-message
|
||
# apply the AI's one-line change to tasks.py, then:
|
||
git add . && git commit -m "Friendlier empty-list message"
|
||
```
|
||
|
||
If you have a Module 8 remote: `git push -u origin tweak-empty-message`, then open the PR on
|
||
your host and read your own diff in the PR view. If you're local-only:
|
||
`git diff main..tweak-empty-message`. Either way, **review your own one-line change as a diff
|
||
before merging it.** Get used to the gate on a trivial change so it's a reflex on a dangerous
|
||
one. Merge it when you're satisfied (`git switch main && git merge tweak-empty-message`).
|
||
|
||
### Part B — Review the AI's diff (the real exercise)
|
||
|
||
3. Now a teammate-who-is-an-AI has opened a PR. The prompt it was given was exactly:
|
||
**"Add a `delete <index>` command to the tasks app."** Bring its change in on its own branch.
|
||
`git apply` lays the AI's proposed change onto this branch as if it were its PR, so you can read
|
||
it before deciding whether to keep it — exactly what you'd be doing in a real PR review. (Again,
|
||
use your real course path in place of `/path/to/`.)
|
||
|
||
```bash
|
||
git switch main
|
||
git switch -c ai-delete-command
|
||
git apply /path/to/modules/10-reviewing-code-you-didnt-write/lab/ai-change.patch
|
||
git add . && git commit -m "Add delete command"
|
||
```
|
||
|
||
4. **Review it before you run it.** Open the checklist and read the diff as one unit:
|
||
|
||
```bash
|
||
git diff main..ai-delete-command
|
||
```
|
||
|
||
Work the checklist. The request was *one sentence*: add a `delete` command. Hold every hunk up
|
||
to it. Read the `-` lines. Find the line that does something the request never asked for and
|
||
that changes behavior you tested in Part A. Write down what you think the trap is *before*
|
||
step 5.
|
||
|
||
### Part C — Confirm the trap by running the failure case
|
||
|
||
5. Now verify your read by running the *failure* path, not the happy one:
|
||
|
||
```bash
|
||
python cli.py add "a real task"
|
||
python cli.py delete 0 # the requested feature: works fine on the happy path
|
||
python cli.py add "another"
|
||
python cli.py done 99 # the trap: compare this to your Part A baseline
|
||
echo "exit code: $?"
|
||
python cli.py list # did task 99 (which doesn't exist) get marked done? did anything?
|
||
```
|
||
|
||
In the base app, `done 99` was a clean error with a non-zero exit. After this "add a delete
|
||
command" change, it prints `updated` and exits `0` — silently claiming success while marking
|
||
nothing. The diff *only said* it was adding `delete`. While in the file it also rewrote
|
||
`complete()` to swallow the `IndexError` "for robustness," deleting the edge-case handling and
|
||
turning a loud failure into a silent lie. That's three traps in one small hunk: **scope creep**
|
||
(it touched `complete`, which the request never mentioned), **deleted edge-case handling**, and
|
||
**convincing-but-wrong logic** wearing a reassuring comment.
|
||
|
||
6. Play it out. On your host's PR you'd leave a line comment on the `complete()` hunk —
|
||
*"out of scope, and this swallows the error `done` relied on; please drop it"* — and **request
|
||
changes** rather than approve. The feature you were asked for was fine; the PR still doesn't
|
||
merge. That's the gate doing its job.
|
||
|
||
---
|
||
|
||
## Where it breaks
|
||
|
||
- **A checklist is a floor, not a ceiling.** It catches the characteristic traps reliably; it will
|
||
not catch a deep logic error that requires understanding the whole system. For changes in code
|
||
you don't know, reviewing the diff in isolation isn't enough — that harder case (pointing AI at
|
||
an unfamiliar codebase, and reviewing safely there) is Module 23.
|
||
- **Tests catch what review misses, and vice versa.** This module is human review; it pairs with
|
||
automated testing and CI (Modules 13–14), which catch the regressions a tired reviewer skims
|
||
past. Neither replaces the other — the trap in this lab passes a casual run *and* would pass a
|
||
test suite that only tests the happy path. Review is what notices the test you *should* have.
|
||
- **Review fatigue is real and AI makes it worse.** Twenty fluent PRs in a day will wear down the
|
||
exact attention this skill needs, and a rubber-stamped review is worse than none because it
|
||
launders the change as "reviewed." Smaller PRs are the mitigation: insist the AI's changes stay
|
||
small and single-purpose so each one is reviewable in full. A PR too big to review honestly
|
||
should be sent back to be split, not skimmed.
|
||
- **You can't review what you don't understand.** If a diff uses an API or a corner of the language
|
||
you don't know, "looks fine" is not a review — that's the moment to verify it exists and does
|
||
what it claims, or to pull in someone who knows. The honest output of a review is sometimes
|
||
"I'm not qualified to approve this," and that's a valid result.
|
||
|
||
---
|
||
|
||
## Check for understanding
|
||
|
||
**You're done when:**
|
||
|
||
- You've opened (or branched) a change and reviewed it as a diff *before* merging — the gate is a
|
||
reflex, even on a one-liner.
|
||
- You found the planted trap in `ai-change.patch` by reading the diff against the one-sentence
|
||
request, and named *why* it's a trap (it changed `complete()`, which the request never mentioned,
|
||
and swallowed the error `done` depended on).
|
||
- You confirmed it by running the **failure** case (`done 99`) and seeing the silent `updated` +
|
||
exit `0`, instead of trusting the happy path (`delete 0`) that worked fine.
|
||
- You can name the four plausibility traps from memory — invented APIs, silent scope creep, deleted
|
||
edge-case handling, convincing-but-wrong logic — and you treat a diff as guilty until proven
|
||
correct.
|
||
|
||
When "it runs" stops feeling like sufficient evidence and "I read every `-` line" starts feeling
|
||
mandatory, you've got the skill. Module 11 takes this gate and wires it into the full collaboration
|
||
loop — issues, branches, PRs, and merges — with both humans and agents as contributors.
|