fix(M7-27+capstone): apply AI-drives-git reframe, lesson=theory, de-slop course-wide

Phase 2 sweep — all modules are post-pivot, so the learner directs the AI agent
(Claude Code as the worked example) to do the git/setup work and verifies, instead
of typing commands by hand; no re-teaching basics. Lesson sections are theory with
example output; all execution lives in the labs. De-slopped ("prose" etc. gone
course-wide, em-dash density thinned). /path/to placeholders -> ~/ai-workflow-course.

Every deliberate teaching device verified intact: M10 ai-change.patch trap,
M12 bad-clear-snippet, M13/M27 planted pending_count bug, M15 secret+typosquat+MD5,
M18 BREAK=1, M21 absent-.gitignore, M22 poisoned skill, M24 no-op patch, M25 --simulate.
Labs compile/parse (py/sh/yaml/json); no junk.

Closes #83
Closes #86
Closes #89

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01TfzV5QvtPDz8LJS3Pu5VLT
This commit is contained in:
2026-06-22 21:58:17 -04:00
parent a29823f4b3
commit f925fd9645
38 changed files with 1735 additions and 1424 deletions
+55 -44
View File
@@ -1,8 +1,8 @@
# Module 13 — Testing in the AI Era
> **AI writes code that looks right and passes a human skim — that's exactly the code that needs a
> test.** The happy turn: the same AI that produces the risk is excellent at writing the tests that
> catch it, once you know how to direct it.
> **AI writes code that looks right and passes a human skim. That's exactly the code that needs a
> test.** The same AI that produces the risk is excellent at writing the tests that catch it, once
> you know how to direct it.
---
@@ -15,7 +15,7 @@
This module is the automated, repeatable version of that same instinct: a test reviews the code for
you, the same way, every time.
You can parachute in here with only Modules 12 if you must — you'll have the app and version control,
You can parachute in here with only Modules 12 if you must. You'll have the app and version control,
which is enough to do the lab. But the payoff lands hardest if you've already felt the review problem
from Module 10, because a test is how you stop reviewing the same thing by hand forever.
@@ -55,7 +55,7 @@ manual version is the same problem copy-paste had in Module 1: it doesn't scale
across time. You can't re-run "eyeball every command" on every change, so you don't, so regressions
slip in. An automated test is that same check, written down once and run forever for free.
Python ships a test framework in the standard library `unittest` so there is nothing to install.
Python ships a test framework in the standard library, `unittest`, so there is nothing to install.
A test is a method whose name starts with `test_`, living in a class that subclasses
`unittest.TestCase`, using assertion methods to state expectations:
@@ -71,19 +71,26 @@ class TestTaskList(unittest.TestCase):
self.assertEqual(tl.tasks[0].title, "write the tests")
```
Run the whole suite from the project folder:
The whole suite runs from the project folder with a single command: `python -m unittest`
auto-discovers files named `test_*.py`, and `-v` prints each test name and its result. A verbose run
looks like:
```bash
python -m unittest # auto-discovers files named test_*.py
python -m unittest -v # verbose: prints each test name and pass/fail
```text
$ python -m unittest -v
test_add_appends_a_task (test_tasks.TestTaskList) ... ok
----------------------------------------------------------------------
Ran 1 test in 0.000s
OK
```
A passing run ends in `OK`. A failing one ends in `FAILED (failures=1)` and shows you the line, the
A passing run ends in `OK`. A failing one ends in `FAILED (failures=1)` and shows the line, the
expected value, and the actual value. That diff between *expected* and *actual* is the entire value
of the thing.
> A note on `unittest` vs `pytest`. The wider Python world mostly uses `pytest`, which is terser
> (plain `assert`, no class boilerplate) and genuinely nicer — but it's a third-party install. We use
> (plain `assert`, no class boilerplate) and nicer to use, but it's a third-party install. We use
> `unittest` here so the lab runs on a clean machine with zero dependencies and the test file is
> something you can drop into CI in Module 14 without a `pip install` step first. Everything you learn
> transfers directly; if your team standardizes on `pytest` later, the *thinking* is identical and the
@@ -99,24 +106,23 @@ human skim — because "looks like correct code" is close to what it was trained
and the surface gives you almost no signal about which.
This is the exact trap from Module 10's review skill, sharpened. When you review human code, sloppy
code looks sloppy odd naming, weird structure, obvious gaps and the look is a useful tripwire.
code looks sloppy (odd naming, weird structure, obvious gaps), and the look is a useful tripwire.
AI code removes that tripwire. The buggy version and the correct version look equally clean. You can
read a wrong implementation three times and approve it, because nothing about it *looks* wrong.
A test doesn't read the code. It *runs* the code and checks the result. It is immune to plausibility.
That immunity is precisely what AI-assisted work needs more of, because the one signal you used to
rely on "does this look right?" has been actively defeated.
rely on, "does this look right?", has been actively defeated.
### The happy fact: AI is excellent at writing tests
### AI is excellent at writing tests
Now the good news, and it's genuinely good. Writing tests is the chore that keeps most people from
having a real suite — it's tedious, it's not the feature, it's easy to skip. AI removes that excuse
almost entirely. Describe the code and the behavior you care about, and a competent model will
Writing tests is the chore that keeps most people from having a real suite: it's tedious, it's not
the feature, it's easy to skip. AI removes that excuse almost entirely. Describe the code and the behavior you care about, and a competent model will
produce a solid first draft of a test suite faster than you could write the boilerplate: it knows
`unittest`, it'll cover the obvious cases, set up fixtures, and name the tests sensibly.
So the economics flip. The thing that was too tedious to do consistently is now cheap. The remaining
skill isn't *writing* tests it's *directing* the AI to write the right ones, and knowing how to
The economics change. The thing that was too tedious to do consistently is now cheap. The remaining
skill isn't *writing* tests, it's *directing* the AI to write the right ones, and knowing how to
tell a good test from a worthless one. Which brings us to the trap.
### The trap: tests that assert current behavior instead of intent
@@ -134,7 +140,7 @@ paper trail.
The fix is a discipline, and it's the whole craft of testing in one sentence:
> **A test must encode intent what the code is *for* derived from the spec, not from the
> **A test must encode intent (what the code is *for*) derived from the spec, not from the
> implementation.**
Concretely, that changes how you direct the AI. Don't say "write tests for `pending_count`." Say
@@ -147,11 +153,11 @@ Concretely, that changes how you direct the AI. Don't say "write tests for `pend
count; all done returns 0. Derive the expected values from that description, not from the current
implementation."*
The second prompt does something the first can't: it describes a case *after completing some*
The second prompt does something the first can't: it describes a case (*after completing some*)
where a buggy implementation and a correct one give *different* answers. A tautological test only
ever exercises the case where they happen to agree. **The intent test is the one that can fail, and a
test that can't fail isn't testing anything.** Your job when reviewing AI-written tests is to ask of
each one: *if the code were wrong, would this test notice?* If the answer is no, it's decoration.
each one: *if the code were wrong, would this test notice?* If the answer is no, the test is worthless.
This is also why you write the test against the *spec*, even when the AI wrote both the code and the
tests. If you let the same source produce both, they agree by construction and verify nothing. The
@@ -181,7 +187,7 @@ Generic testing courses teach assertions and frameworks. What's specific to AI-a
verify behavior, which is the thing the surface no longer tells you.
- **AI is also what makes a real test suite finally affordable.** The boilerplate that used to make
testing a discipline you skipped is now nearly free to generate. The barrier moves from "writing
tests is tedious" to "directing and judging tests is a skill" a much better place for the barrier
tests is tedious" to "directing and judging tests is a skill," a much better place for the barrier
to be.
- **The danger is letting the same AI close the loop on itself.** AI writes the code, then AI writes
tests *from that code*, the tests pass, and you've certified a bug. The discipline that breaks the
@@ -189,7 +195,7 @@ Generic testing courses teach assertions and frameworks. What's specific to AI-a
that, so the test can disagree with the code. A test that can't disagree with the code is theater.
The reflex to build: when an AI hands you code *and* tests, review the tests first, and review them by
asking "would this fail if the code were wrong?" not "do these pass?" Passing is the easy part.
asking "would this fail if the code were wrong?", not "do these pass?" Passing is the easy part.
Passing for the right reason is the skill.
---
@@ -205,12 +211,14 @@ to catch a bug that has been sitting in the code looking perfectly fine.
**You'll need:**
- Python 3.10+ and a terminal.
- The lab copy of the app in this module's `lab/tasks-app/` (`tasks.py`, `cli.py`). It's the
Module 1/2 app plus a `count` command — and a planted bug. Copy it somewhere to work in, or use
- The lab copy of the app at
`~/ai-workflow-course/modules/13-testing-in-the-ai-era/lab/tasks-app/` (`tasks.py`, `cli.py`).
It's the Module 1/2 app plus a `count` command, and a planted bug. Have Claude Code copy it to a
working directory (`~/ai-workflow-course/work/tasks-app/`) and confirm both files landed; or use
your own `tasks-app` if it has a `count` command (see note in step 6).
- Your AI assistant. By now you may be running it editor-integrated (Module 4); browser chat is fine
too — paste `tasks.py` in when asked.
- Git initialized in your working copy (Module 2), so you can commit the test file at the end.
- Claude Code running in your editor or terminal (Module 4), with file access to the working copy.
Sub your own agent if you prefer (`claude --version # sub your own agent`).
- Git initialized in your working copy (Module 2), so the agent can commit the test file at the end.
### Part A — Write and run a first test by hand
@@ -243,20 +251,20 @@ Do this once yourself so the tool isn't magic. From inside your working copy of
### Part B — Direct the AI to write tests that encode intent
3. Now hand the AI the job, but direct it properly. Give it `tasks.py` and a prompt that supplies
**intent**, not just "write tests." Something like:
3. Now hand Claude Code the job, but direct it properly. Point it at `tasks.py` with a prompt that
supplies **intent**, not just "write tests." Something like:
> "Here is `tasks.py`. Write a `unittest` test suite in `test_tasks.py` covering `add`,
> "Look at `tasks.py`. Write a `unittest` test suite in `test_tasks.py` covering `add`,
> `complete`, `pending`, and `pending_count`. For `pending_count`, the intended behavior is: it
> returns the number of tasks that are *not done*. Cover these cases and derive the expected
> numbers from that description, not from the current code: (a) empty list → 0; (b) two added,
> none completed → 2; (c) two added, one completed → 1; (d) one added then completed → 0."
Note what you did: you described a case *one completed* where a correct `pending_count` and a
Note what you did: you described a case (*one completed*) where a correct `pending_count` and a
wrong one give different answers. That's the case that can catch a bug.
4. Put the AI's `test_tasks.py` next to `tasks.py`. **Review it before running it** — this is the
Module 10 skill applied to tests. For each test ask: *if `pending_count` were wrong, would this
4. Claude Code writes `test_tasks.py` next to `tasks.py`. **Review it before running it** — this is
the Module 10 skill applied to tests. For each test ask: *if `pending_count` were wrong, would this
one notice?* A test that only ever adds tasks (never completes one) would pass no matter what
`pending_count` returns, because with nothing done, total and pending are the same number. That
test is a tautology; the "one completed" test is the one with teeth.
@@ -279,7 +287,7 @@ Do this once yourself so the tool isn't magic. From inside your working copy of
```
There's the bug. It "worked" in every quick manual check because nobody ran `count` *after*
completing a task the one case where total and pending diverge. It passes a human skim. It does
completing a task, the one case where total and pending diverge. It passes a human skim. It does
not pass a test that encodes intent.
6. **Fix the code, not the test.** The test is correct; the code is wrong. Change it to honor the
@@ -299,15 +307,18 @@ Do this once yourself so the tool isn't magic. From inside your working copy of
> to `len(self.tasks)`, confirm an intent-encoding test goes red, then fix it. The muscle is
> "write the test that would have caught this," and you build it by watching it catch something.
7. Commit the test file — this is the artifact Module 14 will automate:
7. Commit the test file. This is the artifact Module 14 will automate. Tell Claude Code to stage
`tasks.py` and `test_tasks.py` and commit them with a message describing the test addition and the
`pending_count` fix. Before it commits, check the staged diff and the message yourself; you're
verifying it staged exactly those two files and landed a commit equivalent to:
```bash
git add tasks.py test_tasks.py
git commit -m "Add tests for TaskList; fix pending_count to count only pending"
```text
Add tests for TaskList; fix pending_count to count only pending
```
A reference suite (including the tautology-vs-intent contrast spelled out) is in
`lab/solution/reference_test_tasks.py` — compare against it *after* you've written your own.
`~/ai-workflow-course/modules/13-testing-in-the-ai-era/lab/solution/reference_test_tasks.py`. Compare
against it *after* you've written your own.
---
@@ -320,7 +331,7 @@ The honest limits, because a green suite invites overconfidence:
code, includes the edge cases the model also didn't think about. Tests narrow risk; they don't
eliminate it. "All tests pass" is not "the code is correct."
- **Tests written from the implementation are worse than no tests.** A suite that locks in current
behavior gives you false confidence with a paper trail the worst combination. The whole module
behavior gives you false confidence with a paper trail, the worst combination. The whole module
hinges on intent coming from *you*, not from the code the AI just wrote. If you ever let the same
AI write both code and tests with no spec from you, assume the tests verify nothing until you've
checked each one against intent.
@@ -331,8 +342,8 @@ The honest limits, because a green suite invites overconfidence:
- **Not everything is a unit test.** The `tasks-app` is pure logic, which is the easy case. Code that
hits a database, a network, the filesystem, or an external service needs more setup (fixtures,
fakes, integration tests) than this module covers. The thinking transfers; the mechanics get
heavier, and that's a deliberately out-of-scope rabbit hole here.
- **A test suite is code too and the AI wrote it.** Tests can have bugs, including the silent kind
heavier, and that's out of scope here.
- **A test suite is code too, and the AI wrote it.** Tests can have bugs, including the silent kind
that always pass. Reviewing tests is as real a task as reviewing code, which is exactly why Part B
has you read them before trusting them.