fix(M7-27+capstone): apply AI-drives-git reframe, lesson=theory, de-slop course-wide
Phase 2 sweep — all modules are post-pivot, so the learner directs the AI agent
(Claude Code as the worked example) to do the git/setup work and verifies, instead
of typing commands by hand; no re-teaching basics. Lesson sections are theory with
example output; all execution lives in the labs. De-slopped ("prose" etc. gone
course-wide, em-dash density thinned). /path/to placeholders -> ~/ai-workflow-course.
Every deliberate teaching device verified intact: M10 ai-change.patch trap,
M12 bad-clear-snippet, M13/M27 planted pending_count bug, M15 secret+typosquat+MD5,
M18 BREAK=1, M21 absent-.gitignore, M22 poisoned skill, M24 no-op patch, M25 --simulate.
Labs compile/parse (py/sh/yaml/json); no junk.
Closes #83
Closes #86
Closes #89
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01TfzV5QvtPDz8LJS3Pu5VLT
This commit is contained in:
@@ -1,8 +1,8 @@
|
||||
# Module 13 — Testing in the AI Era
|
||||
|
||||
> **AI writes code that looks right and passes a human skim — that's exactly the code that needs a
|
||||
> test.** The happy turn: the same AI that produces the risk is excellent at writing the tests that
|
||||
> catch it, once you know how to direct it.
|
||||
> **AI writes code that looks right and passes a human skim. That's exactly the code that needs a
|
||||
> test.** The same AI that produces the risk is excellent at writing the tests that catch it, once
|
||||
> you know how to direct it.
|
||||
|
||||
---
|
||||
|
||||
@@ -15,7 +15,7 @@
|
||||
This module is the automated, repeatable version of that same instinct: a test reviews the code for
|
||||
you, the same way, every time.
|
||||
|
||||
You can parachute in here with only Modules 1–2 if you must — you'll have the app and version control,
|
||||
You can parachute in here with only Modules 1–2 if you must. You'll have the app and version control,
|
||||
which is enough to do the lab. But the payoff lands hardest if you've already felt the review problem
|
||||
from Module 10, because a test is how you stop reviewing the same thing by hand forever.
|
||||
|
||||
@@ -55,7 +55,7 @@ manual version is the same problem copy-paste had in Module 1: it doesn't scale
|
||||
across time. You can't re-run "eyeball every command" on every change, so you don't, so regressions
|
||||
slip in. An automated test is that same check, written down once and run forever for free.
|
||||
|
||||
Python ships a test framework in the standard library — `unittest` — so there is nothing to install.
|
||||
Python ships a test framework in the standard library, `unittest`, so there is nothing to install.
|
||||
A test is a method whose name starts with `test_`, living in a class that subclasses
|
||||
`unittest.TestCase`, using assertion methods to state expectations:
|
||||
|
||||
@@ -71,19 +71,26 @@ class TestTaskList(unittest.TestCase):
|
||||
self.assertEqual(tl.tasks[0].title, "write the tests")
|
||||
```
|
||||
|
||||
Run the whole suite from the project folder:
|
||||
The whole suite runs from the project folder with a single command: `python -m unittest`
|
||||
auto-discovers files named `test_*.py`, and `-v` prints each test name and its result. A verbose run
|
||||
looks like:
|
||||
|
||||
```bash
|
||||
python -m unittest # auto-discovers files named test_*.py
|
||||
python -m unittest -v # verbose: prints each test name and pass/fail
|
||||
```text
|
||||
$ python -m unittest -v
|
||||
test_add_appends_a_task (test_tasks.TestTaskList) ... ok
|
||||
|
||||
----------------------------------------------------------------------
|
||||
Ran 1 test in 0.000s
|
||||
|
||||
OK
|
||||
```
|
||||
|
||||
A passing run ends in `OK`. A failing one ends in `FAILED (failures=1)` and shows you the line, the
|
||||
A passing run ends in `OK`. A failing one ends in `FAILED (failures=1)` and shows the line, the
|
||||
expected value, and the actual value. That diff between *expected* and *actual* is the entire value
|
||||
of the thing.
|
||||
|
||||
> A note on `unittest` vs `pytest`. The wider Python world mostly uses `pytest`, which is terser
|
||||
> (plain `assert`, no class boilerplate) and genuinely nicer — but it's a third-party install. We use
|
||||
> (plain `assert`, no class boilerplate) and nicer to use, but it's a third-party install. We use
|
||||
> `unittest` here so the lab runs on a clean machine with zero dependencies and the test file is
|
||||
> something you can drop into CI in Module 14 without a `pip install` step first. Everything you learn
|
||||
> transfers directly; if your team standardizes on `pytest` later, the *thinking* is identical and the
|
||||
@@ -99,24 +106,23 @@ human skim — because "looks like correct code" is close to what it was trained
|
||||
and the surface gives you almost no signal about which.
|
||||
|
||||
This is the exact trap from Module 10's review skill, sharpened. When you review human code, sloppy
|
||||
code looks sloppy — odd naming, weird structure, obvious gaps — and the look is a useful tripwire.
|
||||
code looks sloppy (odd naming, weird structure, obvious gaps), and the look is a useful tripwire.
|
||||
AI code removes that tripwire. The buggy version and the correct version look equally clean. You can
|
||||
read a wrong implementation three times and approve it, because nothing about it *looks* wrong.
|
||||
|
||||
A test doesn't read the code. It *runs* the code and checks the result. It is immune to plausibility.
|
||||
That immunity is precisely what AI-assisted work needs more of, because the one signal you used to
|
||||
rely on — "does this look right?" — has been actively defeated.
|
||||
rely on, "does this look right?", has been actively defeated.
|
||||
|
||||
### The happy fact: AI is excellent at writing tests
|
||||
### AI is excellent at writing tests
|
||||
|
||||
Now the good news, and it's genuinely good. Writing tests is the chore that keeps most people from
|
||||
having a real suite — it's tedious, it's not the feature, it's easy to skip. AI removes that excuse
|
||||
almost entirely. Describe the code and the behavior you care about, and a competent model will
|
||||
Writing tests is the chore that keeps most people from having a real suite: it's tedious, it's not
|
||||
the feature, it's easy to skip. AI removes that excuse almost entirely. Describe the code and the behavior you care about, and a competent model will
|
||||
produce a solid first draft of a test suite faster than you could write the boilerplate: it knows
|
||||
`unittest`, it'll cover the obvious cases, set up fixtures, and name the tests sensibly.
|
||||
|
||||
So the economics flip. The thing that was too tedious to do consistently is now cheap. The remaining
|
||||
skill isn't *writing* tests — it's *directing* the AI to write the right ones, and knowing how to
|
||||
The economics change. The thing that was too tedious to do consistently is now cheap. The remaining
|
||||
skill isn't *writing* tests, it's *directing* the AI to write the right ones, and knowing how to
|
||||
tell a good test from a worthless one. Which brings us to the trap.
|
||||
|
||||
### The trap: tests that assert current behavior instead of intent
|
||||
@@ -134,7 +140,7 @@ paper trail.
|
||||
|
||||
The fix is a discipline, and it's the whole craft of testing in one sentence:
|
||||
|
||||
> **A test must encode intent — what the code is *for* — derived from the spec, not from the
|
||||
> **A test must encode intent (what the code is *for*) derived from the spec, not from the
|
||||
> implementation.**
|
||||
|
||||
Concretely, that changes how you direct the AI. Don't say "write tests for `pending_count`." Say
|
||||
@@ -147,11 +153,11 @@ Concretely, that changes how you direct the AI. Don't say "write tests for `pend
|
||||
count; all done returns 0. Derive the expected values from that description, not from the current
|
||||
implementation."*
|
||||
|
||||
The second prompt does something the first can't: it describes a case — *after completing some* —
|
||||
The second prompt does something the first can't: it describes a case (*after completing some*)
|
||||
where a buggy implementation and a correct one give *different* answers. A tautological test only
|
||||
ever exercises the case where they happen to agree. **The intent test is the one that can fail, and a
|
||||
test that can't fail isn't testing anything.** Your job when reviewing AI-written tests is to ask of
|
||||
each one: *if the code were wrong, would this test notice?* If the answer is no, it's decoration.
|
||||
each one: *if the code were wrong, would this test notice?* If the answer is no, the test is worthless.
|
||||
|
||||
This is also why you write the test against the *spec*, even when the AI wrote both the code and the
|
||||
tests. If you let the same source produce both, they agree by construction and verify nothing. The
|
||||
@@ -181,7 +187,7 @@ Generic testing courses teach assertions and frameworks. What's specific to AI-a
|
||||
verify behavior, which is the thing the surface no longer tells you.
|
||||
- **AI is also what makes a real test suite finally affordable.** The boilerplate that used to make
|
||||
testing a discipline you skipped is now nearly free to generate. The barrier moves from "writing
|
||||
tests is tedious" to "directing and judging tests is a skill" — a much better place for the barrier
|
||||
tests is tedious" to "directing and judging tests is a skill," a much better place for the barrier
|
||||
to be.
|
||||
- **The danger is letting the same AI close the loop on itself.** AI writes the code, then AI writes
|
||||
tests *from that code*, the tests pass, and you've certified a bug. The discipline that breaks the
|
||||
@@ -189,7 +195,7 @@ Generic testing courses teach assertions and frameworks. What's specific to AI-a
|
||||
that, so the test can disagree with the code. A test that can't disagree with the code is theater.
|
||||
|
||||
The reflex to build: when an AI hands you code *and* tests, review the tests first, and review them by
|
||||
asking "would this fail if the code were wrong?" — not "do these pass?" Passing is the easy part.
|
||||
asking "would this fail if the code were wrong?", not "do these pass?" Passing is the easy part.
|
||||
Passing for the right reason is the skill.
|
||||
|
||||
---
|
||||
@@ -205,12 +211,14 @@ to catch a bug that has been sitting in the code looking perfectly fine.
|
||||
**You'll need:**
|
||||
|
||||
- Python 3.10+ and a terminal.
|
||||
- The lab copy of the app in this module's `lab/tasks-app/` (`tasks.py`, `cli.py`). It's the
|
||||
Module 1/2 app plus a `count` command — and a planted bug. Copy it somewhere to work in, or use
|
||||
- The lab copy of the app at
|
||||
`~/ai-workflow-course/modules/13-testing-in-the-ai-era/lab/tasks-app/` (`tasks.py`, `cli.py`).
|
||||
It's the Module 1/2 app plus a `count` command, and a planted bug. Have Claude Code copy it to a
|
||||
working directory (`~/ai-workflow-course/work/tasks-app/`) and confirm both files landed; or use
|
||||
your own `tasks-app` if it has a `count` command (see note in step 6).
|
||||
- Your AI assistant. By now you may be running it editor-integrated (Module 4); browser chat is fine
|
||||
too — paste `tasks.py` in when asked.
|
||||
- Git initialized in your working copy (Module 2), so you can commit the test file at the end.
|
||||
- Claude Code running in your editor or terminal (Module 4), with file access to the working copy.
|
||||
Sub your own agent if you prefer (`claude --version # sub your own agent`).
|
||||
- Git initialized in your working copy (Module 2), so the agent can commit the test file at the end.
|
||||
|
||||
### Part A — Write and run a first test by hand
|
||||
|
||||
@@ -243,20 +251,20 @@ Do this once yourself so the tool isn't magic. From inside your working copy of
|
||||
|
||||
### Part B — Direct the AI to write tests that encode intent
|
||||
|
||||
3. Now hand the AI the job, but direct it properly. Give it `tasks.py` and a prompt that supplies
|
||||
**intent**, not just "write tests." Something like:
|
||||
3. Now hand Claude Code the job, but direct it properly. Point it at `tasks.py` with a prompt that
|
||||
supplies **intent**, not just "write tests." Something like:
|
||||
|
||||
> "Here is `tasks.py`. Write a `unittest` test suite in `test_tasks.py` covering `add`,
|
||||
> "Look at `tasks.py`. Write a `unittest` test suite in `test_tasks.py` covering `add`,
|
||||
> `complete`, `pending`, and `pending_count`. For `pending_count`, the intended behavior is: it
|
||||
> returns the number of tasks that are *not done*. Cover these cases and derive the expected
|
||||
> numbers from that description, not from the current code: (a) empty list → 0; (b) two added,
|
||||
> none completed → 2; (c) two added, one completed → 1; (d) one added then completed → 0."
|
||||
|
||||
Note what you did: you described a case — *one completed* — where a correct `pending_count` and a
|
||||
Note what you did: you described a case (*one completed*) where a correct `pending_count` and a
|
||||
wrong one give different answers. That's the case that can catch a bug.
|
||||
|
||||
4. Put the AI's `test_tasks.py` next to `tasks.py`. **Review it before running it** — this is the
|
||||
Module 10 skill applied to tests. For each test ask: *if `pending_count` were wrong, would this
|
||||
4. Claude Code writes `test_tasks.py` next to `tasks.py`. **Review it before running it** — this is
|
||||
the Module 10 skill applied to tests. For each test ask: *if `pending_count` were wrong, would this
|
||||
one notice?* A test that only ever adds tasks (never completes one) would pass no matter what
|
||||
`pending_count` returns, because with nothing done, total and pending are the same number. That
|
||||
test is a tautology; the "one completed" test is the one with teeth.
|
||||
@@ -279,7 +287,7 @@ Do this once yourself so the tool isn't magic. From inside your working copy of
|
||||
```
|
||||
|
||||
There's the bug. It "worked" in every quick manual check because nobody ran `count` *after*
|
||||
completing a task — the one case where total and pending diverge. It passes a human skim. It does
|
||||
completing a task, the one case where total and pending diverge. It passes a human skim. It does
|
||||
not pass a test that encodes intent.
|
||||
|
||||
6. **Fix the code, not the test.** The test is correct; the code is wrong. Change it to honor the
|
||||
@@ -299,15 +307,18 @@ Do this once yourself so the tool isn't magic. From inside your working copy of
|
||||
> to `len(self.tasks)`, confirm an intent-encoding test goes red, then fix it. The muscle is
|
||||
> "write the test that would have caught this," and you build it by watching it catch something.
|
||||
|
||||
7. Commit the test file — this is the artifact Module 14 will automate:
|
||||
7. Commit the test file. This is the artifact Module 14 will automate. Tell Claude Code to stage
|
||||
`tasks.py` and `test_tasks.py` and commit them with a message describing the test addition and the
|
||||
`pending_count` fix. Before it commits, check the staged diff and the message yourself; you're
|
||||
verifying it staged exactly those two files and landed a commit equivalent to:
|
||||
|
||||
```bash
|
||||
git add tasks.py test_tasks.py
|
||||
git commit -m "Add tests for TaskList; fix pending_count to count only pending"
|
||||
```text
|
||||
Add tests for TaskList; fix pending_count to count only pending
|
||||
```
|
||||
|
||||
A reference suite (including the tautology-vs-intent contrast spelled out) is in
|
||||
`lab/solution/reference_test_tasks.py` — compare against it *after* you've written your own.
|
||||
`~/ai-workflow-course/modules/13-testing-in-the-ai-era/lab/solution/reference_test_tasks.py`. Compare
|
||||
against it *after* you've written your own.
|
||||
|
||||
---
|
||||
|
||||
@@ -320,7 +331,7 @@ The honest limits, because a green suite invites overconfidence:
|
||||
code, includes the edge cases the model also didn't think about. Tests narrow risk; they don't
|
||||
eliminate it. "All tests pass" is not "the code is correct."
|
||||
- **Tests written from the implementation are worse than no tests.** A suite that locks in current
|
||||
behavior gives you false confidence with a paper trail — the worst combination. The whole module
|
||||
behavior gives you false confidence with a paper trail, the worst combination. The whole module
|
||||
hinges on intent coming from *you*, not from the code the AI just wrote. If you ever let the same
|
||||
AI write both code and tests with no spec from you, assume the tests verify nothing until you've
|
||||
checked each one against intent.
|
||||
@@ -331,8 +342,8 @@ The honest limits, because a green suite invites overconfidence:
|
||||
- **Not everything is a unit test.** The `tasks-app` is pure logic, which is the easy case. Code that
|
||||
hits a database, a network, the filesystem, or an external service needs more setup (fixtures,
|
||||
fakes, integration tests) than this module covers. The thinking transfers; the mechanics get
|
||||
heavier, and that's a deliberately out-of-scope rabbit hole here.
|
||||
- **A test suite is code too — and the AI wrote it.** Tests can have bugs, including the silent kind
|
||||
heavier, and that's out of scope here.
|
||||
- **A test suite is code too, and the AI wrote it.** Tests can have bugs, including the silent kind
|
||||
that always pass. Reviewing tests is as real a task as reviewing code, which is exactly why Part B
|
||||
has you read them before trusting them.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user