fix(M7-27+capstone): apply AI-drives-git reframe, lesson=theory, de-slop course-wide

Phase 2 sweep — all modules are post-pivot, so the learner directs the AI agent
(Claude Code as the worked example) to do the git/setup work and verifies, instead
of typing commands by hand; no re-teaching basics. Lesson sections are theory with
example output; all execution lives in the labs. De-slopped ("prose" etc. gone
course-wide, em-dash density thinned). /path/to placeholders -> ~/ai-workflow-course.

Every deliberate teaching device verified intact: M10 ai-change.patch trap,
M12 bad-clear-snippet, M13/M27 planted pending_count bug, M15 secret+typosquat+MD5,
M18 BREAK=1, M21 absent-.gitignore, M22 poisoned skill, M24 no-op patch, M25 --simulate.
Labs compile/parse (py/sh/yaml/json); no junk.

Closes #83
Closes #86
Closes #89

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01TfzV5QvtPDz8LJS3Pu5VLT
This commit is contained in:
2026-06-22 21:58:17 -04:00
parent a29823f4b3
commit f925fd9645
38 changed files with 1735 additions and 1424 deletions
+55 -52
View File
@@ -9,29 +9,29 @@
## Prerequisites
This is the module the whole back half of the course was load-bearing for. It assumes a lot, on
purpose each piece is a wall the autonomous agent has to land behind.
purpose; each piece is a wall the autonomous agent has to land behind.
- **Module 24** assistive agents, where the AI helped and *you* decided every step. This module is
- **Module 24**: assistive agents, where the AI helped and *you* decided every step. This module is
the escalation: the agent now takes a step on its own. The only reason that's responsible is the
rest of this list.
- **Module 9** issues as an agent's task specification, including the `ready` label and the idea of
- **Module 9**: issues as an agent's task specification, including the `ready` label and the idea of
an agent as an *assignee*. An issue is the agent's input here.
- **Module 6** branches. The agent's work goes on a branch, never straight onto `main`.
- **Modules 10 and 11** the PR review gate and the full issue → branch → implementation → PR →
- **Module 6**: branches. The agent's work goes on a branch, never straight onto `main`.
- **Modules 10 and 11**: the PR review gate and the full issue → branch → implementation → PR →
review → merge → close loop. The PR *is* the unit of supervision in this module.
- **Modules 13 and 14** tests and CI. The automated gate that runs on the agent's PR.
- **Module 15** security scanning as another gate on the same pushes. Autonomy makes this
- **Modules 13 and 14**: tests and CI. The automated gate that runs on the agent's PR.
- **Module 15**: security scanning as another gate on the same pushes. Autonomy makes this
non-optional, not optional.
- **Module 19** runners. A triggered or scheduled agent is just a runner job; you need to know
- **Module 19**: runners. A triggered or scheduled agent is just a runner job; you need to know
what's executing it and whose compute it's burning.
- **Module 12** revert, reset, recovery. The backstop for when a gate misses something.
- **Module 5** your committed AI instructions file: the agent's standing brief, the half of the
- **Module 12**: revert, reset, recovery. The backstop for when a gate misses something.
- **Module 5**: your committed AI instructions file: the agent's standing brief, the half of the
spec that isn't in the issue.
- **Modules 16, 17, 22** containers (sandboxing), secrets (scoped credentials), and the prompt-
- **Modules 16, 17, 22**: containers (sandboxing), secrets (scoped credentials), and the prompt-
injection attack surface. An unattended agent with a push token is a security boundary; these are
why.
If you skipped straight here, the lesson will read as reckless because without those gates, it
If you skipped straight here, the lesson will read as reckless, because without those gates, it
*would* be.
---
@@ -48,7 +48,7 @@ By the end of this module you can:
`main`, and explain why that's *structural* supervision rather than *behavioral*.
4. Build a bounded self-healing loop: when a gate fails, feed the failure back to the agent for a
fix, capped at N attempts, with the result landing as a PR you review.
5. Decide how much autonomy to grant by reasoning about the strength of your gates not the
5. Decide how much autonomy to grant by reasoning about the strength of your gates, not the
intelligence of your model.
---
@@ -99,15 +99,15 @@ issue (assigned/labeled) → agent reads it → branch → implement →
What the agent reads as its brief is two artifacts you already maintain:
- **The issue** (Module 9) the *specific* task: title, context, acceptance criteria, scope. The
- **The issue** (Module 9): the *specific* task: title, context, acceptance criteria, scope. The
acceptance criteria are the agent's literal definition of done.
- **The committed config** (Module 5) the *standing* brief: conventions, the build and test
- **The committed config** (Module 5): the *standing* brief: conventions, the build and test
commands, "don't touch these files," house style. Every assignee inherits it, including this one.
Together they're enough for the agent to attempt the work with **no live conversation**. That's the
point of having spent modules making both artifacts good: a well-formed issue plus a committed config
is a complete, handoff-ready spec. Hand it a vague issue and you get the Module 9 failure mode at
full volume a confident, plausible, wrong PR that costs more to review than the work would have
full volume: a confident, plausible, wrong PR that costs more to review than the work would have
taken.
Crucially: the agent's last step is **open a PR**, not **merge**. The output is a proposal. Nothing
@@ -129,14 +129,14 @@ push → CI fails → agent reads the failure → proposes a fix → pus
green? PR for review
```
Two design rules make this safe rather than a money-burning loop:
Two design rules make this safe rather than a runaway loop:
1. **Bound the retries.** Two or three attempts, then stop and tag a human. An agent that can retry
forever *will*, on a flaky test, producing an endless stream of plausible "fixes" and a runner
bill to match.
2. **Watch what it's fixing.** The classic failure mode: the test fails, so the agent "fixes" it by
*editing the test to pass* instead of fixing the bug. That's why the green result still lands as a
**reviewable PR** a human confirms it fixed the code, not the evidence. Self-healing CI proposes
**reviewable PR**: a human confirms it fixed the code, not the evidence. Self-healing CI proposes
a fix; it doesn't certify one.
### Pattern 3 — Triggered and scheduled agent jobs
@@ -145,9 +145,9 @@ How does an agent *start* without you launching it? It runs as a runner job (Mod
machinery that runs your CI, pointed at an agent instead of a test suite. Two triggers cover almost
everything:
- **Triggered** an event fires the job: an issue gets a `ready`/`agent` label, a comment says
- **Triggered**: an event fires the job: an issue gets a `ready`/`agent` label, a comment says
`/agent fix this`, a CI run goes red. Event in, agent runs, PR out.
- **Scheduled** a cron-style timer fires it: "every night, attempt the top `ready`-labelled issue,"
- **Scheduled**: a cron-style timer fires it: "every night, attempt the top `ready`-labelled issue,"
or "hourly, retry any red `main` build." This is where "the workflow starts running itself" stops
being a slogan.
@@ -170,7 +170,7 @@ Here's the load-bearing idea of the module, and it's not about the model:
If your test suite covers 30% of behavior, an autonomous agent can silently break the other 70% and
still go green. If your only "review" is rubber-stamping the diff, the review gate isn't real and the
agent is effectively merging unseen. The work of making agents trustworthy is mostly the unglamorous
work of making your gates strong which is the work of Modules 10, 13, 14, and 15. Autonomy doesn't
work of making your gates strong, which is the work of Modules 10, 13, 14, and 15. Autonomy doesn't
ask you to trust the model more. It asks you to trust your gates more, and to have earned it.
---
@@ -181,22 +181,22 @@ Scripting a runner job is ordinary automation. What's specific to AI here is tha
the job is non-deterministic and persuasive**, and that changes what "automation" has to mean:
- **The output is a proposal, not a result.** A normal scheduled job (back up the database, rotate
logs) you trust to *complete*. An agent job you trust only to *propose* because its output is a
logs) you trust to *complete*. An agent job you trust only to *propose*, because its output is a
confident artifact that might be subtly wrong. That's why the universal endpoint is a PR behind a
gate, never a merge. The structure absorbs the non-determinism.
- **Supervision shifts from the action to the gate.** With deterministic automation you review the
*script* once. With an agent you can't, because it writes something new every run so you review
*script* once. With an agent you can't, because it writes something new every run, so you review
the *output* every run, automatically (CI, security) and by sample (human review). The supervision
didn't disappear; it moved from watching the agent to hardening the wall it hits.
- **Self-healing tempts the worst shortcut in the toolkit.** Pointed at a failing test, an agent will
cheerfully delete or weaken the test, because that does technically make CI green. A human would
feel the dishonesty; the agent just optimizes the objective you gave it. The defense is structural:
the fix is a reviewable diff, and the reviewer's job (Module 10) explicitly includes reading the
`-` lines on the *test* file.
delete or weaken the test, because that does technically make CI green. A human would feel the
dishonesty; the agent just optimizes the objective you gave it. The defense is structural: the fix
is a reviewable diff, and the reviewer's job (Module 10) explicitly includes reading the `-` lines
on the *test* file.
- **Autonomy multiplies your earlier discipline, for good or ill.** A clean repo with strong gates
and a good committed config turns an agent into a tireless contributor. A repo with flaky tests, no
security scanning, and an empty config turns the same agent into an automated mess-generator running
on a timer. The agent doesn't fix your engineering it amplifies it.
and a good committed config lets an agent contribute real work on a timer. A repo with flaky tests,
no security scanning, and an empty config lets the same agent generate mess on a timer. The agent
doesn't fix your engineering; it amplifies it.
---
@@ -216,11 +216,11 @@ shows how the exact same flow runs on a real forge as a triggered/scheduled job.
`pytest` and `ruff` installed (`pip install pytest ruff`). The lab runs these as the CI gate,
locally — the same checks `ci.yml` runs in Module 14.
- The starter files in this module's `lab/` folder:
- `agent_runner.py` the orchestrator. Drives the agent (real or simulated), then runs the gate,
- `agent_runner.py`: the orchestrator. Drives the agent (real or simulated), then runs the gate,
and only ever produces a branch + PR proposal, never a merge.
- `issue-delete-command.md` a well-formed issue (Module 9 format) for a `delete <index>` command:
- `issue-delete-command.md`: a well-formed issue (Module 9 format) for a `delete <index>` command:
the agent's input.
- `agent-job.yml` a reference forge workflow showing the triggered + scheduled runner version.
- `agent-job.yml`: a reference forge workflow showing the triggered + scheduled runner version.
Read it; you'll run it for real only in Part D.
- *Optional, for the "for real" path:* an agentic coding tool that has a non-interactive / headless /
one-shot mode (most expose a flag for running a single prompt without the interactive UI). If you
@@ -240,22 +240,23 @@ shows how the exact same flow runs on a real forge as a triggered/scheduled job.
Copy `agent_runner.py` and `issue-delete-command.md` into your `tasks-app` folder, along with this
module's `lab/.gitignore` (append its lines to the `.gitignore` you already have from Module 2 rather
than overwriting it). Commit that `.gitignore` first — it keeps the lab scaffolding and Python caches
out of the agent's `git add -A`, so the change you review in Part B is clean. Then, from a clean
branch:
than overwriting it). Direct your agent (Claude Code as the worked example; sub your own) to commit
that updated `.gitignore`, then verify with `git log`. It keeps the lab scaffolding and Python caches
out of the agent's `git add -A`, so the change you review in Part B is clean. Then, from
`~/ai-workflow-course/tasks-app`, run the orchestrator:
```bash
cd ~/ai-workflow-course/tasks-app
git checkout -b agent/delete-command
# Simulate an agent that produces a BROKEN change, then run the gate on it:
python agent_runner.py issue-to-pr issue-delete-command.md --simulate bad
```
Watch the output. The "agent" plants a change, the script runs the gate (`ruff check` then
`pytest -q`), a test fails, and the script **stops and refuses to call the work ready** — exit code
non-zero, no PR proposed. That is structural supervision: it didn't matter that the change looked
plausible; the gate caught it. Nothing reached `main`.
The orchestrator creates and switches to its own `agent/issue-delete-command` branch first (the same
`git switch -c` the runner does in `agent-job.yml`), so you direct the automation and verify the
branch with `git branch` rather than typing `git checkout`. Then watch the output: the "agent" plants
a change, the script runs the gate (`ruff check` then `pytest -q`), a test fails, and the script
**stops and refuses to call the work ready**, exit code non-zero, no PR proposed. That is structural
supervision. It didn't matter that the change looked plausible; the gate caught it, and nothing
reached `main`.
### Part B — See a good change land as a PR proposal
@@ -264,19 +265,21 @@ python agent_runner.py issue-to-pr issue-delete-command.md --simulate good
```
This time the planted change is correct. The gate passes, the script commits to the branch and prints
the diff for review plus the exact `git push` / open-PR command. **It does not merge.** Open the diff
and review it with the Module 10 checklist. Remember (from the note above) that the simulated diff is
the self-contained `discount()` stand-in, not a `delete` command — but the review *motion* is the real
lesson: you are the human gate, and that step doesn't go away just because an agent did the typing.
the diff plus the push / open-PR command it would run. **It does not merge.** Review the diff with the
Module 10 checklist, then direct your agent (Claude Code; sub your own) to run that push and open the
PR, and verify the PR appeared. Remember (from the note above) that the simulated diff is the
self-contained `discount()` stand-in, not a `delete` command. The review *motion* is the real lesson:
you are the human gate, and that step doesn't go away just because an agent did the typing. The agent
stops at a PR; it never merges.
### Part C — Run the self-healing loop
```bash
git checkout -b agent/self-heal
python agent_runner.py self-heal --simulate bad
```
The script plants a failing change, runs the gate (red), feeds the failure back to the "agent" for a
The orchestrator switches to its own `agent/self-heal` branch (again, you direct the automation, not
your fingers), then plants a failing change, runs the gate (red), feeds the failure back to the "agent" for a
fix, re-runs the gate, and repeats up to its retry cap. With `--simulate bad` the fix succeeds on the
second attempt and the result is offered as a PR proposal. Run it with `--simulate stuck` to watch the
cap trip: after N attempts it gives up and tags the work for a human instead of looping forever.
@@ -311,7 +314,7 @@ Two ways to go from simulation to a genuine autonomous run:
The honest limits — and for autonomous agents, the limits *are* the lesson:
- **Your gates are the ceiling, and most gates are weaker than they look.** Thin test coverage,
skipped security scans, or review-by-rubber-stamp don't just reduce quality they directly set how
skipped security scans, or review-by-rubber-stamp don't just reduce quality, they directly set how
much an autonomous agent can quietly break. Don't grant more autonomy than your gates can verify.
The honest version of "should I let an agent do this unattended?" is "would my CI catch it if it got
it wrong?"
@@ -352,8 +355,8 @@ The honest limits — and for autonomous agents, the limits *are* the lesson:
- You can name the three patterns (issue-to-PR, self-healing CI, triggered/scheduled jobs) and the
four gates that make any of them safe (review M10, CI M14, security M15, recovery M12).
When "let the agent take the first pass" feels safe because you trust the wall it lands behind not
because you trust the model — you've got the model right. Module 26 takes the next step: more than one
When "let the agent take the first pass" feels safe because you trust the wall it lands behind, not
because you trust the model. You've got the model right. Module 26 takes the next step: more than one
agent working at once without colliding, which is where the worktrees from Module 7 finally pay off at
scale.
@@ -161,6 +161,18 @@ def in_git_repo() -> bool:
capture_output=True).returncode == 0
def ensure_branch(name: str) -> None:
"""Create and switch to the agent's working branch. The orchestrator owns this git step the same
way agent-job.yml's runner does (`git switch -c`) — you direct the automation and then verify the
branch (`git branch`), instead of typing `git checkout` by hand. No-op outside a Git repo."""
if not in_git_repo():
return
exists = subprocess.run(["git", "rev-parse", "--verify", "--quiet", name],
capture_output=True).returncode == 0
subprocess.run(["git", "switch", name] if exists else ["git", "switch", "-c", name])
print(f"[git] working on branch {name} (the orchestrator created/switched it for you).")
def propose_pr(message: str) -> None:
print("\n" + "=" * 80)
print("GATE PASSED. Proposing a PR — NOT merging. A human reviews the diff (Module 10).")
@@ -202,6 +214,7 @@ def reject(reason: str, gate_output: str, *, simulated: bool = False) -> None:
# --------------------------------------------------------------------------------------------------
def cmd_issue_to_pr(issue_path: Path, simulate: str | None) -> int:
print(f"[issue-to-pr] brief: {issue_path}")
ensure_branch(f"agent/{issue_path.stem}")
if simulate:
print(f"[issue-to-pr] simulating a '{simulate}' agent on the self-contained demo target.")
simulate_implement(simulate)
@@ -218,6 +231,7 @@ def cmd_issue_to_pr(issue_path: Path, simulate: str | None) -> int:
def cmd_self_heal(simulate: str | None) -> int:
ensure_branch("agent/self-heal")
# Establish a failing state to heal. In a real pipeline this is "CI just went red on a push".
if simulate:
print(f"[self-heal] simulating a red build ('{simulate}') on the demo target.")