fix(M7-27+capstone): apply AI-drives-git reframe, lesson=theory, de-slop course-wide
Phase 2 sweep — all modules are post-pivot, so the learner directs the AI agent
(Claude Code as the worked example) to do the git/setup work and verifies, instead
of typing commands by hand; no re-teaching basics. Lesson sections are theory with
example output; all execution lives in the labs. De-slopped ("prose" etc. gone
course-wide, em-dash density thinned). /path/to placeholders -> ~/ai-workflow-course.
Every deliberate teaching device verified intact: M10 ai-change.patch trap,
M12 bad-clear-snippet, M13/M27 planted pending_count bug, M15 secret+typosquat+MD5,
M18 BREAK=1, M21 absent-.gitignore, M22 poisoned skill, M24 no-op patch, M25 --simulate.
Labs compile/parse (py/sh/yaml/json); no junk.
Closes #83
Closes #86
Closes #89
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01TfzV5QvtPDz8LJS3Pu5VLT
This commit is contained in:
@@ -9,29 +9,29 @@
|
||||
## Prerequisites
|
||||
|
||||
This is the module the whole back half of the course was load-bearing for. It assumes a lot, on
|
||||
purpose — each piece is a wall the autonomous agent has to land behind.
|
||||
purpose; each piece is a wall the autonomous agent has to land behind.
|
||||
|
||||
- **Module 24** — assistive agents, where the AI helped and *you* decided every step. This module is
|
||||
- **Module 24**: assistive agents, where the AI helped and *you* decided every step. This module is
|
||||
the escalation: the agent now takes a step on its own. The only reason that's responsible is the
|
||||
rest of this list.
|
||||
- **Module 9** — issues as an agent's task specification, including the `ready` label and the idea of
|
||||
- **Module 9**: issues as an agent's task specification, including the `ready` label and the idea of
|
||||
an agent as an *assignee*. An issue is the agent's input here.
|
||||
- **Module 6** — branches. The agent's work goes on a branch, never straight onto `main`.
|
||||
- **Modules 10 and 11** — the PR review gate and the full issue → branch → implementation → PR →
|
||||
- **Module 6**: branches. The agent's work goes on a branch, never straight onto `main`.
|
||||
- **Modules 10 and 11**: the PR review gate and the full issue → branch → implementation → PR →
|
||||
review → merge → close loop. The PR *is* the unit of supervision in this module.
|
||||
- **Modules 13 and 14** — tests and CI. The automated gate that runs on the agent's PR.
|
||||
- **Module 15** — security scanning as another gate on the same pushes. Autonomy makes this
|
||||
- **Modules 13 and 14**: tests and CI. The automated gate that runs on the agent's PR.
|
||||
- **Module 15**: security scanning as another gate on the same pushes. Autonomy makes this
|
||||
non-optional, not optional.
|
||||
- **Module 19** — runners. A triggered or scheduled agent is just a runner job; you need to know
|
||||
- **Module 19**: runners. A triggered or scheduled agent is just a runner job; you need to know
|
||||
what's executing it and whose compute it's burning.
|
||||
- **Module 12** — revert, reset, recovery. The backstop for when a gate misses something.
|
||||
- **Module 5** — your committed AI instructions file: the agent's standing brief, the half of the
|
||||
- **Module 12**: revert, reset, recovery. The backstop for when a gate misses something.
|
||||
- **Module 5**: your committed AI instructions file: the agent's standing brief, the half of the
|
||||
spec that isn't in the issue.
|
||||
- **Modules 16, 17, 22** — containers (sandboxing), secrets (scoped credentials), and the prompt-
|
||||
- **Modules 16, 17, 22**: containers (sandboxing), secrets (scoped credentials), and the prompt-
|
||||
injection attack surface. An unattended agent with a push token is a security boundary; these are
|
||||
why.
|
||||
|
||||
If you skipped straight here, the lesson will read as reckless — because without those gates, it
|
||||
If you skipped straight here, the lesson will read as reckless, because without those gates, it
|
||||
*would* be.
|
||||
|
||||
---
|
||||
@@ -48,7 +48,7 @@ By the end of this module you can:
|
||||
`main`, and explain why that's *structural* supervision rather than *behavioral*.
|
||||
4. Build a bounded self-healing loop: when a gate fails, feed the failure back to the agent for a
|
||||
fix, capped at N attempts, with the result landing as a PR you review.
|
||||
5. Decide how much autonomy to grant by reasoning about the strength of your gates — not the
|
||||
5. Decide how much autonomy to grant by reasoning about the strength of your gates, not the
|
||||
intelligence of your model.
|
||||
|
||||
---
|
||||
@@ -99,15 +99,15 @@ issue (assigned/labeled) → agent reads it → branch → implement →
|
||||
|
||||
What the agent reads as its brief is two artifacts you already maintain:
|
||||
|
||||
- **The issue** (Module 9) — the *specific* task: title, context, acceptance criteria, scope. The
|
||||
- **The issue** (Module 9): the *specific* task: title, context, acceptance criteria, scope. The
|
||||
acceptance criteria are the agent's literal definition of done.
|
||||
- **The committed config** (Module 5) — the *standing* brief: conventions, the build and test
|
||||
- **The committed config** (Module 5): the *standing* brief: conventions, the build and test
|
||||
commands, "don't touch these files," house style. Every assignee inherits it, including this one.
|
||||
|
||||
Together they're enough for the agent to attempt the work with **no live conversation**. That's the
|
||||
point of having spent modules making both artifacts good: a well-formed issue plus a committed config
|
||||
is a complete, handoff-ready spec. Hand it a vague issue and you get the Module 9 failure mode at
|
||||
full volume — a confident, plausible, wrong PR that costs more to review than the work would have
|
||||
full volume: a confident, plausible, wrong PR that costs more to review than the work would have
|
||||
taken.
|
||||
|
||||
Crucially: the agent's last step is **open a PR**, not **merge**. The output is a proposal. Nothing
|
||||
@@ -129,14 +129,14 @@ push → CI fails → agent reads the failure → proposes a fix → pus
|
||||
green? PR for review
|
||||
```
|
||||
|
||||
Two design rules make this safe rather than a money-burning loop:
|
||||
Two design rules make this safe rather than a runaway loop:
|
||||
|
||||
1. **Bound the retries.** Two or three attempts, then stop and tag a human. An agent that can retry
|
||||
forever *will*, on a flaky test, producing an endless stream of plausible "fixes" and a runner
|
||||
bill to match.
|
||||
2. **Watch what it's fixing.** The classic failure mode: the test fails, so the agent "fixes" it by
|
||||
*editing the test to pass* instead of fixing the bug. That's why the green result still lands as a
|
||||
**reviewable PR** — a human confirms it fixed the code, not the evidence. Self-healing CI proposes
|
||||
**reviewable PR**: a human confirms it fixed the code, not the evidence. Self-healing CI proposes
|
||||
a fix; it doesn't certify one.
|
||||
|
||||
### Pattern 3 — Triggered and scheduled agent jobs
|
||||
@@ -145,9 +145,9 @@ How does an agent *start* without you launching it? It runs as a runner job (Mod
|
||||
machinery that runs your CI, pointed at an agent instead of a test suite. Two triggers cover almost
|
||||
everything:
|
||||
|
||||
- **Triggered** — an event fires the job: an issue gets a `ready`/`agent` label, a comment says
|
||||
- **Triggered**: an event fires the job: an issue gets a `ready`/`agent` label, a comment says
|
||||
`/agent fix this`, a CI run goes red. Event in, agent runs, PR out.
|
||||
- **Scheduled** — a cron-style timer fires it: "every night, attempt the top `ready`-labelled issue,"
|
||||
- **Scheduled**: a cron-style timer fires it: "every night, attempt the top `ready`-labelled issue,"
|
||||
or "hourly, retry any red `main` build." This is where "the workflow starts running itself" stops
|
||||
being a slogan.
|
||||
|
||||
@@ -170,7 +170,7 @@ Here's the load-bearing idea of the module, and it's not about the model:
|
||||
If your test suite covers 30% of behavior, an autonomous agent can silently break the other 70% and
|
||||
still go green. If your only "review" is rubber-stamping the diff, the review gate isn't real and the
|
||||
agent is effectively merging unseen. The work of making agents trustworthy is mostly the unglamorous
|
||||
work of making your gates strong — which is the work of Modules 10, 13, 14, and 15. Autonomy doesn't
|
||||
work of making your gates strong, which is the work of Modules 10, 13, 14, and 15. Autonomy doesn't
|
||||
ask you to trust the model more. It asks you to trust your gates more, and to have earned it.
|
||||
|
||||
---
|
||||
@@ -181,22 +181,22 @@ Scripting a runner job is ordinary automation. What's specific to AI here is tha
|
||||
the job is non-deterministic and persuasive**, and that changes what "automation" has to mean:
|
||||
|
||||
- **The output is a proposal, not a result.** A normal scheduled job (back up the database, rotate
|
||||
logs) you trust to *complete*. An agent job you trust only to *propose* — because its output is a
|
||||
logs) you trust to *complete*. An agent job you trust only to *propose*, because its output is a
|
||||
confident artifact that might be subtly wrong. That's why the universal endpoint is a PR behind a
|
||||
gate, never a merge. The structure absorbs the non-determinism.
|
||||
- **Supervision shifts from the action to the gate.** With deterministic automation you review the
|
||||
*script* once. With an agent you can't, because it writes something new every run — so you review
|
||||
*script* once. With an agent you can't, because it writes something new every run, so you review
|
||||
the *output* every run, automatically (CI, security) and by sample (human review). The supervision
|
||||
didn't disappear; it moved from watching the agent to hardening the wall it hits.
|
||||
- **Self-healing tempts the worst shortcut in the toolkit.** Pointed at a failing test, an agent will
|
||||
cheerfully delete or weaken the test, because that does technically make CI green. A human would
|
||||
feel the dishonesty; the agent just optimizes the objective you gave it. The defense is structural:
|
||||
the fix is a reviewable diff, and the reviewer's job (Module 10) explicitly includes reading the
|
||||
`-` lines on the *test* file.
|
||||
delete or weaken the test, because that does technically make CI green. A human would feel the
|
||||
dishonesty; the agent just optimizes the objective you gave it. The defense is structural: the fix
|
||||
is a reviewable diff, and the reviewer's job (Module 10) explicitly includes reading the `-` lines
|
||||
on the *test* file.
|
||||
- **Autonomy multiplies your earlier discipline, for good or ill.** A clean repo with strong gates
|
||||
and a good committed config turns an agent into a tireless contributor. A repo with flaky tests, no
|
||||
security scanning, and an empty config turns the same agent into an automated mess-generator running
|
||||
on a timer. The agent doesn't fix your engineering — it amplifies it.
|
||||
and a good committed config lets an agent contribute real work on a timer. A repo with flaky tests,
|
||||
no security scanning, and an empty config lets the same agent generate mess on a timer. The agent
|
||||
doesn't fix your engineering; it amplifies it.
|
||||
|
||||
---
|
||||
|
||||
@@ -216,11 +216,11 @@ shows how the exact same flow runs on a real forge as a triggered/scheduled job.
|
||||
`pytest` and `ruff` installed (`pip install pytest ruff`). The lab runs these as the CI gate,
|
||||
locally — the same checks `ci.yml` runs in Module 14.
|
||||
- The starter files in this module's `lab/` folder:
|
||||
- `agent_runner.py` — the orchestrator. Drives the agent (real or simulated), then runs the gate,
|
||||
- `agent_runner.py`: the orchestrator. Drives the agent (real or simulated), then runs the gate,
|
||||
and only ever produces a branch + PR proposal, never a merge.
|
||||
- `issue-delete-command.md` — a well-formed issue (Module 9 format) for a `delete <index>` command:
|
||||
- `issue-delete-command.md`: a well-formed issue (Module 9 format) for a `delete <index>` command:
|
||||
the agent's input.
|
||||
- `agent-job.yml` — a reference forge workflow showing the triggered + scheduled runner version.
|
||||
- `agent-job.yml`: a reference forge workflow showing the triggered + scheduled runner version.
|
||||
Read it; you'll run it for real only in Part D.
|
||||
- *Optional, for the "for real" path:* an agentic coding tool that has a non-interactive / headless /
|
||||
one-shot mode (most expose a flag for running a single prompt without the interactive UI). If you
|
||||
@@ -240,22 +240,23 @@ shows how the exact same flow runs on a real forge as a triggered/scheduled job.
|
||||
|
||||
Copy `agent_runner.py` and `issue-delete-command.md` into your `tasks-app` folder, along with this
|
||||
module's `lab/.gitignore` (append its lines to the `.gitignore` you already have from Module 2 rather
|
||||
than overwriting it). Commit that `.gitignore` first — it keeps the lab scaffolding and Python caches
|
||||
out of the agent's `git add -A`, so the change you review in Part B is clean. Then, from a clean
|
||||
branch:
|
||||
than overwriting it). Direct your agent (Claude Code as the worked example; sub your own) to commit
|
||||
that updated `.gitignore`, then verify with `git log`. It keeps the lab scaffolding and Python caches
|
||||
out of the agent's `git add -A`, so the change you review in Part B is clean. Then, from
|
||||
`~/ai-workflow-course/tasks-app`, run the orchestrator:
|
||||
|
||||
```bash
|
||||
cd ~/ai-workflow-course/tasks-app
|
||||
git checkout -b agent/delete-command
|
||||
|
||||
# Simulate an agent that produces a BROKEN change, then run the gate on it:
|
||||
python agent_runner.py issue-to-pr issue-delete-command.md --simulate bad
|
||||
```
|
||||
|
||||
Watch the output. The "agent" plants a change, the script runs the gate (`ruff check` then
|
||||
`pytest -q`), a test fails, and the script **stops and refuses to call the work ready** — exit code
|
||||
non-zero, no PR proposed. That is structural supervision: it didn't matter that the change looked
|
||||
plausible; the gate caught it. Nothing reached `main`.
|
||||
The orchestrator creates and switches to its own `agent/issue-delete-command` branch first (the same
|
||||
`git switch -c` the runner does in `agent-job.yml`), so you direct the automation and verify the
|
||||
branch with `git branch` rather than typing `git checkout`. Then watch the output: the "agent" plants
|
||||
a change, the script runs the gate (`ruff check` then `pytest -q`), a test fails, and the script
|
||||
**stops and refuses to call the work ready**, exit code non-zero, no PR proposed. That is structural
|
||||
supervision. It didn't matter that the change looked plausible; the gate caught it, and nothing
|
||||
reached `main`.
|
||||
|
||||
### Part B — See a good change land as a PR proposal
|
||||
|
||||
@@ -264,19 +265,21 @@ python agent_runner.py issue-to-pr issue-delete-command.md --simulate good
|
||||
```
|
||||
|
||||
This time the planted change is correct. The gate passes, the script commits to the branch and prints
|
||||
the diff for review plus the exact `git push` / open-PR command. **It does not merge.** Open the diff
|
||||
and review it with the Module 10 checklist. Remember (from the note above) that the simulated diff is
|
||||
the self-contained `discount()` stand-in, not a `delete` command — but the review *motion* is the real
|
||||
lesson: you are the human gate, and that step doesn't go away just because an agent did the typing.
|
||||
the diff plus the push / open-PR command it would run. **It does not merge.** Review the diff with the
|
||||
Module 10 checklist, then direct your agent (Claude Code; sub your own) to run that push and open the
|
||||
PR, and verify the PR appeared. Remember (from the note above) that the simulated diff is the
|
||||
self-contained `discount()` stand-in, not a `delete` command. The review *motion* is the real lesson:
|
||||
you are the human gate, and that step doesn't go away just because an agent did the typing. The agent
|
||||
stops at a PR; it never merges.
|
||||
|
||||
### Part C — Run the self-healing loop
|
||||
|
||||
```bash
|
||||
git checkout -b agent/self-heal
|
||||
python agent_runner.py self-heal --simulate bad
|
||||
```
|
||||
|
||||
The script plants a failing change, runs the gate (red), feeds the failure back to the "agent" for a
|
||||
The orchestrator switches to its own `agent/self-heal` branch (again, you direct the automation, not
|
||||
your fingers), then plants a failing change, runs the gate (red), feeds the failure back to the "agent" for a
|
||||
fix, re-runs the gate, and repeats up to its retry cap. With `--simulate bad` the fix succeeds on the
|
||||
second attempt and the result is offered as a PR proposal. Run it with `--simulate stuck` to watch the
|
||||
cap trip: after N attempts it gives up and tags the work for a human instead of looping forever.
|
||||
@@ -311,7 +314,7 @@ Two ways to go from simulation to a genuine autonomous run:
|
||||
The honest limits — and for autonomous agents, the limits *are* the lesson:
|
||||
|
||||
- **Your gates are the ceiling, and most gates are weaker than they look.** Thin test coverage,
|
||||
skipped security scans, or review-by-rubber-stamp don't just reduce quality — they directly set how
|
||||
skipped security scans, or review-by-rubber-stamp don't just reduce quality, they directly set how
|
||||
much an autonomous agent can quietly break. Don't grant more autonomy than your gates can verify.
|
||||
The honest version of "should I let an agent do this unattended?" is "would my CI catch it if it got
|
||||
it wrong?"
|
||||
@@ -352,8 +355,8 @@ The honest limits — and for autonomous agents, the limits *are* the lesson:
|
||||
- You can name the three patterns (issue-to-PR, self-healing CI, triggered/scheduled jobs) and the
|
||||
four gates that make any of them safe (review M10, CI M14, security M15, recovery M12).
|
||||
|
||||
When "let the agent take the first pass" feels safe because you trust the wall it lands behind — not
|
||||
because you trust the model — you've got the model right. Module 26 takes the next step: more than one
|
||||
When "let the agent take the first pass" feels safe because you trust the wall it lands behind, not
|
||||
because you trust the model. You've got the model right. Module 26 takes the next step: more than one
|
||||
agent working at once without colliding, which is where the worktrees from Module 7 finally pay off at
|
||||
scale.
|
||||
|
||||
|
||||
@@ -161,6 +161,18 @@ def in_git_repo() -> bool:
|
||||
capture_output=True).returncode == 0
|
||||
|
||||
|
||||
def ensure_branch(name: str) -> None:
|
||||
"""Create and switch to the agent's working branch. The orchestrator owns this git step the same
|
||||
way agent-job.yml's runner does (`git switch -c`) — you direct the automation and then verify the
|
||||
branch (`git branch`), instead of typing `git checkout` by hand. No-op outside a Git repo."""
|
||||
if not in_git_repo():
|
||||
return
|
||||
exists = subprocess.run(["git", "rev-parse", "--verify", "--quiet", name],
|
||||
capture_output=True).returncode == 0
|
||||
subprocess.run(["git", "switch", name] if exists else ["git", "switch", "-c", name])
|
||||
print(f"[git] working on branch {name} (the orchestrator created/switched it for you).")
|
||||
|
||||
|
||||
def propose_pr(message: str) -> None:
|
||||
print("\n" + "=" * 80)
|
||||
print("GATE PASSED. Proposing a PR — NOT merging. A human reviews the diff (Module 10).")
|
||||
@@ -202,6 +214,7 @@ def reject(reason: str, gate_output: str, *, simulated: bool = False) -> None:
|
||||
# --------------------------------------------------------------------------------------------------
|
||||
def cmd_issue_to_pr(issue_path: Path, simulate: str | None) -> int:
|
||||
print(f"[issue-to-pr] brief: {issue_path}")
|
||||
ensure_branch(f"agent/{issue_path.stem}")
|
||||
if simulate:
|
||||
print(f"[issue-to-pr] simulating a '{simulate}' agent on the self-contained demo target.")
|
||||
simulate_implement(simulate)
|
||||
@@ -218,6 +231,7 @@ def cmd_issue_to_pr(issue_path: Path, simulate: str | None) -> int:
|
||||
|
||||
|
||||
def cmd_self_heal(simulate: str | None) -> int:
|
||||
ensure_branch("agent/self-heal")
|
||||
# Establish a failing state to heal. In a real pipeline this is "CI just went red on a push".
|
||||
if simulate:
|
||||
print(f"[self-heal] simulating a red build ('{simulate}') on the demo target.")
|
||||
|
||||
Reference in New Issue
Block a user