From f98eacb196b38495b0cd3c41ee8d78eaf9496f0c Mon Sep 17 00:00:00 2001 From: claude Date: Mon, 22 Jun 2026 16:07:47 -0400 Subject: [PATCH] fix(testing/ci/tooling): consistent unittest, venv guidance, runnable lab commands MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - #9: standardize the test chain on stdlib unittest (nothing-to-install, which keeps M13's claims true and its planted bug intact). Aligned M5/M14/M16 prose, M14 lab/test_tasks.py, and ci/gitlab starters; ruff stays the only pip install. - #20: add venv / PEP 668 / which-python guidance to M20 (+ M14/M15 local installs); point MCP config at the venv's absolute python. - #21: replace M21 Part D's empty `git diff HEAD~1` with `git log -p` (no .gitignore added — device preserved). - #22: add a dependency-install step before M23's green baseline on a fresh clone. - #23: M24 reviewer/triage now tolerate code-fence-wrapped JSON (stdlib only); feature.patch trap untouched. - #28: fix M27 Part D CI snippet path (working-directory) and require the gate to target a varying candidate; swapped_model regression kept as the fixture. Closes #9 Closes #20 Closes #21 Closes #22 Closes #23 Closes #28 Co-Authored-By: Claude Opus 4.8 (1M context) Claude-Session: https://claude.ai/code/session_01TfzV5QvtPDz8LJS3Pu5VLT --- modules/05-commit-the-ai-config/README.md | 21 +++---- .../lab/instructions-file-starter.md | 2 +- modules/14-continuous-integration/README.md | 35 +++++++----- .../lab/ci-starter.yml | 10 ++-- .../lab/gitlab-ci-starter.yml | 6 +- .../lab/test_tasks.py | 57 ++++++++++--------- modules/15-security-scanning/README.md | 6 ++ .../README.md | 7 ++- .../README.md | 52 ++++++++++++++--- .../lab/mcp-config-example.json | 4 +- .../README.md | 7 ++- .../README.md | 17 ++++-- .../lab/skills/safe-change.md | 8 ++- modules/24-assistive-agents/README.md | 4 ++ modules/24-assistive-agents/lab/reviewer.py | 20 ++++++- modules/24-assistive-agents/lab/triage.py | 19 ++++++- modules/27-evals/README.md | 23 +++++++- 17 files changed, 216 insertions(+), 82 deletions(-) diff --git a/modules/05-commit-the-ai-config/README.md b/modules/05-commit-the-ai-config/README.md index 81eb93a..3d40536 100644 --- a/modules/05-commit-the-ai-config/README.md +++ b/modules/05-commit-the-ai-config/README.md @@ -47,8 +47,8 @@ committed instructions file from the repo, and you control what's in it.** > repo-root config file). Some tools even read more than one filename — point them all at the same > content if so. The principle outlives any one vendor's filename. -Without this file, you re-explain your project every session: "we use 4-space indent," "the tests are -`pytest`, run them before you say you're done," "don't touch the generated `tasks.json`." You say it, +Without this file, you re-explain your project every session: "we use 4-space indent," "run the tests +with `python -m unittest` before you say you're done," "don't touch the generated `tasks.json`." You say it, the AI complies, the session ends, the memory evaporates (Module 1's second seam), and tomorrow you say it all again. The instructions file is where that knowledge stops being something you retype and becomes something the project *carries*. @@ -62,8 +62,8 @@ a briefing for an agent that will edit this code. Keep it to what changes the AI uses. "Core logic lives in `tasks.py`; the CLI front end is `cli.py`; state persists to `tasks.json`." - **Build and test commands** — the exact commands, copy-pasteable. "Run the app with - `python cli.py `. Run tests with `pytest`. Don't claim a change works until the tests - pass." This single line stops the AI from inventing a test runner you don't use. + `python cli.py `. Run tests with `python -m unittest`. Don't claim a change works until + the tests pass." This single line stops the AI from inventing a test runner you don't use. - **Coding standards** — formatting, typing, error handling, the libraries you do and don't want. "Use the standard library only — no third-party packages. Type-hint public functions." - **"Don't touch these files."** — the off-limits list. Generated files, vendored code, secrets, @@ -83,7 +83,7 @@ useful for personal preferences, but it's the wrong home for project knowledge, lives: on *your* laptop, invisible to everyone else. Picture a two-person project with no committed instructions file. You've trained your local setup to -run `pytest` and avoid `tasks.json`. Your teammate's setup hasn't — their agent reformats whole files +run `python -m unittest` and avoid `tasks.json`. Your teammate's setup hasn't — their agent reformats whole files and hand-edits the generated JSON. You're both "using AI on the same repo," but you're getting different behavior, and neither of you can see the other's configuration. That's **drift**: the same codebase, diverging because the rules live in two heads instead of one file. @@ -176,7 +176,8 @@ editor-integrated AI (Module 4) for the part where the AI obeys the file. - The `tasks-app` repo from Module 2 (already a Git repo with some history). - Your agentic coding tool from Module 4, and knowledge of which filename it reads for repo-level instructions (check its docs — see the note in *Key concepts*). -- Optionally `pytest` (`pip install pytest`) so the AI has a real test command to honor. +- Optionally, a test command for the AI to honor — Python's built-in `python -m unittest` works with + nothing to install (you'll write a real suite in Module 13; until then it simply reports no tests). ### Part A — Write the instructions file @@ -192,8 +193,8 @@ editor-integrated AI (Module 4) for the part where the AI obeys the file. 2. Open it in your editor and make it true for *your* project. The starter is filled in for the `tasks-app`, but read every line and confirm it matches reality — wrong instructions are worse - than none. At minimum, set the real test command (or delete the line if you didn't install - `pytest`). + than none. At minimum, set the real test command (or delete the line if you don't have tests + yet). 3. Commit it. This is the point of the whole module: @@ -272,8 +273,8 @@ Be honest about what a committed instructions file does and doesn't buy you: - **Bloat kills it.** A 300-line instructions file is read the way *you* read a 300-line terms-of- service: not really. Every line you add dilutes the rest. Keep it to what actually changes behavior, and prune lines the model already honors without being told. -- **Stale instructions are worse than none.** A file that says "tests are `pytest`" after you've moved - to something else will actively misdirect the AI. The file is code-adjacent — it has to be +- **Stale instructions are worse than none.** A file that says "run the tests with `python -m + unittest`" after you've switched to a different runner will actively misdirect the AI. The file is code-adjacent — it has to be maintained like code, and reviewed like code. That's exactly why committing it (so changes are visible) matters. - **The team payoff isn't here yet.** On a solo local repo, the "no more drift between teammates" diff --git a/modules/05-commit-the-ai-config/lab/instructions-file-starter.md b/modules/05-commit-the-ai-config/lab/instructions-file-starter.md index 6b2cb29..c01d53b 100644 --- a/modules/05-commit-the-ai-config/lab/instructions-file-starter.md +++ b/modules/05-commit-the-ai-config/lab/instructions-file-starter.md @@ -26,7 +26,7 @@ minute but real enough to have more than one file. Keep it that way — don't gr ## Build and test commands - Run the app: `python cli.py ` (e.g. `python cli.py list`). -- Run the tests: `pytest` +- Run the tests: `python -m unittest` - Do not claim a change works until you have actually run it. If tests exist, they must pass first. ## Coding standards diff --git a/modules/14-continuous-integration/README.md b/modules/14-continuous-integration/README.md index 6347da7..cb05a40 100644 --- a/modules/14-continuous-integration/README.md +++ b/modules/14-continuous-integration/README.md @@ -78,8 +78,8 @@ Almost every CI configuration, on every forge, is the same four moves: 4. **Run the checks** — lint, then test. Any check that exits non-zero fails the whole run. That last point is the load-bearing one. CI's entire enforcement mechanism is the **exit code**. -Every tool you'd run in a terminal returns 0 for success and non-zero for failure. `pytest` exits -non-zero if a test fails. `ruff check` exits non-zero if it finds a lint problem. CI runs your +Every tool you'd run in a terminal returns 0 for success and non-zero for failure. `python -m +unittest` exits non-zero if a test fails. `ruff check` exits non-zero if it finds a lint problem. CI runs your commands and watches those exit codes; one failure turns the run red. You're not learning a new testing system — you're wiring the tools you already have to a trigger. @@ -125,18 +125,19 @@ jobs: with: python-version: "3.12" - name: Install tools - run: pip install pytest ruff + run: pip install ruff - name: Lint run: ruff check . - name: Test - run: pytest -q + run: python -m unittest ``` Reading it top to bottom: `on:` is the trigger (push and pull request). `runs-on:` picks the clean machine. The `steps:` are the four moves — checkout, set up Python, install the tools, then the two checks. `uses:` pulls in a pre-built action (someone else's reusable step); `run:` is just a shell command. The linter runs first because it's cheap; the tests run last because they're the -expensive, decisive check. +expensive, decisive check. Only the linter needs a `pip install` here — the tests run on Python's +standard-library `unittest` runner from Module 13, so there's nothing to install for them. This file lives *in the repo*, committed and versioned like everything else. That's deliberate and on-thesis: your pipeline is code, it's reviewed as a diff in a PR (Module 10), and a teammate or an @@ -151,9 +152,9 @@ When CI goes red, the skill is triage, and it's fast once you know the shape: 2. **The first red step is the cause.** Steps run in order and stop at the first failure; everything after it is skipped, not broken. Don't get distracted by the skipped steps. 3. **Read that step's log.** It's the same output the tool prints in your terminal — a failing - `pytest` assertion, a `ruff` finding with a file and line number. CI didn't invent a new error + `unittest` assertion, a `ruff` finding with a file and line number. CI didn't invent a new error format; it's showing you the command's own output. -4. **Reproduce it locally.** Run the exact command from the failed step (`pytest -q` or +4. **Reproduce it locally.** Run the exact command from the failed step (`python -m unittest` or `ruff check .`) on your machine. It will fail the same way, because CI ran the same command. Fix it locally, confirm it's green locally, push again. @@ -225,14 +226,21 @@ your machine first. ```bash cd ~/workflow-course/tasks-app - pip install pytest ruff - pytest -q # should report all tests passing - ruff check . # should report no issues (or fix what it flags) + pip install ruff + python -m unittest # should report all tests passing + ruff check . # should report no issues (or fix what it flags) ``` If both are clean locally, CI will be green. If not, fix it here — it's faster than waiting on a runner. + > **If `pip install` is refused** with "externally-managed-environment" (PEP 668 — common on + > recent Debian/Ubuntu and Homebrew Python), install into a per-project virtual environment + > instead: `python3 -m venv .venv && source .venv/bin/activate` (Windows: + > `.venv\Scripts\activate`), then re-run `pip install ruff`. Only the linter needs installing — the + > stdlib `unittest` runner needs nothing. (`pipx` or `pip install --break-system-packages` also + > work; a venv is the clean default.) + ### Part B — Add the workflow and watch it pass 2. Put the workflow where your forge looks for it: @@ -288,7 +296,7 @@ and watch CI stop it. bad one, instead of rewriting history other people may have pulled. ```bash - pytest -q # fails locally too — same command, same failure + python -m unittest # fails locally too — same command, same failure git revert HEAD # new commit that undoes "Simplify pending()" (Module 12) git push # CI re-runs on the fixed code and goes green again ``` @@ -371,5 +379,6 @@ Re-check at build time: - [ ] **Forge UI labels.** The tab names in the lab ("Actions," "CI/CD," "Pipelines") and the workflow file locations (`.github/workflows/`, `.gitlab-ci.yml`, `.forgejo/`, `.gitea/`) match what the current forge versions actually use. -- [ ] **Tool names.** The example linter and test runner (`ruff`, `pytest`) are current, installable, - and still behave as described — or swap in the equivalents the rest of the course uses. +- [ ] **Tool names.** The example linter (`ruff`) is current, installable, and still behaves as + described — or swap in the equivalent the rest of the course uses. (The test runner is Python's + standard-library `unittest`, which ships with Python — no install, nothing to drift.) diff --git a/modules/14-continuous-integration/lab/ci-starter.yml b/modules/14-continuous-integration/lab/ci-starter.yml index 727e068..8655e4f 100644 --- a/modules/14-continuous-integration/lab/ci-starter.yml +++ b/modules/14-continuous-integration/lab/ci-starter.yml @@ -33,14 +33,16 @@ jobs: with: python-version: "3.12" - # Step 3: install the tools the checks need — the test runner and the linter from Module 13. + # Step 3: install the linter (ruff), the new tool this module adds. The test runner is + # Python's standard-library unittest from Module 13 — nothing to install for it. - name: Install tools - run: pip install pytest ruff + run: pip install ruff # Step 4: lint. Style and obvious-mistake check. Fails the job on any finding (non-zero exit). - name: Lint run: ruff check . - # Step 5: test. The Module 13 tests. A single failing assertion fails the whole job. + # Step 5: test. The Module 13 tests, run with the stdlib unittest runner. A single failing + # assertion fails the whole job. - name: Test - run: pytest -q + run: python -m unittest diff --git a/modules/14-continuous-integration/lab/gitlab-ci-starter.yml b/modules/14-continuous-integration/lab/gitlab-ci-starter.yml index 9f0f9a9..5ebfce5 100644 --- a/modules/14-continuous-integration/lab/gitlab-ci-starter.yml +++ b/modules/14-continuous-integration/lab/gitlab-ci-starter.yml @@ -17,6 +17,6 @@ check: # of "runs-on: ubuntu-latest" plus "set up Python". image: python:3.12 script: - - pip install pytest ruff - - ruff check . # lint - - pytest -q # test + - pip install ruff + - ruff check . # lint + - python -m unittest # test (stdlib runner from Module 13 — nothing to install) diff --git a/modules/14-continuous-integration/lab/test_tasks.py b/modules/14-continuous-integration/lab/test_tasks.py index 5e10fc0..02c7f95 100644 --- a/modules/14-continuous-integration/lab/test_tasks.py +++ b/modules/14-continuous-integration/lab/test_tasks.py @@ -1,36 +1,41 @@ """Tests for the tasks-app core logic — the kind of suite Module 13 has you write. Reproduced here so this module's lab is self-contained: if you already wrote tests in Module 13, -use those instead. Run locally with `pytest -q` from the project folder. CI runs exactly this. +use those instead. Standard-library `unittest`, exactly like Module 13 — nothing to install. +Run locally with `python -m unittest` from the project folder. CI runs exactly this. """ +import unittest + from tasks import TaskList -def test_add_appends_a_task(): - tl = TaskList() - tl.add("write the CI lesson") - assert len(tl.tasks) == 1 - assert tl.tasks[0].title == "write the CI lesson" - assert tl.tasks[0].done is False +class TestTaskList(unittest.TestCase): + def test_add_appends_a_task(self): + tl = TaskList() + tl.add("write the CI lesson") + self.assertEqual(len(tl.tasks), 1) + self.assertEqual(tl.tasks[0].title, "write the CI lesson") + self.assertFalse(tl.tasks[0].done) + + def test_complete_marks_a_task_done(self): + tl = TaskList() + tl.add("ship it") + tl.complete(0) + self.assertTrue(tl.tasks[0].done) + + def test_pending_excludes_completed_tasks(self): + tl = TaskList() + tl.add("a") + tl.add("b") + tl.complete(0) + pending = tl.pending() + self.assertEqual(len(pending), 1) + self.assertEqual(pending[0].title, "b") + + def test_render_is_friendly_when_empty(self): + self.assertEqual(TaskList().render(), "(no tasks yet)") -def test_complete_marks_a_task_done(): - tl = TaskList() - tl.add("ship it") - tl.complete(0) - assert tl.tasks[0].done is True - - -def test_pending_excludes_completed_tasks(): - tl = TaskList() - tl.add("a") - tl.add("b") - tl.complete(0) - pending = tl.pending() - assert len(pending) == 1 - assert pending[0].title == "b" - - -def test_render_is_friendly_when_empty(): - assert TaskList().render() == "(no tasks yet)" +if __name__ == "__main__": + unittest.main() diff --git a/modules/15-security-scanning/README.md b/modules/15-security-scanning/README.md index 8809f73..f52090f 100644 --- a/modules/15-security-scanning/README.md +++ b/modules/15-security-scanning/README.md @@ -220,6 +220,12 @@ and wire the catch into your pipeline. pip install pip-audit detect-secrets ``` + > **If `pip install` is refused** with "externally-managed-environment" (PEP 668 — common on + > recent Debian/Ubuntu and Homebrew Python), install into a per-project virtual environment + > instead: `python3 -m venv .venv && source .venv/bin/activate` (Windows: `.venv\Scripts\activate`), + > then re-run the install. (`pipx` or `pip install --break-system-packages` also work; a venv is the + > clean default.) + These are concrete, currently-maintained examples of the **SCA** and **secret-scanning** categories — not the only choices (see *Where it breaks* and *Verify-before-publish*). The lab teaches the moves; the moves transfer to any tool in the category. diff --git a/modules/16-containers-and-reproducible-environments/README.md b/modules/16-containers-and-reproducible-environments/README.md index 27e4799..6623d10 100644 --- a/modules/16-containers-and-reproducible-environments/README.md +++ b/modules/16-containers-and-reproducible-environments/README.md @@ -221,12 +221,13 @@ containerize and run the app you already have. ``` That's a clean Python with none of your code. Now confirm CI-grade reproducibility — run the - Module 14 test suite in a clean, throwaway container that mounts your code but installs its tools - fresh (no test tools baked into your app image — that keeps it lean; see *Where it breaks*): + Module 14 test suite in a clean, throwaway container that mounts your code and runs it with the + standard-library `unittest` runner: nothing to install, and no test tooling baked into your app + image (that keeps it lean; see *Where it breaks*): ```bash docker run --rm -v "$PWD":/app -w /app python:3.12-slim \ - sh -c "pip install pytest -q && pytest -q" + python -m unittest ``` This is, in miniature, exactly what containerized CI does. If it passes here, it passes the same diff --git a/modules/20-mcp-servers-giving-the-ai-hands/README.md b/modules/20-mcp-servers-giving-the-ai-hands/README.md index 2f3eca6..b1dec3f 100644 --- a/modules/20-mcp-servers-giving-the-ai-hands/README.md +++ b/modules/20-mcp-servers-giving-the-ai-hands/README.md @@ -239,10 +239,40 @@ is the one that lands the concept. - Your agentic coding tool from Module 4, which is the **MCP client**. Find, in its docs, *where it reads MCP server configuration* and *how it shows that a server is connected* (often a list of connected servers or available tools). -- Python 3.10+ and the official MCP Python SDK: `pip install "mcp[cli]"`. +- Python 3.10+ and the official MCP Python SDK, installed into a virtual environment — read the + **Python packages and which `python`** note just below *before* you run `pip`. - The starter files in this module's `lab/` folder: `tasks_mcp_server.py` and `mcp-config-example.json`. +> **Python packages and which `python`.** This lab's one dependency is the MCP SDK, and *how* you +> install it decides whether the server ever connects. Two things bite people: +> +> - **PEP 668 ("externally-managed-environment").** On modern Debian/Ubuntu and Homebrew Python, a +> global `pip install` is refused on purpose. The clean fix is a virtual environment per project: +> +> ```bash +> cd ~/workflow-course/tasks-app +> python3 -m venv .venv # one-time +> source .venv/bin/activate # Windows: .venv\Scripts\activate +> python3 -m pip install "mcp[cli]" +> ``` +> +> (If you'd rather not manage a venv: `pipx`, or `pip install --break-system-packages` — but a venv +> is the clean default and keeps this lab's dependency out of your system Python.) +> - **The install interpreter must match the config's launch command.** Your MCP client starts the +> server by running the `"command"` in its config — *not* your activated shell — so activating a +> venv does nothing to help the client find the SDK. You must point `"command"` at the venv's +> **absolute** python path (e.g. `~/workflow-course/tasks-app/.venv/bin/python`, or +> `...\.venv\Scripts\python.exe` on Windows). If they don't match, the server dies on `import mcp` +> and your tool just says "not connected" with no obvious reason — the exact failure this lab is +> about avoiding. +> +> Before wiring anything, verify with the *same* interpreter the config will launch: +> +> ```bash +> ~/workflow-course/tasks-app/.venv/bin/python -c "import mcp; print('mcp ok')" +> ``` + ### Part A — Connect an existing server (warm-up, ~10 min) Before building anything, prove the plumbing works by connecting a server someone else already @@ -291,8 +321,8 @@ That's the entire client/server loop, end to end, with zero code you wrote. Now 2. Sanity-check it starts. From inside `tasks-app`: ```bash - pip install "mcp[cli]" # once - python tasks_mcp_server.py # it will sit there waiting for a client — that's correct + python3 -m pip install "mcp[cli]" # into the venv from the note above, once + python tasks_mcp_server.py # it will sit there waiting for a client — that's correct ``` It looks like it's hanging. It isn't — a stdio server waits for a client on its stdin/stdout. @@ -301,20 +331,26 @@ That's the entire client/server loop, end to end, with zero code you wrote. Now ### Part C — Wire it into your agentic tool 3. Open `lab/mcp-config-example.json`. Copy the `tasks` entry into wherever your tool reads MCP - config, and replace the path with the **absolute** path to your `tasks_mcp_server.py`. (Use - `python3` or a venv's python if that's what runs the SDK on your system.) + config. Set `"command"` to the **absolute path of the python that has `mcp` installed** — the venv + python from the note above, *not* a bare `python` — and set `args` to the **absolute** path to + your `tasks_mcp_server.py`: ```json "tasks": { - "command": "python", + "command": "/ABSOLUTE/PATH/TO/workflow-course/tasks-app/.venv/bin/python", "args": ["/ABSOLUTE/PATH/TO/workflow-course/tasks-app/tasks_mcp_server.py"] } ``` + (On Windows the venv python is `...\.venv\Scripts\python.exe`.) A bare `"command": "python"` is the + single most common reason the server "won't connect": the client launches whatever `python` is on + *its* PATH, which is usually not the interpreter that has the SDK. + 4. Reload your agentic tool and confirm it shows the `tasks` server **connected**, with `list_tasks` and `add_task` among its available tools. If it doesn't connect, the usual culprits are a wrong - path, the wrong `python`, or the SDK not installed for that interpreter — check the tool's MCP - logs. + path, the wrong `python`, or the SDK not installed for that interpreter — re-run the + `... .venv/bin/python -c "import mcp"` check from the note above against the *exact* path you put + in `"command"`, then check the tool's MCP logs. ### Part D — Watch the AI use its new hands diff --git a/modules/20-mcp-servers-giving-the-ai-hands/lab/mcp-config-example.json b/modules/20-mcp-servers-giving-the-ai-hands/lab/mcp-config-example.json index d08b896..b9e6650 100644 --- a/modules/20-mcp-servers-giving-the-ai-hands/lab/mcp-config-example.json +++ b/modules/20-mcp-servers-giving-the-ai-hands/lab/mcp-config-example.json @@ -1,8 +1,8 @@ { - "_comment": "Common shape of an MCP server entry for a local (stdio) server. Many agentic tools accept this 'mcpServers' map; yours may use a different key or location (check its docs). Replace the path with the ABSOLUTE path to tasks_mcp_server.py in your tasks-app. Use 'python3' instead of 'python' if that's what your system calls it, or the full path to a virtualenv's python.", + "_comment": "Common shape of an MCP server entry for a local (stdio) server. Many agentic tools accept this 'mcpServers' map; yours may use a different key or location (check its docs). IMPORTANT: 'command' must be the ABSOLUTE path to the python interpreter that has the MCP SDK installed (e.g. your venv's python) -- a bare 'python' makes the client launch whatever is on its PATH, which usually does NOT have the SDK, and the server then reports 'not connected'. On Windows the venv python is ...\\.venv\\Scripts\\python.exe. Set 'args' to the ABSOLUTE path to tasks_mcp_server.py in your tasks-app.", "mcpServers": { "tasks": { - "command": "python", + "command": "/ABSOLUTE/PATH/TO/workflow-course/tasks-app/.venv/bin/python", "args": ["/ABSOLUTE/PATH/TO/workflow-course/tasks-app/tasks_mcp_server.py"] } } diff --git a/modules/21-skills-teaching-the-ai-your-playbook/README.md b/modules/21-skills-teaching-the-ai-your-playbook/README.md index f752c53..428cda3 100644 --- a/modules/21-skills-teaching-the-ai-your-playbook/README.md +++ b/modules/21-skills-teaching-the-ai-your-playbook/README.md @@ -234,10 +234,13 @@ seen, producing all four parts without you listing the steps. ```bash git log --oneline add-command.md # the procedure's own history - git diff HEAD~1 add-command.md # if you tightened it in Part C — your workflow change as a diff + git log -p -- add-command.md # full patch history: the file's creation, plus the Part C tighten if you made one ``` - That diff *is* a change to how your team adds commands — readable, attributable, revertable. In a + (`git log -p` surfaces the skill's own patches no matter what you committed *after* tightening it — + unlike `git diff HEAD~1`, which would be empty here because the most recent commit added the second + *command*, not a change to the skill.) Each entry in that history *is* a change to how your team adds + commands — readable, attributable, revertable. In a team repo (Modules 8, 11) it reaches everyone on `git pull`; behind review (Module 10) it lands as a PR someone approves. You've turned a procedure you used to narrate into a versioned capability. diff --git a/modules/23-working-with-existing-codebases/README.md b/modules/23-working-with-existing-codebases/README.md index e6b2752..6b148c1 100644 --- a/modules/23-working-with-existing-codebases/README.md +++ b/modules/23-working-with-existing-codebases/README.md @@ -170,8 +170,12 @@ This lab does **not** use `tasks-app` — the entire point is a codebase you *di - Git, Python 3.10+, and your agentic AI tool from Module 4. - A real, small-to-medium open-source repo to clone. Pick something with **tests** and a clear build/test command, in a language you can at least read. Good traits: a few thousand lines, an - obvious entry point, a green test suite. (Avoid giant frameworks for a first run — you want a + obvious entry point, a documented install (`pip install -e .`, `npm install`, `go mod download`, + …), and a test suite that **goes green on a clean clone after that documented install** — confirm + that before you rely on it as a baseline. (Avoid giant frameworks for a first run — you want a system you can't fully hold in your head, but whose test suite finishes in under a minute.) + **First time? Pick a small Python repo**, so the Module 13 testing toolchain you already have + transfers with the least friction. - The starter files from this module's `lab/` folder: `orient.py` and `skills/`. ### Part A — Clone and orient @@ -205,9 +209,14 @@ This lab does **not** use `tasks-app` — the entire point is a codebase you *di ### Part C — One small, scoped, tested change 6. Pick a genuinely small change — a clearer error message, a fixed edge case, a tiny missing - validation, a documented-but-unhandled input. Something a single function owns. Run the existing - tests first to establish a green baseline (`pytest`, `npm test`, `go test ./...` — whatever - `ORIENT.md` and the README confirmed). + validation, a documented-but-unhandled input. Something a single function owns. First **install + the project's dependencies** the way its README says — typically `pip install -e .` (Python), + `npm install` (JS/TS), `go mod download` (Go), or the equivalent — *then* run the existing tests + to establish a green baseline (`python -m unittest`, `pytest`, `npm test`, `go test ./...` — + whatever `ORIENT.md` and the README confirmed). A fresh clone usually won't run green until its + deps are installed; if it still won't go green on a clean clone *after* a documented install, + that's a setup problem, not your baseline — pick another repo rather than change code on top of an + environment you can't trust. 7. Branch, then load the `safe-change` skill (`lab/skills/safe-change.md`) and work the change with the AI: diff --git a/modules/23-working-with-existing-codebases/lab/skills/safe-change.md b/modules/23-working-with-existing-codebases/lab/skills/safe-change.md index cc96679..f35c835 100644 --- a/modules/23-working-with-existing-codebases/lab/skills/safe-change.md +++ b/modules/23-working-with-existing-codebases/lab/skills/safe-change.md @@ -22,8 +22,12 @@ When making a concrete change to an unfamiliar repo. 1. **State the change in one sentence** and the acceptance criterion ("done when X"). 2. **Find the blast radius first:** search for every caller/usage of what you're about to touch. List them. If you can't enumerate them, you're not ready to change it. -3. **Run the existing tests before touching anything** — establish a green baseline. If they were - already red, note it; don't let a pre-existing failure get blamed on you. +3. **Install the project's dependencies, then run the existing tests before touching anything** — + establish a green baseline. Tell two failures apart: if the suite errors with missing imports, + "no module named …", or "no tests ran," that's an **unconfigured environment**, not a baseline — + finish the documented install (and pick a different repo if it still won't go green on a clean + clone). A genuine **pre-existing failure** (install succeeded, but a real test fails) is the other + case — note it so it doesn't get blamed on you, and don't build on top of it. 4. **Make the minimal edit.** Keep it to the files identified in step 2. 5. **Add or extend a test** that fails without your change and passes with it. 6. **Run the full suite.** All green, including the baseline tests. diff --git a/modules/24-assistive-agents/README.md b/modules/24-assistive-agents/README.md index 1741472..c8d86b6 100644 --- a/modules/24-assistive-agents/README.md +++ b/modules/24-assistive-agents/README.md @@ -214,6 +214,10 @@ You're reviewing a branch that adds a `clear` command to the tasks-app. The diff python reviewer.py apply my-review.json ``` + (If your assistant wrapped the JSON in a ```` ```json ```` code fence even though the prompt said + "JSON only," don't worry — `apply` tolerates a fenced or prose-wrapped response and reads the JSON + out of it.) + 4. **Make the human decision.** Open `feature.patch` and check the agent's headline claim: the `clear` branch in `cli.py` never calls `save(tlist)`, so it prints "cleared all tasks" while `tasks.json` is untouched — a silent no-op, the exact kind of plausibility trap Module 10 trained diff --git a/modules/24-assistive-agents/lab/reviewer.py b/modules/24-assistive-agents/lab/reviewer.py index 51321d1..3573dfa 100644 --- a/modules/24-assistive-agents/lab/reviewer.py +++ b/modules/24-assistive-agents/lab/reviewer.py @@ -19,6 +19,24 @@ from pathlib import Path HERE = Path(__file__).parent + +def load_json_response(path: Path): + """Parse the JSON the AI returned. + + Chat assistants very often wrap their output in a ```json ... ``` code fence (or add a line of + prose) even when told to "return only the JSON" — so a strict json.loads on the raw paste fails + on the most likely real output. Try a strict parse first; if that fails, fall back to the + outermost { ... } block, which survives a code fence or surrounding text. Stdlib only.""" + raw = path.read_text() + try: + return json.loads(raw) + except json.JSONDecodeError: + start, end = raw.find("{"), raw.rfind("}") + if start != -1 and end > start: + return json.loads(raw[start : end + 1]) + raise + + PROMPT_HEADER = """\ You are an assistive code reviewer. Follow the rubric below exactly, then review the diff that follows it. Return ONLY the JSON object the rubric specifies — no prose before or after. @@ -40,7 +58,7 @@ def cmd_prompt(args: argparse.Namespace) -> int: def cmd_apply(args: argparse.Namespace) -> int: try: - review = json.loads(Path(args.response).read_text()) + review = load_json_response(Path(args.response)) except (json.JSONDecodeError, FileNotFoundError) as exc: print(f"error: could not read a JSON review from {args.response}: {exc}") return 1 diff --git a/modules/24-assistive-agents/lab/triage.py b/modules/24-assistive-agents/lab/triage.py index 84a6398..f2d2854 100644 --- a/modules/24-assistive-agents/lab/triage.py +++ b/modules/24-assistive-agents/lab/triage.py @@ -39,6 +39,23 @@ def allowed_labels(taxonomy_text: str) -> set[str]: return set(LABEL_RE.findall(taxonomy_text)) +def load_json_response(path: Path): + """Parse the JSON the AI returned. + + Chat assistants very often wrap their output in a ```json ... ``` code fence (or add a line of + prose) even when told to "return only the JSON" — so a strict json.loads on the raw paste fails + on the most likely real output. Try a strict parse first; if that fails, fall back to the + outermost { ... } block, which survives a code fence or surrounding text. Stdlib only.""" + raw = path.read_text() + try: + return json.loads(raw) + except json.JSONDecodeError: + start, end = raw.find("{"), raw.rfind("}") + if start != -1 and end > start: + return json.loads(raw[start : end + 1]) + raise + + def cmd_prompt(args: argparse.Namespace) -> int: taxonomy = Path(args.taxonomy).read_text() issue = Path(args.issue).read_text() @@ -49,7 +66,7 @@ def cmd_prompt(args: argparse.Namespace) -> int: def cmd_apply(args: argparse.Namespace) -> int: allowed = allowed_labels(Path(args.taxonomy).read_text()) try: - sug = json.loads(Path(args.response).read_text()) + sug = load_json_response(Path(args.response)) except (json.JSONDecodeError, FileNotFoundError) as exc: print(f"error: could not read a JSON suggestion from {args.response}: {exc}") return 1 diff --git a/modules/27-evals/README.md b/modules/27-evals/README.md index dc54049..9ae1125 100644 --- a/modules/27-evals/README.md +++ b/modules/27-evals/README.md @@ -287,16 +287,35 @@ agent's output. 6. Decide the autonomy for this task using the ladder in Key concepts. Write one sentence: *"`pending_count` changes may merge unattended only when `run_eval.py` scores 100%; otherwise a - human reviews."* Then make it enforceable — this is one line in a CI workflow (Module 14): + human reviews."* Then make it enforceable — this is one job in a CI workflow (Module 14), running + the exact command you ran in Parts A–B: ```yaml - name: Eval gate - run: python modules/27-evals/lab/run_eval.py candidates/current_model --threshold 1.0 + working-directory: modules/27-evals/lab + run: python run_eval.py candidates/current_model --threshold 1.0 ``` + The `working-directory:` line makes the CI job `cd` into the lab folder first, so the + `candidates/...` path and `run_eval.py`'s own `from eval_set import CASES` resolve exactly as they + did on your machine. (Drop it and point a repo-root job straight at + `python modules/27-evals/lab/run_eval.py candidates/current_model` instead, and `candidates/` + won't exist from the repo root — the gate crashes with a *false* failure, which is worse than no + gate. If you'd rather keep a single line, spell both paths out from the repo root: + `python modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model + --threshold 1.0`.) + Below threshold exits non-zero and the pipeline blocks, exactly like a failing test. The guardrail is now structural, not a promise. + **One honest caveat, or this gate guards nothing.** `candidates/current_model` is the bundled, + always-correct stand-in — it scores 100% on every run, forever, so a gate pointed at it can never + fail. That's a dashboard, not a guardrail: the exact trap this section warns about. In a real + pipeline, point the gate at the candidate that actually *varies* — your agent's real output for + this task (the `candidates/my_run_2` you made in Part C, or wherever your pipeline writes the + model's output before merge). Prove the gate bites by aiming it at `candidates/swapped_model`: the + same command drops to 60%, exits `1`, and blocks the merge. + --- ## Where it breaks