justin/ai-workflow-course

Fork 0

Files

T

claude 95e5911957

CI / check (push) Successful in 7s

Details

Sync course wiki / sync-wiki (push) Successful in 4s

Details

Use python3 as the canonical command name course-wide (#104 ) (#105 )

2026-06-23 20:25:05 -04:00

14 KiB

Raw Blame History

AI Made Writing Code Cheap. Now Automate the Catching.

Here's a thing that should worry you a little more than it does: AI is fast, and most of what makes it fast also makes it dangerous. It writes a function in three seconds. It also writes a wrong function in three seconds, one that reads beautifully, uses the right names, follows your conventions, and ships a flipped comparison you'll never catch by skimming. The generation got cheap. The catching didn't, unless you make it.

That's this whole unit, and it's the post where The Workflow shifts gears. The first half of the course was about getting out of the chat window and making your work shareable and recoverable: Git as undo for the AI, hosting, review. Useful, foundational, a little slow-burn. This is where it speeds up. Seven modules, one job: build the machine that checks AI's work and ships it, automatically, so AI's speed becomes shipped software instead of shipped risk.

If you run infrastructure for a living, the punchline lands early and it lands hard, so I'll spoil it now: by the end of this unit you own a pipeline end to end. Tests, gates, containers, deploys, and the actual compute underneath. Not "I use someone's CI." Yours. Let me walk the arc.

It starts with tests: because AI output needs a witness

The unit opens on testing, and the reframe is sharper than the usual "you should write tests" sermon. Normal buggy code looks buggy: odd naming, weird structure, a tripwire your eye catches. AI code removes that tripwire. The buggy version and the correct version look equally clean, because "looks like correct code" is roughly what the model was trained to produce. You can read a wrong implementation three times and approve it.

A test doesn't read the code. It runs it and checks the result. It's immune to plausibility, which is exactly the signal AI just defeated.

And here's the happy turn that makes the whole unit feel less like eating your vegetables: the same AI that produces the risk is genuinely excellent at writing the tests that catch it. The chore that used to keep people from having a real suite (the tedious boilerplate) is now nearly free. The skill moves from writing tests to directing them. With one trap to avoid, and it's a doozy:

Weak prompt: "Write unit tests for the pending_count method." You'll get tests that assert whatever the code currently does. If the code is wrong, the test faithfully certifies the wrong answer. Now you've got a green checkmark on a bug.
Strong prompt: "pending_count should return the number of tasks that are still pending. Test these cases and derive the expected numbers from that description, not the current code: empty list → 0; two added, none done → 2; two added, one done → 1; one added then completed → 0."

That "one done" case is the one where a correct implementation and a buggy one give different answers. The whole craft in one sentence: a test that can't fail isn't testing anything. When the AI hands you code and tests, review the tests first, and review them by asking "would this fail if the code were wrong?", not "do these pass?" Passing is the easy part.

CI: the reviewer that doesn't skim

A test file sitting in your repo is useful right up until you forget to run it, which, like every manual check, you eventually will. Continuous Integration removes the "eventually." It's a grand name for a mundane core: the same checks you'd run by hand (lint, build, test) bound to a trigger, on a clean machine you don't control, on every single push.

The magic is entirely in automatically. You don't run CI; pushing runs it. It can't be skipped by forgetting, it doesn't get tired on the fortieth push of the day, and its whole enforcement mechanism is the humble exit code: python3 -m unittest returns non-zero when a test fails, and one non-zero turns the run red. The actual config is shorter than this paragraph:

name: CI
on: [push, pull_request]
jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install ruff
      - run: ruff check .
      - run: python -m unittest

That's a real, working pipeline. Cheap check first (the linter, three seconds), expensive check last (the tests). The reason this matters more with AI is the same reason tests do: AI raises your push rate and lowers how carefully each diff gets read. Manual pre-push discipline doesn't survive that volume. An automated gate scales for free. CI is the reviewer that runs the code instead of believing the diff.

[insert a screenshot referencing a CI run going from red to green on a forge, with the failed Test step expanded here]

Then the gates AI specifically needs: security scanning

Your build is green and your tests pass. Is the code safe? Different question, and CI structurally can't answer it. This is the module where the AI angle stops being "more of the same" and gets genuinely novel, because AI doesn't just fail to prevent security problems: it actively manufactures three of them:

It hardcodes secrets. Ask for code that calls an authenticated API and the model cheerfully writes API_KEY = "sk-live-..." into the source, because that makes the example run, and "make it run" is what it optimizes for. It has no instinct that the string is dangerous.
It reproduces insecure idioms (string-concatenated SQL, weak crypto) with total confidence, because a million tutorials did it that way and insecure code looks plausible.
And the one that should make the hair stand up: it invents dependencies that don't exist. LLMs generate plausible text, and a package name is plausible text. The model will confidently import requests-oauth or task-store-client: names that sound exactly right but were never published.

That last one has a name now: slopsquatting. Attackers watch which fake package names LLMs habitually invent (and they invent the same plausible names repeatedly) then register those exact names on the public index with malware inside. The next developer who pastes AI output and runs pip install -r requirements.txt pulls the payload, which runs with their privileges, in their dev environment or, worse, in CI. It's a supply-chain attack that exists because of how LLMs fail. So the habit to build: a dependency the AI added is an untrusted claim until you've verified it's the real, intended, widely-used project. Treat the requirements file the AI hands you like a stranger handing you a USB stick. Then bolt three scanners onto your pipeline (dependency scanning, secret scanning, static analysis) so a planted key or a fake package turns the build red before it merges.

Containers: kill "works on my machine," and get a sandbox for agents

"Works on my machine" is a confession, not a defense. Your code never runs alone: it runs on top of an invisible stack of OS libraries, a runtime version, env vars, paths you've never written down. A container packages the code and that invisible stack into one artifact that runs the same on your laptop, in CI, and in production. You stop shipping the code and start shipping the machine. It dissolves the "passes locally, fails in CI" bug by construction: there's one environment now, not two that drift.

There's a forward-looking payoff here too, and it's the one I'd flag for anyone nervous about letting AI off the leash. A throwaway container is a blast-radius box for a command (or an agent) you don't fully trust:

docker run --rm --network none --read-only python:3.12-slim \
  sh -c "<the sketchy command the AI gave you>"

No network, no writes, destroyed on exit. The host never saw it. That's the practical foundation for running less-trusted agents later in the course. (One honest caveat the module hammers: a container is not a strong security boundary by default: it shares the host kernel. It raises the cost of mischief; it's not a guarantee against a determined attacker.)

Secrets, then shipping, then the compute underneath

The last three modules close the loop. Secrets is the prevention for the AI failure you met in scanning: instead of catching the hardcoded key after the fact, you teach the AI the pattern up front ("never hardcode secrets; read from the environment; fail loudly if it's missing") and move config into the environment so the same built-once artifact runs in dev, staging, and prod with nothing but different variables injected. Gitignore the real .env, commit a .env.example template, and the leak window never opens.

Continuous delivery and deployment answers the question CI doesn't: merged isn't running. It's more stages on the same pipeline: build a versioned image tagged by commit SHA, push it to a registry, deploy that exact artifact (never a rebuild on the prod box), health-check it, and roll back automatically when it's wrong. The distinction worth memorizing: continuous delivery keeps a human on the prod button; continuous deployment removes the button. And the AI-era posture falls right out of it: strengthen the early gates, then automate the late ones. Auto-deploy is only survivable because review, CI, and scanning sit in front of it. Take it without those gates and you've built a machine that ships AI mistakes to production at full speed.

And then runners, the module that delivers the IT-pro payoff this whole unit was building toward. Every green check in the previous five modules ran on someone else's computer. This is where you find out whose, and decide whether it should be yours. A runner is just a process on a machine that checks out your code and executes the YAML. Hosted runners are rented, clean-room, metered. A self-hosted runner runs the identical loop on hardware you own, and flipping to it is often one line:

# before, renting:
runs-on: ubuntu-latest
# after, your hardware, inside your network:
runs-on: [self-hosted, linux, internal-net]

That one line is the "I now own this pipeline" switch. You'd do it for real reasons (cost at volume, data that can't leave your perimeter, network line-of-sight to private systems a hosted runner can't reach, specialized hardware, air-gapped operation) not for the vibe. And it comes with the sharpest edge in the course: a runner executes arbitrary code, is persistent by default, and a self-hosted one wired into your network is a backdoor into that network if you're careless with it. Never casually attach one to a public repo. But owned and isolated properly, it's the thing that turns "I use a pipeline" into "I own the pipeline, end to end."

Where this unit breaks (the honest part)

I'd be doing you a disservice if I made this sound like a finish line. A few things to keep your skepticism calibrated:

A green pipeline is not a correct, safe codebase. Tests prove the behaviors you thought to test work. Scanners find the vulns they know about. "No findings" means "none of the things these tools know," not "secure." This unit narrows risk dramatically; it doesn't eliminate it, and it never replaces human review.
The gates are only as good as what's in them. CI is exactly as good as your test suite and no better. A scanner with no manifest to read is blind. A health check that returns 200 when the app started (but before it can serve a real request) lies to you.
Some things don't roll back. Reverting a running image is cheap. Reverting a database migration, a sent email, or a charged card is not. "We can always roll back" does not cover your data.
Don't over-build for a five-line script. Same honesty as the first post in this series: the toolchain earns its keep on real projects: more than one file, more than one day. Don't bring a deploy pipeline to a throwaway utility.

But for anything real? This is the unit where AI's speed stops being a liability and starts being an asset. You're merging more code, faster, with less of it read line-by-line, because the AI made generation cheap. The one defense that scales with that volume is the one that doesn't depend on a human remembering to look. That's the whole pipeline. You don't build it despite using AI. Using AI is what moves it from "nice to have" to "required."

The model is the cheap, swappable part. The workflow around it is the skill that lasts, and this unit is a big, durable chunk of that workflow.

Your turn

We've crossed into the back half of the course now, and the pace picks up from here: this is the faster-moving material, the part where the tools come quicker and the payoff compounds. If you've built any piece of this pipeline on your own projects, I want to hear how it went, especially the slopsquatting bit, because I suspect a lot of people are one pip install away from a bad day and don't know it. Drop a comment, tell me where it clicked or where I lost you. I read them, and the rough edges you hit are what makes the course better.

Next up: Unit 4, where we stop defending against the AI and start extending it into your systems: MCP servers, skills, and pointing AI at a big codebase you didn't write.

14 KiB Raw Blame History