f743bb671b
Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io>
364 lines
22 KiB
Markdown
364 lines
22 KiB
Markdown
# Module 19 — Runners: The Compute Behind the Automation
|
|
|
|
> **Every green check in the last five modules ran on someone else's computer. This module is where
|
|
> you find out whose — and decide whether it should be yours.** Owning the runner is what turns "I
|
|
> use a CI pipeline" into "I own the pipeline, end to end."
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
- **Module 8 — Remotes and Hosting.** You push to a forge, and you met the self-host track
|
|
(Forgejo, Gitea, GitLab CE, and others). Self-hosted runners are the compute half of that same
|
|
"own your own infrastructure" decision.
|
|
- **Module 14 — Continuous Integration.** You have a CI workflow that lints and tests `tasks-app`
|
|
on every push. Module 14 mentioned, in passing, that the job runs on "a fresh, throwaway Linux
|
|
machine the forge spins up." This module is the full accounting of that machine.
|
|
- **Module 18 — Continuous Delivery and Deployment.** The deploy jobs you automated there run on
|
|
the same compute. Once you self-host, deploy steps get direct line-of-sight to your private
|
|
infrastructure — a feature and a footgun, both covered here.
|
|
- Helpful but not required: **Module 16 — Containers**, since most runners execute jobs in
|
|
containers and ephemeral runners lean on them.
|
|
|
|
You don't need to have read Module 18 in full — if you only have CI from Module 14, everything here
|
|
still lands. CD just gives you a second, higher-stakes reason to care where jobs run.
|
|
|
|
---
|
|
|
|
## Learning objectives
|
|
|
|
By the end of this module you can:
|
|
|
|
1. Explain what a runner *is* — the actual process and machine that executes your pipeline steps —
|
|
and tell, for any job, whether it ran on hosted or self-hosted compute.
|
|
2. Make a reasoned hosted-vs-self-hosted decision for a given pipeline, on the five axes that
|
|
actually move the needle: cost, data control, network reach, hardware, and air-gap/compliance.
|
|
3. Register a self-hosted runner against your forge and run the `tasks-app` CI job on it.
|
|
4. State, without flinching, the central security tradeoff: a self-hosted runner executes arbitrary
|
|
code, is non-ephemeral by default, and can be a backdoor into your network — and name the
|
|
mitigations that make it survivable.
|
|
|
|
---
|
|
|
|
## Key concepts
|
|
|
|
### A runner is just a computer that does what the YAML says
|
|
|
|
A runner is **a process, on some machine, that checks out your code and executes the steps in your
|
|
pipeline** — nothing more exotic than that. When your Module 14 workflow says "set up
|
|
Python, install pytest, run the tests," *something physical* has to do that — pull the repo onto a
|
|
disk, run `pip install`, run `pytest`, report pass or fail back to the forge. That something is the
|
|
runner.
|
|
|
|
The loop every runner runs, regardless of forge:
|
|
|
|
1. **Register** with the forge once, using a registration token, so the forge knows it exists.
|
|
2. **Poll** the forge: "got any jobs for me?"
|
|
3. When a job matches, **pull the code and the job definition**, then execute each step in order.
|
|
4. **Stream logs and the final status** (pass/fail) back to the forge.
|
|
5. Go to 2.
|
|
|
|
That's the whole machine. Everything else — hosted vs. self-hosted, ephemeral vs. persistent,
|
|
containerized vs. bare metal — is a variation on *which computer runs that loop and who owns it.*
|
|
|
|
### Hosted runners: you've been renting
|
|
|
|
Up to now, every job ran on a **hosted runner** — a machine the forge owns, spins up on demand, and
|
|
bills you for. This is the default and, for most work, the right default. What you're actually
|
|
getting:
|
|
|
|
- **A fresh, throwaway machine per job.** This is the property Module 14 leaned on: "works on my
|
|
machine" can't hide, because the machine has *nothing of yours on it.* The job starts from a clean
|
|
image and the machine is destroyed afterward. Clean room, every time.
|
|
- **No ops burden.** You don't patch it, scale it, or keep it online. It exists for the length of
|
|
your job and then it's gone.
|
|
- **Metered billing.** You pay in **runner-minutes** — wall-clock time your jobs spend executing,
|
|
usually with a free monthly allotment and then per-minute pricing above it. Different machine
|
|
sizes (more CPU/RAM, GPUs) bill at higher multipliers.
|
|
|
|
For a small Python test suite, hosted is perfect. The job is short, needs nothing private, and the
|
|
clean-room property is pure upside. You will keep using hosted runners for most of what you do.
|
|
|
|
### Self-hosted runners: you own the computer
|
|
|
|
A **self-hosted runner** runs that exact same loop — register, poll, execute, report — but on a
|
|
machine *you* own: a spare server, a VM in your own cloud account, a box in your homelab, a beefy
|
|
workstation under a desk. You install the forge's runner agent, register it with a token, and it
|
|
starts pulling jobs. To the pipeline author, almost nothing changes; the workflow just targets your
|
|
runner instead of a hosted one (more on the targeting mechanic below).
|
|
|
|
This is the compute analogue of the Module 8 decision. There, you chose between pushing your repo to
|
|
a hosted forge versus self-hosting one. Here, you choose between renting compute to run your
|
|
pipeline versus owning it. Same instinct, applied one layer down.
|
|
|
|
### Why you'd run your own — the five real reasons
|
|
|
|
Don't self-host for the vibe of it. Self-host when one of these actually applies:
|
|
|
|
1. **Cost at volume.** Runner-minutes are cheap until they aren't. A heavy pipeline — large test
|
|
matrices, container builds, long integration suites, or the AI eval/agent jobs from Unit 5 that
|
|
call models on every run — can run the meter hard. If you already own idle hardware, a self-hosted
|
|
runner turns "per-minute forever" into "electricity you're already paying for." (Verify the
|
|
crossover with real numbers; see the checklist at the end.)
|
|
|
|
2. **Data control.** Hosted runners execute your code, with your secrets, on infrastructure you
|
|
don't own. For a lot of work that's fine. For regulated data, customer data under contract, or a
|
|
shop with a "source never leaves our perimeter" rule, it isn't. A self-hosted runner keeps the
|
|
checkout, the build, and the secrets on hardware you control.
|
|
|
|
3. **Network access to private systems.** This is the one IT pros hit first and hardest. Your CD job
|
|
(Module 18) needs to deploy to a server on your private network. Your tests need a database that
|
|
lives on an internal VLAN. A hosted runner sits on the public internet and cannot reach any of
|
|
that without you punching holes in your firewall. A self-hosted runner placed *inside* your
|
|
network already has line-of-sight — no inbound holes, no VPN gymnastics. (This is also exactly why
|
|
it's a security problem; hold that thought.)
|
|
|
|
4. **Custom or specialized hardware.** GPUs for ML work, a specific CPU architecture, more RAM than
|
|
any hosted tier offers, a hardware security module, a USB device for hardware-in-the-loop tests.
|
|
If your job needs hardware the forge doesn't rent, you bring your own.
|
|
|
|
5. **Air-gapped or fully on-prem operation.** A self-hosted forge (Module 8) on an isolated network
|
|
has nowhere to send jobs *except* a self-hosted runner on that same network. There is no hosted
|
|
option in an air gap. If your whole stack lives behind a wall, the runner lives there too.
|
|
|
|
If none of these apply, stay on hosted. "I want to" is not on the list.
|
|
|
|
### The mechanic: register, target, run
|
|
|
|
The shape is the same on every forge; only the command names and config filenames differ. The
|
|
pattern, vendor-neutral:
|
|
|
|
- **Get a registration token** from the forge — at the repo, org, or instance level, in the
|
|
forge's settings under its "Runners" or "CI/CD" section. The token is short-lived and proves you're
|
|
allowed to attach a runner here.
|
|
- **Run the runner agent's register/config command** on your machine, pointing it at your forge URL
|
|
and handing it the token. This writes a small local config/identity file and starts the agent
|
|
polling. Concretely, the agent and command differ per forge — for example:
|
|
- GitHub-style Actions: a `config` script that registers the agent, then a `run` script (or a
|
|
service) that starts polling.
|
|
- GitLab: a `gitlab-runner register` command, then the runner runs as a service.
|
|
- Forgejo/Gitea: an `act_runner register` command (Actions-compatible), then `act_runner daemon`.
|
|
|
|
All three do the same two things: *register an identity*, then *start the poll loop.* Don't memorize
|
|
the flags — read your forge's runner docs at build time (the commands drift; see the checklist).
|
|
- **Label the runner and target it from the workflow.** A runner advertises **labels** (e.g.
|
|
`self-hosted`, `linux`, `gpu`, `internal-net`). Your job selects runners by label — in
|
|
Actions-style YAML that's the `runs-on:` field; in GitLab it's `tags:`. So changing a job from
|
|
hosted to your own runner is often a one-line edit:
|
|
|
|
```yaml
|
|
# before — hosted:
|
|
runs-on: ubuntu-latest
|
|
# after — your runner, selected by label:
|
|
runs-on: [self-hosted, linux, internal-net]
|
|
```
|
|
|
|
That one line is the whole "I now own this pipeline" switch. Everything else in your Module 14
|
|
workflow stays identical, because the runner runs the same loop either way.
|
|
|
|
### Ephemeral vs. persistent — the property that matters most
|
|
|
|
A hosted runner is **ephemeral**: fresh machine per job, destroyed after. A self-hosted runner is
|
|
**persistent by default**: the same machine, with the same disk, runs job after job. That difference
|
|
is the source of nearly every self-hosted runner security incident, so it gets its own section
|
|
below — but flag it now. The clean-room guarantee you got for free with hosted runners is something
|
|
you have to *rebuild on purpose* when you self-host.
|
|
|
|
---
|
|
|
|
## The AI angle
|
|
|
|
Two things make runners specifically an AI-era topic, not a generic ops footnote.
|
|
|
|
**1. AI pipelines are compute-hungry, and that changes the cost math.** Unit 5 puts agents *inside*
|
|
the pipeline: jobs that call a model to review a PR, triage an issue, or attempt a fix on a failing
|
|
build. Module 25 takes this further — agents running as **triggered or scheduled runner jobs**, kicked
|
|
off on a cron or by an event rather than a human push. Those jobs run longer and fire more often than
|
|
a lint-and-test pass, and every one of them consumes runner-minutes. The "rent vs. own compute"
|
|
decision you're learning here is the one that keeps an AI-heavy pipeline from quietly becoming your
|
|
biggest line item. When you reach Module 25 and stand up an agent that runs unattended on a schedule,
|
|
*this* is the machine it runs on.
|
|
|
|
**2. The agent needs hands, and the self-hosted runner is the hands.** A self-hosted runner inside
|
|
your network is the most direct way to give an automated agent real reach — deploy access, internal
|
|
databases, private services. That's the payoff and the peril in one sentence. The same property that
|
|
makes a self-hosted runner useful for an unattended agent (it can touch your real systems) is exactly
|
|
what makes it dangerous when the code it runs isn't yours. Which brings us to the part you cannot skip.
|
|
|
|
**3. AI writes the CI config too.** Ask an agent to "set up CI" and it will happily emit
|
|
`runs-on: self-hosted` or wire a deploy step, because it's pattern-matching on examples that did. AI
|
|
also opens PRs (Module 11) — and a pull request, from a human or an agent, is *untrusted code that
|
|
your pipeline may execute.* You review the *code* in a PR (Module 10); you also have to review what
|
|
your pipeline *does with that PR's code* before it runs on hardware that can reach your network. The
|
|
review reflex from Module 10 has to extend to the workflow files, not just the application code.
|
|
|
|
---
|
|
|
|
## Hands-on lab
|
|
|
|
**Lab language:** shell, plus a one-line edit to the YAML workflow from Module 14. Runs on your own
|
|
machine and your own forge — no hosted account required for the core of it.
|
|
|
|
This lab has two tracks. **Track A** is mandatory and works for everyone: find out exactly where your
|
|
jobs run today and walk the security tradeoffs concretely. **Track B** is the real thing: register a
|
|
self-hosted runner and run `tasks-app` CI on it. Do Track A always; do Track B if you have a forge you
|
|
can attach a runner to (a self-hosted forge from Module 8 is ideal; a hosted account where you control
|
|
a repo also works). If a real runner is too heavy right now, Track A alone satisfies the module.
|
|
|
|
**You'll need:**
|
|
|
|
- Your `tasks-app` repo with the Module 14 CI workflow in it.
|
|
- The two starter files in this module's `lab/` folder:
|
|
- `whoami-runner.yml` — a tiny workflow that reports *where it ran*.
|
|
- `inspect-runner.sh` — a script you run on a candidate runner machine to see what an attacker
|
|
would see if they got code execution on it.
|
|
- For Track B: a forge you can register a runner against, and a spare machine or VM to be the runner
|
|
(your laptop is fine for a one-off; don't leave it registered).
|
|
- Your AI assistant.
|
|
|
|
### Track A — Find out whose computer you've been using (everyone)
|
|
|
|
1. **Make the invisible visible.** Copy `lab/whoami-runner.yml` into your repo's workflow directory
|
|
(the same place your Module 14 `ci.yml` lives — for Actions-style forges that's
|
|
`.github/`/`.forgejo/`/`.gitea/` under `workflows/`; the file comments tell you where). Commit and
|
|
push. It runs the same lint-and-test as Module 14, then prints the runner's hostname, OS, user,
|
|
whether it looks ephemeral, and whether it can reach the public internet. The receipt step carries
|
|
`if: always()` so it still prints even when lint or test fail — a diagnostic shouldn't disappear on
|
|
a red build (the job still reports red). On GitLab CI the same idea is `when: always` on the job.
|
|
|
|
2. **Read the receipt.** Open the job logs on your forge and read the `Where did this run?` step.
|
|
You're now able to answer, for a real job, the question this module opened with: *whose computer
|
|
was that?* On a hosted runner you'll see a generic cloud hostname and a throwaway user. Note it —
|
|
you'll compare against your own runner in Track B.
|
|
|
|
3. **See what code execution would expose.** On the machine you'd *consider* using as a self-hosted
|
|
runner (your laptop is fine for the exercise), run:
|
|
|
|
```bash
|
|
bash lab/inspect-runner.sh
|
|
```
|
|
|
|
It inventories what a job — *any* job, including one from a pull request — could see if it ran
|
|
here: environment secrets, cloud credential files, SSH keys, Docker socket access, and which
|
|
private hosts on your network are reachable. This is not hypothetical. A workflow step is a shell
|
|
command; whatever the script can see, a malicious workflow step can see too.
|
|
|
|
4. **Walk the tradeoff with your AI, grounded in that output.** Paste the `inspect-runner.sh` output
|
|
into your AI and ask: *"If this machine were a self-hosted CI runner and someone opened a pull
|
|
request with a malicious workflow step, what could they reach or steal? Rank it worst-first."*
|
|
Read the answer against your real output. This is the honest version of "why you'd run your own" —
|
|
the network reach that makes a self-hosted runner *useful* is the exact same reach that makes a
|
|
compromised one *catastrophic.*
|
|
|
|
### Track B — Own the pipeline (if you can attach a runner)
|
|
|
|
5. **Get a registration token.** In your forge's settings, find the Runners / CI/CD section and
|
|
generate a runner registration token (repo-level is the tightest scope — start there).
|
|
|
|
6. **Register the runner.** On your runner machine, download your forge's runner agent and run its
|
|
register command, pointing at your forge URL with the token, and give it a clear label like
|
|
`self-hosted`. The exact command is forge-specific — open your forge's runner docs and follow the
|
|
register step (the Key concepts section names the three common agents). When it's registered, start
|
|
the agent so it begins polling. Confirm it shows as **online** in the forge's Runners list.
|
|
|
|
7. **Aim CI at your runner — the one-line switch.** Edit the `runs-on:` (or `tags:`) line in your
|
|
`tasks-app` CI workflow to select your runner's label instead of the hosted image, exactly as
|
|
shown in Key concepts. Commit and push.
|
|
|
|
8. **Watch your own machine do the work.** Open the job logs. The lint-and-test pass from Module 14
|
|
now runs on hardware you own. Re-run the `whoami-runner.yml` workflow too and compare its output to
|
|
step 2: your hostname, your user, and — critically — note that it is **not** a fresh throwaway
|
|
machine. Run it twice and look for leftovers (a `pip` cache, files from the previous run). That
|
|
persistence is the thing to respect.
|
|
|
|
9. **Clean up.** If this was a one-off on your laptop, **remove the runner** from the forge and stop
|
|
the agent. A registered-but-forgotten runner is a standing liability — exactly the kind of stale
|
|
backdoor the security section warns about.
|
|
|
|
---
|
|
|
|
## Where it breaks
|
|
|
|
This is the section that earns the module. Self-hosted runners are the single sharpest-edged tool in
|
|
this course. Be honest about all of it.
|
|
|
|
- **A runner executes arbitrary code — that's its entire job.** A "workflow step" is just a shell
|
|
command someone put in a file in the repo. The runner runs it, faithfully, with whatever access
|
|
that machine has. There is no sandbox unless you build one.
|
|
|
|
- **Pull requests are untrusted code, and this is the headline risk.** On a public repository, *anyone
|
|
can fork it, edit the workflow, and open a PR* — and on a misconfigured setup, your self-hosted
|
|
runner will dutifully execute their workflow on your hardware, inside your network. This is not
|
|
theoretical: in 2025, real attacks used exactly this path — a malicious fork PR pulled a reverse
|
|
shell onto a self-hosted runner and used the available token to push malicious code back to the
|
|
origin repo. The blunt, widely-repeated guidance: **do not attach self-hosted runners to public
|
|
repositories.** If you must, require manual approval before workflows from forks/first-time
|
|
contributors run, and never give those jobs your real secrets.
|
|
|
|
- **Persistent runners accumulate compromise.** Because the default self-hosted runner is *not*
|
|
ephemeral, anything a job leaves behind — a cached credential, a background process, a tampered
|
|
tool on `PATH` — survives into the next job. A single compromised run can become a permanent
|
|
implant. The fix is **ephemeral runners**: tear the environment down and rebuild it after every
|
|
job (typically by running each job in a fresh container or a disposable VM). This is more setup, and
|
|
it's the price of getting back the clean-room property hosted runners gave you for free.
|
|
|
|
- **Network reach cuts both ways.** The reason you self-host — line-of-sight to internal systems — is
|
|
also why a compromised runner is a pivot point into your network. Put runners on an isolated
|
|
segment with only the egress they actually need, run them as a dedicated low-privilege user (never
|
|
root, never your own login), and scope their secrets to the minimum. Treat the runner as
|
|
semi-trusted at best.
|
|
|
|
- **"Free" compute isn't free.** You trade per-minute billing for ops work: patching the OS, keeping
|
|
the agent online and version-matched to the forge (a runner significantly older than the server can
|
|
fail jobs in subtle ways), scaling under load, and securing all of the above. For a busy pipeline
|
|
on idle hardware that math wins. For an occasional test run, the hosted clean room is cheaper once
|
|
you count your own time.
|
|
|
|
- **Autoscaling is a real project, not a checkbox.** Matching a fleet of runners to bursty demand —
|
|
spinning ephemeral runners up and down on a queue — is its own piece of infrastructure. Don't
|
|
assume one box; don't assume it's trivial to make it many.
|
|
|
|
---
|
|
|
|
## Check for understanding
|
|
|
|
**You're done when:**
|
|
|
|
- You can look at any pipeline run and state whether it executed on hosted or self-hosted compute,
|
|
and back it up from the job's own output (you ran `whoami-runner.yml` and read the receipt).
|
|
- You can give the five reasons to self-host and honestly say which, if any, apply to your situation
|
|
— instead of self-hosting by default.
|
|
- (Track B) You ran `tasks-app` CI on a runner you own, by changing a single targeting line, and you
|
|
saw firsthand that it is not a throwaway machine.
|
|
- You can explain, to a skeptical colleague, the central tradeoff in one breath: a self-hosted runner
|
|
executes arbitrary code on your hardware with reach into your network, is persistent by default, and
|
|
must never be casually attached to a public repo — and you can name ephemeral runners, network
|
|
isolation, and least-privilege as the mitigations.
|
|
|
|
When "where does this run, and what can it touch?" is a question you ask reflexively about every job —
|
|
and especially every job triggered by a PR or, soon, by an agent — you own the pipeline end to end.
|
|
Module 25 will put autonomous agents on exactly this compute; you now know what they're standing on.
|
|
|
|
---
|
|
|
|
## Verify-before-publish
|
|
|
|
This is an expansion-zone module and the runner ecosystem moves. Re-check at build/publish time:
|
|
|
|
- [ ] **Runner agent commands and config filenames** for each forge named (the GitHub-style
|
|
`config`/`run` scripts, `gitlab-runner register`, `act_runner register`/`daemon`). Flags and
|
|
script names drift between releases — confirm against current official runner docs, don't pin
|
|
from memory.
|
|
- [ ] **Hosted runner pricing and free-minute allotments**, and the machine-size multipliers, for any
|
|
forge a reader is likely to use. These change and vary by plan; state them as "check current
|
|
pricing" rather than a hard number, and re-verify the cost-crossover framing.
|
|
- [ ] **Fork-PR / untrusted-workflow defaults** — whether the major forges run fork PRs on
|
|
self-hosted runners by default or require approval, and the exact setting names. The security
|
|
guidance here depends on current defaults; confirm them.
|
|
- [ ] **Ephemeral-runner mechanics** — the current supported way to run jobs ephemerally
|
|
(per-job containers, disposable VMs, the `--ephemeral`-style flags) on each forge.
|
|
- [ ] **The 2025 attack reference** — keep it accurate and current; if newer, clearer public
|
|
incidents exist at publish time, cite the most representative one rather than an aging example.
|
|
- [ ] **Runner-to-server version-compatibility guidance** — confirm the "keep the agent version
|
|
matched to the forge" caveat still reflects current behavior.
|