style(no-slop): remove every em-dash + banned words across all modules + capstone

Apply the no-ai-slop standard (now binding in AGENTS.md): the em-dash character is
banned outright (restructured, not blind-replaced), plus the banned word/phrase
list (delve, leverage, robust, seamless, truly, unlock, etc.). 0 em-dashes remain
in modules + capstone; the only "robust" left is the planted M10 ai-change.patch
trap. Module H1 titles use a colon separator.

All deliberate teaching devices preserved; labs compile/parse (py/sh/yaml/json);
no junk. AGENTS.md updated with the hard no-slop rules.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01TfzV5QvtPDz8LJS3Pu5VLT
This commit is contained in:
2026-06-22 23:21:09 -04:00
parent 513d7e7ac8
commit 389ac2e460
99 changed files with 1324 additions and 1315 deletions
@@ -1,12 +1,12 @@
"""Candidate output: a SWAPPED model/prompt.
Same task, different model (or a tweaked prompt). This output "looks right" and
passes a casual manual check adding three tasks and calling count returns 3.
passes a casual manual check; adding three tasks and calling count returns 3.
But pending_count() returns the total number of tasks, not the number of
*pending* ones, so it's wrong the moment anything is marked done.
Nobody would notice this by skimming. The eval set notices it instantly. That's
the regression eval catching an unsafe swap exactly the scenario this module
the regression eval catching an unsafe swap, exactly the scenario this module
exists for. Replace this with your own swapped-model output when you run it for
real; you may get lucky and have it pass, or you may catch a regression like
this one.
+1 -1
View File
@@ -7,7 +7,7 @@ An *eval set* is a list of CASES. Each case is three things:
- the expected result (here: how many tasks should count as pending).
The grading lives in run_eval.py; this file is just data. Keeping the cases
separate from any model, prompt, or runner is the whole point the same eval
separate from any model, prompt, or runner is the whole point; the same eval
set judges *any* candidate you point it at, which is what makes it useful when
you swap the model out from under it.
+2 -2
View File
@@ -34,7 +34,7 @@ def judge(candidate_text: str) -> dict:
key = os.environ.get("EVAL_JUDGE_KEY")
model = os.environ.get("EVAL_JUDGE_MODEL")
if not (url and key and model):
return {"score": None, "reason": "judge not configured abstaining (set EVAL_JUDGE_* to enable)"}
return {"score": None, "reason": "judge not configured; abstaining (set EVAL_JUDGE_* to enable)"}
payload = json.dumps({
"model": model,
@@ -72,7 +72,7 @@ if __name__ == "__main__":
# about the candidate changed. The ruler is itself made of rubber.
#
# So: use a programmatic grader (run_eval.py) wherever a deterministic check is
# possible that is most of the time. Reach for an LLM judge only for genuinely
# possible; that is most of the time. Reach for an LLM judge only for genuinely
# open-ended output, and CALIBRATE it first: hand-label ~20 examples yourself,
# run the judge on them, and confirm it agrees with you before you let it gate
# anything. An uncalibrated judge is a vibe with a number attached.
+2 -2
View File
@@ -68,9 +68,9 @@ def main(argv):
print(f"\nscore: {passed}/{len(CASES)} = {score:.0%} threshold: {args.threshold:.0%}")
if score < args.threshold:
print("RESULT: below threshold this change is NOT safe to ship.\n")
print("RESULT: below threshold; this change is NOT safe to ship.\n")
return 1
print("RESULT: at or above threshold safe by this eval.\n")
print("RESULT: at or above threshold; safe by this eval.\n")
return 0