style(no-slop): remove every em-dash + banned words across all modules + capstone

Apply the no-ai-slop standard (now binding in AGENTS.md): the em-dash character is banned outright (restructured, not blind-replaced), plus the banned word/phrase list (delve, leverage, robust, seamless, truly, unlock, etc.). 0 em-dashes remain in modules + capstone; the only "robust" left is the planted M10 ai-change.patch trap. Module H1 titles use a colon separator. All deliberate teaching devices preserved; labs compile/parse (py/sh/yaml/json); no junk. AGENTS.md updated with the hard no-slop rules. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01TfzV5QvtPDz8LJS3Pu5VLT
2026-06-22 23:21:09 -04:00
parent 513d7e7ac8
commit 389ac2e460
99 changed files with 1324 additions and 1315 deletions
@@ -1,12 +1,12 @@
 """Candidate output: a SWAPPED model/prompt.

 Same task, different model (or a tweaked prompt). This output "looks right" and
-passes a casual manual check — adding three tasks and calling count returns 3.
+passes a casual manual check; adding three tasks and calling count returns 3.
 But pending_count() returns the total number of tasks, not the number of
 *pending* ones, so it's wrong the moment anything is marked done.

 Nobody would notice this by skimming. The eval set notices it instantly. That's
-the regression eval catching an unsafe swap — exactly the scenario this module
+the regression eval catching an unsafe swap, exactly the scenario this module
 exists for. Replace this with your own swapped-model output when you run it for
 real; you may get lucky and have it pass, or you may catch a regression like
 this one.
@@ -7,7 +7,7 @@ An *eval set* is a list of CASES. Each case is three things:
  - the expected result (here: how many tasks should count as pending).

 The grading lives in run_eval.py; this file is just data. Keeping the cases
-separate from any model, prompt, or runner is the whole point — the same eval
+separate from any model, prompt, or runner is the whole point; the same eval
 set judges *any* candidate you point it at, which is what makes it useful when
 you swap the model out from under it.

@@ -34,7 +34,7 @@ def judge(candidate_text: str) -> dict:
    key = os.environ.get("EVAL_JUDGE_KEY")
    model = os.environ.get("EVAL_JUDGE_MODEL")
    if not (url and key and model):
-        return {"score": None, "reason": "judge not configured — abstaining (set EVAL_JUDGE_* to enable)"}
+        return {"score": None, "reason": "judge not configured; abstaining (set EVAL_JUDGE_* to enable)"}

    payload = json.dumps({
        "model": model,
@@ -72,7 +72,7 @@ if __name__ == "__main__":
 #     about the candidate changed. The ruler is itself made of rubber.
 #
 # So: use a programmatic grader (run_eval.py) wherever a deterministic check is
-# possible — that is most of the time. Reach for an LLM judge only for genuinely
+# possible; that is most of the time. Reach for an LLM judge only for genuinely
 # open-ended output, and CALIBRATE it first: hand-label ~20 examples yourself,
 # run the judge on them, and confirm it agrees with you before you let it gate
 # anything. An uncalibrated judge is a vibe with a number attached.
@@ -68,9 +68,9 @@ def main(argv):
    print(f"\nscore: {passed}/{len(CASES)} = {score:.0%}   threshold: {args.threshold:.0%}")

    if score < args.threshold:
-        print("RESULT: below threshold — this change is NOT safe to ship.\n")
+        print("RESULT: below threshold; this change is NOT safe to ship.\n")
        return 1
-    print("RESULT: at or above threshold — safe by this eval.\n")
+    print("RESULT: at or above threshold; safe by this eval.\n")
    return 0