De-slop: remove every em-dash + banned words across all modules + capstone (#94)

Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io>
2026-06-22 23:21:22 -04:00
parent 513d7e7ac8
commit c098933f25
99 changed files with 1324 additions and 1315 deletions
@@ -1,12 +1,12 @@
 """Candidate output: a SWAPPED model/prompt.

 Same task, different model (or a tweaked prompt). This output "looks right" and
-passes a casual manual check — adding three tasks and calling count returns 3.
+passes a casual manual check; adding three tasks and calling count returns 3.
 But pending_count() returns the total number of tasks, not the number of
 *pending* ones, so it's wrong the moment anything is marked done.

 Nobody would notice this by skimming. The eval set notices it instantly. That's
-the regression eval catching an unsafe swap — exactly the scenario this module
+the regression eval catching an unsafe swap, exactly the scenario this module
 exists for. Replace this with your own swapped-model output when you run it for
 real; you may get lucky and have it pass, or you may catch a regression like
 this one.
@@ -7,7 +7,7 @@ An *eval set* is a list of CASES. Each case is three things:
  - the expected result (here: how many tasks should count as pending).

 The grading lives in run_eval.py; this file is just data. Keeping the cases
-separate from any model, prompt, or runner is the whole point — the same eval
+separate from any model, prompt, or runner is the whole point; the same eval
 set judges *any* candidate you point it at, which is what makes it useful when
 you swap the model out from under it.

@@ -34,7 +34,7 @@ def judge(candidate_text: str) -> dict:
    key = os.environ.get("EVAL_JUDGE_KEY")
    model = os.environ.get("EVAL_JUDGE_MODEL")
    if not (url and key and model):
-        return {"score": None, "reason": "judge not configured — abstaining (set EVAL_JUDGE_* to enable)"}
+        return {"score": None, "reason": "judge not configured; abstaining (set EVAL_JUDGE_* to enable)"}

    payload = json.dumps({
        "model": model,
@@ -72,7 +72,7 @@ if __name__ == "__main__":
 #     about the candidate changed. The ruler is itself made of rubber.
 #
 # So: use a programmatic grader (run_eval.py) wherever a deterministic check is
-# possible — that is most of the time. Reach for an LLM judge only for genuinely
+# possible; that is most of the time. Reach for an LLM judge only for genuinely
 # open-ended output, and CALIBRATE it first: hand-label ~20 examples yourself,
 # run the judge on them, and confirm it agrees with you before you let it gate
 # anything. An uncalibrated judge is a vibe with a number attached.
@@ -68,9 +68,9 @@ def main(argv):
    print(f"\nscore: {passed}/{len(CASES)} = {score:.0%}   threshold: {args.threshold:.0%}")

    if score < args.threshold:
-        print("RESULT: below threshold — this change is NOT safe to ship.\n")
+        print("RESULT: below threshold; this change is NOT safe to ship.\n")
        return 1
-    print("RESULT: at or above threshold — safe by this eval.\n")
+    print("RESULT: at or above threshold; safe by this eval.\n")
    return 0