style(no-slop): remove every em-dash + banned words across all modules + capstone
Apply the no-ai-slop standard (now binding in AGENTS.md): the em-dash character is banned outright (restructured, not blind-replaced), plus the banned word/phrase list (delve, leverage, robust, seamless, truly, unlock, etc.). 0 em-dashes remain in modules + capstone; the only "robust" left is the planted M10 ai-change.patch trap. Module H1 titles use a colon separator. All deliberate teaching devices preserved; labs compile/parse (py/sh/yaml/json); no junk. AGENTS.md updated with the hard no-slop rules. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01TfzV5QvtPDz8LJS3Pu5VLT
This commit is contained in:
@@ -1,12 +1,12 @@
|
||||
"""Candidate output: a SWAPPED model/prompt.
|
||||
|
||||
Same task, different model (or a tweaked prompt). This output "looks right" and
|
||||
passes a casual manual check — adding three tasks and calling count returns 3.
|
||||
passes a casual manual check; adding three tasks and calling count returns 3.
|
||||
But pending_count() returns the total number of tasks, not the number of
|
||||
*pending* ones, so it's wrong the moment anything is marked done.
|
||||
|
||||
Nobody would notice this by skimming. The eval set notices it instantly. That's
|
||||
the regression eval catching an unsafe swap — exactly the scenario this module
|
||||
the regression eval catching an unsafe swap, exactly the scenario this module
|
||||
exists for. Replace this with your own swapped-model output when you run it for
|
||||
real; you may get lucky and have it pass, or you may catch a regression like
|
||||
this one.
|
||||
|
||||
@@ -7,7 +7,7 @@ An *eval set* is a list of CASES. Each case is three things:
|
||||
- the expected result (here: how many tasks should count as pending).
|
||||
|
||||
The grading lives in run_eval.py; this file is just data. Keeping the cases
|
||||
separate from any model, prompt, or runner is the whole point — the same eval
|
||||
separate from any model, prompt, or runner is the whole point; the same eval
|
||||
set judges *any* candidate you point it at, which is what makes it useful when
|
||||
you swap the model out from under it.
|
||||
|
||||
|
||||
@@ -34,7 +34,7 @@ def judge(candidate_text: str) -> dict:
|
||||
key = os.environ.get("EVAL_JUDGE_KEY")
|
||||
model = os.environ.get("EVAL_JUDGE_MODEL")
|
||||
if not (url and key and model):
|
||||
return {"score": None, "reason": "judge not configured — abstaining (set EVAL_JUDGE_* to enable)"}
|
||||
return {"score": None, "reason": "judge not configured; abstaining (set EVAL_JUDGE_* to enable)"}
|
||||
|
||||
payload = json.dumps({
|
||||
"model": model,
|
||||
@@ -72,7 +72,7 @@ if __name__ == "__main__":
|
||||
# about the candidate changed. The ruler is itself made of rubber.
|
||||
#
|
||||
# So: use a programmatic grader (run_eval.py) wherever a deterministic check is
|
||||
# possible — that is most of the time. Reach for an LLM judge only for genuinely
|
||||
# possible; that is most of the time. Reach for an LLM judge only for genuinely
|
||||
# open-ended output, and CALIBRATE it first: hand-label ~20 examples yourself,
|
||||
# run the judge on them, and confirm it agrees with you before you let it gate
|
||||
# anything. An uncalibrated judge is a vibe with a number attached.
|
||||
|
||||
@@ -68,9 +68,9 @@ def main(argv):
|
||||
print(f"\nscore: {passed}/{len(CASES)} = {score:.0%} threshold: {args.threshold:.0%}")
|
||||
|
||||
if score < args.threshold:
|
||||
print("RESULT: below threshold — this change is NOT safe to ship.\n")
|
||||
print("RESULT: below threshold; this change is NOT safe to ship.\n")
|
||||
return 1
|
||||
print("RESULT: at or above threshold — safe by this eval.\n")
|
||||
print("RESULT: at or above threshold; safe by this eval.\n")
|
||||
return 0
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user