De-slop: remove every em-dash + banned words across all modules + capstone (#94)
Sync course wiki / sync-wiki (push) Successful in 4s
Sync course wiki / sync-wiki (push) Successful in 4s
Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io>
This commit was merged in pull request #94.
This commit is contained in:
@@ -1,12 +1,12 @@
|
||||
"""Candidate output: a SWAPPED model/prompt.
|
||||
|
||||
Same task, different model (or a tweaked prompt). This output "looks right" and
|
||||
passes a casual manual check — adding three tasks and calling count returns 3.
|
||||
passes a casual manual check; adding three tasks and calling count returns 3.
|
||||
But pending_count() returns the total number of tasks, not the number of
|
||||
*pending* ones, so it's wrong the moment anything is marked done.
|
||||
|
||||
Nobody would notice this by skimming. The eval set notices it instantly. That's
|
||||
the regression eval catching an unsafe swap — exactly the scenario this module
|
||||
the regression eval catching an unsafe swap, exactly the scenario this module
|
||||
exists for. Replace this with your own swapped-model output when you run it for
|
||||
real; you may get lucky and have it pass, or you may catch a regression like
|
||||
this one.
|
||||
|
||||
@@ -7,7 +7,7 @@ An *eval set* is a list of CASES. Each case is three things:
|
||||
- the expected result (here: how many tasks should count as pending).
|
||||
|
||||
The grading lives in run_eval.py; this file is just data. Keeping the cases
|
||||
separate from any model, prompt, or runner is the whole point — the same eval
|
||||
separate from any model, prompt, or runner is the whole point; the same eval
|
||||
set judges *any* candidate you point it at, which is what makes it useful when
|
||||
you swap the model out from under it.
|
||||
|
||||
|
||||
@@ -34,7 +34,7 @@ def judge(candidate_text: str) -> dict:
|
||||
key = os.environ.get("EVAL_JUDGE_KEY")
|
||||
model = os.environ.get("EVAL_JUDGE_MODEL")
|
||||
if not (url and key and model):
|
||||
return {"score": None, "reason": "judge not configured — abstaining (set EVAL_JUDGE_* to enable)"}
|
||||
return {"score": None, "reason": "judge not configured; abstaining (set EVAL_JUDGE_* to enable)"}
|
||||
|
||||
payload = json.dumps({
|
||||
"model": model,
|
||||
@@ -72,7 +72,7 @@ if __name__ == "__main__":
|
||||
# about the candidate changed. The ruler is itself made of rubber.
|
||||
#
|
||||
# So: use a programmatic grader (run_eval.py) wherever a deterministic check is
|
||||
# possible — that is most of the time. Reach for an LLM judge only for genuinely
|
||||
# possible; that is most of the time. Reach for an LLM judge only for genuinely
|
||||
# open-ended output, and CALIBRATE it first: hand-label ~20 examples yourself,
|
||||
# run the judge on them, and confirm it agrees with you before you let it gate
|
||||
# anything. An uncalibrated judge is a vibe with a number attached.
|
||||
|
||||
@@ -68,9 +68,9 @@ def main(argv):
|
||||
print(f"\nscore: {passed}/{len(CASES)} = {score:.0%} threshold: {args.threshold:.0%}")
|
||||
|
||||
if score < args.threshold:
|
||||
print("RESULT: below threshold — this change is NOT safe to ship.\n")
|
||||
print("RESULT: below threshold; this change is NOT safe to ship.\n")
|
||||
return 1
|
||||
print("RESULT: at or above threshold — safe by this eval.\n")
|
||||
print("RESULT: at or above threshold; safe by this eval.\n")
|
||||
return 0
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user