Files
ai-workflow-course/modules/27-evals/lab/run_eval.py
T
justin 3221f7abe8
CI / check (pull_request) Successful in 7s
Use python3 as the canonical command name course-wide (#104)
Most current systems (default Debian/Ubuntu, recent macOS) install Python
only as `python3`, with no bare `python` on PATH, so learners who copied
`python cli.py ...` into their host shell hit "command not found".

Convert host-shell `python <cmd>` -> `python3 <cmd>` across module/lab
READMEs, lab `.py` docstrings & usage strings, blog posts, lab prompt and
instruction files, the M04 verify.sh message, and the M10/M24 lab patches.
Module 01's convention note (and its blog/02 mirror) is rewritten so
`python3` is canonical and `python` is the documented fallback.

Stop-lines respected: Docker image tags (`python:3.12-slim`), `.venv/.../python`
and `...\.venv\Scripts\python.exe` paths, the M20 `"command": "python"`
teaching example and surrounding venv prose, container-internal invocations
(M16/M18 Dockerfiles, M16 README `docker run` examples), and CI-workflow
`run:` steps fed by `actions/setup-python` / `image: python:3.12` are left
as `python` on purpose.

pip was left out of scope: most occurrences are prose or CI/container-internal,
and `pip3` does not fix the PEP 668 externally-managed-environment refusal that
the course already addresses with venvs. The M01 note is worded to stay
consistent with bare `pip` (use whichever pip pairs with your Python).

Build (tools/build_wiki.py) and tools/check.sh both pass.

Closes #104

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01GAEzanEoGJT5o1VizQar47
2026-06-23 20:18:04 -04:00

79 lines
2.8 KiB
Python

"""Run the eval set against one candidate and print a scorecard.
Usage:
python3 run_eval.py candidates/current_model
python3 run_eval.py candidates/swapped_model
python3 run_eval.py candidates/current_model --threshold 0.9
A "candidate" is a directory containing a tasks.py that an agent produced. The
runner imports that tasks.py, runs every case in eval_set.py against it, prints
a pass/fail line per case, and reports an aggregate score.
The exit code is the guardrail: 0 if the score meets the threshold, 1 if it
doesn't. That single integer is what lets an eval gate a model swap, a prompt
change, or an unattended agent in CI (Module 14) instead of a human eyeballing
output. "Below threshold" should block exactly like a failing test does.
"""
import argparse
import importlib.util
import sys
from pathlib import Path
from eval_set import CASES
def load_candidate(candidate_dir: Path):
"""Import the tasks.py living in candidate_dir as an isolated module."""
tasks_py = candidate_dir / "tasks.py"
if not tasks_py.exists():
sys.exit(f"no tasks.py in {candidate_dir}")
spec = importlib.util.spec_from_file_location("candidate_tasks", tasks_py)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
return module
def run_case(candidate, items, expected):
"""Build the input state with the candidate's own classes, then grade."""
tlist = candidate.TaskList()
for title, done in items:
tlist.tasks.append(candidate.Task(title=title, done=done))
try:
actual = tlist.pending_count()
except Exception as exc: # a crash is a failed case, not a crashed harness
return False, f"raised {type(exc).__name__}: {exc}"
return actual == expected, f"expected {expected}, got {actual}"
def main(argv):
parser = argparse.ArgumentParser()
parser.add_argument("candidate", help="path to a candidate dir with tasks.py")
parser.add_argument("--threshold", type=float, default=1.0,
help="minimum passing fraction to exit 0 (default 1.0)")
args = parser.parse_args(argv)
candidate_dir = Path(args.candidate)
candidate = load_candidate(candidate_dir)
passed = 0
print(f"\neval set: {len(CASES)} cases candidate: {candidate_dir}\n")
for name, items, expected in CASES:
ok, detail = run_case(candidate, items, expected)
mark = "PASS" if ok else "FAIL"
print(f" [{mark}] {name:<40} ({detail})")
passed += ok
score = passed / len(CASES)
print(f"\nscore: {passed}/{len(CASES)} = {score:.0%} threshold: {args.threshold:.0%}")
if score < args.threshold:
print("RESULT: below threshold; this change is NOT safe to ship.\n")
return 1
print("RESULT: at or above threshold; safe by this eval.\n")
return 0
if __name__ == "__main__":
raise SystemExit(main(sys.argv[1:]))