ai-workflow-course/modules/27-evals/lab/eval_set.py

"""The eval set for the tasks-app `pending_count` agent task.

An *eval set* is a list of CASES. Each case is three things:

  - a name (so the scorecard is readable),
  - an input (here: the state a TaskList is in), and
  - the expected result (here: how many tasks should count as pending).

The grading lives in run_eval.py; this file is just data. Keeping the cases
separate from any model, prompt, or runner is the whole point; the same eval
set judges *any* candidate you point it at, which is what makes it useful when
you swap the model out from under it.

The task we're evaluating: an agent was asked to implement
`TaskList.pending_count()` so it returns the number of tasks that are NOT done.
That sounds trivial. The discriminating cases below are the ones a
"looks-right" implementation quietly fails.
"""

# Each case: (name, [(title, done), ...], expected_pending_count)
CASES = [
    ("empty list has zero pending", [], 0),
    ("one open task counts as one", [("write tests", False)], 1),
    (
        "three open tasks count as three",
        [("a", False), ("b", False), ("c", False)],
        3,
    ),
    # The discriminating case. A candidate that returns len(tasks) passes
    # everything above and fails right here. This is the eval earning its keep.
    (
        "completed tasks are NOT pending",
        [("done thing", True), ("open thing", False), ("also done", True)],
        1,
    ),
    ("all done means zero pending", [("x", True), ("y", True)], 0),
]