Files
ai-workflow-course/modules/27-evals/lab/eval_set.py
T
claude fbec36cb67 feat(course): build out all 27 modules, capstone, scaffold, and conventions
Scaffold the course repo and author the full curriculum in dependency-chain
order, following the settled build decisions in handoff.md.

- Scaffold: course README, vendor-neutral AGENTS.md (dogfoods Module 5),
  _TEMPLATE.md (the fixed 9-section module shape), root .gitignore, ship config.
- Modules 1-2: reference exemplars (locked for tone/depth/lab style).
- Modules 3-27: full lessons + runnable labs, each following the template,
  respecting the chain, vendor/model-agnostic, with "feel the pain" labs.
- Module 8 hosting comparison web-researched and date-stamped (as of 2026-06-22),
  not written from memory; expansion-zone modules carry Verify-before-publish.
- Capstone: the full loop end to end on the running tasks-app example.

Lab code syntax-checked (Python/shell/YAML); every module has the 7 core
template sections.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01TfzV5QvtPDz8LJS3Pu5VLT
2026-06-22 12:18:30 -04:00

38 lines
1.5 KiB
Python

"""The eval set for the tasks-app `pending_count` agent task.
An *eval set* is a list of CASES. Each case is three things:
- a name (so the scorecard is readable),
- an input (here: the state a TaskList is in), and
- the expected result (here: how many tasks should count as pending).
The grading lives in run_eval.py; this file is just data. Keeping the cases
separate from any model, prompt, or runner is the whole point — the same eval
set judges *any* candidate you point it at, which is what makes it useful when
you swap the model out from under it.
The task we're evaluating: an agent was asked to implement
`TaskList.pending_count()` so it returns the number of tasks that are NOT done.
That sounds trivial. The discriminating cases below are the ones a
"looks-right" implementation quietly fails.
"""
# Each case: (name, [(title, done), ...], expected_pending_count)
CASES = [
("empty list has zero pending", [], 0),
("one open task counts as one", [("write tests", False)], 1),
(
"three open tasks count as three",
[("a", False), ("b", False), ("c", False)],
3,
),
# The discriminating case. A candidate that returns len(tasks) passes
# everything above and fails right here. This is the eval earning its keep.
(
"completed tasks are NOT pending",
[("done thing", True), ("open thing", False), ("also done", True)],
1,
),
("all done means zero pending", [("x", True), ("y", True)], 0),
]