389ac2e460
Apply the no-ai-slop standard (now binding in AGENTS.md): the em-dash character is banned outright (restructured, not blind-replaced), plus the banned word/phrase list (delve, leverage, robust, seamless, truly, unlock, etc.). 0 em-dashes remain in modules + capstone; the only "robust" left is the planted M10 ai-change.patch trap. Module H1 titles use a colon separator. All deliberate teaching devices preserved; labs compile/parse (py/sh/yaml/json); no junk. AGENTS.md updated with the hard no-slop rules. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01TfzV5QvtPDz8LJS3Pu5VLT
38 lines
1.5 KiB
Python
38 lines
1.5 KiB
Python
"""The eval set for the tasks-app `pending_count` agent task.
|
|
|
|
An *eval set* is a list of CASES. Each case is three things:
|
|
|
|
- a name (so the scorecard is readable),
|
|
- an input (here: the state a TaskList is in), and
|
|
- the expected result (here: how many tasks should count as pending).
|
|
|
|
The grading lives in run_eval.py; this file is just data. Keeping the cases
|
|
separate from any model, prompt, or runner is the whole point; the same eval
|
|
set judges *any* candidate you point it at, which is what makes it useful when
|
|
you swap the model out from under it.
|
|
|
|
The task we're evaluating: an agent was asked to implement
|
|
`TaskList.pending_count()` so it returns the number of tasks that are NOT done.
|
|
That sounds trivial. The discriminating cases below are the ones a
|
|
"looks-right" implementation quietly fails.
|
|
"""
|
|
|
|
# Each case: (name, [(title, done), ...], expected_pending_count)
|
|
CASES = [
|
|
("empty list has zero pending", [], 0),
|
|
("one open task counts as one", [("write tests", False)], 1),
|
|
(
|
|
"three open tasks count as three",
|
|
[("a", False), ("b", False), ("c", False)],
|
|
3,
|
|
),
|
|
# The discriminating case. A candidate that returns len(tasks) passes
|
|
# everything above and fails right here. This is the eval earning its keep.
|
|
(
|
|
"completed tasks are NOT pending",
|
|
[("done thing", True), ("open thing", False), ("also done", True)],
|
|
1,
|
|
),
|
|
("all done means zero pending", [("x", True), ("y", True)], 0),
|
|
]
|