Bayer's seed-treatment catalog query re-serves products from
herbicide/fungicide/insecticide queries that have seed-treatment use
sites listed. safe_slug() correctly strips the class suffix when the
catalog product type matches, but doesn't strip when querying as
seed-treatment, so the same product gets written twice — once as
"<base>" (canonical class) and once as "<base>-<class>"
(class=seed-treatment).
First full scrape produced 159 files for 87 unique EPA reg nos —
~45% redundant. Fix:
- process_product accepts an optional seen_regs set and returns
"dup-skip" when the product's EPA reg no is already in it.
- run() seeds seen_regs from existing sidecars on disk via
_load_seen_regs() so dedup survives re-runs (force overrides).
- run() updates seen_regs after each successful write, so within-run
dedup works for the seed-treatment query (which iterates last).
Important nuance preserved: when two genuinely-different brand-name
products share the same EPA reg (e.g., Absolute Maxx + Adament Flow
both = 264-849), they are NOT treated as dups — they're different
catalog entries with different slugs and same canonical class. Only
the seed-treatment-clone pattern (slug = <canonical>-<class> AND
class=seed-treatment AND sibling at same reg with matching class) is
the bug we're fixing.
One-off cleanup of the existing USB corpus removed 68 dup pairs;
159 → 91 files (73 canonical-class + 18 true seed-treatments).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both scrapers now honor PPLS_CORPUS_ROOT so the corpus can land on
external storage (USB drive, NAS mount, secondary partition) without
editing the repo. Default behavior unchanged: corpus/ at repo root
when the env var is unset.
Per-source subdirectory layout preserved: ${PPLS_CORPUS_ROOT}/bayer/,
${PPLS_CORPUS_ROOT}/epa_ppls/, etc.
Live-verified against /run/media/justin/USB (vfat, 59GB free):
PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \
python -m scrape.runner --source epa_ppls --reg-no 524-475
-> wrote to USB, root disk untouched
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adapts the docs-mcp-template scraping layer for the pesticide-labels
domain. The template's bundle/version/platform concepts don't map to
labels (there's no "Bayer 8.1.0" — there's just the current accepted
label per EPA Reg No), so the scraper layer is reshaped around a
"source" abstraction: one source per manufacturer or regulator, one
per-product label per source.
Sources shipped:
- bayer — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs)
- epa_ppls — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint
Canonical sidecar schema (see scrape/README.md) unifies fields across
sources:
- active_ingredients always [{name, cas, percent}]
- label/* nested (url, filename, accepted_date, last_modified,
page_count, text_layer)
- all timestamps normalized to ISO 8601 UTC
- signal_word surfaced (operationally critical for the farmer advisor)
- source_key + epa_reg_no separate per-source PK from the
cross-source join key
bundles.json → sources.json. --bundle → --source. The runner walks
sources.json and dispatches by id; per-source modules remain
independently runnable for development.
PLAN.md gets a one-block domain note up front; later phases (chunking,
embeddings, retrieval, eval) still apply as written.
Smoke test:
python -m scrape.runner --all --limit 2 # works
python -m scrape.runner --source bayer --limit 3 # 3 written, idempotent re-run skips
python -m scrape.runner --source epa_ppls --reg-no 524-475 # Roundup Ultra, 167 pages, ISO last_modified
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>