crop-chem-docs

Author	SHA1	Message	Date
justin	420e00b44b	bayer: dedup by EPA reg no across catalog product-type queries Bayer's seed-treatment catalog query re-serves products from herbicide/fungicide/insecticide queries that have seed-treatment use sites listed. safe_slug() correctly strips the class suffix when the catalog product type matches, but doesn't strip when querying as seed-treatment, so the same product gets written twice — once as "<base>" (canonical class) and once as "<base>-<class>" (class=seed-treatment). First full scrape produced 159 files for 87 unique EPA reg nos — ~45% redundant. Fix: - process_product accepts an optional seen_regs set and returns "dup-skip" when the product's EPA reg no is already in it. - run() seeds seen_regs from existing sidecars on disk via _load_seen_regs() so dedup survives re-runs (force overrides). - run() updates seen_regs after each successful write, so within-run dedup works for the seed-treatment query (which iterates last). Important nuance preserved: when two genuinely-different brand-name products share the same EPA reg (e.g., Absolute Maxx + Adament Flow both = 264-849), they are NOT treated as dups — they're different catalog entries with different slugs and same canonical class. Only the seed-treatment-clone pattern (slug = <canonical>-<class> AND class=seed-treatment AND sibling at same reg with matching class) is the bug we're fixing. One-off cleanup of the existing USB corpus removed 68 dup pairs; 159 → 91 files (73 canonical-class + 18 true seed-treatments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 21:27:45 -04:00
justin	717426f873	scrape: route corpus via PPLS_CORPUS_ROOT env var Both scrapers now honor PPLS_CORPUS_ROOT so the corpus can land on external storage (USB drive, NAS mount, secondary partition) without editing the repo. Default behavior unchanged: corpus/ at repo root when the env var is unset. Per-source subdirectory layout preserved: ${PPLS_CORPUS_ROOT}/bayer/, ${PPLS_CORPUS_ROOT}/epa_ppls/, etc. Live-verified against /run/media/justin/USB (vfat, 59GB free): PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \ python -m scrape.runner --source epa_ppls --reg-no 524-475 -> wrote to USB, root disk untouched Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 20:41:56 -04:00
justin	ea3aea5871	epa_ppls: narrow row-crop filter to corn/soy/wheat only App focus is corn, soybeans, and wheat. Dropping the broader US-row-crops allowlist (cotton/rice/sorghum/milo/barley/oats/rye/ sunflower/peanut/sugar-beet/dry-bean/canola/alfalfa). Empirical impact (random N=100 sample): broad list matched 17/100 products, narrow list matches 16/100 — only 6% reduction, because corn/soy/wheat dominate ag-chem registrations so thoroughly that products registered for cotton/sorghum/etc. are almost always co-registered for one of corn/soy/wheat. One sampled product was dropped: a peanut-only herbicide (2749-614). Verified live: 524-475 Roundup + 524-591 Warrant kept (CORN/SOYBEAN sites); 2749-614 AG36448 (PEANUTS only) correctly filtered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 19:39:55 -04:00
justin	60657aa6df	epa_ppls: filter PPLS enumeration to row-crop products The farmer-advisor consumer only cares about US row crops, so the EPA scraper now drops products without at least one row-crop site in the PPLS API response. Filter is on by default; --no-row-crop-filter overrides for one-off broader pulls. Filter shape: - Word-boundary regex match against each entry in the API's `sites` array (e.g., "SOYBEANS (FOLIAR TREATMENT)" → keep, "SHIPS, BOATS, SHIPHOLDS" → drop even though it contains "OATS" as substring). - Allowlist covers the major US row + small-grain + oilseed + sugar/ fiber crops, plus alfalfa as a common rotation crop. See ROW_CROP_KEYWORDS in scrape/sources/epa_ppls.py for the full list. Cost model: - 102K PPIS rows still need one API call each (no bulk filter available upstream), so enumeration still takes ~28h at 1 req/sec. - But PDF downloads drop from ~67K → ~5-10K (estimated row-crop hit rate), saving ~17h wall time and ~60GB disk on a full backfill. Smoke test (4 mixed reg nos): 524-475 Roundup Ultra → kept (CORN/SOYBEANS/COTTON sites) 524-591 Warrant → kept (CORN/SOYBEANS/SORGHUM sites) 100-1486 Advion Cockroach → filtered (building/transport sites only) 432-1276 (Bayer pet flea) → filtered (no row crops) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 19:05:26 -04:00
justin	e9250de8e7	scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema Adapts the docs-mcp-template scraping layer for the pesticide-labels domain. The template's bundle/version/platform concepts don't map to labels (there's no "Bayer 8.1.0" — there's just the current accepted label per EPA Reg No), so the scraper layer is reshaped around a "source" abstraction: one source per manufacturer or regulator, one per-product label per source. Sources shipped: - bayer — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs) - epa_ppls — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint Canonical sidecar schema (see scrape/README.md) unifies fields across sources: - active_ingredients always [{name, cas, percent}] - label/* nested (url, filename, accepted_date, last_modified, page_count, text_layer) - all timestamps normalized to ISO 8601 UTC - signal_word surfaced (operationally critical for the farmer advisor) - source_key + epa_reg_no separate per-source PK from the cross-source join key bundles.json → sources.json. --bundle → --source. The runner walks sources.json and dispatches by id; per-source modules remain independently runnable for development. PLAN.md gets a one-block domain note up front; later phases (chunking, embeddings, retrieval, eval) still apply as written. Smoke test: python -m scrape.runner --all --limit 2 # works python -m scrape.runner --source bayer --limit 3 # 3 written, idempotent re-run skips python -m scrape.runner --source epa_ppls --reg-no 524-475 # Roundup Ultra, 167 pages, ISO last_modified Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 18:27:07 -04:00

5 Commits