crop-chem-docs

Author	SHA1	Message	Date
justin	92a95d5e78	epa_ppls: add registrant allowlist pre-API filter Cuts the PPIS-enumeration universe from 102K rows to ~11.5K rows by dropping products from non-row-crop-ag registrants BEFORE the per- product API call. This is the biggest cost lever we have on the EPA scraper — full backfill drops from ~28 h to ~3.5 h. scrape/sources/epa_registrant_allowlist.json holds the 34 verified ag-chem company numbers (Syngenta, Bayer, BASF, Corteva, FMC, Nufarm, ADAMA, UPL, Albaugh, Loveland, AMVAC, Helena, Drexel, Atticus, etc.). Each entry was verified by querying the EPA PPLS API for the first active product registered under that company number. Edit the JSON freely — scraper loads it at run time. Bypass with --no-registrant-filter when you suspect a row-crop product registered to a specialty company not on the list. Why a curated allowlist rather than blacklist consumer brands: the 102K PPIS rows are 89% non-ag-relevant; an allowlist is shorter to maintain and harder to false-positive. Excluded with intent (not omissions): Bayer Environmental Science (turf/ornamental), Scotts (consumer lawn & garden), Wellmark/Zoecon (animal flea/tick), Control Solutions (structural pest), Cleary (turf), PBI/Gordon (mostly turf), Buckman Labs (industrial water). Smoke test --limit 100: - 1239 PPIS rows considered (in first slice of file) - 1139 skipped by registrant filter (no API call paid) - 100 hit API, 81 filtered by row-crop sites, 19 written - = 91% API-call reduction over the prior version Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 23:55:38 -04:00
justin	ea3aea5871	epa_ppls: narrow row-crop filter to corn/soy/wheat only App focus is corn, soybeans, and wheat. Dropping the broader US-row-crops allowlist (cotton/rice/sorghum/milo/barley/oats/rye/ sunflower/peanut/sugar-beet/dry-bean/canola/alfalfa). Empirical impact (random N=100 sample): broad list matched 17/100 products, narrow list matches 16/100 — only 6% reduction, because corn/soy/wheat dominate ag-chem registrations so thoroughly that products registered for cotton/sorghum/etc. are almost always co-registered for one of corn/soy/wheat. One sampled product was dropped: a peanut-only herbicide (2749-614). Verified live: 524-475 Roundup + 524-591 Warrant kept (CORN/SOYBEAN sites); 2749-614 AG36448 (PEANUTS only) correctly filtered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 19:39:55 -04:00
justin	60657aa6df	epa_ppls: filter PPLS enumeration to row-crop products The farmer-advisor consumer only cares about US row crops, so the EPA scraper now drops products without at least one row-crop site in the PPLS API response. Filter is on by default; --no-row-crop-filter overrides for one-off broader pulls. Filter shape: - Word-boundary regex match against each entry in the API's `sites` array (e.g., "SOYBEANS (FOLIAR TREATMENT)" → keep, "SHIPS, BOATS, SHIPHOLDS" → drop even though it contains "OATS" as substring). - Allowlist covers the major US row + small-grain + oilseed + sugar/ fiber crops, plus alfalfa as a common rotation crop. See ROW_CROP_KEYWORDS in scrape/sources/epa_ppls.py for the full list. Cost model: - 102K PPIS rows still need one API call each (no bulk filter available upstream), so enumeration still takes ~28h at 1 req/sec. - But PDF downloads drop from ~67K → ~5-10K (estimated row-crop hit rate), saving ~17h wall time and ~60GB disk on a full backfill. Smoke test (4 mixed reg nos): 524-475 Roundup Ultra → kept (CORN/SOYBEANS/COTTON sites) 524-591 Warrant → kept (CORN/SOYBEANS/SORGHUM sites) 100-1486 Advion Cockroach → filtered (building/transport sites only) 432-1276 (Bayer pet flea) → filtered (no row crops) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 19:05:26 -04:00
justin	e9250de8e7	scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema Adapts the docs-mcp-template scraping layer for the pesticide-labels domain. The template's bundle/version/platform concepts don't map to labels (there's no "Bayer 8.1.0" — there's just the current accepted label per EPA Reg No), so the scraper layer is reshaped around a "source" abstraction: one source per manufacturer or regulator, one per-product label per source. Sources shipped: - bayer — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs) - epa_ppls — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint Canonical sidecar schema (see scrape/README.md) unifies fields across sources: - active_ingredients always [{name, cas, percent}] - label/* nested (url, filename, accepted_date, last_modified, page_count, text_layer) - all timestamps normalized to ISO 8601 UTC - signal_word surfaced (operationally critical for the farmer advisor) - source_key + epa_reg_no separate per-source PK from the cross-source join key bundles.json → sources.json. --bundle → --source. The runner walks sources.json and dispatches by id; per-source modules remain independently runnable for development. PLAN.md gets a one-block domain note up front; later phases (chunking, embeddings, retrieval, eval) still apply as written. Smoke test: python -m scrape.runner --all --limit 2 # works python -m scrape.runner --source bayer --limit 3 # 3 written, idempotent re-run skips python -m scrape.runner --source epa_ppls --reg-no 524-475 # Roundup Ultra, 167 pages, ISO last_modified Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 18:27:07 -04:00

4 Commits