Files
crop-chem-docs/scrape/sources
justin 60657aa6df epa_ppls: filter PPLS enumeration to row-crop products
The farmer-advisor consumer only cares about US row crops, so the EPA
scraper now drops products without at least one row-crop site in the
PPLS API response. Filter is on by default; --no-row-crop-filter
overrides for one-off broader pulls.

Filter shape:
  - Word-boundary regex match against each entry in the API's `sites`
    array (e.g., "SOYBEANS (FOLIAR TREATMENT)" → keep, "SHIPS, BOATS,
    SHIPHOLDS" → drop even though it contains "OATS" as substring).
  - Allowlist covers the major US row + small-grain + oilseed + sugar/
    fiber crops, plus alfalfa as a common rotation crop. See
    ROW_CROP_KEYWORDS in scrape/sources/epa_ppls.py for the full list.

Cost model:
  - 102K PPIS rows still need one API call each (no bulk filter
    available upstream), so enumeration still takes ~28h at 1 req/sec.
  - But PDF downloads drop from ~67K → ~5-10K (estimated row-crop
    hit rate), saving ~17h wall time and ~60GB disk on a full backfill.

Smoke test (4 mixed reg nos):
  524-475 Roundup Ultra        → kept (CORN/SOYBEANS/COTTON sites)
  524-591 Warrant              → kept (CORN/SOYBEANS/SORGHUM sites)
  100-1486 Advion Cockroach    → filtered (building/transport sites only)
  432-1276 (Bayer pet flea)    → filtered (no row crops)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 19:05:26 -04:00
..