epa_ppls: narrow row-crop filter to corn/soy/wheat only

App focus is corn, soybeans, and wheat. Dropping the broader
US-row-crops allowlist (cotton/rice/sorghum/milo/barley/oats/rye/
sunflower/peanut/sugar-beet/dry-bean/canola/alfalfa).

Empirical impact (random N=100 sample): broad list matched 17/100
products, narrow list matches 16/100 — only 6% reduction, because
corn/soy/wheat dominate ag-chem registrations so thoroughly that
products registered for cotton/sorghum/etc. are almost always
co-registered for one of corn/soy/wheat. One sampled product was
dropped: a peanut-only herbicide (2749-614).

Verified live: 524-475 Roundup + 524-591 Warrant kept (CORN/SOYBEAN
sites); 2749-614 AG36448 (PEANUTS only) correctly filtered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-23 19:39:55 -04:00
parent 60657aa6df
commit ea3aea5871
3 changed files with 25 additions and 25 deletions
+17 -11
View File
@@ -41,20 +41,26 @@ Every scraper is **idempotent** by default — re-running with the
same arguments skips records already on disk. Use `--force` to
re-fetch.
## Scope: row crops only
## Scope: corn / soybeans / wheat
The corpus is scoped to **US row crops** — corn, soybeans, cotton,
wheat, rice, sorghum/milo, barley, oats, rye, sunflowers, peanuts,
sugar beets, dry/field beans, canola/rapeseed, and alfalfa. The
EPA PPLS scraper enforces this by inspecting the `sites` array on
each product's PPLS API response and dropping anything without a
row-crop site (word-boundary match).
The corpus is scoped to the three crops the consumer app focuses on:
**corn (incl. maize, popcorn), soybeans, and wheat.** The EPA PPLS
scraper enforces this by inspecting the `sites` array on each
product's PPLS API response and dropping anything without a matching
site (word-boundary match against `ROW_CROP_KEYWORDS`).
Empirically (random N=100 sample): this narrow allowlist matches
~16% of all PPLS products and only loses ~6% of the broader
"all US row crops" hit set, because corn/soy/wheat dominate ag
chemistry registrations — products registered for cotton/sorghum/
rice/etc. are almost always *also* registered for one of corn,
soy, or wheat.
The Bayer scraper doesn't filter — its catalog is implicitly
ag-focused, and dropping fungicide/insecticide/seed-treatment
products there would lose row-crop-relevant chemistry. Add
per-source filters as needed if other manufacturer sources cover
non-ag products.
ag-focused, and the catalog product names + descriptions don't
expose enough crop metadata for a pre-API filter to be reliable.
Add per-source filters as needed if other manufacturer sources
turn up non-ag products.
Override the EPA filter for a one-off broader pull: