epa_ppls: filter PPLS enumeration to row-crop products
The farmer-advisor consumer only cares about US row crops, so the EPA
scraper now drops products without at least one row-crop site in the
PPLS API response. Filter is on by default; --no-row-crop-filter
overrides for one-off broader pulls.
Filter shape:
- Word-boundary regex match against each entry in the API's `sites`
array (e.g., "SOYBEANS (FOLIAR TREATMENT)" → keep, "SHIPS, BOATS,
SHIPHOLDS" → drop even though it contains "OATS" as substring).
- Allowlist covers the major US row + small-grain + oilseed + sugar/
fiber crops, plus alfalfa as a common rotation crop. See
ROW_CROP_KEYWORDS in scrape/sources/epa_ppls.py for the full list.
Cost model:
- 102K PPIS rows still need one API call each (no bulk filter
available upstream), so enumeration still takes ~28h at 1 req/sec.
- But PDF downloads drop from ~67K → ~5-10K (estimated row-crop
hit rate), saving ~17h wall time and ~60GB disk on a full backfill.
Smoke test (4 mixed reg nos):
524-475 Roundup Ultra → kept (CORN/SOYBEANS/COTTON sites)
524-591 Warrant → kept (CORN/SOYBEANS/SORGHUM sites)
100-1486 Advion Cockroach → filtered (building/transport sites only)
432-1276 (Bayer pet flea) → filtered (no row crops)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -41,6 +41,27 @@ Every scraper is **idempotent** by default — re-running with the
|
||||
same arguments skips records already on disk. Use `--force` to
|
||||
re-fetch.
|
||||
|
||||
## Scope: row crops only
|
||||
|
||||
The corpus is scoped to **US row crops** — corn, soybeans, cotton,
|
||||
wheat, rice, sorghum/milo, barley, oats, rye, sunflowers, peanuts,
|
||||
sugar beets, dry/field beans, canola/rapeseed, and alfalfa. The
|
||||
EPA PPLS scraper enforces this by inspecting the `sites` array on
|
||||
each product's PPLS API response and dropping anything without a
|
||||
row-crop site (word-boundary match).
|
||||
|
||||
The Bayer scraper doesn't filter — its catalog is implicitly
|
||||
ag-focused, and dropping fungicide/insecticide/seed-treatment
|
||||
products there would lose row-crop-relevant chemistry. Add
|
||||
per-source filters as needed if other manufacturer sources cover
|
||||
non-ag products.
|
||||
|
||||
Override the EPA filter for a one-off broader pull:
|
||||
|
||||
```bash
|
||||
python -m scrape.sources.epa_ppls --no-row-crop-filter --reg-no 100-1486
|
||||
```
|
||||
|
||||
## Canonical sidecar schema
|
||||
|
||||
Every `corpus/<source>/<key>.json` conforms to this shape. Fields
|
||||
|
||||
Reference in New Issue
Block a user