epa_ppls: add registrant allowlist pre-API filter

Cuts the PPIS-enumeration universe from 102K rows to ~11.5K rows by
dropping products from non-row-crop-ag registrants BEFORE the per-
product API call. This is the biggest cost lever we have on the EPA
scraper — full backfill drops from ~28 h to ~3.5 h.

scrape/sources/epa_registrant_allowlist.json holds the 34 verified
ag-chem company numbers (Syngenta, Bayer, BASF, Corteva, FMC, Nufarm,
ADAMA, UPL, Albaugh, Loveland, AMVAC, Helena, Drexel, Atticus, etc.).
Each entry was verified by querying the EPA PPLS API for the first
active product registered under that company number. Edit the JSON
freely — scraper loads it at run time. Bypass with
--no-registrant-filter when you suspect a row-crop product registered
to a specialty company not on the list.

Why a curated allowlist rather than blacklist consumer brands: the
102K PPIS rows are 89% non-ag-relevant; an allowlist is shorter to
maintain and harder to false-positive.

Excluded with intent (not omissions): Bayer Environmental Science
(turf/ornamental), Scotts (consumer lawn & garden), Wellmark/Zoecon
(animal flea/tick), Control Solutions (structural pest), Cleary
(turf), PBI/Gordon (mostly turf), Buckman Labs (industrial water).

Smoke test --limit 100:
  - 1239 PPIS rows considered (in first slice of file)
  - 1139 skipped by registrant filter (no API call paid)
  - 100 hit API, 81 filtered by row-crop sites, 19 written
  - = 91% API-call reduction over the prior version

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-23 23:55:38 -04:00
parent 420e00b44b
commit 92a95d5e78
4 changed files with 117 additions and 2 deletions
+23
View File
@@ -85,6 +85,29 @@ Override the EPA filter for a one-off broader pull:
python -m scrape.sources.epa_ppls --no-row-crop-filter --reg-no 100-1486
```
### EPA registrant allowlist
The EPA scraper applies a second filter at PPIS enumeration time:
**only consider products from companies on the row-crop ag-chem
allowlist** at [`scrape/sources/epa_registrant_allowlist.json`](sources/epa_registrant_allowlist.json).
This is a pre-API filter — products from non-allowlist registrants
are dropped before paying the per-product API call cost.
Effect: the 102,378-row PPIS universe shrinks to ~11,500 rows
(~89% reduction). Full backfill drops from ~28 h to ~56 h.
The allowlist covers the major US row-crop ag-chem registrants
(Syngenta, Bayer, BASF, Corteva, FMC, Nufarm, ADAMA, UPL, Albaugh,
Loveland, AMVAC, Helena, Drexel, Atticus, etc.) — see the JSON file
for the full set with verified company names. Edit it freely; the
scraper loads it at run time. Each entry was verified by querying
the EPA PPLS API for the first active product registered under that
company number.
Bypass with `--no-registrant-filter` to enumerate the full universe
(useful if you suspect a row-crop product is registered to a small
or specialty company not on the list).
## Canonical sidecar schema
Every `corpus/<source>/<key>.json` conforms to this shape. Fields