92a95d5e78
Cuts the PPIS-enumeration universe from 102K rows to ~11.5K rows by dropping products from non-row-crop-ag registrants BEFORE the per- product API call. This is the biggest cost lever we have on the EPA scraper — full backfill drops from ~28 h to ~3.5 h. scrape/sources/epa_registrant_allowlist.json holds the 34 verified ag-chem company numbers (Syngenta, Bayer, BASF, Corteva, FMC, Nufarm, ADAMA, UPL, Albaugh, Loveland, AMVAC, Helena, Drexel, Atticus, etc.). Each entry was verified by querying the EPA PPLS API for the first active product registered under that company number. Edit the JSON freely — scraper loads it at run time. Bypass with --no-registrant-filter when you suspect a row-crop product registered to a specialty company not on the list. Why a curated allowlist rather than blacklist consumer brands: the 102K PPIS rows are 89% non-ag-relevant; an allowlist is shorter to maintain and harder to false-positive. Excluded with intent (not omissions): Bayer Environmental Science (turf/ornamental), Scotts (consumer lawn & garden), Wellmark/Zoecon (animal flea/tick), Control Solutions (structural pest), Cleary (turf), PBI/Gordon (mostly turf), Buckman Labs (industrial water). Smoke test --limit 100: - 1239 PPIS rows considered (in first slice of file) - 1139 skipped by registrant filter (no API call paid) - 100 hit API, 81 filtered by row-crop sites, 19 written - = 91% API-call reduction over the prior version Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
23 lines
1.3 KiB
JSON
23 lines
1.3 KiB
JSON
[
|
|
{
|
|
"id": "bayer",
|
|
"title": "Bayer Crop Science US — Product Labels",
|
|
"type": "manufacturer",
|
|
"homepage": "https://www.cropscience.bayer.us",
|
|
"scraper": "scrape.sources.bayer",
|
|
"scraper_version": "0.1.0",
|
|
"license_note": "robots.txt explicitly permits scraping for AI retrieval-augmented generation (verified 2026-05)"
|
|
},
|
|
{
|
|
"id": "epa_ppls",
|
|
"title": "EPA Pesticide Product Label System",
|
|
"type": "regulator",
|
|
"homepage": "https://ordspub.epa.gov/ords/pesticides/f?p=PPLS:1",
|
|
"scraper": "scrape.sources.epa_ppls",
|
|
"scraper_version": "0.1.0",
|
|
"license_note": "US federal government — public domain (no ToS restriction)",
|
|
"scope_filter": "corn / soybean / wheat only — products with at least one site matching CORN, MAIZE, POPCORN, SOYBEAN(S), or WHEAT (word-boundary match). Hits ~16% of the PPLS universe in sampling. Pass --no-row-crop-filter to scrape the full PPLS universe.",
|
|
"registrant_filter": "Pre-API filter at PPIS enumeration: only products from registrants on scrape/sources/epa_registrant_allowlist.json (34 major US ag-chem companies — Syngenta, Bayer, BASF, Corteva, FMC, Nufarm, ADAMA, UPL, Albaugh, Loveland, AMVAC, Helena, Drexel, Atticus, etc.) hit the API. Cuts the 102K-row PPIS universe to ~11.5K — full backfill drops from ~28h to ~5-6h. --no-registrant-filter to skip."
|
|
}
|
|
]
|