epa_ppls: narrow row-crop filter to corn/soy/wheat only
App focus is corn, soybeans, and wheat. Dropping the broader US-row-crops allowlist (cotton/rice/sorghum/milo/barley/oats/rye/ sunflower/peanut/sugar-beet/dry-bean/canola/alfalfa). Empirical impact (random N=100 sample): broad list matched 17/100 products, narrow list matches 16/100 — only 6% reduction, because corn/soy/wheat dominate ag-chem registrations so thoroughly that products registered for cotton/sorghum/etc. are almost always co-registered for one of corn/soy/wheat. One sampled product was dropped: a peanut-only herbicide (2749-614). Verified live: 524-475 Roundup + 524-591 Warrant kept (CORN/SOYBEAN sites); 2749-614 AG36448 (PEANUTS only) correctly filtered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+17
-11
@@ -41,20 +41,26 @@ Every scraper is **idempotent** by default — re-running with the
|
|||||||
same arguments skips records already on disk. Use `--force` to
|
same arguments skips records already on disk. Use `--force` to
|
||||||
re-fetch.
|
re-fetch.
|
||||||
|
|
||||||
## Scope: row crops only
|
## Scope: corn / soybeans / wheat
|
||||||
|
|
||||||
The corpus is scoped to **US row crops** — corn, soybeans, cotton,
|
The corpus is scoped to the three crops the consumer app focuses on:
|
||||||
wheat, rice, sorghum/milo, barley, oats, rye, sunflowers, peanuts,
|
**corn (incl. maize, popcorn), soybeans, and wheat.** The EPA PPLS
|
||||||
sugar beets, dry/field beans, canola/rapeseed, and alfalfa. The
|
scraper enforces this by inspecting the `sites` array on each
|
||||||
EPA PPLS scraper enforces this by inspecting the `sites` array on
|
product's PPLS API response and dropping anything without a matching
|
||||||
each product's PPLS API response and dropping anything without a
|
site (word-boundary match against `ROW_CROP_KEYWORDS`).
|
||||||
row-crop site (word-boundary match).
|
|
||||||
|
Empirically (random N=100 sample): this narrow allowlist matches
|
||||||
|
~16% of all PPLS products and only loses ~6% of the broader
|
||||||
|
"all US row crops" hit set, because corn/soy/wheat dominate ag
|
||||||
|
chemistry registrations — products registered for cotton/sorghum/
|
||||||
|
rice/etc. are almost always *also* registered for one of corn,
|
||||||
|
soy, or wheat.
|
||||||
|
|
||||||
The Bayer scraper doesn't filter — its catalog is implicitly
|
The Bayer scraper doesn't filter — its catalog is implicitly
|
||||||
ag-focused, and dropping fungicide/insecticide/seed-treatment
|
ag-focused, and the catalog product names + descriptions don't
|
||||||
products there would lose row-crop-relevant chemistry. Add
|
expose enough crop metadata for a pre-API filter to be reliable.
|
||||||
per-source filters as needed if other manufacturer sources cover
|
Add per-source filters as needed if other manufacturer sources
|
||||||
non-ag products.
|
turn up non-ag products.
|
||||||
|
|
||||||
Override the EPA filter for a one-off broader pull:
|
Override the EPA filter for a one-off broader pull:
|
||||||
|
|
||||||
|
|||||||
@@ -83,23 +83,17 @@ MAX_RETRIES = 4
|
|||||||
# "OATS" naively matches "SHIPS, BOATS, SHIPHOLDS"; bare "RICE" matches
|
# "OATS" naively matches "SHIPS, BOATS, SHIPHOLDS"; bare "RICE" matches
|
||||||
# "LICORICE"; bare "RYE" matches "FRYER".
|
# "LICORICE"; bare "RYE" matches "FRYER".
|
||||||
#
|
#
|
||||||
# Scope = the major US row + small-grain + oilseed + sugar/fiber crops the
|
# Scope = the three crops the farmer-advisor consumer focuses on: corn,
|
||||||
# farmer-advisor consumer cares about. Alfalfa included as a common rotation
|
# soybeans, and wheat. Sweet/seed/pop corn included alongside field corn.
|
||||||
# crop; sweet/seed corn included alongside field corn.
|
# Empirically (random N=100 sample, 2026-05-23): this narrow allowlist
|
||||||
|
# matches ~16% of all PPLS products and only loses ~6% of the broader
|
||||||
|
# "all US row crops" hit set, because corn/soy/wheat dominate ag chemistry
|
||||||
|
# registrations — almost every product registered for e.g. cotton or
|
||||||
|
# sorghum is co-registered for at least one of corn/soy/wheat.
|
||||||
ROW_CROP_KEYWORDS = (
|
ROW_CROP_KEYWORDS = (
|
||||||
"CORN", "MAIZE", "POPCORN",
|
"CORN", "MAIZE", "POPCORN",
|
||||||
"SOYBEAN", "SOYBEANS",
|
"SOYBEAN", "SOYBEANS",
|
||||||
"COTTON",
|
|
||||||
"WHEAT",
|
"WHEAT",
|
||||||
"RICE",
|
|
||||||
"SORGHUM", "MILO",
|
|
||||||
"BARLEY", "OATS", "RYE",
|
|
||||||
"SUNFLOWER", "SUNFLOWERS",
|
|
||||||
"PEANUT", "PEANUTS",
|
|
||||||
"SUGAR BEET", "SUGAR BEETS",
|
|
||||||
"DRY BEAN", "DRY BEANS", "FIELD BEAN", "FIELD BEANS",
|
|
||||||
"CANOLA", "RAPESEED",
|
|
||||||
"ALFALFA",
|
|
||||||
)
|
)
|
||||||
_ROW_CROP_PATTERNS = tuple(
|
_ROW_CROP_PATTERNS = tuple(
|
||||||
re.compile(rf"\b{re.escape(kw)}\b", re.IGNORECASE)
|
re.compile(rf"\b{re.escape(kw)}\b", re.IGNORECASE)
|
||||||
|
|||||||
+1
-1
@@ -16,6 +16,6 @@
|
|||||||
"scraper": "scrape.sources.epa_ppls",
|
"scraper": "scrape.sources.epa_ppls",
|
||||||
"scraper_version": "0.1.0",
|
"scraper_version": "0.1.0",
|
||||||
"license_note": "US federal government — public domain (no ToS restriction)",
|
"license_note": "US federal government — public domain (no ToS restriction)",
|
||||||
"scope_filter": "row-crop only — products with at least one site matching CORN, MAIZE, POPCORN, SOYBEAN(S), COTTON, WHEAT, RICE, SORGHUM, MILO, BARLEY, OATS, RYE, SUNFLOWER(S), PEANUT(S), SUGAR BEET(S), DRY/FIELD BEAN(S), CANOLA, RAPESEED, or ALFALFA (word-boundary match). Pass --no-row-crop-filter to scrape the full PPLS universe."
|
"scope_filter": "corn / soybean / wheat only — products with at least one site matching CORN, MAIZE, POPCORN, SOYBEAN(S), or WHEAT (word-boundary match). Hits ~16% of the PPLS universe in sampling. Pass --no-row-crop-filter to scrape the full PPLS universe."
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
|
|||||||
Reference in New Issue
Block a user