Cuts the PPIS-enumeration universe from 102K rows to ~11.5K rows by
dropping products from non-row-crop-ag registrants BEFORE the per-
product API call. This is the biggest cost lever we have on the EPA
scraper — full backfill drops from ~28 h to ~3.5 h.
scrape/sources/epa_registrant_allowlist.json holds the 34 verified
ag-chem company numbers (Syngenta, Bayer, BASF, Corteva, FMC, Nufarm,
ADAMA, UPL, Albaugh, Loveland, AMVAC, Helena, Drexel, Atticus, etc.).
Each entry was verified by querying the EPA PPLS API for the first
active product registered under that company number. Edit the JSON
freely — scraper loads it at run time. Bypass with
--no-registrant-filter when you suspect a row-crop product registered
to a specialty company not on the list.
Why a curated allowlist rather than blacklist consumer brands: the
102K PPIS rows are 89% non-ag-relevant; an allowlist is shorter to
maintain and harder to false-positive.
Excluded with intent (not omissions): Bayer Environmental Science
(turf/ornamental), Scotts (consumer lawn & garden), Wellmark/Zoecon
(animal flea/tick), Control Solutions (structural pest), Cleary
(turf), PBI/Gordon (mostly turf), Buckman Labs (industrial water).
Smoke test --limit 100:
- 1239 PPIS rows considered (in first slice of file)
- 1139 skipped by registrant filter (no API call paid)
- 100 hit API, 81 filtered by row-crop sites, 19 written
- = 91% API-call reduction over the prior version
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bayer's seed-treatment catalog query re-serves products from
herbicide/fungicide/insecticide queries that have seed-treatment use
sites listed. safe_slug() correctly strips the class suffix when the
catalog product type matches, but doesn't strip when querying as
seed-treatment, so the same product gets written twice — once as
"<base>" (canonical class) and once as "<base>-<class>"
(class=seed-treatment).
First full scrape produced 159 files for 87 unique EPA reg nos —
~45% redundant. Fix:
- process_product accepts an optional seen_regs set and returns
"dup-skip" when the product's EPA reg no is already in it.
- run() seeds seen_regs from existing sidecars on disk via
_load_seen_regs() so dedup survives re-runs (force overrides).
- run() updates seen_regs after each successful write, so within-run
dedup works for the seed-treatment query (which iterates last).
Important nuance preserved: when two genuinely-different brand-name
products share the same EPA reg (e.g., Absolute Maxx + Adament Flow
both = 264-849), they are NOT treated as dups — they're different
catalog entries with different slugs and same canonical class. Only
the seed-treatment-clone pattern (slug = <canonical>-<class> AND
class=seed-treatment AND sibling at same reg with matching class) is
the bug we're fixing.
One-off cleanup of the existing USB corpus removed 68 dup pairs;
159 → 91 files (73 canonical-class + 18 true seed-treatments).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both scrapers now honor PPLS_CORPUS_ROOT so the corpus can land on
external storage (USB drive, NAS mount, secondary partition) without
editing the repo. Default behavior unchanged: corpus/ at repo root
when the env var is unset.
Per-source subdirectory layout preserved: ${PPLS_CORPUS_ROOT}/bayer/,
${PPLS_CORPUS_ROOT}/epa_ppls/, etc.
Live-verified against /run/media/justin/USB (vfat, 59GB free):
PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \
python -m scrape.runner --source epa_ppls --reg-no 524-475
-> wrote to USB, root disk untouched
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
App focus is corn, soybeans, and wheat. Dropping the broader
US-row-crops allowlist (cotton/rice/sorghum/milo/barley/oats/rye/
sunflower/peanut/sugar-beet/dry-bean/canola/alfalfa).
Empirical impact (random N=100 sample): broad list matched 17/100
products, narrow list matches 16/100 — only 6% reduction, because
corn/soy/wheat dominate ag-chem registrations so thoroughly that
products registered for cotton/sorghum/etc. are almost always
co-registered for one of corn/soy/wheat. One sampled product was
dropped: a peanut-only herbicide (2749-614).
Verified live: 524-475 Roundup + 524-591 Warrant kept (CORN/SOYBEAN
sites); 2749-614 AG36448 (PEANUTS only) correctly filtered.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The farmer-advisor consumer only cares about US row crops, so the EPA
scraper now drops products without at least one row-crop site in the
PPLS API response. Filter is on by default; --no-row-crop-filter
overrides for one-off broader pulls.
Filter shape:
- Word-boundary regex match against each entry in the API's `sites`
array (e.g., "SOYBEANS (FOLIAR TREATMENT)" → keep, "SHIPS, BOATS,
SHIPHOLDS" → drop even though it contains "OATS" as substring).
- Allowlist covers the major US row + small-grain + oilseed + sugar/
fiber crops, plus alfalfa as a common rotation crop. See
ROW_CROP_KEYWORDS in scrape/sources/epa_ppls.py for the full list.
Cost model:
- 102K PPIS rows still need one API call each (no bulk filter
available upstream), so enumeration still takes ~28h at 1 req/sec.
- But PDF downloads drop from ~67K → ~5-10K (estimated row-crop
hit rate), saving ~17h wall time and ~60GB disk on a full backfill.
Smoke test (4 mixed reg nos):
524-475 Roundup Ultra → kept (CORN/SOYBEANS/COTTON sites)
524-591 Warrant → kept (CORN/SOYBEANS/SORGHUM sites)
100-1486 Advion Cockroach → filtered (building/transport sites only)
432-1276 (Bayer pet flea) → filtered (no row crops)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adapts the docs-mcp-template scraping layer for the pesticide-labels
domain. The template's bundle/version/platform concepts don't map to
labels (there's no "Bayer 8.1.0" — there's just the current accepted
label per EPA Reg No), so the scraper layer is reshaped around a
"source" abstraction: one source per manufacturer or regulator, one
per-product label per source.
Sources shipped:
- bayer — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs)
- epa_ppls — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint
Canonical sidecar schema (see scrape/README.md) unifies fields across
sources:
- active_ingredients always [{name, cas, percent}]
- label/* nested (url, filename, accepted_date, last_modified,
page_count, text_layer)
- all timestamps normalized to ISO 8601 UTC
- signal_word surfaced (operationally critical for the farmer advisor)
- source_key + epa_reg_no separate per-source PK from the
cross-source join key
bundles.json → sources.json. --bundle → --source. The runner walks
sources.json and dispatches by id; per-source modules remain
independently runnable for development.
PLAN.md gets a one-block domain note up front; later phases (chunking,
embeddings, retrieval, eval) still apply as written.
Smoke test:
python -m scrape.runner --all --limit 2 # works
python -m scrape.runner --source bayer --limit 3 # 3 written, idempotent re-run skips
python -m scrape.runner --source epa_ppls --reg-no 524-475 # Roundup Ultra, 167 pages, ISO last_modified
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>