Files
crop-chem-docs/scrape/README.md
T
justin 92a95d5e78 epa_ppls: add registrant allowlist pre-API filter
Cuts the PPIS-enumeration universe from 102K rows to ~11.5K rows by
dropping products from non-row-crop-ag registrants BEFORE the per-
product API call. This is the biggest cost lever we have on the EPA
scraper — full backfill drops from ~28 h to ~3.5 h.

scrape/sources/epa_registrant_allowlist.json holds the 34 verified
ag-chem company numbers (Syngenta, Bayer, BASF, Corteva, FMC, Nufarm,
ADAMA, UPL, Albaugh, Loveland, AMVAC, Helena, Drexel, Atticus, etc.).
Each entry was verified by querying the EPA PPLS API for the first
active product registered under that company number. Edit the JSON
freely — scraper loads it at run time. Bypass with
--no-registrant-filter when you suspect a row-crop product registered
to a specialty company not on the list.

Why a curated allowlist rather than blacklist consumer brands: the
102K PPIS rows are 89% non-ag-relevant; an allowlist is shorter to
maintain and harder to false-positive.

Excluded with intent (not omissions): Bayer Environmental Science
(turf/ornamental), Scotts (consumer lawn & garden), Wellmark/Zoecon
(animal flea/tick), Control Solutions (structural pest), Cleary
(turf), PBI/Gordon (mostly turf), Buckman Labs (industrial water).

Smoke test --limit 100:
  - 1239 PPIS rows considered (in first slice of file)
  - 1139 skipped by registrant filter (no API call paid)
  - 100 hit API, 81 filtered by row-crop sites, 19 written
  - = 91% API-call reduction over the prior version

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 23:55:38 -04:00

8.5 KiB
Raw Blame History

scrape/

Per-source scrapers for pesticide / herbicide product labels. Each module under scrape/sources/ pulls a single upstream catalog and writes its results into corpus/<source_id>/ using the canonical sidecar schema documented below.

Architecture

sources.json                       — registry of active sources
scrape/runner.py                   — thin dispatcher (--source <id> | --all)
scrape/sources/<id>.py             — one source per file
corpus/<id>/<key>.md               — extracted label text (markdown)
corpus/<id>/<key>.json             — canonical metadata sidecar

<key> is the per-source primary key — a slug for manufacturer sources (e.g. warrant, roundup-powermax-3) or an EPA Reg No for regulator sources (e.g. 524-475). The sidecar's epa_reg_no field is the cross-source join key that lets the corpus consumer reconcile records from different sources for the same product.

CLI

# Run a single source
python -m scrape.runner --source bayer --limit 20
python -m scrape.runner --source epa_ppls --reg-no 524-475

# Run every source registered in sources.json
python -m scrape.runner --all --limit 50

# Per-source modules also run standalone
python -m scrape.sources.bayer --class herbicide --limit 5
python -m scrape.sources.epa_ppls --seed-file seeds.txt

Every scraper is idempotent by default — re-running with the same arguments skips records already on disk. Use --force to re-fetch.

Corpus location

Default: corpus/ at the repo root. Override with the PPLS_CORPUS_ROOT env var to route the corpus to external storage (USB drive, NAS mount, secondary partition):

export PPLS_CORPUS_ROOT=/mnt/big-disk/ppls-corpus
python -m scrape.runner --source bayer --limit 20
# writes to /mnt/big-disk/ppls-corpus/bayer/...

All sources honor the same env var; each creates its own <source_id>/ subdirectory beneath it. Per-source code paths still resolve CORPUS_DIR correctly whether the env var is set or not.

Scope: corn / soybeans / wheat

The corpus is scoped to the three crops the consumer app focuses on: corn (incl. maize, popcorn), soybeans, and wheat. The EPA PPLS scraper enforces this by inspecting the sites array on each product's PPLS API response and dropping anything without a matching site (word-boundary match against ROW_CROP_KEYWORDS).

Empirically (random N=100 sample): this narrow allowlist matches ~16% of all PPLS products and only loses ~6% of the broader "all US row crops" hit set, because corn/soy/wheat dominate ag chemistry registrations — products registered for cotton/sorghum/ rice/etc. are almost always also registered for one of corn, soy, or wheat.

The Bayer scraper doesn't filter — its catalog is implicitly ag-focused, and the catalog product names + descriptions don't expose enough crop metadata for a pre-API filter to be reliable. Add per-source filters as needed if other manufacturer sources turn up non-ag products.

Override the EPA filter for a one-off broader pull:

python -m scrape.sources.epa_ppls --no-row-crop-filter --reg-no 100-1486

EPA registrant allowlist

The EPA scraper applies a second filter at PPIS enumeration time: only consider products from companies on the row-crop ag-chem allowlist at scrape/sources/epa_registrant_allowlist.json. This is a pre-API filter — products from non-allowlist registrants are dropped before paying the per-product API call cost.

Effect: the 102,378-row PPIS universe shrinks to ~11,500 rows (~89% reduction). Full backfill drops from ~28 h to ~56 h.

The allowlist covers the major US row-crop ag-chem registrants (Syngenta, Bayer, BASF, Corteva, FMC, Nufarm, ADAMA, UPL, Albaugh, Loveland, AMVAC, Helena, Drexel, Atticus, etc.) — see the JSON file for the full set with verified company names. Edit it freely; the scraper loads it at run time. Each entry was verified by querying the EPA PPLS API for the first active product registered under that company number.

Bypass with --no-registrant-filter to enumerate the full universe (useful if you suspect a row-crop product is registered to a small or specialty company not on the list).

Canonical sidecar schema

Every corpus/<source>/<key>.json conforms to this shape. Fields that don't apply to a given source are null (not omitted) so the JSON is uniform across sources.

{
  "source": "bayer",
  "source_key": "warrant",
  "epa_reg_no": "524-591",
  "product_name": "Warrant Herbicide",
  "product_class": "herbicide",
  "registrant": null,
  "active_ingredients": [
    {"name": "acetochlor", "cas": "34256-82-1", "percent": 35.4}
  ],
  "signal_word": "Caution",
  "label": {
    "url": "https://cs-assets.bayer.com/is/content/bayer/Warrant_2025pdf",
    "filename": "Warrant_2025pdf",
    "accepted_date": "2024-01-15",
    "last_modified": "2026-05-15T20:21:54+00:00",
    "page_count": 24,
    "text_layer": true
  },
  "supplemental_documents": [
    {"kind": "2EE", "title": "Warrant tank-mix 2EE — cotton",
     "url": "https://cs-assets.bayer.com/.../...pdf",
     "last_modified": "2026-04-01T12:00:00+00:00"}
  ],
  "source_urls": {
    "product_page": "https://www.cropscience.bayer.us/products/herbicides/warrant/label-msds",
    "label_api": null,
    "label_index": null
  },
  "fetched_at": "2026-05-23T22:05:29+00:00",
  "scraper_version": "0.1.0"
}

Field reference

Field Type Required Notes
source string yes Matches an id in sources.json.
source_key string yes Per-source primary key. Filesystem-safe.
epa_reg_no string | null best-effort Canonical EPA registration (e.g. 524-591, or 524-591-12345 with distributor suffix). The cross-source join key.
product_name string | null yes Display name.
product_class string | null best-effort One of herbicide, fungicide, insecticide, seed-treatment, rodenticide, other. EPA PPLS leaves this null; manufacturer sources usually know.
registrant string | null best-effort Required-ish for regulator sources, often null for MFR sources where redundant.
active_ingredients array of objects yes (may be empty) [{name, cas, percent}]. cas and percent are null when the source doesn't expose them.
signal_word string | null best-effort Danger, Warning, Caution, or null. Operationally critical for the farmer advisor.
label.url string | null yes Direct URL of the current label PDF.
label.filename string | null best-effort Last URL segment, useful for diffing revisions.
label.accepted_date ISO date | null best-effort EPA-stamped acceptance date. MFR sources may not expose this.
label.last_modified ISO 8601 datetime | null best-effort From the PDF's HTTP Last-Modified header. Always normalized to ISO 8601 UTC.
label.page_count int | null best-effort After download.
label.text_layer bool | null best-effort false for scanned PDFs that need OCR.
supplemental_documents array yes (may be empty) 24(c) labels, 2(ee) bulletins, MSDS/SDS, product bulletins. EPA PPLS leaves this empty (those are separate API calls).
source_urls.product_page string | null best-effort The HTML product page on the source site.
source_urls.label_api string | null best-effort The JSON API endpoint that returned this record (for traceability).
source_urls.label_index string | null best-effort The human-readable index/search URL.
fetched_at ISO 8601 datetime yes When this sidecar was generated.
scraper_version string yes Source module's SCRAPER_VERSION constant.

Sources may add their own extra fields beyond the canonical schema (EPA's sidecars carry registration_status and registrant_company_number, for instance). Consumers should ignore unknown fields.

Adding a new source

  1. Write scrape/sources/<id>.py exposing a main(argv: list[str]) -> int that accepts at minimum --limit N and --force.
  2. Conform to the canonical sidecar schema. Add source-specific extras as additional top-level keys if they don't fit.
  3. Add an entry to sources.json (id, title, type, homepage, scraper, scraper_version, license_note).
  4. Scrapers MUST be polite: rate-limit to ≤1 req/sec, set a real User-Agent identifying the project, retry with backoff on 429/5xx, and respect robots.txt unless an explicit carve-out exists (e.g. Bayer's RAG allowlist).
  5. Scrapers MUST be idempotent: skip records already on disk unless --force is set.