Bayer's seed-treatment catalog query re-serves products from herbicide/fungicide/insecticide queries that have seed-treatment use sites listed. safe_slug() correctly strips the class suffix when the catalog product type matches, but doesn't strip when querying as seed-treatment, so the same product gets written twice — once as "<base>" (canonical class) and once as "<base>-<class>" (class=seed-treatment). First full scrape produced 159 files for 87 unique EPA reg nos — ~45% redundant. Fix: - process_product accepts an optional seen_regs set and returns "dup-skip" when the product's EPA reg no is already in it. - run() seeds seen_regs from existing sidecars on disk via _load_seen_regs() so dedup survives re-runs (force overrides). - run() updates seen_regs after each successful write, so within-run dedup works for the seed-treatment query (which iterates last). Important nuance preserved: when two genuinely-different brand-name products share the same EPA reg (e.g., Absolute Maxx + Adament Flow both = 264-849), they are NOT treated as dups — they're different catalog entries with different slugs and same canonical class. Only the seed-treatment-clone pattern (slug = <canonical>-<class> AND class=seed-treatment AND sibling at same reg with matching class) is the bug we're fixing. One-off cleanup of the existing USB corpus removed 68 dup pairs; 159 → 91 files (73 canonical-class + 18 true seed-treatments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scrape/
Per-source scrapers for pesticide / herbicide product labels. Each
module under scrape/sources/ pulls a single upstream catalog and
writes its results into corpus/<source_id>/ using the canonical
sidecar schema documented below.
Architecture
sources.json — registry of active sources
scrape/runner.py — thin dispatcher (--source <id> | --all)
scrape/sources/<id>.py — one source per file
corpus/<id>/<key>.md — extracted label text (markdown)
corpus/<id>/<key>.json — canonical metadata sidecar
<key> is the per-source primary key — a slug for manufacturer
sources (e.g. warrant, roundup-powermax-3) or an EPA Reg No
for regulator sources (e.g. 524-475). The sidecar's
epa_reg_no field is the cross-source join key that lets the
corpus consumer reconcile records from different sources for the
same product.
CLI
# Run a single source
python -m scrape.runner --source bayer --limit 20
python -m scrape.runner --source epa_ppls --reg-no 524-475
# Run every source registered in sources.json
python -m scrape.runner --all --limit 50
# Per-source modules also run standalone
python -m scrape.sources.bayer --class herbicide --limit 5
python -m scrape.sources.epa_ppls --seed-file seeds.txt
Every scraper is idempotent by default — re-running with the
same arguments skips records already on disk. Use --force to
re-fetch.
Corpus location
Default: corpus/ at the repo root. Override with the
PPLS_CORPUS_ROOT env var to route the corpus to external storage
(USB drive, NAS mount, secondary partition):
export PPLS_CORPUS_ROOT=/mnt/big-disk/ppls-corpus
python -m scrape.runner --source bayer --limit 20
# writes to /mnt/big-disk/ppls-corpus/bayer/...
All sources honor the same env var; each creates its own
<source_id>/ subdirectory beneath it. Per-source code paths
still resolve CORPUS_DIR correctly whether the env var is set
or not.
Scope: corn / soybeans / wheat
The corpus is scoped to the three crops the consumer app focuses on:
corn (incl. maize, popcorn), soybeans, and wheat. The EPA PPLS
scraper enforces this by inspecting the sites array on each
product's PPLS API response and dropping anything without a matching
site (word-boundary match against ROW_CROP_KEYWORDS).
Empirically (random N=100 sample): this narrow allowlist matches ~16% of all PPLS products and only loses ~6% of the broader "all US row crops" hit set, because corn/soy/wheat dominate ag chemistry registrations — products registered for cotton/sorghum/ rice/etc. are almost always also registered for one of corn, soy, or wheat.
The Bayer scraper doesn't filter — its catalog is implicitly ag-focused, and the catalog product names + descriptions don't expose enough crop metadata for a pre-API filter to be reliable. Add per-source filters as needed if other manufacturer sources turn up non-ag products.
Override the EPA filter for a one-off broader pull:
python -m scrape.sources.epa_ppls --no-row-crop-filter --reg-no 100-1486
Canonical sidecar schema
Every corpus/<source>/<key>.json conforms to this shape. Fields
that don't apply to a given source are null (not omitted) so the
JSON is uniform across sources.
{
"source": "bayer",
"source_key": "warrant",
"epa_reg_no": "524-591",
"product_name": "Warrant Herbicide",
"product_class": "herbicide",
"registrant": null,
"active_ingredients": [
{"name": "acetochlor", "cas": "34256-82-1", "percent": 35.4}
],
"signal_word": "Caution",
"label": {
"url": "https://cs-assets.bayer.com/is/content/bayer/Warrant_2025pdf",
"filename": "Warrant_2025pdf",
"accepted_date": "2024-01-15",
"last_modified": "2026-05-15T20:21:54+00:00",
"page_count": 24,
"text_layer": true
},
"supplemental_documents": [
{"kind": "2EE", "title": "Warrant tank-mix 2EE — cotton",
"url": "https://cs-assets.bayer.com/.../...pdf",
"last_modified": "2026-04-01T12:00:00+00:00"}
],
"source_urls": {
"product_page": "https://www.cropscience.bayer.us/products/herbicides/warrant/label-msds",
"label_api": null,
"label_index": null
},
"fetched_at": "2026-05-23T22:05:29+00:00",
"scraper_version": "0.1.0"
}
Field reference
| Field | Type | Required | Notes |
|---|---|---|---|
source |
string | yes | Matches an id in sources.json. |
source_key |
string | yes | Per-source primary key. Filesystem-safe. |
epa_reg_no |
string | null | best-effort | Canonical EPA registration (e.g. 524-591, or 524-591-12345 with distributor suffix). The cross-source join key. |
product_name |
string | null | yes | Display name. |
product_class |
string | null | best-effort | One of herbicide, fungicide, insecticide, seed-treatment, rodenticide, other. EPA PPLS leaves this null; manufacturer sources usually know. |
registrant |
string | null | best-effort | Required-ish for regulator sources, often null for MFR sources where redundant. |
active_ingredients |
array of objects | yes (may be empty) | [{name, cas, percent}]. cas and percent are null when the source doesn't expose them. |
signal_word |
string | null | best-effort | Danger, Warning, Caution, or null. Operationally critical for the farmer advisor. |
label.url |
string | null | yes | Direct URL of the current label PDF. |
label.filename |
string | null | best-effort | Last URL segment, useful for diffing revisions. |
label.accepted_date |
ISO date | null | best-effort | EPA-stamped acceptance date. MFR sources may not expose this. |
label.last_modified |
ISO 8601 datetime | null | best-effort | From the PDF's HTTP Last-Modified header. Always normalized to ISO 8601 UTC. |
label.page_count |
int | null | best-effort | After download. |
label.text_layer |
bool | null | best-effort | false for scanned PDFs that need OCR. |
supplemental_documents |
array | yes (may be empty) | 24(c) labels, 2(ee) bulletins, MSDS/SDS, product bulletins. EPA PPLS leaves this empty (those are separate API calls). |
source_urls.product_page |
string | null | best-effort | The HTML product page on the source site. |
source_urls.label_api |
string | null | best-effort | The JSON API endpoint that returned this record (for traceability). |
source_urls.label_index |
string | null | best-effort | The human-readable index/search URL. |
fetched_at |
ISO 8601 datetime | yes | When this sidecar was generated. |
scraper_version |
string | yes | Source module's SCRAPER_VERSION constant. |
Sources may add their own extra fields beyond the canonical schema
(EPA's sidecars carry registration_status and
registrant_company_number, for instance). Consumers should ignore
unknown fields.
Adding a new source
- Write
scrape/sources/<id>.pyexposing amain(argv: list[str]) -> intthat accepts at minimum--limit Nand--force. - Conform to the canonical sidecar schema. Add source-specific extras as additional top-level keys if they don't fit.
- Add an entry to
sources.json(id,title,type,homepage,scraper,scraper_version,license_note). - Scrapers MUST be polite: rate-limit to ≤1 req/sec, set a real User-Agent identifying the project, retry with backoff on 429/5xx, and respect robots.txt unless an explicit carve-out exists (e.g. Bayer's RAG allowlist).
- Scrapers MUST be idempotent: skip records already on disk unless
--forceis set.