# scrape/ Per-source scrapers for pesticide / herbicide product labels. Each module under `scrape/sources/` pulls a single upstream catalog and writes its results into `corpus//` using the canonical sidecar schema documented below. ## Architecture ``` sources.json — registry of active sources scrape/runner.py — thin dispatcher (--source | --all) scrape/sources/.py — one source per file corpus//.md — extracted label text (markdown) corpus//.json — canonical metadata sidecar ``` `` is the per-source primary key — a slug for manufacturer sources (e.g. `warrant`, `roundup-powermax-3`) or an EPA Reg No for regulator sources (e.g. `524-475`). The sidecar's `epa_reg_no` field is the cross-source join key that lets the corpus consumer reconcile records from different sources for the same product. ## CLI ```bash # Run a single source python -m scrape.runner --source bayer --limit 20 python -m scrape.runner --source epa_ppls --reg-no 524-475 # Run every source registered in sources.json python -m scrape.runner --all --limit 50 # Per-source modules also run standalone python -m scrape.sources.bayer --class herbicide --limit 5 python -m scrape.sources.epa_ppls --seed-file seeds.txt ``` Every scraper is **idempotent** by default — re-running with the same arguments skips records already on disk. Use `--force` to re-fetch. ### Corpus location Default: `corpus/` at the repo root. Override with the `CORPUS_ROOT` env var to route the corpus to external storage (USB drive, NAS mount, secondary partition): ```bash export CORPUS_ROOT=/mnt/big-disk/crop-chem-corpus python -m scrape.runner --source bayer --limit 20 # writes to /mnt/big-disk/crop-chem-corpus/bayer/... ``` All sources honor the same env var; each creates its own `/` subdirectory beneath it. Per-source code paths still resolve `CORPUS_DIR` correctly whether the env var is set or not. ## Scope: corn / soybeans / wheat The corpus is scoped to the three crops the consumer app focuses on: **corn (incl. maize, popcorn), soybeans, and wheat.** The EPA PPLS scraper enforces this by inspecting the `sites` array on each product's PPLS API response and dropping anything without a matching site (word-boundary match against `ROW_CROP_KEYWORDS`). Empirically (random N=100 sample): this narrow allowlist matches ~16% of all PPLS products and only loses ~6% of the broader "all US row crops" hit set, because corn/soy/wheat dominate ag chemistry registrations — products registered for cotton/sorghum/ rice/etc. are almost always *also* registered for one of corn, soy, or wheat. The Bayer scraper doesn't filter — its catalog is implicitly ag-focused, and the catalog product names + descriptions don't expose enough crop metadata for a pre-API filter to be reliable. Add per-source filters as needed if other manufacturer sources turn up non-ag products. Override the EPA filter for a one-off broader pull: ```bash python -m scrape.sources.epa_ppls --no-row-crop-filter --reg-no 100-1486 ``` ### EPA registrant allowlist The EPA scraper applies a second filter at PPIS enumeration time: **only consider products from companies on the row-crop ag-chem allowlist** at [`scrape/sources/epa_registrant_allowlist.json`](sources/epa_registrant_allowlist.json). This is a pre-API filter — products from non-allowlist registrants are dropped before paying the per-product API call cost. Effect: the 102,378-row PPIS universe shrinks to ~11,500 rows (~89% reduction). Full backfill drops from ~28 h to ~5–6 h. The allowlist covers the major US row-crop ag-chem registrants (Syngenta, Bayer, BASF, Corteva, FMC, Nufarm, ADAMA, UPL, Albaugh, Loveland, AMVAC, Helena, Drexel, Atticus, etc.) — see the JSON file for the full set with verified company names. Edit it freely; the scraper loads it at run time. Each entry was verified by querying the EPA PPLS API for the first active product registered under that company number. Bypass with `--no-registrant-filter` to enumerate the full universe (useful if you suspect a row-crop product is registered to a small or specialty company not on the list). ## Canonical sidecar schema Every `corpus//.json` conforms to this shape. Fields that don't apply to a given source are `null` (not omitted) so the JSON is uniform across sources. ```json { "source": "bayer", "source_key": "warrant", "epa_reg_no": "524-591", "product_name": "Warrant Herbicide", "product_class": "herbicide", "registrant": null, "active_ingredients": [ {"name": "acetochlor", "cas": "34256-82-1", "percent": 35.4} ], "signal_word": "Caution", "label": { "url": "https://cs-assets.bayer.com/is/content/bayer/Warrant_2025pdf", "filename": "Warrant_2025pdf", "accepted_date": "2024-01-15", "last_modified": "2026-05-15T20:21:54+00:00", "page_count": 24, "text_layer": true }, "supplemental_documents": [ {"kind": "2EE", "title": "Warrant tank-mix 2EE — cotton", "url": "https://cs-assets.bayer.com/.../...pdf", "last_modified": "2026-04-01T12:00:00+00:00"} ], "source_urls": { "product_page": "https://www.cropscience.bayer.us/products/herbicides/warrant/label-msds", "label_api": null, "label_index": null }, "fetched_at": "2026-05-23T22:05:29+00:00", "scraper_version": "0.1.0" } ``` ### Field reference | Field | Type | Required | Notes | |---|---|---|---| | `source` | string | yes | Matches an `id` in `sources.json`. | | `source_key` | string | yes | Per-source primary key. Filesystem-safe. | | `epa_reg_no` | string \| null | best-effort | Canonical EPA registration (e.g. `524-591`, or `524-591-12345` with distributor suffix). The cross-source join key. | | `product_name` | string \| null | yes | Display name. | | `product_class` | string \| null | best-effort | One of `herbicide`, `fungicide`, `insecticide`, `seed-treatment`, `rodenticide`, `other`. EPA PPLS leaves this `null`; manufacturer sources usually know. | | `registrant` | string \| null | best-effort | Required-ish for regulator sources, often `null` for MFR sources where redundant. | | `active_ingredients` | array of objects | yes (may be empty) | `[{name, cas, percent}]`. `cas` and `percent` are `null` when the source doesn't expose them. | | `signal_word` | string \| null | best-effort | `Danger`, `Warning`, `Caution`, or `null`. Operationally critical for the farmer advisor. | | `label.url` | string \| null | yes | Direct URL of the current label PDF. | | `label.filename` | string \| null | best-effort | Last URL segment, useful for diffing revisions. | | `label.accepted_date` | ISO date \| null | best-effort | EPA-stamped acceptance date. MFR sources may not expose this. | | `label.last_modified` | ISO 8601 datetime \| null | best-effort | From the PDF's HTTP `Last-Modified` header. Always normalized to ISO 8601 UTC. | | `label.page_count` | int \| null | best-effort | After download. | | `label.text_layer` | bool \| null | best-effort | `false` for scanned PDFs that need OCR. | | `supplemental_documents` | array | yes (may be empty) | 24(c) labels, 2(ee) bulletins, MSDS/SDS, product bulletins. EPA PPLS leaves this empty (those are separate API calls). | | `source_urls.product_page` | string \| null | best-effort | The HTML product page on the source site. | | `source_urls.label_api` | string \| null | best-effort | The JSON API endpoint that returned this record (for traceability). | | `source_urls.label_index` | string \| null | best-effort | The human-readable index/search URL. | | `fetched_at` | ISO 8601 datetime | yes | When this sidecar was generated. | | `scraper_version` | string | yes | Source module's `SCRAPER_VERSION` constant. | Sources may add their own extra fields beyond the canonical schema (EPA's sidecars carry `registration_status` and `registrant_company_number`, for instance). Consumers should ignore unknown fields. ## Adding a new source 1. Write `scrape/sources/.py` exposing a `main(argv: list[str]) -> int` that accepts at minimum `--limit N` and `--force`. 2. Conform to the canonical sidecar schema. Add source-specific extras as additional top-level keys if they don't fit. 3. Add an entry to `sources.json` (`id`, `title`, `type`, `homepage`, `scraper`, `scraper_version`, `license_note`). 4. Scrapers MUST be polite: rate-limit to ≤1 req/sec, set a real User-Agent identifying the project, retry with backoff on 429/5xx, and respect robots.txt unless an explicit carve-out exists (e.g. Bayer's RAG allowlist). 5. Scrapers MUST be idempotent: skip records already on disk unless `--force` is set.