60657aa6df
The farmer-advisor consumer only cares about US row crops, so the EPA
scraper now drops products without at least one row-crop site in the
PPLS API response. Filter is on by default; --no-row-crop-filter
overrides for one-off broader pulls.
Filter shape:
- Word-boundary regex match against each entry in the API's `sites`
array (e.g., "SOYBEANS (FOLIAR TREATMENT)" → keep, "SHIPS, BOATS,
SHIPHOLDS" → drop even though it contains "OATS" as substring).
- Allowlist covers the major US row + small-grain + oilseed + sugar/
fiber crops, plus alfalfa as a common rotation crop. See
ROW_CROP_KEYWORDS in scrape/sources/epa_ppls.py for the full list.
Cost model:
- 102K PPIS rows still need one API call each (no bulk filter
available upstream), so enumeration still takes ~28h at 1 req/sec.
- But PDF downloads drop from ~67K → ~5-10K (estimated row-crop
hit rate), saving ~17h wall time and ~60GB disk on a full backfill.
Smoke test (4 mixed reg nos):
524-475 Roundup Ultra → kept (CORN/SOYBEANS/COTTON sites)
524-591 Warrant → kept (CORN/SOYBEANS/SORGHUM sites)
100-1486 Advion Cockroach → filtered (building/transport sites only)
432-1276 (Bayer pet flea) → filtered (no row crops)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
150 lines
6.6 KiB
Markdown
150 lines
6.6 KiB
Markdown
# scrape/
|
|
|
|
Per-source scrapers for pesticide / herbicide product labels. Each
|
|
module under `scrape/sources/` pulls a single upstream catalog and
|
|
writes its results into `corpus/<source_id>/` using the canonical
|
|
sidecar schema documented below.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
sources.json — registry of active sources
|
|
scrape/runner.py — thin dispatcher (--source <id> | --all)
|
|
scrape/sources/<id>.py — one source per file
|
|
corpus/<id>/<key>.md — extracted label text (markdown)
|
|
corpus/<id>/<key>.json — canonical metadata sidecar
|
|
```
|
|
|
|
`<key>` is the per-source primary key — a slug for manufacturer
|
|
sources (e.g. `warrant`, `roundup-powermax-3`) or an EPA Reg No
|
|
for regulator sources (e.g. `524-475`). The sidecar's
|
|
`epa_reg_no` field is the cross-source join key that lets the
|
|
corpus consumer reconcile records from different sources for the
|
|
same product.
|
|
|
|
## CLI
|
|
|
|
```bash
|
|
# Run a single source
|
|
python -m scrape.runner --source bayer --limit 20
|
|
python -m scrape.runner --source epa_ppls --reg-no 524-475
|
|
|
|
# Run every source registered in sources.json
|
|
python -m scrape.runner --all --limit 50
|
|
|
|
# Per-source modules also run standalone
|
|
python -m scrape.sources.bayer --class herbicide --limit 5
|
|
python -m scrape.sources.epa_ppls --seed-file seeds.txt
|
|
```
|
|
|
|
Every scraper is **idempotent** by default — re-running with the
|
|
same arguments skips records already on disk. Use `--force` to
|
|
re-fetch.
|
|
|
|
## Scope: row crops only
|
|
|
|
The corpus is scoped to **US row crops** — corn, soybeans, cotton,
|
|
wheat, rice, sorghum/milo, barley, oats, rye, sunflowers, peanuts,
|
|
sugar beets, dry/field beans, canola/rapeseed, and alfalfa. The
|
|
EPA PPLS scraper enforces this by inspecting the `sites` array on
|
|
each product's PPLS API response and dropping anything without a
|
|
row-crop site (word-boundary match).
|
|
|
|
The Bayer scraper doesn't filter — its catalog is implicitly
|
|
ag-focused, and dropping fungicide/insecticide/seed-treatment
|
|
products there would lose row-crop-relevant chemistry. Add
|
|
per-source filters as needed if other manufacturer sources cover
|
|
non-ag products.
|
|
|
|
Override the EPA filter for a one-off broader pull:
|
|
|
|
```bash
|
|
python -m scrape.sources.epa_ppls --no-row-crop-filter --reg-no 100-1486
|
|
```
|
|
|
|
## Canonical sidecar schema
|
|
|
|
Every `corpus/<source>/<key>.json` conforms to this shape. Fields
|
|
that don't apply to a given source are `null` (not omitted) so the
|
|
JSON is uniform across sources.
|
|
|
|
```json
|
|
{
|
|
"source": "bayer",
|
|
"source_key": "warrant",
|
|
"epa_reg_no": "524-591",
|
|
"product_name": "Warrant Herbicide",
|
|
"product_class": "herbicide",
|
|
"registrant": null,
|
|
"active_ingredients": [
|
|
{"name": "acetochlor", "cas": "34256-82-1", "percent": 35.4}
|
|
],
|
|
"signal_word": "Caution",
|
|
"label": {
|
|
"url": "https://cs-assets.bayer.com/is/content/bayer/Warrant_2025pdf",
|
|
"filename": "Warrant_2025pdf",
|
|
"accepted_date": "2024-01-15",
|
|
"last_modified": "2026-05-15T20:21:54+00:00",
|
|
"page_count": 24,
|
|
"text_layer": true
|
|
},
|
|
"supplemental_documents": [
|
|
{"kind": "2EE", "title": "Warrant tank-mix 2EE — cotton",
|
|
"url": "https://cs-assets.bayer.com/.../...pdf",
|
|
"last_modified": "2026-04-01T12:00:00+00:00"}
|
|
],
|
|
"source_urls": {
|
|
"product_page": "https://www.cropscience.bayer.us/products/herbicides/warrant/label-msds",
|
|
"label_api": null,
|
|
"label_index": null
|
|
},
|
|
"fetched_at": "2026-05-23T22:05:29+00:00",
|
|
"scraper_version": "0.1.0"
|
|
}
|
|
```
|
|
|
|
### Field reference
|
|
|
|
| Field | Type | Required | Notes |
|
|
|---|---|---|---|
|
|
| `source` | string | yes | Matches an `id` in `sources.json`. |
|
|
| `source_key` | string | yes | Per-source primary key. Filesystem-safe. |
|
|
| `epa_reg_no` | string \| null | best-effort | Canonical EPA registration (e.g. `524-591`, or `524-591-12345` with distributor suffix). The cross-source join key. |
|
|
| `product_name` | string \| null | yes | Display name. |
|
|
| `product_class` | string \| null | best-effort | One of `herbicide`, `fungicide`, `insecticide`, `seed-treatment`, `rodenticide`, `other`. EPA PPLS leaves this `null`; manufacturer sources usually know. |
|
|
| `registrant` | string \| null | best-effort | Required-ish for regulator sources, often `null` for MFR sources where redundant. |
|
|
| `active_ingredients` | array of objects | yes (may be empty) | `[{name, cas, percent}]`. `cas` and `percent` are `null` when the source doesn't expose them. |
|
|
| `signal_word` | string \| null | best-effort | `Danger`, `Warning`, `Caution`, or `null`. Operationally critical for the farmer advisor. |
|
|
| `label.url` | string \| null | yes | Direct URL of the current label PDF. |
|
|
| `label.filename` | string \| null | best-effort | Last URL segment, useful for diffing revisions. |
|
|
| `label.accepted_date` | ISO date \| null | best-effort | EPA-stamped acceptance date. MFR sources may not expose this. |
|
|
| `label.last_modified` | ISO 8601 datetime \| null | best-effort | From the PDF's HTTP `Last-Modified` header. Always normalized to ISO 8601 UTC. |
|
|
| `label.page_count` | int \| null | best-effort | After download. |
|
|
| `label.text_layer` | bool \| null | best-effort | `false` for scanned PDFs that need OCR. |
|
|
| `supplemental_documents` | array | yes (may be empty) | 24(c) labels, 2(ee) bulletins, MSDS/SDS, product bulletins. EPA PPLS leaves this empty (those are separate API calls). |
|
|
| `source_urls.product_page` | string \| null | best-effort | The HTML product page on the source site. |
|
|
| `source_urls.label_api` | string \| null | best-effort | The JSON API endpoint that returned this record (for traceability). |
|
|
| `source_urls.label_index` | string \| null | best-effort | The human-readable index/search URL. |
|
|
| `fetched_at` | ISO 8601 datetime | yes | When this sidecar was generated. |
|
|
| `scraper_version` | string | yes | Source module's `SCRAPER_VERSION` constant. |
|
|
|
|
Sources may add their own extra fields beyond the canonical schema
|
|
(EPA's sidecars carry `registration_status` and
|
|
`registrant_company_number`, for instance). Consumers should ignore
|
|
unknown fields.
|
|
|
|
## Adding a new source
|
|
|
|
1. Write `scrape/sources/<id>.py` exposing a `main(argv: list[str]) -> int`
|
|
that accepts at minimum `--limit N` and `--force`.
|
|
2. Conform to the canonical sidecar schema. Add source-specific
|
|
extras as additional top-level keys if they don't fit.
|
|
3. Add an entry to `sources.json` (`id`, `title`, `type`, `homepage`,
|
|
`scraper`, `scraper_version`, `license_note`).
|
|
4. Scrapers MUST be polite: rate-limit to ≤1 req/sec, set a real
|
|
User-Agent identifying the project, retry with backoff on 429/5xx,
|
|
and respect robots.txt unless an explicit carve-out exists (e.g.
|
|
Bayer's RAG allowlist).
|
|
5. Scrapers MUST be idempotent: skip records already on disk unless
|
|
`--force` is set.
|