crop-chem-docs/scrape/README.md

# scrape/

Per-source scrapers for pesticide / herbicide product labels. Each
module under `scrape/sources/` pulls a single upstream catalog and
writes its results into `corpus/<source_id>/` using the canonical
sidecar schema documented below.

## Architecture

```
sources.json                       — registry of active sources
scrape/runner.py                   — thin dispatcher (--source <id> | --all)
scrape/sources/<id>.py             — one source per file
corpus/<id>/<key>.md               — extracted label text (markdown)
corpus/<id>/<key>.json             — canonical metadata sidecar
```

`<key>` is the per-source primary key — a slug for manufacturer
sources (e.g. `warrant`, `roundup-powermax-3`) or an EPA Reg No
for regulator sources (e.g. `524-475`). The sidecar's
`epa_reg_no` field is the cross-source join key that lets the
corpus consumer reconcile records from different sources for the
same product.

## CLI

```bash
# Run a single source
python -m scrape.runner --source bayer --limit 20
python -m scrape.runner --source epa_ppls --reg-no 524-475

# Run every source registered in sources.json
python -m scrape.runner --all --limit 50

# Per-source modules also run standalone
python -m scrape.sources.bayer --class herbicide --limit 5
python -m scrape.sources.epa_ppls --seed-file seeds.txt
```

Every scraper is **idempotent** by default — re-running with the
same arguments skips records already on disk. Use `--force` to
re-fetch.

### Corpus location

Default: `corpus/` at the repo root. Override with the
`CORPUS_ROOT` env var to route the corpus to external storage
(USB drive, NAS mount, secondary partition):

```bash
export CORPUS_ROOT=/mnt/big-disk/crop-chem-corpus
python -m scrape.runner --source bayer --limit 20
# writes to /mnt/big-disk/crop-chem-corpus/bayer/...
```

All sources honor the same env var; each creates its own
`<source_id>/` subdirectory beneath it. Per-source code paths
still resolve `CORPUS_DIR` correctly whether the env var is set
or not.

## Scope: corn / soybeans / wheat

The corpus is scoped to the three crops the consumer app focuses on:
**corn (incl. maize, popcorn), soybeans, and wheat.** The EPA PPLS
scraper enforces this by inspecting the `sites` array on each
product's PPLS API response and dropping anything without a matching
site (word-boundary match against `ROW_CROP_KEYWORDS`).

Empirically (random N=100 sample): this narrow allowlist matches
~16% of all PPLS products and only loses ~6% of the broader
"all US row crops" hit set, because corn/soy/wheat dominate ag
chemistry registrations — products registered for cotton/sorghum/
rice/etc. are almost always *also* registered for one of corn,
soy, or wheat.

The Bayer scraper doesn't filter — its catalog is implicitly
ag-focused, and the catalog product names + descriptions don't
expose enough crop metadata for a pre-API filter to be reliable.
Add per-source filters as needed if other manufacturer sources
turn up non-ag products.

Override the EPA filter for a one-off broader pull:

```bash
python -m scrape.sources.epa_ppls --no-row-crop-filter --reg-no 100-1486
```

### EPA registrant allowlist

The EPA scraper applies a second filter at PPIS enumeration time:
**only consider products from companies on the row-crop ag-chem
allowlist** at [`scrape/sources/epa_registrant_allowlist.json`](sources/epa_registrant_allowlist.json).
This is a pre-API filter — products from non-allowlist registrants
are dropped before paying the per-product API call cost.

Effect: the 102,378-row PPIS universe shrinks to ~11,500 rows
(~89% reduction). Full backfill drops from ~28 h to ~5–6 h.

The allowlist covers the major US row-crop ag-chem registrants
(Syngenta, Bayer, BASF, Corteva, FMC, Nufarm, ADAMA, UPL, Albaugh,
Loveland, AMVAC, Helena, Drexel, Atticus, etc.) — see the JSON file
for the full set with verified company names. Edit it freely; the
scraper loads it at run time. Each entry was verified by querying
the EPA PPLS API for the first active product registered under that
company number.

Bypass with `--no-registrant-filter` to enumerate the full universe
(useful if you suspect a row-crop product is registered to a small
or specialty company not on the list).

## Canonical sidecar schema

Every `corpus/<source>/<key>.json` conforms to this shape. Fields
that don't apply to a given source are `null` (not omitted) so the
JSON is uniform across sources.

```json
{
  "source": "bayer",
  "source_key": "warrant",
  "epa_reg_no": "524-591",
  "product_name": "Warrant Herbicide",
  "product_class": "herbicide",
  "registrant": null,
  "active_ingredients": [
    {"name": "acetochlor", "cas": "34256-82-1", "percent": 35.4}
  ],
  "signal_word": "Caution",
  "label": {
    "url": "https://cs-assets.bayer.com/is/content/bayer/Warrant_2025pdf",
    "filename": "Warrant_2025pdf",
    "accepted_date": "2024-01-15",
    "last_modified": "2026-05-15T20:21:54+00:00",
    "page_count": 24,
    "text_layer": true
  },
  "supplemental_documents": [
    {"kind": "2EE", "title": "Warrant tank-mix 2EE — cotton",
     "url": "https://cs-assets.bayer.com/.../...pdf",
     "last_modified": "2026-04-01T12:00:00+00:00"}
  ],
  "source_urls": {
    "product_page": "https://www.cropscience.bayer.us/products/herbicides/warrant/label-msds",
    "label_api": null,
    "label_index": null
  },
  "fetched_at": "2026-05-23T22:05:29+00:00",
  "scraper_version": "0.1.0"
}
```

### Field reference

| Field | Type | Required | Notes |
|---|---|---|---|
| `source` | string | yes | Matches an `id` in `sources.json`. |
| `source_key` | string | yes | Per-source primary key. Filesystem-safe. |
| `epa_reg_no` | string \| null | best-effort | Canonical EPA registration (e.g. `524-591`, or `524-591-12345` with distributor suffix). The cross-source join key. |
| `product_name` | string \| null | yes | Display name. |
| `product_class` | string \| null | best-effort | One of `herbicide`, `fungicide`, `insecticide`, `seed-treatment`, `rodenticide`, `other`. EPA PPLS leaves this `null`; manufacturer sources usually know. |
| `registrant` | string \| null | best-effort | Required-ish for regulator sources, often `null` for MFR sources where redundant. |
| `active_ingredients` | array of objects | yes (may be empty) | `[{name, cas, percent}]`. `cas` and `percent` are `null` when the source doesn't expose them. |
| `signal_word` | string \| null | best-effort | `Danger`, `Warning`, `Caution`, or `null`. Operationally critical for the farmer advisor. |
| `label.url` | string \| null | yes | Direct URL of the current label PDF. |
| `label.filename` | string \| null | best-effort | Last URL segment, useful for diffing revisions. |
| `label.accepted_date` | ISO date \| null | best-effort | EPA-stamped acceptance date. MFR sources may not expose this. |
| `label.last_modified` | ISO 8601 datetime \| null | best-effort | From the PDF's HTTP `Last-Modified` header. Always normalized to ISO 8601 UTC. |
| `label.page_count` | int \| null | best-effort | After download. |
| `label.text_layer` | bool \| null | best-effort | `false` for scanned PDFs that need OCR. |
| `supplemental_documents` | array | yes (may be empty) | 24(c) labels, 2(ee) bulletins, MSDS/SDS, product bulletins. EPA PPLS leaves this empty (those are separate API calls). |
| `source_urls.product_page` | string \| null | best-effort | The HTML product page on the source site. |
| `source_urls.label_api` | string \| null | best-effort | The JSON API endpoint that returned this record (for traceability). |
| `source_urls.label_index` | string \| null | best-effort | The human-readable index/search URL. |
| `fetched_at` | ISO 8601 datetime | yes | When this sidecar was generated. |
| `scraper_version` | string | yes | Source module's `SCRAPER_VERSION` constant. |

Sources may add their own extra fields beyond the canonical schema
(EPA's sidecars carry `registration_status` and
`registrant_company_number`, for instance). Consumers should ignore
unknown fields.

## Adding a new source

1. Write `scrape/sources/<id>.py` exposing a `main(argv: list[str]) -> int`
   that accepts at minimum `--limit N` and `--force`.
2. Conform to the canonical sidecar schema. Add source-specific
   extras as additional top-level keys if they don't fit.
3. Add an entry to `sources.json` (`id`, `title`, `type`, `homepage`,
   `scraper`, `scraper_version`, `license_note`).
4. Scrapers MUST be polite: rate-limit to ≤1 req/sec, set a real
   User-Agent identifying the project, retry with backoff on 429/5xx,
   and respect robots.txt unless an explicit carve-out exists (e.g.
   Bayer's RAG allowlist).
5. Scrapers MUST be idempotent: skip records already on disk unless
   `--force` is set.