Files
crop-chem-docs/scrape/README.md
T
justin 1a45280e45 rename: ppls-docs → crop-chem-docs
Repo/project rename to better reflect scope. PPLS is EPA's term for
their Pesticide Product Label System — accurate when the corpus was
EPA-only, narrow now that it also pulls from Bayer's own catalog
(and may expand to Syngenta/Corteva/BASF/FMC labels in the future).
crop-chem-docs scopes flexibly without acronyms to explain.

Renames:
- directory:           ppls-docs            → crop-chem-docs
- PRODUCT_NAME:        ppls                 → crop_chem
- Chroma collection:   ppls_docs            → crop_chem_docs  (in-place via .modify(), no re-embed)
- BM25 db:             bm25/ppls_docs.db    → bm25/crop_chem_docs.db
- MCP tool name:       ppls_api_lessons     → crop_chem_api_lessons
- FastMCP server name: ppls-docs            → crop-chem-docs
- Env vars:            PPLS_CORPUS_ROOT     → CORPUS_ROOT
                       PPLS_CHROMA_DIR      → CHROMA_DIR_OVERRIDE
- User-Agent:          ppls-docs-scraper    → crop-chem-docs-scraper

Preserved (intentional, correct):
- epa_ppls (source id) — refers specifically to EPA's PPLS database
- "EPA PPLS" mentions in regulatory text (lessons.md, server docstrings)
- PPLS_API_BASE / PPLS_PDF_BASE / PPLS_INDEX_URL_TEMPLATE in
  scrape/sources/epa_ppls.py — these point at EPA's actual endpoints

Memory entries get updated in a follow-up commit so the rename is
isolated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 12:25:59 -04:00

196 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# scrape/
Per-source scrapers for pesticide / herbicide product labels. Each
module under `scrape/sources/` pulls a single upstream catalog and
writes its results into `corpus/<source_id>/` using the canonical
sidecar schema documented below.
## Architecture
```
sources.json — registry of active sources
scrape/runner.py — thin dispatcher (--source <id> | --all)
scrape/sources/<id>.py — one source per file
corpus/<id>/<key>.md — extracted label text (markdown)
corpus/<id>/<key>.json — canonical metadata sidecar
```
`<key>` is the per-source primary key — a slug for manufacturer
sources (e.g. `warrant`, `roundup-powermax-3`) or an EPA Reg No
for regulator sources (e.g. `524-475`). The sidecar's
`epa_reg_no` field is the cross-source join key that lets the
corpus consumer reconcile records from different sources for the
same product.
## CLI
```bash
# Run a single source
python -m scrape.runner --source bayer --limit 20
python -m scrape.runner --source epa_ppls --reg-no 524-475
# Run every source registered in sources.json
python -m scrape.runner --all --limit 50
# Per-source modules also run standalone
python -m scrape.sources.bayer --class herbicide --limit 5
python -m scrape.sources.epa_ppls --seed-file seeds.txt
```
Every scraper is **idempotent** by default — re-running with the
same arguments skips records already on disk. Use `--force` to
re-fetch.
### Corpus location
Default: `corpus/` at the repo root. Override with the
`CORPUS_ROOT` env var to route the corpus to external storage
(USB drive, NAS mount, secondary partition):
```bash
export CORPUS_ROOT=/mnt/big-disk/crop-chem-corpus
python -m scrape.runner --source bayer --limit 20
# writes to /mnt/big-disk/crop-chem-corpus/bayer/...
```
All sources honor the same env var; each creates its own
`<source_id>/` subdirectory beneath it. Per-source code paths
still resolve `CORPUS_DIR` correctly whether the env var is set
or not.
## Scope: corn / soybeans / wheat
The corpus is scoped to the three crops the consumer app focuses on:
**corn (incl. maize, popcorn), soybeans, and wheat.** The EPA PPLS
scraper enforces this by inspecting the `sites` array on each
product's PPLS API response and dropping anything without a matching
site (word-boundary match against `ROW_CROP_KEYWORDS`).
Empirically (random N=100 sample): this narrow allowlist matches
~16% of all PPLS products and only loses ~6% of the broader
"all US row crops" hit set, because corn/soy/wheat dominate ag
chemistry registrations — products registered for cotton/sorghum/
rice/etc. are almost always *also* registered for one of corn,
soy, or wheat.
The Bayer scraper doesn't filter — its catalog is implicitly
ag-focused, and the catalog product names + descriptions don't
expose enough crop metadata for a pre-API filter to be reliable.
Add per-source filters as needed if other manufacturer sources
turn up non-ag products.
Override the EPA filter for a one-off broader pull:
```bash
python -m scrape.sources.epa_ppls --no-row-crop-filter --reg-no 100-1486
```
### EPA registrant allowlist
The EPA scraper applies a second filter at PPIS enumeration time:
**only consider products from companies on the row-crop ag-chem
allowlist** at [`scrape/sources/epa_registrant_allowlist.json`](sources/epa_registrant_allowlist.json).
This is a pre-API filter — products from non-allowlist registrants
are dropped before paying the per-product API call cost.
Effect: the 102,378-row PPIS universe shrinks to ~11,500 rows
(~89% reduction). Full backfill drops from ~28 h to ~56 h.
The allowlist covers the major US row-crop ag-chem registrants
(Syngenta, Bayer, BASF, Corteva, FMC, Nufarm, ADAMA, UPL, Albaugh,
Loveland, AMVAC, Helena, Drexel, Atticus, etc.) — see the JSON file
for the full set with verified company names. Edit it freely; the
scraper loads it at run time. Each entry was verified by querying
the EPA PPLS API for the first active product registered under that
company number.
Bypass with `--no-registrant-filter` to enumerate the full universe
(useful if you suspect a row-crop product is registered to a small
or specialty company not on the list).
## Canonical sidecar schema
Every `corpus/<source>/<key>.json` conforms to this shape. Fields
that don't apply to a given source are `null` (not omitted) so the
JSON is uniform across sources.
```json
{
"source": "bayer",
"source_key": "warrant",
"epa_reg_no": "524-591",
"product_name": "Warrant Herbicide",
"product_class": "herbicide",
"registrant": null,
"active_ingredients": [
{"name": "acetochlor", "cas": "34256-82-1", "percent": 35.4}
],
"signal_word": "Caution",
"label": {
"url": "https://cs-assets.bayer.com/is/content/bayer/Warrant_2025pdf",
"filename": "Warrant_2025pdf",
"accepted_date": "2024-01-15",
"last_modified": "2026-05-15T20:21:54+00:00",
"page_count": 24,
"text_layer": true
},
"supplemental_documents": [
{"kind": "2EE", "title": "Warrant tank-mix 2EE — cotton",
"url": "https://cs-assets.bayer.com/.../...pdf",
"last_modified": "2026-04-01T12:00:00+00:00"}
],
"source_urls": {
"product_page": "https://www.cropscience.bayer.us/products/herbicides/warrant/label-msds",
"label_api": null,
"label_index": null
},
"fetched_at": "2026-05-23T22:05:29+00:00",
"scraper_version": "0.1.0"
}
```
### Field reference
| Field | Type | Required | Notes |
|---|---|---|---|
| `source` | string | yes | Matches an `id` in `sources.json`. |
| `source_key` | string | yes | Per-source primary key. Filesystem-safe. |
| `epa_reg_no` | string \| null | best-effort | Canonical EPA registration (e.g. `524-591`, or `524-591-12345` with distributor suffix). The cross-source join key. |
| `product_name` | string \| null | yes | Display name. |
| `product_class` | string \| null | best-effort | One of `herbicide`, `fungicide`, `insecticide`, `seed-treatment`, `rodenticide`, `other`. EPA PPLS leaves this `null`; manufacturer sources usually know. |
| `registrant` | string \| null | best-effort | Required-ish for regulator sources, often `null` for MFR sources where redundant. |
| `active_ingredients` | array of objects | yes (may be empty) | `[{name, cas, percent}]`. `cas` and `percent` are `null` when the source doesn't expose them. |
| `signal_word` | string \| null | best-effort | `Danger`, `Warning`, `Caution`, or `null`. Operationally critical for the farmer advisor. |
| `label.url` | string \| null | yes | Direct URL of the current label PDF. |
| `label.filename` | string \| null | best-effort | Last URL segment, useful for diffing revisions. |
| `label.accepted_date` | ISO date \| null | best-effort | EPA-stamped acceptance date. MFR sources may not expose this. |
| `label.last_modified` | ISO 8601 datetime \| null | best-effort | From the PDF's HTTP `Last-Modified` header. Always normalized to ISO 8601 UTC. |
| `label.page_count` | int \| null | best-effort | After download. |
| `label.text_layer` | bool \| null | best-effort | `false` for scanned PDFs that need OCR. |
| `supplemental_documents` | array | yes (may be empty) | 24(c) labels, 2(ee) bulletins, MSDS/SDS, product bulletins. EPA PPLS leaves this empty (those are separate API calls). |
| `source_urls.product_page` | string \| null | best-effort | The HTML product page on the source site. |
| `source_urls.label_api` | string \| null | best-effort | The JSON API endpoint that returned this record (for traceability). |
| `source_urls.label_index` | string \| null | best-effort | The human-readable index/search URL. |
| `fetched_at` | ISO 8601 datetime | yes | When this sidecar was generated. |
| `scraper_version` | string | yes | Source module's `SCRAPER_VERSION` constant. |
Sources may add their own extra fields beyond the canonical schema
(EPA's sidecars carry `registration_status` and
`registrant_company_number`, for instance). Consumers should ignore
unknown fields.
## Adding a new source
1. Write `scrape/sources/<id>.py` exposing a `main(argv: list[str]) -> int`
that accepts at minimum `--limit N` and `--force`.
2. Conform to the canonical sidecar schema. Add source-specific
extras as additional top-level keys if they don't fit.
3. Add an entry to `sources.json` (`id`, `title`, `type`, `homepage`,
`scraper`, `scraper_version`, `license_note`).
4. Scrapers MUST be polite: rate-limit to ≤1 req/sec, set a real
User-Agent identifying the project, retry with backoff on 429/5xx,
and respect robots.txt unless an explicit carve-out exists (e.g.
Bayer's RAG allowlist).
5. Scrapers MUST be idempotent: skip records already on disk unless
`--force` is set.