Files
justin 1a45280e45 rename: ppls-docs → crop-chem-docs
Repo/project rename to better reflect scope. PPLS is EPA's term for
their Pesticide Product Label System — accurate when the corpus was
EPA-only, narrow now that it also pulls from Bayer's own catalog
(and may expand to Syngenta/Corteva/BASF/FMC labels in the future).
crop-chem-docs scopes flexibly without acronyms to explain.

Renames:
- directory:           ppls-docs            → crop-chem-docs
- PRODUCT_NAME:        ppls                 → crop_chem
- Chroma collection:   ppls_docs            → crop_chem_docs  (in-place via .modify(), no re-embed)
- BM25 db:             bm25/ppls_docs.db    → bm25/crop_chem_docs.db
- MCP tool name:       ppls_api_lessons     → crop_chem_api_lessons
- FastMCP server name: ppls-docs            → crop-chem-docs
- Env vars:            PPLS_CORPUS_ROOT     → CORPUS_ROOT
                       PPLS_CHROMA_DIR      → CHROMA_DIR_OVERRIDE
- User-Agent:          ppls-docs-scraper    → crop-chem-docs-scraper

Preserved (intentional, correct):
- epa_ppls (source id) — refers specifically to EPA's PPLS database
- "EPA PPLS" mentions in regulatory text (lessons.md, server docstrings)
- PPLS_API_BASE / PPLS_PDF_BASE / PPLS_INDEX_URL_TEMPLATE in
  scrape/sources/epa_ppls.py — these point at EPA's actual endpoints

Memory entries get updated in a follow-up commit so the rename is
isolated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 12:25:59 -04:00

8.5 KiB
Raw Permalink Blame History

scrape/

Per-source scrapers for pesticide / herbicide product labels. Each module under scrape/sources/ pulls a single upstream catalog and writes its results into corpus/<source_id>/ using the canonical sidecar schema documented below.

Architecture

sources.json                       — registry of active sources
scrape/runner.py                   — thin dispatcher (--source <id> | --all)
scrape/sources/<id>.py             — one source per file
corpus/<id>/<key>.md               — extracted label text (markdown)
corpus/<id>/<key>.json             — canonical metadata sidecar

<key> is the per-source primary key — a slug for manufacturer sources (e.g. warrant, roundup-powermax-3) or an EPA Reg No for regulator sources (e.g. 524-475). The sidecar's epa_reg_no field is the cross-source join key that lets the corpus consumer reconcile records from different sources for the same product.

CLI

# Run a single source
python -m scrape.runner --source bayer --limit 20
python -m scrape.runner --source epa_ppls --reg-no 524-475

# Run every source registered in sources.json
python -m scrape.runner --all --limit 50

# Per-source modules also run standalone
python -m scrape.sources.bayer --class herbicide --limit 5
python -m scrape.sources.epa_ppls --seed-file seeds.txt

Every scraper is idempotent by default — re-running with the same arguments skips records already on disk. Use --force to re-fetch.

Corpus location

Default: corpus/ at the repo root. Override with the CORPUS_ROOT env var to route the corpus to external storage (USB drive, NAS mount, secondary partition):

export CORPUS_ROOT=/mnt/big-disk/crop-chem-corpus
python -m scrape.runner --source bayer --limit 20
# writes to /mnt/big-disk/crop-chem-corpus/bayer/...

All sources honor the same env var; each creates its own <source_id>/ subdirectory beneath it. Per-source code paths still resolve CORPUS_DIR correctly whether the env var is set or not.

Scope: corn / soybeans / wheat

The corpus is scoped to the three crops the consumer app focuses on: corn (incl. maize, popcorn), soybeans, and wheat. The EPA PPLS scraper enforces this by inspecting the sites array on each product's PPLS API response and dropping anything without a matching site (word-boundary match against ROW_CROP_KEYWORDS).

Empirically (random N=100 sample): this narrow allowlist matches ~16% of all PPLS products and only loses ~6% of the broader "all US row crops" hit set, because corn/soy/wheat dominate ag chemistry registrations — products registered for cotton/sorghum/ rice/etc. are almost always also registered for one of corn, soy, or wheat.

The Bayer scraper doesn't filter — its catalog is implicitly ag-focused, and the catalog product names + descriptions don't expose enough crop metadata for a pre-API filter to be reliable. Add per-source filters as needed if other manufacturer sources turn up non-ag products.

Override the EPA filter for a one-off broader pull:

python -m scrape.sources.epa_ppls --no-row-crop-filter --reg-no 100-1486

EPA registrant allowlist

The EPA scraper applies a second filter at PPIS enumeration time: only consider products from companies on the row-crop ag-chem allowlist at scrape/sources/epa_registrant_allowlist.json. This is a pre-API filter — products from non-allowlist registrants are dropped before paying the per-product API call cost.

Effect: the 102,378-row PPIS universe shrinks to ~11,500 rows (~89% reduction). Full backfill drops from ~28 h to ~56 h.

The allowlist covers the major US row-crop ag-chem registrants (Syngenta, Bayer, BASF, Corteva, FMC, Nufarm, ADAMA, UPL, Albaugh, Loveland, AMVAC, Helena, Drexel, Atticus, etc.) — see the JSON file for the full set with verified company names. Edit it freely; the scraper loads it at run time. Each entry was verified by querying the EPA PPLS API for the first active product registered under that company number.

Bypass with --no-registrant-filter to enumerate the full universe (useful if you suspect a row-crop product is registered to a small or specialty company not on the list).

Canonical sidecar schema

Every corpus/<source>/<key>.json conforms to this shape. Fields that don't apply to a given source are null (not omitted) so the JSON is uniform across sources.

{
  "source": "bayer",
  "source_key": "warrant",
  "epa_reg_no": "524-591",
  "product_name": "Warrant Herbicide",
  "product_class": "herbicide",
  "registrant": null,
  "active_ingredients": [
    {"name": "acetochlor", "cas": "34256-82-1", "percent": 35.4}
  ],
  "signal_word": "Caution",
  "label": {
    "url": "https://cs-assets.bayer.com/is/content/bayer/Warrant_2025pdf",
    "filename": "Warrant_2025pdf",
    "accepted_date": "2024-01-15",
    "last_modified": "2026-05-15T20:21:54+00:00",
    "page_count": 24,
    "text_layer": true
  },
  "supplemental_documents": [
    {"kind": "2EE", "title": "Warrant tank-mix 2EE — cotton",
     "url": "https://cs-assets.bayer.com/.../...pdf",
     "last_modified": "2026-04-01T12:00:00+00:00"}
  ],
  "source_urls": {
    "product_page": "https://www.cropscience.bayer.us/products/herbicides/warrant/label-msds",
    "label_api": null,
    "label_index": null
  },
  "fetched_at": "2026-05-23T22:05:29+00:00",
  "scraper_version": "0.1.0"
}

Field reference

Field Type Required Notes
source string yes Matches an id in sources.json.
source_key string yes Per-source primary key. Filesystem-safe.
epa_reg_no string | null best-effort Canonical EPA registration (e.g. 524-591, or 524-591-12345 with distributor suffix). The cross-source join key.
product_name string | null yes Display name.
product_class string | null best-effort One of herbicide, fungicide, insecticide, seed-treatment, rodenticide, other. EPA PPLS leaves this null; manufacturer sources usually know.
registrant string | null best-effort Required-ish for regulator sources, often null for MFR sources where redundant.
active_ingredients array of objects yes (may be empty) [{name, cas, percent}]. cas and percent are null when the source doesn't expose them.
signal_word string | null best-effort Danger, Warning, Caution, or null. Operationally critical for the farmer advisor.
label.url string | null yes Direct URL of the current label PDF.
label.filename string | null best-effort Last URL segment, useful for diffing revisions.
label.accepted_date ISO date | null best-effort EPA-stamped acceptance date. MFR sources may not expose this.
label.last_modified ISO 8601 datetime | null best-effort From the PDF's HTTP Last-Modified header. Always normalized to ISO 8601 UTC.
label.page_count int | null best-effort After download.
label.text_layer bool | null best-effort false for scanned PDFs that need OCR.
supplemental_documents array yes (may be empty) 24(c) labels, 2(ee) bulletins, MSDS/SDS, product bulletins. EPA PPLS leaves this empty (those are separate API calls).
source_urls.product_page string | null best-effort The HTML product page on the source site.
source_urls.label_api string | null best-effort The JSON API endpoint that returned this record (for traceability).
source_urls.label_index string | null best-effort The human-readable index/search URL.
fetched_at ISO 8601 datetime yes When this sidecar was generated.
scraper_version string yes Source module's SCRAPER_VERSION constant.

Sources may add their own extra fields beyond the canonical schema (EPA's sidecars carry registration_status and registrant_company_number, for instance). Consumers should ignore unknown fields.

Adding a new source

  1. Write scrape/sources/<id>.py exposing a main(argv: list[str]) -> int that accepts at minimum --limit N and --force.
  2. Conform to the canonical sidecar schema. Add source-specific extras as additional top-level keys if they don't fit.
  3. Add an entry to sources.json (id, title, type, homepage, scraper, scraper_version, license_note).
  4. Scrapers MUST be polite: rate-limit to ≤1 req/sec, set a real User-Agent identifying the project, retry with backoff on 429/5xx, and respect robots.txt unless an explicit carve-out exists (e.g. Bayer's RAG allowlist).
  5. Scrapers MUST be idempotent: skip records already on disk unless --force is set.