Files

T

justin e9250de8e7 scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema

Adapts the docs-mcp-template scraping layer for the pesticide-labels
domain. The template's bundle/version/platform concepts don't map to
labels (there's no "Bayer 8.1.0" — there's just the current accepted
label per EPA Reg No), so the scraper layer is reshaped around a
"source" abstraction: one source per manufacturer or regulator, one
per-product label per source.

Sources shipped:
  - bayer       — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs)
  - epa_ppls    — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint

Canonical sidecar schema (see scrape/README.md) unifies fields across
sources:
  - active_ingredients always [{name, cas, percent}]
  - label/* nested (url, filename, accepted_date, last_modified,
    page_count, text_layer)
  - all timestamps normalized to ISO 8601 UTC
  - signal_word surfaced (operationally critical for the farmer advisor)
  - source_key + epa_reg_no separate per-source PK from the
    cross-source join key

bundles.json → sources.json. --bundle → --source. The runner walks
sources.json and dispatches by id; per-source modules remain
independently runnable for development.

PLAN.md gets a one-block domain note up front; later phases (chunking,
embeddings, retrieval, eval) still apply as written.

Smoke test:
  python -m scrape.runner --all --limit 2     # works
  python -m scrape.runner --source bayer --limit 3    # 3 written, idempotent re-run skips
  python -m scrape.runner --source epa_ppls --reg-no 524-475   # Roundup Ultra, 167 pages, ISO last_modified

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-23 18:27:07 -04:00

5.8 KiB

Raw Blame History

scrape/

Per-source scrapers for pesticide / herbicide product labels. Each module under scrape/sources/ pulls a single upstream catalog and writes its results into corpus/<source_id>/ using the canonical sidecar schema documented below.

Architecture

sources.json                       — registry of active sources
scrape/runner.py                   — thin dispatcher (--source <id> | --all)
scrape/sources/<id>.py             — one source per file
corpus/<id>/<key>.md               — extracted label text (markdown)
corpus/<id>/<key>.json             — canonical metadata sidecar

<key> is the per-source primary key — a slug for manufacturer sources (e.g. warrant, roundup-powermax-3) or an EPA Reg No for regulator sources (e.g. 524-475). The sidecar's epa_reg_no field is the cross-source join key that lets the corpus consumer reconcile records from different sources for the same product.

CLI

# Run a single source
python -m scrape.runner --source bayer --limit 20
python -m scrape.runner --source epa_ppls --reg-no 524-475

# Run every source registered in sources.json
python -m scrape.runner --all --limit 50

# Per-source modules also run standalone
python -m scrape.sources.bayer --class herbicide --limit 5
python -m scrape.sources.epa_ppls --seed-file seeds.txt

Every scraper is idempotent by default — re-running with the same arguments skips records already on disk. Use --force to re-fetch.

Canonical sidecar schema

Every corpus/<source>/<key>.json conforms to this shape. Fields that don't apply to a given source are null (not omitted) so the JSON is uniform across sources.

{
  "source": "bayer",
  "source_key": "warrant",
  "epa_reg_no": "524-591",
  "product_name": "Warrant Herbicide",
  "product_class": "herbicide",
  "registrant": null,
  "active_ingredients": [
    {"name": "acetochlor", "cas": "34256-82-1", "percent": 35.4}
  ],
  "signal_word": "Caution",
  "label": {
    "url": "https://cs-assets.bayer.com/is/content/bayer/Warrant_2025pdf",
    "filename": "Warrant_2025pdf",
    "accepted_date": "2024-01-15",
    "last_modified": "2026-05-15T20:21:54+00:00",
    "page_count": 24,
    "text_layer": true
  },
  "supplemental_documents": [
    {"kind": "2EE", "title": "Warrant tank-mix 2EE — cotton",
     "url": "https://cs-assets.bayer.com/.../...pdf",
     "last_modified": "2026-04-01T12:00:00+00:00"}
  ],
  "source_urls": {
    "product_page": "https://www.cropscience.bayer.us/products/herbicides/warrant/label-msds",
    "label_api": null,
    "label_index": null
  },
  "fetched_at": "2026-05-23T22:05:29+00:00",
  "scraper_version": "0.1.0"
}

Field reference

Field	Type	Required	Notes
`source`	string	yes	Matches an `id` in `sources.json`.
`source_key`	string	yes	Per-source primary key. Filesystem-safe.
`epa_reg_no`	string \| null	best-effort	Canonical EPA registration (e.g. `524-591`, or `524-591-12345` with distributor suffix). The cross-source join key.
`product_name`	string \| null	yes	Display name.
`product_class`	string \| null	best-effort	One of `herbicide`, `fungicide`, `insecticide`, `seed-treatment`, `rodenticide`, `other`. EPA PPLS leaves this `null`; manufacturer sources usually know.
`registrant`	string \| null	best-effort	Required-ish for regulator sources, often `null` for MFR sources where redundant.
`active_ingredients`	array of objects	yes (may be empty)	`[{name, cas, percent}]`. `cas` and `percent` are `null` when the source doesn't expose them.
`signal_word`	string \| null	best-effort	`Danger`, `Warning`, `Caution`, or `null`. Operationally critical for the farmer advisor.
`label.url`	string \| null	yes	Direct URL of the current label PDF.
`label.filename`	string \| null	best-effort	Last URL segment, useful for diffing revisions.
`label.accepted_date`	ISO date \| null	best-effort	EPA-stamped acceptance date. MFR sources may not expose this.
`label.last_modified`	ISO 8601 datetime \| null	best-effort	From the PDF's HTTP `Last-Modified` header. Always normalized to ISO 8601 UTC.
`label.page_count`	int \| null	best-effort	After download.
`label.text_layer`	bool \| null	best-effort	`false` for scanned PDFs that need OCR.
`supplemental_documents`	array	yes (may be empty)	24(c) labels, 2(ee) bulletins, MSDS/SDS, product bulletins. EPA PPLS leaves this empty (those are separate API calls).
`source_urls.product_page`	string \| null	best-effort	The HTML product page on the source site.
`source_urls.label_api`	string \| null	best-effort	The JSON API endpoint that returned this record (for traceability).
`source_urls.label_index`	string \| null	best-effort	The human-readable index/search URL.
`fetched_at`	ISO 8601 datetime	yes	When this sidecar was generated.
`scraper_version`	string	yes	Source module's `SCRAPER_VERSION` constant.

Sources may add their own extra fields beyond the canonical schema (EPA's sidecars carry registration_status and registrant_company_number, for instance). Consumers should ignore unknown fields.

Adding a new source

Write scrape/sources/<id>.py exposing a main(argv: list[str]) -> int that accepts at minimum --limit N and --force.
Conform to the canonical sidecar schema. Add source-specific extras as additional top-level keys if they don't fit.
Add an entry to sources.json (id, title, type, homepage, scraper, scraper_version, license_note).
Scrapers MUST be polite: rate-limit to ≤1 req/sec, set a real User-Agent identifying the project, retry with backoff on 429/5xx, and respect robots.txt unless an explicit carve-out exists (e.g. Bayer's RAG allowlist).
Scrapers MUST be idempotent: skip records already on disk unless --force is set.

5.8 KiB Raw Blame History