scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema

Adapts the docs-mcp-template scraping layer for the pesticide-labels
domain. The template's bundle/version/platform concepts don't map to
labels (there's no "Bayer 8.1.0" — there's just the current accepted
label per EPA Reg No), so the scraper layer is reshaped around a
"source" abstraction: one source per manufacturer or regulator, one
per-product label per source.

Sources shipped:
  - bayer       — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs)
  - epa_ppls    — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint

Canonical sidecar schema (see scrape/README.md) unifies fields across
sources:
  - active_ingredients always [{name, cas, percent}]
  - label/* nested (url, filename, accepted_date, last_modified,
    page_count, text_layer)
  - all timestamps normalized to ISO 8601 UTC
  - signal_word surfaced (operationally critical for the farmer advisor)
  - source_key + epa_reg_no separate per-source PK from the
    cross-source join key

bundles.json → sources.json. --bundle → --source. The runner walks
sources.json and dispatches by id; per-source modules remain
independently runnable for development.

PLAN.md gets a one-block domain note up front; later phases (chunking,
embeddings, retrieval, eval) still apply as written.

Smoke test:
  python -m scrape.runner --all --limit 2     # works
  python -m scrape.runner --source bayer --limit 3    # 3 written, idempotent re-run skips
  python -m scrape.runner --source epa_ppls --reg-no 524-475   # Roundup Ultra, 167 pages, ISO last_modified

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-23 18:27:07 -04:00
parent 3ca96a3716
commit e9250de8e7
9 changed files with 1531 additions and 45 deletions
+20
View File
@@ -0,0 +1,20 @@
[
{
"id": "bayer",
"title": "Bayer Crop Science US — Product Labels",
"type": "manufacturer",
"homepage": "https://www.cropscience.bayer.us",
"scraper": "scrape.sources.bayer",
"scraper_version": "0.1.0",
"license_note": "robots.txt explicitly permits scraping for AI retrieval-augmented generation (verified 2026-05)"
},
{
"id": "epa_ppls",
"title": "EPA Pesticide Product Label System",
"type": "regulator",
"homepage": "https://ordspub.epa.gov/ords/pesticides/f?p=PPLS:1",
"scraper": "scrape.sources.epa_ppls",
"scraper_version": "0.1.0",
"license_note": "US federal government — public domain (no ToS restriction)"
}
]