e9250de8e7
Adapts the docs-mcp-template scraping layer for the pesticide-labels
domain. The template's bundle/version/platform concepts don't map to
labels (there's no "Bayer 8.1.0" — there's just the current accepted
label per EPA Reg No), so the scraper layer is reshaped around a
"source" abstraction: one source per manufacturer or regulator, one
per-product label per source.
Sources shipped:
- bayer — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs)
- epa_ppls — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint
Canonical sidecar schema (see scrape/README.md) unifies fields across
sources:
- active_ingredients always [{name, cas, percent}]
- label/* nested (url, filename, accepted_date, last_modified,
page_count, text_layer)
- all timestamps normalized to ISO 8601 UTC
- signal_word surfaced (operationally critical for the farmer advisor)
- source_key + epa_reg_no separate per-source PK from the
cross-source join key
bundles.json → sources.json. --bundle → --source. The runner walks
sources.json and dispatches by id; per-source modules remain
independently runnable for development.
PLAN.md gets a one-block domain note up front; later phases (chunking,
embeddings, retrieval, eval) still apply as written.
Smoke test:
python -m scrape.runner --all --limit 2 # works
python -m scrape.runner --source bayer --limit 3 # 3 written, idempotent re-run skips
python -m scrape.runner --source epa_ppls --reg-no 524-475 # Roundup Ultra, 167 pages, ISO last_modified
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
33 lines
412 B
Plaintext
33 lines
412 B
Plaintext
# Virtualenv
|
|
venv/
|
|
.venv/
|
|
|
|
# Regenerable from corpus + CI
|
|
corpus/
|
|
chroma/
|
|
bm25/
|
|
|
|
# Python detritus
|
|
__pycache__/
|
|
*.py[cod]
|
|
*.egg-info/
|
|
.pytest_cache/
|
|
.mypy_cache/
|
|
.ruff_cache/
|
|
|
|
# Eval results (regenerable; commit only the headline baseline if you want)
|
|
# eval/results/
|
|
|
|
# Usage logs (host-mounted volume in prod; don't commit dev logs)
|
|
var/
|
|
|
|
# Local-only env
|
|
.env
|
|
.env.local
|
|
|
|
# IDE
|
|
.vscode/
|
|
.idea/
|
|
*.swp
|
|
.claude/
|