crop-chem-docs

Author	SHA1	Message	Date
justin	1a45280e45	rename: ppls-docs → crop-chem-docs Repo/project rename to better reflect scope. PPLS is EPA's term for their Pesticide Product Label System — accurate when the corpus was EPA-only, narrow now that it also pulls from Bayer's own catalog (and may expand to Syngenta/Corteva/BASF/FMC labels in the future). crop-chem-docs scopes flexibly without acronyms to explain. Renames: - directory: ppls-docs → crop-chem-docs - PRODUCT_NAME: ppls → crop_chem - Chroma collection: ppls_docs → crop_chem_docs (in-place via .modify(), no re-embed) - BM25 db: bm25/ppls_docs.db → bm25/crop_chem_docs.db - MCP tool name: ppls_api_lessons → crop_chem_api_lessons - FastMCP server name: ppls-docs → crop-chem-docs - Env vars: PPLS_CORPUS_ROOT → CORPUS_ROOT PPLS_CHROMA_DIR → CHROMA_DIR_OVERRIDE - User-Agent: ppls-docs-scraper → crop-chem-docs-scraper Preserved (intentional, correct): - epa_ppls (source id) — refers specifically to EPA's PPLS database - "EPA PPLS" mentions in regulatory text (lessons.md, server docstrings) - PPLS_API_BASE / PPLS_PDF_BASE / PPLS_INDEX_URL_TEMPLATE in scrape/sources/epa_ppls.py — these point at EPA's actual endpoints Memory entries get updated in a follow-up commit so the rename is isolated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:25:59 -04:00
justin	420e00b44b	bayer: dedup by EPA reg no across catalog product-type queries Bayer's seed-treatment catalog query re-serves products from herbicide/fungicide/insecticide queries that have seed-treatment use sites listed. safe_slug() correctly strips the class suffix when the catalog product type matches, but doesn't strip when querying as seed-treatment, so the same product gets written twice — once as "<base>" (canonical class) and once as "<base>-<class>" (class=seed-treatment). First full scrape produced 159 files for 87 unique EPA reg nos — ~45% redundant. Fix: - process_product accepts an optional seen_regs set and returns "dup-skip" when the product's EPA reg no is already in it. - run() seeds seen_regs from existing sidecars on disk via _load_seen_regs() so dedup survives re-runs (force overrides). - run() updates seen_regs after each successful write, so within-run dedup works for the seed-treatment query (which iterates last). Important nuance preserved: when two genuinely-different brand-name products share the same EPA reg (e.g., Absolute Maxx + Adament Flow both = 264-849), they are NOT treated as dups — they're different catalog entries with different slugs and same canonical class. Only the seed-treatment-clone pattern (slug = <canonical>-<class> AND class=seed-treatment AND sibling at same reg with matching class) is the bug we're fixing. One-off cleanup of the existing USB corpus removed 68 dup pairs; 159 → 91 files (73 canonical-class + 18 true seed-treatments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 21:27:45 -04:00
justin	717426f873	scrape: route corpus via PPLS_CORPUS_ROOT env var Both scrapers now honor PPLS_CORPUS_ROOT so the corpus can land on external storage (USB drive, NAS mount, secondary partition) without editing the repo. Default behavior unchanged: corpus/ at repo root when the env var is unset. Per-source subdirectory layout preserved: ${PPLS_CORPUS_ROOT}/bayer/, ${PPLS_CORPUS_ROOT}/epa_ppls/, etc. Live-verified against /run/media/justin/USB (vfat, 59GB free): PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \ python -m scrape.runner --source epa_ppls --reg-no 524-475 -> wrote to USB, root disk untouched Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 20:41:56 -04:00
justin	e9250de8e7	scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema Adapts the docs-mcp-template scraping layer for the pesticide-labels domain. The template's bundle/version/platform concepts don't map to labels (there's no "Bayer 8.1.0" — there's just the current accepted label per EPA Reg No), so the scraper layer is reshaped around a "source" abstraction: one source per manufacturer or regulator, one per-product label per source. Sources shipped: - bayer — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs) - epa_ppls — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint Canonical sidecar schema (see scrape/README.md) unifies fields across sources: - active_ingredients always [{name, cas, percent}] - label/* nested (url, filename, accepted_date, last_modified, page_count, text_layer) - all timestamps normalized to ISO 8601 UTC - signal_word surfaced (operationally critical for the farmer advisor) - source_key + epa_reg_no separate per-source PK from the cross-source join key bundles.json → sources.json. --bundle → --source. The runner walks sources.json and dispatches by id; per-source modules remain independently runnable for development. PLAN.md gets a one-block domain note up front; later phases (chunking, embeddings, retrieval, eval) still apply as written. Smoke test: python -m scrape.runner --all --limit 2 # works python -m scrape.runner --source bayer --limit 3 # 3 written, idempotent re-run skips python -m scrape.runner --source epa_ppls --reg-no 524-475 # Roundup Ultra, 167 pages, ISO last_modified Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 18:27:07 -04:00

4 Commits