crop-chem-docs

Author	SHA1	Message	Date
justin	ea3aea5871	epa_ppls: narrow row-crop filter to corn/soy/wheat only App focus is corn, soybeans, and wheat. Dropping the broader US-row-crops allowlist (cotton/rice/sorghum/milo/barley/oats/rye/ sunflower/peanut/sugar-beet/dry-bean/canola/alfalfa). Empirical impact (random N=100 sample): broad list matched 17/100 products, narrow list matches 16/100 — only 6% reduction, because corn/soy/wheat dominate ag-chem registrations so thoroughly that products registered for cotton/sorghum/etc. are almost always co-registered for one of corn/soy/wheat. One sampled product was dropped: a peanut-only herbicide (2749-614). Verified live: 524-475 Roundup + 524-591 Warrant kept (CORN/SOYBEAN sites); 2749-614 AG36448 (PEANUTS only) correctly filtered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 19:39:55 -04:00
justin	60657aa6df	epa_ppls: filter PPLS enumeration to row-crop products The farmer-advisor consumer only cares about US row crops, so the EPA scraper now drops products without at least one row-crop site in the PPLS API response. Filter is on by default; --no-row-crop-filter overrides for one-off broader pulls. Filter shape: - Word-boundary regex match against each entry in the API's `sites` array (e.g., "SOYBEANS (FOLIAR TREATMENT)" → keep, "SHIPS, BOATS, SHIPHOLDS" → drop even though it contains "OATS" as substring). - Allowlist covers the major US row + small-grain + oilseed + sugar/ fiber crops, plus alfalfa as a common rotation crop. See ROW_CROP_KEYWORDS in scrape/sources/epa_ppls.py for the full list. Cost model: - 102K PPIS rows still need one API call each (no bulk filter available upstream), so enumeration still takes ~28h at 1 req/sec. - But PDF downloads drop from ~67K → ~5-10K (estimated row-crop hit rate), saving ~17h wall time and ~60GB disk on a full backfill. Smoke test (4 mixed reg nos): 524-475 Roundup Ultra → kept (CORN/SOYBEANS/COTTON sites) 524-591 Warrant → kept (CORN/SOYBEANS/SORGHUM sites) 100-1486 Advion Cockroach → filtered (building/transport sites only) 432-1276 (Bayer pet flea) → filtered (no row crops) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 19:05:26 -04:00
justin	e9250de8e7	scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema Adapts the docs-mcp-template scraping layer for the pesticide-labels domain. The template's bundle/version/platform concepts don't map to labels (there's no "Bayer 8.1.0" — there's just the current accepted label per EPA Reg No), so the scraper layer is reshaped around a "source" abstraction: one source per manufacturer or regulator, one per-product label per source. Sources shipped: - bayer — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs) - epa_ppls — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint Canonical sidecar schema (see scrape/README.md) unifies fields across sources: - active_ingredients always [{name, cas, percent}] - label/* nested (url, filename, accepted_date, last_modified, page_count, text_layer) - all timestamps normalized to ISO 8601 UTC - signal_word surfaced (operationally critical for the farmer advisor) - source_key + epa_reg_no separate per-source PK from the cross-source join key bundles.json → sources.json. --bundle → --source. The runner walks sources.json and dispatches by id; per-source modules remain independently runnable for development. PLAN.md gets a one-block domain note up front; later phases (chunking, embeddings, retrieval, eval) still apply as written. Smoke test: python -m scrape.runner --all --limit 2 # works python -m scrape.runner --source bayer --limit 3 # 3 written, idempotent re-run skips python -m scrape.runner --source epa_ppls --reg-no 524-475 # Roundup Ultra, 167 pages, ISO last_modified Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 18:27:07 -04:00
justin	9ba615c8ee	initial: docs-mcp-template — build guide + scaffolded server Template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out. Contents: - PLAN.md — comprehensive build guide. 13 phases from project skeleton through weekly_digest. Includes the gotchas ("fetch-depth: 0 always", reranker per-pair token limit, Cloudflare body cap, dash-not-bash on Gitea runners), the decisions worth carrying forward, and a per-product customization checklist. - CLAUDE.md — guidance for Claude Code working in a clone of this template. Phase identification table, conventions (env-gating + operator confirmation for side-effecting tools, defensive fallback for retrieval components), common commands. - README.md — quick-start summary. Scaffolded code (all signature-stable, with NotImplementedError stubs where phase-specific work is required): docs_mcp/server.py FastMCP server, stateless_http=True, with search_docs / get_page / list_versions baseline tools and commented stubs for the rest of the phase set. docs_mcp/usage.py TimedCall telemetry, JSONL, daily rotation, 90-day retention. Reusable as-is. rag/embeddings.py Ollama embedder (nomic-embed-text default), load-balanced across N URLs. Reusable. rag/chunk.py Paragraph-aware chunker with synthetic chunk 0. Per-product tunable. rag/index.py Chroma + BM25 builder. --rebuild and --bm25-only flags. rag/bm25.py SQLite FTS5 lexical index. Reusable. scrape/changelog.py --cached / --ref / --json / --history-out. Reusable. scrape/README.md What you write per-product. eval/queries.jsonl.example Curate ~25 hand-labeled queries here. eval/retrievers.py Retriever protocol + stub classes. eval/run_eval.py MRR / Recall@K / nDCG@K harness skeleton. scripts/usage_report.py Standalone log analyzer; the FOLLOW-UP CHECKS pattern noted in the module docstring. scripts/registry_gc.py Gitea container registry cleanup. Reusable. Deployment + CI: Dockerfile Python 3.12-slim; COPY corpus + chroma + bm25 last for cache efficiency. deploy/docker-compose.yml MCP + reranker sidecar + Watchtower. Templated with <placeholders>. .gitea/workflows/refresh.yml Weekly cron + manual dispatch. fetch-depth: 0, retry-on-race, three-tag image scheme. .gitea/workflows/image-only.yml Code-only ship cycle, ~18min. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:18:17 -04:00

4 Commits