crop-chem-docs

Author	SHA1	Message	Date
justin	2acba0aa86	server: catch one more "PPLS" → "crop-chem-docs" rename miss in corpus_status header Image rebuild (skip scrape) / build (push) Failing after 16m22s Details Functional smoke test from trashpanda confirmed end-to-end working: $ docker run -d ... git.jpaul.io/justin/crop-chem-docs:corpus-2026.05.24 $ docker exec ... python -c 'from docs_mcp.server import corpus_status; print(corpus_status())' Output: 4,159 labels on disk (4,068 epa_ppls + 91 bayer), 216,467 chunks in Chroma collection `crop_chem_docs`, BM25 db 416 MB, HYBRID_SEARCH=on, RERANK_URL=http://10.10.1.65:8082. Image is production-ready for Drawbar compose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 13:02:45 -04:00
justin	8766d73327	deploy: Drawbar compose snippet — first image is published Image pushed to git.jpaul.io/justin/crop-chem-docs with three tags: :latest — Watchtower auto-pull target :a97107de4636 — commit-sha rollback pin :corpus-2026.05.24 — corpus-snapshot pin (prod-recommended) Drawbar compose snippet at deploy/drawbar-compose-snippet.md. Wires the container against the existing infra: - Ollama pool: 192.168.0.2:11434, 192.168.0.2:11435, 192.168.0.125:11434, 10.10.1.65:11434 - Reranker: http://10.10.1.65:8082 - HYBRID_SEARCH=true (production retrieval — BM25 + dense + rerank) - Exposes streamable-HTTP MCP on port 8000 Pull path uses git.jpaul.io (public hostname, CF-fronted; pull response bodies aren't capped). Push path uses 192.168.0.2:1234 (LAN endpoint, bypasses CF 100MB body cap). Same registry, different URLs — per the template gotcha doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:48:24 -04:00
justin	420b4fa2d8	workflows: use LAN registry endpoint for push (CF 100MB cap) Cloudflare in front of git.jpaul.io caps HTTP request bodies at 100 MB, which kills container blob pushes for our 6 GB image (Chroma layer alone is ~2 GB). Per the template gotcha doc: Push via LAN endpoint (192.168.0.2:1234, plain HTTP, in the Gitea host's insecure-registries list). Pull via public hostname (git.jpaul.io) — pull response bodies aren't capped. REGISTRY_PUSH: 192.168.0.2:1234 REGISTRY_PULL: git.jpaul.io (unchanged; used for the package-link API) This matches how hvm-docs / morpheus-docs / opsramp-docs / zerto-docs CI workflows push successfully on the same Gitea host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:47:41 -04:00
justin	a97107de46	docker: production image + Gitea Actions for monthly refresh Image rebuild (skip scrape) / build (push) Failing after 1h37m12s Details Dockerfile: self-contained image with corpus + Chroma + BM25 baked in. Drawbar's compose pulls + runs without volume mounts. Built from sources.json (labels schema), PRODUCT_NAME=crop_chem by default, HYBRID_SEARCH=true (always-on for production quality). RERANK_URL + OLLAMA_URL get set at compose time. .gitea/workflows/refresh.yml: monthly cron (1st @ 06:00 UTC) does full scrape → reindex → image push. Scrapes Bayer (~30 min) + EPA PPLS row-crop filtered (~7h). Skips reindex+push if no corpus diff. Tags pushed: :latest, :<sha12>, :corpus-<YYYY.MM.DD>. .gitea/workflows/image-only.yml: on-demand or auto on code-only pushes to main (paths: docs_mcp/, rag/, scrape/, requirements.txt, Dockerfile, sources.json). Reindexes from committed corpus, builds image, pushes. ~10 min vs ~9h full refresh. .gitignore: corpus/ now COMMITTED (4,159 labels, 265 MB of .md + sidecars). Lets image-only.yml rebuild indexes without re-scraping. chroma/ + bm25/ still gitignored (regenerable binary indexes). .dockerignore: drops venv, eval results, PLAN/README/CLAUDE.md, deploy/, .git/ — keeps the image lean. corpus + chroma + bm25 explicitly NOT in dockerignore (those go INTO the image). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:32:41 -04:00
justin	1a45280e45	rename: ppls-docs → crop-chem-docs Repo/project rename to better reflect scope. PPLS is EPA's term for their Pesticide Product Label System — accurate when the corpus was EPA-only, narrow now that it also pulls from Bayer's own catalog (and may expand to Syngenta/Corteva/BASF/FMC labels in the future). crop-chem-docs scopes flexibly without acronyms to explain. Renames: - directory: ppls-docs → crop-chem-docs - PRODUCT_NAME: ppls → crop_chem - Chroma collection: ppls_docs → crop_chem_docs (in-place via .modify(), no re-embed) - BM25 db: bm25/ppls_docs.db → bm25/crop_chem_docs.db - MCP tool name: ppls_api_lessons → crop_chem_api_lessons - FastMCP server name: ppls-docs → crop-chem-docs - Env vars: PPLS_CORPUS_ROOT → CORPUS_ROOT PPLS_CHROMA_DIR → CHROMA_DIR_OVERRIDE - User-Agent: ppls-docs-scraper → crop-chem-docs-scraper Preserved (intentional, correct): - epa_ppls (source id) — refers specifically to EPA's PPLS database - "EPA PPLS" mentions in regulatory text (lessons.md, server docstrings) - PPLS_API_BASE / PPLS_PDF_BASE / PPLS_INDEX_URL_TEMPLATE in scrape/sources/epa_ppls.py — these point at EPA's actual endpoints Memory entries get updated in a follow-up commit so the rename is isolated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:25:59 -04:00
justin	3c3178a6ad	eval: GPU rerank baseline + CLI fix GPU eval (hybrid+rerank, RERANK_URL=http://10.10.1.65:8082): MRR=0.672 Recall@5=0.638 nDCG@5=0.621 (35 queries, 1 transient 500, otherwise clean) Quality identical to the CPU rerank run as expected — only latency changed (single rerank call dropped from ~23s to ~0.7-1.5s on the Tesla P4). Per-query report at eval/results/with_rerank_gpu.md. CLI parser fix: `--retrievers dense+rerank,hybrid+rerank` now correctly wires the dense+rerank variant. Previously only literal "rerank" (without prefix) matched the dense+rerank branch, so combined-retriever runs silently dropped dense+rerank. (Note: the eval's RerankedRetriever does 50 individual Chroma `get` calls per query to fetch chunk text by (source, source_key); this adds ~15s per query of pure SQLite lookup overhead. Not a production concern — docs_mcp/server.py's _rerank_pool reranks docs already in the dense pool, no extra Chroma round-trips. Worth tightening the eval-side impl on a later pass.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:12:51 -04:00
justin	af44d7a102	Phase 11 + Phase 6 GPU move ## Phase 11 — Curated agronomy / label-handling knowledge layer docs_mcp/lessons.md: 13 topic-anchored markdown sections covering the LLM-side context a farmer-advisor needs alongside the raw label corpus — - how-to-use-this-corpus - epa-signal-words - rei-phi-fundamentals - rup-handling - supplemental-labels-24c-2ee - tank-mix-fundamentals - resistance-management-hrac-frac-irac - glufosinate-application-rules - dicamba-application-rules - lake-erie-watershed-ohio - scn-and-other-seed-treatment-context - drift-management-essentials - how-to-format-recommendations Each Topic block is independently retrievable via the new MCP tool: ppls_api_lessons(topic="rup-handling") Or with no topic to get the full TOC, or with a substring to match-and-return matching sections ("dicamba" → dicamba-application-rules). Tool docstring instructs the LLM to call this proactively before any pesticide recommendation so the recommendation lands with regulatory framing, resistance-group callouts, RUP applicator language, and the canonical recommendation format — not just a rate from a label. ## Phase 6 — Reranker moved to GPU on trashpanda Stopped the local CPU container and started on trashpanda's Tesla P4 (8 GB VRAM) via: docker run -d --name llama-rerank --restart unless-stopped --gpus all \ -p 8082:8080 \ ghcr.io/ggml-org/llama.cpp:server-cuda \ -hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \ --reranking --host 0.0.0.0 --port 8080 -ngl 99 The :server-cuda image variant (not :server) is required for CUDA backend; -ngl 99 offloads all layers to GPU. Latency: 50-doc rerank dropped from ~23 s on CPU to ~0.7-1.5 s on the Tesla P4 — production-grade interactive speeds. deploy/rerank-docker.md updated with the trashpanda deploy recipe, troubleshooting (mostly "did you use server-cuda?"), and a perf reference table. The MCP server's RERANK_URL just points at http://10.10.1.65:8082 now. GPU eval still completing in background; results land in eval/results/with_rerank_gpu.md as a follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:10:09 -04:00
justin	278fe5f456	Phase 6: reranker sidecar (jina-reranker-v2-base via llama.cpp) Wires the docs_mcp/server.py reranker hook into a real backend: ghcr.io/ggml-org/llama.cpp:server \\ -hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \\ --reranking --host 0.0.0.0 --port 8080 Setup recipe at deploy/rerank-docker.md. The MCP server already honors RERANK_URL (added in Phase 7+8 commit); setting it to http://<host>:8082 turns on rerank automatically. ## Eval results (35 queries, k=5, pool=50) \| Retriever \| MRR \| Recall@5 \| nDCG@5 \| \|----------------\|-------\|----------\|--------\| \| dense \| 0.027 \| 0.086 \| 0.041 \| \| bm25 \| 0.544 \| 0.586 \| 0.524 \| \| hybrid-rrf \| 0.114 \| 0.114 \| 0.108 \| \| dense+rerank \| 0.171 \| 0.143 \| 0.149 \| \| hybrid+rerank \| 0.672 \| 0.638 \| 0.621 \| ← winner The reranker fixes hybrid's failure mode (dense noise polluting the fused pool) by scoring each (query, chunk) pair independently. Net: hybrid+rerank gives +24% MRR over BM25-only. Smoke test for the reranker itself (query: "soybean herbicide for waterhemp", 4 candidates): index=1 SENCOR metribuzin waterhemp soybean → score=0.84 ← right index=3 Headline wheat fungicide → score=-2.80 index=2 Lorsban corn rootworm → score=-2.91 index=0 Roundup fallow burndown → score=-3.44 Strong separation between the right doc and the rest. ## Production gotchas - CPU-only reranker is slow (~23s for a 50-doc pool). For interactive use put it on GPU (`--gpus all`); ~10-20× faster. - jina-reranker rejects the ENTIRE batch if any pair exceeds n_ctx_train=1024 — server truncates each doc to 2000 chars before sending. Already handled in _rerank_pool. Per-query rerank report at eval/results/with_rerank.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 10:50:03 -04:00
justin	335c33465b	Phase 7+8: eval harness + hybrid retrieval ## Phase 7 — Eval harness eval/retrievers.py + rag/retrieval.py: Retriever protocol with DenseRetriever, BM25Retriever, HybridRetriever (RRF k=60), RerankedRetriever (llama.cpp /v1/rerank). retrievers.py is now a thin shim re-exporting from rag.retrieval so the MCP server can use the same code at request time without making eval/ a runtime dep. eval/run_eval.py: drives N retrievers against eval/queries.jsonl, computes MRR / Recall@K / nDCG@K, emits a markdown report with a summary table + per-query breakdown for the first retriever. Each query carries expected (source, source_key) tuples — matches the labels-domain page-level keying. eval/queries.jsonl: 35 curated queries — 25 brand-name (Warrant, Huskie, Roundup Custom, Liberty, Authority, Headline, Trivapro, Poncho, Lorsban, Sencor, Acuron, ...) + 10 intent/semantic ("what controls horseweed before soybean", "fungicide for fusarium head blight", "rainfast interval for glyphosate", ...). ## Phase 8 — Hybrid retrieval (BM25 + dense + RRF) docs_mcp/server.py: search_docs now branches on HYBRID_SEARCH env. When on, _search_chunks runs both Chroma + BM25 (rag/bm25.py existing impl), fuses on chunk_id with reciprocal-rank-fusion (RRF k=60), and returns the combined pool. Dense-only path unchanged when HYBRID_SEARCH is unset. The rendering layer (_format_hit) is untouched. The RERANK_URL hook is also wired (_rerank_pool sends docs to llama.cpp /v1/rerank, truncated to 2000 chars per the jina-reranker n_ctx_train=1024 batch-rejection gotcha). Fails open to base order on any exception. ## Baseline numbers (k=5, pool=50, 35 queries) \| Retriever \| MRR \| Recall@5 \| nDCG@5 \| \|------------\|-------\|----------\|--------\| \| dense \| 0.027 \| 0.086 \| 0.041 \| \| bm25 \| 0.544 \| 0.586 \| 0.524 \| \| hybrid-rrf \| 0.114 \| 0.114 \| 0.108 \| Headline: BM25 dominates because farmers search for products by brand name, and brand names are exact-match tokens that lexical search nails. Dense is poor — semantic embeddings spread across similar products and don't preferentially weight brand-name tokens. Textbook RRF hurts when one retriever is much weaker than the other: dense's irrelevant top-50 pollute the fused pool with ties at 1/(60+rank). Phase 6 reranker is the planned fix — the reranker scores each (query, chunk) pair independently and can recover the right answer regardless of base order. Per-query report at eval/results/baseline.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 10:19:05 -04:00
justin	97a2a05b24	Phase 3: MCP server tools for the labels corpus Adapt docs_mcp/server.py from versioned-software-docs domain to pesticide-labels domain. Standard MCP tool names preserved (search_docs / get_page / list_versions) so existing MCP clients (Claude Desktop, Cursor) still pick them up; docstrings + argument shape are labels-domain. Tools shipped: - search_docs(query, source, product_class, registrant_contains, signal_word, epa_reg_no, k) — dense Chroma query with optional filters, post-filtered for registrant substring. Returns top-k chunks rendered as markdown with product / reg / registrant / actives / signal / section / label-PDF URL. - get_page(source, source_key) — full label markdown + metadata header. source_key is slug for MFR sources, EPA Reg No for EPA PPLS. - list_versions() — discovers facet values: sources, product classes, signal words, registrants (samples up to 50K chunks from Chroma to enumerate distinct metadata values). - corpus_status() — fast no-embedder counts: labels on disk per source, chunks in Chroma, BM25 db size, active feature flags. Wiring: - Reads PPLS_CORPUS_ROOT + PPLS_CHROMA_DIR (matches the scrapers and indexer). - Uses sources.json (not the template's bundles.json). - Lazy Chroma init so the server starts cleanly even when Ollama is briefly down (e.g. during HVM corpus rebuilds). - Phase 6 reranker + Phase 8 hybrid hooks left as feature flags (RERANK_URL, HYBRID_SEARCH) — fail open to dense-only when unset. Smoke test against the live 216K-chunk corpus: - corpus_status: 4,157 labels / 216,467 chunks / 416 MB BM25 ✓ - search_docs("waterhemp control on soybeans", k=2) returns Tackle Herbicide (FMC, 279-3564, glyph+imazethapyr) and R14640 Herbicide (Bayer, 524-724, glyph) with section context (ROUNDUP READY SOYBEANS / SOYBEAN) and dist-derived scores of 0.76 each — highly relevant. Run as stdio for Claude Desktop: PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \ OLLAMA_URL=http://gpu1:11434,http://gpu2:11434 \ PRODUCT_NAME=ppls \ python -m docs_mcp.server --transport stdio Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 10:02:01 -04:00
justin	38141c362e	Phase 2: chunking + parallel Ollama embeddings + Chroma + BM25 indexes End-to-end RAG pipeline for the pesticide-labels corpus. From the 4,066 labels on USB, the indexer produces 216,467 chunks, embeds them via N parallel Ollama endpoints, upserts to Chroma, and builds a BM25 lexical index. ## Files - rag/index.py: adapted to labels schema (source / source_key / epa_reg_no / product_name / product_class / registrant / signal_word / active_ingredients flattened for Chroma where-filter); honors PPLS_CORPUS_ROOT (corpus on USB) and PPLS_CHROMA_DIR; upsert batch size auto-tuned to 64 * N URLs; --limit + --source flags for incremental work. - rag/chunk.py: label-aware. ALL-CAPS section heading detector (heuristic) for EPA labels alongside markdown `#` headings. TARGET_CHARS=2000 (~500 tokens), MAX_CHUNK_CHARS=4000 (~1000 tokens) hard cap with _force_split sentence/char fallback to defend against monolithic crop+rate tables. Chunk 0 is a synthetic anchor with product name, EPA Reg No, registrant, signal word, product class, active ingredients + keyword bag for joint dense/BM25 retrieval. - rag/embeddings.py: parallel-dispatch across N Ollama URLs via ThreadPoolExecutor. Each __call__ stride-slices input into N shards, fires N concurrent HTTP requests, joins in original order. Bisect-resilient on 400 (context-length): recursively splits the failing shard down to single doc, logs+drops single bad doc with zero-vector placeholder so Chroma upsert never sees a gap. Real HTTP/connection errors still propagate. - requirements.txt: chromadb already pinned via template. ## Run PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \ OLLAMA_URL=http://host1:11434,http://host2:11434,... \ PRODUCT_NAME=ppls \ python -m rag.index --rebuild ## Build stats - 216,467 chunks across 4,066 labels (~53 chunks/label avg) - Wall time: 75.7 min on 4 parallel GPU-backed Ollama endpoints (Bayer-Crop / BASF / Corteva / FMC / Nufarm / Syngenta / etc. chemistry; production Ollama on trashpanda + 2× 192.168.0.2 + 1× Windows 192.168.0.125) - 473 bisect-drops (0.22%) — all from monolithic-table sections in 1970s-90s scanned PDFs whose pypdf extracts tokenized past the model's context. Acceptable; the dropped chunks were garbled OCR with no useful content. - Chroma: 2.2 GB persistent SQLite at ./chroma/ - BM25: 416 MB SQLite FTS5 at ./bm25/ppls_docs.db ## Smoke-test queries (top-3 dense-only) "what can I spray on soybeans to control waterhemp" → Rage (glyphosate+carfentrazone), Sencor (metribuzin) "REI for dicamba on corn" → Nufarm Credit (DICAMBA tank-mix restrictions section) "fungicide for wheat head scab" → MCW 710 SC (azoxystrobin+tebuconazole), Sercadis (fluxapyroxad) Distances 0.16-0.23. Dense-only quality is OK-not-great in spots (exactly the failure mode Phase 6 reranker + Phase 8 hybrid BM25 fusion address). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 09:56:49 -04:00
justin	92a95d5e78	epa_ppls: add registrant allowlist pre-API filter Cuts the PPIS-enumeration universe from 102K rows to ~11.5K rows by dropping products from non-row-crop-ag registrants BEFORE the per- product API call. This is the biggest cost lever we have on the EPA scraper — full backfill drops from ~28 h to ~3.5 h. scrape/sources/epa_registrant_allowlist.json holds the 34 verified ag-chem company numbers (Syngenta, Bayer, BASF, Corteva, FMC, Nufarm, ADAMA, UPL, Albaugh, Loveland, AMVAC, Helena, Drexel, Atticus, etc.). Each entry was verified by querying the EPA PPLS API for the first active product registered under that company number. Edit the JSON freely — scraper loads it at run time. Bypass with --no-registrant-filter when you suspect a row-crop product registered to a specialty company not on the list. Why a curated allowlist rather than blacklist consumer brands: the 102K PPIS rows are 89% non-ag-relevant; an allowlist is shorter to maintain and harder to false-positive. Excluded with intent (not omissions): Bayer Environmental Science (turf/ornamental), Scotts (consumer lawn & garden), Wellmark/Zoecon (animal flea/tick), Control Solutions (structural pest), Cleary (turf), PBI/Gordon (mostly turf), Buckman Labs (industrial water). Smoke test --limit 100: - 1239 PPIS rows considered (in first slice of file) - 1139 skipped by registrant filter (no API call paid) - 100 hit API, 81 filtered by row-crop sites, 19 written - = 91% API-call reduction over the prior version Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 23:55:38 -04:00
justin	420e00b44b	bayer: dedup by EPA reg no across catalog product-type queries Bayer's seed-treatment catalog query re-serves products from herbicide/fungicide/insecticide queries that have seed-treatment use sites listed. safe_slug() correctly strips the class suffix when the catalog product type matches, but doesn't strip when querying as seed-treatment, so the same product gets written twice — once as "<base>" (canonical class) and once as "<base>-<class>" (class=seed-treatment). First full scrape produced 159 files for 87 unique EPA reg nos — ~45% redundant. Fix: - process_product accepts an optional seen_regs set and returns "dup-skip" when the product's EPA reg no is already in it. - run() seeds seen_regs from existing sidecars on disk via _load_seen_regs() so dedup survives re-runs (force overrides). - run() updates seen_regs after each successful write, so within-run dedup works for the seed-treatment query (which iterates last). Important nuance preserved: when two genuinely-different brand-name products share the same EPA reg (e.g., Absolute Maxx + Adament Flow both = 264-849), they are NOT treated as dups — they're different catalog entries with different slugs and same canonical class. Only the seed-treatment-clone pattern (slug = <canonical>-<class> AND class=seed-treatment AND sibling at same reg with matching class) is the bug we're fixing. One-off cleanup of the existing USB corpus removed 68 dup pairs; 159 → 91 files (73 canonical-class + 18 true seed-treatments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 21:27:45 -04:00
justin	717426f873	scrape: route corpus via PPLS_CORPUS_ROOT env var Both scrapers now honor PPLS_CORPUS_ROOT so the corpus can land on external storage (USB drive, NAS mount, secondary partition) without editing the repo. Default behavior unchanged: corpus/ at repo root when the env var is unset. Per-source subdirectory layout preserved: ${PPLS_CORPUS_ROOT}/bayer/, ${PPLS_CORPUS_ROOT}/epa_ppls/, etc. Live-verified against /run/media/justin/USB (vfat, 59GB free): PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \ python -m scrape.runner --source epa_ppls --reg-no 524-475 -> wrote to USB, root disk untouched Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 20:41:56 -04:00
justin	ea3aea5871	epa_ppls: narrow row-crop filter to corn/soy/wheat only App focus is corn, soybeans, and wheat. Dropping the broader US-row-crops allowlist (cotton/rice/sorghum/milo/barley/oats/rye/ sunflower/peanut/sugar-beet/dry-bean/canola/alfalfa). Empirical impact (random N=100 sample): broad list matched 17/100 products, narrow list matches 16/100 — only 6% reduction, because corn/soy/wheat dominate ag-chem registrations so thoroughly that products registered for cotton/sorghum/etc. are almost always co-registered for one of corn/soy/wheat. One sampled product was dropped: a peanut-only herbicide (2749-614). Verified live: 524-475 Roundup + 524-591 Warrant kept (CORN/SOYBEAN sites); 2749-614 AG36448 (PEANUTS only) correctly filtered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 19:39:55 -04:00
justin	60657aa6df	epa_ppls: filter PPLS enumeration to row-crop products The farmer-advisor consumer only cares about US row crops, so the EPA scraper now drops products without at least one row-crop site in the PPLS API response. Filter is on by default; --no-row-crop-filter overrides for one-off broader pulls. Filter shape: - Word-boundary regex match against each entry in the API's `sites` array (e.g., "SOYBEANS (FOLIAR TREATMENT)" → keep, "SHIPS, BOATS, SHIPHOLDS" → drop even though it contains "OATS" as substring). - Allowlist covers the major US row + small-grain + oilseed + sugar/ fiber crops, plus alfalfa as a common rotation crop. See ROW_CROP_KEYWORDS in scrape/sources/epa_ppls.py for the full list. Cost model: - 102K PPIS rows still need one API call each (no bulk filter available upstream), so enumeration still takes ~28h at 1 req/sec. - But PDF downloads drop from ~67K → ~5-10K (estimated row-crop hit rate), saving ~17h wall time and ~60GB disk on a full backfill. Smoke test (4 mixed reg nos): 524-475 Roundup Ultra → kept (CORN/SOYBEANS/COTTON sites) 524-591 Warrant → kept (CORN/SOYBEANS/SORGHUM sites) 100-1486 Advion Cockroach → filtered (building/transport sites only) 432-1276 (Bayer pet flea) → filtered (no row crops) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 19:05:26 -04:00
justin	e9250de8e7	scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema Adapts the docs-mcp-template scraping layer for the pesticide-labels domain. The template's bundle/version/platform concepts don't map to labels (there's no "Bayer 8.1.0" — there's just the current accepted label per EPA Reg No), so the scraper layer is reshaped around a "source" abstraction: one source per manufacturer or regulator, one per-product label per source. Sources shipped: - bayer — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs) - epa_ppls — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint Canonical sidecar schema (see scrape/README.md) unifies fields across sources: - active_ingredients always [{name, cas, percent}] - label/* nested (url, filename, accepted_date, last_modified, page_count, text_layer) - all timestamps normalized to ISO 8601 UTC - signal_word surfaced (operationally critical for the farmer advisor) - source_key + epa_reg_no separate per-source PK from the cross-source join key bundles.json → sources.json. --bundle → --source. The runner walks sources.json and dispatches by id; per-source modules remain independently runnable for development. PLAN.md gets a one-block domain note up front; later phases (chunking, embeddings, retrieval, eval) still apply as written. Smoke test: python -m scrape.runner --all --limit 2 # works python -m scrape.runner --source bayer --limit 3 # 3 written, idempotent re-run skips python -m scrape.runner --source epa_ppls --reg-no 524-475 # Roundup Ultra, 167 pages, ISO last_modified Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 18:27:07 -04:00
justin	3ca96a3716	Strip submit_doc_bug tool and gate (Zerto-specific, not applicable to label MCP)	2026-05-23 17:51:56 -04:00
justin	43728320bf	ci: default PRODUCT_NAME to repo name (caught by template dispatch test) First dispatch on the empty template failed at Chroma collection creation because PRODUCT_NAME was the literal string "<product>" (YAML doesn't expand placeholders), and Chroma rejects collection names containing characters outside [a-zA-Z0-9._-]: chromadb.errors.InvalidArgumentError: Validation error: name: Expected a name containing 3-512 characters from [a-zA-Z0-9._-], starting and ending with a character in [a-zA-Z0-9]. Got: <product>_docs Same fix as the IMAGE env: derive from the repo name dynamically via ${{ github.event.repository.name }}. Cloners can still override explicitly, but a fresh clone now runs the index-rebuild step cleanly out of the box. Verified by re-dispatch — should fail next at docker login (placeholder REGISTRY_PUSH hostname), which is the next-expected fail point and a real per-deployment config the cloner has to fill in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:37:07 -04:00
justin	33b0fd652e	ci: derive image name + package linking from repo, add link step Both workflows had a static IMAGE env (<owner>/<product>-docs-mcp) and a static --package arg in the GC step. Switch both to Gitea Actions context variables so a clone of the template into any repo name works on the first CI run without find/replace: IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }} --owner ${{ github.repository_owner }} --package ${{ github.event.repository.name }} Also add the "Link container package to this repo" step that was missing from the template (and which, naively copy-pasted from the reference build, would have linked everything back to docs-mcp- template). The new step derives owner + package + link-target all from the running repo's context. The github.* namespace is Gitea Actions' inherited GitHub-Actions context — values come from the Gitea server, not github.com. Same mechanism the reference build's $GITHUB_SHA tag-builder uses. CLAUDE.md updated to note that image and package naming are repo-derived; only registry endpoints and the Ollama URL need per-clone editing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:34:26 -04:00
justin	9ba615c8ee	initial: docs-mcp-template — build guide + scaffolded server Template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out. Contents: - PLAN.md — comprehensive build guide. 13 phases from project skeleton through weekly_digest. Includes the gotchas ("fetch-depth: 0 always", reranker per-pair token limit, Cloudflare body cap, dash-not-bash on Gitea runners), the decisions worth carrying forward, and a per-product customization checklist. - CLAUDE.md — guidance for Claude Code working in a clone of this template. Phase identification table, conventions (env-gating + operator confirmation for side-effecting tools, defensive fallback for retrieval components), common commands. - README.md — quick-start summary. Scaffolded code (all signature-stable, with NotImplementedError stubs where phase-specific work is required): docs_mcp/server.py FastMCP server, stateless_http=True, with search_docs / get_page / list_versions baseline tools and commented stubs for the rest of the phase set. docs_mcp/usage.py TimedCall telemetry, JSONL, daily rotation, 90-day retention. Reusable as-is. rag/embeddings.py Ollama embedder (nomic-embed-text default), load-balanced across N URLs. Reusable. rag/chunk.py Paragraph-aware chunker with synthetic chunk 0. Per-product tunable. rag/index.py Chroma + BM25 builder. --rebuild and --bm25-only flags. rag/bm25.py SQLite FTS5 lexical index. Reusable. scrape/changelog.py --cached / --ref / --json / --history-out. Reusable. scrape/README.md What you write per-product. eval/queries.jsonl.example Curate ~25 hand-labeled queries here. eval/retrievers.py Retriever protocol + stub classes. eval/run_eval.py MRR / Recall@K / nDCG@K harness skeleton. scripts/usage_report.py Standalone log analyzer; the FOLLOW-UP CHECKS pattern noted in the module docstring. scripts/registry_gc.py Gitea container registry cleanup. Reusable. Deployment + CI: Dockerfile Python 3.12-slim; COPY corpus + chroma + bm25 last for cache efficiency. deploy/docker-compose.yml MCP + reranker sidecar + Watchtower. Templated with <placeholders>. .gitea/workflows/refresh.yml Weekly cron + manual dispatch. fetch-depth: 0, retry-on-race, three-tag image scheme. .gitea/workflows/image-only.yml Code-only ship cycle, ~18min. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:18:17 -04:00

21 Commits