crop-chem-docs

Author	SHA1	Message	Date
justin	335c33465b	Phase 7+8: eval harness + hybrid retrieval ## Phase 7 — Eval harness eval/retrievers.py + rag/retrieval.py: Retriever protocol with DenseRetriever, BM25Retriever, HybridRetriever (RRF k=60), RerankedRetriever (llama.cpp /v1/rerank). retrievers.py is now a thin shim re-exporting from rag.retrieval so the MCP server can use the same code at request time without making eval/ a runtime dep. eval/run_eval.py: drives N retrievers against eval/queries.jsonl, computes MRR / Recall@K / nDCG@K, emits a markdown report with a summary table + per-query breakdown for the first retriever. Each query carries expected (source, source_key) tuples — matches the labels-domain page-level keying. eval/queries.jsonl: 35 curated queries — 25 brand-name (Warrant, Huskie, Roundup Custom, Liberty, Authority, Headline, Trivapro, Poncho, Lorsban, Sencor, Acuron, ...) + 10 intent/semantic ("what controls horseweed before soybean", "fungicide for fusarium head blight", "rainfast interval for glyphosate", ...). ## Phase 8 — Hybrid retrieval (BM25 + dense + RRF) docs_mcp/server.py: search_docs now branches on HYBRID_SEARCH env. When on, _search_chunks runs both Chroma + BM25 (rag/bm25.py existing impl), fuses on chunk_id with reciprocal-rank-fusion (RRF k=60), and returns the combined pool. Dense-only path unchanged when HYBRID_SEARCH is unset. The rendering layer (_format_hit) is untouched. The RERANK_URL hook is also wired (_rerank_pool sends docs to llama.cpp /v1/rerank, truncated to 2000 chars per the jina-reranker n_ctx_train=1024 batch-rejection gotcha). Fails open to base order on any exception. ## Baseline numbers (k=5, pool=50, 35 queries) \| Retriever \| MRR \| Recall@5 \| nDCG@5 \| \|------------\|-------\|----------\|--------\| \| dense \| 0.027 \| 0.086 \| 0.041 \| \| bm25 \| 0.544 \| 0.586 \| 0.524 \| \| hybrid-rrf \| 0.114 \| 0.114 \| 0.108 \| Headline: BM25 dominates because farmers search for products by brand name, and brand names are exact-match tokens that lexical search nails. Dense is poor — semantic embeddings spread across similar products and don't preferentially weight brand-name tokens. Textbook RRF hurts when one retriever is much weaker than the other: dense's irrelevant top-50 pollute the fused pool with ties at 1/(60+rank). Phase 6 reranker is the planned fix — the reranker scores each (query, chunk) pair independently and can recover the right answer regardless of base order. Per-query report at eval/results/baseline.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 10:19:05 -04:00
justin	97a2a05b24	Phase 3: MCP server tools for the labels corpus Adapt docs_mcp/server.py from versioned-software-docs domain to pesticide-labels domain. Standard MCP tool names preserved (search_docs / get_page / list_versions) so existing MCP clients (Claude Desktop, Cursor) still pick them up; docstrings + argument shape are labels-domain. Tools shipped: - search_docs(query, source, product_class, registrant_contains, signal_word, epa_reg_no, k) — dense Chroma query with optional filters, post-filtered for registrant substring. Returns top-k chunks rendered as markdown with product / reg / registrant / actives / signal / section / label-PDF URL. - get_page(source, source_key) — full label markdown + metadata header. source_key is slug for MFR sources, EPA Reg No for EPA PPLS. - list_versions() — discovers facet values: sources, product classes, signal words, registrants (samples up to 50K chunks from Chroma to enumerate distinct metadata values). - corpus_status() — fast no-embedder counts: labels on disk per source, chunks in Chroma, BM25 db size, active feature flags. Wiring: - Reads PPLS_CORPUS_ROOT + PPLS_CHROMA_DIR (matches the scrapers and indexer). - Uses sources.json (not the template's bundles.json). - Lazy Chroma init so the server starts cleanly even when Ollama is briefly down (e.g. during HVM corpus rebuilds). - Phase 6 reranker + Phase 8 hybrid hooks left as feature flags (RERANK_URL, HYBRID_SEARCH) — fail open to dense-only when unset. Smoke test against the live 216K-chunk corpus: - corpus_status: 4,157 labels / 216,467 chunks / 416 MB BM25 ✓ - search_docs("waterhemp control on soybeans", k=2) returns Tackle Herbicide (FMC, 279-3564, glyph+imazethapyr) and R14640 Herbicide (Bayer, 524-724, glyph) with section context (ROUNDUP READY SOYBEANS / SOYBEAN) and dist-derived scores of 0.76 each — highly relevant. Run as stdio for Claude Desktop: PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \ OLLAMA_URL=http://gpu1:11434,http://gpu2:11434 \ PRODUCT_NAME=ppls \ python -m docs_mcp.server --transport stdio Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 10:02:01 -04:00
justin	3ca96a3716	Strip submit_doc_bug tool and gate (Zerto-specific, not applicable to label MCP)	2026-05-23 17:51:56 -04:00
justin	9ba615c8ee	initial: docs-mcp-template — build guide + scaffolded server Template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out. Contents: - PLAN.md — comprehensive build guide. 13 phases from project skeleton through weekly_digest. Includes the gotchas ("fetch-depth: 0 always", reranker per-pair token limit, Cloudflare body cap, dash-not-bash on Gitea runners), the decisions worth carrying forward, and a per-product customization checklist. - CLAUDE.md — guidance for Claude Code working in a clone of this template. Phase identification table, conventions (env-gating + operator confirmation for side-effecting tools, defensive fallback for retrieval components), common commands. - README.md — quick-start summary. Scaffolded code (all signature-stable, with NotImplementedError stubs where phase-specific work is required): docs_mcp/server.py FastMCP server, stateless_http=True, with search_docs / get_page / list_versions baseline tools and commented stubs for the rest of the phase set. docs_mcp/usage.py TimedCall telemetry, JSONL, daily rotation, 90-day retention. Reusable as-is. rag/embeddings.py Ollama embedder (nomic-embed-text default), load-balanced across N URLs. Reusable. rag/chunk.py Paragraph-aware chunker with synthetic chunk 0. Per-product tunable. rag/index.py Chroma + BM25 builder. --rebuild and --bm25-only flags. rag/bm25.py SQLite FTS5 lexical index. Reusable. scrape/changelog.py --cached / --ref / --json / --history-out. Reusable. scrape/README.md What you write per-product. eval/queries.jsonl.example Curate ~25 hand-labeled queries here. eval/retrievers.py Retriever protocol + stub classes. eval/run_eval.py MRR / Recall@K / nDCG@K harness skeleton. scripts/usage_report.py Standalone log analyzer; the FOLLOW-UP CHECKS pattern noted in the module docstring. scripts/registry_gc.py Gitea container registry cleanup. Reusable. Deployment + CI: Dockerfile Python 3.12-slim; COPY corpus + chroma + bm25 last for cache efficiency. deploy/docker-compose.yml MCP + reranker sidecar + Watchtower. Templated with <placeholders>. .gitea/workflows/refresh.yml Weekly cron + manual dispatch. fetch-depth: 0, retry-on-race, three-tag image scheme. .gitea/workflows/image-only.yml Code-only ship cycle, ~18min. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:18:17 -04:00

4 Commits