crop-chem-docs

Author	SHA1	Message	Date
justin	3c3178a6ad	eval: GPU rerank baseline + CLI fix GPU eval (hybrid+rerank, RERANK_URL=http://10.10.1.65:8082): MRR=0.672 Recall@5=0.638 nDCG@5=0.621 (35 queries, 1 transient 500, otherwise clean) Quality identical to the CPU rerank run as expected — only latency changed (single rerank call dropped from ~23s to ~0.7-1.5s on the Tesla P4). Per-query report at eval/results/with_rerank_gpu.md. CLI parser fix: `--retrievers dense+rerank,hybrid+rerank` now correctly wires the dense+rerank variant. Previously only literal "rerank" (without prefix) matched the dense+rerank branch, so combined-retriever runs silently dropped dense+rerank. (Note: the eval's RerankedRetriever does 50 individual Chroma `get` calls per query to fetch chunk text by (source, source_key); this adds ~15s per query of pure SQLite lookup overhead. Not a production concern — docs_mcp/server.py's _rerank_pool reranks docs already in the dense pool, no extra Chroma round-trips. Worth tightening the eval-side impl on a later pass.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:12:51 -04:00
justin	278fe5f456	Phase 6: reranker sidecar (jina-reranker-v2-base via llama.cpp) Wires the docs_mcp/server.py reranker hook into a real backend: ghcr.io/ggml-org/llama.cpp:server \\ -hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \\ --reranking --host 0.0.0.0 --port 8080 Setup recipe at deploy/rerank-docker.md. The MCP server already honors RERANK_URL (added in Phase 7+8 commit); setting it to http://<host>:8082 turns on rerank automatically. ## Eval results (35 queries, k=5, pool=50) \| Retriever \| MRR \| Recall@5 \| nDCG@5 \| \|----------------\|-------\|----------\|--------\| \| dense \| 0.027 \| 0.086 \| 0.041 \| \| bm25 \| 0.544 \| 0.586 \| 0.524 \| \| hybrid-rrf \| 0.114 \| 0.114 \| 0.108 \| \| dense+rerank \| 0.171 \| 0.143 \| 0.149 \| \| hybrid+rerank \| 0.672 \| 0.638 \| 0.621 \| ← winner The reranker fixes hybrid's failure mode (dense noise polluting the fused pool) by scoring each (query, chunk) pair independently. Net: hybrid+rerank gives +24% MRR over BM25-only. Smoke test for the reranker itself (query: "soybean herbicide for waterhemp", 4 candidates): index=1 SENCOR metribuzin waterhemp soybean → score=0.84 ← right index=3 Headline wheat fungicide → score=-2.80 index=2 Lorsban corn rootworm → score=-2.91 index=0 Roundup fallow burndown → score=-3.44 Strong separation between the right doc and the rest. ## Production gotchas - CPU-only reranker is slow (~23s for a 50-doc pool). For interactive use put it on GPU (`--gpus all`); ~10-20× faster. - jina-reranker rejects the ENTIRE batch if any pair exceeds n_ctx_train=1024 — server truncates each doc to 2000 chars before sending. Already handled in _rerank_pool. Per-query rerank report at eval/results/with_rerank.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 10:50:03 -04:00
justin	335c33465b	Phase 7+8: eval harness + hybrid retrieval ## Phase 7 — Eval harness eval/retrievers.py + rag/retrieval.py: Retriever protocol with DenseRetriever, BM25Retriever, HybridRetriever (RRF k=60), RerankedRetriever (llama.cpp /v1/rerank). retrievers.py is now a thin shim re-exporting from rag.retrieval so the MCP server can use the same code at request time without making eval/ a runtime dep. eval/run_eval.py: drives N retrievers against eval/queries.jsonl, computes MRR / Recall@K / nDCG@K, emits a markdown report with a summary table + per-query breakdown for the first retriever. Each query carries expected (source, source_key) tuples — matches the labels-domain page-level keying. eval/queries.jsonl: 35 curated queries — 25 brand-name (Warrant, Huskie, Roundup Custom, Liberty, Authority, Headline, Trivapro, Poncho, Lorsban, Sencor, Acuron, ...) + 10 intent/semantic ("what controls horseweed before soybean", "fungicide for fusarium head blight", "rainfast interval for glyphosate", ...). ## Phase 8 — Hybrid retrieval (BM25 + dense + RRF) docs_mcp/server.py: search_docs now branches on HYBRID_SEARCH env. When on, _search_chunks runs both Chroma + BM25 (rag/bm25.py existing impl), fuses on chunk_id with reciprocal-rank-fusion (RRF k=60), and returns the combined pool. Dense-only path unchanged when HYBRID_SEARCH is unset. The rendering layer (_format_hit) is untouched. The RERANK_URL hook is also wired (_rerank_pool sends docs to llama.cpp /v1/rerank, truncated to 2000 chars per the jina-reranker n_ctx_train=1024 batch-rejection gotcha). Fails open to base order on any exception. ## Baseline numbers (k=5, pool=50, 35 queries) \| Retriever \| MRR \| Recall@5 \| nDCG@5 \| \|------------\|-------\|----------\|--------\| \| dense \| 0.027 \| 0.086 \| 0.041 \| \| bm25 \| 0.544 \| 0.586 \| 0.524 \| \| hybrid-rrf \| 0.114 \| 0.114 \| 0.108 \| Headline: BM25 dominates because farmers search for products by brand name, and brand names are exact-match tokens that lexical search nails. Dense is poor — semantic embeddings spread across similar products and don't preferentially weight brand-name tokens. Textbook RRF hurts when one retriever is much weaker than the other: dense's irrelevant top-50 pollute the fused pool with ties at 1/(60+rank). Phase 6 reranker is the planned fix — the reranker scores each (query, chunk) pair independently and can recover the right answer regardless of base order. Per-query report at eval/results/baseline.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 10:19:05 -04:00
justin	9ba615c8ee	initial: docs-mcp-template — build guide + scaffolded server Template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out. Contents: - PLAN.md — comprehensive build guide. 13 phases from project skeleton through weekly_digest. Includes the gotchas ("fetch-depth: 0 always", reranker per-pair token limit, Cloudflare body cap, dash-not-bash on Gitea runners), the decisions worth carrying forward, and a per-product customization checklist. - CLAUDE.md — guidance for Claude Code working in a clone of this template. Phase identification table, conventions (env-gating + operator confirmation for side-effecting tools, defensive fallback for retrieval components), common commands. - README.md — quick-start summary. Scaffolded code (all signature-stable, with NotImplementedError stubs where phase-specific work is required): docs_mcp/server.py FastMCP server, stateless_http=True, with search_docs / get_page / list_versions baseline tools and commented stubs for the rest of the phase set. docs_mcp/usage.py TimedCall telemetry, JSONL, daily rotation, 90-day retention. Reusable as-is. rag/embeddings.py Ollama embedder (nomic-embed-text default), load-balanced across N URLs. Reusable. rag/chunk.py Paragraph-aware chunker with synthetic chunk 0. Per-product tunable. rag/index.py Chroma + BM25 builder. --rebuild and --bm25-only flags. rag/bm25.py SQLite FTS5 lexical index. Reusable. scrape/changelog.py --cached / --ref / --json / --history-out. Reusable. scrape/README.md What you write per-product. eval/queries.jsonl.example Curate ~25 hand-labeled queries here. eval/retrievers.py Retriever protocol + stub classes. eval/run_eval.py MRR / Recall@K / nDCG@K harness skeleton. scripts/usage_report.py Standalone log analyzer; the FOLLOW-UP CHECKS pattern noted in the module docstring. scripts/registry_gc.py Gitea container registry cleanup. Reusable. Deployment + CI: Dockerfile Python 3.12-slim; COPY corpus + chroma + bm25 last for cache efficiency. deploy/docker-compose.yml MCP + reranker sidecar + Watchtower. Templated with <placeholders>. .gitea/workflows/refresh.yml Weekly cron + manual dispatch. fetch-depth: 0, retry-on-race, three-tag image scheme. .gitea/workflows/image-only.yml Code-only ship cycle, ~18min. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:18:17 -04:00

4 Commits