From b1a712308c8f92e576d366a28ee76ac882aef86c Mon Sep 17 00:00:00 2001 From: Justin Paul Date: Mon, 25 May 2026 17:50:36 -0400 Subject: [PATCH] README: rewrite for crop-chem-docs as a product (was template README) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The README had never been customized after cloning the docs-mcp-template — title said "docs-mcp-template" and it read as the template's generic introduction with no mention of EPA PPLS, the Bayer scraper, the ~4k label corpus, or the production deploy. Replace with a crop-chem-docs-specific README that covers: - Corpus inventory: 4,159 indexed pages (91 Bayer + 4,068 EPA PPLS) - MCP tool catalog with crop_chem_api_lessons specifics - Eval baseline from eval/results/with_rerank.md showing hybrid+rerank wins (MRR 0.672) over BM25-only (0.544) and that hybrid-without-rerank actively HURTS (0.114) — same pattern seed-mcp found independently - Note that the deployed rerank was silently failing through 2026-05-25 due to the llama-rerank Docker network gotcha; fixed and re-running eval is on the followup list - Quick-start commands - Repo layout reference - Infrastructure: registry, embedder pool, shared llama-rerank sidecar, PRODUCT_NAME=crop_chem - Cross-link to the sibling seed-mcp project Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 177 +++++++++++++++++++++++++++++++++--------------------- 1 file changed, 110 insertions(+), 67 deletions(-) diff --git a/README.md b/README.md index a27ae47..b94059d 100644 --- a/README.md +++ b/README.md @@ -1,103 +1,146 @@ -# docs-mcp-template +# crop-chem-docs -A reusable template for building hosted MCP servers over a product's -public documentation. Distilled from one production build; everything -product-specific has been factored out. +MCP server over ~4,000 public US row-crop pesticide / herbicide / fertilizer labels — feeding the same Drawbar farm-advisor AI as the sibling [`seed-mcp`](https://git.jpaul.io/justin/seed-mcp). The advisor calls this MCP for label rates, REI/PHI, rotation restrictions, tank-mix guidance, and active-ingredient lookups. -The end product is a streamable-HTTP MCP server with ~15 tools that -any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can -call to answer questions against the docs, surface what changed -recently, and flag likely inconsistencies. +Built on [`docs-mcp-template`](https://git.jpaul.io/justin/docs-mcp-template) (same template lineage as seed-mcp). **In production** on trashpanda; the Drawbar advisor calls it via the `chem:` prefix. -## What's here +## What's in the corpus -- **[PLAN.md](PLAN.md)** — comprehensive build guide. Phased - approach (13 phases, ~2–3 weeks of focused work for the full - stack). Includes the design decisions, the gotchas, and a - per-product customization checklist. -- **Scaffolded skeleton** — working FastMCP server with stub tools, - Dockerfile, docker-compose, CI workflows, eval harness layout, - usage logging. Everything you need to `git clone` and start - filling in the product-specific bits. +**4,159 indexed pages** across two complementary sources: + +| Source | Pages | Notes | +|---|---|---| +| `bayer` | 91 | Bayer Crop Science US product pages — Warrant, Harness, Roundup, Liberty, Capreno, etc. Rich Next.js `__NEXT_DATA__` payload: active ingredients, label rates, MOA codes, supplemental PDFs (24c / 2EE / bulletins). robots.txt explicitly whitelists RAG indexing. | +| `epa_ppls` | 4,068 | EPA Pesticide Product Label System — every registered ag chemistry product. Authoritative source of truth for EPA reg numbers, master labels, signal words, registrant info, formulations. | + +## MCP tools + +Same shape as the docs-mcp-template's standard tools (see [`docs_mcp/server.py`](docs_mcp/server.py)): + +| Tool | Purpose | +|---|---| +| `search_docs` | Hybrid dense + BM25 + rerank search over the label corpus, filterable by source. | +| `get_page` | Full label record by `(source, source_key)`. Returns marketing copy + extracted PDF text + sidecar metadata. | +| `list_versions` | Facet discovery (sources, EPA registrant codes, label categories). | +| `crop_chem_api_lessons` | Curated agronomy / regulatory lessons — EPA reg-number normalization, label-supersession ordering, common tank-mix gotchas. | +| Plus the template's standard `diff_versions`, `bundle_changelog`, `weekly_digest` if needed. | + +## Retrieval — eval-validated + +From [`eval/results/with_rerank.md`](eval/results/with_rerank.md) (35 golden queries, k=5): + +| Retriever | MRR | Recall@5 | nDCG@5 | Time (s) | +|---|---|---|---|---| +| **hybrid+rerank** | **0.672** | **0.638** | **0.621** | 823 | +| bm25 | 0.544 | 0.586 | 0.524 | 5 | +| dense+rerank | 0.171 | 0.143 | 0.149 | 805 | +| hybrid-rrf | 0.114 | 0.114 | 0.108 | 8 | +| dense | 0.027 | 0.086 | 0.041 | 5 | + +**Deploy config**: `HYBRID_SEARCH=true` + `RERANK_URL=http://llama-rerank:8080`. + +Pattern matches what seed-mcp found independently: + +1. **Dense embedding alone is essentially useless** on this corpus (MRR 0.027). Variety codes, EPA reg numbers, and active-ingredient names have no semantic neighbors — `nomic-embed-text` returns noise. +2. **Hybrid-rrf (no rerank) is worse than BM25 alone.** RRF dilutes BM25's strong ranking with dense noise. Don't ship without rerank. +3. **BM25 alone (MRR 0.544, 5 sec) is a great fallback** when the rerank sidecar is unavailable. +4. **Rerank brings the win** — `hybrid+rerank` MRR 0.672 is 23% better than BM25 alone and dominates every other configuration. + +**Note on rerank in production**: through 2026-05-25 the `llama-rerank` sidecar was attached to Docker's default `bridge` network instead of `drawbar-backend_default`, so chem-mcp's `RERANK_URL=http://llama-rerank:8080` was resolving via public DNS to a random IP and connection-refusing. The MCP fell back to dense+BM25 silently. Fixed via `docker network connect drawbar-backend_default llama-rerank`. Re-running the eval is on the follow-up list; expect the deployed MRR to lift toward the lab number. ## Quick start ```bash -git clone https://git.jpaul.io/justin/docs-mcp-template.git my-product-docs -cd my-product-docs -git remote remove origin # detach from template +git clone https://git.jpaul.io/justin/crop-chem-docs.git +cd crop-chem-docs python -m venv venv && source venv/bin/activate pip install -r requirements.txt -# Read PLAN.md before doing anything else. Pay particular attention to -# Phase 1 (scraper) — that's the most product-specific phase. +# Sample-scrape to verify wiring: +python -m scrape.runner --source bayer --limit 5 -# Run the stub server (no corpus yet — just verifies the wiring): -python -m docs_mcp.server --transport stdio +# Full refresh (be polite — bayer is small, epa_ppls is hours): +python -m scrape.runner --source bayer --force +python -m scrape.runner --source epa_ppls --force + +# Rebuild Chroma + BM25: +OLLAMA_URL=http://192.168.0.125:11434 PRODUCT_NAME=crop_chem \ + python -m rag.index --rebuild + +# Run the eval harness: +RERANK_URL=http://localhost:18080 python -m eval.run_eval \ + --queries eval/queries.jsonl --k 5 \ + --output eval/results/baseline.md + +# Local MCP server (stdio for Claude Desktop dev): +PRODUCT_NAME=crop_chem python -m docs_mcp.server --transport stdio ``` ## Repo layout ``` . -├── PLAN.md # The build guide. Read first. +├── CLAUDE.md # Canonical agent guide +├── PLAN.md # Template's 13-phase build guide ├── README.md ├── requirements.txt ├── Dockerfile -├── .gitignore +├── deploy/ +│ ├── docker-compose.yml # Drop-in compose for Drawbar +│ ├── drawbar-compose-snippet.md # Notes on the parent compose merge +│ └── rerank-docker.md # llama-rerank service deployment ├── .gitea/workflows/ -│ ├── refresh.yml # Weekly scrape + index + image push -│ └── image-only.yml # On-demand code-only ship +│ ├── refresh.yml # Monthly cron: scrape + index + image push +│ └── image-only.yml # On-demand code-only ship cycle ├── scrape/ -│ ├── README.md # Product-specific scraper goes here -│ └── changelog.py # Reusable: --json, --history-out +│ ├── runner.py # Dispatches `--source ` +│ ├── changelog.py # Reusable: --json, --history-out +│ └── sources/ +│ ├── bayer.py # cropscience.bayer.us Next.js scraper +│ └── epa_ppls.py # EPA PPLS pagination + label PDFs ├── rag/ -│ ├── embeddings.py # Ollama embedder, swappable -│ ├── chunk.py # Chunker — adjust per page format -│ ├── index.py # Builds Chroma + (optionally) BM25 -│ └── bm25.py # SQLite FTS5 lexical index +│ ├── embeddings.py # nomic-embed-text via Ollama +│ ├── chunk.py # Chunker w/ EPA-reg-number preamble +│ ├── index.py # Chroma + BM25 builder +│ └── bm25.py # FTS5 lexical index ├── docs_mcp/ -│ ├── server.py # FastMCP server with stub tools +│ ├── server.py # FastMCP — hybrid+rerank +│ ├── lessons.md # Curated knowledge layer │ └── usage.py # TimedCall + JSONL telemetry ├── eval/ -│ ├── queries.jsonl.example # Curate ~25 hand-labeled queries -│ ├── retrievers.py # Retriever protocol + implementations -│ └── run_eval.py # MRR / Recall@k / nDCG@k harness +│ ├── queries.jsonl # 35 golden queries +│ ├── retrievers.py # 5 named configurations +│ ├── run_eval.py # MRR / Recall@k / nDCG@k +│ └── results/ # Baseline + with_rerank measurements ├── scripts/ -│ ├── usage_report.py # Standalone log analyzer +│ ├── usage_report.py │ └── registry_gc.py # Container registry cleanup -└── deploy/ - └── docker-compose.yml # Hosting stack: MCP + reranker + Watchtower +└── corpus/ # Committed scrape output (CI-refreshed) + ├── bayer/ + └── epa_ppls/ ``` -## What's product-specific (must implement) +## Infrastructure -- `scrape/` — the scraper itself. The template gives you the corpus - layout contract and a working `changelog.py`; the actual extraction - logic is yours. -- The corpus on disk (gitignored; rebuilt by CI). -- The reranker GGUF model and llama.cpp container (commented in - `deploy/docker-compose.yml`). -- The reverse proxy / TLS layer in front of the public endpoint. -- The hand-curated knowledge surface (your product's API gotchas, - example scripts, anything the LLM should know that the docs - don't say). +- **Registry**: pushes to `192.168.0.2:1234` (LAN, no CF body cap); deploys pull `git.jpaul.io/justin/crop-chem-docs:latest` (public, CF tunnel). Also tagged `:` for rollback pinning and `:corpus-YYYY.MM.DD` for snapshot pinning. +- **Embedder pool (CI)**: 3 GPU-pinned Ollama endpoints, weighted toward `.0.125` (RTX 40-series). +- **Reranker**: shared `llama-rerank` sidecar on trashpanda's Tesla P4 (`jina-reranker-v2-base-multilingual` via llama.cpp). Same container serves crop-chem-docs and seed-mcp. +- **PRODUCT_NAME**: `crop_chem` — used in `crop_chem_docs` Chroma collection, `bm25/crop_chem_docs.db`, and the `crop_chem_api_lessons` tool name. -## What's NOT product-specific (works as-is) +## Deploy mechanics -- FastMCP server skeleton + tool decoration pattern -- Chroma + Ollama embedding pipeline -- BM25 / SQLite FTS5 lexical index -- Hybrid retrieval (RRF) + reranker integration -- Eval harness (Retriever protocol, MRR/Recall/nDCG) -- Usage logging (TimedCall, JSONL, daily rotation) -- CI workflow shape (weekly + on-demand, retry-on-race, three-tag - image scheme) -- Registry GC script -- Standard tools: `search_docs`, `get_page`, `list_versions`, - `diff_versions`, `bundle_changelog`, `weekly_digest`, - `find_doc_inconsistencies`, etc. +Same Watchtower auto-deploy chain as seed-mcp. On every push to `main` that touches `docs_mcp/`, `rag/`, `scrape/`, `requirements.txt`, `Dockerfile`, or `sources.json`: -## License +1. `image-only.yml` checks out main + committed corpus +2. Rebuilds Chroma + BM25 (~few min on the GPU pool) +3. `docker build` + push three tags to the LAN registry +4. Links the package to the repo via Gitea API +5. Watchtower on trashpanda polls `:latest` every 5 min → recreates `drawbar-backend-chem-mcp-1` -Internal template. Adjust before publishing. +Corpus refresh runs monthly via `refresh.yml`. EPA PPLS is the slow source — ~hours at 1 req/sec at full scale. + +## Sibling + +[`seed-mcp`](https://git.jpaul.io/justin/seed-mcp) covers the row-crop seed-variety + yield-trial side of the advisor's tool catalog. Both MCPs are docs-mcp-template clones running side-by-side on trashpanda, sharing the Ollama pool and the `llama-rerank` sidecar. + +See [`CLAUDE.md`](./CLAUDE.md) for canonical sidecar schemas, the EPA reg-number normalization rules, and label-supersession ordering.