97a2a05b248c2520c87c40a68543cd1cd0555c7f
Adapt docs_mcp/server.py from versioned-software-docs domain to
pesticide-labels domain. Standard MCP tool names preserved
(search_docs / get_page / list_versions) so existing MCP clients
(Claude Desktop, Cursor) still pick them up; docstrings + argument
shape are labels-domain.
Tools shipped:
- search_docs(query, source, product_class, registrant_contains,
signal_word, epa_reg_no, k) — dense Chroma query with optional
filters, post-filtered for registrant substring. Returns top-k
chunks rendered as markdown with product / reg / registrant /
actives / signal / section / label-PDF URL.
- get_page(source, source_key) — full label markdown + metadata
header. source_key is slug for MFR sources, EPA Reg No for EPA PPLS.
- list_versions() — discovers facet values: sources, product
classes, signal words, registrants (samples up to 50K chunks
from Chroma to enumerate distinct metadata values).
- corpus_status() — fast no-embedder counts: labels on disk per
source, chunks in Chroma, BM25 db size, active feature flags.
Wiring:
- Reads PPLS_CORPUS_ROOT + PPLS_CHROMA_DIR (matches the scrapers
and indexer).
- Uses sources.json (not the template's bundles.json).
- Lazy Chroma init so the server starts cleanly even when Ollama
is briefly down (e.g. during HVM corpus rebuilds).
- Phase 6 reranker + Phase 8 hybrid hooks left as feature flags
(RERANK_URL, HYBRID_SEARCH) — fail open to dense-only when unset.
Smoke test against the live 216K-chunk corpus:
- corpus_status: 4,157 labels / 216,467 chunks / 416 MB BM25 ✓
- search_docs("waterhemp control on soybeans", k=2) returns
Tackle Herbicide (FMC, 279-3564, glyph+imazethapyr) and
R14640 Herbicide (Bayer, 524-724, glyph) with section context
(ROUNDUP READY SOYBEANS / SOYBEAN) and dist-derived scores
of 0.76 each — highly relevant.
Run as stdio for Claude Desktop:
PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \
OLLAMA_URL=http://gpu1:11434,http://gpu2:11434 \
PRODUCT_NAME=ppls \
python -m docs_mcp.server --transport stdio
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs-mcp-template
A reusable template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out.
The end product is a streamable-HTTP MCP server with ~15 tools that any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can call to answer questions against the docs, surface what changed recently, and flag likely inconsistencies.
What's here
- PLAN.md — comprehensive build guide. Phased approach (13 phases, ~2–3 weeks of focused work for the full stack). Includes the design decisions, the gotchas, and a per-product customization checklist.
- Scaffolded skeleton — working FastMCP server with stub tools,
Dockerfile, docker-compose, CI workflows, eval harness layout,
usage logging. Everything you need to
git cloneand start filling in the product-specific bits.
Quick start
git clone https://git.jpaul.io/justin/docs-mcp-template.git my-product-docs
cd my-product-docs
git remote remove origin # detach from template
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Read PLAN.md before doing anything else. Pay particular attention to
# Phase 1 (scraper) — that's the most product-specific phase.
# Run the stub server (no corpus yet — just verifies the wiring):
python -m docs_mcp.server --transport stdio
Repo layout
.
├── PLAN.md # The build guide. Read first.
├── README.md
├── requirements.txt
├── Dockerfile
├── .gitignore
├── .gitea/workflows/
│ ├── refresh.yml # Weekly scrape + index + image push
│ └── image-only.yml # On-demand code-only ship
├── scrape/
│ ├── README.md # Product-specific scraper goes here
│ └── changelog.py # Reusable: --json, --history-out
├── rag/
│ ├── embeddings.py # Ollama embedder, swappable
│ ├── chunk.py # Chunker — adjust per page format
│ ├── index.py # Builds Chroma + (optionally) BM25
│ └── bm25.py # SQLite FTS5 lexical index
├── docs_mcp/
│ ├── server.py # FastMCP server with stub tools
│ └── usage.py # TimedCall + JSONL telemetry
├── eval/
│ ├── queries.jsonl.example # Curate ~25 hand-labeled queries
│ ├── retrievers.py # Retriever protocol + implementations
│ └── run_eval.py # MRR / Recall@k / nDCG@k harness
├── scripts/
│ ├── usage_report.py # Standalone log analyzer
│ └── registry_gc.py # Container registry cleanup
└── deploy/
└── docker-compose.yml # Hosting stack: MCP + reranker + Watchtower
What's product-specific (must implement)
scrape/— the scraper itself. The template gives you the corpus layout contract and a workingchangelog.py; the actual extraction logic is yours.- The corpus on disk (gitignored; rebuilt by CI).
- The reranker GGUF model and llama.cpp container (commented in
deploy/docker-compose.yml). - The reverse proxy / TLS layer in front of the public endpoint.
- The hand-curated knowledge surface (your product's API gotchas, example scripts, anything the LLM should know that the docs don't say).
What's NOT product-specific (works as-is)
- FastMCP server skeleton + tool decoration pattern
- Chroma + Ollama embedding pipeline
- BM25 / SQLite FTS5 lexical index
- Hybrid retrieval (RRF) + reranker integration
- Eval harness (Retriever protocol, MRR/Recall/nDCG)
- Usage logging (TimedCall, JSONL, daily rotation)
- CI workflow shape (weekly + on-demand, retry-on-race, three-tag image scheme)
- Registry GC script
- Standard tools:
search_docs,get_page,list_versions,diff_versions,bundle_changelog,weekly_digest,find_doc_inconsistencies, etc.
License
Internal template. Adjust before publishing.
Description
MCP server over US row-crop pesticide labels (EPA PPLS + manufacturer sites). Feeds Drawbar farmer advisor.
Languages
Python
98.8%
Dockerfile
1.2%