justin 335c33465b Phase 7+8: eval harness + hybrid retrieval
## Phase 7 — Eval harness

eval/retrievers.py + rag/retrieval.py: Retriever protocol with
DenseRetriever, BM25Retriever, HybridRetriever (RRF k=60),
RerankedRetriever (llama.cpp /v1/rerank). retrievers.py is now a
thin shim re-exporting from rag.retrieval so the MCP server can
use the same code at request time without making eval/ a runtime
dep.

eval/run_eval.py: drives N retrievers against eval/queries.jsonl,
computes MRR / Recall@K / nDCG@K, emits a markdown report with a
summary table + per-query breakdown for the first retriever. Each
query carries expected (source, source_key) tuples — matches the
labels-domain page-level keying.

eval/queries.jsonl: 35 curated queries — 25 brand-name (Warrant,
Huskie, Roundup Custom, Liberty, Authority, Headline, Trivapro,
Poncho, Lorsban, Sencor, Acuron, ...) + 10 intent/semantic
("what controls horseweed before soybean", "fungicide for fusarium
head blight", "rainfast interval for glyphosate", ...).

## Phase 8 — Hybrid retrieval (BM25 + dense + RRF)

docs_mcp/server.py: search_docs now branches on HYBRID_SEARCH env.
When on, _search_chunks runs both Chroma + BM25 (rag/bm25.py
existing impl), fuses on chunk_id with reciprocal-rank-fusion
(RRF k=60), and returns the combined pool. Dense-only path
unchanged when HYBRID_SEARCH is unset. The rendering layer
(_format_hit) is untouched.

The RERANK_URL hook is also wired (_rerank_pool sends docs to
llama.cpp /v1/rerank, truncated to 2000 chars per the jina-reranker
n_ctx_train=1024 batch-rejection gotcha). Fails open to base order
on any exception.

## Baseline numbers (k=5, pool=50, 35 queries)

  | Retriever  | MRR   | Recall@5 | nDCG@5 |
  |------------|-------|----------|--------|
  | dense      | 0.027 | 0.086    | 0.041  |
  | bm25       | 0.544 | 0.586    | 0.524  |
  | hybrid-rrf | 0.114 | 0.114    | 0.108  |

Headline: BM25 dominates because farmers search for products by
brand name, and brand names are exact-match tokens that lexical
search nails. Dense is poor — semantic embeddings spread across
similar products and don't preferentially weight brand-name tokens.
Textbook RRF hurts when one retriever is much weaker than the
other: dense's irrelevant top-50 pollute the fused pool with
ties at 1/(60+rank). Phase 6 reranker is the planned fix —
the reranker scores each (query, chunk) pair independently
and can recover the right answer regardless of base order.

Per-query report at eval/results/baseline.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 10:19:05 -04:00

docs-mcp-template

A reusable template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out.

The end product is a streamable-HTTP MCP server with ~15 tools that any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can call to answer questions against the docs, surface what changed recently, and flag likely inconsistencies.

What's here

  • PLAN.md — comprehensive build guide. Phased approach (13 phases, ~23 weeks of focused work for the full stack). Includes the design decisions, the gotchas, and a per-product customization checklist.
  • Scaffolded skeleton — working FastMCP server with stub tools, Dockerfile, docker-compose, CI workflows, eval harness layout, usage logging. Everything you need to git clone and start filling in the product-specific bits.

Quick start

git clone https://git.jpaul.io/justin/docs-mcp-template.git my-product-docs
cd my-product-docs
git remote remove origin  # detach from template
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Read PLAN.md before doing anything else. Pay particular attention to
# Phase 1 (scraper) — that's the most product-specific phase.

# Run the stub server (no corpus yet — just verifies the wiring):
python -m docs_mcp.server --transport stdio

Repo layout

.
├── PLAN.md                        # The build guide. Read first.
├── README.md
├── requirements.txt
├── Dockerfile
├── .gitignore
├── .gitea/workflows/
│   ├── refresh.yml                # Weekly scrape + index + image push
│   └── image-only.yml             # On-demand code-only ship
├── scrape/
│   ├── README.md                  # Product-specific scraper goes here
│   └── changelog.py               # Reusable: --json, --history-out
├── rag/
│   ├── embeddings.py              # Ollama embedder, swappable
│   ├── chunk.py                   # Chunker — adjust per page format
│   ├── index.py                   # Builds Chroma + (optionally) BM25
│   └── bm25.py                    # SQLite FTS5 lexical index
├── docs_mcp/
│   ├── server.py                  # FastMCP server with stub tools
│   └── usage.py                   # TimedCall + JSONL telemetry
├── eval/
│   ├── queries.jsonl.example      # Curate ~25 hand-labeled queries
│   ├── retrievers.py              # Retriever protocol + implementations
│   └── run_eval.py                # MRR / Recall@k / nDCG@k harness
├── scripts/
│   ├── usage_report.py            # Standalone log analyzer
│   └── registry_gc.py             # Container registry cleanup
└── deploy/
    └── docker-compose.yml         # Hosting stack: MCP + reranker + Watchtower

What's product-specific (must implement)

  • scrape/ — the scraper itself. The template gives you the corpus layout contract and a working changelog.py; the actual extraction logic is yours.
  • The corpus on disk (gitignored; rebuilt by CI).
  • The reranker GGUF model and llama.cpp container (commented in deploy/docker-compose.yml).
  • The reverse proxy / TLS layer in front of the public endpoint.
  • The hand-curated knowledge surface (your product's API gotchas, example scripts, anything the LLM should know that the docs don't say).

What's NOT product-specific (works as-is)

  • FastMCP server skeleton + tool decoration pattern
  • Chroma + Ollama embedding pipeline
  • BM25 / SQLite FTS5 lexical index
  • Hybrid retrieval (RRF) + reranker integration
  • Eval harness (Retriever protocol, MRR/Recall/nDCG)
  • Usage logging (TimedCall, JSONL, daily rotation)
  • CI workflow shape (weekly + on-demand, retry-on-race, three-tag image scheme)
  • Registry GC script
  • Standard tools: search_docs, get_page, list_versions, diff_versions, bundle_changelog, weekly_digest, find_doc_inconsistencies, etc.

License

Internal template. Adjust before publishing.

S
Description
MCP server over US row-crop pesticide labels (EPA PPLS + manufacturer sites). Feeds Drawbar farmer advisor.
Readme 76 MiB
Languages
Python 98.8%
Dockerfile 1.2%