# Docs MCP Server — Build Guide A reusable recipe for building a hosted MCP server over a product's public documentation. Distilled from one production build; everything product-specific has been factored out. The end product is a streamable-HTTP MCP server with ~15 tools that any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can call to answer questions against the docs, surface what changed recently, and flag likely inconsistencies. > **Domain note for crop-chem-docs.** This template was originally written > for versioned software product documentation (Zoomin bundles, Hugo > sites, etc.). For crop-chem-docs the domain is pesticide product labels — > the "bundle" abstraction has been replaced with "source" > (manufacturer or regulator), and "page" with "product label". The > canonical on-disk schema lives in [`scrape/README.md`](scrape/README.md), > not in this document. References below to `bundles.json`, `bundle_id`, > `--bundle`, `version`, and `platform` are template artifacts — read > them as `sources.json`, `source_id`, `--source`, and (mostly) > not-applicable. Phase 1 (scraper) is the most heavily adapted; later > phases (chunking, embeddings, retrieval, eval) apply largely as > written. --- ## What you're building A pipeline with these stages: ``` upstream docs portal │ ▼ scrape ──► corpus//.md + .json sidecar │ ▼ chunk + embed ──► chroma/ (dense vectors) │ ──► bm25/ (FTS5 lexical index) ▼ MCP server ──► search_docs / get_page / diff_versions / weekly_digest / find_doc_inconsistencies / ... │ ▼ reverse proxy / Cloudflare Tunnel ──► public endpoint ``` Two CI cadences: - **Weekly cron** (~40 min): full re-scrape, re-chunk, re-embed, image build & push. - **On-demand image-only** (~18 min): code-only rebuild from committed corpus, image build & push. A container registry (self-hosted Gitea works well), a host running Docker Compose, Watchtower auto-updating from `:latest`, and a reverse proxy in front. --- ## Build phases Each phase is a discrete, shippable unit. Build them in order; each one is useful on its own and unlocks the next. Realistic effort per phase is given as a rough order of magnitude. Total: roughly 2–3 weeks of focused work for the full stack. ### Phase 0 — Project skeleton *(half a day)* Goals: directory layout, dependency manifest, virtualenv. - Top-level dirs: `scrape/`, `corpus/` (gitignored), `rag/`, `docs_mcp/`, `eval/`, `scripts/`, `deploy/`, `.gitea/workflows/`. - `requirements.txt` with the dependencies you'll need across all phases (FastMCP, chromadb, httpx, beautifulsoup4 or whatever HTML parser, ollama or sentence-transformers client, etc.). - `python -m venv venv` and pin Python version (3.11 or 3.12 — be conservative; some embedding libraries have version-specific wheels). - `.gitignore`: `venv/`, `corpus/` (regenerable), `chroma/` (regenerable), `bm25/` (regenerable), `*.pyc`, `__pycache__/`, `.pytest_cache/`. ### Phase 1 — Scraper *(2–4 days, product-specific)* This is the most product-dependent phase. The goal is to write a scraper that produces a normalized corpus layout regardless of upstream portal shape. Output shape (mandatory): ``` corpus/ / # one dir per "doc bundle" — see Glossary .md # markdown body .json # sidecar with structured metadata ... bundles.json # catalog of bundles with metadata ``` **Bundle metadata** (`bundles.json` is a list of these): ```json { "slug": "", "title": "User-facing title", "version": "10.9", "platform": "VMware vSphere", // may be null "product": "Admin Guide", // optional but useful "language": "en-US", "page_count": 127, "dates": { "Added on": "2024-01-15", "Updated on": "2026-05-20" }, "landing_page": "" } ``` **Per-page sidecar** (`.json`) carries page-level metadata. The one field that matters cross-cutting is `topic_cluster` (see Phase 9): ```json { "bundle_id": "", "page_id": "", "title": "How to ...", "ordinal": 42, "topic_cluster": { "clustering_title": "How to ...", "clustered_topics": [ {"bundle_id": "...10.8", "page_id": "How_to_X.htm", "clustering_title": "..."}, {"bundle_id": "...10.9", "page_id": "How_to_X.htm", "clustering_title": "..."} ] } } ``` If the portal exposes a cross-version "this page corresponds to that page" mapping, capture it here. If it doesn't, you can synthesize a filename-based fallback (same filename across bundle versions = same topic) and live without the editor-curated mapping. The features that read `topic_cluster` (`list_cluster`, `diff_versions`, `find_doc_inconsistencies`, parts of `weekly_digest`) will work either way; they're more accurate with real clusters. **Patterns that recur across doc portals:** - Most modern doc portals are SPAs. Plain `requests.get` won't see rendered content. Either find the underlying API the SPA calls (the cheapest, most reliable path), or fall back to a headless browser (Playwright). The API path is almost always available; sniff the network tab. - Portals usually expose a "bundle/topic" hierarchy under the hood (Zoomin, Madcap Flare, Paligo, GitBook, Docusaurus all do). Map it to `bundles.json` + `corpus//`. - Many portals expose `?save_local=` or `.pdf` rendered versions; the HTML they serve is structurally cleaner than what the page shows through the SPA shell. **`scrape/changelog.py`** (~250 LOC; see Phase 13) — provides `summarize_diff()`, `render_human()`, `walk_history()` and the `--json` / `--history-out` modes. Mostly reusable as-is; the only product-specific bit is the path layout assumption. ### Phase 2 — Chunking + embeddings + Chroma *(2 days)* Goal: build a queryable dense index from the scraped corpus. - `rag/chunk.py` — split each page's markdown into ~400-600 token chunks. Strategy that works: paragraph-aware splitter with a rich "chunk 0" containing the page title + 1-sentence summary + bag-of-words from key terms. Chunk 0 is what dense retrieval lands on first; getting it right dominates retrieval quality. - `rag/embeddings.py` — pluggable embedder. Recommended start: Ollama-hosted `nomic-embed-text` (768-dim, free, good baseline). Other defensible choices: `text-embedding-3-small` (OpenAI), `bge-m3` (also via Ollama). The embedder is a Chroma `EmbeddingFunction` that returns `list[list[float]]` for a list of texts. - `rag/index.py` — orchestrates: read corpus → emit chunks (with metadata: bundle_id, page_id, version, platform, ordinal) → upsert into Chroma collection. `--rebuild` flag for a clean reindex. Run via `python -m rag.index --rebuild`. Chroma settings: `PersistentClient(path="chroma/")` and `Settings(anonymized_telemetry=False)`. Single collection (`_docs`). **GPU note**: embedding 70K chunks on CPU takes hours; on a GPU (via Ollama with `NVIDIA_VISIBLE_DEVICES`) takes ~10 minutes. Two GPUs in parallel: ~5 minutes. The orchestrator just needs to load- balance HTTP requests across multiple Ollama endpoints. ### Phase 3 — MCP server skeleton *(1 day)* Goal: working FastMCP server with three tools — `search_docs`, `get_page`, `list_versions`. - `docs_mcp/server.py` — `FastMCP("-docs", stateless_http=True)`. `stateless_http=True` is critical for production hosting: every request creates an ephemeral session, so container recreates don't produce a 404 storm from stale `mcp-session-id` headers on clients. - Lazy initialization for everything expensive (Chroma client, embedder, bundles catalog) so the server starts cleanly even when Ollama is briefly unreachable. - Tool: `search_docs(query, version=None, platform=None, bundle_id=None, k=10)`. Returns markdown of top-k chunks with full source URLs. - Tool: `get_page(bundle_id, page_id)`. Returns full page markdown + metadata. - Tool: `list_versions()`. Returns the version/platform facets available, drawn from `bundles.json`. Helps the LLM pick filter values. Transports: stdio (for local Claude Desktop dev), streamable-HTTP (for hosted production). One argparse switch. ```python @mcp.tool() def search_docs( query: Annotated[str, Field(description="Natural-language query about .")], version: Annotated[str | None, Field(description="Restrict to one version")] = None, ... ) -> str: ... ``` The tool descriptions are first-class context — the LLM reads them and decides whether to call the tool. Treat them as button labels; use "Call when..." / "Use proactively whenever..." phrasings. ### Phase 4 — Containerization *(1 day)* Goal: image you can run anywhere. - `Dockerfile`: Python 3.12-slim base, install requirements, COPY `scrape rag diff docs_mcp` + `bundles.json` + `corpus/ chroma/` + (later) `bm25/`. Don't COPY `scripts/` — those stay external for ops use only. - `ENTRYPOINT ["python", "-m", "docs_mcp.server", "--transport", "streamable-http"]`. Configurable host/port via env. - `deploy/docker-compose.yml`: one service, named volumes for usage logs and any state, Watchtower label, depends_on for the reranker sidecar (Phase 6). Smoke-test locally: `docker compose up` should expose `http://localhost:8000/mcp` and respond to an MCP `initialize` JSON-RPC. ### Phase 5 — CI on self-hosted Gitea Actions *(1–2 days)* Goal: weekly cron rebuild + on-demand code-only ship cycle. **Two workflows, two cadences:** | Workflow | Trigger | Steps | Runtime | |---|---|---|---| | `refresh.yml` | Monday cron + manual dispatch | scrape → commit corpus → rebuild indexes → build & push image | ~40 min | | `image-only.yml` | manual dispatch only | rebuild indexes from committed corpus → build & push image | ~18 min | **Critical settings (learned the hard way):** - `fetch-depth: 0` on `actions/checkout@v4`. The default depth is 1 (shallow), which breaks any step that walks git history (changelog, digest history walker). Pay the ~10 second cost; never debug a "0-byte history file" mystery. - `runs-on: docker` (Gitea convention, not `ubuntu-latest`). - Runner shell is `/bin/sh` (dash), not bash. `${VAR::N}` substring expansion doesn't exist; use `cut` / `printf` / `awk`. **Retry-on-race pattern for long-running scrapes:** ```bash attempt=1 while [ $attempt -le 3 ]; do if git push; then echo "pushed (attempt $attempt)" break fi [ $attempt -eq 3 ] && { echo "still failing"; exit 1; } git fetch origin main git rebase origin/main || { echo "conflict — bail"; exit 1; } attempt=$((attempt + 1)) done ``` Works because scrape commits only touch `corpus/` + `bundles.json`, and code merges only touch `.py` / `.yml` — disjoint paths, trivially clean rebases. **Image tagging — three tags per build:** | Tag | Purpose | |---|---| | `:latest` | Watchtower watches this for auto-deploy | | `:` | Immutable; rollback target | | `:` | Human-readable in incident notes | Same tag set on every build; rollback is a one-line compose edit to pin `:` instead of `:latest`. **Container registry behind Cloudflare:** Cloudflare's free tier has a 100 MB request body limit. Big image layers (Chroma index can easily be 800+ MB) exceed it on push. The fix is a LAN registry endpoint for push, public hostname for pull: ```yaml env: REGISTRY_PUSH: : # bypasses Cloudflare REGISTRY_PULL: # response bodies aren't capped ``` Runner needs the LAN endpoint in `/etc/docker/daemon.json` `insecure-registries`. Costs nothing operationally; saves hours of debugging. **Registry GC:** weekly cron in the workflow that walks the package versions, keeps `:latest` + N most-recent date tags + anything pushed in the last 90 days, deletes the rest. Worth ~50 LOC; the package GC on the Gitea side reclaims disk after. ### Phase 6 — Reranker *(half a day)* Goal: lift retrieval quality 3× by cross-encoder reranking the top-N dense candidates. - A `/v1/rerank` HTTP endpoint backed by `llama.cpp` serving `jina-reranker-v2-base` (GGUF). Runs as a sidecar in compose. GPU strongly recommended (CPU latency is unworkable for live queries). - `_rerank(query, docs)` helper in the server: POST to the endpoint, apply the scores, re-sort the top-N candidates. Defensive: on any failure log a warning and fall through to dense-only. - Env: `RERANK_URL` (off by default), `RERANK_POOL` (how deep to pull candidates for reranking; 200 is a good default), `RERANK_TIMEOUT` (30s for cold-start tolerance). - **Watch the per-pair token limit.** Jina's GGUF reports `n_ctx_train=1024` and llama.cpp will reject the ENTIRE batch if any pair exceeds it. Truncate doc text to ~2000 chars before reranking. The full untruncated chunk still goes back to the user; truncation is only for the reranker scoring path. ### Phase 7 — Eval harness *(1 day)* Goal: hand-curated golden queries + standard metrics so you can measure the impact of any retrieval change. - `eval/queries.jsonl`: 20–25 hand-curated queries with expected pages. Spread across versions, platforms, and difficulty levels. Include the queries that "obviously" should work and DON'T — those are the ones to track. - `eval/retrievers.py`: a `Retriever` protocol with concrete implementations: `DenseRetriever`, `RerankedRetriever`, `BM25Retriever` (Phase 8), `HybridRetriever` (Phase 8). One matrix dimension per knob. - `eval/run_eval.py`: computes MRR / Recall@5 / nDCG@5 across all retrievers; emits a markdown comparison table at `eval/results/.md`. Commit the result so PRs land with the A/B evidence in the diff. Three numbers are enough — don't overengineer. The hand-curated queries are the value; the metrics are just a stable way to score them. ### Phase 8 — BM25 + Hybrid retrieval *(half a day, conditional)* **Skip unless your eval shows specific failure modes.** Dense embeddings + cross-encoder reranker handle most queries. The case where they don't: queries with rare technical tokens (filenames, language names, error codes) get buried at dense rank 1000+ by a much larger prose corpus that's semantically nearby. The reranker only sees top-200, so it never gets a shot. - `rag/bm25.py`: SQLite FTS5 index, in the stdlib, on-disk (`bm25/.db`). Two tables — metadata table keyed by rowid, FTS5 virtual table for full-text. Sanitize the query (strip FTS5 reserved keywords, OR-join tokens for recall). ~210 LOC. - `_rrf_fuse()` in the server — Reciprocal Rank Fusion with `k=60`. Per-id score = `sum_over_retrievers(1 / (k + rank))`. Returns ordered ids plus per-retriever contribution dict for telemetry. - `search_docs` hybrid path: run dense + BM25 in parallel, RRF-fuse, hand the merged top-200 to the reranker. Env-gated: `HYBRID_SEARCH=true`. - Log `top1_source` per call (`dense_only` / `bm25_only` / `both`) to usage logs so you can measure whether BM25 is actually earning its keep on production traffic. If after 4–6 weeks of production data you see `bm25_only >= 80%`, you can simplify to BM25-only (much less infrastructure). If `both >= 50%`, hybrid is acting as tie-breaker not rescue — keep it or simplify depending on how much you care about the long tail. ### Phase 9 — Multi-version diff tooling *(1 day, if applicable)* **Only relevant if the product has multiple maintained versions.** - `diff_versions(bundle_id, page_id, against_bundle_id)`: unified diff between two versions of the same page. Two matching strategies: editor-curated `topic_cluster` peer (if the portal exposes it), or same-filename fallback. - `list_cluster(bundle_id, page_id)`: list cross-version peers for one page. - `bundle_changelog(bundle_id_new, bundle_id_old)`: added / removed / changed pages between two bundles, sorted by churn. - `_diff_churn(a, b)`: small helper, ~15 LOC of `difflib.unified_diff --unified=0` line counting. Used by `bundle_changelog`, `find_doc_inconsistencies`, and `weekly_digest`. ### Phase 10 — Usage logging *(half a day)* Goal: per-call JSONL telemetry so you can answer "what are people actually asking for" and "is the new feature getting used." - `docs_mcp/usage.py`: `TimedCall` context manager that captures tool name, args, elapsed time, hits returned, any extra fields set by the tool via `_call.set(key=value)`. Writes JSONL to `var/logs/usage.jsonl`, rotated daily, kept 90 days. - Mount the log dir as a named compose volume so logs survive container recreates. - `scripts/usage_report.py` (standalone, no docs_mcp deps): reads the JSONL files, prints per-tool counts, top queries, 0-hit queries, filter usage histogram, reranker activity. Markdown output flag for piping into weekly digest emails. What to log: query text, filters, hits returned, elapsed_ms, reranker_fired flag, hybrid top1_source, retrieval_mode. What NOT to log: anything PII-shaped. The corpus is public, queries are usually about the product, not personal — but be deliberate. ### Phase 11 — Curated knowledge layer *(2 days)* The "RAG can't tell you what isn't in the docs" gap. Surfaces: - **API quickstart repos** if the product has them. Ingest the example scripts (Python, PowerShell, curl) into the corpus. Rewrite chunk-0 for each script to embed naturally — explicit natural-language H1, task description sentence, keyword bag. Dense embeddings need an anchor. - **A curated `_api_lessons` markdown doc** for things the swagger / OpenAPI doesn't say: auth flow gotchas, async-task patterns, schema bugs you've hit, platform-detection quirks. Surface as a dedicated MCP tool whose description tells the LLM: *"Call proactively whenever the user asks you to write a script / integrate with the API / debug a 4xx response."* - **An auto-hint banner** in `search_docs` results — when the query matches a script/API trigger word, render a one-line nudge at the top of results pointing at the dedicated tool. Belt-and- suspenders for queries where the LLM doesn't think to call it proactively. ### Phase 12 — Doc-inconsistency tool *(half a day, optional)* A *"scan the corpus for likely doc bugs"* tool the model can call when an operator asks "is this section reliable?" - `find_doc_inconsistencies(scope_query, version=None, platform=None, max_pages=30, checks=None)`: deterministic, read-only. Two checks: cross-version drift (pages whose content shifted between immediate- previous versions in the actionable 10–60% churn band) and redirect-chain detection (short pages whose body is just a "see [other page] for details" pointer). Heavy lifting is line-level diff (`difflib`) against editor-curated cluster peers; the model judges which findings are real bugs. ### Phase 13 — Weekly digest tool *(half a day)* Goal: a tool that answers *"what changed in the docs in the last N days?"* with no runtime git dependency (the prod container has no git). - Extend `scrape/changelog.py` with `--json` (one-shot structured output) and `--history-out PATH` (walks `git log --first-parent --since=" days ago"` for corpus-touching commits, writes one JSON line per commit to a JSONL file). - CI workflows write the JSONL file into the image at build time: `corpus/.digest/history.jsonl`. Both `refresh.yml` and `image-only.yml`. **`fetch-depth: 0` is required** — see Phase 5. - New MCP tool `weekly_digest(days=7, version=None, platform=None, max_bundles=25, max_pages_per_bundle=10)`: reads the JSONL, filters to the window, applies version/platform via `bundles.json` metadata, aggregates per-bundle change counts and page lists, renders markdown. - Post-filter totals are critical: the headline "X page changes across Y bundles" must compute X from the filtered set, not the raw record count. Otherwise filtered calls look wrong to the reader. Out of scope but trivial bolt-ons: scheduled HTML email of the digest, auto-publish to a blog, per-page diff excerpts as a follow-up tool. --- ## Standard tool set By the end you'll have ~15 tools registered. Production-tested shape: | Tool | What it does | |---|---| | `search_docs` | Semantic search with version/platform/bundle filters | | `get_page` | Full markdown + metadata for one page | | `list_versions` | Discover available facet values | | `list_cluster` | Cross-version peers for one page (if applicable) | | `diff_versions` | Unified diff of a page across two versions | | `bundle_changelog` | Added / removed / changed pages between two bundles | | `weekly_digest` | What changed in the last N days, with filters | | `corpus_status` | Freshness + size of the knowledge base | | `find_doc_inconsistencies` | Scoped scan for doc bugs | | `_api_lessons` | Curated API gotchas, proactively-called | | product-specific tools | Interop matrix, lifecycle queries, etc. | --- ## Per-product customization checklist When applying this template to a new product, here's what you have to figure out yourself — everything else is shared infrastructure: - **Doc portal mechanics** - URL pattern for pages - Bundle/version concept (Zoomin "bundle", Madcap "project", GitBook "space", Docusaurus "docs version" — same idea, different name) - SPA backing API (sniff the network tab) or fallback to headless browser - How `topic_cluster` -equivalent cross-version peers are exposed (or whether you synthesize them from filenames) - **Bundle metadata schema** - What does `version` look like? Semver, calendar, named? - What does `platform` mean for this product? Is there a useful facet at all? - Other useful facets (language, product line, edition)? - **Filterable facets** for `search_docs` - One filter per high-cardinality facet - Skip filters that have <5 distinct values — they're not worth the surface area - **Curated knowledge** for the `_api_lessons` tool - What does the product's API documentation NOT say that you've learned from real integration work? - **Quickstart / example repos** - Does the vendor publish working code? Ingest it; rewrite chunk-0 for natural-language retrieval. --- ## Decisions worth carrying forward Things you'll save time on by deciding the same way again: - **Tool descriptions are user interface.** The LLM reads them verbatim and decides whether to call the tool. *"Use when..."* and *"Call proactively whenever..."* are real surfaces; treat them like button labels. Most retrieval improvements turn out to be tool-description rewrites in disguise. - **`stateless_http=True`** on the FastMCP server. Eliminates whole categories of session-ID-related 404 storms after container recreates. - **Pre-bake everything at CI time.** No runtime calls to git, external services, or anything you wouldn't trust on a Cloudflare outage. If the digest needs git history, write a JSONL file at CI time. If the lessons doc needs to load fast, bake it into the image. - **Env-gate every side-effecting tool.** Off by default in dev; on only in production compose. Belt and suspenders against accidental writes from staging environments. - **Operator-confirmation pattern for side-effecting tools.** The tool docstring is the only place to enforce human-in-the-loop. Make it loud. "MANDATORY", "Do not loop", "show-confirm-then-submit" — those phrasings work. - **Verify with hand-curated golden queries before shipping any retrieval change.** Numbers in the diff, in the commit message. Don't ship retrieval changes on vibes. - **Two-cadence CI** (weekly scrape vs on-demand code-only) saves hours per code iteration once you're past the one-iteration-a-week stage. - **Rolling tag + sha-pinned tag** deploy pattern. `:latest` is what Watchtower watches; `:` is your safety net. Rollback is a one-line compose edit, not a redeploy. - **Usage logging is non-negotiable.** You will be wrong about what people use. Capture the truth from day one; let it tell you which features to keep building and which to delete. --- ## Glossary - **Bundle** — one logical doc set in the portal. Zoomin calls them bundles; Madcap calls them projects; the concept is the same: a versioned, titled collection of pages. One dir under `corpus/`. - **Page** — one HTML page in a bundle. One `.md` + one `.json` sidecar under the bundle dir. - **Topic cluster** — Zoomin's name for "this page in version 10.9 corresponds to that page in version 10.8." Stored in the per-page sidecar. The portal-agnostic concept is "cross-version peer mapping." - **Chunk** — a unit of text that gets independently embedded and stored in Chroma. Target ~400-600 tokens; preserve paragraph boundaries. - **RRF** — Reciprocal Rank Fusion. The way to merge two ranked lists from independent retrievers without score calibration. --- ## What's deliberately NOT in this template Decisions you should make per-product (not copy from the original build): - The reverse proxy and TLS termination layer. Could be Caddy, nginx, Traefik, Cloudflare Tunnel — pick what your infra uses. - The Gateway / aggregator in front of multiple MCPs (MetaMCP is one option; you may not need any aggregator if you're running a single product MCP). - The specific embedding model — `nomic-embed-text` is a strong default but newer / domain-specific models may be better for some products. - The Ollama containers / GPU setup — depends on what hardware you have. The pattern is one container per GPU with explicit `NVIDIA_VISIBLE_DEVICES` pinning; the indexer load-balances across them. - Whether to publish a blog series alongside the build. Strongly recommended (forces clarity, builds an audience), but optional.