Files
crop-chem-docs/PLAN.md
T

25 KiB
Raw Blame History

Docs MCP Server — Build Guide

A reusable recipe for building a hosted MCP server over a product's public documentation. Distilled from one production build; everything product-specific has been factored out.

The end product is a streamable-HTTP MCP server with ~15 tools that any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can call to answer questions against the docs, surface what changed recently, and flag likely inconsistencies.


What you're building

A pipeline with these stages:

upstream docs portal
        │
        ▼
   scrape  ──► corpus/<bundle>/<page>.md + .json sidecar
        │
        ▼
    chunk + embed  ──► chroma/  (dense vectors)
        │           ──► bm25/   (FTS5 lexical index)
        ▼
   MCP server  ──► search_docs / get_page / diff_versions / weekly_digest /
                   find_doc_inconsistencies / ...
        │
        ▼
   reverse proxy / Cloudflare Tunnel ──► public endpoint

Two CI cadences:

  • Weekly cron (~40 min): full re-scrape, re-chunk, re-embed, image build & push.
  • On-demand image-only (~18 min): code-only rebuild from committed corpus, image build & push.

A container registry (self-hosted Gitea works well), a host running Docker Compose, Watchtower auto-updating from :latest, and a reverse proxy in front.


Build phases

Each phase is a discrete, shippable unit. Build them in order; each one is useful on its own and unlocks the next. Realistic effort per phase is given as a rough order of magnitude. Total: roughly 23 weeks of focused work for the full stack.

Phase 0 — Project skeleton (half a day)

Goals: directory layout, dependency manifest, virtualenv.

  • Top-level dirs: scrape/, corpus/ (gitignored), rag/, docs_mcp/, eval/, scripts/, deploy/, .gitea/workflows/.
  • requirements.txt with the dependencies you'll need across all phases (FastMCP, chromadb, httpx, beautifulsoup4 or whatever HTML parser, ollama or sentence-transformers client, etc.).
  • python -m venv venv and pin Python version (3.11 or 3.12 — be conservative; some embedding libraries have version-specific wheels).
  • .gitignore: venv/, corpus/ (regenerable), chroma/ (regenerable), bm25/ (regenerable), *.pyc, __pycache__/, .pytest_cache/.

Phase 1 — Scraper (24 days, product-specific)

This is the most product-dependent phase. The goal is to write a scraper that produces a normalized corpus layout regardless of upstream portal shape.

Output shape (mandatory):

corpus/
  <bundle_id>/             # one dir per "doc bundle" — see Glossary
    <page_id>.md           # markdown body
    <page_id>.json         # sidecar with structured metadata
  ...
bundles.json               # catalog of bundles with metadata

Bundle metadata (bundles.json is a list of these):

{
  "slug":          "<bundle_id>",
  "title":         "User-facing title",
  "version":       "10.9",
  "platform":      "VMware vSphere",   // may be null
  "product":       "Admin Guide",       // optional but useful
  "language":      "en-US",
  "page_count":    127,
  "dates": {
    "Added on":    "2024-01-15",
    "Updated on":  "2026-05-20"
  },
  "landing_page":  "<page_id>"
}

Per-page sidecar (<page_id>.json) carries page-level metadata. The one field that matters cross-cutting is topic_cluster (see Phase 9):

{
  "bundle_id":     "<bundle_id>",
  "page_id":       "<page_id>",
  "title":         "How to ...",
  "ordinal":       42,
  "topic_cluster": {
    "clustering_title": "How to ...",
    "clustered_topics": [
      {"bundle_id": "...10.8", "page_id": "How_to_X.htm", "clustering_title": "..."},
      {"bundle_id": "...10.9", "page_id": "How_to_X.htm", "clustering_title": "..."}
    ]
  }
}

If the portal exposes a cross-version "this page corresponds to that page" mapping, capture it here. If it doesn't, you can synthesize a filename-based fallback (same filename across bundle versions = same topic) and live without the editor-curated mapping. The features that read topic_cluster (list_cluster, diff_versions, find_doc_inconsistencies, parts of weekly_digest) will work either way; they're more accurate with real clusters.

Patterns that recur across doc portals:

  • Most modern doc portals are SPAs. Plain requests.get won't see rendered content. Either find the underlying API the SPA calls (the cheapest, most reliable path), or fall back to a headless browser (Playwright). The API path is almost always available; sniff the network tab.
  • Portals usually expose a "bundle/topic" hierarchy under the hood (Zoomin, Madcap Flare, Paligo, GitBook, Docusaurus all do). Map it to bundles.json + corpus/<bundle>/<page>.
  • Many portals expose ?save_local= or .pdf rendered versions; the HTML they serve is structurally cleaner than what the page shows through the SPA shell.

scrape/changelog.py (~250 LOC; see Phase 13) — provides summarize_diff(), render_human(), walk_history() and the --json / --history-out modes. Mostly reusable as-is; the only product-specific bit is the path layout assumption.

Phase 2 — Chunking + embeddings + Chroma (2 days)

Goal: build a queryable dense index from the scraped corpus.

  • rag/chunk.py — split each page's markdown into ~400-600 token chunks. Strategy that works: paragraph-aware splitter with a rich "chunk 0" containing the page title + 1-sentence summary + bag-of-words from key terms. Chunk 0 is what dense retrieval lands on first; getting it right dominates retrieval quality.
  • rag/embeddings.py — pluggable embedder. Recommended start: Ollama-hosted nomic-embed-text (768-dim, free, good baseline). Other defensible choices: text-embedding-3-small (OpenAI), bge-m3 (also via Ollama). The embedder is a Chroma EmbeddingFunction that returns list[list[float]] for a list of texts.
  • rag/index.py — orchestrates: read corpus → emit chunks (with metadata: bundle_id, page_id, version, platform, ordinal) → upsert into Chroma collection. --rebuild flag for a clean reindex. Run via python -m rag.index --rebuild.

Chroma settings: PersistentClient(path="chroma/") and Settings(anonymized_telemetry=False). Single collection (<product>_docs).

GPU note: embedding 70K chunks on CPU takes hours; on a GPU (via Ollama with NVIDIA_VISIBLE_DEVICES) takes ~10 minutes. Two GPUs in parallel: ~5 minutes. The orchestrator just needs to load- balance HTTP requests across multiple Ollama endpoints.

Phase 3 — MCP server skeleton (1 day)

Goal: working FastMCP server with three tools — search_docs, get_page, list_versions.

  • docs_mcp/server.pyFastMCP("<product>-docs", stateless_http=True). stateless_http=True is critical for production hosting: every request creates an ephemeral session, so container recreates don't produce a 404 storm from stale mcp-session-id headers on clients.
  • Lazy initialization for everything expensive (Chroma client, embedder, bundles catalog) so the server starts cleanly even when Ollama is briefly unreachable.
  • Tool: search_docs(query, version=None, platform=None, bundle_id=None, k=10). Returns markdown of top-k chunks with full source URLs.
  • Tool: get_page(bundle_id, page_id). Returns full page markdown + metadata.
  • Tool: list_versions(). Returns the version/platform facets available, drawn from bundles.json. Helps the LLM pick filter values.

Transports: stdio (for local Claude Desktop dev), streamable-HTTP (for hosted production). One argparse switch.

@mcp.tool()
def search_docs(
    query: Annotated[str, Field(description="Natural-language query about <product>.")],
    version: Annotated[str | None, Field(description="Restrict to one version")] = None,
    ...
) -> str:
    ...

The tool descriptions are first-class context — the LLM reads them and decides whether to call the tool. Treat them as button labels; use "Call when..." / "Use proactively whenever..." phrasings.

Phase 4 — Containerization (1 day)

Goal: image you can run anywhere.

  • Dockerfile: Python 3.12-slim base, install requirements, COPY scrape rag diff docs_mcp + bundles.json + corpus/ chroma/
    • (later) bm25/. Don't COPY scripts/ — those stay external for ops use only.
  • ENTRYPOINT ["python", "-m", "docs_mcp.server", "--transport", "streamable-http"]. Configurable host/port via env.
  • deploy/docker-compose.yml: one service, named volumes for usage logs and any state, Watchtower label, depends_on for the reranker sidecar (Phase 6).

Smoke-test locally: docker compose up should expose http://localhost:8000/mcp and respond to an MCP initialize JSON-RPC.

Phase 5 — CI on self-hosted Gitea Actions (12 days)

Goal: weekly cron rebuild + on-demand code-only ship cycle.

Two workflows, two cadences:

Workflow Trigger Steps Runtime
refresh.yml Monday cron + manual dispatch scrape → commit corpus → rebuild indexes → build & push image ~40 min
image-only.yml manual dispatch only rebuild indexes from committed corpus → build & push image ~18 min

Critical settings (learned the hard way):

  • fetch-depth: 0 on actions/checkout@v4. The default depth is 1 (shallow), which breaks any step that walks git history (changelog, digest history walker). Pay the ~10 second cost; never debug a "0-byte history file" mystery.
  • runs-on: docker (Gitea convention, not ubuntu-latest).
  • Runner shell is /bin/sh (dash), not bash. ${VAR::N} substring expansion doesn't exist; use cut / printf / awk.

Retry-on-race pattern for long-running scrapes:

attempt=1
while [ $attempt -le 3 ]; do
  if git push; then
    echo "pushed (attempt $attempt)"
    break
  fi
  [ $attempt -eq 3 ] && { echo "still failing"; exit 1; }
  git fetch origin main
  git rebase origin/main || { echo "conflict — bail"; exit 1; }
  attempt=$((attempt + 1))
done

Works because scrape commits only touch corpus/ + bundles.json, and code merges only touch .py / .yml — disjoint paths, trivially clean rebases.

Image tagging — three tags per build:

Tag Purpose
:latest Watchtower watches this for auto-deploy
:<sha12> Immutable; rollback target
:<YYYY.MM.DD> Human-readable in incident notes

Same tag set on every build; rollback is a one-line compose edit to pin :<sha> instead of :latest.

Container registry behind Cloudflare:

Cloudflare's free tier has a 100 MB request body limit. Big image layers (Chroma index can easily be 800+ MB) exceed it on push. The fix is a LAN registry endpoint for push, public hostname for pull:

env:
  REGISTRY_PUSH: <lan-ip>:<port>     # bypasses Cloudflare
  REGISTRY_PULL: <public-hostname>   # response bodies aren't capped

Runner needs the LAN endpoint in /etc/docker/daemon.json insecure-registries. Costs nothing operationally; saves hours of debugging.

Registry GC: weekly cron in the workflow that walks the package versions, keeps :latest + N most-recent date tags + anything pushed in the last 90 days, deletes the rest. Worth ~50 LOC; the package GC on the Gitea side reclaims disk after.

Phase 6 — Reranker (half a day)

Goal: lift retrieval quality 3× by cross-encoder reranking the top-N dense candidates.

  • A /v1/rerank HTTP endpoint backed by llama.cpp serving jina-reranker-v2-base (GGUF). Runs as a sidecar in compose. GPU strongly recommended (CPU latency is unworkable for live queries).
  • _rerank(query, docs) helper in the server: POST to the endpoint, apply the scores, re-sort the top-N candidates. Defensive: on any failure log a warning and fall through to dense-only.
  • Env: RERANK_URL (off by default), RERANK_POOL (how deep to pull candidates for reranking; 200 is a good default), RERANK_TIMEOUT (30s for cold-start tolerance).
  • Watch the per-pair token limit. Jina's GGUF reports n_ctx_train=1024 and llama.cpp will reject the ENTIRE batch if any pair exceeds it. Truncate doc text to ~2000 chars before reranking. The full untruncated chunk still goes back to the user; truncation is only for the reranker scoring path.

Phase 7 — Eval harness (1 day)

Goal: hand-curated golden queries + standard metrics so you can measure the impact of any retrieval change.

  • eval/queries.jsonl: 2025 hand-curated queries with expected pages. Spread across versions, platforms, and difficulty levels. Include the queries that "obviously" should work and DON'T — those are the ones to track.
  • eval/retrievers.py: a Retriever protocol with concrete implementations: DenseRetriever, RerankedRetriever, BM25Retriever (Phase 8), HybridRetriever (Phase 8). One matrix dimension per knob.
  • eval/run_eval.py: computes MRR / Recall@5 / nDCG@5 across all retrievers; emits a markdown comparison table at eval/results/<baseline>.md. Commit the result so PRs land with the A/B evidence in the diff.

Three numbers are enough — don't overengineer. The hand-curated queries are the value; the metrics are just a stable way to score them.

Phase 8 — BM25 + Hybrid retrieval (half a day, conditional)

Skip unless your eval shows specific failure modes. Dense embeddings + cross-encoder reranker handle most queries. The case where they don't: queries with rare technical tokens (filenames, language names, error codes) get buried at dense rank 1000+ by a much larger prose corpus that's semantically nearby. The reranker only sees top-200, so it never gets a shot.

  • rag/bm25.py: SQLite FTS5 index, in the stdlib, on-disk (bm25/<product>.db). Two tables — metadata table keyed by rowid, FTS5 virtual table for full-text. Sanitize the query (strip FTS5 reserved keywords, OR-join tokens for recall). ~210 LOC.
  • _rrf_fuse() in the server — Reciprocal Rank Fusion with k=60. Per-id score = sum_over_retrievers(1 / (k + rank)). Returns ordered ids plus per-retriever contribution dict for telemetry.
  • search_docs hybrid path: run dense + BM25 in parallel, RRF-fuse, hand the merged top-200 to the reranker. Env-gated: HYBRID_SEARCH=true.
  • Log top1_source per call (dense_only / bm25_only / both) to usage logs so you can measure whether BM25 is actually earning its keep on production traffic.

If after 46 weeks of production data you see bm25_only >= 80%, you can simplify to BM25-only (much less infrastructure). If both >= 50%, hybrid is acting as tie-breaker not rescue — keep it or simplify depending on how much you care about the long tail.

Phase 9 — Multi-version diff tooling (1 day, if applicable)

Only relevant if the product has multiple maintained versions.

  • diff_versions(bundle_id, page_id, against_bundle_id): unified diff between two versions of the same page. Two matching strategies: editor-curated topic_cluster peer (if the portal exposes it), or same-filename fallback.
  • list_cluster(bundle_id, page_id): list cross-version peers for one page.
  • bundle_changelog(bundle_id_new, bundle_id_old): added / removed / changed pages between two bundles, sorted by churn.
  • _diff_churn(a, b): small helper, ~15 LOC of difflib.unified_diff --unified=0 line counting. Used by bundle_changelog, find_doc_inconsistencies, and weekly_digest.

Phase 10 — Usage logging (half a day)

Goal: per-call JSONL telemetry so you can answer "what are people actually asking for" and "is the new feature getting used."

  • docs_mcp/usage.py: TimedCall context manager that captures tool name, args, elapsed time, hits returned, any extra fields set by the tool via _call.set(key=value). Writes JSONL to var/logs/usage.jsonl, rotated daily, kept 90 days.
  • Mount the log dir as a named compose volume so logs survive container recreates.
  • scripts/usage_report.py (standalone, no docs_mcp deps): reads the JSONL files, prints per-tool counts, top queries, 0-hit queries, filter usage histogram, reranker activity. Markdown output flag for piping into weekly digest emails.

What to log: query text, filters, hits returned, elapsed_ms, reranker_fired flag, hybrid top1_source, retrieval_mode. What NOT to log: anything PII-shaped. The corpus is public, queries are usually about the product, not personal — but be deliberate.

Phase 11 — Curated knowledge layer (2 days)

The "RAG can't tell you what isn't in the docs" gap. Surfaces:

  • API quickstart repos if the product has them. Ingest the example scripts (Python, PowerShell, curl) into the corpus. Rewrite chunk-0 for each script to embed naturally — explicit natural-language H1, task description sentence, keyword bag. Dense embeddings need an anchor.
  • A curated <product>_api_lessons markdown doc for things the swagger / OpenAPI doesn't say: auth flow gotchas, async-task patterns, schema bugs you've hit, platform-detection quirks. Surface as a dedicated MCP tool whose description tells the LLM: "Call proactively whenever the user asks you to write a script / integrate with the API / debug a 4xx response."
  • An auto-hint banner in search_docs results — when the query matches a script/API trigger word, render a one-line nudge at the top of results pointing at the dedicated tool. Belt-and- suspenders for queries where the LLM doesn't think to call it proactively.

Phase 12 — Doc-inconsistency tool (half a day, optional)

A "scan the corpus for likely doc bugs" tool the model can call when an operator asks "is this section reliable?"

  • find_doc_inconsistencies(scope_query, version=None, platform=None, max_pages=30, checks=None): deterministic, read-only. Two checks: cross-version drift (pages whose content shifted between immediate- previous versions in the actionable 1060% churn band) and redirect-chain detection (short pages whose body is just a "see [other page] for details" pointer). Heavy lifting is line-level diff (difflib) against editor-curated cluster peers; the model judges which findings are real bugs.

Phase 13 — Weekly digest tool (half a day)

Goal: a tool that answers "what changed in the docs in the last N days?" with no runtime git dependency (the prod container has no git).

  • Extend scrape/changelog.py with --json (one-shot structured output) and --history-out PATH (walks git log --first-parent --since="<N> days ago" for corpus-touching commits, writes one JSON line per commit to a JSONL file).
  • CI workflows write the JSONL file into the image at build time: corpus/.digest/history.jsonl. Both refresh.yml and image-only.yml. fetch-depth: 0 is required — see Phase 5.
  • New MCP tool weekly_digest(days=7, version=None, platform=None, max_bundles=25, max_pages_per_bundle=10): reads the JSONL, filters to the window, applies version/platform via bundles.json metadata, aggregates per-bundle change counts and page lists, renders markdown.
  • Post-filter totals are critical: the headline "X page changes across Y bundles" must compute X from the filtered set, not the raw record count. Otherwise filtered calls look wrong to the reader.

Out of scope but trivial bolt-ons: scheduled HTML email of the digest, auto-publish to a blog, per-page diff excerpts as a follow-up tool.


Standard tool set

By the end you'll have ~15 tools registered. Production-tested shape:

Tool What it does
search_docs Semantic search with version/platform/bundle filters
get_page Full markdown + metadata for one page
list_versions Discover available facet values
list_cluster Cross-version peers for one page (if applicable)
diff_versions Unified diff of a page across two versions
bundle_changelog Added / removed / changed pages between two bundles
weekly_digest What changed in the last N days, with filters
corpus_status Freshness + size of the knowledge base
find_doc_inconsistencies Scoped scan for doc bugs
<product>_api_lessons Curated API gotchas, proactively-called
product-specific tools Interop matrix, lifecycle queries, etc.

Per-product customization checklist

When applying this template to a new product, here's what you have to figure out yourself — everything else is shared infrastructure:

  • Doc portal mechanics
    • URL pattern for pages
    • Bundle/version concept (Zoomin "bundle", Madcap "project", GitBook "space", Docusaurus "docs version" — same idea, different name)
    • SPA backing API (sniff the network tab) or fallback to headless browser
    • How topic_cluster -equivalent cross-version peers are exposed (or whether you synthesize them from filenames)
  • Bundle metadata schema
    • What does version look like? Semver, calendar, named?
    • What does platform mean for this product? Is there a useful facet at all?
    • Other useful facets (language, product line, edition)?
  • Filterable facets for search_docs
    • One filter per high-cardinality facet
    • Skip filters that have <5 distinct values — they're not worth the surface area
  • Curated knowledge for the _api_lessons tool
    • What does the product's API documentation NOT say that you've learned from real integration work?
  • Quickstart / example repos
    • Does the vendor publish working code? Ingest it; rewrite chunk-0 for natural-language retrieval.

Decisions worth carrying forward

Things you'll save time on by deciding the same way again:

  • Tool descriptions are user interface. The LLM reads them verbatim and decides whether to call the tool. "Use when..." and "Call proactively whenever..." are real surfaces; treat them like button labels. Most retrieval improvements turn out to be tool-description rewrites in disguise.
  • stateless_http=True on the FastMCP server. Eliminates whole categories of session-ID-related 404 storms after container recreates.
  • Pre-bake everything at CI time. No runtime calls to git, external services, or anything you wouldn't trust on a Cloudflare outage. If the digest needs git history, write a JSONL file at CI time. If the lessons doc needs to load fast, bake it into the image.
  • Env-gate every side-effecting tool. Off by default in dev; on only in production compose. Belt and suspenders against accidental writes from staging environments.
  • Operator-confirmation pattern for side-effecting tools. The tool docstring is the only place to enforce human-in-the-loop. Make it loud. "MANDATORY", "Do not loop", "show-confirm-then-submit" — those phrasings work.
  • Verify with hand-curated golden queries before shipping any retrieval change. Numbers in the diff, in the commit message. Don't ship retrieval changes on vibes.
  • Two-cadence CI (weekly scrape vs on-demand code-only) saves hours per code iteration once you're past the one-iteration-a-week stage.
  • Rolling tag + sha-pinned tag deploy pattern. :latest is what Watchtower watches; :<sha> is your safety net. Rollback is a one-line compose edit, not a redeploy.
  • Usage logging is non-negotiable. You will be wrong about what people use. Capture the truth from day one; let it tell you which features to keep building and which to delete.

Glossary

  • Bundle — one logical doc set in the portal. Zoomin calls them bundles; Madcap calls them projects; the concept is the same: a versioned, titled collection of pages. One dir under corpus/.
  • Page — one HTML page in a bundle. One .md + one .json sidecar under the bundle dir.
  • Topic cluster — Zoomin's name for "this page in version 10.9 corresponds to that page in version 10.8." Stored in the per-page sidecar. The portal-agnostic concept is "cross-version peer mapping."
  • Chunk — a unit of text that gets independently embedded and stored in Chroma. Target ~400-600 tokens; preserve paragraph boundaries.
  • RRF — Reciprocal Rank Fusion. The way to merge two ranked lists from independent retrievers without score calibration.

What's deliberately NOT in this template

Decisions you should make per-product (not copy from the original build):

  • The reverse proxy and TLS termination layer. Could be Caddy, nginx, Traefik, Cloudflare Tunnel — pick what your infra uses.
  • The Gateway / aggregator in front of multiple MCPs (MetaMCP is one option; you may not need any aggregator if you're running a single product MCP).
  • The specific embedding model — nomic-embed-text is a strong default but newer / domain-specific models may be better for some products.
  • The Ollama containers / GPU setup — depends on what hardware you have. The pattern is one container per GPU with explicit NVIDIA_VISIBLE_DEVICES pinning; the indexer load-balances across them.
  • Whether to publish a blog series alongside the build. Strongly recommended (forces clarity, builds an audience), but optional.