25 KiB
Docs MCP Server — Build Guide
A reusable recipe for building a hosted MCP server over a product's public documentation. Distilled from one production build; everything product-specific has been factored out.
The end product is a streamable-HTTP MCP server with ~15 tools that any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can call to answer questions against the docs, surface what changed recently, and flag likely inconsistencies.
What you're building
A pipeline with these stages:
upstream docs portal
│
▼
scrape ──► corpus/<bundle>/<page>.md + .json sidecar
│
▼
chunk + embed ──► chroma/ (dense vectors)
│ ──► bm25/ (FTS5 lexical index)
▼
MCP server ──► search_docs / get_page / diff_versions / weekly_digest /
find_doc_inconsistencies / ...
│
▼
reverse proxy / Cloudflare Tunnel ──► public endpoint
Two CI cadences:
- Weekly cron (~40 min): full re-scrape, re-chunk, re-embed, image build & push.
- On-demand image-only (~18 min): code-only rebuild from committed corpus, image build & push.
A container registry (self-hosted Gitea works well), a host running
Docker Compose, Watchtower auto-updating from :latest, and a
reverse proxy in front.
Build phases
Each phase is a discrete, shippable unit. Build them in order; each one is useful on its own and unlocks the next. Realistic effort per phase is given as a rough order of magnitude. Total: roughly 2–3 weeks of focused work for the full stack.
Phase 0 — Project skeleton (half a day)
Goals: directory layout, dependency manifest, virtualenv.
- Top-level dirs:
scrape/,corpus/(gitignored),rag/,docs_mcp/,eval/,scripts/,deploy/,.gitea/workflows/. requirements.txtwith the dependencies you'll need across all phases (FastMCP, chromadb, httpx, beautifulsoup4 or whatever HTML parser, ollama or sentence-transformers client, etc.).python -m venv venvand pin Python version (3.11 or 3.12 — be conservative; some embedding libraries have version-specific wheels)..gitignore:venv/,corpus/(regenerable),chroma/(regenerable),bm25/(regenerable),*.pyc,__pycache__/,.pytest_cache/.
Phase 1 — Scraper (2–4 days, product-specific)
This is the most product-dependent phase. The goal is to write a scraper that produces a normalized corpus layout regardless of upstream portal shape.
Output shape (mandatory):
corpus/
<bundle_id>/ # one dir per "doc bundle" — see Glossary
<page_id>.md # markdown body
<page_id>.json # sidecar with structured metadata
...
bundles.json # catalog of bundles with metadata
Bundle metadata (bundles.json is a list of these):
{
"slug": "<bundle_id>",
"title": "User-facing title",
"version": "10.9",
"platform": "VMware vSphere", // may be null
"product": "Admin Guide", // optional but useful
"language": "en-US",
"page_count": 127,
"dates": {
"Added on": "2024-01-15",
"Updated on": "2026-05-20"
},
"landing_page": "<page_id>"
}
Per-page sidecar (<page_id>.json) carries page-level metadata.
The one field that matters cross-cutting is topic_cluster (see
Phase 9):
{
"bundle_id": "<bundle_id>",
"page_id": "<page_id>",
"title": "How to ...",
"ordinal": 42,
"topic_cluster": {
"clustering_title": "How to ...",
"clustered_topics": [
{"bundle_id": "...10.8", "page_id": "How_to_X.htm", "clustering_title": "..."},
{"bundle_id": "...10.9", "page_id": "How_to_X.htm", "clustering_title": "..."}
]
}
}
If the portal exposes a cross-version "this page corresponds to that
page" mapping, capture it here. If it doesn't, you can synthesize a
filename-based fallback (same filename across bundle versions = same
topic) and live without the editor-curated mapping. The features that
read topic_cluster (list_cluster, diff_versions,
find_doc_inconsistencies, parts of weekly_digest) will work
either way; they're more accurate with real clusters.
Patterns that recur across doc portals:
- Most modern doc portals are SPAs. Plain
requests.getwon't see rendered content. Either find the underlying API the SPA calls (the cheapest, most reliable path), or fall back to a headless browser (Playwright). The API path is almost always available; sniff the network tab. - Portals usually expose a "bundle/topic" hierarchy under the hood
(Zoomin, Madcap Flare, Paligo, GitBook, Docusaurus all do). Map
it to
bundles.json+corpus/<bundle>/<page>. - Many portals expose
?save_local=or.pdfrendered versions; the HTML they serve is structurally cleaner than what the page shows through the SPA shell.
scrape/changelog.py (~250 LOC; see Phase 13) — provides
summarize_diff(), render_human(), walk_history() and the
--json / --history-out modes. Mostly reusable as-is; the only
product-specific bit is the path layout assumption.
Phase 2 — Chunking + embeddings + Chroma (2 days)
Goal: build a queryable dense index from the scraped corpus.
rag/chunk.py— split each page's markdown into ~400-600 token chunks. Strategy that works: paragraph-aware splitter with a rich "chunk 0" containing the page title + 1-sentence summary + bag-of-words from key terms. Chunk 0 is what dense retrieval lands on first; getting it right dominates retrieval quality.rag/embeddings.py— pluggable embedder. Recommended start: Ollama-hostednomic-embed-text(768-dim, free, good baseline). Other defensible choices:text-embedding-3-small(OpenAI),bge-m3(also via Ollama). The embedder is a ChromaEmbeddingFunctionthat returnslist[list[float]]for a list of texts.rag/index.py— orchestrates: read corpus → emit chunks (with metadata: bundle_id, page_id, version, platform, ordinal) → upsert into Chroma collection.--rebuildflag for a clean reindex. Run viapython -m rag.index --rebuild.
Chroma settings: PersistentClient(path="chroma/") and
Settings(anonymized_telemetry=False). Single collection
(<product>_docs).
GPU note: embedding 70K chunks on CPU takes hours; on a GPU
(via Ollama with NVIDIA_VISIBLE_DEVICES) takes ~10 minutes. Two
GPUs in parallel: ~5 minutes. The orchestrator just needs to load-
balance HTTP requests across multiple Ollama endpoints.
Phase 3 — MCP server skeleton (1 day)
Goal: working FastMCP server with three tools — search_docs,
get_page, list_versions.
docs_mcp/server.py—FastMCP("<product>-docs", stateless_http=True).stateless_http=Trueis critical for production hosting: every request creates an ephemeral session, so container recreates don't produce a 404 storm from stalemcp-session-idheaders on clients.- Lazy initialization for everything expensive (Chroma client, embedder, bundles catalog) so the server starts cleanly even when Ollama is briefly unreachable.
- Tool:
search_docs(query, version=None, platform=None, bundle_id=None, k=10). Returns markdown of top-k chunks with full source URLs. - Tool:
get_page(bundle_id, page_id). Returns full page markdown + metadata. - Tool:
list_versions(). Returns the version/platform facets available, drawn frombundles.json. Helps the LLM pick filter values.
Transports: stdio (for local Claude Desktop dev), streamable-HTTP (for hosted production). One argparse switch.
@mcp.tool()
def search_docs(
query: Annotated[str, Field(description="Natural-language query about <product>.")],
version: Annotated[str | None, Field(description="Restrict to one version")] = None,
...
) -> str:
...
The tool descriptions are first-class context — the LLM reads them and decides whether to call the tool. Treat them as button labels; use "Call when..." / "Use proactively whenever..." phrasings.
Phase 4 — Containerization (1 day)
Goal: image you can run anywhere.
Dockerfile: Python 3.12-slim base, install requirements, COPYscrape rag diff docs_mcp+bundles.json+corpus/ chroma/- (later)
bm25/. Don't COPYscripts/— those stay external for ops use only.
- (later)
ENTRYPOINT ["python", "-m", "docs_mcp.server", "--transport", "streamable-http"]. Configurable host/port via env.deploy/docker-compose.yml: one service, named volumes for usage logs and any state, Watchtower label, depends_on for the reranker sidecar (Phase 6).
Smoke-test locally: docker compose up should expose
http://localhost:8000/mcp and respond to an MCP initialize JSON-RPC.
Phase 5 — CI on self-hosted Gitea Actions (1–2 days)
Goal: weekly cron rebuild + on-demand code-only ship cycle.
Two workflows, two cadences:
| Workflow | Trigger | Steps | Runtime |
|---|---|---|---|
refresh.yml |
Monday cron + manual dispatch | scrape → commit corpus → rebuild indexes → build & push image | ~40 min |
image-only.yml |
manual dispatch only | rebuild indexes from committed corpus → build & push image | ~18 min |
Critical settings (learned the hard way):
fetch-depth: 0onactions/checkout@v4. The default depth is 1 (shallow), which breaks any step that walks git history (changelog, digest history walker). Pay the ~10 second cost; never debug a "0-byte history file" mystery.runs-on: docker(Gitea convention, notubuntu-latest).- Runner shell is
/bin/sh(dash), not bash.${VAR::N}substring expansion doesn't exist; usecut/printf/awk.
Retry-on-race pattern for long-running scrapes:
attempt=1
while [ $attempt -le 3 ]; do
if git push; then
echo "pushed (attempt $attempt)"
break
fi
[ $attempt -eq 3 ] && { echo "still failing"; exit 1; }
git fetch origin main
git rebase origin/main || { echo "conflict — bail"; exit 1; }
attempt=$((attempt + 1))
done
Works because scrape commits only touch corpus/ + bundles.json,
and code merges only touch .py / .yml — disjoint paths, trivially
clean rebases.
Image tagging — three tags per build:
| Tag | Purpose |
|---|---|
:latest |
Watchtower watches this for auto-deploy |
:<sha12> |
Immutable; rollback target |
:<YYYY.MM.DD> |
Human-readable in incident notes |
Same tag set on every build; rollback is a one-line compose edit
to pin :<sha> instead of :latest.
Container registry behind Cloudflare:
Cloudflare's free tier has a 100 MB request body limit. Big image layers (Chroma index can easily be 800+ MB) exceed it on push. The fix is a LAN registry endpoint for push, public hostname for pull:
env:
REGISTRY_PUSH: <lan-ip>:<port> # bypasses Cloudflare
REGISTRY_PULL: <public-hostname> # response bodies aren't capped
Runner needs the LAN endpoint in /etc/docker/daemon.json
insecure-registries. Costs nothing operationally; saves hours
of debugging.
Registry GC: weekly cron in the workflow that walks the package
versions, keeps :latest + N most-recent date tags + anything
pushed in the last 90 days, deletes the rest. Worth ~50 LOC; the
package GC on the Gitea side reclaims disk after.
Phase 6 — Reranker (half a day)
Goal: lift retrieval quality 3× by cross-encoder reranking the top-N dense candidates.
- A
/v1/rerankHTTP endpoint backed byllama.cppservingjina-reranker-v2-base(GGUF). Runs as a sidecar in compose. GPU strongly recommended (CPU latency is unworkable for live queries). _rerank(query, docs)helper in the server: POST to the endpoint, apply the scores, re-sort the top-N candidates. Defensive: on any failure log a warning and fall through to dense-only.- Env:
RERANK_URL(off by default),RERANK_POOL(how deep to pull candidates for reranking; 200 is a good default),RERANK_TIMEOUT(30s for cold-start tolerance). - Watch the per-pair token limit. Jina's GGUF reports
n_ctx_train=1024and llama.cpp will reject the ENTIRE batch if any pair exceeds it. Truncate doc text to ~2000 chars before reranking. The full untruncated chunk still goes back to the user; truncation is only for the reranker scoring path.
Phase 7 — Eval harness (1 day)
Goal: hand-curated golden queries + standard metrics so you can measure the impact of any retrieval change.
eval/queries.jsonl: 20–25 hand-curated queries with expected pages. Spread across versions, platforms, and difficulty levels. Include the queries that "obviously" should work and DON'T — those are the ones to track.eval/retrievers.py: aRetrieverprotocol with concrete implementations:DenseRetriever,RerankedRetriever,BM25Retriever(Phase 8),HybridRetriever(Phase 8). One matrix dimension per knob.eval/run_eval.py: computes MRR / Recall@5 / nDCG@5 across all retrievers; emits a markdown comparison table ateval/results/<baseline>.md. Commit the result so PRs land with the A/B evidence in the diff.
Three numbers are enough — don't overengineer. The hand-curated queries are the value; the metrics are just a stable way to score them.
Phase 8 — BM25 + Hybrid retrieval (half a day, conditional)
Skip unless your eval shows specific failure modes. Dense embeddings + cross-encoder reranker handle most queries. The case where they don't: queries with rare technical tokens (filenames, language names, error codes) get buried at dense rank 1000+ by a much larger prose corpus that's semantically nearby. The reranker only sees top-200, so it never gets a shot.
rag/bm25.py: SQLite FTS5 index, in the stdlib, on-disk (bm25/<product>.db). Two tables — metadata table keyed by rowid, FTS5 virtual table for full-text. Sanitize the query (strip FTS5 reserved keywords, OR-join tokens for recall). ~210 LOC._rrf_fuse()in the server — Reciprocal Rank Fusion withk=60. Per-id score =sum_over_retrievers(1 / (k + rank)). Returns ordered ids plus per-retriever contribution dict for telemetry.search_docshybrid path: run dense + BM25 in parallel, RRF-fuse, hand the merged top-200 to the reranker. Env-gated:HYBRID_SEARCH=true.- Log
top1_sourceper call (dense_only/bm25_only/both) to usage logs so you can measure whether BM25 is actually earning its keep on production traffic.
If after 4–6 weeks of production data you see bm25_only >= 80%,
you can simplify to BM25-only (much less infrastructure). If
both >= 50%, hybrid is acting as tie-breaker not rescue — keep it
or simplify depending on how much you care about the long tail.
Phase 9 — Multi-version diff tooling (1 day, if applicable)
Only relevant if the product has multiple maintained versions.
diff_versions(bundle_id, page_id, against_bundle_id): unified diff between two versions of the same page. Two matching strategies: editor-curatedtopic_clusterpeer (if the portal exposes it), or same-filename fallback.list_cluster(bundle_id, page_id): list cross-version peers for one page.bundle_changelog(bundle_id_new, bundle_id_old): added / removed / changed pages between two bundles, sorted by churn._diff_churn(a, b): small helper, ~15 LOC ofdifflib.unified_diff --unified=0line counting. Used bybundle_changelog,find_doc_inconsistencies, andweekly_digest.
Phase 10 — Usage logging (half a day)
Goal: per-call JSONL telemetry so you can answer "what are people actually asking for" and "is the new feature getting used."
docs_mcp/usage.py:TimedCallcontext manager that captures tool name, args, elapsed time, hits returned, any extra fields set by the tool via_call.set(key=value). Writes JSONL tovar/logs/usage.jsonl, rotated daily, kept 90 days.- Mount the log dir as a named compose volume so logs survive container recreates.
scripts/usage_report.py(standalone, no docs_mcp deps): reads the JSONL files, prints per-tool counts, top queries, 0-hit queries, filter usage histogram, reranker activity. Markdown output flag for piping into weekly digest emails.
What to log: query text, filters, hits returned, elapsed_ms, reranker_fired flag, hybrid top1_source, retrieval_mode. What NOT to log: anything PII-shaped. The corpus is public, queries are usually about the product, not personal — but be deliberate.
Phase 11 — Curated knowledge layer (2 days)
The "RAG can't tell you what isn't in the docs" gap. Surfaces:
- API quickstart repos if the product has them. Ingest the example scripts (Python, PowerShell, curl) into the corpus. Rewrite chunk-0 for each script to embed naturally — explicit natural-language H1, task description sentence, keyword bag. Dense embeddings need an anchor.
- A curated
<product>_api_lessonsmarkdown doc for things the swagger / OpenAPI doesn't say: auth flow gotchas, async-task patterns, schema bugs you've hit, platform-detection quirks. Surface as a dedicated MCP tool whose description tells the LLM: "Call proactively whenever the user asks you to write a script / integrate with the API / debug a 4xx response." - An auto-hint banner in
search_docsresults — when the query matches a script/API trigger word, render a one-line nudge at the top of results pointing at the dedicated tool. Belt-and- suspenders for queries where the LLM doesn't think to call it proactively.
Phase 12 — Doc-inconsistency tool (half a day, optional)
A "scan the corpus for likely doc bugs" tool the model can call when an operator asks "is this section reliable?"
find_doc_inconsistencies(scope_query, version=None, platform=None, max_pages=30, checks=None): deterministic, read-only. Two checks: cross-version drift (pages whose content shifted between immediate- previous versions in the actionable 10–60% churn band) and redirect-chain detection (short pages whose body is just a "see [other page] for details" pointer). Heavy lifting is line-level diff (difflib) against editor-curated cluster peers; the model judges which findings are real bugs.
Phase 13 — Weekly digest tool (half a day)
Goal: a tool that answers "what changed in the docs in the last N days?" with no runtime git dependency (the prod container has no git).
- Extend
scrape/changelog.pywith--json(one-shot structured output) and--history-out PATH(walksgit log --first-parent --since="<N> days ago"for corpus-touching commits, writes one JSON line per commit to a JSONL file). - CI workflows write the JSONL file into the image at build time:
corpus/.digest/history.jsonl. Bothrefresh.ymlandimage-only.yml.fetch-depth: 0is required — see Phase 5. - New MCP tool
weekly_digest(days=7, version=None, platform=None, max_bundles=25, max_pages_per_bundle=10): reads the JSONL, filters to the window, applies version/platform viabundles.jsonmetadata, aggregates per-bundle change counts and page lists, renders markdown. - Post-filter totals are critical: the headline "X page changes across Y bundles" must compute X from the filtered set, not the raw record count. Otherwise filtered calls look wrong to the reader.
Out of scope but trivial bolt-ons: scheduled HTML email of the digest, auto-publish to a blog, per-page diff excerpts as a follow-up tool.
Standard tool set
By the end you'll have ~15 tools registered. Production-tested shape:
| Tool | What it does |
|---|---|
search_docs |
Semantic search with version/platform/bundle filters |
get_page |
Full markdown + metadata for one page |
list_versions |
Discover available facet values |
list_cluster |
Cross-version peers for one page (if applicable) |
diff_versions |
Unified diff of a page across two versions |
bundle_changelog |
Added / removed / changed pages between two bundles |
weekly_digest |
What changed in the last N days, with filters |
corpus_status |
Freshness + size of the knowledge base |
find_doc_inconsistencies |
Scoped scan for doc bugs |
<product>_api_lessons |
Curated API gotchas, proactively-called |
| product-specific tools | Interop matrix, lifecycle queries, etc. |
Per-product customization checklist
When applying this template to a new product, here's what you have to figure out yourself — everything else is shared infrastructure:
- Doc portal mechanics
- URL pattern for pages
- Bundle/version concept (Zoomin "bundle", Madcap "project", GitBook "space", Docusaurus "docs version" — same idea, different name)
- SPA backing API (sniff the network tab) or fallback to headless browser
- How
topic_cluster-equivalent cross-version peers are exposed (or whether you synthesize them from filenames)
- Bundle metadata schema
- What does
versionlook like? Semver, calendar, named? - What does
platformmean for this product? Is there a useful facet at all? - Other useful facets (language, product line, edition)?
- What does
- Filterable facets for
search_docs- One filter per high-cardinality facet
- Skip filters that have <5 distinct values — they're not worth the surface area
- Curated knowledge for the
_api_lessonstool- What does the product's API documentation NOT say that you've learned from real integration work?
- Quickstart / example repos
- Does the vendor publish working code? Ingest it; rewrite chunk-0 for natural-language retrieval.
Decisions worth carrying forward
Things you'll save time on by deciding the same way again:
- Tool descriptions are user interface. The LLM reads them verbatim and decides whether to call the tool. "Use when..." and "Call proactively whenever..." are real surfaces; treat them like button labels. Most retrieval improvements turn out to be tool-description rewrites in disguise.
stateless_http=Trueon the FastMCP server. Eliminates whole categories of session-ID-related 404 storms after container recreates.- Pre-bake everything at CI time. No runtime calls to git, external services, or anything you wouldn't trust on a Cloudflare outage. If the digest needs git history, write a JSONL file at CI time. If the lessons doc needs to load fast, bake it into the image.
- Env-gate every side-effecting tool. Off by default in dev; on only in production compose. Belt and suspenders against accidental writes from staging environments.
- Operator-confirmation pattern for side-effecting tools. The tool docstring is the only place to enforce human-in-the-loop. Make it loud. "MANDATORY", "Do not loop", "show-confirm-then-submit" — those phrasings work.
- Verify with hand-curated golden queries before shipping any retrieval change. Numbers in the diff, in the commit message. Don't ship retrieval changes on vibes.
- Two-cadence CI (weekly scrape vs on-demand code-only) saves hours per code iteration once you're past the one-iteration-a-week stage.
- Rolling tag + sha-pinned tag deploy pattern.
:latestis what Watchtower watches;:<sha>is your safety net. Rollback is a one-line compose edit, not a redeploy. - Usage logging is non-negotiable. You will be wrong about what people use. Capture the truth from day one; let it tell you which features to keep building and which to delete.
Glossary
- Bundle — one logical doc set in the portal. Zoomin calls
them bundles; Madcap calls them projects; the concept is the
same: a versioned, titled collection of pages. One dir under
corpus/. - Page — one HTML page in a bundle. One
.md+ one.jsonsidecar under the bundle dir. - Topic cluster — Zoomin's name for "this page in version 10.9 corresponds to that page in version 10.8." Stored in the per-page sidecar. The portal-agnostic concept is "cross-version peer mapping."
- Chunk — a unit of text that gets independently embedded and stored in Chroma. Target ~400-600 tokens; preserve paragraph boundaries.
- RRF — Reciprocal Rank Fusion. The way to merge two ranked lists from independent retrievers without score calibration.
What's deliberately NOT in this template
Decisions you should make per-product (not copy from the original build):
- The reverse proxy and TLS termination layer. Could be Caddy, nginx, Traefik, Cloudflare Tunnel — pick what your infra uses.
- The Gateway / aggregator in front of multiple MCPs (MetaMCP is one option; you may not need any aggregator if you're running a single product MCP).
- The specific embedding model —
nomic-embed-textis a strong default but newer / domain-specific models may be better for some products. - The Ollama containers / GPU setup — depends on what hardware you
have. The pattern is one container per GPU with explicit
NVIDIA_VISIBLE_DEVICESpinning; the indexer load-balances across them. - Whether to publish a blog series alongside the build. Strongly recommended (forces clarity, builds an audience), but optional.