9ba615c8ee
Template for building hosted MCP servers over a product's public
documentation. Distilled from one production build; everything
product-specific has been factored out.
Contents:
- PLAN.md — comprehensive build guide. 13 phases from project
skeleton through weekly_digest. Includes the gotchas
("fetch-depth: 0 always", reranker per-pair token limit,
Cloudflare body cap, dash-not-bash on Gitea runners), the
decisions worth carrying forward, and a per-product
customization checklist.
- CLAUDE.md — guidance for Claude Code working in a clone of this
template. Phase identification table, conventions (env-gating +
operator confirmation for side-effecting tools, defensive
fallback for retrieval components), common commands.
- README.md — quick-start summary.
Scaffolded code (all signature-stable, with NotImplementedError
stubs where phase-specific work is required):
docs_mcp/server.py FastMCP server, stateless_http=True, with
search_docs / get_page / list_versions
baseline tools and commented stubs for the
rest of the phase set.
docs_mcp/usage.py TimedCall telemetry, JSONL, daily rotation,
90-day retention. Reusable as-is.
rag/embeddings.py Ollama embedder (nomic-embed-text default),
load-balanced across N URLs. Reusable.
rag/chunk.py Paragraph-aware chunker with synthetic
chunk 0. Per-product tunable.
rag/index.py Chroma + BM25 builder. --rebuild and
--bm25-only flags.
rag/bm25.py SQLite FTS5 lexical index. Reusable.
scrape/changelog.py --cached / --ref / --json / --history-out.
Reusable.
scrape/README.md What you write per-product.
eval/queries.jsonl.example
Curate ~25 hand-labeled queries here.
eval/retrievers.py Retriever protocol + stub classes.
eval/run_eval.py MRR / Recall@K / nDCG@K harness skeleton.
scripts/usage_report.py
Standalone log analyzer; the
FOLLOW-UP CHECKS pattern noted in the
module docstring.
scripts/registry_gc.py
Gitea container registry cleanup. Reusable.
Deployment + CI:
Dockerfile Python 3.12-slim; COPY corpus + chroma
+ bm25 last for cache efficiency.
deploy/docker-compose.yml MCP + reranker sidecar + Watchtower.
Templated with <placeholders>.
.gitea/workflows/refresh.yml Weekly cron + manual dispatch.
fetch-depth: 0, retry-on-race,
three-tag image scheme.
.gitea/workflows/image-only.yml Code-only ship cycle, ~18min.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
648 lines
26 KiB
Markdown
648 lines
26 KiB
Markdown
# Docs MCP Server — Build Guide
|
||
|
||
A reusable recipe for building a hosted MCP server over a product's
|
||
public documentation. Distilled from one production build; everything
|
||
product-specific has been factored out.
|
||
|
||
The end product is a streamable-HTTP MCP server with ~15 tools that
|
||
any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can
|
||
call to answer questions against the docs, surface what changed
|
||
recently, find inconsistencies, and (optionally) submit doc bugs
|
||
back upstream.
|
||
|
||
---
|
||
|
||
## What you're building
|
||
|
||
A pipeline with these stages:
|
||
|
||
```
|
||
upstream docs portal
|
||
│
|
||
▼
|
||
scrape ──► corpus/<bundle>/<page>.md + .json sidecar
|
||
│
|
||
▼
|
||
chunk + embed ──► chroma/ (dense vectors)
|
||
│ ──► bm25/ (FTS5 lexical index)
|
||
▼
|
||
MCP server ──► search_docs / get_page / diff_versions / weekly_digest /
|
||
find_doc_inconsistencies / submit_doc_bug / ...
|
||
│
|
||
▼
|
||
reverse proxy / Cloudflare Tunnel ──► public endpoint
|
||
```
|
||
|
||
Two CI cadences:
|
||
|
||
- **Weekly cron** (~40 min): full re-scrape, re-chunk, re-embed,
|
||
image build & push.
|
||
- **On-demand image-only** (~18 min): code-only rebuild from
|
||
committed corpus, image build & push.
|
||
|
||
A container registry (self-hosted Gitea works well), a host running
|
||
Docker Compose, Watchtower auto-updating from `:latest`, and a
|
||
reverse proxy in front.
|
||
|
||
---
|
||
|
||
## Build phases
|
||
|
||
Each phase is a discrete, shippable unit. Build them in order; each
|
||
one is useful on its own and unlocks the next. Realistic effort per
|
||
phase is given as a rough order of magnitude. Total: roughly 2–3
|
||
weeks of focused work for the full stack.
|
||
|
||
### Phase 0 — Project skeleton *(half a day)*
|
||
|
||
Goals: directory layout, dependency manifest, virtualenv.
|
||
|
||
- Top-level dirs: `scrape/`, `corpus/` (gitignored), `rag/`,
|
||
`docs_mcp/`, `eval/`, `scripts/`, `deploy/`, `.gitea/workflows/`.
|
||
- `requirements.txt` with the dependencies you'll need across all
|
||
phases (FastMCP, chromadb, httpx, beautifulsoup4 or whatever HTML
|
||
parser, ollama or sentence-transformers client, etc.).
|
||
- `python -m venv venv` and pin Python version (3.11 or 3.12 — be
|
||
conservative; some embedding libraries have version-specific
|
||
wheels).
|
||
- `.gitignore`: `venv/`, `corpus/` (regenerable), `chroma/`
|
||
(regenerable), `bm25/` (regenerable), `*.pyc`, `__pycache__/`,
|
||
`.pytest_cache/`.
|
||
|
||
### Phase 1 — Scraper *(2–4 days, product-specific)*
|
||
|
||
This is the most product-dependent phase. The goal is to write a
|
||
scraper that produces a normalized corpus layout regardless of
|
||
upstream portal shape.
|
||
|
||
Output shape (mandatory):
|
||
|
||
```
|
||
corpus/
|
||
<bundle_id>/ # one dir per "doc bundle" — see Glossary
|
||
<page_id>.md # markdown body
|
||
<page_id>.json # sidecar with structured metadata
|
||
...
|
||
bundles.json # catalog of bundles with metadata
|
||
```
|
||
|
||
**Bundle metadata** (`bundles.json` is a list of these):
|
||
|
||
```json
|
||
{
|
||
"slug": "<bundle_id>",
|
||
"title": "User-facing title",
|
||
"version": "10.9",
|
||
"platform": "VMware vSphere", // may be null
|
||
"product": "Admin Guide", // optional but useful
|
||
"language": "en-US",
|
||
"page_count": 127,
|
||
"dates": {
|
||
"Added on": "2024-01-15",
|
||
"Updated on": "2026-05-20"
|
||
},
|
||
"landing_page": "<page_id>"
|
||
}
|
||
```
|
||
|
||
**Per-page sidecar** (`<page_id>.json`) carries page-level metadata.
|
||
The one field that matters cross-cutting is `topic_cluster` (see
|
||
Phase 9):
|
||
|
||
```json
|
||
{
|
||
"bundle_id": "<bundle_id>",
|
||
"page_id": "<page_id>",
|
||
"title": "How to ...",
|
||
"ordinal": 42,
|
||
"topic_cluster": {
|
||
"clustering_title": "How to ...",
|
||
"clustered_topics": [
|
||
{"bundle_id": "...10.8", "page_id": "How_to_X.htm", "clustering_title": "..."},
|
||
{"bundle_id": "...10.9", "page_id": "How_to_X.htm", "clustering_title": "..."}
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
If the portal exposes a cross-version "this page corresponds to that
|
||
page" mapping, capture it here. If it doesn't, you can synthesize a
|
||
filename-based fallback (same filename across bundle versions = same
|
||
topic) and live without the editor-curated mapping. The features that
|
||
read `topic_cluster` (`list_cluster`, `diff_versions`,
|
||
`find_doc_inconsistencies`, parts of `weekly_digest`) will work
|
||
either way; they're more accurate with real clusters.
|
||
|
||
**Patterns that recur across doc portals:**
|
||
|
||
- Most modern doc portals are SPAs. Plain `requests.get` won't see
|
||
rendered content. Either find the underlying API the SPA calls (the
|
||
cheapest, most reliable path), or fall back to a headless browser
|
||
(Playwright). The API path is almost always available; sniff the
|
||
network tab.
|
||
- Portals usually expose a "bundle/topic" hierarchy under the hood
|
||
(Zoomin, Madcap Flare, Paligo, GitBook, Docusaurus all do). Map
|
||
it to `bundles.json` + `corpus/<bundle>/<page>`.
|
||
- Many portals expose `?save_local=` or `.pdf` rendered versions; the
|
||
HTML they serve is structurally cleaner than what the page shows
|
||
through the SPA shell.
|
||
|
||
**`scrape/changelog.py`** (~250 LOC; see Phase 13) — provides
|
||
`summarize_diff()`, `render_human()`, `walk_history()` and the
|
||
`--json` / `--history-out` modes. Mostly reusable as-is; the only
|
||
product-specific bit is the path layout assumption.
|
||
|
||
### Phase 2 — Chunking + embeddings + Chroma *(2 days)*
|
||
|
||
Goal: build a queryable dense index from the scraped corpus.
|
||
|
||
- `rag/chunk.py` — split each page's markdown into ~400-600 token
|
||
chunks. Strategy that works: paragraph-aware splitter with a
|
||
rich "chunk 0" containing the page title + 1-sentence summary +
|
||
bag-of-words from key terms. Chunk 0 is what dense retrieval lands
|
||
on first; getting it right dominates retrieval quality.
|
||
- `rag/embeddings.py` — pluggable embedder. Recommended start:
|
||
Ollama-hosted `nomic-embed-text` (768-dim, free, good baseline).
|
||
Other defensible choices: `text-embedding-3-small` (OpenAI),
|
||
`bge-m3` (also via Ollama). The embedder is a Chroma
|
||
`EmbeddingFunction` that returns `list[list[float]]` for a list
|
||
of texts.
|
||
- `rag/index.py` — orchestrates: read corpus → emit chunks (with
|
||
metadata: bundle_id, page_id, version, platform, ordinal) →
|
||
upsert into Chroma collection. `--rebuild` flag for a clean
|
||
reindex. Run via `python -m rag.index --rebuild`.
|
||
|
||
Chroma settings: `PersistentClient(path="chroma/")` and
|
||
`Settings(anonymized_telemetry=False)`. Single collection
|
||
(`<product>_docs`).
|
||
|
||
**GPU note**: embedding 70K chunks on CPU takes hours; on a GPU
|
||
(via Ollama with `NVIDIA_VISIBLE_DEVICES`) takes ~10 minutes. Two
|
||
GPUs in parallel: ~5 minutes. The orchestrator just needs to load-
|
||
balance HTTP requests across multiple Ollama endpoints.
|
||
|
||
### Phase 3 — MCP server skeleton *(1 day)*
|
||
|
||
Goal: working FastMCP server with three tools — `search_docs`,
|
||
`get_page`, `list_versions`.
|
||
|
||
- `docs_mcp/server.py` — `FastMCP("<product>-docs", stateless_http=True)`.
|
||
`stateless_http=True` is critical for production hosting: every
|
||
request creates an ephemeral session, so container recreates don't
|
||
produce a 404 storm from stale `mcp-session-id` headers on
|
||
clients.
|
||
- Lazy initialization for everything expensive (Chroma client,
|
||
embedder, bundles catalog) so the server starts cleanly even when
|
||
Ollama is briefly unreachable.
|
||
- Tool: `search_docs(query, version=None, platform=None,
|
||
bundle_id=None, k=10)`. Returns markdown of top-k chunks with full
|
||
source URLs.
|
||
- Tool: `get_page(bundle_id, page_id)`. Returns full page markdown +
|
||
metadata.
|
||
- Tool: `list_versions()`. Returns the version/platform facets
|
||
available, drawn from `bundles.json`. Helps the LLM pick filter
|
||
values.
|
||
|
||
Transports: stdio (for local Claude Desktop dev), streamable-HTTP
|
||
(for hosted production). One argparse switch.
|
||
|
||
```python
|
||
@mcp.tool()
|
||
def search_docs(
|
||
query: Annotated[str, Field(description="Natural-language query about <product>.")],
|
||
version: Annotated[str | None, Field(description="Restrict to one version")] = None,
|
||
...
|
||
) -> str:
|
||
...
|
||
```
|
||
|
||
The tool descriptions are first-class context — the LLM reads them
|
||
and decides whether to call the tool. Treat them as button labels;
|
||
use "Call when..." / "Use proactively whenever..." phrasings.
|
||
|
||
### Phase 4 — Containerization *(1 day)*
|
||
|
||
Goal: image you can run anywhere.
|
||
|
||
- `Dockerfile`: Python 3.12-slim base, install requirements, COPY
|
||
`scrape rag diff docs_mcp` + `bundles.json` + `corpus/ chroma/`
|
||
+ (later) `bm25/`. Don't COPY `scripts/` — those stay external
|
||
for ops use only.
|
||
- `ENTRYPOINT ["python", "-m", "docs_mcp.server",
|
||
"--transport", "streamable-http"]`. Configurable host/port via env.
|
||
- `deploy/docker-compose.yml`: one service, named volumes for usage
|
||
logs and any state, Watchtower label, depends_on for the reranker
|
||
sidecar (Phase 6).
|
||
|
||
Smoke-test locally: `docker compose up` should expose
|
||
`http://localhost:8000/mcp` and respond to an MCP `initialize` JSON-RPC.
|
||
|
||
### Phase 5 — CI on self-hosted Gitea Actions *(1–2 days)*
|
||
|
||
Goal: weekly cron rebuild + on-demand code-only ship cycle.
|
||
|
||
**Two workflows, two cadences:**
|
||
|
||
| Workflow | Trigger | Steps | Runtime |
|
||
|---|---|---|---|
|
||
| `refresh.yml` | Monday cron + manual dispatch | scrape → commit corpus → rebuild indexes → build & push image | ~40 min |
|
||
| `image-only.yml` | manual dispatch only | rebuild indexes from committed corpus → build & push image | ~18 min |
|
||
|
||
**Critical settings (learned the hard way):**
|
||
|
||
- `fetch-depth: 0` on `actions/checkout@v4`. The default depth is 1
|
||
(shallow), which breaks any step that walks git history (changelog,
|
||
digest history walker). Pay the ~10 second cost; never debug a
|
||
"0-byte history file" mystery.
|
||
- `runs-on: docker` (Gitea convention, not `ubuntu-latest`).
|
||
- Runner shell is `/bin/sh` (dash), not bash. `${VAR::N}` substring
|
||
expansion doesn't exist; use `cut` / `printf` / `awk`.
|
||
|
||
**Retry-on-race pattern for long-running scrapes:**
|
||
|
||
```bash
|
||
attempt=1
|
||
while [ $attempt -le 3 ]; do
|
||
if git push; then
|
||
echo "pushed (attempt $attempt)"
|
||
break
|
||
fi
|
||
[ $attempt -eq 3 ] && { echo "still failing"; exit 1; }
|
||
git fetch origin main
|
||
git rebase origin/main || { echo "conflict — bail"; exit 1; }
|
||
attempt=$((attempt + 1))
|
||
done
|
||
```
|
||
|
||
Works because scrape commits only touch `corpus/` + `bundles.json`,
|
||
and code merges only touch `.py` / `.yml` — disjoint paths, trivially
|
||
clean rebases.
|
||
|
||
**Image tagging — three tags per build:**
|
||
|
||
| Tag | Purpose |
|
||
|---|---|
|
||
| `:latest` | Watchtower watches this for auto-deploy |
|
||
| `:<sha12>` | Immutable; rollback target |
|
||
| `:<YYYY.MM.DD>` | Human-readable in incident notes |
|
||
|
||
Same tag set on every build; rollback is a one-line compose edit
|
||
to pin `:<sha>` instead of `:latest`.
|
||
|
||
**Container registry behind Cloudflare:**
|
||
|
||
Cloudflare's free tier has a 100 MB request body limit. Big image
|
||
layers (Chroma index can easily be 800+ MB) exceed it on push. The
|
||
fix is a LAN registry endpoint for push, public hostname for pull:
|
||
|
||
```yaml
|
||
env:
|
||
REGISTRY_PUSH: <lan-ip>:<port> # bypasses Cloudflare
|
||
REGISTRY_PULL: <public-hostname> # response bodies aren't capped
|
||
```
|
||
|
||
Runner needs the LAN endpoint in `/etc/docker/daemon.json`
|
||
`insecure-registries`. Costs nothing operationally; saves hours
|
||
of debugging.
|
||
|
||
**Registry GC:** weekly cron in the workflow that walks the package
|
||
versions, keeps `:latest` + N most-recent date tags + anything
|
||
pushed in the last 90 days, deletes the rest. Worth ~50 LOC; the
|
||
package GC on the Gitea side reclaims disk after.
|
||
|
||
### Phase 6 — Reranker *(half a day)*
|
||
|
||
Goal: lift retrieval quality 3× by cross-encoder reranking the top-N
|
||
dense candidates.
|
||
|
||
- A `/v1/rerank` HTTP endpoint backed by `llama.cpp` serving
|
||
`jina-reranker-v2-base` (GGUF). Runs as a sidecar in compose.
|
||
GPU strongly recommended (CPU latency is unworkable for live
|
||
queries).
|
||
- `_rerank(query, docs)` helper in the server: POST to the endpoint,
|
||
apply the scores, re-sort the top-N candidates. Defensive: on any
|
||
failure log a warning and fall through to dense-only.
|
||
- Env: `RERANK_URL` (off by default), `RERANK_POOL` (how deep to
|
||
pull candidates for reranking; 200 is a good default),
|
||
`RERANK_TIMEOUT` (30s for cold-start tolerance).
|
||
- **Watch the per-pair token limit.** Jina's GGUF reports
|
||
`n_ctx_train=1024` and llama.cpp will reject the ENTIRE batch if
|
||
any pair exceeds it. Truncate doc text to ~2000 chars before
|
||
reranking. The full untruncated chunk still goes back to the user;
|
||
truncation is only for the reranker scoring path.
|
||
|
||
### Phase 7 — Eval harness *(1 day)*
|
||
|
||
Goal: hand-curated golden queries + standard metrics so you can
|
||
measure the impact of any retrieval change.
|
||
|
||
- `eval/queries.jsonl`: 20–25 hand-curated queries with expected
|
||
pages. Spread across versions, platforms, and difficulty levels.
|
||
Include the queries that "obviously" should work and DON'T —
|
||
those are the ones to track.
|
||
- `eval/retrievers.py`: a `Retriever` protocol with concrete
|
||
implementations: `DenseRetriever`, `RerankedRetriever`,
|
||
`BM25Retriever` (Phase 8), `HybridRetriever` (Phase 8). One
|
||
matrix dimension per knob.
|
||
- `eval/run_eval.py`: computes MRR / Recall@5 / nDCG@5 across all
|
||
retrievers; emits a markdown comparison table at
|
||
`eval/results/<baseline>.md`. Commit the result so PRs land with
|
||
the A/B evidence in the diff.
|
||
|
||
Three numbers are enough — don't overengineer. The hand-curated
|
||
queries are the value; the metrics are just a stable way to score
|
||
them.
|
||
|
||
### Phase 8 — BM25 + Hybrid retrieval *(half a day, conditional)*
|
||
|
||
**Skip unless your eval shows specific failure modes.** Dense
|
||
embeddings + cross-encoder reranker handle most queries. The case
|
||
where they don't: queries with rare technical tokens (filenames,
|
||
language names, error codes) get buried at dense rank 1000+ by a
|
||
much larger prose corpus that's semantically nearby. The reranker
|
||
only sees top-200, so it never gets a shot.
|
||
|
||
- `rag/bm25.py`: SQLite FTS5 index, in the stdlib, on-disk
|
||
(`bm25/<product>.db`). Two tables — metadata table keyed by
|
||
rowid, FTS5 virtual table for full-text. Sanitize the query
|
||
(strip FTS5 reserved keywords, OR-join tokens for recall). ~210
|
||
LOC.
|
||
- `_rrf_fuse()` in the server — Reciprocal Rank Fusion with `k=60`.
|
||
Per-id score = `sum_over_retrievers(1 / (k + rank))`. Returns
|
||
ordered ids plus per-retriever contribution dict for telemetry.
|
||
- `search_docs` hybrid path: run dense + BM25 in parallel,
|
||
RRF-fuse, hand the merged top-200 to the reranker. Env-gated:
|
||
`HYBRID_SEARCH=true`.
|
||
- Log `top1_source` per call (`dense_only` / `bm25_only` / `both`)
|
||
to usage logs so you can measure whether BM25 is actually earning
|
||
its keep on production traffic.
|
||
|
||
If after 4–6 weeks of production data you see `bm25_only >= 80%`,
|
||
you can simplify to BM25-only (much less infrastructure). If
|
||
`both >= 50%`, hybrid is acting as tie-breaker not rescue — keep it
|
||
or simplify depending on how much you care about the long tail.
|
||
|
||
### Phase 9 — Multi-version diff tooling *(1 day, if applicable)*
|
||
|
||
**Only relevant if the product has multiple maintained versions.**
|
||
|
||
- `diff_versions(bundle_id, page_id, against_bundle_id)`: unified
|
||
diff between two versions of the same page. Two matching
|
||
strategies: editor-curated `topic_cluster` peer (if the portal
|
||
exposes it), or same-filename fallback.
|
||
- `list_cluster(bundle_id, page_id)`: list cross-version peers
|
||
for one page.
|
||
- `bundle_changelog(bundle_id_new, bundle_id_old)`: added /
|
||
removed / changed pages between two bundles, sorted by churn.
|
||
- `_diff_churn(a, b)`: small helper, ~15 LOC of `difflib.unified_diff
|
||
--unified=0` line counting. Used by `bundle_changelog`,
|
||
`find_doc_inconsistencies`, and `weekly_digest`.
|
||
|
||
### Phase 10 — Usage logging *(half a day)*
|
||
|
||
Goal: per-call JSONL telemetry so you can answer "what are people
|
||
actually asking for" and "is the new feature getting used."
|
||
|
||
- `docs_mcp/usage.py`: `TimedCall` context manager that captures
|
||
tool name, args, elapsed time, hits returned, any extra fields
|
||
set by the tool via `_call.set(key=value)`. Writes JSONL to
|
||
`var/logs/usage.jsonl`, rotated daily, kept 90 days.
|
||
- Mount the log dir as a named compose volume so logs survive
|
||
container recreates.
|
||
- `scripts/usage_report.py` (standalone, no docs_mcp deps): reads
|
||
the JSONL files, prints per-tool counts, top queries, 0-hit
|
||
queries, filter usage histogram, reranker activity. Markdown
|
||
output flag for piping into weekly digest emails.
|
||
|
||
What to log: query text, filters, hits returned, elapsed_ms,
|
||
reranker_fired flag, hybrid top1_source, retrieval_mode. What NOT
|
||
to log: anything PII-shaped. The corpus is public, queries are
|
||
usually about the product, not personal — but be deliberate.
|
||
|
||
### Phase 11 — Curated knowledge layer *(2 days)*
|
||
|
||
The "RAG can't tell you what isn't in the docs" gap. Surfaces:
|
||
|
||
- **API quickstart repos** if the product has them. Ingest the
|
||
example scripts (Python, PowerShell, curl) into the corpus.
|
||
Rewrite chunk-0 for each script to embed naturally — explicit
|
||
natural-language H1, task description sentence, keyword bag.
|
||
Dense embeddings need an anchor.
|
||
- **A curated `<product>_api_lessons` markdown doc** for things
|
||
the swagger / OpenAPI doesn't say: auth flow gotchas, async-task
|
||
patterns, schema bugs you've hit, platform-detection quirks.
|
||
Surface as a dedicated MCP tool whose description tells the LLM:
|
||
*"Call proactively whenever the user asks you to write a script
|
||
/ integrate with the API / debug a 4xx response."*
|
||
- **An auto-hint banner** in `search_docs` results — when the
|
||
query matches a script/API trigger word, render a one-line nudge
|
||
at the top of results pointing at the dedicated tool. Belt-and-
|
||
suspenders for queries where the LLM doesn't think to call it
|
||
proactively.
|
||
|
||
### Phase 12 — Doc-bug workflow tools *(1 day, optional)*
|
||
|
||
Two tools that pair up to enable a *"check the docs for
|
||
inconsistencies, draft bugs, confirm, submit"* workflow.
|
||
|
||
- `find_doc_inconsistencies(scope_query, version=None, platform=None,
|
||
max_pages=30, checks=None)`: deterministic, read-only. Two checks:
|
||
cross-version drift (pages whose content shifted between immediate-
|
||
previous versions in the actionable 10–60% churn band) and
|
||
redirect-chain detection (short pages whose body is just a "see
|
||
[other page] for details" pointer). Heavy lifting is line-level
|
||
diff (`difflib`) against editor-curated cluster peers; the model
|
||
judges which findings are real bugs.
|
||
|
||
- `submit_doc_bug(page_url, content, email=None, rating=None,
|
||
like=None)`: POSTs to the docs portal's feedback endpoint.
|
||
Env-gated by `DOC_BUG_SUBMIT_ENABLED=true` so dev/staging
|
||
deployments can't accidentally hit the upstream. The tool's
|
||
docstring is loud about a mandatory operator-confirmation
|
||
workflow per submission — LLM must draft, show, ask, then
|
||
submit. Explicit *"do not loop"* instruction. Defensive
|
||
validation upfront (URL host matches expected portal, content
|
||
non-empty, etc.) so the LLM gets a clean error instead of a
|
||
rejected POST.
|
||
|
||
**You'll need to find the docs portal's feedback endpoint.** Most
|
||
portals route the "Was this helpful?" widget through a backend
|
||
API; sniff the browser network tab on the live site. The payload
|
||
shape varies; common fields: content/body, page url/href, optional
|
||
email, optional rating, optional thumbs. Most accept anonymous
|
||
POSTs with no captcha at the JSON-API layer (even if the widget
|
||
shows a captcha). Validate before you ship — and if the endpoint
|
||
has rate limits or captcha enforcement, the tool returns a clean
|
||
"submission rejected — paste manually at <url>" fallback.
|
||
|
||
The whole point is the per-bug operator confirmation in the
|
||
LLM-side conversation flow; the tool description enforces it. Do
|
||
not bypass.
|
||
|
||
### Phase 13 — Weekly digest tool *(half a day)*
|
||
|
||
Goal: a tool that answers *"what changed in the docs in the last N
|
||
days?"* with no runtime git dependency (the prod container has no
|
||
git).
|
||
|
||
- Extend `scrape/changelog.py` with `--json` (one-shot structured
|
||
output) and `--history-out PATH` (walks `git log --first-parent
|
||
--since="<N> days ago"` for corpus-touching commits, writes one
|
||
JSON line per commit to a JSONL file).
|
||
- CI workflows write the JSONL file into the image at build time:
|
||
`corpus/.digest/history.jsonl`. Both `refresh.yml` and
|
||
`image-only.yml`. **`fetch-depth: 0` is required** — see Phase 5.
|
||
- New MCP tool `weekly_digest(days=7, version=None, platform=None,
|
||
max_bundles=25, max_pages_per_bundle=10)`: reads the JSONL,
|
||
filters to the window, applies version/platform via
|
||
`bundles.json` metadata, aggregates per-bundle change counts and
|
||
page lists, renders markdown.
|
||
- Post-filter totals are critical: the headline "X page changes
|
||
across Y bundles" must compute X from the filtered set, not the
|
||
raw record count. Otherwise filtered calls look wrong to the
|
||
reader.
|
||
|
||
Out of scope but trivial bolt-ons: scheduled HTML email of the
|
||
digest, auto-publish to a blog, per-page diff excerpts as a
|
||
follow-up tool.
|
||
|
||
---
|
||
|
||
## Standard tool set
|
||
|
||
By the end you'll have ~15 tools registered. Production-tested
|
||
shape:
|
||
|
||
| Tool | What it does |
|
||
|---|---|
|
||
| `search_docs` | Semantic search with version/platform/bundle filters |
|
||
| `get_page` | Full markdown + metadata for one page |
|
||
| `list_versions` | Discover available facet values |
|
||
| `list_cluster` | Cross-version peers for one page (if applicable) |
|
||
| `diff_versions` | Unified diff of a page across two versions |
|
||
| `bundle_changelog` | Added / removed / changed pages between two bundles |
|
||
| `weekly_digest` | What changed in the last N days, with filters |
|
||
| `corpus_status` | Freshness + size of the knowledge base |
|
||
| `find_doc_inconsistencies` | Scoped scan for doc bugs |
|
||
| `submit_doc_bug` | Submit a drafted bug (env-gated, operator-confirmed) |
|
||
| `<product>_api_lessons` | Curated API gotchas, proactively-called |
|
||
| product-specific tools | Interop matrix, lifecycle queries, etc. |
|
||
|
||
---
|
||
|
||
## Per-product customization checklist
|
||
|
||
When applying this template to a new product, here's what you have
|
||
to figure out yourself — everything else is shared infrastructure:
|
||
|
||
- **Doc portal mechanics**
|
||
- URL pattern for pages
|
||
- Bundle/version concept (Zoomin "bundle", Madcap "project",
|
||
GitBook "space", Docusaurus "docs version" — same idea, different
|
||
name)
|
||
- SPA backing API (sniff the network tab) or fallback to
|
||
headless browser
|
||
- How `topic_cluster` -equivalent cross-version peers are exposed
|
||
(or whether you synthesize them from filenames)
|
||
- **Bundle metadata schema**
|
||
- What does `version` look like? Semver, calendar, named?
|
||
- What does `platform` mean for this product? Is there a useful
|
||
facet at all?
|
||
- Other useful facets (language, product line, edition)?
|
||
- **Filterable facets** for `search_docs`
|
||
- One filter per high-cardinality facet
|
||
- Skip filters that have <5 distinct values — they're not worth
|
||
the surface area
|
||
- **Feedback endpoint** (for `submit_doc_bug`, if you want it)
|
||
- URL of the POST endpoint
|
||
- Required + optional payload fields
|
||
- Captcha / rate-limit behavior
|
||
- Whether anonymous submissions are accepted
|
||
- **Curated knowledge** for the `_api_lessons` tool
|
||
- What does the product's API documentation NOT say that you've
|
||
learned from real integration work?
|
||
- **Quickstart / example repos**
|
||
- Does the vendor publish working code? Ingest it; rewrite
|
||
chunk-0 for natural-language retrieval.
|
||
|
||
---
|
||
|
||
## Decisions worth carrying forward
|
||
|
||
Things you'll save time on by deciding the same way again:
|
||
|
||
- **Tool descriptions are user interface.** The LLM reads them
|
||
verbatim and decides whether to call the tool. *"Use when..."*
|
||
and *"Call proactively whenever..."* are real surfaces; treat
|
||
them like button labels. Most retrieval improvements turn out
|
||
to be tool-description rewrites in disguise.
|
||
- **`stateless_http=True`** on the FastMCP server. Eliminates
|
||
whole categories of session-ID-related 404 storms after
|
||
container recreates.
|
||
- **Pre-bake everything at CI time.** No runtime calls to git,
|
||
external services, or anything you wouldn't trust on a
|
||
Cloudflare outage. If the digest needs git history, write a
|
||
JSONL file at CI time. If the lessons doc needs to load fast,
|
||
bake it into the image.
|
||
- **Env-gate every side-effecting tool.** Off by default in dev;
|
||
on only in production compose. Belt and suspenders against
|
||
accidental writes from staging environments.
|
||
- **Operator-confirmation pattern for side-effecting tools.**
|
||
The tool docstring is the only place to enforce
|
||
human-in-the-loop. Make it loud. "MANDATORY", "Do not loop",
|
||
"show-confirm-then-submit" — those phrasings work.
|
||
- **Verify with hand-curated golden queries before shipping any
|
||
retrieval change.** Numbers in the diff, in the commit message.
|
||
Don't ship retrieval changes on vibes.
|
||
- **Two-cadence CI** (weekly scrape vs on-demand code-only)
|
||
saves hours per code iteration once you're past the
|
||
one-iteration-a-week stage.
|
||
- **Rolling tag + sha-pinned tag** deploy pattern. `:latest` is
|
||
what Watchtower watches; `:<sha>` is your safety net. Rollback
|
||
is a one-line compose edit, not a redeploy.
|
||
- **Usage logging is non-negotiable.** You will be wrong about
|
||
what people use. Capture the truth from day one; let it tell
|
||
you which features to keep building and which to delete.
|
||
|
||
---
|
||
|
||
## Glossary
|
||
|
||
- **Bundle** — one logical doc set in the portal. Zoomin calls
|
||
them bundles; Madcap calls them projects; the concept is the
|
||
same: a versioned, titled collection of pages. One dir under
|
||
`corpus/`.
|
||
- **Page** — one HTML page in a bundle. One `.md` + one `.json`
|
||
sidecar under the bundle dir.
|
||
- **Topic cluster** — Zoomin's name for "this page in version
|
||
10.9 corresponds to that page in version 10.8." Stored in the
|
||
per-page sidecar. The portal-agnostic concept is "cross-version
|
||
peer mapping."
|
||
- **Chunk** — a unit of text that gets independently embedded and
|
||
stored in Chroma. Target ~400-600 tokens; preserve paragraph
|
||
boundaries.
|
||
- **RRF** — Reciprocal Rank Fusion. The way to merge two ranked
|
||
lists from independent retrievers without score calibration.
|
||
|
||
---
|
||
|
||
## What's deliberately NOT in this template
|
||
|
||
Decisions you should make per-product (not copy from the original
|
||
build):
|
||
|
||
- The reverse proxy and TLS termination layer. Could be Caddy,
|
||
nginx, Traefik, Cloudflare Tunnel — pick what your infra uses.
|
||
- The Gateway / aggregator in front of multiple MCPs (MetaMCP is one
|
||
option; you may not need any aggregator if you're running a
|
||
single product MCP).
|
||
- The specific embedding model — `nomic-embed-text` is a strong
|
||
default but newer / domain-specific models may be better for
|
||
some products.
|
||
- The Ollama containers / GPU setup — depends on what hardware you
|
||
have. The pattern is one container per GPU with explicit
|
||
`NVIDIA_VISIBLE_DEVICES` pinning; the indexer load-balances
|
||
across them.
|
||
- Whether to publish a blog series alongside the build. Strongly
|
||
recommended (forces clarity, builds an audience), but optional.
|