# Docs MCP Server — Build Guide

A reusable recipe for building a hosted MCP server over a product's
public documentation. Distilled from one production build; everything
product-specific has been factored out.

The end product is a streamable-HTTP MCP server with ~15 tools that
any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can
call to answer questions against the docs, surface what changed
recently, and flag likely inconsistencies.

> **Domain note for crop-chem-docs.** This template was originally written
> for versioned software product documentation (Zoomin bundles, Hugo
> sites, etc.). For crop-chem-docs the domain is pesticide product labels —
> the "bundle" abstraction has been replaced with "source"
> (manufacturer or regulator), and "page" with "product label". The
> canonical on-disk schema lives in [`scrape/README.md`](scrape/README.md),
> not in this document. References below to `bundles.json`, `bundle_id`,
> `--bundle`, `version`, and `platform` are template artifacts — read
> them as `sources.json`, `source_id`, `--source`, and (mostly)
> not-applicable. Phase 1 (scraper) is the most heavily adapted; later
> phases (chunking, embeddings, retrieval, eval) apply largely as
> written.

---

## What you're building

A pipeline with these stages:

```
upstream docs portal
        │
        ▼
   scrape  ──► corpus/<bundle>/<page>.md + .json sidecar
        │
        ▼
    chunk + embed  ──► chroma/  (dense vectors)
        │           ──► bm25/   (FTS5 lexical index)
        ▼
   MCP server  ──► search_docs / get_page / diff_versions / weekly_digest /
                   find_doc_inconsistencies / ...
        │
        ▼
   reverse proxy / Cloudflare Tunnel ──► public endpoint
```

Two CI cadences:

- **Weekly cron** (~40 min): full re-scrape, re-chunk, re-embed,
  image build & push.
- **On-demand image-only** (~18 min): code-only rebuild from
  committed corpus, image build & push.

A container registry (self-hosted Gitea works well), a host running
Docker Compose, Watchtower auto-updating from `:latest`, and a
reverse proxy in front.

---

## Build phases

Each phase is a discrete, shippable unit. Build them in order; each
one is useful on its own and unlocks the next. Realistic effort per
phase is given as a rough order of magnitude. Total: roughly 2–3
weeks of focused work for the full stack.

### Phase 0 — Project skeleton  *(half a day)*

Goals: directory layout, dependency manifest, virtualenv.

- Top-level dirs: `scrape/`, `corpus/` (gitignored), `rag/`,
  `docs_mcp/`, `eval/`, `scripts/`, `deploy/`, `.gitea/workflows/`.
- `requirements.txt` with the dependencies you'll need across all
  phases (FastMCP, chromadb, httpx, beautifulsoup4 or whatever HTML
  parser, ollama or sentence-transformers client, etc.).
- `python -m venv venv` and pin Python version (3.11 or 3.12 — be
  conservative; some embedding libraries have version-specific
  wheels).
- `.gitignore`: `venv/`, `corpus/` (regenerable), `chroma/`
  (regenerable), `bm25/` (regenerable), `*.pyc`, `__pycache__/`,
  `.pytest_cache/`.

### Phase 1 — Scraper  *(2–4 days, product-specific)*

This is the most product-dependent phase. The goal is to write a
scraper that produces a normalized corpus layout regardless of
upstream portal shape.

Output shape (mandatory):

```
corpus/
  <bundle_id>/             # one dir per "doc bundle" — see Glossary
    <page_id>.md           # markdown body
    <page_id>.json         # sidecar with structured metadata
  ...
bundles.json               # catalog of bundles with metadata
```

**Bundle metadata** (`bundles.json` is a list of these):

```json
{
  "slug":          "<bundle_id>",
  "title":         "User-facing title",
  "version":       "10.9",
  "platform":      "VMware vSphere",   // may be null
  "product":       "Admin Guide",       // optional but useful
  "language":      "en-US",
  "page_count":    127,
  "dates": {
    "Added on":    "2024-01-15",
    "Updated on":  "2026-05-20"
  },
  "landing_page":  "<page_id>"
}
```

**Per-page sidecar** (`<page_id>.json`) carries page-level metadata.
The one field that matters cross-cutting is `topic_cluster` (see
Phase 9):

```json
{
  "bundle_id":     "<bundle_id>",
  "page_id":       "<page_id>",
  "title":         "How to ...",
  "ordinal":       42,
  "topic_cluster": {
    "clustering_title": "How to ...",
    "clustered_topics": [
      {"bundle_id": "...10.8", "page_id": "How_to_X.htm", "clustering_title": "..."},
      {"bundle_id": "...10.9", "page_id": "How_to_X.htm", "clustering_title": "..."}
    ]
  }
}
```

If the portal exposes a cross-version "this page corresponds to that
page" mapping, capture it here. If it doesn't, you can synthesize a
filename-based fallback (same filename across bundle versions = same
topic) and live without the editor-curated mapping. The features that
read `topic_cluster` (`list_cluster`, `diff_versions`,
`find_doc_inconsistencies`, parts of `weekly_digest`) will work
either way; they're more accurate with real clusters.

**Patterns that recur across doc portals:**

- Most modern doc portals are SPAs. Plain `requests.get` won't see
  rendered content. Either find the underlying API the SPA calls (the
  cheapest, most reliable path), or fall back to a headless browser
  (Playwright). The API path is almost always available; sniff the
  network tab.
- Portals usually expose a "bundle/topic" hierarchy under the hood
  (Zoomin, Madcap Flare, Paligo, GitBook, Docusaurus all do). Map
  it to `bundles.json` + `corpus/<bundle>/<page>`.
- Many portals expose `?save_local=` or `.pdf` rendered versions; the
  HTML they serve is structurally cleaner than what the page shows
  through the SPA shell.

**`scrape/changelog.py`** (~250 LOC; see Phase 13) — provides
`summarize_diff()`, `render_human()`, `walk_history()` and the
`--json` / `--history-out` modes. Mostly reusable as-is; the only
product-specific bit is the path layout assumption.

### Phase 2 — Chunking + embeddings + Chroma  *(2 days)*

Goal: build a queryable dense index from the scraped corpus.

- `rag/chunk.py` — split each page's markdown into ~400-600 token
  chunks. Strategy that works: paragraph-aware splitter with a
  rich "chunk 0" containing the page title + 1-sentence summary +
  bag-of-words from key terms. Chunk 0 is what dense retrieval lands
  on first; getting it right dominates retrieval quality.
- `rag/embeddings.py` — pluggable embedder. Recommended start:
  Ollama-hosted `nomic-embed-text` (768-dim, free, good baseline).
  Other defensible choices: `text-embedding-3-small` (OpenAI),
  `bge-m3` (also via Ollama). The embedder is a Chroma
  `EmbeddingFunction` that returns `list[list[float]]` for a list
  of texts.
- `rag/index.py` — orchestrates: read corpus → emit chunks (with
  metadata: bundle_id, page_id, version, platform, ordinal) →
  upsert into Chroma collection. `--rebuild` flag for a clean
  reindex. Run via `python -m rag.index --rebuild`.

Chroma settings: `PersistentClient(path="chroma/")` and
`Settings(anonymized_telemetry=False)`. Single collection
(`<product>_docs`).

**GPU note**: embedding 70K chunks on CPU takes hours; on a GPU
(via Ollama with `NVIDIA_VISIBLE_DEVICES`) takes ~10 minutes. Two
GPUs in parallel: ~5 minutes. The orchestrator just needs to load-
balance HTTP requests across multiple Ollama endpoints.

### Phase 3 — MCP server skeleton  *(1 day)*

Goal: working FastMCP server with three tools — `search_docs`,
`get_page`, `list_versions`.

- `docs_mcp/server.py` — `FastMCP("<product>-docs", stateless_http=True)`.
  `stateless_http=True` is critical for production hosting: every
  request creates an ephemeral session, so container recreates don't
  produce a 404 storm from stale `mcp-session-id` headers on
  clients.
- Lazy initialization for everything expensive (Chroma client,
  embedder, bundles catalog) so the server starts cleanly even when
  Ollama is briefly unreachable.
- Tool: `search_docs(query, version=None, platform=None,
  bundle_id=None, k=10)`. Returns markdown of top-k chunks with full
  source URLs.
- Tool: `get_page(bundle_id, page_id)`. Returns full page markdown +
  metadata.
- Tool: `list_versions()`. Returns the version/platform facets
  available, drawn from `bundles.json`. Helps the LLM pick filter
  values.

Transports: stdio (for local Claude Desktop dev), streamable-HTTP
(for hosted production). One argparse switch.

```python
@mcp.tool()
def search_docs(
    query: Annotated[str, Field(description="Natural-language query about <product>.")],
    version: Annotated[str | None, Field(description="Restrict to one version")] = None,
    ...
) -> str:
    ...
```

The tool descriptions are first-class context — the LLM reads them
and decides whether to call the tool. Treat them as button labels;
use "Call when..." / "Use proactively whenever..." phrasings.

### Phase 4 — Containerization  *(1 day)*

Goal: image you can run anywhere.

- `Dockerfile`: Python 3.12-slim base, install requirements, COPY
  `scrape rag diff docs_mcp` + `bundles.json` + `corpus/ chroma/`
  + (later) `bm25/`. Don't COPY `scripts/` — those stay external
  for ops use only.
- `ENTRYPOINT ["python", "-m", "docs_mcp.server",
  "--transport", "streamable-http"]`. Configurable host/port via env.
- `deploy/docker-compose.yml`: one service, named volumes for usage
  logs and any state, Watchtower label, depends_on for the reranker
  sidecar (Phase 6).

Smoke-test locally: `docker compose up` should expose
`http://localhost:8000/mcp` and respond to an MCP `initialize` JSON-RPC.

### Phase 5 — CI on self-hosted Gitea Actions  *(1–2 days)*

Goal: weekly cron rebuild + on-demand code-only ship cycle.

**Two workflows, two cadences:**

| Workflow | Trigger | Steps | Runtime |
|---|---|---|---|
| `refresh.yml` | Monday cron + manual dispatch | scrape → commit corpus → rebuild indexes → build & push image | ~40 min |
| `image-only.yml` | manual dispatch only | rebuild indexes from committed corpus → build & push image | ~18 min |

**Critical settings (learned the hard way):**

- `fetch-depth: 0` on `actions/checkout@v4`. The default depth is 1
  (shallow), which breaks any step that walks git history (changelog,
  digest history walker). Pay the ~10 second cost; never debug a
  "0-byte history file" mystery.
- `runs-on: docker` (Gitea convention, not `ubuntu-latest`).
- Runner shell is `/bin/sh` (dash), not bash. `${VAR::N}` substring
  expansion doesn't exist; use `cut` / `printf` / `awk`.

**Retry-on-race pattern for long-running scrapes:**

```bash
attempt=1
while [ $attempt -le 3 ]; do
  if git push; then
    echo "pushed (attempt $attempt)"
    break
  fi
  [ $attempt -eq 3 ] && { echo "still failing"; exit 1; }
  git fetch origin main
  git rebase origin/main || { echo "conflict — bail"; exit 1; }
  attempt=$((attempt + 1))
done
```

Works because scrape commits only touch `corpus/` + `bundles.json`,
and code merges only touch `.py` / `.yml` — disjoint paths, trivially
clean rebases.

**Image tagging — three tags per build:**

| Tag | Purpose |
|---|---|
| `:latest` | Watchtower watches this for auto-deploy |
| `:<sha12>` | Immutable; rollback target |
| `:<YYYY.MM.DD>` | Human-readable in incident notes |

Same tag set on every build; rollback is a one-line compose edit
to pin `:<sha>` instead of `:latest`.

**Container registry behind Cloudflare:**

Cloudflare's free tier has a 100 MB request body limit. Big image
layers (Chroma index can easily be 800+ MB) exceed it on push. The
fix is a LAN registry endpoint for push, public hostname for pull:

```yaml
env:
  REGISTRY_PUSH: <lan-ip>:<port>     # bypasses Cloudflare
  REGISTRY_PULL: <public-hostname>   # response bodies aren't capped
```

Runner needs the LAN endpoint in `/etc/docker/daemon.json`
`insecure-registries`. Costs nothing operationally; saves hours
of debugging.

**Registry GC:** weekly cron in the workflow that walks the package
versions, keeps `:latest` + N most-recent date tags + anything
pushed in the last 90 days, deletes the rest. Worth ~50 LOC; the
package GC on the Gitea side reclaims disk after.

### Phase 6 — Reranker  *(half a day)*

Goal: lift retrieval quality 3× by cross-encoder reranking the top-N
dense candidates.

- A `/v1/rerank` HTTP endpoint backed by `llama.cpp` serving
  `jina-reranker-v2-base` (GGUF). Runs as a sidecar in compose.
  GPU strongly recommended (CPU latency is unworkable for live
  queries).
- `_rerank(query, docs)` helper in the server: POST to the endpoint,
  apply the scores, re-sort the top-N candidates. Defensive: on any
  failure log a warning and fall through to dense-only.
- Env: `RERANK_URL` (off by default), `RERANK_POOL` (how deep to
  pull candidates for reranking; 200 is a good default),
  `RERANK_TIMEOUT` (30s for cold-start tolerance).
- **Watch the per-pair token limit.** Jina's GGUF reports
  `n_ctx_train=1024` and llama.cpp will reject the ENTIRE batch if
  any pair exceeds it. Truncate doc text to ~2000 chars before
  reranking. The full untruncated chunk still goes back to the user;
  truncation is only for the reranker scoring path.

### Phase 7 — Eval harness  *(1 day)*

Goal: hand-curated golden queries + standard metrics so you can
measure the impact of any retrieval change.

- `eval/queries.jsonl`: 20–25 hand-curated queries with expected
  pages. Spread across versions, platforms, and difficulty levels.
  Include the queries that "obviously" should work and DON'T —
  those are the ones to track.
- `eval/retrievers.py`: a `Retriever` protocol with concrete
  implementations: `DenseRetriever`, `RerankedRetriever`,
  `BM25Retriever` (Phase 8), `HybridRetriever` (Phase 8). One
  matrix dimension per knob.
- `eval/run_eval.py`: computes MRR / Recall@5 / nDCG@5 across all
  retrievers; emits a markdown comparison table at
  `eval/results/<baseline>.md`. Commit the result so PRs land with
  the A/B evidence in the diff.

Three numbers are enough — don't overengineer. The hand-curated
queries are the value; the metrics are just a stable way to score
them.

### Phase 8 — BM25 + Hybrid retrieval  *(half a day, conditional)*

**Skip unless your eval shows specific failure modes.** Dense
embeddings + cross-encoder reranker handle most queries. The case
where they don't: queries with rare technical tokens (filenames,
language names, error codes) get buried at dense rank 1000+ by a
much larger prose corpus that's semantically nearby. The reranker
only sees top-200, so it never gets a shot.

- `rag/bm25.py`: SQLite FTS5 index, in the stdlib, on-disk
  (`bm25/<product>.db`). Two tables — metadata table keyed by
  rowid, FTS5 virtual table for full-text. Sanitize the query
  (strip FTS5 reserved keywords, OR-join tokens for recall). ~210
  LOC.
- `_rrf_fuse()` in the server — Reciprocal Rank Fusion with `k=60`.
  Per-id score = `sum_over_retrievers(1 / (k + rank))`. Returns
  ordered ids plus per-retriever contribution dict for telemetry.
- `search_docs` hybrid path: run dense + BM25 in parallel,
  RRF-fuse, hand the merged top-200 to the reranker. Env-gated:
  `HYBRID_SEARCH=true`.
- Log `top1_source` per call (`dense_only` / `bm25_only` / `both`)
  to usage logs so you can measure whether BM25 is actually earning
  its keep on production traffic.

If after 4–6 weeks of production data you see `bm25_only >= 80%`,
you can simplify to BM25-only (much less infrastructure). If
`both >= 50%`, hybrid is acting as tie-breaker not rescue — keep it
or simplify depending on how much you care about the long tail.

### Phase 9 — Multi-version diff tooling  *(1 day, if applicable)*

**Only relevant if the product has multiple maintained versions.**

- `diff_versions(bundle_id, page_id, against_bundle_id)`: unified
  diff between two versions of the same page. Two matching
  strategies: editor-curated `topic_cluster` peer (if the portal
  exposes it), or same-filename fallback.
- `list_cluster(bundle_id, page_id)`: list cross-version peers
  for one page.
- `bundle_changelog(bundle_id_new, bundle_id_old)`: added /
  removed / changed pages between two bundles, sorted by churn.
- `_diff_churn(a, b)`: small helper, ~15 LOC of `difflib.unified_diff
  --unified=0` line counting. Used by `bundle_changelog`,
  `find_doc_inconsistencies`, and `weekly_digest`.

### Phase 10 — Usage logging  *(half a day)*

Goal: per-call JSONL telemetry so you can answer "what are people
actually asking for" and "is the new feature getting used."

- `docs_mcp/usage.py`: `TimedCall` context manager that captures
  tool name, args, elapsed time, hits returned, any extra fields
  set by the tool via `_call.set(key=value)`. Writes JSONL to
  `var/logs/usage.jsonl`, rotated daily, kept 90 days.
- Mount the log dir as a named compose volume so logs survive
  container recreates.
- `scripts/usage_report.py` (standalone, no docs_mcp deps): reads
  the JSONL files, prints per-tool counts, top queries, 0-hit
  queries, filter usage histogram, reranker activity. Markdown
  output flag for piping into weekly digest emails.

What to log: query text, filters, hits returned, elapsed_ms,
reranker_fired flag, hybrid top1_source, retrieval_mode. What NOT
to log: anything PII-shaped. The corpus is public, queries are
usually about the product, not personal — but be deliberate.

### Phase 11 — Curated knowledge layer  *(2 days)*

The "RAG can't tell you what isn't in the docs" gap. Surfaces:

- **API quickstart repos** if the product has them. Ingest the
  example scripts (Python, PowerShell, curl) into the corpus.
  Rewrite chunk-0 for each script to embed naturally — explicit
  natural-language H1, task description sentence, keyword bag.
  Dense embeddings need an anchor.
- **A curated `<product>_api_lessons` markdown doc** for things
  the swagger / OpenAPI doesn't say: auth flow gotchas, async-task
  patterns, schema bugs you've hit, platform-detection quirks.
  Surface as a dedicated MCP tool whose description tells the LLM:
  *"Call proactively whenever the user asks you to write a script
  / integrate with the API / debug a 4xx response."*
- **An auto-hint banner** in `search_docs` results — when the
  query matches a script/API trigger word, render a one-line nudge
  at the top of results pointing at the dedicated tool. Belt-and-
  suspenders for queries where the LLM doesn't think to call it
  proactively.

### Phase 12 — Doc-inconsistency tool  *(half a day, optional)*

A *"scan the corpus for likely doc bugs"* tool the model can call
when an operator asks "is this section reliable?"

- `find_doc_inconsistencies(scope_query, version=None, platform=None,
  max_pages=30, checks=None)`: deterministic, read-only. Two checks:
  cross-version drift (pages whose content shifted between immediate-
  previous versions in the actionable 10–60% churn band) and
  redirect-chain detection (short pages whose body is just a "see
  [other page] for details" pointer). Heavy lifting is line-level
  diff (`difflib`) against editor-curated cluster peers; the model
  judges which findings are real bugs.

### Phase 13 — Weekly digest tool  *(half a day)*

Goal: a tool that answers *"what changed in the docs in the last N
days?"* with no runtime git dependency (the prod container has no
git).

- Extend `scrape/changelog.py` with `--json` (one-shot structured
  output) and `--history-out PATH` (walks `git log --first-parent
  --since="<N> days ago"` for corpus-touching commits, writes one
  JSON line per commit to a JSONL file).
- CI workflows write the JSONL file into the image at build time:
  `corpus/.digest/history.jsonl`. Both `refresh.yml` and
  `image-only.yml`. **`fetch-depth: 0` is required** — see Phase 5.
- New MCP tool `weekly_digest(days=7, version=None, platform=None,
  max_bundles=25, max_pages_per_bundle=10)`: reads the JSONL,
  filters to the window, applies version/platform via
  `bundles.json` metadata, aggregates per-bundle change counts and
  page lists, renders markdown.
- Post-filter totals are critical: the headline "X page changes
  across Y bundles" must compute X from the filtered set, not the
  raw record count. Otherwise filtered calls look wrong to the
  reader.

Out of scope but trivial bolt-ons: scheduled HTML email of the
digest, auto-publish to a blog, per-page diff excerpts as a
follow-up tool.

---

## Standard tool set

By the end you'll have ~15 tools registered. Production-tested
shape:

| Tool | What it does |
|---|---|
| `search_docs` | Semantic search with version/platform/bundle filters |
| `get_page` | Full markdown + metadata for one page |
| `list_versions` | Discover available facet values |
| `list_cluster` | Cross-version peers for one page (if applicable) |
| `diff_versions` | Unified diff of a page across two versions |
| `bundle_changelog` | Added / removed / changed pages between two bundles |
| `weekly_digest` | What changed in the last N days, with filters |
| `corpus_status` | Freshness + size of the knowledge base |
| `find_doc_inconsistencies` | Scoped scan for doc bugs |
| `<product>_api_lessons` | Curated API gotchas, proactively-called |
| product-specific tools | Interop matrix, lifecycle queries, etc. |

---

## Per-product customization checklist

When applying this template to a new product, here's what you have
to figure out yourself — everything else is shared infrastructure:

- **Doc portal mechanics**
  - URL pattern for pages
  - Bundle/version concept (Zoomin "bundle", Madcap "project",
    GitBook "space", Docusaurus "docs version" — same idea, different
    name)
  - SPA backing API (sniff the network tab) or fallback to
    headless browser
  - How `topic_cluster` -equivalent cross-version peers are exposed
    (or whether you synthesize them from filenames)
- **Bundle metadata schema**
  - What does `version` look like? Semver, calendar, named?
  - What does `platform` mean for this product? Is there a useful
    facet at all?
  - Other useful facets (language, product line, edition)?
- **Filterable facets** for `search_docs`
  - One filter per high-cardinality facet
  - Skip filters that have <5 distinct values — they're not worth
    the surface area
- **Curated knowledge** for the `_api_lessons` tool
  - What does the product's API documentation NOT say that you've
    learned from real integration work?
- **Quickstart / example repos**
  - Does the vendor publish working code? Ingest it; rewrite
    chunk-0 for natural-language retrieval.

---

## Decisions worth carrying forward

Things you'll save time on by deciding the same way again:

- **Tool descriptions are user interface.** The LLM reads them
  verbatim and decides whether to call the tool. *"Use when..."*
  and *"Call proactively whenever..."* are real surfaces; treat
  them like button labels. Most retrieval improvements turn out
  to be tool-description rewrites in disguise.
- **`stateless_http=True`** on the FastMCP server. Eliminates
  whole categories of session-ID-related 404 storms after
  container recreates.
- **Pre-bake everything at CI time.** No runtime calls to git,
  external services, or anything you wouldn't trust on a
  Cloudflare outage. If the digest needs git history, write a
  JSONL file at CI time. If the lessons doc needs to load fast,
  bake it into the image.
- **Env-gate every side-effecting tool.** Off by default in dev;
  on only in production compose. Belt and suspenders against
  accidental writes from staging environments.
- **Operator-confirmation pattern for side-effecting tools.**
  The tool docstring is the only place to enforce
  human-in-the-loop. Make it loud. "MANDATORY", "Do not loop",
  "show-confirm-then-submit" — those phrasings work.
- **Verify with hand-curated golden queries before shipping any
  retrieval change.** Numbers in the diff, in the commit message.
  Don't ship retrieval changes on vibes.
- **Two-cadence CI** (weekly scrape vs on-demand code-only)
  saves hours per code iteration once you're past the
  one-iteration-a-week stage.
- **Rolling tag + sha-pinned tag** deploy pattern. `:latest` is
  what Watchtower watches; `:<sha>` is your safety net. Rollback
  is a one-line compose edit, not a redeploy.
- **Usage logging is non-negotiable.** You will be wrong about
  what people use. Capture the truth from day one; let it tell
  you which features to keep building and which to delete.

---

## Glossary

- **Bundle** — one logical doc set in the portal. Zoomin calls
  them bundles; Madcap calls them projects; the concept is the
  same: a versioned, titled collection of pages. One dir under
  `corpus/`.
- **Page** — one HTML page in a bundle. One `.md` + one `.json`
  sidecar under the bundle dir.
- **Topic cluster** — Zoomin's name for "this page in version
  10.9 corresponds to that page in version 10.8." Stored in the
  per-page sidecar. The portal-agnostic concept is "cross-version
  peer mapping."
- **Chunk** — a unit of text that gets independently embedded and
  stored in Chroma. Target ~400-600 tokens; preserve paragraph
  boundaries.
- **RRF** — Reciprocal Rank Fusion. The way to merge two ranked
  lists from independent retrievers without score calibration.

---

## What's deliberately NOT in this template

Decisions you should make per-product (not copy from the original
build):

- The reverse proxy and TLS termination layer. Could be Caddy,
  nginx, Traefik, Cloudflare Tunnel — pick what your infra uses.
- The Gateway / aggregator in front of multiple MCPs (MetaMCP is one
  option; you may not need any aggregator if you're running a
  single product MCP).
- The specific embedding model — `nomic-embed-text` is a strong
  default but newer / domain-specific models may be better for
  some products.
- The Ollama containers / GPU setup — depends on what hardware you
  have. The pattern is one container per GPU with explicit
  `NVIDIA_VISIBLE_DEVICES` pinning; the indexer load-balances
  across them.
- Whether to publish a blog series alongside the build. Strongly
  recommended (forces clarity, builds an audience), but optional.