initial: docs-mcp-template — build guide + scaffolded server

Template for building hosted MCP servers over a product's public
documentation. Distilled from one production build; everything
product-specific has been factored out.

Contents:

- PLAN.md — comprehensive build guide. 13 phases from project
  skeleton through weekly_digest. Includes the gotchas
  ("fetch-depth: 0 always", reranker per-pair token limit,
  Cloudflare body cap, dash-not-bash on Gitea runners), the
  decisions worth carrying forward, and a per-product
  customization checklist.

- CLAUDE.md — guidance for Claude Code working in a clone of this
  template. Phase identification table, conventions (env-gating +
  operator confirmation for side-effecting tools, defensive
  fallback for retrieval components), common commands.

- README.md — quick-start summary.

Scaffolded code (all signature-stable, with NotImplementedError
stubs where phase-specific work is required):

  docs_mcp/server.py    FastMCP server, stateless_http=True, with
                        search_docs / get_page / list_versions
                        baseline tools and commented stubs for the
                        rest of the phase set.
  docs_mcp/usage.py     TimedCall telemetry, JSONL, daily rotation,
                        90-day retention. Reusable as-is.
  rag/embeddings.py     Ollama embedder (nomic-embed-text default),
                        load-balanced across N URLs. Reusable.
  rag/chunk.py          Paragraph-aware chunker with synthetic
                        chunk 0. Per-product tunable.
  rag/index.py          Chroma + BM25 builder. --rebuild and
                        --bm25-only flags.
  rag/bm25.py           SQLite FTS5 lexical index. Reusable.
  scrape/changelog.py   --cached / --ref / --json / --history-out.
                        Reusable.
  scrape/README.md      What you write per-product.
  eval/queries.jsonl.example
                        Curate ~25 hand-labeled queries here.
  eval/retrievers.py    Retriever protocol + stub classes.
  eval/run_eval.py      MRR / Recall@K / nDCG@K harness skeleton.
  scripts/usage_report.py
                        Standalone log analyzer; the
                        FOLLOW-UP CHECKS pattern noted in the
                        module docstring.
  scripts/registry_gc.py
                        Gitea container registry cleanup. Reusable.

Deployment + CI:

  Dockerfile               Python 3.12-slim; COPY corpus + chroma
                           + bm25 last for cache efficiency.
  deploy/docker-compose.yml MCP + reranker sidecar + Watchtower.
                           Templated with <placeholders>.
  .gitea/workflows/refresh.yml    Weekly cron + manual dispatch.
                                  fetch-depth: 0, retry-on-race,
                                  three-tag image scheme.
  .gitea/workflows/image-only.yml Code-only ship cycle, ~18min.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 09:18:17 -04:00
commit 9ba615c8ee
26 changed files with 3280 additions and 0 deletions
+89
View File
@@ -0,0 +1,89 @@
name: Image rebuild (skip scrape)
# Fast path for code-only changes. Skips the scrape and goes straight to:
# rebuild indexes (from corpus already committed on main) + image build
# + push. Runtime is ~18 min vs ~40 min for the full refresh.
#
# Use when a PR only changes code/config — anything where the upstream
# corpus hasn't moved but we want the new Python in the running image.
#
# IMPORTANT: fetch-depth: 0 is required for the digest-history step
# to find commits to walk. Don't change to 1.
on:
workflow_dispatch:
env:
REGISTRY_PUSH: <lan-host>:<port>
REGISTRY_PULL: <public-registry-hostname>
IMAGE: <owner>/<product>-docs-mcp
OLLAMA_URL: http://<gpu-host>:11434
EMBED_MODEL: nomic-embed-text
PRODUCT_NAME: <product>
jobs:
build:
runs-on: docker
container:
image: catthehacker/ubuntu:act-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
# Full history (not shallow) so the digest-history step can
# walk git log up to --history-days back.
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: |
python -m pip install -q --upgrade pip
python -m pip install -q -r requirements.txt
- name: Refresh digest history
# Cheap (a few seconds); doesn't touch corpus content.
# Without this step, a code-only deploy would ship an
# increasingly-stale digest history relative to git.
run: |
mkdir -p corpus/.digest
python -m scrape.changelog \
--history-out corpus/.digest/history.jsonl \
--history-days 120
- name: Verify committed corpus is present
run: |
test -d corpus || { echo "ERROR: corpus/ missing on this ref"; exit 1; }
echo "corpus: $(du -sh corpus | cut -f1), $(find corpus -name '*.md' | wc -l) markdown files"
- name: Rebuild indexes from existing corpus
run: python -m rag.index --rebuild
- name: Log in to registry (LAN endpoint)
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u <user> --password-stdin
- name: Build & push image
run: |
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
DATE_TAG=$(date -u +%Y.%m.%d)
docker build \
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
-t "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}" \
.
docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
docker push "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}"
- name: Prune old container versions
env:
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
run: |
python scripts/registry_gc.py \
--owner <user> \
--package <product>-docs-mcp \
--keep-days 90 \
--keep-latest 5
+158
View File
@@ -0,0 +1,158 @@
name: Weekly docs refresh
# Runs the full pipeline: scrape upstream → rebuild indexes → push
# image. Cron'd weekly (Mondays). Skip the reindex + image-push if the
# scrape produced no diff against the committed corpus.
#
# IMPORTANT: actions/checkout@v4 fetch-depth: 0 is required because
# the digest-history step walks git log up to --history-days back.
# With a shallow checkout the history file ships empty.
on:
schedule:
- cron: "0 6 * * 1" # Mondays 06:00 UTC
workflow_dispatch:
inputs:
force_build:
description: "Rebuild indexes + push image even if corpus is unchanged"
type: boolean
default: false
env:
# If your registry sits behind Cloudflare with its 100 MB body cap,
# use a LAN endpoint for pushes (bypasses CF) and the public hostname
# for pulls (response bodies aren't capped).
REGISTRY_PUSH: <lan-host>:<port>
REGISTRY_PULL: <public-registry-hostname>
IMAGE: <owner>/<product>-docs-mcp
# Embedder. One URL per GPU; the indexer round-robins.
OLLAMA_URL: http://<gpu-host>:11434
EMBED_MODEL: nomic-embed-text
PRODUCT_NAME: <product>
jobs:
refresh:
runs-on: docker
container:
image: catthehacker/ubuntu:act-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
# Full history — required for the digest-history step to
# walk git log. Default fetch-depth: 1 silently produces a
# 0-byte history file.
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: |
python -m pip install -q --upgrade pip
python -m pip install -q -r requirements.txt
# ---- Phase 1: scrape ---------------------------------------
- name: Refresh bundle catalog
run: python -m scrape.bundles
- name: Re-scrape all bundles
# --force re-fetches every page so we actually see upstream
# edits. Without it the runner skips pages already on disk.
run: python -m scrape.runner --all --force --concurrency 6
# ---- Build the digest history BEFORE committing ------------
# See PLAN.md Phase 13. Walks recent corpus-touching commits
# and writes corpus/.digest/history.jsonl. The current refresh
# gets added on the NEXT run's history (one-week lag is fine).
- name: Build digest history
run: |
mkdir -p corpus/.digest
python -m scrape.changelog \
--history-out corpus/.digest/history.jsonl \
--history-days 120
# ---- Commit + retry-on-race --------------------------------
- name: Commit corpus changes (if any)
id: commit
run: |
git config user.name "<product>-docs-refresh"
git config user.email "actions@<your-domain>"
git add bundles.json corpus
if git diff --cached --quiet; then
echo "no corpus changes — skipping reindex and image build"
echo "changed=false" >> "$GITHUB_OUTPUT"
exit 0
fi
echo "changed=true" >> "$GITHUB_OUTPUT"
python -m scrape.changelog --cached > /tmp/changelog.txt
summary=$(head -1 /tmp/changelog.txt)
ts=$(date -u +"%Y-%m-%dT%H:%MZ")
{
echo "weekly refresh: ${ts} — ${summary}"
echo ""
cat /tmp/changelog.txt
} > /tmp/commitmsg.txt
git commit -F /tmp/commitmsg.txt
# Retry on race: if main moved while we were scraping (a
# human merged a PR during the run), `git push` rejects
# with "fetch first". Rebase our corpus commit onto new
# main and retry. Corpus + code paths are disjoint, so
# the rebase is trivially clean.
attempt=1
while [ $attempt -le 3 ]; do
if git push; then
echo "pushed corpus changes (attempt $attempt)"
break
fi
if [ $attempt -eq 3 ]; then
echo "push still failing after 3 attempts — bailing"
exit 1
fi
git fetch origin main
git rebase origin/main || { echo "rebase conflict — bailing"; exit 1; }
attempt=$((attempt + 1))
done
# ---- Reindex Chroma + BM25 ---------------------------------
- name: Rebuild indexes
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
run: python -m rag.index --rebuild
# ---- Build & push image ------------------------------------
- name: Log in to registry (LAN endpoint)
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u <user> --password-stdin
- name: Build & push image
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
# Runner shell is /bin/sh — use cut instead of ${VAR::N}.
# Three tags: :latest (Watchtower target), :<sha12>
# (rollback pin), :<YYYY.MM.DD> (human-readable).
run: |
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
DATE_TAG=$(date -u +%Y.%m.%d)
docker build \
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
-t "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}" \
.
docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
docker push "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}"
# ---- Registry GC -------------------------------------------
- name: Prune old container versions
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
env:
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
run: |
python scripts/registry_gc.py \
--owner <user> \
--package <product>-docs-mcp \
--keep-days 90 \
--keep-latest 5
+31
View File
@@ -0,0 +1,31 @@
# Virtualenv
venv/
.venv/
# Regenerable from corpus + CI
corpus/
chroma/
bm25/
# Python detritus
__pycache__/
*.py[cod]
*.egg-info/
.pytest_cache/
.mypy_cache/
.ruff_cache/
# Eval results (regenerable; commit only the headline baseline if you want)
# eval/results/
# Usage logs (host-mounted volume in prod; don't commit dev logs)
var/
# Local-only env
.env
.env.local
# IDE
.vscode/
.idea/
*.swp
+232
View File
@@ -0,0 +1,232 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when
working with code in this repository.
## Purpose
This is a **template** for building an MCP server over a product's
public documentation. When you (Claude) are working in a clone of this
repo, you are helping the user implement one specific product's docs
MCP — not editing the template itself.
**Read `PLAN.md` first.** It's the canonical build guide and lays out
13 phases. Most user requests will be "implement Phase N" or "we hit
a bug in Phase N." Identify the phase before doing anything else.
## Working with this template
### Identifying the current phase
When the user clones this template and starts working, figure out
which phase they're on by inspecting:
| Signal | Likely phase |
|---|---|
| `corpus/` doesn't exist | Phase 1 (scraper) — they need to build it before anything else works |
| `corpus/` exists, `chroma/` doesn't | Phase 2 (indexing) |
| Indexes exist, only `search_docs` / `get_page` / `list_versions` implemented | Phase 3 (server skeleton done; next: Dockerfile + CI) |
| No `Dockerfile` or `.gitea/workflows/` updated | Phase 45 |
| `RERANK_URL` env unset in compose | Phase 6 not done |
| `HYBRID_SEARCH` env unset, no `rag/bm25.py` content | Phase 8 not done |
| No `eval/results/` directory | Phase 7 not done |
| `find_doc_inconsistencies` / `submit_doc_bug` are commented-out stubs in `docs_mcp/server.py` | Phase 12 |
| No `corpus/.digest/` produced by CI | Phase 13 |
When in doubt, ask the user: *"Which phase from PLAN.md are we
working on?"*
### The scaffolded server has stubs
`docs_mcp/server.py` ships with three working tools (`search_docs`,
`get_page`, `list_versions`) and signature-only stubs for the
phase-specific tools. The stubs `raise NotImplementedError` with a
phase hint in the docstring. When implementing a phase, you'll be
filling these bodies in — DO NOT change the signatures unless the
user has a specific reason. Signatures are the public contract
between the MCP and its clients (Claude Desktop, Claude Code,
Cursor, etc.).
## Layout
```
.
├── PLAN.md # Read first. Phase-by-phase build guide.
├── README.md # Quick-start summary.
├── CLAUDE.md # This file.
├── requirements.txt
├── Dockerfile
├── deploy/docker-compose.yml
├── .gitea/workflows/
│ ├── refresh.yml # Weekly cron: scrape + index + image
│ └── image-only.yml # On-demand: code-only ship cycle
├── scrape/ # Phase 1 — product-specific scraper here
│ └── changelog.py # Reusable: --json, --history-out
├── rag/ # Phase 2/8 — indexing
│ ├── embeddings.py # Ollama embedder (swappable)
│ ├── chunk.py # Page → chunks (adjust per page format)
│ ├── index.py # Builds Chroma + BM25
│ └── bm25.py # SQLite FTS5 lexical index
├── docs_mcp/ # Phase 3+ — MCP server
│ ├── server.py # FastMCP + tool definitions
│ └── usage.py # TimedCall telemetry
├── eval/ # Phase 7 — golden-query harness
│ ├── queries.jsonl.example
│ ├── retrievers.py
│ └── run_eval.py
├── scripts/ # Standalone ops scripts
│ ├── usage_report.py
│ └── registry_gc.py
└── deploy/
└── docker-compose.yml
```
## Conventions
### Tool docstrings are user interface
The text in `@mcp.tool()` docstrings is what the LLM sees and uses to
decide whether to call the tool. Treat it like a button label.
*"Use when..."*, *"Call proactively whenever..."* phrasings work
well. Don't bury the headline in implementation notes.
### Side-effecting tools must be env-gated AND operator-confirmed
Any tool that POSTs to an external service (submit_doc_bug being the
canonical example):
1. Must check an env flag at call time and return a "disabled,
manual fallback at <URL>" message if unset.
2. Must have a loud docstring requiring per-call operator
confirmation in the LLM conversation flow (the LLM drafts, shows
the operator the exact payload, asks yes/no, only then calls).
3. Must do upfront validation (URL allowlist, content length, etc.)
so the LLM gets a clean error instead of a wire-level failure.
Match the `submit_doc_bug` patterns documented in PLAN.md Phase 12.
### Defensive fallback for retrieval components
The reranker, BM25 index, and any external dependency must fail
gracefully:
- Catch the specific exception type
- Log a warning with enough info to debug
- Fall back to a working baseline (dense-only, no reranker, etc.)
- Never block a search_docs call on a single failure
The user's MCP is in front of real people; partial degradation
beats a 500.
### Verify retrieval changes with the eval harness
Any change that touches retrieval (new embedder, chunker tweak,
reranker model, filter shape) ships with eval numbers in the commit
message. Don't ship retrieval changes on vibes. If `eval/queries.jsonl`
isn't populated yet, populate it before changing retrieval — it's
the most important file in the repo.
### Standard infrastructure choices
These are reasoned defaults — only deviate if you have a specific
need:
- **Embedding model**: `nomic-embed-text` via Ollama (768-dim,
free, on-prem)
- **Reranker**: `jina-reranker-v2-base` GGUF via llama.cpp
`/v1/rerank` endpoint
- **Vector store**: Chroma `PersistentClient`
- **Lexical store**: SQLite FTS5 (stdlib)
- **Fusion**: Reciprocal Rank Fusion with k=60
- **Transport**: streamable-HTTP in prod, stdio for local dev
- **MCP framework**: FastMCP with `stateless_http=True`
- **Container deploy**: Watchtower auto-pull on `:latest`, rollback
via `:<sha12>` pin
### Naming the product
The template uses `PRODUCT_NAME` env var (defaults to `"myproduct"`)
throughout. Set it on first build. References show up in:
- `docs_mcp/server.py``FastMCP(f"{PRODUCT_NAME}-docs", ...)`
- Collection name (`<product>_docs`)
- BM25 db filename
- Tool names that include the product name (e.g., the `_api_lessons`
tool — convention is to name it `<product>_api_lessons`)
Use lowercase, underscores not hyphens, since it ends up in tool
identifiers that the LLM reads.
## Common commands
```bash
# Set up dev environment
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Run the MCP server locally for Claude Desktop dev
python -m docs_mcp.server --transport stdio
# Run as HTTP for integration testing
python -m docs_mcp.server --transport streamable-http --port 8000
# Rebuild Chroma + BM25 indexes from corpus
python -m rag.index --rebuild
# Rebuild only BM25 (fast iteration)
python -m rag.index --bm25-only
# Run the eval harness
python -m eval.run_eval --queries eval/queries.jsonl --output eval/results/baseline.md
# Generate changelog summary (called by CI, useful locally too)
python -m scrape.changelog --cached
python -m scrape.changelog --history-out corpus/.digest/history.jsonl --history-days 120
```
## Gotchas (carried forward from the reference build)
- **`fetch-depth: 0` on `actions/checkout@v4`** in both workflows.
Default is shallow; history-walking steps (changelog, digest)
silently produce empty output otherwise. This is the #1 thing
people miss.
- **Reranker per-pair token limit**: jina-reranker GGUF rejects the
ENTIRE batch if any doc exceeds `n_ctx_train=1024`. Truncate docs
to ~2000 chars before sending to rerank. Full chunk text still
goes back to the user; truncation is reranking-only.
- **FastMCP `stateless_http=True`**: critical for production
hosting behind Watchtower auto-updates. Without it, every
container recreate produces a 404 storm from clients with
stale session IDs.
- **Runner shell is `/bin/sh` (dash)**: no `${VAR::N}` substring
expansion in workflow scripts. Use `cut`/`awk`/`printf`.
- **Cloudflare 100 MB body cap**: if pushing through a Cloudflare-
fronted registry, push via LAN endpoint, pull via public
hostname. Same registry, different URLs.
## When the user says...
| User says | You do |
|---|---|
| "Let's start building" / "set up the project" | Read PLAN.md Phase 0; create dirs, requirements.txt, etc. Confirm Python version and existing tooling. |
| "Build the scraper" / "scrape the docs" | Read PLAN.md Phase 1. Find the upstream portal's underlying API by sniffing; AVOID headless-browser solutions unless the API path is truly closed. |
| "Get retrieval working" / "make search work" | Read PLAN.md Phase 2-3. Implement chunking, embedder, Chroma indexer, then the three baseline tools. |
| "Add a reranker" | Read PLAN.md Phase 6. Stand up the llama.cpp sidecar, implement `_rerank()`. Verify with the eval harness. |
| "Search is missing X queries" | Run the eval harness first to confirm the failure. Then consider: rich chunk-0 rewrites, hybrid retrieval, curated knowledge layer. Don't just tune cosine. |
| "Let's add hybrid search" | Read PLAN.md Phase 8. Only after you've established the failure mode with eval queries — hybrid is not free. |
| "Make a tool that submits doc bugs" | Read PLAN.md Phase 12. Find the docs portal's feedback endpoint by sniffing. Build with operator confirmation as a hard requirement in the tool docstring. |
| "I want a 'what changed' tool" | Read PLAN.md Phase 13. Don't try to do this at runtime — pre-bake the history JSONL at CI time. |
## Out-of-scope concerns (don't try to solve here)
- **Reverse proxy / TLS termination** — outside the repo. User
picks Caddy / Cloudflare Tunnel / nginx / Traefik based on their
infra.
- **MetaMCP or other gateway** — outside the repo. Optional, only
matters when running multiple MCPs.
- **GPU container orchestration** — outside the repo. Pattern is
one Ollama container per GPU; the indexer load-balances. Document
it in deploy/ but don't build it in this template.
- **Email/blog delivery for weekly_digest** — out of scope per
PLAN.md ("Out of scope" section). Add a separate script in
scripts/ if/when the user asks.
+43
View File
@@ -0,0 +1,43 @@
# Docs MCP server — production image.
#
# Structure: copy code first, then the regenerable indexes last so a
# code-only change doesn't invalidate the corpus COPY layer.
#
# The container runs the MCP server via streamable-http on PORT 8000.
# Override via MCP_HOST / MCP_PORT env if you front it with a different
# reverse-proxy setup.
FROM python:3.12-slim
WORKDIR /app
# Install Python deps first for cacheability.
COPY requirements.txt /app/
RUN pip install --no-cache-dir -r requirements.txt
# Code.
COPY scrape /app/scrape
COPY rag /app/rag
COPY docs_mcp /app/docs_mcp
# Catalog. Written by the scraper at CI time.
COPY bundles.json /app/
# Regenerable indexes. CI builds these from corpus/ in the same job
# that builds the image. Listed last so code changes don't invalidate
# the COPY layer cache for these (much larger) directories.
#
# bm25/ is only consulted when HYBRID_SEARCH=true (the server falls
# back to dense-only if it's missing).
COPY corpus /app/corpus
COPY chroma /app/chroma
COPY bm25 /app/bm25
ENV PYTHONUNBUFFERED=1 \
MCP_TRANSPORT=streamable-http \
MCP_HOST=0.0.0.0 \
MCP_PORT=8000
EXPOSE 8000
ENTRYPOINT ["python", "-m", "docs_mcp.server"]
+647
View File
@@ -0,0 +1,647 @@
# Docs MCP Server — Build Guide
A reusable recipe for building a hosted MCP server over a product's
public documentation. Distilled from one production build; everything
product-specific has been factored out.
The end product is a streamable-HTTP MCP server with ~15 tools that
any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can
call to answer questions against the docs, surface what changed
recently, find inconsistencies, and (optionally) submit doc bugs
back upstream.
---
## What you're building
A pipeline with these stages:
```
upstream docs portal
scrape ──► corpus/<bundle>/<page>.md + .json sidecar
chunk + embed ──► chroma/ (dense vectors)
│ ──► bm25/ (FTS5 lexical index)
MCP server ──► search_docs / get_page / diff_versions / weekly_digest /
find_doc_inconsistencies / submit_doc_bug / ...
reverse proxy / Cloudflare Tunnel ──► public endpoint
```
Two CI cadences:
- **Weekly cron** (~40 min): full re-scrape, re-chunk, re-embed,
image build & push.
- **On-demand image-only** (~18 min): code-only rebuild from
committed corpus, image build & push.
A container registry (self-hosted Gitea works well), a host running
Docker Compose, Watchtower auto-updating from `:latest`, and a
reverse proxy in front.
---
## Build phases
Each phase is a discrete, shippable unit. Build them in order; each
one is useful on its own and unlocks the next. Realistic effort per
phase is given as a rough order of magnitude. Total: roughly 23
weeks of focused work for the full stack.
### Phase 0 — Project skeleton *(half a day)*
Goals: directory layout, dependency manifest, virtualenv.
- Top-level dirs: `scrape/`, `corpus/` (gitignored), `rag/`,
`docs_mcp/`, `eval/`, `scripts/`, `deploy/`, `.gitea/workflows/`.
- `requirements.txt` with the dependencies you'll need across all
phases (FastMCP, chromadb, httpx, beautifulsoup4 or whatever HTML
parser, ollama or sentence-transformers client, etc.).
- `python -m venv venv` and pin Python version (3.11 or 3.12 — be
conservative; some embedding libraries have version-specific
wheels).
- `.gitignore`: `venv/`, `corpus/` (regenerable), `chroma/`
(regenerable), `bm25/` (regenerable), `*.pyc`, `__pycache__/`,
`.pytest_cache/`.
### Phase 1 — Scraper *(24 days, product-specific)*
This is the most product-dependent phase. The goal is to write a
scraper that produces a normalized corpus layout regardless of
upstream portal shape.
Output shape (mandatory):
```
corpus/
<bundle_id>/ # one dir per "doc bundle" — see Glossary
<page_id>.md # markdown body
<page_id>.json # sidecar with structured metadata
...
bundles.json # catalog of bundles with metadata
```
**Bundle metadata** (`bundles.json` is a list of these):
```json
{
"slug": "<bundle_id>",
"title": "User-facing title",
"version": "10.9",
"platform": "VMware vSphere", // may be null
"product": "Admin Guide", // optional but useful
"language": "en-US",
"page_count": 127,
"dates": {
"Added on": "2024-01-15",
"Updated on": "2026-05-20"
},
"landing_page": "<page_id>"
}
```
**Per-page sidecar** (`<page_id>.json`) carries page-level metadata.
The one field that matters cross-cutting is `topic_cluster` (see
Phase 9):
```json
{
"bundle_id": "<bundle_id>",
"page_id": "<page_id>",
"title": "How to ...",
"ordinal": 42,
"topic_cluster": {
"clustering_title": "How to ...",
"clustered_topics": [
{"bundle_id": "...10.8", "page_id": "How_to_X.htm", "clustering_title": "..."},
{"bundle_id": "...10.9", "page_id": "How_to_X.htm", "clustering_title": "..."}
]
}
}
```
If the portal exposes a cross-version "this page corresponds to that
page" mapping, capture it here. If it doesn't, you can synthesize a
filename-based fallback (same filename across bundle versions = same
topic) and live without the editor-curated mapping. The features that
read `topic_cluster` (`list_cluster`, `diff_versions`,
`find_doc_inconsistencies`, parts of `weekly_digest`) will work
either way; they're more accurate with real clusters.
**Patterns that recur across doc portals:**
- Most modern doc portals are SPAs. Plain `requests.get` won't see
rendered content. Either find the underlying API the SPA calls (the
cheapest, most reliable path), or fall back to a headless browser
(Playwright). The API path is almost always available; sniff the
network tab.
- Portals usually expose a "bundle/topic" hierarchy under the hood
(Zoomin, Madcap Flare, Paligo, GitBook, Docusaurus all do). Map
it to `bundles.json` + `corpus/<bundle>/<page>`.
- Many portals expose `?save_local=` or `.pdf` rendered versions; the
HTML they serve is structurally cleaner than what the page shows
through the SPA shell.
**`scrape/changelog.py`** (~250 LOC; see Phase 13) — provides
`summarize_diff()`, `render_human()`, `walk_history()` and the
`--json` / `--history-out` modes. Mostly reusable as-is; the only
product-specific bit is the path layout assumption.
### Phase 2 — Chunking + embeddings + Chroma *(2 days)*
Goal: build a queryable dense index from the scraped corpus.
- `rag/chunk.py` — split each page's markdown into ~400-600 token
chunks. Strategy that works: paragraph-aware splitter with a
rich "chunk 0" containing the page title + 1-sentence summary +
bag-of-words from key terms. Chunk 0 is what dense retrieval lands
on first; getting it right dominates retrieval quality.
- `rag/embeddings.py` — pluggable embedder. Recommended start:
Ollama-hosted `nomic-embed-text` (768-dim, free, good baseline).
Other defensible choices: `text-embedding-3-small` (OpenAI),
`bge-m3` (also via Ollama). The embedder is a Chroma
`EmbeddingFunction` that returns `list[list[float]]` for a list
of texts.
- `rag/index.py` — orchestrates: read corpus → emit chunks (with
metadata: bundle_id, page_id, version, platform, ordinal) →
upsert into Chroma collection. `--rebuild` flag for a clean
reindex. Run via `python -m rag.index --rebuild`.
Chroma settings: `PersistentClient(path="chroma/")` and
`Settings(anonymized_telemetry=False)`. Single collection
(`<product>_docs`).
**GPU note**: embedding 70K chunks on CPU takes hours; on a GPU
(via Ollama with `NVIDIA_VISIBLE_DEVICES`) takes ~10 minutes. Two
GPUs in parallel: ~5 minutes. The orchestrator just needs to load-
balance HTTP requests across multiple Ollama endpoints.
### Phase 3 — MCP server skeleton *(1 day)*
Goal: working FastMCP server with three tools — `search_docs`,
`get_page`, `list_versions`.
- `docs_mcp/server.py``FastMCP("<product>-docs", stateless_http=True)`.
`stateless_http=True` is critical for production hosting: every
request creates an ephemeral session, so container recreates don't
produce a 404 storm from stale `mcp-session-id` headers on
clients.
- Lazy initialization for everything expensive (Chroma client,
embedder, bundles catalog) so the server starts cleanly even when
Ollama is briefly unreachable.
- Tool: `search_docs(query, version=None, platform=None,
bundle_id=None, k=10)`. Returns markdown of top-k chunks with full
source URLs.
- Tool: `get_page(bundle_id, page_id)`. Returns full page markdown +
metadata.
- Tool: `list_versions()`. Returns the version/platform facets
available, drawn from `bundles.json`. Helps the LLM pick filter
values.
Transports: stdio (for local Claude Desktop dev), streamable-HTTP
(for hosted production). One argparse switch.
```python
@mcp.tool()
def search_docs(
query: Annotated[str, Field(description="Natural-language query about <product>.")],
version: Annotated[str | None, Field(description="Restrict to one version")] = None,
...
) -> str:
...
```
The tool descriptions are first-class context — the LLM reads them
and decides whether to call the tool. Treat them as button labels;
use "Call when..." / "Use proactively whenever..." phrasings.
### Phase 4 — Containerization *(1 day)*
Goal: image you can run anywhere.
- `Dockerfile`: Python 3.12-slim base, install requirements, COPY
`scrape rag diff docs_mcp` + `bundles.json` + `corpus/ chroma/`
+ (later) `bm25/`. Don't COPY `scripts/` — those stay external
for ops use only.
- `ENTRYPOINT ["python", "-m", "docs_mcp.server",
"--transport", "streamable-http"]`. Configurable host/port via env.
- `deploy/docker-compose.yml`: one service, named volumes for usage
logs and any state, Watchtower label, depends_on for the reranker
sidecar (Phase 6).
Smoke-test locally: `docker compose up` should expose
`http://localhost:8000/mcp` and respond to an MCP `initialize` JSON-RPC.
### Phase 5 — CI on self-hosted Gitea Actions *(12 days)*
Goal: weekly cron rebuild + on-demand code-only ship cycle.
**Two workflows, two cadences:**
| Workflow | Trigger | Steps | Runtime |
|---|---|---|---|
| `refresh.yml` | Monday cron + manual dispatch | scrape → commit corpus → rebuild indexes → build & push image | ~40 min |
| `image-only.yml` | manual dispatch only | rebuild indexes from committed corpus → build & push image | ~18 min |
**Critical settings (learned the hard way):**
- `fetch-depth: 0` on `actions/checkout@v4`. The default depth is 1
(shallow), which breaks any step that walks git history (changelog,
digest history walker). Pay the ~10 second cost; never debug a
"0-byte history file" mystery.
- `runs-on: docker` (Gitea convention, not `ubuntu-latest`).
- Runner shell is `/bin/sh` (dash), not bash. `${VAR::N}` substring
expansion doesn't exist; use `cut` / `printf` / `awk`.
**Retry-on-race pattern for long-running scrapes:**
```bash
attempt=1
while [ $attempt -le 3 ]; do
if git push; then
echo "pushed (attempt $attempt)"
break
fi
[ $attempt -eq 3 ] && { echo "still failing"; exit 1; }
git fetch origin main
git rebase origin/main || { echo "conflict — bail"; exit 1; }
attempt=$((attempt + 1))
done
```
Works because scrape commits only touch `corpus/` + `bundles.json`,
and code merges only touch `.py` / `.yml` — disjoint paths, trivially
clean rebases.
**Image tagging — three tags per build:**
| Tag | Purpose |
|---|---|
| `:latest` | Watchtower watches this for auto-deploy |
| `:<sha12>` | Immutable; rollback target |
| `:<YYYY.MM.DD>` | Human-readable in incident notes |
Same tag set on every build; rollback is a one-line compose edit
to pin `:<sha>` instead of `:latest`.
**Container registry behind Cloudflare:**
Cloudflare's free tier has a 100 MB request body limit. Big image
layers (Chroma index can easily be 800+ MB) exceed it on push. The
fix is a LAN registry endpoint for push, public hostname for pull:
```yaml
env:
REGISTRY_PUSH: <lan-ip>:<port> # bypasses Cloudflare
REGISTRY_PULL: <public-hostname> # response bodies aren't capped
```
Runner needs the LAN endpoint in `/etc/docker/daemon.json`
`insecure-registries`. Costs nothing operationally; saves hours
of debugging.
**Registry GC:** weekly cron in the workflow that walks the package
versions, keeps `:latest` + N most-recent date tags + anything
pushed in the last 90 days, deletes the rest. Worth ~50 LOC; the
package GC on the Gitea side reclaims disk after.
### Phase 6 — Reranker *(half a day)*
Goal: lift retrieval quality 3× by cross-encoder reranking the top-N
dense candidates.
- A `/v1/rerank` HTTP endpoint backed by `llama.cpp` serving
`jina-reranker-v2-base` (GGUF). Runs as a sidecar in compose.
GPU strongly recommended (CPU latency is unworkable for live
queries).
- `_rerank(query, docs)` helper in the server: POST to the endpoint,
apply the scores, re-sort the top-N candidates. Defensive: on any
failure log a warning and fall through to dense-only.
- Env: `RERANK_URL` (off by default), `RERANK_POOL` (how deep to
pull candidates for reranking; 200 is a good default),
`RERANK_TIMEOUT` (30s for cold-start tolerance).
- **Watch the per-pair token limit.** Jina's GGUF reports
`n_ctx_train=1024` and llama.cpp will reject the ENTIRE batch if
any pair exceeds it. Truncate doc text to ~2000 chars before
reranking. The full untruncated chunk still goes back to the user;
truncation is only for the reranker scoring path.
### Phase 7 — Eval harness *(1 day)*
Goal: hand-curated golden queries + standard metrics so you can
measure the impact of any retrieval change.
- `eval/queries.jsonl`: 2025 hand-curated queries with expected
pages. Spread across versions, platforms, and difficulty levels.
Include the queries that "obviously" should work and DON'T —
those are the ones to track.
- `eval/retrievers.py`: a `Retriever` protocol with concrete
implementations: `DenseRetriever`, `RerankedRetriever`,
`BM25Retriever` (Phase 8), `HybridRetriever` (Phase 8). One
matrix dimension per knob.
- `eval/run_eval.py`: computes MRR / Recall@5 / nDCG@5 across all
retrievers; emits a markdown comparison table at
`eval/results/<baseline>.md`. Commit the result so PRs land with
the A/B evidence in the diff.
Three numbers are enough — don't overengineer. The hand-curated
queries are the value; the metrics are just a stable way to score
them.
### Phase 8 — BM25 + Hybrid retrieval *(half a day, conditional)*
**Skip unless your eval shows specific failure modes.** Dense
embeddings + cross-encoder reranker handle most queries. The case
where they don't: queries with rare technical tokens (filenames,
language names, error codes) get buried at dense rank 1000+ by a
much larger prose corpus that's semantically nearby. The reranker
only sees top-200, so it never gets a shot.
- `rag/bm25.py`: SQLite FTS5 index, in the stdlib, on-disk
(`bm25/<product>.db`). Two tables — metadata table keyed by
rowid, FTS5 virtual table for full-text. Sanitize the query
(strip FTS5 reserved keywords, OR-join tokens for recall). ~210
LOC.
- `_rrf_fuse()` in the server — Reciprocal Rank Fusion with `k=60`.
Per-id score = `sum_over_retrievers(1 / (k + rank))`. Returns
ordered ids plus per-retriever contribution dict for telemetry.
- `search_docs` hybrid path: run dense + BM25 in parallel,
RRF-fuse, hand the merged top-200 to the reranker. Env-gated:
`HYBRID_SEARCH=true`.
- Log `top1_source` per call (`dense_only` / `bm25_only` / `both`)
to usage logs so you can measure whether BM25 is actually earning
its keep on production traffic.
If after 46 weeks of production data you see `bm25_only >= 80%`,
you can simplify to BM25-only (much less infrastructure). If
`both >= 50%`, hybrid is acting as tie-breaker not rescue — keep it
or simplify depending on how much you care about the long tail.
### Phase 9 — Multi-version diff tooling *(1 day, if applicable)*
**Only relevant if the product has multiple maintained versions.**
- `diff_versions(bundle_id, page_id, against_bundle_id)`: unified
diff between two versions of the same page. Two matching
strategies: editor-curated `topic_cluster` peer (if the portal
exposes it), or same-filename fallback.
- `list_cluster(bundle_id, page_id)`: list cross-version peers
for one page.
- `bundle_changelog(bundle_id_new, bundle_id_old)`: added /
removed / changed pages between two bundles, sorted by churn.
- `_diff_churn(a, b)`: small helper, ~15 LOC of `difflib.unified_diff
--unified=0` line counting. Used by `bundle_changelog`,
`find_doc_inconsistencies`, and `weekly_digest`.
### Phase 10 — Usage logging *(half a day)*
Goal: per-call JSONL telemetry so you can answer "what are people
actually asking for" and "is the new feature getting used."
- `docs_mcp/usage.py`: `TimedCall` context manager that captures
tool name, args, elapsed time, hits returned, any extra fields
set by the tool via `_call.set(key=value)`. Writes JSONL to
`var/logs/usage.jsonl`, rotated daily, kept 90 days.
- Mount the log dir as a named compose volume so logs survive
container recreates.
- `scripts/usage_report.py` (standalone, no docs_mcp deps): reads
the JSONL files, prints per-tool counts, top queries, 0-hit
queries, filter usage histogram, reranker activity. Markdown
output flag for piping into weekly digest emails.
What to log: query text, filters, hits returned, elapsed_ms,
reranker_fired flag, hybrid top1_source, retrieval_mode. What NOT
to log: anything PII-shaped. The corpus is public, queries are
usually about the product, not personal — but be deliberate.
### Phase 11 — Curated knowledge layer *(2 days)*
The "RAG can't tell you what isn't in the docs" gap. Surfaces:
- **API quickstart repos** if the product has them. Ingest the
example scripts (Python, PowerShell, curl) into the corpus.
Rewrite chunk-0 for each script to embed naturally — explicit
natural-language H1, task description sentence, keyword bag.
Dense embeddings need an anchor.
- **A curated `<product>_api_lessons` markdown doc** for things
the swagger / OpenAPI doesn't say: auth flow gotchas, async-task
patterns, schema bugs you've hit, platform-detection quirks.
Surface as a dedicated MCP tool whose description tells the LLM:
*"Call proactively whenever the user asks you to write a script
/ integrate with the API / debug a 4xx response."*
- **An auto-hint banner** in `search_docs` results — when the
query matches a script/API trigger word, render a one-line nudge
at the top of results pointing at the dedicated tool. Belt-and-
suspenders for queries where the LLM doesn't think to call it
proactively.
### Phase 12 — Doc-bug workflow tools *(1 day, optional)*
Two tools that pair up to enable a *"check the docs for
inconsistencies, draft bugs, confirm, submit"* workflow.
- `find_doc_inconsistencies(scope_query, version=None, platform=None,
max_pages=30, checks=None)`: deterministic, read-only. Two checks:
cross-version drift (pages whose content shifted between immediate-
previous versions in the actionable 1060% churn band) and
redirect-chain detection (short pages whose body is just a "see
[other page] for details" pointer). Heavy lifting is line-level
diff (`difflib`) against editor-curated cluster peers; the model
judges which findings are real bugs.
- `submit_doc_bug(page_url, content, email=None, rating=None,
like=None)`: POSTs to the docs portal's feedback endpoint.
Env-gated by `DOC_BUG_SUBMIT_ENABLED=true` so dev/staging
deployments can't accidentally hit the upstream. The tool's
docstring is loud about a mandatory operator-confirmation
workflow per submission — LLM must draft, show, ask, then
submit. Explicit *"do not loop"* instruction. Defensive
validation upfront (URL host matches expected portal, content
non-empty, etc.) so the LLM gets a clean error instead of a
rejected POST.
**You'll need to find the docs portal's feedback endpoint.** Most
portals route the "Was this helpful?" widget through a backend
API; sniff the browser network tab on the live site. The payload
shape varies; common fields: content/body, page url/href, optional
email, optional rating, optional thumbs. Most accept anonymous
POSTs with no captcha at the JSON-API layer (even if the widget
shows a captcha). Validate before you ship — and if the endpoint
has rate limits or captcha enforcement, the tool returns a clean
"submission rejected — paste manually at <url>" fallback.
The whole point is the per-bug operator confirmation in the
LLM-side conversation flow; the tool description enforces it. Do
not bypass.
### Phase 13 — Weekly digest tool *(half a day)*
Goal: a tool that answers *"what changed in the docs in the last N
days?"* with no runtime git dependency (the prod container has no
git).
- Extend `scrape/changelog.py` with `--json` (one-shot structured
output) and `--history-out PATH` (walks `git log --first-parent
--since="<N> days ago"` for corpus-touching commits, writes one
JSON line per commit to a JSONL file).
- CI workflows write the JSONL file into the image at build time:
`corpus/.digest/history.jsonl`. Both `refresh.yml` and
`image-only.yml`. **`fetch-depth: 0` is required** — see Phase 5.
- New MCP tool `weekly_digest(days=7, version=None, platform=None,
max_bundles=25, max_pages_per_bundle=10)`: reads the JSONL,
filters to the window, applies version/platform via
`bundles.json` metadata, aggregates per-bundle change counts and
page lists, renders markdown.
- Post-filter totals are critical: the headline "X page changes
across Y bundles" must compute X from the filtered set, not the
raw record count. Otherwise filtered calls look wrong to the
reader.
Out of scope but trivial bolt-ons: scheduled HTML email of the
digest, auto-publish to a blog, per-page diff excerpts as a
follow-up tool.
---
## Standard tool set
By the end you'll have ~15 tools registered. Production-tested
shape:
| Tool | What it does |
|---|---|
| `search_docs` | Semantic search with version/platform/bundle filters |
| `get_page` | Full markdown + metadata for one page |
| `list_versions` | Discover available facet values |
| `list_cluster` | Cross-version peers for one page (if applicable) |
| `diff_versions` | Unified diff of a page across two versions |
| `bundle_changelog` | Added / removed / changed pages between two bundles |
| `weekly_digest` | What changed in the last N days, with filters |
| `corpus_status` | Freshness + size of the knowledge base |
| `find_doc_inconsistencies` | Scoped scan for doc bugs |
| `submit_doc_bug` | Submit a drafted bug (env-gated, operator-confirmed) |
| `<product>_api_lessons` | Curated API gotchas, proactively-called |
| product-specific tools | Interop matrix, lifecycle queries, etc. |
---
## Per-product customization checklist
When applying this template to a new product, here's what you have
to figure out yourself — everything else is shared infrastructure:
- **Doc portal mechanics**
- URL pattern for pages
- Bundle/version concept (Zoomin "bundle", Madcap "project",
GitBook "space", Docusaurus "docs version" — same idea, different
name)
- SPA backing API (sniff the network tab) or fallback to
headless browser
- How `topic_cluster` -equivalent cross-version peers are exposed
(or whether you synthesize them from filenames)
- **Bundle metadata schema**
- What does `version` look like? Semver, calendar, named?
- What does `platform` mean for this product? Is there a useful
facet at all?
- Other useful facets (language, product line, edition)?
- **Filterable facets** for `search_docs`
- One filter per high-cardinality facet
- Skip filters that have <5 distinct values — they're not worth
the surface area
- **Feedback endpoint** (for `submit_doc_bug`, if you want it)
- URL of the POST endpoint
- Required + optional payload fields
- Captcha / rate-limit behavior
- Whether anonymous submissions are accepted
- **Curated knowledge** for the `_api_lessons` tool
- What does the product's API documentation NOT say that you've
learned from real integration work?
- **Quickstart / example repos**
- Does the vendor publish working code? Ingest it; rewrite
chunk-0 for natural-language retrieval.
---
## Decisions worth carrying forward
Things you'll save time on by deciding the same way again:
- **Tool descriptions are user interface.** The LLM reads them
verbatim and decides whether to call the tool. *"Use when..."*
and *"Call proactively whenever..."* are real surfaces; treat
them like button labels. Most retrieval improvements turn out
to be tool-description rewrites in disguise.
- **`stateless_http=True`** on the FastMCP server. Eliminates
whole categories of session-ID-related 404 storms after
container recreates.
- **Pre-bake everything at CI time.** No runtime calls to git,
external services, or anything you wouldn't trust on a
Cloudflare outage. If the digest needs git history, write a
JSONL file at CI time. If the lessons doc needs to load fast,
bake it into the image.
- **Env-gate every side-effecting tool.** Off by default in dev;
on only in production compose. Belt and suspenders against
accidental writes from staging environments.
- **Operator-confirmation pattern for side-effecting tools.**
The tool docstring is the only place to enforce
human-in-the-loop. Make it loud. "MANDATORY", "Do not loop",
"show-confirm-then-submit" — those phrasings work.
- **Verify with hand-curated golden queries before shipping any
retrieval change.** Numbers in the diff, in the commit message.
Don't ship retrieval changes on vibes.
- **Two-cadence CI** (weekly scrape vs on-demand code-only)
saves hours per code iteration once you're past the
one-iteration-a-week stage.
- **Rolling tag + sha-pinned tag** deploy pattern. `:latest` is
what Watchtower watches; `:<sha>` is your safety net. Rollback
is a one-line compose edit, not a redeploy.
- **Usage logging is non-negotiable.** You will be wrong about
what people use. Capture the truth from day one; let it tell
you which features to keep building and which to delete.
---
## Glossary
- **Bundle** — one logical doc set in the portal. Zoomin calls
them bundles; Madcap calls them projects; the concept is the
same: a versioned, titled collection of pages. One dir under
`corpus/`.
- **Page** — one HTML page in a bundle. One `.md` + one `.json`
sidecar under the bundle dir.
- **Topic cluster** — Zoomin's name for "this page in version
10.9 corresponds to that page in version 10.8." Stored in the
per-page sidecar. The portal-agnostic concept is "cross-version
peer mapping."
- **Chunk** — a unit of text that gets independently embedded and
stored in Chroma. Target ~400-600 tokens; preserve paragraph
boundaries.
- **RRF** — Reciprocal Rank Fusion. The way to merge two ranked
lists from independent retrievers without score calibration.
---
## What's deliberately NOT in this template
Decisions you should make per-product (not copy from the original
build):
- The reverse proxy and TLS termination layer. Could be Caddy,
nginx, Traefik, Cloudflare Tunnel — pick what your infra uses.
- The Gateway / aggregator in front of multiple MCPs (MetaMCP is one
option; you may not need any aggregator if you're running a
single product MCP).
- The specific embedding model — `nomic-embed-text` is a strong
default but newer / domain-specific models may be better for
some products.
- The Ollama containers / GPU setup — depends on what hardware you
have. The pattern is one container per GPU with explicit
`NVIDIA_VISIBLE_DEVICES` pinning; the indexer load-balances
across them.
- Whether to publish a blog series alongside the build. Strongly
recommended (forces clarity, builds an audience), but optional.
+104
View File
@@ -0,0 +1,104 @@
# docs-mcp-template
A reusable template for building hosted MCP servers over a product's
public documentation. Distilled from one production build; everything
product-specific has been factored out.
The end product is a streamable-HTTP MCP server with ~15 tools that
any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can
call to answer questions against the docs, surface what changed
recently, find inconsistencies, and (optionally) submit doc bugs
back upstream.
## What's here
- **[PLAN.md](PLAN.md)** — comprehensive build guide. Phased
approach (13 phases, ~23 weeks of focused work for the full
stack). Includes the design decisions, the gotchas, and a
per-product customization checklist.
- **Scaffolded skeleton** — working FastMCP server with stub tools,
Dockerfile, docker-compose, CI workflows, eval harness layout,
usage logging. Everything you need to `git clone` and start
filling in the product-specific bits.
## Quick start
```bash
git clone https://git.jpaul.io/justin/docs-mcp-template.git my-product-docs
cd my-product-docs
git remote remove origin # detach from template
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Read PLAN.md before doing anything else. Pay particular attention to
# Phase 1 (scraper) — that's the most product-specific phase.
# Run the stub server (no corpus yet — just verifies the wiring):
python -m docs_mcp.server --transport stdio
```
## Repo layout
```
.
├── PLAN.md # The build guide. Read first.
├── README.md
├── requirements.txt
├── Dockerfile
├── .gitignore
├── .gitea/workflows/
│ ├── refresh.yml # Weekly scrape + index + image push
│ └── image-only.yml # On-demand code-only ship
├── scrape/
│ ├── README.md # Product-specific scraper goes here
│ └── changelog.py # Reusable: --json, --history-out
├── rag/
│ ├── embeddings.py # Ollama embedder, swappable
│ ├── chunk.py # Chunker — adjust per page format
│ ├── index.py # Builds Chroma + (optionally) BM25
│ └── bm25.py # SQLite FTS5 lexical index
├── docs_mcp/
│ ├── server.py # FastMCP server with stub tools
│ └── usage.py # TimedCall + JSONL telemetry
├── eval/
│ ├── queries.jsonl.example # Curate ~25 hand-labeled queries
│ ├── retrievers.py # Retriever protocol + implementations
│ └── run_eval.py # MRR / Recall@k / nDCG@k harness
├── scripts/
│ ├── usage_report.py # Standalone log analyzer
│ └── registry_gc.py # Container registry cleanup
└── deploy/
└── docker-compose.yml # Hosting stack: MCP + reranker + Watchtower
```
## What's product-specific (must implement)
- `scrape/` — the scraper itself. The template gives you the corpus
layout contract and a working `changelog.py`; the actual extraction
logic is yours.
- The corpus on disk (gitignored; rebuilt by CI).
- The reranker GGUF model and llama.cpp container (commented in
`deploy/docker-compose.yml`).
- The reverse proxy / TLS layer in front of the public endpoint.
- The hand-curated knowledge surface (your product's API gotchas,
example scripts, anything the LLM should know that the docs
don't say).
## What's NOT product-specific (works as-is)
- FastMCP server skeleton + tool decoration pattern
- Chroma + Ollama embedding pipeline
- BM25 / SQLite FTS5 lexical index
- Hybrid retrieval (RRF) + reranker integration
- Eval harness (Retriever protocol, MRR/Recall/nDCG)
- Usage logging (TimedCall, JSONL, daily rotation)
- CI workflow shape (weekly + on-demand, retry-on-race, three-tag
image scheme)
- Registry GC script
- Standard tools: `search_docs`, `get_page`, `list_versions`,
`diff_versions`, `bundle_changelog`, `weekly_digest`,
`find_doc_inconsistencies`, `submit_doc_bug`, etc.
## License
Internal template. Adjust before publishing.
+111
View File
@@ -0,0 +1,111 @@
# Hosting stack for a docs MCP server.
#
# Replace <product> below with your product name on first deploy.
# Volumes: usage logs are mounted to a host path so they survive
# Watchtower-driven container recreates.
#
# This template assumes a reverse proxy / Cloudflare Tunnel terminates
# TLS in front of port 8000. Adjust if your infra differs.
services:
# The MCP server. Watchtower auto-pulls on :latest changes.
<product>-docs-mcp:
image: <registry>/<owner>/<product>-docs-mcp:latest
container_name: <product>-docs-mcp
restart: unless-stopped
ports:
- "8000:8000"
environment:
PRODUCT_NAME: "<product>"
PRODUCT_DOCS_URL: "https://docs.example.com"
# Streamable-HTTP transport. Stateless mode is required for
# production: clients don't lose sessions when Watchtower
# recreates the container.
MCP_TRANSPORT: streamable-http
MCP_HOST: 0.0.0.0
MCP_PORT: "8000"
# If you run MetaMCP or another gateway in front and reach
# this container via its compose DNS name (e.g. <product>-docs-mcp:8000),
# add that hostname here. "*" disables the rebind check entirely.
MCP_ALLOWED_HOSTS: "<product>-docs-mcp,localhost,127.0.0.1"
# Phase 6 — reranker sidecar (jina-reranker-v2-base via llama.cpp).
RERANK_URL: http://<product>-rerank:8080
RERANK_POOL: "200"
RERANK_TIMEOUT: "30"
# Phase 8 — hybrid retrieval (BM25 + dense + RRF). Set true
# only after the eval harness shows the dense-only path
# missing technical-term queries that BM25 catches.
HYBRID_SEARCH: "true"
# Phase 10 — usage telemetry.
USAGE_LOG_DIR: /app/var/logs
USAGE_LOG_KEEP_DAYS: "90"
# Phase 12 — doc-bug submission gate. Off by default; on only
# in production after you've verified the endpoint contract.
DOC_BUG_SUBMIT_ENABLED: "false"
# DOC_BUG_API_URL: "https://docs-be.example.com/api/feedback"
volumes:
# Usage logs persist across container recreates.
- ./<product>-docs-mcp-logs:/app/var/logs
depends_on:
- <product>-rerank
labels:
# Watchtower polls *only* containers with this label set true.
com.centurylinklabs.watchtower.enable: "true"
networks:
- mcp
# Reranker sidecar — llama.cpp serving jina-reranker-v2-base.
# Requires GPU access; adjust runtime/devices for your hardware.
<product>-rerank:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
container_name: <product>-rerank
restart: unless-stopped
# Mount the GGUF model from the host. Download from huggingface
# (gguf-org/jina-reranker-v2-base-multilingual-GGUF) first.
volumes:
- /path/to/models:/models:ro
command: >
--model /models/jina-reranker-v2-base.Q8_0.gguf
--reranking
--host 0.0.0.0
--port 8080
--n-gpu-layers 99
--ctx-size 4096
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
networks:
- mcp
# Watchtower — auto-pulls :latest on push.
# Only watches containers labeled `com.centurylinklabs.watchtower.enable=true`.
watchtower:
image: containrrr/watchtower:latest
container_name: watchtower
restart: unless-stopped
volumes:
- /var/run/docker.sock:/var/run/docker.sock
environment:
WATCHTOWER_POLL_INTERVAL: "300" # 5 min
WATCHTOWER_LABEL_ENABLE: "true"
WATCHTOWER_CLEANUP: "true" # remove old images after pull
# If your registry requires auth, mount a docker config:
# volumes:
# - ./registry-auth.json:/config.json:ro
networks:
- mcp
networks:
mcp:
driver: bridge
View File
+263
View File
@@ -0,0 +1,263 @@
"""MCP server skeleton — fill in PRODUCT_NAME and the tool bodies.
This file is the template's structural anchor. The phases described in
PLAN.md add or extend pieces of this file:
Phase 3 — search_docs, get_page, list_versions stubs (you are here)
Phase 6 — reranker integration in search_docs
Phase 8 — BM25 + hybrid retrieval (HYBRID_SEARCH env gate, _rrf_fuse)
Phase 9 — diff_versions, list_cluster, bundle_changelog
Phase 10 — TimedCall wiring (already imported below)
Phase 11 — <product>_api_lessons tool
Phase 12 — find_doc_inconsistencies, submit_doc_bug
Phase 13 — weekly_digest + _digest_history reader
Every stub below has a docstring + `raise NotImplementedError`. Replace
the body when you reach the corresponding phase. Keep the signatures
stable across products — clients depend on them.
"""
from __future__ import annotations
import json
import logging
import os
import re
from pathlib import Path
from typing import Annotated
from mcp.server.fastmcp import FastMCP
from pydantic import Field
from .usage import TimedCall
log = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Product-specific configuration. Set these for each new build.
# ---------------------------------------------------------------------------
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "myproduct")
PRODUCT_DOCS_URL = os.environ.get("PRODUCT_DOCS_URL", "https://docs.example.com")
COLLECTION = f"{PRODUCT_NAME}_docs"
# Paths inside the deployed container (and matching layout locally for dev).
ROOT = Path(__file__).resolve().parent.parent
CORPUS = ROOT / "corpus"
CHROMA_DIR = ROOT / "chroma"
BM25_DB = Path(os.environ.get("BM25_DB", str(ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db")))
BUNDLES_JSON = ROOT / "bundles.json"
# ---------------------------------------------------------------------------
# Feature flags (Phase 6 / 8 / 12 enable these as you ship each phase).
# ---------------------------------------------------------------------------
RERANK_URL = os.environ.get("RERANK_URL", "").rstrip("/") or None
RERANK_POOL = int(os.environ.get("RERANK_POOL", "50"))
RERANK_TIMEOUT = float(os.environ.get("RERANK_TIMEOUT", "30"))
HYBRID_SEARCH = os.environ.get("HYBRID_SEARCH", "").lower() in ("true", "1", "yes", "on")
RRF_K = int(os.environ.get("RRF_K", "60"))
DOC_BUG_SUBMIT_ENABLED = os.environ.get("DOC_BUG_SUBMIT_ENABLED", "").lower() in ("true", "1", "yes", "on")
DOC_BUG_API_URL = os.environ.get("DOC_BUG_API_URL", "") # product-specific endpoint
DOC_BUG_TIMEOUT = float(os.environ.get("DOC_BUG_TIMEOUT", "15"))
# ---------------------------------------------------------------------------
# FastMCP setup.
#
# stateless_http=True — every request creates an ephemeral session and
# discards it on return. Critical for production: clients don't get
# 404 storms when the container is recreated by Watchtower.
# ---------------------------------------------------------------------------
mcp = FastMCP(f"{PRODUCT_NAME}-docs", stateless_http=True)
# ---------------------------------------------------------------------------
# Lazy helpers — instantiate expensive things only when actually needed,
# so the server still starts when (e.g.) Ollama is briefly unreachable.
# ---------------------------------------------------------------------------
def _bundles() -> dict[str, dict]:
"""Cached load of bundles.json into a {slug: bundle_dict} mapping.
bundles.json is the product-specific catalog written by the Phase 1
scraper. See PLAN.md Phase 1 for the schema.
"""
if not BUNDLES_JSON.exists():
return {}
cat = json.loads(BUNDLES_JSON.read_text())
return {b["slug"]: b for b in cat}
def _build_where(version: str | None, platform: str | None, bundle_id: str | None) -> dict | None:
"""Translate filter args into a Chroma `where` clause."""
conds: list[dict] = []
if version:
conds.append({"version": version})
if platform:
conds.append({"platform": platform})
if bundle_id:
conds.append({"bundle_id": bundle_id})
if not conds:
return None
if len(conds) == 1:
return conds[0]
return {"$and": conds}
def _read_page(bundle_id: str, page_id: str) -> tuple[str, dict] | None:
"""Read a corpus page off disk. Returns (markdown_body, metadata_dict)."""
md_path = CORPUS / bundle_id / (page_id + ".md")
json_path = CORPUS / bundle_id / (page_id + ".json")
if not md_path.exists() or not json_path.exists():
return None
return md_path.read_text(), json.loads(json_path.read_text())
# ===========================================================================
# Tools
# ===========================================================================
@mcp.tool()
def search_docs(
query: Annotated[str, Field(description=f"Natural-language query about {PRODUCT_NAME}.")],
version: Annotated[
str | None,
Field(description="OPTIONAL version filter — restrict to one product version."),
] = None,
platform: Annotated[
str | None,
Field(description="OPTIONAL platform filter. Set to one of the platforms listed by list_versions(); omit for all platforms."),
] = None,
bundle_id: Annotated[
str | None,
Field(description="OPTIONAL bundle filter — pin to a specific doc bundle slug."),
] = None,
k: Annotated[int, Field(description="Number of results to return.", ge=1, le=50)] = 10,
) -> str:
"""Search the {product} docs corpus.
Returns the top-k most relevant chunks (with full source page URLs)
given a natural-language query. Optional filters narrow the search
to one version, one platform, or one bundle. Use list_versions()
first if you need to discover the available facet values.
Call this tool whenever the user asks anything that should be
answerable from the official product documentation.
"""
with TimedCall("search_docs", {
"query": query, "version": version, "platform": platform,
"bundle_id": bundle_id, "k": k,
}) as _call:
# TODO Phase 2-3: query Chroma collection (see rag/index.py for
# how it was built). Render the top-k chunks as markdown with
# source URLs.
# TODO Phase 6: optional reranker via _rerank() if RERANK_URL set.
# TODO Phase 8: hybrid retrieval if HYBRID_SEARCH=true — run
# dense + BM25 in parallel, RRF-fuse, hand merged pool to rerank.
_call.set(hits_returned=0)
raise NotImplementedError("Phase 2/3: implement Chroma query + rendering")
@mcp.tool()
def get_page(
bundle_id: Annotated[str, Field(description="Bundle slug.")],
page_id: Annotated[str, Field(description="Page filename within the bundle.")],
) -> str:
"""Return the full markdown for one page, plus a metadata header.
Use after search_docs surfaces a relevant page and the user (or you)
want the complete text — not just the matched chunks.
"""
with TimedCall("get_page", {"bundle_id": bundle_id, "page_id": page_id}) as _call:
data = _read_page(bundle_id, page_id)
if data is None:
_call.set(found=False)
return f"Page not found: {bundle_id}/{page_id}"
md, meta = data
_call.set(found=True, page_chars=len(md))
# TODO: add a metadata header (title, version, source URL) above
# the body. Product-specific shape.
return md
@mcp.tool()
def list_versions() -> str:
"""List the available version/platform facets across all bundles.
Use this to discover valid filter values for search_docs.
"""
with TimedCall("list_versions", {}) as _call:
cat = _bundles()
if not cat:
return "_(no bundles indexed yet — run the scraper + indexer)_"
versions = sorted({b.get("version") for b in cat.values() if b.get("version")})
platforms = sorted({b.get("platform") for b in cat.values() if b.get("platform")})
_call.set(versions=len(versions), platforms=len(platforms))
lines = [f"# Facets across {len(cat)} bundle(s)", ""]
if versions:
lines.append("## Versions"); lines.append("")
for v in versions: lines.append(f"- `{v}`")
lines.append("")
if platforms:
lines.append("## Platforms"); lines.append("")
for p in platforms: lines.append(f"- `{p}`")
return "\n".join(lines)
# ---------------------------------------------------------------------------
# Stubs for later phases — keep the signatures in this file so refactors
# don't lose the contracts. Implementations come per phase.
# ---------------------------------------------------------------------------
# @mcp.tool() # Phase 9
# def list_cluster(bundle_id: str, page_id: str) -> str: ...
# @mcp.tool() # Phase 9
# def diff_versions(bundle_id: str, page_id: str, against_bundle_id: str, context: int = 3) -> str: ...
# @mcp.tool() # Phase 9
# def bundle_changelog(bundle_id_new: str, bundle_id_old: str, min_churn: int = 5, max_changed: int = 50) -> str: ...
# @mcp.tool() # Phase 13
# def weekly_digest(days: int = 7, version: str | None = None, platform: str | None = None, ...) -> str: ...
# @mcp.tool() # Phase 9 (or 3 — useful early)
# def corpus_status() -> str: ...
# @mcp.tool() # Phase 11
# def myproduct_api_lessons(topic: str | None = None) -> str: ...
# @mcp.tool() # Phase 12
# def find_doc_inconsistencies(scope_query: str, ...) -> str: ...
# @mcp.tool() # Phase 12
# def submit_doc_bug(page_url: str, content: str, email: str | None = None, ...) -> str: ...
# ===========================================================================
# Entry point
# ===========================================================================
def main() -> None:
import argparse
p = argparse.ArgumentParser(description=f"{PRODUCT_NAME} docs MCP server")
p.add_argument("--transport", choices=["stdio", "streamable-http", "sse"],
default=os.environ.get("MCP_TRANSPORT", "stdio"))
p.add_argument("--host", default=os.environ.get("MCP_HOST", "0.0.0.0"))
p.add_argument("--port", type=int, default=int(os.environ.get("MCP_PORT", "8000")))
args = p.parse_args()
if args.transport == "stdio":
mcp.run()
else:
mcp.settings.host = args.host
mcp.settings.port = args.port
# DNS-rebinding protection defaults to localhost-only — disable for
# container-network DNS hostnames. See PLAN.md "Hosting" notes.
if os.environ.get("MCP_DISABLE_DNS_REBINDING_PROTECTION") in {"1", "true", "yes"}:
mcp.settings.transport_security.enable_dns_rebinding_protection = False
mcp.run(transport=args.transport)
if __name__ == "__main__":
main()
+127
View File
@@ -0,0 +1,127 @@
"""Per-call usage telemetry — JSONL with daily rotation and retention.
Reusable as-is across products. Drop the import + `with TimedCall(...)`
into any tool body and the call gets logged with the tool name, args,
elapsed time, and any extra fields the tool sets via `_call.set(...)`.
The log file is `var/logs/usage.jsonl` by default (override with the
`USAGE_LOG_DIR` env). Daily rotation; files older than
`USAGE_LOG_KEEP_DAYS` (default 90) are deleted on next write.
Layout of one record:
{
"ts": "2026-05-22T13:14:15+00:00",
"tool": "search_docs",
"args": {"query": "...", "version": "10.9", "k": 10},
"elapsed_ms": 142.5,
"hits_returned": 7, # optional, set by the tool
"reranked": true, # optional, set by the tool
// ... any other key the tool sets via _call.set(...)
}
"""
from __future__ import annotations
import json
import os
import time
import threading
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any
USAGE_LOG_DIR = Path(os.environ.get("USAGE_LOG_DIR", "var/logs"))
USAGE_LOG_KEEP_DAYS = int(os.environ.get("USAGE_LOG_KEEP_DAYS", "90"))
# Single global lock to serialize writes from multiple request handlers.
# JSONL appends are atomic at the OS level for short records on most
# filesystems, but the lock is cheap and saves you from cross-platform
# surprises.
_lock = threading.Lock()
_last_rotation_check: float = 0.0
def _maybe_rotate() -> None:
"""Move usage.jsonl → usage.jsonl.<yesterday> if the date has rolled.
Cheap to call; we only do filesystem work when a day has actually
passed since the last check.
"""
global _last_rotation_check
now = time.time()
if now - _last_rotation_check < 300: # 5 min cap between checks
return
_last_rotation_check = now
USAGE_LOG_DIR.mkdir(parents=True, exist_ok=True)
active = USAGE_LOG_DIR / "usage.jsonl"
if active.exists():
try:
mtime = datetime.fromtimestamp(active.stat().st_mtime, tz=timezone.utc).date()
today = datetime.now(timezone.utc).date()
if mtime < today:
rotated = USAGE_LOG_DIR / f"usage.jsonl.{mtime.isoformat()}"
if not rotated.exists():
active.rename(rotated)
except OSError:
pass
# Retention: delete usage.jsonl.YYYY-MM-DD files older than the
# retention window. The active file is never deleted by this.
cutoff = datetime.now(timezone.utc).date() - timedelta(days=USAGE_LOG_KEEP_DAYS)
for f in USAGE_LOG_DIR.glob("usage.jsonl.*"):
try:
datestamp = f.name.split(".", 2)[-1]
if datetime.fromisoformat(datestamp).date() < cutoff:
f.unlink()
except (ValueError, OSError):
continue
class TimedCall:
"""Context manager that captures one tool call's telemetry record.
Usage:
with TimedCall("search_docs", {"query": q, ...}) as call:
... do the work ...
call.set(hits_returned=len(results), reranked=True)
On exit, writes one JSONL record to usage.jsonl. Exceptions are
captured into the `error` field; the exception is re-raised so
the tool's caller sees the failure.
"""
def __init__(self, tool: str, args: dict[str, Any]):
self.tool = tool
self.args = args
self.extra: dict[str, Any] = {}
self._t0: float = 0.0
def set(self, **kwargs: Any) -> None:
"""Attach extra fields to the eventual telemetry record."""
self.extra.update(kwargs)
def __enter__(self) -> "TimedCall":
self._t0 = time.perf_counter()
return self
def __exit__(self, exc_type, exc_val, exc_tb) -> None:
elapsed_ms = (time.perf_counter() - self._t0) * 1000.0
record: dict[str, Any] = {
"ts": datetime.now(timezone.utc).isoformat(),
"tool": self.tool,
"args": self.args,
"elapsed_ms": round(elapsed_ms, 2),
}
if exc_type is not None:
record["error"] = f"{exc_type.__name__}: {exc_val}"
record.update(self.extra)
_maybe_rotate()
with _lock:
USAGE_LOG_DIR.mkdir(parents=True, exist_ok=True)
with open(USAGE_LOG_DIR / "usage.jsonl", "a") as fh:
fh.write(json.dumps(record, separators=(",", ":")) + "\n")
# Don't swallow the exception — the caller still needs to see it.
View File
+4
View File
@@ -0,0 +1,4 @@
{"query": "how to install <product> on Linux", "expected": [{"bundle_id": "Install.Linux.10.0", "page_id": "Installation.htm"}], "tags": ["install", "linux"]}
{"query": "configure database connection for high availability", "expected": [{"bundle_id": "Admin.10.0", "page_id": "HA_Setup.htm"}], "tags": ["ha", "config"]}
{"query": "API endpoint to list users", "expected": [{"bundle_id": "API.10.0", "page_id": "Users_API.htm"}], "tags": ["api"]}
{"query": "what changed between 10.0 and 10.1", "expected": [{"bundle_id": "Release_Notes.10.1", "page_id": "Whats_New.htm"}], "tags": ["release-notes"]}
+62
View File
@@ -0,0 +1,62 @@
"""Retriever protocol + concrete implementations.
A single matrix dimension per knob (dense / reranked / bm25 / hybrid)
so the eval harness can compare them apples-to-apples. Implement these
once at Phase 7 and reuse them across every retrieval change.
Each retriever returns a ranked list of (bundle_id, page_id) tuples
deduplicated to the page level (chunks within the same page collapse
to one entry; the highest-ranked chunk's position wins).
"""
from __future__ import annotations
from typing import Protocol, Iterable
class Retriever(Protocol):
name: str
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
"""Return up to k (bundle_id, page_id) tuples in rank order."""
...
def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> list[tuple[str, str]]:
"""Take a stream of (bundle_id, page_id, chunk_ordinal) and return
the first k unique pages in their first-seen order."""
seen: set[tuple[str, str]] = set()
out: list[tuple[str, str]] = []
for bid, pid, _ord in chunk_ids:
key = (bid, pid)
if key in seen:
continue
seen.add(key)
out.append(key)
if len(out) >= k:
break
return out
# TODO Phase 2/3 — implement these once Chroma + the bm25 module are
# in place. Each one is small (15-30 LOC). The eval harness imports
# from this module by class name.
#
# class DenseRetriever:
# name = "dense"
# def __init__(self, collection): self.col = collection
# def retrieve(self, query, k=10): ...
#
# class RerankedRetriever:
# name = "dense+rerank"
# def __init__(self, collection, rerank_url, pool=200): ...
# def retrieve(self, query, k=10): ...
#
# class BM25Retriever:
# name = "bm25"
# def __init__(self, bm25_index): ...
# def retrieve(self, query, k=10): ...
#
# class HybridRetriever:
# name = "bm25+dense+rrf"
# def __init__(self, dense, bm25, k_rrf=60): ...
# def retrieve(self, query, k=10): ...
+91
View File
@@ -0,0 +1,91 @@
"""Run all retrievers against eval/queries.jsonl, emit a markdown report.
Metrics computed per retriever:
MRR — mean reciprocal rank of the FIRST expected page in the
ranked result list (0 if not in top-k).
Recall@K — fraction of expected pages that appear in top-K.
nDCG@K — discounted gain weighted by rank position.
The "right" number depends on what you're measuring. MRR tracks "the
first-line answer is correct"; Recall@K tracks "everything relevant
is there to draw from"; nDCG@K is a smoother combination of both.
For docs-RAG, MRR is usually the headline metric.
Usage:
python -m eval.run_eval \\
--queries eval/queries.jsonl \\
--k 5 \\
--output eval/results/baseline.md
"""
from __future__ import annotations
import argparse
import json
import math
import time
from pathlib import Path
from typing import Iterable
def load_queries(path: Path) -> list[dict]:
with open(path) as fh:
return [json.loads(line) for line in fh if line.strip()]
def reciprocal_rank(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]]) -> float:
expected_set = set(expected)
for i, page in enumerate(retrieved, start=1):
if page in expected_set:
return 1.0 / i
return 0.0
def recall_at_k(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]], k: int) -> float:
if not expected:
return 0.0
retrieved_set = set(retrieved[:k])
hits = sum(1 for e in expected if e in retrieved_set)
return hits / len(expected)
def ndcg_at_k(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]], k: int) -> float:
expected_set = set(expected)
dcg = 0.0
for i, page in enumerate(retrieved[:k], start=1):
if page in expected_set:
dcg += 1.0 / math.log2(i + 1)
# Ideal DCG: every expected page in the top positions.
idcg = sum(1.0 / math.log2(i + 1) for i in range(1, min(len(expected), k) + 1))
return dcg / idcg if idcg else 0.0
def main() -> int:
p = argparse.ArgumentParser()
p.add_argument("--queries", type=Path, default=Path("eval/queries.jsonl"))
p.add_argument("--k", type=int, default=5)
p.add_argument("--output", type=Path, default=Path("eval/results/baseline.md"))
args = p.parse_args()
if not args.queries.exists():
print(f"queries file not found: {args.queries}")
print("hint: copy eval/queries.jsonl.example and edit")
return 1
queries = load_queries(args.queries)
print(f"loaded {len(queries)} queries")
# TODO Phase 7: instantiate the retrievers you implemented in
# eval/retrievers.py and run each one against each query.
# Aggregate MRR / Recall@K / nDCG@K per retriever. Emit a
# markdown table to args.output. Commit the file alongside the
# PR that changes retrieval.
raise NotImplementedError(
"Wire up the retrievers in eval/retrievers.py first, then "
"fill in this evaluation loop. See PLAN.md Phase 7."
)
if __name__ == "__main__":
raise SystemExit(main())
View File
+277
View File
@@ -0,0 +1,277 @@
"""SQLite FTS5-backed BM25 retrieval over the same chunks Chroma indexes.
Hybrid retrieval (BM25 + dense + Reciprocal Rank Fusion) addresses a
limit of single-tower dense embeddings: when a query has specific
technical terms (filenames, language names, error codes, API paths),
the dense embedding doesn't bridge from the query into a short
code-focused chunk. The chunk loses to the much larger crowd of
prose chunks that semantically match the query topic.
BM25 handles this directly. Lexical overlap on rare terms ("python",
"create_vpg.py", "PROTECTED_SITE_ID", "applyUpgrade") scores those
chunks high. Fused with the dense ranking via RRF, the hybrid result
is strictly better than either alone for the queries we've seen
fail.
Why SQLite FTS5:
- In the stdlib. Zero new deps.
- On-disk. Same persistence model as Chroma — Docker COPY the dir,
`rag.index --rebuild` regenerates from corpus.
- Built-in `bm25()` ranking function. No knobs to tune that matter
for our use case (k1=1.2, b=0.75 defaults are fine).
- Builds 70k+ chunks in seconds. Faster than the Chroma rebuild's
embedding step by 100×, so it adds basically nothing to the
full-rebuild cycle.
Schema is two tables to keep filtering clean. FTS5 doesn't filter
nicely on its own columns; the content_rowid pattern keeps an
external metadata table joinable by rowid:
CREATE TABLE chunks_meta (
rowid INTEGER PRIMARY KEY AUTOINCREMENT,
id TEXT UNIQUE,
bundle_id TEXT, page_id TEXT, version TEXT,
platform TEXT, product TEXT, ordinal INTEGER
);
CREATE VIRTUAL TABLE chunks_fts USING fts5(
text,
tokenize = 'porter unicode61 remove_diacritics 2',
content = 'chunks_meta',
content_rowid = 'rowid'
);
Queries:
SELECT m.id, bm25(chunks_fts) AS score
FROM chunks_meta m
JOIN chunks_fts f ON m.rowid = f.rowid
WHERE f MATCH ?
AND m.version = ? -- optional metadata filter
ORDER BY bm25(chunks_fts) -- lower = better in FTS5
LIMIT ?;
"""
from __future__ import annotations
import logging
import re
import sqlite3
from pathlib import Path
from typing import Any
log = logging.getLogger(__name__)
# Default location: bm25/<product>_docs.db at the repo root, next to chroma/.
ROOT = Path(__file__).resolve().parent.parent
DEFAULT_DB_DIR = ROOT / "bm25"
DEFAULT_DB_NAME = "<product>_docs.db"
# Columns we expose as filterable metadata. Mirrors what _build_where in
# docs_mcp/server.py accepts so the same filter dicts work for both
# Chroma and BM25 without per-retriever translation in the caller.
FILTER_COLUMNS = ("bundle_id", "page_id", "version", "platform", "product", "ordinal")
# Allowlist tokenizer for free-text queries. FTS5's parser chokes on lots
# of punctuation we routinely see in user queries (".10.9", "?", "VPG's",
# em-dash, etc.). Rather than blocklist every operator, just keep
# alphanumerics + a few separators and replace everything else with a
# space. This loses the ability to phrase-search ("exact match") but we
# don't expose that to users anyway — they ask natural-language questions
# and want the answer, not a Boolean DSL.
_KEEP_RE = re.compile(r"[^A-Za-z0-9_\s]")
# FTS5 reserves these Boolean operator KEYWORDS at the token level —
# stripping them avoids accidental phrase-query behavior when a user
# query happens to contain bare "AND", "OR", "NOT", "NEAR".
_BOOLEAN_KW_RE = re.compile(r"(?<!\w)(AND|OR|NOT|NEAR)(?!\w)")
def _sanitize_query(text: str) -> str:
"""Reduce a natural-language query to an FTS5 OR-of-tokens query.
Two transformations:
1. Non-alphanumeric → space (drops punctuation; "10.9?" becomes
"10 9"). Lets us handle versions, parens, question marks, etc.
without inviting FTS5 parse errors.
2. Boolean keywords stripped (FTS5 reserves AND/OR/NOT/NEAR).
3. Tokens explicitly OR'd. FTS5's default is AND-of-tokens — for
any non-trivial natural-language query that means zero hits
(no chunk contains every word). OR semantics is what we want:
BM25 already weights documents containing more query terms
higher, so we don't lose precision, but we DO gain recall.
"""
cleaned = _KEEP_RE.sub(" ", text)
cleaned = _BOOLEAN_KW_RE.sub(" ", cleaned)
tokens = cleaned.split()
if not tokens:
return ""
return " OR ".join(tokens)
def _where_to_sql(where: dict | None) -> tuple[str, list[Any]]:
"""Translate a Chroma-shaped filter dict into a SQL fragment + params.
Accepts the same shapes ``docs_mcp.server._build_where`` produces:
None → ("", [])
{"version": "10.9"} → ("AND m.version = ?", ["10.9"])
{"$and": [{...}, {...}]} → ("AND m.X = ? AND m.Y = ?", [...])
Unknown keys are silently dropped (defensive — better to over-match
than to crash on a filter we don't know).
"""
if not where:
return "", []
parts: list[str] = []
params: list[Any] = []
def _emit_eq(cond: dict[str, Any]) -> None:
for k, v in cond.items():
if k in FILTER_COLUMNS:
parts.append(f"m.{k} = ?")
params.append(v)
if "$and" in where:
for sub in where["$and"]:
_emit_eq(sub)
else:
_emit_eq(where)
if not parts:
return "", []
return "AND " + " AND ".join(parts), params
class BM25Index:
"""Thin wrapper around an FTS5-backed sqlite db.
Single-writer model. Reads are connection-per-call (sqlite handles
concurrency through file locks; for our read-heavy workload that's
fine and avoids cross-thread connection sharing issues with the MCP
server's request handlers).
"""
def __init__(self, db_path: Path | None = None):
self.db_path = Path(db_path) if db_path else (DEFAULT_DB_DIR / DEFAULT_DB_NAME)
# -- build ----------------------------------------------------------
def build(self, records: list[dict]) -> int:
"""Rebuild the index from scratch from `records`.
`records` is the same list ``rag.index.page_records`` produces:
``[{"id": ..., "text": ..., "metadata": {...}}, ...]``. Bulk
insert wrapped in a transaction — single-digit seconds for the
full 73k-chunk corpus.
"""
self.db_path.parent.mkdir(parents=True, exist_ok=True)
# Drop and recreate. Idempotent rebuild.
if self.db_path.exists():
self.db_path.unlink()
with sqlite3.connect(self.db_path) as con:
con.executescript(self._schema_sql())
con.executemany(
"INSERT INTO chunks_meta (id, bundle_id, page_id, version, "
"platform, product, ordinal) VALUES (?, ?, ?, ?, ?, ?, ?)",
[
(
r["id"],
r["metadata"].get("bundle_id") or "",
r["metadata"].get("page_id") or "",
r["metadata"].get("version") or "",
r["metadata"].get("platform") or "",
r["metadata"].get("product") or "",
int(r["metadata"].get("ordinal") or 0),
)
for r in records
],
)
# Populate the FTS5 contentless-ish table by rowid. We populated
# chunks_meta first; rowids align with insertion order.
con.executemany(
"INSERT INTO chunks_fts (rowid, text) VALUES (?, ?)",
[
(i + 1, r["text"])
for i, r in enumerate(records)
],
)
con.commit()
log.info("bm25: indexed %d chunks → %s", len(records), self.db_path)
return len(records)
# -- query ----------------------------------------------------------
def query(
self,
text: str,
n: int = 200,
where: dict | None = None,
) -> list[tuple[str, float]]:
"""Return up to `n` (chunk_id, bm25_score) pairs, lowest score first.
FTS5's bm25() returns NEGATIVE numbers — more relevant docs have
smaller (more negative) scores. We order ASC so the first row is
the most relevant. Callers that need a "rank" should enumerate
the returned list.
"""
sanitized = _sanitize_query(text)
if not sanitized:
return []
where_sql, params = _where_to_sql(where)
# FTS5 MATCH wants the unaliased table name on its left, so we use
# chunks_fts (no alias) and JOIN by rowid against chunks_meta.
sql = (
"SELECT m.id, bm25(chunks_fts) AS score "
"FROM chunks_fts "
"JOIN chunks_meta m ON m.rowid = chunks_fts.rowid "
f"WHERE chunks_fts MATCH ? {where_sql} "
"ORDER BY bm25(chunks_fts) "
"LIMIT ?"
)
try:
with sqlite3.connect(self.db_path) as con:
cur = con.execute(sql, [sanitized, *params, n])
return [(row[0], float(row[1])) for row in cur.fetchall()]
except sqlite3.OperationalError as e:
# FTS5 syntax error (rare after sanitization) or db missing.
# Caller decides whether to fall back to dense-only.
log.warning("bm25 query failed (%s); query=%r", e, sanitized[:80])
return []
def exists(self) -> bool:
"""Cheap probe — does the index file exist on disk?"""
return self.db_path.exists()
def count(self) -> int:
"""Number of chunks indexed. 0 if the db is missing or empty."""
if not self.exists():
return 0
try:
with sqlite3.connect(self.db_path) as con:
return con.execute("SELECT COUNT(*) FROM chunks_meta").fetchone()[0]
except sqlite3.OperationalError:
return 0
# -- schema ---------------------------------------------------------
@staticmethod
def _schema_sql() -> str:
return """
CREATE TABLE chunks_meta (
rowid INTEGER PRIMARY KEY AUTOINCREMENT,
id TEXT UNIQUE NOT NULL,
bundle_id TEXT,
page_id TEXT,
version TEXT,
platform TEXT,
product TEXT,
ordinal INTEGER
);
CREATE INDEX idx_meta_version ON chunks_meta(version);
CREATE INDEX idx_meta_platform ON chunks_meta(platform);
CREATE INDEX idx_meta_bundle ON chunks_meta(bundle_id);
CREATE VIRTUAL TABLE chunks_fts USING fts5(
text,
tokenize = 'porter unicode61 remove_diacritics 2'
);
"""
+126
View File
@@ -0,0 +1,126 @@
"""Markdown chunker — paragraph-aware, ~400-600 token target.
Adjust the chunking strategy per product if your page format differs
significantly from prose. The output shape (id, text, metadata) is
fixed by the downstream Chroma + BM25 indexing in rag/index.py — don't
change that.
The key knob you'll tune per product is chunk-0. Dense retrieval lands
on chunk 0 first for most queries. Make it a synthetic chunk built
from:
- the page title (as natural-language H1)
- a 1-sentence task description (you'll have to generate this — for
pages that already have a "## Overview" or "## Introduction" the
first sentence usually works)
- a keyword bag of important terms (filenames, API names, error
codes — the rare technical tokens that BM25 lights up on)
Without a rich chunk 0, dense retrieval gets dominated by the much
larger prose body, and short pages (script examples, reference cards)
get buried.
"""
from __future__ import annotations
import re
from typing import Iterator
# Approximate token estimate from char count. Tunable — set per
# embedder if the default 4 chars/token is wrong.
CHARS_PER_TOKEN = 4
TARGET_TOKENS = 500
TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
def estimate_tokens(text: str) -> int:
return max(1, len(text) // CHARS_PER_TOKEN)
def split_paragraphs(md: str) -> list[str]:
"""Split markdown into paragraph-ish blocks.
Keeps fenced code blocks together (don't slice through ```).
Headings start new paragraphs.
"""
blocks: list[str] = []
current: list[str] = []
in_fence = False
for line in md.splitlines(keepends=True):
stripped = line.strip()
if stripped.startswith("```"):
in_fence = not in_fence
current.append(line)
continue
if in_fence:
current.append(line)
continue
if stripped.startswith("#"):
if current:
blocks.append("".join(current).strip())
current = []
current.append(line)
continue
if not stripped and current and not "".join(current).strip().endswith("\n\n"):
current.append(line)
blocks.append("".join(current).strip())
current = []
continue
current.append(line)
if current:
blocks.append("".join(current).strip())
return [b for b in blocks if b]
def chunks_from_page(
text: str,
page_id: str,
metadata: dict,
) -> Iterator[dict]:
"""Yield chunk dicts ready for index.py to upsert.
The synthetic chunk 0 is the per-product customization point. The
default below is a simple title + body-first-paragraph; rewrite
for richer retrieval signal (see module docstring).
"""
paragraphs = split_paragraphs(text)
if not paragraphs:
return
# ----- Chunk 0: synthetic anchor for dense retrieval ---------
title = metadata.get("title") or page_id
first_para = next((p for p in paragraphs if not p.startswith("#")), "")
chunk0_body = (
f"# {title}\n\n"
f"{first_para[:300]}"
# TODO per product: append a keyword bag here (filenames,
# API names, error codes) for BM25 + dense joint coverage.
)
yield {
"id": f"{metadata['bundle_id']}::{page_id}::0",
"text": chunk0_body,
"metadata": {**metadata, "ordinal": 0},
}
# ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
ordinal = 1
buf: list[str] = []
buf_chars = 0
for p in paragraphs:
if buf_chars + len(p) > TARGET_CHARS and buf:
yield {
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
"text": "\n\n".join(buf),
"metadata": {**metadata, "ordinal": ordinal},
}
ordinal += 1
buf = []
buf_chars = 0
buf.append(p)
buf_chars += len(p)
if buf:
yield {
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
"text": "\n\n".join(buf),
"metadata": {**metadata, "ordinal": ordinal},
}
+72
View File
@@ -0,0 +1,72 @@
"""Embedding function for Chroma — Ollama-hosted nomic-embed-text by default.
Swappable: implement the same `embedding_function()` interface returning
a Chroma `EmbeddingFunction` and the rest of the pipeline doesn't care.
Defaults (override via env):
OLLAMA_URL one or more comma-separated URLs (load-balanced)
EMBED_MODEL model name; default 'nomic-embed-text'
EMBED_DIM expected embedding dim; default 768 (nomic-embed-text)
"""
from __future__ import annotations
import os
import logging
from typing import Any
import httpx
from chromadb import EmbeddingFunction, Documents, Embeddings
log = logging.getLogger(__name__)
OLLAMA_URLS = [u.strip() for u in os.environ.get("OLLAMA_URL",
"http://localhost:11434").split(",") if u.strip()]
EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
EMBED_DIM = int(os.environ.get("EMBED_DIM", "768"))
class OllamaEmbeddings(EmbeddingFunction):
"""Calls /api/embed across N Ollama endpoints, naive round-robin.
For indexing throughput on multiple GPUs, run one Ollama container
per GPU (pinned via NVIDIA_VISIBLE_DEVICES) and pass all their URLs
in OLLAMA_URL — the embedder picks the next endpoint per batch.
"""
def __init__(self, urls: list[str] = OLLAMA_URLS, model: str = EMBED_MODEL):
self.urls = urls
self.model = model
self._next = 0
def __call__(self, input: Documents) -> Embeddings:
url = self.urls[self._next % len(self.urls)]
self._next += 1
with httpx.Client(timeout=300) as c:
r = c.post(f"{url}/api/embed",
json={"model": self.model, "input": list(input)})
r.raise_for_status()
data = r.json()
return data.get("embeddings") or []
def name(self) -> str: # newer chromadb requires this
return f"ollama:{self.model}"
@staticmethod
def build_from_config(config: dict) -> "OllamaEmbeddings": # newer chromadb
return OllamaEmbeddings(
urls=config.get("urls", OLLAMA_URLS),
model=config.get("model", EMBED_MODEL),
)
def get_config(self) -> dict: # newer chromadb
return {"urls": self.urls, "model": self.model}
def default_space(self) -> str:
return "cosine"
def supported_spaces(self) -> list[str]:
return ["cosine", "l2", "ip"]
def embedding_function() -> EmbeddingFunction:
return OllamaEmbeddings()
+134
View File
@@ -0,0 +1,134 @@
"""Build Chroma (and optionally BM25) indexes from corpus on disk.
Reads `corpus/<bundle>/<page>.{md,json}`, chunks each page, upserts
into Chroma. With --rebuild, drops + recreates the collection (clean
state). With --bm25-only, skips Chroma and rebuilds only the FTS5
index — useful for fast iteration when chunking didn't change.
"""
from __future__ import annotations
import argparse
import json
import logging
import time
from pathlib import Path
from typing import Iterator
import chromadb
from chromadb.config import Settings
from .chunk import chunks_from_page
from .embeddings import embedding_function
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
ROOT = Path(__file__).resolve().parent.parent
CORPUS = ROOT / "corpus"
CHROMA_DIR = ROOT / "chroma"
# Collection name — convention: <product>_docs. Override via env if needed.
import os
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "myproduct")
COLLECTION = f"{PRODUCT_NAME}_docs"
def page_records() -> Iterator[dict]:
"""Walk corpus/, yield chunks for every page."""
if not CORPUS.exists():
log.error("corpus/ doesn't exist; run the scraper first")
return
for bundle_dir in sorted(CORPUS.iterdir()):
if not bundle_dir.is_dir() or bundle_dir.name.startswith("."):
continue
for md_path in sorted(bundle_dir.glob("*.md")):
page_id = md_path.stem
sidecar = md_path.with_suffix(".json")
if not sidecar.exists():
log.warning("skipping %s — no JSON sidecar", md_path)
continue
md = md_path.read_text()
meta = json.loads(sidecar.read_text())
# Surface common filter fields at the chunk-metadata level
# so Chroma's `where` filter can use them.
base_meta = {
"bundle_id": bundle_dir.name,
"page_id": page_id,
"title": meta.get("title") or "",
"version": meta.get("version") or "",
"platform": meta.get("platform") or "",
"product": meta.get("product") or "",
}
yield from chunks_from_page(md, page_id, base_meta)
def upsert_to_chroma(records: list[dict]) -> int:
client = chromadb.PersistentClient(
path=str(CHROMA_DIR),
settings=Settings(anonymized_telemetry=False),
)
# Drop + recreate for --rebuild semantics
try:
client.delete_collection(COLLECTION)
except Exception:
pass
col = client.create_collection(COLLECTION, embedding_function=embedding_function())
BATCH = 64
total = 0
for i in range(0, len(records), BATCH):
chunk = records[i:i + BATCH]
col.upsert(
ids=[r["id"] for r in chunk],
documents=[r["text"] for r in chunk],
metadatas=[r["metadata"] for r in chunk],
)
total += len(chunk)
log.info("upserted %d / %d chunks", total, len(records))
return total
def main() -> int:
p = argparse.ArgumentParser()
p.add_argument("--rebuild", action="store_true",
help="Drop and recreate the Chroma collection.")
p.add_argument("--bm25-only", action="store_true",
help="Rebuild only the BM25 index, skip Chroma.")
p.add_argument("--bm25-db", type=Path,
default=ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db",
help="Path to the BM25 sqlite db.")
args = p.parse_args()
log.info("reading corpus from %s", CORPUS)
t0 = time.time()
records = list(page_records())
log.info("loaded %d chunks in %.1fs", len(records), time.time() - t0)
if args.bm25_only:
from .bm25 import BM25Index
log.info("--bm25-only: building FTS5 only")
BM25Index(args.bm25_db).build(records)
return 0
if not args.rebuild:
log.info("no --rebuild; nothing to do. (Use --rebuild to upsert.)")
return 0
t_c = time.time()
n = upsert_to_chroma(records)
log.info("chroma: %d chunks in %.1fs", n, time.time() - t_c)
# Build BM25 too — see PLAN.md Phase 8. Safe to remove this block
# for products that don't need hybrid retrieval.
try:
from .bm25 import BM25Index
t_b = time.time()
BM25Index(args.bm25_db).build(records)
log.info("bm25 done in %.1fs", time.time() - t_b)
except ImportError:
log.info("rag.bm25 not available — skipping BM25 build")
return 0
if __name__ == "__main__":
raise SystemExit(main())
+19
View File
@@ -0,0 +1,19 @@
# MCP server
mcp[fastmcp]>=1.0.0
pydantic>=2.0
httpx>=0.27
# Vector store + embeddings
chromadb>=0.5.0
ollama>=0.4.0 # if using Ollama-hosted embedder; swap if not
# Scraping (Phase 1; adjust per product)
beautifulsoup4>=4.12
requests>=2.31
# playwright>=1.40 # uncomment if you need headless browser fallback
# Evaluation
numpy>=1.26
# Dev / utility
python-dateutil>=2.8
+59
View File
@@ -0,0 +1,59 @@
# scrape/
Product-specific. **You implement this for each product.** The
template gives you the contract; the extraction logic depends on
the upstream doc portal.
See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline
expects.
## What you write
At minimum, two scripts:
### `scrape/bundles.py`
Discovers the upstream portal's bundle catalog and writes
`bundles.json` at the repo root. One entry per bundle (versioned doc
set) with the schema in PLAN.md.
```bash
python -m scrape.bundles
```
### `scrape/runner.py`
Scrapes the pages of each bundle (or a single bundle with `--bundle
<slug>`). Writes:
- `corpus/<bundle_id>/<page_id>.md` — extracted markdown body
- `corpus/<bundle_id>/<page_id>.json` — per-page metadata sidecar
```bash
python -m scrape.runner --all --force --concurrency 6
python -m scrape.runner --bundle Admin.VC.HTML.10.9
```
## Tips
- **Sniff before you scrape.** Almost every modern doc portal is an
SPA that calls a backend API. Open the browser's Network tab,
click around, find the underlying JSON. Scraping the API is 10×
cheaper and 100× more reliable than scraping the rendered HTML.
- **Idempotent re-scrapes.** Without `--force`, the runner should
skip pages already on disk so a resume doesn't have to re-fetch
everything. With `--force`, re-fetch every page — that's the
weekly cron mode that catches edits.
- **Respect the portal.** Backoff on 429s. Set a recognizable
user-agent so the portal owner can identify you if they want to.
- **Whitespace normalize.** Markdown that round-trips through HTML
often has extra blank lines. Normalize to a single blank between
paragraphs so diffs are clean (the changelog summary and digest
tools care about line counts).
## What's already reusable
`scrape/changelog.py` is fully product-agnostic and ready to use
as-is. It walks `git diff --name-status` output to produce a
structured summary, and walks `git log` for the digest history
(Phase 13).
View File
+272
View File
@@ -0,0 +1,272 @@
"""Generate a summary of corpus changes.
Two output shapes for two consumers:
1. Human-readable text (default) — written into the weekly-refresh
commit message so the commit log is greppable for *"what changed
this week"* instead of *"806 files changed"*.
2. Structured JSON (``--json``) and rolling JSONL history
(``--history-out``) — consumed by the ``weekly_digest`` MCP tool.
Computed in CI and committed at ``corpus/.digest/history.jsonl``;
the tool reads it at runtime because the prod container is a
static filesystem COPY with no git available.
Usage:
# Commit-message helper (existing behavior — unchanged)
python -m scrape.changelog [--cached] [--ref REF]
# One-shot JSON for the current diff range
python -m scrape.changelog --cached --json
# Build / refresh the digest history file (CI use)
python -m scrape.changelog --history-out corpus/.digest/history.jsonl \\
--history-days 120
The history walker only includes commits that touch ``corpus/`` (or
``bundles.json``); it skips pure code/CI commits. Each emitted record
carries the commit's short sha, ISO timestamp, subject, and the same
structured summary the ``--json`` path produces, so the consumer can
treat history records and one-shot summaries interchangeably.
"""
from __future__ import annotations
import argparse
import json
import subprocess
import sys
from collections import defaultdict
from typing import Any
def git(*args: str) -> str:
return subprocess.check_output(["git", *args], text=True)
def summarize_diff(diff_output: str) -> dict[str, Any]:
"""Parse ``git diff --name-status`` output into a structured summary.
Pure function (no IO, no git calls) so the same logic is exercised
by the human-readable, JSON-one-shot, and history-walking paths.
Returns a dict with:
md_count int — total .md files changed
json_count int — total .json sidecars changed
content_bundles dict — {bundle_id: [page_id_without_.md, ...]}
Only bundles where at least one .md
file moved. Lists are in the order
git emitted them.
json_only_bundles list[str] — bundles whose ONLY change was sidecar
drift (no .md changes). Sorted.
new_bundles list[str] — bundles whose first .md was Added
in this diff. Sorted.
other_files list[str] — any non-corpus path mentioned in the
diff, as ``"STATUS path"`` strings.
"""
md_changes: dict[str, list[str]] = defaultdict(list)
json_only_bundles: set[str] = set()
new_bundles: set[str] = set()
md_count = json_count = 0
other_files: list[str] = []
for line in diff_output.splitlines():
if not line.strip():
continue
# status<TAB>path (or status<TAB>old<TAB>new for renames; we take
# the post-rename path as the canonical location).
parts = line.split("\t")
status, path = parts[0], parts[-1]
if not path.startswith("corpus/"):
other_files.append(f"{status} {path}")
continue
segs = path.split("/", 2)
if len(segs) < 3:
# corpus/<filename> with no bundle dir — skip.
continue
_, bundle, page = segs
if page.endswith(".md"):
md_changes[bundle].append(page[:-3])
md_count += 1
if status == "A":
new_bundles.add(bundle)
elif page.endswith(".json"):
json_count += 1
json_only_bundles.add(bundle)
# A bundle counts as "content-changing" if it had any .md edit. Sidecar-
# only drift goes in the separate bucket so the commit message doesn't
# report timestamp churn as if it were real edits.
content_bundles_set = set(md_changes)
drift_only = sorted(json_only_bundles - content_bundles_set)
return {
"md_count": md_count,
"json_count": json_count,
"content_bundles": dict(md_changes), # cast back to plain dict for JSON
"json_only_bundles": drift_only,
"new_bundles": sorted(new_bundles),
"other_files": other_files,
}
def render_human(summary: dict[str, Any]) -> str:
"""Format a summary dict as the multi-line commit-message text.
Matches the historical output exactly so existing commit-message
tooling and downstream readers don't have to change.
"""
lines: list[str] = []
content_bundles = sorted(summary["content_bundles"])
md_count = summary["md_count"]
json_count = summary["json_count"]
new_bundles = set(summary["new_bundles"])
drift_only = summary["json_only_bundles"]
other_files = summary["other_files"]
lines.append(f"{md_count} content change(s) across {len(content_bundles)} bundle(s)")
lines.append(f"{json_count} sidecar metadata update(s)")
if new_bundles:
lines.append(f"{len(new_bundles)} new bundle(s) added")
if other_files:
lines.append(f"{len(other_files)} other file change(s)")
if content_bundles:
lines.append("")
lines.append("Bundles with content changes:")
for b in content_bundles:
pages = summary["content_bundles"][b]
tag = " (NEW)" if b in new_bundles else ""
lines.append(f" {b}{tag}: {len(pages)} page(s)")
for p in pages[:5]:
lines.append(f" - {p}")
if len(pages) > 5:
lines.append(f" ... and {len(pages) - 5} more")
if drift_only:
lines.append("")
head = ", ".join(drift_only[:10])
suffix = "" if len(drift_only) > 10 else ""
lines.append(f"Bundles with sidecar-only drift ({len(drift_only)}): {head}{suffix}")
return "\n".join(lines)
def walk_history(history_days: int) -> list[dict[str, Any]]:
"""Walk recent corpus-touching commits, emit one summary per commit.
Uses ``git log --first-parent main`` to keep the rolling weekly-
refresh line clean of branch-merge noise. Only commits whose diff
touches ``corpus/`` or ``bundles.json`` are emitted; pure code
commits are skipped (they have nothing to digest).
Each record:
{
"sha": "<short sha>",
"timestamp": "<ISO 8601, UTC>",
"subject": "<commit subject line>",
... + every field from summarize_diff()
}
"""
# Find candidate commits. --first-parent keeps the linear refresh history
# on main and ignores branch-side merges. We still need to filter by what
# the commit actually touched, because non-corpus commits can land on
# main (PR merges for code, CI tweaks, etc.).
raw = git(
"log",
f"--since={history_days} days ago",
"--first-parent",
"main",
"--pretty=format:%H%x09%cI%x09%s",
)
records: list[dict[str, Any]] = []
for line in raw.splitlines():
if not line.strip():
continue
parts = line.split("\t", 2)
if len(parts) < 3:
continue
sha, ts, subject = parts
# What did this commit actually touch? Cheap: just the name-status diff
# against its first parent. Empty stdout = commit didn't change any
# files we care about. Root commits (no parent) error out — suppress
# the stderr noise and skip them.
try:
diff = subprocess.check_output(
["git", "diff", "--name-status", f"{sha}^..{sha}"],
text=True,
stderr=subprocess.DEVNULL,
)
except subprocess.CalledProcessError:
continue
if not diff.strip():
continue
summary = summarize_diff(diff)
# Skip pure code commits — only emit records that have actual corpus
# content motion. This is what makes the history "interesting" for
# the weekly digest.
if summary["md_count"] == 0 and summary["json_count"] == 0 and not summary["new_bundles"]:
continue
records.append({
"sha": sha[:12],
"timestamp": ts,
"subject": subject,
**summary,
})
return records
def main() -> int:
p = argparse.ArgumentParser(description=__doc__)
p.add_argument("--cached", action="store_true",
help="Summarize staged changes instead of a ref range.")
p.add_argument("--ref", default="HEAD^..HEAD",
help="Diff range to summarize (default: HEAD^..HEAD).")
p.add_argument("--json", dest="as_json", action="store_true",
help="Emit one JSON object instead of the human-readable form.")
p.add_argument("--history-out", metavar="PATH",
help="Walk recent corpus-touching commits and write a "
"JSONL history file at PATH. Overwrites if it exists. "
"Implies the history walker; --cached/--ref are ignored.")
p.add_argument("--history-days", type=int, default=120,
help="How far back the history walker looks (default 120).")
args = p.parse_args()
# History-walker path: build the JSONL file consumed by the
# weekly_digest MCP tool, then exit. CI uses this.
if args.history_out:
records = walk_history(args.history_days)
# Sort by timestamp ascending so the file is roughly stable
# across rebuilds (commits within a single run could otherwise
# depend on git log default ordering).
records.sort(key=lambda r: r["timestamp"])
with open(args.history_out, "w") as fh:
for rec in records:
fh.write(json.dumps(rec, separators=(",", ":")) + "\n")
# Brief stdout signal for CI logs — easy to spot in the workflow run.
print(f"wrote {len(records)} commit record(s) to {args.history_out} "
f"covering up to {args.history_days} days")
return 0
# One-shot summary path. Unchanged behavior for --cached / --ref.
if args.cached:
diff_args = ["diff", "--name-status", "--cached"]
else:
diff_args = ["diff", "--name-status", args.ref]
diff = git(*diff_args)
summary = summarize_diff(diff)
if args.as_json:
print(json.dumps(summary, separators=(",", ":")))
else:
print(render_human(summary))
return 0
if __name__ == "__main__":
sys.exit(main())
+108
View File
@@ -0,0 +1,108 @@
"""Gitea container-registry garbage collection.
Lists package versions for one container package and deletes versions
older than --keep-days. Always preserves:
- the :latest tag
- the --keep-latest most-recent date-tagged versions
- anything pushed in the last --keep-days days
The actual disk reclaim happens on Gitea's next package GC cron (admin
site settings). This script just marks the versions for deletion.
Usage:
python scripts/registry_gc.py \\
--owner <user> \\
--package <product>-docs-mcp \\
--keep-days 90 \\
--keep-latest 5
Auth: reads GITEA_TOKEN from env (set in the workflow as a secret).
"""
from __future__ import annotations
import argparse
import os
import sys
from datetime import datetime, timedelta, timezone
from urllib.request import Request, urlopen
from urllib.error import HTTPError
import json
GITEA_HOST = os.environ.get("GITEA_HOST", "https://git.jpaul.io")
def api(token: str, method: str, path: str) -> object:
req = Request(f"{GITEA_HOST}{path}",
headers={"Authorization": f"token {token}"},
method=method)
try:
with urlopen(req, timeout=30) as r:
body = r.read()
return json.loads(body) if body else None
except HTTPError as e:
if e.code == 404:
return None
raise
def main() -> int:
p = argparse.ArgumentParser()
p.add_argument("--owner", required=True)
p.add_argument("--package", required=True)
p.add_argument("--keep-days", type=int, default=90)
p.add_argument("--keep-latest", type=int, default=5)
p.add_argument("--dry-run", action="store_true")
args = p.parse_args()
token = os.environ.get("GITEA_TOKEN")
if not token:
print("GITEA_TOKEN not set", file=sys.stderr)
return 1
versions = api(token, "GET",
f"/api/v1/packages/{args.owner}/container/{args.package}/versions") or []
if not versions:
print(f"no versions found for {args.owner}/{args.package}")
return 0
cutoff = datetime.now(timezone.utc) - timedelta(days=args.keep_days)
# Date-tagged versions (YYYY.MM.DD), newest first
date_tagged = []
for v in versions:
tags = v.get("tags") or []
for t in tags:
if len(t) == 10 and t[4] == "." and t[7] == ".":
date_tagged.append((t, v))
break
date_tagged.sort(key=lambda kv: kv[0], reverse=True)
keep_date_tags = {t for t, _ in date_tagged[:args.keep_latest]}
deleted = 0
for v in versions:
tags = v.get("tags") or []
if "latest" in tags:
continue
if any(t in keep_date_tags for t in tags):
continue
try:
created = datetime.fromisoformat(v["created_at"].replace("Z", "+00:00"))
except (KeyError, ValueError):
continue
if created >= cutoff:
continue
version_id = v.get("id")
print(f" deleting v{version_id} tags={tags} created={v['created_at']}")
if not args.dry_run:
api(token, "DELETE",
f"/api/v1/packages/{args.owner}/container/{args.package}/versions/{version_id}")
deleted += 1
print(f"done: {deleted} version(s) deleted")
return 0
if __name__ == "__main__":
sys.exit(main())
+251
View File
@@ -0,0 +1,251 @@
"""Summarize usage logs from docs_mcp.usage into a quick scan.
Reads one or more usage.jsonl* files and prints sections for:
- per-tool call counts
- top search_docs queries by frequency
- 0-hit queries (where we returned nothing — high-signal for tuning)
- filter usage histogram (which version / platform / bundle filters get hit)
- reranker effectiveness (calls where the reranker fired vs not)
- hybrid retrieval top-1 attribution (dense vs bm25 vs both)
Usage:
# Default: read /app/var/logs in the production container
python scripts/usage_report.py --logs-dir /path/to/usage/logs
# Last N days only:
python scripts/usage_report.py --logs-dir <dir> --since 7d
# Markdown output (for piping into a weekly digest email, etc):
python scripts/usage_report.py --logs-dir <dir> --format markdown
The script doesn't depend on anything in the docs_mcp package — it's a
standalone tool that can run anywhere with the log files available
(scp them off the host, point it at the directory).
----------------------------------------------------------------------
FOLLOW-UP CHECKS
----------------------------------------------------------------------
Pattern: when you ship a retrieval change with a hypothesis attached
(e.g. "hybrid will rescue queries dense misses"), add a note HERE
describing what the usage report should show and at what threshold
the change earns its keep. Future-you running the report a month
later will be glad. Example:
Q: Does the dense leg of hybrid retrieval earn its keep on
real traffic, or could we simplify to BM25-only?
- bm25_only >= 80%% --> dense not doing much; consider
simplifying to BM25 mode
- both >= 50%% --> hybrid is tie-breaking; keep it
- dense_only > bm25_only --> dense is the workhorse; keep
Also worth a glance every month:
- 0-hit queries list (tuning candidates)
- reranker p95 latency drift (slow reranker = bad UX)
- filter usage (does anyone actually use version/platform
filters? if not, simplify the tool surface)
"""
from __future__ import annotations
import argparse
import json
import re
import sys
from collections import Counter, defaultdict
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any, Iterable
def parse_since(s: str | None) -> datetime | None:
"""Accept '7d', '24h', '30m', or an ISO timestamp. None → no cutoff."""
if not s:
return None
m = re.fullmatch(r"(\d+)([dhm])", s)
if m:
n, unit = int(m.group(1)), m.group(2)
delta = {"d": timedelta(days=n), "h": timedelta(hours=n), "m": timedelta(minutes=n)}[unit]
return datetime.now(timezone.utc) - delta
return datetime.fromisoformat(s.replace("Z", "+00:00"))
def load_events(logs_dir: Path, since: datetime | None) -> Iterable[dict[str, Any]]:
"""Yield every JSONL record across all files in logs_dir."""
if not logs_dir.exists():
print(f"warning: logs dir {logs_dir} does not exist", file=sys.stderr)
return
# usage.jsonl is the active file; usage.jsonl.YYYY-MM-DD are rotated.
files = sorted(logs_dir.glob("usage.jsonl*"))
for f in files:
with open(f) as fh:
for ln, line in enumerate(fh, start=1):
line = line.strip()
if not line:
continue
try:
rec = json.loads(line)
except json.JSONDecodeError as e:
print(f" ! skipping {f}:{ln}: {e}", file=sys.stderr)
continue
if since:
ts = rec.get("ts", "")
try:
rec_ts = datetime.fromisoformat(ts.replace("Z", "+00:00"))
except ValueError:
continue
if rec_ts < since:
continue
yield rec
def main() -> int:
p = argparse.ArgumentParser(description=__doc__)
p.add_argument("--logs-dir", type=Path, default=Path("/app/var/logs"),
help="directory with usage.jsonl* files")
p.add_argument("--since", default=None,
help="time window: '7d', '24h', '30m', or ISO timestamp")
p.add_argument("--top", type=int, default=25,
help="how many top queries / filters to show")
p.add_argument("--format", choices=("text", "markdown"), default="text")
args = p.parse_args()
since = parse_since(args.since)
events = list(load_events(args.logs_dir, since))
if not events:
print("(no events in window)")
return 0
print(f"# Usage report — {len(events)} events"
+ (f" since {since.isoformat()}" if since else "")
+ f" from {args.logs_dir}")
print()
# 1. Per-tool counts
by_tool = Counter(e["tool"] for e in events)
print("## Per-tool call counts")
print()
if args.format == "markdown":
print("| tool | calls |")
print("|---|---|")
for tool, n in by_tool.most_common():
print(f"| `{tool}` | {n} |")
else:
for tool, n in by_tool.most_common():
print(f" {tool:<25s} {n:>6d}")
print()
# 2. Top search_docs queries
search_events = [e for e in events if e["tool"] == "search_docs"]
queries = Counter(e["args"].get("query", "") for e in search_events)
print(f"## Top {args.top} search_docs queries (of {len(search_events)} searches)")
print()
if args.format == "markdown":
print("| count | query |")
print("|---|---|")
for q, n in queries.most_common(args.top):
print(f"| {n} | `{q}` |")
else:
for q, n in queries.most_common(args.top):
print(f" {n:>5d} {q!r}")
print()
# 3. 0-hit queries — the highest-signal data for tuning
zero_hit = [e for e in search_events if e.get("hits_returned") == 0]
zero_q = Counter(e["args"].get("query", "") for e in zero_hit)
print(f"## 0-hit queries ({len(zero_hit)} of {len(search_events)} searches returned nothing)")
print()
if zero_q:
if args.format == "markdown":
print("| count | query | filters |")
print("|---|---|---|")
# Group by query, show filter examples for each
examples_by_query: dict[str, list[dict]] = defaultdict(list)
for e in zero_hit:
examples_by_query[e["args"].get("query", "")].append(e["args"])
for q, n in zero_q.most_common(args.top):
ex = examples_by_query[q][0]
f = {k: v for k, v in ex.items()
if k in ("version", "platform", "bundle_id") and v}
print(f"| {n} | `{q}` | `{f}` |")
else:
for q, n in zero_q.most_common(args.top):
print(f" {n:>5d} {q!r}")
else:
print(" _(no 0-hit queries in window)_")
print()
# 4. Filter usage
filter_use = Counter()
for e in search_events:
a = e["args"]
v = a.get("version")
p_ = a.get("platform")
b = a.get("bundle_id")
if v:
filter_use[f"version={v}"] += 1
if p_:
filter_use[f"platform={p_}"] += 1
if b:
filter_use[f"bundle_id={b}"] += 1
if not (v or p_ or b):
filter_use["(no filter)"] += 1
print(f"## search_docs filter usage")
print()
if args.format == "markdown":
print("| filter | count |")
print("|---|---|")
for f, n in filter_use.most_common(args.top):
print(f"| `{f}` | {n} |")
else:
for f, n in filter_use.most_common(args.top):
print(f" {n:>5d} {f}")
print()
# 5. Reranker effectiveness
reranked = [e for e in search_events if e.get("reranked") is True]
dense_only = [e for e in search_events if e.get("reranked") is False]
print(f"## Reranker activity")
print()
print(f" reranked: {len(reranked):>5d}")
print(f" dense only: {len(dense_only):>5d} (filter too narrow or 0 results)")
if reranked:
elapsed = [e["elapsed_ms"] for e in reranked if e.get("elapsed_ms") is not None]
if elapsed:
elapsed.sort()
p50 = elapsed[len(elapsed) // 2]
p95 = elapsed[int(len(elapsed) * 0.95)]
print(f" reranked latency p50: {p50:.0f} ms, p95: {p95:.0f} ms")
print()
# 6. Hybrid retrieval activity — which retriever contributed the top-1?
# Empty unless HYBRID_SEARCH=true is set on the MCP container.
hybrid_events = [e for e in search_events if e.get("retrieval_mode") == "hybrid"]
if hybrid_events:
by_source = Counter(e.get("top1_source") for e in hybrid_events
if e.get("top1_source"))
print("## Hybrid retrieval — top-1 attribution")
print()
print(f" hybrid mode events: {len(hybrid_events)}")
total = sum(by_source.values()) or 1
for src in ("both", "dense_only", "bm25_only"):
n = by_source.get(src, 0)
pct = 100.0 * n / total
label = {
"both": "in BOTH retrievers' top-N",
"dense_only": "dense found it, BM25 didn't",
"bm25_only": "BM25 found it, dense didn't",
}[src]
print(f" {src:<11s} {n:>5d} ({pct:5.1f}%) — {label}")
rescued = by_source.get("bm25_only", 0)
if rescued and total:
print(f"\n{rescued} ({100.0 * rescued / total:.1f}%) of hybrid queries had the top-1 "
"result that ONLY BM25 surfaced. Without hybrid those would have been dense-misses.")
return 0
if __name__ == "__main__":
sys.exit(main())