initial: docs-mcp-template — build guide + scaffolded server
Template for building hosted MCP servers over a product's public
documentation. Distilled from one production build; everything
product-specific has been factored out.
Contents:
- PLAN.md — comprehensive build guide. 13 phases from project
skeleton through weekly_digest. Includes the gotchas
("fetch-depth: 0 always", reranker per-pair token limit,
Cloudflare body cap, dash-not-bash on Gitea runners), the
decisions worth carrying forward, and a per-product
customization checklist.
- CLAUDE.md — guidance for Claude Code working in a clone of this
template. Phase identification table, conventions (env-gating +
operator confirmation for side-effecting tools, defensive
fallback for retrieval components), common commands.
- README.md — quick-start summary.
Scaffolded code (all signature-stable, with NotImplementedError
stubs where phase-specific work is required):
docs_mcp/server.py FastMCP server, stateless_http=True, with
search_docs / get_page / list_versions
baseline tools and commented stubs for the
rest of the phase set.
docs_mcp/usage.py TimedCall telemetry, JSONL, daily rotation,
90-day retention. Reusable as-is.
rag/embeddings.py Ollama embedder (nomic-embed-text default),
load-balanced across N URLs. Reusable.
rag/chunk.py Paragraph-aware chunker with synthetic
chunk 0. Per-product tunable.
rag/index.py Chroma + BM25 builder. --rebuild and
--bm25-only flags.
rag/bm25.py SQLite FTS5 lexical index. Reusable.
scrape/changelog.py --cached / --ref / --json / --history-out.
Reusable.
scrape/README.md What you write per-product.
eval/queries.jsonl.example
Curate ~25 hand-labeled queries here.
eval/retrievers.py Retriever protocol + stub classes.
eval/run_eval.py MRR / Recall@K / nDCG@K harness skeleton.
scripts/usage_report.py
Standalone log analyzer; the
FOLLOW-UP CHECKS pattern noted in the
module docstring.
scripts/registry_gc.py
Gitea container registry cleanup. Reusable.
Deployment + CI:
Dockerfile Python 3.12-slim; COPY corpus + chroma
+ bm25 last for cache efficiency.
deploy/docker-compose.yml MCP + reranker sidecar + Watchtower.
Templated with <placeholders>.
.gitea/workflows/refresh.yml Weekly cron + manual dispatch.
fetch-depth: 0, retry-on-race,
three-tag image scheme.
.gitea/workflows/image-only.yml Code-only ship cycle, ~18min.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,89 @@
|
||||
name: Image rebuild (skip scrape)
|
||||
|
||||
# Fast path for code-only changes. Skips the scrape and goes straight to:
|
||||
# rebuild indexes (from corpus already committed on main) + image build
|
||||
# + push. Runtime is ~18 min vs ~40 min for the full refresh.
|
||||
#
|
||||
# Use when a PR only changes code/config — anything where the upstream
|
||||
# corpus hasn't moved but we want the new Python in the running image.
|
||||
#
|
||||
# IMPORTANT: fetch-depth: 0 is required for the digest-history step
|
||||
# to find commits to walk. Don't change to 1.
|
||||
|
||||
on:
|
||||
workflow_dispatch:
|
||||
|
||||
env:
|
||||
REGISTRY_PUSH: <lan-host>:<port>
|
||||
REGISTRY_PULL: <public-registry-hostname>
|
||||
IMAGE: <owner>/<product>-docs-mcp
|
||||
OLLAMA_URL: http://<gpu-host>:11434
|
||||
EMBED_MODEL: nomic-embed-text
|
||||
PRODUCT_NAME: <product>
|
||||
|
||||
jobs:
|
||||
build:
|
||||
runs-on: docker
|
||||
container:
|
||||
image: catthehacker/ubuntu:act-latest
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
# Full history (not shallow) so the digest-history step can
|
||||
# walk git log up to --history-days back.
|
||||
fetch-depth: 0
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: "3.12"
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
python -m pip install -q --upgrade pip
|
||||
python -m pip install -q -r requirements.txt
|
||||
|
||||
- name: Refresh digest history
|
||||
# Cheap (a few seconds); doesn't touch corpus content.
|
||||
# Without this step, a code-only deploy would ship an
|
||||
# increasingly-stale digest history relative to git.
|
||||
run: |
|
||||
mkdir -p corpus/.digest
|
||||
python -m scrape.changelog \
|
||||
--history-out corpus/.digest/history.jsonl \
|
||||
--history-days 120
|
||||
|
||||
- name: Verify committed corpus is present
|
||||
run: |
|
||||
test -d corpus || { echo "ERROR: corpus/ missing on this ref"; exit 1; }
|
||||
echo "corpus: $(du -sh corpus | cut -f1), $(find corpus -name '*.md' | wc -l) markdown files"
|
||||
|
||||
- name: Rebuild indexes from existing corpus
|
||||
run: python -m rag.index --rebuild
|
||||
|
||||
- name: Log in to registry (LAN endpoint)
|
||||
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u <user> --password-stdin
|
||||
|
||||
- name: Build & push image
|
||||
run: |
|
||||
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
|
||||
DATE_TAG=$(date -u +%Y.%m.%d)
|
||||
docker build \
|
||||
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \
|
||||
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
|
||||
-t "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}" \
|
||||
.
|
||||
docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
|
||||
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
|
||||
docker push "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}"
|
||||
|
||||
- name: Prune old container versions
|
||||
env:
|
||||
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
|
||||
run: |
|
||||
python scripts/registry_gc.py \
|
||||
--owner <user> \
|
||||
--package <product>-docs-mcp \
|
||||
--keep-days 90 \
|
||||
--keep-latest 5
|
||||
@@ -0,0 +1,158 @@
|
||||
name: Weekly docs refresh
|
||||
|
||||
# Runs the full pipeline: scrape upstream → rebuild indexes → push
|
||||
# image. Cron'd weekly (Mondays). Skip the reindex + image-push if the
|
||||
# scrape produced no diff against the committed corpus.
|
||||
#
|
||||
# IMPORTANT: actions/checkout@v4 fetch-depth: 0 is required because
|
||||
# the digest-history step walks git log up to --history-days back.
|
||||
# With a shallow checkout the history file ships empty.
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: "0 6 * * 1" # Mondays 06:00 UTC
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
force_build:
|
||||
description: "Rebuild indexes + push image even if corpus is unchanged"
|
||||
type: boolean
|
||||
default: false
|
||||
|
||||
env:
|
||||
# If your registry sits behind Cloudflare with its 100 MB body cap,
|
||||
# use a LAN endpoint for pushes (bypasses CF) and the public hostname
|
||||
# for pulls (response bodies aren't capped).
|
||||
REGISTRY_PUSH: <lan-host>:<port>
|
||||
REGISTRY_PULL: <public-registry-hostname>
|
||||
IMAGE: <owner>/<product>-docs-mcp
|
||||
|
||||
# Embedder. One URL per GPU; the indexer round-robins.
|
||||
OLLAMA_URL: http://<gpu-host>:11434
|
||||
EMBED_MODEL: nomic-embed-text
|
||||
|
||||
PRODUCT_NAME: <product>
|
||||
|
||||
jobs:
|
||||
refresh:
|
||||
runs-on: docker
|
||||
container:
|
||||
image: catthehacker/ubuntu:act-latest
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
# Full history — required for the digest-history step to
|
||||
# walk git log. Default fetch-depth: 1 silently produces a
|
||||
# 0-byte history file.
|
||||
fetch-depth: 0
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: "3.12"
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
python -m pip install -q --upgrade pip
|
||||
python -m pip install -q -r requirements.txt
|
||||
|
||||
# ---- Phase 1: scrape ---------------------------------------
|
||||
- name: Refresh bundle catalog
|
||||
run: python -m scrape.bundles
|
||||
|
||||
- name: Re-scrape all bundles
|
||||
# --force re-fetches every page so we actually see upstream
|
||||
# edits. Without it the runner skips pages already on disk.
|
||||
run: python -m scrape.runner --all --force --concurrency 6
|
||||
|
||||
# ---- Build the digest history BEFORE committing ------------
|
||||
# See PLAN.md Phase 13. Walks recent corpus-touching commits
|
||||
# and writes corpus/.digest/history.jsonl. The current refresh
|
||||
# gets added on the NEXT run's history (one-week lag is fine).
|
||||
- name: Build digest history
|
||||
run: |
|
||||
mkdir -p corpus/.digest
|
||||
python -m scrape.changelog \
|
||||
--history-out corpus/.digest/history.jsonl \
|
||||
--history-days 120
|
||||
|
||||
# ---- Commit + retry-on-race --------------------------------
|
||||
- name: Commit corpus changes (if any)
|
||||
id: commit
|
||||
run: |
|
||||
git config user.name "<product>-docs-refresh"
|
||||
git config user.email "actions@<your-domain>"
|
||||
git add bundles.json corpus
|
||||
if git diff --cached --quiet; then
|
||||
echo "no corpus changes — skipping reindex and image build"
|
||||
echo "changed=false" >> "$GITHUB_OUTPUT"
|
||||
exit 0
|
||||
fi
|
||||
echo "changed=true" >> "$GITHUB_OUTPUT"
|
||||
python -m scrape.changelog --cached > /tmp/changelog.txt
|
||||
summary=$(head -1 /tmp/changelog.txt)
|
||||
ts=$(date -u +"%Y-%m-%dT%H:%MZ")
|
||||
{
|
||||
echo "weekly refresh: ${ts} — ${summary}"
|
||||
echo ""
|
||||
cat /tmp/changelog.txt
|
||||
} > /tmp/commitmsg.txt
|
||||
git commit -F /tmp/commitmsg.txt
|
||||
# Retry on race: if main moved while we were scraping (a
|
||||
# human merged a PR during the run), `git push` rejects
|
||||
# with "fetch first". Rebase our corpus commit onto new
|
||||
# main and retry. Corpus + code paths are disjoint, so
|
||||
# the rebase is trivially clean.
|
||||
attempt=1
|
||||
while [ $attempt -le 3 ]; do
|
||||
if git push; then
|
||||
echo "pushed corpus changes (attempt $attempt)"
|
||||
break
|
||||
fi
|
||||
if [ $attempt -eq 3 ]; then
|
||||
echo "push still failing after 3 attempts — bailing"
|
||||
exit 1
|
||||
fi
|
||||
git fetch origin main
|
||||
git rebase origin/main || { echo "rebase conflict — bailing"; exit 1; }
|
||||
attempt=$((attempt + 1))
|
||||
done
|
||||
|
||||
# ---- Reindex Chroma + BM25 ---------------------------------
|
||||
- name: Rebuild indexes
|
||||
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||
run: python -m rag.index --rebuild
|
||||
|
||||
# ---- Build & push image ------------------------------------
|
||||
- name: Log in to registry (LAN endpoint)
|
||||
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u <user> --password-stdin
|
||||
|
||||
- name: Build & push image
|
||||
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||
# Runner shell is /bin/sh — use cut instead of ${VAR::N}.
|
||||
# Three tags: :latest (Watchtower target), :<sha12>
|
||||
# (rollback pin), :<YYYY.MM.DD> (human-readable).
|
||||
run: |
|
||||
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
|
||||
DATE_TAG=$(date -u +%Y.%m.%d)
|
||||
docker build \
|
||||
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \
|
||||
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
|
||||
-t "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}" \
|
||||
.
|
||||
docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
|
||||
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
|
||||
docker push "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}"
|
||||
|
||||
# ---- Registry GC -------------------------------------------
|
||||
- name: Prune old container versions
|
||||
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||
env:
|
||||
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
|
||||
run: |
|
||||
python scripts/registry_gc.py \
|
||||
--owner <user> \
|
||||
--package <product>-docs-mcp \
|
||||
--keep-days 90 \
|
||||
--keep-latest 5
|
||||
+31
@@ -0,0 +1,31 @@
|
||||
# Virtualenv
|
||||
venv/
|
||||
.venv/
|
||||
|
||||
# Regenerable from corpus + CI
|
||||
corpus/
|
||||
chroma/
|
||||
bm25/
|
||||
|
||||
# Python detritus
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*.egg-info/
|
||||
.pytest_cache/
|
||||
.mypy_cache/
|
||||
.ruff_cache/
|
||||
|
||||
# Eval results (regenerable; commit only the headline baseline if you want)
|
||||
# eval/results/
|
||||
|
||||
# Usage logs (host-mounted volume in prod; don't commit dev logs)
|
||||
var/
|
||||
|
||||
# Local-only env
|
||||
.env
|
||||
.env.local
|
||||
|
||||
# IDE
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
@@ -0,0 +1,232 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when
|
||||
working with code in this repository.
|
||||
|
||||
## Purpose
|
||||
|
||||
This is a **template** for building an MCP server over a product's
|
||||
public documentation. When you (Claude) are working in a clone of this
|
||||
repo, you are helping the user implement one specific product's docs
|
||||
MCP — not editing the template itself.
|
||||
|
||||
**Read `PLAN.md` first.** It's the canonical build guide and lays out
|
||||
13 phases. Most user requests will be "implement Phase N" or "we hit
|
||||
a bug in Phase N." Identify the phase before doing anything else.
|
||||
|
||||
## Working with this template
|
||||
|
||||
### Identifying the current phase
|
||||
|
||||
When the user clones this template and starts working, figure out
|
||||
which phase they're on by inspecting:
|
||||
|
||||
| Signal | Likely phase |
|
||||
|---|---|
|
||||
| `corpus/` doesn't exist | Phase 1 (scraper) — they need to build it before anything else works |
|
||||
| `corpus/` exists, `chroma/` doesn't | Phase 2 (indexing) |
|
||||
| Indexes exist, only `search_docs` / `get_page` / `list_versions` implemented | Phase 3 (server skeleton done; next: Dockerfile + CI) |
|
||||
| No `Dockerfile` or `.gitea/workflows/` updated | Phase 4–5 |
|
||||
| `RERANK_URL` env unset in compose | Phase 6 not done |
|
||||
| `HYBRID_SEARCH` env unset, no `rag/bm25.py` content | Phase 8 not done |
|
||||
| No `eval/results/` directory | Phase 7 not done |
|
||||
| `find_doc_inconsistencies` / `submit_doc_bug` are commented-out stubs in `docs_mcp/server.py` | Phase 12 |
|
||||
| No `corpus/.digest/` produced by CI | Phase 13 |
|
||||
|
||||
When in doubt, ask the user: *"Which phase from PLAN.md are we
|
||||
working on?"*
|
||||
|
||||
### The scaffolded server has stubs
|
||||
|
||||
`docs_mcp/server.py` ships with three working tools (`search_docs`,
|
||||
`get_page`, `list_versions`) and signature-only stubs for the
|
||||
phase-specific tools. The stubs `raise NotImplementedError` with a
|
||||
phase hint in the docstring. When implementing a phase, you'll be
|
||||
filling these bodies in — DO NOT change the signatures unless the
|
||||
user has a specific reason. Signatures are the public contract
|
||||
between the MCP and its clients (Claude Desktop, Claude Code,
|
||||
Cursor, etc.).
|
||||
|
||||
## Layout
|
||||
|
||||
```
|
||||
.
|
||||
├── PLAN.md # Read first. Phase-by-phase build guide.
|
||||
├── README.md # Quick-start summary.
|
||||
├── CLAUDE.md # This file.
|
||||
├── requirements.txt
|
||||
├── Dockerfile
|
||||
├── deploy/docker-compose.yml
|
||||
├── .gitea/workflows/
|
||||
│ ├── refresh.yml # Weekly cron: scrape + index + image
|
||||
│ └── image-only.yml # On-demand: code-only ship cycle
|
||||
├── scrape/ # Phase 1 — product-specific scraper here
|
||||
│ └── changelog.py # Reusable: --json, --history-out
|
||||
├── rag/ # Phase 2/8 — indexing
|
||||
│ ├── embeddings.py # Ollama embedder (swappable)
|
||||
│ ├── chunk.py # Page → chunks (adjust per page format)
|
||||
│ ├── index.py # Builds Chroma + BM25
|
||||
│ └── bm25.py # SQLite FTS5 lexical index
|
||||
├── docs_mcp/ # Phase 3+ — MCP server
|
||||
│ ├── server.py # FastMCP + tool definitions
|
||||
│ └── usage.py # TimedCall telemetry
|
||||
├── eval/ # Phase 7 — golden-query harness
|
||||
│ ├── queries.jsonl.example
|
||||
│ ├── retrievers.py
|
||||
│ └── run_eval.py
|
||||
├── scripts/ # Standalone ops scripts
|
||||
│ ├── usage_report.py
|
||||
│ └── registry_gc.py
|
||||
└── deploy/
|
||||
└── docker-compose.yml
|
||||
```
|
||||
|
||||
## Conventions
|
||||
|
||||
### Tool docstrings are user interface
|
||||
|
||||
The text in `@mcp.tool()` docstrings is what the LLM sees and uses to
|
||||
decide whether to call the tool. Treat it like a button label.
|
||||
*"Use when..."*, *"Call proactively whenever..."* phrasings work
|
||||
well. Don't bury the headline in implementation notes.
|
||||
|
||||
### Side-effecting tools must be env-gated AND operator-confirmed
|
||||
|
||||
Any tool that POSTs to an external service (submit_doc_bug being the
|
||||
canonical example):
|
||||
|
||||
1. Must check an env flag at call time and return a "disabled,
|
||||
manual fallback at <URL>" message if unset.
|
||||
2. Must have a loud docstring requiring per-call operator
|
||||
confirmation in the LLM conversation flow (the LLM drafts, shows
|
||||
the operator the exact payload, asks yes/no, only then calls).
|
||||
3. Must do upfront validation (URL allowlist, content length, etc.)
|
||||
so the LLM gets a clean error instead of a wire-level failure.
|
||||
|
||||
Match the `submit_doc_bug` patterns documented in PLAN.md Phase 12.
|
||||
|
||||
### Defensive fallback for retrieval components
|
||||
|
||||
The reranker, BM25 index, and any external dependency must fail
|
||||
gracefully:
|
||||
|
||||
- Catch the specific exception type
|
||||
- Log a warning with enough info to debug
|
||||
- Fall back to a working baseline (dense-only, no reranker, etc.)
|
||||
- Never block a search_docs call on a single failure
|
||||
|
||||
The user's MCP is in front of real people; partial degradation
|
||||
beats a 500.
|
||||
|
||||
### Verify retrieval changes with the eval harness
|
||||
|
||||
Any change that touches retrieval (new embedder, chunker tweak,
|
||||
reranker model, filter shape) ships with eval numbers in the commit
|
||||
message. Don't ship retrieval changes on vibes. If `eval/queries.jsonl`
|
||||
isn't populated yet, populate it before changing retrieval — it's
|
||||
the most important file in the repo.
|
||||
|
||||
### Standard infrastructure choices
|
||||
|
||||
These are reasoned defaults — only deviate if you have a specific
|
||||
need:
|
||||
|
||||
- **Embedding model**: `nomic-embed-text` via Ollama (768-dim,
|
||||
free, on-prem)
|
||||
- **Reranker**: `jina-reranker-v2-base` GGUF via llama.cpp
|
||||
`/v1/rerank` endpoint
|
||||
- **Vector store**: Chroma `PersistentClient`
|
||||
- **Lexical store**: SQLite FTS5 (stdlib)
|
||||
- **Fusion**: Reciprocal Rank Fusion with k=60
|
||||
- **Transport**: streamable-HTTP in prod, stdio for local dev
|
||||
- **MCP framework**: FastMCP with `stateless_http=True`
|
||||
- **Container deploy**: Watchtower auto-pull on `:latest`, rollback
|
||||
via `:<sha12>` pin
|
||||
|
||||
### Naming the product
|
||||
|
||||
The template uses `PRODUCT_NAME` env var (defaults to `"myproduct"`)
|
||||
throughout. Set it on first build. References show up in:
|
||||
|
||||
- `docs_mcp/server.py` — `FastMCP(f"{PRODUCT_NAME}-docs", ...)`
|
||||
- Collection name (`<product>_docs`)
|
||||
- BM25 db filename
|
||||
- Tool names that include the product name (e.g., the `_api_lessons`
|
||||
tool — convention is to name it `<product>_api_lessons`)
|
||||
|
||||
Use lowercase, underscores not hyphens, since it ends up in tool
|
||||
identifiers that the LLM reads.
|
||||
|
||||
## Common commands
|
||||
|
||||
```bash
|
||||
# Set up dev environment
|
||||
python -m venv venv && source venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Run the MCP server locally for Claude Desktop dev
|
||||
python -m docs_mcp.server --transport stdio
|
||||
|
||||
# Run as HTTP for integration testing
|
||||
python -m docs_mcp.server --transport streamable-http --port 8000
|
||||
|
||||
# Rebuild Chroma + BM25 indexes from corpus
|
||||
python -m rag.index --rebuild
|
||||
|
||||
# Rebuild only BM25 (fast iteration)
|
||||
python -m rag.index --bm25-only
|
||||
|
||||
# Run the eval harness
|
||||
python -m eval.run_eval --queries eval/queries.jsonl --output eval/results/baseline.md
|
||||
|
||||
# Generate changelog summary (called by CI, useful locally too)
|
||||
python -m scrape.changelog --cached
|
||||
python -m scrape.changelog --history-out corpus/.digest/history.jsonl --history-days 120
|
||||
```
|
||||
|
||||
## Gotchas (carried forward from the reference build)
|
||||
|
||||
- **`fetch-depth: 0` on `actions/checkout@v4`** in both workflows.
|
||||
Default is shallow; history-walking steps (changelog, digest)
|
||||
silently produce empty output otherwise. This is the #1 thing
|
||||
people miss.
|
||||
- **Reranker per-pair token limit**: jina-reranker GGUF rejects the
|
||||
ENTIRE batch if any doc exceeds `n_ctx_train=1024`. Truncate docs
|
||||
to ~2000 chars before sending to rerank. Full chunk text still
|
||||
goes back to the user; truncation is reranking-only.
|
||||
- **FastMCP `stateless_http=True`**: critical for production
|
||||
hosting behind Watchtower auto-updates. Without it, every
|
||||
container recreate produces a 404 storm from clients with
|
||||
stale session IDs.
|
||||
- **Runner shell is `/bin/sh` (dash)**: no `${VAR::N}` substring
|
||||
expansion in workflow scripts. Use `cut`/`awk`/`printf`.
|
||||
- **Cloudflare 100 MB body cap**: if pushing through a Cloudflare-
|
||||
fronted registry, push via LAN endpoint, pull via public
|
||||
hostname. Same registry, different URLs.
|
||||
|
||||
## When the user says...
|
||||
|
||||
| User says | You do |
|
||||
|---|---|
|
||||
| "Let's start building" / "set up the project" | Read PLAN.md Phase 0; create dirs, requirements.txt, etc. Confirm Python version and existing tooling. |
|
||||
| "Build the scraper" / "scrape the docs" | Read PLAN.md Phase 1. Find the upstream portal's underlying API by sniffing; AVOID headless-browser solutions unless the API path is truly closed. |
|
||||
| "Get retrieval working" / "make search work" | Read PLAN.md Phase 2-3. Implement chunking, embedder, Chroma indexer, then the three baseline tools. |
|
||||
| "Add a reranker" | Read PLAN.md Phase 6. Stand up the llama.cpp sidecar, implement `_rerank()`. Verify with the eval harness. |
|
||||
| "Search is missing X queries" | Run the eval harness first to confirm the failure. Then consider: rich chunk-0 rewrites, hybrid retrieval, curated knowledge layer. Don't just tune cosine. |
|
||||
| "Let's add hybrid search" | Read PLAN.md Phase 8. Only after you've established the failure mode with eval queries — hybrid is not free. |
|
||||
| "Make a tool that submits doc bugs" | Read PLAN.md Phase 12. Find the docs portal's feedback endpoint by sniffing. Build with operator confirmation as a hard requirement in the tool docstring. |
|
||||
| "I want a 'what changed' tool" | Read PLAN.md Phase 13. Don't try to do this at runtime — pre-bake the history JSONL at CI time. |
|
||||
|
||||
## Out-of-scope concerns (don't try to solve here)
|
||||
|
||||
- **Reverse proxy / TLS termination** — outside the repo. User
|
||||
picks Caddy / Cloudflare Tunnel / nginx / Traefik based on their
|
||||
infra.
|
||||
- **MetaMCP or other gateway** — outside the repo. Optional, only
|
||||
matters when running multiple MCPs.
|
||||
- **GPU container orchestration** — outside the repo. Pattern is
|
||||
one Ollama container per GPU; the indexer load-balances. Document
|
||||
it in deploy/ but don't build it in this template.
|
||||
- **Email/blog delivery for weekly_digest** — out of scope per
|
||||
PLAN.md ("Out of scope" section). Add a separate script in
|
||||
scripts/ if/when the user asks.
|
||||
+43
@@ -0,0 +1,43 @@
|
||||
# Docs MCP server — production image.
|
||||
#
|
||||
# Structure: copy code first, then the regenerable indexes last so a
|
||||
# code-only change doesn't invalidate the corpus COPY layer.
|
||||
#
|
||||
# The container runs the MCP server via streamable-http on PORT 8000.
|
||||
# Override via MCP_HOST / MCP_PORT env if you front it with a different
|
||||
# reverse-proxy setup.
|
||||
|
||||
FROM python:3.12-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install Python deps first for cacheability.
|
||||
COPY requirements.txt /app/
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
# Code.
|
||||
COPY scrape /app/scrape
|
||||
COPY rag /app/rag
|
||||
COPY docs_mcp /app/docs_mcp
|
||||
|
||||
# Catalog. Written by the scraper at CI time.
|
||||
COPY bundles.json /app/
|
||||
|
||||
# Regenerable indexes. CI builds these from corpus/ in the same job
|
||||
# that builds the image. Listed last so code changes don't invalidate
|
||||
# the COPY layer cache for these (much larger) directories.
|
||||
#
|
||||
# bm25/ is only consulted when HYBRID_SEARCH=true (the server falls
|
||||
# back to dense-only if it's missing).
|
||||
COPY corpus /app/corpus
|
||||
COPY chroma /app/chroma
|
||||
COPY bm25 /app/bm25
|
||||
|
||||
ENV PYTHONUNBUFFERED=1 \
|
||||
MCP_TRANSPORT=streamable-http \
|
||||
MCP_HOST=0.0.0.0 \
|
||||
MCP_PORT=8000
|
||||
|
||||
EXPOSE 8000
|
||||
|
||||
ENTRYPOINT ["python", "-m", "docs_mcp.server"]
|
||||
@@ -0,0 +1,647 @@
|
||||
# Docs MCP Server — Build Guide
|
||||
|
||||
A reusable recipe for building a hosted MCP server over a product's
|
||||
public documentation. Distilled from one production build; everything
|
||||
product-specific has been factored out.
|
||||
|
||||
The end product is a streamable-HTTP MCP server with ~15 tools that
|
||||
any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can
|
||||
call to answer questions against the docs, surface what changed
|
||||
recently, find inconsistencies, and (optionally) submit doc bugs
|
||||
back upstream.
|
||||
|
||||
---
|
||||
|
||||
## What you're building
|
||||
|
||||
A pipeline with these stages:
|
||||
|
||||
```
|
||||
upstream docs portal
|
||||
│
|
||||
▼
|
||||
scrape ──► corpus/<bundle>/<page>.md + .json sidecar
|
||||
│
|
||||
▼
|
||||
chunk + embed ──► chroma/ (dense vectors)
|
||||
│ ──► bm25/ (FTS5 lexical index)
|
||||
▼
|
||||
MCP server ──► search_docs / get_page / diff_versions / weekly_digest /
|
||||
find_doc_inconsistencies / submit_doc_bug / ...
|
||||
│
|
||||
▼
|
||||
reverse proxy / Cloudflare Tunnel ──► public endpoint
|
||||
```
|
||||
|
||||
Two CI cadences:
|
||||
|
||||
- **Weekly cron** (~40 min): full re-scrape, re-chunk, re-embed,
|
||||
image build & push.
|
||||
- **On-demand image-only** (~18 min): code-only rebuild from
|
||||
committed corpus, image build & push.
|
||||
|
||||
A container registry (self-hosted Gitea works well), a host running
|
||||
Docker Compose, Watchtower auto-updating from `:latest`, and a
|
||||
reverse proxy in front.
|
||||
|
||||
---
|
||||
|
||||
## Build phases
|
||||
|
||||
Each phase is a discrete, shippable unit. Build them in order; each
|
||||
one is useful on its own and unlocks the next. Realistic effort per
|
||||
phase is given as a rough order of magnitude. Total: roughly 2–3
|
||||
weeks of focused work for the full stack.
|
||||
|
||||
### Phase 0 — Project skeleton *(half a day)*
|
||||
|
||||
Goals: directory layout, dependency manifest, virtualenv.
|
||||
|
||||
- Top-level dirs: `scrape/`, `corpus/` (gitignored), `rag/`,
|
||||
`docs_mcp/`, `eval/`, `scripts/`, `deploy/`, `.gitea/workflows/`.
|
||||
- `requirements.txt` with the dependencies you'll need across all
|
||||
phases (FastMCP, chromadb, httpx, beautifulsoup4 or whatever HTML
|
||||
parser, ollama or sentence-transformers client, etc.).
|
||||
- `python -m venv venv` and pin Python version (3.11 or 3.12 — be
|
||||
conservative; some embedding libraries have version-specific
|
||||
wheels).
|
||||
- `.gitignore`: `venv/`, `corpus/` (regenerable), `chroma/`
|
||||
(regenerable), `bm25/` (regenerable), `*.pyc`, `__pycache__/`,
|
||||
`.pytest_cache/`.
|
||||
|
||||
### Phase 1 — Scraper *(2–4 days, product-specific)*
|
||||
|
||||
This is the most product-dependent phase. The goal is to write a
|
||||
scraper that produces a normalized corpus layout regardless of
|
||||
upstream portal shape.
|
||||
|
||||
Output shape (mandatory):
|
||||
|
||||
```
|
||||
corpus/
|
||||
<bundle_id>/ # one dir per "doc bundle" — see Glossary
|
||||
<page_id>.md # markdown body
|
||||
<page_id>.json # sidecar with structured metadata
|
||||
...
|
||||
bundles.json # catalog of bundles with metadata
|
||||
```
|
||||
|
||||
**Bundle metadata** (`bundles.json` is a list of these):
|
||||
|
||||
```json
|
||||
{
|
||||
"slug": "<bundle_id>",
|
||||
"title": "User-facing title",
|
||||
"version": "10.9",
|
||||
"platform": "VMware vSphere", // may be null
|
||||
"product": "Admin Guide", // optional but useful
|
||||
"language": "en-US",
|
||||
"page_count": 127,
|
||||
"dates": {
|
||||
"Added on": "2024-01-15",
|
||||
"Updated on": "2026-05-20"
|
||||
},
|
||||
"landing_page": "<page_id>"
|
||||
}
|
||||
```
|
||||
|
||||
**Per-page sidecar** (`<page_id>.json`) carries page-level metadata.
|
||||
The one field that matters cross-cutting is `topic_cluster` (see
|
||||
Phase 9):
|
||||
|
||||
```json
|
||||
{
|
||||
"bundle_id": "<bundle_id>",
|
||||
"page_id": "<page_id>",
|
||||
"title": "How to ...",
|
||||
"ordinal": 42,
|
||||
"topic_cluster": {
|
||||
"clustering_title": "How to ...",
|
||||
"clustered_topics": [
|
||||
{"bundle_id": "...10.8", "page_id": "How_to_X.htm", "clustering_title": "..."},
|
||||
{"bundle_id": "...10.9", "page_id": "How_to_X.htm", "clustering_title": "..."}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
If the portal exposes a cross-version "this page corresponds to that
|
||||
page" mapping, capture it here. If it doesn't, you can synthesize a
|
||||
filename-based fallback (same filename across bundle versions = same
|
||||
topic) and live without the editor-curated mapping. The features that
|
||||
read `topic_cluster` (`list_cluster`, `diff_versions`,
|
||||
`find_doc_inconsistencies`, parts of `weekly_digest`) will work
|
||||
either way; they're more accurate with real clusters.
|
||||
|
||||
**Patterns that recur across doc portals:**
|
||||
|
||||
- Most modern doc portals are SPAs. Plain `requests.get` won't see
|
||||
rendered content. Either find the underlying API the SPA calls (the
|
||||
cheapest, most reliable path), or fall back to a headless browser
|
||||
(Playwright). The API path is almost always available; sniff the
|
||||
network tab.
|
||||
- Portals usually expose a "bundle/topic" hierarchy under the hood
|
||||
(Zoomin, Madcap Flare, Paligo, GitBook, Docusaurus all do). Map
|
||||
it to `bundles.json` + `corpus/<bundle>/<page>`.
|
||||
- Many portals expose `?save_local=` or `.pdf` rendered versions; the
|
||||
HTML they serve is structurally cleaner than what the page shows
|
||||
through the SPA shell.
|
||||
|
||||
**`scrape/changelog.py`** (~250 LOC; see Phase 13) — provides
|
||||
`summarize_diff()`, `render_human()`, `walk_history()` and the
|
||||
`--json` / `--history-out` modes. Mostly reusable as-is; the only
|
||||
product-specific bit is the path layout assumption.
|
||||
|
||||
### Phase 2 — Chunking + embeddings + Chroma *(2 days)*
|
||||
|
||||
Goal: build a queryable dense index from the scraped corpus.
|
||||
|
||||
- `rag/chunk.py` — split each page's markdown into ~400-600 token
|
||||
chunks. Strategy that works: paragraph-aware splitter with a
|
||||
rich "chunk 0" containing the page title + 1-sentence summary +
|
||||
bag-of-words from key terms. Chunk 0 is what dense retrieval lands
|
||||
on first; getting it right dominates retrieval quality.
|
||||
- `rag/embeddings.py` — pluggable embedder. Recommended start:
|
||||
Ollama-hosted `nomic-embed-text` (768-dim, free, good baseline).
|
||||
Other defensible choices: `text-embedding-3-small` (OpenAI),
|
||||
`bge-m3` (also via Ollama). The embedder is a Chroma
|
||||
`EmbeddingFunction` that returns `list[list[float]]` for a list
|
||||
of texts.
|
||||
- `rag/index.py` — orchestrates: read corpus → emit chunks (with
|
||||
metadata: bundle_id, page_id, version, platform, ordinal) →
|
||||
upsert into Chroma collection. `--rebuild` flag for a clean
|
||||
reindex. Run via `python -m rag.index --rebuild`.
|
||||
|
||||
Chroma settings: `PersistentClient(path="chroma/")` and
|
||||
`Settings(anonymized_telemetry=False)`. Single collection
|
||||
(`<product>_docs`).
|
||||
|
||||
**GPU note**: embedding 70K chunks on CPU takes hours; on a GPU
|
||||
(via Ollama with `NVIDIA_VISIBLE_DEVICES`) takes ~10 minutes. Two
|
||||
GPUs in parallel: ~5 minutes. The orchestrator just needs to load-
|
||||
balance HTTP requests across multiple Ollama endpoints.
|
||||
|
||||
### Phase 3 — MCP server skeleton *(1 day)*
|
||||
|
||||
Goal: working FastMCP server with three tools — `search_docs`,
|
||||
`get_page`, `list_versions`.
|
||||
|
||||
- `docs_mcp/server.py` — `FastMCP("<product>-docs", stateless_http=True)`.
|
||||
`stateless_http=True` is critical for production hosting: every
|
||||
request creates an ephemeral session, so container recreates don't
|
||||
produce a 404 storm from stale `mcp-session-id` headers on
|
||||
clients.
|
||||
- Lazy initialization for everything expensive (Chroma client,
|
||||
embedder, bundles catalog) so the server starts cleanly even when
|
||||
Ollama is briefly unreachable.
|
||||
- Tool: `search_docs(query, version=None, platform=None,
|
||||
bundle_id=None, k=10)`. Returns markdown of top-k chunks with full
|
||||
source URLs.
|
||||
- Tool: `get_page(bundle_id, page_id)`. Returns full page markdown +
|
||||
metadata.
|
||||
- Tool: `list_versions()`. Returns the version/platform facets
|
||||
available, drawn from `bundles.json`. Helps the LLM pick filter
|
||||
values.
|
||||
|
||||
Transports: stdio (for local Claude Desktop dev), streamable-HTTP
|
||||
(for hosted production). One argparse switch.
|
||||
|
||||
```python
|
||||
@mcp.tool()
|
||||
def search_docs(
|
||||
query: Annotated[str, Field(description="Natural-language query about <product>.")],
|
||||
version: Annotated[str | None, Field(description="Restrict to one version")] = None,
|
||||
...
|
||||
) -> str:
|
||||
...
|
||||
```
|
||||
|
||||
The tool descriptions are first-class context — the LLM reads them
|
||||
and decides whether to call the tool. Treat them as button labels;
|
||||
use "Call when..." / "Use proactively whenever..." phrasings.
|
||||
|
||||
### Phase 4 — Containerization *(1 day)*
|
||||
|
||||
Goal: image you can run anywhere.
|
||||
|
||||
- `Dockerfile`: Python 3.12-slim base, install requirements, COPY
|
||||
`scrape rag diff docs_mcp` + `bundles.json` + `corpus/ chroma/`
|
||||
+ (later) `bm25/`. Don't COPY `scripts/` — those stay external
|
||||
for ops use only.
|
||||
- `ENTRYPOINT ["python", "-m", "docs_mcp.server",
|
||||
"--transport", "streamable-http"]`. Configurable host/port via env.
|
||||
- `deploy/docker-compose.yml`: one service, named volumes for usage
|
||||
logs and any state, Watchtower label, depends_on for the reranker
|
||||
sidecar (Phase 6).
|
||||
|
||||
Smoke-test locally: `docker compose up` should expose
|
||||
`http://localhost:8000/mcp` and respond to an MCP `initialize` JSON-RPC.
|
||||
|
||||
### Phase 5 — CI on self-hosted Gitea Actions *(1–2 days)*
|
||||
|
||||
Goal: weekly cron rebuild + on-demand code-only ship cycle.
|
||||
|
||||
**Two workflows, two cadences:**
|
||||
|
||||
| Workflow | Trigger | Steps | Runtime |
|
||||
|---|---|---|---|
|
||||
| `refresh.yml` | Monday cron + manual dispatch | scrape → commit corpus → rebuild indexes → build & push image | ~40 min |
|
||||
| `image-only.yml` | manual dispatch only | rebuild indexes from committed corpus → build & push image | ~18 min |
|
||||
|
||||
**Critical settings (learned the hard way):**
|
||||
|
||||
- `fetch-depth: 0` on `actions/checkout@v4`. The default depth is 1
|
||||
(shallow), which breaks any step that walks git history (changelog,
|
||||
digest history walker). Pay the ~10 second cost; never debug a
|
||||
"0-byte history file" mystery.
|
||||
- `runs-on: docker` (Gitea convention, not `ubuntu-latest`).
|
||||
- Runner shell is `/bin/sh` (dash), not bash. `${VAR::N}` substring
|
||||
expansion doesn't exist; use `cut` / `printf` / `awk`.
|
||||
|
||||
**Retry-on-race pattern for long-running scrapes:**
|
||||
|
||||
```bash
|
||||
attempt=1
|
||||
while [ $attempt -le 3 ]; do
|
||||
if git push; then
|
||||
echo "pushed (attempt $attempt)"
|
||||
break
|
||||
fi
|
||||
[ $attempt -eq 3 ] && { echo "still failing"; exit 1; }
|
||||
git fetch origin main
|
||||
git rebase origin/main || { echo "conflict — bail"; exit 1; }
|
||||
attempt=$((attempt + 1))
|
||||
done
|
||||
```
|
||||
|
||||
Works because scrape commits only touch `corpus/` + `bundles.json`,
|
||||
and code merges only touch `.py` / `.yml` — disjoint paths, trivially
|
||||
clean rebases.
|
||||
|
||||
**Image tagging — three tags per build:**
|
||||
|
||||
| Tag | Purpose |
|
||||
|---|---|
|
||||
| `:latest` | Watchtower watches this for auto-deploy |
|
||||
| `:<sha12>` | Immutable; rollback target |
|
||||
| `:<YYYY.MM.DD>` | Human-readable in incident notes |
|
||||
|
||||
Same tag set on every build; rollback is a one-line compose edit
|
||||
to pin `:<sha>` instead of `:latest`.
|
||||
|
||||
**Container registry behind Cloudflare:**
|
||||
|
||||
Cloudflare's free tier has a 100 MB request body limit. Big image
|
||||
layers (Chroma index can easily be 800+ MB) exceed it on push. The
|
||||
fix is a LAN registry endpoint for push, public hostname for pull:
|
||||
|
||||
```yaml
|
||||
env:
|
||||
REGISTRY_PUSH: <lan-ip>:<port> # bypasses Cloudflare
|
||||
REGISTRY_PULL: <public-hostname> # response bodies aren't capped
|
||||
```
|
||||
|
||||
Runner needs the LAN endpoint in `/etc/docker/daemon.json`
|
||||
`insecure-registries`. Costs nothing operationally; saves hours
|
||||
of debugging.
|
||||
|
||||
**Registry GC:** weekly cron in the workflow that walks the package
|
||||
versions, keeps `:latest` + N most-recent date tags + anything
|
||||
pushed in the last 90 days, deletes the rest. Worth ~50 LOC; the
|
||||
package GC on the Gitea side reclaims disk after.
|
||||
|
||||
### Phase 6 — Reranker *(half a day)*
|
||||
|
||||
Goal: lift retrieval quality 3× by cross-encoder reranking the top-N
|
||||
dense candidates.
|
||||
|
||||
- A `/v1/rerank` HTTP endpoint backed by `llama.cpp` serving
|
||||
`jina-reranker-v2-base` (GGUF). Runs as a sidecar in compose.
|
||||
GPU strongly recommended (CPU latency is unworkable for live
|
||||
queries).
|
||||
- `_rerank(query, docs)` helper in the server: POST to the endpoint,
|
||||
apply the scores, re-sort the top-N candidates. Defensive: on any
|
||||
failure log a warning and fall through to dense-only.
|
||||
- Env: `RERANK_URL` (off by default), `RERANK_POOL` (how deep to
|
||||
pull candidates for reranking; 200 is a good default),
|
||||
`RERANK_TIMEOUT` (30s for cold-start tolerance).
|
||||
- **Watch the per-pair token limit.** Jina's GGUF reports
|
||||
`n_ctx_train=1024` and llama.cpp will reject the ENTIRE batch if
|
||||
any pair exceeds it. Truncate doc text to ~2000 chars before
|
||||
reranking. The full untruncated chunk still goes back to the user;
|
||||
truncation is only for the reranker scoring path.
|
||||
|
||||
### Phase 7 — Eval harness *(1 day)*
|
||||
|
||||
Goal: hand-curated golden queries + standard metrics so you can
|
||||
measure the impact of any retrieval change.
|
||||
|
||||
- `eval/queries.jsonl`: 20–25 hand-curated queries with expected
|
||||
pages. Spread across versions, platforms, and difficulty levels.
|
||||
Include the queries that "obviously" should work and DON'T —
|
||||
those are the ones to track.
|
||||
- `eval/retrievers.py`: a `Retriever` protocol with concrete
|
||||
implementations: `DenseRetriever`, `RerankedRetriever`,
|
||||
`BM25Retriever` (Phase 8), `HybridRetriever` (Phase 8). One
|
||||
matrix dimension per knob.
|
||||
- `eval/run_eval.py`: computes MRR / Recall@5 / nDCG@5 across all
|
||||
retrievers; emits a markdown comparison table at
|
||||
`eval/results/<baseline>.md`. Commit the result so PRs land with
|
||||
the A/B evidence in the diff.
|
||||
|
||||
Three numbers are enough — don't overengineer. The hand-curated
|
||||
queries are the value; the metrics are just a stable way to score
|
||||
them.
|
||||
|
||||
### Phase 8 — BM25 + Hybrid retrieval *(half a day, conditional)*
|
||||
|
||||
**Skip unless your eval shows specific failure modes.** Dense
|
||||
embeddings + cross-encoder reranker handle most queries. The case
|
||||
where they don't: queries with rare technical tokens (filenames,
|
||||
language names, error codes) get buried at dense rank 1000+ by a
|
||||
much larger prose corpus that's semantically nearby. The reranker
|
||||
only sees top-200, so it never gets a shot.
|
||||
|
||||
- `rag/bm25.py`: SQLite FTS5 index, in the stdlib, on-disk
|
||||
(`bm25/<product>.db`). Two tables — metadata table keyed by
|
||||
rowid, FTS5 virtual table for full-text. Sanitize the query
|
||||
(strip FTS5 reserved keywords, OR-join tokens for recall). ~210
|
||||
LOC.
|
||||
- `_rrf_fuse()` in the server — Reciprocal Rank Fusion with `k=60`.
|
||||
Per-id score = `sum_over_retrievers(1 / (k + rank))`. Returns
|
||||
ordered ids plus per-retriever contribution dict for telemetry.
|
||||
- `search_docs` hybrid path: run dense + BM25 in parallel,
|
||||
RRF-fuse, hand the merged top-200 to the reranker. Env-gated:
|
||||
`HYBRID_SEARCH=true`.
|
||||
- Log `top1_source` per call (`dense_only` / `bm25_only` / `both`)
|
||||
to usage logs so you can measure whether BM25 is actually earning
|
||||
its keep on production traffic.
|
||||
|
||||
If after 4–6 weeks of production data you see `bm25_only >= 80%`,
|
||||
you can simplify to BM25-only (much less infrastructure). If
|
||||
`both >= 50%`, hybrid is acting as tie-breaker not rescue — keep it
|
||||
or simplify depending on how much you care about the long tail.
|
||||
|
||||
### Phase 9 — Multi-version diff tooling *(1 day, if applicable)*
|
||||
|
||||
**Only relevant if the product has multiple maintained versions.**
|
||||
|
||||
- `diff_versions(bundle_id, page_id, against_bundle_id)`: unified
|
||||
diff between two versions of the same page. Two matching
|
||||
strategies: editor-curated `topic_cluster` peer (if the portal
|
||||
exposes it), or same-filename fallback.
|
||||
- `list_cluster(bundle_id, page_id)`: list cross-version peers
|
||||
for one page.
|
||||
- `bundle_changelog(bundle_id_new, bundle_id_old)`: added /
|
||||
removed / changed pages between two bundles, sorted by churn.
|
||||
- `_diff_churn(a, b)`: small helper, ~15 LOC of `difflib.unified_diff
|
||||
--unified=0` line counting. Used by `bundle_changelog`,
|
||||
`find_doc_inconsistencies`, and `weekly_digest`.
|
||||
|
||||
### Phase 10 — Usage logging *(half a day)*
|
||||
|
||||
Goal: per-call JSONL telemetry so you can answer "what are people
|
||||
actually asking for" and "is the new feature getting used."
|
||||
|
||||
- `docs_mcp/usage.py`: `TimedCall` context manager that captures
|
||||
tool name, args, elapsed time, hits returned, any extra fields
|
||||
set by the tool via `_call.set(key=value)`. Writes JSONL to
|
||||
`var/logs/usage.jsonl`, rotated daily, kept 90 days.
|
||||
- Mount the log dir as a named compose volume so logs survive
|
||||
container recreates.
|
||||
- `scripts/usage_report.py` (standalone, no docs_mcp deps): reads
|
||||
the JSONL files, prints per-tool counts, top queries, 0-hit
|
||||
queries, filter usage histogram, reranker activity. Markdown
|
||||
output flag for piping into weekly digest emails.
|
||||
|
||||
What to log: query text, filters, hits returned, elapsed_ms,
|
||||
reranker_fired flag, hybrid top1_source, retrieval_mode. What NOT
|
||||
to log: anything PII-shaped. The corpus is public, queries are
|
||||
usually about the product, not personal — but be deliberate.
|
||||
|
||||
### Phase 11 — Curated knowledge layer *(2 days)*
|
||||
|
||||
The "RAG can't tell you what isn't in the docs" gap. Surfaces:
|
||||
|
||||
- **API quickstart repos** if the product has them. Ingest the
|
||||
example scripts (Python, PowerShell, curl) into the corpus.
|
||||
Rewrite chunk-0 for each script to embed naturally — explicit
|
||||
natural-language H1, task description sentence, keyword bag.
|
||||
Dense embeddings need an anchor.
|
||||
- **A curated `<product>_api_lessons` markdown doc** for things
|
||||
the swagger / OpenAPI doesn't say: auth flow gotchas, async-task
|
||||
patterns, schema bugs you've hit, platform-detection quirks.
|
||||
Surface as a dedicated MCP tool whose description tells the LLM:
|
||||
*"Call proactively whenever the user asks you to write a script
|
||||
/ integrate with the API / debug a 4xx response."*
|
||||
- **An auto-hint banner** in `search_docs` results — when the
|
||||
query matches a script/API trigger word, render a one-line nudge
|
||||
at the top of results pointing at the dedicated tool. Belt-and-
|
||||
suspenders for queries where the LLM doesn't think to call it
|
||||
proactively.
|
||||
|
||||
### Phase 12 — Doc-bug workflow tools *(1 day, optional)*
|
||||
|
||||
Two tools that pair up to enable a *"check the docs for
|
||||
inconsistencies, draft bugs, confirm, submit"* workflow.
|
||||
|
||||
- `find_doc_inconsistencies(scope_query, version=None, platform=None,
|
||||
max_pages=30, checks=None)`: deterministic, read-only. Two checks:
|
||||
cross-version drift (pages whose content shifted between immediate-
|
||||
previous versions in the actionable 10–60% churn band) and
|
||||
redirect-chain detection (short pages whose body is just a "see
|
||||
[other page] for details" pointer). Heavy lifting is line-level
|
||||
diff (`difflib`) against editor-curated cluster peers; the model
|
||||
judges which findings are real bugs.
|
||||
|
||||
- `submit_doc_bug(page_url, content, email=None, rating=None,
|
||||
like=None)`: POSTs to the docs portal's feedback endpoint.
|
||||
Env-gated by `DOC_BUG_SUBMIT_ENABLED=true` so dev/staging
|
||||
deployments can't accidentally hit the upstream. The tool's
|
||||
docstring is loud about a mandatory operator-confirmation
|
||||
workflow per submission — LLM must draft, show, ask, then
|
||||
submit. Explicit *"do not loop"* instruction. Defensive
|
||||
validation upfront (URL host matches expected portal, content
|
||||
non-empty, etc.) so the LLM gets a clean error instead of a
|
||||
rejected POST.
|
||||
|
||||
**You'll need to find the docs portal's feedback endpoint.** Most
|
||||
portals route the "Was this helpful?" widget through a backend
|
||||
API; sniff the browser network tab on the live site. The payload
|
||||
shape varies; common fields: content/body, page url/href, optional
|
||||
email, optional rating, optional thumbs. Most accept anonymous
|
||||
POSTs with no captcha at the JSON-API layer (even if the widget
|
||||
shows a captcha). Validate before you ship — and if the endpoint
|
||||
has rate limits or captcha enforcement, the tool returns a clean
|
||||
"submission rejected — paste manually at <url>" fallback.
|
||||
|
||||
The whole point is the per-bug operator confirmation in the
|
||||
LLM-side conversation flow; the tool description enforces it. Do
|
||||
not bypass.
|
||||
|
||||
### Phase 13 — Weekly digest tool *(half a day)*
|
||||
|
||||
Goal: a tool that answers *"what changed in the docs in the last N
|
||||
days?"* with no runtime git dependency (the prod container has no
|
||||
git).
|
||||
|
||||
- Extend `scrape/changelog.py` with `--json` (one-shot structured
|
||||
output) and `--history-out PATH` (walks `git log --first-parent
|
||||
--since="<N> days ago"` for corpus-touching commits, writes one
|
||||
JSON line per commit to a JSONL file).
|
||||
- CI workflows write the JSONL file into the image at build time:
|
||||
`corpus/.digest/history.jsonl`. Both `refresh.yml` and
|
||||
`image-only.yml`. **`fetch-depth: 0` is required** — see Phase 5.
|
||||
- New MCP tool `weekly_digest(days=7, version=None, platform=None,
|
||||
max_bundles=25, max_pages_per_bundle=10)`: reads the JSONL,
|
||||
filters to the window, applies version/platform via
|
||||
`bundles.json` metadata, aggregates per-bundle change counts and
|
||||
page lists, renders markdown.
|
||||
- Post-filter totals are critical: the headline "X page changes
|
||||
across Y bundles" must compute X from the filtered set, not the
|
||||
raw record count. Otherwise filtered calls look wrong to the
|
||||
reader.
|
||||
|
||||
Out of scope but trivial bolt-ons: scheduled HTML email of the
|
||||
digest, auto-publish to a blog, per-page diff excerpts as a
|
||||
follow-up tool.
|
||||
|
||||
---
|
||||
|
||||
## Standard tool set
|
||||
|
||||
By the end you'll have ~15 tools registered. Production-tested
|
||||
shape:
|
||||
|
||||
| Tool | What it does |
|
||||
|---|---|
|
||||
| `search_docs` | Semantic search with version/platform/bundle filters |
|
||||
| `get_page` | Full markdown + metadata for one page |
|
||||
| `list_versions` | Discover available facet values |
|
||||
| `list_cluster` | Cross-version peers for one page (if applicable) |
|
||||
| `diff_versions` | Unified diff of a page across two versions |
|
||||
| `bundle_changelog` | Added / removed / changed pages between two bundles |
|
||||
| `weekly_digest` | What changed in the last N days, with filters |
|
||||
| `corpus_status` | Freshness + size of the knowledge base |
|
||||
| `find_doc_inconsistencies` | Scoped scan for doc bugs |
|
||||
| `submit_doc_bug` | Submit a drafted bug (env-gated, operator-confirmed) |
|
||||
| `<product>_api_lessons` | Curated API gotchas, proactively-called |
|
||||
| product-specific tools | Interop matrix, lifecycle queries, etc. |
|
||||
|
||||
---
|
||||
|
||||
## Per-product customization checklist
|
||||
|
||||
When applying this template to a new product, here's what you have
|
||||
to figure out yourself — everything else is shared infrastructure:
|
||||
|
||||
- **Doc portal mechanics**
|
||||
- URL pattern for pages
|
||||
- Bundle/version concept (Zoomin "bundle", Madcap "project",
|
||||
GitBook "space", Docusaurus "docs version" — same idea, different
|
||||
name)
|
||||
- SPA backing API (sniff the network tab) or fallback to
|
||||
headless browser
|
||||
- How `topic_cluster` -equivalent cross-version peers are exposed
|
||||
(or whether you synthesize them from filenames)
|
||||
- **Bundle metadata schema**
|
||||
- What does `version` look like? Semver, calendar, named?
|
||||
- What does `platform` mean for this product? Is there a useful
|
||||
facet at all?
|
||||
- Other useful facets (language, product line, edition)?
|
||||
- **Filterable facets** for `search_docs`
|
||||
- One filter per high-cardinality facet
|
||||
- Skip filters that have <5 distinct values — they're not worth
|
||||
the surface area
|
||||
- **Feedback endpoint** (for `submit_doc_bug`, if you want it)
|
||||
- URL of the POST endpoint
|
||||
- Required + optional payload fields
|
||||
- Captcha / rate-limit behavior
|
||||
- Whether anonymous submissions are accepted
|
||||
- **Curated knowledge** for the `_api_lessons` tool
|
||||
- What does the product's API documentation NOT say that you've
|
||||
learned from real integration work?
|
||||
- **Quickstart / example repos**
|
||||
- Does the vendor publish working code? Ingest it; rewrite
|
||||
chunk-0 for natural-language retrieval.
|
||||
|
||||
---
|
||||
|
||||
## Decisions worth carrying forward
|
||||
|
||||
Things you'll save time on by deciding the same way again:
|
||||
|
||||
- **Tool descriptions are user interface.** The LLM reads them
|
||||
verbatim and decides whether to call the tool. *"Use when..."*
|
||||
and *"Call proactively whenever..."* are real surfaces; treat
|
||||
them like button labels. Most retrieval improvements turn out
|
||||
to be tool-description rewrites in disguise.
|
||||
- **`stateless_http=True`** on the FastMCP server. Eliminates
|
||||
whole categories of session-ID-related 404 storms after
|
||||
container recreates.
|
||||
- **Pre-bake everything at CI time.** No runtime calls to git,
|
||||
external services, or anything you wouldn't trust on a
|
||||
Cloudflare outage. If the digest needs git history, write a
|
||||
JSONL file at CI time. If the lessons doc needs to load fast,
|
||||
bake it into the image.
|
||||
- **Env-gate every side-effecting tool.** Off by default in dev;
|
||||
on only in production compose. Belt and suspenders against
|
||||
accidental writes from staging environments.
|
||||
- **Operator-confirmation pattern for side-effecting tools.**
|
||||
The tool docstring is the only place to enforce
|
||||
human-in-the-loop. Make it loud. "MANDATORY", "Do not loop",
|
||||
"show-confirm-then-submit" — those phrasings work.
|
||||
- **Verify with hand-curated golden queries before shipping any
|
||||
retrieval change.** Numbers in the diff, in the commit message.
|
||||
Don't ship retrieval changes on vibes.
|
||||
- **Two-cadence CI** (weekly scrape vs on-demand code-only)
|
||||
saves hours per code iteration once you're past the
|
||||
one-iteration-a-week stage.
|
||||
- **Rolling tag + sha-pinned tag** deploy pattern. `:latest` is
|
||||
what Watchtower watches; `:<sha>` is your safety net. Rollback
|
||||
is a one-line compose edit, not a redeploy.
|
||||
- **Usage logging is non-negotiable.** You will be wrong about
|
||||
what people use. Capture the truth from day one; let it tell
|
||||
you which features to keep building and which to delete.
|
||||
|
||||
---
|
||||
|
||||
## Glossary
|
||||
|
||||
- **Bundle** — one logical doc set in the portal. Zoomin calls
|
||||
them bundles; Madcap calls them projects; the concept is the
|
||||
same: a versioned, titled collection of pages. One dir under
|
||||
`corpus/`.
|
||||
- **Page** — one HTML page in a bundle. One `.md` + one `.json`
|
||||
sidecar under the bundle dir.
|
||||
- **Topic cluster** — Zoomin's name for "this page in version
|
||||
10.9 corresponds to that page in version 10.8." Stored in the
|
||||
per-page sidecar. The portal-agnostic concept is "cross-version
|
||||
peer mapping."
|
||||
- **Chunk** — a unit of text that gets independently embedded and
|
||||
stored in Chroma. Target ~400-600 tokens; preserve paragraph
|
||||
boundaries.
|
||||
- **RRF** — Reciprocal Rank Fusion. The way to merge two ranked
|
||||
lists from independent retrievers without score calibration.
|
||||
|
||||
---
|
||||
|
||||
## What's deliberately NOT in this template
|
||||
|
||||
Decisions you should make per-product (not copy from the original
|
||||
build):
|
||||
|
||||
- The reverse proxy and TLS termination layer. Could be Caddy,
|
||||
nginx, Traefik, Cloudflare Tunnel — pick what your infra uses.
|
||||
- The Gateway / aggregator in front of multiple MCPs (MetaMCP is one
|
||||
option; you may not need any aggregator if you're running a
|
||||
single product MCP).
|
||||
- The specific embedding model — `nomic-embed-text` is a strong
|
||||
default but newer / domain-specific models may be better for
|
||||
some products.
|
||||
- The Ollama containers / GPU setup — depends on what hardware you
|
||||
have. The pattern is one container per GPU with explicit
|
||||
`NVIDIA_VISIBLE_DEVICES` pinning; the indexer load-balances
|
||||
across them.
|
||||
- Whether to publish a blog series alongside the build. Strongly
|
||||
recommended (forces clarity, builds an audience), but optional.
|
||||
@@ -0,0 +1,104 @@
|
||||
# docs-mcp-template
|
||||
|
||||
A reusable template for building hosted MCP servers over a product's
|
||||
public documentation. Distilled from one production build; everything
|
||||
product-specific has been factored out.
|
||||
|
||||
The end product is a streamable-HTTP MCP server with ~15 tools that
|
||||
any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can
|
||||
call to answer questions against the docs, surface what changed
|
||||
recently, find inconsistencies, and (optionally) submit doc bugs
|
||||
back upstream.
|
||||
|
||||
## What's here
|
||||
|
||||
- **[PLAN.md](PLAN.md)** — comprehensive build guide. Phased
|
||||
approach (13 phases, ~2–3 weeks of focused work for the full
|
||||
stack). Includes the design decisions, the gotchas, and a
|
||||
per-product customization checklist.
|
||||
- **Scaffolded skeleton** — working FastMCP server with stub tools,
|
||||
Dockerfile, docker-compose, CI workflows, eval harness layout,
|
||||
usage logging. Everything you need to `git clone` and start
|
||||
filling in the product-specific bits.
|
||||
|
||||
## Quick start
|
||||
|
||||
```bash
|
||||
git clone https://git.jpaul.io/justin/docs-mcp-template.git my-product-docs
|
||||
cd my-product-docs
|
||||
git remote remove origin # detach from template
|
||||
python -m venv venv && source venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Read PLAN.md before doing anything else. Pay particular attention to
|
||||
# Phase 1 (scraper) — that's the most product-specific phase.
|
||||
|
||||
# Run the stub server (no corpus yet — just verifies the wiring):
|
||||
python -m docs_mcp.server --transport stdio
|
||||
```
|
||||
|
||||
## Repo layout
|
||||
|
||||
```
|
||||
.
|
||||
├── PLAN.md # The build guide. Read first.
|
||||
├── README.md
|
||||
├── requirements.txt
|
||||
├── Dockerfile
|
||||
├── .gitignore
|
||||
├── .gitea/workflows/
|
||||
│ ├── refresh.yml # Weekly scrape + index + image push
|
||||
│ └── image-only.yml # On-demand code-only ship
|
||||
├── scrape/
|
||||
│ ├── README.md # Product-specific scraper goes here
|
||||
│ └── changelog.py # Reusable: --json, --history-out
|
||||
├── rag/
|
||||
│ ├── embeddings.py # Ollama embedder, swappable
|
||||
│ ├── chunk.py # Chunker — adjust per page format
|
||||
│ ├── index.py # Builds Chroma + (optionally) BM25
|
||||
│ └── bm25.py # SQLite FTS5 lexical index
|
||||
├── docs_mcp/
|
||||
│ ├── server.py # FastMCP server with stub tools
|
||||
│ └── usage.py # TimedCall + JSONL telemetry
|
||||
├── eval/
|
||||
│ ├── queries.jsonl.example # Curate ~25 hand-labeled queries
|
||||
│ ├── retrievers.py # Retriever protocol + implementations
|
||||
│ └── run_eval.py # MRR / Recall@k / nDCG@k harness
|
||||
├── scripts/
|
||||
│ ├── usage_report.py # Standalone log analyzer
|
||||
│ └── registry_gc.py # Container registry cleanup
|
||||
└── deploy/
|
||||
└── docker-compose.yml # Hosting stack: MCP + reranker + Watchtower
|
||||
```
|
||||
|
||||
## What's product-specific (must implement)
|
||||
|
||||
- `scrape/` — the scraper itself. The template gives you the corpus
|
||||
layout contract and a working `changelog.py`; the actual extraction
|
||||
logic is yours.
|
||||
- The corpus on disk (gitignored; rebuilt by CI).
|
||||
- The reranker GGUF model and llama.cpp container (commented in
|
||||
`deploy/docker-compose.yml`).
|
||||
- The reverse proxy / TLS layer in front of the public endpoint.
|
||||
- The hand-curated knowledge surface (your product's API gotchas,
|
||||
example scripts, anything the LLM should know that the docs
|
||||
don't say).
|
||||
|
||||
## What's NOT product-specific (works as-is)
|
||||
|
||||
- FastMCP server skeleton + tool decoration pattern
|
||||
- Chroma + Ollama embedding pipeline
|
||||
- BM25 / SQLite FTS5 lexical index
|
||||
- Hybrid retrieval (RRF) + reranker integration
|
||||
- Eval harness (Retriever protocol, MRR/Recall/nDCG)
|
||||
- Usage logging (TimedCall, JSONL, daily rotation)
|
||||
- CI workflow shape (weekly + on-demand, retry-on-race, three-tag
|
||||
image scheme)
|
||||
- Registry GC script
|
||||
- Standard tools: `search_docs`, `get_page`, `list_versions`,
|
||||
`diff_versions`, `bundle_changelog`, `weekly_digest`,
|
||||
`find_doc_inconsistencies`, `submit_doc_bug`, etc.
|
||||
|
||||
## License
|
||||
|
||||
Internal template. Adjust before publishing.
|
||||
@@ -0,0 +1,111 @@
|
||||
# Hosting stack for a docs MCP server.
|
||||
#
|
||||
# Replace <product> below with your product name on first deploy.
|
||||
# Volumes: usage logs are mounted to a host path so they survive
|
||||
# Watchtower-driven container recreates.
|
||||
#
|
||||
# This template assumes a reverse proxy / Cloudflare Tunnel terminates
|
||||
# TLS in front of port 8000. Adjust if your infra differs.
|
||||
|
||||
services:
|
||||
|
||||
# The MCP server. Watchtower auto-pulls on :latest changes.
|
||||
<product>-docs-mcp:
|
||||
image: <registry>/<owner>/<product>-docs-mcp:latest
|
||||
container_name: <product>-docs-mcp
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "8000:8000"
|
||||
environment:
|
||||
PRODUCT_NAME: "<product>"
|
||||
PRODUCT_DOCS_URL: "https://docs.example.com"
|
||||
|
||||
# Streamable-HTTP transport. Stateless mode is required for
|
||||
# production: clients don't lose sessions when Watchtower
|
||||
# recreates the container.
|
||||
MCP_TRANSPORT: streamable-http
|
||||
MCP_HOST: 0.0.0.0
|
||||
MCP_PORT: "8000"
|
||||
|
||||
# If you run MetaMCP or another gateway in front and reach
|
||||
# this container via its compose DNS name (e.g. <product>-docs-mcp:8000),
|
||||
# add that hostname here. "*" disables the rebind check entirely.
|
||||
MCP_ALLOWED_HOSTS: "<product>-docs-mcp,localhost,127.0.0.1"
|
||||
|
||||
# Phase 6 — reranker sidecar (jina-reranker-v2-base via llama.cpp).
|
||||
RERANK_URL: http://<product>-rerank:8080
|
||||
RERANK_POOL: "200"
|
||||
RERANK_TIMEOUT: "30"
|
||||
|
||||
# Phase 8 — hybrid retrieval (BM25 + dense + RRF). Set true
|
||||
# only after the eval harness shows the dense-only path
|
||||
# missing technical-term queries that BM25 catches.
|
||||
HYBRID_SEARCH: "true"
|
||||
|
||||
# Phase 10 — usage telemetry.
|
||||
USAGE_LOG_DIR: /app/var/logs
|
||||
USAGE_LOG_KEEP_DAYS: "90"
|
||||
|
||||
# Phase 12 — doc-bug submission gate. Off by default; on only
|
||||
# in production after you've verified the endpoint contract.
|
||||
DOC_BUG_SUBMIT_ENABLED: "false"
|
||||
# DOC_BUG_API_URL: "https://docs-be.example.com/api/feedback"
|
||||
volumes:
|
||||
# Usage logs persist across container recreates.
|
||||
- ./<product>-docs-mcp-logs:/app/var/logs
|
||||
depends_on:
|
||||
- <product>-rerank
|
||||
labels:
|
||||
# Watchtower polls *only* containers with this label set true.
|
||||
com.centurylinklabs.watchtower.enable: "true"
|
||||
networks:
|
||||
- mcp
|
||||
|
||||
# Reranker sidecar — llama.cpp serving jina-reranker-v2-base.
|
||||
# Requires GPU access; adjust runtime/devices for your hardware.
|
||||
<product>-rerank:
|
||||
image: ghcr.io/ggml-org/llama.cpp:server-cuda
|
||||
container_name: <product>-rerank
|
||||
restart: unless-stopped
|
||||
# Mount the GGUF model from the host. Download from huggingface
|
||||
# (gguf-org/jina-reranker-v2-base-multilingual-GGUF) first.
|
||||
volumes:
|
||||
- /path/to/models:/models:ro
|
||||
command: >
|
||||
--model /models/jina-reranker-v2-base.Q8_0.gguf
|
||||
--reranking
|
||||
--host 0.0.0.0
|
||||
--port 8080
|
||||
--n-gpu-layers 99
|
||||
--ctx-size 4096
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
count: 1
|
||||
capabilities: [gpu]
|
||||
networks:
|
||||
- mcp
|
||||
|
||||
# Watchtower — auto-pulls :latest on push.
|
||||
# Only watches containers labeled `com.centurylinklabs.watchtower.enable=true`.
|
||||
watchtower:
|
||||
image: containrrr/watchtower:latest
|
||||
container_name: watchtower
|
||||
restart: unless-stopped
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
environment:
|
||||
WATCHTOWER_POLL_INTERVAL: "300" # 5 min
|
||||
WATCHTOWER_LABEL_ENABLE: "true"
|
||||
WATCHTOWER_CLEANUP: "true" # remove old images after pull
|
||||
# If your registry requires auth, mount a docker config:
|
||||
# volumes:
|
||||
# - ./registry-auth.json:/config.json:ro
|
||||
networks:
|
||||
- mcp
|
||||
|
||||
networks:
|
||||
mcp:
|
||||
driver: bridge
|
||||
@@ -0,0 +1,263 @@
|
||||
"""MCP server skeleton — fill in PRODUCT_NAME and the tool bodies.
|
||||
|
||||
This file is the template's structural anchor. The phases described in
|
||||
PLAN.md add or extend pieces of this file:
|
||||
|
||||
Phase 3 — search_docs, get_page, list_versions stubs (you are here)
|
||||
Phase 6 — reranker integration in search_docs
|
||||
Phase 8 — BM25 + hybrid retrieval (HYBRID_SEARCH env gate, _rrf_fuse)
|
||||
Phase 9 — diff_versions, list_cluster, bundle_changelog
|
||||
Phase 10 — TimedCall wiring (already imported below)
|
||||
Phase 11 — <product>_api_lessons tool
|
||||
Phase 12 — find_doc_inconsistencies, submit_doc_bug
|
||||
Phase 13 — weekly_digest + _digest_history reader
|
||||
|
||||
Every stub below has a docstring + `raise NotImplementedError`. Replace
|
||||
the body when you reach the corresponding phase. Keep the signatures
|
||||
stable across products — clients depend on them.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import Annotated
|
||||
|
||||
from mcp.server.fastmcp import FastMCP
|
||||
from pydantic import Field
|
||||
|
||||
from .usage import TimedCall
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Product-specific configuration. Set these for each new build.
|
||||
# ---------------------------------------------------------------------------
|
||||
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "myproduct")
|
||||
PRODUCT_DOCS_URL = os.environ.get("PRODUCT_DOCS_URL", "https://docs.example.com")
|
||||
COLLECTION = f"{PRODUCT_NAME}_docs"
|
||||
|
||||
# Paths inside the deployed container (and matching layout locally for dev).
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
CORPUS = ROOT / "corpus"
|
||||
CHROMA_DIR = ROOT / "chroma"
|
||||
BM25_DB = Path(os.environ.get("BM25_DB", str(ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db")))
|
||||
BUNDLES_JSON = ROOT / "bundles.json"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Feature flags (Phase 6 / 8 / 12 enable these as you ship each phase).
|
||||
# ---------------------------------------------------------------------------
|
||||
RERANK_URL = os.environ.get("RERANK_URL", "").rstrip("/") or None
|
||||
RERANK_POOL = int(os.environ.get("RERANK_POOL", "50"))
|
||||
RERANK_TIMEOUT = float(os.environ.get("RERANK_TIMEOUT", "30"))
|
||||
|
||||
HYBRID_SEARCH = os.environ.get("HYBRID_SEARCH", "").lower() in ("true", "1", "yes", "on")
|
||||
RRF_K = int(os.environ.get("RRF_K", "60"))
|
||||
|
||||
DOC_BUG_SUBMIT_ENABLED = os.environ.get("DOC_BUG_SUBMIT_ENABLED", "").lower() in ("true", "1", "yes", "on")
|
||||
DOC_BUG_API_URL = os.environ.get("DOC_BUG_API_URL", "") # product-specific endpoint
|
||||
DOC_BUG_TIMEOUT = float(os.environ.get("DOC_BUG_TIMEOUT", "15"))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# FastMCP setup.
|
||||
#
|
||||
# stateless_http=True — every request creates an ephemeral session and
|
||||
# discards it on return. Critical for production: clients don't get
|
||||
# 404 storms when the container is recreated by Watchtower.
|
||||
# ---------------------------------------------------------------------------
|
||||
mcp = FastMCP(f"{PRODUCT_NAME}-docs", stateless_http=True)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Lazy helpers — instantiate expensive things only when actually needed,
|
||||
# so the server still starts when (e.g.) Ollama is briefly unreachable.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _bundles() -> dict[str, dict]:
|
||||
"""Cached load of bundles.json into a {slug: bundle_dict} mapping.
|
||||
|
||||
bundles.json is the product-specific catalog written by the Phase 1
|
||||
scraper. See PLAN.md Phase 1 for the schema.
|
||||
"""
|
||||
if not BUNDLES_JSON.exists():
|
||||
return {}
|
||||
cat = json.loads(BUNDLES_JSON.read_text())
|
||||
return {b["slug"]: b for b in cat}
|
||||
|
||||
|
||||
def _build_where(version: str | None, platform: str | None, bundle_id: str | None) -> dict | None:
|
||||
"""Translate filter args into a Chroma `where` clause."""
|
||||
conds: list[dict] = []
|
||||
if version:
|
||||
conds.append({"version": version})
|
||||
if platform:
|
||||
conds.append({"platform": platform})
|
||||
if bundle_id:
|
||||
conds.append({"bundle_id": bundle_id})
|
||||
if not conds:
|
||||
return None
|
||||
if len(conds) == 1:
|
||||
return conds[0]
|
||||
return {"$and": conds}
|
||||
|
||||
|
||||
def _read_page(bundle_id: str, page_id: str) -> tuple[str, dict] | None:
|
||||
"""Read a corpus page off disk. Returns (markdown_body, metadata_dict)."""
|
||||
md_path = CORPUS / bundle_id / (page_id + ".md")
|
||||
json_path = CORPUS / bundle_id / (page_id + ".json")
|
||||
if not md_path.exists() or not json_path.exists():
|
||||
return None
|
||||
return md_path.read_text(), json.loads(json_path.read_text())
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# Tools
|
||||
# ===========================================================================
|
||||
|
||||
@mcp.tool()
|
||||
def search_docs(
|
||||
query: Annotated[str, Field(description=f"Natural-language query about {PRODUCT_NAME}.")],
|
||||
version: Annotated[
|
||||
str | None,
|
||||
Field(description="OPTIONAL version filter — restrict to one product version."),
|
||||
] = None,
|
||||
platform: Annotated[
|
||||
str | None,
|
||||
Field(description="OPTIONAL platform filter. Set to one of the platforms listed by list_versions(); omit for all platforms."),
|
||||
] = None,
|
||||
bundle_id: Annotated[
|
||||
str | None,
|
||||
Field(description="OPTIONAL bundle filter — pin to a specific doc bundle slug."),
|
||||
] = None,
|
||||
k: Annotated[int, Field(description="Number of results to return.", ge=1, le=50)] = 10,
|
||||
) -> str:
|
||||
"""Search the {product} docs corpus.
|
||||
|
||||
Returns the top-k most relevant chunks (with full source page URLs)
|
||||
given a natural-language query. Optional filters narrow the search
|
||||
to one version, one platform, or one bundle. Use list_versions()
|
||||
first if you need to discover the available facet values.
|
||||
|
||||
Call this tool whenever the user asks anything that should be
|
||||
answerable from the official product documentation.
|
||||
"""
|
||||
with TimedCall("search_docs", {
|
||||
"query": query, "version": version, "platform": platform,
|
||||
"bundle_id": bundle_id, "k": k,
|
||||
}) as _call:
|
||||
# TODO Phase 2-3: query Chroma collection (see rag/index.py for
|
||||
# how it was built). Render the top-k chunks as markdown with
|
||||
# source URLs.
|
||||
# TODO Phase 6: optional reranker via _rerank() if RERANK_URL set.
|
||||
# TODO Phase 8: hybrid retrieval if HYBRID_SEARCH=true — run
|
||||
# dense + BM25 in parallel, RRF-fuse, hand merged pool to rerank.
|
||||
_call.set(hits_returned=0)
|
||||
raise NotImplementedError("Phase 2/3: implement Chroma query + rendering")
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def get_page(
|
||||
bundle_id: Annotated[str, Field(description="Bundle slug.")],
|
||||
page_id: Annotated[str, Field(description="Page filename within the bundle.")],
|
||||
) -> str:
|
||||
"""Return the full markdown for one page, plus a metadata header.
|
||||
|
||||
Use after search_docs surfaces a relevant page and the user (or you)
|
||||
want the complete text — not just the matched chunks.
|
||||
"""
|
||||
with TimedCall("get_page", {"bundle_id": bundle_id, "page_id": page_id}) as _call:
|
||||
data = _read_page(bundle_id, page_id)
|
||||
if data is None:
|
||||
_call.set(found=False)
|
||||
return f"Page not found: {bundle_id}/{page_id}"
|
||||
md, meta = data
|
||||
_call.set(found=True, page_chars=len(md))
|
||||
# TODO: add a metadata header (title, version, source URL) above
|
||||
# the body. Product-specific shape.
|
||||
return md
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def list_versions() -> str:
|
||||
"""List the available version/platform facets across all bundles.
|
||||
|
||||
Use this to discover valid filter values for search_docs.
|
||||
"""
|
||||
with TimedCall("list_versions", {}) as _call:
|
||||
cat = _bundles()
|
||||
if not cat:
|
||||
return "_(no bundles indexed yet — run the scraper + indexer)_"
|
||||
versions = sorted({b.get("version") for b in cat.values() if b.get("version")})
|
||||
platforms = sorted({b.get("platform") for b in cat.values() if b.get("platform")})
|
||||
_call.set(versions=len(versions), platforms=len(platforms))
|
||||
lines = [f"# Facets across {len(cat)} bundle(s)", ""]
|
||||
if versions:
|
||||
lines.append("## Versions"); lines.append("")
|
||||
for v in versions: lines.append(f"- `{v}`")
|
||||
lines.append("")
|
||||
if platforms:
|
||||
lines.append("## Platforms"); lines.append("")
|
||||
for p in platforms: lines.append(f"- `{p}`")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stubs for later phases — keep the signatures in this file so refactors
|
||||
# don't lose the contracts. Implementations come per phase.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# @mcp.tool() # Phase 9
|
||||
# def list_cluster(bundle_id: str, page_id: str) -> str: ...
|
||||
|
||||
# @mcp.tool() # Phase 9
|
||||
# def diff_versions(bundle_id: str, page_id: str, against_bundle_id: str, context: int = 3) -> str: ...
|
||||
|
||||
# @mcp.tool() # Phase 9
|
||||
# def bundle_changelog(bundle_id_new: str, bundle_id_old: str, min_churn: int = 5, max_changed: int = 50) -> str: ...
|
||||
|
||||
# @mcp.tool() # Phase 13
|
||||
# def weekly_digest(days: int = 7, version: str | None = None, platform: str | None = None, ...) -> str: ...
|
||||
|
||||
# @mcp.tool() # Phase 9 (or 3 — useful early)
|
||||
# def corpus_status() -> str: ...
|
||||
|
||||
# @mcp.tool() # Phase 11
|
||||
# def myproduct_api_lessons(topic: str | None = None) -> str: ...
|
||||
|
||||
# @mcp.tool() # Phase 12
|
||||
# def find_doc_inconsistencies(scope_query: str, ...) -> str: ...
|
||||
|
||||
# @mcp.tool() # Phase 12
|
||||
# def submit_doc_bug(page_url: str, content: str, email: str | None = None, ...) -> str: ...
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# Entry point
|
||||
# ===========================================================================
|
||||
|
||||
def main() -> None:
|
||||
import argparse
|
||||
p = argparse.ArgumentParser(description=f"{PRODUCT_NAME} docs MCP server")
|
||||
p.add_argument("--transport", choices=["stdio", "streamable-http", "sse"],
|
||||
default=os.environ.get("MCP_TRANSPORT", "stdio"))
|
||||
p.add_argument("--host", default=os.environ.get("MCP_HOST", "0.0.0.0"))
|
||||
p.add_argument("--port", type=int, default=int(os.environ.get("MCP_PORT", "8000")))
|
||||
args = p.parse_args()
|
||||
|
||||
if args.transport == "stdio":
|
||||
mcp.run()
|
||||
else:
|
||||
mcp.settings.host = args.host
|
||||
mcp.settings.port = args.port
|
||||
# DNS-rebinding protection defaults to localhost-only — disable for
|
||||
# container-network DNS hostnames. See PLAN.md "Hosting" notes.
|
||||
if os.environ.get("MCP_DISABLE_DNS_REBINDING_PROTECTION") in {"1", "true", "yes"}:
|
||||
mcp.settings.transport_security.enable_dns_rebinding_protection = False
|
||||
mcp.run(transport=args.transport)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,127 @@
|
||||
"""Per-call usage telemetry — JSONL with daily rotation and retention.
|
||||
|
||||
Reusable as-is across products. Drop the import + `with TimedCall(...)`
|
||||
into any tool body and the call gets logged with the tool name, args,
|
||||
elapsed time, and any extra fields the tool sets via `_call.set(...)`.
|
||||
|
||||
The log file is `var/logs/usage.jsonl` by default (override with the
|
||||
`USAGE_LOG_DIR` env). Daily rotation; files older than
|
||||
`USAGE_LOG_KEEP_DAYS` (default 90) are deleted on next write.
|
||||
|
||||
Layout of one record:
|
||||
|
||||
{
|
||||
"ts": "2026-05-22T13:14:15+00:00",
|
||||
"tool": "search_docs",
|
||||
"args": {"query": "...", "version": "10.9", "k": 10},
|
||||
"elapsed_ms": 142.5,
|
||||
"hits_returned": 7, # optional, set by the tool
|
||||
"reranked": true, # optional, set by the tool
|
||||
// ... any other key the tool sets via _call.set(...)
|
||||
}
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
import threading
|
||||
from datetime import datetime, timedelta, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
USAGE_LOG_DIR = Path(os.environ.get("USAGE_LOG_DIR", "var/logs"))
|
||||
USAGE_LOG_KEEP_DAYS = int(os.environ.get("USAGE_LOG_KEEP_DAYS", "90"))
|
||||
|
||||
# Single global lock to serialize writes from multiple request handlers.
|
||||
# JSONL appends are atomic at the OS level for short records on most
|
||||
# filesystems, but the lock is cheap and saves you from cross-platform
|
||||
# surprises.
|
||||
_lock = threading.Lock()
|
||||
_last_rotation_check: float = 0.0
|
||||
|
||||
|
||||
def _maybe_rotate() -> None:
|
||||
"""Move usage.jsonl → usage.jsonl.<yesterday> if the date has rolled.
|
||||
|
||||
Cheap to call; we only do filesystem work when a day has actually
|
||||
passed since the last check.
|
||||
"""
|
||||
global _last_rotation_check
|
||||
now = time.time()
|
||||
if now - _last_rotation_check < 300: # 5 min cap between checks
|
||||
return
|
||||
_last_rotation_check = now
|
||||
|
||||
USAGE_LOG_DIR.mkdir(parents=True, exist_ok=True)
|
||||
active = USAGE_LOG_DIR / "usage.jsonl"
|
||||
if active.exists():
|
||||
try:
|
||||
mtime = datetime.fromtimestamp(active.stat().st_mtime, tz=timezone.utc).date()
|
||||
today = datetime.now(timezone.utc).date()
|
||||
if mtime < today:
|
||||
rotated = USAGE_LOG_DIR / f"usage.jsonl.{mtime.isoformat()}"
|
||||
if not rotated.exists():
|
||||
active.rename(rotated)
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
# Retention: delete usage.jsonl.YYYY-MM-DD files older than the
|
||||
# retention window. The active file is never deleted by this.
|
||||
cutoff = datetime.now(timezone.utc).date() - timedelta(days=USAGE_LOG_KEEP_DAYS)
|
||||
for f in USAGE_LOG_DIR.glob("usage.jsonl.*"):
|
||||
try:
|
||||
datestamp = f.name.split(".", 2)[-1]
|
||||
if datetime.fromisoformat(datestamp).date() < cutoff:
|
||||
f.unlink()
|
||||
except (ValueError, OSError):
|
||||
continue
|
||||
|
||||
|
||||
class TimedCall:
|
||||
"""Context manager that captures one tool call's telemetry record.
|
||||
|
||||
Usage:
|
||||
|
||||
with TimedCall("search_docs", {"query": q, ...}) as call:
|
||||
... do the work ...
|
||||
call.set(hits_returned=len(results), reranked=True)
|
||||
|
||||
On exit, writes one JSONL record to usage.jsonl. Exceptions are
|
||||
captured into the `error` field; the exception is re-raised so
|
||||
the tool's caller sees the failure.
|
||||
"""
|
||||
|
||||
def __init__(self, tool: str, args: dict[str, Any]):
|
||||
self.tool = tool
|
||||
self.args = args
|
||||
self.extra: dict[str, Any] = {}
|
||||
self._t0: float = 0.0
|
||||
|
||||
def set(self, **kwargs: Any) -> None:
|
||||
"""Attach extra fields to the eventual telemetry record."""
|
||||
self.extra.update(kwargs)
|
||||
|
||||
def __enter__(self) -> "TimedCall":
|
||||
self._t0 = time.perf_counter()
|
||||
return self
|
||||
|
||||
def __exit__(self, exc_type, exc_val, exc_tb) -> None:
|
||||
elapsed_ms = (time.perf_counter() - self._t0) * 1000.0
|
||||
record: dict[str, Any] = {
|
||||
"ts": datetime.now(timezone.utc).isoformat(),
|
||||
"tool": self.tool,
|
||||
"args": self.args,
|
||||
"elapsed_ms": round(elapsed_ms, 2),
|
||||
}
|
||||
if exc_type is not None:
|
||||
record["error"] = f"{exc_type.__name__}: {exc_val}"
|
||||
record.update(self.extra)
|
||||
|
||||
_maybe_rotate()
|
||||
with _lock:
|
||||
USAGE_LOG_DIR.mkdir(parents=True, exist_ok=True)
|
||||
with open(USAGE_LOG_DIR / "usage.jsonl", "a") as fh:
|
||||
fh.write(json.dumps(record, separators=(",", ":")) + "\n")
|
||||
# Don't swallow the exception — the caller still needs to see it.
|
||||
@@ -0,0 +1,4 @@
|
||||
{"query": "how to install <product> on Linux", "expected": [{"bundle_id": "Install.Linux.10.0", "page_id": "Installation.htm"}], "tags": ["install", "linux"]}
|
||||
{"query": "configure database connection for high availability", "expected": [{"bundle_id": "Admin.10.0", "page_id": "HA_Setup.htm"}], "tags": ["ha", "config"]}
|
||||
{"query": "API endpoint to list users", "expected": [{"bundle_id": "API.10.0", "page_id": "Users_API.htm"}], "tags": ["api"]}
|
||||
{"query": "what changed between 10.0 and 10.1", "expected": [{"bundle_id": "Release_Notes.10.1", "page_id": "Whats_New.htm"}], "tags": ["release-notes"]}
|
||||
@@ -0,0 +1,62 @@
|
||||
"""Retriever protocol + concrete implementations.
|
||||
|
||||
A single matrix dimension per knob (dense / reranked / bm25 / hybrid)
|
||||
so the eval harness can compare them apples-to-apples. Implement these
|
||||
once at Phase 7 and reuse them across every retrieval change.
|
||||
|
||||
Each retriever returns a ranked list of (bundle_id, page_id) tuples
|
||||
deduplicated to the page level (chunks within the same page collapse
|
||||
to one entry; the highest-ranked chunk's position wins).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Protocol, Iterable
|
||||
|
||||
|
||||
class Retriever(Protocol):
|
||||
name: str
|
||||
|
||||
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
|
||||
"""Return up to k (bundle_id, page_id) tuples in rank order."""
|
||||
...
|
||||
|
||||
|
||||
def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> list[tuple[str, str]]:
|
||||
"""Take a stream of (bundle_id, page_id, chunk_ordinal) and return
|
||||
the first k unique pages in their first-seen order."""
|
||||
seen: set[tuple[str, str]] = set()
|
||||
out: list[tuple[str, str]] = []
|
||||
for bid, pid, _ord in chunk_ids:
|
||||
key = (bid, pid)
|
||||
if key in seen:
|
||||
continue
|
||||
seen.add(key)
|
||||
out.append(key)
|
||||
if len(out) >= k:
|
||||
break
|
||||
return out
|
||||
|
||||
|
||||
# TODO Phase 2/3 — implement these once Chroma + the bm25 module are
|
||||
# in place. Each one is small (15-30 LOC). The eval harness imports
|
||||
# from this module by class name.
|
||||
#
|
||||
# class DenseRetriever:
|
||||
# name = "dense"
|
||||
# def __init__(self, collection): self.col = collection
|
||||
# def retrieve(self, query, k=10): ...
|
||||
#
|
||||
# class RerankedRetriever:
|
||||
# name = "dense+rerank"
|
||||
# def __init__(self, collection, rerank_url, pool=200): ...
|
||||
# def retrieve(self, query, k=10): ...
|
||||
#
|
||||
# class BM25Retriever:
|
||||
# name = "bm25"
|
||||
# def __init__(self, bm25_index): ...
|
||||
# def retrieve(self, query, k=10): ...
|
||||
#
|
||||
# class HybridRetriever:
|
||||
# name = "bm25+dense+rrf"
|
||||
# def __init__(self, dense, bm25, k_rrf=60): ...
|
||||
# def retrieve(self, query, k=10): ...
|
||||
@@ -0,0 +1,91 @@
|
||||
"""Run all retrievers against eval/queries.jsonl, emit a markdown report.
|
||||
|
||||
Metrics computed per retriever:
|
||||
|
||||
MRR — mean reciprocal rank of the FIRST expected page in the
|
||||
ranked result list (0 if not in top-k).
|
||||
Recall@K — fraction of expected pages that appear in top-K.
|
||||
nDCG@K — discounted gain weighted by rank position.
|
||||
|
||||
The "right" number depends on what you're measuring. MRR tracks "the
|
||||
first-line answer is correct"; Recall@K tracks "everything relevant
|
||||
is there to draw from"; nDCG@K is a smoother combination of both.
|
||||
For docs-RAG, MRR is usually the headline metric.
|
||||
|
||||
Usage:
|
||||
|
||||
python -m eval.run_eval \\
|
||||
--queries eval/queries.jsonl \\
|
||||
--k 5 \\
|
||||
--output eval/results/baseline.md
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import math
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Iterable
|
||||
|
||||
|
||||
def load_queries(path: Path) -> list[dict]:
|
||||
with open(path) as fh:
|
||||
return [json.loads(line) for line in fh if line.strip()]
|
||||
|
||||
|
||||
def reciprocal_rank(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]]) -> float:
|
||||
expected_set = set(expected)
|
||||
for i, page in enumerate(retrieved, start=1):
|
||||
if page in expected_set:
|
||||
return 1.0 / i
|
||||
return 0.0
|
||||
|
||||
|
||||
def recall_at_k(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]], k: int) -> float:
|
||||
if not expected:
|
||||
return 0.0
|
||||
retrieved_set = set(retrieved[:k])
|
||||
hits = sum(1 for e in expected if e in retrieved_set)
|
||||
return hits / len(expected)
|
||||
|
||||
|
||||
def ndcg_at_k(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]], k: int) -> float:
|
||||
expected_set = set(expected)
|
||||
dcg = 0.0
|
||||
for i, page in enumerate(retrieved[:k], start=1):
|
||||
if page in expected_set:
|
||||
dcg += 1.0 / math.log2(i + 1)
|
||||
# Ideal DCG: every expected page in the top positions.
|
||||
idcg = sum(1.0 / math.log2(i + 1) for i in range(1, min(len(expected), k) + 1))
|
||||
return dcg / idcg if idcg else 0.0
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("--queries", type=Path, default=Path("eval/queries.jsonl"))
|
||||
p.add_argument("--k", type=int, default=5)
|
||||
p.add_argument("--output", type=Path, default=Path("eval/results/baseline.md"))
|
||||
args = p.parse_args()
|
||||
|
||||
if not args.queries.exists():
|
||||
print(f"queries file not found: {args.queries}")
|
||||
print("hint: copy eval/queries.jsonl.example and edit")
|
||||
return 1
|
||||
|
||||
queries = load_queries(args.queries)
|
||||
print(f"loaded {len(queries)} queries")
|
||||
|
||||
# TODO Phase 7: instantiate the retrievers you implemented in
|
||||
# eval/retrievers.py and run each one against each query.
|
||||
# Aggregate MRR / Recall@K / nDCG@K per retriever. Emit a
|
||||
# markdown table to args.output. Commit the file alongside the
|
||||
# PR that changes retrieval.
|
||||
raise NotImplementedError(
|
||||
"Wire up the retrievers in eval/retrievers.py first, then "
|
||||
"fill in this evaluation loop. See PLAN.md Phase 7."
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
+277
@@ -0,0 +1,277 @@
|
||||
"""SQLite FTS5-backed BM25 retrieval over the same chunks Chroma indexes.
|
||||
|
||||
Hybrid retrieval (BM25 + dense + Reciprocal Rank Fusion) addresses a
|
||||
limit of single-tower dense embeddings: when a query has specific
|
||||
technical terms (filenames, language names, error codes, API paths),
|
||||
the dense embedding doesn't bridge from the query into a short
|
||||
code-focused chunk. The chunk loses to the much larger crowd of
|
||||
prose chunks that semantically match the query topic.
|
||||
|
||||
BM25 handles this directly. Lexical overlap on rare terms ("python",
|
||||
"create_vpg.py", "PROTECTED_SITE_ID", "applyUpgrade") scores those
|
||||
chunks high. Fused with the dense ranking via RRF, the hybrid result
|
||||
is strictly better than either alone for the queries we've seen
|
||||
fail.
|
||||
|
||||
Why SQLite FTS5:
|
||||
- In the stdlib. Zero new deps.
|
||||
- On-disk. Same persistence model as Chroma — Docker COPY the dir,
|
||||
`rag.index --rebuild` regenerates from corpus.
|
||||
- Built-in `bm25()` ranking function. No knobs to tune that matter
|
||||
for our use case (k1=1.2, b=0.75 defaults are fine).
|
||||
- Builds 70k+ chunks in seconds. Faster than the Chroma rebuild's
|
||||
embedding step by 100×, so it adds basically nothing to the
|
||||
full-rebuild cycle.
|
||||
|
||||
Schema is two tables to keep filtering clean. FTS5 doesn't filter
|
||||
nicely on its own columns; the content_rowid pattern keeps an
|
||||
external metadata table joinable by rowid:
|
||||
|
||||
CREATE TABLE chunks_meta (
|
||||
rowid INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
id TEXT UNIQUE,
|
||||
bundle_id TEXT, page_id TEXT, version TEXT,
|
||||
platform TEXT, product TEXT, ordinal INTEGER
|
||||
);
|
||||
CREATE VIRTUAL TABLE chunks_fts USING fts5(
|
||||
text,
|
||||
tokenize = 'porter unicode61 remove_diacritics 2',
|
||||
content = 'chunks_meta',
|
||||
content_rowid = 'rowid'
|
||||
);
|
||||
|
||||
Queries:
|
||||
|
||||
SELECT m.id, bm25(chunks_fts) AS score
|
||||
FROM chunks_meta m
|
||||
JOIN chunks_fts f ON m.rowid = f.rowid
|
||||
WHERE f MATCH ?
|
||||
AND m.version = ? -- optional metadata filter
|
||||
ORDER BY bm25(chunks_fts) -- lower = better in FTS5
|
||||
LIMIT ?;
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import re
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
# Default location: bm25/<product>_docs.db at the repo root, next to chroma/.
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
DEFAULT_DB_DIR = ROOT / "bm25"
|
||||
DEFAULT_DB_NAME = "<product>_docs.db"
|
||||
|
||||
# Columns we expose as filterable metadata. Mirrors what _build_where in
|
||||
# docs_mcp/server.py accepts so the same filter dicts work for both
|
||||
# Chroma and BM25 without per-retriever translation in the caller.
|
||||
FILTER_COLUMNS = ("bundle_id", "page_id", "version", "platform", "product", "ordinal")
|
||||
|
||||
|
||||
# Allowlist tokenizer for free-text queries. FTS5's parser chokes on lots
|
||||
# of punctuation we routinely see in user queries (".10.9", "?", "VPG's",
|
||||
# em-dash, etc.). Rather than blocklist every operator, just keep
|
||||
# alphanumerics + a few separators and replace everything else with a
|
||||
# space. This loses the ability to phrase-search ("exact match") but we
|
||||
# don't expose that to users anyway — they ask natural-language questions
|
||||
# and want the answer, not a Boolean DSL.
|
||||
_KEEP_RE = re.compile(r"[^A-Za-z0-9_\s]")
|
||||
# FTS5 reserves these Boolean operator KEYWORDS at the token level —
|
||||
# stripping them avoids accidental phrase-query behavior when a user
|
||||
# query happens to contain bare "AND", "OR", "NOT", "NEAR".
|
||||
_BOOLEAN_KW_RE = re.compile(r"(?<!\w)(AND|OR|NOT|NEAR)(?!\w)")
|
||||
|
||||
|
||||
def _sanitize_query(text: str) -> str:
|
||||
"""Reduce a natural-language query to an FTS5 OR-of-tokens query.
|
||||
|
||||
Two transformations:
|
||||
|
||||
1. Non-alphanumeric → space (drops punctuation; "10.9?" becomes
|
||||
"10 9"). Lets us handle versions, parens, question marks, etc.
|
||||
without inviting FTS5 parse errors.
|
||||
2. Boolean keywords stripped (FTS5 reserves AND/OR/NOT/NEAR).
|
||||
3. Tokens explicitly OR'd. FTS5's default is AND-of-tokens — for
|
||||
any non-trivial natural-language query that means zero hits
|
||||
(no chunk contains every word). OR semantics is what we want:
|
||||
BM25 already weights documents containing more query terms
|
||||
higher, so we don't lose precision, but we DO gain recall.
|
||||
"""
|
||||
cleaned = _KEEP_RE.sub(" ", text)
|
||||
cleaned = _BOOLEAN_KW_RE.sub(" ", cleaned)
|
||||
tokens = cleaned.split()
|
||||
if not tokens:
|
||||
return ""
|
||||
return " OR ".join(tokens)
|
||||
|
||||
|
||||
def _where_to_sql(where: dict | None) -> tuple[str, list[Any]]:
|
||||
"""Translate a Chroma-shaped filter dict into a SQL fragment + params.
|
||||
|
||||
Accepts the same shapes ``docs_mcp.server._build_where`` produces:
|
||||
|
||||
None → ("", [])
|
||||
{"version": "10.9"} → ("AND m.version = ?", ["10.9"])
|
||||
{"$and": [{...}, {...}]} → ("AND m.X = ? AND m.Y = ?", [...])
|
||||
|
||||
Unknown keys are silently dropped (defensive — better to over-match
|
||||
than to crash on a filter we don't know).
|
||||
"""
|
||||
if not where:
|
||||
return "", []
|
||||
parts: list[str] = []
|
||||
params: list[Any] = []
|
||||
|
||||
def _emit_eq(cond: dict[str, Any]) -> None:
|
||||
for k, v in cond.items():
|
||||
if k in FILTER_COLUMNS:
|
||||
parts.append(f"m.{k} = ?")
|
||||
params.append(v)
|
||||
|
||||
if "$and" in where:
|
||||
for sub in where["$and"]:
|
||||
_emit_eq(sub)
|
||||
else:
|
||||
_emit_eq(where)
|
||||
if not parts:
|
||||
return "", []
|
||||
return "AND " + " AND ".join(parts), params
|
||||
|
||||
|
||||
class BM25Index:
|
||||
"""Thin wrapper around an FTS5-backed sqlite db.
|
||||
|
||||
Single-writer model. Reads are connection-per-call (sqlite handles
|
||||
concurrency through file locks; for our read-heavy workload that's
|
||||
fine and avoids cross-thread connection sharing issues with the MCP
|
||||
server's request handlers).
|
||||
"""
|
||||
|
||||
def __init__(self, db_path: Path | None = None):
|
||||
self.db_path = Path(db_path) if db_path else (DEFAULT_DB_DIR / DEFAULT_DB_NAME)
|
||||
|
||||
# -- build ----------------------------------------------------------
|
||||
|
||||
def build(self, records: list[dict]) -> int:
|
||||
"""Rebuild the index from scratch from `records`.
|
||||
|
||||
`records` is the same list ``rag.index.page_records`` produces:
|
||||
``[{"id": ..., "text": ..., "metadata": {...}}, ...]``. Bulk
|
||||
insert wrapped in a transaction — single-digit seconds for the
|
||||
full 73k-chunk corpus.
|
||||
"""
|
||||
self.db_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
# Drop and recreate. Idempotent rebuild.
|
||||
if self.db_path.exists():
|
||||
self.db_path.unlink()
|
||||
with sqlite3.connect(self.db_path) as con:
|
||||
con.executescript(self._schema_sql())
|
||||
con.executemany(
|
||||
"INSERT INTO chunks_meta (id, bundle_id, page_id, version, "
|
||||
"platform, product, ordinal) VALUES (?, ?, ?, ?, ?, ?, ?)",
|
||||
[
|
||||
(
|
||||
r["id"],
|
||||
r["metadata"].get("bundle_id") or "",
|
||||
r["metadata"].get("page_id") or "",
|
||||
r["metadata"].get("version") or "",
|
||||
r["metadata"].get("platform") or "",
|
||||
r["metadata"].get("product") or "",
|
||||
int(r["metadata"].get("ordinal") or 0),
|
||||
)
|
||||
for r in records
|
||||
],
|
||||
)
|
||||
# Populate the FTS5 contentless-ish table by rowid. We populated
|
||||
# chunks_meta first; rowids align with insertion order.
|
||||
con.executemany(
|
||||
"INSERT INTO chunks_fts (rowid, text) VALUES (?, ?)",
|
||||
[
|
||||
(i + 1, r["text"])
|
||||
for i, r in enumerate(records)
|
||||
],
|
||||
)
|
||||
con.commit()
|
||||
log.info("bm25: indexed %d chunks → %s", len(records), self.db_path)
|
||||
return len(records)
|
||||
|
||||
# -- query ----------------------------------------------------------
|
||||
|
||||
def query(
|
||||
self,
|
||||
text: str,
|
||||
n: int = 200,
|
||||
where: dict | None = None,
|
||||
) -> list[tuple[str, float]]:
|
||||
"""Return up to `n` (chunk_id, bm25_score) pairs, lowest score first.
|
||||
|
||||
FTS5's bm25() returns NEGATIVE numbers — more relevant docs have
|
||||
smaller (more negative) scores. We order ASC so the first row is
|
||||
the most relevant. Callers that need a "rank" should enumerate
|
||||
the returned list.
|
||||
"""
|
||||
sanitized = _sanitize_query(text)
|
||||
if not sanitized:
|
||||
return []
|
||||
where_sql, params = _where_to_sql(where)
|
||||
# FTS5 MATCH wants the unaliased table name on its left, so we use
|
||||
# chunks_fts (no alias) and JOIN by rowid against chunks_meta.
|
||||
sql = (
|
||||
"SELECT m.id, bm25(chunks_fts) AS score "
|
||||
"FROM chunks_fts "
|
||||
"JOIN chunks_meta m ON m.rowid = chunks_fts.rowid "
|
||||
f"WHERE chunks_fts MATCH ? {where_sql} "
|
||||
"ORDER BY bm25(chunks_fts) "
|
||||
"LIMIT ?"
|
||||
)
|
||||
try:
|
||||
with sqlite3.connect(self.db_path) as con:
|
||||
cur = con.execute(sql, [sanitized, *params, n])
|
||||
return [(row[0], float(row[1])) for row in cur.fetchall()]
|
||||
except sqlite3.OperationalError as e:
|
||||
# FTS5 syntax error (rare after sanitization) or db missing.
|
||||
# Caller decides whether to fall back to dense-only.
|
||||
log.warning("bm25 query failed (%s); query=%r", e, sanitized[:80])
|
||||
return []
|
||||
|
||||
def exists(self) -> bool:
|
||||
"""Cheap probe — does the index file exist on disk?"""
|
||||
return self.db_path.exists()
|
||||
|
||||
def count(self) -> int:
|
||||
"""Number of chunks indexed. 0 if the db is missing or empty."""
|
||||
if not self.exists():
|
||||
return 0
|
||||
try:
|
||||
with sqlite3.connect(self.db_path) as con:
|
||||
return con.execute("SELECT COUNT(*) FROM chunks_meta").fetchone()[0]
|
||||
except sqlite3.OperationalError:
|
||||
return 0
|
||||
|
||||
# -- schema ---------------------------------------------------------
|
||||
|
||||
@staticmethod
|
||||
def _schema_sql() -> str:
|
||||
return """
|
||||
CREATE TABLE chunks_meta (
|
||||
rowid INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
id TEXT UNIQUE NOT NULL,
|
||||
bundle_id TEXT,
|
||||
page_id TEXT,
|
||||
version TEXT,
|
||||
platform TEXT,
|
||||
product TEXT,
|
||||
ordinal INTEGER
|
||||
);
|
||||
CREATE INDEX idx_meta_version ON chunks_meta(version);
|
||||
CREATE INDEX idx_meta_platform ON chunks_meta(platform);
|
||||
CREATE INDEX idx_meta_bundle ON chunks_meta(bundle_id);
|
||||
|
||||
CREATE VIRTUAL TABLE chunks_fts USING fts5(
|
||||
text,
|
||||
tokenize = 'porter unicode61 remove_diacritics 2'
|
||||
);
|
||||
"""
|
||||
+126
@@ -0,0 +1,126 @@
|
||||
"""Markdown chunker — paragraph-aware, ~400-600 token target.
|
||||
|
||||
Adjust the chunking strategy per product if your page format differs
|
||||
significantly from prose. The output shape (id, text, metadata) is
|
||||
fixed by the downstream Chroma + BM25 indexing in rag/index.py — don't
|
||||
change that.
|
||||
|
||||
The key knob you'll tune per product is chunk-0. Dense retrieval lands
|
||||
on chunk 0 first for most queries. Make it a synthetic chunk built
|
||||
from:
|
||||
|
||||
- the page title (as natural-language H1)
|
||||
- a 1-sentence task description (you'll have to generate this — for
|
||||
pages that already have a "## Overview" or "## Introduction" the
|
||||
first sentence usually works)
|
||||
- a keyword bag of important terms (filenames, API names, error
|
||||
codes — the rare technical tokens that BM25 lights up on)
|
||||
|
||||
Without a rich chunk 0, dense retrieval gets dominated by the much
|
||||
larger prose body, and short pages (script examples, reference cards)
|
||||
get buried.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from typing import Iterator
|
||||
|
||||
|
||||
# Approximate token estimate from char count. Tunable — set per
|
||||
# embedder if the default 4 chars/token is wrong.
|
||||
CHARS_PER_TOKEN = 4
|
||||
TARGET_TOKENS = 500
|
||||
TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
|
||||
|
||||
|
||||
def estimate_tokens(text: str) -> int:
|
||||
return max(1, len(text) // CHARS_PER_TOKEN)
|
||||
|
||||
|
||||
def split_paragraphs(md: str) -> list[str]:
|
||||
"""Split markdown into paragraph-ish blocks.
|
||||
|
||||
Keeps fenced code blocks together (don't slice through ```).
|
||||
Headings start new paragraphs.
|
||||
"""
|
||||
blocks: list[str] = []
|
||||
current: list[str] = []
|
||||
in_fence = False
|
||||
for line in md.splitlines(keepends=True):
|
||||
stripped = line.strip()
|
||||
if stripped.startswith("```"):
|
||||
in_fence = not in_fence
|
||||
current.append(line)
|
||||
continue
|
||||
if in_fence:
|
||||
current.append(line)
|
||||
continue
|
||||
if stripped.startswith("#"):
|
||||
if current:
|
||||
blocks.append("".join(current).strip())
|
||||
current = []
|
||||
current.append(line)
|
||||
continue
|
||||
if not stripped and current and not "".join(current).strip().endswith("\n\n"):
|
||||
current.append(line)
|
||||
blocks.append("".join(current).strip())
|
||||
current = []
|
||||
continue
|
||||
current.append(line)
|
||||
if current:
|
||||
blocks.append("".join(current).strip())
|
||||
return [b for b in blocks if b]
|
||||
|
||||
|
||||
def chunks_from_page(
|
||||
text: str,
|
||||
page_id: str,
|
||||
metadata: dict,
|
||||
) -> Iterator[dict]:
|
||||
"""Yield chunk dicts ready for index.py to upsert.
|
||||
|
||||
The synthetic chunk 0 is the per-product customization point. The
|
||||
default below is a simple title + body-first-paragraph; rewrite
|
||||
for richer retrieval signal (see module docstring).
|
||||
"""
|
||||
paragraphs = split_paragraphs(text)
|
||||
if not paragraphs:
|
||||
return
|
||||
|
||||
# ----- Chunk 0: synthetic anchor for dense retrieval ---------
|
||||
title = metadata.get("title") or page_id
|
||||
first_para = next((p for p in paragraphs if not p.startswith("#")), "")
|
||||
chunk0_body = (
|
||||
f"# {title}\n\n"
|
||||
f"{first_para[:300]}"
|
||||
# TODO per product: append a keyword bag here (filenames,
|
||||
# API names, error codes) for BM25 + dense joint coverage.
|
||||
)
|
||||
yield {
|
||||
"id": f"{metadata['bundle_id']}::{page_id}::0",
|
||||
"text": chunk0_body,
|
||||
"metadata": {**metadata, "ordinal": 0},
|
||||
}
|
||||
|
||||
# ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
|
||||
ordinal = 1
|
||||
buf: list[str] = []
|
||||
buf_chars = 0
|
||||
for p in paragraphs:
|
||||
if buf_chars + len(p) > TARGET_CHARS and buf:
|
||||
yield {
|
||||
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
|
||||
"text": "\n\n".join(buf),
|
||||
"metadata": {**metadata, "ordinal": ordinal},
|
||||
}
|
||||
ordinal += 1
|
||||
buf = []
|
||||
buf_chars = 0
|
||||
buf.append(p)
|
||||
buf_chars += len(p)
|
||||
if buf:
|
||||
yield {
|
||||
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
|
||||
"text": "\n\n".join(buf),
|
||||
"metadata": {**metadata, "ordinal": ordinal},
|
||||
}
|
||||
@@ -0,0 +1,72 @@
|
||||
"""Embedding function for Chroma — Ollama-hosted nomic-embed-text by default.
|
||||
|
||||
Swappable: implement the same `embedding_function()` interface returning
|
||||
a Chroma `EmbeddingFunction` and the rest of the pipeline doesn't care.
|
||||
|
||||
Defaults (override via env):
|
||||
OLLAMA_URL one or more comma-separated URLs (load-balanced)
|
||||
EMBED_MODEL model name; default 'nomic-embed-text'
|
||||
EMBED_DIM expected embedding dim; default 768 (nomic-embed-text)
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import logging
|
||||
from typing import Any
|
||||
|
||||
import httpx
|
||||
from chromadb import EmbeddingFunction, Documents, Embeddings
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
OLLAMA_URLS = [u.strip() for u in os.environ.get("OLLAMA_URL",
|
||||
"http://localhost:11434").split(",") if u.strip()]
|
||||
EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
|
||||
EMBED_DIM = int(os.environ.get("EMBED_DIM", "768"))
|
||||
|
||||
|
||||
class OllamaEmbeddings(EmbeddingFunction):
|
||||
"""Calls /api/embed across N Ollama endpoints, naive round-robin.
|
||||
|
||||
For indexing throughput on multiple GPUs, run one Ollama container
|
||||
per GPU (pinned via NVIDIA_VISIBLE_DEVICES) and pass all their URLs
|
||||
in OLLAMA_URL — the embedder picks the next endpoint per batch.
|
||||
"""
|
||||
|
||||
def __init__(self, urls: list[str] = OLLAMA_URLS, model: str = EMBED_MODEL):
|
||||
self.urls = urls
|
||||
self.model = model
|
||||
self._next = 0
|
||||
|
||||
def __call__(self, input: Documents) -> Embeddings:
|
||||
url = self.urls[self._next % len(self.urls)]
|
||||
self._next += 1
|
||||
with httpx.Client(timeout=300) as c:
|
||||
r = c.post(f"{url}/api/embed",
|
||||
json={"model": self.model, "input": list(input)})
|
||||
r.raise_for_status()
|
||||
data = r.json()
|
||||
return data.get("embeddings") or []
|
||||
|
||||
def name(self) -> str: # newer chromadb requires this
|
||||
return f"ollama:{self.model}"
|
||||
|
||||
@staticmethod
|
||||
def build_from_config(config: dict) -> "OllamaEmbeddings": # newer chromadb
|
||||
return OllamaEmbeddings(
|
||||
urls=config.get("urls", OLLAMA_URLS),
|
||||
model=config.get("model", EMBED_MODEL),
|
||||
)
|
||||
|
||||
def get_config(self) -> dict: # newer chromadb
|
||||
return {"urls": self.urls, "model": self.model}
|
||||
|
||||
def default_space(self) -> str:
|
||||
return "cosine"
|
||||
|
||||
def supported_spaces(self) -> list[str]:
|
||||
return ["cosine", "l2", "ip"]
|
||||
|
||||
|
||||
def embedding_function() -> EmbeddingFunction:
|
||||
return OllamaEmbeddings()
|
||||
+134
@@ -0,0 +1,134 @@
|
||||
"""Build Chroma (and optionally BM25) indexes from corpus on disk.
|
||||
|
||||
Reads `corpus/<bundle>/<page>.{md,json}`, chunks each page, upserts
|
||||
into Chroma. With --rebuild, drops + recreates the collection (clean
|
||||
state). With --bm25-only, skips Chroma and rebuilds only the FTS5
|
||||
index — useful for fast iteration when chunking didn't change.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Iterator
|
||||
|
||||
import chromadb
|
||||
from chromadb.config import Settings
|
||||
|
||||
from .chunk import chunks_from_page
|
||||
from .embeddings import embedding_function
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
CORPUS = ROOT / "corpus"
|
||||
CHROMA_DIR = ROOT / "chroma"
|
||||
|
||||
# Collection name — convention: <product>_docs. Override via env if needed.
|
||||
import os
|
||||
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "myproduct")
|
||||
COLLECTION = f"{PRODUCT_NAME}_docs"
|
||||
|
||||
|
||||
def page_records() -> Iterator[dict]:
|
||||
"""Walk corpus/, yield chunks for every page."""
|
||||
if not CORPUS.exists():
|
||||
log.error("corpus/ doesn't exist; run the scraper first")
|
||||
return
|
||||
for bundle_dir in sorted(CORPUS.iterdir()):
|
||||
if not bundle_dir.is_dir() or bundle_dir.name.startswith("."):
|
||||
continue
|
||||
for md_path in sorted(bundle_dir.glob("*.md")):
|
||||
page_id = md_path.stem
|
||||
sidecar = md_path.with_suffix(".json")
|
||||
if not sidecar.exists():
|
||||
log.warning("skipping %s — no JSON sidecar", md_path)
|
||||
continue
|
||||
md = md_path.read_text()
|
||||
meta = json.loads(sidecar.read_text())
|
||||
# Surface common filter fields at the chunk-metadata level
|
||||
# so Chroma's `where` filter can use them.
|
||||
base_meta = {
|
||||
"bundle_id": bundle_dir.name,
|
||||
"page_id": page_id,
|
||||
"title": meta.get("title") or "",
|
||||
"version": meta.get("version") or "",
|
||||
"platform": meta.get("platform") or "",
|
||||
"product": meta.get("product") or "",
|
||||
}
|
||||
yield from chunks_from_page(md, page_id, base_meta)
|
||||
|
||||
|
||||
def upsert_to_chroma(records: list[dict]) -> int:
|
||||
client = chromadb.PersistentClient(
|
||||
path=str(CHROMA_DIR),
|
||||
settings=Settings(anonymized_telemetry=False),
|
||||
)
|
||||
# Drop + recreate for --rebuild semantics
|
||||
try:
|
||||
client.delete_collection(COLLECTION)
|
||||
except Exception:
|
||||
pass
|
||||
col = client.create_collection(COLLECTION, embedding_function=embedding_function())
|
||||
|
||||
BATCH = 64
|
||||
total = 0
|
||||
for i in range(0, len(records), BATCH):
|
||||
chunk = records[i:i + BATCH]
|
||||
col.upsert(
|
||||
ids=[r["id"] for r in chunk],
|
||||
documents=[r["text"] for r in chunk],
|
||||
metadatas=[r["metadata"] for r in chunk],
|
||||
)
|
||||
total += len(chunk)
|
||||
log.info("upserted %d / %d chunks", total, len(records))
|
||||
return total
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("--rebuild", action="store_true",
|
||||
help="Drop and recreate the Chroma collection.")
|
||||
p.add_argument("--bm25-only", action="store_true",
|
||||
help="Rebuild only the BM25 index, skip Chroma.")
|
||||
p.add_argument("--bm25-db", type=Path,
|
||||
default=ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db",
|
||||
help="Path to the BM25 sqlite db.")
|
||||
args = p.parse_args()
|
||||
|
||||
log.info("reading corpus from %s", CORPUS)
|
||||
t0 = time.time()
|
||||
records = list(page_records())
|
||||
log.info("loaded %d chunks in %.1fs", len(records), time.time() - t0)
|
||||
|
||||
if args.bm25_only:
|
||||
from .bm25 import BM25Index
|
||||
log.info("--bm25-only: building FTS5 only")
|
||||
BM25Index(args.bm25_db).build(records)
|
||||
return 0
|
||||
|
||||
if not args.rebuild:
|
||||
log.info("no --rebuild; nothing to do. (Use --rebuild to upsert.)")
|
||||
return 0
|
||||
|
||||
t_c = time.time()
|
||||
n = upsert_to_chroma(records)
|
||||
log.info("chroma: %d chunks in %.1fs", n, time.time() - t_c)
|
||||
|
||||
# Build BM25 too — see PLAN.md Phase 8. Safe to remove this block
|
||||
# for products that don't need hybrid retrieval.
|
||||
try:
|
||||
from .bm25 import BM25Index
|
||||
t_b = time.time()
|
||||
BM25Index(args.bm25_db).build(records)
|
||||
log.info("bm25 done in %.1fs", time.time() - t_b)
|
||||
except ImportError:
|
||||
log.info("rag.bm25 not available — skipping BM25 build")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -0,0 +1,19 @@
|
||||
# MCP server
|
||||
mcp[fastmcp]>=1.0.0
|
||||
pydantic>=2.0
|
||||
httpx>=0.27
|
||||
|
||||
# Vector store + embeddings
|
||||
chromadb>=0.5.0
|
||||
ollama>=0.4.0 # if using Ollama-hosted embedder; swap if not
|
||||
|
||||
# Scraping (Phase 1; adjust per product)
|
||||
beautifulsoup4>=4.12
|
||||
requests>=2.31
|
||||
# playwright>=1.40 # uncomment if you need headless browser fallback
|
||||
|
||||
# Evaluation
|
||||
numpy>=1.26
|
||||
|
||||
# Dev / utility
|
||||
python-dateutil>=2.8
|
||||
@@ -0,0 +1,59 @@
|
||||
# scrape/
|
||||
|
||||
Product-specific. **You implement this for each product.** The
|
||||
template gives you the contract; the extraction logic depends on
|
||||
the upstream doc portal.
|
||||
|
||||
See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline
|
||||
expects.
|
||||
|
||||
## What you write
|
||||
|
||||
At minimum, two scripts:
|
||||
|
||||
### `scrape/bundles.py`
|
||||
|
||||
Discovers the upstream portal's bundle catalog and writes
|
||||
`bundles.json` at the repo root. One entry per bundle (versioned doc
|
||||
set) with the schema in PLAN.md.
|
||||
|
||||
```bash
|
||||
python -m scrape.bundles
|
||||
```
|
||||
|
||||
### `scrape/runner.py`
|
||||
|
||||
Scrapes the pages of each bundle (or a single bundle with `--bundle
|
||||
<slug>`). Writes:
|
||||
|
||||
- `corpus/<bundle_id>/<page_id>.md` — extracted markdown body
|
||||
- `corpus/<bundle_id>/<page_id>.json` — per-page metadata sidecar
|
||||
|
||||
```bash
|
||||
python -m scrape.runner --all --force --concurrency 6
|
||||
python -m scrape.runner --bundle Admin.VC.HTML.10.9
|
||||
```
|
||||
|
||||
## Tips
|
||||
|
||||
- **Sniff before you scrape.** Almost every modern doc portal is an
|
||||
SPA that calls a backend API. Open the browser's Network tab,
|
||||
click around, find the underlying JSON. Scraping the API is 10×
|
||||
cheaper and 100× more reliable than scraping the rendered HTML.
|
||||
- **Idempotent re-scrapes.** Without `--force`, the runner should
|
||||
skip pages already on disk so a resume doesn't have to re-fetch
|
||||
everything. With `--force`, re-fetch every page — that's the
|
||||
weekly cron mode that catches edits.
|
||||
- **Respect the portal.** Backoff on 429s. Set a recognizable
|
||||
user-agent so the portal owner can identify you if they want to.
|
||||
- **Whitespace normalize.** Markdown that round-trips through HTML
|
||||
often has extra blank lines. Normalize to a single blank between
|
||||
paragraphs so diffs are clean (the changelog summary and digest
|
||||
tools care about line counts).
|
||||
|
||||
## What's already reusable
|
||||
|
||||
`scrape/changelog.py` is fully product-agnostic and ready to use
|
||||
as-is. It walks `git diff --name-status` output to produce a
|
||||
structured summary, and walks `git log` for the digest history
|
||||
(Phase 13).
|
||||
@@ -0,0 +1,272 @@
|
||||
"""Generate a summary of corpus changes.
|
||||
|
||||
Two output shapes for two consumers:
|
||||
|
||||
1. Human-readable text (default) — written into the weekly-refresh
|
||||
commit message so the commit log is greppable for *"what changed
|
||||
this week"* instead of *"806 files changed"*.
|
||||
|
||||
2. Structured JSON (``--json``) and rolling JSONL history
|
||||
(``--history-out``) — consumed by the ``weekly_digest`` MCP tool.
|
||||
Computed in CI and committed at ``corpus/.digest/history.jsonl``;
|
||||
the tool reads it at runtime because the prod container is a
|
||||
static filesystem COPY with no git available.
|
||||
|
||||
Usage:
|
||||
|
||||
# Commit-message helper (existing behavior — unchanged)
|
||||
python -m scrape.changelog [--cached] [--ref REF]
|
||||
|
||||
# One-shot JSON for the current diff range
|
||||
python -m scrape.changelog --cached --json
|
||||
|
||||
# Build / refresh the digest history file (CI use)
|
||||
python -m scrape.changelog --history-out corpus/.digest/history.jsonl \\
|
||||
--history-days 120
|
||||
|
||||
The history walker only includes commits that touch ``corpus/`` (or
|
||||
``bundles.json``); it skips pure code/CI commits. Each emitted record
|
||||
carries the commit's short sha, ISO timestamp, subject, and the same
|
||||
structured summary the ``--json`` path produces, so the consumer can
|
||||
treat history records and one-shot summaries interchangeably.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import subprocess
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from typing import Any
|
||||
|
||||
|
||||
def git(*args: str) -> str:
|
||||
return subprocess.check_output(["git", *args], text=True)
|
||||
|
||||
|
||||
def summarize_diff(diff_output: str) -> dict[str, Any]:
|
||||
"""Parse ``git diff --name-status`` output into a structured summary.
|
||||
|
||||
Pure function (no IO, no git calls) so the same logic is exercised
|
||||
by the human-readable, JSON-one-shot, and history-walking paths.
|
||||
|
||||
Returns a dict with:
|
||||
|
||||
md_count int — total .md files changed
|
||||
json_count int — total .json sidecars changed
|
||||
content_bundles dict — {bundle_id: [page_id_without_.md, ...]}
|
||||
Only bundles where at least one .md
|
||||
file moved. Lists are in the order
|
||||
git emitted them.
|
||||
json_only_bundles list[str] — bundles whose ONLY change was sidecar
|
||||
drift (no .md changes). Sorted.
|
||||
new_bundles list[str] — bundles whose first .md was Added
|
||||
in this diff. Sorted.
|
||||
other_files list[str] — any non-corpus path mentioned in the
|
||||
diff, as ``"STATUS path"`` strings.
|
||||
"""
|
||||
md_changes: dict[str, list[str]] = defaultdict(list)
|
||||
json_only_bundles: set[str] = set()
|
||||
new_bundles: set[str] = set()
|
||||
md_count = json_count = 0
|
||||
other_files: list[str] = []
|
||||
|
||||
for line in diff_output.splitlines():
|
||||
if not line.strip():
|
||||
continue
|
||||
# status<TAB>path (or status<TAB>old<TAB>new for renames; we take
|
||||
# the post-rename path as the canonical location).
|
||||
parts = line.split("\t")
|
||||
status, path = parts[0], parts[-1]
|
||||
if not path.startswith("corpus/"):
|
||||
other_files.append(f"{status} {path}")
|
||||
continue
|
||||
segs = path.split("/", 2)
|
||||
if len(segs) < 3:
|
||||
# corpus/<filename> with no bundle dir — skip.
|
||||
continue
|
||||
_, bundle, page = segs
|
||||
if page.endswith(".md"):
|
||||
md_changes[bundle].append(page[:-3])
|
||||
md_count += 1
|
||||
if status == "A":
|
||||
new_bundles.add(bundle)
|
||||
elif page.endswith(".json"):
|
||||
json_count += 1
|
||||
json_only_bundles.add(bundle)
|
||||
|
||||
# A bundle counts as "content-changing" if it had any .md edit. Sidecar-
|
||||
# only drift goes in the separate bucket so the commit message doesn't
|
||||
# report timestamp churn as if it were real edits.
|
||||
content_bundles_set = set(md_changes)
|
||||
drift_only = sorted(json_only_bundles - content_bundles_set)
|
||||
|
||||
return {
|
||||
"md_count": md_count,
|
||||
"json_count": json_count,
|
||||
"content_bundles": dict(md_changes), # cast back to plain dict for JSON
|
||||
"json_only_bundles": drift_only,
|
||||
"new_bundles": sorted(new_bundles),
|
||||
"other_files": other_files,
|
||||
}
|
||||
|
||||
|
||||
def render_human(summary: dict[str, Any]) -> str:
|
||||
"""Format a summary dict as the multi-line commit-message text.
|
||||
|
||||
Matches the historical output exactly so existing commit-message
|
||||
tooling and downstream readers don't have to change.
|
||||
"""
|
||||
lines: list[str] = []
|
||||
content_bundles = sorted(summary["content_bundles"])
|
||||
md_count = summary["md_count"]
|
||||
json_count = summary["json_count"]
|
||||
new_bundles = set(summary["new_bundles"])
|
||||
drift_only = summary["json_only_bundles"]
|
||||
other_files = summary["other_files"]
|
||||
|
||||
lines.append(f"{md_count} content change(s) across {len(content_bundles)} bundle(s)")
|
||||
lines.append(f"{json_count} sidecar metadata update(s)")
|
||||
if new_bundles:
|
||||
lines.append(f"{len(new_bundles)} new bundle(s) added")
|
||||
if other_files:
|
||||
lines.append(f"{len(other_files)} other file change(s)")
|
||||
|
||||
if content_bundles:
|
||||
lines.append("")
|
||||
lines.append("Bundles with content changes:")
|
||||
for b in content_bundles:
|
||||
pages = summary["content_bundles"][b]
|
||||
tag = " (NEW)" if b in new_bundles else ""
|
||||
lines.append(f" {b}{tag}: {len(pages)} page(s)")
|
||||
for p in pages[:5]:
|
||||
lines.append(f" - {p}")
|
||||
if len(pages) > 5:
|
||||
lines.append(f" ... and {len(pages) - 5} more")
|
||||
if drift_only:
|
||||
lines.append("")
|
||||
head = ", ".join(drift_only[:10])
|
||||
suffix = " …" if len(drift_only) > 10 else ""
|
||||
lines.append(f"Bundles with sidecar-only drift ({len(drift_only)}): {head}{suffix}")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def walk_history(history_days: int) -> list[dict[str, Any]]:
|
||||
"""Walk recent corpus-touching commits, emit one summary per commit.
|
||||
|
||||
Uses ``git log --first-parent main`` to keep the rolling weekly-
|
||||
refresh line clean of branch-merge noise. Only commits whose diff
|
||||
touches ``corpus/`` or ``bundles.json`` are emitted; pure code
|
||||
commits are skipped (they have nothing to digest).
|
||||
|
||||
Each record:
|
||||
|
||||
{
|
||||
"sha": "<short sha>",
|
||||
"timestamp": "<ISO 8601, UTC>",
|
||||
"subject": "<commit subject line>",
|
||||
... + every field from summarize_diff()
|
||||
}
|
||||
"""
|
||||
# Find candidate commits. --first-parent keeps the linear refresh history
|
||||
# on main and ignores branch-side merges. We still need to filter by what
|
||||
# the commit actually touched, because non-corpus commits can land on
|
||||
# main (PR merges for code, CI tweaks, etc.).
|
||||
raw = git(
|
||||
"log",
|
||||
f"--since={history_days} days ago",
|
||||
"--first-parent",
|
||||
"main",
|
||||
"--pretty=format:%H%x09%cI%x09%s",
|
||||
)
|
||||
|
||||
records: list[dict[str, Any]] = []
|
||||
for line in raw.splitlines():
|
||||
if not line.strip():
|
||||
continue
|
||||
parts = line.split("\t", 2)
|
||||
if len(parts) < 3:
|
||||
continue
|
||||
sha, ts, subject = parts
|
||||
|
||||
# What did this commit actually touch? Cheap: just the name-status diff
|
||||
# against its first parent. Empty stdout = commit didn't change any
|
||||
# files we care about. Root commits (no parent) error out — suppress
|
||||
# the stderr noise and skip them.
|
||||
try:
|
||||
diff = subprocess.check_output(
|
||||
["git", "diff", "--name-status", f"{sha}^..{sha}"],
|
||||
text=True,
|
||||
stderr=subprocess.DEVNULL,
|
||||
)
|
||||
except subprocess.CalledProcessError:
|
||||
continue
|
||||
if not diff.strip():
|
||||
continue
|
||||
|
||||
summary = summarize_diff(diff)
|
||||
# Skip pure code commits — only emit records that have actual corpus
|
||||
# content motion. This is what makes the history "interesting" for
|
||||
# the weekly digest.
|
||||
if summary["md_count"] == 0 and summary["json_count"] == 0 and not summary["new_bundles"]:
|
||||
continue
|
||||
|
||||
records.append({
|
||||
"sha": sha[:12],
|
||||
"timestamp": ts,
|
||||
"subject": subject,
|
||||
**summary,
|
||||
})
|
||||
|
||||
return records
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument("--cached", action="store_true",
|
||||
help="Summarize staged changes instead of a ref range.")
|
||||
p.add_argument("--ref", default="HEAD^..HEAD",
|
||||
help="Diff range to summarize (default: HEAD^..HEAD).")
|
||||
p.add_argument("--json", dest="as_json", action="store_true",
|
||||
help="Emit one JSON object instead of the human-readable form.")
|
||||
p.add_argument("--history-out", metavar="PATH",
|
||||
help="Walk recent corpus-touching commits and write a "
|
||||
"JSONL history file at PATH. Overwrites if it exists. "
|
||||
"Implies the history walker; --cached/--ref are ignored.")
|
||||
p.add_argument("--history-days", type=int, default=120,
|
||||
help="How far back the history walker looks (default 120).")
|
||||
args = p.parse_args()
|
||||
|
||||
# History-walker path: build the JSONL file consumed by the
|
||||
# weekly_digest MCP tool, then exit. CI uses this.
|
||||
if args.history_out:
|
||||
records = walk_history(args.history_days)
|
||||
# Sort by timestamp ascending so the file is roughly stable
|
||||
# across rebuilds (commits within a single run could otherwise
|
||||
# depend on git log default ordering).
|
||||
records.sort(key=lambda r: r["timestamp"])
|
||||
with open(args.history_out, "w") as fh:
|
||||
for rec in records:
|
||||
fh.write(json.dumps(rec, separators=(",", ":")) + "\n")
|
||||
# Brief stdout signal for CI logs — easy to spot in the workflow run.
|
||||
print(f"wrote {len(records)} commit record(s) to {args.history_out} "
|
||||
f"covering up to {args.history_days} days")
|
||||
return 0
|
||||
|
||||
# One-shot summary path. Unchanged behavior for --cached / --ref.
|
||||
if args.cached:
|
||||
diff_args = ["diff", "--name-status", "--cached"]
|
||||
else:
|
||||
diff_args = ["diff", "--name-status", args.ref]
|
||||
diff = git(*diff_args)
|
||||
summary = summarize_diff(diff)
|
||||
|
||||
if args.as_json:
|
||||
print(json.dumps(summary, separators=(",", ":")))
|
||||
else:
|
||||
print(render_human(summary))
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,108 @@
|
||||
"""Gitea container-registry garbage collection.
|
||||
|
||||
Lists package versions for one container package and deletes versions
|
||||
older than --keep-days. Always preserves:
|
||||
|
||||
- the :latest tag
|
||||
- the --keep-latest most-recent date-tagged versions
|
||||
- anything pushed in the last --keep-days days
|
||||
|
||||
The actual disk reclaim happens on Gitea's next package GC cron (admin
|
||||
site settings). This script just marks the versions for deletion.
|
||||
|
||||
Usage:
|
||||
|
||||
python scripts/registry_gc.py \\
|
||||
--owner <user> \\
|
||||
--package <product>-docs-mcp \\
|
||||
--keep-days 90 \\
|
||||
--keep-latest 5
|
||||
|
||||
Auth: reads GITEA_TOKEN from env (set in the workflow as a secret).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
from datetime import datetime, timedelta, timezone
|
||||
from urllib.request import Request, urlopen
|
||||
from urllib.error import HTTPError
|
||||
import json
|
||||
|
||||
|
||||
GITEA_HOST = os.environ.get("GITEA_HOST", "https://git.jpaul.io")
|
||||
|
||||
|
||||
def api(token: str, method: str, path: str) -> object:
|
||||
req = Request(f"{GITEA_HOST}{path}",
|
||||
headers={"Authorization": f"token {token}"},
|
||||
method=method)
|
||||
try:
|
||||
with urlopen(req, timeout=30) as r:
|
||||
body = r.read()
|
||||
return json.loads(body) if body else None
|
||||
except HTTPError as e:
|
||||
if e.code == 404:
|
||||
return None
|
||||
raise
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("--owner", required=True)
|
||||
p.add_argument("--package", required=True)
|
||||
p.add_argument("--keep-days", type=int, default=90)
|
||||
p.add_argument("--keep-latest", type=int, default=5)
|
||||
p.add_argument("--dry-run", action="store_true")
|
||||
args = p.parse_args()
|
||||
|
||||
token = os.environ.get("GITEA_TOKEN")
|
||||
if not token:
|
||||
print("GITEA_TOKEN not set", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
versions = api(token, "GET",
|
||||
f"/api/v1/packages/{args.owner}/container/{args.package}/versions") or []
|
||||
if not versions:
|
||||
print(f"no versions found for {args.owner}/{args.package}")
|
||||
return 0
|
||||
|
||||
cutoff = datetime.now(timezone.utc) - timedelta(days=args.keep_days)
|
||||
|
||||
# Date-tagged versions (YYYY.MM.DD), newest first
|
||||
date_tagged = []
|
||||
for v in versions:
|
||||
tags = v.get("tags") or []
|
||||
for t in tags:
|
||||
if len(t) == 10 and t[4] == "." and t[7] == ".":
|
||||
date_tagged.append((t, v))
|
||||
break
|
||||
date_tagged.sort(key=lambda kv: kv[0], reverse=True)
|
||||
keep_date_tags = {t for t, _ in date_tagged[:args.keep_latest]}
|
||||
|
||||
deleted = 0
|
||||
for v in versions:
|
||||
tags = v.get("tags") or []
|
||||
if "latest" in tags:
|
||||
continue
|
||||
if any(t in keep_date_tags for t in tags):
|
||||
continue
|
||||
try:
|
||||
created = datetime.fromisoformat(v["created_at"].replace("Z", "+00:00"))
|
||||
except (KeyError, ValueError):
|
||||
continue
|
||||
if created >= cutoff:
|
||||
continue
|
||||
version_id = v.get("id")
|
||||
print(f" deleting v{version_id} tags={tags} created={v['created_at']}")
|
||||
if not args.dry_run:
|
||||
api(token, "DELETE",
|
||||
f"/api/v1/packages/{args.owner}/container/{args.package}/versions/{version_id}")
|
||||
deleted += 1
|
||||
print(f"done: {deleted} version(s) deleted")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,251 @@
|
||||
"""Summarize usage logs from docs_mcp.usage into a quick scan.
|
||||
|
||||
Reads one or more usage.jsonl* files and prints sections for:
|
||||
|
||||
- per-tool call counts
|
||||
- top search_docs queries by frequency
|
||||
- 0-hit queries (where we returned nothing — high-signal for tuning)
|
||||
- filter usage histogram (which version / platform / bundle filters get hit)
|
||||
- reranker effectiveness (calls where the reranker fired vs not)
|
||||
- hybrid retrieval top-1 attribution (dense vs bm25 vs both)
|
||||
|
||||
Usage:
|
||||
|
||||
# Default: read /app/var/logs in the production container
|
||||
python scripts/usage_report.py --logs-dir /path/to/usage/logs
|
||||
|
||||
# Last N days only:
|
||||
python scripts/usage_report.py --logs-dir <dir> --since 7d
|
||||
|
||||
# Markdown output (for piping into a weekly digest email, etc):
|
||||
python scripts/usage_report.py --logs-dir <dir> --format markdown
|
||||
|
||||
The script doesn't depend on anything in the docs_mcp package — it's a
|
||||
standalone tool that can run anywhere with the log files available
|
||||
(scp them off the host, point it at the directory).
|
||||
|
||||
----------------------------------------------------------------------
|
||||
FOLLOW-UP CHECKS
|
||||
----------------------------------------------------------------------
|
||||
|
||||
Pattern: when you ship a retrieval change with a hypothesis attached
|
||||
(e.g. "hybrid will rescue queries dense misses"), add a note HERE
|
||||
describing what the usage report should show and at what threshold
|
||||
the change earns its keep. Future-you running the report a month
|
||||
later will be glad. Example:
|
||||
|
||||
Q: Does the dense leg of hybrid retrieval earn its keep on
|
||||
real traffic, or could we simplify to BM25-only?
|
||||
|
||||
- bm25_only >= 80%% --> dense not doing much; consider
|
||||
simplifying to BM25 mode
|
||||
- both >= 50%% --> hybrid is tie-breaking; keep it
|
||||
- dense_only > bm25_only --> dense is the workhorse; keep
|
||||
|
||||
Also worth a glance every month:
|
||||
|
||||
- 0-hit queries list (tuning candidates)
|
||||
- reranker p95 latency drift (slow reranker = bad UX)
|
||||
- filter usage (does anyone actually use version/platform
|
||||
filters? if not, simplify the tool surface)
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter, defaultdict
|
||||
from datetime import datetime, timedelta, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any, Iterable
|
||||
|
||||
|
||||
def parse_since(s: str | None) -> datetime | None:
|
||||
"""Accept '7d', '24h', '30m', or an ISO timestamp. None → no cutoff."""
|
||||
if not s:
|
||||
return None
|
||||
m = re.fullmatch(r"(\d+)([dhm])", s)
|
||||
if m:
|
||||
n, unit = int(m.group(1)), m.group(2)
|
||||
delta = {"d": timedelta(days=n), "h": timedelta(hours=n), "m": timedelta(minutes=n)}[unit]
|
||||
return datetime.now(timezone.utc) - delta
|
||||
return datetime.fromisoformat(s.replace("Z", "+00:00"))
|
||||
|
||||
|
||||
def load_events(logs_dir: Path, since: datetime | None) -> Iterable[dict[str, Any]]:
|
||||
"""Yield every JSONL record across all files in logs_dir."""
|
||||
if not logs_dir.exists():
|
||||
print(f"warning: logs dir {logs_dir} does not exist", file=sys.stderr)
|
||||
return
|
||||
# usage.jsonl is the active file; usage.jsonl.YYYY-MM-DD are rotated.
|
||||
files = sorted(logs_dir.glob("usage.jsonl*"))
|
||||
for f in files:
|
||||
with open(f) as fh:
|
||||
for ln, line in enumerate(fh, start=1):
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
try:
|
||||
rec = json.loads(line)
|
||||
except json.JSONDecodeError as e:
|
||||
print(f" ! skipping {f}:{ln}: {e}", file=sys.stderr)
|
||||
continue
|
||||
if since:
|
||||
ts = rec.get("ts", "")
|
||||
try:
|
||||
rec_ts = datetime.fromisoformat(ts.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
continue
|
||||
if rec_ts < since:
|
||||
continue
|
||||
yield rec
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument("--logs-dir", type=Path, default=Path("/app/var/logs"),
|
||||
help="directory with usage.jsonl* files")
|
||||
p.add_argument("--since", default=None,
|
||||
help="time window: '7d', '24h', '30m', or ISO timestamp")
|
||||
p.add_argument("--top", type=int, default=25,
|
||||
help="how many top queries / filters to show")
|
||||
p.add_argument("--format", choices=("text", "markdown"), default="text")
|
||||
args = p.parse_args()
|
||||
|
||||
since = parse_since(args.since)
|
||||
events = list(load_events(args.logs_dir, since))
|
||||
if not events:
|
||||
print("(no events in window)")
|
||||
return 0
|
||||
|
||||
print(f"# Usage report — {len(events)} events"
|
||||
+ (f" since {since.isoformat()}" if since else "")
|
||||
+ f" from {args.logs_dir}")
|
||||
print()
|
||||
|
||||
# 1. Per-tool counts
|
||||
by_tool = Counter(e["tool"] for e in events)
|
||||
print("## Per-tool call counts")
|
||||
print()
|
||||
if args.format == "markdown":
|
||||
print("| tool | calls |")
|
||||
print("|---|---|")
|
||||
for tool, n in by_tool.most_common():
|
||||
print(f"| `{tool}` | {n} |")
|
||||
else:
|
||||
for tool, n in by_tool.most_common():
|
||||
print(f" {tool:<25s} {n:>6d}")
|
||||
print()
|
||||
|
||||
# 2. Top search_docs queries
|
||||
search_events = [e for e in events if e["tool"] == "search_docs"]
|
||||
queries = Counter(e["args"].get("query", "") for e in search_events)
|
||||
print(f"## Top {args.top} search_docs queries (of {len(search_events)} searches)")
|
||||
print()
|
||||
if args.format == "markdown":
|
||||
print("| count | query |")
|
||||
print("|---|---|")
|
||||
for q, n in queries.most_common(args.top):
|
||||
print(f"| {n} | `{q}` |")
|
||||
else:
|
||||
for q, n in queries.most_common(args.top):
|
||||
print(f" {n:>5d} {q!r}")
|
||||
print()
|
||||
|
||||
# 3. 0-hit queries — the highest-signal data for tuning
|
||||
zero_hit = [e for e in search_events if e.get("hits_returned") == 0]
|
||||
zero_q = Counter(e["args"].get("query", "") for e in zero_hit)
|
||||
print(f"## 0-hit queries ({len(zero_hit)} of {len(search_events)} searches returned nothing)")
|
||||
print()
|
||||
if zero_q:
|
||||
if args.format == "markdown":
|
||||
print("| count | query | filters |")
|
||||
print("|---|---|---|")
|
||||
# Group by query, show filter examples for each
|
||||
examples_by_query: dict[str, list[dict]] = defaultdict(list)
|
||||
for e in zero_hit:
|
||||
examples_by_query[e["args"].get("query", "")].append(e["args"])
|
||||
for q, n in zero_q.most_common(args.top):
|
||||
ex = examples_by_query[q][0]
|
||||
f = {k: v for k, v in ex.items()
|
||||
if k in ("version", "platform", "bundle_id") and v}
|
||||
print(f"| {n} | `{q}` | `{f}` |")
|
||||
else:
|
||||
for q, n in zero_q.most_common(args.top):
|
||||
print(f" {n:>5d} {q!r}")
|
||||
else:
|
||||
print(" _(no 0-hit queries in window)_")
|
||||
print()
|
||||
|
||||
# 4. Filter usage
|
||||
filter_use = Counter()
|
||||
for e in search_events:
|
||||
a = e["args"]
|
||||
v = a.get("version")
|
||||
p_ = a.get("platform")
|
||||
b = a.get("bundle_id")
|
||||
if v:
|
||||
filter_use[f"version={v}"] += 1
|
||||
if p_:
|
||||
filter_use[f"platform={p_}"] += 1
|
||||
if b:
|
||||
filter_use[f"bundle_id={b}"] += 1
|
||||
if not (v or p_ or b):
|
||||
filter_use["(no filter)"] += 1
|
||||
print(f"## search_docs filter usage")
|
||||
print()
|
||||
if args.format == "markdown":
|
||||
print("| filter | count |")
|
||||
print("|---|---|")
|
||||
for f, n in filter_use.most_common(args.top):
|
||||
print(f"| `{f}` | {n} |")
|
||||
else:
|
||||
for f, n in filter_use.most_common(args.top):
|
||||
print(f" {n:>5d} {f}")
|
||||
print()
|
||||
|
||||
# 5. Reranker effectiveness
|
||||
reranked = [e for e in search_events if e.get("reranked") is True]
|
||||
dense_only = [e for e in search_events if e.get("reranked") is False]
|
||||
print(f"## Reranker activity")
|
||||
print()
|
||||
print(f" reranked: {len(reranked):>5d}")
|
||||
print(f" dense only: {len(dense_only):>5d} (filter too narrow or 0 results)")
|
||||
if reranked:
|
||||
elapsed = [e["elapsed_ms"] for e in reranked if e.get("elapsed_ms") is not None]
|
||||
if elapsed:
|
||||
elapsed.sort()
|
||||
p50 = elapsed[len(elapsed) // 2]
|
||||
p95 = elapsed[int(len(elapsed) * 0.95)]
|
||||
print(f" reranked latency p50: {p50:.0f} ms, p95: {p95:.0f} ms")
|
||||
print()
|
||||
|
||||
# 6. Hybrid retrieval activity — which retriever contributed the top-1?
|
||||
# Empty unless HYBRID_SEARCH=true is set on the MCP container.
|
||||
hybrid_events = [e for e in search_events if e.get("retrieval_mode") == "hybrid"]
|
||||
if hybrid_events:
|
||||
by_source = Counter(e.get("top1_source") for e in hybrid_events
|
||||
if e.get("top1_source"))
|
||||
print("## Hybrid retrieval — top-1 attribution")
|
||||
print()
|
||||
print(f" hybrid mode events: {len(hybrid_events)}")
|
||||
total = sum(by_source.values()) or 1
|
||||
for src in ("both", "dense_only", "bm25_only"):
|
||||
n = by_source.get(src, 0)
|
||||
pct = 100.0 * n / total
|
||||
label = {
|
||||
"both": "in BOTH retrievers' top-N",
|
||||
"dense_only": "dense found it, BM25 didn't",
|
||||
"bm25_only": "BM25 found it, dense didn't",
|
||||
}[src]
|
||||
print(f" {src:<11s} {n:>5d} ({pct:5.1f}%) — {label}")
|
||||
rescued = by_source.get("bm25_only", 0)
|
||||
if rescued and total:
|
||||
print(f"\n → {rescued} ({100.0 * rescued / total:.1f}%) of hybrid queries had the top-1 "
|
||||
"result that ONLY BM25 surfaced. Without hybrid those would have been dense-misses.")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user