initial: docs-mcp-template — build guide + scaffolded server

Template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out. Contents: - PLAN.md — comprehensive build guide. 13 phases from project skeleton through weekly_digest. Includes the gotchas ("fetch-depth: 0 always", reranker per-pair token limit, Cloudflare body cap, dash-not-bash on Gitea runners), the decisions worth carrying forward, and a per-product customization checklist. - CLAUDE.md — guidance for Claude Code working in a clone of this template. Phase identification table, conventions (env-gating + operator confirmation for side-effecting tools, defensive fallback for retrieval components), common commands. - README.md — quick-start summary. Scaffolded code (all signature-stable, with NotImplementedError stubs where phase-specific work is required): docs_mcp/server.py FastMCP server, stateless_http=True, with search_docs / get_page / list_versions baseline tools and commented stubs for the rest of the phase set. docs_mcp/usage.py TimedCall telemetry, JSONL, daily rotation, 90-day retention. Reusable as-is. rag/embeddings.py Ollama embedder (nomic-embed-text default), load-balanced across N URLs. Reusable. rag/chunk.py Paragraph-aware chunker with synthetic chunk 0. Per-product tunable. rag/index.py Chroma + BM25 builder. --rebuild and --bm25-only flags. rag/bm25.py SQLite FTS5 lexical index. Reusable. scrape/changelog.py --cached / --ref / --json / --history-out. Reusable. scrape/README.md What you write per-product. eval/queries.jsonl.example Curate ~25 hand-labeled queries here. eval/retrievers.py Retriever protocol + stub classes. eval/run_eval.py MRR / Recall@K / nDCG@K harness skeleton. scripts/usage_report.py Standalone log analyzer; the FOLLOW-UP CHECKS pattern noted in the module docstring. scripts/registry_gc.py Gitea container registry cleanup. Reusable. Deployment + CI: Dockerfile Python 3.12-slim; COPY corpus + chroma + bm25 last for cache efficiency. deploy/docker-compose.yml MCP + reranker sidecar + Watchtower. Templated with <placeholders>. .gitea/workflows/refresh.yml Weekly cron + manual dispatch. fetch-depth: 0, retry-on-race, three-tag image scheme. .gitea/workflows/image-only.yml Code-only ship cycle, ~18min. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:18:17 -04:00
commit 9ba615c8ee
26 changed files with 3280 additions and 0 deletions
@@ -0,0 +1,126 @@
+"""Markdown chunker — paragraph-aware, ~400-600 token target.
+
+Adjust the chunking strategy per product if your page format differs
+significantly from prose. The output shape (id, text, metadata) is
+fixed by the downstream Chroma + BM25 indexing in rag/index.py — don't
+change that.
+
+The key knob you'll tune per product is chunk-0. Dense retrieval lands
+on chunk 0 first for most queries. Make it a synthetic chunk built
+from:
+
+  - the page title (as natural-language H1)
+  - a 1-sentence task description (you'll have to generate this — for
+    pages that already have a "## Overview" or "## Introduction" the
+    first sentence usually works)
+  - a keyword bag of important terms (filenames, API names, error
+    codes — the rare technical tokens that BM25 lights up on)
+
+Without a rich chunk 0, dense retrieval gets dominated by the much
+larger prose body, and short pages (script examples, reference cards)
+get buried.
+"""
+from __future__ import annotations
+
+import re
+from typing import Iterator
+
+
+# Approximate token estimate from char count. Tunable — set per
+# embedder if the default 4 chars/token is wrong.
+CHARS_PER_TOKEN = 4
+TARGET_TOKENS = 500
+TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
+
+
+def estimate_tokens(text: str) -> int:
+    return max(1, len(text) // CHARS_PER_TOKEN)
+
+
+def split_paragraphs(md: str) -> list[str]:
+    """Split markdown into paragraph-ish blocks.
+
+    Keeps fenced code blocks together (don't slice through ```).
+    Headings start new paragraphs.
+    """
+    blocks: list[str] = []
+    current: list[str] = []
+    in_fence = False
+    for line in md.splitlines(keepends=True):
+        stripped = line.strip()
+        if stripped.startswith("```"):
+            in_fence = not in_fence
+            current.append(line)
+            continue
+        if in_fence:
+            current.append(line)
+            continue
+        if stripped.startswith("#"):
+            if current:
+                blocks.append("".join(current).strip())
+                current = []
+            current.append(line)
+            continue
+        if not stripped and current and not "".join(current).strip().endswith("\n\n"):
+            current.append(line)
+            blocks.append("".join(current).strip())
+            current = []
+            continue
+        current.append(line)
+    if current:
+        blocks.append("".join(current).strip())
+    return [b for b in blocks if b]
+
+
+def chunks_from_page(
+    text: str,
+    page_id: str,
+    metadata: dict,
+) -> Iterator[dict]:
+    """Yield chunk dicts ready for index.py to upsert.
+
+    The synthetic chunk 0 is the per-product customization point. The
+    default below is a simple title + body-first-paragraph; rewrite
+    for richer retrieval signal (see module docstring).
+    """
+    paragraphs = split_paragraphs(text)
+    if not paragraphs:
+        return
+
+    # ----- Chunk 0: synthetic anchor for dense retrieval ---------
+    title = metadata.get("title") or page_id
+    first_para = next((p for p in paragraphs if not p.startswith("#")), "")
+    chunk0_body = (
+        f"# {title}\n\n"
+        f"{first_para[:300]}"
+        # TODO per product: append a keyword bag here (filenames,
+        # API names, error codes) for BM25 + dense joint coverage.
+    )
+    yield {
+        "id":       f"{metadata['bundle_id']}::{page_id}::0",
+        "text":     chunk0_body,
+        "metadata": {**metadata, "ordinal": 0},
+    }
+
+    # ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
+    ordinal = 1
+    buf: list[str] = []
+    buf_chars = 0
+    for p in paragraphs:
+        if buf_chars + len(p) > TARGET_CHARS and buf:
+            yield {
+                "id":       f"{metadata['bundle_id']}::{page_id}::{ordinal}",
+                "text":     "\n\n".join(buf),
+                "metadata": {**metadata, "ordinal": ordinal},
+            }
+            ordinal += 1
+            buf = []
+            buf_chars = 0
+        buf.append(p)
+        buf_chars += len(p)
+    if buf:
+        yield {
+            "id":       f"{metadata['bundle_id']}::{page_id}::{ordinal}",
+            "text":     "\n\n".join(buf),
+            "metadata": {**metadata, "ordinal": ordinal},
+        }