T

justin e9b37e86df feat: implement Phases 9 / 11 / 12 / 13 — diff/lessons/inconsistencies/digest

Eight new MCP tools on top of the Phase 3 baseline. Each one uses
TimedCall so calls show up in usage.jsonl alongside search/get/list.

Phase 9 — multi-version diff:
  * list_cluster(bundle_id, page_id) — cross-version peers from the
    synthesized topic_cluster (same GUID across 8.1.x versions).
  * diff_versions(bundle_id, page_id, against_bundle_id) — unified
    diff between two bundles; uses topic_cluster first, falls back to
    same page_id (which works because HVM GUIDs are stable cross-version).
  * bundle_changelog(new, old) — page-level adds/removes/churn summary,
    sorted by lines moved; uses _diff_churn helper.

Phase 11 — curated knowledge:
  * hvm_api_lessons(topic?) — surfaces docs_mcp/api_lessons.md (manager
    sizing, upgrade ordering, plugin/worker version compat, backups
    setup, console keyboard, elevation, ops gotchas). topic= filters to
    matching H2 sections. Marked "call proactively for HVM scripting /
    integration / upgrade questions" in the docstring so the LLM uses it.

Phase 12 — doc-bug workflow:
  * find_doc_inconsistencies(scope_query, ...) — read-only scan with two
    checks: cross_version_drift (line-diff vs cluster peers, in-band
    10-60% of file = high confidence) and redirect_chain (short body
    that's mostly a "see [other page]" pointer).
  * submit_doc_bug(page_url, content, ...) — env-gated OFF
    (DOC_BUG_SUBMIT_ENABLED) AND requires DOC_BUG_API_URL. Refuses
    cleanly with a manual-fallback message when either is unset.
    Allowlist: support.hpe.com only. Mandatory operator-confirmation
    pattern in the docstring; loud "do not loop" warning. The actual
    HPE feedback endpoint hasn't been sniffed yet — when it is, set
    both env vars and verify the payload shape against the schema.

Phase 13 — weekly digest:
  * _digest_history() reads corpus/.digest/history.jsonl (built by
    scrape.changelog --history-out in the CI refresh workflow).
  * weekly_digest(days, version?, platform?, ...) aggregates corpus-
    touching commits in the window. Post-filter totals so version /
    platform filters give honest "X page changes" numbers, not the
    pre-filter commit count.
  * corpus_status() reports image build time, latest upstream Published
    date, total bundles/pages/chunks, and the 5 most-recently-edited
    bundles.

Tool count now: 11 registered (search_docs, get_page, list_versions,
list_cluster, diff_versions, bundle_changelog, weekly_digest,
corpus_status, hvm_api_lessons, find_doc_inconsistencies, submit_doc_bug).
Verified end-to-end via MCP stdio tools/list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-22 13:58:19 -04:00

.gitea/workflows

ci: use zerto-docs's load-balanced Ollama GPU pool on the Gitea host

2026-05-22 13:22:59 -04:00

corpus

weekly refresh: 2026-05-22T17:39Z — 1161 content change(s) across 7 bundle(s)

2026-05-22 17:39:22 +00:00

deploy

ci+deploy: target git.jpaul.io registry, PRODUCT_NAME=hvm

2026-05-22 13:07:15 -04:00

docs_mcp

feat: implement Phases 9 / 11 / 12 / 13 — diff/lessons/inconsistencies/digest

2026-05-22 13:58:19 -04:00

eval

search: BM25-default + cross-encoder rerank, hybrid behind env gate

2026-05-22 13:06:51 -04:00

rag

ci: use zerto-docs's load-balanced Ollama GPU pool on the Gitea host

2026-05-22 13:22:59 -04:00

scrape

scrape: HVM bundles + runner for HPE Support DocPortal

2026-05-22 13:06:26 -04:00

scripts

fix(registry_gc): correct Gitea packages API + Cloudflare-friendly UA (#2 )

2026-05-22 13:44:43 -04:00

.gitignore

fix: stop ignoring corpus/ so refresh workflow can commit it (#1 )

2026-05-22 13:38:23 -04:00

bundles.json

scrape: HVM bundles + runner for HPE Support DocPortal

2026-05-22 13:06:26 -04:00

CLAUDE.md

ci: derive image name + package linking from repo, add link step

2026-05-22 09:34:26 -04:00

Dockerfile

initial: docs-mcp-template — build guide + scaffolded server

2026-05-22 09:18:17 -04:00

PLAN.md

initial: docs-mcp-template — build guide + scaffolded server

2026-05-22 09:18:17 -04:00

README.md

initial: docs-mcp-template — build guide + scaffolded server

2026-05-22 09:18:17 -04:00

requirements-rerank.txt

scrape: HVM bundles + runner for HPE Support DocPortal

2026-05-22 13:06:26 -04:00

requirements.txt

scrape: HVM bundles + runner for HPE Support DocPortal

2026-05-22 13:06:26 -04:00

README.md

docs-mcp-template

A reusable template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out.

The end product is a streamable-HTTP MCP server with ~15 tools that any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can call to answer questions against the docs, surface what changed recently, find inconsistencies, and (optionally) submit doc bugs back upstream.

What's here

PLAN.md — comprehensive build guide. Phased approach (13 phases, ~2–3 weeks of focused work for the full stack). Includes the design decisions, the gotchas, and a per-product customization checklist.
Scaffolded skeleton — working FastMCP server with stub tools, Dockerfile, docker-compose, CI workflows, eval harness layout, usage logging. Everything you need to git clone and start filling in the product-specific bits.

Quick start

git clone https://git.jpaul.io/justin/docs-mcp-template.git my-product-docs
cd my-product-docs
git remote remove origin  # detach from template
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Read PLAN.md before doing anything else. Pay particular attention to
# Phase 1 (scraper) — that's the most product-specific phase.

# Run the stub server (no corpus yet — just verifies the wiring):
python -m docs_mcp.server --transport stdio

Repo layout

.
├── PLAN.md                        # The build guide. Read first.
├── README.md
├── requirements.txt
├── Dockerfile
├── .gitignore
├── .gitea/workflows/
│   ├── refresh.yml                # Weekly scrape + index + image push
│   └── image-only.yml             # On-demand code-only ship
├── scrape/
│   ├── README.md                  # Product-specific scraper goes here
│   └── changelog.py               # Reusable: --json, --history-out
├── rag/
│   ├── embeddings.py              # Ollama embedder, swappable
│   ├── chunk.py                   # Chunker — adjust per page format
│   ├── index.py                   # Builds Chroma + (optionally) BM25
│   └── bm25.py                    # SQLite FTS5 lexical index
├── docs_mcp/
│   ├── server.py                  # FastMCP server with stub tools
│   └── usage.py                   # TimedCall + JSONL telemetry
├── eval/
│   ├── queries.jsonl.example      # Curate ~25 hand-labeled queries
│   ├── retrievers.py              # Retriever protocol + implementations
│   └── run_eval.py                # MRR / Recall@k / nDCG@k harness
├── scripts/
│   ├── usage_report.py            # Standalone log analyzer
│   └── registry_gc.py             # Container registry cleanup
└── deploy/
    └── docker-compose.yml         # Hosting stack: MCP + reranker + Watchtower

What's product-specific (must implement)

scrape/ — the scraper itself. The template gives you the corpus layout contract and a working changelog.py; the actual extraction logic is yours.
The corpus on disk (gitignored; rebuilt by CI).
The reranker GGUF model and llama.cpp container (commented in deploy/docker-compose.yml).
The reverse proxy / TLS layer in front of the public endpoint.
The hand-curated knowledge surface (your product's API gotchas, example scripts, anything the LLM should know that the docs don't say).

What's NOT product-specific (works as-is)

FastMCP server skeleton + tool decoration pattern
Chroma + Ollama embedding pipeline
BM25 / SQLite FTS5 lexical index
Hybrid retrieval (RRF) + reranker integration
Eval harness (Retriever protocol, MRR/Recall/nDCG)
Usage logging (TimedCall, JSONL, daily rotation)
CI workflow shape (weekly + on-demand, retry-on-race, three-tag image scheme)
Registry GC script
Standard tools: search_docs, get_page, list_versions, diff_versions, bundle_changelog, weekly_digest, find_doc_inconsistencies, submit_doc_bug, etc.

License

Internal template. Adjust before publishing.

README.md Unescape Escape

docs-mcp-template

What's here

Quick start

Repo layout

What's product-specific (must implement)

What's NOT product-specific (works as-is)

License

README.md