hvm-docs

Author	SHA1	Message	Date
justin	e9b37e86df	feat: implement Phases 9 / 11 / 12 / 13 — diff/lessons/inconsistencies/digest Eight new MCP tools on top of the Phase 3 baseline. Each one uses TimedCall so calls show up in usage.jsonl alongside search/get/list. Phase 9 — multi-version diff: * list_cluster(bundle_id, page_id) — cross-version peers from the synthesized topic_cluster (same GUID across 8.1.x versions). * diff_versions(bundle_id, page_id, against_bundle_id) — unified diff between two bundles; uses topic_cluster first, falls back to same page_id (which works because HVM GUIDs are stable cross-version). * bundle_changelog(new, old) — page-level adds/removes/churn summary, sorted by lines moved; uses _diff_churn helper. Phase 11 — curated knowledge: * hvm_api_lessons(topic?) — surfaces docs_mcp/api_lessons.md (manager sizing, upgrade ordering, plugin/worker version compat, backups setup, console keyboard, elevation, ops gotchas). topic= filters to matching H2 sections. Marked "call proactively for HVM scripting / integration / upgrade questions" in the docstring so the LLM uses it. Phase 12 — doc-bug workflow: * find_doc_inconsistencies(scope_query, ...) — read-only scan with two checks: cross_version_drift (line-diff vs cluster peers, in-band 10-60% of file = high confidence) and redirect_chain (short body that's mostly a "see [other page]" pointer). * submit_doc_bug(page_url, content, ...) — env-gated OFF (DOC_BUG_SUBMIT_ENABLED) AND requires DOC_BUG_API_URL. Refuses cleanly with a manual-fallback message when either is unset. Allowlist: support.hpe.com only. Mandatory operator-confirmation pattern in the docstring; loud "do not loop" warning. The actual HPE feedback endpoint hasn't been sniffed yet — when it is, set both env vars and verify the payload shape against the schema. Phase 13 — weekly digest: * _digest_history() reads corpus/.digest/history.jsonl (built by scrape.changelog --history-out in the CI refresh workflow). * weekly_digest(days, version?, platform?, ...) aggregates corpus- touching commits in the window. Post-filter totals so version / platform filters give honest "X page changes" numbers, not the pre-filter commit count. * corpus_status() reports image build time, latest upstream Published date, total bundles/pages/chunks, and the 5 most-recently-edited bundles. Tool count now: 11 registered (search_docs, get_page, list_versions, list_cluster, diff_versions, bundle_changelog, weekly_digest, corpus_status, hvm_api_lessons, find_doc_inconsistencies, submit_doc_bug). Verified end-to-end via MCP stdio tools/list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 13:58:19 -04:00
justin	761552fe69	fix(registry_gc): correct Gitea packages API + Cloudflare-friendly UA (#2 )	2026-05-22 13:44:43 -04:00
hvm-docs-refresh	8743fff510	weekly refresh: 2026-05-22T17:39Z — 1161 content change(s) across 7 bundle(s) 1161 content change(s) across 7 bundle(s) 1161 sidecar metadata update(s) 7 new bundle(s) added Bundles with content changes: hvm_deployment_guide (NEW): 32 page(s) - GUID-0F55384D-5632-4CDC-AA39-A21C1C089AFA - GUID-28F18596-4902-4CD1-83F3-1411430C5534 - GUID-2DD9D39D-9031-4BB5-A4ED-A0179BEF5259 - GUID-34B1D00A-C42E-4691-8B4F-3B110E34FE7C - GUID-3DA92E9D-0635-427A-BA9D-5A7E475B55DB ... and 27 more hvm_release_notes_8_1_0 (NEW): 1 page(s) - sd00007497en_us hvm_release_notes_8_1_1 (NEW): 1 page(s) - sd00007609en_us hvm_release_notes_8_1_2 (NEW): 1 page(s) - sd00007734en_us hvm_user_manual_8_1_0 (NEW): 374 page(s) - GUID-008AF6CD-E219-4D76-B175-B763E5C397CE - GUID-02679208-A796-4A58-80AC-33DCF6A4899F - GUID-034C2E21-6B14-4AAD-A582-2638A4C7D04C - GUID-04410F73-1BA6-46D4-A7A4-E4706C5FD522 - GUID-0503F050-177F-4360-9B1A-49439AF366B8 ... and 369 more hvm_user_manual_8_1_1 (NEW): 376 page(s) - GUID-008AF6CD-E219-4D76-B175-B763E5C397CE - GUID-02679208-A796-4A58-80AC-33DCF6A4899F - GUID-034C2E21-6B14-4AAD-A582-2638A4C7D04C - GUID-04410F73-1BA6-46D4-A7A4-E4706C5FD522 - GUID-0503F050-177F-4360-9B1A-49439AF366B8 ... and 371 more hvm_user_manual_8_1_2 (NEW): 376 page(s) - GUID-008AF6CD-E219-4D76-B175-B763E5C397CE - GUID-02679208-A796-4A58-80AC-33DCF6A4899F - GUID-034C2E21-6B14-4AAD-A582-2638A4C7D04C - GUID-04410F73-1BA6-46D4-A7A4-E4706C5FD522 - GUID-0503F050-177F-4360-9B1A-49439AF366B8 ... and 371 more	2026-05-22 17:39:22 +00:00
justin	21500b1aaa	fix: stop ignoring corpus/ so refresh workflow can commit it (#1 )	2026-05-22 13:38:23 -04:00
justin	6b11993688	ci: use zerto-docs's load-balanced Ollama GPU pool on the Gitea host Match the OLLAMA_URLS pattern from zerto-docs-rag so every docs MCP build fans out across the same two GPU-pinned Ollama containers on 192.168.0.2 (:11435 Titan X, :11436 1080 Ti). The host's primary Ollama on :11434 is left alone for OpenWebUI. rag.embeddings now reads OLLAMA_URLS (plural CSV) preferentially with fallback to OLLAMA_URL, defaulting to http://192.168.0.2:11434 — same shape as zerto's embeddings.py. The OllamaEmbeddings class already round-robins per batch, so both GPUs run in parallel during the chroma rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 13:22:59 -04:00
justin	fd376fab77	ci+deploy: target git.jpaul.io registry, PRODUCT_NAME=hvm Phase 4/5 — adapt the template workflows to Justin's self-hosted Gitea + act_runner setup (see reference_gitea_server memory): * PUSH via LAN endpoint 192.168.0.2:1234 (bypasses Cloudflare's 100 MB request-body cap on the Free plan); PULL via git.jpaul.io. * buildx with config-inline insecure-registry for the LAN endpoint — docker/login-action can't be used there (host daemon rejects HTTP). Auth is written into ~/.docker/config.json so buildx reads it directly. * docker/metadata-action labels org.opencontainers.image.source with the PUBLIC URL so Gitea auto-links the package; explicit POST to /api/v1/packages/.../-/link/{repo} as belt-and-suspenders (201 newly linked, 400 already linked, both treated as success). * deploy/docker-compose.yml: substitute <product> placeholders, point image at git.jpaul.io/justin/hvm-docs:latest, set HYBRID_SEARCH=false to match the eval winner (bm25+rerank), keep the llama.cpp + jina GGUF reranker sidecar as the production target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 13:07:15 -04:00
justin	dda044eb95	search: BM25-default + cross-encoder rerank, hybrid behind env gate Phase 3/6/7/8 in one pass since they depend on each other. * docs_mcp/server.py - Wire search_docs / get_page / list_versions tool bodies. - search_docs flow: BM25 first (rag.bm25 FTS5) → over-fetch RERANK_POOL chunks → POST to RERANK_URL/v1/rerank → return top-k. Dense is the fallback when BM25 finds nothing. HYBRID_SEARCH=true switches to dense+BM25+RRF (fused via the new _rrf_fuse helper). - All retrieval failures are caught and fall back to the next layer, so a dead reranker or missing BM25 db never blocks a search. - Source URLs built from the bundle's docId so results link straight into support.hpe.com. * eval/ - 22 hand-curated golden queries grounded in real corpus page titles. - DenseRetriever / BM25Retriever / HybridRetriever / RerankedRetriever + MRR/Recall@K/nDCG@K harness. RERANK_URL env activates the reranked variants. - Committed eval/results/baseline.md. On this corpus: dense: MRR 0.539 bm25: MRR 0.880 hybrid_rrf: MRR 0.692 bm25+rerank: MRR 0.920 (winner) hybrid_rrf+rerank: MRR 0.875 HPE structured docs use controlled vocabulary, so lexical match dominates. Hybrid loses because dense pollutes the fused pool. * scripts/rerank_server.py - Minimal HTTP /v1/rerank over sentence-transformers cross-encoder/ms-marco-MiniLM-L-6-v2. Cohere-style request/response. - This is the dev/CPU fallback; production replaces it with the llama.cpp + jina-reranker-v2-base GGUF sidecar (same wire protocol). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 13:06:51 -04:00
justin	dd691b0111	rag: cap chunk size at 6KB to fit nomic-embed-text 2048-tok context The chunker emits any single paragraph as a stand-alone chunk regardless of size. One HVM page had a 14,858-char paragraph (a big config table) — nomic-embed-text 400'd the entire embed batch because the model's context is 2048 tokens. Added a hard-split fallback that splits any oversized chunk on line boundaries to MAX_CHARS=6000 (~1500 tokens, headroom). Also defaulted PRODUCT_NAME to "hvm" in rag/index.py to match server.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 13:06:35 -04:00
justin	7a491ba9e4	scrape: HVM bundles + runner for HPE Support DocPortal Phase 1: scrape User Manual (8.1.0/.1/.2), Release Notes (8.1.0/.1/.2), and the unversioned Deployment Guide. Total ~1,160 pages, 9.7 MB markdown. Discovers via the anonymous JSON API at /hpesc/public/api/document/{docId}: /toc walks the page tree (for TOC-paginated docs), /render?page=GUID fetches per-page HTML, /document/{docId} returns the whole body for single-doc shapes like Release Notes. Runner converts DITA-source HTML to clean markdown (strips Notices/ Acknowledgments/Abstract boilerplate), writes corpus/<bundle>/<page>.{md,json}, then a finalize pass synthesizes topic_cluster.clustered_topics by GUID overlap across versions (HPE GUIDs are stable cross-version — confirmed 374/376/376 with 100% overlap on shared pages). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 13:06:26 -04:00
justin	43728320bf	ci: default PRODUCT_NAME to repo name (caught by template dispatch test) First dispatch on the empty template failed at Chroma collection creation because PRODUCT_NAME was the literal string "<product>" (YAML doesn't expand placeholders), and Chroma rejects collection names containing characters outside [a-zA-Z0-9._-]: chromadb.errors.InvalidArgumentError: Validation error: name: Expected a name containing 3-512 characters from [a-zA-Z0-9._-], starting and ending with a character in [a-zA-Z0-9]. Got: <product>_docs Same fix as the IMAGE env: derive from the repo name dynamically via ${{ github.event.repository.name }}. Cloners can still override explicitly, but a fresh clone now runs the index-rebuild step cleanly out of the box. Verified by re-dispatch — should fail next at docker login (placeholder REGISTRY_PUSH hostname), which is the next-expected fail point and a real per-deployment config the cloner has to fill in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:37:07 -04:00
justin	33b0fd652e	ci: derive image name + package linking from repo, add link step Both workflows had a static IMAGE env (<owner>/<product>-docs-mcp) and a static --package arg in the GC step. Switch both to Gitea Actions context variables so a clone of the template into any repo name works on the first CI run without find/replace: IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }} --owner ${{ github.repository_owner }} --package ${{ github.event.repository.name }} Also add the "Link container package to this repo" step that was missing from the template (and which, naively copy-pasted from the reference build, would have linked everything back to docs-mcp- template). The new step derives owner + package + link-target all from the running repo's context. The github.* namespace is Gitea Actions' inherited GitHub-Actions context — values come from the Gitea server, not github.com. Same mechanism the reference build's $GITHUB_SHA tag-builder uses. CLAUDE.md updated to note that image and package naming are repo-derived; only registry endpoints and the Ollama URL need per-clone editing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:34:26 -04:00
justin	9ba615c8ee	initial: docs-mcp-template — build guide + scaffolded server Template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out. Contents: - PLAN.md — comprehensive build guide. 13 phases from project skeleton through weekly_digest. Includes the gotchas ("fetch-depth: 0 always", reranker per-pair token limit, Cloudflare body cap, dash-not-bash on Gitea runners), the decisions worth carrying forward, and a per-product customization checklist. - CLAUDE.md — guidance for Claude Code working in a clone of this template. Phase identification table, conventions (env-gating + operator confirmation for side-effecting tools, defensive fallback for retrieval components), common commands. - README.md — quick-start summary. Scaffolded code (all signature-stable, with NotImplementedError stubs where phase-specific work is required): docs_mcp/server.py FastMCP server, stateless_http=True, with search_docs / get_page / list_versions baseline tools and commented stubs for the rest of the phase set. docs_mcp/usage.py TimedCall telemetry, JSONL, daily rotation, 90-day retention. Reusable as-is. rag/embeddings.py Ollama embedder (nomic-embed-text default), load-balanced across N URLs. Reusable. rag/chunk.py Paragraph-aware chunker with synthetic chunk 0. Per-product tunable. rag/index.py Chroma + BM25 builder. --rebuild and --bm25-only flags. rag/bm25.py SQLite FTS5 lexical index. Reusable. scrape/changelog.py --cached / --ref / --json / --history-out. Reusable. scrape/README.md What you write per-product. eval/queries.jsonl.example Curate ~25 hand-labeled queries here. eval/retrievers.py Retriever protocol + stub classes. eval/run_eval.py MRR / Recall@K / nDCG@K harness skeleton. scripts/usage_report.py Standalone log analyzer; the FOLLOW-UP CHECKS pattern noted in the module docstring. scripts/registry_gc.py Gitea container registry cleanup. Reusable. Deployment + CI: Dockerfile Python 3.12-slim; COPY corpus + chroma + bm25 last for cache efficiency. deploy/docker-compose.yml MCP + reranker sidecar + Watchtower. Templated with <placeholders>. .gitea/workflows/refresh.yml Weekly cron + manual dispatch. fetch-depth: 0, retry-on-race, three-tag image scheme. .gitea/workflows/image-only.yml Code-only ship cycle, ~18min. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:18:17 -04:00

12 Commits