hvm-docs

Author	SHA1	Message	Date
justin	c33682c968	eval: new baseline on the 4-endpoint embed pool index 22 queries against the prod image index rebuilt today on the expanded GPU pool with the resilient embedder (PR #8): dense MRR 0.539→0.557, bm25+rerank 0.920→0.959, hybrid_rrf+rerank 0.875→0.960 vs the 2026-05-22 baseline. No regression from mixed-provenance embeddings. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 20:37:41 -04:00
justin	41d431670f	remove submit_doc_bug tool	2026-05-24 07:44:39 -04:00
justin	e07df7a1ae	fix(chunker): MAX_CHARS 6000 → 4000 for table-dense content (#6 )	2026-05-22 15:11:23 -04:00
hvm-docs-refresh	faad4767c6	weekly refresh: 2026-05-22T19:08Z — 0 content change(s) across 0 bundle(s) 0 content change(s) across 0 bundle(s) 0 sidecar metadata update(s)	2026-05-22 19:08:42 +00:00
justin	6e938c05c4	feat: Qualification Matrix + QuickSpecs bundles (#5 )	2026-05-22 15:05:13 -04:00
justin	ab1de47475	docs: replace template README with HVM-specific content (#4 )	2026-05-22 14:04:15 -04:00
justin	79d3455de5	feat: Phases 9/11/12/13 — diff / lessons / inconsistencies / digest (#3 )	2026-05-22 13:58:21 -04:00
justin	761552fe69	fix(registry_gc): correct Gitea packages API + Cloudflare-friendly UA (#2 )	2026-05-22 13:44:43 -04:00
hvm-docs-refresh	8743fff510	weekly refresh: 2026-05-22T17:39Z — 1161 content change(s) across 7 bundle(s) 1161 content change(s) across 7 bundle(s) 1161 sidecar metadata update(s) 7 new bundle(s) added Bundles with content changes: hvm_deployment_guide (NEW): 32 page(s) - GUID-0F55384D-5632-4CDC-AA39-A21C1C089AFA - GUID-28F18596-4902-4CD1-83F3-1411430C5534 - GUID-2DD9D39D-9031-4BB5-A4ED-A0179BEF5259 - GUID-34B1D00A-C42E-4691-8B4F-3B110E34FE7C - GUID-3DA92E9D-0635-427A-BA9D-5A7E475B55DB ... and 27 more hvm_release_notes_8_1_0 (NEW): 1 page(s) - sd00007497en_us hvm_release_notes_8_1_1 (NEW): 1 page(s) - sd00007609en_us hvm_release_notes_8_1_2 (NEW): 1 page(s) - sd00007734en_us hvm_user_manual_8_1_0 (NEW): 374 page(s) - GUID-008AF6CD-E219-4D76-B175-B763E5C397CE - GUID-02679208-A796-4A58-80AC-33DCF6A4899F - GUID-034C2E21-6B14-4AAD-A582-2638A4C7D04C - GUID-04410F73-1BA6-46D4-A7A4-E4706C5FD522 - GUID-0503F050-177F-4360-9B1A-49439AF366B8 ... and 369 more hvm_user_manual_8_1_1 (NEW): 376 page(s) - GUID-008AF6CD-E219-4D76-B175-B763E5C397CE - GUID-02679208-A796-4A58-80AC-33DCF6A4899F - GUID-034C2E21-6B14-4AAD-A582-2638A4C7D04C - GUID-04410F73-1BA6-46D4-A7A4-E4706C5FD522 - GUID-0503F050-177F-4360-9B1A-49439AF366B8 ... and 371 more hvm_user_manual_8_1_2 (NEW): 376 page(s) - GUID-008AF6CD-E219-4D76-B175-B763E5C397CE - GUID-02679208-A796-4A58-80AC-33DCF6A4899F - GUID-034C2E21-6B14-4AAD-A582-2638A4C7D04C - GUID-04410F73-1BA6-46D4-A7A4-E4706C5FD522 - GUID-0503F050-177F-4360-9B1A-49439AF366B8 ... and 371 more	2026-05-22 17:39:22 +00:00
justin	21500b1aaa	fix: stop ignoring corpus/ so refresh workflow can commit it (#1 )	2026-05-22 13:38:23 -04:00
justin	6b11993688	ci: use zerto-docs's load-balanced Ollama GPU pool on the Gitea host Match the OLLAMA_URLS pattern from zerto-docs-rag so every docs MCP build fans out across the same two GPU-pinned Ollama containers on 192.168.0.2 (:11435 Titan X, :11436 1080 Ti). The host's primary Ollama on :11434 is left alone for OpenWebUI. rag.embeddings now reads OLLAMA_URLS (plural CSV) preferentially with fallback to OLLAMA_URL, defaulting to http://192.168.0.2:11434 — same shape as zerto's embeddings.py. The OllamaEmbeddings class already round-robins per batch, so both GPUs run in parallel during the chroma rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 13:22:59 -04:00
justin	fd376fab77	ci+deploy: target git.jpaul.io registry, PRODUCT_NAME=hvm Phase 4/5 — adapt the template workflows to Justin's self-hosted Gitea + act_runner setup (see reference_gitea_server memory): * PUSH via LAN endpoint 192.168.0.2:1234 (bypasses Cloudflare's 100 MB request-body cap on the Free plan); PULL via git.jpaul.io. * buildx with config-inline insecure-registry for the LAN endpoint — docker/login-action can't be used there (host daemon rejects HTTP). Auth is written into ~/.docker/config.json so buildx reads it directly. * docker/metadata-action labels org.opencontainers.image.source with the PUBLIC URL so Gitea auto-links the package; explicit POST to /api/v1/packages/.../-/link/{repo} as belt-and-suspenders (201 newly linked, 400 already linked, both treated as success). * deploy/docker-compose.yml: substitute <product> placeholders, point image at git.jpaul.io/justin/hvm-docs:latest, set HYBRID_SEARCH=false to match the eval winner (bm25+rerank), keep the llama.cpp + jina GGUF reranker sidecar as the production target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 13:07:15 -04:00
justin	dda044eb95	search: BM25-default + cross-encoder rerank, hybrid behind env gate Phase 3/6/7/8 in one pass since they depend on each other. * docs_mcp/server.py - Wire search_docs / get_page / list_versions tool bodies. - search_docs flow: BM25 first (rag.bm25 FTS5) → over-fetch RERANK_POOL chunks → POST to RERANK_URL/v1/rerank → return top-k. Dense is the fallback when BM25 finds nothing. HYBRID_SEARCH=true switches to dense+BM25+RRF (fused via the new _rrf_fuse helper). - All retrieval failures are caught and fall back to the next layer, so a dead reranker or missing BM25 db never blocks a search. - Source URLs built from the bundle's docId so results link straight into support.hpe.com. * eval/ - 22 hand-curated golden queries grounded in real corpus page titles. - DenseRetriever / BM25Retriever / HybridRetriever / RerankedRetriever + MRR/Recall@K/nDCG@K harness. RERANK_URL env activates the reranked variants. - Committed eval/results/baseline.md. On this corpus: dense: MRR 0.539 bm25: MRR 0.880 hybrid_rrf: MRR 0.692 bm25+rerank: MRR 0.920 (winner) hybrid_rrf+rerank: MRR 0.875 HPE structured docs use controlled vocabulary, so lexical match dominates. Hybrid loses because dense pollutes the fused pool. * scripts/rerank_server.py - Minimal HTTP /v1/rerank over sentence-transformers cross-encoder/ms-marco-MiniLM-L-6-v2. Cohere-style request/response. - This is the dev/CPU fallback; production replaces it with the llama.cpp + jina-reranker-v2-base GGUF sidecar (same wire protocol). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 13:06:51 -04:00
justin	dd691b0111	rag: cap chunk size at 6KB to fit nomic-embed-text 2048-tok context The chunker emits any single paragraph as a stand-alone chunk regardless of size. One HVM page had a 14,858-char paragraph (a big config table) — nomic-embed-text 400'd the entire embed batch because the model's context is 2048 tokens. Added a hard-split fallback that splits any oversized chunk on line boundaries to MAX_CHARS=6000 (~1500 tokens, headroom). Also defaulted PRODUCT_NAME to "hvm" in rag/index.py to match server.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 13:06:35 -04:00
justin	7a491ba9e4	scrape: HVM bundles + runner for HPE Support DocPortal Phase 1: scrape User Manual (8.1.0/.1/.2), Release Notes (8.1.0/.1/.2), and the unversioned Deployment Guide. Total ~1,160 pages, 9.7 MB markdown. Discovers via the anonymous JSON API at /hpesc/public/api/document/{docId}: /toc walks the page tree (for TOC-paginated docs), /render?page=GUID fetches per-page HTML, /document/{docId} returns the whole body for single-doc shapes like Release Notes. Runner converts DITA-source HTML to clean markdown (strips Notices/ Acknowledgments/Abstract boilerplate), writes corpus/<bundle>/<page>.{md,json}, then a finalize pass synthesizes topic_cluster.clustered_topics by GUID overlap across versions (HPE GUIDs are stable cross-version — confirmed 374/376/376 with 100% overlap on shared pages). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 13:06:26 -04:00
justin	43728320bf	ci: default PRODUCT_NAME to repo name (caught by template dispatch test) First dispatch on the empty template failed at Chroma collection creation because PRODUCT_NAME was the literal string "<product>" (YAML doesn't expand placeholders), and Chroma rejects collection names containing characters outside [a-zA-Z0-9._-]: chromadb.errors.InvalidArgumentError: Validation error: name: Expected a name containing 3-512 characters from [a-zA-Z0-9._-], starting and ending with a character in [a-zA-Z0-9]. Got: <product>_docs Same fix as the IMAGE env: derive from the repo name dynamically via ${{ github.event.repository.name }}. Cloners can still override explicitly, but a fresh clone now runs the index-rebuild step cleanly out of the box. Verified by re-dispatch — should fail next at docker login (placeholder REGISTRY_PUSH hostname), which is the next-expected fail point and a real per-deployment config the cloner has to fill in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:37:07 -04:00
justin	33b0fd652e	ci: derive image name + package linking from repo, add link step Both workflows had a static IMAGE env (<owner>/<product>-docs-mcp) and a static --package arg in the GC step. Switch both to Gitea Actions context variables so a clone of the template into any repo name works on the first CI run without find/replace: IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }} --owner ${{ github.repository_owner }} --package ${{ github.event.repository.name }} Also add the "Link container package to this repo" step that was missing from the template (and which, naively copy-pasted from the reference build, would have linked everything back to docs-mcp- template). The new step derives owner + package + link-target all from the running repo's context. The github.* namespace is Gitea Actions' inherited GitHub-Actions context — values come from the Gitea server, not github.com. Same mechanism the reference build's $GITHUB_SHA tag-builder uses. CLAUDE.md updated to note that image and package naming are repo-derived; only registry endpoints and the Ollama URL need per-clone editing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:34:26 -04:00
justin	9ba615c8ee	initial: docs-mcp-template — build guide + scaffolded server Template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out. Contents: - PLAN.md — comprehensive build guide. 13 phases from project skeleton through weekly_digest. Includes the gotchas ("fetch-depth: 0 always", reranker per-pair token limit, Cloudflare body cap, dash-not-bash on Gitea runners), the decisions worth carrying forward, and a per-product customization checklist. - CLAUDE.md — guidance for Claude Code working in a clone of this template. Phase identification table, conventions (env-gating + operator confirmation for side-effecting tools, defensive fallback for retrieval components), common commands. - README.md — quick-start summary. Scaffolded code (all signature-stable, with NotImplementedError stubs where phase-specific work is required): docs_mcp/server.py FastMCP server, stateless_http=True, with search_docs / get_page / list_versions baseline tools and commented stubs for the rest of the phase set. docs_mcp/usage.py TimedCall telemetry, JSONL, daily rotation, 90-day retention. Reusable as-is. rag/embeddings.py Ollama embedder (nomic-embed-text default), load-balanced across N URLs. Reusable. rag/chunk.py Paragraph-aware chunker with synthetic chunk 0. Per-product tunable. rag/index.py Chroma + BM25 builder. --rebuild and --bm25-only flags. rag/bm25.py SQLite FTS5 lexical index. Reusable. scrape/changelog.py --cached / --ref / --json / --history-out. Reusable. scrape/README.md What you write per-product. eval/queries.jsonl.example Curate ~25 hand-labeled queries here. eval/retrievers.py Retriever protocol + stub classes. eval/run_eval.py MRR / Recall@K / nDCG@K harness skeleton. scripts/usage_report.py Standalone log analyzer; the FOLLOW-UP CHECKS pattern noted in the module docstring. scripts/registry_gc.py Gitea container registry cleanup. Reusable. Deployment + CI: Dockerfile Python 3.12-slim; COPY corpus + chroma + bm25 last for cache efficiency. deploy/docker-compose.yml MCP + reranker sidecar + Watchtower. Templated with <placeholders>. .gitea/workflows/refresh.yml Weekly cron + manual dispatch. fetch-depth: 0, retry-on-race, three-tag image scheme. .gitea/workflows/image-only.yml Code-only ship cycle, ~18min. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:18:17 -04:00

18 Commits