justin a0727da8da scrape: add Qualification Matrix + QuickSpecs bundles (live curl_cffi for HPE www)
Two new bundles:

* hvm_qualification_matrix (sd00006551en_us) — the "Qualification Matrix
  for HVM Clusters Managed by HPE Morpheus Software". Single TOC bundle,
  2 pages (parent + content). The content page is ~100 KB of HTML
  containing five tables: Server Hardware Support, Storage Hardware
  Support, Independent Software Vendor (ISV) Support, Hypervisor OS
  Compatibility and Interoperability Matrix, and Guest OS. Scraped via
  the same /hpesc/public/api/document/{docId}/render endpoint as every
  other bundle on support.hpe.com — the API returns server-rendered
  DITA HTML, so no JS/SPA shenanigans.

* hvm_quickspecs (a50004260enw) — HPE Morpheus VM Essentials Software
  QuickSpecs, Version 4 (02-Feb-2026). SKUs: S5Q81AAE (1-yr per Socket
  E-LTU), S5Q82AAE (3-yr), S5Q83AAE (5-yr); each includes Tech Care
  Essentials. QuickSpecs lives at www.hpe.com (not support.hpe.com),
  which drops connections at the edge for non-browser TLS fingerprints —
  verified 2026-05-22 against curl, wget, urllib, and Anthropic's
  WebFetch (all = 0 bytes / connection timeout in headers). Bypassed
  here via curl_cffi impersonating Chrome 120's JA3/JA4 fingerprint.
  HTTP 200, 255 KB on first try, all four sections + all three SKUs
  cleanly parseable from the server-rendered HTML.

New module scrape/quickspecs.py drives the live fetch + parse for any
hvm_*_quickspecs bundle. CSS selectors taken from the captured DOM:
  .lr-right-rail hpe-highlights-container .collateral-content
       — one block per H3 section
  h3.txto-title             — section title
  div.txto-description      — section body
  uc-table.uc-table-polaris — SKU and version-history tables
On any live failure the parser falls back to a committed HTML fixture
at scrape/quickspecs/<doc_id>.html so the build never breaks on a
transient edge hiccup.

scrape/runner.py learned a new mode "html-file" that dispatches to
scrape.quickspecs; bundles.py extended with an optional source_url on
BundleSpec for cases where the page lives outside support.hpe.com.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 15:05:11 -04:00

hvm-docs

A hosted MCP server over the public documentation for HPE Morpheus VM Essentials Software (HVM) — the KVM-based hypervisor platform from HPE. Lets any MCP-aware client (Claude Desktop, Claude Code, Cursor, Copilot, MetaMCP) answer questions against the User Manual, Release Notes, and Deployment Guide; diff pages across 8.1.x versions; surface what changed recently; and (when enabled) submit documentation bugs back to HPE.

Live behind MetaMCP at https://mcp.jpaul.io/metamcp/hvm-docs/mcp once deployed.

Tools

11 tools, registered over MCP streamable-HTTP:

Tool Use
search_docs BM25-default search with optional version / platform / bundle filters; cross-encoder reranked when RERANK_URL is set
get_page Full markdown of one page with metadata header + source URL
list_versions Discover available versions, doc types, and bundle slugs
list_cluster Cross-version peers of a page (synthesized from same-GUID overlap)
diff_versions Unified diff of one topic between two bundles
bundle_changelog Added / removed / churn-ranked changed pages between two bundles
weekly_digest "What changed in the docs in the last N days" — reads CI-baked history.jsonl
corpus_status Image build time, upstream Published date, total bundles/pages/chunks
hvm_api_lessons Curated operator gotchas (manager sizing, upgrade ordering, plugin/worker compat, console keyboards, backups setup)
find_doc_inconsistencies Scoped scan for cross-version drift + redirect-chain stub pages
submit_doc_bug Env-gated draft → confirm → submit workflow to HPE's docs feedback (endpoint TBD; currently refuses with manual-fallback)

Corpus

Confirmed bundles (scraped 2026-05-22 from HPE Support DocPortal):

Bundle docId Pages
hvm_user_manual_8_1_0 sd00007520en_us 374
hvm_user_manual_8_1_1 sd00007620en_us 376
hvm_user_manual_8_1_2 sd00007735en_us 376
hvm_release_notes_8_1_0 sd00007497en_us 1
hvm_release_notes_8_1_1 sd00007609en_us 1
hvm_release_notes_8_1_2 sd00007734en_us 1
hvm_deployment_guide sd00007332en_us 32

Total: ~1,161 pages → 2,650 chunks in Chroma + same chunks indexed in SQLite FTS5 (BM25).

GUIDs are stable across HVM versions, so topic_cluster cross-version peer mapping is free (no fuzzy matching needed).

Retrieval

Eval against 22 hand-curated golden queries — see eval/results/baseline.md:

Retriever MRR Recall@5 nDCG@5 latency
dense (Ollama nomic-embed-text) 0.539 0.621 0.558 88 ms
BM25 (SQLite FTS5) 0.880 0.909 0.883 3 ms
hybrid (dense + BM25 + RRF) 0.692 0.818 0.713 69 ms
bm25 + jina-rerank 0.920 0.939 0.927 490 ms (CPU) / ~50 ms (GPU)

HPE docs use controlled vocabulary, so lexical match dominates; the cross-encoder cleans up the long tail. See PLAN.md Phase 7/8 for the reasoning.

Architecture

HPE Support DocPortal (sniff-the-API, no auth)
        │
        ▼
   scrape/        ──► corpus/<bundle>/<GUID>.{md,json}  (committed)
        │
        ▼
   rag/index      ──► chroma/  (dense, 768-dim nomic-embed-text)
                  ──► bm25/    (SQLite FTS5)
        │
        ▼
   docs_mcp.server (FastMCP, streamable-HTTP)
        │
        ├── BM25 → reranker (jina-reranker-v2-base GGUF, GPU sidecar)
        │
        ▼
   deploy/docker-compose.yml
        │
        ├── MetaMCP gateway   ── public at mcp.jpaul.io behind Cloudflare Tunnel
        ├── jina-rerank       ── shared GPU sidecar (1080 Ti)
        └── Watchtower        ── auto-pulls :latest on weekly refresh

CI (Gitea Actions on git.jpaul.io)

Two cadences:

  • refresh.yml — weekly Monday 06:00 UTC cron + manual dispatch. Re-scrapes upstream, commits corpus diffs, rebuilds Chroma + BM25, builds & pushes image. ~58 min on the GPU pool.
  • image-only.yml — manual dispatch. Skips scrape; rebuilds indexes from committed corpus and ships a new image. ~3 min.

Image: git.jpaul.io/justin/hvm-docs:latest (Watchtower target), plus rolling :<sha7> and :YYYY.MM.DD tags.

Embeddings fan out across the two GPU-pinned Ollama containers on the Gitea host (192.168.0.2:11435 Titan X, :11436 1080 Ti) — same infra zerto-docs uses; see OLLAMA_URLS in both workflows.

Local dev

python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# (Optional) the CPU dev reranker — pulls PyTorch (~2 GB); skip if
# you'll just be running stdio queries.
pip install -r requirements-rerank.txt

# Build / refresh the corpus + indexes
python -m scrape.bundles
python -m scrape.runner --all --force --concurrency 6
python -m rag.index --rebuild

# Local stdio server (Claude Desktop dev)
python -m docs_mcp.server --transport stdio

# Local streamable-HTTP for integration testing
python -m docs_mcp.server --transport streamable-http --port 8000

# Run the eval harness (without reranker)
python -m eval.run_eval --k 5

# With the dev reranker
python -m scripts.rerank_server &
RERANK_URL=http://127.0.0.1:8001 python -m eval.run_eval --k 5

Repo layout

.
├── PLAN.md                       # 13-phase build guide (template-shared)
├── CLAUDE.md                     # Claude Code guidance
├── README.md                     # this file
├── Dockerfile
├── requirements.txt              # production deps
├── requirements-rerank.txt       # dev CPU reranker only
├── bundles.json                  # bundle catalog (committed)
├── corpus/                       # 1,161 scraped pages (committed)
├── .gitea/workflows/             # refresh.yml + image-only.yml
├── scrape/
│   ├── bundles.py                # HVM bundle catalog + discovery
│   ├── runner.py                 # TOC + single-doc page scraper
│   └── changelog.py              # git-history → digest JSONL
├── rag/
│   ├── chunk.py                  # paragraph-aware splitter w/ 6 KB hard cap
│   ├── embeddings.py             # OLLAMA_URLS (zerto-style fan-out)
│   ├── index.py                  # builds Chroma + BM25
│   └── bm25.py                   # FTS5 lexical index
├── docs_mcp/
│   ├── server.py                 # FastMCP + 11 tools
│   ├── usage.py                  # TimedCall JSONL telemetry
│   └── api_lessons.md            # curated HVM operator gotchas
├── eval/
│   ├── queries.jsonl             # 22 hand-curated golden queries
│   ├── retrievers.py             # Dense/BM25/Hybrid/Reranked
│   ├── run_eval.py               # MRR / Recall@K / nDCG@K
│   └── results/baseline.md       # committed eval results
├── scripts/
│   ├── rerank_server.py          # dev/CPU cross-encoder /v1/rerank
│   ├── usage_report.py           # log summarizer
│   └── registry_gc.py            # Gitea container-registry cleanup
└── deploy/
    └── docker-compose.yml        # production hosting (MCP + reranker + Watchtower)

License

Internal — HVM is HPE's product; the docs MCP is a side project, not HPE-sanctioned.

S
Description
No description provided
Readme 900 KiB
Languages
Python 92.3%
HTML 7%
Dockerfile 0.7%