build out morpheus-docs MCP stack, mirroring hvm-docs through Phases 1-13

Initial scaffold: the docs-mcp-template clone with all the HVM-validated stack ported across, customized for Morpheus Enterprise (PRODUCT_NAME=morpheus, server name morpheus-docs). Bundles (live-discovered 2026-05-22; 1710 cataloged pages total): * morpheus_user_manual_8_1_0 sd00007510en_us 568 pages (Feb 2026) * morpheus_user_manual_8_1_1 sd00007621en_us 569 pages (Mar 2026) * morpheus_user_manual_8_1_2 sd00007732en_us 569 pages (Apr 2026) * morpheus_release_notes_8_1_0 sd00007496en_us single-doc * morpheus_release_notes_8_1_1 sd00007610en_us single-doc * morpheus_release_notes_8_1_2 sd00007733en_us single-doc * morpheus_quickspecs a50009231enw html-file (live curl_cffi against www.hpe.com; all 12+ Enterprise SKUs captured — S6E64..S6E73AAE for new/renewal/upgrade × 1/3/5-yr terms, plus services SKUs HA124A1#V38/V39 and H46SBA1). No Deployment Guide or Qualification Matrix on HPE Support for Morpheus Enterprise specifically — the only QM (sd00006551en_us) covers HVM clusters managed by Morpheus and lives in hvm-docs. Stack carried forward from hvm-docs: * rag/{index,chunk,embeddings,bm25}.py — including the MAX_CHARS=4000 chunk-cap fix for table-dense content * docs_mcp/{server,usage}.py — 11 MCP tools, BM25-default search, cross-encoder rerank, hybrid behind HYBRID_SEARCH=true, morpheus_api_lessons (renamed from hvm_api_lessons), env-gated submit_doc_bug * docs_mcp/api_lessons.md — Morpheus-specific scaffold covering licensing model, HVM elevation path, REST vs Plugin API, with TODO markers for sections to flesh out from real ops experience * scrape/{runner,quickspecs,changelog,bundles}.py — TOC + single-doc + html-file modes, curl_cffi Chrome120 for www.hpe.com edge bypass * eval/{retrievers,run_eval}.py + queries.jsonl scaffold (4 placeholder queries; populate after first scrape) * scripts/{rerank_server,usage_report,registry_gc}.py * .gitea/workflows/{refresh,image-only}.yml — same Gitea Actions setup zerto-docs uses (push LAN, pull public-URL, GPU Ollama pool) * deploy/docker-compose.yml — morpheus-docs-mcp service definition, shared jina-rerank sidecar, Watchtower-labeled * Dockerfile, requirements.txt, requirements-rerank.txt Verified locally: scrape produced 1599 .md pages (some TOC entries are parent-only and yield no body), 6353 chunks all under the 4 KB cap, MCP server boots and lists 11 tools cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 15:26:24 -04:00
parent 43728320bf
commit fa448f94e1
22 changed files with 2822 additions and 247 deletions
@@ -7,6 +7,72 @@ the upstream doc portal.
 See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline
 expects.

+---
+
+## Product context — HPE Morpheus Enterprise Software
+
+**This repo is for HPE Morpheus Enterprise**, the full cloud-management
+platform. It is a **different SKU** from HPE Morpheus VM Essentials
+(HVM), which has its own MCP at `../hvm-docs/`. Don't ingest HVM
+docs here; they're a separate, smaller product (the "VM-only" subset
+of Morpheus). The Morpheus VM Essentials Deployment Guide refers to
+Morpheus Enterprise as the "elevate to" target — that's the
+relationship.
+
+`PRODUCT_NAME=morpheus`. Tool will be named `morpheus_api_lessons`,
+collection `morpheus_docs`, etc.
+
+### Upstream portal
+
+HPE Support DocPortal (Tridion/SDL-derived, same surface as HVM and
+the Zerto docs). Anonymous JSON API, no auth required.
+
+| Endpoint | Returns |
+|---|---|
+| `GET https://support.hpe.com/hpesc/public/api/document/{docId}` | DITA-source HTML — title page / abstract OR (for short docs like Release Notes) the entire body |
+| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/toc` | Nested JSON tree of `{topicName, topicLink, description, children}`. Empty/404 for single-doc Release Notes. |
+| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/render?page=GUID-XXXX.html` | `{docId, page_html, doc_meta, page_meta}` — single page body |
+
+User-facing URL format:
+`https://support.hpe.com/hpesc/public/docDisplay?docId={docId}&page=GUID-XXXX.html`
+
+### Bundle IDs (confirmed 2026-05-22)
+
+**Morpheus Enterprise User Manual** — ~569 pages each, full nested TOC:
+
+| Version | docId |
+|---|---|
+| 8.1.0  | `sd00007510en_us` |
+| 8.1.1  | `sd00007621en_us` |
+| 8.1.2  | `sd00007732en_us` |
+
+**Morpheus Enterprise Release Notes** — short, single-doc-blob shape
+(no TOC; full body returned by the `/document/{docId}` endpoint
+itself; scraper needs a `--single-doc` mode for these):
+
+| Version | docId |
+|---|---|
+| 8.1.0  | `sd00007496en_us` |
+| 8.1.1  | `sd00007610en_us` |
+| 8.1.2  | `sd00007733en_us` |
+
+### Cross-version peers are free
+
+GUIDs are stable across versions (confirmed on HVM where 374/376/376
+pages had 100% GUID overlap between adjacent versions). Same-GUID =
+same-topic. Synthesize `topic_cluster.clustered_topics` by looking
+up the same GUID in the other bundle slugs — no fuzzy matching
+needed.
+
+### Reusable from hvm-docs
+
+`../hvm-docs/scrape/bundles.py` and `../hvm-docs/scrape/runner.py`
+solve the identical portal shape. Copy and adapt the BUNDLES list +
+PRODUCT_NAME; the fetch logic should drop in unchanged. Both the
+TOC-paginated path and the single-doc path are needed (the HVM
+build covers both because HVM Release Notes follow the same shape).
+
+
 ## What you write

 At minimum, two scripts:
@@ -0,0 +1,200 @@
+"""Discover Morpheus Enterprise doc bundles on HPE Support DocPortal and write bundles.json.
+
+Mirrors hvm-docs/scrape/bundles.py — same portal, same API shape, same single-doc-blob
+treatment for Release Notes, but pointing at the Morpheus Enterprise docId range.
+
+For each bundle this script:
+  1. GETs /hpesc/public/api/document/{docId}        → abstract HTML
+  2. GETs /hpesc/public/api/document/{docId}/toc    → page tree (or 404 for single-doc)
+  3. Writes bundles.json at repo root with the schema PLAN.md Phase 1 documents.
+
+QuickSpecs is a special case: lives at www.hpe.com (not support.hpe.com), gets the
+html-file mode and is scraped via curl_cffi (see scrape/quickspecs.py).
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import re
+import sys
+import time
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+import requests
+from bs4 import BeautifulSoup
+
+API = "https://support.hpe.com/hpesc/public/api/document"
+DOC_URL = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}"
+UA = "morpheus-docs-mcp/0.1 (+https://git.jpaul.io/justin/morpheus-docs; admin@jpaul.io)"
+ROOT = Path(__file__).resolve().parent.parent
+BUNDLES_JSON = ROOT / "bundles.json"
+
+
+@dataclass
+class BundleSpec:
+    slug: str
+    doc_id: str
+    title: str
+    version: str | None
+    product: str  # e.g. "User Manual", "Release Notes", "QuickSpecs"
+    mode: str    # "toc", "single", or "html-file"
+    platform: str | None = None
+    language: str = "en-US"
+    source_url: str | None = None   # overrides the default support.hpe.com URL
+
+
+# Declared bundles. Versions confirmed 2026-05-22 by probing the docId
+# range sd00006500..7740 for `Morpheus Enterprise` matches in the abstract.
+#
+# Notes:
+#   - Morpheus Enterprise has User Manuals dating back to 8.0.10
+#     (sd00006774en_us, Sep 2025) but we only ship the 8.1.x line for
+#     now. Add the 8.0.x bundles here if you need older versions in the
+#     corpus.
+#   - No dedicated Deployment Guide or Qualification Matrix for Morpheus
+#     Enterprise on HPE Support — the only QM (sd00006551en_us) covers
+#     HVM clusters managed by Morpheus, which lives in hvm-docs.
+#   - QuickSpecs lives on www.hpe.com (not support.hpe.com), uses the
+#     html-file scrape mode with curl_cffi Chrome impersonation.
+BUNDLES: list[BundleSpec] = [
+    BundleSpec("morpheus_user_manual_8_1_0",   "sd00007510en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.0", "User Manual",   "toc"),
+    BundleSpec("morpheus_user_manual_8_1_1",   "sd00007621en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.1", "User Manual",   "toc"),
+    BundleSpec("morpheus_user_manual_8_1_2",   "sd00007732en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.2", "User Manual",   "toc"),
+    BundleSpec("morpheus_release_notes_8_1_0", "sd00007496en_us", "HPE Morpheus Enterprise Software Release Notes",  "8.1.0", "Release Notes", "single"),
+    BundleSpec("morpheus_release_notes_8_1_1", "sd00007610en_us", "HPE Morpheus Enterprise Software Release Notes",  "8.1.1", "Release Notes", "single"),
+    BundleSpec("morpheus_release_notes_8_1_2", "sd00007733en_us", "HPE Morpheus Enterprise Software Release Notes",  "8.1.2", "Release Notes", "single"),
+    BundleSpec("morpheus_quickspecs",          "a50009231enw",    "HPE Morpheus Enterprise Software QuickSpecs",
+               "v1", "QuickSpecs", "html-file",
+               source_url="https://www.hpe.com/psnow/doc/a50009231enw"),
+]
+
+
+def _session() -> requests.Session:
+    s = requests.Session()
+    s.headers.update({"User-Agent": UA, "Accept": "application/json, text/html"})
+    return s
+
+
+def _get(s: requests.Session, url: str, expect_json: bool = False, retries: int = 4) -> Any:
+    delay = 1.0
+    for attempt in range(retries):
+        r = s.get(url, timeout=30)
+        if r.status_code == 200:
+            return r.json() if expect_json else r.text
+        if r.status_code == 404:
+            return None
+        if r.status_code in (429, 500, 502, 503, 504):
+            time.sleep(delay)
+            delay *= 2
+            continue
+        r.raise_for_status()
+    raise RuntimeError(f"GET failed after {retries} retries: {url}")
+
+
+def _count_toc(toc: list[dict] | None) -> tuple[int, str | None]:
+    if not toc:
+        return 0, None
+    landing = None
+    n = 0
+
+    def walk(nodes: list[dict] | None, depth: int) -> None:
+        nonlocal n, landing
+        for node in nodes or []:
+            link = node.get("topicLink")
+            if link:
+                n += 1
+                m = re.search(r"page=(GUID-[A-F0-9-]+)\.html", link)
+                if m and landing is None:
+                    landing = m.group(1)
+            walk(node.get("children"), depth + 1)
+
+    walk(toc, 0)
+    return n, landing
+
+
+def _parse_abstract(html: str) -> dict[str, str]:
+    soup = BeautifulSoup(html, "html.parser")
+    out: dict[str, str] = {}
+    h1 = soup.select_one("h1.title.topictitle1")
+    if h1:
+        out["title"] = h1.get_text(" ", strip=True)
+    desc = soup.select_one("div.desc")
+    if desc:
+        out["abstract"] = desc.get_text(" ", strip=True)
+    pub = soup.select_one("div.publishedDate")
+    if pub:
+        out["published"] = pub.get_text(" ", strip=True).replace("Published:", "").strip()
+    return out
+
+
+def discover_bundle(s: requests.Session, spec: BundleSpec) -> dict[str, Any]:
+    # html-file bundles are static fixtures or live-fetched outside support.hpe.com.
+    if spec.mode == "html-file":
+        return {
+            "slug": spec.slug,
+            "doc_id": spec.doc_id,
+            "title": spec.title,
+            "version": spec.version,
+            "platform": spec.platform,
+            "product": spec.product,
+            "language": spec.language,
+            "page_count": 1,
+            "mode": "html-file",
+            "abstract": "",
+            "dates": {},
+            "landing_page": spec.doc_id,
+            "source_url": spec.source_url or f"https://www.hpe.com/psnow/doc/{spec.doc_id}",
+        }
+
+    abstract_html = _get(s, f"{API}/{spec.doc_id}", expect_json=False)
+    meta = _parse_abstract(abstract_html or "")
+
+    page_count: int
+    landing: str | None
+    if spec.mode == "toc":
+        toc = _get(s, f"{API}/{spec.doc_id}/toc", expect_json=True)
+        page_count, landing = _count_toc(toc)
+        if page_count == 0:
+            print(f"  ! {spec.slug}: TOC empty — falling back to single-doc mode", file=sys.stderr)
+            spec.mode = "single"
+            page_count, landing = 1, spec.doc_id
+    else:
+        page_count, landing = 1, spec.doc_id
+
+    return {
+        "slug": spec.slug,
+        "doc_id": spec.doc_id,
+        "title": meta.get("title") or spec.title,
+        "version": spec.version,
+        "platform": spec.platform,
+        "product": spec.product,
+        "language": spec.language,
+        "page_count": page_count,
+        "mode": spec.mode,
+        "abstract": meta.get("abstract", ""),
+        "dates": {"Published": meta.get("published", "")},
+        "landing_page": landing,
+        "source_url": spec.source_url or DOC_URL.format(doc_id=spec.doc_id),
+    }
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description="Build bundles.json from BUNDLES list.")
+    p.add_argument("--out", default=str(BUNDLES_JSON))
+    args = p.parse_args()
+
+    s = _session()
+    out: list[dict[str, Any]] = []
+    for spec in BUNDLES:
+        print(f"  • {spec.slug} ({spec.doc_id}) ...", file=sys.stderr)
+        out.append(discover_bundle(s, spec))
+
+    Path(args.out).write_text(json.dumps(out, indent=2) + "\n")
+    print(f"wrote {args.out}: {len(out)} bundles, {sum(b['page_count'] for b in out)} pages total", file=sys.stderr)
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
@@ -0,0 +1,194 @@
+"""Scrape HPE QuickSpecs collateral pages into corpus markdown.
+
+HPE QuickSpecs live at `https://www.hpe.com/us/en/collaterals/collateral.<doc_id>.html`
+with a server-rendered HTML body (confirmed 2026-05-22 by inspecting the
+captured DOM). The blocker for automated scraping is `www.hpe.com`'s
+edge bot defense, which drops connections from non-browser TLS
+fingerprints (curl, wget, Python-urllib, even WebFetch). Bypassed here
+by `curl_cffi` impersonating Chrome 120's JA3/JA4 fingerprint.
+
+Content extraction uses these stable CSS selectors found in the page:
+
+  .lr-right-rail hpe-highlights-container .collateral-content
+       — one per section ("Overview", "Standard Features", etc.)
+  h3.txto-title          — section title
+  div.txto-description   — section body
+  uc-table.uc-table-polaris   — SKU / version-history tables
+
+A committed HTML fixture at `scrape/quickspecs/<doc_id>.html` is used
+as a fallback when the live fetch fails (HPE edge churn, network
+issues). Keeping a current fixture in the repo also makes diffing
+QuickSpecs revisions easy.
+
+Usage (called by scrape.runner for bundles with mode="quickspecs"):
+
+    python -m scrape.quickspecs a50004260enw
+
+Or programmatically:
+
+    from scrape.quickspecs import scrape_quickspecs
+    scrape_quickspecs("a50004260enw", bundle_id="hvm_quickspecs", title="...")
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+import sys
+from pathlib import Path
+
+from bs4 import BeautifulSoup, NavigableString
+from markdownify import markdownify as md
+
+log = logging.getLogger(__name__)
+
+ROOT = Path(__file__).resolve().parent.parent
+SOURCE_DIR = ROOT / "scrape" / "quickspecs"
+CORPUS_DIR = ROOT / "corpus"
+
+COLLATERAL_URL = "https://www.hpe.com/us/en/collaterals/collateral.{doc_id}.html"
+
+
+def fetch_live(doc_id: str, timeout: float = 30.0) -> str | None:
+    """GET the collateral page via curl_cffi (Chrome 120 TLS fingerprint).
+    Returns the HTML body on success, None on any failure."""
+    try:
+        from curl_cffi import requests as cc
+    except ImportError:
+        log.warning("curl_cffi not installed; can't fetch QuickSpecs live")
+        return None
+    try:
+        r = cc.get(COLLATERAL_URL.format(doc_id=doc_id),
+                   impersonate="chrome120", timeout=timeout)
+        if r.status_code != 200 or not r.text:
+            log.warning("QuickSpecs %s: http=%s bytes=%d", doc_id, r.status_code, len(r.text or ""))
+            return None
+        return r.text
+    except Exception as e:
+        log.warning("QuickSpecs %s live fetch failed: %s", doc_id, e)
+        return None
+
+
+def fetch_fixture(doc_id: str) -> str | None:
+    """Read the committed HTML fixture as fallback."""
+    p = SOURCE_DIR / f"{doc_id}.html"
+    if not p.exists():
+        return None
+    return p.read_text()
+
+
+def _extract_content_blocks(html: str) -> list[str]:
+    """Pull each section block (.collateral-content under .lr-right-rail).
+
+    The fixture format (just .quickspecs-content wrapper) and the live
+    format (.lr-right-rail with nested hpe-highlights-container) are
+    both supported. Returns a list of section HTML strings, in document
+    order.
+    """
+    soup = BeautifulSoup(html, "html.parser")
+    # Live format: each <hpe-highlights-container> under .lr-right-rail has
+    # one or more .collateral-content blocks; concat them.
+    rail = soup.select_one(".lr-right-rail")
+    if rail is not None:
+        blocks = rail.select(".collateral-content")
+        return [str(b) for b in blocks]
+    # Fixture format: a single wrapper holding all the H2/H3 sections.
+    wrapper = soup.select_one(".quickspecs-content")
+    if wrapper is not None:
+        return [str(wrapper)]
+    # Last-resort: whole body.
+    body = soup.body or soup
+    return [str(body)]
+
+
+def parse_html(html: str) -> str:
+    """Convert QuickSpecs HTML to clean markdown.
+
+    Filters out the page chrome (nav, footer, recommendations carousel,
+    cookie banner, analytics blobs) by extracting only the content
+    blocks, then runs markdownify."""
+    blocks = _extract_content_blocks(html)
+    chunks: list[str] = []
+    for block in blocks:
+        soup = BeautifulSoup(block, "html.parser")
+        # Drop anchor placeholders that markdownify turns into noisy links
+        for a in soup.select('[hpe-left-rail-anchor]'):
+            a.decompose()
+        # Drop carousel / share / recommendation widgets if any leaked in.
+        for sel in ("esl-share", "hpe-recommendations", "hpe-sticky-bar",
+                    "esl-scrollbar", "esl-trigger", "video-overlay",
+                    "generic-modal-loader", "style", "script"):
+            for el in soup.select(sel):
+                el.decompose()
+        chunks.append(md(str(soup), heading_style="ATX", bullets="-",
+                          strip=["span", "div"]))
+    text = "\n\n".join(chunks)
+    # Collapse runs of blank lines markdownify likes to emit.
+    text = "\n".join(line.rstrip() for line in text.splitlines())
+    while "\n\n\n" in text:
+        text = text.replace("\n\n\n", "\n\n")
+    return text.strip() + "\n"
+
+
+def scrape_quickspecs(doc_id: str, bundle_id: str, title: str,
+                     version: str | None = None,
+                     product: str = "QuickSpecs",
+                     source_url: str | None = None,
+                     force: bool = False) -> bool:
+    """Live-fetch (or fall back to fixture), parse, write corpus files.
+
+    Returns True if files were written, False if skipped (already exists
+    and --force not set)."""
+    bundle_dir = CORPUS_DIR / bundle_id
+    md_path = bundle_dir / f"{doc_id}.md"
+    json_path = bundle_dir / f"{doc_id}.json"
+    if not force and md_path.exists() and json_path.exists():
+        log.info("  %s/%s: already on disk (use --force to refresh)", bundle_id, doc_id)
+        return False
+
+    html = fetch_live(doc_id)
+    fetched_from = "live"
+    if html is None:
+        html = fetch_fixture(doc_id)
+        fetched_from = "fixture"
+    if html is None:
+        log.error("QuickSpecs %s: no live response and no fixture at %s",
+                  doc_id, SOURCE_DIR / f"{doc_id}.html")
+        return False
+
+    body_md = parse_html(html)
+    bundle_dir.mkdir(parents=True, exist_ok=True)
+    md_path.write_text(body_md)
+    sidecar = {
+        "bundle_id": bundle_id,
+        "page_id": doc_id,
+        "title": title,
+        "ordinal": 1,
+        "parent_title": None,
+        "doc_id": doc_id,
+        "version": version,
+        "product": product,
+        "source_url": source_url or f"https://www.hpe.com/psnow/doc/{doc_id}",
+        "fetched_from": fetched_from,
+    }
+    json_path.write_text(json.dumps(sidecar, indent=2) + "\n")
+    log.info("  %s/%s: %d bytes from %s", bundle_id, doc_id, len(body_md), fetched_from)
+    return True
+
+
+def main() -> int:
+    logging.basicConfig(level=logging.INFO, format="%(message)s")
+    p = argparse.ArgumentParser()
+    p.add_argument("doc_id", help="QuickSpecs document id, e.g. a50004260enw")
+    p.add_argument("--bundle-id", default="hvm_quickspecs")
+    p.add_argument("--title", default="HPE Morpheus VM Essentials Software QuickSpecs")
+    p.add_argument("--version", default=None)
+    p.add_argument("--force", action="store_true")
+    args = p.parse_args()
+    ok = scrape_quickspecs(args.doc_id, args.bundle_id, args.title,
+                            args.version, force=args.force)
+    return 0 if ok else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
@@ -0,0 +1,27 @@
+# scrape/quickspecs/
+
+Static HTML fixtures for HPE QuickSpecs documents that aren't reachable
+from the runner (www.hpe.com edge drops connections from datacenter IPs
+with non-browser User-Agents — verified 2026-05-22 with curl, wget, and
+Anthropic's WebFetch).
+
+## Workflow
+
+1. Operator visits `https://www.hpe.com/psnow/doc/<doc_id>` in a real
+   browser, opens DevTools → Elements → Copy the `<body>` HTML.
+2. Save it at `scrape/quickspecs/<doc_id>.html`.
+3. Add a bundle entry in `scrape/bundles.py` with `mode="html-file"`.
+4. `python -m scrape.runner --bundle hvm_quickspecs --force` reads the
+   committed HTML and writes `corpus/hvm_quickspecs/<doc_id>.{md,json}`.
+5. Re-index and ship.
+
+QuickSpecs only update every few months (HPE rebrand, new SKU added,
+feature change). When a new version drops, refresh the local HTML
+file and re-run the scrape.
+
+## Current fixtures
+
+- `a50004260enw.html` — HPE Morpheus VM Essentials Software QuickSpecs
+  (Version 4, 02-February-2026). SKUs: S5Q81AAE (1-yr), S5Q82AAE
+  (3-yr), S5Q83AAE (5-yr) — all "per Socket E-LTU" with Tech Care
+  Essentials included.
@@ -0,0 +1,339 @@
+"""Scrape HVM doc bundles into corpus/<slug>/<page_id>.{md,json}.
+
+Reads bundles.json (produced by scrape.bundles), then for each bundle:
+  - mode="toc":    walks the TOC tree, fetches each page via the render
+                   endpoint, converts page_html to markdown, writes
+                   <page_id>.md + <page_id>.json sidecar.
+  - mode="single": fetches /document/{docId} directly, treats the whole
+                   body as one page with page_id = doc_id.
+
+After all bundles are on disk, runs a finalize pass that synthesizes
+topic_cluster.clustered_topics for each page by looking up the same
+GUID in sibling bundles (HPE GUIDs are stable across versions — see
+reference_hpe_docs_portal_api.md).
+
+Usage:
+    python -m scrape.runner --all
+    python -m scrape.runner --bundle hvm_user_manual_8_1_2
+    python -m scrape.runner --all --force        # re-download already-on-disk pages
+    python -m scrape.runner --finalize-only      # only redo the topic_cluster pass
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import re
+import sys
+import time
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+import requests
+from bs4 import BeautifulSoup
+from markdownify import markdownify as md
+
+API = "https://support.hpe.com/hpesc/public/api/document"
+DOC_URL = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}&page={page_id}.html"
+DOC_URL_SINGLE = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}"
+UA = "hvm-docs-mcp/0.1 (+https://git.jpaul.io/justin/hvm-docs; admin@jpaul.io)"
+ROOT = Path(__file__).resolve().parent.parent
+CORPUS = ROOT / "corpus"
+BUNDLES_JSON = ROOT / "bundles.json"
+
+GUID_RE = re.compile(r"page=(GUID-[A-F0-9-]+)\.html")
+
+
+@dataclass
+class TocEntry:
+    page_id: str
+    title: str
+    ordinal: int
+    parent_title: str | None
+
+
+def _session() -> requests.Session:
+    s = requests.Session()
+    s.headers.update({"User-Agent": UA, "Accept": "application/json, text/html"})
+    return s
+
+
+def _get(s: requests.Session, url: str, expect_json: bool = False, retries: int = 4) -> Any:
+    delay = 1.0
+    for attempt in range(retries):
+        r = s.get(url, timeout=30)
+        if r.status_code == 200:
+            return r.json() if expect_json else r.text
+        if r.status_code == 404:
+            return None
+        if r.status_code in (429, 500, 502, 503, 504):
+            time.sleep(delay)
+            delay *= 2
+            continue
+        r.raise_for_status()
+    raise RuntimeError(f"GET failed after {retries} retries: {url}")
+
+
+def _flatten_toc(toc: list[dict]) -> list[TocEntry]:
+    out: list[TocEntry] = []
+    ordinal = 0
+
+    def walk(nodes: list[dict] | None, parent_title: str | None) -> None:
+        nonlocal ordinal
+        for node in nodes or []:
+            title = node.get("topicName") or ""
+            link = node.get("topicLink") or ""
+            m = GUID_RE.search(link)
+            if m:
+                ordinal += 1
+                out.append(TocEntry(page_id=m.group(1), title=title, ordinal=ordinal, parent_title=parent_title))
+            walk(node.get("children"), title or parent_title)
+
+    walk(toc, None)
+    return out
+
+
+def _strip_dita_wrappers(html: str) -> str:
+    """Remove the outer <main class="ditasrc">, drop the trademark Notices section,
+    and unwrap aria-only span markup so markdownify produces clean text.
+
+    DITA's notices boilerplate repeats across every doc; if we leave it in,
+    every page chunk inherits the same trademark text and pollutes retrieval."""
+    soup = BeautifulSoup(html, "html.parser")
+    # Drop the Notices/Acknowledgments/Abstract boilerplate by section heading.
+    # Every doc on the portal carries the same legal Notices and trademark
+    # Acknowledgments; if we leave them in, every chunk inherits the same
+    # text and pollutes retrieval. Abstract is one-line marketing.
+    boilerplate = {"Notices", "Acknowledgments", "Abstract"}
+    # Wrapped form: <article>/<section>/<div> whose first heading child is boilerplate.
+    for sec in soup.select("article, section, div"):
+        h = sec.find(["h1", "h2"], recursive=False)
+        if h and h.get_text(strip=True) in boilerplate:
+            sec.decompose()
+    # Unwrapped form: bare <h1>/<h2>Boilerplate</h2> followed by its .desc/.body sibling.
+    for h in soup.find_all(["h1", "h2"]):
+        if h.get_text(strip=True) in boilerplate:
+            sib = h.find_next_sibling()
+            if sib and (sib.name in {"div", "section"}):
+                cls = " ".join(sib.get("class", []) or [])
+                if "desc" in cls or "body" in cls or "notices" in cls:
+                    sib.decompose()
+            h.decompose()
+    main = soup.find("main")
+    return str(main) if main else str(soup)
+
+
+def html_to_md(page_html: str) -> str:
+    cleaned = _strip_dita_wrappers(page_html)
+    text = md(cleaned, heading_style="ATX", bullets="-")
+    # collapse runs of blank lines
+    text = re.sub(r"\n{3,}", "\n\n", text).strip()
+    return text + "\n"
+
+
+def fetch_toc_page(s: requests.Session, doc_id: str, page_id: str) -> str:
+    payload = _get(s, f"{API}/{doc_id}/render?page={page_id}.html", expect_json=True)
+    if not payload:
+        return ""
+    return payload.get("page_html") or ""
+
+
+def fetch_single_doc(s: requests.Session, doc_id: str) -> tuple[str, str]:
+    """Returns (page_html, title) for a single-doc-shape bundle."""
+    html = _get(s, f"{API}/{doc_id}")
+    if not html:
+        return "", ""
+    soup = BeautifulSoup(html, "html.parser")
+    h1 = soup.select_one("h1.title.topictitle1")
+    title = h1.get_text(" ", strip=True) if h1 else doc_id
+    return html, title
+
+
+def write_page(bundle_dir: Path, page_id: str, body_md: str, sidecar: dict[str, Any], force: bool) -> bool:
+    bundle_dir.mkdir(parents=True, exist_ok=True)
+    md_path = bundle_dir / f"{page_id}.md"
+    json_path = bundle_dir / f"{page_id}.json"
+    if not force and md_path.exists() and json_path.exists():
+        return False
+    md_path.write_text(body_md)
+    json_path.write_text(json.dumps(sidecar, indent=2) + "\n")
+    return True
+
+
+def scrape_toc_bundle(s: requests.Session, bundle: dict, force: bool, concurrency: int) -> int:
+    doc_id = bundle["doc_id"]
+    slug = bundle["slug"]
+    bundle_dir = CORPUS / slug
+
+    toc = _get(s, f"{API}/{doc_id}/toc", expect_json=True) or []
+    entries = _flatten_toc(toc)
+    print(f"  {slug}: {len(entries)} pages", file=sys.stderr)
+
+    written = 0
+    def do_one(entry: TocEntry) -> bool:
+        page_html = fetch_toc_page(s, doc_id, entry.page_id)
+        if not page_html:
+            return False
+        body_md = html_to_md(page_html)
+        sidecar = {
+            "bundle_id": slug,
+            "page_id": entry.page_id,
+            "title": entry.title,
+            "ordinal": entry.ordinal,
+            "parent_title": entry.parent_title,
+            "doc_id": doc_id,
+            "version": bundle.get("version"),
+            "product": bundle.get("product"),
+            "source_url": DOC_URL.format(doc_id=doc_id, page_id=entry.page_id),
+            # topic_cluster filled in by finalize()
+        }
+        return write_page(bundle_dir, entry.page_id, body_md, sidecar, force)
+
+    with ThreadPoolExecutor(max_workers=concurrency) as pool:
+        for fut in as_completed(pool.submit(do_one, e) for e in entries):
+            if fut.result():
+                written += 1
+    return written
+
+
+def scrape_single_bundle(s: requests.Session, bundle: dict, force: bool) -> int:
+    doc_id = bundle["doc_id"]
+    slug = bundle["slug"]
+    bundle_dir = CORPUS / slug
+
+    html, title = fetch_single_doc(s, doc_id)
+    if not html:
+        print(f"  ! {slug}: empty body", file=sys.stderr)
+        return 0
+    body_md = html_to_md(html)
+    sidecar = {
+        "bundle_id": slug,
+        "page_id": doc_id,
+        "title": title or bundle["title"],
+        "ordinal": 1,
+        "parent_title": None,
+        "doc_id": doc_id,
+        "version": bundle.get("version"),
+        "product": bundle.get("product"),
+        "source_url": DOC_URL_SINGLE.format(doc_id=doc_id),
+    }
+    print(f"  {slug}: 1 page (single-doc)", file=sys.stderr)
+    return 1 if write_page(bundle_dir, doc_id, body_md, sidecar, force) else 0
+
+
+def finalize_clusters(bundles: list[dict]) -> int:
+    """Cross-link sibling pages with the same GUID across version bundles.
+
+    For TOC bundles, page_id == GUID; same GUID across two bundles = same
+    underlying topic. For single-doc bundles (page_id == doc_id), peer them
+    by matching product+version-sibling on the `product` field."""
+    # GUID → list[(slug, sidecar_path, sidecar_dict)]
+    guid_to_pages: dict[str, list[tuple[str, Path, dict]]] = {}
+    # product → list[(slug, sidecar_path, sidecar_dict)] for single-doc peering
+    product_to_pages: dict[str, list[tuple[str, Path, dict]]] = {}
+
+    for b in bundles:
+        slug = b["slug"]
+        bundle_dir = CORPUS / slug
+        if not bundle_dir.exists():
+            continue
+        for jp in bundle_dir.glob("*.json"):
+            data = json.loads(jp.read_text())
+            pid = data["page_id"]
+            if pid.startswith("GUID-"):
+                guid_to_pages.setdefault(pid, []).append((slug, jp, data))
+            else:
+                product_to_pages.setdefault(b["product"], []).append((slug, jp, data))
+
+    updated = 0
+    # TOC pages — cluster by GUID
+    for guid, peers in guid_to_pages.items():
+        if len(peers) < 2:
+            continue
+        for slug, jp, data in peers:
+            others = [
+                {"bundle_id": s2, "page_id": guid, "clustering_title": d2.get("title", "")}
+                for s2, _, d2 in peers if s2 != slug
+            ]
+            data["topic_cluster"] = {"clustering_title": data.get("title", ""), "clustered_topics": others}
+            jp.write_text(json.dumps(data, indent=2) + "\n")
+            updated += 1
+    # Single-doc pages — cluster by product (e.g. Release Notes 8.1.0/.1/.2)
+    for product, peers in product_to_pages.items():
+        if len(peers) < 2:
+            continue
+        for slug, jp, data in peers:
+            others = [
+                {"bundle_id": s2, "page_id": d2["page_id"], "clustering_title": d2.get("title", "")}
+                for s2, _, d2 in peers if s2 != slug
+            ]
+            data["topic_cluster"] = {"clustering_title": data.get("title", ""), "clustered_topics": others}
+            jp.write_text(json.dumps(data, indent=2) + "\n")
+            updated += 1
+
+    return updated
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description="Scrape HVM bundles into corpus/.")
+    p.add_argument("--all", action="store_true", help="scrape every bundle in bundles.json")
+    p.add_argument("--bundle", action="append", help="scrape one bundle by slug (repeatable)")
+    p.add_argument("--force", action="store_true", help="re-fetch pages already on disk")
+    p.add_argument("--concurrency", type=int, default=6)
+    p.add_argument("--finalize-only", action="store_true", help="only rebuild topic_cluster sidecar fields")
+    args = p.parse_args()
+
+    if not BUNDLES_JSON.exists():
+        print(f"bundles.json missing — run `python -m scrape.bundles` first", file=sys.stderr)
+        return 2
+
+    bundles = json.loads(BUNDLES_JSON.read_text())
+
+    if args.finalize_only:
+        n = finalize_clusters(bundles)
+        print(f"finalize: updated topic_cluster on {n} sidecars", file=sys.stderr)
+        return 0
+
+    if args.bundle:
+        bundles = [b for b in bundles if b["slug"] in args.bundle]
+        if not bundles:
+            print(f"no bundles matched: {args.bundle}", file=sys.stderr)
+            return 2
+    elif not args.all:
+        print("specify --all or --bundle <slug>", file=sys.stderr)
+        return 2
+
+    s = _session()
+    total = 0
+    for b in bundles:
+        mode = b.get("mode")
+        if mode == "single":
+            total += scrape_single_bundle(s, b, args.force)
+        elif mode == "html-file":
+            # Live-scrape HPE collateral (QuickSpecs) via curl_cffi; falls back
+            # to scrape/quickspecs/<doc_id>.html fixture if the edge blocks us.
+            from scrape.quickspecs import scrape_quickspecs
+            ok = scrape_quickspecs(
+                doc_id=b["doc_id"], bundle_id=b["slug"],
+                title=b.get("title", b["doc_id"]),
+                version=b.get("version"),
+                product=b.get("product", "QuickSpecs"),
+                source_url=b.get("source_url"),
+                force=args.force,
+            )
+            total += 1 if ok else 0
+        else:
+            total += scrape_toc_bundle(s, b, args.force, args.concurrency)
+    print(f"scraped {total} new/updated pages", file=sys.stderr)
+
+    # Always finalize after a scrape so sidecars are consistent.
+    all_bundles = json.loads(BUNDLES_JSON.read_text())
+    n = finalize_clusters(all_bundles)
+    print(f"finalize: updated topic_cluster on {n} sidecars", file=sys.stderr)
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())