build out morpheus-docs MCP stack, mirroring hvm-docs through Phases 1-13

Initial scaffold: the docs-mcp-template clone with all the
HVM-validated stack ported across, customized for Morpheus
Enterprise (PRODUCT_NAME=morpheus, server name morpheus-docs).

Bundles (live-discovered 2026-05-22; 1710 cataloged pages total):
* morpheus_user_manual_8_1_0  sd00007510en_us  568 pages (Feb 2026)
* morpheus_user_manual_8_1_1  sd00007621en_us  569 pages (Mar 2026)
* morpheus_user_manual_8_1_2  sd00007732en_us  569 pages (Apr 2026)
* morpheus_release_notes_8_1_0  sd00007496en_us  single-doc
* morpheus_release_notes_8_1_1  sd00007610en_us  single-doc
* morpheus_release_notes_8_1_2  sd00007733en_us  single-doc
* morpheus_quickspecs            a50009231enw     html-file (live
  curl_cffi against www.hpe.com; all 12+ Enterprise SKUs captured —
  S6E64..S6E73AAE for new/renewal/upgrade × 1/3/5-yr terms, plus
  services SKUs HA124A1#V38/V39 and H46SBA1).

No Deployment Guide or Qualification Matrix on HPE Support for
Morpheus Enterprise specifically — the only QM (sd00006551en_us)
covers HVM clusters managed by Morpheus and lives in hvm-docs.

Stack carried forward from hvm-docs:
* rag/{index,chunk,embeddings,bm25}.py — including the
  MAX_CHARS=4000 chunk-cap fix for table-dense content
* docs_mcp/{server,usage}.py — 11 MCP tools, BM25-default search,
  cross-encoder rerank, hybrid behind HYBRID_SEARCH=true,
  morpheus_api_lessons (renamed from hvm_api_lessons), env-gated
  submit_doc_bug
* docs_mcp/api_lessons.md — Morpheus-specific scaffold covering
  licensing model, HVM elevation path, REST vs Plugin API, with
  TODO markers for sections to flesh out from real ops experience
* scrape/{runner,quickspecs,changelog,bundles}.py — TOC + single-doc
  + html-file modes, curl_cffi Chrome120 for www.hpe.com edge bypass
* eval/{retrievers,run_eval}.py + queries.jsonl scaffold (4 placeholder
  queries; populate after first scrape)
* scripts/{rerank_server,usage_report,registry_gc}.py
* .gitea/workflows/{refresh,image-only}.yml — same Gitea Actions
  setup zerto-docs uses (push LAN, pull public-URL, GPU Ollama pool)
* deploy/docker-compose.yml — morpheus-docs-mcp service definition,
  shared jina-rerank sidecar, Watchtower-labeled
* Dockerfile, requirements.txt, requirements-rerank.txt

Verified locally: scrape produced 1599 .md pages (some TOC entries
are parent-only and yield no body), 6353 chunks all under the 4 KB
cap, MCP server boots and lists 11 tools cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 15:26:24 -04:00
parent 43728320bf
commit fa448f94e1
22 changed files with 2822 additions and 247 deletions
+66
View File
@@ -7,6 +7,72 @@ the upstream doc portal.
See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline
expects.
---
## Product context — HPE Morpheus Enterprise Software
**This repo is for HPE Morpheus Enterprise**, the full cloud-management
platform. It is a **different SKU** from HPE Morpheus VM Essentials
(HVM), which has its own MCP at `../hvm-docs/`. Don't ingest HVM
docs here; they're a separate, smaller product (the "VM-only" subset
of Morpheus). The Morpheus VM Essentials Deployment Guide refers to
Morpheus Enterprise as the "elevate to" target — that's the
relationship.
`PRODUCT_NAME=morpheus`. Tool will be named `morpheus_api_lessons`,
collection `morpheus_docs`, etc.
### Upstream portal
HPE Support DocPortal (Tridion/SDL-derived, same surface as HVM and
the Zerto docs). Anonymous JSON API, no auth required.
| Endpoint | Returns |
|---|---|
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}` | DITA-source HTML — title page / abstract OR (for short docs like Release Notes) the entire body |
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/toc` | Nested JSON tree of `{topicName, topicLink, description, children}`. Empty/404 for single-doc Release Notes. |
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/render?page=GUID-XXXX.html` | `{docId, page_html, doc_meta, page_meta}` — single page body |
User-facing URL format:
`https://support.hpe.com/hpesc/public/docDisplay?docId={docId}&page=GUID-XXXX.html`
### Bundle IDs (confirmed 2026-05-22)
**Morpheus Enterprise User Manual** — ~569 pages each, full nested TOC:
| Version | docId |
|---|---|
| 8.1.0 | `sd00007510en_us` |
| 8.1.1 | `sd00007621en_us` |
| 8.1.2 | `sd00007732en_us` |
**Morpheus Enterprise Release Notes** — short, single-doc-blob shape
(no TOC; full body returned by the `/document/{docId}` endpoint
itself; scraper needs a `--single-doc` mode for these):
| Version | docId |
|---|---|
| 8.1.0 | `sd00007496en_us` |
| 8.1.1 | `sd00007610en_us` |
| 8.1.2 | `sd00007733en_us` |
### Cross-version peers are free
GUIDs are stable across versions (confirmed on HVM where 374/376/376
pages had 100% GUID overlap between adjacent versions). Same-GUID =
same-topic. Synthesize `topic_cluster.clustered_topics` by looking
up the same GUID in the other bundle slugs — no fuzzy matching
needed.
### Reusable from hvm-docs
`../hvm-docs/scrape/bundles.py` and `../hvm-docs/scrape/runner.py`
solve the identical portal shape. Copy and adapt the BUNDLES list +
PRODUCT_NAME; the fetch logic should drop in unchanged. Both the
TOC-paginated path and the single-doc path are needed (the HVM
build covers both because HVM Release Notes follow the same shape).
## What you write
At minimum, two scripts:
+200
View File
@@ -0,0 +1,200 @@
"""Discover Morpheus Enterprise doc bundles on HPE Support DocPortal and write bundles.json.
Mirrors hvm-docs/scrape/bundles.py — same portal, same API shape, same single-doc-blob
treatment for Release Notes, but pointing at the Morpheus Enterprise docId range.
For each bundle this script:
1. GETs /hpesc/public/api/document/{docId} → abstract HTML
2. GETs /hpesc/public/api/document/{docId}/toc → page tree (or 404 for single-doc)
3. Writes bundles.json at repo root with the schema PLAN.md Phase 1 documents.
QuickSpecs is a special case: lives at www.hpe.com (not support.hpe.com), gets the
html-file mode and is scraped via curl_cffi (see scrape/quickspecs.py).
"""
from __future__ import annotations
import argparse
import json
import re
import sys
import time
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
import requests
from bs4 import BeautifulSoup
API = "https://support.hpe.com/hpesc/public/api/document"
DOC_URL = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}"
UA = "morpheus-docs-mcp/0.1 (+https://git.jpaul.io/justin/morpheus-docs; admin@jpaul.io)"
ROOT = Path(__file__).resolve().parent.parent
BUNDLES_JSON = ROOT / "bundles.json"
@dataclass
class BundleSpec:
slug: str
doc_id: str
title: str
version: str | None
product: str # e.g. "User Manual", "Release Notes", "QuickSpecs"
mode: str # "toc", "single", or "html-file"
platform: str | None = None
language: str = "en-US"
source_url: str | None = None # overrides the default support.hpe.com URL
# Declared bundles. Versions confirmed 2026-05-22 by probing the docId
# range sd00006500..7740 for `Morpheus Enterprise` matches in the abstract.
#
# Notes:
# - Morpheus Enterprise has User Manuals dating back to 8.0.10
# (sd00006774en_us, Sep 2025) but we only ship the 8.1.x line for
# now. Add the 8.0.x bundles here if you need older versions in the
# corpus.
# - No dedicated Deployment Guide or Qualification Matrix for Morpheus
# Enterprise on HPE Support — the only QM (sd00006551en_us) covers
# HVM clusters managed by Morpheus, which lives in hvm-docs.
# - QuickSpecs lives on www.hpe.com (not support.hpe.com), uses the
# html-file scrape mode with curl_cffi Chrome impersonation.
BUNDLES: list[BundleSpec] = [
BundleSpec("morpheus_user_manual_8_1_0", "sd00007510en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.0", "User Manual", "toc"),
BundleSpec("morpheus_user_manual_8_1_1", "sd00007621en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.1", "User Manual", "toc"),
BundleSpec("morpheus_user_manual_8_1_2", "sd00007732en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.2", "User Manual", "toc"),
BundleSpec("morpheus_release_notes_8_1_0", "sd00007496en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.0", "Release Notes", "single"),
BundleSpec("morpheus_release_notes_8_1_1", "sd00007610en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.1", "Release Notes", "single"),
BundleSpec("morpheus_release_notes_8_1_2", "sd00007733en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.2", "Release Notes", "single"),
BundleSpec("morpheus_quickspecs", "a50009231enw", "HPE Morpheus Enterprise Software QuickSpecs",
"v1", "QuickSpecs", "html-file",
source_url="https://www.hpe.com/psnow/doc/a50009231enw"),
]
def _session() -> requests.Session:
s = requests.Session()
s.headers.update({"User-Agent": UA, "Accept": "application/json, text/html"})
return s
def _get(s: requests.Session, url: str, expect_json: bool = False, retries: int = 4) -> Any:
delay = 1.0
for attempt in range(retries):
r = s.get(url, timeout=30)
if r.status_code == 200:
return r.json() if expect_json else r.text
if r.status_code == 404:
return None
if r.status_code in (429, 500, 502, 503, 504):
time.sleep(delay)
delay *= 2
continue
r.raise_for_status()
raise RuntimeError(f"GET failed after {retries} retries: {url}")
def _count_toc(toc: list[dict] | None) -> tuple[int, str | None]:
if not toc:
return 0, None
landing = None
n = 0
def walk(nodes: list[dict] | None, depth: int) -> None:
nonlocal n, landing
for node in nodes or []:
link = node.get("topicLink")
if link:
n += 1
m = re.search(r"page=(GUID-[A-F0-9-]+)\.html", link)
if m and landing is None:
landing = m.group(1)
walk(node.get("children"), depth + 1)
walk(toc, 0)
return n, landing
def _parse_abstract(html: str) -> dict[str, str]:
soup = BeautifulSoup(html, "html.parser")
out: dict[str, str] = {}
h1 = soup.select_one("h1.title.topictitle1")
if h1:
out["title"] = h1.get_text(" ", strip=True)
desc = soup.select_one("div.desc")
if desc:
out["abstract"] = desc.get_text(" ", strip=True)
pub = soup.select_one("div.publishedDate")
if pub:
out["published"] = pub.get_text(" ", strip=True).replace("Published:", "").strip()
return out
def discover_bundle(s: requests.Session, spec: BundleSpec) -> dict[str, Any]:
# html-file bundles are static fixtures or live-fetched outside support.hpe.com.
if spec.mode == "html-file":
return {
"slug": spec.slug,
"doc_id": spec.doc_id,
"title": spec.title,
"version": spec.version,
"platform": spec.platform,
"product": spec.product,
"language": spec.language,
"page_count": 1,
"mode": "html-file",
"abstract": "",
"dates": {},
"landing_page": spec.doc_id,
"source_url": spec.source_url or f"https://www.hpe.com/psnow/doc/{spec.doc_id}",
}
abstract_html = _get(s, f"{API}/{spec.doc_id}", expect_json=False)
meta = _parse_abstract(abstract_html or "")
page_count: int
landing: str | None
if spec.mode == "toc":
toc = _get(s, f"{API}/{spec.doc_id}/toc", expect_json=True)
page_count, landing = _count_toc(toc)
if page_count == 0:
print(f" ! {spec.slug}: TOC empty — falling back to single-doc mode", file=sys.stderr)
spec.mode = "single"
page_count, landing = 1, spec.doc_id
else:
page_count, landing = 1, spec.doc_id
return {
"slug": spec.slug,
"doc_id": spec.doc_id,
"title": meta.get("title") or spec.title,
"version": spec.version,
"platform": spec.platform,
"product": spec.product,
"language": spec.language,
"page_count": page_count,
"mode": spec.mode,
"abstract": meta.get("abstract", ""),
"dates": {"Published": meta.get("published", "")},
"landing_page": landing,
"source_url": spec.source_url or DOC_URL.format(doc_id=spec.doc_id),
}
def main() -> int:
p = argparse.ArgumentParser(description="Build bundles.json from BUNDLES list.")
p.add_argument("--out", default=str(BUNDLES_JSON))
args = p.parse_args()
s = _session()
out: list[dict[str, Any]] = []
for spec in BUNDLES:
print(f"{spec.slug} ({spec.doc_id}) ...", file=sys.stderr)
out.append(discover_bundle(s, spec))
Path(args.out).write_text(json.dumps(out, indent=2) + "\n")
print(f"wrote {args.out}: {len(out)} bundles, {sum(b['page_count'] for b in out)} pages total", file=sys.stderr)
return 0
if __name__ == "__main__":
sys.exit(main())
+194
View File
@@ -0,0 +1,194 @@
"""Scrape HPE QuickSpecs collateral pages into corpus markdown.
HPE QuickSpecs live at `https://www.hpe.com/us/en/collaterals/collateral.<doc_id>.html`
with a server-rendered HTML body (confirmed 2026-05-22 by inspecting the
captured DOM). The blocker for automated scraping is `www.hpe.com`'s
edge bot defense, which drops connections from non-browser TLS
fingerprints (curl, wget, Python-urllib, even WebFetch). Bypassed here
by `curl_cffi` impersonating Chrome 120's JA3/JA4 fingerprint.
Content extraction uses these stable CSS selectors found in the page:
.lr-right-rail hpe-highlights-container .collateral-content
— one per section ("Overview", "Standard Features", etc.)
h3.txto-title — section title
div.txto-description — section body
uc-table.uc-table-polaris — SKU / version-history tables
A committed HTML fixture at `scrape/quickspecs/<doc_id>.html` is used
as a fallback when the live fetch fails (HPE edge churn, network
issues). Keeping a current fixture in the repo also makes diffing
QuickSpecs revisions easy.
Usage (called by scrape.runner for bundles with mode="quickspecs"):
python -m scrape.quickspecs a50004260enw
Or programmatically:
from scrape.quickspecs import scrape_quickspecs
scrape_quickspecs("a50004260enw", bundle_id="hvm_quickspecs", title="...")
"""
from __future__ import annotations
import argparse
import json
import logging
import sys
from pathlib import Path
from bs4 import BeautifulSoup, NavigableString
from markdownify import markdownify as md
log = logging.getLogger(__name__)
ROOT = Path(__file__).resolve().parent.parent
SOURCE_DIR = ROOT / "scrape" / "quickspecs"
CORPUS_DIR = ROOT / "corpus"
COLLATERAL_URL = "https://www.hpe.com/us/en/collaterals/collateral.{doc_id}.html"
def fetch_live(doc_id: str, timeout: float = 30.0) -> str | None:
"""GET the collateral page via curl_cffi (Chrome 120 TLS fingerprint).
Returns the HTML body on success, None on any failure."""
try:
from curl_cffi import requests as cc
except ImportError:
log.warning("curl_cffi not installed; can't fetch QuickSpecs live")
return None
try:
r = cc.get(COLLATERAL_URL.format(doc_id=doc_id),
impersonate="chrome120", timeout=timeout)
if r.status_code != 200 or not r.text:
log.warning("QuickSpecs %s: http=%s bytes=%d", doc_id, r.status_code, len(r.text or ""))
return None
return r.text
except Exception as e:
log.warning("QuickSpecs %s live fetch failed: %s", doc_id, e)
return None
def fetch_fixture(doc_id: str) -> str | None:
"""Read the committed HTML fixture as fallback."""
p = SOURCE_DIR / f"{doc_id}.html"
if not p.exists():
return None
return p.read_text()
def _extract_content_blocks(html: str) -> list[str]:
"""Pull each section block (.collateral-content under .lr-right-rail).
The fixture format (just .quickspecs-content wrapper) and the live
format (.lr-right-rail with nested hpe-highlights-container) are
both supported. Returns a list of section HTML strings, in document
order.
"""
soup = BeautifulSoup(html, "html.parser")
# Live format: each <hpe-highlights-container> under .lr-right-rail has
# one or more .collateral-content blocks; concat them.
rail = soup.select_one(".lr-right-rail")
if rail is not None:
blocks = rail.select(".collateral-content")
return [str(b) for b in blocks]
# Fixture format: a single wrapper holding all the H2/H3 sections.
wrapper = soup.select_one(".quickspecs-content")
if wrapper is not None:
return [str(wrapper)]
# Last-resort: whole body.
body = soup.body or soup
return [str(body)]
def parse_html(html: str) -> str:
"""Convert QuickSpecs HTML to clean markdown.
Filters out the page chrome (nav, footer, recommendations carousel,
cookie banner, analytics blobs) by extracting only the content
blocks, then runs markdownify."""
blocks = _extract_content_blocks(html)
chunks: list[str] = []
for block in blocks:
soup = BeautifulSoup(block, "html.parser")
# Drop anchor placeholders that markdownify turns into noisy links
for a in soup.select('[hpe-left-rail-anchor]'):
a.decompose()
# Drop carousel / share / recommendation widgets if any leaked in.
for sel in ("esl-share", "hpe-recommendations", "hpe-sticky-bar",
"esl-scrollbar", "esl-trigger", "video-overlay",
"generic-modal-loader", "style", "script"):
for el in soup.select(sel):
el.decompose()
chunks.append(md(str(soup), heading_style="ATX", bullets="-",
strip=["span", "div"]))
text = "\n\n".join(chunks)
# Collapse runs of blank lines markdownify likes to emit.
text = "\n".join(line.rstrip() for line in text.splitlines())
while "\n\n\n" in text:
text = text.replace("\n\n\n", "\n\n")
return text.strip() + "\n"
def scrape_quickspecs(doc_id: str, bundle_id: str, title: str,
version: str | None = None,
product: str = "QuickSpecs",
source_url: str | None = None,
force: bool = False) -> bool:
"""Live-fetch (or fall back to fixture), parse, write corpus files.
Returns True if files were written, False if skipped (already exists
and --force not set)."""
bundle_dir = CORPUS_DIR / bundle_id
md_path = bundle_dir / f"{doc_id}.md"
json_path = bundle_dir / f"{doc_id}.json"
if not force and md_path.exists() and json_path.exists():
log.info(" %s/%s: already on disk (use --force to refresh)", bundle_id, doc_id)
return False
html = fetch_live(doc_id)
fetched_from = "live"
if html is None:
html = fetch_fixture(doc_id)
fetched_from = "fixture"
if html is None:
log.error("QuickSpecs %s: no live response and no fixture at %s",
doc_id, SOURCE_DIR / f"{doc_id}.html")
return False
body_md = parse_html(html)
bundle_dir.mkdir(parents=True, exist_ok=True)
md_path.write_text(body_md)
sidecar = {
"bundle_id": bundle_id,
"page_id": doc_id,
"title": title,
"ordinal": 1,
"parent_title": None,
"doc_id": doc_id,
"version": version,
"product": product,
"source_url": source_url or f"https://www.hpe.com/psnow/doc/{doc_id}",
"fetched_from": fetched_from,
}
json_path.write_text(json.dumps(sidecar, indent=2) + "\n")
log.info(" %s/%s: %d bytes from %s", bundle_id, doc_id, len(body_md), fetched_from)
return True
def main() -> int:
logging.basicConfig(level=logging.INFO, format="%(message)s")
p = argparse.ArgumentParser()
p.add_argument("doc_id", help="QuickSpecs document id, e.g. a50004260enw")
p.add_argument("--bundle-id", default="hvm_quickspecs")
p.add_argument("--title", default="HPE Morpheus VM Essentials Software QuickSpecs")
p.add_argument("--version", default=None)
p.add_argument("--force", action="store_true")
args = p.parse_args()
ok = scrape_quickspecs(args.doc_id, args.bundle_id, args.title,
args.version, force=args.force)
return 0 if ok else 1
if __name__ == "__main__":
sys.exit(main())
+27
View File
@@ -0,0 +1,27 @@
# scrape/quickspecs/
Static HTML fixtures for HPE QuickSpecs documents that aren't reachable
from the runner (www.hpe.com edge drops connections from datacenter IPs
with non-browser User-Agents — verified 2026-05-22 with curl, wget, and
Anthropic's WebFetch).
## Workflow
1. Operator visits `https://www.hpe.com/psnow/doc/<doc_id>` in a real
browser, opens DevTools → Elements → Copy the `<body>` HTML.
2. Save it at `scrape/quickspecs/<doc_id>.html`.
3. Add a bundle entry in `scrape/bundles.py` with `mode="html-file"`.
4. `python -m scrape.runner --bundle hvm_quickspecs --force` reads the
committed HTML and writes `corpus/hvm_quickspecs/<doc_id>.{md,json}`.
5. Re-index and ship.
QuickSpecs only update every few months (HPE rebrand, new SKU added,
feature change). When a new version drops, refresh the local HTML
file and re-run the scrape.
## Current fixtures
- `a50004260enw.html` — HPE Morpheus VM Essentials Software QuickSpecs
(Version 4, 02-February-2026). SKUs: S5Q81AAE (1-yr), S5Q82AAE
(3-yr), S5Q83AAE (5-yr) — all "per Socket E-LTU" with Tech Care
Essentials included.
+339
View File
@@ -0,0 +1,339 @@
"""Scrape HVM doc bundles into corpus/<slug>/<page_id>.{md,json}.
Reads bundles.json (produced by scrape.bundles), then for each bundle:
- mode="toc": walks the TOC tree, fetches each page via the render
endpoint, converts page_html to markdown, writes
<page_id>.md + <page_id>.json sidecar.
- mode="single": fetches /document/{docId} directly, treats the whole
body as one page with page_id = doc_id.
After all bundles are on disk, runs a finalize pass that synthesizes
topic_cluster.clustered_topics for each page by looking up the same
GUID in sibling bundles (HPE GUIDs are stable across versions — see
reference_hpe_docs_portal_api.md).
Usage:
python -m scrape.runner --all
python -m scrape.runner --bundle hvm_user_manual_8_1_2
python -m scrape.runner --all --force # re-download already-on-disk pages
python -m scrape.runner --finalize-only # only redo the topic_cluster pass
"""
from __future__ import annotations
import argparse
import json
import re
import sys
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify as md
API = "https://support.hpe.com/hpesc/public/api/document"
DOC_URL = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}&page={page_id}.html"
DOC_URL_SINGLE = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}"
UA = "hvm-docs-mcp/0.1 (+https://git.jpaul.io/justin/hvm-docs; admin@jpaul.io)"
ROOT = Path(__file__).resolve().parent.parent
CORPUS = ROOT / "corpus"
BUNDLES_JSON = ROOT / "bundles.json"
GUID_RE = re.compile(r"page=(GUID-[A-F0-9-]+)\.html")
@dataclass
class TocEntry:
page_id: str
title: str
ordinal: int
parent_title: str | None
def _session() -> requests.Session:
s = requests.Session()
s.headers.update({"User-Agent": UA, "Accept": "application/json, text/html"})
return s
def _get(s: requests.Session, url: str, expect_json: bool = False, retries: int = 4) -> Any:
delay = 1.0
for attempt in range(retries):
r = s.get(url, timeout=30)
if r.status_code == 200:
return r.json() if expect_json else r.text
if r.status_code == 404:
return None
if r.status_code in (429, 500, 502, 503, 504):
time.sleep(delay)
delay *= 2
continue
r.raise_for_status()
raise RuntimeError(f"GET failed after {retries} retries: {url}")
def _flatten_toc(toc: list[dict]) -> list[TocEntry]:
out: list[TocEntry] = []
ordinal = 0
def walk(nodes: list[dict] | None, parent_title: str | None) -> None:
nonlocal ordinal
for node in nodes or []:
title = node.get("topicName") or ""
link = node.get("topicLink") or ""
m = GUID_RE.search(link)
if m:
ordinal += 1
out.append(TocEntry(page_id=m.group(1), title=title, ordinal=ordinal, parent_title=parent_title))
walk(node.get("children"), title or parent_title)
walk(toc, None)
return out
def _strip_dita_wrappers(html: str) -> str:
"""Remove the outer <main class="ditasrc">, drop the trademark Notices section,
and unwrap aria-only span markup so markdownify produces clean text.
DITA's notices boilerplate repeats across every doc; if we leave it in,
every page chunk inherits the same trademark text and pollutes retrieval."""
soup = BeautifulSoup(html, "html.parser")
# Drop the Notices/Acknowledgments/Abstract boilerplate by section heading.
# Every doc on the portal carries the same legal Notices and trademark
# Acknowledgments; if we leave them in, every chunk inherits the same
# text and pollutes retrieval. Abstract is one-line marketing.
boilerplate = {"Notices", "Acknowledgments", "Abstract"}
# Wrapped form: <article>/<section>/<div> whose first heading child is boilerplate.
for sec in soup.select("article, section, div"):
h = sec.find(["h1", "h2"], recursive=False)
if h and h.get_text(strip=True) in boilerplate:
sec.decompose()
# Unwrapped form: bare <h1>/<h2>Boilerplate</h2> followed by its .desc/.body sibling.
for h in soup.find_all(["h1", "h2"]):
if h.get_text(strip=True) in boilerplate:
sib = h.find_next_sibling()
if sib and (sib.name in {"div", "section"}):
cls = " ".join(sib.get("class", []) or [])
if "desc" in cls or "body" in cls or "notices" in cls:
sib.decompose()
h.decompose()
main = soup.find("main")
return str(main) if main else str(soup)
def html_to_md(page_html: str) -> str:
cleaned = _strip_dita_wrappers(page_html)
text = md(cleaned, heading_style="ATX", bullets="-")
# collapse runs of blank lines
text = re.sub(r"\n{3,}", "\n\n", text).strip()
return text + "\n"
def fetch_toc_page(s: requests.Session, doc_id: str, page_id: str) -> str:
payload = _get(s, f"{API}/{doc_id}/render?page={page_id}.html", expect_json=True)
if not payload:
return ""
return payload.get("page_html") or ""
def fetch_single_doc(s: requests.Session, doc_id: str) -> tuple[str, str]:
"""Returns (page_html, title) for a single-doc-shape bundle."""
html = _get(s, f"{API}/{doc_id}")
if not html:
return "", ""
soup = BeautifulSoup(html, "html.parser")
h1 = soup.select_one("h1.title.topictitle1")
title = h1.get_text(" ", strip=True) if h1 else doc_id
return html, title
def write_page(bundle_dir: Path, page_id: str, body_md: str, sidecar: dict[str, Any], force: bool) -> bool:
bundle_dir.mkdir(parents=True, exist_ok=True)
md_path = bundle_dir / f"{page_id}.md"
json_path = bundle_dir / f"{page_id}.json"
if not force and md_path.exists() and json_path.exists():
return False
md_path.write_text(body_md)
json_path.write_text(json.dumps(sidecar, indent=2) + "\n")
return True
def scrape_toc_bundle(s: requests.Session, bundle: dict, force: bool, concurrency: int) -> int:
doc_id = bundle["doc_id"]
slug = bundle["slug"]
bundle_dir = CORPUS / slug
toc = _get(s, f"{API}/{doc_id}/toc", expect_json=True) or []
entries = _flatten_toc(toc)
print(f" {slug}: {len(entries)} pages", file=sys.stderr)
written = 0
def do_one(entry: TocEntry) -> bool:
page_html = fetch_toc_page(s, doc_id, entry.page_id)
if not page_html:
return False
body_md = html_to_md(page_html)
sidecar = {
"bundle_id": slug,
"page_id": entry.page_id,
"title": entry.title,
"ordinal": entry.ordinal,
"parent_title": entry.parent_title,
"doc_id": doc_id,
"version": bundle.get("version"),
"product": bundle.get("product"),
"source_url": DOC_URL.format(doc_id=doc_id, page_id=entry.page_id),
# topic_cluster filled in by finalize()
}
return write_page(bundle_dir, entry.page_id, body_md, sidecar, force)
with ThreadPoolExecutor(max_workers=concurrency) as pool:
for fut in as_completed(pool.submit(do_one, e) for e in entries):
if fut.result():
written += 1
return written
def scrape_single_bundle(s: requests.Session, bundle: dict, force: bool) -> int:
doc_id = bundle["doc_id"]
slug = bundle["slug"]
bundle_dir = CORPUS / slug
html, title = fetch_single_doc(s, doc_id)
if not html:
print(f" ! {slug}: empty body", file=sys.stderr)
return 0
body_md = html_to_md(html)
sidecar = {
"bundle_id": slug,
"page_id": doc_id,
"title": title or bundle["title"],
"ordinal": 1,
"parent_title": None,
"doc_id": doc_id,
"version": bundle.get("version"),
"product": bundle.get("product"),
"source_url": DOC_URL_SINGLE.format(doc_id=doc_id),
}
print(f" {slug}: 1 page (single-doc)", file=sys.stderr)
return 1 if write_page(bundle_dir, doc_id, body_md, sidecar, force) else 0
def finalize_clusters(bundles: list[dict]) -> int:
"""Cross-link sibling pages with the same GUID across version bundles.
For TOC bundles, page_id == GUID; same GUID across two bundles = same
underlying topic. For single-doc bundles (page_id == doc_id), peer them
by matching product+version-sibling on the `product` field."""
# GUID → list[(slug, sidecar_path, sidecar_dict)]
guid_to_pages: dict[str, list[tuple[str, Path, dict]]] = {}
# product → list[(slug, sidecar_path, sidecar_dict)] for single-doc peering
product_to_pages: dict[str, list[tuple[str, Path, dict]]] = {}
for b in bundles:
slug = b["slug"]
bundle_dir = CORPUS / slug
if not bundle_dir.exists():
continue
for jp in bundle_dir.glob("*.json"):
data = json.loads(jp.read_text())
pid = data["page_id"]
if pid.startswith("GUID-"):
guid_to_pages.setdefault(pid, []).append((slug, jp, data))
else:
product_to_pages.setdefault(b["product"], []).append((slug, jp, data))
updated = 0
# TOC pages — cluster by GUID
for guid, peers in guid_to_pages.items():
if len(peers) < 2:
continue
for slug, jp, data in peers:
others = [
{"bundle_id": s2, "page_id": guid, "clustering_title": d2.get("title", "")}
for s2, _, d2 in peers if s2 != slug
]
data["topic_cluster"] = {"clustering_title": data.get("title", ""), "clustered_topics": others}
jp.write_text(json.dumps(data, indent=2) + "\n")
updated += 1
# Single-doc pages — cluster by product (e.g. Release Notes 8.1.0/.1/.2)
for product, peers in product_to_pages.items():
if len(peers) < 2:
continue
for slug, jp, data in peers:
others = [
{"bundle_id": s2, "page_id": d2["page_id"], "clustering_title": d2.get("title", "")}
for s2, _, d2 in peers if s2 != slug
]
data["topic_cluster"] = {"clustering_title": data.get("title", ""), "clustered_topics": others}
jp.write_text(json.dumps(data, indent=2) + "\n")
updated += 1
return updated
def main() -> int:
p = argparse.ArgumentParser(description="Scrape HVM bundles into corpus/.")
p.add_argument("--all", action="store_true", help="scrape every bundle in bundles.json")
p.add_argument("--bundle", action="append", help="scrape one bundle by slug (repeatable)")
p.add_argument("--force", action="store_true", help="re-fetch pages already on disk")
p.add_argument("--concurrency", type=int, default=6)
p.add_argument("--finalize-only", action="store_true", help="only rebuild topic_cluster sidecar fields")
args = p.parse_args()
if not BUNDLES_JSON.exists():
print(f"bundles.json missing — run `python -m scrape.bundles` first", file=sys.stderr)
return 2
bundles = json.loads(BUNDLES_JSON.read_text())
if args.finalize_only:
n = finalize_clusters(bundles)
print(f"finalize: updated topic_cluster on {n} sidecars", file=sys.stderr)
return 0
if args.bundle:
bundles = [b for b in bundles if b["slug"] in args.bundle]
if not bundles:
print(f"no bundles matched: {args.bundle}", file=sys.stderr)
return 2
elif not args.all:
print("specify --all or --bundle <slug>", file=sys.stderr)
return 2
s = _session()
total = 0
for b in bundles:
mode = b.get("mode")
if mode == "single":
total += scrape_single_bundle(s, b, args.force)
elif mode == "html-file":
# Live-scrape HPE collateral (QuickSpecs) via curl_cffi; falls back
# to scrape/quickspecs/<doc_id>.html fixture if the edge blocks us.
from scrape.quickspecs import scrape_quickspecs
ok = scrape_quickspecs(
doc_id=b["doc_id"], bundle_id=b["slug"],
title=b.get("title", b["doc_id"]),
version=b.get("version"),
product=b.get("product", "QuickSpecs"),
source_url=b.get("source_url"),
force=args.force,
)
total += 1 if ok else 0
else:
total += scrape_toc_bundle(s, b, args.force, args.concurrency)
print(f"scraped {total} new/updated pages", file=sys.stderr)
# Always finalize after a scrape so sidecars are consistent.
all_bundles = json.loads(BUNDLES_JSON.read_text())
n = finalize_clusters(all_bundles)
print(f"finalize: updated topic_cluster on {n} sidecars", file=sys.stderr)
return 0
if __name__ == "__main__":
sys.exit(main())