build out morpheus-docs MCP stack, mirroring hvm-docs through Phases 1-13
Initial scaffold: the docs-mcp-template clone with all the
HVM-validated stack ported across, customized for Morpheus
Enterprise (PRODUCT_NAME=morpheus, server name morpheus-docs).
Bundles (live-discovered 2026-05-22; 1710 cataloged pages total):
* morpheus_user_manual_8_1_0 sd00007510en_us 568 pages (Feb 2026)
* morpheus_user_manual_8_1_1 sd00007621en_us 569 pages (Mar 2026)
* morpheus_user_manual_8_1_2 sd00007732en_us 569 pages (Apr 2026)
* morpheus_release_notes_8_1_0 sd00007496en_us single-doc
* morpheus_release_notes_8_1_1 sd00007610en_us single-doc
* morpheus_release_notes_8_1_2 sd00007733en_us single-doc
* morpheus_quickspecs a50009231enw html-file (live
curl_cffi against www.hpe.com; all 12+ Enterprise SKUs captured —
S6E64..S6E73AAE for new/renewal/upgrade × 1/3/5-yr terms, plus
services SKUs HA124A1#V38/V39 and H46SBA1).
No Deployment Guide or Qualification Matrix on HPE Support for
Morpheus Enterprise specifically — the only QM (sd00006551en_us)
covers HVM clusters managed by Morpheus and lives in hvm-docs.
Stack carried forward from hvm-docs:
* rag/{index,chunk,embeddings,bm25}.py — including the
MAX_CHARS=4000 chunk-cap fix for table-dense content
* docs_mcp/{server,usage}.py — 11 MCP tools, BM25-default search,
cross-encoder rerank, hybrid behind HYBRID_SEARCH=true,
morpheus_api_lessons (renamed from hvm_api_lessons), env-gated
submit_doc_bug
* docs_mcp/api_lessons.md — Morpheus-specific scaffold covering
licensing model, HVM elevation path, REST vs Plugin API, with
TODO markers for sections to flesh out from real ops experience
* scrape/{runner,quickspecs,changelog,bundles}.py — TOC + single-doc
+ html-file modes, curl_cffi Chrome120 for www.hpe.com edge bypass
* eval/{retrievers,run_eval}.py + queries.jsonl scaffold (4 placeholder
queries; populate after first scrape)
* scripts/{rerank_server,usage_report,registry_gc}.py
* .gitea/workflows/{refresh,image-only}.yml — same Gitea Actions
setup zerto-docs uses (push LAN, pull public-URL, GPU Ollama pool)
* deploy/docker-compose.yml — morpheus-docs-mcp service definition,
shared jina-rerank sidecar, Watchtower-labeled
* Dockerfile, requirements.txt, requirements-rerank.txt
Verified locally: scrape produced 1599 .md pages (some TOC entries
are parent-only and yield no body), 6353 chunks all under the 4 KB
cap, MCP server boots and lists 11 tools cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -7,6 +7,72 @@ the upstream doc portal.
|
||||
See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline
|
||||
expects.
|
||||
|
||||
---
|
||||
|
||||
## Product context — HPE Morpheus Enterprise Software
|
||||
|
||||
**This repo is for HPE Morpheus Enterprise**, the full cloud-management
|
||||
platform. It is a **different SKU** from HPE Morpheus VM Essentials
|
||||
(HVM), which has its own MCP at `../hvm-docs/`. Don't ingest HVM
|
||||
docs here; they're a separate, smaller product (the "VM-only" subset
|
||||
of Morpheus). The Morpheus VM Essentials Deployment Guide refers to
|
||||
Morpheus Enterprise as the "elevate to" target — that's the
|
||||
relationship.
|
||||
|
||||
`PRODUCT_NAME=morpheus`. Tool will be named `morpheus_api_lessons`,
|
||||
collection `morpheus_docs`, etc.
|
||||
|
||||
### Upstream portal
|
||||
|
||||
HPE Support DocPortal (Tridion/SDL-derived, same surface as HVM and
|
||||
the Zerto docs). Anonymous JSON API, no auth required.
|
||||
|
||||
| Endpoint | Returns |
|
||||
|---|---|
|
||||
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}` | DITA-source HTML — title page / abstract OR (for short docs like Release Notes) the entire body |
|
||||
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/toc` | Nested JSON tree of `{topicName, topicLink, description, children}`. Empty/404 for single-doc Release Notes. |
|
||||
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/render?page=GUID-XXXX.html` | `{docId, page_html, doc_meta, page_meta}` — single page body |
|
||||
|
||||
User-facing URL format:
|
||||
`https://support.hpe.com/hpesc/public/docDisplay?docId={docId}&page=GUID-XXXX.html`
|
||||
|
||||
### Bundle IDs (confirmed 2026-05-22)
|
||||
|
||||
**Morpheus Enterprise User Manual** — ~569 pages each, full nested TOC:
|
||||
|
||||
| Version | docId |
|
||||
|---|---|
|
||||
| 8.1.0 | `sd00007510en_us` |
|
||||
| 8.1.1 | `sd00007621en_us` |
|
||||
| 8.1.2 | `sd00007732en_us` |
|
||||
|
||||
**Morpheus Enterprise Release Notes** — short, single-doc-blob shape
|
||||
(no TOC; full body returned by the `/document/{docId}` endpoint
|
||||
itself; scraper needs a `--single-doc` mode for these):
|
||||
|
||||
| Version | docId |
|
||||
|---|---|
|
||||
| 8.1.0 | `sd00007496en_us` |
|
||||
| 8.1.1 | `sd00007610en_us` |
|
||||
| 8.1.2 | `sd00007733en_us` |
|
||||
|
||||
### Cross-version peers are free
|
||||
|
||||
GUIDs are stable across versions (confirmed on HVM where 374/376/376
|
||||
pages had 100% GUID overlap between adjacent versions). Same-GUID =
|
||||
same-topic. Synthesize `topic_cluster.clustered_topics` by looking
|
||||
up the same GUID in the other bundle slugs — no fuzzy matching
|
||||
needed.
|
||||
|
||||
### Reusable from hvm-docs
|
||||
|
||||
`../hvm-docs/scrape/bundles.py` and `../hvm-docs/scrape/runner.py`
|
||||
solve the identical portal shape. Copy and adapt the BUNDLES list +
|
||||
PRODUCT_NAME; the fetch logic should drop in unchanged. Both the
|
||||
TOC-paginated path and the single-doc path are needed (the HVM
|
||||
build covers both because HVM Release Notes follow the same shape).
|
||||
|
||||
|
||||
## What you write
|
||||
|
||||
At minimum, two scripts:
|
||||
|
||||
@@ -0,0 +1,200 @@
|
||||
"""Discover Morpheus Enterprise doc bundles on HPE Support DocPortal and write bundles.json.
|
||||
|
||||
Mirrors hvm-docs/scrape/bundles.py — same portal, same API shape, same single-doc-blob
|
||||
treatment for Release Notes, but pointing at the Morpheus Enterprise docId range.
|
||||
|
||||
For each bundle this script:
|
||||
1. GETs /hpesc/public/api/document/{docId} → abstract HTML
|
||||
2. GETs /hpesc/public/api/document/{docId}/toc → page tree (or 404 for single-doc)
|
||||
3. Writes bundles.json at repo root with the schema PLAN.md Phase 1 documents.
|
||||
|
||||
QuickSpecs is a special case: lives at www.hpe.com (not support.hpe.com), gets the
|
||||
html-file mode and is scraped via curl_cffi (see scrape/quickspecs.py).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
API = "https://support.hpe.com/hpesc/public/api/document"
|
||||
DOC_URL = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}"
|
||||
UA = "morpheus-docs-mcp/0.1 (+https://git.jpaul.io/justin/morpheus-docs; admin@jpaul.io)"
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
BUNDLES_JSON = ROOT / "bundles.json"
|
||||
|
||||
|
||||
@dataclass
|
||||
class BundleSpec:
|
||||
slug: str
|
||||
doc_id: str
|
||||
title: str
|
||||
version: str | None
|
||||
product: str # e.g. "User Manual", "Release Notes", "QuickSpecs"
|
||||
mode: str # "toc", "single", or "html-file"
|
||||
platform: str | None = None
|
||||
language: str = "en-US"
|
||||
source_url: str | None = None # overrides the default support.hpe.com URL
|
||||
|
||||
|
||||
# Declared bundles. Versions confirmed 2026-05-22 by probing the docId
|
||||
# range sd00006500..7740 for `Morpheus Enterprise` matches in the abstract.
|
||||
#
|
||||
# Notes:
|
||||
# - Morpheus Enterprise has User Manuals dating back to 8.0.10
|
||||
# (sd00006774en_us, Sep 2025) but we only ship the 8.1.x line for
|
||||
# now. Add the 8.0.x bundles here if you need older versions in the
|
||||
# corpus.
|
||||
# - No dedicated Deployment Guide or Qualification Matrix for Morpheus
|
||||
# Enterprise on HPE Support — the only QM (sd00006551en_us) covers
|
||||
# HVM clusters managed by Morpheus, which lives in hvm-docs.
|
||||
# - QuickSpecs lives on www.hpe.com (not support.hpe.com), uses the
|
||||
# html-file scrape mode with curl_cffi Chrome impersonation.
|
||||
BUNDLES: list[BundleSpec] = [
|
||||
BundleSpec("morpheus_user_manual_8_1_0", "sd00007510en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.0", "User Manual", "toc"),
|
||||
BundleSpec("morpheus_user_manual_8_1_1", "sd00007621en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.1", "User Manual", "toc"),
|
||||
BundleSpec("morpheus_user_manual_8_1_2", "sd00007732en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.2", "User Manual", "toc"),
|
||||
BundleSpec("morpheus_release_notes_8_1_0", "sd00007496en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.0", "Release Notes", "single"),
|
||||
BundleSpec("morpheus_release_notes_8_1_1", "sd00007610en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.1", "Release Notes", "single"),
|
||||
BundleSpec("morpheus_release_notes_8_1_2", "sd00007733en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.2", "Release Notes", "single"),
|
||||
BundleSpec("morpheus_quickspecs", "a50009231enw", "HPE Morpheus Enterprise Software QuickSpecs",
|
||||
"v1", "QuickSpecs", "html-file",
|
||||
source_url="https://www.hpe.com/psnow/doc/a50009231enw"),
|
||||
]
|
||||
|
||||
|
||||
def _session() -> requests.Session:
|
||||
s = requests.Session()
|
||||
s.headers.update({"User-Agent": UA, "Accept": "application/json, text/html"})
|
||||
return s
|
||||
|
||||
|
||||
def _get(s: requests.Session, url: str, expect_json: bool = False, retries: int = 4) -> Any:
|
||||
delay = 1.0
|
||||
for attempt in range(retries):
|
||||
r = s.get(url, timeout=30)
|
||||
if r.status_code == 200:
|
||||
return r.json() if expect_json else r.text
|
||||
if r.status_code == 404:
|
||||
return None
|
||||
if r.status_code in (429, 500, 502, 503, 504):
|
||||
time.sleep(delay)
|
||||
delay *= 2
|
||||
continue
|
||||
r.raise_for_status()
|
||||
raise RuntimeError(f"GET failed after {retries} retries: {url}")
|
||||
|
||||
|
||||
def _count_toc(toc: list[dict] | None) -> tuple[int, str | None]:
|
||||
if not toc:
|
||||
return 0, None
|
||||
landing = None
|
||||
n = 0
|
||||
|
||||
def walk(nodes: list[dict] | None, depth: int) -> None:
|
||||
nonlocal n, landing
|
||||
for node in nodes or []:
|
||||
link = node.get("topicLink")
|
||||
if link:
|
||||
n += 1
|
||||
m = re.search(r"page=(GUID-[A-F0-9-]+)\.html", link)
|
||||
if m and landing is None:
|
||||
landing = m.group(1)
|
||||
walk(node.get("children"), depth + 1)
|
||||
|
||||
walk(toc, 0)
|
||||
return n, landing
|
||||
|
||||
|
||||
def _parse_abstract(html: str) -> dict[str, str]:
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
out: dict[str, str] = {}
|
||||
h1 = soup.select_one("h1.title.topictitle1")
|
||||
if h1:
|
||||
out["title"] = h1.get_text(" ", strip=True)
|
||||
desc = soup.select_one("div.desc")
|
||||
if desc:
|
||||
out["abstract"] = desc.get_text(" ", strip=True)
|
||||
pub = soup.select_one("div.publishedDate")
|
||||
if pub:
|
||||
out["published"] = pub.get_text(" ", strip=True).replace("Published:", "").strip()
|
||||
return out
|
||||
|
||||
|
||||
def discover_bundle(s: requests.Session, spec: BundleSpec) -> dict[str, Any]:
|
||||
# html-file bundles are static fixtures or live-fetched outside support.hpe.com.
|
||||
if spec.mode == "html-file":
|
||||
return {
|
||||
"slug": spec.slug,
|
||||
"doc_id": spec.doc_id,
|
||||
"title": spec.title,
|
||||
"version": spec.version,
|
||||
"platform": spec.platform,
|
||||
"product": spec.product,
|
||||
"language": spec.language,
|
||||
"page_count": 1,
|
||||
"mode": "html-file",
|
||||
"abstract": "",
|
||||
"dates": {},
|
||||
"landing_page": spec.doc_id,
|
||||
"source_url": spec.source_url or f"https://www.hpe.com/psnow/doc/{spec.doc_id}",
|
||||
}
|
||||
|
||||
abstract_html = _get(s, f"{API}/{spec.doc_id}", expect_json=False)
|
||||
meta = _parse_abstract(abstract_html or "")
|
||||
|
||||
page_count: int
|
||||
landing: str | None
|
||||
if spec.mode == "toc":
|
||||
toc = _get(s, f"{API}/{spec.doc_id}/toc", expect_json=True)
|
||||
page_count, landing = _count_toc(toc)
|
||||
if page_count == 0:
|
||||
print(f" ! {spec.slug}: TOC empty — falling back to single-doc mode", file=sys.stderr)
|
||||
spec.mode = "single"
|
||||
page_count, landing = 1, spec.doc_id
|
||||
else:
|
||||
page_count, landing = 1, spec.doc_id
|
||||
|
||||
return {
|
||||
"slug": spec.slug,
|
||||
"doc_id": spec.doc_id,
|
||||
"title": meta.get("title") or spec.title,
|
||||
"version": spec.version,
|
||||
"platform": spec.platform,
|
||||
"product": spec.product,
|
||||
"language": spec.language,
|
||||
"page_count": page_count,
|
||||
"mode": spec.mode,
|
||||
"abstract": meta.get("abstract", ""),
|
||||
"dates": {"Published": meta.get("published", "")},
|
||||
"landing_page": landing,
|
||||
"source_url": spec.source_url or DOC_URL.format(doc_id=spec.doc_id),
|
||||
}
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description="Build bundles.json from BUNDLES list.")
|
||||
p.add_argument("--out", default=str(BUNDLES_JSON))
|
||||
args = p.parse_args()
|
||||
|
||||
s = _session()
|
||||
out: list[dict[str, Any]] = []
|
||||
for spec in BUNDLES:
|
||||
print(f" • {spec.slug} ({spec.doc_id}) ...", file=sys.stderr)
|
||||
out.append(discover_bundle(s, spec))
|
||||
|
||||
Path(args.out).write_text(json.dumps(out, indent=2) + "\n")
|
||||
print(f"wrote {args.out}: {len(out)} bundles, {sum(b['page_count'] for b in out)} pages total", file=sys.stderr)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,194 @@
|
||||
"""Scrape HPE QuickSpecs collateral pages into corpus markdown.
|
||||
|
||||
HPE QuickSpecs live at `https://www.hpe.com/us/en/collaterals/collateral.<doc_id>.html`
|
||||
with a server-rendered HTML body (confirmed 2026-05-22 by inspecting the
|
||||
captured DOM). The blocker for automated scraping is `www.hpe.com`'s
|
||||
edge bot defense, which drops connections from non-browser TLS
|
||||
fingerprints (curl, wget, Python-urllib, even WebFetch). Bypassed here
|
||||
by `curl_cffi` impersonating Chrome 120's JA3/JA4 fingerprint.
|
||||
|
||||
Content extraction uses these stable CSS selectors found in the page:
|
||||
|
||||
.lr-right-rail hpe-highlights-container .collateral-content
|
||||
— one per section ("Overview", "Standard Features", etc.)
|
||||
h3.txto-title — section title
|
||||
div.txto-description — section body
|
||||
uc-table.uc-table-polaris — SKU / version-history tables
|
||||
|
||||
A committed HTML fixture at `scrape/quickspecs/<doc_id>.html` is used
|
||||
as a fallback when the live fetch fails (HPE edge churn, network
|
||||
issues). Keeping a current fixture in the repo also makes diffing
|
||||
QuickSpecs revisions easy.
|
||||
|
||||
Usage (called by scrape.runner for bundles with mode="quickspecs"):
|
||||
|
||||
python -m scrape.quickspecs a50004260enw
|
||||
|
||||
Or programmatically:
|
||||
|
||||
from scrape.quickspecs import scrape_quickspecs
|
||||
scrape_quickspecs("a50004260enw", bundle_id="hvm_quickspecs", title="...")
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
from bs4 import BeautifulSoup, NavigableString
|
||||
from markdownify import markdownify as md
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
SOURCE_DIR = ROOT / "scrape" / "quickspecs"
|
||||
CORPUS_DIR = ROOT / "corpus"
|
||||
|
||||
COLLATERAL_URL = "https://www.hpe.com/us/en/collaterals/collateral.{doc_id}.html"
|
||||
|
||||
|
||||
def fetch_live(doc_id: str, timeout: float = 30.0) -> str | None:
|
||||
"""GET the collateral page via curl_cffi (Chrome 120 TLS fingerprint).
|
||||
Returns the HTML body on success, None on any failure."""
|
||||
try:
|
||||
from curl_cffi import requests as cc
|
||||
except ImportError:
|
||||
log.warning("curl_cffi not installed; can't fetch QuickSpecs live")
|
||||
return None
|
||||
try:
|
||||
r = cc.get(COLLATERAL_URL.format(doc_id=doc_id),
|
||||
impersonate="chrome120", timeout=timeout)
|
||||
if r.status_code != 200 or not r.text:
|
||||
log.warning("QuickSpecs %s: http=%s bytes=%d", doc_id, r.status_code, len(r.text or ""))
|
||||
return None
|
||||
return r.text
|
||||
except Exception as e:
|
||||
log.warning("QuickSpecs %s live fetch failed: %s", doc_id, e)
|
||||
return None
|
||||
|
||||
|
||||
def fetch_fixture(doc_id: str) -> str | None:
|
||||
"""Read the committed HTML fixture as fallback."""
|
||||
p = SOURCE_DIR / f"{doc_id}.html"
|
||||
if not p.exists():
|
||||
return None
|
||||
return p.read_text()
|
||||
|
||||
|
||||
def _extract_content_blocks(html: str) -> list[str]:
|
||||
"""Pull each section block (.collateral-content under .lr-right-rail).
|
||||
|
||||
The fixture format (just .quickspecs-content wrapper) and the live
|
||||
format (.lr-right-rail with nested hpe-highlights-container) are
|
||||
both supported. Returns a list of section HTML strings, in document
|
||||
order.
|
||||
"""
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
# Live format: each <hpe-highlights-container> under .lr-right-rail has
|
||||
# one or more .collateral-content blocks; concat them.
|
||||
rail = soup.select_one(".lr-right-rail")
|
||||
if rail is not None:
|
||||
blocks = rail.select(".collateral-content")
|
||||
return [str(b) for b in blocks]
|
||||
# Fixture format: a single wrapper holding all the H2/H3 sections.
|
||||
wrapper = soup.select_one(".quickspecs-content")
|
||||
if wrapper is not None:
|
||||
return [str(wrapper)]
|
||||
# Last-resort: whole body.
|
||||
body = soup.body or soup
|
||||
return [str(body)]
|
||||
|
||||
|
||||
def parse_html(html: str) -> str:
|
||||
"""Convert QuickSpecs HTML to clean markdown.
|
||||
|
||||
Filters out the page chrome (nav, footer, recommendations carousel,
|
||||
cookie banner, analytics blobs) by extracting only the content
|
||||
blocks, then runs markdownify."""
|
||||
blocks = _extract_content_blocks(html)
|
||||
chunks: list[str] = []
|
||||
for block in blocks:
|
||||
soup = BeautifulSoup(block, "html.parser")
|
||||
# Drop anchor placeholders that markdownify turns into noisy links
|
||||
for a in soup.select('[hpe-left-rail-anchor]'):
|
||||
a.decompose()
|
||||
# Drop carousel / share / recommendation widgets if any leaked in.
|
||||
for sel in ("esl-share", "hpe-recommendations", "hpe-sticky-bar",
|
||||
"esl-scrollbar", "esl-trigger", "video-overlay",
|
||||
"generic-modal-loader", "style", "script"):
|
||||
for el in soup.select(sel):
|
||||
el.decompose()
|
||||
chunks.append(md(str(soup), heading_style="ATX", bullets="-",
|
||||
strip=["span", "div"]))
|
||||
text = "\n\n".join(chunks)
|
||||
# Collapse runs of blank lines markdownify likes to emit.
|
||||
text = "\n".join(line.rstrip() for line in text.splitlines())
|
||||
while "\n\n\n" in text:
|
||||
text = text.replace("\n\n\n", "\n\n")
|
||||
return text.strip() + "\n"
|
||||
|
||||
|
||||
def scrape_quickspecs(doc_id: str, bundle_id: str, title: str,
|
||||
version: str | None = None,
|
||||
product: str = "QuickSpecs",
|
||||
source_url: str | None = None,
|
||||
force: bool = False) -> bool:
|
||||
"""Live-fetch (or fall back to fixture), parse, write corpus files.
|
||||
|
||||
Returns True if files were written, False if skipped (already exists
|
||||
and --force not set)."""
|
||||
bundle_dir = CORPUS_DIR / bundle_id
|
||||
md_path = bundle_dir / f"{doc_id}.md"
|
||||
json_path = bundle_dir / f"{doc_id}.json"
|
||||
if not force and md_path.exists() and json_path.exists():
|
||||
log.info(" %s/%s: already on disk (use --force to refresh)", bundle_id, doc_id)
|
||||
return False
|
||||
|
||||
html = fetch_live(doc_id)
|
||||
fetched_from = "live"
|
||||
if html is None:
|
||||
html = fetch_fixture(doc_id)
|
||||
fetched_from = "fixture"
|
||||
if html is None:
|
||||
log.error("QuickSpecs %s: no live response and no fixture at %s",
|
||||
doc_id, SOURCE_DIR / f"{doc_id}.html")
|
||||
return False
|
||||
|
||||
body_md = parse_html(html)
|
||||
bundle_dir.mkdir(parents=True, exist_ok=True)
|
||||
md_path.write_text(body_md)
|
||||
sidecar = {
|
||||
"bundle_id": bundle_id,
|
||||
"page_id": doc_id,
|
||||
"title": title,
|
||||
"ordinal": 1,
|
||||
"parent_title": None,
|
||||
"doc_id": doc_id,
|
||||
"version": version,
|
||||
"product": product,
|
||||
"source_url": source_url or f"https://www.hpe.com/psnow/doc/{doc_id}",
|
||||
"fetched_from": fetched_from,
|
||||
}
|
||||
json_path.write_text(json.dumps(sidecar, indent=2) + "\n")
|
||||
log.info(" %s/%s: %d bytes from %s", bundle_id, doc_id, len(body_md), fetched_from)
|
||||
return True
|
||||
|
||||
|
||||
def main() -> int:
|
||||
logging.basicConfig(level=logging.INFO, format="%(message)s")
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("doc_id", help="QuickSpecs document id, e.g. a50004260enw")
|
||||
p.add_argument("--bundle-id", default="hvm_quickspecs")
|
||||
p.add_argument("--title", default="HPE Morpheus VM Essentials Software QuickSpecs")
|
||||
p.add_argument("--version", default=None)
|
||||
p.add_argument("--force", action="store_true")
|
||||
args = p.parse_args()
|
||||
ok = scrape_quickspecs(args.doc_id, args.bundle_id, args.title,
|
||||
args.version, force=args.force)
|
||||
return 0 if ok else 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,27 @@
|
||||
# scrape/quickspecs/
|
||||
|
||||
Static HTML fixtures for HPE QuickSpecs documents that aren't reachable
|
||||
from the runner (www.hpe.com edge drops connections from datacenter IPs
|
||||
with non-browser User-Agents — verified 2026-05-22 with curl, wget, and
|
||||
Anthropic's WebFetch).
|
||||
|
||||
## Workflow
|
||||
|
||||
1. Operator visits `https://www.hpe.com/psnow/doc/<doc_id>` in a real
|
||||
browser, opens DevTools → Elements → Copy the `<body>` HTML.
|
||||
2. Save it at `scrape/quickspecs/<doc_id>.html`.
|
||||
3. Add a bundle entry in `scrape/bundles.py` with `mode="html-file"`.
|
||||
4. `python -m scrape.runner --bundle hvm_quickspecs --force` reads the
|
||||
committed HTML and writes `corpus/hvm_quickspecs/<doc_id>.{md,json}`.
|
||||
5. Re-index and ship.
|
||||
|
||||
QuickSpecs only update every few months (HPE rebrand, new SKU added,
|
||||
feature change). When a new version drops, refresh the local HTML
|
||||
file and re-run the scrape.
|
||||
|
||||
## Current fixtures
|
||||
|
||||
- `a50004260enw.html` — HPE Morpheus VM Essentials Software QuickSpecs
|
||||
(Version 4, 02-February-2026). SKUs: S5Q81AAE (1-yr), S5Q82AAE
|
||||
(3-yr), S5Q83AAE (5-yr) — all "per Socket E-LTU" with Tech Care
|
||||
Essentials included.
|
||||
@@ -0,0 +1,339 @@
|
||||
"""Scrape HVM doc bundles into corpus/<slug>/<page_id>.{md,json}.
|
||||
|
||||
Reads bundles.json (produced by scrape.bundles), then for each bundle:
|
||||
- mode="toc": walks the TOC tree, fetches each page via the render
|
||||
endpoint, converts page_html to markdown, writes
|
||||
<page_id>.md + <page_id>.json sidecar.
|
||||
- mode="single": fetches /document/{docId} directly, treats the whole
|
||||
body as one page with page_id = doc_id.
|
||||
|
||||
After all bundles are on disk, runs a finalize pass that synthesizes
|
||||
topic_cluster.clustered_topics for each page by looking up the same
|
||||
GUID in sibling bundles (HPE GUIDs are stable across versions — see
|
||||
reference_hpe_docs_portal_api.md).
|
||||
|
||||
Usage:
|
||||
python -m scrape.runner --all
|
||||
python -m scrape.runner --bundle hvm_user_manual_8_1_2
|
||||
python -m scrape.runner --all --force # re-download already-on-disk pages
|
||||
python -m scrape.runner --finalize-only # only redo the topic_cluster pass
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
from markdownify import markdownify as md
|
||||
|
||||
API = "https://support.hpe.com/hpesc/public/api/document"
|
||||
DOC_URL = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}&page={page_id}.html"
|
||||
DOC_URL_SINGLE = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}"
|
||||
UA = "hvm-docs-mcp/0.1 (+https://git.jpaul.io/justin/hvm-docs; admin@jpaul.io)"
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
CORPUS = ROOT / "corpus"
|
||||
BUNDLES_JSON = ROOT / "bundles.json"
|
||||
|
||||
GUID_RE = re.compile(r"page=(GUID-[A-F0-9-]+)\.html")
|
||||
|
||||
|
||||
@dataclass
|
||||
class TocEntry:
|
||||
page_id: str
|
||||
title: str
|
||||
ordinal: int
|
||||
parent_title: str | None
|
||||
|
||||
|
||||
def _session() -> requests.Session:
|
||||
s = requests.Session()
|
||||
s.headers.update({"User-Agent": UA, "Accept": "application/json, text/html"})
|
||||
return s
|
||||
|
||||
|
||||
def _get(s: requests.Session, url: str, expect_json: bool = False, retries: int = 4) -> Any:
|
||||
delay = 1.0
|
||||
for attempt in range(retries):
|
||||
r = s.get(url, timeout=30)
|
||||
if r.status_code == 200:
|
||||
return r.json() if expect_json else r.text
|
||||
if r.status_code == 404:
|
||||
return None
|
||||
if r.status_code in (429, 500, 502, 503, 504):
|
||||
time.sleep(delay)
|
||||
delay *= 2
|
||||
continue
|
||||
r.raise_for_status()
|
||||
raise RuntimeError(f"GET failed after {retries} retries: {url}")
|
||||
|
||||
|
||||
def _flatten_toc(toc: list[dict]) -> list[TocEntry]:
|
||||
out: list[TocEntry] = []
|
||||
ordinal = 0
|
||||
|
||||
def walk(nodes: list[dict] | None, parent_title: str | None) -> None:
|
||||
nonlocal ordinal
|
||||
for node in nodes or []:
|
||||
title = node.get("topicName") or ""
|
||||
link = node.get("topicLink") or ""
|
||||
m = GUID_RE.search(link)
|
||||
if m:
|
||||
ordinal += 1
|
||||
out.append(TocEntry(page_id=m.group(1), title=title, ordinal=ordinal, parent_title=parent_title))
|
||||
walk(node.get("children"), title or parent_title)
|
||||
|
||||
walk(toc, None)
|
||||
return out
|
||||
|
||||
|
||||
def _strip_dita_wrappers(html: str) -> str:
|
||||
"""Remove the outer <main class="ditasrc">, drop the trademark Notices section,
|
||||
and unwrap aria-only span markup so markdownify produces clean text.
|
||||
|
||||
DITA's notices boilerplate repeats across every doc; if we leave it in,
|
||||
every page chunk inherits the same trademark text and pollutes retrieval."""
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
# Drop the Notices/Acknowledgments/Abstract boilerplate by section heading.
|
||||
# Every doc on the portal carries the same legal Notices and trademark
|
||||
# Acknowledgments; if we leave them in, every chunk inherits the same
|
||||
# text and pollutes retrieval. Abstract is one-line marketing.
|
||||
boilerplate = {"Notices", "Acknowledgments", "Abstract"}
|
||||
# Wrapped form: <article>/<section>/<div> whose first heading child is boilerplate.
|
||||
for sec in soup.select("article, section, div"):
|
||||
h = sec.find(["h1", "h2"], recursive=False)
|
||||
if h and h.get_text(strip=True) in boilerplate:
|
||||
sec.decompose()
|
||||
# Unwrapped form: bare <h1>/<h2>Boilerplate</h2> followed by its .desc/.body sibling.
|
||||
for h in soup.find_all(["h1", "h2"]):
|
||||
if h.get_text(strip=True) in boilerplate:
|
||||
sib = h.find_next_sibling()
|
||||
if sib and (sib.name in {"div", "section"}):
|
||||
cls = " ".join(sib.get("class", []) or [])
|
||||
if "desc" in cls or "body" in cls or "notices" in cls:
|
||||
sib.decompose()
|
||||
h.decompose()
|
||||
main = soup.find("main")
|
||||
return str(main) if main else str(soup)
|
||||
|
||||
|
||||
def html_to_md(page_html: str) -> str:
|
||||
cleaned = _strip_dita_wrappers(page_html)
|
||||
text = md(cleaned, heading_style="ATX", bullets="-")
|
||||
# collapse runs of blank lines
|
||||
text = re.sub(r"\n{3,}", "\n\n", text).strip()
|
||||
return text + "\n"
|
||||
|
||||
|
||||
def fetch_toc_page(s: requests.Session, doc_id: str, page_id: str) -> str:
|
||||
payload = _get(s, f"{API}/{doc_id}/render?page={page_id}.html", expect_json=True)
|
||||
if not payload:
|
||||
return ""
|
||||
return payload.get("page_html") or ""
|
||||
|
||||
|
||||
def fetch_single_doc(s: requests.Session, doc_id: str) -> tuple[str, str]:
|
||||
"""Returns (page_html, title) for a single-doc-shape bundle."""
|
||||
html = _get(s, f"{API}/{doc_id}")
|
||||
if not html:
|
||||
return "", ""
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
h1 = soup.select_one("h1.title.topictitle1")
|
||||
title = h1.get_text(" ", strip=True) if h1 else doc_id
|
||||
return html, title
|
||||
|
||||
|
||||
def write_page(bundle_dir: Path, page_id: str, body_md: str, sidecar: dict[str, Any], force: bool) -> bool:
|
||||
bundle_dir.mkdir(parents=True, exist_ok=True)
|
||||
md_path = bundle_dir / f"{page_id}.md"
|
||||
json_path = bundle_dir / f"{page_id}.json"
|
||||
if not force and md_path.exists() and json_path.exists():
|
||||
return False
|
||||
md_path.write_text(body_md)
|
||||
json_path.write_text(json.dumps(sidecar, indent=2) + "\n")
|
||||
return True
|
||||
|
||||
|
||||
def scrape_toc_bundle(s: requests.Session, bundle: dict, force: bool, concurrency: int) -> int:
|
||||
doc_id = bundle["doc_id"]
|
||||
slug = bundle["slug"]
|
||||
bundle_dir = CORPUS / slug
|
||||
|
||||
toc = _get(s, f"{API}/{doc_id}/toc", expect_json=True) or []
|
||||
entries = _flatten_toc(toc)
|
||||
print(f" {slug}: {len(entries)} pages", file=sys.stderr)
|
||||
|
||||
written = 0
|
||||
def do_one(entry: TocEntry) -> bool:
|
||||
page_html = fetch_toc_page(s, doc_id, entry.page_id)
|
||||
if not page_html:
|
||||
return False
|
||||
body_md = html_to_md(page_html)
|
||||
sidecar = {
|
||||
"bundle_id": slug,
|
||||
"page_id": entry.page_id,
|
||||
"title": entry.title,
|
||||
"ordinal": entry.ordinal,
|
||||
"parent_title": entry.parent_title,
|
||||
"doc_id": doc_id,
|
||||
"version": bundle.get("version"),
|
||||
"product": bundle.get("product"),
|
||||
"source_url": DOC_URL.format(doc_id=doc_id, page_id=entry.page_id),
|
||||
# topic_cluster filled in by finalize()
|
||||
}
|
||||
return write_page(bundle_dir, entry.page_id, body_md, sidecar, force)
|
||||
|
||||
with ThreadPoolExecutor(max_workers=concurrency) as pool:
|
||||
for fut in as_completed(pool.submit(do_one, e) for e in entries):
|
||||
if fut.result():
|
||||
written += 1
|
||||
return written
|
||||
|
||||
|
||||
def scrape_single_bundle(s: requests.Session, bundle: dict, force: bool) -> int:
|
||||
doc_id = bundle["doc_id"]
|
||||
slug = bundle["slug"]
|
||||
bundle_dir = CORPUS / slug
|
||||
|
||||
html, title = fetch_single_doc(s, doc_id)
|
||||
if not html:
|
||||
print(f" ! {slug}: empty body", file=sys.stderr)
|
||||
return 0
|
||||
body_md = html_to_md(html)
|
||||
sidecar = {
|
||||
"bundle_id": slug,
|
||||
"page_id": doc_id,
|
||||
"title": title or bundle["title"],
|
||||
"ordinal": 1,
|
||||
"parent_title": None,
|
||||
"doc_id": doc_id,
|
||||
"version": bundle.get("version"),
|
||||
"product": bundle.get("product"),
|
||||
"source_url": DOC_URL_SINGLE.format(doc_id=doc_id),
|
||||
}
|
||||
print(f" {slug}: 1 page (single-doc)", file=sys.stderr)
|
||||
return 1 if write_page(bundle_dir, doc_id, body_md, sidecar, force) else 0
|
||||
|
||||
|
||||
def finalize_clusters(bundles: list[dict]) -> int:
|
||||
"""Cross-link sibling pages with the same GUID across version bundles.
|
||||
|
||||
For TOC bundles, page_id == GUID; same GUID across two bundles = same
|
||||
underlying topic. For single-doc bundles (page_id == doc_id), peer them
|
||||
by matching product+version-sibling on the `product` field."""
|
||||
# GUID → list[(slug, sidecar_path, sidecar_dict)]
|
||||
guid_to_pages: dict[str, list[tuple[str, Path, dict]]] = {}
|
||||
# product → list[(slug, sidecar_path, sidecar_dict)] for single-doc peering
|
||||
product_to_pages: dict[str, list[tuple[str, Path, dict]]] = {}
|
||||
|
||||
for b in bundles:
|
||||
slug = b["slug"]
|
||||
bundle_dir = CORPUS / slug
|
||||
if not bundle_dir.exists():
|
||||
continue
|
||||
for jp in bundle_dir.glob("*.json"):
|
||||
data = json.loads(jp.read_text())
|
||||
pid = data["page_id"]
|
||||
if pid.startswith("GUID-"):
|
||||
guid_to_pages.setdefault(pid, []).append((slug, jp, data))
|
||||
else:
|
||||
product_to_pages.setdefault(b["product"], []).append((slug, jp, data))
|
||||
|
||||
updated = 0
|
||||
# TOC pages — cluster by GUID
|
||||
for guid, peers in guid_to_pages.items():
|
||||
if len(peers) < 2:
|
||||
continue
|
||||
for slug, jp, data in peers:
|
||||
others = [
|
||||
{"bundle_id": s2, "page_id": guid, "clustering_title": d2.get("title", "")}
|
||||
for s2, _, d2 in peers if s2 != slug
|
||||
]
|
||||
data["topic_cluster"] = {"clustering_title": data.get("title", ""), "clustered_topics": others}
|
||||
jp.write_text(json.dumps(data, indent=2) + "\n")
|
||||
updated += 1
|
||||
# Single-doc pages — cluster by product (e.g. Release Notes 8.1.0/.1/.2)
|
||||
for product, peers in product_to_pages.items():
|
||||
if len(peers) < 2:
|
||||
continue
|
||||
for slug, jp, data in peers:
|
||||
others = [
|
||||
{"bundle_id": s2, "page_id": d2["page_id"], "clustering_title": d2.get("title", "")}
|
||||
for s2, _, d2 in peers if s2 != slug
|
||||
]
|
||||
data["topic_cluster"] = {"clustering_title": data.get("title", ""), "clustered_topics": others}
|
||||
jp.write_text(json.dumps(data, indent=2) + "\n")
|
||||
updated += 1
|
||||
|
||||
return updated
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description="Scrape HVM bundles into corpus/.")
|
||||
p.add_argument("--all", action="store_true", help="scrape every bundle in bundles.json")
|
||||
p.add_argument("--bundle", action="append", help="scrape one bundle by slug (repeatable)")
|
||||
p.add_argument("--force", action="store_true", help="re-fetch pages already on disk")
|
||||
p.add_argument("--concurrency", type=int, default=6)
|
||||
p.add_argument("--finalize-only", action="store_true", help="only rebuild topic_cluster sidecar fields")
|
||||
args = p.parse_args()
|
||||
|
||||
if not BUNDLES_JSON.exists():
|
||||
print(f"bundles.json missing — run `python -m scrape.bundles` first", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
bundles = json.loads(BUNDLES_JSON.read_text())
|
||||
|
||||
if args.finalize_only:
|
||||
n = finalize_clusters(bundles)
|
||||
print(f"finalize: updated topic_cluster on {n} sidecars", file=sys.stderr)
|
||||
return 0
|
||||
|
||||
if args.bundle:
|
||||
bundles = [b for b in bundles if b["slug"] in args.bundle]
|
||||
if not bundles:
|
||||
print(f"no bundles matched: {args.bundle}", file=sys.stderr)
|
||||
return 2
|
||||
elif not args.all:
|
||||
print("specify --all or --bundle <slug>", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
s = _session()
|
||||
total = 0
|
||||
for b in bundles:
|
||||
mode = b.get("mode")
|
||||
if mode == "single":
|
||||
total += scrape_single_bundle(s, b, args.force)
|
||||
elif mode == "html-file":
|
||||
# Live-scrape HPE collateral (QuickSpecs) via curl_cffi; falls back
|
||||
# to scrape/quickspecs/<doc_id>.html fixture if the edge blocks us.
|
||||
from scrape.quickspecs import scrape_quickspecs
|
||||
ok = scrape_quickspecs(
|
||||
doc_id=b["doc_id"], bundle_id=b["slug"],
|
||||
title=b.get("title", b["doc_id"]),
|
||||
version=b.get("version"),
|
||||
product=b.get("product", "QuickSpecs"),
|
||||
source_url=b.get("source_url"),
|
||||
force=args.force,
|
||||
)
|
||||
total += 1 if ok else 0
|
||||
else:
|
||||
total += scrape_toc_bundle(s, b, args.force, args.concurrency)
|
||||
print(f"scraped {total} new/updated pages", file=sys.stderr)
|
||||
|
||||
# Always finalize after a scrape so sidecars are consistent.
|
||||
all_bundles = json.loads(BUNDLES_JSON.read_text())
|
||||
n = finalize_clusters(all_bundles)
|
||||
print(f"finalize: updated topic_cluster on {n} sidecars", file=sys.stderr)
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user