scrape: HVM bundles + runner for HPE Support DocPortal

Phase 1: scrape User Manual (8.1.0/.1/.2), Release Notes (8.1.0/.1/.2), and the unversioned Deployment Guide. Total ~1,160 pages, 9.7 MB markdown. Discovers via the anonymous JSON API at /hpesc/public/api/document/{docId}: /toc walks the page tree (for TOC-paginated docs), /render?page=GUID fetches per-page HTML, /document/{docId} returns the whole body for single-doc shapes like Release Notes. Runner converts DITA-source HTML to clean markdown (strips Notices/ Acknowledgments/Abstract boilerplate), writes corpus/<bundle>/<page>.{md,json}, then a finalize pass synthesizes topic_cluster.clustered_topics by GUID overlap across versions (HPE GUIDs are stable cross-version — confirmed 374/376/376 with 100% overlap on shared pages). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 13:06:26 -04:00
parent 43728320bf
commit 7a491ba9e4
5 changed files with 633 additions and 0 deletions
@@ -10,10 +10,17 @@ ollama>=0.4.0      # if using Ollama-hosted embedder; swap if not
 # Scraping (Phase 1; adjust per product)
 beautifulsoup4>=4.12
 requests>=2.31
+markdownify>=0.11
 # playwright>=1.40  # uncomment if you need headless browser fallback

 # Evaluation
 numpy>=1.26

+# Reranker is a sidecar (see deploy/docker-compose.yml). The MCP server
+# only needs httpx (declared above) to call it. For the dev / CPU
+# fallback reranker (scripts/rerank_server.py), install
+# requirements-rerank.txt separately — it pulls in PyTorch which would
+# triple the production image size.
+
 # Dev / utility
 python-dateutil>=2.8