scrape: HVM bundles + runner for HPE Support DocPortal
Phase 1: scrape User Manual (8.1.0/.1/.2), Release Notes (8.1.0/.1/.2),
and the unversioned Deployment Guide. Total ~1,160 pages, 9.7 MB markdown.
Discovers via the anonymous JSON API at /hpesc/public/api/document/{docId}:
/toc walks the page tree (for TOC-paginated docs), /render?page=GUID
fetches per-page HTML, /document/{docId} returns the whole body for
single-doc shapes like Release Notes.
Runner converts DITA-source HTML to clean markdown (strips Notices/
Acknowledgments/Abstract boilerplate), writes corpus/<bundle>/<page>.{md,json},
then a finalize pass synthesizes topic_cluster.clustered_topics by GUID
overlap across versions (HPE GUIDs are stable cross-version — confirmed
374/376/376 with 100% overlap on shared pages).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,10 @@
|
||||
# Dev/CPU reranker — only for running scripts/rerank_server.py locally.
|
||||
# Production uses the llama.cpp + jina-reranker GGUF sidecar (see
|
||||
# deploy/docker-compose.yml). Install with:
|
||||
#
|
||||
# pip install -r requirements-rerank.txt
|
||||
#
|
||||
# This adds PyTorch (~2 GB) and the sentence-transformers cross-encoder
|
||||
# (cross-encoder/ms-marco-MiniLM-L-6-v2, ~22 MB). Keep out of the main
|
||||
# requirements.txt so the production image stays slim.
|
||||
sentence-transformers>=3.0
|
||||
Reference in New Issue
Block a user