7a491ba9e4
Phase 1: scrape User Manual (8.1.0/.1/.2), Release Notes (8.1.0/.1/.2),
and the unversioned Deployment Guide. Total ~1,160 pages, 9.7 MB markdown.
Discovers via the anonymous JSON API at /hpesc/public/api/document/{docId}:
/toc walks the page tree (for TOC-paginated docs), /render?page=GUID
fetches per-page HTML, /document/{docId} returns the whole body for
single-doc shapes like Release Notes.
Runner converts DITA-source HTML to clean markdown (strips Notices/
Acknowledgments/Abstract boilerplate), writes corpus/<bundle>/<page>.{md,json},
then a finalize pass synthesizes topic_cluster.clustered_topics by GUID
overlap across versions (HPE GUIDs are stable cross-version — confirmed
374/376/376 with 100% overlap on shared pages).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
27 lines
704 B
Plaintext
27 lines
704 B
Plaintext
# MCP server
|
|
mcp[fastmcp]>=1.0.0
|
|
pydantic>=2.0
|
|
httpx>=0.27
|
|
|
|
# Vector store + embeddings
|
|
chromadb>=0.5.0
|
|
ollama>=0.4.0 # if using Ollama-hosted embedder; swap if not
|
|
|
|
# Scraping (Phase 1; adjust per product)
|
|
beautifulsoup4>=4.12
|
|
requests>=2.31
|
|
markdownify>=0.11
|
|
# playwright>=1.40 # uncomment if you need headless browser fallback
|
|
|
|
# Evaluation
|
|
numpy>=1.26
|
|
|
|
# Reranker is a sidecar (see deploy/docker-compose.yml). The MCP server
|
|
# only needs httpx (declared above) to call it. For the dev / CPU
|
|
# fallback reranker (scripts/rerank_server.py), install
|
|
# requirements-rerank.txt separately — it pulls in PyTorch which would
|
|
# triple the production image size.
|
|
|
|
# Dev / utility
|
|
python-dateutil>=2.8
|