Files
morpheus-docs/scrape/README.md
T
justin fa448f94e1 build out morpheus-docs MCP stack, mirroring hvm-docs through Phases 1-13
Initial scaffold: the docs-mcp-template clone with all the
HVM-validated stack ported across, customized for Morpheus
Enterprise (PRODUCT_NAME=morpheus, server name morpheus-docs).

Bundles (live-discovered 2026-05-22; 1710 cataloged pages total):
* morpheus_user_manual_8_1_0  sd00007510en_us  568 pages (Feb 2026)
* morpheus_user_manual_8_1_1  sd00007621en_us  569 pages (Mar 2026)
* morpheus_user_manual_8_1_2  sd00007732en_us  569 pages (Apr 2026)
* morpheus_release_notes_8_1_0  sd00007496en_us  single-doc
* morpheus_release_notes_8_1_1  sd00007610en_us  single-doc
* morpheus_release_notes_8_1_2  sd00007733en_us  single-doc
* morpheus_quickspecs            a50009231enw     html-file (live
  curl_cffi against www.hpe.com; all 12+ Enterprise SKUs captured —
  S6E64..S6E73AAE for new/renewal/upgrade × 1/3/5-yr terms, plus
  services SKUs HA124A1#V38/V39 and H46SBA1).

No Deployment Guide or Qualification Matrix on HPE Support for
Morpheus Enterprise specifically — the only QM (sd00006551en_us)
covers HVM clusters managed by Morpheus and lives in hvm-docs.

Stack carried forward from hvm-docs:
* rag/{index,chunk,embeddings,bm25}.py — including the
  MAX_CHARS=4000 chunk-cap fix for table-dense content
* docs_mcp/{server,usage}.py — 11 MCP tools, BM25-default search,
  cross-encoder rerank, hybrid behind HYBRID_SEARCH=true,
  morpheus_api_lessons (renamed from hvm_api_lessons), env-gated
  submit_doc_bug
* docs_mcp/api_lessons.md — Morpheus-specific scaffold covering
  licensing model, HVM elevation path, REST vs Plugin API, with
  TODO markers for sections to flesh out from real ops experience
* scrape/{runner,quickspecs,changelog,bundles}.py — TOC + single-doc
  + html-file modes, curl_cffi Chrome120 for www.hpe.com edge bypass
* eval/{retrievers,run_eval}.py + queries.jsonl scaffold (4 placeholder
  queries; populate after first scrape)
* scripts/{rerank_server,usage_report,registry_gc}.py
* .gitea/workflows/{refresh,image-only}.yml — same Gitea Actions
  setup zerto-docs uses (push LAN, pull public-URL, GPU Ollama pool)
* deploy/docker-compose.yml — morpheus-docs-mcp service definition,
  shared jina-rerank sidecar, Watchtower-labeled
* Dockerfile, requirements.txt, requirements-rerank.txt

Verified locally: scrape produced 1599 .md pages (some TOC entries
are parent-only and yield no body), 6353 chunks all under the 4 KB
cap, MCP server boots and lists 11 tools cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 15:26:24 -04:00

126 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# scrape/
Product-specific. **You implement this for each product.** The
template gives you the contract; the extraction logic depends on
the upstream doc portal.
See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline
expects.
---
## Product context — HPE Morpheus Enterprise Software
**This repo is for HPE Morpheus Enterprise**, the full cloud-management
platform. It is a **different SKU** from HPE Morpheus VM Essentials
(HVM), which has its own MCP at `../hvm-docs/`. Don't ingest HVM
docs here; they're a separate, smaller product (the "VM-only" subset
of Morpheus). The Morpheus VM Essentials Deployment Guide refers to
Morpheus Enterprise as the "elevate to" target — that's the
relationship.
`PRODUCT_NAME=morpheus`. Tool will be named `morpheus_api_lessons`,
collection `morpheus_docs`, etc.
### Upstream portal
HPE Support DocPortal (Tridion/SDL-derived, same surface as HVM and
the Zerto docs). Anonymous JSON API, no auth required.
| Endpoint | Returns |
|---|---|
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}` | DITA-source HTML — title page / abstract OR (for short docs like Release Notes) the entire body |
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/toc` | Nested JSON tree of `{topicName, topicLink, description, children}`. Empty/404 for single-doc Release Notes. |
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/render?page=GUID-XXXX.html` | `{docId, page_html, doc_meta, page_meta}` — single page body |
User-facing URL format:
`https://support.hpe.com/hpesc/public/docDisplay?docId={docId}&page=GUID-XXXX.html`
### Bundle IDs (confirmed 2026-05-22)
**Morpheus Enterprise User Manual** — ~569 pages each, full nested TOC:
| Version | docId |
|---|---|
| 8.1.0 | `sd00007510en_us` |
| 8.1.1 | `sd00007621en_us` |
| 8.1.2 | `sd00007732en_us` |
**Morpheus Enterprise Release Notes** — short, single-doc-blob shape
(no TOC; full body returned by the `/document/{docId}` endpoint
itself; scraper needs a `--single-doc` mode for these):
| Version | docId |
|---|---|
| 8.1.0 | `sd00007496en_us` |
| 8.1.1 | `sd00007610en_us` |
| 8.1.2 | `sd00007733en_us` |
### Cross-version peers are free
GUIDs are stable across versions (confirmed on HVM where 374/376/376
pages had 100% GUID overlap between adjacent versions). Same-GUID =
same-topic. Synthesize `topic_cluster.clustered_topics` by looking
up the same GUID in the other bundle slugs — no fuzzy matching
needed.
### Reusable from hvm-docs
`../hvm-docs/scrape/bundles.py` and `../hvm-docs/scrape/runner.py`
solve the identical portal shape. Copy and adapt the BUNDLES list +
PRODUCT_NAME; the fetch logic should drop in unchanged. Both the
TOC-paginated path and the single-doc path are needed (the HVM
build covers both because HVM Release Notes follow the same shape).
## What you write
At minimum, two scripts:
### `scrape/bundles.py`
Discovers the upstream portal's bundle catalog and writes
`bundles.json` at the repo root. One entry per bundle (versioned doc
set) with the schema in PLAN.md.
```bash
python -m scrape.bundles
```
### `scrape/runner.py`
Scrapes the pages of each bundle (or a single bundle with `--bundle
<slug>`). Writes:
- `corpus/<bundle_id>/<page_id>.md` — extracted markdown body
- `corpus/<bundle_id>/<page_id>.json` — per-page metadata sidecar
```bash
python -m scrape.runner --all --force --concurrency 6
python -m scrape.runner --bundle Admin.VC.HTML.10.9
```
## Tips
- **Sniff before you scrape.** Almost every modern doc portal is an
SPA that calls a backend API. Open the browser's Network tab,
click around, find the underlying JSON. Scraping the API is 10×
cheaper and 100× more reliable than scraping the rendered HTML.
- **Idempotent re-scrapes.** Without `--force`, the runner should
skip pages already on disk so a resume doesn't have to re-fetch
everything. With `--force`, re-fetch every page — that's the
weekly cron mode that catches edits.
- **Respect the portal.** Backoff on 429s. Set a recognizable
user-agent so the portal owner can identify you if they want to.
- **Whitespace normalize.** Markdown that round-trips through HTML
often has extra blank lines. Normalize to a single blank between
paragraphs so diffs are clean (the changelog summary and digest
tools care about line counts).
## What's already reusable
`scrape/changelog.py` is fully product-agnostic and ready to use
as-is. It walks `git diff --name-status` output to produce a
structured summary, and walks `git log` for the digest history
(Phase 13).