# scrape/ Product-specific. **You implement this for each product.** The template gives you the contract; the extraction logic depends on the upstream doc portal. See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline expects. --- ## Product context — HPE Morpheus Enterprise Software **This repo is for HPE Morpheus Enterprise**, the full cloud-management platform. It is a **different SKU** from HPE Morpheus VM Essentials (HVM), which has its own MCP at `../hvm-docs/`. Don't ingest HVM docs here; they're a separate, smaller product (the "VM-only" subset of Morpheus). The Morpheus VM Essentials Deployment Guide refers to Morpheus Enterprise as the "elevate to" target — that's the relationship. `PRODUCT_NAME=morpheus`. Tool will be named `morpheus_api_lessons`, collection `morpheus_docs`, etc. ### Upstream portal HPE Support DocPortal (Tridion/SDL-derived, same surface as HVM and the Zerto docs). Anonymous JSON API, no auth required. | Endpoint | Returns | |---|---| | `GET https://support.hpe.com/hpesc/public/api/document/{docId}` | DITA-source HTML — title page / abstract OR (for short docs like Release Notes) the entire body | | `GET https://support.hpe.com/hpesc/public/api/document/{docId}/toc` | Nested JSON tree of `{topicName, topicLink, description, children}`. Empty/404 for single-doc Release Notes. | | `GET https://support.hpe.com/hpesc/public/api/document/{docId}/render?page=GUID-XXXX.html` | `{docId, page_html, doc_meta, page_meta}` — single page body | User-facing URL format: `https://support.hpe.com/hpesc/public/docDisplay?docId={docId}&page=GUID-XXXX.html` ### Bundle IDs (confirmed 2026-05-22) **Morpheus Enterprise User Manual** — ~569 pages each, full nested TOC: | Version | docId | |---|---| | 8.1.0 | `sd00007510en_us` | | 8.1.1 | `sd00007621en_us` | | 8.1.2 | `sd00007732en_us` | **Morpheus Enterprise Release Notes** — short, single-doc-blob shape (no TOC; full body returned by the `/document/{docId}` endpoint itself; scraper needs a `--single-doc` mode for these): | Version | docId | |---|---| | 8.1.0 | `sd00007496en_us` | | 8.1.1 | `sd00007610en_us` | | 8.1.2 | `sd00007733en_us` | ### Cross-version peers are free GUIDs are stable across versions (confirmed on HVM where 374/376/376 pages had 100% GUID overlap between adjacent versions). Same-GUID = same-topic. Synthesize `topic_cluster.clustered_topics` by looking up the same GUID in the other bundle slugs — no fuzzy matching needed. ### Reusable from hvm-docs `../hvm-docs/scrape/bundles.py` and `../hvm-docs/scrape/runner.py` solve the identical portal shape. Copy and adapt the BUNDLES list + PRODUCT_NAME; the fetch logic should drop in unchanged. Both the TOC-paginated path and the single-doc path are needed (the HVM build covers both because HVM Release Notes follow the same shape). ## What you write At minimum, two scripts: ### `scrape/bundles.py` Discovers the upstream portal's bundle catalog and writes `bundles.json` at the repo root. One entry per bundle (versioned doc set) with the schema in PLAN.md. ```bash python -m scrape.bundles ``` ### `scrape/runner.py` Scrapes the pages of each bundle (or a single bundle with `--bundle `). Writes: - `corpus//.md` — extracted markdown body - `corpus//.json` — per-page metadata sidecar ```bash python -m scrape.runner --all --force --concurrency 6 python -m scrape.runner --bundle Admin.VC.HTML.10.9 ``` ## Tips - **Sniff before you scrape.** Almost every modern doc portal is an SPA that calls a backend API. Open the browser's Network tab, click around, find the underlying JSON. Scraping the API is 10× cheaper and 100× more reliable than scraping the rendered HTML. - **Idempotent re-scrapes.** Without `--force`, the runner should skip pages already on disk so a resume doesn't have to re-fetch everything. With `--force`, re-fetch every page — that's the weekly cron mode that catches edits. - **Respect the portal.** Backoff on 429s. Set a recognizable user-agent so the portal owner can identify you if they want to. - **Whitespace normalize.** Markdown that round-trips through HTML often has extra blank lines. Normalize to a single blank between paragraphs so diffs are clean (the changelog summary and digest tools care about line counts). ## What's already reusable `scrape/changelog.py` is fully product-agnostic and ready to use as-is. It walks `git diff --name-status` output to produce a structured summary, and walks `git log` for the digest history (Phase 13).