morpheus-docs/scrape/README.md

# scrape/

Product-specific. **You implement this for each product.** The
template gives you the contract; the extraction logic depends on
the upstream doc portal.

See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline
expects.

---

## Product context — HPE Morpheus Enterprise Software

**This repo is for HPE Morpheus Enterprise**, the full cloud-management
platform. It is a **different SKU** from HPE Morpheus VM Essentials
(HVM), which has its own MCP at `../hvm-docs/`. Don't ingest HVM
docs here; they're a separate, smaller product (the "VM-only" subset
of Morpheus). The Morpheus VM Essentials Deployment Guide refers to
Morpheus Enterprise as the "elevate to" target — that's the
relationship.

`PRODUCT_NAME=morpheus`. Tool will be named `morpheus_api_lessons`,
collection `morpheus_docs`, etc.

### Upstream portal

HPE Support DocPortal (Tridion/SDL-derived, same surface as HVM and
the Zerto docs). Anonymous JSON API, no auth required.

| Endpoint | Returns |
|---|---|
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}` | DITA-source HTML — title page / abstract OR (for short docs like Release Notes) the entire body |
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/toc` | Nested JSON tree of `{topicName, topicLink, description, children}`. Empty/404 for single-doc Release Notes. |
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/render?page=GUID-XXXX.html` | `{docId, page_html, doc_meta, page_meta}` — single page body |

User-facing URL format:
`https://support.hpe.com/hpesc/public/docDisplay?docId={docId}&page=GUID-XXXX.html`

### Bundle IDs (confirmed 2026-05-22)

**Morpheus Enterprise User Manual** — ~569 pages each, full nested TOC:

| Version | docId |
|---|---|
| 8.1.0  | `sd00007510en_us` |
| 8.1.1  | `sd00007621en_us` |
| 8.1.2  | `sd00007732en_us` |

**Morpheus Enterprise Release Notes** — short, single-doc-blob shape
(no TOC; full body returned by the `/document/{docId}` endpoint
itself; scraper needs a `--single-doc` mode for these):

| Version | docId |
|---|---|
| 8.1.0  | `sd00007496en_us` |
| 8.1.1  | `sd00007610en_us` |
| 8.1.2  | `sd00007733en_us` |

### Cross-version peers are free

GUIDs are stable across versions (confirmed on HVM where 374/376/376
pages had 100% GUID overlap between adjacent versions). Same-GUID =
same-topic. Synthesize `topic_cluster.clustered_topics` by looking
up the same GUID in the other bundle slugs — no fuzzy matching
needed.

### Reusable from hvm-docs

`../hvm-docs/scrape/bundles.py` and `../hvm-docs/scrape/runner.py`
solve the identical portal shape. Copy and adapt the BUNDLES list +
PRODUCT_NAME; the fetch logic should drop in unchanged. Both the
TOC-paginated path and the single-doc path are needed (the HVM
build covers both because HVM Release Notes follow the same shape).


## What you write

At minimum, two scripts:

### `scrape/bundles.py`

Discovers the upstream portal's bundle catalog and writes
`bundles.json` at the repo root. One entry per bundle (versioned doc
set) with the schema in PLAN.md.

```bash
python -m scrape.bundles
```

### `scrape/runner.py`

Scrapes the pages of each bundle (or a single bundle with `--bundle
<slug>`). Writes:

- `corpus/<bundle_id>/<page_id>.md` — extracted markdown body
- `corpus/<bundle_id>/<page_id>.json` — per-page metadata sidecar

```bash
python -m scrape.runner --all --force --concurrency 6
python -m scrape.runner --bundle Admin.VC.HTML.10.9
```

## Tips

- **Sniff before you scrape.** Almost every modern doc portal is an
  SPA that calls a backend API. Open the browser's Network tab,
  click around, find the underlying JSON. Scraping the API is 10×
  cheaper and 100× more reliable than scraping the rendered HTML.
- **Idempotent re-scrapes.** Without `--force`, the runner should
  skip pages already on disk so a resume doesn't have to re-fetch
  everything. With `--force`, re-fetch every page — that's the
  weekly cron mode that catches edits.
- **Respect the portal.** Backoff on 429s. Set a recognizable
  user-agent so the portal owner can identify you if they want to.
- **Whitespace normalize.** Markdown that round-trips through HTML
  often has extra blank lines. Normalize to a single blank between
  paragraphs so diffs are clean (the changelog summary and digest
  tools care about line counts).

## What's already reusable

`scrape/changelog.py` is fully product-agnostic and ready to use
as-is. It walks `git diff --name-status` output to produce a
structured summary, and walks `git log` for the digest history
(Phase 13).