build out morpheus-docs MCP stack, mirroring hvm-docs through Phases 1-13

Initial scaffold: the docs-mcp-template clone with all the
HVM-validated stack ported across, customized for Morpheus
Enterprise (PRODUCT_NAME=morpheus, server name morpheus-docs).

Bundles (live-discovered 2026-05-22; 1710 cataloged pages total):
* morpheus_user_manual_8_1_0  sd00007510en_us  568 pages (Feb 2026)
* morpheus_user_manual_8_1_1  sd00007621en_us  569 pages (Mar 2026)
* morpheus_user_manual_8_1_2  sd00007732en_us  569 pages (Apr 2026)
* morpheus_release_notes_8_1_0  sd00007496en_us  single-doc
* morpheus_release_notes_8_1_1  sd00007610en_us  single-doc
* morpheus_release_notes_8_1_2  sd00007733en_us  single-doc
* morpheus_quickspecs            a50009231enw     html-file (live
  curl_cffi against www.hpe.com; all 12+ Enterprise SKUs captured —
  S6E64..S6E73AAE for new/renewal/upgrade × 1/3/5-yr terms, plus
  services SKUs HA124A1#V38/V39 and H46SBA1).

No Deployment Guide or Qualification Matrix on HPE Support for
Morpheus Enterprise specifically — the only QM (sd00006551en_us)
covers HVM clusters managed by Morpheus and lives in hvm-docs.

Stack carried forward from hvm-docs:
* rag/{index,chunk,embeddings,bm25}.py — including the
  MAX_CHARS=4000 chunk-cap fix for table-dense content
* docs_mcp/{server,usage}.py — 11 MCP tools, BM25-default search,
  cross-encoder rerank, hybrid behind HYBRID_SEARCH=true,
  morpheus_api_lessons (renamed from hvm_api_lessons), env-gated
  submit_doc_bug
* docs_mcp/api_lessons.md — Morpheus-specific scaffold covering
  licensing model, HVM elevation path, REST vs Plugin API, with
  TODO markers for sections to flesh out from real ops experience
* scrape/{runner,quickspecs,changelog,bundles}.py — TOC + single-doc
  + html-file modes, curl_cffi Chrome120 for www.hpe.com edge bypass
* eval/{retrievers,run_eval}.py + queries.jsonl scaffold (4 placeholder
  queries; populate after first scrape)
* scripts/{rerank_server,usage_report,registry_gc}.py
* .gitea/workflows/{refresh,image-only}.yml — same Gitea Actions
  setup zerto-docs uses (push LAN, pull public-URL, GPU Ollama pool)
* deploy/docker-compose.yml — morpheus-docs-mcp service definition,
  shared jina-rerank sidecar, Watchtower-labeled
* Dockerfile, requirements.txt, requirements-rerank.txt

Verified locally: scrape produced 1599 .md pages (some TOC entries
are parent-only and yield no body), 6353 chunks all under the 4 KB
cap, MCP server boots and lists 11 tools cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 15:26:24 -04:00
parent 43728320bf
commit fa448f94e1
22 changed files with 2822 additions and 247 deletions
+65 -44
View File
@@ -14,21 +14,17 @@ on:
workflow_dispatch:
env:
REGISTRY_PUSH: <lan-host>:<port>
REGISTRY_PULL: <public-registry-hostname>
# Image name derives from the actual repo at runtime, so a clone
# doesn't need to find/replace anything. e.g. justin/my-product-docs.
# github.* context is Gitea Actions' inherited GitHub-Actions namespace
# — values come from the Gitea server, not github.com.
# PUSH goes to the LAN endpoint (HTTP) to bypass Cloudflare's 100 MB
# body cap. PULL uses the public hostname (HTTPS). Same Gitea registry.
REGISTRY_PUSH: 192.168.0.2:1234
REGISTRY_PULL: git.jpaul.io
IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
OLLAMA_URL: http://<gpu-host>:11434
# Two GPU-pinned Ollama containers on the Gitea host — same infra
# zerto-docs uses. :11435 = Titan X, :11436 = 1080 Ti. Indexer
# round-robins per batch.
OLLAMA_URLS: http://192.168.0.2:11435,http://192.168.0.2:11436
EMBED_MODEL: nomic-embed-text
# PRODUCT_NAME defaults to the repo name so a clone works without
# editing. Override here if you want a different identifier (e.g.
# repo "my-product-docs" → PRODUCT_NAME "myproduct"). Used as the
# Chroma collection name, BM25 db filename, and MCP server name —
# see docs_mcp/server.py.
PRODUCT_NAME: ${{ github.event.repository.name }}
PRODUCT_NAME: morpheus
jobs:
build:
@@ -39,8 +35,7 @@ jobs:
- name: Checkout
uses: actions/checkout@v4
with:
# Full history (not shallow) so the digest-history step can
# walk git log up to --history-days back.
# Full history so digest-history can walk git log.
fetch-depth: 0
- name: Set up Python
@@ -54,9 +49,8 @@ jobs:
python -m pip install -q -r requirements.txt
- name: Refresh digest history
# Cheap (a few seconds); doesn't touch corpus content.
# Without this step, a code-only deploy would ship an
# increasingly-stale digest history relative to git.
# Cheap (few seconds). Without this step, a code-only deploy
# would ship an increasingly-stale digest history.
run: |
mkdir -p corpus/.digest
python -m scrape.changelog \
@@ -71,42 +65,69 @@ jobs:
- name: Rebuild indexes from existing corpus
run: python -m rag.index --rebuild
- name: Log in to registry (LAN endpoint)
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u "${{ github.repository_owner }}" --password-stdin
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
with:
# LAN registry is HTTP only.
config-inline: |
[registry."192.168.0.2:1234"]
http = true
insecure = true
- name: Build & push image
- name: Configure registry credentials for buildx
env:
REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
REGISTRY_USER: ${{ github.actor }}
run: |
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
DATE_TAG=$(date -u +%Y.%m.%d)
docker build \
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
-t "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}" \
.
docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
docker push "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}"
mkdir -p ~/.docker
AUTH=$(printf '%s:%s' "$REGISTRY_USER" "$REGISTRY_TOKEN" | base64 -w0)
cat > ~/.docker/config.json <<EOF
{
"auths": {
"192.168.0.2:1234": {
"auth": "$AUTH"
}
}
}
EOF
- name: Compute tags
id: meta
uses: docker/metadata-action@v5
with:
images: 192.168.0.2:1234/${{ github.repository_owner }}/${{ github.event.repository.name }}
tags: |
type=raw,value=latest
type=sha,prefix=,format=short
type=raw,value={{date 'YYYY.MM.DD'}}
labels: |
org.opencontainers.image.source=https://git.jpaul.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
org.opencontainers.image.url=https://git.jpaul.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
- name: Build & push (amd64)
uses: docker/build-push-action@v6
with:
context: .
platforms: linux/amd64
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
- name: Link container package to this repo
# Gitea container packages are owned by a USER, not a repo —
# they don't auto-appear under the repo's Packages tab.
# This API call creates the association. One-time-effective:
# re-running returns 400 once linked, which we swallow.
# Endpoint requires Gitea 1.21+.
env:
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
run: |
OWNER="${{ github.repository_owner }}"
PKG="${{ github.event.repository.name }}"
BODY=$(mktemp)
CODE=$(curl -sS -o "$BODY" -w "%{http_code}" -X POST \
code=$(curl -s -o /tmp/link.out -w "%{http_code}" -X POST \
-H "Authorization: token ${GITEA_TOKEN}" \
"https://${REGISTRY_PULL}/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
echo "link http=$CODE body=$(cat "$BODY")"
case "$CODE" in
201) echo "linked package to ${OWNER}/${PKG}" ;;
400) echo "already linked (re-link returns 400) — ok" ;;
*) echo "unexpected status $CODE"; exit 1 ;;
"https://git.jpaul.io/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
echo "link ${OWNER}/container/${PKG} -> ${PKG}: HTTP ${code}"
body=$(cat /tmp/link.out)
case "$code" in
201) echo "OK — newly linked" ;;
400|409) echo "OK — already linked: ${body}" ;;
*) echo "unexpected: ${body}"; exit 1 ;;
esac
- name: Prune old container versions
+92 -52
View File
@@ -19,27 +19,25 @@ on:
default: false
env:
# If your registry sits behind Cloudflare with its 100 MB body cap,
# use a LAN endpoint for pushes (bypasses CF) and the public hostname
# for pulls (response bodies aren't capped).
REGISTRY_PUSH: <lan-host>:<port>
REGISTRY_PULL: <public-registry-hostname>
# Image name derives from the actual repo at runtime, so a clone
# doesn't need to find/replace anything. e.g. justin/my-product-docs.
# github.* context is Gitea Actions' inherited GitHub-Actions namespace
# — values come from the Gitea server, not github.com.
# PUSH goes to the LAN endpoint (HTTP) to bypass Cloudflare Tunnel's
# 100 MB body cap. PULL uses the public hostname (HTTPS). Same Gitea
# registry either way — package lands under the same owner/repo.
REGISTRY_PUSH: 192.168.0.2:1234
REGISTRY_PULL: git.jpaul.io
# Image name derives from the repo at runtime — clones don't need to
# edit this. github.* is the Gitea-Actions inherited namespace.
IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
# Embedder. One URL per GPU; the indexer round-robins.
OLLAMA_URL: http://<gpu-host>:11434
# Two GPU-pinned Ollama containers on the Gitea host — same infra
# zerto-docs uses (deploy/ollama-rag.docker-compose.yml over there).
# :11435 owns the Titan X, :11436 owns the 1080 Ti; the indexer
# round-robins per batch so both cards run in parallel. The host's
# primary Ollama on :11434 is left alone for OpenWebUI etc.
OLLAMA_URLS: http://192.168.0.2:11435,http://192.168.0.2:11436
EMBED_MODEL: nomic-embed-text
# PRODUCT_NAME defaults to the repo name so a clone works without
# editing. Override here if you want a different identifier (e.g.
# repo "my-product-docs" → PRODUCT_NAME "myproduct"). Used as the
# Chroma collection name, BM25 db filename, and MCP server name —
# see docs_mcp/server.py.
PRODUCT_NAME: ${{ github.event.repository.name }}
PRODUCT_NAME: morpheus
jobs:
refresh:
@@ -50,10 +48,12 @@ jobs:
- name: Checkout
uses: actions/checkout@v4
with:
# Full history — required for the digest-history step to
# walk git log. Default fetch-depth: 1 silently produces a
# 0-byte history file.
# Full history — required for digest-history. Default depth 1
# silently produces a 0-byte history file.
fetch-depth: 0
# Set the credentials Gitea injects so we can push corpus
# commits back. Persist them across the run.
token: ${{ secrets.GITEA_TOKEN }}
- name: Set up Python
uses: actions/setup-python@v5
@@ -89,8 +89,8 @@ jobs:
- name: Commit corpus changes (if any)
id: commit
run: |
git config user.name "<product>-docs-refresh"
git config user.email "actions@<your-domain>"
git config user.name "hvm-docs-refresh"
git config user.email "actions@jpaul.io"
git add bundles.json corpus
if git diff --cached --quiet; then
echo "no corpus changes — skipping reindex and image build"
@@ -132,49 +132,89 @@ jobs:
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
run: python -m rag.index --rebuild
# ---- Build & push image ------------------------------------
- name: Log in to registry (LAN endpoint)
# ---- Build & push image (LAN endpoint, buildx) -------------
- name: Set up Docker Buildx
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u "${{ github.repository_owner }}" --password-stdin
uses: docker/setup-buildx-action@v3
with:
# LAN registry is HTTP only. Buildkit needs an explicit
# insecure-registry config or it tries to upgrade to HTTPS.
config-inline: |
[registry."192.168.0.2:1234"]
http = true
insecure = true
- name: Build & push image
- name: Configure registry credentials for buildx
# Can't use docker/login-action against the LAN endpoint —
# the host docker daemon errors on HTTP-vs-HTTPS. Buildx reads
# ~/.docker/config.json directly, so write the auth ourselves.
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
# Runner shell is /bin/sh — use cut instead of ${VAR::N}.
# Three tags: :latest (Watchtower target), :<sha12>
# (rollback pin), :<YYYY.MM.DD> (human-readable).
env:
REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
REGISTRY_USER: ${{ github.actor }}
run: |
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
DATE_TAG=$(date -u +%Y.%m.%d)
docker build \
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
-t "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}" \
.
docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
docker push "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}"
mkdir -p ~/.docker
AUTH=$(printf '%s:%s' "$REGISTRY_USER" "$REGISTRY_TOKEN" | base64 -w0)
cat > ~/.docker/config.json <<EOF
{
"auths": {
"192.168.0.2:1234": {
"auth": "$AUTH"
}
}
}
EOF
- name: Compute tags
id: meta
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
uses: docker/metadata-action@v5
with:
# Tag with the LAN hostname so the push goes over LAN.
# docker-compose on the deploy host pulls via git.jpaul.io.
images: 192.168.0.2:1234/${{ github.repository_owner }}/${{ github.event.repository.name }}
tags: |
type=raw,value=latest
type=sha,prefix=,format=short
type=schedule,pattern={{date 'YYYY.MM.DD'}}
type=raw,value={{date 'YYYY.MM.DD'}}
# Override auto-derived labels with the PUBLIC URL so Gitea
# can auto-link the package back to this repo.
labels: |
org.opencontainers.image.source=https://git.jpaul.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
org.opencontainers.image.url=https://git.jpaul.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
- name: Build & push (amd64)
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
uses: docker/build-push-action@v6
with:
context: .
platforms: linux/amd64
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
- name: Link container package to this repo
# Gitea container packages are owned by a USER, not a repo
# they don't auto-appear under the repo's Packages tab.
# This API call creates the association. One-time-effective:
# re-running returns 400 once linked, which we swallow.
# Endpoint requires Gitea 1.21+.
# Idempotent linkage so the package shows under the repo's
# Packages tab. Gitea's auto-link from the source label is
# unreliable in this setup (the runner reports an internal
# server URL), so we link explicitly. 201 = newly linked,
# 400 = already linked (treated as success).
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
env:
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
run: |
OWNER="${{ github.repository_owner }}"
PKG="${{ github.event.repository.name }}"
BODY=$(mktemp)
CODE=$(curl -sS -o "$BODY" -w "%{http_code}" -X POST \
code=$(curl -s -o /tmp/link.out -w "%{http_code}" -X POST \
-H "Authorization: token ${GITEA_TOKEN}" \
"https://${REGISTRY_PULL}/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
echo "link http=$CODE body=$(cat "$BODY")"
case "$CODE" in
201) echo "linked package to ${OWNER}/${PKG}" ;;
400) echo "already linked (re-link returns 400) — ok" ;;
*) echo "unexpected status $CODE"; exit 1 ;;
"https://git.jpaul.io/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
echo "link ${OWNER}/container/${PKG} -> ${PKG}: HTTP ${code}"
body=$(cat /tmp/link.out)
case "$code" in
201) echo "OK — newly linked" ;;
400|409) echo "OK — already linked: ${body}" ;;
*) echo "unexpected: ${body}"; exit 1 ;;
esac
# ---- Registry GC -------------------------------------------
+119
View File
@@ -0,0 +1,119 @@
[
{
"slug": "morpheus_user_manual_8_1_0",
"doc_id": "sd00007510en_us",
"title": "HPE Morpheus Enterprise Software Documentation v8.1.0",
"version": "8.1.0",
"platform": null,
"product": "User Manual",
"language": "en-US",
"page_count": 568,
"mode": "toc",
"abstract": "",
"dates": {
"Published": "February 2026"
},
"landing_page": "GUID-709AAADB-A9C1-40B6-AD22-958EE7E6F312",
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007510en_us"
},
{
"slug": "morpheus_user_manual_8_1_1",
"doc_id": "sd00007621en_us",
"title": "HPE Morpheus Enterprise Software Documentation v8.1.1",
"version": "8.1.1",
"platform": null,
"product": "User Manual",
"language": "en-US",
"page_count": 569,
"mode": "toc",
"abstract": "",
"dates": {
"Published": "March 2026"
},
"landing_page": "GUID-709AAADB-A9C1-40B6-AD22-958EE7E6F312",
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007621en_us"
},
{
"slug": "morpheus_user_manual_8_1_2",
"doc_id": "sd00007732en_us",
"title": "HPE Morpheus Enterprise Software Documentation v8.1.2",
"version": "8.1.2",
"platform": null,
"product": "User Manual",
"language": "en-US",
"page_count": 569,
"mode": "toc",
"abstract": "",
"dates": {
"Published": "April 2026"
},
"landing_page": "GUID-709AAADB-A9C1-40B6-AD22-958EE7E6F312",
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007732en_us"
},
{
"slug": "morpheus_release_notes_8_1_0",
"doc_id": "sd00007496en_us",
"title": "v8.1.0 Release Notes",
"version": "8.1.0",
"platform": null,
"product": "Release Notes",
"language": "en-US",
"page_count": 1,
"mode": "single",
"abstract": "Release notes for HPE Morpheus Enterprise Software version v8.1.0",
"dates": {
"Published": "February 2026"
},
"landing_page": "sd00007496en_us",
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007496en_us"
},
{
"slug": "morpheus_release_notes_8_1_1",
"doc_id": "sd00007610en_us",
"title": "v8.1.1 Release Notes",
"version": "8.1.1",
"platform": null,
"product": "Release Notes",
"language": "en-US",
"page_count": 1,
"mode": "single",
"abstract": "Release notes for HPE Morpheus Enterprise Software version v8.1.1",
"dates": {
"Published": "March 2026"
},
"landing_page": "sd00007610en_us",
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007610en_us"
},
{
"slug": "morpheus_release_notes_8_1_2",
"doc_id": "sd00007733en_us",
"title": "v8.1.2 Release Notes",
"version": "8.1.2",
"platform": null,
"product": "Release Notes",
"language": "en-US",
"page_count": 1,
"mode": "single",
"abstract": "Release notes for HPE Morpheus Enterprise Software version v8.1.2",
"dates": {
"Published": "April 2026"
},
"landing_page": "sd00007733en_us",
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007733en_us"
},
{
"slug": "morpheus_quickspecs",
"doc_id": "a50009231enw",
"title": "HPE Morpheus Enterprise Software QuickSpecs",
"version": "v1",
"platform": null,
"product": "QuickSpecs",
"language": "en-US",
"page_count": 1,
"mode": "html-file",
"abstract": "",
"dates": {},
"landing_page": "a50009231enw",
"source_url": "https://www.hpe.com/psnow/doc/a50009231enw"
}
]
+23 -17
View File
@@ -1,6 +1,6 @@
# Hosting stack for a docs MCP server.
#
# Replace <product> below with your product name on first deploy.
# Replace hvm below with your product name on first deploy.
# Volumes: usage logs are mounted to a host path so they survive
# Watchtower-driven container recreates.
#
@@ -10,15 +10,15 @@
services:
# The MCP server. Watchtower auto-pulls on :latest changes.
<product>-docs-mcp:
image: <registry>/<owner>/<product>-docs-mcp:latest
container_name: <product>-docs-mcp
morpheus-docs-mcp:
image: git.jpaul.io/justin/morpheus-docs:latest
container_name: morpheus-docs-mcp
restart: unless-stopped
ports:
- "8000:8000"
environment:
PRODUCT_NAME: "<product>"
PRODUCT_DOCS_URL: "https://docs.example.com"
PRODUCT_NAME: "morpheus"
PRODUCT_DOCS_URL: "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007732en_us"
# Streamable-HTTP transport. Stateless mode is required for
# production: clients don't lose sessions when Watchtower
@@ -28,19 +28,21 @@ services:
MCP_PORT: "8000"
# If you run MetaMCP or another gateway in front and reach
# this container via its compose DNS name (e.g. <product>-docs-mcp:8000),
# this container via its compose DNS name (e.g. morpheus-docs-mcp:8000),
# add that hostname here. "*" disables the rebind check entirely.
MCP_ALLOWED_HOSTS: "<product>-docs-mcp,localhost,127.0.0.1"
MCP_ALLOWED_HOSTS: "morpheus-docs-mcp,localhost,127.0.0.1"
# Phase 6 — reranker sidecar (jina-reranker-v2-base via llama.cpp).
RERANK_URL: http://<product>-rerank:8080
RERANK_URL: http://hvm-rerank:8080
RERANK_POOL: "200"
RERANK_TIMEOUT: "30"
# Phase 8 — hybrid retrieval (BM25 + dense + RRF). Set true
# only after the eval harness shows the dense-only path
# missing technical-term queries that BM25 catches.
HYBRID_SEARCH: "true"
# Phase 8 — hybrid retrieval (BM25 + dense + RRF).
# Eval on the HVM corpus (eval/results/baseline.md, 2026-05-22) shows
# BM25-default + reranker beats hybrid on every metric (MRR 0.920 vs
# 0.875). Leaving HYBRID_SEARCH off so search_docs runs BM25-first +
# reranker; dense is the fallback when BM25 finds nothing.
HYBRID_SEARCH: "false"
# Phase 10 — usage telemetry.
USAGE_LOG_DIR: /app/var/logs
@@ -52,9 +54,9 @@ services:
# DOC_BUG_API_URL: "https://docs-be.example.com/api/feedback"
volumes:
# Usage logs persist across container recreates.
- ./<product>-docs-mcp-logs:/app/var/logs
- ./morpheus-docs-mcp-logs:/app/var/logs
depends_on:
- <product>-rerank
- hvm-rerank
labels:
# Watchtower polls *only* containers with this label set true.
com.centurylinklabs.watchtower.enable: "true"
@@ -63,9 +65,13 @@ services:
# Reranker sidecar — llama.cpp serving jina-reranker-v2-base.
# Requires GPU access; adjust runtime/devices for your hardware.
<product>-rerank:
#
# For dev / CPU-only hosts, swap this service for scripts/rerank_server.py
# (sentence-transformers ms-marco-MiniLM-L-6-v2). Same /v1/rerank shape,
# ~500ms/batch on CPU vs ~50ms on GPU with the jina GGUF.
hvm-rerank:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
container_name: <product>-rerank
container_name: hvm-rerank
restart: unless-stopped
# Mount the GGUF model from the host. Download from huggingface
# (gguf-org/jina-reranker-v2-base-multilingual-GGUF) first.
+148
View File
@@ -0,0 +1,148 @@
# HPE Morpheus Enterprise — Lessons
Notes and gotchas about running, integrating with, and licensing
**HPE Morpheus Enterprise Software** that aren't obvious from the
official docs alone. The official User Manual + Release Notes +
QuickSpecs describe the product as designed; this file is what
experienced operators actually learn.
> Treat this as living context. Update it when you (or the LLM
> driving this MCP) discover something non-obvious that the docs
> don't say or don't make findable. Each section is an H2 so the
> `morpheus_api_lessons(topic=...)` tool can return just the
> relevant piece.
## TL;DR
- **Morpheus Enterprise is the full cloud-management platform.** HPE
Morpheus VM Essentials (HVM) is the VM-only subset; Morpheus
Enterprise is what you "elevate to" when you need multi-cloud,
containers, automation, policy, FinOps, ITSM integration, and
self-service catalogs. The relationship is one-way upgrade.
- **Licensing is per physical CPU socket** on connected on-prem
clouds (bare metal, hypervisor hosts, Kubernetes worker nodes).
Public-cloud workloads (AWS / Azure / GCP / OCI) are factored at
**15 workloads per socket** equivalent.
- **All license SKUs include Tech Care Essentials 24×7** as part
of the license cost. There is no separate purchase for support
on the license tier.
- **`morpheus_quickspecs` is the source of truth for SKUs.** Don't
guess part numbers; query the QuickSpecs bundle.
## Licensing and SKUs
**Source of truth: the `morpheus_quickspecs` bundle.** Query it for
the current SKU list — the catalog updates more often than this
file does.
Pricing model summary (from QuickSpecs v1, 2026):
- **Per physical CPU socket** for connected on-prem clouds —
KVM/HVM hosts, VMware ESXi hosts, bare metal servers, Kubernetes
worker nodes. Count the **sockets**, not the cores; not the VMs.
- **Public cloud workloads factor at 15:1** — one socket of license
covers up to 15 public-cloud workloads (instances) across AWS,
Azure, GCP, OCI.
- **Term-based** licensing (not perpetual). 1, 3, and 5-year terms
on E-LTU SKUs.
- **All include HPE Tech Care Essentials** (24×7 support, 15-minute
response for severity-1) bundled into the license cost.
> The exact ratios and SKU names can change between QuickSpecs
> revisions. Use the `morpheus_quickspecs` tool / bundle for current
> values rather than memorizing.
## Elevation from HVM
The "elevate to Morpheus Enterprise" path is the canonical journey
for customers who started on HVM and outgrew it:
- **HVM clusters keep working unchanged after elevation.** You
don't redeploy the manager; you upgrade-in-place using a
Morpheus Enterprise license.
- **What changes:** the manager UI unlocks the full Enterprise
feature set — public-cloud integrations, container/Kubernetes
management, blueprints/catalogs, automation workflows, policy
engine, FinOps cost dashboards, ITSM connectors (ServiceNow etc.),
and the full REST API surface.
- **Existing HVM-tier work products survive the elevation:**
Instance backups, network pools, storage providers, user
accounts, integrations, scheduled jobs, etc.
The HVM User Manual page `Elevating to HPE Morpheus Enterprise`
(GUID-ECCA4FDD-37C8-45CE-A71F-C6E73B3BA713) walks the procedure.
See also the HVM `morpheus-docs` sibling MCP's
`hvm_user_manual_8_1_*` bundles.
## API surface — Plugin vs REST
Morpheus exposes two completely separate extensibility surfaces:
- **REST API** at `https://<manager>/api/` — external automation
and integration. Bearer-token authentication; tokens issued from
the user profile → API tokens UI. Full Enterprise API surface
available (vs HVM-only managers which 404 on Enterprise-only
endpoints).
- **Plugin API** — server-side extensions that load INTO the
manager process. Versioned independently of the platform
(Plugin API version listed in the Release Notes for each
Morpheus version). A plugin built for Plugin API 1.3.x may not
load on 1.4.x without changes.
**TODO — fill in real operational lessons as we accumulate them.**
## Multi-cloud onboarding
**TODO.** Each cloud (AWS, Azure, GCP, OCI, VMware vSphere, KVM/HVM,
OpenStack, Nutanix, etc.) has its own onboarding ritual: credentials,
networking, IAM roles, regions, storage providers, image catalogs.
Search the User Manual: `search_docs(query="Add AWS cloud
integration")`, `search_docs(query="Azure subscription cost")`, etc.
## Tenancy, RBAC, and groups
**TODO.** Morpheus Enterprise tenancy is one of the more complex areas
— tenants, roles, groups, account groups, persona-based access.
Lessons specific to "what surprised me" go here.
## Backups
**TODO.** Morpheus Enterprise inherits the backup framework HVM
introduced (Storage Buckets, Execution Schedules, Backup Jobs)
and adds: cloud-native backup integrations (AWS Backup, Azure
Backup), per-instance backup policies via the policy engine,
ServiceNow-driven backup orchestration. Document the gotchas you
hit.
## Common operational gotchas
**TODO.** This is where the "experienced operator hallway
conversation" notes go. Examples to seed (delete or replace as you
learn):
- **Service plan vs Instance type** — same concept, different
contexts. A service plan is the sizing template ("small / medium
/ large with these CPU/RAM"); an instance type is what you
provision FROM the plan. Operators conflate them.
- **Cloud integration credentials are tenant-scoped, not
global.** Adding a credential at the master tenant doesn't
cascade — sub-tenants need their own (or the policy engine
granting access).
- **Policy engine vs Logic library** — both live under
Library/Automation, both can gate provisioning. Policies are
preventive (block bad config), logic is generative (run scripts
on lifecycle events). Pick the right tool.
## Adding to this doc
Two ways:
1. Manually edit `docs_mcp/api_lessons.md` in this repo and commit.
The next image build picks it up.
2. Use `submit_doc_bug` for upstream issues, and append the
takeaway here once the docs team responds.
The point of this doc is to surface the kind of context an
experienced operator would mention in a hallway conversation but
that doesn't quite fit anywhere in the formal product docs. Keep
sections tight — one H2 = one topic the LLM can return on demand.
+1077 -40
View File
File diff suppressed because it is too large Load Diff
+4
View File
@@ -0,0 +1,4 @@
{"query": "what's the per-socket licensing model for Morpheus Enterprise", "expected": [{"bundle_id": "morpheus_quickspecs", "page_id": "a50009231enw"}], "tags": ["licensing", "skus"]}
{"query": "add an AWS cloud integration", "expected": [], "tags": ["cloud", "TODO-populate-after-first-scrape"]}
{"query": "Plugin API version compatibility", "expected": [], "tags": ["api", "TODO"]}
{"query": "Morpheus Enterprise 8.1.2 what's new", "expected": [{"bundle_id": "morpheus_release_notes_8_1_2", "page_id": "sd00007733en_us"}], "tags": ["release-notes"]}
+118 -28
View File
@@ -10,7 +10,7 @@ to one entry; the highest-ranked chunk's position wins).
"""
from __future__ import annotations
from typing import Protocol, Iterable
from typing import Iterable, Protocol
class Retriever(Protocol):
@@ -21,12 +21,17 @@ class Retriever(Protocol):
...
def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> list[tuple[str, str]]:
"""Take a stream of (bundle_id, page_id, chunk_ordinal) and return
the first k unique pages in their first-seen order."""
def _split_chunk_id(chunk_id: str) -> tuple[str, str, int]:
"""`bundle::page::ordinal` -> (bundle, page, int(ordinal))."""
bid, pid, ordinal = chunk_id.split("::")
return bid, pid, int(ordinal)
def _collapse_to_pages(chunk_ids: Iterable[str], k: int) -> list[tuple[str, str]]:
seen: set[tuple[str, str]] = set()
out: list[tuple[str, str]] = []
for bid, pid, _ord in chunk_ids:
for cid in chunk_ids:
bid, pid, _ord = _split_chunk_id(cid)
key = (bid, pid)
if key in seen:
continue
@@ -37,26 +42,111 @@ def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> lis
return out
# TODO Phase 2/3 — implement these once Chroma + the bm25 module are
# in place. Each one is small (15-30 LOC). The eval harness imports
# from this module by class name.
#
# class DenseRetriever:
# name = "dense"
# def __init__(self, collection): self.col = collection
# def retrieve(self, query, k=10): ...
#
# class RerankedRetriever:
# name = "dense+rerank"
# def __init__(self, collection, rerank_url, pool=200): ...
# def retrieve(self, query, k=10): ...
#
# class BM25Retriever:
# name = "bm25"
# def __init__(self, bm25_index): ...
# def retrieve(self, query, k=10): ...
#
# class HybridRetriever:
# name = "bm25+dense+rrf"
# def __init__(self, dense, bm25, k_rrf=60): ...
# def retrieve(self, query, k=10): ...
class DenseRetriever:
"""Chroma cosine search via the live embedding function."""
name = "dense"
def __init__(self, collection, pool: int = 50):
self.col = collection
self.pool = pool
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
res = self.col.query(query_texts=[query], n_results=self.pool)
ids = (res.get("ids") or [[]])[0]
return _collapse_to_pages(ids, k)
class BM25Retriever:
"""SQLite FTS5 lexical search."""
name = "bm25"
def __init__(self, bm25_index, pool: int = 200):
self.bm = bm25_index
self.pool = pool
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
hits = self.bm.query(query, n=self.pool)
return _collapse_to_pages((cid for cid, _score in hits), k)
class HybridRetriever:
"""Reciprocal Rank Fusion of dense + BM25 rankings."""
name = "hybrid_rrf"
def __init__(self, dense: DenseRetriever, bm25: BM25Retriever, k_rrf: int = 60, pool: int = 100):
self.dense = dense
self.bm25 = bm25
self.k_rrf = k_rrf
self.pool = pool
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
dense_pages = self.dense.retrieve(query, k=self.pool)
bm25_pages = self.bm25.retrieve(query, k=self.pool)
scores: dict[tuple[str, str], float] = {}
for rank, page in enumerate(dense_pages, start=1):
scores[page] = scores.get(page, 0.0) + 1.0 / (self.k_rrf + rank)
for rank, page in enumerate(bm25_pages, start=1):
scores[page] = scores.get(page, 0.0) + 1.0 / (self.k_rrf + rank)
ranked = sorted(scores.items(), key=lambda kv: -kv[1])
return [page for page, _s in ranked[:k]]
def _rerank_pool(rerank_url: str, query: str, ids_and_texts: list[tuple[str, str]],
timeout: float = 30.0) -> list[str] | None:
"""POST to /v1/rerank, return ids in reranked order. None on failure."""
if not ids_and_texts:
return []
import httpx
try:
with httpx.Client(timeout=timeout) as c:
r = c.post(f"{rerank_url}/v1/rerank", json={
"query": query,
"documents": [(t or "")[:2000] for _i, t in ids_and_texts],
"top_n": len(ids_and_texts),
})
r.raise_for_status()
results = r.json().get("results") or []
return [ids_and_texts[item["index"]][0] for item in results
if isinstance(item.get("index"), int)
and 0 <= item["index"] < len(ids_and_texts)]
except Exception:
return None
class RerankedRetriever:
"""Pull a candidate pool via a base retriever, then cross-encoder re-rank."""
def __init__(self, base: Retriever, collection, rerank_url: str, name_suffix: str = "rerank",
pool: int = 50, timeout: float = 30.0):
self.base = base
self.col = collection
self.url = rerank_url
self.name = f"{base.name}+{name_suffix}"
self.pool = pool
self.timeout = timeout
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
# Base returns deduplicated page-level tuples; rerank needs CHUNK-level
# texts to be informative. Pull each page's chunk 0 text from Chroma.
pages = self.base.retrieve(query, k=self.pool)
if not pages:
return []
chunk_ids = [f"{bid}::{pid}::0" for bid, pid in pages]
g = self.col.get(ids=chunk_ids, include=["documents"])
by_id = dict(zip(g["ids"], g["documents"]))
ids_and_texts = [(cid, by_id.get(cid, "")) for cid in chunk_ids]
order = _rerank_pool(self.url, query, ids_and_texts, timeout=self.timeout)
if order is None:
return pages[:k]
out: list[tuple[str, str]] = []
seen: set[tuple[str, str]] = set()
for cid in order:
bid, pid, _ = cid.split("::")
key = (bid, pid)
if key in seen:
continue
seen.add(key)
out.append(key)
if len(out) >= k:
break
return out
+81 -9
View File
@@ -76,15 +76,87 @@ def main() -> int:
queries = load_queries(args.queries)
print(f"loaded {len(queries)} queries")
# TODO Phase 7: instantiate the retrievers you implemented in
# eval/retrievers.py and run each one against each query.
# Aggregate MRR / Recall@K / nDCG@K per retriever. Emit a
# markdown table to args.output. Commit the file alongside the
# PR that changes retrieval.
raise NotImplementedError(
"Wire up the retrievers in eval/retrievers.py first, then "
"fill in this evaluation loop. See PLAN.md Phase 7."
)
import os
import chromadb
from chromadb.config import Settings
from rag.embeddings import embedding_function
from rag.bm25 import BM25Index
from eval.retrievers import DenseRetriever, BM25Retriever, HybridRetriever
product = os.environ.get("PRODUCT_NAME", "hvm")
repo_root = Path(__file__).resolve().parent.parent
client = chromadb.PersistentClient(path=str(repo_root / "chroma"),
settings=Settings(anonymized_telemetry=False))
col = client.get_collection(f"{product}_docs", embedding_function=embedding_function())
bm = BM25Index(str(repo_root / "bm25" / f"{product}_docs.db"))
from eval.retrievers import RerankedRetriever
dense = DenseRetriever(col)
bm25 = BM25Retriever(bm)
hybrid = HybridRetriever(DenseRetriever(col, pool=100), BM25Retriever(bm, pool=100))
retrievers = [dense, bm25, hybrid]
rerank_url = os.environ.get("RERANK_URL", "").rstrip("/")
if rerank_url:
retrievers += [
RerankedRetriever(bm25, col, rerank_url, name_suffix="rerank", pool=50),
RerankedRetriever(hybrid, col, rerank_url, name_suffix="rerank", pool=50),
]
print(f"reranker enabled: {rerank_url}")
rows: dict[str, dict[str, float]] = {}
per_query: list[dict] = []
for r in retrievers:
mrr_sum = recall_sum = ndcg_sum = 0.0
elapsed_sum = 0.0
for q in queries:
expected = [(e["bundle_id"], e["page_id"]) for e in q["expected"]]
t0 = time.time()
retrieved = r.retrieve(q["query"], k=max(args.k, 10))
elapsed = time.time() - t0
mrr = reciprocal_rank(retrieved, expected)
recall = recall_at_k(retrieved, expected, args.k)
ndcg = ndcg_at_k(retrieved, expected, args.k)
mrr_sum += mrr
recall_sum += recall
ndcg_sum += ndcg
elapsed_sum += elapsed
per_query.append({
"retriever": r.name, "query": q["query"],
"mrr": mrr, "recall@k": recall, "ndcg@k": ndcg,
"top1": list(retrieved[0]) if retrieved else None,
"elapsed_s": round(elapsed, 3),
})
n = len(queries)
rows[r.name] = {
"MRR": mrr_sum / n,
f"Recall@{args.k}": recall_sum / n,
f"nDCG@{args.k}": ndcg_sum / n,
"avg_latency_s": elapsed_sum / n,
}
print(f" {r.name}: MRR={rows[r.name]['MRR']:.3f} "
f"Recall@{args.k}={rows[r.name][f'Recall@{args.k}']:.3f} "
f"nDCG@{args.k}={rows[r.name][f'nDCG@{args.k}']:.3f} "
f"avg={rows[r.name]['avg_latency_s']*1000:.0f}ms")
args.output.parent.mkdir(parents=True, exist_ok=True)
md = [f"# Retrieval eval — k={args.k}", "",
f"_{len(queries)} hand-curated queries, generated {time.strftime('%Y-%m-%d %H:%M:%S')}_", "",
"| Retriever | MRR | Recall@{k} | nDCG@{k} | avg latency |".replace("{k}", str(args.k)),
"| --- | ---: | ---: | ---: | ---: |"]
for name, m in rows.items():
md.append(f"| `{name}` | {m['MRR']:.3f} | {m[f'Recall@{args.k}']:.3f} "
f"| {m[f'nDCG@{args.k}']:.3f} | {m['avg_latency_s']*1000:.0f}ms |")
md += ["", "## Per-query results", "",
"| Retriever | Query | MRR | top-1 |", "| --- | --- | ---: | --- |"]
for r in per_query:
top1 = f"`{r['top1'][0]}/{r['top1'][1][:24]}...`" if r["top1"] else ""
md.append(f"| `{r['retriever']}` | {r['query'][:60]} | {r['mrr']:.3f} | {top1} |")
args.output.write_text("\n".join(md) + "\n")
print(f"wrote {args.output}")
return 0
if __name__ == "__main__":
+39 -11
View File
@@ -31,6 +31,31 @@ from typing import Iterator
CHARS_PER_TOKEN = 4
TARGET_TOKENS = 500
TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
# Hard cap: nomic-embed-text's context is 2048 tokens. Anything larger
# 400s the entire embed batch. 6000 chars works for prose but markdown
# tables with lots of `|` separators tokenize ~1.4× denser; a 5839-char
# table chunk from the HVM qualification matrix tokenized past 2048 and
# crashed the rebuild. 4000 chars stays under 2048 tokens even for
# dense table content while leaving headroom for the query side.
MAX_CHARS = 4000
def _hard_split(text: str) -> list[str]:
"""Split an oversized block on line boundaries into MAX_CHARS pieces."""
if len(text) <= MAX_CHARS:
return [text]
out: list[str] = []
buf: list[str] = []
buf_chars = 0
for line in text.splitlines(keepends=True):
if buf_chars + len(line) > MAX_CHARS and buf:
out.append("".join(buf).rstrip())
buf, buf_chars = [], 0
buf.append(line)
buf_chars += len(line)
if buf:
out.append("".join(buf).rstrip())
return out
def estimate_tokens(text: str) -> int:
@@ -104,23 +129,26 @@ def chunks_from_page(
# ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
ordinal = 1
def emit(buf: list[str]) -> Iterator[dict]:
nonlocal ordinal
merged = "\n\n".join(buf)
for piece in _hard_split(merged):
yield {
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
"text": piece,
"metadata": {**metadata, "ordinal": ordinal},
}
ordinal += 1
buf: list[str] = []
buf_chars = 0
for p in paragraphs:
if buf_chars + len(p) > TARGET_CHARS and buf:
yield {
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
"text": "\n\n".join(buf),
"metadata": {**metadata, "ordinal": ordinal},
}
ordinal += 1
yield from emit(buf)
buf = []
buf_chars = 0
buf.append(p)
buf_chars += len(p)
if buf:
yield {
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
"text": "\n\n".join(buf),
"metadata": {**metadata, "ordinal": ordinal},
}
yield from emit(buf)
+21 -4
View File
@@ -3,8 +3,15 @@
Swappable: implement the same `embedding_function()` interface returning
a Chroma `EmbeddingFunction` and the rest of the pipeline doesn't care.
Defaults (override via env):
OLLAMA_URL one or more comma-separated URLs (load-balanced)
Env-configurable (matches the zerto-docs-rag pattern so the same Gitea
runner + GPU-pinned Ollama containers can serve every docs MCP build):
OLLAMA_URLS comma-separated list, load-balanced round-robin per batch.
Preferred — set in the CI workflow to fan out across two
GPU-pinned Ollama containers on the Gitea host.
OLLAMA_URL single endpoint, fallback when OLLAMA_URLS is unset.
Default http://192.168.0.2:11434 (the host where the GPUs
live in Justin's lab).
EMBED_MODEL model name; default 'nomic-embed-text'
EMBED_DIM expected embedding dim; default 768 (nomic-embed-text)
"""
@@ -19,8 +26,18 @@ from chromadb import EmbeddingFunction, Documents, Embeddings
log = logging.getLogger(__name__)
OLLAMA_URLS = [u.strip() for u in os.environ.get("OLLAMA_URL",
"http://localhost:11434").split(",") if u.strip()]
DEFAULT_OLLAMA_URL = "http://192.168.0.2:11434"
def _resolve_urls() -> list[str]:
raw = os.environ.get("OLLAMA_URLS", "").strip()
if raw:
return [u.strip().rstrip("/") for u in raw.split(",") if u.strip()]
single = os.environ.get("OLLAMA_URL", DEFAULT_OLLAMA_URL).strip().rstrip("/")
return [single]
OLLAMA_URLS = _resolve_urls()
EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
EMBED_DIM = int(os.environ.get("EMBED_DIM", "768"))
+1 -1
View File
@@ -29,7 +29,7 @@ CHROMA_DIR = ROOT / "chroma"
# Collection name — convention: <product>_docs. Override via env if needed.
import os
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "myproduct")
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "morpheus")
COLLECTION = f"{PRODUCT_NAME}_docs"
+10
View File
@@ -0,0 +1,10 @@
# Dev/CPU reranker — only for running scripts/rerank_server.py locally.
# Production uses the llama.cpp + jina-reranker GGUF sidecar (see
# deploy/docker-compose.yml). Install with:
#
# pip install -r requirements-rerank.txt
#
# This adds PyTorch (~2 GB) and the sentence-transformers cross-encoder
# (cross-encoder/ms-marco-MiniLM-L-6-v2, ~22 MB). Keep out of the main
# requirements.txt so the production image stays slim.
sentence-transformers>=3.0
+8
View File
@@ -10,10 +10,18 @@ ollama>=0.4.0 # if using Ollama-hosted embedder; swap if not
# Scraping (Phase 1; adjust per product)
beautifulsoup4>=4.12
requests>=2.31
curl_cffi>=0.7 # for HPE QuickSpecs scrape (Chrome TLS impersonation)
markdownify>=0.11
# playwright>=1.40 # uncomment if you need headless browser fallback
# Evaluation
numpy>=1.26
# Reranker is a sidecar (see deploy/docker-compose.yml). The MCP server
# only needs httpx (declared above) to call it. For the dev / CPU
# fallback reranker (scripts/rerank_server.py), install
# requirements-rerank.txt separately — it pulls in PyTorch which would
# triple the production image size.
# Dev / utility
python-dateutil>=2.8
+66
View File
@@ -7,6 +7,72 @@ the upstream doc portal.
See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline
expects.
---
## Product context — HPE Morpheus Enterprise Software
**This repo is for HPE Morpheus Enterprise**, the full cloud-management
platform. It is a **different SKU** from HPE Morpheus VM Essentials
(HVM), which has its own MCP at `../hvm-docs/`. Don't ingest HVM
docs here; they're a separate, smaller product (the "VM-only" subset
of Morpheus). The Morpheus VM Essentials Deployment Guide refers to
Morpheus Enterprise as the "elevate to" target — that's the
relationship.
`PRODUCT_NAME=morpheus`. Tool will be named `morpheus_api_lessons`,
collection `morpheus_docs`, etc.
### Upstream portal
HPE Support DocPortal (Tridion/SDL-derived, same surface as HVM and
the Zerto docs). Anonymous JSON API, no auth required.
| Endpoint | Returns |
|---|---|
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}` | DITA-source HTML — title page / abstract OR (for short docs like Release Notes) the entire body |
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/toc` | Nested JSON tree of `{topicName, topicLink, description, children}`. Empty/404 for single-doc Release Notes. |
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/render?page=GUID-XXXX.html` | `{docId, page_html, doc_meta, page_meta}` — single page body |
User-facing URL format:
`https://support.hpe.com/hpesc/public/docDisplay?docId={docId}&page=GUID-XXXX.html`
### Bundle IDs (confirmed 2026-05-22)
**Morpheus Enterprise User Manual** — ~569 pages each, full nested TOC:
| Version | docId |
|---|---|
| 8.1.0 | `sd00007510en_us` |
| 8.1.1 | `sd00007621en_us` |
| 8.1.2 | `sd00007732en_us` |
**Morpheus Enterprise Release Notes** — short, single-doc-blob shape
(no TOC; full body returned by the `/document/{docId}` endpoint
itself; scraper needs a `--single-doc` mode for these):
| Version | docId |
|---|---|
| 8.1.0 | `sd00007496en_us` |
| 8.1.1 | `sd00007610en_us` |
| 8.1.2 | `sd00007733en_us` |
### Cross-version peers are free
GUIDs are stable across versions (confirmed on HVM where 374/376/376
pages had 100% GUID overlap between adjacent versions). Same-GUID =
same-topic. Synthesize `topic_cluster.clustered_topics` by looking
up the same GUID in the other bundle slugs — no fuzzy matching
needed.
### Reusable from hvm-docs
`../hvm-docs/scrape/bundles.py` and `../hvm-docs/scrape/runner.py`
solve the identical portal shape. Copy and adapt the BUNDLES list +
PRODUCT_NAME; the fetch logic should drop in unchanged. Both the
TOC-paginated path and the single-doc path are needed (the HVM
build covers both because HVM Release Notes follow the same shape).
## What you write
At minimum, two scripts:
+200
View File
@@ -0,0 +1,200 @@
"""Discover Morpheus Enterprise doc bundles on HPE Support DocPortal and write bundles.json.
Mirrors hvm-docs/scrape/bundles.py — same portal, same API shape, same single-doc-blob
treatment for Release Notes, but pointing at the Morpheus Enterprise docId range.
For each bundle this script:
1. GETs /hpesc/public/api/document/{docId} → abstract HTML
2. GETs /hpesc/public/api/document/{docId}/toc → page tree (or 404 for single-doc)
3. Writes bundles.json at repo root with the schema PLAN.md Phase 1 documents.
QuickSpecs is a special case: lives at www.hpe.com (not support.hpe.com), gets the
html-file mode and is scraped via curl_cffi (see scrape/quickspecs.py).
"""
from __future__ import annotations
import argparse
import json
import re
import sys
import time
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
import requests
from bs4 import BeautifulSoup
API = "https://support.hpe.com/hpesc/public/api/document"
DOC_URL = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}"
UA = "morpheus-docs-mcp/0.1 (+https://git.jpaul.io/justin/morpheus-docs; admin@jpaul.io)"
ROOT = Path(__file__).resolve().parent.parent
BUNDLES_JSON = ROOT / "bundles.json"
@dataclass
class BundleSpec:
slug: str
doc_id: str
title: str
version: str | None
product: str # e.g. "User Manual", "Release Notes", "QuickSpecs"
mode: str # "toc", "single", or "html-file"
platform: str | None = None
language: str = "en-US"
source_url: str | None = None # overrides the default support.hpe.com URL
# Declared bundles. Versions confirmed 2026-05-22 by probing the docId
# range sd00006500..7740 for `Morpheus Enterprise` matches in the abstract.
#
# Notes:
# - Morpheus Enterprise has User Manuals dating back to 8.0.10
# (sd00006774en_us, Sep 2025) but we only ship the 8.1.x line for
# now. Add the 8.0.x bundles here if you need older versions in the
# corpus.
# - No dedicated Deployment Guide or Qualification Matrix for Morpheus
# Enterprise on HPE Support — the only QM (sd00006551en_us) covers
# HVM clusters managed by Morpheus, which lives in hvm-docs.
# - QuickSpecs lives on www.hpe.com (not support.hpe.com), uses the
# html-file scrape mode with curl_cffi Chrome impersonation.
BUNDLES: list[BundleSpec] = [
BundleSpec("morpheus_user_manual_8_1_0", "sd00007510en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.0", "User Manual", "toc"),
BundleSpec("morpheus_user_manual_8_1_1", "sd00007621en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.1", "User Manual", "toc"),
BundleSpec("morpheus_user_manual_8_1_2", "sd00007732en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.2", "User Manual", "toc"),
BundleSpec("morpheus_release_notes_8_1_0", "sd00007496en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.0", "Release Notes", "single"),
BundleSpec("morpheus_release_notes_8_1_1", "sd00007610en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.1", "Release Notes", "single"),
BundleSpec("morpheus_release_notes_8_1_2", "sd00007733en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.2", "Release Notes", "single"),
BundleSpec("morpheus_quickspecs", "a50009231enw", "HPE Morpheus Enterprise Software QuickSpecs",
"v1", "QuickSpecs", "html-file",
source_url="https://www.hpe.com/psnow/doc/a50009231enw"),
]
def _session() -> requests.Session:
s = requests.Session()
s.headers.update({"User-Agent": UA, "Accept": "application/json, text/html"})
return s
def _get(s: requests.Session, url: str, expect_json: bool = False, retries: int = 4) -> Any:
delay = 1.0
for attempt in range(retries):
r = s.get(url, timeout=30)
if r.status_code == 200:
return r.json() if expect_json else r.text
if r.status_code == 404:
return None
if r.status_code in (429, 500, 502, 503, 504):
time.sleep(delay)
delay *= 2
continue
r.raise_for_status()
raise RuntimeError(f"GET failed after {retries} retries: {url}")
def _count_toc(toc: list[dict] | None) -> tuple[int, str | None]:
if not toc:
return 0, None
landing = None
n = 0
def walk(nodes: list[dict] | None, depth: int) -> None:
nonlocal n, landing
for node in nodes or []:
link = node.get("topicLink")
if link:
n += 1
m = re.search(r"page=(GUID-[A-F0-9-]+)\.html", link)
if m and landing is None:
landing = m.group(1)
walk(node.get("children"), depth + 1)
walk(toc, 0)
return n, landing
def _parse_abstract(html: str) -> dict[str, str]:
soup = BeautifulSoup(html, "html.parser")
out: dict[str, str] = {}
h1 = soup.select_one("h1.title.topictitle1")
if h1:
out["title"] = h1.get_text(" ", strip=True)
desc = soup.select_one("div.desc")
if desc:
out["abstract"] = desc.get_text(" ", strip=True)
pub = soup.select_one("div.publishedDate")
if pub:
out["published"] = pub.get_text(" ", strip=True).replace("Published:", "").strip()
return out
def discover_bundle(s: requests.Session, spec: BundleSpec) -> dict[str, Any]:
# html-file bundles are static fixtures or live-fetched outside support.hpe.com.
if spec.mode == "html-file":
return {
"slug": spec.slug,
"doc_id": spec.doc_id,
"title": spec.title,
"version": spec.version,
"platform": spec.platform,
"product": spec.product,
"language": spec.language,
"page_count": 1,
"mode": "html-file",
"abstract": "",
"dates": {},
"landing_page": spec.doc_id,
"source_url": spec.source_url or f"https://www.hpe.com/psnow/doc/{spec.doc_id}",
}
abstract_html = _get(s, f"{API}/{spec.doc_id}", expect_json=False)
meta = _parse_abstract(abstract_html or "")
page_count: int
landing: str | None
if spec.mode == "toc":
toc = _get(s, f"{API}/{spec.doc_id}/toc", expect_json=True)
page_count, landing = _count_toc(toc)
if page_count == 0:
print(f" ! {spec.slug}: TOC empty — falling back to single-doc mode", file=sys.stderr)
spec.mode = "single"
page_count, landing = 1, spec.doc_id
else:
page_count, landing = 1, spec.doc_id
return {
"slug": spec.slug,
"doc_id": spec.doc_id,
"title": meta.get("title") or spec.title,
"version": spec.version,
"platform": spec.platform,
"product": spec.product,
"language": spec.language,
"page_count": page_count,
"mode": spec.mode,
"abstract": meta.get("abstract", ""),
"dates": {"Published": meta.get("published", "")},
"landing_page": landing,
"source_url": spec.source_url or DOC_URL.format(doc_id=spec.doc_id),
}
def main() -> int:
p = argparse.ArgumentParser(description="Build bundles.json from BUNDLES list.")
p.add_argument("--out", default=str(BUNDLES_JSON))
args = p.parse_args()
s = _session()
out: list[dict[str, Any]] = []
for spec in BUNDLES:
print(f"{spec.slug} ({spec.doc_id}) ...", file=sys.stderr)
out.append(discover_bundle(s, spec))
Path(args.out).write_text(json.dumps(out, indent=2) + "\n")
print(f"wrote {args.out}: {len(out)} bundles, {sum(b['page_count'] for b in out)} pages total", file=sys.stderr)
return 0
if __name__ == "__main__":
sys.exit(main())
+194
View File
@@ -0,0 +1,194 @@
"""Scrape HPE QuickSpecs collateral pages into corpus markdown.
HPE QuickSpecs live at `https://www.hpe.com/us/en/collaterals/collateral.<doc_id>.html`
with a server-rendered HTML body (confirmed 2026-05-22 by inspecting the
captured DOM). The blocker for automated scraping is `www.hpe.com`'s
edge bot defense, which drops connections from non-browser TLS
fingerprints (curl, wget, Python-urllib, even WebFetch). Bypassed here
by `curl_cffi` impersonating Chrome 120's JA3/JA4 fingerprint.
Content extraction uses these stable CSS selectors found in the page:
.lr-right-rail hpe-highlights-container .collateral-content
— one per section ("Overview", "Standard Features", etc.)
h3.txto-title — section title
div.txto-description — section body
uc-table.uc-table-polaris — SKU / version-history tables
A committed HTML fixture at `scrape/quickspecs/<doc_id>.html` is used
as a fallback when the live fetch fails (HPE edge churn, network
issues). Keeping a current fixture in the repo also makes diffing
QuickSpecs revisions easy.
Usage (called by scrape.runner for bundles with mode="quickspecs"):
python -m scrape.quickspecs a50004260enw
Or programmatically:
from scrape.quickspecs import scrape_quickspecs
scrape_quickspecs("a50004260enw", bundle_id="hvm_quickspecs", title="...")
"""
from __future__ import annotations
import argparse
import json
import logging
import sys
from pathlib import Path
from bs4 import BeautifulSoup, NavigableString
from markdownify import markdownify as md
log = logging.getLogger(__name__)
ROOT = Path(__file__).resolve().parent.parent
SOURCE_DIR = ROOT / "scrape" / "quickspecs"
CORPUS_DIR = ROOT / "corpus"
COLLATERAL_URL = "https://www.hpe.com/us/en/collaterals/collateral.{doc_id}.html"
def fetch_live(doc_id: str, timeout: float = 30.0) -> str | None:
"""GET the collateral page via curl_cffi (Chrome 120 TLS fingerprint).
Returns the HTML body on success, None on any failure."""
try:
from curl_cffi import requests as cc
except ImportError:
log.warning("curl_cffi not installed; can't fetch QuickSpecs live")
return None
try:
r = cc.get(COLLATERAL_URL.format(doc_id=doc_id),
impersonate="chrome120", timeout=timeout)
if r.status_code != 200 or not r.text:
log.warning("QuickSpecs %s: http=%s bytes=%d", doc_id, r.status_code, len(r.text or ""))
return None
return r.text
except Exception as e:
log.warning("QuickSpecs %s live fetch failed: %s", doc_id, e)
return None
def fetch_fixture(doc_id: str) -> str | None:
"""Read the committed HTML fixture as fallback."""
p = SOURCE_DIR / f"{doc_id}.html"
if not p.exists():
return None
return p.read_text()
def _extract_content_blocks(html: str) -> list[str]:
"""Pull each section block (.collateral-content under .lr-right-rail).
The fixture format (just .quickspecs-content wrapper) and the live
format (.lr-right-rail with nested hpe-highlights-container) are
both supported. Returns a list of section HTML strings, in document
order.
"""
soup = BeautifulSoup(html, "html.parser")
# Live format: each <hpe-highlights-container> under .lr-right-rail has
# one or more .collateral-content blocks; concat them.
rail = soup.select_one(".lr-right-rail")
if rail is not None:
blocks = rail.select(".collateral-content")
return [str(b) for b in blocks]
# Fixture format: a single wrapper holding all the H2/H3 sections.
wrapper = soup.select_one(".quickspecs-content")
if wrapper is not None:
return [str(wrapper)]
# Last-resort: whole body.
body = soup.body or soup
return [str(body)]
def parse_html(html: str) -> str:
"""Convert QuickSpecs HTML to clean markdown.
Filters out the page chrome (nav, footer, recommendations carousel,
cookie banner, analytics blobs) by extracting only the content
blocks, then runs markdownify."""
blocks = _extract_content_blocks(html)
chunks: list[str] = []
for block in blocks:
soup = BeautifulSoup(block, "html.parser")
# Drop anchor placeholders that markdownify turns into noisy links
for a in soup.select('[hpe-left-rail-anchor]'):
a.decompose()
# Drop carousel / share / recommendation widgets if any leaked in.
for sel in ("esl-share", "hpe-recommendations", "hpe-sticky-bar",
"esl-scrollbar", "esl-trigger", "video-overlay",
"generic-modal-loader", "style", "script"):
for el in soup.select(sel):
el.decompose()
chunks.append(md(str(soup), heading_style="ATX", bullets="-",
strip=["span", "div"]))
text = "\n\n".join(chunks)
# Collapse runs of blank lines markdownify likes to emit.
text = "\n".join(line.rstrip() for line in text.splitlines())
while "\n\n\n" in text:
text = text.replace("\n\n\n", "\n\n")
return text.strip() + "\n"
def scrape_quickspecs(doc_id: str, bundle_id: str, title: str,
version: str | None = None,
product: str = "QuickSpecs",
source_url: str | None = None,
force: bool = False) -> bool:
"""Live-fetch (or fall back to fixture), parse, write corpus files.
Returns True if files were written, False if skipped (already exists
and --force not set)."""
bundle_dir = CORPUS_DIR / bundle_id
md_path = bundle_dir / f"{doc_id}.md"
json_path = bundle_dir / f"{doc_id}.json"
if not force and md_path.exists() and json_path.exists():
log.info(" %s/%s: already on disk (use --force to refresh)", bundle_id, doc_id)
return False
html = fetch_live(doc_id)
fetched_from = "live"
if html is None:
html = fetch_fixture(doc_id)
fetched_from = "fixture"
if html is None:
log.error("QuickSpecs %s: no live response and no fixture at %s",
doc_id, SOURCE_DIR / f"{doc_id}.html")
return False
body_md = parse_html(html)
bundle_dir.mkdir(parents=True, exist_ok=True)
md_path.write_text(body_md)
sidecar = {
"bundle_id": bundle_id,
"page_id": doc_id,
"title": title,
"ordinal": 1,
"parent_title": None,
"doc_id": doc_id,
"version": version,
"product": product,
"source_url": source_url or f"https://www.hpe.com/psnow/doc/{doc_id}",
"fetched_from": fetched_from,
}
json_path.write_text(json.dumps(sidecar, indent=2) + "\n")
log.info(" %s/%s: %d bytes from %s", bundle_id, doc_id, len(body_md), fetched_from)
return True
def main() -> int:
logging.basicConfig(level=logging.INFO, format="%(message)s")
p = argparse.ArgumentParser()
p.add_argument("doc_id", help="QuickSpecs document id, e.g. a50004260enw")
p.add_argument("--bundle-id", default="hvm_quickspecs")
p.add_argument("--title", default="HPE Morpheus VM Essentials Software QuickSpecs")
p.add_argument("--version", default=None)
p.add_argument("--force", action="store_true")
args = p.parse_args()
ok = scrape_quickspecs(args.doc_id, args.bundle_id, args.title,
args.version, force=args.force)
return 0 if ok else 1
if __name__ == "__main__":
sys.exit(main())
+27
View File
@@ -0,0 +1,27 @@
# scrape/quickspecs/
Static HTML fixtures for HPE QuickSpecs documents that aren't reachable
from the runner (www.hpe.com edge drops connections from datacenter IPs
with non-browser User-Agents — verified 2026-05-22 with curl, wget, and
Anthropic's WebFetch).
## Workflow
1. Operator visits `https://www.hpe.com/psnow/doc/<doc_id>` in a real
browser, opens DevTools → Elements → Copy the `<body>` HTML.
2. Save it at `scrape/quickspecs/<doc_id>.html`.
3. Add a bundle entry in `scrape/bundles.py` with `mode="html-file"`.
4. `python -m scrape.runner --bundle hvm_quickspecs --force` reads the
committed HTML and writes `corpus/hvm_quickspecs/<doc_id>.{md,json}`.
5. Re-index and ship.
QuickSpecs only update every few months (HPE rebrand, new SKU added,
feature change). When a new version drops, refresh the local HTML
file and re-run the scrape.
## Current fixtures
- `a50004260enw.html` — HPE Morpheus VM Essentials Software QuickSpecs
(Version 4, 02-February-2026). SKUs: S5Q81AAE (1-yr), S5Q82AAE
(3-yr), S5Q83AAE (5-yr) — all "per Socket E-LTU" with Tech Care
Essentials included.
+339
View File
@@ -0,0 +1,339 @@
"""Scrape HVM doc bundles into corpus/<slug>/<page_id>.{md,json}.
Reads bundles.json (produced by scrape.bundles), then for each bundle:
- mode="toc": walks the TOC tree, fetches each page via the render
endpoint, converts page_html to markdown, writes
<page_id>.md + <page_id>.json sidecar.
- mode="single": fetches /document/{docId} directly, treats the whole
body as one page with page_id = doc_id.
After all bundles are on disk, runs a finalize pass that synthesizes
topic_cluster.clustered_topics for each page by looking up the same
GUID in sibling bundles (HPE GUIDs are stable across versions — see
reference_hpe_docs_portal_api.md).
Usage:
python -m scrape.runner --all
python -m scrape.runner --bundle hvm_user_manual_8_1_2
python -m scrape.runner --all --force # re-download already-on-disk pages
python -m scrape.runner --finalize-only # only redo the topic_cluster pass
"""
from __future__ import annotations
import argparse
import json
import re
import sys
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify as md
API = "https://support.hpe.com/hpesc/public/api/document"
DOC_URL = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}&page={page_id}.html"
DOC_URL_SINGLE = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}"
UA = "hvm-docs-mcp/0.1 (+https://git.jpaul.io/justin/hvm-docs; admin@jpaul.io)"
ROOT = Path(__file__).resolve().parent.parent
CORPUS = ROOT / "corpus"
BUNDLES_JSON = ROOT / "bundles.json"
GUID_RE = re.compile(r"page=(GUID-[A-F0-9-]+)\.html")
@dataclass
class TocEntry:
page_id: str
title: str
ordinal: int
parent_title: str | None
def _session() -> requests.Session:
s = requests.Session()
s.headers.update({"User-Agent": UA, "Accept": "application/json, text/html"})
return s
def _get(s: requests.Session, url: str, expect_json: bool = False, retries: int = 4) -> Any:
delay = 1.0
for attempt in range(retries):
r = s.get(url, timeout=30)
if r.status_code == 200:
return r.json() if expect_json else r.text
if r.status_code == 404:
return None
if r.status_code in (429, 500, 502, 503, 504):
time.sleep(delay)
delay *= 2
continue
r.raise_for_status()
raise RuntimeError(f"GET failed after {retries} retries: {url}")
def _flatten_toc(toc: list[dict]) -> list[TocEntry]:
out: list[TocEntry] = []
ordinal = 0
def walk(nodes: list[dict] | None, parent_title: str | None) -> None:
nonlocal ordinal
for node in nodes or []:
title = node.get("topicName") or ""
link = node.get("topicLink") or ""
m = GUID_RE.search(link)
if m:
ordinal += 1
out.append(TocEntry(page_id=m.group(1), title=title, ordinal=ordinal, parent_title=parent_title))
walk(node.get("children"), title or parent_title)
walk(toc, None)
return out
def _strip_dita_wrappers(html: str) -> str:
"""Remove the outer <main class="ditasrc">, drop the trademark Notices section,
and unwrap aria-only span markup so markdownify produces clean text.
DITA's notices boilerplate repeats across every doc; if we leave it in,
every page chunk inherits the same trademark text and pollutes retrieval."""
soup = BeautifulSoup(html, "html.parser")
# Drop the Notices/Acknowledgments/Abstract boilerplate by section heading.
# Every doc on the portal carries the same legal Notices and trademark
# Acknowledgments; if we leave them in, every chunk inherits the same
# text and pollutes retrieval. Abstract is one-line marketing.
boilerplate = {"Notices", "Acknowledgments", "Abstract"}
# Wrapped form: <article>/<section>/<div> whose first heading child is boilerplate.
for sec in soup.select("article, section, div"):
h = sec.find(["h1", "h2"], recursive=False)
if h and h.get_text(strip=True) in boilerplate:
sec.decompose()
# Unwrapped form: bare <h1>/<h2>Boilerplate</h2> followed by its .desc/.body sibling.
for h in soup.find_all(["h1", "h2"]):
if h.get_text(strip=True) in boilerplate:
sib = h.find_next_sibling()
if sib and (sib.name in {"div", "section"}):
cls = " ".join(sib.get("class", []) or [])
if "desc" in cls or "body" in cls or "notices" in cls:
sib.decompose()
h.decompose()
main = soup.find("main")
return str(main) if main else str(soup)
def html_to_md(page_html: str) -> str:
cleaned = _strip_dita_wrappers(page_html)
text = md(cleaned, heading_style="ATX", bullets="-")
# collapse runs of blank lines
text = re.sub(r"\n{3,}", "\n\n", text).strip()
return text + "\n"
def fetch_toc_page(s: requests.Session, doc_id: str, page_id: str) -> str:
payload = _get(s, f"{API}/{doc_id}/render?page={page_id}.html", expect_json=True)
if not payload:
return ""
return payload.get("page_html") or ""
def fetch_single_doc(s: requests.Session, doc_id: str) -> tuple[str, str]:
"""Returns (page_html, title) for a single-doc-shape bundle."""
html = _get(s, f"{API}/{doc_id}")
if not html:
return "", ""
soup = BeautifulSoup(html, "html.parser")
h1 = soup.select_one("h1.title.topictitle1")
title = h1.get_text(" ", strip=True) if h1 else doc_id
return html, title
def write_page(bundle_dir: Path, page_id: str, body_md: str, sidecar: dict[str, Any], force: bool) -> bool:
bundle_dir.mkdir(parents=True, exist_ok=True)
md_path = bundle_dir / f"{page_id}.md"
json_path = bundle_dir / f"{page_id}.json"
if not force and md_path.exists() and json_path.exists():
return False
md_path.write_text(body_md)
json_path.write_text(json.dumps(sidecar, indent=2) + "\n")
return True
def scrape_toc_bundle(s: requests.Session, bundle: dict, force: bool, concurrency: int) -> int:
doc_id = bundle["doc_id"]
slug = bundle["slug"]
bundle_dir = CORPUS / slug
toc = _get(s, f"{API}/{doc_id}/toc", expect_json=True) or []
entries = _flatten_toc(toc)
print(f" {slug}: {len(entries)} pages", file=sys.stderr)
written = 0
def do_one(entry: TocEntry) -> bool:
page_html = fetch_toc_page(s, doc_id, entry.page_id)
if not page_html:
return False
body_md = html_to_md(page_html)
sidecar = {
"bundle_id": slug,
"page_id": entry.page_id,
"title": entry.title,
"ordinal": entry.ordinal,
"parent_title": entry.parent_title,
"doc_id": doc_id,
"version": bundle.get("version"),
"product": bundle.get("product"),
"source_url": DOC_URL.format(doc_id=doc_id, page_id=entry.page_id),
# topic_cluster filled in by finalize()
}
return write_page(bundle_dir, entry.page_id, body_md, sidecar, force)
with ThreadPoolExecutor(max_workers=concurrency) as pool:
for fut in as_completed(pool.submit(do_one, e) for e in entries):
if fut.result():
written += 1
return written
def scrape_single_bundle(s: requests.Session, bundle: dict, force: bool) -> int:
doc_id = bundle["doc_id"]
slug = bundle["slug"]
bundle_dir = CORPUS / slug
html, title = fetch_single_doc(s, doc_id)
if not html:
print(f" ! {slug}: empty body", file=sys.stderr)
return 0
body_md = html_to_md(html)
sidecar = {
"bundle_id": slug,
"page_id": doc_id,
"title": title or bundle["title"],
"ordinal": 1,
"parent_title": None,
"doc_id": doc_id,
"version": bundle.get("version"),
"product": bundle.get("product"),
"source_url": DOC_URL_SINGLE.format(doc_id=doc_id),
}
print(f" {slug}: 1 page (single-doc)", file=sys.stderr)
return 1 if write_page(bundle_dir, doc_id, body_md, sidecar, force) else 0
def finalize_clusters(bundles: list[dict]) -> int:
"""Cross-link sibling pages with the same GUID across version bundles.
For TOC bundles, page_id == GUID; same GUID across two bundles = same
underlying topic. For single-doc bundles (page_id == doc_id), peer them
by matching product+version-sibling on the `product` field."""
# GUID → list[(slug, sidecar_path, sidecar_dict)]
guid_to_pages: dict[str, list[tuple[str, Path, dict]]] = {}
# product → list[(slug, sidecar_path, sidecar_dict)] for single-doc peering
product_to_pages: dict[str, list[tuple[str, Path, dict]]] = {}
for b in bundles:
slug = b["slug"]
bundle_dir = CORPUS / slug
if not bundle_dir.exists():
continue
for jp in bundle_dir.glob("*.json"):
data = json.loads(jp.read_text())
pid = data["page_id"]
if pid.startswith("GUID-"):
guid_to_pages.setdefault(pid, []).append((slug, jp, data))
else:
product_to_pages.setdefault(b["product"], []).append((slug, jp, data))
updated = 0
# TOC pages — cluster by GUID
for guid, peers in guid_to_pages.items():
if len(peers) < 2:
continue
for slug, jp, data in peers:
others = [
{"bundle_id": s2, "page_id": guid, "clustering_title": d2.get("title", "")}
for s2, _, d2 in peers if s2 != slug
]
data["topic_cluster"] = {"clustering_title": data.get("title", ""), "clustered_topics": others}
jp.write_text(json.dumps(data, indent=2) + "\n")
updated += 1
# Single-doc pages — cluster by product (e.g. Release Notes 8.1.0/.1/.2)
for product, peers in product_to_pages.items():
if len(peers) < 2:
continue
for slug, jp, data in peers:
others = [
{"bundle_id": s2, "page_id": d2["page_id"], "clustering_title": d2.get("title", "")}
for s2, _, d2 in peers if s2 != slug
]
data["topic_cluster"] = {"clustering_title": data.get("title", ""), "clustered_topics": others}
jp.write_text(json.dumps(data, indent=2) + "\n")
updated += 1
return updated
def main() -> int:
p = argparse.ArgumentParser(description="Scrape HVM bundles into corpus/.")
p.add_argument("--all", action="store_true", help="scrape every bundle in bundles.json")
p.add_argument("--bundle", action="append", help="scrape one bundle by slug (repeatable)")
p.add_argument("--force", action="store_true", help="re-fetch pages already on disk")
p.add_argument("--concurrency", type=int, default=6)
p.add_argument("--finalize-only", action="store_true", help="only rebuild topic_cluster sidecar fields")
args = p.parse_args()
if not BUNDLES_JSON.exists():
print(f"bundles.json missing — run `python -m scrape.bundles` first", file=sys.stderr)
return 2
bundles = json.loads(BUNDLES_JSON.read_text())
if args.finalize_only:
n = finalize_clusters(bundles)
print(f"finalize: updated topic_cluster on {n} sidecars", file=sys.stderr)
return 0
if args.bundle:
bundles = [b for b in bundles if b["slug"] in args.bundle]
if not bundles:
print(f"no bundles matched: {args.bundle}", file=sys.stderr)
return 2
elif not args.all:
print("specify --all or --bundle <slug>", file=sys.stderr)
return 2
s = _session()
total = 0
for b in bundles:
mode = b.get("mode")
if mode == "single":
total += scrape_single_bundle(s, b, args.force)
elif mode == "html-file":
# Live-scrape HPE collateral (QuickSpecs) via curl_cffi; falls back
# to scrape/quickspecs/<doc_id>.html fixture if the edge blocks us.
from scrape.quickspecs import scrape_quickspecs
ok = scrape_quickspecs(
doc_id=b["doc_id"], bundle_id=b["slug"],
title=b.get("title", b["doc_id"]),
version=b.get("version"),
product=b.get("product", "QuickSpecs"),
source_url=b.get("source_url"),
force=args.force,
)
total += 1 if ok else 0
else:
total += scrape_toc_bundle(s, b, args.force, args.concurrency)
print(f"scraped {total} new/updated pages", file=sys.stderr)
# Always finalize after a scrape so sidecars are consistent.
all_bundles = json.loads(BUNDLES_JSON.read_text())
n = finalize_clusters(all_bundles)
print(f"finalize: updated topic_cluster on {n} sidecars", file=sys.stderr)
return 0
if __name__ == "__main__":
sys.exit(main())
View File
+70 -41
View File
@@ -1,42 +1,58 @@
"""Gitea container-registry garbage collection.
Lists package versions for one container package and deletes versions
older than --keep-days. Always preserves:
Lists tagged versions of one container package and deletes old ones.
Always preserves:
- the :latest tag
- the --keep-latest most-recent date-tagged versions
- anything pushed in the last --keep-days days
- the `latest` tag (Watchtower's auto-deploy target)
- the `--keep-latest` most-recent date-tagged versions (YYYY.MM.DD)
- the `--keep-latest` most-recent short-SHA tags (rollback pins)
- anything pushed within `--keep-days` days
The actual disk reclaim happens on Gitea's next package GC cron (admin
site settings). This script just marks the versions for deletion.
OCI blob-level versions (`sha256:...`) are never touched directly — those
are managed by Gitea's internal package GC cron when their last tag
goes away.
Usage:
python scripts/registry_gc.py \\
--owner <user> \\
--package <product>-docs-mcp \\
GITEA_TOKEN=... python scripts/registry_gc.py \\
--owner justin \\
--package hvm-docs \\
--keep-days 90 \\
--keep-latest 5
Auth: reads GITEA_TOKEN from env (set in the workflow as a secret).
The Gitea endpoint shape (confirmed 2026-05-22 against git.jpaul.io):
GET /api/v1/packages/{owner}/container/{package}
-> [{id, version, created_at, ...}, ...]
DELETE /api/v1/packages/{owner}/container/{package}/{version}
"""
from __future__ import annotations
import argparse
import json
import os
import re
import sys
from datetime import datetime, timedelta, timezone
from urllib.request import Request, urlopen
from urllib.error import HTTPError
import json
from urllib.parse import quote
from urllib.request import Request, urlopen
GITEA_HOST = os.environ.get("GITEA_HOST", "https://git.jpaul.io")
DATE_TAG = re.compile(r"^\d{4}\.\d{2}\.\d{2}$")
SHA_TAG = re.compile(r"^[0-9a-f]{7,40}$") # short or full git SHA
BLOB_VER = re.compile(r"^sha256:") # OCI blob versions — skip
def api(token: str, method: str, path: str) -> object:
# Explicit User-Agent: git.jpaul.io is behind Cloudflare, whose default
# Bot Fight Mode 403s `Python-urllib/X.Y` with error 1010. Any
# recognizable browser/curl-style UA passes.
req = Request(f"{GITEA_HOST}{path}",
headers={"Authorization": f"token {token}"},
headers={
"Authorization": f"token {token}",
"User-Agent": "hvm-docs-registry-gc/1.0",
},
method=method)
try:
with urlopen(req, timeout=30) as r:
@@ -63,44 +79,57 @@ def main() -> int:
return 1
versions = api(token, "GET",
f"/api/v1/packages/{args.owner}/container/{args.package}/versions") or []
f"/api/v1/packages/{args.owner}/container/{args.package}") or []
if not versions:
print(f"no versions found for {args.owner}/{args.package}")
print(f"no versions found for {args.owner}/container/{args.package}")
return 0
cutoff = datetime.now(timezone.utc) - timedelta(days=args.keep_days)
print(f" {len(versions)} version(s); cutoff={cutoff.isoformat()} "
f"keep_days={args.keep_days} keep_latest={args.keep_latest}")
# Date-tagged versions (YYYY.MM.DD), newest first
date_tagged = []
for v in versions:
tags = v.get("tags") or []
for t in tags:
if len(t) == 10 and t[4] == "." and t[7] == ".":
date_tagged.append((t, v))
break
date_tagged.sort(key=lambda kv: kv[0], reverse=True)
keep_date_tags = {t for t, _ in date_tagged[:args.keep_latest]}
deleted = 0
for v in versions:
tags = v.get("tags") or []
if "latest" in tags:
continue
if any(t in keep_date_tags for t in tags):
continue
# Sort newest first by created_at.
def parsed_ts(v: dict) -> datetime:
try:
created = datetime.fromisoformat(v["created_at"].replace("Z", "+00:00"))
return datetime.fromisoformat(v["created_at"].replace("Z", "+00:00"))
except (KeyError, ValueError):
return datetime.min.replace(tzinfo=timezone.utc)
versions.sort(key=parsed_ts, reverse=True)
# Compute the keep-set: top-N date tags + top-N sha tags + always latest.
keep_dates: list[str] = []
keep_shas: list[str] = []
for v in versions:
ver = v.get("version") or ""
if DATE_TAG.match(ver) and len(keep_dates) < args.keep_latest:
keep_dates.append(ver)
elif SHA_TAG.match(ver) and len(keep_shas) < args.keep_latest:
keep_shas.append(ver)
keep = {"latest", *keep_dates, *keep_shas}
print(f" keep tags: {sorted(keep)}")
deleted = skipped_blob = skipped_age = skipped_keep = 0
for v in versions:
ver = v.get("version") or ""
ts = parsed_ts(v)
if BLOB_VER.match(ver):
skipped_blob += 1
continue
if created >= cutoff:
if ver in keep:
skipped_keep += 1
continue
version_id = v.get("id")
print(f" deleting v{version_id} tags={tags} created={v['created_at']}")
if ts >= cutoff:
skipped_age += 1
continue
print(f" deleting {ver!r} id={v.get('id')} created={v.get('created_at')}")
if not args.dry_run:
api(token, "DELETE",
f"/api/v1/packages/{args.owner}/container/{args.package}/versions/{version_id}")
f"/api/v1/packages/{args.owner}/container/{args.package}/{quote(ver, safe='')}")
deleted += 1
print(f"done: {deleted} version(s) deleted")
print(f"done: deleted={deleted} kept_named={skipped_keep} "
f"kept_recent={skipped_age} skipped_blobs={skipped_blob}")
return 0
+120
View File
@@ -0,0 +1,120 @@
"""Minimal HTTP reranker — `/v1/rerank` endpoint over a sentence-transformers CrossEncoder.
Matches the Cohere `/v1/rerank` request/response shape, which is what the
server's `_rerank()` helper expects. This is the dev-friendly fallback;
production replaces this with the llama.cpp + jina-reranker-v2-base GGUF
sidecar (see deploy/docker-compose.yml) without changing the client.
Request:
POST /v1/rerank
{"model": "...", "query": "...", "documents": ["text", ...], "top_n": 10}
Response:
{"model": "...", "results": [{"index": 0, "relevance_score": 0.93}, ...]}
Usage:
python -m scripts.rerank_server # localhost:8001
RERANK_MODEL=cross-encoder/ms-marco-MiniLM-L-12-v2 \\
RERANK_PORT=8001 python -m scripts.rerank_server
"""
from __future__ import annotations
import json
import logging
import os
import sys
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
MODEL_NAME = os.environ.get("RERANK_MODEL", "cross-encoder/ms-marco-MiniLM-L-6-v2")
PORT = int(os.environ.get("RERANK_PORT", "8001"))
HOST = os.environ.get("RERANK_HOST", "127.0.0.1")
# Truncate docs to this many chars before scoring. jina-reranker GGUF has a
# 1024-token per-pair cap that 400s the entire batch; ms-marco is more
# forgiving but we still cap to keep latency predictable.
MAX_DOC_CHARS = int(os.environ.get("RERANK_MAX_DOC_CHARS", "2000"))
_model = None
def _get_model():
global _model
if _model is None:
from sentence_transformers import CrossEncoder
log.info("loading %s", MODEL_NAME)
_model = CrossEncoder(MODEL_NAME)
log.info("loaded")
return _model
def _rerank(query: str, documents: list[str], top_n: int | None) -> list[dict]:
model = _get_model()
pairs = [[query, (d or "")[:MAX_DOC_CHARS]] for d in documents]
scores = model.predict(pairs)
ranked = sorted(
({"index": i, "relevance_score": float(s)} for i, s in enumerate(scores)),
key=lambda r: -r["relevance_score"],
)
if top_n is not None:
ranked = ranked[:top_n]
return ranked
class Handler(BaseHTTPRequestHandler):
def log_message(self, fmt, *args):
log.info("%s - %s", self.address_string(), fmt % args)
def _send_json(self, status: int, payload: dict) -> None:
body = json.dumps(payload).encode()
self.send_response(status)
self.send_header("Content-Type", "application/json")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
self.wfile.write(body)
def do_GET(self): # noqa: N802
if self.path in ("/", "/health"):
self._send_json(200, {"status": "ok", "model": MODEL_NAME})
return
self._send_json(404, {"error": "not found"})
def do_POST(self): # noqa: N802
if self.path not in ("/v1/rerank", "/rerank"):
self._send_json(404, {"error": "not found"})
return
length = int(self.headers.get("Content-Length", "0"))
try:
req = json.loads(self.rfile.read(length).decode())
except Exception as e:
self._send_json(400, {"error": f"bad json: {e}"})
return
query = req.get("query")
documents = req.get("documents")
if not isinstance(query, str) or not isinstance(documents, list):
self._send_json(400, {"error": "expected {query: str, documents: list[str]}"})
return
top_n = req.get("top_n")
try:
results = _rerank(query, documents, top_n if isinstance(top_n, int) else None)
except Exception as e:
log.exception("rerank failed")
self._send_json(500, {"error": str(e)})
return
self._send_json(200, {"model": MODEL_NAME, "results": results})
def main() -> int:
_get_model() # warm-load before accepting traffic
server = ThreadingHTTPServer((HOST, PORT), Handler)
log.info("listening on http://%s:%d", HOST, PORT)
try:
server.serve_forever()
except KeyboardInterrupt:
log.info("shutting down")
return 0
if __name__ == "__main__":
sys.exit(main())