build out morpheus-docs MCP stack, mirroring hvm-docs through Phases 1-13

Initial scaffold: the docs-mcp-template clone with all the
HVM-validated stack ported across, customized for Morpheus
Enterprise (PRODUCT_NAME=morpheus, server name morpheus-docs).

Bundles (live-discovered 2026-05-22; 1710 cataloged pages total):
* morpheus_user_manual_8_1_0  sd00007510en_us  568 pages (Feb 2026)
* morpheus_user_manual_8_1_1  sd00007621en_us  569 pages (Mar 2026)
* morpheus_user_manual_8_1_2  sd00007732en_us  569 pages (Apr 2026)
* morpheus_release_notes_8_1_0  sd00007496en_us  single-doc
* morpheus_release_notes_8_1_1  sd00007610en_us  single-doc
* morpheus_release_notes_8_1_2  sd00007733en_us  single-doc
* morpheus_quickspecs            a50009231enw     html-file (live
  curl_cffi against www.hpe.com; all 12+ Enterprise SKUs captured —
  S6E64..S6E73AAE for new/renewal/upgrade × 1/3/5-yr terms, plus
  services SKUs HA124A1#V38/V39 and H46SBA1).

No Deployment Guide or Qualification Matrix on HPE Support for
Morpheus Enterprise specifically — the only QM (sd00006551en_us)
covers HVM clusters managed by Morpheus and lives in hvm-docs.

Stack carried forward from hvm-docs:
* rag/{index,chunk,embeddings,bm25}.py — including the
  MAX_CHARS=4000 chunk-cap fix for table-dense content
* docs_mcp/{server,usage}.py — 11 MCP tools, BM25-default search,
  cross-encoder rerank, hybrid behind HYBRID_SEARCH=true,
  morpheus_api_lessons (renamed from hvm_api_lessons), env-gated
  submit_doc_bug
* docs_mcp/api_lessons.md — Morpheus-specific scaffold covering
  licensing model, HVM elevation path, REST vs Plugin API, with
  TODO markers for sections to flesh out from real ops experience
* scrape/{runner,quickspecs,changelog,bundles}.py — TOC + single-doc
  + html-file modes, curl_cffi Chrome120 for www.hpe.com edge bypass
* eval/{retrievers,run_eval}.py + queries.jsonl scaffold (4 placeholder
  queries; populate after first scrape)
* scripts/{rerank_server,usage_report,registry_gc}.py
* .gitea/workflows/{refresh,image-only}.yml — same Gitea Actions
  setup zerto-docs uses (push LAN, pull public-URL, GPU Ollama pool)
* deploy/docker-compose.yml — morpheus-docs-mcp service definition,
  shared jina-rerank sidecar, Watchtower-labeled
* Dockerfile, requirements.txt, requirements-rerank.txt

Verified locally: scrape produced 1599 .md pages (some TOC entries
are parent-only and yield no body), 6353 chunks all under the 4 KB
cap, MCP server boots and lists 11 tools cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 15:26:24 -04:00
parent 43728320bf
commit fa448f94e1
22 changed files with 2822 additions and 247 deletions
+65 -44
View File
@@ -14,21 +14,17 @@ on:
workflow_dispatch: workflow_dispatch:
env: env:
REGISTRY_PUSH: <lan-host>:<port> # PUSH goes to the LAN endpoint (HTTP) to bypass Cloudflare's 100 MB
REGISTRY_PULL: <public-registry-hostname> # body cap. PULL uses the public hostname (HTTPS). Same Gitea registry.
# Image name derives from the actual repo at runtime, so a clone REGISTRY_PUSH: 192.168.0.2:1234
# doesn't need to find/replace anything. e.g. justin/my-product-docs. REGISTRY_PULL: git.jpaul.io
# github.* context is Gitea Actions' inherited GitHub-Actions namespace
# — values come from the Gitea server, not github.com.
IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }} IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
OLLAMA_URL: http://<gpu-host>:11434 # Two GPU-pinned Ollama containers on the Gitea host — same infra
# zerto-docs uses. :11435 = Titan X, :11436 = 1080 Ti. Indexer
# round-robins per batch.
OLLAMA_URLS: http://192.168.0.2:11435,http://192.168.0.2:11436
EMBED_MODEL: nomic-embed-text EMBED_MODEL: nomic-embed-text
# PRODUCT_NAME defaults to the repo name so a clone works without PRODUCT_NAME: morpheus
# editing. Override here if you want a different identifier (e.g.
# repo "my-product-docs" → PRODUCT_NAME "myproduct"). Used as the
# Chroma collection name, BM25 db filename, and MCP server name —
# see docs_mcp/server.py.
PRODUCT_NAME: ${{ github.event.repository.name }}
jobs: jobs:
build: build:
@@ -39,8 +35,7 @@ jobs:
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@v4
with: with:
# Full history (not shallow) so the digest-history step can # Full history so digest-history can walk git log.
# walk git log up to --history-days back.
fetch-depth: 0 fetch-depth: 0
- name: Set up Python - name: Set up Python
@@ -54,9 +49,8 @@ jobs:
python -m pip install -q -r requirements.txt python -m pip install -q -r requirements.txt
- name: Refresh digest history - name: Refresh digest history
# Cheap (a few seconds); doesn't touch corpus content. # Cheap (few seconds). Without this step, a code-only deploy
# Without this step, a code-only deploy would ship an # would ship an increasingly-stale digest history.
# increasingly-stale digest history relative to git.
run: | run: |
mkdir -p corpus/.digest mkdir -p corpus/.digest
python -m scrape.changelog \ python -m scrape.changelog \
@@ -71,42 +65,69 @@ jobs:
- name: Rebuild indexes from existing corpus - name: Rebuild indexes from existing corpus
run: python -m rag.index --rebuild run: python -m rag.index --rebuild
- name: Log in to registry (LAN endpoint) - name: Set up Docker Buildx
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u "${{ github.repository_owner }}" --password-stdin uses: docker/setup-buildx-action@v3
with:
# LAN registry is HTTP only.
config-inline: |
[registry."192.168.0.2:1234"]
http = true
insecure = true
- name: Build & push image - name: Configure registry credentials for buildx
env:
REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
REGISTRY_USER: ${{ github.actor }}
run: | run: |
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12) mkdir -p ~/.docker
DATE_TAG=$(date -u +%Y.%m.%d) AUTH=$(printf '%s:%s' "$REGISTRY_USER" "$REGISTRY_TOKEN" | base64 -w0)
docker build \ cat > ~/.docker/config.json <<EOF
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \ {
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \ "auths": {
-t "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}" \ "192.168.0.2:1234": {
. "auth": "$AUTH"
docker push "${REGISTRY_PUSH}/${IMAGE}:latest" }
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" }
docker push "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}" }
EOF
- name: Compute tags
id: meta
uses: docker/metadata-action@v5
with:
images: 192.168.0.2:1234/${{ github.repository_owner }}/${{ github.event.repository.name }}
tags: |
type=raw,value=latest
type=sha,prefix=,format=short
type=raw,value={{date 'YYYY.MM.DD'}}
labels: |
org.opencontainers.image.source=https://git.jpaul.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
org.opencontainers.image.url=https://git.jpaul.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
- name: Build & push (amd64)
uses: docker/build-push-action@v6
with:
context: .
platforms: linux/amd64
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
- name: Link container package to this repo - name: Link container package to this repo
# Gitea container packages are owned by a USER, not a repo —
# they don't auto-appear under the repo's Packages tab.
# This API call creates the association. One-time-effective:
# re-running returns 400 once linked, which we swallow.
# Endpoint requires Gitea 1.21+.
env: env:
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }} GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
run: | run: |
OWNER="${{ github.repository_owner }}" OWNER="${{ github.repository_owner }}"
PKG="${{ github.event.repository.name }}" PKG="${{ github.event.repository.name }}"
BODY=$(mktemp) code=$(curl -s -o /tmp/link.out -w "%{http_code}" -X POST \
CODE=$(curl -sS -o "$BODY" -w "%{http_code}" -X POST \
-H "Authorization: token ${GITEA_TOKEN}" \ -H "Authorization: token ${GITEA_TOKEN}" \
"https://${REGISTRY_PULL}/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}") "https://git.jpaul.io/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
echo "link http=$CODE body=$(cat "$BODY")" echo "link ${OWNER}/container/${PKG} -> ${PKG}: HTTP ${code}"
case "$CODE" in body=$(cat /tmp/link.out)
201) echo "linked package to ${OWNER}/${PKG}" ;; case "$code" in
400) echo "already linked (re-link returns 400) — ok" ;; 201) echo "OK — newly linked" ;;
*) echo "unexpected status $CODE"; exit 1 ;; 400|409) echo "OK — already linked: ${body}" ;;
*) echo "unexpected: ${body}"; exit 1 ;;
esac esac
- name: Prune old container versions - name: Prune old container versions
+92 -52
View File
@@ -19,27 +19,25 @@ on:
default: false default: false
env: env:
# If your registry sits behind Cloudflare with its 100 MB body cap, # PUSH goes to the LAN endpoint (HTTP) to bypass Cloudflare Tunnel's
# use a LAN endpoint for pushes (bypasses CF) and the public hostname # 100 MB body cap. PULL uses the public hostname (HTTPS). Same Gitea
# for pulls (response bodies aren't capped). # registry either way — package lands under the same owner/repo.
REGISTRY_PUSH: <lan-host>:<port> REGISTRY_PUSH: 192.168.0.2:1234
REGISTRY_PULL: <public-registry-hostname> REGISTRY_PULL: git.jpaul.io
# Image name derives from the actual repo at runtime, so a clone
# doesn't need to find/replace anything. e.g. justin/my-product-docs. # Image name derives from the repo at runtime — clones don't need to
# github.* context is Gitea Actions' inherited GitHub-Actions namespace # edit this. github.* is the Gitea-Actions inherited namespace.
# — values come from the Gitea server, not github.com.
IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }} IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
# Embedder. One URL per GPU; the indexer round-robins. # Two GPU-pinned Ollama containers on the Gitea host — same infra
OLLAMA_URL: http://<gpu-host>:11434 # zerto-docs uses (deploy/ollama-rag.docker-compose.yml over there).
# :11435 owns the Titan X, :11436 owns the 1080 Ti; the indexer
# round-robins per batch so both cards run in parallel. The host's
# primary Ollama on :11434 is left alone for OpenWebUI etc.
OLLAMA_URLS: http://192.168.0.2:11435,http://192.168.0.2:11436
EMBED_MODEL: nomic-embed-text EMBED_MODEL: nomic-embed-text
# PRODUCT_NAME defaults to the repo name so a clone works without PRODUCT_NAME: morpheus
# editing. Override here if you want a different identifier (e.g.
# repo "my-product-docs" → PRODUCT_NAME "myproduct"). Used as the
# Chroma collection name, BM25 db filename, and MCP server name —
# see docs_mcp/server.py.
PRODUCT_NAME: ${{ github.event.repository.name }}
jobs: jobs:
refresh: refresh:
@@ -50,10 +48,12 @@ jobs:
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@v4
with: with:
# Full history — required for the digest-history step to # Full history — required for digest-history. Default depth 1
# walk git log. Default fetch-depth: 1 silently produces a # silently produces a 0-byte history file.
# 0-byte history file.
fetch-depth: 0 fetch-depth: 0
# Set the credentials Gitea injects so we can push corpus
# commits back. Persist them across the run.
token: ${{ secrets.GITEA_TOKEN }}
- name: Set up Python - name: Set up Python
uses: actions/setup-python@v5 uses: actions/setup-python@v5
@@ -89,8 +89,8 @@ jobs:
- name: Commit corpus changes (if any) - name: Commit corpus changes (if any)
id: commit id: commit
run: | run: |
git config user.name "<product>-docs-refresh" git config user.name "hvm-docs-refresh"
git config user.email "actions@<your-domain>" git config user.email "actions@jpaul.io"
git add bundles.json corpus git add bundles.json corpus
if git diff --cached --quiet; then if git diff --cached --quiet; then
echo "no corpus changes — skipping reindex and image build" echo "no corpus changes — skipping reindex and image build"
@@ -132,49 +132,89 @@ jobs:
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
run: python -m rag.index --rebuild run: python -m rag.index --rebuild
# ---- Build & push image ------------------------------------ # ---- Build & push image (LAN endpoint, buildx) -------------
- name: Log in to registry (LAN endpoint) - name: Set up Docker Buildx
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u "${{ github.repository_owner }}" --password-stdin uses: docker/setup-buildx-action@v3
with:
# LAN registry is HTTP only. Buildkit needs an explicit
# insecure-registry config or it tries to upgrade to HTTPS.
config-inline: |
[registry."192.168.0.2:1234"]
http = true
insecure = true
- name: Build & push image - name: Configure registry credentials for buildx
# Can't use docker/login-action against the LAN endpoint —
# the host docker daemon errors on HTTP-vs-HTTPS. Buildx reads
# ~/.docker/config.json directly, so write the auth ourselves.
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
# Runner shell is /bin/sh — use cut instead of ${VAR::N}. env:
# Three tags: :latest (Watchtower target), :<sha12> REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
# (rollback pin), :<YYYY.MM.DD> (human-readable). REGISTRY_USER: ${{ github.actor }}
run: | run: |
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12) mkdir -p ~/.docker
DATE_TAG=$(date -u +%Y.%m.%d) AUTH=$(printf '%s:%s' "$REGISTRY_USER" "$REGISTRY_TOKEN" | base64 -w0)
docker build \ cat > ~/.docker/config.json <<EOF
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \ {
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \ "auths": {
-t "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}" \ "192.168.0.2:1234": {
. "auth": "$AUTH"
docker push "${REGISTRY_PUSH}/${IMAGE}:latest" }
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" }
docker push "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}" }
EOF
- name: Compute tags
id: meta
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
uses: docker/metadata-action@v5
with:
# Tag with the LAN hostname so the push goes over LAN.
# docker-compose on the deploy host pulls via git.jpaul.io.
images: 192.168.0.2:1234/${{ github.repository_owner }}/${{ github.event.repository.name }}
tags: |
type=raw,value=latest
type=sha,prefix=,format=short
type=schedule,pattern={{date 'YYYY.MM.DD'}}
type=raw,value={{date 'YYYY.MM.DD'}}
# Override auto-derived labels with the PUBLIC URL so Gitea
# can auto-link the package back to this repo.
labels: |
org.opencontainers.image.source=https://git.jpaul.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
org.opencontainers.image.url=https://git.jpaul.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
- name: Build & push (amd64)
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
uses: docker/build-push-action@v6
with:
context: .
platforms: linux/amd64
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
- name: Link container package to this repo - name: Link container package to this repo
# Gitea container packages are owned by a USER, not a repo # Idempotent linkage so the package shows under the repo's
# they don't auto-appear under the repo's Packages tab. # Packages tab. Gitea's auto-link from the source label is
# This API call creates the association. One-time-effective: # unreliable in this setup (the runner reports an internal
# re-running returns 400 once linked, which we swallow. # server URL), so we link explicitly. 201 = newly linked,
# Endpoint requires Gitea 1.21+. # 400 = already linked (treated as success).
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
env: env:
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }} GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
run: | run: |
OWNER="${{ github.repository_owner }}" OWNER="${{ github.repository_owner }}"
PKG="${{ github.event.repository.name }}" PKG="${{ github.event.repository.name }}"
BODY=$(mktemp) code=$(curl -s -o /tmp/link.out -w "%{http_code}" -X POST \
CODE=$(curl -sS -o "$BODY" -w "%{http_code}" -X POST \
-H "Authorization: token ${GITEA_TOKEN}" \ -H "Authorization: token ${GITEA_TOKEN}" \
"https://${REGISTRY_PULL}/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}") "https://git.jpaul.io/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
echo "link http=$CODE body=$(cat "$BODY")" echo "link ${OWNER}/container/${PKG} -> ${PKG}: HTTP ${code}"
case "$CODE" in body=$(cat /tmp/link.out)
201) echo "linked package to ${OWNER}/${PKG}" ;; case "$code" in
400) echo "already linked (re-link returns 400) — ok" ;; 201) echo "OK — newly linked" ;;
*) echo "unexpected status $CODE"; exit 1 ;; 400|409) echo "OK — already linked: ${body}" ;;
*) echo "unexpected: ${body}"; exit 1 ;;
esac esac
# ---- Registry GC ------------------------------------------- # ---- Registry GC -------------------------------------------
+119
View File
@@ -0,0 +1,119 @@
[
{
"slug": "morpheus_user_manual_8_1_0",
"doc_id": "sd00007510en_us",
"title": "HPE Morpheus Enterprise Software Documentation v8.1.0",
"version": "8.1.0",
"platform": null,
"product": "User Manual",
"language": "en-US",
"page_count": 568,
"mode": "toc",
"abstract": "",
"dates": {
"Published": "February 2026"
},
"landing_page": "GUID-709AAADB-A9C1-40B6-AD22-958EE7E6F312",
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007510en_us"
},
{
"slug": "morpheus_user_manual_8_1_1",
"doc_id": "sd00007621en_us",
"title": "HPE Morpheus Enterprise Software Documentation v8.1.1",
"version": "8.1.1",
"platform": null,
"product": "User Manual",
"language": "en-US",
"page_count": 569,
"mode": "toc",
"abstract": "",
"dates": {
"Published": "March 2026"
},
"landing_page": "GUID-709AAADB-A9C1-40B6-AD22-958EE7E6F312",
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007621en_us"
},
{
"slug": "morpheus_user_manual_8_1_2",
"doc_id": "sd00007732en_us",
"title": "HPE Morpheus Enterprise Software Documentation v8.1.2",
"version": "8.1.2",
"platform": null,
"product": "User Manual",
"language": "en-US",
"page_count": 569,
"mode": "toc",
"abstract": "",
"dates": {
"Published": "April 2026"
},
"landing_page": "GUID-709AAADB-A9C1-40B6-AD22-958EE7E6F312",
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007732en_us"
},
{
"slug": "morpheus_release_notes_8_1_0",
"doc_id": "sd00007496en_us",
"title": "v8.1.0 Release Notes",
"version": "8.1.0",
"platform": null,
"product": "Release Notes",
"language": "en-US",
"page_count": 1,
"mode": "single",
"abstract": "Release notes for HPE Morpheus Enterprise Software version v8.1.0",
"dates": {
"Published": "February 2026"
},
"landing_page": "sd00007496en_us",
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007496en_us"
},
{
"slug": "morpheus_release_notes_8_1_1",
"doc_id": "sd00007610en_us",
"title": "v8.1.1 Release Notes",
"version": "8.1.1",
"platform": null,
"product": "Release Notes",
"language": "en-US",
"page_count": 1,
"mode": "single",
"abstract": "Release notes for HPE Morpheus Enterprise Software version v8.1.1",
"dates": {
"Published": "March 2026"
},
"landing_page": "sd00007610en_us",
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007610en_us"
},
{
"slug": "morpheus_release_notes_8_1_2",
"doc_id": "sd00007733en_us",
"title": "v8.1.2 Release Notes",
"version": "8.1.2",
"platform": null,
"product": "Release Notes",
"language": "en-US",
"page_count": 1,
"mode": "single",
"abstract": "Release notes for HPE Morpheus Enterprise Software version v8.1.2",
"dates": {
"Published": "April 2026"
},
"landing_page": "sd00007733en_us",
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007733en_us"
},
{
"slug": "morpheus_quickspecs",
"doc_id": "a50009231enw",
"title": "HPE Morpheus Enterprise Software QuickSpecs",
"version": "v1",
"platform": null,
"product": "QuickSpecs",
"language": "en-US",
"page_count": 1,
"mode": "html-file",
"abstract": "",
"dates": {},
"landing_page": "a50009231enw",
"source_url": "https://www.hpe.com/psnow/doc/a50009231enw"
}
]
+23 -17
View File
@@ -1,6 +1,6 @@
# Hosting stack for a docs MCP server. # Hosting stack for a docs MCP server.
# #
# Replace <product> below with your product name on first deploy. # Replace hvm below with your product name on first deploy.
# Volumes: usage logs are mounted to a host path so they survive # Volumes: usage logs are mounted to a host path so they survive
# Watchtower-driven container recreates. # Watchtower-driven container recreates.
# #
@@ -10,15 +10,15 @@
services: services:
# The MCP server. Watchtower auto-pulls on :latest changes. # The MCP server. Watchtower auto-pulls on :latest changes.
<product>-docs-mcp: morpheus-docs-mcp:
image: <registry>/<owner>/<product>-docs-mcp:latest image: git.jpaul.io/justin/morpheus-docs:latest
container_name: <product>-docs-mcp container_name: morpheus-docs-mcp
restart: unless-stopped restart: unless-stopped
ports: ports:
- "8000:8000" - "8000:8000"
environment: environment:
PRODUCT_NAME: "<product>" PRODUCT_NAME: "morpheus"
PRODUCT_DOCS_URL: "https://docs.example.com" PRODUCT_DOCS_URL: "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007732en_us"
# Streamable-HTTP transport. Stateless mode is required for # Streamable-HTTP transport. Stateless mode is required for
# production: clients don't lose sessions when Watchtower # production: clients don't lose sessions when Watchtower
@@ -28,19 +28,21 @@ services:
MCP_PORT: "8000" MCP_PORT: "8000"
# If you run MetaMCP or another gateway in front and reach # If you run MetaMCP or another gateway in front and reach
# this container via its compose DNS name (e.g. <product>-docs-mcp:8000), # this container via its compose DNS name (e.g. morpheus-docs-mcp:8000),
# add that hostname here. "*" disables the rebind check entirely. # add that hostname here. "*" disables the rebind check entirely.
MCP_ALLOWED_HOSTS: "<product>-docs-mcp,localhost,127.0.0.1" MCP_ALLOWED_HOSTS: "morpheus-docs-mcp,localhost,127.0.0.1"
# Phase 6 — reranker sidecar (jina-reranker-v2-base via llama.cpp). # Phase 6 — reranker sidecar (jina-reranker-v2-base via llama.cpp).
RERANK_URL: http://<product>-rerank:8080 RERANK_URL: http://hvm-rerank:8080
RERANK_POOL: "200" RERANK_POOL: "200"
RERANK_TIMEOUT: "30" RERANK_TIMEOUT: "30"
# Phase 8 — hybrid retrieval (BM25 + dense + RRF). Set true # Phase 8 — hybrid retrieval (BM25 + dense + RRF).
# only after the eval harness shows the dense-only path # Eval on the HVM corpus (eval/results/baseline.md, 2026-05-22) shows
# missing technical-term queries that BM25 catches. # BM25-default + reranker beats hybrid on every metric (MRR 0.920 vs
HYBRID_SEARCH: "true" # 0.875). Leaving HYBRID_SEARCH off so search_docs runs BM25-first +
# reranker; dense is the fallback when BM25 finds nothing.
HYBRID_SEARCH: "false"
# Phase 10 — usage telemetry. # Phase 10 — usage telemetry.
USAGE_LOG_DIR: /app/var/logs USAGE_LOG_DIR: /app/var/logs
@@ -52,9 +54,9 @@ services:
# DOC_BUG_API_URL: "https://docs-be.example.com/api/feedback" # DOC_BUG_API_URL: "https://docs-be.example.com/api/feedback"
volumes: volumes:
# Usage logs persist across container recreates. # Usage logs persist across container recreates.
- ./<product>-docs-mcp-logs:/app/var/logs - ./morpheus-docs-mcp-logs:/app/var/logs
depends_on: depends_on:
- <product>-rerank - hvm-rerank
labels: labels:
# Watchtower polls *only* containers with this label set true. # Watchtower polls *only* containers with this label set true.
com.centurylinklabs.watchtower.enable: "true" com.centurylinklabs.watchtower.enable: "true"
@@ -63,9 +65,13 @@ services:
# Reranker sidecar — llama.cpp serving jina-reranker-v2-base. # Reranker sidecar — llama.cpp serving jina-reranker-v2-base.
# Requires GPU access; adjust runtime/devices for your hardware. # Requires GPU access; adjust runtime/devices for your hardware.
<product>-rerank: #
# For dev / CPU-only hosts, swap this service for scripts/rerank_server.py
# (sentence-transformers ms-marco-MiniLM-L-6-v2). Same /v1/rerank shape,
# ~500ms/batch on CPU vs ~50ms on GPU with the jina GGUF.
hvm-rerank:
image: ghcr.io/ggml-org/llama.cpp:server-cuda image: ghcr.io/ggml-org/llama.cpp:server-cuda
container_name: <product>-rerank container_name: hvm-rerank
restart: unless-stopped restart: unless-stopped
# Mount the GGUF model from the host. Download from huggingface # Mount the GGUF model from the host. Download from huggingface
# (gguf-org/jina-reranker-v2-base-multilingual-GGUF) first. # (gguf-org/jina-reranker-v2-base-multilingual-GGUF) first.
+148
View File
@@ -0,0 +1,148 @@
# HPE Morpheus Enterprise — Lessons
Notes and gotchas about running, integrating with, and licensing
**HPE Morpheus Enterprise Software** that aren't obvious from the
official docs alone. The official User Manual + Release Notes +
QuickSpecs describe the product as designed; this file is what
experienced operators actually learn.
> Treat this as living context. Update it when you (or the LLM
> driving this MCP) discover something non-obvious that the docs
> don't say or don't make findable. Each section is an H2 so the
> `morpheus_api_lessons(topic=...)` tool can return just the
> relevant piece.
## TL;DR
- **Morpheus Enterprise is the full cloud-management platform.** HPE
Morpheus VM Essentials (HVM) is the VM-only subset; Morpheus
Enterprise is what you "elevate to" when you need multi-cloud,
containers, automation, policy, FinOps, ITSM integration, and
self-service catalogs. The relationship is one-way upgrade.
- **Licensing is per physical CPU socket** on connected on-prem
clouds (bare metal, hypervisor hosts, Kubernetes worker nodes).
Public-cloud workloads (AWS / Azure / GCP / OCI) are factored at
**15 workloads per socket** equivalent.
- **All license SKUs include Tech Care Essentials 24×7** as part
of the license cost. There is no separate purchase for support
on the license tier.
- **`morpheus_quickspecs` is the source of truth for SKUs.** Don't
guess part numbers; query the QuickSpecs bundle.
## Licensing and SKUs
**Source of truth: the `morpheus_quickspecs` bundle.** Query it for
the current SKU list — the catalog updates more often than this
file does.
Pricing model summary (from QuickSpecs v1, 2026):
- **Per physical CPU socket** for connected on-prem clouds —
KVM/HVM hosts, VMware ESXi hosts, bare metal servers, Kubernetes
worker nodes. Count the **sockets**, not the cores; not the VMs.
- **Public cloud workloads factor at 15:1** — one socket of license
covers up to 15 public-cloud workloads (instances) across AWS,
Azure, GCP, OCI.
- **Term-based** licensing (not perpetual). 1, 3, and 5-year terms
on E-LTU SKUs.
- **All include HPE Tech Care Essentials** (24×7 support, 15-minute
response for severity-1) bundled into the license cost.
> The exact ratios and SKU names can change between QuickSpecs
> revisions. Use the `morpheus_quickspecs` tool / bundle for current
> values rather than memorizing.
## Elevation from HVM
The "elevate to Morpheus Enterprise" path is the canonical journey
for customers who started on HVM and outgrew it:
- **HVM clusters keep working unchanged after elevation.** You
don't redeploy the manager; you upgrade-in-place using a
Morpheus Enterprise license.
- **What changes:** the manager UI unlocks the full Enterprise
feature set — public-cloud integrations, container/Kubernetes
management, blueprints/catalogs, automation workflows, policy
engine, FinOps cost dashboards, ITSM connectors (ServiceNow etc.),
and the full REST API surface.
- **Existing HVM-tier work products survive the elevation:**
Instance backups, network pools, storage providers, user
accounts, integrations, scheduled jobs, etc.
The HVM User Manual page `Elevating to HPE Morpheus Enterprise`
(GUID-ECCA4FDD-37C8-45CE-A71F-C6E73B3BA713) walks the procedure.
See also the HVM `morpheus-docs` sibling MCP's
`hvm_user_manual_8_1_*` bundles.
## API surface — Plugin vs REST
Morpheus exposes two completely separate extensibility surfaces:
- **REST API** at `https://<manager>/api/` — external automation
and integration. Bearer-token authentication; tokens issued from
the user profile → API tokens UI. Full Enterprise API surface
available (vs HVM-only managers which 404 on Enterprise-only
endpoints).
- **Plugin API** — server-side extensions that load INTO the
manager process. Versioned independently of the platform
(Plugin API version listed in the Release Notes for each
Morpheus version). A plugin built for Plugin API 1.3.x may not
load on 1.4.x without changes.
**TODO — fill in real operational lessons as we accumulate them.**
## Multi-cloud onboarding
**TODO.** Each cloud (AWS, Azure, GCP, OCI, VMware vSphere, KVM/HVM,
OpenStack, Nutanix, etc.) has its own onboarding ritual: credentials,
networking, IAM roles, regions, storage providers, image catalogs.
Search the User Manual: `search_docs(query="Add AWS cloud
integration")`, `search_docs(query="Azure subscription cost")`, etc.
## Tenancy, RBAC, and groups
**TODO.** Morpheus Enterprise tenancy is one of the more complex areas
— tenants, roles, groups, account groups, persona-based access.
Lessons specific to "what surprised me" go here.
## Backups
**TODO.** Morpheus Enterprise inherits the backup framework HVM
introduced (Storage Buckets, Execution Schedules, Backup Jobs)
and adds: cloud-native backup integrations (AWS Backup, Azure
Backup), per-instance backup policies via the policy engine,
ServiceNow-driven backup orchestration. Document the gotchas you
hit.
## Common operational gotchas
**TODO.** This is where the "experienced operator hallway
conversation" notes go. Examples to seed (delete or replace as you
learn):
- **Service plan vs Instance type** — same concept, different
contexts. A service plan is the sizing template ("small / medium
/ large with these CPU/RAM"); an instance type is what you
provision FROM the plan. Operators conflate them.
- **Cloud integration credentials are tenant-scoped, not
global.** Adding a credential at the master tenant doesn't
cascade — sub-tenants need their own (or the policy engine
granting access).
- **Policy engine vs Logic library** — both live under
Library/Automation, both can gate provisioning. Policies are
preventive (block bad config), logic is generative (run scripts
on lifecycle events). Pick the right tool.
## Adding to this doc
Two ways:
1. Manually edit `docs_mcp/api_lessons.md` in this repo and commit.
The next image build picks it up.
2. Use `submit_doc_bug` for upstream issues, and append the
takeaway here once the docs team responds.
The point of this doc is to surface the kind of context an
experienced operator would mention in a hallway conversation but
that doesn't quite fit anywhere in the formal product docs. Keep
sections tight — one H2 = one topic the LLM can return on demand.
+1077 -40
View File
File diff suppressed because it is too large Load Diff
+4
View File
@@ -0,0 +1,4 @@
{"query": "what's the per-socket licensing model for Morpheus Enterprise", "expected": [{"bundle_id": "morpheus_quickspecs", "page_id": "a50009231enw"}], "tags": ["licensing", "skus"]}
{"query": "add an AWS cloud integration", "expected": [], "tags": ["cloud", "TODO-populate-after-first-scrape"]}
{"query": "Plugin API version compatibility", "expected": [], "tags": ["api", "TODO"]}
{"query": "Morpheus Enterprise 8.1.2 what's new", "expected": [{"bundle_id": "morpheus_release_notes_8_1_2", "page_id": "sd00007733en_us"}], "tags": ["release-notes"]}
+118 -28
View File
@@ -10,7 +10,7 @@ to one entry; the highest-ranked chunk's position wins).
""" """
from __future__ import annotations from __future__ import annotations
from typing import Protocol, Iterable from typing import Iterable, Protocol
class Retriever(Protocol): class Retriever(Protocol):
@@ -21,12 +21,17 @@ class Retriever(Protocol):
... ...
def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> list[tuple[str, str]]: def _split_chunk_id(chunk_id: str) -> tuple[str, str, int]:
"""Take a stream of (bundle_id, page_id, chunk_ordinal) and return """`bundle::page::ordinal` -> (bundle, page, int(ordinal))."""
the first k unique pages in their first-seen order.""" bid, pid, ordinal = chunk_id.split("::")
return bid, pid, int(ordinal)
def _collapse_to_pages(chunk_ids: Iterable[str], k: int) -> list[tuple[str, str]]:
seen: set[tuple[str, str]] = set() seen: set[tuple[str, str]] = set()
out: list[tuple[str, str]] = [] out: list[tuple[str, str]] = []
for bid, pid, _ord in chunk_ids: for cid in chunk_ids:
bid, pid, _ord = _split_chunk_id(cid)
key = (bid, pid) key = (bid, pid)
if key in seen: if key in seen:
continue continue
@@ -37,26 +42,111 @@ def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> lis
return out return out
# TODO Phase 2/3 — implement these once Chroma + the bm25 module are class DenseRetriever:
# in place. Each one is small (15-30 LOC). The eval harness imports """Chroma cosine search via the live embedding function."""
# from this module by class name. name = "dense"
#
# class DenseRetriever: def __init__(self, collection, pool: int = 50):
# name = "dense" self.col = collection
# def __init__(self, collection): self.col = collection self.pool = pool
# def retrieve(self, query, k=10): ...
# def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
# class RerankedRetriever: res = self.col.query(query_texts=[query], n_results=self.pool)
# name = "dense+rerank" ids = (res.get("ids") or [[]])[0]
# def __init__(self, collection, rerank_url, pool=200): ... return _collapse_to_pages(ids, k)
# def retrieve(self, query, k=10): ...
#
# class BM25Retriever: class BM25Retriever:
# name = "bm25" """SQLite FTS5 lexical search."""
# def __init__(self, bm25_index): ... name = "bm25"
# def retrieve(self, query, k=10): ...
# def __init__(self, bm25_index, pool: int = 200):
# class HybridRetriever: self.bm = bm25_index
# name = "bm25+dense+rrf" self.pool = pool
# def __init__(self, dense, bm25, k_rrf=60): ...
# def retrieve(self, query, k=10): ... def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
hits = self.bm.query(query, n=self.pool)
return _collapse_to_pages((cid for cid, _score in hits), k)
class HybridRetriever:
"""Reciprocal Rank Fusion of dense + BM25 rankings."""
name = "hybrid_rrf"
def __init__(self, dense: DenseRetriever, bm25: BM25Retriever, k_rrf: int = 60, pool: int = 100):
self.dense = dense
self.bm25 = bm25
self.k_rrf = k_rrf
self.pool = pool
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
dense_pages = self.dense.retrieve(query, k=self.pool)
bm25_pages = self.bm25.retrieve(query, k=self.pool)
scores: dict[tuple[str, str], float] = {}
for rank, page in enumerate(dense_pages, start=1):
scores[page] = scores.get(page, 0.0) + 1.0 / (self.k_rrf + rank)
for rank, page in enumerate(bm25_pages, start=1):
scores[page] = scores.get(page, 0.0) + 1.0 / (self.k_rrf + rank)
ranked = sorted(scores.items(), key=lambda kv: -kv[1])
return [page for page, _s in ranked[:k]]
def _rerank_pool(rerank_url: str, query: str, ids_and_texts: list[tuple[str, str]],
timeout: float = 30.0) -> list[str] | None:
"""POST to /v1/rerank, return ids in reranked order. None on failure."""
if not ids_and_texts:
return []
import httpx
try:
with httpx.Client(timeout=timeout) as c:
r = c.post(f"{rerank_url}/v1/rerank", json={
"query": query,
"documents": [(t or "")[:2000] for _i, t in ids_and_texts],
"top_n": len(ids_and_texts),
})
r.raise_for_status()
results = r.json().get("results") or []
return [ids_and_texts[item["index"]][0] for item in results
if isinstance(item.get("index"), int)
and 0 <= item["index"] < len(ids_and_texts)]
except Exception:
return None
class RerankedRetriever:
"""Pull a candidate pool via a base retriever, then cross-encoder re-rank."""
def __init__(self, base: Retriever, collection, rerank_url: str, name_suffix: str = "rerank",
pool: int = 50, timeout: float = 30.0):
self.base = base
self.col = collection
self.url = rerank_url
self.name = f"{base.name}+{name_suffix}"
self.pool = pool
self.timeout = timeout
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
# Base returns deduplicated page-level tuples; rerank needs CHUNK-level
# texts to be informative. Pull each page's chunk 0 text from Chroma.
pages = self.base.retrieve(query, k=self.pool)
if not pages:
return []
chunk_ids = [f"{bid}::{pid}::0" for bid, pid in pages]
g = self.col.get(ids=chunk_ids, include=["documents"])
by_id = dict(zip(g["ids"], g["documents"]))
ids_and_texts = [(cid, by_id.get(cid, "")) for cid in chunk_ids]
order = _rerank_pool(self.url, query, ids_and_texts, timeout=self.timeout)
if order is None:
return pages[:k]
out: list[tuple[str, str]] = []
seen: set[tuple[str, str]] = set()
for cid in order:
bid, pid, _ = cid.split("::")
key = (bid, pid)
if key in seen:
continue
seen.add(key)
out.append(key)
if len(out) >= k:
break
return out
+81 -9
View File
@@ -76,15 +76,87 @@ def main() -> int:
queries = load_queries(args.queries) queries = load_queries(args.queries)
print(f"loaded {len(queries)} queries") print(f"loaded {len(queries)} queries")
# TODO Phase 7: instantiate the retrievers you implemented in import os
# eval/retrievers.py and run each one against each query. import chromadb
# Aggregate MRR / Recall@K / nDCG@K per retriever. Emit a from chromadb.config import Settings
# markdown table to args.output. Commit the file alongside the from rag.embeddings import embedding_function
# PR that changes retrieval. from rag.bm25 import BM25Index
raise NotImplementedError( from eval.retrievers import DenseRetriever, BM25Retriever, HybridRetriever
"Wire up the retrievers in eval/retrievers.py first, then "
"fill in this evaluation loop. See PLAN.md Phase 7." product = os.environ.get("PRODUCT_NAME", "hvm")
) repo_root = Path(__file__).resolve().parent.parent
client = chromadb.PersistentClient(path=str(repo_root / "chroma"),
settings=Settings(anonymized_telemetry=False))
col = client.get_collection(f"{product}_docs", embedding_function=embedding_function())
bm = BM25Index(str(repo_root / "bm25" / f"{product}_docs.db"))
from eval.retrievers import RerankedRetriever
dense = DenseRetriever(col)
bm25 = BM25Retriever(bm)
hybrid = HybridRetriever(DenseRetriever(col, pool=100), BM25Retriever(bm, pool=100))
retrievers = [dense, bm25, hybrid]
rerank_url = os.environ.get("RERANK_URL", "").rstrip("/")
if rerank_url:
retrievers += [
RerankedRetriever(bm25, col, rerank_url, name_suffix="rerank", pool=50),
RerankedRetriever(hybrid, col, rerank_url, name_suffix="rerank", pool=50),
]
print(f"reranker enabled: {rerank_url}")
rows: dict[str, dict[str, float]] = {}
per_query: list[dict] = []
for r in retrievers:
mrr_sum = recall_sum = ndcg_sum = 0.0
elapsed_sum = 0.0
for q in queries:
expected = [(e["bundle_id"], e["page_id"]) for e in q["expected"]]
t0 = time.time()
retrieved = r.retrieve(q["query"], k=max(args.k, 10))
elapsed = time.time() - t0
mrr = reciprocal_rank(retrieved, expected)
recall = recall_at_k(retrieved, expected, args.k)
ndcg = ndcg_at_k(retrieved, expected, args.k)
mrr_sum += mrr
recall_sum += recall
ndcg_sum += ndcg
elapsed_sum += elapsed
per_query.append({
"retriever": r.name, "query": q["query"],
"mrr": mrr, "recall@k": recall, "ndcg@k": ndcg,
"top1": list(retrieved[0]) if retrieved else None,
"elapsed_s": round(elapsed, 3),
})
n = len(queries)
rows[r.name] = {
"MRR": mrr_sum / n,
f"Recall@{args.k}": recall_sum / n,
f"nDCG@{args.k}": ndcg_sum / n,
"avg_latency_s": elapsed_sum / n,
}
print(f" {r.name}: MRR={rows[r.name]['MRR']:.3f} "
f"Recall@{args.k}={rows[r.name][f'Recall@{args.k}']:.3f} "
f"nDCG@{args.k}={rows[r.name][f'nDCG@{args.k}']:.3f} "
f"avg={rows[r.name]['avg_latency_s']*1000:.0f}ms")
args.output.parent.mkdir(parents=True, exist_ok=True)
md = [f"# Retrieval eval — k={args.k}", "",
f"_{len(queries)} hand-curated queries, generated {time.strftime('%Y-%m-%d %H:%M:%S')}_", "",
"| Retriever | MRR | Recall@{k} | nDCG@{k} | avg latency |".replace("{k}", str(args.k)),
"| --- | ---: | ---: | ---: | ---: |"]
for name, m in rows.items():
md.append(f"| `{name}` | {m['MRR']:.3f} | {m[f'Recall@{args.k}']:.3f} "
f"| {m[f'nDCG@{args.k}']:.3f} | {m['avg_latency_s']*1000:.0f}ms |")
md += ["", "## Per-query results", "",
"| Retriever | Query | MRR | top-1 |", "| --- | --- | ---: | --- |"]
for r in per_query:
top1 = f"`{r['top1'][0]}/{r['top1'][1][:24]}...`" if r["top1"] else ""
md.append(f"| `{r['retriever']}` | {r['query'][:60]} | {r['mrr']:.3f} | {top1} |")
args.output.write_text("\n".join(md) + "\n")
print(f"wrote {args.output}")
return 0
if __name__ == "__main__": if __name__ == "__main__":
+39 -11
View File
@@ -31,6 +31,31 @@ from typing import Iterator
CHARS_PER_TOKEN = 4 CHARS_PER_TOKEN = 4
TARGET_TOKENS = 500 TARGET_TOKENS = 500
TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
# Hard cap: nomic-embed-text's context is 2048 tokens. Anything larger
# 400s the entire embed batch. 6000 chars works for prose but markdown
# tables with lots of `|` separators tokenize ~1.4× denser; a 5839-char
# table chunk from the HVM qualification matrix tokenized past 2048 and
# crashed the rebuild. 4000 chars stays under 2048 tokens even for
# dense table content while leaving headroom for the query side.
MAX_CHARS = 4000
def _hard_split(text: str) -> list[str]:
"""Split an oversized block on line boundaries into MAX_CHARS pieces."""
if len(text) <= MAX_CHARS:
return [text]
out: list[str] = []
buf: list[str] = []
buf_chars = 0
for line in text.splitlines(keepends=True):
if buf_chars + len(line) > MAX_CHARS and buf:
out.append("".join(buf).rstrip())
buf, buf_chars = [], 0
buf.append(line)
buf_chars += len(line)
if buf:
out.append("".join(buf).rstrip())
return out
def estimate_tokens(text: str) -> int: def estimate_tokens(text: str) -> int:
@@ -104,23 +129,26 @@ def chunks_from_page(
# ----- Body chunks: pack paragraphs up to TARGET_CHARS ------- # ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
ordinal = 1 ordinal = 1
def emit(buf: list[str]) -> Iterator[dict]:
nonlocal ordinal
merged = "\n\n".join(buf)
for piece in _hard_split(merged):
yield {
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
"text": piece,
"metadata": {**metadata, "ordinal": ordinal},
}
ordinal += 1
buf: list[str] = [] buf: list[str] = []
buf_chars = 0 buf_chars = 0
for p in paragraphs: for p in paragraphs:
if buf_chars + len(p) > TARGET_CHARS and buf: if buf_chars + len(p) > TARGET_CHARS and buf:
yield { yield from emit(buf)
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
"text": "\n\n".join(buf),
"metadata": {**metadata, "ordinal": ordinal},
}
ordinal += 1
buf = [] buf = []
buf_chars = 0 buf_chars = 0
buf.append(p) buf.append(p)
buf_chars += len(p) buf_chars += len(p)
if buf: if buf:
yield { yield from emit(buf)
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
"text": "\n\n".join(buf),
"metadata": {**metadata, "ordinal": ordinal},
}
+21 -4
View File
@@ -3,8 +3,15 @@
Swappable: implement the same `embedding_function()` interface returning Swappable: implement the same `embedding_function()` interface returning
a Chroma `EmbeddingFunction` and the rest of the pipeline doesn't care. a Chroma `EmbeddingFunction` and the rest of the pipeline doesn't care.
Defaults (override via env): Env-configurable (matches the zerto-docs-rag pattern so the same Gitea
OLLAMA_URL one or more comma-separated URLs (load-balanced) runner + GPU-pinned Ollama containers can serve every docs MCP build):
OLLAMA_URLS comma-separated list, load-balanced round-robin per batch.
Preferred — set in the CI workflow to fan out across two
GPU-pinned Ollama containers on the Gitea host.
OLLAMA_URL single endpoint, fallback when OLLAMA_URLS is unset.
Default http://192.168.0.2:11434 (the host where the GPUs
live in Justin's lab).
EMBED_MODEL model name; default 'nomic-embed-text' EMBED_MODEL model name; default 'nomic-embed-text'
EMBED_DIM expected embedding dim; default 768 (nomic-embed-text) EMBED_DIM expected embedding dim; default 768 (nomic-embed-text)
""" """
@@ -19,8 +26,18 @@ from chromadb import EmbeddingFunction, Documents, Embeddings
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
OLLAMA_URLS = [u.strip() for u in os.environ.get("OLLAMA_URL", DEFAULT_OLLAMA_URL = "http://192.168.0.2:11434"
"http://localhost:11434").split(",") if u.strip()]
def _resolve_urls() -> list[str]:
raw = os.environ.get("OLLAMA_URLS", "").strip()
if raw:
return [u.strip().rstrip("/") for u in raw.split(",") if u.strip()]
single = os.environ.get("OLLAMA_URL", DEFAULT_OLLAMA_URL).strip().rstrip("/")
return [single]
OLLAMA_URLS = _resolve_urls()
EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text") EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
EMBED_DIM = int(os.environ.get("EMBED_DIM", "768")) EMBED_DIM = int(os.environ.get("EMBED_DIM", "768"))
+1 -1
View File
@@ -29,7 +29,7 @@ CHROMA_DIR = ROOT / "chroma"
# Collection name — convention: <product>_docs. Override via env if needed. # Collection name — convention: <product>_docs. Override via env if needed.
import os import os
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "myproduct") PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "morpheus")
COLLECTION = f"{PRODUCT_NAME}_docs" COLLECTION = f"{PRODUCT_NAME}_docs"
+10
View File
@@ -0,0 +1,10 @@
# Dev/CPU reranker — only for running scripts/rerank_server.py locally.
# Production uses the llama.cpp + jina-reranker GGUF sidecar (see
# deploy/docker-compose.yml). Install with:
#
# pip install -r requirements-rerank.txt
#
# This adds PyTorch (~2 GB) and the sentence-transformers cross-encoder
# (cross-encoder/ms-marco-MiniLM-L-6-v2, ~22 MB). Keep out of the main
# requirements.txt so the production image stays slim.
sentence-transformers>=3.0
+8
View File
@@ -10,10 +10,18 @@ ollama>=0.4.0 # if using Ollama-hosted embedder; swap if not
# Scraping (Phase 1; adjust per product) # Scraping (Phase 1; adjust per product)
beautifulsoup4>=4.12 beautifulsoup4>=4.12
requests>=2.31 requests>=2.31
curl_cffi>=0.7 # for HPE QuickSpecs scrape (Chrome TLS impersonation)
markdownify>=0.11
# playwright>=1.40 # uncomment if you need headless browser fallback # playwright>=1.40 # uncomment if you need headless browser fallback
# Evaluation # Evaluation
numpy>=1.26 numpy>=1.26
# Reranker is a sidecar (see deploy/docker-compose.yml). The MCP server
# only needs httpx (declared above) to call it. For the dev / CPU
# fallback reranker (scripts/rerank_server.py), install
# requirements-rerank.txt separately — it pulls in PyTorch which would
# triple the production image size.
# Dev / utility # Dev / utility
python-dateutil>=2.8 python-dateutil>=2.8
+66
View File
@@ -7,6 +7,72 @@ the upstream doc portal.
See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline
expects. expects.
---
## Product context — HPE Morpheus Enterprise Software
**This repo is for HPE Morpheus Enterprise**, the full cloud-management
platform. It is a **different SKU** from HPE Morpheus VM Essentials
(HVM), which has its own MCP at `../hvm-docs/`. Don't ingest HVM
docs here; they're a separate, smaller product (the "VM-only" subset
of Morpheus). The Morpheus VM Essentials Deployment Guide refers to
Morpheus Enterprise as the "elevate to" target — that's the
relationship.
`PRODUCT_NAME=morpheus`. Tool will be named `morpheus_api_lessons`,
collection `morpheus_docs`, etc.
### Upstream portal
HPE Support DocPortal (Tridion/SDL-derived, same surface as HVM and
the Zerto docs). Anonymous JSON API, no auth required.
| Endpoint | Returns |
|---|---|
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}` | DITA-source HTML — title page / abstract OR (for short docs like Release Notes) the entire body |
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/toc` | Nested JSON tree of `{topicName, topicLink, description, children}`. Empty/404 for single-doc Release Notes. |
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/render?page=GUID-XXXX.html` | `{docId, page_html, doc_meta, page_meta}` — single page body |
User-facing URL format:
`https://support.hpe.com/hpesc/public/docDisplay?docId={docId}&page=GUID-XXXX.html`
### Bundle IDs (confirmed 2026-05-22)
**Morpheus Enterprise User Manual** — ~569 pages each, full nested TOC:
| Version | docId |
|---|---|
| 8.1.0 | `sd00007510en_us` |
| 8.1.1 | `sd00007621en_us` |
| 8.1.2 | `sd00007732en_us` |
**Morpheus Enterprise Release Notes** — short, single-doc-blob shape
(no TOC; full body returned by the `/document/{docId}` endpoint
itself; scraper needs a `--single-doc` mode for these):
| Version | docId |
|---|---|
| 8.1.0 | `sd00007496en_us` |
| 8.1.1 | `sd00007610en_us` |
| 8.1.2 | `sd00007733en_us` |
### Cross-version peers are free
GUIDs are stable across versions (confirmed on HVM where 374/376/376
pages had 100% GUID overlap between adjacent versions). Same-GUID =
same-topic. Synthesize `topic_cluster.clustered_topics` by looking
up the same GUID in the other bundle slugs — no fuzzy matching
needed.
### Reusable from hvm-docs
`../hvm-docs/scrape/bundles.py` and `../hvm-docs/scrape/runner.py`
solve the identical portal shape. Copy and adapt the BUNDLES list +
PRODUCT_NAME; the fetch logic should drop in unchanged. Both the
TOC-paginated path and the single-doc path are needed (the HVM
build covers both because HVM Release Notes follow the same shape).
## What you write ## What you write
At minimum, two scripts: At minimum, two scripts:
+200
View File
@@ -0,0 +1,200 @@
"""Discover Morpheus Enterprise doc bundles on HPE Support DocPortal and write bundles.json.
Mirrors hvm-docs/scrape/bundles.py — same portal, same API shape, same single-doc-blob
treatment for Release Notes, but pointing at the Morpheus Enterprise docId range.
For each bundle this script:
1. GETs /hpesc/public/api/document/{docId} → abstract HTML
2. GETs /hpesc/public/api/document/{docId}/toc → page tree (or 404 for single-doc)
3. Writes bundles.json at repo root with the schema PLAN.md Phase 1 documents.
QuickSpecs is a special case: lives at www.hpe.com (not support.hpe.com), gets the
html-file mode and is scraped via curl_cffi (see scrape/quickspecs.py).
"""
from __future__ import annotations
import argparse
import json
import re
import sys
import time
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
import requests
from bs4 import BeautifulSoup
API = "https://support.hpe.com/hpesc/public/api/document"
DOC_URL = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}"
UA = "morpheus-docs-mcp/0.1 (+https://git.jpaul.io/justin/morpheus-docs; admin@jpaul.io)"
ROOT = Path(__file__).resolve().parent.parent
BUNDLES_JSON = ROOT / "bundles.json"
@dataclass
class BundleSpec:
slug: str
doc_id: str
title: str
version: str | None
product: str # e.g. "User Manual", "Release Notes", "QuickSpecs"
mode: str # "toc", "single", or "html-file"
platform: str | None = None
language: str = "en-US"
source_url: str | None = None # overrides the default support.hpe.com URL
# Declared bundles. Versions confirmed 2026-05-22 by probing the docId
# range sd00006500..7740 for `Morpheus Enterprise` matches in the abstract.
#
# Notes:
# - Morpheus Enterprise has User Manuals dating back to 8.0.10
# (sd00006774en_us, Sep 2025) but we only ship the 8.1.x line for
# now. Add the 8.0.x bundles here if you need older versions in the
# corpus.
# - No dedicated Deployment Guide or Qualification Matrix for Morpheus
# Enterprise on HPE Support — the only QM (sd00006551en_us) covers
# HVM clusters managed by Morpheus, which lives in hvm-docs.
# - QuickSpecs lives on www.hpe.com (not support.hpe.com), uses the
# html-file scrape mode with curl_cffi Chrome impersonation.
BUNDLES: list[BundleSpec] = [
BundleSpec("morpheus_user_manual_8_1_0", "sd00007510en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.0", "User Manual", "toc"),
BundleSpec("morpheus_user_manual_8_1_1", "sd00007621en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.1", "User Manual", "toc"),
BundleSpec("morpheus_user_manual_8_1_2", "sd00007732en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.2", "User Manual", "toc"),
BundleSpec("morpheus_release_notes_8_1_0", "sd00007496en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.0", "Release Notes", "single"),
BundleSpec("morpheus_release_notes_8_1_1", "sd00007610en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.1", "Release Notes", "single"),
BundleSpec("morpheus_release_notes_8_1_2", "sd00007733en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.2", "Release Notes", "single"),
BundleSpec("morpheus_quickspecs", "a50009231enw", "HPE Morpheus Enterprise Software QuickSpecs",
"v1", "QuickSpecs", "html-file",
source_url="https://www.hpe.com/psnow/doc/a50009231enw"),
]
def _session() -> requests.Session:
s = requests.Session()
s.headers.update({"User-Agent": UA, "Accept": "application/json, text/html"})
return s
def _get(s: requests.Session, url: str, expect_json: bool = False, retries: int = 4) -> Any:
delay = 1.0
for attempt in range(retries):
r = s.get(url, timeout=30)
if r.status_code == 200:
return r.json() if expect_json else r.text
if r.status_code == 404:
return None
if r.status_code in (429, 500, 502, 503, 504):
time.sleep(delay)
delay *= 2
continue
r.raise_for_status()
raise RuntimeError(f"GET failed after {retries} retries: {url}")
def _count_toc(toc: list[dict] | None) -> tuple[int, str | None]:
if not toc:
return 0, None
landing = None
n = 0
def walk(nodes: list[dict] | None, depth: int) -> None:
nonlocal n, landing
for node in nodes or []:
link = node.get("topicLink")
if link:
n += 1
m = re.search(r"page=(GUID-[A-F0-9-]+)\.html", link)
if m and landing is None:
landing = m.group(1)
walk(node.get("children"), depth + 1)
walk(toc, 0)
return n, landing
def _parse_abstract(html: str) -> dict[str, str]:
soup = BeautifulSoup(html, "html.parser")
out: dict[str, str] = {}
h1 = soup.select_one("h1.title.topictitle1")
if h1:
out["title"] = h1.get_text(" ", strip=True)
desc = soup.select_one("div.desc")
if desc:
out["abstract"] = desc.get_text(" ", strip=True)
pub = soup.select_one("div.publishedDate")
if pub:
out["published"] = pub.get_text(" ", strip=True).replace("Published:", "").strip()
return out
def discover_bundle(s: requests.Session, spec: BundleSpec) -> dict[str, Any]:
# html-file bundles are static fixtures or live-fetched outside support.hpe.com.
if spec.mode == "html-file":
return {
"slug": spec.slug,
"doc_id": spec.doc_id,
"title": spec.title,
"version": spec.version,
"platform": spec.platform,
"product": spec.product,
"language": spec.language,
"page_count": 1,
"mode": "html-file",
"abstract": "",
"dates": {},
"landing_page": spec.doc_id,
"source_url": spec.source_url or f"https://www.hpe.com/psnow/doc/{spec.doc_id}",
}
abstract_html = _get(s, f"{API}/{spec.doc_id}", expect_json=False)
meta = _parse_abstract(abstract_html or "")
page_count: int
landing: str | None
if spec.mode == "toc":
toc = _get(s, f"{API}/{spec.doc_id}/toc", expect_json=True)
page_count, landing = _count_toc(toc)
if page_count == 0:
print(f" ! {spec.slug}: TOC empty — falling back to single-doc mode", file=sys.stderr)
spec.mode = "single"
page_count, landing = 1, spec.doc_id
else:
page_count, landing = 1, spec.doc_id
return {
"slug": spec.slug,
"doc_id": spec.doc_id,
"title": meta.get("title") or spec.title,
"version": spec.version,
"platform": spec.platform,
"product": spec.product,
"language": spec.language,
"page_count": page_count,
"mode": spec.mode,
"abstract": meta.get("abstract", ""),
"dates": {"Published": meta.get("published", "")},
"landing_page": landing,
"source_url": spec.source_url or DOC_URL.format(doc_id=spec.doc_id),
}
def main() -> int:
p = argparse.ArgumentParser(description="Build bundles.json from BUNDLES list.")
p.add_argument("--out", default=str(BUNDLES_JSON))
args = p.parse_args()
s = _session()
out: list[dict[str, Any]] = []
for spec in BUNDLES:
print(f"{spec.slug} ({spec.doc_id}) ...", file=sys.stderr)
out.append(discover_bundle(s, spec))
Path(args.out).write_text(json.dumps(out, indent=2) + "\n")
print(f"wrote {args.out}: {len(out)} bundles, {sum(b['page_count'] for b in out)} pages total", file=sys.stderr)
return 0
if __name__ == "__main__":
sys.exit(main())
+194
View File
@@ -0,0 +1,194 @@
"""Scrape HPE QuickSpecs collateral pages into corpus markdown.
HPE QuickSpecs live at `https://www.hpe.com/us/en/collaterals/collateral.<doc_id>.html`
with a server-rendered HTML body (confirmed 2026-05-22 by inspecting the
captured DOM). The blocker for automated scraping is `www.hpe.com`'s
edge bot defense, which drops connections from non-browser TLS
fingerprints (curl, wget, Python-urllib, even WebFetch). Bypassed here
by `curl_cffi` impersonating Chrome 120's JA3/JA4 fingerprint.
Content extraction uses these stable CSS selectors found in the page:
.lr-right-rail hpe-highlights-container .collateral-content
— one per section ("Overview", "Standard Features", etc.)
h3.txto-title — section title
div.txto-description — section body
uc-table.uc-table-polaris — SKU / version-history tables
A committed HTML fixture at `scrape/quickspecs/<doc_id>.html` is used
as a fallback when the live fetch fails (HPE edge churn, network
issues). Keeping a current fixture in the repo also makes diffing
QuickSpecs revisions easy.
Usage (called by scrape.runner for bundles with mode="quickspecs"):
python -m scrape.quickspecs a50004260enw
Or programmatically:
from scrape.quickspecs import scrape_quickspecs
scrape_quickspecs("a50004260enw", bundle_id="hvm_quickspecs", title="...")
"""
from __future__ import annotations
import argparse
import json
import logging
import sys
from pathlib import Path
from bs4 import BeautifulSoup, NavigableString
from markdownify import markdownify as md
log = logging.getLogger(__name__)
ROOT = Path(__file__).resolve().parent.parent
SOURCE_DIR = ROOT / "scrape" / "quickspecs"
CORPUS_DIR = ROOT / "corpus"
COLLATERAL_URL = "https://www.hpe.com/us/en/collaterals/collateral.{doc_id}.html"
def fetch_live(doc_id: str, timeout: float = 30.0) -> str | None:
"""GET the collateral page via curl_cffi (Chrome 120 TLS fingerprint).
Returns the HTML body on success, None on any failure."""
try:
from curl_cffi import requests as cc
except ImportError:
log.warning("curl_cffi not installed; can't fetch QuickSpecs live")
return None
try:
r = cc.get(COLLATERAL_URL.format(doc_id=doc_id),
impersonate="chrome120", timeout=timeout)
if r.status_code != 200 or not r.text:
log.warning("QuickSpecs %s: http=%s bytes=%d", doc_id, r.status_code, len(r.text or ""))
return None
return r.text
except Exception as e:
log.warning("QuickSpecs %s live fetch failed: %s", doc_id, e)
return None
def fetch_fixture(doc_id: str) -> str | None:
"""Read the committed HTML fixture as fallback."""
p = SOURCE_DIR / f"{doc_id}.html"
if not p.exists():
return None
return p.read_text()
def _extract_content_blocks(html: str) -> list[str]:
"""Pull each section block (.collateral-content under .lr-right-rail).
The fixture format (just .quickspecs-content wrapper) and the live
format (.lr-right-rail with nested hpe-highlights-container) are
both supported. Returns a list of section HTML strings, in document
order.
"""
soup = BeautifulSoup(html, "html.parser")
# Live format: each <hpe-highlights-container> under .lr-right-rail has
# one or more .collateral-content blocks; concat them.
rail = soup.select_one(".lr-right-rail")
if rail is not None:
blocks = rail.select(".collateral-content")
return [str(b) for b in blocks]
# Fixture format: a single wrapper holding all the H2/H3 sections.
wrapper = soup.select_one(".quickspecs-content")
if wrapper is not None:
return [str(wrapper)]
# Last-resort: whole body.
body = soup.body or soup
return [str(body)]
def parse_html(html: str) -> str:
"""Convert QuickSpecs HTML to clean markdown.
Filters out the page chrome (nav, footer, recommendations carousel,
cookie banner, analytics blobs) by extracting only the content
blocks, then runs markdownify."""
blocks = _extract_content_blocks(html)
chunks: list[str] = []
for block in blocks:
soup = BeautifulSoup(block, "html.parser")
# Drop anchor placeholders that markdownify turns into noisy links
for a in soup.select('[hpe-left-rail-anchor]'):
a.decompose()
# Drop carousel / share / recommendation widgets if any leaked in.
for sel in ("esl-share", "hpe-recommendations", "hpe-sticky-bar",
"esl-scrollbar", "esl-trigger", "video-overlay",
"generic-modal-loader", "style", "script"):
for el in soup.select(sel):
el.decompose()
chunks.append(md(str(soup), heading_style="ATX", bullets="-",
strip=["span", "div"]))
text = "\n\n".join(chunks)
# Collapse runs of blank lines markdownify likes to emit.
text = "\n".join(line.rstrip() for line in text.splitlines())
while "\n\n\n" in text:
text = text.replace("\n\n\n", "\n\n")
return text.strip() + "\n"
def scrape_quickspecs(doc_id: str, bundle_id: str, title: str,
version: str | None = None,
product: str = "QuickSpecs",
source_url: str | None = None,
force: bool = False) -> bool:
"""Live-fetch (or fall back to fixture), parse, write corpus files.
Returns True if files were written, False if skipped (already exists
and --force not set)."""
bundle_dir = CORPUS_DIR / bundle_id
md_path = bundle_dir / f"{doc_id}.md"
json_path = bundle_dir / f"{doc_id}.json"
if not force and md_path.exists() and json_path.exists():
log.info(" %s/%s: already on disk (use --force to refresh)", bundle_id, doc_id)
return False
html = fetch_live(doc_id)
fetched_from = "live"
if html is None:
html = fetch_fixture(doc_id)
fetched_from = "fixture"
if html is None:
log.error("QuickSpecs %s: no live response and no fixture at %s",
doc_id, SOURCE_DIR / f"{doc_id}.html")
return False
body_md = parse_html(html)
bundle_dir.mkdir(parents=True, exist_ok=True)
md_path.write_text(body_md)
sidecar = {
"bundle_id": bundle_id,
"page_id": doc_id,
"title": title,
"ordinal": 1,
"parent_title": None,
"doc_id": doc_id,
"version": version,
"product": product,
"source_url": source_url or f"https://www.hpe.com/psnow/doc/{doc_id}",
"fetched_from": fetched_from,
}
json_path.write_text(json.dumps(sidecar, indent=2) + "\n")
log.info(" %s/%s: %d bytes from %s", bundle_id, doc_id, len(body_md), fetched_from)
return True
def main() -> int:
logging.basicConfig(level=logging.INFO, format="%(message)s")
p = argparse.ArgumentParser()
p.add_argument("doc_id", help="QuickSpecs document id, e.g. a50004260enw")
p.add_argument("--bundle-id", default="hvm_quickspecs")
p.add_argument("--title", default="HPE Morpheus VM Essentials Software QuickSpecs")
p.add_argument("--version", default=None)
p.add_argument("--force", action="store_true")
args = p.parse_args()
ok = scrape_quickspecs(args.doc_id, args.bundle_id, args.title,
args.version, force=args.force)
return 0 if ok else 1
if __name__ == "__main__":
sys.exit(main())
+27
View File
@@ -0,0 +1,27 @@
# scrape/quickspecs/
Static HTML fixtures for HPE QuickSpecs documents that aren't reachable
from the runner (www.hpe.com edge drops connections from datacenter IPs
with non-browser User-Agents — verified 2026-05-22 with curl, wget, and
Anthropic's WebFetch).
## Workflow
1. Operator visits `https://www.hpe.com/psnow/doc/<doc_id>` in a real
browser, opens DevTools → Elements → Copy the `<body>` HTML.
2. Save it at `scrape/quickspecs/<doc_id>.html`.
3. Add a bundle entry in `scrape/bundles.py` with `mode="html-file"`.
4. `python -m scrape.runner --bundle hvm_quickspecs --force` reads the
committed HTML and writes `corpus/hvm_quickspecs/<doc_id>.{md,json}`.
5. Re-index and ship.
QuickSpecs only update every few months (HPE rebrand, new SKU added,
feature change). When a new version drops, refresh the local HTML
file and re-run the scrape.
## Current fixtures
- `a50004260enw.html` — HPE Morpheus VM Essentials Software QuickSpecs
(Version 4, 02-February-2026). SKUs: S5Q81AAE (1-yr), S5Q82AAE
(3-yr), S5Q83AAE (5-yr) — all "per Socket E-LTU" with Tech Care
Essentials included.
+339
View File
@@ -0,0 +1,339 @@
"""Scrape HVM doc bundles into corpus/<slug>/<page_id>.{md,json}.
Reads bundles.json (produced by scrape.bundles), then for each bundle:
- mode="toc": walks the TOC tree, fetches each page via the render
endpoint, converts page_html to markdown, writes
<page_id>.md + <page_id>.json sidecar.
- mode="single": fetches /document/{docId} directly, treats the whole
body as one page with page_id = doc_id.
After all bundles are on disk, runs a finalize pass that synthesizes
topic_cluster.clustered_topics for each page by looking up the same
GUID in sibling bundles (HPE GUIDs are stable across versions — see
reference_hpe_docs_portal_api.md).
Usage:
python -m scrape.runner --all
python -m scrape.runner --bundle hvm_user_manual_8_1_2
python -m scrape.runner --all --force # re-download already-on-disk pages
python -m scrape.runner --finalize-only # only redo the topic_cluster pass
"""
from __future__ import annotations
import argparse
import json
import re
import sys
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify as md
API = "https://support.hpe.com/hpesc/public/api/document"
DOC_URL = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}&page={page_id}.html"
DOC_URL_SINGLE = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}"
UA = "hvm-docs-mcp/0.1 (+https://git.jpaul.io/justin/hvm-docs; admin@jpaul.io)"
ROOT = Path(__file__).resolve().parent.parent
CORPUS = ROOT / "corpus"
BUNDLES_JSON = ROOT / "bundles.json"
GUID_RE = re.compile(r"page=(GUID-[A-F0-9-]+)\.html")
@dataclass
class TocEntry:
page_id: str
title: str
ordinal: int
parent_title: str | None
def _session() -> requests.Session:
s = requests.Session()
s.headers.update({"User-Agent": UA, "Accept": "application/json, text/html"})
return s
def _get(s: requests.Session, url: str, expect_json: bool = False, retries: int = 4) -> Any:
delay = 1.0
for attempt in range(retries):
r = s.get(url, timeout=30)
if r.status_code == 200:
return r.json() if expect_json else r.text
if r.status_code == 404:
return None
if r.status_code in (429, 500, 502, 503, 504):
time.sleep(delay)
delay *= 2
continue
r.raise_for_status()
raise RuntimeError(f"GET failed after {retries} retries: {url}")
def _flatten_toc(toc: list[dict]) -> list[TocEntry]:
out: list[TocEntry] = []
ordinal = 0
def walk(nodes: list[dict] | None, parent_title: str | None) -> None:
nonlocal ordinal
for node in nodes or []:
title = node.get("topicName") or ""
link = node.get("topicLink") or ""
m = GUID_RE.search(link)
if m:
ordinal += 1
out.append(TocEntry(page_id=m.group(1), title=title, ordinal=ordinal, parent_title=parent_title))
walk(node.get("children"), title or parent_title)
walk(toc, None)
return out
def _strip_dita_wrappers(html: str) -> str:
"""Remove the outer <main class="ditasrc">, drop the trademark Notices section,
and unwrap aria-only span markup so markdownify produces clean text.
DITA's notices boilerplate repeats across every doc; if we leave it in,
every page chunk inherits the same trademark text and pollutes retrieval."""
soup = BeautifulSoup(html, "html.parser")
# Drop the Notices/Acknowledgments/Abstract boilerplate by section heading.
# Every doc on the portal carries the same legal Notices and trademark
# Acknowledgments; if we leave them in, every chunk inherits the same
# text and pollutes retrieval. Abstract is one-line marketing.
boilerplate = {"Notices", "Acknowledgments", "Abstract"}
# Wrapped form: <article>/<section>/<div> whose first heading child is boilerplate.
for sec in soup.select("article, section, div"):
h = sec.find(["h1", "h2"], recursive=False)
if h and h.get_text(strip=True) in boilerplate:
sec.decompose()
# Unwrapped form: bare <h1>/<h2>Boilerplate</h2> followed by its .desc/.body sibling.
for h in soup.find_all(["h1", "h2"]):
if h.get_text(strip=True) in boilerplate:
sib = h.find_next_sibling()
if sib and (sib.name in {"div", "section"}):
cls = " ".join(sib.get("class", []) or [])
if "desc" in cls or "body" in cls or "notices" in cls:
sib.decompose()
h.decompose()
main = soup.find("main")
return str(main) if main else str(soup)
def html_to_md(page_html: str) -> str:
cleaned = _strip_dita_wrappers(page_html)
text = md(cleaned, heading_style="ATX", bullets="-")
# collapse runs of blank lines
text = re.sub(r"\n{3,}", "\n\n", text).strip()
return text + "\n"
def fetch_toc_page(s: requests.Session, doc_id: str, page_id: str) -> str:
payload = _get(s, f"{API}/{doc_id}/render?page={page_id}.html", expect_json=True)
if not payload:
return ""
return payload.get("page_html") or ""
def fetch_single_doc(s: requests.Session, doc_id: str) -> tuple[str, str]:
"""Returns (page_html, title) for a single-doc-shape bundle."""
html = _get(s, f"{API}/{doc_id}")
if not html:
return "", ""
soup = BeautifulSoup(html, "html.parser")
h1 = soup.select_one("h1.title.topictitle1")
title = h1.get_text(" ", strip=True) if h1 else doc_id
return html, title
def write_page(bundle_dir: Path, page_id: str, body_md: str, sidecar: dict[str, Any], force: bool) -> bool:
bundle_dir.mkdir(parents=True, exist_ok=True)
md_path = bundle_dir / f"{page_id}.md"
json_path = bundle_dir / f"{page_id}.json"
if not force and md_path.exists() and json_path.exists():
return False
md_path.write_text(body_md)
json_path.write_text(json.dumps(sidecar, indent=2) + "\n")
return True
def scrape_toc_bundle(s: requests.Session, bundle: dict, force: bool, concurrency: int) -> int:
doc_id = bundle["doc_id"]
slug = bundle["slug"]
bundle_dir = CORPUS / slug
toc = _get(s, f"{API}/{doc_id}/toc", expect_json=True) or []
entries = _flatten_toc(toc)
print(f" {slug}: {len(entries)} pages", file=sys.stderr)
written = 0
def do_one(entry: TocEntry) -> bool:
page_html = fetch_toc_page(s, doc_id, entry.page_id)
if not page_html:
return False
body_md = html_to_md(page_html)
sidecar = {
"bundle_id": slug,
"page_id": entry.page_id,
"title": entry.title,
"ordinal": entry.ordinal,
"parent_title": entry.parent_title,
"doc_id": doc_id,
"version": bundle.get("version"),
"product": bundle.get("product"),
"source_url": DOC_URL.format(doc_id=doc_id, page_id=entry.page_id),
# topic_cluster filled in by finalize()
}
return write_page(bundle_dir, entry.page_id, body_md, sidecar, force)
with ThreadPoolExecutor(max_workers=concurrency) as pool:
for fut in as_completed(pool.submit(do_one, e) for e in entries):
if fut.result():
written += 1
return written
def scrape_single_bundle(s: requests.Session, bundle: dict, force: bool) -> int:
doc_id = bundle["doc_id"]
slug = bundle["slug"]
bundle_dir = CORPUS / slug
html, title = fetch_single_doc(s, doc_id)
if not html:
print(f" ! {slug}: empty body", file=sys.stderr)
return 0
body_md = html_to_md(html)
sidecar = {
"bundle_id": slug,
"page_id": doc_id,
"title": title or bundle["title"],
"ordinal": 1,
"parent_title": None,
"doc_id": doc_id,
"version": bundle.get("version"),
"product": bundle.get("product"),
"source_url": DOC_URL_SINGLE.format(doc_id=doc_id),
}
print(f" {slug}: 1 page (single-doc)", file=sys.stderr)
return 1 if write_page(bundle_dir, doc_id, body_md, sidecar, force) else 0
def finalize_clusters(bundles: list[dict]) -> int:
"""Cross-link sibling pages with the same GUID across version bundles.
For TOC bundles, page_id == GUID; same GUID across two bundles = same
underlying topic. For single-doc bundles (page_id == doc_id), peer them
by matching product+version-sibling on the `product` field."""
# GUID → list[(slug, sidecar_path, sidecar_dict)]
guid_to_pages: dict[str, list[tuple[str, Path, dict]]] = {}
# product → list[(slug, sidecar_path, sidecar_dict)] for single-doc peering
product_to_pages: dict[str, list[tuple[str, Path, dict]]] = {}
for b in bundles:
slug = b["slug"]
bundle_dir = CORPUS / slug
if not bundle_dir.exists():
continue
for jp in bundle_dir.glob("*.json"):
data = json.loads(jp.read_text())
pid = data["page_id"]
if pid.startswith("GUID-"):
guid_to_pages.setdefault(pid, []).append((slug, jp, data))
else:
product_to_pages.setdefault(b["product"], []).append((slug, jp, data))
updated = 0
# TOC pages — cluster by GUID
for guid, peers in guid_to_pages.items():
if len(peers) < 2:
continue
for slug, jp, data in peers:
others = [
{"bundle_id": s2, "page_id": guid, "clustering_title": d2.get("title", "")}
for s2, _, d2 in peers if s2 != slug
]
data["topic_cluster"] = {"clustering_title": data.get("title", ""), "clustered_topics": others}
jp.write_text(json.dumps(data, indent=2) + "\n")
updated += 1
# Single-doc pages — cluster by product (e.g. Release Notes 8.1.0/.1/.2)
for product, peers in product_to_pages.items():
if len(peers) < 2:
continue
for slug, jp, data in peers:
others = [
{"bundle_id": s2, "page_id": d2["page_id"], "clustering_title": d2.get("title", "")}
for s2, _, d2 in peers if s2 != slug
]
data["topic_cluster"] = {"clustering_title": data.get("title", ""), "clustered_topics": others}
jp.write_text(json.dumps(data, indent=2) + "\n")
updated += 1
return updated
def main() -> int:
p = argparse.ArgumentParser(description="Scrape HVM bundles into corpus/.")
p.add_argument("--all", action="store_true", help="scrape every bundle in bundles.json")
p.add_argument("--bundle", action="append", help="scrape one bundle by slug (repeatable)")
p.add_argument("--force", action="store_true", help="re-fetch pages already on disk")
p.add_argument("--concurrency", type=int, default=6)
p.add_argument("--finalize-only", action="store_true", help="only rebuild topic_cluster sidecar fields")
args = p.parse_args()
if not BUNDLES_JSON.exists():
print(f"bundles.json missing — run `python -m scrape.bundles` first", file=sys.stderr)
return 2
bundles = json.loads(BUNDLES_JSON.read_text())
if args.finalize_only:
n = finalize_clusters(bundles)
print(f"finalize: updated topic_cluster on {n} sidecars", file=sys.stderr)
return 0
if args.bundle:
bundles = [b for b in bundles if b["slug"] in args.bundle]
if not bundles:
print(f"no bundles matched: {args.bundle}", file=sys.stderr)
return 2
elif not args.all:
print("specify --all or --bundle <slug>", file=sys.stderr)
return 2
s = _session()
total = 0
for b in bundles:
mode = b.get("mode")
if mode == "single":
total += scrape_single_bundle(s, b, args.force)
elif mode == "html-file":
# Live-scrape HPE collateral (QuickSpecs) via curl_cffi; falls back
# to scrape/quickspecs/<doc_id>.html fixture if the edge blocks us.
from scrape.quickspecs import scrape_quickspecs
ok = scrape_quickspecs(
doc_id=b["doc_id"], bundle_id=b["slug"],
title=b.get("title", b["doc_id"]),
version=b.get("version"),
product=b.get("product", "QuickSpecs"),
source_url=b.get("source_url"),
force=args.force,
)
total += 1 if ok else 0
else:
total += scrape_toc_bundle(s, b, args.force, args.concurrency)
print(f"scraped {total} new/updated pages", file=sys.stderr)
# Always finalize after a scrape so sidecars are consistent.
all_bundles = json.loads(BUNDLES_JSON.read_text())
n = finalize_clusters(all_bundles)
print(f"finalize: updated topic_cluster on {n} sidecars", file=sys.stderr)
return 0
if __name__ == "__main__":
sys.exit(main())
View File
+70 -41
View File
@@ -1,42 +1,58 @@
"""Gitea container-registry garbage collection. """Gitea container-registry garbage collection.
Lists package versions for one container package and deletes versions Lists tagged versions of one container package and deletes old ones.
older than --keep-days. Always preserves: Always preserves:
- the :latest tag - the `latest` tag (Watchtower's auto-deploy target)
- the --keep-latest most-recent date-tagged versions - the `--keep-latest` most-recent date-tagged versions (YYYY.MM.DD)
- anything pushed in the last --keep-days days - the `--keep-latest` most-recent short-SHA tags (rollback pins)
- anything pushed within `--keep-days` days
The actual disk reclaim happens on Gitea's next package GC cron (admin OCI blob-level versions (`sha256:...`) are never touched directly — those
site settings). This script just marks the versions for deletion. are managed by Gitea's internal package GC cron when their last tag
goes away.
Usage: Usage:
python scripts/registry_gc.py \\ GITEA_TOKEN=... python scripts/registry_gc.py \\
--owner <user> \\ --owner justin \\
--package <product>-docs-mcp \\ --package hvm-docs \\
--keep-days 90 \\ --keep-days 90 \\
--keep-latest 5 --keep-latest 5
Auth: reads GITEA_TOKEN from env (set in the workflow as a secret). The Gitea endpoint shape (confirmed 2026-05-22 against git.jpaul.io):
GET /api/v1/packages/{owner}/container/{package}
-> [{id, version, created_at, ...}, ...]
DELETE /api/v1/packages/{owner}/container/{package}/{version}
""" """
from __future__ import annotations from __future__ import annotations
import argparse import argparse
import json
import os import os
import re
import sys import sys
from datetime import datetime, timedelta, timezone from datetime import datetime, timedelta, timezone
from urllib.request import Request, urlopen
from urllib.error import HTTPError from urllib.error import HTTPError
import json from urllib.parse import quote
from urllib.request import Request, urlopen
GITEA_HOST = os.environ.get("GITEA_HOST", "https://git.jpaul.io") GITEA_HOST = os.environ.get("GITEA_HOST", "https://git.jpaul.io")
DATE_TAG = re.compile(r"^\d{4}\.\d{2}\.\d{2}$")
SHA_TAG = re.compile(r"^[0-9a-f]{7,40}$") # short or full git SHA
BLOB_VER = re.compile(r"^sha256:") # OCI blob versions — skip
def api(token: str, method: str, path: str) -> object: def api(token: str, method: str, path: str) -> object:
# Explicit User-Agent: git.jpaul.io is behind Cloudflare, whose default
# Bot Fight Mode 403s `Python-urllib/X.Y` with error 1010. Any
# recognizable browser/curl-style UA passes.
req = Request(f"{GITEA_HOST}{path}", req = Request(f"{GITEA_HOST}{path}",
headers={"Authorization": f"token {token}"}, headers={
"Authorization": f"token {token}",
"User-Agent": "hvm-docs-registry-gc/1.0",
},
method=method) method=method)
try: try:
with urlopen(req, timeout=30) as r: with urlopen(req, timeout=30) as r:
@@ -63,44 +79,57 @@ def main() -> int:
return 1 return 1
versions = api(token, "GET", versions = api(token, "GET",
f"/api/v1/packages/{args.owner}/container/{args.package}/versions") or [] f"/api/v1/packages/{args.owner}/container/{args.package}") or []
if not versions: if not versions:
print(f"no versions found for {args.owner}/{args.package}") print(f"no versions found for {args.owner}/container/{args.package}")
return 0 return 0
cutoff = datetime.now(timezone.utc) - timedelta(days=args.keep_days) cutoff = datetime.now(timezone.utc) - timedelta(days=args.keep_days)
print(f" {len(versions)} version(s); cutoff={cutoff.isoformat()} "
f"keep_days={args.keep_days} keep_latest={args.keep_latest}")
# Date-tagged versions (YYYY.MM.DD), newest first # Sort newest first by created_at.
date_tagged = [] def parsed_ts(v: dict) -> datetime:
for v in versions:
tags = v.get("tags") or []
for t in tags:
if len(t) == 10 and t[4] == "." and t[7] == ".":
date_tagged.append((t, v))
break
date_tagged.sort(key=lambda kv: kv[0], reverse=True)
keep_date_tags = {t for t, _ in date_tagged[:args.keep_latest]}
deleted = 0
for v in versions:
tags = v.get("tags") or []
if "latest" in tags:
continue
if any(t in keep_date_tags for t in tags):
continue
try: try:
created = datetime.fromisoformat(v["created_at"].replace("Z", "+00:00")) return datetime.fromisoformat(v["created_at"].replace("Z", "+00:00"))
except (KeyError, ValueError): except (KeyError, ValueError):
return datetime.min.replace(tzinfo=timezone.utc)
versions.sort(key=parsed_ts, reverse=True)
# Compute the keep-set: top-N date tags + top-N sha tags + always latest.
keep_dates: list[str] = []
keep_shas: list[str] = []
for v in versions:
ver = v.get("version") or ""
if DATE_TAG.match(ver) and len(keep_dates) < args.keep_latest:
keep_dates.append(ver)
elif SHA_TAG.match(ver) and len(keep_shas) < args.keep_latest:
keep_shas.append(ver)
keep = {"latest", *keep_dates, *keep_shas}
print(f" keep tags: {sorted(keep)}")
deleted = skipped_blob = skipped_age = skipped_keep = 0
for v in versions:
ver = v.get("version") or ""
ts = parsed_ts(v)
if BLOB_VER.match(ver):
skipped_blob += 1
continue continue
if created >= cutoff: if ver in keep:
skipped_keep += 1
continue continue
version_id = v.get("id") if ts >= cutoff:
print(f" deleting v{version_id} tags={tags} created={v['created_at']}") skipped_age += 1
continue
print(f" deleting {ver!r} id={v.get('id')} created={v.get('created_at')}")
if not args.dry_run: if not args.dry_run:
api(token, "DELETE", api(token, "DELETE",
f"/api/v1/packages/{args.owner}/container/{args.package}/versions/{version_id}") f"/api/v1/packages/{args.owner}/container/{args.package}/{quote(ver, safe='')}")
deleted += 1 deleted += 1
print(f"done: {deleted} version(s) deleted")
print(f"done: deleted={deleted} kept_named={skipped_keep} "
f"kept_recent={skipped_age} skipped_blobs={skipped_blob}")
return 0 return 0
+120
View File
@@ -0,0 +1,120 @@
"""Minimal HTTP reranker — `/v1/rerank` endpoint over a sentence-transformers CrossEncoder.
Matches the Cohere `/v1/rerank` request/response shape, which is what the
server's `_rerank()` helper expects. This is the dev-friendly fallback;
production replaces this with the llama.cpp + jina-reranker-v2-base GGUF
sidecar (see deploy/docker-compose.yml) without changing the client.
Request:
POST /v1/rerank
{"model": "...", "query": "...", "documents": ["text", ...], "top_n": 10}
Response:
{"model": "...", "results": [{"index": 0, "relevance_score": 0.93}, ...]}
Usage:
python -m scripts.rerank_server # localhost:8001
RERANK_MODEL=cross-encoder/ms-marco-MiniLM-L-12-v2 \\
RERANK_PORT=8001 python -m scripts.rerank_server
"""
from __future__ import annotations
import json
import logging
import os
import sys
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
MODEL_NAME = os.environ.get("RERANK_MODEL", "cross-encoder/ms-marco-MiniLM-L-6-v2")
PORT = int(os.environ.get("RERANK_PORT", "8001"))
HOST = os.environ.get("RERANK_HOST", "127.0.0.1")
# Truncate docs to this many chars before scoring. jina-reranker GGUF has a
# 1024-token per-pair cap that 400s the entire batch; ms-marco is more
# forgiving but we still cap to keep latency predictable.
MAX_DOC_CHARS = int(os.environ.get("RERANK_MAX_DOC_CHARS", "2000"))
_model = None
def _get_model():
global _model
if _model is None:
from sentence_transformers import CrossEncoder
log.info("loading %s", MODEL_NAME)
_model = CrossEncoder(MODEL_NAME)
log.info("loaded")
return _model
def _rerank(query: str, documents: list[str], top_n: int | None) -> list[dict]:
model = _get_model()
pairs = [[query, (d or "")[:MAX_DOC_CHARS]] for d in documents]
scores = model.predict(pairs)
ranked = sorted(
({"index": i, "relevance_score": float(s)} for i, s in enumerate(scores)),
key=lambda r: -r["relevance_score"],
)
if top_n is not None:
ranked = ranked[:top_n]
return ranked
class Handler(BaseHTTPRequestHandler):
def log_message(self, fmt, *args):
log.info("%s - %s", self.address_string(), fmt % args)
def _send_json(self, status: int, payload: dict) -> None:
body = json.dumps(payload).encode()
self.send_response(status)
self.send_header("Content-Type", "application/json")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
self.wfile.write(body)
def do_GET(self): # noqa: N802
if self.path in ("/", "/health"):
self._send_json(200, {"status": "ok", "model": MODEL_NAME})
return
self._send_json(404, {"error": "not found"})
def do_POST(self): # noqa: N802
if self.path not in ("/v1/rerank", "/rerank"):
self._send_json(404, {"error": "not found"})
return
length = int(self.headers.get("Content-Length", "0"))
try:
req = json.loads(self.rfile.read(length).decode())
except Exception as e:
self._send_json(400, {"error": f"bad json: {e}"})
return
query = req.get("query")
documents = req.get("documents")
if not isinstance(query, str) or not isinstance(documents, list):
self._send_json(400, {"error": "expected {query: str, documents: list[str]}"})
return
top_n = req.get("top_n")
try:
results = _rerank(query, documents, top_n if isinstance(top_n, int) else None)
except Exception as e:
log.exception("rerank failed")
self._send_json(500, {"error": str(e)})
return
self._send_json(200, {"model": MODEL_NAME, "results": results})
def main() -> int:
_get_model() # warm-load before accepting traffic
server = ThreadingHTTPServer((HOST, PORT), Handler)
log.info("listening on http://%s:%d", HOST, PORT)
try:
server.serve_forever()
except KeyboardInterrupt:
log.info("shutting down")
return 0
if __name__ == "__main__":
sys.exit(main())