build out morpheus-docs MCP stack, mirroring hvm-docs through Phases 1-13
Initial scaffold: the docs-mcp-template clone with all the
HVM-validated stack ported across, customized for Morpheus
Enterprise (PRODUCT_NAME=morpheus, server name morpheus-docs).
Bundles (live-discovered 2026-05-22; 1710 cataloged pages total):
* morpheus_user_manual_8_1_0 sd00007510en_us 568 pages (Feb 2026)
* morpheus_user_manual_8_1_1 sd00007621en_us 569 pages (Mar 2026)
* morpheus_user_manual_8_1_2 sd00007732en_us 569 pages (Apr 2026)
* morpheus_release_notes_8_1_0 sd00007496en_us single-doc
* morpheus_release_notes_8_1_1 sd00007610en_us single-doc
* morpheus_release_notes_8_1_2 sd00007733en_us single-doc
* morpheus_quickspecs a50009231enw html-file (live
curl_cffi against www.hpe.com; all 12+ Enterprise SKUs captured —
S6E64..S6E73AAE for new/renewal/upgrade × 1/3/5-yr terms, plus
services SKUs HA124A1#V38/V39 and H46SBA1).
No Deployment Guide or Qualification Matrix on HPE Support for
Morpheus Enterprise specifically — the only QM (sd00006551en_us)
covers HVM clusters managed by Morpheus and lives in hvm-docs.
Stack carried forward from hvm-docs:
* rag/{index,chunk,embeddings,bm25}.py — including the
MAX_CHARS=4000 chunk-cap fix for table-dense content
* docs_mcp/{server,usage}.py — 11 MCP tools, BM25-default search,
cross-encoder rerank, hybrid behind HYBRID_SEARCH=true,
morpheus_api_lessons (renamed from hvm_api_lessons), env-gated
submit_doc_bug
* docs_mcp/api_lessons.md — Morpheus-specific scaffold covering
licensing model, HVM elevation path, REST vs Plugin API, with
TODO markers for sections to flesh out from real ops experience
* scrape/{runner,quickspecs,changelog,bundles}.py — TOC + single-doc
+ html-file modes, curl_cffi Chrome120 for www.hpe.com edge bypass
* eval/{retrievers,run_eval}.py + queries.jsonl scaffold (4 placeholder
queries; populate after first scrape)
* scripts/{rerank_server,usage_report,registry_gc}.py
* .gitea/workflows/{refresh,image-only}.yml — same Gitea Actions
setup zerto-docs uses (push LAN, pull public-URL, GPU Ollama pool)
* deploy/docker-compose.yml — morpheus-docs-mcp service definition,
shared jina-rerank sidecar, Watchtower-labeled
* Dockerfile, requirements.txt, requirements-rerank.txt
Verified locally: scrape produced 1599 .md pages (some TOC entries
are parent-only and yield no body), 6353 chunks all under the 4 KB
cap, MCP server boots and lists 11 tools cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -14,21 +14,17 @@ on:
|
|||||||
workflow_dispatch:
|
workflow_dispatch:
|
||||||
|
|
||||||
env:
|
env:
|
||||||
REGISTRY_PUSH: <lan-host>:<port>
|
# PUSH goes to the LAN endpoint (HTTP) to bypass Cloudflare's 100 MB
|
||||||
REGISTRY_PULL: <public-registry-hostname>
|
# body cap. PULL uses the public hostname (HTTPS). Same Gitea registry.
|
||||||
# Image name derives from the actual repo at runtime, so a clone
|
REGISTRY_PUSH: 192.168.0.2:1234
|
||||||
# doesn't need to find/replace anything. e.g. justin/my-product-docs.
|
REGISTRY_PULL: git.jpaul.io
|
||||||
# github.* context is Gitea Actions' inherited GitHub-Actions namespace
|
|
||||||
# — values come from the Gitea server, not github.com.
|
|
||||||
IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
|
IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
|
||||||
OLLAMA_URL: http://<gpu-host>:11434
|
# Two GPU-pinned Ollama containers on the Gitea host — same infra
|
||||||
|
# zerto-docs uses. :11435 = Titan X, :11436 = 1080 Ti. Indexer
|
||||||
|
# round-robins per batch.
|
||||||
|
OLLAMA_URLS: http://192.168.0.2:11435,http://192.168.0.2:11436
|
||||||
EMBED_MODEL: nomic-embed-text
|
EMBED_MODEL: nomic-embed-text
|
||||||
# PRODUCT_NAME defaults to the repo name so a clone works without
|
PRODUCT_NAME: morpheus
|
||||||
# editing. Override here if you want a different identifier (e.g.
|
|
||||||
# repo "my-product-docs" → PRODUCT_NAME "myproduct"). Used as the
|
|
||||||
# Chroma collection name, BM25 db filename, and MCP server name —
|
|
||||||
# see docs_mcp/server.py.
|
|
||||||
PRODUCT_NAME: ${{ github.event.repository.name }}
|
|
||||||
|
|
||||||
jobs:
|
jobs:
|
||||||
build:
|
build:
|
||||||
@@ -39,8 +35,7 @@ jobs:
|
|||||||
- name: Checkout
|
- name: Checkout
|
||||||
uses: actions/checkout@v4
|
uses: actions/checkout@v4
|
||||||
with:
|
with:
|
||||||
# Full history (not shallow) so the digest-history step can
|
# Full history so digest-history can walk git log.
|
||||||
# walk git log up to --history-days back.
|
|
||||||
fetch-depth: 0
|
fetch-depth: 0
|
||||||
|
|
||||||
- name: Set up Python
|
- name: Set up Python
|
||||||
@@ -54,9 +49,8 @@ jobs:
|
|||||||
python -m pip install -q -r requirements.txt
|
python -m pip install -q -r requirements.txt
|
||||||
|
|
||||||
- name: Refresh digest history
|
- name: Refresh digest history
|
||||||
# Cheap (a few seconds); doesn't touch corpus content.
|
# Cheap (few seconds). Without this step, a code-only deploy
|
||||||
# Without this step, a code-only deploy would ship an
|
# would ship an increasingly-stale digest history.
|
||||||
# increasingly-stale digest history relative to git.
|
|
||||||
run: |
|
run: |
|
||||||
mkdir -p corpus/.digest
|
mkdir -p corpus/.digest
|
||||||
python -m scrape.changelog \
|
python -m scrape.changelog \
|
||||||
@@ -71,42 +65,69 @@ jobs:
|
|||||||
- name: Rebuild indexes from existing corpus
|
- name: Rebuild indexes from existing corpus
|
||||||
run: python -m rag.index --rebuild
|
run: python -m rag.index --rebuild
|
||||||
|
|
||||||
- name: Log in to registry (LAN endpoint)
|
- name: Set up Docker Buildx
|
||||||
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u "${{ github.repository_owner }}" --password-stdin
|
uses: docker/setup-buildx-action@v3
|
||||||
|
with:
|
||||||
|
# LAN registry is HTTP only.
|
||||||
|
config-inline: |
|
||||||
|
[registry."192.168.0.2:1234"]
|
||||||
|
http = true
|
||||||
|
insecure = true
|
||||||
|
|
||||||
- name: Build & push image
|
- name: Configure registry credentials for buildx
|
||||||
|
env:
|
||||||
|
REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
|
||||||
|
REGISTRY_USER: ${{ github.actor }}
|
||||||
run: |
|
run: |
|
||||||
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
|
mkdir -p ~/.docker
|
||||||
DATE_TAG=$(date -u +%Y.%m.%d)
|
AUTH=$(printf '%s:%s' "$REGISTRY_USER" "$REGISTRY_TOKEN" | base64 -w0)
|
||||||
docker build \
|
cat > ~/.docker/config.json <<EOF
|
||||||
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \
|
{
|
||||||
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
|
"auths": {
|
||||||
-t "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}" \
|
"192.168.0.2:1234": {
|
||||||
.
|
"auth": "$AUTH"
|
||||||
docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
|
}
|
||||||
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
|
}
|
||||||
docker push "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}"
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
- name: Compute tags
|
||||||
|
id: meta
|
||||||
|
uses: docker/metadata-action@v5
|
||||||
|
with:
|
||||||
|
images: 192.168.0.2:1234/${{ github.repository_owner }}/${{ github.event.repository.name }}
|
||||||
|
tags: |
|
||||||
|
type=raw,value=latest
|
||||||
|
type=sha,prefix=,format=short
|
||||||
|
type=raw,value={{date 'YYYY.MM.DD'}}
|
||||||
|
labels: |
|
||||||
|
org.opencontainers.image.source=https://git.jpaul.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
|
||||||
|
org.opencontainers.image.url=https://git.jpaul.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
|
||||||
|
|
||||||
|
- name: Build & push (amd64)
|
||||||
|
uses: docker/build-push-action@v6
|
||||||
|
with:
|
||||||
|
context: .
|
||||||
|
platforms: linux/amd64
|
||||||
|
push: true
|
||||||
|
tags: ${{ steps.meta.outputs.tags }}
|
||||||
|
labels: ${{ steps.meta.outputs.labels }}
|
||||||
|
|
||||||
- name: Link container package to this repo
|
- name: Link container package to this repo
|
||||||
# Gitea container packages are owned by a USER, not a repo —
|
|
||||||
# they don't auto-appear under the repo's Packages tab.
|
|
||||||
# This API call creates the association. One-time-effective:
|
|
||||||
# re-running returns 400 once linked, which we swallow.
|
|
||||||
# Endpoint requires Gitea 1.21+.
|
|
||||||
env:
|
env:
|
||||||
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
|
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
|
||||||
run: |
|
run: |
|
||||||
OWNER="${{ github.repository_owner }}"
|
OWNER="${{ github.repository_owner }}"
|
||||||
PKG="${{ github.event.repository.name }}"
|
PKG="${{ github.event.repository.name }}"
|
||||||
BODY=$(mktemp)
|
code=$(curl -s -o /tmp/link.out -w "%{http_code}" -X POST \
|
||||||
CODE=$(curl -sS -o "$BODY" -w "%{http_code}" -X POST \
|
|
||||||
-H "Authorization: token ${GITEA_TOKEN}" \
|
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||||
"https://${REGISTRY_PULL}/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
|
"https://git.jpaul.io/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
|
||||||
echo "link http=$CODE body=$(cat "$BODY")"
|
echo "link ${OWNER}/container/${PKG} -> ${PKG}: HTTP ${code}"
|
||||||
case "$CODE" in
|
body=$(cat /tmp/link.out)
|
||||||
201) echo "linked package to ${OWNER}/${PKG}" ;;
|
case "$code" in
|
||||||
400) echo "already linked (re-link returns 400) — ok" ;;
|
201) echo "OK — newly linked" ;;
|
||||||
*) echo "unexpected status $CODE"; exit 1 ;;
|
400|409) echo "OK — already linked: ${body}" ;;
|
||||||
|
*) echo "unexpected: ${body}"; exit 1 ;;
|
||||||
esac
|
esac
|
||||||
|
|
||||||
- name: Prune old container versions
|
- name: Prune old container versions
|
||||||
|
|||||||
@@ -19,27 +19,25 @@ on:
|
|||||||
default: false
|
default: false
|
||||||
|
|
||||||
env:
|
env:
|
||||||
# If your registry sits behind Cloudflare with its 100 MB body cap,
|
# PUSH goes to the LAN endpoint (HTTP) to bypass Cloudflare Tunnel's
|
||||||
# use a LAN endpoint for pushes (bypasses CF) and the public hostname
|
# 100 MB body cap. PULL uses the public hostname (HTTPS). Same Gitea
|
||||||
# for pulls (response bodies aren't capped).
|
# registry either way — package lands under the same owner/repo.
|
||||||
REGISTRY_PUSH: <lan-host>:<port>
|
REGISTRY_PUSH: 192.168.0.2:1234
|
||||||
REGISTRY_PULL: <public-registry-hostname>
|
REGISTRY_PULL: git.jpaul.io
|
||||||
# Image name derives from the actual repo at runtime, so a clone
|
|
||||||
# doesn't need to find/replace anything. e.g. justin/my-product-docs.
|
# Image name derives from the repo at runtime — clones don't need to
|
||||||
# github.* context is Gitea Actions' inherited GitHub-Actions namespace
|
# edit this. github.* is the Gitea-Actions inherited namespace.
|
||||||
# — values come from the Gitea server, not github.com.
|
|
||||||
IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
|
IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
|
||||||
|
|
||||||
# Embedder. One URL per GPU; the indexer round-robins.
|
# Two GPU-pinned Ollama containers on the Gitea host — same infra
|
||||||
OLLAMA_URL: http://<gpu-host>:11434
|
# zerto-docs uses (deploy/ollama-rag.docker-compose.yml over there).
|
||||||
|
# :11435 owns the Titan X, :11436 owns the 1080 Ti; the indexer
|
||||||
|
# round-robins per batch so both cards run in parallel. The host's
|
||||||
|
# primary Ollama on :11434 is left alone for OpenWebUI etc.
|
||||||
|
OLLAMA_URLS: http://192.168.0.2:11435,http://192.168.0.2:11436
|
||||||
EMBED_MODEL: nomic-embed-text
|
EMBED_MODEL: nomic-embed-text
|
||||||
|
|
||||||
# PRODUCT_NAME defaults to the repo name so a clone works without
|
PRODUCT_NAME: morpheus
|
||||||
# editing. Override here if you want a different identifier (e.g.
|
|
||||||
# repo "my-product-docs" → PRODUCT_NAME "myproduct"). Used as the
|
|
||||||
# Chroma collection name, BM25 db filename, and MCP server name —
|
|
||||||
# see docs_mcp/server.py.
|
|
||||||
PRODUCT_NAME: ${{ github.event.repository.name }}
|
|
||||||
|
|
||||||
jobs:
|
jobs:
|
||||||
refresh:
|
refresh:
|
||||||
@@ -50,10 +48,12 @@ jobs:
|
|||||||
- name: Checkout
|
- name: Checkout
|
||||||
uses: actions/checkout@v4
|
uses: actions/checkout@v4
|
||||||
with:
|
with:
|
||||||
# Full history — required for the digest-history step to
|
# Full history — required for digest-history. Default depth 1
|
||||||
# walk git log. Default fetch-depth: 1 silently produces a
|
# silently produces a 0-byte history file.
|
||||||
# 0-byte history file.
|
|
||||||
fetch-depth: 0
|
fetch-depth: 0
|
||||||
|
# Set the credentials Gitea injects so we can push corpus
|
||||||
|
# commits back. Persist them across the run.
|
||||||
|
token: ${{ secrets.GITEA_TOKEN }}
|
||||||
|
|
||||||
- name: Set up Python
|
- name: Set up Python
|
||||||
uses: actions/setup-python@v5
|
uses: actions/setup-python@v5
|
||||||
@@ -89,8 +89,8 @@ jobs:
|
|||||||
- name: Commit corpus changes (if any)
|
- name: Commit corpus changes (if any)
|
||||||
id: commit
|
id: commit
|
||||||
run: |
|
run: |
|
||||||
git config user.name "<product>-docs-refresh"
|
git config user.name "hvm-docs-refresh"
|
||||||
git config user.email "actions@<your-domain>"
|
git config user.email "actions@jpaul.io"
|
||||||
git add bundles.json corpus
|
git add bundles.json corpus
|
||||||
if git diff --cached --quiet; then
|
if git diff --cached --quiet; then
|
||||||
echo "no corpus changes — skipping reindex and image build"
|
echo "no corpus changes — skipping reindex and image build"
|
||||||
@@ -132,49 +132,89 @@ jobs:
|
|||||||
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||||
run: python -m rag.index --rebuild
|
run: python -m rag.index --rebuild
|
||||||
|
|
||||||
# ---- Build & push image ------------------------------------
|
# ---- Build & push image (LAN endpoint, buildx) -------------
|
||||||
- name: Log in to registry (LAN endpoint)
|
- name: Set up Docker Buildx
|
||||||
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||||
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u "${{ github.repository_owner }}" --password-stdin
|
uses: docker/setup-buildx-action@v3
|
||||||
|
with:
|
||||||
|
# LAN registry is HTTP only. Buildkit needs an explicit
|
||||||
|
# insecure-registry config or it tries to upgrade to HTTPS.
|
||||||
|
config-inline: |
|
||||||
|
[registry."192.168.0.2:1234"]
|
||||||
|
http = true
|
||||||
|
insecure = true
|
||||||
|
|
||||||
- name: Build & push image
|
- name: Configure registry credentials for buildx
|
||||||
|
# Can't use docker/login-action against the LAN endpoint —
|
||||||
|
# the host docker daemon errors on HTTP-vs-HTTPS. Buildx reads
|
||||||
|
# ~/.docker/config.json directly, so write the auth ourselves.
|
||||||
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||||
# Runner shell is /bin/sh — use cut instead of ${VAR::N}.
|
env:
|
||||||
# Three tags: :latest (Watchtower target), :<sha12>
|
REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
|
||||||
# (rollback pin), :<YYYY.MM.DD> (human-readable).
|
REGISTRY_USER: ${{ github.actor }}
|
||||||
run: |
|
run: |
|
||||||
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
|
mkdir -p ~/.docker
|
||||||
DATE_TAG=$(date -u +%Y.%m.%d)
|
AUTH=$(printf '%s:%s' "$REGISTRY_USER" "$REGISTRY_TOKEN" | base64 -w0)
|
||||||
docker build \
|
cat > ~/.docker/config.json <<EOF
|
||||||
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \
|
{
|
||||||
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
|
"auths": {
|
||||||
-t "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}" \
|
"192.168.0.2:1234": {
|
||||||
.
|
"auth": "$AUTH"
|
||||||
docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
|
}
|
||||||
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
|
}
|
||||||
docker push "${REGISTRY_PUSH}/${IMAGE}:${DATE_TAG}"
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
- name: Compute tags
|
||||||
|
id: meta
|
||||||
|
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||||
|
uses: docker/metadata-action@v5
|
||||||
|
with:
|
||||||
|
# Tag with the LAN hostname so the push goes over LAN.
|
||||||
|
# docker-compose on the deploy host pulls via git.jpaul.io.
|
||||||
|
images: 192.168.0.2:1234/${{ github.repository_owner }}/${{ github.event.repository.name }}
|
||||||
|
tags: |
|
||||||
|
type=raw,value=latest
|
||||||
|
type=sha,prefix=,format=short
|
||||||
|
type=schedule,pattern={{date 'YYYY.MM.DD'}}
|
||||||
|
type=raw,value={{date 'YYYY.MM.DD'}}
|
||||||
|
# Override auto-derived labels with the PUBLIC URL so Gitea
|
||||||
|
# can auto-link the package back to this repo.
|
||||||
|
labels: |
|
||||||
|
org.opencontainers.image.source=https://git.jpaul.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
|
||||||
|
org.opencontainers.image.url=https://git.jpaul.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
|
||||||
|
|
||||||
|
- name: Build & push (amd64)
|
||||||
|
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||||
|
uses: docker/build-push-action@v6
|
||||||
|
with:
|
||||||
|
context: .
|
||||||
|
platforms: linux/amd64
|
||||||
|
push: true
|
||||||
|
tags: ${{ steps.meta.outputs.tags }}
|
||||||
|
labels: ${{ steps.meta.outputs.labels }}
|
||||||
|
|
||||||
- name: Link container package to this repo
|
- name: Link container package to this repo
|
||||||
# Gitea container packages are owned by a USER, not a repo —
|
# Idempotent linkage so the package shows under the repo's
|
||||||
# they don't auto-appear under the repo's Packages tab.
|
# Packages tab. Gitea's auto-link from the source label is
|
||||||
# This API call creates the association. One-time-effective:
|
# unreliable in this setup (the runner reports an internal
|
||||||
# re-running returns 400 once linked, which we swallow.
|
# server URL), so we link explicitly. 201 = newly linked,
|
||||||
# Endpoint requires Gitea 1.21+.
|
# 400 = already linked (treated as success).
|
||||||
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||||
env:
|
env:
|
||||||
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
|
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
|
||||||
run: |
|
run: |
|
||||||
OWNER="${{ github.repository_owner }}"
|
OWNER="${{ github.repository_owner }}"
|
||||||
PKG="${{ github.event.repository.name }}"
|
PKG="${{ github.event.repository.name }}"
|
||||||
BODY=$(mktemp)
|
code=$(curl -s -o /tmp/link.out -w "%{http_code}" -X POST \
|
||||||
CODE=$(curl -sS -o "$BODY" -w "%{http_code}" -X POST \
|
|
||||||
-H "Authorization: token ${GITEA_TOKEN}" \
|
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||||
"https://${REGISTRY_PULL}/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
|
"https://git.jpaul.io/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
|
||||||
echo "link http=$CODE body=$(cat "$BODY")"
|
echo "link ${OWNER}/container/${PKG} -> ${PKG}: HTTP ${code}"
|
||||||
case "$CODE" in
|
body=$(cat /tmp/link.out)
|
||||||
201) echo "linked package to ${OWNER}/${PKG}" ;;
|
case "$code" in
|
||||||
400) echo "already linked (re-link returns 400) — ok" ;;
|
201) echo "OK — newly linked" ;;
|
||||||
*) echo "unexpected status $CODE"; exit 1 ;;
|
400|409) echo "OK — already linked: ${body}" ;;
|
||||||
|
*) echo "unexpected: ${body}"; exit 1 ;;
|
||||||
esac
|
esac
|
||||||
|
|
||||||
# ---- Registry GC -------------------------------------------
|
# ---- Registry GC -------------------------------------------
|
||||||
|
|||||||
+119
@@ -0,0 +1,119 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"slug": "morpheus_user_manual_8_1_0",
|
||||||
|
"doc_id": "sd00007510en_us",
|
||||||
|
"title": "HPE Morpheus Enterprise Software Documentation v8.1.0",
|
||||||
|
"version": "8.1.0",
|
||||||
|
"platform": null,
|
||||||
|
"product": "User Manual",
|
||||||
|
"language": "en-US",
|
||||||
|
"page_count": 568,
|
||||||
|
"mode": "toc",
|
||||||
|
"abstract": "",
|
||||||
|
"dates": {
|
||||||
|
"Published": "February 2026"
|
||||||
|
},
|
||||||
|
"landing_page": "GUID-709AAADB-A9C1-40B6-AD22-958EE7E6F312",
|
||||||
|
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007510en_us"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"slug": "morpheus_user_manual_8_1_1",
|
||||||
|
"doc_id": "sd00007621en_us",
|
||||||
|
"title": "HPE Morpheus Enterprise Software Documentation v8.1.1",
|
||||||
|
"version": "8.1.1",
|
||||||
|
"platform": null,
|
||||||
|
"product": "User Manual",
|
||||||
|
"language": "en-US",
|
||||||
|
"page_count": 569,
|
||||||
|
"mode": "toc",
|
||||||
|
"abstract": "",
|
||||||
|
"dates": {
|
||||||
|
"Published": "March 2026"
|
||||||
|
},
|
||||||
|
"landing_page": "GUID-709AAADB-A9C1-40B6-AD22-958EE7E6F312",
|
||||||
|
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007621en_us"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"slug": "morpheus_user_manual_8_1_2",
|
||||||
|
"doc_id": "sd00007732en_us",
|
||||||
|
"title": "HPE Morpheus Enterprise Software Documentation v8.1.2",
|
||||||
|
"version": "8.1.2",
|
||||||
|
"platform": null,
|
||||||
|
"product": "User Manual",
|
||||||
|
"language": "en-US",
|
||||||
|
"page_count": 569,
|
||||||
|
"mode": "toc",
|
||||||
|
"abstract": "",
|
||||||
|
"dates": {
|
||||||
|
"Published": "April 2026"
|
||||||
|
},
|
||||||
|
"landing_page": "GUID-709AAADB-A9C1-40B6-AD22-958EE7E6F312",
|
||||||
|
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007732en_us"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"slug": "morpheus_release_notes_8_1_0",
|
||||||
|
"doc_id": "sd00007496en_us",
|
||||||
|
"title": "v8.1.0 Release Notes",
|
||||||
|
"version": "8.1.0",
|
||||||
|
"platform": null,
|
||||||
|
"product": "Release Notes",
|
||||||
|
"language": "en-US",
|
||||||
|
"page_count": 1,
|
||||||
|
"mode": "single",
|
||||||
|
"abstract": "Release notes for HPE Morpheus Enterprise Software version v8.1.0",
|
||||||
|
"dates": {
|
||||||
|
"Published": "February 2026"
|
||||||
|
},
|
||||||
|
"landing_page": "sd00007496en_us",
|
||||||
|
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007496en_us"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"slug": "morpheus_release_notes_8_1_1",
|
||||||
|
"doc_id": "sd00007610en_us",
|
||||||
|
"title": "v8.1.1 Release Notes",
|
||||||
|
"version": "8.1.1",
|
||||||
|
"platform": null,
|
||||||
|
"product": "Release Notes",
|
||||||
|
"language": "en-US",
|
||||||
|
"page_count": 1,
|
||||||
|
"mode": "single",
|
||||||
|
"abstract": "Release notes for HPE Morpheus Enterprise Software version v8.1.1",
|
||||||
|
"dates": {
|
||||||
|
"Published": "March 2026"
|
||||||
|
},
|
||||||
|
"landing_page": "sd00007610en_us",
|
||||||
|
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007610en_us"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"slug": "morpheus_release_notes_8_1_2",
|
||||||
|
"doc_id": "sd00007733en_us",
|
||||||
|
"title": "v8.1.2 Release Notes",
|
||||||
|
"version": "8.1.2",
|
||||||
|
"platform": null,
|
||||||
|
"product": "Release Notes",
|
||||||
|
"language": "en-US",
|
||||||
|
"page_count": 1,
|
||||||
|
"mode": "single",
|
||||||
|
"abstract": "Release notes for HPE Morpheus Enterprise Software version v8.1.2",
|
||||||
|
"dates": {
|
||||||
|
"Published": "April 2026"
|
||||||
|
},
|
||||||
|
"landing_page": "sd00007733en_us",
|
||||||
|
"source_url": "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007733en_us"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"slug": "morpheus_quickspecs",
|
||||||
|
"doc_id": "a50009231enw",
|
||||||
|
"title": "HPE Morpheus Enterprise Software QuickSpecs",
|
||||||
|
"version": "v1",
|
||||||
|
"platform": null,
|
||||||
|
"product": "QuickSpecs",
|
||||||
|
"language": "en-US",
|
||||||
|
"page_count": 1,
|
||||||
|
"mode": "html-file",
|
||||||
|
"abstract": "",
|
||||||
|
"dates": {},
|
||||||
|
"landing_page": "a50009231enw",
|
||||||
|
"source_url": "https://www.hpe.com/psnow/doc/a50009231enw"
|
||||||
|
}
|
||||||
|
]
|
||||||
+23
-17
@@ -1,6 +1,6 @@
|
|||||||
# Hosting stack for a docs MCP server.
|
# Hosting stack for a docs MCP server.
|
||||||
#
|
#
|
||||||
# Replace <product> below with your product name on first deploy.
|
# Replace hvm below with your product name on first deploy.
|
||||||
# Volumes: usage logs are mounted to a host path so they survive
|
# Volumes: usage logs are mounted to a host path so they survive
|
||||||
# Watchtower-driven container recreates.
|
# Watchtower-driven container recreates.
|
||||||
#
|
#
|
||||||
@@ -10,15 +10,15 @@
|
|||||||
services:
|
services:
|
||||||
|
|
||||||
# The MCP server. Watchtower auto-pulls on :latest changes.
|
# The MCP server. Watchtower auto-pulls on :latest changes.
|
||||||
<product>-docs-mcp:
|
morpheus-docs-mcp:
|
||||||
image: <registry>/<owner>/<product>-docs-mcp:latest
|
image: git.jpaul.io/justin/morpheus-docs:latest
|
||||||
container_name: <product>-docs-mcp
|
container_name: morpheus-docs-mcp
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
ports:
|
ports:
|
||||||
- "8000:8000"
|
- "8000:8000"
|
||||||
environment:
|
environment:
|
||||||
PRODUCT_NAME: "<product>"
|
PRODUCT_NAME: "morpheus"
|
||||||
PRODUCT_DOCS_URL: "https://docs.example.com"
|
PRODUCT_DOCS_URL: "https://support.hpe.com/hpesc/public/docDisplay?docId=sd00007732en_us"
|
||||||
|
|
||||||
# Streamable-HTTP transport. Stateless mode is required for
|
# Streamable-HTTP transport. Stateless mode is required for
|
||||||
# production: clients don't lose sessions when Watchtower
|
# production: clients don't lose sessions when Watchtower
|
||||||
@@ -28,19 +28,21 @@ services:
|
|||||||
MCP_PORT: "8000"
|
MCP_PORT: "8000"
|
||||||
|
|
||||||
# If you run MetaMCP or another gateway in front and reach
|
# If you run MetaMCP or another gateway in front and reach
|
||||||
# this container via its compose DNS name (e.g. <product>-docs-mcp:8000),
|
# this container via its compose DNS name (e.g. morpheus-docs-mcp:8000),
|
||||||
# add that hostname here. "*" disables the rebind check entirely.
|
# add that hostname here. "*" disables the rebind check entirely.
|
||||||
MCP_ALLOWED_HOSTS: "<product>-docs-mcp,localhost,127.0.0.1"
|
MCP_ALLOWED_HOSTS: "morpheus-docs-mcp,localhost,127.0.0.1"
|
||||||
|
|
||||||
# Phase 6 — reranker sidecar (jina-reranker-v2-base via llama.cpp).
|
# Phase 6 — reranker sidecar (jina-reranker-v2-base via llama.cpp).
|
||||||
RERANK_URL: http://<product>-rerank:8080
|
RERANK_URL: http://hvm-rerank:8080
|
||||||
RERANK_POOL: "200"
|
RERANK_POOL: "200"
|
||||||
RERANK_TIMEOUT: "30"
|
RERANK_TIMEOUT: "30"
|
||||||
|
|
||||||
# Phase 8 — hybrid retrieval (BM25 + dense + RRF). Set true
|
# Phase 8 — hybrid retrieval (BM25 + dense + RRF).
|
||||||
# only after the eval harness shows the dense-only path
|
# Eval on the HVM corpus (eval/results/baseline.md, 2026-05-22) shows
|
||||||
# missing technical-term queries that BM25 catches.
|
# BM25-default + reranker beats hybrid on every metric (MRR 0.920 vs
|
||||||
HYBRID_SEARCH: "true"
|
# 0.875). Leaving HYBRID_SEARCH off so search_docs runs BM25-first +
|
||||||
|
# reranker; dense is the fallback when BM25 finds nothing.
|
||||||
|
HYBRID_SEARCH: "false"
|
||||||
|
|
||||||
# Phase 10 — usage telemetry.
|
# Phase 10 — usage telemetry.
|
||||||
USAGE_LOG_DIR: /app/var/logs
|
USAGE_LOG_DIR: /app/var/logs
|
||||||
@@ -52,9 +54,9 @@ services:
|
|||||||
# DOC_BUG_API_URL: "https://docs-be.example.com/api/feedback"
|
# DOC_BUG_API_URL: "https://docs-be.example.com/api/feedback"
|
||||||
volumes:
|
volumes:
|
||||||
# Usage logs persist across container recreates.
|
# Usage logs persist across container recreates.
|
||||||
- ./<product>-docs-mcp-logs:/app/var/logs
|
- ./morpheus-docs-mcp-logs:/app/var/logs
|
||||||
depends_on:
|
depends_on:
|
||||||
- <product>-rerank
|
- hvm-rerank
|
||||||
labels:
|
labels:
|
||||||
# Watchtower polls *only* containers with this label set true.
|
# Watchtower polls *only* containers with this label set true.
|
||||||
com.centurylinklabs.watchtower.enable: "true"
|
com.centurylinklabs.watchtower.enable: "true"
|
||||||
@@ -63,9 +65,13 @@ services:
|
|||||||
|
|
||||||
# Reranker sidecar — llama.cpp serving jina-reranker-v2-base.
|
# Reranker sidecar — llama.cpp serving jina-reranker-v2-base.
|
||||||
# Requires GPU access; adjust runtime/devices for your hardware.
|
# Requires GPU access; adjust runtime/devices for your hardware.
|
||||||
<product>-rerank:
|
#
|
||||||
|
# For dev / CPU-only hosts, swap this service for scripts/rerank_server.py
|
||||||
|
# (sentence-transformers ms-marco-MiniLM-L-6-v2). Same /v1/rerank shape,
|
||||||
|
# ~500ms/batch on CPU vs ~50ms on GPU with the jina GGUF.
|
||||||
|
hvm-rerank:
|
||||||
image: ghcr.io/ggml-org/llama.cpp:server-cuda
|
image: ghcr.io/ggml-org/llama.cpp:server-cuda
|
||||||
container_name: <product>-rerank
|
container_name: hvm-rerank
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
# Mount the GGUF model from the host. Download from huggingface
|
# Mount the GGUF model from the host. Download from huggingface
|
||||||
# (gguf-org/jina-reranker-v2-base-multilingual-GGUF) first.
|
# (gguf-org/jina-reranker-v2-base-multilingual-GGUF) first.
|
||||||
|
|||||||
@@ -0,0 +1,148 @@
|
|||||||
|
# HPE Morpheus Enterprise — Lessons
|
||||||
|
|
||||||
|
Notes and gotchas about running, integrating with, and licensing
|
||||||
|
**HPE Morpheus Enterprise Software** that aren't obvious from the
|
||||||
|
official docs alone. The official User Manual + Release Notes +
|
||||||
|
QuickSpecs describe the product as designed; this file is what
|
||||||
|
experienced operators actually learn.
|
||||||
|
|
||||||
|
> Treat this as living context. Update it when you (or the LLM
|
||||||
|
> driving this MCP) discover something non-obvious that the docs
|
||||||
|
> don't say or don't make findable. Each section is an H2 so the
|
||||||
|
> `morpheus_api_lessons(topic=...)` tool can return just the
|
||||||
|
> relevant piece.
|
||||||
|
|
||||||
|
## TL;DR
|
||||||
|
|
||||||
|
- **Morpheus Enterprise is the full cloud-management platform.** HPE
|
||||||
|
Morpheus VM Essentials (HVM) is the VM-only subset; Morpheus
|
||||||
|
Enterprise is what you "elevate to" when you need multi-cloud,
|
||||||
|
containers, automation, policy, FinOps, ITSM integration, and
|
||||||
|
self-service catalogs. The relationship is one-way upgrade.
|
||||||
|
- **Licensing is per physical CPU socket** on connected on-prem
|
||||||
|
clouds (bare metal, hypervisor hosts, Kubernetes worker nodes).
|
||||||
|
Public-cloud workloads (AWS / Azure / GCP / OCI) are factored at
|
||||||
|
**15 workloads per socket** equivalent.
|
||||||
|
- **All license SKUs include Tech Care Essentials 24×7** as part
|
||||||
|
of the license cost. There is no separate purchase for support
|
||||||
|
on the license tier.
|
||||||
|
- **`morpheus_quickspecs` is the source of truth for SKUs.** Don't
|
||||||
|
guess part numbers; query the QuickSpecs bundle.
|
||||||
|
|
||||||
|
## Licensing and SKUs
|
||||||
|
|
||||||
|
**Source of truth: the `morpheus_quickspecs` bundle.** Query it for
|
||||||
|
the current SKU list — the catalog updates more often than this
|
||||||
|
file does.
|
||||||
|
|
||||||
|
Pricing model summary (from QuickSpecs v1, 2026):
|
||||||
|
|
||||||
|
- **Per physical CPU socket** for connected on-prem clouds —
|
||||||
|
KVM/HVM hosts, VMware ESXi hosts, bare metal servers, Kubernetes
|
||||||
|
worker nodes. Count the **sockets**, not the cores; not the VMs.
|
||||||
|
- **Public cloud workloads factor at 15:1** — one socket of license
|
||||||
|
covers up to 15 public-cloud workloads (instances) across AWS,
|
||||||
|
Azure, GCP, OCI.
|
||||||
|
- **Term-based** licensing (not perpetual). 1, 3, and 5-year terms
|
||||||
|
on E-LTU SKUs.
|
||||||
|
- **All include HPE Tech Care Essentials** (24×7 support, 15-minute
|
||||||
|
response for severity-1) bundled into the license cost.
|
||||||
|
|
||||||
|
> The exact ratios and SKU names can change between QuickSpecs
|
||||||
|
> revisions. Use the `morpheus_quickspecs` tool / bundle for current
|
||||||
|
> values rather than memorizing.
|
||||||
|
|
||||||
|
## Elevation from HVM
|
||||||
|
|
||||||
|
The "elevate to Morpheus Enterprise" path is the canonical journey
|
||||||
|
for customers who started on HVM and outgrew it:
|
||||||
|
|
||||||
|
- **HVM clusters keep working unchanged after elevation.** You
|
||||||
|
don't redeploy the manager; you upgrade-in-place using a
|
||||||
|
Morpheus Enterprise license.
|
||||||
|
- **What changes:** the manager UI unlocks the full Enterprise
|
||||||
|
feature set — public-cloud integrations, container/Kubernetes
|
||||||
|
management, blueprints/catalogs, automation workflows, policy
|
||||||
|
engine, FinOps cost dashboards, ITSM connectors (ServiceNow etc.),
|
||||||
|
and the full REST API surface.
|
||||||
|
- **Existing HVM-tier work products survive the elevation:**
|
||||||
|
Instance backups, network pools, storage providers, user
|
||||||
|
accounts, integrations, scheduled jobs, etc.
|
||||||
|
|
||||||
|
The HVM User Manual page `Elevating to HPE Morpheus Enterprise`
|
||||||
|
(GUID-ECCA4FDD-37C8-45CE-A71F-C6E73B3BA713) walks the procedure.
|
||||||
|
See also the HVM `morpheus-docs` sibling MCP's
|
||||||
|
`hvm_user_manual_8_1_*` bundles.
|
||||||
|
|
||||||
|
## API surface — Plugin vs REST
|
||||||
|
|
||||||
|
Morpheus exposes two completely separate extensibility surfaces:
|
||||||
|
|
||||||
|
- **REST API** at `https://<manager>/api/` — external automation
|
||||||
|
and integration. Bearer-token authentication; tokens issued from
|
||||||
|
the user profile → API tokens UI. Full Enterprise API surface
|
||||||
|
available (vs HVM-only managers which 404 on Enterprise-only
|
||||||
|
endpoints).
|
||||||
|
- **Plugin API** — server-side extensions that load INTO the
|
||||||
|
manager process. Versioned independently of the platform
|
||||||
|
(Plugin API version listed in the Release Notes for each
|
||||||
|
Morpheus version). A plugin built for Plugin API 1.3.x may not
|
||||||
|
load on 1.4.x without changes.
|
||||||
|
|
||||||
|
**TODO — fill in real operational lessons as we accumulate them.**
|
||||||
|
|
||||||
|
## Multi-cloud onboarding
|
||||||
|
|
||||||
|
**TODO.** Each cloud (AWS, Azure, GCP, OCI, VMware vSphere, KVM/HVM,
|
||||||
|
OpenStack, Nutanix, etc.) has its own onboarding ritual: credentials,
|
||||||
|
networking, IAM roles, regions, storage providers, image catalogs.
|
||||||
|
Search the User Manual: `search_docs(query="Add AWS cloud
|
||||||
|
integration")`, `search_docs(query="Azure subscription cost")`, etc.
|
||||||
|
|
||||||
|
## Tenancy, RBAC, and groups
|
||||||
|
|
||||||
|
**TODO.** Morpheus Enterprise tenancy is one of the more complex areas
|
||||||
|
— tenants, roles, groups, account groups, persona-based access.
|
||||||
|
Lessons specific to "what surprised me" go here.
|
||||||
|
|
||||||
|
## Backups
|
||||||
|
|
||||||
|
**TODO.** Morpheus Enterprise inherits the backup framework HVM
|
||||||
|
introduced (Storage Buckets, Execution Schedules, Backup Jobs)
|
||||||
|
and adds: cloud-native backup integrations (AWS Backup, Azure
|
||||||
|
Backup), per-instance backup policies via the policy engine,
|
||||||
|
ServiceNow-driven backup orchestration. Document the gotchas you
|
||||||
|
hit.
|
||||||
|
|
||||||
|
## Common operational gotchas
|
||||||
|
|
||||||
|
**TODO.** This is where the "experienced operator hallway
|
||||||
|
conversation" notes go. Examples to seed (delete or replace as you
|
||||||
|
learn):
|
||||||
|
|
||||||
|
- **Service plan vs Instance type** — same concept, different
|
||||||
|
contexts. A service plan is the sizing template ("small / medium
|
||||||
|
/ large with these CPU/RAM"); an instance type is what you
|
||||||
|
provision FROM the plan. Operators conflate them.
|
||||||
|
- **Cloud integration credentials are tenant-scoped, not
|
||||||
|
global.** Adding a credential at the master tenant doesn't
|
||||||
|
cascade — sub-tenants need their own (or the policy engine
|
||||||
|
granting access).
|
||||||
|
- **Policy engine vs Logic library** — both live under
|
||||||
|
Library/Automation, both can gate provisioning. Policies are
|
||||||
|
preventive (block bad config), logic is generative (run scripts
|
||||||
|
on lifecycle events). Pick the right tool.
|
||||||
|
|
||||||
|
## Adding to this doc
|
||||||
|
|
||||||
|
Two ways:
|
||||||
|
|
||||||
|
1. Manually edit `docs_mcp/api_lessons.md` in this repo and commit.
|
||||||
|
The next image build picks it up.
|
||||||
|
2. Use `submit_doc_bug` for upstream issues, and append the
|
||||||
|
takeaway here once the docs team responds.
|
||||||
|
|
||||||
|
The point of this doc is to surface the kind of context an
|
||||||
|
experienced operator would mention in a hallway conversation but
|
||||||
|
that doesn't quite fit anywhere in the formal product docs. Keep
|
||||||
|
sections tight — one H2 = one topic the LLM can return on demand.
|
||||||
+1077
-40
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,4 @@
|
|||||||
|
{"query": "what's the per-socket licensing model for Morpheus Enterprise", "expected": [{"bundle_id": "morpheus_quickspecs", "page_id": "a50009231enw"}], "tags": ["licensing", "skus"]}
|
||||||
|
{"query": "add an AWS cloud integration", "expected": [], "tags": ["cloud", "TODO-populate-after-first-scrape"]}
|
||||||
|
{"query": "Plugin API version compatibility", "expected": [], "tags": ["api", "TODO"]}
|
||||||
|
{"query": "Morpheus Enterprise 8.1.2 what's new", "expected": [{"bundle_id": "morpheus_release_notes_8_1_2", "page_id": "sd00007733en_us"}], "tags": ["release-notes"]}
|
||||||
+118
-28
@@ -10,7 +10,7 @@ to one entry; the highest-ranked chunk's position wins).
|
|||||||
"""
|
"""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
from typing import Protocol, Iterable
|
from typing import Iterable, Protocol
|
||||||
|
|
||||||
|
|
||||||
class Retriever(Protocol):
|
class Retriever(Protocol):
|
||||||
@@ -21,12 +21,17 @@ class Retriever(Protocol):
|
|||||||
...
|
...
|
||||||
|
|
||||||
|
|
||||||
def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> list[tuple[str, str]]:
|
def _split_chunk_id(chunk_id: str) -> tuple[str, str, int]:
|
||||||
"""Take a stream of (bundle_id, page_id, chunk_ordinal) and return
|
"""`bundle::page::ordinal` -> (bundle, page, int(ordinal))."""
|
||||||
the first k unique pages in their first-seen order."""
|
bid, pid, ordinal = chunk_id.split("::")
|
||||||
|
return bid, pid, int(ordinal)
|
||||||
|
|
||||||
|
|
||||||
|
def _collapse_to_pages(chunk_ids: Iterable[str], k: int) -> list[tuple[str, str]]:
|
||||||
seen: set[tuple[str, str]] = set()
|
seen: set[tuple[str, str]] = set()
|
||||||
out: list[tuple[str, str]] = []
|
out: list[tuple[str, str]] = []
|
||||||
for bid, pid, _ord in chunk_ids:
|
for cid in chunk_ids:
|
||||||
|
bid, pid, _ord = _split_chunk_id(cid)
|
||||||
key = (bid, pid)
|
key = (bid, pid)
|
||||||
if key in seen:
|
if key in seen:
|
||||||
continue
|
continue
|
||||||
@@ -37,26 +42,111 @@ def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> lis
|
|||||||
return out
|
return out
|
||||||
|
|
||||||
|
|
||||||
# TODO Phase 2/3 — implement these once Chroma + the bm25 module are
|
class DenseRetriever:
|
||||||
# in place. Each one is small (15-30 LOC). The eval harness imports
|
"""Chroma cosine search via the live embedding function."""
|
||||||
# from this module by class name.
|
name = "dense"
|
||||||
#
|
|
||||||
# class DenseRetriever:
|
def __init__(self, collection, pool: int = 50):
|
||||||
# name = "dense"
|
self.col = collection
|
||||||
# def __init__(self, collection): self.col = collection
|
self.pool = pool
|
||||||
# def retrieve(self, query, k=10): ...
|
|
||||||
#
|
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
|
||||||
# class RerankedRetriever:
|
res = self.col.query(query_texts=[query], n_results=self.pool)
|
||||||
# name = "dense+rerank"
|
ids = (res.get("ids") or [[]])[0]
|
||||||
# def __init__(self, collection, rerank_url, pool=200): ...
|
return _collapse_to_pages(ids, k)
|
||||||
# def retrieve(self, query, k=10): ...
|
|
||||||
#
|
|
||||||
# class BM25Retriever:
|
class BM25Retriever:
|
||||||
# name = "bm25"
|
"""SQLite FTS5 lexical search."""
|
||||||
# def __init__(self, bm25_index): ...
|
name = "bm25"
|
||||||
# def retrieve(self, query, k=10): ...
|
|
||||||
#
|
def __init__(self, bm25_index, pool: int = 200):
|
||||||
# class HybridRetriever:
|
self.bm = bm25_index
|
||||||
# name = "bm25+dense+rrf"
|
self.pool = pool
|
||||||
# def __init__(self, dense, bm25, k_rrf=60): ...
|
|
||||||
# def retrieve(self, query, k=10): ...
|
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
|
||||||
|
hits = self.bm.query(query, n=self.pool)
|
||||||
|
return _collapse_to_pages((cid for cid, _score in hits), k)
|
||||||
|
|
||||||
|
|
||||||
|
class HybridRetriever:
|
||||||
|
"""Reciprocal Rank Fusion of dense + BM25 rankings."""
|
||||||
|
name = "hybrid_rrf"
|
||||||
|
|
||||||
|
def __init__(self, dense: DenseRetriever, bm25: BM25Retriever, k_rrf: int = 60, pool: int = 100):
|
||||||
|
self.dense = dense
|
||||||
|
self.bm25 = bm25
|
||||||
|
self.k_rrf = k_rrf
|
||||||
|
self.pool = pool
|
||||||
|
|
||||||
|
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
|
||||||
|
dense_pages = self.dense.retrieve(query, k=self.pool)
|
||||||
|
bm25_pages = self.bm25.retrieve(query, k=self.pool)
|
||||||
|
scores: dict[tuple[str, str], float] = {}
|
||||||
|
for rank, page in enumerate(dense_pages, start=1):
|
||||||
|
scores[page] = scores.get(page, 0.0) + 1.0 / (self.k_rrf + rank)
|
||||||
|
for rank, page in enumerate(bm25_pages, start=1):
|
||||||
|
scores[page] = scores.get(page, 0.0) + 1.0 / (self.k_rrf + rank)
|
||||||
|
ranked = sorted(scores.items(), key=lambda kv: -kv[1])
|
||||||
|
return [page for page, _s in ranked[:k]]
|
||||||
|
|
||||||
|
|
||||||
|
def _rerank_pool(rerank_url: str, query: str, ids_and_texts: list[tuple[str, str]],
|
||||||
|
timeout: float = 30.0) -> list[str] | None:
|
||||||
|
"""POST to /v1/rerank, return ids in reranked order. None on failure."""
|
||||||
|
if not ids_and_texts:
|
||||||
|
return []
|
||||||
|
import httpx
|
||||||
|
try:
|
||||||
|
with httpx.Client(timeout=timeout) as c:
|
||||||
|
r = c.post(f"{rerank_url}/v1/rerank", json={
|
||||||
|
"query": query,
|
||||||
|
"documents": [(t or "")[:2000] for _i, t in ids_and_texts],
|
||||||
|
"top_n": len(ids_and_texts),
|
||||||
|
})
|
||||||
|
r.raise_for_status()
|
||||||
|
results = r.json().get("results") or []
|
||||||
|
return [ids_and_texts[item["index"]][0] for item in results
|
||||||
|
if isinstance(item.get("index"), int)
|
||||||
|
and 0 <= item["index"] < len(ids_and_texts)]
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
class RerankedRetriever:
|
||||||
|
"""Pull a candidate pool via a base retriever, then cross-encoder re-rank."""
|
||||||
|
|
||||||
|
def __init__(self, base: Retriever, collection, rerank_url: str, name_suffix: str = "rerank",
|
||||||
|
pool: int = 50, timeout: float = 30.0):
|
||||||
|
self.base = base
|
||||||
|
self.col = collection
|
||||||
|
self.url = rerank_url
|
||||||
|
self.name = f"{base.name}+{name_suffix}"
|
||||||
|
self.pool = pool
|
||||||
|
self.timeout = timeout
|
||||||
|
|
||||||
|
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
|
||||||
|
# Base returns deduplicated page-level tuples; rerank needs CHUNK-level
|
||||||
|
# texts to be informative. Pull each page's chunk 0 text from Chroma.
|
||||||
|
pages = self.base.retrieve(query, k=self.pool)
|
||||||
|
if not pages:
|
||||||
|
return []
|
||||||
|
chunk_ids = [f"{bid}::{pid}::0" for bid, pid in pages]
|
||||||
|
g = self.col.get(ids=chunk_ids, include=["documents"])
|
||||||
|
by_id = dict(zip(g["ids"], g["documents"]))
|
||||||
|
ids_and_texts = [(cid, by_id.get(cid, "")) for cid in chunk_ids]
|
||||||
|
order = _rerank_pool(self.url, query, ids_and_texts, timeout=self.timeout)
|
||||||
|
if order is None:
|
||||||
|
return pages[:k]
|
||||||
|
out: list[tuple[str, str]] = []
|
||||||
|
seen: set[tuple[str, str]] = set()
|
||||||
|
for cid in order:
|
||||||
|
bid, pid, _ = cid.split("::")
|
||||||
|
key = (bid, pid)
|
||||||
|
if key in seen:
|
||||||
|
continue
|
||||||
|
seen.add(key)
|
||||||
|
out.append(key)
|
||||||
|
if len(out) >= k:
|
||||||
|
break
|
||||||
|
return out
|
||||||
|
|||||||
+81
-9
@@ -76,15 +76,87 @@ def main() -> int:
|
|||||||
queries = load_queries(args.queries)
|
queries = load_queries(args.queries)
|
||||||
print(f"loaded {len(queries)} queries")
|
print(f"loaded {len(queries)} queries")
|
||||||
|
|
||||||
# TODO Phase 7: instantiate the retrievers you implemented in
|
import os
|
||||||
# eval/retrievers.py and run each one against each query.
|
import chromadb
|
||||||
# Aggregate MRR / Recall@K / nDCG@K per retriever. Emit a
|
from chromadb.config import Settings
|
||||||
# markdown table to args.output. Commit the file alongside the
|
from rag.embeddings import embedding_function
|
||||||
# PR that changes retrieval.
|
from rag.bm25 import BM25Index
|
||||||
raise NotImplementedError(
|
from eval.retrievers import DenseRetriever, BM25Retriever, HybridRetriever
|
||||||
"Wire up the retrievers in eval/retrievers.py first, then "
|
|
||||||
"fill in this evaluation loop. See PLAN.md Phase 7."
|
product = os.environ.get("PRODUCT_NAME", "hvm")
|
||||||
)
|
repo_root = Path(__file__).resolve().parent.parent
|
||||||
|
client = chromadb.PersistentClient(path=str(repo_root / "chroma"),
|
||||||
|
settings=Settings(anonymized_telemetry=False))
|
||||||
|
col = client.get_collection(f"{product}_docs", embedding_function=embedding_function())
|
||||||
|
bm = BM25Index(str(repo_root / "bm25" / f"{product}_docs.db"))
|
||||||
|
|
||||||
|
from eval.retrievers import RerankedRetriever
|
||||||
|
|
||||||
|
dense = DenseRetriever(col)
|
||||||
|
bm25 = BM25Retriever(bm)
|
||||||
|
hybrid = HybridRetriever(DenseRetriever(col, pool=100), BM25Retriever(bm, pool=100))
|
||||||
|
|
||||||
|
retrievers = [dense, bm25, hybrid]
|
||||||
|
|
||||||
|
rerank_url = os.environ.get("RERANK_URL", "").rstrip("/")
|
||||||
|
if rerank_url:
|
||||||
|
retrievers += [
|
||||||
|
RerankedRetriever(bm25, col, rerank_url, name_suffix="rerank", pool=50),
|
||||||
|
RerankedRetriever(hybrid, col, rerank_url, name_suffix="rerank", pool=50),
|
||||||
|
]
|
||||||
|
print(f"reranker enabled: {rerank_url}")
|
||||||
|
|
||||||
|
rows: dict[str, dict[str, float]] = {}
|
||||||
|
per_query: list[dict] = []
|
||||||
|
for r in retrievers:
|
||||||
|
mrr_sum = recall_sum = ndcg_sum = 0.0
|
||||||
|
elapsed_sum = 0.0
|
||||||
|
for q in queries:
|
||||||
|
expected = [(e["bundle_id"], e["page_id"]) for e in q["expected"]]
|
||||||
|
t0 = time.time()
|
||||||
|
retrieved = r.retrieve(q["query"], k=max(args.k, 10))
|
||||||
|
elapsed = time.time() - t0
|
||||||
|
mrr = reciprocal_rank(retrieved, expected)
|
||||||
|
recall = recall_at_k(retrieved, expected, args.k)
|
||||||
|
ndcg = ndcg_at_k(retrieved, expected, args.k)
|
||||||
|
mrr_sum += mrr
|
||||||
|
recall_sum += recall
|
||||||
|
ndcg_sum += ndcg
|
||||||
|
elapsed_sum += elapsed
|
||||||
|
per_query.append({
|
||||||
|
"retriever": r.name, "query": q["query"],
|
||||||
|
"mrr": mrr, "recall@k": recall, "ndcg@k": ndcg,
|
||||||
|
"top1": list(retrieved[0]) if retrieved else None,
|
||||||
|
"elapsed_s": round(elapsed, 3),
|
||||||
|
})
|
||||||
|
n = len(queries)
|
||||||
|
rows[r.name] = {
|
||||||
|
"MRR": mrr_sum / n,
|
||||||
|
f"Recall@{args.k}": recall_sum / n,
|
||||||
|
f"nDCG@{args.k}": ndcg_sum / n,
|
||||||
|
"avg_latency_s": elapsed_sum / n,
|
||||||
|
}
|
||||||
|
print(f" {r.name}: MRR={rows[r.name]['MRR']:.3f} "
|
||||||
|
f"Recall@{args.k}={rows[r.name][f'Recall@{args.k}']:.3f} "
|
||||||
|
f"nDCG@{args.k}={rows[r.name][f'nDCG@{args.k}']:.3f} "
|
||||||
|
f"avg={rows[r.name]['avg_latency_s']*1000:.0f}ms")
|
||||||
|
|
||||||
|
args.output.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
md = [f"# Retrieval eval — k={args.k}", "",
|
||||||
|
f"_{len(queries)} hand-curated queries, generated {time.strftime('%Y-%m-%d %H:%M:%S')}_", "",
|
||||||
|
"| Retriever | MRR | Recall@{k} | nDCG@{k} | avg latency |".replace("{k}", str(args.k)),
|
||||||
|
"| --- | ---: | ---: | ---: | ---: |"]
|
||||||
|
for name, m in rows.items():
|
||||||
|
md.append(f"| `{name}` | {m['MRR']:.3f} | {m[f'Recall@{args.k}']:.3f} "
|
||||||
|
f"| {m[f'nDCG@{args.k}']:.3f} | {m['avg_latency_s']*1000:.0f}ms |")
|
||||||
|
md += ["", "## Per-query results", "",
|
||||||
|
"| Retriever | Query | MRR | top-1 |", "| --- | --- | ---: | --- |"]
|
||||||
|
for r in per_query:
|
||||||
|
top1 = f"`{r['top1'][0]}/{r['top1'][1][:24]}...`" if r["top1"] else "—"
|
||||||
|
md.append(f"| `{r['retriever']}` | {r['query'][:60]} | {r['mrr']:.3f} | {top1} |")
|
||||||
|
args.output.write_text("\n".join(md) + "\n")
|
||||||
|
print(f"wrote {args.output}")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
+39
-11
@@ -31,6 +31,31 @@ from typing import Iterator
|
|||||||
CHARS_PER_TOKEN = 4
|
CHARS_PER_TOKEN = 4
|
||||||
TARGET_TOKENS = 500
|
TARGET_TOKENS = 500
|
||||||
TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
|
TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
|
||||||
|
# Hard cap: nomic-embed-text's context is 2048 tokens. Anything larger
|
||||||
|
# 400s the entire embed batch. 6000 chars works for prose but markdown
|
||||||
|
# tables with lots of `|` separators tokenize ~1.4× denser; a 5839-char
|
||||||
|
# table chunk from the HVM qualification matrix tokenized past 2048 and
|
||||||
|
# crashed the rebuild. 4000 chars stays under 2048 tokens even for
|
||||||
|
# dense table content while leaving headroom for the query side.
|
||||||
|
MAX_CHARS = 4000
|
||||||
|
|
||||||
|
|
||||||
|
def _hard_split(text: str) -> list[str]:
|
||||||
|
"""Split an oversized block on line boundaries into MAX_CHARS pieces."""
|
||||||
|
if len(text) <= MAX_CHARS:
|
||||||
|
return [text]
|
||||||
|
out: list[str] = []
|
||||||
|
buf: list[str] = []
|
||||||
|
buf_chars = 0
|
||||||
|
for line in text.splitlines(keepends=True):
|
||||||
|
if buf_chars + len(line) > MAX_CHARS and buf:
|
||||||
|
out.append("".join(buf).rstrip())
|
||||||
|
buf, buf_chars = [], 0
|
||||||
|
buf.append(line)
|
||||||
|
buf_chars += len(line)
|
||||||
|
if buf:
|
||||||
|
out.append("".join(buf).rstrip())
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
def estimate_tokens(text: str) -> int:
|
def estimate_tokens(text: str) -> int:
|
||||||
@@ -104,23 +129,26 @@ def chunks_from_page(
|
|||||||
|
|
||||||
# ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
|
# ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
|
||||||
ordinal = 1
|
ordinal = 1
|
||||||
|
|
||||||
|
def emit(buf: list[str]) -> Iterator[dict]:
|
||||||
|
nonlocal ordinal
|
||||||
|
merged = "\n\n".join(buf)
|
||||||
|
for piece in _hard_split(merged):
|
||||||
|
yield {
|
||||||
|
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
|
||||||
|
"text": piece,
|
||||||
|
"metadata": {**metadata, "ordinal": ordinal},
|
||||||
|
}
|
||||||
|
ordinal += 1
|
||||||
|
|
||||||
buf: list[str] = []
|
buf: list[str] = []
|
||||||
buf_chars = 0
|
buf_chars = 0
|
||||||
for p in paragraphs:
|
for p in paragraphs:
|
||||||
if buf_chars + len(p) > TARGET_CHARS and buf:
|
if buf_chars + len(p) > TARGET_CHARS and buf:
|
||||||
yield {
|
yield from emit(buf)
|
||||||
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
|
|
||||||
"text": "\n\n".join(buf),
|
|
||||||
"metadata": {**metadata, "ordinal": ordinal},
|
|
||||||
}
|
|
||||||
ordinal += 1
|
|
||||||
buf = []
|
buf = []
|
||||||
buf_chars = 0
|
buf_chars = 0
|
||||||
buf.append(p)
|
buf.append(p)
|
||||||
buf_chars += len(p)
|
buf_chars += len(p)
|
||||||
if buf:
|
if buf:
|
||||||
yield {
|
yield from emit(buf)
|
||||||
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
|
|
||||||
"text": "\n\n".join(buf),
|
|
||||||
"metadata": {**metadata, "ordinal": ordinal},
|
|
||||||
}
|
|
||||||
|
|||||||
+21
-4
@@ -3,8 +3,15 @@
|
|||||||
Swappable: implement the same `embedding_function()` interface returning
|
Swappable: implement the same `embedding_function()` interface returning
|
||||||
a Chroma `EmbeddingFunction` and the rest of the pipeline doesn't care.
|
a Chroma `EmbeddingFunction` and the rest of the pipeline doesn't care.
|
||||||
|
|
||||||
Defaults (override via env):
|
Env-configurable (matches the zerto-docs-rag pattern so the same Gitea
|
||||||
OLLAMA_URL one or more comma-separated URLs (load-balanced)
|
runner + GPU-pinned Ollama containers can serve every docs MCP build):
|
||||||
|
|
||||||
|
OLLAMA_URLS comma-separated list, load-balanced round-robin per batch.
|
||||||
|
Preferred — set in the CI workflow to fan out across two
|
||||||
|
GPU-pinned Ollama containers on the Gitea host.
|
||||||
|
OLLAMA_URL single endpoint, fallback when OLLAMA_URLS is unset.
|
||||||
|
Default http://192.168.0.2:11434 (the host where the GPUs
|
||||||
|
live in Justin's lab).
|
||||||
EMBED_MODEL model name; default 'nomic-embed-text'
|
EMBED_MODEL model name; default 'nomic-embed-text'
|
||||||
EMBED_DIM expected embedding dim; default 768 (nomic-embed-text)
|
EMBED_DIM expected embedding dim; default 768 (nomic-embed-text)
|
||||||
"""
|
"""
|
||||||
@@ -19,8 +26,18 @@ from chromadb import EmbeddingFunction, Documents, Embeddings
|
|||||||
|
|
||||||
log = logging.getLogger(__name__)
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
OLLAMA_URLS = [u.strip() for u in os.environ.get("OLLAMA_URL",
|
DEFAULT_OLLAMA_URL = "http://192.168.0.2:11434"
|
||||||
"http://localhost:11434").split(",") if u.strip()]
|
|
||||||
|
|
||||||
|
def _resolve_urls() -> list[str]:
|
||||||
|
raw = os.environ.get("OLLAMA_URLS", "").strip()
|
||||||
|
if raw:
|
||||||
|
return [u.strip().rstrip("/") for u in raw.split(",") if u.strip()]
|
||||||
|
single = os.environ.get("OLLAMA_URL", DEFAULT_OLLAMA_URL).strip().rstrip("/")
|
||||||
|
return [single]
|
||||||
|
|
||||||
|
|
||||||
|
OLLAMA_URLS = _resolve_urls()
|
||||||
EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
|
EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
|
||||||
EMBED_DIM = int(os.environ.get("EMBED_DIM", "768"))
|
EMBED_DIM = int(os.environ.get("EMBED_DIM", "768"))
|
||||||
|
|
||||||
|
|||||||
+1
-1
@@ -29,7 +29,7 @@ CHROMA_DIR = ROOT / "chroma"
|
|||||||
|
|
||||||
# Collection name — convention: <product>_docs. Override via env if needed.
|
# Collection name — convention: <product>_docs. Override via env if needed.
|
||||||
import os
|
import os
|
||||||
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "myproduct")
|
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "morpheus")
|
||||||
COLLECTION = f"{PRODUCT_NAME}_docs"
|
COLLECTION = f"{PRODUCT_NAME}_docs"
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,10 @@
|
|||||||
|
# Dev/CPU reranker — only for running scripts/rerank_server.py locally.
|
||||||
|
# Production uses the llama.cpp + jina-reranker GGUF sidecar (see
|
||||||
|
# deploy/docker-compose.yml). Install with:
|
||||||
|
#
|
||||||
|
# pip install -r requirements-rerank.txt
|
||||||
|
#
|
||||||
|
# This adds PyTorch (~2 GB) and the sentence-transformers cross-encoder
|
||||||
|
# (cross-encoder/ms-marco-MiniLM-L-6-v2, ~22 MB). Keep out of the main
|
||||||
|
# requirements.txt so the production image stays slim.
|
||||||
|
sentence-transformers>=3.0
|
||||||
@@ -10,10 +10,18 @@ ollama>=0.4.0 # if using Ollama-hosted embedder; swap if not
|
|||||||
# Scraping (Phase 1; adjust per product)
|
# Scraping (Phase 1; adjust per product)
|
||||||
beautifulsoup4>=4.12
|
beautifulsoup4>=4.12
|
||||||
requests>=2.31
|
requests>=2.31
|
||||||
|
curl_cffi>=0.7 # for HPE QuickSpecs scrape (Chrome TLS impersonation)
|
||||||
|
markdownify>=0.11
|
||||||
# playwright>=1.40 # uncomment if you need headless browser fallback
|
# playwright>=1.40 # uncomment if you need headless browser fallback
|
||||||
|
|
||||||
# Evaluation
|
# Evaluation
|
||||||
numpy>=1.26
|
numpy>=1.26
|
||||||
|
|
||||||
|
# Reranker is a sidecar (see deploy/docker-compose.yml). The MCP server
|
||||||
|
# only needs httpx (declared above) to call it. For the dev / CPU
|
||||||
|
# fallback reranker (scripts/rerank_server.py), install
|
||||||
|
# requirements-rerank.txt separately — it pulls in PyTorch which would
|
||||||
|
# triple the production image size.
|
||||||
|
|
||||||
# Dev / utility
|
# Dev / utility
|
||||||
python-dateutil>=2.8
|
python-dateutil>=2.8
|
||||||
|
|||||||
@@ -7,6 +7,72 @@ the upstream doc portal.
|
|||||||
See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline
|
See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline
|
||||||
expects.
|
expects.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Product context — HPE Morpheus Enterprise Software
|
||||||
|
|
||||||
|
**This repo is for HPE Morpheus Enterprise**, the full cloud-management
|
||||||
|
platform. It is a **different SKU** from HPE Morpheus VM Essentials
|
||||||
|
(HVM), which has its own MCP at `../hvm-docs/`. Don't ingest HVM
|
||||||
|
docs here; they're a separate, smaller product (the "VM-only" subset
|
||||||
|
of Morpheus). The Morpheus VM Essentials Deployment Guide refers to
|
||||||
|
Morpheus Enterprise as the "elevate to" target — that's the
|
||||||
|
relationship.
|
||||||
|
|
||||||
|
`PRODUCT_NAME=morpheus`. Tool will be named `morpheus_api_lessons`,
|
||||||
|
collection `morpheus_docs`, etc.
|
||||||
|
|
||||||
|
### Upstream portal
|
||||||
|
|
||||||
|
HPE Support DocPortal (Tridion/SDL-derived, same surface as HVM and
|
||||||
|
the Zerto docs). Anonymous JSON API, no auth required.
|
||||||
|
|
||||||
|
| Endpoint | Returns |
|
||||||
|
|---|---|
|
||||||
|
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}` | DITA-source HTML — title page / abstract OR (for short docs like Release Notes) the entire body |
|
||||||
|
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/toc` | Nested JSON tree of `{topicName, topicLink, description, children}`. Empty/404 for single-doc Release Notes. |
|
||||||
|
| `GET https://support.hpe.com/hpesc/public/api/document/{docId}/render?page=GUID-XXXX.html` | `{docId, page_html, doc_meta, page_meta}` — single page body |
|
||||||
|
|
||||||
|
User-facing URL format:
|
||||||
|
`https://support.hpe.com/hpesc/public/docDisplay?docId={docId}&page=GUID-XXXX.html`
|
||||||
|
|
||||||
|
### Bundle IDs (confirmed 2026-05-22)
|
||||||
|
|
||||||
|
**Morpheus Enterprise User Manual** — ~569 pages each, full nested TOC:
|
||||||
|
|
||||||
|
| Version | docId |
|
||||||
|
|---|---|
|
||||||
|
| 8.1.0 | `sd00007510en_us` |
|
||||||
|
| 8.1.1 | `sd00007621en_us` |
|
||||||
|
| 8.1.2 | `sd00007732en_us` |
|
||||||
|
|
||||||
|
**Morpheus Enterprise Release Notes** — short, single-doc-blob shape
|
||||||
|
(no TOC; full body returned by the `/document/{docId}` endpoint
|
||||||
|
itself; scraper needs a `--single-doc` mode for these):
|
||||||
|
|
||||||
|
| Version | docId |
|
||||||
|
|---|---|
|
||||||
|
| 8.1.0 | `sd00007496en_us` |
|
||||||
|
| 8.1.1 | `sd00007610en_us` |
|
||||||
|
| 8.1.2 | `sd00007733en_us` |
|
||||||
|
|
||||||
|
### Cross-version peers are free
|
||||||
|
|
||||||
|
GUIDs are stable across versions (confirmed on HVM where 374/376/376
|
||||||
|
pages had 100% GUID overlap between adjacent versions). Same-GUID =
|
||||||
|
same-topic. Synthesize `topic_cluster.clustered_topics` by looking
|
||||||
|
up the same GUID in the other bundle slugs — no fuzzy matching
|
||||||
|
needed.
|
||||||
|
|
||||||
|
### Reusable from hvm-docs
|
||||||
|
|
||||||
|
`../hvm-docs/scrape/bundles.py` and `../hvm-docs/scrape/runner.py`
|
||||||
|
solve the identical portal shape. Copy and adapt the BUNDLES list +
|
||||||
|
PRODUCT_NAME; the fetch logic should drop in unchanged. Both the
|
||||||
|
TOC-paginated path and the single-doc path are needed (the HVM
|
||||||
|
build covers both because HVM Release Notes follow the same shape).
|
||||||
|
|
||||||
|
|
||||||
## What you write
|
## What you write
|
||||||
|
|
||||||
At minimum, two scripts:
|
At minimum, two scripts:
|
||||||
|
|||||||
@@ -0,0 +1,200 @@
|
|||||||
|
"""Discover Morpheus Enterprise doc bundles on HPE Support DocPortal and write bundles.json.
|
||||||
|
|
||||||
|
Mirrors hvm-docs/scrape/bundles.py — same portal, same API shape, same single-doc-blob
|
||||||
|
treatment for Release Notes, but pointing at the Morpheus Enterprise docId range.
|
||||||
|
|
||||||
|
For each bundle this script:
|
||||||
|
1. GETs /hpesc/public/api/document/{docId} → abstract HTML
|
||||||
|
2. GETs /hpesc/public/api/document/{docId}/toc → page tree (or 404 for single-doc)
|
||||||
|
3. Writes bundles.json at repo root with the schema PLAN.md Phase 1 documents.
|
||||||
|
|
||||||
|
QuickSpecs is a special case: lives at www.hpe.com (not support.hpe.com), gets the
|
||||||
|
html-file mode and is scraped via curl_cffi (see scrape/quickspecs.py).
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import requests
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
|
||||||
|
API = "https://support.hpe.com/hpesc/public/api/document"
|
||||||
|
DOC_URL = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}"
|
||||||
|
UA = "morpheus-docs-mcp/0.1 (+https://git.jpaul.io/justin/morpheus-docs; admin@jpaul.io)"
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
BUNDLES_JSON = ROOT / "bundles.json"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class BundleSpec:
|
||||||
|
slug: str
|
||||||
|
doc_id: str
|
||||||
|
title: str
|
||||||
|
version: str | None
|
||||||
|
product: str # e.g. "User Manual", "Release Notes", "QuickSpecs"
|
||||||
|
mode: str # "toc", "single", or "html-file"
|
||||||
|
platform: str | None = None
|
||||||
|
language: str = "en-US"
|
||||||
|
source_url: str | None = None # overrides the default support.hpe.com URL
|
||||||
|
|
||||||
|
|
||||||
|
# Declared bundles. Versions confirmed 2026-05-22 by probing the docId
|
||||||
|
# range sd00006500..7740 for `Morpheus Enterprise` matches in the abstract.
|
||||||
|
#
|
||||||
|
# Notes:
|
||||||
|
# - Morpheus Enterprise has User Manuals dating back to 8.0.10
|
||||||
|
# (sd00006774en_us, Sep 2025) but we only ship the 8.1.x line for
|
||||||
|
# now. Add the 8.0.x bundles here if you need older versions in the
|
||||||
|
# corpus.
|
||||||
|
# - No dedicated Deployment Guide or Qualification Matrix for Morpheus
|
||||||
|
# Enterprise on HPE Support — the only QM (sd00006551en_us) covers
|
||||||
|
# HVM clusters managed by Morpheus, which lives in hvm-docs.
|
||||||
|
# - QuickSpecs lives on www.hpe.com (not support.hpe.com), uses the
|
||||||
|
# html-file scrape mode with curl_cffi Chrome impersonation.
|
||||||
|
BUNDLES: list[BundleSpec] = [
|
||||||
|
BundleSpec("morpheus_user_manual_8_1_0", "sd00007510en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.0", "User Manual", "toc"),
|
||||||
|
BundleSpec("morpheus_user_manual_8_1_1", "sd00007621en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.1", "User Manual", "toc"),
|
||||||
|
BundleSpec("morpheus_user_manual_8_1_2", "sd00007732en_us", "HPE Morpheus Enterprise Software Documentation", "8.1.2", "User Manual", "toc"),
|
||||||
|
BundleSpec("morpheus_release_notes_8_1_0", "sd00007496en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.0", "Release Notes", "single"),
|
||||||
|
BundleSpec("morpheus_release_notes_8_1_1", "sd00007610en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.1", "Release Notes", "single"),
|
||||||
|
BundleSpec("morpheus_release_notes_8_1_2", "sd00007733en_us", "HPE Morpheus Enterprise Software Release Notes", "8.1.2", "Release Notes", "single"),
|
||||||
|
BundleSpec("morpheus_quickspecs", "a50009231enw", "HPE Morpheus Enterprise Software QuickSpecs",
|
||||||
|
"v1", "QuickSpecs", "html-file",
|
||||||
|
source_url="https://www.hpe.com/psnow/doc/a50009231enw"),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _session() -> requests.Session:
|
||||||
|
s = requests.Session()
|
||||||
|
s.headers.update({"User-Agent": UA, "Accept": "application/json, text/html"})
|
||||||
|
return s
|
||||||
|
|
||||||
|
|
||||||
|
def _get(s: requests.Session, url: str, expect_json: bool = False, retries: int = 4) -> Any:
|
||||||
|
delay = 1.0
|
||||||
|
for attempt in range(retries):
|
||||||
|
r = s.get(url, timeout=30)
|
||||||
|
if r.status_code == 200:
|
||||||
|
return r.json() if expect_json else r.text
|
||||||
|
if r.status_code == 404:
|
||||||
|
return None
|
||||||
|
if r.status_code in (429, 500, 502, 503, 504):
|
||||||
|
time.sleep(delay)
|
||||||
|
delay *= 2
|
||||||
|
continue
|
||||||
|
r.raise_for_status()
|
||||||
|
raise RuntimeError(f"GET failed after {retries} retries: {url}")
|
||||||
|
|
||||||
|
|
||||||
|
def _count_toc(toc: list[dict] | None) -> tuple[int, str | None]:
|
||||||
|
if not toc:
|
||||||
|
return 0, None
|
||||||
|
landing = None
|
||||||
|
n = 0
|
||||||
|
|
||||||
|
def walk(nodes: list[dict] | None, depth: int) -> None:
|
||||||
|
nonlocal n, landing
|
||||||
|
for node in nodes or []:
|
||||||
|
link = node.get("topicLink")
|
||||||
|
if link:
|
||||||
|
n += 1
|
||||||
|
m = re.search(r"page=(GUID-[A-F0-9-]+)\.html", link)
|
||||||
|
if m and landing is None:
|
||||||
|
landing = m.group(1)
|
||||||
|
walk(node.get("children"), depth + 1)
|
||||||
|
|
||||||
|
walk(toc, 0)
|
||||||
|
return n, landing
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_abstract(html: str) -> dict[str, str]:
|
||||||
|
soup = BeautifulSoup(html, "html.parser")
|
||||||
|
out: dict[str, str] = {}
|
||||||
|
h1 = soup.select_one("h1.title.topictitle1")
|
||||||
|
if h1:
|
||||||
|
out["title"] = h1.get_text(" ", strip=True)
|
||||||
|
desc = soup.select_one("div.desc")
|
||||||
|
if desc:
|
||||||
|
out["abstract"] = desc.get_text(" ", strip=True)
|
||||||
|
pub = soup.select_one("div.publishedDate")
|
||||||
|
if pub:
|
||||||
|
out["published"] = pub.get_text(" ", strip=True).replace("Published:", "").strip()
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def discover_bundle(s: requests.Session, spec: BundleSpec) -> dict[str, Any]:
|
||||||
|
# html-file bundles are static fixtures or live-fetched outside support.hpe.com.
|
||||||
|
if spec.mode == "html-file":
|
||||||
|
return {
|
||||||
|
"slug": spec.slug,
|
||||||
|
"doc_id": spec.doc_id,
|
||||||
|
"title": spec.title,
|
||||||
|
"version": spec.version,
|
||||||
|
"platform": spec.platform,
|
||||||
|
"product": spec.product,
|
||||||
|
"language": spec.language,
|
||||||
|
"page_count": 1,
|
||||||
|
"mode": "html-file",
|
||||||
|
"abstract": "",
|
||||||
|
"dates": {},
|
||||||
|
"landing_page": spec.doc_id,
|
||||||
|
"source_url": spec.source_url or f"https://www.hpe.com/psnow/doc/{spec.doc_id}",
|
||||||
|
}
|
||||||
|
|
||||||
|
abstract_html = _get(s, f"{API}/{spec.doc_id}", expect_json=False)
|
||||||
|
meta = _parse_abstract(abstract_html or "")
|
||||||
|
|
||||||
|
page_count: int
|
||||||
|
landing: str | None
|
||||||
|
if spec.mode == "toc":
|
||||||
|
toc = _get(s, f"{API}/{spec.doc_id}/toc", expect_json=True)
|
||||||
|
page_count, landing = _count_toc(toc)
|
||||||
|
if page_count == 0:
|
||||||
|
print(f" ! {spec.slug}: TOC empty — falling back to single-doc mode", file=sys.stderr)
|
||||||
|
spec.mode = "single"
|
||||||
|
page_count, landing = 1, spec.doc_id
|
||||||
|
else:
|
||||||
|
page_count, landing = 1, spec.doc_id
|
||||||
|
|
||||||
|
return {
|
||||||
|
"slug": spec.slug,
|
||||||
|
"doc_id": spec.doc_id,
|
||||||
|
"title": meta.get("title") or spec.title,
|
||||||
|
"version": spec.version,
|
||||||
|
"platform": spec.platform,
|
||||||
|
"product": spec.product,
|
||||||
|
"language": spec.language,
|
||||||
|
"page_count": page_count,
|
||||||
|
"mode": spec.mode,
|
||||||
|
"abstract": meta.get("abstract", ""),
|
||||||
|
"dates": {"Published": meta.get("published", "")},
|
||||||
|
"landing_page": landing,
|
||||||
|
"source_url": spec.source_url or DOC_URL.format(doc_id=spec.doc_id),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
p = argparse.ArgumentParser(description="Build bundles.json from BUNDLES list.")
|
||||||
|
p.add_argument("--out", default=str(BUNDLES_JSON))
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
s = _session()
|
||||||
|
out: list[dict[str, Any]] = []
|
||||||
|
for spec in BUNDLES:
|
||||||
|
print(f" • {spec.slug} ({spec.doc_id}) ...", file=sys.stderr)
|
||||||
|
out.append(discover_bundle(s, spec))
|
||||||
|
|
||||||
|
Path(args.out).write_text(json.dumps(out, indent=2) + "\n")
|
||||||
|
print(f"wrote {args.out}: {len(out)} bundles, {sum(b['page_count'] for b in out)} pages total", file=sys.stderr)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
@@ -0,0 +1,194 @@
|
|||||||
|
"""Scrape HPE QuickSpecs collateral pages into corpus markdown.
|
||||||
|
|
||||||
|
HPE QuickSpecs live at `https://www.hpe.com/us/en/collaterals/collateral.<doc_id>.html`
|
||||||
|
with a server-rendered HTML body (confirmed 2026-05-22 by inspecting the
|
||||||
|
captured DOM). The blocker for automated scraping is `www.hpe.com`'s
|
||||||
|
edge bot defense, which drops connections from non-browser TLS
|
||||||
|
fingerprints (curl, wget, Python-urllib, even WebFetch). Bypassed here
|
||||||
|
by `curl_cffi` impersonating Chrome 120's JA3/JA4 fingerprint.
|
||||||
|
|
||||||
|
Content extraction uses these stable CSS selectors found in the page:
|
||||||
|
|
||||||
|
.lr-right-rail hpe-highlights-container .collateral-content
|
||||||
|
— one per section ("Overview", "Standard Features", etc.)
|
||||||
|
h3.txto-title — section title
|
||||||
|
div.txto-description — section body
|
||||||
|
uc-table.uc-table-polaris — SKU / version-history tables
|
||||||
|
|
||||||
|
A committed HTML fixture at `scrape/quickspecs/<doc_id>.html` is used
|
||||||
|
as a fallback when the live fetch fails (HPE edge churn, network
|
||||||
|
issues). Keeping a current fixture in the repo also makes diffing
|
||||||
|
QuickSpecs revisions easy.
|
||||||
|
|
||||||
|
Usage (called by scrape.runner for bundles with mode="quickspecs"):
|
||||||
|
|
||||||
|
python -m scrape.quickspecs a50004260enw
|
||||||
|
|
||||||
|
Or programmatically:
|
||||||
|
|
||||||
|
from scrape.quickspecs import scrape_quickspecs
|
||||||
|
scrape_quickspecs("a50004260enw", bundle_id="hvm_quickspecs", title="...")
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from bs4 import BeautifulSoup, NavigableString
|
||||||
|
from markdownify import markdownify as md
|
||||||
|
|
||||||
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
SOURCE_DIR = ROOT / "scrape" / "quickspecs"
|
||||||
|
CORPUS_DIR = ROOT / "corpus"
|
||||||
|
|
||||||
|
COLLATERAL_URL = "https://www.hpe.com/us/en/collaterals/collateral.{doc_id}.html"
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_live(doc_id: str, timeout: float = 30.0) -> str | None:
|
||||||
|
"""GET the collateral page via curl_cffi (Chrome 120 TLS fingerprint).
|
||||||
|
Returns the HTML body on success, None on any failure."""
|
||||||
|
try:
|
||||||
|
from curl_cffi import requests as cc
|
||||||
|
except ImportError:
|
||||||
|
log.warning("curl_cffi not installed; can't fetch QuickSpecs live")
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
r = cc.get(COLLATERAL_URL.format(doc_id=doc_id),
|
||||||
|
impersonate="chrome120", timeout=timeout)
|
||||||
|
if r.status_code != 200 or not r.text:
|
||||||
|
log.warning("QuickSpecs %s: http=%s bytes=%d", doc_id, r.status_code, len(r.text or ""))
|
||||||
|
return None
|
||||||
|
return r.text
|
||||||
|
except Exception as e:
|
||||||
|
log.warning("QuickSpecs %s live fetch failed: %s", doc_id, e)
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_fixture(doc_id: str) -> str | None:
|
||||||
|
"""Read the committed HTML fixture as fallback."""
|
||||||
|
p = SOURCE_DIR / f"{doc_id}.html"
|
||||||
|
if not p.exists():
|
||||||
|
return None
|
||||||
|
return p.read_text()
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_content_blocks(html: str) -> list[str]:
|
||||||
|
"""Pull each section block (.collateral-content under .lr-right-rail).
|
||||||
|
|
||||||
|
The fixture format (just .quickspecs-content wrapper) and the live
|
||||||
|
format (.lr-right-rail with nested hpe-highlights-container) are
|
||||||
|
both supported. Returns a list of section HTML strings, in document
|
||||||
|
order.
|
||||||
|
"""
|
||||||
|
soup = BeautifulSoup(html, "html.parser")
|
||||||
|
# Live format: each <hpe-highlights-container> under .lr-right-rail has
|
||||||
|
# one or more .collateral-content blocks; concat them.
|
||||||
|
rail = soup.select_one(".lr-right-rail")
|
||||||
|
if rail is not None:
|
||||||
|
blocks = rail.select(".collateral-content")
|
||||||
|
return [str(b) for b in blocks]
|
||||||
|
# Fixture format: a single wrapper holding all the H2/H3 sections.
|
||||||
|
wrapper = soup.select_one(".quickspecs-content")
|
||||||
|
if wrapper is not None:
|
||||||
|
return [str(wrapper)]
|
||||||
|
# Last-resort: whole body.
|
||||||
|
body = soup.body or soup
|
||||||
|
return [str(body)]
|
||||||
|
|
||||||
|
|
||||||
|
def parse_html(html: str) -> str:
|
||||||
|
"""Convert QuickSpecs HTML to clean markdown.
|
||||||
|
|
||||||
|
Filters out the page chrome (nav, footer, recommendations carousel,
|
||||||
|
cookie banner, analytics blobs) by extracting only the content
|
||||||
|
blocks, then runs markdownify."""
|
||||||
|
blocks = _extract_content_blocks(html)
|
||||||
|
chunks: list[str] = []
|
||||||
|
for block in blocks:
|
||||||
|
soup = BeautifulSoup(block, "html.parser")
|
||||||
|
# Drop anchor placeholders that markdownify turns into noisy links
|
||||||
|
for a in soup.select('[hpe-left-rail-anchor]'):
|
||||||
|
a.decompose()
|
||||||
|
# Drop carousel / share / recommendation widgets if any leaked in.
|
||||||
|
for sel in ("esl-share", "hpe-recommendations", "hpe-sticky-bar",
|
||||||
|
"esl-scrollbar", "esl-trigger", "video-overlay",
|
||||||
|
"generic-modal-loader", "style", "script"):
|
||||||
|
for el in soup.select(sel):
|
||||||
|
el.decompose()
|
||||||
|
chunks.append(md(str(soup), heading_style="ATX", bullets="-",
|
||||||
|
strip=["span", "div"]))
|
||||||
|
text = "\n\n".join(chunks)
|
||||||
|
# Collapse runs of blank lines markdownify likes to emit.
|
||||||
|
text = "\n".join(line.rstrip() for line in text.splitlines())
|
||||||
|
while "\n\n\n" in text:
|
||||||
|
text = text.replace("\n\n\n", "\n\n")
|
||||||
|
return text.strip() + "\n"
|
||||||
|
|
||||||
|
|
||||||
|
def scrape_quickspecs(doc_id: str, bundle_id: str, title: str,
|
||||||
|
version: str | None = None,
|
||||||
|
product: str = "QuickSpecs",
|
||||||
|
source_url: str | None = None,
|
||||||
|
force: bool = False) -> bool:
|
||||||
|
"""Live-fetch (or fall back to fixture), parse, write corpus files.
|
||||||
|
|
||||||
|
Returns True if files were written, False if skipped (already exists
|
||||||
|
and --force not set)."""
|
||||||
|
bundle_dir = CORPUS_DIR / bundle_id
|
||||||
|
md_path = bundle_dir / f"{doc_id}.md"
|
||||||
|
json_path = bundle_dir / f"{doc_id}.json"
|
||||||
|
if not force and md_path.exists() and json_path.exists():
|
||||||
|
log.info(" %s/%s: already on disk (use --force to refresh)", bundle_id, doc_id)
|
||||||
|
return False
|
||||||
|
|
||||||
|
html = fetch_live(doc_id)
|
||||||
|
fetched_from = "live"
|
||||||
|
if html is None:
|
||||||
|
html = fetch_fixture(doc_id)
|
||||||
|
fetched_from = "fixture"
|
||||||
|
if html is None:
|
||||||
|
log.error("QuickSpecs %s: no live response and no fixture at %s",
|
||||||
|
doc_id, SOURCE_DIR / f"{doc_id}.html")
|
||||||
|
return False
|
||||||
|
|
||||||
|
body_md = parse_html(html)
|
||||||
|
bundle_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
md_path.write_text(body_md)
|
||||||
|
sidecar = {
|
||||||
|
"bundle_id": bundle_id,
|
||||||
|
"page_id": doc_id,
|
||||||
|
"title": title,
|
||||||
|
"ordinal": 1,
|
||||||
|
"parent_title": None,
|
||||||
|
"doc_id": doc_id,
|
||||||
|
"version": version,
|
||||||
|
"product": product,
|
||||||
|
"source_url": source_url or f"https://www.hpe.com/psnow/doc/{doc_id}",
|
||||||
|
"fetched_from": fetched_from,
|
||||||
|
}
|
||||||
|
json_path.write_text(json.dumps(sidecar, indent=2) + "\n")
|
||||||
|
log.info(" %s/%s: %d bytes from %s", bundle_id, doc_id, len(body_md), fetched_from)
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
logging.basicConfig(level=logging.INFO, format="%(message)s")
|
||||||
|
p = argparse.ArgumentParser()
|
||||||
|
p.add_argument("doc_id", help="QuickSpecs document id, e.g. a50004260enw")
|
||||||
|
p.add_argument("--bundle-id", default="hvm_quickspecs")
|
||||||
|
p.add_argument("--title", default="HPE Morpheus VM Essentials Software QuickSpecs")
|
||||||
|
p.add_argument("--version", default=None)
|
||||||
|
p.add_argument("--force", action="store_true")
|
||||||
|
args = p.parse_args()
|
||||||
|
ok = scrape_quickspecs(args.doc_id, args.bundle_id, args.title,
|
||||||
|
args.version, force=args.force)
|
||||||
|
return 0 if ok else 1
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
@@ -0,0 +1,27 @@
|
|||||||
|
# scrape/quickspecs/
|
||||||
|
|
||||||
|
Static HTML fixtures for HPE QuickSpecs documents that aren't reachable
|
||||||
|
from the runner (www.hpe.com edge drops connections from datacenter IPs
|
||||||
|
with non-browser User-Agents — verified 2026-05-22 with curl, wget, and
|
||||||
|
Anthropic's WebFetch).
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. Operator visits `https://www.hpe.com/psnow/doc/<doc_id>` in a real
|
||||||
|
browser, opens DevTools → Elements → Copy the `<body>` HTML.
|
||||||
|
2. Save it at `scrape/quickspecs/<doc_id>.html`.
|
||||||
|
3. Add a bundle entry in `scrape/bundles.py` with `mode="html-file"`.
|
||||||
|
4. `python -m scrape.runner --bundle hvm_quickspecs --force` reads the
|
||||||
|
committed HTML and writes `corpus/hvm_quickspecs/<doc_id>.{md,json}`.
|
||||||
|
5. Re-index and ship.
|
||||||
|
|
||||||
|
QuickSpecs only update every few months (HPE rebrand, new SKU added,
|
||||||
|
feature change). When a new version drops, refresh the local HTML
|
||||||
|
file and re-run the scrape.
|
||||||
|
|
||||||
|
## Current fixtures
|
||||||
|
|
||||||
|
- `a50004260enw.html` — HPE Morpheus VM Essentials Software QuickSpecs
|
||||||
|
(Version 4, 02-February-2026). SKUs: S5Q81AAE (1-yr), S5Q82AAE
|
||||||
|
(3-yr), S5Q83AAE (5-yr) — all "per Socket E-LTU" with Tech Care
|
||||||
|
Essentials included.
|
||||||
@@ -0,0 +1,339 @@
|
|||||||
|
"""Scrape HVM doc bundles into corpus/<slug>/<page_id>.{md,json}.
|
||||||
|
|
||||||
|
Reads bundles.json (produced by scrape.bundles), then for each bundle:
|
||||||
|
- mode="toc": walks the TOC tree, fetches each page via the render
|
||||||
|
endpoint, converts page_html to markdown, writes
|
||||||
|
<page_id>.md + <page_id>.json sidecar.
|
||||||
|
- mode="single": fetches /document/{docId} directly, treats the whole
|
||||||
|
body as one page with page_id = doc_id.
|
||||||
|
|
||||||
|
After all bundles are on disk, runs a finalize pass that synthesizes
|
||||||
|
topic_cluster.clustered_topics for each page by looking up the same
|
||||||
|
GUID in sibling bundles (HPE GUIDs are stable across versions — see
|
||||||
|
reference_hpe_docs_portal_api.md).
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python -m scrape.runner --all
|
||||||
|
python -m scrape.runner --bundle hvm_user_manual_8_1_2
|
||||||
|
python -m scrape.runner --all --force # re-download already-on-disk pages
|
||||||
|
python -m scrape.runner --finalize-only # only redo the topic_cluster pass
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import requests
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
from markdownify import markdownify as md
|
||||||
|
|
||||||
|
API = "https://support.hpe.com/hpesc/public/api/document"
|
||||||
|
DOC_URL = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}&page={page_id}.html"
|
||||||
|
DOC_URL_SINGLE = "https://support.hpe.com/hpesc/public/docDisplay?docId={doc_id}"
|
||||||
|
UA = "hvm-docs-mcp/0.1 (+https://git.jpaul.io/justin/hvm-docs; admin@jpaul.io)"
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
CORPUS = ROOT / "corpus"
|
||||||
|
BUNDLES_JSON = ROOT / "bundles.json"
|
||||||
|
|
||||||
|
GUID_RE = re.compile(r"page=(GUID-[A-F0-9-]+)\.html")
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class TocEntry:
|
||||||
|
page_id: str
|
||||||
|
title: str
|
||||||
|
ordinal: int
|
||||||
|
parent_title: str | None
|
||||||
|
|
||||||
|
|
||||||
|
def _session() -> requests.Session:
|
||||||
|
s = requests.Session()
|
||||||
|
s.headers.update({"User-Agent": UA, "Accept": "application/json, text/html"})
|
||||||
|
return s
|
||||||
|
|
||||||
|
|
||||||
|
def _get(s: requests.Session, url: str, expect_json: bool = False, retries: int = 4) -> Any:
|
||||||
|
delay = 1.0
|
||||||
|
for attempt in range(retries):
|
||||||
|
r = s.get(url, timeout=30)
|
||||||
|
if r.status_code == 200:
|
||||||
|
return r.json() if expect_json else r.text
|
||||||
|
if r.status_code == 404:
|
||||||
|
return None
|
||||||
|
if r.status_code in (429, 500, 502, 503, 504):
|
||||||
|
time.sleep(delay)
|
||||||
|
delay *= 2
|
||||||
|
continue
|
||||||
|
r.raise_for_status()
|
||||||
|
raise RuntimeError(f"GET failed after {retries} retries: {url}")
|
||||||
|
|
||||||
|
|
||||||
|
def _flatten_toc(toc: list[dict]) -> list[TocEntry]:
|
||||||
|
out: list[TocEntry] = []
|
||||||
|
ordinal = 0
|
||||||
|
|
||||||
|
def walk(nodes: list[dict] | None, parent_title: str | None) -> None:
|
||||||
|
nonlocal ordinal
|
||||||
|
for node in nodes or []:
|
||||||
|
title = node.get("topicName") or ""
|
||||||
|
link = node.get("topicLink") or ""
|
||||||
|
m = GUID_RE.search(link)
|
||||||
|
if m:
|
||||||
|
ordinal += 1
|
||||||
|
out.append(TocEntry(page_id=m.group(1), title=title, ordinal=ordinal, parent_title=parent_title))
|
||||||
|
walk(node.get("children"), title or parent_title)
|
||||||
|
|
||||||
|
walk(toc, None)
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _strip_dita_wrappers(html: str) -> str:
|
||||||
|
"""Remove the outer <main class="ditasrc">, drop the trademark Notices section,
|
||||||
|
and unwrap aria-only span markup so markdownify produces clean text.
|
||||||
|
|
||||||
|
DITA's notices boilerplate repeats across every doc; if we leave it in,
|
||||||
|
every page chunk inherits the same trademark text and pollutes retrieval."""
|
||||||
|
soup = BeautifulSoup(html, "html.parser")
|
||||||
|
# Drop the Notices/Acknowledgments/Abstract boilerplate by section heading.
|
||||||
|
# Every doc on the portal carries the same legal Notices and trademark
|
||||||
|
# Acknowledgments; if we leave them in, every chunk inherits the same
|
||||||
|
# text and pollutes retrieval. Abstract is one-line marketing.
|
||||||
|
boilerplate = {"Notices", "Acknowledgments", "Abstract"}
|
||||||
|
# Wrapped form: <article>/<section>/<div> whose first heading child is boilerplate.
|
||||||
|
for sec in soup.select("article, section, div"):
|
||||||
|
h = sec.find(["h1", "h2"], recursive=False)
|
||||||
|
if h and h.get_text(strip=True) in boilerplate:
|
||||||
|
sec.decompose()
|
||||||
|
# Unwrapped form: bare <h1>/<h2>Boilerplate</h2> followed by its .desc/.body sibling.
|
||||||
|
for h in soup.find_all(["h1", "h2"]):
|
||||||
|
if h.get_text(strip=True) in boilerplate:
|
||||||
|
sib = h.find_next_sibling()
|
||||||
|
if sib and (sib.name in {"div", "section"}):
|
||||||
|
cls = " ".join(sib.get("class", []) or [])
|
||||||
|
if "desc" in cls or "body" in cls or "notices" in cls:
|
||||||
|
sib.decompose()
|
||||||
|
h.decompose()
|
||||||
|
main = soup.find("main")
|
||||||
|
return str(main) if main else str(soup)
|
||||||
|
|
||||||
|
|
||||||
|
def html_to_md(page_html: str) -> str:
|
||||||
|
cleaned = _strip_dita_wrappers(page_html)
|
||||||
|
text = md(cleaned, heading_style="ATX", bullets="-")
|
||||||
|
# collapse runs of blank lines
|
||||||
|
text = re.sub(r"\n{3,}", "\n\n", text).strip()
|
||||||
|
return text + "\n"
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_toc_page(s: requests.Session, doc_id: str, page_id: str) -> str:
|
||||||
|
payload = _get(s, f"{API}/{doc_id}/render?page={page_id}.html", expect_json=True)
|
||||||
|
if not payload:
|
||||||
|
return ""
|
||||||
|
return payload.get("page_html") or ""
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_single_doc(s: requests.Session, doc_id: str) -> tuple[str, str]:
|
||||||
|
"""Returns (page_html, title) for a single-doc-shape bundle."""
|
||||||
|
html = _get(s, f"{API}/{doc_id}")
|
||||||
|
if not html:
|
||||||
|
return "", ""
|
||||||
|
soup = BeautifulSoup(html, "html.parser")
|
||||||
|
h1 = soup.select_one("h1.title.topictitle1")
|
||||||
|
title = h1.get_text(" ", strip=True) if h1 else doc_id
|
||||||
|
return html, title
|
||||||
|
|
||||||
|
|
||||||
|
def write_page(bundle_dir: Path, page_id: str, body_md: str, sidecar: dict[str, Any], force: bool) -> bool:
|
||||||
|
bundle_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
md_path = bundle_dir / f"{page_id}.md"
|
||||||
|
json_path = bundle_dir / f"{page_id}.json"
|
||||||
|
if not force and md_path.exists() and json_path.exists():
|
||||||
|
return False
|
||||||
|
md_path.write_text(body_md)
|
||||||
|
json_path.write_text(json.dumps(sidecar, indent=2) + "\n")
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def scrape_toc_bundle(s: requests.Session, bundle: dict, force: bool, concurrency: int) -> int:
|
||||||
|
doc_id = bundle["doc_id"]
|
||||||
|
slug = bundle["slug"]
|
||||||
|
bundle_dir = CORPUS / slug
|
||||||
|
|
||||||
|
toc = _get(s, f"{API}/{doc_id}/toc", expect_json=True) or []
|
||||||
|
entries = _flatten_toc(toc)
|
||||||
|
print(f" {slug}: {len(entries)} pages", file=sys.stderr)
|
||||||
|
|
||||||
|
written = 0
|
||||||
|
def do_one(entry: TocEntry) -> bool:
|
||||||
|
page_html = fetch_toc_page(s, doc_id, entry.page_id)
|
||||||
|
if not page_html:
|
||||||
|
return False
|
||||||
|
body_md = html_to_md(page_html)
|
||||||
|
sidecar = {
|
||||||
|
"bundle_id": slug,
|
||||||
|
"page_id": entry.page_id,
|
||||||
|
"title": entry.title,
|
||||||
|
"ordinal": entry.ordinal,
|
||||||
|
"parent_title": entry.parent_title,
|
||||||
|
"doc_id": doc_id,
|
||||||
|
"version": bundle.get("version"),
|
||||||
|
"product": bundle.get("product"),
|
||||||
|
"source_url": DOC_URL.format(doc_id=doc_id, page_id=entry.page_id),
|
||||||
|
# topic_cluster filled in by finalize()
|
||||||
|
}
|
||||||
|
return write_page(bundle_dir, entry.page_id, body_md, sidecar, force)
|
||||||
|
|
||||||
|
with ThreadPoolExecutor(max_workers=concurrency) as pool:
|
||||||
|
for fut in as_completed(pool.submit(do_one, e) for e in entries):
|
||||||
|
if fut.result():
|
||||||
|
written += 1
|
||||||
|
return written
|
||||||
|
|
||||||
|
|
||||||
|
def scrape_single_bundle(s: requests.Session, bundle: dict, force: bool) -> int:
|
||||||
|
doc_id = bundle["doc_id"]
|
||||||
|
slug = bundle["slug"]
|
||||||
|
bundle_dir = CORPUS / slug
|
||||||
|
|
||||||
|
html, title = fetch_single_doc(s, doc_id)
|
||||||
|
if not html:
|
||||||
|
print(f" ! {slug}: empty body", file=sys.stderr)
|
||||||
|
return 0
|
||||||
|
body_md = html_to_md(html)
|
||||||
|
sidecar = {
|
||||||
|
"bundle_id": slug,
|
||||||
|
"page_id": doc_id,
|
||||||
|
"title": title or bundle["title"],
|
||||||
|
"ordinal": 1,
|
||||||
|
"parent_title": None,
|
||||||
|
"doc_id": doc_id,
|
||||||
|
"version": bundle.get("version"),
|
||||||
|
"product": bundle.get("product"),
|
||||||
|
"source_url": DOC_URL_SINGLE.format(doc_id=doc_id),
|
||||||
|
}
|
||||||
|
print(f" {slug}: 1 page (single-doc)", file=sys.stderr)
|
||||||
|
return 1 if write_page(bundle_dir, doc_id, body_md, sidecar, force) else 0
|
||||||
|
|
||||||
|
|
||||||
|
def finalize_clusters(bundles: list[dict]) -> int:
|
||||||
|
"""Cross-link sibling pages with the same GUID across version bundles.
|
||||||
|
|
||||||
|
For TOC bundles, page_id == GUID; same GUID across two bundles = same
|
||||||
|
underlying topic. For single-doc bundles (page_id == doc_id), peer them
|
||||||
|
by matching product+version-sibling on the `product` field."""
|
||||||
|
# GUID → list[(slug, sidecar_path, sidecar_dict)]
|
||||||
|
guid_to_pages: dict[str, list[tuple[str, Path, dict]]] = {}
|
||||||
|
# product → list[(slug, sidecar_path, sidecar_dict)] for single-doc peering
|
||||||
|
product_to_pages: dict[str, list[tuple[str, Path, dict]]] = {}
|
||||||
|
|
||||||
|
for b in bundles:
|
||||||
|
slug = b["slug"]
|
||||||
|
bundle_dir = CORPUS / slug
|
||||||
|
if not bundle_dir.exists():
|
||||||
|
continue
|
||||||
|
for jp in bundle_dir.glob("*.json"):
|
||||||
|
data = json.loads(jp.read_text())
|
||||||
|
pid = data["page_id"]
|
||||||
|
if pid.startswith("GUID-"):
|
||||||
|
guid_to_pages.setdefault(pid, []).append((slug, jp, data))
|
||||||
|
else:
|
||||||
|
product_to_pages.setdefault(b["product"], []).append((slug, jp, data))
|
||||||
|
|
||||||
|
updated = 0
|
||||||
|
# TOC pages — cluster by GUID
|
||||||
|
for guid, peers in guid_to_pages.items():
|
||||||
|
if len(peers) < 2:
|
||||||
|
continue
|
||||||
|
for slug, jp, data in peers:
|
||||||
|
others = [
|
||||||
|
{"bundle_id": s2, "page_id": guid, "clustering_title": d2.get("title", "")}
|
||||||
|
for s2, _, d2 in peers if s2 != slug
|
||||||
|
]
|
||||||
|
data["topic_cluster"] = {"clustering_title": data.get("title", ""), "clustered_topics": others}
|
||||||
|
jp.write_text(json.dumps(data, indent=2) + "\n")
|
||||||
|
updated += 1
|
||||||
|
# Single-doc pages — cluster by product (e.g. Release Notes 8.1.0/.1/.2)
|
||||||
|
for product, peers in product_to_pages.items():
|
||||||
|
if len(peers) < 2:
|
||||||
|
continue
|
||||||
|
for slug, jp, data in peers:
|
||||||
|
others = [
|
||||||
|
{"bundle_id": s2, "page_id": d2["page_id"], "clustering_title": d2.get("title", "")}
|
||||||
|
for s2, _, d2 in peers if s2 != slug
|
||||||
|
]
|
||||||
|
data["topic_cluster"] = {"clustering_title": data.get("title", ""), "clustered_topics": others}
|
||||||
|
jp.write_text(json.dumps(data, indent=2) + "\n")
|
||||||
|
updated += 1
|
||||||
|
|
||||||
|
return updated
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
p = argparse.ArgumentParser(description="Scrape HVM bundles into corpus/.")
|
||||||
|
p.add_argument("--all", action="store_true", help="scrape every bundle in bundles.json")
|
||||||
|
p.add_argument("--bundle", action="append", help="scrape one bundle by slug (repeatable)")
|
||||||
|
p.add_argument("--force", action="store_true", help="re-fetch pages already on disk")
|
||||||
|
p.add_argument("--concurrency", type=int, default=6)
|
||||||
|
p.add_argument("--finalize-only", action="store_true", help="only rebuild topic_cluster sidecar fields")
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
if not BUNDLES_JSON.exists():
|
||||||
|
print(f"bundles.json missing — run `python -m scrape.bundles` first", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
bundles = json.loads(BUNDLES_JSON.read_text())
|
||||||
|
|
||||||
|
if args.finalize_only:
|
||||||
|
n = finalize_clusters(bundles)
|
||||||
|
print(f"finalize: updated topic_cluster on {n} sidecars", file=sys.stderr)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
if args.bundle:
|
||||||
|
bundles = [b for b in bundles if b["slug"] in args.bundle]
|
||||||
|
if not bundles:
|
||||||
|
print(f"no bundles matched: {args.bundle}", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
elif not args.all:
|
||||||
|
print("specify --all or --bundle <slug>", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
s = _session()
|
||||||
|
total = 0
|
||||||
|
for b in bundles:
|
||||||
|
mode = b.get("mode")
|
||||||
|
if mode == "single":
|
||||||
|
total += scrape_single_bundle(s, b, args.force)
|
||||||
|
elif mode == "html-file":
|
||||||
|
# Live-scrape HPE collateral (QuickSpecs) via curl_cffi; falls back
|
||||||
|
# to scrape/quickspecs/<doc_id>.html fixture if the edge blocks us.
|
||||||
|
from scrape.quickspecs import scrape_quickspecs
|
||||||
|
ok = scrape_quickspecs(
|
||||||
|
doc_id=b["doc_id"], bundle_id=b["slug"],
|
||||||
|
title=b.get("title", b["doc_id"]),
|
||||||
|
version=b.get("version"),
|
||||||
|
product=b.get("product", "QuickSpecs"),
|
||||||
|
source_url=b.get("source_url"),
|
||||||
|
force=args.force,
|
||||||
|
)
|
||||||
|
total += 1 if ok else 0
|
||||||
|
else:
|
||||||
|
total += scrape_toc_bundle(s, b, args.force, args.concurrency)
|
||||||
|
print(f"scraped {total} new/updated pages", file=sys.stderr)
|
||||||
|
|
||||||
|
# Always finalize after a scrape so sidecars are consistent.
|
||||||
|
all_bundles = json.loads(BUNDLES_JSON.read_text())
|
||||||
|
n = finalize_clusters(all_bundles)
|
||||||
|
print(f"finalize: updated topic_cluster on {n} sidecars", file=sys.stderr)
|
||||||
|
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
+70
-41
@@ -1,42 +1,58 @@
|
|||||||
"""Gitea container-registry garbage collection.
|
"""Gitea container-registry garbage collection.
|
||||||
|
|
||||||
Lists package versions for one container package and deletes versions
|
Lists tagged versions of one container package and deletes old ones.
|
||||||
older than --keep-days. Always preserves:
|
Always preserves:
|
||||||
|
|
||||||
- the :latest tag
|
- the `latest` tag (Watchtower's auto-deploy target)
|
||||||
- the --keep-latest most-recent date-tagged versions
|
- the `--keep-latest` most-recent date-tagged versions (YYYY.MM.DD)
|
||||||
- anything pushed in the last --keep-days days
|
- the `--keep-latest` most-recent short-SHA tags (rollback pins)
|
||||||
|
- anything pushed within `--keep-days` days
|
||||||
|
|
||||||
The actual disk reclaim happens on Gitea's next package GC cron (admin
|
OCI blob-level versions (`sha256:...`) are never touched directly — those
|
||||||
site settings). This script just marks the versions for deletion.
|
are managed by Gitea's internal package GC cron when their last tag
|
||||||
|
goes away.
|
||||||
|
|
||||||
Usage:
|
Usage:
|
||||||
|
|
||||||
python scripts/registry_gc.py \\
|
GITEA_TOKEN=... python scripts/registry_gc.py \\
|
||||||
--owner <user> \\
|
--owner justin \\
|
||||||
--package <product>-docs-mcp \\
|
--package hvm-docs \\
|
||||||
--keep-days 90 \\
|
--keep-days 90 \\
|
||||||
--keep-latest 5
|
--keep-latest 5
|
||||||
|
|
||||||
Auth: reads GITEA_TOKEN from env (set in the workflow as a secret).
|
The Gitea endpoint shape (confirmed 2026-05-22 against git.jpaul.io):
|
||||||
|
|
||||||
|
GET /api/v1/packages/{owner}/container/{package}
|
||||||
|
-> [{id, version, created_at, ...}, ...]
|
||||||
|
DELETE /api/v1/packages/{owner}/container/{package}/{version}
|
||||||
"""
|
"""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
|
import json
|
||||||
import os
|
import os
|
||||||
|
import re
|
||||||
import sys
|
import sys
|
||||||
from datetime import datetime, timedelta, timezone
|
from datetime import datetime, timedelta, timezone
|
||||||
from urllib.request import Request, urlopen
|
|
||||||
from urllib.error import HTTPError
|
from urllib.error import HTTPError
|
||||||
import json
|
from urllib.parse import quote
|
||||||
|
from urllib.request import Request, urlopen
|
||||||
|
|
||||||
GITEA_HOST = os.environ.get("GITEA_HOST", "https://git.jpaul.io")
|
GITEA_HOST = os.environ.get("GITEA_HOST", "https://git.jpaul.io")
|
||||||
|
DATE_TAG = re.compile(r"^\d{4}\.\d{2}\.\d{2}$")
|
||||||
|
SHA_TAG = re.compile(r"^[0-9a-f]{7,40}$") # short or full git SHA
|
||||||
|
BLOB_VER = re.compile(r"^sha256:") # OCI blob versions — skip
|
||||||
|
|
||||||
|
|
||||||
def api(token: str, method: str, path: str) -> object:
|
def api(token: str, method: str, path: str) -> object:
|
||||||
|
# Explicit User-Agent: git.jpaul.io is behind Cloudflare, whose default
|
||||||
|
# Bot Fight Mode 403s `Python-urllib/X.Y` with error 1010. Any
|
||||||
|
# recognizable browser/curl-style UA passes.
|
||||||
req = Request(f"{GITEA_HOST}{path}",
|
req = Request(f"{GITEA_HOST}{path}",
|
||||||
headers={"Authorization": f"token {token}"},
|
headers={
|
||||||
|
"Authorization": f"token {token}",
|
||||||
|
"User-Agent": "hvm-docs-registry-gc/1.0",
|
||||||
|
},
|
||||||
method=method)
|
method=method)
|
||||||
try:
|
try:
|
||||||
with urlopen(req, timeout=30) as r:
|
with urlopen(req, timeout=30) as r:
|
||||||
@@ -63,44 +79,57 @@ def main() -> int:
|
|||||||
return 1
|
return 1
|
||||||
|
|
||||||
versions = api(token, "GET",
|
versions = api(token, "GET",
|
||||||
f"/api/v1/packages/{args.owner}/container/{args.package}/versions") or []
|
f"/api/v1/packages/{args.owner}/container/{args.package}") or []
|
||||||
if not versions:
|
if not versions:
|
||||||
print(f"no versions found for {args.owner}/{args.package}")
|
print(f"no versions found for {args.owner}/container/{args.package}")
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
cutoff = datetime.now(timezone.utc) - timedelta(days=args.keep_days)
|
cutoff = datetime.now(timezone.utc) - timedelta(days=args.keep_days)
|
||||||
|
print(f" {len(versions)} version(s); cutoff={cutoff.isoformat()} "
|
||||||
|
f"keep_days={args.keep_days} keep_latest={args.keep_latest}")
|
||||||
|
|
||||||
# Date-tagged versions (YYYY.MM.DD), newest first
|
# Sort newest first by created_at.
|
||||||
date_tagged = []
|
def parsed_ts(v: dict) -> datetime:
|
||||||
for v in versions:
|
|
||||||
tags = v.get("tags") or []
|
|
||||||
for t in tags:
|
|
||||||
if len(t) == 10 and t[4] == "." and t[7] == ".":
|
|
||||||
date_tagged.append((t, v))
|
|
||||||
break
|
|
||||||
date_tagged.sort(key=lambda kv: kv[0], reverse=True)
|
|
||||||
keep_date_tags = {t for t, _ in date_tagged[:args.keep_latest]}
|
|
||||||
|
|
||||||
deleted = 0
|
|
||||||
for v in versions:
|
|
||||||
tags = v.get("tags") or []
|
|
||||||
if "latest" in tags:
|
|
||||||
continue
|
|
||||||
if any(t in keep_date_tags for t in tags):
|
|
||||||
continue
|
|
||||||
try:
|
try:
|
||||||
created = datetime.fromisoformat(v["created_at"].replace("Z", "+00:00"))
|
return datetime.fromisoformat(v["created_at"].replace("Z", "+00:00"))
|
||||||
except (KeyError, ValueError):
|
except (KeyError, ValueError):
|
||||||
|
return datetime.min.replace(tzinfo=timezone.utc)
|
||||||
|
|
||||||
|
versions.sort(key=parsed_ts, reverse=True)
|
||||||
|
|
||||||
|
# Compute the keep-set: top-N date tags + top-N sha tags + always latest.
|
||||||
|
keep_dates: list[str] = []
|
||||||
|
keep_shas: list[str] = []
|
||||||
|
for v in versions:
|
||||||
|
ver = v.get("version") or ""
|
||||||
|
if DATE_TAG.match(ver) and len(keep_dates) < args.keep_latest:
|
||||||
|
keep_dates.append(ver)
|
||||||
|
elif SHA_TAG.match(ver) and len(keep_shas) < args.keep_latest:
|
||||||
|
keep_shas.append(ver)
|
||||||
|
keep = {"latest", *keep_dates, *keep_shas}
|
||||||
|
print(f" keep tags: {sorted(keep)}")
|
||||||
|
|
||||||
|
deleted = skipped_blob = skipped_age = skipped_keep = 0
|
||||||
|
for v in versions:
|
||||||
|
ver = v.get("version") or ""
|
||||||
|
ts = parsed_ts(v)
|
||||||
|
if BLOB_VER.match(ver):
|
||||||
|
skipped_blob += 1
|
||||||
continue
|
continue
|
||||||
if created >= cutoff:
|
if ver in keep:
|
||||||
|
skipped_keep += 1
|
||||||
continue
|
continue
|
||||||
version_id = v.get("id")
|
if ts >= cutoff:
|
||||||
print(f" deleting v{version_id} tags={tags} created={v['created_at']}")
|
skipped_age += 1
|
||||||
|
continue
|
||||||
|
print(f" deleting {ver!r} id={v.get('id')} created={v.get('created_at')}")
|
||||||
if not args.dry_run:
|
if not args.dry_run:
|
||||||
api(token, "DELETE",
|
api(token, "DELETE",
|
||||||
f"/api/v1/packages/{args.owner}/container/{args.package}/versions/{version_id}")
|
f"/api/v1/packages/{args.owner}/container/{args.package}/{quote(ver, safe='')}")
|
||||||
deleted += 1
|
deleted += 1
|
||||||
print(f"done: {deleted} version(s) deleted")
|
|
||||||
|
print(f"done: deleted={deleted} kept_named={skipped_keep} "
|
||||||
|
f"kept_recent={skipped_age} skipped_blobs={skipped_blob}")
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,120 @@
|
|||||||
|
"""Minimal HTTP reranker — `/v1/rerank` endpoint over a sentence-transformers CrossEncoder.
|
||||||
|
|
||||||
|
Matches the Cohere `/v1/rerank` request/response shape, which is what the
|
||||||
|
server's `_rerank()` helper expects. This is the dev-friendly fallback;
|
||||||
|
production replaces this with the llama.cpp + jina-reranker-v2-base GGUF
|
||||||
|
sidecar (see deploy/docker-compose.yml) without changing the client.
|
||||||
|
|
||||||
|
Request:
|
||||||
|
POST /v1/rerank
|
||||||
|
{"model": "...", "query": "...", "documents": ["text", ...], "top_n": 10}
|
||||||
|
|
||||||
|
Response:
|
||||||
|
{"model": "...", "results": [{"index": 0, "relevance_score": 0.93}, ...]}
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python -m scripts.rerank_server # localhost:8001
|
||||||
|
RERANK_MODEL=cross-encoder/ms-marco-MiniLM-L-12-v2 \\
|
||||||
|
RERANK_PORT=8001 python -m scripts.rerank_server
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
|
||||||
|
|
||||||
|
log = logging.getLogger(__name__)
|
||||||
|
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
|
||||||
|
|
||||||
|
MODEL_NAME = os.environ.get("RERANK_MODEL", "cross-encoder/ms-marco-MiniLM-L-6-v2")
|
||||||
|
PORT = int(os.environ.get("RERANK_PORT", "8001"))
|
||||||
|
HOST = os.environ.get("RERANK_HOST", "127.0.0.1")
|
||||||
|
# Truncate docs to this many chars before scoring. jina-reranker GGUF has a
|
||||||
|
# 1024-token per-pair cap that 400s the entire batch; ms-marco is more
|
||||||
|
# forgiving but we still cap to keep latency predictable.
|
||||||
|
MAX_DOC_CHARS = int(os.environ.get("RERANK_MAX_DOC_CHARS", "2000"))
|
||||||
|
|
||||||
|
_model = None
|
||||||
|
|
||||||
|
|
||||||
|
def _get_model():
|
||||||
|
global _model
|
||||||
|
if _model is None:
|
||||||
|
from sentence_transformers import CrossEncoder
|
||||||
|
log.info("loading %s", MODEL_NAME)
|
||||||
|
_model = CrossEncoder(MODEL_NAME)
|
||||||
|
log.info("loaded")
|
||||||
|
return _model
|
||||||
|
|
||||||
|
|
||||||
|
def _rerank(query: str, documents: list[str], top_n: int | None) -> list[dict]:
|
||||||
|
model = _get_model()
|
||||||
|
pairs = [[query, (d or "")[:MAX_DOC_CHARS]] for d in documents]
|
||||||
|
scores = model.predict(pairs)
|
||||||
|
ranked = sorted(
|
||||||
|
({"index": i, "relevance_score": float(s)} for i, s in enumerate(scores)),
|
||||||
|
key=lambda r: -r["relevance_score"],
|
||||||
|
)
|
||||||
|
if top_n is not None:
|
||||||
|
ranked = ranked[:top_n]
|
||||||
|
return ranked
|
||||||
|
|
||||||
|
|
||||||
|
class Handler(BaseHTTPRequestHandler):
|
||||||
|
def log_message(self, fmt, *args):
|
||||||
|
log.info("%s - %s", self.address_string(), fmt % args)
|
||||||
|
|
||||||
|
def _send_json(self, status: int, payload: dict) -> None:
|
||||||
|
body = json.dumps(payload).encode()
|
||||||
|
self.send_response(status)
|
||||||
|
self.send_header("Content-Type", "application/json")
|
||||||
|
self.send_header("Content-Length", str(len(body)))
|
||||||
|
self.end_headers()
|
||||||
|
self.wfile.write(body)
|
||||||
|
|
||||||
|
def do_GET(self): # noqa: N802
|
||||||
|
if self.path in ("/", "/health"):
|
||||||
|
self._send_json(200, {"status": "ok", "model": MODEL_NAME})
|
||||||
|
return
|
||||||
|
self._send_json(404, {"error": "not found"})
|
||||||
|
|
||||||
|
def do_POST(self): # noqa: N802
|
||||||
|
if self.path not in ("/v1/rerank", "/rerank"):
|
||||||
|
self._send_json(404, {"error": "not found"})
|
||||||
|
return
|
||||||
|
length = int(self.headers.get("Content-Length", "0"))
|
||||||
|
try:
|
||||||
|
req = json.loads(self.rfile.read(length).decode())
|
||||||
|
except Exception as e:
|
||||||
|
self._send_json(400, {"error": f"bad json: {e}"})
|
||||||
|
return
|
||||||
|
query = req.get("query")
|
||||||
|
documents = req.get("documents")
|
||||||
|
if not isinstance(query, str) or not isinstance(documents, list):
|
||||||
|
self._send_json(400, {"error": "expected {query: str, documents: list[str]}"})
|
||||||
|
return
|
||||||
|
top_n = req.get("top_n")
|
||||||
|
try:
|
||||||
|
results = _rerank(query, documents, top_n if isinstance(top_n, int) else None)
|
||||||
|
except Exception as e:
|
||||||
|
log.exception("rerank failed")
|
||||||
|
self._send_json(500, {"error": str(e)})
|
||||||
|
return
|
||||||
|
self._send_json(200, {"model": MODEL_NAME, "results": results})
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
_get_model() # warm-load before accepting traffic
|
||||||
|
server = ThreadingHTTPServer((HOST, PORT), Handler)
|
||||||
|
log.info("listening on http://%s:%d", HOST, PORT)
|
||||||
|
try:
|
||||||
|
server.serve_forever()
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
log.info("shutting down")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
Reference in New Issue
Block a user