scrape: add Qualification Matrix + QuickSpecs bundles (live curl_cffi for HPE www)

Two new bundles:

* hvm_qualification_matrix (sd00006551en_us) — the "Qualification Matrix
  for HVM Clusters Managed by HPE Morpheus Software". Single TOC bundle,
  2 pages (parent + content). The content page is ~100 KB of HTML
  containing five tables: Server Hardware Support, Storage Hardware
  Support, Independent Software Vendor (ISV) Support, Hypervisor OS
  Compatibility and Interoperability Matrix, and Guest OS. Scraped via
  the same /hpesc/public/api/document/{docId}/render endpoint as every
  other bundle on support.hpe.com — the API returns server-rendered
  DITA HTML, so no JS/SPA shenanigans.

* hvm_quickspecs (a50004260enw) — HPE Morpheus VM Essentials Software
  QuickSpecs, Version 4 (02-Feb-2026). SKUs: S5Q81AAE (1-yr per Socket
  E-LTU), S5Q82AAE (3-yr), S5Q83AAE (5-yr); each includes Tech Care
  Essentials. QuickSpecs lives at www.hpe.com (not support.hpe.com),
  which drops connections at the edge for non-browser TLS fingerprints —
  verified 2026-05-22 against curl, wget, urllib, and Anthropic's
  WebFetch (all = 0 bytes / connection timeout in headers). Bypassed
  here via curl_cffi impersonating Chrome 120's JA3/JA4 fingerprint.
  HTTP 200, 255 KB on first try, all four sections + all three SKUs
  cleanly parseable from the server-rendered HTML.

New module scrape/quickspecs.py drives the live fetch + parse for any
hvm_*_quickspecs bundle. CSS selectors taken from the captured DOM:
  .lr-right-rail hpe-highlights-container .collateral-content
       — one block per H3 section
  h3.txto-title             — section title
  div.txto-description      — section body
  uc-table.uc-table-polaris — SKU and version-history tables
On any live failure the parser falls back to a committed HTML fixture
at scrape/quickspecs/<doc_id>.html so the build never breaks on a
transient edge hiccup.

scrape/runner.py learned a new mode "html-file" that dispatches to
scrape.quickspecs; bundles.py extended with an optional source_url on
BundleSpec for cases where the page lives outside support.hpe.com.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 15:05:11 -04:00
parent ab1de47475
commit a0727da8da
16 changed files with 919 additions and 3 deletions
+29 -2
View File
@@ -39,9 +39,10 @@ class BundleSpec:
title: str
version: str | None
product: str # e.g. "User Manual", "Release Notes", "Deployment Guide"
mode: str # "toc" or "single"
mode: str # "toc", "single", or "html-file" (committed fixture under scrape/quickspecs/)
platform: str | None = None
language: str = "en-US"
source_url: str | None = None # overrides the default support.hpe.com URL
# Declared bundles. Versions confirmed 2026-05-22 by probing the docId
@@ -54,6 +55,14 @@ BUNDLES: list[BundleSpec] = [
BundleSpec("hvm_release_notes_8_1_1", "sd00007609en_us", "HPE Morpheus VM Essentials Software Release Notes", "8.1.1", "Release Notes", "single"),
BundleSpec("hvm_release_notes_8_1_2", "sd00007734en_us", "HPE Morpheus VM Essentials Software Release Notes", "8.1.2", "Release Notes", "single"),
BundleSpec("hvm_deployment_guide", "sd00007332en_us", "HPE Morpheus VM Essentials Deployment Guide", None, "Deployment Guide","toc"),
BundleSpec("hvm_qualification_matrix","sd00006551en_us", "Qualification Matrix for HVM Clusters Managed by HPE Morpheus Software", None, "Qualification Matrix", "toc"),
# QuickSpecs is a static-HTML fixture (www.hpe.com edge drops automated
# connections — see scrape/quickspecs/README.md). doc_id = the QuickSpecs
# PSNow ref (a50004260enw). page_count is 1; source_url points at the
# public PSNow URL.
BundleSpec("hvm_quickspecs", "a50004260enw", "HPE Morpheus VM Essentials Software QuickSpecs",
"v4-2026-02-02", "QuickSpecs", "html-file",
source_url="https://www.hpe.com/psnow/doc/a50004260enw"),
]
@@ -118,6 +127,24 @@ def _parse_abstract(html: str) -> dict[str, str]:
def discover_bundle(s: requests.Session, spec: BundleSpec) -> dict[str, Any]:
# html-file bundles are static fixtures — no upstream fetch.
if spec.mode == "html-file":
return {
"slug": spec.slug,
"doc_id": spec.doc_id,
"title": spec.title,
"version": spec.version,
"platform": spec.platform,
"product": spec.product,
"language": spec.language,
"page_count": 1,
"mode": "html-file",
"abstract": "",
"dates": {},
"landing_page": spec.doc_id,
"source_url": spec.source_url or f"https://www.hpe.com/psnow/doc/{spec.doc_id}",
}
abstract_html = _get(s, f"{API}/{spec.doc_id}", expect_json=False)
meta = _parse_abstract(abstract_html or "")
@@ -146,7 +173,7 @@ def discover_bundle(s: requests.Session, spec: BundleSpec) -> dict[str, Any]:
"abstract": meta.get("abstract", ""),
"dates": {"Published": meta.get("published", "")},
"landing_page": landing,
"source_url": DOC_URL.format(doc_id=spec.doc_id),
"source_url": spec.source_url or DOC_URL.format(doc_id=spec.doc_id),
}
+194
View File
@@ -0,0 +1,194 @@
"""Scrape HPE QuickSpecs collateral pages into corpus markdown.
HPE QuickSpecs live at `https://www.hpe.com/us/en/collaterals/collateral.<doc_id>.html`
with a server-rendered HTML body (confirmed 2026-05-22 by inspecting the
captured DOM). The blocker for automated scraping is `www.hpe.com`'s
edge bot defense, which drops connections from non-browser TLS
fingerprints (curl, wget, Python-urllib, even WebFetch). Bypassed here
by `curl_cffi` impersonating Chrome 120's JA3/JA4 fingerprint.
Content extraction uses these stable CSS selectors found in the page:
.lr-right-rail hpe-highlights-container .collateral-content
— one per section ("Overview", "Standard Features", etc.)
h3.txto-title — section title
div.txto-description — section body
uc-table.uc-table-polaris — SKU / version-history tables
A committed HTML fixture at `scrape/quickspecs/<doc_id>.html` is used
as a fallback when the live fetch fails (HPE edge churn, network
issues). Keeping a current fixture in the repo also makes diffing
QuickSpecs revisions easy.
Usage (called by scrape.runner for bundles with mode="quickspecs"):
python -m scrape.quickspecs a50004260enw
Or programmatically:
from scrape.quickspecs import scrape_quickspecs
scrape_quickspecs("a50004260enw", bundle_id="hvm_quickspecs", title="...")
"""
from __future__ import annotations
import argparse
import json
import logging
import sys
from pathlib import Path
from bs4 import BeautifulSoup, NavigableString
from markdownify import markdownify as md
log = logging.getLogger(__name__)
ROOT = Path(__file__).resolve().parent.parent
SOURCE_DIR = ROOT / "scrape" / "quickspecs"
CORPUS_DIR = ROOT / "corpus"
COLLATERAL_URL = "https://www.hpe.com/us/en/collaterals/collateral.{doc_id}.html"
def fetch_live(doc_id: str, timeout: float = 30.0) -> str | None:
"""GET the collateral page via curl_cffi (Chrome 120 TLS fingerprint).
Returns the HTML body on success, None on any failure."""
try:
from curl_cffi import requests as cc
except ImportError:
log.warning("curl_cffi not installed; can't fetch QuickSpecs live")
return None
try:
r = cc.get(COLLATERAL_URL.format(doc_id=doc_id),
impersonate="chrome120", timeout=timeout)
if r.status_code != 200 or not r.text:
log.warning("QuickSpecs %s: http=%s bytes=%d", doc_id, r.status_code, len(r.text or ""))
return None
return r.text
except Exception as e:
log.warning("QuickSpecs %s live fetch failed: %s", doc_id, e)
return None
def fetch_fixture(doc_id: str) -> str | None:
"""Read the committed HTML fixture as fallback."""
p = SOURCE_DIR / f"{doc_id}.html"
if not p.exists():
return None
return p.read_text()
def _extract_content_blocks(html: str) -> list[str]:
"""Pull each section block (.collateral-content under .lr-right-rail).
The fixture format (just .quickspecs-content wrapper) and the live
format (.lr-right-rail with nested hpe-highlights-container) are
both supported. Returns a list of section HTML strings, in document
order.
"""
soup = BeautifulSoup(html, "html.parser")
# Live format: each <hpe-highlights-container> under .lr-right-rail has
# one or more .collateral-content blocks; concat them.
rail = soup.select_one(".lr-right-rail")
if rail is not None:
blocks = rail.select(".collateral-content")
return [str(b) for b in blocks]
# Fixture format: a single wrapper holding all the H2/H3 sections.
wrapper = soup.select_one(".quickspecs-content")
if wrapper is not None:
return [str(wrapper)]
# Last-resort: whole body.
body = soup.body or soup
return [str(body)]
def parse_html(html: str) -> str:
"""Convert QuickSpecs HTML to clean markdown.
Filters out the page chrome (nav, footer, recommendations carousel,
cookie banner, analytics blobs) by extracting only the content
blocks, then runs markdownify."""
blocks = _extract_content_blocks(html)
chunks: list[str] = []
for block in blocks:
soup = BeautifulSoup(block, "html.parser")
# Drop anchor placeholders that markdownify turns into noisy links
for a in soup.select('[hpe-left-rail-anchor]'):
a.decompose()
# Drop carousel / share / recommendation widgets if any leaked in.
for sel in ("esl-share", "hpe-recommendations", "hpe-sticky-bar",
"esl-scrollbar", "esl-trigger", "video-overlay",
"generic-modal-loader", "style", "script"):
for el in soup.select(sel):
el.decompose()
chunks.append(md(str(soup), heading_style="ATX", bullets="-",
strip=["span", "div"]))
text = "\n\n".join(chunks)
# Collapse runs of blank lines markdownify likes to emit.
text = "\n".join(line.rstrip() for line in text.splitlines())
while "\n\n\n" in text:
text = text.replace("\n\n\n", "\n\n")
return text.strip() + "\n"
def scrape_quickspecs(doc_id: str, bundle_id: str, title: str,
version: str | None = None,
product: str = "QuickSpecs",
source_url: str | None = None,
force: bool = False) -> bool:
"""Live-fetch (or fall back to fixture), parse, write corpus files.
Returns True if files were written, False if skipped (already exists
and --force not set)."""
bundle_dir = CORPUS_DIR / bundle_id
md_path = bundle_dir / f"{doc_id}.md"
json_path = bundle_dir / f"{doc_id}.json"
if not force and md_path.exists() and json_path.exists():
log.info(" %s/%s: already on disk (use --force to refresh)", bundle_id, doc_id)
return False
html = fetch_live(doc_id)
fetched_from = "live"
if html is None:
html = fetch_fixture(doc_id)
fetched_from = "fixture"
if html is None:
log.error("QuickSpecs %s: no live response and no fixture at %s",
doc_id, SOURCE_DIR / f"{doc_id}.html")
return False
body_md = parse_html(html)
bundle_dir.mkdir(parents=True, exist_ok=True)
md_path.write_text(body_md)
sidecar = {
"bundle_id": bundle_id,
"page_id": doc_id,
"title": title,
"ordinal": 1,
"parent_title": None,
"doc_id": doc_id,
"version": version,
"product": product,
"source_url": source_url or f"https://www.hpe.com/psnow/doc/{doc_id}",
"fetched_from": fetched_from,
}
json_path.write_text(json.dumps(sidecar, indent=2) + "\n")
log.info(" %s/%s: %d bytes from %s", bundle_id, doc_id, len(body_md), fetched_from)
return True
def main() -> int:
logging.basicConfig(level=logging.INFO, format="%(message)s")
p = argparse.ArgumentParser()
p.add_argument("doc_id", help="QuickSpecs document id, e.g. a50004260enw")
p.add_argument("--bundle-id", default="hvm_quickspecs")
p.add_argument("--title", default="HPE Morpheus VM Essentials Software QuickSpecs")
p.add_argument("--version", default=None)
p.add_argument("--force", action="store_true")
args = p.parse_args()
ok = scrape_quickspecs(args.doc_id, args.bundle_id, args.title,
args.version, force=args.force)
return 0 if ok else 1
if __name__ == "__main__":
sys.exit(main())
+27
View File
@@ -0,0 +1,27 @@
# scrape/quickspecs/
Static HTML fixtures for HPE QuickSpecs documents that aren't reachable
from the runner (www.hpe.com edge drops connections from datacenter IPs
with non-browser User-Agents — verified 2026-05-22 with curl, wget, and
Anthropic's WebFetch).
## Workflow
1. Operator visits `https://www.hpe.com/psnow/doc/<doc_id>` in a real
browser, opens DevTools → Elements → Copy the `<body>` HTML.
2. Save it at `scrape/quickspecs/<doc_id>.html`.
3. Add a bundle entry in `scrape/bundles.py` with `mode="html-file"`.
4. `python -m scrape.runner --bundle hvm_quickspecs --force` reads the
committed HTML and writes `corpus/hvm_quickspecs/<doc_id>.{md,json}`.
5. Re-index and ship.
QuickSpecs only update every few months (HPE rebrand, new SKU added,
feature change). When a new version drops, refresh the local HTML
file and re-run the scrape.
## Current fixtures
- `a50004260enw.html` — HPE Morpheus VM Essentials Software QuickSpecs
(Version 4, 02-February-2026). SKUs: S5Q81AAE (1-yr), S5Q82AAE
(3-yr), S5Q83AAE (5-yr) — all "per Socket E-LTU" with Tech Care
Essentials included.
+219
View File
@@ -0,0 +1,219 @@
<!-- Source: https://www.hpe.com/us/en/collaterals/collateral.a50004260enw.html
Captured: 2026-05-22 (Version 4, 02-February-2026)
Reason for static fixture: www.hpe.com edge drops connections from
datacenter IPs / non-browser User-Agents. Operator captures the
.lr-right-rail HTML in a browser and commits it here.
Parser: python -m scrape.quickspecs_parse a50004260enw -->
<div class="quickspecs-content">
<h1>HPE Morpheus VM Essentials Software QuickSpecs</h1>
<p><em>Version 4 — 02-February-2026 — a50004260enw</em></p>
<p><strong>HPE Morpheus VM Essentials Software is a virtualization
software solution that allows customers to provision and manage KVM
and VMware-based VMs from a single intuitive interface.</strong></p>
<h2 id="overview">Overview</h2>
<p>The solution comes with the KVM-based HVM hypervisor that is enhanced
to include enterprise-grade cluster management with capabilities such
as high availability, live compute and storage migration, distributed
workload placement, integrated data protection, secure hardening, and
external storage support. To enable flexibility for those continuing to
host VMware-based applications, VM Essentials can also be used to
connect to and manage existing VMware clusters. This means unified
management and simple VM provisioning across both the HVM hypervisor
and VMware ESXi™ so you can provision workloads on demand to the right
environment, on your terms, with zero lock-in. When you're ready, you
can use the included toolset to convert existing VMware images to
VM Essentials.</p>
<h3>Validated Hardware and Software</h3>
<p>The list of validated compute and storage hardware for VM Essentials
can be found in the <a href="https://www.hpe.com/support/VME-Compatibility-Matrix">compatibility matrix</a>,
along with validated operating systems and ISV software.</p>
<h3>Models</h3>
<p>Hewlett Packard Enterprise is making the following VM Essentials
SKUs available. VM Essentials SKUs are licensed per physical CPU
socket. Each SKU includes Tech Care Essentials support.</p>
<table>
<caption>HPE Morpheus VM Essentials Software SKUs</caption>
<thead><tr><th>Description</th><th>SKU</th></tr></thead>
<tbody>
<tr><td>HPE Morpheus VM Essentials Software per Socket 1-year E-LTU</td><td><code>S5Q81AAE</code></td></tr>
<tr><td>HPE Morpheus VM Essentials Software per Socket 3-year E-LTU</td><td><code>S5Q82AAE</code></td></tr>
<tr><td>HPE Morpheus VM Essentials Software per Socket 5-year E-LTU</td><td><code>S5Q83AAE</code></td></tr>
</tbody>
</table>
<h2 id="standard-features">Standard Features</h2>
<h3>Key Features</h3>
<h4>Enterprise Virtualization</h4>
<ul>
<li><strong>VM Live Migration:</strong> Migrate running virtual
machines from one physical host to another within an HVM cluster
without downtime to improve host utilization or to perform host
maintenance.</li>
<li><strong>VM High Availability:</strong> Enable workload resiliency
with virtual machine high availability to quickly restart virtual
machines on another physical host in the event of a host failure.</li>
<li><strong>Distributed Workload Placement:</strong> Dynamically
schedule the placement of virtual machines within an HVM cluster
based on intelligent placement logic that determines the optimal host
for the virtual machine.</li>
<li><strong>VM Live Storage Migration:</strong> Migrate a running
virtual machine's storage disks from one datastore to another without
downtime.</li>
<li><strong>External Storage Support:</strong> Integrate with existing
external storage (NFS, iSCSI, Fibre Channel) to take advantage of
existing infrastructure investments.</li>
</ul>
<h4>Solution Integrations</h4>
<ul>
<li><strong>VMware vSphere Integration:</strong> Integrate VM
Essentials with a vSphere deployment to discover existing virtual
machines, provision new virtual machines, as well as manage
provisioned or discovered machines.</li>
<li><strong>Native IP Pools:</strong> Assign virtual machine IP
addresses using the VM Essentials native IP pools feature to define
and manage pools of IP addresses associated with virtual machine
networks.</li>
<li><strong>Native Secrets Management:</strong> Securely store and
retrieve credentials and other sensitive information used in
automation tasks for bootstrapping and managing managed virtual
machines.</li>
<li><strong>IP Address Management (IPAM) Integration:</strong>
Integrate 3rd party IPAM solutions (InfoBlox, BlueCat, SolarWinds,
phpIPAM, EfficientIP) to automate the reservation and assignment of
IP address.</li>
<li><strong>Domain Name System (DNS) Integration:</strong> Integrate
3rd party DNS solutions (PowerDNS, Microsoft DNS, BlueCat, InfoBlox,
EfficientIP) to automate the creation of DNS A and PTR records.</li>
<li><strong>Native Data Protection:</strong> Create and restore
snapshot-based backups for VM Essentials and VMware virtual machines
using the native data protection functionality.</li>
<li><strong>Data Protection Integration:</strong> Integrate 3rd party
Data Protection solutions (Commvault, Rubrik, Veeam) to create backup
jobs during the creation of a VMware virtual machine and restore
backups through the VM Essentials web interface.</li>
<li><strong>Provisioning Task Automation:</strong> Execute automation
scripts (Bash and PowerShell) during the provisioning of virtual
machines to orchestrate bootstrap operations such as software
installation and system configuration.</li>
<li><strong>HPE Alletra Storage MP Integration:</strong> Integrate
with the HPE Alletra Storage MP B10000 storage array to utilize
direct virtual machine volume mapping to the storage array to enable
granular performance configuration and array-based snapshotting.</li>
</ul>
<h3>Virtual Machine Management</h3>
<ul>
<li><strong>Power Operations:</strong> Start, stop, and restart
VMware and HVM virtual machines.</li>
<li><strong>Snapshot Management:</strong> Create, revert, and delete
virtual machine snapshots for VMware and HVM virtual machines.</li>
<li><strong>Virtual Hardware Management:</strong> Add and remove
virtual hardware such as hard disks, network interfaces, CPU and
memory from a managed virtual machine (VM Essentials or VMware).</li>
<li><strong>Integrated Backup Management:</strong> Create, restore,
and delete virtual machine backups for VMware and HVM virtual
machines.</li>
<li><strong>HTML5 Console:</strong> Access the console of managed
virtual machines via the VM Essentials manager web interface with
support for Virtual Network Computing (VNC), Secure Shell (SSH), and
Remote Desktop Protocol (RDP).</li>
<li><strong>Day 2 Task Automation:</strong> Execute automation
scripts (Bash and PowerShell) against managed virtual machines to
perform day 2 operational tasks such as freeing up disk space or
updating system packages.</li>
<li><strong>Tag Management:</strong> Create and manage virtual
machine tags for VMware and VM Essentials virtual machines.</li>
</ul>
<h2 id="service-and-support">Service and Support</h2>
<h3>HPE Services</h3>
<p>No matter where you are in your digital transformation journey, you
can count on HPE Services to deliver the expertise you need when, where
and how you need it. From planning to deployment, ongoing operations
and beyond, our experts can help you realize your digital ambitions.
See <a href="https://www.hpe.com/services">https://www.hpe.com/services</a>.</p>
<h3>HPE Complete Care Service</h3>
<p>A modular, edge-to-cloud IT environment service designed to help
optimize your entire IT environment and achieve agreed-upon IT outcomes
and business goals through a personalized experience. Includes complete
coverage, an assigned HPE team, modular engagement, enhanced incident
management with priority access, and AI-driven customer experience.
See <a href="https://www.hpe.com/services/completecare">https://www.hpe.com/services/completecare</a>.</p>
<h3>HPE Tech Care Service</h3>
<p>Operational support service experience for HPE products. Goes beyond
traditional support by providing access to product-specific experts,
an AI-driven digital experience, and general technical guidance.
Available in three response levels:</p>
<ul>
<li><strong>Basic:</strong> 9×5 business hours, 2-hour response.</li>
<li><strong>Essential:</strong> 15-minute response 24×7 (most
enterprise customers).</li>
<li><strong>Critical:</strong> 6-hour repair commitment where
available, plus outage management response for severity 1
incidents.</li>
</ul>
<p>See <a href="https://www.hpe.com/services/techcare">https://www.hpe.com/services/techcare</a>.</p>
<h3>HPE Lifecycle Services</h3>
<ul>
<li>Lifecycle Install and Startup Services.</li>
<li>Firmware Update Analysis Service.</li>
<li>Firmware Update Implementation Service.</li>
<li>Implementation assistance services.</li>
<li>HPE Service Credits.</li>
</ul>
<p>See <a href="https://www.hpe.com/services/lifecycle">https://www.hpe.com/services/lifecycle</a>.</p>
<h3>Other Related Services</h3>
<ul>
<li><strong>HPE Education Services:</strong> Training and
certification. See <a href="https://www.hpe.com/services/training">https://www.hpe.com/services/training</a>.</li>
<li><strong>Defective Media Retention:</strong> Available with
Complete Care and Tech Care for disks/SSDs replaced due to
malfunction.</li>
<li><strong>Parts and Materials:</strong> HPE provides supported
replacement parts including engineering improvements; parts past
maximum supported lifetime are not provided.</li>
<li><strong>How to Purchase:</strong> Services sold by HPE and HPE
Authorized Service Partners. Customers from commercial resellers see
<a href="https://ssc.hpe.com/portal/site/ssc/">ssc.hpe.com</a>.</li>
</ul>
<h3>AI Powered and Digitally Enabled Support Experience</h3>
<p>Sign into the HPE Support Center for streamlined self-serve case
creation, knowledge recommendations, personalized task alerts, and an
intelligent virtual agent with seamless transition to a live support
agent when needed.
See <a href="https://support.hpe.com/hpesc/public/home/signin">support.hpe.com</a>.</p>
<h3>Consume IT On Your Terms</h3>
<p><a href="https://www.hpe.com/GreenLake">HPE GreenLake</a>
edge-to-cloud platform brings the cloud experience directly to your
apps and data — at the edge, in colocations, or in your data center.
Pay-per-use, scalable, self-service experience.</p>
<h2 id="summary-of-changes">Summary of Changes</h2>
<table>
<thead><tr><th>Date</th><th>Version</th><th>Action</th><th>Description</th></tr></thead>
<tbody>
<tr><td>02-Feb-2026</td><td>Version 4</td><td>Changed</td><td>HPE Rebranding applied</td></tr>
<tr><td>05-May-2025</td><td>Version 3</td><td>Changed</td><td>Branding updates, Overview and Standard Features sections updated.</td></tr>
<tr><td>13-Jan-2025</td><td>Version 2</td><td>Changed</td><td>Overview and Standard Features sections updated.</td></tr>
<tr><td>02-Dec-2024</td><td>Version 1</td><td>New</td><td>New QuickSpecs.</td></tr>
</tbody>
</table>
<hr />
<p><em>© Copyright 2026 Hewlett Packard Enterprise Development LP.
a50004260enw, 16864 - Worldwide - V4 - 02-February-2026.</em></p>
</div>
+15 -1
View File
@@ -307,8 +307,22 @@ def main() -> int:
s = _session()
total = 0
for b in bundles:
if b.get("mode") == "single":
mode = b.get("mode")
if mode == "single":
total += scrape_single_bundle(s, b, args.force)
elif mode == "html-file":
# Live-scrape HPE collateral (QuickSpecs) via curl_cffi; falls back
# to scrape/quickspecs/<doc_id>.html fixture if the edge blocks us.
from scrape.quickspecs import scrape_quickspecs
ok = scrape_quickspecs(
doc_id=b["doc_id"], bundle_id=b["slug"],
title=b.get("title", b["doc_id"]),
version=b.get("version"),
product=b.get("product", "QuickSpecs"),
source_url=b.get("source_url"),
force=args.force,
)
total += 1 if ok else 0
else:
total += scrape_toc_bundle(s, b, args.force, args.concurrency)
print(f"scraped {total} new/updated pages", file=sys.stderr)