Files
hvm-docs/scrape/README.md
justin 9ba615c8ee initial: docs-mcp-template — build guide + scaffolded server
Template for building hosted MCP servers over a product's public
documentation. Distilled from one production build; everything
product-specific has been factored out.

Contents:

- PLAN.md — comprehensive build guide. 13 phases from project
  skeleton through weekly_digest. Includes the gotchas
  ("fetch-depth: 0 always", reranker per-pair token limit,
  Cloudflare body cap, dash-not-bash on Gitea runners), the
  decisions worth carrying forward, and a per-product
  customization checklist.

- CLAUDE.md — guidance for Claude Code working in a clone of this
  template. Phase identification table, conventions (env-gating +
  operator confirmation for side-effecting tools, defensive
  fallback for retrieval components), common commands.

- README.md — quick-start summary.

Scaffolded code (all signature-stable, with NotImplementedError
stubs where phase-specific work is required):

  docs_mcp/server.py    FastMCP server, stateless_http=True, with
                        search_docs / get_page / list_versions
                        baseline tools and commented stubs for the
                        rest of the phase set.
  docs_mcp/usage.py     TimedCall telemetry, JSONL, daily rotation,
                        90-day retention. Reusable as-is.
  rag/embeddings.py     Ollama embedder (nomic-embed-text default),
                        load-balanced across N URLs. Reusable.
  rag/chunk.py          Paragraph-aware chunker with synthetic
                        chunk 0. Per-product tunable.
  rag/index.py          Chroma + BM25 builder. --rebuild and
                        --bm25-only flags.
  rag/bm25.py           SQLite FTS5 lexical index. Reusable.
  scrape/changelog.py   --cached / --ref / --json / --history-out.
                        Reusable.
  scrape/README.md      What you write per-product.
  eval/queries.jsonl.example
                        Curate ~25 hand-labeled queries here.
  eval/retrievers.py    Retriever protocol + stub classes.
  eval/run_eval.py      MRR / Recall@K / nDCG@K harness skeleton.
  scripts/usage_report.py
                        Standalone log analyzer; the
                        FOLLOW-UP CHECKS pattern noted in the
                        module docstring.
  scripts/registry_gc.py
                        Gitea container registry cleanup. Reusable.

Deployment + CI:

  Dockerfile               Python 3.12-slim; COPY corpus + chroma
                           + bm25 last for cache efficiency.
  deploy/docker-compose.yml MCP + reranker sidecar + Watchtower.
                           Templated with <placeholders>.
  .gitea/workflows/refresh.yml    Weekly cron + manual dispatch.
                                  fetch-depth: 0, retry-on-race,
                                  three-tag image scheme.
  .gitea/workflows/image-only.yml Code-only ship cycle, ~18min.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:18:17 -04:00

60 lines
1.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# scrape/
Product-specific. **You implement this for each product.** The
template gives you the contract; the extraction logic depends on
the upstream doc portal.
See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline
expects.
## What you write
At minimum, two scripts:
### `scrape/bundles.py`
Discovers the upstream portal's bundle catalog and writes
`bundles.json` at the repo root. One entry per bundle (versioned doc
set) with the schema in PLAN.md.
```bash
python -m scrape.bundles
```
### `scrape/runner.py`
Scrapes the pages of each bundle (or a single bundle with `--bundle
<slug>`). Writes:
- `corpus/<bundle_id>/<page_id>.md` — extracted markdown body
- `corpus/<bundle_id>/<page_id>.json` — per-page metadata sidecar
```bash
python -m scrape.runner --all --force --concurrency 6
python -m scrape.runner --bundle Admin.VC.HTML.10.9
```
## Tips
- **Sniff before you scrape.** Almost every modern doc portal is an
SPA that calls a backend API. Open the browser's Network tab,
click around, find the underlying JSON. Scraping the API is 10×
cheaper and 100× more reliable than scraping the rendered HTML.
- **Idempotent re-scrapes.** Without `--force`, the runner should
skip pages already on disk so a resume doesn't have to re-fetch
everything. With `--force`, re-fetch every page — that's the
weekly cron mode that catches edits.
- **Respect the portal.** Backoff on 429s. Set a recognizable
user-agent so the portal owner can identify you if they want to.
- **Whitespace normalize.** Markdown that round-trips through HTML
often has extra blank lines. Normalize to a single blank between
paragraphs so diffs are clean (the changelog summary and digest
tools care about line counts).
## What's already reusable
`scrape/changelog.py` is fully product-agnostic and ready to use
as-is. It walks `git diff --name-status` output to produce a
structured summary, and walks `git log` for the digest history
(Phase 13).