initial: docs-mcp-template — build guide + scaffolded server
Template for building hosted MCP servers over a product's public
documentation. Distilled from one production build; everything
product-specific has been factored out.
Contents:
- PLAN.md — comprehensive build guide. 13 phases from project
skeleton through weekly_digest. Includes the gotchas
("fetch-depth: 0 always", reranker per-pair token limit,
Cloudflare body cap, dash-not-bash on Gitea runners), the
decisions worth carrying forward, and a per-product
customization checklist.
- CLAUDE.md — guidance for Claude Code working in a clone of this
template. Phase identification table, conventions (env-gating +
operator confirmation for side-effecting tools, defensive
fallback for retrieval components), common commands.
- README.md — quick-start summary.
Scaffolded code (all signature-stable, with NotImplementedError
stubs where phase-specific work is required):
docs_mcp/server.py FastMCP server, stateless_http=True, with
search_docs / get_page / list_versions
baseline tools and commented stubs for the
rest of the phase set.
docs_mcp/usage.py TimedCall telemetry, JSONL, daily rotation,
90-day retention. Reusable as-is.
rag/embeddings.py Ollama embedder (nomic-embed-text default),
load-balanced across N URLs. Reusable.
rag/chunk.py Paragraph-aware chunker with synthetic
chunk 0. Per-product tunable.
rag/index.py Chroma + BM25 builder. --rebuild and
--bm25-only flags.
rag/bm25.py SQLite FTS5 lexical index. Reusable.
scrape/changelog.py --cached / --ref / --json / --history-out.
Reusable.
scrape/README.md What you write per-product.
eval/queries.jsonl.example
Curate ~25 hand-labeled queries here.
eval/retrievers.py Retriever protocol + stub classes.
eval/run_eval.py MRR / Recall@K / nDCG@K harness skeleton.
scripts/usage_report.py
Standalone log analyzer; the
FOLLOW-UP CHECKS pattern noted in the
module docstring.
scripts/registry_gc.py
Gitea container registry cleanup. Reusable.
Deployment + CI:
Dockerfile Python 3.12-slim; COPY corpus + chroma
+ bm25 last for cache efficiency.
deploy/docker-compose.yml MCP + reranker sidecar + Watchtower.
Templated with <placeholders>.
.gitea/workflows/refresh.yml Weekly cron + manual dispatch.
fetch-depth: 0, retry-on-race,
three-tag image scheme.
.gitea/workflows/image-only.yml Code-only ship cycle, ~18min.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+277
@@ -0,0 +1,277 @@
|
||||
"""SQLite FTS5-backed BM25 retrieval over the same chunks Chroma indexes.
|
||||
|
||||
Hybrid retrieval (BM25 + dense + Reciprocal Rank Fusion) addresses a
|
||||
limit of single-tower dense embeddings: when a query has specific
|
||||
technical terms (filenames, language names, error codes, API paths),
|
||||
the dense embedding doesn't bridge from the query into a short
|
||||
code-focused chunk. The chunk loses to the much larger crowd of
|
||||
prose chunks that semantically match the query topic.
|
||||
|
||||
BM25 handles this directly. Lexical overlap on rare terms ("python",
|
||||
"create_vpg.py", "PROTECTED_SITE_ID", "applyUpgrade") scores those
|
||||
chunks high. Fused with the dense ranking via RRF, the hybrid result
|
||||
is strictly better than either alone for the queries we've seen
|
||||
fail.
|
||||
|
||||
Why SQLite FTS5:
|
||||
- In the stdlib. Zero new deps.
|
||||
- On-disk. Same persistence model as Chroma — Docker COPY the dir,
|
||||
`rag.index --rebuild` regenerates from corpus.
|
||||
- Built-in `bm25()` ranking function. No knobs to tune that matter
|
||||
for our use case (k1=1.2, b=0.75 defaults are fine).
|
||||
- Builds 70k+ chunks in seconds. Faster than the Chroma rebuild's
|
||||
embedding step by 100×, so it adds basically nothing to the
|
||||
full-rebuild cycle.
|
||||
|
||||
Schema is two tables to keep filtering clean. FTS5 doesn't filter
|
||||
nicely on its own columns; the content_rowid pattern keeps an
|
||||
external metadata table joinable by rowid:
|
||||
|
||||
CREATE TABLE chunks_meta (
|
||||
rowid INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
id TEXT UNIQUE,
|
||||
bundle_id TEXT, page_id TEXT, version TEXT,
|
||||
platform TEXT, product TEXT, ordinal INTEGER
|
||||
);
|
||||
CREATE VIRTUAL TABLE chunks_fts USING fts5(
|
||||
text,
|
||||
tokenize = 'porter unicode61 remove_diacritics 2',
|
||||
content = 'chunks_meta',
|
||||
content_rowid = 'rowid'
|
||||
);
|
||||
|
||||
Queries:
|
||||
|
||||
SELECT m.id, bm25(chunks_fts) AS score
|
||||
FROM chunks_meta m
|
||||
JOIN chunks_fts f ON m.rowid = f.rowid
|
||||
WHERE f MATCH ?
|
||||
AND m.version = ? -- optional metadata filter
|
||||
ORDER BY bm25(chunks_fts) -- lower = better in FTS5
|
||||
LIMIT ?;
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import re
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
# Default location: bm25/<product>_docs.db at the repo root, next to chroma/.
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
DEFAULT_DB_DIR = ROOT / "bm25"
|
||||
DEFAULT_DB_NAME = "<product>_docs.db"
|
||||
|
||||
# Columns we expose as filterable metadata. Mirrors what _build_where in
|
||||
# docs_mcp/server.py accepts so the same filter dicts work for both
|
||||
# Chroma and BM25 without per-retriever translation in the caller.
|
||||
FILTER_COLUMNS = ("bundle_id", "page_id", "version", "platform", "product", "ordinal")
|
||||
|
||||
|
||||
# Allowlist tokenizer for free-text queries. FTS5's parser chokes on lots
|
||||
# of punctuation we routinely see in user queries (".10.9", "?", "VPG's",
|
||||
# em-dash, etc.). Rather than blocklist every operator, just keep
|
||||
# alphanumerics + a few separators and replace everything else with a
|
||||
# space. This loses the ability to phrase-search ("exact match") but we
|
||||
# don't expose that to users anyway — they ask natural-language questions
|
||||
# and want the answer, not a Boolean DSL.
|
||||
_KEEP_RE = re.compile(r"[^A-Za-z0-9_\s]")
|
||||
# FTS5 reserves these Boolean operator KEYWORDS at the token level —
|
||||
# stripping them avoids accidental phrase-query behavior when a user
|
||||
# query happens to contain bare "AND", "OR", "NOT", "NEAR".
|
||||
_BOOLEAN_KW_RE = re.compile(r"(?<!\w)(AND|OR|NOT|NEAR)(?!\w)")
|
||||
|
||||
|
||||
def _sanitize_query(text: str) -> str:
|
||||
"""Reduce a natural-language query to an FTS5 OR-of-tokens query.
|
||||
|
||||
Two transformations:
|
||||
|
||||
1. Non-alphanumeric → space (drops punctuation; "10.9?" becomes
|
||||
"10 9"). Lets us handle versions, parens, question marks, etc.
|
||||
without inviting FTS5 parse errors.
|
||||
2. Boolean keywords stripped (FTS5 reserves AND/OR/NOT/NEAR).
|
||||
3. Tokens explicitly OR'd. FTS5's default is AND-of-tokens — for
|
||||
any non-trivial natural-language query that means zero hits
|
||||
(no chunk contains every word). OR semantics is what we want:
|
||||
BM25 already weights documents containing more query terms
|
||||
higher, so we don't lose precision, but we DO gain recall.
|
||||
"""
|
||||
cleaned = _KEEP_RE.sub(" ", text)
|
||||
cleaned = _BOOLEAN_KW_RE.sub(" ", cleaned)
|
||||
tokens = cleaned.split()
|
||||
if not tokens:
|
||||
return ""
|
||||
return " OR ".join(tokens)
|
||||
|
||||
|
||||
def _where_to_sql(where: dict | None) -> tuple[str, list[Any]]:
|
||||
"""Translate a Chroma-shaped filter dict into a SQL fragment + params.
|
||||
|
||||
Accepts the same shapes ``docs_mcp.server._build_where`` produces:
|
||||
|
||||
None → ("", [])
|
||||
{"version": "10.9"} → ("AND m.version = ?", ["10.9"])
|
||||
{"$and": [{...}, {...}]} → ("AND m.X = ? AND m.Y = ?", [...])
|
||||
|
||||
Unknown keys are silently dropped (defensive — better to over-match
|
||||
than to crash on a filter we don't know).
|
||||
"""
|
||||
if not where:
|
||||
return "", []
|
||||
parts: list[str] = []
|
||||
params: list[Any] = []
|
||||
|
||||
def _emit_eq(cond: dict[str, Any]) -> None:
|
||||
for k, v in cond.items():
|
||||
if k in FILTER_COLUMNS:
|
||||
parts.append(f"m.{k} = ?")
|
||||
params.append(v)
|
||||
|
||||
if "$and" in where:
|
||||
for sub in where["$and"]:
|
||||
_emit_eq(sub)
|
||||
else:
|
||||
_emit_eq(where)
|
||||
if not parts:
|
||||
return "", []
|
||||
return "AND " + " AND ".join(parts), params
|
||||
|
||||
|
||||
class BM25Index:
|
||||
"""Thin wrapper around an FTS5-backed sqlite db.
|
||||
|
||||
Single-writer model. Reads are connection-per-call (sqlite handles
|
||||
concurrency through file locks; for our read-heavy workload that's
|
||||
fine and avoids cross-thread connection sharing issues with the MCP
|
||||
server's request handlers).
|
||||
"""
|
||||
|
||||
def __init__(self, db_path: Path | None = None):
|
||||
self.db_path = Path(db_path) if db_path else (DEFAULT_DB_DIR / DEFAULT_DB_NAME)
|
||||
|
||||
# -- build ----------------------------------------------------------
|
||||
|
||||
def build(self, records: list[dict]) -> int:
|
||||
"""Rebuild the index from scratch from `records`.
|
||||
|
||||
`records` is the same list ``rag.index.page_records`` produces:
|
||||
``[{"id": ..., "text": ..., "metadata": {...}}, ...]``. Bulk
|
||||
insert wrapped in a transaction — single-digit seconds for the
|
||||
full 73k-chunk corpus.
|
||||
"""
|
||||
self.db_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
# Drop and recreate. Idempotent rebuild.
|
||||
if self.db_path.exists():
|
||||
self.db_path.unlink()
|
||||
with sqlite3.connect(self.db_path) as con:
|
||||
con.executescript(self._schema_sql())
|
||||
con.executemany(
|
||||
"INSERT INTO chunks_meta (id, bundle_id, page_id, version, "
|
||||
"platform, product, ordinal) VALUES (?, ?, ?, ?, ?, ?, ?)",
|
||||
[
|
||||
(
|
||||
r["id"],
|
||||
r["metadata"].get("bundle_id") or "",
|
||||
r["metadata"].get("page_id") or "",
|
||||
r["metadata"].get("version") or "",
|
||||
r["metadata"].get("platform") or "",
|
||||
r["metadata"].get("product") or "",
|
||||
int(r["metadata"].get("ordinal") or 0),
|
||||
)
|
||||
for r in records
|
||||
],
|
||||
)
|
||||
# Populate the FTS5 contentless-ish table by rowid. We populated
|
||||
# chunks_meta first; rowids align with insertion order.
|
||||
con.executemany(
|
||||
"INSERT INTO chunks_fts (rowid, text) VALUES (?, ?)",
|
||||
[
|
||||
(i + 1, r["text"])
|
||||
for i, r in enumerate(records)
|
||||
],
|
||||
)
|
||||
con.commit()
|
||||
log.info("bm25: indexed %d chunks → %s", len(records), self.db_path)
|
||||
return len(records)
|
||||
|
||||
# -- query ----------------------------------------------------------
|
||||
|
||||
def query(
|
||||
self,
|
||||
text: str,
|
||||
n: int = 200,
|
||||
where: dict | None = None,
|
||||
) -> list[tuple[str, float]]:
|
||||
"""Return up to `n` (chunk_id, bm25_score) pairs, lowest score first.
|
||||
|
||||
FTS5's bm25() returns NEGATIVE numbers — more relevant docs have
|
||||
smaller (more negative) scores. We order ASC so the first row is
|
||||
the most relevant. Callers that need a "rank" should enumerate
|
||||
the returned list.
|
||||
"""
|
||||
sanitized = _sanitize_query(text)
|
||||
if not sanitized:
|
||||
return []
|
||||
where_sql, params = _where_to_sql(where)
|
||||
# FTS5 MATCH wants the unaliased table name on its left, so we use
|
||||
# chunks_fts (no alias) and JOIN by rowid against chunks_meta.
|
||||
sql = (
|
||||
"SELECT m.id, bm25(chunks_fts) AS score "
|
||||
"FROM chunks_fts "
|
||||
"JOIN chunks_meta m ON m.rowid = chunks_fts.rowid "
|
||||
f"WHERE chunks_fts MATCH ? {where_sql} "
|
||||
"ORDER BY bm25(chunks_fts) "
|
||||
"LIMIT ?"
|
||||
)
|
||||
try:
|
||||
with sqlite3.connect(self.db_path) as con:
|
||||
cur = con.execute(sql, [sanitized, *params, n])
|
||||
return [(row[0], float(row[1])) for row in cur.fetchall()]
|
||||
except sqlite3.OperationalError as e:
|
||||
# FTS5 syntax error (rare after sanitization) or db missing.
|
||||
# Caller decides whether to fall back to dense-only.
|
||||
log.warning("bm25 query failed (%s); query=%r", e, sanitized[:80])
|
||||
return []
|
||||
|
||||
def exists(self) -> bool:
|
||||
"""Cheap probe — does the index file exist on disk?"""
|
||||
return self.db_path.exists()
|
||||
|
||||
def count(self) -> int:
|
||||
"""Number of chunks indexed. 0 if the db is missing or empty."""
|
||||
if not self.exists():
|
||||
return 0
|
||||
try:
|
||||
with sqlite3.connect(self.db_path) as con:
|
||||
return con.execute("SELECT COUNT(*) FROM chunks_meta").fetchone()[0]
|
||||
except sqlite3.OperationalError:
|
||||
return 0
|
||||
|
||||
# -- schema ---------------------------------------------------------
|
||||
|
||||
@staticmethod
|
||||
def _schema_sql() -> str:
|
||||
return """
|
||||
CREATE TABLE chunks_meta (
|
||||
rowid INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
id TEXT UNIQUE NOT NULL,
|
||||
bundle_id TEXT,
|
||||
page_id TEXT,
|
||||
version TEXT,
|
||||
platform TEXT,
|
||||
product TEXT,
|
||||
ordinal INTEGER
|
||||
);
|
||||
CREATE INDEX idx_meta_version ON chunks_meta(version);
|
||||
CREATE INDEX idx_meta_platform ON chunks_meta(platform);
|
||||
CREATE INDEX idx_meta_bundle ON chunks_meta(bundle_id);
|
||||
|
||||
CREATE VIRTUAL TABLE chunks_fts USING fts5(
|
||||
text,
|
||||
tokenize = 'porter unicode61 remove_diacritics 2'
|
||||
);
|
||||
"""
|
||||
+126
@@ -0,0 +1,126 @@
|
||||
"""Markdown chunker — paragraph-aware, ~400-600 token target.
|
||||
|
||||
Adjust the chunking strategy per product if your page format differs
|
||||
significantly from prose. The output shape (id, text, metadata) is
|
||||
fixed by the downstream Chroma + BM25 indexing in rag/index.py — don't
|
||||
change that.
|
||||
|
||||
The key knob you'll tune per product is chunk-0. Dense retrieval lands
|
||||
on chunk 0 first for most queries. Make it a synthetic chunk built
|
||||
from:
|
||||
|
||||
- the page title (as natural-language H1)
|
||||
- a 1-sentence task description (you'll have to generate this — for
|
||||
pages that already have a "## Overview" or "## Introduction" the
|
||||
first sentence usually works)
|
||||
- a keyword bag of important terms (filenames, API names, error
|
||||
codes — the rare technical tokens that BM25 lights up on)
|
||||
|
||||
Without a rich chunk 0, dense retrieval gets dominated by the much
|
||||
larger prose body, and short pages (script examples, reference cards)
|
||||
get buried.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from typing import Iterator
|
||||
|
||||
|
||||
# Approximate token estimate from char count. Tunable — set per
|
||||
# embedder if the default 4 chars/token is wrong.
|
||||
CHARS_PER_TOKEN = 4
|
||||
TARGET_TOKENS = 500
|
||||
TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
|
||||
|
||||
|
||||
def estimate_tokens(text: str) -> int:
|
||||
return max(1, len(text) // CHARS_PER_TOKEN)
|
||||
|
||||
|
||||
def split_paragraphs(md: str) -> list[str]:
|
||||
"""Split markdown into paragraph-ish blocks.
|
||||
|
||||
Keeps fenced code blocks together (don't slice through ```).
|
||||
Headings start new paragraphs.
|
||||
"""
|
||||
blocks: list[str] = []
|
||||
current: list[str] = []
|
||||
in_fence = False
|
||||
for line in md.splitlines(keepends=True):
|
||||
stripped = line.strip()
|
||||
if stripped.startswith("```"):
|
||||
in_fence = not in_fence
|
||||
current.append(line)
|
||||
continue
|
||||
if in_fence:
|
||||
current.append(line)
|
||||
continue
|
||||
if stripped.startswith("#"):
|
||||
if current:
|
||||
blocks.append("".join(current).strip())
|
||||
current = []
|
||||
current.append(line)
|
||||
continue
|
||||
if not stripped and current and not "".join(current).strip().endswith("\n\n"):
|
||||
current.append(line)
|
||||
blocks.append("".join(current).strip())
|
||||
current = []
|
||||
continue
|
||||
current.append(line)
|
||||
if current:
|
||||
blocks.append("".join(current).strip())
|
||||
return [b for b in blocks if b]
|
||||
|
||||
|
||||
def chunks_from_page(
|
||||
text: str,
|
||||
page_id: str,
|
||||
metadata: dict,
|
||||
) -> Iterator[dict]:
|
||||
"""Yield chunk dicts ready for index.py to upsert.
|
||||
|
||||
The synthetic chunk 0 is the per-product customization point. The
|
||||
default below is a simple title + body-first-paragraph; rewrite
|
||||
for richer retrieval signal (see module docstring).
|
||||
"""
|
||||
paragraphs = split_paragraphs(text)
|
||||
if not paragraphs:
|
||||
return
|
||||
|
||||
# ----- Chunk 0: synthetic anchor for dense retrieval ---------
|
||||
title = metadata.get("title") or page_id
|
||||
first_para = next((p for p in paragraphs if not p.startswith("#")), "")
|
||||
chunk0_body = (
|
||||
f"# {title}\n\n"
|
||||
f"{first_para[:300]}"
|
||||
# TODO per product: append a keyword bag here (filenames,
|
||||
# API names, error codes) for BM25 + dense joint coverage.
|
||||
)
|
||||
yield {
|
||||
"id": f"{metadata['bundle_id']}::{page_id}::0",
|
||||
"text": chunk0_body,
|
||||
"metadata": {**metadata, "ordinal": 0},
|
||||
}
|
||||
|
||||
# ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
|
||||
ordinal = 1
|
||||
buf: list[str] = []
|
||||
buf_chars = 0
|
||||
for p in paragraphs:
|
||||
if buf_chars + len(p) > TARGET_CHARS and buf:
|
||||
yield {
|
||||
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
|
||||
"text": "\n\n".join(buf),
|
||||
"metadata": {**metadata, "ordinal": ordinal},
|
||||
}
|
||||
ordinal += 1
|
||||
buf = []
|
||||
buf_chars = 0
|
||||
buf.append(p)
|
||||
buf_chars += len(p)
|
||||
if buf:
|
||||
yield {
|
||||
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
|
||||
"text": "\n\n".join(buf),
|
||||
"metadata": {**metadata, "ordinal": ordinal},
|
||||
}
|
||||
@@ -0,0 +1,72 @@
|
||||
"""Embedding function for Chroma — Ollama-hosted nomic-embed-text by default.
|
||||
|
||||
Swappable: implement the same `embedding_function()` interface returning
|
||||
a Chroma `EmbeddingFunction` and the rest of the pipeline doesn't care.
|
||||
|
||||
Defaults (override via env):
|
||||
OLLAMA_URL one or more comma-separated URLs (load-balanced)
|
||||
EMBED_MODEL model name; default 'nomic-embed-text'
|
||||
EMBED_DIM expected embedding dim; default 768 (nomic-embed-text)
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import logging
|
||||
from typing import Any
|
||||
|
||||
import httpx
|
||||
from chromadb import EmbeddingFunction, Documents, Embeddings
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
OLLAMA_URLS = [u.strip() for u in os.environ.get("OLLAMA_URL",
|
||||
"http://localhost:11434").split(",") if u.strip()]
|
||||
EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
|
||||
EMBED_DIM = int(os.environ.get("EMBED_DIM", "768"))
|
||||
|
||||
|
||||
class OllamaEmbeddings(EmbeddingFunction):
|
||||
"""Calls /api/embed across N Ollama endpoints, naive round-robin.
|
||||
|
||||
For indexing throughput on multiple GPUs, run one Ollama container
|
||||
per GPU (pinned via NVIDIA_VISIBLE_DEVICES) and pass all their URLs
|
||||
in OLLAMA_URL — the embedder picks the next endpoint per batch.
|
||||
"""
|
||||
|
||||
def __init__(self, urls: list[str] = OLLAMA_URLS, model: str = EMBED_MODEL):
|
||||
self.urls = urls
|
||||
self.model = model
|
||||
self._next = 0
|
||||
|
||||
def __call__(self, input: Documents) -> Embeddings:
|
||||
url = self.urls[self._next % len(self.urls)]
|
||||
self._next += 1
|
||||
with httpx.Client(timeout=300) as c:
|
||||
r = c.post(f"{url}/api/embed",
|
||||
json={"model": self.model, "input": list(input)})
|
||||
r.raise_for_status()
|
||||
data = r.json()
|
||||
return data.get("embeddings") or []
|
||||
|
||||
def name(self) -> str: # newer chromadb requires this
|
||||
return f"ollama:{self.model}"
|
||||
|
||||
@staticmethod
|
||||
def build_from_config(config: dict) -> "OllamaEmbeddings": # newer chromadb
|
||||
return OllamaEmbeddings(
|
||||
urls=config.get("urls", OLLAMA_URLS),
|
||||
model=config.get("model", EMBED_MODEL),
|
||||
)
|
||||
|
||||
def get_config(self) -> dict: # newer chromadb
|
||||
return {"urls": self.urls, "model": self.model}
|
||||
|
||||
def default_space(self) -> str:
|
||||
return "cosine"
|
||||
|
||||
def supported_spaces(self) -> list[str]:
|
||||
return ["cosine", "l2", "ip"]
|
||||
|
||||
|
||||
def embedding_function() -> EmbeddingFunction:
|
||||
return OllamaEmbeddings()
|
||||
+134
@@ -0,0 +1,134 @@
|
||||
"""Build Chroma (and optionally BM25) indexes from corpus on disk.
|
||||
|
||||
Reads `corpus/<bundle>/<page>.{md,json}`, chunks each page, upserts
|
||||
into Chroma. With --rebuild, drops + recreates the collection (clean
|
||||
state). With --bm25-only, skips Chroma and rebuilds only the FTS5
|
||||
index — useful for fast iteration when chunking didn't change.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Iterator
|
||||
|
||||
import chromadb
|
||||
from chromadb.config import Settings
|
||||
|
||||
from .chunk import chunks_from_page
|
||||
from .embeddings import embedding_function
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
CORPUS = ROOT / "corpus"
|
||||
CHROMA_DIR = ROOT / "chroma"
|
||||
|
||||
# Collection name — convention: <product>_docs. Override via env if needed.
|
||||
import os
|
||||
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "myproduct")
|
||||
COLLECTION = f"{PRODUCT_NAME}_docs"
|
||||
|
||||
|
||||
def page_records() -> Iterator[dict]:
|
||||
"""Walk corpus/, yield chunks for every page."""
|
||||
if not CORPUS.exists():
|
||||
log.error("corpus/ doesn't exist; run the scraper first")
|
||||
return
|
||||
for bundle_dir in sorted(CORPUS.iterdir()):
|
||||
if not bundle_dir.is_dir() or bundle_dir.name.startswith("."):
|
||||
continue
|
||||
for md_path in sorted(bundle_dir.glob("*.md")):
|
||||
page_id = md_path.stem
|
||||
sidecar = md_path.with_suffix(".json")
|
||||
if not sidecar.exists():
|
||||
log.warning("skipping %s — no JSON sidecar", md_path)
|
||||
continue
|
||||
md = md_path.read_text()
|
||||
meta = json.loads(sidecar.read_text())
|
||||
# Surface common filter fields at the chunk-metadata level
|
||||
# so Chroma's `where` filter can use them.
|
||||
base_meta = {
|
||||
"bundle_id": bundle_dir.name,
|
||||
"page_id": page_id,
|
||||
"title": meta.get("title") or "",
|
||||
"version": meta.get("version") or "",
|
||||
"platform": meta.get("platform") or "",
|
||||
"product": meta.get("product") or "",
|
||||
}
|
||||
yield from chunks_from_page(md, page_id, base_meta)
|
||||
|
||||
|
||||
def upsert_to_chroma(records: list[dict]) -> int:
|
||||
client = chromadb.PersistentClient(
|
||||
path=str(CHROMA_DIR),
|
||||
settings=Settings(anonymized_telemetry=False),
|
||||
)
|
||||
# Drop + recreate for --rebuild semantics
|
||||
try:
|
||||
client.delete_collection(COLLECTION)
|
||||
except Exception:
|
||||
pass
|
||||
col = client.create_collection(COLLECTION, embedding_function=embedding_function())
|
||||
|
||||
BATCH = 64
|
||||
total = 0
|
||||
for i in range(0, len(records), BATCH):
|
||||
chunk = records[i:i + BATCH]
|
||||
col.upsert(
|
||||
ids=[r["id"] for r in chunk],
|
||||
documents=[r["text"] for r in chunk],
|
||||
metadatas=[r["metadata"] for r in chunk],
|
||||
)
|
||||
total += len(chunk)
|
||||
log.info("upserted %d / %d chunks", total, len(records))
|
||||
return total
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("--rebuild", action="store_true",
|
||||
help="Drop and recreate the Chroma collection.")
|
||||
p.add_argument("--bm25-only", action="store_true",
|
||||
help="Rebuild only the BM25 index, skip Chroma.")
|
||||
p.add_argument("--bm25-db", type=Path,
|
||||
default=ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db",
|
||||
help="Path to the BM25 sqlite db.")
|
||||
args = p.parse_args()
|
||||
|
||||
log.info("reading corpus from %s", CORPUS)
|
||||
t0 = time.time()
|
||||
records = list(page_records())
|
||||
log.info("loaded %d chunks in %.1fs", len(records), time.time() - t0)
|
||||
|
||||
if args.bm25_only:
|
||||
from .bm25 import BM25Index
|
||||
log.info("--bm25-only: building FTS5 only")
|
||||
BM25Index(args.bm25_db).build(records)
|
||||
return 0
|
||||
|
||||
if not args.rebuild:
|
||||
log.info("no --rebuild; nothing to do. (Use --rebuild to upsert.)")
|
||||
return 0
|
||||
|
||||
t_c = time.time()
|
||||
n = upsert_to_chroma(records)
|
||||
log.info("chroma: %d chunks in %.1fs", n, time.time() - t_c)
|
||||
|
||||
# Build BM25 too — see PLAN.md Phase 8. Safe to remove this block
|
||||
# for products that don't need hybrid retrieval.
|
||||
try:
|
||||
from .bm25 import BM25Index
|
||||
t_b = time.time()
|
||||
BM25Index(args.bm25_db).build(records)
|
||||
log.info("bm25 done in %.1fs", time.time() - t_b)
|
||||
except ImportError:
|
||||
log.info("rag.bm25 not available — skipping BM25 build")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
Reference in New Issue
Block a user