Phase 2: chunking + parallel Ollama embeddings + Chroma + BM25 indexes

End-to-end RAG pipeline for the pesticide-labels corpus. From the 4,066 labels on USB, the indexer produces 216,467 chunks, embeds them via N parallel Ollama endpoints, upserts to Chroma, and builds a BM25 lexical index. ## Files - rag/index.py: adapted to labels schema (source / source_key / epa_reg_no / product_name / product_class / registrant / signal_word / active_ingredients flattened for Chroma where-filter); honors PPLS_CORPUS_ROOT (corpus on USB) and PPLS_CHROMA_DIR; upsert batch size auto-tuned to 64 * N URLs; --limit + --source flags for incremental work. - rag/chunk.py: label-aware. ALL-CAPS section heading detector (heuristic) for EPA labels alongside markdown `#` headings. TARGET_CHARS=2000 (~500 tokens), MAX_CHUNK_CHARS=4000 (~1000 tokens) hard cap with _force_split sentence/char fallback to defend against monolithic crop+rate tables. Chunk 0 is a synthetic anchor with product name, EPA Reg No, registrant, signal word, product class, active ingredients + keyword bag for joint dense/BM25 retrieval. - rag/embeddings.py: parallel-dispatch across N Ollama URLs via ThreadPoolExecutor. Each __call__ stride-slices input into N shards, fires N concurrent HTTP requests, joins in original order. Bisect-resilient on 400 (context-length): recursively splits the failing shard down to single doc, logs+drops single bad doc with zero-vector placeholder so Chroma upsert never sees a gap. Real HTTP/connection errors still propagate. - requirements.txt: chromadb already pinned via template. ## Run PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \ OLLAMA_URL=http://host1:11434,http://host2:11434,... \ PRODUCT_NAME=ppls \ python -m rag.index --rebuild ## Build stats - 216,467 chunks across 4,066 labels (~53 chunks/label avg) - Wall time: 75.7 min on 4 parallel GPU-backed Ollama endpoints (Bayer-Crop / BASF / Corteva / FMC / Nufarm / Syngenta / etc. chemistry; production Ollama on trashpanda + 2× 192.168.0.2 + 1× Windows 192.168.0.125) - 473 bisect-drops (0.22%) — all from monolithic-table sections in 1970s-90s scanned PDFs whose pypdf extracts tokenized past the model's context. Acceptable; the dropped chunks were garbled OCR with no useful content. - Chroma: 2.2 GB persistent SQLite at ./chroma/ - BM25: 416 MB SQLite FTS5 at ./bm25/ppls_docs.db ## Smoke-test queries (top-3 dense-only) "what can I spray on soybeans to control waterhemp" → Rage (glyphosate+carfentrazone), Sencor (metribuzin) "REI for dicamba on corn" → Nufarm Credit (DICAMBA tank-mix restrictions section) "fungicide for wheat head scab" → MCW 710 SC (azoxystrobin+tebuconazole), Sercadis (fluxapyroxad) Distances 0.16-0.23. Dense-only quality is OK-not-great in spots (exactly the failure mode Phase 6 reranker + Phase 8 hybrid BM25 fusion address). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 09:56:49 -04:00
parent 92a95d5e78
commit 38141c362e
3 changed files with 431 additions and 150 deletions
@@ -1,10 +1,14 @@
 """Embedding function for Chroma — Ollama-hosted nomic-embed-text by default.

+Supports parallel dispatch across multiple Ollama endpoints. Each call
+splits its input across the configured URLs and embeds them concurrently
+via a thread pool; results are recombined in original order.
+
 Swappable: implement the same `embedding_function()` interface returning
 a Chroma `EmbeddingFunction` and the rest of the pipeline doesn't care.

 Defaults (override via env):
-  OLLAMA_URL    one or more comma-separated URLs (load-balanced)
+  OLLAMA_URL    one or more comma-separated URLs (parallel-dispatched)
  EMBED_MODEL   model name; default 'nomic-embed-text'
  EMBED_DIM     expected embedding dim; default 768 (nomic-embed-text)
 """
@@ -12,6 +16,7 @@ from __future__ import annotations

 import os
 import logging
+from concurrent.futures import ThreadPoolExecutor, as_completed
 from typing import Any

 import httpx
@@ -23,30 +28,114 @@ OLLAMA_URLS = [u.strip() for u in os.environ.get("OLLAMA_URL",
               "http://localhost:11434").split(",") if u.strip()]
 EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
 EMBED_DIM = int(os.environ.get("EMBED_DIM", "768"))
+HTTP_TIMEOUT = float(os.environ.get("EMBED_TIMEOUT", "300"))


 class OllamaEmbeddings(EmbeddingFunction):
-    """Calls /api/embed across N Ollama endpoints, naive round-robin.
+    """Calls /api/embed across N Ollama endpoints **in parallel**.

-    For indexing throughput on multiple GPUs, run one Ollama container
-    per GPU (pinned via NVIDIA_VISIBLE_DEVICES) and pass all their URLs
-    in OLLAMA_URL — the embedder picks the next endpoint per batch.
+    Each __call__ splits its input documents into len(urls) shards via
+    stride slicing, fires len(urls) concurrent HTTP requests, then
+    interleaves the results back to original order. With N GPU-backed
+    Ollamas, throughput scales close to Nx (Chroma upsert overhead and
+    slowest-shard barrier cap it shy of true linear).
+
+    For best per-call efficiency, sized batches at ~64-per-shard
+    (i.e., BATCH = 64 * N in the indexer) keep each Ollama doing real
+    work each round.
    """

    def __init__(self, urls: list[str] = OLLAMA_URLS, model: str = EMBED_MODEL):
+        if not urls:
+            raise ValueError("OllamaEmbeddings requires at least one URL")
        self.urls = urls
        self.model = model
-        self._next = 0
+        # One persistent thread per URL — embedding throughput is HTTP-bound,
+        # threads are essentially free.
+        self._pool = ThreadPoolExecutor(
+            max_workers=len(urls),
+            thread_name_prefix="ollama-embed",
+        )

    def __call__(self, input: Documents) -> Embeddings:
-        url = self.urls[self._next % len(self.urls)]
-        self._next += 1
-        with httpx.Client(timeout=300) as c:
-            r = c.post(f"{url}/api/embed",
-                       json={"model": self.model, "input": list(input)})
-            r.raise_for_status()
-            data = r.json()
-        return data.get("embeddings") or []
+        docs = list(input)
+        n = len(self.urls)
+        if not docs:
+            return []
+        if n == 1:
+            return self._embed_one(self.urls[0], docs)
+
+        # Stride-slice into n shards so docs are distributed evenly.
+        # Reconstruction reverses the stride via index arithmetic.
+        shards: list[tuple[int, str, list[str]]] = []
+        for shard_idx in range(n):
+            shard_docs = docs[shard_idx::n]
+            if shard_docs:
+                shards.append((shard_idx, self.urls[shard_idx], shard_docs))
+
+        # Parallel dispatch + barrier-wait
+        results: dict[int, list[list[float]]] = {}
+        futures = {
+            self._pool.submit(self._embed_one, url, shard_docs): shard_idx
+            for shard_idx, url, shard_docs in shards
+        }
+        for fut in as_completed(futures):
+            shard_idx = futures[fut]
+            results[shard_idx] = fut.result()
+
+        # Interleave back to original order
+        out: list[list[float] | None] = [None] * len(docs)
+        for shard_idx, shard_embeds in results.items():
+            for offset, embed in enumerate(shard_embeds):
+                out[shard_idx + offset * n] = embed
+        # Surface any missing slot loudly rather than silently returning Nones
+        if any(v is None for v in out):
+            missing = [i for i, v in enumerate(out) if v is None]
+            raise RuntimeError(
+                f"embedding gap: {len(missing)} missing slot(s) after parallel "
+                f"join; first missing index={missing[0]}"
+            )
+        return out  # type: ignore[return-value]
+
+    def _embed_one(self, url: str, docs: list[str]) -> list[list[float]]:
+        """Single HTTP call to one Ollama. On a 400 (typically one doc in
+        the batch exceeded the model's context), bisect the batch until
+        the offending doc(s) are isolated, then emit a zero-vector for
+        each bad doc and continue. Never raises for 400 — only for
+        connection / 5xx errors after retries are exhausted upstream."""
+        if not docs:
+            return []
+        try:
+            with httpx.Client(timeout=HTTP_TIMEOUT) as c:
+                r = c.post(
+                    f"{url}/api/embed",
+                    json={"model": self.model, "input": docs},
+                )
+                if r.status_code == 400:
+                    return self._bisect_400(url, docs, r.text)
+                r.raise_for_status()
+                data = r.json()
+            return data.get("embeddings") or []
+        except httpx.HTTPStatusError:
+            # Anything other than 400 propagates so retries / monitors fire.
+            raise
+
+    def _bisect_400(self, url: str, docs: list[str], err_text: str) -> list[list[float]]:
+        """Recursive bisection: split docs in half, retry each half. If
+        one doc alone still 400s, log it with size + a snippet and
+        return a zero-vector placeholder for that slot (so order is
+        preserved and Chroma upsert succeeds)."""
+        if len(docs) == 1:
+            log.warning(
+                "embed: dropping single bad doc on %s (chars=%d, err=%s); "
+                "snippet=%r",
+                url, len(docs[0]), err_text[:120], docs[0][:80],
+            )
+            return [[0.0] * EMBED_DIM]
+        mid = len(docs) // 2
+        left = self._embed_one(url, docs[:mid])
+        right = self._embed_one(url, docs[mid:])
+        return left + right

    def name(self) -> str:                  # newer chromadb requires this
        return f"ollama:{self.model}"