rag: resilient embedder — rotate/split on endpoint errors; 4-GPU embed pool

Port of zerto-docs PR #45. OllamaEmbeddings previously made a single
attempt per batch — any transient connection drop or HTTP error from
one endpoint failed the entire index rebuild.

- _embed() now rotates to the next endpoint and retries with backoff
  (5 attempts) on transport errors, and additionally halves the input
  (floor 16) on HTTP status errors: the .0.125 Windows Ollama (4090)
  400s when its model runner dies on an oversized input array. Error
  response bodies are logged instead of swallowed.
- CI workflows: OLLAMA_URLS extended from the two ripper instances to
  the full 4-endpoint GPU pool (+ .0.125 4090, + .0.126). At the
  64-chunk batches this indexer already uses, .0.125 is the fastest
  embedder in the fleet (242 embeds/s measured on seed-mcp).

Verified against the live pool: 64-text happy path, dead-endpoint
rotation, and a forced 512-text 400 on .0.125 that split and completed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
2026-06-10 15:46:45 -04:00
parent 41d431670f
commit 9fa920d0ce
3 changed files with 40 additions and 9 deletions
+1 -1
View File
@@ -34,7 +34,7 @@ env:
# :11435 owns the Titan X, :11436 owns the 1080 Ti; the indexer
# round-robins per batch so both cards run in parallel. The host's
# primary Ollama on :11434 is left alone for OpenWebUI etc.
OLLAMA_URLS: http://192.168.0.2:11435,http://192.168.0.2:11436
OLLAMA_URLS: http://192.168.0.2:11435,http://192.168.0.2:11436,http://192.168.0.125:11434,http://192.168.0.126:11434
EMBED_MODEL: nomic-embed-text
PRODUCT_NAME: hvm