Phase 6: reranker sidecar (jina-reranker-v2-base via llama.cpp)

Wires the docs_mcp/server.py reranker hook into a real backend: ghcr.io/ggml-org/llama.cpp:server \\ -hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \\ --reranking --host 0.0.0.0 --port 8080 Setup recipe at deploy/rerank-docker.md. The MCP server already honors RERANK_URL (added in Phase 7+8 commit); setting it to http://<host>:8082 turns on rerank automatically. ## Eval results (35 queries, k=5, pool=50) | Retriever | MRR | Recall@5 | nDCG@5 | |----------------|-------|----------|--------| | dense | 0.027 | 0.086 | 0.041 | | bm25 | 0.544 | 0.586 | 0.524 | | hybrid-rrf | 0.114 | 0.114 | 0.108 | | dense+rerank | 0.171 | 0.143 | 0.149 | | hybrid+rerank | 0.672 | 0.638 | 0.621 | ← winner The reranker fixes hybrid's failure mode (dense noise polluting the fused pool) by scoring each (query, chunk) pair independently. Net: hybrid+rerank gives +24% MRR over BM25-only. Smoke test for the reranker itself (query: "soybean herbicide for waterhemp", 4 candidates): index=1 SENCOR metribuzin waterhemp soybean → score=0.84 ← right index=3 Headline wheat fungicide → score=-2.80 index=2 Lorsban corn rootworm → score=-2.91 index=0 Roundup fallow burndown → score=-3.44 Strong separation between the right doc and the rest. ## Production gotchas - CPU-only reranker is slow (~23s for a 50-doc pool). For interactive use put it on GPU (`--gpus all`); ~10-20× faster. - jina-reranker rejects the ENTIRE batch if any pair exceeds n_ctx_train=1024 — server truncates each doc to 2000 chars before sending. Already handled in _rerank_pool. Per-query rerank report at eval/results/with_rerank.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 10:50:03 -04:00
parent 335c33465b
commit 278fe5f456
2 changed files with 108 additions and 0 deletions
@@ -0,0 +1,52 @@
+# Reranker sidecar — llama.cpp + jina-reranker-v2-base
+
+Phase 6 setup. The MCP server reads `RERANK_URL` and, when set, pipes
+the top-50 dense (or hybrid) chunks through this sidecar before
+returning to the LLM. See `docs_mcp/server.py:_rerank_pool`.
+
+## Run
+
+```bash
+docker run -d --name llama-rerank -p 8082:8080 \
+  ghcr.io/ggml-org/llama.cpp:server \
+  -hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \
+  --reranking --host 0.0.0.0 --port 8080
+```
+
+The image auto-downloads the GGUF on first start (~280 MB, one-time).
+First request loads the model into memory (~1s on CPU).
+
+## Configure the MCP server
+
+```bash
+export RERANK_URL=http://localhost:8082
+# search_docs will now rerank automatically
+```
+
+## Verify
+
+```bash
+curl http://localhost:8082/v1/rerank -H 'Content-Type: application/json' -d '{
+  "query": "soybean herbicide for waterhemp",
+  "documents": [
+    "Roundup Custom for fallow burndown",
+    "Sencor metribuzin controls waterhemp in soybean pre-emergence"
+  ]
+}'
+```
+
+Expect index=1 (the Sencor doc) at score ~0.8, index=0 at a strongly
+negative score.
+
+## Performance notes
+
+- **CPU-only is slow.** ~0.5s per (query, doc) pair → ~23s for a
+  50-doc pool. Fine for batch eval; painful for interactive queries.
+- For production, run on GPU: add `--gpus all` to docker, llama.cpp
+  uses the CUDA backend automatically. Expect ~10-20× speedup.
+- Alternative: drop `RERANK_POOL` from 50 to ~20 in the server env.
+  Cuts latency 2.5× at the cost of some quality (rerank gets fewer
+  candidates to choose from).
+- For very small batches the reranker can also run alongside
+  Ollama on the same GPU box — `jina-reranker-v2-base` is ~280 MB
+  and won't conflict with `nomic-embed-text` (~560 MB VRAM each).