Files

T

justin 278fe5f456 Phase 6: reranker sidecar (jina-reranker-v2-base via llama.cpp)

Wires the docs_mcp/server.py reranker hook into a real backend:
  ghcr.io/ggml-org/llama.cpp:server \\
    -hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \\
    --reranking --host 0.0.0.0 --port 8080

Setup recipe at deploy/rerank-docker.md. The MCP server already
honors RERANK_URL (added in Phase 7+8 commit); setting it to
http://<host>:8082 turns on rerank automatically.

## Eval results (35 queries, k=5, pool=50)

  | Retriever      | MRR   | Recall@5 | nDCG@5 |
  |----------------|-------|----------|--------|
  | dense          | 0.027 | 0.086    | 0.041  |
  | bm25           | 0.544 | 0.586    | 0.524  |
  | hybrid-rrf     | 0.114 | 0.114    | 0.108  |
  | dense+rerank   | 0.171 | 0.143    | 0.149  |
  | hybrid+rerank  | 0.672 | 0.638    | 0.621  |  ← winner

The reranker fixes hybrid's failure mode (dense noise polluting
the fused pool) by scoring each (query, chunk) pair independently.
Net: hybrid+rerank gives +24% MRR over BM25-only.

Smoke test for the reranker itself (query: "soybean herbicide for
waterhemp", 4 candidates):
  index=1 SENCOR metribuzin waterhemp soybean → score=0.84  ← right
  index=3 Headline wheat fungicide           → score=-2.80
  index=2 Lorsban corn rootworm              → score=-2.91
  index=0 Roundup fallow burndown            → score=-3.44
Strong separation between the right doc and the rest.

## Production gotchas

- CPU-only reranker is slow (~23s for a 50-doc pool). For
  interactive use put it on GPU (`--gpus all`); ~10-20× faster.
- jina-reranker rejects the ENTIRE batch if any pair exceeds
  n_ctx_train=1024 — server truncates each doc to 2000 chars
  before sending. Already handled in _rerank_pool.

Per-query rerank report at eval/results/with_rerank.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-24 10:50:03 -04:00

1.7 KiB

Raw Blame History

Reranker sidecar — llama.cpp + jina-reranker-v2-base

Phase 6 setup. The MCP server reads RERANK_URL and, when set, pipes the top-50 dense (or hybrid) chunks through this sidecar before returning to the LLM. See docs_mcp/server.py:_rerank_pool.

Run

docker run -d --name llama-rerank -p 8082:8080 \
  ghcr.io/ggml-org/llama.cpp:server \
  -hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \
  --reranking --host 0.0.0.0 --port 8080

The image auto-downloads the GGUF on first start (~280 MB, one-time). First request loads the model into memory (~1s on CPU).

Configure the MCP server

export RERANK_URL=http://localhost:8082
# search_docs will now rerank automatically

Verify

curl http://localhost:8082/v1/rerank -H 'Content-Type: application/json' -d '{
  "query": "soybean herbicide for waterhemp",
  "documents": [
    "Roundup Custom for fallow burndown",
    "Sencor metribuzin controls waterhemp in soybean pre-emergence"
  ]
}'

Expect index=1 (the Sencor doc) at score ~0.8, index=0 at a strongly negative score.

Performance notes

CPU-only is slow. ~0.5s per (query, doc) pair → ~23s for a 50-doc pool. Fine for batch eval; painful for interactive queries.
For production, run on GPU: add --gpus all to docker, llama.cpp uses the CUDA backend automatically. Expect ~10-20× speedup.
Alternative: drop RERANK_POOL from 50 to ~20 in the server env. Cuts latency 2.5× at the cost of some quality (rerank gets fewer candidates to choose from).
For very small batches the reranker can also run alongside Ollama on the same GPU box — jina-reranker-v2-base is ~280 MB and won't conflict with nomic-embed-text (~560 MB VRAM each).

1.7 KiB Raw Blame History Unescape Escape

Reranker sidecar — llama.cpp + jina-reranker-v2-base

Run

Configure the MCP server

Verify

Performance notes

1.7 KiB

Raw Blame History