Files
crop-chem-docs/deploy/rerank-docker.md
T
justin 278fe5f456 Phase 6: reranker sidecar (jina-reranker-v2-base via llama.cpp)
Wires the docs_mcp/server.py reranker hook into a real backend:
  ghcr.io/ggml-org/llama.cpp:server \\
    -hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \\
    --reranking --host 0.0.0.0 --port 8080

Setup recipe at deploy/rerank-docker.md. The MCP server already
honors RERANK_URL (added in Phase 7+8 commit); setting it to
http://<host>:8082 turns on rerank automatically.

## Eval results (35 queries, k=5, pool=50)

  | Retriever      | MRR   | Recall@5 | nDCG@5 |
  |----------------|-------|----------|--------|
  | dense          | 0.027 | 0.086    | 0.041  |
  | bm25           | 0.544 | 0.586    | 0.524  |
  | hybrid-rrf     | 0.114 | 0.114    | 0.108  |
  | dense+rerank   | 0.171 | 0.143    | 0.149  |
  | hybrid+rerank  | 0.672 | 0.638    | 0.621  |  ← winner

The reranker fixes hybrid's failure mode (dense noise polluting
the fused pool) by scoring each (query, chunk) pair independently.
Net: hybrid+rerank gives +24% MRR over BM25-only.

Smoke test for the reranker itself (query: "soybean herbicide for
waterhemp", 4 candidates):
  index=1 SENCOR metribuzin waterhemp soybean → score=0.84  ← right
  index=3 Headline wheat fungicide           → score=-2.80
  index=2 Lorsban corn rootworm              → score=-2.91
  index=0 Roundup fallow burndown            → score=-3.44
Strong separation between the right doc and the rest.

## Production gotchas

- CPU-only reranker is slow (~23s for a 50-doc pool). For
  interactive use put it on GPU (`--gpus all`); ~10-20× faster.
- jina-reranker rejects the ENTIRE batch if any pair exceeds
  n_ctx_train=1024 — server truncates each doc to 2000 chars
  before sending. Already handled in _rerank_pool.

Per-query rerank report at eval/results/with_rerank.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 10:50:03 -04:00

53 lines
1.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Reranker sidecar — llama.cpp + jina-reranker-v2-base
Phase 6 setup. The MCP server reads `RERANK_URL` and, when set, pipes
the top-50 dense (or hybrid) chunks through this sidecar before
returning to the LLM. See `docs_mcp/server.py:_rerank_pool`.
## Run
```bash
docker run -d --name llama-rerank -p 8082:8080 \
ghcr.io/ggml-org/llama.cpp:server \
-hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \
--reranking --host 0.0.0.0 --port 8080
```
The image auto-downloads the GGUF on first start (~280 MB, one-time).
First request loads the model into memory (~1s on CPU).
## Configure the MCP server
```bash
export RERANK_URL=http://localhost:8082
# search_docs will now rerank automatically
```
## Verify
```bash
curl http://localhost:8082/v1/rerank -H 'Content-Type: application/json' -d '{
"query": "soybean herbicide for waterhemp",
"documents": [
"Roundup Custom for fallow burndown",
"Sencor metribuzin controls waterhemp in soybean pre-emergence"
]
}'
```
Expect index=1 (the Sencor doc) at score ~0.8, index=0 at a strongly
negative score.
## Performance notes
- **CPU-only is slow.** ~0.5s per (query, doc) pair → ~23s for a
50-doc pool. Fine for batch eval; painful for interactive queries.
- For production, run on GPU: add `--gpus all` to docker, llama.cpp
uses the CUDA backend automatically. Expect ~10-20× speedup.
- Alternative: drop `RERANK_POOL` from 50 to ~20 in the server env.
Cuts latency 2.5× at the cost of some quality (rerank gets fewer
candidates to choose from).
- For very small batches the reranker can also run alongside
Ollama on the same GPU box — `jina-reranker-v2-base` is ~280 MB
and won't conflict with `nomic-embed-text` (~560 MB VRAM each).