# Reranker sidecar — llama.cpp + jina-reranker-v2-base

Phase 6 setup. The MCP server reads `RERANK_URL` and, when set, pipes
the top-50 dense (or hybrid) chunks through this sidecar before
returning to the LLM. See `docs_mcp/server.py:_rerank_pool`.

## Run

```bash
docker run -d --name llama-rerank -p 8082:8080 \
  ghcr.io/ggml-org/llama.cpp:server \
  -hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \
  --reranking --host 0.0.0.0 --port 8080
```

The image auto-downloads the GGUF on first start (~280 MB, one-time).
First request loads the model into memory (~1s on CPU).

## Configure the MCP server

```bash
export RERANK_URL=http://localhost:8082
# search_docs will now rerank automatically
```

## Verify

```bash
curl http://localhost:8082/v1/rerank -H 'Content-Type: application/json' -d '{
  "query": "soybean herbicide for waterhemp",
  "documents": [
    "Roundup Custom for fallow burndown",
    "Sencor metribuzin controls waterhemp in soybean pre-emergence"
  ]
}'
```

Expect index=1 (the Sencor doc) at score ~0.8, index=0 at a strongly
negative score.

## Performance notes

- **CPU-only is slow.** ~0.5s per (query, doc) pair → ~23s for a
  50-doc pool. Fine for batch eval; painful for interactive queries.
- For production, run on GPU: add `--gpus all` to docker, llama.cpp
  uses the CUDA backend automatically. Expect ~10-20× speedup.
- Alternative: drop `RERANK_POOL` from 50 to ~20 in the server env.
  Cuts latency 2.5× at the cost of some quality (rerank gets fewer
  candidates to choose from).
- For very small batches the reranker can also run alongside
  Ollama on the same GPU box — `jina-reranker-v2-base` is ~280 MB
  and won't conflict with `nomic-embed-text` (~560 MB VRAM each).