# Reranker sidecar — llama.cpp + jina-reranker-v2-base Phase 6 setup. The MCP server reads `RERANK_URL` and, when set, pipes the top-50 dense (or hybrid) chunks through this sidecar before returning to the LLM. See `docs_mcp/server.py:_rerank_pool`. ## Run ```bash docker run -d --name llama-rerank -p 8082:8080 \ ghcr.io/ggml-org/llama.cpp:server \ -hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \ --reranking --host 0.0.0.0 --port 8080 ``` The image auto-downloads the GGUF on first start (~280 MB, one-time). First request loads the model into memory (~1s on CPU). ## Configure the MCP server ```bash export RERANK_URL=http://localhost:8082 # search_docs will now rerank automatically ``` ## Verify ```bash curl http://localhost:8082/v1/rerank -H 'Content-Type: application/json' -d '{ "query": "soybean herbicide for waterhemp", "documents": [ "Roundup Custom for fallow burndown", "Sencor metribuzin controls waterhemp in soybean pre-emergence" ] }' ``` Expect index=1 (the Sencor doc) at score ~0.8, index=0 at a strongly negative score. ## Performance notes - **CPU-only is slow.** ~0.5s per (query, doc) pair → ~23s for a 50-doc pool. Fine for batch eval; painful for interactive queries. - For production, run on GPU: add `--gpus all` to docker, llama.cpp uses the CUDA backend automatically. Expect ~10-20× speedup. - Alternative: drop `RERANK_POOL` from 50 to ~20 in the server env. Cuts latency 2.5× at the cost of some quality (rerank gets fewer candidates to choose from). - For very small batches the reranker can also run alongside Ollama on the same GPU box — `jina-reranker-v2-base` is ~280 MB and won't conflict with `nomic-embed-text` (~560 MB VRAM each).