From e5da4b21b0af7a628ef606e0d2c317107f95731e Mon Sep 17 00:00:00 2001 From: Justin Paul Date: Sun, 24 May 2026 13:25:34 -0400 Subject: [PATCH] deploy: add llama-rerank service to compose snippet MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Drawbar's compose doesn't have a rerank service today — the llama-rerank container I spun up earlier was a standalone docker run, not a compose service. For Docker DNS resolution (http://llama-rerank:8080) to work between MCP + reranker, both need to be siblings in the same compose stack. Added the llama-rerank service entry with: - :server-cuda image (CUDA-built llama.cpp; the plain :server is CPU-only and 25× slower for our 50-doc rerank pool) - -ngl 99 to offload all layers to GPU - deploy.resources.reservations.devices block for compose v3 GPU passthrough (preferred over the older `runtime: nvidia` syntax) - volume for the HuggingFace model cache so first-start GGUF download survives container recreates - no host port mapping — internal-network-only Tesla P4 compatibility notes inline: Pascal (CC 6.1) is in the :server-cuda image's compute-arch list (500-1200) so no special handling beyond the standard compose entry. Also: cleanup instruction to docker rm -f the standalone llama-rerank from the earlier setup before bringing up compose (name collision). And: noted that if trashpanda's existing Ollama is a host-mode process rather than a compose service, the MCP needs host.docker.internal override (snippet included). Co-Authored-By: Claude Opus 4.7 (1M context) --- deploy/drawbar-compose-snippet.md | 107 +++++++++++++++++++++++------- 1 file changed, 84 insertions(+), 23 deletions(-) diff --git a/deploy/drawbar-compose-snippet.md b/deploy/drawbar-compose-snippet.md index 63ac955..203a01b 100644 --- a/deploy/drawbar-compose-snippet.md +++ b/deploy/drawbar-compose-snippet.md @@ -1,24 +1,64 @@ # Drawbar deploy — `crop-chem-docs` MCP server snippet -Drop this into Drawbar's `docker-compose.yml`. Targets the existing -trashpanda stack: shared Docker network with `ollama` + `llama-rerank` -service containers, Cloudflare Tunnel out front. +Drop these two services into Drawbar's `docker-compose.yml`. Targets +the trashpanda stack: shared Docker network with the existing +Drawbar services + the Cloudflare Tunnel. ## Pre-reqs (one-time on the deploy host) -1. **Login to the Gitea registry** so the host can pull: +1. **Docker login to the Gitea registry:** ```bash docker login git.jpaul.io -u justin # PAT for password ``` -2. **`ollama` and `llama-rerank` services** are already running in - the same compose stack on the same Docker network. The MCP - container resolves them by service name via Docker's embedded - DNS — no IPs to maintain. +2. **NVIDIA Container Toolkit** — already installed on trashpanda + (the existing standalone `llama-rerank` container ran with + `--gpus all` fine). +3. **If a standalone `llama-rerank` container is already running** + (left over from earlier setup), remove it so the compose service + can bind the same name: + ```bash + docker rm -f llama-rerank + ``` -## Compose service +## Compose services ```yaml services: + + # ---- Reranker sidecar ----------------------------------------- + # jina-reranker-v2-base-multilingual via llama.cpp on the Tesla P4. + # Internal port only (no host port mapping needed — the MCP reaches + # it via Docker DNS). ~280 MB GPU VRAM at idle, ~500 MB during a + # 50-doc rerank. Co-exists fine with any other GPU users on the P4. + llama-rerank: + image: ghcr.io/ggml-org/llama.cpp:server-cuda + container_name: llama-rerank + restart: unless-stopped + command: + - "-hf" + - "gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0" + - "--reranking" + - "--host" + - "0.0.0.0" + - "--port" + - "8080" + - "-ngl" + - "99" # offload all layers to GPU + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: all + capabilities: [gpu] + # Model cache survives container recreates; first start downloads + # the GGUF (~280 MB) from HuggingFace. + volumes: + - llama-rerank-cache:/root/.cache/huggingface + networks: + - default + + # ---- MCP server ------------------------------------------------ crop-chem-docs: image: git.jpaul.io/justin/crop-chem-docs:corpus-2026.05.24 # :latest for dev / Watchtower auto-pull @@ -32,29 +72,57 @@ services: # HYBRID_SEARCH=true # PRODUCT_NAME=crop_chem # Override here only if your services have different names. + depends_on: + - llama-rerank networks: - - default # or whichever shared network ollama/llama-rerank are on + - default labels: com.centurylinklabs.watchtower.enable: "true" + +volumes: + llama-rerank-cache: ``` -If your stack uses non-default service names: +## Note on the existing `ollama` service + +The Dockerfile default is `OLLAMA_URL=http://ollama:11434` — that +assumes there's an `ollama` service in the same compose stack. If +trashpanda's Ollama is a host-mode process (not a compose service), +override the env in the `crop-chem-docs` block: ```yaml environment: - OLLAMA_URL: "http://:11434" - RERANK_URL: "http://:8080" + OLLAMA_URL: "http://host.docker.internal:11434" + extra_hosts: + - "host.docker.internal:host-gateway" ``` -## Test from the host +Or just add Ollama itself to the compose stack as a sibling service. + +## Test once both are up ```bash -# Verify counts + indexes from inside the container: +docker compose up -d llama-rerank crop-chem-docs + +# Wait ~10s for both to come up, then: docker exec crop-chem-docs python -c \ "from docs_mcp.server import corpus_status; print(corpus_status())" ``` -## What the container exposes +Expect: `# crop-chem-docs corpus status`, 4,159 labels, 216,467 +chunks, BM25 db present, `RERANK_URL=http://llama-rerank:8080`, +`HYBRID_SEARCH=on`. + +Then a live search to verify hybrid+rerank: + +```bash +docker exec crop-chem-docs python -c \ + "from docs_mcp.server import search_docs; print(search_docs('soybean herbicide for waterhemp', k=2))" +``` + +Expect: 2 hits with Sencor/Tackle/Warrant in top-2, `mode=hybrid-rrf+rerank` in the header. + +## What the MCP container exposes | Tool | What it does | |---|---| @@ -78,10 +146,3 @@ docker exec crop-chem-docs python -c \ reindex, image push. Watchtower pulls the new `:latest` automatically. - **Manual** — Gitea Actions UI → `Monthly corpus refresh` → `Run workflow`. Optional `sources` input for single-source refresh (e.g., `bayer` only). - -## Switching corpus scope - -The row-crop filter (corn/soybeans/wheat) is in -`scrape/sources/epa_ppls.py` as `ROW_CROP_KEYWORDS`. Edit + push + -let the next workflow run pick it up. Same for the registrant -allowlist at `scrape/sources/epa_registrant_allowlist.json`.