From e5da4b21b0af7a628ef606e0d2c317107f95731e Mon Sep 17 00:00:00 2001
From: Justin Paul <justin@jpaul.me>
Date: Sun, 24 May 2026 13:25:34 -0400
Subject: [PATCH] deploy: add llama-rerank service to compose snippet
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Drawbar's compose doesn't have a rerank service today — the
llama-rerank container I spun up earlier was a standalone
docker run, not a compose service. For Docker DNS resolution
(http://llama-rerank:8080) to work between MCP + reranker, both
need to be siblings in the same compose stack.

Added the llama-rerank service entry with:
- :server-cuda image (CUDA-built llama.cpp; the plain :server is
  CPU-only and 25× slower for our 50-doc rerank pool)
- -ngl 99 to offload all layers to GPU
- deploy.resources.reservations.devices block for compose v3 GPU
  passthrough (preferred over the older `runtime: nvidia` syntax)
- volume for the HuggingFace model cache so first-start GGUF
  download survives container recreates
- no host port mapping — internal-network-only

Tesla P4 compatibility notes inline: Pascal (CC 6.1) is in the
:server-cuda image's compute-arch list (500-1200) so no special
handling beyond the standard compose entry.

Also: cleanup instruction to docker rm -f the standalone
llama-rerank from the earlier setup before bringing up compose
(name collision).

And: noted that if trashpanda's existing Ollama is a host-mode
process rather than a compose service, the MCP needs
host.docker.internal override (snippet included).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 deploy/drawbar-compose-snippet.md | 107 +++++++++++++++++++++++-------
 1 file changed, 84 insertions(+), 23 deletions(-)

diff --git a/deploy/drawbar-compose-snippet.md b/deploy/drawbar-compose-snippet.md
index 63ac955..203a01b 100644
--- a/deploy/drawbar-compose-snippet.md
+++ b/deploy/drawbar-compose-snippet.md
@@ -1,24 +1,64 @@
 # Drawbar deploy — `crop-chem-docs` MCP server snippet
 
-Drop this into Drawbar's `docker-compose.yml`. Targets the existing
-trashpanda stack: shared Docker network with `ollama` + `llama-rerank`
-service containers, Cloudflare Tunnel out front.
+Drop these two services into Drawbar's `docker-compose.yml`. Targets
+the trashpanda stack: shared Docker network with the existing
+Drawbar services + the Cloudflare Tunnel.
 
 ## Pre-reqs (one-time on the deploy host)
 
-1. **Login to the Gitea registry** so the host can pull:
+1. **Docker login to the Gitea registry:**
    ```bash
    docker login git.jpaul.io -u justin   # PAT for password
    ```
-2. **`ollama` and `llama-rerank` services** are already running in
-   the same compose stack on the same Docker network. The MCP
-   container resolves them by service name via Docker's embedded
-   DNS — no IPs to maintain.
+2. **NVIDIA Container Toolkit** — already installed on trashpanda
+   (the existing standalone `llama-rerank` container ran with
+   `--gpus all` fine).
+3. **If a standalone `llama-rerank` container is already running**
+   (left over from earlier setup), remove it so the compose service
+   can bind the same name:
+   ```bash
+   docker rm -f llama-rerank
+   ```
 
-## Compose service
+## Compose services
 
 ```yaml
 services:
+
+  # ---- Reranker sidecar -----------------------------------------
+  # jina-reranker-v2-base-multilingual via llama.cpp on the Tesla P4.
+  # Internal port only (no host port mapping needed — the MCP reaches
+  # it via Docker DNS). ~280 MB GPU VRAM at idle, ~500 MB during a
+  # 50-doc rerank. Co-exists fine with any other GPU users on the P4.
+  llama-rerank:
+    image: ghcr.io/ggml-org/llama.cpp:server-cuda
+    container_name: llama-rerank
+    restart: unless-stopped
+    command:
+      - "-hf"
+      - "gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0"
+      - "--reranking"
+      - "--host"
+      - "0.0.0.0"
+      - "--port"
+      - "8080"
+      - "-ngl"
+      - "99"            # offload all layers to GPU
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+    # Model cache survives container recreates; first start downloads
+    # the GGUF (~280 MB) from HuggingFace.
+    volumes:
+      - llama-rerank-cache:/root/.cache/huggingface
+    networks:
+      - default
+
+  # ---- MCP server ------------------------------------------------
   crop-chem-docs:
     image: git.jpaul.io/justin/crop-chem-docs:corpus-2026.05.24
     # :latest for dev / Watchtower auto-pull
@@ -32,29 +72,57 @@ services:
     #   HYBRID_SEARCH=true
     #   PRODUCT_NAME=crop_chem
     # Override here only if your services have different names.
+    depends_on:
+      - llama-rerank
     networks:
-      - default  # or whichever shared network ollama/llama-rerank are on
+      - default
     labels:
       com.centurylinklabs.watchtower.enable: "true"
+
+volumes:
+  llama-rerank-cache:
 ```
 
-If your stack uses non-default service names:
+## Note on the existing `ollama` service
+
+The Dockerfile default is `OLLAMA_URL=http://ollama:11434` — that
+assumes there's an `ollama` service in the same compose stack. If
+trashpanda's Ollama is a host-mode process (not a compose service),
+override the env in the `crop-chem-docs` block:
 
 ```yaml
     environment:
-      OLLAMA_URL: "http://<your-ollama-service>:11434"
-      RERANK_URL: "http://<your-rerank-service>:8080"
+      OLLAMA_URL: "http://host.docker.internal:11434"
+    extra_hosts:
+      - "host.docker.internal:host-gateway"
 ```
 
-## Test from the host
+Or just add Ollama itself to the compose stack as a sibling service.
+
+## Test once both are up
 
 ```bash
-# Verify counts + indexes from inside the container:
+docker compose up -d llama-rerank crop-chem-docs
+
+# Wait ~10s for both to come up, then:
 docker exec crop-chem-docs python -c \
   "from docs_mcp.server import corpus_status; print(corpus_status())"
 ```
 
-## What the container exposes
+Expect: `# crop-chem-docs corpus status`, 4,159 labels, 216,467
+chunks, BM25 db present, `RERANK_URL=http://llama-rerank:8080`,
+`HYBRID_SEARCH=on`.
+
+Then a live search to verify hybrid+rerank:
+
+```bash
+docker exec crop-chem-docs python -c \
+  "from docs_mcp.server import search_docs; print(search_docs('soybean herbicide for waterhemp', k=2))"
+```
+
+Expect: 2 hits with Sencor/Tackle/Warrant in top-2, `mode=hybrid-rrf+rerank` in the header.
+
+## What the MCP container exposes
 
 | Tool | What it does |
 |---|---|
@@ -78,10 +146,3 @@ docker exec crop-chem-docs python -c \
   reindex, image push. Watchtower pulls the new `:latest` automatically.
 - **Manual** — Gitea Actions UI → `Monthly corpus refresh` → `Run workflow`.
   Optional `sources` input for single-source refresh (e.g., `bayer` only).
-
-## Switching corpus scope
-
-The row-crop filter (corn/soybeans/wheat) is in
-`scrape/sources/epa_ppls.py` as `ROW_CROP_KEYWORDS`. Edit + push +
-let the next workflow run pick it up. Same for the registrant
-allowlist at `scrape/sources/epa_registrant_allowlist.json`.