Files
justin af44d7a102 Phase 11 + Phase 6 GPU move
## Phase 11 — Curated agronomy / label-handling knowledge layer

docs_mcp/lessons.md: 13 topic-anchored markdown sections covering
the LLM-side context a farmer-advisor needs alongside the raw
label corpus —
  - how-to-use-this-corpus
  - epa-signal-words
  - rei-phi-fundamentals
  - rup-handling
  - supplemental-labels-24c-2ee
  - tank-mix-fundamentals
  - resistance-management-hrac-frac-irac
  - glufosinate-application-rules
  - dicamba-application-rules
  - lake-erie-watershed-ohio
  - scn-and-other-seed-treatment-context
  - drift-management-essentials
  - how-to-format-recommendations

Each Topic block is independently retrievable via the new MCP tool:

  ppls_api_lessons(topic="rup-handling")

Or with no topic to get the full TOC, or with a substring to
match-and-return matching sections ("dicamba" → dicamba-application-rules).

Tool docstring instructs the LLM to call this proactively before any
pesticide recommendation so the recommendation lands with regulatory
framing, resistance-group callouts, RUP applicator language, and the
canonical recommendation format — not just a rate from a label.

## Phase 6 — Reranker moved to GPU on trashpanda

Stopped the local CPU container and started on trashpanda's Tesla P4
(8 GB VRAM) via:

  docker run -d --name llama-rerank --restart unless-stopped --gpus all \
    -p 8082:8080 \
    ghcr.io/ggml-org/llama.cpp:server-cuda \
    -hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \
    --reranking --host 0.0.0.0 --port 8080 -ngl 99

The :server-cuda image variant (not :server) is required for CUDA
backend; -ngl 99 offloads all layers to GPU.

Latency: 50-doc rerank dropped from ~23 s on CPU to ~0.7-1.5 s on
the Tesla P4 — production-grade interactive speeds.

deploy/rerank-docker.md updated with the trashpanda deploy recipe,
troubleshooting (mostly "did you use server-cuda?"), and a perf
reference table. The MCP server's RERANK_URL just points at
http://10.10.1.65:8082 now.

GPU eval still completing in background; results land in
eval/results/with_rerank_gpu.md as a follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 12:10:09 -04:00

83 lines
2.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Reranker sidecar — llama.cpp + jina-reranker-v2-base
Phase 6 setup. The MCP server reads `RERANK_URL` and, when set, pipes
the top-50 dense (or hybrid) chunks through this sidecar before
returning to the LLM. See `docs_mcp/server.py:_rerank_pool`.
## Production deploy — trashpanda (Tesla P4, 8 GB VRAM)
This is where the reranker lives. Same box that runs the Drawbar
backend + Cloudflare Tunnel, so the MCP server can reach it on the
internal LAN.
```bash
ssh justin@10.10.1.65 \
'docker run -d --name llama-rerank --restart unless-stopped --gpus all \
-p 8082:8080 \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \
--reranking --host 0.0.0.0 --port 8080 -ngl 99'
```
Key flags:
- `--gpus all` — pass through the Tesla P4
- `server-cuda` image — CUDA-built llama.cpp (not the CPU-only `:server`)
- `-ngl 99` — offload all layers to GPU
- `-hf <repo>` — auto-download from HuggingFace on first start (~280 MB,
cached in the container volume)
- `--reranking` — enables `/v1/rerank` endpoint
- `--restart unless-stopped` — survives reboot
VRAM usage: ~280 MB model + CUDA context. Well under the 8 GB the
Tesla P4 has, leaves room for nomic-embed-text (~560 MB) if you
later co-host it.
## Configure the MCP server
```bash
export RERANK_URL=http://10.10.1.65:8082
# search_docs now reranks the hybrid pool through the GPU before returning
```
In production (the MetaMCP-fronted Drawbar deploy), this is baked
into the MCP server's container env.
## Verify
```bash
curl http://10.10.1.65:8082/v1/rerank -H 'Content-Type: application/json' -d '{
"query": "soybean herbicide for waterhemp",
"documents": [
"Roundup Custom for fallow burndown",
"Sencor metribuzin controls waterhemp in soybean pre-emergence"
]
}'
```
Expect index=1 (the Sencor doc) at score ~0.8, index=0 at a strongly
negative score, in under 1 s.
## Performance reference
| Mode | Pool | Wall time |
|---|---|---|
| CPU (local 28-thread Xeon) | 50 docs | ~23 s |
| GPU (Tesla P4 on trashpanda) | 50 docs | ~0.7-1.5 s |
| GPU (Tesla P4) | 20 docs | ~0.4 s |
The Tesla P4 is Pascal-era (8.1 TFLOPs FP32) so a modern Ampere or
Ada Lovelace GPU would be ~3-5× faster, but for the row-crop label
corpus query rate the P4 is plenty.
## Troubleshooting
- **Model not on GPU?** Check `docker logs llama-rerank | grep CUDA`
you should see `CUDA0 : Tesla P4 (8109 MiB, ... free)` and tensor
load lines. If you see CPU-only init, you forgot `--gpus all` or
used `:server` instead of `:server-cuda`.
- **Conflict with Ollama on the same GPU?** No — both processes can
share the GPU, CUDA handles VRAM partitioning. nomic-embed-text +
jina-reranker-v2-base together use ~840 MB on the 8 GB card.
- **First rerank call is slow (~4 s)?** Warm-up. Subsequent calls are
~0.7 s for 50 docs.