Phase 6/7: rerank wiring + eval harness (hybrid+rerank = 100% pass, 90% P@1) #10

Merged
justin merged 1 commits from eval-and-rerank into main 2026-05-25 17:03:38 -04:00
Owner

Summary

You called this out before deploy — we'd shipped BM25 + rerank stubs but never actually wired rerank or measured retrieval quality. This PR closes both gaps and proves the deploy config is the right one.

Phase 6 — reranker integration:

  • _rerank() in server.py calls llama.cpp's /v1/rerank; returns None on any failure (graceful fallback, search never blocks on the sidecar).
  • Both search_docs and search_trials call rerank on the post-hybrid pool before truncating to k. The variety-code prefilter still pins exact matches.
  • Per-doc truncation to 2000 chars to fit jina-reranker-v2-base-multilingual's per-pair token budget. Full chunk text still goes back to the LLM — truncation is rerank-input-only.
  • Telemetry: usage records now include reranked: true|false.

Phase 7 — eval harness:

  • eval/queries.jsonl: 21 golden queries spanning variety-code lookups, semantic ag queries, trial queries (regional + product-filter + cross-vendor), and anti-hallucination tests (Pioneer fallback, not-in-corpus product).
  • eval/retrievers.py: 4 retrievers (dense / bm25 / hybrid / hybrid+rerank) sharing the same filter shape as the production server.
  • eval/run_eval.py: runs the 4-way comparison; produces eval/results/baseline.md with per-retriever and per-query breakdowns.

Numbers (k=5, 21 queries)

Retriever Passed Recall P@1 MRR Avg ms
hybrid+rerank 21/21 100% 90% 0.905 2064
bm25 20/21 95% 81% 0.833 5
hybrid (no rerank) 15/21 71% 62% 0.619 73
dense 14/21 67% 38% 0.440 79

Surprising findings worth flagging

  1. Dense embedding alone is the weakest config. Variety codes (DKC62-08RIB, AG29XF4, AP Iliad), gene names (Rps3a), and trait codes (XF) have no semantic neighbors — nomic-embed-text returns noise on them.
  2. The hybrid layer (no rerank) is WORSE than BM25 alone. RRF dilutes the strong BM25 ranking with dense noise. We never would have caught this without eval.
  3. BM25-only is a great fallback (95% recall, 5 ms latency). The variety-code prefilter does a lot of work here — explains why search_docs feels good even on the simpler retriever paths.
  4. Anti-hallucination queries pass on ALL retrievers including dense-only. The must_not_contain + expected_empty design holds.

Deploy decision

HYBRID_SEARCH=true + RERANK_URL set (already configured in both .gitea/workflows/refresh.yml + image-only.yml and in deploy/docker-compose.yml). The shared llama-rerank sidecar on trashpanda's Tesla P4 is already running for crop-chem-docs.

Local test setup (for reproducibility)

docker run -d --name local-rerank --gpus all -p 18080:8080 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \
  --reranking --host 0.0.0.0 --port 8080 --n-gpu-layers 99 \
  --ctx-size 8192 --batch-size 4096 --ubatch-size 4096 --parallel 4

OLLAMA_URL=http://192.168.0.125:11434 \
PRODUCT_NAME=crop_seed \
RERANK_URL=http://localhost:18080 \
python -m eval.run_eval --k 5 --output eval/results/baseline.md

Note --ubatch-size 4096 — the default 512 is too small for ~600-token seed chunks (the rerank server logs input (606 tokens) is too large to process errors otherwise).

## Summary You called this out before deploy — we'd shipped BM25 + rerank stubs but never actually wired rerank or measured retrieval quality. This PR closes both gaps and proves the deploy config is the right one. **Phase 6 — reranker integration**: - `_rerank()` in `server.py` calls llama.cpp's `/v1/rerank`; returns `None` on any failure (graceful fallback, search never blocks on the sidecar). - Both `search_docs` and `search_trials` call rerank on the post-hybrid pool before truncating to `k`. The variety-code prefilter still pins exact matches. - Per-doc truncation to 2000 chars to fit `jina-reranker-v2-base-multilingual`'s per-pair token budget. Full chunk text still goes back to the LLM — truncation is rerank-input-only. - Telemetry: usage records now include `reranked: true|false`. **Phase 7 — eval harness**: - `eval/queries.jsonl`: 21 golden queries spanning variety-code lookups, semantic ag queries, trial queries (regional + product-filter + cross-vendor), and anti-hallucination tests (Pioneer fallback, not-in-corpus product). - `eval/retrievers.py`: 4 retrievers (dense / bm25 / hybrid / hybrid+rerank) sharing the same filter shape as the production server. - `eval/run_eval.py`: runs the 4-way comparison; produces `eval/results/baseline.md` with per-retriever and per-query breakdowns. ## Numbers (k=5, 21 queries) | Retriever | Passed | Recall | **P@1** | **MRR** | Avg ms | |---|---|---|---|---|---| | **hybrid+rerank** | **21/21** | **100%** | **90%** | **0.905** | 2064 | | bm25 | 20/21 | 95% | 81% | 0.833 | 5 | | hybrid (no rerank) | 15/21 | 71% | 62% | 0.619 | 73 | | dense | 14/21 | 67% | 38% | 0.440 | 79 | ## Surprising findings worth flagging 1. **Dense embedding alone is the weakest config.** Variety codes (DKC62-08RIB, AG29XF4, AP Iliad), gene names (Rps3a), and trait codes (XF) have no semantic neighbors — nomic-embed-text returns noise on them. 2. **The hybrid layer (no rerank) is WORSE than BM25 alone.** RRF dilutes the strong BM25 ranking with dense noise. We never would have caught this without eval. 3. **BM25-only is a great fallback** (95% recall, 5 ms latency). The variety-code prefilter does a lot of work here — explains why search_docs feels good even on the simpler retriever paths. 4. **Anti-hallucination queries pass on ALL retrievers** including dense-only. The `must_not_contain` + `expected_empty` design holds. ## Deploy decision `HYBRID_SEARCH=true` + `RERANK_URL` set (already configured in both `.gitea/workflows/refresh.yml` + `image-only.yml` and in `deploy/docker-compose.yml`). The shared `llama-rerank` sidecar on trashpanda's Tesla P4 is already running for crop-chem-docs. ## Local test setup (for reproducibility) ```bash docker run -d --name local-rerank --gpus all -p 18080:8080 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ ghcr.io/ggml-org/llama.cpp:server-cuda \ -hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \ --reranking --host 0.0.0.0 --port 8080 --n-gpu-layers 99 \ --ctx-size 8192 --batch-size 4096 --ubatch-size 4096 --parallel 4 OLLAMA_URL=http://192.168.0.125:11434 \ PRODUCT_NAME=crop_seed \ RERANK_URL=http://localhost:18080 \ python -m eval.run_eval --k 5 --output eval/results/baseline.md ``` Note `--ubatch-size 4096` — the default 512 is too small for ~600-token seed chunks (the rerank server logs `input (606 tokens) is too large to process` errors otherwise).
justin added 1 commit 2026-05-25 17:03:28 -04:00
Phase 6 — Reranker integration
- New _rerank(query, [(cid, doc), ...]) helper in server.py calls
  llama.cpp's /v1/rerank endpoint, returns reranker-ordered ids
  or None on failure (graceful fallback — search never blocks
  on the sidecar).
- search_docs + search_trials both call _rerank() on the post-
  hybrid pool BEFORE truncating to k. The variety-code prefilter
  still pins exact matches on top.
- Per-doc truncation to 2000 chars to fit jina-reranker-v2-base's
  per-pair token budget. Full chunk text still returned to the
  caller — truncation is rerank-input-only.
- Telemetry adds `reranked: true|false` so usage logs distinguish
  reranked calls.

Phase 7 — Eval harness
- eval/queries.jsonl: 21 golden queries spanning:
    * variety-code lookups (DKC62-08RIB, AG29XF4, WB6430, E085Z5,
      AP Iliad)
    * semantic variety queries (drought-tolerant corn, SCN MG-3
      soy, Rps3a, XtendFlex, HRS stripe rust, SWW PNW, Goss's Wilt)
    * trial queries (IA/IN/MN regional, AP Iliad ID, NK1701 head-
      to-head, silage Ton/Acre, product=DKC65-95)
    * anti-hallucination (Pioneer P1142 fallback, DKC65-20 not-in-
      corpus expected_empty)
- eval/retrievers.py: 4 named retrievers — dense, bm25, hybrid
  (dense+bm25+RRF), hybrid+rerank — all sharing the same filter
  shape as docs_mcp/server.py._build_where.
- eval/run_eval.py: runs each retriever against each query,
  reports Recall / Precision@1 / MRR / avg latency. Markdown
  output in eval/results/baseline.md.

Baseline results (k=5, 21 queries):

  | Retriever       | Pass  | Recall | P@1   | MRR   | Avg ms |
  |-----------------|-------|--------|-------|-------|--------|
  | hybrid+rerank   | 21/21 | 100%   | 90%   | 0.905 | 2064   |
  | bm25            | 20/21 |  95%   | 81%   | 0.833 |    5   |
  | hybrid          | 15/21 |  71%   | 62%   | 0.619 |   73   |
  | dense           | 14/21 |  67%   | 38%   | 0.440 |   79   |

Key findings:
1. hybrid+rerank wins on quality — 100% pass, 90% P@1.
2. BM25 alone is surprisingly competitive (95% pass) at 5 ms —
   excellent fallback when rerank is down. The variety-code
   prefilter in search_docs is doing a lot of work here.
3. Dense embedding alone is the WEAKEST configuration on this
   corpus — variety identity tokens (DKC62-08RIB, AP Iliad,
   Rps3a) have no semantic neighbors, so nomic-embed-text returns
   noise. The hybrid (no rerank) layer actively hurts because
   RRF dilutes the BM25 ranking with dense noise.
4. Anti-hallucination queries (Pioneer fallback, DKC65-20 not-
   in-corpus) pass on ALL retrievers including dense-only —
   the must_not_contain + expected_empty design holds.

Deploy decision: HYBRID_SEARCH=true + RERANK_URL set
(production env already has both — refresh.yml + image-only.yml
+ deploy/docker-compose.yml all configured).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justin merged commit 038475e7fd into main 2026-05-25 17:03:38 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#10