Phase 6/7: rerank wiring + eval harness (hybrid+rerank = 100% pass, 90% P@1) #10
Reference in New Issue
Block a user
Delete Branch "eval-and-rerank"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
You called this out before deploy — we'd shipped BM25 + rerank stubs but never actually wired rerank or measured retrieval quality. This PR closes both gaps and proves the deploy config is the right one.
Phase 6 — reranker integration:
_rerank()inserver.pycalls llama.cpp's/v1/rerank; returnsNoneon any failure (graceful fallback, search never blocks on the sidecar).search_docsandsearch_trialscall rerank on the post-hybrid pool before truncating tok. The variety-code prefilter still pins exact matches.jina-reranker-v2-base-multilingual's per-pair token budget. Full chunk text still goes back to the LLM — truncation is rerank-input-only.reranked: true|false.Phase 7 — eval harness:
eval/queries.jsonl: 21 golden queries spanning variety-code lookups, semantic ag queries, trial queries (regional + product-filter + cross-vendor), and anti-hallucination tests (Pioneer fallback, not-in-corpus product).eval/retrievers.py: 4 retrievers (dense / bm25 / hybrid / hybrid+rerank) sharing the same filter shape as the production server.eval/run_eval.py: runs the 4-way comparison; produceseval/results/baseline.mdwith per-retriever and per-query breakdowns.Numbers (k=5, 21 queries)
Surprising findings worth flagging
must_not_contain+expected_emptydesign holds.Deploy decision
HYBRID_SEARCH=true+RERANK_URLset (already configured in both.gitea/workflows/refresh.yml+image-only.ymland indeploy/docker-compose.yml). The sharedllama-reranksidecar on trashpanda's Tesla P4 is already running for crop-chem-docs.Local test setup (for reproducibility)
Note
--ubatch-size 4096— the default 512 is too small for ~600-token seed chunks (the rerank server logsinput (606 tokens) is too large to processerrors otherwise).Phase 6 — Reranker integration - New _rerank(query, [(cid, doc), ...]) helper in server.py calls llama.cpp's /v1/rerank endpoint, returns reranker-ordered ids or None on failure (graceful fallback — search never blocks on the sidecar). - search_docs + search_trials both call _rerank() on the post- hybrid pool BEFORE truncating to k. The variety-code prefilter still pins exact matches on top. - Per-doc truncation to 2000 chars to fit jina-reranker-v2-base's per-pair token budget. Full chunk text still returned to the caller — truncation is rerank-input-only. - Telemetry adds `reranked: true|false` so usage logs distinguish reranked calls. Phase 7 — Eval harness - eval/queries.jsonl: 21 golden queries spanning: * variety-code lookups (DKC62-08RIB, AG29XF4, WB6430, E085Z5, AP Iliad) * semantic variety queries (drought-tolerant corn, SCN MG-3 soy, Rps3a, XtendFlex, HRS stripe rust, SWW PNW, Goss's Wilt) * trial queries (IA/IN/MN regional, AP Iliad ID, NK1701 head- to-head, silage Ton/Acre, product=DKC65-95) * anti-hallucination (Pioneer P1142 fallback, DKC65-20 not-in- corpus expected_empty) - eval/retrievers.py: 4 named retrievers — dense, bm25, hybrid (dense+bm25+RRF), hybrid+rerank — all sharing the same filter shape as docs_mcp/server.py._build_where. - eval/run_eval.py: runs each retriever against each query, reports Recall / Precision@1 / MRR / avg latency. Markdown output in eval/results/baseline.md. Baseline results (k=5, 21 queries): | Retriever | Pass | Recall | P@1 | MRR | Avg ms | |-----------------|-------|--------|-------|-------|--------| | hybrid+rerank | 21/21 | 100% | 90% | 0.905 | 2064 | | bm25 | 20/21 | 95% | 81% | 0.833 | 5 | | hybrid | 15/21 | 71% | 62% | 0.619 | 73 | | dense | 14/21 | 67% | 38% | 0.440 | 79 | Key findings: 1. hybrid+rerank wins on quality — 100% pass, 90% P@1. 2. BM25 alone is surprisingly competitive (95% pass) at 5 ms — excellent fallback when rerank is down. The variety-code prefilter in search_docs is doing a lot of work here. 3. Dense embedding alone is the WEAKEST configuration on this corpus — variety identity tokens (DKC62-08RIB, AP Iliad, Rps3a) have no semantic neighbors, so nomic-embed-text returns noise. The hybrid (no rerank) layer actively hurts because RRF dilutes the BM25 ranking with dense noise. 4. Anti-hallucination queries (Pioneer fallback, DKC65-20 not- in-corpus) pass on ALL retrievers including dense-only — the must_not_contain + expected_empty design holds. Deploy decision: HYBRID_SEARCH=true + RERANK_URL set (production env already has both — refresh.yml + image-only.yml + deploy/docker-compose.yml all configured). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>