Phase 6/7: wire rerank + eval harness — 100% pass on 21 golden queries

Phase 6 — Reranker integration
- New _rerank(query, [(cid, doc), ...]) helper in server.py calls
  llama.cpp's /v1/rerank endpoint, returns reranker-ordered ids
  or None on failure (graceful fallback — search never blocks
  on the sidecar).
- search_docs + search_trials both call _rerank() on the post-
  hybrid pool BEFORE truncating to k. The variety-code prefilter
  still pins exact matches on top.
- Per-doc truncation to 2000 chars to fit jina-reranker-v2-base's
  per-pair token budget. Full chunk text still returned to the
  caller — truncation is rerank-input-only.
- Telemetry adds `reranked: true|false` so usage logs distinguish
  reranked calls.

Phase 7 — Eval harness
- eval/queries.jsonl: 21 golden queries spanning:
    * variety-code lookups (DKC62-08RIB, AG29XF4, WB6430, E085Z5,
      AP Iliad)
    * semantic variety queries (drought-tolerant corn, SCN MG-3
      soy, Rps3a, XtendFlex, HRS stripe rust, SWW PNW, Goss's Wilt)
    * trial queries (IA/IN/MN regional, AP Iliad ID, NK1701 head-
      to-head, silage Ton/Acre, product=DKC65-95)
    * anti-hallucination (Pioneer P1142 fallback, DKC65-20 not-in-
      corpus expected_empty)
- eval/retrievers.py: 4 named retrievers — dense, bm25, hybrid
  (dense+bm25+RRF), hybrid+rerank — all sharing the same filter
  shape as docs_mcp/server.py._build_where.
- eval/run_eval.py: runs each retriever against each query,
  reports Recall / Precision@1 / MRR / avg latency. Markdown
  output in eval/results/baseline.md.

Baseline results (k=5, 21 queries):

  | Retriever       | Pass  | Recall | P@1   | MRR   | Avg ms |
  |-----------------|-------|--------|-------|-------|--------|
  | hybrid+rerank   | 21/21 | 100%   | 90%   | 0.905 | 2064   |
  | bm25            | 20/21 |  95%   | 81%   | 0.833 |    5   |
  | hybrid          | 15/21 |  71%   | 62%   | 0.619 |   73   |
  | dense           | 14/21 |  67%   | 38%   | 0.440 |   79   |

Key findings:
1. hybrid+rerank wins on quality — 100% pass, 90% P@1.
2. BM25 alone is surprisingly competitive (95% pass) at 5 ms —
   excellent fallback when rerank is down. The variety-code
   prefilter in search_docs is doing a lot of work here.
3. Dense embedding alone is the WEAKEST configuration on this
   corpus — variety identity tokens (DKC62-08RIB, AP Iliad,
   Rps3a) have no semantic neighbors, so nomic-embed-text returns
   noise. The hybrid (no rerank) layer actively hurts because
   RRF dilutes the BM25 ranking with dense noise.
4. Anti-hallucination queries (Pioneer fallback, DKC65-20 not-
   in-corpus) pass on ALL retrievers including dense-only —
   the must_not_contain + expected_empty design holds.

Deploy decision: HYBRID_SEARCH=true + RERANK_URL set
(production env already has both — refresh.yml + image-only.yml
+ deploy/docker-compose.yml all configured).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-25 17:02:57 -04:00
parent d60d747858
commit bd71f30ca7
5 changed files with 643 additions and 89 deletions
+123
View File
@@ -289,6 +289,75 @@ def _rrf_fuse(rankings: list[list[str]], k: int = RRF_K) -> list[str]:
return sorted(scores, key=lambda d: scores[d], reverse=True)
# Per-doc character cap when sending to the reranker. jina-reranker-v2-base
# accepts up to ~1024 tokens PER QUERY+DOC PAIR (n_ctx_train) and rejects
# the WHOLE BATCH if any one pair exceeds it. Truncating each doc to
# ~2000 chars (≈ 500-700 tokens) leaves headroom for the query + chat
# template overhead. The truncation is reranking-only — full chunk text
# still goes back to the LLM caller.
RERANK_DOC_MAX_CHARS = 2000
def _rerank(query: str, candidates: list[tuple[str, str]]) -> list[str] | None:
"""Call the llama.cpp /v1/rerank endpoint and return the candidate
chunk ids in reranker-preferred order.
Args:
query: the user's natural-language query
candidates: list of ``(chunk_id, chunk_text)`` to rerank.
Returns:
A list of chunk_ids ordered best-first by reranker score, OR
``None`` if reranking is disabled, the endpoint is unreachable,
or any other error. The caller treats ``None`` as "fall back to
the input ranking" — rerank failures must NEVER block a search.
Anti-hallucination: rerank only reorders chunks the retrievers
already surfaced. It cannot introduce content not in the corpus.
"""
if not RERANK_URL or not candidates:
return None
try:
import httpx
except ImportError:
return None
# Truncate each doc to fit the per-pair token budget. jina-reranker
# rejects the entire batch on any oversize doc.
docs = [(text[:RERANK_DOC_MAX_CHARS] if text else "") for _cid, text in candidates]
ids = [cid for cid, _ in candidates]
try:
with httpx.Client(timeout=RERANK_TIMEOUT) as c:
r = c.post(
f"{RERANK_URL}/v1/rerank",
json={
"model": "rerank", # llama.cpp ignores this; jina passes through
"query": query,
"documents": docs,
},
)
r.raise_for_status()
payload = r.json()
except Exception as exc: # noqa: BLE001
log.warning("rerank request failed (%s) — falling back to input order", exc)
return None
results = payload.get("results") or []
if not results:
log.warning("rerank returned empty results — falling back to input order")
return None
# llama.cpp returns results as [{"index": int, "relevance_score": float}, ...]
# Higher relevance_score = better; sort descending.
try:
ordered = sorted(results, key=lambda r: -r.get("relevance_score", float("-inf")))
return [ids[r["index"]] for r in ordered if 0 <= r.get("index", -1) < len(ids)]
except (KeyError, IndexError, TypeError) as exc:
log.warning("rerank response malformed (%s) — falling back to input order", exc)
return None
def _structured_ratings_block(sidecar: dict) -> str:
"""Render the sidecar's grouped characteristics + identity as a
fact-checkable block, with the source URL pinned at top.
@@ -534,6 +603,34 @@ def search_docs(
else:
fuzzy_ids = dense_ids
# Optional reranker pass over the fuzzy pool BEFORE truncating
# to k. The cross-encoder is much more accurate at the query/
# doc pairing than dense embedding alone, especially when the
# query mentions specific ag terms that share-token-cosine
# might miss. Skipped if RERANK_URL is unset or the call
# fails — search is never blocked on the sidecar.
used_rerank = False
if RERANK_URL and fuzzy_ids:
# Need docs to rerank — fetch any missing.
need = [i for i in fuzzy_ids if i not in id_to_doc]
if need:
try:
extra = col.get(ids=need[:RERANK_POOL], include=["documents", "metadatas"])
for cid, doc, meta in zip(
extra.get("ids") or [],
extra.get("documents") or [],
extra.get("metadatas") or [],
):
id_to_doc[cid] = doc
id_to_meta[cid] = meta
except Exception as exc: # noqa: BLE001
log.warning("pre-rerank get-by-id failed: %s", exc)
pool = [(cid, id_to_doc.get(cid, "")) for cid in fuzzy_ids[:RERANK_POOL]]
reranked = _rerank(query, pool)
if reranked:
fuzzy_ids = reranked + [c for c in fuzzy_ids if c not in set(reranked)]
used_rerank = True
# Pin exact-code matches at top, then fill remainder from fuzzy
# retrieval (deduped). Pinned matches are deterministic and
# high-confidence; they should never lose to a fuzzy match.
@@ -566,6 +663,7 @@ def search_docs(
_call.set(
hits_returned=len(final_ids),
hybrid=used_hybrid,
reranked=used_rerank,
pool_size=pool_size,
)
@@ -885,6 +983,30 @@ def search_trials(
else:
fuzzy_ids = dense_ids
# Optional reranker pass over the fuzzy pool — same shape as
# in search_docs. Skipped silently if RERANK_URL is unset or
# the rerank call fails.
used_rerank = False
if RERANK_URL and fuzzy_ids:
need = [i for i in fuzzy_ids if i not in id_to_doc]
if need:
try:
extra = col.get(ids=need[:RERANK_POOL], include=["documents", "metadatas"])
for cid, doc, meta in zip(
extra.get("ids") or [],
extra.get("documents") or [],
extra.get("metadatas") or [],
):
id_to_doc[cid] = doc
id_to_meta[cid] = meta
except Exception as exc: # noqa: BLE001
log.warning("pre-rerank get-by-id failed: %s", exc)
pool = [(cid, id_to_doc.get(cid, "")) for cid in fuzzy_ids[:RERANK_POOL]]
reranked = _rerank(full_query, pool)
if reranked:
fuzzy_ids = reranked + [c for c in fuzzy_ids if c not in set(reranked)]
used_rerank = True
# Optional product-substring post-filter: if user supplied
# ``product``, require the chunk to actually contain the
# token. This re-checks the bytes since BM25 only sees stems.
@@ -931,6 +1053,7 @@ def search_trials(
_call.set(
hits_returned=len(final_ids),
hybrid=used_hybrid,
reranked=used_rerank,
pool_size=pool_size,
data_type="trial",
)