Phase 6/7: wire rerank + eval harness — 100% pass on 21 golden queries
Phase 6 — Reranker integration
- New _rerank(query, [(cid, doc), ...]) helper in server.py calls
llama.cpp's /v1/rerank endpoint, returns reranker-ordered ids
or None on failure (graceful fallback — search never blocks
on the sidecar).
- search_docs + search_trials both call _rerank() on the post-
hybrid pool BEFORE truncating to k. The variety-code prefilter
still pins exact matches on top.
- Per-doc truncation to 2000 chars to fit jina-reranker-v2-base's
per-pair token budget. Full chunk text still returned to the
caller — truncation is rerank-input-only.
- Telemetry adds `reranked: true|false` so usage logs distinguish
reranked calls.
Phase 7 — Eval harness
- eval/queries.jsonl: 21 golden queries spanning:
* variety-code lookups (DKC62-08RIB, AG29XF4, WB6430, E085Z5,
AP Iliad)
* semantic variety queries (drought-tolerant corn, SCN MG-3
soy, Rps3a, XtendFlex, HRS stripe rust, SWW PNW, Goss's Wilt)
* trial queries (IA/IN/MN regional, AP Iliad ID, NK1701 head-
to-head, silage Ton/Acre, product=DKC65-95)
* anti-hallucination (Pioneer P1142 fallback, DKC65-20 not-in-
corpus expected_empty)
- eval/retrievers.py: 4 named retrievers — dense, bm25, hybrid
(dense+bm25+RRF), hybrid+rerank — all sharing the same filter
shape as docs_mcp/server.py._build_where.
- eval/run_eval.py: runs each retriever against each query,
reports Recall / Precision@1 / MRR / avg latency. Markdown
output in eval/results/baseline.md.
Baseline results (k=5, 21 queries):
| Retriever | Pass | Recall | P@1 | MRR | Avg ms |
|-----------------|-------|--------|-------|-------|--------|
| hybrid+rerank | 21/21 | 100% | 90% | 0.905 | 2064 |
| bm25 | 20/21 | 95% | 81% | 0.833 | 5 |
| hybrid | 15/21 | 71% | 62% | 0.619 | 73 |
| dense | 14/21 | 67% | 38% | 0.440 | 79 |
Key findings:
1. hybrid+rerank wins on quality — 100% pass, 90% P@1.
2. BM25 alone is surprisingly competitive (95% pass) at 5 ms —
excellent fallback when rerank is down. The variety-code
prefilter in search_docs is doing a lot of work here.
3. Dense embedding alone is the WEAKEST configuration on this
corpus — variety identity tokens (DKC62-08RIB, AP Iliad,
Rps3a) have no semantic neighbors, so nomic-embed-text returns
noise. The hybrid (no rerank) layer actively hurts because
RRF dilutes the BM25 ranking with dense noise.
4. Anti-hallucination queries (Pioneer fallback, DKC65-20 not-
in-corpus) pass on ALL retrievers including dense-only —
the must_not_contain + expected_empty design holds.
Deploy decision: HYBRID_SEARCH=true + RERANK_URL set
(production env already has both — refresh.yml + image-only.yml
+ deploy/docker-compose.yml all configured).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -289,6 +289,75 @@ def _rrf_fuse(rankings: list[list[str]], k: int = RRF_K) -> list[str]:
|
||||
return sorted(scores, key=lambda d: scores[d], reverse=True)
|
||||
|
||||
|
||||
# Per-doc character cap when sending to the reranker. jina-reranker-v2-base
|
||||
# accepts up to ~1024 tokens PER QUERY+DOC PAIR (n_ctx_train) and rejects
|
||||
# the WHOLE BATCH if any one pair exceeds it. Truncating each doc to
|
||||
# ~2000 chars (≈ 500-700 tokens) leaves headroom for the query + chat
|
||||
# template overhead. The truncation is reranking-only — full chunk text
|
||||
# still goes back to the LLM caller.
|
||||
RERANK_DOC_MAX_CHARS = 2000
|
||||
|
||||
|
||||
def _rerank(query: str, candidates: list[tuple[str, str]]) -> list[str] | None:
|
||||
"""Call the llama.cpp /v1/rerank endpoint and return the candidate
|
||||
chunk ids in reranker-preferred order.
|
||||
|
||||
Args:
|
||||
query: the user's natural-language query
|
||||
candidates: list of ``(chunk_id, chunk_text)`` to rerank.
|
||||
|
||||
Returns:
|
||||
A list of chunk_ids ordered best-first by reranker score, OR
|
||||
``None`` if reranking is disabled, the endpoint is unreachable,
|
||||
or any other error. The caller treats ``None`` as "fall back to
|
||||
the input ranking" — rerank failures must NEVER block a search.
|
||||
|
||||
Anti-hallucination: rerank only reorders chunks the retrievers
|
||||
already surfaced. It cannot introduce content not in the corpus.
|
||||
"""
|
||||
if not RERANK_URL or not candidates:
|
||||
return None
|
||||
try:
|
||||
import httpx
|
||||
except ImportError:
|
||||
return None
|
||||
|
||||
# Truncate each doc to fit the per-pair token budget. jina-reranker
|
||||
# rejects the entire batch on any oversize doc.
|
||||
docs = [(text[:RERANK_DOC_MAX_CHARS] if text else "") for _cid, text in candidates]
|
||||
ids = [cid for cid, _ in candidates]
|
||||
|
||||
try:
|
||||
with httpx.Client(timeout=RERANK_TIMEOUT) as c:
|
||||
r = c.post(
|
||||
f"{RERANK_URL}/v1/rerank",
|
||||
json={
|
||||
"model": "rerank", # llama.cpp ignores this; jina passes through
|
||||
"query": query,
|
||||
"documents": docs,
|
||||
},
|
||||
)
|
||||
r.raise_for_status()
|
||||
payload = r.json()
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.warning("rerank request failed (%s) — falling back to input order", exc)
|
||||
return None
|
||||
|
||||
results = payload.get("results") or []
|
||||
if not results:
|
||||
log.warning("rerank returned empty results — falling back to input order")
|
||||
return None
|
||||
|
||||
# llama.cpp returns results as [{"index": int, "relevance_score": float}, ...]
|
||||
# Higher relevance_score = better; sort descending.
|
||||
try:
|
||||
ordered = sorted(results, key=lambda r: -r.get("relevance_score", float("-inf")))
|
||||
return [ids[r["index"]] for r in ordered if 0 <= r.get("index", -1) < len(ids)]
|
||||
except (KeyError, IndexError, TypeError) as exc:
|
||||
log.warning("rerank response malformed (%s) — falling back to input order", exc)
|
||||
return None
|
||||
|
||||
|
||||
def _structured_ratings_block(sidecar: dict) -> str:
|
||||
"""Render the sidecar's grouped characteristics + identity as a
|
||||
fact-checkable block, with the source URL pinned at top.
|
||||
@@ -534,6 +603,34 @@ def search_docs(
|
||||
else:
|
||||
fuzzy_ids = dense_ids
|
||||
|
||||
# Optional reranker pass over the fuzzy pool BEFORE truncating
|
||||
# to k. The cross-encoder is much more accurate at the query/
|
||||
# doc pairing than dense embedding alone, especially when the
|
||||
# query mentions specific ag terms that share-token-cosine
|
||||
# might miss. Skipped if RERANK_URL is unset or the call
|
||||
# fails — search is never blocked on the sidecar.
|
||||
used_rerank = False
|
||||
if RERANK_URL and fuzzy_ids:
|
||||
# Need docs to rerank — fetch any missing.
|
||||
need = [i for i in fuzzy_ids if i not in id_to_doc]
|
||||
if need:
|
||||
try:
|
||||
extra = col.get(ids=need[:RERANK_POOL], include=["documents", "metadatas"])
|
||||
for cid, doc, meta in zip(
|
||||
extra.get("ids") or [],
|
||||
extra.get("documents") or [],
|
||||
extra.get("metadatas") or [],
|
||||
):
|
||||
id_to_doc[cid] = doc
|
||||
id_to_meta[cid] = meta
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.warning("pre-rerank get-by-id failed: %s", exc)
|
||||
pool = [(cid, id_to_doc.get(cid, "")) for cid in fuzzy_ids[:RERANK_POOL]]
|
||||
reranked = _rerank(query, pool)
|
||||
if reranked:
|
||||
fuzzy_ids = reranked + [c for c in fuzzy_ids if c not in set(reranked)]
|
||||
used_rerank = True
|
||||
|
||||
# Pin exact-code matches at top, then fill remainder from fuzzy
|
||||
# retrieval (deduped). Pinned matches are deterministic and
|
||||
# high-confidence; they should never lose to a fuzzy match.
|
||||
@@ -566,6 +663,7 @@ def search_docs(
|
||||
_call.set(
|
||||
hits_returned=len(final_ids),
|
||||
hybrid=used_hybrid,
|
||||
reranked=used_rerank,
|
||||
pool_size=pool_size,
|
||||
)
|
||||
|
||||
@@ -885,6 +983,30 @@ def search_trials(
|
||||
else:
|
||||
fuzzy_ids = dense_ids
|
||||
|
||||
# Optional reranker pass over the fuzzy pool — same shape as
|
||||
# in search_docs. Skipped silently if RERANK_URL is unset or
|
||||
# the rerank call fails.
|
||||
used_rerank = False
|
||||
if RERANK_URL and fuzzy_ids:
|
||||
need = [i for i in fuzzy_ids if i not in id_to_doc]
|
||||
if need:
|
||||
try:
|
||||
extra = col.get(ids=need[:RERANK_POOL], include=["documents", "metadatas"])
|
||||
for cid, doc, meta in zip(
|
||||
extra.get("ids") or [],
|
||||
extra.get("documents") or [],
|
||||
extra.get("metadatas") or [],
|
||||
):
|
||||
id_to_doc[cid] = doc
|
||||
id_to_meta[cid] = meta
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.warning("pre-rerank get-by-id failed: %s", exc)
|
||||
pool = [(cid, id_to_doc.get(cid, "")) for cid in fuzzy_ids[:RERANK_POOL]]
|
||||
reranked = _rerank(full_query, pool)
|
||||
if reranked:
|
||||
fuzzy_ids = reranked + [c for c in fuzzy_ids if c not in set(reranked)]
|
||||
used_rerank = True
|
||||
|
||||
# Optional product-substring post-filter: if user supplied
|
||||
# ``product``, require the chunk to actually contain the
|
||||
# token. This re-checks the bytes since BM25 only sees stems.
|
||||
@@ -931,6 +1053,7 @@ def search_trials(
|
||||
_call.set(
|
||||
hits_returned=len(final_ids),
|
||||
hybrid=used_hybrid,
|
||||
reranked=used_rerank,
|
||||
pool_size=pool_size,
|
||||
data_type="trial",
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user