Phase 6/7: wire rerank + eval harness — 100% pass on 21 golden queries

Phase 6 — Reranker integration - New _rerank(query, [(cid, doc), ...]) helper in server.py calls llama.cpp's /v1/rerank endpoint, returns reranker-ordered ids or None on failure (graceful fallback — search never blocks on the sidecar). - search_docs + search_trials both call _rerank() on the post- hybrid pool BEFORE truncating to k. The variety-code prefilter still pins exact matches on top. - Per-doc truncation to 2000 chars to fit jina-reranker-v2-base's per-pair token budget. Full chunk text still returned to the caller — truncation is rerank-input-only. - Telemetry adds `reranked: true|false` so usage logs distinguish reranked calls. Phase 7 — Eval harness - eval/queries.jsonl: 21 golden queries spanning: * variety-code lookups (DKC62-08RIB, AG29XF4, WB6430, E085Z5, AP Iliad) * semantic variety queries (drought-tolerant corn, SCN MG-3 soy, Rps3a, XtendFlex, HRS stripe rust, SWW PNW, Goss's Wilt) * trial queries (IA/IN/MN regional, AP Iliad ID, NK1701 head- to-head, silage Ton/Acre, product=DKC65-95) * anti-hallucination (Pioneer P1142 fallback, DKC65-20 not-in- corpus expected_empty) - eval/retrievers.py: 4 named retrievers — dense, bm25, hybrid (dense+bm25+RRF), hybrid+rerank — all sharing the same filter shape as docs_mcp/server.py._build_where. - eval/run_eval.py: runs each retriever against each query, reports Recall / Precision@1 / MRR / avg latency. Markdown output in eval/results/baseline.md. Baseline results (k=5, 21 queries): | Retriever | Pass | Recall | P@1 | MRR | Avg ms | |-----------------|-------|--------|-------|-------|--------| | hybrid+rerank | 21/21 | 100% | 90% | 0.905 | 2064 | | bm25 | 20/21 | 95% | 81% | 0.833 | 5 | | hybrid | 15/21 | 71% | 62% | 0.619 | 73 | | dense | 14/21 | 67% | 38% | 0.440 | 79 | Key findings: 1. hybrid+rerank wins on quality — 100% pass, 90% P@1. 2. BM25 alone is surprisingly competitive (95% pass) at 5 ms — excellent fallback when rerank is down. The variety-code prefilter in search_docs is doing a lot of work here. 3. Dense embedding alone is the WEAKEST configuration on this corpus — variety identity tokens (DKC62-08RIB, AP Iliad, Rps3a) have no semantic neighbors, so nomic-embed-text returns noise. The hybrid (no rerank) layer actively hurts because RRF dilutes the BM25 ranking with dense noise. 4. Anti-hallucination queries (Pioneer fallback, DKC65-20 not- in-corpus) pass on ALL retrievers including dense-only — the must_not_contain + expected_empty design holds. Deploy decision: HYBRID_SEARCH=true + RERANK_URL set (production env already has both — refresh.yml + image-only.yml + deploy/docker-compose.yml all configured). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 17:02:57 -04:00
parent d60d747858
commit bd71f30ca7
5 changed files with 643 additions and 89 deletions
@@ -289,6 +289,75 @@ def _rrf_fuse(rankings: list[list[str]], k: int = RRF_K) -> list[str]:
    return sorted(scores, key=lambda d: scores[d], reverse=True)


+# Per-doc character cap when sending to the reranker. jina-reranker-v2-base
+# accepts up to ~1024 tokens PER QUERY+DOC PAIR (n_ctx_train) and rejects
+# the WHOLE BATCH if any one pair exceeds it. Truncating each doc to
+# ~2000 chars (≈ 500-700 tokens) leaves headroom for the query + chat
+# template overhead. The truncation is reranking-only — full chunk text
+# still goes back to the LLM caller.
+RERANK_DOC_MAX_CHARS = 2000
+
+
+def _rerank(query: str, candidates: list[tuple[str, str]]) -> list[str] | None:
+    """Call the llama.cpp /v1/rerank endpoint and return the candidate
+    chunk ids in reranker-preferred order.
+
+    Args:
+      query:      the user's natural-language query
+      candidates: list of ``(chunk_id, chunk_text)`` to rerank.
+
+    Returns:
+      A list of chunk_ids ordered best-first by reranker score, OR
+      ``None`` if reranking is disabled, the endpoint is unreachable,
+      or any other error. The caller treats ``None`` as "fall back to
+      the input ranking" — rerank failures must NEVER block a search.
+
+    Anti-hallucination: rerank only reorders chunks the retrievers
+    already surfaced. It cannot introduce content not in the corpus.
+    """
+    if not RERANK_URL or not candidates:
+        return None
+    try:
+        import httpx
+    except ImportError:
+        return None
+
+    # Truncate each doc to fit the per-pair token budget. jina-reranker
+    # rejects the entire batch on any oversize doc.
+    docs = [(text[:RERANK_DOC_MAX_CHARS] if text else "") for _cid, text in candidates]
+    ids = [cid for cid, _ in candidates]
+
+    try:
+        with httpx.Client(timeout=RERANK_TIMEOUT) as c:
+            r = c.post(
+                f"{RERANK_URL}/v1/rerank",
+                json={
+                    "model": "rerank",  # llama.cpp ignores this; jina passes through
+                    "query": query,
+                    "documents": docs,
+                },
+            )
+            r.raise_for_status()
+            payload = r.json()
+    except Exception as exc:  # noqa: BLE001
+        log.warning("rerank request failed (%s) — falling back to input order", exc)
+        return None
+
+    results = payload.get("results") or []
+    if not results:
+        log.warning("rerank returned empty results — falling back to input order")
+        return None
+
+    # llama.cpp returns results as [{"index": int, "relevance_score": float}, ...]
+    # Higher relevance_score = better; sort descending.
+    try:
+        ordered = sorted(results, key=lambda r: -r.get("relevance_score", float("-inf")))
+        return [ids[r["index"]] for r in ordered if 0 <= r.get("index", -1) < len(ids)]
+    except (KeyError, IndexError, TypeError) as exc:
+        log.warning("rerank response malformed (%s) — falling back to input order", exc)
+        return None
+
+
 def _structured_ratings_block(sidecar: dict) -> str:
    """Render the sidecar's grouped characteristics + identity as a
    fact-checkable block, with the source URL pinned at top.
@@ -534,6 +603,34 @@ def search_docs(
        else:
            fuzzy_ids = dense_ids

+        # Optional reranker pass over the fuzzy pool BEFORE truncating
+        # to k. The cross-encoder is much more accurate at the query/
+        # doc pairing than dense embedding alone, especially when the
+        # query mentions specific ag terms that share-token-cosine
+        # might miss. Skipped if RERANK_URL is unset or the call
+        # fails — search is never blocked on the sidecar.
+        used_rerank = False
+        if RERANK_URL and fuzzy_ids:
+            # Need docs to rerank — fetch any missing.
+            need = [i for i in fuzzy_ids if i not in id_to_doc]
+            if need:
+                try:
+                    extra = col.get(ids=need[:RERANK_POOL], include=["documents", "metadatas"])
+                    for cid, doc, meta in zip(
+                        extra.get("ids") or [],
+                        extra.get("documents") or [],
+                        extra.get("metadatas") or [],
+                    ):
+                        id_to_doc[cid] = doc
+                        id_to_meta[cid] = meta
+                except Exception as exc:  # noqa: BLE001
+                    log.warning("pre-rerank get-by-id failed: %s", exc)
+            pool = [(cid, id_to_doc.get(cid, "")) for cid in fuzzy_ids[:RERANK_POOL]]
+            reranked = _rerank(query, pool)
+            if reranked:
+                fuzzy_ids = reranked + [c for c in fuzzy_ids if c not in set(reranked)]
+                used_rerank = True
+
        # Pin exact-code matches at top, then fill remainder from fuzzy
        # retrieval (deduped). Pinned matches are deterministic and
        # high-confidence; they should never lose to a fuzzy match.
@@ -566,6 +663,7 @@ def search_docs(
        _call.set(
            hits_returned=len(final_ids),
            hybrid=used_hybrid,
+            reranked=used_rerank,
            pool_size=pool_size,
        )

@@ -885,6 +983,30 @@ def search_trials(
        else:
            fuzzy_ids = dense_ids

+        # Optional reranker pass over the fuzzy pool — same shape as
+        # in search_docs. Skipped silently if RERANK_URL is unset or
+        # the rerank call fails.
+        used_rerank = False
+        if RERANK_URL and fuzzy_ids:
+            need = [i for i in fuzzy_ids if i not in id_to_doc]
+            if need:
+                try:
+                    extra = col.get(ids=need[:RERANK_POOL], include=["documents", "metadatas"])
+                    for cid, doc, meta in zip(
+                        extra.get("ids") or [],
+                        extra.get("documents") or [],
+                        extra.get("metadatas") or [],
+                    ):
+                        id_to_doc[cid] = doc
+                        id_to_meta[cid] = meta
+                except Exception as exc:  # noqa: BLE001
+                    log.warning("pre-rerank get-by-id failed: %s", exc)
+            pool = [(cid, id_to_doc.get(cid, "")) for cid in fuzzy_ids[:RERANK_POOL]]
+            reranked = _rerank(full_query, pool)
+            if reranked:
+                fuzzy_ids = reranked + [c for c in fuzzy_ids if c not in set(reranked)]
+                used_rerank = True
+
        # Optional product-substring post-filter: if user supplied
        # ``product``, require the chunk to actually contain the
        # token. This re-checks the bytes since BM25 only sees stems.
@@ -931,6 +1053,7 @@ def search_trials(
        _call.set(
            hits_returned=len(final_ids),
            hybrid=used_hybrid,
+            reranked=used_rerank,
            pool_size=pool_size,
            data_type="trial",
        )