Phase 7+8: eval harness + hybrid retrieval

## Phase 7 — Eval harness eval/retrievers.py + rag/retrieval.py: Retriever protocol with DenseRetriever, BM25Retriever, HybridRetriever (RRF k=60), RerankedRetriever (llama.cpp /v1/rerank). retrievers.py is now a thin shim re-exporting from rag.retrieval so the MCP server can use the same code at request time without making eval/ a runtime dep. eval/run_eval.py: drives N retrievers against eval/queries.jsonl, computes MRR / Recall@K / nDCG@K, emits a markdown report with a summary table + per-query breakdown for the first retriever. Each query carries expected (source, source_key) tuples — matches the labels-domain page-level keying. eval/queries.jsonl: 35 curated queries — 25 brand-name (Warrant, Huskie, Roundup Custom, Liberty, Authority, Headline, Trivapro, Poncho, Lorsban, Sencor, Acuron, ...) + 10 intent/semantic ("what controls horseweed before soybean", "fungicide for fusarium head blight", "rainfast interval for glyphosate", ...). ## Phase 8 — Hybrid retrieval (BM25 + dense + RRF) docs_mcp/server.py: search_docs now branches on HYBRID_SEARCH env. When on, _search_chunks runs both Chroma + BM25 (rag/bm25.py existing impl), fuses on chunk_id with reciprocal-rank-fusion (RRF k=60), and returns the combined pool. Dense-only path unchanged when HYBRID_SEARCH is unset. The rendering layer (_format_hit) is untouched. The RERANK_URL hook is also wired (_rerank_pool sends docs to llama.cpp /v1/rerank, truncated to 2000 chars per the jina-reranker n_ctx_train=1024 batch-rejection gotcha). Fails open to base order on any exception. ## Baseline numbers (k=5, pool=50, 35 queries) | Retriever | MRR | Recall@5 | nDCG@5 | |------------|-------|----------|--------| | dense | 0.027 | 0.086 | 0.041 | | bm25 | 0.544 | 0.586 | 0.524 | | hybrid-rrf | 0.114 | 0.114 | 0.108 | Headline: BM25 dominates because farmers search for products by brand name, and brand names are exact-match tokens that lexical search nails. Dense is poor — semantic embeddings spread across similar products and don't preferentially weight brand-name tokens. Textbook RRF hurts when one retriever is much weaker than the other: dense's irrelevant top-50 pollute the fused pool with ties at 1/(60+rank). Phase 6 reranker is the planned fix — the reranker scores each (query, chunk) pair independently and can recover the right answer regardless of base order. Per-query report at eval/results/baseline.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 10:19:05 -04:00
parent 97a2a05b24
commit 335c33465b
6 changed files with 636 additions and 101 deletions
@@ -1,62 +1,21 @@
-"""Retriever protocol + concrete implementations.
+"""Eval-time shim — re-exports the retrievers from rag.retrieval.

-A single matrix dimension per knob (dense / reranked / bm25 / hybrid)
-so the eval harness can compare them apples-to-apples. Implement these
-once at Phase 7 and reuse them across every retrieval change.
-
-Each retriever returns a ranked list of (bundle_id, page_id) tuples
-deduplicated to the page level (chunks within the same page collapse
-to one entry; the highest-ranked chunk's position wins).
+The retrievers live in rag/ so the MCP server can use them at request
+time without making eval/ a runtime dependency. This file exists so
+old import paths (`from eval.retrievers import ...`) keep working.
 """
-from __future__ import annotations
+from rag.retrieval import (
+    Retriever,
+    DenseRetriever,
+    BM25Retriever,
+    HybridRetriever,
+    RerankedRetriever,
+)

-from typing import Protocol, Iterable
-
-
-class Retriever(Protocol):
-    name: str
-
-    def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
-        """Return up to k (bundle_id, page_id) tuples in rank order."""
-        ...
-
-
-def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> list[tuple[str, str]]:
-    """Take a stream of (bundle_id, page_id, chunk_ordinal) and return
-    the first k unique pages in their first-seen order."""
-    seen: set[tuple[str, str]] = set()
-    out: list[tuple[str, str]] = []
-    for bid, pid, _ord in chunk_ids:
-        key = (bid, pid)
-        if key in seen:
-            continue
-        seen.add(key)
-        out.append(key)
-        if len(out) >= k:
-            break
-    return out
-
-
-# TODO Phase 2/3 — implement these once Chroma + the bm25 module are
-# in place. Each one is small (15-30 LOC). The eval harness imports
-# from this module by class name.
-#
-# class DenseRetriever:
-#     name = "dense"
-#     def __init__(self, collection): self.col = collection
-#     def retrieve(self, query, k=10): ...
-#
-# class RerankedRetriever:
-#     name = "dense+rerank"
-#     def __init__(self, collection, rerank_url, pool=200): ...
-#     def retrieve(self, query, k=10): ...
-#
-# class BM25Retriever:
-#     name = "bm25"
-#     def __init__(self, bm25_index): ...
-#     def retrieve(self, query, k=10): ...
-#
-# class HybridRetriever:
-#     name = "bm25+dense+rrf"
-#     def __init__(self, dense, bm25, k_rrf=60): ...
-#     def retrieve(self, query, k=10): ...
+__all__ = [
+    "Retriever",
+    "DenseRetriever",
+    "BM25Retriever",
+    "HybridRetriever",
+    "RerankedRetriever",
+]