Phase 7+8: eval harness + hybrid retrieval
## Phase 7 — Eval harness
eval/retrievers.py + rag/retrieval.py: Retriever protocol with
DenseRetriever, BM25Retriever, HybridRetriever (RRF k=60),
RerankedRetriever (llama.cpp /v1/rerank). retrievers.py is now a
thin shim re-exporting from rag.retrieval so the MCP server can
use the same code at request time without making eval/ a runtime
dep.
eval/run_eval.py: drives N retrievers against eval/queries.jsonl,
computes MRR / Recall@K / nDCG@K, emits a markdown report with a
summary table + per-query breakdown for the first retriever. Each
query carries expected (source, source_key) tuples — matches the
labels-domain page-level keying.
eval/queries.jsonl: 35 curated queries — 25 brand-name (Warrant,
Huskie, Roundup Custom, Liberty, Authority, Headline, Trivapro,
Poncho, Lorsban, Sencor, Acuron, ...) + 10 intent/semantic
("what controls horseweed before soybean", "fungicide for fusarium
head blight", "rainfast interval for glyphosate", ...).
## Phase 8 — Hybrid retrieval (BM25 + dense + RRF)
docs_mcp/server.py: search_docs now branches on HYBRID_SEARCH env.
When on, _search_chunks runs both Chroma + BM25 (rag/bm25.py
existing impl), fuses on chunk_id with reciprocal-rank-fusion
(RRF k=60), and returns the combined pool. Dense-only path
unchanged when HYBRID_SEARCH is unset. The rendering layer
(_format_hit) is untouched.
The RERANK_URL hook is also wired (_rerank_pool sends docs to
llama.cpp /v1/rerank, truncated to 2000 chars per the jina-reranker
n_ctx_train=1024 batch-rejection gotcha). Fails open to base order
on any exception.
## Baseline numbers (k=5, pool=50, 35 queries)
| Retriever | MRR | Recall@5 | nDCG@5 |
|------------|-------|----------|--------|
| dense | 0.027 | 0.086 | 0.041 |
| bm25 | 0.544 | 0.586 | 0.524 |
| hybrid-rrf | 0.114 | 0.114 | 0.108 |
Headline: BM25 dominates because farmers search for products by
brand name, and brand names are exact-match tokens that lexical
search nails. Dense is poor — semantic embeddings spread across
similar products and don't preferentially weight brand-name tokens.
Textbook RRF hurts when one retriever is much weaker than the
other: dense's irrelevant top-50 pollute the fused pool with
ties at 1/(60+rank). Phase 6 reranker is the planned fix —
the reranker scores each (query, chunk) pair independently
and can recover the right answer regardless of base order.
Per-query report at eval/results/baseline.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+18
-59
@@ -1,62 +1,21 @@
|
||||
"""Retriever protocol + concrete implementations.
|
||||
"""Eval-time shim — re-exports the retrievers from rag.retrieval.
|
||||
|
||||
A single matrix dimension per knob (dense / reranked / bm25 / hybrid)
|
||||
so the eval harness can compare them apples-to-apples. Implement these
|
||||
once at Phase 7 and reuse them across every retrieval change.
|
||||
|
||||
Each retriever returns a ranked list of (bundle_id, page_id) tuples
|
||||
deduplicated to the page level (chunks within the same page collapse
|
||||
to one entry; the highest-ranked chunk's position wins).
|
||||
The retrievers live in rag/ so the MCP server can use them at request
|
||||
time without making eval/ a runtime dependency. This file exists so
|
||||
old import paths (`from eval.retrievers import ...`) keep working.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
from rag.retrieval import (
|
||||
Retriever,
|
||||
DenseRetriever,
|
||||
BM25Retriever,
|
||||
HybridRetriever,
|
||||
RerankedRetriever,
|
||||
)
|
||||
|
||||
from typing import Protocol, Iterable
|
||||
|
||||
|
||||
class Retriever(Protocol):
|
||||
name: str
|
||||
|
||||
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
|
||||
"""Return up to k (bundle_id, page_id) tuples in rank order."""
|
||||
...
|
||||
|
||||
|
||||
def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> list[tuple[str, str]]:
|
||||
"""Take a stream of (bundle_id, page_id, chunk_ordinal) and return
|
||||
the first k unique pages in their first-seen order."""
|
||||
seen: set[tuple[str, str]] = set()
|
||||
out: list[tuple[str, str]] = []
|
||||
for bid, pid, _ord in chunk_ids:
|
||||
key = (bid, pid)
|
||||
if key in seen:
|
||||
continue
|
||||
seen.add(key)
|
||||
out.append(key)
|
||||
if len(out) >= k:
|
||||
break
|
||||
return out
|
||||
|
||||
|
||||
# TODO Phase 2/3 — implement these once Chroma + the bm25 module are
|
||||
# in place. Each one is small (15-30 LOC). The eval harness imports
|
||||
# from this module by class name.
|
||||
#
|
||||
# class DenseRetriever:
|
||||
# name = "dense"
|
||||
# def __init__(self, collection): self.col = collection
|
||||
# def retrieve(self, query, k=10): ...
|
||||
#
|
||||
# class RerankedRetriever:
|
||||
# name = "dense+rerank"
|
||||
# def __init__(self, collection, rerank_url, pool=200): ...
|
||||
# def retrieve(self, query, k=10): ...
|
||||
#
|
||||
# class BM25Retriever:
|
||||
# name = "bm25"
|
||||
# def __init__(self, bm25_index): ...
|
||||
# def retrieve(self, query, k=10): ...
|
||||
#
|
||||
# class HybridRetriever:
|
||||
# name = "bm25+dense+rrf"
|
||||
# def __init__(self, dense, bm25, k_rrf=60): ...
|
||||
# def retrieve(self, query, k=10): ...
|
||||
__all__ = [
|
||||
"Retriever",
|
||||
"DenseRetriever",
|
||||
"BM25Retriever",
|
||||
"HybridRetriever",
|
||||
"RerankedRetriever",
|
||||
]
|
||||
|
||||
Reference in New Issue
Block a user