Phase 6/7: wire rerank + eval harness — 100% pass on 21 golden queries

Phase 6 — Reranker integration - New _rerank(query, [(cid, doc), ...]) helper in server.py calls llama.cpp's /v1/rerank endpoint, returns reranker-ordered ids or None on failure (graceful fallback — search never blocks on the sidecar). - search_docs + search_trials both call _rerank() on the post- hybrid pool BEFORE truncating to k. The variety-code prefilter still pins exact matches on top. - Per-doc truncation to 2000 chars to fit jina-reranker-v2-base's per-pair token budget. Full chunk text still returned to the caller — truncation is rerank-input-only. - Telemetry adds `reranked: true|false` so usage logs distinguish reranked calls. Phase 7 — Eval harness - eval/queries.jsonl: 21 golden queries spanning: * variety-code lookups (DKC62-08RIB, AG29XF4, WB6430, E085Z5, AP Iliad) * semantic variety queries (drought-tolerant corn, SCN MG-3 soy, Rps3a, XtendFlex, HRS stripe rust, SWW PNW, Goss's Wilt) * trial queries (IA/IN/MN regional, AP Iliad ID, NK1701 head- to-head, silage Ton/Acre, product=DKC65-95) * anti-hallucination (Pioneer P1142 fallback, DKC65-20 not-in- corpus expected_empty) - eval/retrievers.py: 4 named retrievers — dense, bm25, hybrid (dense+bm25+RRF), hybrid+rerank — all sharing the same filter shape as docs_mcp/server.py._build_where. - eval/run_eval.py: runs each retriever against each query, reports Recall / Precision@1 / MRR / avg latency. Markdown output in eval/results/baseline.md. Baseline results (k=5, 21 queries): | Retriever | Pass | Recall | P@1 | MRR | Avg ms | |-----------------|-------|--------|-------|-------|--------| | hybrid+rerank | 21/21 | 100% | 90% | 0.905 | 2064 | | bm25 | 20/21 | 95% | 81% | 0.833 | 5 | | hybrid | 15/21 | 71% | 62% | 0.619 | 73 | | dense | 14/21 | 67% | 38% | 0.440 | 79 | Key findings: 1. hybrid+rerank wins on quality — 100% pass, 90% P@1. 2. BM25 alone is surprisingly competitive (95% pass) at 5 ms — excellent fallback when rerank is down. The variety-code prefilter in search_docs is doing a lot of work here. 3. Dense embedding alone is the WEAKEST configuration on this corpus — variety identity tokens (DKC62-08RIB, AP Iliad, Rps3a) have no semantic neighbors, so nomic-embed-text returns noise. The hybrid (no rerank) layer actively hurts because RRF dilutes the BM25 ranking with dense noise. 4. Anti-hallucination queries (Pioneer fallback, DKC65-20 not- in-corpus) pass on ALL retrievers including dense-only — the must_not_contain + expected_empty design holds. Deploy decision: HYBRID_SEARCH=true + RERANK_URL set (production env already has both — refresh.yml + image-only.yml + deploy/docker-compose.yml all configured). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 17:02:57 -04:00
parent d60d747858
commit bd71f30ca7
5 changed files with 643 additions and 89 deletions
@@ -1,32 +1,60 @@
 """Run all retrievers against eval/queries.jsonl, emit a markdown report.

+For seed-mcp, the "expected" answer for many queries isn't a single
+chunk — it's "a chunk satisfying these constraints." So per-query
+scoring is one of:
+
+  expected_source_keys     — at least one of these source_keys appears
+                             in top-k (used for variety-code queries
+                             with a single canonical answer)
+  expected_metadata        — all top-k must match these key=value
+                             constraints (e.g. crop=corn, year=2024)
+  expected_substrings      — at least one top-k chunk's text/metadata
+                             contains each substring (e.g. "SCN" must
+                             appear when querying SCN resistance)
+  must_not_contain_source_keys — anti-hallucination: NO top-k chunk's
+                                  source_key may contain these tokens
+                                  (Pioneer fallback queries)
+  expected_empty           — top-k MUST be empty (anti-hallucination)
+  expect_lessons_call      — the agent should call api_lessons; not
+                             measurable from retrieval alone, recorded
+                             as an advisory note
+
 Metrics computed per retriever:

-  MRR        — mean reciprocal rank of the FIRST expected page in the
-               ranked result list (0 if not in top-k).
-  Recall@K   — fraction of expected pages that appear in top-K.
-  nDCG@K     — discounted gain weighted by rank position.
+  recall_known    — fraction of queries where the retriever returned
+                    a chunk satisfying the query's expectations
+  precision_top1  — fraction of queries where the FIRST result
+                    satisfied expectations
+  mrr             — mean reciprocal rank of the FIRST satisfying chunk

-The "right" number depends on what you're measuring. MRR tracks "the
-first-line answer is correct"; Recall@K tracks "everything relevant
-is there to draw from"; nDCG@K is a smoother combination of both.
-For docs-RAG, MRR is usually the headline metric.
+Plus a per-query breakdown table so you can see exactly where each
+retriever wins or loses.

 Usage:
-
    python -m eval.run_eval \\
        --queries eval/queries.jsonl \\
        --k 5 \\
+        --rerank-url http://localhost:18080 \\
        --output eval/results/baseline.md
 """
 from __future__ import annotations

 import argparse
 import json
-import math
+import logging
+import os
+import sys
 import time
 from pathlib import Path
-from typing import Iterable
+
+# Add repo root for imports
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+
+from eval.retrievers import build_all_retrievers  # noqa: E402
+
+logging.getLogger("chromadb").setLevel(logging.ERROR)
+logging.getLogger("httpx").setLevel(logging.ERROR)


 def load_queries(path: Path) -> list[dict]:
@@ -34,31 +62,203 @@ def load_queries(path: Path) -> list[dict]:
        return [json.loads(line) for line in fh if line.strip()]


-def reciprocal_rank(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]]) -> float:
-    expected_set = set(expected)
-    for i, page in enumerate(retrieved, start=1):
-        if page in expected_set:
-            return 1.0 / i
-    return 0.0
+def _doc_satisfies(meta: dict, doc: str, query_spec: dict) -> bool:
+    """Does this single retrieved (metadata, doc) tuple satisfy the
+    query spec? Used by the 'first satisfying' metric."""
+    sk = meta.get("source_key") or ""
+    # exact source_key match
+    if "expected_source_keys" in query_spec:
+        for want in query_spec["expected_source_keys"]:
+            if want.lower() == sk.lower():
+                return True
+        return False
+    # all metadata constraints match
+    if "expected_metadata" in query_spec:
+        for k, v in query_spec["expected_metadata"].items():
+            mv = meta.get(k)
+            if isinstance(v, int):
+                if mv != v:
+                    return False
+            else:
+                if (mv or "").lower() != str(v).lower():
+                    return False
+        # if no substring requirement, metadata match is enough
+        if "expected_substrings" not in query_spec:
+            return True
+    # at least one substring present (in doc OR metadata values)
+    if "expected_substrings" in query_spec:
+        haystack = (doc + " " + " ".join(str(v) for v in meta.values())).lower()
+        return any(s.lower() in haystack for s in query_spec["expected_substrings"])
+    return False


-def recall_at_k(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]], k: int) -> float:
-    if not expected:
-        return 0.0
-    retrieved_set = set(retrieved[:k])
-    hits = sum(1 for e in expected if e in retrieved_set)
-    return hits / len(expected)
+def _evaluate_one(retriever, query_spec: dict, k: int, col) -> dict:
+    """Return per-query metrics for one retriever."""
+    query = query_spec["query"]
+    filters = dict(query_spec.get("filters") or {})
+    # search_trials queries imply data_type=trial; search_docs implies variety
+    tool = query_spec.get("tool", "search_docs")
+    if tool == "search_trials":
+        filters.setdefault("data_type", "trial")
+    elif tool == "search_docs":
+        filters.setdefault("data_type", "variety")
+    # 'product' is a server-side post-filter, not Chroma; strip
+    product = filters.pop("product", None)
+
+    t0 = time.monotonic()
+    ids = retriever.retrieve(query, k, filters)
+    elapsed_ms = (time.monotonic() - t0) * 1000
+
+    # Anti-hallucination queries: expected_empty should return nothing
+    # (BUT we still allow the retriever to surface chunks if the
+    # product filter would filter them out at the server level — so
+    # we re-apply the product filter here).
+    if product:
+        try:
+            extra = col.get(ids=ids, include=["documents"])
+            id_to_doc = dict(zip(extra.get("ids") or [], extra.get("documents") or []))
+        except Exception:
+            id_to_doc = {}
+        ids = [cid for cid in ids if product.lower() in id_to_doc.get(cid, "").lower()]
+
+    if query_spec.get("expected_empty"):
+        passed = len(ids) == 0
+        return {
+            "query": query, "retriever": retriever.name,
+            "k": k, "n_hits": len(ids), "rank_first_match": None,
+            "passed": passed, "elapsed_ms": round(elapsed_ms, 1),
+            "kind": "expected_empty",
+        }
+
+    if "must_not_contain_source_keys" in query_spec:
+        bad_tokens = [t.lower() for t in query_spec["must_not_contain_source_keys"]]
+        try:
+            extra = col.get(ids=ids, include=["metadatas"])
+            metas = extra.get("metadatas") or []
+        except Exception:
+            metas = []
+        # PASS = no top-k chunk's source_key contains a forbidden token
+        for m in metas:
+            sk = (m.get("source_key") or "").lower()
+            if any(t in sk for t in bad_tokens):
+                return {
+                    "query": query, "retriever": retriever.name,
+                    "k": k, "n_hits": len(ids), "rank_first_match": None,
+                    "passed": False, "elapsed_ms": round(elapsed_ms, 1),
+                    "kind": "must_not_contain",
+                }
+        return {
+            "query": query, "retriever": retriever.name,
+            "k": k, "n_hits": len(ids), "rank_first_match": None,
+            "passed": True, "elapsed_ms": round(elapsed_ms, 1),
+            "kind": "must_not_contain",
+        }
+
+    # Positive-match query: pull docs+meta and check each
+    try:
+        extra = col.get(ids=ids, include=["documents", "metadatas"])
+        docs = extra.get("documents") or []
+        metas = extra.get("metadatas") or []
+        ext_ids = extra.get("ids") or []
+        order_idx = {cid: i for i, cid in enumerate(ext_ids)}
+    except Exception:
+        docs = []
+        metas = []
+        order_idx = {}
+
+    rank_first = None
+    for rank, cid in enumerate(ids, start=1):
+        i = order_idx.get(cid)
+        if i is None:
+            continue
+        if _doc_satisfies(metas[i], docs[i], query_spec):
+            rank_first = rank
+            break
+
+    return {
+        "query": query, "retriever": retriever.name,
+        "k": k, "n_hits": len(ids),
+        "rank_first_match": rank_first,
+        "passed": rank_first is not None,
+        "elapsed_ms": round(elapsed_ms, 1),
+        "kind": "positive",
+    }


-def ndcg_at_k(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]], k: int) -> float:
-    expected_set = set(expected)
-    dcg = 0.0
-    for i, page in enumerate(retrieved[:k], start=1):
-        if page in expected_set:
-            dcg += 1.0 / math.log2(i + 1)
-    # Ideal DCG: every expected page in the top positions.
-    idcg = sum(1.0 / math.log2(i + 1) for i in range(1, min(len(expected), k) + 1))
-    return dcg / idcg if idcg else 0.0
+def _aggregate(results: list[dict]) -> dict:
+    """Aggregate per-query results into MRR / recall / precision@1."""
+    by_retriever: dict[str, list[dict]] = {}
+    for r in results:
+        by_retriever.setdefault(r["retriever"], []).append(r)
+    out: dict[str, dict] = {}
+    for name, rows in by_retriever.items():
+        n = len(rows)
+        passed = sum(1 for r in rows if r["passed"])
+        ranks = [r["rank_first_match"] for r in rows
+                 if r["passed"] and r.get("rank_first_match")]
+        mrr = sum(1.0 / r for r in ranks) / n if n else 0.0
+        precision1 = sum(1 for r in rows if r["passed"] and r.get("rank_first_match") == 1) / n if n else 0.0
+        avg_ms = sum(r["elapsed_ms"] for r in rows) / n if n else 0.0
+        out[name] = {
+            "n_queries": n,
+            "passed": passed,
+            "recall_known": passed / n if n else 0.0,
+            "precision_top1": precision1,
+            "mrr": mrr,
+            "avg_latency_ms": round(avg_ms, 1),
+        }
+    return out
+
+
+def _emit_markdown(queries: list[dict], results: list[dict],
+                   summary: dict, k: int) -> str:
+    lines: list[str] = []
+    lines.append(f"# seed-mcp retrieval eval — k={k}")
+    lines.append("")
+    lines.append(f"_{len(queries)} golden queries × {len(summary)} retrievers_")
+    lines.append("")
+    lines.append("## Summary")
+    lines.append("")
+    lines.append("| Retriever | Passed | Recall | P@1 | MRR | Avg ms |")
+    lines.append("|---|---|---|---|---|---|")
+    for name in sorted(summary, key=lambda n: -summary[n]["mrr"]):
+        s = summary[name]
+        lines.append(
+            f"| **{name}** | {s['passed']}/{s['n_queries']} "
+            f"| {s['recall_known']:.2%} | {s['precision_top1']:.2%} "
+            f"| {s['mrr']:.3f} | {s['avg_latency_ms']:.0f} |"
+        )
+    lines.append("")
+    lines.append("**Recall** = % of queries where ≥1 top-k chunk satisfied the spec. "
+                 "**P@1** = % where the very first result satisfied it. "
+                 "**MRR** = mean of `1 / rank-of-first-satisfying-result` (0 if missed).")
+    lines.append("")
+
+    # Per-query breakdown
+    lines.append("## Per-query results")
+    lines.append("")
+    by_query: dict[str, list[dict]] = {}
+    for r in results:
+        by_query.setdefault(r["query"], []).append(r)
+    retriever_names = sorted({r["retriever"] for r in results})
+    header = "| Query | " + " | ".join(retriever_names) + " |"
+    sep = "|" + "---|" * (len(retriever_names) + 1)
+    lines.append(header)
+    lines.append(sep)
+    for q in queries:
+        cells = [f"`{q['query'][:60]}`"]
+        for name in retriever_names:
+            r = next((x for x in by_query.get(q["query"], []) if x["retriever"] == name), None)
+            if r is None:
+                cells.append("?")
+            elif r["passed"]:
+                rk = r.get("rank_first_match")
+                cells.append(f"✅ #{rk}" if rk else "✅")
+            else:
+                cells.append("❌")
+        lines.append("| " + " | ".join(cells) + " |")
+    lines.append("")
+    return "\n".join(lines) + "\n"


 def main() -> int:
@@ -66,25 +266,56 @@ def main() -> int:
    p.add_argument("--queries", type=Path, default=Path("eval/queries.jsonl"))
    p.add_argument("--k", type=int, default=5)
    p.add_argument("--output", type=Path, default=Path("eval/results/baseline.md"))
+    p.add_argument("--rerank-url", default=os.environ.get("RERANK_URL", ""))
+    p.add_argument("--product-name", default=os.environ.get("PRODUCT_NAME", "crop_seed"))
    args = p.parse_args()

    if not args.queries.exists():
        print(f"queries file not found: {args.queries}")
-        print("hint: copy eval/queries.jsonl.example and edit")
        return 1

    queries = load_queries(args.queries)
    print(f"loaded {len(queries)} queries")

-    # TODO Phase 7: instantiate the retrievers you implemented in
-    # eval/retrievers.py and run each one against each query.
-    # Aggregate MRR / Recall@K / nDCG@K per retriever. Emit a
-    # markdown table to args.output. Commit the file alongside the
-    # PR that changes retrieval.
-    raise NotImplementedError(
-        "Wire up the retrievers in eval/retrievers.py first, then "
-        "fill in this evaluation loop. See PLAN.md Phase 7."
+    # Connect to Chroma + BM25
+    import chromadb
+    from chromadb.config import Settings
+    from rag.embeddings import embedding_function
+    from rag.bm25 import BM25Index
+
+    repo_root = Path(__file__).resolve().parent.parent
+    client = chromadb.PersistentClient(
+        path=str(repo_root / "chroma"),
+        settings=Settings(anonymized_telemetry=False),
    )
+    col = client.get_collection(f"{args.product_name}_docs",
+                                embedding_function=embedding_function())
+    bm25 = BM25Index(repo_root / "bm25" / f"{args.product_name}_docs.db")
+    print(f"chroma: {col.count()} chunks; bm25: {bm25.count()} chunks")
+
+    retrievers = build_all_retrievers(col, bm25, args.rerank_url or None)
+    print(f"retrievers: {[r.name for r in retrievers]}")
+
+    all_results: list[dict] = []
+    for r in retrievers:
+        print(f"running {r.name}...")
+        for q in queries:
+            res = _evaluate_one(r, q, args.k, col)
+            all_results.append(res)
+
+    summary = _aggregate(all_results)
+    md = _emit_markdown(queries, all_results, summary, args.k)
+    args.output.parent.mkdir(parents=True, exist_ok=True)
+    args.output.write_text(md, encoding="utf-8")
+    print(f"\nreport: {args.output}")
+    print()
+    # Print summary to stdout too
+    for line in md.split("\n"):
+        if line.startswith("|"):
+            print(line)
+        if line.startswith("## Per-query"):
+            break
+    return 0


 if __name__ == "__main__":