Files
seed-mcp/eval/results/baseline.md
T
justin bd71f30ca7 Phase 6/7: wire rerank + eval harness — 100% pass on 21 golden queries
Phase 6 — Reranker integration
- New _rerank(query, [(cid, doc), ...]) helper in server.py calls
  llama.cpp's /v1/rerank endpoint, returns reranker-ordered ids
  or None on failure (graceful fallback — search never blocks
  on the sidecar).
- search_docs + search_trials both call _rerank() on the post-
  hybrid pool BEFORE truncating to k. The variety-code prefilter
  still pins exact matches on top.
- Per-doc truncation to 2000 chars to fit jina-reranker-v2-base's
  per-pair token budget. Full chunk text still returned to the
  caller — truncation is rerank-input-only.
- Telemetry adds `reranked: true|false` so usage logs distinguish
  reranked calls.

Phase 7 — Eval harness
- eval/queries.jsonl: 21 golden queries spanning:
    * variety-code lookups (DKC62-08RIB, AG29XF4, WB6430, E085Z5,
      AP Iliad)
    * semantic variety queries (drought-tolerant corn, SCN MG-3
      soy, Rps3a, XtendFlex, HRS stripe rust, SWW PNW, Goss's Wilt)
    * trial queries (IA/IN/MN regional, AP Iliad ID, NK1701 head-
      to-head, silage Ton/Acre, product=DKC65-95)
    * anti-hallucination (Pioneer P1142 fallback, DKC65-20 not-in-
      corpus expected_empty)
- eval/retrievers.py: 4 named retrievers — dense, bm25, hybrid
  (dense+bm25+RRF), hybrid+rerank — all sharing the same filter
  shape as docs_mcp/server.py._build_where.
- eval/run_eval.py: runs each retriever against each query,
  reports Recall / Precision@1 / MRR / avg latency. Markdown
  output in eval/results/baseline.md.

Baseline results (k=5, 21 queries):

  | Retriever       | Pass  | Recall | P@1   | MRR   | Avg ms |
  |-----------------|-------|--------|-------|-------|--------|
  | hybrid+rerank   | 21/21 | 100%   | 90%   | 0.905 | 2064   |
  | bm25            | 20/21 |  95%   | 81%   | 0.833 |    5   |
  | hybrid          | 15/21 |  71%   | 62%   | 0.619 |   73   |
  | dense           | 14/21 |  67%   | 38%   | 0.440 |   79   |

Key findings:
1. hybrid+rerank wins on quality — 100% pass, 90% P@1.
2. BM25 alone is surprisingly competitive (95% pass) at 5 ms —
   excellent fallback when rerank is down. The variety-code
   prefilter in search_docs is doing a lot of work here.
3. Dense embedding alone is the WEAKEST configuration on this
   corpus — variety identity tokens (DKC62-08RIB, AP Iliad,
   Rps3a) have no semantic neighbors, so nomic-embed-text returns
   noise. The hybrid (no rerank) layer actively hurts because
   RRF dilutes the BM25 ranking with dense noise.
4. Anti-hallucination queries (Pioneer fallback, DKC65-20 not-
   in-corpus) pass on ALL retrievers including dense-only —
   the must_not_contain + expected_empty design holds.

Deploy decision: HYBRID_SEARCH=true + RERANK_URL set
(production env already has both — refresh.yml + image-only.yml
+ deploy/docker-compose.yml all configured).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 17:02:57 -04:00

42 lines
2.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# seed-mcp retrieval eval — k=5
_21 golden queries × 4 retrievers_
## Summary
| Retriever | Passed | Recall | P@1 | MRR | Avg ms |
|---|---|---|---|---|---|
| **hybrid+rerank** | 21/21 | 100.00% | 90.48% | 0.905 | 2064 |
| **bm25** | 20/21 | 95.24% | 80.95% | 0.833 | 5 |
| **hybrid** | 15/21 | 71.43% | 61.90% | 0.619 | 73 |
| **dense** | 14/21 | 66.67% | 38.10% | 0.440 | 79 |
**Recall** = % of queries where ≥1 top-k chunk satisfied the spec. **P@1** = % where the very first result satisfied it. **MRR** = mean of `1 / rank-of-first-satisfying-result` (0 if missed).
## Per-query results
| Query | bm25 | dense | hybrid | hybrid+rerank |
|---|---|---|---|---|
| `DKC62-08RIB ratings` | ✅ #1 | ❌ | ❌ | ✅ #1 |
| `AG29XF4 disease ratings` | ✅ #1 | ❌ | ❌ | ✅ #1 |
| `WB6430 westbred wheat` | ✅ #1 | ❌ | ❌ | ✅ #1 |
| `E085Z5 corn` | ✅ #1 | ❌ | ❌ | ✅ #1 |
| `AP Iliad wheat performance` | ✅ #1 | ❌ | ❌ | ✅ #1 |
| `drought tolerant corn for sandy soil short season Iowa` | ✅ #2 | ✅ #1 | ✅ #1 | ✅ #1 |
| `soybean cyst nematode SCN resistant variety` | ✅ #1 | ✅ #1 | ✅ #1 | ✅ #1 |
| `Phytophthora resistance Rps3a soybean` | ✅ #1 | ✅ #2 | ✅ #1 | ✅ #1 |
| `XtendFlex soybean Northern Plains` | ❌ | ✅ #1 | ✅ #1 | ✅ #1 |
| `Hard Red Spring wheat stripe rust resistance` | ✅ #1 | ✅ #3 | ✅ #1 | ✅ #1 |
| `Soft White Winter wheat Pacific Northwest` | ✅ #1 | ✅ #5 | ✅ #1 | ✅ #1 |
| `Goss's Wilt resistance corn` | ✅ #1 | ✅ #1 | ✅ #1 | ✅ #1 |
| `best corn 2024 Iowa` | ✅ #1 | ✅ #1 | ✅ #1 | ✅ #1 |
| `Indiana corn yield comparison 2024` | ✅ #1 | ✅ #1 | ✅ #1 | ✅ #1 |
| `AP Iliad Idaho wheat trial` | ✅ #1 | ✅ #5 | ✅ #1 | ✅ #1 |
| `DKC65-95 corn yield in trials` | ✅ #1 | ❌ | ✅ #1 | ✅ #1 |
| `NK1701 corn trials head to head` | ✅ #1 | ❌ | ❌ | ✅ #1 |
| `silage corn high milk per acre dairy` | ✅ #1 | ✅ #1 | ✅ #1 | ✅ #1 |
| `soybean 2025 Minnesota top performers` | ✅ #1 | ✅ #1 | ✅ #1 | ✅ #1 |
| `Pioneer P1142 hybrid recommendation` | ✅ | ✅ | ✅ | ✅ |
| `DKC65-20 yield Alabama trial` | ✅ | ✅ | ✅ | ✅ |