Files
crop-chem-docs/eval/results/baseline.md
T
justin 335c33465b Phase 7+8: eval harness + hybrid retrieval
## Phase 7 — Eval harness

eval/retrievers.py + rag/retrieval.py: Retriever protocol with
DenseRetriever, BM25Retriever, HybridRetriever (RRF k=60),
RerankedRetriever (llama.cpp /v1/rerank). retrievers.py is now a
thin shim re-exporting from rag.retrieval so the MCP server can
use the same code at request time without making eval/ a runtime
dep.

eval/run_eval.py: drives N retrievers against eval/queries.jsonl,
computes MRR / Recall@K / nDCG@K, emits a markdown report with a
summary table + per-query breakdown for the first retriever. Each
query carries expected (source, source_key) tuples — matches the
labels-domain page-level keying.

eval/queries.jsonl: 35 curated queries — 25 brand-name (Warrant,
Huskie, Roundup Custom, Liberty, Authority, Headline, Trivapro,
Poncho, Lorsban, Sencor, Acuron, ...) + 10 intent/semantic
("what controls horseweed before soybean", "fungicide for fusarium
head blight", "rainfast interval for glyphosate", ...).

## Phase 8 — Hybrid retrieval (BM25 + dense + RRF)

docs_mcp/server.py: search_docs now branches on HYBRID_SEARCH env.
When on, _search_chunks runs both Chroma + BM25 (rag/bm25.py
existing impl), fuses on chunk_id with reciprocal-rank-fusion
(RRF k=60), and returns the combined pool. Dense-only path
unchanged when HYBRID_SEARCH is unset. The rendering layer
(_format_hit) is untouched.

The RERANK_URL hook is also wired (_rerank_pool sends docs to
llama.cpp /v1/rerank, truncated to 2000 chars per the jina-reranker
n_ctx_train=1024 batch-rejection gotcha). Fails open to base order
on any exception.

## Baseline numbers (k=5, pool=50, 35 queries)

  | Retriever  | MRR   | Recall@5 | nDCG@5 |
  |------------|-------|----------|--------|
  | dense      | 0.027 | 0.086    | 0.041  |
  | bm25       | 0.544 | 0.586    | 0.524  |
  | hybrid-rrf | 0.114 | 0.114    | 0.108  |

Headline: BM25 dominates because farmers search for products by
brand name, and brand names are exact-match tokens that lexical
search nails. Dense is poor — semantic embeddings spread across
similar products and don't preferentially weight brand-name tokens.
Textbook RRF hurts when one retriever is much weaker than the
other: dense's irrelevant top-50 pollute the fused pool with
ties at 1/(60+rank). Phase 6 reranker is the planned fix —
the reranker scores each (query, chunk) pair independently
and can recover the right answer regardless of base order.

Per-query report at eval/results/baseline.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 10:19:05 -04:00

5.6 KiB

Eval results — queries.jsonl

  • queries: 35
  • k: 5
  • pool: 50
  • retrievers: dense, bm25, hybrid-rrf

Summary

Retriever MRR Recall@5 nDCG@5 Errors Time (s)
dense 0.027 0.086 0.041 0 5.4
bm25 0.544 0.586 0.524 0 4.7
hybrid-rrf 0.114 0.114 0.108 0 8.4

Per-query — dense

Query Expected Top retrieved MRR Recall
Warrant herbicide rate for soybean bayer/warrant, epa_ppls/524-591 epa_ppls/524-508, epa_ppls/524-521, epa_ppls/42750-176 0.00 0.00
Huskie wheat herbicide tank mix bayer/huskie, bayer/huskie-complete epa_ppls/71368-64, epa_ppls/279-9610, epa_ppls/10182-134 0.00 0.00
Harness 20G granular corn herbicide bayer/harness, epa_ppls/524-487 epa_ppls/352-612, epa_ppls/352-608, epa_ppls/352-817 0.00 0.00
Laudis tembotrione post-emergence corn bayer/laudis, epa_ppls/264-860 bayer/diflexx, epa_ppls/70506-331, epa_ppls/84229-48 0.00 0.00
Roundup Custom glyphosate burndown application rate epa_ppls/524-677, epa_ppls/524-475 epa_ppls/42750-122, epa_ppls/5905-656, epa_ppls/228-666 0.00 0.00
Liberty 280 SL glufosinate ammonium soybean epa_ppls/7969-448 epa_ppls/71368-111, epa_ppls/84229-45, epa_ppls/7969-500 0.00 0.00
Atrazine 4L corn pre-emergence rate per acre epa_ppls/5905-7877 epa_ppls/5905-624, epa_ppls/89167-75, epa_ppls/7969-140 0.00 0.00
Albaugh dicamba DMA salt application restrictions epa_ppls/42750-40 epa_ppls/5905-638, epa_ppls/34704-861, epa_ppls/5905-624 0.20 1.00
Authority 4F sulfentrazone soybean residual epa_ppls/279-3146 epa_ppls/279-9663, epa_ppls/87290-70, epa_ppls/66222-248 0.00 0.00
Prowl 10-G pendimethalin granular pre-plant epa_ppls/241-254 epa_ppls/70506-333, epa_ppls/42750-340, epa_ppls/91234-231 0.00 0.00
Callisto GT mesotrione corn postemergence broadleaf control epa_ppls/100-1470 epa_ppls/100-1131, epa_ppls/89167-51, epa_ppls/100-1349 0.00 0.00
Acuron Flexi corn pre-emergence S-metolachlor epa_ppls/100-1568 epa_ppls/62719-312, epa_ppls/42750-122, epa_ppls/5905-638 0.00 0.00
Sencor 4 flowable metribuzin soybean waterhemp epa_ppls/264-735 epa_ppls/1381-259, epa_ppls/279-9624, epa_ppls/89167-101 0.00 0.00
Broadstrike trifluralin pre-plant incorporated epa_ppls/62719-222 epa_ppls/87290-81, epa_ppls/70506-333, epa_ppls/91234-73 0.00 0.00
Headline azoxystrobin pyraclostrobin wheat foliar fungicide epa_ppls/7969-186 epa_ppls/100-1222, epa_ppls/100-1164, epa_ppls/87290-63 0.00 0.00
Trivapro pydiflumetofen corn fungicide tar spot epa_ppls/100-1613 epa_ppls/66222-250, epa_ppls/264-1209, epa_ppls/62719-346 0.00 0.00
Poncho 600 clothianidin seed treatment corn epa_ppls/7969-458 epa_ppls/7969-459, epa_ppls/7969-458, bayer/poncho-beta 0.50 1.00
Gustafson Lorsban 30 chlorpyrifos granular corn rootworm epa_ppls/264-932 epa_ppls/89167-78, epa_ppls/5481-525, epa_ppls/1381-193 0.00 0.00
RT-3 glyphosate potassium salt herbicide bayer/rt-3 bayer/roundup-powermax-3, epa_ppls/19713-597, epa_ppls/19713-606 0.25 1.00
Roundup PowerMAX 3 glyphosate K-salt rate bayer/roundup-powermax-3, epa_ppls/524-659 epa_ppls/19713-597, epa_ppls/19713-606, epa_ppls/51036-333 0.00 0.00
Nortron SC ethofumesate sugar beet bayer/nortron-sc epa_ppls/71368-25, epa_ppls/42750-122, epa_ppls/524-715 0.00 0.00
DiFlexx Duo tembotrione dicamba corn bayer/diflexx-duo epa_ppls/71368-65, epa_ppls/1812-434, epa_ppls/1381-191 0.00 0.00
Corvus thiencarbazone-methyl isoxaflutole corn pre-emergence bayer/corvus, epa_ppls/264-1066 epa_ppls/42750-122, bayer/scoparia, epa_ppls/70506-331 0.00 0.00
Capreno tembotrione thiencarbazone corn herbicide bayer/capreno, epa_ppls/264-1063 epa_ppls/91234-314, epa_ppls/352-894, epa_ppls/42750-32 0.00 0.00
Tilt propiconazole wheat fungicide rust epa_ppls/100-617 epa_ppls/19713-692, epa_ppls/34704-1113, epa_ppls/228-670 0.00 0.00
what controls horseweed marestail before planting soybean epa_ppls/524-475, epa_ppls/524-677 epa_ppls/524-716, epa_ppls/524-717, epa_ppls/524-722 0.00 0.00
what can I tank mix with 2,4-D for burndown in spring epa_ppls/5905-7877, epa_ppls/228-666 epa_ppls/34704-1158, epa_ppls/264-738, epa_ppls/228-364 0.00 0.00
best fungicide for corn tar spot foliar application epa_ppls/100-1613, epa_ppls/100-1547 epa_ppls/100-1178, epa_ppls/87290-63, epa_ppls/100-1262 0.00 0.00
seed treatment to control wireworm in corn epa_ppls/7969-458, epa_ppls/7969-459 epa_ppls/10182-212, epa_ppls/1381-231, epa_ppls/42750-300 0.00 0.00
pre-emergence residual herbicide for soybean for waterhemp epa_ppls/279-3146, epa_ppls/264-735 epa_ppls/352-675, epa_ppls/279-3564, epa_ppls/279-3589 0.00 0.00
what insecticide for soybean aphid foliar epa_ppls/279-3206, epa_ppls/264-840 epa_ppls/264-1157, epa_ppls/264-1159, epa_ppls/279-9615 0.00 0.00
what is the rainfast interval for glyphosate epa_ppls/524-475, epa_ppls/524-677 epa_ppls/89167-56, epa_ppls/524-523, epa_ppls/524-707 0.00 0.00
wheat fungicide for fusarium head blight epa_ppls/7969-186, epa_ppls/100-1547 bayer/stratego, epa_ppls/7969-246, epa_ppls/66222-250 0.00 0.00
endangered species act precautions for pesticide application epa_ppls/524-475, epa_ppls/524-591 epa_ppls/70506-318, epa_ppls/70506-324, epa_ppls/34704-1044 0.00 0.00
what herbicide do I use for postemergence broadleaf in corn bayer/laudis, bayer/capreno, bayer/diflexx-duo epa_ppls/352-842, epa_ppls/100-1349, epa_ppls/89167-51 0.00 0.00