Files
seed-mcp/README.md
T
justin d65c7d0d67 README: reflect deployed state — 5,073 chunks, eval numbers, 6 tools
The scaffold-era README was out of sync with the shipped product:
- Vendor counts stale (recon estimates, not actual deployed counts)
- Trial data sources (gh_plot_reports + agripro_trials) entirely
  unmentioned
- Tool list listed `corpus_status` (doesn't exist) and missed both
  `lookup_variety` and `search_trials`
- Build-phase table showed everything as "pending" / "next" but
  Phases 1-8 + 11 all shipped

Rewrite to reflect the deployed state:
- Corpus inventory: 760 variety records + 4,313 trial documents =
  5,073 chunks across 6 sources
- All 6 MCP tools documented with their purpose
- Eval baseline table (hybrid+rerank wins 100%, P@1 90%, MRR 0.905)
  with the surprising findings (dense alone is noise; hybrid w/o
  rerank is WORSE than BM25 alone)
- Deploy mechanics: Watchtower chain, 4-GPU embedder pool, shared
  llama-rerank sidecar with the network-attach gotcha
- Status table:  on the phases that shipped, deferred work list
  (becks_pfr, 2023 plot backfill, NK trials, Channel Seed brand)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 17:50:05 -04:00

196 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# seed-mcp
MCP server over the public catalogs of major US row-crop seed
vendors — **variety identity** (what each hybrid IS) plus **yield-trial data** (how they actually perform in real cooperator fields). Sibling project to
[`crop-chem-docs`](https://git.jpaul.io/justin/crop-chem-docs)
(pesticide labels), feeding the same Drawbar farm-advisor AI.
**Deployed 2026-05-25** on trashpanda as a sibling sidecar to
`chem-mcp`; the Drawbar advisor calls it via the `seed:` prefix.
## What's in the corpus
**5,073 indexed chunks** across two complementary surfaces:
### Variety identity — 760 records
| Source | Count | Vendor | Brand |
|---|---|---|---|
| `bayer_seeds` | 475 | Bayer | DEKALB (corn) / Asgrow (soy) / WestBred (wheat) |
| `golden_harvest` | 139 | Syngenta | Golden Harvest (corn / soy) |
| `nk` | 122 | Syngenta | NK (corn / soy) |
| `agripro` | 24 | Syngenta | AgriPro (wheat — HRW / HRS / HWS / SWW) |
### Yield-trial data — 4,313 documents
| Source | Count | Notes |
|---|---|---|
| `gh_plot_reports` | 4,299 | Golden Harvest plot reports 2024+2025. **Cross-vendor head-to-head** — DEKALB / NK / GH / Pioneer / Channel all appear in the same trial rankings. The closest thing to independent comparison data the corpus has. |
| `agripro_trials` | 14 | Regional wheat trial PDF summaries (PNW, Western Plains, Northern Plains, etc.) |
### Not in the corpus (documented in `docs_mcp/lessons.md`)
- **Pioneer / Corteva** — ToS bans automation. Curated fallback lesson points the farmer at pioneer.com / a local dealer.
- **NK yield-results** — fiddly ASMX/SOAP endpoint, needs a dedicated reverse-engineer session.
- **Bayer per-variety trial data** — not publicly indexed (DEKALB / Asgrow trial data flows through Channel reps). Partially covered by the GH plot reports' cross-vendor results.
## MCP tools (6)
| Tool | Purpose |
|---|---|
| `search_docs` | Variety IDENTITY — what a hybrid IS (disease ratings, traits, maturity). Hybrid dense+BM25 + cross-encoder rerank + variety-code prefilter. |
| `search_trials` | Variety PERFORMANCE — head-to-head yield trial results. Filterable by crop, state, year, product. |
| `get_page` | Full canonical record for one variety + structured ratings header sourced from the sidecar JSON. |
| `lookup_variety` | Raw sidecar JSON for one variety — **fact-check tool**; call before quoting any specific rating value. |
| `list_versions` | Discover facets (sources, vendors, brands, crops) currently indexed. |
| `crop_seed_api_lessons` | Curated knowledge: Pioneer fallback policy, scale-direction differences across vendors, trait glossary, SCN race coverage notes. |
`search_docs` defaults to `data_type="variety"`; `search_trials` uses `data_type="trial"` — single Chroma collection, metadata-filtered.
## Retrieval — eval-validated
From `eval/results/baseline.md` (21 golden queries, k=5):
| Retriever | Pass | Recall | P@1 | MRR | Avg ms |
|---|---|---|---|---|---|
| **hybrid+rerank** | **21/21** | **100%** | **90%** | **0.905** | 2064 |
| bm25 | 20/21 | 95% | 81% | 0.833 | 5 |
| hybrid (no rerank) | 15/21 | 71% | 62% | 0.619 | 73 |
| dense | 14/21 | 67% | 38% | 0.440 | 79 |
**Deploy config**: `HYBRID_SEARCH=true` + `RERANK_URL=http://llama-rerank:8080`.
Some surprises worth knowing:
1. **Dense embedding alone is the weakest config**. Variety codes (DKC62-08RIB), gene names (Rps3a), and trait codes (XF) have no semantic neighbors — nomic-embed-text returns noise on them.
2. **Hybrid alone is WORSE than BM25 alone.** RRF dilutes BM25's strong ranking with dense noise. Don't ship without rerank.
3. **BM25-alone (95% recall, 5 ms) is an excellent fallback** when the rerank sidecar is unavailable. The variety-code prefilter in `search_docs` does heavy lifting.
4. **Anti-hallucination queries pass on every retriever** — Pioneer fallback + not-in-corpus product checks hold across all configs.
## Quick start
```bash
git clone https://git.jpaul.io/justin/seed-mcp.git
cd seed-mcp
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Sample-scrape just to verify wiring:
python -m scrape.runner --source bayer_seeds --limit 3
# Full refresh (all 6 sources; expect ~25 min for gh_plot_reports
# with 4 concurrent workers):
python -m scrape.runner --all --force
# Rebuild Chroma + BM25 from the corpus:
OLLAMA_URL=http://192.168.0.125:11434 PRODUCT_NAME=crop_seed \
python -m rag.index --rebuild
# Run the eval harness:
RERANK_URL=http://localhost:18080 python -m eval.run_eval \
--queries eval/queries.jsonl --k 5 \
--output eval/results/baseline.md
# Local MCP server (stdio for Claude Desktop dev):
PRODUCT_NAME=crop_seed python -m docs_mcp.server --transport stdio
# Local HTTP server (matches production transport):
PRODUCT_NAME=crop_seed python -m docs_mcp.server \
--transport streamable-http --port 8000
```
## Repo layout
```
.
├── CLAUDE.md # Canonical agent guide. Read first.
├── PLAN.md # Template's 13-phase build guide.
├── README.md
├── requirements.txt
├── Dockerfile
├── sources.json # Source catalog (one entry per scraper)
├── deploy/docker-compose.yml # Drop-in compose snippet for Drawbar
├── .gitea/workflows/
│ ├── refresh.yml # Monthly cron: scrape + index + image push
│ └── image-only.yml # On-demand code-only ship cycle
├── scrape/
│ ├── runner.py # `python -m scrape.runner --source <id>`
│ ├── changelog.py # Reused from template
│ └── sources/
│ ├── bayer_seeds.py # ~475 varieties across 3 brands
│ ├── golden_harvest.py # ~139 varieties (post-discontinued filter)
│ ├── nk.py # 122 varieties (corn + soy)
│ ├── agripro.py # 24 wheat varieties
│ ├── gh_plot_reports.py # 4,299 cross-vendor yield trials
│ ├── agripro_trials.py # 14 regional trial PDFs
│ └── becks_pfr.py # stub — Sanity GROQ research corpus
├── rag/
│ ├── embeddings.py # nomic-embed-text via Ollama
│ ├── chunk.py # one-chunk-per-variety + trial chunker
│ ├── index.py # Chroma + BM25 builder
│ └── bm25.py # FTS5 lexical index w/ seed-domain facets
├── docs_mcp/
│ ├── server.py # FastMCP — 6 tools, hybrid+rerank
│ ├── lessons.md # Curated knowledge layer (Pioneer fallback)
│ └── usage.py # TimedCall + JSONL telemetry
├── eval/
│ ├── queries.jsonl # 21 golden queries
│ ├── retrievers.py # dense / bm25 / hybrid / hybrid+rerank
│ ├── run_eval.py # MRR / Recall@k / Precision@1
│ └── results/baseline.md # Current deploy-config eval numbers
└── corpus/ # Committed scrape output (CI-refreshed)
├── bayer_seeds/
├── golden_harvest/
├── nk/
├── agripro/
├── gh_plot_reports/
└── agripro_trials/
```
## Infrastructure
- **Registry**: pushes to `192.168.0.2:1234` (LAN, no CF body cap); deploys pull `git.jpaul.io/justin/seed-mcp:latest` (public, CF tunnel). Also tagged `:<sha12>` for rollback pinning and `:corpus-YYYY.MM.DD` for snapshot pinning.
- **Embedder pool (CI)**: 3 GPU-pinned Ollama endpoints, weighted toward `.0.125` (RTX 40-series, 242 embeds/sec):
- `.0.125:11434` ×4 (4090)
- `.0.2:11436` ×2 (GPU-pinned)
- `.0.2:11435` ×1 (GPU-pinned)
- Do NOT use `.0.2:11434` (not GPU-pinned) or `localhost:11434` (works in dev, breaks in CI — runner container has no Ollama on its loopback).
- **Reranker**: shared `llama-rerank` sidecar on trashpanda's Tesla P4 (jina-reranker-v2-base via llama.cpp). One container serves both seed-mcp and crop-chem-docs. **Must be on `drawbar-backend_default` Docker network** — see `deploy/docker-compose.yml` for the network-attach gotcha that caused silent rerank degradation on chem-mcp prior to 2026-05-25.
- **PRODUCT_NAME**: `crop_seed` — used in the Chroma collection name (`crop_seed_docs`), the BM25 db filename (`bm25/crop_seed_docs.db`), and the `crop_seed_api_lessons` tool name. Not `seed_mcp` — that would conflict with the container/service name.
## Deploy mechanics
Watchtower handles auto-deploy. Every push to `seed-mcp/main` that touches `docs_mcp/`, `rag/`, `scrape/`, `requirements.txt`, `Dockerfile`, or `sources.json` triggers `image-only.yml`:
1. Checks out main with full corpus
2. Rebuilds Chroma + BM25 (~3 min on the GPU pool)
3. `docker build` + push three tags to the LAN registry
4. Links the package to the repo via Gitea API
5. Watchtower on trashpanda polls `:latest` every 5 min → pulls + recreates `drawbar-backend-seed-mcp-1`
Corpus refresh runs monthly via `refresh.yml` (1st of each month, 06:00 UTC) — re-scrapes all GREEN sources, commits any corpus diff, rebuilds indexes, ships a new image with `:corpus-YYYY.MM.DD` tagged.
See `CLAUDE.md` for canonical sidecar schemas, the reversed disease-scale gotcha (NK + AgriPro publish 1=best, vs Bayer/GH 9=best), and the scraper conventions.
## Status
| Phase | Status |
|---|---|
| 0 — scaffold | ✅ |
| 1 — scrapers (bayer_seeds / golden_harvest / nk / agripro / gh_plot_reports / agripro_trials) | ✅ |
| 2 — chunk + index | ✅ |
| 3 — MCP tools (6) | ✅ |
| 4-5 — Dockerfile + Gitea CI | ✅ |
| 6 — reranker integration | ✅ (eval-validated; deploy uses hybrid+rerank) |
| 7 — eval harness | ✅ (21 golden queries, baseline committed) |
| 8 — hybrid search | ✅ (default ON) |
| 11 — `crop_seed_api_lessons` curated layer | ✅ (Pioneer fallback + 7 other lessons) |
| 13 — weekly_digest | not planned for seed-mcp |
Remaining work (deferred, not blocking):
- `becks_pfr` scraper (2,089 research docs via public Sanity GROQ)
- 2023 GH plot reports backfill (~3,619 more docs)
- NK yield-results endpoint reverse-engineer
- Channel Seed brand (~320 more Bayer varieties — separate brand under the same sitemap)