Files
claude a54fac240f
Image rebuild (skip scrape) / build (push) Successful in 5m54s
Add university-extension trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 cross-vendor trial docs) (#19)
Co-authored-by: claude <claude@jpaul.io>
Co-committed-by: claude <claude@jpaul.io>
2026-06-10 08:36:19 -04:00

213 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# seed-mcp
MCP server over the public catalogs of major US row-crop seed
vendors — **variety identity** (what each hybrid IS) plus **yield-trial data** (how they actually perform in real cooperator fields). Sibling project to
[`crop-chem-docs`](https://git.jpaul.io/justin/crop-chem-docs)
(pesticide labels), feeding the same Drawbar farm-advisor AI.
**Deployed 2026-05-25** on trashpanda as a sibling sidecar to
`chem-mcp`; the Drawbar advisor calls it via the `seed:` prefix.
## What's in the corpus
**~9,300 indexed records** (one chunk each) across two complementary surfaces:
### Variety identity — 2,398 records
| Source | Count | Vendor | Brand |
|---|---|---|---|
| `bayer_seeds` | 931 | Bayer | DEKALB / Channel (corn) / Asgrow (soy) / WestBred (wheat) / Deltapine |
| `latham` | 264 | Latham Hi-Tech Seeds | Latham (corn / soy) — **independent family brand, Alexander IA** |
| `stine` | 217 | Stine Seed Company | Stine (corn / soy) — **largest US independent, Adel IA** |
| `lg_seeds` | 170 | AgReliant | LG Seeds (corn / soy / sorghum) |
| `golden_harvest` | 139 | Syngenta | Golden Harvest (corn / soy) |
| `robseeco` | 130 | RobSeeCo | Rob-See-Co / Innotech (corn / soy) — **independent, Elkhorn NE; from the seed-guide PDF** |
| `nk` | 122 | Syngenta | NK (corn / soy) |
| `proharvest` | 119 | ProHarvest Seeds | ProHarvest / Apex (corn / soy / wheat) — **independent Corn Belt brand** |
| `agrigold` | 111 | AgReliant | AgriGold (corn / soy) |
| `first_choice` | 78 | 1st Choice Seeds | 1st Choice (corn / soy / wheat) — **employee-owned independent, Rushville IN** |
| `burrus` | 64 | Burrus Seed | Burrus / Power Plus / DONMARIO (corn / soy) — **independent family, Arenzville IL** |
| `ebberts_seeds` | 29 | Ebbert's Seeds | Ebbert's (corn / soy / wheat) — independent E. Corn Belt breeder |
| `agripro` | 24 | Syngenta | AgriPro (wheat — HRW / HRS / HWS / SWW) |
### Yield-trial data — 6,910 documents
| Source | Count | Notes |
|---|---|---|
| `gh_plot_reports` | 4,299 | Golden Harvest plot reports 2024+2025. **Cross-vendor head-to-head** — DEKALB / NK / GH / Pioneer / Channel all appear in the same trial rankings. |
| `lg_plot_reports` | 1,307 | LG Seeds (AgReliant) cross-vendor plots, top-5 per site, 2024+2025. |
| `agrigold_plot_reports` | 1,006 | AgriGold (AgReliant) cross-vendor plots, full ranking + rich plot management, 2024+2025. |
| `proharvest_plots` | 161 | ProHarvest Seeds per-cooperator harvest reports (corn / soy, 2024+2025). Many are **cross-vendor** (ProHarvest / Apex vs Pioneer / DEKALB / Becks / Channel / Wyffels). Structured rank/yield/%H2O/test-wt where the PDF fits the template; off-template third-party reports kept verbatim. |
| `ohio_ocpt_trials` | 69 | **University-extension** trial (OSU/CFAES) — corn + soy per-site, 2024+2025. Independent third-party; ranks CHANNEL / DEKALB / NK / Golden Harvest / LG / AgriGold / Beck's etc. side-by-side. |
| `illinois_vt_trials` | 30 | **University-extension** trial (U of Illinois VT) — corn + soy + **wheat**, 2024+2025. Pioneer / NK + many regionals; rich per-site agronomic metadata. |
| `iowa_icpt_trials` | 24 | **University-extension** trial (Iowa State / ICPT) — corn + soy by district, 2024+2025. Pioneer / DEKALB / Asgrow / NK / Golden Harvest. |
| `agripro_trials` | 14 | Regional wheat trial PDF summaries (PNW, Western Plains, Northern Plains, etc.) |
> The three `*_trials` university sources are **independent third-party** performance data — land-grant programs that test every entered brand (incl. majors we can't catalog directly, like **Pioneer / DEKALB / Brevant**) side-by-side with replication + LSD stats. The publisher is the university; the seed brands live in each row's `brand`.
### Not in the corpus (documented in `docs_mcp/lessons.md`)
- **Pioneer / Corteva (all brands)** — ToS bans automation. This now covers the whole Corteva family — Pioneer, Brevant, **Hoegemeyer** (the consolidation brand absorbing Seed Consultants / Dairyland / Nu-Tech / Terral), and the upcoming Vylor spinoff — all share the same corteva.com ToU. Curated fallback lesson points the farmer at a local dealer; legitimate Corteva-data paths are an official license (openinnovation@corteva.com) or university-extension trial data.
- **NK yield-results** — fiddly ASMX/SOAP endpoint, needs a dedicated reverse-engineer session.
- **Bayer per-variety trial data** — not publicly indexed (DEKALB / Asgrow trial data flows through Channel reps). Partially covered by the GH plot reports' cross-vendor results.
## MCP tools (6)
| Tool | Purpose |
|---|---|
| `search_docs` | Variety IDENTITY — what a hybrid IS (disease ratings, traits, maturity). Hybrid dense+BM25 + cross-encoder rerank + variety-code prefilter. |
| `search_trials` | Variety PERFORMANCE — head-to-head yield trial results. Filterable by crop, state, year, product. |
| `get_page` | Full canonical record for one variety + structured ratings header sourced from the sidecar JSON. |
| `lookup_variety` | Raw sidecar JSON for one variety — **fact-check tool**; call before quoting any specific rating value. |
| `list_versions` | Discover facets (sources, vendors, brands, crops) currently indexed. |
| `crop_seed_api_lessons` | Curated knowledge: Pioneer fallback policy, scale-direction differences across vendors, trait glossary, SCN race coverage notes. |
`search_docs` defaults to `data_type="variety"`; `search_trials` uses `data_type="trial"` — single Chroma collection, metadata-filtered.
## Retrieval — eval-validated
From `eval/results/baseline.md` (21 golden queries, k=5):
| Retriever | Pass | Recall | P@1 | MRR | Avg ms |
|---|---|---|---|---|---|
| **hybrid+rerank** | **21/21** | **100%** | **90%** | **0.905** | 2064 |
| bm25 | 20/21 | 95% | 81% | 0.833 | 5 |
| hybrid (no rerank) | 15/21 | 71% | 62% | 0.619 | 73 |
| dense | 14/21 | 67% | 38% | 0.440 | 79 |
**Deploy config**: `HYBRID_SEARCH=true` + `RERANK_URL=http://llama-rerank:8080`.
Some surprises worth knowing:
1. **Dense embedding alone is the weakest config**. Variety codes (DKC62-08RIB), gene names (Rps3a), and trait codes (XF) have no semantic neighbors — nomic-embed-text returns noise on them.
2. **Hybrid alone is WORSE than BM25 alone.** RRF dilutes BM25's strong ranking with dense noise. Don't ship without rerank.
3. **BM25-alone (95% recall, 5 ms) is an excellent fallback** when the rerank sidecar is unavailable. The variety-code prefilter in `search_docs` does heavy lifting.
4. **Anti-hallucination queries pass on every retriever** — Pioneer fallback + not-in-corpus product checks hold across all configs.
## Quick start
```bash
git clone https://git.jpaul.io/justin/seed-mcp.git
cd seed-mcp
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Sample-scrape just to verify wiring:
python -m scrape.runner --source bayer_seeds --limit 3
# Full refresh (all 6 sources; expect ~25 min for gh_plot_reports
# with 4 concurrent workers):
python -m scrape.runner --all --force
# Rebuild Chroma + BM25 from the corpus:
OLLAMA_URL=http://192.168.0.125:11434 PRODUCT_NAME=crop_seed \
python -m rag.index --rebuild
# Run the eval harness:
RERANK_URL=http://localhost:18080 python -m eval.run_eval \
--queries eval/queries.jsonl --k 5 \
--output eval/results/baseline.md
# Local MCP server (stdio for Claude Desktop dev):
PRODUCT_NAME=crop_seed python -m docs_mcp.server --transport stdio
# Local HTTP server (matches production transport):
PRODUCT_NAME=crop_seed python -m docs_mcp.server \
--transport streamable-http --port 8000
```
## Repo layout
```
.
├── CLAUDE.md # Canonical agent guide. Read first.
├── PLAN.md # Template's 13-phase build guide.
├── README.md
├── requirements.txt
├── Dockerfile
├── sources.json # Source catalog (one entry per scraper)
├── deploy/docker-compose.yml # Drop-in compose snippet for Drawbar
├── .gitea/workflows/
│ ├── refresh.yml # Monthly cron: scrape + index + image push
│ └── image-only.yml # On-demand code-only ship cycle
├── scrape/
│ ├── runner.py # `python -m scrape.runner --source <id>`
│ ├── changelog.py # Reused from template
│ └── sources/
│ ├── bayer_seeds.py # ~475 varieties across 3 brands
│ ├── golden_harvest.py # ~139 varieties (post-discontinued filter)
│ ├── nk.py # 122 varieties (corn + soy)
│ ├── agripro.py # 24 wheat varieties
│ ├── gh_plot_reports.py # 4,299 cross-vendor yield trials
│ ├── agripro_trials.py # 14 regional trial PDFs
│ └── becks_pfr.py # stub — Sanity GROQ research corpus
├── rag/
│ ├── embeddings.py # nomic-embed-text via Ollama
│ ├── chunk.py # one-chunk-per-variety + trial chunker
│ ├── index.py # Chroma + BM25 builder
│ └── bm25.py # FTS5 lexical index w/ seed-domain facets
├── docs_mcp/
│ ├── server.py # FastMCP — 6 tools, hybrid+rerank
│ ├── lessons.md # Curated knowledge layer (Pioneer fallback)
│ └── usage.py # TimedCall + JSONL telemetry
├── eval/
│ ├── queries.jsonl # 21 golden queries
│ ├── retrievers.py # dense / bm25 / hybrid / hybrid+rerank
│ ├── run_eval.py # MRR / Recall@k / Precision@1
│ └── results/baseline.md # Current deploy-config eval numbers
└── corpus/ # Committed scrape output (CI-refreshed)
├── bayer_seeds/
├── golden_harvest/
├── nk/
├── agripro/
├── gh_plot_reports/
└── agripro_trials/
```
## Infrastructure
- **Registry**: pushes to `192.168.0.2:1234` (LAN, no CF body cap); deploys pull `git.jpaul.io/justin/seed-mcp:latest` (public, CF tunnel). Also tagged `:<sha12>` for rollback pinning and `:corpus-YYYY.MM.DD` for snapshot pinning.
- **Embedder pool (CI)**: 3 GPU-pinned Ollama endpoints, weighted toward `.0.125` (RTX 40-series, 242 embeds/sec):
- `.0.125:11434` ×4 (4090)
- `.0.2:11436` ×2 (GPU-pinned)
- `.0.2:11435` ×1 (GPU-pinned)
- Do NOT use `.0.2:11434` (not GPU-pinned) or `localhost:11434` (works in dev, breaks in CI — runner container has no Ollama on its loopback).
- **Reranker**: shared `llama-rerank` sidecar on trashpanda's Tesla P4 (jina-reranker-v2-base via llama.cpp). One container serves both seed-mcp and crop-chem-docs. **Must be on `drawbar-backend_default` Docker network** — see `deploy/docker-compose.yml` for the network-attach gotcha that caused silent rerank degradation on chem-mcp prior to 2026-05-25.
- **PRODUCT_NAME**: `crop_seed` — used in the Chroma collection name (`crop_seed_docs`), the BM25 db filename (`bm25/crop_seed_docs.db`), and the `crop_seed_api_lessons` tool name. Not `seed_mcp` — that would conflict with the container/service name.
## Deploy mechanics
Watchtower handles auto-deploy. Every push to `seed-mcp/main` that touches `docs_mcp/`, `rag/`, `scrape/`, `requirements.txt`, `Dockerfile`, or `sources.json` triggers `image-only.yml`:
1. Checks out main with full corpus
2. Rebuilds Chroma + BM25 (~3 min on the GPU pool)
3. `docker build` + push three tags to the LAN registry
4. Links the package to the repo via Gitea API
5. Watchtower on trashpanda polls `:latest` every 5 min → pulls + recreates `drawbar-backend-seed-mcp-1`
Corpus refresh runs monthly via `refresh.yml` (1st of each month, 06:00 UTC) — re-scrapes all GREEN sources, commits any corpus diff, rebuilds indexes, ships a new image with `:corpus-YYYY.MM.DD` tagged.
See `CLAUDE.md` for canonical sidecar schemas, the reversed disease-scale gotcha (NK + AgriPro publish 1=best, vs Bayer/GH 9=best), and the scraper conventions.
## Status
| Phase | Status |
|---|---|
| 0 — scaffold | ✅ |
| 1 — scrapers (bayer_seeds / golden_harvest / nk / agripro / gh_plot_reports / agripro_trials) | ✅ |
| 2 — chunk + index | ✅ |
| 3 — MCP tools (6) | ✅ |
| 4-5 — Dockerfile + Gitea CI | ✅ |
| 6 — reranker integration | ✅ (eval-validated; deploy uses hybrid+rerank) |
| 7 — eval harness | ✅ (21 golden queries, baseline committed) |
| 8 — hybrid search | ✅ (default ON) |
| 11 — `crop_seed_api_lessons` curated layer | ✅ (Pioneer fallback + 7 other lessons) |
| 13 — weekly_digest | not planned for seed-mcp |
Remaining work (deferred, not blocking):
- `becks_pfr` scraper (2,089 research docs via public Sanity GROQ)
- 2023 GH plot reports backfill (~3,619 more docs)
- NK yield-results endpoint reverse-engineer
- Channel Seed brand (~320 more Bayer varieties — separate brand under the same sitemap)