Files
seed-mcp/README.md
T
claude 9600235466 Add 4 independent seed brands: Latham, Stine, 1st Choice, Burrus (+623 varieties)
Four independent regional brands across IA/IN/IL (variety-identity sources,
each parsed into structured characteristics_groups so ratings embed):

- latham (264: 155 corn + 109 soy) — Latham Hi-Tech Seeds, Alexander IA.
  WordPress REST enum (/wp-json/wp/v2/varieties) + /products/<slug>/ detail
  HTML. Scale 1-9 LOWER=better (reversed, like NK/AgriPro).
- stine (217: 58 corn + 159 soy) — Stine Seed, Adel IA (largest US
  independent). sitemap enum + /{crop}/traits/<slug>/<code>/ detail HTML.
  Corn 1-9 (9=best); soy qualitative.
- first_choice (78: 52 corn + 22 soy + 4 wheat) — 1st Choice Seeds,
  Rushville IN (employee-owned). Per-crop sitemap -> detail HTML. Scale
  0-10 higher=better. ~40 older corn pages thin at source; wheat
  identity-only.
- burrus (64: 38 corn + 26 soy) — Burrus Seed, Arenzville IL. Seedware
  JSON API. Scale 1-10 (10=best). Brands Burrus/Power Plus/DONMARIO.
  robots ai-train=no + named-bot blocks; operator opted in, scraper uses a
  non-blacklisted UA + honors Crawl-delay 10.

All 623 validated through rag.chunk.chunks_from_variety (0 errors; 6
identity-only pages from source gaps). No chunk.py change needed (identity
sources auto-route to chunks_from_variety).

Docs:
- sources.json: 4 entries + Hoegemeyer added to _excluded_sources. The
  Corteva ToU (shared across pioneer.com / hoegemeyer.com / therightseed.com
  / corteva.com + the Vylor spinoff) bans scrapers + competitive use, so the
  whole Corteva family is one excluded ToU domain.
- docs_mcp/lessons.md: rating-scales updated with all 4 directions +
  an explicit cross-vendor warning (Latham 1=best vs Stine/Burrus higher=best
  — never compare raw numbers without _scale_direction).
- README + CLAUDE corpus inventory: now 2,268 variety + 6,787 trial records.

CI rebuilds the index from the committed corpus.
2026-06-04 21:57:30 -04:00

207 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# seed-mcp
MCP server over the public catalogs of major US row-crop seed
vendors — **variety identity** (what each hybrid IS) plus **yield-trial data** (how they actually perform in real cooperator fields). Sibling project to
[`crop-chem-docs`](https://git.jpaul.io/justin/crop-chem-docs)
(pesticide labels), feeding the same Drawbar farm-advisor AI.
**Deployed 2026-05-25** on trashpanda as a sibling sidecar to
`chem-mcp`; the Drawbar advisor calls it via the `seed:` prefix.
## What's in the corpus
**~9,050 indexed records** (one chunk each) across two complementary surfaces:
### Variety identity — 2,268 records
| Source | Count | Vendor | Brand |
|---|---|---|---|
| `bayer_seeds` | 931 | Bayer | DEKALB / Channel (corn) / Asgrow (soy) / WestBred (wheat) / Deltapine |
| `latham` | 264 | Latham Hi-Tech Seeds | Latham (corn / soy) — **independent family brand, Alexander IA** |
| `stine` | 217 | Stine Seed Company | Stine (corn / soy) — **largest US independent, Adel IA** |
| `lg_seeds` | 170 | AgReliant | LG Seeds (corn / soy / sorghum) |
| `golden_harvest` | 139 | Syngenta | Golden Harvest (corn / soy) |
| `nk` | 122 | Syngenta | NK (corn / soy) |
| `proharvest` | 119 | ProHarvest Seeds | ProHarvest / Apex (corn / soy / wheat) — **independent Corn Belt brand** |
| `agrigold` | 111 | AgReliant | AgriGold (corn / soy) |
| `first_choice` | 78 | 1st Choice Seeds | 1st Choice (corn / soy / wheat) — **employee-owned independent, Rushville IN** |
| `burrus` | 64 | Burrus Seed | Burrus / Power Plus / DONMARIO (corn / soy) — **independent family, Arenzville IL** |
| `ebberts_seeds` | 29 | Ebbert's Seeds | Ebbert's (corn / soy / wheat) — independent E. Corn Belt breeder |
| `agripro` | 24 | Syngenta | AgriPro (wheat — HRW / HRS / HWS / SWW) |
### Yield-trial data — 6,787 documents
| Source | Count | Notes |
|---|---|---|
| `gh_plot_reports` | 4,299 | Golden Harvest plot reports 2024+2025. **Cross-vendor head-to-head** — DEKALB / NK / GH / Pioneer / Channel all appear in the same trial rankings. |
| `lg_plot_reports` | 1,307 | LG Seeds (AgReliant) cross-vendor plots, top-5 per site, 2024+2025. |
| `agrigold_plot_reports` | 1,006 | AgriGold (AgReliant) cross-vendor plots, full ranking + rich plot management, 2024+2025. |
| `proharvest_plots` | 161 | ProHarvest Seeds per-cooperator harvest reports (corn / soy, 2024+2025). Many are **cross-vendor** (ProHarvest / Apex vs Pioneer / DEKALB / Becks / Channel / Wyffels). Structured rank/yield/%H2O/test-wt where the PDF fits the template; off-template third-party reports kept verbatim. |
| `agripro_trials` | 14 | Regional wheat trial PDF summaries (PNW, Western Plains, Northern Plains, etc.) |
### Not in the corpus (documented in `docs_mcp/lessons.md`)
- **Pioneer / Corteva (all brands)** — ToS bans automation. This now covers the whole Corteva family — Pioneer, Brevant, **Hoegemeyer** (the consolidation brand absorbing Seed Consultants / Dairyland / Nu-Tech / Terral), and the upcoming Vylor spinoff — all share the same corteva.com ToU. Curated fallback lesson points the farmer at a local dealer; legitimate Corteva-data paths are an official license (openinnovation@corteva.com) or university-extension trial data.
- **NK yield-results** — fiddly ASMX/SOAP endpoint, needs a dedicated reverse-engineer session.
- **Bayer per-variety trial data** — not publicly indexed (DEKALB / Asgrow trial data flows through Channel reps). Partially covered by the GH plot reports' cross-vendor results.
## MCP tools (6)
| Tool | Purpose |
|---|---|
| `search_docs` | Variety IDENTITY — what a hybrid IS (disease ratings, traits, maturity). Hybrid dense+BM25 + cross-encoder rerank + variety-code prefilter. |
| `search_trials` | Variety PERFORMANCE — head-to-head yield trial results. Filterable by crop, state, year, product. |
| `get_page` | Full canonical record for one variety + structured ratings header sourced from the sidecar JSON. |
| `lookup_variety` | Raw sidecar JSON for one variety — **fact-check tool**; call before quoting any specific rating value. |
| `list_versions` | Discover facets (sources, vendors, brands, crops) currently indexed. |
| `crop_seed_api_lessons` | Curated knowledge: Pioneer fallback policy, scale-direction differences across vendors, trait glossary, SCN race coverage notes. |
`search_docs` defaults to `data_type="variety"`; `search_trials` uses `data_type="trial"` — single Chroma collection, metadata-filtered.
## Retrieval — eval-validated
From `eval/results/baseline.md` (21 golden queries, k=5):
| Retriever | Pass | Recall | P@1 | MRR | Avg ms |
|---|---|---|---|---|---|
| **hybrid+rerank** | **21/21** | **100%** | **90%** | **0.905** | 2064 |
| bm25 | 20/21 | 95% | 81% | 0.833 | 5 |
| hybrid (no rerank) | 15/21 | 71% | 62% | 0.619 | 73 |
| dense | 14/21 | 67% | 38% | 0.440 | 79 |
**Deploy config**: `HYBRID_SEARCH=true` + `RERANK_URL=http://llama-rerank:8080`.
Some surprises worth knowing:
1. **Dense embedding alone is the weakest config**. Variety codes (DKC62-08RIB), gene names (Rps3a), and trait codes (XF) have no semantic neighbors — nomic-embed-text returns noise on them.
2. **Hybrid alone is WORSE than BM25 alone.** RRF dilutes BM25's strong ranking with dense noise. Don't ship without rerank.
3. **BM25-alone (95% recall, 5 ms) is an excellent fallback** when the rerank sidecar is unavailable. The variety-code prefilter in `search_docs` does heavy lifting.
4. **Anti-hallucination queries pass on every retriever** — Pioneer fallback + not-in-corpus product checks hold across all configs.
## Quick start
```bash
git clone https://git.jpaul.io/justin/seed-mcp.git
cd seed-mcp
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Sample-scrape just to verify wiring:
python -m scrape.runner --source bayer_seeds --limit 3
# Full refresh (all 6 sources; expect ~25 min for gh_plot_reports
# with 4 concurrent workers):
python -m scrape.runner --all --force
# Rebuild Chroma + BM25 from the corpus:
OLLAMA_URL=http://192.168.0.125:11434 PRODUCT_NAME=crop_seed \
python -m rag.index --rebuild
# Run the eval harness:
RERANK_URL=http://localhost:18080 python -m eval.run_eval \
--queries eval/queries.jsonl --k 5 \
--output eval/results/baseline.md
# Local MCP server (stdio for Claude Desktop dev):
PRODUCT_NAME=crop_seed python -m docs_mcp.server --transport stdio
# Local HTTP server (matches production transport):
PRODUCT_NAME=crop_seed python -m docs_mcp.server \
--transport streamable-http --port 8000
```
## Repo layout
```
.
├── CLAUDE.md # Canonical agent guide. Read first.
├── PLAN.md # Template's 13-phase build guide.
├── README.md
├── requirements.txt
├── Dockerfile
├── sources.json # Source catalog (one entry per scraper)
├── deploy/docker-compose.yml # Drop-in compose snippet for Drawbar
├── .gitea/workflows/
│ ├── refresh.yml # Monthly cron: scrape + index + image push
│ └── image-only.yml # On-demand code-only ship cycle
├── scrape/
│ ├── runner.py # `python -m scrape.runner --source <id>`
│ ├── changelog.py # Reused from template
│ └── sources/
│ ├── bayer_seeds.py # ~475 varieties across 3 brands
│ ├── golden_harvest.py # ~139 varieties (post-discontinued filter)
│ ├── nk.py # 122 varieties (corn + soy)
│ ├── agripro.py # 24 wheat varieties
│ ├── gh_plot_reports.py # 4,299 cross-vendor yield trials
│ ├── agripro_trials.py # 14 regional trial PDFs
│ └── becks_pfr.py # stub — Sanity GROQ research corpus
├── rag/
│ ├── embeddings.py # nomic-embed-text via Ollama
│ ├── chunk.py # one-chunk-per-variety + trial chunker
│ ├── index.py # Chroma + BM25 builder
│ └── bm25.py # FTS5 lexical index w/ seed-domain facets
├── docs_mcp/
│ ├── server.py # FastMCP — 6 tools, hybrid+rerank
│ ├── lessons.md # Curated knowledge layer (Pioneer fallback)
│ └── usage.py # TimedCall + JSONL telemetry
├── eval/
│ ├── queries.jsonl # 21 golden queries
│ ├── retrievers.py # dense / bm25 / hybrid / hybrid+rerank
│ ├── run_eval.py # MRR / Recall@k / Precision@1
│ └── results/baseline.md # Current deploy-config eval numbers
└── corpus/ # Committed scrape output (CI-refreshed)
├── bayer_seeds/
├── golden_harvest/
├── nk/
├── agripro/
├── gh_plot_reports/
└── agripro_trials/
```
## Infrastructure
- **Registry**: pushes to `192.168.0.2:1234` (LAN, no CF body cap); deploys pull `git.jpaul.io/justin/seed-mcp:latest` (public, CF tunnel). Also tagged `:<sha12>` for rollback pinning and `:corpus-YYYY.MM.DD` for snapshot pinning.
- **Embedder pool (CI)**: 3 GPU-pinned Ollama endpoints, weighted toward `.0.125` (RTX 40-series, 242 embeds/sec):
- `.0.125:11434` ×4 (4090)
- `.0.2:11436` ×2 (GPU-pinned)
- `.0.2:11435` ×1 (GPU-pinned)
- Do NOT use `.0.2:11434` (not GPU-pinned) or `localhost:11434` (works in dev, breaks in CI — runner container has no Ollama on its loopback).
- **Reranker**: shared `llama-rerank` sidecar on trashpanda's Tesla P4 (jina-reranker-v2-base via llama.cpp). One container serves both seed-mcp and crop-chem-docs. **Must be on `drawbar-backend_default` Docker network** — see `deploy/docker-compose.yml` for the network-attach gotcha that caused silent rerank degradation on chem-mcp prior to 2026-05-25.
- **PRODUCT_NAME**: `crop_seed` — used in the Chroma collection name (`crop_seed_docs`), the BM25 db filename (`bm25/crop_seed_docs.db`), and the `crop_seed_api_lessons` tool name. Not `seed_mcp` — that would conflict with the container/service name.
## Deploy mechanics
Watchtower handles auto-deploy. Every push to `seed-mcp/main` that touches `docs_mcp/`, `rag/`, `scrape/`, `requirements.txt`, `Dockerfile`, or `sources.json` triggers `image-only.yml`:
1. Checks out main with full corpus
2. Rebuilds Chroma + BM25 (~3 min on the GPU pool)
3. `docker build` + push three tags to the LAN registry
4. Links the package to the repo via Gitea API
5. Watchtower on trashpanda polls `:latest` every 5 min → pulls + recreates `drawbar-backend-seed-mcp-1`
Corpus refresh runs monthly via `refresh.yml` (1st of each month, 06:00 UTC) — re-scrapes all GREEN sources, commits any corpus diff, rebuilds indexes, ships a new image with `:corpus-YYYY.MM.DD` tagged.
See `CLAUDE.md` for canonical sidecar schemas, the reversed disease-scale gotcha (NK + AgriPro publish 1=best, vs Bayer/GH 9=best), and the scraper conventions.
## Status
| Phase | Status |
|---|---|
| 0 — scaffold | ✅ |
| 1 — scrapers (bayer_seeds / golden_harvest / nk / agripro / gh_plot_reports / agripro_trials) | ✅ |
| 2 — chunk + index | ✅ |
| 3 — MCP tools (6) | ✅ |
| 4-5 — Dockerfile + Gitea CI | ✅ |
| 6 — reranker integration | ✅ (eval-validated; deploy uses hybrid+rerank) |
| 7 — eval harness | ✅ (21 golden queries, baseline committed) |
| 8 — hybrid search | ✅ (default ON) |
| 11 — `crop_seed_api_lessons` curated layer | ✅ (Pioneer fallback + 7 other lessons) |
| 13 — weekly_digest | not planned for seed-mcp |
Remaining work (deferred, not blocking):
- `becks_pfr` scraper (2,089 research docs via public Sanity GROQ)
- 2023 GH plot reports backfill (~3,619 more docs)
- NK yield-results endpoint reverse-engineer
- Channel Seed brand (~320 more Bayer varieties — separate brand under the same sitemap)