justin 30b182e28a Three new brand scrapers: LG Seeds + AgriGold + Ebbert's Seeds (+310 varieties)
User flagged LG, AgriGold, and Ebbert's (local Ohio breeder) are
all active in farmer territory. Built three scrapers — corpus now
covers 5,839 chunks across 11 brands.

Net new varieties: 310
  lg_seeds        170 — corn 78 + soy 63 + alfalfa 16 + sorghum 13
                  → adds FIRST alfalfa coverage (FD 3-5 range)
  agrigold        111 — corn 60 + soy 51
  ebberts_seeds    29 — corn 17 + soy 12 (regional OH/IN breeder)

scrape/sources/lg_seeds.py — embedded-JSON pattern (cleanest):
- /products/<crop> pages have a `var products = [...]` blob with the
  variety summary (Variety, Maturity, Traits[], Bullets[], CropType).
- Per-variety detail page (/products/<crop>/<Variety>) carries the
  ratings as `<span class="bar-N">` where N is 1-9 on the canonical
  scale. Same 9=best direction as Bayer / Golden Harvest.
- Three sections per page: Characteristics / Management / Disease
  Tolerance, plus a few qualitative bars ("Tar Spot Susceptible",
  "Fungicide Response High") preserved as text values.

scrape/sources/agrigold.py — 5-circle scale:
- Listing page has 60+ /corn/explore-corn-hybrids/<CODE> URLs.
- Detail page renders ratings as <div class="scale"> blocks with 5
  child <div class="circle"> elements, of which N have class
  "circle selected" → rating N on a 1-5 scale.
- 7 sections per page incl. Silage Characteristics (Dairy Silage
  Rating, NDFd 30 Hr, Crude Protein), Planting Applications, Soil
  Adaptability, Plant Characteristics, Product Features.
- Distinct rating direction (1-5 vs Bayer's 1-9) — declared in
  _scale_direction so chunker preamble renders correctly.

scrape/sources/ebberts_seeds.py — small regional breeder, verbatim
text approach:
- Single page per crop (corn / soybeans / wheat). Each variety is an
  <h1> + multi-section CSS-grid block where labels and values are in
  separate adjacent cells. Reconstructing perfectly-aligned columns
  for a 29-variety total isn't worth the engineering — chunk body
  carries the verbatim text in document order, LLM can read the
  tabular content.
- Scale: 1-5 (1 = best, lower = more resistant), inferred from
  marketing-vs-rating cross-checks ("Robust tall plants" + STANDABILITY
  1.0 → 1 = best).
- Politeness: robots.txt asks for Crawl-delay: 5; honored.

All three new scrapers smoke-tested:
- LG corn LG5701 RM 116 SmartStax → 3 characteristic groups with
  Disease Tolerance ratings (Northern/Southern Leaf Blight 8-9, etc.)
- AgriGold A616-30 RM 86 VT2RIB → 7 groups incl. silage and soil
  adaptability ratings
- Ebbert's 7000TR RIB RM 100 → 1098-char verbatim body covering
  CHARACTERISTICS, DISEASE RATINGS, herbicide tolerance, etc.

Corpus state after this PR:
- 5,839 chunks (was 5,529)
- 11 brands (was 8)
- 8 crops (corn 3047, soy 2209, silage 359, wheat 123, sorghum 49,
  cotton 30, alfalfa 16, canola 6) — alfalfa is brand-new

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:42:23 -04:00

seed-mcp

MCP server over the public catalogs of major US row-crop seed vendors — variety identity (what each hybrid IS) plus yield-trial data (how they actually perform in real cooperator fields). Sibling project to crop-chem-docs (pesticide labels), feeding the same Drawbar farm-advisor AI.

Deployed 2026-05-25 on trashpanda as a sibling sidecar to chem-mcp; the Drawbar advisor calls it via the seed: prefix.

What's in the corpus

5,073 indexed chunks across two complementary surfaces:

Variety identity — 760 records

Source Count Vendor Brand
bayer_seeds 475 Bayer DEKALB (corn) / Asgrow (soy) / WestBred (wheat)
golden_harvest 139 Syngenta Golden Harvest (corn / soy)
nk 122 Syngenta NK (corn / soy)
agripro 24 Syngenta AgriPro (wheat — HRW / HRS / HWS / SWW)

Yield-trial data — 4,313 documents

Source Count Notes
gh_plot_reports 4,299 Golden Harvest plot reports 2024+2025. Cross-vendor head-to-head — DEKALB / NK / GH / Pioneer / Channel all appear in the same trial rankings. The closest thing to independent comparison data the corpus has.
agripro_trials 14 Regional wheat trial PDF summaries (PNW, Western Plains, Northern Plains, etc.)

Not in the corpus (documented in docs_mcp/lessons.md)

  • Pioneer / Corteva — ToS bans automation. Curated fallback lesson points the farmer at pioneer.com / a local dealer.
  • NK yield-results — fiddly ASMX/SOAP endpoint, needs a dedicated reverse-engineer session.
  • Bayer per-variety trial data — not publicly indexed (DEKALB / Asgrow trial data flows through Channel reps). Partially covered by the GH plot reports' cross-vendor results.

MCP tools (6)

Tool Purpose
search_docs Variety IDENTITY — what a hybrid IS (disease ratings, traits, maturity). Hybrid dense+BM25 + cross-encoder rerank + variety-code prefilter.
search_trials Variety PERFORMANCE — head-to-head yield trial results. Filterable by crop, state, year, product.
get_page Full canonical record for one variety + structured ratings header sourced from the sidecar JSON.
lookup_variety Raw sidecar JSON for one variety — fact-check tool; call before quoting any specific rating value.
list_versions Discover facets (sources, vendors, brands, crops) currently indexed.
crop_seed_api_lessons Curated knowledge: Pioneer fallback policy, scale-direction differences across vendors, trait glossary, SCN race coverage notes.

search_docs defaults to data_type="variety"; search_trials uses data_type="trial" — single Chroma collection, metadata-filtered.

Retrieval — eval-validated

From eval/results/baseline.md (21 golden queries, k=5):

Retriever Pass Recall P@1 MRR Avg ms
hybrid+rerank 21/21 100% 90% 0.905 2064
bm25 20/21 95% 81% 0.833 5
hybrid (no rerank) 15/21 71% 62% 0.619 73
dense 14/21 67% 38% 0.440 79

Deploy config: HYBRID_SEARCH=true + RERANK_URL=http://llama-rerank:8080.

Some surprises worth knowing:

  1. Dense embedding alone is the weakest config. Variety codes (DKC62-08RIB), gene names (Rps3a), and trait codes (XF) have no semantic neighbors — nomic-embed-text returns noise on them.
  2. Hybrid alone is WORSE than BM25 alone. RRF dilutes BM25's strong ranking with dense noise. Don't ship without rerank.
  3. BM25-alone (95% recall, 5 ms) is an excellent fallback when the rerank sidecar is unavailable. The variety-code prefilter in search_docs does heavy lifting.
  4. Anti-hallucination queries pass on every retriever — Pioneer fallback + not-in-corpus product checks hold across all configs.

Quick start

git clone https://git.jpaul.io/justin/seed-mcp.git
cd seed-mcp
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Sample-scrape just to verify wiring:
python -m scrape.runner --source bayer_seeds --limit 3

# Full refresh (all 6 sources; expect ~25 min for gh_plot_reports
# with 4 concurrent workers):
python -m scrape.runner --all --force

# Rebuild Chroma + BM25 from the corpus:
OLLAMA_URL=http://192.168.0.125:11434 PRODUCT_NAME=crop_seed \
  python -m rag.index --rebuild

# Run the eval harness:
RERANK_URL=http://localhost:18080 python -m eval.run_eval \
  --queries eval/queries.jsonl --k 5 \
  --output eval/results/baseline.md

# Local MCP server (stdio for Claude Desktop dev):
PRODUCT_NAME=crop_seed python -m docs_mcp.server --transport stdio

# Local HTTP server (matches production transport):
PRODUCT_NAME=crop_seed python -m docs_mcp.server \
  --transport streamable-http --port 8000

Repo layout

.
├── CLAUDE.md                      # Canonical agent guide. Read first.
├── PLAN.md                        # Template's 13-phase build guide.
├── README.md
├── requirements.txt
├── Dockerfile
├── sources.json                   # Source catalog (one entry per scraper)
├── deploy/docker-compose.yml      # Drop-in compose snippet for Drawbar
├── .gitea/workflows/
│   ├── refresh.yml                # Monthly cron: scrape + index + image push
│   └── image-only.yml             # On-demand code-only ship cycle
├── scrape/
│   ├── runner.py                  # `python -m scrape.runner --source <id>`
│   ├── changelog.py               # Reused from template
│   └── sources/
│       ├── bayer_seeds.py         # ~475 varieties across 3 brands
│       ├── golden_harvest.py      # ~139 varieties (post-discontinued filter)
│       ├── nk.py                  # 122 varieties (corn + soy)
│       ├── agripro.py             # 24 wheat varieties
│       ├── gh_plot_reports.py     # 4,299 cross-vendor yield trials
│       ├── agripro_trials.py      # 14 regional trial PDFs
│       └── becks_pfr.py           # stub — Sanity GROQ research corpus
├── rag/
│   ├── embeddings.py              # nomic-embed-text via Ollama
│   ├── chunk.py                   # one-chunk-per-variety + trial chunker
│   ├── index.py                   # Chroma + BM25 builder
│   └── bm25.py                    # FTS5 lexical index w/ seed-domain facets
├── docs_mcp/
│   ├── server.py                  # FastMCP — 6 tools, hybrid+rerank
│   ├── lessons.md                 # Curated knowledge layer (Pioneer fallback)
│   └── usage.py                   # TimedCall + JSONL telemetry
├── eval/
│   ├── queries.jsonl              # 21 golden queries
│   ├── retrievers.py              # dense / bm25 / hybrid / hybrid+rerank
│   ├── run_eval.py                # MRR / Recall@k / Precision@1
│   └── results/baseline.md        # Current deploy-config eval numbers
└── corpus/                        # Committed scrape output (CI-refreshed)
    ├── bayer_seeds/
    ├── golden_harvest/
    ├── nk/
    ├── agripro/
    ├── gh_plot_reports/
    └── agripro_trials/

Infrastructure

  • Registry: pushes to 192.168.0.2:1234 (LAN, no CF body cap); deploys pull git.jpaul.io/justin/seed-mcp:latest (public, CF tunnel). Also tagged :<sha12> for rollback pinning and :corpus-YYYY.MM.DD for snapshot pinning.
  • Embedder pool (CI): 3 GPU-pinned Ollama endpoints, weighted toward .0.125 (RTX 40-series, 242 embeds/sec):
    • .0.125:11434 ×4 (4090)
    • .0.2:11436 ×2 (GPU-pinned)
    • .0.2:11435 ×1 (GPU-pinned)
    • Do NOT use .0.2:11434 (not GPU-pinned) or localhost:11434 (works in dev, breaks in CI — runner container has no Ollama on its loopback).
  • Reranker: shared llama-rerank sidecar on trashpanda's Tesla P4 (jina-reranker-v2-base via llama.cpp). One container serves both seed-mcp and crop-chem-docs. Must be on drawbar-backend_default Docker network — see deploy/docker-compose.yml for the network-attach gotcha that caused silent rerank degradation on chem-mcp prior to 2026-05-25.
  • PRODUCT_NAME: crop_seed — used in the Chroma collection name (crop_seed_docs), the BM25 db filename (bm25/crop_seed_docs.db), and the crop_seed_api_lessons tool name. Not seed_mcp — that would conflict with the container/service name.

Deploy mechanics

Watchtower handles auto-deploy. Every push to seed-mcp/main that touches docs_mcp/, rag/, scrape/, requirements.txt, Dockerfile, or sources.json triggers image-only.yml:

  1. Checks out main with full corpus
  2. Rebuilds Chroma + BM25 (~3 min on the GPU pool)
  3. docker build + push three tags to the LAN registry
  4. Links the package to the repo via Gitea API
  5. Watchtower on trashpanda polls :latest every 5 min → pulls + recreates drawbar-backend-seed-mcp-1

Corpus refresh runs monthly via refresh.yml (1st of each month, 06:00 UTC) — re-scrapes all GREEN sources, commits any corpus diff, rebuilds indexes, ships a new image with :corpus-YYYY.MM.DD tagged.

See CLAUDE.md for canonical sidecar schemas, the reversed disease-scale gotcha (NK + AgriPro publish 1=best, vs Bayer/GH 9=best), and the scraper conventions.

Status

Phase Status
0 — scaffold
1 — scrapers (bayer_seeds / golden_harvest / nk / agripro / gh_plot_reports / agripro_trials)
2 — chunk + index
3 — MCP tools (6)
4-5 — Dockerfile + Gitea CI
6 — reranker integration (eval-validated; deploy uses hybrid+rerank)
7 — eval harness (21 golden queries, baseline committed)
8 — hybrid search (default ON)
11 — crop_seed_api_lessons curated layer (Pioneer fallback + 7 other lessons)
13 — weekly_digest not planned for seed-mcp

Remaining work (deferred, not blocking):

  • becks_pfr scraper (2,089 research docs via public Sanity GROQ)
  • 2023 GH plot reports backfill (~3,619 more docs)
  • NK yield-results endpoint reverse-engineer
  • Channel Seed brand (~320 more Bayer varieties — separate brand under the same sitemap)
S
Description
MCP server over US row-crop seed/hybrid variety data (corn, soybeans, wheat). Sibling to crop-chem-docs. Feeds Drawbar farmer advisor.
Readme 23 MiB
Languages
Python 99.7%
Dockerfile 0.3%