justin 1409c2617d golden_harvest: implement scraper (~175 Syngenta corn + soy)
Sitemap-driven scraper for goldenharvestseeds.com. Walks
sitemap-ghs-hybrids.xml to discover product URLs under
/products/corn/ and /products/soybean/ (~89 + 86 = 175 candidates).

Per-variety detail parsed from server-rendered HTML:

- product code (from <h1> / <title>)
- positioning (from <meta name="Description">)
- maturity (from <div class="product-label"><div class="right">):
  integer days for corn, decimal MG for soybeans
- traits derived from product-code suffix (XF, E3, VIP3, GT, Z, etc.)
- 9-row disease tolerance bar chart (#dvDiseaseTolerance) where
  data-percentage / 10 = rating on 1-9 (9 = best) scale
- 9-row agronomic characteristics bar chart (#dvAgronomicChar)
- recommended environment list (.AgronomicMange — upstream typo)
- all 2-column tables (plant description, seed quality, herbicide
  responses, Phytophthora gene, SCN race coverage)
- tech-sheet PDF URL from live HTML (not sitemap — that's stale)

302 redirects to /product-finder treated as "discontinued" and
skipped (Golden Harvest still sitemap-lists some retired SKUs).

Rating scale: 1-9 (9 = best) — same as Bayer despite recon's
"9-to-1" descriptor (that referred to chart-axis direction, not
numeric meaning). _scale_direction is set explicitly so the chunker
stays forward-compatible.

PDFs are NOT downloaded (recon flagged ~14MB each); tech-sheet URLs
are captured in the sidecar for future enrichment.

Smoke-tested all branches: 4 corn varieties (E085Z5, E092W5,
E094Z4, E095D3, E097K6, E100A3) with full 6 characteristics groups
+ tech-sheet URL; 3 soy varieties (GH00864XF MG 0.08, GH00973E3
MG 0.09, GH0225XF MG 0.2) with disease + agronomic bars; 302
redirects skipped cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 13:30:30 -04:00

seed-mcp

MCP server over the public catalogs of major US row-crop seed vendors — corn, soybeans, wheat. Sibling project to crop-chem-docs (pesticide labels), feeding the same Drawbar farm-advisor AI.

The server exposes per-variety records with agronomic ratings, disease tolerance, trait stack, maturity, and regional notes — so the advisor can answer questions like "which corn hybrid for sandy soil, drought-prone, RM ≤105 in northeast Iowa?" without rummaging through individual brand sites.

Vendor coverage

Vendor Verdict Varieties Notes
Bayer seeds (DEKALB + Asgrow + WestBred) 🟢 ~475 Same cropscience.bayer.us Next.js infra as crop-chem-docs
Golden Harvest (Syngenta) 🟢 ~175 Sitemap + server-rendered HTML + Syngenta CDN PDFs
NK (Syngenta) 🟢 29 Shares PDF fetcher with Golden Harvest
AgriPro (Syngenta wheat) 🟢 24 Drupal Views, server-rendered
Beck's PFR 🟡 2,089 Public Sanity GROQ API (no auth)
Beck's products 🟡 860 Identity-only until SeedIQ XHR sniffed
Pioneer (Corteva) 🔴 ToS bans automation — curated fallback lesson instead

Quick start

git clone https://git.jpaul.io/justin/seed-mcp.git
cd seed-mcp
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Run one scraper
python -m scrape.runner --source bayer_seeds --force

# Rebuild indexes
python -m rag.index --rebuild

# Local MCP server (stdio for Claude Desktop dev)
python -m docs_mcp.server --transport stdio

Tools exposed

Tool Purpose
search_docs Hybrid + rerank variety search with crop / RM / trait / region filters
get_page Full variety record by (source, source_key)
list_versions Discover crops, brands, traits, RM/MG ranges, wheat classes
corpus_status Counts + freshness; useful for health probes
crop_seed_api_lessons Curated agronomy lessons — Pioneer fallback, disease-scale normalization, regional placement heuristics

Build phases

This is a clone of docs-mcp-template. The 13 phases in PLAN.md apply:

Phase Status
0 — scaffold done
1 — first scraper (bayer_seeds) next
2 — chunk + index pending
3 — baseline MCP tools template defaults
4-5 — Dockerfile + CI done (placeholders filled)
6 — reranker shares llama-rerank sidecar with crop-chem-docs
7 — eval harness pending (curate ~25 queries)
8 — hybrid search done (template)
9 — diff_versions, list_cluster optional
11 — crop_seed_api_lessons curated layer pending

See CLAUDE.md for the canonical sidecar schema and the disease-scale-normalization gotcha (Golden Harvest is reversed).

Infrastructure

  • Registry: git.jpaul.io/justin/seed-mcp:latest (Watchtower) / :corpus-YYYY.MM.DD (production pin)
  • Embedder: shared Ollama pool with crop-chem-docs (Gitea-host GPUs + Windows Ollama; CI never hits trashpanda's production Ollama)
  • Reranker: shared llama-rerank sidecar on trashpanda's Tesla P4 (one container, both MCPs use it)
  • PRODUCT_NAME: crop_seed (not seed_mcp — used in Chroma collection, BM25 db filename, and crop_seed_api_lessons tool)
S
Description
MCP server over US row-crop seed/hybrid variety data (corn, soybeans, wheat). Sibling to crop-chem-docs. Feeds Drawbar farmer advisor.
Readme 23 MiB
Languages
Python 99.7%
Dockerfile 0.3%