Four independent regional brands across IA/IN/IL (variety-identity sources,
each parsed into structured characteristics_groups so ratings embed):
- latham (264: 155 corn + 109 soy) — Latham Hi-Tech Seeds, Alexander IA.
WordPress REST enum (/wp-json/wp/v2/varieties) + /products/<slug>/ detail
HTML. Scale 1-9 LOWER=better (reversed, like NK/AgriPro).
- stine (217: 58 corn + 159 soy) — Stine Seed, Adel IA (largest US
independent). sitemap enum + /{crop}/traits/<slug>/<code>/ detail HTML.
Corn 1-9 (9=best); soy qualitative.
- first_choice (78: 52 corn + 22 soy + 4 wheat) — 1st Choice Seeds,
Rushville IN (employee-owned). Per-crop sitemap -> detail HTML. Scale
0-10 higher=better. ~40 older corn pages thin at source; wheat
identity-only.
- burrus (64: 38 corn + 26 soy) — Burrus Seed, Arenzville IL. Seedware
JSON API. Scale 1-10 (10=best). Brands Burrus/Power Plus/DONMARIO.
robots ai-train=no + named-bot blocks; operator opted in, scraper uses a
non-blacklisted UA + honors Crawl-delay 10.
All 623 validated through rag.chunk.chunks_from_variety (0 errors; 6
identity-only pages from source gaps). No chunk.py change needed (identity
sources auto-route to chunks_from_variety).
Docs:
- sources.json: 4 entries + Hoegemeyer added to _excluded_sources. The
Corteva ToU (shared across pioneer.com / hoegemeyer.com / therightseed.com
/ corteva.com + the Vylor spinoff) bans scrapers + competitive use, so the
whole Corteva family is one excluded ToU domain.
- docs_mcp/lessons.md: rating-scales updated with all 4 directions +
an explicit cross-vendor warning (Latham 1=best vs Stine/Burrus higher=best
— never compare raw numbers without _scale_direction).
- README + CLAUDE corpus inventory: now 2,268 variety + 6,787 trial records.
CI rebuilds the index from the committed corpus.
11 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Purpose
seed-mcp is an MCP server over the public catalogs of major US
row-crop seed vendors (corn / soybeans / wheat). It is the sibling
project to crop-chem-docs
— same MCP-template lineage, same Drawbar consumer (the farm
advisor AI), but the corpus is seed/hybrid varieties rather than
pesticide labels.
The MCP exposes per-variety records with agronomic ratings, disease tolerance, trait stack, maturity, and regional notes — so the advisor can answer "which corn hybrid for sandy soil, drought-prone, RM ≤105 in northeast Iowa?" without rummaging through individual brand sites.
PRODUCT_NAME for this build: crop_seed (lowercase, underscore;
ends up in the MCP server name, Chroma collection, BM25 db filename,
and the crop_seed_api_lessons tool).
Vendor scope
| Vendor | Verdict | Varieties | Source pattern |
|---|---|---|---|
| Bayer (DEKALB + Channel + Asgrow + WestBred + Deltapine) | 🟢 | 931 | cropscience.bayer.us Next.js __NEXT_DATA__ (same infra as crop-chem-docs) |
| LG Seeds (AgReliant) | 🟢 | 170 | lgseeds.com JSON XHR (+ lg_plot_reports trials) |
| Golden Harvest (Syngenta) | 🟢 | 139 | sitemap.xml + server-rendered HTML + Syngenta CDN PDFs (+ gh_plot_reports trials) |
| NK (Syngenta) | 🟢 | 122 | static HTML + Syngenta CDN PDFs (shares fetcher with Golden Harvest) |
| Latham Hi-Tech Seeds (independent, IA) | 🟢 | 264 | WordPress REST enum (/wp-json/wp/v2/varieties) + /products/<slug>/ detail HTML. Scale 1-9 LOWER=better (reversed) |
| Stine Seed (independent, IA — largest US) | 🟢 | 217 | custom PHP; sitemap.xml enum + /{crop}/traits/<slug>/<code>/ detail HTML. Corn 1-9 (9=best); soy qualitative |
| LG Seeds (AgReliant) | 🟢 | 170 | lgseeds.com JSON XHR (+ lg_plot_reports trials) |
| Golden Harvest (Syngenta) | 🟢 | 139 | sitemap.xml + server-rendered HTML + Syngenta CDN PDFs (+ gh_plot_reports trials) |
| NK (Syngenta) | 🟢 | 122 | static HTML + Syngenta CDN PDFs (shares fetcher with Golden Harvest) |
| ProHarvest Seeds (independent, IL) | 🟢 | 119 | WordPress REST API (/wp/v2/seed + /seed/<slug>/ detail pages) (+ proharvest_plots trials) |
| AgriGold (AgReliant) | 🟢 | 111 | agrigold.com server-rendered HTML (+ agrigold_plot_reports trials) |
| 1st Choice Seeds (independent, IN) | 🟢 | 78 | WordPress (CPTs not in REST); per-crop sitemap → detail HTML. Scale 0-10 higher=better. corn/soy/wheat |
| Burrus Seed (independent, IL) | 🟢 | 64 | Seedware JSON API (burrus25.seedware.net, callback+Referer). Scale 1-10 (10=best). robots ai-train=no — operator opted in |
| Ebbert's Seeds (independent, OH/IN) | 🟢 | 29 | WordPress per-crop catalog pages, verbatim body |
| AgriPro (Syngenta wheat) | 🟢 | 24 | Drupal Views form, server-rendered HTML |
| Beck's PFR | 🟡 | 2,089 | Public Sanity GROQ API at mc8v24rf.api.sanity.io (no auth) |
| Beck's products | 🟡 | 860 | Same Sanity API — identity-only until SeedIQ XHR is sniffed |
| Pioneer + Hoegemeyer + Brevant (Corteva) | 🔴 | — | DROP. Shared corteva.com ToU bans automation (scrapers + "competitive service"). Treat ALL *.corteva.com / pioneer.com / hoegemeyer.com / therightseed.com + Vylor brands as one excluded ToU domain |
Trial-only sources (cross-vendor yield plots, data_type=trial): gh_plot_reports, lg_plot_reports, agrigold_plot_reports, proharvest_plots, agripro_trials. See the README corpus table for counts.
Scale-direction warning (read before any cross-vendor numeric comparison): the independents do NOT agree on direction. Bayer + Stine(corn) + ProHarvest(disease) + Burrus = HIGHER is better (Burrus 1-10, others 1-9). Latham + NK + AgriPro = LOWER is better (1 = best). 1st Choice = 0-10 higher=better. Stine soy is qualitative. Always consult each record's
_scale_direction(the chunker attaches it) before comparing numbers across brands.
Build priority order (shared-infra first → biggest yield):
bayer_seeds— lift-and-shift from crop-chem-docs' Bayer scrapergolden_harvest— biggest unique Syngenta brandnk— reuses Golden Harvest's PDF fetcheragripro— only wheat coverage in the corpusbecks_pfr— research goldmine, public Sanity GROQbecks_products— identity-only, deferred until SeedIQ XHR known
Pioneer fallback
Per user direction (2026-05-25), seed-mcp does NOT scrape Pioneer. The MCP's lessons layer contains a Pioneer-fallback entry: when the LLM detects a Pioneer / P-series query, it should reply:
"Pioneer does not allow AI or other automation techniques to scrape and index their data. For Pioneer brand seed information, reach out to a local dealer directly via pioneer.com."
Pioneer's dealer locator is login-gated — there is no public API to surface dealer contact info, so the lesson stays a plain link.
Schema notes per crop
- Corn: RM (relative maturity days), trait stack (SmartStax, VT Double PRO, Enlist, PowerCore, Trecepta, etc.), GLS / NCLB / Goss's / Anthracnose / Tar Spot ratings, standability, drought tolerance, ear flex, grain-vs-silage flag.
- Soy: Maturity group (e.g. 3.4), trait stack (XF / Xtend / E3 / LL+GT27 / RR2Y / Conkesta), SDS / white mold / SCN / Phytophthora (race + Rps gene) / frogeye / brown stem rot ratings, IDC tolerance (critical for upper Midwest), branching habit.
- Wheat: Class (HRW / HRS / SRW / SWW / SWS / durum), heading (early / medium / late), stripe rust / leaf rust / stem rust / FHB (scab) / Septoria / tan spot ratings, test weight, protein, falling number, straw strength, CoAXium trait flag.
Disease scale gotcha: Golden Harvest publishes ratings on a 9-to-1 scale (9 = best, 1 = worst) — the REVERSE of the typical 1-9 convention used by Bayer/NK/AgriPro. Normalize at chunk time so the corpus has a single direction; document it in a chunk_0 preamble.
Canonical sidecar schema (per variety)
{
"source": "bayer_seeds",
"source_key": "dekalb-dkc62-08rib",
"vendor": "Bayer",
"brand": "DEKALB",
"product_name": "DKC62-08RIB",
"crop": "corn",
"relative_maturity": 112,
"maturity_group": null,
"wheat_class": null,
"trait_stack": ["SmartStax", "RIB"],
"agronomic_ratings": {"standability": 7, "drought_tolerance": 6},
"disease_ratings": {"GLS": 6, "NCLB": 7, "Goss_wilt": 5},
"regional_recommendation": ["IA-N", "MN-S", "WI-W"],
"source_urls": ["https://cropscience.bayer.us/..."],
"fetched_at": "2026-05-25T12:34:56Z"
}
maturity_group is for soy, relative_maturity is for corn,
wheat_class is for wheat. Use null for fields that don't apply.
Disease/agronomic rating direction is normalized 1-9 (9 = best)
post-scrape — original direction noted in chunk_0 if the source
publishes differently.
Working with this repo
Identifying the current phase
This is a clone of the docs-mcp-template; phases follow the template's PLAN.md.
| Signal | Likely phase |
|---|---|
corpus/ doesn't exist |
Phase 1 (first scraper) |
corpus/bayer_seeds/ exists, no chroma/ |
Phase 2 (indexing) |
chroma/ exists, no bm25/ |
Phase 8 (hybrid search) |
No eval/results/ |
Phase 7 (eval harness) |
_api_lessons is NotImplementedError |
Phase 11 |
Layout
.
├── PLAN.md
├── README.md
├── CLAUDE.md
├── sources.json # Vendor catalog (corn/soy/wheat by source)
├── requirements.txt
├── Dockerfile
├── deploy/
│ └── docker-compose.yml
├── .gitea/workflows/
│ ├── refresh.yml # Monthly cron: scrape + index + image
│ └── image-only.yml # On-demand: code-only ship cycle
├── scrape/
│ ├── runner.py # `python -m scrape.runner --source bayer_seeds`
│ ├── changelog.py
│ └── sources/
│ ├── bayer_seeds.py
│ ├── golden_harvest.py
│ ├── nk.py
│ ├── agripro.py
│ ├── becks_pfr.py
│ └── becks_products.py
├── rag/ # chunk + embed + Chroma + BM25
├── docs_mcp/ # FastMCP server + lessons.md
├── eval/ # Golden-query harness
└── scripts/ # registry_gc.py, usage_report.py
Conventions
- Vendor sub-corpora: each scraper writes
corpus/<source>/<source_key>.{md,json}..mdis the LLM-visible text (chunk_0 preamble + body);.jsonis the sidecar metadata. - Tool docstrings are user interface — the LLM uses them to decide whether to call. Treat like button labels.
- Defensive fallback for retrieval — reranker/BM25/external deps must catch their specific exception and degrade to baseline. The MCP is in front of farmers making real seed-buying decisions.
- Verify retrieval changes with eval/ — ship a retrieval change with numbers in the commit message.
Standard infrastructure choices
- Embedding:
nomic-embed-textvia Ollama (768-dim) - Reranker:
jina-reranker-v2-baseGGUF via llama.cpp/v1/rerank(sharedllama-reranksidecar with crop-chem-docs on trashpanda Tesla P4) - Vector store: Chroma
PersistentClient - Lexical store: SQLite FTS5
- Fusion: RRF k=60
- Transport: streamable-HTTP in prod, stdio for local dev
- MCP framework: FastMCP with
stateless_http=True
Image name and package linking are repo-name-derived
IMAGE and --package derive from the repo at runtime via
${{ github.repository_owner }} / ${{ github.event.repository.name }}.
The only workflow placeholders customized per clone are
REGISTRY_PUSH=192.168.0.2:1234, REGISTRY_PULL=git.jpaul.io,
and the OLLAMA_URL embed pool.
Common commands
# Dev environment
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Run one scraper
python -m scrape.runner --source bayer_seeds --force
# Rebuild indexes
python -m rag.index --rebuild
# Local MCP server
python -m docs_mcp.server --transport stdio
python -m docs_mcp.server --transport streamable-http --port 8000
# Eval
python -m eval.run_eval --queries eval/queries.jsonl --output eval/results/baseline.md
Gotchas
fetch-depth: 0onactions/checkout@v4in both workflows.- Reranker per-pair token limit: jina-reranker GGUF rejects the
ENTIRE batch if any doc exceeds
n_ctx_train=1024. Truncate reranked docs to ~2000 chars. - FastMCP
stateless_http=True: critical for prod. - Runner shell is
/bin/sh(dash) in CI — no${VAR::N}. - Cloudflare 100 MB body cap: push via LAN endpoint
192.168.0.2:1234, pull viagit.jpaul.io. - Golden Harvest disease scale is reversed (9 = best) — normalize at chunk time.
- Sitemap-listed PDF dates on Golden Harvest are stale — resolve the live PDF URL from the product HTML page.
- No IPv6 — DNS for git.jpaul.io returns IPv6-only. Clone via HTTPS, not SSH (port 22 returns Network unreachable).
- Pioneer is OFF-LIMITS — do NOT add a
pioneer.pyscraper.
Out-of-scope concerns
- Reverse proxy / TLS — Drawbar's compose handles it
- MetaMCP — separate aggregator
- GPU container orchestration — shared
llama-reranksidecar - University extension trial data — deferred to v1.5