Files

T

claude 54094a0d43 Add university-extension variety trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 trial docs)

Independent third-party performance data — land-grant programs that test every
entered brand side-by-side with replication + LSD stats. This is the legitimate
way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't
scrape directly (data_type=trial, results[] shape; falls through the trial
chunker).

- illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX
  (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata;
  corn-following-corn vs -soybean kept distinct.
- iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView
  (viewstate postback for year/district), corn + soy by district x season.
- ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF
  (pdfplumber; per-site column groups split by header Yield-token count +
  x-coord footnote bucketing), corn + soy per site, 2024+2025.

91 distinct seed brands across the three; majors confirmed present in the
independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212,
Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it
ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in
Illinois — true negatives, not parse gaps.)

- rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university
  sources route through it so the region/district is in the embedded chunk +
  labeled "variety trial (cross-vendor, independent third-party)". Existing plot
  sources (gh/lg/agrigold/proharvest) unchanged.
- requirements.txt: openpyxl (Illinois XLSX; scrape-time only).
- sources.json + README/CLAUDE/lessons: registered + attributed; lessons
  trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available
  indirectly via these trials).

Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0
out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in
each tos_note. CI rebuilds the index from the committed corpus.

2026-06-10 08:35:50 -04:00

12 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Purpose

seed-mcp is an MCP server over the public catalogs of major US row-crop seed vendors (corn / soybeans / wheat). It is the sibling project to crop-chem-docs — same MCP-template lineage, same Drawbar consumer (the farm advisor AI), but the corpus is seed/hybrid varieties rather than pesticide labels.

The MCP exposes per-variety records with agronomic ratings, disease tolerance, trait stack, maturity, and regional notes — so the advisor can answer "which corn hybrid for sandy soil, drought-prone, RM ≤105 in northeast Iowa?" without rummaging through individual brand sites.

PRODUCT_NAME for this build: crop_seed (lowercase, underscore; ends up in the MCP server name, Chroma collection, BM25 db filename, and the crop_seed_api_lessons tool).

Vendor scope

Vendor	Verdict	Varieties	Source pattern
Bayer (DEKALB + Channel + Asgrow + WestBred + Deltapine)	🟢	931	`cropscience.bayer.us` Next.js `__NEXT_DATA__` (same infra as crop-chem-docs)
LG Seeds (AgReliant)	🟢	170	`lgseeds.com` JSON XHR (+ `lg_plot_reports` trials)
Golden Harvest (Syngenta)	🟢	139	sitemap.xml + server-rendered HTML + Syngenta CDN PDFs (+ `gh_plot_reports` trials)
NK (Syngenta)	🟢	122	static HTML + Syngenta CDN PDFs (shares fetcher with Golden Harvest)
Latham Hi-Tech Seeds (independent, IA)	🟢	264	WordPress REST enum (`/wp-json/wp/v2/varieties`) + `/products/<slug>/` detail HTML. Scale 1-9 LOWER=better (reversed)
Stine Seed (independent, IA — largest US)	🟢	217	custom PHP; `sitemap.xml` enum + `/{crop}/traits/<slug>/<code>/` detail HTML. Corn 1-9 (9=best); soy qualitative
LG Seeds (AgReliant)	🟢	170	`lgseeds.com` JSON XHR (+ `lg_plot_reports` trials)
Golden Harvest (Syngenta)	🟢	139	sitemap.xml + server-rendered HTML + Syngenta CDN PDFs (+ `gh_plot_reports` trials)
NK (Syngenta)	🟢	122	static HTML + Syngenta CDN PDFs (shares fetcher with Golden Harvest)
RobSeeCo (independent, NE)	🟢	130	PDF-extraction of the 2026 Seed Guide (Squarespace; no web catalog). Rob-See-Co + Innotech corn/soy. Scale 1-9 (9=best). Pages duplicated → dedup
ProHarvest Seeds (independent, IL)	🟢	119	WordPress REST API (`/wp/v2/seed` + `/seed/<slug>/` detail pages) (+ `proharvest_plots` trials)
AgriGold (AgReliant)	🟢	111	`agrigold.com` server-rendered HTML (+ `agrigold_plot_reports` trials)
1st Choice Seeds (independent, IN)	🟢	78	WordPress (CPTs not in REST); per-crop sitemap → detail HTML. Scale 0-10 higher=better. corn/soy/wheat
Burrus Seed (independent, IL)	🟢	64	Seedware JSON API (`burrus25.seedware.net`, callback+Referer). Scale 1-10 (10=best). robots `ai-train=no` — operator opted in
Ebbert's Seeds (independent, OH/IN)	🟢	29	WordPress per-crop catalog pages, verbatim body
AgriPro (Syngenta wheat)	🟢	24	Drupal Views form, server-rendered HTML
Beck's PFR	🟡	2,089	Public Sanity GROQ API at `mc8v24rf.api.sanity.io` (no auth)
Beck's products	🟡	860	Same Sanity API — identity-only until SeedIQ XHR is sniffed
Pioneer + Hoegemeyer + Brevant (Corteva)	🔴	—	DROP. Shared corteva.com ToU bans automation (scrapers + "competitive service"). Treat ALL `*.corteva.com / pioneer.com / hoegemeyer.com / therightseed.com` + Vylor brands as one excluded ToU domain

Trial-only sources (cross-vendor yield, data_type=trial): vendor plot reports gh_plot_reports, lg_plot_reports, agrigold_plot_reports, proharvest_plots, agripro_trials; university-extension variety trials illinois_vt_trials (IL, +wheat), iowa_icpt_trials (IA), ohio_ocpt_trials (OH) — independent third-party data that ranks the majors we can't catalog directly (Pioneer/DEKALB/Brevant) side-by-side. The university sources route through _render_gh_plot_chunk(include_region=True) so the region/district is in the embedded chunk. See the README corpus table for counts.

Scale-direction warning (read before any cross-vendor numeric comparison): the independents do NOT agree on direction. Bayer + Stine(corn) + ProHarvest(disease) + Burrus = HIGHER is better (Burrus 1-10, others 1-9). Latham + NK + AgriPro = LOWER is better (1 = best). 1st Choice = 0-10 higher=better. Stine soy is qualitative. Always consult each record's _scale_direction (the chunker attaches it) before comparing numbers across brands.

Build priority order (shared-infra first → biggest yield):

bayer_seeds — lift-and-shift from crop-chem-docs' Bayer scraper
golden_harvest — biggest unique Syngenta brand
nk — reuses Golden Harvest's PDF fetcher
agripro — only wheat coverage in the corpus
becks_pfr — research goldmine, public Sanity GROQ
becks_products — identity-only, deferred until SeedIQ XHR known

Pioneer fallback

Per user direction (2026-05-25), seed-mcp does NOT scrape Pioneer. The MCP's lessons layer contains a Pioneer-fallback entry: when the LLM detects a Pioneer / P-series query, it should reply:

"Pioneer does not allow AI or other automation techniques to scrape and index their data. For Pioneer brand seed information, reach out to a local dealer directly via pioneer.com."

Pioneer's dealer locator is login-gated — there is no public API to surface dealer contact info, so the lesson stays a plain link.

Schema notes per crop

Corn: RM (relative maturity days), trait stack (SmartStax, VT Double PRO, Enlist, PowerCore, Trecepta, etc.), GLS / NCLB / Goss's / Anthracnose / Tar Spot ratings, standability, drought tolerance, ear flex, grain-vs-silage flag.
Soy: Maturity group (e.g. 3.4), trait stack (XF / Xtend / E3 / LL+GT27 / RR2Y / Conkesta), SDS / white mold / SCN / Phytophthora (race + Rps gene) / frogeye / brown stem rot ratings, IDC tolerance (critical for upper Midwest), branching habit.
Wheat: Class (HRW / HRS / SRW / SWW / SWS / durum), heading (early / medium / late), stripe rust / leaf rust / stem rust / FHB (scab) / Septoria / tan spot ratings, test weight, protein, falling number, straw strength, CoAXium trait flag.

Disease scale gotcha: Golden Harvest publishes ratings on a 9-to-1 scale (9 = best, 1 = worst) — the REVERSE of the typical 1-9 convention used by Bayer/NK/AgriPro. Normalize at chunk time so the corpus has a single direction; document it in a chunk_0 preamble.

Canonical sidecar schema (per variety)

{
  "source": "bayer_seeds",
  "source_key": "dekalb-dkc62-08rib",
  "vendor": "Bayer",
  "brand": "DEKALB",
  "product_name": "DKC62-08RIB",
  "crop": "corn",
  "relative_maturity": 112,
  "maturity_group": null,
  "wheat_class": null,
  "trait_stack": ["SmartStax", "RIB"],
  "agronomic_ratings": {"standability": 7, "drought_tolerance": 6},
  "disease_ratings": {"GLS": 6, "NCLB": 7, "Goss_wilt": 5},
  "regional_recommendation": ["IA-N", "MN-S", "WI-W"],
  "source_urls": ["https://cropscience.bayer.us/..."],
  "fetched_at": "2026-05-25T12:34:56Z"
}

maturity_group is for soy, relative_maturity is for corn, wheat_class is for wheat. Use null for fields that don't apply. Disease/agronomic rating direction is normalized 1-9 (9 = best) post-scrape — original direction noted in chunk_0 if the source publishes differently.

Working with this repo

Identifying the current phase

This is a clone of the docs-mcp-template; phases follow the template's PLAN.md.

Signal	Likely phase
`corpus/` doesn't exist	Phase 1 (first scraper)
`corpus/bayer_seeds/` exists, no `chroma/`	Phase 2 (indexing)
`chroma/` exists, no `bm25/`	Phase 8 (hybrid search)
No `eval/results/`	Phase 7 (eval harness)
`_api_lessons` is `NotImplementedError`	Phase 11

Layout

.
├── PLAN.md
├── README.md
├── CLAUDE.md
├── sources.json                  # Vendor catalog (corn/soy/wheat by source)
├── requirements.txt
├── Dockerfile
├── deploy/
│   └── docker-compose.yml
├── .gitea/workflows/
│   ├── refresh.yml               # Monthly cron: scrape + index + image
│   └── image-only.yml            # On-demand: code-only ship cycle
├── scrape/
│   ├── runner.py                 # `python -m scrape.runner --source bayer_seeds`
│   ├── changelog.py
│   └── sources/
│       ├── bayer_seeds.py
│       ├── golden_harvest.py
│       ├── nk.py
│       ├── agripro.py
│       ├── becks_pfr.py
│       └── becks_products.py
├── rag/                          # chunk + embed + Chroma + BM25
├── docs_mcp/                     # FastMCP server + lessons.md
├── eval/                         # Golden-query harness
└── scripts/                      # registry_gc.py, usage_report.py

Conventions

Vendor sub-corpora: each scraper writes corpus/<source>/<source_key>.{md,json}. .md is the LLM-visible text (chunk_0 preamble + body); .json is the sidecar metadata.
Tool docstrings are user interface — the LLM uses them to decide whether to call. Treat like button labels.
Defensive fallback for retrieval — reranker/BM25/external deps must catch their specific exception and degrade to baseline. The MCP is in front of farmers making real seed-buying decisions.
Verify retrieval changes with eval/ — ship a retrieval change with numbers in the commit message.

Standard infrastructure choices

Embedding: nomic-embed-text via Ollama (768-dim)
Reranker: jina-reranker-v2-base GGUF via llama.cpp /v1/rerank (shared llama-rerank sidecar with crop-chem-docs on trashpanda Tesla P4)
Vector store: Chroma PersistentClient
Lexical store: SQLite FTS5
Fusion: RRF k=60
Transport: streamable-HTTP in prod, stdio for local dev
MCP framework: FastMCP with stateless_http=True

Image name and package linking are repo-name-derived

IMAGE and --package derive from the repo at runtime via ${{ github.repository_owner }} / ${{ github.event.repository.name }}. The only workflow placeholders customized per clone are REGISTRY_PUSH=192.168.0.2:1234, REGISTRY_PULL=git.jpaul.io, and the OLLAMA_URL embed pool.

Common commands

# Dev environment
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Run one scraper
python -m scrape.runner --source bayer_seeds --force

# Rebuild indexes
python -m rag.index --rebuild

# Local MCP server
python -m docs_mcp.server --transport stdio
python -m docs_mcp.server --transport streamable-http --port 8000

# Eval
python -m eval.run_eval --queries eval/queries.jsonl --output eval/results/baseline.md

Gotchas

fetch-depth: 0 on actions/checkout@v4 in both workflows.
Reranker per-pair token limit: jina-reranker GGUF rejects the ENTIRE batch if any doc exceeds n_ctx_train=1024. Truncate reranked docs to ~2000 chars.
FastMCP stateless_http=True: critical for prod.
Runner shell is /bin/sh (dash) in CI — no ${VAR::N}.
Cloudflare 100 MB body cap: push via LAN endpoint 192.168.0.2:1234, pull via git.jpaul.io.
Golden Harvest disease scale is reversed (9 = best) — normalize at chunk time.
Sitemap-listed PDF dates on Golden Harvest are stale — resolve the live PDF URL from the product HTML page.
No IPv6 — DNS for git.jpaul.io returns IPv6-only. Clone via HTTPS, not SSH (port 22 returns Network unreachable).
Pioneer is OFF-LIMITS — do NOT add a pioneer.py scraper.

Out-of-scope concerns

Reverse proxy / TLS — Drawbar's compose handles it
MetaMCP — separate aggregator
GPU container orchestration — shared llama-rerank sidecar
University extension trial data — deferred to v1.5

12 KiB Raw Blame History