Files
claude a54fac240f
Image rebuild (skip scrape) / build (push) Successful in 5m54s
Add university-extension trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 cross-vendor trial docs) (#19)
Co-authored-by: claude <claude@jpaul.io>
Co-committed-by: claude <claude@jpaul.io>
2026-06-10 08:36:19 -04:00

247 lines
12 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when
working with code in this repository.
## Purpose
`seed-mcp` is an MCP server over the **public catalogs of major US
row-crop seed vendors** (corn / soybeans / wheat). It is the sibling
project to [`crop-chem-docs`](https://git.jpaul.io/justin/crop-chem-docs)
— same MCP-template lineage, same Drawbar consumer (the farm
advisor AI), but the corpus is **seed/hybrid varieties** rather than
pesticide labels.
The MCP exposes per-variety records with agronomic ratings, disease
tolerance, trait stack, maturity, and regional notes — so the advisor
can answer "which corn hybrid for sandy soil, drought-prone, RM ≤105
in northeast Iowa?" without rummaging through individual brand sites.
PRODUCT_NAME for this build: **`crop_seed`** (lowercase, underscore;
ends up in the MCP server name, Chroma collection, BM25 db filename,
and the `crop_seed_api_lessons` tool).
## Vendor scope
| Vendor | Verdict | Varieties | Source pattern |
|---|---|---|---|
| Bayer (DEKALB + Channel + Asgrow + WestBred + Deltapine) | 🟢 | 931 | `cropscience.bayer.us` Next.js `__NEXT_DATA__` (same infra as crop-chem-docs) |
| LG Seeds (AgReliant) | 🟢 | 170 | `lgseeds.com` JSON XHR (+ `lg_plot_reports` trials) |
| Golden Harvest (Syngenta) | 🟢 | 139 | sitemap.xml + server-rendered HTML + Syngenta CDN PDFs (+ `gh_plot_reports` trials) |
| NK (Syngenta) | 🟢 | 122 | static HTML + Syngenta CDN PDFs (shares fetcher with Golden Harvest) |
| **Latham Hi-Tech Seeds** (independent, IA) | 🟢 | **264** | WordPress REST enum (`/wp-json/wp/v2/varieties`) + `/products/<slug>/` detail HTML. Scale 1-9 **LOWER=better** (reversed) |
| **Stine Seed** (independent, IA — largest US) | 🟢 | **217** | custom PHP; `sitemap.xml` enum + `/{crop}/traits/<slug>/<code>/` detail HTML. Corn 1-9 (9=best); soy qualitative |
| LG Seeds (AgReliant) | 🟢 | 170 | `lgseeds.com` JSON XHR (+ `lg_plot_reports` trials) |
| Golden Harvest (Syngenta) | 🟢 | 139 | sitemap.xml + server-rendered HTML + Syngenta CDN PDFs (+ `gh_plot_reports` trials) |
| NK (Syngenta) | 🟢 | 122 | static HTML + Syngenta CDN PDFs (shares fetcher with Golden Harvest) |
| **RobSeeCo** (independent, NE) | 🟢 | **130** | **PDF-extraction** of the 2026 Seed Guide (Squarespace; no web catalog). Rob-See-Co + Innotech corn/soy. Scale 1-9 (9=best). Pages duplicated → dedup |
| **ProHarvest Seeds** (independent, IL) | 🟢 | **119** | WordPress REST API (`/wp/v2/seed` + `/seed/<slug>/` detail pages) (+ `proharvest_plots` trials) |
| AgriGold (AgReliant) | 🟢 | 111 | `agrigold.com` server-rendered HTML (+ `agrigold_plot_reports` trials) |
| **1st Choice Seeds** (independent, IN) | 🟢 | **78** | WordPress (CPTs not in REST); per-crop sitemap → detail HTML. Scale 0-10 higher=better. corn/soy/wheat |
| **Burrus Seed** (independent, IL) | 🟢 | **64** | Seedware JSON API (`burrus25.seedware.net`, callback+Referer). Scale 1-10 (10=best). robots `ai-train=no` — operator opted in |
| Ebbert's Seeds (independent, OH/IN) | 🟢 | 29 | WordPress per-crop catalog pages, verbatim body |
| AgriPro (Syngenta wheat) | 🟢 | 24 | Drupal Views form, server-rendered HTML |
| Beck's PFR | 🟡 | 2,089 | Public Sanity GROQ API at `mc8v24rf.api.sanity.io` (no auth) |
| Beck's products | 🟡 | 860 | Same Sanity API — identity-only until SeedIQ XHR is sniffed |
| Pioneer + Hoegemeyer + Brevant (Corteva) | 🔴 | — | DROP. Shared corteva.com ToU bans automation (scrapers + "competitive service"). Treat ALL `*.corteva.com / pioneer.com / hoegemeyer.com / therightseed.com` + Vylor brands as one excluded ToU domain |
Trial-only sources (cross-vendor yield, `data_type=trial`): vendor plot reports `gh_plot_reports`, `lg_plot_reports`, `agrigold_plot_reports`, `proharvest_plots`, `agripro_trials`; **university-extension variety trials** `illinois_vt_trials` (IL, +wheat), `iowa_icpt_trials` (IA), `ohio_ocpt_trials` (OH) — independent third-party data that ranks the majors we can't catalog directly (Pioneer/DEKALB/Brevant) side-by-side. The university sources route through `_render_gh_plot_chunk(include_region=True)` so the region/district is in the embedded chunk. See the README corpus table for counts.
> **Scale-direction warning (read before any cross-vendor numeric comparison):** the independents do NOT agree on direction. Bayer + Stine(corn) + ProHarvest(disease) + Burrus = HIGHER is better (Burrus 1-10, others 1-9). **Latham + NK + AgriPro = LOWER is better (1 = best).** 1st Choice = 0-10 higher=better. Stine soy is qualitative. Always consult each record's `_scale_direction` (the chunker attaches it) before comparing numbers across brands.
**Build priority order** (shared-infra first → biggest yield):
1. `bayer_seeds` — lift-and-shift from crop-chem-docs' Bayer scraper
2. `golden_harvest` — biggest unique Syngenta brand
3. `nk` — reuses Golden Harvest's PDF fetcher
4. `agripro` — only wheat coverage in the corpus
5. `becks_pfr` — research goldmine, public Sanity GROQ
6. `becks_products` — identity-only, deferred until SeedIQ XHR known
### Pioneer fallback
Per user direction (2026-05-25), seed-mcp does NOT scrape Pioneer.
The MCP's lessons layer contains a Pioneer-fallback entry: when the
LLM detects a Pioneer / P-series query, it should reply:
> "Pioneer does not allow AI or other automation techniques to
> scrape and index their data. For Pioneer brand seed information,
> reach out to a local dealer directly via
> [pioneer.com](https://www.pioneer.com)."
Pioneer's dealer locator is login-gated — there is no public API
to surface dealer contact info, so the lesson stays a plain link.
## Schema notes per crop
- **Corn**: RM (relative maturity days), trait stack (SmartStax, VT
Double PRO, Enlist, PowerCore, Trecepta, etc.), GLS / NCLB /
Goss's / Anthracnose / Tar Spot ratings, standability, drought
tolerance, ear flex, grain-vs-silage flag.
- **Soy**: Maturity group (e.g. 3.4), trait stack (XF / Xtend / E3 /
LL+GT27 / RR2Y / Conkesta), SDS / white mold / SCN / Phytophthora
(race + Rps gene) / frogeye / brown stem rot ratings, IDC
tolerance (critical for upper Midwest), branching habit.
- **Wheat**: Class (HRW / HRS / SRW / SWW / SWS / durum), heading
(early / medium / late), stripe rust / leaf rust / stem rust /
FHB (scab) / Septoria / tan spot ratings, test weight, protein,
falling number, straw strength, CoAXium trait flag.
**Disease scale gotcha**: Golden Harvest publishes ratings on a
**9-to-1 scale** (9 = best, 1 = worst) — the REVERSE of the typical
1-9 convention used by Bayer/NK/AgriPro. Normalize at chunk time so
the corpus has a single direction; document it in a chunk_0
preamble.
## Canonical sidecar schema (per variety)
```json
{
"source": "bayer_seeds",
"source_key": "dekalb-dkc62-08rib",
"vendor": "Bayer",
"brand": "DEKALB",
"product_name": "DKC62-08RIB",
"crop": "corn",
"relative_maturity": 112,
"maturity_group": null,
"wheat_class": null,
"trait_stack": ["SmartStax", "RIB"],
"agronomic_ratings": {"standability": 7, "drought_tolerance": 6},
"disease_ratings": {"GLS": 6, "NCLB": 7, "Goss_wilt": 5},
"regional_recommendation": ["IA-N", "MN-S", "WI-W"],
"source_urls": ["https://cropscience.bayer.us/..."],
"fetched_at": "2026-05-25T12:34:56Z"
}
```
`maturity_group` is for soy, `relative_maturity` is for corn,
`wheat_class` is for wheat. Use `null` for fields that don't apply.
Disease/agronomic rating direction is **normalized 1-9 (9 = best)**
post-scrape — original direction noted in chunk_0 if the source
publishes differently.
## Working with this repo
### Identifying the current phase
This is a clone of the docs-mcp-template; phases follow the
template's PLAN.md.
| Signal | Likely phase |
|---|---|
| `corpus/` doesn't exist | Phase 1 (first scraper) |
| `corpus/bayer_seeds/` exists, no `chroma/` | Phase 2 (indexing) |
| `chroma/` exists, no `bm25/` | Phase 8 (hybrid search) |
| No `eval/results/` | Phase 7 (eval harness) |
| `_api_lessons` is `NotImplementedError` | Phase 11 |
## Layout
```
.
├── PLAN.md
├── README.md
├── CLAUDE.md
├── sources.json # Vendor catalog (corn/soy/wheat by source)
├── requirements.txt
├── Dockerfile
├── deploy/
│ └── docker-compose.yml
├── .gitea/workflows/
│ ├── refresh.yml # Monthly cron: scrape + index + image
│ └── image-only.yml # On-demand: code-only ship cycle
├── scrape/
│ ├── runner.py # `python -m scrape.runner --source bayer_seeds`
│ ├── changelog.py
│ └── sources/
│ ├── bayer_seeds.py
│ ├── golden_harvest.py
│ ├── nk.py
│ ├── agripro.py
│ ├── becks_pfr.py
│ └── becks_products.py
├── rag/ # chunk + embed + Chroma + BM25
├── docs_mcp/ # FastMCP server + lessons.md
├── eval/ # Golden-query harness
└── scripts/ # registry_gc.py, usage_report.py
```
## Conventions
- **Vendor sub-corpora**: each scraper writes
`corpus/<source>/<source_key>.{md,json}`. `.md` is the LLM-visible
text (chunk_0 preamble + body); `.json` is the sidecar metadata.
- **Tool docstrings are user interface** — the LLM uses them to
decide whether to call. Treat like button labels.
- **Defensive fallback for retrieval** — reranker/BM25/external
deps must catch their specific exception and degrade to baseline.
The MCP is in front of farmers making real seed-buying decisions.
- **Verify retrieval changes with eval/** — ship a retrieval change
with numbers in the commit message.
### Standard infrastructure choices
- **Embedding**: `nomic-embed-text` via Ollama (768-dim)
- **Reranker**: `jina-reranker-v2-base` GGUF via llama.cpp
`/v1/rerank` (shared `llama-rerank` sidecar with crop-chem-docs
on trashpanda Tesla P4)
- **Vector store**: Chroma `PersistentClient`
- **Lexical store**: SQLite FTS5
- **Fusion**: RRF k=60
- **Transport**: streamable-HTTP in prod, stdio for local dev
- **MCP framework**: FastMCP with `stateless_http=True`
### Image name and package linking are repo-name-derived
`IMAGE` and `--package` derive from the repo at runtime via
`${{ github.repository_owner }}` / `${{ github.event.repository.name }}`.
The only workflow placeholders customized per clone are
`REGISTRY_PUSH=192.168.0.2:1234`, `REGISTRY_PULL=git.jpaul.io`,
and the `OLLAMA_URL` embed pool.
## Common commands
```bash
# Dev environment
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Run one scraper
python -m scrape.runner --source bayer_seeds --force
# Rebuild indexes
python -m rag.index --rebuild
# Local MCP server
python -m docs_mcp.server --transport stdio
python -m docs_mcp.server --transport streamable-http --port 8000
# Eval
python -m eval.run_eval --queries eval/queries.jsonl --output eval/results/baseline.md
```
## Gotchas
- **`fetch-depth: 0` on `actions/checkout@v4`** in both workflows.
- **Reranker per-pair token limit**: jina-reranker GGUF rejects the
ENTIRE batch if any doc exceeds `n_ctx_train=1024`. Truncate
reranked docs to ~2000 chars.
- **FastMCP `stateless_http=True`**: critical for prod.
- **Runner shell is `/bin/sh` (dash)** in CI — no `${VAR::N}`.
- **Cloudflare 100 MB body cap**: push via LAN endpoint
`192.168.0.2:1234`, pull via `git.jpaul.io`.
- **Golden Harvest disease scale is reversed (9 = best)** —
normalize at chunk time.
- **Sitemap-listed PDF dates on Golden Harvest are stale** —
resolve the live PDF URL from the product HTML page.
- **No IPv6** — DNS for git.jpaul.io returns IPv6-only. Clone via
HTTPS, not SSH (port 22 returns Network unreachable).
- **Pioneer is OFF-LIMITS** — do NOT add a `pioneer.py` scraper.
## Out-of-scope concerns
- **Reverse proxy / TLS** — Drawbar's compose handles it
- **MetaMCP** — separate aggregator
- **GPU container orchestration** — shared `llama-rerank` sidecar
- **University extension trial data** — deferred to v1.5