# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Purpose `seed-mcp` is an MCP server over the **public catalogs of major US row-crop seed vendors** (corn / soybeans / wheat). It is the sibling project to [`crop-chem-docs`](https://git.jpaul.io/justin/crop-chem-docs) — same MCP-template lineage, same Drawbar consumer (the farm advisor AI), but the corpus is **seed/hybrid varieties** rather than pesticide labels. The MCP exposes per-variety records with agronomic ratings, disease tolerance, trait stack, maturity, and regional notes — so the advisor can answer "which corn hybrid for sandy soil, drought-prone, RM ≤105 in northeast Iowa?" without rummaging through individual brand sites. PRODUCT_NAME for this build: **`crop_seed`** (lowercase, underscore; ends up in the MCP server name, Chroma collection, BM25 db filename, and the `crop_seed_api_lessons` tool). ## Vendor scope | Vendor | Verdict | Varieties | Source pattern | |---|---|---|---| | Bayer (DEKALB + Asgrow + WestBred) | 🟢 | ~475 | `cropscience.bayer.us` Next.js `__NEXT_DATA__` (same infra as crop-chem-docs) | | Golden Harvest (Syngenta) | 🟢 | ~175 | sitemap.xml + server-rendered HTML + Syngenta CDN PDFs | | NK (Syngenta) | 🟢 | 29 | static HTML + Syngenta CDN PDFs (shares fetcher with Golden Harvest) | | AgriPro (Syngenta wheat) | 🟢 | 24 | Drupal Views form, server-rendered HTML | | Beck's PFR | 🟡 | 2,089 | Public Sanity GROQ API at `mc8v24rf.api.sanity.io` (no auth) | | Beck's products | 🟡 | 860 | Same Sanity API — identity-only until SeedIQ XHR is sniffed | | Pioneer (Corteva) | 🔴 | — | DROP. ToS bans automation; dealer locator login-gated too | **Build priority order** (shared-infra first → biggest yield): 1. `bayer_seeds` — lift-and-shift from crop-chem-docs' Bayer scraper 2. `golden_harvest` — biggest unique Syngenta brand 3. `nk` — reuses Golden Harvest's PDF fetcher 4. `agripro` — only wheat coverage in the corpus 5. `becks_pfr` — research goldmine, public Sanity GROQ 6. `becks_products` — identity-only, deferred until SeedIQ XHR known ### Pioneer fallback Per user direction (2026-05-25), seed-mcp does NOT scrape Pioneer. The MCP's lessons layer contains a Pioneer-fallback entry: when the LLM detects a Pioneer / P-series query, it should reply: > "Pioneer does not allow AI or other automation techniques to > scrape and index their data. For Pioneer brand seed information, > reach out to a local dealer directly via > [pioneer.com](https://www.pioneer.com)." Pioneer's dealer locator is login-gated — there is no public API to surface dealer contact info, so the lesson stays a plain link. ## Schema notes per crop - **Corn**: RM (relative maturity days), trait stack (SmartStax, VT Double PRO, Enlist, PowerCore, Trecepta, etc.), GLS / NCLB / Goss's / Anthracnose / Tar Spot ratings, standability, drought tolerance, ear flex, grain-vs-silage flag. - **Soy**: Maturity group (e.g. 3.4), trait stack (XF / Xtend / E3 / LL+GT27 / RR2Y / Conkesta), SDS / white mold / SCN / Phytophthora (race + Rps gene) / frogeye / brown stem rot ratings, IDC tolerance (critical for upper Midwest), branching habit. - **Wheat**: Class (HRW / HRS / SRW / SWW / SWS / durum), heading (early / medium / late), stripe rust / leaf rust / stem rust / FHB (scab) / Septoria / tan spot ratings, test weight, protein, falling number, straw strength, CoAXium trait flag. **Disease scale gotcha**: Golden Harvest publishes ratings on a **9-to-1 scale** (9 = best, 1 = worst) — the REVERSE of the typical 1-9 convention used by Bayer/NK/AgriPro. Normalize at chunk time so the corpus has a single direction; document it in a chunk_0 preamble. ## Canonical sidecar schema (per variety) ```json { "source": "bayer_seeds", "source_key": "dekalb-dkc62-08rib", "vendor": "Bayer", "brand": "DEKALB", "product_name": "DKC62-08RIB", "crop": "corn", "relative_maturity": 112, "maturity_group": null, "wheat_class": null, "trait_stack": ["SmartStax", "RIB"], "agronomic_ratings": {"standability": 7, "drought_tolerance": 6}, "disease_ratings": {"GLS": 6, "NCLB": 7, "Goss_wilt": 5}, "regional_recommendation": ["IA-N", "MN-S", "WI-W"], "source_urls": ["https://cropscience.bayer.us/..."], "fetched_at": "2026-05-25T12:34:56Z" } ``` `maturity_group` is for soy, `relative_maturity` is for corn, `wheat_class` is for wheat. Use `null` for fields that don't apply. Disease/agronomic rating direction is **normalized 1-9 (9 = best)** post-scrape — original direction noted in chunk_0 if the source publishes differently. ## Working with this repo ### Identifying the current phase This is a clone of the docs-mcp-template; phases follow the template's PLAN.md. | Signal | Likely phase | |---|---| | `corpus/` doesn't exist | Phase 1 (first scraper) | | `corpus/bayer_seeds/` exists, no `chroma/` | Phase 2 (indexing) | | `chroma/` exists, no `bm25/` | Phase 8 (hybrid search) | | No `eval/results/` | Phase 7 (eval harness) | | `_api_lessons` is `NotImplementedError` | Phase 11 | ## Layout ``` . ├── PLAN.md ├── README.md ├── CLAUDE.md ├── sources.json # Vendor catalog (corn/soy/wheat by source) ├── requirements.txt ├── Dockerfile ├── deploy/ │ └── docker-compose.yml ├── .gitea/workflows/ │ ├── refresh.yml # Monthly cron: scrape + index + image │ └── image-only.yml # On-demand: code-only ship cycle ├── scrape/ │ ├── runner.py # `python -m scrape.runner --source bayer_seeds` │ ├── changelog.py │ └── sources/ │ ├── bayer_seeds.py │ ├── golden_harvest.py │ ├── nk.py │ ├── agripro.py │ ├── becks_pfr.py │ └── becks_products.py ├── rag/ # chunk + embed + Chroma + BM25 ├── docs_mcp/ # FastMCP server + lessons.md ├── eval/ # Golden-query harness └── scripts/ # registry_gc.py, usage_report.py ``` ## Conventions - **Vendor sub-corpora**: each scraper writes `corpus//.{md,json}`. `.md` is the LLM-visible text (chunk_0 preamble + body); `.json` is the sidecar metadata. - **Tool docstrings are user interface** — the LLM uses them to decide whether to call. Treat like button labels. - **Defensive fallback for retrieval** — reranker/BM25/external deps must catch their specific exception and degrade to baseline. The MCP is in front of farmers making real seed-buying decisions. - **Verify retrieval changes with eval/** — ship a retrieval change with numbers in the commit message. ### Standard infrastructure choices - **Embedding**: `nomic-embed-text` via Ollama (768-dim) - **Reranker**: `jina-reranker-v2-base` GGUF via llama.cpp `/v1/rerank` (shared `llama-rerank` sidecar with crop-chem-docs on trashpanda Tesla P4) - **Vector store**: Chroma `PersistentClient` - **Lexical store**: SQLite FTS5 - **Fusion**: RRF k=60 - **Transport**: streamable-HTTP in prod, stdio for local dev - **MCP framework**: FastMCP with `stateless_http=True` ### Image name and package linking are repo-name-derived `IMAGE` and `--package` derive from the repo at runtime via `${{ github.repository_owner }}` / `${{ github.event.repository.name }}`. The only workflow placeholders customized per clone are `REGISTRY_PUSH=192.168.0.2:1234`, `REGISTRY_PULL=git.jpaul.io`, and the `OLLAMA_URL` embed pool. ## Common commands ```bash # Dev environment python -m venv venv && source venv/bin/activate pip install -r requirements.txt # Run one scraper python -m scrape.runner --source bayer_seeds --force # Rebuild indexes python -m rag.index --rebuild # Local MCP server python -m docs_mcp.server --transport stdio python -m docs_mcp.server --transport streamable-http --port 8000 # Eval python -m eval.run_eval --queries eval/queries.jsonl --output eval/results/baseline.md ``` ## Gotchas - **`fetch-depth: 0` on `actions/checkout@v4`** in both workflows. - **Reranker per-pair token limit**: jina-reranker GGUF rejects the ENTIRE batch if any doc exceeds `n_ctx_train=1024`. Truncate reranked docs to ~2000 chars. - **FastMCP `stateless_http=True`**: critical for prod. - **Runner shell is `/bin/sh` (dash)** in CI — no `${VAR::N}`. - **Cloudflare 100 MB body cap**: push via LAN endpoint `192.168.0.2:1234`, pull via `git.jpaul.io`. - **Golden Harvest disease scale is reversed (9 = best)** — normalize at chunk time. - **Sitemap-listed PDF dates on Golden Harvest are stale** — resolve the live PDF URL from the product HTML page. - **No IPv6** — DNS for git.jpaul.io returns IPv6-only. Clone via HTTPS, not SSH (port 22 returns Network unreachable). - **Pioneer is OFF-LIMITS** — do NOT add a `pioneer.py` scraper. ## Out-of-scope concerns - **Reverse proxy / TLS** — Drawbar's compose handles it - **MetaMCP** — separate aggregator - **GPU container orchestration** — shared `llama-rerank` sidecar - **University extension trial data** — deferred to v1.5