ac40e05734
Image rebuild (skip scrape) / build (push) Failing after 7s
Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is
seed/hybrid varieties across 6 vendors instead of pesticide labels.
What's customized vs. the template:
- CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy,
canonical sidecar schema (per-crop), Golden Harvest disease-scale
reversal gotcha, no-IPv6 / HTTPS-clone note
- README.md: vendor coverage table, tool list, phase status
- Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not
bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker
DNS defaults (same llama-rerank sidecar as crop-chem-docs)
- .gitea/workflows/refresh.yml: monthly cron (seed catalogs move
slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar
pinning, continue-on-error on GC step
- .gitea/workflows/image-only.yml: paths filter + cancel-in-progress
concurrency group
- scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea
packages API URL + UA header to bypass CF block on default
Python-urllib UA)
- sources.json: catalog of 6 sources + scope_filter + per-source
schema notes + Pioneer-exclusion rationale
- scrape/runner.py: dispatcher with --all = GREEN-only
- scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr,
becks_products}.py: stub modules with implementation notes
- docs_mcp/server.py: PRODUCT_NAME default → crop_seed,
PRODUCT_DOCS_URL → repo URL
Pioneer is intentionally NOT a source. ToS bans automation; dealer
locator is login-gated. The MCP returns a curated fallback lesson
directing the user to pioneer.com.
Next phases:
- Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs
Bayer scraper; same __NEXT_DATA__ infra)
- Phase 7: curate eval/queries.jsonl
- Phase 11: lessons.md with Pioneer fallback + disease-scale notes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
231 lines
9.0 KiB
Markdown
231 lines
9.0 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when
|
|
working with code in this repository.
|
|
|
|
## Purpose
|
|
|
|
`seed-mcp` is an MCP server over the **public catalogs of major US
|
|
row-crop seed vendors** (corn / soybeans / wheat). It is the sibling
|
|
project to [`crop-chem-docs`](https://git.jpaul.io/justin/crop-chem-docs)
|
|
— same MCP-template lineage, same Drawbar consumer (the farm
|
|
advisor AI), but the corpus is **seed/hybrid varieties** rather than
|
|
pesticide labels.
|
|
|
|
The MCP exposes per-variety records with agronomic ratings, disease
|
|
tolerance, trait stack, maturity, and regional notes — so the advisor
|
|
can answer "which corn hybrid for sandy soil, drought-prone, RM ≤105
|
|
in northeast Iowa?" without rummaging through individual brand sites.
|
|
|
|
PRODUCT_NAME for this build: **`crop_seed`** (lowercase, underscore;
|
|
ends up in the MCP server name, Chroma collection, BM25 db filename,
|
|
and the `crop_seed_api_lessons` tool).
|
|
|
|
## Vendor scope
|
|
|
|
| Vendor | Verdict | Varieties | Source pattern |
|
|
|---|---|---|---|
|
|
| Bayer (DEKALB + Asgrow + WestBred) | 🟢 | ~475 | `cropscience.bayer.us` Next.js `__NEXT_DATA__` (same infra as crop-chem-docs) |
|
|
| Golden Harvest (Syngenta) | 🟢 | ~175 | sitemap.xml + server-rendered HTML + Syngenta CDN PDFs |
|
|
| NK (Syngenta) | 🟢 | 29 | static HTML + Syngenta CDN PDFs (shares fetcher with Golden Harvest) |
|
|
| AgriPro (Syngenta wheat) | 🟢 | 24 | Drupal Views form, server-rendered HTML |
|
|
| Beck's PFR | 🟡 | 2,089 | Public Sanity GROQ API at `mc8v24rf.api.sanity.io` (no auth) |
|
|
| Beck's products | 🟡 | 860 | Same Sanity API — identity-only until SeedIQ XHR is sniffed |
|
|
| Pioneer (Corteva) | 🔴 | — | DROP. ToS bans automation; dealer locator login-gated too |
|
|
|
|
**Build priority order** (shared-infra first → biggest yield):
|
|
1. `bayer_seeds` — lift-and-shift from crop-chem-docs' Bayer scraper
|
|
2. `golden_harvest` — biggest unique Syngenta brand
|
|
3. `nk` — reuses Golden Harvest's PDF fetcher
|
|
4. `agripro` — only wheat coverage in the corpus
|
|
5. `becks_pfr` — research goldmine, public Sanity GROQ
|
|
6. `becks_products` — identity-only, deferred until SeedIQ XHR known
|
|
|
|
### Pioneer fallback
|
|
|
|
Per user direction (2026-05-25), seed-mcp does NOT scrape Pioneer.
|
|
The MCP's lessons layer contains a Pioneer-fallback entry: when the
|
|
LLM detects a Pioneer / P-series query, it should reply:
|
|
|
|
> "Pioneer does not allow AI or other automation techniques to
|
|
> scrape and index their data. For Pioneer brand seed information,
|
|
> reach out to a local dealer directly via
|
|
> [pioneer.com](https://www.pioneer.com)."
|
|
|
|
Pioneer's dealer locator is login-gated — there is no public API
|
|
to surface dealer contact info, so the lesson stays a plain link.
|
|
|
|
## Schema notes per crop
|
|
|
|
- **Corn**: RM (relative maturity days), trait stack (SmartStax, VT
|
|
Double PRO, Enlist, PowerCore, Trecepta, etc.), GLS / NCLB /
|
|
Goss's / Anthracnose / Tar Spot ratings, standability, drought
|
|
tolerance, ear flex, grain-vs-silage flag.
|
|
- **Soy**: Maturity group (e.g. 3.4), trait stack (XF / Xtend / E3 /
|
|
LL+GT27 / RR2Y / Conkesta), SDS / white mold / SCN / Phytophthora
|
|
(race + Rps gene) / frogeye / brown stem rot ratings, IDC
|
|
tolerance (critical for upper Midwest), branching habit.
|
|
- **Wheat**: Class (HRW / HRS / SRW / SWW / SWS / durum), heading
|
|
(early / medium / late), stripe rust / leaf rust / stem rust /
|
|
FHB (scab) / Septoria / tan spot ratings, test weight, protein,
|
|
falling number, straw strength, CoAXium trait flag.
|
|
|
|
**Disease scale gotcha**: Golden Harvest publishes ratings on a
|
|
**9-to-1 scale** (9 = best, 1 = worst) — the REVERSE of the typical
|
|
1-9 convention used by Bayer/NK/AgriPro. Normalize at chunk time so
|
|
the corpus has a single direction; document it in a chunk_0
|
|
preamble.
|
|
|
|
## Canonical sidecar schema (per variety)
|
|
|
|
```json
|
|
{
|
|
"source": "bayer_seeds",
|
|
"source_key": "dekalb-dkc62-08rib",
|
|
"vendor": "Bayer",
|
|
"brand": "DEKALB",
|
|
"product_name": "DKC62-08RIB",
|
|
"crop": "corn",
|
|
"relative_maturity": 112,
|
|
"maturity_group": null,
|
|
"wheat_class": null,
|
|
"trait_stack": ["SmartStax", "RIB"],
|
|
"agronomic_ratings": {"standability": 7, "drought_tolerance": 6},
|
|
"disease_ratings": {"GLS": 6, "NCLB": 7, "Goss_wilt": 5},
|
|
"regional_recommendation": ["IA-N", "MN-S", "WI-W"],
|
|
"source_urls": ["https://cropscience.bayer.us/..."],
|
|
"fetched_at": "2026-05-25T12:34:56Z"
|
|
}
|
|
```
|
|
|
|
`maturity_group` is for soy, `relative_maturity` is for corn,
|
|
`wheat_class` is for wheat. Use `null` for fields that don't apply.
|
|
Disease/agronomic rating direction is **normalized 1-9 (9 = best)**
|
|
post-scrape — original direction noted in chunk_0 if the source
|
|
publishes differently.
|
|
|
|
## Working with this repo
|
|
|
|
### Identifying the current phase
|
|
|
|
This is a clone of the docs-mcp-template; phases follow the
|
|
template's PLAN.md.
|
|
|
|
| Signal | Likely phase |
|
|
|---|---|
|
|
| `corpus/` doesn't exist | Phase 1 (first scraper) |
|
|
| `corpus/bayer_seeds/` exists, no `chroma/` | Phase 2 (indexing) |
|
|
| `chroma/` exists, no `bm25/` | Phase 8 (hybrid search) |
|
|
| No `eval/results/` | Phase 7 (eval harness) |
|
|
| `_api_lessons` is `NotImplementedError` | Phase 11 |
|
|
|
|
## Layout
|
|
|
|
```
|
|
.
|
|
├── PLAN.md
|
|
├── README.md
|
|
├── CLAUDE.md
|
|
├── sources.json # Vendor catalog (corn/soy/wheat by source)
|
|
├── requirements.txt
|
|
├── Dockerfile
|
|
├── deploy/
|
|
│ └── docker-compose.yml
|
|
├── .gitea/workflows/
|
|
│ ├── refresh.yml # Monthly cron: scrape + index + image
|
|
│ └── image-only.yml # On-demand: code-only ship cycle
|
|
├── scrape/
|
|
│ ├── runner.py # `python -m scrape.runner --source bayer_seeds`
|
|
│ ├── changelog.py
|
|
│ └── sources/
|
|
│ ├── bayer_seeds.py
|
|
│ ├── golden_harvest.py
|
|
│ ├── nk.py
|
|
│ ├── agripro.py
|
|
│ ├── becks_pfr.py
|
|
│ └── becks_products.py
|
|
├── rag/ # chunk + embed + Chroma + BM25
|
|
├── docs_mcp/ # FastMCP server + lessons.md
|
|
├── eval/ # Golden-query harness
|
|
└── scripts/ # registry_gc.py, usage_report.py
|
|
```
|
|
|
|
## Conventions
|
|
|
|
- **Vendor sub-corpora**: each scraper writes
|
|
`corpus/<source>/<source_key>.{md,json}`. `.md` is the LLM-visible
|
|
text (chunk_0 preamble + body); `.json` is the sidecar metadata.
|
|
- **Tool docstrings are user interface** — the LLM uses them to
|
|
decide whether to call. Treat like button labels.
|
|
- **Defensive fallback for retrieval** — reranker/BM25/external
|
|
deps must catch their specific exception and degrade to baseline.
|
|
The MCP is in front of farmers making real seed-buying decisions.
|
|
- **Verify retrieval changes with eval/** — ship a retrieval change
|
|
with numbers in the commit message.
|
|
|
|
### Standard infrastructure choices
|
|
|
|
- **Embedding**: `nomic-embed-text` via Ollama (768-dim)
|
|
- **Reranker**: `jina-reranker-v2-base` GGUF via llama.cpp
|
|
`/v1/rerank` (shared `llama-rerank` sidecar with crop-chem-docs
|
|
on trashpanda Tesla P4)
|
|
- **Vector store**: Chroma `PersistentClient`
|
|
- **Lexical store**: SQLite FTS5
|
|
- **Fusion**: RRF k=60
|
|
- **Transport**: streamable-HTTP in prod, stdio for local dev
|
|
- **MCP framework**: FastMCP with `stateless_http=True`
|
|
|
|
### Image name and package linking are repo-name-derived
|
|
|
|
`IMAGE` and `--package` derive from the repo at runtime via
|
|
`${{ github.repository_owner }}` / `${{ github.event.repository.name }}`.
|
|
The only workflow placeholders customized per clone are
|
|
`REGISTRY_PUSH=192.168.0.2:1234`, `REGISTRY_PULL=git.jpaul.io`,
|
|
and the `OLLAMA_URL` embed pool.
|
|
|
|
## Common commands
|
|
|
|
```bash
|
|
# Dev environment
|
|
python -m venv venv && source venv/bin/activate
|
|
pip install -r requirements.txt
|
|
|
|
# Run one scraper
|
|
python -m scrape.runner --source bayer_seeds --force
|
|
|
|
# Rebuild indexes
|
|
python -m rag.index --rebuild
|
|
|
|
# Local MCP server
|
|
python -m docs_mcp.server --transport stdio
|
|
python -m docs_mcp.server --transport streamable-http --port 8000
|
|
|
|
# Eval
|
|
python -m eval.run_eval --queries eval/queries.jsonl --output eval/results/baseline.md
|
|
```
|
|
|
|
## Gotchas
|
|
|
|
- **`fetch-depth: 0` on `actions/checkout@v4`** in both workflows.
|
|
- **Reranker per-pair token limit**: jina-reranker GGUF rejects the
|
|
ENTIRE batch if any doc exceeds `n_ctx_train=1024`. Truncate
|
|
reranked docs to ~2000 chars.
|
|
- **FastMCP `stateless_http=True`**: critical for prod.
|
|
- **Runner shell is `/bin/sh` (dash)** in CI — no `${VAR::N}`.
|
|
- **Cloudflare 100 MB body cap**: push via LAN endpoint
|
|
`192.168.0.2:1234`, pull via `git.jpaul.io`.
|
|
- **Golden Harvest disease scale is reversed (9 = best)** —
|
|
normalize at chunk time.
|
|
- **Sitemap-listed PDF dates on Golden Harvest are stale** —
|
|
resolve the live PDF URL from the product HTML page.
|
|
- **No IPv6** — DNS for git.jpaul.io returns IPv6-only. Clone via
|
|
HTTPS, not SSH (port 22 returns Network unreachable).
|
|
- **Pioneer is OFF-LIMITS** — do NOT add a `pioneer.py` scraper.
|
|
|
|
## Out-of-scope concerns
|
|
|
|
- **Reverse proxy / TLS** — Drawbar's compose handles it
|
|
- **MetaMCP** — separate aggregator
|
|
- **GPU container orchestration** — shared `llama-rerank` sidecar
|
|
- **University extension trial data** — deferred to v1.5
|