T

justin a766756a05 Phase 2/3: chunker + indexer + MCP server tools

Phase 2 — Chunking and indexing
- rag/chunk.py: replace template chunker with seed-variety-specific
  chunks_from_variety(). One chunk per variety (varieties are small
  and named-rating retrieval signal is best kept together). Output
  is rebuilt deterministically from the sidecar JSON: every value is
  verbatim from the source, only framing language ("Disease ratings
  (1-9, 9=best):") is template glue. Anti-hallucination contract:
  same sidecar in → same chunk out, never a fabricated rating.
  Metadata flattened to Chroma-safe primitives (str/int/float/bool):
  source, source_key, vendor, brand, crop, product_name,
  product_id, source_url, rm (corn), mg (soy), wheat_class,
  release_year, trait_codes_csv, rating_scale.
- rag/index.py: walks corpus/<source>/<source_key>.json sidecars
  via the new chunker. Default PRODUCT_NAME=crop_seed so the
  Chroma collection is crop_seed_docs.
- rag/bm25.py: filterable columns updated to seed-domain facets
  (source/vendor/brand/crop/source_key) instead of the template's
  version/platform/product.

Phase 3 — MCP server tools wired up
- search_docs: hybrid dense (Chroma) + BM25 (FTS5) retrieval with
  RRF fusion. Optional filters: crop, brand, vendor, source.
  Variety-code prefilter pins exact source_key / product_name /
  hybrid_prefix matches at the top — dense embeddings have no
  semantic neighbor for tokens like "DKC62-08RIB" and RRF can let
  noise float to #1 without this pin. Each response carries the
  variety's source URL inline so the agent can cite.
- get_page(source, source_key): emits a structured ratings header
  (verbatim from sidecar, table per characteristics group, vendor
  positioning, regional listings) followed by the raw indexed body.
  This is the canonical fact-check surface.
- list_versions(): facet discovery — distinct sources, vendors,
  brands, crops across the corpus.
- lookup_variety(source_key, source?): returns the raw sidecar JSON
  for one variety. The agent should call this BEFORE quoting any
  specific rating value to a farmer — guaranteed verbatim.

Smoke tests against 475 indexed Bayer varieties:
- list_versions returns 475 varieties, 1 source, 1 vendor, 3 brands,
  3 crops with correct per-brand counts (288/102/85).
- Semantic ag queries find the right candidates: short-season
  drought-tolerant corn → DKC44-97RIB at RM 94 (in 90-95 band);
  SCN+MG3 soybean → Asgrow XF varieties with explicit SCN R3 ratings;
  Phytophthora Rps3a soy → AG07XF4 (right gene); stripe-rust
  wheat → WestBred WB1376CLP (Yellow Rust 2 = best).
- Variety-code lookups work via prefilter: DKC62-08RIB, AG29XF4,
  WB6430 all return as #1 hit. BM25 confirms ranking unambiguously
  (top-1 score -13.2 vs -8.5 for #2 on "DKC62-08RIB ratings").
- Server boots cleanly in stdio AND streamable-http modes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-25 13:14:16 -04:00

.gitea/workflows

seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

2026-05-25 12:28:49 -04:00

deploy

seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

2026-05-25 12:28:49 -04:00

docs_mcp

Phase 2/3: chunker + indexer + MCP server tools

2026-05-25 13:14:16 -04:00

eval

seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

2026-05-25 12:28:49 -04:00

rag

Phase 2/3: chunker + indexer + MCP server tools

2026-05-25 13:14:16 -04:00

scrape

bayer_seeds: implement Phase 1 scraper for DEKALB + Asgrow + WestBred

2026-05-25 12:53:46 -04:00

scripts

seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

2026-05-25 12:28:49 -04:00

.gitignore

seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

2026-05-25 12:28:49 -04:00

CLAUDE.md

seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

2026-05-25 12:28:49 -04:00

Dockerfile

seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

2026-05-25 12:28:49 -04:00

PLAN.md

seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

2026-05-25 12:28:49 -04:00

README.md

seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

2026-05-25 12:28:49 -04:00

requirements.txt

seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

2026-05-25 12:28:49 -04:00

sources.json

seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

2026-05-25 12:28:49 -04:00

README.md

seed-mcp

MCP server over the public catalogs of major US row-crop seed vendors — corn, soybeans, wheat. Sibling project to crop-chem-docs (pesticide labels), feeding the same Drawbar farm-advisor AI.

The server exposes per-variety records with agronomic ratings, disease tolerance, trait stack, maturity, and regional notes — so the advisor can answer questions like "which corn hybrid for sandy soil, drought-prone, RM ≤105 in northeast Iowa?" without rummaging through individual brand sites.

Vendor coverage

Vendor	Verdict	Varieties	Notes
Bayer seeds (DEKALB + Asgrow + WestBred)	🟢	~475	Same `cropscience.bayer.us` Next.js infra as crop-chem-docs
Golden Harvest (Syngenta)	🟢	~175	Sitemap + server-rendered HTML + Syngenta CDN PDFs
NK (Syngenta)	🟢	29	Shares PDF fetcher with Golden Harvest
AgriPro (Syngenta wheat)	🟢	24	Drupal Views, server-rendered
Beck's PFR	🟡	2,089	Public Sanity GROQ API (no auth)
Beck's products	🟡	860	Identity-only until SeedIQ XHR sniffed
Pioneer (Corteva)	🔴	—	ToS bans automation — curated fallback lesson instead

Quick start

git clone https://git.jpaul.io/justin/seed-mcp.git
cd seed-mcp
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Run one scraper
python -m scrape.runner --source bayer_seeds --force

# Rebuild indexes
python -m rag.index --rebuild

# Local MCP server (stdio for Claude Desktop dev)
python -m docs_mcp.server --transport stdio

Tools exposed

Tool	Purpose
`search_docs`	Hybrid + rerank variety search with crop / RM / trait / region filters
`get_page`	Full variety record by `(source, source_key)`
`list_versions`	Discover crops, brands, traits, RM/MG ranges, wheat classes
`corpus_status`	Counts + freshness; useful for health probes
`crop_seed_api_lessons`	Curated agronomy lessons — Pioneer fallback, disease-scale normalization, regional placement heuristics

Build phases

This is a clone of docs-mcp-template. The 13 phases in PLAN.md apply:

Phase	Status
0 — scaffold	done
1 — first scraper (bayer_seeds)	next
2 — chunk + index	pending
3 — baseline MCP tools	template defaults
4-5 — Dockerfile + CI	done (placeholders filled)
6 — reranker	shares `llama-rerank` sidecar with crop-chem-docs
7 — eval harness	pending (curate ~25 queries)
8 — hybrid search	done (template)
9 — diff_versions, list_cluster	optional
11 — `crop_seed_api_lessons` curated layer	pending

See CLAUDE.md for the canonical sidecar schema and the disease-scale-normalization gotcha (Golden Harvest is reversed).

Infrastructure

Registry: git.jpaul.io/justin/seed-mcp:latest (Watchtower) / :corpus-YYYY.MM.DD (production pin)
Embedder: shared Ollama pool with crop-chem-docs (Gitea-host GPUs + Windows Ollama; CI never hits trashpanda's production Ollama)
Reranker: shared llama-rerank sidecar on trashpanda's Tesla P4 (one container, both MCPs use it)
PRODUCT_NAME: crop_seed (not seed_mcp — used in Chroma collection, BM25 db filename, and crop_seed_api_lessons tool)