Phase 2/3: chunker + indexer + MCP server tools (475 Bayer varieties searchable) #2

2026-05-25T13:14:48-04:00

justin commented

2026-05-25 13:14:48 -04:00

Summary

Phase 2 — Chunking and indexing: rewrite rag/chunk.py for seed varieties (one chunk per variety, deterministic from sidecar JSON, every value verbatim — anti-hallucination contract). Update rag/index.py to walk sidecars via the new chunker; default PRODUCT_NAME=crop_seed. Update rag/bm25.py schema with seed-domain filter columns (source/vendor/brand/crop/source_key).
Phase 3 — MCP server: implement search_docs (hybrid dense + BM25 with RRF + a variety-code prefilter), get_page (structured ratings header from sidecar + indexed body), list_versions (facet discovery), and lookup_variety (canonical sidecar JSON — the agent's fact-check tool).
All 475 indexed Bayer varieties (288 DEKALB / 102 Asgrow / 85 WestBred) end-to-end searchable; Chroma + BM25 rebuild in ~20s on the 3-node Ollama pool.

Anti-hallucination design notes

Every chunk's text and metadata is rebuilt deterministically from the variety's sidecar JSON. Given the same scrape, the chunker always produces the same chunk text. The retriever can never surface a rating value that wasn't in the source.
Every search result carries the variety's source URL inline so the agent can cite.
lookup_variety(source_key) is the canonical fact-check surface: returns the raw sidecar JSON unmodified. The tool docstring instructs the agent to call this before quoting any specific rating value to a farmer.
get_page emits a structured ratings table per characteristics group with values verbatim — no paraphrasing, no rounding.

The exact-code prefilter

Variety codes ("DKC62-08RIB", "AG29XF4", "WB6430") have no semantic neighbors. Dense retrieval misses them entirely (DKC62-08RIB not in dense top 100 for "DKC62-08RIB ratings"). BM25 alone finds them cleanly but RRF fusion was letting dense noise float to #1 because "ratings" matches every chunk. Solution: scan sidecars at server boot to build an in-memory {token → [(source, source_key)]} index keyed by source_key, hybrid_prefix, product_name, and code-like sub-tokens. Any query token that exactly matches the index pins those varieties to the top of the response, in front of fuzzy retrieval.

Test plan

list_versions returns 475 varieties / 1 source / 3 brands / 3 crops with correct counts.
Semantic queries find the right candidates:
- short-season drought corn (RM 90-95) → DKC44-97RIB at RM 94
- SCN-resistant MG 3 soy → Asgrow XF varieties with R1/R3 SCN ratings
- Phytophthora Rps3a soy → AG07XF4 (correct gene)
- Pacific Northwest stripe-rust wheat → WestBred WB1376CLP with Yellow Rust rating 2 (best on the 1-9 scale Bayer uses)
Variety-code lookups via prefilter: DKC62-08RIB, AG29XF4, WB6430 each return as #1 hit.
lookup_variety('dekalb-dkc62-08rib') returns canonical sidecar JSON.
get_page('bayer_seeds', 'dekalb-dkc62-08rib') returns structured ratings table + body.
Server boots in both stdio and streamable-http modes; HTTP POST /mcp responds 200 to MCP initialize.
Production deploy via .gitea/workflows/refresh.yml — pending Phase 4-5 (container deploy work).

Phase status

Phase 1 ✅ (Bayer scraper)
Phase 2 ✅ (chunker, embedder, Chroma + BM25 indexer)
Phase 3 ✅ (MCP server tools)
Phase 4-5: Dockerfile / compose / Gitea workflow customization — next session
Phase 1 continued: golden_harvest / nk / agripro / becks_pfr scrapers — next session

## Summary - **Phase 2 — Chunking and indexing**: rewrite `rag/chunk.py` for seed varieties (one chunk per variety, deterministic from sidecar JSON, every value verbatim — anti-hallucination contract). Update `rag/index.py` to walk sidecars via the new chunker; default `PRODUCT_NAME=crop_seed`. Update `rag/bm25.py` schema with seed-domain filter columns (source/vendor/brand/crop/source_key). - **Phase 3 — MCP server**: implement `search_docs` (hybrid dense + BM25 with RRF + a variety-code prefilter), `get_page` (structured ratings header from sidecar + indexed body), `list_versions` (facet discovery), and `lookup_variety` (canonical sidecar JSON — the agent's fact-check tool). - All 475 indexed Bayer varieties (288 DEKALB / 102 Asgrow / 85 WestBred) end-to-end searchable; Chroma + BM25 rebuild in ~20s on the 3-node Ollama pool. ## Anti-hallucination design notes - Every chunk's text and metadata is rebuilt deterministically from the variety's sidecar JSON. Given the same scrape, the chunker always produces the same chunk text. The retriever can never surface a rating value that wasn't in the source. - Every search result carries the variety's source URL inline so the agent can cite. - `lookup_variety(source_key)` is the canonical fact-check surface: returns the raw sidecar JSON unmodified. The tool docstring instructs the agent to call this **before quoting any specific rating value to a farmer**. - `get_page` emits a structured ratings table per characteristics group with values verbatim — no paraphrasing, no rounding. ## The exact-code prefilter Variety codes ("DKC62-08RIB", "AG29XF4", "WB6430") have no semantic neighbors. Dense retrieval misses them entirely (DKC62-08RIB not in dense top 100 for "DKC62-08RIB ratings"). BM25 alone finds them cleanly but RRF fusion was letting dense noise float to #1 because "ratings" matches every chunk. Solution: scan sidecars at server boot to build an in-memory `{token → [(source, source_key)]}` index keyed by source_key, hybrid_prefix, product_name, and code-like sub-tokens. Any query token that exactly matches the index pins those varieties to the top of the response, in front of fuzzy retrieval. ## Test plan - [x] `list_versions` returns 475 varieties / 1 source / 3 brands / 3 crops with correct counts. - [x] Semantic queries find the right candidates: - short-season drought corn (RM 90-95) → DKC44-97RIB at RM 94 - SCN-resistant MG 3 soy → Asgrow XF varieties with R1/R3 SCN ratings - Phytophthora Rps3a soy → AG07XF4 (correct gene) - Pacific Northwest stripe-rust wheat → WestBred WB1376CLP with Yellow Rust rating 2 (best on the 1-9 scale Bayer uses) - [x] Variety-code lookups via prefilter: DKC62-08RIB, AG29XF4, WB6430 each return as #1 hit. - [x] `lookup_variety('dekalb-dkc62-08rib')` returns canonical sidecar JSON. - [x] `get_page('bayer_seeds', 'dekalb-dkc62-08rib')` returns structured ratings table + body. - [x] Server boots in both stdio and streamable-http modes; HTTP `POST /mcp` responds 200 to MCP initialize. - [ ] Production deploy via .gitea/workflows/refresh.yml — pending Phase 4-5 (container deploy work). ## Phase status - Phase 1 ✅ (Bayer scraper) - Phase 2 ✅ (chunker, embedder, Chroma + BM25 indexer) - Phase 3 ✅ (MCP server tools) - Phase 4-5: Dockerfile / compose / Gitea workflow customization — next session - Phase 1 continued: golden_harvest / nk / agripro / becks_pfr scrapers — next session

justin added 1 commit 2026-05-25 13:14:48 -04:00

Phase 2/3: chunker + indexer + MCP server tools a766756a05

Phase 2 — Chunking and indexing
- rag/chunk.py: replace template chunker with seed-variety-specific
  chunks_from_variety(). One chunk per variety (varieties are small
  and named-rating retrieval signal is best kept together). Output
  is rebuilt deterministically from the sidecar JSON: every value is
  verbatim from the source, only framing language ("Disease ratings
  (1-9, 9=best):") is template glue. Anti-hallucination contract:
  same sidecar in → same chunk out, never a fabricated rating.
  Metadata flattened to Chroma-safe primitives (str/int/float/bool):
  source, source_key, vendor, brand, crop, product_name,
  product_id, source_url, rm (corn), mg (soy), wheat_class,
  release_year, trait_codes_csv, rating_scale.
- rag/index.py: walks corpus/<source>/<source_key>.json sidecars
  via the new chunker. Default PRODUCT_NAME=crop_seed so the
  Chroma collection is crop_seed_docs.
- rag/bm25.py: filterable columns updated to seed-domain facets
  (source/vendor/brand/crop/source_key) instead of the template's
  version/platform/product.

Phase 3 — MCP server tools wired up
- search_docs: hybrid dense (Chroma) + BM25 (FTS5) retrieval with
  RRF fusion. Optional filters: crop, brand, vendor, source.
  Variety-code prefilter pins exact source_key / product_name /
  hybrid_prefix matches at the top — dense embeddings have no
  semantic neighbor for tokens like "DKC62-08RIB" and RRF can let
  noise float to #1 without this pin. Each response carries the
  variety's source URL inline so the agent can cite.
- get_page(source, source_key): emits a structured ratings header
  (verbatim from sidecar, table per characteristics group, vendor
  positioning, regional listings) followed by the raw indexed body.
  This is the canonical fact-check surface.
- list_versions(): facet discovery — distinct sources, vendors,
  brands, crops across the corpus.
- lookup_variety(source_key, source?): returns the raw sidecar JSON
  for one variety. The agent should call this BEFORE quoting any
  specific rating value to a farmer — guaranteed verbatim.

Smoke tests against 475 indexed Bayer varieties:
- list_versions returns 475 varieties, 1 source, 1 vendor, 3 brands,
  3 crops with correct per-brand counts (288/102/85).
- Semantic ag queries find the right candidates: short-season
  drought-tolerant corn → DKC44-97RIB at RM 94 (in 90-95 band);
  SCN+MG3 soybean → Asgrow XF varieties with explicit SCN R3 ratings;
  Phytophthora Rps3a soy → AG07XF4 (right gene); stripe-rust
  wheat → WestBred WB1376CLP (Yellow Rust 2 = best).
- Variety-code lookups work via prefilter: DKC62-08RIB, AG29XF4,
  WB6430 all return as #1 hit. BM25 confirms ranking unambiguously
  (top-1 score -13.2 vs -8.5 for #2 on "DKC62-08RIB ratings").
- Server boots cleanly in stdio AND streamable-http modes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

justin merged commit 3cab941c08 into main

2026-05-25 13:14:58 -04:00

justin referenced this issue from a commit

2026-05-25 13:15:00 -04:00

Phase 2/3: chunker + indexer + MCP server tools

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: justin/seed-mcp#2