Phase 2/3: chunker + indexer + MCP server tools (475 Bayer varieties searchable) #2

Merged
justin merged 1 commits from phase-2-3-retrieval into main 2026-05-25 13:14:58 -04:00
Owner

Summary

  • Phase 2 — Chunking and indexing: rewrite rag/chunk.py for seed varieties (one chunk per variety, deterministic from sidecar JSON, every value verbatim — anti-hallucination contract). Update rag/index.py to walk sidecars via the new chunker; default PRODUCT_NAME=crop_seed. Update rag/bm25.py schema with seed-domain filter columns (source/vendor/brand/crop/source_key).
  • Phase 3 — MCP server: implement search_docs (hybrid dense + BM25 with RRF + a variety-code prefilter), get_page (structured ratings header from sidecar + indexed body), list_versions (facet discovery), and lookup_variety (canonical sidecar JSON — the agent's fact-check tool).
  • All 475 indexed Bayer varieties (288 DEKALB / 102 Asgrow / 85 WestBred) end-to-end searchable; Chroma + BM25 rebuild in ~20s on the 3-node Ollama pool.

Anti-hallucination design notes

  • Every chunk's text and metadata is rebuilt deterministically from the variety's sidecar JSON. Given the same scrape, the chunker always produces the same chunk text. The retriever can never surface a rating value that wasn't in the source.
  • Every search result carries the variety's source URL inline so the agent can cite.
  • lookup_variety(source_key) is the canonical fact-check surface: returns the raw sidecar JSON unmodified. The tool docstring instructs the agent to call this before quoting any specific rating value to a farmer.
  • get_page emits a structured ratings table per characteristics group with values verbatim — no paraphrasing, no rounding.

The exact-code prefilter

Variety codes ("DKC62-08RIB", "AG29XF4", "WB6430") have no semantic neighbors. Dense retrieval misses them entirely (DKC62-08RIB not in dense top 100 for "DKC62-08RIB ratings"). BM25 alone finds them cleanly but RRF fusion was letting dense noise float to #1 because "ratings" matches every chunk. Solution: scan sidecars at server boot to build an in-memory {token → [(source, source_key)]} index keyed by source_key, hybrid_prefix, product_name, and code-like sub-tokens. Any query token that exactly matches the index pins those varieties to the top of the response, in front of fuzzy retrieval.

Test plan

  • list_versions returns 475 varieties / 1 source / 3 brands / 3 crops with correct counts.
  • Semantic queries find the right candidates:
    • short-season drought corn (RM 90-95) → DKC44-97RIB at RM 94
    • SCN-resistant MG 3 soy → Asgrow XF varieties with R1/R3 SCN ratings
    • Phytophthora Rps3a soy → AG07XF4 (correct gene)
    • Pacific Northwest stripe-rust wheat → WestBred WB1376CLP with Yellow Rust rating 2 (best on the 1-9 scale Bayer uses)
  • Variety-code lookups via prefilter: DKC62-08RIB, AG29XF4, WB6430 each return as #1 hit.
  • lookup_variety('dekalb-dkc62-08rib') returns canonical sidecar JSON.
  • get_page('bayer_seeds', 'dekalb-dkc62-08rib') returns structured ratings table + body.
  • Server boots in both stdio and streamable-http modes; HTTP POST /mcp responds 200 to MCP initialize.
  • Production deploy via .gitea/workflows/refresh.yml — pending Phase 4-5 (container deploy work).

Phase status

  • Phase 1 (Bayer scraper)
  • Phase 2 (chunker, embedder, Chroma + BM25 indexer)
  • Phase 3 (MCP server tools)
  • Phase 4-5: Dockerfile / compose / Gitea workflow customization — next session
  • Phase 1 continued: golden_harvest / nk / agripro / becks_pfr scrapers — next session
## Summary - **Phase 2 — Chunking and indexing**: rewrite `rag/chunk.py` for seed varieties (one chunk per variety, deterministic from sidecar JSON, every value verbatim — anti-hallucination contract). Update `rag/index.py` to walk sidecars via the new chunker; default `PRODUCT_NAME=crop_seed`. Update `rag/bm25.py` schema with seed-domain filter columns (source/vendor/brand/crop/source_key). - **Phase 3 — MCP server**: implement `search_docs` (hybrid dense + BM25 with RRF + a variety-code prefilter), `get_page` (structured ratings header from sidecar + indexed body), `list_versions` (facet discovery), and `lookup_variety` (canonical sidecar JSON — the agent's fact-check tool). - All 475 indexed Bayer varieties (288 DEKALB / 102 Asgrow / 85 WestBred) end-to-end searchable; Chroma + BM25 rebuild in ~20s on the 3-node Ollama pool. ## Anti-hallucination design notes - Every chunk's text and metadata is rebuilt deterministically from the variety's sidecar JSON. Given the same scrape, the chunker always produces the same chunk text. The retriever can never surface a rating value that wasn't in the source. - Every search result carries the variety's source URL inline so the agent can cite. - `lookup_variety(source_key)` is the canonical fact-check surface: returns the raw sidecar JSON unmodified. The tool docstring instructs the agent to call this **before quoting any specific rating value to a farmer**. - `get_page` emits a structured ratings table per characteristics group with values verbatim — no paraphrasing, no rounding. ## The exact-code prefilter Variety codes ("DKC62-08RIB", "AG29XF4", "WB6430") have no semantic neighbors. Dense retrieval misses them entirely (DKC62-08RIB not in dense top 100 for "DKC62-08RIB ratings"). BM25 alone finds them cleanly but RRF fusion was letting dense noise float to #1 because "ratings" matches every chunk. Solution: scan sidecars at server boot to build an in-memory `{token → [(source, source_key)]}` index keyed by source_key, hybrid_prefix, product_name, and code-like sub-tokens. Any query token that exactly matches the index pins those varieties to the top of the response, in front of fuzzy retrieval. ## Test plan - [x] `list_versions` returns 475 varieties / 1 source / 3 brands / 3 crops with correct counts. - [x] Semantic queries find the right candidates: - short-season drought corn (RM 90-95) → DKC44-97RIB at RM 94 - SCN-resistant MG 3 soy → Asgrow XF varieties with R1/R3 SCN ratings - Phytophthora Rps3a soy → AG07XF4 (correct gene) - Pacific Northwest stripe-rust wheat → WestBred WB1376CLP with Yellow Rust rating 2 (best on the 1-9 scale Bayer uses) - [x] Variety-code lookups via prefilter: DKC62-08RIB, AG29XF4, WB6430 each return as #1 hit. - [x] `lookup_variety('dekalb-dkc62-08rib')` returns canonical sidecar JSON. - [x] `get_page('bayer_seeds', 'dekalb-dkc62-08rib')` returns structured ratings table + body. - [x] Server boots in both stdio and streamable-http modes; HTTP `POST /mcp` responds 200 to MCP initialize. - [ ] Production deploy via .gitea/workflows/refresh.yml — pending Phase 4-5 (container deploy work). ## Phase status - Phase 1 ✅ (Bayer scraper) - Phase 2 ✅ (chunker, embedder, Chroma + BM25 indexer) - Phase 3 ✅ (MCP server tools) - Phase 4-5: Dockerfile / compose / Gitea workflow customization — next session - Phase 1 continued: golden_harvest / nk / agripro / becks_pfr scrapers — next session
justin added 1 commit 2026-05-25 13:14:48 -04:00
Phase 2 — Chunking and indexing
- rag/chunk.py: replace template chunker with seed-variety-specific
  chunks_from_variety(). One chunk per variety (varieties are small
  and named-rating retrieval signal is best kept together). Output
  is rebuilt deterministically from the sidecar JSON: every value is
  verbatim from the source, only framing language ("Disease ratings
  (1-9, 9=best):") is template glue. Anti-hallucination contract:
  same sidecar in → same chunk out, never a fabricated rating.
  Metadata flattened to Chroma-safe primitives (str/int/float/bool):
  source, source_key, vendor, brand, crop, product_name,
  product_id, source_url, rm (corn), mg (soy), wheat_class,
  release_year, trait_codes_csv, rating_scale.
- rag/index.py: walks corpus/<source>/<source_key>.json sidecars
  via the new chunker. Default PRODUCT_NAME=crop_seed so the
  Chroma collection is crop_seed_docs.
- rag/bm25.py: filterable columns updated to seed-domain facets
  (source/vendor/brand/crop/source_key) instead of the template's
  version/platform/product.

Phase 3 — MCP server tools wired up
- search_docs: hybrid dense (Chroma) + BM25 (FTS5) retrieval with
  RRF fusion. Optional filters: crop, brand, vendor, source.
  Variety-code prefilter pins exact source_key / product_name /
  hybrid_prefix matches at the top — dense embeddings have no
  semantic neighbor for tokens like "DKC62-08RIB" and RRF can let
  noise float to #1 without this pin. Each response carries the
  variety's source URL inline so the agent can cite.
- get_page(source, source_key): emits a structured ratings header
  (verbatim from sidecar, table per characteristics group, vendor
  positioning, regional listings) followed by the raw indexed body.
  This is the canonical fact-check surface.
- list_versions(): facet discovery — distinct sources, vendors,
  brands, crops across the corpus.
- lookup_variety(source_key, source?): returns the raw sidecar JSON
  for one variety. The agent should call this BEFORE quoting any
  specific rating value to a farmer — guaranteed verbatim.

Smoke tests against 475 indexed Bayer varieties:
- list_versions returns 475 varieties, 1 source, 1 vendor, 3 brands,
  3 crops with correct per-brand counts (288/102/85).
- Semantic ag queries find the right candidates: short-season
  drought-tolerant corn → DKC44-97RIB at RM 94 (in 90-95 band);
  SCN+MG3 soybean → Asgrow XF varieties with explicit SCN R3 ratings;
  Phytophthora Rps3a soy → AG07XF4 (right gene); stripe-rust
  wheat → WestBred WB1376CLP (Yellow Rust 2 = best).
- Variety-code lookups work via prefilter: DKC62-08RIB, AG29XF4,
  WB6430 all return as #1 hit. BM25 confirms ranking unambiguously
  (top-1 score -13.2 vs -8.5 for #2 on "DKC62-08RIB ratings").
- Server boots cleanly in stdio AND streamable-http modes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justin merged commit 3cab941c08 into main 2026-05-25 13:14:58 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#2