Phase 2/3: chunker + indexer + MCP server tools (475 Bayer varieties searchable) #2
Reference in New Issue
Block a user
Delete Branch "phase-2-3-retrieval"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
rag/chunk.pyfor seed varieties (one chunk per variety, deterministic from sidecar JSON, every value verbatim — anti-hallucination contract). Updaterag/index.pyto walk sidecars via the new chunker; defaultPRODUCT_NAME=crop_seed. Updaterag/bm25.pyschema with seed-domain filter columns (source/vendor/brand/crop/source_key).search_docs(hybrid dense + BM25 with RRF + a variety-code prefilter),get_page(structured ratings header from sidecar + indexed body),list_versions(facet discovery), andlookup_variety(canonical sidecar JSON — the agent's fact-check tool).Anti-hallucination design notes
lookup_variety(source_key)is the canonical fact-check surface: returns the raw sidecar JSON unmodified. The tool docstring instructs the agent to call this before quoting any specific rating value to a farmer.get_pageemits a structured ratings table per characteristics group with values verbatim — no paraphrasing, no rounding.The exact-code prefilter
Variety codes ("DKC62-08RIB", "AG29XF4", "WB6430") have no semantic neighbors. Dense retrieval misses them entirely (DKC62-08RIB not in dense top 100 for "DKC62-08RIB ratings"). BM25 alone finds them cleanly but RRF fusion was letting dense noise float to #1 because "ratings" matches every chunk. Solution: scan sidecars at server boot to build an in-memory
{token → [(source, source_key)]}index keyed by source_key, hybrid_prefix, product_name, and code-like sub-tokens. Any query token that exactly matches the index pins those varieties to the top of the response, in front of fuzzy retrieval.Test plan
list_versionsreturns 475 varieties / 1 source / 3 brands / 3 crops with correct counts.lookup_variety('dekalb-dkc62-08rib')returns canonical sidecar JSON.get_page('bayer_seeds', 'dekalb-dkc62-08rib')returns structured ratings table + body.POST /mcpresponds 200 to MCP initialize.Phase status
Phase 2 — Chunking and indexing - rag/chunk.py: replace template chunker with seed-variety-specific chunks_from_variety(). One chunk per variety (varieties are small and named-rating retrieval signal is best kept together). Output is rebuilt deterministically from the sidecar JSON: every value is verbatim from the source, only framing language ("Disease ratings (1-9, 9=best):") is template glue. Anti-hallucination contract: same sidecar in → same chunk out, never a fabricated rating. Metadata flattened to Chroma-safe primitives (str/int/float/bool): source, source_key, vendor, brand, crop, product_name, product_id, source_url, rm (corn), mg (soy), wheat_class, release_year, trait_codes_csv, rating_scale. - rag/index.py: walks corpus/<source>/<source_key>.json sidecars via the new chunker. Default PRODUCT_NAME=crop_seed so the Chroma collection is crop_seed_docs. - rag/bm25.py: filterable columns updated to seed-domain facets (source/vendor/brand/crop/source_key) instead of the template's version/platform/product. Phase 3 — MCP server tools wired up - search_docs: hybrid dense (Chroma) + BM25 (FTS5) retrieval with RRF fusion. Optional filters: crop, brand, vendor, source. Variety-code prefilter pins exact source_key / product_name / hybrid_prefix matches at the top — dense embeddings have no semantic neighbor for tokens like "DKC62-08RIB" and RRF can let noise float to #1 without this pin. Each response carries the variety's source URL inline so the agent can cite. - get_page(source, source_key): emits a structured ratings header (verbatim from sidecar, table per characteristics group, vendor positioning, regional listings) followed by the raw indexed body. This is the canonical fact-check surface. - list_versions(): facet discovery — distinct sources, vendors, brands, crops across the corpus. - lookup_variety(source_key, source?): returns the raw sidecar JSON for one variety. The agent should call this BEFORE quoting any specific rating value to a farmer — guaranteed verbatim. Smoke tests against 475 indexed Bayer varieties: - list_versions returns 475 varieties, 1 source, 1 vendor, 3 brands, 3 crops with correct per-brand counts (288/102/85). - Semantic ag queries find the right candidates: short-season drought-tolerant corn → DKC44-97RIB at RM 94 (in 90-95 band); SCN+MG3 soybean → Asgrow XF varieties with explicit SCN R3 ratings; Phytophthora Rps3a soy → AG07XF4 (right gene); stripe-rust wheat → WestBred WB1376CLP (Yellow Rust 2 = best). - Variety-code lookups work via prefilter: DKC62-08RIB, AG29XF4, WB6430 all return as #1 hit. BM25 confirms ranking unambiguously (top-1 score -13.2 vs -8.5 for #2 on "DKC62-08RIB ratings"). - Server boots cleanly in stdio AND streamable-http modes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>