Phase 2/3: chunker + indexer + MCP server tools

Phase 2 — Chunking and indexing - rag/chunk.py: replace template chunker with seed-variety-specific chunks_from_variety(). One chunk per variety (varieties are small and named-rating retrieval signal is best kept together). Output is rebuilt deterministically from the sidecar JSON: every value is verbatim from the source, only framing language ("Disease ratings (1-9, 9=best):") is template glue. Anti-hallucination contract: same sidecar in → same chunk out, never a fabricated rating. Metadata flattened to Chroma-safe primitives (str/int/float/bool): source, source_key, vendor, brand, crop, product_name, product_id, source_url, rm (corn), mg (soy), wheat_class, release_year, trait_codes_csv, rating_scale. - rag/index.py: walks corpus/<source>/<source_key>.json sidecars via the new chunker. Default PRODUCT_NAME=crop_seed so the Chroma collection is crop_seed_docs. - rag/bm25.py: filterable columns updated to seed-domain facets (source/vendor/brand/crop/source_key) instead of the template's version/platform/product. Phase 3 — MCP server tools wired up - search_docs: hybrid dense (Chroma) + BM25 (FTS5) retrieval with RRF fusion. Optional filters: crop, brand, vendor, source. Variety-code prefilter pins exact source_key / product_name / hybrid_prefix matches at the top — dense embeddings have no semantic neighbor for tokens like "DKC62-08RIB" and RRF can let noise float to #1 without this pin. Each response carries the variety's source URL inline so the agent can cite. - get_page(source, source_key): emits a structured ratings header (verbatim from sidecar, table per characteristics group, vendor positioning, regional listings) followed by the raw indexed body. This is the canonical fact-check surface. - list_versions(): facet discovery — distinct sources, vendors, brands, crops across the corpus. - lookup_variety(source_key, source?): returns the raw sidecar JSON for one variety. The agent should call this BEFORE quoting any specific rating value to a farmer — guaranteed verbatim. Smoke tests against 475 indexed Bayer varieties: - list_versions returns 475 varieties, 1 source, 1 vendor, 3 brands, 3 crops with correct per-brand counts (288/102/85). - Semantic ag queries find the right candidates: short-season drought-tolerant corn → DKC44-97RIB at RM 94 (in 90-95 band); SCN+MG3 soybean → Asgrow XF varieties with explicit SCN R3 ratings; Phytophthora Rps3a soy → AG07XF4 (right gene); stripe-rust wheat → WestBred WB1376CLP (Yellow Rust 2 = best). - Variety-code lookups work via prefilter: DKC62-08RIB, AG29XF4, WB6430 all return as #1 hit. BM25 confirms ranking unambiguously (top-1 score -13.2 vs -8.5 for #2 on "DKC62-08RIB ratings"). - Server boots cleanly in stdio AND streamable-http modes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 13:14:16 -04:00
parent 0fb8d9d92d
commit a766756a05
4 changed files with 982 additions and 369 deletions
@@ -1,126 +1,324 @@
-"""Markdown chunker — paragraph-aware, ~400-600 token target.
+"""Chunker for seed-variety corpus.

-Adjust the chunking strategy per product if your page format differs
-significantly from prose. The output shape (id, text, metadata) is
-fixed by the downstream Chroma + BM25 indexing in rag/index.py — don't
-change that.
+Each variety becomes ONE chunk by default. Variety pages are small
+(typically 2-3 KB of useful signal) and nomic-embed-text handles up
+to ~8 K tokens cleanly. Splitting a variety across chunks dilutes
+the named-rating embeddings (e.g. "SCN resistance 7") that farmers
+search by — keep them together.

-The key knob you'll tune per product is chunk-0. Dense retrieval lands
-on chunk 0 first for most queries. Make it a synthetic chunk built
-from:
+The chunk text is a synthetic preamble assembled deterministically
+from the sidecar JSON. Every value in the chunk text comes verbatim
+from the source. The framing words ("Disease ratings (1-9, 9=best):",
+"Maturity group:", etc.) are template glue — *we add structure, we
+do NOT add facts*. Given the same sidecar, this chunker always
+produces the same chunk text. That's the anti-hallucination
+contract: the retriever can never surface a rating value that
+wasn't in the source.

-  - the page title (as natural-language H1)
-  - a 1-sentence task description (you'll have to generate this — for
-    pages that already have a "## Overview" or "## Introduction" the
-    first sentence usually works)
-  - a keyword bag of important terms (filenames, API names, error
-    codes — the rare technical tokens that BM25 lights up on)
+Metadata is flattened to Chroma-safe primitives (str/int/float/bool):

-Without a rich chunk 0, dense retrieval gets dominated by the much
-larger prose body, and short pages (script examples, reference cards)
-get buried.
+  source             "bayer_seeds"
+  source_key         "dekalb-dkc075-70rib"
+  vendor             "Bayer"
+  brand              "DEKALB"
+  crop               "corn" | "soybeans" | "wheat"
+  product_name       "DKC075-70RIB BRAND BLEND"
+  product_id         canonical full id
+  source_url         the variety's page URL
+  rm                 corn RM as int when parseable (else absent)
+  mg                 soy MG as float when parseable (else absent)
+  release_year       int when known
+  trait_codes_csv    comma-separated trait codes for substring search
+  rating_scale       "1-9 (9 = best)" — chunker should ALWAYS attach
+                     this so downstream code can detect a flip
+  ordinal            chunk index within variety (0-based)
+
+Lists like ``regional_recommendations`` and the full per-rating dicts
+do NOT fit Chroma's metadata constraints — they stay in the sidecar
+JSON, surfaced by ``get_page`` / ``lookup_variety``.
 """
 from __future__ import annotations

+import json
 import re
+from pathlib import Path
 from typing import Iterator


-# Approximate token estimate from char count. Tunable — set per
-# embedder if the default 4 chars/token is wrong.
-CHARS_PER_TOKEN = 4
-TARGET_TOKENS = 500
-TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
+# Rating-group classification. The source publishes characteristics
+# grouped by label; we map those labels to one of three buckets in
+# the chunk preamble so retrieval gets coherent text. Group labels not
+# listed here fall into "other" and are still emitted, just in their
+# own section.
+DISEASE_GROUP_LABELS = {
+    "DISEASE RATINGS",
+    "PEST AND DISEASE RESISTANCE",
+}
+AGRONOMIC_GROUP_LABELS = {
+    "GROWTH",
+    "HARVEST",
+    "PRODUCTION",
+    "KEY CHARACTERISTICS",
+    "QUALITY",
+}
+MANAGEMENT_GROUP_LABELS = {
+    "MANAGEMENT",
+    "HERBICIDE",
+    "SENSITIVITY",
+    "PLANT DESCRIPTION",
+}


-def estimate_tokens(text: str) -> int:
-    return max(1, len(text) // CHARS_PER_TOKEN)
+def _parse_rm(value: object) -> int | None:
+    """Best-effort RM-days int. Returns None if not a clean integer
+    (e.g. wheat's qualitative 'Early'/'Medium-Early' values)."""
+    if value is None:
+        return None
+    s = str(value).strip()
+    if not s:
+        return None
+    try:
+        # Handle floats stored as strings ("105.0") and the trailing
+        # tenths sometimes seen on early corn ("75").
+        return int(float(s))
+    except ValueError:
+        return None


-def split_paragraphs(md: str) -> list[str]:
-    """Split markdown into paragraph-ish blocks.
+def _parse_mg(value: object) -> float | None:
+    """Best-effort MG float. Soy MGs go from 00 to 9.0 with one decimal."""
+    if value is None:
+        return None
+    s = str(value).strip()
+    if not s:
+        return None
+    try:
+        return float(s)
+    except ValueError:
+        return None

-    Keeps fenced code blocks together (don't slice through ```).
-    Headings start new paragraphs.
+
+def _format_items(items: list[dict]) -> str:
+    """Render `[{characteristic, value}, ...]` to a compact inline list."""
+    out: list[str] = []
+    for it in items:
+        ch = (it.get("characteristic") or "").strip()
+        v = (it.get("value") or "").strip()
+        if ch and v:
+            out.append(f"{ch} {v}")
+        elif ch:
+            out.append(f"{ch} —")
+    return ", ".join(out)
+
+
+def _render_variety_chunk(sidecar: dict) -> str:
+    """Build the dense preamble for one variety from its sidecar JSON.
+
+    Faithful to source: every numeric/categorical *value* is verbatim
+    from ``sidecar``. The only generated text is the framing language.
    """
-    blocks: list[str] = []
-    current: list[str] = []
-    in_fence = False
-    for line in md.splitlines(keepends=True):
-        stripped = line.strip()
-        if stripped.startswith("```"):
-            in_fence = not in_fence
-            current.append(line)
+    lines: list[str] = []
+
+    # ---- Identity line --------------------------------------------------
+    name = sidecar.get("product_name") or sidecar.get("source_key") or ""
+    brand = (sidecar.get("brand") or "").strip()
+    vendor = sidecar.get("vendor") or ""
+    crop = (sidecar.get("crop") or "").strip()
+    crop_label = crop.capitalize() if crop else ""
+    ident = f"# {name}"
+    sub = " ".join(filter(None, [
+        f"({brand.title()} {crop_label} variety, {vendor})" if brand and crop_label and vendor else "",
+    ]))
+    lines.append(ident)
+    if sub:
+        lines.append("")
+        lines.append(sub)
+
+    # ---- Identity body --------------------------------------------------
+    facts: list[str] = []
+
+    rm = sidecar.get("relative_maturity")
+    mg = sidecar.get("maturity_group")
+    wc = sidecar.get("wheat_class")
+    if crop == "corn" and rm:
+        facts.append(f"Relative maturity {rm}")
+    elif crop == "soybeans" and mg:
+        facts.append(f"Maturity group {mg}")
+    elif crop == "wheat":
+        if rm:
+            facts.append(f"Maturity {rm}")
+        if wc:
+            facts.append(f"Wheat class {wc}")
+
+    traits = sidecar.get("trait_stack") or []
+    trait_descs = sidecar.get("trait_descriptions") or []
+    if traits:
+        if trait_descs:
+            facts.append(
+                "Trait stack: "
+                + ", ".join(traits)
+                + " ("
+                + "; ".join(trait_descs)
+                + ")"
+            )
+        else:
+            facts.append("Trait stack: " + ", ".join(traits))
+
+    if sidecar.get("release_year"):
+        facts.append(f"Released {sidecar['release_year']}")
+
+    if facts:
+        lines.append("")
+        lines.append(". ".join(facts) + ".")
+
+    # ---- Positioning ----------------------------------------------------
+    pos = (sidecar.get("positioning_statement") or "").strip()
+    if pos:
+        lines.append("")
+        lines.append(f"Positioning: {pos}")
+
+    # ---- Ratings, bucketed for retrieval --------------------------------
+    scale = sidecar.get("_scale_direction") or "(scale direction not declared)"
+    groups = sidecar.get("characteristics_groups") or []
+    disease: list[dict] = []
+    agronomic: list[dict] = []
+    management: list[dict] = []
+    other: list[tuple[str, list[dict]]] = []
+    for g in groups:
+        label = (g.get("label") or "").upper().strip()
+        items = g.get("items") or []
+        if not items:
            continue
-        if in_fence:
-            current.append(line)
-            continue
-        if stripped.startswith("#"):
-            if current:
-                blocks.append("".join(current).strip())
-                current = []
-            current.append(line)
-            continue
-        if not stripped and current and not "".join(current).strip().endswith("\n\n"):
-            current.append(line)
-            blocks.append("".join(current).strip())
-            current = []
-            continue
-        current.append(line)
-    if current:
-        blocks.append("".join(current).strip())
-    return [b for b in blocks if b]
+        if label in DISEASE_GROUP_LABELS:
+            disease.extend(items)
+        elif label in AGRONOMIC_GROUP_LABELS:
+            agronomic.extend(items)
+        elif label in MANAGEMENT_GROUP_LABELS:
+            management.extend(items)
+        else:
+            other.append((g.get("label") or "Other", items))
+
+    if disease:
+        lines.append("")
+        lines.append(f"Disease ratings ({scale}): {_format_items(disease)}.")
+    if agronomic:
+        lines.append("")
+        lines.append(f"Agronomic ratings ({scale}): {_format_items(agronomic)}.")
+    if management:
+        lines.append("")
+        lines.append(f"Management notes: {_format_items(management)}.")
+    for label, items in other:
+        lines.append("")
+        lines.append(f"{label.title()}: {_format_items(items)}.")
+
+    # ---- Strengths narrative --------------------------------------------
+    strengths = sidecar.get("strengths") or []
+    if strengths:
+        lines.append("")
+        lines.append("Strengths and management notes:")
+        for s in strengths:
+            s = (s or "").strip()
+            if s:
+                lines.append(f"- {s}")
+
+    # ---- Regional listings (compact, not the agronomist emails) ---------
+    rec = sidecar.get("regional_recommendations") or []
+    if rec:
+        names = sorted({
+            (r.get("product_list_name") or "").strip()
+            for r in rec
+            if (r.get("product_list_name") or "").strip()
+        })
+        if names:
+            lines.append("")
+            lines.append("Listed in regional seed guides: " + "; ".join(names) + ".")
+
+    # ---- Provenance footer (must always be in the chunk text so it
+    #      can never be lost between retrieval and LLM rendering) --------
+    urls = sidecar.get("source_urls") or []
+    if urls:
+        lines.append("")
+        lines.append(f"Source: {urls[0]}")
+
+    return "\n".join(lines).strip() + "\n"


-def chunks_from_page(
-    text: str,
-    page_id: str,
-    metadata: dict,
+def _flat_metadata(sidecar: dict) -> dict:
+    """Distil sidecar into Chroma-safe metadata (primitives only)."""
+    md: dict = {
+        "source": sidecar.get("source") or "",
+        "source_key": sidecar.get("source_key") or "",
+        "vendor": sidecar.get("vendor") or "",
+        "brand": sidecar.get("brand") or "",
+        "crop": sidecar.get("crop") or "",
+        "product_name": sidecar.get("product_name") or "",
+        "product_id": sidecar.get("product_id") or "",
+        "source_url": (sidecar.get("source_urls") or [""])[0],
+        "rating_scale": sidecar.get("_scale_direction") or "",
+    }
+    rm = _parse_rm(sidecar.get("relative_maturity"))
+    mg = _parse_mg(sidecar.get("maturity_group"))
+    if rm is not None:
+        md["rm"] = rm
+    if mg is not None:
+        md["mg"] = mg
+    ry = sidecar.get("release_year")
+    if isinstance(ry, int):
+        md["release_year"] = ry
+    traits = sidecar.get("trait_stack") or []
+    if traits:
+        # Comma-delimited for partial-match / human eyeballing.
+        # Bracket-padded so `LIKE '%,XF,%'` finds whole tokens.
+        md["trait_codes_csv"] = "," + ",".join(traits) + ","
+    if sidecar.get("wheat_class"):
+        md["wheat_class"] = sidecar["wheat_class"]
+    return md
+
+
+def chunks_from_variety(
+    sidecar_path: Path | str,
+    *,
+    md_path: Path | str | None = None,
 ) -> Iterator[dict]:
-    """Yield chunk dicts ready for index.py to upsert.
+    """Yield chunk dict(s) for one variety. Currently emits exactly one.

-    The synthetic chunk 0 is the per-product customization point. The
-    default below is a simple title + body-first-paragraph; rewrite
-    for richer retrieval signal (see module docstring).
+    Args:
+      sidecar_path: path to the variety's JSON sidecar.
+      md_path:      ignored (the chunker rebuilds from sidecar); kept
+                    in the signature in case a future split-chunker
+                    wants the rendered body.
    """
-    paragraphs = split_paragraphs(text)
-    if not paragraphs:
-        return
-
-    # ----- Chunk 0: synthetic anchor for dense retrieval ---------
-    title = metadata.get("title") or page_id
-    first_para = next((p for p in paragraphs if not p.startswith("#")), "")
-    chunk0_body = (
-        f"# {title}\n\n"
-        f"{first_para[:300]}"
-        # TODO per product: append a keyword bag here (filenames,
-        # API names, error codes) for BM25 + dense joint coverage.
-    )
+    sidecar = json.loads(Path(sidecar_path).read_text(encoding="utf-8"))
+    text = _render_variety_chunk(sidecar)
+    meta = _flat_metadata(sidecar)
+    chunk_id = f"{meta['source']}::{meta['source_key']}::0"
    yield {
-        "id":       f"{metadata['bundle_id']}::{page_id}::0",
-        "text":     chunk0_body,
-        "metadata": {**metadata, "ordinal": 0},
+        "id": chunk_id,
+        "text": text,
+        "metadata": {**meta, "ordinal": 0},
    }

-    # ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
-    ordinal = 1
-    buf: list[str] = []
-    buf_chars = 0
-    for p in paragraphs:
-        if buf_chars + len(p) > TARGET_CHARS and buf:
-            yield {
-                "id":       f"{metadata['bundle_id']}::{page_id}::{ordinal}",
-                "text":     "\n\n".join(buf),
-                "metadata": {**metadata, "ordinal": ordinal},
-            }
-            ordinal += 1
-            buf = []
-            buf_chars = 0
-        buf.append(p)
-        buf_chars += len(p)
-    if buf:
-        yield {
-            "id":       f"{metadata['bundle_id']}::{page_id}::{ordinal}",
-            "text":     "\n\n".join(buf),
-            "metadata": {**metadata, "ordinal": ordinal},
-        }
+
+# ----- Backwards-compat shim for the template's index.py -------------------
+#
+# The template's ``rag.index.page_records`` calls
+# ``chunks_from_page(md, page_id, base_meta)`` which doesn't know about
+# sidecar JSON. We accept that signature but ignore it — index.py has
+# been updated to use ``chunks_from_variety`` directly, and this shim
+# is here only so a stray import of the old name doesn't break.
+#
+def chunks_from_page(text: str, page_id: str, metadata: dict) -> Iterator[dict]:
+    """Deprecated for seed-mcp; prefer ``chunks_from_variety``."""
+    # Best-effort: if metadata carries a sidecar_path, dispatch.
+    sidecar_path = metadata.get("_sidecar_path")
+    if sidecar_path:
+        yield from chunks_from_variety(sidecar_path)
+        return
+    # Fallback — emit a single chunk of the raw markdown with whatever
+    # metadata we have. Better than crashing if someone calls this.
+    chunk_id = f"{metadata.get('source','unknown')}::{page_id}::0"
+    yield {
+        "id": chunk_id,
+        "text": text,
+        "metadata": {**metadata, "ordinal": 0},
+    }