Phase 2/3: chunker + indexer + MCP server tools
Phase 2 — Chunking and indexing
- rag/chunk.py: replace template chunker with seed-variety-specific
chunks_from_variety(). One chunk per variety (varieties are small
and named-rating retrieval signal is best kept together). Output
is rebuilt deterministically from the sidecar JSON: every value is
verbatim from the source, only framing language ("Disease ratings
(1-9, 9=best):") is template glue. Anti-hallucination contract:
same sidecar in → same chunk out, never a fabricated rating.
Metadata flattened to Chroma-safe primitives (str/int/float/bool):
source, source_key, vendor, brand, crop, product_name,
product_id, source_url, rm (corn), mg (soy), wheat_class,
release_year, trait_codes_csv, rating_scale.
- rag/index.py: walks corpus/<source>/<source_key>.json sidecars
via the new chunker. Default PRODUCT_NAME=crop_seed so the
Chroma collection is crop_seed_docs.
- rag/bm25.py: filterable columns updated to seed-domain facets
(source/vendor/brand/crop/source_key) instead of the template's
version/platform/product.
Phase 3 — MCP server tools wired up
- search_docs: hybrid dense (Chroma) + BM25 (FTS5) retrieval with
RRF fusion. Optional filters: crop, brand, vendor, source.
Variety-code prefilter pins exact source_key / product_name /
hybrid_prefix matches at the top — dense embeddings have no
semantic neighbor for tokens like "DKC62-08RIB" and RRF can let
noise float to #1 without this pin. Each response carries the
variety's source URL inline so the agent can cite.
- get_page(source, source_key): emits a structured ratings header
(verbatim from sidecar, table per characteristics group, vendor
positioning, regional listings) followed by the raw indexed body.
This is the canonical fact-check surface.
- list_versions(): facet discovery — distinct sources, vendors,
brands, crops across the corpus.
- lookup_variety(source_key, source?): returns the raw sidecar JSON
for one variety. The agent should call this BEFORE quoting any
specific rating value to a farmer — guaranteed verbatim.
Smoke tests against 475 indexed Bayer varieties:
- list_versions returns 475 varieties, 1 source, 1 vendor, 3 brands,
3 crops with correct per-brand counts (288/102/85).
- Semantic ag queries find the right candidates: short-season
drought-tolerant corn → DKC44-97RIB at RM 94 (in 90-95 band);
SCN+MG3 soybean → Asgrow XF varieties with explicit SCN R3 ratings;
Phytophthora Rps3a soy → AG07XF4 (right gene); stripe-rust
wheat → WestBred WB1376CLP (Yellow Rust 2 = best).
- Variety-code lookups work via prefilter: DKC62-08RIB, AG29XF4,
WB6430 all return as #1 hit. BM25 confirms ranking unambiguously
(top-1 score -13.2 vs -8.5 for #2 on "DKC62-08RIB ratings").
- Server boots cleanly in stdio AND streamable-http modes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+298
-100
@@ -1,126 +1,324 @@
|
||||
"""Markdown chunker — paragraph-aware, ~400-600 token target.
|
||||
"""Chunker for seed-variety corpus.
|
||||
|
||||
Adjust the chunking strategy per product if your page format differs
|
||||
significantly from prose. The output shape (id, text, metadata) is
|
||||
fixed by the downstream Chroma + BM25 indexing in rag/index.py — don't
|
||||
change that.
|
||||
Each variety becomes ONE chunk by default. Variety pages are small
|
||||
(typically 2-3 KB of useful signal) and nomic-embed-text handles up
|
||||
to ~8 K tokens cleanly. Splitting a variety across chunks dilutes
|
||||
the named-rating embeddings (e.g. "SCN resistance 7") that farmers
|
||||
search by — keep them together.
|
||||
|
||||
The key knob you'll tune per product is chunk-0. Dense retrieval lands
|
||||
on chunk 0 first for most queries. Make it a synthetic chunk built
|
||||
from:
|
||||
The chunk text is a synthetic preamble assembled deterministically
|
||||
from the sidecar JSON. Every value in the chunk text comes verbatim
|
||||
from the source. The framing words ("Disease ratings (1-9, 9=best):",
|
||||
"Maturity group:", etc.) are template glue — *we add structure, we
|
||||
do NOT add facts*. Given the same sidecar, this chunker always
|
||||
produces the same chunk text. That's the anti-hallucination
|
||||
contract: the retriever can never surface a rating value that
|
||||
wasn't in the source.
|
||||
|
||||
- the page title (as natural-language H1)
|
||||
- a 1-sentence task description (you'll have to generate this — for
|
||||
pages that already have a "## Overview" or "## Introduction" the
|
||||
first sentence usually works)
|
||||
- a keyword bag of important terms (filenames, API names, error
|
||||
codes — the rare technical tokens that BM25 lights up on)
|
||||
Metadata is flattened to Chroma-safe primitives (str/int/float/bool):
|
||||
|
||||
Without a rich chunk 0, dense retrieval gets dominated by the much
|
||||
larger prose body, and short pages (script examples, reference cards)
|
||||
get buried.
|
||||
source "bayer_seeds"
|
||||
source_key "dekalb-dkc075-70rib"
|
||||
vendor "Bayer"
|
||||
brand "DEKALB"
|
||||
crop "corn" | "soybeans" | "wheat"
|
||||
product_name "DKC075-70RIB BRAND BLEND"
|
||||
product_id canonical full id
|
||||
source_url the variety's page URL
|
||||
rm corn RM as int when parseable (else absent)
|
||||
mg soy MG as float when parseable (else absent)
|
||||
release_year int when known
|
||||
trait_codes_csv comma-separated trait codes for substring search
|
||||
rating_scale "1-9 (9 = best)" — chunker should ALWAYS attach
|
||||
this so downstream code can detect a flip
|
||||
ordinal chunk index within variety (0-based)
|
||||
|
||||
Lists like ``regional_recommendations`` and the full per-rating dicts
|
||||
do NOT fit Chroma's metadata constraints — they stay in the sidecar
|
||||
JSON, surfaced by ``get_page`` / ``lookup_variety``.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import Iterator
|
||||
|
||||
|
||||
# Approximate token estimate from char count. Tunable — set per
|
||||
# embedder if the default 4 chars/token is wrong.
|
||||
CHARS_PER_TOKEN = 4
|
||||
TARGET_TOKENS = 500
|
||||
TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
|
||||
# Rating-group classification. The source publishes characteristics
|
||||
# grouped by label; we map those labels to one of three buckets in
|
||||
# the chunk preamble so retrieval gets coherent text. Group labels not
|
||||
# listed here fall into "other" and are still emitted, just in their
|
||||
# own section.
|
||||
DISEASE_GROUP_LABELS = {
|
||||
"DISEASE RATINGS",
|
||||
"PEST AND DISEASE RESISTANCE",
|
||||
}
|
||||
AGRONOMIC_GROUP_LABELS = {
|
||||
"GROWTH",
|
||||
"HARVEST",
|
||||
"PRODUCTION",
|
||||
"KEY CHARACTERISTICS",
|
||||
"QUALITY",
|
||||
}
|
||||
MANAGEMENT_GROUP_LABELS = {
|
||||
"MANAGEMENT",
|
||||
"HERBICIDE",
|
||||
"SENSITIVITY",
|
||||
"PLANT DESCRIPTION",
|
||||
}
|
||||
|
||||
|
||||
def estimate_tokens(text: str) -> int:
|
||||
return max(1, len(text) // CHARS_PER_TOKEN)
|
||||
def _parse_rm(value: object) -> int | None:
|
||||
"""Best-effort RM-days int. Returns None if not a clean integer
|
||||
(e.g. wheat's qualitative 'Early'/'Medium-Early' values)."""
|
||||
if value is None:
|
||||
return None
|
||||
s = str(value).strip()
|
||||
if not s:
|
||||
return None
|
||||
try:
|
||||
# Handle floats stored as strings ("105.0") and the trailing
|
||||
# tenths sometimes seen on early corn ("75").
|
||||
return int(float(s))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
|
||||
def split_paragraphs(md: str) -> list[str]:
|
||||
"""Split markdown into paragraph-ish blocks.
|
||||
def _parse_mg(value: object) -> float | None:
|
||||
"""Best-effort MG float. Soy MGs go from 00 to 9.0 with one decimal."""
|
||||
if value is None:
|
||||
return None
|
||||
s = str(value).strip()
|
||||
if not s:
|
||||
return None
|
||||
try:
|
||||
return float(s)
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
Keeps fenced code blocks together (don't slice through ```).
|
||||
Headings start new paragraphs.
|
||||
|
||||
def _format_items(items: list[dict]) -> str:
|
||||
"""Render `[{characteristic, value}, ...]` to a compact inline list."""
|
||||
out: list[str] = []
|
||||
for it in items:
|
||||
ch = (it.get("characteristic") or "").strip()
|
||||
v = (it.get("value") or "").strip()
|
||||
if ch and v:
|
||||
out.append(f"{ch} {v}")
|
||||
elif ch:
|
||||
out.append(f"{ch} —")
|
||||
return ", ".join(out)
|
||||
|
||||
|
||||
def _render_variety_chunk(sidecar: dict) -> str:
|
||||
"""Build the dense preamble for one variety from its sidecar JSON.
|
||||
|
||||
Faithful to source: every numeric/categorical *value* is verbatim
|
||||
from ``sidecar``. The only generated text is the framing language.
|
||||
"""
|
||||
blocks: list[str] = []
|
||||
current: list[str] = []
|
||||
in_fence = False
|
||||
for line in md.splitlines(keepends=True):
|
||||
stripped = line.strip()
|
||||
if stripped.startswith("```"):
|
||||
in_fence = not in_fence
|
||||
current.append(line)
|
||||
lines: list[str] = []
|
||||
|
||||
# ---- Identity line --------------------------------------------------
|
||||
name = sidecar.get("product_name") or sidecar.get("source_key") or ""
|
||||
brand = (sidecar.get("brand") or "").strip()
|
||||
vendor = sidecar.get("vendor") or ""
|
||||
crop = (sidecar.get("crop") or "").strip()
|
||||
crop_label = crop.capitalize() if crop else ""
|
||||
ident = f"# {name}"
|
||||
sub = " ".join(filter(None, [
|
||||
f"({brand.title()} {crop_label} variety, {vendor})" if brand and crop_label and vendor else "",
|
||||
]))
|
||||
lines.append(ident)
|
||||
if sub:
|
||||
lines.append("")
|
||||
lines.append(sub)
|
||||
|
||||
# ---- Identity body --------------------------------------------------
|
||||
facts: list[str] = []
|
||||
|
||||
rm = sidecar.get("relative_maturity")
|
||||
mg = sidecar.get("maturity_group")
|
||||
wc = sidecar.get("wheat_class")
|
||||
if crop == "corn" and rm:
|
||||
facts.append(f"Relative maturity {rm}")
|
||||
elif crop == "soybeans" and mg:
|
||||
facts.append(f"Maturity group {mg}")
|
||||
elif crop == "wheat":
|
||||
if rm:
|
||||
facts.append(f"Maturity {rm}")
|
||||
if wc:
|
||||
facts.append(f"Wheat class {wc}")
|
||||
|
||||
traits = sidecar.get("trait_stack") or []
|
||||
trait_descs = sidecar.get("trait_descriptions") or []
|
||||
if traits:
|
||||
if trait_descs:
|
||||
facts.append(
|
||||
"Trait stack: "
|
||||
+ ", ".join(traits)
|
||||
+ " ("
|
||||
+ "; ".join(trait_descs)
|
||||
+ ")"
|
||||
)
|
||||
else:
|
||||
facts.append("Trait stack: " + ", ".join(traits))
|
||||
|
||||
if sidecar.get("release_year"):
|
||||
facts.append(f"Released {sidecar['release_year']}")
|
||||
|
||||
if facts:
|
||||
lines.append("")
|
||||
lines.append(". ".join(facts) + ".")
|
||||
|
||||
# ---- Positioning ----------------------------------------------------
|
||||
pos = (sidecar.get("positioning_statement") or "").strip()
|
||||
if pos:
|
||||
lines.append("")
|
||||
lines.append(f"Positioning: {pos}")
|
||||
|
||||
# ---- Ratings, bucketed for retrieval --------------------------------
|
||||
scale = sidecar.get("_scale_direction") or "(scale direction not declared)"
|
||||
groups = sidecar.get("characteristics_groups") or []
|
||||
disease: list[dict] = []
|
||||
agronomic: list[dict] = []
|
||||
management: list[dict] = []
|
||||
other: list[tuple[str, list[dict]]] = []
|
||||
for g in groups:
|
||||
label = (g.get("label") or "").upper().strip()
|
||||
items = g.get("items") or []
|
||||
if not items:
|
||||
continue
|
||||
if in_fence:
|
||||
current.append(line)
|
||||
continue
|
||||
if stripped.startswith("#"):
|
||||
if current:
|
||||
blocks.append("".join(current).strip())
|
||||
current = []
|
||||
current.append(line)
|
||||
continue
|
||||
if not stripped and current and not "".join(current).strip().endswith("\n\n"):
|
||||
current.append(line)
|
||||
blocks.append("".join(current).strip())
|
||||
current = []
|
||||
continue
|
||||
current.append(line)
|
||||
if current:
|
||||
blocks.append("".join(current).strip())
|
||||
return [b for b in blocks if b]
|
||||
if label in DISEASE_GROUP_LABELS:
|
||||
disease.extend(items)
|
||||
elif label in AGRONOMIC_GROUP_LABELS:
|
||||
agronomic.extend(items)
|
||||
elif label in MANAGEMENT_GROUP_LABELS:
|
||||
management.extend(items)
|
||||
else:
|
||||
other.append((g.get("label") or "Other", items))
|
||||
|
||||
if disease:
|
||||
lines.append("")
|
||||
lines.append(f"Disease ratings ({scale}): {_format_items(disease)}.")
|
||||
if agronomic:
|
||||
lines.append("")
|
||||
lines.append(f"Agronomic ratings ({scale}): {_format_items(agronomic)}.")
|
||||
if management:
|
||||
lines.append("")
|
||||
lines.append(f"Management notes: {_format_items(management)}.")
|
||||
for label, items in other:
|
||||
lines.append("")
|
||||
lines.append(f"{label.title()}: {_format_items(items)}.")
|
||||
|
||||
# ---- Strengths narrative --------------------------------------------
|
||||
strengths = sidecar.get("strengths") or []
|
||||
if strengths:
|
||||
lines.append("")
|
||||
lines.append("Strengths and management notes:")
|
||||
for s in strengths:
|
||||
s = (s or "").strip()
|
||||
if s:
|
||||
lines.append(f"- {s}")
|
||||
|
||||
# ---- Regional listings (compact, not the agronomist emails) ---------
|
||||
rec = sidecar.get("regional_recommendations") or []
|
||||
if rec:
|
||||
names = sorted({
|
||||
(r.get("product_list_name") or "").strip()
|
||||
for r in rec
|
||||
if (r.get("product_list_name") or "").strip()
|
||||
})
|
||||
if names:
|
||||
lines.append("")
|
||||
lines.append("Listed in regional seed guides: " + "; ".join(names) + ".")
|
||||
|
||||
# ---- Provenance footer (must always be in the chunk text so it
|
||||
# can never be lost between retrieval and LLM rendering) --------
|
||||
urls = sidecar.get("source_urls") or []
|
||||
if urls:
|
||||
lines.append("")
|
||||
lines.append(f"Source: {urls[0]}")
|
||||
|
||||
return "\n".join(lines).strip() + "\n"
|
||||
|
||||
|
||||
def chunks_from_page(
|
||||
text: str,
|
||||
page_id: str,
|
||||
metadata: dict,
|
||||
def _flat_metadata(sidecar: dict) -> dict:
|
||||
"""Distil sidecar into Chroma-safe metadata (primitives only)."""
|
||||
md: dict = {
|
||||
"source": sidecar.get("source") or "",
|
||||
"source_key": sidecar.get("source_key") or "",
|
||||
"vendor": sidecar.get("vendor") or "",
|
||||
"brand": sidecar.get("brand") or "",
|
||||
"crop": sidecar.get("crop") or "",
|
||||
"product_name": sidecar.get("product_name") or "",
|
||||
"product_id": sidecar.get("product_id") or "",
|
||||
"source_url": (sidecar.get("source_urls") or [""])[0],
|
||||
"rating_scale": sidecar.get("_scale_direction") or "",
|
||||
}
|
||||
rm = _parse_rm(sidecar.get("relative_maturity"))
|
||||
mg = _parse_mg(sidecar.get("maturity_group"))
|
||||
if rm is not None:
|
||||
md["rm"] = rm
|
||||
if mg is not None:
|
||||
md["mg"] = mg
|
||||
ry = sidecar.get("release_year")
|
||||
if isinstance(ry, int):
|
||||
md["release_year"] = ry
|
||||
traits = sidecar.get("trait_stack") or []
|
||||
if traits:
|
||||
# Comma-delimited for partial-match / human eyeballing.
|
||||
# Bracket-padded so `LIKE '%,XF,%'` finds whole tokens.
|
||||
md["trait_codes_csv"] = "," + ",".join(traits) + ","
|
||||
if sidecar.get("wheat_class"):
|
||||
md["wheat_class"] = sidecar["wheat_class"]
|
||||
return md
|
||||
|
||||
|
||||
def chunks_from_variety(
|
||||
sidecar_path: Path | str,
|
||||
*,
|
||||
md_path: Path | str | None = None,
|
||||
) -> Iterator[dict]:
|
||||
"""Yield chunk dicts ready for index.py to upsert.
|
||||
"""Yield chunk dict(s) for one variety. Currently emits exactly one.
|
||||
|
||||
The synthetic chunk 0 is the per-product customization point. The
|
||||
default below is a simple title + body-first-paragraph; rewrite
|
||||
for richer retrieval signal (see module docstring).
|
||||
Args:
|
||||
sidecar_path: path to the variety's JSON sidecar.
|
||||
md_path: ignored (the chunker rebuilds from sidecar); kept
|
||||
in the signature in case a future split-chunker
|
||||
wants the rendered body.
|
||||
"""
|
||||
paragraphs = split_paragraphs(text)
|
||||
if not paragraphs:
|
||||
return
|
||||
|
||||
# ----- Chunk 0: synthetic anchor for dense retrieval ---------
|
||||
title = metadata.get("title") or page_id
|
||||
first_para = next((p for p in paragraphs if not p.startswith("#")), "")
|
||||
chunk0_body = (
|
||||
f"# {title}\n\n"
|
||||
f"{first_para[:300]}"
|
||||
# TODO per product: append a keyword bag here (filenames,
|
||||
# API names, error codes) for BM25 + dense joint coverage.
|
||||
)
|
||||
sidecar = json.loads(Path(sidecar_path).read_text(encoding="utf-8"))
|
||||
text = _render_variety_chunk(sidecar)
|
||||
meta = _flat_metadata(sidecar)
|
||||
chunk_id = f"{meta['source']}::{meta['source_key']}::0"
|
||||
yield {
|
||||
"id": f"{metadata['bundle_id']}::{page_id}::0",
|
||||
"text": chunk0_body,
|
||||
"metadata": {**metadata, "ordinal": 0},
|
||||
"id": chunk_id,
|
||||
"text": text,
|
||||
"metadata": {**meta, "ordinal": 0},
|
||||
}
|
||||
|
||||
# ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
|
||||
ordinal = 1
|
||||
buf: list[str] = []
|
||||
buf_chars = 0
|
||||
for p in paragraphs:
|
||||
if buf_chars + len(p) > TARGET_CHARS and buf:
|
||||
yield {
|
||||
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
|
||||
"text": "\n\n".join(buf),
|
||||
"metadata": {**metadata, "ordinal": ordinal},
|
||||
}
|
||||
ordinal += 1
|
||||
buf = []
|
||||
buf_chars = 0
|
||||
buf.append(p)
|
||||
buf_chars += len(p)
|
||||
if buf:
|
||||
yield {
|
||||
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
|
||||
"text": "\n\n".join(buf),
|
||||
"metadata": {**metadata, "ordinal": ordinal},
|
||||
}
|
||||
|
||||
# ----- Backwards-compat shim for the template's index.py -------------------
|
||||
#
|
||||
# The template's ``rag.index.page_records`` calls
|
||||
# ``chunks_from_page(md, page_id, base_meta)`` which doesn't know about
|
||||
# sidecar JSON. We accept that signature but ignore it — index.py has
|
||||
# been updated to use ``chunks_from_variety`` directly, and this shim
|
||||
# is here only so a stray import of the old name doesn't break.
|
||||
#
|
||||
def chunks_from_page(text: str, page_id: str, metadata: dict) -> Iterator[dict]:
|
||||
"""Deprecated for seed-mcp; prefer ``chunks_from_variety``."""
|
||||
# Best-effort: if metadata carries a sidecar_path, dispatch.
|
||||
sidecar_path = metadata.get("_sidecar_path")
|
||||
if sidecar_path:
|
||||
yield from chunks_from_variety(sidecar_path)
|
||||
return
|
||||
# Fallback — emit a single chunk of the raw markdown with whatever
|
||||
# metadata we have. Better than crashing if someone calls this.
|
||||
chunk_id = f"{metadata.get('source','unknown')}::{page_id}::0"
|
||||
yield {
|
||||
"id": chunk_id,
|
||||
"text": text,
|
||||
"metadata": {**metadata, "ordinal": 0},
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user