Phase 2/3: chunker + indexer + MCP server tools
Phase 2 — Chunking and indexing
- rag/chunk.py: replace template chunker with seed-variety-specific
chunks_from_variety(). One chunk per variety (varieties are small
and named-rating retrieval signal is best kept together). Output
is rebuilt deterministically from the sidecar JSON: every value is
verbatim from the source, only framing language ("Disease ratings
(1-9, 9=best):") is template glue. Anti-hallucination contract:
same sidecar in → same chunk out, never a fabricated rating.
Metadata flattened to Chroma-safe primitives (str/int/float/bool):
source, source_key, vendor, brand, crop, product_name,
product_id, source_url, rm (corn), mg (soy), wheat_class,
release_year, trait_codes_csv, rating_scale.
- rag/index.py: walks corpus/<source>/<source_key>.json sidecars
via the new chunker. Default PRODUCT_NAME=crop_seed so the
Chroma collection is crop_seed_docs.
- rag/bm25.py: filterable columns updated to seed-domain facets
(source/vendor/brand/crop/source_key) instead of the template's
version/platform/product.
Phase 3 — MCP server tools wired up
- search_docs: hybrid dense (Chroma) + BM25 (FTS5) retrieval with
RRF fusion. Optional filters: crop, brand, vendor, source.
Variety-code prefilter pins exact source_key / product_name /
hybrid_prefix matches at the top — dense embeddings have no
semantic neighbor for tokens like "DKC62-08RIB" and RRF can let
noise float to #1 without this pin. Each response carries the
variety's source URL inline so the agent can cite.
- get_page(source, source_key): emits a structured ratings header
(verbatim from sidecar, table per characteristics group, vendor
positioning, regional listings) followed by the raw indexed body.
This is the canonical fact-check surface.
- list_versions(): facet discovery — distinct sources, vendors,
brands, crops across the corpus.
- lookup_variety(source_key, source?): returns the raw sidecar JSON
for one variety. The agent should call this BEFORE quoting any
specific rating value to a farmer — guaranteed verbatim.
Smoke tests against 475 indexed Bayer varieties:
- list_versions returns 475 varieties, 1 source, 1 vendor, 3 brands,
3 crops with correct per-brand counts (288/102/85).
- Semantic ag queries find the right candidates: short-season
drought-tolerant corn → DKC44-97RIB at RM 94 (in 90-95 band);
SCN+MG3 soybean → Asgrow XF varieties with explicit SCN R3 ratings;
Phytophthora Rps3a soy → AG07XF4 (right gene); stripe-rust
wheat → WestBred WB1376CLP (Yellow Rust 2 = best).
- Variety-code lookups work via prefilter: DKC62-08RIB, AG29XF4,
WB6430 all return as #1 hit. BM25 confirms ranking unambiguously
(top-1 score -13.2 vs -8.5 for #2 on "DKC62-08RIB ratings").
- Server boots cleanly in stdio AND streamable-http modes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+607
-132
@@ -1,20 +1,20 @@
|
||||
"""MCP server skeleton — fill in PRODUCT_NAME and the tool bodies.
|
||||
"""seed-mcp — MCP server over public US row-crop seed catalogs.
|
||||
|
||||
This file is the template's structural anchor. The phases described in
|
||||
PLAN.md add or extend pieces of this file:
|
||||
Tools (all read-only):
|
||||
|
||||
Phase 3 — search_docs, get_page, list_versions stubs (you are here)
|
||||
Phase 6 — reranker integration in search_docs
|
||||
Phase 8 — BM25 + hybrid retrieval (HYBRID_SEARCH env gate, _rrf_fuse)
|
||||
Phase 9 — diff_versions, list_cluster, bundle_changelog
|
||||
Phase 10 — TimedCall wiring (already imported below)
|
||||
Phase 11 — <product>_api_lessons tool
|
||||
Phase 12 — find_doc_inconsistencies, submit_doc_bug
|
||||
Phase 13 — weekly_digest + _digest_history reader
|
||||
search_docs natural-language retrieval over the corpus
|
||||
get_page the full markdown body of one variety + sidecar
|
||||
list_versions facet discovery (crops / brands / vendors / sources)
|
||||
lookup_variety canonical sidecar JSON by source_key — fact-check
|
||||
anything you surface from search_docs against this
|
||||
|
||||
Every stub below has a docstring + `raise NotImplementedError`. Replace
|
||||
the body when you reach the corresponding phase. Keep the signatures
|
||||
stable across products — clients depend on them.
|
||||
The contract with the calling agent is **never fabricate**. Every chunk
|
||||
we return is verbatim from the source's published catalog (the chunker
|
||||
rebuilds chunks deterministically from the sidecar JSON). Every
|
||||
response carries the variety's source URL so the agent can cite. The
|
||||
``lookup_variety`` tool exists specifically so the agent can validate
|
||||
specific rating values without having to trust paraphrased retrieval
|
||||
text.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
@@ -23,7 +23,7 @@ import logging
|
||||
import os
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import Annotated
|
||||
from typing import Annotated, Any
|
||||
|
||||
from mcp.server.fastmcp import FastMCP
|
||||
from pydantic import Field
|
||||
@@ -33,7 +33,7 @@ from .usage import TimedCall
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Product-specific configuration. Set these for each new build.
|
||||
# Product-specific configuration.
|
||||
# ---------------------------------------------------------------------------
|
||||
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "crop_seed")
|
||||
PRODUCT_DOCS_URL = os.environ.get("PRODUCT_DOCS_URL", "https://git.jpaul.io/justin/seed-mcp")
|
||||
@@ -44,59 +44,176 @@ ROOT = Path(__file__).resolve().parent.parent
|
||||
CORPUS = ROOT / "corpus"
|
||||
CHROMA_DIR = ROOT / "chroma"
|
||||
BM25_DB = Path(os.environ.get("BM25_DB", str(ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db")))
|
||||
BUNDLES_JSON = ROOT / "bundles.json"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Feature flags (Phase 6 / 8 / 12 enable these as you ship each phase).
|
||||
# Feature flags.
|
||||
# ---------------------------------------------------------------------------
|
||||
RERANK_URL = os.environ.get("RERANK_URL", "").rstrip("/") or None
|
||||
RERANK_POOL = int(os.environ.get("RERANK_POOL", "50"))
|
||||
RERANK_TIMEOUT = float(os.environ.get("RERANK_TIMEOUT", "30"))
|
||||
|
||||
HYBRID_SEARCH = os.environ.get("HYBRID_SEARCH", "").lower() in ("true", "1", "yes", "on")
|
||||
HYBRID_SEARCH = os.environ.get("HYBRID_SEARCH", "true").lower() in ("true", "1", "yes", "on")
|
||||
RRF_K = int(os.environ.get("RRF_K", "60"))
|
||||
|
||||
DOC_BUG_SUBMIT_ENABLED = os.environ.get("DOC_BUG_SUBMIT_ENABLED", "").lower() in ("true", "1", "yes", "on")
|
||||
DOC_BUG_API_URL = os.environ.get("DOC_BUG_API_URL", "") # product-specific endpoint
|
||||
DOC_BUG_TIMEOUT = float(os.environ.get("DOC_BUG_TIMEOUT", "15"))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# FastMCP setup.
|
||||
#
|
||||
# stateless_http=True — every request creates an ephemeral session and
|
||||
# discards it on return. Critical for production: clients don't get
|
||||
# 404 storms when the container is recreated by Watchtower.
|
||||
# ---------------------------------------------------------------------------
|
||||
mcp = FastMCP(f"{PRODUCT_NAME}-docs", stateless_http=True)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Lazy helpers — instantiate expensive things only when actually needed,
|
||||
# so the server still starts when (e.g.) Ollama is briefly unreachable.
|
||||
# Lazy singletons. Instantiate on first use so the server can start even
|
||||
# when (e.g.) Ollama is briefly unreachable — degraded modes fall back
|
||||
# gracefully rather than refusing to boot.
|
||||
# ---------------------------------------------------------------------------
|
||||
_chroma_client = None
|
||||
_chroma_collection = None
|
||||
_bm25 = None
|
||||
_code_index: dict[str, list[tuple[str, str]]] | None = None
|
||||
|
||||
|
||||
def _build_code_index() -> dict[str, list[tuple[str, str]]]:
|
||||
"""Walk sidecars once and index varieties by every lookup key a
|
||||
farmer or LLM might paste: ``source_key``, ``hybrid_prefix``,
|
||||
``product_name`` (normalized), each token from
|
||||
``hybrid_prefix``/``product_name`` longer than three chars.
|
||||
|
||||
Returns a dict mapping the normalized key → list of
|
||||
``(source, source_key)`` tuples. Multiple-source matches are kept
|
||||
so we can warn the agent about ambiguity.
|
||||
"""
|
||||
idx: dict[str, list[tuple[str, str]]] = {}
|
||||
|
||||
def _add(key: str, value: tuple[str, str]) -> None:
|
||||
key = (key or "").lower().strip()
|
||||
if not key or len(key) < 4:
|
||||
return
|
||||
idx.setdefault(key, [])
|
||||
if value not in idx[key]:
|
||||
idx[key].append(value)
|
||||
|
||||
if not CORPUS.exists():
|
||||
return idx
|
||||
for source_dir in CORPUS.iterdir():
|
||||
if not source_dir.is_dir() or source_dir.name.startswith("."):
|
||||
continue
|
||||
for sc in source_dir.glob("*.json"):
|
||||
try:
|
||||
d = json.loads(sc.read_text(encoding="utf-8"))
|
||||
except (OSError, json.JSONDecodeError):
|
||||
continue
|
||||
source = d.get("source") or source_dir.name
|
||||
sk = d.get("source_key") or sc.stem
|
||||
value = (source, sk)
|
||||
_add(sk, value)
|
||||
_add(d.get("hybrid_prefix") or "", value)
|
||||
_add(d.get("product_name") or "", value)
|
||||
# Token-split product_name and hybrid_prefix so "DKC62-08RIB"
|
||||
# in a query matches "DKC62-08RIB BRAND BLEND" via the
|
||||
# "DKC62-08RIB" token as well as the full string.
|
||||
for source_field in (d.get("product_name"), d.get("hybrid_prefix")):
|
||||
if not source_field:
|
||||
continue
|
||||
for tok in re.split(r"[\s()/,]+", source_field):
|
||||
tok = tok.strip()
|
||||
# Only retain tokens that LOOK like codes — at least
|
||||
# one digit, mostly alphanumeric/dash. Skips "BRAND"
|
||||
# / "BLEND" / etc.
|
||||
if re.match(r"^[A-Za-z]*\d[\w\-]*$", tok):
|
||||
_add(tok, value)
|
||||
return idx
|
||||
|
||||
|
||||
def _code_lookup() -> dict[str, list[tuple[str, str]]]:
|
||||
global _code_index
|
||||
if _code_index is None:
|
||||
_code_index = _build_code_index()
|
||||
log.info("code-index: %d lookup keys built", len(_code_index))
|
||||
return _code_index
|
||||
|
||||
|
||||
# Tokens that LOOK like a variety code — at least one digit, otherwise
|
||||
# alphanumeric / dash. Catches "DKC62-08RIB", "AG29XF4", "WB1376CLP",
|
||||
# "VT2PRIB"; skips ordinary words like "ratings".
|
||||
_CODE_TOKEN_RE = re.compile(r"\b([A-Za-z][A-Za-z0-9\-]{2,})\b")
|
||||
|
||||
|
||||
def _exact_code_matches(query: str) -> list[tuple[str, str]]:
|
||||
"""Find varieties whose source_key / product_name / token-split
|
||||
components contain a query token. Used as a high-confidence pin
|
||||
before fuzzy retrieval."""
|
||||
idx = _code_lookup()
|
||||
if not idx:
|
||||
return []
|
||||
out: list[tuple[str, str]] = []
|
||||
seen: set[tuple[str, str]] = set()
|
||||
for m in _CODE_TOKEN_RE.finditer(query or ""):
|
||||
tok = m.group(1).lower()
|
||||
if len(tok) < 4 or not any(c.isdigit() for c in tok):
|
||||
continue
|
||||
for v in idx.get(tok, []):
|
||||
if v not in seen:
|
||||
seen.add(v)
|
||||
out.append(v)
|
||||
return out
|
||||
|
||||
|
||||
def _collection():
|
||||
"""Return the Chroma collection, opening it lazily."""
|
||||
global _chroma_client, _chroma_collection
|
||||
if _chroma_collection is not None:
|
||||
return _chroma_collection
|
||||
import chromadb
|
||||
from chromadb.config import Settings
|
||||
from rag.embeddings import embedding_function
|
||||
|
||||
_chroma_client = chromadb.PersistentClient(
|
||||
path=str(CHROMA_DIR),
|
||||
settings=Settings(anonymized_telemetry=False),
|
||||
)
|
||||
_chroma_collection = _chroma_client.get_collection(
|
||||
COLLECTION, embedding_function=embedding_function()
|
||||
)
|
||||
return _chroma_collection
|
||||
|
||||
|
||||
def _bm25_index():
|
||||
"""Return the BM25 index, or None if it doesn't exist on disk."""
|
||||
global _bm25
|
||||
if _bm25 is not None:
|
||||
return _bm25
|
||||
from rag.bm25 import BM25Index
|
||||
idx = BM25Index(BM25_DB)
|
||||
if not idx.exists():
|
||||
return None
|
||||
_bm25 = idx
|
||||
return _bm25
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers — local file IO + filter builder.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _bundles() -> dict[str, dict]:
|
||||
"""Cached load of bundles.json into a {slug: bundle_dict} mapping.
|
||||
|
||||
bundles.json is the product-specific catalog written by the Phase 1
|
||||
scraper. See PLAN.md Phase 1 for the schema.
|
||||
"""
|
||||
if not BUNDLES_JSON.exists():
|
||||
return {}
|
||||
cat = json.loads(BUNDLES_JSON.read_text())
|
||||
return {b["slug"]: b for b in cat}
|
||||
|
||||
|
||||
def _build_where(version: str | None, platform: str | None, bundle_id: str | None) -> dict | None:
|
||||
def _build_where(
|
||||
crop: str | None,
|
||||
brand: str | None,
|
||||
vendor: str | None,
|
||||
source: str | None,
|
||||
source_key: str | None,
|
||||
) -> dict | None:
|
||||
"""Translate filter args into a Chroma `where` clause."""
|
||||
conds: list[dict] = []
|
||||
if version:
|
||||
conds.append({"version": version})
|
||||
if platform:
|
||||
conds.append({"platform": platform})
|
||||
if bundle_id:
|
||||
conds.append({"bundle_id": bundle_id})
|
||||
if crop:
|
||||
conds.append({"crop": crop.lower()})
|
||||
if brand:
|
||||
conds.append({"brand": brand.upper()})
|
||||
if vendor:
|
||||
conds.append({"vendor": vendor})
|
||||
if source:
|
||||
conds.append({"source": source})
|
||||
if source_key:
|
||||
conds.append({"source_key": source_key})
|
||||
if not conds:
|
||||
return None
|
||||
if len(conds) == 1:
|
||||
@@ -104,13 +221,152 @@ def _build_where(version: str | None, platform: str | None, bundle_id: str | Non
|
||||
return {"$and": conds}
|
||||
|
||||
|
||||
def _read_page(bundle_id: str, page_id: str) -> tuple[str, dict] | None:
|
||||
"""Read a corpus page off disk. Returns (markdown_body, metadata_dict)."""
|
||||
md_path = CORPUS / bundle_id / (page_id + ".md")
|
||||
json_path = CORPUS / bundle_id / (page_id + ".json")
|
||||
if not md_path.exists() or not json_path.exists():
|
||||
def _read_sidecar(source: str, source_key: str) -> dict | None:
|
||||
"""Read a variety's sidecar JSON off disk."""
|
||||
path = CORPUS / source / f"{source_key}.json"
|
||||
if not path.exists():
|
||||
return None
|
||||
return md_path.read_text(), json.loads(json_path.read_text())
|
||||
try:
|
||||
return json.loads(path.read_text(encoding="utf-8"))
|
||||
except (OSError, json.JSONDecodeError) as exc:
|
||||
log.warning("sidecar read failed for %s/%s: %s", source, source_key, exc)
|
||||
return None
|
||||
|
||||
|
||||
def _read_markdown(source: str, source_key: str) -> str | None:
|
||||
path = CORPUS / source / f"{source_key}.md"
|
||||
if not path.exists():
|
||||
return None
|
||||
try:
|
||||
return path.read_text(encoding="utf-8")
|
||||
except OSError:
|
||||
return None
|
||||
|
||||
|
||||
def _format_hit(doc: str, meta: dict, distance: float | None = None) -> str:
|
||||
"""Render one retrieval hit as a fenced markdown block with full
|
||||
provenance attached. ``doc`` is the chunk text; ``meta`` is the
|
||||
chunk's metadata dict."""
|
||||
src_url = meta.get("source_url") or ""
|
||||
src_key = meta.get("source_key") or ""
|
||||
src = meta.get("source") or ""
|
||||
vendor = meta.get("vendor") or ""
|
||||
brand = meta.get("brand") or ""
|
||||
crop = meta.get("crop") or ""
|
||||
name = meta.get("product_name") or src_key
|
||||
|
||||
header = (
|
||||
f"### {name} \n"
|
||||
f"`{src}::{src_key}` — {vendor} / {brand} / {crop} \n"
|
||||
f"<{src_url}>"
|
||||
)
|
||||
if distance is not None:
|
||||
header += f" \n_(distance={distance:.4f})_"
|
||||
return f"{header}\n\n{doc.strip()}\n"
|
||||
|
||||
|
||||
def _rrf_fuse(rankings: list[list[str]], k: int = RRF_K) -> list[str]:
|
||||
"""Reciprocal Rank Fusion — merge multiple ranked id lists into one.
|
||||
|
||||
Score(id) = sum over each ranking R of 1 / (k + rank_R(id)).
|
||||
Robust to score-scale differences between dense (cosine) and lexical
|
||||
(BM25). k=60 is the literature default; not particularly sensitive.
|
||||
"""
|
||||
scores: dict[str, float] = {}
|
||||
for ranking in rankings:
|
||||
for rank, doc_id in enumerate(ranking):
|
||||
scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank + 1)
|
||||
return sorted(scores, key=lambda d: scores[d], reverse=True)
|
||||
|
||||
|
||||
def _structured_ratings_block(sidecar: dict) -> str:
|
||||
"""Render the sidecar's grouped characteristics + identity as a
|
||||
fact-checkable block, with the source URL pinned at top.
|
||||
|
||||
This is the response body for ``get_page`` and the embedded
|
||||
"canonical data" block agents should trust over any free-text
|
||||
paraphrase."""
|
||||
lines: list[str] = []
|
||||
name = sidecar.get("product_name") or sidecar.get("source_key", "")
|
||||
lines.append(f"# {name}")
|
||||
lines.append("")
|
||||
|
||||
facets: list[str] = []
|
||||
if sidecar.get("vendor"):
|
||||
facets.append(f"**Vendor:** {sidecar['vendor']}")
|
||||
if sidecar.get("brand"):
|
||||
facets.append(f"**Brand:** {str(sidecar['brand']).title()}")
|
||||
if sidecar.get("crop"):
|
||||
facets.append(f"**Crop:** {str(sidecar['crop']).capitalize()}")
|
||||
if sidecar.get("relative_maturity") and sidecar.get("crop") == "corn":
|
||||
facets.append(f"**Relative maturity:** {sidecar['relative_maturity']}")
|
||||
if sidecar.get("maturity_group") and sidecar.get("crop") == "soybeans":
|
||||
facets.append(f"**Maturity group:** {sidecar['maturity_group']}")
|
||||
if sidecar.get("relative_maturity") and sidecar.get("crop") == "wheat":
|
||||
facets.append(f"**Maturity:** {sidecar['relative_maturity']}")
|
||||
if sidecar.get("wheat_class"):
|
||||
facets.append(f"**Wheat class:** {sidecar['wheat_class']}")
|
||||
if sidecar.get("release_year"):
|
||||
facets.append(f"**Released:** {sidecar['release_year']}")
|
||||
if sidecar.get("trait_stack"):
|
||||
facets.append(f"**Trait stack:** {', '.join(sidecar['trait_stack'])}")
|
||||
if facets:
|
||||
lines.append(" · ".join(facets))
|
||||
lines.append("")
|
||||
|
||||
urls = sidecar.get("source_urls") or []
|
||||
if urls:
|
||||
lines.append(f"**Source:** {urls[0]}")
|
||||
lines.append("")
|
||||
|
||||
scale = sidecar.get("_scale_direction") or "scale direction not declared by source"
|
||||
lines.append(f"**Rating scale (as published):** {scale}")
|
||||
lines.append("")
|
||||
|
||||
# Canonical ratings — one table per group, values verbatim.
|
||||
for g in sidecar.get("characteristics_groups") or []:
|
||||
label = g.get("label") or "Characteristics"
|
||||
items = g.get("items") or []
|
||||
if not items:
|
||||
continue
|
||||
lines.append(f"## {label.title()}")
|
||||
lines.append("")
|
||||
lines.append("| Characteristic | Value |")
|
||||
lines.append("|---|---|")
|
||||
for it in items:
|
||||
ch = (it.get("characteristic") or "").strip()
|
||||
v = (it.get("value") or "").strip()
|
||||
lines.append(f"| {ch} | {v} |")
|
||||
lines.append("")
|
||||
|
||||
if sidecar.get("positioning_statement"):
|
||||
lines.append("## Vendor positioning")
|
||||
lines.append("")
|
||||
lines.append(sidecar["positioning_statement"].strip())
|
||||
lines.append("")
|
||||
|
||||
if sidecar.get("strengths"):
|
||||
lines.append("## Strengths / management notes (vendor copy)")
|
||||
lines.append("")
|
||||
for s in sidecar["strengths"]:
|
||||
lines.append(f"- {s}")
|
||||
lines.append("")
|
||||
|
||||
rec = sidecar.get("regional_recommendations") or []
|
||||
if rec:
|
||||
names = sorted({
|
||||
(r.get("product_list_name") or "").strip()
|
||||
for r in rec
|
||||
if (r.get("product_list_name") or "").strip()
|
||||
})
|
||||
if names:
|
||||
lines.append("## Listed in vendor regional seed guides")
|
||||
lines.append("")
|
||||
for n in names:
|
||||
lines.append(f"- {n}")
|
||||
lines.append("")
|
||||
|
||||
return "\n".join(lines).rstrip() + "\n"
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
@@ -119,119 +375,340 @@ def _read_page(bundle_id: str, page_id: str) -> tuple[str, dict] | None:
|
||||
|
||||
@mcp.tool()
|
||||
def search_docs(
|
||||
query: Annotated[str, Field(description=f"Natural-language query about {PRODUCT_NAME}.")],
|
||||
version: Annotated[
|
||||
query: Annotated[str, Field(description=(
|
||||
"Natural-language query about row-crop seed varieties. "
|
||||
"Mention crop (corn/soybeans/wheat), maturity (RM days / "
|
||||
"MG number / wheat class), traits, disease tolerances, "
|
||||
"or soil/region constraints — they all carry retrieval signal."
|
||||
))],
|
||||
crop: Annotated[
|
||||
str | None,
|
||||
Field(description="OPTIONAL version filter — restrict to one product version."),
|
||||
Field(description="OPTIONAL filter: corn, soybeans, or wheat."),
|
||||
] = None,
|
||||
platform: Annotated[
|
||||
brand: Annotated[
|
||||
str | None,
|
||||
Field(description="OPTIONAL platform filter. Set to one of the platforms listed by list_versions(); omit for all platforms."),
|
||||
Field(description=(
|
||||
"OPTIONAL brand filter — DEKALB, ASGROW, WESTBRED, "
|
||||
"GoldenHarvest, NK, AgriPro, Becks. Case-insensitive."
|
||||
)),
|
||||
] = None,
|
||||
bundle_id: Annotated[
|
||||
vendor: Annotated[
|
||||
str | None,
|
||||
Field(description="OPTIONAL bundle filter — pin to a specific doc bundle slug."),
|
||||
Field(description="OPTIONAL vendor filter — Bayer, Syngenta, Beck's."),
|
||||
] = None,
|
||||
source: Annotated[
|
||||
str | None,
|
||||
Field(description=(
|
||||
"OPTIONAL source filter — e.g. 'bayer_seeds'. Use "
|
||||
"list_versions() to see what's indexed."
|
||||
)),
|
||||
] = None,
|
||||
k: Annotated[int, Field(description="Number of results to return.", ge=1, le=50)] = 10,
|
||||
) -> str:
|
||||
"""Search the {product} docs corpus.
|
||||
"""Search the seed-variety corpus for hybrids/varieties matching a query.
|
||||
|
||||
Returns the top-k most relevant chunks (with full source page URLs)
|
||||
given a natural-language query. Optional filters narrow the search
|
||||
to one version, one platform, or one bundle. Use list_versions()
|
||||
first if you need to discover the available facet values.
|
||||
Returns the top-k variety chunks with their full source URLs, ratings,
|
||||
maturity, traits, and regional listings. Optional filters narrow to one
|
||||
crop, brand, vendor, or scraper source. Use **list_versions()** first
|
||||
to discover valid facet values. Use **lookup_variety()** on any
|
||||
candidate the user is serious about — that returns the canonical
|
||||
sidecar so you can verify exact rating values without trusting the
|
||||
free-text chunk.
|
||||
|
||||
Call this tool whenever the user asks anything that should be
|
||||
answerable from the official product documentation.
|
||||
Call this tool whenever an ag professional or farmer asks anything
|
||||
about choosing a seed variety, comparing hybrids, or finding seed
|
||||
matched to a maturity / soil / region / disease constraint.
|
||||
|
||||
NEVER answer seed-selection questions from prior knowledge alone —
|
||||
seed catalogs change yearly and brand-specific. Always search first.
|
||||
"""
|
||||
with TimedCall("search_docs", {
|
||||
"query": query, "version": version, "platform": platform,
|
||||
"bundle_id": bundle_id, "k": k,
|
||||
"query": query, "crop": crop, "brand": brand,
|
||||
"vendor": vendor, "source": source, "k": k,
|
||||
}) as _call:
|
||||
# TODO Phase 2-3: query Chroma collection (see rag/index.py for
|
||||
# how it was built). Render the top-k chunks as markdown with
|
||||
# source URLs.
|
||||
# TODO Phase 6: optional reranker via _rerank() if RERANK_URL set.
|
||||
# TODO Phase 8: hybrid retrieval if HYBRID_SEARCH=true — run
|
||||
# dense + BM25 in parallel, RRF-fuse, hand merged pool to rerank.
|
||||
_call.set(hits_returned=0)
|
||||
raise NotImplementedError("Phase 2/3: implement Chroma query + rendering")
|
||||
where = _build_where(crop, brand, vendor, source, None)
|
||||
pool_size = max(k * 3, RERANK_POOL)
|
||||
|
||||
# Exact-code pre-filter. Variety codes ("DKC62-08RIB", "AG29XF4")
|
||||
# have NO semantic neighbors — dense retrieval misses them and
|
||||
# RRF can let off-topic noise float to top-1. If the query
|
||||
# contains a token that exactly matches a variety's identifier,
|
||||
# pin those varieties to the top of results.
|
||||
pinned_ids: list[str] = []
|
||||
for src, sk in _exact_code_matches(query):
|
||||
pinned_ids.append(f"{src}::{sk}::0")
|
||||
|
||||
# Dense retrieval via Chroma.
|
||||
try:
|
||||
col = _collection()
|
||||
except Exception as exc: # noqa: BLE001
|
||||
_call.set(error_dense=str(exc), hits_returned=0)
|
||||
return (
|
||||
"_(retrieval unavailable — Chroma collection not found. "
|
||||
"Has the indexer run? `python -m rag.index --rebuild`.)_"
|
||||
)
|
||||
|
||||
try:
|
||||
dense = col.query(
|
||||
query_texts=[query],
|
||||
n_results=pool_size,
|
||||
where=where,
|
||||
)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
_call.set(error_dense=str(exc), hits_returned=0)
|
||||
return f"_(retrieval failed: {exc})_"
|
||||
|
||||
dense_ids: list[str] = (dense.get("ids") or [[]])[0]
|
||||
dense_docs: list[str] = (dense.get("documents") or [[]])[0]
|
||||
dense_metas: list[dict] = (dense.get("metadatas") or [[]])[0]
|
||||
dense_dists: list[float] = (dense.get("distances") or [[]])[0]
|
||||
|
||||
id_to_doc = dict(zip(dense_ids, dense_docs))
|
||||
id_to_meta = dict(zip(dense_ids, dense_metas))
|
||||
id_to_dist = dict(zip(dense_ids, dense_dists))
|
||||
|
||||
# Hybrid: optionally fuse with BM25. Both are pulled at pool_size,
|
||||
# fused via RRF, top-k returned.
|
||||
used_hybrid = False
|
||||
if HYBRID_SEARCH:
|
||||
bm25 = _bm25_index()
|
||||
if bm25 is not None:
|
||||
bm25_hits = bm25.query(query, n=pool_size, where=where)
|
||||
bm25_ids = [h[0] for h in bm25_hits]
|
||||
if bm25_ids:
|
||||
fused = _rrf_fuse([dense_ids, bm25_ids])
|
||||
fuzzy_ids = fused
|
||||
used_hybrid = True
|
||||
else:
|
||||
fuzzy_ids = dense_ids
|
||||
else:
|
||||
fuzzy_ids = dense_ids
|
||||
else:
|
||||
fuzzy_ids = dense_ids
|
||||
|
||||
# Pin exact-code matches at top, then fill remainder from fuzzy
|
||||
# retrieval (deduped). Pinned matches are deterministic and
|
||||
# high-confidence; they should never lose to a fuzzy match.
|
||||
final_ids: list[str] = []
|
||||
seen: set[str] = set()
|
||||
for cid in pinned_ids + fuzzy_ids:
|
||||
if cid in seen:
|
||||
continue
|
||||
seen.add(cid)
|
||||
final_ids.append(cid)
|
||||
if len(final_ids) >= k:
|
||||
break
|
||||
|
||||
# For ids returned by BM25 but not by Chroma, we need their docs/
|
||||
# metadata too. Re-fetch by id.
|
||||
missing = [i for i in final_ids if i not in id_to_doc]
|
||||
if missing:
|
||||
try:
|
||||
extra = col.get(ids=missing, include=["documents", "metadatas"])
|
||||
for cid, doc, meta in zip(
|
||||
extra.get("ids") or [],
|
||||
extra.get("documents") or [],
|
||||
extra.get("metadatas") or [],
|
||||
):
|
||||
id_to_doc[cid] = doc
|
||||
id_to_meta[cid] = meta
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.warning("get-by-id for BM25-only hits failed: %s", exc)
|
||||
|
||||
_call.set(
|
||||
hits_returned=len(final_ids),
|
||||
hybrid=used_hybrid,
|
||||
pool_size=pool_size,
|
||||
)
|
||||
|
||||
if not final_ids:
|
||||
return (
|
||||
"_(no varieties matched. Try broadening the query or "
|
||||
"removing filters — call list_versions() to see what's "
|
||||
"indexed.)_"
|
||||
)
|
||||
|
||||
blocks: list[str] = []
|
||||
for cid in final_ids:
|
||||
doc = id_to_doc.get(cid, "")
|
||||
meta = id_to_meta.get(cid, {})
|
||||
dist = id_to_dist.get(cid) if not used_hybrid else None
|
||||
blocks.append(_format_hit(doc, meta, dist))
|
||||
|
||||
header = (
|
||||
f"# Search results — {len(final_ids)} variety chunk"
|
||||
f"{'s' if len(final_ids) != 1 else ''}"
|
||||
f"{' (dense + BM25 hybrid)' if used_hybrid else ' (dense only)'}\n"
|
||||
f"_Use `lookup_variety(source_key=...)` on any candidate "
|
||||
f"to fact-check ratings from the canonical sidecar._\n\n---\n\n"
|
||||
)
|
||||
return header + "\n---\n\n".join(blocks)
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def get_page(
|
||||
bundle_id: Annotated[str, Field(description="Bundle slug.")],
|
||||
page_id: Annotated[str, Field(description="Page filename within the bundle.")],
|
||||
source: Annotated[str, Field(description=(
|
||||
"Scraper source id — e.g. 'bayer_seeds'. Same as the `source` "
|
||||
"field in search_docs results."
|
||||
))],
|
||||
source_key: Annotated[str, Field(description=(
|
||||
"Per-variety stable key — e.g. 'dekalb-dkc62-08rib'. Same as "
|
||||
"the `source_key` field in search_docs results."
|
||||
))],
|
||||
) -> str:
|
||||
"""Return the full markdown for one page, plus a metadata header.
|
||||
"""Return the full canonical record for one variety.
|
||||
|
||||
Use after search_docs surfaces a relevant page and the user (or you)
|
||||
want the complete text — not just the matched chunks.
|
||||
Emits a structured ratings header (identity, all characteristic
|
||||
groups, vendor positioning, regional listings) sourced from the
|
||||
variety's sidecar JSON, followed by the raw markdown body the
|
||||
chunker indexed.
|
||||
|
||||
Use this when the user is comparing varieties closely or wants to
|
||||
see every published rating value, not just the ones that matched a
|
||||
query. The structured header is the canonical fact-check surface;
|
||||
use it to validate any specific rating value before quoting it.
|
||||
"""
|
||||
with TimedCall("get_page", {"bundle_id": bundle_id, "page_id": page_id}) as _call:
|
||||
data = _read_page(bundle_id, page_id)
|
||||
if data is None:
|
||||
with TimedCall("get_page", {"source": source, "source_key": source_key}) as _call:
|
||||
sidecar = _read_sidecar(source, source_key)
|
||||
if sidecar is None:
|
||||
_call.set(found=False)
|
||||
return f"Page not found: {bundle_id}/{page_id}"
|
||||
md, meta = data
|
||||
_call.set(found=True, page_chars=len(md))
|
||||
# TODO: add a metadata header (title, version, source URL) above
|
||||
# the body. Product-specific shape.
|
||||
return md
|
||||
return (
|
||||
f"_(variety not found: source='{source}' "
|
||||
f"source_key='{source_key}'. Use list_versions() to see "
|
||||
f"available sources or search_docs() to find candidates.)_"
|
||||
)
|
||||
body = _read_markdown(source, source_key) or ""
|
||||
structured = _structured_ratings_block(sidecar)
|
||||
_call.set(found=True, body_chars=len(body), groups=len(sidecar.get("characteristics_groups") or []))
|
||||
sep = "\n\n---\n\n## Indexed body (chunk source text)\n\n"
|
||||
return structured + sep + body
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def list_versions() -> str:
|
||||
"""List the available version/platform facets across all bundles.
|
||||
"""List the available scraper sources and the crop/brand/vendor
|
||||
facets across all indexed varieties.
|
||||
|
||||
Use this to discover valid filter values for search_docs.
|
||||
Use this BEFORE search_docs() the first time you query, to discover
|
||||
which scraper sources, brands, vendors, and crops are actually in
|
||||
the corpus right now. The agent should not assume — the corpus
|
||||
grows as more vendors are scraped.
|
||||
"""
|
||||
with TimedCall("list_versions", {}) as _call:
|
||||
cat = _bundles()
|
||||
if not cat:
|
||||
return "_(no bundles indexed yet — run the scraper + indexer)_"
|
||||
versions = sorted({b.get("version") for b in cat.values() if b.get("version")})
|
||||
platforms = sorted({b.get("platform") for b in cat.values() if b.get("platform")})
|
||||
_call.set(versions=len(versions), platforms=len(platforms))
|
||||
lines = [f"# Facets across {len(cat)} bundle(s)", ""]
|
||||
if versions:
|
||||
lines.append("## Versions"); lines.append("")
|
||||
for v in versions: lines.append(f"- `{v}`")
|
||||
facets: dict[str, dict[str, int]] = {
|
||||
"source": {}, "vendor": {}, "brand": {}, "crop": {},
|
||||
}
|
||||
n_varieties = 0
|
||||
if CORPUS.exists():
|
||||
for source_dir in sorted(CORPUS.iterdir()):
|
||||
if not source_dir.is_dir() or source_dir.name.startswith("."):
|
||||
continue
|
||||
for sc in source_dir.glob("*.json"):
|
||||
try:
|
||||
d = json.loads(sc.read_text(encoding="utf-8"))
|
||||
except (OSError, json.JSONDecodeError):
|
||||
continue
|
||||
n_varieties += 1
|
||||
for k in facets:
|
||||
v = d.get(k)
|
||||
if v:
|
||||
facets[k][v] = facets[k].get(v, 0) + 1
|
||||
|
||||
if n_varieties == 0:
|
||||
return (
|
||||
"_(no varieties indexed yet — run scrapers + indexer "
|
||||
"before calling search_docs)_"
|
||||
)
|
||||
|
||||
_call.set(
|
||||
varieties=n_varieties,
|
||||
sources=len(facets["source"]),
|
||||
vendors=len(facets["vendor"]),
|
||||
brands=len(facets["brand"]),
|
||||
crops=len(facets["crop"]),
|
||||
)
|
||||
|
||||
lines = [f"# Corpus facets ({n_varieties} varieties indexed)", ""]
|
||||
for label, counts in facets.items():
|
||||
if not counts:
|
||||
continue
|
||||
lines.append(f"## {label}")
|
||||
lines.append("")
|
||||
if platforms:
|
||||
lines.append("## Platforms"); lines.append("")
|
||||
for p in platforms: lines.append(f"- `{p}`")
|
||||
return "\n".join(lines)
|
||||
for k, n in sorted(counts.items(), key=lambda kv: -kv[1]):
|
||||
lines.append(f"- `{k}` — {n} varieties")
|
||||
lines.append("")
|
||||
return "\n".join(lines).rstrip()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stubs for later phases — keep the signatures in this file so refactors
|
||||
# don't lose the contracts. Implementations come per phase.
|
||||
# ---------------------------------------------------------------------------
|
||||
@mcp.tool()
|
||||
def lookup_variety(
|
||||
source_key: Annotated[str, Field(description=(
|
||||
"Per-variety stable key — e.g. 'dekalb-dkc62-08rib'."
|
||||
))],
|
||||
source: Annotated[
|
||||
str | None,
|
||||
Field(description=(
|
||||
"OPTIONAL scraper source ('bayer_seeds' etc). If omitted, "
|
||||
"scans all sources for the source_key."
|
||||
)),
|
||||
] = None,
|
||||
) -> str:
|
||||
"""Return the canonical sidecar JSON for one variety, verbatim.
|
||||
|
||||
# @mcp.tool() # Phase 9
|
||||
# def list_cluster(bundle_id: str, page_id: str) -> str: ...
|
||||
USE THIS to fact-check any rating value before quoting it to the
|
||||
farmer. The output is the exact data the scraper captured from the
|
||||
vendor's published catalog — no paraphrasing, no inference.
|
||||
|
||||
# @mcp.tool() # Phase 9
|
||||
# def diff_versions(bundle_id: str, page_id: str, against_bundle_id: str, context: int = 3) -> str: ...
|
||||
Call this whenever the user (or you) needs an exact value for a
|
||||
specific rating, trait, or maturity — not just a semantic match
|
||||
from search_docs.
|
||||
"""
|
||||
with TimedCall("lookup_variety", {"source_key": source_key, "source": source}) as _call:
|
||||
if source:
|
||||
sidecar = _read_sidecar(source, source_key)
|
||||
if sidecar is not None:
|
||||
_call.set(found=True, source=source)
|
||||
return "```json\n" + json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n```"
|
||||
_call.set(found=False, source=source)
|
||||
return f"_(variety '{source_key}' not found under source '{source}')_"
|
||||
|
||||
# @mcp.tool() # Phase 9
|
||||
# def bundle_changelog(bundle_id_new: str, bundle_id_old: str, min_churn: int = 5, max_changed: int = 50) -> str: ...
|
||||
# No source given — scan all source dirs for a match.
|
||||
if not CORPUS.exists():
|
||||
_call.set(found=False)
|
||||
return "_(corpus is empty — no scrapers have run yet)_"
|
||||
matches: list[tuple[str, dict]] = []
|
||||
for source_dir in sorted(CORPUS.iterdir()):
|
||||
if not source_dir.is_dir() or source_dir.name.startswith("."):
|
||||
continue
|
||||
sidecar = _read_sidecar(source_dir.name, source_key)
|
||||
if sidecar is not None:
|
||||
matches.append((source_dir.name, sidecar))
|
||||
|
||||
# @mcp.tool() # Phase 13
|
||||
# def weekly_digest(days: int = 7, version: str | None = None, platform: str | None = None, ...) -> str: ...
|
||||
if not matches:
|
||||
_call.set(found=False)
|
||||
return (
|
||||
f"_(no variety with source_key '{source_key}' found in any "
|
||||
f"source. Use search_docs() to find the right key.)_"
|
||||
)
|
||||
|
||||
# @mcp.tool() # Phase 9 (or 3 — useful early)
|
||||
# def corpus_status() -> str: ...
|
||||
_call.set(found=True, matches=len(matches))
|
||||
if len(matches) == 1:
|
||||
src_name, sidecar = matches[0]
|
||||
return (
|
||||
f"_(matched in source `{src_name}`)_\n\n"
|
||||
"```json\n"
|
||||
+ json.dumps(sidecar, indent=2, ensure_ascii=False)
|
||||
+ "\n```"
|
||||
)
|
||||
|
||||
# @mcp.tool() # Phase 11
|
||||
# def myproduct_api_lessons(topic: str | None = None) -> str: ...
|
||||
|
||||
# @mcp.tool() # Phase 12
|
||||
# def find_doc_inconsistencies(scope_query: str, ...) -> str: ...
|
||||
|
||||
# @mcp.tool() # Phase 12
|
||||
# def submit_doc_bug(page_url: str, content: str, email: str | None = None, ...) -> str: ...
|
||||
# Multi-match: serialize each labeled.
|
||||
out: list[str] = [f"_(matched {len(matches)} sources)_\n"]
|
||||
for src_name, sidecar in matches:
|
||||
out.append(f"### source: `{src_name}`")
|
||||
out.append("```json")
|
||||
out.append(json.dumps(sidecar, indent=2, ensure_ascii=False))
|
||||
out.append("```")
|
||||
return "\n".join(out)
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
@@ -252,8 +729,6 @@ def main() -> None:
|
||||
else:
|
||||
mcp.settings.host = args.host
|
||||
mcp.settings.port = args.port
|
||||
# DNS-rebinding protection defaults to localhost-only — disable for
|
||||
# container-network DNS hostnames. See PLAN.md "Hosting" notes.
|
||||
if os.environ.get("MCP_DISABLE_DNS_REBINDING_PROTECTION") in {"1", "true", "yes"}:
|
||||
mcp.settings.transport_security.enable_dns_rebinding_protection = False
|
||||
mcp.run(transport=args.transport)
|
||||
|
||||
Reference in New Issue
Block a user