Merge pull request 'Phase 11: crop_seed_api_lessons tool + Pioneer fallback' (#3) from api-lessons-pioneer-fallback into main

2026-05-25 13:19:17 -04:00
parent 3cab941c08 4009dc0b78
commit 28d8cb83b3
2 changed files with 365 additions and 0 deletions
@@ -0,0 +1,259 @@
 # crop_seed API lessons
 Curated knowledge that **does not live in the scraped corpus** but
 that an agent needs to interpret search_docs / get_page results
 correctly. This file is the source for `crop_seed_api_lessons(topic)`.
 Each section starts with a `## <slug>` heading; the tool returns
 sections whose slug matches the topic (substring match) or all
 sections when `topic` is None.
 This file is the *only* place the MCP injects opinionated content.
 Everything else returned by the server comes verbatim from a vendor
 catalog. Lessons here should be: factual, well-cited, and conservative
 about what the corpus actually contains.
 ---
 ## pioneer
 **Pioneer (Corteva) is NOT in the seed-mcp corpus.** The MCP intentionally
 does not scrape pioneer.com. Per Pioneer's published Terms of Service:
 > "you shall not use any manual or automated software, devices or
 > other processes (including but not limited to spiders, robots,
 > scrapers, crawlers, avatars, data mining tools or the like) to
 > 'scrape' or download data from the Services"
 Pioneer additionally runs Imperva WAF on their catalog API and
 login-walls dealer lookup, so a polite scrape isn't technically
 feasible either.
 **Agent behavior when a farmer asks about Pioneer / P-series hybrids:**
 1. Tell the farmer plainly: "I don't have Pioneer's variety data
   indexed — Pioneer doesn't allow automated indexing of their
   catalog."
 2. Recommend pioneer.com → *Find a Sales Rep* (or call 1-800-247-6803)
   for hybrid specs and local recommendations.
 3. If the farmer wants an independent third-party rating for a
   specific Pioneer hybrid, mention that **university extension
   performance trials** (Iowa State, Illinois Crop Sciences, Purdue,
   Nebraska, Ohio State) publish multi-location yield data on
   Pioneer hybrids alongside competitors — useful if they want
   apples-to-apples data without going through Pioneer's own
   marketing.
 4. **Do NOT invent Pioneer hybrid ratings.** If asked "what's the
   disease tolerance of P1142AM?", the only correct answer is
   "I don't have that data — please consult Pioneer or an
   extension service."
 This is the canonical anti-hallucination policy for the seed-mcp.
 There is no Pioneer data; there is no inference. Direct the farmer
 to a primary source.
 ---
 ## rating-scales
 Different vendors publish ratings on different conventions. The
 chunker normalizes the *labels* in the chunk preamble but always
 preserves the source's `_scale_direction` field in the sidecar.
 **Bayer (DEKALB / Asgrow / WestBred)**: `1-9 (9 = best)`. A
 GRAY LEAF SPOT rating of 8 means EXCELLENT tolerance. A rating of 2
 means SUSCEPTIBLE.
 **Syngenta Golden Harvest**: `9-to-1 (9 = best, 1 = worst)` —
 this is the *direction* Golden Harvest publishes, but the *meaning*
 of high numbers is the same: high = best. Where the chunker says
 "normalize" for Golden Harvest, that just means we've already
 re-stated it as `1-9 (9 = best)` in the chunk preamble; the source's
 `_scale_direction` field still says `9-to-1` so you can detect the
 provenance.
 **Syngenta NK / AgriPro**: `1-9 (9 = best)`. Same as Bayer.
 **Beck's**: ratings live behind SeedIQ login; only identity-level
 data is publicly available, so most disease/agronomic ratings are
 absent from Beck's records in this corpus.
 **Always check the chunk's "Rating scale" line or call
 `lookup_variety(source_key)` and look at `_scale_direction` if you
 are unsure.** Cross-vendor comparisons are valid AFTER you've
 confirmed each side uses the same direction.
 **Non-numeric values** appear for some characteristics and should be
 read literally:
 - `R`, `MR`, `S` for soybean disease resistance = Resistant / Moderately
  Resistant / Susceptible (not 1-9).
 - `Rps1c`, `Rps3a`, `Rps1k`, etc. = specific Phytophthora resistance
  gene present.
 - `R1`, `R3` (under SOYBEAN CYST NEMATODE) = effective against
  SCN race 1 / race 3.
 - `A`, `B`, `C` under HERBICIDE sensitivity = grade letters where A
  is most tolerant.
 ---
 ## maturity-semantics
 Maturity is encoded differently per crop. Don't conflate the units.
 **Corn — Relative maturity (RM days)**: integer roughly 75-120.
 Lower = shorter season, suitable for higher latitudes / shorter
 growing windows. 110 RM is a Central Iowa default; 85 RM suits
 northern Minnesota or short-season silage; 115+ RM fits southern
 Indiana / southern Illinois / Missouri Delta. The number is
 **Pioneer-style RM days**, normalized across the industry.
 **Soybeans — Maturity group (MG)**: float 00 (zero-zero) to 9.0
 expressed with one decimal. A "3.5 MG" soybean is for central
 Iowa. Northern North Dakota / Minnesota plant 0.0–1.5 MG. Mid-South
 plants 5.0+. Each tenth of an MG ≈ 7-10 days of additional season.
 Sidecar field: `maturity_group` (e.g. "3.5", "0.7").
 **Wheat — Class + heading**: Winter / spring decision is separate
 from "class" (HRW / HRS / SRW / SWW / SWS / durum):
 - HRW = Hard Red Winter — Plains states bread wheat
 - HRS = Hard Red Spring — Northern Plains, North Dakota, Montana
 - SRW = Soft Red Winter — Eastern Corn Belt, Ohio Valley
 - SWW = Soft White Winter — Pacific Northwest
 - SWS = Soft White Spring — Pacific Northwest
 - Durum — North Dakota / Montana, pasta wheat
 Maturity is qualitative: Early / Medium-Early / Medium / Medium-Late / Late.
 **WestBred's product page JSON does not always expose the wheat class
 as a structured field** — sometimes it's only in the marketing
 narrative (e.g. "WB1376CLP is a Soft White Winter Clearfield® Plus
 Wheat variety"). Read `positioning_statement` carefully when the
 sidecar's `wheat_class` is null.
 ---
 ## trait-glossary
 Common trait codes that appear in `trait_stack`:
 **Corn:**
 - `SSRIB` — SmartStax® RIB Complete® corn blend (above + below-ground
  insect protection + Roundup Ready + LibertyLink, with refuge-in-bag)
 - `VT2PRIB` — VT Double PRO® RIB Complete® (above-ground insect
  protection + Roundup Ready, refuge-in-bag)
 - `VT4PRIB` — VT4 PRO® RIB Complete® (newer above-ground protection)
 - `Trecepta` — Trecepta® (Trecepta + Roundup Ready + LibertyLink, for
  earworm + western bean cutworm pressure)
 - `SmartStax PRO` — SmartStax® PRO® (RNAi corn rootworm)
 - `PowerCore` — PowerCore® Refuge Advanced (older above-ground stack)
 - `Conventional` — no biotech traits (organic / specialty channels)
 **Soybeans:**
 - `XF` — XtendFlex® (Roundup Ready 2 Xtend + dicamba + glufosinate)
 - `Xtend` — Roundup Ready 2 Xtend® (dicamba + glyphosate)
 - `RR2Y` — Roundup Ready 2 Yield® (glyphosate only)
 - `E3` — Enlist E3® (2,4-D + glyphosate + glufosinate)
 - `LL/LL+GT27` — LibertyLink® / LibertyLink + GT27 (glufosinate +
  glyphosate + isoxaflutole)
 - `Conkesta E3` — Bt-stack for caterpillar pressure (BR/AR markets)
 - `SR` — SR® (sulfonylurea-tolerant, Asgrow-specific)
 **Wheat:**
 - `Clearfield` / `CLP` — Clearfield® / Clearfield® Plus (imazamox
  tolerance)
 - `CoAXium` — CoAXium® (quizalofop tolerance) — note: AgriPro's
  catalog flag, NOT in the WestBred corpus.
 Always render the full trait name (`trait_descriptions`) when telling
 the farmer "this variety has X trait" — bare trait codes are
 ambiguous in print.
 ---
 ## scn-resistance
 Soybean Cyst Nematode resistance ratings are critical for fields
 with SCN pressure (most of the Corn Belt). Read carefully:
 - `R3` under SOYBEAN CYST NEMATODE = Resistant to race 3 (the most
  common race nationally). Most "SCN-resistant" soybeans on the
  market are R3.
 - `R1, R3` = Resistant to both race 1 AND race 3. Higher value;
  useful in long-rotation SCN fields where race shifts have occurred.
 - `MR3` = Moderately Resistant to race 3. Some yield loss expected
  under high SCN pressure.
 - `S` = Susceptible.
 - Some Bayer Asgrow XF lines (e.g. AG29XF4) use **Peking-type SCN
  resistance**, which is genetically distinct from the more common
  PI 88788 source. Peking is more durable when SCN populations
  have eroded PI 88788 effectiveness. Look for "Peking type" in the
  positioning statement.
 **Recommended workflow when a farmer asks about SCN:** call
 `search_docs` with the user's MG range + "SCN-resistant", then
 `lookup_variety` on the top 2-3 candidates to verify the exact race
 coverage and resistance source.
 ---
 ## regional-listings
 The `regional_recommendations` array in each sidecar is sourced from
 Bayer's "local profiles" — varieties get assigned to regional Seed
 Guide bundles (e.g. *"2026 Washington, Oregon, SEED GUIDE"*) with a
 named regional agronomist contact. This is the closest signal we have
 to *"is this variety recommended for the farmer's geography?"* but
 note:
 - A variety being absent from a regional listing **does not** mean
  it's unsuitable — Bayer's local agronomists curate these lists.
 - Listings are vendor-side recommendations, not third-party trial
  data.
 - When the farmer mentions a region, try filtering or scanning for
  varieties whose `regional_recommendations[].product_list_name`
  mentions that region.
 Other vendors handle regional placement differently. Golden Harvest
 publishes a separate "plot report" system per state/year/site;
 NK publishes ratings as PDF tech sheets without regional flags.
 ---
 ## sources-not-yet-indexed
 These vendors are planned but not yet in the corpus. Don't assume
 their data is present:
 - **Golden Harvest (Syngenta)** — ~175 varieties, sitemap-driven
  scrape pending.
 - **NK (Syngenta)** — 29 varieties.
 - **AgriPro (Syngenta wheat)** — 24 wheat varieties (HRW, HRS, HWS,
  SWW, SWS). The only wheat coverage we expect to have outside
  WestBred.
 - **Beck's PFR (research)** — 2,089 head-to-head trial documents.
  Different shape from variety records — these are studies, not
  hybrids.
 - **Beck's products** — 860 products. Identity-only (SeedIQ login
  gates the ratings).
 If `list_versions()` doesn't show a vendor in the `vendor` facet, the
 corpus does not have it yet. Direct the farmer to that vendor's
 public catalog or their seed dealer.
 ---
 ## checking-your-work
 Before quoting a specific number to a farmer, **always** call
 `lookup_variety(source_key=...)` to confirm. The chunk text inside a
 search_docs response is a faithful render of the sidecar, but the
 sidecar IS the source of truth. Quoting from the canonical sidecar
 makes you robust against:
 - Chunk-text formatting bugs (e.g. a rare unicode issue trimming a
  value).
 - Future chunker changes (a re-index might rewrite the body).
 - Cross-vendor scale-direction differences (the sidecar's
  `_scale_direction` lets you state the convention explicitly).
 If `lookup_variety` returns "not found" but `search_docs` surfaced the
 chunk, that's a bug — please report it. (In normal operation, every
 chunk's `source_key` round-trips to a valid sidecar.)
@@ -369,6 +369,40 @@ def _structured_ratings_block(sidecar: dict) -> str:
    return "\n".join(lines).rstrip() + "\n"
 # ---------------------------------------------------------------------------
 # Curated lessons — docs_mcp/lessons.md is the canonical source.
 # ---------------------------------------------------------------------------
 LESSONS_FILE = Path(__file__).resolve().parent / "lessons.md"
 _lessons_cache: list[tuple[str, str]] | None = None
 def _load_lessons() -> list[tuple[str, str]]:
    """Parse lessons.md into ``[(slug, body), ...]`` sections.
    Sections are delimited by ``## <slug>`` headings. The slug is the
    `<slug>` token (whitespace stripped); the body is everything between
    that heading and the next ``## `` heading (or EOF).
    """
    global _lessons_cache
    if _lessons_cache is not None:
        return _lessons_cache
    out: list[tuple[str, str]] = []
    if not LESSONS_FILE.exists():
        _lessons_cache = out
        return out
    text = LESSONS_FILE.read_text(encoding="utf-8")
    parts = re.split(r"(?m)^## (.+)$", text)
    # parts = [preamble, slug1, body1, slug2, body2, ...]
    for i in range(1, len(parts), 2):
        slug = parts[i].strip()
        body = parts[i + 1] if i + 1 < len(parts) else ""
        # Drop trailing horizontal rule that separates sections.
        body = re.sub(r"\n---\s*$", "", body).strip()
        out.append((slug, body))
    _lessons_cache = out
    return out
 # ===========================================================================
 # Tools
 # ===========================================================================
@@ -711,6 +745,78 @@ def lookup_variety(
        return "\n".join(out)
@mcp.tool()
 def crop_seed_api_lessons(
    topic: Annotated[
        str | None,
        Field(description=(
            "OPTIONAL topic — match against lesson section slugs or body "
            "(substring, case-insensitive). Known slugs: pioneer, "
            "rating-scales, maturity-semantics, trait-glossary, "
            "scn-resistance, regional-listings, sources-not-yet-indexed, "
            "checking-your-work. Omit for the full curated index."
        )),
    ] = None,
 ) -> str:
    """Curated knowledge that does NOT live in the scraped corpus —
    vendor scale-direction notes, trait glossary, maturity semantics,
    SCN resistance interpretation, the **Pioneer fallback policy**,
    and rules for fact-checking your work.
    Call this tool when:
    * The user asks about **Pioneer** or any P-series hybrid — Pioneer
      is intentionally NOT scraped (ToS bans it); the lesson tells you
      what to say instead.
    * You need to compare ratings across vendors — different vendors
      publish on different scale directions.
    * You're parsing a trait code or disease abbreviation you don't
      recognize.
    * Before quoting a specific rating value to a farmer — the
      ``checking-your-work`` lesson reminds you to call
      ``lookup_variety`` to confirm.
    This tool is **the only source of opinionated content** in the
    server. Everything else returned by search_docs / get_page /
    lookup_variety is verbatim from vendor catalogs.
    """
    with TimedCall("crop_seed_api_lessons", {"topic": topic}) as _call:
        sections = _load_lessons()
        if not sections:
            _call.set(sections_returned=0)
            return "_(no lessons file present — docs_mcp/lessons.md missing)_"
        if not topic:
            _call.set(sections_returned=len(sections))
            return "\n\n---\n\n".join(
                f"## {slug}\n\n{body}" for slug, body in sections
            )
        needle = topic.strip().lower()
        # Prefer slug matches (most specific). Fall back to body match
        # only when no slug matches — keeps a query like "rating" from
        # returning every section that happens to mention the word.
        slug_matches: list[tuple[str, str]] = []
        body_matches: list[tuple[str, str]] = []
        for slug, body in sections:
            if needle in slug.lower():
                slug_matches.append((slug, body))
            elif needle in body.lower():
                body_matches.append((slug, body))
        matched = slug_matches if slug_matches else body_matches
        _call.set(sections_returned=len(matched), topic=topic)
        if not matched:
            slugs = ", ".join(s for s, _ in sections)
            return (
                f"_(no lesson section matched topic '{topic}'. "
                f"Available slugs: {slugs}.)_"
            )
        return "\n\n---\n\n".join(
            f"## {slug}\n\n{body}" for slug, body in matched
        )
 # ===========================================================================
 # Entry point
 # ===========================================================================