From 4009dc0b78f5ed524a48ee185704906064592815 Mon Sep 17 00:00:00 2001
From: Justin Paul <justin@jpaul.me>
Date: Mon, 25 May 2026 13:18:57 -0400
Subject: [PATCH] Phase 11: crop_seed_api_lessons tool + Pioneer fallback
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add the fifth MCP tool — crop_seed_api_lessons(topic?) — backed by
docs_mcp/lessons.md, the ONLY source of opinionated content in the
server. Everything else (search_docs, get_page, lookup_variety)
returns verbatim from vendor catalogs; lessons.md fills the gaps
the corpus can't cover.

The Pioneer fallback is the critical anti-hallucination piece:
Pioneer's ToS bans automation, so the corpus has no Pioneer data.
Without this tool, an agent might surface Bayer/Asgrow chunks as
mediocre matches for a Pioneer query. The tool's docstring tells
the agent to call it on any Pioneer / P-series question; the
'pioneer' section says clearly:

  "I don't have Pioneer's variety data indexed... please consult
  Pioneer or an extension service."

  "Do NOT invent Pioneer hybrid ratings."

Other lesson sections cover knowledge the agent needs to interpret
search_docs / get_page output correctly:

- rating-scales: Bayer 1-9, Golden Harvest 9-to-1, what
  R/MR/S/Rps1c/R3 mean in soybean disease columns
- maturity-semantics: corn RM days vs soybean MG vs wheat class +
  qualitative early/medium/late
- trait-glossary: SSRIB, VT2PRIB, XF, E3, Conkesta, Clearfield, etc.
- scn-resistance: race coverage + Peking vs PI 88788 source
- regional-listings: how to interpret Bayer's "local profiles"
- sources-not-yet-indexed: which vendors aren't in the corpus yet
- checking-your-work: always call lookup_variety before quoting

Lesson lookup prefers slug-match (returns just `rating-scales` for
topic="rating", not every section that mentions ratings); falls
back to body-match only when no slug matches.

Smoke-tested with topic=pioneer, topic=rating, topic=trait,
topic=zzzzzz (no match), and topic=None (full index = 10K chars,
8 sections).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs_mcp/lessons.md | 259 ++++++++++++++++++++++++++++++++++++++++++++
 docs_mcp/server.py  | 106 ++++++++++++++++++
 2 files changed, 365 insertions(+)
 create mode 100644 docs_mcp/lessons.md
diff --git a/docs_mcp/lessons.md b/docs_mcp/lessons.md
new file mode 100644
index 00000000..8936c5cf
--- /dev/null
+++ b/docs_mcp/lessons.md
@@ -0,0 +1,259 @@
+# crop_seed API lessons
+
+Curated knowledge that **does not live in the scraped corpus** but
+that an agent needs to interpret search_docs / get_page results
+correctly. This file is the source for `crop_seed_api_lessons(topic)`.
+
+Each section starts with a `## <slug>` heading; the tool returns
+sections whose slug matches the topic (substring match) or all
+sections when `topic` is None.
+
+This file is the *only* place the MCP injects opinionated content.
+Everything else returned by the server comes verbatim from a vendor
+catalog. Lessons here should be: factual, well-cited, and conservative
+about what the corpus actually contains.
+
+---
+
+## pioneer
+
+**Pioneer (Corteva) is NOT in the seed-mcp corpus.** The MCP intentionally
+does not scrape pioneer.com. Per Pioneer's published Terms of Service:
+
+> "you shall not use any manual or automated software, devices or
+> other processes (including but not limited to spiders, robots,
+> scrapers, crawlers, avatars, data mining tools or the like) to
+> 'scrape' or download data from the Services"
+
+Pioneer additionally runs Imperva WAF on their catalog API and
+login-walls dealer lookup, so a polite scrape isn't technically
+feasible either.
+
+**Agent behavior when a farmer asks about Pioneer / P-series hybrids:**
+
+1. Tell the farmer plainly: "I don't have Pioneer's variety data
+   indexed — Pioneer doesn't allow automated indexing of their
+   catalog."
+2. Recommend pioneer.com → *Find a Sales Rep* (or call 1-800-247-6803)
+   for hybrid specs and local recommendations.
+3. If the farmer wants an independent third-party rating for a
+   specific Pioneer hybrid, mention that **university extension
+   performance trials** (Iowa State, Illinois Crop Sciences, Purdue,
+   Nebraska, Ohio State) publish multi-location yield data on
+   Pioneer hybrids alongside competitors — useful if they want
+   apples-to-apples data without going through Pioneer's own
+   marketing.
+4. **Do NOT invent Pioneer hybrid ratings.** If asked "what's the
+   disease tolerance of P1142AM?", the only correct answer is
+   "I don't have that data — please consult Pioneer or an
+   extension service."
+
+This is the canonical anti-hallucination policy for the seed-mcp.
+There is no Pioneer data; there is no inference. Direct the farmer
+to a primary source.
+
+---
+
+## rating-scales
+
+Different vendors publish ratings on different conventions. The
+chunker normalizes the *labels* in the chunk preamble but always
+preserves the source's `_scale_direction` field in the sidecar.
+
+**Bayer (DEKALB / Asgrow / WestBred)**: `1-9 (9 = best)`. A
+GRAY LEAF SPOT rating of 8 means EXCELLENT tolerance. A rating of 2
+means SUSCEPTIBLE.
+
+**Syngenta Golden Harvest**: `9-to-1 (9 = best, 1 = worst)` —
+this is the *direction* Golden Harvest publishes, but the *meaning*
+of high numbers is the same: high = best. Where the chunker says
+"normalize" for Golden Harvest, that just means we've already
+re-stated it as `1-9 (9 = best)` in the chunk preamble; the source's
+`_scale_direction` field still says `9-to-1` so you can detect the
+provenance.
+
+**Syngenta NK / AgriPro**: `1-9 (9 = best)`. Same as Bayer.
+
+**Beck's**: ratings live behind SeedIQ login; only identity-level
+data is publicly available, so most disease/agronomic ratings are
+absent from Beck's records in this corpus.
+
+**Always check the chunk's "Rating scale" line or call
+`lookup_variety(source_key)` and look at `_scale_direction` if you
+are unsure.** Cross-vendor comparisons are valid AFTER you've
+confirmed each side uses the same direction.
+
+**Non-numeric values** appear for some characteristics and should be
+read literally:
+- `R`, `MR`, `S` for soybean disease resistance = Resistant / Moderately
+  Resistant / Susceptible (not 1-9).
+- `Rps1c`, `Rps3a`, `Rps1k`, etc. = specific Phytophthora resistance
+  gene present.
+- `R1`, `R3` (under SOYBEAN CYST NEMATODE) = effective against
+  SCN race 1 / race 3.
+- `A`, `B`, `C` under HERBICIDE sensitivity = grade letters where A
+  is most tolerant.
+
+---
+
+## maturity-semantics
+
+Maturity is encoded differently per crop. Don't conflate the units.
+
+**Corn — Relative maturity (RM days)**: integer roughly 75-120.
+Lower = shorter season, suitable for higher latitudes / shorter
+growing windows. 110 RM is a Central Iowa default; 85 RM suits
+northern Minnesota or short-season silage; 115+ RM fits southern
+Indiana / southern Illinois / Missouri Delta. The number is
+**Pioneer-style RM days**, normalized across the industry.
+
+**Soybeans — Maturity group (MG)**: float 00 (zero-zero) to 9.0
+expressed with one decimal. A "3.5 MG" soybean is for central
+Iowa. Northern North Dakota / Minnesota plant 0.0–1.5 MG. Mid-South
+plants 5.0+. Each tenth of an MG ≈ 7-10 days of additional season.
+Sidecar field: `maturity_group` (e.g. "3.5", "0.7").
+
+**Wheat — Class + heading**: Winter / spring decision is separate
+from "class" (HRW / HRS / SRW / SWW / SWS / durum):
+- HRW = Hard Red Winter — Plains states bread wheat
+- HRS = Hard Red Spring — Northern Plains, North Dakota, Montana
+- SRW = Soft Red Winter — Eastern Corn Belt, Ohio Valley
+- SWW = Soft White Winter — Pacific Northwest
+- SWS = Soft White Spring — Pacific Northwest
+- Durum — North Dakota / Montana, pasta wheat
+Maturity is qualitative: Early / Medium-Early / Medium / Medium-Late / Late.
+**WestBred's product page JSON does not always expose the wheat class
+as a structured field** — sometimes it's only in the marketing
+narrative (e.g. "WB1376CLP is a Soft White Winter Clearfield® Plus
+Wheat variety"). Read `positioning_statement` carefully when the
+sidecar's `wheat_class` is null.
+
+---
+
+## trait-glossary
+
+Common trait codes that appear in `trait_stack`:
+
+**Corn:**
+- `SSRIB` — SmartStax® RIB Complete® corn blend (above + below-ground
+  insect protection + Roundup Ready + LibertyLink, with refuge-in-bag)
+- `VT2PRIB` — VT Double PRO® RIB Complete® (above-ground insect
+  protection + Roundup Ready, refuge-in-bag)
+- `VT4PRIB` — VT4 PRO® RIB Complete® (newer above-ground protection)
+- `Trecepta` — Trecepta® (Trecepta + Roundup Ready + LibertyLink, for
+  earworm + western bean cutworm pressure)
+- `SmartStax PRO` — SmartStax® PRO® (RNAi corn rootworm)
+- `PowerCore` — PowerCore® Refuge Advanced (older above-ground stack)
+- `Conventional` — no biotech traits (organic / specialty channels)
+
+**Soybeans:**
+- `XF` — XtendFlex® (Roundup Ready 2 Xtend + dicamba + glufosinate)
+- `Xtend` — Roundup Ready 2 Xtend® (dicamba + glyphosate)
+- `RR2Y` — Roundup Ready 2 Yield® (glyphosate only)
+- `E3` — Enlist E3® (2,4-D + glyphosate + glufosinate)
+- `LL/LL+GT27` — LibertyLink® / LibertyLink + GT27 (glufosinate +
+  glyphosate + isoxaflutole)
+- `Conkesta E3` — Bt-stack for caterpillar pressure (BR/AR markets)
+- `SR` — SR® (sulfonylurea-tolerant, Asgrow-specific)
+
+**Wheat:**
+- `Clearfield` / `CLP` — Clearfield® / Clearfield® Plus (imazamox
+  tolerance)
+- `CoAXium` — CoAXium® (quizalofop tolerance) — note: AgriPro's
+  catalog flag, NOT in the WestBred corpus.
+
+Always render the full trait name (`trait_descriptions`) when telling
+the farmer "this variety has X trait" — bare trait codes are
+ambiguous in print.
+
+---
+
+## scn-resistance
+
+Soybean Cyst Nematode resistance ratings are critical for fields
+with SCN pressure (most of the Corn Belt). Read carefully:
+
+- `R3` under SOYBEAN CYST NEMATODE = Resistant to race 3 (the most
+  common race nationally). Most "SCN-resistant" soybeans on the
+  market are R3.
+- `R1, R3` = Resistant to both race 1 AND race 3. Higher value;
+  useful in long-rotation SCN fields where race shifts have occurred.
+- `MR3` = Moderately Resistant to race 3. Some yield loss expected
+  under high SCN pressure.
+- `S` = Susceptible.
+- Some Bayer Asgrow XF lines (e.g. AG29XF4) use **Peking-type SCN
+  resistance**, which is genetically distinct from the more common
+  PI 88788 source. Peking is more durable when SCN populations
+  have eroded PI 88788 effectiveness. Look for "Peking type" in the
+  positioning statement.
+
+**Recommended workflow when a farmer asks about SCN:** call
+`search_docs` with the user's MG range + "SCN-resistant", then
+`lookup_variety` on the top 2-3 candidates to verify the exact race
+coverage and resistance source.
+
+---
+
+## regional-listings
+
+The `regional_recommendations` array in each sidecar is sourced from
+Bayer's "local profiles" — varieties get assigned to regional Seed
+Guide bundles (e.g. *"2026 Washington, Oregon, SEED GUIDE"*) with a
+named regional agronomist contact. This is the closest signal we have
+to *"is this variety recommended for the farmer's geography?"* but
+note:
+
+- A variety being absent from a regional listing **does not** mean
+  it's unsuitable — Bayer's local agronomists curate these lists.
+- Listings are vendor-side recommendations, not third-party trial
+  data.
+- When the farmer mentions a region, try filtering or scanning for
+  varieties whose `regional_recommendations[].product_list_name`
+  mentions that region.
+
+Other vendors handle regional placement differently. Golden Harvest
+publishes a separate "plot report" system per state/year/site;
+NK publishes ratings as PDF tech sheets without regional flags.
+
+---
+
+## sources-not-yet-indexed
+
+These vendors are planned but not yet in the corpus. Don't assume
+their data is present:
+
+- **Golden Harvest (Syngenta)** — ~175 varieties, sitemap-driven
+  scrape pending.
+- **NK (Syngenta)** — 29 varieties.
+- **AgriPro (Syngenta wheat)** — 24 wheat varieties (HRW, HRS, HWS,
+  SWW, SWS). The only wheat coverage we expect to have outside
+  WestBred.
+- **Beck's PFR (research)** — 2,089 head-to-head trial documents.
+  Different shape from variety records — these are studies, not
+  hybrids.
+- **Beck's products** — 860 products. Identity-only (SeedIQ login
+  gates the ratings).
+
+If `list_versions()` doesn't show a vendor in the `vendor` facet, the
+corpus does not have it yet. Direct the farmer to that vendor's
+public catalog or their seed dealer.
+
+---
+
+## checking-your-work
+
+Before quoting a specific number to a farmer, **always** call
+`lookup_variety(source_key=...)` to confirm. The chunk text inside a
+search_docs response is a faithful render of the sidecar, but the
+sidecar IS the source of truth. Quoting from the canonical sidecar
+makes you robust against:
+
+- Chunk-text formatting bugs (e.g. a rare unicode issue trimming a
+  value).
+- Future chunker changes (a re-index might rewrite the body).
+- Cross-vendor scale-direction differences (the sidecar's
+  `_scale_direction` lets you state the convention explicitly).
+
+If `lookup_variety` returns "not found" but `search_docs` surfaced the
+chunk, that's a bug — please report it. (In normal operation, every
+chunk's `source_key` round-trips to a valid sidecar.)
diff --git a/docs_mcp/server.py b/docs_mcp/server.py
index cce034bb..e33eb7e8 100644
--- a/docs_mcp/server.py
+++ b/docs_mcp/server.py
@@ -369,6 +369,40 @@ def _structured_ratings_block(sidecar: dict) -> str:
     return "\n".join(lines).rstrip() + "\n"
 
 
+# ---------------------------------------------------------------------------
+# Curated lessons — docs_mcp/lessons.md is the canonical source.
+# ---------------------------------------------------------------------------
+LESSONS_FILE = Path(__file__).resolve().parent / "lessons.md"
+_lessons_cache: list[tuple[str, str]] | None = None
+
+
+def _load_lessons() -> list[tuple[str, str]]:
+    """Parse lessons.md into ``[(slug, body), ...]`` sections.
+
+    Sections are delimited by ``## <slug>`` headings. The slug is the
+    `<slug>` token (whitespace stripped); the body is everything between
+    that heading and the next ``## `` heading (or EOF).
+    """
+    global _lessons_cache
+    if _lessons_cache is not None:
+        return _lessons_cache
+    out: list[tuple[str, str]] = []
+    if not LESSONS_FILE.exists():
+        _lessons_cache = out
+        return out
+    text = LESSONS_FILE.read_text(encoding="utf-8")
+    parts = re.split(r"(?m)^## (.+)$", text)
+    # parts = [preamble, slug1, body1, slug2, body2, ...]
+    for i in range(1, len(parts), 2):
+        slug = parts[i].strip()
+        body = parts[i + 1] if i + 1 < len(parts) else ""
+        # Drop trailing horizontal rule that separates sections.
+        body = re.sub(r"\n---\s*$", "", body).strip()
+        out.append((slug, body))
+    _lessons_cache = out
+    return out
+
+
 # ===========================================================================
 # Tools
 # ===========================================================================
@@ -711,6 +745,78 @@ def lookup_variety(
         return "\n".join(out)
 
 
+@mcp.tool()
+def crop_seed_api_lessons(
+    topic: Annotated[
+        str | None,
+        Field(description=(
+            "OPTIONAL topic — match against lesson section slugs or body "
+            "(substring, case-insensitive). Known slugs: pioneer, "
+            "rating-scales, maturity-semantics, trait-glossary, "
+            "scn-resistance, regional-listings, sources-not-yet-indexed, "
+            "checking-your-work. Omit for the full curated index."
+        )),
+    ] = None,
+) -> str:
+    """Curated knowledge that does NOT live in the scraped corpus —
+    vendor scale-direction notes, trait glossary, maturity semantics,
+    SCN resistance interpretation, the **Pioneer fallback policy**,
+    and rules for fact-checking your work.
+
+    Call this tool when:
+
+    * The user asks about **Pioneer** or any P-series hybrid — Pioneer
+      is intentionally NOT scraped (ToS bans it); the lesson tells you
+      what to say instead.
+    * You need to compare ratings across vendors — different vendors
+      publish on different scale directions.
+    * You're parsing a trait code or disease abbreviation you don't
+      recognize.
+    * Before quoting a specific rating value to a farmer — the
+      ``checking-your-work`` lesson reminds you to call
+      ``lookup_variety`` to confirm.
+
+    This tool is **the only source of opinionated content** in the
+    server. Everything else returned by search_docs / get_page /
+    lookup_variety is verbatim from vendor catalogs.
+    """
+    with TimedCall("crop_seed_api_lessons", {"topic": topic}) as _call:
+        sections = _load_lessons()
+        if not sections:
+            _call.set(sections_returned=0)
+            return "_(no lessons file present — docs_mcp/lessons.md missing)_"
+
+        if not topic:
+            _call.set(sections_returned=len(sections))
+            return "\n\n---\n\n".join(
+                f"## {slug}\n\n{body}" for slug, body in sections
+            )
+
+        needle = topic.strip().lower()
+        # Prefer slug matches (most specific). Fall back to body match
+        # only when no slug matches — keeps a query like "rating" from
+        # returning every section that happens to mention the word.
+        slug_matches: list[tuple[str, str]] = []
+        body_matches: list[tuple[str, str]] = []
+        for slug, body in sections:
+            if needle in slug.lower():
+                slug_matches.append((slug, body))
+            elif needle in body.lower():
+                body_matches.append((slug, body))
+        matched = slug_matches if slug_matches else body_matches
+
+        _call.set(sections_returned=len(matched), topic=topic)
+        if not matched:
+            slugs = ", ".join(s for s, _ in sections)
+            return (
+                f"_(no lesson section matched topic '{topic}'. "
+                f"Available slugs: {slugs}.)_"
+            )
+        return "\n\n---\n\n".join(
+            f"## {slug}\n\n{body}" for slug, body in matched
+        )
+
+
 # ===========================================================================
 # Entry point
 # ===========================================================================
-- 
2.52.0