From 84b49d8360617cb0af51408a8c9d01017038c249 Mon Sep 17 00:00:00 2001
From: Justin Paul <justin@jpaul.me>
Date: Mon, 25 May 2026 15:22:08 -0400
Subject: [PATCH] trial data: workflow scrape steps + lessons.md trial-data
 guide
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

.gitea/workflows/refresh.yml — add scrape steps for the new trial
sources (agripro_trials, gh_plot_reports) so the monthly cron
refreshes them alongside the variety sources. gh_plot_reports
is the heaviest single source (~4,600 docs @ 1 req/sec ≈ 70 min);
runs late so an earlier failure doesn't waste time before failing.
Commit-message variable count expanded to surface the trial counts.

docs_mcp/lessons.md — new "trial-data" section telling the agent:

- The two surfaces (search_docs = identity, search_trials = perf)
  are complementary; how to route a farmer question to each.
- What's indexed (GH plot reports cross-vendor, AgriPro regional
  PDFs) vs what's not (Bayer per-variety trials, NK yield results,
  Pioneer, university extension trials).
- Recommended workflow: search_trials → identify top performers →
  lookup_variety on each to verify identity → don't fabricate.
- How to read a GH plot report (per-column headers vary by crop:
  corn/soy use Yield/MST/Test Weight, silage uses Ton/Acre +
  Milk + Beef columns).
- Single-data-point caveat: one plot is one cooperator's field;
  look across multiple plots for a robust recommendation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .gitea/workflows/refresh.yml | 19 ++++++++--
 docs_mcp/lessons.md          | 73 ++++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/.gitea/workflows/refresh.yml b/.gitea/workflows/refresh.yml
index faf6a8d2..ec3ca06e 100644
--- a/.gitea/workflows/refresh.yml
+++ b/.gitea/workflows/refresh.yml
@@ -83,10 +83,21 @@ jobs:
         if: ${{ inputs.sources == '' || contains(inputs.sources, 'agripro') }}
         run: python -m scrape.runner --source agripro --force
 
+      - name: Scrape AgriPro regional trial PDFs
+        if: ${{ inputs.sources == '' || contains(inputs.sources, 'agripro_trials') }}
+        run: python -m scrape.runner --source agripro_trials --force
+
+      - name: Scrape Golden Harvest plot reports (cross-vendor yield trials)
+        if: ${{ inputs.sources == '' || contains(inputs.sources, 'gh_plot_reports') }}
+        # Heaviest single source — ~4,600 docs at 1 req/sec ≈ 70 min.
+        # Wraps the bulk of CI time; runs late so an earlier failure
+        # doesn't waste 70 min before failing.
+        run: python -m scrape.runner --source gh_plot_reports --force
+
       - name: Scrape Beck's PFR research corpus
         if: ${{ inputs.sources == '' || contains(inputs.sources, 'becks_pfr') }}
-        # Heaviest source — ~2,089 docs via public Sanity GROQ.
-        # No auth, but rate-limit ourselves to be polite.
+        # Deferred (returns 0 cleanly from a stub) — implementation
+        # pending. Public Sanity GROQ at mc8v24rf.api.sanity.io.
         run: python -m scrape.runner --source becks_pfr --force
 
       # ---- Commit corpus changes + retry-on-race -----------------
@@ -107,8 +118,10 @@ jobs:
           n_gh=$(find corpus/golden_harvest -name '*.json' 2>/dev/null | wc -l)
           n_nk=$(find corpus/nk -name '*.json' 2>/dev/null | wc -l)
           n_ag=$(find corpus/agripro -name '*.json' 2>/dev/null | wc -l)
+          n_agt=$(find corpus/agripro_trials -name '*.json' 2>/dev/null | wc -l)
+          n_ghpr=$(find corpus/gh_plot_reports -name '*.json' 2>/dev/null | wc -l)
           n_pfr=$(find corpus/becks_pfr -name '*.json' 2>/dev/null | wc -l)
-          git commit -m "monthly refresh: ${ts} — bayer=${n_bayer} gh=${n_gh} nk=${n_nk} agripro=${n_ag} pfr=${n_pfr}"
+          git commit -m "monthly refresh: ${ts} — bayer=${n_bayer} gh=${n_gh} nk=${n_nk} agripro=${n_ag} ag_trials=${n_agt} gh_plot_reports=${n_ghpr} pfr=${n_pfr}"
           attempt=1
           while [ $attempt -le 3 ]; do
             if git push; then
diff --git a/docs_mcp/lessons.md b/docs_mcp/lessons.md
index c2595f35..e4139fe7 100644
--- a/docs_mcp/lessons.md
+++ b/docs_mcp/lessons.md
@@ -252,6 +252,79 @@ public catalog or their seed dealer.
 
 ---
 
+## trial-data
+
+The MCP exposes TWO complementary surfaces:
+
+* **`search_docs`** — variety IDENTITY (what a hybrid IS):
+  disease ratings, trait stack, maturity, vendor positioning.
+* **`search_trials`** — variety PERFORMANCE (how it ACTUALLY did):
+  ranked yield at specific cooperator fields and regions.
+
+**Indexed trial sources**:
+
+- **Golden Harvest plot reports** (~4,600 cross-vendor head-to-head
+  trials, 2024+2025). Each trial = one cooperator's field at a
+  specific state/year, comparing products from multiple brands
+  (NK / DEKALB / Golden Harvest / Enogen / Pioneer / Channel, etc.)
+  side by side. **This is the closest thing to independent
+  comparison data the corpus has** — Bayer doesn't publish its own
+  trial data, but GH publishes plots where DEKALB hybrids appear
+  alongside their competitors.
+- **AgriPro regional trial PDFs** (14 PDFs) — multi-year
+  multi-location wheat performance for Northern Plains / Pacific
+  Northwest / Plains regions. Variety + per-location yields
+  preserved verbatim.
+
+**Recommended workflow when a farmer asks about performance**:
+
+1. Call `search_trials(crop, state, year, ...)` to find trials
+   from the relevant region/season.
+2. Identify the top performers in the rankings.
+3. Call `lookup_variety(source_key=...)` for each leading hybrid to
+   verify identity (RM, traits, disease ratings) — confirm the
+   variety actually fits the farmer's situation, not just that it
+   won someone else's trial.
+4. If the leading variety is from a brand whose trial data isn't
+   directly published (e.g. DEKALB), the GH plot reports often
+   show it competing — that's still the agent's best public
+   third-party signal.
+
+**Trial data NOT in the corpus** (don't fabricate):
+
+- **DEKALB / Asgrow / Channel** per-variety yield trials —
+  Bayer keeps these in rep tools, not on the public catalog. The
+  GH plot reports surface DEKALB/Asgrow performance indirectly,
+  but per-variety dedicated trials aren't indexed.
+- **NK yield results** — the data exists at
+  `syngenta-us.com/nk/yield-results` but the ASMX endpoint is
+  fiddly; not yet scraped. The variety identity is in the corpus
+  (`search_docs` finds it), just not the per-region trial yields.
+- **Pioneer trials** — ToS bans automation, so we have neither
+  variety identity nor trial data. Direct the farmer to a
+  Pioneer dealer.
+- **University extension trials** (Iowa State, Illinois,
+  Purdue, etc.) — third-party trial data that publishes Pioneer
+  + competitors. Not in the corpus today; could be added in a
+  future enrichment.
+
+**Reading a GH plot report**:
+
+Each plot has a cooperator name (the farmer running the trial), a
+state, a year, planting/harvest dates, population, row width, and a
+ranked table of products. The columns vary by crop:
+
+- **Corn / Soy**: Rank | Brand | Product | Traits | Yield BU/Ac
+  | %MST | Test Weight | Gross Revenue
+- **Silage**: Rank | Brand | Product | Traits | Ton/Acre
+  | Milk Per Acre | Milk Per Ton | Beef Per Acre | Beef Per Ton
+
+Rank 1 = top performer at that site/year. Note that a single plot
+is one data point — for a robust recommendation, look across
+multiple plots from the same region.
+
+---
+
 ## checking-your-work
 
 Before quoting a specific number to a farmer, **always** call
-- 
2.52.0