2026-05-25 15:22:26 -04:00
2 changed files with 89 additions and 3 deletions
@@ -83,10 +83,21 @@ jobs:
        if: ${{ inputs.sources == '' || contains(inputs.sources, 'agripro') }}
        run: python -m scrape.runner --source agripro --force

+      - name: Scrape AgriPro regional trial PDFs
+        if: ${{ inputs.sources == '' || contains(inputs.sources, 'agripro_trials') }}
+        run: python -m scrape.runner --source agripro_trials --force
+
+      - name: Scrape Golden Harvest plot reports (cross-vendor yield trials)
+        if: ${{ inputs.sources == '' || contains(inputs.sources, 'gh_plot_reports') }}
+        # Heaviest single source — ~4,600 docs at 1 req/sec ≈ 70 min.
+        # Wraps the bulk of CI time; runs late so an earlier failure
+        # doesn't waste 70 min before failing.
+        run: python -m scrape.runner --source gh_plot_reports --force
+
      - name: Scrape Beck's PFR research corpus
        if: ${{ inputs.sources == '' || contains(inputs.sources, 'becks_pfr') }}
-        # Heaviest source — ~2,089 docs via public Sanity GROQ.
-        # No auth, but rate-limit ourselves to be polite.
+        # Deferred (returns 0 cleanly from a stub) — implementation
+        # pending. Public Sanity GROQ at mc8v24rf.api.sanity.io.
        run: python -m scrape.runner --source becks_pfr --force

      # ---- Commit corpus changes + retry-on-race -----------------
@@ -107,8 +118,10 @@ jobs:
          n_gh=$(find corpus/golden_harvest -name '*.json' 2>/dev/null | wc -l)
          n_nk=$(find corpus/nk -name '*.json' 2>/dev/null | wc -l)
          n_ag=$(find corpus/agripro -name '*.json' 2>/dev/null | wc -l)
+          n_agt=$(find corpus/agripro_trials -name '*.json' 2>/dev/null | wc -l)
+          n_ghpr=$(find corpus/gh_plot_reports -name '*.json' 2>/dev/null | wc -l)
          n_pfr=$(find corpus/becks_pfr -name '*.json' 2>/dev/null | wc -l)
-          git commit -m "monthly refresh: ${ts} — bayer=${n_bayer} gh=${n_gh} nk=${n_nk} agripro=${n_ag} pfr=${n_pfr}"
+          git commit -m "monthly refresh: ${ts} — bayer=${n_bayer} gh=${n_gh} nk=${n_nk} agripro=${n_ag} ag_trials=${n_agt} gh_plot_reports=${n_ghpr} pfr=${n_pfr}"
          attempt=1
          while [ $attempt -le 3 ]; do
            if git push; then
@@ -252,6 +252,79 @@ public catalog or their seed dealer.

 ---

+## trial-data
+
+The MCP exposes TWO complementary surfaces:
+
+* **`search_docs`** — variety IDENTITY (what a hybrid IS):
+  disease ratings, trait stack, maturity, vendor positioning.
+* **`search_trials`** — variety PERFORMANCE (how it ACTUALLY did):
+  ranked yield at specific cooperator fields and regions.
+
+**Indexed trial sources**:
+
+- **Golden Harvest plot reports** (~4,600 cross-vendor head-to-head
+  trials, 2024+2025). Each trial = one cooperator's field at a
+  specific state/year, comparing products from multiple brands
+  (NK / DEKALB / Golden Harvest / Enogen / Pioneer / Channel, etc.)
+  side by side. **This is the closest thing to independent
+  comparison data the corpus has** — Bayer doesn't publish its own
+  trial data, but GH publishes plots where DEKALB hybrids appear
+  alongside their competitors.
+- **AgriPro regional trial PDFs** (14 PDFs) — multi-year
+  multi-location wheat performance for Northern Plains / Pacific
+  Northwest / Plains regions. Variety + per-location yields
+  preserved verbatim.
+
+**Recommended workflow when a farmer asks about performance**:
+
+1. Call `search_trials(crop, state, year, ...)` to find trials
+   from the relevant region/season.
+2. Identify the top performers in the rankings.
+3. Call `lookup_variety(source_key=...)` for each leading hybrid to
+   verify identity (RM, traits, disease ratings) — confirm the
+   variety actually fits the farmer's situation, not just that it
+   won someone else's trial.
+4. If the leading variety is from a brand whose trial data isn't
+   directly published (e.g. DEKALB), the GH plot reports often
+   show it competing — that's still the agent's best public
+   third-party signal.
+
+**Trial data NOT in the corpus** (don't fabricate):
+
+- **DEKALB / Asgrow / Channel** per-variety yield trials —
+  Bayer keeps these in rep tools, not on the public catalog. The
+  GH plot reports surface DEKALB/Asgrow performance indirectly,
+  but per-variety dedicated trials aren't indexed.
+- **NK yield results** — the data exists at
+  `syngenta-us.com/nk/yield-results` but the ASMX endpoint is
+  fiddly; not yet scraped. The variety identity is in the corpus
+  (`search_docs` finds it), just not the per-region trial yields.
+- **Pioneer trials** — ToS bans automation, so we have neither
+  variety identity nor trial data. Direct the farmer to a
+  Pioneer dealer.
+- **University extension trials** (Iowa State, Illinois,
+  Purdue, etc.) — third-party trial data that publishes Pioneer
+  + competitors. Not in the corpus today; could be added in a
+  future enrichment.
+
+**Reading a GH plot report**:
+
+Each plot has a cooperator name (the farmer running the trial), a
+state, a year, planting/harvest dates, population, row width, and a
+ranked table of products. The columns vary by crop:
+
+- **Corn / Soy**: Rank | Brand | Product | Traits | Yield BU/Ac
+  | %MST | Test Weight | Gross Revenue
+- **Silage**: Rank | Brand | Product | Traits | Ton/Acre
+  | Milk Per Acre | Milk Per Ton | Beef Per Acre | Beef Per Ton
+
+Rank 1 = top performer at that site/year. Note that a single plot
+is one data point — for a robust recommendation, look across
+multiple plots from the same region.
+
+---
+
 ## checking-your-work

 Before quoting a specific number to a farmer, **always** call