trial data: workflow scrape steps + lessons.md trial-data guide

.gitea/workflows/refresh.yml — add scrape steps for the new trial
sources (agripro_trials, gh_plot_reports) so the monthly cron
refreshes them alongside the variety sources. gh_plot_reports
is the heaviest single source (~4,600 docs @ 1 req/sec ≈ 70 min);
runs late so an earlier failure doesn't waste time before failing.
Commit-message variable count expanded to surface the trial counts.

docs_mcp/lessons.md — new "trial-data" section telling the agent:

- The two surfaces (search_docs = identity, search_trials = perf)
  are complementary; how to route a farmer question to each.
- What's indexed (GH plot reports cross-vendor, AgriPro regional
  PDFs) vs what's not (Bayer per-variety trials, NK yield results,
  Pioneer, university extension trials).
- Recommended workflow: search_trials → identify top performers →
  lookup_variety on each to verify identity → don't fabricate.
- How to read a GH plot report (per-column headers vary by crop:
  corn/soy use Yield/MST/Test Weight, silage uses Ton/Acre +
  Milk + Beef columns).
- Single-data-point caveat: one plot is one cooperator's field;
  look across multiple plots for a robust recommendation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-25 15:22:08 -04:00
parent 17260c32c8
commit 84b49d8360
2 changed files with 89 additions and 3 deletions
+73
View File
@@ -252,6 +252,79 @@ public catalog or their seed dealer.
---
## trial-data
The MCP exposes TWO complementary surfaces:
* **`search_docs`** — variety IDENTITY (what a hybrid IS):
disease ratings, trait stack, maturity, vendor positioning.
* **`search_trials`** — variety PERFORMANCE (how it ACTUALLY did):
ranked yield at specific cooperator fields and regions.
**Indexed trial sources**:
- **Golden Harvest plot reports** (~4,600 cross-vendor head-to-head
trials, 2024+2025). Each trial = one cooperator's field at a
specific state/year, comparing products from multiple brands
(NK / DEKALB / Golden Harvest / Enogen / Pioneer / Channel, etc.)
side by side. **This is the closest thing to independent
comparison data the corpus has** — Bayer doesn't publish its own
trial data, but GH publishes plots where DEKALB hybrids appear
alongside their competitors.
- **AgriPro regional trial PDFs** (14 PDFs) — multi-year
multi-location wheat performance for Northern Plains / Pacific
Northwest / Plains regions. Variety + per-location yields
preserved verbatim.
**Recommended workflow when a farmer asks about performance**:
1. Call `search_trials(crop, state, year, ...)` to find trials
from the relevant region/season.
2. Identify the top performers in the rankings.
3. Call `lookup_variety(source_key=...)` for each leading hybrid to
verify identity (RM, traits, disease ratings) — confirm the
variety actually fits the farmer's situation, not just that it
won someone else's trial.
4. If the leading variety is from a brand whose trial data isn't
directly published (e.g. DEKALB), the GH plot reports often
show it competing — that's still the agent's best public
third-party signal.
**Trial data NOT in the corpus** (don't fabricate):
- **DEKALB / Asgrow / Channel** per-variety yield trials —
Bayer keeps these in rep tools, not on the public catalog. The
GH plot reports surface DEKALB/Asgrow performance indirectly,
but per-variety dedicated trials aren't indexed.
- **NK yield results** — the data exists at
`syngenta-us.com/nk/yield-results` but the ASMX endpoint is
fiddly; not yet scraped. The variety identity is in the corpus
(`search_docs` finds it), just not the per-region trial yields.
- **Pioneer trials** — ToS bans automation, so we have neither
variety identity nor trial data. Direct the farmer to a
Pioneer dealer.
- **University extension trials** (Iowa State, Illinois,
Purdue, etc.) — third-party trial data that publishes Pioneer
+ competitors. Not in the corpus today; could be added in a
future enrichment.
**Reading a GH plot report**:
Each plot has a cooperator name (the farmer running the trial), a
state, a year, planting/harvest dates, population, row width, and a
ranked table of products. The columns vary by crop:
- **Corn / Soy**: Rank | Brand | Product | Traits | Yield BU/Ac
| %MST | Test Weight | Gross Revenue
- **Silage**: Rank | Brand | Product | Traits | Ton/Acre
| Milk Per Acre | Milk Per Ton | Beef Per Acre | Beef Per Ton
Rank 1 = top performer at that site/year. Note that a single plot
is one data point — for a robust recommendation, look across
multiple plots from the same region.
---
## checking-your-work
Before quoting a specific number to a farmer, **always** call