trial data: workflow scrape steps + lessons.md trial-data guide #8

Merged
justin merged 1 commits from workflow-and-lessons-for-trials into main 2026-05-25 15:22:26 -04:00
2 changed files with 89 additions and 3 deletions
+16 -3
View File
@@ -83,10 +83,21 @@ jobs:
if: ${{ inputs.sources == '' || contains(inputs.sources, 'agripro') }}
run: python -m scrape.runner --source agripro --force
- name: Scrape AgriPro regional trial PDFs
if: ${{ inputs.sources == '' || contains(inputs.sources, 'agripro_trials') }}
run: python -m scrape.runner --source agripro_trials --force
- name: Scrape Golden Harvest plot reports (cross-vendor yield trials)
if: ${{ inputs.sources == '' || contains(inputs.sources, 'gh_plot_reports') }}
# Heaviest single source — ~4,600 docs at 1 req/sec ≈ 70 min.
# Wraps the bulk of CI time; runs late so an earlier failure
# doesn't waste 70 min before failing.
run: python -m scrape.runner --source gh_plot_reports --force
- name: Scrape Beck's PFR research corpus
if: ${{ inputs.sources == '' || contains(inputs.sources, 'becks_pfr') }}
# Heaviest source — ~2,089 docs via public Sanity GROQ.
# No auth, but rate-limit ourselves to be polite.
# Deferred (returns 0 cleanly from a stub) — implementation
# pending. Public Sanity GROQ at mc8v24rf.api.sanity.io.
run: python -m scrape.runner --source becks_pfr --force
# ---- Commit corpus changes + retry-on-race -----------------
@@ -107,8 +118,10 @@ jobs:
n_gh=$(find corpus/golden_harvest -name '*.json' 2>/dev/null | wc -l)
n_nk=$(find corpus/nk -name '*.json' 2>/dev/null | wc -l)
n_ag=$(find corpus/agripro -name '*.json' 2>/dev/null | wc -l)
n_agt=$(find corpus/agripro_trials -name '*.json' 2>/dev/null | wc -l)
n_ghpr=$(find corpus/gh_plot_reports -name '*.json' 2>/dev/null | wc -l)
n_pfr=$(find corpus/becks_pfr -name '*.json' 2>/dev/null | wc -l)
git commit -m "monthly refresh: ${ts} — bayer=${n_bayer} gh=${n_gh} nk=${n_nk} agripro=${n_ag} pfr=${n_pfr}"
git commit -m "monthly refresh: ${ts} — bayer=${n_bayer} gh=${n_gh} nk=${n_nk} agripro=${n_ag} ag_trials=${n_agt} gh_plot_reports=${n_ghpr} pfr=${n_pfr}"
attempt=1
while [ $attempt -le 3 ]; do
if git push; then
+73
View File
@@ -252,6 +252,79 @@ public catalog or their seed dealer.
---
## trial-data
The MCP exposes TWO complementary surfaces:
* **`search_docs`** — variety IDENTITY (what a hybrid IS):
disease ratings, trait stack, maturity, vendor positioning.
* **`search_trials`** — variety PERFORMANCE (how it ACTUALLY did):
ranked yield at specific cooperator fields and regions.
**Indexed trial sources**:
- **Golden Harvest plot reports** (~4,600 cross-vendor head-to-head
trials, 2024+2025). Each trial = one cooperator's field at a
specific state/year, comparing products from multiple brands
(NK / DEKALB / Golden Harvest / Enogen / Pioneer / Channel, etc.)
side by side. **This is the closest thing to independent
comparison data the corpus has** — Bayer doesn't publish its own
trial data, but GH publishes plots where DEKALB hybrids appear
alongside their competitors.
- **AgriPro regional trial PDFs** (14 PDFs) — multi-year
multi-location wheat performance for Northern Plains / Pacific
Northwest / Plains regions. Variety + per-location yields
preserved verbatim.
**Recommended workflow when a farmer asks about performance**:
1. Call `search_trials(crop, state, year, ...)` to find trials
from the relevant region/season.
2. Identify the top performers in the rankings.
3. Call `lookup_variety(source_key=...)` for each leading hybrid to
verify identity (RM, traits, disease ratings) — confirm the
variety actually fits the farmer's situation, not just that it
won someone else's trial.
4. If the leading variety is from a brand whose trial data isn't
directly published (e.g. DEKALB), the GH plot reports often
show it competing — that's still the agent's best public
third-party signal.
**Trial data NOT in the corpus** (don't fabricate):
- **DEKALB / Asgrow / Channel** per-variety yield trials —
Bayer keeps these in rep tools, not on the public catalog. The
GH plot reports surface DEKALB/Asgrow performance indirectly,
but per-variety dedicated trials aren't indexed.
- **NK yield results** — the data exists at
`syngenta-us.com/nk/yield-results` but the ASMX endpoint is
fiddly; not yet scraped. The variety identity is in the corpus
(`search_docs` finds it), just not the per-region trial yields.
- **Pioneer trials** — ToS bans automation, so we have neither
variety identity nor trial data. Direct the farmer to a
Pioneer dealer.
- **University extension trials** (Iowa State, Illinois,
Purdue, etc.) — third-party trial data that publishes Pioneer
+ competitors. Not in the corpus today; could be added in a
future enrichment.
**Reading a GH plot report**:
Each plot has a cooperator name (the farmer running the trial), a
state, a year, planting/harvest dates, population, row width, and a
ranked table of products. The columns vary by crop:
- **Corn / Soy**: Rank | Brand | Product | Traits | Yield BU/Ac
| %MST | Test Weight | Gross Revenue
- **Silage**: Rank | Brand | Product | Traits | Ton/Acre
| Milk Per Acre | Milk Per Ton | Beef Per Acre | Beef Per Ton
Rank 1 = top performer at that site/year. Note that a single plot
is one data point — for a robust recommendation, look across
multiple plots from the same region.
---
## checking-your-work
Before quoting a specific number to a farmer, **always** call