From 84b49d8360617cb0af51408a8c9d01017038c249 Mon Sep 17 00:00:00 2001 From: Justin Paul Date: Mon, 25 May 2026 15:22:08 -0400 Subject: [PATCH] trial data: workflow scrape steps + lessons.md trial-data guide MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit .gitea/workflows/refresh.yml — add scrape steps for the new trial sources (agripro_trials, gh_plot_reports) so the monthly cron refreshes them alongside the variety sources. gh_plot_reports is the heaviest single source (~4,600 docs @ 1 req/sec ≈ 70 min); runs late so an earlier failure doesn't waste time before failing. Commit-message variable count expanded to surface the trial counts. docs_mcp/lessons.md — new "trial-data" section telling the agent: - The two surfaces (search_docs = identity, search_trials = perf) are complementary; how to route a farmer question to each. - What's indexed (GH plot reports cross-vendor, AgriPro regional PDFs) vs what's not (Bayer per-variety trials, NK yield results, Pioneer, university extension trials). - Recommended workflow: search_trials → identify top performers → lookup_variety on each to verify identity → don't fabricate. - How to read a GH plot report (per-column headers vary by crop: corn/soy use Yield/MST/Test Weight, silage uses Ton/Acre + Milk + Beef columns). - Single-data-point caveat: one plot is one cooperator's field; look across multiple plots for a robust recommendation. Co-Authored-By: Claude Opus 4.7 (1M context) --- .gitea/workflows/refresh.yml | 19 ++++++++-- docs_mcp/lessons.md | 73 ++++++++++++++++++++++++++++++++++++ 2 files changed, 89 insertions(+), 3 deletions(-) diff --git a/.gitea/workflows/refresh.yml b/.gitea/workflows/refresh.yml index faf6a8d2..ec3ca06e 100644 --- a/.gitea/workflows/refresh.yml +++ b/.gitea/workflows/refresh.yml @@ -83,10 +83,21 @@ jobs: if: ${{ inputs.sources == '' || contains(inputs.sources, 'agripro') }} run: python -m scrape.runner --source agripro --force + - name: Scrape AgriPro regional trial PDFs + if: ${{ inputs.sources == '' || contains(inputs.sources, 'agripro_trials') }} + run: python -m scrape.runner --source agripro_trials --force + + - name: Scrape Golden Harvest plot reports (cross-vendor yield trials) + if: ${{ inputs.sources == '' || contains(inputs.sources, 'gh_plot_reports') }} + # Heaviest single source — ~4,600 docs at 1 req/sec ≈ 70 min. + # Wraps the bulk of CI time; runs late so an earlier failure + # doesn't waste 70 min before failing. + run: python -m scrape.runner --source gh_plot_reports --force + - name: Scrape Beck's PFR research corpus if: ${{ inputs.sources == '' || contains(inputs.sources, 'becks_pfr') }} - # Heaviest source — ~2,089 docs via public Sanity GROQ. - # No auth, but rate-limit ourselves to be polite. + # Deferred (returns 0 cleanly from a stub) — implementation + # pending. Public Sanity GROQ at mc8v24rf.api.sanity.io. run: python -m scrape.runner --source becks_pfr --force # ---- Commit corpus changes + retry-on-race ----------------- @@ -107,8 +118,10 @@ jobs: n_gh=$(find corpus/golden_harvest -name '*.json' 2>/dev/null | wc -l) n_nk=$(find corpus/nk -name '*.json' 2>/dev/null | wc -l) n_ag=$(find corpus/agripro -name '*.json' 2>/dev/null | wc -l) + n_agt=$(find corpus/agripro_trials -name '*.json' 2>/dev/null | wc -l) + n_ghpr=$(find corpus/gh_plot_reports -name '*.json' 2>/dev/null | wc -l) n_pfr=$(find corpus/becks_pfr -name '*.json' 2>/dev/null | wc -l) - git commit -m "monthly refresh: ${ts} — bayer=${n_bayer} gh=${n_gh} nk=${n_nk} agripro=${n_ag} pfr=${n_pfr}" + git commit -m "monthly refresh: ${ts} — bayer=${n_bayer} gh=${n_gh} nk=${n_nk} agripro=${n_ag} ag_trials=${n_agt} gh_plot_reports=${n_ghpr} pfr=${n_pfr}" attempt=1 while [ $attempt -le 3 ]; do if git push; then diff --git a/docs_mcp/lessons.md b/docs_mcp/lessons.md index c2595f35..e4139fe7 100644 --- a/docs_mcp/lessons.md +++ b/docs_mcp/lessons.md @@ -252,6 +252,79 @@ public catalog or their seed dealer. --- +## trial-data + +The MCP exposes TWO complementary surfaces: + +* **`search_docs`** — variety IDENTITY (what a hybrid IS): + disease ratings, trait stack, maturity, vendor positioning. +* **`search_trials`** — variety PERFORMANCE (how it ACTUALLY did): + ranked yield at specific cooperator fields and regions. + +**Indexed trial sources**: + +- **Golden Harvest plot reports** (~4,600 cross-vendor head-to-head + trials, 2024+2025). Each trial = one cooperator's field at a + specific state/year, comparing products from multiple brands + (NK / DEKALB / Golden Harvest / Enogen / Pioneer / Channel, etc.) + side by side. **This is the closest thing to independent + comparison data the corpus has** — Bayer doesn't publish its own + trial data, but GH publishes plots where DEKALB hybrids appear + alongside their competitors. +- **AgriPro regional trial PDFs** (14 PDFs) — multi-year + multi-location wheat performance for Northern Plains / Pacific + Northwest / Plains regions. Variety + per-location yields + preserved verbatim. + +**Recommended workflow when a farmer asks about performance**: + +1. Call `search_trials(crop, state, year, ...)` to find trials + from the relevant region/season. +2. Identify the top performers in the rankings. +3. Call `lookup_variety(source_key=...)` for each leading hybrid to + verify identity (RM, traits, disease ratings) — confirm the + variety actually fits the farmer's situation, not just that it + won someone else's trial. +4. If the leading variety is from a brand whose trial data isn't + directly published (e.g. DEKALB), the GH plot reports often + show it competing — that's still the agent's best public + third-party signal. + +**Trial data NOT in the corpus** (don't fabricate): + +- **DEKALB / Asgrow / Channel** per-variety yield trials — + Bayer keeps these in rep tools, not on the public catalog. The + GH plot reports surface DEKALB/Asgrow performance indirectly, + but per-variety dedicated trials aren't indexed. +- **NK yield results** — the data exists at + `syngenta-us.com/nk/yield-results` but the ASMX endpoint is + fiddly; not yet scraped. The variety identity is in the corpus + (`search_docs` finds it), just not the per-region trial yields. +- **Pioneer trials** — ToS bans automation, so we have neither + variety identity nor trial data. Direct the farmer to a + Pioneer dealer. +- **University extension trials** (Iowa State, Illinois, + Purdue, etc.) — third-party trial data that publishes Pioneer + + competitors. Not in the corpus today; could be added in a + future enrichment. + +**Reading a GH plot report**: + +Each plot has a cooperator name (the farmer running the trial), a +state, a year, planting/harvest dates, population, row width, and a +ranked table of products. The columns vary by crop: + +- **Corn / Soy**: Rank | Brand | Product | Traits | Yield BU/Ac + | %MST | Test Weight | Gross Revenue +- **Silage**: Rank | Brand | Product | Traits | Ton/Acre + | Milk Per Acre | Milk Per Ton | Beef Per Acre | Beef Per Ton + +Rank 1 = top performer at that site/year. Note that a single plot +is one data point — for a robust recommendation, look across +multiple plots from the same region. + +--- + ## checking-your-work Before quoting a specific number to a farmer, **always** call -- 2.52.0