Add university-extension variety trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 trial docs)

Independent third-party performance data — land-grant programs that test every
entered brand side-by-side with replication + LSD stats. This is the legitimate
way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't
scrape directly (data_type=trial, results[] shape; falls through the trial
chunker).

- illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX
  (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata;
  corn-following-corn vs -soybean kept distinct.
- iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView
  (viewstate postback for year/district), corn + soy by district x season.
- ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF
  (pdfplumber; per-site column groups split by header Yield-token count +
  x-coord footnote bucketing), corn + soy per site, 2024+2025.

91 distinct seed brands across the three; majors confirmed present in the
independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212,
Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it
ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in
Illinois — true negatives, not parse gaps.)

- rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university
  sources route through it so the region/district is in the embedded chunk +
  labeled "variety trial (cross-vendor, independent third-party)". Existing plot
  sources (gh/lg/agrigold/proharvest) unchanged.
- requirements.txt: openpyxl (Illinois XLSX; scrape-time only).
- sources.json + README/CLAUDE/lessons: registered + attributed; lessons
  trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available
  indirectly via these trials).

Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0
out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in
each tos_note. CI rebuilds the index from the committed corpus.
This commit is contained in:
2026-06-10 08:35:50 -04:00
parent 0bac06b7b6
commit 54094a0d43
255 changed files with 105410 additions and 13 deletions
+23 -7
View File
@@ -338,6 +338,19 @@ The MCP exposes TWO complementary surfaces:
tables where the PDF fits ProHarvest's template; foreign-format
third-party reports are kept verbatim (`raw_text`) so the yields
are still searchable. Image-only PDFs (no text layer) are skipped.
- **University-extension variety trials** (`illinois_vt_trials`,
`iowa_icpt_trials`, `ohio_ocpt_trials`, 2024+2025) — **the
independent third-party gold standard.** Land-grant programs (U of
Illinois VT, Iowa State ICPT, Ohio OCPT) that test every *entered*
brand side-by-side at the same sites with replication + LSD stats.
The publisher is the university; the seed brands are in each row's
`brand`. **This is where Pioneer / DEKALB / Channel / Brevant
performance is legitimately available** (they enter these public
trials even though we can't scrape their own sites). Caveat: a brand
only appears where it *entered* — e.g. Brevant didn't enter Iowa
ICPT, DEKALB/Channel didn't enter Illinois VT; absence in one
program is a true negative, not missing data. Illinois adds wheat;
Iowa/Ohio are corn+soy. (Purdue PCPP + other states deferred.)
**Recommended workflow when a farmer asks about performance**:
@@ -363,13 +376,16 @@ The MCP exposes TWO complementary surfaces:
`syngenta-us.com/nk/yield-results` but the ASMX endpoint is
fiddly; not yet scraped. The variety identity is in the corpus
(`search_docs` finds it), just not the per-region trial yields.
- **Pioneer trials** — ToS bans automation, so we have neither
variety identity nor trial data. Direct the farmer to a
Pioneer dealer.
- **University extension trials** (Iowa State, Illinois,
Purdue, etc.) — third-party trial data that publishes Pioneer
+ competitors. Not in the corpus today; could be added in a
future enrichment.
- **Pioneer trials** — ToS bans automation, so we have no Pioneer
*identity* data and don't scrape Pioneer's own results. BUT
Pioneer *performance* IS now available indirectly via the
university-extension trials (and the GH/ProHarvest plots) where
Pioneer entered — search those for Pioneer head-to-head yields;
for Pioneer variety specs, direct the farmer to a dealer.
- **University extension trials** — NOW INDEXED for IL / IA / OH
(`illinois_vt_trials` / `iowa_icpt_trials` / `ohio_ocpt_trials`,
2024+2025). Purdue PCPP and other states (NE / WI / MN / the
Dakotas / Kansas wheat) are not yet indexed — a future enrichment.
**Reading a GH plot report**: