Add university-extension variety trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 trial docs)
Independent third-party performance data — land-grant programs that test every entered brand side-by-side with replication + LSD stats. This is the legitimate way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't scrape directly (data_type=trial, results[] shape; falls through the trial chunker). - illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata; corn-following-corn vs -soybean kept distinct. - iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView (viewstate postback for year/district), corn + soy by district x season. - ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF (pdfplumber; per-site column groups split by header Yield-token count + x-coord footnote bucketing), corn + soy per site, 2024+2025. 91 distinct seed brands across the three; majors confirmed present in the independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212, Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in Illinois — true negatives, not parse gaps.) - rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university sources route through it so the region/district is in the embedded chunk + labeled "variety trial (cross-vendor, independent third-party)". Existing plot sources (gh/lg/agrigold/proharvest) unchanged. - requirements.txt: openpyxl (Illinois XLSX; scrape-time only). - sources.json + README/CLAUDE/lessons: registered + attributed; lessons trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available indirectly via these trials). Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0 out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in each tos_note. CI rebuilds the index from the committed corpus.
This commit is contained in:
+23
-7
@@ -338,6 +338,19 @@ The MCP exposes TWO complementary surfaces:
|
||||
tables where the PDF fits ProHarvest's template; foreign-format
|
||||
third-party reports are kept verbatim (`raw_text`) so the yields
|
||||
are still searchable. Image-only PDFs (no text layer) are skipped.
|
||||
- **University-extension variety trials** (`illinois_vt_trials`,
|
||||
`iowa_icpt_trials`, `ohio_ocpt_trials`, 2024+2025) — **the
|
||||
independent third-party gold standard.** Land-grant programs (U of
|
||||
Illinois VT, Iowa State ICPT, Ohio OCPT) that test every *entered*
|
||||
brand side-by-side at the same sites with replication + LSD stats.
|
||||
The publisher is the university; the seed brands are in each row's
|
||||
`brand`. **This is where Pioneer / DEKALB / Channel / Brevant
|
||||
performance is legitimately available** (they enter these public
|
||||
trials even though we can't scrape their own sites). Caveat: a brand
|
||||
only appears where it *entered* — e.g. Brevant didn't enter Iowa
|
||||
ICPT, DEKALB/Channel didn't enter Illinois VT; absence in one
|
||||
program is a true negative, not missing data. Illinois adds wheat;
|
||||
Iowa/Ohio are corn+soy. (Purdue PCPP + other states deferred.)
|
||||
|
||||
**Recommended workflow when a farmer asks about performance**:
|
||||
|
||||
@@ -363,13 +376,16 @@ The MCP exposes TWO complementary surfaces:
|
||||
`syngenta-us.com/nk/yield-results` but the ASMX endpoint is
|
||||
fiddly; not yet scraped. The variety identity is in the corpus
|
||||
(`search_docs` finds it), just not the per-region trial yields.
|
||||
- **Pioneer trials** — ToS bans automation, so we have neither
|
||||
variety identity nor trial data. Direct the farmer to a
|
||||
Pioneer dealer.
|
||||
- **University extension trials** (Iowa State, Illinois,
|
||||
Purdue, etc.) — third-party trial data that publishes Pioneer
|
||||
+ competitors. Not in the corpus today; could be added in a
|
||||
future enrichment.
|
||||
- **Pioneer trials** — ToS bans automation, so we have no Pioneer
|
||||
*identity* data and don't scrape Pioneer's own results. BUT
|
||||
Pioneer *performance* IS now available indirectly via the
|
||||
university-extension trials (and the GH/ProHarvest plots) where
|
||||
Pioneer entered — search those for Pioneer head-to-head yields;
|
||||
for Pioneer variety specs, direct the farmer to a dealer.
|
||||
- **University extension trials** — NOW INDEXED for IL / IA / OH
|
||||
(`illinois_vt_trials` / `iowa_icpt_trials` / `ohio_ocpt_trials`,
|
||||
2024+2025). Purdue PCPP and other states (NE / WI / MN / the
|
||||
Dakotas / Kansas wheat) are not yet indexed — a future enrichment.
|
||||
|
||||
**Reading a GH plot report**:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user