Add university-extension variety trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 trial docs)
Independent third-party performance data — land-grant programs that test every entered brand side-by-side with replication + LSD stats. This is the legitimate way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't scrape directly (data_type=trial, results[] shape; falls through the trial chunker). - illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata; corn-following-corn vs -soybean kept distinct. - iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView (viewstate postback for year/district), corn + soy by district x season. - ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF (pdfplumber; per-site column groups split by header Yield-token count + x-coord footnote bucketing), corn + soy per site, 2024+2025. 91 distinct seed brands across the three; majors confirmed present in the independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212, Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in Illinois — true negatives, not parse gaps.) - rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university sources route through it so the region/district is in the embedded chunk + labeled "variety trial (cross-vendor, independent third-party)". Existing plot sources (gh/lg/agrigold/proharvest) unchanged. - requirements.txt: openpyxl (Illinois XLSX; scrape-time only). - sources.json + README/CLAUDE/lessons: registered + attributed; lessons trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available indirectly via these trials). Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0 out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in each tos_note. CI rebuilds the index from the committed corpus.
This commit is contained in:
@@ -10,7 +10,7 @@ vendors — **variety identity** (what each hybrid IS) plus **yield-trial data**
|
||||
|
||||
## What's in the corpus
|
||||
|
||||
**~9,200 indexed records** (one chunk each) across two complementary surfaces:
|
||||
**~9,300 indexed records** (one chunk each) across two complementary surfaces:
|
||||
|
||||
### Variety identity — 2,398 records
|
||||
|
||||
@@ -30,7 +30,7 @@ vendors — **variety identity** (what each hybrid IS) plus **yield-trial data**
|
||||
| `ebberts_seeds` | 29 | Ebbert's Seeds | Ebbert's (corn / soy / wheat) — independent E. Corn Belt breeder |
|
||||
| `agripro` | 24 | Syngenta | AgriPro (wheat — HRW / HRS / HWS / SWW) |
|
||||
|
||||
### Yield-trial data — 6,787 documents
|
||||
### Yield-trial data — 6,910 documents
|
||||
|
||||
| Source | Count | Notes |
|
||||
|---|---|---|
|
||||
@@ -38,8 +38,13 @@ vendors — **variety identity** (what each hybrid IS) plus **yield-trial data**
|
||||
| `lg_plot_reports` | 1,307 | LG Seeds (AgReliant) cross-vendor plots, top-5 per site, 2024+2025. |
|
||||
| `agrigold_plot_reports` | 1,006 | AgriGold (AgReliant) cross-vendor plots, full ranking + rich plot management, 2024+2025. |
|
||||
| `proharvest_plots` | 161 | ProHarvest Seeds per-cooperator harvest reports (corn / soy, 2024+2025). Many are **cross-vendor** (ProHarvest / Apex vs Pioneer / DEKALB / Becks / Channel / Wyffels). Structured rank/yield/%H2O/test-wt where the PDF fits the template; off-template third-party reports kept verbatim. |
|
||||
| `ohio_ocpt_trials` | 69 | **University-extension** trial (OSU/CFAES) — corn + soy per-site, 2024+2025. Independent third-party; ranks CHANNEL / DEKALB / NK / Golden Harvest / LG / AgriGold / Beck's etc. side-by-side. |
|
||||
| `illinois_vt_trials` | 30 | **University-extension** trial (U of Illinois VT) — corn + soy + **wheat**, 2024+2025. Pioneer / NK + many regionals; rich per-site agronomic metadata. |
|
||||
| `iowa_icpt_trials` | 24 | **University-extension** trial (Iowa State / ICPT) — corn + soy by district, 2024+2025. Pioneer / DEKALB / Asgrow / NK / Golden Harvest. |
|
||||
| `agripro_trials` | 14 | Regional wheat trial PDF summaries (PNW, Western Plains, Northern Plains, etc.) |
|
||||
|
||||
> The three `*_trials` university sources are **independent third-party** performance data — land-grant programs that test every entered brand (incl. majors we can't catalog directly, like **Pioneer / DEKALB / Brevant**) side-by-side with replication + LSD stats. The publisher is the university; the seed brands live in each row's `brand`.
|
||||
|
||||
### Not in the corpus (documented in `docs_mcp/lessons.md`)
|
||||
|
||||
- **Pioneer / Corteva (all brands)** — ToS bans automation. This now covers the whole Corteva family — Pioneer, Brevant, **Hoegemeyer** (the consolidation brand absorbing Seed Consultants / Dairyland / Nu-Tech / Terral), and the upcoming Vylor spinoff — all share the same corteva.com ToU. Curated fallback lesson points the farmer at a local dealer; legitimate Corteva-data paths are an official license (openinnovation@corteva.com) or university-extension trial data.
|
||||
|
||||
Reference in New Issue
Block a user