Add university-extension trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 cross-vendor trial docs) #19

Merged
claude merged 1 commits from add-university-trials into main 2026-06-10 08:36:20 -04:00
Contributor

Adds the university-extension variety trials as cross-vendor data_type=trial sources — the legitimate, independent path to Pioneer / DEKALB / Brevant / Channel performance the corpus can't scrape directly. Land-grant programs test every entered brand side-by-side at the same sites with replication + LSD stats.

Source Program Docs Rows Access Crops
illinois_vt_trials U of Illinois VT 30 1,392 per-region XLSX (openpyxl) corn + soy + wheat
iowa_icpt_trials Iowa State ICPT 24 674 ASP.NET GridView (viewstate postback) corn + soy
ohio_ocpt_trials OSU/CFAES OCPT 69 4,647 report PDF (pdfplumber) corn + soy

+123 trial docs / 6,713 ranked entries. 91 distinct seed brands, with the majors we couldn't catalog directly now independently present: DEKALB 395, Golden Harvest 249, Channel 241, NK 212, Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand appears only where it entered a program — Brevant absent from Iowa, DEKALB/Channel absent from Illinois — verified true negatives, not parse gaps.)

Chunker: added a gated include_region to _render_gh_plot_chunk; the three university sources route through it so the region/district is in the embedded chunk (many same-state/year tables) + framed as "variety trial (cross-vendor, independent third-party)". Existing plot sources (gh/lg/agrigold/proharvest) verified unchanged (no region, "plot report" wording).

Hard parts handled: Iowa's year/district is an ASP.NET viewstate POSTBACK (no GET URLs); Ohio's PDF has per-site column groups split by the header's Yield-token count + x-coordinate footnote bucketing, with a site-count sanity gate (0 skips/fallbacks at baseline); Illinois uses header-anchored XLSX cell mapping + a self-locating metadata block.

Validation: all 123 chunk via chunks_from_trial (0 errors), 0 out-of-range yields, 0 dup keys; "Yield" is the canonical metric key throughout.

Legality: all three are public land-grant extension data (published for farmers); no anti-scraping clauses — attribution recorded per tos_note (UIUC VT / Iowa ICPT-ISU / Ohio OCPT-OSU CFAES). openpyxl added (Illinois XLSX; scrape-time only — not imported by the image-only CI rebuild). 2024+2025 baseline; older years + Purdue deferred behind --include-old.

Docs: README/CLAUDE inventory (now 2,398 variety + 6,910 trial) + lessons trial-data/Pioneer entries updated. CI rebuilds the index from the committed corpus.

Adds the **university-extension variety trials** as cross-vendor `data_type=trial` sources — the legitimate, independent path to **Pioneer / DEKALB / Brevant / Channel performance** the corpus can't scrape directly. Land-grant programs test every *entered* brand side-by-side at the same sites with replication + LSD stats. | Source | Program | Docs | Rows | Access | Crops | |---|---|---|---|---|---| | `illinois_vt_trials` | U of Illinois VT | 30 | 1,392 | per-region **XLSX** (openpyxl) | corn + soy + **wheat** | | `iowa_icpt_trials` | Iowa State ICPT | 24 | 674 | ASP.NET **GridView** (viewstate postback) | corn + soy | | `ohio_ocpt_trials` | OSU/CFAES OCPT | 69 | 4,647 | report **PDF** (pdfplumber) | corn + soy | **+123 trial docs / 6,713 ranked entries. 91 distinct seed brands**, with the majors we couldn't catalog directly now independently present: **DEKALB 395, Golden Harvest 249, Channel 241, NK 212, Xitavo 135, LG 103, Pioneer 88, Asgrow 59**. (A brand appears only where it *entered* a program — Brevant absent from Iowa, DEKALB/Channel absent from Illinois — verified true negatives, not parse gaps.) **Chunker:** added a gated `include_region` to `_render_gh_plot_chunk`; the three university sources route through it so the **region/district is in the embedded chunk** (many same-state/year tables) + framed as "variety trial (cross-vendor, independent third-party)". Existing plot sources (gh/lg/agrigold/proharvest) verified **unchanged** (no region, "plot report" wording). **Hard parts handled:** Iowa's year/district is an ASP.NET **viewstate POSTBACK** (no GET URLs); Ohio's PDF has **per-site column groups** split by the header's Yield-token count + x-coordinate footnote bucketing, with a site-count sanity gate (0 skips/fallbacks at baseline); Illinois uses header-anchored XLSX cell mapping + a self-locating metadata block. **Validation:** all 123 chunk via `chunks_from_trial` (0 errors), **0 out-of-range yields, 0 dup keys**; `"Yield"` is the canonical metric key throughout. **Legality:** all three are public land-grant extension data (published for farmers); no anti-scraping clauses — attribution recorded per `tos_note` (UIUC VT / Iowa ICPT-ISU / Ohio OCPT-OSU CFAES). `openpyxl` added (Illinois XLSX; scrape-time only — not imported by the image-only CI rebuild). 2024+2025 baseline; older years + Purdue deferred behind `--include-old`. Docs: README/CLAUDE inventory (now **2,398 variety + 6,910 trial**) + lessons trial-data/Pioneer entries updated. CI rebuilds the index from the committed corpus.
claude added 1 commit 2026-06-10 08:36:10 -04:00
Independent third-party performance data — land-grant programs that test every
entered brand side-by-side with replication + LSD stats. This is the legitimate
way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't
scrape directly (data_type=trial, results[] shape; falls through the trial
chunker).

- illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX
  (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata;
  corn-following-corn vs -soybean kept distinct.
- iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView
  (viewstate postback for year/district), corn + soy by district x season.
- ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF
  (pdfplumber; per-site column groups split by header Yield-token count +
  x-coord footnote bucketing), corn + soy per site, 2024+2025.

91 distinct seed brands across the three; majors confirmed present in the
independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212,
Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it
ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in
Illinois — true negatives, not parse gaps.)

- rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university
  sources route through it so the region/district is in the embedded chunk +
  labeled "variety trial (cross-vendor, independent third-party)". Existing plot
  sources (gh/lg/agrigold/proharvest) unchanged.
- requirements.txt: openpyxl (Illinois XLSX; scrape-time only).
- sources.json + README/CLAUDE/lessons: registered + attributed; lessons
  trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available
  indirectly via these trials).

Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0
out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in
each tos_note. CI rebuilds the index from the committed corpus.
claude merged commit a54fac240f into main 2026-06-10 08:36:20 -04:00
claude deleted branch add-university-trials 2026-06-10 08:36:20 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#19