Add university-extension variety trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 trial docs)

Independent third-party performance data — land-grant programs that test every
entered brand side-by-side with replication + LSD stats. This is the legitimate
way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't
scrape directly (data_type=trial, results[] shape; falls through the trial
chunker).

- illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX
  (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata;
  corn-following-corn vs -soybean kept distinct.
- iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView
  (viewstate postback for year/district), corn + soy by district x season.
- ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF
  (pdfplumber; per-site column groups split by header Yield-token count +
  x-coord footnote bucketing), corn + soy per site, 2024+2025.

91 distinct seed brands across the three; majors confirmed present in the
independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212,
Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it
ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in
Illinois — true negatives, not parse gaps.)

- rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university
  sources route through it so the region/district is in the embedded chunk +
  labeled "variety trial (cross-vendor, independent third-party)". Existing plot
  sources (gh/lg/agrigold/proharvest) unchanged.
- requirements.txt: openpyxl (Illinois XLSX; scrape-time only).
- sources.json + README/CLAUDE/lessons: registered + attributed; lessons
  trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available
  indirectly via these trials).

Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0
out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in
each tos_note. CI rebuilds the index from the committed corpus.
This commit is contained in:
2026-06-10 08:35:50 -04:00
parent 0bac06b7b6
commit 54094a0d43
255 changed files with 105410 additions and 13 deletions
+1 -1
View File
@@ -45,7 +45,7 @@ and the `crop_seed_api_lessons` tool).
| Beck's products | 🟡 | 860 | Same Sanity API — identity-only until SeedIQ XHR is sniffed |
| Pioneer + Hoegemeyer + Brevant (Corteva) | 🔴 | — | DROP. Shared corteva.com ToU bans automation (scrapers + "competitive service"). Treat ALL `*.corteva.com / pioneer.com / hoegemeyer.com / therightseed.com` + Vylor brands as one excluded ToU domain |
Trial-only sources (cross-vendor yield plots, `data_type=trial`): `gh_plot_reports`, `lg_plot_reports`, `agrigold_plot_reports`, `proharvest_plots`, `agripro_trials`. See the README corpus table for counts.
Trial-only sources (cross-vendor yield, `data_type=trial`): vendor plot reports `gh_plot_reports`, `lg_plot_reports`, `agrigold_plot_reports`, `proharvest_plots`, `agripro_trials`; **university-extension variety trials** `illinois_vt_trials` (IL, +wheat), `iowa_icpt_trials` (IA), `ohio_ocpt_trials` (OH) — independent third-party data that ranks the majors we can't catalog directly (Pioneer/DEKALB/Brevant) side-by-side. The university sources route through `_render_gh_plot_chunk(include_region=True)` so the region/district is in the embedded chunk. See the README corpus table for counts.
> **Scale-direction warning (read before any cross-vendor numeric comparison):** the independents do NOT agree on direction. Bayer + Stine(corn) + ProHarvest(disease) + Burrus = HIGHER is better (Burrus 1-10, others 1-9). **Latham + NK + AgriPro = LOWER is better (1 = best).** 1st Choice = 0-10 higher=better. Stine soy is qualitative. Always consult each record's `_scale_direction` (the chunker attaches it) before comparing numbers across brands.