Add university-extension variety trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 trial docs)

Independent third-party performance data — land-grant programs that test every
entered brand side-by-side with replication + LSD stats. This is the legitimate
way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't
scrape directly (data_type=trial, results[] shape; falls through the trial
chunker).

- illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX
  (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata;
  corn-following-corn vs -soybean kept distinct.
- iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView
  (viewstate postback for year/district), corn + soy by district x season.
- ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF
  (pdfplumber; per-site column groups split by header Yield-token count +
  x-coord footnote bucketing), corn + soy per site, 2024+2025.

91 distinct seed brands across the three; majors confirmed present in the
independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212,
Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it
ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in
Illinois — true negatives, not parse gaps.)

- rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university
  sources route through it so the region/district is in the embedded chunk +
  labeled "variety trial (cross-vendor, independent third-party)". Existing plot
  sources (gh/lg/agrigold/proharvest) unchanged.
- requirements.txt: openpyxl (Illinois XLSX; scrape-time only).
- sources.json + README/CLAUDE/lessons: registered + attributed; lessons
  trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available
  indirectly via these trials).

Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0
out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in
each tos_note. CI rebuilds the index from the committed corpus.
This commit is contained in:
2026-06-10 08:35:50 -04:00
parent 0bac06b7b6
commit 54094a0d43
255 changed files with 105410 additions and 13 deletions
+20 -3
View File
@@ -330,7 +330,7 @@ def chunks_from_variety(
# signal for queries like "best corn for sandy soil Iowa 2024".
def _render_gh_plot_chunk(sidecar: dict) -> str:
def _render_gh_plot_chunk(sidecar: dict, *, include_region: bool = False) -> str:
"""Render a cross-vendor plot report (per-site head-to-head).
Originally GH-specific; now also handles ``lg_plot_reports`` and
@@ -340,6 +340,12 @@ def _render_gh_plot_chunk(sidecar: dict) -> str:
queries should still find DEKALB results inside a GH or AgriGold
plot — search filters target ``brand_in_results``, not the
publisher's brand).
``include_region`` (university-trial sources) folds the
region/district into the title + facts so it's in the embedded
text — these sources publish many same-state/year tables that are
only distinguished by region (e.g. Iowa "District South"), and
without this the region lived only in metadata/the .md body.
"""
lines: list[str] = []
crop = (sidecar.get("crop") or "").lower()
@@ -350,12 +356,18 @@ def _render_gh_plot_chunk(sidecar: dict) -> str:
state = sidecar.get("state") or sidecar.get("state_abbrev") or ""
year = sidecar.get("year") or ""
cooperator = sidecar.get("cooperator") or ""
region = (sidecar.get("region") or "").strip() if include_region else ""
lines.append(f"# {crop_label} yield trial — {state}, {year}")
title = f"# {crop_label} yield trial — {state}, {year}"
if region:
title += f" ({region})"
lines.append(title)
lines.append("")
# Publisher label — emphasizes the source brand for retrieval.
publisher_brand = sidecar.get("brand") or "Golden Harvest"
facts = [f"{publisher_brand} plot report (cross-vendor)"]
facts = [f"{publisher_brand} {'variety trial (cross-vendor, independent third-party)' if include_region else 'plot report (cross-vendor)'}"]
if region:
facts.append(f"region {region}")
if cooperator:
facts.append(f"cooperator {cooperator}")
if sidecar.get("planted_date"):
@@ -509,6 +521,11 @@ def _render_trial_chunk(sidecar: dict, md_text: str | None = None) -> str:
# for each (Golden Harvest / LG Seeds / AgriGold).
if source in ("gh_plot_reports", "lg_plot_reports", "agrigold_plot_reports"):
return _render_gh_plot_chunk(sidecar)
if source in ("illinois_vt_trials", "iowa_icpt_trials", "ohio_ocpt_trials"):
# University-extension variety trials — same results[] shape, but
# fold region/district into the embedded text (many same-state/year
# tables) + label as an independent third-party variety trial.
return _render_gh_plot_chunk(sidecar, include_region=True)
if source == "proharvest_plots":
# Structured rows → shared cross-vendor renderer (publisher brand
# read from the sidecar). Foreign-format third-party PDFs that