Add university-extension variety trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 trial docs)
Independent third-party performance data — land-grant programs that test every entered brand side-by-side with replication + LSD stats. This is the legitimate way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't scrape directly (data_type=trial, results[] shape; falls through the trial chunker). - illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata; corn-following-corn vs -soybean kept distinct. - iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView (viewstate postback for year/district), corn + soy by district x season. - ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF (pdfplumber; per-site column groups split by header Yield-token count + x-coord footnote bucketing), corn + soy per site, 2024+2025. 91 distinct seed brands across the three; majors confirmed present in the independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212, Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in Illinois — true negatives, not parse gaps.) - rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university sources route through it so the region/district is in the embedded chunk + labeled "variety trial (cross-vendor, independent third-party)". Existing plot sources (gh/lg/agrigold/proharvest) unchanged. - requirements.txt: openpyxl (Illinois XLSX; scrape-time only). - sources.json + README/CLAUDE/lessons: registered + attributed; lessons trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available indirectly via these trials). Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0 out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in each tos_note. CI rebuilds the index from the committed corpus.
This commit is contained in:
+20
-3
@@ -330,7 +330,7 @@ def chunks_from_variety(
|
||||
# signal for queries like "best corn for sandy soil Iowa 2024".
|
||||
|
||||
|
||||
def _render_gh_plot_chunk(sidecar: dict) -> str:
|
||||
def _render_gh_plot_chunk(sidecar: dict, *, include_region: bool = False) -> str:
|
||||
"""Render a cross-vendor plot report (per-site head-to-head).
|
||||
|
||||
Originally GH-specific; now also handles ``lg_plot_reports`` and
|
||||
@@ -340,6 +340,12 @@ def _render_gh_plot_chunk(sidecar: dict) -> str:
|
||||
queries should still find DEKALB results inside a GH or AgriGold
|
||||
plot — search filters target ``brand_in_results``, not the
|
||||
publisher's brand).
|
||||
|
||||
``include_region`` (university-trial sources) folds the
|
||||
region/district into the title + facts so it's in the embedded
|
||||
text — these sources publish many same-state/year tables that are
|
||||
only distinguished by region (e.g. Iowa "District South"), and
|
||||
without this the region lived only in metadata/the .md body.
|
||||
"""
|
||||
lines: list[str] = []
|
||||
crop = (sidecar.get("crop") or "").lower()
|
||||
@@ -350,12 +356,18 @@ def _render_gh_plot_chunk(sidecar: dict) -> str:
|
||||
state = sidecar.get("state") or sidecar.get("state_abbrev") or ""
|
||||
year = sidecar.get("year") or ""
|
||||
cooperator = sidecar.get("cooperator") or ""
|
||||
region = (sidecar.get("region") or "").strip() if include_region else ""
|
||||
|
||||
lines.append(f"# {crop_label} yield trial — {state}, {year}")
|
||||
title = f"# {crop_label} yield trial — {state}, {year}"
|
||||
if region:
|
||||
title += f" ({region})"
|
||||
lines.append(title)
|
||||
lines.append("")
|
||||
# Publisher label — emphasizes the source brand for retrieval.
|
||||
publisher_brand = sidecar.get("brand") or "Golden Harvest"
|
||||
facts = [f"{publisher_brand} plot report (cross-vendor)"]
|
||||
facts = [f"{publisher_brand} {'variety trial (cross-vendor, independent third-party)' if include_region else 'plot report (cross-vendor)'}"]
|
||||
if region:
|
||||
facts.append(f"region {region}")
|
||||
if cooperator:
|
||||
facts.append(f"cooperator {cooperator}")
|
||||
if sidecar.get("planted_date"):
|
||||
@@ -509,6 +521,11 @@ def _render_trial_chunk(sidecar: dict, md_text: str | None = None) -> str:
|
||||
# for each (Golden Harvest / LG Seeds / AgriGold).
|
||||
if source in ("gh_plot_reports", "lg_plot_reports", "agrigold_plot_reports"):
|
||||
return _render_gh_plot_chunk(sidecar)
|
||||
if source in ("illinois_vt_trials", "iowa_icpt_trials", "ohio_ocpt_trials"):
|
||||
# University-extension variety trials — same results[] shape, but
|
||||
# fold region/district into the embedded text (many same-state/year
|
||||
# tables) + label as an independent third-party variety trial.
|
||||
return _render_gh_plot_chunk(sidecar, include_region=True)
|
||||
if source == "proharvest_plots":
|
||||
# Structured rows → shared cross-vendor renderer (publisher brand
|
||||
# read from the sidecar). Foreign-format third-party PDFs that
|
||||
|
||||
Reference in New Issue
Block a user