Add university-extension variety trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 trial docs)

Independent third-party performance data — land-grant programs that test every entered brand side-by-side with replication + LSD stats. This is the legitimate way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't scrape directly (data_type=trial, results[] shape; falls through the trial chunker). - illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata; corn-following-corn vs -soybean kept distinct. - iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView (viewstate postback for year/district), corn + soy by district x season. - ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF (pdfplumber; per-site column groups split by header Yield-token count + x-coord footnote bucketing), corn + soy per site, 2024+2025. 91 distinct seed brands across the three; majors confirmed present in the independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212, Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in Illinois — true negatives, not parse gaps.) - rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university sources route through it so the region/district is in the embedded chunk + labeled "variety trial (cross-vendor, independent third-party)". Existing plot sources (gh/lg/agrigold/proharvest) unchanged. - requirements.txt: openpyxl (Illinois XLSX; scrape-time only). - sources.json + README/CLAUDE/lessons: registered + attributed; lessons trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available indirectly via these trials). Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0 out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in each tos_note. CI rebuilds the index from the committed corpus.
2026-06-10 08:35:50 -04:00
parent 0bac06b7b6
commit 54094a0d43
255 changed files with 105410 additions and 13 deletions
@@ -0,0 +1,945 @@
+"""University of Illinois Variety Testing — cross-vendor yield trials.
+
+The University of Illinois Crop Sciences Variety Testing program
+(``vt.cropsci.illinois.edu``) is a long-running, independent land-grant
+testing program. Seed companies pay an entry fee to enter hybrids /
+varieties; UIUC plants them in replicated regional trials and publishes
+the results. Because the program is third-party and cross-vendor, a
+single regional table ranks Pioneer / Brevant / DEKALB / Channel /
+Burrus / Stine / Viking and dozens of others head-to-head — the
+legitimate, independent way to get major-brand performance the
+single-vendor corpus can't scrape directly.
+
+This is a ``data_type: "trial"`` source (NOT variety identity). It emits
+the same per-site cross-vendor sidecar shape as ``gh_plot_reports`` /
+``agrigold_plot_reports`` / ``proharvest_plots`` (``results: [{rank,
+brand, product, traits, metrics}]``), so it falls through to the shared
+``_render_gh_plot_chunk`` renderer in ``rag/chunk.py`` with no chunk.py
+edit. The published table lists entries alphabetically by company, so we
+**synthesize ``rank`` by sorting on Yield descending** (highest yield =
+rank 1) per document.
+
+Data layout (verified Nov 2025):
+  - Static XLSX (+ PDF) per region per year at WordPress upload URLs.
+    The month segment of the upload path varies (``/2025/11/``,
+    ``/2025/12/``, ``/2025/07/`` for wheat), so we DISCOVER the .xlsx
+    hrefs by fetching the /corn/, /soybeans/, /wheat/ index pages — we
+    never guess URLs.
+  - Corn regional tables: ``Company | Name | IST1 | GT2(+spill) | HT3 |
+    Relative Maturity | Yield bu/a | Moisture % | Lodging | <per-site
+    cols> | 2-yr Avg | 3-yr Avg``. Per-site metadata lives in a separate
+    "Trial Info" sheet (regional tables) or in trailing columns
+    (single-site CFC tables).
+  - Soybean regional tables: ``COMPANY | NAME | Herbicide Trait1 | ST2 |
+    Yield bu/a | Maturity Date | Lodging | Height | <2yr/3yr Yield> |
+    Protein @13% | Oil @13%`` with the per-site metadata block in
+    trailing columns.
+  - Wheat regional means tables: ``Company | Variety | ST1 | Yield |
+    Yield Rank | Test wt. | Height | <per-site Yield/Test wt.> |
+    Maturity date | Jointing time | FHB Score | FHB Category``. Wheat
+    publishes its own Yield Rank, which we honor.
+
+The variety table sits below a 2-4 row header band (a group-header row,
+a column-name row, and a units row). Columns are positionally stable
+within a sheet (multi-word brands like "Viking | Blue River" live in a
+single COMPANY cell), so we anchor on the header row that carries
+"Company" + "Name"/"Variety" and map the metric columns from the
+column-name + units rows. The leftmost "Yield (bu/a)" column is the
+**regional** yield — the primary metric.
+
+Section markers interleaved in the table ("Early RM", "Any RM",
+"Non-GMO Hybrids", "Early MG: 1.9-2.7", "Late MG:") and summary rows
+("Average", "L.S.D 25% Level", "CV (%)") are skipped — a data row must
+have a real company, a variety name, and a plausible numeric yield.
+
+Rotation distinction: regional tables are corn-following-soybean (the
+default rotation); "CFC" / "Corn Following Corn" single-site tables
+(Monmouth, Urbana, DeKalb) get ``previous_crop="corn"`` and a distinct
+``-cfc-`` source_key so they never collide with the regional table.
+
+robots/ToS: no usage terms posted on the VT site (publicly-funded
+land-grant; companies pay an entry fee, which doesn't restrict published
+result reuse). Polite UA + light rate limit.
+
+Output:
+  corpus/illinois_vt_trials/<source_key>.md      LLM-visible body
+  corpus/illinois_vt_trials/<source_key>.json    sidecar metadata
+
+source_key: ``ilvt-<crop>-<year>-r<region>`` e.g. ``ilvt-corn-2025-r1``;
+CFC single-site tables: ``ilvt-corn-2024-cfc-monmouth``.
+
+CLI:
+  python -m scrape.sources.illinois_vt_trials --year 2025 --limit 2
+  python -m scrape.sources.illinois_vt_trials --crop corn --force
+  python -m scrape.sources.illinois_vt_trials --include-old --force
+"""
+
+from __future__ import annotations
+
+import argparse
+import io
+import json
+import logging
+import os
+import random
+import re
+import sys
+import time
+from dataclasses import dataclass, field
+from datetime import date, datetime, timezone
+from pathlib import Path
+from typing import Any
+from urllib.parse import urljoin
+
+import openpyxl
+import requests
+from bs4 import BeautifulSoup
+
+SCRAPER_VERSION = "0.1.0"
+USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
+BASE = "https://vt.cropsci.illinois.edu"
+
+TOS_NOTE = (
+    "No usage terms posted on UIUC VT site; publicly-funded land-grant "
+    "data; attribute University of Illinois Variety Testing."
+)
+
+BASELINE_YEARS = {2024, 2025}
+OLD_YEAR_MIN = 2000
+OLD_YEAR_MAX = 2023
+
+# Index pages per crop. PLURAL "soybeans" is the corpus crop value.
+CROP_INDEX = {
+    "corn": "/corn/",
+    "soybeans": "/soybeans/",
+    "wheat": "/wheat/",
+}
+
+REQ_INTERVAL_SEC = 1.0
+
+REPO_ROOT = Path(__file__).resolve().parents[2]
+CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
+CORPUS_DIR = CORPUS_ROOT / "illinois_vt_trials"
+
+log = logging.getLogger("scrape.illinois_vt_trials")
+
+
+# --------------------------------------------------------------------- HTTP
+
+
+class RateLimitedSession:
+    def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
+        self.s = requests.Session()
+        self.s.headers["User-Agent"] = USER_AGENT
+        self.interval = interval
+        self._last = 0.0
+
+    def _wait(self) -> None:
+        delta = time.monotonic() - self._last
+        if delta < self.interval:
+            time.sleep(self.interval - delta)
+        self._last = time.monotonic()
+
+    def request(self, method: str, url: str, *, max_retries: int = 4,
+                timeout: float = 60.0, **kw: Any) -> requests.Response:
+        last_exc: Exception | None = None
+        for attempt in range(max_retries):
+            self._wait()
+            try:
+                resp = self.s.request(method, url, timeout=timeout, **kw)
+            except requests.RequestException as exc:
+                last_exc = exc
+                backoff = min(30.0, (2 ** attempt) + random.random())
+                log.warning("network error on %s %s: %s — retry in %.1fs",
+                            method, url, exc, backoff)
+                time.sleep(backoff)
+                continue
+            if resp.status_code == 429 or 500 <= resp.status_code < 600:
+                ra = resp.headers.get("Retry-After")
+                backoff = float(ra) if (ra and ra.isdigit()) else min(
+                    30.0, (2 ** attempt) + random.random())
+                log.warning("HTTP %d on %s %s — retry in %.1fs",
+                            resp.status_code, method, url, backoff)
+                time.sleep(backoff)
+                continue
+            return resp
+        if last_exc:
+            raise last_exc
+        return resp  # type: ignore[return-value]
+
+    def get(self, url: str, **kw: Any) -> requests.Response:
+        return self.request("GET", url, **kw)
+
+
+# --------------------------------------------------------------------- model
+
+
+@dataclass
+class TrialDoc:
+    source_key: str
+    crop: str               # corn / soybeans / wheat
+    year: int
+    region: str             # e.g. "Region 1", "Monmouth CFC"
+    xlsx_url: str
+    index_url: str
+
+    rotation: str | None = None          # "corn following soybean" / "corn following corn"
+    previous_crop: str | None = None      # "corn" for CFC tables
+    cooperator: str | None = None         # site host
+    county: str | None = None
+    soil_type: str | None = None
+    tillage: str | None = None
+    planted_date: str | None = None
+    harvested_date: str | None = None
+    row_width: str | None = None
+    latitude: float | None = None
+    longitude: float | None = None
+    sites: list[str] = field(default_factory=list)
+
+    results: list[dict] = field(default_factory=list)
+
+
+# --------------------------------------------------------------------- discovery
+
+
+# Year/Region extraction from a file name or upload path. The VT site has
+# used many naming schemes across years, so try several.
+_YEAR_RE = re.compile(r"(20\d{2})")
+_REGION_NUM_RE = re.compile(r"region[\s_-]*([1-9])", re.I)
+
+
+def _norm_url(href: str) -> str:
+    """Resolve relative hrefs against BASE; also fold the legacy
+    ``varietytesting.web.illinois.edu`` host (same WP install) to BASE so
+    older files download cleanly."""
+    href = href.strip()
+    if href.startswith("http"):
+        return href
+    return urljoin(BASE + "/", href.lstrip("/"))
+
+
+def _file_year(url: str) -> int | None:
+    """Harvest year of a file. Prefer the upload-path year segment
+    (``/uploads/2025/11/...`` or ``/uploads/corn/2021/...``); fall back to
+    the first 20xx in the file name."""
+    m = re.search(r"/uploads/(?:[a-z]+/)?(20\d{2})/", url)
+    if m:
+        return int(m.group(1))
+    fn = url.rsplit("/", 1)[-1]
+    m = _YEAR_RE.search(fn)
+    return int(m.group(1)) if m else None
+
+
+def _classify_region(url: str) -> tuple[str, str | None, str | None] | None:
+    """Return ``(region_label, rotation, previous_crop)`` for a file, or
+    None if it isn't a per-region/per-site VARIETY table we ingest.
+
+    We INCLUDE regional tables (Region 1-5, North/South/East/West) and
+    single-site CFC (corn-following-corn) tables. We EXCLUDE entry lists,
+    agronomic-characteristic sheets, FHB/scab summaries, and disease
+    tables (those aren't head-to-head yield rankings)."""
+    fn = url.rsplit("/", 1)[-1]
+    low = fn.lower()
+
+    # Exclusions — not cross-vendor yield ranking tables.
+    EXCLUDE = ("entry", "entries", "agronomic", "charateristic",  # sic (site typo)
+               "characteristic", "scab", "fhb", "disease")
+    if any(tok in low for tok in EXCLUDE):
+        return None
+
+    # Numbered regions: "Region 1", "2025-Region-1", "northtbl20"
+    m = _REGION_NUM_RE.search(low)
+    if m:
+        return (f"Region {m.group(1)}", "corn following soybean", None)
+
+    # Named regions (corn 2022 + wheat use compass names).
+    for name, label in (("north", "North Region"), ("south", "South Region"),
+                        ("east", "East Region"), ("west", "West Region")):
+        if re.search(rf"\b{name}\b", low) or low.startswith(name + "tbl"):
+            return (label, "corn following soybean", None)
+
+    # Single-site Corn-Following-Corn tables (Monmouth / Urbana / DeKalb).
+    cfc = re.search(r"([a-z]+)[\s_-]*cfc", low)
+    if cfc or "cfc" in low:
+        site = (cfc.group(1).title() if cfc and cfc.group(1) else "CFC")
+        return (f"{site} CFC", "corn following corn", "corn")
+
+    # Wheat single-site summary tables (e.g. "2025-Urbana-Summary",
+    # "2024-Elkville-Table"). Capture the site name as the region.
+    m = re.search(r"20\d{2}[\s_-]+([a-z]+)[\s_-]+(?:summary|table)", low)
+    if m:
+        return (m.group(1).title(), None, None)
+    m = re.search(r"([a-z]+)[\s_-]+(?:summary|table)[\s_-]*20\d{2}", low)
+    if m and m.group(1) not in ("region", "regional"):
+        return (m.group(1).title(), None, None)
+
+    return None
+
+
+def discover_files(http: RateLimitedSession, *, crops: set[str],
+                   years: set[int]) -> list[TrialDoc]:
+    """Fetch each crop index page, extract .xlsx hrefs, classify them into
+    per-region/per-site variety tables, and keep the ones in scope."""
+    out: list[TrialDoc] = []
+    seen_keys: set[str] = set()
+    for crop in sorted(crops):
+        index_path = CROP_INDEX.get(crop)
+        if not index_path:
+            log.warning("unknown crop %r, skipping", crop)
+            continue
+        index_url = BASE + index_path
+        log.info("GET %s", index_url)
+        r = http.get(index_url)
+        r.raise_for_status()
+        soup = BeautifulSoup(r.text, "html.parser")
+        xlsx_hrefs = [a["href"] for a in soup.find_all("a", href=True)
+                      if a["href"].lower().endswith(".xlsx")]
+        # Dedupe while preserving order.
+        seen_href: set[str] = set()
+        for href in xlsx_hrefs:
+            url = _norm_url(href)
+            if url in seen_href:
+                continue
+            seen_href.add(url)
+            year = _file_year(url)
+            if year is None or year not in years:
+                continue
+            cls = _classify_region(url)
+            if cls is None:
+                continue
+            region, rotation, prev_crop = cls
+            # source_key: ilvt-<crop>-<year>-r<n> for numbered regions,
+            # else a slug of the region name.
+            mnum = _REGION_NUM_RE.search(url.rsplit("/", 1)[-1].lower())
+            if mnum and "cfc" not in region.lower():
+                region_slug = f"r{mnum.group(1)}"
+            else:
+                region_slug = re.sub(r"[^a-z0-9]+", "-",
+                                     region.lower()).strip("-")
+            sk = f"ilvt-{crop}-{year}-{region_slug}"
+            if sk in seen_keys:
+                # Two files map to the same key (e.g. a "-final" + a "-1"
+                # duplicate). Keep the first; log the collision.
+                log.info("duplicate source_key %s from %s — skipping dupe",
+                         sk, url)
+                continue
+            seen_keys.add(sk)
+            out.append(TrialDoc(
+                source_key=sk, crop=crop, year=year, region=region,
+                xlsx_url=url, index_url=index_url,
+                rotation=rotation, previous_crop=prev_crop,
+            ))
+        log.info("  %s: %d in-scope variety tables", crop,
+                 sum(1 for d in out if d.crop == crop))
+    return out
+
+
+# --------------------------------------------------------------------- parse
+
+
+def _to_num(v: Any) -> float | int | None:
+    """Coerce a cell to a number. Strips '*', commas; returns None for the
+    VT missing-value markers ('*.*', '-', '') and non-numeric text."""
+    if v is None:
+        return None
+    if isinstance(v, bool):
+        return None
+    if isinstance(v, (int, float)):
+        f = float(v)
+        return int(f) if f.is_integer() else f
+    s = str(v).strip()
+    if not s or s in ("*", "*.*", "-", "—", "."):
+        return None
+    s = s.replace(",", "")
+    if not re.match(r"^-?\d+(?:\.\d+)?$", s):
+        return None
+    f = float(s)
+    return int(f) if f.is_integer() else f
+
+
+def _iso_date(v: Any) -> str | None:
+    if isinstance(v, datetime):
+        return v.date().isoformat()
+    if isinstance(v, date):
+        return v.isoformat()
+    if v is None:
+        return None
+    s = str(v).strip()
+    if not s or s.lower().startswith("did not"):
+        return None
+    # ISO YYYY-MM-DD (with optional trailing " 00:00:00" time).
+    m = re.match(r"^(\d{4})-(\d{1,2})-(\d{1,2})(?:[ T].*)?$", s)
+    if m:
+        yr, mo, dy = m.groups()
+        try:
+            return f"{int(yr):04d}-{int(mo):02d}-{int(dy):02d}"
+        except ValueError:
+            return None
+    # US MM/DD/YYYY.
+    m = re.match(r"^(\d{1,2})/(\d{1,2})/(\d{2,4})$", s)
+    if m:
+        mo, dy, yr = m.groups()
+        if len(yr) == 2:
+            yr = "20" + yr
+        try:
+            return f"{int(yr):04d}-{int(mo):02d}-{int(dy):02d}"
+        except ValueError:
+            return None
+    return None
+
+
+def _txt(v: Any) -> str:
+    if v is None:
+        return ""
+    if isinstance(v, datetime):
+        return v.date().isoformat()
+    return str(v).strip()
+
+
+def _norm(s: Any) -> str:
+    return re.sub(r"\s+", " ", _txt(s)).strip().lower().rstrip(".")
+
+
+# Section markers / summary labels that are NOT data rows.
+_NONDATA_NAME = re.compile(
+    r"^(average|avg\.?|l\.?s\.?d\.?|c\.?v\.?|coeff|mean|std|"
+    r"early mg|late mg|early rm|any rm|non-?gmo|conventional|"
+    r"public|check)\b", re.I)
+_SECTION_COMPANY = re.compile(
+    r"^(early rm|any rm|late rm|early mg|late mg|non-?gmo|conventional|"
+    r"gmo hybrids?|hybrids?)\b", re.I)
+
+
+def _find_header_row(rows: list[tuple]) -> int | None:
+    """Index of the column-name row — the one carrying 'Company' (col 0)
+    and a 'Name' / 'Variety' (col 1)-ish header."""
+    for i, row in enumerate(rows[:15]):
+        c0 = _norm(row[0] if len(row) > 0 else "")
+        c1 = _norm(row[1] if len(row) > 1 else "")
+        if c0 == "company" and c1 in ("name", "variety"):
+            return i
+    return None
+
+
+def _build_colmap(rows: list[tuple], hdr_i: int) -> dict[str, int]:
+    """Map metric -> column index by merging the header band: the
+    column-name row (hdr_i), the row below it, and the group-header row
+    above it. Layouts vary — corn carries the units (bu/a, %) IN the
+    header row with the Yield/Moisture/Lodging labels in the group row
+    above; soy/wheat carry the units in the row below. We want the
+    REGIONAL (leftmost) Yield, not the per-site repeats, so we take the
+    leftmost yield-units column as the primary Yield.
+
+    Returns keys among: company, name, herb_trait, gt, ist, st, maturity,
+    yield, lodging, height, moisture, protein, oil, rank, testwt,
+    yield_2yr, yield_3yr.
+    """
+    name_row = rows[hdr_i]
+    below_row = rows[hdr_i + 1] if hdr_i + 1 < len(rows) else ()
+    group_row = rows[hdr_i - 1] if hdr_i - 1 >= 0 else ()
+
+    def g(row: tuple, i: int) -> str:
+        return _norm(row[i]) if i < len(row) else ""
+
+    ncols = max(len(name_row), len(below_row), len(group_row))
+
+    def band(i: int) -> tuple[str, str, str]:
+        """(group-above, header, below) normalized text for column i."""
+        return (g(group_row, i), g(name_row, i), g(below_row, i))
+
+    cm: dict[str, int] = {"company": 0, "name": 1}
+
+    # Identity / trait columns — these sit on the header (column-name) row.
+    for i in range(ncols):
+        nm = g(name_row, i)
+        if nm in ("herbicide trait1", "herbicide trait", "ht3", "ht"):
+            cm.setdefault("herb_trait", i)
+        elif nm in ("gt2", "gt"):
+            cm.setdefault("gt", i)
+        elif nm in ("ist1", "ist"):
+            cm.setdefault("ist", i)
+        elif nm in ("st1", "st2", "st"):
+            cm.setdefault("st", i)
+        elif nm in ("relative", "maturity", "relative maturity", "maturity date"):
+            cm.setdefault("maturity", i)
+        elif nm in ("yield rank", "rank"):
+            cm.setdefault("rank", i)
+
+    # Metric columns — match across the whole band. A column is a Yield
+    # column if any band row says "yield" OR carries a bu-per-acre unit.
+    yield_cols: list[int] = []
+    moisture_cols: list[int] = []
+    testwt_cols: list[int] = []
+    lodging_cols: list[int] = []
+    height_cols: list[int] = []
+    protein_cols: list[int] = []
+    oil_cols: list[int] = []
+    maturity_cols: list[int] = []
+    YIELD_UNITS = {"bu/a", "bu/ac", "bu/acre"}
+    for i in range(ncols):
+        gp, nm, bl = band(i)
+        cells = {gp, nm, bl}
+        if "yield" in cells or cells & YIELD_UNITS:
+            yield_cols.append(i)
+        if "moisture" in cells:
+            moisture_cols.append(i)
+        if {"test wt", "test weight"} & cells or "lb/bu" in cells:
+            testwt_cols.append(i)
+        if "lodging" in cells:
+            lodging_cols.append(i)
+        if "height" in cells:
+            height_cols.append(i)
+        if "protein" in cells:
+            protein_cols.append(i)
+        if "oil" in cells:
+            oil_cols.append(i)
+        if {"relative", "maturity", "relative maturity", "maturity date"} & cells:
+            maturity_cols.append(i)
+
+    if yield_cols:
+        cm["yield"] = yield_cols[0]
+        # 2yr / 3yr averages: yield-unit columns whose group header says avg.
+        for i in yield_cols[1:]:
+            grp = g(group_row, i)
+            if "2-yr" in grp or "2 yr" in grp or "2yr" in grp:
+                cm.setdefault("yield_2yr", i)
+            elif "3-yr" in grp or "3 yr" in grp or "3yr" in grp:
+                cm.setdefault("yield_3yr", i)
+    if moisture_cols:
+        cm["moisture"] = moisture_cols[0]
+    if testwt_cols:
+        cm["testwt"] = testwt_cols[0]
+    if lodging_cols:
+        cm["lodging"] = lodging_cols[0]
+    if height_cols:
+        cm["height"] = height_cols[0]
+    if protein_cols:
+        cm["protein"] = protein_cols[0]
+    if oil_cols:
+        cm["oil"] = oil_cols[0]
+    if "maturity" not in cm and maturity_cols:
+        cm["maturity"] = maturity_cols[0]
+
+    return cm
+
+
+# The metadata block always carries these labels; we locate its starting
+# column by finding where they appear, so the per-site yield columns that
+# sit between the metric block and the metadata block don't get scanned as
+# labels (they lead with numbers, not a label).
+_META_LABELS = {
+    "location", "county", "site location", "host", "soil type",
+    "planting date", "harvest date", "nitrogen applied", "pesticides",
+    "tillage", "latitude", "longitude", "rainfall", "fungicide",
+}
+
+
+def _scan_meta_labels(src: list[tuple]) -> dict[str, list[str]]:
+    """From a label/value block, build ``{normalized_label: [values...]}``.
+
+    First find the column where the metadata labels live (the leftmost
+    column that holds a known metadata label in some row); the label in
+    each row is the first non-empty cell AT OR AFTER that column, and the
+    values are the non-empty cells to its right. This skips the per-site
+    yield columns that can sit to the left of the metadata block."""
+    # Find the metadata-label column.
+    label_col: int | None = None
+    for row in src:
+        for i, c in enumerate(row):
+            if _norm(c) in _META_LABELS:
+                if label_col is None or i < label_col:
+                    label_col = i
+                break
+    found: dict[str, list[str]] = {}
+    for row in src:
+        label = None
+        label_idx = None
+        start = label_col if label_col is not None else 0
+        for i in range(start, len(row)):
+            t = _txt(row[i])
+            if t:
+                label = t
+                label_idx = i
+                break
+        if label is None:
+            continue
+        key = _norm(label)
+        values = [_txt(c) for i, c in enumerate(row)
+                  if i > (label_idx or 0) and _txt(c)]
+        if values and key not in found:  # keep the first occurrence
+            found[key] = values
+    return found
+
+
+def _first_real(values: list[str]) -> str | None:
+    for v in values:
+        if v and v.lower() not in ("click to see map", "click here for directions"):
+            return v
+    return None
+
+
+def _apply_site_metadata(doc: TrialDoc, found: dict[str, list[str]], *,
+                         crop: str) -> None:
+    """Apply a scanned metadata block to the doc, filling only fields that
+    are still unset (so the first/preferred source wins)."""
+    def setif(attr: str, val: Any) -> None:
+        if val and getattr(doc, attr) is None:
+            setattr(doc, attr, val)
+
+    if "host" in found:
+        setif("cooperator", _first_real(found["host"]))
+    if "location" in found and not doc.sites:
+        doc.sites = [v for v in found["location"]
+                     if v and v.lower() != "click to see map"]
+    if "county" in found:
+        setif("county", _first_real(found["county"]))
+    if "soil type" in found:
+        setif("soil_type", _first_real(found["soil type"]))
+    if "planting date" in found:
+        setif("planted_date", _iso_date(_first_real(found["planting date"])))
+    if "harvest date" in found:
+        setif("harvested_date", _iso_date(_first_real(found["harvest date"])))
+    if "tillage" in found:
+        setif("tillage", _first_real(found["tillage"]))
+    elif "spring" in found and crop != "wheat":
+        # Corn/soy: "Spring"/"Fall" are tillage operations. Wheat: those
+        # same labels are nitrogen rates — never tillage.
+        setif("tillage", _first_real(found["spring"]))
+    if "latitude" in found:
+        lat = _to_num(_first_real(found["latitude"]) or "")
+        if isinstance(lat, (int, float)):
+            setif("latitude", float(lat))
+    if "longitude" in found:
+        lon = _to_num(_first_real(found["longitude"]) or "")
+        if isinstance(lon, (int, float)):
+            setif("longitude", float(lon))
+
+
+def _assemble_traits(row: tuple, cm: dict[str, int]) -> str:
+    """Combine the herbicide-trait + GT (genetic trait, may spill across
+    cols) + seed-treatment columns into one traits string."""
+    bits: list[str] = []
+    # GT can spill from its col up to (but not including) the herb_trait col.
+    if "gt" in cm:
+        gt_start = cm["gt"]
+        gt_end = cm.get("herb_trait", cm.get("maturity", gt_start + 1))
+        gt_toks = [_txt(row[i]) for i in range(gt_start, gt_end)
+                   if i < len(row) and _txt(row[i])]
+        if gt_toks:
+            bits.append("GT:" + "".join(gt_toks))
+    if "herb_trait" in cm:
+        ht = _txt(row[cm["herb_trait"]]) if cm["herb_trait"] < len(row) else ""
+        if ht:
+            bits.append("HT:" + ht)
+    for k, lbl in (("ist", "IST"), ("st", "ST")):
+        if k in cm and cm[k] < len(row):
+            v = _txt(row[cm[k]])
+            if v:
+                bits.append(f"{lbl}:{v}")
+    return " ".join(bits)
+
+
+def _is_data_row(row: tuple, cm: dict[str, int]) -> bool:
+    company = _txt(row[0]) if len(row) > 0 else ""
+    name = _txt(row[1]) if len(row) > 1 else ""
+    if not company or not name:
+        return False
+    if _NONDATA_NAME.match(name) or _NONDATA_NAME.match(company):
+        return False
+    if _SECTION_COMPANY.match(company):
+        return False
+    # Must have a plausible numeric yield.
+    y = _to_num(row[cm["yield"]]) if "yield" in cm and cm["yield"] < len(row) else None
+    if not isinstance(y, (int, float)):
+        return False
+    return True
+
+
+def _plausible_yield(crop: str, y: float) -> bool:
+    if crop == "corn":
+        return 30 < y < 400
+    if crop == "soybeans":
+        return 10 < y < 150
+    if crop == "wheat":
+        return 20 < y < 200
+    return 0 < y < 500
+
+
+def parse_xlsx(content: bytes, doc: TrialDoc) -> None:
+    wb = openpyxl.load_workbook(io.BytesIO(content), data_only=True,
+                                read_only=True)
+    # The yield table is the first sheet whose first ~15 rows contain a
+    # Company/Name header.
+    data_ws = None
+    data_rows: list[tuple] = []
+    hdr_i = None
+    for name in wb.sheetnames:
+        rows = list(wb[name].iter_rows(values_only=True))
+        hi = _find_header_row(rows)
+        if hi is not None:
+            data_ws, data_rows, hdr_i = name, rows, hi
+            break
+    if data_ws is None or hdr_i is None:
+        raise ValueError("no Company/Name header row found in any sheet")
+
+    cm = _build_colmap(data_rows, hdr_i)
+    if "yield" not in cm:
+        raise ValueError("no Yield column located")
+
+    # Site metadata lives in (a) trailing columns of the data sheet
+    # (co-located with the results — most current) and/or (b) a dedicated
+    # "Trial Info" sheet. Read the trailing-column block FIRST so it wins,
+    # then let the info sheet fill any gaps. (Some regional files carry a
+    # stale info sheet — e.g. a 2025 table whose Trial Info sheet still
+    # shows 2021 planting dates — so trailing columns are preferred.)
+    # _scan_meta_labels self-locates the metadata-label column, so the
+    # per-site yield columns between the metric block and the metadata
+    # block aren't mis-read as labels.
+    _apply_site_metadata(doc, _scan_meta_labels(data_rows), crop=doc.crop)
+    info_sheet = next((s for s in wb.sheetnames
+                       if "trial info" in s.lower()
+                       or "trial information" in s.lower()), None)
+    if info_sheet:
+        _apply_site_metadata(
+            doc,
+            _scan_meta_labels(list(wb[info_sheet].iter_rows(values_only=True))),
+            crop=doc.crop)
+
+    results: list[dict] = []
+    for row in data_rows[hdr_i + 2:]:
+        if not _is_data_row(row, cm):
+            continue
+        y = _to_num(row[cm["yield"]])
+        if not isinstance(y, (int, float)) or not _plausible_yield(doc.crop, float(y)):
+            continue
+        brand = _txt(row[0])
+        product = _txt(row[1])
+        # Sanity: a numeric/blank brand is junk.
+        if not brand or brand.isdigit() or len(brand) <= 1:
+            continue
+        metrics: dict[str, Any] = {"Yield": y}
+        for key, label in (("moisture", "Moisture"), ("lodging", "Lodging"),
+                           ("height", "Height"), ("testwt", "Test Wt."),
+                           ("protein", "Protein"), ("oil", "Oil"),
+                           ("maturity", "Maturity"),
+                           ("yield_2yr", "Yield 2yr avg"),
+                           ("yield_3yr", "Yield 3yr avg")):
+            if key in cm and cm[key] < len(row):
+                v = _to_num(row[cm[key]])
+                if v is not None:
+                    metrics[label] = v
+        rec_rank = None
+        if "rank" in cm and cm["rank"] < len(row):
+            rk = _to_num(row[cm["rank"]])
+            if isinstance(rk, (int, float)):
+                rec_rank = int(rk)
+        results.append({
+            "rank": rec_rank,  # synthesized below if None
+            "brand": brand,
+            "product": product,
+            "traits": _assemble_traits(row, cm) or None,
+            "metrics": metrics,
+        })
+
+    # Synthesize rank by Yield DESC when the sheet didn't publish one
+    # (corn/soy list alphabetically). Wheat carries Yield Rank already, but
+    # we re-rank only if any row is missing a rank to keep them consistent.
+    if results and any(r["rank"] is None for r in results):
+        ordered = sorted(results, key=lambda r: -float(r["metrics"]["Yield"]))
+        for i, r in enumerate(ordered, start=1):
+            r["rank"] = i
+        results = ordered
+    else:
+        results.sort(key=lambda r: r["rank"])
+
+    doc.results = results
+
+
+# --------------------------------------------------------------------- render
+
+
+def render_markdown(doc: TrialDoc) -> str:
+    crop_label = {"corn": "Corn", "soybeans": "Soybean",
+                  "wheat": "Wheat"}.get(doc.crop, doc.crop.title())
+    head: list[str] = [
+        f"# {crop_label} yield trial — University of Illinois VT, "
+        f"{doc.region}, IL {doc.year}",
+        "",
+        "- **Publisher:** University of Illinois Variety Testing "
+        "(independent cross-vendor trial)",
+        f"- **Crop:** {crop_label}",
+        "- **State:** IL",
+        f"- **Year:** {doc.year}",
+        f"- **Region:** {doc.region}",
+    ]
+    for label, val in (
+        ("Rotation", doc.rotation),
+        ("Previous crop", doc.previous_crop),
+        ("Cooperator / host", doc.cooperator),
+        ("County", doc.county),
+        ("Sites", ", ".join(doc.sites) if doc.sites else None),
+        ("Soil type", doc.soil_type),
+        ("Tillage", doc.tillage),
+        ("Planted", doc.planted_date),
+        ("Harvested", doc.harvested_date),
+        ("Row width", doc.row_width),
+    ):
+        if val:
+            head.append(f"- **{label}:** {val}")
+    head += [
+        f"- **Source XLSX:** {doc.xlsx_url}",
+        f"- **Index page:** {doc.index_url}",
+        "", "---", "",
+        "## Results (ranked by regional yield)", "",
+    ]
+
+    # Discover metric columns from the first result.
+    metric_keys: list[str] = []
+    for r in doc.results:
+        if r.get("metrics"):
+            metric_keys = list(r["metrics"].keys())
+            break
+    headers = ["Rank", "Brand", "Product", "Traits"] + metric_keys
+    head.append("| " + " | ".join(headers) + " |")
+    head.append("|" + "|".join(["---"] * len(headers)) + "|")
+    for r in doc.results:
+        row = [str(r.get("rank") or "-"), r.get("brand") or "-",
+               r.get("product") or "-", r.get("traits") or "-"]
+        m = r.get("metrics") or {}
+        for k in metric_keys:
+            v = m.get(k)
+            row.append("-" if v is None else str(v))
+        head.append("| " + " | ".join(row) + " |")
+    head.append("")
+    return "\n".join(head)
+
+
+def write_doc(doc: TrialDoc, body_md: str) -> None:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    (CORPUS_DIR / f"{doc.source_key}.md").write_text(body_md, encoding="utf-8")
+    sidecar = {
+        "source": "illinois_vt_trials",
+        "source_key": doc.source_key,
+        "data_type": "trial",
+        "vendor": "University of Illinois",
+        "brand_aggregator": "University of Illinois Variety Testing publishes",
+        "brand": "University of Illinois VT",
+        "crop": doc.crop,
+        "state": "IL",
+        "state_abbrev": "il",
+        "year": doc.year,
+        "region": doc.region,
+        "rotation": doc.rotation,
+        "previous_crop": doc.previous_crop,
+        "cooperator": doc.cooperator,
+        "county": doc.county,
+        "sites": doc.sites or None,
+        "soil_type": doc.soil_type,
+        "tillage": doc.tillage,
+        "planted_date": doc.planted_date,
+        "harvested_date": doc.harvested_date,
+        "row_width": doc.row_width,
+        "latitude": doc.latitude,
+        "longitude": doc.longitude,
+        "results": doc.results,
+        "n_results": len(doc.results),
+        "source_urls": [doc.xlsx_url, doc.index_url],
+        "tos_note": TOS_NOTE,
+        "fetched_at": datetime.now(timezone.utc).isoformat(),
+        "scraper_version": SCRAPER_VERSION,
+    }
+    (CORPUS_DIR / f"{doc.source_key}.json").write_text(
+        json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
+        encoding="utf-8")
+
+
+# --------------------------------------------------------------------- pipeline
+
+
+def process_doc(http: RateLimitedSession, doc: TrialDoc, *,
+                force: bool) -> str:
+    md_path = CORPUS_DIR / f"{doc.source_key}.md"
+    if md_path.exists() and not force:
+        return "skipped"
+    try:
+        r = http.get(doc.xlsx_url)
+        r.raise_for_status()
+        parse_xlsx(r.content, doc)
+    except Exception as exc:  # noqa: BLE001
+        log.error("%s parse failed (%s): %s", doc.source_key, doc.xlsx_url, exc)
+        return "failed"
+    if not doc.results:
+        log.warning("%s — no valid result rows parsed; skipping", doc.source_key)
+        return "empty"
+    write_doc(doc, render_markdown(doc))
+    log.info("%s written | %s %s %s | %d results | top: %s",
+             doc.source_key, doc.crop, doc.region, doc.year, len(doc.results),
+             doc.results[0]["brand"] + " " + doc.results[0]["product"]
+             if doc.results else "-")
+    return "written"
+
+
+def run(*, crops: set[str], years: set[int], limit: int | None,
+        force: bool) -> int:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    http = RateLimitedSession()
+    docs = discover_files(http, crops=crops, years=years)
+    log.info("discovered %d in-scope variety tables", len(docs))
+    if limit is not None:
+        docs = docs[:limit]
+
+    counts = {"written": 0, "skipped": 0, "empty": 0, "failed": 0}
+    for i, doc in enumerate(docs, start=1):
+        status = process_doc(http, doc, force=force)
+        counts[status] = counts.get(status, 0) + 1
+        if status != "written" or i <= 5 or i % 20 == 0:
+            log.info("[%d/%d] %s -> %s", i, len(docs), doc.source_key, status)
+
+    log.info("done: written=%d skipped=%d empty=%d failed=%d (of %d)",
+             counts["written"], counts["skipped"], counts["empty"],
+             counts["failed"], len(docs))
+    return 0 if counts["failed"] == 0 else 1
+
+
+# --------------------------------------------------------------------- CLI
+
+
+def _build_argparser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        prog="scrape.sources.illinois_vt_trials",
+        description="Scrape University of Illinois Variety Testing "
+                    "cross-vendor yield trials (XLSX) into the corpus.")
+    p.add_argument("--year", type=int, default=None,
+                   help="Scrape a single harvest year (default: 2024+2025).")
+    p.add_argument("--include-old", action="store_true",
+                   help="Also scrape 2000–2023 (deferred by default).")
+    p.add_argument("--crop", default=None, choices=tuple(CROP_INDEX.keys()),
+                   help="Limit to one crop (corn / soybeans / wheat).")
+    p.add_argument("--limit", type=int, default=None,
+                   help="Stop after processing N tables (default: all).")
+    p.add_argument("--force", action="store_true",
+                   help="Re-fetch even if the markdown file already exists.")
+    p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
+    return p
+
+
+def main(argv: list[str] | None = None) -> int:
+    args = _build_argparser().parse_args(argv)
+    logging.basicConfig(
+        level=args.log_level.upper(),
+        format="%(asctime)s %(levelname)s %(name)s %(message)s",
+        stream=sys.stderr)
+
+    crops = {args.crop} if args.crop else set(CROP_INDEX.keys())
+    if args.year is not None:
+        years = {args.year}
+    elif args.include_old:
+        years = set(range(OLD_YEAR_MIN, OLD_YEAR_MAX + 1)) | BASELINE_YEARS
+    else:
+        years = set(BASELINE_YEARS)
+
+    return run(crops=crops, years=years, limit=args.limit, force=args.force)
+
+
+if __name__ == "__main__":
+    sys.exit(main())