Add university-extension variety trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 trial docs)

Independent third-party performance data — land-grant programs that test every entered brand side-by-side with replication + LSD stats. This is the legitimate way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't scrape directly (data_type=trial, results[] shape; falls through the trial chunker). - illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata; corn-following-corn vs -soybean kept distinct. - iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView (viewstate postback for year/district), corn + soy by district x season. - ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF (pdfplumber; per-site column groups split by header Yield-token count + x-coord footnote bucketing), corn + soy per site, 2024+2025. 91 distinct seed brands across the three; majors confirmed present in the independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212, Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in Illinois — true negatives, not parse gaps.) - rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university sources route through it so the region/district is in the embedded chunk + labeled "variety trial (cross-vendor, independent third-party)". Existing plot sources (gh/lg/agrigold/proharvest) unchanged. - requirements.txt: openpyxl (Illinois XLSX; scrape-time only). - sources.json + README/CLAUDE/lessons: registered + attributed; lessons trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available indirectly via these trials). Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0 out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in each tos_note. CI rebuilds the index from the committed corpus.
2026-06-10 08:35:50 -04:00
parent 0bac06b7b6
commit 54094a0d43
255 changed files with 105410 additions and 13 deletions
@@ -0,0 +1,945 @@
+"""University of Illinois Variety Testing — cross-vendor yield trials.
+
+The University of Illinois Crop Sciences Variety Testing program
+(``vt.cropsci.illinois.edu``) is a long-running, independent land-grant
+testing program. Seed companies pay an entry fee to enter hybrids /
+varieties; UIUC plants them in replicated regional trials and publishes
+the results. Because the program is third-party and cross-vendor, a
+single regional table ranks Pioneer / Brevant / DEKALB / Channel /
+Burrus / Stine / Viking and dozens of others head-to-head — the
+legitimate, independent way to get major-brand performance the
+single-vendor corpus can't scrape directly.
+
+This is a ``data_type: "trial"`` source (NOT variety identity). It emits
+the same per-site cross-vendor sidecar shape as ``gh_plot_reports`` /
+``agrigold_plot_reports`` / ``proharvest_plots`` (``results: [{rank,
+brand, product, traits, metrics}]``), so it falls through to the shared
+``_render_gh_plot_chunk`` renderer in ``rag/chunk.py`` with no chunk.py
+edit. The published table lists entries alphabetically by company, so we
+**synthesize ``rank`` by sorting on Yield descending** (highest yield =
+rank 1) per document.
+
+Data layout (verified Nov 2025):
+  - Static XLSX (+ PDF) per region per year at WordPress upload URLs.
+    The month segment of the upload path varies (``/2025/11/``,
+    ``/2025/12/``, ``/2025/07/`` for wheat), so we DISCOVER the .xlsx
+    hrefs by fetching the /corn/, /soybeans/, /wheat/ index pages — we
+    never guess URLs.
+  - Corn regional tables: ``Company | Name | IST1 | GT2(+spill) | HT3 |
+    Relative Maturity | Yield bu/a | Moisture % | Lodging | <per-site
+    cols> | 2-yr Avg | 3-yr Avg``. Per-site metadata lives in a separate
+    "Trial Info" sheet (regional tables) or in trailing columns
+    (single-site CFC tables).
+  - Soybean regional tables: ``COMPANY | NAME | Herbicide Trait1 | ST2 |
+    Yield bu/a | Maturity Date | Lodging | Height | <2yr/3yr Yield> |
+    Protein @13% | Oil @13%`` with the per-site metadata block in
+    trailing columns.
+  - Wheat regional means tables: ``Company | Variety | ST1 | Yield |
+    Yield Rank | Test wt. | Height | <per-site Yield/Test wt.> |
+    Maturity date | Jointing time | FHB Score | FHB Category``. Wheat
+    publishes its own Yield Rank, which we honor.
+
+The variety table sits below a 2-4 row header band (a group-header row,
+a column-name row, and a units row). Columns are positionally stable
+within a sheet (multi-word brands like "Viking | Blue River" live in a
+single COMPANY cell), so we anchor on the header row that carries
+"Company" + "Name"/"Variety" and map the metric columns from the
+column-name + units rows. The leftmost "Yield (bu/a)" column is the
+**regional** yield — the primary metric.
+
+Section markers interleaved in the table ("Early RM", "Any RM",
+"Non-GMO Hybrids", "Early MG: 1.9-2.7", "Late MG:") and summary rows
+("Average", "L.S.D 25% Level", "CV (%)") are skipped — a data row must
+have a real company, a variety name, and a plausible numeric yield.
+
+Rotation distinction: regional tables are corn-following-soybean (the
+default rotation); "CFC" / "Corn Following Corn" single-site tables
+(Monmouth, Urbana, DeKalb) get ``previous_crop="corn"`` and a distinct
+``-cfc-`` source_key so they never collide with the regional table.
+
+robots/ToS: no usage terms posted on the VT site (publicly-funded
+land-grant; companies pay an entry fee, which doesn't restrict published
+result reuse). Polite UA + light rate limit.
+
+Output:
+  corpus/illinois_vt_trials/<source_key>.md      LLM-visible body
+  corpus/illinois_vt_trials/<source_key>.json    sidecar metadata
+
+source_key: ``ilvt-<crop>-<year>-r<region>`` e.g. ``ilvt-corn-2025-r1``;
+CFC single-site tables: ``ilvt-corn-2024-cfc-monmouth``.
+
+CLI:
+  python -m scrape.sources.illinois_vt_trials --year 2025 --limit 2
+  python -m scrape.sources.illinois_vt_trials --crop corn --force
+  python -m scrape.sources.illinois_vt_trials --include-old --force
+"""
+
+from __future__ import annotations
+
+import argparse
+import io
+import json
+import logging
+import os
+import random
+import re
+import sys
+import time
+from dataclasses import dataclass, field
+from datetime import date, datetime, timezone
+from pathlib import Path
+from typing import Any
+from urllib.parse import urljoin
+
+import openpyxl
+import requests
+from bs4 import BeautifulSoup
+
+SCRAPER_VERSION = "0.1.0"
+USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
+BASE = "https://vt.cropsci.illinois.edu"
+
+TOS_NOTE = (
+    "No usage terms posted on UIUC VT site; publicly-funded land-grant "
+    "data; attribute University of Illinois Variety Testing."
+)
+
+BASELINE_YEARS = {2024, 2025}
+OLD_YEAR_MIN = 2000
+OLD_YEAR_MAX = 2023
+
+# Index pages per crop. PLURAL "soybeans" is the corpus crop value.
+CROP_INDEX = {
+    "corn": "/corn/",
+    "soybeans": "/soybeans/",
+    "wheat": "/wheat/",
+}
+
+REQ_INTERVAL_SEC = 1.0
+
+REPO_ROOT = Path(__file__).resolve().parents[2]
+CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
+CORPUS_DIR = CORPUS_ROOT / "illinois_vt_trials"
+
+log = logging.getLogger("scrape.illinois_vt_trials")
+
+
+# --------------------------------------------------------------------- HTTP
+
+
+class RateLimitedSession:
+    def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
+        self.s = requests.Session()
+        self.s.headers["User-Agent"] = USER_AGENT
+        self.interval = interval
+        self._last = 0.0
+
+    def _wait(self) -> None:
+        delta = time.monotonic() - self._last
+        if delta < self.interval:
+            time.sleep(self.interval - delta)
+        self._last = time.monotonic()
+
+    def request(self, method: str, url: str, *, max_retries: int = 4,
+                timeout: float = 60.0, **kw: Any) -> requests.Response:
+        last_exc: Exception | None = None
+        for attempt in range(max_retries):
+            self._wait()
+            try:
+                resp = self.s.request(method, url, timeout=timeout, **kw)
+            except requests.RequestException as exc:
+                last_exc = exc
+                backoff = min(30.0, (2 ** attempt) + random.random())
+                log.warning("network error on %s %s: %s — retry in %.1fs",
+                            method, url, exc, backoff)
+                time.sleep(backoff)
+                continue
+            if resp.status_code == 429 or 500 <= resp.status_code < 600:
+                ra = resp.headers.get("Retry-After")
+                backoff = float(ra) if (ra and ra.isdigit()) else min(
+                    30.0, (2 ** attempt) + random.random())
+                log.warning("HTTP %d on %s %s — retry in %.1fs",
+                            resp.status_code, method, url, backoff)
+                time.sleep(backoff)
+                continue
+            return resp
+        if last_exc:
+            raise last_exc
+        return resp  # type: ignore[return-value]
+
+    def get(self, url: str, **kw: Any) -> requests.Response:
+        return self.request("GET", url, **kw)
+
+
+# --------------------------------------------------------------------- model
+
+
+@dataclass
+class TrialDoc:
+    source_key: str
+    crop: str               # corn / soybeans / wheat
+    year: int
+    region: str             # e.g. "Region 1", "Monmouth CFC"
+    xlsx_url: str
+    index_url: str
+
+    rotation: str | None = None          # "corn following soybean" / "corn following corn"
+    previous_crop: str | None = None      # "corn" for CFC tables
+    cooperator: str | None = None         # site host
+    county: str | None = None
+    soil_type: str | None = None
+    tillage: str | None = None
+    planted_date: str | None = None
+    harvested_date: str | None = None
+    row_width: str | None = None
+    latitude: float | None = None
+    longitude: float | None = None
+    sites: list[str] = field(default_factory=list)
+
+    results: list[dict] = field(default_factory=list)
+
+
+# --------------------------------------------------------------------- discovery
+
+
+# Year/Region extraction from a file name or upload path. The VT site has
+# used many naming schemes across years, so try several.
+_YEAR_RE = re.compile(r"(20\d{2})")
+_REGION_NUM_RE = re.compile(r"region[\s_-]*([1-9])", re.I)
+
+
+def _norm_url(href: str) -> str:
+    """Resolve relative hrefs against BASE; also fold the legacy
+    ``varietytesting.web.illinois.edu`` host (same WP install) to BASE so
+    older files download cleanly."""
+    href = href.strip()
+    if href.startswith("http"):
+        return href
+    return urljoin(BASE + "/", href.lstrip("/"))
+
+
+def _file_year(url: str) -> int | None:
+    """Harvest year of a file. Prefer the upload-path year segment
+    (``/uploads/2025/11/...`` or ``/uploads/corn/2021/...``); fall back to
+    the first 20xx in the file name."""
+    m = re.search(r"/uploads/(?:[a-z]+/)?(20\d{2})/", url)
+    if m:
+        return int(m.group(1))
+    fn = url.rsplit("/", 1)[-1]
+    m = _YEAR_RE.search(fn)
+    return int(m.group(1)) if m else None
+
+
+def _classify_region(url: str) -> tuple[str, str | None, str | None] | None:
+    """Return ``(region_label, rotation, previous_crop)`` for a file, or
+    None if it isn't a per-region/per-site VARIETY table we ingest.
+
+    We INCLUDE regional tables (Region 1-5, North/South/East/West) and
+    single-site CFC (corn-following-corn) tables. We EXCLUDE entry lists,
+    agronomic-characteristic sheets, FHB/scab summaries, and disease
+    tables (those aren't head-to-head yield rankings)."""
+    fn = url.rsplit("/", 1)[-1]
+    low = fn.lower()
+
+    # Exclusions — not cross-vendor yield ranking tables.
+    EXCLUDE = ("entry", "entries", "agronomic", "charateristic",  # sic (site typo)
+               "characteristic", "scab", "fhb", "disease")
+    if any(tok in low for tok in EXCLUDE):
+        return None
+
+    # Numbered regions: "Region 1", "2025-Region-1", "northtbl20"
+    m = _REGION_NUM_RE.search(low)
+    if m:
+        return (f"Region {m.group(1)}", "corn following soybean", None)
+
+    # Named regions (corn 2022 + wheat use compass names).
+    for name, label in (("north", "North Region"), ("south", "South Region"),
+                        ("east", "East Region"), ("west", "West Region")):
+        if re.search(rf"\b{name}\b", low) or low.startswith(name + "tbl"):
+            return (label, "corn following soybean", None)
+
+    # Single-site Corn-Following-Corn tables (Monmouth / Urbana / DeKalb).
+    cfc = re.search(r"([a-z]+)[\s_-]*cfc", low)
+    if cfc or "cfc" in low:
+        site = (cfc.group(1).title() if cfc and cfc.group(1) else "CFC")
+        return (f"{site} CFC", "corn following corn", "corn")
+
+    # Wheat single-site summary tables (e.g. "2025-Urbana-Summary",
+    # "2024-Elkville-Table"). Capture the site name as the region.
+    m = re.search(r"20\d{2}[\s_-]+([a-z]+)[\s_-]+(?:summary|table)", low)
+    if m:
+        return (m.group(1).title(), None, None)
+    m = re.search(r"([a-z]+)[\s_-]+(?:summary|table)[\s_-]*20\d{2}", low)
+    if m and m.group(1) not in ("region", "regional"):
+        return (m.group(1).title(), None, None)
+
+    return None
+
+
+def discover_files(http: RateLimitedSession, *, crops: set[str],
+                   years: set[int]) -> list[TrialDoc]:
+    """Fetch each crop index page, extract .xlsx hrefs, classify them into
+    per-region/per-site variety tables, and keep the ones in scope."""
+    out: list[TrialDoc] = []
+    seen_keys: set[str] = set()
+    for crop in sorted(crops):
+        index_path = CROP_INDEX.get(crop)
+        if not index_path:
+            log.warning("unknown crop %r, skipping", crop)
+            continue
+        index_url = BASE + index_path
+        log.info("GET %s", index_url)
+        r = http.get(index_url)
+        r.raise_for_status()
+        soup = BeautifulSoup(r.text, "html.parser")
+        xlsx_hrefs = [a["href"] for a in soup.find_all("a", href=True)
+                      if a["href"].lower().endswith(".xlsx")]
+        # Dedupe while preserving order.
+        seen_href: set[str] = set()
+        for href in xlsx_hrefs:
+            url = _norm_url(href)
+            if url in seen_href:
+                continue
+            seen_href.add(url)
+            year = _file_year(url)
+            if year is None or year not in years:
+                continue
+            cls = _classify_region(url)
+            if cls is None:
+                continue
+            region, rotation, prev_crop = cls
+            # source_key: ilvt-<crop>-<year>-r<n> for numbered regions,
+            # else a slug of the region name.
+            mnum = _REGION_NUM_RE.search(url.rsplit("/", 1)[-1].lower())
+            if mnum and "cfc" not in region.lower():
+                region_slug = f"r{mnum.group(1)}"
+            else:
+                region_slug = re.sub(r"[^a-z0-9]+", "-",
+                                     region.lower()).strip("-")
+            sk = f"ilvt-{crop}-{year}-{region_slug}"
+            if sk in seen_keys:
+                # Two files map to the same key (e.g. a "-final" + a "-1"
+                # duplicate). Keep the first; log the collision.
+                log.info("duplicate source_key %s from %s — skipping dupe",
+                         sk, url)
+                continue
+            seen_keys.add(sk)
+            out.append(TrialDoc(
+                source_key=sk, crop=crop, year=year, region=region,
+                xlsx_url=url, index_url=index_url,
+                rotation=rotation, previous_crop=prev_crop,
+            ))
+        log.info("  %s: %d in-scope variety tables", crop,
+                 sum(1 for d in out if d.crop == crop))
+    return out
+
+
+# --------------------------------------------------------------------- parse
+
+
+def _to_num(v: Any) -> float | int | None:
+    """Coerce a cell to a number. Strips '*', commas; returns None for the
+    VT missing-value markers ('*.*', '-', '') and non-numeric text."""
+    if v is None:
+        return None
+    if isinstance(v, bool):
+        return None
+    if isinstance(v, (int, float)):
+        f = float(v)
+        return int(f) if f.is_integer() else f
+    s = str(v).strip()
+    if not s or s in ("*", "*.*", "-", "—", "."):
+        return None
+    s = s.replace(",", "")
+    if not re.match(r"^-?\d+(?:\.\d+)?$", s):
+        return None
+    f = float(s)
+    return int(f) if f.is_integer() else f
+
+
+def _iso_date(v: Any) -> str | None:
+    if isinstance(v, datetime):
+        return v.date().isoformat()
+    if isinstance(v, date):
+        return v.isoformat()
+    if v is None:
+        return None
+    s = str(v).strip()
+    if not s or s.lower().startswith("did not"):
+        return None
+    # ISO YYYY-MM-DD (with optional trailing " 00:00:00" time).
+    m = re.match(r"^(\d{4})-(\d{1,2})-(\d{1,2})(?:[ T].*)?$", s)
+    if m:
+        yr, mo, dy = m.groups()
+        try:
+            return f"{int(yr):04d}-{int(mo):02d}-{int(dy):02d}"
+        except ValueError:
+            return None
+    # US MM/DD/YYYY.
+    m = re.match(r"^(\d{1,2})/(\d{1,2})/(\d{2,4})$", s)
+    if m:
+        mo, dy, yr = m.groups()
+        if len(yr) == 2:
+            yr = "20" + yr
+        try:
+            return f"{int(yr):04d}-{int(mo):02d}-{int(dy):02d}"
+        except ValueError:
+            return None
+    return None
+
+
+def _txt(v: Any) -> str:
+    if v is None:
+        return ""
+    if isinstance(v, datetime):
+        return v.date().isoformat()
+    return str(v).strip()
+
+
+def _norm(s: Any) -> str:
+    return re.sub(r"\s+", " ", _txt(s)).strip().lower().rstrip(".")
+
+
+# Section markers / summary labels that are NOT data rows.
+_NONDATA_NAME = re.compile(
+    r"^(average|avg\.?|l\.?s\.?d\.?|c\.?v\.?|coeff|mean|std|"
+    r"early mg|late mg|early rm|any rm|non-?gmo|conventional|"
+    r"public|check)\b", re.I)
+_SECTION_COMPANY = re.compile(
+    r"^(early rm|any rm|late rm|early mg|late mg|non-?gmo|conventional|"
+    r"gmo hybrids?|hybrids?)\b", re.I)
+
+
+def _find_header_row(rows: list[tuple]) -> int | None:
+    """Index of the column-name row — the one carrying 'Company' (col 0)
+    and a 'Name' / 'Variety' (col 1)-ish header."""
+    for i, row in enumerate(rows[:15]):
+        c0 = _norm(row[0] if len(row) > 0 else "")
+        c1 = _norm(row[1] if len(row) > 1 else "")
+        if c0 == "company" and c1 in ("name", "variety"):
+            return i
+    return None
+
+
+def _build_colmap(rows: list[tuple], hdr_i: int) -> dict[str, int]:
+    """Map metric -> column index by merging the header band: the
+    column-name row (hdr_i), the row below it, and the group-header row
+    above it. Layouts vary — corn carries the units (bu/a, %) IN the
+    header row with the Yield/Moisture/Lodging labels in the group row
+    above; soy/wheat carry the units in the row below. We want the
+    REGIONAL (leftmost) Yield, not the per-site repeats, so we take the
+    leftmost yield-units column as the primary Yield.
+
+    Returns keys among: company, name, herb_trait, gt, ist, st, maturity,
+    yield, lodging, height, moisture, protein, oil, rank, testwt,
+    yield_2yr, yield_3yr.
+    """
+    name_row = rows[hdr_i]
+    below_row = rows[hdr_i + 1] if hdr_i + 1 < len(rows) else ()
+    group_row = rows[hdr_i - 1] if hdr_i - 1 >= 0 else ()
+
+    def g(row: tuple, i: int) -> str:
+        return _norm(row[i]) if i < len(row) else ""
+
+    ncols = max(len(name_row), len(below_row), len(group_row))
+
+    def band(i: int) -> tuple[str, str, str]:
+        """(group-above, header, below) normalized text for column i."""
+        return (g(group_row, i), g(name_row, i), g(below_row, i))
+
+    cm: dict[str, int] = {"company": 0, "name": 1}
+
+    # Identity / trait columns — these sit on the header (column-name) row.
+    for i in range(ncols):
+        nm = g(name_row, i)
+        if nm in ("herbicide trait1", "herbicide trait", "ht3", "ht"):
+            cm.setdefault("herb_trait", i)
+        elif nm in ("gt2", "gt"):
+            cm.setdefault("gt", i)
+        elif nm in ("ist1", "ist"):
+            cm.setdefault("ist", i)
+        elif nm in ("st1", "st2", "st"):
+            cm.setdefault("st", i)
+        elif nm in ("relative", "maturity", "relative maturity", "maturity date"):
+            cm.setdefault("maturity", i)
+        elif nm in ("yield rank", "rank"):
+            cm.setdefault("rank", i)
+
+    # Metric columns — match across the whole band. A column is a Yield
+    # column if any band row says "yield" OR carries a bu-per-acre unit.
+    yield_cols: list[int] = []
+    moisture_cols: list[int] = []
+    testwt_cols: list[int] = []
+    lodging_cols: list[int] = []
+    height_cols: list[int] = []
+    protein_cols: list[int] = []
+    oil_cols: list[int] = []
+    maturity_cols: list[int] = []
+    YIELD_UNITS = {"bu/a", "bu/ac", "bu/acre"}
+    for i in range(ncols):
+        gp, nm, bl = band(i)
+        cells = {gp, nm, bl}
+        if "yield" in cells or cells & YIELD_UNITS:
+            yield_cols.append(i)
+        if "moisture" in cells:
+            moisture_cols.append(i)
+        if {"test wt", "test weight"} & cells or "lb/bu" in cells:
+            testwt_cols.append(i)
+        if "lodging" in cells:
+            lodging_cols.append(i)
+        if "height" in cells:
+            height_cols.append(i)
+        if "protein" in cells:
+            protein_cols.append(i)
+        if "oil" in cells:
+            oil_cols.append(i)
+        if {"relative", "maturity", "relative maturity", "maturity date"} & cells:
+            maturity_cols.append(i)
+
+    if yield_cols:
+        cm["yield"] = yield_cols[0]
+        # 2yr / 3yr averages: yield-unit columns whose group header says avg.
+        for i in yield_cols[1:]:
+            grp = g(group_row, i)
+            if "2-yr" in grp or "2 yr" in grp or "2yr" in grp:
+                cm.setdefault("yield_2yr", i)
+            elif "3-yr" in grp or "3 yr" in grp or "3yr" in grp:
+                cm.setdefault("yield_3yr", i)
+    if moisture_cols:
+        cm["moisture"] = moisture_cols[0]
+    if testwt_cols:
+        cm["testwt"] = testwt_cols[0]
+    if lodging_cols:
+        cm["lodging"] = lodging_cols[0]
+    if height_cols:
+        cm["height"] = height_cols[0]
+    if protein_cols:
+        cm["protein"] = protein_cols[0]
+    if oil_cols:
+        cm["oil"] = oil_cols[0]
+    if "maturity" not in cm and maturity_cols:
+        cm["maturity"] = maturity_cols[0]
+
+    return cm
+
+
+# The metadata block always carries these labels; we locate its starting
+# column by finding where they appear, so the per-site yield columns that
+# sit between the metric block and the metadata block don't get scanned as
+# labels (they lead with numbers, not a label).
+_META_LABELS = {
+    "location", "county", "site location", "host", "soil type",
+    "planting date", "harvest date", "nitrogen applied", "pesticides",
+    "tillage", "latitude", "longitude", "rainfall", "fungicide",
+}
+
+
+def _scan_meta_labels(src: list[tuple]) -> dict[str, list[str]]:
+    """From a label/value block, build ``{normalized_label: [values...]}``.
+
+    First find the column where the metadata labels live (the leftmost
+    column that holds a known metadata label in some row); the label in
+    each row is the first non-empty cell AT OR AFTER that column, and the
+    values are the non-empty cells to its right. This skips the per-site
+    yield columns that can sit to the left of the metadata block."""
+    # Find the metadata-label column.
+    label_col: int | None = None
+    for row in src:
+        for i, c in enumerate(row):
+            if _norm(c) in _META_LABELS:
+                if label_col is None or i < label_col:
+                    label_col = i
+                break
+    found: dict[str, list[str]] = {}
+    for row in src:
+        label = None
+        label_idx = None
+        start = label_col if label_col is not None else 0
+        for i in range(start, len(row)):
+            t = _txt(row[i])
+            if t:
+                label = t
+                label_idx = i
+                break
+        if label is None:
+            continue
+        key = _norm(label)
+        values = [_txt(c) for i, c in enumerate(row)
+                  if i > (label_idx or 0) and _txt(c)]
+        if values and key not in found:  # keep the first occurrence
+            found[key] = values
+    return found
+
+
+def _first_real(values: list[str]) -> str | None:
+    for v in values:
+        if v and v.lower() not in ("click to see map", "click here for directions"):
+            return v
+    return None
+
+
+def _apply_site_metadata(doc: TrialDoc, found: dict[str, list[str]], *,
+                         crop: str) -> None:
+    """Apply a scanned metadata block to the doc, filling only fields that
+    are still unset (so the first/preferred source wins)."""
+    def setif(attr: str, val: Any) -> None:
+        if val and getattr(doc, attr) is None:
+            setattr(doc, attr, val)
+
+    if "host" in found:
+        setif("cooperator", _first_real(found["host"]))
+    if "location" in found and not doc.sites:
+        doc.sites = [v for v in found["location"]
+                     if v and v.lower() != "click to see map"]
+    if "county" in found:
+        setif("county", _first_real(found["county"]))
+    if "soil type" in found:
+        setif("soil_type", _first_real(found["soil type"]))
+    if "planting date" in found:
+        setif("planted_date", _iso_date(_first_real(found["planting date"])))
+    if "harvest date" in found:
+        setif("harvested_date", _iso_date(_first_real(found["harvest date"])))
+    if "tillage" in found:
+        setif("tillage", _first_real(found["tillage"]))
+    elif "spring" in found and crop != "wheat":
+        # Corn/soy: "Spring"/"Fall" are tillage operations. Wheat: those
+        # same labels are nitrogen rates — never tillage.
+        setif("tillage", _first_real(found["spring"]))
+    if "latitude" in found:
+        lat = _to_num(_first_real(found["latitude"]) or "")
+        if isinstance(lat, (int, float)):
+            setif("latitude", float(lat))
+    if "longitude" in found:
+        lon = _to_num(_first_real(found["longitude"]) or "")
+        if isinstance(lon, (int, float)):
+            setif("longitude", float(lon))
+
+
+def _assemble_traits(row: tuple, cm: dict[str, int]) -> str:
+    """Combine the herbicide-trait + GT (genetic trait, may spill across
+    cols) + seed-treatment columns into one traits string."""
+    bits: list[str] = []
+    # GT can spill from its col up to (but not including) the herb_trait col.
+    if "gt" in cm:
+        gt_start = cm["gt"]
+        gt_end = cm.get("herb_trait", cm.get("maturity", gt_start + 1))
+        gt_toks = [_txt(row[i]) for i in range(gt_start, gt_end)
+                   if i < len(row) and _txt(row[i])]
+        if gt_toks:
+            bits.append("GT:" + "".join(gt_toks))
+    if "herb_trait" in cm:
+        ht = _txt(row[cm["herb_trait"]]) if cm["herb_trait"] < len(row) else ""
+        if ht:
+            bits.append("HT:" + ht)
+    for k, lbl in (("ist", "IST"), ("st", "ST")):
+        if k in cm and cm[k] < len(row):
+            v = _txt(row[cm[k]])
+            if v:
+                bits.append(f"{lbl}:{v}")
+    return " ".join(bits)
+
+
+def _is_data_row(row: tuple, cm: dict[str, int]) -> bool:
+    company = _txt(row[0]) if len(row) > 0 else ""
+    name = _txt(row[1]) if len(row) > 1 else ""
+    if not company or not name:
+        return False
+    if _NONDATA_NAME.match(name) or _NONDATA_NAME.match(company):
+        return False
+    if _SECTION_COMPANY.match(company):
+        return False
+    # Must have a plausible numeric yield.
+    y = _to_num(row[cm["yield"]]) if "yield" in cm and cm["yield"] < len(row) else None
+    if not isinstance(y, (int, float)):
+        return False
+    return True
+
+
+def _plausible_yield(crop: str, y: float) -> bool:
+    if crop == "corn":
+        return 30 < y < 400
+    if crop == "soybeans":
+        return 10 < y < 150
+    if crop == "wheat":
+        return 20 < y < 200
+    return 0 < y < 500
+
+
+def parse_xlsx(content: bytes, doc: TrialDoc) -> None:
+    wb = openpyxl.load_workbook(io.BytesIO(content), data_only=True,
+                                read_only=True)
+    # The yield table is the first sheet whose first ~15 rows contain a
+    # Company/Name header.
+    data_ws = None
+    data_rows: list[tuple] = []
+    hdr_i = None
+    for name in wb.sheetnames:
+        rows = list(wb[name].iter_rows(values_only=True))
+        hi = _find_header_row(rows)
+        if hi is not None:
+            data_ws, data_rows, hdr_i = name, rows, hi
+            break
+    if data_ws is None or hdr_i is None:
+        raise ValueError("no Company/Name header row found in any sheet")
+
+    cm = _build_colmap(data_rows, hdr_i)
+    if "yield" not in cm:
+        raise ValueError("no Yield column located")
+
+    # Site metadata lives in (a) trailing columns of the data sheet
+    # (co-located with the results — most current) and/or (b) a dedicated
+    # "Trial Info" sheet. Read the trailing-column block FIRST so it wins,
+    # then let the info sheet fill any gaps. (Some regional files carry a
+    # stale info sheet — e.g. a 2025 table whose Trial Info sheet still
+    # shows 2021 planting dates — so trailing columns are preferred.)
+    # _scan_meta_labels self-locates the metadata-label column, so the
+    # per-site yield columns between the metric block and the metadata
+    # block aren't mis-read as labels.
+    _apply_site_metadata(doc, _scan_meta_labels(data_rows), crop=doc.crop)
+    info_sheet = next((s for s in wb.sheetnames
+                       if "trial info" in s.lower()
+                       or "trial information" in s.lower()), None)
+    if info_sheet:
+        _apply_site_metadata(
+            doc,
+            _scan_meta_labels(list(wb[info_sheet].iter_rows(values_only=True))),
+            crop=doc.crop)
+
+    results: list[dict] = []
+    for row in data_rows[hdr_i + 2:]:
+        if not _is_data_row(row, cm):
+            continue
+        y = _to_num(row[cm["yield"]])
+        if not isinstance(y, (int, float)) or not _plausible_yield(doc.crop, float(y)):
+            continue
+        brand = _txt(row[0])
+        product = _txt(row[1])
+        # Sanity: a numeric/blank brand is junk.
+        if not brand or brand.isdigit() or len(brand) <= 1:
+            continue
+        metrics: dict[str, Any] = {"Yield": y}
+        for key, label in (("moisture", "Moisture"), ("lodging", "Lodging"),
+                           ("height", "Height"), ("testwt", "Test Wt."),
+                           ("protein", "Protein"), ("oil", "Oil"),
+                           ("maturity", "Maturity"),
+                           ("yield_2yr", "Yield 2yr avg"),
+                           ("yield_3yr", "Yield 3yr avg")):
+            if key in cm and cm[key] < len(row):
+                v = _to_num(row[cm[key]])
+                if v is not None:
+                    metrics[label] = v
+        rec_rank = None
+        if "rank" in cm and cm["rank"] < len(row):
+            rk = _to_num(row[cm["rank"]])
+            if isinstance(rk, (int, float)):
+                rec_rank = int(rk)
+        results.append({
+            "rank": rec_rank,  # synthesized below if None
+            "brand": brand,
+            "product": product,
+            "traits": _assemble_traits(row, cm) or None,
+            "metrics": metrics,
+        })
+
+    # Synthesize rank by Yield DESC when the sheet didn't publish one
+    # (corn/soy list alphabetically). Wheat carries Yield Rank already, but
+    # we re-rank only if any row is missing a rank to keep them consistent.
+    if results and any(r["rank"] is None for r in results):
+        ordered = sorted(results, key=lambda r: -float(r["metrics"]["Yield"]))
+        for i, r in enumerate(ordered, start=1):
+            r["rank"] = i
+        results = ordered
+    else:
+        results.sort(key=lambda r: r["rank"])
+
+    doc.results = results
+
+
+# --------------------------------------------------------------------- render
+
+
+def render_markdown(doc: TrialDoc) -> str:
+    crop_label = {"corn": "Corn", "soybeans": "Soybean",
+                  "wheat": "Wheat"}.get(doc.crop, doc.crop.title())
+    head: list[str] = [
+        f"# {crop_label} yield trial — University of Illinois VT, "
+        f"{doc.region}, IL {doc.year}",
+        "",
+        "- **Publisher:** University of Illinois Variety Testing "
+        "(independent cross-vendor trial)",
+        f"- **Crop:** {crop_label}",
+        "- **State:** IL",
+        f"- **Year:** {doc.year}",
+        f"- **Region:** {doc.region}",
+    ]
+    for label, val in (
+        ("Rotation", doc.rotation),
+        ("Previous crop", doc.previous_crop),
+        ("Cooperator / host", doc.cooperator),
+        ("County", doc.county),
+        ("Sites", ", ".join(doc.sites) if doc.sites else None),
+        ("Soil type", doc.soil_type),
+        ("Tillage", doc.tillage),
+        ("Planted", doc.planted_date),
+        ("Harvested", doc.harvested_date),
+        ("Row width", doc.row_width),
+    ):
+        if val:
+            head.append(f"- **{label}:** {val}")
+    head += [
+        f"- **Source XLSX:** {doc.xlsx_url}",
+        f"- **Index page:** {doc.index_url}",
+        "", "---", "",
+        "## Results (ranked by regional yield)", "",
+    ]
+
+    # Discover metric columns from the first result.
+    metric_keys: list[str] = []
+    for r in doc.results:
+        if r.get("metrics"):
+            metric_keys = list(r["metrics"].keys())
+            break
+    headers = ["Rank", "Brand", "Product", "Traits"] + metric_keys
+    head.append("| " + " | ".join(headers) + " |")
+    head.append("|" + "|".join(["---"] * len(headers)) + "|")
+    for r in doc.results:
+        row = [str(r.get("rank") or "-"), r.get("brand") or "-",
+               r.get("product") or "-", r.get("traits") or "-"]
+        m = r.get("metrics") or {}
+        for k in metric_keys:
+            v = m.get(k)
+            row.append("-" if v is None else str(v))
+        head.append("| " + " | ".join(row) + " |")
+    head.append("")
+    return "\n".join(head)
+
+
+def write_doc(doc: TrialDoc, body_md: str) -> None:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    (CORPUS_DIR / f"{doc.source_key}.md").write_text(body_md, encoding="utf-8")
+    sidecar = {
+        "source": "illinois_vt_trials",
+        "source_key": doc.source_key,
+        "data_type": "trial",
+        "vendor": "University of Illinois",
+        "brand_aggregator": "University of Illinois Variety Testing publishes",
+        "brand": "University of Illinois VT",
+        "crop": doc.crop,
+        "state": "IL",
+        "state_abbrev": "il",
+        "year": doc.year,
+        "region": doc.region,
+        "rotation": doc.rotation,
+        "previous_crop": doc.previous_crop,
+        "cooperator": doc.cooperator,
+        "county": doc.county,
+        "sites": doc.sites or None,
+        "soil_type": doc.soil_type,
+        "tillage": doc.tillage,
+        "planted_date": doc.planted_date,
+        "harvested_date": doc.harvested_date,
+        "row_width": doc.row_width,
+        "latitude": doc.latitude,
+        "longitude": doc.longitude,
+        "results": doc.results,
+        "n_results": len(doc.results),
+        "source_urls": [doc.xlsx_url, doc.index_url],
+        "tos_note": TOS_NOTE,
+        "fetched_at": datetime.now(timezone.utc).isoformat(),
+        "scraper_version": SCRAPER_VERSION,
+    }
+    (CORPUS_DIR / f"{doc.source_key}.json").write_text(
+        json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
+        encoding="utf-8")
+
+
+# --------------------------------------------------------------------- pipeline
+
+
+def process_doc(http: RateLimitedSession, doc: TrialDoc, *,
+                force: bool) -> str:
+    md_path = CORPUS_DIR / f"{doc.source_key}.md"
+    if md_path.exists() and not force:
+        return "skipped"
+    try:
+        r = http.get(doc.xlsx_url)
+        r.raise_for_status()
+        parse_xlsx(r.content, doc)
+    except Exception as exc:  # noqa: BLE001
+        log.error("%s parse failed (%s): %s", doc.source_key, doc.xlsx_url, exc)
+        return "failed"
+    if not doc.results:
+        log.warning("%s — no valid result rows parsed; skipping", doc.source_key)
+        return "empty"
+    write_doc(doc, render_markdown(doc))
+    log.info("%s written | %s %s %s | %d results | top: %s",
+             doc.source_key, doc.crop, doc.region, doc.year, len(doc.results),
+             doc.results[0]["brand"] + " " + doc.results[0]["product"]
+             if doc.results else "-")
+    return "written"
+
+
+def run(*, crops: set[str], years: set[int], limit: int | None,
+        force: bool) -> int:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    http = RateLimitedSession()
+    docs = discover_files(http, crops=crops, years=years)
+    log.info("discovered %d in-scope variety tables", len(docs))
+    if limit is not None:
+        docs = docs[:limit]
+
+    counts = {"written": 0, "skipped": 0, "empty": 0, "failed": 0}
+    for i, doc in enumerate(docs, start=1):
+        status = process_doc(http, doc, force=force)
+        counts[status] = counts.get(status, 0) + 1
+        if status != "written" or i <= 5 or i % 20 == 0:
+            log.info("[%d/%d] %s -> %s", i, len(docs), doc.source_key, status)
+
+    log.info("done: written=%d skipped=%d empty=%d failed=%d (of %d)",
+             counts["written"], counts["skipped"], counts["empty"],
+             counts["failed"], len(docs))
+    return 0 if counts["failed"] == 0 else 1
+
+
+# --------------------------------------------------------------------- CLI
+
+
+def _build_argparser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        prog="scrape.sources.illinois_vt_trials",
+        description="Scrape University of Illinois Variety Testing "
+                    "cross-vendor yield trials (XLSX) into the corpus.")
+    p.add_argument("--year", type=int, default=None,
+                   help="Scrape a single harvest year (default: 2024+2025).")
+    p.add_argument("--include-old", action="store_true",
+                   help="Also scrape 2000–2023 (deferred by default).")
+    p.add_argument("--crop", default=None, choices=tuple(CROP_INDEX.keys()),
+                   help="Limit to one crop (corn / soybeans / wheat).")
+    p.add_argument("--limit", type=int, default=None,
+                   help="Stop after processing N tables (default: all).")
+    p.add_argument("--force", action="store_true",
+                   help="Re-fetch even if the markdown file already exists.")
+    p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
+    return p
+
+
+def main(argv: list[str] | None = None) -> int:
+    args = _build_argparser().parse_args(argv)
+    logging.basicConfig(
+        level=args.log_level.upper(),
+        format="%(asctime)s %(levelname)s %(name)s %(message)s",
+        stream=sys.stderr)
+
+    crops = {args.crop} if args.crop else set(CROP_INDEX.keys())
+    if args.year is not None:
+        years = {args.year}
+    elif args.include_old:
+        years = set(range(OLD_YEAR_MIN, OLD_YEAR_MAX + 1)) | BASELINE_YEARS
+    else:
+        years = set(BASELINE_YEARS)
+
+    return run(crops=crops, years=years, limit=args.limit, force=args.force)
+
+
+if __name__ == "__main__":
+    sys.exit(main())
@@ -0,0 +1,651 @@
+"""Iowa Crop Performance Tests (ICPT) — cross-vendor yield trials.
+
+Iowa State University / the Iowa Crop Improvement Association run the
+**Iowa Crop Performance Tests**, an independent, third-party variety
+trial program. Because the trial is publisher-neutral, a single
+district table ranks EVERY brand head-to-head — Pioneer, DEKALB,
+Brevant, NuTech, Renk, Legacy, Epley Brothers, etc. — on identical
+plots. That makes it the highest-trust ``data_type: "trial"`` source
+in the corpus: unlike the vendor plot reports (Golden Harvest, LG,
+AgriGold, ProHarvest), no seed company controls the entry list or the
+agronomy, so there's no home-brand bias.
+
+Site shape (ASP.NET, server-rendered GridView tables — requests +
+BeautifulSoup, no JS / headless browser needed):
+
+  Corn:    https://www.croptesting.iastate.edu/corn/CornDistrict2.aspx
+  Soybean: https://www.croptesting.iastate.edu/Soybean/SoybeanDistrict2.aspx
+
+``...District2.aspx`` is the ONLY live district URL — the district
+(North / Central / South) is chosen *on that same page* via a
+``radLstDistrict`` radio (1/2/3) ASP.NET **postback**, NOT a separate
+URL (CornDistrict1/3.aspx 302-redirect away). Likewise the year
+(``cmbYear`` dropdown, 2025→2014) and the maturity season
+(``radListSeason``: 1=Early, 2=Full) are postbacks — there are no
+stable GET URLs for them. So we GET the page once to harvest the
+ASP.NET hidden fields (``__VIEWSTATE`` / ``__VIEWSTATEGENERATOR`` /
+``__VIEWSTATEENCRYPTED``), then POST the form with the desired
+year/district/season + ``btnFilter=Filter`` to drive the view.
+``CornDistrict.aspx`` (no number) is the 2013-and-older legacy page —
+out of scope.
+
+A district table is a multi-site aggregate: the GridView carries the
+district-wide Yield plus a West/East sub-region split (Wyld/Eyld) and a
+per-site yield column for each cooperator location in the district.
+That makes **one district × season × year table the natural document
+granularity** — one sidecar per ``(crop, year, district, season)``.
+
+GridView column → field map:
+  corn:    Company | Entry | RM | Herb Tech | Trait Package |
+           Yield | Yldp | Moist | Wyld | Eyld | <site> ...
+  soybean: Company | Entry | MG | Herb Tech |
+           Yield | WestYield | EastYield | <site> ...
+  Company       -> result.brand   (the seed COMPANY — critical)
+  Entry         -> result.product (variety / hybrid code)
+  Herb Tech +
+    Trait Package -> result.traits
+  everything else (RM/MG, Yield, Yldp, Moist, Wyld/Eyld, per-site)
+                  -> result.metrics  ("Yield" kept verbatim as the
+                     primary key the chunker's top-N picker reads)
+Rows are pre-sorted by Yield DESC on the page; we re-sort defensively
+and assign rank ourselves (the table has no rank column).
+
+We emit the SAME sidecar shape as ``agrigold_plot_reports`` /
+``lg_plot_reports`` / ``gh_plot_reports`` / ``proharvest_plots``
+(``results: [{rank, brand, product, traits, metrics}]``). The trial
+chunker's source dispatch doesn't list ``iowa_icpt_trials`` explicitly,
+so it FALLS THROUGH to the shared ``_render_gh_plot_chunk`` renderer —
+no ``rag/chunk.py`` edit required.
+
+Output:
+  corpus/iowa_icpt_trials/<source_key>.md      LLM-visible body
+  corpus/iowa_icpt_trials/<source_key>.json    sidecar metadata
+
+source_key: ``icpt-<crop>-<year>-<district>[-<season>]``
+  e.g. ``icpt-corn-2025-north-early``, ``icpt-soybeans-2024-south-full``.
+
+Scope: 2024 + 2025 baseline. ``--include-old`` walks 2014–2023.
+
+robots/ToS: no robots.txt (the ASP.NET app 404s it); footer
+"Copyright (c) 1995-2016 Iowa State University ... All rights reserved"
+carries no automation clause. Public land-grant ICPT data, polite UA,
+low request rate. (See ``tos_note`` in the sidecar.)
+
+CLI:
+  python -m scrape.sources.iowa_icpt_trials --limit 4
+  python -m scrape.sources.iowa_icpt_trials --crop corn --year 2025
+  python -m scrape.sources.iowa_icpt_trials --include-old --force
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+import os
+import random
+import re
+import sys
+import time
+from dataclasses import dataclass, field
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+import requests
+from bs4 import BeautifulSoup
+
+SCRAPER_VERSION = "0.1.0"
+USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
+BASE = "https://www.croptesting.iastate.edu"
+
+REPO_ROOT = Path(__file__).resolve().parents[2]
+CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
+CORPUS_DIR = CORPUS_ROOT / "iowa_icpt_trials"
+
+REQ_INTERVAL_SEC = 2.0  # land-grant box; be polite, single-threaded
+
+BASELINE_YEARS = [2024, 2025]
+OLD_YEARS = [2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023]
+
+TOS_NOTE = (
+    "Footer 'Copyright (c) ...ISU...All rights reserved' (no automation "
+    "clause, no robots.txt); public ICPT data; low request rate; attribute "
+    "Iowa Crop Performance Tests / ISU."
+)
+
+# crop -> (district-results page URL, RM/MG header label)
+CROPS: dict[str, tuple[str, str]] = {
+    "corn":     (f"{BASE}/corn/CornDistrict2.aspx", "RM"),
+    "soybeans": (f"{BASE}/Soybean/SoybeanDistrict2.aspx", "MG"),
+}
+
+# radLstDistrict radio value -> (slug, label)
+DISTRICTS: dict[str, tuple[str, str]] = {
+    "1": ("north", "North"),
+    "2": ("central", "Central"),
+    "3": ("south", "South"),
+}
+# radListSeason radio value -> (slug, label)
+SEASONS: dict[str, tuple[str, str]] = {
+    "1": ("early", "Early Season"),
+    "2": ("full", "Full Season"),
+}
+
+# ASP.NET control names
+C_YEAR = "ctl00$MainContent$cmbYear"
+C_DISTRICT = "ctl00$MainContent$radLstDistrict"
+C_SEASON = "ctl00$MainContent$radListSeason"
+C_SHOW = "ctl00$MainContent$radLstShowOptions"
+C_FILTER = "ctl00$MainContent$btnFilter"
+
+# GridView header labels that are NOT metric columns.
+BRAND_COL = "company"
+PRODUCT_COL = "entry"
+TRAIT_COLS = {"herb tech", "trait package"}
+
+log = logging.getLogger("scrape.iowa_icpt_trials")
+
+
+# --------------------------------------------------------------------- HTTP
+
+
+class RateLimitedSession:
+    """Single-threaded rate-limited requests.Session (ASP.NET viewstate
+    flow is inherently sequential per page, so no global lock needed)."""
+
+    def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
+        self.s = requests.Session()
+        self.s.headers["User-Agent"] = USER_AGENT
+        self.interval = interval
+        self._last = 0.0
+
+    def _wait(self) -> None:
+        delta = time.monotonic() - self._last
+        if delta < self.interval:
+            time.sleep(self.interval - delta)
+        self._last = time.monotonic()
+
+    def request(self, method: str, url: str, *, max_retries: int = 4,
+                timeout: float = 45.0, **kw: Any) -> requests.Response:
+        last_exc: Exception | None = None
+        resp: requests.Response | None = None
+        for attempt in range(max_retries):
+            self._wait()
+            try:
+                resp = self.s.request(method, url, timeout=timeout, **kw)
+            except requests.RequestException as exc:
+                last_exc = exc
+                backoff = min(30.0, (2 ** attempt) + random.random())
+                log.warning("network error on %s %s: %s — retry in %.1fs",
+                            method, url, exc, backoff)
+                time.sleep(backoff)
+                continue
+            if resp.status_code == 429 or 500 <= resp.status_code < 600:
+                ra = resp.headers.get("Retry-After")
+                backoff = float(ra) if (ra and ra.isdigit()) else min(
+                    30.0, (2 ** attempt) + random.random())
+                log.warning("HTTP %d on %s %s — retry in %.1fs",
+                            resp.status_code, method, url, backoff)
+                time.sleep(backoff)
+                continue
+            return resp
+        if last_exc:
+            raise last_exc
+        assert resp is not None
+        return resp
+
+    def get(self, url: str, **kw: Any) -> requests.Response:
+        return self.request("GET", url, **kw)
+
+    def post(self, url: str, **kw: Any) -> requests.Response:
+        return self.request("POST", url, **kw)
+
+
+# --------------------------------------------------------------------- model
+
+
+@dataclass
+class TrialResult:
+    rank: int | None = None
+    brand: str = ""
+    product: str = ""
+    traits: str = ""
+    metrics: dict[str, float | str | None] = field(default_factory=dict)
+
+
+@dataclass
+class DistrictTrial:
+    source_key: str
+    source_url: str
+    crop: str            # "corn" / "soybeans"
+    year: int
+    district_slug: str   # north / central / south
+    district_label: str  # North / Central / South
+    season_slug: str     # early / full
+    season_label: str    # Early Season / Full Season
+    sites: list[str] = field(default_factory=list)   # cooperator locations
+    experiment_mean: float | None = None
+    results: list[TrialResult] = field(default_factory=list)
+
+
+# --------------------------------------------------------------------- parse
+
+
+def _hidden_fields(soup: BeautifulSoup) -> dict[str, str]:
+    out: dict[str, str] = {}
+    for inp in soup.find_all("input", {"type": "hidden"}):
+        name = inp.get("name")
+        if name:
+            out[name] = inp.get("value") or ""
+    return out
+
+
+_NUM_RE = re.compile(r"^-?\d+(?:\.\d+)?$")
+
+
+def _to_num(s: str | None) -> float | int | None:
+    s = (s or "").strip()
+    if not s or s == "-" or not _NUM_RE.match(s):
+        return None
+    f = float(s)
+    return int(f) if f.is_integer() else f
+
+
+def _norm(s: str) -> str:
+    return re.sub(r"\s+", " ", (s or "").strip()).lower()
+
+
+def _grid_rows(soup: BeautifulSoup, table_id: str) -> list[list[str]]:
+    table = soup.find("table", {"id": table_id})
+    if table is None:
+        return []
+    rows: list[list[str]] = []
+    for tr in table.find_all("tr"):
+        cells = [c.get_text(" ", strip=True) for c in tr.find_all(["th", "td"])]
+        if cells:
+            rows.append(cells)
+    return rows
+
+
+def _experiment_mean(soup: BeautifulSoup) -> float | None:
+    """Pull the district-wide 'Experiment Mean' Yield from the summary
+    GridView (first data row, second column)."""
+    rows = _grid_rows(soup, "MainContent_gvDataSummary")
+    for r in rows:
+        if r and _norm(r[0]).startswith("experiment mean") and len(r) > 1:
+            return _to_num(r[1])  # type: ignore[return-value]
+    return None
+
+
+def parse_district_table(
+    soup: BeautifulSoup,
+    *,
+    rm_mg_label: str,
+) -> tuple[list[TrialResult], list[str], float | None]:
+    """Parse the ``MainContent_gvData`` GridView into ranked results.
+
+    Returns ``(results, site_columns, experiment_mean)``. Rows arrive
+    pre-sorted by Yield DESC; we re-sort by Yield DESC defensively and
+    assign rank ourselves (no rank column on the page)."""
+    rows = _grid_rows(soup, "MainContent_gvData")
+    if len(rows) < 2:
+        return [], [], None
+
+    header = rows[0]
+    hkeys = [_norm(h) for h in header]
+
+    # Locate the structural columns.
+    def find_col(*want: str) -> int | None:
+        for w in want:
+            for i, h in enumerate(hkeys):
+                if h == w:
+                    return i
+        return None
+
+    i_brand = find_col(BRAND_COL)
+    i_product = find_col(PRODUCT_COL)
+    i_traits = [i for i, h in enumerate(hkeys) if h in TRAIT_COLS]
+
+    # Identify the per-site (cooperator-location) yield columns: they
+    # come AFTER the West/East sub-region split (Wyld/Eyld /
+    # WestYield/EastYield), and their header is a location name, not a
+    # known metric. Anything that isn't brand/product/trait is a metric;
+    # per-site columns are metrics whose header isn't a reserved label.
+    reserved_metric = {
+        _norm(rm_mg_label), "yield", "yldp", "yield pct", "yield %",
+        "moist", "wyld", "eyld", "westyield", "eastyield",
+    }
+    sites: list[str] = []
+    for i, h in enumerate(hkeys):
+        if i == i_brand or i == i_product or i in i_traits:
+            continue
+        if h and h not in reserved_metric:
+            sites.append(header[i])
+
+    skip = {i_brand, i_product, *i_traits}
+    metric_cols = [(header[i], i) for i in range(len(header)) if i not in skip and header[i]]
+
+    results: list[TrialResult] = []
+    for raw in rows[1:]:
+        # Pad/truncate row to header width defensively.
+        cells = raw + [""] * (len(header) - len(raw))
+
+        def cell(i: int | None) -> str:
+            return cells[i].strip() if i is not None and 0 <= i < len(cells) else ""
+
+        brand = cell(i_brand)
+        product = cell(i_product)
+        traits = " ".join(
+            t for t in (cell(i) for i in i_traits)
+            if t and _norm(t) != "none"
+        ).strip()
+
+        metrics: dict[str, float | str | None] = {}
+        for name, idx in metric_cols:
+            raw_val = cell(idx)
+            num = _to_num(raw_val)
+            if num is not None:
+                metrics[name] = num
+            elif raw_val and raw_val != "-":
+                metrics[name] = raw_val
+            # else: leave the column out (empty)
+
+        res = TrialResult(brand=brand, product=product, traits=traits, metrics=metrics)
+        if _row_ok(res):
+            results.append(res)
+
+    # Re-sort by Yield DESC (page is already sorted, but don't trust it),
+    # then assign rank. Rows with no numeric Yield sink to the bottom.
+    def _ysort(r: TrialResult) -> tuple[int, float]:
+        y = r.metrics.get("Yield")
+        if isinstance(y, (int, float)):
+            return (0, -float(y))
+        return (1, 0.0)
+
+    results.sort(key=_ysort)
+    for n, r in enumerate(results, start=1):
+        r.rank = n
+
+    return results, sites, _experiment_mean(soup)
+
+
+def _row_ok(r: TrialResult) -> bool:
+    """Per-row sanity gate. A sound entry has a real (non-numeric)
+    company/brand, a product code, and a plausible bu/a Yield. Drops
+    summary/blank rows and any leaked aggregate line."""
+    brand = (r.brand or "").strip()
+    product = (r.product or "").strip()
+    if not brand or brand.isdigit():
+        return False
+    if _norm(brand) in ("summary", "experiment mean", "minimum mean",
+                         "maximum mean", "lsd", "coefficient of variability"):
+        return False
+    if not product:
+        return False
+    y = r.metrics.get("Yield")
+    # Corn runs ~120-280 bu/a, soy ~30-90; gate generously but reject
+    # garbage / a moisture/RM value that leaked into the Yield slot.
+    if not isinstance(y, (int, float)) or not (10 < float(y) < 400):
+        return False
+    return True
+
+
+# --------------------------------------------------------------------- fetch
+
+
+def source_key_for(crop: str, year: int, district_slug: str, season_slug: str) -> str:
+    return f"icpt-{crop}-{year}-{district_slug}-{season_slug}"
+
+
+def fetch_view(
+    http: RateLimitedSession,
+    *,
+    crop: str,
+    year: int,
+    district: str,   # radio value "1"/"2"/"3"
+    season: str,     # radio value "1"/"2"
+) -> DistrictTrial | None:
+    """GET the district page (for viewstate), then POST the filter form
+    to switch to the requested year/district/season. Returns a parsed
+    DistrictTrial, or None if the table is empty for that combination."""
+    url, rm_mg_label = CROPS[crop]
+    district_slug, district_label = DISTRICTS[district]
+    season_slug, season_label = SEASONS[season]
+
+    seed = http.get(url)
+    seed.raise_for_status()
+    seed_soup = BeautifulSoup(seed.text, "html.parser")
+
+    payload = _hidden_fields(seed_soup)
+    payload[C_YEAR] = str(year)
+    payload[C_DISTRICT] = district
+    payload[C_SEASON] = season
+    payload[C_SHOW] = "yield"  # yield view carries Yield/Yldp/Moist + per-SITE yields
+    payload[C_FILTER] = "Filter"
+
+    resp = http.post(url, data=payload)
+    resp.raise_for_status()
+    soup = BeautifulSoup(resp.text, "html.parser")
+
+    results, sites, mean = parse_district_table(soup, rm_mg_label=rm_mg_label)
+    if not results:
+        return None
+
+    return DistrictTrial(
+        source_key=source_key_for(crop, year, district_slug, season_slug),
+        source_url=url,
+        crop=crop,
+        year=year,
+        district_slug=district_slug,
+        district_label=district_label,
+        season_slug=season_slug,
+        season_label=season_label,
+        sites=sites,
+        experiment_mean=mean,
+        results=results,
+    )
+
+
+# --------------------------------------------------------------------- render
+
+
+def render_markdown(t: DistrictTrial) -> str:
+    crop_label = {"corn": "Corn", "soybeans": "Soybean"}.get(t.crop, t.crop.title())
+    head: list[str] = [
+        f"# {crop_label} yield trial — Iowa {t.district_label} District "
+        f"({t.season_label}), {t.year}",
+        "",
+        "- **Source:** Iowa Crop Performance Tests (independent third-party trial)",
+        "- **Publisher:** Iowa State University / Iowa Crop Improvement Association",
+        f"- **Crop:** {crop_label}",
+        "- **State:** IA",
+        f"- **District:** {t.district_label}",
+        f"- **Maturity season:** {t.season_label}",
+        f"- **Year:** {t.year}",
+    ]
+    if t.experiment_mean is not None:
+        head.append(f"- **Experiment mean yield:** {t.experiment_mean} bu/a")
+    if t.sites:
+        head.append(f"- **Cooperator sites:** {', '.join(t.sites)}")
+    head += [f"- **URL:** {t.source_url}", "", "---", ""]
+
+    # Discover metric column order from the first result with metrics.
+    metric_keys: list[str] = []
+    for r in t.results:
+        if r.metrics:
+            metric_keys = list(r.metrics.keys())
+            break
+
+    sections: list[str] = ["## Results (by yield, all brands)", ""]
+    headers = ["Rank", "Company", "Entry", "Traits"] + metric_keys
+    sections.append("| " + " | ".join(headers) + " |")
+    sections.append("|" + "|".join(["---"] * len(headers)) + "|")
+    for r in t.results:
+        row = [
+            str(r.rank) if r.rank is not None else "-",
+            r.brand or "-",
+            r.product or "-",
+            r.traits or "-",
+        ]
+        for k in metric_keys:
+            v = r.metrics.get(k)
+            row.append("-" if v is None else str(v))
+        sections.append("| " + " | ".join(row) + " |")
+    sections.append("")
+
+    # Compact top-5 line for embedder signal.
+    top = [r for r in t.results if isinstance(r.metrics.get("Yield"), (int, float))][:5]
+    if top:
+        bits = [f"{r.product} ({r.brand}) {r.metrics['Yield']}" for r in top]
+        sections.append(f"Top 5 by Yield: " + ", ".join(bits) + ".")
+        sections.append("")
+
+    return "\n".join(head) + "\n".join(sections)
+
+
+# --------------------------------------------------------------------- write
+
+
+def write_trial(t: DistrictTrial, body_md: str) -> None:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    (CORPUS_DIR / f"{t.source_key}.md").write_text(body_md, encoding="utf-8")
+    sidecar = {
+        "source": "iowa_icpt_trials",
+        "source_key": t.source_key,
+        "data_type": "trial",
+        "vendor": "Iowa State University",
+        "brand_aggregator": "Iowa Crop Performance Tests publishes",
+        "brand": "Iowa Crop Performance Tests",
+        "crop": t.crop,
+        "state": "IA",
+        "state_abbrev": "ia",
+        "year": t.year,
+        "region": f"District {t.district_label}",
+        "district": t.district_label,
+        "season": t.season_label,
+        "cooperator_sites": t.sites,
+        "experiment_mean_yield": t.experiment_mean,
+        "results": [
+            {
+                "rank": r.rank,
+                "brand": r.brand,
+                "product": r.product,
+                "traits": r.traits,
+                "metrics": r.metrics,
+            }
+            for r in t.results
+        ],
+        "n_results": len(t.results),
+        "source_urls": [t.source_url],
+        "tos_note": TOS_NOTE,
+        "fetched_at": datetime.now(timezone.utc).isoformat(),
+        "scraper_version": SCRAPER_VERSION,
+    }
+    (CORPUS_DIR / f"{t.source_key}.json").write_text(
+        json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
+        encoding="utf-8",
+    )
+
+
+# --------------------------------------------------------------------- pipeline
+
+
+def run(
+    *,
+    crops: set[str],
+    years: list[int],
+    limit: int | None,
+    force: bool,
+) -> int:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    http = RateLimitedSession()
+    counts = {"written": 0, "skipped": 0, "empty": 0, "failed": 0}
+    processed = 0
+
+    targets: list[tuple[str, int, str, str]] = []
+    for crop in sorted(crops):
+        for year in years:
+            for district in DISTRICTS:        # 1/2/3
+                for season in SEASONS:        # 1/2
+                    targets.append((crop, year, district, season))
+
+    log.info("planned %d (crop x year x district x season) targets", len(targets))
+
+    for crop, year, district, season in targets:
+        if limit is not None and processed >= limit:
+            break
+        district_slug = DISTRICTS[district][0]
+        season_slug = SEASONS[season][0]
+        sk = source_key_for(crop, year, district_slug, season_slug)
+        md_path = CORPUS_DIR / f"{sk}.md"
+        if md_path.exists() and not force:
+            counts["skipped"] += 1
+            continue
+        processed += 1
+        try:
+            trial = fetch_view(http, crop=crop, year=year,
+                               district=district, season=season)
+        except Exception as exc:  # noqa: BLE001
+            counts["failed"] += 1
+            log.error("[%s] fetch failed: %s", sk, exc)
+            continue
+        if trial is None:
+            counts["empty"] += 1
+            log.info("[%s] empty table (no entries) — skipping", sk)
+            continue
+        write_trial(trial, render_markdown(trial))
+        counts["written"] += 1
+        log.info("[%s] written | %d entries | %d sites | brands=%d",
+                 sk, len(trial.results), len(trial.sites),
+                 len({r.brand for r in trial.results}))
+
+    log.info("done: written=%d skipped=%d empty=%d failed=%d (processed=%d)",
+             counts["written"], counts["skipped"], counts["empty"],
+             counts["failed"], processed)
+    return 0 if counts["failed"] == 0 else 1
+
+
+# --------------------------------------------------------------------- CLI
+
+
+def _build_argparser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        prog="scrape.sources.iowa_icpt_trials",
+        description="Scrape Iowa Crop Performance Tests (ICPT) cross-vendor "
+                    "yield trials (corn + soybean district tables).",
+    )
+    p.add_argument("--year", type=int, default=None,
+                   choices=tuple(BASELINE_YEARS + OLD_YEARS),
+                   help="Limit to a single year (default: 2024+2025 baseline).")
+    p.add_argument("--include-old", action="store_true",
+                   help="Also scrape 2014-2023 (deferred by default).")
+    p.add_argument("--crop", default=None, choices=tuple(CROPS.keys()),
+                   help="Limit to one crop (default: both).")
+    p.add_argument("--limit", type=int, default=None,
+                   help="Stop after writing/processing N tables (default: all).")
+    p.add_argument("--force", action="store_true",
+                   help="Re-fetch even if the markdown file already exists.")
+    p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
+    return p
+
+
+def main(argv: list[str] | None = None) -> int:
+    args = _build_argparser().parse_args(argv)
+    logging.basicConfig(
+        level=args.log_level.upper(),
+        format="%(asctime)s %(levelname)s %(name)s %(message)s",
+        stream=sys.stderr,
+    )
+    crops = {args.crop} if args.crop else set(CROPS.keys())
+    if args.year is not None:
+        years = [args.year]
+    elif args.include_old:
+        years = sorted(set(OLD_YEARS + BASELINE_YEARS))
+    else:
+        years = list(BASELINE_YEARS)
+    return run(crops=crops, years=years, limit=args.limit, force=args.force)
+
+
+if __name__ == "__main__":
+    sys.exit(main())