a54fac240f
Image rebuild (skip scrape) / build (push) Successful in 5m54s
Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io>
946 lines
36 KiB
Python
946 lines
36 KiB
Python
"""University of Illinois Variety Testing — cross-vendor yield trials.
|
||
|
||
The University of Illinois Crop Sciences Variety Testing program
|
||
(``vt.cropsci.illinois.edu``) is a long-running, independent land-grant
|
||
testing program. Seed companies pay an entry fee to enter hybrids /
|
||
varieties; UIUC plants them in replicated regional trials and publishes
|
||
the results. Because the program is third-party and cross-vendor, a
|
||
single regional table ranks Pioneer / Brevant / DEKALB / Channel /
|
||
Burrus / Stine / Viking and dozens of others head-to-head — the
|
||
legitimate, independent way to get major-brand performance the
|
||
single-vendor corpus can't scrape directly.
|
||
|
||
This is a ``data_type: "trial"`` source (NOT variety identity). It emits
|
||
the same per-site cross-vendor sidecar shape as ``gh_plot_reports`` /
|
||
``agrigold_plot_reports`` / ``proharvest_plots`` (``results: [{rank,
|
||
brand, product, traits, metrics}]``), so it falls through to the shared
|
||
``_render_gh_plot_chunk`` renderer in ``rag/chunk.py`` with no chunk.py
|
||
edit. The published table lists entries alphabetically by company, so we
|
||
**synthesize ``rank`` by sorting on Yield descending** (highest yield =
|
||
rank 1) per document.
|
||
|
||
Data layout (verified Nov 2025):
|
||
- Static XLSX (+ PDF) per region per year at WordPress upload URLs.
|
||
The month segment of the upload path varies (``/2025/11/``,
|
||
``/2025/12/``, ``/2025/07/`` for wheat), so we DISCOVER the .xlsx
|
||
hrefs by fetching the /corn/, /soybeans/, /wheat/ index pages — we
|
||
never guess URLs.
|
||
- Corn regional tables: ``Company | Name | IST1 | GT2(+spill) | HT3 |
|
||
Relative Maturity | Yield bu/a | Moisture % | Lodging | <per-site
|
||
cols> | 2-yr Avg | 3-yr Avg``. Per-site metadata lives in a separate
|
||
"Trial Info" sheet (regional tables) or in trailing columns
|
||
(single-site CFC tables).
|
||
- Soybean regional tables: ``COMPANY | NAME | Herbicide Trait1 | ST2 |
|
||
Yield bu/a | Maturity Date | Lodging | Height | <2yr/3yr Yield> |
|
||
Protein @13% | Oil @13%`` with the per-site metadata block in
|
||
trailing columns.
|
||
- Wheat regional means tables: ``Company | Variety | ST1 | Yield |
|
||
Yield Rank | Test wt. | Height | <per-site Yield/Test wt.> |
|
||
Maturity date | Jointing time | FHB Score | FHB Category``. Wheat
|
||
publishes its own Yield Rank, which we honor.
|
||
|
||
The variety table sits below a 2-4 row header band (a group-header row,
|
||
a column-name row, and a units row). Columns are positionally stable
|
||
within a sheet (multi-word brands like "Viking | Blue River" live in a
|
||
single COMPANY cell), so we anchor on the header row that carries
|
||
"Company" + "Name"/"Variety" and map the metric columns from the
|
||
column-name + units rows. The leftmost "Yield (bu/a)" column is the
|
||
**regional** yield — the primary metric.
|
||
|
||
Section markers interleaved in the table ("Early RM", "Any RM",
|
||
"Non-GMO Hybrids", "Early MG: 1.9-2.7", "Late MG:") and summary rows
|
||
("Average", "L.S.D 25% Level", "CV (%)") are skipped — a data row must
|
||
have a real company, a variety name, and a plausible numeric yield.
|
||
|
||
Rotation distinction: regional tables are corn-following-soybean (the
|
||
default rotation); "CFC" / "Corn Following Corn" single-site tables
|
||
(Monmouth, Urbana, DeKalb) get ``previous_crop="corn"`` and a distinct
|
||
``-cfc-`` source_key so they never collide with the regional table.
|
||
|
||
robots/ToS: no usage terms posted on the VT site (publicly-funded
|
||
land-grant; companies pay an entry fee, which doesn't restrict published
|
||
result reuse). Polite UA + light rate limit.
|
||
|
||
Output:
|
||
corpus/illinois_vt_trials/<source_key>.md LLM-visible body
|
||
corpus/illinois_vt_trials/<source_key>.json sidecar metadata
|
||
|
||
source_key: ``ilvt-<crop>-<year>-r<region>`` e.g. ``ilvt-corn-2025-r1``;
|
||
CFC single-site tables: ``ilvt-corn-2024-cfc-monmouth``.
|
||
|
||
CLI:
|
||
python -m scrape.sources.illinois_vt_trials --year 2025 --limit 2
|
||
python -m scrape.sources.illinois_vt_trials --crop corn --force
|
||
python -m scrape.sources.illinois_vt_trials --include-old --force
|
||
"""
|
||
|
||
from __future__ import annotations
|
||
|
||
import argparse
|
||
import io
|
||
import json
|
||
import logging
|
||
import os
|
||
import random
|
||
import re
|
||
import sys
|
||
import time
|
||
from dataclasses import dataclass, field
|
||
from datetime import date, datetime, timezone
|
||
from pathlib import Path
|
||
from typing import Any
|
||
from urllib.parse import urljoin
|
||
|
||
import openpyxl
|
||
import requests
|
||
from bs4 import BeautifulSoup
|
||
|
||
SCRAPER_VERSION = "0.1.0"
|
||
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
|
||
BASE = "https://vt.cropsci.illinois.edu"
|
||
|
||
TOS_NOTE = (
|
||
"No usage terms posted on UIUC VT site; publicly-funded land-grant "
|
||
"data; attribute University of Illinois Variety Testing."
|
||
)
|
||
|
||
BASELINE_YEARS = {2024, 2025}
|
||
OLD_YEAR_MIN = 2000
|
||
OLD_YEAR_MAX = 2023
|
||
|
||
# Index pages per crop. PLURAL "soybeans" is the corpus crop value.
|
||
CROP_INDEX = {
|
||
"corn": "/corn/",
|
||
"soybeans": "/soybeans/",
|
||
"wheat": "/wheat/",
|
||
}
|
||
|
||
REQ_INTERVAL_SEC = 1.0
|
||
|
||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
|
||
CORPUS_DIR = CORPUS_ROOT / "illinois_vt_trials"
|
||
|
||
log = logging.getLogger("scrape.illinois_vt_trials")
|
||
|
||
|
||
# --------------------------------------------------------------------- HTTP
|
||
|
||
|
||
class RateLimitedSession:
|
||
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
|
||
self.s = requests.Session()
|
||
self.s.headers["User-Agent"] = USER_AGENT
|
||
self.interval = interval
|
||
self._last = 0.0
|
||
|
||
def _wait(self) -> None:
|
||
delta = time.monotonic() - self._last
|
||
if delta < self.interval:
|
||
time.sleep(self.interval - delta)
|
||
self._last = time.monotonic()
|
||
|
||
def request(self, method: str, url: str, *, max_retries: int = 4,
|
||
timeout: float = 60.0, **kw: Any) -> requests.Response:
|
||
last_exc: Exception | None = None
|
||
for attempt in range(max_retries):
|
||
self._wait()
|
||
try:
|
||
resp = self.s.request(method, url, timeout=timeout, **kw)
|
||
except requests.RequestException as exc:
|
||
last_exc = exc
|
||
backoff = min(30.0, (2 ** attempt) + random.random())
|
||
log.warning("network error on %s %s: %s — retry in %.1fs",
|
||
method, url, exc, backoff)
|
||
time.sleep(backoff)
|
||
continue
|
||
if resp.status_code == 429 or 500 <= resp.status_code < 600:
|
||
ra = resp.headers.get("Retry-After")
|
||
backoff = float(ra) if (ra and ra.isdigit()) else min(
|
||
30.0, (2 ** attempt) + random.random())
|
||
log.warning("HTTP %d on %s %s — retry in %.1fs",
|
||
resp.status_code, method, url, backoff)
|
||
time.sleep(backoff)
|
||
continue
|
||
return resp
|
||
if last_exc:
|
||
raise last_exc
|
||
return resp # type: ignore[return-value]
|
||
|
||
def get(self, url: str, **kw: Any) -> requests.Response:
|
||
return self.request("GET", url, **kw)
|
||
|
||
|
||
# --------------------------------------------------------------------- model
|
||
|
||
|
||
@dataclass
|
||
class TrialDoc:
|
||
source_key: str
|
||
crop: str # corn / soybeans / wheat
|
||
year: int
|
||
region: str # e.g. "Region 1", "Monmouth CFC"
|
||
xlsx_url: str
|
||
index_url: str
|
||
|
||
rotation: str | None = None # "corn following soybean" / "corn following corn"
|
||
previous_crop: str | None = None # "corn" for CFC tables
|
||
cooperator: str | None = None # site host
|
||
county: str | None = None
|
||
soil_type: str | None = None
|
||
tillage: str | None = None
|
||
planted_date: str | None = None
|
||
harvested_date: str | None = None
|
||
row_width: str | None = None
|
||
latitude: float | None = None
|
||
longitude: float | None = None
|
||
sites: list[str] = field(default_factory=list)
|
||
|
||
results: list[dict] = field(default_factory=list)
|
||
|
||
|
||
# --------------------------------------------------------------------- discovery
|
||
|
||
|
||
# Year/Region extraction from a file name or upload path. The VT site has
|
||
# used many naming schemes across years, so try several.
|
||
_YEAR_RE = re.compile(r"(20\d{2})")
|
||
_REGION_NUM_RE = re.compile(r"region[\s_-]*([1-9])", re.I)
|
||
|
||
|
||
def _norm_url(href: str) -> str:
|
||
"""Resolve relative hrefs against BASE; also fold the legacy
|
||
``varietytesting.web.illinois.edu`` host (same WP install) to BASE so
|
||
older files download cleanly."""
|
||
href = href.strip()
|
||
if href.startswith("http"):
|
||
return href
|
||
return urljoin(BASE + "/", href.lstrip("/"))
|
||
|
||
|
||
def _file_year(url: str) -> int | None:
|
||
"""Harvest year of a file. Prefer the upload-path year segment
|
||
(``/uploads/2025/11/...`` or ``/uploads/corn/2021/...``); fall back to
|
||
the first 20xx in the file name."""
|
||
m = re.search(r"/uploads/(?:[a-z]+/)?(20\d{2})/", url)
|
||
if m:
|
||
return int(m.group(1))
|
||
fn = url.rsplit("/", 1)[-1]
|
||
m = _YEAR_RE.search(fn)
|
||
return int(m.group(1)) if m else None
|
||
|
||
|
||
def _classify_region(url: str) -> tuple[str, str | None, str | None] | None:
|
||
"""Return ``(region_label, rotation, previous_crop)`` for a file, or
|
||
None if it isn't a per-region/per-site VARIETY table we ingest.
|
||
|
||
We INCLUDE regional tables (Region 1-5, North/South/East/West) and
|
||
single-site CFC (corn-following-corn) tables. We EXCLUDE entry lists,
|
||
agronomic-characteristic sheets, FHB/scab summaries, and disease
|
||
tables (those aren't head-to-head yield rankings)."""
|
||
fn = url.rsplit("/", 1)[-1]
|
||
low = fn.lower()
|
||
|
||
# Exclusions — not cross-vendor yield ranking tables.
|
||
EXCLUDE = ("entry", "entries", "agronomic", "charateristic", # sic (site typo)
|
||
"characteristic", "scab", "fhb", "disease")
|
||
if any(tok in low for tok in EXCLUDE):
|
||
return None
|
||
|
||
# Numbered regions: "Region 1", "2025-Region-1", "northtbl20"
|
||
m = _REGION_NUM_RE.search(low)
|
||
if m:
|
||
return (f"Region {m.group(1)}", "corn following soybean", None)
|
||
|
||
# Named regions (corn 2022 + wheat use compass names).
|
||
for name, label in (("north", "North Region"), ("south", "South Region"),
|
||
("east", "East Region"), ("west", "West Region")):
|
||
if re.search(rf"\b{name}\b", low) or low.startswith(name + "tbl"):
|
||
return (label, "corn following soybean", None)
|
||
|
||
# Single-site Corn-Following-Corn tables (Monmouth / Urbana / DeKalb).
|
||
cfc = re.search(r"([a-z]+)[\s_-]*cfc", low)
|
||
if cfc or "cfc" in low:
|
||
site = (cfc.group(1).title() if cfc and cfc.group(1) else "CFC")
|
||
return (f"{site} CFC", "corn following corn", "corn")
|
||
|
||
# Wheat single-site summary tables (e.g. "2025-Urbana-Summary",
|
||
# "2024-Elkville-Table"). Capture the site name as the region.
|
||
m = re.search(r"20\d{2}[\s_-]+([a-z]+)[\s_-]+(?:summary|table)", low)
|
||
if m:
|
||
return (m.group(1).title(), None, None)
|
||
m = re.search(r"([a-z]+)[\s_-]+(?:summary|table)[\s_-]*20\d{2}", low)
|
||
if m and m.group(1) not in ("region", "regional"):
|
||
return (m.group(1).title(), None, None)
|
||
|
||
return None
|
||
|
||
|
||
def discover_files(http: RateLimitedSession, *, crops: set[str],
|
||
years: set[int]) -> list[TrialDoc]:
|
||
"""Fetch each crop index page, extract .xlsx hrefs, classify them into
|
||
per-region/per-site variety tables, and keep the ones in scope."""
|
||
out: list[TrialDoc] = []
|
||
seen_keys: set[str] = set()
|
||
for crop in sorted(crops):
|
||
index_path = CROP_INDEX.get(crop)
|
||
if not index_path:
|
||
log.warning("unknown crop %r, skipping", crop)
|
||
continue
|
||
index_url = BASE + index_path
|
||
log.info("GET %s", index_url)
|
||
r = http.get(index_url)
|
||
r.raise_for_status()
|
||
soup = BeautifulSoup(r.text, "html.parser")
|
||
xlsx_hrefs = [a["href"] for a in soup.find_all("a", href=True)
|
||
if a["href"].lower().endswith(".xlsx")]
|
||
# Dedupe while preserving order.
|
||
seen_href: set[str] = set()
|
||
for href in xlsx_hrefs:
|
||
url = _norm_url(href)
|
||
if url in seen_href:
|
||
continue
|
||
seen_href.add(url)
|
||
year = _file_year(url)
|
||
if year is None or year not in years:
|
||
continue
|
||
cls = _classify_region(url)
|
||
if cls is None:
|
||
continue
|
||
region, rotation, prev_crop = cls
|
||
# source_key: ilvt-<crop>-<year>-r<n> for numbered regions,
|
||
# else a slug of the region name.
|
||
mnum = _REGION_NUM_RE.search(url.rsplit("/", 1)[-1].lower())
|
||
if mnum and "cfc" not in region.lower():
|
||
region_slug = f"r{mnum.group(1)}"
|
||
else:
|
||
region_slug = re.sub(r"[^a-z0-9]+", "-",
|
||
region.lower()).strip("-")
|
||
sk = f"ilvt-{crop}-{year}-{region_slug}"
|
||
if sk in seen_keys:
|
||
# Two files map to the same key (e.g. a "-final" + a "-1"
|
||
# duplicate). Keep the first; log the collision.
|
||
log.info("duplicate source_key %s from %s — skipping dupe",
|
||
sk, url)
|
||
continue
|
||
seen_keys.add(sk)
|
||
out.append(TrialDoc(
|
||
source_key=sk, crop=crop, year=year, region=region,
|
||
xlsx_url=url, index_url=index_url,
|
||
rotation=rotation, previous_crop=prev_crop,
|
||
))
|
||
log.info(" %s: %d in-scope variety tables", crop,
|
||
sum(1 for d in out if d.crop == crop))
|
||
return out
|
||
|
||
|
||
# --------------------------------------------------------------------- parse
|
||
|
||
|
||
def _to_num(v: Any) -> float | int | None:
|
||
"""Coerce a cell to a number. Strips '*', commas; returns None for the
|
||
VT missing-value markers ('*.*', '-', '') and non-numeric text."""
|
||
if v is None:
|
||
return None
|
||
if isinstance(v, bool):
|
||
return None
|
||
if isinstance(v, (int, float)):
|
||
f = float(v)
|
||
return int(f) if f.is_integer() else f
|
||
s = str(v).strip()
|
||
if not s or s in ("*", "*.*", "-", "—", "."):
|
||
return None
|
||
s = s.replace(",", "")
|
||
if not re.match(r"^-?\d+(?:\.\d+)?$", s):
|
||
return None
|
||
f = float(s)
|
||
return int(f) if f.is_integer() else f
|
||
|
||
|
||
def _iso_date(v: Any) -> str | None:
|
||
if isinstance(v, datetime):
|
||
return v.date().isoformat()
|
||
if isinstance(v, date):
|
||
return v.isoformat()
|
||
if v is None:
|
||
return None
|
||
s = str(v).strip()
|
||
if not s or s.lower().startswith("did not"):
|
||
return None
|
||
# ISO YYYY-MM-DD (with optional trailing " 00:00:00" time).
|
||
m = re.match(r"^(\d{4})-(\d{1,2})-(\d{1,2})(?:[ T].*)?$", s)
|
||
if m:
|
||
yr, mo, dy = m.groups()
|
||
try:
|
||
return f"{int(yr):04d}-{int(mo):02d}-{int(dy):02d}"
|
||
except ValueError:
|
||
return None
|
||
# US MM/DD/YYYY.
|
||
m = re.match(r"^(\d{1,2})/(\d{1,2})/(\d{2,4})$", s)
|
||
if m:
|
||
mo, dy, yr = m.groups()
|
||
if len(yr) == 2:
|
||
yr = "20" + yr
|
||
try:
|
||
return f"{int(yr):04d}-{int(mo):02d}-{int(dy):02d}"
|
||
except ValueError:
|
||
return None
|
||
return None
|
||
|
||
|
||
def _txt(v: Any) -> str:
|
||
if v is None:
|
||
return ""
|
||
if isinstance(v, datetime):
|
||
return v.date().isoformat()
|
||
return str(v).strip()
|
||
|
||
|
||
def _norm(s: Any) -> str:
|
||
return re.sub(r"\s+", " ", _txt(s)).strip().lower().rstrip(".")
|
||
|
||
|
||
# Section markers / summary labels that are NOT data rows.
|
||
_NONDATA_NAME = re.compile(
|
||
r"^(average|avg\.?|l\.?s\.?d\.?|c\.?v\.?|coeff|mean|std|"
|
||
r"early mg|late mg|early rm|any rm|non-?gmo|conventional|"
|
||
r"public|check)\b", re.I)
|
||
_SECTION_COMPANY = re.compile(
|
||
r"^(early rm|any rm|late rm|early mg|late mg|non-?gmo|conventional|"
|
||
r"gmo hybrids?|hybrids?)\b", re.I)
|
||
|
||
|
||
def _find_header_row(rows: list[tuple]) -> int | None:
|
||
"""Index of the column-name row — the one carrying 'Company' (col 0)
|
||
and a 'Name' / 'Variety' (col 1)-ish header."""
|
||
for i, row in enumerate(rows[:15]):
|
||
c0 = _norm(row[0] if len(row) > 0 else "")
|
||
c1 = _norm(row[1] if len(row) > 1 else "")
|
||
if c0 == "company" and c1 in ("name", "variety"):
|
||
return i
|
||
return None
|
||
|
||
|
||
def _build_colmap(rows: list[tuple], hdr_i: int) -> dict[str, int]:
|
||
"""Map metric -> column index by merging the header band: the
|
||
column-name row (hdr_i), the row below it, and the group-header row
|
||
above it. Layouts vary — corn carries the units (bu/a, %) IN the
|
||
header row with the Yield/Moisture/Lodging labels in the group row
|
||
above; soy/wheat carry the units in the row below. We want the
|
||
REGIONAL (leftmost) Yield, not the per-site repeats, so we take the
|
||
leftmost yield-units column as the primary Yield.
|
||
|
||
Returns keys among: company, name, herb_trait, gt, ist, st, maturity,
|
||
yield, lodging, height, moisture, protein, oil, rank, testwt,
|
||
yield_2yr, yield_3yr.
|
||
"""
|
||
name_row = rows[hdr_i]
|
||
below_row = rows[hdr_i + 1] if hdr_i + 1 < len(rows) else ()
|
||
group_row = rows[hdr_i - 1] if hdr_i - 1 >= 0 else ()
|
||
|
||
def g(row: tuple, i: int) -> str:
|
||
return _norm(row[i]) if i < len(row) else ""
|
||
|
||
ncols = max(len(name_row), len(below_row), len(group_row))
|
||
|
||
def band(i: int) -> tuple[str, str, str]:
|
||
"""(group-above, header, below) normalized text for column i."""
|
||
return (g(group_row, i), g(name_row, i), g(below_row, i))
|
||
|
||
cm: dict[str, int] = {"company": 0, "name": 1}
|
||
|
||
# Identity / trait columns — these sit on the header (column-name) row.
|
||
for i in range(ncols):
|
||
nm = g(name_row, i)
|
||
if nm in ("herbicide trait1", "herbicide trait", "ht3", "ht"):
|
||
cm.setdefault("herb_trait", i)
|
||
elif nm in ("gt2", "gt"):
|
||
cm.setdefault("gt", i)
|
||
elif nm in ("ist1", "ist"):
|
||
cm.setdefault("ist", i)
|
||
elif nm in ("st1", "st2", "st"):
|
||
cm.setdefault("st", i)
|
||
elif nm in ("relative", "maturity", "relative maturity", "maturity date"):
|
||
cm.setdefault("maturity", i)
|
||
elif nm in ("yield rank", "rank"):
|
||
cm.setdefault("rank", i)
|
||
|
||
# Metric columns — match across the whole band. A column is a Yield
|
||
# column if any band row says "yield" OR carries a bu-per-acre unit.
|
||
yield_cols: list[int] = []
|
||
moisture_cols: list[int] = []
|
||
testwt_cols: list[int] = []
|
||
lodging_cols: list[int] = []
|
||
height_cols: list[int] = []
|
||
protein_cols: list[int] = []
|
||
oil_cols: list[int] = []
|
||
maturity_cols: list[int] = []
|
||
YIELD_UNITS = {"bu/a", "bu/ac", "bu/acre"}
|
||
for i in range(ncols):
|
||
gp, nm, bl = band(i)
|
||
cells = {gp, nm, bl}
|
||
if "yield" in cells or cells & YIELD_UNITS:
|
||
yield_cols.append(i)
|
||
if "moisture" in cells:
|
||
moisture_cols.append(i)
|
||
if {"test wt", "test weight"} & cells or "lb/bu" in cells:
|
||
testwt_cols.append(i)
|
||
if "lodging" in cells:
|
||
lodging_cols.append(i)
|
||
if "height" in cells:
|
||
height_cols.append(i)
|
||
if "protein" in cells:
|
||
protein_cols.append(i)
|
||
if "oil" in cells:
|
||
oil_cols.append(i)
|
||
if {"relative", "maturity", "relative maturity", "maturity date"} & cells:
|
||
maturity_cols.append(i)
|
||
|
||
if yield_cols:
|
||
cm["yield"] = yield_cols[0]
|
||
# 2yr / 3yr averages: yield-unit columns whose group header says avg.
|
||
for i in yield_cols[1:]:
|
||
grp = g(group_row, i)
|
||
if "2-yr" in grp or "2 yr" in grp or "2yr" in grp:
|
||
cm.setdefault("yield_2yr", i)
|
||
elif "3-yr" in grp or "3 yr" in grp or "3yr" in grp:
|
||
cm.setdefault("yield_3yr", i)
|
||
if moisture_cols:
|
||
cm["moisture"] = moisture_cols[0]
|
||
if testwt_cols:
|
||
cm["testwt"] = testwt_cols[0]
|
||
if lodging_cols:
|
||
cm["lodging"] = lodging_cols[0]
|
||
if height_cols:
|
||
cm["height"] = height_cols[0]
|
||
if protein_cols:
|
||
cm["protein"] = protein_cols[0]
|
||
if oil_cols:
|
||
cm["oil"] = oil_cols[0]
|
||
if "maturity" not in cm and maturity_cols:
|
||
cm["maturity"] = maturity_cols[0]
|
||
|
||
return cm
|
||
|
||
|
||
# The metadata block always carries these labels; we locate its starting
|
||
# column by finding where they appear, so the per-site yield columns that
|
||
# sit between the metric block and the metadata block don't get scanned as
|
||
# labels (they lead with numbers, not a label).
|
||
_META_LABELS = {
|
||
"location", "county", "site location", "host", "soil type",
|
||
"planting date", "harvest date", "nitrogen applied", "pesticides",
|
||
"tillage", "latitude", "longitude", "rainfall", "fungicide",
|
||
}
|
||
|
||
|
||
def _scan_meta_labels(src: list[tuple]) -> dict[str, list[str]]:
|
||
"""From a label/value block, build ``{normalized_label: [values...]}``.
|
||
|
||
First find the column where the metadata labels live (the leftmost
|
||
column that holds a known metadata label in some row); the label in
|
||
each row is the first non-empty cell AT OR AFTER that column, and the
|
||
values are the non-empty cells to its right. This skips the per-site
|
||
yield columns that can sit to the left of the metadata block."""
|
||
# Find the metadata-label column.
|
||
label_col: int | None = None
|
||
for row in src:
|
||
for i, c in enumerate(row):
|
||
if _norm(c) in _META_LABELS:
|
||
if label_col is None or i < label_col:
|
||
label_col = i
|
||
break
|
||
found: dict[str, list[str]] = {}
|
||
for row in src:
|
||
label = None
|
||
label_idx = None
|
||
start = label_col if label_col is not None else 0
|
||
for i in range(start, len(row)):
|
||
t = _txt(row[i])
|
||
if t:
|
||
label = t
|
||
label_idx = i
|
||
break
|
||
if label is None:
|
||
continue
|
||
key = _norm(label)
|
||
values = [_txt(c) for i, c in enumerate(row)
|
||
if i > (label_idx or 0) and _txt(c)]
|
||
if values and key not in found: # keep the first occurrence
|
||
found[key] = values
|
||
return found
|
||
|
||
|
||
def _first_real(values: list[str]) -> str | None:
|
||
for v in values:
|
||
if v and v.lower() not in ("click to see map", "click here for directions"):
|
||
return v
|
||
return None
|
||
|
||
|
||
def _apply_site_metadata(doc: TrialDoc, found: dict[str, list[str]], *,
|
||
crop: str) -> None:
|
||
"""Apply a scanned metadata block to the doc, filling only fields that
|
||
are still unset (so the first/preferred source wins)."""
|
||
def setif(attr: str, val: Any) -> None:
|
||
if val and getattr(doc, attr) is None:
|
||
setattr(doc, attr, val)
|
||
|
||
if "host" in found:
|
||
setif("cooperator", _first_real(found["host"]))
|
||
if "location" in found and not doc.sites:
|
||
doc.sites = [v for v in found["location"]
|
||
if v and v.lower() != "click to see map"]
|
||
if "county" in found:
|
||
setif("county", _first_real(found["county"]))
|
||
if "soil type" in found:
|
||
setif("soil_type", _first_real(found["soil type"]))
|
||
if "planting date" in found:
|
||
setif("planted_date", _iso_date(_first_real(found["planting date"])))
|
||
if "harvest date" in found:
|
||
setif("harvested_date", _iso_date(_first_real(found["harvest date"])))
|
||
if "tillage" in found:
|
||
setif("tillage", _first_real(found["tillage"]))
|
||
elif "spring" in found and crop != "wheat":
|
||
# Corn/soy: "Spring"/"Fall" are tillage operations. Wheat: those
|
||
# same labels are nitrogen rates — never tillage.
|
||
setif("tillage", _first_real(found["spring"]))
|
||
if "latitude" in found:
|
||
lat = _to_num(_first_real(found["latitude"]) or "")
|
||
if isinstance(lat, (int, float)):
|
||
setif("latitude", float(lat))
|
||
if "longitude" in found:
|
||
lon = _to_num(_first_real(found["longitude"]) or "")
|
||
if isinstance(lon, (int, float)):
|
||
setif("longitude", float(lon))
|
||
|
||
|
||
def _assemble_traits(row: tuple, cm: dict[str, int]) -> str:
|
||
"""Combine the herbicide-trait + GT (genetic trait, may spill across
|
||
cols) + seed-treatment columns into one traits string."""
|
||
bits: list[str] = []
|
||
# GT can spill from its col up to (but not including) the herb_trait col.
|
||
if "gt" in cm:
|
||
gt_start = cm["gt"]
|
||
gt_end = cm.get("herb_trait", cm.get("maturity", gt_start + 1))
|
||
gt_toks = [_txt(row[i]) for i in range(gt_start, gt_end)
|
||
if i < len(row) and _txt(row[i])]
|
||
if gt_toks:
|
||
bits.append("GT:" + "".join(gt_toks))
|
||
if "herb_trait" in cm:
|
||
ht = _txt(row[cm["herb_trait"]]) if cm["herb_trait"] < len(row) else ""
|
||
if ht:
|
||
bits.append("HT:" + ht)
|
||
for k, lbl in (("ist", "IST"), ("st", "ST")):
|
||
if k in cm and cm[k] < len(row):
|
||
v = _txt(row[cm[k]])
|
||
if v:
|
||
bits.append(f"{lbl}:{v}")
|
||
return " ".join(bits)
|
||
|
||
|
||
def _is_data_row(row: tuple, cm: dict[str, int]) -> bool:
|
||
company = _txt(row[0]) if len(row) > 0 else ""
|
||
name = _txt(row[1]) if len(row) > 1 else ""
|
||
if not company or not name:
|
||
return False
|
||
if _NONDATA_NAME.match(name) or _NONDATA_NAME.match(company):
|
||
return False
|
||
if _SECTION_COMPANY.match(company):
|
||
return False
|
||
# Must have a plausible numeric yield.
|
||
y = _to_num(row[cm["yield"]]) if "yield" in cm and cm["yield"] < len(row) else None
|
||
if not isinstance(y, (int, float)):
|
||
return False
|
||
return True
|
||
|
||
|
||
def _plausible_yield(crop: str, y: float) -> bool:
|
||
if crop == "corn":
|
||
return 30 < y < 400
|
||
if crop == "soybeans":
|
||
return 10 < y < 150
|
||
if crop == "wheat":
|
||
return 20 < y < 200
|
||
return 0 < y < 500
|
||
|
||
|
||
def parse_xlsx(content: bytes, doc: TrialDoc) -> None:
|
||
wb = openpyxl.load_workbook(io.BytesIO(content), data_only=True,
|
||
read_only=True)
|
||
# The yield table is the first sheet whose first ~15 rows contain a
|
||
# Company/Name header.
|
||
data_ws = None
|
||
data_rows: list[tuple] = []
|
||
hdr_i = None
|
||
for name in wb.sheetnames:
|
||
rows = list(wb[name].iter_rows(values_only=True))
|
||
hi = _find_header_row(rows)
|
||
if hi is not None:
|
||
data_ws, data_rows, hdr_i = name, rows, hi
|
||
break
|
||
if data_ws is None or hdr_i is None:
|
||
raise ValueError("no Company/Name header row found in any sheet")
|
||
|
||
cm = _build_colmap(data_rows, hdr_i)
|
||
if "yield" not in cm:
|
||
raise ValueError("no Yield column located")
|
||
|
||
# Site metadata lives in (a) trailing columns of the data sheet
|
||
# (co-located with the results — most current) and/or (b) a dedicated
|
||
# "Trial Info" sheet. Read the trailing-column block FIRST so it wins,
|
||
# then let the info sheet fill any gaps. (Some regional files carry a
|
||
# stale info sheet — e.g. a 2025 table whose Trial Info sheet still
|
||
# shows 2021 planting dates — so trailing columns are preferred.)
|
||
# _scan_meta_labels self-locates the metadata-label column, so the
|
||
# per-site yield columns between the metric block and the metadata
|
||
# block aren't mis-read as labels.
|
||
_apply_site_metadata(doc, _scan_meta_labels(data_rows), crop=doc.crop)
|
||
info_sheet = next((s for s in wb.sheetnames
|
||
if "trial info" in s.lower()
|
||
or "trial information" in s.lower()), None)
|
||
if info_sheet:
|
||
_apply_site_metadata(
|
||
doc,
|
||
_scan_meta_labels(list(wb[info_sheet].iter_rows(values_only=True))),
|
||
crop=doc.crop)
|
||
|
||
results: list[dict] = []
|
||
for row in data_rows[hdr_i + 2:]:
|
||
if not _is_data_row(row, cm):
|
||
continue
|
||
y = _to_num(row[cm["yield"]])
|
||
if not isinstance(y, (int, float)) or not _plausible_yield(doc.crop, float(y)):
|
||
continue
|
||
brand = _txt(row[0])
|
||
product = _txt(row[1])
|
||
# Sanity: a numeric/blank brand is junk.
|
||
if not brand or brand.isdigit() or len(brand) <= 1:
|
||
continue
|
||
metrics: dict[str, Any] = {"Yield": y}
|
||
for key, label in (("moisture", "Moisture"), ("lodging", "Lodging"),
|
||
("height", "Height"), ("testwt", "Test Wt."),
|
||
("protein", "Protein"), ("oil", "Oil"),
|
||
("maturity", "Maturity"),
|
||
("yield_2yr", "Yield 2yr avg"),
|
||
("yield_3yr", "Yield 3yr avg")):
|
||
if key in cm and cm[key] < len(row):
|
||
v = _to_num(row[cm[key]])
|
||
if v is not None:
|
||
metrics[label] = v
|
||
rec_rank = None
|
||
if "rank" in cm and cm["rank"] < len(row):
|
||
rk = _to_num(row[cm["rank"]])
|
||
if isinstance(rk, (int, float)):
|
||
rec_rank = int(rk)
|
||
results.append({
|
||
"rank": rec_rank, # synthesized below if None
|
||
"brand": brand,
|
||
"product": product,
|
||
"traits": _assemble_traits(row, cm) or None,
|
||
"metrics": metrics,
|
||
})
|
||
|
||
# Synthesize rank by Yield DESC when the sheet didn't publish one
|
||
# (corn/soy list alphabetically). Wheat carries Yield Rank already, but
|
||
# we re-rank only if any row is missing a rank to keep them consistent.
|
||
if results and any(r["rank"] is None for r in results):
|
||
ordered = sorted(results, key=lambda r: -float(r["metrics"]["Yield"]))
|
||
for i, r in enumerate(ordered, start=1):
|
||
r["rank"] = i
|
||
results = ordered
|
||
else:
|
||
results.sort(key=lambda r: r["rank"])
|
||
|
||
doc.results = results
|
||
|
||
|
||
# --------------------------------------------------------------------- render
|
||
|
||
|
||
def render_markdown(doc: TrialDoc) -> str:
|
||
crop_label = {"corn": "Corn", "soybeans": "Soybean",
|
||
"wheat": "Wheat"}.get(doc.crop, doc.crop.title())
|
||
head: list[str] = [
|
||
f"# {crop_label} yield trial — University of Illinois VT, "
|
||
f"{doc.region}, IL {doc.year}",
|
||
"",
|
||
"- **Publisher:** University of Illinois Variety Testing "
|
||
"(independent cross-vendor trial)",
|
||
f"- **Crop:** {crop_label}",
|
||
"- **State:** IL",
|
||
f"- **Year:** {doc.year}",
|
||
f"- **Region:** {doc.region}",
|
||
]
|
||
for label, val in (
|
||
("Rotation", doc.rotation),
|
||
("Previous crop", doc.previous_crop),
|
||
("Cooperator / host", doc.cooperator),
|
||
("County", doc.county),
|
||
("Sites", ", ".join(doc.sites) if doc.sites else None),
|
||
("Soil type", doc.soil_type),
|
||
("Tillage", doc.tillage),
|
||
("Planted", doc.planted_date),
|
||
("Harvested", doc.harvested_date),
|
||
("Row width", doc.row_width),
|
||
):
|
||
if val:
|
||
head.append(f"- **{label}:** {val}")
|
||
head += [
|
||
f"- **Source XLSX:** {doc.xlsx_url}",
|
||
f"- **Index page:** {doc.index_url}",
|
||
"", "---", "",
|
||
"## Results (ranked by regional yield)", "",
|
||
]
|
||
|
||
# Discover metric columns from the first result.
|
||
metric_keys: list[str] = []
|
||
for r in doc.results:
|
||
if r.get("metrics"):
|
||
metric_keys = list(r["metrics"].keys())
|
||
break
|
||
headers = ["Rank", "Brand", "Product", "Traits"] + metric_keys
|
||
head.append("| " + " | ".join(headers) + " |")
|
||
head.append("|" + "|".join(["---"] * len(headers)) + "|")
|
||
for r in doc.results:
|
||
row = [str(r.get("rank") or "-"), r.get("brand") or "-",
|
||
r.get("product") or "-", r.get("traits") or "-"]
|
||
m = r.get("metrics") or {}
|
||
for k in metric_keys:
|
||
v = m.get(k)
|
||
row.append("-" if v is None else str(v))
|
||
head.append("| " + " | ".join(row) + " |")
|
||
head.append("")
|
||
return "\n".join(head)
|
||
|
||
|
||
def write_doc(doc: TrialDoc, body_md: str) -> None:
|
||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||
(CORPUS_DIR / f"{doc.source_key}.md").write_text(body_md, encoding="utf-8")
|
||
sidecar = {
|
||
"source": "illinois_vt_trials",
|
||
"source_key": doc.source_key,
|
||
"data_type": "trial",
|
||
"vendor": "University of Illinois",
|
||
"brand_aggregator": "University of Illinois Variety Testing publishes",
|
||
"brand": "University of Illinois VT",
|
||
"crop": doc.crop,
|
||
"state": "IL",
|
||
"state_abbrev": "il",
|
||
"year": doc.year,
|
||
"region": doc.region,
|
||
"rotation": doc.rotation,
|
||
"previous_crop": doc.previous_crop,
|
||
"cooperator": doc.cooperator,
|
||
"county": doc.county,
|
||
"sites": doc.sites or None,
|
||
"soil_type": doc.soil_type,
|
||
"tillage": doc.tillage,
|
||
"planted_date": doc.planted_date,
|
||
"harvested_date": doc.harvested_date,
|
||
"row_width": doc.row_width,
|
||
"latitude": doc.latitude,
|
||
"longitude": doc.longitude,
|
||
"results": doc.results,
|
||
"n_results": len(doc.results),
|
||
"source_urls": [doc.xlsx_url, doc.index_url],
|
||
"tos_note": TOS_NOTE,
|
||
"fetched_at": datetime.now(timezone.utc).isoformat(),
|
||
"scraper_version": SCRAPER_VERSION,
|
||
}
|
||
(CORPUS_DIR / f"{doc.source_key}.json").write_text(
|
||
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
|
||
encoding="utf-8")
|
||
|
||
|
||
# --------------------------------------------------------------------- pipeline
|
||
|
||
|
||
def process_doc(http: RateLimitedSession, doc: TrialDoc, *,
|
||
force: bool) -> str:
|
||
md_path = CORPUS_DIR / f"{doc.source_key}.md"
|
||
if md_path.exists() and not force:
|
||
return "skipped"
|
||
try:
|
||
r = http.get(doc.xlsx_url)
|
||
r.raise_for_status()
|
||
parse_xlsx(r.content, doc)
|
||
except Exception as exc: # noqa: BLE001
|
||
log.error("%s parse failed (%s): %s", doc.source_key, doc.xlsx_url, exc)
|
||
return "failed"
|
||
if not doc.results:
|
||
log.warning("%s — no valid result rows parsed; skipping", doc.source_key)
|
||
return "empty"
|
||
write_doc(doc, render_markdown(doc))
|
||
log.info("%s written | %s %s %s | %d results | top: %s",
|
||
doc.source_key, doc.crop, doc.region, doc.year, len(doc.results),
|
||
doc.results[0]["brand"] + " " + doc.results[0]["product"]
|
||
if doc.results else "-")
|
||
return "written"
|
||
|
||
|
||
def run(*, crops: set[str], years: set[int], limit: int | None,
|
||
force: bool) -> int:
|
||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||
http = RateLimitedSession()
|
||
docs = discover_files(http, crops=crops, years=years)
|
||
log.info("discovered %d in-scope variety tables", len(docs))
|
||
if limit is not None:
|
||
docs = docs[:limit]
|
||
|
||
counts = {"written": 0, "skipped": 0, "empty": 0, "failed": 0}
|
||
for i, doc in enumerate(docs, start=1):
|
||
status = process_doc(http, doc, force=force)
|
||
counts[status] = counts.get(status, 0) + 1
|
||
if status != "written" or i <= 5 or i % 20 == 0:
|
||
log.info("[%d/%d] %s -> %s", i, len(docs), doc.source_key, status)
|
||
|
||
log.info("done: written=%d skipped=%d empty=%d failed=%d (of %d)",
|
||
counts["written"], counts["skipped"], counts["empty"],
|
||
counts["failed"], len(docs))
|
||
return 0 if counts["failed"] == 0 else 1
|
||
|
||
|
||
# --------------------------------------------------------------------- CLI
|
||
|
||
|
||
def _build_argparser() -> argparse.ArgumentParser:
|
||
p = argparse.ArgumentParser(
|
||
prog="scrape.sources.illinois_vt_trials",
|
||
description="Scrape University of Illinois Variety Testing "
|
||
"cross-vendor yield trials (XLSX) into the corpus.")
|
||
p.add_argument("--year", type=int, default=None,
|
||
help="Scrape a single harvest year (default: 2024+2025).")
|
||
p.add_argument("--include-old", action="store_true",
|
||
help="Also scrape 2000–2023 (deferred by default).")
|
||
p.add_argument("--crop", default=None, choices=tuple(CROP_INDEX.keys()),
|
||
help="Limit to one crop (corn / soybeans / wheat).")
|
||
p.add_argument("--limit", type=int, default=None,
|
||
help="Stop after processing N tables (default: all).")
|
||
p.add_argument("--force", action="store_true",
|
||
help="Re-fetch even if the markdown file already exists.")
|
||
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
|
||
return p
|
||
|
||
|
||
def main(argv: list[str] | None = None) -> int:
|
||
args = _build_argparser().parse_args(argv)
|
||
logging.basicConfig(
|
||
level=args.log_level.upper(),
|
||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||
stream=sys.stderr)
|
||
|
||
crops = {args.crop} if args.crop else set(CROP_INDEX.keys())
|
||
if args.year is not None:
|
||
years = {args.year}
|
||
elif args.include_old:
|
||
years = set(range(OLD_YEAR_MIN, OLD_YEAR_MAX + 1)) | BASELINE_YEARS
|
||
else:
|
||
years = set(BASELINE_YEARS)
|
||
|
||
return run(crops=crops, years=years, limit=args.limit, force=args.force)
|
||
|
||
|
||
if __name__ == "__main__":
|
||
sys.exit(main())
|