Add university-extension variety trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 trial docs)
Independent third-party performance data — land-grant programs that test every entered brand side-by-side with replication + LSD stats. This is the legitimate way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't scrape directly (data_type=trial, results[] shape; falls through the trial chunker). - illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata; corn-following-corn vs -soybean kept distinct. - iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView (viewstate postback for year/district), corn + soy by district x season. - ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF (pdfplumber; per-site column groups split by header Yield-token count + x-coord footnote bucketing), corn + soy per site, 2024+2025. 91 distinct seed brands across the three; majors confirmed present in the independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212, Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in Illinois — true negatives, not parse gaps.) - rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university sources route through it so the region/district is in the embedded chunk + labeled "variety trial (cross-vendor, independent third-party)". Existing plot sources (gh/lg/agrigold/proharvest) unchanged. - requirements.txt: openpyxl (Illinois XLSX; scrape-time only). - sources.json + README/CLAUDE/lessons: registered + attributed; lessons trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available indirectly via these trials). Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0 out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in each tos_note. CI rebuilds the index from the committed corpus.
This commit is contained in:
@@ -0,0 +1,945 @@
|
||||
"""University of Illinois Variety Testing — cross-vendor yield trials.
|
||||
|
||||
The University of Illinois Crop Sciences Variety Testing program
|
||||
(``vt.cropsci.illinois.edu``) is a long-running, independent land-grant
|
||||
testing program. Seed companies pay an entry fee to enter hybrids /
|
||||
varieties; UIUC plants them in replicated regional trials and publishes
|
||||
the results. Because the program is third-party and cross-vendor, a
|
||||
single regional table ranks Pioneer / Brevant / DEKALB / Channel /
|
||||
Burrus / Stine / Viking and dozens of others head-to-head — the
|
||||
legitimate, independent way to get major-brand performance the
|
||||
single-vendor corpus can't scrape directly.
|
||||
|
||||
This is a ``data_type: "trial"`` source (NOT variety identity). It emits
|
||||
the same per-site cross-vendor sidecar shape as ``gh_plot_reports`` /
|
||||
``agrigold_plot_reports`` / ``proharvest_plots`` (``results: [{rank,
|
||||
brand, product, traits, metrics}]``), so it falls through to the shared
|
||||
``_render_gh_plot_chunk`` renderer in ``rag/chunk.py`` with no chunk.py
|
||||
edit. The published table lists entries alphabetically by company, so we
|
||||
**synthesize ``rank`` by sorting on Yield descending** (highest yield =
|
||||
rank 1) per document.
|
||||
|
||||
Data layout (verified Nov 2025):
|
||||
- Static XLSX (+ PDF) per region per year at WordPress upload URLs.
|
||||
The month segment of the upload path varies (``/2025/11/``,
|
||||
``/2025/12/``, ``/2025/07/`` for wheat), so we DISCOVER the .xlsx
|
||||
hrefs by fetching the /corn/, /soybeans/, /wheat/ index pages — we
|
||||
never guess URLs.
|
||||
- Corn regional tables: ``Company | Name | IST1 | GT2(+spill) | HT3 |
|
||||
Relative Maturity | Yield bu/a | Moisture % | Lodging | <per-site
|
||||
cols> | 2-yr Avg | 3-yr Avg``. Per-site metadata lives in a separate
|
||||
"Trial Info" sheet (regional tables) or in trailing columns
|
||||
(single-site CFC tables).
|
||||
- Soybean regional tables: ``COMPANY | NAME | Herbicide Trait1 | ST2 |
|
||||
Yield bu/a | Maturity Date | Lodging | Height | <2yr/3yr Yield> |
|
||||
Protein @13% | Oil @13%`` with the per-site metadata block in
|
||||
trailing columns.
|
||||
- Wheat regional means tables: ``Company | Variety | ST1 | Yield |
|
||||
Yield Rank | Test wt. | Height | <per-site Yield/Test wt.> |
|
||||
Maturity date | Jointing time | FHB Score | FHB Category``. Wheat
|
||||
publishes its own Yield Rank, which we honor.
|
||||
|
||||
The variety table sits below a 2-4 row header band (a group-header row,
|
||||
a column-name row, and a units row). Columns are positionally stable
|
||||
within a sheet (multi-word brands like "Viking | Blue River" live in a
|
||||
single COMPANY cell), so we anchor on the header row that carries
|
||||
"Company" + "Name"/"Variety" and map the metric columns from the
|
||||
column-name + units rows. The leftmost "Yield (bu/a)" column is the
|
||||
**regional** yield — the primary metric.
|
||||
|
||||
Section markers interleaved in the table ("Early RM", "Any RM",
|
||||
"Non-GMO Hybrids", "Early MG: 1.9-2.7", "Late MG:") and summary rows
|
||||
("Average", "L.S.D 25% Level", "CV (%)") are skipped — a data row must
|
||||
have a real company, a variety name, and a plausible numeric yield.
|
||||
|
||||
Rotation distinction: regional tables are corn-following-soybean (the
|
||||
default rotation); "CFC" / "Corn Following Corn" single-site tables
|
||||
(Monmouth, Urbana, DeKalb) get ``previous_crop="corn"`` and a distinct
|
||||
``-cfc-`` source_key so they never collide with the regional table.
|
||||
|
||||
robots/ToS: no usage terms posted on the VT site (publicly-funded
|
||||
land-grant; companies pay an entry fee, which doesn't restrict published
|
||||
result reuse). Polite UA + light rate limit.
|
||||
|
||||
Output:
|
||||
corpus/illinois_vt_trials/<source_key>.md LLM-visible body
|
||||
corpus/illinois_vt_trials/<source_key>.json sidecar metadata
|
||||
|
||||
source_key: ``ilvt-<crop>-<year>-r<region>`` e.g. ``ilvt-corn-2025-r1``;
|
||||
CFC single-site tables: ``ilvt-corn-2024-cfc-monmouth``.
|
||||
|
||||
CLI:
|
||||
python -m scrape.sources.illinois_vt_trials --year 2025 --limit 2
|
||||
python -m scrape.sources.illinois_vt_trials --crop corn --force
|
||||
python -m scrape.sources.illinois_vt_trials --include-old --force
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import io
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import date, datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
from urllib.parse import urljoin
|
||||
|
||||
import openpyxl
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
SCRAPER_VERSION = "0.1.0"
|
||||
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
|
||||
BASE = "https://vt.cropsci.illinois.edu"
|
||||
|
||||
TOS_NOTE = (
|
||||
"No usage terms posted on UIUC VT site; publicly-funded land-grant "
|
||||
"data; attribute University of Illinois Variety Testing."
|
||||
)
|
||||
|
||||
BASELINE_YEARS = {2024, 2025}
|
||||
OLD_YEAR_MIN = 2000
|
||||
OLD_YEAR_MAX = 2023
|
||||
|
||||
# Index pages per crop. PLURAL "soybeans" is the corpus crop value.
|
||||
CROP_INDEX = {
|
||||
"corn": "/corn/",
|
||||
"soybeans": "/soybeans/",
|
||||
"wheat": "/wheat/",
|
||||
}
|
||||
|
||||
REQ_INTERVAL_SEC = 1.0
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
|
||||
CORPUS_DIR = CORPUS_ROOT / "illinois_vt_trials"
|
||||
|
||||
log = logging.getLogger("scrape.illinois_vt_trials")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- HTTP
|
||||
|
||||
|
||||
class RateLimitedSession:
|
||||
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
|
||||
self.s = requests.Session()
|
||||
self.s.headers["User-Agent"] = USER_AGENT
|
||||
self.interval = interval
|
||||
self._last = 0.0
|
||||
|
||||
def _wait(self) -> None:
|
||||
delta = time.monotonic() - self._last
|
||||
if delta < self.interval:
|
||||
time.sleep(self.interval - delta)
|
||||
self._last = time.monotonic()
|
||||
|
||||
def request(self, method: str, url: str, *, max_retries: int = 4,
|
||||
timeout: float = 60.0, **kw: Any) -> requests.Response:
|
||||
last_exc: Exception | None = None
|
||||
for attempt in range(max_retries):
|
||||
self._wait()
|
||||
try:
|
||||
resp = self.s.request(method, url, timeout=timeout, **kw)
|
||||
except requests.RequestException as exc:
|
||||
last_exc = exc
|
||||
backoff = min(30.0, (2 ** attempt) + random.random())
|
||||
log.warning("network error on %s %s: %s — retry in %.1fs",
|
||||
method, url, exc, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
if resp.status_code == 429 or 500 <= resp.status_code < 600:
|
||||
ra = resp.headers.get("Retry-After")
|
||||
backoff = float(ra) if (ra and ra.isdigit()) else min(
|
||||
30.0, (2 ** attempt) + random.random())
|
||||
log.warning("HTTP %d on %s %s — retry in %.1fs",
|
||||
resp.status_code, method, url, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
return resp
|
||||
if last_exc:
|
||||
raise last_exc
|
||||
return resp # type: ignore[return-value]
|
||||
|
||||
def get(self, url: str, **kw: Any) -> requests.Response:
|
||||
return self.request("GET", url, **kw)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- model
|
||||
|
||||
|
||||
@dataclass
|
||||
class TrialDoc:
|
||||
source_key: str
|
||||
crop: str # corn / soybeans / wheat
|
||||
year: int
|
||||
region: str # e.g. "Region 1", "Monmouth CFC"
|
||||
xlsx_url: str
|
||||
index_url: str
|
||||
|
||||
rotation: str | None = None # "corn following soybean" / "corn following corn"
|
||||
previous_crop: str | None = None # "corn" for CFC tables
|
||||
cooperator: str | None = None # site host
|
||||
county: str | None = None
|
||||
soil_type: str | None = None
|
||||
tillage: str | None = None
|
||||
planted_date: str | None = None
|
||||
harvested_date: str | None = None
|
||||
row_width: str | None = None
|
||||
latitude: float | None = None
|
||||
longitude: float | None = None
|
||||
sites: list[str] = field(default_factory=list)
|
||||
|
||||
results: list[dict] = field(default_factory=list)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- discovery
|
||||
|
||||
|
||||
# Year/Region extraction from a file name or upload path. The VT site has
|
||||
# used many naming schemes across years, so try several.
|
||||
_YEAR_RE = re.compile(r"(20\d{2})")
|
||||
_REGION_NUM_RE = re.compile(r"region[\s_-]*([1-9])", re.I)
|
||||
|
||||
|
||||
def _norm_url(href: str) -> str:
|
||||
"""Resolve relative hrefs against BASE; also fold the legacy
|
||||
``varietytesting.web.illinois.edu`` host (same WP install) to BASE so
|
||||
older files download cleanly."""
|
||||
href = href.strip()
|
||||
if href.startswith("http"):
|
||||
return href
|
||||
return urljoin(BASE + "/", href.lstrip("/"))
|
||||
|
||||
|
||||
def _file_year(url: str) -> int | None:
|
||||
"""Harvest year of a file. Prefer the upload-path year segment
|
||||
(``/uploads/2025/11/...`` or ``/uploads/corn/2021/...``); fall back to
|
||||
the first 20xx in the file name."""
|
||||
m = re.search(r"/uploads/(?:[a-z]+/)?(20\d{2})/", url)
|
||||
if m:
|
||||
return int(m.group(1))
|
||||
fn = url.rsplit("/", 1)[-1]
|
||||
m = _YEAR_RE.search(fn)
|
||||
return int(m.group(1)) if m else None
|
||||
|
||||
|
||||
def _classify_region(url: str) -> tuple[str, str | None, str | None] | None:
|
||||
"""Return ``(region_label, rotation, previous_crop)`` for a file, or
|
||||
None if it isn't a per-region/per-site VARIETY table we ingest.
|
||||
|
||||
We INCLUDE regional tables (Region 1-5, North/South/East/West) and
|
||||
single-site CFC (corn-following-corn) tables. We EXCLUDE entry lists,
|
||||
agronomic-characteristic sheets, FHB/scab summaries, and disease
|
||||
tables (those aren't head-to-head yield rankings)."""
|
||||
fn = url.rsplit("/", 1)[-1]
|
||||
low = fn.lower()
|
||||
|
||||
# Exclusions — not cross-vendor yield ranking tables.
|
||||
EXCLUDE = ("entry", "entries", "agronomic", "charateristic", # sic (site typo)
|
||||
"characteristic", "scab", "fhb", "disease")
|
||||
if any(tok in low for tok in EXCLUDE):
|
||||
return None
|
||||
|
||||
# Numbered regions: "Region 1", "2025-Region-1", "northtbl20"
|
||||
m = _REGION_NUM_RE.search(low)
|
||||
if m:
|
||||
return (f"Region {m.group(1)}", "corn following soybean", None)
|
||||
|
||||
# Named regions (corn 2022 + wheat use compass names).
|
||||
for name, label in (("north", "North Region"), ("south", "South Region"),
|
||||
("east", "East Region"), ("west", "West Region")):
|
||||
if re.search(rf"\b{name}\b", low) or low.startswith(name + "tbl"):
|
||||
return (label, "corn following soybean", None)
|
||||
|
||||
# Single-site Corn-Following-Corn tables (Monmouth / Urbana / DeKalb).
|
||||
cfc = re.search(r"([a-z]+)[\s_-]*cfc", low)
|
||||
if cfc or "cfc" in low:
|
||||
site = (cfc.group(1).title() if cfc and cfc.group(1) else "CFC")
|
||||
return (f"{site} CFC", "corn following corn", "corn")
|
||||
|
||||
# Wheat single-site summary tables (e.g. "2025-Urbana-Summary",
|
||||
# "2024-Elkville-Table"). Capture the site name as the region.
|
||||
m = re.search(r"20\d{2}[\s_-]+([a-z]+)[\s_-]+(?:summary|table)", low)
|
||||
if m:
|
||||
return (m.group(1).title(), None, None)
|
||||
m = re.search(r"([a-z]+)[\s_-]+(?:summary|table)[\s_-]*20\d{2}", low)
|
||||
if m and m.group(1) not in ("region", "regional"):
|
||||
return (m.group(1).title(), None, None)
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def discover_files(http: RateLimitedSession, *, crops: set[str],
|
||||
years: set[int]) -> list[TrialDoc]:
|
||||
"""Fetch each crop index page, extract .xlsx hrefs, classify them into
|
||||
per-region/per-site variety tables, and keep the ones in scope."""
|
||||
out: list[TrialDoc] = []
|
||||
seen_keys: set[str] = set()
|
||||
for crop in sorted(crops):
|
||||
index_path = CROP_INDEX.get(crop)
|
||||
if not index_path:
|
||||
log.warning("unknown crop %r, skipping", crop)
|
||||
continue
|
||||
index_url = BASE + index_path
|
||||
log.info("GET %s", index_url)
|
||||
r = http.get(index_url)
|
||||
r.raise_for_status()
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
xlsx_hrefs = [a["href"] for a in soup.find_all("a", href=True)
|
||||
if a["href"].lower().endswith(".xlsx")]
|
||||
# Dedupe while preserving order.
|
||||
seen_href: set[str] = set()
|
||||
for href in xlsx_hrefs:
|
||||
url = _norm_url(href)
|
||||
if url in seen_href:
|
||||
continue
|
||||
seen_href.add(url)
|
||||
year = _file_year(url)
|
||||
if year is None or year not in years:
|
||||
continue
|
||||
cls = _classify_region(url)
|
||||
if cls is None:
|
||||
continue
|
||||
region, rotation, prev_crop = cls
|
||||
# source_key: ilvt-<crop>-<year>-r<n> for numbered regions,
|
||||
# else a slug of the region name.
|
||||
mnum = _REGION_NUM_RE.search(url.rsplit("/", 1)[-1].lower())
|
||||
if mnum and "cfc" not in region.lower():
|
||||
region_slug = f"r{mnum.group(1)}"
|
||||
else:
|
||||
region_slug = re.sub(r"[^a-z0-9]+", "-",
|
||||
region.lower()).strip("-")
|
||||
sk = f"ilvt-{crop}-{year}-{region_slug}"
|
||||
if sk in seen_keys:
|
||||
# Two files map to the same key (e.g. a "-final" + a "-1"
|
||||
# duplicate). Keep the first; log the collision.
|
||||
log.info("duplicate source_key %s from %s — skipping dupe",
|
||||
sk, url)
|
||||
continue
|
||||
seen_keys.add(sk)
|
||||
out.append(TrialDoc(
|
||||
source_key=sk, crop=crop, year=year, region=region,
|
||||
xlsx_url=url, index_url=index_url,
|
||||
rotation=rotation, previous_crop=prev_crop,
|
||||
))
|
||||
log.info(" %s: %d in-scope variety tables", crop,
|
||||
sum(1 for d in out if d.crop == crop))
|
||||
return out
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- parse
|
||||
|
||||
|
||||
def _to_num(v: Any) -> float | int | None:
|
||||
"""Coerce a cell to a number. Strips '*', commas; returns None for the
|
||||
VT missing-value markers ('*.*', '-', '') and non-numeric text."""
|
||||
if v is None:
|
||||
return None
|
||||
if isinstance(v, bool):
|
||||
return None
|
||||
if isinstance(v, (int, float)):
|
||||
f = float(v)
|
||||
return int(f) if f.is_integer() else f
|
||||
s = str(v).strip()
|
||||
if not s or s in ("*", "*.*", "-", "—", "."):
|
||||
return None
|
||||
s = s.replace(",", "")
|
||||
if not re.match(r"^-?\d+(?:\.\d+)?$", s):
|
||||
return None
|
||||
f = float(s)
|
||||
return int(f) if f.is_integer() else f
|
||||
|
||||
|
||||
def _iso_date(v: Any) -> str | None:
|
||||
if isinstance(v, datetime):
|
||||
return v.date().isoformat()
|
||||
if isinstance(v, date):
|
||||
return v.isoformat()
|
||||
if v is None:
|
||||
return None
|
||||
s = str(v).strip()
|
||||
if not s or s.lower().startswith("did not"):
|
||||
return None
|
||||
# ISO YYYY-MM-DD (with optional trailing " 00:00:00" time).
|
||||
m = re.match(r"^(\d{4})-(\d{1,2})-(\d{1,2})(?:[ T].*)?$", s)
|
||||
if m:
|
||||
yr, mo, dy = m.groups()
|
||||
try:
|
||||
return f"{int(yr):04d}-{int(mo):02d}-{int(dy):02d}"
|
||||
except ValueError:
|
||||
return None
|
||||
# US MM/DD/YYYY.
|
||||
m = re.match(r"^(\d{1,2})/(\d{1,2})/(\d{2,4})$", s)
|
||||
if m:
|
||||
mo, dy, yr = m.groups()
|
||||
if len(yr) == 2:
|
||||
yr = "20" + yr
|
||||
try:
|
||||
return f"{int(yr):04d}-{int(mo):02d}-{int(dy):02d}"
|
||||
except ValueError:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def _txt(v: Any) -> str:
|
||||
if v is None:
|
||||
return ""
|
||||
if isinstance(v, datetime):
|
||||
return v.date().isoformat()
|
||||
return str(v).strip()
|
||||
|
||||
|
||||
def _norm(s: Any) -> str:
|
||||
return re.sub(r"\s+", " ", _txt(s)).strip().lower().rstrip(".")
|
||||
|
||||
|
||||
# Section markers / summary labels that are NOT data rows.
|
||||
_NONDATA_NAME = re.compile(
|
||||
r"^(average|avg\.?|l\.?s\.?d\.?|c\.?v\.?|coeff|mean|std|"
|
||||
r"early mg|late mg|early rm|any rm|non-?gmo|conventional|"
|
||||
r"public|check)\b", re.I)
|
||||
_SECTION_COMPANY = re.compile(
|
||||
r"^(early rm|any rm|late rm|early mg|late mg|non-?gmo|conventional|"
|
||||
r"gmo hybrids?|hybrids?)\b", re.I)
|
||||
|
||||
|
||||
def _find_header_row(rows: list[tuple]) -> int | None:
|
||||
"""Index of the column-name row — the one carrying 'Company' (col 0)
|
||||
and a 'Name' / 'Variety' (col 1)-ish header."""
|
||||
for i, row in enumerate(rows[:15]):
|
||||
c0 = _norm(row[0] if len(row) > 0 else "")
|
||||
c1 = _norm(row[1] if len(row) > 1 else "")
|
||||
if c0 == "company" and c1 in ("name", "variety"):
|
||||
return i
|
||||
return None
|
||||
|
||||
|
||||
def _build_colmap(rows: list[tuple], hdr_i: int) -> dict[str, int]:
|
||||
"""Map metric -> column index by merging the header band: the
|
||||
column-name row (hdr_i), the row below it, and the group-header row
|
||||
above it. Layouts vary — corn carries the units (bu/a, %) IN the
|
||||
header row with the Yield/Moisture/Lodging labels in the group row
|
||||
above; soy/wheat carry the units in the row below. We want the
|
||||
REGIONAL (leftmost) Yield, not the per-site repeats, so we take the
|
||||
leftmost yield-units column as the primary Yield.
|
||||
|
||||
Returns keys among: company, name, herb_trait, gt, ist, st, maturity,
|
||||
yield, lodging, height, moisture, protein, oil, rank, testwt,
|
||||
yield_2yr, yield_3yr.
|
||||
"""
|
||||
name_row = rows[hdr_i]
|
||||
below_row = rows[hdr_i + 1] if hdr_i + 1 < len(rows) else ()
|
||||
group_row = rows[hdr_i - 1] if hdr_i - 1 >= 0 else ()
|
||||
|
||||
def g(row: tuple, i: int) -> str:
|
||||
return _norm(row[i]) if i < len(row) else ""
|
||||
|
||||
ncols = max(len(name_row), len(below_row), len(group_row))
|
||||
|
||||
def band(i: int) -> tuple[str, str, str]:
|
||||
"""(group-above, header, below) normalized text for column i."""
|
||||
return (g(group_row, i), g(name_row, i), g(below_row, i))
|
||||
|
||||
cm: dict[str, int] = {"company": 0, "name": 1}
|
||||
|
||||
# Identity / trait columns — these sit on the header (column-name) row.
|
||||
for i in range(ncols):
|
||||
nm = g(name_row, i)
|
||||
if nm in ("herbicide trait1", "herbicide trait", "ht3", "ht"):
|
||||
cm.setdefault("herb_trait", i)
|
||||
elif nm in ("gt2", "gt"):
|
||||
cm.setdefault("gt", i)
|
||||
elif nm in ("ist1", "ist"):
|
||||
cm.setdefault("ist", i)
|
||||
elif nm in ("st1", "st2", "st"):
|
||||
cm.setdefault("st", i)
|
||||
elif nm in ("relative", "maturity", "relative maturity", "maturity date"):
|
||||
cm.setdefault("maturity", i)
|
||||
elif nm in ("yield rank", "rank"):
|
||||
cm.setdefault("rank", i)
|
||||
|
||||
# Metric columns — match across the whole band. A column is a Yield
|
||||
# column if any band row says "yield" OR carries a bu-per-acre unit.
|
||||
yield_cols: list[int] = []
|
||||
moisture_cols: list[int] = []
|
||||
testwt_cols: list[int] = []
|
||||
lodging_cols: list[int] = []
|
||||
height_cols: list[int] = []
|
||||
protein_cols: list[int] = []
|
||||
oil_cols: list[int] = []
|
||||
maturity_cols: list[int] = []
|
||||
YIELD_UNITS = {"bu/a", "bu/ac", "bu/acre"}
|
||||
for i in range(ncols):
|
||||
gp, nm, bl = band(i)
|
||||
cells = {gp, nm, bl}
|
||||
if "yield" in cells or cells & YIELD_UNITS:
|
||||
yield_cols.append(i)
|
||||
if "moisture" in cells:
|
||||
moisture_cols.append(i)
|
||||
if {"test wt", "test weight"} & cells or "lb/bu" in cells:
|
||||
testwt_cols.append(i)
|
||||
if "lodging" in cells:
|
||||
lodging_cols.append(i)
|
||||
if "height" in cells:
|
||||
height_cols.append(i)
|
||||
if "protein" in cells:
|
||||
protein_cols.append(i)
|
||||
if "oil" in cells:
|
||||
oil_cols.append(i)
|
||||
if {"relative", "maturity", "relative maturity", "maturity date"} & cells:
|
||||
maturity_cols.append(i)
|
||||
|
||||
if yield_cols:
|
||||
cm["yield"] = yield_cols[0]
|
||||
# 2yr / 3yr averages: yield-unit columns whose group header says avg.
|
||||
for i in yield_cols[1:]:
|
||||
grp = g(group_row, i)
|
||||
if "2-yr" in grp or "2 yr" in grp or "2yr" in grp:
|
||||
cm.setdefault("yield_2yr", i)
|
||||
elif "3-yr" in grp or "3 yr" in grp or "3yr" in grp:
|
||||
cm.setdefault("yield_3yr", i)
|
||||
if moisture_cols:
|
||||
cm["moisture"] = moisture_cols[0]
|
||||
if testwt_cols:
|
||||
cm["testwt"] = testwt_cols[0]
|
||||
if lodging_cols:
|
||||
cm["lodging"] = lodging_cols[0]
|
||||
if height_cols:
|
||||
cm["height"] = height_cols[0]
|
||||
if protein_cols:
|
||||
cm["protein"] = protein_cols[0]
|
||||
if oil_cols:
|
||||
cm["oil"] = oil_cols[0]
|
||||
if "maturity" not in cm and maturity_cols:
|
||||
cm["maturity"] = maturity_cols[0]
|
||||
|
||||
return cm
|
||||
|
||||
|
||||
# The metadata block always carries these labels; we locate its starting
|
||||
# column by finding where they appear, so the per-site yield columns that
|
||||
# sit between the metric block and the metadata block don't get scanned as
|
||||
# labels (they lead with numbers, not a label).
|
||||
_META_LABELS = {
|
||||
"location", "county", "site location", "host", "soil type",
|
||||
"planting date", "harvest date", "nitrogen applied", "pesticides",
|
||||
"tillage", "latitude", "longitude", "rainfall", "fungicide",
|
||||
}
|
||||
|
||||
|
||||
def _scan_meta_labels(src: list[tuple]) -> dict[str, list[str]]:
|
||||
"""From a label/value block, build ``{normalized_label: [values...]}``.
|
||||
|
||||
First find the column where the metadata labels live (the leftmost
|
||||
column that holds a known metadata label in some row); the label in
|
||||
each row is the first non-empty cell AT OR AFTER that column, and the
|
||||
values are the non-empty cells to its right. This skips the per-site
|
||||
yield columns that can sit to the left of the metadata block."""
|
||||
# Find the metadata-label column.
|
||||
label_col: int | None = None
|
||||
for row in src:
|
||||
for i, c in enumerate(row):
|
||||
if _norm(c) in _META_LABELS:
|
||||
if label_col is None or i < label_col:
|
||||
label_col = i
|
||||
break
|
||||
found: dict[str, list[str]] = {}
|
||||
for row in src:
|
||||
label = None
|
||||
label_idx = None
|
||||
start = label_col if label_col is not None else 0
|
||||
for i in range(start, len(row)):
|
||||
t = _txt(row[i])
|
||||
if t:
|
||||
label = t
|
||||
label_idx = i
|
||||
break
|
||||
if label is None:
|
||||
continue
|
||||
key = _norm(label)
|
||||
values = [_txt(c) for i, c in enumerate(row)
|
||||
if i > (label_idx or 0) and _txt(c)]
|
||||
if values and key not in found: # keep the first occurrence
|
||||
found[key] = values
|
||||
return found
|
||||
|
||||
|
||||
def _first_real(values: list[str]) -> str | None:
|
||||
for v in values:
|
||||
if v and v.lower() not in ("click to see map", "click here for directions"):
|
||||
return v
|
||||
return None
|
||||
|
||||
|
||||
def _apply_site_metadata(doc: TrialDoc, found: dict[str, list[str]], *,
|
||||
crop: str) -> None:
|
||||
"""Apply a scanned metadata block to the doc, filling only fields that
|
||||
are still unset (so the first/preferred source wins)."""
|
||||
def setif(attr: str, val: Any) -> None:
|
||||
if val and getattr(doc, attr) is None:
|
||||
setattr(doc, attr, val)
|
||||
|
||||
if "host" in found:
|
||||
setif("cooperator", _first_real(found["host"]))
|
||||
if "location" in found and not doc.sites:
|
||||
doc.sites = [v for v in found["location"]
|
||||
if v and v.lower() != "click to see map"]
|
||||
if "county" in found:
|
||||
setif("county", _first_real(found["county"]))
|
||||
if "soil type" in found:
|
||||
setif("soil_type", _first_real(found["soil type"]))
|
||||
if "planting date" in found:
|
||||
setif("planted_date", _iso_date(_first_real(found["planting date"])))
|
||||
if "harvest date" in found:
|
||||
setif("harvested_date", _iso_date(_first_real(found["harvest date"])))
|
||||
if "tillage" in found:
|
||||
setif("tillage", _first_real(found["tillage"]))
|
||||
elif "spring" in found and crop != "wheat":
|
||||
# Corn/soy: "Spring"/"Fall" are tillage operations. Wheat: those
|
||||
# same labels are nitrogen rates — never tillage.
|
||||
setif("tillage", _first_real(found["spring"]))
|
||||
if "latitude" in found:
|
||||
lat = _to_num(_first_real(found["latitude"]) or "")
|
||||
if isinstance(lat, (int, float)):
|
||||
setif("latitude", float(lat))
|
||||
if "longitude" in found:
|
||||
lon = _to_num(_first_real(found["longitude"]) or "")
|
||||
if isinstance(lon, (int, float)):
|
||||
setif("longitude", float(lon))
|
||||
|
||||
|
||||
def _assemble_traits(row: tuple, cm: dict[str, int]) -> str:
|
||||
"""Combine the herbicide-trait + GT (genetic trait, may spill across
|
||||
cols) + seed-treatment columns into one traits string."""
|
||||
bits: list[str] = []
|
||||
# GT can spill from its col up to (but not including) the herb_trait col.
|
||||
if "gt" in cm:
|
||||
gt_start = cm["gt"]
|
||||
gt_end = cm.get("herb_trait", cm.get("maturity", gt_start + 1))
|
||||
gt_toks = [_txt(row[i]) for i in range(gt_start, gt_end)
|
||||
if i < len(row) and _txt(row[i])]
|
||||
if gt_toks:
|
||||
bits.append("GT:" + "".join(gt_toks))
|
||||
if "herb_trait" in cm:
|
||||
ht = _txt(row[cm["herb_trait"]]) if cm["herb_trait"] < len(row) else ""
|
||||
if ht:
|
||||
bits.append("HT:" + ht)
|
||||
for k, lbl in (("ist", "IST"), ("st", "ST")):
|
||||
if k in cm and cm[k] < len(row):
|
||||
v = _txt(row[cm[k]])
|
||||
if v:
|
||||
bits.append(f"{lbl}:{v}")
|
||||
return " ".join(bits)
|
||||
|
||||
|
||||
def _is_data_row(row: tuple, cm: dict[str, int]) -> bool:
|
||||
company = _txt(row[0]) if len(row) > 0 else ""
|
||||
name = _txt(row[1]) if len(row) > 1 else ""
|
||||
if not company or not name:
|
||||
return False
|
||||
if _NONDATA_NAME.match(name) or _NONDATA_NAME.match(company):
|
||||
return False
|
||||
if _SECTION_COMPANY.match(company):
|
||||
return False
|
||||
# Must have a plausible numeric yield.
|
||||
y = _to_num(row[cm["yield"]]) if "yield" in cm and cm["yield"] < len(row) else None
|
||||
if not isinstance(y, (int, float)):
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def _plausible_yield(crop: str, y: float) -> bool:
|
||||
if crop == "corn":
|
||||
return 30 < y < 400
|
||||
if crop == "soybeans":
|
||||
return 10 < y < 150
|
||||
if crop == "wheat":
|
||||
return 20 < y < 200
|
||||
return 0 < y < 500
|
||||
|
||||
|
||||
def parse_xlsx(content: bytes, doc: TrialDoc) -> None:
|
||||
wb = openpyxl.load_workbook(io.BytesIO(content), data_only=True,
|
||||
read_only=True)
|
||||
# The yield table is the first sheet whose first ~15 rows contain a
|
||||
# Company/Name header.
|
||||
data_ws = None
|
||||
data_rows: list[tuple] = []
|
||||
hdr_i = None
|
||||
for name in wb.sheetnames:
|
||||
rows = list(wb[name].iter_rows(values_only=True))
|
||||
hi = _find_header_row(rows)
|
||||
if hi is not None:
|
||||
data_ws, data_rows, hdr_i = name, rows, hi
|
||||
break
|
||||
if data_ws is None or hdr_i is None:
|
||||
raise ValueError("no Company/Name header row found in any sheet")
|
||||
|
||||
cm = _build_colmap(data_rows, hdr_i)
|
||||
if "yield" not in cm:
|
||||
raise ValueError("no Yield column located")
|
||||
|
||||
# Site metadata lives in (a) trailing columns of the data sheet
|
||||
# (co-located with the results — most current) and/or (b) a dedicated
|
||||
# "Trial Info" sheet. Read the trailing-column block FIRST so it wins,
|
||||
# then let the info sheet fill any gaps. (Some regional files carry a
|
||||
# stale info sheet — e.g. a 2025 table whose Trial Info sheet still
|
||||
# shows 2021 planting dates — so trailing columns are preferred.)
|
||||
# _scan_meta_labels self-locates the metadata-label column, so the
|
||||
# per-site yield columns between the metric block and the metadata
|
||||
# block aren't mis-read as labels.
|
||||
_apply_site_metadata(doc, _scan_meta_labels(data_rows), crop=doc.crop)
|
||||
info_sheet = next((s for s in wb.sheetnames
|
||||
if "trial info" in s.lower()
|
||||
or "trial information" in s.lower()), None)
|
||||
if info_sheet:
|
||||
_apply_site_metadata(
|
||||
doc,
|
||||
_scan_meta_labels(list(wb[info_sheet].iter_rows(values_only=True))),
|
||||
crop=doc.crop)
|
||||
|
||||
results: list[dict] = []
|
||||
for row in data_rows[hdr_i + 2:]:
|
||||
if not _is_data_row(row, cm):
|
||||
continue
|
||||
y = _to_num(row[cm["yield"]])
|
||||
if not isinstance(y, (int, float)) or not _plausible_yield(doc.crop, float(y)):
|
||||
continue
|
||||
brand = _txt(row[0])
|
||||
product = _txt(row[1])
|
||||
# Sanity: a numeric/blank brand is junk.
|
||||
if not brand or brand.isdigit() or len(brand) <= 1:
|
||||
continue
|
||||
metrics: dict[str, Any] = {"Yield": y}
|
||||
for key, label in (("moisture", "Moisture"), ("lodging", "Lodging"),
|
||||
("height", "Height"), ("testwt", "Test Wt."),
|
||||
("protein", "Protein"), ("oil", "Oil"),
|
||||
("maturity", "Maturity"),
|
||||
("yield_2yr", "Yield 2yr avg"),
|
||||
("yield_3yr", "Yield 3yr avg")):
|
||||
if key in cm and cm[key] < len(row):
|
||||
v = _to_num(row[cm[key]])
|
||||
if v is not None:
|
||||
metrics[label] = v
|
||||
rec_rank = None
|
||||
if "rank" in cm and cm["rank"] < len(row):
|
||||
rk = _to_num(row[cm["rank"]])
|
||||
if isinstance(rk, (int, float)):
|
||||
rec_rank = int(rk)
|
||||
results.append({
|
||||
"rank": rec_rank, # synthesized below if None
|
||||
"brand": brand,
|
||||
"product": product,
|
||||
"traits": _assemble_traits(row, cm) or None,
|
||||
"metrics": metrics,
|
||||
})
|
||||
|
||||
# Synthesize rank by Yield DESC when the sheet didn't publish one
|
||||
# (corn/soy list alphabetically). Wheat carries Yield Rank already, but
|
||||
# we re-rank only if any row is missing a rank to keep them consistent.
|
||||
if results and any(r["rank"] is None for r in results):
|
||||
ordered = sorted(results, key=lambda r: -float(r["metrics"]["Yield"]))
|
||||
for i, r in enumerate(ordered, start=1):
|
||||
r["rank"] = i
|
||||
results = ordered
|
||||
else:
|
||||
results.sort(key=lambda r: r["rank"])
|
||||
|
||||
doc.results = results
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- render
|
||||
|
||||
|
||||
def render_markdown(doc: TrialDoc) -> str:
|
||||
crop_label = {"corn": "Corn", "soybeans": "Soybean",
|
||||
"wheat": "Wheat"}.get(doc.crop, doc.crop.title())
|
||||
head: list[str] = [
|
||||
f"# {crop_label} yield trial — University of Illinois VT, "
|
||||
f"{doc.region}, IL {doc.year}",
|
||||
"",
|
||||
"- **Publisher:** University of Illinois Variety Testing "
|
||||
"(independent cross-vendor trial)",
|
||||
f"- **Crop:** {crop_label}",
|
||||
"- **State:** IL",
|
||||
f"- **Year:** {doc.year}",
|
||||
f"- **Region:** {doc.region}",
|
||||
]
|
||||
for label, val in (
|
||||
("Rotation", doc.rotation),
|
||||
("Previous crop", doc.previous_crop),
|
||||
("Cooperator / host", doc.cooperator),
|
||||
("County", doc.county),
|
||||
("Sites", ", ".join(doc.sites) if doc.sites else None),
|
||||
("Soil type", doc.soil_type),
|
||||
("Tillage", doc.tillage),
|
||||
("Planted", doc.planted_date),
|
||||
("Harvested", doc.harvested_date),
|
||||
("Row width", doc.row_width),
|
||||
):
|
||||
if val:
|
||||
head.append(f"- **{label}:** {val}")
|
||||
head += [
|
||||
f"- **Source XLSX:** {doc.xlsx_url}",
|
||||
f"- **Index page:** {doc.index_url}",
|
||||
"", "---", "",
|
||||
"## Results (ranked by regional yield)", "",
|
||||
]
|
||||
|
||||
# Discover metric columns from the first result.
|
||||
metric_keys: list[str] = []
|
||||
for r in doc.results:
|
||||
if r.get("metrics"):
|
||||
metric_keys = list(r["metrics"].keys())
|
||||
break
|
||||
headers = ["Rank", "Brand", "Product", "Traits"] + metric_keys
|
||||
head.append("| " + " | ".join(headers) + " |")
|
||||
head.append("|" + "|".join(["---"] * len(headers)) + "|")
|
||||
for r in doc.results:
|
||||
row = [str(r.get("rank") or "-"), r.get("brand") or "-",
|
||||
r.get("product") or "-", r.get("traits") or "-"]
|
||||
m = r.get("metrics") or {}
|
||||
for k in metric_keys:
|
||||
v = m.get(k)
|
||||
row.append("-" if v is None else str(v))
|
||||
head.append("| " + " | ".join(row) + " |")
|
||||
head.append("")
|
||||
return "\n".join(head)
|
||||
|
||||
|
||||
def write_doc(doc: TrialDoc, body_md: str) -> None:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
(CORPUS_DIR / f"{doc.source_key}.md").write_text(body_md, encoding="utf-8")
|
||||
sidecar = {
|
||||
"source": "illinois_vt_trials",
|
||||
"source_key": doc.source_key,
|
||||
"data_type": "trial",
|
||||
"vendor": "University of Illinois",
|
||||
"brand_aggregator": "University of Illinois Variety Testing publishes",
|
||||
"brand": "University of Illinois VT",
|
||||
"crop": doc.crop,
|
||||
"state": "IL",
|
||||
"state_abbrev": "il",
|
||||
"year": doc.year,
|
||||
"region": doc.region,
|
||||
"rotation": doc.rotation,
|
||||
"previous_crop": doc.previous_crop,
|
||||
"cooperator": doc.cooperator,
|
||||
"county": doc.county,
|
||||
"sites": doc.sites or None,
|
||||
"soil_type": doc.soil_type,
|
||||
"tillage": doc.tillage,
|
||||
"planted_date": doc.planted_date,
|
||||
"harvested_date": doc.harvested_date,
|
||||
"row_width": doc.row_width,
|
||||
"latitude": doc.latitude,
|
||||
"longitude": doc.longitude,
|
||||
"results": doc.results,
|
||||
"n_results": len(doc.results),
|
||||
"source_urls": [doc.xlsx_url, doc.index_url],
|
||||
"tos_note": TOS_NOTE,
|
||||
"fetched_at": datetime.now(timezone.utc).isoformat(),
|
||||
"scraper_version": SCRAPER_VERSION,
|
||||
}
|
||||
(CORPUS_DIR / f"{doc.source_key}.json").write_text(
|
||||
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
|
||||
encoding="utf-8")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- pipeline
|
||||
|
||||
|
||||
def process_doc(http: RateLimitedSession, doc: TrialDoc, *,
|
||||
force: bool) -> str:
|
||||
md_path = CORPUS_DIR / f"{doc.source_key}.md"
|
||||
if md_path.exists() and not force:
|
||||
return "skipped"
|
||||
try:
|
||||
r = http.get(doc.xlsx_url)
|
||||
r.raise_for_status()
|
||||
parse_xlsx(r.content, doc)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.error("%s parse failed (%s): %s", doc.source_key, doc.xlsx_url, exc)
|
||||
return "failed"
|
||||
if not doc.results:
|
||||
log.warning("%s — no valid result rows parsed; skipping", doc.source_key)
|
||||
return "empty"
|
||||
write_doc(doc, render_markdown(doc))
|
||||
log.info("%s written | %s %s %s | %d results | top: %s",
|
||||
doc.source_key, doc.crop, doc.region, doc.year, len(doc.results),
|
||||
doc.results[0]["brand"] + " " + doc.results[0]["product"]
|
||||
if doc.results else "-")
|
||||
return "written"
|
||||
|
||||
|
||||
def run(*, crops: set[str], years: set[int], limit: int | None,
|
||||
force: bool) -> int:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
http = RateLimitedSession()
|
||||
docs = discover_files(http, crops=crops, years=years)
|
||||
log.info("discovered %d in-scope variety tables", len(docs))
|
||||
if limit is not None:
|
||||
docs = docs[:limit]
|
||||
|
||||
counts = {"written": 0, "skipped": 0, "empty": 0, "failed": 0}
|
||||
for i, doc in enumerate(docs, start=1):
|
||||
status = process_doc(http, doc, force=force)
|
||||
counts[status] = counts.get(status, 0) + 1
|
||||
if status != "written" or i <= 5 or i % 20 == 0:
|
||||
log.info("[%d/%d] %s -> %s", i, len(docs), doc.source_key, status)
|
||||
|
||||
log.info("done: written=%d skipped=%d empty=%d failed=%d (of %d)",
|
||||
counts["written"], counts["skipped"], counts["empty"],
|
||||
counts["failed"], len(docs))
|
||||
return 0 if counts["failed"] == 0 else 1
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- CLI
|
||||
|
||||
|
||||
def _build_argparser() -> argparse.ArgumentParser:
|
||||
p = argparse.ArgumentParser(
|
||||
prog="scrape.sources.illinois_vt_trials",
|
||||
description="Scrape University of Illinois Variety Testing "
|
||||
"cross-vendor yield trials (XLSX) into the corpus.")
|
||||
p.add_argument("--year", type=int, default=None,
|
||||
help="Scrape a single harvest year (default: 2024+2025).")
|
||||
p.add_argument("--include-old", action="store_true",
|
||||
help="Also scrape 2000–2023 (deferred by default).")
|
||||
p.add_argument("--crop", default=None, choices=tuple(CROP_INDEX.keys()),
|
||||
help="Limit to one crop (corn / soybeans / wheat).")
|
||||
p.add_argument("--limit", type=int, default=None,
|
||||
help="Stop after processing N tables (default: all).")
|
||||
p.add_argument("--force", action="store_true",
|
||||
help="Re-fetch even if the markdown file already exists.")
|
||||
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
|
||||
return p
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
args = _build_argparser().parse_args(argv)
|
||||
logging.basicConfig(
|
||||
level=args.log_level.upper(),
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||
stream=sys.stderr)
|
||||
|
||||
crops = {args.crop} if args.crop else set(CROP_INDEX.keys())
|
||||
if args.year is not None:
|
||||
years = {args.year}
|
||||
elif args.include_old:
|
||||
years = set(range(OLD_YEAR_MIN, OLD_YEAR_MAX + 1)) | BASELINE_YEARS
|
||||
else:
|
||||
years = set(BASELINE_YEARS)
|
||||
|
||||
return run(crops=crops, years=years, limit=args.limit, force=args.force)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user