Add university-extension trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 cross-vendor trial docs) (#19)
Image rebuild (skip scrape) / build (push) Successful in 5m54s

Co-authored-by: claude <claude@jpaul.io>
Co-committed-by: claude <claude@jpaul.io>
This commit was merged in pull request #19.
This commit is contained in:
2026-06-10 08:36:19 -04:00
committed by Claude (agent)
parent 0bac06b7b6
commit a54fac240f
255 changed files with 105410 additions and 13 deletions
+945
View File
@@ -0,0 +1,945 @@
"""University of Illinois Variety Testing — cross-vendor yield trials.
The University of Illinois Crop Sciences Variety Testing program
(``vt.cropsci.illinois.edu``) is a long-running, independent land-grant
testing program. Seed companies pay an entry fee to enter hybrids /
varieties; UIUC plants them in replicated regional trials and publishes
the results. Because the program is third-party and cross-vendor, a
single regional table ranks Pioneer / Brevant / DEKALB / Channel /
Burrus / Stine / Viking and dozens of others head-to-head — the
legitimate, independent way to get major-brand performance the
single-vendor corpus can't scrape directly.
This is a ``data_type: "trial"`` source (NOT variety identity). It emits
the same per-site cross-vendor sidecar shape as ``gh_plot_reports`` /
``agrigold_plot_reports`` / ``proharvest_plots`` (``results: [{rank,
brand, product, traits, metrics}]``), so it falls through to the shared
``_render_gh_plot_chunk`` renderer in ``rag/chunk.py`` with no chunk.py
edit. The published table lists entries alphabetically by company, so we
**synthesize ``rank`` by sorting on Yield descending** (highest yield =
rank 1) per document.
Data layout (verified Nov 2025):
- Static XLSX (+ PDF) per region per year at WordPress upload URLs.
The month segment of the upload path varies (``/2025/11/``,
``/2025/12/``, ``/2025/07/`` for wheat), so we DISCOVER the .xlsx
hrefs by fetching the /corn/, /soybeans/, /wheat/ index pages — we
never guess URLs.
- Corn regional tables: ``Company | Name | IST1 | GT2(+spill) | HT3 |
Relative Maturity | Yield bu/a | Moisture % | Lodging | <per-site
cols> | 2-yr Avg | 3-yr Avg``. Per-site metadata lives in a separate
"Trial Info" sheet (regional tables) or in trailing columns
(single-site CFC tables).
- Soybean regional tables: ``COMPANY | NAME | Herbicide Trait1 | ST2 |
Yield bu/a | Maturity Date | Lodging | Height | <2yr/3yr Yield> |
Protein @13% | Oil @13%`` with the per-site metadata block in
trailing columns.
- Wheat regional means tables: ``Company | Variety | ST1 | Yield |
Yield Rank | Test wt. | Height | <per-site Yield/Test wt.> |
Maturity date | Jointing time | FHB Score | FHB Category``. Wheat
publishes its own Yield Rank, which we honor.
The variety table sits below a 2-4 row header band (a group-header row,
a column-name row, and a units row). Columns are positionally stable
within a sheet (multi-word brands like "Viking | Blue River" live in a
single COMPANY cell), so we anchor on the header row that carries
"Company" + "Name"/"Variety" and map the metric columns from the
column-name + units rows. The leftmost "Yield (bu/a)" column is the
**regional** yield — the primary metric.
Section markers interleaved in the table ("Early RM", "Any RM",
"Non-GMO Hybrids", "Early MG: 1.9-2.7", "Late MG:") and summary rows
("Average", "L.S.D 25% Level", "CV (%)") are skipped — a data row must
have a real company, a variety name, and a plausible numeric yield.
Rotation distinction: regional tables are corn-following-soybean (the
default rotation); "CFC" / "Corn Following Corn" single-site tables
(Monmouth, Urbana, DeKalb) get ``previous_crop="corn"`` and a distinct
``-cfc-`` source_key so they never collide with the regional table.
robots/ToS: no usage terms posted on the VT site (publicly-funded
land-grant; companies pay an entry fee, which doesn't restrict published
result reuse). Polite UA + light rate limit.
Output:
corpus/illinois_vt_trials/<source_key>.md LLM-visible body
corpus/illinois_vt_trials/<source_key>.json sidecar metadata
source_key: ``ilvt-<crop>-<year>-r<region>`` e.g. ``ilvt-corn-2025-r1``;
CFC single-site tables: ``ilvt-corn-2024-cfc-monmouth``.
CLI:
python -m scrape.sources.illinois_vt_trials --year 2025 --limit 2
python -m scrape.sources.illinois_vt_trials --crop corn --force
python -m scrape.sources.illinois_vt_trials --include-old --force
"""
from __future__ import annotations
import argparse
import io
import json
import logging
import os
import random
import re
import sys
import time
from dataclasses import dataclass, field
from datetime import date, datetime, timezone
from pathlib import Path
from typing import Any
from urllib.parse import urljoin
import openpyxl
import requests
from bs4 import BeautifulSoup
SCRAPER_VERSION = "0.1.0"
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
BASE = "https://vt.cropsci.illinois.edu"
TOS_NOTE = (
"No usage terms posted on UIUC VT site; publicly-funded land-grant "
"data; attribute University of Illinois Variety Testing."
)
BASELINE_YEARS = {2024, 2025}
OLD_YEAR_MIN = 2000
OLD_YEAR_MAX = 2023
# Index pages per crop. PLURAL "soybeans" is the corpus crop value.
CROP_INDEX = {
"corn": "/corn/",
"soybeans": "/soybeans/",
"wheat": "/wheat/",
}
REQ_INTERVAL_SEC = 1.0
REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
CORPUS_DIR = CORPUS_ROOT / "illinois_vt_trials"
log = logging.getLogger("scrape.illinois_vt_trials")
# --------------------------------------------------------------------- HTTP
class RateLimitedSession:
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
self.s = requests.Session()
self.s.headers["User-Agent"] = USER_AGENT
self.interval = interval
self._last = 0.0
def _wait(self) -> None:
delta = time.monotonic() - self._last
if delta < self.interval:
time.sleep(self.interval - delta)
self._last = time.monotonic()
def request(self, method: str, url: str, *, max_retries: int = 4,
timeout: float = 60.0, **kw: Any) -> requests.Response:
last_exc: Exception | None = None
for attempt in range(max_retries):
self._wait()
try:
resp = self.s.request(method, url, timeout=timeout, **kw)
except requests.RequestException as exc:
last_exc = exc
backoff = min(30.0, (2 ** attempt) + random.random())
log.warning("network error on %s %s: %s — retry in %.1fs",
method, url, exc, backoff)
time.sleep(backoff)
continue
if resp.status_code == 429 or 500 <= resp.status_code < 600:
ra = resp.headers.get("Retry-After")
backoff = float(ra) if (ra and ra.isdigit()) else min(
30.0, (2 ** attempt) + random.random())
log.warning("HTTP %d on %s %s — retry in %.1fs",
resp.status_code, method, url, backoff)
time.sleep(backoff)
continue
return resp
if last_exc:
raise last_exc
return resp # type: ignore[return-value]
def get(self, url: str, **kw: Any) -> requests.Response:
return self.request("GET", url, **kw)
# --------------------------------------------------------------------- model
@dataclass
class TrialDoc:
source_key: str
crop: str # corn / soybeans / wheat
year: int
region: str # e.g. "Region 1", "Monmouth CFC"
xlsx_url: str
index_url: str
rotation: str | None = None # "corn following soybean" / "corn following corn"
previous_crop: str | None = None # "corn" for CFC tables
cooperator: str | None = None # site host
county: str | None = None
soil_type: str | None = None
tillage: str | None = None
planted_date: str | None = None
harvested_date: str | None = None
row_width: str | None = None
latitude: float | None = None
longitude: float | None = None
sites: list[str] = field(default_factory=list)
results: list[dict] = field(default_factory=list)
# --------------------------------------------------------------------- discovery
# Year/Region extraction from a file name or upload path. The VT site has
# used many naming schemes across years, so try several.
_YEAR_RE = re.compile(r"(20\d{2})")
_REGION_NUM_RE = re.compile(r"region[\s_-]*([1-9])", re.I)
def _norm_url(href: str) -> str:
"""Resolve relative hrefs against BASE; also fold the legacy
``varietytesting.web.illinois.edu`` host (same WP install) to BASE so
older files download cleanly."""
href = href.strip()
if href.startswith("http"):
return href
return urljoin(BASE + "/", href.lstrip("/"))
def _file_year(url: str) -> int | None:
"""Harvest year of a file. Prefer the upload-path year segment
(``/uploads/2025/11/...`` or ``/uploads/corn/2021/...``); fall back to
the first 20xx in the file name."""
m = re.search(r"/uploads/(?:[a-z]+/)?(20\d{2})/", url)
if m:
return int(m.group(1))
fn = url.rsplit("/", 1)[-1]
m = _YEAR_RE.search(fn)
return int(m.group(1)) if m else None
def _classify_region(url: str) -> tuple[str, str | None, str | None] | None:
"""Return ``(region_label, rotation, previous_crop)`` for a file, or
None if it isn't a per-region/per-site VARIETY table we ingest.
We INCLUDE regional tables (Region 1-5, North/South/East/West) and
single-site CFC (corn-following-corn) tables. We EXCLUDE entry lists,
agronomic-characteristic sheets, FHB/scab summaries, and disease
tables (those aren't head-to-head yield rankings)."""
fn = url.rsplit("/", 1)[-1]
low = fn.lower()
# Exclusions — not cross-vendor yield ranking tables.
EXCLUDE = ("entry", "entries", "agronomic", "charateristic", # sic (site typo)
"characteristic", "scab", "fhb", "disease")
if any(tok in low for tok in EXCLUDE):
return None
# Numbered regions: "Region 1", "2025-Region-1", "northtbl20"
m = _REGION_NUM_RE.search(low)
if m:
return (f"Region {m.group(1)}", "corn following soybean", None)
# Named regions (corn 2022 + wheat use compass names).
for name, label in (("north", "North Region"), ("south", "South Region"),
("east", "East Region"), ("west", "West Region")):
if re.search(rf"\b{name}\b", low) or low.startswith(name + "tbl"):
return (label, "corn following soybean", None)
# Single-site Corn-Following-Corn tables (Monmouth / Urbana / DeKalb).
cfc = re.search(r"([a-z]+)[\s_-]*cfc", low)
if cfc or "cfc" in low:
site = (cfc.group(1).title() if cfc and cfc.group(1) else "CFC")
return (f"{site} CFC", "corn following corn", "corn")
# Wheat single-site summary tables (e.g. "2025-Urbana-Summary",
# "2024-Elkville-Table"). Capture the site name as the region.
m = re.search(r"20\d{2}[\s_-]+([a-z]+)[\s_-]+(?:summary|table)", low)
if m:
return (m.group(1).title(), None, None)
m = re.search(r"([a-z]+)[\s_-]+(?:summary|table)[\s_-]*20\d{2}", low)
if m and m.group(1) not in ("region", "regional"):
return (m.group(1).title(), None, None)
return None
def discover_files(http: RateLimitedSession, *, crops: set[str],
years: set[int]) -> list[TrialDoc]:
"""Fetch each crop index page, extract .xlsx hrefs, classify them into
per-region/per-site variety tables, and keep the ones in scope."""
out: list[TrialDoc] = []
seen_keys: set[str] = set()
for crop in sorted(crops):
index_path = CROP_INDEX.get(crop)
if not index_path:
log.warning("unknown crop %r, skipping", crop)
continue
index_url = BASE + index_path
log.info("GET %s", index_url)
r = http.get(index_url)
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")
xlsx_hrefs = [a["href"] for a in soup.find_all("a", href=True)
if a["href"].lower().endswith(".xlsx")]
# Dedupe while preserving order.
seen_href: set[str] = set()
for href in xlsx_hrefs:
url = _norm_url(href)
if url in seen_href:
continue
seen_href.add(url)
year = _file_year(url)
if year is None or year not in years:
continue
cls = _classify_region(url)
if cls is None:
continue
region, rotation, prev_crop = cls
# source_key: ilvt-<crop>-<year>-r<n> for numbered regions,
# else a slug of the region name.
mnum = _REGION_NUM_RE.search(url.rsplit("/", 1)[-1].lower())
if mnum and "cfc" not in region.lower():
region_slug = f"r{mnum.group(1)}"
else:
region_slug = re.sub(r"[^a-z0-9]+", "-",
region.lower()).strip("-")
sk = f"ilvt-{crop}-{year}-{region_slug}"
if sk in seen_keys:
# Two files map to the same key (e.g. a "-final" + a "-1"
# duplicate). Keep the first; log the collision.
log.info("duplicate source_key %s from %s — skipping dupe",
sk, url)
continue
seen_keys.add(sk)
out.append(TrialDoc(
source_key=sk, crop=crop, year=year, region=region,
xlsx_url=url, index_url=index_url,
rotation=rotation, previous_crop=prev_crop,
))
log.info(" %s: %d in-scope variety tables", crop,
sum(1 for d in out if d.crop == crop))
return out
# --------------------------------------------------------------------- parse
def _to_num(v: Any) -> float | int | None:
"""Coerce a cell to a number. Strips '*', commas; returns None for the
VT missing-value markers ('*.*', '-', '') and non-numeric text."""
if v is None:
return None
if isinstance(v, bool):
return None
if isinstance(v, (int, float)):
f = float(v)
return int(f) if f.is_integer() else f
s = str(v).strip()
if not s or s in ("*", "*.*", "-", "", "."):
return None
s = s.replace(",", "")
if not re.match(r"^-?\d+(?:\.\d+)?$", s):
return None
f = float(s)
return int(f) if f.is_integer() else f
def _iso_date(v: Any) -> str | None:
if isinstance(v, datetime):
return v.date().isoformat()
if isinstance(v, date):
return v.isoformat()
if v is None:
return None
s = str(v).strip()
if not s or s.lower().startswith("did not"):
return None
# ISO YYYY-MM-DD (with optional trailing " 00:00:00" time).
m = re.match(r"^(\d{4})-(\d{1,2})-(\d{1,2})(?:[ T].*)?$", s)
if m:
yr, mo, dy = m.groups()
try:
return f"{int(yr):04d}-{int(mo):02d}-{int(dy):02d}"
except ValueError:
return None
# US MM/DD/YYYY.
m = re.match(r"^(\d{1,2})/(\d{1,2})/(\d{2,4})$", s)
if m:
mo, dy, yr = m.groups()
if len(yr) == 2:
yr = "20" + yr
try:
return f"{int(yr):04d}-{int(mo):02d}-{int(dy):02d}"
except ValueError:
return None
return None
def _txt(v: Any) -> str:
if v is None:
return ""
if isinstance(v, datetime):
return v.date().isoformat()
return str(v).strip()
def _norm(s: Any) -> str:
return re.sub(r"\s+", " ", _txt(s)).strip().lower().rstrip(".")
# Section markers / summary labels that are NOT data rows.
_NONDATA_NAME = re.compile(
r"^(average|avg\.?|l\.?s\.?d\.?|c\.?v\.?|coeff|mean|std|"
r"early mg|late mg|early rm|any rm|non-?gmo|conventional|"
r"public|check)\b", re.I)
_SECTION_COMPANY = re.compile(
r"^(early rm|any rm|late rm|early mg|late mg|non-?gmo|conventional|"
r"gmo hybrids?|hybrids?)\b", re.I)
def _find_header_row(rows: list[tuple]) -> int | None:
"""Index of the column-name row — the one carrying 'Company' (col 0)
and a 'Name' / 'Variety' (col 1)-ish header."""
for i, row in enumerate(rows[:15]):
c0 = _norm(row[0] if len(row) > 0 else "")
c1 = _norm(row[1] if len(row) > 1 else "")
if c0 == "company" and c1 in ("name", "variety"):
return i
return None
def _build_colmap(rows: list[tuple], hdr_i: int) -> dict[str, int]:
"""Map metric -> column index by merging the header band: the
column-name row (hdr_i), the row below it, and the group-header row
above it. Layouts vary — corn carries the units (bu/a, %) IN the
header row with the Yield/Moisture/Lodging labels in the group row
above; soy/wheat carry the units in the row below. We want the
REGIONAL (leftmost) Yield, not the per-site repeats, so we take the
leftmost yield-units column as the primary Yield.
Returns keys among: company, name, herb_trait, gt, ist, st, maturity,
yield, lodging, height, moisture, protein, oil, rank, testwt,
yield_2yr, yield_3yr.
"""
name_row = rows[hdr_i]
below_row = rows[hdr_i + 1] if hdr_i + 1 < len(rows) else ()
group_row = rows[hdr_i - 1] if hdr_i - 1 >= 0 else ()
def g(row: tuple, i: int) -> str:
return _norm(row[i]) if i < len(row) else ""
ncols = max(len(name_row), len(below_row), len(group_row))
def band(i: int) -> tuple[str, str, str]:
"""(group-above, header, below) normalized text for column i."""
return (g(group_row, i), g(name_row, i), g(below_row, i))
cm: dict[str, int] = {"company": 0, "name": 1}
# Identity / trait columns — these sit on the header (column-name) row.
for i in range(ncols):
nm = g(name_row, i)
if nm in ("herbicide trait1", "herbicide trait", "ht3", "ht"):
cm.setdefault("herb_trait", i)
elif nm in ("gt2", "gt"):
cm.setdefault("gt", i)
elif nm in ("ist1", "ist"):
cm.setdefault("ist", i)
elif nm in ("st1", "st2", "st"):
cm.setdefault("st", i)
elif nm in ("relative", "maturity", "relative maturity", "maturity date"):
cm.setdefault("maturity", i)
elif nm in ("yield rank", "rank"):
cm.setdefault("rank", i)
# Metric columns — match across the whole band. A column is a Yield
# column if any band row says "yield" OR carries a bu-per-acre unit.
yield_cols: list[int] = []
moisture_cols: list[int] = []
testwt_cols: list[int] = []
lodging_cols: list[int] = []
height_cols: list[int] = []
protein_cols: list[int] = []
oil_cols: list[int] = []
maturity_cols: list[int] = []
YIELD_UNITS = {"bu/a", "bu/ac", "bu/acre"}
for i in range(ncols):
gp, nm, bl = band(i)
cells = {gp, nm, bl}
if "yield" in cells or cells & YIELD_UNITS:
yield_cols.append(i)
if "moisture" in cells:
moisture_cols.append(i)
if {"test wt", "test weight"} & cells or "lb/bu" in cells:
testwt_cols.append(i)
if "lodging" in cells:
lodging_cols.append(i)
if "height" in cells:
height_cols.append(i)
if "protein" in cells:
protein_cols.append(i)
if "oil" in cells:
oil_cols.append(i)
if {"relative", "maturity", "relative maturity", "maturity date"} & cells:
maturity_cols.append(i)
if yield_cols:
cm["yield"] = yield_cols[0]
# 2yr / 3yr averages: yield-unit columns whose group header says avg.
for i in yield_cols[1:]:
grp = g(group_row, i)
if "2-yr" in grp or "2 yr" in grp or "2yr" in grp:
cm.setdefault("yield_2yr", i)
elif "3-yr" in grp or "3 yr" in grp or "3yr" in grp:
cm.setdefault("yield_3yr", i)
if moisture_cols:
cm["moisture"] = moisture_cols[0]
if testwt_cols:
cm["testwt"] = testwt_cols[0]
if lodging_cols:
cm["lodging"] = lodging_cols[0]
if height_cols:
cm["height"] = height_cols[0]
if protein_cols:
cm["protein"] = protein_cols[0]
if oil_cols:
cm["oil"] = oil_cols[0]
if "maturity" not in cm and maturity_cols:
cm["maturity"] = maturity_cols[0]
return cm
# The metadata block always carries these labels; we locate its starting
# column by finding where they appear, so the per-site yield columns that
# sit between the metric block and the metadata block don't get scanned as
# labels (they lead with numbers, not a label).
_META_LABELS = {
"location", "county", "site location", "host", "soil type",
"planting date", "harvest date", "nitrogen applied", "pesticides",
"tillage", "latitude", "longitude", "rainfall", "fungicide",
}
def _scan_meta_labels(src: list[tuple]) -> dict[str, list[str]]:
"""From a label/value block, build ``{normalized_label: [values...]}``.
First find the column where the metadata labels live (the leftmost
column that holds a known metadata label in some row); the label in
each row is the first non-empty cell AT OR AFTER that column, and the
values are the non-empty cells to its right. This skips the per-site
yield columns that can sit to the left of the metadata block."""
# Find the metadata-label column.
label_col: int | None = None
for row in src:
for i, c in enumerate(row):
if _norm(c) in _META_LABELS:
if label_col is None or i < label_col:
label_col = i
break
found: dict[str, list[str]] = {}
for row in src:
label = None
label_idx = None
start = label_col if label_col is not None else 0
for i in range(start, len(row)):
t = _txt(row[i])
if t:
label = t
label_idx = i
break
if label is None:
continue
key = _norm(label)
values = [_txt(c) for i, c in enumerate(row)
if i > (label_idx or 0) and _txt(c)]
if values and key not in found: # keep the first occurrence
found[key] = values
return found
def _first_real(values: list[str]) -> str | None:
for v in values:
if v and v.lower() not in ("click to see map", "click here for directions"):
return v
return None
def _apply_site_metadata(doc: TrialDoc, found: dict[str, list[str]], *,
crop: str) -> None:
"""Apply a scanned metadata block to the doc, filling only fields that
are still unset (so the first/preferred source wins)."""
def setif(attr: str, val: Any) -> None:
if val and getattr(doc, attr) is None:
setattr(doc, attr, val)
if "host" in found:
setif("cooperator", _first_real(found["host"]))
if "location" in found and not doc.sites:
doc.sites = [v for v in found["location"]
if v and v.lower() != "click to see map"]
if "county" in found:
setif("county", _first_real(found["county"]))
if "soil type" in found:
setif("soil_type", _first_real(found["soil type"]))
if "planting date" in found:
setif("planted_date", _iso_date(_first_real(found["planting date"])))
if "harvest date" in found:
setif("harvested_date", _iso_date(_first_real(found["harvest date"])))
if "tillage" in found:
setif("tillage", _first_real(found["tillage"]))
elif "spring" in found and crop != "wheat":
# Corn/soy: "Spring"/"Fall" are tillage operations. Wheat: those
# same labels are nitrogen rates — never tillage.
setif("tillage", _first_real(found["spring"]))
if "latitude" in found:
lat = _to_num(_first_real(found["latitude"]) or "")
if isinstance(lat, (int, float)):
setif("latitude", float(lat))
if "longitude" in found:
lon = _to_num(_first_real(found["longitude"]) or "")
if isinstance(lon, (int, float)):
setif("longitude", float(lon))
def _assemble_traits(row: tuple, cm: dict[str, int]) -> str:
"""Combine the herbicide-trait + GT (genetic trait, may spill across
cols) + seed-treatment columns into one traits string."""
bits: list[str] = []
# GT can spill from its col up to (but not including) the herb_trait col.
if "gt" in cm:
gt_start = cm["gt"]
gt_end = cm.get("herb_trait", cm.get("maturity", gt_start + 1))
gt_toks = [_txt(row[i]) for i in range(gt_start, gt_end)
if i < len(row) and _txt(row[i])]
if gt_toks:
bits.append("GT:" + "".join(gt_toks))
if "herb_trait" in cm:
ht = _txt(row[cm["herb_trait"]]) if cm["herb_trait"] < len(row) else ""
if ht:
bits.append("HT:" + ht)
for k, lbl in (("ist", "IST"), ("st", "ST")):
if k in cm and cm[k] < len(row):
v = _txt(row[cm[k]])
if v:
bits.append(f"{lbl}:{v}")
return " ".join(bits)
def _is_data_row(row: tuple, cm: dict[str, int]) -> bool:
company = _txt(row[0]) if len(row) > 0 else ""
name = _txt(row[1]) if len(row) > 1 else ""
if not company or not name:
return False
if _NONDATA_NAME.match(name) or _NONDATA_NAME.match(company):
return False
if _SECTION_COMPANY.match(company):
return False
# Must have a plausible numeric yield.
y = _to_num(row[cm["yield"]]) if "yield" in cm and cm["yield"] < len(row) else None
if not isinstance(y, (int, float)):
return False
return True
def _plausible_yield(crop: str, y: float) -> bool:
if crop == "corn":
return 30 < y < 400
if crop == "soybeans":
return 10 < y < 150
if crop == "wheat":
return 20 < y < 200
return 0 < y < 500
def parse_xlsx(content: bytes, doc: TrialDoc) -> None:
wb = openpyxl.load_workbook(io.BytesIO(content), data_only=True,
read_only=True)
# The yield table is the first sheet whose first ~15 rows contain a
# Company/Name header.
data_ws = None
data_rows: list[tuple] = []
hdr_i = None
for name in wb.sheetnames:
rows = list(wb[name].iter_rows(values_only=True))
hi = _find_header_row(rows)
if hi is not None:
data_ws, data_rows, hdr_i = name, rows, hi
break
if data_ws is None or hdr_i is None:
raise ValueError("no Company/Name header row found in any sheet")
cm = _build_colmap(data_rows, hdr_i)
if "yield" not in cm:
raise ValueError("no Yield column located")
# Site metadata lives in (a) trailing columns of the data sheet
# (co-located with the results — most current) and/or (b) a dedicated
# "Trial Info" sheet. Read the trailing-column block FIRST so it wins,
# then let the info sheet fill any gaps. (Some regional files carry a
# stale info sheet — e.g. a 2025 table whose Trial Info sheet still
# shows 2021 planting dates — so trailing columns are preferred.)
# _scan_meta_labels self-locates the metadata-label column, so the
# per-site yield columns between the metric block and the metadata
# block aren't mis-read as labels.
_apply_site_metadata(doc, _scan_meta_labels(data_rows), crop=doc.crop)
info_sheet = next((s for s in wb.sheetnames
if "trial info" in s.lower()
or "trial information" in s.lower()), None)
if info_sheet:
_apply_site_metadata(
doc,
_scan_meta_labels(list(wb[info_sheet].iter_rows(values_only=True))),
crop=doc.crop)
results: list[dict] = []
for row in data_rows[hdr_i + 2:]:
if not _is_data_row(row, cm):
continue
y = _to_num(row[cm["yield"]])
if not isinstance(y, (int, float)) or not _plausible_yield(doc.crop, float(y)):
continue
brand = _txt(row[0])
product = _txt(row[1])
# Sanity: a numeric/blank brand is junk.
if not brand or brand.isdigit() or len(brand) <= 1:
continue
metrics: dict[str, Any] = {"Yield": y}
for key, label in (("moisture", "Moisture"), ("lodging", "Lodging"),
("height", "Height"), ("testwt", "Test Wt."),
("protein", "Protein"), ("oil", "Oil"),
("maturity", "Maturity"),
("yield_2yr", "Yield 2yr avg"),
("yield_3yr", "Yield 3yr avg")):
if key in cm and cm[key] < len(row):
v = _to_num(row[cm[key]])
if v is not None:
metrics[label] = v
rec_rank = None
if "rank" in cm and cm["rank"] < len(row):
rk = _to_num(row[cm["rank"]])
if isinstance(rk, (int, float)):
rec_rank = int(rk)
results.append({
"rank": rec_rank, # synthesized below if None
"brand": brand,
"product": product,
"traits": _assemble_traits(row, cm) or None,
"metrics": metrics,
})
# Synthesize rank by Yield DESC when the sheet didn't publish one
# (corn/soy list alphabetically). Wheat carries Yield Rank already, but
# we re-rank only if any row is missing a rank to keep them consistent.
if results and any(r["rank"] is None for r in results):
ordered = sorted(results, key=lambda r: -float(r["metrics"]["Yield"]))
for i, r in enumerate(ordered, start=1):
r["rank"] = i
results = ordered
else:
results.sort(key=lambda r: r["rank"])
doc.results = results
# --------------------------------------------------------------------- render
def render_markdown(doc: TrialDoc) -> str:
crop_label = {"corn": "Corn", "soybeans": "Soybean",
"wheat": "Wheat"}.get(doc.crop, doc.crop.title())
head: list[str] = [
f"# {crop_label} yield trial — University of Illinois VT, "
f"{doc.region}, IL {doc.year}",
"",
"- **Publisher:** University of Illinois Variety Testing "
"(independent cross-vendor trial)",
f"- **Crop:** {crop_label}",
"- **State:** IL",
f"- **Year:** {doc.year}",
f"- **Region:** {doc.region}",
]
for label, val in (
("Rotation", doc.rotation),
("Previous crop", doc.previous_crop),
("Cooperator / host", doc.cooperator),
("County", doc.county),
("Sites", ", ".join(doc.sites) if doc.sites else None),
("Soil type", doc.soil_type),
("Tillage", doc.tillage),
("Planted", doc.planted_date),
("Harvested", doc.harvested_date),
("Row width", doc.row_width),
):
if val:
head.append(f"- **{label}:** {val}")
head += [
f"- **Source XLSX:** {doc.xlsx_url}",
f"- **Index page:** {doc.index_url}",
"", "---", "",
"## Results (ranked by regional yield)", "",
]
# Discover metric columns from the first result.
metric_keys: list[str] = []
for r in doc.results:
if r.get("metrics"):
metric_keys = list(r["metrics"].keys())
break
headers = ["Rank", "Brand", "Product", "Traits"] + metric_keys
head.append("| " + " | ".join(headers) + " |")
head.append("|" + "|".join(["---"] * len(headers)) + "|")
for r in doc.results:
row = [str(r.get("rank") or "-"), r.get("brand") or "-",
r.get("product") or "-", r.get("traits") or "-"]
m = r.get("metrics") or {}
for k in metric_keys:
v = m.get(k)
row.append("-" if v is None else str(v))
head.append("| " + " | ".join(row) + " |")
head.append("")
return "\n".join(head)
def write_doc(doc: TrialDoc, body_md: str) -> None:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
(CORPUS_DIR / f"{doc.source_key}.md").write_text(body_md, encoding="utf-8")
sidecar = {
"source": "illinois_vt_trials",
"source_key": doc.source_key,
"data_type": "trial",
"vendor": "University of Illinois",
"brand_aggregator": "University of Illinois Variety Testing publishes",
"brand": "University of Illinois VT",
"crop": doc.crop,
"state": "IL",
"state_abbrev": "il",
"year": doc.year,
"region": doc.region,
"rotation": doc.rotation,
"previous_crop": doc.previous_crop,
"cooperator": doc.cooperator,
"county": doc.county,
"sites": doc.sites or None,
"soil_type": doc.soil_type,
"tillage": doc.tillage,
"planted_date": doc.planted_date,
"harvested_date": doc.harvested_date,
"row_width": doc.row_width,
"latitude": doc.latitude,
"longitude": doc.longitude,
"results": doc.results,
"n_results": len(doc.results),
"source_urls": [doc.xlsx_url, doc.index_url],
"tos_note": TOS_NOTE,
"fetched_at": datetime.now(timezone.utc).isoformat(),
"scraper_version": SCRAPER_VERSION,
}
(CORPUS_DIR / f"{doc.source_key}.json").write_text(
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
encoding="utf-8")
# --------------------------------------------------------------------- pipeline
def process_doc(http: RateLimitedSession, doc: TrialDoc, *,
force: bool) -> str:
md_path = CORPUS_DIR / f"{doc.source_key}.md"
if md_path.exists() and not force:
return "skipped"
try:
r = http.get(doc.xlsx_url)
r.raise_for_status()
parse_xlsx(r.content, doc)
except Exception as exc: # noqa: BLE001
log.error("%s parse failed (%s): %s", doc.source_key, doc.xlsx_url, exc)
return "failed"
if not doc.results:
log.warning("%s — no valid result rows parsed; skipping", doc.source_key)
return "empty"
write_doc(doc, render_markdown(doc))
log.info("%s written | %s %s %s | %d results | top: %s",
doc.source_key, doc.crop, doc.region, doc.year, len(doc.results),
doc.results[0]["brand"] + " " + doc.results[0]["product"]
if doc.results else "-")
return "written"
def run(*, crops: set[str], years: set[int], limit: int | None,
force: bool) -> int:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
http = RateLimitedSession()
docs = discover_files(http, crops=crops, years=years)
log.info("discovered %d in-scope variety tables", len(docs))
if limit is not None:
docs = docs[:limit]
counts = {"written": 0, "skipped": 0, "empty": 0, "failed": 0}
for i, doc in enumerate(docs, start=1):
status = process_doc(http, doc, force=force)
counts[status] = counts.get(status, 0) + 1
if status != "written" or i <= 5 or i % 20 == 0:
log.info("[%d/%d] %s -> %s", i, len(docs), doc.source_key, status)
log.info("done: written=%d skipped=%d empty=%d failed=%d (of %d)",
counts["written"], counts["skipped"], counts["empty"],
counts["failed"], len(docs))
return 0 if counts["failed"] == 0 else 1
# --------------------------------------------------------------------- CLI
def _build_argparser() -> argparse.ArgumentParser:
p = argparse.ArgumentParser(
prog="scrape.sources.illinois_vt_trials",
description="Scrape University of Illinois Variety Testing "
"cross-vendor yield trials (XLSX) into the corpus.")
p.add_argument("--year", type=int, default=None,
help="Scrape a single harvest year (default: 2024+2025).")
p.add_argument("--include-old", action="store_true",
help="Also scrape 20002023 (deferred by default).")
p.add_argument("--crop", default=None, choices=tuple(CROP_INDEX.keys()),
help="Limit to one crop (corn / soybeans / wheat).")
p.add_argument("--limit", type=int, default=None,
help="Stop after processing N tables (default: all).")
p.add_argument("--force", action="store_true",
help="Re-fetch even if the markdown file already exists.")
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
return p
def main(argv: list[str] | None = None) -> int:
args = _build_argparser().parse_args(argv)
logging.basicConfig(
level=args.log_level.upper(),
format="%(asctime)s %(levelname)s %(name)s %(message)s",
stream=sys.stderr)
crops = {args.crop} if args.crop else set(CROP_INDEX.keys())
if args.year is not None:
years = {args.year}
elif args.include_old:
years = set(range(OLD_YEAR_MIN, OLD_YEAR_MAX + 1)) | BASELINE_YEARS
else:
years = set(BASELINE_YEARS)
return run(crops=crops, years=years, limit=args.limit, force=args.force)
if __name__ == "__main__":
sys.exit(main())
+651
View File
@@ -0,0 +1,651 @@
"""Iowa Crop Performance Tests (ICPT) — cross-vendor yield trials.
Iowa State University / the Iowa Crop Improvement Association run the
**Iowa Crop Performance Tests**, an independent, third-party variety
trial program. Because the trial is publisher-neutral, a single
district table ranks EVERY brand head-to-head — Pioneer, DEKALB,
Brevant, NuTech, Renk, Legacy, Epley Brothers, etc. — on identical
plots. That makes it the highest-trust ``data_type: "trial"`` source
in the corpus: unlike the vendor plot reports (Golden Harvest, LG,
AgriGold, ProHarvest), no seed company controls the entry list or the
agronomy, so there's no home-brand bias.
Site shape (ASP.NET, server-rendered GridView tables — requests +
BeautifulSoup, no JS / headless browser needed):
Corn: https://www.croptesting.iastate.edu/corn/CornDistrict2.aspx
Soybean: https://www.croptesting.iastate.edu/Soybean/SoybeanDistrict2.aspx
``...District2.aspx`` is the ONLY live district URL — the district
(North / Central / South) is chosen *on that same page* via a
``radLstDistrict`` radio (1/2/3) ASP.NET **postback**, NOT a separate
URL (CornDistrict1/3.aspx 302-redirect away). Likewise the year
(``cmbYear`` dropdown, 2025→2014) and the maturity season
(``radListSeason``: 1=Early, 2=Full) are postbacks — there are no
stable GET URLs for them. So we GET the page once to harvest the
ASP.NET hidden fields (``__VIEWSTATE`` / ``__VIEWSTATEGENERATOR`` /
``__VIEWSTATEENCRYPTED``), then POST the form with the desired
year/district/season + ``btnFilter=Filter`` to drive the view.
``CornDistrict.aspx`` (no number) is the 2013-and-older legacy page —
out of scope.
A district table is a multi-site aggregate: the GridView carries the
district-wide Yield plus a West/East sub-region split (Wyld/Eyld) and a
per-site yield column for each cooperator location in the district.
That makes **one district × season × year table the natural document
granularity** — one sidecar per ``(crop, year, district, season)``.
GridView column → field map:
corn: Company | Entry | RM | Herb Tech | Trait Package |
Yield | Yldp | Moist | Wyld | Eyld | <site> ...
soybean: Company | Entry | MG | Herb Tech |
Yield | WestYield | EastYield | <site> ...
Company -> result.brand (the seed COMPANY — critical)
Entry -> result.product (variety / hybrid code)
Herb Tech +
Trait Package -> result.traits
everything else (RM/MG, Yield, Yldp, Moist, Wyld/Eyld, per-site)
-> result.metrics ("Yield" kept verbatim as the
primary key the chunker's top-N picker reads)
Rows are pre-sorted by Yield DESC on the page; we re-sort defensively
and assign rank ourselves (the table has no rank column).
We emit the SAME sidecar shape as ``agrigold_plot_reports`` /
``lg_plot_reports`` / ``gh_plot_reports`` / ``proharvest_plots``
(``results: [{rank, brand, product, traits, metrics}]``). The trial
chunker's source dispatch doesn't list ``iowa_icpt_trials`` explicitly,
so it FALLS THROUGH to the shared ``_render_gh_plot_chunk`` renderer —
no ``rag/chunk.py`` edit required.
Output:
corpus/iowa_icpt_trials/<source_key>.md LLM-visible body
corpus/iowa_icpt_trials/<source_key>.json sidecar metadata
source_key: ``icpt-<crop>-<year>-<district>[-<season>]``
e.g. ``icpt-corn-2025-north-early``, ``icpt-soybeans-2024-south-full``.
Scope: 2024 + 2025 baseline. ``--include-old`` walks 20142023.
robots/ToS: no robots.txt (the ASP.NET app 404s it); footer
"Copyright (c) 1995-2016 Iowa State University ... All rights reserved"
carries no automation clause. Public land-grant ICPT data, polite UA,
low request rate. (See ``tos_note`` in the sidecar.)
CLI:
python -m scrape.sources.iowa_icpt_trials --limit 4
python -m scrape.sources.iowa_icpt_trials --crop corn --year 2025
python -m scrape.sources.iowa_icpt_trials --include-old --force
"""
from __future__ import annotations
import argparse
import json
import logging
import os
import random
import re
import sys
import time
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import requests
from bs4 import BeautifulSoup
SCRAPER_VERSION = "0.1.0"
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
BASE = "https://www.croptesting.iastate.edu"
REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
CORPUS_DIR = CORPUS_ROOT / "iowa_icpt_trials"
REQ_INTERVAL_SEC = 2.0 # land-grant box; be polite, single-threaded
BASELINE_YEARS = [2024, 2025]
OLD_YEARS = [2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023]
TOS_NOTE = (
"Footer 'Copyright (c) ...ISU...All rights reserved' (no automation "
"clause, no robots.txt); public ICPT data; low request rate; attribute "
"Iowa Crop Performance Tests / ISU."
)
# crop -> (district-results page URL, RM/MG header label)
CROPS: dict[str, tuple[str, str]] = {
"corn": (f"{BASE}/corn/CornDistrict2.aspx", "RM"),
"soybeans": (f"{BASE}/Soybean/SoybeanDistrict2.aspx", "MG"),
}
# radLstDistrict radio value -> (slug, label)
DISTRICTS: dict[str, tuple[str, str]] = {
"1": ("north", "North"),
"2": ("central", "Central"),
"3": ("south", "South"),
}
# radListSeason radio value -> (slug, label)
SEASONS: dict[str, tuple[str, str]] = {
"1": ("early", "Early Season"),
"2": ("full", "Full Season"),
}
# ASP.NET control names
C_YEAR = "ctl00$MainContent$cmbYear"
C_DISTRICT = "ctl00$MainContent$radLstDistrict"
C_SEASON = "ctl00$MainContent$radListSeason"
C_SHOW = "ctl00$MainContent$radLstShowOptions"
C_FILTER = "ctl00$MainContent$btnFilter"
# GridView header labels that are NOT metric columns.
BRAND_COL = "company"
PRODUCT_COL = "entry"
TRAIT_COLS = {"herb tech", "trait package"}
log = logging.getLogger("scrape.iowa_icpt_trials")
# --------------------------------------------------------------------- HTTP
class RateLimitedSession:
"""Single-threaded rate-limited requests.Session (ASP.NET viewstate
flow is inherently sequential per page, so no global lock needed)."""
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
self.s = requests.Session()
self.s.headers["User-Agent"] = USER_AGENT
self.interval = interval
self._last = 0.0
def _wait(self) -> None:
delta = time.monotonic() - self._last
if delta < self.interval:
time.sleep(self.interval - delta)
self._last = time.monotonic()
def request(self, method: str, url: str, *, max_retries: int = 4,
timeout: float = 45.0, **kw: Any) -> requests.Response:
last_exc: Exception | None = None
resp: requests.Response | None = None
for attempt in range(max_retries):
self._wait()
try:
resp = self.s.request(method, url, timeout=timeout, **kw)
except requests.RequestException as exc:
last_exc = exc
backoff = min(30.0, (2 ** attempt) + random.random())
log.warning("network error on %s %s: %s — retry in %.1fs",
method, url, exc, backoff)
time.sleep(backoff)
continue
if resp.status_code == 429 or 500 <= resp.status_code < 600:
ra = resp.headers.get("Retry-After")
backoff = float(ra) if (ra and ra.isdigit()) else min(
30.0, (2 ** attempt) + random.random())
log.warning("HTTP %d on %s %s — retry in %.1fs",
resp.status_code, method, url, backoff)
time.sleep(backoff)
continue
return resp
if last_exc:
raise last_exc
assert resp is not None
return resp
def get(self, url: str, **kw: Any) -> requests.Response:
return self.request("GET", url, **kw)
def post(self, url: str, **kw: Any) -> requests.Response:
return self.request("POST", url, **kw)
# --------------------------------------------------------------------- model
@dataclass
class TrialResult:
rank: int | None = None
brand: str = ""
product: str = ""
traits: str = ""
metrics: dict[str, float | str | None] = field(default_factory=dict)
@dataclass
class DistrictTrial:
source_key: str
source_url: str
crop: str # "corn" / "soybeans"
year: int
district_slug: str # north / central / south
district_label: str # North / Central / South
season_slug: str # early / full
season_label: str # Early Season / Full Season
sites: list[str] = field(default_factory=list) # cooperator locations
experiment_mean: float | None = None
results: list[TrialResult] = field(default_factory=list)
# --------------------------------------------------------------------- parse
def _hidden_fields(soup: BeautifulSoup) -> dict[str, str]:
out: dict[str, str] = {}
for inp in soup.find_all("input", {"type": "hidden"}):
name = inp.get("name")
if name:
out[name] = inp.get("value") or ""
return out
_NUM_RE = re.compile(r"^-?\d+(?:\.\d+)?$")
def _to_num(s: str | None) -> float | int | None:
s = (s or "").strip()
if not s or s == "-" or not _NUM_RE.match(s):
return None
f = float(s)
return int(f) if f.is_integer() else f
def _norm(s: str) -> str:
return re.sub(r"\s+", " ", (s or "").strip()).lower()
def _grid_rows(soup: BeautifulSoup, table_id: str) -> list[list[str]]:
table = soup.find("table", {"id": table_id})
if table is None:
return []
rows: list[list[str]] = []
for tr in table.find_all("tr"):
cells = [c.get_text(" ", strip=True) for c in tr.find_all(["th", "td"])]
if cells:
rows.append(cells)
return rows
def _experiment_mean(soup: BeautifulSoup) -> float | None:
"""Pull the district-wide 'Experiment Mean' Yield from the summary
GridView (first data row, second column)."""
rows = _grid_rows(soup, "MainContent_gvDataSummary")
for r in rows:
if r and _norm(r[0]).startswith("experiment mean") and len(r) > 1:
return _to_num(r[1]) # type: ignore[return-value]
return None
def parse_district_table(
soup: BeautifulSoup,
*,
rm_mg_label: str,
) -> tuple[list[TrialResult], list[str], float | None]:
"""Parse the ``MainContent_gvData`` GridView into ranked results.
Returns ``(results, site_columns, experiment_mean)``. Rows arrive
pre-sorted by Yield DESC; we re-sort by Yield DESC defensively and
assign rank ourselves (no rank column on the page)."""
rows = _grid_rows(soup, "MainContent_gvData")
if len(rows) < 2:
return [], [], None
header = rows[0]
hkeys = [_norm(h) for h in header]
# Locate the structural columns.
def find_col(*want: str) -> int | None:
for w in want:
for i, h in enumerate(hkeys):
if h == w:
return i
return None
i_brand = find_col(BRAND_COL)
i_product = find_col(PRODUCT_COL)
i_traits = [i for i, h in enumerate(hkeys) if h in TRAIT_COLS]
# Identify the per-site (cooperator-location) yield columns: they
# come AFTER the West/East sub-region split (Wyld/Eyld /
# WestYield/EastYield), and their header is a location name, not a
# known metric. Anything that isn't brand/product/trait is a metric;
# per-site columns are metrics whose header isn't a reserved label.
reserved_metric = {
_norm(rm_mg_label), "yield", "yldp", "yield pct", "yield %",
"moist", "wyld", "eyld", "westyield", "eastyield",
}
sites: list[str] = []
for i, h in enumerate(hkeys):
if i == i_brand or i == i_product or i in i_traits:
continue
if h and h not in reserved_metric:
sites.append(header[i])
skip = {i_brand, i_product, *i_traits}
metric_cols = [(header[i], i) for i in range(len(header)) if i not in skip and header[i]]
results: list[TrialResult] = []
for raw in rows[1:]:
# Pad/truncate row to header width defensively.
cells = raw + [""] * (len(header) - len(raw))
def cell(i: int | None) -> str:
return cells[i].strip() if i is not None and 0 <= i < len(cells) else ""
brand = cell(i_brand)
product = cell(i_product)
traits = " ".join(
t for t in (cell(i) for i in i_traits)
if t and _norm(t) != "none"
).strip()
metrics: dict[str, float | str | None] = {}
for name, idx in metric_cols:
raw_val = cell(idx)
num = _to_num(raw_val)
if num is not None:
metrics[name] = num
elif raw_val and raw_val != "-":
metrics[name] = raw_val
# else: leave the column out (empty)
res = TrialResult(brand=brand, product=product, traits=traits, metrics=metrics)
if _row_ok(res):
results.append(res)
# Re-sort by Yield DESC (page is already sorted, but don't trust it),
# then assign rank. Rows with no numeric Yield sink to the bottom.
def _ysort(r: TrialResult) -> tuple[int, float]:
y = r.metrics.get("Yield")
if isinstance(y, (int, float)):
return (0, -float(y))
return (1, 0.0)
results.sort(key=_ysort)
for n, r in enumerate(results, start=1):
r.rank = n
return results, sites, _experiment_mean(soup)
def _row_ok(r: TrialResult) -> bool:
"""Per-row sanity gate. A sound entry has a real (non-numeric)
company/brand, a product code, and a plausible bu/a Yield. Drops
summary/blank rows and any leaked aggregate line."""
brand = (r.brand or "").strip()
product = (r.product or "").strip()
if not brand or brand.isdigit():
return False
if _norm(brand) in ("summary", "experiment mean", "minimum mean",
"maximum mean", "lsd", "coefficient of variability"):
return False
if not product:
return False
y = r.metrics.get("Yield")
# Corn runs ~120-280 bu/a, soy ~30-90; gate generously but reject
# garbage / a moisture/RM value that leaked into the Yield slot.
if not isinstance(y, (int, float)) or not (10 < float(y) < 400):
return False
return True
# --------------------------------------------------------------------- fetch
def source_key_for(crop: str, year: int, district_slug: str, season_slug: str) -> str:
return f"icpt-{crop}-{year}-{district_slug}-{season_slug}"
def fetch_view(
http: RateLimitedSession,
*,
crop: str,
year: int,
district: str, # radio value "1"/"2"/"3"
season: str, # radio value "1"/"2"
) -> DistrictTrial | None:
"""GET the district page (for viewstate), then POST the filter form
to switch to the requested year/district/season. Returns a parsed
DistrictTrial, or None if the table is empty for that combination."""
url, rm_mg_label = CROPS[crop]
district_slug, district_label = DISTRICTS[district]
season_slug, season_label = SEASONS[season]
seed = http.get(url)
seed.raise_for_status()
seed_soup = BeautifulSoup(seed.text, "html.parser")
payload = _hidden_fields(seed_soup)
payload[C_YEAR] = str(year)
payload[C_DISTRICT] = district
payload[C_SEASON] = season
payload[C_SHOW] = "yield" # yield view carries Yield/Yldp/Moist + per-SITE yields
payload[C_FILTER] = "Filter"
resp = http.post(url, data=payload)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
results, sites, mean = parse_district_table(soup, rm_mg_label=rm_mg_label)
if not results:
return None
return DistrictTrial(
source_key=source_key_for(crop, year, district_slug, season_slug),
source_url=url,
crop=crop,
year=year,
district_slug=district_slug,
district_label=district_label,
season_slug=season_slug,
season_label=season_label,
sites=sites,
experiment_mean=mean,
results=results,
)
# --------------------------------------------------------------------- render
def render_markdown(t: DistrictTrial) -> str:
crop_label = {"corn": "Corn", "soybeans": "Soybean"}.get(t.crop, t.crop.title())
head: list[str] = [
f"# {crop_label} yield trial — Iowa {t.district_label} District "
f"({t.season_label}), {t.year}",
"",
"- **Source:** Iowa Crop Performance Tests (independent third-party trial)",
"- **Publisher:** Iowa State University / Iowa Crop Improvement Association",
f"- **Crop:** {crop_label}",
"- **State:** IA",
f"- **District:** {t.district_label}",
f"- **Maturity season:** {t.season_label}",
f"- **Year:** {t.year}",
]
if t.experiment_mean is not None:
head.append(f"- **Experiment mean yield:** {t.experiment_mean} bu/a")
if t.sites:
head.append(f"- **Cooperator sites:** {', '.join(t.sites)}")
head += [f"- **URL:** {t.source_url}", "", "---", ""]
# Discover metric column order from the first result with metrics.
metric_keys: list[str] = []
for r in t.results:
if r.metrics:
metric_keys = list(r.metrics.keys())
break
sections: list[str] = ["## Results (by yield, all brands)", ""]
headers = ["Rank", "Company", "Entry", "Traits"] + metric_keys
sections.append("| " + " | ".join(headers) + " |")
sections.append("|" + "|".join(["---"] * len(headers)) + "|")
for r in t.results:
row = [
str(r.rank) if r.rank is not None else "-",
r.brand or "-",
r.product or "-",
r.traits or "-",
]
for k in metric_keys:
v = r.metrics.get(k)
row.append("-" if v is None else str(v))
sections.append("| " + " | ".join(row) + " |")
sections.append("")
# Compact top-5 line for embedder signal.
top = [r for r in t.results if isinstance(r.metrics.get("Yield"), (int, float))][:5]
if top:
bits = [f"{r.product} ({r.brand}) {r.metrics['Yield']}" for r in top]
sections.append(f"Top 5 by Yield: " + ", ".join(bits) + ".")
sections.append("")
return "\n".join(head) + "\n".join(sections)
# --------------------------------------------------------------------- write
def write_trial(t: DistrictTrial, body_md: str) -> None:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
(CORPUS_DIR / f"{t.source_key}.md").write_text(body_md, encoding="utf-8")
sidecar = {
"source": "iowa_icpt_trials",
"source_key": t.source_key,
"data_type": "trial",
"vendor": "Iowa State University",
"brand_aggregator": "Iowa Crop Performance Tests publishes",
"brand": "Iowa Crop Performance Tests",
"crop": t.crop,
"state": "IA",
"state_abbrev": "ia",
"year": t.year,
"region": f"District {t.district_label}",
"district": t.district_label,
"season": t.season_label,
"cooperator_sites": t.sites,
"experiment_mean_yield": t.experiment_mean,
"results": [
{
"rank": r.rank,
"brand": r.brand,
"product": r.product,
"traits": r.traits,
"metrics": r.metrics,
}
for r in t.results
],
"n_results": len(t.results),
"source_urls": [t.source_url],
"tos_note": TOS_NOTE,
"fetched_at": datetime.now(timezone.utc).isoformat(),
"scraper_version": SCRAPER_VERSION,
}
(CORPUS_DIR / f"{t.source_key}.json").write_text(
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
encoding="utf-8",
)
# --------------------------------------------------------------------- pipeline
def run(
*,
crops: set[str],
years: list[int],
limit: int | None,
force: bool,
) -> int:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
http = RateLimitedSession()
counts = {"written": 0, "skipped": 0, "empty": 0, "failed": 0}
processed = 0
targets: list[tuple[str, int, str, str]] = []
for crop in sorted(crops):
for year in years:
for district in DISTRICTS: # 1/2/3
for season in SEASONS: # 1/2
targets.append((crop, year, district, season))
log.info("planned %d (crop x year x district x season) targets", len(targets))
for crop, year, district, season in targets:
if limit is not None and processed >= limit:
break
district_slug = DISTRICTS[district][0]
season_slug = SEASONS[season][0]
sk = source_key_for(crop, year, district_slug, season_slug)
md_path = CORPUS_DIR / f"{sk}.md"
if md_path.exists() and not force:
counts["skipped"] += 1
continue
processed += 1
try:
trial = fetch_view(http, crop=crop, year=year,
district=district, season=season)
except Exception as exc: # noqa: BLE001
counts["failed"] += 1
log.error("[%s] fetch failed: %s", sk, exc)
continue
if trial is None:
counts["empty"] += 1
log.info("[%s] empty table (no entries) — skipping", sk)
continue
write_trial(trial, render_markdown(trial))
counts["written"] += 1
log.info("[%s] written | %d entries | %d sites | brands=%d",
sk, len(trial.results), len(trial.sites),
len({r.brand for r in trial.results}))
log.info("done: written=%d skipped=%d empty=%d failed=%d (processed=%d)",
counts["written"], counts["skipped"], counts["empty"],
counts["failed"], processed)
return 0 if counts["failed"] == 0 else 1
# --------------------------------------------------------------------- CLI
def _build_argparser() -> argparse.ArgumentParser:
p = argparse.ArgumentParser(
prog="scrape.sources.iowa_icpt_trials",
description="Scrape Iowa Crop Performance Tests (ICPT) cross-vendor "
"yield trials (corn + soybean district tables).",
)
p.add_argument("--year", type=int, default=None,
choices=tuple(BASELINE_YEARS + OLD_YEARS),
help="Limit to a single year (default: 2024+2025 baseline).")
p.add_argument("--include-old", action="store_true",
help="Also scrape 2014-2023 (deferred by default).")
p.add_argument("--crop", default=None, choices=tuple(CROPS.keys()),
help="Limit to one crop (default: both).")
p.add_argument("--limit", type=int, default=None,
help="Stop after writing/processing N tables (default: all).")
p.add_argument("--force", action="store_true",
help="Re-fetch even if the markdown file already exists.")
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
return p
def main(argv: list[str] | None = None) -> int:
args = _build_argparser().parse_args(argv)
logging.basicConfig(
level=args.log_level.upper(),
format="%(asctime)s %(levelname)s %(name)s %(message)s",
stream=sys.stderr,
)
crops = {args.crop} if args.crop else set(CROPS.keys())
if args.year is not None:
years = [args.year]
elif args.include_old:
years = sorted(set(OLD_YEARS + BASELINE_YEARS))
else:
years = list(BASELINE_YEARS)
return run(crops=crops, years=years, limit=args.limit, force=args.force)
if __name__ == "__main__":
sys.exit(main())
File diff suppressed because it is too large Load Diff