Add university-extension variety trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 trial docs)
Independent third-party performance data — land-grant programs that test every entered brand side-by-side with replication + LSD stats. This is the legitimate way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't scrape directly (data_type=trial, results[] shape; falls through the trial chunker). - illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata; corn-following-corn vs -soybean kept distinct. - iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView (viewstate postback for year/district), corn + soy by district x season. - ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF (pdfplumber; per-site column groups split by header Yield-token count + x-coord footnote bucketing), corn + soy per site, 2024+2025. 91 distinct seed brands across the three; majors confirmed present in the independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212, Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in Illinois — true negatives, not parse gaps.) - rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university sources route through it so the region/district is in the embedded chunk + labeled "variety trial (cross-vendor, independent third-party)". Existing plot sources (gh/lg/agrigold/proharvest) unchanged. - requirements.txt: openpyxl (Illinois XLSX; scrape-time only). - sources.json + README/CLAUDE/lessons: registered + attributed; lessons trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available indirectly via these trials). Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0 out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in each tos_note. CI rebuilds the index from the committed corpus.
This commit is contained in:
@@ -0,0 +1,945 @@
|
||||
"""University of Illinois Variety Testing — cross-vendor yield trials.
|
||||
|
||||
The University of Illinois Crop Sciences Variety Testing program
|
||||
(``vt.cropsci.illinois.edu``) is a long-running, independent land-grant
|
||||
testing program. Seed companies pay an entry fee to enter hybrids /
|
||||
varieties; UIUC plants them in replicated regional trials and publishes
|
||||
the results. Because the program is third-party and cross-vendor, a
|
||||
single regional table ranks Pioneer / Brevant / DEKALB / Channel /
|
||||
Burrus / Stine / Viking and dozens of others head-to-head — the
|
||||
legitimate, independent way to get major-brand performance the
|
||||
single-vendor corpus can't scrape directly.
|
||||
|
||||
This is a ``data_type: "trial"`` source (NOT variety identity). It emits
|
||||
the same per-site cross-vendor sidecar shape as ``gh_plot_reports`` /
|
||||
``agrigold_plot_reports`` / ``proharvest_plots`` (``results: [{rank,
|
||||
brand, product, traits, metrics}]``), so it falls through to the shared
|
||||
``_render_gh_plot_chunk`` renderer in ``rag/chunk.py`` with no chunk.py
|
||||
edit. The published table lists entries alphabetically by company, so we
|
||||
**synthesize ``rank`` by sorting on Yield descending** (highest yield =
|
||||
rank 1) per document.
|
||||
|
||||
Data layout (verified Nov 2025):
|
||||
- Static XLSX (+ PDF) per region per year at WordPress upload URLs.
|
||||
The month segment of the upload path varies (``/2025/11/``,
|
||||
``/2025/12/``, ``/2025/07/`` for wheat), so we DISCOVER the .xlsx
|
||||
hrefs by fetching the /corn/, /soybeans/, /wheat/ index pages — we
|
||||
never guess URLs.
|
||||
- Corn regional tables: ``Company | Name | IST1 | GT2(+spill) | HT3 |
|
||||
Relative Maturity | Yield bu/a | Moisture % | Lodging | <per-site
|
||||
cols> | 2-yr Avg | 3-yr Avg``. Per-site metadata lives in a separate
|
||||
"Trial Info" sheet (regional tables) or in trailing columns
|
||||
(single-site CFC tables).
|
||||
- Soybean regional tables: ``COMPANY | NAME | Herbicide Trait1 | ST2 |
|
||||
Yield bu/a | Maturity Date | Lodging | Height | <2yr/3yr Yield> |
|
||||
Protein @13% | Oil @13%`` with the per-site metadata block in
|
||||
trailing columns.
|
||||
- Wheat regional means tables: ``Company | Variety | ST1 | Yield |
|
||||
Yield Rank | Test wt. | Height | <per-site Yield/Test wt.> |
|
||||
Maturity date | Jointing time | FHB Score | FHB Category``. Wheat
|
||||
publishes its own Yield Rank, which we honor.
|
||||
|
||||
The variety table sits below a 2-4 row header band (a group-header row,
|
||||
a column-name row, and a units row). Columns are positionally stable
|
||||
within a sheet (multi-word brands like "Viking | Blue River" live in a
|
||||
single COMPANY cell), so we anchor on the header row that carries
|
||||
"Company" + "Name"/"Variety" and map the metric columns from the
|
||||
column-name + units rows. The leftmost "Yield (bu/a)" column is the
|
||||
**regional** yield — the primary metric.
|
||||
|
||||
Section markers interleaved in the table ("Early RM", "Any RM",
|
||||
"Non-GMO Hybrids", "Early MG: 1.9-2.7", "Late MG:") and summary rows
|
||||
("Average", "L.S.D 25% Level", "CV (%)") are skipped — a data row must
|
||||
have a real company, a variety name, and a plausible numeric yield.
|
||||
|
||||
Rotation distinction: regional tables are corn-following-soybean (the
|
||||
default rotation); "CFC" / "Corn Following Corn" single-site tables
|
||||
(Monmouth, Urbana, DeKalb) get ``previous_crop="corn"`` and a distinct
|
||||
``-cfc-`` source_key so they never collide with the regional table.
|
||||
|
||||
robots/ToS: no usage terms posted on the VT site (publicly-funded
|
||||
land-grant; companies pay an entry fee, which doesn't restrict published
|
||||
result reuse). Polite UA + light rate limit.
|
||||
|
||||
Output:
|
||||
corpus/illinois_vt_trials/<source_key>.md LLM-visible body
|
||||
corpus/illinois_vt_trials/<source_key>.json sidecar metadata
|
||||
|
||||
source_key: ``ilvt-<crop>-<year>-r<region>`` e.g. ``ilvt-corn-2025-r1``;
|
||||
CFC single-site tables: ``ilvt-corn-2024-cfc-monmouth``.
|
||||
|
||||
CLI:
|
||||
python -m scrape.sources.illinois_vt_trials --year 2025 --limit 2
|
||||
python -m scrape.sources.illinois_vt_trials --crop corn --force
|
||||
python -m scrape.sources.illinois_vt_trials --include-old --force
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import io
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import date, datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
from urllib.parse import urljoin
|
||||
|
||||
import openpyxl
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
SCRAPER_VERSION = "0.1.0"
|
||||
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
|
||||
BASE = "https://vt.cropsci.illinois.edu"
|
||||
|
||||
TOS_NOTE = (
|
||||
"No usage terms posted on UIUC VT site; publicly-funded land-grant "
|
||||
"data; attribute University of Illinois Variety Testing."
|
||||
)
|
||||
|
||||
BASELINE_YEARS = {2024, 2025}
|
||||
OLD_YEAR_MIN = 2000
|
||||
OLD_YEAR_MAX = 2023
|
||||
|
||||
# Index pages per crop. PLURAL "soybeans" is the corpus crop value.
|
||||
CROP_INDEX = {
|
||||
"corn": "/corn/",
|
||||
"soybeans": "/soybeans/",
|
||||
"wheat": "/wheat/",
|
||||
}
|
||||
|
||||
REQ_INTERVAL_SEC = 1.0
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
|
||||
CORPUS_DIR = CORPUS_ROOT / "illinois_vt_trials"
|
||||
|
||||
log = logging.getLogger("scrape.illinois_vt_trials")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- HTTP
|
||||
|
||||
|
||||
class RateLimitedSession:
|
||||
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
|
||||
self.s = requests.Session()
|
||||
self.s.headers["User-Agent"] = USER_AGENT
|
||||
self.interval = interval
|
||||
self._last = 0.0
|
||||
|
||||
def _wait(self) -> None:
|
||||
delta = time.monotonic() - self._last
|
||||
if delta < self.interval:
|
||||
time.sleep(self.interval - delta)
|
||||
self._last = time.monotonic()
|
||||
|
||||
def request(self, method: str, url: str, *, max_retries: int = 4,
|
||||
timeout: float = 60.0, **kw: Any) -> requests.Response:
|
||||
last_exc: Exception | None = None
|
||||
for attempt in range(max_retries):
|
||||
self._wait()
|
||||
try:
|
||||
resp = self.s.request(method, url, timeout=timeout, **kw)
|
||||
except requests.RequestException as exc:
|
||||
last_exc = exc
|
||||
backoff = min(30.0, (2 ** attempt) + random.random())
|
||||
log.warning("network error on %s %s: %s — retry in %.1fs",
|
||||
method, url, exc, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
if resp.status_code == 429 or 500 <= resp.status_code < 600:
|
||||
ra = resp.headers.get("Retry-After")
|
||||
backoff = float(ra) if (ra and ra.isdigit()) else min(
|
||||
30.0, (2 ** attempt) + random.random())
|
||||
log.warning("HTTP %d on %s %s — retry in %.1fs",
|
||||
resp.status_code, method, url, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
return resp
|
||||
if last_exc:
|
||||
raise last_exc
|
||||
return resp # type: ignore[return-value]
|
||||
|
||||
def get(self, url: str, **kw: Any) -> requests.Response:
|
||||
return self.request("GET", url, **kw)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- model
|
||||
|
||||
|
||||
@dataclass
|
||||
class TrialDoc:
|
||||
source_key: str
|
||||
crop: str # corn / soybeans / wheat
|
||||
year: int
|
||||
region: str # e.g. "Region 1", "Monmouth CFC"
|
||||
xlsx_url: str
|
||||
index_url: str
|
||||
|
||||
rotation: str | None = None # "corn following soybean" / "corn following corn"
|
||||
previous_crop: str | None = None # "corn" for CFC tables
|
||||
cooperator: str | None = None # site host
|
||||
county: str | None = None
|
||||
soil_type: str | None = None
|
||||
tillage: str | None = None
|
||||
planted_date: str | None = None
|
||||
harvested_date: str | None = None
|
||||
row_width: str | None = None
|
||||
latitude: float | None = None
|
||||
longitude: float | None = None
|
||||
sites: list[str] = field(default_factory=list)
|
||||
|
||||
results: list[dict] = field(default_factory=list)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- discovery
|
||||
|
||||
|
||||
# Year/Region extraction from a file name or upload path. The VT site has
|
||||
# used many naming schemes across years, so try several.
|
||||
_YEAR_RE = re.compile(r"(20\d{2})")
|
||||
_REGION_NUM_RE = re.compile(r"region[\s_-]*([1-9])", re.I)
|
||||
|
||||
|
||||
def _norm_url(href: str) -> str:
|
||||
"""Resolve relative hrefs against BASE; also fold the legacy
|
||||
``varietytesting.web.illinois.edu`` host (same WP install) to BASE so
|
||||
older files download cleanly."""
|
||||
href = href.strip()
|
||||
if href.startswith("http"):
|
||||
return href
|
||||
return urljoin(BASE + "/", href.lstrip("/"))
|
||||
|
||||
|
||||
def _file_year(url: str) -> int | None:
|
||||
"""Harvest year of a file. Prefer the upload-path year segment
|
||||
(``/uploads/2025/11/...`` or ``/uploads/corn/2021/...``); fall back to
|
||||
the first 20xx in the file name."""
|
||||
m = re.search(r"/uploads/(?:[a-z]+/)?(20\d{2})/", url)
|
||||
if m:
|
||||
return int(m.group(1))
|
||||
fn = url.rsplit("/", 1)[-1]
|
||||
m = _YEAR_RE.search(fn)
|
||||
return int(m.group(1)) if m else None
|
||||
|
||||
|
||||
def _classify_region(url: str) -> tuple[str, str | None, str | None] | None:
|
||||
"""Return ``(region_label, rotation, previous_crop)`` for a file, or
|
||||
None if it isn't a per-region/per-site VARIETY table we ingest.
|
||||
|
||||
We INCLUDE regional tables (Region 1-5, North/South/East/West) and
|
||||
single-site CFC (corn-following-corn) tables. We EXCLUDE entry lists,
|
||||
agronomic-characteristic sheets, FHB/scab summaries, and disease
|
||||
tables (those aren't head-to-head yield rankings)."""
|
||||
fn = url.rsplit("/", 1)[-1]
|
||||
low = fn.lower()
|
||||
|
||||
# Exclusions — not cross-vendor yield ranking tables.
|
||||
EXCLUDE = ("entry", "entries", "agronomic", "charateristic", # sic (site typo)
|
||||
"characteristic", "scab", "fhb", "disease")
|
||||
if any(tok in low for tok in EXCLUDE):
|
||||
return None
|
||||
|
||||
# Numbered regions: "Region 1", "2025-Region-1", "northtbl20"
|
||||
m = _REGION_NUM_RE.search(low)
|
||||
if m:
|
||||
return (f"Region {m.group(1)}", "corn following soybean", None)
|
||||
|
||||
# Named regions (corn 2022 + wheat use compass names).
|
||||
for name, label in (("north", "North Region"), ("south", "South Region"),
|
||||
("east", "East Region"), ("west", "West Region")):
|
||||
if re.search(rf"\b{name}\b", low) or low.startswith(name + "tbl"):
|
||||
return (label, "corn following soybean", None)
|
||||
|
||||
# Single-site Corn-Following-Corn tables (Monmouth / Urbana / DeKalb).
|
||||
cfc = re.search(r"([a-z]+)[\s_-]*cfc", low)
|
||||
if cfc or "cfc" in low:
|
||||
site = (cfc.group(1).title() if cfc and cfc.group(1) else "CFC")
|
||||
return (f"{site} CFC", "corn following corn", "corn")
|
||||
|
||||
# Wheat single-site summary tables (e.g. "2025-Urbana-Summary",
|
||||
# "2024-Elkville-Table"). Capture the site name as the region.
|
||||
m = re.search(r"20\d{2}[\s_-]+([a-z]+)[\s_-]+(?:summary|table)", low)
|
||||
if m:
|
||||
return (m.group(1).title(), None, None)
|
||||
m = re.search(r"([a-z]+)[\s_-]+(?:summary|table)[\s_-]*20\d{2}", low)
|
||||
if m and m.group(1) not in ("region", "regional"):
|
||||
return (m.group(1).title(), None, None)
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def discover_files(http: RateLimitedSession, *, crops: set[str],
|
||||
years: set[int]) -> list[TrialDoc]:
|
||||
"""Fetch each crop index page, extract .xlsx hrefs, classify them into
|
||||
per-region/per-site variety tables, and keep the ones in scope."""
|
||||
out: list[TrialDoc] = []
|
||||
seen_keys: set[str] = set()
|
||||
for crop in sorted(crops):
|
||||
index_path = CROP_INDEX.get(crop)
|
||||
if not index_path:
|
||||
log.warning("unknown crop %r, skipping", crop)
|
||||
continue
|
||||
index_url = BASE + index_path
|
||||
log.info("GET %s", index_url)
|
||||
r = http.get(index_url)
|
||||
r.raise_for_status()
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
xlsx_hrefs = [a["href"] for a in soup.find_all("a", href=True)
|
||||
if a["href"].lower().endswith(".xlsx")]
|
||||
# Dedupe while preserving order.
|
||||
seen_href: set[str] = set()
|
||||
for href in xlsx_hrefs:
|
||||
url = _norm_url(href)
|
||||
if url in seen_href:
|
||||
continue
|
||||
seen_href.add(url)
|
||||
year = _file_year(url)
|
||||
if year is None or year not in years:
|
||||
continue
|
||||
cls = _classify_region(url)
|
||||
if cls is None:
|
||||
continue
|
||||
region, rotation, prev_crop = cls
|
||||
# source_key: ilvt-<crop>-<year>-r<n> for numbered regions,
|
||||
# else a slug of the region name.
|
||||
mnum = _REGION_NUM_RE.search(url.rsplit("/", 1)[-1].lower())
|
||||
if mnum and "cfc" not in region.lower():
|
||||
region_slug = f"r{mnum.group(1)}"
|
||||
else:
|
||||
region_slug = re.sub(r"[^a-z0-9]+", "-",
|
||||
region.lower()).strip("-")
|
||||
sk = f"ilvt-{crop}-{year}-{region_slug}"
|
||||
if sk in seen_keys:
|
||||
# Two files map to the same key (e.g. a "-final" + a "-1"
|
||||
# duplicate). Keep the first; log the collision.
|
||||
log.info("duplicate source_key %s from %s — skipping dupe",
|
||||
sk, url)
|
||||
continue
|
||||
seen_keys.add(sk)
|
||||
out.append(TrialDoc(
|
||||
source_key=sk, crop=crop, year=year, region=region,
|
||||
xlsx_url=url, index_url=index_url,
|
||||
rotation=rotation, previous_crop=prev_crop,
|
||||
))
|
||||
log.info(" %s: %d in-scope variety tables", crop,
|
||||
sum(1 for d in out if d.crop == crop))
|
||||
return out
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- parse
|
||||
|
||||
|
||||
def _to_num(v: Any) -> float | int | None:
|
||||
"""Coerce a cell to a number. Strips '*', commas; returns None for the
|
||||
VT missing-value markers ('*.*', '-', '') and non-numeric text."""
|
||||
if v is None:
|
||||
return None
|
||||
if isinstance(v, bool):
|
||||
return None
|
||||
if isinstance(v, (int, float)):
|
||||
f = float(v)
|
||||
return int(f) if f.is_integer() else f
|
||||
s = str(v).strip()
|
||||
if not s or s in ("*", "*.*", "-", "—", "."):
|
||||
return None
|
||||
s = s.replace(",", "")
|
||||
if not re.match(r"^-?\d+(?:\.\d+)?$", s):
|
||||
return None
|
||||
f = float(s)
|
||||
return int(f) if f.is_integer() else f
|
||||
|
||||
|
||||
def _iso_date(v: Any) -> str | None:
|
||||
if isinstance(v, datetime):
|
||||
return v.date().isoformat()
|
||||
if isinstance(v, date):
|
||||
return v.isoformat()
|
||||
if v is None:
|
||||
return None
|
||||
s = str(v).strip()
|
||||
if not s or s.lower().startswith("did not"):
|
||||
return None
|
||||
# ISO YYYY-MM-DD (with optional trailing " 00:00:00" time).
|
||||
m = re.match(r"^(\d{4})-(\d{1,2})-(\d{1,2})(?:[ T].*)?$", s)
|
||||
if m:
|
||||
yr, mo, dy = m.groups()
|
||||
try:
|
||||
return f"{int(yr):04d}-{int(mo):02d}-{int(dy):02d}"
|
||||
except ValueError:
|
||||
return None
|
||||
# US MM/DD/YYYY.
|
||||
m = re.match(r"^(\d{1,2})/(\d{1,2})/(\d{2,4})$", s)
|
||||
if m:
|
||||
mo, dy, yr = m.groups()
|
||||
if len(yr) == 2:
|
||||
yr = "20" + yr
|
||||
try:
|
||||
return f"{int(yr):04d}-{int(mo):02d}-{int(dy):02d}"
|
||||
except ValueError:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def _txt(v: Any) -> str:
|
||||
if v is None:
|
||||
return ""
|
||||
if isinstance(v, datetime):
|
||||
return v.date().isoformat()
|
||||
return str(v).strip()
|
||||
|
||||
|
||||
def _norm(s: Any) -> str:
|
||||
return re.sub(r"\s+", " ", _txt(s)).strip().lower().rstrip(".")
|
||||
|
||||
|
||||
# Section markers / summary labels that are NOT data rows.
|
||||
_NONDATA_NAME = re.compile(
|
||||
r"^(average|avg\.?|l\.?s\.?d\.?|c\.?v\.?|coeff|mean|std|"
|
||||
r"early mg|late mg|early rm|any rm|non-?gmo|conventional|"
|
||||
r"public|check)\b", re.I)
|
||||
_SECTION_COMPANY = re.compile(
|
||||
r"^(early rm|any rm|late rm|early mg|late mg|non-?gmo|conventional|"
|
||||
r"gmo hybrids?|hybrids?)\b", re.I)
|
||||
|
||||
|
||||
def _find_header_row(rows: list[tuple]) -> int | None:
|
||||
"""Index of the column-name row — the one carrying 'Company' (col 0)
|
||||
and a 'Name' / 'Variety' (col 1)-ish header."""
|
||||
for i, row in enumerate(rows[:15]):
|
||||
c0 = _norm(row[0] if len(row) > 0 else "")
|
||||
c1 = _norm(row[1] if len(row) > 1 else "")
|
||||
if c0 == "company" and c1 in ("name", "variety"):
|
||||
return i
|
||||
return None
|
||||
|
||||
|
||||
def _build_colmap(rows: list[tuple], hdr_i: int) -> dict[str, int]:
|
||||
"""Map metric -> column index by merging the header band: the
|
||||
column-name row (hdr_i), the row below it, and the group-header row
|
||||
above it. Layouts vary — corn carries the units (bu/a, %) IN the
|
||||
header row with the Yield/Moisture/Lodging labels in the group row
|
||||
above; soy/wheat carry the units in the row below. We want the
|
||||
REGIONAL (leftmost) Yield, not the per-site repeats, so we take the
|
||||
leftmost yield-units column as the primary Yield.
|
||||
|
||||
Returns keys among: company, name, herb_trait, gt, ist, st, maturity,
|
||||
yield, lodging, height, moisture, protein, oil, rank, testwt,
|
||||
yield_2yr, yield_3yr.
|
||||
"""
|
||||
name_row = rows[hdr_i]
|
||||
below_row = rows[hdr_i + 1] if hdr_i + 1 < len(rows) else ()
|
||||
group_row = rows[hdr_i - 1] if hdr_i - 1 >= 0 else ()
|
||||
|
||||
def g(row: tuple, i: int) -> str:
|
||||
return _norm(row[i]) if i < len(row) else ""
|
||||
|
||||
ncols = max(len(name_row), len(below_row), len(group_row))
|
||||
|
||||
def band(i: int) -> tuple[str, str, str]:
|
||||
"""(group-above, header, below) normalized text for column i."""
|
||||
return (g(group_row, i), g(name_row, i), g(below_row, i))
|
||||
|
||||
cm: dict[str, int] = {"company": 0, "name": 1}
|
||||
|
||||
# Identity / trait columns — these sit on the header (column-name) row.
|
||||
for i in range(ncols):
|
||||
nm = g(name_row, i)
|
||||
if nm in ("herbicide trait1", "herbicide trait", "ht3", "ht"):
|
||||
cm.setdefault("herb_trait", i)
|
||||
elif nm in ("gt2", "gt"):
|
||||
cm.setdefault("gt", i)
|
||||
elif nm in ("ist1", "ist"):
|
||||
cm.setdefault("ist", i)
|
||||
elif nm in ("st1", "st2", "st"):
|
||||
cm.setdefault("st", i)
|
||||
elif nm in ("relative", "maturity", "relative maturity", "maturity date"):
|
||||
cm.setdefault("maturity", i)
|
||||
elif nm in ("yield rank", "rank"):
|
||||
cm.setdefault("rank", i)
|
||||
|
||||
# Metric columns — match across the whole band. A column is a Yield
|
||||
# column if any band row says "yield" OR carries a bu-per-acre unit.
|
||||
yield_cols: list[int] = []
|
||||
moisture_cols: list[int] = []
|
||||
testwt_cols: list[int] = []
|
||||
lodging_cols: list[int] = []
|
||||
height_cols: list[int] = []
|
||||
protein_cols: list[int] = []
|
||||
oil_cols: list[int] = []
|
||||
maturity_cols: list[int] = []
|
||||
YIELD_UNITS = {"bu/a", "bu/ac", "bu/acre"}
|
||||
for i in range(ncols):
|
||||
gp, nm, bl = band(i)
|
||||
cells = {gp, nm, bl}
|
||||
if "yield" in cells or cells & YIELD_UNITS:
|
||||
yield_cols.append(i)
|
||||
if "moisture" in cells:
|
||||
moisture_cols.append(i)
|
||||
if {"test wt", "test weight"} & cells or "lb/bu" in cells:
|
||||
testwt_cols.append(i)
|
||||
if "lodging" in cells:
|
||||
lodging_cols.append(i)
|
||||
if "height" in cells:
|
||||
height_cols.append(i)
|
||||
if "protein" in cells:
|
||||
protein_cols.append(i)
|
||||
if "oil" in cells:
|
||||
oil_cols.append(i)
|
||||
if {"relative", "maturity", "relative maturity", "maturity date"} & cells:
|
||||
maturity_cols.append(i)
|
||||
|
||||
if yield_cols:
|
||||
cm["yield"] = yield_cols[0]
|
||||
# 2yr / 3yr averages: yield-unit columns whose group header says avg.
|
||||
for i in yield_cols[1:]:
|
||||
grp = g(group_row, i)
|
||||
if "2-yr" in grp or "2 yr" in grp or "2yr" in grp:
|
||||
cm.setdefault("yield_2yr", i)
|
||||
elif "3-yr" in grp or "3 yr" in grp or "3yr" in grp:
|
||||
cm.setdefault("yield_3yr", i)
|
||||
if moisture_cols:
|
||||
cm["moisture"] = moisture_cols[0]
|
||||
if testwt_cols:
|
||||
cm["testwt"] = testwt_cols[0]
|
||||
if lodging_cols:
|
||||
cm["lodging"] = lodging_cols[0]
|
||||
if height_cols:
|
||||
cm["height"] = height_cols[0]
|
||||
if protein_cols:
|
||||
cm["protein"] = protein_cols[0]
|
||||
if oil_cols:
|
||||
cm["oil"] = oil_cols[0]
|
||||
if "maturity" not in cm and maturity_cols:
|
||||
cm["maturity"] = maturity_cols[0]
|
||||
|
||||
return cm
|
||||
|
||||
|
||||
# The metadata block always carries these labels; we locate its starting
|
||||
# column by finding where they appear, so the per-site yield columns that
|
||||
# sit between the metric block and the metadata block don't get scanned as
|
||||
# labels (they lead with numbers, not a label).
|
||||
_META_LABELS = {
|
||||
"location", "county", "site location", "host", "soil type",
|
||||
"planting date", "harvest date", "nitrogen applied", "pesticides",
|
||||
"tillage", "latitude", "longitude", "rainfall", "fungicide",
|
||||
}
|
||||
|
||||
|
||||
def _scan_meta_labels(src: list[tuple]) -> dict[str, list[str]]:
|
||||
"""From a label/value block, build ``{normalized_label: [values...]}``.
|
||||
|
||||
First find the column where the metadata labels live (the leftmost
|
||||
column that holds a known metadata label in some row); the label in
|
||||
each row is the first non-empty cell AT OR AFTER that column, and the
|
||||
values are the non-empty cells to its right. This skips the per-site
|
||||
yield columns that can sit to the left of the metadata block."""
|
||||
# Find the metadata-label column.
|
||||
label_col: int | None = None
|
||||
for row in src:
|
||||
for i, c in enumerate(row):
|
||||
if _norm(c) in _META_LABELS:
|
||||
if label_col is None or i < label_col:
|
||||
label_col = i
|
||||
break
|
||||
found: dict[str, list[str]] = {}
|
||||
for row in src:
|
||||
label = None
|
||||
label_idx = None
|
||||
start = label_col if label_col is not None else 0
|
||||
for i in range(start, len(row)):
|
||||
t = _txt(row[i])
|
||||
if t:
|
||||
label = t
|
||||
label_idx = i
|
||||
break
|
||||
if label is None:
|
||||
continue
|
||||
key = _norm(label)
|
||||
values = [_txt(c) for i, c in enumerate(row)
|
||||
if i > (label_idx or 0) and _txt(c)]
|
||||
if values and key not in found: # keep the first occurrence
|
||||
found[key] = values
|
||||
return found
|
||||
|
||||
|
||||
def _first_real(values: list[str]) -> str | None:
|
||||
for v in values:
|
||||
if v and v.lower() not in ("click to see map", "click here for directions"):
|
||||
return v
|
||||
return None
|
||||
|
||||
|
||||
def _apply_site_metadata(doc: TrialDoc, found: dict[str, list[str]], *,
|
||||
crop: str) -> None:
|
||||
"""Apply a scanned metadata block to the doc, filling only fields that
|
||||
are still unset (so the first/preferred source wins)."""
|
||||
def setif(attr: str, val: Any) -> None:
|
||||
if val and getattr(doc, attr) is None:
|
||||
setattr(doc, attr, val)
|
||||
|
||||
if "host" in found:
|
||||
setif("cooperator", _first_real(found["host"]))
|
||||
if "location" in found and not doc.sites:
|
||||
doc.sites = [v for v in found["location"]
|
||||
if v and v.lower() != "click to see map"]
|
||||
if "county" in found:
|
||||
setif("county", _first_real(found["county"]))
|
||||
if "soil type" in found:
|
||||
setif("soil_type", _first_real(found["soil type"]))
|
||||
if "planting date" in found:
|
||||
setif("planted_date", _iso_date(_first_real(found["planting date"])))
|
||||
if "harvest date" in found:
|
||||
setif("harvested_date", _iso_date(_first_real(found["harvest date"])))
|
||||
if "tillage" in found:
|
||||
setif("tillage", _first_real(found["tillage"]))
|
||||
elif "spring" in found and crop != "wheat":
|
||||
# Corn/soy: "Spring"/"Fall" are tillage operations. Wheat: those
|
||||
# same labels are nitrogen rates — never tillage.
|
||||
setif("tillage", _first_real(found["spring"]))
|
||||
if "latitude" in found:
|
||||
lat = _to_num(_first_real(found["latitude"]) or "")
|
||||
if isinstance(lat, (int, float)):
|
||||
setif("latitude", float(lat))
|
||||
if "longitude" in found:
|
||||
lon = _to_num(_first_real(found["longitude"]) or "")
|
||||
if isinstance(lon, (int, float)):
|
||||
setif("longitude", float(lon))
|
||||
|
||||
|
||||
def _assemble_traits(row: tuple, cm: dict[str, int]) -> str:
|
||||
"""Combine the herbicide-trait + GT (genetic trait, may spill across
|
||||
cols) + seed-treatment columns into one traits string."""
|
||||
bits: list[str] = []
|
||||
# GT can spill from its col up to (but not including) the herb_trait col.
|
||||
if "gt" in cm:
|
||||
gt_start = cm["gt"]
|
||||
gt_end = cm.get("herb_trait", cm.get("maturity", gt_start + 1))
|
||||
gt_toks = [_txt(row[i]) for i in range(gt_start, gt_end)
|
||||
if i < len(row) and _txt(row[i])]
|
||||
if gt_toks:
|
||||
bits.append("GT:" + "".join(gt_toks))
|
||||
if "herb_trait" in cm:
|
||||
ht = _txt(row[cm["herb_trait"]]) if cm["herb_trait"] < len(row) else ""
|
||||
if ht:
|
||||
bits.append("HT:" + ht)
|
||||
for k, lbl in (("ist", "IST"), ("st", "ST")):
|
||||
if k in cm and cm[k] < len(row):
|
||||
v = _txt(row[cm[k]])
|
||||
if v:
|
||||
bits.append(f"{lbl}:{v}")
|
||||
return " ".join(bits)
|
||||
|
||||
|
||||
def _is_data_row(row: tuple, cm: dict[str, int]) -> bool:
|
||||
company = _txt(row[0]) if len(row) > 0 else ""
|
||||
name = _txt(row[1]) if len(row) > 1 else ""
|
||||
if not company or not name:
|
||||
return False
|
||||
if _NONDATA_NAME.match(name) or _NONDATA_NAME.match(company):
|
||||
return False
|
||||
if _SECTION_COMPANY.match(company):
|
||||
return False
|
||||
# Must have a plausible numeric yield.
|
||||
y = _to_num(row[cm["yield"]]) if "yield" in cm and cm["yield"] < len(row) else None
|
||||
if not isinstance(y, (int, float)):
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def _plausible_yield(crop: str, y: float) -> bool:
|
||||
if crop == "corn":
|
||||
return 30 < y < 400
|
||||
if crop == "soybeans":
|
||||
return 10 < y < 150
|
||||
if crop == "wheat":
|
||||
return 20 < y < 200
|
||||
return 0 < y < 500
|
||||
|
||||
|
||||
def parse_xlsx(content: bytes, doc: TrialDoc) -> None:
|
||||
wb = openpyxl.load_workbook(io.BytesIO(content), data_only=True,
|
||||
read_only=True)
|
||||
# The yield table is the first sheet whose first ~15 rows contain a
|
||||
# Company/Name header.
|
||||
data_ws = None
|
||||
data_rows: list[tuple] = []
|
||||
hdr_i = None
|
||||
for name in wb.sheetnames:
|
||||
rows = list(wb[name].iter_rows(values_only=True))
|
||||
hi = _find_header_row(rows)
|
||||
if hi is not None:
|
||||
data_ws, data_rows, hdr_i = name, rows, hi
|
||||
break
|
||||
if data_ws is None or hdr_i is None:
|
||||
raise ValueError("no Company/Name header row found in any sheet")
|
||||
|
||||
cm = _build_colmap(data_rows, hdr_i)
|
||||
if "yield" not in cm:
|
||||
raise ValueError("no Yield column located")
|
||||
|
||||
# Site metadata lives in (a) trailing columns of the data sheet
|
||||
# (co-located with the results — most current) and/or (b) a dedicated
|
||||
# "Trial Info" sheet. Read the trailing-column block FIRST so it wins,
|
||||
# then let the info sheet fill any gaps. (Some regional files carry a
|
||||
# stale info sheet — e.g. a 2025 table whose Trial Info sheet still
|
||||
# shows 2021 planting dates — so trailing columns are preferred.)
|
||||
# _scan_meta_labels self-locates the metadata-label column, so the
|
||||
# per-site yield columns between the metric block and the metadata
|
||||
# block aren't mis-read as labels.
|
||||
_apply_site_metadata(doc, _scan_meta_labels(data_rows), crop=doc.crop)
|
||||
info_sheet = next((s for s in wb.sheetnames
|
||||
if "trial info" in s.lower()
|
||||
or "trial information" in s.lower()), None)
|
||||
if info_sheet:
|
||||
_apply_site_metadata(
|
||||
doc,
|
||||
_scan_meta_labels(list(wb[info_sheet].iter_rows(values_only=True))),
|
||||
crop=doc.crop)
|
||||
|
||||
results: list[dict] = []
|
||||
for row in data_rows[hdr_i + 2:]:
|
||||
if not _is_data_row(row, cm):
|
||||
continue
|
||||
y = _to_num(row[cm["yield"]])
|
||||
if not isinstance(y, (int, float)) or not _plausible_yield(doc.crop, float(y)):
|
||||
continue
|
||||
brand = _txt(row[0])
|
||||
product = _txt(row[1])
|
||||
# Sanity: a numeric/blank brand is junk.
|
||||
if not brand or brand.isdigit() or len(brand) <= 1:
|
||||
continue
|
||||
metrics: dict[str, Any] = {"Yield": y}
|
||||
for key, label in (("moisture", "Moisture"), ("lodging", "Lodging"),
|
||||
("height", "Height"), ("testwt", "Test Wt."),
|
||||
("protein", "Protein"), ("oil", "Oil"),
|
||||
("maturity", "Maturity"),
|
||||
("yield_2yr", "Yield 2yr avg"),
|
||||
("yield_3yr", "Yield 3yr avg")):
|
||||
if key in cm and cm[key] < len(row):
|
||||
v = _to_num(row[cm[key]])
|
||||
if v is not None:
|
||||
metrics[label] = v
|
||||
rec_rank = None
|
||||
if "rank" in cm and cm["rank"] < len(row):
|
||||
rk = _to_num(row[cm["rank"]])
|
||||
if isinstance(rk, (int, float)):
|
||||
rec_rank = int(rk)
|
||||
results.append({
|
||||
"rank": rec_rank, # synthesized below if None
|
||||
"brand": brand,
|
||||
"product": product,
|
||||
"traits": _assemble_traits(row, cm) or None,
|
||||
"metrics": metrics,
|
||||
})
|
||||
|
||||
# Synthesize rank by Yield DESC when the sheet didn't publish one
|
||||
# (corn/soy list alphabetically). Wheat carries Yield Rank already, but
|
||||
# we re-rank only if any row is missing a rank to keep them consistent.
|
||||
if results and any(r["rank"] is None for r in results):
|
||||
ordered = sorted(results, key=lambda r: -float(r["metrics"]["Yield"]))
|
||||
for i, r in enumerate(ordered, start=1):
|
||||
r["rank"] = i
|
||||
results = ordered
|
||||
else:
|
||||
results.sort(key=lambda r: r["rank"])
|
||||
|
||||
doc.results = results
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- render
|
||||
|
||||
|
||||
def render_markdown(doc: TrialDoc) -> str:
|
||||
crop_label = {"corn": "Corn", "soybeans": "Soybean",
|
||||
"wheat": "Wheat"}.get(doc.crop, doc.crop.title())
|
||||
head: list[str] = [
|
||||
f"# {crop_label} yield trial — University of Illinois VT, "
|
||||
f"{doc.region}, IL {doc.year}",
|
||||
"",
|
||||
"- **Publisher:** University of Illinois Variety Testing "
|
||||
"(independent cross-vendor trial)",
|
||||
f"- **Crop:** {crop_label}",
|
||||
"- **State:** IL",
|
||||
f"- **Year:** {doc.year}",
|
||||
f"- **Region:** {doc.region}",
|
||||
]
|
||||
for label, val in (
|
||||
("Rotation", doc.rotation),
|
||||
("Previous crop", doc.previous_crop),
|
||||
("Cooperator / host", doc.cooperator),
|
||||
("County", doc.county),
|
||||
("Sites", ", ".join(doc.sites) if doc.sites else None),
|
||||
("Soil type", doc.soil_type),
|
||||
("Tillage", doc.tillage),
|
||||
("Planted", doc.planted_date),
|
||||
("Harvested", doc.harvested_date),
|
||||
("Row width", doc.row_width),
|
||||
):
|
||||
if val:
|
||||
head.append(f"- **{label}:** {val}")
|
||||
head += [
|
||||
f"- **Source XLSX:** {doc.xlsx_url}",
|
||||
f"- **Index page:** {doc.index_url}",
|
||||
"", "---", "",
|
||||
"## Results (ranked by regional yield)", "",
|
||||
]
|
||||
|
||||
# Discover metric columns from the first result.
|
||||
metric_keys: list[str] = []
|
||||
for r in doc.results:
|
||||
if r.get("metrics"):
|
||||
metric_keys = list(r["metrics"].keys())
|
||||
break
|
||||
headers = ["Rank", "Brand", "Product", "Traits"] + metric_keys
|
||||
head.append("| " + " | ".join(headers) + " |")
|
||||
head.append("|" + "|".join(["---"] * len(headers)) + "|")
|
||||
for r in doc.results:
|
||||
row = [str(r.get("rank") or "-"), r.get("brand") or "-",
|
||||
r.get("product") or "-", r.get("traits") or "-"]
|
||||
m = r.get("metrics") or {}
|
||||
for k in metric_keys:
|
||||
v = m.get(k)
|
||||
row.append("-" if v is None else str(v))
|
||||
head.append("| " + " | ".join(row) + " |")
|
||||
head.append("")
|
||||
return "\n".join(head)
|
||||
|
||||
|
||||
def write_doc(doc: TrialDoc, body_md: str) -> None:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
(CORPUS_DIR / f"{doc.source_key}.md").write_text(body_md, encoding="utf-8")
|
||||
sidecar = {
|
||||
"source": "illinois_vt_trials",
|
||||
"source_key": doc.source_key,
|
||||
"data_type": "trial",
|
||||
"vendor": "University of Illinois",
|
||||
"brand_aggregator": "University of Illinois Variety Testing publishes",
|
||||
"brand": "University of Illinois VT",
|
||||
"crop": doc.crop,
|
||||
"state": "IL",
|
||||
"state_abbrev": "il",
|
||||
"year": doc.year,
|
||||
"region": doc.region,
|
||||
"rotation": doc.rotation,
|
||||
"previous_crop": doc.previous_crop,
|
||||
"cooperator": doc.cooperator,
|
||||
"county": doc.county,
|
||||
"sites": doc.sites or None,
|
||||
"soil_type": doc.soil_type,
|
||||
"tillage": doc.tillage,
|
||||
"planted_date": doc.planted_date,
|
||||
"harvested_date": doc.harvested_date,
|
||||
"row_width": doc.row_width,
|
||||
"latitude": doc.latitude,
|
||||
"longitude": doc.longitude,
|
||||
"results": doc.results,
|
||||
"n_results": len(doc.results),
|
||||
"source_urls": [doc.xlsx_url, doc.index_url],
|
||||
"tos_note": TOS_NOTE,
|
||||
"fetched_at": datetime.now(timezone.utc).isoformat(),
|
||||
"scraper_version": SCRAPER_VERSION,
|
||||
}
|
||||
(CORPUS_DIR / f"{doc.source_key}.json").write_text(
|
||||
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
|
||||
encoding="utf-8")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- pipeline
|
||||
|
||||
|
||||
def process_doc(http: RateLimitedSession, doc: TrialDoc, *,
|
||||
force: bool) -> str:
|
||||
md_path = CORPUS_DIR / f"{doc.source_key}.md"
|
||||
if md_path.exists() and not force:
|
||||
return "skipped"
|
||||
try:
|
||||
r = http.get(doc.xlsx_url)
|
||||
r.raise_for_status()
|
||||
parse_xlsx(r.content, doc)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.error("%s parse failed (%s): %s", doc.source_key, doc.xlsx_url, exc)
|
||||
return "failed"
|
||||
if not doc.results:
|
||||
log.warning("%s — no valid result rows parsed; skipping", doc.source_key)
|
||||
return "empty"
|
||||
write_doc(doc, render_markdown(doc))
|
||||
log.info("%s written | %s %s %s | %d results | top: %s",
|
||||
doc.source_key, doc.crop, doc.region, doc.year, len(doc.results),
|
||||
doc.results[0]["brand"] + " " + doc.results[0]["product"]
|
||||
if doc.results else "-")
|
||||
return "written"
|
||||
|
||||
|
||||
def run(*, crops: set[str], years: set[int], limit: int | None,
|
||||
force: bool) -> int:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
http = RateLimitedSession()
|
||||
docs = discover_files(http, crops=crops, years=years)
|
||||
log.info("discovered %d in-scope variety tables", len(docs))
|
||||
if limit is not None:
|
||||
docs = docs[:limit]
|
||||
|
||||
counts = {"written": 0, "skipped": 0, "empty": 0, "failed": 0}
|
||||
for i, doc in enumerate(docs, start=1):
|
||||
status = process_doc(http, doc, force=force)
|
||||
counts[status] = counts.get(status, 0) + 1
|
||||
if status != "written" or i <= 5 or i % 20 == 0:
|
||||
log.info("[%d/%d] %s -> %s", i, len(docs), doc.source_key, status)
|
||||
|
||||
log.info("done: written=%d skipped=%d empty=%d failed=%d (of %d)",
|
||||
counts["written"], counts["skipped"], counts["empty"],
|
||||
counts["failed"], len(docs))
|
||||
return 0 if counts["failed"] == 0 else 1
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- CLI
|
||||
|
||||
|
||||
def _build_argparser() -> argparse.ArgumentParser:
|
||||
p = argparse.ArgumentParser(
|
||||
prog="scrape.sources.illinois_vt_trials",
|
||||
description="Scrape University of Illinois Variety Testing "
|
||||
"cross-vendor yield trials (XLSX) into the corpus.")
|
||||
p.add_argument("--year", type=int, default=None,
|
||||
help="Scrape a single harvest year (default: 2024+2025).")
|
||||
p.add_argument("--include-old", action="store_true",
|
||||
help="Also scrape 2000–2023 (deferred by default).")
|
||||
p.add_argument("--crop", default=None, choices=tuple(CROP_INDEX.keys()),
|
||||
help="Limit to one crop (corn / soybeans / wheat).")
|
||||
p.add_argument("--limit", type=int, default=None,
|
||||
help="Stop after processing N tables (default: all).")
|
||||
p.add_argument("--force", action="store_true",
|
||||
help="Re-fetch even if the markdown file already exists.")
|
||||
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
|
||||
return p
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
args = _build_argparser().parse_args(argv)
|
||||
logging.basicConfig(
|
||||
level=args.log_level.upper(),
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||
stream=sys.stderr)
|
||||
|
||||
crops = {args.crop} if args.crop else set(CROP_INDEX.keys())
|
||||
if args.year is not None:
|
||||
years = {args.year}
|
||||
elif args.include_old:
|
||||
years = set(range(OLD_YEAR_MIN, OLD_YEAR_MAX + 1)) | BASELINE_YEARS
|
||||
else:
|
||||
years = set(BASELINE_YEARS)
|
||||
|
||||
return run(crops=crops, years=years, limit=args.limit, force=args.force)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,651 @@
|
||||
"""Iowa Crop Performance Tests (ICPT) — cross-vendor yield trials.
|
||||
|
||||
Iowa State University / the Iowa Crop Improvement Association run the
|
||||
**Iowa Crop Performance Tests**, an independent, third-party variety
|
||||
trial program. Because the trial is publisher-neutral, a single
|
||||
district table ranks EVERY brand head-to-head — Pioneer, DEKALB,
|
||||
Brevant, NuTech, Renk, Legacy, Epley Brothers, etc. — on identical
|
||||
plots. That makes it the highest-trust ``data_type: "trial"`` source
|
||||
in the corpus: unlike the vendor plot reports (Golden Harvest, LG,
|
||||
AgriGold, ProHarvest), no seed company controls the entry list or the
|
||||
agronomy, so there's no home-brand bias.
|
||||
|
||||
Site shape (ASP.NET, server-rendered GridView tables — requests +
|
||||
BeautifulSoup, no JS / headless browser needed):
|
||||
|
||||
Corn: https://www.croptesting.iastate.edu/corn/CornDistrict2.aspx
|
||||
Soybean: https://www.croptesting.iastate.edu/Soybean/SoybeanDistrict2.aspx
|
||||
|
||||
``...District2.aspx`` is the ONLY live district URL — the district
|
||||
(North / Central / South) is chosen *on that same page* via a
|
||||
``radLstDistrict`` radio (1/2/3) ASP.NET **postback**, NOT a separate
|
||||
URL (CornDistrict1/3.aspx 302-redirect away). Likewise the year
|
||||
(``cmbYear`` dropdown, 2025→2014) and the maturity season
|
||||
(``radListSeason``: 1=Early, 2=Full) are postbacks — there are no
|
||||
stable GET URLs for them. So we GET the page once to harvest the
|
||||
ASP.NET hidden fields (``__VIEWSTATE`` / ``__VIEWSTATEGENERATOR`` /
|
||||
``__VIEWSTATEENCRYPTED``), then POST the form with the desired
|
||||
year/district/season + ``btnFilter=Filter`` to drive the view.
|
||||
``CornDistrict.aspx`` (no number) is the 2013-and-older legacy page —
|
||||
out of scope.
|
||||
|
||||
A district table is a multi-site aggregate: the GridView carries the
|
||||
district-wide Yield plus a West/East sub-region split (Wyld/Eyld) and a
|
||||
per-site yield column for each cooperator location in the district.
|
||||
That makes **one district × season × year table the natural document
|
||||
granularity** — one sidecar per ``(crop, year, district, season)``.
|
||||
|
||||
GridView column → field map:
|
||||
corn: Company | Entry | RM | Herb Tech | Trait Package |
|
||||
Yield | Yldp | Moist | Wyld | Eyld | <site> ...
|
||||
soybean: Company | Entry | MG | Herb Tech |
|
||||
Yield | WestYield | EastYield | <site> ...
|
||||
Company -> result.brand (the seed COMPANY — critical)
|
||||
Entry -> result.product (variety / hybrid code)
|
||||
Herb Tech +
|
||||
Trait Package -> result.traits
|
||||
everything else (RM/MG, Yield, Yldp, Moist, Wyld/Eyld, per-site)
|
||||
-> result.metrics ("Yield" kept verbatim as the
|
||||
primary key the chunker's top-N picker reads)
|
||||
Rows are pre-sorted by Yield DESC on the page; we re-sort defensively
|
||||
and assign rank ourselves (the table has no rank column).
|
||||
|
||||
We emit the SAME sidecar shape as ``agrigold_plot_reports`` /
|
||||
``lg_plot_reports`` / ``gh_plot_reports`` / ``proharvest_plots``
|
||||
(``results: [{rank, brand, product, traits, metrics}]``). The trial
|
||||
chunker's source dispatch doesn't list ``iowa_icpt_trials`` explicitly,
|
||||
so it FALLS THROUGH to the shared ``_render_gh_plot_chunk`` renderer —
|
||||
no ``rag/chunk.py`` edit required.
|
||||
|
||||
Output:
|
||||
corpus/iowa_icpt_trials/<source_key>.md LLM-visible body
|
||||
corpus/iowa_icpt_trials/<source_key>.json sidecar metadata
|
||||
|
||||
source_key: ``icpt-<crop>-<year>-<district>[-<season>]``
|
||||
e.g. ``icpt-corn-2025-north-early``, ``icpt-soybeans-2024-south-full``.
|
||||
|
||||
Scope: 2024 + 2025 baseline. ``--include-old`` walks 2014–2023.
|
||||
|
||||
robots/ToS: no robots.txt (the ASP.NET app 404s it); footer
|
||||
"Copyright (c) 1995-2016 Iowa State University ... All rights reserved"
|
||||
carries no automation clause. Public land-grant ICPT data, polite UA,
|
||||
low request rate. (See ``tos_note`` in the sidecar.)
|
||||
|
||||
CLI:
|
||||
python -m scrape.sources.iowa_icpt_trials --limit 4
|
||||
python -m scrape.sources.iowa_icpt_trials --crop corn --year 2025
|
||||
python -m scrape.sources.iowa_icpt_trials --include-old --force
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
SCRAPER_VERSION = "0.1.0"
|
||||
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
|
||||
BASE = "https://www.croptesting.iastate.edu"
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
|
||||
CORPUS_DIR = CORPUS_ROOT / "iowa_icpt_trials"
|
||||
|
||||
REQ_INTERVAL_SEC = 2.0 # land-grant box; be polite, single-threaded
|
||||
|
||||
BASELINE_YEARS = [2024, 2025]
|
||||
OLD_YEARS = [2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023]
|
||||
|
||||
TOS_NOTE = (
|
||||
"Footer 'Copyright (c) ...ISU...All rights reserved' (no automation "
|
||||
"clause, no robots.txt); public ICPT data; low request rate; attribute "
|
||||
"Iowa Crop Performance Tests / ISU."
|
||||
)
|
||||
|
||||
# crop -> (district-results page URL, RM/MG header label)
|
||||
CROPS: dict[str, tuple[str, str]] = {
|
||||
"corn": (f"{BASE}/corn/CornDistrict2.aspx", "RM"),
|
||||
"soybeans": (f"{BASE}/Soybean/SoybeanDistrict2.aspx", "MG"),
|
||||
}
|
||||
|
||||
# radLstDistrict radio value -> (slug, label)
|
||||
DISTRICTS: dict[str, tuple[str, str]] = {
|
||||
"1": ("north", "North"),
|
||||
"2": ("central", "Central"),
|
||||
"3": ("south", "South"),
|
||||
}
|
||||
# radListSeason radio value -> (slug, label)
|
||||
SEASONS: dict[str, tuple[str, str]] = {
|
||||
"1": ("early", "Early Season"),
|
||||
"2": ("full", "Full Season"),
|
||||
}
|
||||
|
||||
# ASP.NET control names
|
||||
C_YEAR = "ctl00$MainContent$cmbYear"
|
||||
C_DISTRICT = "ctl00$MainContent$radLstDistrict"
|
||||
C_SEASON = "ctl00$MainContent$radListSeason"
|
||||
C_SHOW = "ctl00$MainContent$radLstShowOptions"
|
||||
C_FILTER = "ctl00$MainContent$btnFilter"
|
||||
|
||||
# GridView header labels that are NOT metric columns.
|
||||
BRAND_COL = "company"
|
||||
PRODUCT_COL = "entry"
|
||||
TRAIT_COLS = {"herb tech", "trait package"}
|
||||
|
||||
log = logging.getLogger("scrape.iowa_icpt_trials")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- HTTP
|
||||
|
||||
|
||||
class RateLimitedSession:
|
||||
"""Single-threaded rate-limited requests.Session (ASP.NET viewstate
|
||||
flow is inherently sequential per page, so no global lock needed)."""
|
||||
|
||||
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
|
||||
self.s = requests.Session()
|
||||
self.s.headers["User-Agent"] = USER_AGENT
|
||||
self.interval = interval
|
||||
self._last = 0.0
|
||||
|
||||
def _wait(self) -> None:
|
||||
delta = time.monotonic() - self._last
|
||||
if delta < self.interval:
|
||||
time.sleep(self.interval - delta)
|
||||
self._last = time.monotonic()
|
||||
|
||||
def request(self, method: str, url: str, *, max_retries: int = 4,
|
||||
timeout: float = 45.0, **kw: Any) -> requests.Response:
|
||||
last_exc: Exception | None = None
|
||||
resp: requests.Response | None = None
|
||||
for attempt in range(max_retries):
|
||||
self._wait()
|
||||
try:
|
||||
resp = self.s.request(method, url, timeout=timeout, **kw)
|
||||
except requests.RequestException as exc:
|
||||
last_exc = exc
|
||||
backoff = min(30.0, (2 ** attempt) + random.random())
|
||||
log.warning("network error on %s %s: %s — retry in %.1fs",
|
||||
method, url, exc, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
if resp.status_code == 429 or 500 <= resp.status_code < 600:
|
||||
ra = resp.headers.get("Retry-After")
|
||||
backoff = float(ra) if (ra and ra.isdigit()) else min(
|
||||
30.0, (2 ** attempt) + random.random())
|
||||
log.warning("HTTP %d on %s %s — retry in %.1fs",
|
||||
resp.status_code, method, url, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
return resp
|
||||
if last_exc:
|
||||
raise last_exc
|
||||
assert resp is not None
|
||||
return resp
|
||||
|
||||
def get(self, url: str, **kw: Any) -> requests.Response:
|
||||
return self.request("GET", url, **kw)
|
||||
|
||||
def post(self, url: str, **kw: Any) -> requests.Response:
|
||||
return self.request("POST", url, **kw)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- model
|
||||
|
||||
|
||||
@dataclass
|
||||
class TrialResult:
|
||||
rank: int | None = None
|
||||
brand: str = ""
|
||||
product: str = ""
|
||||
traits: str = ""
|
||||
metrics: dict[str, float | str | None] = field(default_factory=dict)
|
||||
|
||||
|
||||
@dataclass
|
||||
class DistrictTrial:
|
||||
source_key: str
|
||||
source_url: str
|
||||
crop: str # "corn" / "soybeans"
|
||||
year: int
|
||||
district_slug: str # north / central / south
|
||||
district_label: str # North / Central / South
|
||||
season_slug: str # early / full
|
||||
season_label: str # Early Season / Full Season
|
||||
sites: list[str] = field(default_factory=list) # cooperator locations
|
||||
experiment_mean: float | None = None
|
||||
results: list[TrialResult] = field(default_factory=list)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- parse
|
||||
|
||||
|
||||
def _hidden_fields(soup: BeautifulSoup) -> dict[str, str]:
|
||||
out: dict[str, str] = {}
|
||||
for inp in soup.find_all("input", {"type": "hidden"}):
|
||||
name = inp.get("name")
|
||||
if name:
|
||||
out[name] = inp.get("value") or ""
|
||||
return out
|
||||
|
||||
|
||||
_NUM_RE = re.compile(r"^-?\d+(?:\.\d+)?$")
|
||||
|
||||
|
||||
def _to_num(s: str | None) -> float | int | None:
|
||||
s = (s or "").strip()
|
||||
if not s or s == "-" or not _NUM_RE.match(s):
|
||||
return None
|
||||
f = float(s)
|
||||
return int(f) if f.is_integer() else f
|
||||
|
||||
|
||||
def _norm(s: str) -> str:
|
||||
return re.sub(r"\s+", " ", (s or "").strip()).lower()
|
||||
|
||||
|
||||
def _grid_rows(soup: BeautifulSoup, table_id: str) -> list[list[str]]:
|
||||
table = soup.find("table", {"id": table_id})
|
||||
if table is None:
|
||||
return []
|
||||
rows: list[list[str]] = []
|
||||
for tr in table.find_all("tr"):
|
||||
cells = [c.get_text(" ", strip=True) for c in tr.find_all(["th", "td"])]
|
||||
if cells:
|
||||
rows.append(cells)
|
||||
return rows
|
||||
|
||||
|
||||
def _experiment_mean(soup: BeautifulSoup) -> float | None:
|
||||
"""Pull the district-wide 'Experiment Mean' Yield from the summary
|
||||
GridView (first data row, second column)."""
|
||||
rows = _grid_rows(soup, "MainContent_gvDataSummary")
|
||||
for r in rows:
|
||||
if r and _norm(r[0]).startswith("experiment mean") and len(r) > 1:
|
||||
return _to_num(r[1]) # type: ignore[return-value]
|
||||
return None
|
||||
|
||||
|
||||
def parse_district_table(
|
||||
soup: BeautifulSoup,
|
||||
*,
|
||||
rm_mg_label: str,
|
||||
) -> tuple[list[TrialResult], list[str], float | None]:
|
||||
"""Parse the ``MainContent_gvData`` GridView into ranked results.
|
||||
|
||||
Returns ``(results, site_columns, experiment_mean)``. Rows arrive
|
||||
pre-sorted by Yield DESC; we re-sort by Yield DESC defensively and
|
||||
assign rank ourselves (no rank column on the page)."""
|
||||
rows = _grid_rows(soup, "MainContent_gvData")
|
||||
if len(rows) < 2:
|
||||
return [], [], None
|
||||
|
||||
header = rows[0]
|
||||
hkeys = [_norm(h) for h in header]
|
||||
|
||||
# Locate the structural columns.
|
||||
def find_col(*want: str) -> int | None:
|
||||
for w in want:
|
||||
for i, h in enumerate(hkeys):
|
||||
if h == w:
|
||||
return i
|
||||
return None
|
||||
|
||||
i_brand = find_col(BRAND_COL)
|
||||
i_product = find_col(PRODUCT_COL)
|
||||
i_traits = [i for i, h in enumerate(hkeys) if h in TRAIT_COLS]
|
||||
|
||||
# Identify the per-site (cooperator-location) yield columns: they
|
||||
# come AFTER the West/East sub-region split (Wyld/Eyld /
|
||||
# WestYield/EastYield), and their header is a location name, not a
|
||||
# known metric. Anything that isn't brand/product/trait is a metric;
|
||||
# per-site columns are metrics whose header isn't a reserved label.
|
||||
reserved_metric = {
|
||||
_norm(rm_mg_label), "yield", "yldp", "yield pct", "yield %",
|
||||
"moist", "wyld", "eyld", "westyield", "eastyield",
|
||||
}
|
||||
sites: list[str] = []
|
||||
for i, h in enumerate(hkeys):
|
||||
if i == i_brand or i == i_product or i in i_traits:
|
||||
continue
|
||||
if h and h not in reserved_metric:
|
||||
sites.append(header[i])
|
||||
|
||||
skip = {i_brand, i_product, *i_traits}
|
||||
metric_cols = [(header[i], i) for i in range(len(header)) if i not in skip and header[i]]
|
||||
|
||||
results: list[TrialResult] = []
|
||||
for raw in rows[1:]:
|
||||
# Pad/truncate row to header width defensively.
|
||||
cells = raw + [""] * (len(header) - len(raw))
|
||||
|
||||
def cell(i: int | None) -> str:
|
||||
return cells[i].strip() if i is not None and 0 <= i < len(cells) else ""
|
||||
|
||||
brand = cell(i_brand)
|
||||
product = cell(i_product)
|
||||
traits = " ".join(
|
||||
t for t in (cell(i) for i in i_traits)
|
||||
if t and _norm(t) != "none"
|
||||
).strip()
|
||||
|
||||
metrics: dict[str, float | str | None] = {}
|
||||
for name, idx in metric_cols:
|
||||
raw_val = cell(idx)
|
||||
num = _to_num(raw_val)
|
||||
if num is not None:
|
||||
metrics[name] = num
|
||||
elif raw_val and raw_val != "-":
|
||||
metrics[name] = raw_val
|
||||
# else: leave the column out (empty)
|
||||
|
||||
res = TrialResult(brand=brand, product=product, traits=traits, metrics=metrics)
|
||||
if _row_ok(res):
|
||||
results.append(res)
|
||||
|
||||
# Re-sort by Yield DESC (page is already sorted, but don't trust it),
|
||||
# then assign rank. Rows with no numeric Yield sink to the bottom.
|
||||
def _ysort(r: TrialResult) -> tuple[int, float]:
|
||||
y = r.metrics.get("Yield")
|
||||
if isinstance(y, (int, float)):
|
||||
return (0, -float(y))
|
||||
return (1, 0.0)
|
||||
|
||||
results.sort(key=_ysort)
|
||||
for n, r in enumerate(results, start=1):
|
||||
r.rank = n
|
||||
|
||||
return results, sites, _experiment_mean(soup)
|
||||
|
||||
|
||||
def _row_ok(r: TrialResult) -> bool:
|
||||
"""Per-row sanity gate. A sound entry has a real (non-numeric)
|
||||
company/brand, a product code, and a plausible bu/a Yield. Drops
|
||||
summary/blank rows and any leaked aggregate line."""
|
||||
brand = (r.brand or "").strip()
|
||||
product = (r.product or "").strip()
|
||||
if not brand or brand.isdigit():
|
||||
return False
|
||||
if _norm(brand) in ("summary", "experiment mean", "minimum mean",
|
||||
"maximum mean", "lsd", "coefficient of variability"):
|
||||
return False
|
||||
if not product:
|
||||
return False
|
||||
y = r.metrics.get("Yield")
|
||||
# Corn runs ~120-280 bu/a, soy ~30-90; gate generously but reject
|
||||
# garbage / a moisture/RM value that leaked into the Yield slot.
|
||||
if not isinstance(y, (int, float)) or not (10 < float(y) < 400):
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- fetch
|
||||
|
||||
|
||||
def source_key_for(crop: str, year: int, district_slug: str, season_slug: str) -> str:
|
||||
return f"icpt-{crop}-{year}-{district_slug}-{season_slug}"
|
||||
|
||||
|
||||
def fetch_view(
|
||||
http: RateLimitedSession,
|
||||
*,
|
||||
crop: str,
|
||||
year: int,
|
||||
district: str, # radio value "1"/"2"/"3"
|
||||
season: str, # radio value "1"/"2"
|
||||
) -> DistrictTrial | None:
|
||||
"""GET the district page (for viewstate), then POST the filter form
|
||||
to switch to the requested year/district/season. Returns a parsed
|
||||
DistrictTrial, or None if the table is empty for that combination."""
|
||||
url, rm_mg_label = CROPS[crop]
|
||||
district_slug, district_label = DISTRICTS[district]
|
||||
season_slug, season_label = SEASONS[season]
|
||||
|
||||
seed = http.get(url)
|
||||
seed.raise_for_status()
|
||||
seed_soup = BeautifulSoup(seed.text, "html.parser")
|
||||
|
||||
payload = _hidden_fields(seed_soup)
|
||||
payload[C_YEAR] = str(year)
|
||||
payload[C_DISTRICT] = district
|
||||
payload[C_SEASON] = season
|
||||
payload[C_SHOW] = "yield" # yield view carries Yield/Yldp/Moist + per-SITE yields
|
||||
payload[C_FILTER] = "Filter"
|
||||
|
||||
resp = http.post(url, data=payload)
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
|
||||
results, sites, mean = parse_district_table(soup, rm_mg_label=rm_mg_label)
|
||||
if not results:
|
||||
return None
|
||||
|
||||
return DistrictTrial(
|
||||
source_key=source_key_for(crop, year, district_slug, season_slug),
|
||||
source_url=url,
|
||||
crop=crop,
|
||||
year=year,
|
||||
district_slug=district_slug,
|
||||
district_label=district_label,
|
||||
season_slug=season_slug,
|
||||
season_label=season_label,
|
||||
sites=sites,
|
||||
experiment_mean=mean,
|
||||
results=results,
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- render
|
||||
|
||||
|
||||
def render_markdown(t: DistrictTrial) -> str:
|
||||
crop_label = {"corn": "Corn", "soybeans": "Soybean"}.get(t.crop, t.crop.title())
|
||||
head: list[str] = [
|
||||
f"# {crop_label} yield trial — Iowa {t.district_label} District "
|
||||
f"({t.season_label}), {t.year}",
|
||||
"",
|
||||
"- **Source:** Iowa Crop Performance Tests (independent third-party trial)",
|
||||
"- **Publisher:** Iowa State University / Iowa Crop Improvement Association",
|
||||
f"- **Crop:** {crop_label}",
|
||||
"- **State:** IA",
|
||||
f"- **District:** {t.district_label}",
|
||||
f"- **Maturity season:** {t.season_label}",
|
||||
f"- **Year:** {t.year}",
|
||||
]
|
||||
if t.experiment_mean is not None:
|
||||
head.append(f"- **Experiment mean yield:** {t.experiment_mean} bu/a")
|
||||
if t.sites:
|
||||
head.append(f"- **Cooperator sites:** {', '.join(t.sites)}")
|
||||
head += [f"- **URL:** {t.source_url}", "", "---", ""]
|
||||
|
||||
# Discover metric column order from the first result with metrics.
|
||||
metric_keys: list[str] = []
|
||||
for r in t.results:
|
||||
if r.metrics:
|
||||
metric_keys = list(r.metrics.keys())
|
||||
break
|
||||
|
||||
sections: list[str] = ["## Results (by yield, all brands)", ""]
|
||||
headers = ["Rank", "Company", "Entry", "Traits"] + metric_keys
|
||||
sections.append("| " + " | ".join(headers) + " |")
|
||||
sections.append("|" + "|".join(["---"] * len(headers)) + "|")
|
||||
for r in t.results:
|
||||
row = [
|
||||
str(r.rank) if r.rank is not None else "-",
|
||||
r.brand or "-",
|
||||
r.product or "-",
|
||||
r.traits or "-",
|
||||
]
|
||||
for k in metric_keys:
|
||||
v = r.metrics.get(k)
|
||||
row.append("-" if v is None else str(v))
|
||||
sections.append("| " + " | ".join(row) + " |")
|
||||
sections.append("")
|
||||
|
||||
# Compact top-5 line for embedder signal.
|
||||
top = [r for r in t.results if isinstance(r.metrics.get("Yield"), (int, float))][:5]
|
||||
if top:
|
||||
bits = [f"{r.product} ({r.brand}) {r.metrics['Yield']}" for r in top]
|
||||
sections.append(f"Top 5 by Yield: " + ", ".join(bits) + ".")
|
||||
sections.append("")
|
||||
|
||||
return "\n".join(head) + "\n".join(sections)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- write
|
||||
|
||||
|
||||
def write_trial(t: DistrictTrial, body_md: str) -> None:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
(CORPUS_DIR / f"{t.source_key}.md").write_text(body_md, encoding="utf-8")
|
||||
sidecar = {
|
||||
"source": "iowa_icpt_trials",
|
||||
"source_key": t.source_key,
|
||||
"data_type": "trial",
|
||||
"vendor": "Iowa State University",
|
||||
"brand_aggregator": "Iowa Crop Performance Tests publishes",
|
||||
"brand": "Iowa Crop Performance Tests",
|
||||
"crop": t.crop,
|
||||
"state": "IA",
|
||||
"state_abbrev": "ia",
|
||||
"year": t.year,
|
||||
"region": f"District {t.district_label}",
|
||||
"district": t.district_label,
|
||||
"season": t.season_label,
|
||||
"cooperator_sites": t.sites,
|
||||
"experiment_mean_yield": t.experiment_mean,
|
||||
"results": [
|
||||
{
|
||||
"rank": r.rank,
|
||||
"brand": r.brand,
|
||||
"product": r.product,
|
||||
"traits": r.traits,
|
||||
"metrics": r.metrics,
|
||||
}
|
||||
for r in t.results
|
||||
],
|
||||
"n_results": len(t.results),
|
||||
"source_urls": [t.source_url],
|
||||
"tos_note": TOS_NOTE,
|
||||
"fetched_at": datetime.now(timezone.utc).isoformat(),
|
||||
"scraper_version": SCRAPER_VERSION,
|
||||
}
|
||||
(CORPUS_DIR / f"{t.source_key}.json").write_text(
|
||||
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- pipeline
|
||||
|
||||
|
||||
def run(
|
||||
*,
|
||||
crops: set[str],
|
||||
years: list[int],
|
||||
limit: int | None,
|
||||
force: bool,
|
||||
) -> int:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
http = RateLimitedSession()
|
||||
counts = {"written": 0, "skipped": 0, "empty": 0, "failed": 0}
|
||||
processed = 0
|
||||
|
||||
targets: list[tuple[str, int, str, str]] = []
|
||||
for crop in sorted(crops):
|
||||
for year in years:
|
||||
for district in DISTRICTS: # 1/2/3
|
||||
for season in SEASONS: # 1/2
|
||||
targets.append((crop, year, district, season))
|
||||
|
||||
log.info("planned %d (crop x year x district x season) targets", len(targets))
|
||||
|
||||
for crop, year, district, season in targets:
|
||||
if limit is not None and processed >= limit:
|
||||
break
|
||||
district_slug = DISTRICTS[district][0]
|
||||
season_slug = SEASONS[season][0]
|
||||
sk = source_key_for(crop, year, district_slug, season_slug)
|
||||
md_path = CORPUS_DIR / f"{sk}.md"
|
||||
if md_path.exists() and not force:
|
||||
counts["skipped"] += 1
|
||||
continue
|
||||
processed += 1
|
||||
try:
|
||||
trial = fetch_view(http, crop=crop, year=year,
|
||||
district=district, season=season)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
counts["failed"] += 1
|
||||
log.error("[%s] fetch failed: %s", sk, exc)
|
||||
continue
|
||||
if trial is None:
|
||||
counts["empty"] += 1
|
||||
log.info("[%s] empty table (no entries) — skipping", sk)
|
||||
continue
|
||||
write_trial(trial, render_markdown(trial))
|
||||
counts["written"] += 1
|
||||
log.info("[%s] written | %d entries | %d sites | brands=%d",
|
||||
sk, len(trial.results), len(trial.sites),
|
||||
len({r.brand for r in trial.results}))
|
||||
|
||||
log.info("done: written=%d skipped=%d empty=%d failed=%d (processed=%d)",
|
||||
counts["written"], counts["skipped"], counts["empty"],
|
||||
counts["failed"], processed)
|
||||
return 0 if counts["failed"] == 0 else 1
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- CLI
|
||||
|
||||
|
||||
def _build_argparser() -> argparse.ArgumentParser:
|
||||
p = argparse.ArgumentParser(
|
||||
prog="scrape.sources.iowa_icpt_trials",
|
||||
description="Scrape Iowa Crop Performance Tests (ICPT) cross-vendor "
|
||||
"yield trials (corn + soybean district tables).",
|
||||
)
|
||||
p.add_argument("--year", type=int, default=None,
|
||||
choices=tuple(BASELINE_YEARS + OLD_YEARS),
|
||||
help="Limit to a single year (default: 2024+2025 baseline).")
|
||||
p.add_argument("--include-old", action="store_true",
|
||||
help="Also scrape 2014-2023 (deferred by default).")
|
||||
p.add_argument("--crop", default=None, choices=tuple(CROPS.keys()),
|
||||
help="Limit to one crop (default: both).")
|
||||
p.add_argument("--limit", type=int, default=None,
|
||||
help="Stop after writing/processing N tables (default: all).")
|
||||
p.add_argument("--force", action="store_true",
|
||||
help="Re-fetch even if the markdown file already exists.")
|
||||
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
|
||||
return p
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
args = _build_argparser().parse_args(argv)
|
||||
logging.basicConfig(
|
||||
level=args.log_level.upper(),
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||
stream=sys.stderr,
|
||||
)
|
||||
crops = {args.crop} if args.crop else set(CROPS.keys())
|
||||
if args.year is not None:
|
||||
years = [args.year]
|
||||
elif args.include_old:
|
||||
years = sorted(set(OLD_YEARS + BASELINE_YEARS))
|
||||
else:
|
||||
years = list(BASELINE_YEARS)
|
||||
return run(crops=crops, years=years, limit=args.limit, force=args.force)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user