Add university-extension trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 cross-vendor trial docs) (#19)
Image rebuild (skip scrape) / build (push) Successful in 5m54s
Image rebuild (skip scrape) / build (push) Successful in 5m54s
Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io>
This commit was merged in pull request #19.
This commit is contained in:
@@ -307,6 +307,55 @@
|
||||
"tos_check_date": "2026-06-09",
|
||||
"tos_note": "Squarespace site; robots.txt does NOT block AI/content crawling (the AI-bot UAs incl. ClaudeBot/anthropic-ai are grouped with User-agent:* under standard Squarespace exclusions only — no Disallow:/, AI-block toggle off, no expressed AI opt-out). No anti-scraping ToS clause found. Lineup is published as a public seed-guide PDF on a Squarespace CDN URL (not disallowed). UA seed-mcp-scraper.",
|
||||
"schema_notes": "PDF-extraction source (no structured web catalog — Squarespace visual grid). Download the 2026 Seed Guide PDF (https://www.robseeco.com/s/2026_RobSeeCo-Seed-Guide_FINAL-LR-Single.pdf, follow redirect to static1.squarespace.com; ~18MB, 52 pages; cached under var/, gitignored). EVERY content page is DUPLICATED (p5==p6, p9==p10, ...) → dedup by source_key. Sections: corn ratings table p5-8 + 2-col descriptive cards p9-18; soy ratings table p19-26 + cards. Masters Choice silage (p27-38) + sorghum (p39-42) scoped OUT. Rotated/vertical column headers reconstructed by clustering rotated words by x0; each data cell mapped to its column by X-CENTER alignment (whitespace tokenization is unreliable around sparse cells). Cards joined by code to enrich trait_stack (corn -RR2/-VT2P/-Conv suffixes) + strengths bullets. characteristics_groups: AGRONOMIC (emergence/vigor/root/stalk/greensnap/staygreen/drydown/drought/plant+ear height/test wt) + DISEASE (GLS/Goss/NCLB/Tar Spot/fungicide response; soy SCN source+score/IDC/Phytophthora gene+PRR/BSR/SWM/SDS). SCALE: 1-9, 9=Best (HIGHER=better, same direction as Bayer/Stine-corn); '-'=not available; soy disease letter codes R/MR/S; Product Fit Geography A/C/E/W/CW. Column map verified against descriptive-card bullets; 0 card-only fallbacks (all 130 parsed from the table)."
|
||||
},
|
||||
{
|
||||
"name": "illinois_vt_trials",
|
||||
"vendor": "University of Illinois",
|
||||
"brand_aggregator": "University of Illinois Variety Testing publishes",
|
||||
"crops": [
|
||||
"corn",
|
||||
"soybeans",
|
||||
"wheat"
|
||||
],
|
||||
"verdict": "green",
|
||||
"expected_count": 30,
|
||||
"base_url": "https://vt.cropsci.illinois.edu",
|
||||
"scope_filter": "University-extension variety trials (independent third-party; tests ALL entered brands incl. majors). 2024+2025 baseline (--include-old walks 2000-2023). One doc per region trial table; corn-following-corn vs corn-following-soybean kept as distinct docs.",
|
||||
"tos_check_date": "2026-06-10",
|
||||
"tos_note": "No usage terms posted on the UIUC VT site (verified corn/soy/wheat + About pages); publicly-funded land-grant data published for farmers (companies pay an entry fee, which doesn't restrict result reuse). Attribute 'University of Illinois Variety Testing'. Polite UA seed-mcp-scraper, ~1s rate.",
|
||||
"schema_notes": "data_type=trial; emits the gh_plot results[] shape and routes through _render_gh_plot_chunk with include_region=True (region folded into the embedded chunk). Discover the per-region XLSX hrefs from the /corn /soybeans /wheat index pages (the upload URL's month segment varies; do NOT guess). Parse XLSX with openpyxl (added to requirements; scrape-time only — not needed in the image-only CI rebuild). Header-anchored cell map (multi-word company names + variable trait cols shift fixed positions): col0=Company->brand, col1=Name->product; metric columns resolved by merging the group-header/column-name/units rows; 'Yield'=the Regional/Regional-Average bu/a column (canonical key), + Moisture/Lodging/Height/Protein/Oil/Maturity + 2yr/3yr avg. traits = GT/HT/IST seed-treatment cols. Per-site metadata (Host->cooperator, County, Soil type, Planting/Harvest date->ISO, Tillage, Lat/Long) read from a self-locating label block. rank synthesized by Yield DESC (corn/soy list alphabetically; wheat's own Yield Rank honored). Per-row sanity gate drops summary/blank-brand rows + yields outside 1-400. NOTE: DEKALB/Channel/Brevant/AgriGold are NOT entered in the IL program (true negatives, not parse gaps); Pioneer/NK + many regionals (FS InSPIRE, AgriMAXX, Burrus, Cornelius) are."
|
||||
},
|
||||
{
|
||||
"name": "iowa_icpt_trials",
|
||||
"vendor": "Iowa State University",
|
||||
"brand_aggregator": "Iowa Crop Performance Tests publishes",
|
||||
"crops": [
|
||||
"corn",
|
||||
"soybeans"
|
||||
],
|
||||
"verdict": "green",
|
||||
"expected_count": 24,
|
||||
"base_url": "https://www.croptesting.iastate.edu",
|
||||
"scope_filter": "University-extension variety trials (independent third-party; all entered brands). Corn + soybean (no wheat — Iowa ICPT doesn't run a wheat trial). 2024+2025 baseline (--include-old 2014-2023). One doc per (crop, year, district, maturity-season); districts North/Central/South.",
|
||||
"tos_check_date": "2026-06-10",
|
||||
"tos_note": "No robots.txt (ASP.NET 404). Footer 'Copyright (c) 1995-2016 Iowa State University ... All rights reserved' — no automation/scraping clause. Public ICPT land-grant data. Attribute 'Iowa Crop Performance Tests / Iowa State University'. Polite UA seed-mcp-scraper, single-threaded, 2s interval.",
|
||||
"schema_notes": "data_type=trial; gh_plot results[] shape → _render_gh_plot_chunk(include_region=True). Server-rendered ASP.NET GridView tables (requests + BeautifulSoup; no JS). NAVIGATION IS POSTBACK, not GET: only the District2.aspx pages are live; GET the page to harvest hidden fields (__VIEWSTATE/__VIEWSTATEGENERATOR/__VIEWSTATEENCRYPTED), then POST cmbYear + radLstDistrict (1/2/3) + radListSeason + radLstShowOptions=yield + btnFilter. Column map: Company->brand, Entry->product, Herb Tech + Trait Package->traits, Yield (canonical key) + Yldp(yield % of mean) + Moist + per-site Wyld/Eyld->metrics. rank synthesized by Yield DESC (no rank column). Per-row sanity gate (10-400 bu/a) drops summary rows. Single-location pages skipped (redundant with the district per-site columns). Majors confirmed present (Pioneer/DEKALB/Asgrow/NK/Golden Harvest); ISU/LOYAL BRAND/P3 are legit public/regional lines; Brevant is a true-negative (not an entrant)."
|
||||
},
|
||||
{
|
||||
"name": "ohio_ocpt_trials",
|
||||
"vendor": "The Ohio State University",
|
||||
"brand_aggregator": "Ohio Corn/Soybean Performance Test publishes",
|
||||
"crops": [
|
||||
"corn",
|
||||
"soybeans"
|
||||
],
|
||||
"verdict": "green",
|
||||
"expected_count": 69,
|
||||
"base_url": "https://ohiocroptest.cfaes.osu.edu",
|
||||
"scope_filter": "University-extension variety trials (independent third-party; all entered brands). Corn + soybean. 2024+2025 baseline (--include-old against the u.osu.edu/perf archive). Corn = one doc per site (Early+Full merged) + region summaries; soy = one doc per region x maturity.",
|
||||
"tos_check_date": "2026-06-10",
|
||||
"tos_note": "Report PDF carries '(c) The Ohio State University' + an explicit no-endorsement clause; entry-fee-funded CFAES extension publication for farmers. Attribute 'Ohio Corn/Soybean Performance Test, OSU CFAES'. Polite UA seed-mcp-scraper, low rate.",
|
||||
"schema_notes": "data_type=trial; gh_plot results[] shape → _render_gh_plot_chunk(include_region=True). PDF (pdfplumber). Discover the report PDF hrefs live from /corntrials/?year=N and /soyN/ (no hardcoded filename); the regions.asp web tool is JS-rendered and IGNORED. CORN: per-site tables repeat 'Brand|Hybrid| Yield Mst Ldg Std Emg' x N sites + trailing TW; anchor on the trailing numeric run, split into N groups of 5 (group count taken authoritatively from the header's 'Yield' token count — a mismatch SKIPS the table), brand/hybrid split via an ALL-CAPS known-brand dictionary (longest-match-first), per-site agronomic footnotes (soil/prev-crop/tillage/cooperator/county) via word x-coordinate bucketing (text alone mis-assigns columns), Table-10 hybrid->trait-codes joined onto results. SOY: 'Variety|Brand|Type|Seed Treatment|RM|per-site yields|Mean' anchored on the Type column (EN/CV/XF/STS incl. comma-compounds). 'Yield' canonical key. Per-row + site-count sanity gates; 0 verbatim fallbacks needed at baseline. Majors present (CHANNEL/DEKALB/NK/Golden Harvest/LG/AgriGold/Asgrow/Xitavo/Beck's); Pioneer/Brevant didn't enter Ohio these years (true negatives)."
|
||||
}
|
||||
],
|
||||
"_excluded_sources": [
|
||||
|
||||
Reference in New Issue
Block a user