Add RobSeeCo (Rob-See-Co + Innotech): 130 corn/soy varieties from the seed-guide PDF
Independent regional brand (Elkhorn, NE; rolled up Federal Hybrids / Big Cob / Kiser / Rupp's grain-forage). No structured web catalog — the lineup lives in the 2026 Seed Guide PDF — so this is a PDF-extraction identity source. - robseeco (130: 87 corn + 43 soy; Rob-See-Co 105 + Innotech 25). Downloads the guide (cached under var/, gitignored), dedups the duplicated pages, parses the corn (p5-8) + soy (p19-26) ratings tables. Rotated/vertical column headers reconstructed by clustering rotated words; cells mapped by x-center alignment; descriptive 2-col cards joined by code for trait_stack + strengths. Masters Choice silage + sorghum scoped out (row-crop core only). - SCALE 1-9, 9=Best (higher=better, like Bayer/Stine-corn); column map verified against the card bullets (e.g. RC2500 "rapid drydown"->Drydown 8, "short plant"->Plant Height 5; RC4779 "industry-leading tar spot"->Tar Spot 7). Validation: all 130 chunk via rag.chunk.chunks_from_variety (0 errors), 0 duplicate keys, 0 out-of-range ratings (misalignment check), RM/MG all sane. robseeco.com robots permissive (Squarespace AI-block toggle off; no ToS scrape clause; PDF on a public CDN). docs: sources.json + README/CLAUDE inventory (2,398 variety records) + rating-scales lesson (added RobSeeCo to the higher=better group + the cross-vendor direction warning).
This commit is contained in:
@@ -288,6 +288,25 @@
|
||||
"tos_check_date": "2026-06-04",
|
||||
"tos_note": "Data via the Seedware JSON API (burrus25.seedware.net). burrusseed.com robots.txt blocks ~33 NAMED AI/scraper bots and carries Content-signal: ai-train=no + Crawl-delay 10; User-agent:* IS allowed and the ToS has NO scraping clause. Operator chose to include this source despite the ai-train=no signal; scraper uses a non-blacklisted UA (seed-mcp-scraper) and honors Crawl-delay 10 (>=10s between requests).",
|
||||
"schema_notes": "Seedware JSONP API: GET https://burrus25.seedware.net/app/_queries/crop_varieties.php?crop_pkey=101(corn)|102(soy)&callback=cb (requires a callback param + Referer https://burrusseed.com/; strip the JSONP wrapper). ~40 fields/record incl. brand, maturity (RM/MG), released, and many stat_* ratings → mapped into characteristics_groups: DISEASE RATINGS (gray leaf spot, tar spot, BSR, SDS, phytophthora), AGRONOMIC CHARACTERISTICS (drought, greensnap, stalk/root strength, standability, emergence, etc.), HERBICIDE TOLERANCE (glyphosate/glufosinate/2,4-D/dicamba/FOPs, Yes/No) + Bt insect-protection (Yes/No). SCALE: numeric agronomic+disease 1-10, 10 = best/most-tolerant (HIGHER = better; observed 4-10); NR/blank/0/'-' = not rated. Per-variety tech-sheet PDFs exist (getTechSheet/<pkey>) — not ingested this pass."
|
||||
},
|
||||
{
|
||||
"name": "robseeco",
|
||||
"vendor": "RobSeeCo",
|
||||
"brands": [
|
||||
"Rob-See-Co",
|
||||
"Innotech"
|
||||
],
|
||||
"crops": [
|
||||
"corn",
|
||||
"soybeans"
|
||||
],
|
||||
"verdict": "green",
|
||||
"expected_count": 130,
|
||||
"base_url": "https://www.robseeco.com",
|
||||
"scope_filter": "Independent regional seed co (Elkhorn, NE; rolled up Federal Hybrids/Big Cob/Kiser/Rupp's grain-forage). Row-crop core: corn (Rob-See-Co brand, 87) + soybean (Rob-See-Co RS#### + Innotech IS####, 43). Masters Choice silage corn + sorghum sections EXCLUDED (out of row-crop scope).",
|
||||
"tos_check_date": "2026-06-09",
|
||||
"tos_note": "Squarespace site; robots.txt does NOT block AI/content crawling (the AI-bot UAs incl. ClaudeBot/anthropic-ai are grouped with User-agent:* under standard Squarespace exclusions only — no Disallow:/, AI-block toggle off, no expressed AI opt-out). No anti-scraping ToS clause found. Lineup is published as a public seed-guide PDF on a Squarespace CDN URL (not disallowed). UA seed-mcp-scraper.",
|
||||
"schema_notes": "PDF-extraction source (no structured web catalog — Squarespace visual grid). Download the 2026 Seed Guide PDF (https://www.robseeco.com/s/2026_RobSeeCo-Seed-Guide_FINAL-LR-Single.pdf, follow redirect to static1.squarespace.com; ~18MB, 52 pages; cached under var/, gitignored). EVERY content page is DUPLICATED (p5==p6, p9==p10, ...) → dedup by source_key. Sections: corn ratings table p5-8 + 2-col descriptive cards p9-18; soy ratings table p19-26 + cards. Masters Choice silage (p27-38) + sorghum (p39-42) scoped OUT. Rotated/vertical column headers reconstructed by clustering rotated words by x0; each data cell mapped to its column by X-CENTER alignment (whitespace tokenization is unreliable around sparse cells). Cards joined by code to enrich trait_stack (corn -RR2/-VT2P/-Conv suffixes) + strengths bullets. characteristics_groups: AGRONOMIC (emergence/vigor/root/stalk/greensnap/staygreen/drydown/drought/plant+ear height/test wt) + DISEASE (GLS/Goss/NCLB/Tar Spot/fungicide response; soy SCN source+score/IDC/Phytophthora gene+PRR/BSR/SWM/SDS). SCALE: 1-9, 9=Best (HIGHER=better, same direction as Bayer/Stine-corn); '-'=not available; soy disease letter codes R/MR/S; Product Fit Geography A/C/E/W/CW. Column map verified against descriptive-card bullets; 0 card-only fallbacks (all 130 parsed from the table)."
|
||||
}
|
||||
],
|
||||
"_excluded_sources": [
|
||||
|
||||
Reference in New Issue
Block a user