Add RobSeeCo (Rob-See-Co + Innotech): 130 corn/soy varieties from the seed-guide PDF #18

Merged
claude merged 1 commits from add-robseeco into main 2026-06-09 23:29:39 -04:00
Contributor

Adds RobSeeCo (independent regional seed co, Elkhorn NE — markets Rob-See-Co® + Innotech®, rolled up Federal Hybrids / Big Cob / Kiser / Rupp's grain-forage) as a variety-identity source.

Unlike the other independents, RobSeeCo has no structured web catalog (Squarespace visual grid) — the lineup lives in the 2026 Seed Guide PDF, so robseeco is a PDF-extraction source.

Records 130 — 87 corn + 43 soy (Rob-See-Co 105 + Innotech 25)
Source 2026 Seed Guide PDF (~18 MB, 52 pp; cached under var/, gitignored)
Scale 1-9, 9=Best (higher=better, like Bayer/Stine-corn); -=n/a; soy letter codes R/MR/S
Scope corn + soybean row-crop core; Masters Choice silage + sorghum excluded

The hard part — and how it was de-risked. The guide's ratings tables have rotated/vertical column headers and every page is duplicated. The scraper dedups pages, reconstructs the rotated headers by clustering words, and maps each data cell to its column by x-center alignment (whitespace tokenization is unreliable around sparse cells). The column map was verified against the descriptive-card bullets per variety, and I independently re-checked before merge:

  • RC2500 "rapid drydown" → Drydown=8; "moderately short plant" → Plant Height=5 ✓
  • RC4779 "industry-leading tar spot tolerance" → Tar Spot=7 ✓
  • IS1162E3 "Peking source SCN" → SCN source=Peking, Rps3a, PRR=8 ✓
  • 0 out-of-range ratings (a shifted column would dump a GDU like 1825 into a 1–9 slot), 0 dup keys, RM/MG all sane, all 130 chunk cleanly.

Legality: robseeco.com is Squarespace with the AI-crawler block off (the AI-bot UAs are grouped with * under standard exclusions only — no Disallow: /), no anti-scraping ToS clause, and the guide PDF is on a public CDN URL. UA seed-mcp-scraper.

Docs: sources.json + README/CLAUDE inventory (now 2,398 variety + 6,787 trial records) + rating-scales lesson (RobSeeCo added to the higher=better group + the cross-vendor direction warning). CI rebuilds the index from the committed corpus.

Adds **RobSeeCo** (independent regional seed co, Elkhorn NE — markets Rob-See-Co® + Innotech®, rolled up Federal Hybrids / Big Cob / Kiser / Rupp's grain-forage) as a variety-identity source. Unlike the other independents, RobSeeCo has **no structured web catalog** (Squarespace visual grid) — the lineup lives in the **2026 Seed Guide PDF**, so `robseeco` is a **PDF-extraction** source. | | | |---|---| | Records | **130** — 87 corn + 43 soy (Rob-See-Co 105 + Innotech 25) | | Source | 2026 Seed Guide PDF (~18 MB, 52 pp; cached under `var/`, gitignored) | | Scale | **1-9, 9=Best** (higher=better, like Bayer/Stine-corn); `-`=n/a; soy letter codes R/MR/S | | Scope | corn + soybean row-crop core; Masters Choice silage + sorghum **excluded** | **The hard part — and how it was de-risked.** The guide's ratings tables have rotated/vertical column headers and every page is duplicated. The scraper dedups pages, reconstructs the rotated headers by clustering words, and maps each data cell to its column by **x-center alignment** (whitespace tokenization is unreliable around sparse cells). The column map was **verified against the descriptive-card bullets** per variety, and I independently re-checked before merge: - RC2500 "rapid drydown" → Drydown=8; "moderately short plant" → Plant Height=5 ✓ - RC4779 "industry-leading tar spot tolerance" → Tar Spot=7 ✓ - IS1162E3 "Peking source SCN" → SCN source=Peking, Rps3a, PRR=8 ✓ - **0 out-of-range ratings** (a shifted column would dump a GDU like 1825 into a 1–9 slot), 0 dup keys, RM/MG all sane, all 130 chunk cleanly. **Legality:** robseeco.com is Squarespace with the AI-crawler block **off** (the AI-bot UAs are grouped with `*` under standard exclusions only — no `Disallow: /`), no anti-scraping ToS clause, and the guide PDF is on a public CDN URL. UA `seed-mcp-scraper`. **Docs:** sources.json + README/CLAUDE inventory (now **2,398 variety + 6,787 trial** records) + rating-scales lesson (RobSeeCo added to the higher=better group + the cross-vendor direction warning). CI rebuilds the index from the committed corpus.
claude added 1 commit 2026-06-09 23:29:31 -04:00
Independent regional brand (Elkhorn, NE; rolled up Federal Hybrids / Big Cob /
Kiser / Rupp's grain-forage). No structured web catalog — the lineup lives in
the 2026 Seed Guide PDF — so this is a PDF-extraction identity source.

- robseeco (130: 87 corn + 43 soy; Rob-See-Co 105 + Innotech 25). Downloads the
  guide (cached under var/, gitignored), dedups the duplicated pages, parses the
  corn (p5-8) + soy (p19-26) ratings tables. Rotated/vertical column headers
  reconstructed by clustering rotated words; cells mapped by x-center alignment;
  descriptive 2-col cards joined by code for trait_stack + strengths. Masters
  Choice silage + sorghum scoped out (row-crop core only).
- SCALE 1-9, 9=Best (higher=better, like Bayer/Stine-corn); column map verified
  against the card bullets (e.g. RC2500 "rapid drydown"->Drydown 8, "short
  plant"->Plant Height 5; RC4779 "industry-leading tar spot"->Tar Spot 7).

Validation: all 130 chunk via rag.chunk.chunks_from_variety (0 errors), 0
duplicate keys, 0 out-of-range ratings (misalignment check), RM/MG all sane.

robseeco.com robots permissive (Squarespace AI-block toggle off; no ToS scrape
clause; PDF on a public CDN). docs: sources.json + README/CLAUDE inventory
(2,398 variety records) + rating-scales lesson (added RobSeeCo to the
higher=better group + the cross-vendor direction warning).
claude merged commit 0bac06b7b6 into main 2026-06-09 23:29:39 -04:00
claude deleted branch add-robseeco 2026-06-09 23:29:39 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#18