Add RobSeeCo (Rob-See-Co + Innotech): 130 corn/soy varieties from the seed-guide PDF

Independent regional brand (Elkhorn, NE; rolled up Federal Hybrids / Big Cob / Kiser / Rupp's grain-forage). No structured web catalog — the lineup lives in the 2026 Seed Guide PDF — so this is a PDF-extraction identity source. - robseeco (130: 87 corn + 43 soy; Rob-See-Co 105 + Innotech 25). Downloads the guide (cached under var/, gitignored), dedups the duplicated pages, parses the corn (p5-8) + soy (p19-26) ratings tables. Rotated/vertical column headers reconstructed by clustering rotated words; cells mapped by x-center alignment; descriptive 2-col cards joined by code for trait_stack + strengths. Masters Choice silage + sorghum scoped out (row-crop core only). - SCALE 1-9, 9=Best (higher=better, like Bayer/Stine-corn); column map verified against the card bullets (e.g. RC2500 "rapid drydown"->Drydown 8, "short plant"->Plant Height 5; RC4779 "industry-leading tar spot"->Tar Spot 7). Validation: all 130 chunk via rag.chunk.chunks_from_variety (0 errors), 0 duplicate keys, 0 out-of-range ratings (misalignment check), RM/MG all sane. robseeco.com robots permissive (Squarespace AI-block toggle off; no ToS scrape clause; PDF on a public CDN). docs: sources.json + README/CLAUDE inventory (2,398 variety records) + rating-scales lesson (added RobSeeCo to the higher=better group + the cross-vendor direction warning).
2026-06-09 23:29:11 -04:00
parent 84ad2b1de6
commit 2425a79f0c
265 changed files with 23133 additions and 6 deletions
@@ -0,0 +1,969 @@
+"""RobSeeCo (Rob-See-Co / Innotech) seed-guide scraper.
+
+Source: the 2026 RobSeeCo Seed Guide PDF
+(``robseeco.com/s/2026_RobSeeCo-Seed-Guide_FINAL-LR-Single.pdf`` — a
+302 to a static1.squarespace.com asset). Rob-See-Co is an independent
+seed company serving the Western/Central Corn Belt + Dakotas (Eastern
+Corn Belt via Federal Hybrids / Kiser Seed); the guide carries the
+Rob-See-Co corn line, the Rob-See-Co + Innotech soybean lines, plus
+Masters Choice silage corn and sorghum (both OUT OF SCOPE for the
+row-crop advisor — skipped).
+
+This is a marketing-laid-out PDF, the hardest extraction class. The
+value lives in two complementary places per crop:
+
+1. A dense **ratings table** (corn p5-8, soy p19-26) — one row per
+   hybrid/variety with ~14 single-digit 1-9 agronomic + disease
+   ratings. The numeric column headers are ROTATED 90deg (rendered as
+   reversed text), so we reconstruct each header label by clustering the
+   rotated glyphs by their x-position, ordering bottom-to-top, and
+   reversing the string. We then map each numeric data cell to its
+   column by **x-center alignment** (the most reliable signal — the
+   whitespace-tokenised text drops/duplicates cells around the sparse
+   "Fungicide Response" / soil-fit columns and the trailing geo letters).
+
+2. Descriptive **2-column cards** (corn p9-18, soy p21-26) — code +
+   trait variants (corn) + 3-6 marketing bullets. We split the page at
+   its mid-x into a left / right column, cluster each into lines, and
+   group consecutive title lines (font size 14) + their following
+   bullet lines (size 9.5) into one card. Cards enrich the table record
+   with the corn trait-variant suffixes (-RR2 / -VT2P / -Conv / …) and
+   the bullets (``strengths``).
+
+The table is authoritative for the structured 1-9 ``characteristics_groups``;
+cards add the trait stack + bullets. A table row that won't parse cleanly
+(too few aligned numeric cells) falls back to a **card-only** record
+(identity + bullets) rather than emit mis-assigned ratings.
+
+EVERY content page in the PDF is DUPLICATED consecutively (p5 == p6,
+p7 == p8, …), and the two covers carry mirrored text — we dedup by
+source_key and only ever process the first occurrence of a code.
+
+Output:
+  corpus/robseeco/<source_key>.md
+  corpus/robseeco/<source_key>.json
+
+source_key: ``robseeco-<code>`` lowercased, e.g. ``robseeco-rc2500``,
+``robseeco-is1162e3``.
+
+CLI:
+  python -m scrape.sources.robseeco --crop corn --limit 5
+  python -m scrape.sources.robseeco --force
+  python -m scrape.sources.robseeco --product robseeco-rc2500
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+import os
+import random
+import re
+import sys
+import time
+from collections import defaultdict
+from dataclasses import dataclass, field
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+import pdfplumber
+import requests
+
+SCRAPER_VERSION = "0.1.0"
+USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
+PDF_URL = "https://www.robseeco.com/s/2026_RobSeeCo-Seed-Guide_FINAL-LR-Single.pdf"
+PRODUCTS_URL = "https://www.robseeco.com/products"
+
+REPO_ROOT = Path(__file__).resolve().parents[2]
+CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
+CORPUS_DIR = CORPUS_ROOT / "robseeco"
+PDF_CACHE = Path(os.environ.get("ROBSEECO_PDF_CACHE")
+                 or (REPO_ROOT / "var" / "robseeco_2026_seed_guide.pdf"))
+
+REQ_INTERVAL_SEC = 1.5
+
+# Scale direction — taken verbatim from the legend on p7-8 of the guide.
+SCALE_DIRECTION = (
+    "Agronomic and disease ratings 1-9, 9=Best, 1=Worst, -=not available "
+    "(higher is better). Plant Height 9=Tall/1=Short; Ear Height 9=High/1=Low. "
+    "Planting Rate Guideline L=Low / ML=Medium-Low / M=Medium / MH=Medium-High / "
+    "H=High. Disease/pest letter codes (soy): R=resistant, MR=moderately resistant, "
+    "S=susceptible. Product Fit Geography A=All, C=Central (IA/MN/WI), "
+    "E=East (IN/MI/OH/PA/MD), W=West (ND/SD/NE/KS/OK/TX & West), CW=Central+West."
+)
+
+REGIONAL_REC = [{
+    "product_list_name": ("RobSeeCo dealer network (Western/Central Corn Belt + "
+                          "Dakotas; Eastern Corn Belt via Federal Hybrids/Kiser Seed)"),
+    "agronomist": None,
+    "agronomist_email": None,
+    "variant_id": None,
+}]
+
+# Page sections (0-indexed). Content pages are duplicated consecutively,
+# so the ranges deliberately span both copies — dedup by source_key
+# handles the repeat.
+CORN_TABLE_PAGES = range(5, 9)      # p5-8
+CORN_CARD_PAGES = range(9, 19)      # p9-18
+SOY_TABLE_PAGES = range(19, 27)     # p19-26
+SOY_CARD_PAGES = range(21, 27)      # p21-26 (soy cards interleave after the table)
+
+log = logging.getLogger("scrape.robseeco")
+
+
+# --------------------------------------------------------------------- HTTP
+
+
+class RateLimitedSession:
+    """Polite session with backoff. We make exactly one request (the PDF)
+    so this is mostly belt-and-suspenders, but it keeps the source uniform
+    with the other scrapers and gives retry/backoff on a flaky CDN."""
+
+    def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
+        self.s = requests.Session()
+        self.s.headers["User-Agent"] = USER_AGENT
+        self.interval = interval
+        self._last = 0.0
+
+    def _wait(self) -> None:
+        delta = time.monotonic() - self._last
+        if delta < self.interval:
+            time.sleep(self.interval - delta)
+        self._last = time.monotonic()
+
+    def request(self, method: str, url: str, *, max_retries: int = 4,
+                timeout: float = 90.0, **kw: Any) -> requests.Response:
+        last_exc: Exception | None = None
+        for attempt in range(max_retries):
+            self._wait()
+            try:
+                resp = self.s.request(method, url, timeout=timeout,
+                                      allow_redirects=True, **kw)
+            except requests.RequestException as exc:
+                last_exc = exc
+                backoff = min(30.0, (2 ** attempt) + random.random())
+                log.warning("network error on %s %s: %s — retry in %.1fs",
+                            method, url, exc, backoff)
+                time.sleep(backoff)
+                continue
+            if resp.status_code == 429 or 500 <= resp.status_code < 600:
+                ra = resp.headers.get("Retry-After")
+                backoff = float(ra) if (ra and ra.isdigit()) else min(
+                    30.0, (2 ** attempt) + random.random())
+                log.warning("HTTP %d on %s %s — retry in %.1fs",
+                            resp.status_code, method, url, backoff)
+                time.sleep(backoff)
+                continue
+            return resp
+        if last_exc:
+            raise last_exc
+        return resp  # type: ignore[return-value]
+
+    def get(self, url: str, **kw: Any) -> requests.Response:
+        return self.request("GET", url, **kw)
+
+
+def fetch_pdf(http: RateLimitedSession, *, force: bool) -> Path:
+    """Download the seed guide to the cache path (reused unless --force).
+    Reproducible for the monthly refresh — the scraper owns the download."""
+    if PDF_CACHE.exists() and PDF_CACHE.stat().st_size > 1_000_000 and not force:
+        log.info("using cached PDF %s (%d bytes)", PDF_CACHE, PDF_CACHE.stat().st_size)
+        return PDF_CACHE
+    PDF_CACHE.parent.mkdir(parents=True, exist_ok=True)
+    log.info("downloading seed guide %s", PDF_URL)
+    r = http.get(PDF_URL)
+    r.raise_for_status()
+    ct = r.headers.get("Content-Type", "")
+    if "pdf" not in ct.lower() and not r.content[:4] == b"%PDF":
+        raise RuntimeError(f"unexpected content-type {ct!r} for {PDF_URL}")
+    PDF_CACHE.write_bytes(r.content)
+    log.info("cached PDF -> %s (%d bytes)", PDF_CACHE, len(r.content))
+    return PDF_CACHE
+
+
+# --------------------------------------------------------------------- model
+
+
+@dataclass
+class RSVariety:
+    source_key: str
+    crop: str                          # "corn" | "soybeans"
+    brand: str                         # "Rob-See-Co" | "Innotech"
+    product_name: str                  # the base code
+    relative_maturity: int | None = None     # corn
+    maturity_group: float | None = None      # soy
+    trait_stack: list[str] = field(default_factory=list)
+    positioning: str | None = None
+    strengths: list[str] = field(default_factory=list)
+    # characteristics_groups: [{label, items:[{characteristic,value}]}]
+    groups: list[dict] = field(default_factory=list)
+    from_table: bool = True            # False = card-only fallback
+
+
+# --------------------------------------------------------------------- header reconstruction
+
+
+def reconstruct_rotated_headers(page: pdfplumber.page.Page,
+                                top_max: float) -> list[tuple[float, str]]:
+    """Return ``[(x_center, label), ...]`` for the rotated 90deg column
+    headers in the top header band of a ratings-table page.
+
+    Rotated text comes back as ``upright=False`` words; each column shares
+    an x-position and reads bottom-to-top, so we cluster by ``round(x0)``,
+    order each column's glyphs by descending ``top`` (read order), join,
+    and reverse the string to recover the human-readable label.
+    """
+    words = page.extract_words(use_text_flow=False, keep_blank_chars=False,
+                               extra_attrs=["upright"])
+    rot = [w for w in words if not w["upright"] and w["top"] < top_max]
+    cols: dict[int, list[dict]] = defaultdict(list)
+    for w in rot:
+        cols[round(w["x0"])].append(w)
+    out: list[tuple[float, str]] = []
+    for x0 in sorted(cols):
+        ws = cols[x0]
+        ws.sort(key=lambda w: -w["top"])          # bottom glyph first
+        label = "".join(w["text"] for w in ws)[::-1]
+        cx = sum((w["x0"] + w["x1"]) / 2 for w in ws) / len(ws)
+        out.append((round(cx, 2), label))
+    return out
+
+
+# --------------------------------------------------------------------- column maps
+#
+# The decoded (reversed) rotated-header label -> a canonical column key.
+# Built once from the reconstructed headers per page; the per-row cell
+# alignment then keys on x-center, so these maps only need to recognise
+# the header *strings* the reconstruction produces.
+
+# Canonical corn columns (decoded label fragments -> canonical key).
+# We match on the de-spaced reversed string the reconstructor emits.
+CORN_HEADER_CANON = {
+    "RM": "RM",
+    "SilktoRM": "RM to Silk",
+    "BlacklayertoRM": "RM to Blacklayer",
+    "BlacklayertoGDU": "GDU to Blacklayer",
+    "Emergence": "Emergence",
+    "VigorSeedling": "Seedling Vigor",
+    "StrengthRoot": "Root Strength",
+    "StrengthStalk": "Stalk Strength",
+    "SnapGreen": "Green Snap",
+    "Staygreen": "Staygreen",
+    "Drydown": "Drydown",
+    "ToleranceDrought": "Drought Tolerance",
+    "HeightPlant": "Plant Height",
+    "HeightEar": "Ear Height",
+    "WeightTest": "Test Weight",
+    "SpotLeafGray": "Gray Leaf Spot",
+    "WiltGoss’s": "Goss's Wilt",
+    "WiltGoss's": "Goss's Wilt",
+    "CornNorthern": "Northern Corn Leaf Blight",
+    "BlightLeaf": "Northern Corn Leaf Blight",
+    "SpotTar": "Tar Spot",
+    "ResponseFungicide": "Fungicide Response (Absence of Disease in Continuous Corn)",
+    "DiseaseofAbsencein": "Fungicide Response (Absence of Disease in Continuous Corn)",
+    "RatePlanting": "Planting Rate Guideline",
+    "Guidelines": "Planting Rate Guideline",
+    "SoilsProductiveHighly": "Continuous-corn fit: Highly Productive Soils",
+    "SoilsVariable": "Continuous-corn fit: Variable Soils",
+    "SoilsDrainedPoorly": "Continuous-corn fit: Poorly Drained Soils",
+    "AgronomicCorn:onCorn": "Corn-on-Corn Agronomic Characteristics",
+    "Characteristics": "Corn-on-Corn Agronomic Characteristics",
+    "RegionRecommended": "Product Fit Geography",
+}
+
+# Corn classification of canonical column -> chunker bucket label.
+CORN_DISEASE = {
+    "Gray Leaf Spot", "Goss's Wilt", "Northern Corn Leaf Blight", "Tar Spot",
+    "Fungicide Response (Absence of Disease in Continuous Corn)",
+}
+CORN_AGRONOMIC = {
+    "Emergence", "Seedling Vigor", "Root Strength", "Stalk Strength",
+    "Green Snap", "Staygreen", "Drydown", "Drought Tolerance",
+    "Plant Height", "Ear Height", "Test Weight",
+}
+# Everything else (RM-to-Silk/Blacklayer, GDU, planting rate, soil fit,
+# product-fit geography, corn-on-corn) is passthrough placement context.
+CORN_PASSTHROUGH = "Planting & Placement"
+
+# Canonical soy columns (decoded reversed-label fragments -> key). The
+# soy reconstruction joins the two label words; match on the de-spaced
+# string. Two-word labels like "Maturity Relative" reverse word-order.
+SOY_HEADER_CANON = {
+    "MaturityRelative": "Relative Maturity",
+    "TraitHerbicide": "Herbicide Trait",
+    "Herbicide": "Herbicide Trait",
+    "TraitTolerance": "Herbicide Trait",
+    "ResistanceSCN": "SCN Resistance Source",
+    "Source": "SCN Resistance Source",
+    "ChlorosisIron": "Iron Chlorosis (IDC) Tolerance",
+    "(IDC)Tolerance": "Iron Chlorosis (IDC) Tolerance",
+    "Phytophthora": "Phytophthora",   # disambiguated by x-order below
+    "GeneResistance": "Phytophthora Gene Resistance",
+    "(PRR)RatingField": "Phytophthora (PRR) Field Rating",
+    "RotStemBrown": "Brown Stem Rot (BSR)",
+    "(BSR)": "Brown Stem Rot (BSR)",
+    "MoldWhite": "White Mold (SWM) Tolerance",
+    "(SWM)Tolerance": "White Mold (SWM) Tolerance",
+    "DeathSudden": "Sudden Death Syndrome (SDS) Tolerance",
+    "(SDS)Syndrome": "Sudden Death Syndrome (SDS) Tolerance",
+    "Emergence": "Emergence",
+    "Standability": "Standability",
+    "HeightPlant": "Plant Height for Maturity",
+    "Maturityfor": "Plant Height for Maturity",
+    "WidthCanopy": "Canopy Width / Plant Type",
+    "SoilsVariable": "Variable Soils fit",
+    "ProductiveHigh": "High Productive Soil fit",
+    "Soil": "High Productive Soil fit",
+    "StressandVariableEnvironments": "Stress & Variable Environments fit",
+    "EnvironmentsYieldHigh": "High Yield Environments fit",
+    "RegionRecommended": "Product Fit Geography",
+    "Recommended": "Product Fit Geography",
+    "Region": "Product Fit Geography",
+}
+SOY_DISEASE = {
+    "SCN Resistance Source", "Iron Chlorosis (IDC) Tolerance",
+    "Phytophthora Gene Resistance", "Phytophthora (PRR) Field Rating",
+    "Brown Stem Rot (BSR)", "White Mold (SWM) Tolerance",
+    "Sudden Death Syndrome (SDS) Tolerance",
+}
+SOY_AGRONOMIC = {
+    "Emergence", "Standability", "Plant Height for Maturity",
+    "Canopy Width / Plant Type",
+}
+SOY_PASSTHROUGH = "Placement"
+
+
+def _despace(s: str) -> str:
+    return re.sub(r"\s+", "", s or "")
+
+
+def build_corn_colmap(headers: list[tuple[float, str]]) -> list[tuple[float, str, str]]:
+    """Return ``[(x_center, canonical_key, bucket), ...]`` sorted by x.
+    bucket in {"disease","agronomic","pass"}."""
+    out: list[tuple[float, str, str]] = []
+    for cx, raw in headers:
+        key = CORN_HEADER_CANON.get(_despace(raw))
+        if not key:
+            continue
+        if key in CORN_DISEASE:
+            bucket = "disease"
+        elif key in CORN_AGRONOMIC:
+            bucket = "agronomic"
+        else:
+            bucket = "pass"
+        out.append((cx, key, bucket))
+    out.sort(key=lambda t: t[0])
+    return out
+
+
+def build_soy_colmap(headers: list[tuple[float, str]]) -> list[tuple[float, str, str]]:
+    out: list[tuple[float, str, str]] = []
+    seen_phyto = False
+    for cx, raw in sorted(headers, key=lambda t: t[0]):
+        ds = _despace(raw)
+        key = SOY_HEADER_CANON.get(ds)
+        if key == "Phytophthora":
+            # First "Phytophthora" header = the gene-resistance column;
+            # the (PRR)RatingField label sits at the next column.
+            key = "Phytophthora Gene Resistance" if not seen_phyto else None
+            seen_phyto = True
+        if not key:
+            continue
+        if key in SOY_DISEASE:
+            bucket = "disease"
+        elif key in SOY_AGRONOMIC:
+            bucket = "agronomic"
+        else:
+            bucket = "pass"
+        out.append((cx, key, bucket))
+    out.sort(key=lambda t: t[0])
+    return out
+
+
+# --------------------------------------------------------------------- table row parsing
+
+
+_CORN_CODE = re.compile(r"^(RC\d{4}|[A-Z]\d{2}-\d{2})$")
+_SOY_CODE = re.compile(r"^(IS|RS)\d.*$")
+
+
+def _nearest_col(cx: float, colmap: list[tuple[float, str, str]],
+                 tol: float) -> tuple[str, str] | None:
+    best = None
+    bestd = tol
+    for x, key, bucket in colmap:
+        d = abs(x - cx)
+        if d < bestd:
+            bestd = d
+            best = (key, bucket)
+    return best
+
+
+def _row_words(page: pdfplumber.page.Page) -> dict[int, list[dict]]:
+    """Group upright words on a page into rows keyed by rounded top."""
+    words = [w for w in page.extract_words(use_text_flow=False,
+                                            keep_blank_chars=False,
+                                            extra_attrs=["upright"])
+             if w["upright"]]
+    rows: dict[int, list[dict]] = defaultdict(list)
+    for w in words:
+        # snap near-equal tops together
+        key = None
+        for k in list(rows):
+            if abs(k - w["top"]) < 4:
+                key = k
+                break
+        rows[key if key is not None else round(w["top"])].append(w)
+    return rows
+
+
+def parse_corn_table(page: pdfplumber.page.Page,
+                     colmap: list[tuple[float, str, str]]) -> list[RSVariety]:
+    out: list[RSVariety] = []
+    rows = _row_words(page)
+    for _top, ws in rows.items():
+        ws = sorted(ws, key=lambda w: w["x0"])
+        if not ws:
+            continue
+        code = ws[0]["text"]
+        if not _CORN_CODE.match(code) or ws[0]["x0"] > 80:
+            continue
+        cells = ws[1:]
+        # A genuine data row leads with the RM integer right after the code.
+        nums = [c for c in cells if re.fullmatch(r"-?\d+(?:\.\d+)?", c["text"])]
+        if not nums:
+            continue
+        v = _assemble_corn(code, cells, colmap)
+        if v:
+            out.append(v)
+    return out
+
+
+def _assemble_corn(code: str, cells: list[dict],
+                   colmap: list[tuple[float, str, str]]) -> RSVariety | None:
+    disease: list[dict] = []
+    agronomic: list[dict] = []
+    passthrough: list[dict] = []
+    rm: int | None = None
+    placed = 0
+    for c in cells:
+        cx = (c["x0"] + c["x1"]) / 2
+        # The page-number column sits far right (x > 740) — ignore it.
+        if cx > 738:
+            continue
+        hit = _nearest_col(cx, colmap, tol=13.0)
+        if not hit:
+            continue
+        key, bucket = hit
+        val = c["text"].strip()
+        if key == "RM":
+            try:
+                rm = int(float(val))
+            except ValueError:
+                pass
+            continue
+        if val in ("", "-"):
+            # still record a placeholder so the chunk shows the column
+            item = {"characteristic": key, "value": val or "-"}
+        else:
+            item = {"characteristic": key, "value": val}
+        if bucket == "disease":
+            disease.append(item)
+        elif bucket == "agronomic":
+            agronomic.append(item)
+        else:
+            passthrough.append(item)
+        placed += 1
+    # A clean corn row aligns into a healthy number of columns; if almost
+    # nothing lined up, the caller falls back to a card-only record.
+    if placed < 8 or rm is None:
+        return None
+    groups: list[dict] = []
+    if agronomic:
+        groups.append({"label": "AGRONOMIC CHARACTERISTICS", "items": agronomic})
+    if disease:
+        groups.append({"label": "DISEASE RATINGS", "items": disease})
+    if passthrough:
+        groups.append({"label": CORN_PASSTHROUGH, "items": passthrough})
+    return RSVariety(
+        source_key=f"robseeco-{code.lower()}",
+        crop="corn",
+        brand="Rob-See-Co",
+        product_name=code,
+        relative_maturity=rm,
+        groups=groups,
+    )
+
+
+def parse_soy_table(page: pdfplumber.page.Page,
+                    colmap: list[tuple[float, str, str]]) -> list[RSVariety]:
+    out: list[RSVariety] = []
+    rows = _row_words(page)
+    for _top, ws in rows.items():
+        ws = sorted(ws, key=lambda w: w["x0"])
+        if not ws:
+            continue
+        code = ws[0]["text"]
+        if not _SOY_CODE.match(code) or ws[0]["x0"] > 80:
+            continue
+        v = _assemble_soy(code, ws[1:], colmap)
+        if v:
+            out.append(v)
+    return out
+
+
+def _assemble_soy(code: str, cells: list[dict],
+                  colmap: list[tuple[float, str, str]]) -> RSVariety | None:
+    disease: list[dict] = []
+    agronomic: list[dict] = []
+    passthrough: list[dict] = []
+    mg: float | None = None
+    trait: str | None = None
+    # Phytophthora gene cells can be two tokens ("Rps1c, Rps3a") split by
+    # a space; merge any cell that lands in the same column.
+    bycol: dict[str, list[str]] = defaultdict(list)
+    colbucket: dict[str, str] = {}
+    placed = 0
+    for c in cells:
+        cx = (c["x0"] + c["x1"]) / 2
+        if cx > 738:
+            continue
+        hit = _nearest_col(cx, colmap, tol=16.0)
+        if not hit:
+            continue
+        key, bucket = hit
+        bycol[key].append(c["text"].strip())
+        colbucket[key] = bucket
+        placed += 1
+    for key, parts in bycol.items():
+        val = " ".join(p for p in parts if p).strip()
+        if key == "Relative Maturity":
+            try:
+                mg = float(val)
+            except ValueError:
+                pass
+            continue
+        if key == "Herbicide Trait":
+            trait = val or None
+            continue
+        item = {"characteristic": key, "value": val or "-"}
+        b = colbucket[key]
+        if b == "disease":
+            disease.append(item)
+        elif b == "agronomic":
+            agronomic.append(item)
+        else:
+            passthrough.append(item)
+    if placed < 6 or mg is None:
+        return None
+    brand = "Innotech" if code.upper().startswith("IS") else "Rob-See-Co"
+    trait_stack = [trait] if trait else _trait_from_soy_code(code)
+    groups: list[dict] = []
+    if agronomic:
+        groups.append({"label": "AGRONOMIC CHARACTERISTICS", "items": agronomic})
+    if disease:
+        groups.append({"label": "DISEASE RATINGS", "items": disease})
+    if passthrough:
+        groups.append({"label": SOY_PASSTHROUGH, "items": passthrough})
+    return RSVariety(
+        source_key=f"robseeco-{code.lower()}",
+        crop="soybeans",
+        brand=brand,
+        product_name=code,
+        maturity_group=mg,
+        trait_stack=trait_stack,
+        groups=groups,
+    )
+
+
+def _trait_from_soy_code(code: str) -> list[str]:
+    """Fallback trait from the code suffix (E3 / E3S / XF / XFS)."""
+    m = re.search(r"(E3S|E3|XFS|XF)$", code.upper())
+    if not m:
+        return []
+    mapping = {"E3": "E3", "E3S": "E3/STS", "XF": "XF", "XFS": "XF/STS"}
+    return [mapping[m.group(1)]]
+
+
+# --------------------------------------------------------------------- card parsing
+
+
+# Try the hyphenated D94-26 form FIRST so a code like ``N97-55-Organic``
+# splits to base ``N97-55`` + trait ``Organic`` rather than base ``N97``.
+_CARD_CODE = re.compile(r"^([A-Z]\d{2}-\d{2}|[A-Z]{1,3}\d{2,4}[A-Z]?)(?:-(\S+))?$")
+
+# Recognised corn trait suffixes — used to filter card-derived traits so a
+# mis-split sibling fragment (e.g. "55-Conv") never leaks into trait_stack.
+_KNOWN_CORN_TRAITS = {
+    "Conv", "RR2", "VT2P", "DGVT2P", "GT", "GTA", "GT/LL", "GT/LLA", "SS",
+    "SSP", "TRE", "AA", "DV", "D", "V", "VZ", "PCE", "3110A", "Artesian",
+    "Organic",
+}
+
+
+# A soybean card code is the full variety code (IS0325E3 / RS3437XFS) —
+# no trait suffix, so the whole token is the base.
+_SOY_CARD_CODE = re.compile(r"^(IS|RS)\d{2,5}(E3S|E3|XFS|XF)$", re.I)
+
+
+def _card_base(token: str) -> tuple[str, str | None]:
+    """``RC4185-VT2P`` -> ("RC4185","VT2P"); ``D94-26-VT2P`` -> ("D94-26","VT2P");
+    ``IS0325E3`` -> ("IS0325E3", None)."""
+    if _SOY_CARD_CODE.match(token):
+        return token, None
+    m = _CARD_CODE.match(token)
+    if not m:
+        return token, None
+    return m.group(1), m.group(2)
+
+
+def _is_card_code(token: str) -> bool:
+    return bool(_SOY_CARD_CODE.match(token) or _CARD_CODE.match(token))
+
+
+def parse_cards(page: pdfplumber.page.Page, *, ncols: int = 2) -> list[dict]:
+    """Return ``[{bases:set, traits:set, bullets:[str]}, ...]`` for the
+    descriptive cards on a card page. Corn cards are a 2-column layout,
+    soy cards a 3-column layout — ``ncols`` sets the equal-width column
+    split. Within a column a card is a run of size-14 code lines followed
+    by its size-9.5 bullet lines (sub-11pt continuation lines fold into
+    the preceding bullet)."""
+    width = page.width
+    bounds = [(width * i / ncols, width * (i + 1) / ncols) for i in range(ncols)]
+    words = [w for w in page.extract_words(use_text_flow=False,
+                                           keep_blank_chars=False,
+                                           extra_attrs=["upright", "size"])
+             if w["upright"]]
+    cards: list[dict] = []
+    for lo, hi in bounds:
+        colw = [w for w in words if lo <= w["x0"] < hi and 140 < w["top"] < 1180]
+        colw.sort(key=lambda w: (round(w["top"]), w["x0"]))
+        lines: list[dict] = []
+        for w in colw:
+            if lines and abs(w["top"] - lines[-1]["top"]) < 5:
+                lines[-1]["ws"].append(w)
+            else:
+                lines.append({"top": w["top"], "ws": [w]})
+        cur: dict | None = None
+        for ln in lines:
+            ws = sorted(ln["ws"], key=lambda w: w["x0"])
+            txt = " ".join(w["text"] for w in ws).strip()
+            sz = max(w["size"] for w in ws)
+            collapsed = txt.replace(" ", "")
+            if sz >= 13 and _is_card_code(collapsed) and len(collapsed) <= 20:
+                if cur is None or cur["bullets"]:
+                    cur = {"bases": set(), "traits": set(), "bullets": []}
+                    cards.append(cur)
+                base, trait = _card_base(collapsed)
+                cur["bases"].add(base)
+                if trait:
+                    cur["traits"].add(trait)
+            elif txt.startswith("•"):
+                if cur is not None:
+                    cur["bullets"].append(_clean_bullet(txt.lstrip("• ")))
+            elif sz < 11 and cur is not None and cur["bullets"]:
+                cur["bullets"][-1] = _clean_bullet(cur["bullets"][-1] + " " + txt)
+    return cards
+
+
+def _clean_bullet(s: str) -> str:
+    s = re.sub(r"\s+", " ", (s or "")).strip()
+    s = re.sub(r"\s*®", "®", s)             # "Artesian ®" -> "Artesian®"
+    s = re.sub(r"\s*\bNEW\b\s*$", "", s).strip()  # trailing NEW badge
+    return s
+
+
+def build_card_groups(pdf: pdfplumber.PDF, pages: range, *, ncols: int = 2) -> list[dict]:
+    """Return the list of distinct card groups across the card pages,
+    each ``{bases:set(UPPER), traits:set, bullets:[...]}``. Pages are
+    deduped by their reconstructed card signature so the duplicated page
+    copy doesn't double the cards. A card can group several sibling codes
+    for one hybrid (e.g. legacy A94-16 / G94-86 alongside the table's
+    D94-26) — keeping the grouping lets the caller fold those into the
+    table record instead of minting duplicate identities."""
+    groups: list[dict] = []
+    seen_sig: set[tuple] = set()
+    for pi in pages:
+        cards = parse_cards(pdf.pages[pi], ncols=ncols)
+        sig = tuple(sorted(b for c in cards for b in c["bases"]))
+        if not sig or sig in seen_sig:
+            continue
+        seen_sig.add(sig)
+        for c in cards:
+            if not c["bases"]:
+                continue
+            groups.append({
+                "bases": {b.upper() for b in c["bases"]},
+                "traits": set(c["traits"]),
+                "bullets": [b for b in c["bullets"] if b],
+            })
+    return groups
+
+
+# --------------------------------------------------------------------- extraction orchestration
+
+
+# Corn trait suffix ordering preference for a stable trait_stack.
+_CORN_TRAIT_ORDER = ["Conv", "RR2", "VT2P", "DGVT2P", "GT", "GTA", "GT/LL",
+                     "GT/LLA", "SS", "SSP", "TRE", "AA", "DV", "D", "V", "VZ",
+                     "PCE", "3110A", "Artesian", "Organic"]
+
+
+def _sorted_corn_traits(traits: set[str]) -> list[str]:
+    order = {t: i for i, t in enumerate(_CORN_TRAIT_ORDER)}
+    clean = {t for t in traits if t in _KNOWN_CORN_TRAITS}
+    return sorted(clean, key=lambda t: (order.get(t, 999), t))
+
+
+def _group_for_base(groups: list[dict], base: str) -> dict | None:
+    for g in groups:
+        if base in g["bases"]:
+            return g
+    return None
+
+
+def extract(pdf: pdfplumber.PDF, *, only_crop: str | None) -> list[RSVariety]:
+    varieties: dict[str, RSVariety] = {}
+
+    # ---- Corn ----------------------------------------------------------
+    if only_crop in (None, "corn"):
+        corn_groups = build_card_groups(pdf, CORN_CARD_PAGES)
+        table_bases: set[str] = set()
+        # Pass 1: table records (authoritative ratings), enriched from the
+        # card group that contains the base.
+        for pi in CORN_TABLE_PAGES:
+            page = pdf.pages[pi]
+            headers = reconstruct_rotated_headers(page, top_max=320)
+            colmap = build_corn_colmap(headers)
+            if len(colmap) < 12:
+                log.warning("p%d corn header reconstruction thin (%d cols) — skipping",
+                            pi, len(colmap))
+                continue
+            for v in parse_corn_table(page, colmap):
+                if v.source_key in varieties:
+                    continue
+                base = v.product_name.upper()
+                g = _group_for_base(corn_groups, base)
+                if g:
+                    v.trait_stack = _sorted_corn_traits(g["traits"])
+                    v.strengths = list(g["bullets"])
+                varieties[v.source_key] = v
+                table_bases.add(base)
+        # Pass 2: card-only fallback — one record per card group that has
+        # NO table-backed sibling. Sibling legacy codes that share a card
+        # with a table hybrid fold into that record (no duplicate identity).
+        for g in corn_groups:
+            if g["bases"] & table_bases or not g["bullets"]:
+                continue
+            # Pick a deterministic primary base (prefer an RC#### code).
+            cand = sorted(b for b in g["bases"] if _CORN_CODE.match(b))
+            if not cand:
+                continue
+            primary = sorted(cand, key=lambda b: (not b.startswith("RC"), b))[0]
+            sk = f"robseeco-{primary.lower()}"
+            if sk in varieties:
+                continue
+            varieties[sk] = RSVariety(
+                source_key=sk, crop="corn", brand="Rob-See-Co",
+                product_name=primary,
+                trait_stack=_sorted_corn_traits(g["traits"]),
+                strengths=list(g["bullets"]), from_table=False)
+
+    # ---- Soy -----------------------------------------------------------
+    if only_crop in (None, "soybeans"):
+        soy_groups = build_card_groups(pdf, SOY_CARD_PAGES, ncols=3)
+        soy_table_bases: set[str] = set()
+        for pi in SOY_TABLE_PAGES:
+            page = pdf.pages[pi]
+            headers = reconstruct_rotated_headers(page, top_max=330)
+            colmap = build_soy_colmap(headers)
+            if len(colmap) < 8:
+                continue
+            for v in parse_soy_table(page, colmap):
+                if v.source_key in varieties:
+                    continue
+                g = _group_for_base(soy_groups, v.product_name.upper())
+                if g:
+                    v.strengths = list(g["bullets"])
+                varieties[v.source_key] = v
+                soy_table_bases.add(v.product_name.upper())
+        for g in soy_groups:
+            if g["bases"] & soy_table_bases or not g["bullets"]:
+                continue
+            cand = sorted(b for b in g["bases"] if _SOY_CODE.match(b))
+            if not cand:
+                continue
+            primary = cand[0]
+            sk = f"robseeco-{primary.lower()}"
+            if sk in varieties:
+                continue
+            brand = "Innotech" if primary.startswith("IS") else "Rob-See-Co"
+            varieties[sk] = RSVariety(
+                source_key=sk, crop="soybeans", brand=brand, product_name=primary,
+                trait_stack=_trait_from_soy_code(primary),
+                strengths=list(g["bullets"]), from_table=False)
+
+    return list(varieties.values())
+
+
+# --------------------------------------------------------------------- render
+
+
+def render_markdown(v: RSVariety) -> str:
+    crop_label = "Corn" if v.crop == "corn" else "Soybeans"
+    head: list[str] = [
+        f"# {v.product_name}",
+        "",
+        "- **Vendor:** RobSeeCo",
+        f"- **Brand:** {v.brand}",
+        f"- **Crop:** {crop_label}",
+    ]
+    if v.crop == "corn" and v.relative_maturity is not None:
+        head.append(f"- **Relative maturity:** {v.relative_maturity} days")
+    if v.crop == "soybeans" and v.maturity_group is not None:
+        head.append(f"- **Maturity group:** {v.maturity_group}")
+    if v.trait_stack:
+        head.append(f"- **Trait(s):** {', '.join(v.trait_stack)}")
+    head.append(f"- **Source:** {PRODUCTS_URL}")
+    head.append(f"- **Rating scale:** {SCALE_DIRECTION}")
+    head.append("- **Service area:** RobSeeCo dealer network — Western/Central "
+                "Corn Belt + Dakotas (Eastern Corn Belt via Federal Hybrids/Kiser Seed)")
+    head.append("")
+    if not v.from_table:
+        head += ["_Identity from the descriptive card; no structured ratings table "
+                 "row matched._", ""]
+    head += ["---", ""]
+    for g in v.groups:
+        head.append(f"## {g['label'].title()}")
+        head.append("")
+        for it in g["items"]:
+            head.append(f"- **{it['characteristic']}:** {it['value'] or '—'}")
+        head.append("")
+    if v.strengths:
+        head += ["## Strengths", ""]
+        for s in v.strengths:
+            head.append(f"- {s}")
+        head.append("")
+    return "\n".join(head)
+
+
+def write_variety(v: RSVariety) -> None:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    (CORPUS_DIR / f"{v.source_key}.md").write_text(render_markdown(v), encoding="utf-8")
+    sidecar = {
+        "source": "robseeco",
+        "source_key": v.source_key,
+        "vendor": "RobSeeCo",
+        "brand": v.brand,
+        "product_name": v.product_name,
+        "product_id": v.product_name,
+        "crop": v.crop,
+        "release_year": None,
+        "relative_maturity": v.relative_maturity,
+        "maturity_group": v.maturity_group,
+        "wheat_class": None,
+        "trait_stack": v.trait_stack,
+        "trait_descriptions": [],
+        "positioning_statement": v.positioning,
+        "strengths": v.strengths,
+        "characteristics_groups": v.groups,
+        "_scale_direction": SCALE_DIRECTION,
+        "regional_recommendations": REGIONAL_REC,
+        "image_url": None,
+        "source_urls": [PRODUCTS_URL],
+        "sitemap_last_modified": None,
+        "fetched_at": datetime.now(timezone.utc).isoformat(),
+        "scraper_version": SCRAPER_VERSION,
+    }
+    (CORPUS_DIR / f"{v.source_key}.json").write_text(
+        json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n", encoding="utf-8")
+
+
+# --------------------------------------------------------------------- pipeline
+
+
+def run(*, limit: int | None, force: bool,
+        only_crop: str | None, only_product: str | None) -> int:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    http = RateLimitedSession()
+    pdf_path = fetch_pdf(http, force=force)
+
+    with pdfplumber.open(pdf_path) as pdf:
+        varieties = extract(pdf, only_crop=only_crop)
+
+    if only_product:
+        key = only_product.lower()
+        varieties = [v for v in varieties
+                     if v.source_key == key or v.product_name.lower() == key]
+        if not varieties:
+            log.error("no variety matched --product=%s", only_product)
+            return 2
+
+    varieties.sort(key=lambda v: (v.crop, v.brand, v.source_key))
+
+    counts = {"written": 0, "skipped": 0, "card_only": 0,
+              "corn": 0, "soybeans": 0}
+    processed = 0
+    for v in varieties:
+        if limit is not None and processed >= limit:
+            break
+        processed += 1
+        md_path = CORPUS_DIR / f"{v.source_key}.md"
+        if md_path.exists() and not force:
+            counts["skipped"] += 1
+            log.info("[%d/%d] %s skipped", processed, len(varieties), v.source_key)
+            continue
+        write_variety(v)
+        counts["written"] += 1
+        counts[v.crop] += 1
+        if not v.from_table:
+            counts["card_only"] += 1
+        log.info("[%d/%d] %s written | crop=%s brand=%s rm/mg=%s traits=%s "
+                 "groups=%d strengths=%d%s",
+                 processed, len(varieties), v.source_key, v.crop, v.brand,
+                 v.relative_maturity if v.crop == "corn" else v.maturity_group,
+                 ",".join(v.trait_stack) or "-", len(v.groups), len(v.strengths),
+                 " [CARD-ONLY]" if not v.from_table else "")
+
+    log.info("done: processed=%d written=%d skipped=%d card_only=%d "
+             "| corn=%d soybeans=%d (of %d)",
+             processed, counts["written"], counts["skipped"], counts["card_only"],
+             counts["corn"], counts["soybeans"], len(varieties))
+    return 0
+
+
+# --------------------------------------------------------------------- CLI
+
+
+def _build_argparser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        prog="scrape.sources.robseeco",
+        description="Scrape the RobSeeCo (Rob-See-Co / Innotech) 2026 Seed Guide "
+                    "PDF — corn + soybean variety identity + 1-9 ratings.")
+    p.add_argument("--limit", type=int, default=None,
+                   help="Stop after writing N varieties (default: all).")
+    p.add_argument("--force", action="store_true",
+                   help="Re-download the PDF and re-write all variety files.")
+    p.add_argument("--crop", default=None, choices=["corn", "soybeans"],
+                   help="Limit to one crop.")
+    p.add_argument("--product", default=None,
+                   help="Process a single variety by source_key or code.")
+    p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
+    return p
+
+
+def main(argv: list[str] | None = None) -> int:
+    args = _build_argparser().parse_args(argv)
+    logging.basicConfig(
+        level=args.log_level.upper(),
+        format="%(asctime)s %(levelname)s %(name)s %(message)s",
+        stream=sys.stderr)
+    return run(limit=args.limit, force=args.force,
+               only_crop=args.crop, only_product=args.product)
+
+
+if __name__ == "__main__":
+    sys.exit(main())