Add 4 independent seed brands: Latham, Stine, 1st Choice, Burrus (+623 varieties)

Four independent regional brands across IA/IN/IL (variety-identity sources, each parsed into structured characteristics_groups so ratings embed): - latham (264: 155 corn + 109 soy) — Latham Hi-Tech Seeds, Alexander IA. WordPress REST enum (/wp-json/wp/v2/varieties) + /products/<slug>/ detail HTML. Scale 1-9 LOWER=better (reversed, like NK/AgriPro). - stine (217: 58 corn + 159 soy) — Stine Seed, Adel IA (largest US independent). sitemap enum + /{crop}/traits/<slug>/<code>/ detail HTML. Corn 1-9 (9=best); soy qualitative. - first_choice (78: 52 corn + 22 soy + 4 wheat) — 1st Choice Seeds, Rushville IN (employee-owned). Per-crop sitemap -> detail HTML. Scale 0-10 higher=better. ~40 older corn pages thin at source; wheat identity-only. - burrus (64: 38 corn + 26 soy) — Burrus Seed, Arenzville IL. Seedware JSON API. Scale 1-10 (10=best). Brands Burrus/Power Plus/DONMARIO. robots ai-train=no + named-bot blocks; operator opted in, scraper uses a non-blacklisted UA + honors Crawl-delay 10. All 623 validated through rag.chunk.chunks_from_variety (0 errors; 6 identity-only pages from source gaps). No chunk.py change needed (identity sources auto-route to chunks_from_variety). Docs: - sources.json: 4 entries + Hoegemeyer added to _excluded_sources. The Corteva ToU (shared across pioneer.com / hoegemeyer.com / therightseed.com / corteva.com + the Vylor spinoff) bans scrapers + competitive use, so the whole Corteva family is one excluded ToU domain. - docs_mcp/lessons.md: rating-scales updated with all 4 directions + an explicit cross-vendor warning (Latham 1=best vs Stine/Burrus higher=best — never compare raw numbers without _scale_direction). - README + CLAUDE corpus inventory: now 2,268 variety + 6,787 trial records. CI rebuilds the index from the committed corpus.
2026-06-04 21:57:30 -04:00
parent 22e8092faf
commit 9600235466
1254 changed files with 103589 additions and 4 deletions
@@ -0,0 +1,671 @@
+"""1st Choice Seeds scraper — employee-owned independent (Rushville, IN).
+
+Source: ``www.1stchoiceseeds.com`` — a plain Apache/PHP WordPress site
+(All in One SEO). 1st Choice Seeds is an **independent, employee-owned**
+seed company in Rushville, Indiana, serving the Eastern Corn Belt
+(IN/OH/KY/TN). Corn hybrids / soybeans / wheat (plus a cover-crop line
+that is out of scope for the row-crop advisor).
+
+Discovery is by **sitemap**, NOT the WP REST API: the catalog custom
+post types (corn-hybrids / soybeans / wheat) are NOT exposed to
+``/wp-json/`` (every variety route returns ``rest_no_route``). Instead we
+fetch ``/sitemap.xml`` (an All-in-One-SEO sitemap *index*) and follow the
+per-crop child sitemaps:
+
+  - ``/corn-hybrids-sitemap.xml`` -> ``/corn-hybrids/<slug>/``  (~52 URLs)
+  - ``/soybeans-sitemap.xml``     -> ``/soybeans/<slug>/``      (~22 URLs)
+  - ``/wheat-sitemap.xml``        -> ``/wheat/<slug>/``         (~4 URLs)
+
+robots.txt is permissive (``User-agent: *`` / ``Disallow: /wp-admin/`` /
+``Allow: /wp-admin/admin-ajax.php`` + a ``Sitemap:`` line). No Crawl-delay,
+no Terms-of-Use page, no bot wall. We use a descriptive UA and ~1.2 s
+between requests.
+
+Detail-page DOM (server-rendered, no JS needed for the text):
+  * Product name: the second ``<h1>`` inside ``article.content`` (the
+    first is the site logo "1st Choice Seeds").
+  * Corn — three ``<h2>`` sections + a side table:
+      - "Hybrid Characteristics": a single ``<p>`` of ``label • value``
+        lines split on ``<br>`` (Seedling Vigor, Plant Height, Ear
+        Placement, Root Rating, Stalk Rating, Foliar Health, Drydown,
+        Ear Length/Girth/Flex, Test Weight). Some hybrids only publish
+        Seedling Vigor (genuinely thin pages — still written).
+      - "Hybrid Ratings": a ``ul.chart-key`` legend + a ``div.d3-chart``
+        (the numeric 0-10 bars are drawn client-side by d3 and are NOT
+        in the HTML). The legend IS the scale: 0-4 Below Average … 9-10
+        Superior, so higher = better.
+      - "Management Tips": ``label: value`` lines (Corn-On-Corn,
+        Productivity / soil guidance, Silage Rating).
+      - A ``<table>`` carrying Relative Maturity, Degree Days (GDU), and
+        the Low/Medium/High recommended planting populations.
+  * Soybeans — three ``<h2>`` sections:
+      - "Field Notes": a ``<ul>`` of strengths (often includes SCN
+        source / PRR gene call-outs).
+      - "Soybean Ratings": ``ul.chart-key`` legend only (same d3 chart).
+      - "Variety Description": ``div`` blocks of ``<b>Label:</b> value``
+        pairs (Maturity = MG, Plant Type, Plant Height, PRR Gene, Flower
+        Color, Pubescence, Pod, Hilum).
+  * Wheat — thin (title + date only; wheat is private-label). We still
+    write an identity record so the variety is discoverable.
+
+Rating scale: the published legend is **0-10, higher = better**
+("Below Average 0-4, Average 5, Good 6, Very Good 7, Excellent 8,
+Superior 9-10"). 1st Choice publishes the *qualitative* word
+(Excellent / Very Good / …) in the HTML — those map directly onto that
+legend — while the numeric bar is d3-rendered and absent from the
+markup. NA / blank = not rated.
+
+Output:
+  corpus/first_choice/<source_key>.md
+  corpus/first_choice/<source_key>.json
+
+source_key: ``firstchoice-<slug>`` lowercased, e.g.
+``firstchoice-fc-8455-vt2p`` or ``firstchoice-fb-2733-en``.
+
+CLI:
+  python -m scrape.sources.first_choice --crop corn --limit 5
+  python -m scrape.sources.first_choice --force
+  python -m scrape.sources.first_choice --product firstchoice-fc-8455-vt2p
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+import os
+import random
+import re
+import sys
+import time
+from dataclasses import dataclass, field
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+import requests
+from bs4 import BeautifulSoup, NavigableString, Tag
+
+SCRAPER_VERSION = "0.1.0"
+USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
+BASE = "https://www.1stchoiceseeds.com"
+SITEMAP_INDEX = f"{BASE}/sitemap.xml"
+
+# Per-crop child sitemap -> chunker crop value. The chunker keys on
+# "soybeans" (plural) for the MG branch, so map accordingly. The
+# cover-crops sitemap is intentionally omitted (out of scope for the
+# row-crop advisor).
+CROP_SITEMAPS = {
+    "corn": "corn-hybrids-sitemap.xml",
+    "soybeans": "soybeans-sitemap.xml",
+    "wheat": "wheat-sitemap.xml",
+}
+
+# URL path prefix that confirms a sitemap entry is a variety detail page
+# (vs. a category/archive page that can sneak into a child sitemap).
+CROP_PATH = {
+    "corn": "/corn-hybrids/",
+    "soybeans": "/soybeans/",
+    "wheat": "/wheat/",
+}
+
+# robots.txt declares no Crawl-delay; we stay polite. The full row-crop
+# catalog is ~78 detail pages, so ~1.2 s/req finishes in a couple min.
+REQ_INTERVAL_SEC = 1.2
+
+RATING_SCALE_DIRECTION = (
+    "0-10, higher = better (legend: 0-4 Below Average, 5 Average, "
+    "6 Good, 7 Very Good, 8 Excellent, 9-10 Superior); 1st Choice "
+    "publishes the qualitative word in HTML (the numeric bar is "
+    "d3-rendered, not in markup); blank/NA = not rated"
+)
+
+# Corn "Hybrid Characteristics" lines that are foliar/disease in nature
+# bucket into DISEASE RATINGS; the rest are agronomic/plant ratings.
+_CORN_DISEASE_LABELS = {"foliar health", "foliar rating", "foliar"}
+
+# Trait-suffix -> human label, derived from the slug tail. Best-effort;
+# an unmapped suffix is title-cased verbatim so nothing is dropped.
+TRAIT_LABELS = {
+    # corn
+    "vt2p": "VT Double PRO (VT2P)",
+    "gt": "Glyphosate Tolerant (GT)",
+    "c": "Conventional",
+    "pc": "PowerCore (PC)",
+    "tre": "Trecepta (TRE)",
+    "ss": "SmartStax (SS)",
+    "v": "VT (V)",
+    "dv": "Double VT (DV)",
+    "aa": "Agrisure Artesian (AA)",
+    # soybeans
+    "en": "Enlist E3 (EN)",
+    "xf": "XtendFlex (XF)",
+    "sts": "STS",
+    # wheat
+    "b": "Bin-run / branded (B)",
+    "s": "Soft (S)",
+}
+
+REPO_ROOT = Path(__file__).resolve().parents[2]
+CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
+CORPUS_DIR = CORPUS_ROOT / "first_choice"
+
+log = logging.getLogger("scrape.first_choice")
+
+
+# --------------------------------------------------------------------- HTTP
+
+
+class RateLimitedSession:
+    """Polite session with backoff. The 1st Choice row-crop catalog is
+    small (~78 detail pages + 4 sitemaps) so 1.2 s/req still finishes in
+    a couple minutes."""
+
+    def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
+        self.s = requests.Session()
+        self.s.headers["User-Agent"] = USER_AGENT
+        self.interval = interval
+        self._last = 0.0
+
+    def _wait(self) -> None:
+        delta = time.monotonic() - self._last
+        if delta < self.interval:
+            time.sleep(self.interval - delta)
+        self._last = time.monotonic()
+
+    def request(self, method: str, url: str, *, max_retries: int = 4,
+                timeout: float = 30.0, **kw: Any) -> requests.Response:
+        last_exc: Exception | None = None
+        resp: requests.Response | None = None
+        for attempt in range(max_retries):
+            self._wait()
+            try:
+                resp = self.s.request(method, url, timeout=timeout, **kw)
+            except requests.RequestException as exc:
+                last_exc = exc
+                backoff = min(30.0, (2 ** attempt) + random.random())
+                log.warning("network error on %s %s: %s — retry in %.1fs",
+                            method, url, exc, backoff)
+                time.sleep(backoff)
+                continue
+            if resp.status_code == 429 or 500 <= resp.status_code < 600:
+                ra = resp.headers.get("Retry-After")
+                backoff = float(ra) if (ra and ra.isdigit()) else min(
+                    30.0, (2 ** attempt) + random.random())
+                log.warning("HTTP %d on %s %s — retry in %.1fs",
+                            resp.status_code, method, url, backoff)
+                time.sleep(backoff)
+                continue
+            return resp
+        if last_exc:
+            raise last_exc
+        assert resp is not None
+        return resp
+
+    def get(self, url: str, **kw: Any) -> requests.Response:
+        return self.request("GET", url, **kw)
+
+
+# --------------------------------------------------------------------- model
+
+
+@dataclass
+class FCVariety:
+    source_key: str
+    source_url: str
+    crop: str                          # chunker value: corn / soybeans / wheat
+    product_name: str = ""             # "FC 8455 VT2P"
+    relative_maturity: int | None = None      # corn (days)
+    maturity_group: float | None = None       # soy
+    wheat_class: str | None = None             # wheat
+    trait_stack: list[str] = field(default_factory=list)
+    positioning: str | None = None
+    strengths: list[str] = field(default_factory=list)
+    # [{label, items:[{characteristic, value}]}] — chunker source of truth
+    groups: list[dict] = field(default_factory=list)
+    sitemap_last_modified: str | None = None
+
+
+# --------------------------------------------------------------------- discovery (sitemaps)
+
+
+_LOC_RE = re.compile(r"<loc>\s*(?:<!\[CDATA\[)?\s*(.*?)\s*(?:\]\]>)?\s*</loc>",
+                     re.IGNORECASE | re.DOTALL)
+_URL_BLOCK_RE = re.compile(r"<url>(.*?)</url>", re.IGNORECASE | re.DOTALL)
+_LASTMOD_RE = re.compile(r"<lastmod>\s*(?:<!\[CDATA\[)?\s*(.*?)\s*(?:\]\]>)?\s*</lastmod>",
+                         re.IGNORECASE | re.DOTALL)
+
+
+def _slug_from_url(url: str) -> str:
+    return url.rstrip("/").rsplit("/", 1)[-1].lower()
+
+
+def discover(http: RateLimitedSession, *, only_crop: str | None) -> list[dict]:
+    """Return [{crop, url, slug, lastmod}] for in-scope row-crop varieties
+    by walking the per-crop child sitemaps under /sitemap.xml.
+
+    We fetch each known child sitemap directly (their names are stable
+    All-in-One-SEO conventions) rather than trusting the index ordering,
+    but we still confirm against the index so a renamed sitemap is caught.
+    """
+    # Pull the sitemap index once so we can warn if a crop sitemap is
+    # missing/renamed (defensive; we still target the known names).
+    index_locs: set[str] = set()
+    try:
+        idx = http.get(SITEMAP_INDEX)
+        idx.raise_for_status()
+        index_locs = {m.strip() for m in _LOC_RE.findall(idx.text)}
+    except requests.RequestException as exc:
+        log.warning("could not read sitemap index %s: %s (continuing with "
+                    "known child sitemap names)", SITEMAP_INDEX, exc)
+
+    records: list[dict] = []
+    for crop, child in CROP_SITEMAPS.items():
+        if only_crop and crop != only_crop:
+            continue
+        child_url = f"{BASE}/{child}"
+        if index_locs and child_url not in index_locs:
+            log.warning("crop sitemap %s not listed in the index — site may "
+                        "have renamed it; trying anyway", child_url)
+        r = http.get(child_url)
+        if r.status_code == 404:
+            log.warning("crop sitemap %s -> 404; skipping %s", child_url, crop)
+            continue
+        r.raise_for_status()
+        prefix = CROP_PATH[crop]
+        seen: set[str] = set()
+        n = 0
+        for block in _URL_BLOCK_RE.findall(r.text):
+            loc_m = _LOC_RE.search(block)
+            if not loc_m:
+                continue
+            url = loc_m.group(1).strip()
+            if prefix not in url:
+                continue  # category/archive page leaked into the sitemap
+            slug = _slug_from_url(url)
+            if not slug or slug in seen:
+                continue
+            seen.add(slug)
+            lm_m = _LASTMOD_RE.search(block)
+            records.append({
+                "crop": crop,
+                "url": url,
+                "slug": slug,
+                "lastmod": lm_m.group(1).strip() if lm_m else None,
+            })
+            n += 1
+        log.info("crop sitemap %-22s (%s): %d varieties", child, crop, n)
+    log.info("total varieties discovered: %d", len(records))
+    return records
+
+
+# --------------------------------------------------------------------- detail parse
+
+
+def _clean(s: str) -> str:
+    return re.sub(r"\s+", " ", s or "").strip()
+
+
+def _direct_text(el: Tag) -> str:
+    return _clean("".join(c for c in el.children if isinstance(c, NavigableString)))
+
+
+def _br_lines(el: Tag) -> list[str]:
+    """Text of an element with <br> treated as a line break."""
+    # Work on a copy so the original tree (used by other parsers) stays intact.
+    for br in el.find_all("br"):
+        br.replace_with("\n")
+    return [ln.strip() for ln in el.get_text("\n").split("\n") if ln.strip()]
+
+
+def _product_name(article: Tag, slug: str) -> str:
+    """The variety name is the 2nd <h1> in article.content (the 1st is the
+    site-logo "1st Choice Seeds"). Fall back to a tidied slug."""
+    for h1 in article.find_all("h1"):
+        txt = _clean(h1.get_text(" ", strip=True))
+        if txt and txt.lower() != "1st choice seeds":
+            return txt
+    return slug.upper().replace("-", " ")
+
+
+def _trait_stack(slug: str, crop: str) -> list[str]:
+    """Derive a trait label from the slug tail (e.g. fc-8455-vt2p -> VT2P,
+    fb-3545-c-sts -> Conventional + STS). The leading model token
+    (fc-8455 / fb-2733 / fw-2035 / 20rw36) is not a trait."""
+    parts = slug.split("-")
+    # Drop the leading model identifier: typically the first 1-2 tokens
+    # (brand letters + number, e.g. "fc","8455" or "20rw36"). Anything
+    # that is a known trait suffix counts; we scan from the right.
+    traits: list[str] = []
+    for tok in parts:
+        t = tok.lower()
+        if t in TRAIT_LABELS:
+            label = TRAIT_LABELS[t]
+            if label not in traits:
+                traits.append(label)
+    # Trailing numeric-like / model tokens won't be in TRAIT_LABELS, so the
+    # above naturally skips them. Preserve discovery order (left->right).
+    return traits
+
+
+def _parse_corn(article: Tag, v: FCVariety) -> None:
+    """Populate corn ratings from Hybrid Characteristics + Management Tips
+    + the Relative Maturity / Degree Days side table."""
+    agronomic: list[dict] = []
+    disease: list[dict] = []
+    management: list[dict] = []
+
+    # Hybrid Characteristics: a <p> of "label • value" lines.
+    hc = next((h for h in article.find_all("h2")
+               if _clean(h.get_text()) == "Hybrid Characteristics"), None)
+    if hc is not None:
+        sib = hc.find_next_sibling()
+        if sib is not None and sib.name == "p":
+            for ln in _br_lines(sib):
+                # split on bullet (•) or fall back to first colon
+                if "•" in ln:
+                    k, _, val = ln.partition("•")
+                elif ":" in ln:
+                    k, _, val = ln.partition(":")
+                else:
+                    k, val = ln, ""
+                k, val = _clean(k), _clean(val)
+                if not k:
+                    continue
+                item = {"characteristic": k, "value": val}
+                if k.lower() in _CORN_DISEASE_LABELS:
+                    disease.append(item)
+                else:
+                    agronomic.append(item)
+
+    # Management Tips: "label: value" lines (Corn-On-Corn / Productivity /
+    # Silage Rating). Stop pulling once we wander into the footer address.
+    mt = next((h for h in article.find_all("h2")
+               if _clean(h.get_text()) == "Management Tips"), None)
+    if mt is not None:
+        sib = mt.find_next_sibling()
+        if sib is not None and sib.name == "p":
+            for ln in _br_lines(sib):
+                if ":" not in ln:
+                    continue
+                k, _, val = ln.partition(":")
+                k, val = _clean(k), _clean(val)
+                # Footer noise (address / © line) has no useful colon form.
+                if k and val and not k.startswith("©") and "rights reserved" not in ln.lower():
+                    management.append({"characteristic": k, "value": val})
+
+    # Side table: Relative Maturity / Degree Days + planting populations.
+    pop_rows: list[str] = []
+    for tbl in article.find_all("table"):
+        for tr in tbl.find_all("tr"):
+            cells = [_clean(c.get_text(" ", strip=True))
+                     for c in tr.find_all(["td", "th"])]
+            cells = [c for c in cells if c]
+            if not cells:
+                continue
+            joined = " ".join(cells).lower()
+            if cells[0].lower().startswith("relative maturity") and len(cells) >= 2:
+                m = re.search(r"(\d+)", cells[1])
+                if m:
+                    v.relative_maturity = int(m.group(1))
+                agronomic.insert(0, {"characteristic": "Relative Maturity",
+                                     "value": cells[1]})
+            elif cells[0].lower().startswith("degree days") and len(cells) >= 2:
+                agronomic.append({"characteristic": "Degree Days (GDU)",
+                                  "value": cells[1]})
+            elif joined.startswith("low") and ("medium" in joined or "high" in joined):
+                pop_rows.append(" / ".join(cells))
+    if pop_rows:
+        management.append({"characteristic": "Recommended Planting Population",
+                           "value": "; ".join(pop_rows)})
+
+    if agronomic:
+        v.groups.append({"label": "AGRONOMIC CHARACTERISTICS", "items": agronomic})
+    if disease:
+        v.groups.append({"label": "DISEASE RATINGS", "items": disease})
+    if management:
+        v.groups.append({"label": "MANAGEMENT", "items": management})
+
+
+def _parse_soy(article: Tag, v: FCVariety) -> None:
+    """Populate soy MG + agronomic descriptors + field-note strengths."""
+    # Field Notes -> strengths (and positioning from the first one).
+    fn = next((h for h in article.find_all("h2")
+               if _clean(h.get_text()) == "Field Notes"), None)
+    if fn is not None:
+        sib = fn.find_next_sibling()
+        if sib is not None and sib.name == "ul":
+            notes = [_clean(li.get_text(" ", strip=True)) for li in sib.find_all("li")]
+            v.strengths = [n for n in notes if n]
+            if v.strengths and not v.positioning:
+                v.positioning = v.strengths[0]
+
+    # Variety Description -> [{characteristic, value}] from <b>Label:</b> value.
+    agronomic: list[dict] = []
+    vd = next((h for h in article.find_all("h2")
+               if _clean(h.get_text()) == "Variety Description"), None)
+    if vd is not None:
+        for el in vd.find_all_next():
+            if el.name == "h2" and el is not vd:
+                break
+            if not isinstance(el, Tag):
+                continue
+            # Stop at the action buttons / right-nav / footer region.
+            cls = el.get("class") or []
+            if el.name == "div" and any(
+                    c in cls for c in ("btn", "right-bar", "right-navigation",
+                                       "address", "wrapper")):
+                break
+            b = el.find("b", recursive=False) if el.name == "div" else None
+            if b is not None:
+                k = _clean(b.get_text(" ", strip=True)).rstrip(":")
+                val = _direct_text(el)
+                if not k:
+                    continue
+                if k.lower() == "maturity":
+                    try:
+                        v.maturity_group = float(re.search(r"[\d.]+", val).group(0))
+                    except (AttributeError, ValueError):
+                        pass
+                    agronomic.append({"characteristic": "Maturity Group", "value": val})
+                else:
+                    agronomic.append({"characteristic": k, "value": val})
+    if agronomic:
+        v.groups.append({"label": "AGRONOMIC CHARACTERISTICS", "items": agronomic})
+
+
+def parse_detail(http: RateLimitedSession, rec: dict) -> FCVariety:
+    crop = rec["crop"]
+    slug = rec["slug"]
+    url = rec["url"]
+    v = FCVariety(
+        source_key=f"firstchoice-{slug}",
+        source_url=url,
+        crop=crop,
+        trait_stack=_trait_stack(slug, crop),
+        sitemap_last_modified=rec.get("lastmod"),
+    )
+    r = http.get(url)
+    r.raise_for_status()
+    soup = BeautifulSoup(r.text, "html.parser")
+    article = soup.find("article", class_="content") or soup
+    v.product_name = _product_name(article, slug)
+
+    if crop == "corn":
+        _parse_corn(article, v)
+    elif crop == "soybeans":
+        _parse_soy(article, v)
+    # wheat: thin pages — identity only (no spec sections to parse).
+    return v
+
+
+# --------------------------------------------------------------------- render
+
+
+def render_markdown(v: FCVariety) -> str:
+    crop_label = {"corn": "Corn", "soybeans": "Soybeans",
+                  "wheat": "Wheat"}.get(v.crop, v.crop.title())
+    head: list[str] = [
+        f"# {v.product_name}",
+        "",
+        "- **Vendor:** 1st Choice Seeds (independent, employee-owned)",
+        "- **Brand:** 1st Choice Seeds",
+        f"- **Crop:** {crop_label}",
+    ]
+    if v.crop == "corn" and v.relative_maturity is not None:
+        head.append(f"- **Relative maturity:** {v.relative_maturity} day")
+    if v.crop == "soybeans" and v.maturity_group is not None:
+        head.append(f"- **Maturity group:** {v.maturity_group}")
+    if v.crop == "wheat" and v.wheat_class:
+        head.append(f"- **Wheat class:** {v.wheat_class}")
+    if v.trait_stack:
+        head.append(f"- **Trait(s):** {', '.join(v.trait_stack)}")
+    head.append(f"- **Source:** {v.source_url}")
+    head.append(f"- **Rating scale:** {RATING_SCALE_DIRECTION}")
+    head.append("- **Service area:** 1st Choice Seeds dealer network — "
+                "Eastern Corn Belt (IN/OH/KY/TN), Rushville, IN")
+    head.append("")
+    if v.positioning:
+        head += ["---", "", f"_{v.positioning}_", ""]
+    if v.strengths:
+        head += ["---", "", "## Field Notes", ""]
+        head += [f"- {s}" for s in v.strengths]
+        head.append("")
+    head += ["---", ""]
+    for g in v.groups:
+        head.append(f"## {g['label'].title()}")
+        head.append("")
+        for it in g["items"]:
+            ch = it["characteristic"]
+            val = it["value"] or "—"
+            head.append(f"- **{ch}:** {val}")
+        head.append("")
+    if not v.groups and v.crop == "wheat":
+        head += ["_Identity record only — 1st Choice wheat is private-label "
+                 "and the catalog page carries no agronomic spec block._", ""]
+    return "\n".join(head)
+
+
+def write_variety(v: FCVariety, body_md: str) -> None:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    (CORPUS_DIR / f"{v.source_key}.md").write_text(body_md, encoding="utf-8")
+    sidecar = {
+        "source": "first_choice",
+        "source_key": v.source_key,
+        "vendor": "1st Choice Seeds",
+        "brand": "1st Choice Seeds",
+        "product_name": v.product_name,
+        "product_id": v.product_name,
+        "crop": v.crop,
+        "release_year": None,
+        "relative_maturity": v.relative_maturity,
+        "maturity_group": v.maturity_group,
+        "wheat_class": v.wheat_class,
+        "trait_stack": v.trait_stack,
+        "trait_descriptions": [],
+        "positioning_statement": v.positioning,
+        "strengths": v.strengths,
+        "characteristics_groups": v.groups,
+        "_scale_direction": RATING_SCALE_DIRECTION,
+        "regional_recommendations": [
+            {"product_list_name": "1st Choice Seeds dealer network "
+                                  "(Eastern Corn Belt — IN/OH/KY/TN)",
+             "agronomist": None, "agronomist_email": None, "variant_id": None},
+        ],
+        "image_url": None,
+        "source_urls": [v.source_url],
+        "sitemap_last_modified": v.sitemap_last_modified,
+        "fetched_at": datetime.now(timezone.utc).isoformat(),
+        "scraper_version": SCRAPER_VERSION,
+    }
+    (CORPUS_DIR / f"{v.source_key}.json").write_text(
+        json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n", encoding="utf-8")
+
+
+# --------------------------------------------------------------------- pipeline
+
+
+def run(*, limit: int | None, force: bool,
+        only_crop: str | None, only_product: str | None) -> int:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    http = RateLimitedSession()
+    records = discover(http, only_crop=only_crop)
+
+    if only_product:
+        key = only_product.lower()
+        records = [r for r in records
+                   if f"firstchoice-{r['slug']}" == key or r["slug"] == key]
+        if not records:
+            log.error("no variety matched --product=%s", only_product)
+            return 2
+
+    counts = {"written": 0, "skipped": 0, "empty": 0, "failed": 0}
+    processed = 0
+    for rec in records:
+        if limit is not None and processed >= limit:
+            break
+        processed += 1
+        source_key = f"firstchoice-{rec['slug']}"
+        md_path = CORPUS_DIR / f"{source_key}.md"
+        if md_path.exists() and not force:
+            counts["skipped"] += 1
+            log.info("[%d/%d] %s skipped", processed, len(records), source_key)
+            continue
+        try:
+            v = parse_detail(http, rec)
+        except requests.HTTPError as exc:
+            counts["failed"] += 1
+            log.error("[%d/%d] %s detail fetch failed: %s",
+                      processed, len(records), source_key, exc)
+            continue
+        if not v.groups:
+            counts["empty"] += 1
+            log.warning("[%d/%d] %s — no spec groups parsed (writing identity%s)",
+                        processed, len(records), source_key,
+                        "; thin wheat page" if v.crop == "wheat" else "")
+        write_variety(v, render_markdown(v))
+        counts["written"] += 1
+        log.info("[%d/%d] %s written | crop=%s rm/mg=%s groups=%d traits=%s",
+                 processed, len(records), source_key, v.crop,
+                 v.relative_maturity or v.maturity_group or "-",
+                 len(v.groups), ",".join(v.trait_stack) or "-")
+
+    log.info("done: processed=%d written=%d skipped=%d empty_groups=%d failed=%d (of %d)",
+             processed, counts["written"], counts["skipped"], counts["empty"],
+             counts["failed"], len(records))
+    return 0
+
+
+# --------------------------------------------------------------------- CLI
+
+
+def _build_argparser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        prog="scrape.sources.first_choice",
+        description="Scrape 1st Choice Seeds (independent, employee-owned — "
+                    "Rushville, IN) — corn / soybeans / wheat via sitemaps "
+                    "+ detail pages.")
+    p.add_argument("--limit", type=int, default=None,
+                   help="Stop after processing N varieties (default: all).")
+    p.add_argument("--force", action="store_true",
+                   help="Re-fetch even if the markdown file already exists.")
+    p.add_argument("--crop", default=None, choices=sorted(CROP_SITEMAPS),
+                   help="Limit to one crop (corn / soybeans / wheat).")
+    p.add_argument("--product", default=None,
+                   help="Process a single variety by source_key or slug.")
+    p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
+    return p
+
+
+def main(argv: list[str] | None = None) -> int:
+    args = _build_argparser().parse_args(argv)
+    logging.basicConfig(
+        level=args.log_level.upper(),
+        format="%(asctime)s %(levelname)s %(name)s %(message)s",
+        stream=sys.stderr)
+    return run(limit=args.limit, force=args.force,
+               only_crop=args.crop, only_product=args.product)
+
+
+if __name__ == "__main__":
+    sys.exit(main())