bayer_seeds: implement Phase 1 scraper for DEKALB + Asgrow + WestBred

Replace stub with working scraper for all three Bayer seed brands. Discovery uses the public sitemap-dynamic.xml (475 varieties: 288 DEKALB corn + 102 Asgrow soy + 85 WestBred wheat — matches recon). Per-variety detail comes from the page's __NEXT_DATA__ JSON island. Each variety writes corpus/bayer_seeds/<source_key>.{md,json} with: - Identity (brand, crop, hybridLabel, productId, releaseYear) - Maturity routed per crop (RM for corn, MG for soy, qualitative for wheat) - Trait stack (code + full name) - Positioning + strengths narrative - Characteristics groups (DISEASE RATINGS, GROWTH, MANAGEMENT, HARVEST, etc.) preserved verbatim from source so the chunker can re-bucket into canonical disease/agronomic flats per CLAUDE.md schema - Regional seed-guide listings with agronomist contacts - _scale_direction tag (Bayer = "1-9 (9 = best)") for chunker Smoke-tested all three brands (--limit 2 each, plus --product, --force, and scrape.runner dispatch). Politeness: 1 req/sec, retries on 429/5xx with Retry-After honored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:53:46 -04:00
parent ac40e05734
commit 2a4c0d4aba
1 changed files with 668 additions and 38 deletions
@@ -1,56 +1,686 @@
 """Bayer seeds scraper — DEKALB (corn) + Asgrow (soy) + WestBred (wheat).

-Source: ``cropscience.bayer.us`` — same Next.js + ``__NEXT_DATA__``
-infrastructure used by crop-chem-docs' Bayer crop-protection scraper.
-That scraper is the reference; this one lifts ~80% of its plumbing
-and adapts the per-product field mapping for seed schema.
+Source: ``www.cropscience.bayer.us`` — the same Next.js + ``__NEXT_DATA__``
+infrastructure that powers Bayer's crop-protection catalog (which
+``crop-chem-docs`` already scrapes). robots.txt explicitly whitelists
+*"artificial intelligence retrieval augmented generation"* use of the
+content, which is what this corpus feeds.

-Catalog index pages:
-  /corn/dekalb/seed-catalog
-  /soybeans/asgrow/seed-catalog
-  /wheat/westbred/seed-catalog
+Discovery: ``/sitemap-dynamic.xml`` enumerates every variety URL under
+``/corn/dekalb/``, ``/soybeans/asgrow/``, ``/wheat/westbred/`` — counts
+on 2026-05-25: 288 / 102 / 85 = 475 total, matching recon. The seed
+catalog landing pages SSR only the first 12 of N products via React
+Query state hydration; we sidestep that entirely by walking the
+sitemap.

-Each catalog page is a Next.js route; the per-variety data lives in
-``__NEXT_DATA__.props.pageProps.{whatever}``. The buildId in the
-script tag rotates — fetch the index page first, extract the
-buildId, then fetch the per-variety JSON.
+Per-variety detail comes from the product page itself. Each page
+embeds a full ``__NEXT_DATA__`` JSON island whose
+``props.pageProps.productDetails`` carries:

-Output layout:
+  - Identity:  ``brand``, ``crop``, ``productId``,
+    ``hybridLabel``, ``hybridPrefix``, ``hybridSuffix``,
+    ``releaseYear``
+  - Maturity:  ``relativeMaturity`` (corn = RM days, soy = MG,
+    wheat = qualitative early/medium/late)
+  - Traits:    ``traits[]`` of ``{trait, traitFullName}``
+  - Narrative: ``positioningStatement``, ``strengthsAndManagement[]``
+  - Ratings:   ``characteristics[]`` of
+    ``{label, type, items: [{characteristic, value}]}`` —
+    groups vary by crop:
+      corn: DISEASE RATINGS / GROWTH / MANAGEMENT / HARVEST /
+            HERBICIDE / PLANT DESCRIPTION
+      soy:  DISEASE RATINGS / SENSITIVITY / MANAGEMENT /
+            PLANT DESCRIPTION / PRODUCTION
+      wheat: KEY CHARACTERISTICS / MANAGEMENT / PRODUCTION /
+             QUALITY / PEST AND DISEASE RESISTANCE
+  - Regional:  ``localProfiles[]`` of regional seed-guide listings
+    incl. agronomist name + email
+
+Bayer ratings are on the canonical **1-9 (9 = best)** scale already,
+so no flip is needed (unlike Golden Harvest, which is documented in
+CLAUDE.md). Non-numeric ratings (S/R for soy disease resistance,
+gene names like Rps1c, sensitivity letters A/B/C) are preserved
+verbatim — the chunker (Phase 2) handles surfacing.
+
+Output:
  corpus/bayer_seeds/<source_key>.md     LLM-visible body
-  corpus/bayer_seeds/<source_key>.json    Sidecar metadata
+  corpus/bayer_seeds/<source_key>.json   sidecar metadata

-source_key convention: ``<brand>-<product-slug>`` lowercased, e.g.
-``dekalb-dkc62-08rib`` or ``asgrow-ag34xf2``.
+source_key convention: ``<brand>-<sku>`` lowercased — derived from the
+URL terminal slug minus the trailing crop suffix
+(``-corn``/``-soybeans``/``-wheat``). E.g.
+``dekalb-dkc075-70rib`` or ``asgrow-ag005xf3``.

-Sidecar schema (per CLAUDE.md):
-  source: "bayer_seeds"
-  source_key: str
-  vendor: "Bayer"
-  brand: "DEKALB" | "Asgrow" | "WestBred"
-  product_name: str
-  crop: "corn" | "soybeans" | "wheat"
-  relative_maturity: int | null         # corn only
-  maturity_group: float | null          # soy only
-  wheat_class: str | null               # wheat only
-  trait_stack: list[str]
-  agronomic_ratings: dict[str, int]     # normalized 1-9 (9 = best)
-  disease_ratings: dict[str, int]       # normalized 1-9 (9 = best)
-  regional_recommendation: list[str]
-  source_urls: list[str]
-  fetched_at: str (ISO 8601 UTC)
-
-TODO: implement. Reference: ~/github/crop-chem-docs/scrape/sources/bayer.py
+CLI:
+  python -m scrape.sources.bayer_seeds --limit 5
+  python -m scrape.sources.bayer_seeds --brand dekalb --limit 20
+  python -m scrape.sources.bayer_seeds --force
 """
+
 from __future__ import annotations

+import argparse
+import json
+import logging
+import os
+import random
+import re
 import sys
+import time
+from dataclasses import dataclass, field
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+import requests
+
+SCRAPER_VERSION = "0.1.0"
+USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
+BASE = "https://www.cropscience.bayer.us"
+SITEMAP_URL = f"{BASE}/sitemap-dynamic.xml"
+
+# Brand → (URL path segment, crop label). Ordering here defines the
+# `--all` walk order and the `--brand` choices.
+BRANDS: dict[str, tuple[str, str]] = {
+    "dekalb": ("/corn/dekalb/", "corn"),
+    "asgrow": ("/soybeans/asgrow/", "soybeans"),
+    "westbred": ("/wheat/westbred/", "wheat"),
+}
+
+# Per-brand crop-suffix to strip off the URL's terminal slug when
+# computing source_key (so ``dekalb-dkc075-70rib-corn`` → ``dekalb-dkc075-70rib``).
+CROP_SUFFIX = {
+    "dekalb": "-corn",
+    "asgrow": "-soybeans",
+    "westbred": "-wheat",
+}
+
+# Catalog/landing pages that live under the brand path but are NOT
+# individual varieties. Skip these during discovery.
+NON_VARIETY_PATH_TAILS = {
+    "seed-catalog",
+    "product-compare",
+    "find-a-dealer",
+    "find-a-rep",
+    "saved-products",
+}
+
+# Bayer publishes seed ratings on the canonical 1-9 scale (9 = best),
+# unlike Golden Harvest. This goes into the sidecar so the chunker
+# knows not to flip.
+RATING_SCALE_DIRECTION = "1-9 (9 = best)"
+
+# Repo root: scrape/sources/bayer_seeds.py -> 3 parents up.
+REPO_ROOT = Path(__file__).resolve().parents[2]
+CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
+CORPUS_DIR = CORPUS_ROOT / "bayer_seeds"
+
+REQ_INTERVAL_SEC = 1.0
+
+log = logging.getLogger("scrape.bayer_seeds")
+
+
+# --------------------------------------------------------------------- HTTP
+
+
+class RateLimitedSession:
+    """``requests.Session`` wrapper with sleep-based rate limiting and
+    polite retries on 429/5xx. Lifted from crop-chem-docs' Bayer scraper
+    — same host, same politeness story."""
+
+    def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
+        self.s = requests.Session()
+        self.s.headers["User-Agent"] = USER_AGENT
+        self.interval = interval
+        self._last = 0.0
+
+    def _wait(self) -> None:
+        delta = time.monotonic() - self._last
+        if delta < self.interval:
+            time.sleep(self.interval - delta)
+        self._last = time.monotonic()
+
+    def request(
+        self,
+        method: str,
+        url: str,
+        *,
+        max_retries: int = 4,
+        timeout: float = 30.0,
+        **kw: Any,
+    ) -> requests.Response:
+        last_exc: Exception | None = None
+        for attempt in range(max_retries):
+            self._wait()
+            try:
+                resp = self.s.request(method, url, timeout=timeout, **kw)
+            except requests.RequestException as exc:
+                last_exc = exc
+                backoff = min(30.0, (2 ** attempt) + random.random())
+                log.warning("network error on %s %s: %s — retry in %.1fs",
+                            method, url, exc, backoff)
+                time.sleep(backoff)
+                continue
+            if resp.status_code == 429 or 500 <= resp.status_code < 600:
+                ra = resp.headers.get("Retry-After")
+                backoff = float(ra) if (ra and ra.isdigit()) else min(30.0, (2 ** attempt) + random.random())
+                log.warning("HTTP %d on %s %s — retry in %.1fs",
+                            resp.status_code, method, url, backoff)
+                time.sleep(backoff)
+                continue
+            return resp
+        if last_exc:
+            raise last_exc
+        return resp  # type: ignore[return-value]
+
+    def get(self, url: str, **kw: Any) -> requests.Response:
+        return self.request("GET", url, **kw)
+
+
+# --------------------------------------------------------------------- model
+
+
+@dataclass
+class BayerSeedProduct:
+    # Identity
+    source_key: str                                  # e.g. "dekalb-dkc075-70rib"
+    source_url: str                                  # full product page URL
+    brand: str                                       # "DEKALB" | "ASGROW" | "WESTBRED"
+    crop: str                                        # "corn" | "soybeans" | "wheat"
+    product_name: str = ""                           # hybridLabel, e.g. "DKC075-70RIB BRAND BLEND"
+    product_id: str | None = None                    # full Bayer productId
+    hybrid_prefix: str | None = None                 # e.g. "DKC075-70RIB"
+    hybrid_suffix: str | None = None                 # e.g. "BRAND BLEND"
+    release_year: int | None = None
+
+    # Maturity — semantics vary by crop, value preserved as-published.
+    relative_maturity: str | None = None             # corn: RM days as string; wheat: qualitative
+    maturity_group: str | None = None                # soy MG as string
+    wheat_class: str | None = None                   # not exposed in productDetails — left null
+
+    # Traits
+    trait_codes: list[str] = field(default_factory=list)          # ["VT2PRIB"]
+    trait_descriptions: list[str] = field(default_factory=list)   # full names
+
+    # Narrative
+    positioning_statement: str | None = None
+    strengths: list[str] = field(default_factory=list)
+
+    # Ratings — preserved as the source's grouped form. The chunker
+    # re-buckets into the canonical disease/agronomic flats from
+    # seed-mcp/CLAUDE.md.
+    characteristics_groups: list[dict] = field(default_factory=list)
+
+    # Regional recommendations (Bayer's "local profiles").
+    regional_recommendations: list[dict] = field(default_factory=list)
+
+    # Media
+    image_url: str | None = None
+
+    # Discovery
+    sitemap_last_modified: str | None = None
+
+
+# --------------------------------------------------------------------- helpers
+
+
+_NEXT_DATA_RE = re.compile(
+    r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', re.S
+)
+
+
+def parse_next_data(html: str) -> dict[str, Any]:
+    """Pull the ``__NEXT_DATA__`` JSON blob out of a Next.js page."""
+    m = _NEXT_DATA_RE.search(html)
+    if not m:
+        raise RuntimeError("no __NEXT_DATA__ script tag found")
+    return json.loads(m.group(1))
+
+
+def source_key_from_url(url: str, brand: str) -> str:
+    """Derive ``<brand>-<sku>`` slug from the product URL.
+
+    Drops the trailing ``-<crop>`` suffix Bayer puts on every product
+    URL terminal segment (``dekalb-dkc075-70rib-corn`` →
+    ``dekalb-dkc075-70rib``).
+    """
+    tail = url.rstrip("/").rsplit("/", 1)[-1].lower()
+    suffix = CROP_SUFFIX.get(brand, "")
+    if suffix and tail.endswith(suffix):
+        tail = tail[: -len(suffix)]
+    return tail
+
+
+def looks_like_variety_url(url: str, brand_path: str) -> bool:
+    """True if ``url`` is a per-variety product page under ``brand_path``
+    (not a catalog/landing page or sub-tool)."""
+    rest = url.split(brand_path, 1)[-1].strip("/")
+    if not rest or "/" in rest:
+        return False  # empty (the brand index) or a sub-path tool
+    if rest in NON_VARIETY_PATH_TAILS:
+        return False
+    return True
+
+
+# --------------------------------------------------------------------- discovery
+
+
+def discover_varieties(
+    http: RateLimitedSession,
+    *,
+    only_brand: str | None = None,
+) -> list[tuple[str, str, str, str]]:
+    """Return ``[(url, brand, crop, lastmod), ...]`` for every Bayer
+    seed variety found in the dynamic sitemap.
+
+    ``brand`` is the lowercase brand key (matches ``BRANDS``).
+    ``lastmod`` is the ISO 8601 timestamp from the sitemap entry.
+    """
+    log.info("fetching sitemap %s", SITEMAP_URL)
+    r = http.get(SITEMAP_URL)
+    r.raise_for_status()
+    xml = r.text
+
+    # Tiny regex parse — sitemap is flat and well-formed; no need for
+    # the lxml dependency on a single 600KB file.
+    entries = re.findall(
+        r"<url>\s*<loc>([^<]+)</loc>\s*(?:<lastmod>([^<]+)</lastmod>)?",
+        xml,
+    )
+    log.info("sitemap parsed: %d total URLs", len(entries))
+
+    out: list[tuple[str, str, str, str]] = []
+    for url, lastmod in entries:
+        for brand, (brand_path, crop) in BRANDS.items():
+            if only_brand and brand != only_brand:
+                continue
+            if brand_path in url and looks_like_variety_url(url, brand_path):
+                out.append((url, brand, crop, lastmod or ""))
+                break
+
+    by_brand: dict[str, int] = {}
+    for _, b, _, _ in out:
+        by_brand[b] = by_brand.get(b, 0) + 1
+    log.info("variety URLs found: %s (total=%d)",
+             ", ".join(f"{k}={v}" for k, v in sorted(by_brand.items())),
+             len(out))
+    return out
+
+
+# --------------------------------------------------------------------- detail
+
+
+def fetch_product_detail(
+    http: RateLimitedSession, url: str, brand: str, crop: str, lastmod: str
+) -> BayerSeedProduct:
+    """Fetch + parse one product page into a ``BayerSeedProduct``."""
+    r = http.get(url)
+    r.raise_for_status()
+    data = parse_next_data(r.text)
+    pp = (data.get("props") or {}).get("pageProps") or {}
+    pd = pp.get("productDetails") or {}
+
+    prod = BayerSeedProduct(
+        source_key=source_key_from_url(url, brand),
+        source_url=url,
+        brand=(pd.get("brand") or brand).upper(),
+        crop=(pd.get("crop") or crop).lower(),
+        sitemap_last_modified=lastmod or None,
+    )
+
+    prod.product_name = pd.get("hybridLabel") or pd.get("productName") or prod.source_key
+    prod.product_id = pd.get("productId")
+    prod.hybrid_prefix = pd.get("hybridPrefix")
+    prod.hybrid_suffix = pd.get("hybridSuffix")
+
+    ry = pd.get("releaseYear")
+    if isinstance(ry, int):
+        prod.release_year = ry
+    elif isinstance(ry, str) and ry.isdigit():
+        prod.release_year = int(ry)
+
+    # Maturity routing per crop. Source stores all three in
+    # `relativeMaturity` as a string; we split by crop semantics.
+    rm = pd.get("relativeMaturity")
+    if rm is not None:
+        rm_str = str(rm)
+        if prod.crop == "corn":
+            prod.relative_maturity = rm_str
+        elif prod.crop == "soybeans":
+            prod.maturity_group = rm_str
+        elif prod.crop == "wheat":
+            # WestBred encodes Early/Medium/Late as the qualitative
+            # maturity. The class (HRW/HRS/SWW/...) is not in
+            # productDetails — it's only in the marketing narrative.
+            # We surface what we have; a future enrichment step can
+            # parse the narrative if needed.
+            prod.wheat_class = None  # explicit: not exposed in this JSON
+            prod.relative_maturity = rm_str
+
+    # Traits
+    for t in pd.get("traits") or []:
+        code = (t or {}).get("trait")
+        full = (t or {}).get("traitFullName")
+        if code:
+            prod.trait_codes.append(code)
+        if full:
+            prod.trait_descriptions.append(full)
+
+    # Narrative
+    prod.positioning_statement = pd.get("positioningStatement")
+    sm = pd.get("strengthsAndManagement") or pd.get("strengths") or []
+    if isinstance(sm, list):
+        prod.strengths = [str(s).strip() for s in sm if s]
+
+    # Ratings groups — preserved verbatim (label / type / items).
+    chars = pd.get("characteristics") or []
+    cleaned_groups: list[dict] = []
+    for g in chars:
+        if not isinstance(g, dict):
+            continue
+        items = [
+            {"characteristic": (it.get("characteristic") or "").strip(),
+             "value": ("" if it.get("value") is None else str(it.get("value"))).strip()}
+            for it in (g.get("items") or [])
+            if isinstance(it, dict) and it.get("characteristic")
+        ]
+        if not items:
+            continue
+        cleaned_groups.append({
+            "label": (g.get("label") or "").strip(),
+            "type": (g.get("type") or "").strip(),
+            "items": items,
+        })
+    prod.characteristics_groups = cleaned_groups
+
+    # Regional recommendations.
+    lp = pd.get("localProfiles") or []
+    if isinstance(lp, list):
+        for p in lp:
+            if not isinstance(p, dict):
+                continue
+            prod.regional_recommendations.append({
+                "product_list_name": p.get("productListName"),
+                "agronomist": p.get("agronomist"),
+                "agronomist_email": p.get("agronomistEmailAddress"),
+                "variant_id": p.get("variantId"),
+            })
+
+    # Image (just the first one)
+    imgs = pp.get("images") or []
+    if isinstance(imgs, list) and imgs and isinstance(imgs[0], dict):
+        prod.image_url = imgs[0].get("url")
+
+    return prod
+
+
+# --------------------------------------------------------------------- render
+
+
+def render_markdown(p: BayerSeedProduct) -> str:
+    """Build the markdown body for the variety. The Phase 2 chunker will
+    rewrite chunk_0 with a tighter preamble; this is the readable today
+    copy that already covers everything searchable.
+    """
+    title = p.product_name or p.source_key
+    crop_label = p.crop.capitalize()
+
+    maturity_lines: list[str] = []
+    if p.relative_maturity is not None and p.crop == "corn":
+        maturity_lines.append(f"- **Relative maturity:** {p.relative_maturity}")
+    if p.maturity_group is not None and p.crop == "soybeans":
+        maturity_lines.append(f"- **Maturity group:** {p.maturity_group}")
+    if p.relative_maturity is not None and p.crop == "wheat":
+        maturity_lines.append(f"- **Maturity:** {p.relative_maturity}")
+    if p.wheat_class:
+        maturity_lines.append(f"- **Wheat class:** {p.wheat_class}")
+
+    trait_line = ""
+    if p.trait_codes:
+        codes = ", ".join(p.trait_codes)
+        if p.trait_descriptions:
+            descs = "; ".join(p.trait_descriptions)
+            trait_line = f"- **Traits:** {codes} ({descs})"
+        else:
+            trait_line = f"- **Traits:** {codes}"
+
+    header_lines = [
+        f"# {title}",
+        "",
+        "- **Vendor:** Bayer",
+        f"- **Brand:** {p.brand.title() if p.brand else '(unknown)'}",
+        f"- **Crop:** {crop_label}",
+        *maturity_lines,
+    ]
+    if trait_line:
+        header_lines.append(trait_line)
+    if p.release_year:
+        header_lines.append(f"- **Release year:** {p.release_year}")
+    header_lines.append(f"- **Source:** {p.source_url}")
+    header_lines.append(f"- **Rating scale (Bayer):** {RATING_SCALE_DIRECTION}")
+    header_lines.append("")
+    header_lines.append("---")
+    header_lines.append("")
+
+    sections: list[str] = []
+
+    if p.positioning_statement:
+        sections.append("## Positioning\n\n" + p.positioning_statement.strip() + "\n")
+
+    if p.strengths:
+        bullets = "\n".join(f"- {s}" for s in p.strengths)
+        sections.append("## Strengths & management\n\n" + bullets + "\n")
+
+    # Render each characteristics group as its own table for readability.
+    for g in p.characteristics_groups:
+        label = g.get("label") or "Characteristics"
+        items = g.get("items") or []
+        if not items:
+            continue
+        rows = "\n".join(
+            f"| {it['characteristic']} | {it['value']} |"
+            for it in items
+        )
+        sections.append(
+            f"## {label.title()}\n\n"
+            "| Characteristic | Value |\n"
+            "|---|---|\n"
+            f"{rows}\n"
+        )
+
+    if p.regional_recommendations:
+        seen: set[str] = set()
+        rows: list[str] = []
+        for r in p.regional_recommendations:
+            name = (r.get("product_list_name") or "").strip()
+            agronomist = (r.get("agronomist") or "").strip()
+            key = f"{name}||{agronomist}"
+            if key in seen or not name:
+                continue
+            seen.add(key)
+            rows.append(f"- **{name}** — agronomist: {agronomist or '(unlisted)'}")
+        if rows:
+            sections.append("## Regional seed-guide listings\n\n" + "\n".join(rows) + "\n")
+
+    return "\n".join(header_lines) + "\n".join(sections)
+
+
+# --------------------------------------------------------------------- write
+
+
+def write_product(prod: BayerSeedProduct, body_md: str) -> None:
+    """Write the markdown body + sidecar JSON. Schema documented in
+    seed-mcp/CLAUDE.md."""
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    md_path = CORPUS_DIR / f"{prod.source_key}.md"
+    json_path = CORPUS_DIR / f"{prod.source_key}.json"
+
+    md_path.write_text(body_md, encoding="utf-8")
+
+    sidecar = {
+        "source": "bayer_seeds",
+        "source_key": prod.source_key,
+        "vendor": "Bayer",
+        "brand": prod.brand,
+        "product_name": prod.product_name,
+        "product_id": prod.product_id,
+        "hybrid_prefix": prod.hybrid_prefix,
+        "hybrid_suffix": prod.hybrid_suffix,
+        "crop": prod.crop,
+        "release_year": prod.release_year,
+        "relative_maturity": prod.relative_maturity,
+        "maturity_group": prod.maturity_group,
+        "wheat_class": prod.wheat_class,
+        "trait_stack": prod.trait_codes,
+        "trait_descriptions": prod.trait_descriptions,
+        "positioning_statement": prod.positioning_statement,
+        "strengths": prod.strengths,
+        # Raw grouped ratings preserved as published. Chunker re-buckets
+        # into canonical disease/agronomic flats per CLAUDE.md schema.
+        "characteristics_groups": prod.characteristics_groups,
+        "_scale_direction": RATING_SCALE_DIRECTION,
+        "regional_recommendations": prod.regional_recommendations,
+        "image_url": prod.image_url,
+        "source_urls": [prod.source_url],
+        "sitemap_last_modified": prod.sitemap_last_modified,
+        "fetched_at": datetime.now(timezone.utc).isoformat(),
+        "scraper_version": SCRAPER_VERSION,
+    }
+    json_path.write_text(
+        json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
+        encoding="utf-8",
+    )
+
+
+# --------------------------------------------------------------------- pipeline
+
+
+def process_product(
+    http: RateLimitedSession,
+    *,
+    url: str,
+    brand: str,
+    crop: str,
+    lastmod: str,
+    force: bool,
+) -> tuple[str, BayerSeedProduct | None]:
+    """Returns ``(status, prod or None)`` where status is one of
+    ``written`` / ``skipped`` / ``failed``."""
+    source_key = source_key_from_url(url, brand)
+    md_path = CORPUS_DIR / f"{source_key}.md"
+    if md_path.exists() and not force:
+        return "skipped", None
+
+    try:
+        prod = fetch_product_detail(http, url, brand, crop, lastmod)
+    except Exception as exc:  # noqa: BLE001
+        log.error("detail fetch failed for %s: %s", url, exc)
+        return "failed", None
+
+    body = render_markdown(prod)
+    write_product(prod, body)
+    return "written", prod
+
+
+def run(
+    *,
+    limit: int | None,
+    force: bool,
+    only_brand: str | None,
+    only_product: str | None,
+) -> int:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    http = RateLimitedSession()
+
+    targets = discover_varieties(http, only_brand=only_brand)
+    if only_product:
+        targets = [
+            (u, b, c, lm) for (u, b, c, lm) in targets
+            if source_key_from_url(u, b) == only_product
+            or u.rstrip("/").rsplit("/", 1)[-1].lower() == only_product
+        ]
+        if not targets:
+            log.error("no variety matched --product=%s", only_product)
+            return 2
+
+    counts = {"written": 0, "skipped": 0, "failed": 0}
+    processed = 0
+    for url, brand, crop, lastmod in targets:
+        if limit is not None and processed >= limit:
+            break
+        processed += 1
+        status, prod = process_product(
+            http, url=url, brand=brand, crop=crop, lastmod=lastmod, force=force,
+        )
+        counts[status] = counts.get(status, 0) + 1
+
+        if prod is not None:
+            log.info(
+                "[%d/%s] %s %s | crop=%s rm/mg=%s traits=%s ratings_groups=%d",
+                processed, str(limit) if limit else "all",
+                prod.source_key, status, prod.crop,
+                prod.relative_maturity or prod.maturity_group or "-",
+                ",".join(prod.trait_codes) or "-",
+                len(prod.characteristics_groups),
+            )
+        else:
+            log.info("[%d/%s] %s %s",
+                     processed, str(limit) if limit else "all",
+                     source_key_from_url(url, brand), status)
+
+    log.info(
+        "done: processed=%d written=%d skipped=%d failed=%d (out of %d candidates)",
+        processed, counts["written"], counts["skipped"], counts["failed"], len(targets),
+    )
+    return 0 if counts["failed"] == 0 else 1
+
+
+# --------------------------------------------------------------------- CLI
+
+
+def _build_argparser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        prog="scrape.sources.bayer_seeds",
+        description="Scrape Bayer DEKALB / Asgrow / WestBred seed varieties.",
+    )
+    p.add_argument(
+        "--limit", type=int, default=None,
+        help="Stop after processing N varieties (default: all).",
+    )
+    p.add_argument(
+        "--force", action="store_true",
+        help="Re-fetch even if the markdown file already exists.",
+    )
+    p.add_argument(
+        "--brand", default=None, choices=sorted(BRANDS),
+        help="Limit to one Bayer seed brand.",
+    )
+    p.add_argument(
+        "--product", default=None,
+        help="Process a single variety by source_key "
+             "(e.g. 'dekalb-dkc62-08rib') or terminal URL slug.",
+    )
+    p.add_argument(
+        "--log-level", default=os.environ.get("LOG_LEVEL", "INFO"),
+        help="Python logging level (default INFO).",
+    )
+    return p


 def main(argv: list[str] | None = None) -> int:
-    print("bayer_seeds: not implemented yet — see ~/github/crop-chem-docs/scrape/sources/bayer.py for the reference Next.js extraction pattern",
-          file=sys.stderr)
-    return 2
+    args = _build_argparser().parse_args(argv)
+    logging.basicConfig(
+        level=args.log_level.upper(),
+        format="%(asctime)s %(levelname)s %(name)s %(message)s",
+        stream=sys.stderr,
+    )
+    return run(
+        limit=args.limit,
+        force=args.force,
+        only_brand=args.brand,
+        only_product=args.product,
+    )


 if __name__ == "__main__":
-    sys.exit(main(sys.argv[1:]))
+    sys.exit(main())