agripro + nk scrapers — 146 Syngenta varieties added (wheat + corn/soy)

agripro (24 varieties) - Drupal Views form scrape via /search-agripro-brand-varieties with explicit GET params (sidesteps the AJAX-only-on-load default that returns an empty form skeleton). - Per-variety parse: <h1>, .field--node--variety-type--variety, .field--node--tag-line--variety, .field--node--body, plus the three rated sections (Agronomics / Grain / Disease) with their <div class="row"><div class="label">label</div><div>value</div> pairs. - Wheat-class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley — provides the Northern Plains HRS coverage WestBred lacks. nk (122 varieties — recon's "29" was outdated; the current NK seed finder lists 41 corn + 81 soy) - ASP.NET WebForms endpoint: POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returns {"d": "<html>"} where the inner HTML is one <div class="sf-result"> per variety. BeautifulSoup tokenizes the whole blob. - Per-card: product code (NK8005, NK008-P8XF), RM/MG from the title <span>, "Brands Available" trait variants, marketing positioning + bullet strengths, tech-sheet PDF URL. - pdfplumber text extraction on the tech-sheet PDFs adds: * corn disease ratings (Gray Leaf Spot, NCLB, Goss's Wilt, Anthracnose, Tar Spot, Fusarium, etc.) where the PDF prints "Label N" lines (text-extractable) * soybean Phytophthora source genes (Rps1c, Rps3a, ...) * soybean SCN race coverage * soybean agronomic ratings (Emergence, Standability, Shatter Tolerance, Green Stem) with text-extractable 1-9 values * soybean soil-type adaptation (Best/Good/Fair/Poor) for drought prone / high pH / poorly drained / etc. - Agronomic rating BARS for corn (Emergence, Stalk Strength, Drought) are not text-extractable; we record the labels with an explicit "rated in PDF chart, see tech sheet" value so the agent can direct the farmer at the source for those numbers. Scale-direction correction in lessons.md: - NK and AgriPro both use 1 = best, lower = more resistant — the REVERSED convention vs Bayer / Golden Harvest. NK's tech-sheet footer literally prints "1-9 Scale: 1 = Best, 9 = Worst". AgriPro positioning on stripe-rust-resistant varieties (AP Iliad with Stripe Rust 1, Eyespot 2) confirms the same direction. - sources-not-yet-indexed section trimmed to just Beck's PFR + Beck's products — everything else IS now in the corpus. Cross-vendor coverage after this PR: 760 varieties. bayer_seeds 475 (DEKALB 288 / Asgrow 102 / WestBred 85) golden_harvest 139 nk 122 (41 corn / 81 soy) agripro 24 (12 HRS / 7 SWW / 3 HRW / 1 HWS / 1 Barley) Vendors: Bayer, Syngenta. Brands: 6. Crops: corn, soy, wheat (109 wheat now, up from 85). requirements.txt: pdfplumber>=0.11 for NK tech-sheet parsing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:16:36 -04:00
parent 2588ebafa1
commit 9ce920f622
296 changed files with 23233 additions and 60 deletions
@@ -1,37 +1,503 @@
-"""AgriPro scraper (Syngenta wheat brand).
+"""AgriPro (Syngenta) wheat scraper.

-Source: ``https://www.agriprowheat.com`` — Drupal Views form,
-server-rendered HTML. No headless browser needed.
+Source: ``agriprowheat.com`` — Drupal site, server-rendered HTML.
+robots.txt is empty (no Disallow).

-Expected count: 24 varieties. Covers HRW / HRS / HWS / SWW / SWS
-plus barley. NO SRW — Syngenta's SRW lives at GrowProGenetics.com
-under a separate brand and is out of scope for AgriPro.
+Expected count: 24 varieties spanning Hard Red Winter (HRW), Hard
+Red Spring (HRS), Hard White Spring (HWS), Soft White Winter (SWW),
+Soft White Spring (SWS), and durum. NO SRW — Syngenta's Soft Red
+Winter sits at GrowProGenetics.com under a separate brand, out of
+scope for AgriPro.

-Trait flags to capture: Clearfield (CL2), CoAXium (NB: CoAXium is
-implicit in product family naming, not always a separate field).
+Discovery: the variety listing at
+``/search-agripro-brand-varieties`` server-renders only the
+filter form; the actual variety rows are populated by a Drupal
+Views AJAX call. We sidestep the AJAX by passing the filter values
+as GET params on the same path:

-Schema notes:
- ``wheat_class`` is required (HRW/HRS/HWS/SWW/SWS/durum/barley)
- ``relative_maturity`` and ``maturity_group`` are null for wheat
- Disease panel: stripe rust / leaf rust / stem rust / FHB (scab) /
-  Septoria / tan spot
- Quality: test weight, protein, falling number, straw strength
+  /search-agripro-brand-varieties?title=&variety_type_value=All

-TODO: implement.
+That returns the fully-rendered list (24 rows in
+``.block-views-blockvarieties-search-varieties-search-block``) with
+links to ``/variety/<slug>`` pages.
+
+Per-variety detail comes from the variety page HTML. Useful fields:
+
+  - ``<h1>`` — product name (e.g. "AP Exceed")
+  - ``.field--node--variety-type--variety`` — wheat class
+    ("Soft White Winter", "Hard Red Spring", etc.)
+  - ``.field--node--tag-line--variety`` — short positioning slogan
+  - ``.field--node--body`` — full positioning narrative
+  - Three sections delimited by ``<h3>``: Agronomics / Grain /
+    Disease, each containing ``.row`` divs with
+    ``<div class="label">…</div><div>…</div>`` pairs.
+
+**Rating-scale direction**: AgriPro publishes disease tolerance on a
+1-9 scale where **1 = best (most resistant)** — REVERSED from
+Bayer's and Golden Harvest's "9 = best" convention. The chunker
+preserves values verbatim and the sidecar's ``_scale_direction``
+field declares the direction, so the LLM's chunk-preamble framing
+will correctly say "(1 = best)" — anti-hallucination guarantee
+holds even across vendors with opposite scales.
+
+(Agronomic ratings on AgriPro are qualitative — "Excellent / Very
+Good / Good / Fair / Poor" — and don't have a numeric direction
+issue. They're preserved verbatim.)
+
+Output:
+  corpus/agripro/<source_key>.md
+  corpus/agripro/<source_key>.json
+
+source_key convention: ``agripro-<slug>`` lowercased, e.g.
+``agripro-ap-exceed`` or ``agripro-sy-assure``.
+
+CLI:
+  python -m scrape.sources.agripro --limit 5
+  python -m scrape.sources.agripro --force
 """
+
 from __future__ import annotations

+import argparse
+import json
+import logging
+import os
+import random
+import re
 import sys
+import time
+from dataclasses import dataclass, field
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+import requests
+from bs4 import BeautifulSoup
+
+SCRAPER_VERSION = "0.1.0"
+USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
+BASE = "https://agriprowheat.com"
+LIST_URL = f"{BASE}/search-agripro-brand-varieties?title=&variety_type_value=All"
+
+# AgriPro disease ratings: 1-9, LOWER number = MORE resistant. This
+# is the inverse of Bayer/Golden-Harvest's 1-9 (9 = best) convention.
+# Document this in the sidecar so the chunker / LLM never mis-renders.
+RATING_SCALE_DIRECTION = "1-9 (1 = best, lower = more resistant)"
+
+# Class abbreviations for the wheat_class field. AgriPro renders the
+# full English name; we map it to the canonical short form the rest
+# of the corpus uses (matches schema notes in seed-mcp/CLAUDE.md).
+WHEAT_CLASS_MAP = {
+    "hard red winter": "HRW",
+    "hard red spring": "HRS",
+    "hard white spring": "HWS",
+    "hard white winter": "HWW",
+    "soft white winter": "SWW",
+    "soft white spring": "SWS",
+    "soft red winter": "SRW",
+    "durum": "Durum",
+}
+
+REPO_ROOT = Path(__file__).resolve().parents[2]
+CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
+CORPUS_DIR = CORPUS_ROOT / "agripro"
+
+REQ_INTERVAL_SEC = 1.0
+
+log = logging.getLogger("scrape.agripro")
+
+
+# --------------------------------------------------------------------- HTTP
+
+
+class RateLimitedSession:
+    def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
+        self.s = requests.Session()
+        self.s.headers["User-Agent"] = USER_AGENT
+        self.interval = interval
+        self._last = 0.0
+
+    def _wait(self) -> None:
+        delta = time.monotonic() - self._last
+        if delta < self.interval:
+            time.sleep(self.interval - delta)
+        self._last = time.monotonic()
+
+    def request(
+        self,
+        method: str,
+        url: str,
+        *,
+        max_retries: int = 4,
+        timeout: float = 30.0,
+        **kw: Any,
+    ) -> requests.Response:
+        last_exc: Exception | None = None
+        for attempt in range(max_retries):
+            self._wait()
+            try:
+                resp = self.s.request(method, url, timeout=timeout, **kw)
+            except requests.RequestException as exc:
+                last_exc = exc
+                backoff = min(30.0, (2 ** attempt) + random.random())
+                log.warning("network error on %s %s: %s — retry in %.1fs",
+                            method, url, exc, backoff)
+                time.sleep(backoff)
+                continue
+            if resp.status_code == 429 or 500 <= resp.status_code < 600:
+                ra = resp.headers.get("Retry-After")
+                backoff = float(ra) if (ra and ra.isdigit()) else min(30.0, (2 ** attempt) + random.random())
+                log.warning("HTTP %d on %s %s — retry in %.1fs",
+                            resp.status_code, method, url, backoff)
+                time.sleep(backoff)
+                continue
+            return resp
+        if last_exc:
+            raise last_exc
+        return resp  # type: ignore[return-value]
+
+    def get(self, url: str, **kw: Any) -> requests.Response:
+        return self.request("GET", url, **kw)
+
+
+# --------------------------------------------------------------------- model
+
+
+@dataclass
+class APProduct:
+    source_key: str
+    source_url: str
+    product_name: str = ""
+    wheat_class: str | None = None
+    positioning_statement: str | None = None
+    tagline: str | None = None
+    characteristics_groups: list[dict] = field(default_factory=list)
+
+
+# --------------------------------------------------------------------- discovery
+
+
+def discover_varieties(http: RateLimitedSession) -> list[str]:
+    """Fetch the variety-search page and return the list of
+    ``/variety/<slug>`` URLs found in it.
+
+    Dedupes per-row twice-listed links (the row's hero image link
+    and its "view full details" link both point to the same place).
+    """
+    log.info("fetching variety list %s", LIST_URL)
+    r = http.get(LIST_URL)
+    r.raise_for_status()
+    soup = BeautifulSoup(r.text, "html.parser")
+    urls: list[str] = []
+    seen: set[str] = set()
+    for a in soup.find_all("a", href=re.compile(r"^/variety/")):
+        h = a["href"]
+        if h in seen:
+            continue
+        seen.add(h)
+        urls.append(BASE + h)
+    log.info("variety URLs found: %d", len(urls))
+    return urls
+
+
+# --------------------------------------------------------------------- helpers
+
+
+def source_key_for(url: str) -> str:
+    """``/variety/ap-exceed`` → ``agripro-ap-exceed``."""
+    tail = url.rstrip("/").rsplit("/", 1)[-1].lower()
+    return f"agripro-{tail}"
+
+
+def normalize_wheat_class(raw: str | None) -> str | None:
+    if not raw:
+        return None
+    key = raw.strip().lower()
+    return WHEAT_CLASS_MAP.get(key, raw.strip())
+
+
+def _rows_in_section(soup: BeautifulSoup, h3_text: str) -> list[dict]:
+    """Walk the variety page for the section heading matching
+    ``h3_text``, then collect every ``.row`` inside the same
+    container. Returns ``[{characteristic, value}, ...]``."""
+    items: list[dict] = []
+    for h3 in soup.find_all("h3"):
+        if h3.get_text(strip=True).lower() != h3_text.lower():
+            continue
+        # Walk up to the enclosing section (the parent that scopes
+        # the .row siblings of the h3). The simplest reliable scope:
+        # the row siblings within the immediate parent.
+        parent = h3.parent
+        if parent is None:
+            continue
+        for row in parent.find_all(class_="row"):
+            label_el = row.find(class_="label")
+            if not label_el:
+                continue
+            label = label_el.get_text(" ", strip=True)
+            # The value is whatever <div> sibling follows the label
+            # (NOT the .label div itself).
+            value: str | None = None
+            for child in row.find_all("div"):
+                if "label" in (child.get("class") or []):
+                    continue
+                # First non-label <div> with non-empty text wins.
+                t = child.get_text(" ", strip=True)
+                if t:
+                    value = t
+                    break
+            if label and value:
+                items.append({"characteristic": label, "value": value})
+        break
+    return items
+
+
+# --------------------------------------------------------------------- detail
+
+
+def fetch_product_detail(
+    http: RateLimitedSession, url: str
+) -> APProduct | None:
+    r = http.get(url)
+    if r.status_code == 404:
+        return None
+    r.raise_for_status()
+    soup = BeautifulSoup(r.text, "html.parser")
+
+    prod = APProduct(
+        source_key=source_key_for(url),
+        source_url=url,
+    )
+
+    h1 = soup.find("h1")
+    if h1:
+        prod.product_name = h1.get_text(strip=True)
+
+    vt = soup.find(class_="field--node--variety-type--variety")
+    if vt:
+        prod.wheat_class = normalize_wheat_class(vt.get_text(strip=True))
+
+    tl = soup.find(class_="field--node--tag-line--variety")
+    if tl:
+        prod.tagline = tl.get_text(strip=True) or None
+
+    # Body text — the long-form positioning narrative.
+    body = soup.find(class_=re.compile(r"field--node--body"))
+    if body:
+        prod.positioning_statement = body.get_text(" ", strip=True) or None
+
+    # Tagline alone if no body — better than nothing.
+    if not prod.positioning_statement and prod.tagline:
+        prod.positioning_statement = prod.tagline
+
+    # The three rated sections on every variety page.
+    groups: list[dict] = []
+    for label, h3 in (
+        ("AGRONOMICS", "Agronomics"),
+        ("GRAIN", "Grain"),
+        ("DISEASE RATINGS", "Disease"),
+    ):
+        items = _rows_in_section(soup, h3)
+        if items:
+            groups.append({"label": label, "type": "fields", "items": items})
+    prod.characteristics_groups = groups
+
+    return prod
+
+
+# --------------------------------------------------------------------- render
+
+
+def render_markdown(p: APProduct) -> str:
+    title = p.product_name or p.source_key
+    head: list[str] = [
+        f"# {title}",
+        "",
+        "- **Vendor:** Syngenta",
+        "- **Brand:** AgriPro",
+        "- **Crop:** Wheat",
+    ]
+    if p.wheat_class:
+        head.append(f"- **Wheat class:** {p.wheat_class}")
+    if p.tagline:
+        head.append(f"- **Tagline:** {p.tagline}")
+    head.append(f"- **Source:** {p.source_url}")
+    head.append(f"- **Rating scale (AgriPro):** {RATING_SCALE_DIRECTION}")
+    head.append("")
+    head.append("---")
+    head.append("")
+
+    sections: list[str] = []
+    if p.positioning_statement and p.positioning_statement != p.tagline:
+        sections.append("## Positioning\n\n" + p.positioning_statement.strip() + "\n")
+
+    for g in p.characteristics_groups:
+        label = (g.get("label") or "Characteristics").title()
+        items = g.get("items") or []
+        if not items:
+            continue
+        rows = "\n".join(f"| {it['characteristic']} | {it['value']} |" for it in items)
+        sections.append(
+            f"## {label}\n\n"
+            "| Characteristic | Value |\n"
+            "|---|---|\n"
+            f"{rows}\n"
+        )
+    return "\n".join(head) + "\n".join(sections)
+
+
+# --------------------------------------------------------------------- write
+
+
+def write_product(prod: APProduct, body_md: str) -> None:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    md_path = CORPUS_DIR / f"{prod.source_key}.md"
+    json_path = CORPUS_DIR / f"{prod.source_key}.json"
+
+    md_path.write_text(body_md, encoding="utf-8")
+    sidecar = {
+        "source": "agripro",
+        "source_key": prod.source_key,
+        "vendor": "Syngenta",
+        "brand": "AgriPro",
+        "product_name": prod.product_name,
+        "product_id": None,
+        "hybrid_prefix": prod.product_name,
+        "hybrid_suffix": None,
+        "crop": "wheat",
+        "release_year": None,
+        "relative_maturity": None,
+        "maturity_group": None,
+        "wheat_class": prod.wheat_class,
+        "trait_stack": [],
+        "trait_descriptions": [],
+        "positioning_statement": prod.positioning_statement,
+        "tagline": prod.tagline,
+        "strengths": [],
+        "characteristics_groups": prod.characteristics_groups,
+        # AgriPro's reversed direction is the load-bearing field here:
+        # any cross-vendor disease-resistance comparison MUST consult
+        # this before interpreting values. The chunker reads it; the
+        # api_lessons file's rating-scales section documents the
+        # convention.
+        "_scale_direction": RATING_SCALE_DIRECTION,
+        "regional_recommendations": [],
+        "image_url": None,
+        "source_urls": [prod.source_url],
+        "sitemap_last_modified": None,
+        "fetched_at": datetime.now(timezone.utc).isoformat(),
+        "scraper_version": SCRAPER_VERSION,
+    }
+    json_path.write_text(
+        json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
+        encoding="utf-8",
+    )
+
+
+# --------------------------------------------------------------------- pipeline
+
+
+def process_product(
+    http: RateLimitedSession,
+    *,
+    url: str,
+    force: bool,
+) -> tuple[str, APProduct | None]:
+    source_key = source_key_for(url)
+    md_path = CORPUS_DIR / f"{source_key}.md"
+    if md_path.exists() and not force:
+        return "skipped", None
+    try:
+        prod = fetch_product_detail(http, url)
+    except Exception as exc:  # noqa: BLE001
+        log.error("detail fetch failed for %s: %s", url, exc)
+        return "failed", None
+    if prod is None:
+        return "missing", None
+    body = render_markdown(prod)
+    write_product(prod, body)
+    return "written", prod
+
+
+def run(
+    *,
+    limit: int | None,
+    force: bool,
+    only_product: str | None,
+) -> int:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    http = RateLimitedSession()
+    targets = discover_varieties(http)
+    if only_product:
+        targets = [
+            u for u in targets
+            if source_key_for(u) == only_product
+            or u.rstrip("/").rsplit("/", 1)[-1].lower() == only_product.lower()
+        ]
+        if not targets:
+            log.error("no variety matched --product=%s", only_product)
+            return 2
+
+    counts = {"written": 0, "skipped": 0, "missing": 0, "failed": 0}
+    processed = 0
+    for url in targets:
+        if limit is not None and processed >= limit:
+            break
+        processed += 1
+        status, prod = process_product(http, url=url, force=force)
+        counts[status] = counts.get(status, 0) + 1
+        if prod is not None:
+            log.info(
+                "[%d/%s] %s %s | class=%s groups=%d",
+                processed, str(limit) if limit else "all",
+                prod.source_key, status,
+                prod.wheat_class or "-",
+                len(prod.characteristics_groups),
+            )
+        else:
+            log.info("[%d/%s] %s %s",
+                     processed, str(limit) if limit else "all",
+                     source_key_for(url), status)
+
+    log.info(
+        "done: processed=%d written=%d skipped=%d missing=%d failed=%d (of %d candidates)",
+        processed, counts["written"], counts["skipped"],
+        counts["missing"], counts["failed"], len(targets),
+    )
+    return 0 if counts["failed"] == 0 else 1
+
+
+# --------------------------------------------------------------------- CLI
+
+
+def _build_argparser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        prog="scrape.sources.agripro",
+        description="Scrape AgriPro (Syngenta) wheat varieties.",
+    )
+    p.add_argument("--limit", type=int, default=None,
+                   help="Stop after processing N varieties (default: all).")
+    p.add_argument("--force", action="store_true",
+                   help="Re-fetch even if the markdown file already exists.")
+    p.add_argument("--product", default=None,
+                   help="Process a single variety by source_key or URL tail.")
+    p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
+    return p


 def main(argv: list[str] | None = None) -> int:
-    print("agripro: deferred — Drupal Views form, only wheat in the corpus, no SRW (separate brand). See reference_seed_vendor_recon.md.",
-          file=sys.stderr)
-    # Return 0 so the monthly CI workflow doesn't fail when this
-    # source is listed but not yet implemented. Real implementation
-    # will return 0 on success / 1 on failure.
-    return 0
+    args = _build_argparser().parse_args(argv)
+    logging.basicConfig(
+        level=args.log_level.upper(),
+        format="%(asctime)s %(levelname)s %(name)s %(message)s",
+        stream=sys.stderr,
+    )
+    return run(
+        limit=args.limit,
+        force=args.force,
+        only_product=args.product,
+    )


 if __name__ == "__main__":
-    sys.exit(main(sys.argv[1:]))
+    sys.exit(main())
@@ -1,38 +1,828 @@
-"""NK scraper (Syngenta brand).
+"""NK (Syngenta) seed scraper — corn + soybeans.

-Source: ``https://www.syngenta-us.com`` — static HTML product pages
-plus tech-sheet PDFs on the Syngenta CDN at
-``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf``.
+Source: ``syngenta-us.com`` — ASP.NET WebForms catalog with an
+ASMX-style JSON endpoint for the seed-finder UI, plus tech-sheet
+PDFs on the Syngenta CDN at
+``assets.syngenta-us.com/pdf/techsheets/<CODE>_YYMMDD.pdf``.

-Expected count: 29 varieties (12 corn + 17 soy). No wheat.
+Expected count: 29 varieties (12 corn + 17 soy on 2026-05-25). No
+wheat.

-The PDF fetcher is shared with ``golden_harvest`` — same CDN, same
-``<CODE>_YYMMDD.pdf`` filename convention. Factor that into a
-helper module under ``scrape.sources._syngenta_pdf`` once both
-scrapers are written.
+Discovery: the HTML catalog pages (``/corn/nk/products``,
+``/soybeans/nk/products``) load product cards via JS. The JS calls

-Disease + agronomic ratings live INSIDE the PDFs (the HTML pages
-have marketing copy only). Use pdfplumber for table extraction.
+  POST /NKSeeds/CornProductFinder.aspx/GetProducts
+  POST /NKSeeds/SoyProductFinder.aspx/GetProducts

-Bonus: regional "Seed Guide" PDFs (~14 MB each) for IA, IL, MN,
-etc. — additional supplemental context worth ingesting once the
-per-variety scrape is solid.
+Both endpoints return ASP.NET's ``{"d": "..."}`` wrapper where ``d``
+is a string of HTML fragments separated by `` @ `` containing one
+``<div class="sf-result">`` per variety. Each card carries:

-TODO: implement.
+  - product code (e.g. ``NK8005`` / ``NK008-P8XF``)
+  - RM days (corn) / MG decimal (soy) in a ``<span>`` next to the
+    title
+  - "Brands Available" line listing trait variants
+    (NK8005-V, NK8005-GT/LL — these are trait-specific SKUs)
+  - positioning slogan + bullet-list of strengths
+  - tech-sheet PDF URL
+
+Per-variety disease ratings live ONLY in the PDF tech sheets (the
+HTML cards have marketing text but no rating numbers). We extract
+disease ratings via ``pdfplumber`` text extraction — they appear as
+"Label Number" lines that we parse with a regex.
+
+**Rating-scale direction**: NK explicitly publishes
+``1-9 Scale: 1 = Best, Tallest or Highest; 9 = Worst, Shortest or
+Lowest`` on every tech sheet — REVERSED from Bayer/Golden Harvest.
+The chunker preserves values verbatim and the sidecar's
+``_scale_direction`` field declares this so the LLM correctly
+interprets the chunk preamble.
+
+**Agronomic ratings**: rendered as horizontal bar charts in the
+PDF; pdfplumber's text extraction captures the LABELS (Emergence,
+Stalk Strength, Drought, etc.) but NOT the bar values. Surfacing
+those would require either OCR of the bar positions or pdfplumber's
+geometric layout parsing — deferred. For now the chunk records the
+labels and an explicit "agronomic ratings rendered as chart bars in
+the source PDF — values not currently extracted" annotation so the
+agent knows to direct the farmer at the tech-sheet PDF for those
+numbers.
+
+Tech-sheet PDF URLs come from the API response (live URL is
+correct; the assets-host filenames include a YYMMDD that changes).
+
+Output:
+  corpus/nk/<source_key>.md
+  corpus/nk/<source_key>.json
+
+source_key convention: ``nk-<code>`` lowercased, e.g.
+``nk-nk8005`` or ``nk-nk008-p8xf``.
+
+CLI:
+  python -m scrape.sources.nk --limit 5
+  python -m scrape.sources.nk --crop corn --limit 12
+  python -m scrape.sources.nk --force
 """
+
 from __future__ import annotations

+import argparse
+import io
+import json
+import logging
+import os
+import random
+import re
 import sys
+import time
+from dataclasses import dataclass, field
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+import requests
+from bs4 import BeautifulSoup
+import pdfplumber
+
+SCRAPER_VERSION = "0.1.0"
+USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
+BASE = "https://www.syngenta-us.com"
+CORN_LIST_URL = f"{BASE}/corn/nk/products"
+SOY_LIST_URL = f"{BASE}/soybeans/nk/products"
+CORN_API = f"{BASE}/NKSeeds/CornProductFinder.aspx/GetProducts"
+SOY_API = f"{BASE}/NKSeeds/SoyProductFinder.aspx/GetProducts"
+
+# NK + AgriPro both use the "1 = best, lower = more resistant" convention.
+# Confirmed by tech-sheet footer: "1-9 Scale: 1 = Best...; 9 = Worst..."
+RATING_SCALE_DIRECTION = "1-9 (1 = best, lower = more resistant)"
+
+REPO_ROOT = Path(__file__).resolve().parents[2]
+CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
+CORPUS_DIR = CORPUS_ROOT / "nk"
+
+REQ_INTERVAL_SEC = 1.0
+
+log = logging.getLogger("scrape.nk")
+
+
+# --------------------------------------------------------------------- HTTP
+
+
+class RateLimitedSession:
+    def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
+        self.s = requests.Session()
+        self.s.headers["User-Agent"] = USER_AGENT
+        self.interval = interval
+        self._last = 0.0
+
+    def _wait(self) -> None:
+        delta = time.monotonic() - self._last
+        if delta < self.interval:
+            time.sleep(self.interval - delta)
+        self._last = time.monotonic()
+
+    def request(
+        self,
+        method: str,
+        url: str,
+        *,
+        max_retries: int = 4,
+        timeout: float = 30.0,
+        **kw: Any,
+    ) -> requests.Response:
+        last_exc: Exception | None = None
+        for attempt in range(max_retries):
+            self._wait()
+            try:
+                resp = self.s.request(method, url, timeout=timeout, **kw)
+            except requests.RequestException as exc:
+                last_exc = exc
+                backoff = min(30.0, (2 ** attempt) + random.random())
+                log.warning("network error on %s %s: %s — retry in %.1fs",
+                            method, url, exc, backoff)
+                time.sleep(backoff)
+                continue
+            if resp.status_code == 429 or 500 <= resp.status_code < 600:
+                ra = resp.headers.get("Retry-After")
+                backoff = float(ra) if (ra and ra.isdigit()) else min(30.0, (2 ** attempt) + random.random())
+                log.warning("HTTP %d on %s %s — retry in %.1fs",
+                            resp.status_code, method, url, backoff)
+                time.sleep(backoff)
+                continue
+            return resp
+        if last_exc:
+            raise last_exc
+        return resp  # type: ignore[return-value]
+
+    def get(self, url: str, **kw: Any) -> requests.Response:
+        return self.request("GET", url, **kw)
+
+    def post(self, url: str, **kw: Any) -> requests.Response:
+        return self.request("POST", url, **kw)
+
+
+# --------------------------------------------------------------------- model
+
+
+@dataclass
+class NKProduct:
+    source_key: str
+    source_url: str            # the brand catalog page (closest thing to a per-variety URL)
+    crop: str                  # "corn" | "soybeans"
+    product_code: str = ""     # NK8005 / NK008-P8XF
+    relative_maturity: str | None = None   # corn
+    maturity_group: str | None = None      # soy
+    brand_variants: list[str] = field(default_factory=list)   # ["NK8005-V", "NK8005-GT/LL"]
+    trait_codes: list[str] = field(default_factory=list)
+    trait_descriptions: list[str] = field(default_factory=list)
+    positioning_statement: str | None = None
+    strengths: list[str] = field(default_factory=list)
+    techsheet_url: str | None = None
+    characteristics_groups: list[dict] = field(default_factory=list)
+
+
+# --------------------------------------------------------------------- discovery
+
+
+def _api_payload_corn(rm_low: str, rm_high: str) -> str:
+    """Payload for ``CornProductFinder.aspx/GetProducts``."""
+    return json.dumps({
+        "cornCount": "1",
+        "rmLowerRange": rm_low,
+        "rmUpperRange": rm_high,
+        "brands": "NK",
+        "agisuraTraits": "",
+        "insectResistance": "",
+        "herbicideTolerance": "",
+        "waterOptimization": "",
+        "reducedRefuge": "",
+        "diseaseResistence": "",
+        "silage": "",
+        "path": "false",
+        "currentUrl": CORN_LIST_URL,
+        "fieldForged": "",
+        "newProduct": "",
+    })
+
+
+def _api_payload_soy(rm_low: str, rm_high: str) -> str:
+    return json.dumps({
+        "soyaBeanCount": "1",
+        "rmLowerRange": rm_low,
+        "rmUpperRange": rm_high,
+        "herbicideTolerance": "",
+        "diseaseFilter": "",
+        "nematodeFilter": "",
+        "agroPlantCharFilter": "",
+        "plantHeightFilter": "",
+        "brands": "NK",
+        "browserURL": SOY_LIST_URL,
+        "fieldForged": "",
+        "newProduct": "",
+    })
+
+
+def _parse_card(html_chunk: str, crop: str) -> NKProduct | None:
+    """Parse one ``<div class="sf-result">`` card from the API
+    response into an NKProduct."""
+    soup = BeautifulSoup(html_chunk, "html.parser")
+    title_el = soup.find(class_="sf-result-title")
+    if not title_el:
+        return None
+    # Title contains code + RM <span> tail
+    code = (title_el.contents[0].strip() if title_el.contents else "").strip()
+    if not code:
+        return None
+    rm_str: str | None = None
+    span = title_el.find("span")
+    if span:
+        # span text is like "RM\n80" — strip to digits/decimal
+        text = span.get_text(" ", strip=True)
+        m = re.search(r"(\d+(?:\.\d+)?)", text)
+        if m:
+            rm_str = m.group(1)
+
+    prod = NKProduct(
+        source_key=f"nk-{code.lower()}",
+        # NK doesn't expose per-variety URLs; the brand catalog is the
+        # nearest equivalent. lookup_variety / get_page will still work
+        # via source_key.
+        source_url=CORN_LIST_URL if crop == "corn" else SOY_LIST_URL,
+        crop=crop,
+        product_code=code,
+    )
+    if rm_str is not None:
+        if crop == "corn":
+            prod.relative_maturity = rm_str
+        else:
+            prod.maturity_group = rm_str
+
+    # Brands Available (trait variants).
+    inner = soup.find(class_="sf-result-content-inner")
+    if inner:
+        # The first <strong> with "Brands available:" or
+        # "Herbicide Tolerant Trait(s):" sets the trait context.
+        for strong in inner.find_all("strong"):
+            text = strong.get_text(" ", strip=True)
+            if text.lower().startswith("brands available"):
+                rest = text.split(":", 1)[1] if ":" in text else ""
+                for v in rest.split("|"):
+                    v = v.strip()
+                    if v:
+                        prod.brand_variants.append(v)
+            elif text.lower().startswith("herbicide tolerant trait"):
+                rest = text.split(":", 1)[1] if ":" in text else ""
+                for t in rest.split(","):
+                    t = t.strip()
+                    if t:
+                        prod.trait_codes.append(t)
+            else:
+                # Positioning slogan is also rendered as a bare <strong>.
+                if not prod.positioning_statement and len(text) > 12:
+                    prod.positioning_statement = text
+
+        # Bullet strengths
+        ul = inner.find("ul")
+        if ul:
+            for li in ul.find_all("li"):
+                t = li.get_text(" ", strip=True)
+                if t:
+                    prod.strengths.append(t)
+
+    # Tech-sheet PDF URL.
+    for a in soup.find_all("a", href=True):
+        h = a["href"]
+        if "assets.syngenta-us.com/pdf/techsheets/" in h and h.lower().endswith(".pdf"):
+            prod.techsheet_url = h
+            break
+
+    return prod
+
+
+def discover_products(
+    http: RateLimitedSession,
+    *,
+    only_crop: str | None = None,
+) -> list[NKProduct]:
+    """Hit the corn + soy product-finder APIs and parse the returned
+    HTML cards into NKProducts. Returns identity-level data only;
+    ratings come from the per-variety tech-sheet PDF in
+    ``enrich_with_pdf``."""
+    # Warm the session cookie (some Syngenta deployments need it).
+    http.get(CORN_LIST_URL)
+
+    out: list[NKProduct] = []
+    headers = {
+        "Content-Type": "application/json; charset=utf-8",
+        "X-Requested-With": "XMLHttpRequest",
+    }
+
+    def _parse_response(html_blob: str, crop: str) -> int:
+        """Parse the API response's inner HTML into NKProducts.
+
+        The endpoint emits one ``<div class="sf-result">`` per variety,
+        each wrapped in a ``<div class="col-md-6">`` column. Strip the
+        leading ``@`` markers and let BeautifulSoup tokenize the whole
+        blob — no per-chunk split (the API doesn't actually delimit
+        with ``@`` reliably, despite appearances).
+        """
+        n = 0
+        # Strip leading " @ " noise (rendered by the JS when filters
+        # change, not a structural delimiter).
+        cleaned = html_blob.replace("@", "").strip()
+        soup = BeautifulSoup(cleaned, "html.parser")
+        for card in soup.find_all("div", class_="sf-result"):
+            prod = _parse_card(str(card), crop)
+            if prod:
+                out.append(prod)
+                n += 1
+        return n
+
+    if only_crop in (None, "corn"):
+        log.info("fetching NK corn product list")
+        r = http.post(
+            CORN_API,
+            data=_api_payload_corn("75", "120"),
+            headers={**headers, "Referer": CORN_LIST_URL},
+        )
+        r.raise_for_status()
+        n = _parse_response(r.json().get("d") or "", "corn")
+        log.info("corn cards parsed: %d", n)
+
+    if only_crop in (None, "soybeans"):
+        log.info("fetching NK soy product list")
+        r = http.post(
+            SOY_API,
+            data=_api_payload_soy("0", "9.9"),
+            headers={**headers, "Referer": SOY_LIST_URL},
+        )
+        r.raise_for_status()
+        n = _parse_response(r.json().get("d") or "", "soybeans")
+        log.info("soy cards parsed: %d", n)
+
+    log.info("total: %d NK varieties", len(out))
+    return out
+
+
+# --------------------------------------------------------------------- PDF
+
+
+def _extract_disease_ratings(text: str) -> list[dict]:
+    """Pull disease-tolerance ratings out of the tech-sheet PDF text.
+
+    The PDF renders disease ratings as a left-column-label / right-
+    column-number layout. pdfplumber's ``extract_text`` interleaves
+    the agronomic-chart labels (no number) with the disease-rating
+    labels + numbers, so we just look for lines ending in a numeric
+    rating or a literal ``-`` (not available).
+
+    Returns a list of ``{characteristic, value}``. Values are
+    preserved as strings (including ``-`` for "not available").
+    """
+    # The disease list per tech sheet is small (~10 conditions) and
+    # the labels are stable. We anchor on the known label set rather
+    # than try to guess by layout.
+    known_diseases = [
+        "Gray Leaf Spot",
+        "Northern Corn Leaf Blight",
+        "Goss's Wilt",
+        "Goss's wilt",
+        "Bacterial Leaf Streak",
+        "Bacterial Corn Leaf Streak",
+        "Southern Corn Leaf Blight",
+        "Anthracnose Stalk Rot",
+        "Anthracnose Leaf Blight",
+        "Tar Spot",
+        "Fusarium Crown Rot",
+        "Common Rust",
+        "Southern Rust",
+        "Eye Spot",
+        "Stewart's Bacterial Wilt",
+        # Soybean
+        "Brown Stem Rot",
+        "Charcoal Rot",
+        "Frogeye Leaf Spot",
+        "Iron Deficiency Chlorosis",
+        "Phytophthora Root Rot",
+        "Sclerotinia White Mold",
+        "White Mold",
+        "Soybean Cyst Nematode",
+        "Sudden Death Syndrome",
+        "Southern Stem Canker",
+        "Stem Canker",
+        "Soybean Mosaic Virus",
+    ]
+    items: list[dict] = []
+    for line in text.splitlines():
+        line = line.strip()
+        if not line:
+            continue
+        # Match "<label> <value>" where label is one of known_diseases
+        # and value is a single digit or "-".
+        for d in known_diseases:
+            m = re.match(rf"^{re.escape(d)}\s+([1-9]|-)\s*$", line)
+            if m:
+                items.append({"characteristic": d, "value": m.group(1)})
+                break
+    # Dedup while preserving order
+    seen: set[str] = set()
+    deduped: list[dict] = []
+    for it in items:
+        if it["characteristic"] not in seen:
+            seen.add(it["characteristic"])
+            deduped.append(it)
+    return deduped
+
+
+def _extract_phytophthora_genes(text: str) -> str | None:
+    """Soybean tech sheets list the Phytophthora Root Rot (PRR) source
+    genes (Rps1c / Rps3a / etc.). The exact line wording varies; we
+    accept several common phrasings."""
+    patterns = (
+        r"Phytophthora Root Rot\s*\(PRR\)\s*Source\s+(.+)",
+        r"PRR Source\s*[:\-]?\s*(.+)",
+        r"Phytophthora Gene\s*[:\-]?\s*(.+)",
+    )
+    for line in text.splitlines():
+        line = line.strip()
+        for p in patterns:
+            m = re.match(p, line, re.I)
+            if m:
+                val = m.group(1).strip()
+                # Trim trailing words that obviously aren't gene names
+                # ("Source Rps1c, Rps3a Emergence 3" can run together).
+                val = re.split(r"\s+(?:Emergence|Soybean|Standability|Root)\b", val, 1)[0].strip()
+                if val and val.lower() not in ("-", "na", "n/a", "none"):
+                    return val
+    return None
+
+
+def _extract_scn_source(text: str) -> str | None:
+    for line in text.splitlines():
+        line = line.strip()
+        m = re.match(r"^(SCN Source|Cyst Nematode Source)\s*[:\-]?\s*(.+)$", line, re.I)
+        if m:
+            val = m.group(2).strip()
+            if val and val != "-":
+                return val
+    return None
+
+
+def _extract_scn_races(text: str) -> str | None:
+    """Soy: 'Soybean Cyst Nematode (SCN) Races S' / 'R3' etc."""
+    for line in text.splitlines():
+        line = line.strip()
+        m = re.match(
+            r"^Soybean Cyst Nematode \(SCN\) Races\s+(.+)$", line, re.I,
+        )
+        if m:
+            val = m.group(1).strip()
+            if val:
+                return val
+    return None
+
+
+# Soy agronomic ratings rendered as text "Label N" pairs in the PDF.
+# These ARE extractable (unlike the bar charts).
+_SOY_AGRO_LABELS = (
+    "Emergence", "Standability", "Shatter Tolerance",
+    "Green Stem", "% Protein at 13% mst.", "% Oil at 13% mst.",
+)
+
+
+def _extract_soy_agronomic_text(text: str) -> list[dict]:
+    out: list[dict] = []
+    for label in _SOY_AGRO_LABELS:
+        # Allow trailing decimal for %Protein / %Oil; single digit
+        # for the 1-9 ratings.
+        m = re.search(
+            rf"{re.escape(label)}\s+(\d+(?:\.\d+)?|-)\b",
+            text,
+        )
+        if m:
+            out.append({"characteristic": label, "value": m.group(1)})
+    return out
+
+
+# Soil-type adaptation lines on soy PDFs: "Drought Prone Best",
+# "Narrow Rows Best", "High pH* Good", etc.
+_SOY_SOIL_LABELS = (
+    "Drought Prone", "Narrow Rows", "High pH",
+    "Wide Rows", "Highly Productive",
+    "Moderate/Variable Environments", "Poorly Drained",
+)
+
+
+def _extract_soy_soil_adaptation(text: str) -> list[dict]:
+    out: list[dict] = []
+    for label in _SOY_SOIL_LABELS:
+        m = re.search(
+            rf"{re.escape(label)}\*?\s+(Best|Good|Fair|Poor)\b",
+            text,
+        )
+        if m:
+            out.append({"characteristic": label, "value": m.group(1)})
+    return out
+
+
+def enrich_with_pdf(
+    http: RateLimitedSession, prod: NKProduct
+) -> None:
+    """Fetch the tech-sheet PDF and add disease ratings + relevant
+    soybean fields to ``prod.characteristics_groups``."""
+    if not prod.techsheet_url:
+        log.info("%s: no tech sheet URL — identity only", prod.source_key)
+        return
+    try:
+        r = http.get(prod.techsheet_url)
+        r.raise_for_status()
+    except Exception as exc:  # noqa: BLE001
+        log.warning("%s: PDF fetch failed (%s) — identity only",
+                    prod.source_key, exc)
+        return
+    try:
+        with pdfplumber.open(io.BytesIO(r.content)) as pdf:
+            text = "\n".join((p.extract_text() or "") for p in pdf.pages)
+    except Exception as exc:  # noqa: BLE001
+        log.warning("%s: PDF parse failed (%s) — identity only",
+                    prod.source_key, exc)
+        return
+
+    disease = _extract_disease_ratings(text)
+    if disease:
+        prod.characteristics_groups.append({
+            "label": "DISEASE RATINGS",
+            "type": "pdf-text",
+            "items": disease,
+        })
+
+    if prod.crop == "soybeans":
+        misc_items: list[dict] = []
+        prr = _extract_phytophthora_genes(text)
+        if prr:
+            misc_items.append({"characteristic": "Phytophthora Gene", "value": prr})
+        scn = _extract_scn_source(text)
+        if scn:
+            misc_items.append({"characteristic": "SCN Source", "value": scn})
+        scn_races = _extract_scn_races(text)
+        if scn_races:
+            misc_items.append({"characteristic": "SCN Race Coverage", "value": scn_races})
+        if misc_items:
+            prod.characteristics_groups.append({
+                "label": "DISEASE GENETICS",
+                "type": "pdf-text",
+                "items": misc_items,
+            })
+
+        soy_agro = _extract_soy_agronomic_text(text)
+        if soy_agro:
+            prod.characteristics_groups.append({
+                "label": "AGRONOMIC TRAITS",
+                "type": "pdf-text",
+                "items": soy_agro,
+            })
+
+        soil = _extract_soy_soil_adaptation(text)
+        if soil:
+            prod.characteristics_groups.append({
+                "label": "SOIL TYPE ADAPTATION",
+                "type": "pdf-text",
+                "items": soil,
+            })
+
+    # Surface labels for charted-only agronomic ratings so search_docs
+    # can match queries like "drought" / "stalk strength" — values
+    # aren't extractable via text (the source PDF renders them as bar
+    # positions). We record only labels NOT already present in
+    # text-extractable groups, with an explicit "rated in PDF chart"
+    # value so the LLM directs the farmer at the tech sheet for those
+    # numbers. (For soy this is mostly redundant — text extraction got
+    # the agronomic numbers — so we skip the chart-label group there.)
+    if prod.crop == "corn":
+        agronomic_labels_corn = (
+            "Emergence", "Seedling Vigor", "Root Strength",
+            "Stalk Strength", "Green Snap", "Staygreen",
+            "Drydown", "Test Weight", "Drought",
+        )
+        # Skip any label already present with a numeric value.
+        already_rated = {
+            it["characteristic"]
+            for g in prod.characteristics_groups
+            for it in g.get("items") or []
+            if str(it.get("value", "")).strip() not in ("",)
+        }
+        present = [l for l in agronomic_labels_corn
+                   if l in text and l not in already_rated]
+        if present:
+            prod.characteristics_groups.append({
+                "label": "AGRONOMIC CHARACTERISTICS",
+                "type": "pdf-chart",
+                "items": [
+                    {"characteristic": l, "value": "rated in tech-sheet PDF chart (not text-extractable)"}
+                    for l in present
+                ],
+            })
+
+
+# --------------------------------------------------------------------- render
+
+
+def render_markdown(p: NKProduct) -> str:
+    title = p.product_code or p.source_key
+    crop_label = "Corn" if p.crop == "corn" else "Soybeans"
+
+    head: list[str] = [
+        f"# {title}",
+        "",
+        "- **Vendor:** Syngenta",
+        "- **Brand:** NK",
+        f"- **Crop:** {crop_label}",
+    ]
+    if p.crop == "corn" and p.relative_maturity:
+        head.append(f"- **Relative maturity:** {p.relative_maturity}")
+    if p.crop == "soybeans" and p.maturity_group:
+        head.append(f"- **Maturity group:** {p.maturity_group}")
+    if p.brand_variants:
+        head.append(f"- **Brand variants:** {', '.join(p.brand_variants)}")
+    if p.trait_codes:
+        head.append(f"- **Traits:** {', '.join(p.trait_codes)}")
+    head.append(f"- **Catalog page:** {p.source_url}")
+    if p.techsheet_url:
+        head.append(f"- **Tech sheet (PDF):** {p.techsheet_url}")
+    head.append(f"- **Rating scale (NK):** {RATING_SCALE_DIRECTION}")
+    head.append("")
+    head.append("---")
+    head.append("")
+
+    sections: list[str] = []
+    if p.positioning_statement:
+        sections.append("## Positioning\n\n" + p.positioning_statement.strip() + "\n")
+    if p.strengths:
+        bullets = "\n".join(f"- {s}" for s in p.strengths)
+        sections.append("## Strengths\n\n" + bullets + "\n")
+
+    for g in p.characteristics_groups:
+        label = (g.get("label") or "Characteristics").title()
+        items = g.get("items") or []
+        if not items:
+            continue
+        rows = "\n".join(f"| {it['characteristic']} | {it['value']} |" for it in items)
+        sections.append(
+            f"## {label}\n\n"
+            "| Characteristic | Value |\n"
+            "|---|---|\n"
+            f"{rows}\n"
+        )
+    return "\n".join(head) + "\n".join(sections)
+
+
+# --------------------------------------------------------------------- write
+
+
+def write_product(prod: NKProduct, body_md: str) -> None:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    md_path = CORPUS_DIR / f"{prod.source_key}.md"
+    json_path = CORPUS_DIR / f"{prod.source_key}.json"
+
+    md_path.write_text(body_md, encoding="utf-8")
+    sidecar = {
+        "source": "nk",
+        "source_key": prod.source_key,
+        "vendor": "Syngenta",
+        "brand": "NK",
+        "product_name": prod.product_code,
+        "product_id": None,
+        "hybrid_prefix": prod.product_code,
+        "hybrid_suffix": None,
+        "crop": prod.crop,
+        "release_year": None,
+        "relative_maturity": prod.relative_maturity,
+        "maturity_group": prod.maturity_group,
+        "wheat_class": None,
+        "trait_stack": prod.trait_codes,
+        "trait_descriptions": prod.trait_descriptions,
+        "brand_variants": prod.brand_variants,
+        "positioning_statement": prod.positioning_statement,
+        "strengths": prod.strengths,
+        "characteristics_groups": prod.characteristics_groups,
+        "_scale_direction": RATING_SCALE_DIRECTION,
+        "regional_recommendations": [],
+        "image_url": None,
+        "techsheet_url": prod.techsheet_url,
+        "source_urls": [prod.source_url],
+        "sitemap_last_modified": None,
+        "fetched_at": datetime.now(timezone.utc).isoformat(),
+        "scraper_version": SCRAPER_VERSION,
+    }
+    json_path.write_text(
+        json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
+        encoding="utf-8",
+    )
+
+
+# --------------------------------------------------------------------- pipeline
+
+
+def process_product(
+    http: RateLimitedSession,
+    prod: NKProduct,
+    *,
+    force: bool,
+) -> str:
+    md_path = CORPUS_DIR / f"{prod.source_key}.md"
+    if md_path.exists() and not force:
+        return "skipped"
+    enrich_with_pdf(http, prod)
+    body = render_markdown(prod)
+    write_product(prod, body)
+    return "written"
+
+
+def run(
+    *,
+    limit: int | None,
+    force: bool,
+    only_crop: str | None,
+    only_product: str | None,
+) -> int:
+    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
+    http = RateLimitedSession()
+    targets = discover_products(http, only_crop=only_crop)
+
+    if only_product:
+        targets = [
+            p for p in targets
+            if p.source_key == only_product
+            or p.product_code.lower() == only_product.lower()
+        ]
+        if not targets:
+            log.error("no variety matched --product=%s", only_product)
+            return 2
+
+    counts = {"written": 0, "skipped": 0, "failed": 0}
+    processed = 0
+    for prod in targets:
+        if limit is not None and processed >= limit:
+            break
+        processed += 1
+        try:
+            status = process_product(http, prod, force=force)
+        except Exception as exc:  # noqa: BLE001
+            log.error("%s failed: %s", prod.source_key, exc)
+            status = "failed"
+        counts[status] = counts.get(status, 0) + 1
+        log.info(
+            "[%d/%s] %s %s | crop=%s rm/mg=%s variants=%d traits=%s groups=%d",
+            processed, str(limit) if limit else "all",
+            prod.source_key, status, prod.crop,
+            prod.relative_maturity or prod.maturity_group or "-",
+            len(prod.brand_variants),
+            ",".join(prod.trait_codes) or "-",
+            len(prod.characteristics_groups),
+        )
+
+    log.info(
+        "done: processed=%d written=%d skipped=%d failed=%d (of %d candidates)",
+        processed, counts["written"], counts["skipped"],
+        counts["failed"], len(targets),
+    )
+    return 0 if counts["failed"] == 0 else 1
+
+
+# --------------------------------------------------------------------- CLI
+
+
+def _build_argparser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        prog="scrape.sources.nk",
+        description="Scrape NK (Syngenta) corn + soybean varieties.",
+    )
+    p.add_argument("--limit", type=int, default=None,
+                   help="Stop after processing N varieties (default: all).")
+    p.add_argument("--force", action="store_true",
+                   help="Re-fetch even if the markdown file already exists.")
+    p.add_argument("--crop", default=None, choices=("corn", "soybeans"),
+                   help="Limit to one crop.")
+    p.add_argument("--product", default=None,
+                   help="Process a single variety by source_key or product code.")
+    p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
+    return p


 def main(argv: list[str] | None = None) -> int:
-    print("nk: deferred — disease/agronomic ratings come from CDN tech-sheet PDFs only, use pdfplumber. See reference_seed_vendor_recon.md.",
-          file=sys.stderr)
-    # Return 0 so the monthly CI workflow doesn't fail when this
-    # source is listed but not yet implemented. Real implementation
-    # will return 0 on success / 1 on failure.
-    return 0
+    args = _build_argparser().parse_args(argv)
+    logging.basicConfig(
+        level=args.log_level.upper(),
+        format="%(asctime)s %(levelname)s %(name)s %(message)s",
+        stream=sys.stderr,
+    )
+    return run(
+        limit=args.limit,
+        force=args.force,
+        only_crop=args.crop,
+        only_product=args.product,
+    )


 if __name__ == "__main__":
-    sys.exit(main(sys.argv[1:]))
+    sys.exit(main())