seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is seed/hybrid varieties across 6 vendors instead of pesticide labels. What's customized vs. the template: - CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy, canonical sidecar schema (per-crop), Golden Harvest disease-scale reversal gotcha, no-IPv6 / HTTPS-clone note - README.md: vendor coverage table, tool list, phase status - Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker DNS defaults (same llama-rerank sidecar as crop-chem-docs) - .gitea/workflows/refresh.yml: monthly cron (seed catalogs move slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar pinning, continue-on-error on GC step - .gitea/workflows/image-only.yml: paths filter + cancel-in-progress concurrency group - scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea packages API URL + UA header to bypass CF block on default Python-urllib UA) - sources.json: catalog of 6 sources + scope_filter + per-source schema notes + Pioneer-exclusion rationale - scrape/runner.py: dispatcher with --all = GREEN-only - scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr, becks_products}.py: stub modules with implementation notes - docs_mcp/server.py: PRODUCT_NAME default → crop_seed, PRODUCT_DOCS_URL → repo URL Pioneer is intentionally NOT a source. ToS bans automation; dealer locator is login-gated. The MCP returns a curated fallback lesson directing the user to pioneer.com. Next phases: - Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs Bayer scraper; same __NEXT_DATA__ infra) - Phase 7: curate eval/queries.jsonl - Phase 11: lessons.md with Pioneer fallback + disease-scale notes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:28:49 -04:00
commit ac40e05734
35 changed files with 3833 additions and 0 deletions
@@ -0,0 +1,61 @@
+# scrape/
+
+Per-vendor seed catalog scrapers + the runner that dispatches to
+them. Each source lives in `scrape/sources/<name>.py` with a `main()`
+entrypoint. The runner is a thin shim:
+
+```bash
+python -m scrape.runner --source bayer_seeds --force
+python -m scrape.runner --source golden_harvest --limit 20
+python -m scrape.runner --all                # only GREEN sources
+```
+
+## Output layout
+
+Each scraper writes:
+
+- `corpus/<source>/<source_key>.md` — LLM-visible body (chunk_0
+  preamble + the variety's marketing + agronomic narrative)
+- `corpus/<source>/<source_key>.json` — sidecar metadata (per
+  CLAUDE.md's canonical schema)
+
+`source_key` is a stable per-vendor slug — typically `<brand>-<sku>`
+lowercased, e.g. `dekalb-dkc62-08rib`. Stability matters: it's the
+join key the MCP uses for `get_page(source, source_key)`.
+
+## Sources
+
+| Source | Module | Verdict | Notes |
+|---|---|---|---|
+| `bayer_seeds` | `bayer_seeds.py` | 🟢 | DEKALB + Asgrow + WestBred, ~475 varieties |
+| `golden_harvest` | `golden_harvest.py` | 🟢 | ~175 varieties, 9-to-1 disease scale (reverse) |
+| `nk` | `nk.py` | 🟢 | 29 varieties, ratings in CDN PDFs |
+| `agripro` | `agripro.py` | 🟢 | 24 wheat varieties |
+| `becks_pfr` | `becks_pfr.py` | 🟡 | 2,089 research docs via public Sanity GROQ |
+| `becks_products` | `becks_products.py` | 🟡 | 860 products, identity-only (SeedIQ-gated) |
+
+Pioneer is intentionally absent — see `CLAUDE.md` and the curated
+Pioneer fallback in `docs_mcp/lessons.md`.
+
+## Tips
+
+- **Sniff before you scrape.** Most catalogs are SPAs that call a
+  backend API. The recon docs in `~/.claude/projects/-home-justin/
+  memory/reference_seed_vendor_recon.md` already capture the
+  endpoints; if you find new ones, update that file.
+- **Idempotent re-scrapes.** Without `--force`, skip pages already
+  on disk. With `--force`, re-fetch everything — that's the
+  monthly cron mode.
+- **Respect the portals.** Backoff on 429s. Set a recognizable
+  user-agent (`seed-mcp-scraper/<version>`).
+- **Normalize at chunk time, not at scrape time.** The chunker
+  (Phase 2) handles the 9-to-1 → 1-9 disease-scale flip for Golden
+  Harvest, NOT this scraper. Sidecar JSON should preserve the
+  vendor's raw values + a `_scale_direction` field; the chunker
+  reads that and normalizes the markdown body.
+
+## changelog.py
+
+Reusable as-is from the template. Walks `git diff --name-status`
+output for the commit summary, and `git log` for the digest history
+(Phase 13).
@@ -0,0 +1,272 @@
+"""Generate a summary of corpus changes.
+
+Two output shapes for two consumers:
+
+  1. Human-readable text (default) — written into the weekly-refresh
+     commit message so the commit log is greppable for *"what changed
+     this week"* instead of *"806 files changed"*.
+
+  2. Structured JSON (``--json``) and rolling JSONL history
+     (``--history-out``) — consumed by the ``weekly_digest`` MCP tool.
+     Computed in CI and committed at ``corpus/.digest/history.jsonl``;
+     the tool reads it at runtime because the prod container is a
+     static filesystem COPY with no git available.
+
+Usage:
+
+    # Commit-message helper (existing behavior — unchanged)
+    python -m scrape.changelog [--cached] [--ref REF]
+
+    # One-shot JSON for the current diff range
+    python -m scrape.changelog --cached --json
+
+    # Build / refresh the digest history file (CI use)
+    python -m scrape.changelog --history-out corpus/.digest/history.jsonl \\
+        --history-days 120
+
+The history walker only includes commits that touch ``corpus/`` (or
+``bundles.json``); it skips pure code/CI commits. Each emitted record
+carries the commit's short sha, ISO timestamp, subject, and the same
+structured summary the ``--json`` path produces, so the consumer can
+treat history records and one-shot summaries interchangeably.
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import subprocess
+import sys
+from collections import defaultdict
+from typing import Any
+
+
+def git(*args: str) -> str:
+    return subprocess.check_output(["git", *args], text=True)
+
+
+def summarize_diff(diff_output: str) -> dict[str, Any]:
+    """Parse ``git diff --name-status`` output into a structured summary.
+
+    Pure function (no IO, no git calls) so the same logic is exercised
+    by the human-readable, JSON-one-shot, and history-walking paths.
+
+    Returns a dict with:
+
+        md_count           int       — total .md files changed
+        json_count         int       — total .json sidecars changed
+        content_bundles    dict      — {bundle_id: [page_id_without_.md, ...]}
+                                       Only bundles where at least one .md
+                                       file moved. Lists are in the order
+                                       git emitted them.
+        json_only_bundles  list[str] — bundles whose ONLY change was sidecar
+                                       drift (no .md changes). Sorted.
+        new_bundles        list[str] — bundles whose first .md was Added
+                                       in this diff. Sorted.
+        other_files        list[str] — any non-corpus path mentioned in the
+                                       diff, as ``"STATUS path"`` strings.
+    """
+    md_changes: dict[str, list[str]] = defaultdict(list)
+    json_only_bundles: set[str] = set()
+    new_bundles: set[str] = set()
+    md_count = json_count = 0
+    other_files: list[str] = []
+
+    for line in diff_output.splitlines():
+        if not line.strip():
+            continue
+        # status<TAB>path (or status<TAB>old<TAB>new for renames; we take
+        # the post-rename path as the canonical location).
+        parts = line.split("\t")
+        status, path = parts[0], parts[-1]
+        if not path.startswith("corpus/"):
+            other_files.append(f"{status} {path}")
+            continue
+        segs = path.split("/", 2)
+        if len(segs) < 3:
+            # corpus/<filename> with no bundle dir — skip.
+            continue
+        _, bundle, page = segs
+        if page.endswith(".md"):
+            md_changes[bundle].append(page[:-3])
+            md_count += 1
+            if status == "A":
+                new_bundles.add(bundle)
+        elif page.endswith(".json"):
+            json_count += 1
+            json_only_bundles.add(bundle)
+
+    # A bundle counts as "content-changing" if it had any .md edit. Sidecar-
+    # only drift goes in the separate bucket so the commit message doesn't
+    # report timestamp churn as if it were real edits.
+    content_bundles_set = set(md_changes)
+    drift_only = sorted(json_only_bundles - content_bundles_set)
+
+    return {
+        "md_count":          md_count,
+        "json_count":        json_count,
+        "content_bundles":   dict(md_changes),   # cast back to plain dict for JSON
+        "json_only_bundles": drift_only,
+        "new_bundles":       sorted(new_bundles),
+        "other_files":       other_files,
+    }
+
+
+def render_human(summary: dict[str, Any]) -> str:
+    """Format a summary dict as the multi-line commit-message text.
+
+    Matches the historical output exactly so existing commit-message
+    tooling and downstream readers don't have to change.
+    """
+    lines: list[str] = []
+    content_bundles = sorted(summary["content_bundles"])
+    md_count = summary["md_count"]
+    json_count = summary["json_count"]
+    new_bundles = set(summary["new_bundles"])
+    drift_only = summary["json_only_bundles"]
+    other_files = summary["other_files"]
+
+    lines.append(f"{md_count} content change(s) across {len(content_bundles)} bundle(s)")
+    lines.append(f"{json_count} sidecar metadata update(s)")
+    if new_bundles:
+        lines.append(f"{len(new_bundles)} new bundle(s) added")
+    if other_files:
+        lines.append(f"{len(other_files)} other file change(s)")
+
+    if content_bundles:
+        lines.append("")
+        lines.append("Bundles with content changes:")
+        for b in content_bundles:
+            pages = summary["content_bundles"][b]
+            tag = " (NEW)" if b in new_bundles else ""
+            lines.append(f"  {b}{tag}: {len(pages)} page(s)")
+            for p in pages[:5]:
+                lines.append(f"    - {p}")
+            if len(pages) > 5:
+                lines.append(f"    ... and {len(pages) - 5} more")
+    if drift_only:
+        lines.append("")
+        head = ", ".join(drift_only[:10])
+        suffix = " …" if len(drift_only) > 10 else ""
+        lines.append(f"Bundles with sidecar-only drift ({len(drift_only)}): {head}{suffix}")
+    return "\n".join(lines)
+
+
+def walk_history(history_days: int) -> list[dict[str, Any]]:
+    """Walk recent corpus-touching commits, emit one summary per commit.
+
+    Uses ``git log --first-parent main`` to keep the rolling weekly-
+    refresh line clean of branch-merge noise. Only commits whose diff
+    touches ``corpus/`` or ``bundles.json`` are emitted; pure code
+    commits are skipped (they have nothing to digest).
+
+    Each record:
+
+        {
+          "sha":       "<short sha>",
+          "timestamp": "<ISO 8601, UTC>",
+          "subject":   "<commit subject line>",
+          ... + every field from summarize_diff()
+        }
+    """
+    # Find candidate commits. --first-parent keeps the linear refresh history
+    # on main and ignores branch-side merges. We still need to filter by what
+    # the commit actually touched, because non-corpus commits can land on
+    # main (PR merges for code, CI tweaks, etc.).
+    raw = git(
+        "log",
+        f"--since={history_days} days ago",
+        "--first-parent",
+        "main",
+        "--pretty=format:%H%x09%cI%x09%s",
+    )
+
+    records: list[dict[str, Any]] = []
+    for line in raw.splitlines():
+        if not line.strip():
+            continue
+        parts = line.split("\t", 2)
+        if len(parts) < 3:
+            continue
+        sha, ts, subject = parts
+
+        # What did this commit actually touch? Cheap: just the name-status diff
+        # against its first parent. Empty stdout = commit didn't change any
+        # files we care about. Root commits (no parent) error out — suppress
+        # the stderr noise and skip them.
+        try:
+            diff = subprocess.check_output(
+                ["git", "diff", "--name-status", f"{sha}^..{sha}"],
+                text=True,
+                stderr=subprocess.DEVNULL,
+            )
+        except subprocess.CalledProcessError:
+            continue
+        if not diff.strip():
+            continue
+
+        summary = summarize_diff(diff)
+        # Skip pure code commits — only emit records that have actual corpus
+        # content motion. This is what makes the history "interesting" for
+        # the weekly digest.
+        if summary["md_count"] == 0 and summary["json_count"] == 0 and not summary["new_bundles"]:
+            continue
+
+        records.append({
+            "sha":       sha[:12],
+            "timestamp": ts,
+            "subject":   subject,
+            **summary,
+        })
+
+    return records
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__)
+    p.add_argument("--cached", action="store_true",
+                   help="Summarize staged changes instead of a ref range.")
+    p.add_argument("--ref", default="HEAD^..HEAD",
+                   help="Diff range to summarize (default: HEAD^..HEAD).")
+    p.add_argument("--json", dest="as_json", action="store_true",
+                   help="Emit one JSON object instead of the human-readable form.")
+    p.add_argument("--history-out", metavar="PATH",
+                   help="Walk recent corpus-touching commits and write a "
+                        "JSONL history file at PATH. Overwrites if it exists. "
+                        "Implies the history walker; --cached/--ref are ignored.")
+    p.add_argument("--history-days", type=int, default=120,
+                   help="How far back the history walker looks (default 120).")
+    args = p.parse_args()
+
+    # History-walker path: build the JSONL file consumed by the
+    # weekly_digest MCP tool, then exit. CI uses this.
+    if args.history_out:
+        records = walk_history(args.history_days)
+        # Sort by timestamp ascending so the file is roughly stable
+        # across rebuilds (commits within a single run could otherwise
+        # depend on git log default ordering).
+        records.sort(key=lambda r: r["timestamp"])
+        with open(args.history_out, "w") as fh:
+            for rec in records:
+                fh.write(json.dumps(rec, separators=(",", ":")) + "\n")
+        # Brief stdout signal for CI logs — easy to spot in the workflow run.
+        print(f"wrote {len(records)} commit record(s) to {args.history_out} "
+              f"covering up to {args.history_days} days")
+        return 0
+
+    # One-shot summary path. Unchanged behavior for --cached / --ref.
+    if args.cached:
+        diff_args = ["diff", "--name-status", "--cached"]
+    else:
+        diff_args = ["diff", "--name-status", args.ref]
+    diff = git(*diff_args)
+    summary = summarize_diff(diff)
+
+    if args.as_json:
+        print(json.dumps(summary, separators=(",", ":")))
+    else:
+        print(render_human(summary))
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
@@ -0,0 +1,93 @@
+"""Thin dispatcher that routes ``--source <id>`` to the right per-source
+scraper module.
+
+Convention: one source per module under ``scrape.sources.<id>``. Each
+module is independently runnable via ``python -m scrape.sources.<id>``
+and accepts its own flags — this runner is a convenience shim for CI.
+
+Examples:
+
+    python -m scrape.runner --source bayer_seeds --force
+    python -m scrape.runner --source golden_harvest --limit 20
+    python -m scrape.runner --all          # walk every source in sources.json
+
+Anything after the recognized flags is passed through to the source
+scraper, so:
+
+    python -m scrape.runner --source bayer_seeds --force --brand dekalb
+
+dispatches to ``scrape.sources.bayer_seeds`` with ``--force --brand
+dekalb`` as argv.
+
+Sources whose ``verdict`` in sources.json is anything other than
+``"green"`` are skipped by ``--all`` (Beck's products is yellow until
+the SeedIQ XHR is captured). Pass ``--source becks_products`` to run
+a yellow source explicitly.
+"""
+
+from __future__ import annotations
+
+import argparse
+import importlib
+import json
+import sys
+from pathlib import Path
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+SOURCES_JSON = REPO_ROOT / "sources.json"
+
+
+def _load_sources() -> list[dict]:
+    if not SOURCES_JSON.exists():
+        return []
+    try:
+        data = json.loads(SOURCES_JSON.read_text())
+        return data.get("sources", []) if isinstance(data, dict) else data
+    except json.JSONDecodeError:
+        return []
+
+
+def _run_source(source_id: str, passthrough: list[str]) -> int:
+    mod_name = f"scrape.sources.{source_id}"
+    try:
+        mod = importlib.import_module(mod_name)
+    except ImportError as exc:
+        print(f"runner: no source module {mod_name}: {exc}", file=sys.stderr)
+        return 2
+    main = getattr(mod, "main", None)
+    if not callable(main):
+        print(f"runner: {mod_name} has no main() entrypoint", file=sys.stderr)
+        return 2
+    return int(main(passthrough) or 0)
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(prog="scrape.runner")
+    parser.add_argument("--source", help="Source id (matches sources.json)")
+    parser.add_argument("--all", action="store_true",
+                        help="Run every GREEN source listed in sources.json")
+    args, passthrough = parser.parse_known_args(argv)
+
+    if not args.source and not args.all:
+        parser.error("specify --source <id> or --all")
+
+    sources = _load_sources()
+    if args.all:
+        ids = [s["name"] for s in sources if s.get("verdict") == "green"]
+        if not ids:
+            print("runner: no GREEN sources in sources.json", file=sys.stderr)
+            return 2
+    else:
+        # If the source isn't registered in sources.json yet, dispatch anyway
+        # so the scraper can be exercised during initial development.
+        ids = [args.source]
+
+    rc = 0
+    for sid in ids:
+        print(f"=== scrape.runner: dispatching to {sid} ===")
+        rc |= _run_source(sid, passthrough)
+    return rc
+
+
+if __name__ == "__main__":
+    sys.exit(main())
@@ -0,0 +1,34 @@
+"""AgriPro scraper (Syngenta wheat brand).
+
+Source: ``https://www.agriprowheat.com`` — Drupal Views form,
+server-rendered HTML. No headless browser needed.
+
+Expected count: 24 varieties. Covers HRW / HRS / HWS / SWW / SWS
+plus barley. NO SRW — Syngenta's SRW lives at GrowProGenetics.com
+under a separate brand and is out of scope for AgriPro.
+
+Trait flags to capture: Clearfield (CL2), CoAXium (NB: CoAXium is
+implicit in product family naming, not always a separate field).
+
+Schema notes:
+- ``wheat_class`` is required (HRW/HRS/HWS/SWW/SWS/durum/barley)
+- ``relative_maturity`` and ``maturity_group`` are null for wheat
+- Disease panel: stripe rust / leaf rust / stem rust / FHB (scab) /
+  Septoria / tan spot
+- Quality: test weight, protein, falling number, straw strength
+
+TODO: implement.
+"""
+from __future__ import annotations
+
+import sys
+
+
+def main(argv: list[str] | None = None) -> int:
+    print("agripro: not implemented yet — Drupal Views form, only wheat in the corpus, no SRW (separate brand)",
+          file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,56 @@
+"""Bayer seeds scraper — DEKALB (corn) + Asgrow (soy) + WestBred (wheat).
+
+Source: ``cropscience.bayer.us`` — same Next.js + ``__NEXT_DATA__``
+infrastructure used by crop-chem-docs' Bayer crop-protection scraper.
+That scraper is the reference; this one lifts ~80% of its plumbing
+and adapts the per-product field mapping for seed schema.
+
+Catalog index pages:
+  /corn/dekalb/seed-catalog
+  /soybeans/asgrow/seed-catalog
+  /wheat/westbred/seed-catalog
+
+Each catalog page is a Next.js route; the per-variety data lives in
+``__NEXT_DATA__.props.pageProps.{whatever}``. The buildId in the
+script tag rotates — fetch the index page first, extract the
+buildId, then fetch the per-variety JSON.
+
+Output layout:
+  corpus/bayer_seeds/<source_key>.md      LLM-visible body
+  corpus/bayer_seeds/<source_key>.json    Sidecar metadata
+
+source_key convention: ``<brand>-<product-slug>`` lowercased, e.g.
+``dekalb-dkc62-08rib`` or ``asgrow-ag34xf2``.
+
+Sidecar schema (per CLAUDE.md):
+  source: "bayer_seeds"
+  source_key: str
+  vendor: "Bayer"
+  brand: "DEKALB" | "Asgrow" | "WestBred"
+  product_name: str
+  crop: "corn" | "soybeans" | "wheat"
+  relative_maturity: int | null         # corn only
+  maturity_group: float | null          # soy only
+  wheat_class: str | null               # wheat only
+  trait_stack: list[str]
+  agronomic_ratings: dict[str, int]     # normalized 1-9 (9 = best)
+  disease_ratings: dict[str, int]       # normalized 1-9 (9 = best)
+  regional_recommendation: list[str]
+  source_urls: list[str]
+  fetched_at: str (ISO 8601 UTC)
+
+TODO: implement. Reference: ~/github/crop-chem-docs/scrape/sources/bayer.py
+"""
+from __future__ import annotations
+
+import sys
+
+
+def main(argv: list[str] | None = None) -> int:
+    print("bayer_seeds: not implemented yet — see ~/github/crop-chem-docs/scrape/sources/bayer.py for the reference Next.js extraction pattern",
+          file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,45 @@
+"""Beck's PFR (Practical Farm Research) scraper.
+
+Source: Public Sanity GROQ API at ``https://mc8v24rf.api.sanity.io``.
+No authentication required — Beck's exposes their CMS content store
+publicly. ~2,089 documents going back to 2015.
+
+Sanity query endpoint:
+  ``/v1/data/query/production?query=<groq>``
+
+Useful GROQ for PFR docs (the projectId / dataset are public):
+
+  *[_type == "pfrStudy"] {
+    _id, title, year, crop, slug, summary, body, attachments
+  }
+
+Records are research studies, not variety identity — head-to-head
+yield trials, fungicide timing, planting-date studies, hybrid-by-
+population, biological seed treatments, etc.
+
+Treat differently from variety scrapers:
+- One record per study, not per variety
+- chunk_0 preamble includes the study's tl;dr finding (extract from
+  the ``summary`` field if present, or first paragraph of ``body``)
+- Crop tag (corn/soy/wheat) for filtering
+- Year tag — older PFR studies are still relevant but search should
+  let the user weight recency
+
+Polite rate limit: Sanity is generous but no auth means we should
+keep concurrency ≤4 and pause ~250ms between batches.
+
+TODO: implement.
+"""
+from __future__ import annotations
+
+import sys
+
+
+def main(argv: list[str] | None = None) -> int:
+    print("becks_pfr: not implemented yet — public Sanity GROQ at mc8v24rf.api.sanity.io, ~2089 research docs",
+          file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,46 @@
+"""Beck's product catalog scraper (identity-only until SeedIQ XHR sniff lands).
+
+Source: Same public Sanity GROQ API as ``becks_pfr`` (no auth).
+Expected count: ~860 products (corn + soy + wheat).
+
+Current limitation: Beck's exposes IDENTITY fields publicly (product
+name, RM/MG, basic trait stack) but routes the AGRONOMIC + DISEASE
+ratings through their SeedIQ application, which is gated behind a
+browser session cookie. The public Sanity records do not include
+ratings.
+
+What we CAN ship without SeedIQ:
+- Product identity for confirmation ("yes Beck's has hybrid X at RM 112")
+- RM (corn) / MG (soy) / class (wheat)
+- Trait stack
+- Basic descriptive text
+
+What needs the SeedIQ XHR endpoint (BLOCKED on user sniff):
+- Disease ratings (GLS, NCLB, Goss's, etc.)
+- Agronomic ratings (standability, drought, etc.)
+- Regional recommendations
+
+For now this scraper is DEFERRED. Run when:
+- User captures the SeedIQ XHR URL + cookie/header pattern from
+  browser dev tools, OR
+- We decide to ship Beck's as identity-only and let the LLM say
+  "Beck's has this hybrid; ask your Beck's rep for full agronomic
+  ratings" (less useful but avoids the empty-data UX).
+
+Yellow verdict in sources.json reflects this — ``--all`` skips it.
+
+TODO: implement (deferred).
+"""
+from __future__ import annotations
+
+import sys
+
+
+def main(argv: list[str] | None = None) -> int:
+    print("becks_products: deferred — SeedIQ XHR sniff required for ratings, run only if user has captured the endpoint",
+          file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,42 @@
+"""Golden Harvest scraper (Syngenta brand).
+
+Discovery: ``https://www.goldenharvestseeds.com/sitemap.xml`` lists
+every variety page. Server-rendered HTML — no headless browser
+required. Tech-sheet PDFs live on the Syngenta CDN at
+``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf`` — same
+fetcher pattern as NK.
+
+Two gotchas:
+
+1. **Sitemap PDF dates are stale** (the sitemap was generated
+   2025-03-31 and never updated). Resolve the LIVE PDF URL from the
+   product HTML page, not from the sitemap entry.
+
+2. **Disease scale is reversed.** Golden Harvest publishes ratings
+   on a 9-to-1 scale (9 = best, 1 = worst). Bayer/NK/AgriPro use
+   1-9 (9 = best). Normalize at chunk time so the corpus has a
+   single direction. Record the original direction in the chunk_0
+   preamble: "Note: ratings normalized to 1-9 (9 = best). Golden
+   Harvest publishes on a 9-to-1 scale natively."
+
+Expected count: ~175 varieties (89 corn + 86 soy). No wheat.
+
+Bonus dataset: ``/plot-report/<state>/<year>/<id>`` — ~7,800 regional
+yield trial records. Out of scope for v1 but a high-value future
+ingest for regional placement recommendations.
+
+TODO: implement. Reuse the PDF-fetch helper that NK uses.
+"""
+from __future__ import annotations
+
+import sys
+
+
+def main(argv: list[str] | None = None) -> int:
+    print("golden_harvest: not implemented yet — see CLAUDE.md for the disease-scale-reversal gotcha and the live-PDF-URL-resolution requirement",
+          file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,35 @@
+"""NK scraper (Syngenta brand).
+
+Source: ``https://www.syngenta-us.com`` — static HTML product pages
+plus tech-sheet PDFs on the Syngenta CDN at
+``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf``.
+
+Expected count: 29 varieties (12 corn + 17 soy). No wheat.
+
+The PDF fetcher is shared with ``golden_harvest`` — same CDN, same
+``<CODE>_YYMMDD.pdf`` filename convention. Factor that into a
+helper module under ``scrape.sources._syngenta_pdf`` once both
+scrapers are written.
+
+Disease + agronomic ratings live INSIDE the PDFs (the HTML pages
+have marketing copy only). Use pdfplumber for table extraction.
+
+Bonus: regional "Seed Guide" PDFs (~14 MB each) for IA, IL, MN,
+etc. — additional supplemental context worth ingesting once the
+per-variety scrape is solid.
+
+TODO: implement.
+"""
+from __future__ import annotations
+
+import sys
+
+
+def main(argv: list[str] | None = None) -> int:
+    print("nk: not implemented yet — disease/agronomic ratings come from CDN tech-sheet PDFs only, use pdfplumber",
+          file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))