agripro + nk scrapers — 146 Syngenta varieties added (wheat + corn/soy)

agripro (24 varieties)
- Drupal Views form scrape via /search-agripro-brand-varieties with
  explicit GET params (sidesteps the AJAX-only-on-load default that
  returns an empty form skeleton).
- Per-variety parse: <h1>, .field--node--variety-type--variety,
  .field--node--tag-line--variety, .field--node--body, plus the
  three rated sections (Agronomics / Grain / Disease) with their
  <div class="row"><div class="label">label</div><div>value</div>
  pairs.
- Wheat-class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley
  — provides the Northern Plains HRS coverage WestBred lacks.

nk (122 varieties — recon's "29" was outdated; the current NK seed
finder lists 41 corn + 81 soy)
- ASP.NET WebForms endpoint:
  POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returns
  {"d": "<html>"} where the inner HTML is one <div class="sf-result">
  per variety. BeautifulSoup tokenizes the whole blob.
- Per-card: product code (NK8005, NK008-P8XF), RM/MG from the
  title <span>, "Brands Available" trait variants, marketing
  positioning + bullet strengths, tech-sheet PDF URL.
- pdfplumber text extraction on the tech-sheet PDFs adds:
  * corn disease ratings (Gray Leaf Spot, NCLB, Goss's Wilt,
    Anthracnose, Tar Spot, Fusarium, etc.) where the PDF prints
    "Label N" lines (text-extractable)
  * soybean Phytophthora source genes (Rps1c, Rps3a, ...)
  * soybean SCN race coverage
  * soybean agronomic ratings (Emergence, Standability, Shatter
    Tolerance, Green Stem) with text-extractable 1-9 values
  * soybean soil-type adaptation (Best/Good/Fair/Poor) for drought
    prone / high pH / poorly drained / etc.
- Agronomic rating BARS for corn (Emergence, Stalk Strength,
  Drought) are not text-extractable; we record the labels with an
  explicit "rated in PDF chart, see tech sheet" value so the agent
  can direct the farmer at the source for those numbers.

Scale-direction correction in lessons.md:
- NK and AgriPro both use 1 = best, lower = more resistant — the
  REVERSED convention vs Bayer / Golden Harvest. NK's tech-sheet
  footer literally prints "1-9 Scale: 1 = Best, 9 = Worst".
  AgriPro positioning on stripe-rust-resistant varieties (AP Iliad
  with Stripe Rust 1, Eyespot 2) confirms the same direction.
- sources-not-yet-indexed section trimmed to just Beck's PFR +
  Beck's products — everything else IS now in the corpus.

Cross-vendor coverage after this PR: 760 varieties.
  bayer_seeds     475 (DEKALB 288 / Asgrow 102 / WestBred 85)
  golden_harvest  139
  nk              122  (41 corn / 81 soy)
  agripro          24  (12 HRS / 7 SWW / 3 HRW / 1 HWS / 1 Barley)
Vendors: Bayer, Syngenta. Brands: 6. Crops: corn, soy, wheat (109
wheat now, up from 85).

requirements.txt: pdfplumber>=0.11 for NK tech-sheet parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-25 14:16:36 -04:00
parent 2588ebafa1
commit 9ce920f622
296 changed files with 23233 additions and 60 deletions
+488 -22
View File
@@ -1,37 +1,503 @@
"""AgriPro scraper (Syngenta wheat brand).
"""AgriPro (Syngenta) wheat scraper.
Source: ``https://www.agriprowheat.com`` — Drupal Views form,
server-rendered HTML. No headless browser needed.
Source: ``agriprowheat.com`` — Drupal site, server-rendered HTML.
robots.txt is empty (no Disallow).
Expected count: 24 varieties. Covers HRW / HRS / HWS / SWW / SWS
plus barley. NO SRW — Syngenta's SRW lives at GrowProGenetics.com
under a separate brand and is out of scope for AgriPro.
Expected count: 24 varieties spanning Hard Red Winter (HRW), Hard
Red Spring (HRS), Hard White Spring (HWS), Soft White Winter (SWW),
Soft White Spring (SWS), and durum. NO SRW — Syngenta's Soft Red
Winter sits at GrowProGenetics.com under a separate brand, out of
scope for AgriPro.
Trait flags to capture: Clearfield (CL2), CoAXium (NB: CoAXium is
implicit in product family naming, not always a separate field).
Discovery: the variety listing at
``/search-agripro-brand-varieties`` server-renders only the
filter form; the actual variety rows are populated by a Drupal
Views AJAX call. We sidestep the AJAX by passing the filter values
as GET params on the same path:
Schema notes:
- ``wheat_class`` is required (HRW/HRS/HWS/SWW/SWS/durum/barley)
- ``relative_maturity`` and ``maturity_group`` are null for wheat
- Disease panel: stripe rust / leaf rust / stem rust / FHB (scab) /
Septoria / tan spot
- Quality: test weight, protein, falling number, straw strength
/search-agripro-brand-varieties?title=&variety_type_value=All
TODO: implement.
That returns the fully-rendered list (24 rows in
``.block-views-blockvarieties-search-varieties-search-block``) with
links to ``/variety/<slug>`` pages.
Per-variety detail comes from the variety page HTML. Useful fields:
- ``<h1>`` — product name (e.g. "AP Exceed")
- ``.field--node--variety-type--variety`` — wheat class
("Soft White Winter", "Hard Red Spring", etc.)
- ``.field--node--tag-line--variety`` — short positioning slogan
- ``.field--node--body`` — full positioning narrative
- Three sections delimited by ``<h3>``: Agronomics / Grain /
Disease, each containing ``.row`` divs with
``<div class="label">…</div><div>…</div>`` pairs.
**Rating-scale direction**: AgriPro publishes disease tolerance on a
1-9 scale where **1 = best (most resistant)** — REVERSED from
Bayer's and Golden Harvest's "9 = best" convention. The chunker
preserves values verbatim and the sidecar's ``_scale_direction``
field declares the direction, so the LLM's chunk-preamble framing
will correctly say "(1 = best)" — anti-hallucination guarantee
holds even across vendors with opposite scales.
(Agronomic ratings on AgriPro are qualitative — "Excellent / Very
Good / Good / Fair / Poor" — and don't have a numeric direction
issue. They're preserved verbatim.)
Output:
corpus/agripro/<source_key>.md
corpus/agripro/<source_key>.json
source_key convention: ``agripro-<slug>`` lowercased, e.g.
``agripro-ap-exceed`` or ``agripro-sy-assure``.
CLI:
python -m scrape.sources.agripro --limit 5
python -m scrape.sources.agripro --force
"""
from __future__ import annotations
import argparse
import json
import logging
import os
import random
import re
import sys
import time
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import requests
from bs4 import BeautifulSoup
SCRAPER_VERSION = "0.1.0"
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
BASE = "https://agriprowheat.com"
LIST_URL = f"{BASE}/search-agripro-brand-varieties?title=&variety_type_value=All"
# AgriPro disease ratings: 1-9, LOWER number = MORE resistant. This
# is the inverse of Bayer/Golden-Harvest's 1-9 (9 = best) convention.
# Document this in the sidecar so the chunker / LLM never mis-renders.
RATING_SCALE_DIRECTION = "1-9 (1 = best, lower = more resistant)"
# Class abbreviations for the wheat_class field. AgriPro renders the
# full English name; we map it to the canonical short form the rest
# of the corpus uses (matches schema notes in seed-mcp/CLAUDE.md).
WHEAT_CLASS_MAP = {
"hard red winter": "HRW",
"hard red spring": "HRS",
"hard white spring": "HWS",
"hard white winter": "HWW",
"soft white winter": "SWW",
"soft white spring": "SWS",
"soft red winter": "SRW",
"durum": "Durum",
}
REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
CORPUS_DIR = CORPUS_ROOT / "agripro"
REQ_INTERVAL_SEC = 1.0
log = logging.getLogger("scrape.agripro")
# --------------------------------------------------------------------- HTTP
class RateLimitedSession:
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
self.s = requests.Session()
self.s.headers["User-Agent"] = USER_AGENT
self.interval = interval
self._last = 0.0
def _wait(self) -> None:
delta = time.monotonic() - self._last
if delta < self.interval:
time.sleep(self.interval - delta)
self._last = time.monotonic()
def request(
self,
method: str,
url: str,
*,
max_retries: int = 4,
timeout: float = 30.0,
**kw: Any,
) -> requests.Response:
last_exc: Exception | None = None
for attempt in range(max_retries):
self._wait()
try:
resp = self.s.request(method, url, timeout=timeout, **kw)
except requests.RequestException as exc:
last_exc = exc
backoff = min(30.0, (2 ** attempt) + random.random())
log.warning("network error on %s %s: %s — retry in %.1fs",
method, url, exc, backoff)
time.sleep(backoff)
continue
if resp.status_code == 429 or 500 <= resp.status_code < 600:
ra = resp.headers.get("Retry-After")
backoff = float(ra) if (ra and ra.isdigit()) else min(30.0, (2 ** attempt) + random.random())
log.warning("HTTP %d on %s %s — retry in %.1fs",
resp.status_code, method, url, backoff)
time.sleep(backoff)
continue
return resp
if last_exc:
raise last_exc
return resp # type: ignore[return-value]
def get(self, url: str, **kw: Any) -> requests.Response:
return self.request("GET", url, **kw)
# --------------------------------------------------------------------- model
@dataclass
class APProduct:
source_key: str
source_url: str
product_name: str = ""
wheat_class: str | None = None
positioning_statement: str | None = None
tagline: str | None = None
characteristics_groups: list[dict] = field(default_factory=list)
# --------------------------------------------------------------------- discovery
def discover_varieties(http: RateLimitedSession) -> list[str]:
"""Fetch the variety-search page and return the list of
``/variety/<slug>`` URLs found in it.
Dedupes per-row twice-listed links (the row's hero image link
and its "view full details" link both point to the same place).
"""
log.info("fetching variety list %s", LIST_URL)
r = http.get(LIST_URL)
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")
urls: list[str] = []
seen: set[str] = set()
for a in soup.find_all("a", href=re.compile(r"^/variety/")):
h = a["href"]
if h in seen:
continue
seen.add(h)
urls.append(BASE + h)
log.info("variety URLs found: %d", len(urls))
return urls
# --------------------------------------------------------------------- helpers
def source_key_for(url: str) -> str:
"""``/variety/ap-exceed`` → ``agripro-ap-exceed``."""
tail = url.rstrip("/").rsplit("/", 1)[-1].lower()
return f"agripro-{tail}"
def normalize_wheat_class(raw: str | None) -> str | None:
if not raw:
return None
key = raw.strip().lower()
return WHEAT_CLASS_MAP.get(key, raw.strip())
def _rows_in_section(soup: BeautifulSoup, h3_text: str) -> list[dict]:
"""Walk the variety page for the section heading matching
``h3_text``, then collect every ``.row`` inside the same
container. Returns ``[{characteristic, value}, ...]``."""
items: list[dict] = []
for h3 in soup.find_all("h3"):
if h3.get_text(strip=True).lower() != h3_text.lower():
continue
# Walk up to the enclosing section (the parent that scopes
# the .row siblings of the h3). The simplest reliable scope:
# the row siblings within the immediate parent.
parent = h3.parent
if parent is None:
continue
for row in parent.find_all(class_="row"):
label_el = row.find(class_="label")
if not label_el:
continue
label = label_el.get_text(" ", strip=True)
# The value is whatever <div> sibling follows the label
# (NOT the .label div itself).
value: str | None = None
for child in row.find_all("div"):
if "label" in (child.get("class") or []):
continue
# First non-label <div> with non-empty text wins.
t = child.get_text(" ", strip=True)
if t:
value = t
break
if label and value:
items.append({"characteristic": label, "value": value})
break
return items
# --------------------------------------------------------------------- detail
def fetch_product_detail(
http: RateLimitedSession, url: str
) -> APProduct | None:
r = http.get(url)
if r.status_code == 404:
return None
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")
prod = APProduct(
source_key=source_key_for(url),
source_url=url,
)
h1 = soup.find("h1")
if h1:
prod.product_name = h1.get_text(strip=True)
vt = soup.find(class_="field--node--variety-type--variety")
if vt:
prod.wheat_class = normalize_wheat_class(vt.get_text(strip=True))
tl = soup.find(class_="field--node--tag-line--variety")
if tl:
prod.tagline = tl.get_text(strip=True) or None
# Body text — the long-form positioning narrative.
body = soup.find(class_=re.compile(r"field--node--body"))
if body:
prod.positioning_statement = body.get_text(" ", strip=True) or None
# Tagline alone if no body — better than nothing.
if not prod.positioning_statement and prod.tagline:
prod.positioning_statement = prod.tagline
# The three rated sections on every variety page.
groups: list[dict] = []
for label, h3 in (
("AGRONOMICS", "Agronomics"),
("GRAIN", "Grain"),
("DISEASE RATINGS", "Disease"),
):
items = _rows_in_section(soup, h3)
if items:
groups.append({"label": label, "type": "fields", "items": items})
prod.characteristics_groups = groups
return prod
# --------------------------------------------------------------------- render
def render_markdown(p: APProduct) -> str:
title = p.product_name or p.source_key
head: list[str] = [
f"# {title}",
"",
"- **Vendor:** Syngenta",
"- **Brand:** AgriPro",
"- **Crop:** Wheat",
]
if p.wheat_class:
head.append(f"- **Wheat class:** {p.wheat_class}")
if p.tagline:
head.append(f"- **Tagline:** {p.tagline}")
head.append(f"- **Source:** {p.source_url}")
head.append(f"- **Rating scale (AgriPro):** {RATING_SCALE_DIRECTION}")
head.append("")
head.append("---")
head.append("")
sections: list[str] = []
if p.positioning_statement and p.positioning_statement != p.tagline:
sections.append("## Positioning\n\n" + p.positioning_statement.strip() + "\n")
for g in p.characteristics_groups:
label = (g.get("label") or "Characteristics").title()
items = g.get("items") or []
if not items:
continue
rows = "\n".join(f"| {it['characteristic']} | {it['value']} |" for it in items)
sections.append(
f"## {label}\n\n"
"| Characteristic | Value |\n"
"|---|---|\n"
f"{rows}\n"
)
return "\n".join(head) + "\n".join(sections)
# --------------------------------------------------------------------- write
def write_product(prod: APProduct, body_md: str) -> None:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
md_path = CORPUS_DIR / f"{prod.source_key}.md"
json_path = CORPUS_DIR / f"{prod.source_key}.json"
md_path.write_text(body_md, encoding="utf-8")
sidecar = {
"source": "agripro",
"source_key": prod.source_key,
"vendor": "Syngenta",
"brand": "AgriPro",
"product_name": prod.product_name,
"product_id": None,
"hybrid_prefix": prod.product_name,
"hybrid_suffix": None,
"crop": "wheat",
"release_year": None,
"relative_maturity": None,
"maturity_group": None,
"wheat_class": prod.wheat_class,
"trait_stack": [],
"trait_descriptions": [],
"positioning_statement": prod.positioning_statement,
"tagline": prod.tagline,
"strengths": [],
"characteristics_groups": prod.characteristics_groups,
# AgriPro's reversed direction is the load-bearing field here:
# any cross-vendor disease-resistance comparison MUST consult
# this before interpreting values. The chunker reads it; the
# api_lessons file's rating-scales section documents the
# convention.
"_scale_direction": RATING_SCALE_DIRECTION,
"regional_recommendations": [],
"image_url": None,
"source_urls": [prod.source_url],
"sitemap_last_modified": None,
"fetched_at": datetime.now(timezone.utc).isoformat(),
"scraper_version": SCRAPER_VERSION,
}
json_path.write_text(
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
encoding="utf-8",
)
# --------------------------------------------------------------------- pipeline
def process_product(
http: RateLimitedSession,
*,
url: str,
force: bool,
) -> tuple[str, APProduct | None]:
source_key = source_key_for(url)
md_path = CORPUS_DIR / f"{source_key}.md"
if md_path.exists() and not force:
return "skipped", None
try:
prod = fetch_product_detail(http, url)
except Exception as exc: # noqa: BLE001
log.error("detail fetch failed for %s: %s", url, exc)
return "failed", None
if prod is None:
return "missing", None
body = render_markdown(prod)
write_product(prod, body)
return "written", prod
def run(
*,
limit: int | None,
force: bool,
only_product: str | None,
) -> int:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
http = RateLimitedSession()
targets = discover_varieties(http)
if only_product:
targets = [
u for u in targets
if source_key_for(u) == only_product
or u.rstrip("/").rsplit("/", 1)[-1].lower() == only_product.lower()
]
if not targets:
log.error("no variety matched --product=%s", only_product)
return 2
counts = {"written": 0, "skipped": 0, "missing": 0, "failed": 0}
processed = 0
for url in targets:
if limit is not None and processed >= limit:
break
processed += 1
status, prod = process_product(http, url=url, force=force)
counts[status] = counts.get(status, 0) + 1
if prod is not None:
log.info(
"[%d/%s] %s %s | class=%s groups=%d",
processed, str(limit) if limit else "all",
prod.source_key, status,
prod.wheat_class or "-",
len(prod.characteristics_groups),
)
else:
log.info("[%d/%s] %s %s",
processed, str(limit) if limit else "all",
source_key_for(url), status)
log.info(
"done: processed=%d written=%d skipped=%d missing=%d failed=%d (of %d candidates)",
processed, counts["written"], counts["skipped"],
counts["missing"], counts["failed"], len(targets),
)
return 0 if counts["failed"] == 0 else 1
# --------------------------------------------------------------------- CLI
def _build_argparser() -> argparse.ArgumentParser:
p = argparse.ArgumentParser(
prog="scrape.sources.agripro",
description="Scrape AgriPro (Syngenta) wheat varieties.",
)
p.add_argument("--limit", type=int, default=None,
help="Stop after processing N varieties (default: all).")
p.add_argument("--force", action="store_true",
help="Re-fetch even if the markdown file already exists.")
p.add_argument("--product", default=None,
help="Process a single variety by source_key or URL tail.")
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
return p
def main(argv: list[str] | None = None) -> int:
print("agripro: deferred — Drupal Views form, only wheat in the corpus, no SRW (separate brand). See reference_seed_vendor_recon.md.",
file=sys.stderr)
# Return 0 so the monthly CI workflow doesn't fail when this
# source is listed but not yet implemented. Real implementation
# will return 0 on success / 1 on failure.
return 0
args = _build_argparser().parse_args(argv)
logging.basicConfig(
level=args.log_level.upper(),
format="%(asctime)s %(levelname)s %(name)s %(message)s",
stream=sys.stderr,
)
return run(
limit=args.limit,
force=args.force,
only_product=args.product,
)
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
sys.exit(main())
+812 -22
View File
@@ -1,38 +1,828 @@
"""NK scraper (Syngenta brand).
"""NK (Syngenta) seed scraper — corn + soybeans.
Source: ``https://www.syngenta-us.com`` — static HTML product pages
plus tech-sheet PDFs on the Syngenta CDN at
``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf``.
Source: ``syngenta-us.com`` — ASP.NET WebForms catalog with an
ASMX-style JSON endpoint for the seed-finder UI, plus tech-sheet
PDFs on the Syngenta CDN at
``assets.syngenta-us.com/pdf/techsheets/<CODE>_YYMMDD.pdf``.
Expected count: 29 varieties (12 corn + 17 soy). No wheat.
Expected count: 29 varieties (12 corn + 17 soy on 2026-05-25). No
wheat.
The PDF fetcher is shared with ``golden_harvest`` — same CDN, same
``<CODE>_YYMMDD.pdf`` filename convention. Factor that into a
helper module under ``scrape.sources._syngenta_pdf`` once both
scrapers are written.
Discovery: the HTML catalog pages (``/corn/nk/products``,
``/soybeans/nk/products``) load product cards via JS. The JS calls
Disease + agronomic ratings live INSIDE the PDFs (the HTML pages
have marketing copy only). Use pdfplumber for table extraction.
POST /NKSeeds/CornProductFinder.aspx/GetProducts
POST /NKSeeds/SoyProductFinder.aspx/GetProducts
Bonus: regional "Seed Guide" PDFs (~14 MB each) for IA, IL, MN,
etc. — additional supplemental context worth ingesting once the
per-variety scrape is solid.
Both endpoints return ASP.NET's ``{"d": "..."}`` wrapper where ``d``
is a string of HTML fragments separated by `` @ `` containing one
``<div class="sf-result">`` per variety. Each card carries:
TODO: implement.
- product code (e.g. ``NK8005`` / ``NK008-P8XF``)
- RM days (corn) / MG decimal (soy) in a ``<span>`` next to the
title
- "Brands Available" line listing trait variants
(NK8005-V, NK8005-GT/LL — these are trait-specific SKUs)
- positioning slogan + bullet-list of strengths
- tech-sheet PDF URL
Per-variety disease ratings live ONLY in the PDF tech sheets (the
HTML cards have marketing text but no rating numbers). We extract
disease ratings via ``pdfplumber`` text extraction — they appear as
"Label Number" lines that we parse with a regex.
**Rating-scale direction**: NK explicitly publishes
``1-9 Scale: 1 = Best, Tallest or Highest; 9 = Worst, Shortest or
Lowest`` on every tech sheet — REVERSED from Bayer/Golden Harvest.
The chunker preserves values verbatim and the sidecar's
``_scale_direction`` field declares this so the LLM correctly
interprets the chunk preamble.
**Agronomic ratings**: rendered as horizontal bar charts in the
PDF; pdfplumber's text extraction captures the LABELS (Emergence,
Stalk Strength, Drought, etc.) but NOT the bar values. Surfacing
those would require either OCR of the bar positions or pdfplumber's
geometric layout parsing — deferred. For now the chunk records the
labels and an explicit "agronomic ratings rendered as chart bars in
the source PDF — values not currently extracted" annotation so the
agent knows to direct the farmer at the tech-sheet PDF for those
numbers.
Tech-sheet PDF URLs come from the API response (live URL is
correct; the assets-host filenames include a YYMMDD that changes).
Output:
corpus/nk/<source_key>.md
corpus/nk/<source_key>.json
source_key convention: ``nk-<code>`` lowercased, e.g.
``nk-nk8005`` or ``nk-nk008-p8xf``.
CLI:
python -m scrape.sources.nk --limit 5
python -m scrape.sources.nk --crop corn --limit 12
python -m scrape.sources.nk --force
"""
from __future__ import annotations
import argparse
import io
import json
import logging
import os
import random
import re
import sys
import time
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import requests
from bs4 import BeautifulSoup
import pdfplumber
SCRAPER_VERSION = "0.1.0"
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
BASE = "https://www.syngenta-us.com"
CORN_LIST_URL = f"{BASE}/corn/nk/products"
SOY_LIST_URL = f"{BASE}/soybeans/nk/products"
CORN_API = f"{BASE}/NKSeeds/CornProductFinder.aspx/GetProducts"
SOY_API = f"{BASE}/NKSeeds/SoyProductFinder.aspx/GetProducts"
# NK + AgriPro both use the "1 = best, lower = more resistant" convention.
# Confirmed by tech-sheet footer: "1-9 Scale: 1 = Best...; 9 = Worst..."
RATING_SCALE_DIRECTION = "1-9 (1 = best, lower = more resistant)"
REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
CORPUS_DIR = CORPUS_ROOT / "nk"
REQ_INTERVAL_SEC = 1.0
log = logging.getLogger("scrape.nk")
# --------------------------------------------------------------------- HTTP
class RateLimitedSession:
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
self.s = requests.Session()
self.s.headers["User-Agent"] = USER_AGENT
self.interval = interval
self._last = 0.0
def _wait(self) -> None:
delta = time.monotonic() - self._last
if delta < self.interval:
time.sleep(self.interval - delta)
self._last = time.monotonic()
def request(
self,
method: str,
url: str,
*,
max_retries: int = 4,
timeout: float = 30.0,
**kw: Any,
) -> requests.Response:
last_exc: Exception | None = None
for attempt in range(max_retries):
self._wait()
try:
resp = self.s.request(method, url, timeout=timeout, **kw)
except requests.RequestException as exc:
last_exc = exc
backoff = min(30.0, (2 ** attempt) + random.random())
log.warning("network error on %s %s: %s — retry in %.1fs",
method, url, exc, backoff)
time.sleep(backoff)
continue
if resp.status_code == 429 or 500 <= resp.status_code < 600:
ra = resp.headers.get("Retry-After")
backoff = float(ra) if (ra and ra.isdigit()) else min(30.0, (2 ** attempt) + random.random())
log.warning("HTTP %d on %s %s — retry in %.1fs",
resp.status_code, method, url, backoff)
time.sleep(backoff)
continue
return resp
if last_exc:
raise last_exc
return resp # type: ignore[return-value]
def get(self, url: str, **kw: Any) -> requests.Response:
return self.request("GET", url, **kw)
def post(self, url: str, **kw: Any) -> requests.Response:
return self.request("POST", url, **kw)
# --------------------------------------------------------------------- model
@dataclass
class NKProduct:
source_key: str
source_url: str # the brand catalog page (closest thing to a per-variety URL)
crop: str # "corn" | "soybeans"
product_code: str = "" # NK8005 / NK008-P8XF
relative_maturity: str | None = None # corn
maturity_group: str | None = None # soy
brand_variants: list[str] = field(default_factory=list) # ["NK8005-V", "NK8005-GT/LL"]
trait_codes: list[str] = field(default_factory=list)
trait_descriptions: list[str] = field(default_factory=list)
positioning_statement: str | None = None
strengths: list[str] = field(default_factory=list)
techsheet_url: str | None = None
characteristics_groups: list[dict] = field(default_factory=list)
# --------------------------------------------------------------------- discovery
def _api_payload_corn(rm_low: str, rm_high: str) -> str:
"""Payload for ``CornProductFinder.aspx/GetProducts``."""
return json.dumps({
"cornCount": "1",
"rmLowerRange": rm_low,
"rmUpperRange": rm_high,
"brands": "NK",
"agisuraTraits": "",
"insectResistance": "",
"herbicideTolerance": "",
"waterOptimization": "",
"reducedRefuge": "",
"diseaseResistence": "",
"silage": "",
"path": "false",
"currentUrl": CORN_LIST_URL,
"fieldForged": "",
"newProduct": "",
})
def _api_payload_soy(rm_low: str, rm_high: str) -> str:
return json.dumps({
"soyaBeanCount": "1",
"rmLowerRange": rm_low,
"rmUpperRange": rm_high,
"herbicideTolerance": "",
"diseaseFilter": "",
"nematodeFilter": "",
"agroPlantCharFilter": "",
"plantHeightFilter": "",
"brands": "NK",
"browserURL": SOY_LIST_URL,
"fieldForged": "",
"newProduct": "",
})
def _parse_card(html_chunk: str, crop: str) -> NKProduct | None:
"""Parse one ``<div class="sf-result">`` card from the API
response into an NKProduct."""
soup = BeautifulSoup(html_chunk, "html.parser")
title_el = soup.find(class_="sf-result-title")
if not title_el:
return None
# Title contains code + RM <span> tail
code = (title_el.contents[0].strip() if title_el.contents else "").strip()
if not code:
return None
rm_str: str | None = None
span = title_el.find("span")
if span:
# span text is like "RM\n80" — strip to digits/decimal
text = span.get_text(" ", strip=True)
m = re.search(r"(\d+(?:\.\d+)?)", text)
if m:
rm_str = m.group(1)
prod = NKProduct(
source_key=f"nk-{code.lower()}",
# NK doesn't expose per-variety URLs; the brand catalog is the
# nearest equivalent. lookup_variety / get_page will still work
# via source_key.
source_url=CORN_LIST_URL if crop == "corn" else SOY_LIST_URL,
crop=crop,
product_code=code,
)
if rm_str is not None:
if crop == "corn":
prod.relative_maturity = rm_str
else:
prod.maturity_group = rm_str
# Brands Available (trait variants).
inner = soup.find(class_="sf-result-content-inner")
if inner:
# The first <strong> with "Brands available:" or
# "Herbicide Tolerant Trait(s):" sets the trait context.
for strong in inner.find_all("strong"):
text = strong.get_text(" ", strip=True)
if text.lower().startswith("brands available"):
rest = text.split(":", 1)[1] if ":" in text else ""
for v in rest.split("|"):
v = v.strip()
if v:
prod.brand_variants.append(v)
elif text.lower().startswith("herbicide tolerant trait"):
rest = text.split(":", 1)[1] if ":" in text else ""
for t in rest.split(","):
t = t.strip()
if t:
prod.trait_codes.append(t)
else:
# Positioning slogan is also rendered as a bare <strong>.
if not prod.positioning_statement and len(text) > 12:
prod.positioning_statement = text
# Bullet strengths
ul = inner.find("ul")
if ul:
for li in ul.find_all("li"):
t = li.get_text(" ", strip=True)
if t:
prod.strengths.append(t)
# Tech-sheet PDF URL.
for a in soup.find_all("a", href=True):
h = a["href"]
if "assets.syngenta-us.com/pdf/techsheets/" in h and h.lower().endswith(".pdf"):
prod.techsheet_url = h
break
return prod
def discover_products(
http: RateLimitedSession,
*,
only_crop: str | None = None,
) -> list[NKProduct]:
"""Hit the corn + soy product-finder APIs and parse the returned
HTML cards into NKProducts. Returns identity-level data only;
ratings come from the per-variety tech-sheet PDF in
``enrich_with_pdf``."""
# Warm the session cookie (some Syngenta deployments need it).
http.get(CORN_LIST_URL)
out: list[NKProduct] = []
headers = {
"Content-Type": "application/json; charset=utf-8",
"X-Requested-With": "XMLHttpRequest",
}
def _parse_response(html_blob: str, crop: str) -> int:
"""Parse the API response's inner HTML into NKProducts.
The endpoint emits one ``<div class="sf-result">`` per variety,
each wrapped in a ``<div class="col-md-6">`` column. Strip the
leading ``@`` markers and let BeautifulSoup tokenize the whole
blob — no per-chunk split (the API doesn't actually delimit
with ``@`` reliably, despite appearances).
"""
n = 0
# Strip leading " @ " noise (rendered by the JS when filters
# change, not a structural delimiter).
cleaned = html_blob.replace("@", "").strip()
soup = BeautifulSoup(cleaned, "html.parser")
for card in soup.find_all("div", class_="sf-result"):
prod = _parse_card(str(card), crop)
if prod:
out.append(prod)
n += 1
return n
if only_crop in (None, "corn"):
log.info("fetching NK corn product list")
r = http.post(
CORN_API,
data=_api_payload_corn("75", "120"),
headers={**headers, "Referer": CORN_LIST_URL},
)
r.raise_for_status()
n = _parse_response(r.json().get("d") or "", "corn")
log.info("corn cards parsed: %d", n)
if only_crop in (None, "soybeans"):
log.info("fetching NK soy product list")
r = http.post(
SOY_API,
data=_api_payload_soy("0", "9.9"),
headers={**headers, "Referer": SOY_LIST_URL},
)
r.raise_for_status()
n = _parse_response(r.json().get("d") or "", "soybeans")
log.info("soy cards parsed: %d", n)
log.info("total: %d NK varieties", len(out))
return out
# --------------------------------------------------------------------- PDF
def _extract_disease_ratings(text: str) -> list[dict]:
"""Pull disease-tolerance ratings out of the tech-sheet PDF text.
The PDF renders disease ratings as a left-column-label / right-
column-number layout. pdfplumber's ``extract_text`` interleaves
the agronomic-chart labels (no number) with the disease-rating
labels + numbers, so we just look for lines ending in a numeric
rating or a literal ``-`` (not available).
Returns a list of ``{characteristic, value}``. Values are
preserved as strings (including ``-`` for "not available").
"""
# The disease list per tech sheet is small (~10 conditions) and
# the labels are stable. We anchor on the known label set rather
# than try to guess by layout.
known_diseases = [
"Gray Leaf Spot",
"Northern Corn Leaf Blight",
"Goss's Wilt",
"Goss's wilt",
"Bacterial Leaf Streak",
"Bacterial Corn Leaf Streak",
"Southern Corn Leaf Blight",
"Anthracnose Stalk Rot",
"Anthracnose Leaf Blight",
"Tar Spot",
"Fusarium Crown Rot",
"Common Rust",
"Southern Rust",
"Eye Spot",
"Stewart's Bacterial Wilt",
# Soybean
"Brown Stem Rot",
"Charcoal Rot",
"Frogeye Leaf Spot",
"Iron Deficiency Chlorosis",
"Phytophthora Root Rot",
"Sclerotinia White Mold",
"White Mold",
"Soybean Cyst Nematode",
"Sudden Death Syndrome",
"Southern Stem Canker",
"Stem Canker",
"Soybean Mosaic Virus",
]
items: list[dict] = []
for line in text.splitlines():
line = line.strip()
if not line:
continue
# Match "<label> <value>" where label is one of known_diseases
# and value is a single digit or "-".
for d in known_diseases:
m = re.match(rf"^{re.escape(d)}\s+([1-9]|-)\s*$", line)
if m:
items.append({"characteristic": d, "value": m.group(1)})
break
# Dedup while preserving order
seen: set[str] = set()
deduped: list[dict] = []
for it in items:
if it["characteristic"] not in seen:
seen.add(it["characteristic"])
deduped.append(it)
return deduped
def _extract_phytophthora_genes(text: str) -> str | None:
"""Soybean tech sheets list the Phytophthora Root Rot (PRR) source
genes (Rps1c / Rps3a / etc.). The exact line wording varies; we
accept several common phrasings."""
patterns = (
r"Phytophthora Root Rot\s*\(PRR\)\s*Source\s+(.+)",
r"PRR Source\s*[:\-]?\s*(.+)",
r"Phytophthora Gene\s*[:\-]?\s*(.+)",
)
for line in text.splitlines():
line = line.strip()
for p in patterns:
m = re.match(p, line, re.I)
if m:
val = m.group(1).strip()
# Trim trailing words that obviously aren't gene names
# ("Source Rps1c, Rps3a Emergence 3" can run together).
val = re.split(r"\s+(?:Emergence|Soybean|Standability|Root)\b", val, 1)[0].strip()
if val and val.lower() not in ("-", "na", "n/a", "none"):
return val
return None
def _extract_scn_source(text: str) -> str | None:
for line in text.splitlines():
line = line.strip()
m = re.match(r"^(SCN Source|Cyst Nematode Source)\s*[:\-]?\s*(.+)$", line, re.I)
if m:
val = m.group(2).strip()
if val and val != "-":
return val
return None
def _extract_scn_races(text: str) -> str | None:
"""Soy: 'Soybean Cyst Nematode (SCN) Races S' / 'R3' etc."""
for line in text.splitlines():
line = line.strip()
m = re.match(
r"^Soybean Cyst Nematode \(SCN\) Races\s+(.+)$", line, re.I,
)
if m:
val = m.group(1).strip()
if val:
return val
return None
# Soy agronomic ratings rendered as text "Label N" pairs in the PDF.
# These ARE extractable (unlike the bar charts).
_SOY_AGRO_LABELS = (
"Emergence", "Standability", "Shatter Tolerance",
"Green Stem", "% Protein at 13% mst.", "% Oil at 13% mst.",
)
def _extract_soy_agronomic_text(text: str) -> list[dict]:
out: list[dict] = []
for label in _SOY_AGRO_LABELS:
# Allow trailing decimal for %Protein / %Oil; single digit
# for the 1-9 ratings.
m = re.search(
rf"{re.escape(label)}\s+(\d+(?:\.\d+)?|-)\b",
text,
)
if m:
out.append({"characteristic": label, "value": m.group(1)})
return out
# Soil-type adaptation lines on soy PDFs: "Drought Prone Best",
# "Narrow Rows Best", "High pH* Good", etc.
_SOY_SOIL_LABELS = (
"Drought Prone", "Narrow Rows", "High pH",
"Wide Rows", "Highly Productive",
"Moderate/Variable Environments", "Poorly Drained",
)
def _extract_soy_soil_adaptation(text: str) -> list[dict]:
out: list[dict] = []
for label in _SOY_SOIL_LABELS:
m = re.search(
rf"{re.escape(label)}\*?\s+(Best|Good|Fair|Poor)\b",
text,
)
if m:
out.append({"characteristic": label, "value": m.group(1)})
return out
def enrich_with_pdf(
http: RateLimitedSession, prod: NKProduct
) -> None:
"""Fetch the tech-sheet PDF and add disease ratings + relevant
soybean fields to ``prod.characteristics_groups``."""
if not prod.techsheet_url:
log.info("%s: no tech sheet URL — identity only", prod.source_key)
return
try:
r = http.get(prod.techsheet_url)
r.raise_for_status()
except Exception as exc: # noqa: BLE001
log.warning("%s: PDF fetch failed (%s) — identity only",
prod.source_key, exc)
return
try:
with pdfplumber.open(io.BytesIO(r.content)) as pdf:
text = "\n".join((p.extract_text() or "") for p in pdf.pages)
except Exception as exc: # noqa: BLE001
log.warning("%s: PDF parse failed (%s) — identity only",
prod.source_key, exc)
return
disease = _extract_disease_ratings(text)
if disease:
prod.characteristics_groups.append({
"label": "DISEASE RATINGS",
"type": "pdf-text",
"items": disease,
})
if prod.crop == "soybeans":
misc_items: list[dict] = []
prr = _extract_phytophthora_genes(text)
if prr:
misc_items.append({"characteristic": "Phytophthora Gene", "value": prr})
scn = _extract_scn_source(text)
if scn:
misc_items.append({"characteristic": "SCN Source", "value": scn})
scn_races = _extract_scn_races(text)
if scn_races:
misc_items.append({"characteristic": "SCN Race Coverage", "value": scn_races})
if misc_items:
prod.characteristics_groups.append({
"label": "DISEASE GENETICS",
"type": "pdf-text",
"items": misc_items,
})
soy_agro = _extract_soy_agronomic_text(text)
if soy_agro:
prod.characteristics_groups.append({
"label": "AGRONOMIC TRAITS",
"type": "pdf-text",
"items": soy_agro,
})
soil = _extract_soy_soil_adaptation(text)
if soil:
prod.characteristics_groups.append({
"label": "SOIL TYPE ADAPTATION",
"type": "pdf-text",
"items": soil,
})
# Surface labels for charted-only agronomic ratings so search_docs
# can match queries like "drought" / "stalk strength" — values
# aren't extractable via text (the source PDF renders them as bar
# positions). We record only labels NOT already present in
# text-extractable groups, with an explicit "rated in PDF chart"
# value so the LLM directs the farmer at the tech sheet for those
# numbers. (For soy this is mostly redundant — text extraction got
# the agronomic numbers — so we skip the chart-label group there.)
if prod.crop == "corn":
agronomic_labels_corn = (
"Emergence", "Seedling Vigor", "Root Strength",
"Stalk Strength", "Green Snap", "Staygreen",
"Drydown", "Test Weight", "Drought",
)
# Skip any label already present with a numeric value.
already_rated = {
it["characteristic"]
for g in prod.characteristics_groups
for it in g.get("items") or []
if str(it.get("value", "")).strip() not in ("",)
}
present = [l for l in agronomic_labels_corn
if l in text and l not in already_rated]
if present:
prod.characteristics_groups.append({
"label": "AGRONOMIC CHARACTERISTICS",
"type": "pdf-chart",
"items": [
{"characteristic": l, "value": "rated in tech-sheet PDF chart (not text-extractable)"}
for l in present
],
})
# --------------------------------------------------------------------- render
def render_markdown(p: NKProduct) -> str:
title = p.product_code or p.source_key
crop_label = "Corn" if p.crop == "corn" else "Soybeans"
head: list[str] = [
f"# {title}",
"",
"- **Vendor:** Syngenta",
"- **Brand:** NK",
f"- **Crop:** {crop_label}",
]
if p.crop == "corn" and p.relative_maturity:
head.append(f"- **Relative maturity:** {p.relative_maturity}")
if p.crop == "soybeans" and p.maturity_group:
head.append(f"- **Maturity group:** {p.maturity_group}")
if p.brand_variants:
head.append(f"- **Brand variants:** {', '.join(p.brand_variants)}")
if p.trait_codes:
head.append(f"- **Traits:** {', '.join(p.trait_codes)}")
head.append(f"- **Catalog page:** {p.source_url}")
if p.techsheet_url:
head.append(f"- **Tech sheet (PDF):** {p.techsheet_url}")
head.append(f"- **Rating scale (NK):** {RATING_SCALE_DIRECTION}")
head.append("")
head.append("---")
head.append("")
sections: list[str] = []
if p.positioning_statement:
sections.append("## Positioning\n\n" + p.positioning_statement.strip() + "\n")
if p.strengths:
bullets = "\n".join(f"- {s}" for s in p.strengths)
sections.append("## Strengths\n\n" + bullets + "\n")
for g in p.characteristics_groups:
label = (g.get("label") or "Characteristics").title()
items = g.get("items") or []
if not items:
continue
rows = "\n".join(f"| {it['characteristic']} | {it['value']} |" for it in items)
sections.append(
f"## {label}\n\n"
"| Characteristic | Value |\n"
"|---|---|\n"
f"{rows}\n"
)
return "\n".join(head) + "\n".join(sections)
# --------------------------------------------------------------------- write
def write_product(prod: NKProduct, body_md: str) -> None:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
md_path = CORPUS_DIR / f"{prod.source_key}.md"
json_path = CORPUS_DIR / f"{prod.source_key}.json"
md_path.write_text(body_md, encoding="utf-8")
sidecar = {
"source": "nk",
"source_key": prod.source_key,
"vendor": "Syngenta",
"brand": "NK",
"product_name": prod.product_code,
"product_id": None,
"hybrid_prefix": prod.product_code,
"hybrid_suffix": None,
"crop": prod.crop,
"release_year": None,
"relative_maturity": prod.relative_maturity,
"maturity_group": prod.maturity_group,
"wheat_class": None,
"trait_stack": prod.trait_codes,
"trait_descriptions": prod.trait_descriptions,
"brand_variants": prod.brand_variants,
"positioning_statement": prod.positioning_statement,
"strengths": prod.strengths,
"characteristics_groups": prod.characteristics_groups,
"_scale_direction": RATING_SCALE_DIRECTION,
"regional_recommendations": [],
"image_url": None,
"techsheet_url": prod.techsheet_url,
"source_urls": [prod.source_url],
"sitemap_last_modified": None,
"fetched_at": datetime.now(timezone.utc).isoformat(),
"scraper_version": SCRAPER_VERSION,
}
json_path.write_text(
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
encoding="utf-8",
)
# --------------------------------------------------------------------- pipeline
def process_product(
http: RateLimitedSession,
prod: NKProduct,
*,
force: bool,
) -> str:
md_path = CORPUS_DIR / f"{prod.source_key}.md"
if md_path.exists() and not force:
return "skipped"
enrich_with_pdf(http, prod)
body = render_markdown(prod)
write_product(prod, body)
return "written"
def run(
*,
limit: int | None,
force: bool,
only_crop: str | None,
only_product: str | None,
) -> int:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
http = RateLimitedSession()
targets = discover_products(http, only_crop=only_crop)
if only_product:
targets = [
p for p in targets
if p.source_key == only_product
or p.product_code.lower() == only_product.lower()
]
if not targets:
log.error("no variety matched --product=%s", only_product)
return 2
counts = {"written": 0, "skipped": 0, "failed": 0}
processed = 0
for prod in targets:
if limit is not None and processed >= limit:
break
processed += 1
try:
status = process_product(http, prod, force=force)
except Exception as exc: # noqa: BLE001
log.error("%s failed: %s", prod.source_key, exc)
status = "failed"
counts[status] = counts.get(status, 0) + 1
log.info(
"[%d/%s] %s %s | crop=%s rm/mg=%s variants=%d traits=%s groups=%d",
processed, str(limit) if limit else "all",
prod.source_key, status, prod.crop,
prod.relative_maturity or prod.maturity_group or "-",
len(prod.brand_variants),
",".join(prod.trait_codes) or "-",
len(prod.characteristics_groups),
)
log.info(
"done: processed=%d written=%d skipped=%d failed=%d (of %d candidates)",
processed, counts["written"], counts["skipped"],
counts["failed"], len(targets),
)
return 0 if counts["failed"] == 0 else 1
# --------------------------------------------------------------------- CLI
def _build_argparser() -> argparse.ArgumentParser:
p = argparse.ArgumentParser(
prog="scrape.sources.nk",
description="Scrape NK (Syngenta) corn + soybean varieties.",
)
p.add_argument("--limit", type=int, default=None,
help="Stop after processing N varieties (default: all).")
p.add_argument("--force", action="store_true",
help="Re-fetch even if the markdown file already exists.")
p.add_argument("--crop", default=None, choices=("corn", "soybeans"),
help="Limit to one crop.")
p.add_argument("--product", default=None,
help="Process a single variety by source_key or product code.")
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
return p
def main(argv: list[str] | None = None) -> int:
print("nk: deferred — disease/agronomic ratings come from CDN tech-sheet PDFs only, use pdfplumber. See reference_seed_vendor_recon.md.",
file=sys.stderr)
# Return 0 so the monthly CI workflow doesn't fail when this
# source is listed but not yet implemented. Real implementation
# will return 0 on success / 1 on failure.
return 0
args = _build_argparser().parse_args(argv)
logging.basicConfig(
level=args.log_level.upper(),
format="%(asctime)s %(levelname)s %(name)s %(message)s",
stream=sys.stderr,
)
return run(
limit=args.limit,
force=args.force,
only_crop=args.crop,
only_product=args.product,
)
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
sys.exit(main())