9ce920f622
agripro (24 varieties)
- Drupal Views form scrape via /search-agripro-brand-varieties with
explicit GET params (sidesteps the AJAX-only-on-load default that
returns an empty form skeleton).
- Per-variety parse: <h1>, .field--node--variety-type--variety,
.field--node--tag-line--variety, .field--node--body, plus the
three rated sections (Agronomics / Grain / Disease) with their
<div class="row"><div class="label">label</div><div>value</div>
pairs.
- Wheat-class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley
— provides the Northern Plains HRS coverage WestBred lacks.
nk (122 varieties — recon's "29" was outdated; the current NK seed
finder lists 41 corn + 81 soy)
- ASP.NET WebForms endpoint:
POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returns
{"d": "<html>"} where the inner HTML is one <div class="sf-result">
per variety. BeautifulSoup tokenizes the whole blob.
- Per-card: product code (NK8005, NK008-P8XF), RM/MG from the
title <span>, "Brands Available" trait variants, marketing
positioning + bullet strengths, tech-sheet PDF URL.
- pdfplumber text extraction on the tech-sheet PDFs adds:
* corn disease ratings (Gray Leaf Spot, NCLB, Goss's Wilt,
Anthracnose, Tar Spot, Fusarium, etc.) where the PDF prints
"Label N" lines (text-extractable)
* soybean Phytophthora source genes (Rps1c, Rps3a, ...)
* soybean SCN race coverage
* soybean agronomic ratings (Emergence, Standability, Shatter
Tolerance, Green Stem) with text-extractable 1-9 values
* soybean soil-type adaptation (Best/Good/Fair/Poor) for drought
prone / high pH / poorly drained / etc.
- Agronomic rating BARS for corn (Emergence, Stalk Strength,
Drought) are not text-extractable; we record the labels with an
explicit "rated in PDF chart, see tech sheet" value so the agent
can direct the farmer at the source for those numbers.
Scale-direction correction in lessons.md:
- NK and AgriPro both use 1 = best, lower = more resistant — the
REVERSED convention vs Bayer / Golden Harvest. NK's tech-sheet
footer literally prints "1-9 Scale: 1 = Best, 9 = Worst".
AgriPro positioning on stripe-rust-resistant varieties (AP Iliad
with Stripe Rust 1, Eyespot 2) confirms the same direction.
- sources-not-yet-indexed section trimmed to just Beck's PFR +
Beck's products — everything else IS now in the corpus.
Cross-vendor coverage after this PR: 760 varieties.
bayer_seeds 475 (DEKALB 288 / Asgrow 102 / WestBred 85)
golden_harvest 139
nk 122 (41 corn / 81 soy)
agripro 24 (12 HRS / 7 SWW / 3 HRW / 1 HWS / 1 Barley)
Vendors: Bayer, Syngenta. Brands: 6. Crops: corn, soy, wheat (109
wheat now, up from 85).
requirements.txt: pdfplumber>=0.11 for NK tech-sheet parsing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
829 lines
28 KiB
Python
829 lines
28 KiB
Python
"""NK (Syngenta) seed scraper — corn + soybeans.
|
|
|
|
Source: ``syngenta-us.com`` — ASP.NET WebForms catalog with an
|
|
ASMX-style JSON endpoint for the seed-finder UI, plus tech-sheet
|
|
PDFs on the Syngenta CDN at
|
|
``assets.syngenta-us.com/pdf/techsheets/<CODE>_YYMMDD.pdf``.
|
|
|
|
Expected count: 29 varieties (12 corn + 17 soy on 2026-05-25). No
|
|
wheat.
|
|
|
|
Discovery: the HTML catalog pages (``/corn/nk/products``,
|
|
``/soybeans/nk/products``) load product cards via JS. The JS calls
|
|
|
|
POST /NKSeeds/CornProductFinder.aspx/GetProducts
|
|
POST /NKSeeds/SoyProductFinder.aspx/GetProducts
|
|
|
|
Both endpoints return ASP.NET's ``{"d": "..."}`` wrapper where ``d``
|
|
is a string of HTML fragments separated by `` @ `` containing one
|
|
``<div class="sf-result">`` per variety. Each card carries:
|
|
|
|
- product code (e.g. ``NK8005`` / ``NK008-P8XF``)
|
|
- RM days (corn) / MG decimal (soy) in a ``<span>`` next to the
|
|
title
|
|
- "Brands Available" line listing trait variants
|
|
(NK8005-V, NK8005-GT/LL — these are trait-specific SKUs)
|
|
- positioning slogan + bullet-list of strengths
|
|
- tech-sheet PDF URL
|
|
|
|
Per-variety disease ratings live ONLY in the PDF tech sheets (the
|
|
HTML cards have marketing text but no rating numbers). We extract
|
|
disease ratings via ``pdfplumber`` text extraction — they appear as
|
|
"Label Number" lines that we parse with a regex.
|
|
|
|
**Rating-scale direction**: NK explicitly publishes
|
|
``1-9 Scale: 1 = Best, Tallest or Highest; 9 = Worst, Shortest or
|
|
Lowest`` on every tech sheet — REVERSED from Bayer/Golden Harvest.
|
|
The chunker preserves values verbatim and the sidecar's
|
|
``_scale_direction`` field declares this so the LLM correctly
|
|
interprets the chunk preamble.
|
|
|
|
**Agronomic ratings**: rendered as horizontal bar charts in the
|
|
PDF; pdfplumber's text extraction captures the LABELS (Emergence,
|
|
Stalk Strength, Drought, etc.) but NOT the bar values. Surfacing
|
|
those would require either OCR of the bar positions or pdfplumber's
|
|
geometric layout parsing — deferred. For now the chunk records the
|
|
labels and an explicit "agronomic ratings rendered as chart bars in
|
|
the source PDF — values not currently extracted" annotation so the
|
|
agent knows to direct the farmer at the tech-sheet PDF for those
|
|
numbers.
|
|
|
|
Tech-sheet PDF URLs come from the API response (live URL is
|
|
correct; the assets-host filenames include a YYMMDD that changes).
|
|
|
|
Output:
|
|
corpus/nk/<source_key>.md
|
|
corpus/nk/<source_key>.json
|
|
|
|
source_key convention: ``nk-<code>`` lowercased, e.g.
|
|
``nk-nk8005`` or ``nk-nk008-p8xf``.
|
|
|
|
CLI:
|
|
python -m scrape.sources.nk --limit 5
|
|
python -m scrape.sources.nk --crop corn --limit 12
|
|
python -m scrape.sources.nk --force
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import argparse
|
|
import io
|
|
import json
|
|
import logging
|
|
import os
|
|
import random
|
|
import re
|
|
import sys
|
|
import time
|
|
from dataclasses import dataclass, field
|
|
from datetime import datetime, timezone
|
|
from pathlib import Path
|
|
from typing import Any
|
|
|
|
import requests
|
|
from bs4 import BeautifulSoup
|
|
import pdfplumber
|
|
|
|
SCRAPER_VERSION = "0.1.0"
|
|
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
|
|
BASE = "https://www.syngenta-us.com"
|
|
CORN_LIST_URL = f"{BASE}/corn/nk/products"
|
|
SOY_LIST_URL = f"{BASE}/soybeans/nk/products"
|
|
CORN_API = f"{BASE}/NKSeeds/CornProductFinder.aspx/GetProducts"
|
|
SOY_API = f"{BASE}/NKSeeds/SoyProductFinder.aspx/GetProducts"
|
|
|
|
# NK + AgriPro both use the "1 = best, lower = more resistant" convention.
|
|
# Confirmed by tech-sheet footer: "1-9 Scale: 1 = Best...; 9 = Worst..."
|
|
RATING_SCALE_DIRECTION = "1-9 (1 = best, lower = more resistant)"
|
|
|
|
REPO_ROOT = Path(__file__).resolve().parents[2]
|
|
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
|
|
CORPUS_DIR = CORPUS_ROOT / "nk"
|
|
|
|
REQ_INTERVAL_SEC = 1.0
|
|
|
|
log = logging.getLogger("scrape.nk")
|
|
|
|
|
|
# --------------------------------------------------------------------- HTTP
|
|
|
|
|
|
class RateLimitedSession:
|
|
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
|
|
self.s = requests.Session()
|
|
self.s.headers["User-Agent"] = USER_AGENT
|
|
self.interval = interval
|
|
self._last = 0.0
|
|
|
|
def _wait(self) -> None:
|
|
delta = time.monotonic() - self._last
|
|
if delta < self.interval:
|
|
time.sleep(self.interval - delta)
|
|
self._last = time.monotonic()
|
|
|
|
def request(
|
|
self,
|
|
method: str,
|
|
url: str,
|
|
*,
|
|
max_retries: int = 4,
|
|
timeout: float = 30.0,
|
|
**kw: Any,
|
|
) -> requests.Response:
|
|
last_exc: Exception | None = None
|
|
for attempt in range(max_retries):
|
|
self._wait()
|
|
try:
|
|
resp = self.s.request(method, url, timeout=timeout, **kw)
|
|
except requests.RequestException as exc:
|
|
last_exc = exc
|
|
backoff = min(30.0, (2 ** attempt) + random.random())
|
|
log.warning("network error on %s %s: %s — retry in %.1fs",
|
|
method, url, exc, backoff)
|
|
time.sleep(backoff)
|
|
continue
|
|
if resp.status_code == 429 or 500 <= resp.status_code < 600:
|
|
ra = resp.headers.get("Retry-After")
|
|
backoff = float(ra) if (ra and ra.isdigit()) else min(30.0, (2 ** attempt) + random.random())
|
|
log.warning("HTTP %d on %s %s — retry in %.1fs",
|
|
resp.status_code, method, url, backoff)
|
|
time.sleep(backoff)
|
|
continue
|
|
return resp
|
|
if last_exc:
|
|
raise last_exc
|
|
return resp # type: ignore[return-value]
|
|
|
|
def get(self, url: str, **kw: Any) -> requests.Response:
|
|
return self.request("GET", url, **kw)
|
|
|
|
def post(self, url: str, **kw: Any) -> requests.Response:
|
|
return self.request("POST", url, **kw)
|
|
|
|
|
|
# --------------------------------------------------------------------- model
|
|
|
|
|
|
@dataclass
|
|
class NKProduct:
|
|
source_key: str
|
|
source_url: str # the brand catalog page (closest thing to a per-variety URL)
|
|
crop: str # "corn" | "soybeans"
|
|
product_code: str = "" # NK8005 / NK008-P8XF
|
|
relative_maturity: str | None = None # corn
|
|
maturity_group: str | None = None # soy
|
|
brand_variants: list[str] = field(default_factory=list) # ["NK8005-V", "NK8005-GT/LL"]
|
|
trait_codes: list[str] = field(default_factory=list)
|
|
trait_descriptions: list[str] = field(default_factory=list)
|
|
positioning_statement: str | None = None
|
|
strengths: list[str] = field(default_factory=list)
|
|
techsheet_url: str | None = None
|
|
characteristics_groups: list[dict] = field(default_factory=list)
|
|
|
|
|
|
# --------------------------------------------------------------------- discovery
|
|
|
|
|
|
def _api_payload_corn(rm_low: str, rm_high: str) -> str:
|
|
"""Payload for ``CornProductFinder.aspx/GetProducts``."""
|
|
return json.dumps({
|
|
"cornCount": "1",
|
|
"rmLowerRange": rm_low,
|
|
"rmUpperRange": rm_high,
|
|
"brands": "NK",
|
|
"agisuraTraits": "",
|
|
"insectResistance": "",
|
|
"herbicideTolerance": "",
|
|
"waterOptimization": "",
|
|
"reducedRefuge": "",
|
|
"diseaseResistence": "",
|
|
"silage": "",
|
|
"path": "false",
|
|
"currentUrl": CORN_LIST_URL,
|
|
"fieldForged": "",
|
|
"newProduct": "",
|
|
})
|
|
|
|
|
|
def _api_payload_soy(rm_low: str, rm_high: str) -> str:
|
|
return json.dumps({
|
|
"soyaBeanCount": "1",
|
|
"rmLowerRange": rm_low,
|
|
"rmUpperRange": rm_high,
|
|
"herbicideTolerance": "",
|
|
"diseaseFilter": "",
|
|
"nematodeFilter": "",
|
|
"agroPlantCharFilter": "",
|
|
"plantHeightFilter": "",
|
|
"brands": "NK",
|
|
"browserURL": SOY_LIST_URL,
|
|
"fieldForged": "",
|
|
"newProduct": "",
|
|
})
|
|
|
|
|
|
def _parse_card(html_chunk: str, crop: str) -> NKProduct | None:
|
|
"""Parse one ``<div class="sf-result">`` card from the API
|
|
response into an NKProduct."""
|
|
soup = BeautifulSoup(html_chunk, "html.parser")
|
|
title_el = soup.find(class_="sf-result-title")
|
|
if not title_el:
|
|
return None
|
|
# Title contains code + RM <span> tail
|
|
code = (title_el.contents[0].strip() if title_el.contents else "").strip()
|
|
if not code:
|
|
return None
|
|
rm_str: str | None = None
|
|
span = title_el.find("span")
|
|
if span:
|
|
# span text is like "RM\n80" — strip to digits/decimal
|
|
text = span.get_text(" ", strip=True)
|
|
m = re.search(r"(\d+(?:\.\d+)?)", text)
|
|
if m:
|
|
rm_str = m.group(1)
|
|
|
|
prod = NKProduct(
|
|
source_key=f"nk-{code.lower()}",
|
|
# NK doesn't expose per-variety URLs; the brand catalog is the
|
|
# nearest equivalent. lookup_variety / get_page will still work
|
|
# via source_key.
|
|
source_url=CORN_LIST_URL if crop == "corn" else SOY_LIST_URL,
|
|
crop=crop,
|
|
product_code=code,
|
|
)
|
|
if rm_str is not None:
|
|
if crop == "corn":
|
|
prod.relative_maturity = rm_str
|
|
else:
|
|
prod.maturity_group = rm_str
|
|
|
|
# Brands Available (trait variants).
|
|
inner = soup.find(class_="sf-result-content-inner")
|
|
if inner:
|
|
# The first <strong> with "Brands available:" or
|
|
# "Herbicide Tolerant Trait(s):" sets the trait context.
|
|
for strong in inner.find_all("strong"):
|
|
text = strong.get_text(" ", strip=True)
|
|
if text.lower().startswith("brands available"):
|
|
rest = text.split(":", 1)[1] if ":" in text else ""
|
|
for v in rest.split("|"):
|
|
v = v.strip()
|
|
if v:
|
|
prod.brand_variants.append(v)
|
|
elif text.lower().startswith("herbicide tolerant trait"):
|
|
rest = text.split(":", 1)[1] if ":" in text else ""
|
|
for t in rest.split(","):
|
|
t = t.strip()
|
|
if t:
|
|
prod.trait_codes.append(t)
|
|
else:
|
|
# Positioning slogan is also rendered as a bare <strong>.
|
|
if not prod.positioning_statement and len(text) > 12:
|
|
prod.positioning_statement = text
|
|
|
|
# Bullet strengths
|
|
ul = inner.find("ul")
|
|
if ul:
|
|
for li in ul.find_all("li"):
|
|
t = li.get_text(" ", strip=True)
|
|
if t:
|
|
prod.strengths.append(t)
|
|
|
|
# Tech-sheet PDF URL.
|
|
for a in soup.find_all("a", href=True):
|
|
h = a["href"]
|
|
if "assets.syngenta-us.com/pdf/techsheets/" in h and h.lower().endswith(".pdf"):
|
|
prod.techsheet_url = h
|
|
break
|
|
|
|
return prod
|
|
|
|
|
|
def discover_products(
|
|
http: RateLimitedSession,
|
|
*,
|
|
only_crop: str | None = None,
|
|
) -> list[NKProduct]:
|
|
"""Hit the corn + soy product-finder APIs and parse the returned
|
|
HTML cards into NKProducts. Returns identity-level data only;
|
|
ratings come from the per-variety tech-sheet PDF in
|
|
``enrich_with_pdf``."""
|
|
# Warm the session cookie (some Syngenta deployments need it).
|
|
http.get(CORN_LIST_URL)
|
|
|
|
out: list[NKProduct] = []
|
|
headers = {
|
|
"Content-Type": "application/json; charset=utf-8",
|
|
"X-Requested-With": "XMLHttpRequest",
|
|
}
|
|
|
|
def _parse_response(html_blob: str, crop: str) -> int:
|
|
"""Parse the API response's inner HTML into NKProducts.
|
|
|
|
The endpoint emits one ``<div class="sf-result">`` per variety,
|
|
each wrapped in a ``<div class="col-md-6">`` column. Strip the
|
|
leading ``@`` markers and let BeautifulSoup tokenize the whole
|
|
blob — no per-chunk split (the API doesn't actually delimit
|
|
with ``@`` reliably, despite appearances).
|
|
"""
|
|
n = 0
|
|
# Strip leading " @ " noise (rendered by the JS when filters
|
|
# change, not a structural delimiter).
|
|
cleaned = html_blob.replace("@", "").strip()
|
|
soup = BeautifulSoup(cleaned, "html.parser")
|
|
for card in soup.find_all("div", class_="sf-result"):
|
|
prod = _parse_card(str(card), crop)
|
|
if prod:
|
|
out.append(prod)
|
|
n += 1
|
|
return n
|
|
|
|
if only_crop in (None, "corn"):
|
|
log.info("fetching NK corn product list")
|
|
r = http.post(
|
|
CORN_API,
|
|
data=_api_payload_corn("75", "120"),
|
|
headers={**headers, "Referer": CORN_LIST_URL},
|
|
)
|
|
r.raise_for_status()
|
|
n = _parse_response(r.json().get("d") or "", "corn")
|
|
log.info("corn cards parsed: %d", n)
|
|
|
|
if only_crop in (None, "soybeans"):
|
|
log.info("fetching NK soy product list")
|
|
r = http.post(
|
|
SOY_API,
|
|
data=_api_payload_soy("0", "9.9"),
|
|
headers={**headers, "Referer": SOY_LIST_URL},
|
|
)
|
|
r.raise_for_status()
|
|
n = _parse_response(r.json().get("d") or "", "soybeans")
|
|
log.info("soy cards parsed: %d", n)
|
|
|
|
log.info("total: %d NK varieties", len(out))
|
|
return out
|
|
|
|
|
|
# --------------------------------------------------------------------- PDF
|
|
|
|
|
|
def _extract_disease_ratings(text: str) -> list[dict]:
|
|
"""Pull disease-tolerance ratings out of the tech-sheet PDF text.
|
|
|
|
The PDF renders disease ratings as a left-column-label / right-
|
|
column-number layout. pdfplumber's ``extract_text`` interleaves
|
|
the agronomic-chart labels (no number) with the disease-rating
|
|
labels + numbers, so we just look for lines ending in a numeric
|
|
rating or a literal ``-`` (not available).
|
|
|
|
Returns a list of ``{characteristic, value}``. Values are
|
|
preserved as strings (including ``-`` for "not available").
|
|
"""
|
|
# The disease list per tech sheet is small (~10 conditions) and
|
|
# the labels are stable. We anchor on the known label set rather
|
|
# than try to guess by layout.
|
|
known_diseases = [
|
|
"Gray Leaf Spot",
|
|
"Northern Corn Leaf Blight",
|
|
"Goss's Wilt",
|
|
"Goss's wilt",
|
|
"Bacterial Leaf Streak",
|
|
"Bacterial Corn Leaf Streak",
|
|
"Southern Corn Leaf Blight",
|
|
"Anthracnose Stalk Rot",
|
|
"Anthracnose Leaf Blight",
|
|
"Tar Spot",
|
|
"Fusarium Crown Rot",
|
|
"Common Rust",
|
|
"Southern Rust",
|
|
"Eye Spot",
|
|
"Stewart's Bacterial Wilt",
|
|
# Soybean
|
|
"Brown Stem Rot",
|
|
"Charcoal Rot",
|
|
"Frogeye Leaf Spot",
|
|
"Iron Deficiency Chlorosis",
|
|
"Phytophthora Root Rot",
|
|
"Sclerotinia White Mold",
|
|
"White Mold",
|
|
"Soybean Cyst Nematode",
|
|
"Sudden Death Syndrome",
|
|
"Southern Stem Canker",
|
|
"Stem Canker",
|
|
"Soybean Mosaic Virus",
|
|
]
|
|
items: list[dict] = []
|
|
for line in text.splitlines():
|
|
line = line.strip()
|
|
if not line:
|
|
continue
|
|
# Match "<label> <value>" where label is one of known_diseases
|
|
# and value is a single digit or "-".
|
|
for d in known_diseases:
|
|
m = re.match(rf"^{re.escape(d)}\s+([1-9]|-)\s*$", line)
|
|
if m:
|
|
items.append({"characteristic": d, "value": m.group(1)})
|
|
break
|
|
# Dedup while preserving order
|
|
seen: set[str] = set()
|
|
deduped: list[dict] = []
|
|
for it in items:
|
|
if it["characteristic"] not in seen:
|
|
seen.add(it["characteristic"])
|
|
deduped.append(it)
|
|
return deduped
|
|
|
|
|
|
def _extract_phytophthora_genes(text: str) -> str | None:
|
|
"""Soybean tech sheets list the Phytophthora Root Rot (PRR) source
|
|
genes (Rps1c / Rps3a / etc.). The exact line wording varies; we
|
|
accept several common phrasings."""
|
|
patterns = (
|
|
r"Phytophthora Root Rot\s*\(PRR\)\s*Source\s+(.+)",
|
|
r"PRR Source\s*[:\-]?\s*(.+)",
|
|
r"Phytophthora Gene\s*[:\-]?\s*(.+)",
|
|
)
|
|
for line in text.splitlines():
|
|
line = line.strip()
|
|
for p in patterns:
|
|
m = re.match(p, line, re.I)
|
|
if m:
|
|
val = m.group(1).strip()
|
|
# Trim trailing words that obviously aren't gene names
|
|
# ("Source Rps1c, Rps3a Emergence 3" can run together).
|
|
val = re.split(r"\s+(?:Emergence|Soybean|Standability|Root)\b", val, 1)[0].strip()
|
|
if val and val.lower() not in ("-", "na", "n/a", "none"):
|
|
return val
|
|
return None
|
|
|
|
|
|
def _extract_scn_source(text: str) -> str | None:
|
|
for line in text.splitlines():
|
|
line = line.strip()
|
|
m = re.match(r"^(SCN Source|Cyst Nematode Source)\s*[:\-]?\s*(.+)$", line, re.I)
|
|
if m:
|
|
val = m.group(2).strip()
|
|
if val and val != "-":
|
|
return val
|
|
return None
|
|
|
|
|
|
def _extract_scn_races(text: str) -> str | None:
|
|
"""Soy: 'Soybean Cyst Nematode (SCN) Races S' / 'R3' etc."""
|
|
for line in text.splitlines():
|
|
line = line.strip()
|
|
m = re.match(
|
|
r"^Soybean Cyst Nematode \(SCN\) Races\s+(.+)$", line, re.I,
|
|
)
|
|
if m:
|
|
val = m.group(1).strip()
|
|
if val:
|
|
return val
|
|
return None
|
|
|
|
|
|
# Soy agronomic ratings rendered as text "Label N" pairs in the PDF.
|
|
# These ARE extractable (unlike the bar charts).
|
|
_SOY_AGRO_LABELS = (
|
|
"Emergence", "Standability", "Shatter Tolerance",
|
|
"Green Stem", "% Protein at 13% mst.", "% Oil at 13% mst.",
|
|
)
|
|
|
|
|
|
def _extract_soy_agronomic_text(text: str) -> list[dict]:
|
|
out: list[dict] = []
|
|
for label in _SOY_AGRO_LABELS:
|
|
# Allow trailing decimal for %Protein / %Oil; single digit
|
|
# for the 1-9 ratings.
|
|
m = re.search(
|
|
rf"{re.escape(label)}\s+(\d+(?:\.\d+)?|-)\b",
|
|
text,
|
|
)
|
|
if m:
|
|
out.append({"characteristic": label, "value": m.group(1)})
|
|
return out
|
|
|
|
|
|
# Soil-type adaptation lines on soy PDFs: "Drought Prone Best",
|
|
# "Narrow Rows Best", "High pH* Good", etc.
|
|
_SOY_SOIL_LABELS = (
|
|
"Drought Prone", "Narrow Rows", "High pH",
|
|
"Wide Rows", "Highly Productive",
|
|
"Moderate/Variable Environments", "Poorly Drained",
|
|
)
|
|
|
|
|
|
def _extract_soy_soil_adaptation(text: str) -> list[dict]:
|
|
out: list[dict] = []
|
|
for label in _SOY_SOIL_LABELS:
|
|
m = re.search(
|
|
rf"{re.escape(label)}\*?\s+(Best|Good|Fair|Poor)\b",
|
|
text,
|
|
)
|
|
if m:
|
|
out.append({"characteristic": label, "value": m.group(1)})
|
|
return out
|
|
|
|
|
|
def enrich_with_pdf(
|
|
http: RateLimitedSession, prod: NKProduct
|
|
) -> None:
|
|
"""Fetch the tech-sheet PDF and add disease ratings + relevant
|
|
soybean fields to ``prod.characteristics_groups``."""
|
|
if not prod.techsheet_url:
|
|
log.info("%s: no tech sheet URL — identity only", prod.source_key)
|
|
return
|
|
try:
|
|
r = http.get(prod.techsheet_url)
|
|
r.raise_for_status()
|
|
except Exception as exc: # noqa: BLE001
|
|
log.warning("%s: PDF fetch failed (%s) — identity only",
|
|
prod.source_key, exc)
|
|
return
|
|
try:
|
|
with pdfplumber.open(io.BytesIO(r.content)) as pdf:
|
|
text = "\n".join((p.extract_text() or "") for p in pdf.pages)
|
|
except Exception as exc: # noqa: BLE001
|
|
log.warning("%s: PDF parse failed (%s) — identity only",
|
|
prod.source_key, exc)
|
|
return
|
|
|
|
disease = _extract_disease_ratings(text)
|
|
if disease:
|
|
prod.characteristics_groups.append({
|
|
"label": "DISEASE RATINGS",
|
|
"type": "pdf-text",
|
|
"items": disease,
|
|
})
|
|
|
|
if prod.crop == "soybeans":
|
|
misc_items: list[dict] = []
|
|
prr = _extract_phytophthora_genes(text)
|
|
if prr:
|
|
misc_items.append({"characteristic": "Phytophthora Gene", "value": prr})
|
|
scn = _extract_scn_source(text)
|
|
if scn:
|
|
misc_items.append({"characteristic": "SCN Source", "value": scn})
|
|
scn_races = _extract_scn_races(text)
|
|
if scn_races:
|
|
misc_items.append({"characteristic": "SCN Race Coverage", "value": scn_races})
|
|
if misc_items:
|
|
prod.characteristics_groups.append({
|
|
"label": "DISEASE GENETICS",
|
|
"type": "pdf-text",
|
|
"items": misc_items,
|
|
})
|
|
|
|
soy_agro = _extract_soy_agronomic_text(text)
|
|
if soy_agro:
|
|
prod.characteristics_groups.append({
|
|
"label": "AGRONOMIC TRAITS",
|
|
"type": "pdf-text",
|
|
"items": soy_agro,
|
|
})
|
|
|
|
soil = _extract_soy_soil_adaptation(text)
|
|
if soil:
|
|
prod.characteristics_groups.append({
|
|
"label": "SOIL TYPE ADAPTATION",
|
|
"type": "pdf-text",
|
|
"items": soil,
|
|
})
|
|
|
|
# Surface labels for charted-only agronomic ratings so search_docs
|
|
# can match queries like "drought" / "stalk strength" — values
|
|
# aren't extractable via text (the source PDF renders them as bar
|
|
# positions). We record only labels NOT already present in
|
|
# text-extractable groups, with an explicit "rated in PDF chart"
|
|
# value so the LLM directs the farmer at the tech sheet for those
|
|
# numbers. (For soy this is mostly redundant — text extraction got
|
|
# the agronomic numbers — so we skip the chart-label group there.)
|
|
if prod.crop == "corn":
|
|
agronomic_labels_corn = (
|
|
"Emergence", "Seedling Vigor", "Root Strength",
|
|
"Stalk Strength", "Green Snap", "Staygreen",
|
|
"Drydown", "Test Weight", "Drought",
|
|
)
|
|
# Skip any label already present with a numeric value.
|
|
already_rated = {
|
|
it["characteristic"]
|
|
for g in prod.characteristics_groups
|
|
for it in g.get("items") or []
|
|
if str(it.get("value", "")).strip() not in ("",)
|
|
}
|
|
present = [l for l in agronomic_labels_corn
|
|
if l in text and l not in already_rated]
|
|
if present:
|
|
prod.characteristics_groups.append({
|
|
"label": "AGRONOMIC CHARACTERISTICS",
|
|
"type": "pdf-chart",
|
|
"items": [
|
|
{"characteristic": l, "value": "rated in tech-sheet PDF chart (not text-extractable)"}
|
|
for l in present
|
|
],
|
|
})
|
|
|
|
|
|
# --------------------------------------------------------------------- render
|
|
|
|
|
|
def render_markdown(p: NKProduct) -> str:
|
|
title = p.product_code or p.source_key
|
|
crop_label = "Corn" if p.crop == "corn" else "Soybeans"
|
|
|
|
head: list[str] = [
|
|
f"# {title}",
|
|
"",
|
|
"- **Vendor:** Syngenta",
|
|
"- **Brand:** NK",
|
|
f"- **Crop:** {crop_label}",
|
|
]
|
|
if p.crop == "corn" and p.relative_maturity:
|
|
head.append(f"- **Relative maturity:** {p.relative_maturity}")
|
|
if p.crop == "soybeans" and p.maturity_group:
|
|
head.append(f"- **Maturity group:** {p.maturity_group}")
|
|
if p.brand_variants:
|
|
head.append(f"- **Brand variants:** {', '.join(p.brand_variants)}")
|
|
if p.trait_codes:
|
|
head.append(f"- **Traits:** {', '.join(p.trait_codes)}")
|
|
head.append(f"- **Catalog page:** {p.source_url}")
|
|
if p.techsheet_url:
|
|
head.append(f"- **Tech sheet (PDF):** {p.techsheet_url}")
|
|
head.append(f"- **Rating scale (NK):** {RATING_SCALE_DIRECTION}")
|
|
head.append("")
|
|
head.append("---")
|
|
head.append("")
|
|
|
|
sections: list[str] = []
|
|
if p.positioning_statement:
|
|
sections.append("## Positioning\n\n" + p.positioning_statement.strip() + "\n")
|
|
if p.strengths:
|
|
bullets = "\n".join(f"- {s}" for s in p.strengths)
|
|
sections.append("## Strengths\n\n" + bullets + "\n")
|
|
|
|
for g in p.characteristics_groups:
|
|
label = (g.get("label") or "Characteristics").title()
|
|
items = g.get("items") or []
|
|
if not items:
|
|
continue
|
|
rows = "\n".join(f"| {it['characteristic']} | {it['value']} |" for it in items)
|
|
sections.append(
|
|
f"## {label}\n\n"
|
|
"| Characteristic | Value |\n"
|
|
"|---|---|\n"
|
|
f"{rows}\n"
|
|
)
|
|
return "\n".join(head) + "\n".join(sections)
|
|
|
|
|
|
# --------------------------------------------------------------------- write
|
|
|
|
|
|
def write_product(prod: NKProduct, body_md: str) -> None:
|
|
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
|
md_path = CORPUS_DIR / f"{prod.source_key}.md"
|
|
json_path = CORPUS_DIR / f"{prod.source_key}.json"
|
|
|
|
md_path.write_text(body_md, encoding="utf-8")
|
|
sidecar = {
|
|
"source": "nk",
|
|
"source_key": prod.source_key,
|
|
"vendor": "Syngenta",
|
|
"brand": "NK",
|
|
"product_name": prod.product_code,
|
|
"product_id": None,
|
|
"hybrid_prefix": prod.product_code,
|
|
"hybrid_suffix": None,
|
|
"crop": prod.crop,
|
|
"release_year": None,
|
|
"relative_maturity": prod.relative_maturity,
|
|
"maturity_group": prod.maturity_group,
|
|
"wheat_class": None,
|
|
"trait_stack": prod.trait_codes,
|
|
"trait_descriptions": prod.trait_descriptions,
|
|
"brand_variants": prod.brand_variants,
|
|
"positioning_statement": prod.positioning_statement,
|
|
"strengths": prod.strengths,
|
|
"characteristics_groups": prod.characteristics_groups,
|
|
"_scale_direction": RATING_SCALE_DIRECTION,
|
|
"regional_recommendations": [],
|
|
"image_url": None,
|
|
"techsheet_url": prod.techsheet_url,
|
|
"source_urls": [prod.source_url],
|
|
"sitemap_last_modified": None,
|
|
"fetched_at": datetime.now(timezone.utc).isoformat(),
|
|
"scraper_version": SCRAPER_VERSION,
|
|
}
|
|
json_path.write_text(
|
|
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
|
|
encoding="utf-8",
|
|
)
|
|
|
|
|
|
# --------------------------------------------------------------------- pipeline
|
|
|
|
|
|
def process_product(
|
|
http: RateLimitedSession,
|
|
prod: NKProduct,
|
|
*,
|
|
force: bool,
|
|
) -> str:
|
|
md_path = CORPUS_DIR / f"{prod.source_key}.md"
|
|
if md_path.exists() and not force:
|
|
return "skipped"
|
|
enrich_with_pdf(http, prod)
|
|
body = render_markdown(prod)
|
|
write_product(prod, body)
|
|
return "written"
|
|
|
|
|
|
def run(
|
|
*,
|
|
limit: int | None,
|
|
force: bool,
|
|
only_crop: str | None,
|
|
only_product: str | None,
|
|
) -> int:
|
|
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
|
http = RateLimitedSession()
|
|
targets = discover_products(http, only_crop=only_crop)
|
|
|
|
if only_product:
|
|
targets = [
|
|
p for p in targets
|
|
if p.source_key == only_product
|
|
or p.product_code.lower() == only_product.lower()
|
|
]
|
|
if not targets:
|
|
log.error("no variety matched --product=%s", only_product)
|
|
return 2
|
|
|
|
counts = {"written": 0, "skipped": 0, "failed": 0}
|
|
processed = 0
|
|
for prod in targets:
|
|
if limit is not None and processed >= limit:
|
|
break
|
|
processed += 1
|
|
try:
|
|
status = process_product(http, prod, force=force)
|
|
except Exception as exc: # noqa: BLE001
|
|
log.error("%s failed: %s", prod.source_key, exc)
|
|
status = "failed"
|
|
counts[status] = counts.get(status, 0) + 1
|
|
log.info(
|
|
"[%d/%s] %s %s | crop=%s rm/mg=%s variants=%d traits=%s groups=%d",
|
|
processed, str(limit) if limit else "all",
|
|
prod.source_key, status, prod.crop,
|
|
prod.relative_maturity or prod.maturity_group or "-",
|
|
len(prod.brand_variants),
|
|
",".join(prod.trait_codes) or "-",
|
|
len(prod.characteristics_groups),
|
|
)
|
|
|
|
log.info(
|
|
"done: processed=%d written=%d skipped=%d failed=%d (of %d candidates)",
|
|
processed, counts["written"], counts["skipped"],
|
|
counts["failed"], len(targets),
|
|
)
|
|
return 0 if counts["failed"] == 0 else 1
|
|
|
|
|
|
# --------------------------------------------------------------------- CLI
|
|
|
|
|
|
def _build_argparser() -> argparse.ArgumentParser:
|
|
p = argparse.ArgumentParser(
|
|
prog="scrape.sources.nk",
|
|
description="Scrape NK (Syngenta) corn + soybean varieties.",
|
|
)
|
|
p.add_argument("--limit", type=int, default=None,
|
|
help="Stop after processing N varieties (default: all).")
|
|
p.add_argument("--force", action="store_true",
|
|
help="Re-fetch even if the markdown file already exists.")
|
|
p.add_argument("--crop", default=None, choices=("corn", "soybeans"),
|
|
help="Limit to one crop.")
|
|
p.add_argument("--product", default=None,
|
|
help="Process a single variety by source_key or product code.")
|
|
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
|
|
return p
|
|
|
|
|
|
def main(argv: list[str] | None = None) -> int:
|
|
args = _build_argparser().parse_args(argv)
|
|
logging.basicConfig(
|
|
level=args.log_level.upper(),
|
|
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
|
stream=sys.stderr,
|
|
)
|
|
return run(
|
|
limit=args.limit,
|
|
force=args.force,
|
|
only_crop=args.crop,
|
|
only_product=args.product,
|
|
)
|
|
|
|
|
|
if __name__ == "__main__":
|
|
sys.exit(main())
|