agripro + nk scrapers — 146 Syngenta varieties added (wheat + corn/soy)
agripro (24 varieties)
- Drupal Views form scrape via /search-agripro-brand-varieties with
explicit GET params (sidesteps the AJAX-only-on-load default that
returns an empty form skeleton).
- Per-variety parse: <h1>, .field--node--variety-type--variety,
.field--node--tag-line--variety, .field--node--body, plus the
three rated sections (Agronomics / Grain / Disease) with their
<div class="row"><div class="label">label</div><div>value</div>
pairs.
- Wheat-class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley
— provides the Northern Plains HRS coverage WestBred lacks.
nk (122 varieties — recon's "29" was outdated; the current NK seed
finder lists 41 corn + 81 soy)
- ASP.NET WebForms endpoint:
POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returns
{"d": "<html>"} where the inner HTML is one <div class="sf-result">
per variety. BeautifulSoup tokenizes the whole blob.
- Per-card: product code (NK8005, NK008-P8XF), RM/MG from the
title <span>, "Brands Available" trait variants, marketing
positioning + bullet strengths, tech-sheet PDF URL.
- pdfplumber text extraction on the tech-sheet PDFs adds:
* corn disease ratings (Gray Leaf Spot, NCLB, Goss's Wilt,
Anthracnose, Tar Spot, Fusarium, etc.) where the PDF prints
"Label N" lines (text-extractable)
* soybean Phytophthora source genes (Rps1c, Rps3a, ...)
* soybean SCN race coverage
* soybean agronomic ratings (Emergence, Standability, Shatter
Tolerance, Green Stem) with text-extractable 1-9 values
* soybean soil-type adaptation (Best/Good/Fair/Poor) for drought
prone / high pH / poorly drained / etc.
- Agronomic rating BARS for corn (Emergence, Stalk Strength,
Drought) are not text-extractable; we record the labels with an
explicit "rated in PDF chart, see tech sheet" value so the agent
can direct the farmer at the source for those numbers.
Scale-direction correction in lessons.md:
- NK and AgriPro both use 1 = best, lower = more resistant — the
REVERSED convention vs Bayer / Golden Harvest. NK's tech-sheet
footer literally prints "1-9 Scale: 1 = Best, 9 = Worst".
AgriPro positioning on stripe-rust-resistant varieties (AP Iliad
with Stripe Rust 1, Eyespot 2) confirms the same direction.
- sources-not-yet-indexed section trimmed to just Beck's PFR +
Beck's products — everything else IS now in the corpus.
Cross-vendor coverage after this PR: 760 varieties.
bayer_seeds 475 (DEKALB 288 / Asgrow 102 / WestBred 85)
golden_harvest 139
nk 122 (41 corn / 81 soy)
agripro 24 (12 HRS / 7 SWW / 3 HRW / 1 HWS / 1 Barley)
Vendors: Bayer, Syngenta. Brands: 6. Crops: corn, soy, wheat (109
wheat now, up from 85).
requirements.txt: pdfplumber>=0.11 for NK tech-sheet parsing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+488
-22
@@ -1,37 +1,503 @@
|
||||
"""AgriPro scraper (Syngenta wheat brand).
|
||||
"""AgriPro (Syngenta) wheat scraper.
|
||||
|
||||
Source: ``https://www.agriprowheat.com`` — Drupal Views form,
|
||||
server-rendered HTML. No headless browser needed.
|
||||
Source: ``agriprowheat.com`` — Drupal site, server-rendered HTML.
|
||||
robots.txt is empty (no Disallow).
|
||||
|
||||
Expected count: 24 varieties. Covers HRW / HRS / HWS / SWW / SWS
|
||||
plus barley. NO SRW — Syngenta's SRW lives at GrowProGenetics.com
|
||||
under a separate brand and is out of scope for AgriPro.
|
||||
Expected count: 24 varieties spanning Hard Red Winter (HRW), Hard
|
||||
Red Spring (HRS), Hard White Spring (HWS), Soft White Winter (SWW),
|
||||
Soft White Spring (SWS), and durum. NO SRW — Syngenta's Soft Red
|
||||
Winter sits at GrowProGenetics.com under a separate brand, out of
|
||||
scope for AgriPro.
|
||||
|
||||
Trait flags to capture: Clearfield (CL2), CoAXium (NB: CoAXium is
|
||||
implicit in product family naming, not always a separate field).
|
||||
Discovery: the variety listing at
|
||||
``/search-agripro-brand-varieties`` server-renders only the
|
||||
filter form; the actual variety rows are populated by a Drupal
|
||||
Views AJAX call. We sidestep the AJAX by passing the filter values
|
||||
as GET params on the same path:
|
||||
|
||||
Schema notes:
|
||||
- ``wheat_class`` is required (HRW/HRS/HWS/SWW/SWS/durum/barley)
|
||||
- ``relative_maturity`` and ``maturity_group`` are null for wheat
|
||||
- Disease panel: stripe rust / leaf rust / stem rust / FHB (scab) /
|
||||
Septoria / tan spot
|
||||
- Quality: test weight, protein, falling number, straw strength
|
||||
/search-agripro-brand-varieties?title=&variety_type_value=All
|
||||
|
||||
TODO: implement.
|
||||
That returns the fully-rendered list (24 rows in
|
||||
``.block-views-blockvarieties-search-varieties-search-block``) with
|
||||
links to ``/variety/<slug>`` pages.
|
||||
|
||||
Per-variety detail comes from the variety page HTML. Useful fields:
|
||||
|
||||
- ``<h1>`` — product name (e.g. "AP Exceed")
|
||||
- ``.field--node--variety-type--variety`` — wheat class
|
||||
("Soft White Winter", "Hard Red Spring", etc.)
|
||||
- ``.field--node--tag-line--variety`` — short positioning slogan
|
||||
- ``.field--node--body`` — full positioning narrative
|
||||
- Three sections delimited by ``<h3>``: Agronomics / Grain /
|
||||
Disease, each containing ``.row`` divs with
|
||||
``<div class="label">…</div><div>…</div>`` pairs.
|
||||
|
||||
**Rating-scale direction**: AgriPro publishes disease tolerance on a
|
||||
1-9 scale where **1 = best (most resistant)** — REVERSED from
|
||||
Bayer's and Golden Harvest's "9 = best" convention. The chunker
|
||||
preserves values verbatim and the sidecar's ``_scale_direction``
|
||||
field declares the direction, so the LLM's chunk-preamble framing
|
||||
will correctly say "(1 = best)" — anti-hallucination guarantee
|
||||
holds even across vendors with opposite scales.
|
||||
|
||||
(Agronomic ratings on AgriPro are qualitative — "Excellent / Very
|
||||
Good / Good / Fair / Poor" — and don't have a numeric direction
|
||||
issue. They're preserved verbatim.)
|
||||
|
||||
Output:
|
||||
corpus/agripro/<source_key>.md
|
||||
corpus/agripro/<source_key>.json
|
||||
|
||||
source_key convention: ``agripro-<slug>`` lowercased, e.g.
|
||||
``agripro-ap-exceed`` or ``agripro-sy-assure``.
|
||||
|
||||
CLI:
|
||||
python -m scrape.sources.agripro --limit 5
|
||||
python -m scrape.sources.agripro --force
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
SCRAPER_VERSION = "0.1.0"
|
||||
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
|
||||
BASE = "https://agriprowheat.com"
|
||||
LIST_URL = f"{BASE}/search-agripro-brand-varieties?title=&variety_type_value=All"
|
||||
|
||||
# AgriPro disease ratings: 1-9, LOWER number = MORE resistant. This
|
||||
# is the inverse of Bayer/Golden-Harvest's 1-9 (9 = best) convention.
|
||||
# Document this in the sidecar so the chunker / LLM never mis-renders.
|
||||
RATING_SCALE_DIRECTION = "1-9 (1 = best, lower = more resistant)"
|
||||
|
||||
# Class abbreviations for the wheat_class field. AgriPro renders the
|
||||
# full English name; we map it to the canonical short form the rest
|
||||
# of the corpus uses (matches schema notes in seed-mcp/CLAUDE.md).
|
||||
WHEAT_CLASS_MAP = {
|
||||
"hard red winter": "HRW",
|
||||
"hard red spring": "HRS",
|
||||
"hard white spring": "HWS",
|
||||
"hard white winter": "HWW",
|
||||
"soft white winter": "SWW",
|
||||
"soft white spring": "SWS",
|
||||
"soft red winter": "SRW",
|
||||
"durum": "Durum",
|
||||
}
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
|
||||
CORPUS_DIR = CORPUS_ROOT / "agripro"
|
||||
|
||||
REQ_INTERVAL_SEC = 1.0
|
||||
|
||||
log = logging.getLogger("scrape.agripro")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- HTTP
|
||||
|
||||
|
||||
class RateLimitedSession:
|
||||
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
|
||||
self.s = requests.Session()
|
||||
self.s.headers["User-Agent"] = USER_AGENT
|
||||
self.interval = interval
|
||||
self._last = 0.0
|
||||
|
||||
def _wait(self) -> None:
|
||||
delta = time.monotonic() - self._last
|
||||
if delta < self.interval:
|
||||
time.sleep(self.interval - delta)
|
||||
self._last = time.monotonic()
|
||||
|
||||
def request(
|
||||
self,
|
||||
method: str,
|
||||
url: str,
|
||||
*,
|
||||
max_retries: int = 4,
|
||||
timeout: float = 30.0,
|
||||
**kw: Any,
|
||||
) -> requests.Response:
|
||||
last_exc: Exception | None = None
|
||||
for attempt in range(max_retries):
|
||||
self._wait()
|
||||
try:
|
||||
resp = self.s.request(method, url, timeout=timeout, **kw)
|
||||
except requests.RequestException as exc:
|
||||
last_exc = exc
|
||||
backoff = min(30.0, (2 ** attempt) + random.random())
|
||||
log.warning("network error on %s %s: %s — retry in %.1fs",
|
||||
method, url, exc, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
if resp.status_code == 429 or 500 <= resp.status_code < 600:
|
||||
ra = resp.headers.get("Retry-After")
|
||||
backoff = float(ra) if (ra and ra.isdigit()) else min(30.0, (2 ** attempt) + random.random())
|
||||
log.warning("HTTP %d on %s %s — retry in %.1fs",
|
||||
resp.status_code, method, url, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
return resp
|
||||
if last_exc:
|
||||
raise last_exc
|
||||
return resp # type: ignore[return-value]
|
||||
|
||||
def get(self, url: str, **kw: Any) -> requests.Response:
|
||||
return self.request("GET", url, **kw)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- model
|
||||
|
||||
|
||||
@dataclass
|
||||
class APProduct:
|
||||
source_key: str
|
||||
source_url: str
|
||||
product_name: str = ""
|
||||
wheat_class: str | None = None
|
||||
positioning_statement: str | None = None
|
||||
tagline: str | None = None
|
||||
characteristics_groups: list[dict] = field(default_factory=list)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- discovery
|
||||
|
||||
|
||||
def discover_varieties(http: RateLimitedSession) -> list[str]:
|
||||
"""Fetch the variety-search page and return the list of
|
||||
``/variety/<slug>`` URLs found in it.
|
||||
|
||||
Dedupes per-row twice-listed links (the row's hero image link
|
||||
and its "view full details" link both point to the same place).
|
||||
"""
|
||||
log.info("fetching variety list %s", LIST_URL)
|
||||
r = http.get(LIST_URL)
|
||||
r.raise_for_status()
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
urls: list[str] = []
|
||||
seen: set[str] = set()
|
||||
for a in soup.find_all("a", href=re.compile(r"^/variety/")):
|
||||
h = a["href"]
|
||||
if h in seen:
|
||||
continue
|
||||
seen.add(h)
|
||||
urls.append(BASE + h)
|
||||
log.info("variety URLs found: %d", len(urls))
|
||||
return urls
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- helpers
|
||||
|
||||
|
||||
def source_key_for(url: str) -> str:
|
||||
"""``/variety/ap-exceed`` → ``agripro-ap-exceed``."""
|
||||
tail = url.rstrip("/").rsplit("/", 1)[-1].lower()
|
||||
return f"agripro-{tail}"
|
||||
|
||||
|
||||
def normalize_wheat_class(raw: str | None) -> str | None:
|
||||
if not raw:
|
||||
return None
|
||||
key = raw.strip().lower()
|
||||
return WHEAT_CLASS_MAP.get(key, raw.strip())
|
||||
|
||||
|
||||
def _rows_in_section(soup: BeautifulSoup, h3_text: str) -> list[dict]:
|
||||
"""Walk the variety page for the section heading matching
|
||||
``h3_text``, then collect every ``.row`` inside the same
|
||||
container. Returns ``[{characteristic, value}, ...]``."""
|
||||
items: list[dict] = []
|
||||
for h3 in soup.find_all("h3"):
|
||||
if h3.get_text(strip=True).lower() != h3_text.lower():
|
||||
continue
|
||||
# Walk up to the enclosing section (the parent that scopes
|
||||
# the .row siblings of the h3). The simplest reliable scope:
|
||||
# the row siblings within the immediate parent.
|
||||
parent = h3.parent
|
||||
if parent is None:
|
||||
continue
|
||||
for row in parent.find_all(class_="row"):
|
||||
label_el = row.find(class_="label")
|
||||
if not label_el:
|
||||
continue
|
||||
label = label_el.get_text(" ", strip=True)
|
||||
# The value is whatever <div> sibling follows the label
|
||||
# (NOT the .label div itself).
|
||||
value: str | None = None
|
||||
for child in row.find_all("div"):
|
||||
if "label" in (child.get("class") or []):
|
||||
continue
|
||||
# First non-label <div> with non-empty text wins.
|
||||
t = child.get_text(" ", strip=True)
|
||||
if t:
|
||||
value = t
|
||||
break
|
||||
if label and value:
|
||||
items.append({"characteristic": label, "value": value})
|
||||
break
|
||||
return items
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- detail
|
||||
|
||||
|
||||
def fetch_product_detail(
|
||||
http: RateLimitedSession, url: str
|
||||
) -> APProduct | None:
|
||||
r = http.get(url)
|
||||
if r.status_code == 404:
|
||||
return None
|
||||
r.raise_for_status()
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
|
||||
prod = APProduct(
|
||||
source_key=source_key_for(url),
|
||||
source_url=url,
|
||||
)
|
||||
|
||||
h1 = soup.find("h1")
|
||||
if h1:
|
||||
prod.product_name = h1.get_text(strip=True)
|
||||
|
||||
vt = soup.find(class_="field--node--variety-type--variety")
|
||||
if vt:
|
||||
prod.wheat_class = normalize_wheat_class(vt.get_text(strip=True))
|
||||
|
||||
tl = soup.find(class_="field--node--tag-line--variety")
|
||||
if tl:
|
||||
prod.tagline = tl.get_text(strip=True) or None
|
||||
|
||||
# Body text — the long-form positioning narrative.
|
||||
body = soup.find(class_=re.compile(r"field--node--body"))
|
||||
if body:
|
||||
prod.positioning_statement = body.get_text(" ", strip=True) or None
|
||||
|
||||
# Tagline alone if no body — better than nothing.
|
||||
if not prod.positioning_statement and prod.tagline:
|
||||
prod.positioning_statement = prod.tagline
|
||||
|
||||
# The three rated sections on every variety page.
|
||||
groups: list[dict] = []
|
||||
for label, h3 in (
|
||||
("AGRONOMICS", "Agronomics"),
|
||||
("GRAIN", "Grain"),
|
||||
("DISEASE RATINGS", "Disease"),
|
||||
):
|
||||
items = _rows_in_section(soup, h3)
|
||||
if items:
|
||||
groups.append({"label": label, "type": "fields", "items": items})
|
||||
prod.characteristics_groups = groups
|
||||
|
||||
return prod
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- render
|
||||
|
||||
|
||||
def render_markdown(p: APProduct) -> str:
|
||||
title = p.product_name or p.source_key
|
||||
head: list[str] = [
|
||||
f"# {title}",
|
||||
"",
|
||||
"- **Vendor:** Syngenta",
|
||||
"- **Brand:** AgriPro",
|
||||
"- **Crop:** Wheat",
|
||||
]
|
||||
if p.wheat_class:
|
||||
head.append(f"- **Wheat class:** {p.wheat_class}")
|
||||
if p.tagline:
|
||||
head.append(f"- **Tagline:** {p.tagline}")
|
||||
head.append(f"- **Source:** {p.source_url}")
|
||||
head.append(f"- **Rating scale (AgriPro):** {RATING_SCALE_DIRECTION}")
|
||||
head.append("")
|
||||
head.append("---")
|
||||
head.append("")
|
||||
|
||||
sections: list[str] = []
|
||||
if p.positioning_statement and p.positioning_statement != p.tagline:
|
||||
sections.append("## Positioning\n\n" + p.positioning_statement.strip() + "\n")
|
||||
|
||||
for g in p.characteristics_groups:
|
||||
label = (g.get("label") or "Characteristics").title()
|
||||
items = g.get("items") or []
|
||||
if not items:
|
||||
continue
|
||||
rows = "\n".join(f"| {it['characteristic']} | {it['value']} |" for it in items)
|
||||
sections.append(
|
||||
f"## {label}\n\n"
|
||||
"| Characteristic | Value |\n"
|
||||
"|---|---|\n"
|
||||
f"{rows}\n"
|
||||
)
|
||||
return "\n".join(head) + "\n".join(sections)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- write
|
||||
|
||||
|
||||
def write_product(prod: APProduct, body_md: str) -> None:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
md_path = CORPUS_DIR / f"{prod.source_key}.md"
|
||||
json_path = CORPUS_DIR / f"{prod.source_key}.json"
|
||||
|
||||
md_path.write_text(body_md, encoding="utf-8")
|
||||
sidecar = {
|
||||
"source": "agripro",
|
||||
"source_key": prod.source_key,
|
||||
"vendor": "Syngenta",
|
||||
"brand": "AgriPro",
|
||||
"product_name": prod.product_name,
|
||||
"product_id": None,
|
||||
"hybrid_prefix": prod.product_name,
|
||||
"hybrid_suffix": None,
|
||||
"crop": "wheat",
|
||||
"release_year": None,
|
||||
"relative_maturity": None,
|
||||
"maturity_group": None,
|
||||
"wheat_class": prod.wheat_class,
|
||||
"trait_stack": [],
|
||||
"trait_descriptions": [],
|
||||
"positioning_statement": prod.positioning_statement,
|
||||
"tagline": prod.tagline,
|
||||
"strengths": [],
|
||||
"characteristics_groups": prod.characteristics_groups,
|
||||
# AgriPro's reversed direction is the load-bearing field here:
|
||||
# any cross-vendor disease-resistance comparison MUST consult
|
||||
# this before interpreting values. The chunker reads it; the
|
||||
# api_lessons file's rating-scales section documents the
|
||||
# convention.
|
||||
"_scale_direction": RATING_SCALE_DIRECTION,
|
||||
"regional_recommendations": [],
|
||||
"image_url": None,
|
||||
"source_urls": [prod.source_url],
|
||||
"sitemap_last_modified": None,
|
||||
"fetched_at": datetime.now(timezone.utc).isoformat(),
|
||||
"scraper_version": SCRAPER_VERSION,
|
||||
}
|
||||
json_path.write_text(
|
||||
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- pipeline
|
||||
|
||||
|
||||
def process_product(
|
||||
http: RateLimitedSession,
|
||||
*,
|
||||
url: str,
|
||||
force: bool,
|
||||
) -> tuple[str, APProduct | None]:
|
||||
source_key = source_key_for(url)
|
||||
md_path = CORPUS_DIR / f"{source_key}.md"
|
||||
if md_path.exists() and not force:
|
||||
return "skipped", None
|
||||
try:
|
||||
prod = fetch_product_detail(http, url)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.error("detail fetch failed for %s: %s", url, exc)
|
||||
return "failed", None
|
||||
if prod is None:
|
||||
return "missing", None
|
||||
body = render_markdown(prod)
|
||||
write_product(prod, body)
|
||||
return "written", prod
|
||||
|
||||
|
||||
def run(
|
||||
*,
|
||||
limit: int | None,
|
||||
force: bool,
|
||||
only_product: str | None,
|
||||
) -> int:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
http = RateLimitedSession()
|
||||
targets = discover_varieties(http)
|
||||
if only_product:
|
||||
targets = [
|
||||
u for u in targets
|
||||
if source_key_for(u) == only_product
|
||||
or u.rstrip("/").rsplit("/", 1)[-1].lower() == only_product.lower()
|
||||
]
|
||||
if not targets:
|
||||
log.error("no variety matched --product=%s", only_product)
|
||||
return 2
|
||||
|
||||
counts = {"written": 0, "skipped": 0, "missing": 0, "failed": 0}
|
||||
processed = 0
|
||||
for url in targets:
|
||||
if limit is not None and processed >= limit:
|
||||
break
|
||||
processed += 1
|
||||
status, prod = process_product(http, url=url, force=force)
|
||||
counts[status] = counts.get(status, 0) + 1
|
||||
if prod is not None:
|
||||
log.info(
|
||||
"[%d/%s] %s %s | class=%s groups=%d",
|
||||
processed, str(limit) if limit else "all",
|
||||
prod.source_key, status,
|
||||
prod.wheat_class or "-",
|
||||
len(prod.characteristics_groups),
|
||||
)
|
||||
else:
|
||||
log.info("[%d/%s] %s %s",
|
||||
processed, str(limit) if limit else "all",
|
||||
source_key_for(url), status)
|
||||
|
||||
log.info(
|
||||
"done: processed=%d written=%d skipped=%d missing=%d failed=%d (of %d candidates)",
|
||||
processed, counts["written"], counts["skipped"],
|
||||
counts["missing"], counts["failed"], len(targets),
|
||||
)
|
||||
return 0 if counts["failed"] == 0 else 1
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- CLI
|
||||
|
||||
|
||||
def _build_argparser() -> argparse.ArgumentParser:
|
||||
p = argparse.ArgumentParser(
|
||||
prog="scrape.sources.agripro",
|
||||
description="Scrape AgriPro (Syngenta) wheat varieties.",
|
||||
)
|
||||
p.add_argument("--limit", type=int, default=None,
|
||||
help="Stop after processing N varieties (default: all).")
|
||||
p.add_argument("--force", action="store_true",
|
||||
help="Re-fetch even if the markdown file already exists.")
|
||||
p.add_argument("--product", default=None,
|
||||
help="Process a single variety by source_key or URL tail.")
|
||||
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
|
||||
return p
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
print("agripro: deferred — Drupal Views form, only wheat in the corpus, no SRW (separate brand). See reference_seed_vendor_recon.md.",
|
||||
file=sys.stderr)
|
||||
# Return 0 so the monthly CI workflow doesn't fail when this
|
||||
# source is listed but not yet implemented. Real implementation
|
||||
# will return 0 on success / 1 on failure.
|
||||
return 0
|
||||
args = _build_argparser().parse_args(argv)
|
||||
logging.basicConfig(
|
||||
level=args.log_level.upper(),
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||
stream=sys.stderr,
|
||||
)
|
||||
return run(
|
||||
limit=args.limit,
|
||||
force=args.force,
|
||||
only_product=args.product,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main(sys.argv[1:]))
|
||||
sys.exit(main())
|
||||
|
||||
+812
-22
@@ -1,38 +1,828 @@
|
||||
"""NK scraper (Syngenta brand).
|
||||
"""NK (Syngenta) seed scraper — corn + soybeans.
|
||||
|
||||
Source: ``https://www.syngenta-us.com`` — static HTML product pages
|
||||
plus tech-sheet PDFs on the Syngenta CDN at
|
||||
``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf``.
|
||||
Source: ``syngenta-us.com`` — ASP.NET WebForms catalog with an
|
||||
ASMX-style JSON endpoint for the seed-finder UI, plus tech-sheet
|
||||
PDFs on the Syngenta CDN at
|
||||
``assets.syngenta-us.com/pdf/techsheets/<CODE>_YYMMDD.pdf``.
|
||||
|
||||
Expected count: 29 varieties (12 corn + 17 soy). No wheat.
|
||||
Expected count: 29 varieties (12 corn + 17 soy on 2026-05-25). No
|
||||
wheat.
|
||||
|
||||
The PDF fetcher is shared with ``golden_harvest`` — same CDN, same
|
||||
``<CODE>_YYMMDD.pdf`` filename convention. Factor that into a
|
||||
helper module under ``scrape.sources._syngenta_pdf`` once both
|
||||
scrapers are written.
|
||||
Discovery: the HTML catalog pages (``/corn/nk/products``,
|
||||
``/soybeans/nk/products``) load product cards via JS. The JS calls
|
||||
|
||||
Disease + agronomic ratings live INSIDE the PDFs (the HTML pages
|
||||
have marketing copy only). Use pdfplumber for table extraction.
|
||||
POST /NKSeeds/CornProductFinder.aspx/GetProducts
|
||||
POST /NKSeeds/SoyProductFinder.aspx/GetProducts
|
||||
|
||||
Bonus: regional "Seed Guide" PDFs (~14 MB each) for IA, IL, MN,
|
||||
etc. — additional supplemental context worth ingesting once the
|
||||
per-variety scrape is solid.
|
||||
Both endpoints return ASP.NET's ``{"d": "..."}`` wrapper where ``d``
|
||||
is a string of HTML fragments separated by `` @ `` containing one
|
||||
``<div class="sf-result">`` per variety. Each card carries:
|
||||
|
||||
TODO: implement.
|
||||
- product code (e.g. ``NK8005`` / ``NK008-P8XF``)
|
||||
- RM days (corn) / MG decimal (soy) in a ``<span>`` next to the
|
||||
title
|
||||
- "Brands Available" line listing trait variants
|
||||
(NK8005-V, NK8005-GT/LL — these are trait-specific SKUs)
|
||||
- positioning slogan + bullet-list of strengths
|
||||
- tech-sheet PDF URL
|
||||
|
||||
Per-variety disease ratings live ONLY in the PDF tech sheets (the
|
||||
HTML cards have marketing text but no rating numbers). We extract
|
||||
disease ratings via ``pdfplumber`` text extraction — they appear as
|
||||
"Label Number" lines that we parse with a regex.
|
||||
|
||||
**Rating-scale direction**: NK explicitly publishes
|
||||
``1-9 Scale: 1 = Best, Tallest or Highest; 9 = Worst, Shortest or
|
||||
Lowest`` on every tech sheet — REVERSED from Bayer/Golden Harvest.
|
||||
The chunker preserves values verbatim and the sidecar's
|
||||
``_scale_direction`` field declares this so the LLM correctly
|
||||
interprets the chunk preamble.
|
||||
|
||||
**Agronomic ratings**: rendered as horizontal bar charts in the
|
||||
PDF; pdfplumber's text extraction captures the LABELS (Emergence,
|
||||
Stalk Strength, Drought, etc.) but NOT the bar values. Surfacing
|
||||
those would require either OCR of the bar positions or pdfplumber's
|
||||
geometric layout parsing — deferred. For now the chunk records the
|
||||
labels and an explicit "agronomic ratings rendered as chart bars in
|
||||
the source PDF — values not currently extracted" annotation so the
|
||||
agent knows to direct the farmer at the tech-sheet PDF for those
|
||||
numbers.
|
||||
|
||||
Tech-sheet PDF URLs come from the API response (live URL is
|
||||
correct; the assets-host filenames include a YYMMDD that changes).
|
||||
|
||||
Output:
|
||||
corpus/nk/<source_key>.md
|
||||
corpus/nk/<source_key>.json
|
||||
|
||||
source_key convention: ``nk-<code>`` lowercased, e.g.
|
||||
``nk-nk8005`` or ``nk-nk008-p8xf``.
|
||||
|
||||
CLI:
|
||||
python -m scrape.sources.nk --limit 5
|
||||
python -m scrape.sources.nk --crop corn --limit 12
|
||||
python -m scrape.sources.nk --force
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import io
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
import pdfplumber
|
||||
|
||||
SCRAPER_VERSION = "0.1.0"
|
||||
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
|
||||
BASE = "https://www.syngenta-us.com"
|
||||
CORN_LIST_URL = f"{BASE}/corn/nk/products"
|
||||
SOY_LIST_URL = f"{BASE}/soybeans/nk/products"
|
||||
CORN_API = f"{BASE}/NKSeeds/CornProductFinder.aspx/GetProducts"
|
||||
SOY_API = f"{BASE}/NKSeeds/SoyProductFinder.aspx/GetProducts"
|
||||
|
||||
# NK + AgriPro both use the "1 = best, lower = more resistant" convention.
|
||||
# Confirmed by tech-sheet footer: "1-9 Scale: 1 = Best...; 9 = Worst..."
|
||||
RATING_SCALE_DIRECTION = "1-9 (1 = best, lower = more resistant)"
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
|
||||
CORPUS_DIR = CORPUS_ROOT / "nk"
|
||||
|
||||
REQ_INTERVAL_SEC = 1.0
|
||||
|
||||
log = logging.getLogger("scrape.nk")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- HTTP
|
||||
|
||||
|
||||
class RateLimitedSession:
|
||||
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
|
||||
self.s = requests.Session()
|
||||
self.s.headers["User-Agent"] = USER_AGENT
|
||||
self.interval = interval
|
||||
self._last = 0.0
|
||||
|
||||
def _wait(self) -> None:
|
||||
delta = time.monotonic() - self._last
|
||||
if delta < self.interval:
|
||||
time.sleep(self.interval - delta)
|
||||
self._last = time.monotonic()
|
||||
|
||||
def request(
|
||||
self,
|
||||
method: str,
|
||||
url: str,
|
||||
*,
|
||||
max_retries: int = 4,
|
||||
timeout: float = 30.0,
|
||||
**kw: Any,
|
||||
) -> requests.Response:
|
||||
last_exc: Exception | None = None
|
||||
for attempt in range(max_retries):
|
||||
self._wait()
|
||||
try:
|
||||
resp = self.s.request(method, url, timeout=timeout, **kw)
|
||||
except requests.RequestException as exc:
|
||||
last_exc = exc
|
||||
backoff = min(30.0, (2 ** attempt) + random.random())
|
||||
log.warning("network error on %s %s: %s — retry in %.1fs",
|
||||
method, url, exc, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
if resp.status_code == 429 or 500 <= resp.status_code < 600:
|
||||
ra = resp.headers.get("Retry-After")
|
||||
backoff = float(ra) if (ra and ra.isdigit()) else min(30.0, (2 ** attempt) + random.random())
|
||||
log.warning("HTTP %d on %s %s — retry in %.1fs",
|
||||
resp.status_code, method, url, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
return resp
|
||||
if last_exc:
|
||||
raise last_exc
|
||||
return resp # type: ignore[return-value]
|
||||
|
||||
def get(self, url: str, **kw: Any) -> requests.Response:
|
||||
return self.request("GET", url, **kw)
|
||||
|
||||
def post(self, url: str, **kw: Any) -> requests.Response:
|
||||
return self.request("POST", url, **kw)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- model
|
||||
|
||||
|
||||
@dataclass
|
||||
class NKProduct:
|
||||
source_key: str
|
||||
source_url: str # the brand catalog page (closest thing to a per-variety URL)
|
||||
crop: str # "corn" | "soybeans"
|
||||
product_code: str = "" # NK8005 / NK008-P8XF
|
||||
relative_maturity: str | None = None # corn
|
||||
maturity_group: str | None = None # soy
|
||||
brand_variants: list[str] = field(default_factory=list) # ["NK8005-V", "NK8005-GT/LL"]
|
||||
trait_codes: list[str] = field(default_factory=list)
|
||||
trait_descriptions: list[str] = field(default_factory=list)
|
||||
positioning_statement: str | None = None
|
||||
strengths: list[str] = field(default_factory=list)
|
||||
techsheet_url: str | None = None
|
||||
characteristics_groups: list[dict] = field(default_factory=list)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- discovery
|
||||
|
||||
|
||||
def _api_payload_corn(rm_low: str, rm_high: str) -> str:
|
||||
"""Payload for ``CornProductFinder.aspx/GetProducts``."""
|
||||
return json.dumps({
|
||||
"cornCount": "1",
|
||||
"rmLowerRange": rm_low,
|
||||
"rmUpperRange": rm_high,
|
||||
"brands": "NK",
|
||||
"agisuraTraits": "",
|
||||
"insectResistance": "",
|
||||
"herbicideTolerance": "",
|
||||
"waterOptimization": "",
|
||||
"reducedRefuge": "",
|
||||
"diseaseResistence": "",
|
||||
"silage": "",
|
||||
"path": "false",
|
||||
"currentUrl": CORN_LIST_URL,
|
||||
"fieldForged": "",
|
||||
"newProduct": "",
|
||||
})
|
||||
|
||||
|
||||
def _api_payload_soy(rm_low: str, rm_high: str) -> str:
|
||||
return json.dumps({
|
||||
"soyaBeanCount": "1",
|
||||
"rmLowerRange": rm_low,
|
||||
"rmUpperRange": rm_high,
|
||||
"herbicideTolerance": "",
|
||||
"diseaseFilter": "",
|
||||
"nematodeFilter": "",
|
||||
"agroPlantCharFilter": "",
|
||||
"plantHeightFilter": "",
|
||||
"brands": "NK",
|
||||
"browserURL": SOY_LIST_URL,
|
||||
"fieldForged": "",
|
||||
"newProduct": "",
|
||||
})
|
||||
|
||||
|
||||
def _parse_card(html_chunk: str, crop: str) -> NKProduct | None:
|
||||
"""Parse one ``<div class="sf-result">`` card from the API
|
||||
response into an NKProduct."""
|
||||
soup = BeautifulSoup(html_chunk, "html.parser")
|
||||
title_el = soup.find(class_="sf-result-title")
|
||||
if not title_el:
|
||||
return None
|
||||
# Title contains code + RM <span> tail
|
||||
code = (title_el.contents[0].strip() if title_el.contents else "").strip()
|
||||
if not code:
|
||||
return None
|
||||
rm_str: str | None = None
|
||||
span = title_el.find("span")
|
||||
if span:
|
||||
# span text is like "RM\n80" — strip to digits/decimal
|
||||
text = span.get_text(" ", strip=True)
|
||||
m = re.search(r"(\d+(?:\.\d+)?)", text)
|
||||
if m:
|
||||
rm_str = m.group(1)
|
||||
|
||||
prod = NKProduct(
|
||||
source_key=f"nk-{code.lower()}",
|
||||
# NK doesn't expose per-variety URLs; the brand catalog is the
|
||||
# nearest equivalent. lookup_variety / get_page will still work
|
||||
# via source_key.
|
||||
source_url=CORN_LIST_URL if crop == "corn" else SOY_LIST_URL,
|
||||
crop=crop,
|
||||
product_code=code,
|
||||
)
|
||||
if rm_str is not None:
|
||||
if crop == "corn":
|
||||
prod.relative_maturity = rm_str
|
||||
else:
|
||||
prod.maturity_group = rm_str
|
||||
|
||||
# Brands Available (trait variants).
|
||||
inner = soup.find(class_="sf-result-content-inner")
|
||||
if inner:
|
||||
# The first <strong> with "Brands available:" or
|
||||
# "Herbicide Tolerant Trait(s):" sets the trait context.
|
||||
for strong in inner.find_all("strong"):
|
||||
text = strong.get_text(" ", strip=True)
|
||||
if text.lower().startswith("brands available"):
|
||||
rest = text.split(":", 1)[1] if ":" in text else ""
|
||||
for v in rest.split("|"):
|
||||
v = v.strip()
|
||||
if v:
|
||||
prod.brand_variants.append(v)
|
||||
elif text.lower().startswith("herbicide tolerant trait"):
|
||||
rest = text.split(":", 1)[1] if ":" in text else ""
|
||||
for t in rest.split(","):
|
||||
t = t.strip()
|
||||
if t:
|
||||
prod.trait_codes.append(t)
|
||||
else:
|
||||
# Positioning slogan is also rendered as a bare <strong>.
|
||||
if not prod.positioning_statement and len(text) > 12:
|
||||
prod.positioning_statement = text
|
||||
|
||||
# Bullet strengths
|
||||
ul = inner.find("ul")
|
||||
if ul:
|
||||
for li in ul.find_all("li"):
|
||||
t = li.get_text(" ", strip=True)
|
||||
if t:
|
||||
prod.strengths.append(t)
|
||||
|
||||
# Tech-sheet PDF URL.
|
||||
for a in soup.find_all("a", href=True):
|
||||
h = a["href"]
|
||||
if "assets.syngenta-us.com/pdf/techsheets/" in h and h.lower().endswith(".pdf"):
|
||||
prod.techsheet_url = h
|
||||
break
|
||||
|
||||
return prod
|
||||
|
||||
|
||||
def discover_products(
|
||||
http: RateLimitedSession,
|
||||
*,
|
||||
only_crop: str | None = None,
|
||||
) -> list[NKProduct]:
|
||||
"""Hit the corn + soy product-finder APIs and parse the returned
|
||||
HTML cards into NKProducts. Returns identity-level data only;
|
||||
ratings come from the per-variety tech-sheet PDF in
|
||||
``enrich_with_pdf``."""
|
||||
# Warm the session cookie (some Syngenta deployments need it).
|
||||
http.get(CORN_LIST_URL)
|
||||
|
||||
out: list[NKProduct] = []
|
||||
headers = {
|
||||
"Content-Type": "application/json; charset=utf-8",
|
||||
"X-Requested-With": "XMLHttpRequest",
|
||||
}
|
||||
|
||||
def _parse_response(html_blob: str, crop: str) -> int:
|
||||
"""Parse the API response's inner HTML into NKProducts.
|
||||
|
||||
The endpoint emits one ``<div class="sf-result">`` per variety,
|
||||
each wrapped in a ``<div class="col-md-6">`` column. Strip the
|
||||
leading ``@`` markers and let BeautifulSoup tokenize the whole
|
||||
blob — no per-chunk split (the API doesn't actually delimit
|
||||
with ``@`` reliably, despite appearances).
|
||||
"""
|
||||
n = 0
|
||||
# Strip leading " @ " noise (rendered by the JS when filters
|
||||
# change, not a structural delimiter).
|
||||
cleaned = html_blob.replace("@", "").strip()
|
||||
soup = BeautifulSoup(cleaned, "html.parser")
|
||||
for card in soup.find_all("div", class_="sf-result"):
|
||||
prod = _parse_card(str(card), crop)
|
||||
if prod:
|
||||
out.append(prod)
|
||||
n += 1
|
||||
return n
|
||||
|
||||
if only_crop in (None, "corn"):
|
||||
log.info("fetching NK corn product list")
|
||||
r = http.post(
|
||||
CORN_API,
|
||||
data=_api_payload_corn("75", "120"),
|
||||
headers={**headers, "Referer": CORN_LIST_URL},
|
||||
)
|
||||
r.raise_for_status()
|
||||
n = _parse_response(r.json().get("d") or "", "corn")
|
||||
log.info("corn cards parsed: %d", n)
|
||||
|
||||
if only_crop in (None, "soybeans"):
|
||||
log.info("fetching NK soy product list")
|
||||
r = http.post(
|
||||
SOY_API,
|
||||
data=_api_payload_soy("0", "9.9"),
|
||||
headers={**headers, "Referer": SOY_LIST_URL},
|
||||
)
|
||||
r.raise_for_status()
|
||||
n = _parse_response(r.json().get("d") or "", "soybeans")
|
||||
log.info("soy cards parsed: %d", n)
|
||||
|
||||
log.info("total: %d NK varieties", len(out))
|
||||
return out
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- PDF
|
||||
|
||||
|
||||
def _extract_disease_ratings(text: str) -> list[dict]:
|
||||
"""Pull disease-tolerance ratings out of the tech-sheet PDF text.
|
||||
|
||||
The PDF renders disease ratings as a left-column-label / right-
|
||||
column-number layout. pdfplumber's ``extract_text`` interleaves
|
||||
the agronomic-chart labels (no number) with the disease-rating
|
||||
labels + numbers, so we just look for lines ending in a numeric
|
||||
rating or a literal ``-`` (not available).
|
||||
|
||||
Returns a list of ``{characteristic, value}``. Values are
|
||||
preserved as strings (including ``-`` for "not available").
|
||||
"""
|
||||
# The disease list per tech sheet is small (~10 conditions) and
|
||||
# the labels are stable. We anchor on the known label set rather
|
||||
# than try to guess by layout.
|
||||
known_diseases = [
|
||||
"Gray Leaf Spot",
|
||||
"Northern Corn Leaf Blight",
|
||||
"Goss's Wilt",
|
||||
"Goss's wilt",
|
||||
"Bacterial Leaf Streak",
|
||||
"Bacterial Corn Leaf Streak",
|
||||
"Southern Corn Leaf Blight",
|
||||
"Anthracnose Stalk Rot",
|
||||
"Anthracnose Leaf Blight",
|
||||
"Tar Spot",
|
||||
"Fusarium Crown Rot",
|
||||
"Common Rust",
|
||||
"Southern Rust",
|
||||
"Eye Spot",
|
||||
"Stewart's Bacterial Wilt",
|
||||
# Soybean
|
||||
"Brown Stem Rot",
|
||||
"Charcoal Rot",
|
||||
"Frogeye Leaf Spot",
|
||||
"Iron Deficiency Chlorosis",
|
||||
"Phytophthora Root Rot",
|
||||
"Sclerotinia White Mold",
|
||||
"White Mold",
|
||||
"Soybean Cyst Nematode",
|
||||
"Sudden Death Syndrome",
|
||||
"Southern Stem Canker",
|
||||
"Stem Canker",
|
||||
"Soybean Mosaic Virus",
|
||||
]
|
||||
items: list[dict] = []
|
||||
for line in text.splitlines():
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
# Match "<label> <value>" where label is one of known_diseases
|
||||
# and value is a single digit or "-".
|
||||
for d in known_diseases:
|
||||
m = re.match(rf"^{re.escape(d)}\s+([1-9]|-)\s*$", line)
|
||||
if m:
|
||||
items.append({"characteristic": d, "value": m.group(1)})
|
||||
break
|
||||
# Dedup while preserving order
|
||||
seen: set[str] = set()
|
||||
deduped: list[dict] = []
|
||||
for it in items:
|
||||
if it["characteristic"] not in seen:
|
||||
seen.add(it["characteristic"])
|
||||
deduped.append(it)
|
||||
return deduped
|
||||
|
||||
|
||||
def _extract_phytophthora_genes(text: str) -> str | None:
|
||||
"""Soybean tech sheets list the Phytophthora Root Rot (PRR) source
|
||||
genes (Rps1c / Rps3a / etc.). The exact line wording varies; we
|
||||
accept several common phrasings."""
|
||||
patterns = (
|
||||
r"Phytophthora Root Rot\s*\(PRR\)\s*Source\s+(.+)",
|
||||
r"PRR Source\s*[:\-]?\s*(.+)",
|
||||
r"Phytophthora Gene\s*[:\-]?\s*(.+)",
|
||||
)
|
||||
for line in text.splitlines():
|
||||
line = line.strip()
|
||||
for p in patterns:
|
||||
m = re.match(p, line, re.I)
|
||||
if m:
|
||||
val = m.group(1).strip()
|
||||
# Trim trailing words that obviously aren't gene names
|
||||
# ("Source Rps1c, Rps3a Emergence 3" can run together).
|
||||
val = re.split(r"\s+(?:Emergence|Soybean|Standability|Root)\b", val, 1)[0].strip()
|
||||
if val and val.lower() not in ("-", "na", "n/a", "none"):
|
||||
return val
|
||||
return None
|
||||
|
||||
|
||||
def _extract_scn_source(text: str) -> str | None:
|
||||
for line in text.splitlines():
|
||||
line = line.strip()
|
||||
m = re.match(r"^(SCN Source|Cyst Nematode Source)\s*[:\-]?\s*(.+)$", line, re.I)
|
||||
if m:
|
||||
val = m.group(2).strip()
|
||||
if val and val != "-":
|
||||
return val
|
||||
return None
|
||||
|
||||
|
||||
def _extract_scn_races(text: str) -> str | None:
|
||||
"""Soy: 'Soybean Cyst Nematode (SCN) Races S' / 'R3' etc."""
|
||||
for line in text.splitlines():
|
||||
line = line.strip()
|
||||
m = re.match(
|
||||
r"^Soybean Cyst Nematode \(SCN\) Races\s+(.+)$", line, re.I,
|
||||
)
|
||||
if m:
|
||||
val = m.group(1).strip()
|
||||
if val:
|
||||
return val
|
||||
return None
|
||||
|
||||
|
||||
# Soy agronomic ratings rendered as text "Label N" pairs in the PDF.
|
||||
# These ARE extractable (unlike the bar charts).
|
||||
_SOY_AGRO_LABELS = (
|
||||
"Emergence", "Standability", "Shatter Tolerance",
|
||||
"Green Stem", "% Protein at 13% mst.", "% Oil at 13% mst.",
|
||||
)
|
||||
|
||||
|
||||
def _extract_soy_agronomic_text(text: str) -> list[dict]:
|
||||
out: list[dict] = []
|
||||
for label in _SOY_AGRO_LABELS:
|
||||
# Allow trailing decimal for %Protein / %Oil; single digit
|
||||
# for the 1-9 ratings.
|
||||
m = re.search(
|
||||
rf"{re.escape(label)}\s+(\d+(?:\.\d+)?|-)\b",
|
||||
text,
|
||||
)
|
||||
if m:
|
||||
out.append({"characteristic": label, "value": m.group(1)})
|
||||
return out
|
||||
|
||||
|
||||
# Soil-type adaptation lines on soy PDFs: "Drought Prone Best",
|
||||
# "Narrow Rows Best", "High pH* Good", etc.
|
||||
_SOY_SOIL_LABELS = (
|
||||
"Drought Prone", "Narrow Rows", "High pH",
|
||||
"Wide Rows", "Highly Productive",
|
||||
"Moderate/Variable Environments", "Poorly Drained",
|
||||
)
|
||||
|
||||
|
||||
def _extract_soy_soil_adaptation(text: str) -> list[dict]:
|
||||
out: list[dict] = []
|
||||
for label in _SOY_SOIL_LABELS:
|
||||
m = re.search(
|
||||
rf"{re.escape(label)}\*?\s+(Best|Good|Fair|Poor)\b",
|
||||
text,
|
||||
)
|
||||
if m:
|
||||
out.append({"characteristic": label, "value": m.group(1)})
|
||||
return out
|
||||
|
||||
|
||||
def enrich_with_pdf(
|
||||
http: RateLimitedSession, prod: NKProduct
|
||||
) -> None:
|
||||
"""Fetch the tech-sheet PDF and add disease ratings + relevant
|
||||
soybean fields to ``prod.characteristics_groups``."""
|
||||
if not prod.techsheet_url:
|
||||
log.info("%s: no tech sheet URL — identity only", prod.source_key)
|
||||
return
|
||||
try:
|
||||
r = http.get(prod.techsheet_url)
|
||||
r.raise_for_status()
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.warning("%s: PDF fetch failed (%s) — identity only",
|
||||
prod.source_key, exc)
|
||||
return
|
||||
try:
|
||||
with pdfplumber.open(io.BytesIO(r.content)) as pdf:
|
||||
text = "\n".join((p.extract_text() or "") for p in pdf.pages)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.warning("%s: PDF parse failed (%s) — identity only",
|
||||
prod.source_key, exc)
|
||||
return
|
||||
|
||||
disease = _extract_disease_ratings(text)
|
||||
if disease:
|
||||
prod.characteristics_groups.append({
|
||||
"label": "DISEASE RATINGS",
|
||||
"type": "pdf-text",
|
||||
"items": disease,
|
||||
})
|
||||
|
||||
if prod.crop == "soybeans":
|
||||
misc_items: list[dict] = []
|
||||
prr = _extract_phytophthora_genes(text)
|
||||
if prr:
|
||||
misc_items.append({"characteristic": "Phytophthora Gene", "value": prr})
|
||||
scn = _extract_scn_source(text)
|
||||
if scn:
|
||||
misc_items.append({"characteristic": "SCN Source", "value": scn})
|
||||
scn_races = _extract_scn_races(text)
|
||||
if scn_races:
|
||||
misc_items.append({"characteristic": "SCN Race Coverage", "value": scn_races})
|
||||
if misc_items:
|
||||
prod.characteristics_groups.append({
|
||||
"label": "DISEASE GENETICS",
|
||||
"type": "pdf-text",
|
||||
"items": misc_items,
|
||||
})
|
||||
|
||||
soy_agro = _extract_soy_agronomic_text(text)
|
||||
if soy_agro:
|
||||
prod.characteristics_groups.append({
|
||||
"label": "AGRONOMIC TRAITS",
|
||||
"type": "pdf-text",
|
||||
"items": soy_agro,
|
||||
})
|
||||
|
||||
soil = _extract_soy_soil_adaptation(text)
|
||||
if soil:
|
||||
prod.characteristics_groups.append({
|
||||
"label": "SOIL TYPE ADAPTATION",
|
||||
"type": "pdf-text",
|
||||
"items": soil,
|
||||
})
|
||||
|
||||
# Surface labels for charted-only agronomic ratings so search_docs
|
||||
# can match queries like "drought" / "stalk strength" — values
|
||||
# aren't extractable via text (the source PDF renders them as bar
|
||||
# positions). We record only labels NOT already present in
|
||||
# text-extractable groups, with an explicit "rated in PDF chart"
|
||||
# value so the LLM directs the farmer at the tech sheet for those
|
||||
# numbers. (For soy this is mostly redundant — text extraction got
|
||||
# the agronomic numbers — so we skip the chart-label group there.)
|
||||
if prod.crop == "corn":
|
||||
agronomic_labels_corn = (
|
||||
"Emergence", "Seedling Vigor", "Root Strength",
|
||||
"Stalk Strength", "Green Snap", "Staygreen",
|
||||
"Drydown", "Test Weight", "Drought",
|
||||
)
|
||||
# Skip any label already present with a numeric value.
|
||||
already_rated = {
|
||||
it["characteristic"]
|
||||
for g in prod.characteristics_groups
|
||||
for it in g.get("items") or []
|
||||
if str(it.get("value", "")).strip() not in ("",)
|
||||
}
|
||||
present = [l for l in agronomic_labels_corn
|
||||
if l in text and l not in already_rated]
|
||||
if present:
|
||||
prod.characteristics_groups.append({
|
||||
"label": "AGRONOMIC CHARACTERISTICS",
|
||||
"type": "pdf-chart",
|
||||
"items": [
|
||||
{"characteristic": l, "value": "rated in tech-sheet PDF chart (not text-extractable)"}
|
||||
for l in present
|
||||
],
|
||||
})
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- render
|
||||
|
||||
|
||||
def render_markdown(p: NKProduct) -> str:
|
||||
title = p.product_code or p.source_key
|
||||
crop_label = "Corn" if p.crop == "corn" else "Soybeans"
|
||||
|
||||
head: list[str] = [
|
||||
f"# {title}",
|
||||
"",
|
||||
"- **Vendor:** Syngenta",
|
||||
"- **Brand:** NK",
|
||||
f"- **Crop:** {crop_label}",
|
||||
]
|
||||
if p.crop == "corn" and p.relative_maturity:
|
||||
head.append(f"- **Relative maturity:** {p.relative_maturity}")
|
||||
if p.crop == "soybeans" and p.maturity_group:
|
||||
head.append(f"- **Maturity group:** {p.maturity_group}")
|
||||
if p.brand_variants:
|
||||
head.append(f"- **Brand variants:** {', '.join(p.brand_variants)}")
|
||||
if p.trait_codes:
|
||||
head.append(f"- **Traits:** {', '.join(p.trait_codes)}")
|
||||
head.append(f"- **Catalog page:** {p.source_url}")
|
||||
if p.techsheet_url:
|
||||
head.append(f"- **Tech sheet (PDF):** {p.techsheet_url}")
|
||||
head.append(f"- **Rating scale (NK):** {RATING_SCALE_DIRECTION}")
|
||||
head.append("")
|
||||
head.append("---")
|
||||
head.append("")
|
||||
|
||||
sections: list[str] = []
|
||||
if p.positioning_statement:
|
||||
sections.append("## Positioning\n\n" + p.positioning_statement.strip() + "\n")
|
||||
if p.strengths:
|
||||
bullets = "\n".join(f"- {s}" for s in p.strengths)
|
||||
sections.append("## Strengths\n\n" + bullets + "\n")
|
||||
|
||||
for g in p.characteristics_groups:
|
||||
label = (g.get("label") or "Characteristics").title()
|
||||
items = g.get("items") or []
|
||||
if not items:
|
||||
continue
|
||||
rows = "\n".join(f"| {it['characteristic']} | {it['value']} |" for it in items)
|
||||
sections.append(
|
||||
f"## {label}\n\n"
|
||||
"| Characteristic | Value |\n"
|
||||
"|---|---|\n"
|
||||
f"{rows}\n"
|
||||
)
|
||||
return "\n".join(head) + "\n".join(sections)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- write
|
||||
|
||||
|
||||
def write_product(prod: NKProduct, body_md: str) -> None:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
md_path = CORPUS_DIR / f"{prod.source_key}.md"
|
||||
json_path = CORPUS_DIR / f"{prod.source_key}.json"
|
||||
|
||||
md_path.write_text(body_md, encoding="utf-8")
|
||||
sidecar = {
|
||||
"source": "nk",
|
||||
"source_key": prod.source_key,
|
||||
"vendor": "Syngenta",
|
||||
"brand": "NK",
|
||||
"product_name": prod.product_code,
|
||||
"product_id": None,
|
||||
"hybrid_prefix": prod.product_code,
|
||||
"hybrid_suffix": None,
|
||||
"crop": prod.crop,
|
||||
"release_year": None,
|
||||
"relative_maturity": prod.relative_maturity,
|
||||
"maturity_group": prod.maturity_group,
|
||||
"wheat_class": None,
|
||||
"trait_stack": prod.trait_codes,
|
||||
"trait_descriptions": prod.trait_descriptions,
|
||||
"brand_variants": prod.brand_variants,
|
||||
"positioning_statement": prod.positioning_statement,
|
||||
"strengths": prod.strengths,
|
||||
"characteristics_groups": prod.characteristics_groups,
|
||||
"_scale_direction": RATING_SCALE_DIRECTION,
|
||||
"regional_recommendations": [],
|
||||
"image_url": None,
|
||||
"techsheet_url": prod.techsheet_url,
|
||||
"source_urls": [prod.source_url],
|
||||
"sitemap_last_modified": None,
|
||||
"fetched_at": datetime.now(timezone.utc).isoformat(),
|
||||
"scraper_version": SCRAPER_VERSION,
|
||||
}
|
||||
json_path.write_text(
|
||||
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- pipeline
|
||||
|
||||
|
||||
def process_product(
|
||||
http: RateLimitedSession,
|
||||
prod: NKProduct,
|
||||
*,
|
||||
force: bool,
|
||||
) -> str:
|
||||
md_path = CORPUS_DIR / f"{prod.source_key}.md"
|
||||
if md_path.exists() and not force:
|
||||
return "skipped"
|
||||
enrich_with_pdf(http, prod)
|
||||
body = render_markdown(prod)
|
||||
write_product(prod, body)
|
||||
return "written"
|
||||
|
||||
|
||||
def run(
|
||||
*,
|
||||
limit: int | None,
|
||||
force: bool,
|
||||
only_crop: str | None,
|
||||
only_product: str | None,
|
||||
) -> int:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
http = RateLimitedSession()
|
||||
targets = discover_products(http, only_crop=only_crop)
|
||||
|
||||
if only_product:
|
||||
targets = [
|
||||
p for p in targets
|
||||
if p.source_key == only_product
|
||||
or p.product_code.lower() == only_product.lower()
|
||||
]
|
||||
if not targets:
|
||||
log.error("no variety matched --product=%s", only_product)
|
||||
return 2
|
||||
|
||||
counts = {"written": 0, "skipped": 0, "failed": 0}
|
||||
processed = 0
|
||||
for prod in targets:
|
||||
if limit is not None and processed >= limit:
|
||||
break
|
||||
processed += 1
|
||||
try:
|
||||
status = process_product(http, prod, force=force)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.error("%s failed: %s", prod.source_key, exc)
|
||||
status = "failed"
|
||||
counts[status] = counts.get(status, 0) + 1
|
||||
log.info(
|
||||
"[%d/%s] %s %s | crop=%s rm/mg=%s variants=%d traits=%s groups=%d",
|
||||
processed, str(limit) if limit else "all",
|
||||
prod.source_key, status, prod.crop,
|
||||
prod.relative_maturity or prod.maturity_group or "-",
|
||||
len(prod.brand_variants),
|
||||
",".join(prod.trait_codes) or "-",
|
||||
len(prod.characteristics_groups),
|
||||
)
|
||||
|
||||
log.info(
|
||||
"done: processed=%d written=%d skipped=%d failed=%d (of %d candidates)",
|
||||
processed, counts["written"], counts["skipped"],
|
||||
counts["failed"], len(targets),
|
||||
)
|
||||
return 0 if counts["failed"] == 0 else 1
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- CLI
|
||||
|
||||
|
||||
def _build_argparser() -> argparse.ArgumentParser:
|
||||
p = argparse.ArgumentParser(
|
||||
prog="scrape.sources.nk",
|
||||
description="Scrape NK (Syngenta) corn + soybean varieties.",
|
||||
)
|
||||
p.add_argument("--limit", type=int, default=None,
|
||||
help="Stop after processing N varieties (default: all).")
|
||||
p.add_argument("--force", action="store_true",
|
||||
help="Re-fetch even if the markdown file already exists.")
|
||||
p.add_argument("--crop", default=None, choices=("corn", "soybeans"),
|
||||
help="Limit to one crop.")
|
||||
p.add_argument("--product", default=None,
|
||||
help="Process a single variety by source_key or product code.")
|
||||
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
|
||||
return p
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
print("nk: deferred — disease/agronomic ratings come from CDN tech-sheet PDFs only, use pdfplumber. See reference_seed_vendor_recon.md.",
|
||||
file=sys.stderr)
|
||||
# Return 0 so the monthly CI workflow doesn't fail when this
|
||||
# source is listed but not yet implemented. Real implementation
|
||||
# will return 0 on success / 1 on failure.
|
||||
return 0
|
||||
args = _build_argparser().parse_args(argv)
|
||||
logging.basicConfig(
|
||||
level=args.log_level.upper(),
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||
stream=sys.stderr,
|
||||
)
|
||||
return run(
|
||||
limit=args.limit,
|
||||
force=args.force,
|
||||
only_crop=args.crop,
|
||||
only_product=args.product,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main(sys.argv[1:]))
|
||||
sys.exit(main())
|
||||
|
||||
Reference in New Issue
Block a user