Files
justin 30b182e28a Three new brand scrapers: LG Seeds + AgriGold + Ebbert's Seeds (+310 varieties)
User flagged LG, AgriGold, and Ebbert's (local Ohio breeder) are
all active in farmer territory. Built three scrapers — corpus now
covers 5,839 chunks across 11 brands.

Net new varieties: 310
  lg_seeds        170 — corn 78 + soy 63 + alfalfa 16 + sorghum 13
                  → adds FIRST alfalfa coverage (FD 3-5 range)
  agrigold        111 — corn 60 + soy 51
  ebberts_seeds    29 — corn 17 + soy 12 (regional OH/IN breeder)

scrape/sources/lg_seeds.py — embedded-JSON pattern (cleanest):
- /products/<crop> pages have a `var products = [...]` blob with the
  variety summary (Variety, Maturity, Traits[], Bullets[], CropType).
- Per-variety detail page (/products/<crop>/<Variety>) carries the
  ratings as `<span class="bar-N">` where N is 1-9 on the canonical
  scale. Same 9=best direction as Bayer / Golden Harvest.
- Three sections per page: Characteristics / Management / Disease
  Tolerance, plus a few qualitative bars ("Tar Spot Susceptible",
  "Fungicide Response High") preserved as text values.

scrape/sources/agrigold.py — 5-circle scale:
- Listing page has 60+ /corn/explore-corn-hybrids/<CODE> URLs.
- Detail page renders ratings as <div class="scale"> blocks with 5
  child <div class="circle"> elements, of which N have class
  "circle selected" → rating N on a 1-5 scale.
- 7 sections per page incl. Silage Characteristics (Dairy Silage
  Rating, NDFd 30 Hr, Crude Protein), Planting Applications, Soil
  Adaptability, Plant Characteristics, Product Features.
- Distinct rating direction (1-5 vs Bayer's 1-9) — declared in
  _scale_direction so chunker preamble renders correctly.

scrape/sources/ebberts_seeds.py — small regional breeder, verbatim
text approach:
- Single page per crop (corn / soybeans / wheat). Each variety is an
  <h1> + multi-section CSS-grid block where labels and values are in
  separate adjacent cells. Reconstructing perfectly-aligned columns
  for a 29-variety total isn't worth the engineering — chunk body
  carries the verbatim text in document order, LLM can read the
  tabular content.
- Scale: 1-5 (1 = best, lower = more resistant), inferred from
  marketing-vs-rating cross-checks ("Robust tall plants" + STANDABILITY
  1.0 → 1 = best).
- Politeness: robots.txt asks for Crawl-delay: 5; honored.

All three new scrapers smoke-tested:
- LG corn LG5701 RM 116 SmartStax → 3 characteristic groups with
  Disease Tolerance ratings (Northern/Southern Leaf Blight 8-9, etc.)
- AgriGold A616-30 RM 86 VT2RIB → 7 groups incl. silage and soil
  adaptability ratings
- Ebbert's 7000TR RIB RM 100 → 1098-char verbatim body covering
  CHARACTERISTICS, DISEASE RATINGS, herbicide tolerance, etc.

Corpus state after this PR:
- 5,839 chunks (was 5,529)
- 11 brands (was 8)
- 8 crops (corn 3047, soy 2209, silage 359, wheat 123, sorghum 49,
  cotton 30, alfalfa 16, canola 6) — alfalfa is brand-new

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 12:42:23 -04:00

504 lines
18 KiB
Python

"""LG Seeds scraper — AgReliant Genetics brand.
Source: ``www.lgseeds.com`` — WordPress site. Empty robots.txt
(no Disallow). Catalog covers 4 crops: corn, soybeans, alfalfa,
sorghum.
Two-layer fetch:
1. **Listing page** (one per crop): inline JavaScript variable
``products = [{...}, ...]`` carries the full variety summary —
Variety code, Maturity, Traits[], Bullets[], CropType. No
per-variety HTTP needed for identity.
2. **Detail page** (``/products/<crop>/<Variety>``): rich plant
characteristics + disease tolerance + management ratings,
rendered as ``<div class="characteristics-bar">`` blocks with
``<span class="bar-N">`` where N ∈ 1-9 is the rating. Same
convention as Bayer/Golden Harvest (9 = best).
LG Seeds is a regional brand (Eastern Corn Belt focus) under
AgReliant Genetics, the same parent as AgriGold. Brand voice is
distinct so we keep them in separate scrapers.
Rating scale: ``1-9 (9 = best)`` — verified empirically on the
bar-N markup; matches Bayer / Golden Harvest convention.
Output:
corpus/lg_seeds/<source_key>.md
corpus/lg_seeds/<source_key>.json
source_key: ``lg-<variety>`` lowercased, e.g. ``lg-lg5701``,
``lg-c3400`` (soybean — codes don't use LG prefix), ``lg-7c300``
(alfalfa), ``lg-silo-max-100`` (sorghum).
CLI:
python -m scrape.sources.lg_seeds --crop corn --limit 5
python -m scrape.sources.lg_seeds --force
"""
from __future__ import annotations
import argparse
import json
import logging
import os
import random
import re
import sys
import time
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import requests
from bs4 import BeautifulSoup
SCRAPER_VERSION = "0.1.0"
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
BASE = "https://www.lgseeds.com"
# Crops listed in nav. Each has a listing page at /products/<crop>
# with an inline `var products = [...]` JSON blob.
LISTING_PATHS = {
"corn": "/products/corn",
"soybeans": "/products/soybeans",
"alfalfa": "/products/alfalfa",
"sorghum": "/products/sorghum",
}
RATING_SCALE_DIRECTION = "1-9 (9 = best)"
REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
CORPUS_DIR = CORPUS_ROOT / "lg_seeds"
REQ_INTERVAL_SEC = 1.0
log = logging.getLogger("scrape.lg_seeds")
# --------------------------------------------------------------------- HTTP
class RateLimitedSession:
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
self.s = requests.Session()
self.s.headers["User-Agent"] = USER_AGENT
self.interval = interval
self._last = 0.0
def _wait(self) -> None:
delta = time.monotonic() - self._last
if delta < self.interval:
time.sleep(self.interval - delta)
self._last = time.monotonic()
def request(self, method: str, url: str, *, max_retries: int = 4,
timeout: float = 30.0, **kw: Any) -> requests.Response:
last_exc: Exception | None = None
for attempt in range(max_retries):
self._wait()
try:
resp = self.s.request(method, url, timeout=timeout, **kw)
except requests.RequestException as exc:
last_exc = exc
backoff = min(30.0, (2 ** attempt) + random.random())
log.warning("network error on %s %s: %s — retry in %.1fs",
method, url, exc, backoff)
time.sleep(backoff)
continue
if resp.status_code == 429 or 500 <= resp.status_code < 600:
ra = resp.headers.get("Retry-After")
backoff = float(ra) if (ra and ra.isdigit()) else min(30.0, (2 ** attempt) + random.random())
log.warning("HTTP %d on %s %s — retry in %.1fs",
resp.status_code, method, url, backoff)
time.sleep(backoff)
continue
return resp
if last_exc:
raise last_exc
return resp # type: ignore[return-value]
def get(self, url: str, **kw: Any) -> requests.Response:
return self.request("GET", url, **kw)
# --------------------------------------------------------------------- model
@dataclass
class LGProduct:
source_key: str
source_url: str
crop: str
product_name: str = ""
product_id: int | None = None
maturity_raw: str | None = None # corn RM days / soy MG / alfalfa FD / sorghum days
fall_dormancy: str | None = None # alfalfa only
trait_descriptions: list[str] = field(default_factory=list)
bullets: list[str] = field(default_factory=list)
characteristics_groups: list[dict] = field(default_factory=list)
# --------------------------------------------------------------------- discovery
_VAR_RE = re.compile(
r'var\s+\w+\s*=\s*(\[\{"Variety":.+?\}\]);', re.S,
)
def discover_varieties(
http: RateLimitedSession, *, only_crop: str | None = None,
) -> list[tuple[str, dict]]:
"""Return ``[(crop, summary_dict), ...]`` from each listing page's
inline JSON. Summary dict has Variety / Id / Maturity / Traits /
Bullets / CropType / FallDormancy."""
out: list[tuple[str, dict]] = []
for crop, path in LISTING_PATHS.items():
if only_crop and crop != only_crop:
continue
log.info("fetching listing %s%s", BASE, path)
r = http.get(f"{BASE}{path}")
r.raise_for_status()
m = _VAR_RE.search(r.text)
if not m:
log.warning("no products array in %s", path)
continue
try:
items = json.loads(m.group(1))
except json.JSONDecodeError as exc:
log.error("JSON parse failed for %s: %s", path, exc)
continue
log.info(" %s: %d varieties", crop, len(items))
for it in items:
out.append((crop, it))
log.info("total varieties discovered: %d", len(out))
return out
# --------------------------------------------------------------------- helpers
def source_key_for(variety: str) -> str:
"""Slugify the variety code into a stable source_key."""
slug = re.sub(r"[^a-zA-Z0-9-]+", "-", variety).strip("-").lower()
return f"lg-{slug}"
_BAR_CLASS_RE = re.compile(r"^bar-(\d)$")
def _parse_bar_value(span_classes: list[str]) -> int | None:
"""Extract the integer rating from a ``bar-N`` CSS class."""
for c in span_classes or []:
m = _BAR_CLASS_RE.match(c)
if m:
return int(m.group(1))
return None
# --------------------------------------------------------------------- detail
def fetch_product_detail(
http: RateLimitedSession, summary: dict, crop: str,
) -> LGProduct:
"""Fetch the detail page and merge characteristics into an
LGProduct seeded by the listing-page summary."""
variety = summary.get("Variety") or ""
# LG's detail URL is /products/<crop>/<Variety>. The Variety in the
# listing JSON appears in correct case; LG seems to accept any case
# but we use what's published.
url = f"{BASE}/products/{crop}/{variety}"
prod = LGProduct(
source_key=source_key_for(variety),
source_url=url,
crop=crop,
product_name=variety,
product_id=summary.get("Id"),
maturity_raw=str(summary.get("Maturity")) if summary.get("Maturity") is not None else None,
fall_dormancy=str(summary.get("FallDormancy")) if summary.get("FallDormancy") else None,
trait_descriptions=list(summary.get("Traits") or []),
bullets=list(summary.get("Bullets") or []),
)
try:
r = http.get(url)
r.raise_for_status()
except Exception as exc: # noqa: BLE001
log.warning("detail fetch failed for %s: %s", variety, exc)
return prod # identity-only fallback
soup = BeautifulSoup(r.text, "html.parser")
# The detail page has multiple .product-section blocks; each has
# a heading + a collection of .characteristics-bar rows. We bucket
# by the section's text content. Common LG section labels:
# "Characteristics" / "Management" / "Disease Tolerance".
sections: list[tuple[str, list[dict]]] = []
for section in soup.find_all("div", class_=re.compile(r"product-section")):
# Heading is the first text node inside the section, before bars.
# The section class often includes a hint like "disease-toler",
# "plantCharacteristics", "management-pr".
section_classes = " ".join(section.get("class", []))
bars = section.find_all("div", class_="characteristics-bar")
if not bars:
continue
# Section label — use the first heading-like element or the
# text right after the section class anchor.
label = ""
for h in section.find_all(["h2", "h3", "h4"]):
t = h.get_text(strip=True)
if t:
label = t
break
if not label:
# fallback: section_classes hint
if "disease" in section_classes.lower():
label = "Disease Tolerance"
elif "management" in section_classes.lower():
label = "Management"
elif "plantcharacteristics" in section_classes.lower():
label = "Characteristics"
items: list[dict] = []
for bar in bars:
name_el = bar.find(class_="product-name")
value_span = bar.find("span", class_=_BAR_CLASS_RE)
name = (name_el.get_text(" ", strip=True) if name_el else "").strip()
rating = _parse_bar_value(value_span.get("class") if value_span else [])
if not name:
continue
# Some "bars" are actually qualitative (e.g. "Tar Spot Susceptible",
# "Fungicide Response High"). For those we keep the label as the
# value text rather than a missing rating.
if rating is None:
# Look inside the bar element for a non-name text snippet
inner_text = bar.get_text(" ", strip=True)
# Strip the label off the front
if inner_text.startswith(name):
inner_text = inner_text[len(name):].strip()
items.append({"characteristic": name, "value": inner_text or "-"})
else:
items.append({"characteristic": name, "value": str(rating)})
if items:
sections.append((label or "Characteristics", items))
prod.characteristics_groups = [
{"label": label.upper(), "type": "bars", "items": items}
for label, items in sections
]
return prod
# --------------------------------------------------------------------- render
def render_markdown(p: LGProduct) -> str:
title = p.product_name or p.source_key
crop_label = {
"corn": "Corn", "soybeans": "Soybeans",
"alfalfa": "Alfalfa", "sorghum": "Sorghum",
}.get(p.crop, p.crop.title())
head: list[str] = [
f"# {title}",
"",
"- **Vendor:** AgReliant Genetics",
"- **Brand:** LG Seeds",
f"- **Crop:** {crop_label}",
]
if p.maturity_raw:
if p.crop == "corn":
head.append(f"- **Relative maturity:** {p.maturity_raw}")
elif p.crop == "soybeans":
head.append(f"- **Maturity group:** {p.maturity_raw}")
elif p.crop == "alfalfa":
head.append(f"- **Fall dormancy / maturity:** {p.maturity_raw}")
elif p.crop == "sorghum":
head.append(f"- **Days to maturity:** {p.maturity_raw}")
if p.trait_descriptions:
head.append(f"- **Traits:** {', '.join(p.trait_descriptions)}")
head.append(f"- **Source:** {p.source_url}")
head.append(f"- **Rating scale (LG Seeds):** {RATING_SCALE_DIRECTION}")
head.append("")
head.append("---")
head.append("")
sections: list[str] = []
if p.bullets:
bullets = "\n".join(f"- {b}" for b in p.bullets)
sections.append("## Strengths\n\n" + bullets + "\n")
for g in p.characteristics_groups:
label = (g.get("label") or "Characteristics").title()
items = g.get("items") or []
if not items:
continue
rows = "\n".join(f"| {it['characteristic']} | {it['value']} |" for it in items)
sections.append(
f"## {label}\n\n"
"| Characteristic | Value |\n"
"|---|---|\n"
f"{rows}\n"
)
return "\n".join(head) + "\n".join(sections)
# --------------------------------------------------------------------- write
def write_product(prod: LGProduct, body_md: str) -> None:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
md_path = CORPUS_DIR / f"{prod.source_key}.md"
json_path = CORPUS_DIR / f"{prod.source_key}.json"
md_path.write_text(body_md, encoding="utf-8")
sidecar = {
"source": "lg_seeds",
"source_key": prod.source_key,
"vendor": "AgReliant Genetics",
"brand": "LG Seeds",
"product_name": prod.product_name,
"product_id": prod.product_id,
"hybrid_prefix": prod.product_name,
"hybrid_suffix": None,
"crop": prod.crop,
"release_year": None,
# Maturity routing: corn = RM days, soy = MG, alfalfa = FD,
# sorghum = days-to-maturity. Stored in the canonical fields
# so the chunker's crop-aware preamble works.
"relative_maturity": prod.maturity_raw if prod.crop in ("corn", "sorghum") else None,
"maturity_group": prod.maturity_raw if prod.crop == "soybeans" else None,
"fall_dormancy": prod.maturity_raw if prod.crop == "alfalfa" else prod.fall_dormancy,
"wheat_class": None,
"trait_stack": prod.trait_descriptions, # LG publishes full names, not codes
"trait_descriptions": prod.trait_descriptions,
"positioning_statement": None,
"strengths": prod.bullets,
"characteristics_groups": prod.characteristics_groups,
"_scale_direction": RATING_SCALE_DIRECTION,
"regional_recommendations": [],
"image_url": None,
"source_urls": [prod.source_url],
"sitemap_last_modified": None,
"fetched_at": datetime.now(timezone.utc).isoformat(),
"scraper_version": SCRAPER_VERSION,
}
json_path.write_text(
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
encoding="utf-8",
)
# --------------------------------------------------------------------- pipeline
def process_product(
http: RateLimitedSession, summary: dict, crop: str, *, force: bool,
) -> tuple[str, LGProduct | None]:
variety = summary.get("Variety") or ""
source_key = source_key_for(variety)
md_path = CORPUS_DIR / f"{source_key}.md"
if md_path.exists() and not force:
return "skipped", None
try:
prod = fetch_product_detail(http, summary, crop)
except Exception as exc: # noqa: BLE001
log.error("variety %s failed: %s", variety, exc)
return "failed", None
body = render_markdown(prod)
write_product(prod, body)
return "written", prod
def run(
*, limit: int | None, force: bool,
only_crop: str | None, only_product: str | None,
) -> int:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
http = RateLimitedSession()
targets = discover_varieties(http, only_crop=only_crop)
if only_product:
targets = [
(c, s) for (c, s) in targets
if source_key_for(s.get("Variety", "")) == only_product
or s.get("Variety", "").lower() == only_product.lower()
]
if not targets:
log.error("no variety matched --product=%s", only_product)
return 2
counts = {"written": 0, "skipped": 0, "failed": 0}
processed = 0
for crop, summary in targets:
if limit is not None and processed >= limit:
break
processed += 1
status, prod = process_product(http, summary, crop, force=force)
counts[status] = counts.get(status, 0) + 1
if prod is not None:
log.info(
"[%d/%s] %s %s | crop=%s maturity=%s traits=%d groups=%d",
processed, str(limit) if limit else "all",
prod.source_key, status, prod.crop,
prod.maturity_raw or "-",
len(prod.trait_descriptions),
len(prod.characteristics_groups),
)
else:
log.info("[%d/%s] %s %s",
processed, str(limit) if limit else "all",
source_key_for(summary.get("Variety", "")), status)
log.info(
"done: processed=%d written=%d skipped=%d failed=%d (of %d candidates)",
processed, counts["written"], counts["skipped"],
counts["failed"], len(targets),
)
return 0 if counts["failed"] == 0 else 1
# --------------------------------------------------------------------- CLI
def _build_argparser() -> argparse.ArgumentParser:
p = argparse.ArgumentParser(
prog="scrape.sources.lg_seeds",
description="Scrape LG Seeds (AgReliant Genetics) — corn / "
"soybeans / alfalfa / sorghum.",
)
p.add_argument("--limit", type=int, default=None,
help="Stop after processing N varieties (default: all).")
p.add_argument("--force", action="store_true",
help="Re-fetch even if the markdown file already exists.")
p.add_argument("--crop", default=None, choices=list(LISTING_PATHS),
help="Limit to one crop.")
p.add_argument("--product", default=None,
help="Process a single variety by source_key or Variety code.")
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
return p
def main(argv: list[str] | None = None) -> int:
args = _build_argparser().parse_args(argv)
logging.basicConfig(
level=args.log_level.upper(),
format="%(asctime)s %(levelname)s %(name)s %(message)s",
stream=sys.stderr,
)
return run(
limit=args.limit, force=args.force,
only_crop=args.crop, only_product=args.product,
)
if __name__ == "__main__":
sys.exit(main())