Three new brand scrapers: LG Seeds + AgriGold + Ebbert's Seeds (+310 varieties)
User flagged LG, AgriGold, and Ebbert's (local Ohio breeder) are
all active in farmer territory. Built three scrapers — corpus now
covers 5,839 chunks across 11 brands.
Net new varieties: 310
lg_seeds 170 — corn 78 + soy 63 + alfalfa 16 + sorghum 13
→ adds FIRST alfalfa coverage (FD 3-5 range)
agrigold 111 — corn 60 + soy 51
ebberts_seeds 29 — corn 17 + soy 12 (regional OH/IN breeder)
scrape/sources/lg_seeds.py — embedded-JSON pattern (cleanest):
- /products/<crop> pages have a `var products = [...]` blob with the
variety summary (Variety, Maturity, Traits[], Bullets[], CropType).
- Per-variety detail page (/products/<crop>/<Variety>) carries the
ratings as `<span class="bar-N">` where N is 1-9 on the canonical
scale. Same 9=best direction as Bayer / Golden Harvest.
- Three sections per page: Characteristics / Management / Disease
Tolerance, plus a few qualitative bars ("Tar Spot Susceptible",
"Fungicide Response High") preserved as text values.
scrape/sources/agrigold.py — 5-circle scale:
- Listing page has 60+ /corn/explore-corn-hybrids/<CODE> URLs.
- Detail page renders ratings as <div class="scale"> blocks with 5
child <div class="circle"> elements, of which N have class
"circle selected" → rating N on a 1-5 scale.
- 7 sections per page incl. Silage Characteristics (Dairy Silage
Rating, NDFd 30 Hr, Crude Protein), Planting Applications, Soil
Adaptability, Plant Characteristics, Product Features.
- Distinct rating direction (1-5 vs Bayer's 1-9) — declared in
_scale_direction so chunker preamble renders correctly.
scrape/sources/ebberts_seeds.py — small regional breeder, verbatim
text approach:
- Single page per crop (corn / soybeans / wheat). Each variety is an
<h1> + multi-section CSS-grid block where labels and values are in
separate adjacent cells. Reconstructing perfectly-aligned columns
for a 29-variety total isn't worth the engineering — chunk body
carries the verbatim text in document order, LLM can read the
tabular content.
- Scale: 1-5 (1 = best, lower = more resistant), inferred from
marketing-vs-rating cross-checks ("Robust tall plants" + STANDABILITY
1.0 → 1 = best).
- Politeness: robots.txt asks for Crawl-delay: 5; honored.
All three new scrapers smoke-tested:
- LG corn LG5701 RM 116 SmartStax → 3 characteristic groups with
Disease Tolerance ratings (Northern/Southern Leaf Blight 8-9, etc.)
- AgriGold A616-30 RM 86 VT2RIB → 7 groups incl. silage and soil
adaptability ratings
- Ebbert's 7000TR RIB RM 100 → 1098-char verbatim body covering
CHARACTERISTICS, DISEASE RATINGS, herbicide tolerance, etc.
Corpus state after this PR:
- 5,839 chunks (was 5,529)
- 11 brands (was 8)
- 8 crops (corn 3047, soy 2209, silage 359, wheat 123, sorghum 49,
cotton 30, alfalfa 16, canola 6) — alfalfa is brand-new
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,502 @@
|
||||
"""AgriGold scraper — AgReliant Genetics brand.
|
||||
|
||||
Source: ``www.agrigold.com`` — WordPress site, empty robots.txt
|
||||
(no Disallow). Catalog covers corn + soybeans. Sibling of LG Seeds
|
||||
under the same parent (AgReliant) but distinct branding /
|
||||
positioning, so kept in its own scraper.
|
||||
|
||||
Discovery: the listing page ``/corn/explore-corn-hybrids`` (and
|
||||
the soybean equivalent) is server-rendered HTML that contains
|
||||
``<a href="/corn/explore-corn-hybrids/<CODE>">`` for every variety.
|
||||
Codes look like ``A616-30``, ``A623-88``, etc. Parse the listing
|
||||
HTML, collect distinct variety URLs.
|
||||
|
||||
Per-variety detail (``/corn/explore-corn-hybrids/<CODE>``) renders
|
||||
several ``<div class="product-section ...">`` blocks. Each section
|
||||
has a ``<div class="title">`` heading + multiple ``.detail-item``
|
||||
rows shaped as ``<div class="label">N</div><div class="value">V</div>``.
|
||||
|
||||
The ``<div class="value">`` content is one of:
|
||||
|
||||
- **5-circle rating scale** (Agronomic Rating, Disease Tolerance,
|
||||
Silage Characteristics): ``<div class="scale">`` containing 5
|
||||
children, where N have class ``circle selected`` and 5-N have
|
||||
class ``circle``. Count = rating on a **1-5 scale** (5 = best).
|
||||
Distinct from Bayer / LG Seeds' 1-9 convention — documented in
|
||||
the sidecar's ``_scale_direction``.
|
||||
|
||||
- **Numeric value** (GDUs, year, plant population): bare number.
|
||||
|
||||
- **Categorical / qualitative** (Ear Flex Type "KERNEL",
|
||||
Leaf Orientation "SEMI UPRIGHT", Cob Color "Red"): the literal
|
||||
text.
|
||||
|
||||
- **NA**: rated but not yet measured.
|
||||
|
||||
Rating scale: ``1-5 (5 = best)`` — distinct from the other brands;
|
||||
the chunker reads ``_scale_direction`` to render the correct
|
||||
preamble.
|
||||
|
||||
Output:
|
||||
corpus/agrigold/<source_key>.md
|
||||
corpus/agrigold/<source_key>.json
|
||||
|
||||
source_key: ``agrigold-<code>`` lowercased, e.g.
|
||||
``agrigold-a616-30``.
|
||||
|
||||
CLI:
|
||||
python -m scrape.sources.agrigold --crop corn --limit 5
|
||||
python -m scrape.sources.agrigold --force
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
SCRAPER_VERSION = "0.1.0"
|
||||
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
|
||||
BASE = "https://www.agrigold.com"
|
||||
|
||||
LISTING_PATHS = {
|
||||
"corn": "/corn/explore-corn-hybrids",
|
||||
"soybeans": "/soybeans/explore-soybean-varieties",
|
||||
}
|
||||
|
||||
# AgriGold publishes ratings on a 1-5 scale (5 = best), counted from
|
||||
# the selected circles in the per-rating scale block. The chunker
|
||||
# preserves this verbatim — every chunk preamble declares the scale
|
||||
# so the LLM doesn't conflate with Bayer's 1-9.
|
||||
RATING_SCALE_DIRECTION = "1-5 (5 = best)"
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
|
||||
CORPUS_DIR = CORPUS_ROOT / "agrigold"
|
||||
|
||||
REQ_INTERVAL_SEC = 1.0
|
||||
|
||||
log = logging.getLogger("scrape.agrigold")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- HTTP
|
||||
|
||||
|
||||
class RateLimitedSession:
|
||||
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
|
||||
self.s = requests.Session()
|
||||
self.s.headers["User-Agent"] = USER_AGENT
|
||||
self.interval = interval
|
||||
self._last = 0.0
|
||||
|
||||
def _wait(self) -> None:
|
||||
delta = time.monotonic() - self._last
|
||||
if delta < self.interval:
|
||||
time.sleep(self.interval - delta)
|
||||
self._last = time.monotonic()
|
||||
|
||||
def request(self, method: str, url: str, *, max_retries: int = 4,
|
||||
timeout: float = 30.0, **kw: Any) -> requests.Response:
|
||||
last_exc: Exception | None = None
|
||||
for attempt in range(max_retries):
|
||||
self._wait()
|
||||
try:
|
||||
resp = self.s.request(method, url, timeout=timeout, **kw)
|
||||
except requests.RequestException as exc:
|
||||
last_exc = exc
|
||||
backoff = min(30.0, (2 ** attempt) + random.random())
|
||||
log.warning("network error on %s %s: %s — retry in %.1fs",
|
||||
method, url, exc, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
if resp.status_code == 429 or 500 <= resp.status_code < 600:
|
||||
ra = resp.headers.get("Retry-After")
|
||||
backoff = float(ra) if (ra and ra.isdigit()) else min(30.0, (2 ** attempt) + random.random())
|
||||
log.warning("HTTP %d on %s %s — retry in %.1fs",
|
||||
resp.status_code, method, url, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
return resp
|
||||
if last_exc:
|
||||
raise last_exc
|
||||
return resp # type: ignore[return-value]
|
||||
|
||||
def get(self, url: str, **kw: Any) -> requests.Response:
|
||||
return self.request("GET", url, **kw)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- model
|
||||
|
||||
|
||||
@dataclass
|
||||
class AGProduct:
|
||||
source_key: str
|
||||
source_url: str
|
||||
crop: str
|
||||
product_name: str = ""
|
||||
relative_maturity: str | None = None # corn RM days from .maturity
|
||||
maturity_group: str | None = None # soy MG
|
||||
trait_descriptions: list[str] = field(default_factory=list)
|
||||
characteristics_groups: list[dict] = field(default_factory=list)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- discovery
|
||||
|
||||
|
||||
def discover_varieties(
|
||||
http: RateLimitedSession, *, only_crop: str | None = None,
|
||||
) -> list[tuple[str, str, str]]:
|
||||
"""Return ``[(url, crop, variety_code), ...]`` for every variety in
|
||||
the listing pages."""
|
||||
out: list[tuple[str, str, str]] = []
|
||||
for crop, path in LISTING_PATHS.items():
|
||||
if only_crop and crop != only_crop:
|
||||
continue
|
||||
log.info("fetching listing %s%s", BASE, path)
|
||||
r = http.get(f"{BASE}{path}")
|
||||
r.raise_for_status()
|
||||
# Collect distinct hrefs that look like /<crop>/explore-X-{hybrids,
|
||||
# varieties}/<CODE>. Codes are alphanumeric with dashes.
|
||||
href_re = re.compile(rf"^{re.escape(path)}/([\w\-]+)$")
|
||||
seen: set[str] = set()
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
for a in soup.find_all("a", href=True):
|
||||
m = href_re.match(a["href"])
|
||||
if not m:
|
||||
continue
|
||||
code = m.group(1)
|
||||
# Filter out catalog-tool tails ("filter", "browse", etc.)
|
||||
if not re.match(r"^[A-Z0-9][\w\-]{2,30}$", code, re.I):
|
||||
continue
|
||||
if code in seen:
|
||||
continue
|
||||
seen.add(code)
|
||||
out.append((f"{BASE}{path}/{code}", crop, code))
|
||||
log.info(" %s: %d varieties", crop, len(seen))
|
||||
log.info("total varieties discovered: %d", len(out))
|
||||
return out
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- helpers
|
||||
|
||||
|
||||
def source_key_for(code: str) -> str:
|
||||
slug = re.sub(r"[^a-zA-Z0-9-]+", "-", code).strip("-").lower()
|
||||
return f"agrigold-{slug}"
|
||||
|
||||
|
||||
# Section class hint -> normalized label for the sidecar.
|
||||
SECTION_LABEL_MAP = {
|
||||
"agronomic-rating": "AGRONOMIC RATING",
|
||||
"disease-tolerance": "DISEASE TOLERANCE",
|
||||
"plant-characteristics": "PLANT CHARACTERISTICS",
|
||||
"plant-features": "PRODUCT FEATURES",
|
||||
"silage-characteristics": "SILAGE CHARACTERISTICS",
|
||||
"planting-applications": "PLANTING APPLICATIONS",
|
||||
"planting-population": "PLANTING POPULATION",
|
||||
}
|
||||
|
||||
|
||||
def _parse_scale(value_el) -> int | None:
|
||||
"""Count selected circles in a ``<div class="scale">`` block.
|
||||
Returns 1-5 or None if no scale present."""
|
||||
if value_el is None:
|
||||
return None
|
||||
scale = value_el.find("div", class_="scale")
|
||||
if scale is None:
|
||||
return None
|
||||
selected = scale.find_all("div", class_=lambda c: c and "selected" in c)
|
||||
return len(selected) if selected else 0
|
||||
|
||||
|
||||
def _parse_value(value_el) -> str:
|
||||
"""Extract a non-scale value: raw text contents, trimmed."""
|
||||
if value_el is None:
|
||||
return ""
|
||||
# If it has a .scale child we should have caught it above. Otherwise
|
||||
# return the leaf text.
|
||||
text = value_el.get_text(" ", strip=True)
|
||||
return text
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- detail
|
||||
|
||||
|
||||
def fetch_product_detail(
|
||||
http: RateLimitedSession, url: str, crop: str, code: str,
|
||||
) -> AGProduct:
|
||||
r = http.get(url)
|
||||
r.raise_for_status()
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
|
||||
prod = AGProduct(
|
||||
source_key=source_key_for(code),
|
||||
source_url=url,
|
||||
crop=crop,
|
||||
product_name=code,
|
||||
)
|
||||
|
||||
# Maturity — often rendered as ``<div class="maturity">86 days</div>``.
|
||||
mat_el = soup.find(class_="maturity")
|
||||
if mat_el:
|
||||
text = mat_el.get_text(strip=True)
|
||||
m = re.search(r"(\d+(?:\.\d+)?)", text)
|
||||
if m:
|
||||
if crop == "corn":
|
||||
prod.relative_maturity = m.group(1)
|
||||
elif crop == "soybeans":
|
||||
prod.maturity_group = m.group(1)
|
||||
|
||||
# Trait package — from .product-details / "Trait Package"
|
||||
pd = soup.find(class_="product-details")
|
||||
if pd:
|
||||
# The details block renders pairs of label / value text:
|
||||
# "Genetic Family | Icon-J | Trait Package | VT2RIB | ..."
|
||||
# Parse the labels we recognize.
|
||||
text = pd.get_text(" | ", strip=True)
|
||||
m = re.search(r"Trait Package\s*\|\s*([^|]+?)(?:\s*\||$)", text)
|
||||
if m:
|
||||
tp = m.group(1).strip()
|
||||
if tp and tp.lower() not in ("none", "-"):
|
||||
prod.trait_descriptions = [tp]
|
||||
|
||||
# Iterate all product-section blocks; bucket items per section.
|
||||
for section in soup.find_all("div", class_=re.compile(r"product-section")):
|
||||
section_classes = section.get("class", [])
|
||||
label = ""
|
||||
for cls in section_classes:
|
||||
if cls in SECTION_LABEL_MAP:
|
||||
label = SECTION_LABEL_MAP[cls]
|
||||
break
|
||||
if not label:
|
||||
title_el = section.find(class_="title")
|
||||
label = (title_el.get_text(strip=True).upper()
|
||||
if title_el else "OTHER")
|
||||
|
||||
items: list[dict] = []
|
||||
for detail in section.find_all("div", class_="detail-item"):
|
||||
label_el = detail.find("div", class_="label")
|
||||
value_el = detail.find("div", class_="value")
|
||||
ch = (label_el.get_text(" ", strip=True) if label_el else "").strip()
|
||||
if not ch:
|
||||
continue
|
||||
|
||||
scale = _parse_scale(value_el)
|
||||
if scale is not None:
|
||||
items.append({"characteristic": ch, "value": str(scale)})
|
||||
else:
|
||||
v = _parse_value(value_el)
|
||||
# Special-case the "Row Type" header row from planting-population
|
||||
# which holds nested headers, not a real rating.
|
||||
if ch.lower() == "row type" and v.lower() in (
|
||||
"low medium high", "low / medium / high",
|
||||
):
|
||||
continue
|
||||
if v:
|
||||
items.append({"characteristic": ch, "value": v})
|
||||
|
||||
if items:
|
||||
prod.characteristics_groups.append({
|
||||
"label": label, "type": "scale-or-value", "items": items,
|
||||
})
|
||||
|
||||
return prod
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- render
|
||||
|
||||
|
||||
def render_markdown(p: AGProduct) -> str:
|
||||
title = p.product_name or p.source_key
|
||||
crop_label = "Corn" if p.crop == "corn" else "Soybeans"
|
||||
head: list[str] = [
|
||||
f"# {title}",
|
||||
"",
|
||||
"- **Vendor:** AgReliant Genetics",
|
||||
"- **Brand:** AgriGold",
|
||||
f"- **Crop:** {crop_label}",
|
||||
]
|
||||
if p.relative_maturity and p.crop == "corn":
|
||||
head.append(f"- **Relative maturity:** {p.relative_maturity}")
|
||||
if p.maturity_group and p.crop == "soybeans":
|
||||
head.append(f"- **Maturity group:** {p.maturity_group}")
|
||||
if p.trait_descriptions:
|
||||
head.append(f"- **Traits:** {', '.join(p.trait_descriptions)}")
|
||||
head.append(f"- **Source:** {p.source_url}")
|
||||
head.append(f"- **Rating scale (AgriGold):** {RATING_SCALE_DIRECTION}")
|
||||
head.append("")
|
||||
head.append("---")
|
||||
head.append("")
|
||||
|
||||
sections: list[str] = []
|
||||
for g in p.characteristics_groups:
|
||||
label = (g.get("label") or "Characteristics").title()
|
||||
items = g.get("items") or []
|
||||
if not items:
|
||||
continue
|
||||
rows = "\n".join(f"| {it['characteristic']} | {it['value']} |" for it in items)
|
||||
sections.append(
|
||||
f"## {label}\n\n"
|
||||
"| Characteristic | Value |\n"
|
||||
"|---|---|\n"
|
||||
f"{rows}\n"
|
||||
)
|
||||
return "\n".join(head) + "\n".join(sections)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- write
|
||||
|
||||
|
||||
def write_product(prod: AGProduct, body_md: str) -> None:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
md_path = CORPUS_DIR / f"{prod.source_key}.md"
|
||||
json_path = CORPUS_DIR / f"{prod.source_key}.json"
|
||||
|
||||
md_path.write_text(body_md, encoding="utf-8")
|
||||
sidecar = {
|
||||
"source": "agrigold",
|
||||
"source_key": prod.source_key,
|
||||
"vendor": "AgReliant Genetics",
|
||||
"brand": "AgriGold",
|
||||
"product_name": prod.product_name,
|
||||
"product_id": None,
|
||||
"hybrid_prefix": prod.product_name,
|
||||
"hybrid_suffix": None,
|
||||
"crop": prod.crop,
|
||||
"release_year": None,
|
||||
"relative_maturity": prod.relative_maturity,
|
||||
"maturity_group": prod.maturity_group,
|
||||
"wheat_class": None,
|
||||
"trait_stack": prod.trait_descriptions,
|
||||
"trait_descriptions": prod.trait_descriptions,
|
||||
"positioning_statement": None,
|
||||
"strengths": [],
|
||||
"characteristics_groups": prod.characteristics_groups,
|
||||
"_scale_direction": RATING_SCALE_DIRECTION,
|
||||
"regional_recommendations": [],
|
||||
"image_url": None,
|
||||
"source_urls": [prod.source_url],
|
||||
"sitemap_last_modified": None,
|
||||
"fetched_at": datetime.now(timezone.utc).isoformat(),
|
||||
"scraper_version": SCRAPER_VERSION,
|
||||
}
|
||||
json_path.write_text(
|
||||
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- pipeline
|
||||
|
||||
|
||||
def process_product(
|
||||
http: RateLimitedSession, *, url: str, crop: str, code: str, force: bool,
|
||||
) -> tuple[str, AGProduct | None]:
|
||||
source_key = source_key_for(code)
|
||||
md_path = CORPUS_DIR / f"{source_key}.md"
|
||||
if md_path.exists() and not force:
|
||||
return "skipped", None
|
||||
try:
|
||||
prod = fetch_product_detail(http, url, crop, code)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.error("variety %s failed: %s", code, exc)
|
||||
return "failed", None
|
||||
body = render_markdown(prod)
|
||||
write_product(prod, body)
|
||||
return "written", prod
|
||||
|
||||
|
||||
def run(*, limit: int | None, force: bool,
|
||||
only_crop: str | None, only_product: str | None) -> int:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
http = RateLimitedSession()
|
||||
targets = discover_varieties(http, only_crop=only_crop)
|
||||
if only_product:
|
||||
targets = [
|
||||
(u, c, k) for (u, c, k) in targets
|
||||
if source_key_for(k) == only_product
|
||||
or k.lower() == only_product.lower()
|
||||
]
|
||||
if not targets:
|
||||
log.error("no variety matched --product=%s", only_product)
|
||||
return 2
|
||||
|
||||
counts = {"written": 0, "skipped": 0, "failed": 0}
|
||||
processed = 0
|
||||
for url, crop, code in targets:
|
||||
if limit is not None and processed >= limit:
|
||||
break
|
||||
processed += 1
|
||||
status, prod = process_product(
|
||||
http, url=url, crop=crop, code=code, force=force,
|
||||
)
|
||||
counts[status] = counts.get(status, 0) + 1
|
||||
if prod is not None:
|
||||
log.info(
|
||||
"[%d/%s] %s %s | crop=%s rm/mg=%s traits=%s groups=%d",
|
||||
processed, str(limit) if limit else "all",
|
||||
prod.source_key, status, prod.crop,
|
||||
prod.relative_maturity or prod.maturity_group or "-",
|
||||
",".join(prod.trait_descriptions) or "-",
|
||||
len(prod.characteristics_groups),
|
||||
)
|
||||
else:
|
||||
log.info("[%d/%s] %s %s",
|
||||
processed, str(limit) if limit else "all",
|
||||
source_key_for(code), status)
|
||||
|
||||
log.info(
|
||||
"done: processed=%d written=%d skipped=%d failed=%d (of %d candidates)",
|
||||
processed, counts["written"], counts["skipped"],
|
||||
counts["failed"], len(targets),
|
||||
)
|
||||
return 0 if counts["failed"] == 0 else 1
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- CLI
|
||||
|
||||
|
||||
def _build_argparser() -> argparse.ArgumentParser:
|
||||
p = argparse.ArgumentParser(
|
||||
prog="scrape.sources.agrigold",
|
||||
description="Scrape AgriGold (AgReliant Genetics) corn + soybean varieties.",
|
||||
)
|
||||
p.add_argument("--limit", type=int, default=None,
|
||||
help="Stop after processing N varieties (default: all).")
|
||||
p.add_argument("--force", action="store_true",
|
||||
help="Re-fetch even if the markdown file already exists.")
|
||||
p.add_argument("--crop", default=None, choices=list(LISTING_PATHS),
|
||||
help="Limit to one crop.")
|
||||
p.add_argument("--product", default=None,
|
||||
help="Process a single variety by source_key or variety code.")
|
||||
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
|
||||
return p
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
args = _build_argparser().parse_args(argv)
|
||||
logging.basicConfig(
|
||||
level=args.log_level.upper(),
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||
stream=sys.stderr,
|
||||
)
|
||||
return run(
|
||||
limit=args.limit, force=args.force,
|
||||
only_crop=args.crop, only_product=args.product,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,412 @@
|
||||
"""Ebbert's Seeds scraper — small regional Ohio/Indiana breeder.
|
||||
|
||||
Source: ``www.ebbertsseeds.com`` — WordPress site. robots.txt is
|
||||
permissive (``Crawl-delay: 5`` only, no Disallow). Covington, OH +
|
||||
Decatur, IN — Eastern Corn Belt focus.
|
||||
|
||||
Catalog is structured as one scrollable page PER CROP, with each
|
||||
variety rendered as a CSS-grid block of `<h1>NAME TRAIT RM RM</h1>`
|
||||
+ several sub-sections (MANAGEMENT & POSITIONING / CHARACTERISTICS
|
||||
/ DISEASE RATINGS) where the labels and numeric values live in
|
||||
separate adjacent grid cells. Reconstructing a perfectly-aligned
|
||||
{characteristic: value} dict from the multi-column layout is
|
||||
fiddly; the small variety count (~17 corn + similar soy/wheat)
|
||||
doesn't justify the engineering. We instead **preserve the full
|
||||
text body of each variety's container** in the chunk markdown so
|
||||
the LLM can read the tabular text as-is.
|
||||
|
||||
Pages scraped: `/corn/`, `/soybeans-2/`, `/wheat/`. Grass-seed /
|
||||
forage / cover-crop pages are out of scope for the row-crop
|
||||
advisor.
|
||||
|
||||
Rating scale: ``1-5 (1 = best, lower = more resistant)`` — same
|
||||
direction as AgriPro / NK. Confirmed by cross-referencing
|
||||
positioning text against published values (a variety described as
|
||||
"Robust tall plants" has STANDABILITY 1.0 → 1 = best).
|
||||
|
||||
Output:
|
||||
corpus/ebberts_seeds/<source_key>.md
|
||||
corpus/ebberts_seeds/<source_key>.json
|
||||
|
||||
source_key: ``ebberts-<slug>`` lowercased, e.g.
|
||||
``ebberts-7000tr-rib`` or ``ebberts-1335-conventional``.
|
||||
|
||||
CLI:
|
||||
python -m scrape.sources.ebberts_seeds --crop corn --limit 5
|
||||
python -m scrape.sources.ebberts_seeds --force
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
SCRAPER_VERSION = "0.1.0"
|
||||
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
|
||||
BASE = "https://www.ebbertsseeds.com"
|
||||
|
||||
# Ebbert's per-crop catalog pages. URL paths confirmed via homepage
|
||||
# nav links 2026-05-26.
|
||||
CROP_PAGES = {
|
||||
"corn": "/corn/",
|
||||
"soybeans": "/soybeans-2/",
|
||||
"wheat": "/wheat/",
|
||||
}
|
||||
|
||||
# Per robots.txt: Crawl-delay: 5 (seconds). We respect that.
|
||||
REQ_INTERVAL_SEC = 5.0
|
||||
|
||||
RATING_SCALE_DIRECTION = "1-5 (1 = best, lower = more resistant)"
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
|
||||
CORPUS_DIR = CORPUS_ROOT / "ebberts_seeds"
|
||||
|
||||
log = logging.getLogger("scrape.ebberts_seeds")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- HTTP
|
||||
|
||||
|
||||
class RateLimitedSession:
|
||||
"""robots.txt asks for 5-sec Crawl-delay; we honor it. Ebbert's
|
||||
catalog is only ~30-50 pages total so even at 5 sec/req the
|
||||
full scrape finishes in <5 min."""
|
||||
|
||||
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
|
||||
self.s = requests.Session()
|
||||
self.s.headers["User-Agent"] = USER_AGENT
|
||||
self.interval = interval
|
||||
self._last = 0.0
|
||||
|
||||
def _wait(self) -> None:
|
||||
delta = time.monotonic() - self._last
|
||||
if delta < self.interval:
|
||||
time.sleep(self.interval - delta)
|
||||
self._last = time.monotonic()
|
||||
|
||||
def request(self, method: str, url: str, *, max_retries: int = 4,
|
||||
timeout: float = 30.0, **kw: Any) -> requests.Response:
|
||||
last_exc: Exception | None = None
|
||||
for attempt in range(max_retries):
|
||||
self._wait()
|
||||
try:
|
||||
resp = self.s.request(method, url, timeout=timeout, **kw)
|
||||
except requests.RequestException as exc:
|
||||
last_exc = exc
|
||||
backoff = min(30.0, (2 ** attempt) + random.random())
|
||||
log.warning("network error on %s %s: %s — retry in %.1fs",
|
||||
method, url, exc, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
if resp.status_code == 429 or 500 <= resp.status_code < 600:
|
||||
ra = resp.headers.get("Retry-After")
|
||||
backoff = float(ra) if (ra and ra.isdigit()) else min(30.0, (2 ** attempt) + random.random())
|
||||
log.warning("HTTP %d on %s %s — retry in %.1fs",
|
||||
resp.status_code, method, url, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
return resp
|
||||
if last_exc:
|
||||
raise last_exc
|
||||
return resp # type: ignore[return-value]
|
||||
|
||||
def get(self, url: str, **kw: Any) -> requests.Response:
|
||||
return self.request("GET", url, **kw)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- model
|
||||
|
||||
|
||||
@dataclass
|
||||
class EbProduct:
|
||||
source_key: str
|
||||
source_url: str # the per-crop page URL (Ebbert's doesn't have per-variety pages)
|
||||
crop: str
|
||||
product_name: str = "" # "7000TR RIB", "1335 CONVENTIONAL"
|
||||
trait_label: str | None = None # "RIB", "CONVENTIONAL", "PC", "SSX RIB", etc.
|
||||
relative_maturity: str | None = None # corn
|
||||
maturity_group: str | None = None # soy
|
||||
body_text: str = "" # verbatim text of the variety's container
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- discovery + parse
|
||||
|
||||
|
||||
_VARIETY_HEADING_RE = re.compile(
|
||||
r"^(?P<name>\S+(?:\s+\S+)*?)\s+(?P<rm>\d+(?:\.\d+)?)\s*RM$",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
|
||||
def _variety_text(h1, next_h1) -> str:
|
||||
"""Collect the visible text from this variety's <h1> up to (but
|
||||
not including) the next variety's <h1>, walking the DOM in
|
||||
document order.
|
||||
|
||||
Ebbert's grid layout spreads each variety's content across many
|
||||
sibling ``.x-cell`` blocks in the outer container; the h1's
|
||||
immediate parent only holds the title cell. The correct boundary
|
||||
is the next variety h1 in document order.
|
||||
"""
|
||||
chunks: list[str] = [h1.get_text(strip=True)]
|
||||
for node in h1.find_all_next(string=True):
|
||||
# Stop once we cross into the next variety's h1.
|
||||
if next_h1 is not None:
|
||||
if node is next_h1 or next_h1 in getattr(node, "parents", []):
|
||||
break
|
||||
# Or text is a descendant of next_h1
|
||||
anc = node.parent
|
||||
while anc is not None:
|
||||
if anc is next_h1:
|
||||
break
|
||||
anc = anc.parent
|
||||
if anc is next_h1:
|
||||
break
|
||||
text = str(node).strip()
|
||||
if text:
|
||||
chunks.append(text)
|
||||
body = " | ".join(chunks)
|
||||
body = re.sub(r"\s*\|\s*\|\s*", " | ", body)
|
||||
body = re.sub(r"\s+", " ", body).strip()
|
||||
return body
|
||||
|
||||
|
||||
def _slug(text: str) -> str:
|
||||
s = re.sub(r"[^a-zA-Z0-9]+", "-", text).strip("-").lower()
|
||||
return s
|
||||
|
||||
|
||||
def discover_and_parse(
|
||||
http: RateLimitedSession, *, only_crop: str | None = None,
|
||||
) -> list[EbProduct]:
|
||||
"""Fetch one page per crop and extract every variety container."""
|
||||
out: list[EbProduct] = []
|
||||
for crop, path in CROP_PAGES.items():
|
||||
if only_crop and crop != only_crop:
|
||||
continue
|
||||
url = f"{BASE}{path}"
|
||||
log.info("fetching %s", url)
|
||||
r = http.get(url)
|
||||
r.raise_for_status()
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
|
||||
# Every variety is anchored by an <h1>NAME ... RM RM</h1>.
|
||||
v_h1s = [
|
||||
h for h in soup.find_all("h1")
|
||||
if _VARIETY_HEADING_RE.match(h.get_text(strip=True))
|
||||
]
|
||||
log.info(" %s: %d varieties", crop, len(v_h1s))
|
||||
|
||||
for i, h1 in enumerate(v_h1s):
|
||||
title = h1.get_text(strip=True)
|
||||
m = _VARIETY_HEADING_RE.match(title)
|
||||
if not m:
|
||||
continue
|
||||
name = m.group("name").strip()
|
||||
maturity = m.group("rm")
|
||||
|
||||
next_h1 = v_h1s[i + 1] if i + 1 < len(v_h1s) else None
|
||||
body = _variety_text(h1, next_h1)
|
||||
|
||||
prod = EbProduct(
|
||||
source_key=f"ebberts-{_slug(name)}",
|
||||
source_url=url,
|
||||
crop=crop,
|
||||
product_name=name,
|
||||
relative_maturity=maturity if crop == "corn" else None,
|
||||
maturity_group=maturity if crop == "soybeans" else None,
|
||||
body_text=body,
|
||||
)
|
||||
# Derive trait_label from the second token of the name if
|
||||
# it looks like a trait (CONVENTIONAL, RIB, PC, SSX RIB,
|
||||
# TR RIB, etc.). Best-effort, doesn't have to be perfect.
|
||||
parts = name.split(maxsplit=1)
|
||||
if len(parts) == 2:
|
||||
prod.trait_label = parts[1]
|
||||
out.append(prod)
|
||||
log.info("total varieties discovered: %d", len(out))
|
||||
return out
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- render
|
||||
|
||||
|
||||
def render_markdown(p: EbProduct) -> str:
|
||||
title = p.product_name or p.source_key
|
||||
crop_label = {"corn": "Corn", "soybeans": "Soybeans",
|
||||
"wheat": "Wheat"}.get(p.crop, p.crop.title())
|
||||
head: list[str] = [
|
||||
f"# {title}",
|
||||
"",
|
||||
"- **Vendor:** Ebbert's Seeds (independent regional breeder)",
|
||||
"- **Brand:** Ebbert's Seeds",
|
||||
f"- **Crop:** {crop_label}",
|
||||
]
|
||||
if p.relative_maturity and p.crop == "corn":
|
||||
head.append(f"- **Relative maturity:** {p.relative_maturity}")
|
||||
if p.maturity_group and p.crop == "soybeans":
|
||||
head.append(f"- **Maturity group:** {p.maturity_group}")
|
||||
if p.trait_label:
|
||||
head.append(f"- **Trait stack (label):** {p.trait_label}")
|
||||
head.append(f"- **Source:** {p.source_url}")
|
||||
head.append(f"- **Rating scale (Ebbert's):** {RATING_SCALE_DIRECTION}")
|
||||
head.append("- **Service area:** Covington, OH + Decatur, IN — Eastern Corn Belt regional")
|
||||
head.append("")
|
||||
head.append("---")
|
||||
head.append("")
|
||||
head.append("## Variety detail (verbatim from page)")
|
||||
head.append("")
|
||||
head.append(p.body_text)
|
||||
head.append("")
|
||||
return "\n".join(head)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- write
|
||||
|
||||
|
||||
def write_product(prod: EbProduct, body_md: str) -> None:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
md_path = CORPUS_DIR / f"{prod.source_key}.md"
|
||||
json_path = CORPUS_DIR / f"{prod.source_key}.json"
|
||||
|
||||
md_path.write_text(body_md, encoding="utf-8")
|
||||
sidecar = {
|
||||
"source": "ebberts_seeds",
|
||||
"source_key": prod.source_key,
|
||||
"vendor": "Ebbert's Seeds",
|
||||
"brand": "Ebbert's Seeds",
|
||||
"product_name": prod.product_name,
|
||||
"product_id": None,
|
||||
"hybrid_prefix": prod.product_name,
|
||||
"hybrid_suffix": prod.trait_label,
|
||||
"crop": prod.crop,
|
||||
"release_year": None,
|
||||
"relative_maturity": prod.relative_maturity,
|
||||
"maturity_group": prod.maturity_group,
|
||||
"wheat_class": None,
|
||||
"trait_stack": [prod.trait_label] if prod.trait_label else [],
|
||||
"trait_descriptions": [],
|
||||
"positioning_statement": None,
|
||||
"strengths": [],
|
||||
# No structured groups — the body markdown carries the table
|
||||
# text verbatim. characteristics_groups stays empty so the
|
||||
# chunker doesn't try to bucket non-existent items.
|
||||
"characteristics_groups": [],
|
||||
"page_text_chars": len(prod.body_text),
|
||||
"_scale_direction": RATING_SCALE_DIRECTION,
|
||||
"regional_recommendations": [
|
||||
{"product_list_name": "Ebbert's service area (Eastern Corn Belt — OH/IN/IL)",
|
||||
"agronomist": None, "agronomist_email": None, "variant_id": None},
|
||||
],
|
||||
"image_url": None,
|
||||
"source_urls": [prod.source_url],
|
||||
"sitemap_last_modified": None,
|
||||
"fetched_at": datetime.now(timezone.utc).isoformat(),
|
||||
"scraper_version": SCRAPER_VERSION,
|
||||
}
|
||||
json_path.write_text(
|
||||
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- pipeline
|
||||
|
||||
|
||||
def run(*, limit: int | None, force: bool,
|
||||
only_crop: str | None, only_product: str | None) -> int:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
http = RateLimitedSession()
|
||||
products = discover_and_parse(http, only_crop=only_crop)
|
||||
|
||||
if only_product:
|
||||
products = [
|
||||
p for p in products
|
||||
if p.source_key == only_product
|
||||
or p.product_name.lower() == only_product.lower()
|
||||
]
|
||||
if not products:
|
||||
log.error("no variety matched --product=%s", only_product)
|
||||
return 2
|
||||
|
||||
counts = {"written": 0, "skipped": 0}
|
||||
processed = 0
|
||||
for prod in products:
|
||||
if limit is not None and processed >= limit:
|
||||
break
|
||||
processed += 1
|
||||
md_path = CORPUS_DIR / f"{prod.source_key}.md"
|
||||
if md_path.exists() and not force:
|
||||
counts["skipped"] += 1
|
||||
log.info("[%d/%s] %s skipped",
|
||||
processed, str(limit) if limit else len(products),
|
||||
prod.source_key)
|
||||
continue
|
||||
body = render_markdown(prod)
|
||||
write_product(prod, body)
|
||||
counts["written"] += 1
|
||||
log.info(
|
||||
"[%d/%s] %s written | crop=%s rm/mg=%s trait=%s chars=%d",
|
||||
processed, str(limit) if limit else len(products),
|
||||
prod.source_key, prod.crop,
|
||||
prod.relative_maturity or prod.maturity_group or "-",
|
||||
prod.trait_label or "-", len(prod.body_text),
|
||||
)
|
||||
|
||||
log.info(
|
||||
"done: processed=%d written=%d skipped=%d (of %d varieties)",
|
||||
processed, counts["written"], counts["skipped"], len(products),
|
||||
)
|
||||
return 0
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- CLI
|
||||
|
||||
|
||||
def _build_argparser() -> argparse.ArgumentParser:
|
||||
p = argparse.ArgumentParser(
|
||||
prog="scrape.sources.ebberts_seeds",
|
||||
description="Scrape Ebbert's Seeds (regional Eastern Corn Belt breeder) — "
|
||||
"corn / soybeans / wheat.",
|
||||
)
|
||||
p.add_argument("--limit", type=int, default=None,
|
||||
help="Stop after processing N varieties (default: all).")
|
||||
p.add_argument("--force", action="store_true",
|
||||
help="Re-fetch even if the markdown file already exists.")
|
||||
p.add_argument("--crop", default=None, choices=list(CROP_PAGES),
|
||||
help="Limit to one crop (corn / soybeans / wheat).")
|
||||
p.add_argument("--product", default=None,
|
||||
help="Process a single variety by source_key or product name.")
|
||||
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
|
||||
return p
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
args = _build_argparser().parse_args(argv)
|
||||
logging.basicConfig(
|
||||
level=args.log_level.upper(),
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||
stream=sys.stderr,
|
||||
)
|
||||
return run(
|
||||
limit=args.limit, force=args.force,
|
||||
only_crop=args.crop, only_product=args.product,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,503 @@
|
||||
"""LG Seeds scraper — AgReliant Genetics brand.
|
||||
|
||||
Source: ``www.lgseeds.com`` — WordPress site. Empty robots.txt
|
||||
(no Disallow). Catalog covers 4 crops: corn, soybeans, alfalfa,
|
||||
sorghum.
|
||||
|
||||
Two-layer fetch:
|
||||
|
||||
1. **Listing page** (one per crop): inline JavaScript variable
|
||||
``products = [{...}, ...]`` carries the full variety summary —
|
||||
Variety code, Maturity, Traits[], Bullets[], CropType. No
|
||||
per-variety HTTP needed for identity.
|
||||
|
||||
2. **Detail page** (``/products/<crop>/<Variety>``): rich plant
|
||||
characteristics + disease tolerance + management ratings,
|
||||
rendered as ``<div class="characteristics-bar">`` blocks with
|
||||
``<span class="bar-N">`` where N ∈ 1-9 is the rating. Same
|
||||
convention as Bayer/Golden Harvest (9 = best).
|
||||
|
||||
LG Seeds is a regional brand (Eastern Corn Belt focus) under
|
||||
AgReliant Genetics, the same parent as AgriGold. Brand voice is
|
||||
distinct so we keep them in separate scrapers.
|
||||
|
||||
Rating scale: ``1-9 (9 = best)`` — verified empirically on the
|
||||
bar-N markup; matches Bayer / Golden Harvest convention.
|
||||
|
||||
Output:
|
||||
corpus/lg_seeds/<source_key>.md
|
||||
corpus/lg_seeds/<source_key>.json
|
||||
|
||||
source_key: ``lg-<variety>`` lowercased, e.g. ``lg-lg5701``,
|
||||
``lg-c3400`` (soybean — codes don't use LG prefix), ``lg-7c300``
|
||||
(alfalfa), ``lg-silo-max-100`` (sorghum).
|
||||
|
||||
CLI:
|
||||
python -m scrape.sources.lg_seeds --crop corn --limit 5
|
||||
python -m scrape.sources.lg_seeds --force
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
SCRAPER_VERSION = "0.1.0"
|
||||
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
|
||||
BASE = "https://www.lgseeds.com"
|
||||
|
||||
# Crops listed in nav. Each has a listing page at /products/<crop>
|
||||
# with an inline `var products = [...]` JSON blob.
|
||||
LISTING_PATHS = {
|
||||
"corn": "/products/corn",
|
||||
"soybeans": "/products/soybeans",
|
||||
"alfalfa": "/products/alfalfa",
|
||||
"sorghum": "/products/sorghum",
|
||||
}
|
||||
|
||||
RATING_SCALE_DIRECTION = "1-9 (9 = best)"
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
|
||||
CORPUS_DIR = CORPUS_ROOT / "lg_seeds"
|
||||
|
||||
REQ_INTERVAL_SEC = 1.0
|
||||
|
||||
log = logging.getLogger("scrape.lg_seeds")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- HTTP
|
||||
|
||||
|
||||
class RateLimitedSession:
|
||||
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
|
||||
self.s = requests.Session()
|
||||
self.s.headers["User-Agent"] = USER_AGENT
|
||||
self.interval = interval
|
||||
self._last = 0.0
|
||||
|
||||
def _wait(self) -> None:
|
||||
delta = time.monotonic() - self._last
|
||||
if delta < self.interval:
|
||||
time.sleep(self.interval - delta)
|
||||
self._last = time.monotonic()
|
||||
|
||||
def request(self, method: str, url: str, *, max_retries: int = 4,
|
||||
timeout: float = 30.0, **kw: Any) -> requests.Response:
|
||||
last_exc: Exception | None = None
|
||||
for attempt in range(max_retries):
|
||||
self._wait()
|
||||
try:
|
||||
resp = self.s.request(method, url, timeout=timeout, **kw)
|
||||
except requests.RequestException as exc:
|
||||
last_exc = exc
|
||||
backoff = min(30.0, (2 ** attempt) + random.random())
|
||||
log.warning("network error on %s %s: %s — retry in %.1fs",
|
||||
method, url, exc, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
if resp.status_code == 429 or 500 <= resp.status_code < 600:
|
||||
ra = resp.headers.get("Retry-After")
|
||||
backoff = float(ra) if (ra and ra.isdigit()) else min(30.0, (2 ** attempt) + random.random())
|
||||
log.warning("HTTP %d on %s %s — retry in %.1fs",
|
||||
resp.status_code, method, url, backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
return resp
|
||||
if last_exc:
|
||||
raise last_exc
|
||||
return resp # type: ignore[return-value]
|
||||
|
||||
def get(self, url: str, **kw: Any) -> requests.Response:
|
||||
return self.request("GET", url, **kw)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- model
|
||||
|
||||
|
||||
@dataclass
|
||||
class LGProduct:
|
||||
source_key: str
|
||||
source_url: str
|
||||
crop: str
|
||||
product_name: str = ""
|
||||
product_id: int | None = None
|
||||
maturity_raw: str | None = None # corn RM days / soy MG / alfalfa FD / sorghum days
|
||||
fall_dormancy: str | None = None # alfalfa only
|
||||
trait_descriptions: list[str] = field(default_factory=list)
|
||||
bullets: list[str] = field(default_factory=list)
|
||||
characteristics_groups: list[dict] = field(default_factory=list)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- discovery
|
||||
|
||||
|
||||
_VAR_RE = re.compile(
|
||||
r'var\s+\w+\s*=\s*(\[\{"Variety":.+?\}\]);', re.S,
|
||||
)
|
||||
|
||||
|
||||
def discover_varieties(
|
||||
http: RateLimitedSession, *, only_crop: str | None = None,
|
||||
) -> list[tuple[str, dict]]:
|
||||
"""Return ``[(crop, summary_dict), ...]`` from each listing page's
|
||||
inline JSON. Summary dict has Variety / Id / Maturity / Traits /
|
||||
Bullets / CropType / FallDormancy."""
|
||||
out: list[tuple[str, dict]] = []
|
||||
for crop, path in LISTING_PATHS.items():
|
||||
if only_crop and crop != only_crop:
|
||||
continue
|
||||
log.info("fetching listing %s%s", BASE, path)
|
||||
r = http.get(f"{BASE}{path}")
|
||||
r.raise_for_status()
|
||||
m = _VAR_RE.search(r.text)
|
||||
if not m:
|
||||
log.warning("no products array in %s", path)
|
||||
continue
|
||||
try:
|
||||
items = json.loads(m.group(1))
|
||||
except json.JSONDecodeError as exc:
|
||||
log.error("JSON parse failed for %s: %s", path, exc)
|
||||
continue
|
||||
log.info(" %s: %d varieties", crop, len(items))
|
||||
for it in items:
|
||||
out.append((crop, it))
|
||||
log.info("total varieties discovered: %d", len(out))
|
||||
return out
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- helpers
|
||||
|
||||
|
||||
def source_key_for(variety: str) -> str:
|
||||
"""Slugify the variety code into a stable source_key."""
|
||||
slug = re.sub(r"[^a-zA-Z0-9-]+", "-", variety).strip("-").lower()
|
||||
return f"lg-{slug}"
|
||||
|
||||
|
||||
_BAR_CLASS_RE = re.compile(r"^bar-(\d)$")
|
||||
|
||||
|
||||
def _parse_bar_value(span_classes: list[str]) -> int | None:
|
||||
"""Extract the integer rating from a ``bar-N`` CSS class."""
|
||||
for c in span_classes or []:
|
||||
m = _BAR_CLASS_RE.match(c)
|
||||
if m:
|
||||
return int(m.group(1))
|
||||
return None
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- detail
|
||||
|
||||
|
||||
def fetch_product_detail(
|
||||
http: RateLimitedSession, summary: dict, crop: str,
|
||||
) -> LGProduct:
|
||||
"""Fetch the detail page and merge characteristics into an
|
||||
LGProduct seeded by the listing-page summary."""
|
||||
variety = summary.get("Variety") or ""
|
||||
# LG's detail URL is /products/<crop>/<Variety>. The Variety in the
|
||||
# listing JSON appears in correct case; LG seems to accept any case
|
||||
# but we use what's published.
|
||||
url = f"{BASE}/products/{crop}/{variety}"
|
||||
prod = LGProduct(
|
||||
source_key=source_key_for(variety),
|
||||
source_url=url,
|
||||
crop=crop,
|
||||
product_name=variety,
|
||||
product_id=summary.get("Id"),
|
||||
maturity_raw=str(summary.get("Maturity")) if summary.get("Maturity") is not None else None,
|
||||
fall_dormancy=str(summary.get("FallDormancy")) if summary.get("FallDormancy") else None,
|
||||
trait_descriptions=list(summary.get("Traits") or []),
|
||||
bullets=list(summary.get("Bullets") or []),
|
||||
)
|
||||
|
||||
try:
|
||||
r = http.get(url)
|
||||
r.raise_for_status()
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.warning("detail fetch failed for %s: %s", variety, exc)
|
||||
return prod # identity-only fallback
|
||||
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
|
||||
# The detail page has multiple .product-section blocks; each has
|
||||
# a heading + a collection of .characteristics-bar rows. We bucket
|
||||
# by the section's text content. Common LG section labels:
|
||||
# "Characteristics" / "Management" / "Disease Tolerance".
|
||||
sections: list[tuple[str, list[dict]]] = []
|
||||
for section in soup.find_all("div", class_=re.compile(r"product-section")):
|
||||
# Heading is the first text node inside the section, before bars.
|
||||
# The section class often includes a hint like "disease-toler",
|
||||
# "plantCharacteristics", "management-pr".
|
||||
section_classes = " ".join(section.get("class", []))
|
||||
bars = section.find_all("div", class_="characteristics-bar")
|
||||
if not bars:
|
||||
continue
|
||||
|
||||
# Section label — use the first heading-like element or the
|
||||
# text right after the section class anchor.
|
||||
label = ""
|
||||
for h in section.find_all(["h2", "h3", "h4"]):
|
||||
t = h.get_text(strip=True)
|
||||
if t:
|
||||
label = t
|
||||
break
|
||||
if not label:
|
||||
# fallback: section_classes hint
|
||||
if "disease" in section_classes.lower():
|
||||
label = "Disease Tolerance"
|
||||
elif "management" in section_classes.lower():
|
||||
label = "Management"
|
||||
elif "plantcharacteristics" in section_classes.lower():
|
||||
label = "Characteristics"
|
||||
|
||||
items: list[dict] = []
|
||||
for bar in bars:
|
||||
name_el = bar.find(class_="product-name")
|
||||
value_span = bar.find("span", class_=_BAR_CLASS_RE)
|
||||
name = (name_el.get_text(" ", strip=True) if name_el else "").strip()
|
||||
rating = _parse_bar_value(value_span.get("class") if value_span else [])
|
||||
if not name:
|
||||
continue
|
||||
# Some "bars" are actually qualitative (e.g. "Tar Spot Susceptible",
|
||||
# "Fungicide Response High"). For those we keep the label as the
|
||||
# value text rather than a missing rating.
|
||||
if rating is None:
|
||||
# Look inside the bar element for a non-name text snippet
|
||||
inner_text = bar.get_text(" ", strip=True)
|
||||
# Strip the label off the front
|
||||
if inner_text.startswith(name):
|
||||
inner_text = inner_text[len(name):].strip()
|
||||
items.append({"characteristic": name, "value": inner_text or "-"})
|
||||
else:
|
||||
items.append({"characteristic": name, "value": str(rating)})
|
||||
|
||||
if items:
|
||||
sections.append((label or "Characteristics", items))
|
||||
|
||||
prod.characteristics_groups = [
|
||||
{"label": label.upper(), "type": "bars", "items": items}
|
||||
for label, items in sections
|
||||
]
|
||||
|
||||
return prod
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- render
|
||||
|
||||
|
||||
def render_markdown(p: LGProduct) -> str:
|
||||
title = p.product_name or p.source_key
|
||||
crop_label = {
|
||||
"corn": "Corn", "soybeans": "Soybeans",
|
||||
"alfalfa": "Alfalfa", "sorghum": "Sorghum",
|
||||
}.get(p.crop, p.crop.title())
|
||||
|
||||
head: list[str] = [
|
||||
f"# {title}",
|
||||
"",
|
||||
"- **Vendor:** AgReliant Genetics",
|
||||
"- **Brand:** LG Seeds",
|
||||
f"- **Crop:** {crop_label}",
|
||||
]
|
||||
if p.maturity_raw:
|
||||
if p.crop == "corn":
|
||||
head.append(f"- **Relative maturity:** {p.maturity_raw}")
|
||||
elif p.crop == "soybeans":
|
||||
head.append(f"- **Maturity group:** {p.maturity_raw}")
|
||||
elif p.crop == "alfalfa":
|
||||
head.append(f"- **Fall dormancy / maturity:** {p.maturity_raw}")
|
||||
elif p.crop == "sorghum":
|
||||
head.append(f"- **Days to maturity:** {p.maturity_raw}")
|
||||
if p.trait_descriptions:
|
||||
head.append(f"- **Traits:** {', '.join(p.trait_descriptions)}")
|
||||
head.append(f"- **Source:** {p.source_url}")
|
||||
head.append(f"- **Rating scale (LG Seeds):** {RATING_SCALE_DIRECTION}")
|
||||
head.append("")
|
||||
head.append("---")
|
||||
head.append("")
|
||||
|
||||
sections: list[str] = []
|
||||
if p.bullets:
|
||||
bullets = "\n".join(f"- {b}" for b in p.bullets)
|
||||
sections.append("## Strengths\n\n" + bullets + "\n")
|
||||
|
||||
for g in p.characteristics_groups:
|
||||
label = (g.get("label") or "Characteristics").title()
|
||||
items = g.get("items") or []
|
||||
if not items:
|
||||
continue
|
||||
rows = "\n".join(f"| {it['characteristic']} | {it['value']} |" for it in items)
|
||||
sections.append(
|
||||
f"## {label}\n\n"
|
||||
"| Characteristic | Value |\n"
|
||||
"|---|---|\n"
|
||||
f"{rows}\n"
|
||||
)
|
||||
return "\n".join(head) + "\n".join(sections)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- write
|
||||
|
||||
|
||||
def write_product(prod: LGProduct, body_md: str) -> None:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
md_path = CORPUS_DIR / f"{prod.source_key}.md"
|
||||
json_path = CORPUS_DIR / f"{prod.source_key}.json"
|
||||
|
||||
md_path.write_text(body_md, encoding="utf-8")
|
||||
sidecar = {
|
||||
"source": "lg_seeds",
|
||||
"source_key": prod.source_key,
|
||||
"vendor": "AgReliant Genetics",
|
||||
"brand": "LG Seeds",
|
||||
"product_name": prod.product_name,
|
||||
"product_id": prod.product_id,
|
||||
"hybrid_prefix": prod.product_name,
|
||||
"hybrid_suffix": None,
|
||||
"crop": prod.crop,
|
||||
"release_year": None,
|
||||
# Maturity routing: corn = RM days, soy = MG, alfalfa = FD,
|
||||
# sorghum = days-to-maturity. Stored in the canonical fields
|
||||
# so the chunker's crop-aware preamble works.
|
||||
"relative_maturity": prod.maturity_raw if prod.crop in ("corn", "sorghum") else None,
|
||||
"maturity_group": prod.maturity_raw if prod.crop == "soybeans" else None,
|
||||
"fall_dormancy": prod.maturity_raw if prod.crop == "alfalfa" else prod.fall_dormancy,
|
||||
"wheat_class": None,
|
||||
"trait_stack": prod.trait_descriptions, # LG publishes full names, not codes
|
||||
"trait_descriptions": prod.trait_descriptions,
|
||||
"positioning_statement": None,
|
||||
"strengths": prod.bullets,
|
||||
"characteristics_groups": prod.characteristics_groups,
|
||||
"_scale_direction": RATING_SCALE_DIRECTION,
|
||||
"regional_recommendations": [],
|
||||
"image_url": None,
|
||||
"source_urls": [prod.source_url],
|
||||
"sitemap_last_modified": None,
|
||||
"fetched_at": datetime.now(timezone.utc).isoformat(),
|
||||
"scraper_version": SCRAPER_VERSION,
|
||||
}
|
||||
json_path.write_text(
|
||||
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- pipeline
|
||||
|
||||
|
||||
def process_product(
|
||||
http: RateLimitedSession, summary: dict, crop: str, *, force: bool,
|
||||
) -> tuple[str, LGProduct | None]:
|
||||
variety = summary.get("Variety") or ""
|
||||
source_key = source_key_for(variety)
|
||||
md_path = CORPUS_DIR / f"{source_key}.md"
|
||||
if md_path.exists() and not force:
|
||||
return "skipped", None
|
||||
try:
|
||||
prod = fetch_product_detail(http, summary, crop)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.error("variety %s failed: %s", variety, exc)
|
||||
return "failed", None
|
||||
body = render_markdown(prod)
|
||||
write_product(prod, body)
|
||||
return "written", prod
|
||||
|
||||
|
||||
def run(
|
||||
*, limit: int | None, force: bool,
|
||||
only_crop: str | None, only_product: str | None,
|
||||
) -> int:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
http = RateLimitedSession()
|
||||
targets = discover_varieties(http, only_crop=only_crop)
|
||||
if only_product:
|
||||
targets = [
|
||||
(c, s) for (c, s) in targets
|
||||
if source_key_for(s.get("Variety", "")) == only_product
|
||||
or s.get("Variety", "").lower() == only_product.lower()
|
||||
]
|
||||
if not targets:
|
||||
log.error("no variety matched --product=%s", only_product)
|
||||
return 2
|
||||
|
||||
counts = {"written": 0, "skipped": 0, "failed": 0}
|
||||
processed = 0
|
||||
for crop, summary in targets:
|
||||
if limit is not None and processed >= limit:
|
||||
break
|
||||
processed += 1
|
||||
status, prod = process_product(http, summary, crop, force=force)
|
||||
counts[status] = counts.get(status, 0) + 1
|
||||
if prod is not None:
|
||||
log.info(
|
||||
"[%d/%s] %s %s | crop=%s maturity=%s traits=%d groups=%d",
|
||||
processed, str(limit) if limit else "all",
|
||||
prod.source_key, status, prod.crop,
|
||||
prod.maturity_raw or "-",
|
||||
len(prod.trait_descriptions),
|
||||
len(prod.characteristics_groups),
|
||||
)
|
||||
else:
|
||||
log.info("[%d/%s] %s %s",
|
||||
processed, str(limit) if limit else "all",
|
||||
source_key_for(summary.get("Variety", "")), status)
|
||||
|
||||
log.info(
|
||||
"done: processed=%d written=%d skipped=%d failed=%d (of %d candidates)",
|
||||
processed, counts["written"], counts["skipped"],
|
||||
counts["failed"], len(targets),
|
||||
)
|
||||
return 0 if counts["failed"] == 0 else 1
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- CLI
|
||||
|
||||
|
||||
def _build_argparser() -> argparse.ArgumentParser:
|
||||
p = argparse.ArgumentParser(
|
||||
prog="scrape.sources.lg_seeds",
|
||||
description="Scrape LG Seeds (AgReliant Genetics) — corn / "
|
||||
"soybeans / alfalfa / sorghum.",
|
||||
)
|
||||
p.add_argument("--limit", type=int, default=None,
|
||||
help="Stop after processing N varieties (default: all).")
|
||||
p.add_argument("--force", action="store_true",
|
||||
help="Re-fetch even if the markdown file already exists.")
|
||||
p.add_argument("--crop", default=None, choices=list(LISTING_PATHS),
|
||||
help="Limit to one crop.")
|
||||
p.add_argument("--product", default=None,
|
||||
help="Process a single variety by source_key or Variety code.")
|
||||
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
|
||||
return p
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
args = _build_argparser().parse_args(argv)
|
||||
logging.basicConfig(
|
||||
level=args.log_level.upper(),
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||
stream=sys.stderr,
|
||||
)
|
||||
return run(
|
||||
limit=args.limit, force=args.force,
|
||||
only_crop=args.crop, only_product=args.product,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user