bayer_seeds: add Channel + DEKALB silage/sorghum/canola + Deltapine cotton
User flagged that Channel is expanding into their area — re-walked the cropscience.bayer.us sitemap and found 8 additional brand×crop paths beyond the original DEKALB/Asgrow/WestBred triple. Patches the scraper to walk all of them; total Bayer varieties roughly doubles from 475 to 931 and the corpus picks up first-ever coverage in sorghum (36), cotton (30), canola (6), and silage as a distinct crop (was conflated with corn before). Net new varieties: 456 Channel corn=181 soy=67 silage=54 sorghum=18 (320) DEKALB silage=82 sorghum=18 canola=6 (106) Deltapine cotton=30 (30) scrape/sources/bayer_seeds.py - Replace `BRANDS` (brand → 1 path) and `CROP_SUFFIX` (brand → 1 suffix) with a flatter `BRAND_PATHS` list of (brand, url_path, crop, is_primary_for_brand) entries. Channel and DEKALB are now multi-crop brands; the same scraper walks every brand×crop pair. - source_key derivation: for a brand's PRIMARY crop, strip the trailing `-<crop>` suffix (matches the existing deployed source keys for DEKALB corn / Asgrow soy / WestBred wheat). For SECONDARY crops, KEEP the suffix so DEKALB-the-same-SKU sold as both grain corn and silage gets two distinct source_keys (collision-safe and unambiguous for `lookup_variety`). - New `--crop` CLI filter for incremental backfills. - Log line shows brand + crop alongside source_key for visibility. rag/chunk.py - Channel + Deltapine pages use slightly different characteristics group labels (DISEASE not DISEASE RATINGS, AGRONOMIC CHARACTERISTICS not GROWTH/HARVEST, plus MATURITY / ADAPTATION / HERBICIDES / OTHER). Fold them into the DISEASE / AGRONOMIC / MANAGEMENT label sets so the chunker buckets them correctly into the standard sections. Smoke-tested cross-brand × cross-crop queries against the rebuilt index (5,529 chunks total) — all 6 sample queries surface the right brand+crop at top-3: Channel corn 110 RM → 210-25TRE BRAND Channel soy 2.5 MG IA → 2622RXF BRAND Deltapine cotton XF → DP 1820 B3XF BRAND Sorghum dryland Kansas → 6B95 BRAND (Channel) Silage corn WI dairy → DKC64-44RIB BRAND BLEND (silage variant) Canola Northern Plains → DK401TL BRAND Watchtower will pull the new image on the next push; deploy is unchanged otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+100
-50
@@ -79,21 +79,41 @@ USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
|
||||
BASE = "https://www.cropscience.bayer.us"
|
||||
SITEMAP_URL = f"{BASE}/sitemap-dynamic.xml"
|
||||
|
||||
# Brand → (URL path segment, crop label). Ordering here defines the
|
||||
# `--all` walk order and the `--brand` choices.
|
||||
BRANDS: dict[str, tuple[str, str]] = {
|
||||
"dekalb": ("/corn/dekalb/", "corn"),
|
||||
"asgrow": ("/soybeans/asgrow/", "soybeans"),
|
||||
"westbred": ("/wheat/westbred/", "wheat"),
|
||||
}
|
||||
# All Bayer brand × crop paths in the cropscience.bayer.us sitemap.
|
||||
# Each entry: (brand_key, url_path_prefix, crop_label, is_primary_for_brand).
|
||||
#
|
||||
# `is_primary` controls source_key derivation: for a brand's primary
|
||||
# crop we STRIP the trailing crop suffix from the URL tail (so
|
||||
# DEKALB corn `dekalb-dkc62-08rib-corn` → source_key
|
||||
# `dekalb-dkc62-08rib`, matching the corpus we deployed 2026-05-25).
|
||||
# For non-primary crops we KEEP the suffix (so DEKALB silage
|
||||
# `dekalb-dkc093-05rib-silage` → source_key
|
||||
# `dekalb-dkc093-05rib-silage`, distinct from the corn key and
|
||||
# collision-safe when the same SKU is marketed as both grain and
|
||||
# silage).
|
||||
#
|
||||
# Counts as of 2026-05-25 sitemap walk:
|
||||
# DEKALB corn=288 silage=82 sorghum=18 canola=6
|
||||
# Asgrow soy=102
|
||||
# WestBred wheat=85
|
||||
# Channel corn=181 soy=67 silage=54 sorghum=18
|
||||
# Deltapine cotton=30
|
||||
BRAND_PATHS: list[tuple[str, str, str, bool]] = [
|
||||
("dekalb", "/corn/dekalb/", "corn", True),
|
||||
("dekalb", "/silage/dekalb/", "silage", False),
|
||||
("dekalb", "/sorghum/dekalb/", "sorghum", False),
|
||||
("dekalb", "/canola/dekalb/", "canola", False),
|
||||
("asgrow", "/soybeans/asgrow/", "soybeans", True),
|
||||
("westbred", "/wheat/westbred/", "wheat", True),
|
||||
("channel", "/corn/channel/", "corn", True),
|
||||
("channel", "/soybeans/channel/", "soybeans", False),
|
||||
("channel", "/silage/channel/", "silage", False),
|
||||
("channel", "/sorghum/channel/", "sorghum", False),
|
||||
("deltapine", "/cotton/deltapine/", "cotton", True),
|
||||
]
|
||||
|
||||
# Per-brand crop-suffix to strip off the URL's terminal slug when
|
||||
# computing source_key (so ``dekalb-dkc075-70rib-corn`` → ``dekalb-dkc075-70rib``).
|
||||
CROP_SUFFIX = {
|
||||
"dekalb": "-corn",
|
||||
"asgrow": "-soybeans",
|
||||
"westbred": "-wheat",
|
||||
}
|
||||
# Distinct brand-key list for the --brand CLI choices.
|
||||
BRANDS = sorted({b for b, _p, _c, _pri in BRAND_PATHS})
|
||||
|
||||
# Catalog/landing pages that live under the brand path but are NOT
|
||||
# individual varieties. Skip these during discovery.
|
||||
@@ -237,17 +257,25 @@ def parse_next_data(html: str) -> dict[str, Any]:
|
||||
return json.loads(m.group(1))
|
||||
|
||||
|
||||
def source_key_from_url(url: str, brand: str) -> str:
|
||||
def source_key_from_url(url: str, brand: str, crop: str, is_primary: bool) -> str:
|
||||
"""Derive ``<brand>-<sku>`` slug from the product URL.
|
||||
|
||||
Drops the trailing ``-<crop>`` suffix Bayer puts on every product
|
||||
URL terminal segment (``dekalb-dkc075-70rib-corn`` →
|
||||
``dekalb-dkc075-70rib``).
|
||||
For a brand's PRIMARY crop (DEKALB/corn, Asgrow/soybeans,
|
||||
WestBred/wheat, Channel/corn, Deltapine/cotton): strip the
|
||||
trailing ``-<crop>`` suffix Bayer puts on every URL — so
|
||||
``dekalb-dkc075-70rib-corn`` becomes ``dekalb-dkc075-70rib``.
|
||||
|
||||
For SECONDARY crops on a multi-crop brand (DEKALB silage /
|
||||
sorghum / canola; Channel soybeans / silage / sorghum): KEEP
|
||||
the crop suffix so the same SKU marketed under multiple crops
|
||||
gets distinct source_keys and `lookup_variety(...)` stays
|
||||
unambiguous.
|
||||
"""
|
||||
tail = url.rstrip("/").rsplit("/", 1)[-1].lower()
|
||||
suffix = CROP_SUFFIX.get(brand, "")
|
||||
if suffix and tail.endswith(suffix):
|
||||
tail = tail[: -len(suffix)]
|
||||
if is_primary:
|
||||
suffix = f"-{crop}"
|
||||
if tail.endswith(suffix):
|
||||
tail = tail[: -len(suffix)]
|
||||
return tail
|
||||
|
||||
|
||||
@@ -269,11 +297,16 @@ def discover_varieties(
|
||||
http: RateLimitedSession,
|
||||
*,
|
||||
only_brand: str | None = None,
|
||||
) -> list[tuple[str, str, str, str]]:
|
||||
"""Return ``[(url, brand, crop, lastmod), ...]`` for every Bayer
|
||||
seed variety found in the dynamic sitemap.
|
||||
only_crop: str | None = None,
|
||||
) -> list[tuple[str, str, str, bool, str]]:
|
||||
"""Return ``[(url, brand, crop, is_primary, lastmod), ...]`` for
|
||||
every Bayer seed variety found in the dynamic sitemap.
|
||||
|
||||
``brand`` is the lowercase brand key (matches ``BRANDS``).
|
||||
``brand`` is the lowercase brand key (one of ``BRANDS``).
|
||||
``crop`` is the crop label (corn/soybeans/wheat/silage/sorghum/
|
||||
canola/cotton) determined by the URL path segment.
|
||||
``is_primary`` is True when this is the brand's primary crop —
|
||||
drives the source_key suffix-stripping rule.
|
||||
``lastmod`` is the ISO 8601 timestamp from the sitemap entry.
|
||||
"""
|
||||
log.info("fetching sitemap %s", SITEMAP_URL)
|
||||
@@ -281,29 +314,31 @@ def discover_varieties(
|
||||
r.raise_for_status()
|
||||
xml = r.text
|
||||
|
||||
# Tiny regex parse — sitemap is flat and well-formed; no need for
|
||||
# the lxml dependency on a single 600KB file.
|
||||
entries = re.findall(
|
||||
r"<url>\s*<loc>([^<]+)</loc>\s*(?:<lastmod>([^<]+)</lastmod>)?",
|
||||
xml,
|
||||
)
|
||||
log.info("sitemap parsed: %d total URLs", len(entries))
|
||||
|
||||
out: list[tuple[str, str, str, str]] = []
|
||||
out: list[tuple[str, str, str, bool, str]] = []
|
||||
for url, lastmod in entries:
|
||||
for brand, (brand_path, crop) in BRANDS.items():
|
||||
for brand, brand_path, crop, is_primary in BRAND_PATHS:
|
||||
if only_brand and brand != only_brand:
|
||||
continue
|
||||
if only_crop and crop != only_crop:
|
||||
continue
|
||||
if brand_path in url and looks_like_variety_url(url, brand_path):
|
||||
out.append((url, brand, crop, lastmod or ""))
|
||||
out.append((url, brand, crop, is_primary, lastmod or ""))
|
||||
break
|
||||
|
||||
by_brand: dict[str, int] = {}
|
||||
for _, b, _, _ in out:
|
||||
by_brand[b] = by_brand.get(b, 0) + 1
|
||||
log.info("variety URLs found: %s (total=%d)",
|
||||
", ".join(f"{k}={v}" for k, v in sorted(by_brand.items())),
|
||||
len(out))
|
||||
by_brand_crop: dict[tuple[str, str], int] = {}
|
||||
for _, b, c, _, _ in out:
|
||||
by_brand_crop[(b, c)] = by_brand_crop.get((b, c), 0) + 1
|
||||
log.info(
|
||||
"variety URLs found: %s (total=%d)",
|
||||
", ".join(f"{b}/{c}={n}" for (b, c), n in sorted(by_brand_crop.items())),
|
||||
len(out),
|
||||
)
|
||||
return out
|
||||
|
||||
|
||||
@@ -311,7 +346,8 @@ def discover_varieties(
|
||||
|
||||
|
||||
def fetch_product_detail(
|
||||
http: RateLimitedSession, url: str, brand: str, crop: str, lastmod: str
|
||||
http: RateLimitedSession, url: str, brand: str, crop: str,
|
||||
is_primary: bool, lastmod: str,
|
||||
) -> BayerSeedProduct:
|
||||
"""Fetch + parse one product page into a ``BayerSeedProduct``."""
|
||||
r = http.get(url)
|
||||
@@ -321,7 +357,7 @@ def fetch_product_detail(
|
||||
pd = pp.get("productDetails") or {}
|
||||
|
||||
prod = BayerSeedProduct(
|
||||
source_key=source_key_from_url(url, brand),
|
||||
source_key=source_key_from_url(url, brand, crop, is_primary),
|
||||
source_url=url,
|
||||
brand=(pd.get("brand") or brand).upper(),
|
||||
crop=(pd.get("crop") or crop).lower(),
|
||||
@@ -561,18 +597,19 @@ def process_product(
|
||||
url: str,
|
||||
brand: str,
|
||||
crop: str,
|
||||
is_primary: bool,
|
||||
lastmod: str,
|
||||
force: bool,
|
||||
) -> tuple[str, BayerSeedProduct | None]:
|
||||
"""Returns ``(status, prod or None)`` where status is one of
|
||||
``written`` / ``skipped`` / ``failed``."""
|
||||
source_key = source_key_from_url(url, brand)
|
||||
source_key = source_key_from_url(url, brand, crop, is_primary)
|
||||
md_path = CORPUS_DIR / f"{source_key}.md"
|
||||
if md_path.exists() and not force:
|
||||
return "skipped", None
|
||||
|
||||
try:
|
||||
prod = fetch_product_detail(http, url, brand, crop, lastmod)
|
||||
prod = fetch_product_detail(http, url, brand, crop, is_primary, lastmod)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.error("detail fetch failed for %s: %s", url, exc)
|
||||
return "failed", None
|
||||
@@ -587,16 +624,17 @@ def run(
|
||||
limit: int | None,
|
||||
force: bool,
|
||||
only_brand: str | None,
|
||||
only_crop: str | None,
|
||||
only_product: str | None,
|
||||
) -> int:
|
||||
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
http = RateLimitedSession()
|
||||
|
||||
targets = discover_varieties(http, only_brand=only_brand)
|
||||
targets = discover_varieties(http, only_brand=only_brand, only_crop=only_crop)
|
||||
if only_product:
|
||||
targets = [
|
||||
(u, b, c, lm) for (u, b, c, lm) in targets
|
||||
if source_key_from_url(u, b) == only_product
|
||||
(u, b, c, p, lm) for (u, b, c, p, lm) in targets
|
||||
if source_key_from_url(u, b, c, p) == only_product
|
||||
or u.rstrip("/").rsplit("/", 1)[-1].lower() == only_product
|
||||
]
|
||||
if not targets:
|
||||
@@ -605,20 +643,21 @@ def run(
|
||||
|
||||
counts = {"written": 0, "skipped": 0, "failed": 0}
|
||||
processed = 0
|
||||
for url, brand, crop, lastmod in targets:
|
||||
for url, brand, crop, is_primary, lastmod in targets:
|
||||
if limit is not None and processed >= limit:
|
||||
break
|
||||
processed += 1
|
||||
status, prod = process_product(
|
||||
http, url=url, brand=brand, crop=crop, lastmod=lastmod, force=force,
|
||||
http, url=url, brand=brand, crop=crop,
|
||||
is_primary=is_primary, lastmod=lastmod, force=force,
|
||||
)
|
||||
counts[status] = counts.get(status, 0) + 1
|
||||
|
||||
if prod is not None:
|
||||
log.info(
|
||||
"[%d/%s] %s %s | crop=%s rm/mg=%s traits=%s ratings_groups=%d",
|
||||
"[%d/%s] %s %s | brand=%s crop=%s rm/mg=%s traits=%s groups=%d",
|
||||
processed, str(limit) if limit else "all",
|
||||
prod.source_key, status, prod.crop,
|
||||
prod.source_key, status, prod.brand, prod.crop,
|
||||
prod.relative_maturity or prod.maturity_group or "-",
|
||||
",".join(prod.trait_codes) or "-",
|
||||
len(prod.characteristics_groups),
|
||||
@@ -626,7 +665,7 @@ def run(
|
||||
else:
|
||||
log.info("[%d/%s] %s %s",
|
||||
processed, str(limit) if limit else "all",
|
||||
source_key_from_url(url, brand), status)
|
||||
source_key_from_url(url, brand, crop, is_primary), status)
|
||||
|
||||
log.info(
|
||||
"done: processed=%d written=%d skipped=%d failed=%d (out of %d candidates)",
|
||||
@@ -638,10 +677,15 @@ def run(
|
||||
# --------------------------------------------------------------------- CLI
|
||||
|
||||
|
||||
_ALL_CROPS = sorted({c for _b, _p, c, _pri in BRAND_PATHS})
|
||||
|
||||
|
||||
def _build_argparser() -> argparse.ArgumentParser:
|
||||
p = argparse.ArgumentParser(
|
||||
prog="scrape.sources.bayer_seeds",
|
||||
description="Scrape Bayer DEKALB / Asgrow / WestBred seed varieties.",
|
||||
description="Scrape Bayer seed varieties — DEKALB / Asgrow / "
|
||||
"WestBred / Channel / Deltapine across corn, "
|
||||
"soybeans, wheat, silage, sorghum, canola, cotton.",
|
||||
)
|
||||
p.add_argument(
|
||||
"--limit", type=int, default=None,
|
||||
@@ -652,9 +696,14 @@ def _build_argparser() -> argparse.ArgumentParser:
|
||||
help="Re-fetch even if the markdown file already exists.",
|
||||
)
|
||||
p.add_argument(
|
||||
"--brand", default=None, choices=sorted(BRANDS),
|
||||
"--brand", default=None, choices=BRANDS,
|
||||
help="Limit to one Bayer seed brand.",
|
||||
)
|
||||
p.add_argument(
|
||||
"--crop", default=None, choices=_ALL_CROPS,
|
||||
help="Limit to one crop. Useful for incrementally backfilling "
|
||||
"(e.g. `--crop sorghum` to grab just the sorghum lines).",
|
||||
)
|
||||
p.add_argument(
|
||||
"--product", default=None,
|
||||
help="Process a single variety by source_key "
|
||||
@@ -678,6 +727,7 @@ def main(argv: list[str] | None = None) -> int:
|
||||
limit=args.limit,
|
||||
force=args.force,
|
||||
only_brand=args.brand,
|
||||
only_crop=args.crop,
|
||||
only_product=args.product,
|
||||
)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user