bayer_seeds: add Channel + DEKALB silage/sorghum/canola + Deltapine cotton

User flagged that Channel is expanding into their area — re-walked
the cropscience.bayer.us sitemap and found 8 additional brand×crop
paths beyond the original DEKALB/Asgrow/WestBred triple. Patches
the scraper to walk all of them; total Bayer varieties roughly
doubles from 475 to 931 and the corpus picks up first-ever
coverage in sorghum (36), cotton (30), canola (6), and silage as a
distinct crop (was conflated with corn before).

Net new varieties: 456
  Channel    corn=181  soy=67   silage=54  sorghum=18    (320)
  DEKALB     silage=82 sorghum=18  canola=6              (106)
  Deltapine  cotton=30                                    (30)

scrape/sources/bayer_seeds.py
- Replace `BRANDS` (brand → 1 path) and `CROP_SUFFIX` (brand → 1
  suffix) with a flatter `BRAND_PATHS` list of (brand, url_path,
  crop, is_primary_for_brand) entries. Channel and DEKALB are now
  multi-crop brands; the same scraper walks every brand×crop pair.
- source_key derivation: for a brand's PRIMARY crop, strip the
  trailing `-<crop>` suffix (matches the existing deployed source
  keys for DEKALB corn / Asgrow soy / WestBred wheat). For
  SECONDARY crops, KEEP the suffix so DEKALB-the-same-SKU sold as
  both grain corn and silage gets two distinct source_keys
  (collision-safe and unambiguous for `lookup_variety`).
- New `--crop` CLI filter for incremental backfills.
- Log line shows brand + crop alongside source_key for visibility.

rag/chunk.py
- Channel + Deltapine pages use slightly different characteristics
  group labels (DISEASE not DISEASE RATINGS, AGRONOMIC
  CHARACTERISTICS not GROWTH/HARVEST, plus MATURITY / ADAPTATION /
  HERBICIDES / OTHER). Fold them into the DISEASE / AGRONOMIC /
  MANAGEMENT label sets so the chunker buckets them correctly
  into the standard sections.

Smoke-tested cross-brand × cross-crop queries against the rebuilt
index (5,529 chunks total) — all 6 sample queries surface the
right brand+crop at top-3:
  Channel corn 110 RM       → 210-25TRE BRAND
  Channel soy 2.5 MG IA     → 2622RXF BRAND
  Deltapine cotton XF       → DP 1820 B3XF BRAND
  Sorghum dryland Kansas    → 6B95 BRAND (Channel)
  Silage corn WI dairy      → DKC64-44RIB BRAND BLEND (silage variant)
  Canola Northern Plains    → DK401TL BRAND

Watchtower will pull the new image on the next push; deploy is
unchanged otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-26 11:54:30 -04:00
parent c76df4c44a
commit eaa7e0789b
914 changed files with 176117 additions and 52 deletions
+100 -50
View File
@@ -79,21 +79,41 @@ USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
BASE = "https://www.cropscience.bayer.us"
SITEMAP_URL = f"{BASE}/sitemap-dynamic.xml"
# Brand → (URL path segment, crop label). Ordering here defines the
# `--all` walk order and the `--brand` choices.
BRANDS: dict[str, tuple[str, str]] = {
"dekalb": ("/corn/dekalb/", "corn"),
"asgrow": ("/soybeans/asgrow/", "soybeans"),
"westbred": ("/wheat/westbred/", "wheat"),
}
# All Bayer brand × crop paths in the cropscience.bayer.us sitemap.
# Each entry: (brand_key, url_path_prefix, crop_label, is_primary_for_brand).
#
# `is_primary` controls source_key derivation: for a brand's primary
# crop we STRIP the trailing crop suffix from the URL tail (so
# DEKALB corn `dekalb-dkc62-08rib-corn` → source_key
# `dekalb-dkc62-08rib`, matching the corpus we deployed 2026-05-25).
# For non-primary crops we KEEP the suffix (so DEKALB silage
# `dekalb-dkc093-05rib-silage` → source_key
# `dekalb-dkc093-05rib-silage`, distinct from the corn key and
# collision-safe when the same SKU is marketed as both grain and
# silage).
#
# Counts as of 2026-05-25 sitemap walk:
# DEKALB corn=288 silage=82 sorghum=18 canola=6
# Asgrow soy=102
# WestBred wheat=85
# Channel corn=181 soy=67 silage=54 sorghum=18
# Deltapine cotton=30
BRAND_PATHS: list[tuple[str, str, str, bool]] = [
("dekalb", "/corn/dekalb/", "corn", True),
("dekalb", "/silage/dekalb/", "silage", False),
("dekalb", "/sorghum/dekalb/", "sorghum", False),
("dekalb", "/canola/dekalb/", "canola", False),
("asgrow", "/soybeans/asgrow/", "soybeans", True),
("westbred", "/wheat/westbred/", "wheat", True),
("channel", "/corn/channel/", "corn", True),
("channel", "/soybeans/channel/", "soybeans", False),
("channel", "/silage/channel/", "silage", False),
("channel", "/sorghum/channel/", "sorghum", False),
("deltapine", "/cotton/deltapine/", "cotton", True),
]
# Per-brand crop-suffix to strip off the URL's terminal slug when
# computing source_key (so ``dekalb-dkc075-70rib-corn`` → ``dekalb-dkc075-70rib``).
CROP_SUFFIX = {
"dekalb": "-corn",
"asgrow": "-soybeans",
"westbred": "-wheat",
}
# Distinct brand-key list for the --brand CLI choices.
BRANDS = sorted({b for b, _p, _c, _pri in BRAND_PATHS})
# Catalog/landing pages that live under the brand path but are NOT
# individual varieties. Skip these during discovery.
@@ -237,17 +257,25 @@ def parse_next_data(html: str) -> dict[str, Any]:
return json.loads(m.group(1))
def source_key_from_url(url: str, brand: str) -> str:
def source_key_from_url(url: str, brand: str, crop: str, is_primary: bool) -> str:
"""Derive ``<brand>-<sku>`` slug from the product URL.
Drops the trailing ``-<crop>`` suffix Bayer puts on every product
URL terminal segment (``dekalb-dkc075-70rib-corn`` →
``dekalb-dkc075-70rib``).
For a brand's PRIMARY crop (DEKALB/corn, Asgrow/soybeans,
WestBred/wheat, Channel/corn, Deltapine/cotton): strip the
trailing ``-<crop>`` suffix Bayer puts on every URL — so
``dekalb-dkc075-70rib-corn`` becomes ``dekalb-dkc075-70rib``.
For SECONDARY crops on a multi-crop brand (DEKALB silage /
sorghum / canola; Channel soybeans / silage / sorghum): KEEP
the crop suffix so the same SKU marketed under multiple crops
gets distinct source_keys and `lookup_variety(...)` stays
unambiguous.
"""
tail = url.rstrip("/").rsplit("/", 1)[-1].lower()
suffix = CROP_SUFFIX.get(brand, "")
if suffix and tail.endswith(suffix):
tail = tail[: -len(suffix)]
if is_primary:
suffix = f"-{crop}"
if tail.endswith(suffix):
tail = tail[: -len(suffix)]
return tail
@@ -269,11 +297,16 @@ def discover_varieties(
http: RateLimitedSession,
*,
only_brand: str | None = None,
) -> list[tuple[str, str, str, str]]:
"""Return ``[(url, brand, crop, lastmod), ...]`` for every Bayer
seed variety found in the dynamic sitemap.
only_crop: str | None = None,
) -> list[tuple[str, str, str, bool, str]]:
"""Return ``[(url, brand, crop, is_primary, lastmod), ...]`` for
every Bayer seed variety found in the dynamic sitemap.
``brand`` is the lowercase brand key (matches ``BRANDS``).
``brand`` is the lowercase brand key (one of ``BRANDS``).
``crop`` is the crop label (corn/soybeans/wheat/silage/sorghum/
canola/cotton) determined by the URL path segment.
``is_primary`` is True when this is the brand's primary crop —
drives the source_key suffix-stripping rule.
``lastmod`` is the ISO 8601 timestamp from the sitemap entry.
"""
log.info("fetching sitemap %s", SITEMAP_URL)
@@ -281,29 +314,31 @@ def discover_varieties(
r.raise_for_status()
xml = r.text
# Tiny regex parse — sitemap is flat and well-formed; no need for
# the lxml dependency on a single 600KB file.
entries = re.findall(
r"<url>\s*<loc>([^<]+)</loc>\s*(?:<lastmod>([^<]+)</lastmod>)?",
xml,
)
log.info("sitemap parsed: %d total URLs", len(entries))
out: list[tuple[str, str, str, str]] = []
out: list[tuple[str, str, str, bool, str]] = []
for url, lastmod in entries:
for brand, (brand_path, crop) in BRANDS.items():
for brand, brand_path, crop, is_primary in BRAND_PATHS:
if only_brand and brand != only_brand:
continue
if only_crop and crop != only_crop:
continue
if brand_path in url and looks_like_variety_url(url, brand_path):
out.append((url, brand, crop, lastmod or ""))
out.append((url, brand, crop, is_primary, lastmod or ""))
break
by_brand: dict[str, int] = {}
for _, b, _, _ in out:
by_brand[b] = by_brand.get(b, 0) + 1
log.info("variety URLs found: %s (total=%d)",
", ".join(f"{k}={v}" for k, v in sorted(by_brand.items())),
len(out))
by_brand_crop: dict[tuple[str, str], int] = {}
for _, b, c, _, _ in out:
by_brand_crop[(b, c)] = by_brand_crop.get((b, c), 0) + 1
log.info(
"variety URLs found: %s (total=%d)",
", ".join(f"{b}/{c}={n}" for (b, c), n in sorted(by_brand_crop.items())),
len(out),
)
return out
@@ -311,7 +346,8 @@ def discover_varieties(
def fetch_product_detail(
http: RateLimitedSession, url: str, brand: str, crop: str, lastmod: str
http: RateLimitedSession, url: str, brand: str, crop: str,
is_primary: bool, lastmod: str,
) -> BayerSeedProduct:
"""Fetch + parse one product page into a ``BayerSeedProduct``."""
r = http.get(url)
@@ -321,7 +357,7 @@ def fetch_product_detail(
pd = pp.get("productDetails") or {}
prod = BayerSeedProduct(
source_key=source_key_from_url(url, brand),
source_key=source_key_from_url(url, brand, crop, is_primary),
source_url=url,
brand=(pd.get("brand") or brand).upper(),
crop=(pd.get("crop") or crop).lower(),
@@ -561,18 +597,19 @@ def process_product(
url: str,
brand: str,
crop: str,
is_primary: bool,
lastmod: str,
force: bool,
) -> tuple[str, BayerSeedProduct | None]:
"""Returns ``(status, prod or None)`` where status is one of
``written`` / ``skipped`` / ``failed``."""
source_key = source_key_from_url(url, brand)
source_key = source_key_from_url(url, brand, crop, is_primary)
md_path = CORPUS_DIR / f"{source_key}.md"
if md_path.exists() and not force:
return "skipped", None
try:
prod = fetch_product_detail(http, url, brand, crop, lastmod)
prod = fetch_product_detail(http, url, brand, crop, is_primary, lastmod)
except Exception as exc: # noqa: BLE001
log.error("detail fetch failed for %s: %s", url, exc)
return "failed", None
@@ -587,16 +624,17 @@ def run(
limit: int | None,
force: bool,
only_brand: str | None,
only_crop: str | None,
only_product: str | None,
) -> int:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
http = RateLimitedSession()
targets = discover_varieties(http, only_brand=only_brand)
targets = discover_varieties(http, only_brand=only_brand, only_crop=only_crop)
if only_product:
targets = [
(u, b, c, lm) for (u, b, c, lm) in targets
if source_key_from_url(u, b) == only_product
(u, b, c, p, lm) for (u, b, c, p, lm) in targets
if source_key_from_url(u, b, c, p) == only_product
or u.rstrip("/").rsplit("/", 1)[-1].lower() == only_product
]
if not targets:
@@ -605,20 +643,21 @@ def run(
counts = {"written": 0, "skipped": 0, "failed": 0}
processed = 0
for url, brand, crop, lastmod in targets:
for url, brand, crop, is_primary, lastmod in targets:
if limit is not None and processed >= limit:
break
processed += 1
status, prod = process_product(
http, url=url, brand=brand, crop=crop, lastmod=lastmod, force=force,
http, url=url, brand=brand, crop=crop,
is_primary=is_primary, lastmod=lastmod, force=force,
)
counts[status] = counts.get(status, 0) + 1
if prod is not None:
log.info(
"[%d/%s] %s %s | crop=%s rm/mg=%s traits=%s ratings_groups=%d",
"[%d/%s] %s %s | brand=%s crop=%s rm/mg=%s traits=%s groups=%d",
processed, str(limit) if limit else "all",
prod.source_key, status, prod.crop,
prod.source_key, status, prod.brand, prod.crop,
prod.relative_maturity or prod.maturity_group or "-",
",".join(prod.trait_codes) or "-",
len(prod.characteristics_groups),
@@ -626,7 +665,7 @@ def run(
else:
log.info("[%d/%s] %s %s",
processed, str(limit) if limit else "all",
source_key_from_url(url, brand), status)
source_key_from_url(url, brand, crop, is_primary), status)
log.info(
"done: processed=%d written=%d skipped=%d failed=%d (out of %d candidates)",
@@ -638,10 +677,15 @@ def run(
# --------------------------------------------------------------------- CLI
_ALL_CROPS = sorted({c for _b, _p, c, _pri in BRAND_PATHS})
def _build_argparser() -> argparse.ArgumentParser:
p = argparse.ArgumentParser(
prog="scrape.sources.bayer_seeds",
description="Scrape Bayer DEKALB / Asgrow / WestBred seed varieties.",
description="Scrape Bayer seed varieties — DEKALB / Asgrow / "
"WestBred / Channel / Deltapine across corn, "
"soybeans, wheat, silage, sorghum, canola, cotton.",
)
p.add_argument(
"--limit", type=int, default=None,
@@ -652,9 +696,14 @@ def _build_argparser() -> argparse.ArgumentParser:
help="Re-fetch even if the markdown file already exists.",
)
p.add_argument(
"--brand", default=None, choices=sorted(BRANDS),
"--brand", default=None, choices=BRANDS,
help="Limit to one Bayer seed brand.",
)
p.add_argument(
"--crop", default=None, choices=_ALL_CROPS,
help="Limit to one crop. Useful for incrementally backfilling "
"(e.g. `--crop sorghum` to grab just the sorghum lines).",
)
p.add_argument(
"--product", default=None,
help="Process a single variety by source_key "
@@ -678,6 +727,7 @@ def main(argv: list[str] | None = None) -> int:
limit=args.limit,
force=args.force,
only_brand=args.brand,
only_crop=args.crop,
only_product=args.product,
)