bayer_seeds: add Channel + DEKALB silage/sorghum/canola + Deltapine cotton

User flagged that Channel is expanding into their area — re-walked the cropscience.bayer.us sitemap and found 8 additional brand×crop paths beyond the original DEKALB/Asgrow/WestBred triple. Patches the scraper to walk all of them; total Bayer varieties roughly doubles from 475 to 931 and the corpus picks up first-ever coverage in sorghum (36), cotton (30), canola (6), and silage as a distinct crop (was conflated with corn before). Net new varieties: 456 Channel corn=181 soy=67 silage=54 sorghum=18 (320) DEKALB silage=82 sorghum=18 canola=6 (106) Deltapine cotton=30 (30) scrape/sources/bayer_seeds.py - Replace `BRANDS` (brand → 1 path) and `CROP_SUFFIX` (brand → 1 suffix) with a flatter `BRAND_PATHS` list of (brand, url_path, crop, is_primary_for_brand) entries. Channel and DEKALB are now multi-crop brands; the same scraper walks every brand×crop pair. - source_key derivation: for a brand's PRIMARY crop, strip the trailing `-<crop>` suffix (matches the existing deployed source keys for DEKALB corn / Asgrow soy / WestBred wheat). For SECONDARY crops, KEEP the suffix so DEKALB-the-same-SKU sold as both grain corn and silage gets two distinct source_keys (collision-safe and unambiguous for `lookup_variety`). - New `--crop` CLI filter for incremental backfills. - Log line shows brand + crop alongside source_key for visibility. rag/chunk.py - Channel + Deltapine pages use slightly different characteristics group labels (DISEASE not DISEASE RATINGS, AGRONOMIC CHARACTERISTICS not GROWTH/HARVEST, plus MATURITY / ADAPTATION / HERBICIDES / OTHER). Fold them into the DISEASE / AGRONOMIC / MANAGEMENT label sets so the chunker buckets them correctly into the standard sections. Smoke-tested cross-brand × cross-crop queries against the rebuilt index (5,529 chunks total) — all 6 sample queries surface the right brand+crop at top-3: Channel corn 110 RM → 210-25TRE BRAND Channel soy 2.5 MG IA → 2622RXF BRAND Deltapine cotton XF → DP 1820 B3XF BRAND Sorghum dryland Kansas → 6B95 BRAND (Channel) Silage corn WI dairy → DKC64-44RIB BRAND BLEND (silage variant) Canola Northern Plains → DK401TL BRAND Watchtower will pull the new image on the next push; deploy is unchanged otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 11:54:30 -04:00
parent c76df4c44a
commit eaa7e0789b
914 changed files with 176117 additions and 52 deletions
@@ -79,21 +79,41 @@ USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
 BASE = "https://www.cropscience.bayer.us"
 SITEMAP_URL = f"{BASE}/sitemap-dynamic.xml"

-# Brand → (URL path segment, crop label). Ordering here defines the
-# `--all` walk order and the `--brand` choices.
-BRANDS: dict[str, tuple[str, str]] = {
-    "dekalb": ("/corn/dekalb/", "corn"),
-    "asgrow": ("/soybeans/asgrow/", "soybeans"),
-    "westbred": ("/wheat/westbred/", "wheat"),
-}
+# All Bayer brand × crop paths in the cropscience.bayer.us sitemap.
+# Each entry: (brand_key, url_path_prefix, crop_label, is_primary_for_brand).
+#
+# `is_primary` controls source_key derivation: for a brand's primary
+# crop we STRIP the trailing crop suffix from the URL tail (so
+# DEKALB corn `dekalb-dkc62-08rib-corn` → source_key
+# `dekalb-dkc62-08rib`, matching the corpus we deployed 2026-05-25).
+# For non-primary crops we KEEP the suffix (so DEKALB silage
+# `dekalb-dkc093-05rib-silage` → source_key
+# `dekalb-dkc093-05rib-silage`, distinct from the corn key and
+# collision-safe when the same SKU is marketed as both grain and
+# silage).
+#
+# Counts as of 2026-05-25 sitemap walk:
+#   DEKALB    corn=288  silage=82  sorghum=18  canola=6
+#   Asgrow    soy=102
+#   WestBred  wheat=85
+#   Channel   corn=181  soy=67     silage=54   sorghum=18
+#   Deltapine cotton=30
+BRAND_PATHS: list[tuple[str, str, str, bool]] = [
+    ("dekalb",    "/corn/dekalb/",      "corn",     True),
+    ("dekalb",    "/silage/dekalb/",    "silage",   False),
+    ("dekalb",    "/sorghum/dekalb/",   "sorghum",  False),
+    ("dekalb",    "/canola/dekalb/",    "canola",   False),
+    ("asgrow",    "/soybeans/asgrow/",  "soybeans", True),
+    ("westbred",  "/wheat/westbred/",   "wheat",    True),
+    ("channel",   "/corn/channel/",     "corn",     True),
+    ("channel",   "/soybeans/channel/", "soybeans", False),
+    ("channel",   "/silage/channel/",   "silage",   False),
+    ("channel",   "/sorghum/channel/",  "sorghum",  False),
+    ("deltapine", "/cotton/deltapine/", "cotton",   True),
+]

-# Per-brand crop-suffix to strip off the URL's terminal slug when
-# computing source_key (so ``dekalb-dkc075-70rib-corn`` → ``dekalb-dkc075-70rib``).
-CROP_SUFFIX = {
-    "dekalb": "-corn",
-    "asgrow": "-soybeans",
-    "westbred": "-wheat",
-}
+# Distinct brand-key list for the --brand CLI choices.
+BRANDS = sorted({b for b, _p, _c, _pri in BRAND_PATHS})

 # Catalog/landing pages that live under the brand path but are NOT
 # individual varieties. Skip these during discovery.
@@ -237,17 +257,25 @@ def parse_next_data(html: str) -> dict[str, Any]:
    return json.loads(m.group(1))


-def source_key_from_url(url: str, brand: str) -> str:
+def source_key_from_url(url: str, brand: str, crop: str, is_primary: bool) -> str:
    """Derive ``<brand>-<sku>`` slug from the product URL.

-    Drops the trailing ``-<crop>`` suffix Bayer puts on every product
-    URL terminal segment (``dekalb-dkc075-70rib-corn`` →
-    ``dekalb-dkc075-70rib``).
+    For a brand's PRIMARY crop (DEKALB/corn, Asgrow/soybeans,
+    WestBred/wheat, Channel/corn, Deltapine/cotton): strip the
+    trailing ``-<crop>`` suffix Bayer puts on every URL — so
+    ``dekalb-dkc075-70rib-corn`` becomes ``dekalb-dkc075-70rib``.
+
+    For SECONDARY crops on a multi-crop brand (DEKALB silage /
+    sorghum / canola; Channel soybeans / silage / sorghum): KEEP
+    the crop suffix so the same SKU marketed under multiple crops
+    gets distinct source_keys and `lookup_variety(...)` stays
+    unambiguous.
    """
    tail = url.rstrip("/").rsplit("/", 1)[-1].lower()
-    suffix = CROP_SUFFIX.get(brand, "")
-    if suffix and tail.endswith(suffix):
-        tail = tail[: -len(suffix)]
+    if is_primary:
+        suffix = f"-{crop}"
+        if tail.endswith(suffix):
+            tail = tail[: -len(suffix)]
    return tail


@@ -269,11 +297,16 @@ def discover_varieties(
    http: RateLimitedSession,
    *,
    only_brand: str | None = None,
-) -> list[tuple[str, str, str, str]]:
-    """Return ``[(url, brand, crop, lastmod), ...]`` for every Bayer
-    seed variety found in the dynamic sitemap.
+    only_crop: str | None = None,
+) -> list[tuple[str, str, str, bool, str]]:
+    """Return ``[(url, brand, crop, is_primary, lastmod), ...]`` for
+    every Bayer seed variety found in the dynamic sitemap.

-    ``brand`` is the lowercase brand key (matches ``BRANDS``).
+    ``brand`` is the lowercase brand key (one of ``BRANDS``).
+    ``crop`` is the crop label (corn/soybeans/wheat/silage/sorghum/
+    canola/cotton) determined by the URL path segment.
+    ``is_primary`` is True when this is the brand's primary crop —
+    drives the source_key suffix-stripping rule.
    ``lastmod`` is the ISO 8601 timestamp from the sitemap entry.
    """
    log.info("fetching sitemap %s", SITEMAP_URL)
@@ -281,29 +314,31 @@ def discover_varieties(
    r.raise_for_status()
    xml = r.text

-    # Tiny regex parse — sitemap is flat and well-formed; no need for
-    # the lxml dependency on a single 600KB file.
    entries = re.findall(
        r"<url>\s*<loc>([^<]+)</loc>\s*(?:<lastmod>([^<]+)</lastmod>)?",
        xml,
    )
    log.info("sitemap parsed: %d total URLs", len(entries))

-    out: list[tuple[str, str, str, str]] = []
+    out: list[tuple[str, str, str, bool, str]] = []
    for url, lastmod in entries:
-        for brand, (brand_path, crop) in BRANDS.items():
+        for brand, brand_path, crop, is_primary in BRAND_PATHS:
            if only_brand and brand != only_brand:
                continue
+            if only_crop and crop != only_crop:
+                continue
            if brand_path in url and looks_like_variety_url(url, brand_path):
-                out.append((url, brand, crop, lastmod or ""))
+                out.append((url, brand, crop, is_primary, lastmod or ""))
                break

-    by_brand: dict[str, int] = {}
-    for _, b, _, _ in out:
-        by_brand[b] = by_brand.get(b, 0) + 1
-    log.info("variety URLs found: %s (total=%d)",
-             ", ".join(f"{k}={v}" for k, v in sorted(by_brand.items())),
-             len(out))
+    by_brand_crop: dict[tuple[str, str], int] = {}
+    for _, b, c, _, _ in out:
+        by_brand_crop[(b, c)] = by_brand_crop.get((b, c), 0) + 1
+    log.info(
+        "variety URLs found: %s (total=%d)",
+        ", ".join(f"{b}/{c}={n}" for (b, c), n in sorted(by_brand_crop.items())),
+        len(out),
+    )
    return out


@@ -311,7 +346,8 @@ def discover_varieties(


 def fetch_product_detail(
-    http: RateLimitedSession, url: str, brand: str, crop: str, lastmod: str
+    http: RateLimitedSession, url: str, brand: str, crop: str,
+    is_primary: bool, lastmod: str,
 ) -> BayerSeedProduct:
    """Fetch + parse one product page into a ``BayerSeedProduct``."""
    r = http.get(url)
@@ -321,7 +357,7 @@ def fetch_product_detail(
    pd = pp.get("productDetails") or {}

    prod = BayerSeedProduct(
-        source_key=source_key_from_url(url, brand),
+        source_key=source_key_from_url(url, brand, crop, is_primary),
        source_url=url,
        brand=(pd.get("brand") or brand).upper(),
        crop=(pd.get("crop") or crop).lower(),
@@ -561,18 +597,19 @@ def process_product(
    url: str,
    brand: str,
    crop: str,
+    is_primary: bool,
    lastmod: str,
    force: bool,
 ) -> tuple[str, BayerSeedProduct | None]:
    """Returns ``(status, prod or None)`` where status is one of
    ``written`` / ``skipped`` / ``failed``."""
-    source_key = source_key_from_url(url, brand)
+    source_key = source_key_from_url(url, brand, crop, is_primary)
    md_path = CORPUS_DIR / f"{source_key}.md"
    if md_path.exists() and not force:
        return "skipped", None

    try:
-        prod = fetch_product_detail(http, url, brand, crop, lastmod)
+        prod = fetch_product_detail(http, url, brand, crop, is_primary, lastmod)
    except Exception as exc:  # noqa: BLE001
        log.error("detail fetch failed for %s: %s", url, exc)
        return "failed", None
@@ -587,16 +624,17 @@ def run(
    limit: int | None,
    force: bool,
    only_brand: str | None,
+    only_crop: str | None,
    only_product: str | None,
 ) -> int:
    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
    http = RateLimitedSession()

-    targets = discover_varieties(http, only_brand=only_brand)
+    targets = discover_varieties(http, only_brand=only_brand, only_crop=only_crop)
    if only_product:
        targets = [
-            (u, b, c, lm) for (u, b, c, lm) in targets
-            if source_key_from_url(u, b) == only_product
+            (u, b, c, p, lm) for (u, b, c, p, lm) in targets
+            if source_key_from_url(u, b, c, p) == only_product
            or u.rstrip("/").rsplit("/", 1)[-1].lower() == only_product
        ]
        if not targets:
@@ -605,20 +643,21 @@ def run(

    counts = {"written": 0, "skipped": 0, "failed": 0}
    processed = 0
-    for url, brand, crop, lastmod in targets:
+    for url, brand, crop, is_primary, lastmod in targets:
        if limit is not None and processed >= limit:
            break
        processed += 1
        status, prod = process_product(
-            http, url=url, brand=brand, crop=crop, lastmod=lastmod, force=force,
+            http, url=url, brand=brand, crop=crop,
+            is_primary=is_primary, lastmod=lastmod, force=force,
        )
        counts[status] = counts.get(status, 0) + 1

        if prod is not None:
            log.info(
-                "[%d/%s] %s %s | crop=%s rm/mg=%s traits=%s ratings_groups=%d",
+                "[%d/%s] %s %s | brand=%s crop=%s rm/mg=%s traits=%s groups=%d",
                processed, str(limit) if limit else "all",
-                prod.source_key, status, prod.crop,
+                prod.source_key, status, prod.brand, prod.crop,
                prod.relative_maturity or prod.maturity_group or "-",
                ",".join(prod.trait_codes) or "-",
                len(prod.characteristics_groups),
@@ -626,7 +665,7 @@ def run(
        else:
            log.info("[%d/%s] %s %s",
                     processed, str(limit) if limit else "all",
-                     source_key_from_url(url, brand), status)
+                     source_key_from_url(url, brand, crop, is_primary), status)

    log.info(
        "done: processed=%d written=%d skipped=%d failed=%d (out of %d candidates)",
@@ -638,10 +677,15 @@ def run(
 # --------------------------------------------------------------------- CLI


+_ALL_CROPS = sorted({c for _b, _p, c, _pri in BRAND_PATHS})
+
+
 def _build_argparser() -> argparse.ArgumentParser:
    p = argparse.ArgumentParser(
        prog="scrape.sources.bayer_seeds",
-        description="Scrape Bayer DEKALB / Asgrow / WestBred seed varieties.",
+        description="Scrape Bayer seed varieties — DEKALB / Asgrow / "
+                    "WestBred / Channel / Deltapine across corn, "
+                    "soybeans, wheat, silage, sorghum, canola, cotton.",
    )
    p.add_argument(
        "--limit", type=int, default=None,
@@ -652,9 +696,14 @@ def _build_argparser() -> argparse.ArgumentParser:
        help="Re-fetch even if the markdown file already exists.",
    )
    p.add_argument(
-        "--brand", default=None, choices=sorted(BRANDS),
+        "--brand", default=None, choices=BRANDS,
        help="Limit to one Bayer seed brand.",
    )
+    p.add_argument(
+        "--crop", default=None, choices=_ALL_CROPS,
+        help="Limit to one crop. Useful for incrementally backfilling "
+             "(e.g. `--crop sorghum` to grab just the sorghum lines).",
+    )
    p.add_argument(
        "--product", default=None,
        help="Process a single variety by source_key "
@@ -678,6 +727,7 @@ def main(argv: list[str] | None = None) -> int:
        limit=args.limit,
        force=args.force,
        only_brand=args.brand,
+        only_crop=args.crop,
        only_product=args.product,
    )