bayer_seeds: add Channel + DEKALB silage/sorghum/canola + Deltapine cotton

User flagged that Channel is expanding into their area — re-walked
the cropscience.bayer.us sitemap and found 8 additional brand×crop
paths beyond the original DEKALB/Asgrow/WestBred triple. Patches
the scraper to walk all of them; total Bayer varieties roughly
doubles from 475 to 931 and the corpus picks up first-ever
coverage in sorghum (36), cotton (30), canola (6), and silage as a
distinct crop (was conflated with corn before).

Net new varieties: 456
  Channel    corn=181  soy=67   silage=54  sorghum=18    (320)
  DEKALB     silage=82 sorghum=18  canola=6              (106)
  Deltapine  cotton=30                                    (30)

scrape/sources/bayer_seeds.py
- Replace `BRANDS` (brand → 1 path) and `CROP_SUFFIX` (brand → 1
  suffix) with a flatter `BRAND_PATHS` list of (brand, url_path,
  crop, is_primary_for_brand) entries. Channel and DEKALB are now
  multi-crop brands; the same scraper walks every brand×crop pair.
- source_key derivation: for a brand's PRIMARY crop, strip the
  trailing `-<crop>` suffix (matches the existing deployed source
  keys for DEKALB corn / Asgrow soy / WestBred wheat). For
  SECONDARY crops, KEEP the suffix so DEKALB-the-same-SKU sold as
  both grain corn and silage gets two distinct source_keys
  (collision-safe and unambiguous for `lookup_variety`).
- New `--crop` CLI filter for incremental backfills.
- Log line shows brand + crop alongside source_key for visibility.

rag/chunk.py
- Channel + Deltapine pages use slightly different characteristics
  group labels (DISEASE not DISEASE RATINGS, AGRONOMIC
  CHARACTERISTICS not GROWTH/HARVEST, plus MATURITY / ADAPTATION /
  HERBICIDES / OTHER). Fold them into the DISEASE / AGRONOMIC /
  MANAGEMENT label sets so the chunker buckets them correctly
  into the standard sections.

Smoke-tested cross-brand × cross-crop queries against the rebuilt
index (5,529 chunks total) — all 6 sample queries surface the
right brand+crop at top-3:
  Channel corn 110 RM       → 210-25TRE BRAND
  Channel soy 2.5 MG IA     → 2622RXF BRAND
  Deltapine cotton XF       → DP 1820 B3XF BRAND
  Sorghum dryland Kansas    → 6B95 BRAND (Channel)
  Silage corn WI dairy      → DKC64-44RIB BRAND BLEND (silage variant)
  Canola Northern Plains    → DK401TL BRAND

Watchtower will pull the new image on the next push; deploy is
unchanged otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-26 11:54:30 -04:00
parent c76df4c44a
commit eaa7e0789b
914 changed files with 176117 additions and 52 deletions
+8 -2
View File
@@ -51,8 +51,9 @@ from typing import Iterator
# listed here fall into "other" and are still emitted, just in their
# own section.
DISEASE_GROUP_LABELS = {
"DISEASE RATINGS",
"PEST AND DISEASE RESISTANCE",
"DISEASE RATINGS", # Bayer DEKALB / Asgrow / WestBred
"PEST AND DISEASE RESISTANCE", # WestBred wheat
"DISEASE", # Channel + Deltapine (Bayer)
}
AGRONOMIC_GROUP_LABELS = {
"GROWTH",
@@ -60,12 +61,17 @@ AGRONOMIC_GROUP_LABELS = {
"PRODUCTION",
"KEY CHARACTERISTICS",
"QUALITY",
"AGRONOMIC CHARACTERISTICS", # Channel + Deltapine
"MATURITY", # Channel — RM / GDU
}
MANAGEMENT_GROUP_LABELS = {
"MANAGEMENT",
"HERBICIDE",
"SENSITIVITY",
"PLANT DESCRIPTION",
"HERBICIDES", # Channel (plural)
"ADAPTATION", # Channel — regional placement
"OTHER", # Channel — misc trait/management
}