bayer_seeds: add Channel + DEKALB silage/sorghum/canola + Deltapine cotton (+456 varieties) #13

Merged
justin merged 1 commits from bayer-expand-channel-deltapine into main 2026-05-26 11:55:01 -04:00
Owner

Summary

User flagged Channel is moving into their area. Re-walked the cropscience.bayer.us sitemap and there were 8 brand×crop paths I hadn't scraped beyond the original DEKALB/Asgrow/WestBred triple. Net new: 456 varieties and 4 first-time crop categories.

Coverage delta

Metric Before After
Bayer varieties 475 931 (+96%)
Total chunks indexed 5,073 5,529 (+9%)
Crops in corpus 3 (corn / soy / wheat) 7 (+silage as distinct, +sorghum, +cotton, +canola)
Brands 6 8 (+Channel, +Deltapine)
Path added Count Notes
/corn/channel/ 181 Channel corn (user's ask)
/soybeans/channel/ 67 Channel soy
/silage/channel/ 54 Channel silage
/sorghum/channel/ 18 Channel sorghum
/silage/dekalb/ 82 DEKALB silage — separate records from /corn/dekalb/, verified (e.g. SILAGE_DKC093-05RIB_DEKALB_US is a distinct product with silage-specific ratings)
/sorghum/dekalb/ 18 DEKALB sorghum
/canola/dekalb/ 6 DEKALB canola (Northern Plains)
/cotton/deltapine/ 30 Deltapine — Bayer's cotton brand

Implementation

  • scrape/sources/bayer_seeds.py — replaced the 1:1 BRANDS / CROP_SUFFIX dicts with a flatter BRAND_PATHS list of (brand, url_path, crop, is_primary_for_brand) tuples. Channel and DEKALB are now multi-crop brands.
  • source_key collision safety: for a brand's PRIMARY crop the URL -<crop> suffix is stripped (matches existing deployed keys); for SECONDARY crops the suffix is kept, so DEKALB-the-same-SKU sold as both grain corn AND silage gets two distinct source_keys.
  • rag/chunk.py — Channel + Deltapine pages use slightly different characteristics group labels (DISEASE not DISEASE RATINGS; AGRONOMIC CHARACTERISTICS not GROWTH/HARVEST; plus MATURITY/ADAPTATION/HERBICIDES/OTHER). Folded them into the existing bucketing sets so chunks render under the standard sections.
  • New --crop CLI filter in bayer_seeds for incremental backfills.

Test plan

  • All 8 new brand×crop paths produce clean records on smoke test (Channel soy MG 0.05-0.5, DEKALB silage SSRIB, DEKALB sorghum, DEKALB canola TF/LL, Deltapine cotton B3XF)
  • Re-index 5,529 chunks in 3:18 on the 4-GPU pool
  • Cross-brand smoke queries surface correct results:
    • Channel corn 110 RM → 210-25TRE BRAND ✓
    • Channel soy 2.5 MG → 2622RXF BRAND ✓
    • Deltapine cotton XF → DP 1820 B3XF BRAND ✓
    • Sorghum dryland Kansas → Channel 6B95 BRAND ✓
    • Silage corn WI dairy → DKC64-44RIB BRAND BLEND (silage variant) ✓
    • Canola Northern Plains → DK401TL BRAND ✓
  • CI builds new image, Watchtower auto-deploys to trashpanda (in flight on merge)
## Summary User flagged Channel is moving into their area. Re-walked the `cropscience.bayer.us` sitemap and there were **8 brand×crop paths** I hadn't scraped beyond the original DEKALB/Asgrow/WestBred triple. Net new: **456 varieties** and **4 first-time crop categories**. ## Coverage delta | Metric | Before | After | |---|---|---| | Bayer varieties | 475 | **931** (+96%) | | Total chunks indexed | 5,073 | **5,529** (+9%) | | Crops in corpus | 3 (corn / soy / wheat) | **7** (+silage as distinct, +sorghum, +cotton, +canola) | | Brands | 6 | **8** (+Channel, +Deltapine) | | Path added | Count | Notes | |---|---|---| | `/corn/channel/` | 181 | Channel corn (user's ask) | | `/soybeans/channel/` | 67 | Channel soy | | `/silage/channel/` | 54 | Channel silage | | `/sorghum/channel/` | 18 | Channel sorghum | | `/silage/dekalb/` | 82 | DEKALB silage — **separate records** from /corn/dekalb/, verified (e.g. `SILAGE_DKC093-05RIB_DEKALB_US` is a distinct product with silage-specific ratings) | | `/sorghum/dekalb/` | 18 | DEKALB sorghum | | `/canola/dekalb/` | 6 | DEKALB canola (Northern Plains) | | `/cotton/deltapine/` | 30 | Deltapine — Bayer's cotton brand | ## Implementation - **`scrape/sources/bayer_seeds.py`** — replaced the 1:1 `BRANDS` / `CROP_SUFFIX` dicts with a flatter `BRAND_PATHS` list of `(brand, url_path, crop, is_primary_for_brand)` tuples. Channel and DEKALB are now multi-crop brands. - **source_key collision safety**: for a brand's PRIMARY crop the URL `-<crop>` suffix is stripped (matches existing deployed keys); for SECONDARY crops the suffix is kept, so DEKALB-the-same-SKU sold as both grain corn AND silage gets two distinct source_keys. - **`rag/chunk.py`** — Channel + Deltapine pages use slightly different characteristics group labels (`DISEASE` not `DISEASE RATINGS`; `AGRONOMIC CHARACTERISTICS` not `GROWTH`/`HARVEST`; plus `MATURITY`/`ADAPTATION`/`HERBICIDES`/`OTHER`). Folded them into the existing bucketing sets so chunks render under the standard sections. - **New `--crop` CLI filter** in bayer_seeds for incremental backfills. ## Test plan - [x] All 8 new brand×crop paths produce clean records on smoke test (Channel soy MG 0.05-0.5, DEKALB silage SSRIB, DEKALB sorghum, DEKALB canola TF/LL, Deltapine cotton B3XF) - [x] Re-index 5,529 chunks in 3:18 on the 4-GPU pool - [x] Cross-brand smoke queries surface correct results: - Channel corn 110 RM → 210-25TRE BRAND ✓ - Channel soy 2.5 MG → 2622RXF BRAND ✓ - Deltapine cotton XF → DP 1820 B3XF BRAND ✓ - Sorghum dryland Kansas → Channel 6B95 BRAND ✓ - Silage corn WI dairy → DKC64-44RIB BRAND BLEND (silage variant) ✓ - Canola Northern Plains → DK401TL BRAND ✓ - [ ] CI builds new image, Watchtower auto-deploys to trashpanda (in flight on merge)
justin added 1 commit 2026-05-26 11:55:00 -04:00
User flagged that Channel is expanding into their area — re-walked
the cropscience.bayer.us sitemap and found 8 additional brand×crop
paths beyond the original DEKALB/Asgrow/WestBred triple. Patches
the scraper to walk all of them; total Bayer varieties roughly
doubles from 475 to 931 and the corpus picks up first-ever
coverage in sorghum (36), cotton (30), canola (6), and silage as a
distinct crop (was conflated with corn before).

Net new varieties: 456
  Channel    corn=181  soy=67   silage=54  sorghum=18    (320)
  DEKALB     silage=82 sorghum=18  canola=6              (106)
  Deltapine  cotton=30                                    (30)

scrape/sources/bayer_seeds.py
- Replace `BRANDS` (brand → 1 path) and `CROP_SUFFIX` (brand → 1
  suffix) with a flatter `BRAND_PATHS` list of (brand, url_path,
  crop, is_primary_for_brand) entries. Channel and DEKALB are now
  multi-crop brands; the same scraper walks every brand×crop pair.
- source_key derivation: for a brand's PRIMARY crop, strip the
  trailing `-<crop>` suffix (matches the existing deployed source
  keys for DEKALB corn / Asgrow soy / WestBred wheat). For
  SECONDARY crops, KEEP the suffix so DEKALB-the-same-SKU sold as
  both grain corn and silage gets two distinct source_keys
  (collision-safe and unambiguous for `lookup_variety`).
- New `--crop` CLI filter for incremental backfills.
- Log line shows brand + crop alongside source_key for visibility.

rag/chunk.py
- Channel + Deltapine pages use slightly different characteristics
  group labels (DISEASE not DISEASE RATINGS, AGRONOMIC
  CHARACTERISTICS not GROWTH/HARVEST, plus MATURITY / ADAPTATION /
  HERBICIDES / OTHER). Fold them into the DISEASE / AGRONOMIC /
  MANAGEMENT label sets so the chunker buckets them correctly
  into the standard sections.

Smoke-tested cross-brand × cross-crop queries against the rebuilt
index (5,529 chunks total) — all 6 sample queries surface the
right brand+crop at top-3:
  Channel corn 110 RM       → 210-25TRE BRAND
  Channel soy 2.5 MG IA     → 2622RXF BRAND
  Deltapine cotton XF       → DP 1820 B3XF BRAND
  Sorghum dryland Kansas    → 6B95 BRAND (Channel)
  Silage corn WI dairy      → DKC64-44RIB BRAND BLEND (silage variant)
  Canola Northern Plains    → DK401TL BRAND

Watchtower will pull the new image on the next push; deploy is
unchanged otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justin merged commit 06461ade1d into main 2026-05-26 11:55:01 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#13