bayer_seeds: implement Phase 1 scraper (DEKALB + Asgrow + WestBred) #1

Merged
justin merged 1 commits from bayer-seeds-scraper into main 2026-05-25 12:54:51 -04:00
Owner

Summary

  • Replaces the bayer_seeds stub with a working scraper for all three Bayer seed brands (DEKALB corn, Asgrow soy, WestBred wheat — 475 varieties total).
  • Discovery uses sitemap-dynamic.xml (288 + 102 + 85 = 475, exactly matching recon).
  • Per-variety detail comes from each product page's __NEXT_DATA__ JSON island — no PDFs, no rendering required.
  • Sidecar JSON preserves the source's grouped characteristics form verbatim plus a _scale_direction tag ("1-9 (9 = best)") so the Phase 2 chunker can re-bucket into the canonical disease/agronomic flats from CLAUDE.md.

Test plan

  • --limit 2 --brand dekalb writes 3 corn varieties with 6 characteristics groups each
  • --limit 2 --brand asgrow writes 2 soy varieties with 5 groups each, MG routed correctly
  • --limit 2 --brand westbred writes 2 wheat varieties with 5 groups each, qualitative maturity preserved
  • --force re-fetches existing files; default mode skips
  • --product dekalb-dkc081-18rib resolves to a single variety
  • python -m scrape.runner --source bayer_seeds --limit 1 --brand dekalb --force dispatches correctly
  • Full run (475 varieties, ~8 min @ 1 req/sec) — deferred until first CI refresh

Phase status

Phase 1 (first scraper) complete for the largest GREEN source. Next:

  • Phase 2: chunker (rag/chunk.py) with seed-specific chunk_0 preambles + Golden Harvest 9→1 normalization
  • Then: golden_harvest scraper
## Summary - Replaces the bayer_seeds stub with a working scraper for all three Bayer seed brands (DEKALB corn, Asgrow soy, WestBred wheat — 475 varieties total). - Discovery uses `sitemap-dynamic.xml` (288 + 102 + 85 = 475, exactly matching recon). - Per-variety detail comes from each product page's `__NEXT_DATA__` JSON island — no PDFs, no rendering required. - Sidecar JSON preserves the source's grouped `characteristics` form verbatim plus a `_scale_direction` tag ("1-9 (9 = best)") so the Phase 2 chunker can re-bucket into the canonical disease/agronomic flats from CLAUDE.md. ## Test plan - [x] `--limit 2 --brand dekalb` writes 3 corn varieties with 6 characteristics groups each - [x] `--limit 2 --brand asgrow` writes 2 soy varieties with 5 groups each, MG routed correctly - [x] `--limit 2 --brand westbred` writes 2 wheat varieties with 5 groups each, qualitative maturity preserved - [x] `--force` re-fetches existing files; default mode skips - [x] `--product dekalb-dkc081-18rib` resolves to a single variety - [x] `python -m scrape.runner --source bayer_seeds --limit 1 --brand dekalb --force` dispatches correctly - [ ] Full run (475 varieties, ~8 min @ 1 req/sec) — deferred until first CI refresh ## Phase status Phase 1 (first scraper) complete for the largest GREEN source. Next: - Phase 2: chunker (`rag/chunk.py`) with seed-specific chunk_0 preambles + Golden Harvest 9→1 normalization - Then: `golden_harvest` scraper
justin added 1 commit 2026-05-25 12:54:38 -04:00
Replace stub with working scraper for all three Bayer seed brands.
Discovery uses the public sitemap-dynamic.xml (475 varieties:
288 DEKALB corn + 102 Asgrow soy + 85 WestBred wheat — matches recon).
Per-variety detail comes from the page's __NEXT_DATA__ JSON island.

Each variety writes corpus/bayer_seeds/<source_key>.{md,json} with:
- Identity (brand, crop, hybridLabel, productId, releaseYear)
- Maturity routed per crop (RM for corn, MG for soy, qualitative for wheat)
- Trait stack (code + full name)
- Positioning + strengths narrative
- Characteristics groups (DISEASE RATINGS, GROWTH, MANAGEMENT, HARVEST,
  etc.) preserved verbatim from source so the chunker can re-bucket
  into canonical disease/agronomic flats per CLAUDE.md schema
- Regional seed-guide listings with agronomist contacts
- _scale_direction tag (Bayer = "1-9 (9 = best)") for chunker

Smoke-tested all three brands (--limit 2 each, plus --product, --force,
and scrape.runner dispatch). Politeness: 1 req/sec, retries on 429/5xx
with Retry-After honored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justin merged commit 0fb8d9d92d into main 2026-05-25 12:54:51 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#1