bayer_seeds: implement Phase 1 scraper (DEKALB + Asgrow + WestBred) #1
Reference in New Issue
Block a user
Delete Branch "bayer-seeds-scraper"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
sitemap-dynamic.xml(288 + 102 + 85 = 475, exactly matching recon).__NEXT_DATA__JSON island — no PDFs, no rendering required.characteristicsform verbatim plus a_scale_directiontag ("1-9 (9 = best)") so the Phase 2 chunker can re-bucket into the canonical disease/agronomic flats from CLAUDE.md.Test plan
--limit 2 --brand dekalbwrites 3 corn varieties with 6 characteristics groups each--limit 2 --brand asgrowwrites 2 soy varieties with 5 groups each, MG routed correctly--limit 2 --brand westbredwrites 2 wheat varieties with 5 groups each, qualitative maturity preserved--forcere-fetches existing files; default mode skips--product dekalb-dkc081-18ribresolves to a single varietypython -m scrape.runner --source bayer_seeds --limit 1 --brand dekalb --forcedispatches correctlyPhase status
Phase 1 (first scraper) complete for the largest GREEN source. Next:
rag/chunk.py) with seed-specific chunk_0 preambles + Golden Harvest 9→1 normalizationgolden_harvestscraperReplace stub with working scraper for all three Bayer seed brands. Discovery uses the public sitemap-dynamic.xml (475 varieties: 288 DEKALB corn + 102 Asgrow soy + 85 WestBred wheat — matches recon). Per-variety detail comes from the page's __NEXT_DATA__ JSON island. Each variety writes corpus/bayer_seeds/<source_key>.{md,json} with: - Identity (brand, crop, hybridLabel, productId, releaseYear) - Maturity routed per crop (RM for corn, MG for soy, qualitative for wheat) - Trait stack (code + full name) - Positioning + strengths narrative - Characteristics groups (DISEASE RATINGS, GROWTH, MANAGEMENT, HARVEST, etc.) preserved verbatim from source so the chunker can re-bucket into canonical disease/agronomic flats per CLAUDE.md schema - Regional seed-guide listings with agronomist contacts - _scale_direction tag (Bayer = "1-9 (9 = best)") for chunker Smoke-tested all three brands (--limit 2 each, plus --product, --force, and scrape.runner dispatch). Politeness: 1 req/sec, retries on 429/5xx with Retry-After honored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>