1409c2617d
Sitemap-driven scraper for goldenharvestseeds.com. Walks sitemap-ghs-hybrids.xml to discover product URLs under /products/corn/ and /products/soybean/ (~89 + 86 = 175 candidates). Per-variety detail parsed from server-rendered HTML: - product code (from <h1> / <title>) - positioning (from <meta name="Description">) - maturity (from <div class="product-label"><div class="right">): integer days for corn, decimal MG for soybeans - traits derived from product-code suffix (XF, E3, VIP3, GT, Z, etc.) - 9-row disease tolerance bar chart (#dvDiseaseTolerance) where data-percentage / 10 = rating on 1-9 (9 = best) scale - 9-row agronomic characteristics bar chart (#dvAgronomicChar) - recommended environment list (.AgronomicMange — upstream typo) - all 2-column tables (plant description, seed quality, herbicide responses, Phytophthora gene, SCN race coverage) - tech-sheet PDF URL from live HTML (not sitemap — that's stale) 302 redirects to /product-finder treated as "discontinued" and skipped (Golden Harvest still sitemap-lists some retired SKUs). Rating scale: 1-9 (9 = best) — same as Bayer despite recon's "9-to-1" descriptor (that referred to chart-axis direction, not numeric meaning). _scale_direction is set explicitly so the chunker stays forward-compatible. PDFs are NOT downloaded (recon flagged ~14MB each); tech-sheet URLs are captured in the sidecar for future enrichment. Smoke-tested all branches: 4 corn varieties (E085Z5, E092W5, E094Z4, E095D3, E097K6, E100A3) with full 6 characteristics groups + tech-sheet URL; 3 soy varieties (GH00864XF MG 0.08, GH00973E3 MG 0.09, GH0225XF MG 0.2) with disease + agronomic bars; 302 redirects skipped cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>