golden_harvest: implement scraper (~175 Syngenta corn + soy) #4

Merged
justin merged 1 commits from golden-harvest-scraper into main 2026-05-25 13:31:24 -04:00
Owner

Summary

  • Sitemap-driven scraper for goldenharvestseeds.com. Walks sitemap-ghs-hybrids.xml for product URLs under /products/corn/ and /products/soybean/ (~89 + 86 = 175 candidates).
  • Per-variety data parsed from server-rendered HTML: identity from <h1> / <title> / <meta name="Description">; maturity from <div class="product-label"><div class="right"> (integer days for corn, decimal MG for soy); traits derived from product-code suffix (XF, E3, VIP3, GT, Z).
  • Ratings: disease tolerance (#dvDiseaseTolerance) + agronomic characteristics (#dvAgronomicChar) bar charts. Each data-percentage / 10 = rating on 1-9 (9 = best) scale — same direction as Bayer.
  • Recommended environments parsed from .AgronomicMange (upstream typo, not ours).
  • Tech-sheet PDF URL captured from live HTML (sitemap-listed dates are stale — recon was correct).
  • 302 redirects → /<crop>/product-finder treated as "discontinued" and skipped.

Anti-hallucination notes

  • All values passed through verbatim. The chunker already preserves the source's _scale_direction field, so cross-vendor rating comparison can be done correctly.
  • The previously-feared "9-to-1 reversal" turned out to be GH's visual chart axis direction (9 on top, 1 on bottom), not the numeric meaning. Bayer and Golden Harvest both use the canonical 1-9 / 9-best convention. Note added in code comments.

Test plan

  • Corn smoke test: 6 varieties scraped (E085Z5 RM 85, E092W5 RM 92, E094Z4 RM 94, E095D3 RM 95, E097K6 RM 97, E100A3 RM 100) with 6 characteristics groups each + tech-sheet URLs resolved.
  • Soy smoke test: GH00864XF / GH00973E3 / GH0225XF parsed with MG hero + trait codes + disease bars + Phytophthora gene + SCN race tables.
  • Discontinued varieties skipped (GH00615E3, GH0116E3 — they 302 to product-finder).
  • Full scrape (~175 varieties, in progress) — pending in this session.

Coverage after merge

  • Corpus is rebuilt with Bayer + Golden Harvest combined: 475 + ~155 (after discontinued skips) = ~630 indexed varieties across 2 vendors / 4 brands / 3 crops.
## Summary - Sitemap-driven scraper for `goldenharvestseeds.com`. Walks `sitemap-ghs-hybrids.xml` for product URLs under `/products/corn/` and `/products/soybean/` (~89 + 86 = 175 candidates). - Per-variety data parsed from server-rendered HTML: identity from `<h1>` / `<title>` / `<meta name="Description">`; maturity from `<div class="product-label"><div class="right">` (integer days for corn, decimal MG for soy); traits derived from product-code suffix (XF, E3, VIP3, GT, Z). - Ratings: disease tolerance (`#dvDiseaseTolerance`) + agronomic characteristics (`#dvAgronomicChar`) bar charts. Each `data-percentage / 10` = rating on 1-9 (9 = best) scale — same direction as Bayer. - Recommended environments parsed from `.AgronomicMange` (upstream typo, not ours). - Tech-sheet PDF URL captured from live HTML (sitemap-listed dates are stale — recon was correct). - 302 redirects → `/<crop>/product-finder` treated as "discontinued" and skipped. ## Anti-hallucination notes - All values passed through verbatim. The chunker already preserves the source's `_scale_direction` field, so cross-vendor rating comparison can be done correctly. - The previously-feared "9-to-1 reversal" turned out to be GH's *visual chart axis direction* (9 on top, 1 on bottom), not the numeric meaning. Bayer and Golden Harvest both use the canonical 1-9 / 9-best convention. Note added in code comments. ## Test plan - [x] Corn smoke test: 6 varieties scraped (E085Z5 RM 85, E092W5 RM 92, E094Z4 RM 94, E095D3 RM 95, E097K6 RM 97, E100A3 RM 100) with 6 characteristics groups each + tech-sheet URLs resolved. - [x] Soy smoke test: GH00864XF / GH00973E3 / GH0225XF parsed with MG hero + trait codes + disease bars + Phytophthora gene + SCN race tables. - [x] Discontinued varieties skipped (GH00615E3, GH0116E3 — they 302 to product-finder). - [ ] Full scrape (~175 varieties, in progress) — pending in this session. ## Coverage after merge - Corpus is rebuilt with Bayer + Golden Harvest combined: 475 + ~155 (after discontinued skips) = ~630 indexed varieties across 2 vendors / 4 brands / 3 crops.
justin added 1 commit 2026-05-25 13:31:14 -04:00
Sitemap-driven scraper for goldenharvestseeds.com. Walks
sitemap-ghs-hybrids.xml to discover product URLs under
/products/corn/ and /products/soybean/ (~89 + 86 = 175 candidates).

Per-variety detail parsed from server-rendered HTML:

- product code (from <h1> / <title>)
- positioning (from <meta name="Description">)
- maturity (from <div class="product-label"><div class="right">):
  integer days for corn, decimal MG for soybeans
- traits derived from product-code suffix (XF, E3, VIP3, GT, Z, etc.)
- 9-row disease tolerance bar chart (#dvDiseaseTolerance) where
  data-percentage / 10 = rating on 1-9 (9 = best) scale
- 9-row agronomic characteristics bar chart (#dvAgronomicChar)
- recommended environment list (.AgronomicMange — upstream typo)
- all 2-column tables (plant description, seed quality, herbicide
  responses, Phytophthora gene, SCN race coverage)
- tech-sheet PDF URL from live HTML (not sitemap — that's stale)

302 redirects to /product-finder treated as "discontinued" and
skipped (Golden Harvest still sitemap-lists some retired SKUs).

Rating scale: 1-9 (9 = best) — same as Bayer despite recon's
"9-to-1" descriptor (that referred to chart-axis direction, not
numeric meaning). _scale_direction is set explicitly so the chunker
stays forward-compatible.

PDFs are NOT downloaded (recon flagged ~14MB each); tech-sheet URLs
are captured in the sidecar for future enrichment.

Smoke-tested all branches: 4 corn varieties (E085Z5, E092W5,
E094Z4, E095D3, E097K6, E100A3) with full 6 characteristics groups
+ tech-sheet URL; 3 soy varieties (GH00864XF MG 0.08, GH00973E3
MG 0.09, GH0225XF MG 0.2) with disease + agronomic bars; 302
redirects skipped cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justin merged commit 9d4a490731 into main 2026-05-25 13:31:24 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#4