seed-mcp/scrape/sources/golden_harvest.py

"""Golden Harvest scraper (Syngenta brand).

Discovery: ``https://www.goldenharvestseeds.com/sitemap.xml`` lists
every variety page. Server-rendered HTML — no headless browser
required. Tech-sheet PDFs live on the Syngenta CDN at
``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf`` — same
fetcher pattern as NK.

Two gotchas:

1. **Sitemap PDF dates are stale** (the sitemap was generated
   2025-03-31 and never updated). Resolve the LIVE PDF URL from the
   product HTML page, not from the sitemap entry.

2. **Disease scale is reversed.** Golden Harvest publishes ratings
   on a 9-to-1 scale (9 = best, 1 = worst). Bayer/NK/AgriPro use
   1-9 (9 = best). Normalize at chunk time so the corpus has a
   single direction. Record the original direction in the chunk_0
   preamble: "Note: ratings normalized to 1-9 (9 = best). Golden
   Harvest publishes on a 9-to-1 scale natively."

Expected count: ~175 varieties (89 corn + 86 soy). No wheat.

Bonus dataset: ``/plot-report/<state>/<year>/<id>`` — ~7,800 regional
yield trial records. Out of scope for v1 but a high-value future
ingest for regional placement recommendations.

TODO: implement. Reuse the PDF-fetch helper that NK uses.
"""
from __future__ import annotations

import sys


def main(argv: list[str] | None = None) -> int:
    print("golden_harvest: not implemented yet — see CLAUDE.md for the disease-scale-reversal gotcha and the live-PDF-URL-resolution requirement",
          file=sys.stderr)
    return 2


if __name__ == "__main__":
    sys.exit(main(sys.argv[1:]))