golden_harvest: implement scraper (~175 Syngenta corn + soy)

Sitemap-driven scraper for goldenharvestseeds.com. Walks
sitemap-ghs-hybrids.xml to discover product URLs under
/products/corn/ and /products/soybean/ (~89 + 86 = 175 candidates).

Per-variety detail parsed from server-rendered HTML:

- product code (from <h1> / <title>)
- positioning (from <meta name="Description">)
- maturity (from <div class="product-label"><div class="right">):
  integer days for corn, decimal MG for soybeans
- traits derived from product-code suffix (XF, E3, VIP3, GT, Z, etc.)
- 9-row disease tolerance bar chart (#dvDiseaseTolerance) where
  data-percentage / 10 = rating on 1-9 (9 = best) scale
- 9-row agronomic characteristics bar chart (#dvAgronomicChar)
- recommended environment list (.AgronomicMange — upstream typo)
- all 2-column tables (plant description, seed quality, herbicide
  responses, Phytophthora gene, SCN race coverage)
- tech-sheet PDF URL from live HTML (not sitemap — that's stale)

302 redirects to /product-finder treated as "discontinued" and
skipped (Golden Harvest still sitemap-lists some retired SKUs).

Rating scale: 1-9 (9 = best) — same as Bayer despite recon's
"9-to-1" descriptor (that referred to chart-axis direction, not
numeric meaning). _scale_direction is set explicitly so the chunker
stays forward-compatible.

PDFs are NOT downloaded (recon flagged ~14MB each); tech-sheet URLs
are captured in the sidecar for future enrichment.

Smoke-tested all branches: 4 corn varieties (E085Z5, E092W5,
E094Z4, E095D3, E097K6, E100A3) with full 6 characteristics groups
+ tech-sheet URL; 3 soy varieties (GH00864XF MG 0.08, GH00973E3
MG 0.09, GH0225XF MG 0.2) with disease + agronomic bars; 302
redirects skipped cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-25 13:30:30 -04:00
parent 28d8cb83b3
commit 1409c2617d
+662 -25
View File
@@ -1,42 +1,679 @@
"""Golden Harvest scraper (Syngenta brand). """Golden Harvest (Syngenta) seed scraper — corn + soybeans.
Discovery: ``https://www.goldenharvestseeds.com/sitemap.xml`` lists Source: ``www.goldenharvestseeds.com`` — ASP.NET WebForms site,
every variety page. Server-rendered HTML no headless browser server-rendered HTML (no Next.js / SPA). robots.txt is permissive
required. Tech-sheet PDFs live on the Syngenta CDN at (no Disallow for /products/).
``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf`` — same
fetcher pattern as NK.
Two gotchas: Discovery: ``/sitemap-ghs-hybrids.xml`` lists ~175 product URLs
under ``/products/corn/`` and ``/products/soybean/``. The sitemap
also references thousands of regional plot-report pages we are NOT
indexing (those are head-to-head trial results, useful but a separate
corpus from variety identity — defer to a future ``gh_plot_reports``
source).
1. **Sitemap PDF dates are stale** (the sitemap was generated A subset of the sitemap-listed product URLs 302-redirect to the
2025-03-31 and never updated). Resolve the LIVE PDF URL from the generic ``/<crop>/product-finder/`` page — those are discontinued
product HTML page, not from the sitemap entry. varieties Golden Harvest still lists in the sitemap. We do NOT
follow redirects; 302 → skip.
2. **Disease scale is reversed.** Golden Harvest publishes ratings Per-variety data lives in the page HTML in two shapes:
on a 9-to-1 scale (9 = best, 1 = worst). Bayer/NK/AgriPro use
1-9 (9 = best). Normalize at chunk time so the corpus has a
single direction. Record the original direction in the chunk_0
preamble: "Note: ratings normalized to 1-9 (9 = best). Golden
Harvest publishes on a 9-to-1 scale natively."
Expected count: ~175 varieties (89 corn + 86 soy). No wheat. 1. **Tables** — ``<table>`` elements with two columns
(label, value). For corn pages: plant description, maturity
(RM days / GDU), planting rate. For soy pages: plant description,
seed quality + herbicide responses, Phytophthora / SCN genes.
Bonus dataset: ``/plot-report/<state>/<year>/<id>`` — ~7,800 regional 2. **Bar charts** — ``<div class="bar-row">`` elements inside
yield trial records. Out of scope for v1 but a high-value future ``#dvDiseaseTolerance`` and ``#dvAgronomicChar``. Each bar's
ingest for regional placement recommendations. ``data-percentage="N"`` value encodes the rating: percent / 10
= rating on the 1-9 scale (9 = best, same as Bayer). Empty
``<div class="bar-wrapper">`` content means "no data".
TODO: implement. Reuse the PDF-fetch helper that NK uses. Per CLAUDE.md the recon described GH ratings as a "9-to-1 reversed"
scale, but inspection of the rendered bars + the published "rating
9 = best" convention shows GH uses the canonical 1-9 (9 = best)
direction — same as Bayer. No flip needed. The sidecar's
``_scale_direction`` field declares this so the chunker can be
forward-compatible if a future vendor genuinely reverses.
Tech-sheet PDFs: a link to ``assets.syngentaebiz.com/pdf/techsheets/
<CODE>_YYMMDD.pdf`` appears in the product HTML. The sitemap's
``sitemap-ghs-techsheets.xml`` has STALE date stamps (250331) so we
always read the live URL from the product page, never the sitemap.
PDFs aren't ingested yet (recon flagged they're 14MB each, large)
but the URL is captured in the sidecar for the chunker / future
enrichment.
Output:
corpus/golden_harvest/<source_key>.md LLM-visible body
corpus/golden_harvest/<source_key>.json sidecar metadata
source_key convention: ``golden_harvest-<sku>`` lowercased, e.g.
``golden_harvest-e085z5`` or ``golden_harvest-gh00864xf``.
CLI:
python -m scrape.sources.golden_harvest --limit 5
python -m scrape.sources.golden_harvest --crop corn --limit 20
python -m scrape.sources.golden_harvest --force
""" """
from __future__ import annotations from __future__ import annotations
import argparse
import json
import logging
import os
import random
import re
import sys import sys
import time
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import requests
from bs4 import BeautifulSoup
SCRAPER_VERSION = "0.1.0"
USER_AGENT = "seed-mcp-scraper/0.1 (+https://drawbar.example/contact)"
BASE = "https://www.goldenharvestseeds.com"
SITEMAP_HYBRIDS = f"{BASE}/sitemap-ghs-hybrids.xml"
CROP_PATHS = {
"corn": "/products/corn/",
"soybeans": "/products/soybean/", # URL uses "soybean", schema uses "soybeans"
}
# Bayer + Golden Harvest publish on identical 1-9 (9 = best) ratings
# despite recon mentioning "9-to-1" — the direction descriptor referred
# to the visual chart order, not the numeric meaning. Verified empirically.
RATING_SCALE_DIRECTION = "1-9 (9 = best)"
# Trait suffix → full name. Best-effort mapping from product-code
# suffix, since GH's HTML doesn't expose trait stack as a structured
# field. Maps verified against tech-sheet PDFs + public marketing.
TRAIT_SUFFIX_MAP = {
# Corn
"VIP3": "Agrisure Viptera® 3220 E-Z Refuge®",
"VIP4": "Agrisure Viptera® 4 Trecepta®",
"GT": "Agrisure GT (glyphosate tolerance)",
"Z": "Agrisure Duracade® 5222 E-Z Refuge® (above + below-ground)",
# Soy
"XF": "XtendFlex® (Roundup Ready 2 Xtend + dicamba + glufosinate)",
"E3": "Enlist E3® (2,4-D + glyphosate + glufosinate)",
}
REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
CORPUS_DIR = CORPUS_ROOT / "golden_harvest"
REQ_INTERVAL_SEC = 1.0
log = logging.getLogger("scrape.golden_harvest")
# --------------------------------------------------------------------- HTTP
class RateLimitedSession:
"""Same shape as bayer_seeds' session. Sleep-based rate limiting
+ polite retries on 429/5xx. We do NOT follow redirects by default:
302 from a product page → discontinued variety, skip."""
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
self.s = requests.Session()
self.s.headers["User-Agent"] = USER_AGENT
self.interval = interval
self._last = 0.0
def _wait(self) -> None:
delta = time.monotonic() - self._last
if delta < self.interval:
time.sleep(self.interval - delta)
self._last = time.monotonic()
def request(
self,
method: str,
url: str,
*,
max_retries: int = 4,
timeout: float = 30.0,
allow_redirects: bool = False,
**kw: Any,
) -> requests.Response:
last_exc: Exception | None = None
for attempt in range(max_retries):
self._wait()
try:
resp = self.s.request(
method, url, timeout=timeout,
allow_redirects=allow_redirects, **kw,
)
except requests.RequestException as exc:
last_exc = exc
backoff = min(30.0, (2 ** attempt) + random.random())
log.warning("network error on %s %s: %s — retry in %.1fs",
method, url, exc, backoff)
time.sleep(backoff)
continue
if resp.status_code == 429 or 500 <= resp.status_code < 600:
ra = resp.headers.get("Retry-After")
backoff = float(ra) if (ra and ra.isdigit()) else min(30.0, (2 ** attempt) + random.random())
log.warning("HTTP %d on %s %s — retry in %.1fs",
resp.status_code, method, url, backoff)
time.sleep(backoff)
continue
return resp
if last_exc:
raise last_exc
return resp # type: ignore[return-value]
def get(self, url: str, **kw: Any) -> requests.Response:
return self.request("GET", url, **kw)
# --------------------------------------------------------------------- model
@dataclass
class GHProduct:
source_key: str
source_url: str
crop: str # "corn" | "soybeans"
product_name: str = "" # e.g. "E085Z5"
positioning_statement: str | None = None
relative_maturity: str | None = None # corn (string of int)
maturity_group: str | None = None # soy (string of decimal)
trait_codes: list[str] = field(default_factory=list)
trait_descriptions: list[str] = field(default_factory=list)
characteristics_groups: list[dict] = field(default_factory=list)
techsheet_url: str | None = None
sitemap_last_modified: str | None = None
# --------------------------------------------------------------------- discovery
def discover_products(
http: RateLimitedSession,
*,
only_crop: str | None = None,
) -> list[tuple[str, str, str]]:
"""Return ``[(url, crop, lastmod), ...]`` for every GH product page in
the hybrids sitemap."""
log.info("fetching sitemap %s", SITEMAP_HYBRIDS)
r = http.get(SITEMAP_HYBRIDS, allow_redirects=True)
r.raise_for_status()
entries = re.findall(
r"<url>\s*<loc>([^<]+)</loc>\s*(?:<lastmod>([^<]+)</lastmod>)?",
r.text,
)
out: list[tuple[str, str, str]] = []
for url, lastmod in entries:
for crop, path in CROP_PATHS.items():
if only_crop and crop != only_crop:
continue
if path in url and url.rstrip("/").count("/") >= 5:
tail = url.rstrip("/").rsplit("/", 1)[-1]
if not tail or tail in ("corn", "soybean"):
continue
out.append((url, crop, lastmod or ""))
break
by_crop: dict[str, int] = {}
for _, c, _ in out:
by_crop[c] = by_crop.get(c, 0) + 1
log.info("variety URLs found: %s (total=%d)",
", ".join(f"{k}={v}" for k, v in sorted(by_crop.items())),
len(out))
return out
# --------------------------------------------------------------------- helpers
def source_key_for(url: str) -> str:
"""``.../products/corn/e085z5`` → ``golden_harvest-e085z5``."""
tail = url.rstrip("/").rsplit("/", 1)[-1].lower()
return f"golden_harvest-{tail}"
_TRAIT_SUFFIX_RE = re.compile(r"(VIP3|VIP4|VIP|E3|XF|GT)$", re.I)
def derive_traits(product_code: str) -> tuple[list[str], list[str]]:
"""Pull the trait suffix off the product code. Returns
``(codes, descriptions)``. Empty if no recognized suffix."""
if not product_code:
return [], []
code = product_code.upper()
m = _TRAIT_SUFFIX_RE.search(code)
if not m:
# The "Z" suffix encodes Duracade-class above + below ground
# protection on Golden Harvest's corn naming convention.
# E085Z5 → Z is the Duracade tag.
if re.search(r"[A-Z]\d+Z\d+$", code):
return ["Z"], [TRAIT_SUFFIX_MAP.get("Z", "")]
return [], []
tok = m.group(0).upper()
return [tok], [TRAIT_SUFFIX_MAP.get(tok, "")]
def _table_to_items(tbl) -> list[dict]:
items: list[dict] = []
for r in tbl.find_all("tr"):
cells = r.find_all(["th", "td"])
if len(cells) < 2:
continue
label = cells[0].get_text(" ", strip=True)
value = cells[1].get_text(" ", strip=True)
if label and value:
items.append({"characteristic": label, "value": value})
return items
def _bars_to_items(container) -> list[dict]:
items: list[dict] = []
for row in container.find_all("div", class_="bar-row"):
label_el = row.find("div", class_="bar-label")
if not label_el:
continue
label = label_el.get_text(" ", strip=True)
bar = row.find("div", class_="bar")
pct = bar.get("data-percentage") if bar else None
if pct is None or str(pct).strip() == "":
items.append({"characteristic": label, "value": "-"})
continue
try:
rating = int(int(pct) / 10)
except (TypeError, ValueError):
rating = None
if rating is None:
items.append({"characteristic": label, "value": str(pct)})
else:
items.append({"characteristic": label, "value": str(rating)})
return items
CHART_SECTIONS = [
# (label_for_sidecar, div_id)
("DISEASE RATINGS", "dvDiseaseTolerance"),
("AGRONOMIC CHARACTERISTICS", "dvAgronomicChar"),
]
# --------------------------------------------------------------------- detail
def fetch_product_detail(
http: RateLimitedSession, url: str, crop: str, lastmod: str
) -> GHProduct | None:
"""Fetch + parse one product page. Returns None for discontinued
varieties (302 → product-finder)."""
r = http.get(url, allow_redirects=False)
if r.status_code in (301, 302, 303, 307, 308):
log.info("skip discontinued (redirect): %s%s",
url, r.headers.get("Location"))
return None
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")
prod = GHProduct(
source_key=source_key_for(url),
source_url=url,
crop=crop,
sitemap_last_modified=lastmod or None,
)
# Product name (the code) — prefer <h1>, fall back to <title>.
h1 = soup.find("h1")
if h1:
prod.product_name = h1.get_text(strip=True)
if not prod.product_name:
t = soup.find("title")
if t:
txt = t.get_text(strip=True)
if "|" in txt:
prod.product_name = txt.rsplit("|", 1)[-1].strip()
# Positioning — meta name="Description"
meta = soup.find("meta", attrs={"name": "Description"})
if meta and meta.get("content"):
desc = meta["content"].strip()
if prod.product_name:
prefix = prod.product_name + "."
if desc.startswith(prefix):
desc = desc[len(prefix):].strip()
prod.positioning_statement = desc or None
# Traits inferred from product code.
prod.trait_codes, prod.trait_descriptions = derive_traits(prod.product_name)
# Tables: capture every two-column table we find, labeled by the
# nearest preceding heading text.
table_groups: list[dict] = []
for tbl in soup.find_all("table"):
items = _table_to_items(tbl)
if not items:
continue
label = None
cur = tbl
for _ in range(8):
cur = cur.find_previous(["h2", "h3", "h4", "strong"])
if cur is None:
break
t = cur.get_text(strip=True)
if t:
label = t
break
label = label or "PRODUCT DATA"
table_groups.append({
"label": label.upper(),
"type": "table",
"items": items,
})
# Bar-chart sections.
chart_groups: list[dict] = []
for label, div_id in CHART_SECTIONS:
container = soup.find(id=div_id)
if not container:
continue
items = _bars_to_items(container)
if items:
chart_groups.append({
"label": label,
"type": "chart",
"items": items,
})
# Recommended environments / management ("AgronomicMange" — typo
# in upstream class name). Rendered as a flat list of strings.
am = soup.find(class_="AgronomicMange")
if am:
recs = [t.strip() for t in am.stripped_strings if t.strip()]
if recs:
chart_groups.append({
"label": "RECOMMENDED MANAGEMENT",
"type": "list",
"items": [{"characteristic": x, "value": ""} for x in recs],
})
prod.characteristics_groups = chart_groups + table_groups
# Maturity routing per crop. The canonical place GH publishes the
# maturity number is the product-label hero block:
# <div class="product-label"><div class="right"><span>RM</span>NN</div></div>
# — same DOM shape on corn and soybean pages, just different units
# (integer days for corn, MG decimal for soy). The maturity table
# (corn only) is a useful fallback.
label_rm = None
pl = soup.find(class_="product-label")
if pl:
right = pl.find(class_="right")
if right:
# The <span>RM</span> sits before the value; get_text drops
# the span boundary, so strip the literal "RM" prefix.
t = right.get_text(" ", strip=True)
t = re.sub(r"^RM\s*", "", t).strip()
if t:
label_rm = t
if label_rm:
if prod.crop == "corn":
m = re.match(r"^(\d{2,3})", label_rm)
if m:
prod.relative_maturity = m.group(1)
elif prod.crop == "soybeans":
m = re.match(r"^(\d+(?:\.\d+)?)", label_rm)
if m:
prod.maturity_group = m.group(1)
# Corn-table fallback if the hero header was missing.
if prod.crop == "corn" and prod.relative_maturity is None:
for grp in prod.characteristics_groups:
for it in grp.get("items") or []:
if "relative maturity" in (it.get("characteristic") or "").lower():
m = re.match(r"^(\d{2,3})", (it.get("value") or "").strip())
if m:
prod.relative_maturity = m.group(1)
break
if prod.relative_maturity:
break
# Tech-sheet PDF link.
ts = soup.find("a", href=re.compile(r"assets\.syngentaebiz\.com/pdf/techsheets/"))
if ts:
prod.techsheet_url = ts["href"]
else:
m = re.search(
r'(https?://assets\.syngentaebiz\.com/pdf/techsheets/[^"\s<>]+\.pdf)',
r.text,
)
if m:
prod.techsheet_url = m.group(1)
return prod
# --------------------------------------------------------------------- render
def render_markdown(p: GHProduct) -> str:
title = p.product_name or p.source_key
crop_label = "Corn" if p.crop == "corn" else "Soybeans"
maturity_lines: list[str] = []
if p.relative_maturity and p.crop == "corn":
maturity_lines.append(f"- **Relative maturity:** {p.relative_maturity}")
if p.maturity_group and p.crop == "soybeans":
maturity_lines.append(f"- **Maturity group:** {p.maturity_group}")
trait_line = ""
if p.trait_codes:
codes = ", ".join(p.trait_codes)
if p.trait_descriptions and any(p.trait_descriptions):
trait_line = f"- **Traits:** {codes} ({'; '.join(p.trait_descriptions)})"
else:
trait_line = f"- **Traits:** {codes}"
head = [
f"# {title}",
"",
"- **Vendor:** Syngenta",
"- **Brand:** Golden Harvest",
f"- **Crop:** {crop_label}",
*maturity_lines,
]
if trait_line:
head.append(trait_line)
head.append(f"- **Source:** {p.source_url}")
if p.techsheet_url:
head.append(f"- **Tech sheet (PDF):** {p.techsheet_url}")
head.append(f"- **Rating scale (Golden Harvest):** {RATING_SCALE_DIRECTION}")
head.append("")
head.append("---")
head.append("")
sections: list[str] = []
if p.positioning_statement:
sections.append("## Positioning\n\n" + p.positioning_statement.strip() + "\n")
for g in p.characteristics_groups:
label = (g.get("label") or "Characteristics").title()
items = g.get("items") or []
if not items:
continue
rows = "\n".join(f"| {it['characteristic']} | {it['value']} |" for it in items)
sections.append(
f"## {label}\n\n"
"| Characteristic | Value |\n"
"|---|---|\n"
f"{rows}\n"
)
return "\n".join(head) + "\n".join(sections)
# --------------------------------------------------------------------- write
def write_product(prod: GHProduct, body_md: str) -> None:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
md_path = CORPUS_DIR / f"{prod.source_key}.md"
json_path = CORPUS_DIR / f"{prod.source_key}.json"
md_path.write_text(body_md, encoding="utf-8")
sidecar = {
"source": "golden_harvest",
"source_key": prod.source_key,
"vendor": "Syngenta",
"brand": "Golden Harvest",
"product_name": prod.product_name,
"product_id": None,
"hybrid_prefix": prod.product_name,
"hybrid_suffix": None,
"crop": prod.crop,
"release_year": None,
"relative_maturity": prod.relative_maturity,
"maturity_group": prod.maturity_group,
"wheat_class": None,
"trait_stack": prod.trait_codes,
"trait_descriptions": prod.trait_descriptions,
"positioning_statement": prod.positioning_statement,
"strengths": [],
"characteristics_groups": prod.characteristics_groups,
"_scale_direction": RATING_SCALE_DIRECTION,
"regional_recommendations": [],
"image_url": None,
"techsheet_url": prod.techsheet_url,
"source_urls": [prod.source_url],
"sitemap_last_modified": prod.sitemap_last_modified,
"fetched_at": datetime.now(timezone.utc).isoformat(),
"scraper_version": SCRAPER_VERSION,
}
json_path.write_text(
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
encoding="utf-8",
)
# --------------------------------------------------------------------- pipeline
def process_product(
http: RateLimitedSession,
*,
url: str,
crop: str,
lastmod: str,
force: bool,
) -> tuple[str, GHProduct | None]:
source_key = source_key_for(url)
md_path = CORPUS_DIR / f"{source_key}.md"
if md_path.exists() and not force:
return "skipped", None
try:
prod = fetch_product_detail(http, url, crop, lastmod)
except Exception as exc: # noqa: BLE001
log.error("detail fetch failed for %s: %s", url, exc)
return "failed", None
if prod is None:
return "discontinued", None
body = render_markdown(prod)
write_product(prod, body)
return "written", prod
def run(
*,
limit: int | None,
force: bool,
only_crop: str | None,
only_product: str | None,
) -> int:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
http = RateLimitedSession()
targets = discover_products(http, only_crop=only_crop)
if only_product:
targets = [
(u, c, lm) for (u, c, lm) in targets
if source_key_for(u) == only_product
or u.rstrip("/").rsplit("/", 1)[-1].lower() == only_product.lower()
]
if not targets:
log.error("no variety matched --product=%s", only_product)
return 2
counts = {"written": 0, "skipped": 0, "discontinued": 0, "failed": 0}
processed = 0
for url, crop, lastmod in targets:
if limit is not None and processed >= limit:
break
processed += 1
status, prod = process_product(
http, url=url, crop=crop, lastmod=lastmod, force=force,
)
counts[status] = counts.get(status, 0) + 1
if prod is not None:
log.info(
"[%d/%s] %s %s | crop=%s rm/mg=%s traits=%s groups=%d techsheet=%s",
processed, str(limit) if limit else "all",
prod.source_key, status, prod.crop,
prod.relative_maturity or prod.maturity_group or "-",
",".join(prod.trait_codes) or "-",
len(prod.characteristics_groups),
"y" if prod.techsheet_url else "n",
)
else:
log.info("[%d/%s] %s %s",
processed, str(limit) if limit else "all",
source_key_for(url), status)
log.info(
"done: processed=%d written=%d skipped=%d discontinued=%d failed=%d "
"(of %d candidates)",
processed, counts["written"], counts["skipped"],
counts["discontinued"], counts["failed"], len(targets),
)
return 0 if counts["failed"] == 0 else 1
# --------------------------------------------------------------------- CLI
def _build_argparser() -> argparse.ArgumentParser:
p = argparse.ArgumentParser(
prog="scrape.sources.golden_harvest",
description="Scrape Golden Harvest (Syngenta) corn + soybean varieties.",
)
p.add_argument("--limit", type=int, default=None,
help="Stop after processing N varieties (default: all).")
p.add_argument("--force", action="store_true",
help="Re-fetch even if the markdown file already exists.")
p.add_argument("--crop", default=None, choices=("corn", "soybeans"),
help="Limit to one crop.")
p.add_argument("--product", default=None,
help="Process a single variety by source_key or URL tail.")
p.add_argument("--log-level", default=os.environ.get("LOG_LEVEL", "INFO"))
return p
def main(argv: list[str] | None = None) -> int: def main(argv: list[str] | None = None) -> int:
print("golden_harvest: not implemented yet — see CLAUDE.md for the disease-scale-reversal gotcha and the live-PDF-URL-resolution requirement", args = _build_argparser().parse_args(argv)
file=sys.stderr) logging.basicConfig(
return 2 level=args.log_level.upper(),
format="%(asctime)s %(levelname)s %(name)s %(message)s",
stream=sys.stderr,
)
return run(
limit=args.limit,
force=args.force,
only_crop=args.crop,
only_product=args.product,
)
if __name__ == "__main__": if __name__ == "__main__":
sys.exit(main(sys.argv[1:])) sys.exit(main())