# scrape/ Per-vendor seed catalog scrapers + the runner that dispatches to them. Each source lives in `scrape/sources/.py` with a `main()` entrypoint. The runner is a thin shim: ```bash python -m scrape.runner --source bayer_seeds --force python -m scrape.runner --source golden_harvest --limit 20 python -m scrape.runner --all # only GREEN sources ``` ## Output layout Each scraper writes: - `corpus//.md` — LLM-visible body (chunk_0 preamble + the variety's marketing + agronomic narrative) - `corpus//.json` — sidecar metadata (per CLAUDE.md's canonical schema) `source_key` is a stable per-vendor slug — typically `-` lowercased, e.g. `dekalb-dkc62-08rib`. Stability matters: it's the join key the MCP uses for `get_page(source, source_key)`. ## Sources | Source | Module | Verdict | Notes | |---|---|---|---| | `bayer_seeds` | `bayer_seeds.py` | 🟢 | DEKALB + Asgrow + WestBred, ~475 varieties | | `golden_harvest` | `golden_harvest.py` | 🟢 | ~175 varieties, 9-to-1 disease scale (reverse) | | `nk` | `nk.py` | 🟢 | 29 varieties, ratings in CDN PDFs | | `agripro` | `agripro.py` | 🟢 | 24 wheat varieties | | `becks_pfr` | `becks_pfr.py` | 🟡 | 2,089 research docs via public Sanity GROQ | | `becks_products` | `becks_products.py` | 🟡 | 860 products, identity-only (SeedIQ-gated) | Pioneer is intentionally absent — see `CLAUDE.md` and the curated Pioneer fallback in `docs_mcp/lessons.md`. ## Tips - **Sniff before you scrape.** Most catalogs are SPAs that call a backend API. The recon docs in `~/.claude/projects/-home-justin/ memory/reference_seed_vendor_recon.md` already capture the endpoints; if you find new ones, update that file. - **Idempotent re-scrapes.** Without `--force`, skip pages already on disk. With `--force`, re-fetch everything — that's the monthly cron mode. - **Respect the portals.** Backoff on 429s. Set a recognizable user-agent (`seed-mcp-scraper/`). - **Normalize at chunk time, not at scrape time.** The chunker (Phase 2) handles the 9-to-1 → 1-9 disease-scale flip for Golden Harvest, NOT this scraper. Sidecar JSON should preserve the vendor's raw values + a `_scale_direction` field; the chunker reads that and normalizes the markdown body. ## changelog.py Reusable as-is from the template. Walks `git diff --name-status` output for the commit summary, and `git log` for the digest history (Phase 13).