b98965a68a
Adds the **first non-Syngenta trial coverage** to the corpus:
| Source | Docs | Publisher | URL pattern |
|---|---|---|---|
| lg_plot_reports | 1,304 | LG Seeds (AgReliant) | lgseeds.com/performance/{crop} JSON XHR |
| agrigold_plot_reports | 1,003 | AgriGold (AgReliant) | agrigold.com/{crop}/performance/{crop}-yield-results |
Total trial coverage now: gh_plot_reports (4,299) + agripro_trials (14) +
lg_plot_reports (1,304) + agrigold_plot_reports (1,003) = 6,620 trial docs.
**Both scrapers follow the gh_plot_reports template** — same RateLimitedSession
primitive, same TrialResult/PlotReport dataclass shape, same data_type="trial"
sidecar convention. The trial chunker (`rag/chunk.py:_render_trial_chunk`) is
extended to recognize both new sources; they share `_render_gh_plot_chunk`
since their sidecars are structurally identical (just different brand label).
**LG specifics:**
- POST `/performance/{crop}/GetPlots/` returns sparse listing (id, year, lat/lng)
- GET `/performance/{crop}/GetPlotData/?PlotId=X&IsSilage=Y` returns full detail
with state, cooperator, planting/harvest dates, and **top-5 hybrids** (LG +
competitors). Top-5 is what LG publishes publicly; not the full ranking.
- 4 crops: corn (963), soybeans (287), sorghum (10), silage (50) — first
alfalfa absent because LG doesn't run alfalfa plots; that's variety-only data.
- 301 gotcha: www.lgseeds.com redirects to lgseeds.com which drops POST body,
so the scraper hits the apex host directly.
**AgriGold specifics:**
- Listing: GET `/{crop}/performance/{crop}-yield-results?harvestYear={year}`
(server-rendered HTML, ~1MB; 408 corn plots in 2025 alone)
- Detail: GET `/{crop_url}/performance/{slug}/{plot_id}` returns the **full
ranking** (not just top-5) plus rich plot management metadata: tillage,
previous crop, fungicide, herbicide, insecticide, irrigation, soil type,
row width, population. Most metadata-rich of the three trial sources.
- Soybean URL slug is singular: `/soybeans/performance/soybean-yield-results/`
- Columns: Rank | Brand | Product | Trait | Ck | H20 (moisture) | Test Wt. |
Yield | Adj Yield (check-adjusted)
- 2 crops: corn (849) + soybeans (157)
**Indexer needs no changes** — `rag/index.py` auto-discovers any directory
under corpus/ and routes by data_type. Both new sources flow into the
existing trial collection and surface via `search_trials`.
Years scraped: 2024+2025 (matching gh_plot_reports baseline). 2023 is
available via `--include-2023` on either scraper for future backfill.
scrape/
Per-vendor seed catalog scrapers + the runner that dispatches to
them. Each source lives in scrape/sources/<name>.py with a main()
entrypoint. The runner is a thin shim:
python -m scrape.runner --source bayer_seeds --force
python -m scrape.runner --source golden_harvest --limit 20
python -m scrape.runner --all # only GREEN sources
Output layout
Each scraper writes:
corpus/<source>/<source_key>.md— LLM-visible body (chunk_0 preamble + the variety's marketing + agronomic narrative)corpus/<source>/<source_key>.json— sidecar metadata (per CLAUDE.md's canonical schema)
source_key is a stable per-vendor slug — typically <brand>-<sku>
lowercased, e.g. dekalb-dkc62-08rib. Stability matters: it's the
join key the MCP uses for get_page(source, source_key).
Sources
| Source | Module | Verdict | Notes |
|---|---|---|---|
bayer_seeds |
bayer_seeds.py |
🟢 | DEKALB + Asgrow + WestBred, ~475 varieties |
golden_harvest |
golden_harvest.py |
🟢 | ~175 varieties, 9-to-1 disease scale (reverse) |
nk |
nk.py |
🟢 | 29 varieties, ratings in CDN PDFs |
agripro |
agripro.py |
🟢 | 24 wheat varieties |
becks_pfr |
becks_pfr.py |
🟡 | 2,089 research docs via public Sanity GROQ |
becks_products |
becks_products.py |
🟡 | 860 products, identity-only (SeedIQ-gated) |
Pioneer is intentionally absent — see CLAUDE.md and the curated
Pioneer fallback in docs_mcp/lessons.md.
Tips
- Sniff before you scrape. Most catalogs are SPAs that call a
backend API. The recon docs in
~/.claude/projects/-home-justin/ memory/reference_seed_vendor_recon.mdalready capture the endpoints; if you find new ones, update that file. - Idempotent re-scrapes. Without
--force, skip pages already on disk. With--force, re-fetch everything — that's the monthly cron mode. - Respect the portals. Backoff on 429s. Set a recognizable
user-agent (
seed-mcp-scraper/<version>). - Normalize at chunk time, not at scrape time. The chunker
(Phase 2) handles the 9-to-1 → 1-9 disease-scale flip for Golden
Harvest, NOT this scraper. Sidecar JSON should preserve the
vendor's raw values + a
_scale_directionfield; the chunker reads that and normalizes the markdown body.
changelog.py
Reusable as-is from the template. Walks git diff --name-status
output for the commit summary, and git log for the digest history
(Phase 13).