Two new trial sources: LG Seeds + AgriGold plot reports (+2,307 cross-vendor yield trials) #15

Merged
justin merged 1 commits from lg-agrigold-trials into main 2026-05-26 22:27:17 -04:00
Owner

Summary

  • Adds lg_plot_reports (1,304 plots) and agrigold_plot_reports (1,003 plots) — first non-Syngenta trial coverage in the corpus.
  • Both publishers are AgReliant Genetics (LG + AgriGold are sister brands). Results inside each plot are cross-vendor (LG/DEKALB/Pioneer/Dairyland/etc.).
  • Together with gh_plot_reports (4,299) and agripro_trials (14), total trial coverage is now 6,620 docs.

Implementation

  • Both scrapers follow the gh_plot_reports template (RateLimitedSession, TrialResult/PlotReport dataclasses, sidecar shape).
  • rag/chunk.py:_render_trial_chunk extended to dispatch lg_plot_reports / agrigold_plot_reports through the same renderer — sidecars are structurally identical to GHs.
  • sources.json updated with both entries (verdict: green).
  • Indexer auto-discovers via data_type=trial — no index.py changes needed.

LG specifics

  • POST /performance/{crop}/GetPlots/ → listing, GET /performance/{crop}/GetPlotData/ → detail (top-5).
  • 4 crops: corn (963), soybeans (287), sorghum (10), silage (50) for 2024+2025.
  • 301 gotcha: must hit apex lgseeds.com (no www) or POST body is dropped.

AgriGold specifics

  • Server-rendered HTML; listing page lists 408 plots in one shot for 2025 corn.
  • Detail page exposes full ranking (not just top-5) + rich plot management metadata (tillage, prev crop, fungicide, soil type, irrigation).
  • Most metadata-rich of the three plot-report sources.

Test plan

  • Smoke test: 3 plots per source — both produce clean .md + .json sidecars
  • Full scrape: 1,304 LG + 1,003 AgriGold, 0 failed, 3 missing (non-JSON edge cases)
  • CI rebuilds Chroma + BM25 from corpus (auto on merge)
  • Watchtower picks up new :latest image on trashpanda
  • Smoke test through drawbar-backend-api proxy: query seed:search_trials brand=LG
## Summary - Adds **`lg_plot_reports`** (1,304 plots) and **`agrigold_plot_reports`** (1,003 plots) — first non-Syngenta trial coverage in the corpus. - Both publishers are AgReliant Genetics (LG + AgriGold are sister brands). Results inside each plot are cross-vendor (LG/DEKALB/Pioneer/Dairyland/etc.). - Together with `gh_plot_reports` (4,299) and `agripro_trials` (14), total trial coverage is now **6,620 docs**. ## Implementation - Both scrapers follow the `gh_plot_reports` template (RateLimitedSession, TrialResult/PlotReport dataclasses, sidecar shape). - `rag/chunk.py:_render_trial_chunk` extended to dispatch `lg_plot_reports` / `agrigold_plot_reports` through the same renderer — sidecars are structurally identical to GHs. - `sources.json` updated with both entries (verdict: green). - Indexer auto-discovers via `data_type=trial` — no `index.py` changes needed. ## LG specifics - POST `/performance/{crop}/GetPlots/` → listing, GET `/performance/{crop}/GetPlotData/` → detail (top-5). - 4 crops: corn (963), soybeans (287), sorghum (10), silage (50) for 2024+2025. - 301 gotcha: must hit apex `lgseeds.com` (no www) or POST body is dropped. ## AgriGold specifics - Server-rendered HTML; listing page lists 408 plots in one shot for 2025 corn. - Detail page exposes **full ranking** (not just top-5) + rich plot management metadata (tillage, prev crop, fungicide, soil type, irrigation). - Most metadata-rich of the three plot-report sources. ## Test plan - [x] Smoke test: 3 plots per source — both produce clean .md + .json sidecars - [x] Full scrape: 1,304 LG + 1,003 AgriGold, 0 failed, 3 missing (non-JSON edge cases) - [ ] CI rebuilds Chroma + BM25 from corpus (auto on merge) - [ ] Watchtower picks up new `:latest` image on trashpanda - [ ] Smoke test through drawbar-backend-api proxy: query `seed:search_trials brand=LG`
justin added 1 commit 2026-05-26 22:27:04 -04:00
Adds the **first non-Syngenta trial coverage** to the corpus:

| Source | Docs | Publisher | URL pattern |
|---|---|---|---|
| lg_plot_reports | 1,304 | LG Seeds (AgReliant) | lgseeds.com/performance/{crop} JSON XHR |
| agrigold_plot_reports | 1,003 | AgriGold (AgReliant) | agrigold.com/{crop}/performance/{crop}-yield-results |

Total trial coverage now: gh_plot_reports (4,299) + agripro_trials (14) +
lg_plot_reports (1,304) + agrigold_plot_reports (1,003) = 6,620 trial docs.

**Both scrapers follow the gh_plot_reports template** — same RateLimitedSession
primitive, same TrialResult/PlotReport dataclass shape, same data_type="trial"
sidecar convention. The trial chunker (`rag/chunk.py:_render_trial_chunk`) is
extended to recognize both new sources; they share `_render_gh_plot_chunk`
since their sidecars are structurally identical (just different brand label).

**LG specifics:**
- POST `/performance/{crop}/GetPlots/` returns sparse listing (id, year, lat/lng)
- GET `/performance/{crop}/GetPlotData/?PlotId=X&IsSilage=Y` returns full detail
  with state, cooperator, planting/harvest dates, and **top-5 hybrids** (LG +
  competitors). Top-5 is what LG publishes publicly; not the full ranking.
- 4 crops: corn (963), soybeans (287), sorghum (10), silage (50) — first
  alfalfa absent because LG doesn't run alfalfa plots; that's variety-only data.
- 301 gotcha: www.lgseeds.com redirects to lgseeds.com which drops POST body,
  so the scraper hits the apex host directly.

**AgriGold specifics:**
- Listing: GET `/{crop}/performance/{crop}-yield-results?harvestYear={year}`
  (server-rendered HTML, ~1MB; 408 corn plots in 2025 alone)
- Detail: GET `/{crop_url}/performance/{slug}/{plot_id}` returns the **full
  ranking** (not just top-5) plus rich plot management metadata: tillage,
  previous crop, fungicide, herbicide, insecticide, irrigation, soil type,
  row width, population. Most metadata-rich of the three trial sources.
- Soybean URL slug is singular: `/soybeans/performance/soybean-yield-results/`
- Columns: Rank | Brand | Product | Trait | Ck | H20 (moisture) | Test Wt. |
  Yield | Adj Yield (check-adjusted)
- 2 crops: corn (849) + soybeans (157)

**Indexer needs no changes** — `rag/index.py` auto-discovers any directory
under corpus/ and routes by data_type. Both new sources flow into the
existing trial collection and surface via `search_trials`.

Years scraped: 2024+2025 (matching gh_plot_reports baseline). 2023 is
available via `--include-2023` on either scraper for future backfill.
justin merged commit e40de053c4 into main 2026-05-26 22:27:17 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#15