Two new trial sources: LG Seeds + AgriGold plot reports (+2,307 cross-vendor yield trials) #15
Reference in New Issue
Block a user
Delete Branch "lg-agrigold-trials"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
lg_plot_reports(1,304 plots) andagrigold_plot_reports(1,003 plots) — first non-Syngenta trial coverage in the corpus.gh_plot_reports(4,299) andagripro_trials(14), total trial coverage is now 6,620 docs.Implementation
gh_plot_reportstemplate (RateLimitedSession, TrialResult/PlotReport dataclasses, sidecar shape).rag/chunk.py:_render_trial_chunkextended to dispatchlg_plot_reports/agrigold_plot_reportsthrough the same renderer — sidecars are structurally identical to GHs.sources.jsonupdated with both entries (verdict: green).data_type=trial— noindex.pychanges needed.LG specifics
/performance/{crop}/GetPlots/→ listing, GET/performance/{crop}/GetPlotData/→ detail (top-5).lgseeds.com(no www) or POST body is dropped.AgriGold specifics
Test plan
:latestimage on trashpandaseed:search_trials brand=LGAdds the **first non-Syngenta trial coverage** to the corpus: | Source | Docs | Publisher | URL pattern | |---|---|---|---| | lg_plot_reports | 1,304 | LG Seeds (AgReliant) | lgseeds.com/performance/{crop} JSON XHR | | agrigold_plot_reports | 1,003 | AgriGold (AgReliant) | agrigold.com/{crop}/performance/{crop}-yield-results | Total trial coverage now: gh_plot_reports (4,299) + agripro_trials (14) + lg_plot_reports (1,304) + agrigold_plot_reports (1,003) = 6,620 trial docs. **Both scrapers follow the gh_plot_reports template** — same RateLimitedSession primitive, same TrialResult/PlotReport dataclass shape, same data_type="trial" sidecar convention. The trial chunker (`rag/chunk.py:_render_trial_chunk`) is extended to recognize both new sources; they share `_render_gh_plot_chunk` since their sidecars are structurally identical (just different brand label). **LG specifics:** - POST `/performance/{crop}/GetPlots/` returns sparse listing (id, year, lat/lng) - GET `/performance/{crop}/GetPlotData/?PlotId=X&IsSilage=Y` returns full detail with state, cooperator, planting/harvest dates, and **top-5 hybrids** (LG + competitors). Top-5 is what LG publishes publicly; not the full ranking. - 4 crops: corn (963), soybeans (287), sorghum (10), silage (50) — first alfalfa absent because LG doesn't run alfalfa plots; that's variety-only data. - 301 gotcha: www.lgseeds.com redirects to lgseeds.com which drops POST body, so the scraper hits the apex host directly. **AgriGold specifics:** - Listing: GET `/{crop}/performance/{crop}-yield-results?harvestYear={year}` (server-rendered HTML, ~1MB; 408 corn plots in 2025 alone) - Detail: GET `/{crop_url}/performance/{slug}/{plot_id}` returns the **full ranking** (not just top-5) plus rich plot management metadata: tillage, previous crop, fungicide, herbicide, insecticide, irrigation, soil type, row width, population. Most metadata-rich of the three trial sources. - Soybean URL slug is singular: `/soybeans/performance/soybean-yield-results/` - Columns: Rank | Brand | Product | Trait | Ck | H20 (moisture) | Test Wt. | Yield | Adj Yield (check-adjusted) - 2 crops: corn (849) + soybeans (157) **Indexer needs no changes** — `rag/index.py` auto-discovers any directory under corpus/ and routes by data_type. Both new sources flow into the existing trial collection and surface via `search_trials`. Years scraped: 2024+2025 (matching gh_plot_reports baseline). 2023 is available via `--include-2023` on either scraper for future backfill.