Two new trial sources: LG Seeds + AgriGold plot reports (+2,307 cross-vendor yield trials)
Adds the **first non-Syngenta trial coverage** to the corpus:
| Source | Docs | Publisher | URL pattern |
|---|---|---|---|
| lg_plot_reports | 1,304 | LG Seeds (AgReliant) | lgseeds.com/performance/{crop} JSON XHR |
| agrigold_plot_reports | 1,003 | AgriGold (AgReliant) | agrigold.com/{crop}/performance/{crop}-yield-results |
Total trial coverage now: gh_plot_reports (4,299) + agripro_trials (14) +
lg_plot_reports (1,304) + agrigold_plot_reports (1,003) = 6,620 trial docs.
**Both scrapers follow the gh_plot_reports template** — same RateLimitedSession
primitive, same TrialResult/PlotReport dataclass shape, same data_type="trial"
sidecar convention. The trial chunker (`rag/chunk.py:_render_trial_chunk`) is
extended to recognize both new sources; they share `_render_gh_plot_chunk`
since their sidecars are structurally identical (just different brand label).
**LG specifics:**
- POST `/performance/{crop}/GetPlots/` returns sparse listing (id, year, lat/lng)
- GET `/performance/{crop}/GetPlotData/?PlotId=X&IsSilage=Y` returns full detail
with state, cooperator, planting/harvest dates, and **top-5 hybrids** (LG +
competitors). Top-5 is what LG publishes publicly; not the full ranking.
- 4 crops: corn (963), soybeans (287), sorghum (10), silage (50) — first
alfalfa absent because LG doesn't run alfalfa plots; that's variety-only data.
- 301 gotcha: www.lgseeds.com redirects to lgseeds.com which drops POST body,
so the scraper hits the apex host directly.
**AgriGold specifics:**
- Listing: GET `/{crop}/performance/{crop}-yield-results?harvestYear={year}`
(server-rendered HTML, ~1MB; 408 corn plots in 2025 alone)
- Detail: GET `/{crop_url}/performance/{slug}/{plot_id}` returns the **full
ranking** (not just top-5) plus rich plot management metadata: tillage,
previous crop, fungicide, herbicide, insecticide, irrigation, soil type,
row width, population. Most metadata-rich of the three trial sources.
- Soybean URL slug is singular: `/soybeans/performance/soybean-yield-results/`
- Columns: Rank | Brand | Product | Trait | Ck | H20 (moisture) | Test Wt. |
Yield | Adj Yield (check-adjusted)
- 2 crops: corn (849) + soybeans (157)
**Indexer needs no changes** — `rag/index.py` auto-discovers any directory
under corpus/ and routes by data_type. Both new sources flow into the
existing trial collection and surface via `search_trials`.
Years scraped: 2024+2025 (matching gh_plot_reports baseline). 2023 is
available via `--include-2023` on either scraper for future backfill.
This commit is contained in:
+23
-4
@@ -331,17 +331,31 @@ def chunks_from_variety(
|
||||
|
||||
|
||||
def _render_gh_plot_chunk(sidecar: dict) -> str:
|
||||
"""Render a Golden Harvest plot report (per-site cross-vendor)."""
|
||||
"""Render a cross-vendor plot report (per-site head-to-head).
|
||||
|
||||
Originally GH-specific; now also handles ``lg_plot_reports`` and
|
||||
``agrigold_plot_reports`` since they emit the same sidecar shape.
|
||||
The preamble's "Source:" line uses the actual brand from the
|
||||
sidecar so the LLM sees who PUBLISHED the trial (Bayer-side
|
||||
queries should still find DEKALB results inside a GH or AgriGold
|
||||
plot — search filters target ``brand_in_results``, not the
|
||||
publisher's brand).
|
||||
"""
|
||||
lines: list[str] = []
|
||||
crop = (sidecar.get("crop") or "").lower()
|
||||
crop_label = {"corn": "Corn", "soybeans": "Soybean", "silage": "Silage"}.get(crop, crop.title())
|
||||
crop_label = {
|
||||
"corn": "Corn", "soybeans": "Soybean", "silage": "Silage",
|
||||
"sorghum": "Sorghum",
|
||||
}.get(crop, crop.title())
|
||||
state = sidecar.get("state") or sidecar.get("state_abbrev") or ""
|
||||
year = sidecar.get("year") or ""
|
||||
cooperator = sidecar.get("cooperator") or ""
|
||||
|
||||
lines.append(f"# {crop_label} yield trial — {state}, {year}")
|
||||
lines.append("")
|
||||
facts = ["Golden Harvest plot report (cross-vendor)"]
|
||||
# Publisher label — emphasizes the source brand for retrieval.
|
||||
publisher_brand = sidecar.get("brand") or "Golden Harvest"
|
||||
facts = [f"{publisher_brand} plot report (cross-vendor)"]
|
||||
if cooperator:
|
||||
facts.append(f"cooperator {cooperator}")
|
||||
if sidecar.get("planted_date"):
|
||||
@@ -488,7 +502,12 @@ def _render_trial_chunk(sidecar: dict, md_text: str | None = None) -> str:
|
||||
verbatim trial body for sources whose value lives in the body text
|
||||
(currently agripro_trials)."""
|
||||
source = sidecar.get("source")
|
||||
if source == "gh_plot_reports":
|
||||
# Cross-vendor plot-report sources all share the gh_plot_reports
|
||||
# sidecar shape (results: [{rank,brand,product,traits,metrics}]),
|
||||
# so they route through the same renderer. The renderer reads
|
||||
# ``brand`` from the sidecar so the publisher label is correct
|
||||
# for each (Golden Harvest / LG Seeds / AgriGold).
|
||||
if source in ("gh_plot_reports", "lg_plot_reports", "agrigold_plot_reports"):
|
||||
return _render_gh_plot_chunk(sidecar)
|
||||
if source == "agripro_trials":
|
||||
header = _render_agripro_trial_chunk(sidecar)
|
||||
|
||||
Reference in New Issue
Block a user