Trial-data scrapers: gh_plot_reports + agripro_trials + search_trials tool

This PR introduces TRIAL data — yield-performance results from real
field trials — as a SEPARATE data type alongside variety identity.
The two are complementary:

  search_docs  → "What's the disease resistance of DKC62-08RIB?"
                  (variety identity — what it IS)
  search_trials → "Which corn hybrid won the IA 2024 trials?"
                  (performance data — how it PERFORMED)

scrape/sources/gh_plot_reports.py — Golden Harvest plot reports
- 4,618 expected (2024+2025; 2023 deferred to a backfill pass).
- URL: /<crop>/plot-report/<state>/<year>/<plot_id>
- Cross-vendor: each plot lists products from multiple brands
  (NK / DEKALB / Golden Harvest / Enogen / Pioneer / Channel) side
  by side at one cooperator's field — the kind of independent
  comparison data Bayer doesn't publish itself.
- Generic per-column metrics dict (Yield/MST/Test Weight/$/Ac for
  corn+soy, Ton/Acre + Milk + Beef columns for silage).
- Politeness: 1 req/sec, retries on 429/5xx, no redirect-follow.

scrape/sources/agripro_trials.py — AgriPro regional trial PDFs
- 14 unique PDFs (38 sitemap links deduped) at /trials-data
- pdfplumber text extraction, region/year detection from filename
- Verbatim PDF text preserved in chunk body so variety + yield
  number adjacency drives retrieval (AP Iliad's Aberdeen ID yield
  matches a query about "AP Iliad Idaho yield")

rag/chunk.py — chunks_from_trial() dispatching by source
- Plot reports: identity preamble + Top-5 by primary metric + full
  ranking table. Metric labels chosen from the data (corn/soy use
  "Yield", silage uses "Ton/Acre").
- AgriPro PDFs: identity preamble + verbatim trial body inline so
  per-location yields surface for region+variety queries.
- Variety chunks get data_type="variety" metadata; trial chunks get
  data_type="trial". Single Chroma collection; the tool router
  filters by data_type rather than maintaining two collections.

rag/index.py — dispatch by sidecar's data_type field
rag/bm25.py — new filter columns (data_type, year, state)

docs_mcp/server.py — sixth MCP tool: search_trials(crop?, state?,
year?, product?, k=10)
- Filters trial chunks via where={"data_type": "trial", ...}
- Optional product substring post-filter for "DKC62-08RIB Iowa 2024"
  style searches
- search_docs now defaults to data_type="variety" so trial chunks
  don't bleed into variety identity queries
- Tool docstring routes the agent: "use lookup_variety to verify
  identity details on any trial winner you surface"

NK trial endpoint (/NKSeeds/wsProxy.asmx/GetPlotResult) is documented
as deferred — the ASMX-SOAP shape returned empty XML on initial
probe. Bayer per-variety yield data is not publicly indexed at all
— documented in the trial-scope note (DEKALB/Asgrow trial data flows
through Channel reps, not the web). AgRevival research books exist
as 10 large annual PDFs but are deferred (low ROI per parse).

Initial corpus shipped in this PR: 14 AgriPro trial PDFs. The 4,618
Golden Harvest plot reports are scraping in background and will be
added in a follow-up corpus-snapshot PR (~70 min ETA).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-25 15:19:03 -04:00
parent 7b3da908e0
commit c737871c4c
35 changed files with 3302 additions and 25 deletions
+85 -18
View File
@@ -5,8 +5,16 @@
{
"name": "bayer_seeds",
"vendor": "Bayer",
"brands": ["DEKALB", "Asgrow", "WestBred"],
"crops": ["corn", "soybeans", "wheat"],
"brands": [
"DEKALB",
"Asgrow",
"WestBred"
],
"crops": [
"corn",
"soybeans",
"wheat"
],
"verdict": "green",
"expected_count": 475,
"base_url": "https://cropscience.bayer.us",
@@ -17,65 +25,124 @@
{
"name": "golden_harvest",
"vendor": "Syngenta",
"brands": ["Golden Harvest"],
"crops": ["corn", "soybeans"],
"brands": [
"Golden Harvest"
],
"crops": [
"corn",
"soybeans"
],
"verdict": "green",
"expected_count": 175,
"base_url": "https://www.goldenharvestseeds.com",
"scope_filter": "All sitemap-listed corn + soybean varieties.",
"tos_check_date": "2026-05-25",
"schema_notes": "Disease ratings published on 9-to-1 scale (9 = best). Normalize to 1-9 (9 = best) at chunk time to match Bayer/NK/AgriPro convention. Note original direction in chunk_0 preamble. Tech-sheet PDF URLs in the sitemap are stale (250331) resolve live URL from product HTML, not sitemap entry."
"schema_notes": "Disease ratings published on 9-to-1 scale (9 = best). Normalize to 1-9 (9 = best) at chunk time to match Bayer/NK/AgriPro convention. Note original direction in chunk_0 preamble. Tech-sheet PDF URLs in the sitemap are stale (250331) \u2014 resolve live URL from product HTML, not sitemap entry."
},
{
"name": "nk",
"vendor": "Syngenta",
"brands": ["NK"],
"crops": ["corn", "soybeans"],
"brands": [
"NK"
],
"crops": [
"corn",
"soybeans"
],
"verdict": "green",
"expected_count": 29,
"base_url": "https://www.syngenta-us.com",
"pdf_cdn": "https://assets.syngentaebiz.com/pdf/techsheets/",
"scope_filter": "All NK corn + soy varieties. No wheat (NK doesn't sell wheat in US).",
"tos_check_date": "2026-05-24",
"schema_notes": "Disease + agronomic ratings live in tech-sheet PDFs only need pdfplumber. PDF URLs share format `<CODE>_YYMMDD.pdf` with Golden Harvest, so the same fetcher works for both."
"schema_notes": "Disease + agronomic ratings live in tech-sheet PDFs only \u2014 need pdfplumber. PDF URLs share format `<CODE>_YYMMDD.pdf` with Golden Harvest, so the same fetcher works for both."
},
{
"name": "agripro",
"vendor": "Syngenta",
"brands": ["AgriPro"],
"crops": ["wheat", "barley"],
"brands": [
"AgriPro"
],
"crops": [
"wheat",
"barley"
],
"verdict": "green",
"expected_count": 24,
"base_url": "https://www.agriprowheat.com",
"scope_filter": "All wheat classes (HRW/HRS/HWS/SWW/SWS) + barley. NO SRW Syngenta's SRW lives at GrowProGenetics.com under a separate brand.",
"scope_filter": "All wheat classes (HRW/HRS/HWS/SWW/SWS) + barley. NO SRW \u2014 Syngenta's SRW lives at GrowProGenetics.com under a separate brand.",
"tos_check_date": "2026-05-24",
"schema_notes": "Drupal Views form; server-rendered HTML. CoAXium trait flag is implicit in product family; Clearfield/CL2 trait IS in this catalog."
},
{
"name": "becks_pfr",
"vendor": "Beck's Hybrids",
"brands": ["Beck's PFR"],
"crops": ["corn", "soybeans", "wheat"],
"brands": [
"Beck's PFR"
],
"crops": [
"corn",
"soybeans",
"wheat"
],
"verdict": "yellow",
"expected_count": 2089,
"base_url": "https://www.beckshybrids.com",
"api_base": "https://mc8v24rf.api.sanity.io",
"scope_filter": "All Practical Farm Research publications since 2015. PFR is head-to-head agronomy trials fungicide timing, planting-date studies, hybrid-by-population, etc.",
"scope_filter": "All Practical Farm Research publications since 2015. PFR is head-to-head agronomy trials \u2014 fungicide timing, planting-date studies, hybrid-by-population, etc.",
"tos_check_date": "2026-05-24",
"schema_notes": "Public Sanity GROQ API, no auth required. Records have title/year/crop/key-findings/full-text. Treat PFR docs as a research corpus, not variety records the chunk_0 includes the study's tl;dr finding."
"schema_notes": "Public Sanity GROQ API, no auth required. Records have title/year/crop/key-findings/full-text. Treat PFR docs as a research corpus, not variety records \u2014 the chunk_0 includes the study's tl;dr finding."
},
{
"name": "becks_products",
"vendor": "Beck's Hybrids",
"brands": ["Beck's"],
"crops": ["corn", "soybeans", "wheat"],
"brands": [
"Beck's"
],
"crops": [
"corn",
"soybeans",
"wheat"
],
"verdict": "yellow",
"expected_count": 860,
"base_url": "https://www.beckshybrids.com",
"api_base": "https://mc8v24rf.api.sanity.io",
"scope_filter": "All Beck's product records corn + soy + wheat. Identity + RM/MG only.",
"scope_filter": "All Beck's product records \u2014 corn + soy + wheat. Identity + RM/MG only.",
"tos_check_date": "2026-05-24",
"schema_notes": "Sanity GROQ exposes identity (name, RM/MG, basic traits) but agronomic + disease ratings are SeedIQ-gated (requires browser cookie). Deferred until the SeedIQ XHR endpoint is captured from a logged-in browser session. Without ratings, products are reference-only; the MCP can confirm 'Beck's has hybrid X at RM 112 with Enlist trait' but not 'rate it against drought'."
},
{
"name": "gh_plot_reports",
"vendor": "Syngenta",
"brand_aggregator": "Golden Harvest publishes",
"crops": [
"corn",
"soybeans",
"silage"
],
"verdict": "green",
"expected_count": 4618,
"base_url": "https://www.goldenharvestseeds.com",
"scope_filter": "sitemap-listed plot reports 2024 and 2025 (4,618 reports). 2023 (3,619 reports) deferred to a future pass \u2014 most recent data is most relevant for current decisions.",
"tos_check_date": "2026-05-25",
"schema_notes": "Cross-vendor head-to-head yield trials at specific state/year/site. Each report lists products from multiple brands (NK, DEKALB, GH, etc.) with rank, yield, %MST, test weight, gross revenue. URL: /<crop>/plot-report/<state>/<year>/<id>. Same site/auth as golden_harvest variety scraper.",
"data_type": "trial"
},
{
"name": "agripro_trials",
"vendor": "Syngenta",
"brand_aggregator": "AgriPro publishes",
"crops": [
"wheat"
],
"verdict": "green",
"expected_count": 38,
"base_url": "https://agriprowheat.com",
"scope_filter": "PDF trial summaries linked from /trials-data. Regional wheat performance (PNW, Western Plains, NE Colorado, etc.).",
"tos_check_date": "2026-05-25",
"schema_notes": "PDF tables of varieties tested per region per year. pdfplumber for table extraction.",
"data_type": "trial"
}
],
"_excluded_sources": [