Trial-data scrapers + search_trials MCP tool (cross-vendor yield trials) #7
Reference in New Issue
Block a user
Delete Branch "trial-data-scrapers"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
First trial-data PR. Introduces yield-trial scrapers + a new
search_trialsMCP tool as a SEPARATE data type alongside variety identity.Why separate: variety chunks answer "what is DKC62-08RIB?" — trial chunks answer "which corn hybrid won the IA 2024 trials?" Mixing them would muddy embedding signal. Single Chroma collection with
data_typemetadata filter — clean ops, two clean entry points.New scrapers
gh_plot_reports— Golden Harvest plot reports, ~4,618 cross-vendor head-to-head yield trials (2024+2025). Each plot lists products from multiple brands at one cooperator's field — the kind of independent comparison data Bayer doesn't publish itself. URL:/<crop>/plot-report/<state>/<year>/<plot_id>. Generic per-column metrics dict so silage'sTon/Acre/Milk Per Acre/Beef Per Toncolumns work alongside corn/soy'sYield/%MST/Test Weight.agripro_trials— 14 unique PDFs from/trials-data(38 sitemap links deduped). Verbatim PDF text preserved in chunk body so variety + yield number adjacency drives retrieval ("AP Iliad Aberdeen ID yield 116.3" matches a query about "AP Iliad Idaho").Architecture changes
rag/chunk.py: newchunks_from_trial()dispatching by source. Variety chunks getdata_type="variety", trial chunks getdata_type="trial".rag/index.py: dispatches by sidecar'sdata_typefield.rag/bm25.py: new filter columnsdata_type,year,state.docs_mcp/server.py: sixth MCP toolsearch_trials(crop?, state?, year?, product?, k=10).search_docsnow defaults todata_type="variety"so trial chunks don't bleed into identity queries.Documented as deferred / unavailable
/NKSeeds/wsProxy.asmx/GetPlotResult) — ASMX/SOAP, initial probe returned empty XML. Reverse-engineering needs a dedicated session.What's in this PR's corpus
/trials-data).Following this PR
Test plan
This PR introduces TRIAL data — yield-performance results from real field trials — as a SEPARATE data type alongside variety identity. The two are complementary: search_docs → "What's the disease resistance of DKC62-08RIB?" (variety identity — what it IS) search_trials → "Which corn hybrid won the IA 2024 trials?" (performance data — how it PERFORMED) scrape/sources/gh_plot_reports.py — Golden Harvest plot reports - 4,618 expected (2024+2025; 2023 deferred to a backfill pass). - URL: /<crop>/plot-report/<state>/<year>/<plot_id> - Cross-vendor: each plot lists products from multiple brands (NK / DEKALB / Golden Harvest / Enogen / Pioneer / Channel) side by side at one cooperator's field — the kind of independent comparison data Bayer doesn't publish itself. - Generic per-column metrics dict (Yield/MST/Test Weight/$/Ac for corn+soy, Ton/Acre + Milk + Beef columns for silage). - Politeness: 1 req/sec, retries on 429/5xx, no redirect-follow. scrape/sources/agripro_trials.py — AgriPro regional trial PDFs - 14 unique PDFs (38 sitemap links deduped) at /trials-data - pdfplumber text extraction, region/year detection from filename - Verbatim PDF text preserved in chunk body so variety + yield number adjacency drives retrieval (AP Iliad's Aberdeen ID yield matches a query about "AP Iliad Idaho yield") rag/chunk.py — chunks_from_trial() dispatching by source - Plot reports: identity preamble + Top-5 by primary metric + full ranking table. Metric labels chosen from the data (corn/soy use "Yield", silage uses "Ton/Acre"). - AgriPro PDFs: identity preamble + verbatim trial body inline so per-location yields surface for region+variety queries. - Variety chunks get data_type="variety" metadata; trial chunks get data_type="trial". Single Chroma collection; the tool router filters by data_type rather than maintaining two collections. rag/index.py — dispatch by sidecar's data_type field rag/bm25.py — new filter columns (data_type, year, state) docs_mcp/server.py — sixth MCP tool: search_trials(crop?, state?, year?, product?, k=10) - Filters trial chunks via where={"data_type": "trial", ...} - Optional product substring post-filter for "DKC62-08RIB Iowa 2024" style searches - search_docs now defaults to data_type="variety" so trial chunks don't bleed into variety identity queries - Tool docstring routes the agent: "use lookup_variety to verify identity details on any trial winner you surface" NK trial endpoint (/NKSeeds/wsProxy.asmx/GetPlotResult) is documented as deferred — the ASMX-SOAP shape returned empty XML on initial probe. Bayer per-variety yield data is not publicly indexed at all — documented in the trial-scope note (DEKALB/Asgrow trial data flows through Channel reps, not the web). AgRevival research books exist as 10 large annual PDFs but are deferred (low ROI per parse). Initial corpus shipped in this PR: 14 AgriPro trial PDFs. The 4,618 Golden Harvest plot reports are scraping in background and will be added in a follow-up corpus-snapshot PR (~70 min ETA). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>