Trial-data scrapers + search_trials MCP tool (cross-vendor yield trials) #7

Merged
justin merged 1 commits from trial-data-scrapers into main 2026-05-25 15:19:43 -04:00
Owner

Summary

First trial-data PR. Introduces yield-trial scrapers + a new search_trials MCP tool as a SEPARATE data type alongside variety identity.

Why separate: variety chunks answer "what is DKC62-08RIB?" — trial chunks answer "which corn hybrid won the IA 2024 trials?" Mixing them would muddy embedding signal. Single Chroma collection with data_type metadata filter — clean ops, two clean entry points.

New scrapers

  • gh_plot_reports — Golden Harvest plot reports, ~4,618 cross-vendor head-to-head yield trials (2024+2025). Each plot lists products from multiple brands at one cooperator's field — the kind of independent comparison data Bayer doesn't publish itself. URL: /<crop>/plot-report/<state>/<year>/<plot_id>. Generic per-column metrics dict so silage's Ton/Acre / Milk Per Acre / Beef Per Ton columns work alongside corn/soy's Yield / %MST / Test Weight.
  • agripro_trials — 14 unique PDFs from /trials-data (38 sitemap links deduped). Verbatim PDF text preserved in chunk body so variety + yield number adjacency drives retrieval ("AP Iliad Aberdeen ID yield 116.3" matches a query about "AP Iliad Idaho").

Architecture changes

  • rag/chunk.py: new chunks_from_trial() dispatching by source. Variety chunks get data_type="variety", trial chunks get data_type="trial".
  • rag/index.py: dispatches by sidecar's data_type field.
  • rag/bm25.py: new filter columns data_type, year, state.
  • docs_mcp/server.py: sixth MCP tool search_trials(crop?, state?, year?, product?, k=10). search_docs now defaults to data_type="variety" so trial chunks don't bleed into identity queries.

Documented as deferred / unavailable

  • NK trial endpoint (/NKSeeds/wsProxy.asmx/GetPlotResult) — ASMX/SOAP, initial probe returned empty XML. Reverse-engineering needs a dedicated session.
  • Bayer per-variety trial data — not publicly indexed. DEKALB/Asgrow trial data flows through Channel reps. Documented as a known gap.
  • AgRevival research books — 10 large annual PDFs (2016-2025). Quasi-independent but low ROI per parse. Defer.

What's in this PR's corpus

  • 14 AgriPro trial PDFs (full coverage of /trials-data).

Following this PR

  • Full Golden Harvest plot-report corpus (~4,618 docs) is scraping in background. Will be added in a follow-up data-only PR (~70 min ETA).
  • Re-index + smoke tests against the combined corpus also follow.

Test plan

  • gh_plot_reports parses corn (Yield/MST/Test Weight/Gross Revenue), soy (same), silage (Ton/Acre/Milk Per Acre/Milk Per Ton/Beef Per Acre/Beef Per Ton).
  • Cross-vendor brand split correct on AL 2023 corn plot: NK NK1748-3110 (rank 1, 192.9 BU), DEKALB DKC65-20 (rank 2, 191.2 BU).
  • agripro_trials parses all 14 PDFs. AP Iliad's Aberdeen/Craigmont/Twin Falls per-location yields preserved verbatim.
  • Trial chunker output verified on both sources; varieties + numeric yields adjacent in chunk text.
  • Six MCP tools registered: search_docs, get_page, list_versions, lookup_variety, search_trials, crop_seed_api_lessons.
  • End-to-end search_trials smoke test against full indexed corpus (follow-up PR).
## Summary First trial-data PR. Introduces yield-trial scrapers + a new `search_trials` MCP tool as a SEPARATE data type alongside variety identity. **Why separate**: variety chunks answer "what is DKC62-08RIB?" — trial chunks answer "which corn hybrid won the IA 2024 trials?" Mixing them would muddy embedding signal. Single Chroma collection with `data_type` metadata filter — clean ops, two clean entry points. ## New scrapers - **`gh_plot_reports`** — Golden Harvest plot reports, ~4,618 cross-vendor head-to-head yield trials (2024+2025). Each plot lists products from multiple brands at one cooperator's field — the kind of independent comparison data Bayer doesn't publish itself. URL: `/<crop>/plot-report/<state>/<year>/<plot_id>`. Generic per-column metrics dict so silage's `Ton/Acre` / `Milk Per Acre` / `Beef Per Ton` columns work alongside corn/soy's `Yield` / `%MST` / `Test Weight`. - **`agripro_trials`** — 14 unique PDFs from `/trials-data` (38 sitemap links deduped). Verbatim PDF text preserved in chunk body so variety + yield number adjacency drives retrieval ("AP Iliad Aberdeen ID yield 116.3" matches a query about "AP Iliad Idaho"). ## Architecture changes - `rag/chunk.py`: new `chunks_from_trial()` dispatching by source. Variety chunks get `data_type="variety"`, trial chunks get `data_type="trial"`. - `rag/index.py`: dispatches by sidecar's `data_type` field. - `rag/bm25.py`: new filter columns `data_type`, `year`, `state`. - `docs_mcp/server.py`: sixth MCP tool `search_trials(crop?, state?, year?, product?, k=10)`. `search_docs` now defaults to `data_type="variety"` so trial chunks don't bleed into identity queries. ## Documented as deferred / unavailable - **NK trial endpoint** (`/NKSeeds/wsProxy.asmx/GetPlotResult`) — ASMX/SOAP, initial probe returned empty XML. Reverse-engineering needs a dedicated session. - **Bayer per-variety trial data** — not publicly indexed. DEKALB/Asgrow trial data flows through Channel reps. Documented as a known gap. - **AgRevival research books** — 10 large annual PDFs (2016-2025). Quasi-independent but low ROI per parse. Defer. ## What's in this PR's corpus - 14 AgriPro trial PDFs (full coverage of `/trials-data`). ## Following this PR - Full Golden Harvest plot-report corpus (~4,618 docs) is scraping in background. Will be added in a follow-up data-only PR (~70 min ETA). - Re-index + smoke tests against the combined corpus also follow. ## Test plan - [x] gh_plot_reports parses corn (Yield/MST/Test Weight/Gross Revenue), soy (same), silage (Ton/Acre/Milk Per Acre/Milk Per Ton/Beef Per Acre/Beef Per Ton). - [x] Cross-vendor brand split correct on AL 2023 corn plot: NK NK1748-3110 (rank 1, 192.9 BU), DEKALB DKC65-20 (rank 2, 191.2 BU). - [x] agripro_trials parses all 14 PDFs. AP Iliad's Aberdeen/Craigmont/Twin Falls per-location yields preserved verbatim. - [x] Trial chunker output verified on both sources; varieties + numeric yields adjacent in chunk text. - [x] Six MCP tools registered: search_docs, get_page, list_versions, lookup_variety, search_trials, crop_seed_api_lessons. - [ ] End-to-end search_trials smoke test against full indexed corpus (follow-up PR).
justin added 1 commit 2026-05-25 15:19:33 -04:00
This PR introduces TRIAL data — yield-performance results from real
field trials — as a SEPARATE data type alongside variety identity.
The two are complementary:

  search_docs  → "What's the disease resistance of DKC62-08RIB?"
                  (variety identity — what it IS)
  search_trials → "Which corn hybrid won the IA 2024 trials?"
                  (performance data — how it PERFORMED)

scrape/sources/gh_plot_reports.py — Golden Harvest plot reports
- 4,618 expected (2024+2025; 2023 deferred to a backfill pass).
- URL: /<crop>/plot-report/<state>/<year>/<plot_id>
- Cross-vendor: each plot lists products from multiple brands
  (NK / DEKALB / Golden Harvest / Enogen / Pioneer / Channel) side
  by side at one cooperator's field — the kind of independent
  comparison data Bayer doesn't publish itself.
- Generic per-column metrics dict (Yield/MST/Test Weight/$/Ac for
  corn+soy, Ton/Acre + Milk + Beef columns for silage).
- Politeness: 1 req/sec, retries on 429/5xx, no redirect-follow.

scrape/sources/agripro_trials.py — AgriPro regional trial PDFs
- 14 unique PDFs (38 sitemap links deduped) at /trials-data
- pdfplumber text extraction, region/year detection from filename
- Verbatim PDF text preserved in chunk body so variety + yield
  number adjacency drives retrieval (AP Iliad's Aberdeen ID yield
  matches a query about "AP Iliad Idaho yield")

rag/chunk.py — chunks_from_trial() dispatching by source
- Plot reports: identity preamble + Top-5 by primary metric + full
  ranking table. Metric labels chosen from the data (corn/soy use
  "Yield", silage uses "Ton/Acre").
- AgriPro PDFs: identity preamble + verbatim trial body inline so
  per-location yields surface for region+variety queries.
- Variety chunks get data_type="variety" metadata; trial chunks get
  data_type="trial". Single Chroma collection; the tool router
  filters by data_type rather than maintaining two collections.

rag/index.py — dispatch by sidecar's data_type field
rag/bm25.py — new filter columns (data_type, year, state)

docs_mcp/server.py — sixth MCP tool: search_trials(crop?, state?,
year?, product?, k=10)
- Filters trial chunks via where={"data_type": "trial", ...}
- Optional product substring post-filter for "DKC62-08RIB Iowa 2024"
  style searches
- search_docs now defaults to data_type="variety" so trial chunks
  don't bleed into variety identity queries
- Tool docstring routes the agent: "use lookup_variety to verify
  identity details on any trial winner you surface"

NK trial endpoint (/NKSeeds/wsProxy.asmx/GetPlotResult) is documented
as deferred — the ASMX-SOAP shape returned empty XML on initial
probe. Bayer per-variety yield data is not publicly indexed at all
— documented in the trial-scope note (DEKALB/Asgrow trial data flows
through Channel reps, not the web). AgRevival research books exist
as 10 large annual PDFs but are deferred (low ROI per parse).

Initial corpus shipped in this PR: 14 AgriPro trial PDFs. The 4,618
Golden Harvest plot reports are scraping in background and will be
added in a follow-up corpus-snapshot PR (~70 min ETA).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justin merged commit 17260c32c8 into main 2026-05-25 15:19:43 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#7