Independent third-party performance data — land-grant programs that test every
entered brand side-by-side with replication + LSD stats. This is the legitimate
way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't
scrape directly (data_type=trial, results[] shape; falls through the trial
chunker).
- illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX
(openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata;
corn-following-corn vs -soybean kept distinct.
- iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView
(viewstate postback for year/district), corn + soy by district x season.
- ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF
(pdfplumber; per-site column groups split by header Yield-token count +
x-coord footnote bucketing), corn + soy per site, 2024+2025.
91 distinct seed brands across the three; majors confirmed present in the
independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212,
Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it
ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in
Illinois — true negatives, not parse gaps.)
- rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university
sources route through it so the region/district is in the embedded chunk +
labeled "variety trial (cross-vendor, independent third-party)". Existing plot
sources (gh/lg/agrigold/proharvest) unchanged.
- requirements.txt: openpyxl (Illinois XLSX; scrape-time only).
- sources.json + README/CLAUDE/lessons: registered + attributed; lessons
trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available
indirectly via these trials).
Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0
out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in
each tos_note. CI rebuilds the index from the committed corpus.
.gitea/workflows/refresh.yml — add scrape steps for the new trial
sources (agripro_trials, gh_plot_reports) so the monthly cron
refreshes them alongside the variety sources. gh_plot_reports
is the heaviest single source (~4,600 docs @ 1 req/sec ≈ 70 min);
runs late so an earlier failure doesn't waste time before failing.
Commit-message variable count expanded to surface the trial counts.
docs_mcp/lessons.md — new "trial-data" section telling the agent:
- The two surfaces (search_docs = identity, search_trials = perf)
are complementary; how to route a farmer question to each.
- What's indexed (GH plot reports cross-vendor, AgriPro regional
PDFs) vs what's not (Bayer per-variety trials, NK yield results,
Pioneer, university extension trials).
- Recommended workflow: search_trials → identify top performers →
lookup_variety on each to verify identity → don't fabricate.
- How to read a GH plot report (per-column headers vary by crop:
corn/soy use Yield/MST/Test Weight, silage uses Ton/Acre +
Milk + Beef columns).
- Single-data-point caveat: one plot is one cooperator's field;
look across multiple plots for a robust recommendation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
agripro (24 varieties)
- Drupal Views form scrape via /search-agripro-brand-varieties with
explicit GET params (sidesteps the AJAX-only-on-load default that
returns an empty form skeleton).
- Per-variety parse: <h1>, .field--node--variety-type--variety,
.field--node--tag-line--variety, .field--node--body, plus the
three rated sections (Agronomics / Grain / Disease) with their
<div class="row"><div class="label">label</div><div>value</div>
pairs.
- Wheat-class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley
— provides the Northern Plains HRS coverage WestBred lacks.
nk (122 varieties — recon's "29" was outdated; the current NK seed
finder lists 41 corn + 81 soy)
- ASP.NET WebForms endpoint:
POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returns
{"d": "<html>"} where the inner HTML is one <div class="sf-result">
per variety. BeautifulSoup tokenizes the whole blob.
- Per-card: product code (NK8005, NK008-P8XF), RM/MG from the
title <span>, "Brands Available" trait variants, marketing
positioning + bullet strengths, tech-sheet PDF URL.
- pdfplumber text extraction on the tech-sheet PDFs adds:
* corn disease ratings (Gray Leaf Spot, NCLB, Goss's Wilt,
Anthracnose, Tar Spot, Fusarium, etc.) where the PDF prints
"Label N" lines (text-extractable)
* soybean Phytophthora source genes (Rps1c, Rps3a, ...)
* soybean SCN race coverage
* soybean agronomic ratings (Emergence, Standability, Shatter
Tolerance, Green Stem) with text-extractable 1-9 values
* soybean soil-type adaptation (Best/Good/Fair/Poor) for drought
prone / high pH / poorly drained / etc.
- Agronomic rating BARS for corn (Emergence, Stalk Strength,
Drought) are not text-extractable; we record the labels with an
explicit "rated in PDF chart, see tech sheet" value so the agent
can direct the farmer at the source for those numbers.
Scale-direction correction in lessons.md:
- NK and AgriPro both use 1 = best, lower = more resistant — the
REVERSED convention vs Bayer / Golden Harvest. NK's tech-sheet
footer literally prints "1-9 Scale: 1 = Best, 9 = Worst".
AgriPro positioning on stripe-rust-resistant varieties (AP Iliad
with Stripe Rust 1, Eyespot 2) confirms the same direction.
- sources-not-yet-indexed section trimmed to just Beck's PFR +
Beck's products — everything else IS now in the corpus.
Cross-vendor coverage after this PR: 760 varieties.
bayer_seeds 475 (DEKALB 288 / Asgrow 102 / WestBred 85)
golden_harvest 139
nk 122 (41 corn / 81 soy)
agripro 24 (12 HRS / 7 SWW / 3 HRW / 1 HWS / 1 Barley)
Vendors: Bayer, Syngenta. Brands: 6. Crops: corn, soy, wheat (109
wheat now, up from 85).
requirements.txt: pdfplumber>=0.11 for NK tech-sheet parsing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the fifth MCP tool — crop_seed_api_lessons(topic?) — backed by
docs_mcp/lessons.md, the ONLY source of opinionated content in the
server. Everything else (search_docs, get_page, lookup_variety)
returns verbatim from vendor catalogs; lessons.md fills the gaps
the corpus can't cover.
The Pioneer fallback is the critical anti-hallucination piece:
Pioneer's ToS bans automation, so the corpus has no Pioneer data.
Without this tool, an agent might surface Bayer/Asgrow chunks as
mediocre matches for a Pioneer query. The tool's docstring tells
the agent to call it on any Pioneer / P-series question; the
'pioneer' section says clearly:
"I don't have Pioneer's variety data indexed... please consult
Pioneer or an extension service."
"Do NOT invent Pioneer hybrid ratings."
Other lesson sections cover knowledge the agent needs to interpret
search_docs / get_page output correctly:
- rating-scales: Bayer 1-9, Golden Harvest 9-to-1, what
R/MR/S/Rps1c/R3 mean in soybean disease columns
- maturity-semantics: corn RM days vs soybean MG vs wheat class +
qualitative early/medium/late
- trait-glossary: SSRIB, VT2PRIB, XF, E3, Conkesta, Clearfield, etc.
- scn-resistance: race coverage + Peking vs PI 88788 source
- regional-listings: how to interpret Bayer's "local profiles"
- sources-not-yet-indexed: which vendors aren't in the corpus yet
- checking-your-work: always call lookup_variety before quoting
Lesson lookup prefers slug-match (returns just `rating-scales` for
topic="rating", not every section that mentions ratings); falls
back to body-match only when no slug matches.
Smoke-tested with topic=pioneer, topic=rating, topic=trait,
topic=zzzzzz (no match), and topic=None (full index = 10K chars,
8 sections).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>