agripro + nk scrapers — 146 Syngenta varieties added (760 total in corpus) #6

Merged
justin merged 1 commits from agripro-nk-scrapers into main 2026-05-25 14:17:19 -04:00
Owner

Summary

  • agripro (24 wheat varieties): Drupal Views form scrape via /search-agripro-brand-varieties?title=&variety_type_value=All. Three rated sections per variety (Agronomics / Grain / Disease) with <div class="row"> label/value pairs. Wheat class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley. Closes the Northern Plains HRS gap WestBred didn't cover.
  • nk (122 varieties — recon's "29" was outdated; current NK seed-finder lists 41 corn + 81 soy): ASP.NET WebForms endpoint POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returning {"d": "<html>"} with one <div class="sf-result"> per variety. pdfplumber text extraction on tech-sheet PDFs pulls disease ratings (corn), Phytophthora source genes, SCN race coverage, agronomic numeric ratings (soy), and soil-type adaptation labels (soy).
  • lessons.md scale-direction correction: NK and AgriPro both use 1 = best, lower = more resistant (REVERSED vs Bayer/Golden Harvest). My earlier lesson incorrectly grouped them with Bayer. NK's PDF footer literally prints "1-9 Scale: 1 = Best, 9 = Worst". The chunker reads _scale_direction from each variety's sidecar, so the chunk preamble correctly tells the LLM how to interpret the numbers — anti-hallucination guarantee holds across the reversed-scale vendors.
  • requirements.txt: pdfplumber>=0.11 added for NK tech-sheet text extraction.

Coverage after merge

Source Count Brands
bayer_seeds 475 DEKALB / Asgrow / WestBred
golden_harvest 139 Golden Harvest
nk 122 NK
agripro 24 AgriPro
Total 760 6 brands across 2 vendors

109 wheat varieties (up from 85) — now includes HRS/SWW/HRW/HWS classes across both AgriPro and WestBred.

Known v1 limitations (documented, not blockers)

  • NK corn agronomic chart bars (Drought, Stalk Strength, Emergence) aren't text-extractable from the PDF. Chunk records the label list with an explicit "rated in tech-sheet PDF chart, see source" annotation; full numeric values would need pdf2image+OCR or pdfplumber geometric layout parsing.
  • AgriPro non-code variety lookup ("AP Iliad" search may not pin Iliad at #1 because "Iliad" is not a code-token pattern). Workaround: lookup_variety('agripro-ap-iliad') for exact lookup; brand='AgriPro' filter for branded search.
  • Cross-vendor result diversity: Dense embedding favors Bayer for broad queries because Bayer chunks are longer/richer. A reranker (Phase 6, deferred) would help; for now the agent can use brand filters for vendor-specific results.

Test plan

  • agripro smoke: all 24 wheat varieties parsed; AP Iliad has Stripe Rust 1, Eyespot 2 (1 = best for AgriPro).
  • nk smoke: 122 cards parsed (41 corn / 81 soy); 77 of 81 soy varieties have Phytophthora gene extracted; corn disease ratings extract correctly (e.g. NK8005 Tar Spot 2 = excellent).
  • Re-index produced 760 chunks in 26s on the 3-node Ollama pool.
  • list_versions reports 4 sources / 2 vendors / 6 brands / 3 crops with correct counts.
  • crop_seed_api_lessons(topic='rating-scales') returns the corrected guidance with NK + AgriPro in the reversed-scale bucket.
## Summary - **agripro** (24 wheat varieties): Drupal Views form scrape via `/search-agripro-brand-varieties?title=&variety_type_value=All`. Three rated sections per variety (Agronomics / Grain / Disease) with `<div class="row">` label/value pairs. Wheat class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley. **Closes the Northern Plains HRS gap** WestBred didn't cover. - **nk** (122 varieties — recon's "29" was outdated; current NK seed-finder lists 41 corn + 81 soy): ASP.NET WebForms endpoint `POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts` returning `{"d": "<html>"}` with one `<div class="sf-result">` per variety. pdfplumber text extraction on tech-sheet PDFs pulls disease ratings (corn), Phytophthora source genes, SCN race coverage, agronomic numeric ratings (soy), and soil-type adaptation labels (soy). - **lessons.md scale-direction correction**: NK and AgriPro both use **1 = best, lower = more resistant** (REVERSED vs Bayer/Golden Harvest). My earlier lesson incorrectly grouped them with Bayer. NK's PDF footer literally prints *"1-9 Scale: 1 = Best, 9 = Worst"*. The chunker reads `_scale_direction` from each variety's sidecar, so the chunk preamble correctly tells the LLM how to interpret the numbers — anti-hallucination guarantee holds across the reversed-scale vendors. - **requirements.txt**: `pdfplumber>=0.11` added for NK tech-sheet text extraction. ## Coverage after merge | Source | Count | Brands | |---|---|---| | `bayer_seeds` | 475 | DEKALB / Asgrow / WestBred | | `golden_harvest` | 139 | Golden Harvest | | `nk` | 122 | NK | | `agripro` | 24 | AgriPro | | **Total** | **760** | **6 brands across 2 vendors** | 109 wheat varieties (up from 85) — now includes HRS/SWW/HRW/HWS classes across both AgriPro and WestBred. ## Known v1 limitations (documented, not blockers) - **NK corn agronomic chart bars** (Drought, Stalk Strength, Emergence) aren't text-extractable from the PDF. Chunk records the label list with an explicit "rated in tech-sheet PDF chart, see source" annotation; full numeric values would need pdf2image+OCR or pdfplumber geometric layout parsing. - **AgriPro non-code variety lookup** ("AP Iliad" search may not pin Iliad at #1 because "Iliad" is not a code-token pattern). Workaround: `lookup_variety('agripro-ap-iliad')` for exact lookup; `brand='AgriPro'` filter for branded search. - **Cross-vendor result diversity**: Dense embedding favors Bayer for broad queries because Bayer chunks are longer/richer. A reranker (Phase 6, deferred) would help; for now the agent can use brand filters for vendor-specific results. ## Test plan - [x] `agripro` smoke: all 24 wheat varieties parsed; AP Iliad has Stripe Rust 1, Eyespot 2 (1 = best for AgriPro). - [x] `nk` smoke: 122 cards parsed (41 corn / 81 soy); 77 of 81 soy varieties have Phytophthora gene extracted; corn disease ratings extract correctly (e.g. NK8005 Tar Spot 2 = excellent). - [x] Re-index produced 760 chunks in 26s on the 3-node Ollama pool. - [x] `list_versions` reports 4 sources / 2 vendors / 6 brands / 3 crops with correct counts. - [x] `crop_seed_api_lessons(topic='rating-scales')` returns the corrected guidance with NK + AgriPro in the reversed-scale bucket.
justin added 1 commit 2026-05-25 14:17:08 -04:00
agripro (24 varieties)
- Drupal Views form scrape via /search-agripro-brand-varieties with
  explicit GET params (sidesteps the AJAX-only-on-load default that
  returns an empty form skeleton).
- Per-variety parse: <h1>, .field--node--variety-type--variety,
  .field--node--tag-line--variety, .field--node--body, plus the
  three rated sections (Agronomics / Grain / Disease) with their
  <div class="row"><div class="label">label</div><div>value</div>
  pairs.
- Wheat-class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley
  — provides the Northern Plains HRS coverage WestBred lacks.

nk (122 varieties — recon's "29" was outdated; the current NK seed
finder lists 41 corn + 81 soy)
- ASP.NET WebForms endpoint:
  POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returns
  {"d": "<html>"} where the inner HTML is one <div class="sf-result">
  per variety. BeautifulSoup tokenizes the whole blob.
- Per-card: product code (NK8005, NK008-P8XF), RM/MG from the
  title <span>, "Brands Available" trait variants, marketing
  positioning + bullet strengths, tech-sheet PDF URL.
- pdfplumber text extraction on the tech-sheet PDFs adds:
  * corn disease ratings (Gray Leaf Spot, NCLB, Goss's Wilt,
    Anthracnose, Tar Spot, Fusarium, etc.) where the PDF prints
    "Label N" lines (text-extractable)
  * soybean Phytophthora source genes (Rps1c, Rps3a, ...)
  * soybean SCN race coverage
  * soybean agronomic ratings (Emergence, Standability, Shatter
    Tolerance, Green Stem) with text-extractable 1-9 values
  * soybean soil-type adaptation (Best/Good/Fair/Poor) for drought
    prone / high pH / poorly drained / etc.
- Agronomic rating BARS for corn (Emergence, Stalk Strength,
  Drought) are not text-extractable; we record the labels with an
  explicit "rated in PDF chart, see tech sheet" value so the agent
  can direct the farmer at the source for those numbers.

Scale-direction correction in lessons.md:
- NK and AgriPro both use 1 = best, lower = more resistant — the
  REVERSED convention vs Bayer / Golden Harvest. NK's tech-sheet
  footer literally prints "1-9 Scale: 1 = Best, 9 = Worst".
  AgriPro positioning on stripe-rust-resistant varieties (AP Iliad
  with Stripe Rust 1, Eyespot 2) confirms the same direction.
- sources-not-yet-indexed section trimmed to just Beck's PFR +
  Beck's products — everything else IS now in the corpus.

Cross-vendor coverage after this PR: 760 varieties.
  bayer_seeds     475 (DEKALB 288 / Asgrow 102 / WestBred 85)
  golden_harvest  139
  nk              122  (41 corn / 81 soy)
  agripro          24  (12 HRS / 7 SWW / 3 HRW / 1 HWS / 1 Barley)
Vendors: Bayer, Syngenta. Brands: 6. Crops: corn, soy, wheat (109
wheat now, up from 85).

requirements.txt: pdfplumber>=0.11 for NK tech-sheet parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justin merged commit 7b3da908e0 into main 2026-05-25 14:17:19 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#6