agripro + nk scrapers — 146 Syngenta varieties added (wheat + corn/soy)

agripro (24 varieties)
- Drupal Views form scrape via /search-agripro-brand-varieties with
  explicit GET params (sidesteps the AJAX-only-on-load default that
  returns an empty form skeleton).
- Per-variety parse: <h1>, .field--node--variety-type--variety,
  .field--node--tag-line--variety, .field--node--body, plus the
  three rated sections (Agronomics / Grain / Disease) with their
  <div class="row"><div class="label">label</div><div>value</div>
  pairs.
- Wheat-class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley
  — provides the Northern Plains HRS coverage WestBred lacks.

nk (122 varieties — recon's "29" was outdated; the current NK seed
finder lists 41 corn + 81 soy)
- ASP.NET WebForms endpoint:
  POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returns
  {"d": "<html>"} where the inner HTML is one <div class="sf-result">
  per variety. BeautifulSoup tokenizes the whole blob.
- Per-card: product code (NK8005, NK008-P8XF), RM/MG from the
  title <span>, "Brands Available" trait variants, marketing
  positioning + bullet strengths, tech-sheet PDF URL.
- pdfplumber text extraction on the tech-sheet PDFs adds:
  * corn disease ratings (Gray Leaf Spot, NCLB, Goss's Wilt,
    Anthracnose, Tar Spot, Fusarium, etc.) where the PDF prints
    "Label N" lines (text-extractable)
  * soybean Phytophthora source genes (Rps1c, Rps3a, ...)
  * soybean SCN race coverage
  * soybean agronomic ratings (Emergence, Standability, Shatter
    Tolerance, Green Stem) with text-extractable 1-9 values
  * soybean soil-type adaptation (Best/Good/Fair/Poor) for drought
    prone / high pH / poorly drained / etc.
- Agronomic rating BARS for corn (Emergence, Stalk Strength,
  Drought) are not text-extractable; we record the labels with an
  explicit "rated in PDF chart, see tech sheet" value so the agent
  can direct the farmer at the source for those numbers.

Scale-direction correction in lessons.md:
- NK and AgriPro both use 1 = best, lower = more resistant — the
  REVERSED convention vs Bayer / Golden Harvest. NK's tech-sheet
  footer literally prints "1-9 Scale: 1 = Best, 9 = Worst".
  AgriPro positioning on stripe-rust-resistant varieties (AP Iliad
  with Stripe Rust 1, Eyespot 2) confirms the same direction.
- sources-not-yet-indexed section trimmed to just Beck's PFR +
  Beck's products — everything else IS now in the corpus.

Cross-vendor coverage after this PR: 760 varieties.
  bayer_seeds     475 (DEKALB 288 / Asgrow 102 / WestBred 85)
  golden_harvest  139
  nk              122  (41 corn / 81 soy)
  agripro          24  (12 HRS / 7 SWW / 3 HRW / 1 HWS / 1 Barley)
Vendors: Bayer, Syngenta. Brands: 6. Crops: corn, soy, wheat (109
wheat now, up from 85).

requirements.txt: pdfplumber>=0.11 for NK tech-sheet parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-25 14:16:36 -04:00
parent 2588ebafa1
commit 9ce920f622
296 changed files with 23233 additions and 60 deletions
+28 -16
View File
@@ -72,7 +72,21 @@ re-stated it as `1-9 (9 = best)` in the chunk preamble; the source's
`_scale_direction` field still says `9-to-1` so you can detect the
provenance.
**Syngenta NK / AgriPro**: `1-9 (9 = best)`. Same as Bayer.
**Syngenta NK and AgriPro**: `1-9 (1 = best, lower = more
resistant)`. **REVERSED from Bayer and Golden Harvest.** NK's
tech-sheet PDFs literally print *"1-9 Scale: 1 = Best, 9 = Worst"*
in the footer; AgriPro's positioning on stripe-rust-resistant
varieties (e.g. AP Iliad with Stripe Rust 1, Eyespot 2) confirms
the same direction. On NK, this applies both to disease tolerance
AND to numeric agronomic ratings (Emergence, Standability, Shatter
Tolerance, Green Stem — all 1 = best). Cross-vendor comparisons
MUST consult the `_scale_direction` field in each side's sidecar
before drawing conclusions.
(Agronomic ratings on AgriPro are qualitative —
"Excellent / Very Good / Good / Fair" — and have no direction
issue. NK's soybean tech sheets ALSO publish soil-type adaptation
as Best/Good/Fair/Poor labels which are qualitative.)
**Beck's**: ratings live behind SeedIQ login; only identity-level
data is publicly available, so most disease/agronomic ratings are
@@ -219,25 +233,23 @@ NK publishes ratings as PDF tech sheets without regional flags.
## sources-not-yet-indexed
These vendors are planned but not yet in the corpus. Don't assume
their data is present:
- **Golden Harvest (Syngenta)** — ~175 varieties, sitemap-driven
scrape pending.
- **NK (Syngenta)** — 29 varieties.
- **AgriPro (Syngenta wheat)** — 24 wheat varieties (HRW, HRS, HWS,
SWW, SWS). The only wheat coverage we expect to have outside
WestBred.
- **Beck's PFR (research)** — 2,089 head-to-head trial documents.
Different shape from variety records — these are studies, not
hybrids.
- **Beck's products** — 860 products. Identity-only (SeedIQ login
gates the ratings).
If `list_versions()` doesn't show a vendor in the `vendor` facet, the
corpus does not have it yet. Direct the farmer to that vendor's
public catalog or their seed dealer.
**Already indexed**: Bayer (DEKALB / Asgrow / WestBred), Syngenta
(Golden Harvest, NK, AgriPro).
**Not yet indexed**:
- **Beck's PFR (research)** — 2,089 head-to-head trial documents
on the public Sanity GROQ API. Different shape from variety
records — these are studies, not hybrids. Surfacing them would
benefit a separate tool (e.g. `search_pfr_studies`) rather than
share a corpus with variety identity.
- **Beck's products** — ~860 products. Identity-only (SeedIQ login
gates the ratings).
---
## checking-your-work