Three new brand scrapers: LG Seeds + AgriGold + Ebbert's Seeds (+310 varieties) #14

Merged
justin merged 1 commits from lg-agrigold-ebberts into main 2026-05-26 12:42:56 -04:00
Owner

Summary

User flagged LG Seeds, AgriGold, and Ebbert's (local Ohio/Indiana breeder) are all active in farmer territory. Three new scrapers, three different shapes, 310 new varieties, three new brands, first-ever alfalfa coverage.

Source Count Crops Pattern
lg_seeds 170 corn / soy / alfalfa / sorghum Listing page has embedded var products = [...] JSON; per-variety detail has <span class="bar-N"> ratings on 1-9 (9=best)
agrigold 111 corn / soy 60+ /corn/explore-corn-hybrids/<CODE> URLs; ratings rendered as 5-circle scale (1-5, 5=best) — distinct from the 1-9 brands
ebberts_seeds 29 corn / soy / wheat Single page per crop, multi-section CSS grid; preserving verbatim text since the 29-variety total doesn't justify perfect column parsing. Honors robots.txt Crawl-delay: 5. 1-5 scale (1=best, lower = more resistant)

Corpus state after merge

Metric Before After
Total chunks 5,529 5,839 (+310)
Brands 8 11
Crops 7 8 (+alfalfa)
Vendors 2 4 (+AgReliant Genetics, +Ebbert's)

Smoke tests

  • brand="LG Seeds" + crop=corn → returns LG corn varieties (LG65C89, etc.) ✓
  • brand="AgriGold" + crop=corn → returns AgriGold varieties (A626-39, A618-34) ✓
  • brand="Ebbert's Seeds" + crop=corn → returns Ebbert's regional (7442PC, 7188PC) ✓
  • Variety-code prefilter pins exact matches at #1 for all three brands: LG5701 → LG5701, A616-30 → A616-30, 7000TR RIB → 7000TR RIB ✓
  • LG alfalfa query returns alfalfa varieties (7C300, 5R300, 4R300) — first alfalfa data the advisor can answer on

Notable choices

  • AgriGold's rating direction is 1-5 (different from the other brands). _scale_direction is set on every chunk so the LLM doesn't conflate.
  • Ebbert's content is preserved verbatim rather than column-parsed. The multi-cell CSS grid layout would need bespoke alignment logic; 29 varieties doesn't justify it. The chunk body carries the full visible text so retrieval still works (the LLM reads the table text).
  • LG Seeds publishes alfalfa FD ratings, which is genuinely useful for hay-acreage farmers — that data didn't exist anywhere else in the corpus before.

What's not in this PR (possible future)

  • LG /performance/corn and AgriGold /corn/performance/corn-yield-results look like they publish plot reports analogous to GH's. Could become two more trial sources (lg_plot_reports, agrigold_plot_reports).
  • Other small regional breeders following Ebbert's pattern: Renk, Wyffels, Stine, Hoegemeyer, Brett Young (canola). Each would be ~30 mins given the Ebbert's pattern.
## Summary User flagged LG Seeds, AgriGold, and **Ebbert's** (local Ohio/Indiana breeder) are all active in farmer territory. Three new scrapers, three different shapes, 310 new varieties, three new brands, first-ever alfalfa coverage. | Source | Count | Crops | Pattern | |---|---|---|---| | `lg_seeds` | 170 | corn / soy / **alfalfa** / sorghum | Listing page has embedded `var products = [...]` JSON; per-variety detail has `<span class="bar-N">` ratings on 1-9 (9=best) | | `agrigold` | 111 | corn / soy | 60+ `/corn/explore-corn-hybrids/<CODE>` URLs; ratings rendered as **5-circle scale** (1-5, 5=best) — distinct from the 1-9 brands | | `ebberts_seeds` | 29 | corn / soy / wheat | Single page per crop, multi-section CSS grid; preserving verbatim text since the 29-variety total doesn't justify perfect column parsing. Honors robots.txt `Crawl-delay: 5`. 1-5 scale (1=best, lower = more resistant) | ## Corpus state after merge | Metric | Before | After | |---|---|---| | Total chunks | 5,529 | **5,839** (+310) | | Brands | 8 | **11** | | Crops | 7 | **8** (+alfalfa) | | Vendors | 2 | **4** (+AgReliant Genetics, +Ebbert's) | ## Smoke tests - `brand="LG Seeds"` + crop=corn → returns LG corn varieties (`LG65C89`, etc.) ✓ - `brand="AgriGold"` + crop=corn → returns AgriGold varieties (`A626-39`, `A618-34`) ✓ - `brand="Ebbert's Seeds"` + crop=corn → returns Ebbert's regional (`7442PC`, `7188PC`) ✓ - Variety-code prefilter pins exact matches at #1 for all three brands: `LG5701` → LG5701, `A616-30` → A616-30, `7000TR RIB` → 7000TR RIB ✓ - LG alfalfa query returns alfalfa varieties (`7C300`, `5R300`, `4R300`) — **first alfalfa data the advisor can answer on** ✓ ## Notable choices - **AgriGold's rating direction is 1-5 (different from the other brands)**. `_scale_direction` is set on every chunk so the LLM doesn't conflate. - **Ebbert's content is preserved verbatim** rather than column-parsed. The multi-cell CSS grid layout would need bespoke alignment logic; 29 varieties doesn't justify it. The chunk body carries the full visible text so retrieval still works (the LLM reads the table text). - **LG Seeds publishes alfalfa FD ratings**, which is genuinely useful for hay-acreage farmers — that data didn't exist anywhere else in the corpus before. ## What's not in this PR (possible future) - LG `/performance/corn` and AgriGold `/corn/performance/corn-yield-results` look like they publish plot reports analogous to GH's. Could become two more trial sources (`lg_plot_reports`, `agrigold_plot_reports`). - Other small regional breeders following Ebbert's pattern: Renk, Wyffels, Stine, Hoegemeyer, Brett Young (canola). Each would be ~30 mins given the Ebbert's pattern.
justin added 1 commit 2026-05-26 12:42:55 -04:00
User flagged LG, AgriGold, and Ebbert's (local Ohio breeder) are
all active in farmer territory. Built three scrapers — corpus now
covers 5,839 chunks across 11 brands.

Net new varieties: 310
  lg_seeds        170 — corn 78 + soy 63 + alfalfa 16 + sorghum 13
                  → adds FIRST alfalfa coverage (FD 3-5 range)
  agrigold        111 — corn 60 + soy 51
  ebberts_seeds    29 — corn 17 + soy 12 (regional OH/IN breeder)

scrape/sources/lg_seeds.py — embedded-JSON pattern (cleanest):
- /products/<crop> pages have a `var products = [...]` blob with the
  variety summary (Variety, Maturity, Traits[], Bullets[], CropType).
- Per-variety detail page (/products/<crop>/<Variety>) carries the
  ratings as `<span class="bar-N">` where N is 1-9 on the canonical
  scale. Same 9=best direction as Bayer / Golden Harvest.
- Three sections per page: Characteristics / Management / Disease
  Tolerance, plus a few qualitative bars ("Tar Spot Susceptible",
  "Fungicide Response High") preserved as text values.

scrape/sources/agrigold.py — 5-circle scale:
- Listing page has 60+ /corn/explore-corn-hybrids/<CODE> URLs.
- Detail page renders ratings as <div class="scale"> blocks with 5
  child <div class="circle"> elements, of which N have class
  "circle selected" → rating N on a 1-5 scale.
- 7 sections per page incl. Silage Characteristics (Dairy Silage
  Rating, NDFd 30 Hr, Crude Protein), Planting Applications, Soil
  Adaptability, Plant Characteristics, Product Features.
- Distinct rating direction (1-5 vs Bayer's 1-9) — declared in
  _scale_direction so chunker preamble renders correctly.

scrape/sources/ebberts_seeds.py — small regional breeder, verbatim
text approach:
- Single page per crop (corn / soybeans / wheat). Each variety is an
  <h1> + multi-section CSS-grid block where labels and values are in
  separate adjacent cells. Reconstructing perfectly-aligned columns
  for a 29-variety total isn't worth the engineering — chunk body
  carries the verbatim text in document order, LLM can read the
  tabular content.
- Scale: 1-5 (1 = best, lower = more resistant), inferred from
  marketing-vs-rating cross-checks ("Robust tall plants" + STANDABILITY
  1.0 → 1 = best).
- Politeness: robots.txt asks for Crawl-delay: 5; honored.

All three new scrapers smoke-tested:
- LG corn LG5701 RM 116 SmartStax → 3 characteristic groups with
  Disease Tolerance ratings (Northern/Southern Leaf Blight 8-9, etc.)
- AgriGold A616-30 RM 86 VT2RIB → 7 groups incl. silage and soil
  adaptability ratings
- Ebbert's 7000TR RIB RM 100 → 1098-char verbatim body covering
  CHARACTERISTICS, DISEASE RATINGS, herbicide tolerance, etc.

Corpus state after this PR:
- 5,839 chunks (was 5,529)
- 11 brands (was 8)
- 8 crops (corn 3047, soy 2209, silage 359, wheat 123, sorghum 49,
  cotton 30, alfalfa 16, canola 6) — alfalfa is brand-new

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justin merged commit db1567f84a into main 2026-05-26 12:42:56 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#14