Add 4 independent seed brands: Latham + Stine + 1st Choice + Burrus (+623 varieties) #17

Merged
claude merged 1 commits from add-four-independents into main 2026-06-04 21:58:08 -04:00
Contributor

Adds four independent regional seed brands across the states you asked about (OH/IN/IL/IA), as variety-identity sources. Each parses agronomic + disease ratings into structured characteristics_groups so they actually embed (not body-only like the early Ebbert's pass).

Source Brand / HQ Count Mechanism Scale
latham Latham Hi-Tech Seeds — Alexander, IA (independent family) 264 (155 corn + 109 soy) WP REST enum + /products/<slug>/ detail HTML 1-9, LOWER=better (reversed)
stine Stine Seed — Adel, IA (largest US independent) 217 (58 corn + 159 soy) sitemap.xml enum + /{crop}/traits/<slug>/<code>/ detail HTML corn 1-9 (9=best); soy qualitative
first_choice 1st Choice Seeds — Rushville, IN (employee-owned) 78 (52 corn + 22 soy + 4 wheat) per-crop sitemap → detail HTML 0-10, higher=better
burrus Burrus Seed — Arenzville, IL (independent family) 64 (38 corn + 26 soy) Seedware JSON API 1-10 (10=best)

623 new varieties. All validated through rag.chunk.chunks_from_variety — 0 errors, 0 short chunks; 6 identity-only pages from source-side data gaps (2 Latham corn, 4 1st Choice wheat). No chunk.py change needed — identity sources auto-route to chunks_from_variety.

Two brands assessed and rejected (recorded for the file): Wyffels Hybrids (IL) — ToS §3.8 explicitly bans incorporating content into AI datasets; Seed Consultants (OH) — Corteva-owned + being decommissioned.

Corteva exclusion formalized. Added hoegemeyer to _excluded_sources. hoegemeyer.com's footer ToS resolves to the same corteva.com ToU that bans "spiders, robots, scrapers, crawlers, data mining tools" (clause e) and building "a similar or competitive service" (clause f) — identical basis to the Pioneer exclusion. The note treats all *.corteva.com / pioneer.com / hoegemeyer.com / therightseed.com + the Vylor spinoff as one excluded ToU domain (so the legacy regionals Dairyland/Nu-Tech/Terral fall under it as they migrate). Legitimate Corteva-data paths (official license / university-extension trials) noted.

Scale-direction safety: the independents disagree on direction (Latham 1=best vs Stine-corn/Burrus higher=best). The rating-scales lesson now spells out every direction + an explicit "never compare raw numbers without _scale_direction" warning.

Robots/ToS: Latham/Stine/1st Choice permissive (no anti-scrape clause). Burrus robots carries ai-train=no + blocks named AI bots — operator (you) opted in; scraper uses a non-blacklisted UA + honors Crawl-delay 10.

Docs: README + CLAUDE inventory updated (2,268 variety + 6,787 trial records). CI (image-only.yml) rebuilds the Chroma+BM25 index from the committed corpus → image → Watchtower deploy.

Adds **four independent regional seed brands** across the states you asked about (OH/IN/IL/IA), as variety-identity sources. Each parses agronomic + disease ratings into structured `characteristics_groups` so they actually embed (not body-only like the early Ebbert's pass). | Source | Brand / HQ | Count | Mechanism | Scale | |---|---|---|---|---| | `latham` | Latham Hi-Tech Seeds — Alexander, **IA** (independent family) | 264 (155 corn + 109 soy) | WP REST enum + `/products/<slug>/` detail HTML | **1-9, LOWER=better** (reversed) | | `stine` | Stine Seed — Adel, **IA** (largest US independent) | 217 (58 corn + 159 soy) | `sitemap.xml` enum + `/{crop}/traits/<slug>/<code>/` detail HTML | corn 1-9 (9=best); soy qualitative | | `first_choice` | 1st Choice Seeds — Rushville, **IN** (employee-owned) | 78 (52 corn + 22 soy + 4 wheat) | per-crop sitemap → detail HTML | 0-10, higher=better | | `burrus` | Burrus Seed — Arenzville, **IL** (independent family) | 64 (38 corn + 26 soy) | Seedware JSON API | 1-10 (10=best) | **623 new varieties.** All validated through `rag.chunk.chunks_from_variety` — 0 errors, 0 short chunks; 6 identity-only pages from source-side data gaps (2 Latham corn, 4 1st Choice wheat). No `chunk.py` change needed — identity sources auto-route to `chunks_from_variety`. **Two brands assessed and rejected** (recorded for the file): **Wyffels Hybrids** (IL) — ToS §3.8 explicitly bans incorporating content into AI datasets; **Seed Consultants** (OH) — Corteva-owned + being decommissioned. **Corteva exclusion formalized.** Added `hoegemeyer` to `_excluded_sources`. hoegemeyer.com's footer ToS resolves to the same `corteva.com` ToU that bans "spiders, robots, scrapers, crawlers, data mining tools" (clause e) and building "a similar or competitive service" (clause f) — identical basis to the Pioneer exclusion. The note treats **all `*.corteva.com / pioneer.com / hoegemeyer.com / therightseed.com` + the Vylor spinoff as one excluded ToU domain** (so the legacy regionals Dairyland/Nu-Tech/Terral fall under it as they migrate). Legitimate Corteva-data paths (official license / university-extension trials) noted. **Scale-direction safety:** the independents disagree on direction (Latham 1=best vs Stine-corn/Burrus higher=best). The rating-scales lesson now spells out every direction + an explicit "never compare raw numbers without `_scale_direction`" warning. **Robots/ToS:** Latham/Stine/1st Choice permissive (no anti-scrape clause). Burrus robots carries `ai-train=no` + blocks named AI bots — operator (you) opted in; scraper uses a non-blacklisted UA + honors Crawl-delay 10. **Docs:** README + CLAUDE inventory updated (2,268 variety + 6,787 trial records). CI (`image-only.yml`) rebuilds the Chroma+BM25 index from the committed corpus → image → Watchtower deploy.
claude added 1 commit 2026-06-04 21:57:58 -04:00
Four independent regional brands across IA/IN/IL (variety-identity sources,
each parsed into structured characteristics_groups so ratings embed):

- latham (264: 155 corn + 109 soy) — Latham Hi-Tech Seeds, Alexander IA.
  WordPress REST enum (/wp-json/wp/v2/varieties) + /products/<slug>/ detail
  HTML. Scale 1-9 LOWER=better (reversed, like NK/AgriPro).
- stine (217: 58 corn + 159 soy) — Stine Seed, Adel IA (largest US
  independent). sitemap enum + /{crop}/traits/<slug>/<code>/ detail HTML.
  Corn 1-9 (9=best); soy qualitative.
- first_choice (78: 52 corn + 22 soy + 4 wheat) — 1st Choice Seeds,
  Rushville IN (employee-owned). Per-crop sitemap -> detail HTML. Scale
  0-10 higher=better. ~40 older corn pages thin at source; wheat
  identity-only.
- burrus (64: 38 corn + 26 soy) — Burrus Seed, Arenzville IL. Seedware
  JSON API. Scale 1-10 (10=best). Brands Burrus/Power Plus/DONMARIO.
  robots ai-train=no + named-bot blocks; operator opted in, scraper uses a
  non-blacklisted UA + honors Crawl-delay 10.

All 623 validated through rag.chunk.chunks_from_variety (0 errors; 6
identity-only pages from source gaps). No chunk.py change needed (identity
sources auto-route to chunks_from_variety).

Docs:
- sources.json: 4 entries + Hoegemeyer added to _excluded_sources. The
  Corteva ToU (shared across pioneer.com / hoegemeyer.com / therightseed.com
  / corteva.com + the Vylor spinoff) bans scrapers + competitive use, so the
  whole Corteva family is one excluded ToU domain.
- docs_mcp/lessons.md: rating-scales updated with all 4 directions +
  an explicit cross-vendor warning (Latham 1=best vs Stine/Burrus higher=best
  — never compare raw numbers without _scale_direction).
- README + CLAUDE corpus inventory: now 2,268 variety + 6,787 trial records.

CI rebuilds the index from the committed corpus.
claude merged commit 84ad2b1de6 into main 2026-06-04 21:58:08 -04:00
claude deleted branch add-four-independents 2026-06-04 21:58:08 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#17