Add 4 independent seed brands: Latham + Stine + 1st Choice + Burrus (+623 varieties) #17
Reference in New Issue
Block a user
Delete Branch "add-four-independents"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Adds four independent regional seed brands across the states you asked about (OH/IN/IL/IA), as variety-identity sources. Each parses agronomic + disease ratings into structured
characteristics_groupsso they actually embed (not body-only like the early Ebbert's pass).latham/products/<slug>/detail HTMLstinesitemap.xmlenum +/{crop}/traits/<slug>/<code>/detail HTMLfirst_choiceburrus623 new varieties. All validated through
rag.chunk.chunks_from_variety— 0 errors, 0 short chunks; 6 identity-only pages from source-side data gaps (2 Latham corn, 4 1st Choice wheat). Nochunk.pychange needed — identity sources auto-route tochunks_from_variety.Two brands assessed and rejected (recorded for the file): Wyffels Hybrids (IL) — ToS §3.8 explicitly bans incorporating content into AI datasets; Seed Consultants (OH) — Corteva-owned + being decommissioned.
Corteva exclusion formalized. Added
hoegemeyerto_excluded_sources. hoegemeyer.com's footer ToS resolves to the samecorteva.comToU that bans "spiders, robots, scrapers, crawlers, data mining tools" (clause e) and building "a similar or competitive service" (clause f) — identical basis to the Pioneer exclusion. The note treats all*.corteva.com / pioneer.com / hoegemeyer.com / therightseed.com+ the Vylor spinoff as one excluded ToU domain (so the legacy regionals Dairyland/Nu-Tech/Terral fall under it as they migrate). Legitimate Corteva-data paths (official license / university-extension trials) noted.Scale-direction safety: the independents disagree on direction (Latham 1=best vs Stine-corn/Burrus higher=best). The rating-scales lesson now spells out every direction + an explicit "never compare raw numbers without
_scale_direction" warning.Robots/ToS: Latham/Stine/1st Choice permissive (no anti-scrape clause). Burrus robots carries
ai-train=no+ blocks named AI bots — operator (you) opted in; scraper uses a non-blacklisted UA + honors Crawl-delay 10.Docs: README + CLAUDE inventory updated (2,268 variety + 6,787 trial records). CI (
image-only.yml) rebuilds the Chroma+BM25 index from the committed corpus → image → Watchtower deploy.Four independent regional brands across IA/IN/IL (variety-identity sources, each parsed into structured characteristics_groups so ratings embed): - latham (264: 155 corn + 109 soy) — Latham Hi-Tech Seeds, Alexander IA. WordPress REST enum (/wp-json/wp/v2/varieties) + /products/<slug>/ detail HTML. Scale 1-9 LOWER=better (reversed, like NK/AgriPro). - stine (217: 58 corn + 159 soy) — Stine Seed, Adel IA (largest US independent). sitemap enum + /{crop}/traits/<slug>/<code>/ detail HTML. Corn 1-9 (9=best); soy qualitative. - first_choice (78: 52 corn + 22 soy + 4 wheat) — 1st Choice Seeds, Rushville IN (employee-owned). Per-crop sitemap -> detail HTML. Scale 0-10 higher=better. ~40 older corn pages thin at source; wheat identity-only. - burrus (64: 38 corn + 26 soy) — Burrus Seed, Arenzville IL. Seedware JSON API. Scale 1-10 (10=best). Brands Burrus/Power Plus/DONMARIO. robots ai-train=no + named-bot blocks; operator opted in, scraper uses a non-blacklisted UA + honors Crawl-delay 10. All 623 validated through rag.chunk.chunks_from_variety (0 errors; 6 identity-only pages from source gaps). No chunk.py change needed (identity sources auto-route to chunks_from_variety). Docs: - sources.json: 4 entries + Hoegemeyer added to _excluded_sources. The Corteva ToU (shared across pioneer.com / hoegemeyer.com / therightseed.com / corteva.com + the Vylor spinoff) bans scrapers + competitive use, so the whole Corteva family is one excluded ToU domain. - docs_mcp/lessons.md: rating-scales updated with all 4 directions + an explicit cross-vendor warning (Latham 1=best vs Stine/Burrus higher=best — never compare raw numbers without _scale_direction). - README + CLAUDE corpus inventory: now 2,268 variety + 6,787 trial records. CI rebuilds the index from the committed corpus.