Add 4 independent seed brands: Latham + Stine + 1st Choice + Burrus (+623 varieties) (#17)
Image rebuild (skip scrape) / build (push) Successful in 4m44s
Image rebuild (skip scrape) / build (push) Successful in 4m44s
Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io>
This commit was merged in pull request #17.
This commit is contained in:
@@ -213,6 +213,81 @@
|
||||
"tos_check_date": "2026-06-04",
|
||||
"schema_notes": "Same sidecar shape as agrigold/lg/gh plot reports (results:[{rank,brand,product,traits,metrics}]) — routed through _render_gh_plot_chunk (proharvest_plots added to that source list in rag/chunk.py). API gives clean location metadata (city/state/county/year/product/lat-long/PDF); PDF gives the management block (planted/harvested/prev-crop/population/tillage/irrigation) + results. THREE PDF realities: ruled tables (extract_tables splits columns), unruled tables (text-line fallback anchored on trailing numerics; soy reports drop the Test Wt. column so rows carry 4 vs 5 numerics), and off-template third-party reports (e.g. a university land-lab with extra RM/harvest-weight columns) which a per-row + per-plot sanity gate redirects to verbatim raw_text so cross-vendor yields aren't corrupted or lost. Many plots are cross-vendor (Pioneer/DEKALB/Becks/Channel/Wyffels vs ProHarvest/Apex). Image-only PDFs (no text layer) are skipped + counted (no silent cap). metrics key 'Yield' is canonical so the chunker top-N picker finds it.",
|
||||
"data_type": "trial"
|
||||
},
|
||||
{
|
||||
"name": "latham",
|
||||
"vendor": "Latham Hi-Tech Seeds",
|
||||
"brands": [
|
||||
"Latham Hi-Tech Seeds"
|
||||
],
|
||||
"crops": [
|
||||
"corn",
|
||||
"soybeans"
|
||||
],
|
||||
"verdict": "green",
|
||||
"expected_count": 264,
|
||||
"base_url": "https://www.lathamseeds.com",
|
||||
"scope_filter": "Independent family brand (Alexander, IA — Upper Midwest). Row-crop varieties (155 corn + 109 soy; no wheat). Alfalfa taxonomy term is count=0 in the catalog.",
|
||||
"tos_check_date": "2026-06-04",
|
||||
"tos_note": "robots.txt permissive (only /wp-admin/ disallowed, no Crawl-delay); no Terms-of-Use anti-scraping clause located. Public WordPress catalog. UA seed-mcp-scraper, ~1.5s/req.",
|
||||
"schema_notes": "WordPress. Enumerate the variety list via /wp-json/wp/v2/varieties (paginated, open) or variety-sitemap1.xml; crop/trait/year come from the variety_crop / variety_trait / variety_year taxonomies (acf+content are NOT in REST). Agronomic specs are parsed from each /products/<slug>/ detail page (server-rendered <li><span>label</span><span>value</span>); parsed into characteristics_groups. SCALE: numeric ~1-9 where LOWER = BETTER (1 = best/most-tolerant/most-resistant) — REVERSED from Bayer; derived empirically (no on-page legend) by cross-referencing Product Overview prose vs values. Categorical values (SCN source 'PI 88788', Phytophthora 'Rps 1k', Anthracnose 'ASR') pass through verbatim; NA/blank = not rated. 2 corn pages are identity-only (source-side empty spec sections)."
|
||||
},
|
||||
{
|
||||
"name": "stine",
|
||||
"vendor": "Stine Seed Company",
|
||||
"brands": [
|
||||
"Stine"
|
||||
],
|
||||
"crops": [
|
||||
"corn",
|
||||
"soybeans"
|
||||
],
|
||||
"verdict": "green",
|
||||
"expected_count": 217,
|
||||
"base_url": "https://www.stineseed.com",
|
||||
"scope_filter": "Largest independent US seed company (Adel, IA). Row-crop varieties (58 corn + 159 soy; no wheat — Stine doesn't breed wheat).",
|
||||
"tos_check_date": "2026-06-04",
|
||||
"tos_note": "No robots.txt (404). /legal/ has only a standard copyright/no-reproduce clause — NO automation/scraping ban (same posture as other corpus vendors). UA seed-mcp-scraper, ~1.5s/req.",
|
||||
"schema_notes": "Custom PHP site (NOT WordPress). Enumerate via sitemap.xml (filter to 4-segment /{crop}/traits/{slug}/{code}/ URLs → 58 corn + 159 soy); the /ajax/{corn,soybean}-comparison/filter_products.php endpoint is wired as a fallback (returns the full historical/discontinued set). Detail pages are clean server-rendered HTML (<section class=agronomic-details> → <ul class=agronomy-chart> → <li><strong>label</strong><span class=value>value</span>); parsed into characteristics_groups. SCALE: corn agronomic+disease 1-9 numeric, 9 = Excellent/best (HIGHER = better, same direction as Bayer; read from the on-page corn legend ul.tm-legend); soybeans are QUALITATIVE (Excellent/Very Good/Good; disease Resistant/Strong/Good/Susceptible) with SCN source + RPS gene passed through. Corn Maturity is an RM range (e.g. '79-81') → representative int + range kept as a characteristic; soy MG from the 2-3 digit code (/10, or /100 for leading-zero ultra-earlies)."
|
||||
},
|
||||
{
|
||||
"name": "first_choice",
|
||||
"vendor": "1st Choice Seeds",
|
||||
"brands": [
|
||||
"1st Choice Seeds"
|
||||
],
|
||||
"crops": [
|
||||
"corn",
|
||||
"soybeans",
|
||||
"wheat"
|
||||
],
|
||||
"verdict": "green",
|
||||
"expected_count": 78,
|
||||
"base_url": "https://www.1stchoiceseeds.com",
|
||||
"scope_filter": "Employee-owned independent (Rushville, IN — Eastern Corn Belt). 52 corn + 22 soy + 4 wheat. Fills the Indiana gap (Beck's already covered; Ebbert's covers OH/IN).",
|
||||
"tos_check_date": "2026-06-04",
|
||||
"tos_note": "robots.txt permissive (only /wp-admin/, allows admin-ajax.php, no Crawl-delay); NO Terms-of-Use page exists at all, no bot wall. UA seed-mcp-scraper, ~1.2s/req.",
|
||||
"schema_notes": "WordPress, but the catalog custom post types are NOT exposed via WP REST (rest_no_route). Enumerate via the per-crop sitemaps (corn-hybrids-sitemap.xml / soybeans-sitemap.xml / wheat-sitemap.xml) → fetch each /corn-hybrids|/soybeans|/wheat/<slug>/ server-rendered HTML spec block → characteristics_groups. SCALE: 0-10, HIGHER = better (legend: 0-4 Below Average, 5 Average, 6 Good, 7 Very Good, 8 Excellent, 9-10 Superior). The numeric bar is d3-rendered (not in markup); 1st Choice also prints the qualitative word in HTML, which IS captured. CAVEATS: ~40 of 52 corn pages are THIN at the source (only Seedling Vigor in the ratings panel; still carry RM/GDU/management/populations); the 12 newer GT/PC hybrids have the full panel. 4 wheat pages are identity-only (private-label, no spec block)."
|
||||
},
|
||||
{
|
||||
"name": "burrus",
|
||||
"vendor": "Burrus Seed",
|
||||
"brands": [
|
||||
"Burrus",
|
||||
"Power Plus",
|
||||
"DONMARIO"
|
||||
],
|
||||
"crops": [
|
||||
"corn",
|
||||
"soybeans"
|
||||
],
|
||||
"verdict": "green",
|
||||
"expected_count": 64,
|
||||
"base_url": "https://burrusseed.com",
|
||||
"scope_filter": "Independent family co (Arenzville, IL, since 1935; IL/IN/IA/MO/WI). 38 corn + 26 soy. Sells own Burrus brand + distributed Power Plus (corn) and DONMARIO (soy) lines; brand stored per-record.",
|
||||
"tos_check_date": "2026-06-04",
|
||||
"tos_note": "Data via the Seedware JSON API (burrus25.seedware.net). burrusseed.com robots.txt blocks ~33 NAMED AI/scraper bots and carries Content-signal: ai-train=no + Crawl-delay 10; User-agent:* IS allowed and the ToS has NO scraping clause. Operator chose to include this source despite the ai-train=no signal; scraper uses a non-blacklisted UA (seed-mcp-scraper) and honors Crawl-delay 10 (>=10s between requests).",
|
||||
"schema_notes": "Seedware JSONP API: GET https://burrus25.seedware.net/app/_queries/crop_varieties.php?crop_pkey=101(corn)|102(soy)&callback=cb (requires a callback param + Referer https://burrusseed.com/; strip the JSONP wrapper). ~40 fields/record incl. brand, maturity (RM/MG), released, and many stat_* ratings → mapped into characteristics_groups: DISEASE RATINGS (gray leaf spot, tar spot, BSR, SDS, phytophthora), AGRONOMIC CHARACTERISTICS (drought, greensnap, stalk/root strength, standability, emergence, etc.), HERBICIDE TOLERANCE (glyphosate/glufosinate/2,4-D/dicamba/FOPs, Yes/No) + Bt insect-protection (Yes/No). SCALE: numeric agronomic+disease 1-10, 10 = best/most-tolerant (HIGHER = better; observed 4-10); NR/blank/0/'-' = not rated. Per-variety tech-sheet PDFs exist (getTechSheet/<pkey>) — not ingested this pass."
|
||||
}
|
||||
],
|
||||
"_excluded_sources": [
|
||||
@@ -221,6 +296,12 @@
|
||||
"vendor": "Corteva",
|
||||
"verdict": "red",
|
||||
"reason": "Explicit ToS prohibits automated scraping. Dealer locator at /us/sales-representatives/my-local-team.html is login-gated; no public API for dealer contact info. The MCP returns a curated fallback lesson instead of erroring."
|
||||
},
|
||||
{
|
||||
"name": "hoegemeyer",
|
||||
"vendor": "Corteva",
|
||||
"verdict": "red",
|
||||
"reason": "Corteva consolidation brand (absorbing Seed Consultants, Dairyland, Nu-Tech, Terral for 2027; part of the upcoming 'Vylor' spinoff with Pioneer + Brevant). hoegemeyer.com's footer 'Terms of Use' links to corteva.com/terms-and-conditions.html (= /us/terms-of-use.html) — the SAME Corteva ToU that bans 'spiders, robots, scrapers, crawlers, data mining tools' (clause e) and building 'a similar or competitive service' (clause f). Same legal basis as the Pioneer exclusion. Even setting ToS aside, data is a single 15.8 MB Seed-Guide PDF behind an Imperva-walled AEM SPA. TREAT ALL *.corteva.com / corteva.us / pioneer.com / hoegemeyer.com / therightseed.com and the Vylor brands as ONE ToU domain: as the legacy regionals (Dairyland, Nu-Tech, Terral, Seed Consultants) migrate to Corteva's ToU, they fall under the same exclusion. Legitimate Corteva-data paths: an official Corteva data license (openinnovation@corteva.com) or university-extension trial data."
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user