Files
seed-mcp/sources.json
T
claude 89eea0f2b4 Add ProHarvest Seeds: 119 varieties + 161 cross-vendor plot reports
ProHarvest Seeds (independent Corn Belt brand, proharvestseeds.com) exposes
a public, no-auth WordPress REST API — cleaner ingestion than the HTML-only
independents. Two new sources:

- `proharvest` (variety identity, 119 row-crop varieties: 70 corn / 47 soy /
  2 wheat). Enumerated via /wp/v2/seed (seed-type taxonomy), agronomics
  parsed from each /seed/<slug>/ detail page into structured
  characteristics_groups so the ratings actually embed. Mixed scale: disease
  1-9 numeric (9=best, no flip), agronomic/general qualitative, soil HR/R.

- `proharvest_plots` (trials, data_type=trial, 161 plots, 2024+2025). Per-
  cooperator harvest reports via the custom /wp-json/proharvest/v1/plots?y=
  endpoint + PDF table extraction. Many are cross-vendor head-to-head
  (ProHarvest/Apex vs Pioneer/DEKALB/Becks/Channel/Wyffels). Handles ruled
  tables, unruled tables (text fallback; soy drops the Test-Wt column → 4 vs
  5 numerics), and off-template third-party reports (sanity-gated to verbatim
  so junk rows never ship). Image-only PDFs skipped + counted.

- rag/chunk.py: route proharvest_plots through the shared cross-vendor plot
  renderer (structured) / verbatim body (raw_text fallback).
- sources.json + lessons.md (rating-scales, trial-data).
- README/CLAUDE.md corpus inventory brought current (it had drifted: bayer
  931 not 475; ebberts/lg/agrigold were unlisted). New totals: 1,645 variety
  + 6,787 trial records.

robots.txt permissive (only search + /dealer-* disallowed); no ToS
automation clause. CI rebuilds the index from the committed corpus.
2026-06-04 21:04:33 -04:00

227 lines
12 KiB
JSON

{
"_description": "seed-mcp source catalog. Each scraper module under scrape/sources/ corresponds to one entry. Run via `python -m scrape.runner --source <name>`. The MCP container bakes this file in so corpus_status / list_versions can reflect provenance without re-scraping.",
"_pioneer_excluded": "Pioneer (Corteva) is intentionally absent. Per their ToS: 'you shall not use any manual or automated software, devices or other processes (including but not limited to spiders, robots, scrapers, crawlers, avatars, data mining tools or the like) to scrape or download data from the Services'. The MCP returns a curated fallback lesson directing the user to pioneer.com / a local dealer.",
"sources": [
{
"name": "bayer_seeds",
"vendor": "Bayer",
"brands": [
"DEKALB",
"Asgrow",
"WestBred"
],
"crops": [
"corn",
"soybeans",
"wheat"
],
"verdict": "green",
"expected_count": 475,
"base_url": "https://cropscience.bayer.us",
"scope_filter": "All listed varieties; no regional filter applied at scrape time (regional recommendations parsed into sidecar so the MCP can filter at search time).",
"tos_check_date": "2026-05-24",
"tos_note": "robots.txt explicitly whitelists RAG/LLM use cases. Same legal stance as crop-chem-docs scraper."
},
{
"name": "golden_harvest",
"vendor": "Syngenta",
"brands": [
"Golden Harvest"
],
"crops": [
"corn",
"soybeans"
],
"verdict": "green",
"expected_count": 175,
"base_url": "https://www.goldenharvestseeds.com",
"scope_filter": "All sitemap-listed corn + soybean varieties.",
"tos_check_date": "2026-05-25",
"schema_notes": "Disease ratings published on 9-to-1 scale (9 = best). Normalize to 1-9 (9 = best) at chunk time to match Bayer/NK/AgriPro convention. Note original direction in chunk_0 preamble. Tech-sheet PDF URLs in the sitemap are stale (250331) \u2014 resolve live URL from product HTML, not sitemap entry."
},
{
"name": "nk",
"vendor": "Syngenta",
"brands": [
"NK"
],
"crops": [
"corn",
"soybeans"
],
"verdict": "green",
"expected_count": 29,
"base_url": "https://www.syngenta-us.com",
"pdf_cdn": "https://assets.syngentaebiz.com/pdf/techsheets/",
"scope_filter": "All NK corn + soy varieties. No wheat (NK doesn't sell wheat in US).",
"tos_check_date": "2026-05-24",
"schema_notes": "Disease + agronomic ratings live in tech-sheet PDFs only \u2014 need pdfplumber. PDF URLs share format `<CODE>_YYMMDD.pdf` with Golden Harvest, so the same fetcher works for both."
},
{
"name": "agripro",
"vendor": "Syngenta",
"brands": [
"AgriPro"
],
"crops": [
"wheat",
"barley"
],
"verdict": "green",
"expected_count": 24,
"base_url": "https://www.agriprowheat.com",
"scope_filter": "All wheat classes (HRW/HRS/HWS/SWW/SWS) + barley. NO SRW \u2014 Syngenta's SRW lives at GrowProGenetics.com under a separate brand.",
"tos_check_date": "2026-05-24",
"schema_notes": "Drupal Views form; server-rendered HTML. CoAXium trait flag is implicit in product family; Clearfield/CL2 trait IS in this catalog."
},
{
"name": "becks_pfr",
"vendor": "Beck's Hybrids",
"brands": [
"Beck's PFR"
],
"crops": [
"corn",
"soybeans",
"wheat"
],
"verdict": "yellow",
"expected_count": 2089,
"base_url": "https://www.beckshybrids.com",
"api_base": "https://mc8v24rf.api.sanity.io",
"scope_filter": "All Practical Farm Research publications since 2015. PFR is head-to-head agronomy trials \u2014 fungicide timing, planting-date studies, hybrid-by-population, etc.",
"tos_check_date": "2026-05-24",
"schema_notes": "Public Sanity GROQ API, no auth required. Records have title/year/crop/key-findings/full-text. Treat PFR docs as a research corpus, not variety records \u2014 the chunk_0 includes the study's tl;dr finding."
},
{
"name": "becks_products",
"vendor": "Beck's Hybrids",
"brands": [
"Beck's"
],
"crops": [
"corn",
"soybeans",
"wheat"
],
"verdict": "yellow",
"expected_count": 860,
"base_url": "https://www.beckshybrids.com",
"api_base": "https://mc8v24rf.api.sanity.io",
"scope_filter": "All Beck's product records \u2014 corn + soy + wheat. Identity + RM/MG only.",
"tos_check_date": "2026-05-24",
"schema_notes": "Sanity GROQ exposes identity (name, RM/MG, basic traits) but agronomic + disease ratings are SeedIQ-gated (requires browser cookie). Deferred until the SeedIQ XHR endpoint is captured from a logged-in browser session. Without ratings, products are reference-only; the MCP can confirm 'Beck's has hybrid X at RM 112 with Enlist trait' but not 'rate it against drought'."
},
{
"name": "gh_plot_reports",
"vendor": "Syngenta",
"brand_aggregator": "Golden Harvest publishes",
"crops": [
"corn",
"soybeans",
"silage"
],
"verdict": "green",
"expected_count": 4618,
"base_url": "https://www.goldenharvestseeds.com",
"scope_filter": "sitemap-listed plot reports 2024 and 2025 (4,618 reports). 2023 (3,619 reports) deferred to a future pass \u2014 most recent data is most relevant for current decisions.",
"tos_check_date": "2026-05-25",
"schema_notes": "Cross-vendor head-to-head yield trials at specific state/year/site. Each report lists products from multiple brands (NK, DEKALB, GH, etc.) with rank, yield, %MST, test weight, gross revenue. URL: /<crop>/plot-report/<state>/<year>/<id>. Same site/auth as golden_harvest variety scraper.",
"data_type": "trial"
},
{
"name": "agripro_trials",
"vendor": "Syngenta",
"brand_aggregator": "AgriPro publishes",
"crops": [
"wheat"
],
"verdict": "green",
"expected_count": 38,
"base_url": "https://agriprowheat.com",
"scope_filter": "PDF trial summaries linked from /trials-data. Regional wheat performance (PNW, Western Plains, NE Colorado, etc.).",
"tos_check_date": "2026-05-25",
"schema_notes": "PDF tables of varieties tested per region per year. pdfplumber for table extraction.",
"data_type": "trial"
},
{
"name": "lg_plot_reports",
"vendor": "AgReliant Genetics",
"brand_aggregator": "LG Seeds publishes",
"crops": [
"corn",
"soybeans",
"sorghum",
"silage"
],
"verdict": "green",
"expected_count": 1310,
"base_url": "https://lgseeds.com",
"scope_filter": "Cross-vendor plot-report data exposed by lgseeds.com/performance/{crop} JSON XHR (GetPlots + GetPlotData). 2024+2025 baseline (older 2023 deferred to --include-2023). Top-5 hybrids per plot (LG + competitors), not full ranking — that's what LG publishes publicly.",
"tos_check_date": "2026-05-26",
"schema_notes": "JSON API behind www→apex redirect; POST GetPlots returns list with year+coords, GET GetPlotData returns top-5 + state/cooperator/dates. 301 redirect drops POST body so hit `lgseeds.com` (no www). Top-5 means each plot is partial coverage — multiple plots cover the same site/cooperator with different LG hybrid lineups.",
"data_type": "trial"
},
{
"name": "agrigold_plot_reports",
"vendor": "AgReliant Genetics",
"brand_aggregator": "AgriGold publishes",
"crops": [
"corn",
"soybeans"
],
"verdict": "green",
"expected_count": 1006,
"base_url": "https://www.agrigold.com",
"scope_filter": "Cross-vendor plot reports at agrigold.com/{crop}/performance/{crop}-yield-results, year-filtered via ?harvestYear=. Detail page exposes FULL ranking (not just top-5) plus rich plot management (tillage, prev crop, fungicide, soil type, irrigation). 2024+2025 baseline.",
"tos_check_date": "2026-05-26",
"schema_notes": "Server-rendered HTML; detail URL is /{crop_url}/performance/{slug}/{id}. Soybeans URL slug is singular: /soybeans/performance/soybean-yield-results/{id}. Columns: Rank, Brand, Product, Trait, Ck, H20 (moisture %), Test Wt., Yield, Adj Yield. Most metadata-rich of the three trial sources.",
"data_type": "trial"
},
{
"name": "proharvest",
"vendor": "ProHarvest Seeds",
"brands": [
"ProHarvest Seeds",
"Apex"
],
"crops": [
"corn",
"soybeans",
"wheat"
],
"verdict": "green",
"expected_count": 119,
"base_url": "https://proharvestseeds.com",
"scope_filter": "Row-crop varieties via the WordPress REST API (/wp/v2/seed) filtered to seed-type corn-hybrid (70) / soybean (47) / wheat (2). Excludes alfalfa/forage/grass/cover-crop/sweet-corn/blend terms (out of scope for the row-crop advisor).",
"tos_check_date": "2026-06-04",
"tos_note": "robots.txt permissive — disallows only /?s=, /search/, /dealer-files/*, /dealer-section/* and AdsBot. The /wp-json/ REST API + /seed/ detail pages are open. No ToS automation clause found.",
"schema_notes": "REST list enumerates id/slug/title/link + seed-trait taxonomy, but acf+content are NOT registered to REST (empty), so agronomic data is parsed from each /seed/<slug>/ detail page (<h2> spec sections of <strong>label</strong><div>value</div> pairs). Parsed into characteristics_groups so ratings embed (unlike ebberts_seeds, which left them body-only). Ratings MIXED: Disease Tolerance 1-9 numeric (9=best, same direction as Bayer/NK — NO flip; NA=not rated); General/Agronomic qualitative (Excellent/Very Good/Good/Average); Soil Adaptability HR/R. RM in '<h1>Maturity: N Days</h1>'."
},
{
"name": "proharvest_plots",
"vendor": "ProHarvest Seeds",
"brand_aggregator": "ProHarvest Seeds publishes",
"crops": [
"corn",
"soybeans"
],
"verdict": "green",
"expected_count": 161,
"base_url": "https://proharvestseeds.com",
"scope_filter": "Per-cooperator harvest-report plots via the custom GET /wp-json/proharvest/v1/plots?y=<year> endpoint, 2024+2025 baseline (~162 plots). Older 2015-2023 deferred to --include-old. Each plot links a harvest-report PDF with a head-to-head results table.",
"tos_check_date": "2026-06-04",
"schema_notes": "Same sidecar shape as agrigold/lg/gh plot reports (results:[{rank,brand,product,traits,metrics}]) — routed through _render_gh_plot_chunk (proharvest_plots added to that source list in rag/chunk.py). API gives clean location metadata (city/state/county/year/product/lat-long/PDF); PDF gives the management block (planted/harvested/prev-crop/population/tillage/irrigation) + results. THREE PDF realities: ruled tables (extract_tables splits columns), unruled tables (text-line fallback anchored on trailing numerics; soy reports drop the Test Wt. column so rows carry 4 vs 5 numerics), and off-template third-party reports (e.g. a university land-lab with extra RM/harvest-weight columns) which a per-row + per-plot sanity gate redirects to verbatim raw_text so cross-vendor yields aren't corrupted or lost. Many plots are cross-vendor (Pioneer/DEKALB/Becks/Channel/Wyffels vs ProHarvest/Apex). Image-only PDFs (no text layer) are skipped + counted (no silent cap). metrics key 'Yield' is canonical so the chunker top-N picker finds it.",
"data_type": "trial"
}
],
"_excluded_sources": [
{
"name": "pioneer",
"vendor": "Corteva",
"verdict": "red",
"reason": "Explicit ToS prohibits automated scraping. Dealer locator at /us/sales-representatives/my-local-team.html is login-gated; no public API for dealer contact info. The MCP returns a curated fallback lesson instead of erroring."
}
]
}