Files
seed-mcp/sources.json
T
justin ac40e05734
Image rebuild (skip scrape) / build (push) Failing after 7s
seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME
Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is
seed/hybrid varieties across 6 vendors instead of pesticide labels.

What's customized vs. the template:
- CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy,
  canonical sidecar schema (per-crop), Golden Harvest disease-scale
  reversal gotcha, no-IPv6 / HTTPS-clone note
- README.md: vendor coverage table, tool list, phase status
- Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not
  bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker
  DNS defaults (same llama-rerank sidecar as crop-chem-docs)
- .gitea/workflows/refresh.yml: monthly cron (seed catalogs move
  slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar
  pinning, continue-on-error on GC step
- .gitea/workflows/image-only.yml: paths filter + cancel-in-progress
  concurrency group
- scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea
  packages API URL + UA header to bypass CF block on default
  Python-urllib UA)
- sources.json: catalog of 6 sources + scope_filter + per-source
  schema notes + Pioneer-exclusion rationale
- scrape/runner.py: dispatcher with --all = GREEN-only
- scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr,
  becks_products}.py: stub modules with implementation notes
- docs_mcp/server.py: PRODUCT_NAME default → crop_seed,
  PRODUCT_DOCS_URL → repo URL

Pioneer is intentionally NOT a source. ToS bans automation; dealer
locator is login-gated. The MCP returns a curated fallback lesson
directing the user to pioneer.com.

Next phases:
- Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs
  Bayer scraper; same __NEXT_DATA__ infra)
- Phase 7: curate eval/queries.jsonl
- Phase 11: lessons.md with Pioneer fallback + disease-scale notes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:28:49 -04:00

90 lines
5.0 KiB
JSON

{
"_description": "seed-mcp source catalog. Each scraper module under scrape/sources/ corresponds to one entry. Run via `python -m scrape.runner --source <name>`. The MCP container bakes this file in so corpus_status / list_versions can reflect provenance without re-scraping.",
"_pioneer_excluded": "Pioneer (Corteva) is intentionally absent. Per their ToS: 'you shall not use any manual or automated software, devices or other processes (including but not limited to spiders, robots, scrapers, crawlers, avatars, data mining tools or the like) to scrape or download data from the Services'. The MCP returns a curated fallback lesson directing the user to pioneer.com / a local dealer.",
"sources": [
{
"name": "bayer_seeds",
"vendor": "Bayer",
"brands": ["DEKALB", "Asgrow", "WestBred"],
"crops": ["corn", "soybeans", "wheat"],
"verdict": "green",
"expected_count": 475,
"base_url": "https://cropscience.bayer.us",
"scope_filter": "All listed varieties; no regional filter applied at scrape time (regional recommendations parsed into sidecar so the MCP can filter at search time).",
"tos_check_date": "2026-05-24",
"tos_note": "robots.txt explicitly whitelists RAG/LLM use cases. Same legal stance as crop-chem-docs scraper."
},
{
"name": "golden_harvest",
"vendor": "Syngenta",
"brands": ["Golden Harvest"],
"crops": ["corn", "soybeans"],
"verdict": "green",
"expected_count": 175,
"base_url": "https://www.goldenharvestseeds.com",
"scope_filter": "All sitemap-listed corn + soybean varieties.",
"tos_check_date": "2026-05-25",
"schema_notes": "Disease ratings published on 9-to-1 scale (9 = best). Normalize to 1-9 (9 = best) at chunk time to match Bayer/NK/AgriPro convention. Note original direction in chunk_0 preamble. Tech-sheet PDF URLs in the sitemap are stale (250331) — resolve live URL from product HTML, not sitemap entry."
},
{
"name": "nk",
"vendor": "Syngenta",
"brands": ["NK"],
"crops": ["corn", "soybeans"],
"verdict": "green",
"expected_count": 29,
"base_url": "https://www.syngenta-us.com",
"pdf_cdn": "https://assets.syngentaebiz.com/pdf/techsheets/",
"scope_filter": "All NK corn + soy varieties. No wheat (NK doesn't sell wheat in US).",
"tos_check_date": "2026-05-24",
"schema_notes": "Disease + agronomic ratings live in tech-sheet PDFs only — need pdfplumber. PDF URLs share format `<CODE>_YYMMDD.pdf` with Golden Harvest, so the same fetcher works for both."
},
{
"name": "agripro",
"vendor": "Syngenta",
"brands": ["AgriPro"],
"crops": ["wheat", "barley"],
"verdict": "green",
"expected_count": 24,
"base_url": "https://www.agriprowheat.com",
"scope_filter": "All wheat classes (HRW/HRS/HWS/SWW/SWS) + barley. NO SRW — Syngenta's SRW lives at GrowProGenetics.com under a separate brand.",
"tos_check_date": "2026-05-24",
"schema_notes": "Drupal Views form; server-rendered HTML. CoAXium trait flag is implicit in product family; Clearfield/CL2 trait IS in this catalog."
},
{
"name": "becks_pfr",
"vendor": "Beck's Hybrids",
"brands": ["Beck's PFR"],
"crops": ["corn", "soybeans", "wheat"],
"verdict": "yellow",
"expected_count": 2089,
"base_url": "https://www.beckshybrids.com",
"api_base": "https://mc8v24rf.api.sanity.io",
"scope_filter": "All Practical Farm Research publications since 2015. PFR is head-to-head agronomy trials — fungicide timing, planting-date studies, hybrid-by-population, etc.",
"tos_check_date": "2026-05-24",
"schema_notes": "Public Sanity GROQ API, no auth required. Records have title/year/crop/key-findings/full-text. Treat PFR docs as a research corpus, not variety records — the chunk_0 includes the study's tl;dr finding."
},
{
"name": "becks_products",
"vendor": "Beck's Hybrids",
"brands": ["Beck's"],
"crops": ["corn", "soybeans", "wheat"],
"verdict": "yellow",
"expected_count": 860,
"base_url": "https://www.beckshybrids.com",
"api_base": "https://mc8v24rf.api.sanity.io",
"scope_filter": "All Beck's product records — corn + soy + wheat. Identity + RM/MG only.",
"tos_check_date": "2026-05-24",
"schema_notes": "Sanity GROQ exposes identity (name, RM/MG, basic traits) but agronomic + disease ratings are SeedIQ-gated (requires browser cookie). Deferred until the SeedIQ XHR endpoint is captured from a logged-in browser session. Without ratings, products are reference-only; the MCP can confirm 'Beck's has hybrid X at RM 112 with Enlist trait' but not 'rate it against drought'."
}
],
"_excluded_sources": [
{
"name": "pioneer",
"vendor": "Corteva",
"verdict": "red",
"reason": "Explicit ToS prohibits automated scraping. Dealer locator at /us/sales-representatives/my-local-team.html is login-gated; no public API for dealer contact info. The MCP returns a curated fallback lesson instead of erroring."
}
]
}