seed-mcp

Author	SHA1	Message	Date
justin	c737871c4c	Trial-data scrapers: gh_plot_reports + agripro_trials + search_trials tool This PR introduces TRIAL data — yield-performance results from real field trials — as a SEPARATE data type alongside variety identity. The two are complementary: search_docs → "What's the disease resistance of DKC62-08RIB?" (variety identity — what it IS) search_trials → "Which corn hybrid won the IA 2024 trials?" (performance data — how it PERFORMED) scrape/sources/gh_plot_reports.py — Golden Harvest plot reports - 4,618 expected (2024+2025; 2023 deferred to a backfill pass). - URL: /<crop>/plot-report/<state>/<year>/<plot_id> - Cross-vendor: each plot lists products from multiple brands (NK / DEKALB / Golden Harvest / Enogen / Pioneer / Channel) side by side at one cooperator's field — the kind of independent comparison data Bayer doesn't publish itself. - Generic per-column metrics dict (Yield/MST/Test Weight/$/Ac for corn+soy, Ton/Acre + Milk + Beef columns for silage). - Politeness: 1 req/sec, retries on 429/5xx, no redirect-follow. scrape/sources/agripro_trials.py — AgriPro regional trial PDFs - 14 unique PDFs (38 sitemap links deduped) at /trials-data - pdfplumber text extraction, region/year detection from filename - Verbatim PDF text preserved in chunk body so variety + yield number adjacency drives retrieval (AP Iliad's Aberdeen ID yield matches a query about "AP Iliad Idaho yield") rag/chunk.py — chunks_from_trial() dispatching by source - Plot reports: identity preamble + Top-5 by primary metric + full ranking table. Metric labels chosen from the data (corn/soy use "Yield", silage uses "Ton/Acre"). - AgriPro PDFs: identity preamble + verbatim trial body inline so per-location yields surface for region+variety queries. - Variety chunks get data_type="variety" metadata; trial chunks get data_type="trial". Single Chroma collection; the tool router filters by data_type rather than maintaining two collections. rag/index.py — dispatch by sidecar's data_type field rag/bm25.py — new filter columns (data_type, year, state) docs_mcp/server.py — sixth MCP tool: search_trials(crop?, state?, year?, product?, k=10) - Filters trial chunks via where={"data_type": "trial", ...} - Optional product substring post-filter for "DKC62-08RIB Iowa 2024" style searches - search_docs now defaults to data_type="variety" so trial chunks don't bleed into variety identity queries - Tool docstring routes the agent: "use lookup_variety to verify identity details on any trial winner you surface" NK trial endpoint (/NKSeeds/wsProxy.asmx/GetPlotResult) is documented as deferred — the ASMX-SOAP shape returned empty XML on initial probe. Bayer per-variety yield data is not publicly indexed at all — documented in the trial-scope note (DEKALB/Asgrow trial data flows through Channel reps, not the web). AgRevival research books exist as 10 large annual PDFs but are deferred (low ROI per parse). Initial corpus shipped in this PR: 14 AgriPro trial PDFs. The 4,618 Golden Harvest plot reports are scraping in background and will be added in a follow-up corpus-snapshot PR (~70 min ETA). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 15:19:03 -04:00
justin	9ce920f622	agripro + nk scrapers — 146 Syngenta varieties added (wheat + corn/soy) agripro (24 varieties) - Drupal Views form scrape via /search-agripro-brand-varieties with explicit GET params (sidesteps the AJAX-only-on-load default that returns an empty form skeleton). - Per-variety parse: <h1>, .field--node--variety-type--variety, .field--node--tag-line--variety, .field--node--body, plus the three rated sections (Agronomics / Grain / Disease) with their <div class="row"><div class="label">label</div><div>value</div> pairs. - Wheat-class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley — provides the Northern Plains HRS coverage WestBred lacks. nk (122 varieties — recon's "29" was outdated; the current NK seed finder lists 41 corn + 81 soy) - ASP.NET WebForms endpoint: POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returns {"d": "<html>"} where the inner HTML is one <div class="sf-result"> per variety. BeautifulSoup tokenizes the whole blob. - Per-card: product code (NK8005, NK008-P8XF), RM/MG from the title <span>, "Brands Available" trait variants, marketing positioning + bullet strengths, tech-sheet PDF URL. - pdfplumber text extraction on the tech-sheet PDFs adds: * corn disease ratings (Gray Leaf Spot, NCLB, Goss's Wilt, Anthracnose, Tar Spot, Fusarium, etc.) where the PDF prints "Label N" lines (text-extractable) * soybean Phytophthora source genes (Rps1c, Rps3a, ...) * soybean SCN race coverage * soybean agronomic ratings (Emergence, Standability, Shatter Tolerance, Green Stem) with text-extractable 1-9 values * soybean soil-type adaptation (Best/Good/Fair/Poor) for drought prone / high pH / poorly drained / etc. - Agronomic rating BARS for corn (Emergence, Stalk Strength, Drought) are not text-extractable; we record the labels with an explicit "rated in PDF chart, see tech sheet" value so the agent can direct the farmer at the source for those numbers. Scale-direction correction in lessons.md: - NK and AgriPro both use 1 = best, lower = more resistant — the REVERSED convention vs Bayer / Golden Harvest. NK's tech-sheet footer literally prints "1-9 Scale: 1 = Best, 9 = Worst". AgriPro positioning on stripe-rust-resistant varieties (AP Iliad with Stripe Rust 1, Eyespot 2) confirms the same direction. - sources-not-yet-indexed section trimmed to just Beck's PFR + Beck's products — everything else IS now in the corpus. Cross-vendor coverage after this PR: 760 varieties. bayer_seeds 475 (DEKALB 288 / Asgrow 102 / WestBred 85) golden_harvest 139 nk 122 (41 corn / 81 soy) agripro 24 (12 HRS / 7 SWW / 3 HRW / 1 HWS / 1 Barley) Vendors: Bayer, Syngenta. Brands: 6. Crops: corn, soy, wheat (109 wheat now, up from 85). requirements.txt: pdfplumber>=0.11 for NK tech-sheet parsing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:16:36 -04:00
justin	75f714b454	Phase 4-5: deployable container + corpus snapshot + CI fixes deploy/docker-compose.yml — replace <product>/<registry> placeholders with concrete values for Drawbar's stack: - image: git.jpaul.io/justin/seed-mcp:latest (CF tunnel for pulls; CI pushes via LAN 192.168.0.2:1234 to avoid 100 MB body cap) - container_name: seed-mcp - port 8001:8000 (8001 host-side to not collide with crop-chem-docs on 8000) - PRODUCT_NAME=crop_seed, hybrid search enabled, stateless HTTP - llama-rerank shared with crop-chem-docs (NOT redefined here — expected to already be in Drawbar's parent compose network) - networks.drawbar-mcp external: true so seed-mcp joins the existing cross-MCP shared network .gitignore — corpus/ is now COMMITTED, not ignored. The monthly refresh workflow scrapes and commits corpus changes; the image-only workflow rebuilds indexes from the committed corpus. Allowing the corpus to flow through git means the :corpus-YYYY.MM.DD image tag pins to a specific seed-catalog snapshot. chroma/ and bm25/ remain ignored — those are deterministically derived from corpus. Initial committed snapshot: 614 varieties. - bayer_seeds: 475 (DEKALB 288 + Asgrow 102 + WestBred 85) - golden_harvest: 139 (Syngenta corn + soy; 36 sitemap URLs 302-redirected = discontinued) rag/chunk.py — normalize brand and crop to uppercase/lowercase in Chroma metadata so cross-vendor brand-filter lookups don't break on casing inconsistency (Bayer stores "DEKALB", Golden Harvest stores "Golden Harvest"; _build_where uppercases user-supplied brand which matched the former but not the latter pre-fix). Sidecar JSON keeps original casing for display. Stub scrapers (nk, agripro, becks_pfr, becks_products) — change return code from 2 to 0 so the monthly-refresh CI workflow doesn't fail on deferred sources. Real implementations will return 0 on success / 1 on failure when they ship. Smoke-tested cross-vendor retrieval against the 614-chunk index: - list_versions shows both vendors with correct facet counts - broad "corn hybrid 100 RM" query returns both DEKALB and Golden Harvest hits in top 5 - brand='Golden Harvest' filter returns 3 GH-only varieties - variety-code prefilter still works (E085Z5 → top hit on GH) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 13:40:05 -04:00
justin	1409c2617d	golden_harvest: implement scraper (~175 Syngenta corn + soy) Sitemap-driven scraper for goldenharvestseeds.com. Walks sitemap-ghs-hybrids.xml to discover product URLs under /products/corn/ and /products/soybean/ (~89 + 86 = 175 candidates). Per-variety detail parsed from server-rendered HTML: - product code (from <h1> / <title>) - positioning (from <meta name="Description">) - maturity (from <div class="product-label"><div class="right">): integer days for corn, decimal MG for soybeans - traits derived from product-code suffix (XF, E3, VIP3, GT, Z, etc.) - 9-row disease tolerance bar chart (#dvDiseaseTolerance) where data-percentage / 10 = rating on 1-9 (9 = best) scale - 9-row agronomic characteristics bar chart (#dvAgronomicChar) - recommended environment list (.AgronomicMange — upstream typo) - all 2-column tables (plant description, seed quality, herbicide responses, Phytophthora gene, SCN race coverage) - tech-sheet PDF URL from live HTML (not sitemap — that's stale) 302 redirects to /product-finder treated as "discontinued" and skipped (Golden Harvest still sitemap-lists some retired SKUs). Rating scale: 1-9 (9 = best) — same as Bayer despite recon's "9-to-1" descriptor (that referred to chart-axis direction, not numeric meaning). _scale_direction is set explicitly so the chunker stays forward-compatible. PDFs are NOT downloaded (recon flagged ~14MB each); tech-sheet URLs are captured in the sidecar for future enrichment. Smoke-tested all branches: 4 corn varieties (E085Z5, E092W5, E094Z4, E095D3, E097K6, E100A3) with full 6 characteristics groups + tech-sheet URL; 3 soy varieties (GH00864XF MG 0.08, GH00973E3 MG 0.09, GH0225XF MG 0.2) with disease + agronomic bars; 302 redirects skipped cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 13:30:30 -04:00
justin	2a4c0d4aba	bayer_seeds: implement Phase 1 scraper for DEKALB + Asgrow + WestBred Replace stub with working scraper for all three Bayer seed brands. Discovery uses the public sitemap-dynamic.xml (475 varieties: 288 DEKALB corn + 102 Asgrow soy + 85 WestBred wheat — matches recon). Per-variety detail comes from the page's __NEXT_DATA__ JSON island. Each variety writes corpus/bayer_seeds/<source_key>.{md,json} with: - Identity (brand, crop, hybridLabel, productId, releaseYear) - Maturity routed per crop (RM for corn, MG for soy, qualitative for wheat) - Trait stack (code + full name) - Positioning + strengths narrative - Characteristics groups (DISEASE RATINGS, GROWTH, MANAGEMENT, HARVEST, etc.) preserved verbatim from source so the chunker can re-bucket into canonical disease/agronomic flats per CLAUDE.md schema - Regional seed-guide listings with agronomist contacts - _scale_direction tag (Bayer = "1-9 (9 = best)") for chunker Smoke-tested all three brands (--limit 2 each, plus --product, --force, and scrape.runner dispatch). Politeness: 1 req/sec, retries on 429/5xx with Retry-After honored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:53:46 -04:00
justin	ac40e05734	seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME Image rebuild (skip scrape) / build (push) Failing after 7s Details Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is seed/hybrid varieties across 6 vendors instead of pesticide labels. What's customized vs. the template: - CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy, canonical sidecar schema (per-crop), Golden Harvest disease-scale reversal gotcha, no-IPv6 / HTTPS-clone note - README.md: vendor coverage table, tool list, phase status - Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker DNS defaults (same llama-rerank sidecar as crop-chem-docs) - .gitea/workflows/refresh.yml: monthly cron (seed catalogs move slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar pinning, continue-on-error on GC step - .gitea/workflows/image-only.yml: paths filter + cancel-in-progress concurrency group - scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea packages API URL + UA header to bypass CF block on default Python-urllib UA) - sources.json: catalog of 6 sources + scope_filter + per-source schema notes + Pioneer-exclusion rationale - scrape/runner.py: dispatcher with --all = GREEN-only - scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr, becks_products}.py: stub modules with implementation notes - docs_mcp/server.py: PRODUCT_NAME default → crop_seed, PRODUCT_DOCS_URL → repo URL Pioneer is intentionally NOT a source. ToS bans automation; dealer locator is login-gated. The MCP returns a curated fallback lesson directing the user to pioneer.com. Next phases: - Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs Bayer scraper; same __NEXT_DATA__ infra) - Phase 7: curate eval/queries.jsonl - Phase 11: lessons.md with Pioneer fallback + disease-scale notes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:28:49 -04:00

6 Commits