seed-mcp

Author	SHA1	Message	Date
claude	22e8092faf	Add ProHarvest Seeds: 119 varieties + 161 cross-vendor plot reports (#16 ) Image rebuild (skip scrape) / build (push) Successful in 5m46s Details Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io>	2026-06-04 21:05:30 -04:00
justin	b98965a68a	Two new trial sources: LG Seeds + AgriGold plot reports (+2,307 cross-vendor yield trials) Adds the first non-Syngenta trial coverage to the corpus: \| Source \| Docs \| Publisher \| URL pattern \| \|---\|---\|---\|---\| \| lg_plot_reports \| 1,304 \| LG Seeds (AgReliant) \| lgseeds.com/performance/{crop} JSON XHR \| \| agrigold_plot_reports \| 1,003 \| AgriGold (AgReliant) \| agrigold.com/{crop}/performance/{crop}-yield-results \| Total trial coverage now: gh_plot_reports (4,299) + agripro_trials (14) + lg_plot_reports (1,304) + agrigold_plot_reports (1,003) = 6,620 trial docs. Both scrapers follow the gh_plot_reports template — same RateLimitedSession primitive, same TrialResult/PlotReport dataclass shape, same data_type="trial" sidecar convention. The trial chunker (`rag/chunk.py:_render_trial_chunk`) is extended to recognize both new sources; they share `_render_gh_plot_chunk` since their sidecars are structurally identical (just different brand label). LG specifics: - POST `/performance/{crop}/GetPlots/` returns sparse listing (id, year, lat/lng) - GET `/performance/{crop}/GetPlotData/?PlotId=X&IsSilage=Y` returns full detail with state, cooperator, planting/harvest dates, and top-5 hybrids (LG + competitors). Top-5 is what LG publishes publicly; not the full ranking. - 4 crops: corn (963), soybeans (287), sorghum (10), silage (50) — first alfalfa absent because LG doesn't run alfalfa plots; that's variety-only data. - 301 gotcha: www.lgseeds.com redirects to lgseeds.com which drops POST body, so the scraper hits the apex host directly. AgriGold specifics: - Listing: GET `/{crop}/performance/{crop}-yield-results?harvestYear={year}` (server-rendered HTML, ~1MB; 408 corn plots in 2025 alone) - Detail: GET `/{crop_url}/performance/{slug}/{plot_id}` returns the full ranking (not just top-5) plus rich plot management metadata: tillage, previous crop, fungicide, herbicide, insecticide, irrigation, soil type, row width, population. Most metadata-rich of the three trial sources. - Soybean URL slug is singular: `/soybeans/performance/soybean-yield-results/` - Columns: Rank \| Brand \| Product \| Trait \| Ck \| H20 (moisture) \| Test Wt. \| Yield \| Adj Yield (check-adjusted) - 2 crops: corn (849) + soybeans (157) Indexer needs no changes — `rag/index.py` auto-discovers any directory under corpus/ and routes by data_type. Both new sources flow into the existing trial collection and surface via `search_trials`. Years scraped: 2024+2025 (matching gh_plot_reports baseline). 2023 is available via `--include-2023` on either scraper for future backfill.	2026-05-26 22:26:24 -04:00
justin	c737871c4c	Trial-data scrapers: gh_plot_reports + agripro_trials + search_trials tool This PR introduces TRIAL data — yield-performance results from real field trials — as a SEPARATE data type alongside variety identity. The two are complementary: search_docs → "What's the disease resistance of DKC62-08RIB?" (variety identity — what it IS) search_trials → "Which corn hybrid won the IA 2024 trials?" (performance data — how it PERFORMED) scrape/sources/gh_plot_reports.py — Golden Harvest plot reports - 4,618 expected (2024+2025; 2023 deferred to a backfill pass). - URL: /<crop>/plot-report/<state>/<year>/<plot_id> - Cross-vendor: each plot lists products from multiple brands (NK / DEKALB / Golden Harvest / Enogen / Pioneer / Channel) side by side at one cooperator's field — the kind of independent comparison data Bayer doesn't publish itself. - Generic per-column metrics dict (Yield/MST/Test Weight/$/Ac for corn+soy, Ton/Acre + Milk + Beef columns for silage). - Politeness: 1 req/sec, retries on 429/5xx, no redirect-follow. scrape/sources/agripro_trials.py — AgriPro regional trial PDFs - 14 unique PDFs (38 sitemap links deduped) at /trials-data - pdfplumber text extraction, region/year detection from filename - Verbatim PDF text preserved in chunk body so variety + yield number adjacency drives retrieval (AP Iliad's Aberdeen ID yield matches a query about "AP Iliad Idaho yield") rag/chunk.py — chunks_from_trial() dispatching by source - Plot reports: identity preamble + Top-5 by primary metric + full ranking table. Metric labels chosen from the data (corn/soy use "Yield", silage uses "Ton/Acre"). - AgriPro PDFs: identity preamble + verbatim trial body inline so per-location yields surface for region+variety queries. - Variety chunks get data_type="variety" metadata; trial chunks get data_type="trial". Single Chroma collection; the tool router filters by data_type rather than maintaining two collections. rag/index.py — dispatch by sidecar's data_type field rag/bm25.py — new filter columns (data_type, year, state) docs_mcp/server.py — sixth MCP tool: search_trials(crop?, state?, year?, product?, k=10) - Filters trial chunks via where={"data_type": "trial", ...} - Optional product substring post-filter for "DKC62-08RIB Iowa 2024" style searches - search_docs now defaults to data_type="variety" so trial chunks don't bleed into variety identity queries - Tool docstring routes the agent: "use lookup_variety to verify identity details on any trial winner you surface" NK trial endpoint (/NKSeeds/wsProxy.asmx/GetPlotResult) is documented as deferred — the ASMX-SOAP shape returned empty XML on initial probe. Bayer per-variety yield data is not publicly indexed at all — documented in the trial-scope note (DEKALB/Asgrow trial data flows through Channel reps, not the web). AgRevival research books exist as 10 large annual PDFs but are deferred (low ROI per parse). Initial corpus shipped in this PR: 14 AgriPro trial PDFs. The 4,618 Golden Harvest plot reports are scraping in background and will be added in a follow-up corpus-snapshot PR (~70 min ETA). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 15:19:03 -04:00
justin	ac40e05734	seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME Image rebuild (skip scrape) / build (push) Failing after 7s Details Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is seed/hybrid varieties across 6 vendors instead of pesticide labels. What's customized vs. the template: - CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy, canonical sidecar schema (per-crop), Golden Harvest disease-scale reversal gotcha, no-IPv6 / HTTPS-clone note - README.md: vendor coverage table, tool list, phase status - Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker DNS defaults (same llama-rerank sidecar as crop-chem-docs) - .gitea/workflows/refresh.yml: monthly cron (seed catalogs move slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar pinning, continue-on-error on GC step - .gitea/workflows/image-only.yml: paths filter + cancel-in-progress concurrency group - scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea packages API URL + UA header to bypass CF block on default Python-urllib UA) - sources.json: catalog of 6 sources + scope_filter + per-source schema notes + Pioneer-exclusion rationale - scrape/runner.py: dispatcher with --all = GREEN-only - scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr, becks_products}.py: stub modules with implementation notes - docs_mcp/server.py: PRODUCT_NAME default → crop_seed, PRODUCT_DOCS_URL → repo URL Pioneer is intentionally NOT a source. ToS bans automation; dealer locator is login-gated. The MCP returns a curated fallback lesson directing the user to pioneer.com. Next phases: - Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs Bayer scraper; same __NEXT_DATA__ infra) - Phase 7: curate eval/queries.jsonl - Phase 11: lessons.md with Pioneer fallback + disease-scale notes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:28:49 -04:00

4 Commits