seed-mcp

Author	SHA1	Message	Date
justin	9ce920f622	agripro + nk scrapers — 146 Syngenta varieties added (wheat + corn/soy) agripro (24 varieties) - Drupal Views form scrape via /search-agripro-brand-varieties with explicit GET params (sidesteps the AJAX-only-on-load default that returns an empty form skeleton). - Per-variety parse: <h1>, .field--node--variety-type--variety, .field--node--tag-line--variety, .field--node--body, plus the three rated sections (Agronomics / Grain / Disease) with their <div class="row"><div class="label">label</div><div>value</div> pairs. - Wheat-class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley — provides the Northern Plains HRS coverage WestBred lacks. nk (122 varieties — recon's "29" was outdated; the current NK seed finder lists 41 corn + 81 soy) - ASP.NET WebForms endpoint: POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returns {"d": "<html>"} where the inner HTML is one <div class="sf-result"> per variety. BeautifulSoup tokenizes the whole blob. - Per-card: product code (NK8005, NK008-P8XF), RM/MG from the title <span>, "Brands Available" trait variants, marketing positioning + bullet strengths, tech-sheet PDF URL. - pdfplumber text extraction on the tech-sheet PDFs adds: * corn disease ratings (Gray Leaf Spot, NCLB, Goss's Wilt, Anthracnose, Tar Spot, Fusarium, etc.) where the PDF prints "Label N" lines (text-extractable) * soybean Phytophthora source genes (Rps1c, Rps3a, ...) * soybean SCN race coverage * soybean agronomic ratings (Emergence, Standability, Shatter Tolerance, Green Stem) with text-extractable 1-9 values * soybean soil-type adaptation (Best/Good/Fair/Poor) for drought prone / high pH / poorly drained / etc. - Agronomic rating BARS for corn (Emergence, Stalk Strength, Drought) are not text-extractable; we record the labels with an explicit "rated in PDF chart, see tech sheet" value so the agent can direct the farmer at the source for those numbers. Scale-direction correction in lessons.md: - NK and AgriPro both use 1 = best, lower = more resistant — the REVERSED convention vs Bayer / Golden Harvest. NK's tech-sheet footer literally prints "1-9 Scale: 1 = Best, 9 = Worst". AgriPro positioning on stripe-rust-resistant varieties (AP Iliad with Stripe Rust 1, Eyespot 2) confirms the same direction. - sources-not-yet-indexed section trimmed to just Beck's PFR + Beck's products — everything else IS now in the corpus. Cross-vendor coverage after this PR: 760 varieties. bayer_seeds 475 (DEKALB 288 / Asgrow 102 / WestBred 85) golden_harvest 139 nk 122 (41 corn / 81 soy) agripro 24 (12 HRS / 7 SWW / 3 HRW / 1 HWS / 1 Barley) Vendors: Bayer, Syngenta. Brands: 6. Crops: corn, soy, wheat (109 wheat now, up from 85). requirements.txt: pdfplumber>=0.11 for NK tech-sheet parsing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:16:36 -04:00
justin	4009dc0b78	Phase 11: crop_seed_api_lessons tool + Pioneer fallback Add the fifth MCP tool — crop_seed_api_lessons(topic?) — backed by docs_mcp/lessons.md, the ONLY source of opinionated content in the server. Everything else (search_docs, get_page, lookup_variety) returns verbatim from vendor catalogs; lessons.md fills the gaps the corpus can't cover. The Pioneer fallback is the critical anti-hallucination piece: Pioneer's ToS bans automation, so the corpus has no Pioneer data. Without this tool, an agent might surface Bayer/Asgrow chunks as mediocre matches for a Pioneer query. The tool's docstring tells the agent to call it on any Pioneer / P-series question; the 'pioneer' section says clearly: "I don't have Pioneer's variety data indexed... please consult Pioneer or an extension service." "Do NOT invent Pioneer hybrid ratings." Other lesson sections cover knowledge the agent needs to interpret search_docs / get_page output correctly: - rating-scales: Bayer 1-9, Golden Harvest 9-to-1, what R/MR/S/Rps1c/R3 mean in soybean disease columns - maturity-semantics: corn RM days vs soybean MG vs wheat class + qualitative early/medium/late - trait-glossary: SSRIB, VT2PRIB, XF, E3, Conkesta, Clearfield, etc. - scn-resistance: race coverage + Peking vs PI 88788 source - regional-listings: how to interpret Bayer's "local profiles" - sources-not-yet-indexed: which vendors aren't in the corpus yet - checking-your-work: always call lookup_variety before quoting Lesson lookup prefers slug-match (returns just `rating-scales` for topic="rating", not every section that mentions ratings); falls back to body-match only when no slug matches. Smoke-tested with topic=pioneer, topic=rating, topic=trait, topic=zzzzzz (no match), and topic=None (full index = 10K chars, 8 sections). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 13:18:57 -04:00
justin	a766756a05	Phase 2/3: chunker + indexer + MCP server tools Phase 2 — Chunking and indexing - rag/chunk.py: replace template chunker with seed-variety-specific chunks_from_variety(). One chunk per variety (varieties are small and named-rating retrieval signal is best kept together). Output is rebuilt deterministically from the sidecar JSON: every value is verbatim from the source, only framing language ("Disease ratings (1-9, 9=best):") is template glue. Anti-hallucination contract: same sidecar in → same chunk out, never a fabricated rating. Metadata flattened to Chroma-safe primitives (str/int/float/bool): source, source_key, vendor, brand, crop, product_name, product_id, source_url, rm (corn), mg (soy), wheat_class, release_year, trait_codes_csv, rating_scale. - rag/index.py: walks corpus/<source>/<source_key>.json sidecars via the new chunker. Default PRODUCT_NAME=crop_seed so the Chroma collection is crop_seed_docs. - rag/bm25.py: filterable columns updated to seed-domain facets (source/vendor/brand/crop/source_key) instead of the template's version/platform/product. Phase 3 — MCP server tools wired up - search_docs: hybrid dense (Chroma) + BM25 (FTS5) retrieval with RRF fusion. Optional filters: crop, brand, vendor, source. Variety-code prefilter pins exact source_key / product_name / hybrid_prefix matches at the top — dense embeddings have no semantic neighbor for tokens like "DKC62-08RIB" and RRF can let noise float to #1 without this pin. Each response carries the variety's source URL inline so the agent can cite. - get_page(source, source_key): emits a structured ratings header (verbatim from sidecar, table per characteristics group, vendor positioning, regional listings) followed by the raw indexed body. This is the canonical fact-check surface. - list_versions(): facet discovery — distinct sources, vendors, brands, crops across the corpus. - lookup_variety(source_key, source?): returns the raw sidecar JSON for one variety. The agent should call this BEFORE quoting any specific rating value to a farmer — guaranteed verbatim. Smoke tests against 475 indexed Bayer varieties: - list_versions returns 475 varieties, 1 source, 1 vendor, 3 brands, 3 crops with correct per-brand counts (288/102/85). - Semantic ag queries find the right candidates: short-season drought-tolerant corn → DKC44-97RIB at RM 94 (in 90-95 band); SCN+MG3 soybean → Asgrow XF varieties with explicit SCN R3 ratings; Phytophthora Rps3a soy → AG07XF4 (right gene); stripe-rust wheat → WestBred WB1376CLP (Yellow Rust 2 = best). - Variety-code lookups work via prefilter: DKC62-08RIB, AG29XF4, WB6430 all return as #1 hit. BM25 confirms ranking unambiguously (top-1 score -13.2 vs -8.5 for #2 on "DKC62-08RIB ratings"). - Server boots cleanly in stdio AND streamable-http modes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 13:14:16 -04:00
justin	ac40e05734	seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME Image rebuild (skip scrape) / build (push) Failing after 7s Details Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is seed/hybrid varieties across 6 vendors instead of pesticide labels. What's customized vs. the template: - CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy, canonical sidecar schema (per-crop), Golden Harvest disease-scale reversal gotcha, no-IPv6 / HTTPS-clone note - README.md: vendor coverage table, tool list, phase status - Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker DNS defaults (same llama-rerank sidecar as crop-chem-docs) - .gitea/workflows/refresh.yml: monthly cron (seed catalogs move slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar pinning, continue-on-error on GC step - .gitea/workflows/image-only.yml: paths filter + cancel-in-progress concurrency group - scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea packages API URL + UA header to bypass CF block on default Python-urllib UA) - sources.json: catalog of 6 sources + scope_filter + per-source schema notes + Pioneer-exclusion rationale - scrape/runner.py: dispatcher with --all = GREEN-only - scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr, becks_products}.py: stub modules with implementation notes - docs_mcp/server.py: PRODUCT_NAME default → crop_seed, PRODUCT_DOCS_URL → repo URL Pioneer is intentionally NOT a source. ToS bans automation; dealer locator is login-gated. The MCP returns a curated fallback lesson directing the user to pioneer.com. Next phases: - Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs Bayer scraper; same __NEXT_DATA__ infra) - Phase 7: curate eval/queries.jsonl - Phase 11: lessons.md with Pioneer fallback + disease-scale notes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:28:49 -04:00

4 Commits