diff --git a/README.md b/README.md index 2b86c6e8..9b77f267 100644 --- a/README.md +++ b/README.md @@ -1,27 +1,71 @@ # seed-mcp MCP server over the public catalogs of major US row-crop seed -vendors — corn, soybeans, wheat. Sibling project to +vendors — **variety identity** (what each hybrid IS) plus **yield-trial data** (how they actually perform in real cooperator fields). Sibling project to [`crop-chem-docs`](https://git.jpaul.io/justin/crop-chem-docs) (pesticide labels), feeding the same Drawbar farm-advisor AI. -The server exposes per-variety records with **agronomic ratings**, -**disease tolerance**, **trait stack**, **maturity**, and -**regional notes** — so the advisor can answer questions like -"which corn hybrid for sandy soil, drought-prone, RM ≤105 in -northeast Iowa?" without rummaging through individual brand sites. +**Deployed 2026-05-25** on trashpanda as a sibling sidecar to +`chem-mcp`; the Drawbar advisor calls it via the `seed:` prefix. -## Vendor coverage +## What's in the corpus -| Vendor | Verdict | Varieties | Notes | +**5,073 indexed chunks** across two complementary surfaces: + +### Variety identity — 760 records + +| Source | Count | Vendor | Brand | |---|---|---|---| -| Bayer seeds (DEKALB + Asgrow + WestBred) | 🟢 | ~475 | Same `cropscience.bayer.us` Next.js infra as crop-chem-docs | -| Golden Harvest (Syngenta) | 🟢 | ~175 | Sitemap + server-rendered HTML + Syngenta CDN PDFs | -| NK (Syngenta) | 🟢 | 29 | Shares PDF fetcher with Golden Harvest | -| AgriPro (Syngenta wheat) | 🟢 | 24 | Drupal Views, server-rendered | -| Beck's PFR | 🟡 | 2,089 | Public Sanity GROQ API (no auth) | -| Beck's products | 🟡 | 860 | Identity-only until SeedIQ XHR sniffed | -| Pioneer (Corteva) | 🔴 | — | ToS bans automation — curated fallback lesson instead | +| `bayer_seeds` | 475 | Bayer | DEKALB (corn) / Asgrow (soy) / WestBred (wheat) | +| `golden_harvest` | 139 | Syngenta | Golden Harvest (corn / soy) | +| `nk` | 122 | Syngenta | NK (corn / soy) | +| `agripro` | 24 | Syngenta | AgriPro (wheat — HRW / HRS / HWS / SWW) | + +### Yield-trial data — 4,313 documents + +| Source | Count | Notes | +|---|---|---| +| `gh_plot_reports` | 4,299 | Golden Harvest plot reports 2024+2025. **Cross-vendor head-to-head** — DEKALB / NK / GH / Pioneer / Channel all appear in the same trial rankings. The closest thing to independent comparison data the corpus has. | +| `agripro_trials` | 14 | Regional wheat trial PDF summaries (PNW, Western Plains, Northern Plains, etc.) | + +### Not in the corpus (documented in `docs_mcp/lessons.md`) + +- **Pioneer / Corteva** — ToS bans automation. Curated fallback lesson points the farmer at pioneer.com / a local dealer. +- **NK yield-results** — fiddly ASMX/SOAP endpoint, needs a dedicated reverse-engineer session. +- **Bayer per-variety trial data** — not publicly indexed (DEKALB / Asgrow trial data flows through Channel reps). Partially covered by the GH plot reports' cross-vendor results. + +## MCP tools (6) + +| Tool | Purpose | +|---|---| +| `search_docs` | Variety IDENTITY — what a hybrid IS (disease ratings, traits, maturity). Hybrid dense+BM25 + cross-encoder rerank + variety-code prefilter. | +| `search_trials` | Variety PERFORMANCE — head-to-head yield trial results. Filterable by crop, state, year, product. | +| `get_page` | Full canonical record for one variety + structured ratings header sourced from the sidecar JSON. | +| `lookup_variety` | Raw sidecar JSON for one variety — **fact-check tool**; call before quoting any specific rating value. | +| `list_versions` | Discover facets (sources, vendors, brands, crops) currently indexed. | +| `crop_seed_api_lessons` | Curated knowledge: Pioneer fallback policy, scale-direction differences across vendors, trait glossary, SCN race coverage notes. | + +`search_docs` defaults to `data_type="variety"`; `search_trials` uses `data_type="trial"` — single Chroma collection, metadata-filtered. + +## Retrieval — eval-validated + +From `eval/results/baseline.md` (21 golden queries, k=5): + +| Retriever | Pass | Recall | P@1 | MRR | Avg ms | +|---|---|---|---|---|---| +| **hybrid+rerank** | **21/21** | **100%** | **90%** | **0.905** | 2064 | +| bm25 | 20/21 | 95% | 81% | 0.833 | 5 | +| hybrid (no rerank) | 15/21 | 71% | 62% | 0.619 | 73 | +| dense | 14/21 | 67% | 38% | 0.440 | 79 | + +**Deploy config**: `HYBRID_SEARCH=true` + `RERANK_URL=http://llama-rerank:8080`. + +Some surprises worth knowing: + +1. **Dense embedding alone is the weakest config**. Variety codes (DKC62-08RIB), gene names (Rps3a), and trait codes (XF) have no semantic neighbors — nomic-embed-text returns noise on them. +2. **Hybrid alone is WORSE than BM25 alone.** RRF dilutes BM25's strong ranking with dense noise. Don't ship without rerank. +3. **BM25-alone (95% recall, 5 ms) is an excellent fallback** when the rerank sidecar is unavailable. The variety-code prefilter in `search_docs` does heavy lifting. +4. **Anti-hallucination queries pass on every retriever** — Pioneer fallback + not-in-corpus product checks hold across all configs. ## Quick start @@ -31,54 +75,121 @@ cd seed-mcp python -m venv venv && source venv/bin/activate pip install -r requirements.txt -# Run one scraper -python -m scrape.runner --source bayer_seeds --force +# Sample-scrape just to verify wiring: +python -m scrape.runner --source bayer_seeds --limit 3 -# Rebuild indexes -python -m rag.index --rebuild +# Full refresh (all 6 sources; expect ~25 min for gh_plot_reports +# with 4 concurrent workers): +python -m scrape.runner --all --force -# Local MCP server (stdio for Claude Desktop dev) -python -m docs_mcp.server --transport stdio +# Rebuild Chroma + BM25 from the corpus: +OLLAMA_URL=http://192.168.0.125:11434 PRODUCT_NAME=crop_seed \ + python -m rag.index --rebuild + +# Run the eval harness: +RERANK_URL=http://localhost:18080 python -m eval.run_eval \ + --queries eval/queries.jsonl --k 5 \ + --output eval/results/baseline.md + +# Local MCP server (stdio for Claude Desktop dev): +PRODUCT_NAME=crop_seed python -m docs_mcp.server --transport stdio + +# Local HTTP server (matches production transport): +PRODUCT_NAME=crop_seed python -m docs_mcp.server \ + --transport streamable-http --port 8000 ``` -## Tools exposed +## Repo layout -| Tool | Purpose | -|---|---| -| `search_docs` | Hybrid + rerank variety search with crop / RM / trait / region filters | -| `get_page` | Full variety record by `(source, source_key)` | -| `list_versions` | Discover crops, brands, traits, RM/MG ranges, wheat classes | -| `corpus_status` | Counts + freshness; useful for health probes | -| `crop_seed_api_lessons` | Curated agronomy lessons — Pioneer fallback, disease-scale normalization, regional placement heuristics | - -## Build phases - -This is a clone of [`docs-mcp-template`](https://git.jpaul.io/justin/docs-mcp-template). -The 13 phases in `PLAN.md` apply: - -| Phase | Status | -|---|---| -| 0 — scaffold | done | -| 1 — first scraper (bayer_seeds) | next | -| 2 — chunk + index | pending | -| 3 — baseline MCP tools | template defaults | -| 4-5 — Dockerfile + CI | done (placeholders filled) | -| 6 — reranker | shares `llama-rerank` sidecar with crop-chem-docs | -| 7 — eval harness | pending (curate ~25 queries) | -| 8 — hybrid search | done (template) | -| 9 — diff_versions, list_cluster | optional | -| 11 — `crop_seed_api_lessons` curated layer | pending | - -See `CLAUDE.md` for the canonical sidecar schema and the -disease-scale-normalization gotcha (Golden Harvest is reversed). +``` +. +├── CLAUDE.md # Canonical agent guide. Read first. +├── PLAN.md # Template's 13-phase build guide. +├── README.md +├── requirements.txt +├── Dockerfile +├── sources.json # Source catalog (one entry per scraper) +├── deploy/docker-compose.yml # Drop-in compose snippet for Drawbar +├── .gitea/workflows/ +│ ├── refresh.yml # Monthly cron: scrape + index + image push +│ └── image-only.yml # On-demand code-only ship cycle +├── scrape/ +│ ├── runner.py # `python -m scrape.runner --source ` +│ ├── changelog.py # Reused from template +│ └── sources/ +│ ├── bayer_seeds.py # ~475 varieties across 3 brands +│ ├── golden_harvest.py # ~139 varieties (post-discontinued filter) +│ ├── nk.py # 122 varieties (corn + soy) +│ ├── agripro.py # 24 wheat varieties +│ ├── gh_plot_reports.py # 4,299 cross-vendor yield trials +│ ├── agripro_trials.py # 14 regional trial PDFs +│ └── becks_pfr.py # stub — Sanity GROQ research corpus +├── rag/ +│ ├── embeddings.py # nomic-embed-text via Ollama +│ ├── chunk.py # one-chunk-per-variety + trial chunker +│ ├── index.py # Chroma + BM25 builder +│ └── bm25.py # FTS5 lexical index w/ seed-domain facets +├── docs_mcp/ +│ ├── server.py # FastMCP — 6 tools, hybrid+rerank +│ ├── lessons.md # Curated knowledge layer (Pioneer fallback) +│ └── usage.py # TimedCall + JSONL telemetry +├── eval/ +│ ├── queries.jsonl # 21 golden queries +│ ├── retrievers.py # dense / bm25 / hybrid / hybrid+rerank +│ ├── run_eval.py # MRR / Recall@k / Precision@1 +│ └── results/baseline.md # Current deploy-config eval numbers +└── corpus/ # Committed scrape output (CI-refreshed) + ├── bayer_seeds/ + ├── golden_harvest/ + ├── nk/ + ├── agripro/ + ├── gh_plot_reports/ + └── agripro_trials/ +``` ## Infrastructure -- **Registry**: `git.jpaul.io/justin/seed-mcp:latest` (Watchtower) / - `:corpus-YYYY.MM.DD` (production pin) -- **Embedder**: shared Ollama pool with crop-chem-docs (Gitea-host - GPUs + Windows Ollama; CI never hits trashpanda's production Ollama) -- **Reranker**: shared `llama-rerank` sidecar on trashpanda's Tesla - P4 (one container, both MCPs use it) -- **PRODUCT_NAME**: `crop_seed` (not `seed_mcp` — used in Chroma - collection, BM25 db filename, and `crop_seed_api_lessons` tool) +- **Registry**: pushes to `192.168.0.2:1234` (LAN, no CF body cap); deploys pull `git.jpaul.io/justin/seed-mcp:latest` (public, CF tunnel). Also tagged `:` for rollback pinning and `:corpus-YYYY.MM.DD` for snapshot pinning. +- **Embedder pool (CI)**: 3 GPU-pinned Ollama endpoints, weighted toward `.0.125` (RTX 40-series, 242 embeds/sec): + - `.0.125:11434` ×4 (4090) + - `.0.2:11436` ×2 (GPU-pinned) + - `.0.2:11435` ×1 (GPU-pinned) + - Do NOT use `.0.2:11434` (not GPU-pinned) or `localhost:11434` (works in dev, breaks in CI — runner container has no Ollama on its loopback). +- **Reranker**: shared `llama-rerank` sidecar on trashpanda's Tesla P4 (jina-reranker-v2-base via llama.cpp). One container serves both seed-mcp and crop-chem-docs. **Must be on `drawbar-backend_default` Docker network** — see `deploy/docker-compose.yml` for the network-attach gotcha that caused silent rerank degradation on chem-mcp prior to 2026-05-25. +- **PRODUCT_NAME**: `crop_seed` — used in the Chroma collection name (`crop_seed_docs`), the BM25 db filename (`bm25/crop_seed_docs.db`), and the `crop_seed_api_lessons` tool name. Not `seed_mcp` — that would conflict with the container/service name. + +## Deploy mechanics + +Watchtower handles auto-deploy. Every push to `seed-mcp/main` that touches `docs_mcp/`, `rag/`, `scrape/`, `requirements.txt`, `Dockerfile`, or `sources.json` triggers `image-only.yml`: + +1. Checks out main with full corpus +2. Rebuilds Chroma + BM25 (~3 min on the GPU pool) +3. `docker build` + push three tags to the LAN registry +4. Links the package to the repo via Gitea API +5. Watchtower on trashpanda polls `:latest` every 5 min → pulls + recreates `drawbar-backend-seed-mcp-1` + +Corpus refresh runs monthly via `refresh.yml` (1st of each month, 06:00 UTC) — re-scrapes all GREEN sources, commits any corpus diff, rebuilds indexes, ships a new image with `:corpus-YYYY.MM.DD` tagged. + +See `CLAUDE.md` for canonical sidecar schemas, the reversed disease-scale gotcha (NK + AgriPro publish 1=best, vs Bayer/GH 9=best), and the scraper conventions. + +## Status + +| Phase | Status | +|---|---| +| 0 — scaffold | ✅ | +| 1 — scrapers (bayer_seeds / golden_harvest / nk / agripro / gh_plot_reports / agripro_trials) | ✅ | +| 2 — chunk + index | ✅ | +| 3 — MCP tools (6) | ✅ | +| 4-5 — Dockerfile + Gitea CI | ✅ | +| 6 — reranker integration | ✅ (eval-validated; deploy uses hybrid+rerank) | +| 7 — eval harness | ✅ (21 golden queries, baseline committed) | +| 8 — hybrid search | ✅ (default ON) | +| 11 — `crop_seed_api_lessons` curated layer | ✅ (Pioneer fallback + 7 other lessons) | +| 13 — weekly_digest | not planned for seed-mcp | + +Remaining work (deferred, not blocking): + +- `becks_pfr` scraper (2,089 research docs via public Sanity GROQ) +- 2023 GH plot reports backfill (~3,619 more docs) +- NK yield-results endpoint reverse-engineer +- Channel Seed brand (~320 more Bayer varieties — separate brand under the same sitemap)