seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is seed/hybrid varieties across 6 vendors instead of pesticide labels. What's customized vs. the template: - CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy, canonical sidecar schema (per-crop), Golden Harvest disease-scale reversal gotcha, no-IPv6 / HTTPS-clone note - README.md: vendor coverage table, tool list, phase status - Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker DNS defaults (same llama-rerank sidecar as crop-chem-docs) - .gitea/workflows/refresh.yml: monthly cron (seed catalogs move slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar pinning, continue-on-error on GC step - .gitea/workflows/image-only.yml: paths filter + cancel-in-progress concurrency group - scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea packages API URL + UA header to bypass CF block on default Python-urllib UA) - sources.json: catalog of 6 sources + scope_filter + per-source schema notes + Pioneer-exclusion rationale - scrape/runner.py: dispatcher with --all = GREEN-only - scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr, becks_products}.py: stub modules with implementation notes - docs_mcp/server.py: PRODUCT_NAME default → crop_seed, PRODUCT_DOCS_URL → repo URL Pioneer is intentionally NOT a source. ToS bans automation; dealer locator is login-gated. The MCP returns a curated fallback lesson directing the user to pioneer.com. Next phases: - Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs Bayer scraper; same __NEXT_DATA__ infra) - Phase 7: curate eval/queries.jsonl - Phase 11: lessons.md with Pioneer fallback + disease-scale notes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:28:49 -04:00
commit ac40e05734
35 changed files with 3833 additions and 0 deletions
@@ -0,0 +1,34 @@
+"""AgriPro scraper (Syngenta wheat brand).
+
+Source: ``https://www.agriprowheat.com`` — Drupal Views form,
+server-rendered HTML. No headless browser needed.
+
+Expected count: 24 varieties. Covers HRW / HRS / HWS / SWW / SWS
+plus barley. NO SRW — Syngenta's SRW lives at GrowProGenetics.com
+under a separate brand and is out of scope for AgriPro.
+
+Trait flags to capture: Clearfield (CL2), CoAXium (NB: CoAXium is
+implicit in product family naming, not always a separate field).
+
+Schema notes:
+- ``wheat_class`` is required (HRW/HRS/HWS/SWW/SWS/durum/barley)
+- ``relative_maturity`` and ``maturity_group`` are null for wheat
+- Disease panel: stripe rust / leaf rust / stem rust / FHB (scab) /
+  Septoria / tan spot
+- Quality: test weight, protein, falling number, straw strength
+
+TODO: implement.
+"""
+from __future__ import annotations
+
+import sys
+
+
+def main(argv: list[str] | None = None) -> int:
+    print("agripro: not implemented yet — Drupal Views form, only wheat in the corpus, no SRW (separate brand)",
+          file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,56 @@
+"""Bayer seeds scraper — DEKALB (corn) + Asgrow (soy) + WestBred (wheat).
+
+Source: ``cropscience.bayer.us`` — same Next.js + ``__NEXT_DATA__``
+infrastructure used by crop-chem-docs' Bayer crop-protection scraper.
+That scraper is the reference; this one lifts ~80% of its plumbing
+and adapts the per-product field mapping for seed schema.
+
+Catalog index pages:
+  /corn/dekalb/seed-catalog
+  /soybeans/asgrow/seed-catalog
+  /wheat/westbred/seed-catalog
+
+Each catalog page is a Next.js route; the per-variety data lives in
+``__NEXT_DATA__.props.pageProps.{whatever}``. The buildId in the
+script tag rotates — fetch the index page first, extract the
+buildId, then fetch the per-variety JSON.
+
+Output layout:
+  corpus/bayer_seeds/<source_key>.md      LLM-visible body
+  corpus/bayer_seeds/<source_key>.json    Sidecar metadata
+
+source_key convention: ``<brand>-<product-slug>`` lowercased, e.g.
+``dekalb-dkc62-08rib`` or ``asgrow-ag34xf2``.
+
+Sidecar schema (per CLAUDE.md):
+  source: "bayer_seeds"
+  source_key: str
+  vendor: "Bayer"
+  brand: "DEKALB" | "Asgrow" | "WestBred"
+  product_name: str
+  crop: "corn" | "soybeans" | "wheat"
+  relative_maturity: int | null         # corn only
+  maturity_group: float | null          # soy only
+  wheat_class: str | null               # wheat only
+  trait_stack: list[str]
+  agronomic_ratings: dict[str, int]     # normalized 1-9 (9 = best)
+  disease_ratings: dict[str, int]       # normalized 1-9 (9 = best)
+  regional_recommendation: list[str]
+  source_urls: list[str]
+  fetched_at: str (ISO 8601 UTC)
+
+TODO: implement. Reference: ~/github/crop-chem-docs/scrape/sources/bayer.py
+"""
+from __future__ import annotations
+
+import sys
+
+
+def main(argv: list[str] | None = None) -> int:
+    print("bayer_seeds: not implemented yet — see ~/github/crop-chem-docs/scrape/sources/bayer.py for the reference Next.js extraction pattern",
+          file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,45 @@
+"""Beck's PFR (Practical Farm Research) scraper.
+
+Source: Public Sanity GROQ API at ``https://mc8v24rf.api.sanity.io``.
+No authentication required — Beck's exposes their CMS content store
+publicly. ~2,089 documents going back to 2015.
+
+Sanity query endpoint:
+  ``/v1/data/query/production?query=<groq>``
+
+Useful GROQ for PFR docs (the projectId / dataset are public):
+
+  *[_type == "pfrStudy"] {
+    _id, title, year, crop, slug, summary, body, attachments
+  }
+
+Records are research studies, not variety identity — head-to-head
+yield trials, fungicide timing, planting-date studies, hybrid-by-
+population, biological seed treatments, etc.
+
+Treat differently from variety scrapers:
+- One record per study, not per variety
+- chunk_0 preamble includes the study's tl;dr finding (extract from
+  the ``summary`` field if present, or first paragraph of ``body``)
+- Crop tag (corn/soy/wheat) for filtering
+- Year tag — older PFR studies are still relevant but search should
+  let the user weight recency
+
+Polite rate limit: Sanity is generous but no auth means we should
+keep concurrency ≤4 and pause ~250ms between batches.
+
+TODO: implement.
+"""
+from __future__ import annotations
+
+import sys
+
+
+def main(argv: list[str] | None = None) -> int:
+    print("becks_pfr: not implemented yet — public Sanity GROQ at mc8v24rf.api.sanity.io, ~2089 research docs",
+          file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,46 @@
+"""Beck's product catalog scraper (identity-only until SeedIQ XHR sniff lands).
+
+Source: Same public Sanity GROQ API as ``becks_pfr`` (no auth).
+Expected count: ~860 products (corn + soy + wheat).
+
+Current limitation: Beck's exposes IDENTITY fields publicly (product
+name, RM/MG, basic trait stack) but routes the AGRONOMIC + DISEASE
+ratings through their SeedIQ application, which is gated behind a
+browser session cookie. The public Sanity records do not include
+ratings.
+
+What we CAN ship without SeedIQ:
+- Product identity for confirmation ("yes Beck's has hybrid X at RM 112")
+- RM (corn) / MG (soy) / class (wheat)
+- Trait stack
+- Basic descriptive text
+
+What needs the SeedIQ XHR endpoint (BLOCKED on user sniff):
+- Disease ratings (GLS, NCLB, Goss's, etc.)
+- Agronomic ratings (standability, drought, etc.)
+- Regional recommendations
+
+For now this scraper is DEFERRED. Run when:
+- User captures the SeedIQ XHR URL + cookie/header pattern from
+  browser dev tools, OR
+- We decide to ship Beck's as identity-only and let the LLM say
+  "Beck's has this hybrid; ask your Beck's rep for full agronomic
+  ratings" (less useful but avoids the empty-data UX).
+
+Yellow verdict in sources.json reflects this — ``--all`` skips it.
+
+TODO: implement (deferred).
+"""
+from __future__ import annotations
+
+import sys
+
+
+def main(argv: list[str] | None = None) -> int:
+    print("becks_products: deferred — SeedIQ XHR sniff required for ratings, run only if user has captured the endpoint",
+          file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,42 @@
+"""Golden Harvest scraper (Syngenta brand).
+
+Discovery: ``https://www.goldenharvestseeds.com/sitemap.xml`` lists
+every variety page. Server-rendered HTML — no headless browser
+required. Tech-sheet PDFs live on the Syngenta CDN at
+``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf`` — same
+fetcher pattern as NK.
+
+Two gotchas:
+
+1. **Sitemap PDF dates are stale** (the sitemap was generated
+   2025-03-31 and never updated). Resolve the LIVE PDF URL from the
+   product HTML page, not from the sitemap entry.
+
+2. **Disease scale is reversed.** Golden Harvest publishes ratings
+   on a 9-to-1 scale (9 = best, 1 = worst). Bayer/NK/AgriPro use
+   1-9 (9 = best). Normalize at chunk time so the corpus has a
+   single direction. Record the original direction in the chunk_0
+   preamble: "Note: ratings normalized to 1-9 (9 = best). Golden
+   Harvest publishes on a 9-to-1 scale natively."
+
+Expected count: ~175 varieties (89 corn + 86 soy). No wheat.
+
+Bonus dataset: ``/plot-report/<state>/<year>/<id>`` — ~7,800 regional
+yield trial records. Out of scope for v1 but a high-value future
+ingest for regional placement recommendations.
+
+TODO: implement. Reuse the PDF-fetch helper that NK uses.
+"""
+from __future__ import annotations
+
+import sys
+
+
+def main(argv: list[str] | None = None) -> int:
+    print("golden_harvest: not implemented yet — see CLAUDE.md for the disease-scale-reversal gotcha and the live-PDF-URL-resolution requirement",
+          file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,35 @@
+"""NK scraper (Syngenta brand).
+
+Source: ``https://www.syngenta-us.com`` — static HTML product pages
+plus tech-sheet PDFs on the Syngenta CDN at
+``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf``.
+
+Expected count: 29 varieties (12 corn + 17 soy). No wheat.
+
+The PDF fetcher is shared with ``golden_harvest`` — same CDN, same
+``<CODE>_YYMMDD.pdf`` filename convention. Factor that into a
+helper module under ``scrape.sources._syngenta_pdf`` once both
+scrapers are written.
+
+Disease + agronomic ratings live INSIDE the PDFs (the HTML pages
+have marketing copy only). Use pdfplumber for table extraction.
+
+Bonus: regional "Seed Guide" PDFs (~14 MB each) for IA, IL, MN,
+etc. — additional supplemental context worth ingesting once the
+per-variety scrape is solid.
+
+TODO: implement.
+"""
+from __future__ import annotations
+
+import sys
+
+
+def main(argv: list[str] | None = None) -> int:
+    print("nk: not implemented yet — disease/agronomic ratings come from CDN tech-sheet PDFs only, use pdfplumber",
+          file=sys.stderr)
+    return 2
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))