seed-mcp

justin/seed-mcp

Fork 0

Commit Graph

Author	SHA1	Message	Date
claude	54094a0d43	Add university-extension variety trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 trial docs) Independent third-party performance data — land-grant programs that test every entered brand side-by-side with replication + LSD stats. This is the legitimate way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't scrape directly (data_type=trial, results[] shape; falls through the trial chunker). - illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata; corn-following-corn vs -soybean kept distinct. - iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView (viewstate postback for year/district), corn + soy by district x season. - ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF (pdfplumber; per-site column groups split by header Yield-token count + x-coord footnote bucketing), corn + soy per site, 2024+2025. 91 distinct seed brands across the three; majors confirmed present in the independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212, Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in Illinois — true negatives, not parse gaps.) - rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university sources route through it so the region/district is in the embedded chunk + labeled "variety trial (cross-vendor, independent third-party)". Existing plot sources (gh/lg/agrigold/proharvest) unchanged. - requirements.txt: openpyxl (Illinois XLSX; scrape-time only). - sources.json + README/CLAUDE/lessons: registered + attributed; lessons trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available indirectly via these trials). Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0 out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in each tos_note. CI rebuilds the index from the committed corpus.	2026-06-10 08:35:50 -04:00
justin	9ce920f622	agripro + nk scrapers — 146 Syngenta varieties added (wheat + corn/soy) agripro (24 varieties) - Drupal Views form scrape via /search-agripro-brand-varieties with explicit GET params (sidesteps the AJAX-only-on-load default that returns an empty form skeleton). - Per-variety parse: <h1>, .field--node--variety-type--variety, .field--node--tag-line--variety, .field--node--body, plus the three rated sections (Agronomics / Grain / Disease) with their <div class="row"><div class="label">label</div><div>value</div> pairs. - Wheat-class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley — provides the Northern Plains HRS coverage WestBred lacks. nk (122 varieties — recon's "29" was outdated; the current NK seed finder lists 41 corn + 81 soy) - ASP.NET WebForms endpoint: POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returns {"d": "<html>"} where the inner HTML is one <div class="sf-result"> per variety. BeautifulSoup tokenizes the whole blob. - Per-card: product code (NK8005, NK008-P8XF), RM/MG from the title <span>, "Brands Available" trait variants, marketing positioning + bullet strengths, tech-sheet PDF URL. - pdfplumber text extraction on the tech-sheet PDFs adds: * corn disease ratings (Gray Leaf Spot, NCLB, Goss's Wilt, Anthracnose, Tar Spot, Fusarium, etc.) where the PDF prints "Label N" lines (text-extractable) * soybean Phytophthora source genes (Rps1c, Rps3a, ...) * soybean SCN race coverage * soybean agronomic ratings (Emergence, Standability, Shatter Tolerance, Green Stem) with text-extractable 1-9 values * soybean soil-type adaptation (Best/Good/Fair/Poor) for drought prone / high pH / poorly drained / etc. - Agronomic rating BARS for corn (Emergence, Stalk Strength, Drought) are not text-extractable; we record the labels with an explicit "rated in PDF chart, see tech sheet" value so the agent can direct the farmer at the source for those numbers. Scale-direction correction in lessons.md: - NK and AgriPro both use 1 = best, lower = more resistant — the REVERSED convention vs Bayer / Golden Harvest. NK's tech-sheet footer literally prints "1-9 Scale: 1 = Best, 9 = Worst". AgriPro positioning on stripe-rust-resistant varieties (AP Iliad with Stripe Rust 1, Eyespot 2) confirms the same direction. - sources-not-yet-indexed section trimmed to just Beck's PFR + Beck's products — everything else IS now in the corpus. Cross-vendor coverage after this PR: 760 varieties. bayer_seeds 475 (DEKALB 288 / Asgrow 102 / WestBred 85) golden_harvest 139 nk 122 (41 corn / 81 soy) agripro 24 (12 HRS / 7 SWW / 3 HRW / 1 HWS / 1 Barley) Vendors: Bayer, Syngenta. Brands: 6. Crops: corn, soy, wheat (109 wheat now, up from 85). requirements.txt: pdfplumber>=0.11 for NK tech-sheet parsing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:16:36 -04:00
justin	ac40e05734	seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME Image rebuild (skip scrape) / build (push) Failing after 7s Details Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is seed/hybrid varieties across 6 vendors instead of pesticide labels. What's customized vs. the template: - CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy, canonical sidecar schema (per-crop), Golden Harvest disease-scale reversal gotcha, no-IPv6 / HTTPS-clone note - README.md: vendor coverage table, tool list, phase status - Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker DNS defaults (same llama-rerank sidecar as crop-chem-docs) - .gitea/workflows/refresh.yml: monthly cron (seed catalogs move slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar pinning, continue-on-error on GC step - .gitea/workflows/image-only.yml: paths filter + cancel-in-progress concurrency group - scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea packages API URL + UA header to bypass CF block on default Python-urllib UA) - sources.json: catalog of 6 sources + scope_filter + per-source schema notes + Pioneer-exclusion rationale - scrape/runner.py: dispatcher with --all = GREEN-only - scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr, becks_products}.py: stub modules with implementation notes - docs_mcp/server.py: PRODUCT_NAME default → crop_seed, PRODUCT_DOCS_URL → repo URL Pioneer is intentionally NOT a source. ToS bans automation; dealer locator is login-gated. The MCP returns a curated fallback lesson directing the user to pioneer.com. Next phases: - Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs Bayer scraper; same __NEXT_DATA__ infra) - Phase 7: curate eval/queries.jsonl - Phase 11: lessons.md with Pioneer fallback + disease-scale notes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:28:49 -04:00

Author

SHA1

Message

Date

claude

54094a0d43

Add university-extension variety trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 trial docs)

Independent third-party performance data — land-grant programs that test every
entered brand side-by-side with replication + LSD stats. This is the legitimate
way to get Pioneer / DEKALB / Brevant / Channel performance the corpus can't
scrape directly (data_type=trial, results[] shape; falls through the trial
chunker).

- illinois_vt_trials (30 docs, 1,392 rows) — U of Illinois VT. Per-region XLSX
  (openpyxl), corn + soy + WHEAT, 2024+2025. Rich per-site agronomic metadata;
  corn-following-corn vs -soybean kept distinct.
- iowa_icpt_trials (24 docs, 674 rows) — Iowa State ICPT. ASP.NET GridView
  (viewstate postback for year/district), corn + soy by district x season.
- ohio_ocpt_trials (69 docs, 4,647 rows) — OSU/CFAES OCPT. Report PDF
  (pdfplumber; per-site column groups split by header Yield-token count +
  x-coord footnote bucketing), corn + soy per site, 2024+2025.

91 distinct seed brands across the three; majors confirmed present in the
independent rankings: DEKALB 395, Golden Harvest 249, Channel 241, NK 212,
Xitavo 135, LG 103, Pioneer 88, Asgrow 59. (A brand only appears where it
ENTERED a given program — e.g. Brevant not in Iowa, DEKALB/Channel not in
Illinois — true negatives, not parse gaps.)

- rag/chunk.py: gated `include_region` on _render_gh_plot_chunk; the 3 university
  sources route through it so the region/district is in the embedded chunk +
  labeled "variety trial (cross-vendor, independent third-party)". Existing plot
  sources (gh/lg/agrigold/proharvest) unchanged.
- requirements.txt: openpyxl (Illinois XLSX; scrape-time only).
- sources.json + README/CLAUDE/lessons: registered + attributed; lessons
  trial-data + Pioneer entries updated (Pioneer/DEKALB performance now available
  indirectly via these trials).

Validation: all 123 chunk via rag.chunk.chunks_from_trial (0 errors), 0
out-of-range yields, 0 dup keys. Public land-grant data; attribution recorded in
each tos_note. CI rebuilds the index from the committed corpus.

2026-06-10 08:35:50 -04:00

justin

9ce920f622

agripro + nk scrapers — 146 Syngenta varieties added (wheat + corn/soy)

agripro (24 varieties)
- Drupal Views form scrape via /search-agripro-brand-varieties with
  explicit GET params (sidesteps the AJAX-only-on-load default that
  returns an empty form skeleton).
- Per-variety parse: <h1>, .field--node--variety-type--variety,
  .field--node--tag-line--variety, .field--node--body, plus the
  three rated sections (Agronomics / Grain / Disease) with their
  <div class="row"><div class="label">label</div><div>value</div>
  pairs.
- Wheat-class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley
  — provides the Northern Plains HRS coverage WestBred lacks.

nk (122 varieties — recon's "29" was outdated; the current NK seed
finder lists 41 corn + 81 soy)
- ASP.NET WebForms endpoint:
  POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returns
  {"d": "<html>"} where the inner HTML is one <div class="sf-result">
  per variety. BeautifulSoup tokenizes the whole blob.
- Per-card: product code (NK8005, NK008-P8XF), RM/MG from the
  title <span>, "Brands Available" trait variants, marketing
  positioning + bullet strengths, tech-sheet PDF URL.
- pdfplumber text extraction on the tech-sheet PDFs adds:
  * corn disease ratings (Gray Leaf Spot, NCLB, Goss's Wilt,
    Anthracnose, Tar Spot, Fusarium, etc.) where the PDF prints
    "Label N" lines (text-extractable)
  * soybean Phytophthora source genes (Rps1c, Rps3a, ...)
  * soybean SCN race coverage
  * soybean agronomic ratings (Emergence, Standability, Shatter
    Tolerance, Green Stem) with text-extractable 1-9 values
  * soybean soil-type adaptation (Best/Good/Fair/Poor) for drought
    prone / high pH / poorly drained / etc.
- Agronomic rating BARS for corn (Emergence, Stalk Strength,
  Drought) are not text-extractable; we record the labels with an
  explicit "rated in PDF chart, see tech sheet" value so the agent
  can direct the farmer at the source for those numbers.

Scale-direction correction in lessons.md:
- NK and AgriPro both use 1 = best, lower = more resistant — the
  REVERSED convention vs Bayer / Golden Harvest. NK's tech-sheet
  footer literally prints "1-9 Scale: 1 = Best, 9 = Worst".
  AgriPro positioning on stripe-rust-resistant varieties (AP Iliad
  with Stripe Rust 1, Eyespot 2) confirms the same direction.
- sources-not-yet-indexed section trimmed to just Beck's PFR +
  Beck's products — everything else IS now in the corpus.

Cross-vendor coverage after this PR: 760 varieties.
  bayer_seeds     475 (DEKALB 288 / Asgrow 102 / WestBred 85)
  golden_harvest  139
  nk              122  (41 corn / 81 soy)
  agripro          24  (12 HRS / 7 SWW / 3 HRW / 1 HWS / 1 Barley)
Vendors: Bayer, Syngenta. Brands: 6. Crops: corn, soy, wheat (109
wheat now, up from 85).

requirements.txt: pdfplumber>=0.11 for NK tech-sheet parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-25 14:16:36 -04:00

justin

ac40e05734

seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

Image rebuild (skip scrape) / build (push) Failing after 7s

Details

Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is
seed/hybrid varieties across 6 vendors instead of pesticide labels.

What's customized vs. the template:
- CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy,
  canonical sidecar schema (per-crop), Golden Harvest disease-scale
  reversal gotcha, no-IPv6 / HTTPS-clone note
- README.md: vendor coverage table, tool list, phase status
- Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not
  bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker
  DNS defaults (same llama-rerank sidecar as crop-chem-docs)
- .gitea/workflows/refresh.yml: monthly cron (seed catalogs move
  slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar
  pinning, continue-on-error on GC step
- .gitea/workflows/image-only.yml: paths filter + cancel-in-progress
  concurrency group
- scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea
  packages API URL + UA header to bypass CF block on default
  Python-urllib UA)
- sources.json: catalog of 6 sources + scope_filter + per-source
  schema notes + Pioneer-exclusion rationale
- scrape/runner.py: dispatcher with --all = GREEN-only
- scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr,
  becks_products}.py: stub modules with implementation notes
- docs_mcp/server.py: PRODUCT_NAME default → crop_seed,
  PRODUCT_DOCS_URL → repo URL

Pioneer is intentionally NOT a source. ToS bans automation; dealer
locator is login-gated. The MCP returns a curated fallback lesson
directing the user to pioneer.com.

Next phases:
- Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs
  Bayer scraper; same __NEXT_DATA__ infra)
- Phase 7: curate eval/queries.jsonl
- Phase 11: lessons.md with Pioneer fallback + disease-scale notes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-25 12:28:49 -04:00

3 Commits