Commit Graph

13 Commits

Author SHA1 Message Date
justin 7b3da908e0 Merge pull request 'agripro + nk scrapers — 146 Syngenta varieties added (760 total in corpus)' (#6) from agripro-nk-scrapers into main
Image rebuild (skip scrape) / build (push) Failing after 35s
2026-05-25 14:17:18 -04:00
justin 9ce920f622 agripro + nk scrapers — 146 Syngenta varieties added (wheat + corn/soy)
agripro (24 varieties)
- Drupal Views form scrape via /search-agripro-brand-varieties with
  explicit GET params (sidesteps the AJAX-only-on-load default that
  returns an empty form skeleton).
- Per-variety parse: <h1>, .field--node--variety-type--variety,
  .field--node--tag-line--variety, .field--node--body, plus the
  three rated sections (Agronomics / Grain / Disease) with their
  <div class="row"><div class="label">label</div><div>value</div>
  pairs.
- Wheat-class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley
  — provides the Northern Plains HRS coverage WestBred lacks.

nk (122 varieties — recon's "29" was outdated; the current NK seed
finder lists 41 corn + 81 soy)
- ASP.NET WebForms endpoint:
  POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returns
  {"d": "<html>"} where the inner HTML is one <div class="sf-result">
  per variety. BeautifulSoup tokenizes the whole blob.
- Per-card: product code (NK8005, NK008-P8XF), RM/MG from the
  title <span>, "Brands Available" trait variants, marketing
  positioning + bullet strengths, tech-sheet PDF URL.
- pdfplumber text extraction on the tech-sheet PDFs adds:
  * corn disease ratings (Gray Leaf Spot, NCLB, Goss's Wilt,
    Anthracnose, Tar Spot, Fusarium, etc.) where the PDF prints
    "Label N" lines (text-extractable)
  * soybean Phytophthora source genes (Rps1c, Rps3a, ...)
  * soybean SCN race coverage
  * soybean agronomic ratings (Emergence, Standability, Shatter
    Tolerance, Green Stem) with text-extractable 1-9 values
  * soybean soil-type adaptation (Best/Good/Fair/Poor) for drought
    prone / high pH / poorly drained / etc.
- Agronomic rating BARS for corn (Emergence, Stalk Strength,
  Drought) are not text-extractable; we record the labels with an
  explicit "rated in PDF chart, see tech sheet" value so the agent
  can direct the farmer at the source for those numbers.

Scale-direction correction in lessons.md:
- NK and AgriPro both use 1 = best, lower = more resistant — the
  REVERSED convention vs Bayer / Golden Harvest. NK's tech-sheet
  footer literally prints "1-9 Scale: 1 = Best, 9 = Worst".
  AgriPro positioning on stripe-rust-resistant varieties (AP Iliad
  with Stripe Rust 1, Eyespot 2) confirms the same direction.
- sources-not-yet-indexed section trimmed to just Beck's PFR +
  Beck's products — everything else IS now in the corpus.

Cross-vendor coverage after this PR: 760 varieties.
  bayer_seeds     475 (DEKALB 288 / Asgrow 102 / WestBred 85)
  golden_harvest  139
  nk              122  (41 corn / 81 soy)
  agripro          24  (12 HRS / 7 SWW / 3 HRW / 1 HWS / 1 Barley)
Vendors: Bayer, Syngenta. Brands: 6. Crops: corn, soy, wheat (109
wheat now, up from 85).

requirements.txt: pdfplumber>=0.11 for NK tech-sheet parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:16:36 -04:00
justin 2588ebafa1 Merge pull request 'Phase 4-5: deployable container + corpus snapshot (614 varieties)' (#5) from phase-4-5-deploy into main
Image rebuild (skip scrape) / build (push) Failing after 29s
2026-05-25 13:40:41 -04:00
justin 75f714b454 Phase 4-5: deployable container + corpus snapshot + CI fixes
deploy/docker-compose.yml — replace <product>/<registry> placeholders
with concrete values for Drawbar's stack:
- image: git.jpaul.io/justin/seed-mcp:latest (CF tunnel for pulls; CI
  pushes via LAN 192.168.0.2:1234 to avoid 100 MB body cap)
- container_name: seed-mcp
- port 8001:8000 (8001 host-side to not collide with crop-chem-docs
  on 8000)
- PRODUCT_NAME=crop_seed, hybrid search enabled, stateless HTTP
- llama-rerank shared with crop-chem-docs (NOT redefined here —
  expected to already be in Drawbar's parent compose network)
- networks.drawbar-mcp external: true so seed-mcp joins the existing
  cross-MCP shared network

.gitignore — corpus/ is now COMMITTED, not ignored. The monthly
refresh workflow scrapes and commits corpus changes; the image-only
workflow rebuilds indexes from the committed corpus. Allowing the
corpus to flow through git means the :corpus-YYYY.MM.DD image tag
pins to a specific seed-catalog snapshot. chroma/ and bm25/ remain
ignored — those are deterministically derived from corpus.

Initial committed snapshot: 614 varieties.
- bayer_seeds: 475 (DEKALB 288 + Asgrow 102 + WestBred 85)
- golden_harvest: 139 (Syngenta corn + soy; 36 sitemap URLs
  302-redirected = discontinued)

rag/chunk.py — normalize brand and crop to uppercase/lowercase in
Chroma metadata so cross-vendor brand-filter lookups don't break on
casing inconsistency (Bayer stores "DEKALB", Golden Harvest stores
"Golden Harvest"; _build_where uppercases user-supplied brand which
matched the former but not the latter pre-fix). Sidecar JSON keeps
original casing for display.

Stub scrapers (nk, agripro, becks_pfr, becks_products) — change
return code from 2 to 0 so the monthly-refresh CI workflow doesn't
fail on deferred sources. Real implementations will return 0 on
success / 1 on failure when they ship.

Smoke-tested cross-vendor retrieval against the 614-chunk index:
- list_versions shows both vendors with correct facet counts
- broad "corn hybrid 100 RM" query returns both DEKALB and Golden
  Harvest hits in top 5
- brand='Golden Harvest' filter returns 3 GH-only varieties
- variety-code prefilter still works (E085Z5 → top hit on GH)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 13:40:05 -04:00
justin 9d4a490731 Merge pull request 'golden_harvest: implement scraper (~175 Syngenta corn + soy)' (#4) from golden-harvest-scraper into main
Image rebuild (skip scrape) / build (push) Failing after 6s
2026-05-25 13:31:24 -04:00
justin 1409c2617d golden_harvest: implement scraper (~175 Syngenta corn + soy)
Sitemap-driven scraper for goldenharvestseeds.com. Walks
sitemap-ghs-hybrids.xml to discover product URLs under
/products/corn/ and /products/soybean/ (~89 + 86 = 175 candidates).

Per-variety detail parsed from server-rendered HTML:

- product code (from <h1> / <title>)
- positioning (from <meta name="Description">)
- maturity (from <div class="product-label"><div class="right">):
  integer days for corn, decimal MG for soybeans
- traits derived from product-code suffix (XF, E3, VIP3, GT, Z, etc.)
- 9-row disease tolerance bar chart (#dvDiseaseTolerance) where
  data-percentage / 10 = rating on 1-9 (9 = best) scale
- 9-row agronomic characteristics bar chart (#dvAgronomicChar)
- recommended environment list (.AgronomicMange — upstream typo)
- all 2-column tables (plant description, seed quality, herbicide
  responses, Phytophthora gene, SCN race coverage)
- tech-sheet PDF URL from live HTML (not sitemap — that's stale)

302 redirects to /product-finder treated as "discontinued" and
skipped (Golden Harvest still sitemap-lists some retired SKUs).

Rating scale: 1-9 (9 = best) — same as Bayer despite recon's
"9-to-1" descriptor (that referred to chart-axis direction, not
numeric meaning). _scale_direction is set explicitly so the chunker
stays forward-compatible.

PDFs are NOT downloaded (recon flagged ~14MB each); tech-sheet URLs
are captured in the sidecar for future enrichment.

Smoke-tested all branches: 4 corn varieties (E085Z5, E092W5,
E094Z4, E095D3, E097K6, E100A3) with full 6 characteristics groups
+ tech-sheet URL; 3 soy varieties (GH00864XF MG 0.08, GH00973E3
MG 0.09, GH0225XF MG 0.2) with disease + agronomic bars; 302
redirects skipped cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 13:30:30 -04:00
justin 28d8cb83b3 Merge pull request 'Phase 11: crop_seed_api_lessons tool + Pioneer fallback' (#3) from api-lessons-pioneer-fallback into main
Image rebuild (skip scrape) / build (push) Failing after 7s
2026-05-25 13:19:17 -04:00
justin 4009dc0b78 Phase 11: crop_seed_api_lessons tool + Pioneer fallback
Add the fifth MCP tool — crop_seed_api_lessons(topic?) — backed by
docs_mcp/lessons.md, the ONLY source of opinionated content in the
server. Everything else (search_docs, get_page, lookup_variety)
returns verbatim from vendor catalogs; lessons.md fills the gaps
the corpus can't cover.

The Pioneer fallback is the critical anti-hallucination piece:
Pioneer's ToS bans automation, so the corpus has no Pioneer data.
Without this tool, an agent might surface Bayer/Asgrow chunks as
mediocre matches for a Pioneer query. The tool's docstring tells
the agent to call it on any Pioneer / P-series question; the
'pioneer' section says clearly:

  "I don't have Pioneer's variety data indexed... please consult
  Pioneer or an extension service."

  "Do NOT invent Pioneer hybrid ratings."

Other lesson sections cover knowledge the agent needs to interpret
search_docs / get_page output correctly:

- rating-scales: Bayer 1-9, Golden Harvest 9-to-1, what
  R/MR/S/Rps1c/R3 mean in soybean disease columns
- maturity-semantics: corn RM days vs soybean MG vs wheat class +
  qualitative early/medium/late
- trait-glossary: SSRIB, VT2PRIB, XF, E3, Conkesta, Clearfield, etc.
- scn-resistance: race coverage + Peking vs PI 88788 source
- regional-listings: how to interpret Bayer's "local profiles"
- sources-not-yet-indexed: which vendors aren't in the corpus yet
- checking-your-work: always call lookup_variety before quoting

Lesson lookup prefers slug-match (returns just `rating-scales` for
topic="rating", not every section that mentions ratings); falls
back to body-match only when no slug matches.

Smoke-tested with topic=pioneer, topic=rating, topic=trait,
topic=zzzzzz (no match), and topic=None (full index = 10K chars,
8 sections).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 13:18:57 -04:00
justin 3cab941c08 Phase 2/3: chunker + indexer + MCP server tools (#2)
Image rebuild (skip scrape) / build (push) Failing after 6s
2026-05-25 13:14:58 -04:00
justin a766756a05 Phase 2/3: chunker + indexer + MCP server tools
Phase 2 — Chunking and indexing
- rag/chunk.py: replace template chunker with seed-variety-specific
  chunks_from_variety(). One chunk per variety (varieties are small
  and named-rating retrieval signal is best kept together). Output
  is rebuilt deterministically from the sidecar JSON: every value is
  verbatim from the source, only framing language ("Disease ratings
  (1-9, 9=best):") is template glue. Anti-hallucination contract:
  same sidecar in → same chunk out, never a fabricated rating.
  Metadata flattened to Chroma-safe primitives (str/int/float/bool):
  source, source_key, vendor, brand, crop, product_name,
  product_id, source_url, rm (corn), mg (soy), wheat_class,
  release_year, trait_codes_csv, rating_scale.
- rag/index.py: walks corpus/<source>/<source_key>.json sidecars
  via the new chunker. Default PRODUCT_NAME=crop_seed so the
  Chroma collection is crop_seed_docs.
- rag/bm25.py: filterable columns updated to seed-domain facets
  (source/vendor/brand/crop/source_key) instead of the template's
  version/platform/product.

Phase 3 — MCP server tools wired up
- search_docs: hybrid dense (Chroma) + BM25 (FTS5) retrieval with
  RRF fusion. Optional filters: crop, brand, vendor, source.
  Variety-code prefilter pins exact source_key / product_name /
  hybrid_prefix matches at the top — dense embeddings have no
  semantic neighbor for tokens like "DKC62-08RIB" and RRF can let
  noise float to #1 without this pin. Each response carries the
  variety's source URL inline so the agent can cite.
- get_page(source, source_key): emits a structured ratings header
  (verbatim from sidecar, table per characteristics group, vendor
  positioning, regional listings) followed by the raw indexed body.
  This is the canonical fact-check surface.
- list_versions(): facet discovery — distinct sources, vendors,
  brands, crops across the corpus.
- lookup_variety(source_key, source?): returns the raw sidecar JSON
  for one variety. The agent should call this BEFORE quoting any
  specific rating value to a farmer — guaranteed verbatim.

Smoke tests against 475 indexed Bayer varieties:
- list_versions returns 475 varieties, 1 source, 1 vendor, 3 brands,
  3 crops with correct per-brand counts (288/102/85).
- Semantic ag queries find the right candidates: short-season
  drought-tolerant corn → DKC44-97RIB at RM 94 (in 90-95 band);
  SCN+MG3 soybean → Asgrow XF varieties with explicit SCN R3 ratings;
  Phytophthora Rps3a soy → AG07XF4 (right gene); stripe-rust
  wheat → WestBred WB1376CLP (Yellow Rust 2 = best).
- Variety-code lookups work via prefilter: DKC62-08RIB, AG29XF4,
  WB6430 all return as #1 hit. BM25 confirms ranking unambiguously
  (top-1 score -13.2 vs -8.5 for #2 on "DKC62-08RIB ratings").
- Server boots cleanly in stdio AND streamable-http modes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 13:14:16 -04:00
justin 0fb8d9d92d bayer_seeds: implement Phase 1 scraper (#1)
Image rebuild (skip scrape) / build (push) Failing after 6s
2026-05-25 12:54:50 -04:00
justin 2a4c0d4aba bayer_seeds: implement Phase 1 scraper for DEKALB + Asgrow + WestBred
Replace stub with working scraper for all three Bayer seed brands.
Discovery uses the public sitemap-dynamic.xml (475 varieties:
288 DEKALB corn + 102 Asgrow soy + 85 WestBred wheat — matches recon).
Per-variety detail comes from the page's __NEXT_DATA__ JSON island.

Each variety writes corpus/bayer_seeds/<source_key>.{md,json} with:
- Identity (brand, crop, hybridLabel, productId, releaseYear)
- Maturity routed per crop (RM for corn, MG for soy, qualitative for wheat)
- Trait stack (code + full name)
- Positioning + strengths narrative
- Characteristics groups (DISEASE RATINGS, GROWTH, MANAGEMENT, HARVEST,
  etc.) preserved verbatim from source so the chunker can re-bucket
  into canonical disease/agronomic flats per CLAUDE.md schema
- Regional seed-guide listings with agronomist contacts
- _scale_direction tag (Bayer = "1-9 (9 = best)") for chunker

Smoke-tested all three brands (--limit 2 each, plus --product, --force,
and scrape.runner dispatch). Politeness: 1 req/sec, retries on 429/5xx
with Retry-After honored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:53:46 -04:00
justin ac40e05734 seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME
Image rebuild (skip scrape) / build (push) Failing after 7s
Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is
seed/hybrid varieties across 6 vendors instead of pesticide labels.

What's customized vs. the template:
- CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy,
  canonical sidecar schema (per-crop), Golden Harvest disease-scale
  reversal gotcha, no-IPv6 / HTTPS-clone note
- README.md: vendor coverage table, tool list, phase status
- Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not
  bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker
  DNS defaults (same llama-rerank sidecar as crop-chem-docs)
- .gitea/workflows/refresh.yml: monthly cron (seed catalogs move
  slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar
  pinning, continue-on-error on GC step
- .gitea/workflows/image-only.yml: paths filter + cancel-in-progress
  concurrency group
- scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea
  packages API URL + UA header to bypass CF block on default
  Python-urllib UA)
- sources.json: catalog of 6 sources + scope_filter + per-source
  schema notes + Pioneer-exclusion rationale
- scrape/runner.py: dispatcher with --all = GREEN-only
- scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr,
  becks_products}.py: stub modules with implementation notes
- docs_mcp/server.py: PRODUCT_NAME default → crop_seed,
  PRODUCT_DOCS_URL → repo URL

Pioneer is intentionally NOT a source. ToS bans automation; dealer
locator is login-gated. The MCP returns a curated fallback lesson
directing the user to pioneer.com.

Next phases:
- Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs
  Bayer scraper; same __NEXT_DATA__ infra)
- Phase 7: curate eval/queries.jsonl
- Phase 11: lessons.md with Pioneer fallback + disease-scale notes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:28:49 -04:00