4 Commits

Author SHA1 Message Date
seed-mcp-refresh a892a83507 monthly refresh: 2026-07-01T06:26Z — bayer=946 gh=139 nk=122 agripro=24 ag_trials=14 gh_plot_reports=4299 pfr=0 2026-07-01 06:26:16 +00:00
seed-mcp-refresh e356633d4f monthly refresh: 2026-06-01T06:53Z — bayer=931 gh=139 nk=122 agripro=24 ag_trials=14 gh_plot_reports=4299 pfr=0 2026-06-01 06:53:15 +00:00
justin eaa7e0789b bayer_seeds: add Channel + DEKALB silage/sorghum/canola + Deltapine cotton
User flagged that Channel is expanding into their area — re-walked
the cropscience.bayer.us sitemap and found 8 additional brand×crop
paths beyond the original DEKALB/Asgrow/WestBred triple. Patches
the scraper to walk all of them; total Bayer varieties roughly
doubles from 475 to 931 and the corpus picks up first-ever
coverage in sorghum (36), cotton (30), canola (6), and silage as a
distinct crop (was conflated with corn before).

Net new varieties: 456
  Channel    corn=181  soy=67   silage=54  sorghum=18    (320)
  DEKALB     silage=82 sorghum=18  canola=6              (106)
  Deltapine  cotton=30                                    (30)

scrape/sources/bayer_seeds.py
- Replace `BRANDS` (brand → 1 path) and `CROP_SUFFIX` (brand → 1
  suffix) with a flatter `BRAND_PATHS` list of (brand, url_path,
  crop, is_primary_for_brand) entries. Channel and DEKALB are now
  multi-crop brands; the same scraper walks every brand×crop pair.
- source_key derivation: for a brand's PRIMARY crop, strip the
  trailing `-<crop>` suffix (matches the existing deployed source
  keys for DEKALB corn / Asgrow soy / WestBred wheat). For
  SECONDARY crops, KEEP the suffix so DEKALB-the-same-SKU sold as
  both grain corn and silage gets two distinct source_keys
  (collision-safe and unambiguous for `lookup_variety`).
- New `--crop` CLI filter for incremental backfills.
- Log line shows brand + crop alongside source_key for visibility.

rag/chunk.py
- Channel + Deltapine pages use slightly different characteristics
  group labels (DISEASE not DISEASE RATINGS, AGRONOMIC
  CHARACTERISTICS not GROWTH/HARVEST, plus MATURITY / ADAPTATION /
  HERBICIDES / OTHER). Fold them into the DISEASE / AGRONOMIC /
  MANAGEMENT label sets so the chunker buckets them correctly
  into the standard sections.

Smoke-tested cross-brand × cross-crop queries against the rebuilt
index (5,529 chunks total) — all 6 sample queries surface the
right brand+crop at top-3:
  Channel corn 110 RM       → 210-25TRE BRAND
  Channel soy 2.5 MG IA     → 2622RXF BRAND
  Deltapine cotton XF       → DP 1820 B3XF BRAND
  Sorghum dryland Kansas    → 6B95 BRAND (Channel)
  Silage corn WI dairy      → DKC64-44RIB BRAND BLEND (silage variant)
  Canola Northern Plains    → DK401TL BRAND

Watchtower will pull the new image on the next push; deploy is
unchanged otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 11:54:30 -04:00
justin 75f714b454 Phase 4-5: deployable container + corpus snapshot + CI fixes
deploy/docker-compose.yml — replace <product>/<registry> placeholders
with concrete values for Drawbar's stack:
- image: git.jpaul.io/justin/seed-mcp:latest (CF tunnel for pulls; CI
  pushes via LAN 192.168.0.2:1234 to avoid 100 MB body cap)
- container_name: seed-mcp
- port 8001:8000 (8001 host-side to not collide with crop-chem-docs
  on 8000)
- PRODUCT_NAME=crop_seed, hybrid search enabled, stateless HTTP
- llama-rerank shared with crop-chem-docs (NOT redefined here —
  expected to already be in Drawbar's parent compose network)
- networks.drawbar-mcp external: true so seed-mcp joins the existing
  cross-MCP shared network

.gitignore — corpus/ is now COMMITTED, not ignored. The monthly
refresh workflow scrapes and commits corpus changes; the image-only
workflow rebuilds indexes from the committed corpus. Allowing the
corpus to flow through git means the :corpus-YYYY.MM.DD image tag
pins to a specific seed-catalog snapshot. chroma/ and bm25/ remain
ignored — those are deterministically derived from corpus.

Initial committed snapshot: 614 varieties.
- bayer_seeds: 475 (DEKALB 288 + Asgrow 102 + WestBred 85)
- golden_harvest: 139 (Syngenta corn + soy; 36 sitemap URLs
  302-redirected = discontinued)

rag/chunk.py — normalize brand and crop to uppercase/lowercase in
Chroma metadata so cross-vendor brand-filter lookups don't break on
casing inconsistency (Bayer stores "DEKALB", Golden Harvest stores
"Golden Harvest"; _build_where uppercases user-supplied brand which
matched the former but not the latter pre-fix). Sidecar JSON keeps
original casing for display.

Stub scrapers (nk, agripro, becks_pfr, becks_products) — change
return code from 2 to 0 so the monthly-refresh CI workflow doesn't
fail on deferred sources. Real implementations will return 0 on
success / 1 on failure when they ship.

Smoke-tested cross-vendor retrieval against the 614-chunk index:
- list_versions shows both vendors with correct facet counts
- broad "corn hybrid 100 RM" query returns both DEKALB and Golden
  Harvest hits in top 5
- brand='Golden Harvest' filter returns 3 GH-only varieties
- variety-code prefilter still works (E085Z5 → top hit on GH)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 13:40:05 -04:00