Phase 4-5: deployable container + corpus snapshot (614 varieties) #5

Merged
justin merged 1 commits from phase-4-5-deploy into main 2026-05-25 13:40:42 -04:00
Owner

Summary

  • deploy/docker-compose.yml — fill in concrete values for Drawbar's stack: image git.jpaul.io/justin/seed-mcp:latest, host port 8001 (so we don't collide with crop-chem-docs on 8000), PRODUCT_NAME=crop_seed, hybrid search enabled, stateless HTTP. The shared llama-rerank is NOT redefined here — Drawbar's parent stack already has it from the crop-chem-docs deploy.
  • .gitignorecorpus/ is now COMMITTED, not ignored. The monthly refresh workflow scrapes and pushes corpus diffs; the image-only workflow rebuilds indexes from the committed corpus. This lets :corpus-YYYY.MM.DD image tags pin to a specific seed-catalog snapshot. chroma/ and bm25/ remain ignored.
  • Initial committed corpus: 614 varieties indexed.
    • bayer_seeds: 475 (DEKALB 288 / Asgrow 102 / WestBred 85)
    • golden_harvest: 139 (36 sitemap URLs were 302-discontinued)
  • rag/chunk.py — normalize brand to uppercase in Chroma metadata so brand-filter lookups don't break on vendor casing inconsistency (Bayer "DEKALB" vs Syngenta "Golden Harvest"; _build_where.upper() matched the former but not the latter). Sidecar JSON keeps original casing for display.
  • Stub scrapers (nk, agripro, becks_pfr, becks_products) — return code 0 instead of 2 so the monthly refresh CI workflow doesn't fail on deferred sources.

What this PR makes possible

After merge:

  • gh workflow run image-only.yml rebuilds the image from the committed corpus and pushes to 192.168.0.2:1234/justin/seed-mcp:latest.
  • Drawbar's compose can pull git.jpaul.io/justin/seed-mcp:latest, mount usage logs, and the seed-mcp tools are live to the farm-advisor agent.

Test plan

  • Cross-vendor retrieval: list_versions shows both vendors with correct facet counts (614 = 475 + 139)
  • Broad "corn hybrid 100 RM" query returns BOTH DEKALB and Golden Harvest in top 5 (post-normalization)
  • brand='Golden Harvest' filter returns 3 GH-only varieties (pre-fix was zero results due to case mismatch)
  • Variety-code prefilter works across both vendors: E085Z5 → Golden Harvest top hit
  • Local server boot in HTTP mode + tools registered (5 tools: search_docs, get_page, list_versions, lookup_variety, crop_seed_api_lessons)
  • CI workflow run end-to-end — to verify after merge

Coverage now

  • 2 vendors / 4 brands / 3 crops / 614 indexed varieties
  • Pioneer fallback policy via crop_seed_api_lessons(topic='pioneer')
  • Deferred sources surfaced cleanly: nk, agripro, becks_pfr, becks_products (the lessons tool's sources-not-yet-indexed section tells the agent which vendors aren't there yet)
## Summary - **`deploy/docker-compose.yml`** — fill in concrete values for Drawbar's stack: image `git.jpaul.io/justin/seed-mcp:latest`, host port 8001 (so we don't collide with crop-chem-docs on 8000), `PRODUCT_NAME=crop_seed`, hybrid search enabled, stateless HTTP. The shared `llama-rerank` is NOT redefined here — Drawbar's parent stack already has it from the crop-chem-docs deploy. - **`.gitignore`** — `corpus/` is now COMMITTED, not ignored. The monthly refresh workflow scrapes and pushes corpus diffs; the image-only workflow rebuilds indexes from the committed corpus. This lets `:corpus-YYYY.MM.DD` image tags pin to a specific seed-catalog snapshot. `chroma/` and `bm25/` remain ignored. - **Initial committed corpus**: 614 varieties indexed. - `bayer_seeds`: 475 (DEKALB 288 / Asgrow 102 / WestBred 85) - `golden_harvest`: 139 (36 sitemap URLs were 302-discontinued) - **`rag/chunk.py`** — normalize `brand` to uppercase in Chroma metadata so brand-filter lookups don't break on vendor casing inconsistency (Bayer "DEKALB" vs Syngenta "Golden Harvest"; `_build_where.upper()` matched the former but not the latter). Sidecar JSON keeps original casing for display. - **Stub scrapers** (`nk`, `agripro`, `becks_pfr`, `becks_products`) — return code 0 instead of 2 so the monthly refresh CI workflow doesn't fail on deferred sources. ## What this PR makes possible After merge: - `gh workflow run image-only.yml` rebuilds the image from the committed corpus and pushes to `192.168.0.2:1234/justin/seed-mcp:latest`. - Drawbar's compose can pull `git.jpaul.io/justin/seed-mcp:latest`, mount usage logs, and the seed-mcp tools are live to the farm-advisor agent. ## Test plan - [x] Cross-vendor retrieval: `list_versions` shows both vendors with correct facet counts (614 = 475 + 139) - [x] Broad "corn hybrid 100 RM" query returns BOTH DEKALB and Golden Harvest in top 5 (post-normalization) - [x] `brand='Golden Harvest'` filter returns 3 GH-only varieties (pre-fix was zero results due to case mismatch) - [x] Variety-code prefilter works across both vendors: `E085Z5` → Golden Harvest top hit - [x] Local server boot in HTTP mode + tools registered (5 tools: search_docs, get_page, list_versions, lookup_variety, crop_seed_api_lessons) - [ ] CI workflow run end-to-end — to verify after merge ## Coverage now - 2 vendors / 4 brands / 3 crops / 614 indexed varieties - Pioneer fallback policy via `crop_seed_api_lessons(topic='pioneer')` - Deferred sources surfaced cleanly: `nk`, `agripro`, `becks_pfr`, `becks_products` (the lessons tool's `sources-not-yet-indexed` section tells the agent which vendors aren't there yet)
justin added 1 commit 2026-05-25 13:40:31 -04:00
deploy/docker-compose.yml — replace <product>/<registry> placeholders
with concrete values for Drawbar's stack:
- image: git.jpaul.io/justin/seed-mcp:latest (CF tunnel for pulls; CI
  pushes via LAN 192.168.0.2:1234 to avoid 100 MB body cap)
- container_name: seed-mcp
- port 8001:8000 (8001 host-side to not collide with crop-chem-docs
  on 8000)
- PRODUCT_NAME=crop_seed, hybrid search enabled, stateless HTTP
- llama-rerank shared with crop-chem-docs (NOT redefined here —
  expected to already be in Drawbar's parent compose network)
- networks.drawbar-mcp external: true so seed-mcp joins the existing
  cross-MCP shared network

.gitignore — corpus/ is now COMMITTED, not ignored. The monthly
refresh workflow scrapes and commits corpus changes; the image-only
workflow rebuilds indexes from the committed corpus. Allowing the
corpus to flow through git means the :corpus-YYYY.MM.DD image tag
pins to a specific seed-catalog snapshot. chroma/ and bm25/ remain
ignored — those are deterministically derived from corpus.

Initial committed snapshot: 614 varieties.
- bayer_seeds: 475 (DEKALB 288 + Asgrow 102 + WestBred 85)
- golden_harvest: 139 (Syngenta corn + soy; 36 sitemap URLs
  302-redirected = discontinued)

rag/chunk.py — normalize brand and crop to uppercase/lowercase in
Chroma metadata so cross-vendor brand-filter lookups don't break on
casing inconsistency (Bayer stores "DEKALB", Golden Harvest stores
"Golden Harvest"; _build_where uppercases user-supplied brand which
matched the former but not the latter pre-fix). Sidecar JSON keeps
original casing for display.

Stub scrapers (nk, agripro, becks_pfr, becks_products) — change
return code from 2 to 0 so the monthly-refresh CI workflow doesn't
fail on deferred sources. Real implementations will return 0 on
success / 1 on failure when they ship.

Smoke-tested cross-vendor retrieval against the 614-chunk index:
- list_versions shows both vendors with correct facet counts
- broad "corn hybrid 100 RM" query returns both DEKALB and Golden
  Harvest hits in top 5
- brand='Golden Harvest' filter returns 3 GH-only varieties
- variety-code prefilter still works (E085Z5 → top hit on GH)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justin merged commit 2588ebafa1 into main 2026-05-25 13:40:42 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#5