Phase 4-5: deployable container + corpus snapshot + CI fixes

deploy/docker-compose.yml — replace <product>/<registry> placeholders
with concrete values for Drawbar's stack:
- image: git.jpaul.io/justin/seed-mcp:latest (CF tunnel for pulls; CI
  pushes via LAN 192.168.0.2:1234 to avoid 100 MB body cap)
- container_name: seed-mcp
- port 8001:8000 (8001 host-side to not collide with crop-chem-docs
  on 8000)
- PRODUCT_NAME=crop_seed, hybrid search enabled, stateless HTTP
- llama-rerank shared with crop-chem-docs (NOT redefined here —
  expected to already be in Drawbar's parent compose network)
- networks.drawbar-mcp external: true so seed-mcp joins the existing
  cross-MCP shared network

.gitignore — corpus/ is now COMMITTED, not ignored. The monthly
refresh workflow scrapes and commits corpus changes; the image-only
workflow rebuilds indexes from the committed corpus. Allowing the
corpus to flow through git means the :corpus-YYYY.MM.DD image tag
pins to a specific seed-catalog snapshot. chroma/ and bm25/ remain
ignored — those are deterministically derived from corpus.

Initial committed snapshot: 614 varieties.
- bayer_seeds: 475 (DEKALB 288 + Asgrow 102 + WestBred 85)
- golden_harvest: 139 (Syngenta corn + soy; 36 sitemap URLs
  302-redirected = discontinued)

rag/chunk.py — normalize brand and crop to uppercase/lowercase in
Chroma metadata so cross-vendor brand-filter lookups don't break on
casing inconsistency (Bayer stores "DEKALB", Golden Harvest stores
"Golden Harvest"; _build_where uppercases user-supplied brand which
matched the former but not the latter pre-fix). Sidecar JSON keeps
original casing for display.

Stub scrapers (nk, agripro, becks_pfr, becks_products) — change
return code from 2 to 0 so the monthly-refresh CI workflow doesn't
fail on deferred sources. Real implementations will return 0 on
success / 1 on failure when they ship.

Smoke-tested cross-vendor retrieval against the 614-chunk index:
- list_versions shows both vendors with correct facet counts
- broad "corn hybrid 100 RM" query returns both DEKALB and Golden
  Harvest hits in top 5
- brand='Golden Harvest' filter returns 3 GH-only varieties
- variety-code prefilter still works (E085Z5 → top hit on GH)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-25 13:40:05 -04:00
parent 9d4a490731
commit 75f714b454
1235 changed files with 284483 additions and 94 deletions
+55 -83
View File
@@ -1,111 +1,83 @@
# Hosting stack for a docs MCP server.
# Hosting stack for the seed-mcp MCP server.
#
# Replace <product> below with your product name on first deploy.
# Volumes: usage logs are mounted to a host path so they survive
# Watchtower-driven container recreates.
# This compose file is meant to live in Drawbar's deploy stack and is
# included here as the canonical reference. The seed-mcp image is
# self-contained — corpus + Chroma + BM25 are baked in by CI at build
# time — so the only host-side concerns are usage-log persistence and
# the shared reranker / Ollama sidecars.
#
# This template assumes a reverse proxy / Cloudflare Tunnel terminates
# TLS in front of port 8000. Adjust if your infra differs.
# The reranker container (llama-rerank) is SHARED with crop-chem-docs.
# Drawbar's compose already has it from the crop-chem-docs deploy;
# don't duplicate it here when stacking the two MCPs together.
#
# Watchtower auto-pulls on :latest changes — but ONLY for containers
# labeled `com.centurylinklabs.watchtower.enable=true`.
services:
# The MCP server. Watchtower auto-pulls on :latest changes.
<product>-docs-mcp:
image: <registry>/<owner>/<product>-docs-mcp:latest
container_name: <product>-docs-mcp
# The seed-mcp server. Image is rebuilt nightly by .gitea/workflows/
# refresh.yml; pulled via the public git.jpaul.io endpoint (CF
# tunnels in front, so the 100 MB body cap doesn't matter on pulls).
seed-mcp:
image: git.jpaul.io/justin/seed-mcp:latest
container_name: seed-mcp
restart: unless-stopped
ports:
- "8000:8000"
- "8001:8000"
environment:
PRODUCT_NAME: "<product>"
PRODUCT_DOCS_URL: "https://docs.example.com"
PRODUCT_NAME: "crop_seed"
PRODUCT_DOCS_URL: "https://git.jpaul.io/justin/seed-mcp"
# Streamable-HTTP transport. Stateless mode is required for
# production: clients don't lose sessions when Watchtower
# recreates the container.
# Streamable-HTTP transport, stateless mode (every request gets
# a fresh ephemeral session). Required for production: avoids
# 404 storms when Watchtower recreates the container while
# clients hold session IDs from the previous instance.
MCP_TRANSPORT: streamable-http
MCP_HOST: 0.0.0.0
MCP_PORT: "8000"
MCP_DISABLE_DNS_REBINDING_PROTECTION: "1"
# If you run MetaMCP or another gateway in front and reach
# this container via its compose DNS name (e.g. <product>-docs-mcp:8000),
# add that hostname here. "*" disables the rebind check entirely.
MCP_ALLOWED_HOSTS: "<product>-docs-mcp,localhost,127.0.0.1"
# Embedding pool. Drawbar's compose puts the seed-mcp on the
# same docker network as Ollama; comma-separate multiple
# endpoints (one per GPU) for indexing throughput. At runtime
# only search_docs hits this (one embed per query, ~5ms).
OLLAMA_URL: "http://ollama:11434"
# Phase 6 — reranker sidecar (jina-reranker-v2-base via llama.cpp).
RERANK_URL: http://<product>-rerank:8080
# Reranker. The llama.cpp sidecar serving jina-reranker-v2-base
# is SHARED with crop-chem-docs. Drawbar's compose already
# defines llama-rerank from the crop-chem-docs deploy; we just
# point at the same DNS name. Falls back to dense-only on any
# rerank error so MCP requests never block on the sidecar.
RERANK_URL: "http://llama-rerank:8080"
RERANK_POOL: "200"
RERANK_TIMEOUT: "30"
# Phase 8 — hybrid retrieval (BM25 + dense + RRF). Set true
# only after the eval harness shows the dense-only path
# missing technical-term queries that BM25 catches.
# Hybrid retrieval (BM25 + dense + RRF + exact-code prefilter).
# Worth it for seed-mcp because farmer queries often contain
# rare technical tokens — variety codes (DKC62-08RIB), trait
# codes (XF/VT2PRIB), Rps gene names, disease abbreviations.
HYBRID_SEARCH: "true"
RRF_K: "60"
# Phase 10 — usage telemetry.
# Usage telemetry. JSONL with daily rotation; 90-day retention.
USAGE_LOG_DIR: /app/var/logs
USAGE_LOG_KEEP_DAYS: "90"
# Phase 12 — doc-bug submission gate. Off by default; on only
# in production after you've verified the endpoint contract.
DOC_BUG_SUBMIT_ENABLED: "false"
# DOC_BUG_API_URL: "https://docs-be.example.com/api/feedback"
volumes:
# Usage logs persist across container recreates.
- ./<product>-docs-mcp-logs:/app/var/logs
depends_on:
- <product>-rerank
# Usage logs persist across container recreates. Mount point
# creates host directory `./seed-mcp-logs/` on first run.
- ./seed-mcp-logs:/app/var/logs
labels:
# Watchtower polls *only* containers with this label set true.
# Watchtower polls only containers with this label = true.
com.centurylinklabs.watchtower.enable: "true"
networks:
- mcp
- drawbar-mcp
# Reranker sidecar — llama.cpp serving jina-reranker-v2-base.
# Requires GPU access; adjust runtime/devices for your hardware.
<product>-rerank:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
container_name: <product>-rerank
restart: unless-stopped
# Mount the GGUF model from the host. Download from huggingface
# (gguf-org/jina-reranker-v2-base-multilingual-GGUF) first.
volumes:
- /path/to/models:/models:ro
command: >
--model /models/jina-reranker-v2-base.Q8_0.gguf
--reranking
--host 0.0.0.0
--port 8080
--n-gpu-layers 99
--ctx-size 4096
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
networks:
- mcp
# Watchtower — auto-pulls :latest on push.
# Only watches containers labeled `com.centurylinklabs.watchtower.enable=true`.
watchtower:
image: containrrr/watchtower:latest
container_name: watchtower
restart: unless-stopped
volumes:
- /var/run/docker.sock:/var/run/docker.sock
environment:
WATCHTOWER_POLL_INTERVAL: "300" # 5 min
WATCHTOWER_LABEL_ENABLE: "true"
WATCHTOWER_CLEANUP: "true" # remove old images after pull
# If your registry requires auth, mount a docker config:
# volumes:
# - ./registry-auth.json:/config.json:ro
networks:
- mcp
# NOTE: do NOT include llama-rerank or ollama here if you're stacking
# this compose alongside crop-chem-docs. They're already defined in
# the parent stack. The networks: external: true block below assumes
# those services live on the drawbar-mcp shared network.
networks:
mcp:
driver: bridge
drawbar-mcp:
external: true
name: drawbar-mcp