Files
seed-mcp/scrape/sources/agripro.py
T
justin 75f714b454 Phase 4-5: deployable container + corpus snapshot + CI fixes
deploy/docker-compose.yml — replace <product>/<registry> placeholders
with concrete values for Drawbar's stack:
- image: git.jpaul.io/justin/seed-mcp:latest (CF tunnel for pulls; CI
  pushes via LAN 192.168.0.2:1234 to avoid 100 MB body cap)
- container_name: seed-mcp
- port 8001:8000 (8001 host-side to not collide with crop-chem-docs
  on 8000)
- PRODUCT_NAME=crop_seed, hybrid search enabled, stateless HTTP
- llama-rerank shared with crop-chem-docs (NOT redefined here —
  expected to already be in Drawbar's parent compose network)
- networks.drawbar-mcp external: true so seed-mcp joins the existing
  cross-MCP shared network

.gitignore — corpus/ is now COMMITTED, not ignored. The monthly
refresh workflow scrapes and commits corpus changes; the image-only
workflow rebuilds indexes from the committed corpus. Allowing the
corpus to flow through git means the :corpus-YYYY.MM.DD image tag
pins to a specific seed-catalog snapshot. chroma/ and bm25/ remain
ignored — those are deterministically derived from corpus.

Initial committed snapshot: 614 varieties.
- bayer_seeds: 475 (DEKALB 288 + Asgrow 102 + WestBred 85)
- golden_harvest: 139 (Syngenta corn + soy; 36 sitemap URLs
  302-redirected = discontinued)

rag/chunk.py — normalize brand and crop to uppercase/lowercase in
Chroma metadata so cross-vendor brand-filter lookups don't break on
casing inconsistency (Bayer stores "DEKALB", Golden Harvest stores
"Golden Harvest"; _build_where uppercases user-supplied brand which
matched the former but not the latter pre-fix). Sidecar JSON keeps
original casing for display.

Stub scrapers (nk, agripro, becks_pfr, becks_products) — change
return code from 2 to 0 so the monthly-refresh CI workflow doesn't
fail on deferred sources. Real implementations will return 0 on
success / 1 on failure when they ship.

Smoke-tested cross-vendor retrieval against the 614-chunk index:
- list_versions shows both vendors with correct facet counts
- broad "corn hybrid 100 RM" query returns both DEKALB and Golden
  Harvest hits in top 5
- brand='Golden Harvest' filter returns 3 GH-only varieties
- variety-code prefilter still works (E085Z5 → top hit on GH)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 13:40:05 -04:00

38 lines
1.3 KiB
Python

"""AgriPro scraper (Syngenta wheat brand).
Source: ``https://www.agriprowheat.com`` — Drupal Views form,
server-rendered HTML. No headless browser needed.
Expected count: 24 varieties. Covers HRW / HRS / HWS / SWW / SWS
plus barley. NO SRW — Syngenta's SRW lives at GrowProGenetics.com
under a separate brand and is out of scope for AgriPro.
Trait flags to capture: Clearfield (CL2), CoAXium (NB: CoAXium is
implicit in product family naming, not always a separate field).
Schema notes:
- ``wheat_class`` is required (HRW/HRS/HWS/SWW/SWS/durum/barley)
- ``relative_maturity`` and ``maturity_group`` are null for wheat
- Disease panel: stripe rust / leaf rust / stem rust / FHB (scab) /
Septoria / tan spot
- Quality: test weight, protein, falling number, straw strength
TODO: implement.
"""
from __future__ import annotations
import sys
def main(argv: list[str] | None = None) -> int:
print("agripro: deferred — Drupal Views form, only wheat in the corpus, no SRW (separate brand). See reference_seed_vendor_recon.md.",
file=sys.stderr)
# Return 0 so the monthly CI workflow doesn't fail when this
# source is listed but not yet implemented. Real implementation
# will return 0 on success / 1 on failure.
return 0
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))