ac40e05734
Image rebuild (skip scrape) / build (push) Failing after 7s
Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is
seed/hybrid varieties across 6 vendors instead of pesticide labels.
What's customized vs. the template:
- CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy,
canonical sidecar schema (per-crop), Golden Harvest disease-scale
reversal gotcha, no-IPv6 / HTTPS-clone note
- README.md: vendor coverage table, tool list, phase status
- Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not
bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker
DNS defaults (same llama-rerank sidecar as crop-chem-docs)
- .gitea/workflows/refresh.yml: monthly cron (seed catalogs move
slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar
pinning, continue-on-error on GC step
- .gitea/workflows/image-only.yml: paths filter + cancel-in-progress
concurrency group
- scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea
packages API URL + UA header to bypass CF block on default
Python-urllib UA)
- sources.json: catalog of 6 sources + scope_filter + per-source
schema notes + Pioneer-exclusion rationale
- scrape/runner.py: dispatcher with --all = GREEN-only
- scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr,
becks_products}.py: stub modules with implementation notes
- docs_mcp/server.py: PRODUCT_NAME default → crop_seed,
PRODUCT_DOCS_URL → repo URL
Pioneer is intentionally NOT a source. ToS bans automation; dealer
locator is login-gated. The MCP returns a curated fallback lesson
directing the user to pioneer.com.
Next phases:
- Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs
Bayer scraper; same __NEXT_DATA__ infra)
- Phase 7: curate eval/queries.jsonl
- Phase 11: lessons.md with Pioneer fallback + disease-scale notes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
63 lines
2.0 KiB
Python
63 lines
2.0 KiB
Python
"""Retriever protocol + concrete implementations.
|
|
|
|
A single matrix dimension per knob (dense / reranked / bm25 / hybrid)
|
|
so the eval harness can compare them apples-to-apples. Implement these
|
|
once at Phase 7 and reuse them across every retrieval change.
|
|
|
|
Each retriever returns a ranked list of (bundle_id, page_id) tuples
|
|
deduplicated to the page level (chunks within the same page collapse
|
|
to one entry; the highest-ranked chunk's position wins).
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
from typing import Protocol, Iterable
|
|
|
|
|
|
class Retriever(Protocol):
|
|
name: str
|
|
|
|
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
|
|
"""Return up to k (bundle_id, page_id) tuples in rank order."""
|
|
...
|
|
|
|
|
|
def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> list[tuple[str, str]]:
|
|
"""Take a stream of (bundle_id, page_id, chunk_ordinal) and return
|
|
the first k unique pages in their first-seen order."""
|
|
seen: set[tuple[str, str]] = set()
|
|
out: list[tuple[str, str]] = []
|
|
for bid, pid, _ord in chunk_ids:
|
|
key = (bid, pid)
|
|
if key in seen:
|
|
continue
|
|
seen.add(key)
|
|
out.append(key)
|
|
if len(out) >= k:
|
|
break
|
|
return out
|
|
|
|
|
|
# TODO Phase 2/3 — implement these once Chroma + the bm25 module are
|
|
# in place. Each one is small (15-30 LOC). The eval harness imports
|
|
# from this module by class name.
|
|
#
|
|
# class DenseRetriever:
|
|
# name = "dense"
|
|
# def __init__(self, collection): self.col = collection
|
|
# def retrieve(self, query, k=10): ...
|
|
#
|
|
# class RerankedRetriever:
|
|
# name = "dense+rerank"
|
|
# def __init__(self, collection, rerank_url, pool=200): ...
|
|
# def retrieve(self, query, k=10): ...
|
|
#
|
|
# class BM25Retriever:
|
|
# name = "bm25"
|
|
# def __init__(self, bm25_index): ...
|
|
# def retrieve(self, query, k=10): ...
|
|
#
|
|
# class HybridRetriever:
|
|
# name = "bm25+dense+rrf"
|
|
# def __init__(self, dense, bm25, k_rrf=60): ...
|
|
# def retrieve(self, query, k=10): ...
|