seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME
Image rebuild (skip scrape) / build (push) Failing after 7s

Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is
seed/hybrid varieties across 6 vendors instead of pesticide labels.

What's customized vs. the template:
- CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy,
  canonical sidecar schema (per-crop), Golden Harvest disease-scale
  reversal gotcha, no-IPv6 / HTTPS-clone note
- README.md: vendor coverage table, tool list, phase status
- Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not
  bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker
  DNS defaults (same llama-rerank sidecar as crop-chem-docs)
- .gitea/workflows/refresh.yml: monthly cron (seed catalogs move
  slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar
  pinning, continue-on-error on GC step
- .gitea/workflows/image-only.yml: paths filter + cancel-in-progress
  concurrency group
- scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea
  packages API URL + UA header to bypass CF block on default
  Python-urllib UA)
- sources.json: catalog of 6 sources + scope_filter + per-source
  schema notes + Pioneer-exclusion rationale
- scrape/runner.py: dispatcher with --all = GREEN-only
- scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr,
  becks_products}.py: stub modules with implementation notes
- docs_mcp/server.py: PRODUCT_NAME default → crop_seed,
  PRODUCT_DOCS_URL → repo URL

Pioneer is intentionally NOT a source. ToS bans automation; dealer
locator is login-gated. The MCP returns a curated fallback lesson
directing the user to pioneer.com.

Next phases:
- Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs
  Bayer scraper; same __NEXT_DATA__ infra)
- Phase 7: curate eval/queries.jsonl
- Phase 11: lessons.md with Pioneer fallback + disease-scale notes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-25 12:28:49 -04:00
commit ac40e05734
35 changed files with 3833 additions and 0 deletions
+117
View File
@@ -0,0 +1,117 @@
name: Image rebuild (skip scrape)
# Fast path for code-only changes. Skips the scrape and goes straight
# to: rebuild indexes (from corpus already committed on main) + image
# build + push. Runtime ~10 min vs ~2-3 h for the full monthly refresh.
#
# Use when a PR only changes code/config — anything where the upstream
# seed catalogs haven't moved but we want the new Python in the
# running image.
on:
workflow_dispatch:
push:
branches:
- main
paths:
- "docs_mcp/**"
- "rag/**"
- "scrape/**"
- "requirements.txt"
- "Dockerfile"
- "sources.json"
# If multiple pushes land in quick succession, cancel the older one
# rather than queueing both — each run is non-trivial and the older
# commit's image just gets overwritten by the newer one anyway.
concurrency:
group: image-only
cancel-in-progress: true
env:
REGISTRY_PUSH: 192.168.0.2:1234
REGISTRY_PULL: git.jpaul.io
IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
OLLAMA_URL: http://192.168.0.2:11434,http://192.168.0.2:11435,http://192.168.0.125:11434
EMBED_MODEL: nomic-embed-text
PRODUCT_NAME: crop_seed
jobs:
build:
runs-on: docker
container:
image: catthehacker/ubuntu:act-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: |
python -m pip install -q --upgrade pip
python -m pip install -q -r requirements.txt
- name: Verify committed corpus is present
run: |
test -d corpus || { echo "ERROR: corpus/ missing on this ref"; exit 1; }
n_md=$(find corpus -name '*.md' | wc -l)
n_json=$(find corpus -name '*.json' | wc -l)
echo "corpus: $(du -sh corpus | cut -f1) on disk, ${n_md} .md / ${n_json} .json"
- name: Rebuild indexes from committed corpus
run: python -m rag.index --rebuild
- name: Log in to Gitea container registry
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u "${{ github.repository_owner }}" --password-stdin
- name: Build & push image
run: |
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
CORPUS_TAG="corpus-$(date -u +%Y.%m.%d)"
docker build \
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
-t "${REGISTRY_PUSH}/${IMAGE}:${CORPUS_TAG}" \
.
docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
docker push "${REGISTRY_PUSH}/${IMAGE}:${CORPUS_TAG}"
- name: Link container package to this repo
env:
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
run: |
OWNER="${{ github.repository_owner }}"
PKG="${{ github.event.repository.name }}"
BODY=$(mktemp)
CODE=$(curl -sS -o "$BODY" -w "%{http_code}" -X POST \
-H "Authorization: token ${GITEA_TOKEN}" \
"https://${REGISTRY_PULL}/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
echo "link http=$CODE body=$(cat "$BODY")"
case "$CODE" in
201) echo "linked package to ${OWNER}/${PKG}" ;;
400) echo "already linked — ok" ;;
*) echo "unexpected status $CODE"; exit 1 ;;
esac
- name: Prune old container versions
# GC requires broader scope than REGISTRY_TOKEN's push perms
# (HTTP 403 on /packages/.../versions). Non-critical —
# housekeeping only. Don't fail the whole run.
# TODO: issue separate PAT with admin:package scope and set
# as PACKAGES_ADMIN_TOKEN.
continue-on-error: true
env:
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
run: |
python scripts/registry_gc.py \
--owner "${{ github.repository_owner }}" \
--package "${{ github.event.repository.name }}" \
--keep-days 180 \
--keep-latest 6
+186
View File
@@ -0,0 +1,186 @@
name: Monthly seed catalog refresh
# Runs the full pipeline: scrape all GREEN sources → rebuild indexes
# → push image. Cron'd once a month (1st @ 06:00 UTC). Skip the
# reindex + image-push if the scrape produced no diff against the
# committed corpus.
#
# Seed catalogs move slowly (vendors release new hybrids 1-2x/year
# at field-day timing); monthly cadence is plenty.
#
# Total runtime budget: ~2-3 h end-to-end across all 5 GREEN sources.
# Bayer is the longest (~475 varieties, ~45 min). Beck's PFR is the
# heaviest single-source (~2,089 docs via Sanity GROQ pagination).
on:
schedule:
- cron: "0 6 1 * *" # 1st of each month, 06:00 UTC
workflow_dispatch:
inputs:
force_build:
description: "Rebuild indexes + push image even if corpus is unchanged"
type: boolean
default: false
sources:
description: "Sources to scrape (comma-separated, blank = all GREEN)"
type: string
default: ""
env:
# Self-hosted Gitea registry on the same LAN as the runner.
# CF caps push body at 100 MB, so push via LAN endpoint; pull
# through the public hostname (response bodies aren't capped).
REGISTRY_PUSH: 192.168.0.2:1234
REGISTRY_PULL: git.jpaul.io
IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
# Embedder pool. Two Ollama instances on the Gitea/runner host
# (one per GPU) + the Windows Ollama. Trashpanda's Ollama is
# production-shared with Drawbar; CI does NOT hit it.
OLLAMA_URL: http://192.168.0.2:11434,http://192.168.0.2:11435,http://192.168.0.125:11434
EMBED_MODEL: nomic-embed-text
PRODUCT_NAME: crop_seed
jobs:
refresh:
runs-on: docker
container:
image: catthehacker/ubuntu:act-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
# Full history — required for the digest-history step
# to walk git log. Default fetch-depth: 1 silently
# produces a 0-byte history file.
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: |
python -m pip install -q --upgrade pip
python -m pip install -q -r requirements.txt
# ---- Phase 1: scrape ---------------------------------------
- name: Scrape Bayer seeds (DEKALB + Asgrow + WestBred)
if: ${{ inputs.sources == '' || contains(inputs.sources, 'bayer_seeds') }}
run: python -m scrape.runner --source bayer_seeds --force
- name: Scrape Golden Harvest
if: ${{ inputs.sources == '' || contains(inputs.sources, 'golden_harvest') }}
run: python -m scrape.runner --source golden_harvest --force
- name: Scrape NK
if: ${{ inputs.sources == '' || contains(inputs.sources, 'nk') }}
run: python -m scrape.runner --source nk --force
- name: Scrape AgriPro
if: ${{ inputs.sources == '' || contains(inputs.sources, 'agripro') }}
run: python -m scrape.runner --source agripro --force
- name: Scrape Beck's PFR research corpus
if: ${{ inputs.sources == '' || contains(inputs.sources, 'becks_pfr') }}
# Heaviest source — ~2,089 docs via public Sanity GROQ.
# No auth, but rate-limit ourselves to be polite.
run: python -m scrape.runner --source becks_pfr --force
# ---- Commit corpus changes + retry-on-race -----------------
- name: Commit corpus changes (if any)
id: commit
run: |
git config user.name "seed-mcp-refresh"
git config user.email "actions@jpaul.io"
git add sources.json corpus
if git diff --cached --quiet; then
echo "no corpus changes — skipping reindex and image build"
echo "changed=false" >> "$GITHUB_OUTPUT"
exit 0
fi
echo "changed=true" >> "$GITHUB_OUTPUT"
ts=$(date -u +"%Y-%m-%dT%H:%MZ")
n_bayer=$(find corpus/bayer_seeds -name '*.json' 2>/dev/null | wc -l)
n_gh=$(find corpus/golden_harvest -name '*.json' 2>/dev/null | wc -l)
n_nk=$(find corpus/nk -name '*.json' 2>/dev/null | wc -l)
n_ag=$(find corpus/agripro -name '*.json' 2>/dev/null | wc -l)
n_pfr=$(find corpus/becks_pfr -name '*.json' 2>/dev/null | wc -l)
git commit -m "monthly refresh: ${ts} — bayer=${n_bayer} gh=${n_gh} nk=${n_nk} agripro=${n_ag} pfr=${n_pfr}"
attempt=1
while [ $attempt -le 3 ]; do
if git push; then
echo "pushed corpus changes (attempt $attempt)"
break
fi
if [ $attempt -eq 3 ]; then
echo "push still failing after 3 attempts"; exit 1
fi
git fetch origin main
git rebase origin/main || { echo "rebase conflict"; exit 1; }
attempt=$((attempt + 1))
done
# ---- Rebuild Chroma + BM25 ---------------------------------
- name: Rebuild indexes
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
run: python -m rag.index --rebuild
# ---- Build & push image ------------------------------------
- name: Log in to Gitea container registry
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u "${{ github.repository_owner }}" --password-stdin
- name: Build & push image
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
# Tags: :latest (Watchtower target), :<sha12> (rollback pin),
# :corpus-<YYYY.MM.DD> (links image to corpus version so
# Drawbar can pin to a specific seed-catalog snapshot).
run: |
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
CORPUS_TAG="corpus-$(date -u +%Y.%m.%d)"
docker build \
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
-t "${REGISTRY_PUSH}/${IMAGE}:${CORPUS_TAG}" \
.
docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
docker push "${REGISTRY_PUSH}/${IMAGE}:${CORPUS_TAG}"
- name: Link container package to this repo
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
env:
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
run: |
OWNER="${{ github.repository_owner }}"
PKG="${{ github.event.repository.name }}"
BODY=$(mktemp)
CODE=$(curl -sS -o "$BODY" -w "%{http_code}" -X POST \
-H "Authorization: token ${GITEA_TOKEN}" \
"https://${REGISTRY_PULL}/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
echo "link http=$CODE body=$(cat "$BODY")"
case "$CODE" in
201) echo "linked package to ${OWNER}/${PKG}" ;;
400) echo "already linked — ok" ;;
*) echo "unexpected status $CODE"; exit 1 ;;
esac
- name: Prune old container versions
# GC requires broader scope than REGISTRY_TOKEN's push perms
# (HTTP 403 on /packages/.../versions). Non-critical
# housekeeping. TODO: issue separate PAT with admin:package
# scope. Until then continue-on-error keeps a failed prune
# from breaking the whole refresh.
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
continue-on-error: true
env:
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
run: |
python scripts/registry_gc.py \
--owner "${{ github.repository_owner }}" \
--package "${{ github.event.repository.name }}" \
--keep-days 180 \
--keep-latest 6
+31
View File
@@ -0,0 +1,31 @@
# Virtualenv
venv/
.venv/
# Regenerable from corpus + CI
corpus/
chroma/
bm25/
# Python detritus
__pycache__/
*.py[cod]
*.egg-info/
.pytest_cache/
.mypy_cache/
.ruff_cache/
# Eval results (regenerable; commit only the headline baseline if you want)
# eval/results/
# Usage logs (host-mounted volume in prod; don't commit dev logs)
var/
# Local-only env
.env
.env.local
# IDE
.vscode/
.idea/
*.swp
+230
View File
@@ -0,0 +1,230 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when
working with code in this repository.
## Purpose
`seed-mcp` is an MCP server over the **public catalogs of major US
row-crop seed vendors** (corn / soybeans / wheat). It is the sibling
project to [`crop-chem-docs`](https://git.jpaul.io/justin/crop-chem-docs)
— same MCP-template lineage, same Drawbar consumer (the farm
advisor AI), but the corpus is **seed/hybrid varieties** rather than
pesticide labels.
The MCP exposes per-variety records with agronomic ratings, disease
tolerance, trait stack, maturity, and regional notes — so the advisor
can answer "which corn hybrid for sandy soil, drought-prone, RM ≤105
in northeast Iowa?" without rummaging through individual brand sites.
PRODUCT_NAME for this build: **`crop_seed`** (lowercase, underscore;
ends up in the MCP server name, Chroma collection, BM25 db filename,
and the `crop_seed_api_lessons` tool).
## Vendor scope
| Vendor | Verdict | Varieties | Source pattern |
|---|---|---|---|
| Bayer (DEKALB + Asgrow + WestBred) | 🟢 | ~475 | `cropscience.bayer.us` Next.js `__NEXT_DATA__` (same infra as crop-chem-docs) |
| Golden Harvest (Syngenta) | 🟢 | ~175 | sitemap.xml + server-rendered HTML + Syngenta CDN PDFs |
| NK (Syngenta) | 🟢 | 29 | static HTML + Syngenta CDN PDFs (shares fetcher with Golden Harvest) |
| AgriPro (Syngenta wheat) | 🟢 | 24 | Drupal Views form, server-rendered HTML |
| Beck's PFR | 🟡 | 2,089 | Public Sanity GROQ API at `mc8v24rf.api.sanity.io` (no auth) |
| Beck's products | 🟡 | 860 | Same Sanity API — identity-only until SeedIQ XHR is sniffed |
| Pioneer (Corteva) | 🔴 | — | DROP. ToS bans automation; dealer locator login-gated too |
**Build priority order** (shared-infra first → biggest yield):
1. `bayer_seeds` — lift-and-shift from crop-chem-docs' Bayer scraper
2. `golden_harvest` — biggest unique Syngenta brand
3. `nk` — reuses Golden Harvest's PDF fetcher
4. `agripro` — only wheat coverage in the corpus
5. `becks_pfr` — research goldmine, public Sanity GROQ
6. `becks_products` — identity-only, deferred until SeedIQ XHR known
### Pioneer fallback
Per user direction (2026-05-25), seed-mcp does NOT scrape Pioneer.
The MCP's lessons layer contains a Pioneer-fallback entry: when the
LLM detects a Pioneer / P-series query, it should reply:
> "Pioneer does not allow AI or other automation techniques to
> scrape and index their data. For Pioneer brand seed information,
> reach out to a local dealer directly via
> [pioneer.com](https://www.pioneer.com)."
Pioneer's dealer locator is login-gated — there is no public API
to surface dealer contact info, so the lesson stays a plain link.
## Schema notes per crop
- **Corn**: RM (relative maturity days), trait stack (SmartStax, VT
Double PRO, Enlist, PowerCore, Trecepta, etc.), GLS / NCLB /
Goss's / Anthracnose / Tar Spot ratings, standability, drought
tolerance, ear flex, grain-vs-silage flag.
- **Soy**: Maturity group (e.g. 3.4), trait stack (XF / Xtend / E3 /
LL+GT27 / RR2Y / Conkesta), SDS / white mold / SCN / Phytophthora
(race + Rps gene) / frogeye / brown stem rot ratings, IDC
tolerance (critical for upper Midwest), branching habit.
- **Wheat**: Class (HRW / HRS / SRW / SWW / SWS / durum), heading
(early / medium / late), stripe rust / leaf rust / stem rust /
FHB (scab) / Septoria / tan spot ratings, test weight, protein,
falling number, straw strength, CoAXium trait flag.
**Disease scale gotcha**: Golden Harvest publishes ratings on a
**9-to-1 scale** (9 = best, 1 = worst) — the REVERSE of the typical
1-9 convention used by Bayer/NK/AgriPro. Normalize at chunk time so
the corpus has a single direction; document it in a chunk_0
preamble.
## Canonical sidecar schema (per variety)
```json
{
"source": "bayer_seeds",
"source_key": "dekalb-dkc62-08rib",
"vendor": "Bayer",
"brand": "DEKALB",
"product_name": "DKC62-08RIB",
"crop": "corn",
"relative_maturity": 112,
"maturity_group": null,
"wheat_class": null,
"trait_stack": ["SmartStax", "RIB"],
"agronomic_ratings": {"standability": 7, "drought_tolerance": 6},
"disease_ratings": {"GLS": 6, "NCLB": 7, "Goss_wilt": 5},
"regional_recommendation": ["IA-N", "MN-S", "WI-W"],
"source_urls": ["https://cropscience.bayer.us/..."],
"fetched_at": "2026-05-25T12:34:56Z"
}
```
`maturity_group` is for soy, `relative_maturity` is for corn,
`wheat_class` is for wheat. Use `null` for fields that don't apply.
Disease/agronomic rating direction is **normalized 1-9 (9 = best)**
post-scrape — original direction noted in chunk_0 if the source
publishes differently.
## Working with this repo
### Identifying the current phase
This is a clone of the docs-mcp-template; phases follow the
template's PLAN.md.
| Signal | Likely phase |
|---|---|
| `corpus/` doesn't exist | Phase 1 (first scraper) |
| `corpus/bayer_seeds/` exists, no `chroma/` | Phase 2 (indexing) |
| `chroma/` exists, no `bm25/` | Phase 8 (hybrid search) |
| No `eval/results/` | Phase 7 (eval harness) |
| `_api_lessons` is `NotImplementedError` | Phase 11 |
## Layout
```
.
├── PLAN.md
├── README.md
├── CLAUDE.md
├── sources.json # Vendor catalog (corn/soy/wheat by source)
├── requirements.txt
├── Dockerfile
├── deploy/
│ └── docker-compose.yml
├── .gitea/workflows/
│ ├── refresh.yml # Monthly cron: scrape + index + image
│ └── image-only.yml # On-demand: code-only ship cycle
├── scrape/
│ ├── runner.py # `python -m scrape.runner --source bayer_seeds`
│ ├── changelog.py
│ └── sources/
│ ├── bayer_seeds.py
│ ├── golden_harvest.py
│ ├── nk.py
│ ├── agripro.py
│ ├── becks_pfr.py
│ └── becks_products.py
├── rag/ # chunk + embed + Chroma + BM25
├── docs_mcp/ # FastMCP server + lessons.md
├── eval/ # Golden-query harness
└── scripts/ # registry_gc.py, usage_report.py
```
## Conventions
- **Vendor sub-corpora**: each scraper writes
`corpus/<source>/<source_key>.{md,json}`. `.md` is the LLM-visible
text (chunk_0 preamble + body); `.json` is the sidecar metadata.
- **Tool docstrings are user interface** — the LLM uses them to
decide whether to call. Treat like button labels.
- **Defensive fallback for retrieval** — reranker/BM25/external
deps must catch their specific exception and degrade to baseline.
The MCP is in front of farmers making real seed-buying decisions.
- **Verify retrieval changes with eval/** — ship a retrieval change
with numbers in the commit message.
### Standard infrastructure choices
- **Embedding**: `nomic-embed-text` via Ollama (768-dim)
- **Reranker**: `jina-reranker-v2-base` GGUF via llama.cpp
`/v1/rerank` (shared `llama-rerank` sidecar with crop-chem-docs
on trashpanda Tesla P4)
- **Vector store**: Chroma `PersistentClient`
- **Lexical store**: SQLite FTS5
- **Fusion**: RRF k=60
- **Transport**: streamable-HTTP in prod, stdio for local dev
- **MCP framework**: FastMCP with `stateless_http=True`
### Image name and package linking are repo-name-derived
`IMAGE` and `--package` derive from the repo at runtime via
`${{ github.repository_owner }}` / `${{ github.event.repository.name }}`.
The only workflow placeholders customized per clone are
`REGISTRY_PUSH=192.168.0.2:1234`, `REGISTRY_PULL=git.jpaul.io`,
and the `OLLAMA_URL` embed pool.
## Common commands
```bash
# Dev environment
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Run one scraper
python -m scrape.runner --source bayer_seeds --force
# Rebuild indexes
python -m rag.index --rebuild
# Local MCP server
python -m docs_mcp.server --transport stdio
python -m docs_mcp.server --transport streamable-http --port 8000
# Eval
python -m eval.run_eval --queries eval/queries.jsonl --output eval/results/baseline.md
```
## Gotchas
- **`fetch-depth: 0` on `actions/checkout@v4`** in both workflows.
- **Reranker per-pair token limit**: jina-reranker GGUF rejects the
ENTIRE batch if any doc exceeds `n_ctx_train=1024`. Truncate
reranked docs to ~2000 chars.
- **FastMCP `stateless_http=True`**: critical for prod.
- **Runner shell is `/bin/sh` (dash)** in CI — no `${VAR::N}`.
- **Cloudflare 100 MB body cap**: push via LAN endpoint
`192.168.0.2:1234`, pull via `git.jpaul.io`.
- **Golden Harvest disease scale is reversed (9 = best)** —
normalize at chunk time.
- **Sitemap-listed PDF dates on Golden Harvest are stale** —
resolve the live PDF URL from the product HTML page.
- **No IPv6** — DNS for git.jpaul.io returns IPv6-only. Clone via
HTTPS, not SSH (port 22 returns Network unreachable).
- **Pioneer is OFF-LIMITS** — do NOT add a `pioneer.py` scraper.
## Out-of-scope concerns
- **Reverse proxy / TLS** — Drawbar's compose handles it
- **MetaMCP** — separate aggregator
- **GPU container orchestration** — shared `llama-rerank` sidecar
- **University extension trial data** — deferred to v1.5
+61
View File
@@ -0,0 +1,61 @@
# seed-mcp MCP server — production image.
#
# Structure: copy code first, then the regenerable indexes last so a
# code-only change doesn't invalidate the corpus COPY layer.
#
# The container runs the MCP server via streamable-http on PORT 8000.
# Override via MCP_HOST / MCP_PORT env if you front it with a different
# reverse-proxy setup.
#
# Image is self-contained — corpus, Chroma collection, and BM25 db are
# all baked in. Drawbar's docker-compose pulls the image and runs it;
# no host volume mounts required for serve.
#
# RERANK_URL is set at compose time (points at the llama.cpp sidecar
# on trashpanda's Tesla P4 — SHARED with crop-chem-docs). OLLAMA_URL
# is set at compose time too. Defaults below assume same-stack Docker
# DNS names.
FROM python:3.12-slim
WORKDIR /app
# Install Python deps first for cacheability.
COPY requirements.txt /app/
RUN pip install --no-cache-dir -r requirements.txt
# Code.
COPY scrape /app/scrape
COPY rag /app/rag
COPY docs_mcp /app/docs_mcp
# Source catalog. Lists the corpus sources (Bayer seeds + Golden
# Harvest + NK + AgriPro + Beck's PFR + Beck's products).
COPY sources.json /app/
# Regenerable indexes. CI builds these from corpus/ in the same job
# that builds the image. Listed last so code changes don't invalidate
# the COPY layer cache for these (much larger) directories.
#
# bm25/ is only consulted when HYBRID_SEARCH=true (the server falls
# back to dense-only if it's missing).
COPY corpus /app/corpus
COPY chroma /app/chroma
COPY bm25 /app/bm25
ENV PYTHONUNBUFFERED=1 \
PRODUCT_NAME=crop_seed \
MCP_TRANSPORT=streamable-http \
MCP_HOST=0.0.0.0 \
MCP_PORT=8000 \
HYBRID_SEARCH=true \
OLLAMA_URL=http://ollama:11434 \
RERANK_URL=http://llama-rerank:8080
# Defaults above assume the MCP container shares a Docker network
# with services named `ollama` and `llama-rerank`. Override either
# in the compose `environment:` block if your stack uses different
# service names or if you want to point at off-stack hosts.
EXPOSE 8000
ENTRYPOINT ["python", "-m", "docs_mcp.server"]
+647
View File
@@ -0,0 +1,647 @@
# Docs MCP Server — Build Guide
A reusable recipe for building a hosted MCP server over a product's
public documentation. Distilled from one production build; everything
product-specific has been factored out.
The end product is a streamable-HTTP MCP server with ~15 tools that
any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can
call to answer questions against the docs, surface what changed
recently, find inconsistencies, and (optionally) submit doc bugs
back upstream.
---
## What you're building
A pipeline with these stages:
```
upstream docs portal
scrape ──► corpus/<bundle>/<page>.md + .json sidecar
chunk + embed ──► chroma/ (dense vectors)
│ ──► bm25/ (FTS5 lexical index)
MCP server ──► search_docs / get_page / diff_versions / weekly_digest /
find_doc_inconsistencies / submit_doc_bug / ...
reverse proxy / Cloudflare Tunnel ──► public endpoint
```
Two CI cadences:
- **Weekly cron** (~40 min): full re-scrape, re-chunk, re-embed,
image build & push.
- **On-demand image-only** (~18 min): code-only rebuild from
committed corpus, image build & push.
A container registry (self-hosted Gitea works well), a host running
Docker Compose, Watchtower auto-updating from `:latest`, and a
reverse proxy in front.
---
## Build phases
Each phase is a discrete, shippable unit. Build them in order; each
one is useful on its own and unlocks the next. Realistic effort per
phase is given as a rough order of magnitude. Total: roughly 23
weeks of focused work for the full stack.
### Phase 0 — Project skeleton *(half a day)*
Goals: directory layout, dependency manifest, virtualenv.
- Top-level dirs: `scrape/`, `corpus/` (gitignored), `rag/`,
`docs_mcp/`, `eval/`, `scripts/`, `deploy/`, `.gitea/workflows/`.
- `requirements.txt` with the dependencies you'll need across all
phases (FastMCP, chromadb, httpx, beautifulsoup4 or whatever HTML
parser, ollama or sentence-transformers client, etc.).
- `python -m venv venv` and pin Python version (3.11 or 3.12 — be
conservative; some embedding libraries have version-specific
wheels).
- `.gitignore`: `venv/`, `corpus/` (regenerable), `chroma/`
(regenerable), `bm25/` (regenerable), `*.pyc`, `__pycache__/`,
`.pytest_cache/`.
### Phase 1 — Scraper *(24 days, product-specific)*
This is the most product-dependent phase. The goal is to write a
scraper that produces a normalized corpus layout regardless of
upstream portal shape.
Output shape (mandatory):
```
corpus/
<bundle_id>/ # one dir per "doc bundle" — see Glossary
<page_id>.md # markdown body
<page_id>.json # sidecar with structured metadata
...
bundles.json # catalog of bundles with metadata
```
**Bundle metadata** (`bundles.json` is a list of these):
```json
{
"slug": "<bundle_id>",
"title": "User-facing title",
"version": "10.9",
"platform": "VMware vSphere", // may be null
"product": "Admin Guide", // optional but useful
"language": "en-US",
"page_count": 127,
"dates": {
"Added on": "2024-01-15",
"Updated on": "2026-05-20"
},
"landing_page": "<page_id>"
}
```
**Per-page sidecar** (`<page_id>.json`) carries page-level metadata.
The one field that matters cross-cutting is `topic_cluster` (see
Phase 9):
```json
{
"bundle_id": "<bundle_id>",
"page_id": "<page_id>",
"title": "How to ...",
"ordinal": 42,
"topic_cluster": {
"clustering_title": "How to ...",
"clustered_topics": [
{"bundle_id": "...10.8", "page_id": "How_to_X.htm", "clustering_title": "..."},
{"bundle_id": "...10.9", "page_id": "How_to_X.htm", "clustering_title": "..."}
]
}
}
```
If the portal exposes a cross-version "this page corresponds to that
page" mapping, capture it here. If it doesn't, you can synthesize a
filename-based fallback (same filename across bundle versions = same
topic) and live without the editor-curated mapping. The features that
read `topic_cluster` (`list_cluster`, `diff_versions`,
`find_doc_inconsistencies`, parts of `weekly_digest`) will work
either way; they're more accurate with real clusters.
**Patterns that recur across doc portals:**
- Most modern doc portals are SPAs. Plain `requests.get` won't see
rendered content. Either find the underlying API the SPA calls (the
cheapest, most reliable path), or fall back to a headless browser
(Playwright). The API path is almost always available; sniff the
network tab.
- Portals usually expose a "bundle/topic" hierarchy under the hood
(Zoomin, Madcap Flare, Paligo, GitBook, Docusaurus all do). Map
it to `bundles.json` + `corpus/<bundle>/<page>`.
- Many portals expose `?save_local=` or `.pdf` rendered versions; the
HTML they serve is structurally cleaner than what the page shows
through the SPA shell.
**`scrape/changelog.py`** (~250 LOC; see Phase 13) — provides
`summarize_diff()`, `render_human()`, `walk_history()` and the
`--json` / `--history-out` modes. Mostly reusable as-is; the only
product-specific bit is the path layout assumption.
### Phase 2 — Chunking + embeddings + Chroma *(2 days)*
Goal: build a queryable dense index from the scraped corpus.
- `rag/chunk.py` — split each page's markdown into ~400-600 token
chunks. Strategy that works: paragraph-aware splitter with a
rich "chunk 0" containing the page title + 1-sentence summary +
bag-of-words from key terms. Chunk 0 is what dense retrieval lands
on first; getting it right dominates retrieval quality.
- `rag/embeddings.py` — pluggable embedder. Recommended start:
Ollama-hosted `nomic-embed-text` (768-dim, free, good baseline).
Other defensible choices: `text-embedding-3-small` (OpenAI),
`bge-m3` (also via Ollama). The embedder is a Chroma
`EmbeddingFunction` that returns `list[list[float]]` for a list
of texts.
- `rag/index.py` — orchestrates: read corpus → emit chunks (with
metadata: bundle_id, page_id, version, platform, ordinal) →
upsert into Chroma collection. `--rebuild` flag for a clean
reindex. Run via `python -m rag.index --rebuild`.
Chroma settings: `PersistentClient(path="chroma/")` and
`Settings(anonymized_telemetry=False)`. Single collection
(`<product>_docs`).
**GPU note**: embedding 70K chunks on CPU takes hours; on a GPU
(via Ollama with `NVIDIA_VISIBLE_DEVICES`) takes ~10 minutes. Two
GPUs in parallel: ~5 minutes. The orchestrator just needs to load-
balance HTTP requests across multiple Ollama endpoints.
### Phase 3 — MCP server skeleton *(1 day)*
Goal: working FastMCP server with three tools — `search_docs`,
`get_page`, `list_versions`.
- `docs_mcp/server.py``FastMCP("<product>-docs", stateless_http=True)`.
`stateless_http=True` is critical for production hosting: every
request creates an ephemeral session, so container recreates don't
produce a 404 storm from stale `mcp-session-id` headers on
clients.
- Lazy initialization for everything expensive (Chroma client,
embedder, bundles catalog) so the server starts cleanly even when
Ollama is briefly unreachable.
- Tool: `search_docs(query, version=None, platform=None,
bundle_id=None, k=10)`. Returns markdown of top-k chunks with full
source URLs.
- Tool: `get_page(bundle_id, page_id)`. Returns full page markdown +
metadata.
- Tool: `list_versions()`. Returns the version/platform facets
available, drawn from `bundles.json`. Helps the LLM pick filter
values.
Transports: stdio (for local Claude Desktop dev), streamable-HTTP
(for hosted production). One argparse switch.
```python
@mcp.tool()
def search_docs(
query: Annotated[str, Field(description="Natural-language query about <product>.")],
version: Annotated[str | None, Field(description="Restrict to one version")] = None,
...
) -> str:
...
```
The tool descriptions are first-class context — the LLM reads them
and decides whether to call the tool. Treat them as button labels;
use "Call when..." / "Use proactively whenever..." phrasings.
### Phase 4 — Containerization *(1 day)*
Goal: image you can run anywhere.
- `Dockerfile`: Python 3.12-slim base, install requirements, COPY
`scrape rag diff docs_mcp` + `bundles.json` + `corpus/ chroma/`
+ (later) `bm25/`. Don't COPY `scripts/` — those stay external
for ops use only.
- `ENTRYPOINT ["python", "-m", "docs_mcp.server",
"--transport", "streamable-http"]`. Configurable host/port via env.
- `deploy/docker-compose.yml`: one service, named volumes for usage
logs and any state, Watchtower label, depends_on for the reranker
sidecar (Phase 6).
Smoke-test locally: `docker compose up` should expose
`http://localhost:8000/mcp` and respond to an MCP `initialize` JSON-RPC.
### Phase 5 — CI on self-hosted Gitea Actions *(12 days)*
Goal: weekly cron rebuild + on-demand code-only ship cycle.
**Two workflows, two cadences:**
| Workflow | Trigger | Steps | Runtime |
|---|---|---|---|
| `refresh.yml` | Monday cron + manual dispatch | scrape → commit corpus → rebuild indexes → build & push image | ~40 min |
| `image-only.yml` | manual dispatch only | rebuild indexes from committed corpus → build & push image | ~18 min |
**Critical settings (learned the hard way):**
- `fetch-depth: 0` on `actions/checkout@v4`. The default depth is 1
(shallow), which breaks any step that walks git history (changelog,
digest history walker). Pay the ~10 second cost; never debug a
"0-byte history file" mystery.
- `runs-on: docker` (Gitea convention, not `ubuntu-latest`).
- Runner shell is `/bin/sh` (dash), not bash. `${VAR::N}` substring
expansion doesn't exist; use `cut` / `printf` / `awk`.
**Retry-on-race pattern for long-running scrapes:**
```bash
attempt=1
while [ $attempt -le 3 ]; do
if git push; then
echo "pushed (attempt $attempt)"
break
fi
[ $attempt -eq 3 ] && { echo "still failing"; exit 1; }
git fetch origin main
git rebase origin/main || { echo "conflict — bail"; exit 1; }
attempt=$((attempt + 1))
done
```
Works because scrape commits only touch `corpus/` + `bundles.json`,
and code merges only touch `.py` / `.yml` — disjoint paths, trivially
clean rebases.
**Image tagging — three tags per build:**
| Tag | Purpose |
|---|---|
| `:latest` | Watchtower watches this for auto-deploy |
| `:<sha12>` | Immutable; rollback target |
| `:<YYYY.MM.DD>` | Human-readable in incident notes |
Same tag set on every build; rollback is a one-line compose edit
to pin `:<sha>` instead of `:latest`.
**Container registry behind Cloudflare:**
Cloudflare's free tier has a 100 MB request body limit. Big image
layers (Chroma index can easily be 800+ MB) exceed it on push. The
fix is a LAN registry endpoint for push, public hostname for pull:
```yaml
env:
REGISTRY_PUSH: <lan-ip>:<port> # bypasses Cloudflare
REGISTRY_PULL: <public-hostname> # response bodies aren't capped
```
Runner needs the LAN endpoint in `/etc/docker/daemon.json`
`insecure-registries`. Costs nothing operationally; saves hours
of debugging.
**Registry GC:** weekly cron in the workflow that walks the package
versions, keeps `:latest` + N most-recent date tags + anything
pushed in the last 90 days, deletes the rest. Worth ~50 LOC; the
package GC on the Gitea side reclaims disk after.
### Phase 6 — Reranker *(half a day)*
Goal: lift retrieval quality 3× by cross-encoder reranking the top-N
dense candidates.
- A `/v1/rerank` HTTP endpoint backed by `llama.cpp` serving
`jina-reranker-v2-base` (GGUF). Runs as a sidecar in compose.
GPU strongly recommended (CPU latency is unworkable for live
queries).
- `_rerank(query, docs)` helper in the server: POST to the endpoint,
apply the scores, re-sort the top-N candidates. Defensive: on any
failure log a warning and fall through to dense-only.
- Env: `RERANK_URL` (off by default), `RERANK_POOL` (how deep to
pull candidates for reranking; 200 is a good default),
`RERANK_TIMEOUT` (30s for cold-start tolerance).
- **Watch the per-pair token limit.** Jina's GGUF reports
`n_ctx_train=1024` and llama.cpp will reject the ENTIRE batch if
any pair exceeds it. Truncate doc text to ~2000 chars before
reranking. The full untruncated chunk still goes back to the user;
truncation is only for the reranker scoring path.
### Phase 7 — Eval harness *(1 day)*
Goal: hand-curated golden queries + standard metrics so you can
measure the impact of any retrieval change.
- `eval/queries.jsonl`: 2025 hand-curated queries with expected
pages. Spread across versions, platforms, and difficulty levels.
Include the queries that "obviously" should work and DON'T —
those are the ones to track.
- `eval/retrievers.py`: a `Retriever` protocol with concrete
implementations: `DenseRetriever`, `RerankedRetriever`,
`BM25Retriever` (Phase 8), `HybridRetriever` (Phase 8). One
matrix dimension per knob.
- `eval/run_eval.py`: computes MRR / Recall@5 / nDCG@5 across all
retrievers; emits a markdown comparison table at
`eval/results/<baseline>.md`. Commit the result so PRs land with
the A/B evidence in the diff.
Three numbers are enough — don't overengineer. The hand-curated
queries are the value; the metrics are just a stable way to score
them.
### Phase 8 — BM25 + Hybrid retrieval *(half a day, conditional)*
**Skip unless your eval shows specific failure modes.** Dense
embeddings + cross-encoder reranker handle most queries. The case
where they don't: queries with rare technical tokens (filenames,
language names, error codes) get buried at dense rank 1000+ by a
much larger prose corpus that's semantically nearby. The reranker
only sees top-200, so it never gets a shot.
- `rag/bm25.py`: SQLite FTS5 index, in the stdlib, on-disk
(`bm25/<product>.db`). Two tables — metadata table keyed by
rowid, FTS5 virtual table for full-text. Sanitize the query
(strip FTS5 reserved keywords, OR-join tokens for recall). ~210
LOC.
- `_rrf_fuse()` in the server — Reciprocal Rank Fusion with `k=60`.
Per-id score = `sum_over_retrievers(1 / (k + rank))`. Returns
ordered ids plus per-retriever contribution dict for telemetry.
- `search_docs` hybrid path: run dense + BM25 in parallel,
RRF-fuse, hand the merged top-200 to the reranker. Env-gated:
`HYBRID_SEARCH=true`.
- Log `top1_source` per call (`dense_only` / `bm25_only` / `both`)
to usage logs so you can measure whether BM25 is actually earning
its keep on production traffic.
If after 46 weeks of production data you see `bm25_only >= 80%`,
you can simplify to BM25-only (much less infrastructure). If
`both >= 50%`, hybrid is acting as tie-breaker not rescue — keep it
or simplify depending on how much you care about the long tail.
### Phase 9 — Multi-version diff tooling *(1 day, if applicable)*
**Only relevant if the product has multiple maintained versions.**
- `diff_versions(bundle_id, page_id, against_bundle_id)`: unified
diff between two versions of the same page. Two matching
strategies: editor-curated `topic_cluster` peer (if the portal
exposes it), or same-filename fallback.
- `list_cluster(bundle_id, page_id)`: list cross-version peers
for one page.
- `bundle_changelog(bundle_id_new, bundle_id_old)`: added /
removed / changed pages between two bundles, sorted by churn.
- `_diff_churn(a, b)`: small helper, ~15 LOC of `difflib.unified_diff
--unified=0` line counting. Used by `bundle_changelog`,
`find_doc_inconsistencies`, and `weekly_digest`.
### Phase 10 — Usage logging *(half a day)*
Goal: per-call JSONL telemetry so you can answer "what are people
actually asking for" and "is the new feature getting used."
- `docs_mcp/usage.py`: `TimedCall` context manager that captures
tool name, args, elapsed time, hits returned, any extra fields
set by the tool via `_call.set(key=value)`. Writes JSONL to
`var/logs/usage.jsonl`, rotated daily, kept 90 days.
- Mount the log dir as a named compose volume so logs survive
container recreates.
- `scripts/usage_report.py` (standalone, no docs_mcp deps): reads
the JSONL files, prints per-tool counts, top queries, 0-hit
queries, filter usage histogram, reranker activity. Markdown
output flag for piping into weekly digest emails.
What to log: query text, filters, hits returned, elapsed_ms,
reranker_fired flag, hybrid top1_source, retrieval_mode. What NOT
to log: anything PII-shaped. The corpus is public, queries are
usually about the product, not personal — but be deliberate.
### Phase 11 — Curated knowledge layer *(2 days)*
The "RAG can't tell you what isn't in the docs" gap. Surfaces:
- **API quickstart repos** if the product has them. Ingest the
example scripts (Python, PowerShell, curl) into the corpus.
Rewrite chunk-0 for each script to embed naturally — explicit
natural-language H1, task description sentence, keyword bag.
Dense embeddings need an anchor.
- **A curated `<product>_api_lessons` markdown doc** for things
the swagger / OpenAPI doesn't say: auth flow gotchas, async-task
patterns, schema bugs you've hit, platform-detection quirks.
Surface as a dedicated MCP tool whose description tells the LLM:
*"Call proactively whenever the user asks you to write a script
/ integrate with the API / debug a 4xx response."*
- **An auto-hint banner** in `search_docs` results — when the
query matches a script/API trigger word, render a one-line nudge
at the top of results pointing at the dedicated tool. Belt-and-
suspenders for queries where the LLM doesn't think to call it
proactively.
### Phase 12 — Doc-bug workflow tools *(1 day, optional)*
Two tools that pair up to enable a *"check the docs for
inconsistencies, draft bugs, confirm, submit"* workflow.
- `find_doc_inconsistencies(scope_query, version=None, platform=None,
max_pages=30, checks=None)`: deterministic, read-only. Two checks:
cross-version drift (pages whose content shifted between immediate-
previous versions in the actionable 1060% churn band) and
redirect-chain detection (short pages whose body is just a "see
[other page] for details" pointer). Heavy lifting is line-level
diff (`difflib`) against editor-curated cluster peers; the model
judges which findings are real bugs.
- `submit_doc_bug(page_url, content, email=None, rating=None,
like=None)`: POSTs to the docs portal's feedback endpoint.
Env-gated by `DOC_BUG_SUBMIT_ENABLED=true` so dev/staging
deployments can't accidentally hit the upstream. The tool's
docstring is loud about a mandatory operator-confirmation
workflow per submission — LLM must draft, show, ask, then
submit. Explicit *"do not loop"* instruction. Defensive
validation upfront (URL host matches expected portal, content
non-empty, etc.) so the LLM gets a clean error instead of a
rejected POST.
**You'll need to find the docs portal's feedback endpoint.** Most
portals route the "Was this helpful?" widget through a backend
API; sniff the browser network tab on the live site. The payload
shape varies; common fields: content/body, page url/href, optional
email, optional rating, optional thumbs. Most accept anonymous
POSTs with no captcha at the JSON-API layer (even if the widget
shows a captcha). Validate before you ship — and if the endpoint
has rate limits or captcha enforcement, the tool returns a clean
"submission rejected — paste manually at <url>" fallback.
The whole point is the per-bug operator confirmation in the
LLM-side conversation flow; the tool description enforces it. Do
not bypass.
### Phase 13 — Weekly digest tool *(half a day)*
Goal: a tool that answers *"what changed in the docs in the last N
days?"* with no runtime git dependency (the prod container has no
git).
- Extend `scrape/changelog.py` with `--json` (one-shot structured
output) and `--history-out PATH` (walks `git log --first-parent
--since="<N> days ago"` for corpus-touching commits, writes one
JSON line per commit to a JSONL file).
- CI workflows write the JSONL file into the image at build time:
`corpus/.digest/history.jsonl`. Both `refresh.yml` and
`image-only.yml`. **`fetch-depth: 0` is required** — see Phase 5.
- New MCP tool `weekly_digest(days=7, version=None, platform=None,
max_bundles=25, max_pages_per_bundle=10)`: reads the JSONL,
filters to the window, applies version/platform via
`bundles.json` metadata, aggregates per-bundle change counts and
page lists, renders markdown.
- Post-filter totals are critical: the headline "X page changes
across Y bundles" must compute X from the filtered set, not the
raw record count. Otherwise filtered calls look wrong to the
reader.
Out of scope but trivial bolt-ons: scheduled HTML email of the
digest, auto-publish to a blog, per-page diff excerpts as a
follow-up tool.
---
## Standard tool set
By the end you'll have ~15 tools registered. Production-tested
shape:
| Tool | What it does |
|---|---|
| `search_docs` | Semantic search with version/platform/bundle filters |
| `get_page` | Full markdown + metadata for one page |
| `list_versions` | Discover available facet values |
| `list_cluster` | Cross-version peers for one page (if applicable) |
| `diff_versions` | Unified diff of a page across two versions |
| `bundle_changelog` | Added / removed / changed pages between two bundles |
| `weekly_digest` | What changed in the last N days, with filters |
| `corpus_status` | Freshness + size of the knowledge base |
| `find_doc_inconsistencies` | Scoped scan for doc bugs |
| `submit_doc_bug` | Submit a drafted bug (env-gated, operator-confirmed) |
| `<product>_api_lessons` | Curated API gotchas, proactively-called |
| product-specific tools | Interop matrix, lifecycle queries, etc. |
---
## Per-product customization checklist
When applying this template to a new product, here's what you have
to figure out yourself — everything else is shared infrastructure:
- **Doc portal mechanics**
- URL pattern for pages
- Bundle/version concept (Zoomin "bundle", Madcap "project",
GitBook "space", Docusaurus "docs version" — same idea, different
name)
- SPA backing API (sniff the network tab) or fallback to
headless browser
- How `topic_cluster` -equivalent cross-version peers are exposed
(or whether you synthesize them from filenames)
- **Bundle metadata schema**
- What does `version` look like? Semver, calendar, named?
- What does `platform` mean for this product? Is there a useful
facet at all?
- Other useful facets (language, product line, edition)?
- **Filterable facets** for `search_docs`
- One filter per high-cardinality facet
- Skip filters that have <5 distinct values — they're not worth
the surface area
- **Feedback endpoint** (for `submit_doc_bug`, if you want it)
- URL of the POST endpoint
- Required + optional payload fields
- Captcha / rate-limit behavior
- Whether anonymous submissions are accepted
- **Curated knowledge** for the `_api_lessons` tool
- What does the product's API documentation NOT say that you've
learned from real integration work?
- **Quickstart / example repos**
- Does the vendor publish working code? Ingest it; rewrite
chunk-0 for natural-language retrieval.
---
## Decisions worth carrying forward
Things you'll save time on by deciding the same way again:
- **Tool descriptions are user interface.** The LLM reads them
verbatim and decides whether to call the tool. *"Use when..."*
and *"Call proactively whenever..."* are real surfaces; treat
them like button labels. Most retrieval improvements turn out
to be tool-description rewrites in disguise.
- **`stateless_http=True`** on the FastMCP server. Eliminates
whole categories of session-ID-related 404 storms after
container recreates.
- **Pre-bake everything at CI time.** No runtime calls to git,
external services, or anything you wouldn't trust on a
Cloudflare outage. If the digest needs git history, write a
JSONL file at CI time. If the lessons doc needs to load fast,
bake it into the image.
- **Env-gate every side-effecting tool.** Off by default in dev;
on only in production compose. Belt and suspenders against
accidental writes from staging environments.
- **Operator-confirmation pattern for side-effecting tools.**
The tool docstring is the only place to enforce
human-in-the-loop. Make it loud. "MANDATORY", "Do not loop",
"show-confirm-then-submit" — those phrasings work.
- **Verify with hand-curated golden queries before shipping any
retrieval change.** Numbers in the diff, in the commit message.
Don't ship retrieval changes on vibes.
- **Two-cadence CI** (weekly scrape vs on-demand code-only)
saves hours per code iteration once you're past the
one-iteration-a-week stage.
- **Rolling tag + sha-pinned tag** deploy pattern. `:latest` is
what Watchtower watches; `:<sha>` is your safety net. Rollback
is a one-line compose edit, not a redeploy.
- **Usage logging is non-negotiable.** You will be wrong about
what people use. Capture the truth from day one; let it tell
you which features to keep building and which to delete.
---
## Glossary
- **Bundle** — one logical doc set in the portal. Zoomin calls
them bundles; Madcap calls them projects; the concept is the
same: a versioned, titled collection of pages. One dir under
`corpus/`.
- **Page** — one HTML page in a bundle. One `.md` + one `.json`
sidecar under the bundle dir.
- **Topic cluster** — Zoomin's name for "this page in version
10.9 corresponds to that page in version 10.8." Stored in the
per-page sidecar. The portal-agnostic concept is "cross-version
peer mapping."
- **Chunk** — a unit of text that gets independently embedded and
stored in Chroma. Target ~400-600 tokens; preserve paragraph
boundaries.
- **RRF** — Reciprocal Rank Fusion. The way to merge two ranked
lists from independent retrievers without score calibration.
---
## What's deliberately NOT in this template
Decisions you should make per-product (not copy from the original
build):
- The reverse proxy and TLS termination layer. Could be Caddy,
nginx, Traefik, Cloudflare Tunnel — pick what your infra uses.
- The Gateway / aggregator in front of multiple MCPs (MetaMCP is one
option; you may not need any aggregator if you're running a
single product MCP).
- The specific embedding model — `nomic-embed-text` is a strong
default but newer / domain-specific models may be better for
some products.
- The Ollama containers / GPU setup — depends on what hardware you
have. The pattern is one container per GPU with explicit
`NVIDIA_VISIBLE_DEVICES` pinning; the indexer load-balances
across them.
- Whether to publish a blog series alongside the build. Strongly
recommended (forces clarity, builds an audience), but optional.
+84
View File
@@ -0,0 +1,84 @@
# seed-mcp
MCP server over the public catalogs of major US row-crop seed
vendors — corn, soybeans, wheat. Sibling project to
[`crop-chem-docs`](https://git.jpaul.io/justin/crop-chem-docs)
(pesticide labels), feeding the same Drawbar farm-advisor AI.
The server exposes per-variety records with **agronomic ratings**,
**disease tolerance**, **trait stack**, **maturity**, and
**regional notes** — so the advisor can answer questions like
"which corn hybrid for sandy soil, drought-prone, RM ≤105 in
northeast Iowa?" without rummaging through individual brand sites.
## Vendor coverage
| Vendor | Verdict | Varieties | Notes |
|---|---|---|---|
| Bayer seeds (DEKALB + Asgrow + WestBred) | 🟢 | ~475 | Same `cropscience.bayer.us` Next.js infra as crop-chem-docs |
| Golden Harvest (Syngenta) | 🟢 | ~175 | Sitemap + server-rendered HTML + Syngenta CDN PDFs |
| NK (Syngenta) | 🟢 | 29 | Shares PDF fetcher with Golden Harvest |
| AgriPro (Syngenta wheat) | 🟢 | 24 | Drupal Views, server-rendered |
| Beck's PFR | 🟡 | 2,089 | Public Sanity GROQ API (no auth) |
| Beck's products | 🟡 | 860 | Identity-only until SeedIQ XHR sniffed |
| Pioneer (Corteva) | 🔴 | — | ToS bans automation — curated fallback lesson instead |
## Quick start
```bash
git clone https://git.jpaul.io/justin/seed-mcp.git
cd seed-mcp
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Run one scraper
python -m scrape.runner --source bayer_seeds --force
# Rebuild indexes
python -m rag.index --rebuild
# Local MCP server (stdio for Claude Desktop dev)
python -m docs_mcp.server --transport stdio
```
## Tools exposed
| Tool | Purpose |
|---|---|
| `search_docs` | Hybrid + rerank variety search with crop / RM / trait / region filters |
| `get_page` | Full variety record by `(source, source_key)` |
| `list_versions` | Discover crops, brands, traits, RM/MG ranges, wheat classes |
| `corpus_status` | Counts + freshness; useful for health probes |
| `crop_seed_api_lessons` | Curated agronomy lessons — Pioneer fallback, disease-scale normalization, regional placement heuristics |
## Build phases
This is a clone of [`docs-mcp-template`](https://git.jpaul.io/justin/docs-mcp-template).
The 13 phases in `PLAN.md` apply:
| Phase | Status |
|---|---|
| 0 — scaffold | done |
| 1 — first scraper (bayer_seeds) | next |
| 2 — chunk + index | pending |
| 3 — baseline MCP tools | template defaults |
| 4-5 — Dockerfile + CI | done (placeholders filled) |
| 6 — reranker | shares `llama-rerank` sidecar with crop-chem-docs |
| 7 — eval harness | pending (curate ~25 queries) |
| 8 — hybrid search | done (template) |
| 9 — diff_versions, list_cluster | optional |
| 11 — `crop_seed_api_lessons` curated layer | pending |
See `CLAUDE.md` for the canonical sidecar schema and the
disease-scale-normalization gotcha (Golden Harvest is reversed).
## Infrastructure
- **Registry**: `git.jpaul.io/justin/seed-mcp:latest` (Watchtower) /
`:corpus-YYYY.MM.DD` (production pin)
- **Embedder**: shared Ollama pool with crop-chem-docs (Gitea-host
GPUs + Windows Ollama; CI never hits trashpanda's production Ollama)
- **Reranker**: shared `llama-rerank` sidecar on trashpanda's Tesla
P4 (one container, both MCPs use it)
- **PRODUCT_NAME**: `crop_seed` (not `seed_mcp` — used in Chroma
collection, BM25 db filename, and `crop_seed_api_lessons` tool)
+111
View File
@@ -0,0 +1,111 @@
# Hosting stack for a docs MCP server.
#
# Replace <product> below with your product name on first deploy.
# Volumes: usage logs are mounted to a host path so they survive
# Watchtower-driven container recreates.
#
# This template assumes a reverse proxy / Cloudflare Tunnel terminates
# TLS in front of port 8000. Adjust if your infra differs.
services:
# The MCP server. Watchtower auto-pulls on :latest changes.
<product>-docs-mcp:
image: <registry>/<owner>/<product>-docs-mcp:latest
container_name: <product>-docs-mcp
restart: unless-stopped
ports:
- "8000:8000"
environment:
PRODUCT_NAME: "<product>"
PRODUCT_DOCS_URL: "https://docs.example.com"
# Streamable-HTTP transport. Stateless mode is required for
# production: clients don't lose sessions when Watchtower
# recreates the container.
MCP_TRANSPORT: streamable-http
MCP_HOST: 0.0.0.0
MCP_PORT: "8000"
# If you run MetaMCP or another gateway in front and reach
# this container via its compose DNS name (e.g. <product>-docs-mcp:8000),
# add that hostname here. "*" disables the rebind check entirely.
MCP_ALLOWED_HOSTS: "<product>-docs-mcp,localhost,127.0.0.1"
# Phase 6 — reranker sidecar (jina-reranker-v2-base via llama.cpp).
RERANK_URL: http://<product>-rerank:8080
RERANK_POOL: "200"
RERANK_TIMEOUT: "30"
# Phase 8 — hybrid retrieval (BM25 + dense + RRF). Set true
# only after the eval harness shows the dense-only path
# missing technical-term queries that BM25 catches.
HYBRID_SEARCH: "true"
# Phase 10 — usage telemetry.
USAGE_LOG_DIR: /app/var/logs
USAGE_LOG_KEEP_DAYS: "90"
# Phase 12 — doc-bug submission gate. Off by default; on only
# in production after you've verified the endpoint contract.
DOC_BUG_SUBMIT_ENABLED: "false"
# DOC_BUG_API_URL: "https://docs-be.example.com/api/feedback"
volumes:
# Usage logs persist across container recreates.
- ./<product>-docs-mcp-logs:/app/var/logs
depends_on:
- <product>-rerank
labels:
# Watchtower polls *only* containers with this label set true.
com.centurylinklabs.watchtower.enable: "true"
networks:
- mcp
# Reranker sidecar — llama.cpp serving jina-reranker-v2-base.
# Requires GPU access; adjust runtime/devices for your hardware.
<product>-rerank:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
container_name: <product>-rerank
restart: unless-stopped
# Mount the GGUF model from the host. Download from huggingface
# (gguf-org/jina-reranker-v2-base-multilingual-GGUF) first.
volumes:
- /path/to/models:/models:ro
command: >
--model /models/jina-reranker-v2-base.Q8_0.gguf
--reranking
--host 0.0.0.0
--port 8080
--n-gpu-layers 99
--ctx-size 4096
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
networks:
- mcp
# Watchtower — auto-pulls :latest on push.
# Only watches containers labeled `com.centurylinklabs.watchtower.enable=true`.
watchtower:
image: containrrr/watchtower:latest
container_name: watchtower
restart: unless-stopped
volumes:
- /var/run/docker.sock:/var/run/docker.sock
environment:
WATCHTOWER_POLL_INTERVAL: "300" # 5 min
WATCHTOWER_LABEL_ENABLE: "true"
WATCHTOWER_CLEANUP: "true" # remove old images after pull
# If your registry requires auth, mount a docker config:
# volumes:
# - ./registry-auth.json:/config.json:ro
networks:
- mcp
networks:
mcp:
driver: bridge
View File
+263
View File
@@ -0,0 +1,263 @@
"""MCP server skeleton — fill in PRODUCT_NAME and the tool bodies.
This file is the template's structural anchor. The phases described in
PLAN.md add or extend pieces of this file:
Phase 3 — search_docs, get_page, list_versions stubs (you are here)
Phase 6 — reranker integration in search_docs
Phase 8 — BM25 + hybrid retrieval (HYBRID_SEARCH env gate, _rrf_fuse)
Phase 9 — diff_versions, list_cluster, bundle_changelog
Phase 10 — TimedCall wiring (already imported below)
Phase 11 — <product>_api_lessons tool
Phase 12 — find_doc_inconsistencies, submit_doc_bug
Phase 13 — weekly_digest + _digest_history reader
Every stub below has a docstring + `raise NotImplementedError`. Replace
the body when you reach the corresponding phase. Keep the signatures
stable across products — clients depend on them.
"""
from __future__ import annotations
import json
import logging
import os
import re
from pathlib import Path
from typing import Annotated
from mcp.server.fastmcp import FastMCP
from pydantic import Field
from .usage import TimedCall
log = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Product-specific configuration. Set these for each new build.
# ---------------------------------------------------------------------------
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "crop_seed")
PRODUCT_DOCS_URL = os.environ.get("PRODUCT_DOCS_URL", "https://git.jpaul.io/justin/seed-mcp")
COLLECTION = f"{PRODUCT_NAME}_docs"
# Paths inside the deployed container (and matching layout locally for dev).
ROOT = Path(__file__).resolve().parent.parent
CORPUS = ROOT / "corpus"
CHROMA_DIR = ROOT / "chroma"
BM25_DB = Path(os.environ.get("BM25_DB", str(ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db")))
BUNDLES_JSON = ROOT / "bundles.json"
# ---------------------------------------------------------------------------
# Feature flags (Phase 6 / 8 / 12 enable these as you ship each phase).
# ---------------------------------------------------------------------------
RERANK_URL = os.environ.get("RERANK_URL", "").rstrip("/") or None
RERANK_POOL = int(os.environ.get("RERANK_POOL", "50"))
RERANK_TIMEOUT = float(os.environ.get("RERANK_TIMEOUT", "30"))
HYBRID_SEARCH = os.environ.get("HYBRID_SEARCH", "").lower() in ("true", "1", "yes", "on")
RRF_K = int(os.environ.get("RRF_K", "60"))
DOC_BUG_SUBMIT_ENABLED = os.environ.get("DOC_BUG_SUBMIT_ENABLED", "").lower() in ("true", "1", "yes", "on")
DOC_BUG_API_URL = os.environ.get("DOC_BUG_API_URL", "") # product-specific endpoint
DOC_BUG_TIMEOUT = float(os.environ.get("DOC_BUG_TIMEOUT", "15"))
# ---------------------------------------------------------------------------
# FastMCP setup.
#
# stateless_http=True — every request creates an ephemeral session and
# discards it on return. Critical for production: clients don't get
# 404 storms when the container is recreated by Watchtower.
# ---------------------------------------------------------------------------
mcp = FastMCP(f"{PRODUCT_NAME}-docs", stateless_http=True)
# ---------------------------------------------------------------------------
# Lazy helpers — instantiate expensive things only when actually needed,
# so the server still starts when (e.g.) Ollama is briefly unreachable.
# ---------------------------------------------------------------------------
def _bundles() -> dict[str, dict]:
"""Cached load of bundles.json into a {slug: bundle_dict} mapping.
bundles.json is the product-specific catalog written by the Phase 1
scraper. See PLAN.md Phase 1 for the schema.
"""
if not BUNDLES_JSON.exists():
return {}
cat = json.loads(BUNDLES_JSON.read_text())
return {b["slug"]: b for b in cat}
def _build_where(version: str | None, platform: str | None, bundle_id: str | None) -> dict | None:
"""Translate filter args into a Chroma `where` clause."""
conds: list[dict] = []
if version:
conds.append({"version": version})
if platform:
conds.append({"platform": platform})
if bundle_id:
conds.append({"bundle_id": bundle_id})
if not conds:
return None
if len(conds) == 1:
return conds[0]
return {"$and": conds}
def _read_page(bundle_id: str, page_id: str) -> tuple[str, dict] | None:
"""Read a corpus page off disk. Returns (markdown_body, metadata_dict)."""
md_path = CORPUS / bundle_id / (page_id + ".md")
json_path = CORPUS / bundle_id / (page_id + ".json")
if not md_path.exists() or not json_path.exists():
return None
return md_path.read_text(), json.loads(json_path.read_text())
# ===========================================================================
# Tools
# ===========================================================================
@mcp.tool()
def search_docs(
query: Annotated[str, Field(description=f"Natural-language query about {PRODUCT_NAME}.")],
version: Annotated[
str | None,
Field(description="OPTIONAL version filter — restrict to one product version."),
] = None,
platform: Annotated[
str | None,
Field(description="OPTIONAL platform filter. Set to one of the platforms listed by list_versions(); omit for all platforms."),
] = None,
bundle_id: Annotated[
str | None,
Field(description="OPTIONAL bundle filter — pin to a specific doc bundle slug."),
] = None,
k: Annotated[int, Field(description="Number of results to return.", ge=1, le=50)] = 10,
) -> str:
"""Search the {product} docs corpus.
Returns the top-k most relevant chunks (with full source page URLs)
given a natural-language query. Optional filters narrow the search
to one version, one platform, or one bundle. Use list_versions()
first if you need to discover the available facet values.
Call this tool whenever the user asks anything that should be
answerable from the official product documentation.
"""
with TimedCall("search_docs", {
"query": query, "version": version, "platform": platform,
"bundle_id": bundle_id, "k": k,
}) as _call:
# TODO Phase 2-3: query Chroma collection (see rag/index.py for
# how it was built). Render the top-k chunks as markdown with
# source URLs.
# TODO Phase 6: optional reranker via _rerank() if RERANK_URL set.
# TODO Phase 8: hybrid retrieval if HYBRID_SEARCH=true — run
# dense + BM25 in parallel, RRF-fuse, hand merged pool to rerank.
_call.set(hits_returned=0)
raise NotImplementedError("Phase 2/3: implement Chroma query + rendering")
@mcp.tool()
def get_page(
bundle_id: Annotated[str, Field(description="Bundle slug.")],
page_id: Annotated[str, Field(description="Page filename within the bundle.")],
) -> str:
"""Return the full markdown for one page, plus a metadata header.
Use after search_docs surfaces a relevant page and the user (or you)
want the complete text — not just the matched chunks.
"""
with TimedCall("get_page", {"bundle_id": bundle_id, "page_id": page_id}) as _call:
data = _read_page(bundle_id, page_id)
if data is None:
_call.set(found=False)
return f"Page not found: {bundle_id}/{page_id}"
md, meta = data
_call.set(found=True, page_chars=len(md))
# TODO: add a metadata header (title, version, source URL) above
# the body. Product-specific shape.
return md
@mcp.tool()
def list_versions() -> str:
"""List the available version/platform facets across all bundles.
Use this to discover valid filter values for search_docs.
"""
with TimedCall("list_versions", {}) as _call:
cat = _bundles()
if not cat:
return "_(no bundles indexed yet — run the scraper + indexer)_"
versions = sorted({b.get("version") for b in cat.values() if b.get("version")})
platforms = sorted({b.get("platform") for b in cat.values() if b.get("platform")})
_call.set(versions=len(versions), platforms=len(platforms))
lines = [f"# Facets across {len(cat)} bundle(s)", ""]
if versions:
lines.append("## Versions"); lines.append("")
for v in versions: lines.append(f"- `{v}`")
lines.append("")
if platforms:
lines.append("## Platforms"); lines.append("")
for p in platforms: lines.append(f"- `{p}`")
return "\n".join(lines)
# ---------------------------------------------------------------------------
# Stubs for later phases — keep the signatures in this file so refactors
# don't lose the contracts. Implementations come per phase.
# ---------------------------------------------------------------------------
# @mcp.tool() # Phase 9
# def list_cluster(bundle_id: str, page_id: str) -> str: ...
# @mcp.tool() # Phase 9
# def diff_versions(bundle_id: str, page_id: str, against_bundle_id: str, context: int = 3) -> str: ...
# @mcp.tool() # Phase 9
# def bundle_changelog(bundle_id_new: str, bundle_id_old: str, min_churn: int = 5, max_changed: int = 50) -> str: ...
# @mcp.tool() # Phase 13
# def weekly_digest(days: int = 7, version: str | None = None, platform: str | None = None, ...) -> str: ...
# @mcp.tool() # Phase 9 (or 3 — useful early)
# def corpus_status() -> str: ...
# @mcp.tool() # Phase 11
# def myproduct_api_lessons(topic: str | None = None) -> str: ...
# @mcp.tool() # Phase 12
# def find_doc_inconsistencies(scope_query: str, ...) -> str: ...
# @mcp.tool() # Phase 12
# def submit_doc_bug(page_url: str, content: str, email: str | None = None, ...) -> str: ...
# ===========================================================================
# Entry point
# ===========================================================================
def main() -> None:
import argparse
p = argparse.ArgumentParser(description=f"{PRODUCT_NAME} docs MCP server")
p.add_argument("--transport", choices=["stdio", "streamable-http", "sse"],
default=os.environ.get("MCP_TRANSPORT", "stdio"))
p.add_argument("--host", default=os.environ.get("MCP_HOST", "0.0.0.0"))
p.add_argument("--port", type=int, default=int(os.environ.get("MCP_PORT", "8000")))
args = p.parse_args()
if args.transport == "stdio":
mcp.run()
else:
mcp.settings.host = args.host
mcp.settings.port = args.port
# DNS-rebinding protection defaults to localhost-only — disable for
# container-network DNS hostnames. See PLAN.md "Hosting" notes.
if os.environ.get("MCP_DISABLE_DNS_REBINDING_PROTECTION") in {"1", "true", "yes"}:
mcp.settings.transport_security.enable_dns_rebinding_protection = False
mcp.run(transport=args.transport)
if __name__ == "__main__":
main()
+127
View File
@@ -0,0 +1,127 @@
"""Per-call usage telemetry — JSONL with daily rotation and retention.
Reusable as-is across products. Drop the import + `with TimedCall(...)`
into any tool body and the call gets logged with the tool name, args,
elapsed time, and any extra fields the tool sets via `_call.set(...)`.
The log file is `var/logs/usage.jsonl` by default (override with the
`USAGE_LOG_DIR` env). Daily rotation; files older than
`USAGE_LOG_KEEP_DAYS` (default 90) are deleted on next write.
Layout of one record:
{
"ts": "2026-05-22T13:14:15+00:00",
"tool": "search_docs",
"args": {"query": "...", "version": "10.9", "k": 10},
"elapsed_ms": 142.5,
"hits_returned": 7, # optional, set by the tool
"reranked": true, # optional, set by the tool
// ... any other key the tool sets via _call.set(...)
}
"""
from __future__ import annotations
import json
import os
import time
import threading
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any
USAGE_LOG_DIR = Path(os.environ.get("USAGE_LOG_DIR", "var/logs"))
USAGE_LOG_KEEP_DAYS = int(os.environ.get("USAGE_LOG_KEEP_DAYS", "90"))
# Single global lock to serialize writes from multiple request handlers.
# JSONL appends are atomic at the OS level for short records on most
# filesystems, but the lock is cheap and saves you from cross-platform
# surprises.
_lock = threading.Lock()
_last_rotation_check: float = 0.0
def _maybe_rotate() -> None:
"""Move usage.jsonl → usage.jsonl.<yesterday> if the date has rolled.
Cheap to call; we only do filesystem work when a day has actually
passed since the last check.
"""
global _last_rotation_check
now = time.time()
if now - _last_rotation_check < 300: # 5 min cap between checks
return
_last_rotation_check = now
USAGE_LOG_DIR.mkdir(parents=True, exist_ok=True)
active = USAGE_LOG_DIR / "usage.jsonl"
if active.exists():
try:
mtime = datetime.fromtimestamp(active.stat().st_mtime, tz=timezone.utc).date()
today = datetime.now(timezone.utc).date()
if mtime < today:
rotated = USAGE_LOG_DIR / f"usage.jsonl.{mtime.isoformat()}"
if not rotated.exists():
active.rename(rotated)
except OSError:
pass
# Retention: delete usage.jsonl.YYYY-MM-DD files older than the
# retention window. The active file is never deleted by this.
cutoff = datetime.now(timezone.utc).date() - timedelta(days=USAGE_LOG_KEEP_DAYS)
for f in USAGE_LOG_DIR.glob("usage.jsonl.*"):
try:
datestamp = f.name.split(".", 2)[-1]
if datetime.fromisoformat(datestamp).date() < cutoff:
f.unlink()
except (ValueError, OSError):
continue
class TimedCall:
"""Context manager that captures one tool call's telemetry record.
Usage:
with TimedCall("search_docs", {"query": q, ...}) as call:
... do the work ...
call.set(hits_returned=len(results), reranked=True)
On exit, writes one JSONL record to usage.jsonl. Exceptions are
captured into the `error` field; the exception is re-raised so
the tool's caller sees the failure.
"""
def __init__(self, tool: str, args: dict[str, Any]):
self.tool = tool
self.args = args
self.extra: dict[str, Any] = {}
self._t0: float = 0.0
def set(self, **kwargs: Any) -> None:
"""Attach extra fields to the eventual telemetry record."""
self.extra.update(kwargs)
def __enter__(self) -> "TimedCall":
self._t0 = time.perf_counter()
return self
def __exit__(self, exc_type, exc_val, exc_tb) -> None:
elapsed_ms = (time.perf_counter() - self._t0) * 1000.0
record: dict[str, Any] = {
"ts": datetime.now(timezone.utc).isoformat(),
"tool": self.tool,
"args": self.args,
"elapsed_ms": round(elapsed_ms, 2),
}
if exc_type is not None:
record["error"] = f"{exc_type.__name__}: {exc_val}"
record.update(self.extra)
_maybe_rotate()
with _lock:
USAGE_LOG_DIR.mkdir(parents=True, exist_ok=True)
with open(USAGE_LOG_DIR / "usage.jsonl", "a") as fh:
fh.write(json.dumps(record, separators=(",", ":")) + "\n")
# Don't swallow the exception — the caller still needs to see it.
View File
+4
View File
@@ -0,0 +1,4 @@
{"query": "how to install <product> on Linux", "expected": [{"bundle_id": "Install.Linux.10.0", "page_id": "Installation.htm"}], "tags": ["install", "linux"]}
{"query": "configure database connection for high availability", "expected": [{"bundle_id": "Admin.10.0", "page_id": "HA_Setup.htm"}], "tags": ["ha", "config"]}
{"query": "API endpoint to list users", "expected": [{"bundle_id": "API.10.0", "page_id": "Users_API.htm"}], "tags": ["api"]}
{"query": "what changed between 10.0 and 10.1", "expected": [{"bundle_id": "Release_Notes.10.1", "page_id": "Whats_New.htm"}], "tags": ["release-notes"]}
+62
View File
@@ -0,0 +1,62 @@
"""Retriever protocol + concrete implementations.
A single matrix dimension per knob (dense / reranked / bm25 / hybrid)
so the eval harness can compare them apples-to-apples. Implement these
once at Phase 7 and reuse them across every retrieval change.
Each retriever returns a ranked list of (bundle_id, page_id) tuples
deduplicated to the page level (chunks within the same page collapse
to one entry; the highest-ranked chunk's position wins).
"""
from __future__ import annotations
from typing import Protocol, Iterable
class Retriever(Protocol):
name: str
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
"""Return up to k (bundle_id, page_id) tuples in rank order."""
...
def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> list[tuple[str, str]]:
"""Take a stream of (bundle_id, page_id, chunk_ordinal) and return
the first k unique pages in their first-seen order."""
seen: set[tuple[str, str]] = set()
out: list[tuple[str, str]] = []
for bid, pid, _ord in chunk_ids:
key = (bid, pid)
if key in seen:
continue
seen.add(key)
out.append(key)
if len(out) >= k:
break
return out
# TODO Phase 2/3 — implement these once Chroma + the bm25 module are
# in place. Each one is small (15-30 LOC). The eval harness imports
# from this module by class name.
#
# class DenseRetriever:
# name = "dense"
# def __init__(self, collection): self.col = collection
# def retrieve(self, query, k=10): ...
#
# class RerankedRetriever:
# name = "dense+rerank"
# def __init__(self, collection, rerank_url, pool=200): ...
# def retrieve(self, query, k=10): ...
#
# class BM25Retriever:
# name = "bm25"
# def __init__(self, bm25_index): ...
# def retrieve(self, query, k=10): ...
#
# class HybridRetriever:
# name = "bm25+dense+rrf"
# def __init__(self, dense, bm25, k_rrf=60): ...
# def retrieve(self, query, k=10): ...
+91
View File
@@ -0,0 +1,91 @@
"""Run all retrievers against eval/queries.jsonl, emit a markdown report.
Metrics computed per retriever:
MRR — mean reciprocal rank of the FIRST expected page in the
ranked result list (0 if not in top-k).
Recall@K — fraction of expected pages that appear in top-K.
nDCG@K — discounted gain weighted by rank position.
The "right" number depends on what you're measuring. MRR tracks "the
first-line answer is correct"; Recall@K tracks "everything relevant
is there to draw from"; nDCG@K is a smoother combination of both.
For docs-RAG, MRR is usually the headline metric.
Usage:
python -m eval.run_eval \\
--queries eval/queries.jsonl \\
--k 5 \\
--output eval/results/baseline.md
"""
from __future__ import annotations
import argparse
import json
import math
import time
from pathlib import Path
from typing import Iterable
def load_queries(path: Path) -> list[dict]:
with open(path) as fh:
return [json.loads(line) for line in fh if line.strip()]
def reciprocal_rank(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]]) -> float:
expected_set = set(expected)
for i, page in enumerate(retrieved, start=1):
if page in expected_set:
return 1.0 / i
return 0.0
def recall_at_k(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]], k: int) -> float:
if not expected:
return 0.0
retrieved_set = set(retrieved[:k])
hits = sum(1 for e in expected if e in retrieved_set)
return hits / len(expected)
def ndcg_at_k(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]], k: int) -> float:
expected_set = set(expected)
dcg = 0.0
for i, page in enumerate(retrieved[:k], start=1):
if page in expected_set:
dcg += 1.0 / math.log2(i + 1)
# Ideal DCG: every expected page in the top positions.
idcg = sum(1.0 / math.log2(i + 1) for i in range(1, min(len(expected), k) + 1))
return dcg / idcg if idcg else 0.0
def main() -> int:
p = argparse.ArgumentParser()
p.add_argument("--queries", type=Path, default=Path("eval/queries.jsonl"))
p.add_argument("--k", type=int, default=5)
p.add_argument("--output", type=Path, default=Path("eval/results/baseline.md"))
args = p.parse_args()
if not args.queries.exists():
print(f"queries file not found: {args.queries}")
print("hint: copy eval/queries.jsonl.example and edit")
return 1
queries = load_queries(args.queries)
print(f"loaded {len(queries)} queries")
# TODO Phase 7: instantiate the retrievers you implemented in
# eval/retrievers.py and run each one against each query.
# Aggregate MRR / Recall@K / nDCG@K per retriever. Emit a
# markdown table to args.output. Commit the file alongside the
# PR that changes retrieval.
raise NotImplementedError(
"Wire up the retrievers in eval/retrievers.py first, then "
"fill in this evaluation loop. See PLAN.md Phase 7."
)
if __name__ == "__main__":
raise SystemExit(main())
View File
+277
View File
@@ -0,0 +1,277 @@
"""SQLite FTS5-backed BM25 retrieval over the same chunks Chroma indexes.
Hybrid retrieval (BM25 + dense + Reciprocal Rank Fusion) addresses a
limit of single-tower dense embeddings: when a query has specific
technical terms (filenames, language names, error codes, API paths),
the dense embedding doesn't bridge from the query into a short
code-focused chunk. The chunk loses to the much larger crowd of
prose chunks that semantically match the query topic.
BM25 handles this directly. Lexical overlap on rare terms ("python",
"create_vpg.py", "PROTECTED_SITE_ID", "applyUpgrade") scores those
chunks high. Fused with the dense ranking via RRF, the hybrid result
is strictly better than either alone for the queries we've seen
fail.
Why SQLite FTS5:
- In the stdlib. Zero new deps.
- On-disk. Same persistence model as Chroma — Docker COPY the dir,
`rag.index --rebuild` regenerates from corpus.
- Built-in `bm25()` ranking function. No knobs to tune that matter
for our use case (k1=1.2, b=0.75 defaults are fine).
- Builds 70k+ chunks in seconds. Faster than the Chroma rebuild's
embedding step by 100×, so it adds basically nothing to the
full-rebuild cycle.
Schema is two tables to keep filtering clean. FTS5 doesn't filter
nicely on its own columns; the content_rowid pattern keeps an
external metadata table joinable by rowid:
CREATE TABLE chunks_meta (
rowid INTEGER PRIMARY KEY AUTOINCREMENT,
id TEXT UNIQUE,
bundle_id TEXT, page_id TEXT, version TEXT,
platform TEXT, product TEXT, ordinal INTEGER
);
CREATE VIRTUAL TABLE chunks_fts USING fts5(
text,
tokenize = 'porter unicode61 remove_diacritics 2',
content = 'chunks_meta',
content_rowid = 'rowid'
);
Queries:
SELECT m.id, bm25(chunks_fts) AS score
FROM chunks_meta m
JOIN chunks_fts f ON m.rowid = f.rowid
WHERE f MATCH ?
AND m.version = ? -- optional metadata filter
ORDER BY bm25(chunks_fts) -- lower = better in FTS5
LIMIT ?;
"""
from __future__ import annotations
import logging
import re
import sqlite3
from pathlib import Path
from typing import Any
log = logging.getLogger(__name__)
# Default location: bm25/<product>_docs.db at the repo root, next to chroma/.
ROOT = Path(__file__).resolve().parent.parent
DEFAULT_DB_DIR = ROOT / "bm25"
DEFAULT_DB_NAME = "<product>_docs.db"
# Columns we expose as filterable metadata. Mirrors what _build_where in
# docs_mcp/server.py accepts so the same filter dicts work for both
# Chroma and BM25 without per-retriever translation in the caller.
FILTER_COLUMNS = ("bundle_id", "page_id", "version", "platform", "product", "ordinal")
# Allowlist tokenizer for free-text queries. FTS5's parser chokes on lots
# of punctuation we routinely see in user queries (".10.9", "?", "VPG's",
# em-dash, etc.). Rather than blocklist every operator, just keep
# alphanumerics + a few separators and replace everything else with a
# space. This loses the ability to phrase-search ("exact match") but we
# don't expose that to users anyway — they ask natural-language questions
# and want the answer, not a Boolean DSL.
_KEEP_RE = re.compile(r"[^A-Za-z0-9_\s]")
# FTS5 reserves these Boolean operator KEYWORDS at the token level —
# stripping them avoids accidental phrase-query behavior when a user
# query happens to contain bare "AND", "OR", "NOT", "NEAR".
_BOOLEAN_KW_RE = re.compile(r"(?<!\w)(AND|OR|NOT|NEAR)(?!\w)")
def _sanitize_query(text: str) -> str:
"""Reduce a natural-language query to an FTS5 OR-of-tokens query.
Two transformations:
1. Non-alphanumeric → space (drops punctuation; "10.9?" becomes
"10 9"). Lets us handle versions, parens, question marks, etc.
without inviting FTS5 parse errors.
2. Boolean keywords stripped (FTS5 reserves AND/OR/NOT/NEAR).
3. Tokens explicitly OR'd. FTS5's default is AND-of-tokens — for
any non-trivial natural-language query that means zero hits
(no chunk contains every word). OR semantics is what we want:
BM25 already weights documents containing more query terms
higher, so we don't lose precision, but we DO gain recall.
"""
cleaned = _KEEP_RE.sub(" ", text)
cleaned = _BOOLEAN_KW_RE.sub(" ", cleaned)
tokens = cleaned.split()
if not tokens:
return ""
return " OR ".join(tokens)
def _where_to_sql(where: dict | None) -> tuple[str, list[Any]]:
"""Translate a Chroma-shaped filter dict into a SQL fragment + params.
Accepts the same shapes ``docs_mcp.server._build_where`` produces:
None → ("", [])
{"version": "10.9"} → ("AND m.version = ?", ["10.9"])
{"$and": [{...}, {...}]} → ("AND m.X = ? AND m.Y = ?", [...])
Unknown keys are silently dropped (defensive — better to over-match
than to crash on a filter we don't know).
"""
if not where:
return "", []
parts: list[str] = []
params: list[Any] = []
def _emit_eq(cond: dict[str, Any]) -> None:
for k, v in cond.items():
if k in FILTER_COLUMNS:
parts.append(f"m.{k} = ?")
params.append(v)
if "$and" in where:
for sub in where["$and"]:
_emit_eq(sub)
else:
_emit_eq(where)
if not parts:
return "", []
return "AND " + " AND ".join(parts), params
class BM25Index:
"""Thin wrapper around an FTS5-backed sqlite db.
Single-writer model. Reads are connection-per-call (sqlite handles
concurrency through file locks; for our read-heavy workload that's
fine and avoids cross-thread connection sharing issues with the MCP
server's request handlers).
"""
def __init__(self, db_path: Path | None = None):
self.db_path = Path(db_path) if db_path else (DEFAULT_DB_DIR / DEFAULT_DB_NAME)
# -- build ----------------------------------------------------------
def build(self, records: list[dict]) -> int:
"""Rebuild the index from scratch from `records`.
`records` is the same list ``rag.index.page_records`` produces:
``[{"id": ..., "text": ..., "metadata": {...}}, ...]``. Bulk
insert wrapped in a transaction — single-digit seconds for the
full 73k-chunk corpus.
"""
self.db_path.parent.mkdir(parents=True, exist_ok=True)
# Drop and recreate. Idempotent rebuild.
if self.db_path.exists():
self.db_path.unlink()
with sqlite3.connect(self.db_path) as con:
con.executescript(self._schema_sql())
con.executemany(
"INSERT INTO chunks_meta (id, bundle_id, page_id, version, "
"platform, product, ordinal) VALUES (?, ?, ?, ?, ?, ?, ?)",
[
(
r["id"],
r["metadata"].get("bundle_id") or "",
r["metadata"].get("page_id") or "",
r["metadata"].get("version") or "",
r["metadata"].get("platform") or "",
r["metadata"].get("product") or "",
int(r["metadata"].get("ordinal") or 0),
)
for r in records
],
)
# Populate the FTS5 contentless-ish table by rowid. We populated
# chunks_meta first; rowids align with insertion order.
con.executemany(
"INSERT INTO chunks_fts (rowid, text) VALUES (?, ?)",
[
(i + 1, r["text"])
for i, r in enumerate(records)
],
)
con.commit()
log.info("bm25: indexed %d chunks → %s", len(records), self.db_path)
return len(records)
# -- query ----------------------------------------------------------
def query(
self,
text: str,
n: int = 200,
where: dict | None = None,
) -> list[tuple[str, float]]:
"""Return up to `n` (chunk_id, bm25_score) pairs, lowest score first.
FTS5's bm25() returns NEGATIVE numbers — more relevant docs have
smaller (more negative) scores. We order ASC so the first row is
the most relevant. Callers that need a "rank" should enumerate
the returned list.
"""
sanitized = _sanitize_query(text)
if not sanitized:
return []
where_sql, params = _where_to_sql(where)
# FTS5 MATCH wants the unaliased table name on its left, so we use
# chunks_fts (no alias) and JOIN by rowid against chunks_meta.
sql = (
"SELECT m.id, bm25(chunks_fts) AS score "
"FROM chunks_fts "
"JOIN chunks_meta m ON m.rowid = chunks_fts.rowid "
f"WHERE chunks_fts MATCH ? {where_sql} "
"ORDER BY bm25(chunks_fts) "
"LIMIT ?"
)
try:
with sqlite3.connect(self.db_path) as con:
cur = con.execute(sql, [sanitized, *params, n])
return [(row[0], float(row[1])) for row in cur.fetchall()]
except sqlite3.OperationalError as e:
# FTS5 syntax error (rare after sanitization) or db missing.
# Caller decides whether to fall back to dense-only.
log.warning("bm25 query failed (%s); query=%r", e, sanitized[:80])
return []
def exists(self) -> bool:
"""Cheap probe — does the index file exist on disk?"""
return self.db_path.exists()
def count(self) -> int:
"""Number of chunks indexed. 0 if the db is missing or empty."""
if not self.exists():
return 0
try:
with sqlite3.connect(self.db_path) as con:
return con.execute("SELECT COUNT(*) FROM chunks_meta").fetchone()[0]
except sqlite3.OperationalError:
return 0
# -- schema ---------------------------------------------------------
@staticmethod
def _schema_sql() -> str:
return """
CREATE TABLE chunks_meta (
rowid INTEGER PRIMARY KEY AUTOINCREMENT,
id TEXT UNIQUE NOT NULL,
bundle_id TEXT,
page_id TEXT,
version TEXT,
platform TEXT,
product TEXT,
ordinal INTEGER
);
CREATE INDEX idx_meta_version ON chunks_meta(version);
CREATE INDEX idx_meta_platform ON chunks_meta(platform);
CREATE INDEX idx_meta_bundle ON chunks_meta(bundle_id);
CREATE VIRTUAL TABLE chunks_fts USING fts5(
text,
tokenize = 'porter unicode61 remove_diacritics 2'
);
"""
+126
View File
@@ -0,0 +1,126 @@
"""Markdown chunker — paragraph-aware, ~400-600 token target.
Adjust the chunking strategy per product if your page format differs
significantly from prose. The output shape (id, text, metadata) is
fixed by the downstream Chroma + BM25 indexing in rag/index.py — don't
change that.
The key knob you'll tune per product is chunk-0. Dense retrieval lands
on chunk 0 first for most queries. Make it a synthetic chunk built
from:
- the page title (as natural-language H1)
- a 1-sentence task description (you'll have to generate this — for
pages that already have a "## Overview" or "## Introduction" the
first sentence usually works)
- a keyword bag of important terms (filenames, API names, error
codes — the rare technical tokens that BM25 lights up on)
Without a rich chunk 0, dense retrieval gets dominated by the much
larger prose body, and short pages (script examples, reference cards)
get buried.
"""
from __future__ import annotations
import re
from typing import Iterator
# Approximate token estimate from char count. Tunable — set per
# embedder if the default 4 chars/token is wrong.
CHARS_PER_TOKEN = 4
TARGET_TOKENS = 500
TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
def estimate_tokens(text: str) -> int:
return max(1, len(text) // CHARS_PER_TOKEN)
def split_paragraphs(md: str) -> list[str]:
"""Split markdown into paragraph-ish blocks.
Keeps fenced code blocks together (don't slice through ```).
Headings start new paragraphs.
"""
blocks: list[str] = []
current: list[str] = []
in_fence = False
for line in md.splitlines(keepends=True):
stripped = line.strip()
if stripped.startswith("```"):
in_fence = not in_fence
current.append(line)
continue
if in_fence:
current.append(line)
continue
if stripped.startswith("#"):
if current:
blocks.append("".join(current).strip())
current = []
current.append(line)
continue
if not stripped and current and not "".join(current).strip().endswith("\n\n"):
current.append(line)
blocks.append("".join(current).strip())
current = []
continue
current.append(line)
if current:
blocks.append("".join(current).strip())
return [b for b in blocks if b]
def chunks_from_page(
text: str,
page_id: str,
metadata: dict,
) -> Iterator[dict]:
"""Yield chunk dicts ready for index.py to upsert.
The synthetic chunk 0 is the per-product customization point. The
default below is a simple title + body-first-paragraph; rewrite
for richer retrieval signal (see module docstring).
"""
paragraphs = split_paragraphs(text)
if not paragraphs:
return
# ----- Chunk 0: synthetic anchor for dense retrieval ---------
title = metadata.get("title") or page_id
first_para = next((p for p in paragraphs if not p.startswith("#")), "")
chunk0_body = (
f"# {title}\n\n"
f"{first_para[:300]}"
# TODO per product: append a keyword bag here (filenames,
# API names, error codes) for BM25 + dense joint coverage.
)
yield {
"id": f"{metadata['bundle_id']}::{page_id}::0",
"text": chunk0_body,
"metadata": {**metadata, "ordinal": 0},
}
# ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
ordinal = 1
buf: list[str] = []
buf_chars = 0
for p in paragraphs:
if buf_chars + len(p) > TARGET_CHARS and buf:
yield {
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
"text": "\n\n".join(buf),
"metadata": {**metadata, "ordinal": ordinal},
}
ordinal += 1
buf = []
buf_chars = 0
buf.append(p)
buf_chars += len(p)
if buf:
yield {
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
"text": "\n\n".join(buf),
"metadata": {**metadata, "ordinal": ordinal},
}
+72
View File
@@ -0,0 +1,72 @@
"""Embedding function for Chroma — Ollama-hosted nomic-embed-text by default.
Swappable: implement the same `embedding_function()` interface returning
a Chroma `EmbeddingFunction` and the rest of the pipeline doesn't care.
Defaults (override via env):
OLLAMA_URL one or more comma-separated URLs (load-balanced)
EMBED_MODEL model name; default 'nomic-embed-text'
EMBED_DIM expected embedding dim; default 768 (nomic-embed-text)
"""
from __future__ import annotations
import os
import logging
from typing import Any
import httpx
from chromadb import EmbeddingFunction, Documents, Embeddings
log = logging.getLogger(__name__)
OLLAMA_URLS = [u.strip() for u in os.environ.get("OLLAMA_URL",
"http://localhost:11434").split(",") if u.strip()]
EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
EMBED_DIM = int(os.environ.get("EMBED_DIM", "768"))
class OllamaEmbeddings(EmbeddingFunction):
"""Calls /api/embed across N Ollama endpoints, naive round-robin.
For indexing throughput on multiple GPUs, run one Ollama container
per GPU (pinned via NVIDIA_VISIBLE_DEVICES) and pass all their URLs
in OLLAMA_URL — the embedder picks the next endpoint per batch.
"""
def __init__(self, urls: list[str] = OLLAMA_URLS, model: str = EMBED_MODEL):
self.urls = urls
self.model = model
self._next = 0
def __call__(self, input: Documents) -> Embeddings:
url = self.urls[self._next % len(self.urls)]
self._next += 1
with httpx.Client(timeout=300) as c:
r = c.post(f"{url}/api/embed",
json={"model": self.model, "input": list(input)})
r.raise_for_status()
data = r.json()
return data.get("embeddings") or []
def name(self) -> str: # newer chromadb requires this
return f"ollama:{self.model}"
@staticmethod
def build_from_config(config: dict) -> "OllamaEmbeddings": # newer chromadb
return OllamaEmbeddings(
urls=config.get("urls", OLLAMA_URLS),
model=config.get("model", EMBED_MODEL),
)
def get_config(self) -> dict: # newer chromadb
return {"urls": self.urls, "model": self.model}
def default_space(self) -> str:
return "cosine"
def supported_spaces(self) -> list[str]:
return ["cosine", "l2", "ip"]
def embedding_function() -> EmbeddingFunction:
return OllamaEmbeddings()
+134
View File
@@ -0,0 +1,134 @@
"""Build Chroma (and optionally BM25) indexes from corpus on disk.
Reads `corpus/<bundle>/<page>.{md,json}`, chunks each page, upserts
into Chroma. With --rebuild, drops + recreates the collection (clean
state). With --bm25-only, skips Chroma and rebuilds only the FTS5
index — useful for fast iteration when chunking didn't change.
"""
from __future__ import annotations
import argparse
import json
import logging
import time
from pathlib import Path
from typing import Iterator
import chromadb
from chromadb.config import Settings
from .chunk import chunks_from_page
from .embeddings import embedding_function
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
ROOT = Path(__file__).resolve().parent.parent
CORPUS = ROOT / "corpus"
CHROMA_DIR = ROOT / "chroma"
# Collection name — convention: <product>_docs. Override via env if needed.
import os
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "myproduct")
COLLECTION = f"{PRODUCT_NAME}_docs"
def page_records() -> Iterator[dict]:
"""Walk corpus/, yield chunks for every page."""
if not CORPUS.exists():
log.error("corpus/ doesn't exist; run the scraper first")
return
for bundle_dir in sorted(CORPUS.iterdir()):
if not bundle_dir.is_dir() or bundle_dir.name.startswith("."):
continue
for md_path in sorted(bundle_dir.glob("*.md")):
page_id = md_path.stem
sidecar = md_path.with_suffix(".json")
if not sidecar.exists():
log.warning("skipping %s — no JSON sidecar", md_path)
continue
md = md_path.read_text()
meta = json.loads(sidecar.read_text())
# Surface common filter fields at the chunk-metadata level
# so Chroma's `where` filter can use them.
base_meta = {
"bundle_id": bundle_dir.name,
"page_id": page_id,
"title": meta.get("title") or "",
"version": meta.get("version") or "",
"platform": meta.get("platform") or "",
"product": meta.get("product") or "",
}
yield from chunks_from_page(md, page_id, base_meta)
def upsert_to_chroma(records: list[dict]) -> int:
client = chromadb.PersistentClient(
path=str(CHROMA_DIR),
settings=Settings(anonymized_telemetry=False),
)
# Drop + recreate for --rebuild semantics
try:
client.delete_collection(COLLECTION)
except Exception:
pass
col = client.create_collection(COLLECTION, embedding_function=embedding_function())
BATCH = 64
total = 0
for i in range(0, len(records), BATCH):
chunk = records[i:i + BATCH]
col.upsert(
ids=[r["id"] for r in chunk],
documents=[r["text"] for r in chunk],
metadatas=[r["metadata"] for r in chunk],
)
total += len(chunk)
log.info("upserted %d / %d chunks", total, len(records))
return total
def main() -> int:
p = argparse.ArgumentParser()
p.add_argument("--rebuild", action="store_true",
help="Drop and recreate the Chroma collection.")
p.add_argument("--bm25-only", action="store_true",
help="Rebuild only the BM25 index, skip Chroma.")
p.add_argument("--bm25-db", type=Path,
default=ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db",
help="Path to the BM25 sqlite db.")
args = p.parse_args()
log.info("reading corpus from %s", CORPUS)
t0 = time.time()
records = list(page_records())
log.info("loaded %d chunks in %.1fs", len(records), time.time() - t0)
if args.bm25_only:
from .bm25 import BM25Index
log.info("--bm25-only: building FTS5 only")
BM25Index(args.bm25_db).build(records)
return 0
if not args.rebuild:
log.info("no --rebuild; nothing to do. (Use --rebuild to upsert.)")
return 0
t_c = time.time()
n = upsert_to_chroma(records)
log.info("chroma: %d chunks in %.1fs", n, time.time() - t_c)
# Build BM25 too — see PLAN.md Phase 8. Safe to remove this block
# for products that don't need hybrid retrieval.
try:
from .bm25 import BM25Index
t_b = time.time()
BM25Index(args.bm25_db).build(records)
log.info("bm25 done in %.1fs", time.time() - t_b)
except ImportError:
log.info("rag.bm25 not available — skipping BM25 build")
return 0
if __name__ == "__main__":
raise SystemExit(main())
+19
View File
@@ -0,0 +1,19 @@
# MCP server
mcp[fastmcp]>=1.0.0
pydantic>=2.0
httpx>=0.27
# Vector store + embeddings
chromadb>=0.5.0
ollama>=0.4.0 # if using Ollama-hosted embedder; swap if not
# Scraping (Phase 1; adjust per product)
beautifulsoup4>=4.12
requests>=2.31
# playwright>=1.40 # uncomment if you need headless browser fallback
# Evaluation
numpy>=1.26
# Dev / utility
python-dateutil>=2.8
+61
View File
@@ -0,0 +1,61 @@
# scrape/
Per-vendor seed catalog scrapers + the runner that dispatches to
them. Each source lives in `scrape/sources/<name>.py` with a `main()`
entrypoint. The runner is a thin shim:
```bash
python -m scrape.runner --source bayer_seeds --force
python -m scrape.runner --source golden_harvest --limit 20
python -m scrape.runner --all # only GREEN sources
```
## Output layout
Each scraper writes:
- `corpus/<source>/<source_key>.md` — LLM-visible body (chunk_0
preamble + the variety's marketing + agronomic narrative)
- `corpus/<source>/<source_key>.json` — sidecar metadata (per
CLAUDE.md's canonical schema)
`source_key` is a stable per-vendor slug — typically `<brand>-<sku>`
lowercased, e.g. `dekalb-dkc62-08rib`. Stability matters: it's the
join key the MCP uses for `get_page(source, source_key)`.
## Sources
| Source | Module | Verdict | Notes |
|---|---|---|---|
| `bayer_seeds` | `bayer_seeds.py` | 🟢 | DEKALB + Asgrow + WestBred, ~475 varieties |
| `golden_harvest` | `golden_harvest.py` | 🟢 | ~175 varieties, 9-to-1 disease scale (reverse) |
| `nk` | `nk.py` | 🟢 | 29 varieties, ratings in CDN PDFs |
| `agripro` | `agripro.py` | 🟢 | 24 wheat varieties |
| `becks_pfr` | `becks_pfr.py` | 🟡 | 2,089 research docs via public Sanity GROQ |
| `becks_products` | `becks_products.py` | 🟡 | 860 products, identity-only (SeedIQ-gated) |
Pioneer is intentionally absent — see `CLAUDE.md` and the curated
Pioneer fallback in `docs_mcp/lessons.md`.
## Tips
- **Sniff before you scrape.** Most catalogs are SPAs that call a
backend API. The recon docs in `~/.claude/projects/-home-justin/
memory/reference_seed_vendor_recon.md` already capture the
endpoints; if you find new ones, update that file.
- **Idempotent re-scrapes.** Without `--force`, skip pages already
on disk. With `--force`, re-fetch everything — that's the
monthly cron mode.
- **Respect the portals.** Backoff on 429s. Set a recognizable
user-agent (`seed-mcp-scraper/<version>`).
- **Normalize at chunk time, not at scrape time.** The chunker
(Phase 2) handles the 9-to-1 → 1-9 disease-scale flip for Golden
Harvest, NOT this scraper. Sidecar JSON should preserve the
vendor's raw values + a `_scale_direction` field; the chunker
reads that and normalizes the markdown body.
## changelog.py
Reusable as-is from the template. Walks `git diff --name-status`
output for the commit summary, and `git log` for the digest history
(Phase 13).
View File
+272
View File
@@ -0,0 +1,272 @@
"""Generate a summary of corpus changes.
Two output shapes for two consumers:
1. Human-readable text (default) — written into the weekly-refresh
commit message so the commit log is greppable for *"what changed
this week"* instead of *"806 files changed"*.
2. Structured JSON (``--json``) and rolling JSONL history
(``--history-out``) — consumed by the ``weekly_digest`` MCP tool.
Computed in CI and committed at ``corpus/.digest/history.jsonl``;
the tool reads it at runtime because the prod container is a
static filesystem COPY with no git available.
Usage:
# Commit-message helper (existing behavior — unchanged)
python -m scrape.changelog [--cached] [--ref REF]
# One-shot JSON for the current diff range
python -m scrape.changelog --cached --json
# Build / refresh the digest history file (CI use)
python -m scrape.changelog --history-out corpus/.digest/history.jsonl \\
--history-days 120
The history walker only includes commits that touch ``corpus/`` (or
``bundles.json``); it skips pure code/CI commits. Each emitted record
carries the commit's short sha, ISO timestamp, subject, and the same
structured summary the ``--json`` path produces, so the consumer can
treat history records and one-shot summaries interchangeably.
"""
from __future__ import annotations
import argparse
import json
import subprocess
import sys
from collections import defaultdict
from typing import Any
def git(*args: str) -> str:
return subprocess.check_output(["git", *args], text=True)
def summarize_diff(diff_output: str) -> dict[str, Any]:
"""Parse ``git diff --name-status`` output into a structured summary.
Pure function (no IO, no git calls) so the same logic is exercised
by the human-readable, JSON-one-shot, and history-walking paths.
Returns a dict with:
md_count int — total .md files changed
json_count int — total .json sidecars changed
content_bundles dict — {bundle_id: [page_id_without_.md, ...]}
Only bundles where at least one .md
file moved. Lists are in the order
git emitted them.
json_only_bundles list[str] — bundles whose ONLY change was sidecar
drift (no .md changes). Sorted.
new_bundles list[str] — bundles whose first .md was Added
in this diff. Sorted.
other_files list[str] — any non-corpus path mentioned in the
diff, as ``"STATUS path"`` strings.
"""
md_changes: dict[str, list[str]] = defaultdict(list)
json_only_bundles: set[str] = set()
new_bundles: set[str] = set()
md_count = json_count = 0
other_files: list[str] = []
for line in diff_output.splitlines():
if not line.strip():
continue
# status<TAB>path (or status<TAB>old<TAB>new for renames; we take
# the post-rename path as the canonical location).
parts = line.split("\t")
status, path = parts[0], parts[-1]
if not path.startswith("corpus/"):
other_files.append(f"{status} {path}")
continue
segs = path.split("/", 2)
if len(segs) < 3:
# corpus/<filename> with no bundle dir — skip.
continue
_, bundle, page = segs
if page.endswith(".md"):
md_changes[bundle].append(page[:-3])
md_count += 1
if status == "A":
new_bundles.add(bundle)
elif page.endswith(".json"):
json_count += 1
json_only_bundles.add(bundle)
# A bundle counts as "content-changing" if it had any .md edit. Sidecar-
# only drift goes in the separate bucket so the commit message doesn't
# report timestamp churn as if it were real edits.
content_bundles_set = set(md_changes)
drift_only = sorted(json_only_bundles - content_bundles_set)
return {
"md_count": md_count,
"json_count": json_count,
"content_bundles": dict(md_changes), # cast back to plain dict for JSON
"json_only_bundles": drift_only,
"new_bundles": sorted(new_bundles),
"other_files": other_files,
}
def render_human(summary: dict[str, Any]) -> str:
"""Format a summary dict as the multi-line commit-message text.
Matches the historical output exactly so existing commit-message
tooling and downstream readers don't have to change.
"""
lines: list[str] = []
content_bundles = sorted(summary["content_bundles"])
md_count = summary["md_count"]
json_count = summary["json_count"]
new_bundles = set(summary["new_bundles"])
drift_only = summary["json_only_bundles"]
other_files = summary["other_files"]
lines.append(f"{md_count} content change(s) across {len(content_bundles)} bundle(s)")
lines.append(f"{json_count} sidecar metadata update(s)")
if new_bundles:
lines.append(f"{len(new_bundles)} new bundle(s) added")
if other_files:
lines.append(f"{len(other_files)} other file change(s)")
if content_bundles:
lines.append("")
lines.append("Bundles with content changes:")
for b in content_bundles:
pages = summary["content_bundles"][b]
tag = " (NEW)" if b in new_bundles else ""
lines.append(f" {b}{tag}: {len(pages)} page(s)")
for p in pages[:5]:
lines.append(f" - {p}")
if len(pages) > 5:
lines.append(f" ... and {len(pages) - 5} more")
if drift_only:
lines.append("")
head = ", ".join(drift_only[:10])
suffix = "" if len(drift_only) > 10 else ""
lines.append(f"Bundles with sidecar-only drift ({len(drift_only)}): {head}{suffix}")
return "\n".join(lines)
def walk_history(history_days: int) -> list[dict[str, Any]]:
"""Walk recent corpus-touching commits, emit one summary per commit.
Uses ``git log --first-parent main`` to keep the rolling weekly-
refresh line clean of branch-merge noise. Only commits whose diff
touches ``corpus/`` or ``bundles.json`` are emitted; pure code
commits are skipped (they have nothing to digest).
Each record:
{
"sha": "<short sha>",
"timestamp": "<ISO 8601, UTC>",
"subject": "<commit subject line>",
... + every field from summarize_diff()
}
"""
# Find candidate commits. --first-parent keeps the linear refresh history
# on main and ignores branch-side merges. We still need to filter by what
# the commit actually touched, because non-corpus commits can land on
# main (PR merges for code, CI tweaks, etc.).
raw = git(
"log",
f"--since={history_days} days ago",
"--first-parent",
"main",
"--pretty=format:%H%x09%cI%x09%s",
)
records: list[dict[str, Any]] = []
for line in raw.splitlines():
if not line.strip():
continue
parts = line.split("\t", 2)
if len(parts) < 3:
continue
sha, ts, subject = parts
# What did this commit actually touch? Cheap: just the name-status diff
# against its first parent. Empty stdout = commit didn't change any
# files we care about. Root commits (no parent) error out — suppress
# the stderr noise and skip them.
try:
diff = subprocess.check_output(
["git", "diff", "--name-status", f"{sha}^..{sha}"],
text=True,
stderr=subprocess.DEVNULL,
)
except subprocess.CalledProcessError:
continue
if not diff.strip():
continue
summary = summarize_diff(diff)
# Skip pure code commits — only emit records that have actual corpus
# content motion. This is what makes the history "interesting" for
# the weekly digest.
if summary["md_count"] == 0 and summary["json_count"] == 0 and not summary["new_bundles"]:
continue
records.append({
"sha": sha[:12],
"timestamp": ts,
"subject": subject,
**summary,
})
return records
def main() -> int:
p = argparse.ArgumentParser(description=__doc__)
p.add_argument("--cached", action="store_true",
help="Summarize staged changes instead of a ref range.")
p.add_argument("--ref", default="HEAD^..HEAD",
help="Diff range to summarize (default: HEAD^..HEAD).")
p.add_argument("--json", dest="as_json", action="store_true",
help="Emit one JSON object instead of the human-readable form.")
p.add_argument("--history-out", metavar="PATH",
help="Walk recent corpus-touching commits and write a "
"JSONL history file at PATH. Overwrites if it exists. "
"Implies the history walker; --cached/--ref are ignored.")
p.add_argument("--history-days", type=int, default=120,
help="How far back the history walker looks (default 120).")
args = p.parse_args()
# History-walker path: build the JSONL file consumed by the
# weekly_digest MCP tool, then exit. CI uses this.
if args.history_out:
records = walk_history(args.history_days)
# Sort by timestamp ascending so the file is roughly stable
# across rebuilds (commits within a single run could otherwise
# depend on git log default ordering).
records.sort(key=lambda r: r["timestamp"])
with open(args.history_out, "w") as fh:
for rec in records:
fh.write(json.dumps(rec, separators=(",", ":")) + "\n")
# Brief stdout signal for CI logs — easy to spot in the workflow run.
print(f"wrote {len(records)} commit record(s) to {args.history_out} "
f"covering up to {args.history_days} days")
return 0
# One-shot summary path. Unchanged behavior for --cached / --ref.
if args.cached:
diff_args = ["diff", "--name-status", "--cached"]
else:
diff_args = ["diff", "--name-status", args.ref]
diff = git(*diff_args)
summary = summarize_diff(diff)
if args.as_json:
print(json.dumps(summary, separators=(",", ":")))
else:
print(render_human(summary))
return 0
if __name__ == "__main__":
sys.exit(main())
+93
View File
@@ -0,0 +1,93 @@
"""Thin dispatcher that routes ``--source <id>`` to the right per-source
scraper module.
Convention: one source per module under ``scrape.sources.<id>``. Each
module is independently runnable via ``python -m scrape.sources.<id>``
and accepts its own flags — this runner is a convenience shim for CI.
Examples:
python -m scrape.runner --source bayer_seeds --force
python -m scrape.runner --source golden_harvest --limit 20
python -m scrape.runner --all # walk every source in sources.json
Anything after the recognized flags is passed through to the source
scraper, so:
python -m scrape.runner --source bayer_seeds --force --brand dekalb
dispatches to ``scrape.sources.bayer_seeds`` with ``--force --brand
dekalb`` as argv.
Sources whose ``verdict`` in sources.json is anything other than
``"green"`` are skipped by ``--all`` (Beck's products is yellow until
the SeedIQ XHR is captured). Pass ``--source becks_products`` to run
a yellow source explicitly.
"""
from __future__ import annotations
import argparse
import importlib
import json
import sys
from pathlib import Path
REPO_ROOT = Path(__file__).resolve().parents[1]
SOURCES_JSON = REPO_ROOT / "sources.json"
def _load_sources() -> list[dict]:
if not SOURCES_JSON.exists():
return []
try:
data = json.loads(SOURCES_JSON.read_text())
return data.get("sources", []) if isinstance(data, dict) else data
except json.JSONDecodeError:
return []
def _run_source(source_id: str, passthrough: list[str]) -> int:
mod_name = f"scrape.sources.{source_id}"
try:
mod = importlib.import_module(mod_name)
except ImportError as exc:
print(f"runner: no source module {mod_name}: {exc}", file=sys.stderr)
return 2
main = getattr(mod, "main", None)
if not callable(main):
print(f"runner: {mod_name} has no main() entrypoint", file=sys.stderr)
return 2
return int(main(passthrough) or 0)
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(prog="scrape.runner")
parser.add_argument("--source", help="Source id (matches sources.json)")
parser.add_argument("--all", action="store_true",
help="Run every GREEN source listed in sources.json")
args, passthrough = parser.parse_known_args(argv)
if not args.source and not args.all:
parser.error("specify --source <id> or --all")
sources = _load_sources()
if args.all:
ids = [s["name"] for s in sources if s.get("verdict") == "green"]
if not ids:
print("runner: no GREEN sources in sources.json", file=sys.stderr)
return 2
else:
# If the source isn't registered in sources.json yet, dispatch anyway
# so the scraper can be exercised during initial development.
ids = [args.source]
rc = 0
for sid in ids:
print(f"=== scrape.runner: dispatching to {sid} ===")
rc |= _run_source(sid, passthrough)
return rc
if __name__ == "__main__":
sys.exit(main())
View File
+34
View File
@@ -0,0 +1,34 @@
"""AgriPro scraper (Syngenta wheat brand).
Source: ``https://www.agriprowheat.com`` — Drupal Views form,
server-rendered HTML. No headless browser needed.
Expected count: 24 varieties. Covers HRW / HRS / HWS / SWW / SWS
plus barley. NO SRW — Syngenta's SRW lives at GrowProGenetics.com
under a separate brand and is out of scope for AgriPro.
Trait flags to capture: Clearfield (CL2), CoAXium (NB: CoAXium is
implicit in product family naming, not always a separate field).
Schema notes:
- ``wheat_class`` is required (HRW/HRS/HWS/SWW/SWS/durum/barley)
- ``relative_maturity`` and ``maturity_group`` are null for wheat
- Disease panel: stripe rust / leaf rust / stem rust / FHB (scab) /
Septoria / tan spot
- Quality: test weight, protein, falling number, straw strength
TODO: implement.
"""
from __future__ import annotations
import sys
def main(argv: list[str] | None = None) -> int:
print("agripro: not implemented yet — Drupal Views form, only wheat in the corpus, no SRW (separate brand)",
file=sys.stderr)
return 2
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
+56
View File
@@ -0,0 +1,56 @@
"""Bayer seeds scraper — DEKALB (corn) + Asgrow (soy) + WestBred (wheat).
Source: ``cropscience.bayer.us`` — same Next.js + ``__NEXT_DATA__``
infrastructure used by crop-chem-docs' Bayer crop-protection scraper.
That scraper is the reference; this one lifts ~80% of its plumbing
and adapts the per-product field mapping for seed schema.
Catalog index pages:
/corn/dekalb/seed-catalog
/soybeans/asgrow/seed-catalog
/wheat/westbred/seed-catalog
Each catalog page is a Next.js route; the per-variety data lives in
``__NEXT_DATA__.props.pageProps.{whatever}``. The buildId in the
script tag rotates — fetch the index page first, extract the
buildId, then fetch the per-variety JSON.
Output layout:
corpus/bayer_seeds/<source_key>.md LLM-visible body
corpus/bayer_seeds/<source_key>.json Sidecar metadata
source_key convention: ``<brand>-<product-slug>`` lowercased, e.g.
``dekalb-dkc62-08rib`` or ``asgrow-ag34xf2``.
Sidecar schema (per CLAUDE.md):
source: "bayer_seeds"
source_key: str
vendor: "Bayer"
brand: "DEKALB" | "Asgrow" | "WestBred"
product_name: str
crop: "corn" | "soybeans" | "wheat"
relative_maturity: int | null # corn only
maturity_group: float | null # soy only
wheat_class: str | null # wheat only
trait_stack: list[str]
agronomic_ratings: dict[str, int] # normalized 1-9 (9 = best)
disease_ratings: dict[str, int] # normalized 1-9 (9 = best)
regional_recommendation: list[str]
source_urls: list[str]
fetched_at: str (ISO 8601 UTC)
TODO: implement. Reference: ~/github/crop-chem-docs/scrape/sources/bayer.py
"""
from __future__ import annotations
import sys
def main(argv: list[str] | None = None) -> int:
print("bayer_seeds: not implemented yet — see ~/github/crop-chem-docs/scrape/sources/bayer.py for the reference Next.js extraction pattern",
file=sys.stderr)
return 2
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
+45
View File
@@ -0,0 +1,45 @@
"""Beck's PFR (Practical Farm Research) scraper.
Source: Public Sanity GROQ API at ``https://mc8v24rf.api.sanity.io``.
No authentication required — Beck's exposes their CMS content store
publicly. ~2,089 documents going back to 2015.
Sanity query endpoint:
``/v1/data/query/production?query=<groq>``
Useful GROQ for PFR docs (the projectId / dataset are public):
*[_type == "pfrStudy"] {
_id, title, year, crop, slug, summary, body, attachments
}
Records are research studies, not variety identity — head-to-head
yield trials, fungicide timing, planting-date studies, hybrid-by-
population, biological seed treatments, etc.
Treat differently from variety scrapers:
- One record per study, not per variety
- chunk_0 preamble includes the study's tl;dr finding (extract from
the ``summary`` field if present, or first paragraph of ``body``)
- Crop tag (corn/soy/wheat) for filtering
- Year tag — older PFR studies are still relevant but search should
let the user weight recency
Polite rate limit: Sanity is generous but no auth means we should
keep concurrency ≤4 and pause ~250ms between batches.
TODO: implement.
"""
from __future__ import annotations
import sys
def main(argv: list[str] | None = None) -> int:
print("becks_pfr: not implemented yet — public Sanity GROQ at mc8v24rf.api.sanity.io, ~2089 research docs",
file=sys.stderr)
return 2
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
+46
View File
@@ -0,0 +1,46 @@
"""Beck's product catalog scraper (identity-only until SeedIQ XHR sniff lands).
Source: Same public Sanity GROQ API as ``becks_pfr`` (no auth).
Expected count: ~860 products (corn + soy + wheat).
Current limitation: Beck's exposes IDENTITY fields publicly (product
name, RM/MG, basic trait stack) but routes the AGRONOMIC + DISEASE
ratings through their SeedIQ application, which is gated behind a
browser session cookie. The public Sanity records do not include
ratings.
What we CAN ship without SeedIQ:
- Product identity for confirmation ("yes Beck's has hybrid X at RM 112")
- RM (corn) / MG (soy) / class (wheat)
- Trait stack
- Basic descriptive text
What needs the SeedIQ XHR endpoint (BLOCKED on user sniff):
- Disease ratings (GLS, NCLB, Goss's, etc.)
- Agronomic ratings (standability, drought, etc.)
- Regional recommendations
For now this scraper is DEFERRED. Run when:
- User captures the SeedIQ XHR URL + cookie/header pattern from
browser dev tools, OR
- We decide to ship Beck's as identity-only and let the LLM say
"Beck's has this hybrid; ask your Beck's rep for full agronomic
ratings" (less useful but avoids the empty-data UX).
Yellow verdict in sources.json reflects this — ``--all`` skips it.
TODO: implement (deferred).
"""
from __future__ import annotations
import sys
def main(argv: list[str] | None = None) -> int:
print("becks_products: deferred — SeedIQ XHR sniff required for ratings, run only if user has captured the endpoint",
file=sys.stderr)
return 2
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
+42
View File
@@ -0,0 +1,42 @@
"""Golden Harvest scraper (Syngenta brand).
Discovery: ``https://www.goldenharvestseeds.com/sitemap.xml`` lists
every variety page. Server-rendered HTML — no headless browser
required. Tech-sheet PDFs live on the Syngenta CDN at
``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf`` — same
fetcher pattern as NK.
Two gotchas:
1. **Sitemap PDF dates are stale** (the sitemap was generated
2025-03-31 and never updated). Resolve the LIVE PDF URL from the
product HTML page, not from the sitemap entry.
2. **Disease scale is reversed.** Golden Harvest publishes ratings
on a 9-to-1 scale (9 = best, 1 = worst). Bayer/NK/AgriPro use
1-9 (9 = best). Normalize at chunk time so the corpus has a
single direction. Record the original direction in the chunk_0
preamble: "Note: ratings normalized to 1-9 (9 = best). Golden
Harvest publishes on a 9-to-1 scale natively."
Expected count: ~175 varieties (89 corn + 86 soy). No wheat.
Bonus dataset: ``/plot-report/<state>/<year>/<id>`` — ~7,800 regional
yield trial records. Out of scope for v1 but a high-value future
ingest for regional placement recommendations.
TODO: implement. Reuse the PDF-fetch helper that NK uses.
"""
from __future__ import annotations
import sys
def main(argv: list[str] | None = None) -> int:
print("golden_harvest: not implemented yet — see CLAUDE.md for the disease-scale-reversal gotcha and the live-PDF-URL-resolution requirement",
file=sys.stderr)
return 2
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
+35
View File
@@ -0,0 +1,35 @@
"""NK scraper (Syngenta brand).
Source: ``https://www.syngenta-us.com`` — static HTML product pages
plus tech-sheet PDFs on the Syngenta CDN at
``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf``.
Expected count: 29 varieties (12 corn + 17 soy). No wheat.
The PDF fetcher is shared with ``golden_harvest`` — same CDN, same
``<CODE>_YYMMDD.pdf`` filename convention. Factor that into a
helper module under ``scrape.sources._syngenta_pdf`` once both
scrapers are written.
Disease + agronomic ratings live INSIDE the PDFs (the HTML pages
have marketing copy only). Use pdfplumber for table extraction.
Bonus: regional "Seed Guide" PDFs (~14 MB each) for IA, IL, MN,
etc. — additional supplemental context worth ingesting once the
per-variety scrape is solid.
TODO: implement.
"""
from __future__ import annotations
import sys
def main(argv: list[str] | None = None) -> int:
print("nk: not implemented yet — disease/agronomic ratings come from CDN tech-sheet PDFs only, use pdfplumber",
file=sys.stderr)
return 2
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
+167
View File
@@ -0,0 +1,167 @@
"""Gitea container-registry garbage collection.
Prunes old container tags from a Gitea registry package. Always
preserves:
- The ``latest`` tag (Watchtower auto-pull target)
- Any ``corpus-*`` tag (production pins; Drawbar may have them locked)
- The ``--keep-latest`` most-recent OTHER tags (typically commit-sha pins)
- Anything pushed within ``--keep-days`` days
The actual disk reclaim happens on Gitea's next package GC cron
(admin site settings). This script marks versions for deletion.
Why this script doesn't use the Docker Registry v2 API: that API has
tag listing + manifest delete by digest, but no per-tag created-at
timestamp without an extra blob-fetch round-trip. Gitea's packages
API gives us {tag, created_at} in one call, which is what the keep
policy needs.
The endpoint shape that actually works (matches Gitea 1.21+):
GET /api/v1/packages/{owner}?type=container&q={name}
→ JSON array, ONE entry per tag, each with id + version=tag + created_at
DELETE /api/v1/packages/{owner}/container/{name}/{tag}
→ 204 on success, 404 if already gone
Auth: GITEA_TOKEN env var (PAT with delete:packages scope; the
push-only PAT we use as REGISTRY_TOKEN may not be enough — if you
see 403s, mint a separate PAT and pass it as GITEA_TOKEN here).
Usage:
python scripts/registry_gc.py \\
--owner justin \\
--package crop-chem-docs \\
--keep-days 180 \\
--keep-latest 6
[--dry-run]
"""
from __future__ import annotations
import argparse
import json
import os
import sys
from datetime import datetime, timedelta, timezone
from urllib.error import HTTPError
from urllib.request import Request, urlopen
GITEA_HOST = os.environ.get("GITEA_HOST", "https://git.jpaul.io")
def api(token: str, method: str, path: str) -> object:
# User-Agent matters: Cloudflare in front of git.jpaul.io returns
# 403 to the default `Python-urllib/3.x` UA. Any non-Python UA
# passes. Curl works, requests works, we just need to not look
# like a vanilla urllib script.
req = Request(
f"{GITEA_HOST}{path}",
headers={
"Authorization": f"token {token}",
"User-Agent": "crop-chem-docs-registry-gc/0.1",
},
method=method,
)
try:
with urlopen(req, timeout=30) as r:
body = r.read()
return json.loads(body) if body else None
except HTTPError as e:
if e.code == 404:
return None
raise
def _parse_created(version: dict) -> datetime:
"""Gitea returns RFC3339 with offset like '2026-05-24T16:07:50-04:00'.
Python 3.11+ handles this directly via fromisoformat."""
return datetime.fromisoformat(version["created_at"])
def main() -> int:
p = argparse.ArgumentParser()
p.add_argument("--owner", required=True)
p.add_argument("--package", required=True)
p.add_argument("--keep-days", type=int, default=180)
p.add_argument("--keep-latest", type=int, default=6,
help="Keep this many most-recent commit-sha (etc.) "
"tags BEFORE applying --keep-days. corpus-* and "
":latest are kept regardless.")
p.add_argument("--dry-run", action="store_true",
help="Show what would be deleted without calling DELETE.")
args = p.parse_args()
token = os.environ.get("GITEA_TOKEN")
if not token:
print("GITEA_TOKEN env var not set", file=sys.stderr)
return 1
# Gitea's q= is a substring match; filter to exact name so we don't
# accidentally GC a sibling package that shares the prefix.
versions = api(
token, "GET",
f"/api/v1/packages/{args.owner}?type=container&q={args.package}",
) or []
versions = [v for v in versions if v.get("name") == args.package]
if not versions:
print(f"no versions found for {args.owner}/{args.package} — nothing to GC")
return 0
cutoff = datetime.now(timezone.utc) - timedelta(days=args.keep_days)
versions.sort(key=_parse_created, reverse=True) # newest first
keep: list[tuple[str, str]] = [] # (tag, reason)
delete: list[dict] = []
other_kept = 0
for v in versions:
tag = v.get("version", "")
created = _parse_created(v)
if tag == "latest":
keep.append((tag, "always-keep (:latest)"))
continue
if tag.startswith("corpus-"):
keep.append((tag, "production pin (corpus-*)"))
continue
if other_kept < args.keep_latest:
other_kept += 1
keep.append((tag, f"keep-latest #{other_kept}/{args.keep_latest}"))
continue
if created >= cutoff:
keep.append((tag, f"within --keep-days ({args.keep_days})"))
continue
delete.append(v)
print(f"=== {args.owner}/{args.package}: {len(versions)} total tag(s) ===")
for tag, reason in keep:
print(f" KEEP {tag:<28} {reason}")
for v in delete:
print(f" DEL {v['version']:<28} created={v['created_at']}")
if not delete:
print("nothing to delete")
return 0
if args.dry_run:
print(f"--dry-run; would delete {len(delete)} tag(s)")
return 0
failed = 0
for v in delete:
tag = v["version"]
try:
api(token, "DELETE",
f"/api/v1/packages/{args.owner}/container/{args.package}/{tag}")
print(f" ✓ deleted {tag}")
except HTTPError as e:
print(f" ✗ failed {tag}: HTTP {e.code} {e.reason}", file=sys.stderr)
failed += 1
print(f"done: deleted {len(delete) - failed} / {len(delete)} tag(s)")
return 0 if failed == 0 else 1
if __name__ == "__main__":
sys.exit(main())
+251
View File
@@ -0,0 +1,251 @@
"""Summarize usage logs from docs_mcp.usage into a quick scan.
Reads one or more usage.jsonl* files and prints sections for:
- per-tool call counts
- top search_docs queries by frequency
- 0-hit queries (where we returned nothing — high-signal for tuning)
- filter usage histogram (which version / platform / bundle filters get hit)
- reranker effectiveness (calls where the reranker fired vs not)
- hybrid retrieval top-1 attribution (dense vs bm25 vs both)
Usage:
# Default: read /app/var/logs in the production container
python scripts/usage_report.py --logs-dir /path/to/usage/logs
# Last N days only:
python scripts/usage_report.py --logs-dir <dir> --since 7d
# Markdown output (for piping into a weekly digest email, etc):
python scripts/usage_report.py --logs-dir <dir> --format markdown
The script doesn't depend on anything in the docs_mcp package — it's a
standalone tool that can run anywhere with the log files available
(scp them off the host, point it at the directory).
----------------------------------------------------------------------
FOLLOW-UP CHECKS
----------------------------------------------------------------------
Pattern: when you ship a retrieval change with a hypothesis attached
(e.g. "hybrid will rescue queries dense misses"), add a note HERE
describing what the usage report should show and at what threshold
the change earns its keep. Future-you running the report a month
later will be glad. Example:
Q: Does the dense leg of hybrid retrieval earn its keep on
real traffic, or could we simplify to BM25-only?
- bm25_only >= 80%% --> dense not doing much; consider
simplifying to BM25 mode
- both >= 50%% --> hybrid is tie-breaking; keep it
- dense_only > bm25_only --> dense is the workhorse; keep
Also worth a glance every month:
- 0-hit queries list (tuning candidates)
- reranker p95 latency drift (slow reranker = bad UX)
- filter usage (does anyone actually use version/platform
filters? if not, simplify the tool surface)
"""
from __future__ import annotations
import argparse
import json
import re
import sys
from collections import Counter, defaultdict
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any, Iterable
def parse_since(s: str | None) -> datetime | None:
"""Accept '7d', '24h', '30m', or an ISO timestamp. None → no cutoff."""
if not s:
return None
m = re.fullmatch(r"(\d+)([dhm])", s)
if m:
n, unit = int(m.group(1)), m.group(2)
delta = {"d": timedelta(days=n), "h": timedelta(hours=n), "m": timedelta(minutes=n)}[unit]
return datetime.now(timezone.utc) - delta
return datetime.fromisoformat(s.replace("Z", "+00:00"))
def load_events(logs_dir: Path, since: datetime | None) -> Iterable[dict[str, Any]]:
"""Yield every JSONL record across all files in logs_dir."""
if not logs_dir.exists():
print(f"warning: logs dir {logs_dir} does not exist", file=sys.stderr)
return
# usage.jsonl is the active file; usage.jsonl.YYYY-MM-DD are rotated.
files = sorted(logs_dir.glob("usage.jsonl*"))
for f in files:
with open(f) as fh:
for ln, line in enumerate(fh, start=1):
line = line.strip()
if not line:
continue
try:
rec = json.loads(line)
except json.JSONDecodeError as e:
print(f" ! skipping {f}:{ln}: {e}", file=sys.stderr)
continue
if since:
ts = rec.get("ts", "")
try:
rec_ts = datetime.fromisoformat(ts.replace("Z", "+00:00"))
except ValueError:
continue
if rec_ts < since:
continue
yield rec
def main() -> int:
p = argparse.ArgumentParser(description=__doc__)
p.add_argument("--logs-dir", type=Path, default=Path("/app/var/logs"),
help="directory with usage.jsonl* files")
p.add_argument("--since", default=None,
help="time window: '7d', '24h', '30m', or ISO timestamp")
p.add_argument("--top", type=int, default=25,
help="how many top queries / filters to show")
p.add_argument("--format", choices=("text", "markdown"), default="text")
args = p.parse_args()
since = parse_since(args.since)
events = list(load_events(args.logs_dir, since))
if not events:
print("(no events in window)")
return 0
print(f"# Usage report — {len(events)} events"
+ (f" since {since.isoformat()}" if since else "")
+ f" from {args.logs_dir}")
print()
# 1. Per-tool counts
by_tool = Counter(e["tool"] for e in events)
print("## Per-tool call counts")
print()
if args.format == "markdown":
print("| tool | calls |")
print("|---|---|")
for tool, n in by_tool.most_common():
print(f"| `{tool}` | {n} |")
else:
for tool, n in by_tool.most_common():
print(f" {tool:<25s} {n:>6d}")
print()
# 2. Top search_docs queries
search_events = [e for e in events if e["tool"] == "search_docs"]
queries = Counter(e["args"].get("query", "") for e in search_events)
print(f"## Top {args.top} search_docs queries (of {len(search_events)} searches)")
print()
if args.format == "markdown":
print("| count | query |")
print("|---|---|")
for q, n in queries.most_common(args.top):
print(f"| {n} | `{q}` |")
else:
for q, n in queries.most_common(args.top):
print(f" {n:>5d} {q!r}")
print()
# 3. 0-hit queries — the highest-signal data for tuning
zero_hit = [e for e in search_events if e.get("hits_returned") == 0]
zero_q = Counter(e["args"].get("query", "") for e in zero_hit)
print(f"## 0-hit queries ({len(zero_hit)} of {len(search_events)} searches returned nothing)")
print()
if zero_q:
if args.format == "markdown":
print("| count | query | filters |")
print("|---|---|---|")
# Group by query, show filter examples for each
examples_by_query: dict[str, list[dict]] = defaultdict(list)
for e in zero_hit:
examples_by_query[e["args"].get("query", "")].append(e["args"])
for q, n in zero_q.most_common(args.top):
ex = examples_by_query[q][0]
f = {k: v for k, v in ex.items()
if k in ("version", "platform", "bundle_id") and v}
print(f"| {n} | `{q}` | `{f}` |")
else:
for q, n in zero_q.most_common(args.top):
print(f" {n:>5d} {q!r}")
else:
print(" _(no 0-hit queries in window)_")
print()
# 4. Filter usage
filter_use = Counter()
for e in search_events:
a = e["args"]
v = a.get("version")
p_ = a.get("platform")
b = a.get("bundle_id")
if v:
filter_use[f"version={v}"] += 1
if p_:
filter_use[f"platform={p_}"] += 1
if b:
filter_use[f"bundle_id={b}"] += 1
if not (v or p_ or b):
filter_use["(no filter)"] += 1
print(f"## search_docs filter usage")
print()
if args.format == "markdown":
print("| filter | count |")
print("|---|---|")
for f, n in filter_use.most_common(args.top):
print(f"| `{f}` | {n} |")
else:
for f, n in filter_use.most_common(args.top):
print(f" {n:>5d} {f}")
print()
# 5. Reranker effectiveness
reranked = [e for e in search_events if e.get("reranked") is True]
dense_only = [e for e in search_events if e.get("reranked") is False]
print(f"## Reranker activity")
print()
print(f" reranked: {len(reranked):>5d}")
print(f" dense only: {len(dense_only):>5d} (filter too narrow or 0 results)")
if reranked:
elapsed = [e["elapsed_ms"] for e in reranked if e.get("elapsed_ms") is not None]
if elapsed:
elapsed.sort()
p50 = elapsed[len(elapsed) // 2]
p95 = elapsed[int(len(elapsed) * 0.95)]
print(f" reranked latency p50: {p50:.0f} ms, p95: {p95:.0f} ms")
print()
# 6. Hybrid retrieval activity — which retriever contributed the top-1?
# Empty unless HYBRID_SEARCH=true is set on the MCP container.
hybrid_events = [e for e in search_events if e.get("retrieval_mode") == "hybrid"]
if hybrid_events:
by_source = Counter(e.get("top1_source") for e in hybrid_events
if e.get("top1_source"))
print("## Hybrid retrieval — top-1 attribution")
print()
print(f" hybrid mode events: {len(hybrid_events)}")
total = sum(by_source.values()) or 1
for src in ("both", "dense_only", "bm25_only"):
n = by_source.get(src, 0)
pct = 100.0 * n / total
label = {
"both": "in BOTH retrievers' top-N",
"dense_only": "dense found it, BM25 didn't",
"bm25_only": "BM25 found it, dense didn't",
}[src]
print(f" {src:<11s} {n:>5d} ({pct:5.1f}%) — {label}")
rescued = by_source.get("bm25_only", 0)
if rescued and total:
print(f"\n{rescued} ({100.0 * rescued / total:.1f}%) of hybrid queries had the top-1 "
"result that ONLY BM25 surfaced. Without hybrid those would have been dense-misses.")
return 0
if __name__ == "__main__":
sys.exit(main())
+89
View File
@@ -0,0 +1,89 @@
{
"_description": "seed-mcp source catalog. Each scraper module under scrape/sources/ corresponds to one entry. Run via `python -m scrape.runner --source <name>`. The MCP container bakes this file in so corpus_status / list_versions can reflect provenance without re-scraping.",
"_pioneer_excluded": "Pioneer (Corteva) is intentionally absent. Per their ToS: 'you shall not use any manual or automated software, devices or other processes (including but not limited to spiders, robots, scrapers, crawlers, avatars, data mining tools or the like) to scrape or download data from the Services'. The MCP returns a curated fallback lesson directing the user to pioneer.com / a local dealer.",
"sources": [
{
"name": "bayer_seeds",
"vendor": "Bayer",
"brands": ["DEKALB", "Asgrow", "WestBred"],
"crops": ["corn", "soybeans", "wheat"],
"verdict": "green",
"expected_count": 475,
"base_url": "https://cropscience.bayer.us",
"scope_filter": "All listed varieties; no regional filter applied at scrape time (regional recommendations parsed into sidecar so the MCP can filter at search time).",
"tos_check_date": "2026-05-24",
"tos_note": "robots.txt explicitly whitelists RAG/LLM use cases. Same legal stance as crop-chem-docs scraper."
},
{
"name": "golden_harvest",
"vendor": "Syngenta",
"brands": ["Golden Harvest"],
"crops": ["corn", "soybeans"],
"verdict": "green",
"expected_count": 175,
"base_url": "https://www.goldenharvestseeds.com",
"scope_filter": "All sitemap-listed corn + soybean varieties.",
"tos_check_date": "2026-05-25",
"schema_notes": "Disease ratings published on 9-to-1 scale (9 = best). Normalize to 1-9 (9 = best) at chunk time to match Bayer/NK/AgriPro convention. Note original direction in chunk_0 preamble. Tech-sheet PDF URLs in the sitemap are stale (250331) — resolve live URL from product HTML, not sitemap entry."
},
{
"name": "nk",
"vendor": "Syngenta",
"brands": ["NK"],
"crops": ["corn", "soybeans"],
"verdict": "green",
"expected_count": 29,
"base_url": "https://www.syngenta-us.com",
"pdf_cdn": "https://assets.syngentaebiz.com/pdf/techsheets/",
"scope_filter": "All NK corn + soy varieties. No wheat (NK doesn't sell wheat in US).",
"tos_check_date": "2026-05-24",
"schema_notes": "Disease + agronomic ratings live in tech-sheet PDFs only — need pdfplumber. PDF URLs share format `<CODE>_YYMMDD.pdf` with Golden Harvest, so the same fetcher works for both."
},
{
"name": "agripro",
"vendor": "Syngenta",
"brands": ["AgriPro"],
"crops": ["wheat", "barley"],
"verdict": "green",
"expected_count": 24,
"base_url": "https://www.agriprowheat.com",
"scope_filter": "All wheat classes (HRW/HRS/HWS/SWW/SWS) + barley. NO SRW — Syngenta's SRW lives at GrowProGenetics.com under a separate brand.",
"tos_check_date": "2026-05-24",
"schema_notes": "Drupal Views form; server-rendered HTML. CoAXium trait flag is implicit in product family; Clearfield/CL2 trait IS in this catalog."
},
{
"name": "becks_pfr",
"vendor": "Beck's Hybrids",
"brands": ["Beck's PFR"],
"crops": ["corn", "soybeans", "wheat"],
"verdict": "yellow",
"expected_count": 2089,
"base_url": "https://www.beckshybrids.com",
"api_base": "https://mc8v24rf.api.sanity.io",
"scope_filter": "All Practical Farm Research publications since 2015. PFR is head-to-head agronomy trials — fungicide timing, planting-date studies, hybrid-by-population, etc.",
"tos_check_date": "2026-05-24",
"schema_notes": "Public Sanity GROQ API, no auth required. Records have title/year/crop/key-findings/full-text. Treat PFR docs as a research corpus, not variety records — the chunk_0 includes the study's tl;dr finding."
},
{
"name": "becks_products",
"vendor": "Beck's Hybrids",
"brands": ["Beck's"],
"crops": ["corn", "soybeans", "wheat"],
"verdict": "yellow",
"expected_count": 860,
"base_url": "https://www.beckshybrids.com",
"api_base": "https://mc8v24rf.api.sanity.io",
"scope_filter": "All Beck's product records — corn + soy + wheat. Identity + RM/MG only.",
"tos_check_date": "2026-05-24",
"schema_notes": "Sanity GROQ exposes identity (name, RM/MG, basic traits) but agronomic + disease ratings are SeedIQ-gated (requires browser cookie). Deferred until the SeedIQ XHR endpoint is captured from a logged-in browser session. Without ratings, products are reference-only; the MCP can confirm 'Beck's has hybrid X at RM 112 with Enlist trait' but not 'rate it against drought'."
}
],
"_excluded_sources": [
{
"name": "pioneer",
"vendor": "Corteva",
"verdict": "red",
"reason": "Explicit ToS prohibits automated scraping. Dealer locator at /us/sales-representatives/my-local-team.html is login-gated; no public API for dealer contact info. The MCP returns a curated fallback lesson instead of erroring."
}
]
}