seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME
Image rebuild (skip scrape) / build (push) Failing after 7s
Image rebuild (skip scrape) / build (push) Failing after 7s
Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is
seed/hybrid varieties across 6 vendors instead of pesticide labels.
What's customized vs. the template:
- CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy,
canonical sidecar schema (per-crop), Golden Harvest disease-scale
reversal gotcha, no-IPv6 / HTTPS-clone note
- README.md: vendor coverage table, tool list, phase status
- Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not
bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker
DNS defaults (same llama-rerank sidecar as crop-chem-docs)
- .gitea/workflows/refresh.yml: monthly cron (seed catalogs move
slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar
pinning, continue-on-error on GC step
- .gitea/workflows/image-only.yml: paths filter + cancel-in-progress
concurrency group
- scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea
packages API URL + UA header to bypass CF block on default
Python-urllib UA)
- sources.json: catalog of 6 sources + scope_filter + per-source
schema notes + Pioneer-exclusion rationale
- scrape/runner.py: dispatcher with --all = GREEN-only
- scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr,
becks_products}.py: stub modules with implementation notes
- docs_mcp/server.py: PRODUCT_NAME default → crop_seed,
PRODUCT_DOCS_URL → repo URL
Pioneer is intentionally NOT a source. ToS bans automation; dealer
locator is login-gated. The MCP returns a curated fallback lesson
directing the user to pioneer.com.
Next phases:
- Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs
Bayer scraper; same __NEXT_DATA__ infra)
- Phase 7: curate eval/queries.jsonl
- Phase 11: lessons.md with Pioneer fallback + disease-scale notes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,117 @@
|
|||||||
|
name: Image rebuild (skip scrape)
|
||||||
|
|
||||||
|
# Fast path for code-only changes. Skips the scrape and goes straight
|
||||||
|
# to: rebuild indexes (from corpus already committed on main) + image
|
||||||
|
# build + push. Runtime ~10 min vs ~2-3 h for the full monthly refresh.
|
||||||
|
#
|
||||||
|
# Use when a PR only changes code/config — anything where the upstream
|
||||||
|
# seed catalogs haven't moved but we want the new Python in the
|
||||||
|
# running image.
|
||||||
|
|
||||||
|
on:
|
||||||
|
workflow_dispatch:
|
||||||
|
push:
|
||||||
|
branches:
|
||||||
|
- main
|
||||||
|
paths:
|
||||||
|
- "docs_mcp/**"
|
||||||
|
- "rag/**"
|
||||||
|
- "scrape/**"
|
||||||
|
- "requirements.txt"
|
||||||
|
- "Dockerfile"
|
||||||
|
- "sources.json"
|
||||||
|
|
||||||
|
# If multiple pushes land in quick succession, cancel the older one
|
||||||
|
# rather than queueing both — each run is non-trivial and the older
|
||||||
|
# commit's image just gets overwritten by the newer one anyway.
|
||||||
|
concurrency:
|
||||||
|
group: image-only
|
||||||
|
cancel-in-progress: true
|
||||||
|
|
||||||
|
env:
|
||||||
|
REGISTRY_PUSH: 192.168.0.2:1234
|
||||||
|
REGISTRY_PULL: git.jpaul.io
|
||||||
|
IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
|
||||||
|
OLLAMA_URL: http://192.168.0.2:11434,http://192.168.0.2:11435,http://192.168.0.125:11434
|
||||||
|
EMBED_MODEL: nomic-embed-text
|
||||||
|
PRODUCT_NAME: crop_seed
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
build:
|
||||||
|
runs-on: docker
|
||||||
|
container:
|
||||||
|
image: catthehacker/ubuntu:act-latest
|
||||||
|
steps:
|
||||||
|
- name: Checkout
|
||||||
|
uses: actions/checkout@v4
|
||||||
|
with:
|
||||||
|
fetch-depth: 0
|
||||||
|
|
||||||
|
- name: Set up Python
|
||||||
|
uses: actions/setup-python@v5
|
||||||
|
with:
|
||||||
|
python-version: "3.12"
|
||||||
|
|
||||||
|
- name: Install dependencies
|
||||||
|
run: |
|
||||||
|
python -m pip install -q --upgrade pip
|
||||||
|
python -m pip install -q -r requirements.txt
|
||||||
|
|
||||||
|
- name: Verify committed corpus is present
|
||||||
|
run: |
|
||||||
|
test -d corpus || { echo "ERROR: corpus/ missing on this ref"; exit 1; }
|
||||||
|
n_md=$(find corpus -name '*.md' | wc -l)
|
||||||
|
n_json=$(find corpus -name '*.json' | wc -l)
|
||||||
|
echo "corpus: $(du -sh corpus | cut -f1) on disk, ${n_md} .md / ${n_json} .json"
|
||||||
|
|
||||||
|
- name: Rebuild indexes from committed corpus
|
||||||
|
run: python -m rag.index --rebuild
|
||||||
|
|
||||||
|
- name: Log in to Gitea container registry
|
||||||
|
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u "${{ github.repository_owner }}" --password-stdin
|
||||||
|
|
||||||
|
- name: Build & push image
|
||||||
|
run: |
|
||||||
|
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
|
||||||
|
CORPUS_TAG="corpus-$(date -u +%Y.%m.%d)"
|
||||||
|
docker build \
|
||||||
|
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \
|
||||||
|
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
|
||||||
|
-t "${REGISTRY_PUSH}/${IMAGE}:${CORPUS_TAG}" \
|
||||||
|
.
|
||||||
|
docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
|
||||||
|
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
|
||||||
|
docker push "${REGISTRY_PUSH}/${IMAGE}:${CORPUS_TAG}"
|
||||||
|
|
||||||
|
- name: Link container package to this repo
|
||||||
|
env:
|
||||||
|
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
|
||||||
|
run: |
|
||||||
|
OWNER="${{ github.repository_owner }}"
|
||||||
|
PKG="${{ github.event.repository.name }}"
|
||||||
|
BODY=$(mktemp)
|
||||||
|
CODE=$(curl -sS -o "$BODY" -w "%{http_code}" -X POST \
|
||||||
|
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||||
|
"https://${REGISTRY_PULL}/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
|
||||||
|
echo "link http=$CODE body=$(cat "$BODY")"
|
||||||
|
case "$CODE" in
|
||||||
|
201) echo "linked package to ${OWNER}/${PKG}" ;;
|
||||||
|
400) echo "already linked — ok" ;;
|
||||||
|
*) echo "unexpected status $CODE"; exit 1 ;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
- name: Prune old container versions
|
||||||
|
# GC requires broader scope than REGISTRY_TOKEN's push perms
|
||||||
|
# (HTTP 403 on /packages/.../versions). Non-critical —
|
||||||
|
# housekeeping only. Don't fail the whole run.
|
||||||
|
# TODO: issue separate PAT with admin:package scope and set
|
||||||
|
# as PACKAGES_ADMIN_TOKEN.
|
||||||
|
continue-on-error: true
|
||||||
|
env:
|
||||||
|
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
|
||||||
|
run: |
|
||||||
|
python scripts/registry_gc.py \
|
||||||
|
--owner "${{ github.repository_owner }}" \
|
||||||
|
--package "${{ github.event.repository.name }}" \
|
||||||
|
--keep-days 180 \
|
||||||
|
--keep-latest 6
|
||||||
@@ -0,0 +1,186 @@
|
|||||||
|
name: Monthly seed catalog refresh
|
||||||
|
|
||||||
|
# Runs the full pipeline: scrape all GREEN sources → rebuild indexes
|
||||||
|
# → push image. Cron'd once a month (1st @ 06:00 UTC). Skip the
|
||||||
|
# reindex + image-push if the scrape produced no diff against the
|
||||||
|
# committed corpus.
|
||||||
|
#
|
||||||
|
# Seed catalogs move slowly (vendors release new hybrids 1-2x/year
|
||||||
|
# at field-day timing); monthly cadence is plenty.
|
||||||
|
#
|
||||||
|
# Total runtime budget: ~2-3 h end-to-end across all 5 GREEN sources.
|
||||||
|
# Bayer is the longest (~475 varieties, ~45 min). Beck's PFR is the
|
||||||
|
# heaviest single-source (~2,089 docs via Sanity GROQ pagination).
|
||||||
|
|
||||||
|
on:
|
||||||
|
schedule:
|
||||||
|
- cron: "0 6 1 * *" # 1st of each month, 06:00 UTC
|
||||||
|
workflow_dispatch:
|
||||||
|
inputs:
|
||||||
|
force_build:
|
||||||
|
description: "Rebuild indexes + push image even if corpus is unchanged"
|
||||||
|
type: boolean
|
||||||
|
default: false
|
||||||
|
sources:
|
||||||
|
description: "Sources to scrape (comma-separated, blank = all GREEN)"
|
||||||
|
type: string
|
||||||
|
default: ""
|
||||||
|
|
||||||
|
env:
|
||||||
|
# Self-hosted Gitea registry on the same LAN as the runner.
|
||||||
|
# CF caps push body at 100 MB, so push via LAN endpoint; pull
|
||||||
|
# through the public hostname (response bodies aren't capped).
|
||||||
|
REGISTRY_PUSH: 192.168.0.2:1234
|
||||||
|
REGISTRY_PULL: git.jpaul.io
|
||||||
|
IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
|
||||||
|
|
||||||
|
# Embedder pool. Two Ollama instances on the Gitea/runner host
|
||||||
|
# (one per GPU) + the Windows Ollama. Trashpanda's Ollama is
|
||||||
|
# production-shared with Drawbar; CI does NOT hit it.
|
||||||
|
OLLAMA_URL: http://192.168.0.2:11434,http://192.168.0.2:11435,http://192.168.0.125:11434
|
||||||
|
EMBED_MODEL: nomic-embed-text
|
||||||
|
|
||||||
|
PRODUCT_NAME: crop_seed
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
refresh:
|
||||||
|
runs-on: docker
|
||||||
|
container:
|
||||||
|
image: catthehacker/ubuntu:act-latest
|
||||||
|
steps:
|
||||||
|
- name: Checkout
|
||||||
|
uses: actions/checkout@v4
|
||||||
|
with:
|
||||||
|
# Full history — required for the digest-history step
|
||||||
|
# to walk git log. Default fetch-depth: 1 silently
|
||||||
|
# produces a 0-byte history file.
|
||||||
|
fetch-depth: 0
|
||||||
|
|
||||||
|
- name: Set up Python
|
||||||
|
uses: actions/setup-python@v5
|
||||||
|
with:
|
||||||
|
python-version: "3.12"
|
||||||
|
|
||||||
|
- name: Install dependencies
|
||||||
|
run: |
|
||||||
|
python -m pip install -q --upgrade pip
|
||||||
|
python -m pip install -q -r requirements.txt
|
||||||
|
|
||||||
|
# ---- Phase 1: scrape ---------------------------------------
|
||||||
|
- name: Scrape Bayer seeds (DEKALB + Asgrow + WestBred)
|
||||||
|
if: ${{ inputs.sources == '' || contains(inputs.sources, 'bayer_seeds') }}
|
||||||
|
run: python -m scrape.runner --source bayer_seeds --force
|
||||||
|
|
||||||
|
- name: Scrape Golden Harvest
|
||||||
|
if: ${{ inputs.sources == '' || contains(inputs.sources, 'golden_harvest') }}
|
||||||
|
run: python -m scrape.runner --source golden_harvest --force
|
||||||
|
|
||||||
|
- name: Scrape NK
|
||||||
|
if: ${{ inputs.sources == '' || contains(inputs.sources, 'nk') }}
|
||||||
|
run: python -m scrape.runner --source nk --force
|
||||||
|
|
||||||
|
- name: Scrape AgriPro
|
||||||
|
if: ${{ inputs.sources == '' || contains(inputs.sources, 'agripro') }}
|
||||||
|
run: python -m scrape.runner --source agripro --force
|
||||||
|
|
||||||
|
- name: Scrape Beck's PFR research corpus
|
||||||
|
if: ${{ inputs.sources == '' || contains(inputs.sources, 'becks_pfr') }}
|
||||||
|
# Heaviest source — ~2,089 docs via public Sanity GROQ.
|
||||||
|
# No auth, but rate-limit ourselves to be polite.
|
||||||
|
run: python -m scrape.runner --source becks_pfr --force
|
||||||
|
|
||||||
|
# ---- Commit corpus changes + retry-on-race -----------------
|
||||||
|
- name: Commit corpus changes (if any)
|
||||||
|
id: commit
|
||||||
|
run: |
|
||||||
|
git config user.name "seed-mcp-refresh"
|
||||||
|
git config user.email "actions@jpaul.io"
|
||||||
|
git add sources.json corpus
|
||||||
|
if git diff --cached --quiet; then
|
||||||
|
echo "no corpus changes — skipping reindex and image build"
|
||||||
|
echo "changed=false" >> "$GITHUB_OUTPUT"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
echo "changed=true" >> "$GITHUB_OUTPUT"
|
||||||
|
ts=$(date -u +"%Y-%m-%dT%H:%MZ")
|
||||||
|
n_bayer=$(find corpus/bayer_seeds -name '*.json' 2>/dev/null | wc -l)
|
||||||
|
n_gh=$(find corpus/golden_harvest -name '*.json' 2>/dev/null | wc -l)
|
||||||
|
n_nk=$(find corpus/nk -name '*.json' 2>/dev/null | wc -l)
|
||||||
|
n_ag=$(find corpus/agripro -name '*.json' 2>/dev/null | wc -l)
|
||||||
|
n_pfr=$(find corpus/becks_pfr -name '*.json' 2>/dev/null | wc -l)
|
||||||
|
git commit -m "monthly refresh: ${ts} — bayer=${n_bayer} gh=${n_gh} nk=${n_nk} agripro=${n_ag} pfr=${n_pfr}"
|
||||||
|
attempt=1
|
||||||
|
while [ $attempt -le 3 ]; do
|
||||||
|
if git push; then
|
||||||
|
echo "pushed corpus changes (attempt $attempt)"
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
if [ $attempt -eq 3 ]; then
|
||||||
|
echo "push still failing after 3 attempts"; exit 1
|
||||||
|
fi
|
||||||
|
git fetch origin main
|
||||||
|
git rebase origin/main || { echo "rebase conflict"; exit 1; }
|
||||||
|
attempt=$((attempt + 1))
|
||||||
|
done
|
||||||
|
|
||||||
|
# ---- Rebuild Chroma + BM25 ---------------------------------
|
||||||
|
- name: Rebuild indexes
|
||||||
|
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||||
|
run: python -m rag.index --rebuild
|
||||||
|
|
||||||
|
# ---- Build & push image ------------------------------------
|
||||||
|
- name: Log in to Gitea container registry
|
||||||
|
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||||
|
run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u "${{ github.repository_owner }}" --password-stdin
|
||||||
|
|
||||||
|
- name: Build & push image
|
||||||
|
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||||
|
# Tags: :latest (Watchtower target), :<sha12> (rollback pin),
|
||||||
|
# :corpus-<YYYY.MM.DD> (links image to corpus version so
|
||||||
|
# Drawbar can pin to a specific seed-catalog snapshot).
|
||||||
|
run: |
|
||||||
|
SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
|
||||||
|
CORPUS_TAG="corpus-$(date -u +%Y.%m.%d)"
|
||||||
|
docker build \
|
||||||
|
-t "${REGISTRY_PUSH}/${IMAGE}:latest" \
|
||||||
|
-t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
|
||||||
|
-t "${REGISTRY_PUSH}/${IMAGE}:${CORPUS_TAG}" \
|
||||||
|
.
|
||||||
|
docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
|
||||||
|
docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
|
||||||
|
docker push "${REGISTRY_PUSH}/${IMAGE}:${CORPUS_TAG}"
|
||||||
|
|
||||||
|
- name: Link container package to this repo
|
||||||
|
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||||
|
env:
|
||||||
|
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
|
||||||
|
run: |
|
||||||
|
OWNER="${{ github.repository_owner }}"
|
||||||
|
PKG="${{ github.event.repository.name }}"
|
||||||
|
BODY=$(mktemp)
|
||||||
|
CODE=$(curl -sS -o "$BODY" -w "%{http_code}" -X POST \
|
||||||
|
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||||
|
"https://${REGISTRY_PULL}/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
|
||||||
|
echo "link http=$CODE body=$(cat "$BODY")"
|
||||||
|
case "$CODE" in
|
||||||
|
201) echo "linked package to ${OWNER}/${PKG}" ;;
|
||||||
|
400) echo "already linked — ok" ;;
|
||||||
|
*) echo "unexpected status $CODE"; exit 1 ;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
- name: Prune old container versions
|
||||||
|
# GC requires broader scope than REGISTRY_TOKEN's push perms
|
||||||
|
# (HTTP 403 on /packages/.../versions). Non-critical
|
||||||
|
# housekeeping. TODO: issue separate PAT with admin:package
|
||||||
|
# scope. Until then continue-on-error keeps a failed prune
|
||||||
|
# from breaking the whole refresh.
|
||||||
|
if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
|
||||||
|
continue-on-error: true
|
||||||
|
env:
|
||||||
|
GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
|
||||||
|
run: |
|
||||||
|
python scripts/registry_gc.py \
|
||||||
|
--owner "${{ github.repository_owner }}" \
|
||||||
|
--package "${{ github.event.repository.name }}" \
|
||||||
|
--keep-days 180 \
|
||||||
|
--keep-latest 6
|
||||||
+31
@@ -0,0 +1,31 @@
|
|||||||
|
# Virtualenv
|
||||||
|
venv/
|
||||||
|
.venv/
|
||||||
|
|
||||||
|
# Regenerable from corpus + CI
|
||||||
|
corpus/
|
||||||
|
chroma/
|
||||||
|
bm25/
|
||||||
|
|
||||||
|
# Python detritus
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*.egg-info/
|
||||||
|
.pytest_cache/
|
||||||
|
.mypy_cache/
|
||||||
|
.ruff_cache/
|
||||||
|
|
||||||
|
# Eval results (regenerable; commit only the headline baseline if you want)
|
||||||
|
# eval/results/
|
||||||
|
|
||||||
|
# Usage logs (host-mounted volume in prod; don't commit dev logs)
|
||||||
|
var/
|
||||||
|
|
||||||
|
# Local-only env
|
||||||
|
.env
|
||||||
|
.env.local
|
||||||
|
|
||||||
|
# IDE
|
||||||
|
.vscode/
|
||||||
|
.idea/
|
||||||
|
*.swp
|
||||||
@@ -0,0 +1,230 @@
|
|||||||
|
# CLAUDE.md
|
||||||
|
|
||||||
|
This file provides guidance to Claude Code (claude.ai/code) when
|
||||||
|
working with code in this repository.
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
`seed-mcp` is an MCP server over the **public catalogs of major US
|
||||||
|
row-crop seed vendors** (corn / soybeans / wheat). It is the sibling
|
||||||
|
project to [`crop-chem-docs`](https://git.jpaul.io/justin/crop-chem-docs)
|
||||||
|
— same MCP-template lineage, same Drawbar consumer (the farm
|
||||||
|
advisor AI), but the corpus is **seed/hybrid varieties** rather than
|
||||||
|
pesticide labels.
|
||||||
|
|
||||||
|
The MCP exposes per-variety records with agronomic ratings, disease
|
||||||
|
tolerance, trait stack, maturity, and regional notes — so the advisor
|
||||||
|
can answer "which corn hybrid for sandy soil, drought-prone, RM ≤105
|
||||||
|
in northeast Iowa?" without rummaging through individual brand sites.
|
||||||
|
|
||||||
|
PRODUCT_NAME for this build: **`crop_seed`** (lowercase, underscore;
|
||||||
|
ends up in the MCP server name, Chroma collection, BM25 db filename,
|
||||||
|
and the `crop_seed_api_lessons` tool).
|
||||||
|
|
||||||
|
## Vendor scope
|
||||||
|
|
||||||
|
| Vendor | Verdict | Varieties | Source pattern |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Bayer (DEKALB + Asgrow + WestBred) | 🟢 | ~475 | `cropscience.bayer.us` Next.js `__NEXT_DATA__` (same infra as crop-chem-docs) |
|
||||||
|
| Golden Harvest (Syngenta) | 🟢 | ~175 | sitemap.xml + server-rendered HTML + Syngenta CDN PDFs |
|
||||||
|
| NK (Syngenta) | 🟢 | 29 | static HTML + Syngenta CDN PDFs (shares fetcher with Golden Harvest) |
|
||||||
|
| AgriPro (Syngenta wheat) | 🟢 | 24 | Drupal Views form, server-rendered HTML |
|
||||||
|
| Beck's PFR | 🟡 | 2,089 | Public Sanity GROQ API at `mc8v24rf.api.sanity.io` (no auth) |
|
||||||
|
| Beck's products | 🟡 | 860 | Same Sanity API — identity-only until SeedIQ XHR is sniffed |
|
||||||
|
| Pioneer (Corteva) | 🔴 | — | DROP. ToS bans automation; dealer locator login-gated too |
|
||||||
|
|
||||||
|
**Build priority order** (shared-infra first → biggest yield):
|
||||||
|
1. `bayer_seeds` — lift-and-shift from crop-chem-docs' Bayer scraper
|
||||||
|
2. `golden_harvest` — biggest unique Syngenta brand
|
||||||
|
3. `nk` — reuses Golden Harvest's PDF fetcher
|
||||||
|
4. `agripro` — only wheat coverage in the corpus
|
||||||
|
5. `becks_pfr` — research goldmine, public Sanity GROQ
|
||||||
|
6. `becks_products` — identity-only, deferred until SeedIQ XHR known
|
||||||
|
|
||||||
|
### Pioneer fallback
|
||||||
|
|
||||||
|
Per user direction (2026-05-25), seed-mcp does NOT scrape Pioneer.
|
||||||
|
The MCP's lessons layer contains a Pioneer-fallback entry: when the
|
||||||
|
LLM detects a Pioneer / P-series query, it should reply:
|
||||||
|
|
||||||
|
> "Pioneer does not allow AI or other automation techniques to
|
||||||
|
> scrape and index their data. For Pioneer brand seed information,
|
||||||
|
> reach out to a local dealer directly via
|
||||||
|
> [pioneer.com](https://www.pioneer.com)."
|
||||||
|
|
||||||
|
Pioneer's dealer locator is login-gated — there is no public API
|
||||||
|
to surface dealer contact info, so the lesson stays a plain link.
|
||||||
|
|
||||||
|
## Schema notes per crop
|
||||||
|
|
||||||
|
- **Corn**: RM (relative maturity days), trait stack (SmartStax, VT
|
||||||
|
Double PRO, Enlist, PowerCore, Trecepta, etc.), GLS / NCLB /
|
||||||
|
Goss's / Anthracnose / Tar Spot ratings, standability, drought
|
||||||
|
tolerance, ear flex, grain-vs-silage flag.
|
||||||
|
- **Soy**: Maturity group (e.g. 3.4), trait stack (XF / Xtend / E3 /
|
||||||
|
LL+GT27 / RR2Y / Conkesta), SDS / white mold / SCN / Phytophthora
|
||||||
|
(race + Rps gene) / frogeye / brown stem rot ratings, IDC
|
||||||
|
tolerance (critical for upper Midwest), branching habit.
|
||||||
|
- **Wheat**: Class (HRW / HRS / SRW / SWW / SWS / durum), heading
|
||||||
|
(early / medium / late), stripe rust / leaf rust / stem rust /
|
||||||
|
FHB (scab) / Septoria / tan spot ratings, test weight, protein,
|
||||||
|
falling number, straw strength, CoAXium trait flag.
|
||||||
|
|
||||||
|
**Disease scale gotcha**: Golden Harvest publishes ratings on a
|
||||||
|
**9-to-1 scale** (9 = best, 1 = worst) — the REVERSE of the typical
|
||||||
|
1-9 convention used by Bayer/NK/AgriPro. Normalize at chunk time so
|
||||||
|
the corpus has a single direction; document it in a chunk_0
|
||||||
|
preamble.
|
||||||
|
|
||||||
|
## Canonical sidecar schema (per variety)
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"source": "bayer_seeds",
|
||||||
|
"source_key": "dekalb-dkc62-08rib",
|
||||||
|
"vendor": "Bayer",
|
||||||
|
"brand": "DEKALB",
|
||||||
|
"product_name": "DKC62-08RIB",
|
||||||
|
"crop": "corn",
|
||||||
|
"relative_maturity": 112,
|
||||||
|
"maturity_group": null,
|
||||||
|
"wheat_class": null,
|
||||||
|
"trait_stack": ["SmartStax", "RIB"],
|
||||||
|
"agronomic_ratings": {"standability": 7, "drought_tolerance": 6},
|
||||||
|
"disease_ratings": {"GLS": 6, "NCLB": 7, "Goss_wilt": 5},
|
||||||
|
"regional_recommendation": ["IA-N", "MN-S", "WI-W"],
|
||||||
|
"source_urls": ["https://cropscience.bayer.us/..."],
|
||||||
|
"fetched_at": "2026-05-25T12:34:56Z"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`maturity_group` is for soy, `relative_maturity` is for corn,
|
||||||
|
`wheat_class` is for wheat. Use `null` for fields that don't apply.
|
||||||
|
Disease/agronomic rating direction is **normalized 1-9 (9 = best)**
|
||||||
|
post-scrape — original direction noted in chunk_0 if the source
|
||||||
|
publishes differently.
|
||||||
|
|
||||||
|
## Working with this repo
|
||||||
|
|
||||||
|
### Identifying the current phase
|
||||||
|
|
||||||
|
This is a clone of the docs-mcp-template; phases follow the
|
||||||
|
template's PLAN.md.
|
||||||
|
|
||||||
|
| Signal | Likely phase |
|
||||||
|
|---|---|
|
||||||
|
| `corpus/` doesn't exist | Phase 1 (first scraper) |
|
||||||
|
| `corpus/bayer_seeds/` exists, no `chroma/` | Phase 2 (indexing) |
|
||||||
|
| `chroma/` exists, no `bm25/` | Phase 8 (hybrid search) |
|
||||||
|
| No `eval/results/` | Phase 7 (eval harness) |
|
||||||
|
| `_api_lessons` is `NotImplementedError` | Phase 11 |
|
||||||
|
|
||||||
|
## Layout
|
||||||
|
|
||||||
|
```
|
||||||
|
.
|
||||||
|
├── PLAN.md
|
||||||
|
├── README.md
|
||||||
|
├── CLAUDE.md
|
||||||
|
├── sources.json # Vendor catalog (corn/soy/wheat by source)
|
||||||
|
├── requirements.txt
|
||||||
|
├── Dockerfile
|
||||||
|
├── deploy/
|
||||||
|
│ └── docker-compose.yml
|
||||||
|
├── .gitea/workflows/
|
||||||
|
│ ├── refresh.yml # Monthly cron: scrape + index + image
|
||||||
|
│ └── image-only.yml # On-demand: code-only ship cycle
|
||||||
|
├── scrape/
|
||||||
|
│ ├── runner.py # `python -m scrape.runner --source bayer_seeds`
|
||||||
|
│ ├── changelog.py
|
||||||
|
│ └── sources/
|
||||||
|
│ ├── bayer_seeds.py
|
||||||
|
│ ├── golden_harvest.py
|
||||||
|
│ ├── nk.py
|
||||||
|
│ ├── agripro.py
|
||||||
|
│ ├── becks_pfr.py
|
||||||
|
│ └── becks_products.py
|
||||||
|
├── rag/ # chunk + embed + Chroma + BM25
|
||||||
|
├── docs_mcp/ # FastMCP server + lessons.md
|
||||||
|
├── eval/ # Golden-query harness
|
||||||
|
└── scripts/ # registry_gc.py, usage_report.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Conventions
|
||||||
|
|
||||||
|
- **Vendor sub-corpora**: each scraper writes
|
||||||
|
`corpus/<source>/<source_key>.{md,json}`. `.md` is the LLM-visible
|
||||||
|
text (chunk_0 preamble + body); `.json` is the sidecar metadata.
|
||||||
|
- **Tool docstrings are user interface** — the LLM uses them to
|
||||||
|
decide whether to call. Treat like button labels.
|
||||||
|
- **Defensive fallback for retrieval** — reranker/BM25/external
|
||||||
|
deps must catch their specific exception and degrade to baseline.
|
||||||
|
The MCP is in front of farmers making real seed-buying decisions.
|
||||||
|
- **Verify retrieval changes with eval/** — ship a retrieval change
|
||||||
|
with numbers in the commit message.
|
||||||
|
|
||||||
|
### Standard infrastructure choices
|
||||||
|
|
||||||
|
- **Embedding**: `nomic-embed-text` via Ollama (768-dim)
|
||||||
|
- **Reranker**: `jina-reranker-v2-base` GGUF via llama.cpp
|
||||||
|
`/v1/rerank` (shared `llama-rerank` sidecar with crop-chem-docs
|
||||||
|
on trashpanda Tesla P4)
|
||||||
|
- **Vector store**: Chroma `PersistentClient`
|
||||||
|
- **Lexical store**: SQLite FTS5
|
||||||
|
- **Fusion**: RRF k=60
|
||||||
|
- **Transport**: streamable-HTTP in prod, stdio for local dev
|
||||||
|
- **MCP framework**: FastMCP with `stateless_http=True`
|
||||||
|
|
||||||
|
### Image name and package linking are repo-name-derived
|
||||||
|
|
||||||
|
`IMAGE` and `--package` derive from the repo at runtime via
|
||||||
|
`${{ github.repository_owner }}` / `${{ github.event.repository.name }}`.
|
||||||
|
The only workflow placeholders customized per clone are
|
||||||
|
`REGISTRY_PUSH=192.168.0.2:1234`, `REGISTRY_PULL=git.jpaul.io`,
|
||||||
|
and the `OLLAMA_URL` embed pool.
|
||||||
|
|
||||||
|
## Common commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Dev environment
|
||||||
|
python -m venv venv && source venv/bin/activate
|
||||||
|
pip install -r requirements.txt
|
||||||
|
|
||||||
|
# Run one scraper
|
||||||
|
python -m scrape.runner --source bayer_seeds --force
|
||||||
|
|
||||||
|
# Rebuild indexes
|
||||||
|
python -m rag.index --rebuild
|
||||||
|
|
||||||
|
# Local MCP server
|
||||||
|
python -m docs_mcp.server --transport stdio
|
||||||
|
python -m docs_mcp.server --transport streamable-http --port 8000
|
||||||
|
|
||||||
|
# Eval
|
||||||
|
python -m eval.run_eval --queries eval/queries.jsonl --output eval/results/baseline.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## Gotchas
|
||||||
|
|
||||||
|
- **`fetch-depth: 0` on `actions/checkout@v4`** in both workflows.
|
||||||
|
- **Reranker per-pair token limit**: jina-reranker GGUF rejects the
|
||||||
|
ENTIRE batch if any doc exceeds `n_ctx_train=1024`. Truncate
|
||||||
|
reranked docs to ~2000 chars.
|
||||||
|
- **FastMCP `stateless_http=True`**: critical for prod.
|
||||||
|
- **Runner shell is `/bin/sh` (dash)** in CI — no `${VAR::N}`.
|
||||||
|
- **Cloudflare 100 MB body cap**: push via LAN endpoint
|
||||||
|
`192.168.0.2:1234`, pull via `git.jpaul.io`.
|
||||||
|
- **Golden Harvest disease scale is reversed (9 = best)** —
|
||||||
|
normalize at chunk time.
|
||||||
|
- **Sitemap-listed PDF dates on Golden Harvest are stale** —
|
||||||
|
resolve the live PDF URL from the product HTML page.
|
||||||
|
- **No IPv6** — DNS for git.jpaul.io returns IPv6-only. Clone via
|
||||||
|
HTTPS, not SSH (port 22 returns Network unreachable).
|
||||||
|
- **Pioneer is OFF-LIMITS** — do NOT add a `pioneer.py` scraper.
|
||||||
|
|
||||||
|
## Out-of-scope concerns
|
||||||
|
|
||||||
|
- **Reverse proxy / TLS** — Drawbar's compose handles it
|
||||||
|
- **MetaMCP** — separate aggregator
|
||||||
|
- **GPU container orchestration** — shared `llama-rerank` sidecar
|
||||||
|
- **University extension trial data** — deferred to v1.5
|
||||||
+61
@@ -0,0 +1,61 @@
|
|||||||
|
# seed-mcp MCP server — production image.
|
||||||
|
#
|
||||||
|
# Structure: copy code first, then the regenerable indexes last so a
|
||||||
|
# code-only change doesn't invalidate the corpus COPY layer.
|
||||||
|
#
|
||||||
|
# The container runs the MCP server via streamable-http on PORT 8000.
|
||||||
|
# Override via MCP_HOST / MCP_PORT env if you front it with a different
|
||||||
|
# reverse-proxy setup.
|
||||||
|
#
|
||||||
|
# Image is self-contained — corpus, Chroma collection, and BM25 db are
|
||||||
|
# all baked in. Drawbar's docker-compose pulls the image and runs it;
|
||||||
|
# no host volume mounts required for serve.
|
||||||
|
#
|
||||||
|
# RERANK_URL is set at compose time (points at the llama.cpp sidecar
|
||||||
|
# on trashpanda's Tesla P4 — SHARED with crop-chem-docs). OLLAMA_URL
|
||||||
|
# is set at compose time too. Defaults below assume same-stack Docker
|
||||||
|
# DNS names.
|
||||||
|
|
||||||
|
FROM python:3.12-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Install Python deps first for cacheability.
|
||||||
|
COPY requirements.txt /app/
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
|
# Code.
|
||||||
|
COPY scrape /app/scrape
|
||||||
|
COPY rag /app/rag
|
||||||
|
COPY docs_mcp /app/docs_mcp
|
||||||
|
|
||||||
|
# Source catalog. Lists the corpus sources (Bayer seeds + Golden
|
||||||
|
# Harvest + NK + AgriPro + Beck's PFR + Beck's products).
|
||||||
|
COPY sources.json /app/
|
||||||
|
|
||||||
|
# Regenerable indexes. CI builds these from corpus/ in the same job
|
||||||
|
# that builds the image. Listed last so code changes don't invalidate
|
||||||
|
# the COPY layer cache for these (much larger) directories.
|
||||||
|
#
|
||||||
|
# bm25/ is only consulted when HYBRID_SEARCH=true (the server falls
|
||||||
|
# back to dense-only if it's missing).
|
||||||
|
COPY corpus /app/corpus
|
||||||
|
COPY chroma /app/chroma
|
||||||
|
COPY bm25 /app/bm25
|
||||||
|
|
||||||
|
ENV PYTHONUNBUFFERED=1 \
|
||||||
|
PRODUCT_NAME=crop_seed \
|
||||||
|
MCP_TRANSPORT=streamable-http \
|
||||||
|
MCP_HOST=0.0.0.0 \
|
||||||
|
MCP_PORT=8000 \
|
||||||
|
HYBRID_SEARCH=true \
|
||||||
|
OLLAMA_URL=http://ollama:11434 \
|
||||||
|
RERANK_URL=http://llama-rerank:8080
|
||||||
|
# Defaults above assume the MCP container shares a Docker network
|
||||||
|
# with services named `ollama` and `llama-rerank`. Override either
|
||||||
|
# in the compose `environment:` block if your stack uses different
|
||||||
|
# service names or if you want to point at off-stack hosts.
|
||||||
|
|
||||||
|
EXPOSE 8000
|
||||||
|
|
||||||
|
ENTRYPOINT ["python", "-m", "docs_mcp.server"]
|
||||||
@@ -0,0 +1,647 @@
|
|||||||
|
# Docs MCP Server — Build Guide
|
||||||
|
|
||||||
|
A reusable recipe for building a hosted MCP server over a product's
|
||||||
|
public documentation. Distilled from one production build; everything
|
||||||
|
product-specific has been factored out.
|
||||||
|
|
||||||
|
The end product is a streamable-HTTP MCP server with ~15 tools that
|
||||||
|
any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can
|
||||||
|
call to answer questions against the docs, surface what changed
|
||||||
|
recently, find inconsistencies, and (optionally) submit doc bugs
|
||||||
|
back upstream.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What you're building
|
||||||
|
|
||||||
|
A pipeline with these stages:
|
||||||
|
|
||||||
|
```
|
||||||
|
upstream docs portal
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
scrape ──► corpus/<bundle>/<page>.md + .json sidecar
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
chunk + embed ──► chroma/ (dense vectors)
|
||||||
|
│ ──► bm25/ (FTS5 lexical index)
|
||||||
|
▼
|
||||||
|
MCP server ──► search_docs / get_page / diff_versions / weekly_digest /
|
||||||
|
find_doc_inconsistencies / submit_doc_bug / ...
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
reverse proxy / Cloudflare Tunnel ──► public endpoint
|
||||||
|
```
|
||||||
|
|
||||||
|
Two CI cadences:
|
||||||
|
|
||||||
|
- **Weekly cron** (~40 min): full re-scrape, re-chunk, re-embed,
|
||||||
|
image build & push.
|
||||||
|
- **On-demand image-only** (~18 min): code-only rebuild from
|
||||||
|
committed corpus, image build & push.
|
||||||
|
|
||||||
|
A container registry (self-hosted Gitea works well), a host running
|
||||||
|
Docker Compose, Watchtower auto-updating from `:latest`, and a
|
||||||
|
reverse proxy in front.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Build phases
|
||||||
|
|
||||||
|
Each phase is a discrete, shippable unit. Build them in order; each
|
||||||
|
one is useful on its own and unlocks the next. Realistic effort per
|
||||||
|
phase is given as a rough order of magnitude. Total: roughly 2–3
|
||||||
|
weeks of focused work for the full stack.
|
||||||
|
|
||||||
|
### Phase 0 — Project skeleton *(half a day)*
|
||||||
|
|
||||||
|
Goals: directory layout, dependency manifest, virtualenv.
|
||||||
|
|
||||||
|
- Top-level dirs: `scrape/`, `corpus/` (gitignored), `rag/`,
|
||||||
|
`docs_mcp/`, `eval/`, `scripts/`, `deploy/`, `.gitea/workflows/`.
|
||||||
|
- `requirements.txt` with the dependencies you'll need across all
|
||||||
|
phases (FastMCP, chromadb, httpx, beautifulsoup4 or whatever HTML
|
||||||
|
parser, ollama or sentence-transformers client, etc.).
|
||||||
|
- `python -m venv venv` and pin Python version (3.11 or 3.12 — be
|
||||||
|
conservative; some embedding libraries have version-specific
|
||||||
|
wheels).
|
||||||
|
- `.gitignore`: `venv/`, `corpus/` (regenerable), `chroma/`
|
||||||
|
(regenerable), `bm25/` (regenerable), `*.pyc`, `__pycache__/`,
|
||||||
|
`.pytest_cache/`.
|
||||||
|
|
||||||
|
### Phase 1 — Scraper *(2–4 days, product-specific)*
|
||||||
|
|
||||||
|
This is the most product-dependent phase. The goal is to write a
|
||||||
|
scraper that produces a normalized corpus layout regardless of
|
||||||
|
upstream portal shape.
|
||||||
|
|
||||||
|
Output shape (mandatory):
|
||||||
|
|
||||||
|
```
|
||||||
|
corpus/
|
||||||
|
<bundle_id>/ # one dir per "doc bundle" — see Glossary
|
||||||
|
<page_id>.md # markdown body
|
||||||
|
<page_id>.json # sidecar with structured metadata
|
||||||
|
...
|
||||||
|
bundles.json # catalog of bundles with metadata
|
||||||
|
```
|
||||||
|
|
||||||
|
**Bundle metadata** (`bundles.json` is a list of these):
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"slug": "<bundle_id>",
|
||||||
|
"title": "User-facing title",
|
||||||
|
"version": "10.9",
|
||||||
|
"platform": "VMware vSphere", // may be null
|
||||||
|
"product": "Admin Guide", // optional but useful
|
||||||
|
"language": "en-US",
|
||||||
|
"page_count": 127,
|
||||||
|
"dates": {
|
||||||
|
"Added on": "2024-01-15",
|
||||||
|
"Updated on": "2026-05-20"
|
||||||
|
},
|
||||||
|
"landing_page": "<page_id>"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Per-page sidecar** (`<page_id>.json`) carries page-level metadata.
|
||||||
|
The one field that matters cross-cutting is `topic_cluster` (see
|
||||||
|
Phase 9):
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"bundle_id": "<bundle_id>",
|
||||||
|
"page_id": "<page_id>",
|
||||||
|
"title": "How to ...",
|
||||||
|
"ordinal": 42,
|
||||||
|
"topic_cluster": {
|
||||||
|
"clustering_title": "How to ...",
|
||||||
|
"clustered_topics": [
|
||||||
|
{"bundle_id": "...10.8", "page_id": "How_to_X.htm", "clustering_title": "..."},
|
||||||
|
{"bundle_id": "...10.9", "page_id": "How_to_X.htm", "clustering_title": "..."}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
If the portal exposes a cross-version "this page corresponds to that
|
||||||
|
page" mapping, capture it here. If it doesn't, you can synthesize a
|
||||||
|
filename-based fallback (same filename across bundle versions = same
|
||||||
|
topic) and live without the editor-curated mapping. The features that
|
||||||
|
read `topic_cluster` (`list_cluster`, `diff_versions`,
|
||||||
|
`find_doc_inconsistencies`, parts of `weekly_digest`) will work
|
||||||
|
either way; they're more accurate with real clusters.
|
||||||
|
|
||||||
|
**Patterns that recur across doc portals:**
|
||||||
|
|
||||||
|
- Most modern doc portals are SPAs. Plain `requests.get` won't see
|
||||||
|
rendered content. Either find the underlying API the SPA calls (the
|
||||||
|
cheapest, most reliable path), or fall back to a headless browser
|
||||||
|
(Playwright). The API path is almost always available; sniff the
|
||||||
|
network tab.
|
||||||
|
- Portals usually expose a "bundle/topic" hierarchy under the hood
|
||||||
|
(Zoomin, Madcap Flare, Paligo, GitBook, Docusaurus all do). Map
|
||||||
|
it to `bundles.json` + `corpus/<bundle>/<page>`.
|
||||||
|
- Many portals expose `?save_local=` or `.pdf` rendered versions; the
|
||||||
|
HTML they serve is structurally cleaner than what the page shows
|
||||||
|
through the SPA shell.
|
||||||
|
|
||||||
|
**`scrape/changelog.py`** (~250 LOC; see Phase 13) — provides
|
||||||
|
`summarize_diff()`, `render_human()`, `walk_history()` and the
|
||||||
|
`--json` / `--history-out` modes. Mostly reusable as-is; the only
|
||||||
|
product-specific bit is the path layout assumption.
|
||||||
|
|
||||||
|
### Phase 2 — Chunking + embeddings + Chroma *(2 days)*
|
||||||
|
|
||||||
|
Goal: build a queryable dense index from the scraped corpus.
|
||||||
|
|
||||||
|
- `rag/chunk.py` — split each page's markdown into ~400-600 token
|
||||||
|
chunks. Strategy that works: paragraph-aware splitter with a
|
||||||
|
rich "chunk 0" containing the page title + 1-sentence summary +
|
||||||
|
bag-of-words from key terms. Chunk 0 is what dense retrieval lands
|
||||||
|
on first; getting it right dominates retrieval quality.
|
||||||
|
- `rag/embeddings.py` — pluggable embedder. Recommended start:
|
||||||
|
Ollama-hosted `nomic-embed-text` (768-dim, free, good baseline).
|
||||||
|
Other defensible choices: `text-embedding-3-small` (OpenAI),
|
||||||
|
`bge-m3` (also via Ollama). The embedder is a Chroma
|
||||||
|
`EmbeddingFunction` that returns `list[list[float]]` for a list
|
||||||
|
of texts.
|
||||||
|
- `rag/index.py` — orchestrates: read corpus → emit chunks (with
|
||||||
|
metadata: bundle_id, page_id, version, platform, ordinal) →
|
||||||
|
upsert into Chroma collection. `--rebuild` flag for a clean
|
||||||
|
reindex. Run via `python -m rag.index --rebuild`.
|
||||||
|
|
||||||
|
Chroma settings: `PersistentClient(path="chroma/")` and
|
||||||
|
`Settings(anonymized_telemetry=False)`. Single collection
|
||||||
|
(`<product>_docs`).
|
||||||
|
|
||||||
|
**GPU note**: embedding 70K chunks on CPU takes hours; on a GPU
|
||||||
|
(via Ollama with `NVIDIA_VISIBLE_DEVICES`) takes ~10 minutes. Two
|
||||||
|
GPUs in parallel: ~5 minutes. The orchestrator just needs to load-
|
||||||
|
balance HTTP requests across multiple Ollama endpoints.
|
||||||
|
|
||||||
|
### Phase 3 — MCP server skeleton *(1 day)*
|
||||||
|
|
||||||
|
Goal: working FastMCP server with three tools — `search_docs`,
|
||||||
|
`get_page`, `list_versions`.
|
||||||
|
|
||||||
|
- `docs_mcp/server.py` — `FastMCP("<product>-docs", stateless_http=True)`.
|
||||||
|
`stateless_http=True` is critical for production hosting: every
|
||||||
|
request creates an ephemeral session, so container recreates don't
|
||||||
|
produce a 404 storm from stale `mcp-session-id` headers on
|
||||||
|
clients.
|
||||||
|
- Lazy initialization for everything expensive (Chroma client,
|
||||||
|
embedder, bundles catalog) so the server starts cleanly even when
|
||||||
|
Ollama is briefly unreachable.
|
||||||
|
- Tool: `search_docs(query, version=None, platform=None,
|
||||||
|
bundle_id=None, k=10)`. Returns markdown of top-k chunks with full
|
||||||
|
source URLs.
|
||||||
|
- Tool: `get_page(bundle_id, page_id)`. Returns full page markdown +
|
||||||
|
metadata.
|
||||||
|
- Tool: `list_versions()`. Returns the version/platform facets
|
||||||
|
available, drawn from `bundles.json`. Helps the LLM pick filter
|
||||||
|
values.
|
||||||
|
|
||||||
|
Transports: stdio (for local Claude Desktop dev), streamable-HTTP
|
||||||
|
(for hosted production). One argparse switch.
|
||||||
|
|
||||||
|
```python
|
||||||
|
@mcp.tool()
|
||||||
|
def search_docs(
|
||||||
|
query: Annotated[str, Field(description="Natural-language query about <product>.")],
|
||||||
|
version: Annotated[str | None, Field(description="Restrict to one version")] = None,
|
||||||
|
...
|
||||||
|
) -> str:
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
The tool descriptions are first-class context — the LLM reads them
|
||||||
|
and decides whether to call the tool. Treat them as button labels;
|
||||||
|
use "Call when..." / "Use proactively whenever..." phrasings.
|
||||||
|
|
||||||
|
### Phase 4 — Containerization *(1 day)*
|
||||||
|
|
||||||
|
Goal: image you can run anywhere.
|
||||||
|
|
||||||
|
- `Dockerfile`: Python 3.12-slim base, install requirements, COPY
|
||||||
|
`scrape rag diff docs_mcp` + `bundles.json` + `corpus/ chroma/`
|
||||||
|
+ (later) `bm25/`. Don't COPY `scripts/` — those stay external
|
||||||
|
for ops use only.
|
||||||
|
- `ENTRYPOINT ["python", "-m", "docs_mcp.server",
|
||||||
|
"--transport", "streamable-http"]`. Configurable host/port via env.
|
||||||
|
- `deploy/docker-compose.yml`: one service, named volumes for usage
|
||||||
|
logs and any state, Watchtower label, depends_on for the reranker
|
||||||
|
sidecar (Phase 6).
|
||||||
|
|
||||||
|
Smoke-test locally: `docker compose up` should expose
|
||||||
|
`http://localhost:8000/mcp` and respond to an MCP `initialize` JSON-RPC.
|
||||||
|
|
||||||
|
### Phase 5 — CI on self-hosted Gitea Actions *(1–2 days)*
|
||||||
|
|
||||||
|
Goal: weekly cron rebuild + on-demand code-only ship cycle.
|
||||||
|
|
||||||
|
**Two workflows, two cadences:**
|
||||||
|
|
||||||
|
| Workflow | Trigger | Steps | Runtime |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `refresh.yml` | Monday cron + manual dispatch | scrape → commit corpus → rebuild indexes → build & push image | ~40 min |
|
||||||
|
| `image-only.yml` | manual dispatch only | rebuild indexes from committed corpus → build & push image | ~18 min |
|
||||||
|
|
||||||
|
**Critical settings (learned the hard way):**
|
||||||
|
|
||||||
|
- `fetch-depth: 0` on `actions/checkout@v4`. The default depth is 1
|
||||||
|
(shallow), which breaks any step that walks git history (changelog,
|
||||||
|
digest history walker). Pay the ~10 second cost; never debug a
|
||||||
|
"0-byte history file" mystery.
|
||||||
|
- `runs-on: docker` (Gitea convention, not `ubuntu-latest`).
|
||||||
|
- Runner shell is `/bin/sh` (dash), not bash. `${VAR::N}` substring
|
||||||
|
expansion doesn't exist; use `cut` / `printf` / `awk`.
|
||||||
|
|
||||||
|
**Retry-on-race pattern for long-running scrapes:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
attempt=1
|
||||||
|
while [ $attempt -le 3 ]; do
|
||||||
|
if git push; then
|
||||||
|
echo "pushed (attempt $attempt)"
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
[ $attempt -eq 3 ] && { echo "still failing"; exit 1; }
|
||||||
|
git fetch origin main
|
||||||
|
git rebase origin/main || { echo "conflict — bail"; exit 1; }
|
||||||
|
attempt=$((attempt + 1))
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
Works because scrape commits only touch `corpus/` + `bundles.json`,
|
||||||
|
and code merges only touch `.py` / `.yml` — disjoint paths, trivially
|
||||||
|
clean rebases.
|
||||||
|
|
||||||
|
**Image tagging — three tags per build:**
|
||||||
|
|
||||||
|
| Tag | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `:latest` | Watchtower watches this for auto-deploy |
|
||||||
|
| `:<sha12>` | Immutable; rollback target |
|
||||||
|
| `:<YYYY.MM.DD>` | Human-readable in incident notes |
|
||||||
|
|
||||||
|
Same tag set on every build; rollback is a one-line compose edit
|
||||||
|
to pin `:<sha>` instead of `:latest`.
|
||||||
|
|
||||||
|
**Container registry behind Cloudflare:**
|
||||||
|
|
||||||
|
Cloudflare's free tier has a 100 MB request body limit. Big image
|
||||||
|
layers (Chroma index can easily be 800+ MB) exceed it on push. The
|
||||||
|
fix is a LAN registry endpoint for push, public hostname for pull:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
env:
|
||||||
|
REGISTRY_PUSH: <lan-ip>:<port> # bypasses Cloudflare
|
||||||
|
REGISTRY_PULL: <public-hostname> # response bodies aren't capped
|
||||||
|
```
|
||||||
|
|
||||||
|
Runner needs the LAN endpoint in `/etc/docker/daemon.json`
|
||||||
|
`insecure-registries`. Costs nothing operationally; saves hours
|
||||||
|
of debugging.
|
||||||
|
|
||||||
|
**Registry GC:** weekly cron in the workflow that walks the package
|
||||||
|
versions, keeps `:latest` + N most-recent date tags + anything
|
||||||
|
pushed in the last 90 days, deletes the rest. Worth ~50 LOC; the
|
||||||
|
package GC on the Gitea side reclaims disk after.
|
||||||
|
|
||||||
|
### Phase 6 — Reranker *(half a day)*
|
||||||
|
|
||||||
|
Goal: lift retrieval quality 3× by cross-encoder reranking the top-N
|
||||||
|
dense candidates.
|
||||||
|
|
||||||
|
- A `/v1/rerank` HTTP endpoint backed by `llama.cpp` serving
|
||||||
|
`jina-reranker-v2-base` (GGUF). Runs as a sidecar in compose.
|
||||||
|
GPU strongly recommended (CPU latency is unworkable for live
|
||||||
|
queries).
|
||||||
|
- `_rerank(query, docs)` helper in the server: POST to the endpoint,
|
||||||
|
apply the scores, re-sort the top-N candidates. Defensive: on any
|
||||||
|
failure log a warning and fall through to dense-only.
|
||||||
|
- Env: `RERANK_URL` (off by default), `RERANK_POOL` (how deep to
|
||||||
|
pull candidates for reranking; 200 is a good default),
|
||||||
|
`RERANK_TIMEOUT` (30s for cold-start tolerance).
|
||||||
|
- **Watch the per-pair token limit.** Jina's GGUF reports
|
||||||
|
`n_ctx_train=1024` and llama.cpp will reject the ENTIRE batch if
|
||||||
|
any pair exceeds it. Truncate doc text to ~2000 chars before
|
||||||
|
reranking. The full untruncated chunk still goes back to the user;
|
||||||
|
truncation is only for the reranker scoring path.
|
||||||
|
|
||||||
|
### Phase 7 — Eval harness *(1 day)*
|
||||||
|
|
||||||
|
Goal: hand-curated golden queries + standard metrics so you can
|
||||||
|
measure the impact of any retrieval change.
|
||||||
|
|
||||||
|
- `eval/queries.jsonl`: 20–25 hand-curated queries with expected
|
||||||
|
pages. Spread across versions, platforms, and difficulty levels.
|
||||||
|
Include the queries that "obviously" should work and DON'T —
|
||||||
|
those are the ones to track.
|
||||||
|
- `eval/retrievers.py`: a `Retriever` protocol with concrete
|
||||||
|
implementations: `DenseRetriever`, `RerankedRetriever`,
|
||||||
|
`BM25Retriever` (Phase 8), `HybridRetriever` (Phase 8). One
|
||||||
|
matrix dimension per knob.
|
||||||
|
- `eval/run_eval.py`: computes MRR / Recall@5 / nDCG@5 across all
|
||||||
|
retrievers; emits a markdown comparison table at
|
||||||
|
`eval/results/<baseline>.md`. Commit the result so PRs land with
|
||||||
|
the A/B evidence in the diff.
|
||||||
|
|
||||||
|
Three numbers are enough — don't overengineer. The hand-curated
|
||||||
|
queries are the value; the metrics are just a stable way to score
|
||||||
|
them.
|
||||||
|
|
||||||
|
### Phase 8 — BM25 + Hybrid retrieval *(half a day, conditional)*
|
||||||
|
|
||||||
|
**Skip unless your eval shows specific failure modes.** Dense
|
||||||
|
embeddings + cross-encoder reranker handle most queries. The case
|
||||||
|
where they don't: queries with rare technical tokens (filenames,
|
||||||
|
language names, error codes) get buried at dense rank 1000+ by a
|
||||||
|
much larger prose corpus that's semantically nearby. The reranker
|
||||||
|
only sees top-200, so it never gets a shot.
|
||||||
|
|
||||||
|
- `rag/bm25.py`: SQLite FTS5 index, in the stdlib, on-disk
|
||||||
|
(`bm25/<product>.db`). Two tables — metadata table keyed by
|
||||||
|
rowid, FTS5 virtual table for full-text. Sanitize the query
|
||||||
|
(strip FTS5 reserved keywords, OR-join tokens for recall). ~210
|
||||||
|
LOC.
|
||||||
|
- `_rrf_fuse()` in the server — Reciprocal Rank Fusion with `k=60`.
|
||||||
|
Per-id score = `sum_over_retrievers(1 / (k + rank))`. Returns
|
||||||
|
ordered ids plus per-retriever contribution dict for telemetry.
|
||||||
|
- `search_docs` hybrid path: run dense + BM25 in parallel,
|
||||||
|
RRF-fuse, hand the merged top-200 to the reranker. Env-gated:
|
||||||
|
`HYBRID_SEARCH=true`.
|
||||||
|
- Log `top1_source` per call (`dense_only` / `bm25_only` / `both`)
|
||||||
|
to usage logs so you can measure whether BM25 is actually earning
|
||||||
|
its keep on production traffic.
|
||||||
|
|
||||||
|
If after 4–6 weeks of production data you see `bm25_only >= 80%`,
|
||||||
|
you can simplify to BM25-only (much less infrastructure). If
|
||||||
|
`both >= 50%`, hybrid is acting as tie-breaker not rescue — keep it
|
||||||
|
or simplify depending on how much you care about the long tail.
|
||||||
|
|
||||||
|
### Phase 9 — Multi-version diff tooling *(1 day, if applicable)*
|
||||||
|
|
||||||
|
**Only relevant if the product has multiple maintained versions.**
|
||||||
|
|
||||||
|
- `diff_versions(bundle_id, page_id, against_bundle_id)`: unified
|
||||||
|
diff between two versions of the same page. Two matching
|
||||||
|
strategies: editor-curated `topic_cluster` peer (if the portal
|
||||||
|
exposes it), or same-filename fallback.
|
||||||
|
- `list_cluster(bundle_id, page_id)`: list cross-version peers
|
||||||
|
for one page.
|
||||||
|
- `bundle_changelog(bundle_id_new, bundle_id_old)`: added /
|
||||||
|
removed / changed pages between two bundles, sorted by churn.
|
||||||
|
- `_diff_churn(a, b)`: small helper, ~15 LOC of `difflib.unified_diff
|
||||||
|
--unified=0` line counting. Used by `bundle_changelog`,
|
||||||
|
`find_doc_inconsistencies`, and `weekly_digest`.
|
||||||
|
|
||||||
|
### Phase 10 — Usage logging *(half a day)*
|
||||||
|
|
||||||
|
Goal: per-call JSONL telemetry so you can answer "what are people
|
||||||
|
actually asking for" and "is the new feature getting used."
|
||||||
|
|
||||||
|
- `docs_mcp/usage.py`: `TimedCall` context manager that captures
|
||||||
|
tool name, args, elapsed time, hits returned, any extra fields
|
||||||
|
set by the tool via `_call.set(key=value)`. Writes JSONL to
|
||||||
|
`var/logs/usage.jsonl`, rotated daily, kept 90 days.
|
||||||
|
- Mount the log dir as a named compose volume so logs survive
|
||||||
|
container recreates.
|
||||||
|
- `scripts/usage_report.py` (standalone, no docs_mcp deps): reads
|
||||||
|
the JSONL files, prints per-tool counts, top queries, 0-hit
|
||||||
|
queries, filter usage histogram, reranker activity. Markdown
|
||||||
|
output flag for piping into weekly digest emails.
|
||||||
|
|
||||||
|
What to log: query text, filters, hits returned, elapsed_ms,
|
||||||
|
reranker_fired flag, hybrid top1_source, retrieval_mode. What NOT
|
||||||
|
to log: anything PII-shaped. The corpus is public, queries are
|
||||||
|
usually about the product, not personal — but be deliberate.
|
||||||
|
|
||||||
|
### Phase 11 — Curated knowledge layer *(2 days)*
|
||||||
|
|
||||||
|
The "RAG can't tell you what isn't in the docs" gap. Surfaces:
|
||||||
|
|
||||||
|
- **API quickstart repos** if the product has them. Ingest the
|
||||||
|
example scripts (Python, PowerShell, curl) into the corpus.
|
||||||
|
Rewrite chunk-0 for each script to embed naturally — explicit
|
||||||
|
natural-language H1, task description sentence, keyword bag.
|
||||||
|
Dense embeddings need an anchor.
|
||||||
|
- **A curated `<product>_api_lessons` markdown doc** for things
|
||||||
|
the swagger / OpenAPI doesn't say: auth flow gotchas, async-task
|
||||||
|
patterns, schema bugs you've hit, platform-detection quirks.
|
||||||
|
Surface as a dedicated MCP tool whose description tells the LLM:
|
||||||
|
*"Call proactively whenever the user asks you to write a script
|
||||||
|
/ integrate with the API / debug a 4xx response."*
|
||||||
|
- **An auto-hint banner** in `search_docs` results — when the
|
||||||
|
query matches a script/API trigger word, render a one-line nudge
|
||||||
|
at the top of results pointing at the dedicated tool. Belt-and-
|
||||||
|
suspenders for queries where the LLM doesn't think to call it
|
||||||
|
proactively.
|
||||||
|
|
||||||
|
### Phase 12 — Doc-bug workflow tools *(1 day, optional)*
|
||||||
|
|
||||||
|
Two tools that pair up to enable a *"check the docs for
|
||||||
|
inconsistencies, draft bugs, confirm, submit"* workflow.
|
||||||
|
|
||||||
|
- `find_doc_inconsistencies(scope_query, version=None, platform=None,
|
||||||
|
max_pages=30, checks=None)`: deterministic, read-only. Two checks:
|
||||||
|
cross-version drift (pages whose content shifted between immediate-
|
||||||
|
previous versions in the actionable 10–60% churn band) and
|
||||||
|
redirect-chain detection (short pages whose body is just a "see
|
||||||
|
[other page] for details" pointer). Heavy lifting is line-level
|
||||||
|
diff (`difflib`) against editor-curated cluster peers; the model
|
||||||
|
judges which findings are real bugs.
|
||||||
|
|
||||||
|
- `submit_doc_bug(page_url, content, email=None, rating=None,
|
||||||
|
like=None)`: POSTs to the docs portal's feedback endpoint.
|
||||||
|
Env-gated by `DOC_BUG_SUBMIT_ENABLED=true` so dev/staging
|
||||||
|
deployments can't accidentally hit the upstream. The tool's
|
||||||
|
docstring is loud about a mandatory operator-confirmation
|
||||||
|
workflow per submission — LLM must draft, show, ask, then
|
||||||
|
submit. Explicit *"do not loop"* instruction. Defensive
|
||||||
|
validation upfront (URL host matches expected portal, content
|
||||||
|
non-empty, etc.) so the LLM gets a clean error instead of a
|
||||||
|
rejected POST.
|
||||||
|
|
||||||
|
**You'll need to find the docs portal's feedback endpoint.** Most
|
||||||
|
portals route the "Was this helpful?" widget through a backend
|
||||||
|
API; sniff the browser network tab on the live site. The payload
|
||||||
|
shape varies; common fields: content/body, page url/href, optional
|
||||||
|
email, optional rating, optional thumbs. Most accept anonymous
|
||||||
|
POSTs with no captcha at the JSON-API layer (even if the widget
|
||||||
|
shows a captcha). Validate before you ship — and if the endpoint
|
||||||
|
has rate limits or captcha enforcement, the tool returns a clean
|
||||||
|
"submission rejected — paste manually at <url>" fallback.
|
||||||
|
|
||||||
|
The whole point is the per-bug operator confirmation in the
|
||||||
|
LLM-side conversation flow; the tool description enforces it. Do
|
||||||
|
not bypass.
|
||||||
|
|
||||||
|
### Phase 13 — Weekly digest tool *(half a day)*
|
||||||
|
|
||||||
|
Goal: a tool that answers *"what changed in the docs in the last N
|
||||||
|
days?"* with no runtime git dependency (the prod container has no
|
||||||
|
git).
|
||||||
|
|
||||||
|
- Extend `scrape/changelog.py` with `--json` (one-shot structured
|
||||||
|
output) and `--history-out PATH` (walks `git log --first-parent
|
||||||
|
--since="<N> days ago"` for corpus-touching commits, writes one
|
||||||
|
JSON line per commit to a JSONL file).
|
||||||
|
- CI workflows write the JSONL file into the image at build time:
|
||||||
|
`corpus/.digest/history.jsonl`. Both `refresh.yml` and
|
||||||
|
`image-only.yml`. **`fetch-depth: 0` is required** — see Phase 5.
|
||||||
|
- New MCP tool `weekly_digest(days=7, version=None, platform=None,
|
||||||
|
max_bundles=25, max_pages_per_bundle=10)`: reads the JSONL,
|
||||||
|
filters to the window, applies version/platform via
|
||||||
|
`bundles.json` metadata, aggregates per-bundle change counts and
|
||||||
|
page lists, renders markdown.
|
||||||
|
- Post-filter totals are critical: the headline "X page changes
|
||||||
|
across Y bundles" must compute X from the filtered set, not the
|
||||||
|
raw record count. Otherwise filtered calls look wrong to the
|
||||||
|
reader.
|
||||||
|
|
||||||
|
Out of scope but trivial bolt-ons: scheduled HTML email of the
|
||||||
|
digest, auto-publish to a blog, per-page diff excerpts as a
|
||||||
|
follow-up tool.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Standard tool set
|
||||||
|
|
||||||
|
By the end you'll have ~15 tools registered. Production-tested
|
||||||
|
shape:
|
||||||
|
|
||||||
|
| Tool | What it does |
|
||||||
|
|---|---|
|
||||||
|
| `search_docs` | Semantic search with version/platform/bundle filters |
|
||||||
|
| `get_page` | Full markdown + metadata for one page |
|
||||||
|
| `list_versions` | Discover available facet values |
|
||||||
|
| `list_cluster` | Cross-version peers for one page (if applicable) |
|
||||||
|
| `diff_versions` | Unified diff of a page across two versions |
|
||||||
|
| `bundle_changelog` | Added / removed / changed pages between two bundles |
|
||||||
|
| `weekly_digest` | What changed in the last N days, with filters |
|
||||||
|
| `corpus_status` | Freshness + size of the knowledge base |
|
||||||
|
| `find_doc_inconsistencies` | Scoped scan for doc bugs |
|
||||||
|
| `submit_doc_bug` | Submit a drafted bug (env-gated, operator-confirmed) |
|
||||||
|
| `<product>_api_lessons` | Curated API gotchas, proactively-called |
|
||||||
|
| product-specific tools | Interop matrix, lifecycle queries, etc. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Per-product customization checklist
|
||||||
|
|
||||||
|
When applying this template to a new product, here's what you have
|
||||||
|
to figure out yourself — everything else is shared infrastructure:
|
||||||
|
|
||||||
|
- **Doc portal mechanics**
|
||||||
|
- URL pattern for pages
|
||||||
|
- Bundle/version concept (Zoomin "bundle", Madcap "project",
|
||||||
|
GitBook "space", Docusaurus "docs version" — same idea, different
|
||||||
|
name)
|
||||||
|
- SPA backing API (sniff the network tab) or fallback to
|
||||||
|
headless browser
|
||||||
|
- How `topic_cluster` -equivalent cross-version peers are exposed
|
||||||
|
(or whether you synthesize them from filenames)
|
||||||
|
- **Bundle metadata schema**
|
||||||
|
- What does `version` look like? Semver, calendar, named?
|
||||||
|
- What does `platform` mean for this product? Is there a useful
|
||||||
|
facet at all?
|
||||||
|
- Other useful facets (language, product line, edition)?
|
||||||
|
- **Filterable facets** for `search_docs`
|
||||||
|
- One filter per high-cardinality facet
|
||||||
|
- Skip filters that have <5 distinct values — they're not worth
|
||||||
|
the surface area
|
||||||
|
- **Feedback endpoint** (for `submit_doc_bug`, if you want it)
|
||||||
|
- URL of the POST endpoint
|
||||||
|
- Required + optional payload fields
|
||||||
|
- Captcha / rate-limit behavior
|
||||||
|
- Whether anonymous submissions are accepted
|
||||||
|
- **Curated knowledge** for the `_api_lessons` tool
|
||||||
|
- What does the product's API documentation NOT say that you've
|
||||||
|
learned from real integration work?
|
||||||
|
- **Quickstart / example repos**
|
||||||
|
- Does the vendor publish working code? Ingest it; rewrite
|
||||||
|
chunk-0 for natural-language retrieval.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decisions worth carrying forward
|
||||||
|
|
||||||
|
Things you'll save time on by deciding the same way again:
|
||||||
|
|
||||||
|
- **Tool descriptions are user interface.** The LLM reads them
|
||||||
|
verbatim and decides whether to call the tool. *"Use when..."*
|
||||||
|
and *"Call proactively whenever..."* are real surfaces; treat
|
||||||
|
them like button labels. Most retrieval improvements turn out
|
||||||
|
to be tool-description rewrites in disguise.
|
||||||
|
- **`stateless_http=True`** on the FastMCP server. Eliminates
|
||||||
|
whole categories of session-ID-related 404 storms after
|
||||||
|
container recreates.
|
||||||
|
- **Pre-bake everything at CI time.** No runtime calls to git,
|
||||||
|
external services, or anything you wouldn't trust on a
|
||||||
|
Cloudflare outage. If the digest needs git history, write a
|
||||||
|
JSONL file at CI time. If the lessons doc needs to load fast,
|
||||||
|
bake it into the image.
|
||||||
|
- **Env-gate every side-effecting tool.** Off by default in dev;
|
||||||
|
on only in production compose. Belt and suspenders against
|
||||||
|
accidental writes from staging environments.
|
||||||
|
- **Operator-confirmation pattern for side-effecting tools.**
|
||||||
|
The tool docstring is the only place to enforce
|
||||||
|
human-in-the-loop. Make it loud. "MANDATORY", "Do not loop",
|
||||||
|
"show-confirm-then-submit" — those phrasings work.
|
||||||
|
- **Verify with hand-curated golden queries before shipping any
|
||||||
|
retrieval change.** Numbers in the diff, in the commit message.
|
||||||
|
Don't ship retrieval changes on vibes.
|
||||||
|
- **Two-cadence CI** (weekly scrape vs on-demand code-only)
|
||||||
|
saves hours per code iteration once you're past the
|
||||||
|
one-iteration-a-week stage.
|
||||||
|
- **Rolling tag + sha-pinned tag** deploy pattern. `:latest` is
|
||||||
|
what Watchtower watches; `:<sha>` is your safety net. Rollback
|
||||||
|
is a one-line compose edit, not a redeploy.
|
||||||
|
- **Usage logging is non-negotiable.** You will be wrong about
|
||||||
|
what people use. Capture the truth from day one; let it tell
|
||||||
|
you which features to keep building and which to delete.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Glossary
|
||||||
|
|
||||||
|
- **Bundle** — one logical doc set in the portal. Zoomin calls
|
||||||
|
them bundles; Madcap calls them projects; the concept is the
|
||||||
|
same: a versioned, titled collection of pages. One dir under
|
||||||
|
`corpus/`.
|
||||||
|
- **Page** — one HTML page in a bundle. One `.md` + one `.json`
|
||||||
|
sidecar under the bundle dir.
|
||||||
|
- **Topic cluster** — Zoomin's name for "this page in version
|
||||||
|
10.9 corresponds to that page in version 10.8." Stored in the
|
||||||
|
per-page sidecar. The portal-agnostic concept is "cross-version
|
||||||
|
peer mapping."
|
||||||
|
- **Chunk** — a unit of text that gets independently embedded and
|
||||||
|
stored in Chroma. Target ~400-600 tokens; preserve paragraph
|
||||||
|
boundaries.
|
||||||
|
- **RRF** — Reciprocal Rank Fusion. The way to merge two ranked
|
||||||
|
lists from independent retrievers without score calibration.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What's deliberately NOT in this template
|
||||||
|
|
||||||
|
Decisions you should make per-product (not copy from the original
|
||||||
|
build):
|
||||||
|
|
||||||
|
- The reverse proxy and TLS termination layer. Could be Caddy,
|
||||||
|
nginx, Traefik, Cloudflare Tunnel — pick what your infra uses.
|
||||||
|
- The Gateway / aggregator in front of multiple MCPs (MetaMCP is one
|
||||||
|
option; you may not need any aggregator if you're running a
|
||||||
|
single product MCP).
|
||||||
|
- The specific embedding model — `nomic-embed-text` is a strong
|
||||||
|
default but newer / domain-specific models may be better for
|
||||||
|
some products.
|
||||||
|
- The Ollama containers / GPU setup — depends on what hardware you
|
||||||
|
have. The pattern is one container per GPU with explicit
|
||||||
|
`NVIDIA_VISIBLE_DEVICES` pinning; the indexer load-balances
|
||||||
|
across them.
|
||||||
|
- Whether to publish a blog series alongside the build. Strongly
|
||||||
|
recommended (forces clarity, builds an audience), but optional.
|
||||||
@@ -0,0 +1,84 @@
|
|||||||
|
# seed-mcp
|
||||||
|
|
||||||
|
MCP server over the public catalogs of major US row-crop seed
|
||||||
|
vendors — corn, soybeans, wheat. Sibling project to
|
||||||
|
[`crop-chem-docs`](https://git.jpaul.io/justin/crop-chem-docs)
|
||||||
|
(pesticide labels), feeding the same Drawbar farm-advisor AI.
|
||||||
|
|
||||||
|
The server exposes per-variety records with **agronomic ratings**,
|
||||||
|
**disease tolerance**, **trait stack**, **maturity**, and
|
||||||
|
**regional notes** — so the advisor can answer questions like
|
||||||
|
"which corn hybrid for sandy soil, drought-prone, RM ≤105 in
|
||||||
|
northeast Iowa?" without rummaging through individual brand sites.
|
||||||
|
|
||||||
|
## Vendor coverage
|
||||||
|
|
||||||
|
| Vendor | Verdict | Varieties | Notes |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Bayer seeds (DEKALB + Asgrow + WestBred) | 🟢 | ~475 | Same `cropscience.bayer.us` Next.js infra as crop-chem-docs |
|
||||||
|
| Golden Harvest (Syngenta) | 🟢 | ~175 | Sitemap + server-rendered HTML + Syngenta CDN PDFs |
|
||||||
|
| NK (Syngenta) | 🟢 | 29 | Shares PDF fetcher with Golden Harvest |
|
||||||
|
| AgriPro (Syngenta wheat) | 🟢 | 24 | Drupal Views, server-rendered |
|
||||||
|
| Beck's PFR | 🟡 | 2,089 | Public Sanity GROQ API (no auth) |
|
||||||
|
| Beck's products | 🟡 | 860 | Identity-only until SeedIQ XHR sniffed |
|
||||||
|
| Pioneer (Corteva) | 🔴 | — | ToS bans automation — curated fallback lesson instead |
|
||||||
|
|
||||||
|
## Quick start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://git.jpaul.io/justin/seed-mcp.git
|
||||||
|
cd seed-mcp
|
||||||
|
python -m venv venv && source venv/bin/activate
|
||||||
|
pip install -r requirements.txt
|
||||||
|
|
||||||
|
# Run one scraper
|
||||||
|
python -m scrape.runner --source bayer_seeds --force
|
||||||
|
|
||||||
|
# Rebuild indexes
|
||||||
|
python -m rag.index --rebuild
|
||||||
|
|
||||||
|
# Local MCP server (stdio for Claude Desktop dev)
|
||||||
|
python -m docs_mcp.server --transport stdio
|
||||||
|
```
|
||||||
|
|
||||||
|
## Tools exposed
|
||||||
|
|
||||||
|
| Tool | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `search_docs` | Hybrid + rerank variety search with crop / RM / trait / region filters |
|
||||||
|
| `get_page` | Full variety record by `(source, source_key)` |
|
||||||
|
| `list_versions` | Discover crops, brands, traits, RM/MG ranges, wheat classes |
|
||||||
|
| `corpus_status` | Counts + freshness; useful for health probes |
|
||||||
|
| `crop_seed_api_lessons` | Curated agronomy lessons — Pioneer fallback, disease-scale normalization, regional placement heuristics |
|
||||||
|
|
||||||
|
## Build phases
|
||||||
|
|
||||||
|
This is a clone of [`docs-mcp-template`](https://git.jpaul.io/justin/docs-mcp-template).
|
||||||
|
The 13 phases in `PLAN.md` apply:
|
||||||
|
|
||||||
|
| Phase | Status |
|
||||||
|
|---|---|
|
||||||
|
| 0 — scaffold | done |
|
||||||
|
| 1 — first scraper (bayer_seeds) | next |
|
||||||
|
| 2 — chunk + index | pending |
|
||||||
|
| 3 — baseline MCP tools | template defaults |
|
||||||
|
| 4-5 — Dockerfile + CI | done (placeholders filled) |
|
||||||
|
| 6 — reranker | shares `llama-rerank` sidecar with crop-chem-docs |
|
||||||
|
| 7 — eval harness | pending (curate ~25 queries) |
|
||||||
|
| 8 — hybrid search | done (template) |
|
||||||
|
| 9 — diff_versions, list_cluster | optional |
|
||||||
|
| 11 — `crop_seed_api_lessons` curated layer | pending |
|
||||||
|
|
||||||
|
See `CLAUDE.md` for the canonical sidecar schema and the
|
||||||
|
disease-scale-normalization gotcha (Golden Harvest is reversed).
|
||||||
|
|
||||||
|
## Infrastructure
|
||||||
|
|
||||||
|
- **Registry**: `git.jpaul.io/justin/seed-mcp:latest` (Watchtower) /
|
||||||
|
`:corpus-YYYY.MM.DD` (production pin)
|
||||||
|
- **Embedder**: shared Ollama pool with crop-chem-docs (Gitea-host
|
||||||
|
GPUs + Windows Ollama; CI never hits trashpanda's production Ollama)
|
||||||
|
- **Reranker**: shared `llama-rerank` sidecar on trashpanda's Tesla
|
||||||
|
P4 (one container, both MCPs use it)
|
||||||
|
- **PRODUCT_NAME**: `crop_seed` (not `seed_mcp` — used in Chroma
|
||||||
|
collection, BM25 db filename, and `crop_seed_api_lessons` tool)
|
||||||
@@ -0,0 +1,111 @@
|
|||||||
|
# Hosting stack for a docs MCP server.
|
||||||
|
#
|
||||||
|
# Replace <product> below with your product name on first deploy.
|
||||||
|
# Volumes: usage logs are mounted to a host path so they survive
|
||||||
|
# Watchtower-driven container recreates.
|
||||||
|
#
|
||||||
|
# This template assumes a reverse proxy / Cloudflare Tunnel terminates
|
||||||
|
# TLS in front of port 8000. Adjust if your infra differs.
|
||||||
|
|
||||||
|
services:
|
||||||
|
|
||||||
|
# The MCP server. Watchtower auto-pulls on :latest changes.
|
||||||
|
<product>-docs-mcp:
|
||||||
|
image: <registry>/<owner>/<product>-docs-mcp:latest
|
||||||
|
container_name: <product>-docs-mcp
|
||||||
|
restart: unless-stopped
|
||||||
|
ports:
|
||||||
|
- "8000:8000"
|
||||||
|
environment:
|
||||||
|
PRODUCT_NAME: "<product>"
|
||||||
|
PRODUCT_DOCS_URL: "https://docs.example.com"
|
||||||
|
|
||||||
|
# Streamable-HTTP transport. Stateless mode is required for
|
||||||
|
# production: clients don't lose sessions when Watchtower
|
||||||
|
# recreates the container.
|
||||||
|
MCP_TRANSPORT: streamable-http
|
||||||
|
MCP_HOST: 0.0.0.0
|
||||||
|
MCP_PORT: "8000"
|
||||||
|
|
||||||
|
# If you run MetaMCP or another gateway in front and reach
|
||||||
|
# this container via its compose DNS name (e.g. <product>-docs-mcp:8000),
|
||||||
|
# add that hostname here. "*" disables the rebind check entirely.
|
||||||
|
MCP_ALLOWED_HOSTS: "<product>-docs-mcp,localhost,127.0.0.1"
|
||||||
|
|
||||||
|
# Phase 6 — reranker sidecar (jina-reranker-v2-base via llama.cpp).
|
||||||
|
RERANK_URL: http://<product>-rerank:8080
|
||||||
|
RERANK_POOL: "200"
|
||||||
|
RERANK_TIMEOUT: "30"
|
||||||
|
|
||||||
|
# Phase 8 — hybrid retrieval (BM25 + dense + RRF). Set true
|
||||||
|
# only after the eval harness shows the dense-only path
|
||||||
|
# missing technical-term queries that BM25 catches.
|
||||||
|
HYBRID_SEARCH: "true"
|
||||||
|
|
||||||
|
# Phase 10 — usage telemetry.
|
||||||
|
USAGE_LOG_DIR: /app/var/logs
|
||||||
|
USAGE_LOG_KEEP_DAYS: "90"
|
||||||
|
|
||||||
|
# Phase 12 — doc-bug submission gate. Off by default; on only
|
||||||
|
# in production after you've verified the endpoint contract.
|
||||||
|
DOC_BUG_SUBMIT_ENABLED: "false"
|
||||||
|
# DOC_BUG_API_URL: "https://docs-be.example.com/api/feedback"
|
||||||
|
volumes:
|
||||||
|
# Usage logs persist across container recreates.
|
||||||
|
- ./<product>-docs-mcp-logs:/app/var/logs
|
||||||
|
depends_on:
|
||||||
|
- <product>-rerank
|
||||||
|
labels:
|
||||||
|
# Watchtower polls *only* containers with this label set true.
|
||||||
|
com.centurylinklabs.watchtower.enable: "true"
|
||||||
|
networks:
|
||||||
|
- mcp
|
||||||
|
|
||||||
|
# Reranker sidecar — llama.cpp serving jina-reranker-v2-base.
|
||||||
|
# Requires GPU access; adjust runtime/devices for your hardware.
|
||||||
|
<product>-rerank:
|
||||||
|
image: ghcr.io/ggml-org/llama.cpp:server-cuda
|
||||||
|
container_name: <product>-rerank
|
||||||
|
restart: unless-stopped
|
||||||
|
# Mount the GGUF model from the host. Download from huggingface
|
||||||
|
# (gguf-org/jina-reranker-v2-base-multilingual-GGUF) first.
|
||||||
|
volumes:
|
||||||
|
- /path/to/models:/models:ro
|
||||||
|
command: >
|
||||||
|
--model /models/jina-reranker-v2-base.Q8_0.gguf
|
||||||
|
--reranking
|
||||||
|
--host 0.0.0.0
|
||||||
|
--port 8080
|
||||||
|
--n-gpu-layers 99
|
||||||
|
--ctx-size 4096
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
devices:
|
||||||
|
- driver: nvidia
|
||||||
|
count: 1
|
||||||
|
capabilities: [gpu]
|
||||||
|
networks:
|
||||||
|
- mcp
|
||||||
|
|
||||||
|
# Watchtower — auto-pulls :latest on push.
|
||||||
|
# Only watches containers labeled `com.centurylinklabs.watchtower.enable=true`.
|
||||||
|
watchtower:
|
||||||
|
image: containrrr/watchtower:latest
|
||||||
|
container_name: watchtower
|
||||||
|
restart: unless-stopped
|
||||||
|
volumes:
|
||||||
|
- /var/run/docker.sock:/var/run/docker.sock
|
||||||
|
environment:
|
||||||
|
WATCHTOWER_POLL_INTERVAL: "300" # 5 min
|
||||||
|
WATCHTOWER_LABEL_ENABLE: "true"
|
||||||
|
WATCHTOWER_CLEANUP: "true" # remove old images after pull
|
||||||
|
# If your registry requires auth, mount a docker config:
|
||||||
|
# volumes:
|
||||||
|
# - ./registry-auth.json:/config.json:ro
|
||||||
|
networks:
|
||||||
|
- mcp
|
||||||
|
|
||||||
|
networks:
|
||||||
|
mcp:
|
||||||
|
driver: bridge
|
||||||
@@ -0,0 +1,263 @@
|
|||||||
|
"""MCP server skeleton — fill in PRODUCT_NAME and the tool bodies.
|
||||||
|
|
||||||
|
This file is the template's structural anchor. The phases described in
|
||||||
|
PLAN.md add or extend pieces of this file:
|
||||||
|
|
||||||
|
Phase 3 — search_docs, get_page, list_versions stubs (you are here)
|
||||||
|
Phase 6 — reranker integration in search_docs
|
||||||
|
Phase 8 — BM25 + hybrid retrieval (HYBRID_SEARCH env gate, _rrf_fuse)
|
||||||
|
Phase 9 — diff_versions, list_cluster, bundle_changelog
|
||||||
|
Phase 10 — TimedCall wiring (already imported below)
|
||||||
|
Phase 11 — <product>_api_lessons tool
|
||||||
|
Phase 12 — find_doc_inconsistencies, submit_doc_bug
|
||||||
|
Phase 13 — weekly_digest + _digest_history reader
|
||||||
|
|
||||||
|
Every stub below has a docstring + `raise NotImplementedError`. Replace
|
||||||
|
the body when you reach the corresponding phase. Keep the signatures
|
||||||
|
stable across products — clients depend on them.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Annotated
|
||||||
|
|
||||||
|
from mcp.server.fastmcp import FastMCP
|
||||||
|
from pydantic import Field
|
||||||
|
|
||||||
|
from .usage import TimedCall
|
||||||
|
|
||||||
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Product-specific configuration. Set these for each new build.
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "crop_seed")
|
||||||
|
PRODUCT_DOCS_URL = os.environ.get("PRODUCT_DOCS_URL", "https://git.jpaul.io/justin/seed-mcp")
|
||||||
|
COLLECTION = f"{PRODUCT_NAME}_docs"
|
||||||
|
|
||||||
|
# Paths inside the deployed container (and matching layout locally for dev).
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
CORPUS = ROOT / "corpus"
|
||||||
|
CHROMA_DIR = ROOT / "chroma"
|
||||||
|
BM25_DB = Path(os.environ.get("BM25_DB", str(ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db")))
|
||||||
|
BUNDLES_JSON = ROOT / "bundles.json"
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Feature flags (Phase 6 / 8 / 12 enable these as you ship each phase).
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
RERANK_URL = os.environ.get("RERANK_URL", "").rstrip("/") or None
|
||||||
|
RERANK_POOL = int(os.environ.get("RERANK_POOL", "50"))
|
||||||
|
RERANK_TIMEOUT = float(os.environ.get("RERANK_TIMEOUT", "30"))
|
||||||
|
|
||||||
|
HYBRID_SEARCH = os.environ.get("HYBRID_SEARCH", "").lower() in ("true", "1", "yes", "on")
|
||||||
|
RRF_K = int(os.environ.get("RRF_K", "60"))
|
||||||
|
|
||||||
|
DOC_BUG_SUBMIT_ENABLED = os.environ.get("DOC_BUG_SUBMIT_ENABLED", "").lower() in ("true", "1", "yes", "on")
|
||||||
|
DOC_BUG_API_URL = os.environ.get("DOC_BUG_API_URL", "") # product-specific endpoint
|
||||||
|
DOC_BUG_TIMEOUT = float(os.environ.get("DOC_BUG_TIMEOUT", "15"))
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# FastMCP setup.
|
||||||
|
#
|
||||||
|
# stateless_http=True — every request creates an ephemeral session and
|
||||||
|
# discards it on return. Critical for production: clients don't get
|
||||||
|
# 404 storms when the container is recreated by Watchtower.
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
mcp = FastMCP(f"{PRODUCT_NAME}-docs", stateless_http=True)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Lazy helpers — instantiate expensive things only when actually needed,
|
||||||
|
# so the server still starts when (e.g.) Ollama is briefly unreachable.
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _bundles() -> dict[str, dict]:
|
||||||
|
"""Cached load of bundles.json into a {slug: bundle_dict} mapping.
|
||||||
|
|
||||||
|
bundles.json is the product-specific catalog written by the Phase 1
|
||||||
|
scraper. See PLAN.md Phase 1 for the schema.
|
||||||
|
"""
|
||||||
|
if not BUNDLES_JSON.exists():
|
||||||
|
return {}
|
||||||
|
cat = json.loads(BUNDLES_JSON.read_text())
|
||||||
|
return {b["slug"]: b for b in cat}
|
||||||
|
|
||||||
|
|
||||||
|
def _build_where(version: str | None, platform: str | None, bundle_id: str | None) -> dict | None:
|
||||||
|
"""Translate filter args into a Chroma `where` clause."""
|
||||||
|
conds: list[dict] = []
|
||||||
|
if version:
|
||||||
|
conds.append({"version": version})
|
||||||
|
if platform:
|
||||||
|
conds.append({"platform": platform})
|
||||||
|
if bundle_id:
|
||||||
|
conds.append({"bundle_id": bundle_id})
|
||||||
|
if not conds:
|
||||||
|
return None
|
||||||
|
if len(conds) == 1:
|
||||||
|
return conds[0]
|
||||||
|
return {"$and": conds}
|
||||||
|
|
||||||
|
|
||||||
|
def _read_page(bundle_id: str, page_id: str) -> tuple[str, dict] | None:
|
||||||
|
"""Read a corpus page off disk. Returns (markdown_body, metadata_dict)."""
|
||||||
|
md_path = CORPUS / bundle_id / (page_id + ".md")
|
||||||
|
json_path = CORPUS / bundle_id / (page_id + ".json")
|
||||||
|
if not md_path.exists() or not json_path.exists():
|
||||||
|
return None
|
||||||
|
return md_path.read_text(), json.loads(json_path.read_text())
|
||||||
|
|
||||||
|
|
||||||
|
# ===========================================================================
|
||||||
|
# Tools
|
||||||
|
# ===========================================================================
|
||||||
|
|
||||||
|
@mcp.tool()
|
||||||
|
def search_docs(
|
||||||
|
query: Annotated[str, Field(description=f"Natural-language query about {PRODUCT_NAME}.")],
|
||||||
|
version: Annotated[
|
||||||
|
str | None,
|
||||||
|
Field(description="OPTIONAL version filter — restrict to one product version."),
|
||||||
|
] = None,
|
||||||
|
platform: Annotated[
|
||||||
|
str | None,
|
||||||
|
Field(description="OPTIONAL platform filter. Set to one of the platforms listed by list_versions(); omit for all platforms."),
|
||||||
|
] = None,
|
||||||
|
bundle_id: Annotated[
|
||||||
|
str | None,
|
||||||
|
Field(description="OPTIONAL bundle filter — pin to a specific doc bundle slug."),
|
||||||
|
] = None,
|
||||||
|
k: Annotated[int, Field(description="Number of results to return.", ge=1, le=50)] = 10,
|
||||||
|
) -> str:
|
||||||
|
"""Search the {product} docs corpus.
|
||||||
|
|
||||||
|
Returns the top-k most relevant chunks (with full source page URLs)
|
||||||
|
given a natural-language query. Optional filters narrow the search
|
||||||
|
to one version, one platform, or one bundle. Use list_versions()
|
||||||
|
first if you need to discover the available facet values.
|
||||||
|
|
||||||
|
Call this tool whenever the user asks anything that should be
|
||||||
|
answerable from the official product documentation.
|
||||||
|
"""
|
||||||
|
with TimedCall("search_docs", {
|
||||||
|
"query": query, "version": version, "platform": platform,
|
||||||
|
"bundle_id": bundle_id, "k": k,
|
||||||
|
}) as _call:
|
||||||
|
# TODO Phase 2-3: query Chroma collection (see rag/index.py for
|
||||||
|
# how it was built). Render the top-k chunks as markdown with
|
||||||
|
# source URLs.
|
||||||
|
# TODO Phase 6: optional reranker via _rerank() if RERANK_URL set.
|
||||||
|
# TODO Phase 8: hybrid retrieval if HYBRID_SEARCH=true — run
|
||||||
|
# dense + BM25 in parallel, RRF-fuse, hand merged pool to rerank.
|
||||||
|
_call.set(hits_returned=0)
|
||||||
|
raise NotImplementedError("Phase 2/3: implement Chroma query + rendering")
|
||||||
|
|
||||||
|
|
||||||
|
@mcp.tool()
|
||||||
|
def get_page(
|
||||||
|
bundle_id: Annotated[str, Field(description="Bundle slug.")],
|
||||||
|
page_id: Annotated[str, Field(description="Page filename within the bundle.")],
|
||||||
|
) -> str:
|
||||||
|
"""Return the full markdown for one page, plus a metadata header.
|
||||||
|
|
||||||
|
Use after search_docs surfaces a relevant page and the user (or you)
|
||||||
|
want the complete text — not just the matched chunks.
|
||||||
|
"""
|
||||||
|
with TimedCall("get_page", {"bundle_id": bundle_id, "page_id": page_id}) as _call:
|
||||||
|
data = _read_page(bundle_id, page_id)
|
||||||
|
if data is None:
|
||||||
|
_call.set(found=False)
|
||||||
|
return f"Page not found: {bundle_id}/{page_id}"
|
||||||
|
md, meta = data
|
||||||
|
_call.set(found=True, page_chars=len(md))
|
||||||
|
# TODO: add a metadata header (title, version, source URL) above
|
||||||
|
# the body. Product-specific shape.
|
||||||
|
return md
|
||||||
|
|
||||||
|
|
||||||
|
@mcp.tool()
|
||||||
|
def list_versions() -> str:
|
||||||
|
"""List the available version/platform facets across all bundles.
|
||||||
|
|
||||||
|
Use this to discover valid filter values for search_docs.
|
||||||
|
"""
|
||||||
|
with TimedCall("list_versions", {}) as _call:
|
||||||
|
cat = _bundles()
|
||||||
|
if not cat:
|
||||||
|
return "_(no bundles indexed yet — run the scraper + indexer)_"
|
||||||
|
versions = sorted({b.get("version") for b in cat.values() if b.get("version")})
|
||||||
|
platforms = sorted({b.get("platform") for b in cat.values() if b.get("platform")})
|
||||||
|
_call.set(versions=len(versions), platforms=len(platforms))
|
||||||
|
lines = [f"# Facets across {len(cat)} bundle(s)", ""]
|
||||||
|
if versions:
|
||||||
|
lines.append("## Versions"); lines.append("")
|
||||||
|
for v in versions: lines.append(f"- `{v}`")
|
||||||
|
lines.append("")
|
||||||
|
if platforms:
|
||||||
|
lines.append("## Platforms"); lines.append("")
|
||||||
|
for p in platforms: lines.append(f"- `{p}`")
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Stubs for later phases — keep the signatures in this file so refactors
|
||||||
|
# don't lose the contracts. Implementations come per phase.
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
# @mcp.tool() # Phase 9
|
||||||
|
# def list_cluster(bundle_id: str, page_id: str) -> str: ...
|
||||||
|
|
||||||
|
# @mcp.tool() # Phase 9
|
||||||
|
# def diff_versions(bundle_id: str, page_id: str, against_bundle_id: str, context: int = 3) -> str: ...
|
||||||
|
|
||||||
|
# @mcp.tool() # Phase 9
|
||||||
|
# def bundle_changelog(bundle_id_new: str, bundle_id_old: str, min_churn: int = 5, max_changed: int = 50) -> str: ...
|
||||||
|
|
||||||
|
# @mcp.tool() # Phase 13
|
||||||
|
# def weekly_digest(days: int = 7, version: str | None = None, platform: str | None = None, ...) -> str: ...
|
||||||
|
|
||||||
|
# @mcp.tool() # Phase 9 (or 3 — useful early)
|
||||||
|
# def corpus_status() -> str: ...
|
||||||
|
|
||||||
|
# @mcp.tool() # Phase 11
|
||||||
|
# def myproduct_api_lessons(topic: str | None = None) -> str: ...
|
||||||
|
|
||||||
|
# @mcp.tool() # Phase 12
|
||||||
|
# def find_doc_inconsistencies(scope_query: str, ...) -> str: ...
|
||||||
|
|
||||||
|
# @mcp.tool() # Phase 12
|
||||||
|
# def submit_doc_bug(page_url: str, content: str, email: str | None = None, ...) -> str: ...
|
||||||
|
|
||||||
|
|
||||||
|
# ===========================================================================
|
||||||
|
# Entry point
|
||||||
|
# ===========================================================================
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
import argparse
|
||||||
|
p = argparse.ArgumentParser(description=f"{PRODUCT_NAME} docs MCP server")
|
||||||
|
p.add_argument("--transport", choices=["stdio", "streamable-http", "sse"],
|
||||||
|
default=os.environ.get("MCP_TRANSPORT", "stdio"))
|
||||||
|
p.add_argument("--host", default=os.environ.get("MCP_HOST", "0.0.0.0"))
|
||||||
|
p.add_argument("--port", type=int, default=int(os.environ.get("MCP_PORT", "8000")))
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
if args.transport == "stdio":
|
||||||
|
mcp.run()
|
||||||
|
else:
|
||||||
|
mcp.settings.host = args.host
|
||||||
|
mcp.settings.port = args.port
|
||||||
|
# DNS-rebinding protection defaults to localhost-only — disable for
|
||||||
|
# container-network DNS hostnames. See PLAN.md "Hosting" notes.
|
||||||
|
if os.environ.get("MCP_DISABLE_DNS_REBINDING_PROTECTION") in {"1", "true", "yes"}:
|
||||||
|
mcp.settings.transport_security.enable_dns_rebinding_protection = False
|
||||||
|
mcp.run(transport=args.transport)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,127 @@
|
|||||||
|
"""Per-call usage telemetry — JSONL with daily rotation and retention.
|
||||||
|
|
||||||
|
Reusable as-is across products. Drop the import + `with TimedCall(...)`
|
||||||
|
into any tool body and the call gets logged with the tool name, args,
|
||||||
|
elapsed time, and any extra fields the tool sets via `_call.set(...)`.
|
||||||
|
|
||||||
|
The log file is `var/logs/usage.jsonl` by default (override with the
|
||||||
|
`USAGE_LOG_DIR` env). Daily rotation; files older than
|
||||||
|
`USAGE_LOG_KEEP_DAYS` (default 90) are deleted on next write.
|
||||||
|
|
||||||
|
Layout of one record:
|
||||||
|
|
||||||
|
{
|
||||||
|
"ts": "2026-05-22T13:14:15+00:00",
|
||||||
|
"tool": "search_docs",
|
||||||
|
"args": {"query": "...", "version": "10.9", "k": 10},
|
||||||
|
"elapsed_ms": 142.5,
|
||||||
|
"hits_returned": 7, # optional, set by the tool
|
||||||
|
"reranked": true, # optional, set by the tool
|
||||||
|
// ... any other key the tool sets via _call.set(...)
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
import threading
|
||||||
|
from datetime import datetime, timedelta, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
|
||||||
|
USAGE_LOG_DIR = Path(os.environ.get("USAGE_LOG_DIR", "var/logs"))
|
||||||
|
USAGE_LOG_KEEP_DAYS = int(os.environ.get("USAGE_LOG_KEEP_DAYS", "90"))
|
||||||
|
|
||||||
|
# Single global lock to serialize writes from multiple request handlers.
|
||||||
|
# JSONL appends are atomic at the OS level for short records on most
|
||||||
|
# filesystems, but the lock is cheap and saves you from cross-platform
|
||||||
|
# surprises.
|
||||||
|
_lock = threading.Lock()
|
||||||
|
_last_rotation_check: float = 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def _maybe_rotate() -> None:
|
||||||
|
"""Move usage.jsonl → usage.jsonl.<yesterday> if the date has rolled.
|
||||||
|
|
||||||
|
Cheap to call; we only do filesystem work when a day has actually
|
||||||
|
passed since the last check.
|
||||||
|
"""
|
||||||
|
global _last_rotation_check
|
||||||
|
now = time.time()
|
||||||
|
if now - _last_rotation_check < 300: # 5 min cap between checks
|
||||||
|
return
|
||||||
|
_last_rotation_check = now
|
||||||
|
|
||||||
|
USAGE_LOG_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
active = USAGE_LOG_DIR / "usage.jsonl"
|
||||||
|
if active.exists():
|
||||||
|
try:
|
||||||
|
mtime = datetime.fromtimestamp(active.stat().st_mtime, tz=timezone.utc).date()
|
||||||
|
today = datetime.now(timezone.utc).date()
|
||||||
|
if mtime < today:
|
||||||
|
rotated = USAGE_LOG_DIR / f"usage.jsonl.{mtime.isoformat()}"
|
||||||
|
if not rotated.exists():
|
||||||
|
active.rename(rotated)
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Retention: delete usage.jsonl.YYYY-MM-DD files older than the
|
||||||
|
# retention window. The active file is never deleted by this.
|
||||||
|
cutoff = datetime.now(timezone.utc).date() - timedelta(days=USAGE_LOG_KEEP_DAYS)
|
||||||
|
for f in USAGE_LOG_DIR.glob("usage.jsonl.*"):
|
||||||
|
try:
|
||||||
|
datestamp = f.name.split(".", 2)[-1]
|
||||||
|
if datetime.fromisoformat(datestamp).date() < cutoff:
|
||||||
|
f.unlink()
|
||||||
|
except (ValueError, OSError):
|
||||||
|
continue
|
||||||
|
|
||||||
|
|
||||||
|
class TimedCall:
|
||||||
|
"""Context manager that captures one tool call's telemetry record.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
|
||||||
|
with TimedCall("search_docs", {"query": q, ...}) as call:
|
||||||
|
... do the work ...
|
||||||
|
call.set(hits_returned=len(results), reranked=True)
|
||||||
|
|
||||||
|
On exit, writes one JSONL record to usage.jsonl. Exceptions are
|
||||||
|
captured into the `error` field; the exception is re-raised so
|
||||||
|
the tool's caller sees the failure.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, tool: str, args: dict[str, Any]):
|
||||||
|
self.tool = tool
|
||||||
|
self.args = args
|
||||||
|
self.extra: dict[str, Any] = {}
|
||||||
|
self._t0: float = 0.0
|
||||||
|
|
||||||
|
def set(self, **kwargs: Any) -> None:
|
||||||
|
"""Attach extra fields to the eventual telemetry record."""
|
||||||
|
self.extra.update(kwargs)
|
||||||
|
|
||||||
|
def __enter__(self) -> "TimedCall":
|
||||||
|
self._t0 = time.perf_counter()
|
||||||
|
return self
|
||||||
|
|
||||||
|
def __exit__(self, exc_type, exc_val, exc_tb) -> None:
|
||||||
|
elapsed_ms = (time.perf_counter() - self._t0) * 1000.0
|
||||||
|
record: dict[str, Any] = {
|
||||||
|
"ts": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"tool": self.tool,
|
||||||
|
"args": self.args,
|
||||||
|
"elapsed_ms": round(elapsed_ms, 2),
|
||||||
|
}
|
||||||
|
if exc_type is not None:
|
||||||
|
record["error"] = f"{exc_type.__name__}: {exc_val}"
|
||||||
|
record.update(self.extra)
|
||||||
|
|
||||||
|
_maybe_rotate()
|
||||||
|
with _lock:
|
||||||
|
USAGE_LOG_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
with open(USAGE_LOG_DIR / "usage.jsonl", "a") as fh:
|
||||||
|
fh.write(json.dumps(record, separators=(",", ":")) + "\n")
|
||||||
|
# Don't swallow the exception — the caller still needs to see it.
|
||||||
@@ -0,0 +1,4 @@
|
|||||||
|
{"query": "how to install <product> on Linux", "expected": [{"bundle_id": "Install.Linux.10.0", "page_id": "Installation.htm"}], "tags": ["install", "linux"]}
|
||||||
|
{"query": "configure database connection for high availability", "expected": [{"bundle_id": "Admin.10.0", "page_id": "HA_Setup.htm"}], "tags": ["ha", "config"]}
|
||||||
|
{"query": "API endpoint to list users", "expected": [{"bundle_id": "API.10.0", "page_id": "Users_API.htm"}], "tags": ["api"]}
|
||||||
|
{"query": "what changed between 10.0 and 10.1", "expected": [{"bundle_id": "Release_Notes.10.1", "page_id": "Whats_New.htm"}], "tags": ["release-notes"]}
|
||||||
@@ -0,0 +1,62 @@
|
|||||||
|
"""Retriever protocol + concrete implementations.
|
||||||
|
|
||||||
|
A single matrix dimension per knob (dense / reranked / bm25 / hybrid)
|
||||||
|
so the eval harness can compare them apples-to-apples. Implement these
|
||||||
|
once at Phase 7 and reuse them across every retrieval change.
|
||||||
|
|
||||||
|
Each retriever returns a ranked list of (bundle_id, page_id) tuples
|
||||||
|
deduplicated to the page level (chunks within the same page collapse
|
||||||
|
to one entry; the highest-ranked chunk's position wins).
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Protocol, Iterable
|
||||||
|
|
||||||
|
|
||||||
|
class Retriever(Protocol):
|
||||||
|
name: str
|
||||||
|
|
||||||
|
def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
|
||||||
|
"""Return up to k (bundle_id, page_id) tuples in rank order."""
|
||||||
|
...
|
||||||
|
|
||||||
|
|
||||||
|
def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> list[tuple[str, str]]:
|
||||||
|
"""Take a stream of (bundle_id, page_id, chunk_ordinal) and return
|
||||||
|
the first k unique pages in their first-seen order."""
|
||||||
|
seen: set[tuple[str, str]] = set()
|
||||||
|
out: list[tuple[str, str]] = []
|
||||||
|
for bid, pid, _ord in chunk_ids:
|
||||||
|
key = (bid, pid)
|
||||||
|
if key in seen:
|
||||||
|
continue
|
||||||
|
seen.add(key)
|
||||||
|
out.append(key)
|
||||||
|
if len(out) >= k:
|
||||||
|
break
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
# TODO Phase 2/3 — implement these once Chroma + the bm25 module are
|
||||||
|
# in place. Each one is small (15-30 LOC). The eval harness imports
|
||||||
|
# from this module by class name.
|
||||||
|
#
|
||||||
|
# class DenseRetriever:
|
||||||
|
# name = "dense"
|
||||||
|
# def __init__(self, collection): self.col = collection
|
||||||
|
# def retrieve(self, query, k=10): ...
|
||||||
|
#
|
||||||
|
# class RerankedRetriever:
|
||||||
|
# name = "dense+rerank"
|
||||||
|
# def __init__(self, collection, rerank_url, pool=200): ...
|
||||||
|
# def retrieve(self, query, k=10): ...
|
||||||
|
#
|
||||||
|
# class BM25Retriever:
|
||||||
|
# name = "bm25"
|
||||||
|
# def __init__(self, bm25_index): ...
|
||||||
|
# def retrieve(self, query, k=10): ...
|
||||||
|
#
|
||||||
|
# class HybridRetriever:
|
||||||
|
# name = "bm25+dense+rrf"
|
||||||
|
# def __init__(self, dense, bm25, k_rrf=60): ...
|
||||||
|
# def retrieve(self, query, k=10): ...
|
||||||
@@ -0,0 +1,91 @@
|
|||||||
|
"""Run all retrievers against eval/queries.jsonl, emit a markdown report.
|
||||||
|
|
||||||
|
Metrics computed per retriever:
|
||||||
|
|
||||||
|
MRR — mean reciprocal rank of the FIRST expected page in the
|
||||||
|
ranked result list (0 if not in top-k).
|
||||||
|
Recall@K — fraction of expected pages that appear in top-K.
|
||||||
|
nDCG@K — discounted gain weighted by rank position.
|
||||||
|
|
||||||
|
The "right" number depends on what you're measuring. MRR tracks "the
|
||||||
|
first-line answer is correct"; Recall@K tracks "everything relevant
|
||||||
|
is there to draw from"; nDCG@K is a smoother combination of both.
|
||||||
|
For docs-RAG, MRR is usually the headline metric.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
|
||||||
|
python -m eval.run_eval \\
|
||||||
|
--queries eval/queries.jsonl \\
|
||||||
|
--k 5 \\
|
||||||
|
--output eval/results/baseline.md
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import math
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Iterable
|
||||||
|
|
||||||
|
|
||||||
|
def load_queries(path: Path) -> list[dict]:
|
||||||
|
with open(path) as fh:
|
||||||
|
return [json.loads(line) for line in fh if line.strip()]
|
||||||
|
|
||||||
|
|
||||||
|
def reciprocal_rank(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]]) -> float:
|
||||||
|
expected_set = set(expected)
|
||||||
|
for i, page in enumerate(retrieved, start=1):
|
||||||
|
if page in expected_set:
|
||||||
|
return 1.0 / i
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def recall_at_k(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]], k: int) -> float:
|
||||||
|
if not expected:
|
||||||
|
return 0.0
|
||||||
|
retrieved_set = set(retrieved[:k])
|
||||||
|
hits = sum(1 for e in expected if e in retrieved_set)
|
||||||
|
return hits / len(expected)
|
||||||
|
|
||||||
|
|
||||||
|
def ndcg_at_k(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]], k: int) -> float:
|
||||||
|
expected_set = set(expected)
|
||||||
|
dcg = 0.0
|
||||||
|
for i, page in enumerate(retrieved[:k], start=1):
|
||||||
|
if page in expected_set:
|
||||||
|
dcg += 1.0 / math.log2(i + 1)
|
||||||
|
# Ideal DCG: every expected page in the top positions.
|
||||||
|
idcg = sum(1.0 / math.log2(i + 1) for i in range(1, min(len(expected), k) + 1))
|
||||||
|
return dcg / idcg if idcg else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
p = argparse.ArgumentParser()
|
||||||
|
p.add_argument("--queries", type=Path, default=Path("eval/queries.jsonl"))
|
||||||
|
p.add_argument("--k", type=int, default=5)
|
||||||
|
p.add_argument("--output", type=Path, default=Path("eval/results/baseline.md"))
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
if not args.queries.exists():
|
||||||
|
print(f"queries file not found: {args.queries}")
|
||||||
|
print("hint: copy eval/queries.jsonl.example and edit")
|
||||||
|
return 1
|
||||||
|
|
||||||
|
queries = load_queries(args.queries)
|
||||||
|
print(f"loaded {len(queries)} queries")
|
||||||
|
|
||||||
|
# TODO Phase 7: instantiate the retrievers you implemented in
|
||||||
|
# eval/retrievers.py and run each one against each query.
|
||||||
|
# Aggregate MRR / Recall@K / nDCG@K per retriever. Emit a
|
||||||
|
# markdown table to args.output. Commit the file alongside the
|
||||||
|
# PR that changes retrieval.
|
||||||
|
raise NotImplementedError(
|
||||||
|
"Wire up the retrievers in eval/retrievers.py first, then "
|
||||||
|
"fill in this evaluation loop. See PLAN.md Phase 7."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
+277
@@ -0,0 +1,277 @@
|
|||||||
|
"""SQLite FTS5-backed BM25 retrieval over the same chunks Chroma indexes.
|
||||||
|
|
||||||
|
Hybrid retrieval (BM25 + dense + Reciprocal Rank Fusion) addresses a
|
||||||
|
limit of single-tower dense embeddings: when a query has specific
|
||||||
|
technical terms (filenames, language names, error codes, API paths),
|
||||||
|
the dense embedding doesn't bridge from the query into a short
|
||||||
|
code-focused chunk. The chunk loses to the much larger crowd of
|
||||||
|
prose chunks that semantically match the query topic.
|
||||||
|
|
||||||
|
BM25 handles this directly. Lexical overlap on rare terms ("python",
|
||||||
|
"create_vpg.py", "PROTECTED_SITE_ID", "applyUpgrade") scores those
|
||||||
|
chunks high. Fused with the dense ranking via RRF, the hybrid result
|
||||||
|
is strictly better than either alone for the queries we've seen
|
||||||
|
fail.
|
||||||
|
|
||||||
|
Why SQLite FTS5:
|
||||||
|
- In the stdlib. Zero new deps.
|
||||||
|
- On-disk. Same persistence model as Chroma — Docker COPY the dir,
|
||||||
|
`rag.index --rebuild` regenerates from corpus.
|
||||||
|
- Built-in `bm25()` ranking function. No knobs to tune that matter
|
||||||
|
for our use case (k1=1.2, b=0.75 defaults are fine).
|
||||||
|
- Builds 70k+ chunks in seconds. Faster than the Chroma rebuild's
|
||||||
|
embedding step by 100×, so it adds basically nothing to the
|
||||||
|
full-rebuild cycle.
|
||||||
|
|
||||||
|
Schema is two tables to keep filtering clean. FTS5 doesn't filter
|
||||||
|
nicely on its own columns; the content_rowid pattern keeps an
|
||||||
|
external metadata table joinable by rowid:
|
||||||
|
|
||||||
|
CREATE TABLE chunks_meta (
|
||||||
|
rowid INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||||
|
id TEXT UNIQUE,
|
||||||
|
bundle_id TEXT, page_id TEXT, version TEXT,
|
||||||
|
platform TEXT, product TEXT, ordinal INTEGER
|
||||||
|
);
|
||||||
|
CREATE VIRTUAL TABLE chunks_fts USING fts5(
|
||||||
|
text,
|
||||||
|
tokenize = 'porter unicode61 remove_diacritics 2',
|
||||||
|
content = 'chunks_meta',
|
||||||
|
content_rowid = 'rowid'
|
||||||
|
);
|
||||||
|
|
||||||
|
Queries:
|
||||||
|
|
||||||
|
SELECT m.id, bm25(chunks_fts) AS score
|
||||||
|
FROM chunks_meta m
|
||||||
|
JOIN chunks_fts f ON m.rowid = f.rowid
|
||||||
|
WHERE f MATCH ?
|
||||||
|
AND m.version = ? -- optional metadata filter
|
||||||
|
ORDER BY bm25(chunks_fts) -- lower = better in FTS5
|
||||||
|
LIMIT ?;
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import re
|
||||||
|
import sqlite3
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
# Default location: bm25/<product>_docs.db at the repo root, next to chroma/.
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
DEFAULT_DB_DIR = ROOT / "bm25"
|
||||||
|
DEFAULT_DB_NAME = "<product>_docs.db"
|
||||||
|
|
||||||
|
# Columns we expose as filterable metadata. Mirrors what _build_where in
|
||||||
|
# docs_mcp/server.py accepts so the same filter dicts work for both
|
||||||
|
# Chroma and BM25 without per-retriever translation in the caller.
|
||||||
|
FILTER_COLUMNS = ("bundle_id", "page_id", "version", "platform", "product", "ordinal")
|
||||||
|
|
||||||
|
|
||||||
|
# Allowlist tokenizer for free-text queries. FTS5's parser chokes on lots
|
||||||
|
# of punctuation we routinely see in user queries (".10.9", "?", "VPG's",
|
||||||
|
# em-dash, etc.). Rather than blocklist every operator, just keep
|
||||||
|
# alphanumerics + a few separators and replace everything else with a
|
||||||
|
# space. This loses the ability to phrase-search ("exact match") but we
|
||||||
|
# don't expose that to users anyway — they ask natural-language questions
|
||||||
|
# and want the answer, not a Boolean DSL.
|
||||||
|
_KEEP_RE = re.compile(r"[^A-Za-z0-9_\s]")
|
||||||
|
# FTS5 reserves these Boolean operator KEYWORDS at the token level —
|
||||||
|
# stripping them avoids accidental phrase-query behavior when a user
|
||||||
|
# query happens to contain bare "AND", "OR", "NOT", "NEAR".
|
||||||
|
_BOOLEAN_KW_RE = re.compile(r"(?<!\w)(AND|OR|NOT|NEAR)(?!\w)")
|
||||||
|
|
||||||
|
|
||||||
|
def _sanitize_query(text: str) -> str:
|
||||||
|
"""Reduce a natural-language query to an FTS5 OR-of-tokens query.
|
||||||
|
|
||||||
|
Two transformations:
|
||||||
|
|
||||||
|
1. Non-alphanumeric → space (drops punctuation; "10.9?" becomes
|
||||||
|
"10 9"). Lets us handle versions, parens, question marks, etc.
|
||||||
|
without inviting FTS5 parse errors.
|
||||||
|
2. Boolean keywords stripped (FTS5 reserves AND/OR/NOT/NEAR).
|
||||||
|
3. Tokens explicitly OR'd. FTS5's default is AND-of-tokens — for
|
||||||
|
any non-trivial natural-language query that means zero hits
|
||||||
|
(no chunk contains every word). OR semantics is what we want:
|
||||||
|
BM25 already weights documents containing more query terms
|
||||||
|
higher, so we don't lose precision, but we DO gain recall.
|
||||||
|
"""
|
||||||
|
cleaned = _KEEP_RE.sub(" ", text)
|
||||||
|
cleaned = _BOOLEAN_KW_RE.sub(" ", cleaned)
|
||||||
|
tokens = cleaned.split()
|
||||||
|
if not tokens:
|
||||||
|
return ""
|
||||||
|
return " OR ".join(tokens)
|
||||||
|
|
||||||
|
|
||||||
|
def _where_to_sql(where: dict | None) -> tuple[str, list[Any]]:
|
||||||
|
"""Translate a Chroma-shaped filter dict into a SQL fragment + params.
|
||||||
|
|
||||||
|
Accepts the same shapes ``docs_mcp.server._build_where`` produces:
|
||||||
|
|
||||||
|
None → ("", [])
|
||||||
|
{"version": "10.9"} → ("AND m.version = ?", ["10.9"])
|
||||||
|
{"$and": [{...}, {...}]} → ("AND m.X = ? AND m.Y = ?", [...])
|
||||||
|
|
||||||
|
Unknown keys are silently dropped (defensive — better to over-match
|
||||||
|
than to crash on a filter we don't know).
|
||||||
|
"""
|
||||||
|
if not where:
|
||||||
|
return "", []
|
||||||
|
parts: list[str] = []
|
||||||
|
params: list[Any] = []
|
||||||
|
|
||||||
|
def _emit_eq(cond: dict[str, Any]) -> None:
|
||||||
|
for k, v in cond.items():
|
||||||
|
if k in FILTER_COLUMNS:
|
||||||
|
parts.append(f"m.{k} = ?")
|
||||||
|
params.append(v)
|
||||||
|
|
||||||
|
if "$and" in where:
|
||||||
|
for sub in where["$and"]:
|
||||||
|
_emit_eq(sub)
|
||||||
|
else:
|
||||||
|
_emit_eq(where)
|
||||||
|
if not parts:
|
||||||
|
return "", []
|
||||||
|
return "AND " + " AND ".join(parts), params
|
||||||
|
|
||||||
|
|
||||||
|
class BM25Index:
|
||||||
|
"""Thin wrapper around an FTS5-backed sqlite db.
|
||||||
|
|
||||||
|
Single-writer model. Reads are connection-per-call (sqlite handles
|
||||||
|
concurrency through file locks; for our read-heavy workload that's
|
||||||
|
fine and avoids cross-thread connection sharing issues with the MCP
|
||||||
|
server's request handlers).
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, db_path: Path | None = None):
|
||||||
|
self.db_path = Path(db_path) if db_path else (DEFAULT_DB_DIR / DEFAULT_DB_NAME)
|
||||||
|
|
||||||
|
# -- build ----------------------------------------------------------
|
||||||
|
|
||||||
|
def build(self, records: list[dict]) -> int:
|
||||||
|
"""Rebuild the index from scratch from `records`.
|
||||||
|
|
||||||
|
`records` is the same list ``rag.index.page_records`` produces:
|
||||||
|
``[{"id": ..., "text": ..., "metadata": {...}}, ...]``. Bulk
|
||||||
|
insert wrapped in a transaction — single-digit seconds for the
|
||||||
|
full 73k-chunk corpus.
|
||||||
|
"""
|
||||||
|
self.db_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
# Drop and recreate. Idempotent rebuild.
|
||||||
|
if self.db_path.exists():
|
||||||
|
self.db_path.unlink()
|
||||||
|
with sqlite3.connect(self.db_path) as con:
|
||||||
|
con.executescript(self._schema_sql())
|
||||||
|
con.executemany(
|
||||||
|
"INSERT INTO chunks_meta (id, bundle_id, page_id, version, "
|
||||||
|
"platform, product, ordinal) VALUES (?, ?, ?, ?, ?, ?, ?)",
|
||||||
|
[
|
||||||
|
(
|
||||||
|
r["id"],
|
||||||
|
r["metadata"].get("bundle_id") or "",
|
||||||
|
r["metadata"].get("page_id") or "",
|
||||||
|
r["metadata"].get("version") or "",
|
||||||
|
r["metadata"].get("platform") or "",
|
||||||
|
r["metadata"].get("product") or "",
|
||||||
|
int(r["metadata"].get("ordinal") or 0),
|
||||||
|
)
|
||||||
|
for r in records
|
||||||
|
],
|
||||||
|
)
|
||||||
|
# Populate the FTS5 contentless-ish table by rowid. We populated
|
||||||
|
# chunks_meta first; rowids align with insertion order.
|
||||||
|
con.executemany(
|
||||||
|
"INSERT INTO chunks_fts (rowid, text) VALUES (?, ?)",
|
||||||
|
[
|
||||||
|
(i + 1, r["text"])
|
||||||
|
for i, r in enumerate(records)
|
||||||
|
],
|
||||||
|
)
|
||||||
|
con.commit()
|
||||||
|
log.info("bm25: indexed %d chunks → %s", len(records), self.db_path)
|
||||||
|
return len(records)
|
||||||
|
|
||||||
|
# -- query ----------------------------------------------------------
|
||||||
|
|
||||||
|
def query(
|
||||||
|
self,
|
||||||
|
text: str,
|
||||||
|
n: int = 200,
|
||||||
|
where: dict | None = None,
|
||||||
|
) -> list[tuple[str, float]]:
|
||||||
|
"""Return up to `n` (chunk_id, bm25_score) pairs, lowest score first.
|
||||||
|
|
||||||
|
FTS5's bm25() returns NEGATIVE numbers — more relevant docs have
|
||||||
|
smaller (more negative) scores. We order ASC so the first row is
|
||||||
|
the most relevant. Callers that need a "rank" should enumerate
|
||||||
|
the returned list.
|
||||||
|
"""
|
||||||
|
sanitized = _sanitize_query(text)
|
||||||
|
if not sanitized:
|
||||||
|
return []
|
||||||
|
where_sql, params = _where_to_sql(where)
|
||||||
|
# FTS5 MATCH wants the unaliased table name on its left, so we use
|
||||||
|
# chunks_fts (no alias) and JOIN by rowid against chunks_meta.
|
||||||
|
sql = (
|
||||||
|
"SELECT m.id, bm25(chunks_fts) AS score "
|
||||||
|
"FROM chunks_fts "
|
||||||
|
"JOIN chunks_meta m ON m.rowid = chunks_fts.rowid "
|
||||||
|
f"WHERE chunks_fts MATCH ? {where_sql} "
|
||||||
|
"ORDER BY bm25(chunks_fts) "
|
||||||
|
"LIMIT ?"
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
with sqlite3.connect(self.db_path) as con:
|
||||||
|
cur = con.execute(sql, [sanitized, *params, n])
|
||||||
|
return [(row[0], float(row[1])) for row in cur.fetchall()]
|
||||||
|
except sqlite3.OperationalError as e:
|
||||||
|
# FTS5 syntax error (rare after sanitization) or db missing.
|
||||||
|
# Caller decides whether to fall back to dense-only.
|
||||||
|
log.warning("bm25 query failed (%s); query=%r", e, sanitized[:80])
|
||||||
|
return []
|
||||||
|
|
||||||
|
def exists(self) -> bool:
|
||||||
|
"""Cheap probe — does the index file exist on disk?"""
|
||||||
|
return self.db_path.exists()
|
||||||
|
|
||||||
|
def count(self) -> int:
|
||||||
|
"""Number of chunks indexed. 0 if the db is missing or empty."""
|
||||||
|
if not self.exists():
|
||||||
|
return 0
|
||||||
|
try:
|
||||||
|
with sqlite3.connect(self.db_path) as con:
|
||||||
|
return con.execute("SELECT COUNT(*) FROM chunks_meta").fetchone()[0]
|
||||||
|
except sqlite3.OperationalError:
|
||||||
|
return 0
|
||||||
|
|
||||||
|
# -- schema ---------------------------------------------------------
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _schema_sql() -> str:
|
||||||
|
return """
|
||||||
|
CREATE TABLE chunks_meta (
|
||||||
|
rowid INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||||
|
id TEXT UNIQUE NOT NULL,
|
||||||
|
bundle_id TEXT,
|
||||||
|
page_id TEXT,
|
||||||
|
version TEXT,
|
||||||
|
platform TEXT,
|
||||||
|
product TEXT,
|
||||||
|
ordinal INTEGER
|
||||||
|
);
|
||||||
|
CREATE INDEX idx_meta_version ON chunks_meta(version);
|
||||||
|
CREATE INDEX idx_meta_platform ON chunks_meta(platform);
|
||||||
|
CREATE INDEX idx_meta_bundle ON chunks_meta(bundle_id);
|
||||||
|
|
||||||
|
CREATE VIRTUAL TABLE chunks_fts USING fts5(
|
||||||
|
text,
|
||||||
|
tokenize = 'porter unicode61 remove_diacritics 2'
|
||||||
|
);
|
||||||
|
"""
|
||||||
+126
@@ -0,0 +1,126 @@
|
|||||||
|
"""Markdown chunker — paragraph-aware, ~400-600 token target.
|
||||||
|
|
||||||
|
Adjust the chunking strategy per product if your page format differs
|
||||||
|
significantly from prose. The output shape (id, text, metadata) is
|
||||||
|
fixed by the downstream Chroma + BM25 indexing in rag/index.py — don't
|
||||||
|
change that.
|
||||||
|
|
||||||
|
The key knob you'll tune per product is chunk-0. Dense retrieval lands
|
||||||
|
on chunk 0 first for most queries. Make it a synthetic chunk built
|
||||||
|
from:
|
||||||
|
|
||||||
|
- the page title (as natural-language H1)
|
||||||
|
- a 1-sentence task description (you'll have to generate this — for
|
||||||
|
pages that already have a "## Overview" or "## Introduction" the
|
||||||
|
first sentence usually works)
|
||||||
|
- a keyword bag of important terms (filenames, API names, error
|
||||||
|
codes — the rare technical tokens that BM25 lights up on)
|
||||||
|
|
||||||
|
Without a rich chunk 0, dense retrieval gets dominated by the much
|
||||||
|
larger prose body, and short pages (script examples, reference cards)
|
||||||
|
get buried.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import re
|
||||||
|
from typing import Iterator
|
||||||
|
|
||||||
|
|
||||||
|
# Approximate token estimate from char count. Tunable — set per
|
||||||
|
# embedder if the default 4 chars/token is wrong.
|
||||||
|
CHARS_PER_TOKEN = 4
|
||||||
|
TARGET_TOKENS = 500
|
||||||
|
TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
|
||||||
|
|
||||||
|
|
||||||
|
def estimate_tokens(text: str) -> int:
|
||||||
|
return max(1, len(text) // CHARS_PER_TOKEN)
|
||||||
|
|
||||||
|
|
||||||
|
def split_paragraphs(md: str) -> list[str]:
|
||||||
|
"""Split markdown into paragraph-ish blocks.
|
||||||
|
|
||||||
|
Keeps fenced code blocks together (don't slice through ```).
|
||||||
|
Headings start new paragraphs.
|
||||||
|
"""
|
||||||
|
blocks: list[str] = []
|
||||||
|
current: list[str] = []
|
||||||
|
in_fence = False
|
||||||
|
for line in md.splitlines(keepends=True):
|
||||||
|
stripped = line.strip()
|
||||||
|
if stripped.startswith("```"):
|
||||||
|
in_fence = not in_fence
|
||||||
|
current.append(line)
|
||||||
|
continue
|
||||||
|
if in_fence:
|
||||||
|
current.append(line)
|
||||||
|
continue
|
||||||
|
if stripped.startswith("#"):
|
||||||
|
if current:
|
||||||
|
blocks.append("".join(current).strip())
|
||||||
|
current = []
|
||||||
|
current.append(line)
|
||||||
|
continue
|
||||||
|
if not stripped and current and not "".join(current).strip().endswith("\n\n"):
|
||||||
|
current.append(line)
|
||||||
|
blocks.append("".join(current).strip())
|
||||||
|
current = []
|
||||||
|
continue
|
||||||
|
current.append(line)
|
||||||
|
if current:
|
||||||
|
blocks.append("".join(current).strip())
|
||||||
|
return [b for b in blocks if b]
|
||||||
|
|
||||||
|
|
||||||
|
def chunks_from_page(
|
||||||
|
text: str,
|
||||||
|
page_id: str,
|
||||||
|
metadata: dict,
|
||||||
|
) -> Iterator[dict]:
|
||||||
|
"""Yield chunk dicts ready for index.py to upsert.
|
||||||
|
|
||||||
|
The synthetic chunk 0 is the per-product customization point. The
|
||||||
|
default below is a simple title + body-first-paragraph; rewrite
|
||||||
|
for richer retrieval signal (see module docstring).
|
||||||
|
"""
|
||||||
|
paragraphs = split_paragraphs(text)
|
||||||
|
if not paragraphs:
|
||||||
|
return
|
||||||
|
|
||||||
|
# ----- Chunk 0: synthetic anchor for dense retrieval ---------
|
||||||
|
title = metadata.get("title") or page_id
|
||||||
|
first_para = next((p for p in paragraphs if not p.startswith("#")), "")
|
||||||
|
chunk0_body = (
|
||||||
|
f"# {title}\n\n"
|
||||||
|
f"{first_para[:300]}"
|
||||||
|
# TODO per product: append a keyword bag here (filenames,
|
||||||
|
# API names, error codes) for BM25 + dense joint coverage.
|
||||||
|
)
|
||||||
|
yield {
|
||||||
|
"id": f"{metadata['bundle_id']}::{page_id}::0",
|
||||||
|
"text": chunk0_body,
|
||||||
|
"metadata": {**metadata, "ordinal": 0},
|
||||||
|
}
|
||||||
|
|
||||||
|
# ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
|
||||||
|
ordinal = 1
|
||||||
|
buf: list[str] = []
|
||||||
|
buf_chars = 0
|
||||||
|
for p in paragraphs:
|
||||||
|
if buf_chars + len(p) > TARGET_CHARS and buf:
|
||||||
|
yield {
|
||||||
|
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
|
||||||
|
"text": "\n\n".join(buf),
|
||||||
|
"metadata": {**metadata, "ordinal": ordinal},
|
||||||
|
}
|
||||||
|
ordinal += 1
|
||||||
|
buf = []
|
||||||
|
buf_chars = 0
|
||||||
|
buf.append(p)
|
||||||
|
buf_chars += len(p)
|
||||||
|
if buf:
|
||||||
|
yield {
|
||||||
|
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
|
||||||
|
"text": "\n\n".join(buf),
|
||||||
|
"metadata": {**metadata, "ordinal": ordinal},
|
||||||
|
}
|
||||||
@@ -0,0 +1,72 @@
|
|||||||
|
"""Embedding function for Chroma — Ollama-hosted nomic-embed-text by default.
|
||||||
|
|
||||||
|
Swappable: implement the same `embedding_function()` interface returning
|
||||||
|
a Chroma `EmbeddingFunction` and the rest of the pipeline doesn't care.
|
||||||
|
|
||||||
|
Defaults (override via env):
|
||||||
|
OLLAMA_URL one or more comma-separated URLs (load-balanced)
|
||||||
|
EMBED_MODEL model name; default 'nomic-embed-text'
|
||||||
|
EMBED_DIM expected embedding dim; default 768 (nomic-embed-text)
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
import logging
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
from chromadb import EmbeddingFunction, Documents, Embeddings
|
||||||
|
|
||||||
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
OLLAMA_URLS = [u.strip() for u in os.environ.get("OLLAMA_URL",
|
||||||
|
"http://localhost:11434").split(",") if u.strip()]
|
||||||
|
EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
|
||||||
|
EMBED_DIM = int(os.environ.get("EMBED_DIM", "768"))
|
||||||
|
|
||||||
|
|
||||||
|
class OllamaEmbeddings(EmbeddingFunction):
|
||||||
|
"""Calls /api/embed across N Ollama endpoints, naive round-robin.
|
||||||
|
|
||||||
|
For indexing throughput on multiple GPUs, run one Ollama container
|
||||||
|
per GPU (pinned via NVIDIA_VISIBLE_DEVICES) and pass all their URLs
|
||||||
|
in OLLAMA_URL — the embedder picks the next endpoint per batch.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, urls: list[str] = OLLAMA_URLS, model: str = EMBED_MODEL):
|
||||||
|
self.urls = urls
|
||||||
|
self.model = model
|
||||||
|
self._next = 0
|
||||||
|
|
||||||
|
def __call__(self, input: Documents) -> Embeddings:
|
||||||
|
url = self.urls[self._next % len(self.urls)]
|
||||||
|
self._next += 1
|
||||||
|
with httpx.Client(timeout=300) as c:
|
||||||
|
r = c.post(f"{url}/api/embed",
|
||||||
|
json={"model": self.model, "input": list(input)})
|
||||||
|
r.raise_for_status()
|
||||||
|
data = r.json()
|
||||||
|
return data.get("embeddings") or []
|
||||||
|
|
||||||
|
def name(self) -> str: # newer chromadb requires this
|
||||||
|
return f"ollama:{self.model}"
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def build_from_config(config: dict) -> "OllamaEmbeddings": # newer chromadb
|
||||||
|
return OllamaEmbeddings(
|
||||||
|
urls=config.get("urls", OLLAMA_URLS),
|
||||||
|
model=config.get("model", EMBED_MODEL),
|
||||||
|
)
|
||||||
|
|
||||||
|
def get_config(self) -> dict: # newer chromadb
|
||||||
|
return {"urls": self.urls, "model": self.model}
|
||||||
|
|
||||||
|
def default_space(self) -> str:
|
||||||
|
return "cosine"
|
||||||
|
|
||||||
|
def supported_spaces(self) -> list[str]:
|
||||||
|
return ["cosine", "l2", "ip"]
|
||||||
|
|
||||||
|
|
||||||
|
def embedding_function() -> EmbeddingFunction:
|
||||||
|
return OllamaEmbeddings()
|
||||||
+134
@@ -0,0 +1,134 @@
|
|||||||
|
"""Build Chroma (and optionally BM25) indexes from corpus on disk.
|
||||||
|
|
||||||
|
Reads `corpus/<bundle>/<page>.{md,json}`, chunks each page, upserts
|
||||||
|
into Chroma. With --rebuild, drops + recreates the collection (clean
|
||||||
|
state). With --bm25-only, skips Chroma and rebuilds only the FTS5
|
||||||
|
index — useful for fast iteration when chunking didn't change.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Iterator
|
||||||
|
|
||||||
|
import chromadb
|
||||||
|
from chromadb.config import Settings
|
||||||
|
|
||||||
|
from .chunk import chunks_from_page
|
||||||
|
from .embeddings import embedding_function
|
||||||
|
|
||||||
|
log = logging.getLogger(__name__)
|
||||||
|
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parent.parent
|
||||||
|
CORPUS = ROOT / "corpus"
|
||||||
|
CHROMA_DIR = ROOT / "chroma"
|
||||||
|
|
||||||
|
# Collection name — convention: <product>_docs. Override via env if needed.
|
||||||
|
import os
|
||||||
|
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "myproduct")
|
||||||
|
COLLECTION = f"{PRODUCT_NAME}_docs"
|
||||||
|
|
||||||
|
|
||||||
|
def page_records() -> Iterator[dict]:
|
||||||
|
"""Walk corpus/, yield chunks for every page."""
|
||||||
|
if not CORPUS.exists():
|
||||||
|
log.error("corpus/ doesn't exist; run the scraper first")
|
||||||
|
return
|
||||||
|
for bundle_dir in sorted(CORPUS.iterdir()):
|
||||||
|
if not bundle_dir.is_dir() or bundle_dir.name.startswith("."):
|
||||||
|
continue
|
||||||
|
for md_path in sorted(bundle_dir.glob("*.md")):
|
||||||
|
page_id = md_path.stem
|
||||||
|
sidecar = md_path.with_suffix(".json")
|
||||||
|
if not sidecar.exists():
|
||||||
|
log.warning("skipping %s — no JSON sidecar", md_path)
|
||||||
|
continue
|
||||||
|
md = md_path.read_text()
|
||||||
|
meta = json.loads(sidecar.read_text())
|
||||||
|
# Surface common filter fields at the chunk-metadata level
|
||||||
|
# so Chroma's `where` filter can use them.
|
||||||
|
base_meta = {
|
||||||
|
"bundle_id": bundle_dir.name,
|
||||||
|
"page_id": page_id,
|
||||||
|
"title": meta.get("title") or "",
|
||||||
|
"version": meta.get("version") or "",
|
||||||
|
"platform": meta.get("platform") or "",
|
||||||
|
"product": meta.get("product") or "",
|
||||||
|
}
|
||||||
|
yield from chunks_from_page(md, page_id, base_meta)
|
||||||
|
|
||||||
|
|
||||||
|
def upsert_to_chroma(records: list[dict]) -> int:
|
||||||
|
client = chromadb.PersistentClient(
|
||||||
|
path=str(CHROMA_DIR),
|
||||||
|
settings=Settings(anonymized_telemetry=False),
|
||||||
|
)
|
||||||
|
# Drop + recreate for --rebuild semantics
|
||||||
|
try:
|
||||||
|
client.delete_collection(COLLECTION)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
col = client.create_collection(COLLECTION, embedding_function=embedding_function())
|
||||||
|
|
||||||
|
BATCH = 64
|
||||||
|
total = 0
|
||||||
|
for i in range(0, len(records), BATCH):
|
||||||
|
chunk = records[i:i + BATCH]
|
||||||
|
col.upsert(
|
||||||
|
ids=[r["id"] for r in chunk],
|
||||||
|
documents=[r["text"] for r in chunk],
|
||||||
|
metadatas=[r["metadata"] for r in chunk],
|
||||||
|
)
|
||||||
|
total += len(chunk)
|
||||||
|
log.info("upserted %d / %d chunks", total, len(records))
|
||||||
|
return total
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
p = argparse.ArgumentParser()
|
||||||
|
p.add_argument("--rebuild", action="store_true",
|
||||||
|
help="Drop and recreate the Chroma collection.")
|
||||||
|
p.add_argument("--bm25-only", action="store_true",
|
||||||
|
help="Rebuild only the BM25 index, skip Chroma.")
|
||||||
|
p.add_argument("--bm25-db", type=Path,
|
||||||
|
default=ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db",
|
||||||
|
help="Path to the BM25 sqlite db.")
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
log.info("reading corpus from %s", CORPUS)
|
||||||
|
t0 = time.time()
|
||||||
|
records = list(page_records())
|
||||||
|
log.info("loaded %d chunks in %.1fs", len(records), time.time() - t0)
|
||||||
|
|
||||||
|
if args.bm25_only:
|
||||||
|
from .bm25 import BM25Index
|
||||||
|
log.info("--bm25-only: building FTS5 only")
|
||||||
|
BM25Index(args.bm25_db).build(records)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
if not args.rebuild:
|
||||||
|
log.info("no --rebuild; nothing to do. (Use --rebuild to upsert.)")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
t_c = time.time()
|
||||||
|
n = upsert_to_chroma(records)
|
||||||
|
log.info("chroma: %d chunks in %.1fs", n, time.time() - t_c)
|
||||||
|
|
||||||
|
# Build BM25 too — see PLAN.md Phase 8. Safe to remove this block
|
||||||
|
# for products that don't need hybrid retrieval.
|
||||||
|
try:
|
||||||
|
from .bm25 import BM25Index
|
||||||
|
t_b = time.time()
|
||||||
|
BM25Index(args.bm25_db).build(records)
|
||||||
|
log.info("bm25 done in %.1fs", time.time() - t_b)
|
||||||
|
except ImportError:
|
||||||
|
log.info("rag.bm25 not available — skipping BM25 build")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
@@ -0,0 +1,19 @@
|
|||||||
|
# MCP server
|
||||||
|
mcp[fastmcp]>=1.0.0
|
||||||
|
pydantic>=2.0
|
||||||
|
httpx>=0.27
|
||||||
|
|
||||||
|
# Vector store + embeddings
|
||||||
|
chromadb>=0.5.0
|
||||||
|
ollama>=0.4.0 # if using Ollama-hosted embedder; swap if not
|
||||||
|
|
||||||
|
# Scraping (Phase 1; adjust per product)
|
||||||
|
beautifulsoup4>=4.12
|
||||||
|
requests>=2.31
|
||||||
|
# playwright>=1.40 # uncomment if you need headless browser fallback
|
||||||
|
|
||||||
|
# Evaluation
|
||||||
|
numpy>=1.26
|
||||||
|
|
||||||
|
# Dev / utility
|
||||||
|
python-dateutil>=2.8
|
||||||
@@ -0,0 +1,61 @@
|
|||||||
|
# scrape/
|
||||||
|
|
||||||
|
Per-vendor seed catalog scrapers + the runner that dispatches to
|
||||||
|
them. Each source lives in `scrape/sources/<name>.py` with a `main()`
|
||||||
|
entrypoint. The runner is a thin shim:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m scrape.runner --source bayer_seeds --force
|
||||||
|
python -m scrape.runner --source golden_harvest --limit 20
|
||||||
|
python -m scrape.runner --all # only GREEN sources
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output layout
|
||||||
|
|
||||||
|
Each scraper writes:
|
||||||
|
|
||||||
|
- `corpus/<source>/<source_key>.md` — LLM-visible body (chunk_0
|
||||||
|
preamble + the variety's marketing + agronomic narrative)
|
||||||
|
- `corpus/<source>/<source_key>.json` — sidecar metadata (per
|
||||||
|
CLAUDE.md's canonical schema)
|
||||||
|
|
||||||
|
`source_key` is a stable per-vendor slug — typically `<brand>-<sku>`
|
||||||
|
lowercased, e.g. `dekalb-dkc62-08rib`. Stability matters: it's the
|
||||||
|
join key the MCP uses for `get_page(source, source_key)`.
|
||||||
|
|
||||||
|
## Sources
|
||||||
|
|
||||||
|
| Source | Module | Verdict | Notes |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `bayer_seeds` | `bayer_seeds.py` | 🟢 | DEKALB + Asgrow + WestBred, ~475 varieties |
|
||||||
|
| `golden_harvest` | `golden_harvest.py` | 🟢 | ~175 varieties, 9-to-1 disease scale (reverse) |
|
||||||
|
| `nk` | `nk.py` | 🟢 | 29 varieties, ratings in CDN PDFs |
|
||||||
|
| `agripro` | `agripro.py` | 🟢 | 24 wheat varieties |
|
||||||
|
| `becks_pfr` | `becks_pfr.py` | 🟡 | 2,089 research docs via public Sanity GROQ |
|
||||||
|
| `becks_products` | `becks_products.py` | 🟡 | 860 products, identity-only (SeedIQ-gated) |
|
||||||
|
|
||||||
|
Pioneer is intentionally absent — see `CLAUDE.md` and the curated
|
||||||
|
Pioneer fallback in `docs_mcp/lessons.md`.
|
||||||
|
|
||||||
|
## Tips
|
||||||
|
|
||||||
|
- **Sniff before you scrape.** Most catalogs are SPAs that call a
|
||||||
|
backend API. The recon docs in `~/.claude/projects/-home-justin/
|
||||||
|
memory/reference_seed_vendor_recon.md` already capture the
|
||||||
|
endpoints; if you find new ones, update that file.
|
||||||
|
- **Idempotent re-scrapes.** Without `--force`, skip pages already
|
||||||
|
on disk. With `--force`, re-fetch everything — that's the
|
||||||
|
monthly cron mode.
|
||||||
|
- **Respect the portals.** Backoff on 429s. Set a recognizable
|
||||||
|
user-agent (`seed-mcp-scraper/<version>`).
|
||||||
|
- **Normalize at chunk time, not at scrape time.** The chunker
|
||||||
|
(Phase 2) handles the 9-to-1 → 1-9 disease-scale flip for Golden
|
||||||
|
Harvest, NOT this scraper. Sidecar JSON should preserve the
|
||||||
|
vendor's raw values + a `_scale_direction` field; the chunker
|
||||||
|
reads that and normalizes the markdown body.
|
||||||
|
|
||||||
|
## changelog.py
|
||||||
|
|
||||||
|
Reusable as-is from the template. Walks `git diff --name-status`
|
||||||
|
output for the commit summary, and `git log` for the digest history
|
||||||
|
(Phase 13).
|
||||||
@@ -0,0 +1,272 @@
|
|||||||
|
"""Generate a summary of corpus changes.
|
||||||
|
|
||||||
|
Two output shapes for two consumers:
|
||||||
|
|
||||||
|
1. Human-readable text (default) — written into the weekly-refresh
|
||||||
|
commit message so the commit log is greppable for *"what changed
|
||||||
|
this week"* instead of *"806 files changed"*.
|
||||||
|
|
||||||
|
2. Structured JSON (``--json``) and rolling JSONL history
|
||||||
|
(``--history-out``) — consumed by the ``weekly_digest`` MCP tool.
|
||||||
|
Computed in CI and committed at ``corpus/.digest/history.jsonl``;
|
||||||
|
the tool reads it at runtime because the prod container is a
|
||||||
|
static filesystem COPY with no git available.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
|
||||||
|
# Commit-message helper (existing behavior — unchanged)
|
||||||
|
python -m scrape.changelog [--cached] [--ref REF]
|
||||||
|
|
||||||
|
# One-shot JSON for the current diff range
|
||||||
|
python -m scrape.changelog --cached --json
|
||||||
|
|
||||||
|
# Build / refresh the digest history file (CI use)
|
||||||
|
python -m scrape.changelog --history-out corpus/.digest/history.jsonl \\
|
||||||
|
--history-days 120
|
||||||
|
|
||||||
|
The history walker only includes commits that touch ``corpus/`` (or
|
||||||
|
``bundles.json``); it skips pure code/CI commits. Each emitted record
|
||||||
|
carries the commit's short sha, ISO timestamp, subject, and the same
|
||||||
|
structured summary the ``--json`` path produces, so the consumer can
|
||||||
|
treat history records and one-shot summaries interchangeably.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
from collections import defaultdict
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
|
||||||
|
def git(*args: str) -> str:
|
||||||
|
return subprocess.check_output(["git", *args], text=True)
|
||||||
|
|
||||||
|
|
||||||
|
def summarize_diff(diff_output: str) -> dict[str, Any]:
|
||||||
|
"""Parse ``git diff --name-status`` output into a structured summary.
|
||||||
|
|
||||||
|
Pure function (no IO, no git calls) so the same logic is exercised
|
||||||
|
by the human-readable, JSON-one-shot, and history-walking paths.
|
||||||
|
|
||||||
|
Returns a dict with:
|
||||||
|
|
||||||
|
md_count int — total .md files changed
|
||||||
|
json_count int — total .json sidecars changed
|
||||||
|
content_bundles dict — {bundle_id: [page_id_without_.md, ...]}
|
||||||
|
Only bundles where at least one .md
|
||||||
|
file moved. Lists are in the order
|
||||||
|
git emitted them.
|
||||||
|
json_only_bundles list[str] — bundles whose ONLY change was sidecar
|
||||||
|
drift (no .md changes). Sorted.
|
||||||
|
new_bundles list[str] — bundles whose first .md was Added
|
||||||
|
in this diff. Sorted.
|
||||||
|
other_files list[str] — any non-corpus path mentioned in the
|
||||||
|
diff, as ``"STATUS path"`` strings.
|
||||||
|
"""
|
||||||
|
md_changes: dict[str, list[str]] = defaultdict(list)
|
||||||
|
json_only_bundles: set[str] = set()
|
||||||
|
new_bundles: set[str] = set()
|
||||||
|
md_count = json_count = 0
|
||||||
|
other_files: list[str] = []
|
||||||
|
|
||||||
|
for line in diff_output.splitlines():
|
||||||
|
if not line.strip():
|
||||||
|
continue
|
||||||
|
# status<TAB>path (or status<TAB>old<TAB>new for renames; we take
|
||||||
|
# the post-rename path as the canonical location).
|
||||||
|
parts = line.split("\t")
|
||||||
|
status, path = parts[0], parts[-1]
|
||||||
|
if not path.startswith("corpus/"):
|
||||||
|
other_files.append(f"{status} {path}")
|
||||||
|
continue
|
||||||
|
segs = path.split("/", 2)
|
||||||
|
if len(segs) < 3:
|
||||||
|
# corpus/<filename> with no bundle dir — skip.
|
||||||
|
continue
|
||||||
|
_, bundle, page = segs
|
||||||
|
if page.endswith(".md"):
|
||||||
|
md_changes[bundle].append(page[:-3])
|
||||||
|
md_count += 1
|
||||||
|
if status == "A":
|
||||||
|
new_bundles.add(bundle)
|
||||||
|
elif page.endswith(".json"):
|
||||||
|
json_count += 1
|
||||||
|
json_only_bundles.add(bundle)
|
||||||
|
|
||||||
|
# A bundle counts as "content-changing" if it had any .md edit. Sidecar-
|
||||||
|
# only drift goes in the separate bucket so the commit message doesn't
|
||||||
|
# report timestamp churn as if it were real edits.
|
||||||
|
content_bundles_set = set(md_changes)
|
||||||
|
drift_only = sorted(json_only_bundles - content_bundles_set)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"md_count": md_count,
|
||||||
|
"json_count": json_count,
|
||||||
|
"content_bundles": dict(md_changes), # cast back to plain dict for JSON
|
||||||
|
"json_only_bundles": drift_only,
|
||||||
|
"new_bundles": sorted(new_bundles),
|
||||||
|
"other_files": other_files,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def render_human(summary: dict[str, Any]) -> str:
|
||||||
|
"""Format a summary dict as the multi-line commit-message text.
|
||||||
|
|
||||||
|
Matches the historical output exactly so existing commit-message
|
||||||
|
tooling and downstream readers don't have to change.
|
||||||
|
"""
|
||||||
|
lines: list[str] = []
|
||||||
|
content_bundles = sorted(summary["content_bundles"])
|
||||||
|
md_count = summary["md_count"]
|
||||||
|
json_count = summary["json_count"]
|
||||||
|
new_bundles = set(summary["new_bundles"])
|
||||||
|
drift_only = summary["json_only_bundles"]
|
||||||
|
other_files = summary["other_files"]
|
||||||
|
|
||||||
|
lines.append(f"{md_count} content change(s) across {len(content_bundles)} bundle(s)")
|
||||||
|
lines.append(f"{json_count} sidecar metadata update(s)")
|
||||||
|
if new_bundles:
|
||||||
|
lines.append(f"{len(new_bundles)} new bundle(s) added")
|
||||||
|
if other_files:
|
||||||
|
lines.append(f"{len(other_files)} other file change(s)")
|
||||||
|
|
||||||
|
if content_bundles:
|
||||||
|
lines.append("")
|
||||||
|
lines.append("Bundles with content changes:")
|
||||||
|
for b in content_bundles:
|
||||||
|
pages = summary["content_bundles"][b]
|
||||||
|
tag = " (NEW)" if b in new_bundles else ""
|
||||||
|
lines.append(f" {b}{tag}: {len(pages)} page(s)")
|
||||||
|
for p in pages[:5]:
|
||||||
|
lines.append(f" - {p}")
|
||||||
|
if len(pages) > 5:
|
||||||
|
lines.append(f" ... and {len(pages) - 5} more")
|
||||||
|
if drift_only:
|
||||||
|
lines.append("")
|
||||||
|
head = ", ".join(drift_only[:10])
|
||||||
|
suffix = " …" if len(drift_only) > 10 else ""
|
||||||
|
lines.append(f"Bundles with sidecar-only drift ({len(drift_only)}): {head}{suffix}")
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
def walk_history(history_days: int) -> list[dict[str, Any]]:
|
||||||
|
"""Walk recent corpus-touching commits, emit one summary per commit.
|
||||||
|
|
||||||
|
Uses ``git log --first-parent main`` to keep the rolling weekly-
|
||||||
|
refresh line clean of branch-merge noise. Only commits whose diff
|
||||||
|
touches ``corpus/`` or ``bundles.json`` are emitted; pure code
|
||||||
|
commits are skipped (they have nothing to digest).
|
||||||
|
|
||||||
|
Each record:
|
||||||
|
|
||||||
|
{
|
||||||
|
"sha": "<short sha>",
|
||||||
|
"timestamp": "<ISO 8601, UTC>",
|
||||||
|
"subject": "<commit subject line>",
|
||||||
|
... + every field from summarize_diff()
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
# Find candidate commits. --first-parent keeps the linear refresh history
|
||||||
|
# on main and ignores branch-side merges. We still need to filter by what
|
||||||
|
# the commit actually touched, because non-corpus commits can land on
|
||||||
|
# main (PR merges for code, CI tweaks, etc.).
|
||||||
|
raw = git(
|
||||||
|
"log",
|
||||||
|
f"--since={history_days} days ago",
|
||||||
|
"--first-parent",
|
||||||
|
"main",
|
||||||
|
"--pretty=format:%H%x09%cI%x09%s",
|
||||||
|
)
|
||||||
|
|
||||||
|
records: list[dict[str, Any]] = []
|
||||||
|
for line in raw.splitlines():
|
||||||
|
if not line.strip():
|
||||||
|
continue
|
||||||
|
parts = line.split("\t", 2)
|
||||||
|
if len(parts) < 3:
|
||||||
|
continue
|
||||||
|
sha, ts, subject = parts
|
||||||
|
|
||||||
|
# What did this commit actually touch? Cheap: just the name-status diff
|
||||||
|
# against its first parent. Empty stdout = commit didn't change any
|
||||||
|
# files we care about. Root commits (no parent) error out — suppress
|
||||||
|
# the stderr noise and skip them.
|
||||||
|
try:
|
||||||
|
diff = subprocess.check_output(
|
||||||
|
["git", "diff", "--name-status", f"{sha}^..{sha}"],
|
||||||
|
text=True,
|
||||||
|
stderr=subprocess.DEVNULL,
|
||||||
|
)
|
||||||
|
except subprocess.CalledProcessError:
|
||||||
|
continue
|
||||||
|
if not diff.strip():
|
||||||
|
continue
|
||||||
|
|
||||||
|
summary = summarize_diff(diff)
|
||||||
|
# Skip pure code commits — only emit records that have actual corpus
|
||||||
|
# content motion. This is what makes the history "interesting" for
|
||||||
|
# the weekly digest.
|
||||||
|
if summary["md_count"] == 0 and summary["json_count"] == 0 and not summary["new_bundles"]:
|
||||||
|
continue
|
||||||
|
|
||||||
|
records.append({
|
||||||
|
"sha": sha[:12],
|
||||||
|
"timestamp": ts,
|
||||||
|
"subject": subject,
|
||||||
|
**summary,
|
||||||
|
})
|
||||||
|
|
||||||
|
return records
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
p = argparse.ArgumentParser(description=__doc__)
|
||||||
|
p.add_argument("--cached", action="store_true",
|
||||||
|
help="Summarize staged changes instead of a ref range.")
|
||||||
|
p.add_argument("--ref", default="HEAD^..HEAD",
|
||||||
|
help="Diff range to summarize (default: HEAD^..HEAD).")
|
||||||
|
p.add_argument("--json", dest="as_json", action="store_true",
|
||||||
|
help="Emit one JSON object instead of the human-readable form.")
|
||||||
|
p.add_argument("--history-out", metavar="PATH",
|
||||||
|
help="Walk recent corpus-touching commits and write a "
|
||||||
|
"JSONL history file at PATH. Overwrites if it exists. "
|
||||||
|
"Implies the history walker; --cached/--ref are ignored.")
|
||||||
|
p.add_argument("--history-days", type=int, default=120,
|
||||||
|
help="How far back the history walker looks (default 120).")
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
# History-walker path: build the JSONL file consumed by the
|
||||||
|
# weekly_digest MCP tool, then exit. CI uses this.
|
||||||
|
if args.history_out:
|
||||||
|
records = walk_history(args.history_days)
|
||||||
|
# Sort by timestamp ascending so the file is roughly stable
|
||||||
|
# across rebuilds (commits within a single run could otherwise
|
||||||
|
# depend on git log default ordering).
|
||||||
|
records.sort(key=lambda r: r["timestamp"])
|
||||||
|
with open(args.history_out, "w") as fh:
|
||||||
|
for rec in records:
|
||||||
|
fh.write(json.dumps(rec, separators=(",", ":")) + "\n")
|
||||||
|
# Brief stdout signal for CI logs — easy to spot in the workflow run.
|
||||||
|
print(f"wrote {len(records)} commit record(s) to {args.history_out} "
|
||||||
|
f"covering up to {args.history_days} days")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
# One-shot summary path. Unchanged behavior for --cached / --ref.
|
||||||
|
if args.cached:
|
||||||
|
diff_args = ["diff", "--name-status", "--cached"]
|
||||||
|
else:
|
||||||
|
diff_args = ["diff", "--name-status", args.ref]
|
||||||
|
diff = git(*diff_args)
|
||||||
|
summary = summarize_diff(diff)
|
||||||
|
|
||||||
|
if args.as_json:
|
||||||
|
print(json.dumps(summary, separators=(",", ":")))
|
||||||
|
else:
|
||||||
|
print(render_human(summary))
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
@@ -0,0 +1,93 @@
|
|||||||
|
"""Thin dispatcher that routes ``--source <id>`` to the right per-source
|
||||||
|
scraper module.
|
||||||
|
|
||||||
|
Convention: one source per module under ``scrape.sources.<id>``. Each
|
||||||
|
module is independently runnable via ``python -m scrape.sources.<id>``
|
||||||
|
and accepts its own flags — this runner is a convenience shim for CI.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
python -m scrape.runner --source bayer_seeds --force
|
||||||
|
python -m scrape.runner --source golden_harvest --limit 20
|
||||||
|
python -m scrape.runner --all # walk every source in sources.json
|
||||||
|
|
||||||
|
Anything after the recognized flags is passed through to the source
|
||||||
|
scraper, so:
|
||||||
|
|
||||||
|
python -m scrape.runner --source bayer_seeds --force --brand dekalb
|
||||||
|
|
||||||
|
dispatches to ``scrape.sources.bayer_seeds`` with ``--force --brand
|
||||||
|
dekalb`` as argv.
|
||||||
|
|
||||||
|
Sources whose ``verdict`` in sources.json is anything other than
|
||||||
|
``"green"`` are skipped by ``--all`` (Beck's products is yellow until
|
||||||
|
the SeedIQ XHR is captured). Pass ``--source becks_products`` to run
|
||||||
|
a yellow source explicitly.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import importlib
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
REPO_ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
SOURCES_JSON = REPO_ROOT / "sources.json"
|
||||||
|
|
||||||
|
|
||||||
|
def _load_sources() -> list[dict]:
|
||||||
|
if not SOURCES_JSON.exists():
|
||||||
|
return []
|
||||||
|
try:
|
||||||
|
data = json.loads(SOURCES_JSON.read_text())
|
||||||
|
return data.get("sources", []) if isinstance(data, dict) else data
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
def _run_source(source_id: str, passthrough: list[str]) -> int:
|
||||||
|
mod_name = f"scrape.sources.{source_id}"
|
||||||
|
try:
|
||||||
|
mod = importlib.import_module(mod_name)
|
||||||
|
except ImportError as exc:
|
||||||
|
print(f"runner: no source module {mod_name}: {exc}", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
main = getattr(mod, "main", None)
|
||||||
|
if not callable(main):
|
||||||
|
print(f"runner: {mod_name} has no main() entrypoint", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
return int(main(passthrough) or 0)
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
parser = argparse.ArgumentParser(prog="scrape.runner")
|
||||||
|
parser.add_argument("--source", help="Source id (matches sources.json)")
|
||||||
|
parser.add_argument("--all", action="store_true",
|
||||||
|
help="Run every GREEN source listed in sources.json")
|
||||||
|
args, passthrough = parser.parse_known_args(argv)
|
||||||
|
|
||||||
|
if not args.source and not args.all:
|
||||||
|
parser.error("specify --source <id> or --all")
|
||||||
|
|
||||||
|
sources = _load_sources()
|
||||||
|
if args.all:
|
||||||
|
ids = [s["name"] for s in sources if s.get("verdict") == "green"]
|
||||||
|
if not ids:
|
||||||
|
print("runner: no GREEN sources in sources.json", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
else:
|
||||||
|
# If the source isn't registered in sources.json yet, dispatch anyway
|
||||||
|
# so the scraper can be exercised during initial development.
|
||||||
|
ids = [args.source]
|
||||||
|
|
||||||
|
rc = 0
|
||||||
|
for sid in ids:
|
||||||
|
print(f"=== scrape.runner: dispatching to {sid} ===")
|
||||||
|
rc |= _run_source(sid, passthrough)
|
||||||
|
return rc
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
@@ -0,0 +1,34 @@
|
|||||||
|
"""AgriPro scraper (Syngenta wheat brand).
|
||||||
|
|
||||||
|
Source: ``https://www.agriprowheat.com`` — Drupal Views form,
|
||||||
|
server-rendered HTML. No headless browser needed.
|
||||||
|
|
||||||
|
Expected count: 24 varieties. Covers HRW / HRS / HWS / SWW / SWS
|
||||||
|
plus barley. NO SRW — Syngenta's SRW lives at GrowProGenetics.com
|
||||||
|
under a separate brand and is out of scope for AgriPro.
|
||||||
|
|
||||||
|
Trait flags to capture: Clearfield (CL2), CoAXium (NB: CoAXium is
|
||||||
|
implicit in product family naming, not always a separate field).
|
||||||
|
|
||||||
|
Schema notes:
|
||||||
|
- ``wheat_class`` is required (HRW/HRS/HWS/SWW/SWS/durum/barley)
|
||||||
|
- ``relative_maturity`` and ``maturity_group`` are null for wheat
|
||||||
|
- Disease panel: stripe rust / leaf rust / stem rust / FHB (scab) /
|
||||||
|
Septoria / tan spot
|
||||||
|
- Quality: test weight, protein, falling number, straw strength
|
||||||
|
|
||||||
|
TODO: implement.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
print("agripro: not implemented yet — Drupal Views form, only wheat in the corpus, no SRW (separate brand)",
|
||||||
|
file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main(sys.argv[1:]))
|
||||||
@@ -0,0 +1,56 @@
|
|||||||
|
"""Bayer seeds scraper — DEKALB (corn) + Asgrow (soy) + WestBred (wheat).
|
||||||
|
|
||||||
|
Source: ``cropscience.bayer.us`` — same Next.js + ``__NEXT_DATA__``
|
||||||
|
infrastructure used by crop-chem-docs' Bayer crop-protection scraper.
|
||||||
|
That scraper is the reference; this one lifts ~80% of its plumbing
|
||||||
|
and adapts the per-product field mapping for seed schema.
|
||||||
|
|
||||||
|
Catalog index pages:
|
||||||
|
/corn/dekalb/seed-catalog
|
||||||
|
/soybeans/asgrow/seed-catalog
|
||||||
|
/wheat/westbred/seed-catalog
|
||||||
|
|
||||||
|
Each catalog page is a Next.js route; the per-variety data lives in
|
||||||
|
``__NEXT_DATA__.props.pageProps.{whatever}``. The buildId in the
|
||||||
|
script tag rotates — fetch the index page first, extract the
|
||||||
|
buildId, then fetch the per-variety JSON.
|
||||||
|
|
||||||
|
Output layout:
|
||||||
|
corpus/bayer_seeds/<source_key>.md LLM-visible body
|
||||||
|
corpus/bayer_seeds/<source_key>.json Sidecar metadata
|
||||||
|
|
||||||
|
source_key convention: ``<brand>-<product-slug>`` lowercased, e.g.
|
||||||
|
``dekalb-dkc62-08rib`` or ``asgrow-ag34xf2``.
|
||||||
|
|
||||||
|
Sidecar schema (per CLAUDE.md):
|
||||||
|
source: "bayer_seeds"
|
||||||
|
source_key: str
|
||||||
|
vendor: "Bayer"
|
||||||
|
brand: "DEKALB" | "Asgrow" | "WestBred"
|
||||||
|
product_name: str
|
||||||
|
crop: "corn" | "soybeans" | "wheat"
|
||||||
|
relative_maturity: int | null # corn only
|
||||||
|
maturity_group: float | null # soy only
|
||||||
|
wheat_class: str | null # wheat only
|
||||||
|
trait_stack: list[str]
|
||||||
|
agronomic_ratings: dict[str, int] # normalized 1-9 (9 = best)
|
||||||
|
disease_ratings: dict[str, int] # normalized 1-9 (9 = best)
|
||||||
|
regional_recommendation: list[str]
|
||||||
|
source_urls: list[str]
|
||||||
|
fetched_at: str (ISO 8601 UTC)
|
||||||
|
|
||||||
|
TODO: implement. Reference: ~/github/crop-chem-docs/scrape/sources/bayer.py
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
print("bayer_seeds: not implemented yet — see ~/github/crop-chem-docs/scrape/sources/bayer.py for the reference Next.js extraction pattern",
|
||||||
|
file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main(sys.argv[1:]))
|
||||||
@@ -0,0 +1,45 @@
|
|||||||
|
"""Beck's PFR (Practical Farm Research) scraper.
|
||||||
|
|
||||||
|
Source: Public Sanity GROQ API at ``https://mc8v24rf.api.sanity.io``.
|
||||||
|
No authentication required — Beck's exposes their CMS content store
|
||||||
|
publicly. ~2,089 documents going back to 2015.
|
||||||
|
|
||||||
|
Sanity query endpoint:
|
||||||
|
``/v1/data/query/production?query=<groq>``
|
||||||
|
|
||||||
|
Useful GROQ for PFR docs (the projectId / dataset are public):
|
||||||
|
|
||||||
|
*[_type == "pfrStudy"] {
|
||||||
|
_id, title, year, crop, slug, summary, body, attachments
|
||||||
|
}
|
||||||
|
|
||||||
|
Records are research studies, not variety identity — head-to-head
|
||||||
|
yield trials, fungicide timing, planting-date studies, hybrid-by-
|
||||||
|
population, biological seed treatments, etc.
|
||||||
|
|
||||||
|
Treat differently from variety scrapers:
|
||||||
|
- One record per study, not per variety
|
||||||
|
- chunk_0 preamble includes the study's tl;dr finding (extract from
|
||||||
|
the ``summary`` field if present, or first paragraph of ``body``)
|
||||||
|
- Crop tag (corn/soy/wheat) for filtering
|
||||||
|
- Year tag — older PFR studies are still relevant but search should
|
||||||
|
let the user weight recency
|
||||||
|
|
||||||
|
Polite rate limit: Sanity is generous but no auth means we should
|
||||||
|
keep concurrency ≤4 and pause ~250ms between batches.
|
||||||
|
|
||||||
|
TODO: implement.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
print("becks_pfr: not implemented yet — public Sanity GROQ at mc8v24rf.api.sanity.io, ~2089 research docs",
|
||||||
|
file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main(sys.argv[1:]))
|
||||||
@@ -0,0 +1,46 @@
|
|||||||
|
"""Beck's product catalog scraper (identity-only until SeedIQ XHR sniff lands).
|
||||||
|
|
||||||
|
Source: Same public Sanity GROQ API as ``becks_pfr`` (no auth).
|
||||||
|
Expected count: ~860 products (corn + soy + wheat).
|
||||||
|
|
||||||
|
Current limitation: Beck's exposes IDENTITY fields publicly (product
|
||||||
|
name, RM/MG, basic trait stack) but routes the AGRONOMIC + DISEASE
|
||||||
|
ratings through their SeedIQ application, which is gated behind a
|
||||||
|
browser session cookie. The public Sanity records do not include
|
||||||
|
ratings.
|
||||||
|
|
||||||
|
What we CAN ship without SeedIQ:
|
||||||
|
- Product identity for confirmation ("yes Beck's has hybrid X at RM 112")
|
||||||
|
- RM (corn) / MG (soy) / class (wheat)
|
||||||
|
- Trait stack
|
||||||
|
- Basic descriptive text
|
||||||
|
|
||||||
|
What needs the SeedIQ XHR endpoint (BLOCKED on user sniff):
|
||||||
|
- Disease ratings (GLS, NCLB, Goss's, etc.)
|
||||||
|
- Agronomic ratings (standability, drought, etc.)
|
||||||
|
- Regional recommendations
|
||||||
|
|
||||||
|
For now this scraper is DEFERRED. Run when:
|
||||||
|
- User captures the SeedIQ XHR URL + cookie/header pattern from
|
||||||
|
browser dev tools, OR
|
||||||
|
- We decide to ship Beck's as identity-only and let the LLM say
|
||||||
|
"Beck's has this hybrid; ask your Beck's rep for full agronomic
|
||||||
|
ratings" (less useful but avoids the empty-data UX).
|
||||||
|
|
||||||
|
Yellow verdict in sources.json reflects this — ``--all`` skips it.
|
||||||
|
|
||||||
|
TODO: implement (deferred).
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
print("becks_products: deferred — SeedIQ XHR sniff required for ratings, run only if user has captured the endpoint",
|
||||||
|
file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main(sys.argv[1:]))
|
||||||
@@ -0,0 +1,42 @@
|
|||||||
|
"""Golden Harvest scraper (Syngenta brand).
|
||||||
|
|
||||||
|
Discovery: ``https://www.goldenharvestseeds.com/sitemap.xml`` lists
|
||||||
|
every variety page. Server-rendered HTML — no headless browser
|
||||||
|
required. Tech-sheet PDFs live on the Syngenta CDN at
|
||||||
|
``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf`` — same
|
||||||
|
fetcher pattern as NK.
|
||||||
|
|
||||||
|
Two gotchas:
|
||||||
|
|
||||||
|
1. **Sitemap PDF dates are stale** (the sitemap was generated
|
||||||
|
2025-03-31 and never updated). Resolve the LIVE PDF URL from the
|
||||||
|
product HTML page, not from the sitemap entry.
|
||||||
|
|
||||||
|
2. **Disease scale is reversed.** Golden Harvest publishes ratings
|
||||||
|
on a 9-to-1 scale (9 = best, 1 = worst). Bayer/NK/AgriPro use
|
||||||
|
1-9 (9 = best). Normalize at chunk time so the corpus has a
|
||||||
|
single direction. Record the original direction in the chunk_0
|
||||||
|
preamble: "Note: ratings normalized to 1-9 (9 = best). Golden
|
||||||
|
Harvest publishes on a 9-to-1 scale natively."
|
||||||
|
|
||||||
|
Expected count: ~175 varieties (89 corn + 86 soy). No wheat.
|
||||||
|
|
||||||
|
Bonus dataset: ``/plot-report/<state>/<year>/<id>`` — ~7,800 regional
|
||||||
|
yield trial records. Out of scope for v1 but a high-value future
|
||||||
|
ingest for regional placement recommendations.
|
||||||
|
|
||||||
|
TODO: implement. Reuse the PDF-fetch helper that NK uses.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
print("golden_harvest: not implemented yet — see CLAUDE.md for the disease-scale-reversal gotcha and the live-PDF-URL-resolution requirement",
|
||||||
|
file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main(sys.argv[1:]))
|
||||||
@@ -0,0 +1,35 @@
|
|||||||
|
"""NK scraper (Syngenta brand).
|
||||||
|
|
||||||
|
Source: ``https://www.syngenta-us.com`` — static HTML product pages
|
||||||
|
plus tech-sheet PDFs on the Syngenta CDN at
|
||||||
|
``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf``.
|
||||||
|
|
||||||
|
Expected count: 29 varieties (12 corn + 17 soy). No wheat.
|
||||||
|
|
||||||
|
The PDF fetcher is shared with ``golden_harvest`` — same CDN, same
|
||||||
|
``<CODE>_YYMMDD.pdf`` filename convention. Factor that into a
|
||||||
|
helper module under ``scrape.sources._syngenta_pdf`` once both
|
||||||
|
scrapers are written.
|
||||||
|
|
||||||
|
Disease + agronomic ratings live INSIDE the PDFs (the HTML pages
|
||||||
|
have marketing copy only). Use pdfplumber for table extraction.
|
||||||
|
|
||||||
|
Bonus: regional "Seed Guide" PDFs (~14 MB each) for IA, IL, MN,
|
||||||
|
etc. — additional supplemental context worth ingesting once the
|
||||||
|
per-variety scrape is solid.
|
||||||
|
|
||||||
|
TODO: implement.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
print("nk: not implemented yet — disease/agronomic ratings come from CDN tech-sheet PDFs only, use pdfplumber",
|
||||||
|
file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main(sys.argv[1:]))
|
||||||
@@ -0,0 +1,167 @@
|
|||||||
|
"""Gitea container-registry garbage collection.
|
||||||
|
|
||||||
|
Prunes old container tags from a Gitea registry package. Always
|
||||||
|
preserves:
|
||||||
|
|
||||||
|
- The ``latest`` tag (Watchtower auto-pull target)
|
||||||
|
- Any ``corpus-*`` tag (production pins; Drawbar may have them locked)
|
||||||
|
- The ``--keep-latest`` most-recent OTHER tags (typically commit-sha pins)
|
||||||
|
- Anything pushed within ``--keep-days`` days
|
||||||
|
|
||||||
|
The actual disk reclaim happens on Gitea's next package GC cron
|
||||||
|
(admin site settings). This script marks versions for deletion.
|
||||||
|
|
||||||
|
Why this script doesn't use the Docker Registry v2 API: that API has
|
||||||
|
tag listing + manifest delete by digest, but no per-tag created-at
|
||||||
|
timestamp without an extra blob-fetch round-trip. Gitea's packages
|
||||||
|
API gives us {tag, created_at} in one call, which is what the keep
|
||||||
|
policy needs.
|
||||||
|
|
||||||
|
The endpoint shape that actually works (matches Gitea 1.21+):
|
||||||
|
|
||||||
|
GET /api/v1/packages/{owner}?type=container&q={name}
|
||||||
|
→ JSON array, ONE entry per tag, each with id + version=tag + created_at
|
||||||
|
DELETE /api/v1/packages/{owner}/container/{name}/{tag}
|
||||||
|
→ 204 on success, 404 if already gone
|
||||||
|
|
||||||
|
Auth: GITEA_TOKEN env var (PAT with delete:packages scope; the
|
||||||
|
push-only PAT we use as REGISTRY_TOKEN may not be enough — if you
|
||||||
|
see 403s, mint a separate PAT and pass it as GITEA_TOKEN here).
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
|
||||||
|
python scripts/registry_gc.py \\
|
||||||
|
--owner justin \\
|
||||||
|
--package crop-chem-docs \\
|
||||||
|
--keep-days 180 \\
|
||||||
|
--keep-latest 6
|
||||||
|
[--dry-run]
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
from datetime import datetime, timedelta, timezone
|
||||||
|
from urllib.error import HTTPError
|
||||||
|
from urllib.request import Request, urlopen
|
||||||
|
|
||||||
|
|
||||||
|
GITEA_HOST = os.environ.get("GITEA_HOST", "https://git.jpaul.io")
|
||||||
|
|
||||||
|
|
||||||
|
def api(token: str, method: str, path: str) -> object:
|
||||||
|
# User-Agent matters: Cloudflare in front of git.jpaul.io returns
|
||||||
|
# 403 to the default `Python-urllib/3.x` UA. Any non-Python UA
|
||||||
|
# passes. Curl works, requests works, we just need to not look
|
||||||
|
# like a vanilla urllib script.
|
||||||
|
req = Request(
|
||||||
|
f"{GITEA_HOST}{path}",
|
||||||
|
headers={
|
||||||
|
"Authorization": f"token {token}",
|
||||||
|
"User-Agent": "crop-chem-docs-registry-gc/0.1",
|
||||||
|
},
|
||||||
|
method=method,
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
with urlopen(req, timeout=30) as r:
|
||||||
|
body = r.read()
|
||||||
|
return json.loads(body) if body else None
|
||||||
|
except HTTPError as e:
|
||||||
|
if e.code == 404:
|
||||||
|
return None
|
||||||
|
raise
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_created(version: dict) -> datetime:
|
||||||
|
"""Gitea returns RFC3339 with offset like '2026-05-24T16:07:50-04:00'.
|
||||||
|
Python 3.11+ handles this directly via fromisoformat."""
|
||||||
|
return datetime.fromisoformat(version["created_at"])
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
p = argparse.ArgumentParser()
|
||||||
|
p.add_argument("--owner", required=True)
|
||||||
|
p.add_argument("--package", required=True)
|
||||||
|
p.add_argument("--keep-days", type=int, default=180)
|
||||||
|
p.add_argument("--keep-latest", type=int, default=6,
|
||||||
|
help="Keep this many most-recent commit-sha (etc.) "
|
||||||
|
"tags BEFORE applying --keep-days. corpus-* and "
|
||||||
|
":latest are kept regardless.")
|
||||||
|
p.add_argument("--dry-run", action="store_true",
|
||||||
|
help="Show what would be deleted without calling DELETE.")
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
token = os.environ.get("GITEA_TOKEN")
|
||||||
|
if not token:
|
||||||
|
print("GITEA_TOKEN env var not set", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
# Gitea's q= is a substring match; filter to exact name so we don't
|
||||||
|
# accidentally GC a sibling package that shares the prefix.
|
||||||
|
versions = api(
|
||||||
|
token, "GET",
|
||||||
|
f"/api/v1/packages/{args.owner}?type=container&q={args.package}",
|
||||||
|
) or []
|
||||||
|
versions = [v for v in versions if v.get("name") == args.package]
|
||||||
|
|
||||||
|
if not versions:
|
||||||
|
print(f"no versions found for {args.owner}/{args.package} — nothing to GC")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
cutoff = datetime.now(timezone.utc) - timedelta(days=args.keep_days)
|
||||||
|
versions.sort(key=_parse_created, reverse=True) # newest first
|
||||||
|
|
||||||
|
keep: list[tuple[str, str]] = [] # (tag, reason)
|
||||||
|
delete: list[dict] = []
|
||||||
|
other_kept = 0
|
||||||
|
|
||||||
|
for v in versions:
|
||||||
|
tag = v.get("version", "")
|
||||||
|
created = _parse_created(v)
|
||||||
|
if tag == "latest":
|
||||||
|
keep.append((tag, "always-keep (:latest)"))
|
||||||
|
continue
|
||||||
|
if tag.startswith("corpus-"):
|
||||||
|
keep.append((tag, "production pin (corpus-*)"))
|
||||||
|
continue
|
||||||
|
if other_kept < args.keep_latest:
|
||||||
|
other_kept += 1
|
||||||
|
keep.append((tag, f"keep-latest #{other_kept}/{args.keep_latest}"))
|
||||||
|
continue
|
||||||
|
if created >= cutoff:
|
||||||
|
keep.append((tag, f"within --keep-days ({args.keep_days})"))
|
||||||
|
continue
|
||||||
|
delete.append(v)
|
||||||
|
|
||||||
|
print(f"=== {args.owner}/{args.package}: {len(versions)} total tag(s) ===")
|
||||||
|
for tag, reason in keep:
|
||||||
|
print(f" KEEP {tag:<28} {reason}")
|
||||||
|
for v in delete:
|
||||||
|
print(f" DEL {v['version']:<28} created={v['created_at']}")
|
||||||
|
|
||||||
|
if not delete:
|
||||||
|
print("nothing to delete")
|
||||||
|
return 0
|
||||||
|
if args.dry_run:
|
||||||
|
print(f"--dry-run; would delete {len(delete)} tag(s)")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
failed = 0
|
||||||
|
for v in delete:
|
||||||
|
tag = v["version"]
|
||||||
|
try:
|
||||||
|
api(token, "DELETE",
|
||||||
|
f"/api/v1/packages/{args.owner}/container/{args.package}/{tag}")
|
||||||
|
print(f" ✓ deleted {tag}")
|
||||||
|
except HTTPError as e:
|
||||||
|
print(f" ✗ failed {tag}: HTTP {e.code} {e.reason}", file=sys.stderr)
|
||||||
|
failed += 1
|
||||||
|
|
||||||
|
print(f"done: deleted {len(delete) - failed} / {len(delete)} tag(s)")
|
||||||
|
return 0 if failed == 0 else 1
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
@@ -0,0 +1,251 @@
|
|||||||
|
"""Summarize usage logs from docs_mcp.usage into a quick scan.
|
||||||
|
|
||||||
|
Reads one or more usage.jsonl* files and prints sections for:
|
||||||
|
|
||||||
|
- per-tool call counts
|
||||||
|
- top search_docs queries by frequency
|
||||||
|
- 0-hit queries (where we returned nothing — high-signal for tuning)
|
||||||
|
- filter usage histogram (which version / platform / bundle filters get hit)
|
||||||
|
- reranker effectiveness (calls where the reranker fired vs not)
|
||||||
|
- hybrid retrieval top-1 attribution (dense vs bm25 vs both)
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
|
||||||
|
# Default: read /app/var/logs in the production container
|
||||||
|
python scripts/usage_report.py --logs-dir /path/to/usage/logs
|
||||||
|
|
||||||
|
# Last N days only:
|
||||||
|
python scripts/usage_report.py --logs-dir <dir> --since 7d
|
||||||
|
|
||||||
|
# Markdown output (for piping into a weekly digest email, etc):
|
||||||
|
python scripts/usage_report.py --logs-dir <dir> --format markdown
|
||||||
|
|
||||||
|
The script doesn't depend on anything in the docs_mcp package — it's a
|
||||||
|
standalone tool that can run anywhere with the log files available
|
||||||
|
(scp them off the host, point it at the directory).
|
||||||
|
|
||||||
|
----------------------------------------------------------------------
|
||||||
|
FOLLOW-UP CHECKS
|
||||||
|
----------------------------------------------------------------------
|
||||||
|
|
||||||
|
Pattern: when you ship a retrieval change with a hypothesis attached
|
||||||
|
(e.g. "hybrid will rescue queries dense misses"), add a note HERE
|
||||||
|
describing what the usage report should show and at what threshold
|
||||||
|
the change earns its keep. Future-you running the report a month
|
||||||
|
later will be glad. Example:
|
||||||
|
|
||||||
|
Q: Does the dense leg of hybrid retrieval earn its keep on
|
||||||
|
real traffic, or could we simplify to BM25-only?
|
||||||
|
|
||||||
|
- bm25_only >= 80%% --> dense not doing much; consider
|
||||||
|
simplifying to BM25 mode
|
||||||
|
- both >= 50%% --> hybrid is tie-breaking; keep it
|
||||||
|
- dense_only > bm25_only --> dense is the workhorse; keep
|
||||||
|
|
||||||
|
Also worth a glance every month:
|
||||||
|
|
||||||
|
- 0-hit queries list (tuning candidates)
|
||||||
|
- reranker p95 latency drift (slow reranker = bad UX)
|
||||||
|
- filter usage (does anyone actually use version/platform
|
||||||
|
filters? if not, simplify the tool surface)
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from datetime import datetime, timedelta, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Iterable
|
||||||
|
|
||||||
|
|
||||||
|
def parse_since(s: str | None) -> datetime | None:
|
||||||
|
"""Accept '7d', '24h', '30m', or an ISO timestamp. None → no cutoff."""
|
||||||
|
if not s:
|
||||||
|
return None
|
||||||
|
m = re.fullmatch(r"(\d+)([dhm])", s)
|
||||||
|
if m:
|
||||||
|
n, unit = int(m.group(1)), m.group(2)
|
||||||
|
delta = {"d": timedelta(days=n), "h": timedelta(hours=n), "m": timedelta(minutes=n)}[unit]
|
||||||
|
return datetime.now(timezone.utc) - delta
|
||||||
|
return datetime.fromisoformat(s.replace("Z", "+00:00"))
|
||||||
|
|
||||||
|
|
||||||
|
def load_events(logs_dir: Path, since: datetime | None) -> Iterable[dict[str, Any]]:
|
||||||
|
"""Yield every JSONL record across all files in logs_dir."""
|
||||||
|
if not logs_dir.exists():
|
||||||
|
print(f"warning: logs dir {logs_dir} does not exist", file=sys.stderr)
|
||||||
|
return
|
||||||
|
# usage.jsonl is the active file; usage.jsonl.YYYY-MM-DD are rotated.
|
||||||
|
files = sorted(logs_dir.glob("usage.jsonl*"))
|
||||||
|
for f in files:
|
||||||
|
with open(f) as fh:
|
||||||
|
for ln, line in enumerate(fh, start=1):
|
||||||
|
line = line.strip()
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
rec = json.loads(line)
|
||||||
|
except json.JSONDecodeError as e:
|
||||||
|
print(f" ! skipping {f}:{ln}: {e}", file=sys.stderr)
|
||||||
|
continue
|
||||||
|
if since:
|
||||||
|
ts = rec.get("ts", "")
|
||||||
|
try:
|
||||||
|
rec_ts = datetime.fromisoformat(ts.replace("Z", "+00:00"))
|
||||||
|
except ValueError:
|
||||||
|
continue
|
||||||
|
if rec_ts < since:
|
||||||
|
continue
|
||||||
|
yield rec
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
p = argparse.ArgumentParser(description=__doc__)
|
||||||
|
p.add_argument("--logs-dir", type=Path, default=Path("/app/var/logs"),
|
||||||
|
help="directory with usage.jsonl* files")
|
||||||
|
p.add_argument("--since", default=None,
|
||||||
|
help="time window: '7d', '24h', '30m', or ISO timestamp")
|
||||||
|
p.add_argument("--top", type=int, default=25,
|
||||||
|
help="how many top queries / filters to show")
|
||||||
|
p.add_argument("--format", choices=("text", "markdown"), default="text")
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
since = parse_since(args.since)
|
||||||
|
events = list(load_events(args.logs_dir, since))
|
||||||
|
if not events:
|
||||||
|
print("(no events in window)")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
print(f"# Usage report — {len(events)} events"
|
||||||
|
+ (f" since {since.isoformat()}" if since else "")
|
||||||
|
+ f" from {args.logs_dir}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 1. Per-tool counts
|
||||||
|
by_tool = Counter(e["tool"] for e in events)
|
||||||
|
print("## Per-tool call counts")
|
||||||
|
print()
|
||||||
|
if args.format == "markdown":
|
||||||
|
print("| tool | calls |")
|
||||||
|
print("|---|---|")
|
||||||
|
for tool, n in by_tool.most_common():
|
||||||
|
print(f"| `{tool}` | {n} |")
|
||||||
|
else:
|
||||||
|
for tool, n in by_tool.most_common():
|
||||||
|
print(f" {tool:<25s} {n:>6d}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 2. Top search_docs queries
|
||||||
|
search_events = [e for e in events if e["tool"] == "search_docs"]
|
||||||
|
queries = Counter(e["args"].get("query", "") for e in search_events)
|
||||||
|
print(f"## Top {args.top} search_docs queries (of {len(search_events)} searches)")
|
||||||
|
print()
|
||||||
|
if args.format == "markdown":
|
||||||
|
print("| count | query |")
|
||||||
|
print("|---|---|")
|
||||||
|
for q, n in queries.most_common(args.top):
|
||||||
|
print(f"| {n} | `{q}` |")
|
||||||
|
else:
|
||||||
|
for q, n in queries.most_common(args.top):
|
||||||
|
print(f" {n:>5d} {q!r}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 3. 0-hit queries — the highest-signal data for tuning
|
||||||
|
zero_hit = [e for e in search_events if e.get("hits_returned") == 0]
|
||||||
|
zero_q = Counter(e["args"].get("query", "") for e in zero_hit)
|
||||||
|
print(f"## 0-hit queries ({len(zero_hit)} of {len(search_events)} searches returned nothing)")
|
||||||
|
print()
|
||||||
|
if zero_q:
|
||||||
|
if args.format == "markdown":
|
||||||
|
print("| count | query | filters |")
|
||||||
|
print("|---|---|---|")
|
||||||
|
# Group by query, show filter examples for each
|
||||||
|
examples_by_query: dict[str, list[dict]] = defaultdict(list)
|
||||||
|
for e in zero_hit:
|
||||||
|
examples_by_query[e["args"].get("query", "")].append(e["args"])
|
||||||
|
for q, n in zero_q.most_common(args.top):
|
||||||
|
ex = examples_by_query[q][0]
|
||||||
|
f = {k: v for k, v in ex.items()
|
||||||
|
if k in ("version", "platform", "bundle_id") and v}
|
||||||
|
print(f"| {n} | `{q}` | `{f}` |")
|
||||||
|
else:
|
||||||
|
for q, n in zero_q.most_common(args.top):
|
||||||
|
print(f" {n:>5d} {q!r}")
|
||||||
|
else:
|
||||||
|
print(" _(no 0-hit queries in window)_")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 4. Filter usage
|
||||||
|
filter_use = Counter()
|
||||||
|
for e in search_events:
|
||||||
|
a = e["args"]
|
||||||
|
v = a.get("version")
|
||||||
|
p_ = a.get("platform")
|
||||||
|
b = a.get("bundle_id")
|
||||||
|
if v:
|
||||||
|
filter_use[f"version={v}"] += 1
|
||||||
|
if p_:
|
||||||
|
filter_use[f"platform={p_}"] += 1
|
||||||
|
if b:
|
||||||
|
filter_use[f"bundle_id={b}"] += 1
|
||||||
|
if not (v or p_ or b):
|
||||||
|
filter_use["(no filter)"] += 1
|
||||||
|
print(f"## search_docs filter usage")
|
||||||
|
print()
|
||||||
|
if args.format == "markdown":
|
||||||
|
print("| filter | count |")
|
||||||
|
print("|---|---|")
|
||||||
|
for f, n in filter_use.most_common(args.top):
|
||||||
|
print(f"| `{f}` | {n} |")
|
||||||
|
else:
|
||||||
|
for f, n in filter_use.most_common(args.top):
|
||||||
|
print(f" {n:>5d} {f}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 5. Reranker effectiveness
|
||||||
|
reranked = [e for e in search_events if e.get("reranked") is True]
|
||||||
|
dense_only = [e for e in search_events if e.get("reranked") is False]
|
||||||
|
print(f"## Reranker activity")
|
||||||
|
print()
|
||||||
|
print(f" reranked: {len(reranked):>5d}")
|
||||||
|
print(f" dense only: {len(dense_only):>5d} (filter too narrow or 0 results)")
|
||||||
|
if reranked:
|
||||||
|
elapsed = [e["elapsed_ms"] for e in reranked if e.get("elapsed_ms") is not None]
|
||||||
|
if elapsed:
|
||||||
|
elapsed.sort()
|
||||||
|
p50 = elapsed[len(elapsed) // 2]
|
||||||
|
p95 = elapsed[int(len(elapsed) * 0.95)]
|
||||||
|
print(f" reranked latency p50: {p50:.0f} ms, p95: {p95:.0f} ms")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 6. Hybrid retrieval activity — which retriever contributed the top-1?
|
||||||
|
# Empty unless HYBRID_SEARCH=true is set on the MCP container.
|
||||||
|
hybrid_events = [e for e in search_events if e.get("retrieval_mode") == "hybrid"]
|
||||||
|
if hybrid_events:
|
||||||
|
by_source = Counter(e.get("top1_source") for e in hybrid_events
|
||||||
|
if e.get("top1_source"))
|
||||||
|
print("## Hybrid retrieval — top-1 attribution")
|
||||||
|
print()
|
||||||
|
print(f" hybrid mode events: {len(hybrid_events)}")
|
||||||
|
total = sum(by_source.values()) or 1
|
||||||
|
for src in ("both", "dense_only", "bm25_only"):
|
||||||
|
n = by_source.get(src, 0)
|
||||||
|
pct = 100.0 * n / total
|
||||||
|
label = {
|
||||||
|
"both": "in BOTH retrievers' top-N",
|
||||||
|
"dense_only": "dense found it, BM25 didn't",
|
||||||
|
"bm25_only": "BM25 found it, dense didn't",
|
||||||
|
}[src]
|
||||||
|
print(f" {src:<11s} {n:>5d} ({pct:5.1f}%) — {label}")
|
||||||
|
rescued = by_source.get("bm25_only", 0)
|
||||||
|
if rescued and total:
|
||||||
|
print(f"\n → {rescued} ({100.0 * rescued / total:.1f}%) of hybrid queries had the top-1 "
|
||||||
|
"result that ONLY BM25 surfaced. Without hybrid those would have been dense-misses.")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
@@ -0,0 +1,89 @@
|
|||||||
|
{
|
||||||
|
"_description": "seed-mcp source catalog. Each scraper module under scrape/sources/ corresponds to one entry. Run via `python -m scrape.runner --source <name>`. The MCP container bakes this file in so corpus_status / list_versions can reflect provenance without re-scraping.",
|
||||||
|
"_pioneer_excluded": "Pioneer (Corteva) is intentionally absent. Per their ToS: 'you shall not use any manual or automated software, devices or other processes (including but not limited to spiders, robots, scrapers, crawlers, avatars, data mining tools or the like) to scrape or download data from the Services'. The MCP returns a curated fallback lesson directing the user to pioneer.com / a local dealer.",
|
||||||
|
"sources": [
|
||||||
|
{
|
||||||
|
"name": "bayer_seeds",
|
||||||
|
"vendor": "Bayer",
|
||||||
|
"brands": ["DEKALB", "Asgrow", "WestBred"],
|
||||||
|
"crops": ["corn", "soybeans", "wheat"],
|
||||||
|
"verdict": "green",
|
||||||
|
"expected_count": 475,
|
||||||
|
"base_url": "https://cropscience.bayer.us",
|
||||||
|
"scope_filter": "All listed varieties; no regional filter applied at scrape time (regional recommendations parsed into sidecar so the MCP can filter at search time).",
|
||||||
|
"tos_check_date": "2026-05-24",
|
||||||
|
"tos_note": "robots.txt explicitly whitelists RAG/LLM use cases. Same legal stance as crop-chem-docs scraper."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "golden_harvest",
|
||||||
|
"vendor": "Syngenta",
|
||||||
|
"brands": ["Golden Harvest"],
|
||||||
|
"crops": ["corn", "soybeans"],
|
||||||
|
"verdict": "green",
|
||||||
|
"expected_count": 175,
|
||||||
|
"base_url": "https://www.goldenharvestseeds.com",
|
||||||
|
"scope_filter": "All sitemap-listed corn + soybean varieties.",
|
||||||
|
"tos_check_date": "2026-05-25",
|
||||||
|
"schema_notes": "Disease ratings published on 9-to-1 scale (9 = best). Normalize to 1-9 (9 = best) at chunk time to match Bayer/NK/AgriPro convention. Note original direction in chunk_0 preamble. Tech-sheet PDF URLs in the sitemap are stale (250331) — resolve live URL from product HTML, not sitemap entry."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "nk",
|
||||||
|
"vendor": "Syngenta",
|
||||||
|
"brands": ["NK"],
|
||||||
|
"crops": ["corn", "soybeans"],
|
||||||
|
"verdict": "green",
|
||||||
|
"expected_count": 29,
|
||||||
|
"base_url": "https://www.syngenta-us.com",
|
||||||
|
"pdf_cdn": "https://assets.syngentaebiz.com/pdf/techsheets/",
|
||||||
|
"scope_filter": "All NK corn + soy varieties. No wheat (NK doesn't sell wheat in US).",
|
||||||
|
"tos_check_date": "2026-05-24",
|
||||||
|
"schema_notes": "Disease + agronomic ratings live in tech-sheet PDFs only — need pdfplumber. PDF URLs share format `<CODE>_YYMMDD.pdf` with Golden Harvest, so the same fetcher works for both."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "agripro",
|
||||||
|
"vendor": "Syngenta",
|
||||||
|
"brands": ["AgriPro"],
|
||||||
|
"crops": ["wheat", "barley"],
|
||||||
|
"verdict": "green",
|
||||||
|
"expected_count": 24,
|
||||||
|
"base_url": "https://www.agriprowheat.com",
|
||||||
|
"scope_filter": "All wheat classes (HRW/HRS/HWS/SWW/SWS) + barley. NO SRW — Syngenta's SRW lives at GrowProGenetics.com under a separate brand.",
|
||||||
|
"tos_check_date": "2026-05-24",
|
||||||
|
"schema_notes": "Drupal Views form; server-rendered HTML. CoAXium trait flag is implicit in product family; Clearfield/CL2 trait IS in this catalog."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "becks_pfr",
|
||||||
|
"vendor": "Beck's Hybrids",
|
||||||
|
"brands": ["Beck's PFR"],
|
||||||
|
"crops": ["corn", "soybeans", "wheat"],
|
||||||
|
"verdict": "yellow",
|
||||||
|
"expected_count": 2089,
|
||||||
|
"base_url": "https://www.beckshybrids.com",
|
||||||
|
"api_base": "https://mc8v24rf.api.sanity.io",
|
||||||
|
"scope_filter": "All Practical Farm Research publications since 2015. PFR is head-to-head agronomy trials — fungicide timing, planting-date studies, hybrid-by-population, etc.",
|
||||||
|
"tos_check_date": "2026-05-24",
|
||||||
|
"schema_notes": "Public Sanity GROQ API, no auth required. Records have title/year/crop/key-findings/full-text. Treat PFR docs as a research corpus, not variety records — the chunk_0 includes the study's tl;dr finding."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "becks_products",
|
||||||
|
"vendor": "Beck's Hybrids",
|
||||||
|
"brands": ["Beck's"],
|
||||||
|
"crops": ["corn", "soybeans", "wheat"],
|
||||||
|
"verdict": "yellow",
|
||||||
|
"expected_count": 860,
|
||||||
|
"base_url": "https://www.beckshybrids.com",
|
||||||
|
"api_base": "https://mc8v24rf.api.sanity.io",
|
||||||
|
"scope_filter": "All Beck's product records — corn + soy + wheat. Identity + RM/MG only.",
|
||||||
|
"tos_check_date": "2026-05-24",
|
||||||
|
"schema_notes": "Sanity GROQ exposes identity (name, RM/MG, basic traits) but agronomic + disease ratings are SeedIQ-gated (requires browser cookie). Deferred until the SeedIQ XHR endpoint is captured from a logged-in browser session. Without ratings, products are reference-only; the MCP can confirm 'Beck's has hybrid X at RM 112 with Enlist trait' but not 'rate it against drought'."
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"_excluded_sources": [
|
||||||
|
{
|
||||||
|
"name": "pioneer",
|
||||||
|
"vendor": "Corteva",
|
||||||
|
"verdict": "red",
|
||||||
|
"reason": "Explicit ToS prohibits automated scraping. Dealer locator at /us/sales-representatives/my-local-team.html is login-gated; no public API for dealer contact info. The MCP returns a curated fallback lesson instead of erroring."
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user