seed-mcp scaffold: clone docs-mcp-template, customize for crop_seed PRODUCT_NAME

Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is seed/hybrid varieties across 6 vendors instead of pesticide labels. What's customized vs. the template: - CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy, canonical sidecar schema (per-crop), Golden Harvest disease-scale reversal gotcha, no-IPv6 / HTTPS-clone note - README.md: vendor coverage table, tool list, phase status - Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker DNS defaults (same llama-rerank sidecar as crop-chem-docs) - .gitea/workflows/refresh.yml: monthly cron (seed catalogs move slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar pinning, continue-on-error on GC step - .gitea/workflows/image-only.yml: paths filter + cancel-in-progress concurrency group - scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea packages API URL + UA header to bypass CF block on default Python-urllib UA) - sources.json: catalog of 6 sources + scope_filter + per-source schema notes + Pioneer-exclusion rationale - scrape/runner.py: dispatcher with --all = GREEN-only - scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr, becks_products}.py: stub modules with implementation notes - docs_mcp/server.py: PRODUCT_NAME default → crop_seed, PRODUCT_DOCS_URL → repo URL Pioneer is intentionally NOT a source. ToS bans automation; dealer locator is login-gated. The MCP returns a curated fallback lesson directing the user to pioneer.com. Next phases: - Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs Bayer scraper; same __NEXT_DATA__ infra) - Phase 7: curate eval/queries.jsonl - Phase 11: lessons.md with Pioneer fallback + disease-scale notes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:28:49 -04:00
commit ac40e05734
35 changed files with 3833 additions and 0 deletions
@@ -0,0 +1,117 @@
 name: Image rebuild (skip scrape)
 # Fast path for code-only changes. Skips the scrape and goes straight
 # to: rebuild indexes (from corpus already committed on main) + image
 # build + push. Runtime ~10 min vs ~2-3 h for the full monthly refresh.
 #
 # Use when a PR only changes code/config — anything where the upstream
 # seed catalogs haven't moved but we want the new Python in the
 # running image.
 on:
  workflow_dispatch:
  push:
    branches:
      - main
    paths:
      - "docs_mcp/**"
      - "rag/**"
      - "scrape/**"
      - "requirements.txt"
      - "Dockerfile"
      - "sources.json"
 # If multiple pushes land in quick succession, cancel the older one
 # rather than queueing both — each run is non-trivial and the older
 # commit's image just gets overwritten by the newer one anyway.
 concurrency:
  group: image-only
  cancel-in-progress: true
 env:
  REGISTRY_PUSH: 192.168.0.2:1234
  REGISTRY_PULL: git.jpaul.io
  IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
  OLLAMA_URL: http://192.168.0.2:11434,http://192.168.0.2:11435,http://192.168.0.125:11434
  EMBED_MODEL: nomic-embed-text
  PRODUCT_NAME: crop_seed
 jobs:
  build:
    runs-on: docker
    container:
      image: catthehacker/ubuntu:act-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Install dependencies
        run: |
          python -m pip install -q --upgrade pip
          python -m pip install -q -r requirements.txt
      - name: Verify committed corpus is present
        run: |
          test -d corpus || { echo "ERROR: corpus/ missing on this ref"; exit 1; }
          n_md=$(find corpus -name '*.md' | wc -l)
          n_json=$(find corpus -name '*.json' | wc -l)
          echo "corpus: $(du -sh corpus | cut -f1) on disk, ${n_md} .md / ${n_json} .json"
      - name: Rebuild indexes from committed corpus
        run: python -m rag.index --rebuild
      - name: Log in to Gitea container registry
        run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u "${{ github.repository_owner }}" --password-stdin
      - name: Build & push image
        run: |
          SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
          CORPUS_TAG="corpus-$(date -u +%Y.%m.%d)"
          docker build \
            -t "${REGISTRY_PUSH}/${IMAGE}:latest" \
            -t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
            -t "${REGISTRY_PUSH}/${IMAGE}:${CORPUS_TAG}" \
            .
          docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
          docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
          docker push "${REGISTRY_PUSH}/${IMAGE}:${CORPUS_TAG}"
      - name: Link container package to this repo
        env:
          GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
        run: |
          OWNER="${{ github.repository_owner }}"
          PKG="${{ github.event.repository.name }}"
          BODY=$(mktemp)
          CODE=$(curl -sS -o "$BODY" -w "%{http_code}" -X POST \
            -H "Authorization: token ${GITEA_TOKEN}" \
            "https://${REGISTRY_PULL}/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
          echo "link http=$CODE  body=$(cat "$BODY")"
          case "$CODE" in
            201) echo "linked package to ${OWNER}/${PKG}" ;;
            400) echo "already linked — ok" ;;
            *)   echo "unexpected status $CODE"; exit 1 ;;
          esac
      - name: Prune old container versions
        # GC requires broader scope than REGISTRY_TOKEN's push perms
        # (HTTP 403 on /packages/.../versions). Non-critical —
        # housekeeping only. Don't fail the whole run.
        # TODO: issue separate PAT with admin:package scope and set
        # as PACKAGES_ADMIN_TOKEN.
        continue-on-error: true
        env:
          GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
        run: |
          python scripts/registry_gc.py \
            --owner "${{ github.repository_owner }}" \
            --package "${{ github.event.repository.name }}" \
            --keep-days 180 \
            --keep-latest 6
@@ -0,0 +1,186 @@
 name: Monthly seed catalog refresh
 # Runs the full pipeline: scrape all GREEN sources → rebuild indexes
 # → push image. Cron'd once a month (1st @ 06:00 UTC). Skip the
 # reindex + image-push if the scrape produced no diff against the
 # committed corpus.
 #
 # Seed catalogs move slowly (vendors release new hybrids 1-2x/year
 # at field-day timing); monthly cadence is plenty.
 #
 # Total runtime budget: ~2-3 h end-to-end across all 5 GREEN sources.
 # Bayer is the longest (~475 varieties, ~45 min). Beck's PFR is the
 # heaviest single-source (~2,089 docs via Sanity GROQ pagination).
 on:
  schedule:
    - cron: "0 6 1 * *"     # 1st of each month, 06:00 UTC
  workflow_dispatch:
    inputs:
      force_build:
        description: "Rebuild indexes + push image even if corpus is unchanged"
        type: boolean
        default: false
      sources:
        description: "Sources to scrape (comma-separated, blank = all GREEN)"
        type: string
        default: ""
 env:
  # Self-hosted Gitea registry on the same LAN as the runner.
  # CF caps push body at 100 MB, so push via LAN endpoint; pull
  # through the public hostname (response bodies aren't capped).
  REGISTRY_PUSH: 192.168.0.2:1234
  REGISTRY_PULL: git.jpaul.io
  IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
  # Embedder pool. Two Ollama instances on the Gitea/runner host
  # (one per GPU) + the Windows Ollama. Trashpanda's Ollama is
  # production-shared with Drawbar; CI does NOT hit it.
  OLLAMA_URL: http://192.168.0.2:11434,http://192.168.0.2:11435,http://192.168.0.125:11434
  EMBED_MODEL: nomic-embed-text
  PRODUCT_NAME: crop_seed
 jobs:
  refresh:
    runs-on: docker
    container:
      image: catthehacker/ubuntu:act-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          # Full history — required for the digest-history step
          # to walk git log. Default fetch-depth: 1 silently
          # produces a 0-byte history file.
          fetch-depth: 0
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Install dependencies
        run: |
          python -m pip install -q --upgrade pip
          python -m pip install -q -r requirements.txt
      # ---- Phase 1: scrape ---------------------------------------
      - name: Scrape Bayer seeds (DEKALB + Asgrow + WestBred)
        if: ${{ inputs.sources == '' || contains(inputs.sources, 'bayer_seeds') }}
        run: python -m scrape.runner --source bayer_seeds --force
      - name: Scrape Golden Harvest
        if: ${{ inputs.sources == '' || contains(inputs.sources, 'golden_harvest') }}
        run: python -m scrape.runner --source golden_harvest --force
      - name: Scrape NK
        if: ${{ inputs.sources == '' || contains(inputs.sources, 'nk') }}
        run: python -m scrape.runner --source nk --force
      - name: Scrape AgriPro
        if: ${{ inputs.sources == '' || contains(inputs.sources, 'agripro') }}
        run: python -m scrape.runner --source agripro --force
      - name: Scrape Beck's PFR research corpus
        if: ${{ inputs.sources == '' || contains(inputs.sources, 'becks_pfr') }}
        # Heaviest source — ~2,089 docs via public Sanity GROQ.
        # No auth, but rate-limit ourselves to be polite.
        run: python -m scrape.runner --source becks_pfr --force
      # ---- Commit corpus changes + retry-on-race -----------------
      - name: Commit corpus changes (if any)
        id: commit
        run: |
          git config user.name "seed-mcp-refresh"
          git config user.email "actions@jpaul.io"
          git add sources.json corpus
          if git diff --cached --quiet; then
            echo "no corpus changes — skipping reindex and image build"
            echo "changed=false" >> "$GITHUB_OUTPUT"
            exit 0
          fi
          echo "changed=true" >> "$GITHUB_OUTPUT"
          ts=$(date -u +"%Y-%m-%dT%H:%MZ")
          n_bayer=$(find corpus/bayer_seeds -name '*.json' 2>/dev/null | wc -l)
          n_gh=$(find corpus/golden_harvest -name '*.json' 2>/dev/null | wc -l)
          n_nk=$(find corpus/nk -name '*.json' 2>/dev/null | wc -l)
          n_ag=$(find corpus/agripro -name '*.json' 2>/dev/null | wc -l)
          n_pfr=$(find corpus/becks_pfr -name '*.json' 2>/dev/null | wc -l)
          git commit -m "monthly refresh: ${ts} — bayer=${n_bayer} gh=${n_gh} nk=${n_nk} agripro=${n_ag} pfr=${n_pfr}"
          attempt=1
          while [ $attempt -le 3 ]; do
            if git push; then
              echo "pushed corpus changes (attempt $attempt)"
              break
            fi
            if [ $attempt -eq 3 ]; then
              echo "push still failing after 3 attempts"; exit 1
            fi
            git fetch origin main
            git rebase origin/main || { echo "rebase conflict"; exit 1; }
            attempt=$((attempt + 1))
          done
      # ---- Rebuild Chroma + BM25 ---------------------------------
      - name: Rebuild indexes
        if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
        run: python -m rag.index --rebuild
      # ---- Build & push image ------------------------------------
      - name: Log in to Gitea container registry
        if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
        run: echo "${{ secrets.REGISTRY_TOKEN }}" | docker login "${REGISTRY_PUSH}" -u "${{ github.repository_owner }}" --password-stdin
      - name: Build & push image
        if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
        # Tags: :latest (Watchtower target), :<sha12> (rollback pin),
        # :corpus-<YYYY.MM.DD> (links image to corpus version so
        # Drawbar can pin to a specific seed-catalog snapshot).
        run: |
          SHA_TAG=$(echo "$GITHUB_SHA" | cut -c1-12)
          CORPUS_TAG="corpus-$(date -u +%Y.%m.%d)"
          docker build \
            -t "${REGISTRY_PUSH}/${IMAGE}:latest" \
            -t "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}" \
            -t "${REGISTRY_PUSH}/${IMAGE}:${CORPUS_TAG}" \
            .
          docker push "${REGISTRY_PUSH}/${IMAGE}:latest"
          docker push "${REGISTRY_PUSH}/${IMAGE}:${SHA_TAG}"
          docker push "${REGISTRY_PUSH}/${IMAGE}:${CORPUS_TAG}"
      - name: Link container package to this repo
        if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
        env:
          GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
        run: |
          OWNER="${{ github.repository_owner }}"
          PKG="${{ github.event.repository.name }}"
          BODY=$(mktemp)
          CODE=$(curl -sS -o "$BODY" -w "%{http_code}" -X POST \
            -H "Authorization: token ${GITEA_TOKEN}" \
            "https://${REGISTRY_PULL}/api/v1/packages/${OWNER}/container/${PKG}/-/link/${PKG}")
          echo "link http=$CODE  body=$(cat "$BODY")"
          case "$CODE" in
            201) echo "linked package to ${OWNER}/${PKG}" ;;
            400) echo "already linked — ok" ;;
            *)   echo "unexpected status $CODE"; exit 1 ;;
          esac
      - name: Prune old container versions
        # GC requires broader scope than REGISTRY_TOKEN's push perms
        # (HTTP 403 on /packages/.../versions). Non-critical
        # housekeeping. TODO: issue separate PAT with admin:package
        # scope. Until then continue-on-error keeps a failed prune
        # from breaking the whole refresh.
        if: steps.commit.outputs.changed == 'true' || inputs.force_build == true
        continue-on-error: true
        env:
          GITEA_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
        run: |
          python scripts/registry_gc.py \
            --owner "${{ github.repository_owner }}" \
            --package "${{ github.event.repository.name }}" \
            --keep-days 180 \
            --keep-latest 6
@@ -0,0 +1,31 @@
 # Virtualenv
 venv/
 .venv/
 # Regenerable from corpus + CI
 corpus/
 chroma/
 bm25/
 # Python detritus
 __pycache__/
 *.py[cod]
 *.egg-info/
 .pytest_cache/
 .mypy_cache/
 .ruff_cache/
 # Eval results (regenerable; commit only the headline baseline if you want)
 # eval/results/
 # Usage logs (host-mounted volume in prod; don't commit dev logs)
 var/
 # Local-only env
 .env
 .env.local
 # IDE
 .vscode/
 .idea/
 *.swp
@@ -0,0 +1,230 @@
 # CLAUDE.md
 This file provides guidance to Claude Code (claude.ai/code) when
 working with code in this repository.
 ## Purpose
 `seed-mcp` is an MCP server over the **public catalogs of major US
 row-crop seed vendors** (corn / soybeans / wheat). It is the sibling
 project to [`crop-chem-docs`](https://git.jpaul.io/justin/crop-chem-docs)
 — same MCP-template lineage, same Drawbar consumer (the farm
 advisor AI), but the corpus is **seed/hybrid varieties** rather than
 pesticide labels.
 The MCP exposes per-variety records with agronomic ratings, disease
 tolerance, trait stack, maturity, and regional notes — so the advisor
 can answer "which corn hybrid for sandy soil, drought-prone, RM ≤105
 in northeast Iowa?" without rummaging through individual brand sites.
 PRODUCT_NAME for this build: **`crop_seed`** (lowercase, underscore;
 ends up in the MCP server name, Chroma collection, BM25 db filename,
 and the `crop_seed_api_lessons` tool).
 ## Vendor scope
 | Vendor | Verdict | Varieties | Source pattern |
 |---|---|---|---|
 | Bayer (DEKALB + Asgrow + WestBred) | 🟢 | ~475 | `cropscience.bayer.us` Next.js `__NEXT_DATA__` (same infra as crop-chem-docs) |
 | Golden Harvest (Syngenta) | 🟢 | ~175 | sitemap.xml + server-rendered HTML + Syngenta CDN PDFs |
 | NK (Syngenta) | 🟢 | 29 | static HTML + Syngenta CDN PDFs (shares fetcher with Golden Harvest) |
 | AgriPro (Syngenta wheat) | 🟢 | 24 | Drupal Views form, server-rendered HTML |
 | Beck's PFR | 🟡 | 2,089 | Public Sanity GROQ API at `mc8v24rf.api.sanity.io` (no auth) |
 | Beck's products | 🟡 | 860 | Same Sanity API — identity-only until SeedIQ XHR is sniffed |
 | Pioneer (Corteva) | 🔴 | — | DROP. ToS bans automation; dealer locator login-gated too |
 **Build priority order** (shared-infra first → biggest yield):
 1. `bayer_seeds` — lift-and-shift from crop-chem-docs' Bayer scraper
 2. `golden_harvest` — biggest unique Syngenta brand
 3. `nk` — reuses Golden Harvest's PDF fetcher
 4. `agripro` — only wheat coverage in the corpus
 5. `becks_pfr` — research goldmine, public Sanity GROQ
 6. `becks_products` — identity-only, deferred until SeedIQ XHR known
 ### Pioneer fallback
 Per user direction (2026-05-25), seed-mcp does NOT scrape Pioneer.
 The MCP's lessons layer contains a Pioneer-fallback entry: when the
 LLM detects a Pioneer / P-series query, it should reply:
 > "Pioneer does not allow AI or other automation techniques to
 > scrape and index their data. For Pioneer brand seed information,
 > reach out to a local dealer directly via
 > [pioneer.com](https://www.pioneer.com)."
 Pioneer's dealer locator is login-gated — there is no public API
 to surface dealer contact info, so the lesson stays a plain link.
 ## Schema notes per crop
 - **Corn**: RM (relative maturity days), trait stack (SmartStax, VT
  Double PRO, Enlist, PowerCore, Trecepta, etc.), GLS / NCLB /
  Goss's / Anthracnose / Tar Spot ratings, standability, drought
  tolerance, ear flex, grain-vs-silage flag.
 - **Soy**: Maturity group (e.g. 3.4), trait stack (XF / Xtend / E3 /
  LL+GT27 / RR2Y / Conkesta), SDS / white mold / SCN / Phytophthora
  (race + Rps gene) / frogeye / brown stem rot ratings, IDC
  tolerance (critical for upper Midwest), branching habit.
 - **Wheat**: Class (HRW / HRS / SRW / SWW / SWS / durum), heading
  (early / medium / late), stripe rust / leaf rust / stem rust /
  FHB (scab) / Septoria / tan spot ratings, test weight, protein,
  falling number, straw strength, CoAXium trait flag.
 **Disease scale gotcha**: Golden Harvest publishes ratings on a
 **9-to-1 scale** (9 = best, 1 = worst) — the REVERSE of the typical
 1-9 convention used by Bayer/NK/AgriPro. Normalize at chunk time so
 the corpus has a single direction; document it in a chunk_0
 preamble.
 ## Canonical sidecar schema (per variety)
 ```json
 {
  "source": "bayer_seeds",
  "source_key": "dekalb-dkc62-08rib",
  "vendor": "Bayer",
  "brand": "DEKALB",
  "product_name": "DKC62-08RIB",
  "crop": "corn",
  "relative_maturity": 112,
  "maturity_group": null,
  "wheat_class": null,
  "trait_stack": ["SmartStax", "RIB"],
  "agronomic_ratings": {"standability": 7, "drought_tolerance": 6},
  "disease_ratings": {"GLS": 6, "NCLB": 7, "Goss_wilt": 5},
  "regional_recommendation": ["IA-N", "MN-S", "WI-W"],
  "source_urls": ["https://cropscience.bayer.us/..."],
  "fetched_at": "2026-05-25T12:34:56Z"
 }
 ```
 `maturity_group` is for soy, `relative_maturity` is for corn,
 `wheat_class` is for wheat. Use `null` for fields that don't apply.
 Disease/agronomic rating direction is **normalized 1-9 (9 = best)**
 post-scrape — original direction noted in chunk_0 if the source
 publishes differently.
 ## Working with this repo
 ### Identifying the current phase
 This is a clone of the docs-mcp-template; phases follow the
 template's PLAN.md.
 | Signal | Likely phase |
 |---|---|
 | `corpus/` doesn't exist | Phase 1 (first scraper) |
 | `corpus/bayer_seeds/` exists, no `chroma/` | Phase 2 (indexing) |
 | `chroma/` exists, no `bm25/` | Phase 8 (hybrid search) |
 | No `eval/results/` | Phase 7 (eval harness) |
 | `_api_lessons` is `NotImplementedError` | Phase 11 |
 ## Layout
 ```
 .
 ├── PLAN.md
 ├── README.md
 ├── CLAUDE.md
 ├── sources.json                  # Vendor catalog (corn/soy/wheat by source)
 ├── requirements.txt
 ├── Dockerfile
 ├── deploy/
 │   └── docker-compose.yml
 ├── .gitea/workflows/
 │   ├── refresh.yml               # Monthly cron: scrape + index + image
 │   └── image-only.yml            # On-demand: code-only ship cycle
 ├── scrape/
 │   ├── runner.py                 # `python -m scrape.runner --source bayer_seeds`
 │   ├── changelog.py
 │   └── sources/
 │       ├── bayer_seeds.py
 │       ├── golden_harvest.py
 │       ├── nk.py
 │       ├── agripro.py
 │       ├── becks_pfr.py
 │       └── becks_products.py
 ├── rag/                          # chunk + embed + Chroma + BM25
 ├── docs_mcp/                     # FastMCP server + lessons.md
 ├── eval/                         # Golden-query harness
 └── scripts/                      # registry_gc.py, usage_report.py
 ```
 ## Conventions
 - **Vendor sub-corpora**: each scraper writes
  `corpus/<source>/<source_key>.{md,json}`. `.md` is the LLM-visible
  text (chunk_0 preamble + body); `.json` is the sidecar metadata.
 - **Tool docstrings are user interface** — the LLM uses them to
  decide whether to call. Treat like button labels.
 - **Defensive fallback for retrieval** — reranker/BM25/external
  deps must catch their specific exception and degrade to baseline.
  The MCP is in front of farmers making real seed-buying decisions.
 - **Verify retrieval changes with eval/** — ship a retrieval change
  with numbers in the commit message.
 ### Standard infrastructure choices
 - **Embedding**: `nomic-embed-text` via Ollama (768-dim)
 - **Reranker**: `jina-reranker-v2-base` GGUF via llama.cpp
  `/v1/rerank` (shared `llama-rerank` sidecar with crop-chem-docs
  on trashpanda Tesla P4)
 - **Vector store**: Chroma `PersistentClient`
 - **Lexical store**: SQLite FTS5
 - **Fusion**: RRF k=60
 - **Transport**: streamable-HTTP in prod, stdio for local dev
 - **MCP framework**: FastMCP with `stateless_http=True`
 ### Image name and package linking are repo-name-derived
 `IMAGE` and `--package` derive from the repo at runtime via
 `${{ github.repository_owner }}` / `${{ github.event.repository.name }}`.
 The only workflow placeholders customized per clone are
 `REGISTRY_PUSH=192.168.0.2:1234`, `REGISTRY_PULL=git.jpaul.io`,
 and the `OLLAMA_URL` embed pool.
 ## Common commands
 ```bash
 # Dev environment
 python -m venv venv && source venv/bin/activate
 pip install -r requirements.txt
 # Run one scraper
 python -m scrape.runner --source bayer_seeds --force
 # Rebuild indexes
 python -m rag.index --rebuild
 # Local MCP server
 python -m docs_mcp.server --transport stdio
 python -m docs_mcp.server --transport streamable-http --port 8000
 # Eval
 python -m eval.run_eval --queries eval/queries.jsonl --output eval/results/baseline.md
 ```
 ## Gotchas
 - **`fetch-depth: 0` on `actions/checkout@v4`** in both workflows.
 - **Reranker per-pair token limit**: jina-reranker GGUF rejects the
  ENTIRE batch if any doc exceeds `n_ctx_train=1024`. Truncate
  reranked docs to ~2000 chars.
 - **FastMCP `stateless_http=True`**: critical for prod.
 - **Runner shell is `/bin/sh` (dash)** in CI — no `${VAR::N}`.
 - **Cloudflare 100 MB body cap**: push via LAN endpoint
  `192.168.0.2:1234`, pull via `git.jpaul.io`.
 - **Golden Harvest disease scale is reversed (9 = best)** —
  normalize at chunk time.
 - **Sitemap-listed PDF dates on Golden Harvest are stale** —
  resolve the live PDF URL from the product HTML page.
 - **No IPv6** — DNS for git.jpaul.io returns IPv6-only. Clone via
  HTTPS, not SSH (port 22 returns Network unreachable).
 - **Pioneer is OFF-LIMITS** — do NOT add a `pioneer.py` scraper.
 ## Out-of-scope concerns
 - **Reverse proxy / TLS** — Drawbar's compose handles it
 - **MetaMCP** — separate aggregator
 - **GPU container orchestration** — shared `llama-rerank` sidecar
 - **University extension trial data** — deferred to v1.5
@@ -0,0 +1,61 @@
 # seed-mcp MCP server — production image.
 #
 # Structure: copy code first, then the regenerable indexes last so a
 # code-only change doesn't invalidate the corpus COPY layer.
 #
 # The container runs the MCP server via streamable-http on PORT 8000.
 # Override via MCP_HOST / MCP_PORT env if you front it with a different
 # reverse-proxy setup.
 #
 # Image is self-contained — corpus, Chroma collection, and BM25 db are
 # all baked in. Drawbar's docker-compose pulls the image and runs it;
 # no host volume mounts required for serve.
 #
 # RERANK_URL is set at compose time (points at the llama.cpp sidecar
 # on trashpanda's Tesla P4 — SHARED with crop-chem-docs). OLLAMA_URL
 # is set at compose time too. Defaults below assume same-stack Docker
 # DNS names.
 FROM python:3.12-slim
 WORKDIR /app
 # Install Python deps first for cacheability.
 COPY requirements.txt /app/
 RUN pip install --no-cache-dir -r requirements.txt
 # Code.
 COPY scrape /app/scrape
 COPY rag /app/rag
 COPY docs_mcp /app/docs_mcp
 # Source catalog. Lists the corpus sources (Bayer seeds + Golden
 # Harvest + NK + AgriPro + Beck's PFR + Beck's products).
 COPY sources.json /app/
 # Regenerable indexes. CI builds these from corpus/ in the same job
 # that builds the image. Listed last so code changes don't invalidate
 # the COPY layer cache for these (much larger) directories.
 #
 # bm25/ is only consulted when HYBRID_SEARCH=true (the server falls
 # back to dense-only if it's missing).
 COPY corpus /app/corpus
 COPY chroma /app/chroma
 COPY bm25 /app/bm25
 ENV PYTHONUNBUFFERED=1 \
    PRODUCT_NAME=crop_seed \
    MCP_TRANSPORT=streamable-http \
    MCP_HOST=0.0.0.0 \
    MCP_PORT=8000 \
    HYBRID_SEARCH=true \
    OLLAMA_URL=http://ollama:11434 \
    RERANK_URL=http://llama-rerank:8080
    # Defaults above assume the MCP container shares a Docker network
    # with services named `ollama` and `llama-rerank`. Override either
    # in the compose `environment:` block if your stack uses different
    # service names or if you want to point at off-stack hosts.
 EXPOSE 8000
 ENTRYPOINT ["python", "-m", "docs_mcp.server"]
@@ -0,0 +1,647 @@
 # Docs MCP Server — Build Guide
 A reusable recipe for building a hosted MCP server over a product's
 public documentation. Distilled from one production build; everything
 product-specific has been factored out.
 The end product is a streamable-HTTP MCP server with ~15 tools that
 any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can
 call to answer questions against the docs, surface what changed
 recently, find inconsistencies, and (optionally) submit doc bugs
 back upstream.
 ---
 ## What you're building
 A pipeline with these stages:
 ```
 upstream docs portal
        │
        ▼
   scrape  ──► corpus/<bundle>/<page>.md + .json sidecar
        │
        ▼
    chunk + embed  ──► chroma/  (dense vectors)
        │           ──► bm25/   (FTS5 lexical index)
        ▼
   MCP server  ──► search_docs / get_page / diff_versions / weekly_digest /
                   find_doc_inconsistencies / submit_doc_bug / ...
        │
        ▼
   reverse proxy / Cloudflare Tunnel ──► public endpoint
 ```
 Two CI cadences:
 - **Weekly cron** (~40 min): full re-scrape, re-chunk, re-embed,
  image build & push.
 - **On-demand image-only** (~18 min): code-only rebuild from
  committed corpus, image build & push.
 A container registry (self-hosted Gitea works well), a host running
 Docker Compose, Watchtower auto-updating from `:latest`, and a
 reverse proxy in front.
 ---
 ## Build phases
 Each phase is a discrete, shippable unit. Build them in order; each
 one is useful on its own and unlocks the next. Realistic effort per
 phase is given as a rough order of magnitude. Total: roughly 2–3
 weeks of focused work for the full stack.
 ### Phase 0 — Project skeleton  *(half a day)*
 Goals: directory layout, dependency manifest, virtualenv.
 - Top-level dirs: `scrape/`, `corpus/` (gitignored), `rag/`,
  `docs_mcp/`, `eval/`, `scripts/`, `deploy/`, `.gitea/workflows/`.
 - `requirements.txt` with the dependencies you'll need across all
  phases (FastMCP, chromadb, httpx, beautifulsoup4 or whatever HTML
  parser, ollama or sentence-transformers client, etc.).
 - `python -m venv venv` and pin Python version (3.11 or 3.12 — be
  conservative; some embedding libraries have version-specific
  wheels).
 - `.gitignore`: `venv/`, `corpus/` (regenerable), `chroma/`
  (regenerable), `bm25/` (regenerable), `*.pyc`, `__pycache__/`,
  `.pytest_cache/`.
 ### Phase 1 — Scraper  *(2–4 days, product-specific)*
 This is the most product-dependent phase. The goal is to write a
 scraper that produces a normalized corpus layout regardless of
 upstream portal shape.
 Output shape (mandatory):
 ```
 corpus/
  <bundle_id>/             # one dir per "doc bundle" — see Glossary
    <page_id>.md           # markdown body
    <page_id>.json         # sidecar with structured metadata
  ...
 bundles.json               # catalog of bundles with metadata
 ```
 **Bundle metadata** (`bundles.json` is a list of these):
 ```json
 {
  "slug":          "<bundle_id>",
  "title":         "User-facing title",
  "version":       "10.9",
  "platform":      "VMware vSphere",   // may be null
  "product":       "Admin Guide",       // optional but useful
  "language":      "en-US",
  "page_count":    127,
  "dates": {
    "Added on":    "2024-01-15",
    "Updated on":  "2026-05-20"
  },
  "landing_page":  "<page_id>"
 }
 ```
 **Per-page sidecar** (`<page_id>.json`) carries page-level metadata.
 The one field that matters cross-cutting is `topic_cluster` (see
 Phase 9):
 ```json
 {
  "bundle_id":     "<bundle_id>",
  "page_id":       "<page_id>",
  "title":         "How to ...",
  "ordinal":       42,
  "topic_cluster": {
    "clustering_title": "How to ...",
    "clustered_topics": [
      {"bundle_id": "...10.8", "page_id": "How_to_X.htm", "clustering_title": "..."},
      {"bundle_id": "...10.9", "page_id": "How_to_X.htm", "clustering_title": "..."}
    ]
  }
 }
 ```
 If the portal exposes a cross-version "this page corresponds to that
 page" mapping, capture it here. If it doesn't, you can synthesize a
 filename-based fallback (same filename across bundle versions = same
 topic) and live without the editor-curated mapping. The features that
 read `topic_cluster` (`list_cluster`, `diff_versions`,
 `find_doc_inconsistencies`, parts of `weekly_digest`) will work
 either way; they're more accurate with real clusters.
 **Patterns that recur across doc portals:**
 - Most modern doc portals are SPAs. Plain `requests.get` won't see
  rendered content. Either find the underlying API the SPA calls (the
  cheapest, most reliable path), or fall back to a headless browser
  (Playwright). The API path is almost always available; sniff the
  network tab.
 - Portals usually expose a "bundle/topic" hierarchy under the hood
  (Zoomin, Madcap Flare, Paligo, GitBook, Docusaurus all do). Map
  it to `bundles.json` + `corpus/<bundle>/<page>`.
 - Many portals expose `?save_local=` or `.pdf` rendered versions; the
  HTML they serve is structurally cleaner than what the page shows
  through the SPA shell.
 **`scrape/changelog.py`** (~250 LOC; see Phase 13) — provides
 `summarize_diff()`, `render_human()`, `walk_history()` and the
 `--json` / `--history-out` modes. Mostly reusable as-is; the only
 product-specific bit is the path layout assumption.
 ### Phase 2 — Chunking + embeddings + Chroma  *(2 days)*
 Goal: build a queryable dense index from the scraped corpus.
 - `rag/chunk.py` — split each page's markdown into ~400-600 token
  chunks. Strategy that works: paragraph-aware splitter with a
  rich "chunk 0" containing the page title + 1-sentence summary +
  bag-of-words from key terms. Chunk 0 is what dense retrieval lands
  on first; getting it right dominates retrieval quality.
 - `rag/embeddings.py` — pluggable embedder. Recommended start:
  Ollama-hosted `nomic-embed-text` (768-dim, free, good baseline).
  Other defensible choices: `text-embedding-3-small` (OpenAI),
  `bge-m3` (also via Ollama). The embedder is a Chroma
  `EmbeddingFunction` that returns `list[list[float]]` for a list
  of texts.
 - `rag/index.py` — orchestrates: read corpus → emit chunks (with
  metadata: bundle_id, page_id, version, platform, ordinal) →
  upsert into Chroma collection. `--rebuild` flag for a clean
  reindex. Run via `python -m rag.index --rebuild`.
 Chroma settings: `PersistentClient(path="chroma/")` and
 `Settings(anonymized_telemetry=False)`. Single collection
 (`<product>_docs`).
 **GPU note**: embedding 70K chunks on CPU takes hours; on a GPU
 (via Ollama with `NVIDIA_VISIBLE_DEVICES`) takes ~10 minutes. Two
 GPUs in parallel: ~5 minutes. The orchestrator just needs to load-
 balance HTTP requests across multiple Ollama endpoints.
 ### Phase 3 — MCP server skeleton  *(1 day)*
 Goal: working FastMCP server with three tools — `search_docs`,
 `get_page`, `list_versions`.
 - `docs_mcp/server.py` — `FastMCP("<product>-docs", stateless_http=True)`.
  `stateless_http=True` is critical for production hosting: every
  request creates an ephemeral session, so container recreates don't
  produce a 404 storm from stale `mcp-session-id` headers on
  clients.
 - Lazy initialization for everything expensive (Chroma client,
  embedder, bundles catalog) so the server starts cleanly even when
  Ollama is briefly unreachable.
 - Tool: `search_docs(query, version=None, platform=None,
  bundle_id=None, k=10)`. Returns markdown of top-k chunks with full
  source URLs.
 - Tool: `get_page(bundle_id, page_id)`. Returns full page markdown +
  metadata.
 - Tool: `list_versions()`. Returns the version/platform facets
  available, drawn from `bundles.json`. Helps the LLM pick filter
  values.
 Transports: stdio (for local Claude Desktop dev), streamable-HTTP
 (for hosted production). One argparse switch.
 ```python
@mcp.tool()
 def search_docs(
    query: Annotated[str, Field(description="Natural-language query about <product>.")],
    version: Annotated[str | None, Field(description="Restrict to one version")] = None,
    ...
 ) -> str:
    ...
 ```
 The tool descriptions are first-class context — the LLM reads them
 and decides whether to call the tool. Treat them as button labels;
 use "Call when..." / "Use proactively whenever..." phrasings.
 ### Phase 4 — Containerization  *(1 day)*
 Goal: image you can run anywhere.
 - `Dockerfile`: Python 3.12-slim base, install requirements, COPY
  `scrape rag diff docs_mcp` + `bundles.json` + `corpus/ chroma/`
  + (later) `bm25/`. Don't COPY `scripts/` — those stay external
  for ops use only.
 - `ENTRYPOINT ["python", "-m", "docs_mcp.server",
  "--transport", "streamable-http"]`. Configurable host/port via env.
 - `deploy/docker-compose.yml`: one service, named volumes for usage
  logs and any state, Watchtower label, depends_on for the reranker
  sidecar (Phase 6).
 Smoke-test locally: `docker compose up` should expose
 `http://localhost:8000/mcp` and respond to an MCP `initialize` JSON-RPC.
 ### Phase 5 — CI on self-hosted Gitea Actions  *(1–2 days)*
 Goal: weekly cron rebuild + on-demand code-only ship cycle.
 **Two workflows, two cadences:**
 | Workflow | Trigger | Steps | Runtime |
 |---|---|---|---|
 | `refresh.yml` | Monday cron + manual dispatch | scrape → commit corpus → rebuild indexes → build & push image | ~40 min |
 | `image-only.yml` | manual dispatch only | rebuild indexes from committed corpus → build & push image | ~18 min |
 **Critical settings (learned the hard way):**
 - `fetch-depth: 0` on `actions/checkout@v4`. The default depth is 1
  (shallow), which breaks any step that walks git history (changelog,
  digest history walker). Pay the ~10 second cost; never debug a
  "0-byte history file" mystery.
 - `runs-on: docker` (Gitea convention, not `ubuntu-latest`).
 - Runner shell is `/bin/sh` (dash), not bash. `${VAR::N}` substring
  expansion doesn't exist; use `cut` / `printf` / `awk`.
 **Retry-on-race pattern for long-running scrapes:**
 ```bash
 attempt=1
 while [ $attempt -le 3 ]; do
  if git push; then
    echo "pushed (attempt $attempt)"
    break
  fi
  [ $attempt -eq 3 ] && { echo "still failing"; exit 1; }
  git fetch origin main
  git rebase origin/main || { echo "conflict — bail"; exit 1; }
  attempt=$((attempt + 1))
 done
 ```
 Works because scrape commits only touch `corpus/` + `bundles.json`,
 and code merges only touch `.py` / `.yml` — disjoint paths, trivially
 clean rebases.
 **Image tagging — three tags per build:**
 | Tag | Purpose |
 |---|---|
 | `:latest` | Watchtower watches this for auto-deploy |
 | `:<sha12>` | Immutable; rollback target |
 | `:<YYYY.MM.DD>` | Human-readable in incident notes |
 Same tag set on every build; rollback is a one-line compose edit
 to pin `:<sha>` instead of `:latest`.
 **Container registry behind Cloudflare:**
 Cloudflare's free tier has a 100 MB request body limit. Big image
 layers (Chroma index can easily be 800+ MB) exceed it on push. The
 fix is a LAN registry endpoint for push, public hostname for pull:
 ```yaml
 env:
  REGISTRY_PUSH: <lan-ip>:<port>     # bypasses Cloudflare
  REGISTRY_PULL: <public-hostname>   # response bodies aren't capped
 ```
 Runner needs the LAN endpoint in `/etc/docker/daemon.json`
 `insecure-registries`. Costs nothing operationally; saves hours
 of debugging.
 **Registry GC:** weekly cron in the workflow that walks the package
 versions, keeps `:latest` + N most-recent date tags + anything
 pushed in the last 90 days, deletes the rest. Worth ~50 LOC; the
 package GC on the Gitea side reclaims disk after.
 ### Phase 6 — Reranker  *(half a day)*
 Goal: lift retrieval quality 3× by cross-encoder reranking the top-N
 dense candidates.
 - A `/v1/rerank` HTTP endpoint backed by `llama.cpp` serving
  `jina-reranker-v2-base` (GGUF). Runs as a sidecar in compose.
  GPU strongly recommended (CPU latency is unworkable for live
  queries).
 - `_rerank(query, docs)` helper in the server: POST to the endpoint,
  apply the scores, re-sort the top-N candidates. Defensive: on any
  failure log a warning and fall through to dense-only.
 - Env: `RERANK_URL` (off by default), `RERANK_POOL` (how deep to
  pull candidates for reranking; 200 is a good default),
  `RERANK_TIMEOUT` (30s for cold-start tolerance).
 - **Watch the per-pair token limit.** Jina's GGUF reports
  `n_ctx_train=1024` and llama.cpp will reject the ENTIRE batch if
  any pair exceeds it. Truncate doc text to ~2000 chars before
  reranking. The full untruncated chunk still goes back to the user;
  truncation is only for the reranker scoring path.
 ### Phase 7 — Eval harness  *(1 day)*
 Goal: hand-curated golden queries + standard metrics so you can
 measure the impact of any retrieval change.
 - `eval/queries.jsonl`: 20–25 hand-curated queries with expected
  pages. Spread across versions, platforms, and difficulty levels.
  Include the queries that "obviously" should work and DON'T —
  those are the ones to track.
 - `eval/retrievers.py`: a `Retriever` protocol with concrete
  implementations: `DenseRetriever`, `RerankedRetriever`,
  `BM25Retriever` (Phase 8), `HybridRetriever` (Phase 8). One
  matrix dimension per knob.
 - `eval/run_eval.py`: computes MRR / Recall@5 / nDCG@5 across all
  retrievers; emits a markdown comparison table at
  `eval/results/<baseline>.md`. Commit the result so PRs land with
  the A/B evidence in the diff.
 Three numbers are enough — don't overengineer. The hand-curated
 queries are the value; the metrics are just a stable way to score
 them.
 ### Phase 8 — BM25 + Hybrid retrieval  *(half a day, conditional)*
 **Skip unless your eval shows specific failure modes.** Dense
 embeddings + cross-encoder reranker handle most queries. The case
 where they don't: queries with rare technical tokens (filenames,
 language names, error codes) get buried at dense rank 1000+ by a
 much larger prose corpus that's semantically nearby. The reranker
 only sees top-200, so it never gets a shot.
 - `rag/bm25.py`: SQLite FTS5 index, in the stdlib, on-disk
  (`bm25/<product>.db`). Two tables — metadata table keyed by
  rowid, FTS5 virtual table for full-text. Sanitize the query
  (strip FTS5 reserved keywords, OR-join tokens for recall). ~210
  LOC.
 - `_rrf_fuse()` in the server — Reciprocal Rank Fusion with `k=60`.
  Per-id score = `sum_over_retrievers(1 / (k + rank))`. Returns
  ordered ids plus per-retriever contribution dict for telemetry.
 - `search_docs` hybrid path: run dense + BM25 in parallel,
  RRF-fuse, hand the merged top-200 to the reranker. Env-gated:
  `HYBRID_SEARCH=true`.
 - Log `top1_source` per call (`dense_only` / `bm25_only` / `both`)
  to usage logs so you can measure whether BM25 is actually earning
  its keep on production traffic.
 If after 4–6 weeks of production data you see `bm25_only >= 80%`,
 you can simplify to BM25-only (much less infrastructure). If
 `both >= 50%`, hybrid is acting as tie-breaker not rescue — keep it
 or simplify depending on how much you care about the long tail.
 ### Phase 9 — Multi-version diff tooling  *(1 day, if applicable)*
 **Only relevant if the product has multiple maintained versions.**
 - `diff_versions(bundle_id, page_id, against_bundle_id)`: unified
  diff between two versions of the same page. Two matching
  strategies: editor-curated `topic_cluster` peer (if the portal
  exposes it), or same-filename fallback.
 - `list_cluster(bundle_id, page_id)`: list cross-version peers
  for one page.
 - `bundle_changelog(bundle_id_new, bundle_id_old)`: added /
  removed / changed pages between two bundles, sorted by churn.
 - `_diff_churn(a, b)`: small helper, ~15 LOC of `difflib.unified_diff
  --unified=0` line counting. Used by `bundle_changelog`,
  `find_doc_inconsistencies`, and `weekly_digest`.
 ### Phase 10 — Usage logging  *(half a day)*
 Goal: per-call JSONL telemetry so you can answer "what are people
 actually asking for" and "is the new feature getting used."
 - `docs_mcp/usage.py`: `TimedCall` context manager that captures
  tool name, args, elapsed time, hits returned, any extra fields
  set by the tool via `_call.set(key=value)`. Writes JSONL to
  `var/logs/usage.jsonl`, rotated daily, kept 90 days.
 - Mount the log dir as a named compose volume so logs survive
  container recreates.
 - `scripts/usage_report.py` (standalone, no docs_mcp deps): reads
  the JSONL files, prints per-tool counts, top queries, 0-hit
  queries, filter usage histogram, reranker activity. Markdown
  output flag for piping into weekly digest emails.
 What to log: query text, filters, hits returned, elapsed_ms,
 reranker_fired flag, hybrid top1_source, retrieval_mode. What NOT
 to log: anything PII-shaped. The corpus is public, queries are
 usually about the product, not personal — but be deliberate.
 ### Phase 11 — Curated knowledge layer  *(2 days)*
 The "RAG can't tell you what isn't in the docs" gap. Surfaces:
 - **API quickstart repos** if the product has them. Ingest the
  example scripts (Python, PowerShell, curl) into the corpus.
  Rewrite chunk-0 for each script to embed naturally — explicit
  natural-language H1, task description sentence, keyword bag.
  Dense embeddings need an anchor.
 - **A curated `<product>_api_lessons` markdown doc** for things
  the swagger / OpenAPI doesn't say: auth flow gotchas, async-task
  patterns, schema bugs you've hit, platform-detection quirks.
  Surface as a dedicated MCP tool whose description tells the LLM:
  *"Call proactively whenever the user asks you to write a script
  / integrate with the API / debug a 4xx response."*
 - **An auto-hint banner** in `search_docs` results — when the
  query matches a script/API trigger word, render a one-line nudge
  at the top of results pointing at the dedicated tool. Belt-and-
  suspenders for queries where the LLM doesn't think to call it
  proactively.
 ### Phase 12 — Doc-bug workflow tools  *(1 day, optional)*
 Two tools that pair up to enable a *"check the docs for
 inconsistencies, draft bugs, confirm, submit"* workflow.
 - `find_doc_inconsistencies(scope_query, version=None, platform=None,
  max_pages=30, checks=None)`: deterministic, read-only. Two checks:
  cross-version drift (pages whose content shifted between immediate-
  previous versions in the actionable 10–60% churn band) and
  redirect-chain detection (short pages whose body is just a "see
  [other page] for details" pointer). Heavy lifting is line-level
  diff (`difflib`) against editor-curated cluster peers; the model
  judges which findings are real bugs.
 - `submit_doc_bug(page_url, content, email=None, rating=None,
  like=None)`: POSTs to the docs portal's feedback endpoint.
  Env-gated by `DOC_BUG_SUBMIT_ENABLED=true` so dev/staging
  deployments can't accidentally hit the upstream. The tool's
  docstring is loud about a mandatory operator-confirmation
  workflow per submission — LLM must draft, show, ask, then
  submit. Explicit *"do not loop"* instruction. Defensive
  validation upfront (URL host matches expected portal, content
  non-empty, etc.) so the LLM gets a clean error instead of a
  rejected POST.
 **You'll need to find the docs portal's feedback endpoint.** Most
 portals route the "Was this helpful?" widget through a backend
 API; sniff the browser network tab on the live site. The payload
 shape varies; common fields: content/body, page url/href, optional
 email, optional rating, optional thumbs. Most accept anonymous
 POSTs with no captcha at the JSON-API layer (even if the widget
 shows a captcha). Validate before you ship — and if the endpoint
 has rate limits or captcha enforcement, the tool returns a clean
 "submission rejected — paste manually at <url>" fallback.
 The whole point is the per-bug operator confirmation in the
 LLM-side conversation flow; the tool description enforces it. Do
 not bypass.
 ### Phase 13 — Weekly digest tool  *(half a day)*
 Goal: a tool that answers *"what changed in the docs in the last N
 days?"* with no runtime git dependency (the prod container has no
 git).
 - Extend `scrape/changelog.py` with `--json` (one-shot structured
  output) and `--history-out PATH` (walks `git log --first-parent
  --since="<N> days ago"` for corpus-touching commits, writes one
  JSON line per commit to a JSONL file).
 - CI workflows write the JSONL file into the image at build time:
  `corpus/.digest/history.jsonl`. Both `refresh.yml` and
  `image-only.yml`. **`fetch-depth: 0` is required** — see Phase 5.
 - New MCP tool `weekly_digest(days=7, version=None, platform=None,
  max_bundles=25, max_pages_per_bundle=10)`: reads the JSONL,
  filters to the window, applies version/platform via
  `bundles.json` metadata, aggregates per-bundle change counts and
  page lists, renders markdown.
 - Post-filter totals are critical: the headline "X page changes
  across Y bundles" must compute X from the filtered set, not the
  raw record count. Otherwise filtered calls look wrong to the
  reader.
 Out of scope but trivial bolt-ons: scheduled HTML email of the
 digest, auto-publish to a blog, per-page diff excerpts as a
 follow-up tool.
 ---
 ## Standard tool set
 By the end you'll have ~15 tools registered. Production-tested
 shape:
 | Tool | What it does |
 |---|---|
 | `search_docs` | Semantic search with version/platform/bundle filters |
 | `get_page` | Full markdown + metadata for one page |
 | `list_versions` | Discover available facet values |
 | `list_cluster` | Cross-version peers for one page (if applicable) |
 | `diff_versions` | Unified diff of a page across two versions |
 | `bundle_changelog` | Added / removed / changed pages between two bundles |
 | `weekly_digest` | What changed in the last N days, with filters |
 | `corpus_status` | Freshness + size of the knowledge base |
 | `find_doc_inconsistencies` | Scoped scan for doc bugs |
 | `submit_doc_bug` | Submit a drafted bug (env-gated, operator-confirmed) |
 | `<product>_api_lessons` | Curated API gotchas, proactively-called |
 | product-specific tools | Interop matrix, lifecycle queries, etc. |
 ---
 ## Per-product customization checklist
 When applying this template to a new product, here's what you have
 to figure out yourself — everything else is shared infrastructure:
 - **Doc portal mechanics**
  - URL pattern for pages
  - Bundle/version concept (Zoomin "bundle", Madcap "project",
    GitBook "space", Docusaurus "docs version" — same idea, different
    name)
  - SPA backing API (sniff the network tab) or fallback to
    headless browser
  - How `topic_cluster` -equivalent cross-version peers are exposed
    (or whether you synthesize them from filenames)
 - **Bundle metadata schema**
  - What does `version` look like? Semver, calendar, named?
  - What does `platform` mean for this product? Is there a useful
    facet at all?
  - Other useful facets (language, product line, edition)?
 - **Filterable facets** for `search_docs`
  - One filter per high-cardinality facet
  - Skip filters that have <5 distinct values — they're not worth
    the surface area
 - **Feedback endpoint** (for `submit_doc_bug`, if you want it)
  - URL of the POST endpoint
  - Required + optional payload fields
  - Captcha / rate-limit behavior
  - Whether anonymous submissions are accepted
 - **Curated knowledge** for the `_api_lessons` tool
  - What does the product's API documentation NOT say that you've
    learned from real integration work?
 - **Quickstart / example repos**
  - Does the vendor publish working code? Ingest it; rewrite
    chunk-0 for natural-language retrieval.
 ---
 ## Decisions worth carrying forward
 Things you'll save time on by deciding the same way again:
 - **Tool descriptions are user interface.** The LLM reads them
  verbatim and decides whether to call the tool. *"Use when..."*
  and *"Call proactively whenever..."* are real surfaces; treat
  them like button labels. Most retrieval improvements turn out
  to be tool-description rewrites in disguise.
 - **`stateless_http=True`** on the FastMCP server. Eliminates
  whole categories of session-ID-related 404 storms after
  container recreates.
 - **Pre-bake everything at CI time.** No runtime calls to git,
  external services, or anything you wouldn't trust on a
  Cloudflare outage. If the digest needs git history, write a
  JSONL file at CI time. If the lessons doc needs to load fast,
  bake it into the image.
 - **Env-gate every side-effecting tool.** Off by default in dev;
  on only in production compose. Belt and suspenders against
  accidental writes from staging environments.
 - **Operator-confirmation pattern for side-effecting tools.**
  The tool docstring is the only place to enforce
  human-in-the-loop. Make it loud. "MANDATORY", "Do not loop",
  "show-confirm-then-submit" — those phrasings work.
 - **Verify with hand-curated golden queries before shipping any
  retrieval change.** Numbers in the diff, in the commit message.
  Don't ship retrieval changes on vibes.
 - **Two-cadence CI** (weekly scrape vs on-demand code-only)
  saves hours per code iteration once you're past the
  one-iteration-a-week stage.
 - **Rolling tag + sha-pinned tag** deploy pattern. `:latest` is
  what Watchtower watches; `:<sha>` is your safety net. Rollback
  is a one-line compose edit, not a redeploy.
 - **Usage logging is non-negotiable.** You will be wrong about
  what people use. Capture the truth from day one; let it tell
  you which features to keep building and which to delete.
 ---
 ## Glossary
 - **Bundle** — one logical doc set in the portal. Zoomin calls
  them bundles; Madcap calls them projects; the concept is the
  same: a versioned, titled collection of pages. One dir under
  `corpus/`.
 - **Page** — one HTML page in a bundle. One `.md` + one `.json`
  sidecar under the bundle dir.
 - **Topic cluster** — Zoomin's name for "this page in version
  10.9 corresponds to that page in version 10.8." Stored in the
  per-page sidecar. The portal-agnostic concept is "cross-version
  peer mapping."
 - **Chunk** — a unit of text that gets independently embedded and
  stored in Chroma. Target ~400-600 tokens; preserve paragraph
  boundaries.
 - **RRF** — Reciprocal Rank Fusion. The way to merge two ranked
  lists from independent retrievers without score calibration.
 ---
 ## What's deliberately NOT in this template
 Decisions you should make per-product (not copy from the original
 build):
 - The reverse proxy and TLS termination layer. Could be Caddy,
  nginx, Traefik, Cloudflare Tunnel — pick what your infra uses.
 - The Gateway / aggregator in front of multiple MCPs (MetaMCP is one
  option; you may not need any aggregator if you're running a
  single product MCP).
 - The specific embedding model — `nomic-embed-text` is a strong
  default but newer / domain-specific models may be better for
  some products.
 - The Ollama containers / GPU setup — depends on what hardware you
  have. The pattern is one container per GPU with explicit
  `NVIDIA_VISIBLE_DEVICES` pinning; the indexer load-balances
  across them.
 - Whether to publish a blog series alongside the build. Strongly
  recommended (forces clarity, builds an audience), but optional.
@@ -0,0 +1,84 @@
 # seed-mcp
 MCP server over the public catalogs of major US row-crop seed
 vendors — corn, soybeans, wheat. Sibling project to
 [`crop-chem-docs`](https://git.jpaul.io/justin/crop-chem-docs)
 (pesticide labels), feeding the same Drawbar farm-advisor AI.
 The server exposes per-variety records with **agronomic ratings**,
 **disease tolerance**, **trait stack**, **maturity**, and
 **regional notes** — so the advisor can answer questions like
 "which corn hybrid for sandy soil, drought-prone, RM ≤105 in
 northeast Iowa?" without rummaging through individual brand sites.
 ## Vendor coverage
 | Vendor | Verdict | Varieties | Notes |
 |---|---|---|---|
 | Bayer seeds (DEKALB + Asgrow + WestBred) | 🟢 | ~475 | Same `cropscience.bayer.us` Next.js infra as crop-chem-docs |
 | Golden Harvest (Syngenta) | 🟢 | ~175 | Sitemap + server-rendered HTML + Syngenta CDN PDFs |
 | NK (Syngenta) | 🟢 | 29 | Shares PDF fetcher with Golden Harvest |
 | AgriPro (Syngenta wheat) | 🟢 | 24 | Drupal Views, server-rendered |
 | Beck's PFR | 🟡 | 2,089 | Public Sanity GROQ API (no auth) |
 | Beck's products | 🟡 | 860 | Identity-only until SeedIQ XHR sniffed |
 | Pioneer (Corteva) | 🔴 | — | ToS bans automation — curated fallback lesson instead |
 ## Quick start
 ```bash
 git clone https://git.jpaul.io/justin/seed-mcp.git
 cd seed-mcp
 python -m venv venv && source venv/bin/activate
 pip install -r requirements.txt
 # Run one scraper
 python -m scrape.runner --source bayer_seeds --force
 # Rebuild indexes
 python -m rag.index --rebuild
 # Local MCP server (stdio for Claude Desktop dev)
 python -m docs_mcp.server --transport stdio
 ```
 ## Tools exposed
 | Tool | Purpose |
 |---|---|
 | `search_docs` | Hybrid + rerank variety search with crop / RM / trait / region filters |
 | `get_page` | Full variety record by `(source, source_key)` |
 | `list_versions` | Discover crops, brands, traits, RM/MG ranges, wheat classes |
 | `corpus_status` | Counts + freshness; useful for health probes |
 | `crop_seed_api_lessons` | Curated agronomy lessons — Pioneer fallback, disease-scale normalization, regional placement heuristics |
 ## Build phases
 This is a clone of [`docs-mcp-template`](https://git.jpaul.io/justin/docs-mcp-template).
 The 13 phases in `PLAN.md` apply:
 | Phase | Status |
 |---|---|
 | 0 — scaffold | done |
 | 1 — first scraper (bayer_seeds) | next |
 | 2 — chunk + index | pending |
 | 3 — baseline MCP tools | template defaults |
 | 4-5 — Dockerfile + CI | done (placeholders filled) |
 | 6 — reranker | shares `llama-rerank` sidecar with crop-chem-docs |
 | 7 — eval harness | pending (curate ~25 queries) |
 | 8 — hybrid search | done (template) |
 | 9 — diff_versions, list_cluster | optional |
 | 11 — `crop_seed_api_lessons` curated layer | pending |
 See `CLAUDE.md` for the canonical sidecar schema and the
 disease-scale-normalization gotcha (Golden Harvest is reversed).
 ## Infrastructure
 - **Registry**: `git.jpaul.io/justin/seed-mcp:latest` (Watchtower) /
  `:corpus-YYYY.MM.DD` (production pin)
 - **Embedder**: shared Ollama pool with crop-chem-docs (Gitea-host
  GPUs + Windows Ollama; CI never hits trashpanda's production Ollama)
 - **Reranker**: shared `llama-rerank` sidecar on trashpanda's Tesla
  P4 (one container, both MCPs use it)
 - **PRODUCT_NAME**: `crop_seed` (not `seed_mcp` — used in Chroma
  collection, BM25 db filename, and `crop_seed_api_lessons` tool)
@@ -0,0 +1,111 @@
 # Hosting stack for a docs MCP server.
 #
 # Replace <product> below with your product name on first deploy.
 # Volumes: usage logs are mounted to a host path so they survive
 # Watchtower-driven container recreates.
 #
 # This template assumes a reverse proxy / Cloudflare Tunnel terminates
 # TLS in front of port 8000. Adjust if your infra differs.
 services:
  # The MCP server. Watchtower auto-pulls on :latest changes.
  <product>-docs-mcp:
    image: <registry>/<owner>/<product>-docs-mcp:latest
    container_name: <product>-docs-mcp
    restart: unless-stopped
    ports:
      - "8000:8000"
    environment:
      PRODUCT_NAME: "<product>"
      PRODUCT_DOCS_URL: "https://docs.example.com"
      # Streamable-HTTP transport. Stateless mode is required for
      # production: clients don't lose sessions when Watchtower
      # recreates the container.
      MCP_TRANSPORT: streamable-http
      MCP_HOST: 0.0.0.0
      MCP_PORT: "8000"
      # If you run MetaMCP or another gateway in front and reach
      # this container via its compose DNS name (e.g. <product>-docs-mcp:8000),
      # add that hostname here. "*" disables the rebind check entirely.
      MCP_ALLOWED_HOSTS: "<product>-docs-mcp,localhost,127.0.0.1"
      # Phase 6 — reranker sidecar (jina-reranker-v2-base via llama.cpp).
      RERANK_URL: http://<product>-rerank:8080
      RERANK_POOL: "200"
      RERANK_TIMEOUT: "30"
      # Phase 8 — hybrid retrieval (BM25 + dense + RRF). Set true
      # only after the eval harness shows the dense-only path
      # missing technical-term queries that BM25 catches.
      HYBRID_SEARCH: "true"
      # Phase 10 — usage telemetry.
      USAGE_LOG_DIR: /app/var/logs
      USAGE_LOG_KEEP_DAYS: "90"
      # Phase 12 — doc-bug submission gate. Off by default; on only
      # in production after you've verified the endpoint contract.
      DOC_BUG_SUBMIT_ENABLED: "false"
      # DOC_BUG_API_URL: "https://docs-be.example.com/api/feedback"
    volumes:
      # Usage logs persist across container recreates.
      - ./<product>-docs-mcp-logs:/app/var/logs
    depends_on:
      - <product>-rerank
    labels:
      # Watchtower polls *only* containers with this label set true.
      com.centurylinklabs.watchtower.enable: "true"
    networks:
      - mcp
  # Reranker sidecar — llama.cpp serving jina-reranker-v2-base.
  # Requires GPU access; adjust runtime/devices for your hardware.
  <product>-rerank:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    container_name: <product>-rerank
    restart: unless-stopped
    # Mount the GGUF model from the host. Download from huggingface
    # (gguf-org/jina-reranker-v2-base-multilingual-GGUF) first.
    volumes:
      - /path/to/models:/models:ro
    command: >
      --model /models/jina-reranker-v2-base.Q8_0.gguf
      --reranking
      --host 0.0.0.0
      --port 8080
      --n-gpu-layers 99
      --ctx-size 4096
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - mcp
  # Watchtower — auto-pulls :latest on push.
  # Only watches containers labeled `com.centurylinklabs.watchtower.enable=true`.
  watchtower:
    image: containrrr/watchtower:latest
    container_name: watchtower
    restart: unless-stopped
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      WATCHTOWER_POLL_INTERVAL: "300"   # 5 min
      WATCHTOWER_LABEL_ENABLE: "true"
      WATCHTOWER_CLEANUP: "true"        # remove old images after pull
    # If your registry requires auth, mount a docker config:
    #  volumes:
    #    - ./registry-auth.json:/config.json:ro
    networks:
      - mcp
 networks:
  mcp:
    driver: bridge
@@ -0,0 +1,263 @@
 """MCP server skeleton — fill in PRODUCT_NAME and the tool bodies.
 This file is the template's structural anchor. The phases described in
 PLAN.md add or extend pieces of this file:
  Phase 3  — search_docs, get_page, list_versions stubs (you are here)
  Phase 6  — reranker integration in search_docs
  Phase 8  — BM25 + hybrid retrieval (HYBRID_SEARCH env gate, _rrf_fuse)
  Phase 9  — diff_versions, list_cluster, bundle_changelog
  Phase 10 — TimedCall wiring (already imported below)
  Phase 11 — <product>_api_lessons tool
  Phase 12 — find_doc_inconsistencies, submit_doc_bug
  Phase 13 — weekly_digest + _digest_history reader
 Every stub below has a docstring + `raise NotImplementedError`. Replace
 the body when you reach the corresponding phase. Keep the signatures
 stable across products — clients depend on them.
 """
 from __future__ import annotations
 import json
 import logging
 import os
 import re
 from pathlib import Path
 from typing import Annotated
 from mcp.server.fastmcp import FastMCP
 from pydantic import Field
 from .usage import TimedCall
 log = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 # Product-specific configuration. Set these for each new build.
 # ---------------------------------------------------------------------------
 PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "crop_seed")
 PRODUCT_DOCS_URL = os.environ.get("PRODUCT_DOCS_URL", "https://git.jpaul.io/justin/seed-mcp")
 COLLECTION = f"{PRODUCT_NAME}_docs"
 # Paths inside the deployed container (and matching layout locally for dev).
 ROOT = Path(__file__).resolve().parent.parent
 CORPUS = ROOT / "corpus"
 CHROMA_DIR = ROOT / "chroma"
 BM25_DB = Path(os.environ.get("BM25_DB", str(ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db")))
 BUNDLES_JSON = ROOT / "bundles.json"
 # ---------------------------------------------------------------------------
 # Feature flags (Phase 6 / 8 / 12 enable these as you ship each phase).
 # ---------------------------------------------------------------------------
 RERANK_URL = os.environ.get("RERANK_URL", "").rstrip("/") or None
 RERANK_POOL = int(os.environ.get("RERANK_POOL", "50"))
 RERANK_TIMEOUT = float(os.environ.get("RERANK_TIMEOUT", "30"))
 HYBRID_SEARCH = os.environ.get("HYBRID_SEARCH", "").lower() in ("true", "1", "yes", "on")
 RRF_K = int(os.environ.get("RRF_K", "60"))
 DOC_BUG_SUBMIT_ENABLED = os.environ.get("DOC_BUG_SUBMIT_ENABLED", "").lower() in ("true", "1", "yes", "on")
 DOC_BUG_API_URL = os.environ.get("DOC_BUG_API_URL", "")  # product-specific endpoint
 DOC_BUG_TIMEOUT = float(os.environ.get("DOC_BUG_TIMEOUT", "15"))
 # ---------------------------------------------------------------------------
 # FastMCP setup.
 #
 # stateless_http=True — every request creates an ephemeral session and
 # discards it on return. Critical for production: clients don't get
 # 404 storms when the container is recreated by Watchtower.
 # ---------------------------------------------------------------------------
 mcp = FastMCP(f"{PRODUCT_NAME}-docs", stateless_http=True)
 # ---------------------------------------------------------------------------
 # Lazy helpers — instantiate expensive things only when actually needed,
 # so the server still starts when (e.g.) Ollama is briefly unreachable.
 # ---------------------------------------------------------------------------
 def _bundles() -> dict[str, dict]:
    """Cached load of bundles.json into a {slug: bundle_dict} mapping.
    bundles.json is the product-specific catalog written by the Phase 1
    scraper. See PLAN.md Phase 1 for the schema.
    """
    if not BUNDLES_JSON.exists():
        return {}
    cat = json.loads(BUNDLES_JSON.read_text())
    return {b["slug"]: b for b in cat}
 def _build_where(version: str | None, platform: str | None, bundle_id: str | None) -> dict | None:
    """Translate filter args into a Chroma `where` clause."""
    conds: list[dict] = []
    if version:
        conds.append({"version": version})
    if platform:
        conds.append({"platform": platform})
    if bundle_id:
        conds.append({"bundle_id": bundle_id})
    if not conds:
        return None
    if len(conds) == 1:
        return conds[0]
    return {"$and": conds}
 def _read_page(bundle_id: str, page_id: str) -> tuple[str, dict] | None:
    """Read a corpus page off disk. Returns (markdown_body, metadata_dict)."""
    md_path = CORPUS / bundle_id / (page_id + ".md")
    json_path = CORPUS / bundle_id / (page_id + ".json")
    if not md_path.exists() or not json_path.exists():
        return None
    return md_path.read_text(), json.loads(json_path.read_text())
 # ===========================================================================
 # Tools
 # ===========================================================================
@mcp.tool()
 def search_docs(
    query: Annotated[str, Field(description=f"Natural-language query about {PRODUCT_NAME}.")],
    version: Annotated[
        str | None,
        Field(description="OPTIONAL version filter — restrict to one product version."),
    ] = None,
    platform: Annotated[
        str | None,
        Field(description="OPTIONAL platform filter. Set to one of the platforms listed by list_versions(); omit for all platforms."),
    ] = None,
    bundle_id: Annotated[
        str | None,
        Field(description="OPTIONAL bundle filter — pin to a specific doc bundle slug."),
    ] = None,
    k: Annotated[int, Field(description="Number of results to return.", ge=1, le=50)] = 10,
 ) -> str:
    """Search the {product} docs corpus.
    Returns the top-k most relevant chunks (with full source page URLs)
    given a natural-language query. Optional filters narrow the search
    to one version, one platform, or one bundle. Use list_versions()
    first if you need to discover the available facet values.
    Call this tool whenever the user asks anything that should be
    answerable from the official product documentation.
    """
    with TimedCall("search_docs", {
        "query": query, "version": version, "platform": platform,
        "bundle_id": bundle_id, "k": k,
    }) as _call:
        # TODO Phase 2-3: query Chroma collection (see rag/index.py for
        # how it was built). Render the top-k chunks as markdown with
        # source URLs.
        # TODO Phase 6: optional reranker via _rerank() if RERANK_URL set.
        # TODO Phase 8: hybrid retrieval if HYBRID_SEARCH=true — run
        # dense + BM25 in parallel, RRF-fuse, hand merged pool to rerank.
        _call.set(hits_returned=0)
        raise NotImplementedError("Phase 2/3: implement Chroma query + rendering")
@mcp.tool()
 def get_page(
    bundle_id: Annotated[str, Field(description="Bundle slug.")],
    page_id: Annotated[str, Field(description="Page filename within the bundle.")],
 ) -> str:
    """Return the full markdown for one page, plus a metadata header.
    Use after search_docs surfaces a relevant page and the user (or you)
    want the complete text — not just the matched chunks.
    """
    with TimedCall("get_page", {"bundle_id": bundle_id, "page_id": page_id}) as _call:
        data = _read_page(bundle_id, page_id)
        if data is None:
            _call.set(found=False)
            return f"Page not found: {bundle_id}/{page_id}"
        md, meta = data
        _call.set(found=True, page_chars=len(md))
        # TODO: add a metadata header (title, version, source URL) above
        # the body. Product-specific shape.
        return md
@mcp.tool()
 def list_versions() -> str:
    """List the available version/platform facets across all bundles.
    Use this to discover valid filter values for search_docs.
    """
    with TimedCall("list_versions", {}) as _call:
        cat = _bundles()
        if not cat:
            return "_(no bundles indexed yet — run the scraper + indexer)_"
        versions = sorted({b.get("version") for b in cat.values() if b.get("version")})
        platforms = sorted({b.get("platform") for b in cat.values() if b.get("platform")})
        _call.set(versions=len(versions), platforms=len(platforms))
        lines = [f"# Facets across {len(cat)} bundle(s)", ""]
        if versions:
            lines.append("## Versions"); lines.append("")
            for v in versions: lines.append(f"- `{v}`")
            lines.append("")
        if platforms:
            lines.append("## Platforms"); lines.append("")
            for p in platforms: lines.append(f"- `{p}`")
        return "\n".join(lines)
 # ---------------------------------------------------------------------------
 # Stubs for later phases — keep the signatures in this file so refactors
 # don't lose the contracts. Implementations come per phase.
 # ---------------------------------------------------------------------------
 # @mcp.tool()  # Phase 9
 # def list_cluster(bundle_id: str, page_id: str) -> str: ...
 # @mcp.tool()  # Phase 9
 # def diff_versions(bundle_id: str, page_id: str, against_bundle_id: str, context: int = 3) -> str: ...
 # @mcp.tool()  # Phase 9
 # def bundle_changelog(bundle_id_new: str, bundle_id_old: str, min_churn: int = 5, max_changed: int = 50) -> str: ...
 # @mcp.tool()  # Phase 13
 # def weekly_digest(days: int = 7, version: str | None = None, platform: str | None = None, ...) -> str: ...
 # @mcp.tool()  # Phase 9 (or 3 — useful early)
 # def corpus_status() -> str: ...
 # @mcp.tool()  # Phase 11
 # def myproduct_api_lessons(topic: str | None = None) -> str: ...
 # @mcp.tool()  # Phase 12
 # def find_doc_inconsistencies(scope_query: str, ...) -> str: ...
 # @mcp.tool()  # Phase 12
 # def submit_doc_bug(page_url: str, content: str, email: str | None = None, ...) -> str: ...
 # ===========================================================================
 # Entry point
 # ===========================================================================
 def main() -> None:
    import argparse
    p = argparse.ArgumentParser(description=f"{PRODUCT_NAME} docs MCP server")
    p.add_argument("--transport", choices=["stdio", "streamable-http", "sse"],
                   default=os.environ.get("MCP_TRANSPORT", "stdio"))
    p.add_argument("--host", default=os.environ.get("MCP_HOST", "0.0.0.0"))
    p.add_argument("--port", type=int, default=int(os.environ.get("MCP_PORT", "8000")))
    args = p.parse_args()
    if args.transport == "stdio":
        mcp.run()
    else:
        mcp.settings.host = args.host
        mcp.settings.port = args.port
        # DNS-rebinding protection defaults to localhost-only — disable for
        # container-network DNS hostnames. See PLAN.md "Hosting" notes.
        if os.environ.get("MCP_DISABLE_DNS_REBINDING_PROTECTION") in {"1", "true", "yes"}:
            mcp.settings.transport_security.enable_dns_rebinding_protection = False
        mcp.run(transport=args.transport)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,127 @@
 """Per-call usage telemetry — JSONL with daily rotation and retention.
 Reusable as-is across products. Drop the import + `with TimedCall(...)`
 into any tool body and the call gets logged with the tool name, args,
 elapsed time, and any extra fields the tool sets via `_call.set(...)`.
 The log file is `var/logs/usage.jsonl` by default (override with the
 `USAGE_LOG_DIR` env). Daily rotation; files older than
 `USAGE_LOG_KEEP_DAYS` (default 90) are deleted on next write.
 Layout of one record:
    {
      "ts":           "2026-05-22T13:14:15+00:00",
      "tool":         "search_docs",
      "args":         {"query": "...", "version": "10.9", "k": 10},
      "elapsed_ms":   142.5,
      "hits_returned": 7,           # optional, set by the tool
      "reranked":     true,         # optional, set by the tool
      // ... any other key the tool sets via _call.set(...)
    }
 """
 from __future__ import annotations
 import json
 import os
 import time
 import threading
 from datetime import datetime, timedelta, timezone
 from pathlib import Path
 from typing import Any
 USAGE_LOG_DIR = Path(os.environ.get("USAGE_LOG_DIR", "var/logs"))
 USAGE_LOG_KEEP_DAYS = int(os.environ.get("USAGE_LOG_KEEP_DAYS", "90"))
 # Single global lock to serialize writes from multiple request handlers.
 # JSONL appends are atomic at the OS level for short records on most
 # filesystems, but the lock is cheap and saves you from cross-platform
 # surprises.
 _lock = threading.Lock()
 _last_rotation_check: float = 0.0
 def _maybe_rotate() -> None:
    """Move usage.jsonl → usage.jsonl.<yesterday> if the date has rolled.
    Cheap to call; we only do filesystem work when a day has actually
    passed since the last check.
    """
    global _last_rotation_check
    now = time.time()
    if now - _last_rotation_check < 300:  # 5 min cap between checks
        return
    _last_rotation_check = now
    USAGE_LOG_DIR.mkdir(parents=True, exist_ok=True)
    active = USAGE_LOG_DIR / "usage.jsonl"
    if active.exists():
        try:
            mtime = datetime.fromtimestamp(active.stat().st_mtime, tz=timezone.utc).date()
            today = datetime.now(timezone.utc).date()
            if mtime < today:
                rotated = USAGE_LOG_DIR / f"usage.jsonl.{mtime.isoformat()}"
                if not rotated.exists():
                    active.rename(rotated)
        except OSError:
            pass
    # Retention: delete usage.jsonl.YYYY-MM-DD files older than the
    # retention window. The active file is never deleted by this.
    cutoff = datetime.now(timezone.utc).date() - timedelta(days=USAGE_LOG_KEEP_DAYS)
    for f in USAGE_LOG_DIR.glob("usage.jsonl.*"):
        try:
            datestamp = f.name.split(".", 2)[-1]
            if datetime.fromisoformat(datestamp).date() < cutoff:
                f.unlink()
        except (ValueError, OSError):
            continue
 class TimedCall:
    """Context manager that captures one tool call's telemetry record.
    Usage:
        with TimedCall("search_docs", {"query": q, ...}) as call:
            ... do the work ...
            call.set(hits_returned=len(results), reranked=True)
    On exit, writes one JSONL record to usage.jsonl. Exceptions are
    captured into the `error` field; the exception is re-raised so
    the tool's caller sees the failure.
    """
    def __init__(self, tool: str, args: dict[str, Any]):
        self.tool = tool
        self.args = args
        self.extra: dict[str, Any] = {}
        self._t0: float = 0.0
    def set(self, **kwargs: Any) -> None:
        """Attach extra fields to the eventual telemetry record."""
        self.extra.update(kwargs)
    def __enter__(self) -> "TimedCall":
        self._t0 = time.perf_counter()
        return self
    def __exit__(self, exc_type, exc_val, exc_tb) -> None:
        elapsed_ms = (time.perf_counter() - self._t0) * 1000.0
        record: dict[str, Any] = {
            "ts":         datetime.now(timezone.utc).isoformat(),
            "tool":       self.tool,
            "args":       self.args,
            "elapsed_ms": round(elapsed_ms, 2),
        }
        if exc_type is not None:
            record["error"] = f"{exc_type.__name__}: {exc_val}"
        record.update(self.extra)
        _maybe_rotate()
        with _lock:
            USAGE_LOG_DIR.mkdir(parents=True, exist_ok=True)
            with open(USAGE_LOG_DIR / "usage.jsonl", "a") as fh:
                fh.write(json.dumps(record, separators=(",", ":")) + "\n")
        # Don't swallow the exception — the caller still needs to see it.
@@ -0,0 +1,4 @@
 {"query": "how to install <product> on Linux", "expected": [{"bundle_id": "Install.Linux.10.0", "page_id": "Installation.htm"}], "tags": ["install", "linux"]}
 {"query": "configure database connection for high availability", "expected": [{"bundle_id": "Admin.10.0", "page_id": "HA_Setup.htm"}], "tags": ["ha", "config"]}
 {"query": "API endpoint to list users", "expected": [{"bundle_id": "API.10.0", "page_id": "Users_API.htm"}], "tags": ["api"]}
 {"query": "what changed between 10.0 and 10.1", "expected": [{"bundle_id": "Release_Notes.10.1", "page_id": "Whats_New.htm"}], "tags": ["release-notes"]}
@@ -0,0 +1,62 @@
 """Retriever protocol + concrete implementations.
 A single matrix dimension per knob (dense / reranked / bm25 / hybrid)
 so the eval harness can compare them apples-to-apples. Implement these
 once at Phase 7 and reuse them across every retrieval change.
 Each retriever returns a ranked list of (bundle_id, page_id) tuples
 deduplicated to the page level (chunks within the same page collapse
 to one entry; the highest-ranked chunk's position wins).
 """
 from __future__ import annotations
 from typing import Protocol, Iterable
 class Retriever(Protocol):
    name: str
    def retrieve(self, query: str, k: int = 10) -> list[tuple[str, str]]:
        """Return up to k (bundle_id, page_id) tuples in rank order."""
        ...
 def _collapse_to_pages(chunk_ids: Iterable[tuple[str, str, str]], k: int) -> list[tuple[str, str]]:
    """Take a stream of (bundle_id, page_id, chunk_ordinal) and return
    the first k unique pages in their first-seen order."""
    seen: set[tuple[str, str]] = set()
    out: list[tuple[str, str]] = []
    for bid, pid, _ord in chunk_ids:
        key = (bid, pid)
        if key in seen:
            continue
        seen.add(key)
        out.append(key)
        if len(out) >= k:
            break
    return out
 # TODO Phase 2/3 — implement these once Chroma + the bm25 module are
 # in place. Each one is small (15-30 LOC). The eval harness imports
 # from this module by class name.
 #
 # class DenseRetriever:
 #     name = "dense"
 #     def __init__(self, collection): self.col = collection
 #     def retrieve(self, query, k=10): ...
 #
 # class RerankedRetriever:
 #     name = "dense+rerank"
 #     def __init__(self, collection, rerank_url, pool=200): ...
 #     def retrieve(self, query, k=10): ...
 #
 # class BM25Retriever:
 #     name = "bm25"
 #     def __init__(self, bm25_index): ...
 #     def retrieve(self, query, k=10): ...
 #
 # class HybridRetriever:
 #     name = "bm25+dense+rrf"
 #     def __init__(self, dense, bm25, k_rrf=60): ...
 #     def retrieve(self, query, k=10): ...
@@ -0,0 +1,91 @@
 """Run all retrievers against eval/queries.jsonl, emit a markdown report.
 Metrics computed per retriever:
  MRR        — mean reciprocal rank of the FIRST expected page in the
               ranked result list (0 if not in top-k).
  Recall@K   — fraction of expected pages that appear in top-K.
  nDCG@K     — discounted gain weighted by rank position.
 The "right" number depends on what you're measuring. MRR tracks "the
 first-line answer is correct"; Recall@K tracks "everything relevant
 is there to draw from"; nDCG@K is a smoother combination of both.
 For docs-RAG, MRR is usually the headline metric.
 Usage:
    python -m eval.run_eval \\
        --queries eval/queries.jsonl \\
        --k 5 \\
        --output eval/results/baseline.md
 """
 from __future__ import annotations
 import argparse
 import json
 import math
 import time
 from pathlib import Path
 from typing import Iterable
 def load_queries(path: Path) -> list[dict]:
    with open(path) as fh:
        return [json.loads(line) for line in fh if line.strip()]
 def reciprocal_rank(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]]) -> float:
    expected_set = set(expected)
    for i, page in enumerate(retrieved, start=1):
        if page in expected_set:
            return 1.0 / i
    return 0.0
 def recall_at_k(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]], k: int) -> float:
    if not expected:
        return 0.0
    retrieved_set = set(retrieved[:k])
    hits = sum(1 for e in expected if e in retrieved_set)
    return hits / len(expected)
 def ndcg_at_k(retrieved: list[tuple[str, str]], expected: list[tuple[str, str]], k: int) -> float:
    expected_set = set(expected)
    dcg = 0.0
    for i, page in enumerate(retrieved[:k], start=1):
        if page in expected_set:
            dcg += 1.0 / math.log2(i + 1)
    # Ideal DCG: every expected page in the top positions.
    idcg = sum(1.0 / math.log2(i + 1) for i in range(1, min(len(expected), k) + 1))
    return dcg / idcg if idcg else 0.0
 def main() -> int:
    p = argparse.ArgumentParser()
    p.add_argument("--queries", type=Path, default=Path("eval/queries.jsonl"))
    p.add_argument("--k", type=int, default=5)
    p.add_argument("--output", type=Path, default=Path("eval/results/baseline.md"))
    args = p.parse_args()
    if not args.queries.exists():
        print(f"queries file not found: {args.queries}")
        print("hint: copy eval/queries.jsonl.example and edit")
        return 1
    queries = load_queries(args.queries)
    print(f"loaded {len(queries)} queries")
    # TODO Phase 7: instantiate the retrievers you implemented in
    # eval/retrievers.py and run each one against each query.
    # Aggregate MRR / Recall@K / nDCG@K per retriever. Emit a
    # markdown table to args.output. Commit the file alongside the
    # PR that changes retrieval.
    raise NotImplementedError(
        "Wire up the retrievers in eval/retrievers.py first, then "
        "fill in this evaluation loop. See PLAN.md Phase 7."
    )
 if __name__ == "__main__":
    raise SystemExit(main())
@@ -0,0 +1,277 @@
 """SQLite FTS5-backed BM25 retrieval over the same chunks Chroma indexes.
 Hybrid retrieval (BM25 + dense + Reciprocal Rank Fusion) addresses a
 limit of single-tower dense embeddings: when a query has specific
 technical terms (filenames, language names, error codes, API paths),
 the dense embedding doesn't bridge from the query into a short
 code-focused chunk. The chunk loses to the much larger crowd of
 prose chunks that semantically match the query topic.
 BM25 handles this directly. Lexical overlap on rare terms ("python",
 "create_vpg.py", "PROTECTED_SITE_ID", "applyUpgrade") scores those
 chunks high. Fused with the dense ranking via RRF, the hybrid result
 is strictly better than either alone for the queries we've seen
 fail.
 Why SQLite FTS5:
  - In the stdlib. Zero new deps.
  - On-disk. Same persistence model as Chroma — Docker COPY the dir,
    `rag.index --rebuild` regenerates from corpus.
  - Built-in `bm25()` ranking function. No knobs to tune that matter
    for our use case (k1=1.2, b=0.75 defaults are fine).
  - Builds 70k+ chunks in seconds. Faster than the Chroma rebuild's
    embedding step by 100×, so it adds basically nothing to the
    full-rebuild cycle.
 Schema is two tables to keep filtering clean. FTS5 doesn't filter
 nicely on its own columns; the content_rowid pattern keeps an
 external metadata table joinable by rowid:
    CREATE TABLE chunks_meta (
        rowid INTEGER PRIMARY KEY AUTOINCREMENT,
        id TEXT UNIQUE,
        bundle_id TEXT, page_id TEXT, version TEXT,
        platform TEXT, product TEXT, ordinal INTEGER
    );
    CREATE VIRTUAL TABLE chunks_fts USING fts5(
        text,
        tokenize = 'porter unicode61 remove_diacritics 2',
        content = 'chunks_meta',
        content_rowid = 'rowid'
    );
 Queries:
    SELECT m.id, bm25(chunks_fts) AS score
    FROM chunks_meta m
    JOIN chunks_fts  f ON m.rowid = f.rowid
    WHERE f MATCH ?
      AND m.version = ?            -- optional metadata filter
    ORDER BY bm25(chunks_fts)      -- lower = better in FTS5
    LIMIT ?;
 """
 from __future__ import annotations
 import logging
 import re
 import sqlite3
 from pathlib import Path
 from typing import Any
 log = logging.getLogger(__name__)
 # Default location: bm25/<product>_docs.db at the repo root, next to chroma/.
 ROOT = Path(__file__).resolve().parent.parent
 DEFAULT_DB_DIR = ROOT / "bm25"
 DEFAULT_DB_NAME = "<product>_docs.db"
 # Columns we expose as filterable metadata. Mirrors what _build_where in
 # docs_mcp/server.py accepts so the same filter dicts work for both
 # Chroma and BM25 without per-retriever translation in the caller.
 FILTER_COLUMNS = ("bundle_id", "page_id", "version", "platform", "product", "ordinal")
 # Allowlist tokenizer for free-text queries. FTS5's parser chokes on lots
 # of punctuation we routinely see in user queries (".10.9", "?", "VPG's",
 # em-dash, etc.). Rather than blocklist every operator, just keep
 # alphanumerics + a few separators and replace everything else with a
 # space. This loses the ability to phrase-search ("exact match") but we
 # don't expose that to users anyway — they ask natural-language questions
 # and want the answer, not a Boolean DSL.
 _KEEP_RE = re.compile(r"[^A-Za-z0-9_\s]")
 # FTS5 reserves these Boolean operator KEYWORDS at the token level —
 # stripping them avoids accidental phrase-query behavior when a user
 # query happens to contain bare "AND", "OR", "NOT", "NEAR".
 _BOOLEAN_KW_RE = re.compile(r"(?<!\w)(AND|OR|NOT|NEAR)(?!\w)")
 def _sanitize_query(text: str) -> str:
    """Reduce a natural-language query to an FTS5 OR-of-tokens query.
    Two transformations:
    1. Non-alphanumeric → space (drops punctuation; "10.9?" becomes
       "10 9"). Lets us handle versions, parens, question marks, etc.
       without inviting FTS5 parse errors.
    2. Boolean keywords stripped (FTS5 reserves AND/OR/NOT/NEAR).
    3. Tokens explicitly OR'd. FTS5's default is AND-of-tokens — for
       any non-trivial natural-language query that means zero hits
       (no chunk contains every word). OR semantics is what we want:
       BM25 already weights documents containing more query terms
       higher, so we don't lose precision, but we DO gain recall.
    """
    cleaned = _KEEP_RE.sub(" ", text)
    cleaned = _BOOLEAN_KW_RE.sub(" ", cleaned)
    tokens = cleaned.split()
    if not tokens:
        return ""
    return " OR ".join(tokens)
 def _where_to_sql(where: dict | None) -> tuple[str, list[Any]]:
    """Translate a Chroma-shaped filter dict into a SQL fragment + params.
    Accepts the same shapes ``docs_mcp.server._build_where`` produces:
        None                          → ("", [])
        {"version": "10.9"}           → ("AND m.version = ?", ["10.9"])
        {"$and": [{...}, {...}]}      → ("AND m.X = ? AND m.Y = ?", [...])
    Unknown keys are silently dropped (defensive — better to over-match
    than to crash on a filter we don't know).
    """
    if not where:
        return "", []
    parts: list[str] = []
    params: list[Any] = []
    def _emit_eq(cond: dict[str, Any]) -> None:
        for k, v in cond.items():
            if k in FILTER_COLUMNS:
                parts.append(f"m.{k} = ?")
                params.append(v)
    if "$and" in where:
        for sub in where["$and"]:
            _emit_eq(sub)
    else:
        _emit_eq(where)
    if not parts:
        return "", []
    return "AND " + " AND ".join(parts), params
 class BM25Index:
    """Thin wrapper around an FTS5-backed sqlite db.
    Single-writer model. Reads are connection-per-call (sqlite handles
    concurrency through file locks; for our read-heavy workload that's
    fine and avoids cross-thread connection sharing issues with the MCP
    server's request handlers).
    """
    def __init__(self, db_path: Path | None = None):
        self.db_path = Path(db_path) if db_path else (DEFAULT_DB_DIR / DEFAULT_DB_NAME)
    # -- build ----------------------------------------------------------
    def build(self, records: list[dict]) -> int:
        """Rebuild the index from scratch from `records`.
        `records` is the same list ``rag.index.page_records`` produces:
        ``[{"id": ..., "text": ..., "metadata": {...}}, ...]``. Bulk
        insert wrapped in a transaction — single-digit seconds for the
        full 73k-chunk corpus.
        """
        self.db_path.parent.mkdir(parents=True, exist_ok=True)
        # Drop and recreate. Idempotent rebuild.
        if self.db_path.exists():
            self.db_path.unlink()
        with sqlite3.connect(self.db_path) as con:
            con.executescript(self._schema_sql())
            con.executemany(
                "INSERT INTO chunks_meta (id, bundle_id, page_id, version, "
                "platform, product, ordinal) VALUES (?, ?, ?, ?, ?, ?, ?)",
                [
                    (
                        r["id"],
                        r["metadata"].get("bundle_id") or "",
                        r["metadata"].get("page_id") or "",
                        r["metadata"].get("version") or "",
                        r["metadata"].get("platform") or "",
                        r["metadata"].get("product") or "",
                        int(r["metadata"].get("ordinal") or 0),
                    )
                    for r in records
                ],
            )
            # Populate the FTS5 contentless-ish table by rowid. We populated
            # chunks_meta first; rowids align with insertion order.
            con.executemany(
                "INSERT INTO chunks_fts (rowid, text) VALUES (?, ?)",
                [
                    (i + 1, r["text"])
                    for i, r in enumerate(records)
                ],
            )
            con.commit()
        log.info("bm25: indexed %d chunks → %s", len(records), self.db_path)
        return len(records)
    # -- query ----------------------------------------------------------
    def query(
        self,
        text: str,
        n: int = 200,
        where: dict | None = None,
    ) -> list[tuple[str, float]]:
        """Return up to `n` (chunk_id, bm25_score) pairs, lowest score first.
        FTS5's bm25() returns NEGATIVE numbers — more relevant docs have
        smaller (more negative) scores. We order ASC so the first row is
        the most relevant. Callers that need a "rank" should enumerate
        the returned list.
        """
        sanitized = _sanitize_query(text)
        if not sanitized:
            return []
        where_sql, params = _where_to_sql(where)
        # FTS5 MATCH wants the unaliased table name on its left, so we use
        # chunks_fts (no alias) and JOIN by rowid against chunks_meta.
        sql = (
            "SELECT m.id, bm25(chunks_fts) AS score "
            "FROM chunks_fts "
            "JOIN chunks_meta m ON m.rowid = chunks_fts.rowid "
            f"WHERE chunks_fts MATCH ? {where_sql} "
            "ORDER BY bm25(chunks_fts) "
            "LIMIT ?"
        )
        try:
            with sqlite3.connect(self.db_path) as con:
                cur = con.execute(sql, [sanitized, *params, n])
                return [(row[0], float(row[1])) for row in cur.fetchall()]
        except sqlite3.OperationalError as e:
            # FTS5 syntax error (rare after sanitization) or db missing.
            # Caller decides whether to fall back to dense-only.
            log.warning("bm25 query failed (%s); query=%r", e, sanitized[:80])
            return []
    def exists(self) -> bool:
        """Cheap probe — does the index file exist on disk?"""
        return self.db_path.exists()
    def count(self) -> int:
        """Number of chunks indexed. 0 if the db is missing or empty."""
        if not self.exists():
            return 0
        try:
            with sqlite3.connect(self.db_path) as con:
                return con.execute("SELECT COUNT(*) FROM chunks_meta").fetchone()[0]
        except sqlite3.OperationalError:
            return 0
    # -- schema ---------------------------------------------------------
    @staticmethod
    def _schema_sql() -> str:
        return """
        CREATE TABLE chunks_meta (
            rowid     INTEGER PRIMARY KEY AUTOINCREMENT,
            id        TEXT UNIQUE NOT NULL,
            bundle_id TEXT,
            page_id   TEXT,
            version   TEXT,
            platform  TEXT,
            product   TEXT,
            ordinal   INTEGER
        );
        CREATE INDEX idx_meta_version  ON chunks_meta(version);
        CREATE INDEX idx_meta_platform ON chunks_meta(platform);
        CREATE INDEX idx_meta_bundle   ON chunks_meta(bundle_id);
        CREATE VIRTUAL TABLE chunks_fts USING fts5(
            text,
            tokenize = 'porter unicode61 remove_diacritics 2'
        );
        """
@@ -0,0 +1,126 @@
 """Markdown chunker — paragraph-aware, ~400-600 token target.
 Adjust the chunking strategy per product if your page format differs
 significantly from prose. The output shape (id, text, metadata) is
 fixed by the downstream Chroma + BM25 indexing in rag/index.py — don't
 change that.
 The key knob you'll tune per product is chunk-0. Dense retrieval lands
 on chunk 0 first for most queries. Make it a synthetic chunk built
 from:
  - the page title (as natural-language H1)
  - a 1-sentence task description (you'll have to generate this — for
    pages that already have a "## Overview" or "## Introduction" the
    first sentence usually works)
  - a keyword bag of important terms (filenames, API names, error
    codes — the rare technical tokens that BM25 lights up on)
 Without a rich chunk 0, dense retrieval gets dominated by the much
 larger prose body, and short pages (script examples, reference cards)
 get buried.
 """
 from __future__ import annotations
 import re
 from typing import Iterator
 # Approximate token estimate from char count. Tunable — set per
 # embedder if the default 4 chars/token is wrong.
 CHARS_PER_TOKEN = 4
 TARGET_TOKENS = 500
 TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
 def estimate_tokens(text: str) -> int:
    return max(1, len(text) // CHARS_PER_TOKEN)
 def split_paragraphs(md: str) -> list[str]:
    """Split markdown into paragraph-ish blocks.
    Keeps fenced code blocks together (don't slice through ```).
    Headings start new paragraphs.
    """
    blocks: list[str] = []
    current: list[str] = []
    in_fence = False
    for line in md.splitlines(keepends=True):
        stripped = line.strip()
        if stripped.startswith("```"):
            in_fence = not in_fence
            current.append(line)
            continue
        if in_fence:
            current.append(line)
            continue
        if stripped.startswith("#"):
            if current:
                blocks.append("".join(current).strip())
                current = []
            current.append(line)
            continue
        if not stripped and current and not "".join(current).strip().endswith("\n\n"):
            current.append(line)
            blocks.append("".join(current).strip())
            current = []
            continue
        current.append(line)
    if current:
        blocks.append("".join(current).strip())
    return [b for b in blocks if b]
 def chunks_from_page(
    text: str,
    page_id: str,
    metadata: dict,
 ) -> Iterator[dict]:
    """Yield chunk dicts ready for index.py to upsert.
    The synthetic chunk 0 is the per-product customization point. The
    default below is a simple title + body-first-paragraph; rewrite
    for richer retrieval signal (see module docstring).
    """
    paragraphs = split_paragraphs(text)
    if not paragraphs:
        return
    # ----- Chunk 0: synthetic anchor for dense retrieval ---------
    title = metadata.get("title") or page_id
    first_para = next((p for p in paragraphs if not p.startswith("#")), "")
    chunk0_body = (
        f"# {title}\n\n"
        f"{first_para[:300]}"
        # TODO per product: append a keyword bag here (filenames,
        # API names, error codes) for BM25 + dense joint coverage.
    )
    yield {
        "id":       f"{metadata['bundle_id']}::{page_id}::0",
        "text":     chunk0_body,
        "metadata": {**metadata, "ordinal": 0},
    }
    # ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
    ordinal = 1
    buf: list[str] = []
    buf_chars = 0
    for p in paragraphs:
        if buf_chars + len(p) > TARGET_CHARS and buf:
            yield {
                "id":       f"{metadata['bundle_id']}::{page_id}::{ordinal}",
                "text":     "\n\n".join(buf),
                "metadata": {**metadata, "ordinal": ordinal},
            }
            ordinal += 1
            buf = []
            buf_chars = 0
        buf.append(p)
        buf_chars += len(p)
    if buf:
        yield {
            "id":       f"{metadata['bundle_id']}::{page_id}::{ordinal}",
            "text":     "\n\n".join(buf),
            "metadata": {**metadata, "ordinal": ordinal},
        }
@@ -0,0 +1,72 @@
 """Embedding function for Chroma — Ollama-hosted nomic-embed-text by default.
 Swappable: implement the same `embedding_function()` interface returning
 a Chroma `EmbeddingFunction` and the rest of the pipeline doesn't care.
 Defaults (override via env):
  OLLAMA_URL    one or more comma-separated URLs (load-balanced)
  EMBED_MODEL   model name; default 'nomic-embed-text'
  EMBED_DIM     expected embedding dim; default 768 (nomic-embed-text)
 """
 from __future__ import annotations
 import os
 import logging
 from typing import Any
 import httpx
 from chromadb import EmbeddingFunction, Documents, Embeddings
 log = logging.getLogger(__name__)
 OLLAMA_URLS = [u.strip() for u in os.environ.get("OLLAMA_URL",
               "http://localhost:11434").split(",") if u.strip()]
 EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
 EMBED_DIM = int(os.environ.get("EMBED_DIM", "768"))
 class OllamaEmbeddings(EmbeddingFunction):
    """Calls /api/embed across N Ollama endpoints, naive round-robin.
    For indexing throughput on multiple GPUs, run one Ollama container
    per GPU (pinned via NVIDIA_VISIBLE_DEVICES) and pass all their URLs
    in OLLAMA_URL — the embedder picks the next endpoint per batch.
    """
    def __init__(self, urls: list[str] = OLLAMA_URLS, model: str = EMBED_MODEL):
        self.urls = urls
        self.model = model
        self._next = 0
    def __call__(self, input: Documents) -> Embeddings:
        url = self.urls[self._next % len(self.urls)]
        self._next += 1
        with httpx.Client(timeout=300) as c:
            r = c.post(f"{url}/api/embed",
                       json={"model": self.model, "input": list(input)})
            r.raise_for_status()
            data = r.json()
        return data.get("embeddings") or []
    def name(self) -> str:                  # newer chromadb requires this
        return f"ollama:{self.model}"
    @staticmethod
    def build_from_config(config: dict) -> "OllamaEmbeddings":  # newer chromadb
        return OllamaEmbeddings(
            urls=config.get("urls", OLLAMA_URLS),
            model=config.get("model", EMBED_MODEL),
        )
    def get_config(self) -> dict:           # newer chromadb
        return {"urls": self.urls, "model": self.model}
    def default_space(self) -> str:
        return "cosine"
    def supported_spaces(self) -> list[str]:
        return ["cosine", "l2", "ip"]
 def embedding_function() -> EmbeddingFunction:
    return OllamaEmbeddings()
@@ -0,0 +1,134 @@
 """Build Chroma (and optionally BM25) indexes from corpus on disk.
 Reads `corpus/<bundle>/<page>.{md,json}`, chunks each page, upserts
 into Chroma. With --rebuild, drops + recreates the collection (clean
 state). With --bm25-only, skips Chroma and rebuilds only the FTS5
 index — useful for fast iteration when chunking didn't change.
 """
 from __future__ import annotations
 import argparse
 import json
 import logging
 import time
 from pathlib import Path
 from typing import Iterator
 import chromadb
 from chromadb.config import Settings
 from .chunk import chunks_from_page
 from .embeddings import embedding_function
 log = logging.getLogger(__name__)
 logging.basicConfig(level=logging.INFO, format="%(asctime)s  %(message)s")
 ROOT = Path(__file__).resolve().parent.parent
 CORPUS = ROOT / "corpus"
 CHROMA_DIR = ROOT / "chroma"
 # Collection name — convention: <product>_docs. Override via env if needed.
 import os
 PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "myproduct")
 COLLECTION = f"{PRODUCT_NAME}_docs"
 def page_records() -> Iterator[dict]:
    """Walk corpus/, yield chunks for every page."""
    if not CORPUS.exists():
        log.error("corpus/ doesn't exist; run the scraper first")
        return
    for bundle_dir in sorted(CORPUS.iterdir()):
        if not bundle_dir.is_dir() or bundle_dir.name.startswith("."):
            continue
        for md_path in sorted(bundle_dir.glob("*.md")):
            page_id = md_path.stem
            sidecar = md_path.with_suffix(".json")
            if not sidecar.exists():
                log.warning("skipping %s — no JSON sidecar", md_path)
                continue
            md = md_path.read_text()
            meta = json.loads(sidecar.read_text())
            # Surface common filter fields at the chunk-metadata level
            # so Chroma's `where` filter can use them.
            base_meta = {
                "bundle_id": bundle_dir.name,
                "page_id":   page_id,
                "title":     meta.get("title") or "",
                "version":   meta.get("version") or "",
                "platform":  meta.get("platform") or "",
                "product":   meta.get("product") or "",
            }
            yield from chunks_from_page(md, page_id, base_meta)
 def upsert_to_chroma(records: list[dict]) -> int:
    client = chromadb.PersistentClient(
        path=str(CHROMA_DIR),
        settings=Settings(anonymized_telemetry=False),
    )
    # Drop + recreate for --rebuild semantics
    try:
        client.delete_collection(COLLECTION)
    except Exception:
        pass
    col = client.create_collection(COLLECTION, embedding_function=embedding_function())
    BATCH = 64
    total = 0
    for i in range(0, len(records), BATCH):
        chunk = records[i:i + BATCH]
        col.upsert(
            ids=[r["id"] for r in chunk],
            documents=[r["text"] for r in chunk],
            metadatas=[r["metadata"] for r in chunk],
        )
        total += len(chunk)
        log.info("upserted %d / %d chunks", total, len(records))
    return total
 def main() -> int:
    p = argparse.ArgumentParser()
    p.add_argument("--rebuild", action="store_true",
                   help="Drop and recreate the Chroma collection.")
    p.add_argument("--bm25-only", action="store_true",
                   help="Rebuild only the BM25 index, skip Chroma.")
    p.add_argument("--bm25-db", type=Path,
                   default=ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db",
                   help="Path to the BM25 sqlite db.")
    args = p.parse_args()
    log.info("reading corpus from %s", CORPUS)
    t0 = time.time()
    records = list(page_records())
    log.info("loaded %d chunks in %.1fs", len(records), time.time() - t0)
    if args.bm25_only:
        from .bm25 import BM25Index
        log.info("--bm25-only: building FTS5 only")
        BM25Index(args.bm25_db).build(records)
        return 0
    if not args.rebuild:
        log.info("no --rebuild; nothing to do. (Use --rebuild to upsert.)")
        return 0
    t_c = time.time()
    n = upsert_to_chroma(records)
    log.info("chroma: %d chunks in %.1fs", n, time.time() - t_c)
    # Build BM25 too — see PLAN.md Phase 8. Safe to remove this block
    # for products that don't need hybrid retrieval.
    try:
        from .bm25 import BM25Index
        t_b = time.time()
        BM25Index(args.bm25_db).build(records)
        log.info("bm25 done in %.1fs", time.time() - t_b)
    except ImportError:
        log.info("rag.bm25 not available — skipping BM25 build")
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
@@ -0,0 +1,19 @@
 # MCP server
 mcp[fastmcp]>=1.0.0
 pydantic>=2.0
 httpx>=0.27
 # Vector store + embeddings
 chromadb>=0.5.0
 ollama>=0.4.0      # if using Ollama-hosted embedder; swap if not
 # Scraping (Phase 1; adjust per product)
 beautifulsoup4>=4.12
 requests>=2.31
 # playwright>=1.40  # uncomment if you need headless browser fallback
 # Evaluation
 numpy>=1.26
 # Dev / utility
 python-dateutil>=2.8
@@ -0,0 +1,61 @@
 # scrape/
 Per-vendor seed catalog scrapers + the runner that dispatches to
 them. Each source lives in `scrape/sources/<name>.py` with a `main()`
 entrypoint. The runner is a thin shim:
 ```bash
 python -m scrape.runner --source bayer_seeds --force
 python -m scrape.runner --source golden_harvest --limit 20
 python -m scrape.runner --all                # only GREEN sources
 ```
 ## Output layout
 Each scraper writes:
 - `corpus/<source>/<source_key>.md` — LLM-visible body (chunk_0
  preamble + the variety's marketing + agronomic narrative)
 - `corpus/<source>/<source_key>.json` — sidecar metadata (per
  CLAUDE.md's canonical schema)
 `source_key` is a stable per-vendor slug — typically `<brand>-<sku>`
 lowercased, e.g. `dekalb-dkc62-08rib`. Stability matters: it's the
 join key the MCP uses for `get_page(source, source_key)`.
 ## Sources
 | Source | Module | Verdict | Notes |
 |---|---|---|---|
 | `bayer_seeds` | `bayer_seeds.py` | 🟢 | DEKALB + Asgrow + WestBred, ~475 varieties |
 | `golden_harvest` | `golden_harvest.py` | 🟢 | ~175 varieties, 9-to-1 disease scale (reverse) |
 | `nk` | `nk.py` | 🟢 | 29 varieties, ratings in CDN PDFs |
 | `agripro` | `agripro.py` | 🟢 | 24 wheat varieties |
 | `becks_pfr` | `becks_pfr.py` | 🟡 | 2,089 research docs via public Sanity GROQ |
 | `becks_products` | `becks_products.py` | 🟡 | 860 products, identity-only (SeedIQ-gated) |
 Pioneer is intentionally absent — see `CLAUDE.md` and the curated
 Pioneer fallback in `docs_mcp/lessons.md`.
 ## Tips
 - **Sniff before you scrape.** Most catalogs are SPAs that call a
  backend API. The recon docs in `~/.claude/projects/-home-justin/
  memory/reference_seed_vendor_recon.md` already capture the
  endpoints; if you find new ones, update that file.
 - **Idempotent re-scrapes.** Without `--force`, skip pages already
  on disk. With `--force`, re-fetch everything — that's the
  monthly cron mode.
 - **Respect the portals.** Backoff on 429s. Set a recognizable
  user-agent (`seed-mcp-scraper/<version>`).
 - **Normalize at chunk time, not at scrape time.** The chunker
  (Phase 2) handles the 9-to-1 → 1-9 disease-scale flip for Golden
  Harvest, NOT this scraper. Sidecar JSON should preserve the
  vendor's raw values + a `_scale_direction` field; the chunker
  reads that and normalizes the markdown body.
 ## changelog.py
 Reusable as-is from the template. Walks `git diff --name-status`
 output for the commit summary, and `git log` for the digest history
 (Phase 13).
@@ -0,0 +1,272 @@
 """Generate a summary of corpus changes.
 Two output shapes for two consumers:
  1. Human-readable text (default) — written into the weekly-refresh
     commit message so the commit log is greppable for *"what changed
     this week"* instead of *"806 files changed"*.
  2. Structured JSON (``--json``) and rolling JSONL history
     (``--history-out``) — consumed by the ``weekly_digest`` MCP tool.
     Computed in CI and committed at ``corpus/.digest/history.jsonl``;
     the tool reads it at runtime because the prod container is a
     static filesystem COPY with no git available.
 Usage:
    # Commit-message helper (existing behavior — unchanged)
    python -m scrape.changelog [--cached] [--ref REF]
    # One-shot JSON for the current diff range
    python -m scrape.changelog --cached --json
    # Build / refresh the digest history file (CI use)
    python -m scrape.changelog --history-out corpus/.digest/history.jsonl \\
        --history-days 120
 The history walker only includes commits that touch ``corpus/`` (or
 ``bundles.json``); it skips pure code/CI commits. Each emitted record
 carries the commit's short sha, ISO timestamp, subject, and the same
 structured summary the ``--json`` path produces, so the consumer can
 treat history records and one-shot summaries interchangeably.
 """
 from __future__ import annotations
 import argparse
 import json
 import subprocess
 import sys
 from collections import defaultdict
 from typing import Any
 def git(*args: str) -> str:
    return subprocess.check_output(["git", *args], text=True)
 def summarize_diff(diff_output: str) -> dict[str, Any]:
    """Parse ``git diff --name-status`` output into a structured summary.
    Pure function (no IO, no git calls) so the same logic is exercised
    by the human-readable, JSON-one-shot, and history-walking paths.
    Returns a dict with:
        md_count           int       — total .md files changed
        json_count         int       — total .json sidecars changed
        content_bundles    dict      — {bundle_id: [page_id_without_.md, ...]}
                                       Only bundles where at least one .md
                                       file moved. Lists are in the order
                                       git emitted them.
        json_only_bundles  list[str] — bundles whose ONLY change was sidecar
                                       drift (no .md changes). Sorted.
        new_bundles        list[str] — bundles whose first .md was Added
                                       in this diff. Sorted.
        other_files        list[str] — any non-corpus path mentioned in the
                                       diff, as ``"STATUS path"`` strings.
    """
    md_changes: dict[str, list[str]] = defaultdict(list)
    json_only_bundles: set[str] = set()
    new_bundles: set[str] = set()
    md_count = json_count = 0
    other_files: list[str] = []
    for line in diff_output.splitlines():
        if not line.strip():
            continue
        # status<TAB>path (or status<TAB>old<TAB>new for renames; we take
        # the post-rename path as the canonical location).
        parts = line.split("\t")
        status, path = parts[0], parts[-1]
        if not path.startswith("corpus/"):
            other_files.append(f"{status} {path}")
            continue
        segs = path.split("/", 2)
        if len(segs) < 3:
            # corpus/<filename> with no bundle dir — skip.
            continue
        _, bundle, page = segs
        if page.endswith(".md"):
            md_changes[bundle].append(page[:-3])
            md_count += 1
            if status == "A":
                new_bundles.add(bundle)
        elif page.endswith(".json"):
            json_count += 1
            json_only_bundles.add(bundle)
    # A bundle counts as "content-changing" if it had any .md edit. Sidecar-
    # only drift goes in the separate bucket so the commit message doesn't
    # report timestamp churn as if it were real edits.
    content_bundles_set = set(md_changes)
    drift_only = sorted(json_only_bundles - content_bundles_set)
    return {
        "md_count":          md_count,
        "json_count":        json_count,
        "content_bundles":   dict(md_changes),   # cast back to plain dict for JSON
        "json_only_bundles": drift_only,
        "new_bundles":       sorted(new_bundles),
        "other_files":       other_files,
    }
 def render_human(summary: dict[str, Any]) -> str:
    """Format a summary dict as the multi-line commit-message text.
    Matches the historical output exactly so existing commit-message
    tooling and downstream readers don't have to change.
    """
    lines: list[str] = []
    content_bundles = sorted(summary["content_bundles"])
    md_count = summary["md_count"]
    json_count = summary["json_count"]
    new_bundles = set(summary["new_bundles"])
    drift_only = summary["json_only_bundles"]
    other_files = summary["other_files"]
    lines.append(f"{md_count} content change(s) across {len(content_bundles)} bundle(s)")
    lines.append(f"{json_count} sidecar metadata update(s)")
    if new_bundles:
        lines.append(f"{len(new_bundles)} new bundle(s) added")
    if other_files:
        lines.append(f"{len(other_files)} other file change(s)")
    if content_bundles:
        lines.append("")
        lines.append("Bundles with content changes:")
        for b in content_bundles:
            pages = summary["content_bundles"][b]
            tag = " (NEW)" if b in new_bundles else ""
            lines.append(f"  {b}{tag}: {len(pages)} page(s)")
            for p in pages[:5]:
                lines.append(f"    - {p}")
            if len(pages) > 5:
                lines.append(f"    ... and {len(pages) - 5} more")
    if drift_only:
        lines.append("")
        head = ", ".join(drift_only[:10])
        suffix = " …" if len(drift_only) > 10 else ""
        lines.append(f"Bundles with sidecar-only drift ({len(drift_only)}): {head}{suffix}")
    return "\n".join(lines)
 def walk_history(history_days: int) -> list[dict[str, Any]]:
    """Walk recent corpus-touching commits, emit one summary per commit.
    Uses ``git log --first-parent main`` to keep the rolling weekly-
    refresh line clean of branch-merge noise. Only commits whose diff
    touches ``corpus/`` or ``bundles.json`` are emitted; pure code
    commits are skipped (they have nothing to digest).
    Each record:
        {
          "sha":       "<short sha>",
          "timestamp": "<ISO 8601, UTC>",
          "subject":   "<commit subject line>",
          ... + every field from summarize_diff()
        }
    """
    # Find candidate commits. --first-parent keeps the linear refresh history
    # on main and ignores branch-side merges. We still need to filter by what
    # the commit actually touched, because non-corpus commits can land on
    # main (PR merges for code, CI tweaks, etc.).
    raw = git(
        "log",
        f"--since={history_days} days ago",
        "--first-parent",
        "main",
        "--pretty=format:%H%x09%cI%x09%s",
    )
    records: list[dict[str, Any]] = []
    for line in raw.splitlines():
        if not line.strip():
            continue
        parts = line.split("\t", 2)
        if len(parts) < 3:
            continue
        sha, ts, subject = parts
        # What did this commit actually touch? Cheap: just the name-status diff
        # against its first parent. Empty stdout = commit didn't change any
        # files we care about. Root commits (no parent) error out — suppress
        # the stderr noise and skip them.
        try:
            diff = subprocess.check_output(
                ["git", "diff", "--name-status", f"{sha}^..{sha}"],
                text=True,
                stderr=subprocess.DEVNULL,
            )
        except subprocess.CalledProcessError:
            continue
        if not diff.strip():
            continue
        summary = summarize_diff(diff)
        # Skip pure code commits — only emit records that have actual corpus
        # content motion. This is what makes the history "interesting" for
        # the weekly digest.
        if summary["md_count"] == 0 and summary["json_count"] == 0 and not summary["new_bundles"]:
            continue
        records.append({
            "sha":       sha[:12],
            "timestamp": ts,
            "subject":   subject,
            **summary,
        })
    return records
 def main() -> int:
    p = argparse.ArgumentParser(description=__doc__)
    p.add_argument("--cached", action="store_true",
                   help="Summarize staged changes instead of a ref range.")
    p.add_argument("--ref", default="HEAD^..HEAD",
                   help="Diff range to summarize (default: HEAD^..HEAD).")
    p.add_argument("--json", dest="as_json", action="store_true",
                   help="Emit one JSON object instead of the human-readable form.")
    p.add_argument("--history-out", metavar="PATH",
                   help="Walk recent corpus-touching commits and write a "
                        "JSONL history file at PATH. Overwrites if it exists. "
                        "Implies the history walker; --cached/--ref are ignored.")
    p.add_argument("--history-days", type=int, default=120,
                   help="How far back the history walker looks (default 120).")
    args = p.parse_args()
    # History-walker path: build the JSONL file consumed by the
    # weekly_digest MCP tool, then exit. CI uses this.
    if args.history_out:
        records = walk_history(args.history_days)
        # Sort by timestamp ascending so the file is roughly stable
        # across rebuilds (commits within a single run could otherwise
        # depend on git log default ordering).
        records.sort(key=lambda r: r["timestamp"])
        with open(args.history_out, "w") as fh:
            for rec in records:
                fh.write(json.dumps(rec, separators=(",", ":")) + "\n")
        # Brief stdout signal for CI logs — easy to spot in the workflow run.
        print(f"wrote {len(records)} commit record(s) to {args.history_out} "
              f"covering up to {args.history_days} days")
        return 0
    # One-shot summary path. Unchanged behavior for --cached / --ref.
    if args.cached:
        diff_args = ["diff", "--name-status", "--cached"]
    else:
        diff_args = ["diff", "--name-status", args.ref]
    diff = git(*diff_args)
    summary = summarize_diff(diff)
    if args.as_json:
        print(json.dumps(summary, separators=(",", ":")))
    else:
        print(render_human(summary))
    return 0
 if __name__ == "__main__":
    sys.exit(main())
@@ -0,0 +1,93 @@
 """Thin dispatcher that routes ``--source <id>`` to the right per-source
 scraper module.
 Convention: one source per module under ``scrape.sources.<id>``. Each
 module is independently runnable via ``python -m scrape.sources.<id>``
 and accepts its own flags — this runner is a convenience shim for CI.
 Examples:
    python -m scrape.runner --source bayer_seeds --force
    python -m scrape.runner --source golden_harvest --limit 20
    python -m scrape.runner --all          # walk every source in sources.json
 Anything after the recognized flags is passed through to the source
 scraper, so:
    python -m scrape.runner --source bayer_seeds --force --brand dekalb
 dispatches to ``scrape.sources.bayer_seeds`` with ``--force --brand
 dekalb`` as argv.
 Sources whose ``verdict`` in sources.json is anything other than
 ``"green"`` are skipped by ``--all`` (Beck's products is yellow until
 the SeedIQ XHR is captured). Pass ``--source becks_products`` to run
 a yellow source explicitly.
 """
 from __future__ import annotations
 import argparse
 import importlib
 import json
 import sys
 from pathlib import Path
 REPO_ROOT = Path(__file__).resolve().parents[1]
 SOURCES_JSON = REPO_ROOT / "sources.json"
 def _load_sources() -> list[dict]:
    if not SOURCES_JSON.exists():
        return []
    try:
        data = json.loads(SOURCES_JSON.read_text())
        return data.get("sources", []) if isinstance(data, dict) else data
    except json.JSONDecodeError:
        return []
 def _run_source(source_id: str, passthrough: list[str]) -> int:
    mod_name = f"scrape.sources.{source_id}"
    try:
        mod = importlib.import_module(mod_name)
    except ImportError as exc:
        print(f"runner: no source module {mod_name}: {exc}", file=sys.stderr)
        return 2
    main = getattr(mod, "main", None)
    if not callable(main):
        print(f"runner: {mod_name} has no main() entrypoint", file=sys.stderr)
        return 2
    return int(main(passthrough) or 0)
 def main(argv: list[str] | None = None) -> int:
    parser = argparse.ArgumentParser(prog="scrape.runner")
    parser.add_argument("--source", help="Source id (matches sources.json)")
    parser.add_argument("--all", action="store_true",
                        help="Run every GREEN source listed in sources.json")
    args, passthrough = parser.parse_known_args(argv)
    if not args.source and not args.all:
        parser.error("specify --source <id> or --all")
    sources = _load_sources()
    if args.all:
        ids = [s["name"] for s in sources if s.get("verdict") == "green"]
        if not ids:
            print("runner: no GREEN sources in sources.json", file=sys.stderr)
            return 2
    else:
        # If the source isn't registered in sources.json yet, dispatch anyway
        # so the scraper can be exercised during initial development.
        ids = [args.source]
    rc = 0
    for sid in ids:
        print(f"=== scrape.runner: dispatching to {sid} ===")
        rc |= _run_source(sid, passthrough)
    return rc
 if __name__ == "__main__":
    sys.exit(main())
@@ -0,0 +1,34 @@
 """AgriPro scraper (Syngenta wheat brand).
 Source: ``https://www.agriprowheat.com`` — Drupal Views form,
 server-rendered HTML. No headless browser needed.
 Expected count: 24 varieties. Covers HRW / HRS / HWS / SWW / SWS
 plus barley. NO SRW — Syngenta's SRW lives at GrowProGenetics.com
 under a separate brand and is out of scope for AgriPro.
 Trait flags to capture: Clearfield (CL2), CoAXium (NB: CoAXium is
 implicit in product family naming, not always a separate field).
 Schema notes:
 - ``wheat_class`` is required (HRW/HRS/HWS/SWW/SWS/durum/barley)
 - ``relative_maturity`` and ``maturity_group`` are null for wheat
 - Disease panel: stripe rust / leaf rust / stem rust / FHB (scab) /
  Septoria / tan spot
 - Quality: test weight, protein, falling number, straw strength
 TODO: implement.
 """
 from __future__ import annotations
 import sys
 def main(argv: list[str] | None = None) -> int:
    print("agripro: not implemented yet — Drupal Views form, only wheat in the corpus, no SRW (separate brand)",
          file=sys.stderr)
    return 2
 if __name__ == "__main__":
    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,56 @@
 """Bayer seeds scraper — DEKALB (corn) + Asgrow (soy) + WestBred (wheat).
 Source: ``cropscience.bayer.us`` — same Next.js + ``__NEXT_DATA__``
 infrastructure used by crop-chem-docs' Bayer crop-protection scraper.
 That scraper is the reference; this one lifts ~80% of its plumbing
 and adapts the per-product field mapping for seed schema.
 Catalog index pages:
  /corn/dekalb/seed-catalog
  /soybeans/asgrow/seed-catalog
  /wheat/westbred/seed-catalog
 Each catalog page is a Next.js route; the per-variety data lives in
 ``__NEXT_DATA__.props.pageProps.{whatever}``. The buildId in the
 script tag rotates — fetch the index page first, extract the
 buildId, then fetch the per-variety JSON.
 Output layout:
  corpus/bayer_seeds/<source_key>.md      LLM-visible body
  corpus/bayer_seeds/<source_key>.json    Sidecar metadata
 source_key convention: ``<brand>-<product-slug>`` lowercased, e.g.
 ``dekalb-dkc62-08rib`` or ``asgrow-ag34xf2``.
 Sidecar schema (per CLAUDE.md):
  source: "bayer_seeds"
  source_key: str
  vendor: "Bayer"
  brand: "DEKALB" | "Asgrow" | "WestBred"
  product_name: str
  crop: "corn" | "soybeans" | "wheat"
  relative_maturity: int | null         # corn only
  maturity_group: float | null          # soy only
  wheat_class: str | null               # wheat only
  trait_stack: list[str]
  agronomic_ratings: dict[str, int]     # normalized 1-9 (9 = best)
  disease_ratings: dict[str, int]       # normalized 1-9 (9 = best)
  regional_recommendation: list[str]
  source_urls: list[str]
  fetched_at: str (ISO 8601 UTC)
 TODO: implement. Reference: ~/github/crop-chem-docs/scrape/sources/bayer.py
 """
 from __future__ import annotations
 import sys
 def main(argv: list[str] | None = None) -> int:
    print("bayer_seeds: not implemented yet — see ~/github/crop-chem-docs/scrape/sources/bayer.py for the reference Next.js extraction pattern",
          file=sys.stderr)
    return 2
 if __name__ == "__main__":
    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,45 @@
 """Beck's PFR (Practical Farm Research) scraper.
 Source: Public Sanity GROQ API at ``https://mc8v24rf.api.sanity.io``.
 No authentication required — Beck's exposes their CMS content store
 publicly. ~2,089 documents going back to 2015.
 Sanity query endpoint:
  ``/v1/data/query/production?query=<groq>``
 Useful GROQ for PFR docs (the projectId / dataset are public):
  *[_type == "pfrStudy"] {
    _id, title, year, crop, slug, summary, body, attachments
  }
 Records are research studies, not variety identity — head-to-head
 yield trials, fungicide timing, planting-date studies, hybrid-by-
 population, biological seed treatments, etc.
 Treat differently from variety scrapers:
 - One record per study, not per variety
 - chunk_0 preamble includes the study's tl;dr finding (extract from
  the ``summary`` field if present, or first paragraph of ``body``)
 - Crop tag (corn/soy/wheat) for filtering
 - Year tag — older PFR studies are still relevant but search should
  let the user weight recency
 Polite rate limit: Sanity is generous but no auth means we should
 keep concurrency ≤4 and pause ~250ms between batches.
 TODO: implement.
 """
 from __future__ import annotations
 import sys
 def main(argv: list[str] | None = None) -> int:
    print("becks_pfr: not implemented yet — public Sanity GROQ at mc8v24rf.api.sanity.io, ~2089 research docs",
          file=sys.stderr)
    return 2
 if __name__ == "__main__":
    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,46 @@
 """Beck's product catalog scraper (identity-only until SeedIQ XHR sniff lands).
 Source: Same public Sanity GROQ API as ``becks_pfr`` (no auth).
 Expected count: ~860 products (corn + soy + wheat).
 Current limitation: Beck's exposes IDENTITY fields publicly (product
 name, RM/MG, basic trait stack) but routes the AGRONOMIC + DISEASE
 ratings through their SeedIQ application, which is gated behind a
 browser session cookie. The public Sanity records do not include
 ratings.
 What we CAN ship without SeedIQ:
 - Product identity for confirmation ("yes Beck's has hybrid X at RM 112")
 - RM (corn) / MG (soy) / class (wheat)
 - Trait stack
 - Basic descriptive text
 What needs the SeedIQ XHR endpoint (BLOCKED on user sniff):
 - Disease ratings (GLS, NCLB, Goss's, etc.)
 - Agronomic ratings (standability, drought, etc.)
 - Regional recommendations
 For now this scraper is DEFERRED. Run when:
 - User captures the SeedIQ XHR URL + cookie/header pattern from
  browser dev tools, OR
 - We decide to ship Beck's as identity-only and let the LLM say
  "Beck's has this hybrid; ask your Beck's rep for full agronomic
  ratings" (less useful but avoids the empty-data UX).
 Yellow verdict in sources.json reflects this — ``--all`` skips it.
 TODO: implement (deferred).
 """
 from __future__ import annotations
 import sys
 def main(argv: list[str] | None = None) -> int:
    print("becks_products: deferred — SeedIQ XHR sniff required for ratings, run only if user has captured the endpoint",
          file=sys.stderr)
    return 2
 if __name__ == "__main__":
    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,42 @@
 """Golden Harvest scraper (Syngenta brand).
 Discovery: ``https://www.goldenharvestseeds.com/sitemap.xml`` lists
 every variety page. Server-rendered HTML — no headless browser
 required. Tech-sheet PDFs live on the Syngenta CDN at
 ``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf`` — same
 fetcher pattern as NK.
 Two gotchas:
 1. **Sitemap PDF dates are stale** (the sitemap was generated
   2025-03-31 and never updated). Resolve the LIVE PDF URL from the
   product HTML page, not from the sitemap entry.
 2. **Disease scale is reversed.** Golden Harvest publishes ratings
   on a 9-to-1 scale (9 = best, 1 = worst). Bayer/NK/AgriPro use
   1-9 (9 = best). Normalize at chunk time so the corpus has a
   single direction. Record the original direction in the chunk_0
   preamble: "Note: ratings normalized to 1-9 (9 = best). Golden
   Harvest publishes on a 9-to-1 scale natively."
 Expected count: ~175 varieties (89 corn + 86 soy). No wheat.
 Bonus dataset: ``/plot-report/<state>/<year>/<id>`` — ~7,800 regional
 yield trial records. Out of scope for v1 but a high-value future
 ingest for regional placement recommendations.
 TODO: implement. Reuse the PDF-fetch helper that NK uses.
 """
 from __future__ import annotations
 import sys
 def main(argv: list[str] | None = None) -> int:
    print("golden_harvest: not implemented yet — see CLAUDE.md for the disease-scale-reversal gotcha and the live-PDF-URL-resolution requirement",
          file=sys.stderr)
    return 2
 if __name__ == "__main__":
    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,35 @@
 """NK scraper (Syngenta brand).
 Source: ``https://www.syngenta-us.com`` — static HTML product pages
 plus tech-sheet PDFs on the Syngenta CDN at
 ``assets.syngentaebiz.com/pdf/techsheets/<CODE>_YYMMDD.pdf``.
 Expected count: 29 varieties (12 corn + 17 soy). No wheat.
 The PDF fetcher is shared with ``golden_harvest`` — same CDN, same
 ``<CODE>_YYMMDD.pdf`` filename convention. Factor that into a
 helper module under ``scrape.sources._syngenta_pdf`` once both
 scrapers are written.
 Disease + agronomic ratings live INSIDE the PDFs (the HTML pages
 have marketing copy only). Use pdfplumber for table extraction.
 Bonus: regional "Seed Guide" PDFs (~14 MB each) for IA, IL, MN,
 etc. — additional supplemental context worth ingesting once the
 per-variety scrape is solid.
 TODO: implement.
 """
 from __future__ import annotations
 import sys
 def main(argv: list[str] | None = None) -> int:
    print("nk: not implemented yet — disease/agronomic ratings come from CDN tech-sheet PDFs only, use pdfplumber",
          file=sys.stderr)
    return 2
 if __name__ == "__main__":
    sys.exit(main(sys.argv[1:]))
@@ -0,0 +1,167 @@
 """Gitea container-registry garbage collection.
 Prunes old container tags from a Gitea registry package. Always
 preserves:
  - The ``latest`` tag (Watchtower auto-pull target)
  - Any ``corpus-*`` tag (production pins; Drawbar may have them locked)
  - The ``--keep-latest`` most-recent OTHER tags (typically commit-sha pins)
  - Anything pushed within ``--keep-days`` days
 The actual disk reclaim happens on Gitea's next package GC cron
 (admin site settings). This script marks versions for deletion.
 Why this script doesn't use the Docker Registry v2 API: that API has
 tag listing + manifest delete by digest, but no per-tag created-at
 timestamp without an extra blob-fetch round-trip. Gitea's packages
 API gives us {tag, created_at} in one call, which is what the keep
 policy needs.
 The endpoint shape that actually works (matches Gitea 1.21+):
  GET    /api/v1/packages/{owner}?type=container&q={name}
         → JSON array, ONE entry per tag, each with id + version=tag + created_at
  DELETE /api/v1/packages/{owner}/container/{name}/{tag}
         → 204 on success, 404 if already gone
 Auth: GITEA_TOKEN env var (PAT with delete:packages scope; the
 push-only PAT we use as REGISTRY_TOKEN may not be enough — if you
 see 403s, mint a separate PAT and pass it as GITEA_TOKEN here).
 Usage:
    python scripts/registry_gc.py \\
        --owner justin \\
        --package crop-chem-docs \\
        --keep-days 180 \\
        --keep-latest 6
        [--dry-run]
 """
 from __future__ import annotations
 import argparse
 import json
 import os
 import sys
 from datetime import datetime, timedelta, timezone
 from urllib.error import HTTPError
 from urllib.request import Request, urlopen
 GITEA_HOST = os.environ.get("GITEA_HOST", "https://git.jpaul.io")
 def api(token: str, method: str, path: str) -> object:
    # User-Agent matters: Cloudflare in front of git.jpaul.io returns
    # 403 to the default `Python-urllib/3.x` UA. Any non-Python UA
    # passes. Curl works, requests works, we just need to not look
    # like a vanilla urllib script.
    req = Request(
        f"{GITEA_HOST}{path}",
        headers={
            "Authorization": f"token {token}",
            "User-Agent": "crop-chem-docs-registry-gc/0.1",
        },
        method=method,
    )
    try:
        with urlopen(req, timeout=30) as r:
            body = r.read()
            return json.loads(body) if body else None
    except HTTPError as e:
        if e.code == 404:
            return None
        raise
 def _parse_created(version: dict) -> datetime:
    """Gitea returns RFC3339 with offset like '2026-05-24T16:07:50-04:00'.
    Python 3.11+ handles this directly via fromisoformat."""
    return datetime.fromisoformat(version["created_at"])
 def main() -> int:
    p = argparse.ArgumentParser()
    p.add_argument("--owner", required=True)
    p.add_argument("--package", required=True)
    p.add_argument("--keep-days", type=int, default=180)
    p.add_argument("--keep-latest", type=int, default=6,
                   help="Keep this many most-recent commit-sha (etc.) "
                        "tags BEFORE applying --keep-days. corpus-* and "
                        ":latest are kept regardless.")
    p.add_argument("--dry-run", action="store_true",
                   help="Show what would be deleted without calling DELETE.")
    args = p.parse_args()
    token = os.environ.get("GITEA_TOKEN")
    if not token:
        print("GITEA_TOKEN env var not set", file=sys.stderr)
        return 1
    # Gitea's q= is a substring match; filter to exact name so we don't
    # accidentally GC a sibling package that shares the prefix.
    versions = api(
        token, "GET",
        f"/api/v1/packages/{args.owner}?type=container&q={args.package}",
    ) or []
    versions = [v for v in versions if v.get("name") == args.package]
    if not versions:
        print(f"no versions found for {args.owner}/{args.package} — nothing to GC")
        return 0
    cutoff = datetime.now(timezone.utc) - timedelta(days=args.keep_days)
    versions.sort(key=_parse_created, reverse=True)  # newest first
    keep: list[tuple[str, str]] = []     # (tag, reason)
    delete: list[dict] = []
    other_kept = 0
    for v in versions:
        tag = v.get("version", "")
        created = _parse_created(v)
        if tag == "latest":
            keep.append((tag, "always-keep (:latest)"))
            continue
        if tag.startswith("corpus-"):
            keep.append((tag, "production pin (corpus-*)"))
            continue
        if other_kept < args.keep_latest:
            other_kept += 1
            keep.append((tag, f"keep-latest #{other_kept}/{args.keep_latest}"))
            continue
        if created >= cutoff:
            keep.append((tag, f"within --keep-days ({args.keep_days})"))
            continue
        delete.append(v)
    print(f"=== {args.owner}/{args.package}: {len(versions)} total tag(s) ===")
    for tag, reason in keep:
        print(f"  KEEP  {tag:<28}  {reason}")
    for v in delete:
        print(f"  DEL   {v['version']:<28}  created={v['created_at']}")
    if not delete:
        print("nothing to delete")
        return 0
    if args.dry_run:
        print(f"--dry-run; would delete {len(delete)} tag(s)")
        return 0
    failed = 0
    for v in delete:
        tag = v["version"]
        try:
            api(token, "DELETE",
                f"/api/v1/packages/{args.owner}/container/{args.package}/{tag}")
            print(f"  ✓ deleted {tag}")
        except HTTPError as e:
            print(f"  ✗ failed {tag}: HTTP {e.code} {e.reason}", file=sys.stderr)
            failed += 1
    print(f"done: deleted {len(delete) - failed} / {len(delete)} tag(s)")
    return 0 if failed == 0 else 1
 if __name__ == "__main__":
    sys.exit(main())
@@ -0,0 +1,251 @@
 """Summarize usage logs from docs_mcp.usage into a quick scan.
 Reads one or more usage.jsonl* files and prints sections for:
  - per-tool call counts
  - top search_docs queries by frequency
  - 0-hit queries (where we returned nothing — high-signal for tuning)
  - filter usage histogram (which version / platform / bundle filters get hit)
  - reranker effectiveness (calls where the reranker fired vs not)
  - hybrid retrieval top-1 attribution (dense vs bm25 vs both)
 Usage:
    # Default: read /app/var/logs in the production container
    python scripts/usage_report.py --logs-dir /path/to/usage/logs
    # Last N days only:
    python scripts/usage_report.py --logs-dir <dir> --since 7d
    # Markdown output (for piping into a weekly digest email, etc):
    python scripts/usage_report.py --logs-dir <dir> --format markdown
 The script doesn't depend on anything in the docs_mcp package — it's a
 standalone tool that can run anywhere with the log files available
 (scp them off the host, point it at the directory).
 ----------------------------------------------------------------------
 FOLLOW-UP CHECKS
 ----------------------------------------------------------------------
 Pattern: when you ship a retrieval change with a hypothesis attached
 (e.g. "hybrid will rescue queries dense misses"), add a note HERE
 describing what the usage report should show and at what threshold
 the change earns its keep. Future-you running the report a month
 later will be glad. Example:
  Q: Does the dense leg of hybrid retrieval earn its keep on
     real traffic, or could we simplify to BM25-only?
  - bm25_only >= 80%%  --> dense not doing much; consider
                          simplifying to BM25 mode
  - both     >= 50%%  --> hybrid is tie-breaking; keep it
  - dense_only > bm25_only --> dense is the workhorse; keep
 Also worth a glance every month:
  - 0-hit queries list (tuning candidates)
  - reranker p95 latency drift (slow reranker = bad UX)
  - filter usage (does anyone actually use version/platform
    filters? if not, simplify the tool surface)
 """
 from __future__ import annotations
 import argparse
 import json
 import re
 import sys
 from collections import Counter, defaultdict
 from datetime import datetime, timedelta, timezone
 from pathlib import Path
 from typing import Any, Iterable
 def parse_since(s: str | None) -> datetime | None:
    """Accept '7d', '24h', '30m', or an ISO timestamp. None → no cutoff."""
    if not s:
        return None
    m = re.fullmatch(r"(\d+)([dhm])", s)
    if m:
        n, unit = int(m.group(1)), m.group(2)
        delta = {"d": timedelta(days=n), "h": timedelta(hours=n), "m": timedelta(minutes=n)}[unit]
        return datetime.now(timezone.utc) - delta
    return datetime.fromisoformat(s.replace("Z", "+00:00"))
 def load_events(logs_dir: Path, since: datetime | None) -> Iterable[dict[str, Any]]:
    """Yield every JSONL record across all files in logs_dir."""
    if not logs_dir.exists():
        print(f"warning: logs dir {logs_dir} does not exist", file=sys.stderr)
        return
    # usage.jsonl is the active file; usage.jsonl.YYYY-MM-DD are rotated.
    files = sorted(logs_dir.glob("usage.jsonl*"))
    for f in files:
        with open(f) as fh:
            for ln, line in enumerate(fh, start=1):
                line = line.strip()
                if not line:
                    continue
                try:
                    rec = json.loads(line)
                except json.JSONDecodeError as e:
                    print(f"  ! skipping {f}:{ln}: {e}", file=sys.stderr)
                    continue
                if since:
                    ts = rec.get("ts", "")
                    try:
                        rec_ts = datetime.fromisoformat(ts.replace("Z", "+00:00"))
                    except ValueError:
                        continue
                    if rec_ts < since:
                        continue
                yield rec
 def main() -> int:
    p = argparse.ArgumentParser(description=__doc__)
    p.add_argument("--logs-dir", type=Path, default=Path("/app/var/logs"),
                   help="directory with usage.jsonl* files")
    p.add_argument("--since", default=None,
                   help="time window: '7d', '24h', '30m', or ISO timestamp")
    p.add_argument("--top", type=int, default=25,
                   help="how many top queries / filters to show")
    p.add_argument("--format", choices=("text", "markdown"), default="text")
    args = p.parse_args()
    since = parse_since(args.since)
    events = list(load_events(args.logs_dir, since))
    if not events:
        print("(no events in window)")
        return 0
    print(f"# Usage report — {len(events)} events"
          + (f" since {since.isoformat()}" if since else "")
          + f" from {args.logs_dir}")
    print()
    # 1. Per-tool counts
    by_tool = Counter(e["tool"] for e in events)
    print("## Per-tool call counts")
    print()
    if args.format == "markdown":
        print("| tool | calls |")
        print("|---|---|")
        for tool, n in by_tool.most_common():
            print(f"| `{tool}` | {n} |")
    else:
        for tool, n in by_tool.most_common():
            print(f"  {tool:<25s} {n:>6d}")
    print()
    # 2. Top search_docs queries
    search_events = [e for e in events if e["tool"] == "search_docs"]
    queries = Counter(e["args"].get("query", "") for e in search_events)
    print(f"## Top {args.top} search_docs queries  (of {len(search_events)} searches)")
    print()
    if args.format == "markdown":
        print("| count | query |")
        print("|---|---|")
        for q, n in queries.most_common(args.top):
            print(f"| {n} | `{q}` |")
    else:
        for q, n in queries.most_common(args.top):
            print(f"  {n:>5d}  {q!r}")
    print()
    # 3. 0-hit queries — the highest-signal data for tuning
    zero_hit = [e for e in search_events if e.get("hits_returned") == 0]
    zero_q = Counter(e["args"].get("query", "") for e in zero_hit)
    print(f"## 0-hit queries  ({len(zero_hit)} of {len(search_events)} searches returned nothing)")
    print()
    if zero_q:
        if args.format == "markdown":
            print("| count | query | filters |")
            print("|---|---|---|")
            # Group by query, show filter examples for each
            examples_by_query: dict[str, list[dict]] = defaultdict(list)
            for e in zero_hit:
                examples_by_query[e["args"].get("query", "")].append(e["args"])
            for q, n in zero_q.most_common(args.top):
                ex = examples_by_query[q][0]
                f = {k: v for k, v in ex.items()
                     if k in ("version", "platform", "bundle_id") and v}
                print(f"| {n} | `{q}` | `{f}` |")
        else:
            for q, n in zero_q.most_common(args.top):
                print(f"  {n:>5d}  {q!r}")
    else:
        print("  _(no 0-hit queries in window)_")
    print()
    # 4. Filter usage
    filter_use = Counter()
    for e in search_events:
        a = e["args"]
        v = a.get("version")
        p_ = a.get("platform")
        b = a.get("bundle_id")
        if v:
            filter_use[f"version={v}"] += 1
        if p_:
            filter_use[f"platform={p_}"] += 1
        if b:
            filter_use[f"bundle_id={b}"] += 1
        if not (v or p_ or b):
            filter_use["(no filter)"] += 1
    print(f"## search_docs filter usage")
    print()
    if args.format == "markdown":
        print("| filter | count |")
        print("|---|---|")
        for f, n in filter_use.most_common(args.top):
            print(f"| `{f}` | {n} |")
    else:
        for f, n in filter_use.most_common(args.top):
            print(f"  {n:>5d}  {f}")
    print()
    # 5. Reranker effectiveness
    reranked = [e for e in search_events if e.get("reranked") is True]
    dense_only = [e for e in search_events if e.get("reranked") is False]
    print(f"## Reranker activity")
    print()
    print(f"  reranked:    {len(reranked):>5d}")
    print(f"  dense only:  {len(dense_only):>5d}  (filter too narrow or 0 results)")
    if reranked:
        elapsed = [e["elapsed_ms"] for e in reranked if e.get("elapsed_ms") is not None]
        if elapsed:
            elapsed.sort()
            p50 = elapsed[len(elapsed) // 2]
            p95 = elapsed[int(len(elapsed) * 0.95)]
            print(f"  reranked latency p50: {p50:.0f} ms,  p95: {p95:.0f} ms")
    print()
    # 6. Hybrid retrieval activity — which retriever contributed the top-1?
    # Empty unless HYBRID_SEARCH=true is set on the MCP container.
    hybrid_events = [e for e in search_events if e.get("retrieval_mode") == "hybrid"]
    if hybrid_events:
        by_source = Counter(e.get("top1_source") for e in hybrid_events
                            if e.get("top1_source"))
        print("## Hybrid retrieval — top-1 attribution")
        print()
        print(f"  hybrid mode events: {len(hybrid_events)}")
        total = sum(by_source.values()) or 1
        for src in ("both", "dense_only", "bm25_only"):
            n = by_source.get(src, 0)
            pct = 100.0 * n / total
            label = {
                "both":       "in BOTH retrievers' top-N",
                "dense_only": "dense found it, BM25 didn't",
                "bm25_only":  "BM25 found it, dense didn't",
            }[src]
            print(f"  {src:<11s} {n:>5d}  ({pct:5.1f}%)  — {label}")
        rescued = by_source.get("bm25_only", 0)
        if rescued and total:
            print(f"\n  → {rescued} ({100.0 * rescued / total:.1f}%) of hybrid queries had the top-1 "
                  "result that ONLY BM25 surfaced. Without hybrid those would have been dense-misses.")
    return 0
 if __name__ == "__main__":
    sys.exit(main())
@@ -0,0 +1,89 @@
 {
  "_description": "seed-mcp source catalog. Each scraper module under scrape/sources/ corresponds to one entry. Run via `python -m scrape.runner --source <name>`. The MCP container bakes this file in so corpus_status / list_versions can reflect provenance without re-scraping.",
  "_pioneer_excluded": "Pioneer (Corteva) is intentionally absent. Per their ToS: 'you shall not use any manual or automated software, devices or other processes (including but not limited to spiders, robots, scrapers, crawlers, avatars, data mining tools or the like) to scrape or download data from the Services'. The MCP returns a curated fallback lesson directing the user to pioneer.com / a local dealer.",
  "sources": [
    {
      "name": "bayer_seeds",
      "vendor": "Bayer",
      "brands": ["DEKALB", "Asgrow", "WestBred"],
      "crops": ["corn", "soybeans", "wheat"],
      "verdict": "green",
      "expected_count": 475,
      "base_url": "https://cropscience.bayer.us",
      "scope_filter": "All listed varieties; no regional filter applied at scrape time (regional recommendations parsed into sidecar so the MCP can filter at search time).",
      "tos_check_date": "2026-05-24",
      "tos_note": "robots.txt explicitly whitelists RAG/LLM use cases. Same legal stance as crop-chem-docs scraper."
    },
    {
      "name": "golden_harvest",
      "vendor": "Syngenta",
      "brands": ["Golden Harvest"],
      "crops": ["corn", "soybeans"],
      "verdict": "green",
      "expected_count": 175,
      "base_url": "https://www.goldenharvestseeds.com",
      "scope_filter": "All sitemap-listed corn + soybean varieties.",
      "tos_check_date": "2026-05-25",
      "schema_notes": "Disease ratings published on 9-to-1 scale (9 = best). Normalize to 1-9 (9 = best) at chunk time to match Bayer/NK/AgriPro convention. Note original direction in chunk_0 preamble. Tech-sheet PDF URLs in the sitemap are stale (250331) — resolve live URL from product HTML, not sitemap entry."
    },
    {
      "name": "nk",
      "vendor": "Syngenta",
      "brands": ["NK"],
      "crops": ["corn", "soybeans"],
      "verdict": "green",
      "expected_count": 29,
      "base_url": "https://www.syngenta-us.com",
      "pdf_cdn": "https://assets.syngentaebiz.com/pdf/techsheets/",
      "scope_filter": "All NK corn + soy varieties. No wheat (NK doesn't sell wheat in US).",
      "tos_check_date": "2026-05-24",
      "schema_notes": "Disease + agronomic ratings live in tech-sheet PDFs only — need pdfplumber. PDF URLs share format `<CODE>_YYMMDD.pdf` with Golden Harvest, so the same fetcher works for both."
    },
    {
      "name": "agripro",
      "vendor": "Syngenta",
      "brands": ["AgriPro"],
      "crops": ["wheat", "barley"],
      "verdict": "green",
      "expected_count": 24,
      "base_url": "https://www.agriprowheat.com",
      "scope_filter": "All wheat classes (HRW/HRS/HWS/SWW/SWS) + barley. NO SRW — Syngenta's SRW lives at GrowProGenetics.com under a separate brand.",
      "tos_check_date": "2026-05-24",
      "schema_notes": "Drupal Views form; server-rendered HTML. CoAXium trait flag is implicit in product family; Clearfield/CL2 trait IS in this catalog."
    },
    {
      "name": "becks_pfr",
      "vendor": "Beck's Hybrids",
      "brands": ["Beck's PFR"],
      "crops": ["corn", "soybeans", "wheat"],
      "verdict": "yellow",
      "expected_count": 2089,
      "base_url": "https://www.beckshybrids.com",
      "api_base": "https://mc8v24rf.api.sanity.io",
      "scope_filter": "All Practical Farm Research publications since 2015. PFR is head-to-head agronomy trials — fungicide timing, planting-date studies, hybrid-by-population, etc.",
      "tos_check_date": "2026-05-24",
      "schema_notes": "Public Sanity GROQ API, no auth required. Records have title/year/crop/key-findings/full-text. Treat PFR docs as a research corpus, not variety records — the chunk_0 includes the study's tl;dr finding."
    },
    {
      "name": "becks_products",
      "vendor": "Beck's Hybrids",
      "brands": ["Beck's"],
      "crops": ["corn", "soybeans", "wheat"],
      "verdict": "yellow",
      "expected_count": 860,
      "base_url": "https://www.beckshybrids.com",
      "api_base": "https://mc8v24rf.api.sanity.io",
      "scope_filter": "All Beck's product records — corn + soy + wheat. Identity + RM/MG only.",
      "tos_check_date": "2026-05-24",
      "schema_notes": "Sanity GROQ exposes identity (name, RM/MG, basic traits) but agronomic + disease ratings are SeedIQ-gated (requires browser cookie). Deferred until the SeedIQ XHR endpoint is captured from a logged-in browser session. Without ratings, products are reference-only; the MCP can confirm 'Beck's has hybrid X at RM 112 with Enlist trait' but not 'rate it against drought'."
    }
  ],
  "_excluded_sources": [
    {
      "name": "pioneer",
      "vendor": "Corteva",
      "verdict": "red",
      "reason": "Explicit ToS prohibits automated scraping. Dealer locator at /us/sales-representatives/my-local-team.html is login-gated; no public API for dealer contact info. The MCP returns a curated fallback lesson instead of erroring."
    }
  ]
 }