Files

T

justin b1a712308c README: rewrite for crop-chem-docs as a product (was template README)

The README had never been customized after cloning the
docs-mcp-template — title said "docs-mcp-template" and it read as
the template's generic introduction with no mention of EPA PPLS,
the Bayer scraper, the ~4k label corpus, or the production deploy.

Replace with a crop-chem-docs-specific README that covers:

- Corpus inventory: 4,159 indexed pages (91 Bayer + 4,068 EPA PPLS)
- MCP tool catalog with crop_chem_api_lessons specifics
- Eval baseline from eval/results/with_rerank.md showing
  hybrid+rerank wins (MRR 0.672) over BM25-only (0.544) and that
  hybrid-without-rerank actively HURTS (0.114) — same pattern
  seed-mcp found independently
- Note that the deployed rerank was silently failing through
  2026-05-25 due to the llama-rerank Docker network gotcha;
  fixed and re-running eval is on the followup list
- Quick-start commands
- Repo layout reference
- Infrastructure: registry, embedder pool, shared llama-rerank
  sidecar, PRODUCT_NAME=crop_chem
- Cross-link to the sibling seed-mcp project

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-25 17:50:36 -04:00

8.0 KiB

Raw Permalink Blame History

crop-chem-docs

MCP server over ~4,000 public US row-crop pesticide / herbicide / fertilizer labels — feeding the same Drawbar farm-advisor AI as the sibling seed-mcp. The advisor calls this MCP for label rates, REI/PHI, rotation restrictions, tank-mix guidance, and active-ingredient lookups.

Built on docs-mcp-template (same template lineage as seed-mcp). In production on trashpanda; the Drawbar advisor calls it via the chem: prefix.

What's in the corpus

4,159 indexed pages across two complementary sources:

Source	Pages	Notes
`bayer`	91	Bayer Crop Science US product pages — Warrant, Harness, Roundup, Liberty, Capreno, etc. Rich Next.js `__NEXT_DATA__` payload: active ingredients, label rates, MOA codes, supplemental PDFs (24c / 2EE / bulletins). robots.txt explicitly whitelists RAG indexing.
`epa_ppls`	4,068	EPA Pesticide Product Label System — every registered ag chemistry product. Authoritative source of truth for EPA reg numbers, master labels, signal words, registrant info, formulations.

MCP tools

Same shape as the docs-mcp-template's standard tools (see docs_mcp/server.py):

Tool	Purpose
`search_docs`	Hybrid dense + BM25 + rerank search over the label corpus, filterable by source.
`get_page`	Full label record by `(source, source_key)`. Returns marketing copy + extracted PDF text + sidecar metadata.
`list_versions`	Facet discovery (sources, EPA registrant codes, label categories).
`crop_chem_api_lessons`	Curated agronomy / regulatory lessons — EPA reg-number normalization, label-supersession ordering, common tank-mix gotchas.
Plus the template's standard `diff_versions`, `bundle_changelog`, `weekly_digest` if needed.

Retrieval — eval-validated

From eval/results/with_rerank.md (35 golden queries, k=5):

Retriever	MRR	Recall@5	nDCG@5	Time (s)
hybrid+rerank	0.672	0.638	0.621	823
bm25	0.544	0.586	0.524	5
dense+rerank	0.171	0.143	0.149	805
hybrid-rrf	0.114	0.114	0.108	8
dense	0.027	0.086	0.041	5

Deploy config: HYBRID_SEARCH=true + RERANK_URL=http://llama-rerank:8080.

Pattern matches what seed-mcp found independently:

Dense embedding alone is essentially useless on this corpus (MRR 0.027). Variety codes, EPA reg numbers, and active-ingredient names have no semantic neighbors — nomic-embed-text returns noise.
Hybrid-rrf (no rerank) is worse than BM25 alone. RRF dilutes BM25's strong ranking with dense noise. Don't ship without rerank.
BM25 alone (MRR 0.544, 5 sec) is a great fallback when the rerank sidecar is unavailable.
Rerank brings the win — hybrid+rerank MRR 0.672 is 23% better than BM25 alone and dominates every other configuration.

Note on rerank in production: through 2026-05-25 the llama-rerank sidecar was attached to Docker's default bridge network instead of drawbar-backend_default, so chem-mcp's RERANK_URL=http://llama-rerank:8080 was resolving via public DNS to a random IP and connection-refusing. The MCP fell back to dense+BM25 silently. Fixed via docker network connect drawbar-backend_default llama-rerank. Re-running the eval is on the follow-up list; expect the deployed MRR to lift toward the lab number.

Quick start

git clone https://git.jpaul.io/justin/crop-chem-docs.git
cd crop-chem-docs
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Sample-scrape to verify wiring:
python -m scrape.runner --source bayer --limit 5

# Full refresh (be polite — bayer is small, epa_ppls is hours):
python -m scrape.runner --source bayer --force
python -m scrape.runner --source epa_ppls --force

# Rebuild Chroma + BM25:
OLLAMA_URL=http://192.168.0.125:11434 PRODUCT_NAME=crop_chem \
  python -m rag.index --rebuild

# Run the eval harness:
RERANK_URL=http://localhost:18080 python -m eval.run_eval \
  --queries eval/queries.jsonl --k 5 \
  --output eval/results/baseline.md

# Local MCP server (stdio for Claude Desktop dev):
PRODUCT_NAME=crop_chem python -m docs_mcp.server --transport stdio

Repo layout

.
├── CLAUDE.md                      # Canonical agent guide
├── PLAN.md                        # Template's 13-phase build guide
├── README.md
├── requirements.txt
├── Dockerfile
├── deploy/
│   ├── docker-compose.yml         # Drop-in compose for Drawbar
│   ├── drawbar-compose-snippet.md # Notes on the parent compose merge
│   └── rerank-docker.md           # llama-rerank service deployment
├── .gitea/workflows/
│   ├── refresh.yml                # Monthly cron: scrape + index + image push
│   └── image-only.yml             # On-demand code-only ship cycle
├── scrape/
│   ├── runner.py                  # Dispatches `--source <id>`
│   ├── changelog.py               # Reusable: --json, --history-out
│   └── sources/
│       ├── bayer.py               # cropscience.bayer.us Next.js scraper
│       └── epa_ppls.py            # EPA PPLS pagination + label PDFs
├── rag/
│   ├── embeddings.py              # nomic-embed-text via Ollama
│   ├── chunk.py                   # Chunker w/ EPA-reg-number preamble
│   ├── index.py                   # Chroma + BM25 builder
│   └── bm25.py                    # FTS5 lexical index
├── docs_mcp/
│   ├── server.py                  # FastMCP — hybrid+rerank
│   ├── lessons.md                 # Curated knowledge layer
│   └── usage.py                   # TimedCall + JSONL telemetry
├── eval/
│   ├── queries.jsonl              # 35 golden queries
│   ├── retrievers.py              # 5 named configurations
│   ├── run_eval.py                # MRR / Recall@k / nDCG@k
│   └── results/                   # Baseline + with_rerank measurements
├── scripts/
│   ├── usage_report.py
│   └── registry_gc.py             # Container registry cleanup
└── corpus/                        # Committed scrape output (CI-refreshed)
    ├── bayer/
    └── epa_ppls/

Infrastructure

Registry: pushes to 192.168.0.2:1234 (LAN, no CF body cap); deploys pull git.jpaul.io/justin/crop-chem-docs:latest (public, CF tunnel). Also tagged :<sha12> for rollback pinning and :corpus-YYYY.MM.DD for snapshot pinning.
Embedder pool (CI): 3 GPU-pinned Ollama endpoints, weighted toward .0.125 (RTX 40-series).
Reranker: shared llama-rerank sidecar on trashpanda's Tesla P4 (jina-reranker-v2-base-multilingual via llama.cpp). Same container serves crop-chem-docs and seed-mcp.
PRODUCT_NAME: crop_chem — used in crop_chem_docs Chroma collection, bm25/crop_chem_docs.db, and the crop_chem_api_lessons tool name.

Deploy mechanics

Same Watchtower auto-deploy chain as seed-mcp. On every push to main that touches docs_mcp/, rag/, scrape/, requirements.txt, Dockerfile, or sources.json:

image-only.yml checks out main + committed corpus
Rebuilds Chroma + BM25 (~few min on the GPU pool)
docker build + push three tags to the LAN registry
Links the package to the repo via Gitea API
Watchtower on trashpanda polls :latest every 5 min → recreates drawbar-backend-chem-mcp-1

Corpus refresh runs monthly via refresh.yml. EPA PPLS is the slow source — ~hours at 1 req/sec at full scale.

Sibling

seed-mcp covers the row-crop seed-variety + yield-trial side of the advisor's tool catalog. Both MCPs are docs-mcp-template clones running side-by-side on trashpanda, sharing the Ollama pool and the llama-rerank sidecar.

See CLAUDE.md for canonical sidecar schemas, the EPA reg-number normalization rules, and label-supersession ordering.

8.0 KiB Raw Permalink Blame History