e9250de8e799781e690de593f5917364d1d766ca
Adapts the docs-mcp-template scraping layer for the pesticide-labels
domain. The template's bundle/version/platform concepts don't map to
labels (there's no "Bayer 8.1.0" — there's just the current accepted
label per EPA Reg No), so the scraper layer is reshaped around a
"source" abstraction: one source per manufacturer or regulator, one
per-product label per source.
Sources shipped:
- bayer — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs)
- epa_ppls — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint
Canonical sidecar schema (see scrape/README.md) unifies fields across
sources:
- active_ingredients always [{name, cas, percent}]
- label/* nested (url, filename, accepted_date, last_modified,
page_count, text_layer)
- all timestamps normalized to ISO 8601 UTC
- signal_word surfaced (operationally critical for the farmer advisor)
- source_key + epa_reg_no separate per-source PK from the
cross-source join key
bundles.json → sources.json. --bundle → --source. The runner walks
sources.json and dispatches by id; per-source modules remain
independently runnable for development.
PLAN.md gets a one-block domain note up front; later phases (chunking,
embeddings, retrieval, eval) still apply as written.
Smoke test:
python -m scrape.runner --all --limit 2 # works
python -m scrape.runner --source bayer --limit 3 # 3 written, idempotent re-run skips
python -m scrape.runner --source epa_ppls --reg-no 524-475 # Roundup Ultra, 167 pages, ISO last_modified
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs-mcp-template
A reusable template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out.
The end product is a streamable-HTTP MCP server with ~15 tools that any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can call to answer questions against the docs, surface what changed recently, and flag likely inconsistencies.
What's here
- PLAN.md — comprehensive build guide. Phased approach (13 phases, ~2–3 weeks of focused work for the full stack). Includes the design decisions, the gotchas, and a per-product customization checklist.
- Scaffolded skeleton — working FastMCP server with stub tools,
Dockerfile, docker-compose, CI workflows, eval harness layout,
usage logging. Everything you need to
git cloneand start filling in the product-specific bits.
Quick start
git clone https://git.jpaul.io/justin/docs-mcp-template.git my-product-docs
cd my-product-docs
git remote remove origin # detach from template
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Read PLAN.md before doing anything else. Pay particular attention to
# Phase 1 (scraper) — that's the most product-specific phase.
# Run the stub server (no corpus yet — just verifies the wiring):
python -m docs_mcp.server --transport stdio
Repo layout
.
├── PLAN.md # The build guide. Read first.
├── README.md
├── requirements.txt
├── Dockerfile
├── .gitignore
├── .gitea/workflows/
│ ├── refresh.yml # Weekly scrape + index + image push
│ └── image-only.yml # On-demand code-only ship
├── scrape/
│ ├── README.md # Product-specific scraper goes here
│ └── changelog.py # Reusable: --json, --history-out
├── rag/
│ ├── embeddings.py # Ollama embedder, swappable
│ ├── chunk.py # Chunker — adjust per page format
│ ├── index.py # Builds Chroma + (optionally) BM25
│ └── bm25.py # SQLite FTS5 lexical index
├── docs_mcp/
│ ├── server.py # FastMCP server with stub tools
│ └── usage.py # TimedCall + JSONL telemetry
├── eval/
│ ├── queries.jsonl.example # Curate ~25 hand-labeled queries
│ ├── retrievers.py # Retriever protocol + implementations
│ └── run_eval.py # MRR / Recall@k / nDCG@k harness
├── scripts/
│ ├── usage_report.py # Standalone log analyzer
│ └── registry_gc.py # Container registry cleanup
└── deploy/
└── docker-compose.yml # Hosting stack: MCP + reranker + Watchtower
What's product-specific (must implement)
scrape/— the scraper itself. The template gives you the corpus layout contract and a workingchangelog.py; the actual extraction logic is yours.- The corpus on disk (gitignored; rebuilt by CI).
- The reranker GGUF model and llama.cpp container (commented in
deploy/docker-compose.yml). - The reverse proxy / TLS layer in front of the public endpoint.
- The hand-curated knowledge surface (your product's API gotchas, example scripts, anything the LLM should know that the docs don't say).
What's NOT product-specific (works as-is)
- FastMCP server skeleton + tool decoration pattern
- Chroma + Ollama embedding pipeline
- BM25 / SQLite FTS5 lexical index
- Hybrid retrieval (RRF) + reranker integration
- Eval harness (Retriever protocol, MRR/Recall/nDCG)
- Usage logging (TimedCall, JSONL, daily rotation)
- CI workflow shape (weekly + on-demand, retry-on-race, three-tag image scheme)
- Registry GC script
- Standard tools:
search_docs,get_page,list_versions,diff_versions,bundle_changelog,weekly_digest,find_doc_inconsistencies, etc.
License
Internal template. Adjust before publishing.
Description
MCP server over US row-crop pesticide labels (EPA PPLS + manufacturer sites). Feeds Drawbar farmer advisor.
Languages
Python
98.8%
Dockerfile
1.2%