T

justin 38141c362e Phase 2: chunking + parallel Ollama embeddings + Chroma + BM25 indexes

End-to-end RAG pipeline for the pesticide-labels corpus. From the
4,066 labels on USB, the indexer produces 216,467 chunks, embeds
them via N parallel Ollama endpoints, upserts to Chroma, and builds
a BM25 lexical index.

## Files

- rag/index.py: adapted to labels schema (source / source_key /
  epa_reg_no / product_name / product_class / registrant /
  signal_word / active_ingredients flattened for Chroma where-filter);
  honors PPLS_CORPUS_ROOT (corpus on USB) and PPLS_CHROMA_DIR;
  upsert batch size auto-tuned to 64 * N URLs; --limit + --source
  flags for incremental work.
- rag/chunk.py: label-aware. ALL-CAPS section heading detector
  (heuristic) for EPA labels alongside markdown `#` headings.
  TARGET_CHARS=2000 (~500 tokens), MAX_CHUNK_CHARS=4000 (~1000
  tokens) hard cap with _force_split sentence/char fallback to
  defend against monolithic crop+rate tables. Chunk 0 is a synthetic
  anchor with product name, EPA Reg No, registrant, signal word,
  product class, active ingredients + keyword bag for joint
  dense/BM25 retrieval.
- rag/embeddings.py: parallel-dispatch across N Ollama URLs via
  ThreadPoolExecutor. Each __call__ stride-slices input into N
  shards, fires N concurrent HTTP requests, joins in original order.
  Bisect-resilient on 400 (context-length): recursively splits the
  failing shard down to single doc, logs+drops single bad doc with
  zero-vector placeholder so Chroma upsert never sees a gap. Real
  HTTP/connection errors still propagate.
- requirements.txt: chromadb already pinned via template.

## Run

  PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \
    OLLAMA_URL=http://host1:11434,http://host2:11434,...  \
    PRODUCT_NAME=ppls \
    python -m rag.index --rebuild

## Build stats

  - 216,467 chunks across 4,066 labels (~53 chunks/label avg)
  - Wall time: 75.7 min on 4 parallel GPU-backed Ollama endpoints
    (Bayer-Crop / BASF / Corteva / FMC / Nufarm / Syngenta / etc.
    chemistry; production Ollama on trashpanda + 2× 192.168.0.2 +
    1× Windows 192.168.0.125)
  - 473 bisect-drops (0.22%) — all from monolithic-table sections
    in 1970s-90s scanned PDFs whose pypdf extracts tokenized past
    the model's context. Acceptable; the dropped chunks were
    garbled OCR with no useful content.
  - Chroma: 2.2 GB persistent SQLite at ./chroma/
  - BM25: 416 MB SQLite FTS5 at ./bm25/ppls_docs.db

## Smoke-test queries (top-3 dense-only)

  "what can I spray on soybeans to control waterhemp"
    → Rage (glyphosate+carfentrazone), Sencor (metribuzin)
  "REI for dicamba on corn"
    → Nufarm Credit (DICAMBA tank-mix restrictions section)
  "fungicide for wheat head scab"
    → MCW 710 SC (azoxystrobin+tebuconazole), Sercadis (fluxapyroxad)

Distances 0.16-0.23. Dense-only quality is OK-not-great in spots
(exactly the failure mode Phase 6 reranker + Phase 8 hybrid BM25
fusion address).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-24 09:56:49 -04:00

.gitea/workflows

ci: default PRODUCT_NAME to repo name (caught by template dispatch test)

2026-05-22 09:37:07 -04:00

deploy

Strip submit_doc_bug tool and gate (Zerto-specific, not applicable to label MCP)

2026-05-23 17:51:56 -04:00

docs_mcp

Strip submit_doc_bug tool and gate (Zerto-specific, not applicable to label MCP)

2026-05-23 17:51:56 -04:00

eval

initial: docs-mcp-template — build guide + scaffolded server

2026-05-22 09:18:17 -04:00

rag

Phase 2: chunking + parallel Ollama embeddings + Chroma + BM25 indexes

2026-05-24 09:56:49 -04:00

scrape

epa_ppls: add registrant allowlist pre-API filter

2026-05-23 23:55:38 -04:00

scripts

initial: docs-mcp-template — build guide + scaffolded server

2026-05-22 09:18:17 -04:00

.gitignore

scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema

2026-05-23 18:27:07 -04:00

CLAUDE.md

Strip submit_doc_bug tool and gate (Zerto-specific, not applicable to label MCP)

2026-05-23 17:51:56 -04:00

Dockerfile

initial: docs-mcp-template — build guide + scaffolded server

2026-05-22 09:18:17 -04:00

PLAN.md

scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema

2026-05-23 18:27:07 -04:00

README.md

Strip submit_doc_bug tool and gate (Zerto-specific, not applicable to label MCP)

2026-05-23 17:51:56 -04:00

requirements.txt

scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema

2026-05-23 18:27:07 -04:00

sources.json

epa_ppls: add registrant allowlist pre-API filter

2026-05-23 23:55:38 -04:00

README.md

docs-mcp-template

A reusable template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out.

The end product is a streamable-HTTP MCP server with ~15 tools that any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can call to answer questions against the docs, surface what changed recently, and flag likely inconsistencies.

What's here

PLAN.md — comprehensive build guide. Phased approach (13 phases, ~2–3 weeks of focused work for the full stack). Includes the design decisions, the gotchas, and a per-product customization checklist.
Scaffolded skeleton — working FastMCP server with stub tools, Dockerfile, docker-compose, CI workflows, eval harness layout, usage logging. Everything you need to git clone and start filling in the product-specific bits.

Quick start

git clone https://git.jpaul.io/justin/docs-mcp-template.git my-product-docs
cd my-product-docs
git remote remove origin  # detach from template
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Read PLAN.md before doing anything else. Pay particular attention to
# Phase 1 (scraper) — that's the most product-specific phase.

# Run the stub server (no corpus yet — just verifies the wiring):
python -m docs_mcp.server --transport stdio

Repo layout

.
├── PLAN.md                        # The build guide. Read first.
├── README.md
├── requirements.txt
├── Dockerfile
├── .gitignore
├── .gitea/workflows/
│   ├── refresh.yml                # Weekly scrape + index + image push
│   └── image-only.yml             # On-demand code-only ship
├── scrape/
│   ├── README.md                  # Product-specific scraper goes here
│   └── changelog.py               # Reusable: --json, --history-out
├── rag/
│   ├── embeddings.py              # Ollama embedder, swappable
│   ├── chunk.py                   # Chunker — adjust per page format
│   ├── index.py                   # Builds Chroma + (optionally) BM25
│   └── bm25.py                    # SQLite FTS5 lexical index
├── docs_mcp/
│   ├── server.py                  # FastMCP server with stub tools
│   └── usage.py                   # TimedCall + JSONL telemetry
├── eval/
│   ├── queries.jsonl.example      # Curate ~25 hand-labeled queries
│   ├── retrievers.py              # Retriever protocol + implementations
│   └── run_eval.py                # MRR / Recall@k / nDCG@k harness
├── scripts/
│   ├── usage_report.py            # Standalone log analyzer
│   └── registry_gc.py             # Container registry cleanup
└── deploy/
    └── docker-compose.yml         # Hosting stack: MCP + reranker + Watchtower

What's product-specific (must implement)

scrape/ — the scraper itself. The template gives you the corpus layout contract and a working changelog.py; the actual extraction logic is yours.
The corpus on disk (gitignored; rebuilt by CI).
The reranker GGUF model and llama.cpp container (commented in deploy/docker-compose.yml).
The reverse proxy / TLS layer in front of the public endpoint.
The hand-curated knowledge surface (your product's API gotchas, example scripts, anything the LLM should know that the docs don't say).

What's NOT product-specific (works as-is)

FastMCP server skeleton + tool decoration pattern
Chroma + Ollama embedding pipeline
BM25 / SQLite FTS5 lexical index
Hybrid retrieval (RRF) + reranker integration
Eval harness (Retriever protocol, MRR/Recall/nDCG)
Usage logging (TimedCall, JSONL, daily rotation)
CI workflow shape (weekly + on-demand, retry-on-race, three-tag image scheme)
Registry GC script
Standard tools: search_docs, get_page, list_versions, diff_versions, bundle_changelog, weekly_digest, find_doc_inconsistencies, etc.

License

Internal template. Adjust before publishing.

README.md Unescape Escape

docs-mcp-template

What's here

Quick start

Repo layout

What's product-specific (must implement)

What's NOT product-specific (works as-is)

License

README.md