T

justin 92a95d5e78 epa_ppls: add registrant allowlist pre-API filter

Cuts the PPIS-enumeration universe from 102K rows to ~11.5K rows by
dropping products from non-row-crop-ag registrants BEFORE the per-
product API call. This is the biggest cost lever we have on the EPA
scraper — full backfill drops from ~28 h to ~3.5 h.

scrape/sources/epa_registrant_allowlist.json holds the 34 verified
ag-chem company numbers (Syngenta, Bayer, BASF, Corteva, FMC, Nufarm,
ADAMA, UPL, Albaugh, Loveland, AMVAC, Helena, Drexel, Atticus, etc.).
Each entry was verified by querying the EPA PPLS API for the first
active product registered under that company number. Edit the JSON
freely — scraper loads it at run time. Bypass with
--no-registrant-filter when you suspect a row-crop product registered
to a specialty company not on the list.

Why a curated allowlist rather than blacklist consumer brands: the
102K PPIS rows are 89% non-ag-relevant; an allowlist is shorter to
maintain and harder to false-positive.

Excluded with intent (not omissions): Bayer Environmental Science
(turf/ornamental), Scotts (consumer lawn & garden), Wellmark/Zoecon
(animal flea/tick), Control Solutions (structural pest), Cleary
(turf), PBI/Gordon (mostly turf), Buckman Labs (industrial water).

Smoke test --limit 100:
  - 1239 PPIS rows considered (in first slice of file)
  - 1139 skipped by registrant filter (no API call paid)
  - 100 hit API, 81 filtered by row-crop sites, 19 written
  - = 91% API-call reduction over the prior version

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-23 23:55:38 -04:00

.gitea/workflows

ci: default PRODUCT_NAME to repo name (caught by template dispatch test)

2026-05-22 09:37:07 -04:00

deploy

Strip submit_doc_bug tool and gate (Zerto-specific, not applicable to label MCP)

2026-05-23 17:51:56 -04:00

docs_mcp

Strip submit_doc_bug tool and gate (Zerto-specific, not applicable to label MCP)

2026-05-23 17:51:56 -04:00

eval

initial: docs-mcp-template — build guide + scaffolded server

2026-05-22 09:18:17 -04:00

rag

initial: docs-mcp-template — build guide + scaffolded server

2026-05-22 09:18:17 -04:00

scrape

epa_ppls: add registrant allowlist pre-API filter

2026-05-23 23:55:38 -04:00

scripts

initial: docs-mcp-template — build guide + scaffolded server

2026-05-22 09:18:17 -04:00

.gitignore

scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema

2026-05-23 18:27:07 -04:00

CLAUDE.md

Strip submit_doc_bug tool and gate (Zerto-specific, not applicable to label MCP)

2026-05-23 17:51:56 -04:00

Dockerfile

initial: docs-mcp-template — build guide + scaffolded server

2026-05-22 09:18:17 -04:00

PLAN.md

scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema

2026-05-23 18:27:07 -04:00

README.md

Strip submit_doc_bug tool and gate (Zerto-specific, not applicable to label MCP)

2026-05-23 17:51:56 -04:00

requirements.txt

scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema

2026-05-23 18:27:07 -04:00

sources.json

epa_ppls: add registrant allowlist pre-API filter

2026-05-23 23:55:38 -04:00

README.md

docs-mcp-template

A reusable template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out.

The end product is a streamable-HTTP MCP server with ~15 tools that any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can call to answer questions against the docs, surface what changed recently, and flag likely inconsistencies.

What's here

PLAN.md — comprehensive build guide. Phased approach (13 phases, ~2–3 weeks of focused work for the full stack). Includes the design decisions, the gotchas, and a per-product customization checklist.
Scaffolded skeleton — working FastMCP server with stub tools, Dockerfile, docker-compose, CI workflows, eval harness layout, usage logging. Everything you need to git clone and start filling in the product-specific bits.

Quick start

git clone https://git.jpaul.io/justin/docs-mcp-template.git my-product-docs
cd my-product-docs
git remote remove origin  # detach from template
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Read PLAN.md before doing anything else. Pay particular attention to
# Phase 1 (scraper) — that's the most product-specific phase.

# Run the stub server (no corpus yet — just verifies the wiring):
python -m docs_mcp.server --transport stdio

Repo layout

.
├── PLAN.md                        # The build guide. Read first.
├── README.md
├── requirements.txt
├── Dockerfile
├── .gitignore
├── .gitea/workflows/
│   ├── refresh.yml                # Weekly scrape + index + image push
│   └── image-only.yml             # On-demand code-only ship
├── scrape/
│   ├── README.md                  # Product-specific scraper goes here
│   └── changelog.py               # Reusable: --json, --history-out
├── rag/
│   ├── embeddings.py              # Ollama embedder, swappable
│   ├── chunk.py                   # Chunker — adjust per page format
│   ├── index.py                   # Builds Chroma + (optionally) BM25
│   └── bm25.py                    # SQLite FTS5 lexical index
├── docs_mcp/
│   ├── server.py                  # FastMCP server with stub tools
│   └── usage.py                   # TimedCall + JSONL telemetry
├── eval/
│   ├── queries.jsonl.example      # Curate ~25 hand-labeled queries
│   ├── retrievers.py              # Retriever protocol + implementations
│   └── run_eval.py                # MRR / Recall@k / nDCG@k harness
├── scripts/
│   ├── usage_report.py            # Standalone log analyzer
│   └── registry_gc.py             # Container registry cleanup
└── deploy/
    └── docker-compose.yml         # Hosting stack: MCP + reranker + Watchtower

What's product-specific (must implement)

scrape/ — the scraper itself. The template gives you the corpus layout contract and a working changelog.py; the actual extraction logic is yours.
The corpus on disk (gitignored; rebuilt by CI).
The reranker GGUF model and llama.cpp container (commented in deploy/docker-compose.yml).
The reverse proxy / TLS layer in front of the public endpoint.
The hand-curated knowledge surface (your product's API gotchas, example scripts, anything the LLM should know that the docs don't say).

What's NOT product-specific (works as-is)

FastMCP server skeleton + tool decoration pattern
Chroma + Ollama embedding pipeline
BM25 / SQLite FTS5 lexical index
Hybrid retrieval (RRF) + reranker integration
Eval harness (Retriever protocol, MRR/Recall/nDCG)
Usage logging (TimedCall, JSONL, daily rotation)
CI workflow shape (weekly + on-demand, retry-on-race, three-tag image scheme)
Registry GC script
Standard tools: search_docs, get_page, list_versions, diff_versions, bundle_changelog, weekly_digest, find_doc_inconsistencies, etc.

License

Internal template. Adjust before publishing.

README.md Unescape Escape

docs-mcp-template

What's here

Quick start

Repo layout

What's product-specific (must implement)

What's NOT product-specific (works as-is)

License

README.md