2acba0aa867be202a02ba8b311a908ad7db625fe
21 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2acba0aa86 |
server: catch one more "PPLS" → "crop-chem-docs" rename miss in corpus_status header
Image rebuild (skip scrape) / build (push) Failing after 16m22s
Functional smoke test from trashpanda confirmed end-to-end working: $ docker run -d ... git.jpaul.io/justin/crop-chem-docs:corpus-2026.05.24 $ docker exec ... python -c 'from docs_mcp.server import corpus_status; print(corpus_status())' Output: 4,159 labels on disk (4,068 epa_ppls + 91 bayer), 216,467 chunks in Chroma collection `crop_chem_docs`, BM25 db 416 MB, HYBRID_SEARCH=on, RERANK_URL=http://10.10.1.65:8082. Image is production-ready for Drawbar compose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
8766d73327 |
deploy: Drawbar compose snippet — first image is published
Image pushed to git.jpaul.io/justin/crop-chem-docs with three tags:
:latest — Watchtower auto-pull target
:a97107de4636 — commit-sha rollback pin
:corpus-2026.05.24 — corpus-snapshot pin (prod-recommended)
Drawbar compose snippet at deploy/drawbar-compose-snippet.md.
Wires the container against the existing infra:
- Ollama pool: 192.168.0.2:11434, 192.168.0.2:11435,
192.168.0.125:11434, 10.10.1.65:11434
- Reranker: http://10.10.1.65:8082
- HYBRID_SEARCH=true (production retrieval — BM25 + dense + rerank)
- Exposes streamable-HTTP MCP on port 8000
Pull path uses git.jpaul.io (public hostname, CF-fronted; pull
response bodies aren't capped). Push path uses 192.168.0.2:1234
(LAN endpoint, bypasses CF 100MB body cap). Same registry,
different URLs — per the template gotcha doc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
420b4fa2d8 |
workflows: use LAN registry endpoint for push (CF 100MB cap)
Cloudflare in front of git.jpaul.io caps HTTP request bodies at 100 MB, which kills container blob pushes for our 6 GB image (Chroma layer alone is ~2 GB). Per the template gotcha doc: Push via LAN endpoint (192.168.0.2:1234, plain HTTP, in the Gitea host's insecure-registries list). Pull via public hostname (git.jpaul.io) — pull response bodies aren't capped. REGISTRY_PUSH: 192.168.0.2:1234 REGISTRY_PULL: git.jpaul.io (unchanged; used for the package-link API) This matches how hvm-docs / morpheus-docs / opsramp-docs / zerto-docs CI workflows push successfully on the same Gitea host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
a97107de46 |
docker: production image + Gitea Actions for monthly refresh
Image rebuild (skip scrape) / build (push) Failing after 1h37m12s
Dockerfile: self-contained image with corpus + Chroma + BM25 baked in. Drawbar's compose pulls + runs without volume mounts. Built from sources.json (labels schema), PRODUCT_NAME=crop_chem by default, HYBRID_SEARCH=true (always-on for production quality). RERANK_URL + OLLAMA_URL get set at compose time. .gitea/workflows/refresh.yml: monthly cron (1st @ 06:00 UTC) does full scrape → reindex → image push. Scrapes Bayer (~30 min) + EPA PPLS row-crop filtered (~7h). Skips reindex+push if no corpus diff. Tags pushed: :latest, :<sha12>, :corpus-<YYYY.MM.DD>. .gitea/workflows/image-only.yml: on-demand or auto on code-only pushes to main (paths: docs_mcp/, rag/, scrape/, requirements.txt, Dockerfile, sources.json). Reindexes from committed corpus, builds image, pushes. ~10 min vs ~9h full refresh. .gitignore: corpus/ now COMMITTED (4,159 labels, 265 MB of .md + sidecars). Lets image-only.yml rebuild indexes without re-scraping. chroma/ + bm25/ still gitignored (regenerable binary indexes). .dockerignore: drops venv, eval results, PLAN/README/CLAUDE.md, deploy/, .git/ — keeps the image lean. corpus + chroma + bm25 explicitly NOT in dockerignore (those go INTO the image). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1a45280e45 |
rename: ppls-docs → crop-chem-docs
Repo/project rename to better reflect scope. PPLS is EPA's term for
their Pesticide Product Label System — accurate when the corpus was
EPA-only, narrow now that it also pulls from Bayer's own catalog
(and may expand to Syngenta/Corteva/BASF/FMC labels in the future).
crop-chem-docs scopes flexibly without acronyms to explain.
Renames:
- directory: ppls-docs → crop-chem-docs
- PRODUCT_NAME: ppls → crop_chem
- Chroma collection: ppls_docs → crop_chem_docs (in-place via .modify(), no re-embed)
- BM25 db: bm25/ppls_docs.db → bm25/crop_chem_docs.db
- MCP tool name: ppls_api_lessons → crop_chem_api_lessons
- FastMCP server name: ppls-docs → crop-chem-docs
- Env vars: PPLS_CORPUS_ROOT → CORPUS_ROOT
PPLS_CHROMA_DIR → CHROMA_DIR_OVERRIDE
- User-Agent: ppls-docs-scraper → crop-chem-docs-scraper
Preserved (intentional, correct):
- epa_ppls (source id) — refers specifically to EPA's PPLS database
- "EPA PPLS" mentions in regulatory text (lessons.md, server docstrings)
- PPLS_API_BASE / PPLS_PDF_BASE / PPLS_INDEX_URL_TEMPLATE in
scrape/sources/epa_ppls.py — these point at EPA's actual endpoints
Memory entries get updated in a follow-up commit so the rename is
isolated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
3c3178a6ad |
eval: GPU rerank baseline + CLI fix
GPU eval (hybrid+rerank, RERANK_URL=http://10.10.1.65:8082): MRR=0.672 Recall@5=0.638 nDCG@5=0.621 (35 queries, 1 transient 500, otherwise clean) Quality identical to the CPU rerank run as expected — only latency changed (single rerank call dropped from ~23s to ~0.7-1.5s on the Tesla P4). Per-query report at eval/results/with_rerank_gpu.md. CLI parser fix: `--retrievers dense+rerank,hybrid+rerank` now correctly wires the dense+rerank variant. Previously only literal "rerank" (without prefix) matched the dense+rerank branch, so combined-retriever runs silently dropped dense+rerank. (Note: the eval's RerankedRetriever does 50 individual Chroma `get` calls per query to fetch chunk text by (source, source_key); this adds ~15s per query of pure SQLite lookup overhead. Not a production concern — docs_mcp/server.py's _rerank_pool reranks docs already in the dense pool, no extra Chroma round-trips. Worth tightening the eval-side impl on a later pass.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
af44d7a102 |
Phase 11 + Phase 6 GPU move
## Phase 11 — Curated agronomy / label-handling knowledge layer
docs_mcp/lessons.md: 13 topic-anchored markdown sections covering
the LLM-side context a farmer-advisor needs alongside the raw
label corpus —
- how-to-use-this-corpus
- epa-signal-words
- rei-phi-fundamentals
- rup-handling
- supplemental-labels-24c-2ee
- tank-mix-fundamentals
- resistance-management-hrac-frac-irac
- glufosinate-application-rules
- dicamba-application-rules
- lake-erie-watershed-ohio
- scn-and-other-seed-treatment-context
- drift-management-essentials
- how-to-format-recommendations
Each Topic block is independently retrievable via the new MCP tool:
ppls_api_lessons(topic="rup-handling")
Or with no topic to get the full TOC, or with a substring to
match-and-return matching sections ("dicamba" → dicamba-application-rules).
Tool docstring instructs the LLM to call this proactively before any
pesticide recommendation so the recommendation lands with regulatory
framing, resistance-group callouts, RUP applicator language, and the
canonical recommendation format — not just a rate from a label.
## Phase 6 — Reranker moved to GPU on trashpanda
Stopped the local CPU container and started on trashpanda's Tesla P4
(8 GB VRAM) via:
docker run -d --name llama-rerank --restart unless-stopped --gpus all \
-p 8082:8080 \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \
--reranking --host 0.0.0.0 --port 8080 -ngl 99
The :server-cuda image variant (not :server) is required for CUDA
backend; -ngl 99 offloads all layers to GPU.
Latency: 50-doc rerank dropped from ~23 s on CPU to ~0.7-1.5 s on
the Tesla P4 — production-grade interactive speeds.
deploy/rerank-docker.md updated with the trashpanda deploy recipe,
troubleshooting (mostly "did you use server-cuda?"), and a perf
reference table. The MCP server's RERANK_URL just points at
http://10.10.1.65:8082 now.
GPU eval still completing in background; results land in
eval/results/with_rerank_gpu.md as a follow-up commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
278fe5f456 |
Phase 6: reranker sidecar (jina-reranker-v2-base via llama.cpp)
Wires the docs_mcp/server.py reranker hook into a real backend:
ghcr.io/ggml-org/llama.cpp:server \\
-hf gpustack/jina-reranker-v2-base-multilingual-GGUF:Q8_0 \\
--reranking --host 0.0.0.0 --port 8080
Setup recipe at deploy/rerank-docker.md. The MCP server already
honors RERANK_URL (added in Phase 7+8 commit); setting it to
http://<host>:8082 turns on rerank automatically.
## Eval results (35 queries, k=5, pool=50)
| Retriever | MRR | Recall@5 | nDCG@5 |
|----------------|-------|----------|--------|
| dense | 0.027 | 0.086 | 0.041 |
| bm25 | 0.544 | 0.586 | 0.524 |
| hybrid-rrf | 0.114 | 0.114 | 0.108 |
| dense+rerank | 0.171 | 0.143 | 0.149 |
| hybrid+rerank | 0.672 | 0.638 | 0.621 | ← winner
The reranker fixes hybrid's failure mode (dense noise polluting
the fused pool) by scoring each (query, chunk) pair independently.
Net: hybrid+rerank gives +24% MRR over BM25-only.
Smoke test for the reranker itself (query: "soybean herbicide for
waterhemp", 4 candidates):
index=1 SENCOR metribuzin waterhemp soybean → score=0.84 ← right
index=3 Headline wheat fungicide → score=-2.80
index=2 Lorsban corn rootworm → score=-2.91
index=0 Roundup fallow burndown → score=-3.44
Strong separation between the right doc and the rest.
## Production gotchas
- CPU-only reranker is slow (~23s for a 50-doc pool). For
interactive use put it on GPU (`--gpus all`); ~10-20× faster.
- jina-reranker rejects the ENTIRE batch if any pair exceeds
n_ctx_train=1024 — server truncates each doc to 2000 chars
before sending. Already handled in _rerank_pool.
Per-query rerank report at eval/results/with_rerank.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
335c33465b |
Phase 7+8: eval harness + hybrid retrieval
## Phase 7 — Eval harness
eval/retrievers.py + rag/retrieval.py: Retriever protocol with
DenseRetriever, BM25Retriever, HybridRetriever (RRF k=60),
RerankedRetriever (llama.cpp /v1/rerank). retrievers.py is now a
thin shim re-exporting from rag.retrieval so the MCP server can
use the same code at request time without making eval/ a runtime
dep.
eval/run_eval.py: drives N retrievers against eval/queries.jsonl,
computes MRR / Recall@K / nDCG@K, emits a markdown report with a
summary table + per-query breakdown for the first retriever. Each
query carries expected (source, source_key) tuples — matches the
labels-domain page-level keying.
eval/queries.jsonl: 35 curated queries — 25 brand-name (Warrant,
Huskie, Roundup Custom, Liberty, Authority, Headline, Trivapro,
Poncho, Lorsban, Sencor, Acuron, ...) + 10 intent/semantic
("what controls horseweed before soybean", "fungicide for fusarium
head blight", "rainfast interval for glyphosate", ...).
## Phase 8 — Hybrid retrieval (BM25 + dense + RRF)
docs_mcp/server.py: search_docs now branches on HYBRID_SEARCH env.
When on, _search_chunks runs both Chroma + BM25 (rag/bm25.py
existing impl), fuses on chunk_id with reciprocal-rank-fusion
(RRF k=60), and returns the combined pool. Dense-only path
unchanged when HYBRID_SEARCH is unset. The rendering layer
(_format_hit) is untouched.
The RERANK_URL hook is also wired (_rerank_pool sends docs to
llama.cpp /v1/rerank, truncated to 2000 chars per the jina-reranker
n_ctx_train=1024 batch-rejection gotcha). Fails open to base order
on any exception.
## Baseline numbers (k=5, pool=50, 35 queries)
| Retriever | MRR | Recall@5 | nDCG@5 |
|------------|-------|----------|--------|
| dense | 0.027 | 0.086 | 0.041 |
| bm25 | 0.544 | 0.586 | 0.524 |
| hybrid-rrf | 0.114 | 0.114 | 0.108 |
Headline: BM25 dominates because farmers search for products by
brand name, and brand names are exact-match tokens that lexical
search nails. Dense is poor — semantic embeddings spread across
similar products and don't preferentially weight brand-name tokens.
Textbook RRF hurts when one retriever is much weaker than the
other: dense's irrelevant top-50 pollute the fused pool with
ties at 1/(60+rank). Phase 6 reranker is the planned fix —
the reranker scores each (query, chunk) pair independently
and can recover the right answer regardless of base order.
Per-query report at eval/results/baseline.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
97a2a05b24 |
Phase 3: MCP server tools for the labels corpus
Adapt docs_mcp/server.py from versioned-software-docs domain to
pesticide-labels domain. Standard MCP tool names preserved
(search_docs / get_page / list_versions) so existing MCP clients
(Claude Desktop, Cursor) still pick them up; docstrings + argument
shape are labels-domain.
Tools shipped:
- search_docs(query, source, product_class, registrant_contains,
signal_word, epa_reg_no, k) — dense Chroma query with optional
filters, post-filtered for registrant substring. Returns top-k
chunks rendered as markdown with product / reg / registrant /
actives / signal / section / label-PDF URL.
- get_page(source, source_key) — full label markdown + metadata
header. source_key is slug for MFR sources, EPA Reg No for EPA PPLS.
- list_versions() — discovers facet values: sources, product
classes, signal words, registrants (samples up to 50K chunks
from Chroma to enumerate distinct metadata values).
- corpus_status() — fast no-embedder counts: labels on disk per
source, chunks in Chroma, BM25 db size, active feature flags.
Wiring:
- Reads PPLS_CORPUS_ROOT + PPLS_CHROMA_DIR (matches the scrapers
and indexer).
- Uses sources.json (not the template's bundles.json).
- Lazy Chroma init so the server starts cleanly even when Ollama
is briefly down (e.g. during HVM corpus rebuilds).
- Phase 6 reranker + Phase 8 hybrid hooks left as feature flags
(RERANK_URL, HYBRID_SEARCH) — fail open to dense-only when unset.
Smoke test against the live 216K-chunk corpus:
- corpus_status: 4,157 labels / 216,467 chunks / 416 MB BM25 ✓
- search_docs("waterhemp control on soybeans", k=2) returns
Tackle Herbicide (FMC, 279-3564, glyph+imazethapyr) and
R14640 Herbicide (Bayer, 524-724, glyph) with section context
(ROUNDUP READY SOYBEANS / SOYBEAN) and dist-derived scores
of 0.76 each — highly relevant.
Run as stdio for Claude Desktop:
PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \
OLLAMA_URL=http://gpu1:11434,http://gpu2:11434 \
PRODUCT_NAME=ppls \
python -m docs_mcp.server --transport stdio
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
38141c362e |
Phase 2: chunking + parallel Ollama embeddings + Chroma + BM25 indexes
End-to-end RAG pipeline for the pesticide-labels corpus. From the
4,066 labels on USB, the indexer produces 216,467 chunks, embeds
them via N parallel Ollama endpoints, upserts to Chroma, and builds
a BM25 lexical index.
## Files
- rag/index.py: adapted to labels schema (source / source_key /
epa_reg_no / product_name / product_class / registrant /
signal_word / active_ingredients flattened for Chroma where-filter);
honors PPLS_CORPUS_ROOT (corpus on USB) and PPLS_CHROMA_DIR;
upsert batch size auto-tuned to 64 * N URLs; --limit + --source
flags for incremental work.
- rag/chunk.py: label-aware. ALL-CAPS section heading detector
(heuristic) for EPA labels alongside markdown `#` headings.
TARGET_CHARS=2000 (~500 tokens), MAX_CHUNK_CHARS=4000 (~1000
tokens) hard cap with _force_split sentence/char fallback to
defend against monolithic crop+rate tables. Chunk 0 is a synthetic
anchor with product name, EPA Reg No, registrant, signal word,
product class, active ingredients + keyword bag for joint
dense/BM25 retrieval.
- rag/embeddings.py: parallel-dispatch across N Ollama URLs via
ThreadPoolExecutor. Each __call__ stride-slices input into N
shards, fires N concurrent HTTP requests, joins in original order.
Bisect-resilient on 400 (context-length): recursively splits the
failing shard down to single doc, logs+drops single bad doc with
zero-vector placeholder so Chroma upsert never sees a gap. Real
HTTP/connection errors still propagate.
- requirements.txt: chromadb already pinned via template.
## Run
PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \
OLLAMA_URL=http://host1:11434,http://host2:11434,... \
PRODUCT_NAME=ppls \
python -m rag.index --rebuild
## Build stats
- 216,467 chunks across 4,066 labels (~53 chunks/label avg)
- Wall time: 75.7 min on 4 parallel GPU-backed Ollama endpoints
(Bayer-Crop / BASF / Corteva / FMC / Nufarm / Syngenta / etc.
chemistry; production Ollama on trashpanda + 2× 192.168.0.2 +
1× Windows 192.168.0.125)
- 473 bisect-drops (0.22%) — all from monolithic-table sections
in 1970s-90s scanned PDFs whose pypdf extracts tokenized past
the model's context. Acceptable; the dropped chunks were
garbled OCR with no useful content.
- Chroma: 2.2 GB persistent SQLite at ./chroma/
- BM25: 416 MB SQLite FTS5 at ./bm25/ppls_docs.db
## Smoke-test queries (top-3 dense-only)
"what can I spray on soybeans to control waterhemp"
→ Rage (glyphosate+carfentrazone), Sencor (metribuzin)
"REI for dicamba on corn"
→ Nufarm Credit (DICAMBA tank-mix restrictions section)
"fungicide for wheat head scab"
→ MCW 710 SC (azoxystrobin+tebuconazole), Sercadis (fluxapyroxad)
Distances 0.16-0.23. Dense-only quality is OK-not-great in spots
(exactly the failure mode Phase 6 reranker + Phase 8 hybrid BM25
fusion address).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
92a95d5e78 |
epa_ppls: add registrant allowlist pre-API filter
Cuts the PPIS-enumeration universe from 102K rows to ~11.5K rows by dropping products from non-row-crop-ag registrants BEFORE the per- product API call. This is the biggest cost lever we have on the EPA scraper — full backfill drops from ~28 h to ~3.5 h. scrape/sources/epa_registrant_allowlist.json holds the 34 verified ag-chem company numbers (Syngenta, Bayer, BASF, Corteva, FMC, Nufarm, ADAMA, UPL, Albaugh, Loveland, AMVAC, Helena, Drexel, Atticus, etc.). Each entry was verified by querying the EPA PPLS API for the first active product registered under that company number. Edit the JSON freely — scraper loads it at run time. Bypass with --no-registrant-filter when you suspect a row-crop product registered to a specialty company not on the list. Why a curated allowlist rather than blacklist consumer brands: the 102K PPIS rows are 89% non-ag-relevant; an allowlist is shorter to maintain and harder to false-positive. Excluded with intent (not omissions): Bayer Environmental Science (turf/ornamental), Scotts (consumer lawn & garden), Wellmark/Zoecon (animal flea/tick), Control Solutions (structural pest), Cleary (turf), PBI/Gordon (mostly turf), Buckman Labs (industrial water). Smoke test --limit 100: - 1239 PPIS rows considered (in first slice of file) - 1139 skipped by registrant filter (no API call paid) - 100 hit API, 81 filtered by row-crop sites, 19 written - = 91% API-call reduction over the prior version Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
420e00b44b |
bayer: dedup by EPA reg no across catalog product-type queries
Bayer's seed-treatment catalog query re-serves products from herbicide/fungicide/insecticide queries that have seed-treatment use sites listed. safe_slug() correctly strips the class suffix when the catalog product type matches, but doesn't strip when querying as seed-treatment, so the same product gets written twice — once as "<base>" (canonical class) and once as "<base>-<class>" (class=seed-treatment). First full scrape produced 159 files for 87 unique EPA reg nos — ~45% redundant. Fix: - process_product accepts an optional seen_regs set and returns "dup-skip" when the product's EPA reg no is already in it. - run() seeds seen_regs from existing sidecars on disk via _load_seen_regs() so dedup survives re-runs (force overrides). - run() updates seen_regs after each successful write, so within-run dedup works for the seed-treatment query (which iterates last). Important nuance preserved: when two genuinely-different brand-name products share the same EPA reg (e.g., Absolute Maxx + Adament Flow both = 264-849), they are NOT treated as dups — they're different catalog entries with different slugs and same canonical class. Only the seed-treatment-clone pattern (slug = <canonical>-<class> AND class=seed-treatment AND sibling at same reg with matching class) is the bug we're fixing. One-off cleanup of the existing USB corpus removed 68 dup pairs; 159 → 91 files (73 canonical-class + 18 true seed-treatments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
717426f873 |
scrape: route corpus via PPLS_CORPUS_ROOT env var
Both scrapers now honor PPLS_CORPUS_ROOT so the corpus can land on
external storage (USB drive, NAS mount, secondary partition) without
editing the repo. Default behavior unchanged: corpus/ at repo root
when the env var is unset.
Per-source subdirectory layout preserved: ${PPLS_CORPUS_ROOT}/bayer/,
${PPLS_CORPUS_ROOT}/epa_ppls/, etc.
Live-verified against /run/media/justin/USB (vfat, 59GB free):
PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \
python -m scrape.runner --source epa_ppls --reg-no 524-475
-> wrote to USB, root disk untouched
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ea3aea5871 |
epa_ppls: narrow row-crop filter to corn/soy/wheat only
App focus is corn, soybeans, and wheat. Dropping the broader US-row-crops allowlist (cotton/rice/sorghum/milo/barley/oats/rye/ sunflower/peanut/sugar-beet/dry-bean/canola/alfalfa). Empirical impact (random N=100 sample): broad list matched 17/100 products, narrow list matches 16/100 — only 6% reduction, because corn/soy/wheat dominate ag-chem registrations so thoroughly that products registered for cotton/sorghum/etc. are almost always co-registered for one of corn/soy/wheat. One sampled product was dropped: a peanut-only herbicide (2749-614). Verified live: 524-475 Roundup + 524-591 Warrant kept (CORN/SOYBEAN sites); 2749-614 AG36448 (PEANUTS only) correctly filtered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
60657aa6df |
epa_ppls: filter PPLS enumeration to row-crop products
The farmer-advisor consumer only cares about US row crops, so the EPA
scraper now drops products without at least one row-crop site in the
PPLS API response. Filter is on by default; --no-row-crop-filter
overrides for one-off broader pulls.
Filter shape:
- Word-boundary regex match against each entry in the API's `sites`
array (e.g., "SOYBEANS (FOLIAR TREATMENT)" → keep, "SHIPS, BOATS,
SHIPHOLDS" → drop even though it contains "OATS" as substring).
- Allowlist covers the major US row + small-grain + oilseed + sugar/
fiber crops, plus alfalfa as a common rotation crop. See
ROW_CROP_KEYWORDS in scrape/sources/epa_ppls.py for the full list.
Cost model:
- 102K PPIS rows still need one API call each (no bulk filter
available upstream), so enumeration still takes ~28h at 1 req/sec.
- But PDF downloads drop from ~67K → ~5-10K (estimated row-crop
hit rate), saving ~17h wall time and ~60GB disk on a full backfill.
Smoke test (4 mixed reg nos):
524-475 Roundup Ultra → kept (CORN/SOYBEANS/COTTON sites)
524-591 Warrant → kept (CORN/SOYBEANS/SORGHUM sites)
100-1486 Advion Cockroach → filtered (building/transport sites only)
432-1276 (Bayer pet flea) → filtered (no row crops)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
e9250de8e7 |
scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema
Adapts the docs-mcp-template scraping layer for the pesticide-labels
domain. The template's bundle/version/platform concepts don't map to
labels (there's no "Bayer 8.1.0" — there's just the current accepted
label per EPA Reg No), so the scraper layer is reshaped around a
"source" abstraction: one source per manufacturer or regulator, one
per-product label per source.
Sources shipped:
- bayer — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs)
- epa_ppls — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint
Canonical sidecar schema (see scrape/README.md) unifies fields across
sources:
- active_ingredients always [{name, cas, percent}]
- label/* nested (url, filename, accepted_date, last_modified,
page_count, text_layer)
- all timestamps normalized to ISO 8601 UTC
- signal_word surfaced (operationally critical for the farmer advisor)
- source_key + epa_reg_no separate per-source PK from the
cross-source join key
bundles.json → sources.json. --bundle → --source. The runner walks
sources.json and dispatches by id; per-source modules remain
independently runnable for development.
PLAN.md gets a one-block domain note up front; later phases (chunking,
embeddings, retrieval, eval) still apply as written.
Smoke test:
python -m scrape.runner --all --limit 2 # works
python -m scrape.runner --source bayer --limit 3 # 3 written, idempotent re-run skips
python -m scrape.runner --source epa_ppls --reg-no 524-475 # Roundup Ultra, 167 pages, ISO last_modified
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
3ca96a3716 | Strip submit_doc_bug tool and gate (Zerto-specific, not applicable to label MCP) | ||
|
|
43728320bf |
ci: default PRODUCT_NAME to repo name (caught by template dispatch test)
First dispatch on the empty template failed at Chroma collection
creation because PRODUCT_NAME was the literal string "<product>"
(YAML doesn't expand placeholders), and Chroma rejects collection
names containing characters outside [a-zA-Z0-9._-]:
chromadb.errors.InvalidArgumentError: Validation error: name:
Expected a name containing 3-512 characters from [a-zA-Z0-9._-],
starting and ending with a character in [a-zA-Z0-9]. Got:
<product>_docs
Same fix as the IMAGE env: derive from the repo name dynamically
via ${{ github.event.repository.name }}. Cloners can still override
explicitly, but a fresh clone now runs the index-rebuild step
cleanly out of the box.
Verified by re-dispatch — should fail next at docker login (placeholder
REGISTRY_PUSH hostname), which is the next-expected fail point and a
real per-deployment config the cloner has to fill in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
33b0fd652e |
ci: derive image name + package linking from repo, add link step
Both workflows had a static IMAGE env (<owner>/<product>-docs-mcp)
and a static --package arg in the GC step. Switch both to Gitea
Actions context variables so a clone of the template into any repo
name works on the first CI run without find/replace:
IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }}
--owner ${{ github.repository_owner }}
--package ${{ github.event.repository.name }}
Also add the "Link container package to this repo" step that was
missing from the template (and which, naively copy-pasted from the
reference build, would have linked everything back to docs-mcp-
template). The new step derives owner + package + link-target all
from the running repo's context.
The github.* namespace is Gitea Actions' inherited GitHub-Actions
context — values come from the Gitea server, not github.com. Same
mechanism the reference build's $GITHUB_SHA tag-builder uses.
CLAUDE.md updated to note that image and package naming are
repo-derived; only registry endpoints and the Ollama URL need
per-clone editing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
9ba615c8ee |
initial: docs-mcp-template — build guide + scaffolded server
Template for building hosted MCP servers over a product's public
documentation. Distilled from one production build; everything
product-specific has been factored out.
Contents:
- PLAN.md — comprehensive build guide. 13 phases from project
skeleton through weekly_digest. Includes the gotchas
("fetch-depth: 0 always", reranker per-pair token limit,
Cloudflare body cap, dash-not-bash on Gitea runners), the
decisions worth carrying forward, and a per-product
customization checklist.
- CLAUDE.md — guidance for Claude Code working in a clone of this
template. Phase identification table, conventions (env-gating +
operator confirmation for side-effecting tools, defensive
fallback for retrieval components), common commands.
- README.md — quick-start summary.
Scaffolded code (all signature-stable, with NotImplementedError
stubs where phase-specific work is required):
docs_mcp/server.py FastMCP server, stateless_http=True, with
search_docs / get_page / list_versions
baseline tools and commented stubs for the
rest of the phase set.
docs_mcp/usage.py TimedCall telemetry, JSONL, daily rotation,
90-day retention. Reusable as-is.
rag/embeddings.py Ollama embedder (nomic-embed-text default),
load-balanced across N URLs. Reusable.
rag/chunk.py Paragraph-aware chunker with synthetic
chunk 0. Per-product tunable.
rag/index.py Chroma + BM25 builder. --rebuild and
--bm25-only flags.
rag/bm25.py SQLite FTS5 lexical index. Reusable.
scrape/changelog.py --cached / --ref / --json / --history-out.
Reusable.
scrape/README.md What you write per-product.
eval/queries.jsonl.example
Curate ~25 hand-labeled queries here.
eval/retrievers.py Retriever protocol + stub classes.
eval/run_eval.py MRR / Recall@K / nDCG@K harness skeleton.
scripts/usage_report.py
Standalone log analyzer; the
FOLLOW-UP CHECKS pattern noted in the
module docstring.
scripts/registry_gc.py
Gitea container registry cleanup. Reusable.
Deployment + CI:
Dockerfile Python 3.12-slim; COPY corpus + chroma
+ bm25 last for cache efficiency.
deploy/docker-compose.yml MCP + reranker sidecar + Watchtower.
Templated with <placeholders>.
.gitea/workflows/refresh.yml Weekly cron + manual dispatch.
fetch-depth: 0, retry-on-race,
three-tag image scheme.
.gitea/workflows/image-only.yml Code-only ship cycle, ~18min.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|