crop-chem-docs

5 Commits 1 Branch 0 Tags

Author	SHA1	Message	Date
justin	e9250de8e7	scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema Adapts the docs-mcp-template scraping layer for the pesticide-labels domain. The template's bundle/version/platform concepts don't map to labels (there's no "Bayer 8.1.0" — there's just the current accepted label per EPA Reg No), so the scraper layer is reshaped around a "source" abstraction: one source per manufacturer or regulator, one per-product label per source. Sources shipped: - bayer — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs) - epa_ppls — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint Canonical sidecar schema (see scrape/README.md) unifies fields across sources: - active_ingredients always [{name, cas, percent}] - label/* nested (url, filename, accepted_date, last_modified, page_count, text_layer) - all timestamps normalized to ISO 8601 UTC - signal_word surfaced (operationally critical for the farmer advisor) - source_key + epa_reg_no separate per-source PK from the cross-source join key bundles.json → sources.json. --bundle → --source. The runner walks sources.json and dispatches by id; per-source modules remain independently runnable for development. PLAN.md gets a one-block domain note up front; later phases (chunking, embeddings, retrieval, eval) still apply as written. Smoke test: python -m scrape.runner --all --limit 2 # works python -m scrape.runner --source bayer --limit 3 # 3 written, idempotent re-run skips python -m scrape.runner --source epa_ppls --reg-no 524-475 # Roundup Ultra, 167 pages, ISO last_modified Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 18:27:07 -04:00
justin	3ca96a3716	Strip submit_doc_bug tool and gate (Zerto-specific, not applicable to label MCP)	2026-05-23 17:51:56 -04:00
justin	43728320bf	ci: default PRODUCT_NAME to repo name (caught by template dispatch test) First dispatch on the empty template failed at Chroma collection creation because PRODUCT_NAME was the literal string "<product>" (YAML doesn't expand placeholders), and Chroma rejects collection names containing characters outside [a-zA-Z0-9._-]: chromadb.errors.InvalidArgumentError: Validation error: name: Expected a name containing 3-512 characters from [a-zA-Z0-9._-], starting and ending with a character in [a-zA-Z0-9]. Got: <product>_docs Same fix as the IMAGE env: derive from the repo name dynamically via ${{ github.event.repository.name }}. Cloners can still override explicitly, but a fresh clone now runs the index-rebuild step cleanly out of the box. Verified by re-dispatch — should fail next at docker login (placeholder REGISTRY_PUSH hostname), which is the next-expected fail point and a real per-deployment config the cloner has to fill in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:37:07 -04:00
justin	33b0fd652e	ci: derive image name + package linking from repo, add link step Both workflows had a static IMAGE env (<owner>/<product>-docs-mcp) and a static --package arg in the GC step. Switch both to Gitea Actions context variables so a clone of the template into any repo name works on the first CI run without find/replace: IMAGE: ${{ github.repository_owner }}/${{ github.event.repository.name }} --owner ${{ github.repository_owner }} --package ${{ github.event.repository.name }} Also add the "Link container package to this repo" step that was missing from the template (and which, naively copy-pasted from the reference build, would have linked everything back to docs-mcp- template). The new step derives owner + package + link-target all from the running repo's context. The github.* namespace is Gitea Actions' inherited GitHub-Actions context — values come from the Gitea server, not github.com. Same mechanism the reference build's $GITHUB_SHA tag-builder uses. CLAUDE.md updated to note that image and package naming are repo-derived; only registry endpoints and the Ollama URL need per-clone editing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:34:26 -04:00
justin	9ba615c8ee	initial: docs-mcp-template — build guide + scaffolded server Template for building hosted MCP servers over a product's public documentation. Distilled from one production build; everything product-specific has been factored out. Contents: - PLAN.md — comprehensive build guide. 13 phases from project skeleton through weekly_digest. Includes the gotchas ("fetch-depth: 0 always", reranker per-pair token limit, Cloudflare body cap, dash-not-bash on Gitea runners), the decisions worth carrying forward, and a per-product customization checklist. - CLAUDE.md — guidance for Claude Code working in a clone of this template. Phase identification table, conventions (env-gating + operator confirmation for side-effecting tools, defensive fallback for retrieval components), common commands. - README.md — quick-start summary. Scaffolded code (all signature-stable, with NotImplementedError stubs where phase-specific work is required): docs_mcp/server.py FastMCP server, stateless_http=True, with search_docs / get_page / list_versions baseline tools and commented stubs for the rest of the phase set. docs_mcp/usage.py TimedCall telemetry, JSONL, daily rotation, 90-day retention. Reusable as-is. rag/embeddings.py Ollama embedder (nomic-embed-text default), load-balanced across N URLs. Reusable. rag/chunk.py Paragraph-aware chunker with synthetic chunk 0. Per-product tunable. rag/index.py Chroma + BM25 builder. --rebuild and --bm25-only flags. rag/bm25.py SQLite FTS5 lexical index. Reusable. scrape/changelog.py --cached / --ref / --json / --history-out. Reusable. scrape/README.md What you write per-product. eval/queries.jsonl.example Curate ~25 hand-labeled queries here. eval/retrievers.py Retriever protocol + stub classes. eval/run_eval.py MRR / Recall@K / nDCG@K harness skeleton. scripts/usage_report.py Standalone log analyzer; the FOLLOW-UP CHECKS pattern noted in the module docstring. scripts/registry_gc.py Gitea container registry cleanup. Reusable. Deployment + CI: Dockerfile Python 3.12-slim; COPY corpus + chroma + bm25 last for cache efficiency. deploy/docker-compose.yml MCP + reranker sidecar + Watchtower. Templated with <placeholders>. .gitea/workflows/refresh.yml Weekly cron + manual dispatch. fetch-depth: 0, retry-on-race, three-tag image scheme. .gitea/workflows/image-only.yml Code-only ship cycle, ~18min. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 09:18:17 -04:00