rename: ppls-docs → crop-chem-docs

Repo/project rename to better reflect scope. PPLS is EPA's term for
their Pesticide Product Label System — accurate when the corpus was
EPA-only, narrow now that it also pulls from Bayer's own catalog
(and may expand to Syngenta/Corteva/BASF/FMC labels in the future).
crop-chem-docs scopes flexibly without acronyms to explain.

Renames:
- directory:           ppls-docs            → crop-chem-docs
- PRODUCT_NAME:        ppls                 → crop_chem
- Chroma collection:   ppls_docs            → crop_chem_docs  (in-place via .modify(), no re-embed)
- BM25 db:             bm25/ppls_docs.db    → bm25/crop_chem_docs.db
- MCP tool name:       ppls_api_lessons     → crop_chem_api_lessons
- FastMCP server name: ppls-docs            → crop-chem-docs
- Env vars:            PPLS_CORPUS_ROOT     → CORPUS_ROOT
                       PPLS_CHROMA_DIR      → CHROMA_DIR_OVERRIDE
- User-Agent:          ppls-docs-scraper    → crop-chem-docs-scraper

Preserved (intentional, correct):
- epa_ppls (source id) — refers specifically to EPA's PPLS database
- "EPA PPLS" mentions in regulatory text (lessons.md, server docstrings)
- PPLS_API_BASE / PPLS_PDF_BASE / PPLS_INDEX_URL_TEMPLATE in
  scrape/sources/epa_ppls.py — these point at EPA's actual endpoints

Memory entries get updated in a follow-up commit so the rename is
isolated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-24 12:25:59 -04:00
parent 3c3178a6ad
commit 1a45280e45
9 changed files with 31 additions and 31 deletions
+2 -2
View File
@@ -9,9 +9,9 @@ any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can
call to answer questions against the docs, surface what changed call to answer questions against the docs, surface what changed
recently, and flag likely inconsistencies. recently, and flag likely inconsistencies.
> **Domain note for ppls-docs.** This template was originally written > **Domain note for crop-chem-docs.** This template was originally written
> for versioned software product documentation (Zoomin bundles, Hugo > for versioned software product documentation (Zoomin bundles, Hugo
> sites, etc.). For ppls-docs the domain is pesticide product labels — > sites, etc.). For crop-chem-docs the domain is pesticide product labels —
> the "bundle" abstraction has been replaced with "source" > the "bundle" abstraction has been replaced with "source"
> (manufacturer or regulator), and "page" with "product label". The > (manufacturer or regulator), and "page" with "product label". The
> canonical on-disk schema lives in [`scrape/README.md`](scrape/README.md), > canonical on-disk schema lives in [`scrape/README.md`](scrape/README.md),
+3 -3
View File
@@ -1,8 +1,8 @@
# PPLS API Lessons # Crop-Chem API Lessons
Curated agronomy + label-handling knowledge that an LLM should know Curated agronomy + label-handling knowledge that an LLM should know
*before* giving recommendations from the labels corpus. Surfaced by *before* giving recommendations from the labels corpus. Surfaced by
the `ppls_api_lessons` MCP tool. the `crop_chem_api_lessons` MCP tool.
Each top-level `## Topic: <slug>` block is independently retrievable. Each top-level `## Topic: <slug>` block is independently retrievable.
The tool docstring tells the LLM to call this proactively before The tool docstring tells the LLM to call this proactively before
@@ -12,7 +12,7 @@ answering any pesticide recommendation question.
## Topic: how-to-use-this-corpus ## Topic: how-to-use-this-corpus
The PPLS docs corpus is the source of truth for *what's on the label*. The crop-chem-docs label corpus is the source of truth for *what's on the label*.
You should: You should:
1. **Run `search_docs` first** with the user's natural-language 1. **Run `search_docs` first** with the user's natural-language
+9 -9
View File
@@ -1,4 +1,4 @@
"""MCP server for the ppls-docs pesticide label corpus. """MCP server for the crop-chem-docs pesticide label corpus.
Adapted from the docs-mcp-template (which targeted versioned software Adapted from the docs-mcp-template (which targeted versioned software
docs) for the EPA pesticide-labels domain: ``bundle_id`` → ``source``, docs) for the EPA pesticide-labels domain: ``bundle_id`` → ``source``,
@@ -34,7 +34,7 @@ log = logging.getLogger(__name__)
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Product configuration. # Product configuration.
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "ppls") PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "crop_chem")
PRODUCT_DOCS_URL = os.environ.get( PRODUCT_DOCS_URL = os.environ.get(
"PRODUCT_DOCS_URL", "PRODUCT_DOCS_URL",
"https://ordspub.epa.gov/ords/pesticides/f?p=PPLS:1", "https://ordspub.epa.gov/ords/pesticides/f?p=PPLS:1",
@@ -43,8 +43,8 @@ COLLECTION = f"{PRODUCT_NAME}_docs"
# Paths — corpus on (possibly) external storage, indexes always at repo root. # Paths — corpus on (possibly) external storage, indexes always at repo root.
REPO_ROOT = Path(__file__).resolve().parent.parent REPO_ROOT = Path(__file__).resolve().parent.parent
CORPUS_ROOT = Path(os.environ.get("PPLS_CORPUS_ROOT") or REPO_ROOT / "corpus") CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
CHROMA_DIR = Path(os.environ.get("PPLS_CHROMA_DIR") or REPO_ROOT / "chroma") CHROMA_DIR = Path(os.environ.get("CHROMA_DIR_OVERRIDE") or REPO_ROOT / "chroma")
BM25_DB = Path(os.environ.get("BM25_DB", BM25_DB = Path(os.environ.get("BM25_DB",
str(REPO_ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db"))) str(REPO_ROOT / "bm25" / f"{PRODUCT_NAME}_docs.db")))
SOURCES_JSON = REPO_ROOT / "sources.json" SOURCES_JSON = REPO_ROOT / "sources.json"
@@ -464,7 +464,7 @@ def list_versions() -> str:
cat = _sources() cat = _sources()
# Source-level summary from sources.json # Source-level summary from sources.json
lines: list[str] = ["# PPLS docs corpus"] lines: list[str] = ["# crop-chem-docs corpus"]
# Live counts from Chroma (best-effort; the server should still # Live counts from Chroma (best-effort; the server should still
# render a useful response if Chroma is unreachable) # render a useful response if Chroma is unreachable)
@@ -628,7 +628,7 @@ def _load_lessons() -> tuple[str, list[tuple[str, str]]]:
@mcp.tool() @mcp.tool()
def ppls_api_lessons( def crop_chem_api_lessons(
topic: Annotated[ topic: Annotated[
str | None, str | None,
Field(description="OPTIONAL: topic slug or substring (e.g., " Field(description="OPTIONAL: topic slug or substring (e.g., "
@@ -654,7 +654,7 @@ def ppls_api_lessons(
warnings that make them actionable. Call this first; cite specific warnings that make them actionable. Call this first; cite specific
lessons in your response. lessons in your response.
""" """
with TimedCall("ppls_api_lessons", {"topic": topic}) as _call: with TimedCall("crop_chem_api_lessons", {"topic": topic}) as _call:
full, sections = _load_lessons() full, sections = _load_lessons()
if not sections: if not sections:
_call.set(sections=0) _call.set(sections=0)
@@ -663,9 +663,9 @@ def ppls_api_lessons(
if not topic: if not topic:
_call.set(sections=len(sections), returned="toc") _call.set(sections=len(sections), returned="toc")
toc_lines = [ toc_lines = [
"# PPLS API lessons — table of contents", "# Crop-Chem API lessons — table of contents",
"", "",
f"Call `ppls_api_lessons(topic='<slug>')` to fetch a specific section.", f"Call `crop_chem_api_lessons(topic='<slug>')` to fetch a specific section.",
"", "",
] ]
for slug, body in sections: for slug, body in sections:
+4 -4
View File
@@ -5,7 +5,7 @@ into Chroma. With --rebuild, drops + recreates the collection (clean
state). With --bm25-only, skips Chroma and rebuilds only the FTS5 state). With --bm25-only, skips Chroma and rebuilds only the FTS5
index — useful for fast iteration when chunking didn't change. index — useful for fast iteration when chunking didn't change.
The corpus root honors PPLS_CORPUS_ROOT (matching the scrapers). The corpus root honors CORPUS_ROOT (matching the scrapers).
The Chroma + BM25 stores stay at the repo root because both rely on The Chroma + BM25 stores stay at the repo root because both rely on
filesystem locking semantics that vfat (typical USB drive) doesn't filesystem locking semantics that vfat (typical USB drive) doesn't
provide reliably. provide reliably.
@@ -30,11 +30,11 @@ log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s") logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
REPO_ROOT = Path(__file__).resolve().parent.parent REPO_ROOT = Path(__file__).resolve().parent.parent
CORPUS_ROOT = Path(os.environ.get("PPLS_CORPUS_ROOT") or REPO_ROOT / "corpus") CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
CHROMA_DIR = Path(os.environ.get("PPLS_CHROMA_DIR") or REPO_ROOT / "chroma") CHROMA_DIR = Path(os.environ.get("CHROMA_DIR_OVERRIDE") or REPO_ROOT / "chroma")
# Collection name — convention: <product>_docs. Override via env. # Collection name — convention: <product>_docs. Override via env.
PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "ppls") PRODUCT_NAME = os.environ.get("PRODUCT_NAME", "crop_chem")
COLLECTION = f"{PRODUCT_NAME}_docs" COLLECTION = f"{PRODUCT_NAME}_docs"
+3 -3
View File
@@ -20,10 +20,10 @@ from typing import Iterable, Protocol
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
REPO_ROOT = Path(__file__).resolve().parent.parent REPO_ROOT = Path(__file__).resolve().parent.parent
CHROMA_DIR = Path(os.environ.get("PPLS_CHROMA_DIR") or REPO_ROOT / "chroma") CHROMA_DIR = Path(os.environ.get("CHROMA_DIR_OVERRIDE") or REPO_ROOT / "chroma")
BM25_DB = Path(os.environ.get("BM25_DB", BM25_DB = Path(os.environ.get("BM25_DB",
str(REPO_ROOT / "bm25" / "ppls_docs.db"))) str(REPO_ROOT / "bm25" / "crop_chem_docs.db")))
COLLECTION = f"{os.environ.get('PRODUCT_NAME', 'ppls')}_docs" COLLECTION = f"{os.environ.get('PRODUCT_NAME', 'crop_chem')}_docs"
class Retriever(Protocol): class Retriever(Protocol):
+3 -3
View File
@@ -44,13 +44,13 @@ re-fetch.
### Corpus location ### Corpus location
Default: `corpus/` at the repo root. Override with the Default: `corpus/` at the repo root. Override with the
`PPLS_CORPUS_ROOT` env var to route the corpus to external storage `CORPUS_ROOT` env var to route the corpus to external storage
(USB drive, NAS mount, secondary partition): (USB drive, NAS mount, secondary partition):
```bash ```bash
export PPLS_CORPUS_ROOT=/mnt/big-disk/ppls-corpus export CORPUS_ROOT=/mnt/big-disk/crop-chem-corpus
python -m scrape.runner --source bayer --limit 20 python -m scrape.runner --source bayer --limit 20
# writes to /mnt/big-disk/ppls-corpus/bayer/... # writes to /mnt/big-disk/crop-chem-corpus/bayer/...
``` ```
All sources honor the same env var; each creates its own All sources honor the same env var; each creates its own
+1 -1
View File
@@ -1,7 +1,7 @@
"""Thin dispatcher that routes ``--source <id>`` to the right per-source """Thin dispatcher that routes ``--source <id>`` to the right per-source
scraper module. scraper module.
For ppls-docs the convention is **one source per scraper module** under For crop-chem-docs the convention is **one source per scraper module** under
``scrape.sources.<id>``. Each module is independently runnable via ``scrape.sources.<id>``. Each module is independently runnable via
``python -m scrape.sources.<id>`` and accepts its own flags — this ``python -m scrape.sources.<id>`` and accepts its own flags — this
runner is a convenience shim for CI / the weekly refresh workflow. runner is a convenience shim for CI / the weekly refresh workflow.
+3 -3
View File
@@ -47,7 +47,7 @@ import requests
from pypdf import PdfReader from pypdf import PdfReader
SCRAPER_VERSION = "0.1.0" SCRAPER_VERSION = "0.1.0"
USER_AGENT = "ppls-docs-scraper/0.1 (+https://drawbar.example/contact)" USER_AGENT = "crop-chem-docs-scraper/0.1 (+https://drawbar.example/contact)"
BASE = "https://www.cropscience.bayer.us" BASE = "https://www.cropscience.bayer.us"
# Catalog product-type values used in the Next.js data API. # Catalog product-type values used in the Next.js data API.
@@ -63,11 +63,11 @@ PRODUCT_CLASS = {
} }
# Repo root: scrape/sources/bayer.py -> repo root is 3 parents up. # Repo root: scrape/sources/bayer.py -> repo root is 3 parents up.
# Corpus root is overridable via PPLS_CORPUS_ROOT for routing the # Corpus root is overridable via CORPUS_ROOT for routing the
# corpus to external storage (USB drive, NAS mount, etc.) without # corpus to external storage (USB drive, NAS mount, etc.) without
# editing the repo. # editing the repo.
REPO_ROOT = Path(__file__).resolve().parents[2] REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS_ROOT = Path(os.environ.get("PPLS_CORPUS_ROOT") or REPO_ROOT / "corpus") CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
CORPUS_DIR = CORPUS_ROOT / "bayer" CORPUS_DIR = CORPUS_ROOT / "bayer"
# Politeness: target ~1 req/sec to Bayer. Each HTTP method goes through # Politeness: target ~1 req/sec to Bayer. Each HTTP method goes through
+3 -3
View File
@@ -63,7 +63,7 @@ from pypdf import PdfReader
from pypdf.errors import PdfReadError from pypdf.errors import PdfReadError
SCRAPER_VERSION = "0.1.0" SCRAPER_VERSION = "0.1.0"
USER_AGENT = "ppls-docs-scraper/0.1 (+https://drawbar.example/contact)" USER_AGENT = "crop-chem-docs-scraper/0.1 (+https://drawbar.example/contact)"
PPIS_PRODUCT_ZIP_URL = "https://www3.epa.gov/pesticides/PPISdata/product.zip" PPIS_PRODUCT_ZIP_URL = "https://www3.epa.gov/pesticides/PPISdata/product.zip"
PPLS_API_BASE = "https://ordspub.epa.gov/ords/pesticides/cswu/ppls" PPLS_API_BASE = "https://ordspub.epa.gov/ords/pesticides/cswu/ppls"
@@ -73,10 +73,10 @@ PPLS_INDEX_URL_TEMPLATE = (
) )
REPO_ROOT = Path(__file__).resolve().parents[2] REPO_ROOT = Path(__file__).resolve().parents[2]
# Corpus root is overridable via PPLS_CORPUS_ROOT for routing the # Corpus root is overridable via CORPUS_ROOT for routing the
# corpus to external storage (USB drive, NAS mount, etc.) without # corpus to external storage (USB drive, NAS mount, etc.) without
# editing the repo. # editing the repo.
CORPUS_ROOT = Path(os.environ.get("PPLS_CORPUS_ROOT") or REPO_ROOT / "corpus") CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
CORPUS_DIR = CORPUS_ROOT / "epa_ppls" CORPUS_DIR = CORPUS_ROOT / "epa_ppls"
REQUEST_DELAY_SECONDS = 1.1 # polite: ~1 req/sec REQUEST_DELAY_SECONDS = 1.1 # polite: ~1 req/sec