rename: ppls-docs → crop-chem-docs

Repo/project rename to better reflect scope. PPLS is EPA's term for
their Pesticide Product Label System — accurate when the corpus was
EPA-only, narrow now that it also pulls from Bayer's own catalog
(and may expand to Syngenta/Corteva/BASF/FMC labels in the future).
crop-chem-docs scopes flexibly without acronyms to explain.

Renames:
- directory:           ppls-docs            → crop-chem-docs
- PRODUCT_NAME:        ppls                 → crop_chem
- Chroma collection:   ppls_docs            → crop_chem_docs  (in-place via .modify(), no re-embed)
- BM25 db:             bm25/ppls_docs.db    → bm25/crop_chem_docs.db
- MCP tool name:       ppls_api_lessons     → crop_chem_api_lessons
- FastMCP server name: ppls-docs            → crop-chem-docs
- Env vars:            PPLS_CORPUS_ROOT     → CORPUS_ROOT
                       PPLS_CHROMA_DIR      → CHROMA_DIR_OVERRIDE
- User-Agent:          ppls-docs-scraper    → crop-chem-docs-scraper

Preserved (intentional, correct):
- epa_ppls (source id) — refers specifically to EPA's PPLS database
- "EPA PPLS" mentions in regulatory text (lessons.md, server docstrings)
- PPLS_API_BASE / PPLS_PDF_BASE / PPLS_INDEX_URL_TEMPLATE in
  scrape/sources/epa_ppls.py — these point at EPA's actual endpoints

Memory entries get updated in a follow-up commit so the rename is
isolated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-24 12:25:59 -04:00
parent 3c3178a6ad
commit 1a45280e45
9 changed files with 31 additions and 31 deletions
+3 -3
View File
@@ -44,13 +44,13 @@ re-fetch.
### Corpus location
Default: `corpus/` at the repo root. Override with the
`PPLS_CORPUS_ROOT` env var to route the corpus to external storage
`CORPUS_ROOT` env var to route the corpus to external storage
(USB drive, NAS mount, secondary partition):
```bash
export PPLS_CORPUS_ROOT=/mnt/big-disk/ppls-corpus
export CORPUS_ROOT=/mnt/big-disk/crop-chem-corpus
python -m scrape.runner --source bayer --limit 20
# writes to /mnt/big-disk/ppls-corpus/bayer/...
# writes to /mnt/big-disk/crop-chem-corpus/bayer/...
```
All sources honor the same env var; each creates its own
+1 -1
View File
@@ -1,7 +1,7 @@
"""Thin dispatcher that routes ``--source <id>`` to the right per-source
scraper module.
For ppls-docs the convention is **one source per scraper module** under
For crop-chem-docs the convention is **one source per scraper module** under
``scrape.sources.<id>``. Each module is independently runnable via
``python -m scrape.sources.<id>`` and accepts its own flags — this
runner is a convenience shim for CI / the weekly refresh workflow.
+3 -3
View File
@@ -47,7 +47,7 @@ import requests
from pypdf import PdfReader
SCRAPER_VERSION = "0.1.0"
USER_AGENT = "ppls-docs-scraper/0.1 (+https://drawbar.example/contact)"
USER_AGENT = "crop-chem-docs-scraper/0.1 (+https://drawbar.example/contact)"
BASE = "https://www.cropscience.bayer.us"
# Catalog product-type values used in the Next.js data API.
@@ -63,11 +63,11 @@ PRODUCT_CLASS = {
}
# Repo root: scrape/sources/bayer.py -> repo root is 3 parents up.
# Corpus root is overridable via PPLS_CORPUS_ROOT for routing the
# Corpus root is overridable via CORPUS_ROOT for routing the
# corpus to external storage (USB drive, NAS mount, etc.) without
# editing the repo.
REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS_ROOT = Path(os.environ.get("PPLS_CORPUS_ROOT") or REPO_ROOT / "corpus")
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
CORPUS_DIR = CORPUS_ROOT / "bayer"
# Politeness: target ~1 req/sec to Bayer. Each HTTP method goes through
+3 -3
View File
@@ -63,7 +63,7 @@ from pypdf import PdfReader
from pypdf.errors import PdfReadError
SCRAPER_VERSION = "0.1.0"
USER_AGENT = "ppls-docs-scraper/0.1 (+https://drawbar.example/contact)"
USER_AGENT = "crop-chem-docs-scraper/0.1 (+https://drawbar.example/contact)"
PPIS_PRODUCT_ZIP_URL = "https://www3.epa.gov/pesticides/PPISdata/product.zip"
PPLS_API_BASE = "https://ordspub.epa.gov/ords/pesticides/cswu/ppls"
@@ -73,10 +73,10 @@ PPLS_INDEX_URL_TEMPLATE = (
)
REPO_ROOT = Path(__file__).resolve().parents[2]
# Corpus root is overridable via PPLS_CORPUS_ROOT for routing the
# Corpus root is overridable via CORPUS_ROOT for routing the
# corpus to external storage (USB drive, NAS mount, etc.) without
# editing the repo.
CORPUS_ROOT = Path(os.environ.get("PPLS_CORPUS_ROOT") or REPO_ROOT / "corpus")
CORPUS_ROOT = Path(os.environ.get("CORPUS_ROOT") or REPO_ROOT / "corpus")
CORPUS_DIR = CORPUS_ROOT / "epa_ppls"
REQUEST_DELAY_SECONDS = 1.1 # polite: ~1 req/sec