scrape: route corpus via PPLS_CORPUS_ROOT env var

Both scrapers now honor PPLS_CORPUS_ROOT so the corpus can land on
external storage (USB drive, NAS mount, secondary partition) without
editing the repo. Default behavior unchanged: corpus/ at repo root
when the env var is unset.

Per-source subdirectory layout preserved: ${PPLS_CORPUS_ROOT}/bayer/,
${PPLS_CORPUS_ROOT}/epa_ppls/, etc.

Live-verified against /run/media/justin/USB (vfat, 59GB free):
  PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \
    python -m scrape.runner --source epa_ppls --reg-no 524-475
  -> wrote to USB, root disk untouched

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-23 20:41:56 -04:00
parent ea3aea5871
commit 717426f873
3 changed files with 28 additions and 2 deletions
+5 -1
View File
@@ -63,8 +63,12 @@ PRODUCT_CLASS = {
}
# Repo root: scrape/sources/bayer.py -> repo root is 3 parents up.
# Corpus root is overridable via PPLS_CORPUS_ROOT for routing the
# corpus to external storage (USB drive, NAS mount, etc.) without
# editing the repo.
REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS_DIR = REPO_ROOT / "corpus" / "bayer"
CORPUS_ROOT = Path(os.environ.get("PPLS_CORPUS_ROOT") or REPO_ROOT / "corpus")
CORPUS_DIR = CORPUS_ROOT / "bayer"
# Politeness: target ~1 req/sec to Bayer. Each HTTP method goes through
# a tiny token-bucket sleeper to enforce this without per-call asyncio.