scrape: route corpus via PPLS_CORPUS_ROOT env var

Both scrapers now honor PPLS_CORPUS_ROOT so the corpus can land on
external storage (USB drive, NAS mount, secondary partition) without
editing the repo. Default behavior unchanged: corpus/ at repo root
when the env var is unset.

Per-source subdirectory layout preserved: ${PPLS_CORPUS_ROOT}/bayer/,
${PPLS_CORPUS_ROOT}/epa_ppls/, etc.

Live-verified against /run/media/justin/USB (vfat, 59GB free):
  PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \
    python -m scrape.runner --source epa_ppls --reg-no 524-475
  -> wrote to USB, root disk untouched

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-23 20:41:56 -04:00
parent ea3aea5871
commit 717426f873
3 changed files with 28 additions and 2 deletions
+17
View File
@@ -41,6 +41,23 @@ Every scraper is **idempotent** by default — re-running with the
same arguments skips records already on disk. Use `--force` to
re-fetch.
### Corpus location
Default: `corpus/` at the repo root. Override with the
`PPLS_CORPUS_ROOT` env var to route the corpus to external storage
(USB drive, NAS mount, secondary partition):
```bash
export PPLS_CORPUS_ROOT=/mnt/big-disk/ppls-corpus
python -m scrape.runner --source bayer --limit 20
# writes to /mnt/big-disk/ppls-corpus/bayer/...
```
All sources honor the same env var; each creates its own
`<source_id>/` subdirectory beneath it. Per-source code paths
still resolve `CORPUS_DIR` correctly whether the env var is set
or not.
## Scope: corn / soybeans / wheat
The corpus is scoped to the three crops the consumer app focuses on: