Files
crop-chem-docs/scrape/runner.py
T
justin e9250de8e7 scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema
Adapts the docs-mcp-template scraping layer for the pesticide-labels
domain. The template's bundle/version/platform concepts don't map to
labels (there's no "Bayer 8.1.0" — there's just the current accepted
label per EPA Reg No), so the scraper layer is reshaped around a
"source" abstraction: one source per manufacturer or regulator, one
per-product label per source.

Sources shipped:
  - bayer       — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs)
  - epa_ppls    — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint

Canonical sidecar schema (see scrape/README.md) unifies fields across
sources:
  - active_ingredients always [{name, cas, percent}]
  - label/* nested (url, filename, accepted_date, last_modified,
    page_count, text_layer)
  - all timestamps normalized to ISO 8601 UTC
  - signal_word surfaced (operationally critical for the farmer advisor)
  - source_key + epa_reg_no separate per-source PK from the
    cross-source join key

bundles.json → sources.json. --bundle → --source. The runner walks
sources.json and dispatches by id; per-source modules remain
independently runnable for development.

PLAN.md gets a one-block domain note up front; later phases (chunking,
embeddings, retrieval, eval) still apply as written.

Smoke test:
  python -m scrape.runner --all --limit 2     # works
  python -m scrape.runner --source bayer --limit 3    # 3 written, idempotent re-run skips
  python -m scrape.runner --source epa_ppls --reg-no 524-475   # Roundup Ultra, 167 pages, ISO last_modified

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 18:27:07 -04:00

88 lines
2.7 KiB
Python

"""Thin dispatcher that routes ``--source <id>`` to the right per-source
scraper module.
For ppls-docs the convention is **one source per scraper module** under
``scrape.sources.<id>``. Each module is independently runnable via
``python -m scrape.sources.<id>`` and accepts its own flags — this
runner is a convenience shim for CI / the weekly refresh workflow.
Examples:
python -m scrape.runner --source bayer --limit 20
python -m scrape.runner --source epa_ppls --limit 20
python -m scrape.runner --all # walk every source in sources.json
Anything after the recognized flags is passed through to the source
scraper, so:
python -m scrape.runner --source bayer --force --product warrant
just dispatches to ``scrape.sources.bayer`` with ``--force --product
warrant`` as argv.
"""
from __future__ import annotations
import argparse
import importlib
import json
import sys
from pathlib import Path
REPO_ROOT = Path(__file__).resolve().parents[1]
SOURCES_JSON = REPO_ROOT / "sources.json"
def _load_sources() -> list[dict]:
if not SOURCES_JSON.exists():
return []
try:
return json.loads(SOURCES_JSON.read_text())
except json.JSONDecodeError:
return []
def _run_source(source_id: str, passthrough: list[str]) -> int:
mod_name = f"scrape.sources.{source_id}"
try:
mod = importlib.import_module(mod_name)
except ImportError as exc:
print(f"runner: no source module {mod_name}: {exc}", file=sys.stderr)
return 2
main = getattr(mod, "main", None)
if not callable(main):
print(f"runner: {mod_name} has no main() entrypoint", file=sys.stderr)
return 2
return int(main(passthrough) or 0)
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(prog="scrape.runner")
parser.add_argument("--source", help="Source id (matches sources.json)")
parser.add_argument("--all", action="store_true",
help="Run every source listed in sources.json")
args, passthrough = parser.parse_known_args(argv)
if not args.source and not args.all:
parser.error("specify --source <id> or --all")
sources = _load_sources()
if args.all:
ids = [s["id"] for s in sources if "id" in s]
if not ids:
print("runner: sources.json is empty or missing", file=sys.stderr)
return 2
else:
# If the source isn't registered in sources.json yet, dispatch anyway
# so the scraper can be exercised during initial development.
ids = [args.source]
rc = 0
for sid in ids:
rc |= _run_source(sid, passthrough)
return rc
if __name__ == "__main__":
sys.exit(main())