scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema

Adapts the docs-mcp-template scraping layer for the pesticide-labels domain. The template's bundle/version/platform concepts don't map to labels (there's no "Bayer 8.1.0" — there's just the current accepted label per EPA Reg No), so the scraper layer is reshaped around a "source" abstraction: one source per manufacturer or regulator, one per-product label per source. Sources shipped: - bayer — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs) - epa_ppls — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint Canonical sidecar schema (see scrape/README.md) unifies fields across sources: - active_ingredients always [{name, cas, percent}] - label/* nested (url, filename, accepted_date, last_modified, page_count, text_layer) - all timestamps normalized to ISO 8601 UTC - signal_word surfaced (operationally critical for the farmer advisor) - source_key + epa_reg_no separate per-source PK from the cross-source join key bundles.json → sources.json. --bundle → --source. The runner walks sources.json and dispatches by id; per-source modules remain independently runnable for development. PLAN.md gets a one-block domain note up front; later phases (chunking, embeddings, retrieval, eval) still apply as written. Smoke test: python -m scrape.runner --all --limit 2 # works python -m scrape.runner --source bayer --limit 3 # 3 written, idempotent re-run skips python -m scrape.runner --source epa_ppls --reg-no 524-475 # Roundup Ultra, 167 pages, ISO last_modified Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 18:27:07 -04:00
parent 3ca96a3716
commit e9250de8e7
9 changed files with 1531 additions and 45 deletions
@@ -29,3 +29,4 @@ var/
 .vscode/
 .idea/
 *.swp
 .claude/
@@ -9,6 +9,19 @@ any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can
 call to answer questions against the docs, surface what changed
 recently, and flag likely inconsistencies.
 > **Domain note for ppls-docs.** This template was originally written
 > for versioned software product documentation (Zoomin bundles, Hugo
 > sites, etc.). For ppls-docs the domain is pesticide product labels —
 > the "bundle" abstraction has been replaced with "source"
 > (manufacturer or regulator), and "page" with "product label". The
 > canonical on-disk schema lives in [`scrape/README.md`](scrape/README.md),
 > not in this document. References below to `bundles.json`, `bundle_id`,
 > `--bundle`, `version`, and `platform` are template artifacts — read
 > them as `sources.json`, `source_id`, `--source`, and (mostly)
 > not-applicable. Phase 1 (scraper) is the most heavily adapted; later
 > phases (chunking, embeddings, retrieval, eval) apply largely as
 > written.
 ---
 ## What you're building
@@ -10,6 +10,7 @@ ollama>=0.4.0      # if using Ollama-hosted embedder; swap if not
 # Scraping (Phase 1; adjust per product)
 beautifulsoup4>=4.12
 requests>=2.31
 pypdf>=4.0          # PDF -> text for label extraction
 # playwright>=1.40  # uncomment if you need headless browser fallback
 # Evaluation
@@ -1,59 +1,128 @@
 # scrape/
-Product-specific. **You implement this for each product.** The
+Per-source scrapers for pesticide / herbicide product labels. Each
-template gives you the contract; the extraction logic depends on
+module under `scrape/sources/` pulls a single upstream catalog and
-the upstream doc portal.
+writes its results into `corpus/<source_id>/` using the canonical
 sidecar schema documented below.
-See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline
+## Architecture
 expects.
-## What you write
+```
-
+sources.json                       — registry of active sources
-At minimum, two scripts:
+scrape/runner.py                   — thin dispatcher (--source <id> | --all)
-
+scrape/sources/<id>.py             — one source per file
-### `scrape/bundles.py`
+corpus/<id>/<key>.md               — extracted label text (markdown)
-
+corpus/<id>/<key>.json             — canonical metadata sidecar
 Discovers the upstream portal's bundle catalog and writes
 `bundles.json` at the repo root. One entry per bundle (versioned doc
 set) with the schema in PLAN.md.
 ```bash
 python -m scrape.bundles
 ```
-### `scrape/runner.py`
+`<key>` is the per-source primary key — a slug for manufacturer
 sources (e.g. `warrant`, `roundup-powermax-3`) or an EPA Reg No
 for regulator sources (e.g. `524-475`). The sidecar's
 `epa_reg_no` field is the cross-source join key that lets the
 corpus consumer reconcile records from different sources for the
 same product.
-Scrapes the pages of each bundle (or a single bundle with `--bundle
+## CLI
 <slug>`). Writes:
 - `corpus/<bundle_id>/<page_id>.md` — extracted markdown body
 - `corpus/<bundle_id>/<page_id>.json` — per-page metadata sidecar
 ```bash
-python -m scrape.runner --all --force --concurrency 6
+# Run a single source
-python -m scrape.runner --bundle Admin.VC.HTML.10.9
+python -m scrape.runner --source bayer --limit 20
 python -m scrape.runner --source epa_ppls --reg-no 524-475
 # Run every source registered in sources.json
 python -m scrape.runner --all --limit 50
 # Per-source modules also run standalone
 python -m scrape.sources.bayer --class herbicide --limit 5
 python -m scrape.sources.epa_ppls --seed-file seeds.txt
 ```
-## Tips
+Every scraper is **idempotent** by default — re-running with the
 same arguments skips records already on disk. Use `--force` to
 re-fetch.
- **Sniff before you scrape.** Almost every modern doc portal is an
+## Canonical sidecar schema
  SPA that calls a backend API. Open the browser's Network tab,
  click around, find the underlying JSON. Scraping the API is 10×
  cheaper and 100× more reliable than scraping the rendered HTML.
 - **Idempotent re-scrapes.** Without `--force`, the runner should
  skip pages already on disk so a resume doesn't have to re-fetch
  everything. With `--force`, re-fetch every page — that's the
  weekly cron mode that catches edits.
 - **Respect the portal.** Backoff on 429s. Set a recognizable
  user-agent so the portal owner can identify you if they want to.
 - **Whitespace normalize.** Markdown that round-trips through HTML
  often has extra blank lines. Normalize to a single blank between
  paragraphs so diffs are clean (the changelog summary and digest
  tools care about line counts).
-## What's already reusable
+Every `corpus/<source>/<key>.json` conforms to this shape. Fields
 that don't apply to a given source are `null` (not omitted) so the
 JSON is uniform across sources.
-`scrape/changelog.py` is fully product-agnostic and ready to use
+```json
-as-is. It walks `git diff --name-status` output to produce a
+{
-structured summary, and walks `git log` for the digest history
+  "source": "bayer",
-(Phase 13).
+  "source_key": "warrant",
  "epa_reg_no": "524-591",
  "product_name": "Warrant Herbicide",
  "product_class": "herbicide",
  "registrant": null,
  "active_ingredients": [
    {"name": "acetochlor", "cas": "34256-82-1", "percent": 35.4}
  ],
  "signal_word": "Caution",
  "label": {
    "url": "https://cs-assets.bayer.com/is/content/bayer/Warrant_2025pdf",
    "filename": "Warrant_2025pdf",
    "accepted_date": "2024-01-15",
    "last_modified": "2026-05-15T20:21:54+00:00",
    "page_count": 24,
    "text_layer": true
  },
  "supplemental_documents": [
    {"kind": "2EE", "title": "Warrant tank-mix 2EE — cotton",
     "url": "https://cs-assets.bayer.com/.../...pdf",
     "last_modified": "2026-04-01T12:00:00+00:00"}
  ],
  "source_urls": {
    "product_page": "https://www.cropscience.bayer.us/products/herbicides/warrant/label-msds",
    "label_api": null,
    "label_index": null
  },
  "fetched_at": "2026-05-23T22:05:29+00:00",
  "scraper_version": "0.1.0"
 }
 ```
 ### Field reference
 | Field | Type | Required | Notes |
 |---|---|---|---|
 | `source` | string | yes | Matches an `id` in `sources.json`. |
 | `source_key` | string | yes | Per-source primary key. Filesystem-safe. |
 | `epa_reg_no` | string \| null | best-effort | Canonical EPA registration (e.g. `524-591`, or `524-591-12345` with distributor suffix). The cross-source join key. |
 | `product_name` | string \| null | yes | Display name. |
 | `product_class` | string \| null | best-effort | One of `herbicide`, `fungicide`, `insecticide`, `seed-treatment`, `rodenticide`, `other`. EPA PPLS leaves this `null`; manufacturer sources usually know. |
 | `registrant` | string \| null | best-effort | Required-ish for regulator sources, often `null` for MFR sources where redundant. |
 | `active_ingredients` | array of objects | yes (may be empty) | `[{name, cas, percent}]`. `cas` and `percent` are `null` when the source doesn't expose them. |
 | `signal_word` | string \| null | best-effort | `Danger`, `Warning`, `Caution`, or `null`. Operationally critical for the farmer advisor. |
 | `label.url` | string \| null | yes | Direct URL of the current label PDF. |
 | `label.filename` | string \| null | best-effort | Last URL segment, useful for diffing revisions. |
 | `label.accepted_date` | ISO date \| null | best-effort | EPA-stamped acceptance date. MFR sources may not expose this. |
 | `label.last_modified` | ISO 8601 datetime \| null | best-effort | From the PDF's HTTP `Last-Modified` header. Always normalized to ISO 8601 UTC. |
 | `label.page_count` | int \| null | best-effort | After download. |
 | `label.text_layer` | bool \| null | best-effort | `false` for scanned PDFs that need OCR. |
 | `supplemental_documents` | array | yes (may be empty) | 24(c) labels, 2(ee) bulletins, MSDS/SDS, product bulletins. EPA PPLS leaves this empty (those are separate API calls). |
 | `source_urls.product_page` | string \| null | best-effort | The HTML product page on the source site. |
 | `source_urls.label_api` | string \| null | best-effort | The JSON API endpoint that returned this record (for traceability). |
 | `source_urls.label_index` | string \| null | best-effort | The human-readable index/search URL. |
 | `fetched_at` | ISO 8601 datetime | yes | When this sidecar was generated. |
 | `scraper_version` | string | yes | Source module's `SCRAPER_VERSION` constant. |
 Sources may add their own extra fields beyond the canonical schema
 (EPA's sidecars carry `registration_status` and
 `registrant_company_number`, for instance). Consumers should ignore
 unknown fields.
 ## Adding a new source
 1. Write `scrape/sources/<id>.py` exposing a `main(argv: list[str]) -> int`
   that accepts at minimum `--limit N` and `--force`.
 2. Conform to the canonical sidecar schema. Add source-specific
   extras as additional top-level keys if they don't fit.
 3. Add an entry to `sources.json` (`id`, `title`, `type`, `homepage`,
   `scraper`, `scraper_version`, `license_note`).
 4. Scrapers MUST be polite: rate-limit to ≤1 req/sec, set a real
   User-Agent identifying the project, retry with backoff on 429/5xx,
   and respect robots.txt unless an explicit carve-out exists (e.g.
   Bayer's RAG allowlist).
 5. Scrapers MUST be idempotent: skip records already on disk unless
   `--force` is set.
@@ -0,0 +1,87 @@
 """Thin dispatcher that routes ``--source <id>`` to the right per-source
 scraper module.
 For ppls-docs the convention is **one source per scraper module** under
 ``scrape.sources.<id>``. Each module is independently runnable via
 ``python -m scrape.sources.<id>`` and accepts its own flags — this
 runner is a convenience shim for CI / the weekly refresh workflow.
 Examples:
    python -m scrape.runner --source bayer --limit 20
    python -m scrape.runner --source epa_ppls --limit 20
    python -m scrape.runner --all          # walk every source in sources.json
 Anything after the recognized flags is passed through to the source
 scraper, so:
    python -m scrape.runner --source bayer --force --product warrant
 just dispatches to ``scrape.sources.bayer`` with ``--force --product
 warrant`` as argv.
 """
 from __future__ import annotations
 import argparse
 import importlib
 import json
 import sys
 from pathlib import Path
 REPO_ROOT = Path(__file__).resolve().parents[1]
 SOURCES_JSON = REPO_ROOT / "sources.json"
 def _load_sources() -> list[dict]:
    if not SOURCES_JSON.exists():
        return []
    try:
        return json.loads(SOURCES_JSON.read_text())
    except json.JSONDecodeError:
        return []
 def _run_source(source_id: str, passthrough: list[str]) -> int:
    mod_name = f"scrape.sources.{source_id}"
    try:
        mod = importlib.import_module(mod_name)
    except ImportError as exc:
        print(f"runner: no source module {mod_name}: {exc}", file=sys.stderr)
        return 2
    main = getattr(mod, "main", None)
    if not callable(main):
        print(f"runner: {mod_name} has no main() entrypoint", file=sys.stderr)
        return 2
    return int(main(passthrough) or 0)
 def main(argv: list[str] | None = None) -> int:
    parser = argparse.ArgumentParser(prog="scrape.runner")
    parser.add_argument("--source", help="Source id (matches sources.json)")
    parser.add_argument("--all", action="store_true",
                        help="Run every source listed in sources.json")
    args, passthrough = parser.parse_known_args(argv)
    if not args.source and not args.all:
        parser.error("specify --source <id> or --all")
    sources = _load_sources()
    if args.all:
        ids = [s["id"] for s in sources if "id" in s]
        if not ids:
            print("runner: sources.json is empty or missing", file=sys.stderr)
            return 2
    else:
        # If the source isn't registered in sources.json yet, dispatch anyway
        # so the scraper can be exercised during initial development.
        ids = [args.source]
    rc = 0
    for sid in ids:
        rc |= _run_source(sid, passthrough)
    return rc
 if __name__ == "__main__":
    sys.exit(main())
@@ -0,0 +1,696 @@
 """Bayer Crop Science US label scraper.
 Pulls herbicide / fungicide / insecticide / seed-treatment product
 metadata and label PDFs from https://www.cropscience.bayer.us, extracts
 each PDF to markdown, and writes a metadata sidecar JSON per product.
 Output:
    corpus/bayer/<slug>.md     extracted label text
    corpus/bayer/<slug>.json   metadata sidecar (see SIDECAR_SCHEMA in
                               PLAN.md / this repo's CLAUDE.md)
 The scraper resolves Bayer's rotating Next.js ``buildId`` from the
 homepage at runtime, then walks the catalog JSON API for each product
 class. It extracts the rest of the label/MSDS/supplemental download
 URLs from each product page's ``__NEXT_DATA__`` JSON island — this is
 strictly cheaper and more stable than scraping rendered HTML.
 robots.txt for cropscience.bayer.us explicitly allows scraping for
 "search engine indexing or artificial intelligence retrieval augmented
 generation" use cases, which is what this corpus feeds.
 CLI:
    python -m scrape.sources.bayer --limit 20
    python -m scrape.sources.bayer --limit 20 --force
    python -m scrape.sources.bayer --product warrant
    python -m scrape.sources.bayer --class herbicide --limit 5
 """
 from __future__ import annotations
 import argparse
 import io
 import json
 import logging
 import os
 import random
 import re
 import sys
 import time
 from dataclasses import dataclass, field
 from datetime import datetime, timezone
 from pathlib import Path
 from typing import Any, Iterable
 import requests
 from pypdf import PdfReader
 SCRAPER_VERSION = "0.1.0"
 USER_AGENT = "ppls-docs-scraper/0.1 (+https://drawbar.example/contact)"
 BASE = "https://www.cropscience.bayer.us"
 # Catalog product-type values used in the Next.js data API.
 PRODUCT_TYPES = ("Herbicide", "Fungicide", "Insecticide", "Seed_Treatment")
 # Map product-type filter -> the canonical "product_class" we record
 # in the sidecar (matches the legacy URL segments).
 PRODUCT_CLASS = {
    "Herbicide": "herbicide",
    "Fungicide": "fungicide",
    "Insecticide": "insecticide",
    "Seed_Treatment": "seed-treatment",
 }
 # Repo root: scrape/sources/bayer.py -> repo root is 3 parents up.
 REPO_ROOT = Path(__file__).resolve().parents[2]
 CORPUS_DIR = REPO_ROOT / "corpus" / "bayer"
 # Politeness: target ~1 req/sec to Bayer. Each HTTP method goes through
 # a tiny token-bucket sleeper to enforce this without per-call asyncio.
 REQ_INTERVAL_SEC = 1.0
 log = logging.getLogger("scrape.bayer")
 # --------------------------------------------------------------------- HTTP
 class RateLimitedSession:
    """``requests.Session`` wrapper with sleep-based rate limiting and
    polite retries on 429/5xx."""
    def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
        self.s = requests.Session()
        self.s.headers["User-Agent"] = USER_AGENT
        self.interval = interval
        self._last = 0.0
    def _wait(self) -> None:
        delta = time.monotonic() - self._last
        if delta < self.interval:
            time.sleep(self.interval - delta)
        self._last = time.monotonic()
    def request(
        self,
        method: str,
        url: str,
        *,
        max_retries: int = 4,
        timeout: float = 30.0,
        **kw: Any,
    ) -> requests.Response:
        last_exc: Exception | None = None
        for attempt in range(max_retries):
            self._wait()
            try:
                resp = self.s.request(method, url, timeout=timeout, **kw)
            except requests.RequestException as exc:
                last_exc = exc
                backoff = min(30.0, (2 ** attempt) + random.random())
                log.warning("network error on %s %s: %s — retry in %.1fs",
                            method, url, exc, backoff)
                time.sleep(backoff)
                continue
            if resp.status_code in (429,) or 500 <= resp.status_code < 600:
                # Honor Retry-After if present, else exponential backoff.
                ra = resp.headers.get("Retry-After")
                if ra and ra.isdigit():
                    backoff = float(ra)
                else:
                    backoff = min(30.0, (2 ** attempt) + random.random())
                log.warning("HTTP %d on %s %s — retry in %.1fs",
                            resp.status_code, method, url, backoff)
                time.sleep(backoff)
                continue
            return resp
        if last_exc:
            raise last_exc
        # Final response (still bad) returned for caller to handle.
        return resp
    def get(self, url: str, **kw: Any) -> requests.Response:
        return self.request("GET", url, **kw)
    def head(self, url: str, **kw: Any) -> requests.Response:
        kw.setdefault("allow_redirects", True)
        return self.request("HEAD", url, **kw)
 # --------------------------------------------------------------------- model
@dataclass
 class SupplementalDoc:
    kind: str
    title: str
    url: str
    last_modified: str | None = None
@dataclass
 class BayerProduct:
    slug: str                                # filesystem-safe slug, e.g. "warrant"
    catalog_slug: str                        # bayer's seoSlug, e.g. "warrant-herbicide"
    product_url_path: str                    # e.g. "/crop-protection/herbicide/warrant-herbicide"
    product_class: str                       # "herbicide" | "fungicide" | ...
    product_name: str = ""
    epa_reg_no: str | None = None
    active_ingredients: list[dict] = field(default_factory=list)  # [{name, cas, percent}]
    label_url: str | None = None
    label_filename: str | None = None
    label_last_modified: str | None = None
    label_page_count: int | None = None
    label_text_layer: bool | None = None
    supplemental_pdfs: list[SupplementalDoc] = field(default_factory=list)
    source_page_url: str = ""
 # --------------------------------------------------------------------- helpers
 _NEXT_DATA_RE = re.compile(
    r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', re.S
 )
 def parse_next_data(html: str) -> dict[str, Any]:
    """Pull the ``__NEXT_DATA__`` JSON blob out of a Next.js page."""
    m = _NEXT_DATA_RE.search(html)
    if not m:
        raise RuntimeError("no __NEXT_DATA__ script tag found")
    return json.loads(m.group(1))
 def fetch_build_id(http: RateLimitedSession) -> str:
    """Grab the rotating ``buildId`` from the Bayer homepage."""
    r = http.get(BASE + "/")
    r.raise_for_status()
    data = parse_next_data(r.text)
    bid = data.get("buildId")
    if not bid:
        raise RuntimeError("buildId missing from homepage __NEXT_DATA__")
    log.info("resolved Bayer buildId=%s", bid)
    return bid
 def normalize_epa_reg(raw: str | None) -> str | None:
    """Convert Bayer's padded reg number to canonical EPA form.
    Example: ``0000524-00591-AA-0000000`` -> ``524-591``.
    The trailing ``-AA-0000000`` is a Bayer-internal qualifier we
    don't surface. We keep ``524-591/<sub>`` if a non-empty sub-reg
    appears (rare).
    """
    if not raw:
        return None
    parts = raw.split("-")
    if len(parts) < 2:
        return raw.strip() or None
    company = parts[0].lstrip("0") or "0"
    product = parts[1].lstrip("0") or "0"
    epa = f"{company}-{product}"
    # If the third segment is something other than the default "AA",
    # it's likely a distributor sub-reg. Preserve it.
    if len(parts) >= 3 and parts[2] and parts[2] != "AA":
        epa += f"-{parts[2]}"
    return epa
 def classify_supplemental(title: str, url: str) -> str:
    """Classify a supplemental/auxiliary doc by its title or URL.
    Returns a short kind tag like ``2EE``, ``24C``, ``24C-CA``,
    ``Bulletin``, ``MSDS``, ``Label``, or ``Other``. The exact tag
    isn't load-bearing for the scraper — it's metadata to help the
    chunker/agent later. Best-effort regex; ambiguous = ``Other``.
    """
    t = (title or "").upper()
    u = (url or "").upper()
    blob = f"{t} {u}"
    # State-specific 24c labels usually carry a two-letter state code,
    # but Bayer's titles rarely encode it. Best we can do is flag 24c.
    if "24C" in blob or "SECTION_24C" in blob or "SECTION 24C" in blob:
        # Try to spot a state suffix in the URL (e.g. "_24c_ca").
        m = re.search(r"24[_-]?C[_-]([A-Z]{2})\b", u)
        if m:
            return f"24C-{m.group(1)}"
        return "24C"
    if "2EE" in blob or "2_EE" in blob:
        return "2EE"
    if "MSDS" in blob or "SDS" in blob or "SAFETY DATA" in blob:
        return "MSDS"
    if "BULLETIN" in blob:
        return "Bulletin"
    if "SUPPLEMENTAL" in blob:
        return "Supplemental"
    if "LABEL" in blob:
        return "Label"
    return "Other"
 def safe_slug(catalog_slug: str, product_class: str) -> str:
    """Strip the trailing class suffix so ``warrant-herbicide`` becomes
    ``warrant``; falls back to the full slug for slugs that don't end
    with the class word."""
    suffix = f"-{product_class}"
    if catalog_slug.endswith(suffix):
        return catalog_slug[: -len(suffix)]
    # seed-treatment is sometimes split or omitted; just return as-is.
    return catalog_slug
 def iso_from_http_date(http_date: str | None) -> str | None:
    """RFC1123 -> ISO 8601 UTC. Returns None if unparseable."""
    if not http_date:
        return None
    try:
        from email.utils import parsedate_to_datetime
        dt = parsedate_to_datetime(http_date)
        if dt.tzinfo is None:
            dt = dt.replace(tzinfo=timezone.utc)
        return dt.astimezone(timezone.utc).isoformat()
    except Exception:  # noqa: BLE001
        return None
 # --------------------------------------------------------------------- catalog
 def walk_catalog(
    http: RateLimitedSession, build_id: str
 ) -> Iterable[BayerProduct]:
    """Yield ``BayerProduct`` stubs for every product across all classes.
    Stubs carry only catalog-level info (slug, URL, class). The detail
    fetch (EPA reg, ingredients, PDFs) happens later via
    :func:`fetch_product_detail`.
    """
    for ptype in PRODUCT_TYPES:
        product_class = PRODUCT_CLASS[ptype]
        page = 1
        seen = 0
        while True:
            url = (
                f"{BASE}/_next/data/{build_id}/crop-protection/catalog.json"
                f"?productType={ptype}&p={page}"
            )
            r = http.get(url)
            if r.status_code != 200:
                log.warning("catalog %s p=%d -> HTTP %d, stopping class",
                            ptype, page, r.status_code)
                break
            data = r.json().get("pageProps", {})
            products = data.get("serverProducts") or []
            total = data.get("total") or 0
            if not products:
                break
            for p in products:
                slug = p.get("seoSlug") or ""
                product_url = p.get("productURL") or ""
                if not slug or not product_url:
                    continue
                yield BayerProduct(
                    slug=safe_slug(slug, product_class),
                    catalog_slug=slug,
                    product_url_path=product_url,
                    product_class=product_class,
                )
            seen += len(products)
            if seen >= total:
                break
            page += 1
 # --------------------------------------------------------------------- detail
 def fetch_product_detail(
    http: RateLimitedSession, prod: BayerProduct
 ) -> BayerProduct:
    """Populate EPA reg, active ingredients, and the full PDF list on
    a catalog stub by fetching its product page __NEXT_DATA__."""
    page_url = BASE + prod.product_url_path
    prod.source_page_url = page_url
    r = http.get(page_url)
    r.raise_for_status()
    data = parse_next_data(r.text)
    pp = (data.get("props") or {}).get("pageProps") or {}
    pd = pp.get("productDetails") or {}
    prod.product_name = pd.get("productLabel") or pd.get("productName") or prod.slug
    prod.epa_reg_no = normalize_epa_reg(pd.get("registrationNumber"))
    # Bayer's product page exposes ingredient names only — no CAS or percent.
    # Conform to the canonical schema by emitting objects with name set and
    # the other fields null; downstream consumers can hydrate from EPA PPLS.
    prod.active_ingredients = [
        {"name": a.get("ingredient"), "cas": None, "percent": None}
        for a in (pd.get("activeIngredients") or [])
        if a.get("ingredient")
    ]
    # Primary label: prefer downloadLabelUrl, then importantDocuments.
    important = (pp.get("importantDocuments") or {}).get("labelData") or []
    additional = (pp.get("additionalDownloads") or {}).get("labelData") or []
    download_url = pp.get("downloadLabelUrl")
    label_url: str | None = None
    if download_url and looks_like_pdf(download_url):
        label_url = download_url
    else:
        # First entry titled "Label" or simply the first PDF.
        for d in important:
            t = (d.get("title") or "").lower()
            u = d.get("url") or ""
            if not looks_like_pdf(u):
                continue
            if "label" in t and "msds" not in t and "sds" not in t:
                label_url = u
                break
        if not label_url:
            for d in important + additional:
                u = d.get("url") or ""
                if looks_like_pdf(u):
                    label_url = u
                    break
    prod.label_url = label_url
    if label_url:
        # Last URL segment is the Scene7 asset id (e.g. "Warrant_2025pdf").
        prod.label_filename = label_url.rsplit("/", 1)[-1]
    # Collect ALL other PDFs as supplementals (label/MSDS/24c/2EE/bulletin
    # /etc.). The kind tag is best-effort; the chunker can refine later.
    supplementals: list[SupplementalDoc] = []
    seen_urls: set[str] = set()
    if label_url:
        seen_urls.add(label_url)
    for d in important + additional:
        u = d.get("url") or ""
        t = d.get("title") or ""
        if not u or u in seen_urls:
            continue
        if not looks_like_pdf(u):
            continue
        seen_urls.add(u)
        supplementals.append(SupplementalDoc(
            kind=classify_supplemental(t, u),
            title=t,
            url=u,
        ))
    prod.supplemental_pdfs = supplementals
    return prod
 def looks_like_pdf(url: str) -> bool:
    """True if the URL is one of Bayer's PDF endpoints.
    Bayer serves PDFs via Adobe Scene7 with the literal ``pdf`` (no
    dot) appended to the asset ID, plus some assets on cs-contentapi
    with a real ``.pdf`` extension.
    """
    u = url.lower()
    if u.endswith("pdf"):
        return True
    if u.endswith(".pdf"):
        return True
    return False
 # --------------------------------------------------------------------- PDF
 def head_last_modified(http: RateLimitedSession, url: str) -> str | None:
    """Resolve Last-Modified for a PDF URL. Returns ISO 8601 or None."""
    try:
        r = http.head(url)
    except requests.RequestException as exc:
        log.warning("HEAD failed for %s: %s", url, exc)
        return None
    if r.status_code != 200:
        log.warning("HEAD %s -> HTTP %d", url, r.status_code)
        return None
    return iso_from_http_date(r.headers.get("Last-Modified"))
 def fetch_pdf_text(http: RateLimitedSession, url: str) -> tuple[str, int, bool]:
    """Download a PDF and return ``(text, page_count, has_text_layer)``.
    Concatenates all pages, normalizes whitespace, and collapses runs
    of blank lines so the resulting markdown diffs cleanly. ``has_text_layer``
    is False for scanned PDFs whose pypdf extract produced no text.
    """
    r = http.get(url)
    r.raise_for_status()
    if "pdf" not in (r.headers.get("Content-Type") or "").lower():
        log.warning("expected PDF Content-Type at %s, got %s",
                    url, r.headers.get("Content-Type"))
    reader = PdfReader(io.BytesIO(r.content))
    page_count = len(reader.pages)
    chunks: list[str] = []
    for page in reader.pages:
        try:
            text = page.extract_text() or ""
        except Exception as exc:  # noqa: BLE001
            log.warning("pypdf extract_text failed on a page of %s: %s",
                        url, exc)
            text = ""
        chunks.append(text)
    raw = "\n\n".join(chunks)
    normalized = normalize_text(raw)
    has_text_layer = bool(normalized.strip())
    return normalized, page_count, has_text_layer
 def normalize_text(s: str) -> str:
    # Strip trailing spaces per line, collapse 3+ blank lines to 2,
    # and trim NBSPs that pypdf often leaves behind.
    s = s.replace("\u00a0", " ")
    s = re.sub(r"[ \t]+\n", "\n", s)
    s = re.sub(r"\n{3,}", "\n\n", s)
    return s.strip() + "\n"
 # --------------------------------------------------------------------- write
 def write_product(prod: BayerProduct, body_md: str) -> None:
    """Write the canonical sidecar + markdown body. See scrape/README.md
    for the schema."""
    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
    md_path = CORPUS_DIR / f"{prod.slug}.md"
    json_path = CORPUS_DIR / f"{prod.slug}.json"
    # Lightweight markdown frontmatter for human eyeballing — canonical
    # metadata lives in the sidecar.
    title = prod.product_name or prod.slug
    ai_summary = ", ".join(a["name"] for a in prod.active_ingredients if a.get("name")) or "(unknown)"
    header = (
        f"# {title}\n\n"
        f"- **Product class:** {prod.product_class}\n"
        f"- **EPA Reg No:** {prod.epa_reg_no or '(unknown)'}\n"
        f"- **Active ingredients:** {ai_summary}\n"
        f"- **Source:** {prod.source_page_url}\n"
        f"- **Label PDF:** {prod.label_url or '(none on page)'}\n\n"
        "---\n\n"
    )
    md_path.write_text(header + body_md, encoding="utf-8")
    sidecar = {
        "source": "bayer",
        "source_key": prod.slug,
        "epa_reg_no": prod.epa_reg_no,
        "product_name": prod.product_name,
        "product_class": prod.product_class,
        "registrant": None,
        "active_ingredients": prod.active_ingredients,
        "signal_word": None,
        "label": {
            "url": prod.label_url,
            "filename": prod.label_filename,
            "accepted_date": None,
            "last_modified": prod.label_last_modified,
            "page_count": prod.label_page_count,
            "text_layer": prod.label_text_layer,
        },
        "supplemental_documents": [
            {
                "kind": s.kind,
                "title": s.title,
                "url": s.url,
                "last_modified": s.last_modified,
            }
            for s in prod.supplemental_pdfs
        ],
        "source_urls": {
            "product_page": prod.source_page_url,
            "label_api": None,
            "label_index": None,
        },
        "fetched_at": datetime.now(timezone.utc).isoformat(),
        "scraper_version": SCRAPER_VERSION,
    }
    json_path.write_text(
        json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
        encoding="utf-8",
    )
 # --------------------------------------------------------------------- pipeline
 def process_product(
    http: RateLimitedSession,
    prod: BayerProduct,
    *,
    force: bool,
 ) -> str:
    """Fetch detail + PDF and write to disk. Returns a status string
    suitable for logging: ``written``, ``skipped``, ``no-pdf``,
    ``failed``."""
    md_path = CORPUS_DIR / f"{prod.slug}.md"
    if md_path.exists() and not force:
        return "skipped"
    try:
        fetch_product_detail(http, prod)
    except Exception as exc:  # noqa: BLE001
        log.error("detail fetch failed for %s: %s", prod.slug, exc)
        return "failed"
    # Resolve Last-Modified for label + supplementals (HEAD only, cheap).
    if prod.label_url:
        prod.label_last_modified = head_last_modified(http, prod.label_url)
    for s in prod.supplemental_pdfs:
        s.last_modified = head_last_modified(http, s.url)
    if not prod.label_url:
        # Some Bayer products have no public label PDF (e.g. product was
        # discontinued or the page only carries a Product Bulletin). We
        # still record the metadata sidecar so the catalog is complete,
        # but write a stub body so the file count reflects reality.
        log.info("%s — no label PDF; writing metadata only", prod.slug)
        prod.label_text_layer = False
        write_product(prod, "_(No label PDF was found on the product page.)_\n")
        return "no-pdf"
    try:
        body, page_count, text_layer = fetch_pdf_text(http, prod.label_url)
    except Exception as exc:  # noqa: BLE001
        log.error("PDF fetch/extract failed for %s (%s): %s",
                  prod.slug, prod.label_url, exc)
        return "failed"
    prod.label_page_count = page_count
    prod.label_text_layer = text_layer
    if not body.strip():
        log.warning("%s — extracted PDF was empty (scanned?)", prod.slug)
        body = "[SCANNED PDF — OCR REQUIRED]\n"
    write_product(prod, body)
    return "written"
 def run(
    *,
    limit: int | None,
    force: bool,
    only_product: str | None,
    only_class: str | None,
 ) -> int:
    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
    http = RateLimitedSession()
    build_id = fetch_build_id(http)
    products: list[BayerProduct] = []
    for prod in walk_catalog(http, build_id):
        if only_class and prod.product_class != only_class:
            continue
        if only_product and prod.slug != only_product and prod.catalog_slug != only_product:
            continue
        products.append(prod)
    if only_product and not products:
        log.error("no product matched --product=%s", only_product)
        return 2
    log.info("catalog yielded %d candidate product(s)", len(products))
    counts = {"written": 0, "skipped": 0, "no-pdf": 0, "failed": 0}
    processed = 0
    for prod in products:
        if limit is not None and processed >= limit:
            break
        processed += 1
        status = process_product(http, prod, force=force)
        counts[status] = counts.get(status, 0) + 1
        log.info(
            "[%d/%s] %s %s | class=%s epa=%s ai=%s label=%s",
            processed, str(limit) if limit else "all",
            prod.slug, status,
            prod.product_class,
            prod.epa_reg_no or "-",
            ",".join(a["name"] for a in prod.active_ingredients if a.get("name")) or "-",
            prod.label_url or "-",
        )
    log.info(
        "done: processed=%d written=%d skipped=%d no-pdf=%d failed=%d",
        processed,
        counts["written"], counts["skipped"], counts["no-pdf"], counts["failed"],
    )
    return 0 if counts["failed"] == 0 else 1
 # --------------------------------------------------------------------- CLI
 def _build_argparser() -> argparse.ArgumentParser:
    p = argparse.ArgumentParser(
        prog="scrape.sources.bayer",
        description="Scrape Bayer Crop Science US product labels.",
    )
    p.add_argument(
        "--limit", type=int, default=None,
        help="Stop after processing N products (default: all).",
    )
    p.add_argument(
        "--force", action="store_true",
        help="Re-download even if the markdown file already exists.",
    )
    p.add_argument(
        "--product", default=None,
        help="Process a single product by slug (e.g. 'warrant' or "
             "'warrant-herbicide').",
    )
    p.add_argument(
        "--class", dest="product_class", default=None,
        choices=sorted(set(PRODUCT_CLASS.values())),
        help="Limit to one product class.",
    )
    p.add_argument(
        "--log-level", default=os.environ.get("LOG_LEVEL", "INFO"),
        help="Python logging level (default INFO).",
    )
    return p
 def main(argv: list[str] | None = None) -> int:
    args = _build_argparser().parse_args(argv)
    logging.basicConfig(
        level=args.log_level.upper(),
        format="%(asctime)s %(levelname)s %(name)s %(message)s",
        stream=sys.stderr,
    )
    return run(
        limit=args.limit,
        force=args.force,
        only_product=args.product,
        only_class=args.product_class,
    )
 if __name__ == "__main__":
    sys.exit(main())
@@ -0,0 +1,599 @@
 """EPA PPLS (Pesticide Product Label System) scraper.
 Enumeration strategy
 ====================
 The PPLS Oracle APEX portal (ordspub.epa.gov/ords/pesticides/f?p=PPLS:1)
 is session-stateful and hostile to enumeration, so we use a two-phase
 approach that bypasses APEX entirely:
 1. **List products** via the public PPIS bulk download
   ``https://www3.epa.gov/pesticides/PPISdata/product.zip`` — a 107-char
   fixed-width flat file (``product.txt``, ~102K active Section 3
   registrations, refreshed every Tuesday). Gives us the universe of
   EPA Reg Nos (company-product), plus the product name.
 2. **Hydrate per product** via the PPLS REST data service at
   ``https://ordspub.epa.gov/ords/pesticides/cswu/ppls/{regno}`` —
   returns rich JSON: registrant, active ingredients (with CAS + %),
   formulations, status, signal word, AND a ``pdffiles`` array
   listing every stamped label PDF EPA has accepted for the product.
   The most recent entry gives us the canonical PDF filename
   (``{company6}-{product5}-{YYYYMMDD}.pdf``), solving the
   stamped-date-suffix problem without having to guess.
 3. **Fetch label PDF** from
   ``https://www3.epa.gov/pesticides/chem_search/ppls/{filename}``
   and extract text with pypdf. Many EPA labels are scans with no
   text layer — those are flagged ``text_layer: false`` and the .md
   body is a ``[SCANNED PDF — OCR REQUIRED]`` placeholder. OCR is
   deferred to Phase 2.
 Paths rejected and why
 ----------------------
 - ``/ords/pesticides/ppls/{reg}`` (no ``/cswu/`` prefix): returns the
  APEX HTML splash, not JSON. The undocumented ``/cswu/`` prefix is
  the actual ORDS REST handler.
 - Scraping the APEX UI: session-stateful, fragile, blocked.
 - data.gov mirror: redirects to the same APEX page, no extract.
 - NPIRS (Purdue): subscription-walled; PPIS is the same authoritative
  feed anyway.
 Required sidecar fields (per task spec): ``source``, ``epa_reg_no``,
 ``label_pdf_url``, ``fetched_at``. Everything else best-effort.
 """
 from __future__ import annotations
 import argparse
 import io
 import json
 import logging
 import re
 import sys
 import time
 import zipfile
 from dataclasses import dataclass, field
 from datetime import UTC, datetime
 from pathlib import Path
 from typing import Any, Iterable
 import httpx
 from pypdf import PdfReader
 from pypdf.errors import PdfReadError
 SCRAPER_VERSION = "0.1.0"
 USER_AGENT = "ppls-docs-scraper/0.1 (+https://drawbar.example/contact)"
 PPIS_PRODUCT_ZIP_URL = "https://www3.epa.gov/pesticides/PPISdata/product.zip"
 PPLS_API_BASE = "https://ordspub.epa.gov/ords/pesticides/cswu/ppls"
 PPLS_PDF_BASE = "https://www3.epa.gov/pesticides/chem_search/ppls"
 PPLS_INDEX_URL_TEMPLATE = (
    "https://ordspub.epa.gov/ords/pesticides/f?p=PPLS:102:::NO::P102_REG_NUM:{regno}"
 )
 REPO_ROOT = Path(__file__).resolve().parents[2]
 CORPUS_DIR = REPO_ROOT / "corpus" / "epa_ppls"
 REQUEST_DELAY_SECONDS = 1.1  # polite: ~1 req/sec
 HTTP_TIMEOUT = httpx.Timeout(60.0, connect=15.0)
 MAX_RETRIES = 4
 log = logging.getLogger("epa_ppls")
 # ---------------------------------------------------------------------------
 # HTTP helpers
 # ---------------------------------------------------------------------------
 def _client() -> httpx.Client:
    return httpx.Client(
        headers={"User-Agent": USER_AGENT, "Accept-Encoding": "gzip, deflate"},
        timeout=HTTP_TIMEOUT,
        follow_redirects=True,
    )
 def _get_with_retries(
    client: httpx.Client, url: str, *, expect_json: bool = False
 ) -> httpx.Response:
    """GET with exponential backoff on 5xx/429/network errors."""
    last_exc: Exception | None = None
    for attempt in range(1, MAX_RETRIES + 1):
        try:
            resp = client.get(url)
            if resp.status_code in (429, 500, 502, 503, 504):
                wait = min(2 ** attempt, 30)
                log.warning(
                    "HTTP %s on %s (attempt %d/%d) — sleeping %ds",
                    resp.status_code, url, attempt, MAX_RETRIES, wait,
                )
                time.sleep(wait)
                continue
            resp.raise_for_status()
            if expect_json:
                # ORDS sometimes returns text/html error pages with 200 — sanity
                ctype = resp.headers.get("content-type", "")
                if "json" not in ctype.lower():
                    raise httpx.HTTPError(
                        f"Expected JSON, got content-type={ctype!r} for {url}"
                    )
            return resp
        except (httpx.TransportError, httpx.HTTPError) as exc:
            last_exc = exc
            wait = min(2 ** attempt, 30)
            log.warning(
                "Network error on %s (attempt %d/%d): %s — sleeping %ds",
                url, attempt, MAX_RETRIES, exc, wait,
            )
            time.sleep(wait)
    raise RuntimeError(f"GET {url} failed after {MAX_RETRIES} attempts: {last_exc}")
 # ---------------------------------------------------------------------------
 # Enumeration: PPIS bulk product.zip
 # ---------------------------------------------------------------------------
@dataclass
 class PpisRow:
    """One row of PPIS product.txt — enough to hydrate via the API."""
    epa_reg_no: str
    product_name: str
    status_flag: str  # 'F' (federal/active) or 'T' (transferred)
    rup_flag: str    # 'Y' or 'N'
 def _parse_ppis_line(line: str) -> PpisRow | None:
    """Parse one 107-char PPIS product.txt row.
    Layout (1-indexed, inferred from inspection):
      1-6   company number (zero-padded, may contain trailing spaces)
      7-11  product number (zero-padded, may contain trailing spaces)
      33-102 product name (70 chars, space-padded)
      103   status flag ('F' or 'T')
      106   RUP flag ('Y' or 'N')
    """
    if len(line) < 106:
        return None
    company_raw = line[0:6].strip()
    product_raw = line[6:11].strip()
    if not company_raw or not product_raw:
        return None
    # Strip leading zeros for canonical EPA Reg No display
    try:
        company = str(int(company_raw))
        product = str(int(product_raw))
    except ValueError:
        return None
    name = line[32:102].strip()
    status_flag = line[102:103]
    rup_flag = line[105:106] if len(line) > 105 else "N"
    return PpisRow(
        epa_reg_no=f"{company}-{product}",
        product_name=name,
        status_flag=status_flag,
        rup_flag=rup_flag,
    )
 def fetch_ppis_index(client: httpx.Client) -> list[PpisRow]:
    """Download PPIS product.zip and parse into PpisRow list."""
    log.info("Fetching PPIS index from %s", PPIS_PRODUCT_ZIP_URL)
    resp = _get_with_retries(client, PPIS_PRODUCT_ZIP_URL)
    rows: list[PpisRow] = []
    with zipfile.ZipFile(io.BytesIO(resp.content)) as zf:
        with zf.open("product.txt") as fh:
            for raw in fh:
                line = raw.decode("latin-1").rstrip("\n").rstrip("\r")
                row = _parse_ppis_line(line)
                if row is not None:
                    rows.append(row)
    log.info("Parsed %d rows from PPIS index", len(rows))
    return rows
 # ---------------------------------------------------------------------------
 # Hydration: PPLS JSON API
 # ---------------------------------------------------------------------------
 def _zero_pad_regno(regno: str) -> str:
    """524-475 -> 000524-00475 (canonical filename form). Distributor suffix
    (524-475-12345) -> 000524-00475-12345."""
    parts = regno.split("-")
    if len(parts) == 2:
        c, p = parts
        return f"{int(c):06d}-{int(p):05d}"
    if len(parts) == 3:
        c, p, d = parts
        return f"{int(c):06d}-{int(p):05d}-{int(d):05d}"
    return regno
 _MONTHS = {
    "january": 1, "february": 2, "march": 3, "april": 4, "may": 5, "june": 6,
    "july": 7, "august": 8, "september": 9, "october": 10, "november": 11,
    "december": 12,
 }
 def _parse_label_date(text: str | None) -> str | None:
    """'October 18, 2016' -> '2016-10-18'. Returns None on any parse issue."""
    if not text:
        return None
    m = re.match(r"^([A-Za-z]+)\s+(\d{1,2}),\s+(\d{4})$", text.strip())
    if not m:
        return None
    month = _MONTHS.get(m.group(1).lower())
    if month is None:
        return None
    try:
        return f"{int(m.group(3)):04d}-{month:02d}-{int(m.group(2)):02d}"
    except ValueError:
        return None
 def _http_date_to_iso(http_date: str | None) -> str | None:
    """RFC1123 'Wed, 19 Oct 2016 17:48:09 GMT' -> ISO 8601 UTC.
    Returns None on unparseable input. Matches the canonical schema's
    requirement that all timestamps be ISO 8601.
    """
    if not http_date:
        return None
    try:
        from email.utils import parsedate_to_datetime
        dt = parsedate_to_datetime(http_date)
        if dt.tzinfo is None:
            dt = dt.replace(tzinfo=UTC)
        return dt.astimezone(UTC).isoformat()
    except Exception:  # noqa: BLE001
        return None
@dataclass
 class ProductRecord:
    epa_reg_no: str
    product_name: str | None
    registrant: str | None
    registrant_company_number: str | None
    active_ingredients: list[dict[str, Any]]
    label_pdf_url: str | None
    label_pdf_filename: str | None
    label_accepted_date: str | None
    registration_status: str | None
    signal_word: str | None
    raw_api_item: dict[str, Any] | None = field(repr=False, default=None)
 def fetch_product_record(client: httpx.Client, regno: str) -> ProductRecord:
    """Call the PPLS API for one EPA Reg No; build a ProductRecord."""
    url = f"{PPLS_API_BASE}/{regno}"
    resp = _get_with_retries(client, url, expect_json=True)
    payload = resp.json()
    items = payload.get("items") or []
    if not items:
        return ProductRecord(
            epa_reg_no=regno,
            product_name=None,
            registrant=None,
            registrant_company_number=None,
            active_ingredients=[],
            label_pdf_url=None,
            label_pdf_filename=None,
            label_accepted_date=None,
            registration_status=None,
            signal_word=None,
            raw_api_item=None,
        )
    item = items[0]
    company_info = (item.get("companyinfo") or [{}])[0]
    registrant = company_info.get("name")
    company_num = regno.split("-")[0]
    ingredients = []
    for ai in item.get("active_ingredients") or []:
        ingredients.append({
            "name": ai.get("active_ing"),
            "cas": ai.get("cas_number"),
            "percent": ai.get("active_ing_percent"),
            "pc_code": ai.get("pc_code"),
        })
    pdffiles = item.get("pdffiles") or []
    # Most recent PDF first (sorted by date desc); API returns them in
    # date-descending order but we sort defensively.
    pdf_entry: dict[str, Any] | None = None
    if pdffiles:
        def _date_key(e: dict[str, Any]) -> str:
            d = _parse_label_date(e.get("pdffile_accepted_date"))
            return d or "0000-00-00"
        pdf_entry = max(pdffiles, key=_date_key)
    pdf_filename = pdf_entry.get("pdffile") if pdf_entry else None
    pdf_url = f"{PPLS_PDF_BASE}/{pdf_filename}" if pdf_filename else None
    accepted = _parse_label_date(pdf_entry.get("pdffile_accepted_date")) if pdf_entry else None
    return ProductRecord(
        epa_reg_no=regno,
        product_name=item.get("productname"),
        registrant=registrant,
        registrant_company_number=company_num,
        active_ingredients=ingredients,
        label_pdf_url=pdf_url,
        label_pdf_filename=pdf_filename,
        label_accepted_date=accepted,
        registration_status=item.get("product_status"),
        signal_word=item.get("signal_word"),
        raw_api_item=item,
    )
 # ---------------------------------------------------------------------------
 # PDF download + text extraction
 # ---------------------------------------------------------------------------
 def download_pdf(client: httpx.Client, url: str) -> tuple[bytes, str | None]:
    """Download a label PDF; return (bytes, Last-Modified header or None)."""
    resp = _get_with_retries(client, url)
    last_modified = resp.headers.get("last-modified")
    return resp.content, last_modified
 def extract_pdf_text(pdf_bytes: bytes) -> tuple[str, bool]:
    """Extract text from a PDF.
    Returns (text, has_text_layer). Concatenates pages, normalizes whitespace.
    If no extractable text is found, returns ("", False).
    """
    try:
        reader = PdfReader(io.BytesIO(pdf_bytes))
    except PdfReadError as exc:
        log.warning("pypdf failed to read PDF: %s", exc)
        return "", False
    chunks: list[str] = []
    for i, page in enumerate(reader.pages):
        try:
            page_text = page.extract_text() or ""
        except Exception as exc:  # pypdf can throw on malformed pages
            log.warning("pypdf extract_text failed on page %d: %s", i, exc)
            page_text = ""
        page_text = re.sub(r"[ \t]+", " ", page_text)
        page_text = re.sub(r"\n{3,}", "\n\n", page_text).strip()
        if page_text:
            chunks.append(page_text)
    combined = "\n\n".join(chunks).strip()
    return combined, bool(combined)
 # ---------------------------------------------------------------------------
 # Per-product processing
 # ---------------------------------------------------------------------------
 def _md_path(regno: str) -> Path:
    return CORPUS_DIR / f"{regno}.md"
 def _json_path(regno: str) -> Path:
    return CORPUS_DIR / f"{regno}.json"
 def process_one(
    client: httpx.Client,
    regno: str,
    *,
    force: bool = False,
 ) -> str:
    """Fetch + extract one product. Returns 'skipped'|'wrote'|'no-pdf'|'error'."""
    md_path = _md_path(regno)
    json_path = _json_path(regno)
    if not force and md_path.exists() and json_path.exists():
        log.info("[%s] skip (already on disk)", regno)
        return "skipped"
    try:
        record = fetch_product_record(client, regno)
    except Exception as exc:
        log.error("[%s] API fetch failed: %s", regno, exc)
        return "error"
    time.sleep(REQUEST_DELAY_SECONDS)
    def _build_sidecar(
        *,
        label_url: str | None,
        label_filename: str | None,
        label_last_modified_iso: str | None,
        page_count: int | None,
        text_layer: bool | None,
    ) -> dict[str, Any]:
        return {
            "source": "epa_ppls",
            "source_key": regno,
            "epa_reg_no": regno,
            "product_name": record.product_name,
            "product_class": None,  # EPA PPLS doesn't expose a clean class taxonomy
            "registrant": record.registrant,
            "active_ingredients": record.active_ingredients,
            "signal_word": record.signal_word,
            "label": {
                "url": label_url,
                "filename": label_filename,
                "accepted_date": record.label_accepted_date,
                "last_modified": label_last_modified_iso,
                "page_count": page_count,
                "text_layer": text_layer,
            },
            "supplemental_documents": [],  # EPA PPLS sidecar omits supplementals; query API per regno
            "source_urls": {
                "product_page": None,
                "label_api": f"{PPLS_API_BASE}/{regno}",
                "label_index": PPLS_INDEX_URL_TEMPLATE.format(regno=regno),
            },
            # EPA-specific extras (kept out of the strict canonical schema but
            # useful for joins back to EPA's data model)
            "registration_status": record.registration_status,
            "registrant_company_number": record.registrant_company_number,
            "fetched_at": datetime.now(UTC).isoformat(),
            "scraper_version": SCRAPER_VERSION,
        }
    if not record.label_pdf_url:
        log.warning("[%s] no label PDF available — writing sidecar only", regno)
        md_path.write_text(
            f"# {record.product_name or regno}\n\n"
            f"EPA Reg No: {regno}\n\n"
            "[NO LABEL PDF AVAILABLE FROM EPA PPLS]\n",
            encoding="utf-8",
        )
        sidecar = _build_sidecar(
            label_url=None, label_filename=None,
            label_last_modified_iso=None,
            page_count=None, text_layer=False,
        )
        json_path.write_text(json.dumps(sidecar, indent=2), encoding="utf-8")
        return "no-pdf"
    try:
        pdf_bytes, last_modified_raw = download_pdf(client, record.label_pdf_url)
    except Exception as exc:
        log.error("[%s] PDF download failed: %s", regno, exc)
        return "error"
    time.sleep(REQUEST_DELAY_SECONDS)
    text, has_text = extract_pdf_text(pdf_bytes)
    last_modified_iso = _http_date_to_iso(last_modified_raw)
    page_count: int | None = None
    try:
        page_count = len(PdfReader(io.BytesIO(pdf_bytes)).pages)
    except Exception:
        pass
    sidecar = _build_sidecar(
        label_url=record.label_pdf_url,
        label_filename=record.label_pdf_filename,
        label_last_modified_iso=last_modified_iso,
        page_count=page_count,
        text_layer=has_text,
    )
    header_lines = [f"# {record.product_name or regno}", ""]
    header_lines.append(f"- EPA Reg No: **{regno}**")
    if record.registrant:
        header_lines.append(f"- Registrant: {record.registrant}")
    if record.signal_word:
        header_lines.append(f"- Signal word: {record.signal_word}")
    if record.active_ingredients:
        ai_strs = [
            f"{ai.get('name')} ({ai.get('percent')}%)"
            for ai in record.active_ingredients
            if ai.get("name")
        ]
        if ai_strs:
            header_lines.append("- Active ingredients: " + "; ".join(ai_strs))
    if record.label_accepted_date:
        header_lines.append(f"- Label accepted: {record.label_accepted_date}")
    header_lines.append(f"- Source PDF: {record.label_pdf_url}")
    header_lines.append("")
    header_lines.append("---")
    header_lines.append("")
    if has_text:
        body = text
    else:
        body = "[SCANNED PDF — OCR REQUIRED]\n\nThis label has no extractable text layer."
        log.info("[%s] PDF has no text layer (scanned)", regno)
    md_content = "\n".join(header_lines) + body + "\n"
    md_path.write_text(md_content, encoding="utf-8")
    json_path.write_text(json.dumps(sidecar, indent=2), encoding="utf-8")
    log.info(
        "[%s] wrote (text_layer=%s, pages=%s, name=%r)",
        regno, has_text, page_count, record.product_name,
    )
    return "wrote"
 # ---------------------------------------------------------------------------
 # CLI
 # ---------------------------------------------------------------------------
 def _iter_regnos(
    args: argparse.Namespace,
    client: httpx.Client,
 ) -> Iterable[str]:
    """Yield reg nos to process based on CLI args."""
    if args.reg_no:
        for r in args.reg_no:
            yield r
        return
    if args.seed_file:
        with open(args.seed_file, encoding="utf-8") as fh:
            for raw in fh:
                line = raw.strip()
                if not line or line.startswith("#"):
                    continue
                yield line
        return
    # Default: enumerate via PPIS bulk index
    rows = fetch_ppis_index(client)
    count = 0
    for row in rows:
        # Skip transferred-out (status_flag 'T') entries by default; their
        # registration has moved to another company-product pairing.
        if row.status_flag == "T":
            continue
        yield row.epa_reg_no
        count += 1
        if args.limit and count >= args.limit:
            return
 def main(argv: list[str] | None = None) -> int:
    parser = argparse.ArgumentParser(
        prog="python -m scrape.sources.epa_ppls",
        description="Scrape EPA PPLS pesticide labels into corpus/epa_ppls/.",
    )
    parser.add_argument(
        "--limit", type=int, default=None,
        help="Max products to process when enumerating from PPIS.",
    )
    parser.add_argument(
        "--force", action="store_true",
        help="Re-fetch even if .md/.json already exist.",
    )
    parser.add_argument(
        "--reg-no", action="append", metavar="REGNO",
        help="Process specific EPA Reg No (e.g. 524-475). Repeatable.",
    )
    parser.add_argument(
        "--seed-file", metavar="PATH",
        help="Text file with one EPA Reg No per line (# comments OK).",
    )
    parser.add_argument(
        "--log-level", default="INFO",
        choices=["DEBUG", "INFO", "WARNING", "ERROR"],
    )
    args = parser.parse_args(argv)
    logging.basicConfig(
        stream=sys.stderr,
        level=getattr(logging, args.log_level),
        format="%(asctime)s %(levelname)s %(name)s: %(message)s",
    )
    CORPUS_DIR.mkdir(parents=True, exist_ok=True)
    summary = {"wrote": 0, "skipped": 0, "no-pdf": 0, "error": 0}
    with _client() as client:
        for regno in _iter_regnos(args, client):
            result = process_one(client, regno, force=args.force)
            summary[result] = summary.get(result, 0) + 1
    log.info("done: %s", summary)
    print(json.dumps(summary), file=sys.stderr)
    return 0
 if __name__ == "__main__":
    sys.exit(main())
@@ -0,0 +1,20 @@
 [
  {
    "id": "bayer",
    "title": "Bayer Crop Science US — Product Labels",
    "type": "manufacturer",
    "homepage": "https://www.cropscience.bayer.us",
    "scraper": "scrape.sources.bayer",
    "scraper_version": "0.1.0",
    "license_note": "robots.txt explicitly permits scraping for AI retrieval-augmented generation (verified 2026-05)"
  },
  {
    "id": "epa_ppls",
    "title": "EPA Pesticide Product Label System",
    "type": "regulator",
    "homepage": "https://ordspub.epa.gov/ords/pesticides/f?p=PPLS:1",
    "scraper": "scrape.sources.epa_ppls",
    "scraper_version": "0.1.0",
    "license_note": "US federal government — public domain (no ToS restriction)"
  }
 ]