scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema

Adapts the docs-mcp-template scraping layer for the pesticide-labels
domain. The template's bundle/version/platform concepts don't map to
labels (there's no "Bayer 8.1.0" — there's just the current accepted
label per EPA Reg No), so the scraper layer is reshaped around a
"source" abstraction: one source per manufacturer or regulator, one
per-product label per source.

Sources shipped:
  - bayer       — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs)
  - epa_ppls    — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint

Canonical sidecar schema (see scrape/README.md) unifies fields across
sources:
  - active_ingredients always [{name, cas, percent}]
  - label/* nested (url, filename, accepted_date, last_modified,
    page_count, text_layer)
  - all timestamps normalized to ISO 8601 UTC
  - signal_word surfaced (operationally critical for the farmer advisor)
  - source_key + epa_reg_no separate per-source PK from the
    cross-source join key

bundles.json → sources.json. --bundle → --source. The runner walks
sources.json and dispatches by id; per-source modules remain
independently runnable for development.

PLAN.md gets a one-block domain note up front; later phases (chunking,
embeddings, retrieval, eval) still apply as written.

Smoke test:
  python -m scrape.runner --all --limit 2     # works
  python -m scrape.runner --source bayer --limit 3    # 3 written, idempotent re-run skips
  python -m scrape.runner --source epa_ppls --reg-no 524-475   # Roundup Ultra, 167 pages, ISO last_modified

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-23 18:27:07 -04:00
parent 3ca96a3716
commit e9250de8e7
9 changed files with 1531 additions and 45 deletions
+1
View File
@@ -29,3 +29,4 @@ var/
.vscode/ .vscode/
.idea/ .idea/
*.swp *.swp
.claude/
+13
View File
@@ -9,6 +9,19 @@ any LLM client (Claude Desktop, Claude Code, Cursor, Copilot) can
call to answer questions against the docs, surface what changed call to answer questions against the docs, surface what changed
recently, and flag likely inconsistencies. recently, and flag likely inconsistencies.
> **Domain note for ppls-docs.** This template was originally written
> for versioned software product documentation (Zoomin bundles, Hugo
> sites, etc.). For ppls-docs the domain is pesticide product labels —
> the "bundle" abstraction has been replaced with "source"
> (manufacturer or regulator), and "page" with "product label". The
> canonical on-disk schema lives in [`scrape/README.md`](scrape/README.md),
> not in this document. References below to `bundles.json`, `bundle_id`,
> `--bundle`, `version`, and `platform` are template artifacts — read
> them as `sources.json`, `source_id`, `--source`, and (mostly)
> not-applicable. Phase 1 (scraper) is the most heavily adapted; later
> phases (chunking, embeddings, retrieval, eval) apply largely as
> written.
--- ---
## What you're building ## What you're building
+1
View File
@@ -10,6 +10,7 @@ ollama>=0.4.0 # if using Ollama-hosted embedder; swap if not
# Scraping (Phase 1; adjust per product) # Scraping (Phase 1; adjust per product)
beautifulsoup4>=4.12 beautifulsoup4>=4.12
requests>=2.31 requests>=2.31
pypdf>=4.0 # PDF -> text for label extraction
# playwright>=1.40 # uncomment if you need headless browser fallback # playwright>=1.40 # uncomment if you need headless browser fallback
# Evaluation # Evaluation
+114 -45
View File
@@ -1,59 +1,128 @@
# scrape/ # scrape/
Product-specific. **You implement this for each product.** The Per-source scrapers for pesticide / herbicide product labels. Each
template gives you the contract; the extraction logic depends on module under `scrape/sources/` pulls a single upstream catalog and
the upstream doc portal. writes its results into `corpus/<source_id>/` using the canonical
sidecar schema documented below.
See `PLAN.md` Phase 1 for the corpus layout the rest of the pipeline ## Architecture
expects.
## What you write ```
sources.json — registry of active sources
At minimum, two scripts: scrape/runner.py — thin dispatcher (--source <id> | --all)
scrape/sources/<id>.py — one source per file
### `scrape/bundles.py` corpus/<id>/<key>.md — extracted label text (markdown)
corpus/<id>/<key>.json — canonical metadata sidecar
Discovers the upstream portal's bundle catalog and writes
`bundles.json` at the repo root. One entry per bundle (versioned doc
set) with the schema in PLAN.md.
```bash
python -m scrape.bundles
``` ```
### `scrape/runner.py` `<key>` is the per-source primary key — a slug for manufacturer
sources (e.g. `warrant`, `roundup-powermax-3`) or an EPA Reg No
for regulator sources (e.g. `524-475`). The sidecar's
`epa_reg_no` field is the cross-source join key that lets the
corpus consumer reconcile records from different sources for the
same product.
Scrapes the pages of each bundle (or a single bundle with `--bundle ## CLI
<slug>`). Writes:
- `corpus/<bundle_id>/<page_id>.md` — extracted markdown body
- `corpus/<bundle_id>/<page_id>.json` — per-page metadata sidecar
```bash ```bash
python -m scrape.runner --all --force --concurrency 6 # Run a single source
python -m scrape.runner --bundle Admin.VC.HTML.10.9 python -m scrape.runner --source bayer --limit 20
python -m scrape.runner --source epa_ppls --reg-no 524-475
# Run every source registered in sources.json
python -m scrape.runner --all --limit 50
# Per-source modules also run standalone
python -m scrape.sources.bayer --class herbicide --limit 5
python -m scrape.sources.epa_ppls --seed-file seeds.txt
``` ```
## Tips Every scraper is **idempotent** by default — re-running with the
same arguments skips records already on disk. Use `--force` to
re-fetch.
- **Sniff before you scrape.** Almost every modern doc portal is an ## Canonical sidecar schema
SPA that calls a backend API. Open the browser's Network tab,
click around, find the underlying JSON. Scraping the API is 10×
cheaper and 100× more reliable than scraping the rendered HTML.
- **Idempotent re-scrapes.** Without `--force`, the runner should
skip pages already on disk so a resume doesn't have to re-fetch
everything. With `--force`, re-fetch every page — that's the
weekly cron mode that catches edits.
- **Respect the portal.** Backoff on 429s. Set a recognizable
user-agent so the portal owner can identify you if they want to.
- **Whitespace normalize.** Markdown that round-trips through HTML
often has extra blank lines. Normalize to a single blank between
paragraphs so diffs are clean (the changelog summary and digest
tools care about line counts).
## What's already reusable Every `corpus/<source>/<key>.json` conforms to this shape. Fields
that don't apply to a given source are `null` (not omitted) so the
JSON is uniform across sources.
`scrape/changelog.py` is fully product-agnostic and ready to use ```json
as-is. It walks `git diff --name-status` output to produce a {
structured summary, and walks `git log` for the digest history "source": "bayer",
(Phase 13). "source_key": "warrant",
"epa_reg_no": "524-591",
"product_name": "Warrant Herbicide",
"product_class": "herbicide",
"registrant": null,
"active_ingredients": [
{"name": "acetochlor", "cas": "34256-82-1", "percent": 35.4}
],
"signal_word": "Caution",
"label": {
"url": "https://cs-assets.bayer.com/is/content/bayer/Warrant_2025pdf",
"filename": "Warrant_2025pdf",
"accepted_date": "2024-01-15",
"last_modified": "2026-05-15T20:21:54+00:00",
"page_count": 24,
"text_layer": true
},
"supplemental_documents": [
{"kind": "2EE", "title": "Warrant tank-mix 2EE — cotton",
"url": "https://cs-assets.bayer.com/.../...pdf",
"last_modified": "2026-04-01T12:00:00+00:00"}
],
"source_urls": {
"product_page": "https://www.cropscience.bayer.us/products/herbicides/warrant/label-msds",
"label_api": null,
"label_index": null
},
"fetched_at": "2026-05-23T22:05:29+00:00",
"scraper_version": "0.1.0"
}
```
### Field reference
| Field | Type | Required | Notes |
|---|---|---|---|
| `source` | string | yes | Matches an `id` in `sources.json`. |
| `source_key` | string | yes | Per-source primary key. Filesystem-safe. |
| `epa_reg_no` | string \| null | best-effort | Canonical EPA registration (e.g. `524-591`, or `524-591-12345` with distributor suffix). The cross-source join key. |
| `product_name` | string \| null | yes | Display name. |
| `product_class` | string \| null | best-effort | One of `herbicide`, `fungicide`, `insecticide`, `seed-treatment`, `rodenticide`, `other`. EPA PPLS leaves this `null`; manufacturer sources usually know. |
| `registrant` | string \| null | best-effort | Required-ish for regulator sources, often `null` for MFR sources where redundant. |
| `active_ingredients` | array of objects | yes (may be empty) | `[{name, cas, percent}]`. `cas` and `percent` are `null` when the source doesn't expose them. |
| `signal_word` | string \| null | best-effort | `Danger`, `Warning`, `Caution`, or `null`. Operationally critical for the farmer advisor. |
| `label.url` | string \| null | yes | Direct URL of the current label PDF. |
| `label.filename` | string \| null | best-effort | Last URL segment, useful for diffing revisions. |
| `label.accepted_date` | ISO date \| null | best-effort | EPA-stamped acceptance date. MFR sources may not expose this. |
| `label.last_modified` | ISO 8601 datetime \| null | best-effort | From the PDF's HTTP `Last-Modified` header. Always normalized to ISO 8601 UTC. |
| `label.page_count` | int \| null | best-effort | After download. |
| `label.text_layer` | bool \| null | best-effort | `false` for scanned PDFs that need OCR. |
| `supplemental_documents` | array | yes (may be empty) | 24(c) labels, 2(ee) bulletins, MSDS/SDS, product bulletins. EPA PPLS leaves this empty (those are separate API calls). |
| `source_urls.product_page` | string \| null | best-effort | The HTML product page on the source site. |
| `source_urls.label_api` | string \| null | best-effort | The JSON API endpoint that returned this record (for traceability). |
| `source_urls.label_index` | string \| null | best-effort | The human-readable index/search URL. |
| `fetched_at` | ISO 8601 datetime | yes | When this sidecar was generated. |
| `scraper_version` | string | yes | Source module's `SCRAPER_VERSION` constant. |
Sources may add their own extra fields beyond the canonical schema
(EPA's sidecars carry `registration_status` and
`registrant_company_number`, for instance). Consumers should ignore
unknown fields.
## Adding a new source
1. Write `scrape/sources/<id>.py` exposing a `main(argv: list[str]) -> int`
that accepts at minimum `--limit N` and `--force`.
2. Conform to the canonical sidecar schema. Add source-specific
extras as additional top-level keys if they don't fit.
3. Add an entry to `sources.json` (`id`, `title`, `type`, `homepage`,
`scraper`, `scraper_version`, `license_note`).
4. Scrapers MUST be polite: rate-limit to ≤1 req/sec, set a real
User-Agent identifying the project, retry with backoff on 429/5xx,
and respect robots.txt unless an explicit carve-out exists (e.g.
Bayer's RAG allowlist).
5. Scrapers MUST be idempotent: skip records already on disk unless
`--force` is set.
+87
View File
@@ -0,0 +1,87 @@
"""Thin dispatcher that routes ``--source <id>`` to the right per-source
scraper module.
For ppls-docs the convention is **one source per scraper module** under
``scrape.sources.<id>``. Each module is independently runnable via
``python -m scrape.sources.<id>`` and accepts its own flags — this
runner is a convenience shim for CI / the weekly refresh workflow.
Examples:
python -m scrape.runner --source bayer --limit 20
python -m scrape.runner --source epa_ppls --limit 20
python -m scrape.runner --all # walk every source in sources.json
Anything after the recognized flags is passed through to the source
scraper, so:
python -m scrape.runner --source bayer --force --product warrant
just dispatches to ``scrape.sources.bayer`` with ``--force --product
warrant`` as argv.
"""
from __future__ import annotations
import argparse
import importlib
import json
import sys
from pathlib import Path
REPO_ROOT = Path(__file__).resolve().parents[1]
SOURCES_JSON = REPO_ROOT / "sources.json"
def _load_sources() -> list[dict]:
if not SOURCES_JSON.exists():
return []
try:
return json.loads(SOURCES_JSON.read_text())
except json.JSONDecodeError:
return []
def _run_source(source_id: str, passthrough: list[str]) -> int:
mod_name = f"scrape.sources.{source_id}"
try:
mod = importlib.import_module(mod_name)
except ImportError as exc:
print(f"runner: no source module {mod_name}: {exc}", file=sys.stderr)
return 2
main = getattr(mod, "main", None)
if not callable(main):
print(f"runner: {mod_name} has no main() entrypoint", file=sys.stderr)
return 2
return int(main(passthrough) or 0)
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(prog="scrape.runner")
parser.add_argument("--source", help="Source id (matches sources.json)")
parser.add_argument("--all", action="store_true",
help="Run every source listed in sources.json")
args, passthrough = parser.parse_known_args(argv)
if not args.source and not args.all:
parser.error("specify --source <id> or --all")
sources = _load_sources()
if args.all:
ids = [s["id"] for s in sources if "id" in s]
if not ids:
print("runner: sources.json is empty or missing", file=sys.stderr)
return 2
else:
# If the source isn't registered in sources.json yet, dispatch anyway
# so the scraper can be exercised during initial development.
ids = [args.source]
rc = 0
for sid in ids:
rc |= _run_source(sid, passthrough)
return rc
if __name__ == "__main__":
sys.exit(main())
View File
+696
View File
@@ -0,0 +1,696 @@
"""Bayer Crop Science US label scraper.
Pulls herbicide / fungicide / insecticide / seed-treatment product
metadata and label PDFs from https://www.cropscience.bayer.us, extracts
each PDF to markdown, and writes a metadata sidecar JSON per product.
Output:
corpus/bayer/<slug>.md extracted label text
corpus/bayer/<slug>.json metadata sidecar (see SIDECAR_SCHEMA in
PLAN.md / this repo's CLAUDE.md)
The scraper resolves Bayer's rotating Next.js ``buildId`` from the
homepage at runtime, then walks the catalog JSON API for each product
class. It extracts the rest of the label/MSDS/supplemental download
URLs from each product page's ``__NEXT_DATA__`` JSON island — this is
strictly cheaper and more stable than scraping rendered HTML.
robots.txt for cropscience.bayer.us explicitly allows scraping for
"search engine indexing or artificial intelligence retrieval augmented
generation" use cases, which is what this corpus feeds.
CLI:
python -m scrape.sources.bayer --limit 20
python -m scrape.sources.bayer --limit 20 --force
python -m scrape.sources.bayer --product warrant
python -m scrape.sources.bayer --class herbicide --limit 5
"""
from __future__ import annotations
import argparse
import io
import json
import logging
import os
import random
import re
import sys
import time
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Iterable
import requests
from pypdf import PdfReader
SCRAPER_VERSION = "0.1.0"
USER_AGENT = "ppls-docs-scraper/0.1 (+https://drawbar.example/contact)"
BASE = "https://www.cropscience.bayer.us"
# Catalog product-type values used in the Next.js data API.
PRODUCT_TYPES = ("Herbicide", "Fungicide", "Insecticide", "Seed_Treatment")
# Map product-type filter -> the canonical "product_class" we record
# in the sidecar (matches the legacy URL segments).
PRODUCT_CLASS = {
"Herbicide": "herbicide",
"Fungicide": "fungicide",
"Insecticide": "insecticide",
"Seed_Treatment": "seed-treatment",
}
# Repo root: scrape/sources/bayer.py -> repo root is 3 parents up.
REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS_DIR = REPO_ROOT / "corpus" / "bayer"
# Politeness: target ~1 req/sec to Bayer. Each HTTP method goes through
# a tiny token-bucket sleeper to enforce this without per-call asyncio.
REQ_INTERVAL_SEC = 1.0
log = logging.getLogger("scrape.bayer")
# --------------------------------------------------------------------- HTTP
class RateLimitedSession:
"""``requests.Session`` wrapper with sleep-based rate limiting and
polite retries on 429/5xx."""
def __init__(self, interval: float = REQ_INTERVAL_SEC) -> None:
self.s = requests.Session()
self.s.headers["User-Agent"] = USER_AGENT
self.interval = interval
self._last = 0.0
def _wait(self) -> None:
delta = time.monotonic() - self._last
if delta < self.interval:
time.sleep(self.interval - delta)
self._last = time.monotonic()
def request(
self,
method: str,
url: str,
*,
max_retries: int = 4,
timeout: float = 30.0,
**kw: Any,
) -> requests.Response:
last_exc: Exception | None = None
for attempt in range(max_retries):
self._wait()
try:
resp = self.s.request(method, url, timeout=timeout, **kw)
except requests.RequestException as exc:
last_exc = exc
backoff = min(30.0, (2 ** attempt) + random.random())
log.warning("network error on %s %s: %s — retry in %.1fs",
method, url, exc, backoff)
time.sleep(backoff)
continue
if resp.status_code in (429,) or 500 <= resp.status_code < 600:
# Honor Retry-After if present, else exponential backoff.
ra = resp.headers.get("Retry-After")
if ra and ra.isdigit():
backoff = float(ra)
else:
backoff = min(30.0, (2 ** attempt) + random.random())
log.warning("HTTP %d on %s %s — retry in %.1fs",
resp.status_code, method, url, backoff)
time.sleep(backoff)
continue
return resp
if last_exc:
raise last_exc
# Final response (still bad) returned for caller to handle.
return resp
def get(self, url: str, **kw: Any) -> requests.Response:
return self.request("GET", url, **kw)
def head(self, url: str, **kw: Any) -> requests.Response:
kw.setdefault("allow_redirects", True)
return self.request("HEAD", url, **kw)
# --------------------------------------------------------------------- model
@dataclass
class SupplementalDoc:
kind: str
title: str
url: str
last_modified: str | None = None
@dataclass
class BayerProduct:
slug: str # filesystem-safe slug, e.g. "warrant"
catalog_slug: str # bayer's seoSlug, e.g. "warrant-herbicide"
product_url_path: str # e.g. "/crop-protection/herbicide/warrant-herbicide"
product_class: str # "herbicide" | "fungicide" | ...
product_name: str = ""
epa_reg_no: str | None = None
active_ingredients: list[dict] = field(default_factory=list) # [{name, cas, percent}]
label_url: str | None = None
label_filename: str | None = None
label_last_modified: str | None = None
label_page_count: int | None = None
label_text_layer: bool | None = None
supplemental_pdfs: list[SupplementalDoc] = field(default_factory=list)
source_page_url: str = ""
# --------------------------------------------------------------------- helpers
_NEXT_DATA_RE = re.compile(
r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', re.S
)
def parse_next_data(html: str) -> dict[str, Any]:
"""Pull the ``__NEXT_DATA__`` JSON blob out of a Next.js page."""
m = _NEXT_DATA_RE.search(html)
if not m:
raise RuntimeError("no __NEXT_DATA__ script tag found")
return json.loads(m.group(1))
def fetch_build_id(http: RateLimitedSession) -> str:
"""Grab the rotating ``buildId`` from the Bayer homepage."""
r = http.get(BASE + "/")
r.raise_for_status()
data = parse_next_data(r.text)
bid = data.get("buildId")
if not bid:
raise RuntimeError("buildId missing from homepage __NEXT_DATA__")
log.info("resolved Bayer buildId=%s", bid)
return bid
def normalize_epa_reg(raw: str | None) -> str | None:
"""Convert Bayer's padded reg number to canonical EPA form.
Example: ``0000524-00591-AA-0000000`` -> ``524-591``.
The trailing ``-AA-0000000`` is a Bayer-internal qualifier we
don't surface. We keep ``524-591/<sub>`` if a non-empty sub-reg
appears (rare).
"""
if not raw:
return None
parts = raw.split("-")
if len(parts) < 2:
return raw.strip() or None
company = parts[0].lstrip("0") or "0"
product = parts[1].lstrip("0") or "0"
epa = f"{company}-{product}"
# If the third segment is something other than the default "AA",
# it's likely a distributor sub-reg. Preserve it.
if len(parts) >= 3 and parts[2] and parts[2] != "AA":
epa += f"-{parts[2]}"
return epa
def classify_supplemental(title: str, url: str) -> str:
"""Classify a supplemental/auxiliary doc by its title or URL.
Returns a short kind tag like ``2EE``, ``24C``, ``24C-CA``,
``Bulletin``, ``MSDS``, ``Label``, or ``Other``. The exact tag
isn't load-bearing for the scraper — it's metadata to help the
chunker/agent later. Best-effort regex; ambiguous = ``Other``.
"""
t = (title or "").upper()
u = (url or "").upper()
blob = f"{t} {u}"
# State-specific 24c labels usually carry a two-letter state code,
# but Bayer's titles rarely encode it. Best we can do is flag 24c.
if "24C" in blob or "SECTION_24C" in blob or "SECTION 24C" in blob:
# Try to spot a state suffix in the URL (e.g. "_24c_ca").
m = re.search(r"24[_-]?C[_-]([A-Z]{2})\b", u)
if m:
return f"24C-{m.group(1)}"
return "24C"
if "2EE" in blob or "2_EE" in blob:
return "2EE"
if "MSDS" in blob or "SDS" in blob or "SAFETY DATA" in blob:
return "MSDS"
if "BULLETIN" in blob:
return "Bulletin"
if "SUPPLEMENTAL" in blob:
return "Supplemental"
if "LABEL" in blob:
return "Label"
return "Other"
def safe_slug(catalog_slug: str, product_class: str) -> str:
"""Strip the trailing class suffix so ``warrant-herbicide`` becomes
``warrant``; falls back to the full slug for slugs that don't end
with the class word."""
suffix = f"-{product_class}"
if catalog_slug.endswith(suffix):
return catalog_slug[: -len(suffix)]
# seed-treatment is sometimes split or omitted; just return as-is.
return catalog_slug
def iso_from_http_date(http_date: str | None) -> str | None:
"""RFC1123 -> ISO 8601 UTC. Returns None if unparseable."""
if not http_date:
return None
try:
from email.utils import parsedate_to_datetime
dt = parsedate_to_datetime(http_date)
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return dt.astimezone(timezone.utc).isoformat()
except Exception: # noqa: BLE001
return None
# --------------------------------------------------------------------- catalog
def walk_catalog(
http: RateLimitedSession, build_id: str
) -> Iterable[BayerProduct]:
"""Yield ``BayerProduct`` stubs for every product across all classes.
Stubs carry only catalog-level info (slug, URL, class). The detail
fetch (EPA reg, ingredients, PDFs) happens later via
:func:`fetch_product_detail`.
"""
for ptype in PRODUCT_TYPES:
product_class = PRODUCT_CLASS[ptype]
page = 1
seen = 0
while True:
url = (
f"{BASE}/_next/data/{build_id}/crop-protection/catalog.json"
f"?productType={ptype}&p={page}"
)
r = http.get(url)
if r.status_code != 200:
log.warning("catalog %s p=%d -> HTTP %d, stopping class",
ptype, page, r.status_code)
break
data = r.json().get("pageProps", {})
products = data.get("serverProducts") or []
total = data.get("total") or 0
if not products:
break
for p in products:
slug = p.get("seoSlug") or ""
product_url = p.get("productURL") or ""
if not slug or not product_url:
continue
yield BayerProduct(
slug=safe_slug(slug, product_class),
catalog_slug=slug,
product_url_path=product_url,
product_class=product_class,
)
seen += len(products)
if seen >= total:
break
page += 1
# --------------------------------------------------------------------- detail
def fetch_product_detail(
http: RateLimitedSession, prod: BayerProduct
) -> BayerProduct:
"""Populate EPA reg, active ingredients, and the full PDF list on
a catalog stub by fetching its product page __NEXT_DATA__."""
page_url = BASE + prod.product_url_path
prod.source_page_url = page_url
r = http.get(page_url)
r.raise_for_status()
data = parse_next_data(r.text)
pp = (data.get("props") or {}).get("pageProps") or {}
pd = pp.get("productDetails") or {}
prod.product_name = pd.get("productLabel") or pd.get("productName") or prod.slug
prod.epa_reg_no = normalize_epa_reg(pd.get("registrationNumber"))
# Bayer's product page exposes ingredient names only — no CAS or percent.
# Conform to the canonical schema by emitting objects with name set and
# the other fields null; downstream consumers can hydrate from EPA PPLS.
prod.active_ingredients = [
{"name": a.get("ingredient"), "cas": None, "percent": None}
for a in (pd.get("activeIngredients") or [])
if a.get("ingredient")
]
# Primary label: prefer downloadLabelUrl, then importantDocuments.
important = (pp.get("importantDocuments") or {}).get("labelData") or []
additional = (pp.get("additionalDownloads") or {}).get("labelData") or []
download_url = pp.get("downloadLabelUrl")
label_url: str | None = None
if download_url and looks_like_pdf(download_url):
label_url = download_url
else:
# First entry titled "Label" or simply the first PDF.
for d in important:
t = (d.get("title") or "").lower()
u = d.get("url") or ""
if not looks_like_pdf(u):
continue
if "label" in t and "msds" not in t and "sds" not in t:
label_url = u
break
if not label_url:
for d in important + additional:
u = d.get("url") or ""
if looks_like_pdf(u):
label_url = u
break
prod.label_url = label_url
if label_url:
# Last URL segment is the Scene7 asset id (e.g. "Warrant_2025pdf").
prod.label_filename = label_url.rsplit("/", 1)[-1]
# Collect ALL other PDFs as supplementals (label/MSDS/24c/2EE/bulletin
# /etc.). The kind tag is best-effort; the chunker can refine later.
supplementals: list[SupplementalDoc] = []
seen_urls: set[str] = set()
if label_url:
seen_urls.add(label_url)
for d in important + additional:
u = d.get("url") or ""
t = d.get("title") or ""
if not u or u in seen_urls:
continue
if not looks_like_pdf(u):
continue
seen_urls.add(u)
supplementals.append(SupplementalDoc(
kind=classify_supplemental(t, u),
title=t,
url=u,
))
prod.supplemental_pdfs = supplementals
return prod
def looks_like_pdf(url: str) -> bool:
"""True if the URL is one of Bayer's PDF endpoints.
Bayer serves PDFs via Adobe Scene7 with the literal ``pdf`` (no
dot) appended to the asset ID, plus some assets on cs-contentapi
with a real ``.pdf`` extension.
"""
u = url.lower()
if u.endswith("pdf"):
return True
if u.endswith(".pdf"):
return True
return False
# --------------------------------------------------------------------- PDF
def head_last_modified(http: RateLimitedSession, url: str) -> str | None:
"""Resolve Last-Modified for a PDF URL. Returns ISO 8601 or None."""
try:
r = http.head(url)
except requests.RequestException as exc:
log.warning("HEAD failed for %s: %s", url, exc)
return None
if r.status_code != 200:
log.warning("HEAD %s -> HTTP %d", url, r.status_code)
return None
return iso_from_http_date(r.headers.get("Last-Modified"))
def fetch_pdf_text(http: RateLimitedSession, url: str) -> tuple[str, int, bool]:
"""Download a PDF and return ``(text, page_count, has_text_layer)``.
Concatenates all pages, normalizes whitespace, and collapses runs
of blank lines so the resulting markdown diffs cleanly. ``has_text_layer``
is False for scanned PDFs whose pypdf extract produced no text.
"""
r = http.get(url)
r.raise_for_status()
if "pdf" not in (r.headers.get("Content-Type") or "").lower():
log.warning("expected PDF Content-Type at %s, got %s",
url, r.headers.get("Content-Type"))
reader = PdfReader(io.BytesIO(r.content))
page_count = len(reader.pages)
chunks: list[str] = []
for page in reader.pages:
try:
text = page.extract_text() or ""
except Exception as exc: # noqa: BLE001
log.warning("pypdf extract_text failed on a page of %s: %s",
url, exc)
text = ""
chunks.append(text)
raw = "\n\n".join(chunks)
normalized = normalize_text(raw)
has_text_layer = bool(normalized.strip())
return normalized, page_count, has_text_layer
def normalize_text(s: str) -> str:
# Strip trailing spaces per line, collapse 3+ blank lines to 2,
# and trim NBSPs that pypdf often leaves behind.
s = s.replace("\u00a0", " ")
s = re.sub(r"[ \t]+\n", "\n", s)
s = re.sub(r"\n{3,}", "\n\n", s)
return s.strip() + "\n"
# --------------------------------------------------------------------- write
def write_product(prod: BayerProduct, body_md: str) -> None:
"""Write the canonical sidecar + markdown body. See scrape/README.md
for the schema."""
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
md_path = CORPUS_DIR / f"{prod.slug}.md"
json_path = CORPUS_DIR / f"{prod.slug}.json"
# Lightweight markdown frontmatter for human eyeballing — canonical
# metadata lives in the sidecar.
title = prod.product_name or prod.slug
ai_summary = ", ".join(a["name"] for a in prod.active_ingredients if a.get("name")) or "(unknown)"
header = (
f"# {title}\n\n"
f"- **Product class:** {prod.product_class}\n"
f"- **EPA Reg No:** {prod.epa_reg_no or '(unknown)'}\n"
f"- **Active ingredients:** {ai_summary}\n"
f"- **Source:** {prod.source_page_url}\n"
f"- **Label PDF:** {prod.label_url or '(none on page)'}\n\n"
"---\n\n"
)
md_path.write_text(header + body_md, encoding="utf-8")
sidecar = {
"source": "bayer",
"source_key": prod.slug,
"epa_reg_no": prod.epa_reg_no,
"product_name": prod.product_name,
"product_class": prod.product_class,
"registrant": None,
"active_ingredients": prod.active_ingredients,
"signal_word": None,
"label": {
"url": prod.label_url,
"filename": prod.label_filename,
"accepted_date": None,
"last_modified": prod.label_last_modified,
"page_count": prod.label_page_count,
"text_layer": prod.label_text_layer,
},
"supplemental_documents": [
{
"kind": s.kind,
"title": s.title,
"url": s.url,
"last_modified": s.last_modified,
}
for s in prod.supplemental_pdfs
],
"source_urls": {
"product_page": prod.source_page_url,
"label_api": None,
"label_index": None,
},
"fetched_at": datetime.now(timezone.utc).isoformat(),
"scraper_version": SCRAPER_VERSION,
}
json_path.write_text(
json.dumps(sidecar, indent=2, ensure_ascii=False) + "\n",
encoding="utf-8",
)
# --------------------------------------------------------------------- pipeline
def process_product(
http: RateLimitedSession,
prod: BayerProduct,
*,
force: bool,
) -> str:
"""Fetch detail + PDF and write to disk. Returns a status string
suitable for logging: ``written``, ``skipped``, ``no-pdf``,
``failed``."""
md_path = CORPUS_DIR / f"{prod.slug}.md"
if md_path.exists() and not force:
return "skipped"
try:
fetch_product_detail(http, prod)
except Exception as exc: # noqa: BLE001
log.error("detail fetch failed for %s: %s", prod.slug, exc)
return "failed"
# Resolve Last-Modified for label + supplementals (HEAD only, cheap).
if prod.label_url:
prod.label_last_modified = head_last_modified(http, prod.label_url)
for s in prod.supplemental_pdfs:
s.last_modified = head_last_modified(http, s.url)
if not prod.label_url:
# Some Bayer products have no public label PDF (e.g. product was
# discontinued or the page only carries a Product Bulletin). We
# still record the metadata sidecar so the catalog is complete,
# but write a stub body so the file count reflects reality.
log.info("%s — no label PDF; writing metadata only", prod.slug)
prod.label_text_layer = False
write_product(prod, "_(No label PDF was found on the product page.)_\n")
return "no-pdf"
try:
body, page_count, text_layer = fetch_pdf_text(http, prod.label_url)
except Exception as exc: # noqa: BLE001
log.error("PDF fetch/extract failed for %s (%s): %s",
prod.slug, prod.label_url, exc)
return "failed"
prod.label_page_count = page_count
prod.label_text_layer = text_layer
if not body.strip():
log.warning("%s — extracted PDF was empty (scanned?)", prod.slug)
body = "[SCANNED PDF — OCR REQUIRED]\n"
write_product(prod, body)
return "written"
def run(
*,
limit: int | None,
force: bool,
only_product: str | None,
only_class: str | None,
) -> int:
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
http = RateLimitedSession()
build_id = fetch_build_id(http)
products: list[BayerProduct] = []
for prod in walk_catalog(http, build_id):
if only_class and prod.product_class != only_class:
continue
if only_product and prod.slug != only_product and prod.catalog_slug != only_product:
continue
products.append(prod)
if only_product and not products:
log.error("no product matched --product=%s", only_product)
return 2
log.info("catalog yielded %d candidate product(s)", len(products))
counts = {"written": 0, "skipped": 0, "no-pdf": 0, "failed": 0}
processed = 0
for prod in products:
if limit is not None and processed >= limit:
break
processed += 1
status = process_product(http, prod, force=force)
counts[status] = counts.get(status, 0) + 1
log.info(
"[%d/%s] %s %s | class=%s epa=%s ai=%s label=%s",
processed, str(limit) if limit else "all",
prod.slug, status,
prod.product_class,
prod.epa_reg_no or "-",
",".join(a["name"] for a in prod.active_ingredients if a.get("name")) or "-",
prod.label_url or "-",
)
log.info(
"done: processed=%d written=%d skipped=%d no-pdf=%d failed=%d",
processed,
counts["written"], counts["skipped"], counts["no-pdf"], counts["failed"],
)
return 0 if counts["failed"] == 0 else 1
# --------------------------------------------------------------------- CLI
def _build_argparser() -> argparse.ArgumentParser:
p = argparse.ArgumentParser(
prog="scrape.sources.bayer",
description="Scrape Bayer Crop Science US product labels.",
)
p.add_argument(
"--limit", type=int, default=None,
help="Stop after processing N products (default: all).",
)
p.add_argument(
"--force", action="store_true",
help="Re-download even if the markdown file already exists.",
)
p.add_argument(
"--product", default=None,
help="Process a single product by slug (e.g. 'warrant' or "
"'warrant-herbicide').",
)
p.add_argument(
"--class", dest="product_class", default=None,
choices=sorted(set(PRODUCT_CLASS.values())),
help="Limit to one product class.",
)
p.add_argument(
"--log-level", default=os.environ.get("LOG_LEVEL", "INFO"),
help="Python logging level (default INFO).",
)
return p
def main(argv: list[str] | None = None) -> int:
args = _build_argparser().parse_args(argv)
logging.basicConfig(
level=args.log_level.upper(),
format="%(asctime)s %(levelname)s %(name)s %(message)s",
stream=sys.stderr,
)
return run(
limit=args.limit,
force=args.force,
only_product=args.product,
only_class=args.product_class,
)
if __name__ == "__main__":
sys.exit(main())
+599
View File
@@ -0,0 +1,599 @@
"""EPA PPLS (Pesticide Product Label System) scraper.
Enumeration strategy
====================
The PPLS Oracle APEX portal (ordspub.epa.gov/ords/pesticides/f?p=PPLS:1)
is session-stateful and hostile to enumeration, so we use a two-phase
approach that bypasses APEX entirely:
1. **List products** via the public PPIS bulk download
``https://www3.epa.gov/pesticides/PPISdata/product.zip`` — a 107-char
fixed-width flat file (``product.txt``, ~102K active Section 3
registrations, refreshed every Tuesday). Gives us the universe of
EPA Reg Nos (company-product), plus the product name.
2. **Hydrate per product** via the PPLS REST data service at
``https://ordspub.epa.gov/ords/pesticides/cswu/ppls/{regno}`` —
returns rich JSON: registrant, active ingredients (with CAS + %),
formulations, status, signal word, AND a ``pdffiles`` array
listing every stamped label PDF EPA has accepted for the product.
The most recent entry gives us the canonical PDF filename
(``{company6}-{product5}-{YYYYMMDD}.pdf``), solving the
stamped-date-suffix problem without having to guess.
3. **Fetch label PDF** from
``https://www3.epa.gov/pesticides/chem_search/ppls/{filename}``
and extract text with pypdf. Many EPA labels are scans with no
text layer — those are flagged ``text_layer: false`` and the .md
body is a ``[SCANNED PDF — OCR REQUIRED]`` placeholder. OCR is
deferred to Phase 2.
Paths rejected and why
----------------------
- ``/ords/pesticides/ppls/{reg}`` (no ``/cswu/`` prefix): returns the
APEX HTML splash, not JSON. The undocumented ``/cswu/`` prefix is
the actual ORDS REST handler.
- Scraping the APEX UI: session-stateful, fragile, blocked.
- data.gov mirror: redirects to the same APEX page, no extract.
- NPIRS (Purdue): subscription-walled; PPIS is the same authoritative
feed anyway.
Required sidecar fields (per task spec): ``source``, ``epa_reg_no``,
``label_pdf_url``, ``fetched_at``. Everything else best-effort.
"""
from __future__ import annotations
import argparse
import io
import json
import logging
import re
import sys
import time
import zipfile
from dataclasses import dataclass, field
from datetime import UTC, datetime
from pathlib import Path
from typing import Any, Iterable
import httpx
from pypdf import PdfReader
from pypdf.errors import PdfReadError
SCRAPER_VERSION = "0.1.0"
USER_AGENT = "ppls-docs-scraper/0.1 (+https://drawbar.example/contact)"
PPIS_PRODUCT_ZIP_URL = "https://www3.epa.gov/pesticides/PPISdata/product.zip"
PPLS_API_BASE = "https://ordspub.epa.gov/ords/pesticides/cswu/ppls"
PPLS_PDF_BASE = "https://www3.epa.gov/pesticides/chem_search/ppls"
PPLS_INDEX_URL_TEMPLATE = (
"https://ordspub.epa.gov/ords/pesticides/f?p=PPLS:102:::NO::P102_REG_NUM:{regno}"
)
REPO_ROOT = Path(__file__).resolve().parents[2]
CORPUS_DIR = REPO_ROOT / "corpus" / "epa_ppls"
REQUEST_DELAY_SECONDS = 1.1 # polite: ~1 req/sec
HTTP_TIMEOUT = httpx.Timeout(60.0, connect=15.0)
MAX_RETRIES = 4
log = logging.getLogger("epa_ppls")
# ---------------------------------------------------------------------------
# HTTP helpers
# ---------------------------------------------------------------------------
def _client() -> httpx.Client:
return httpx.Client(
headers={"User-Agent": USER_AGENT, "Accept-Encoding": "gzip, deflate"},
timeout=HTTP_TIMEOUT,
follow_redirects=True,
)
def _get_with_retries(
client: httpx.Client, url: str, *, expect_json: bool = False
) -> httpx.Response:
"""GET with exponential backoff on 5xx/429/network errors."""
last_exc: Exception | None = None
for attempt in range(1, MAX_RETRIES + 1):
try:
resp = client.get(url)
if resp.status_code in (429, 500, 502, 503, 504):
wait = min(2 ** attempt, 30)
log.warning(
"HTTP %s on %s (attempt %d/%d) — sleeping %ds",
resp.status_code, url, attempt, MAX_RETRIES, wait,
)
time.sleep(wait)
continue
resp.raise_for_status()
if expect_json:
# ORDS sometimes returns text/html error pages with 200 — sanity
ctype = resp.headers.get("content-type", "")
if "json" not in ctype.lower():
raise httpx.HTTPError(
f"Expected JSON, got content-type={ctype!r} for {url}"
)
return resp
except (httpx.TransportError, httpx.HTTPError) as exc:
last_exc = exc
wait = min(2 ** attempt, 30)
log.warning(
"Network error on %s (attempt %d/%d): %s — sleeping %ds",
url, attempt, MAX_RETRIES, exc, wait,
)
time.sleep(wait)
raise RuntimeError(f"GET {url} failed after {MAX_RETRIES} attempts: {last_exc}")
# ---------------------------------------------------------------------------
# Enumeration: PPIS bulk product.zip
# ---------------------------------------------------------------------------
@dataclass
class PpisRow:
"""One row of PPIS product.txt — enough to hydrate via the API."""
epa_reg_no: str
product_name: str
status_flag: str # 'F' (federal/active) or 'T' (transferred)
rup_flag: str # 'Y' or 'N'
def _parse_ppis_line(line: str) -> PpisRow | None:
"""Parse one 107-char PPIS product.txt row.
Layout (1-indexed, inferred from inspection):
1-6 company number (zero-padded, may contain trailing spaces)
7-11 product number (zero-padded, may contain trailing spaces)
33-102 product name (70 chars, space-padded)
103 status flag ('F' or 'T')
106 RUP flag ('Y' or 'N')
"""
if len(line) < 106:
return None
company_raw = line[0:6].strip()
product_raw = line[6:11].strip()
if not company_raw or not product_raw:
return None
# Strip leading zeros for canonical EPA Reg No display
try:
company = str(int(company_raw))
product = str(int(product_raw))
except ValueError:
return None
name = line[32:102].strip()
status_flag = line[102:103]
rup_flag = line[105:106] if len(line) > 105 else "N"
return PpisRow(
epa_reg_no=f"{company}-{product}",
product_name=name,
status_flag=status_flag,
rup_flag=rup_flag,
)
def fetch_ppis_index(client: httpx.Client) -> list[PpisRow]:
"""Download PPIS product.zip and parse into PpisRow list."""
log.info("Fetching PPIS index from %s", PPIS_PRODUCT_ZIP_URL)
resp = _get_with_retries(client, PPIS_PRODUCT_ZIP_URL)
rows: list[PpisRow] = []
with zipfile.ZipFile(io.BytesIO(resp.content)) as zf:
with zf.open("product.txt") as fh:
for raw in fh:
line = raw.decode("latin-1").rstrip("\n").rstrip("\r")
row = _parse_ppis_line(line)
if row is not None:
rows.append(row)
log.info("Parsed %d rows from PPIS index", len(rows))
return rows
# ---------------------------------------------------------------------------
# Hydration: PPLS JSON API
# ---------------------------------------------------------------------------
def _zero_pad_regno(regno: str) -> str:
"""524-475 -> 000524-00475 (canonical filename form). Distributor suffix
(524-475-12345) -> 000524-00475-12345."""
parts = regno.split("-")
if len(parts) == 2:
c, p = parts
return f"{int(c):06d}-{int(p):05d}"
if len(parts) == 3:
c, p, d = parts
return f"{int(c):06d}-{int(p):05d}-{int(d):05d}"
return regno
_MONTHS = {
"january": 1, "february": 2, "march": 3, "april": 4, "may": 5, "june": 6,
"july": 7, "august": 8, "september": 9, "october": 10, "november": 11,
"december": 12,
}
def _parse_label_date(text: str | None) -> str | None:
"""'October 18, 2016' -> '2016-10-18'. Returns None on any parse issue."""
if not text:
return None
m = re.match(r"^([A-Za-z]+)\s+(\d{1,2}),\s+(\d{4})$", text.strip())
if not m:
return None
month = _MONTHS.get(m.group(1).lower())
if month is None:
return None
try:
return f"{int(m.group(3)):04d}-{month:02d}-{int(m.group(2)):02d}"
except ValueError:
return None
def _http_date_to_iso(http_date: str | None) -> str | None:
"""RFC1123 'Wed, 19 Oct 2016 17:48:09 GMT' -> ISO 8601 UTC.
Returns None on unparseable input. Matches the canonical schema's
requirement that all timestamps be ISO 8601.
"""
if not http_date:
return None
try:
from email.utils import parsedate_to_datetime
dt = parsedate_to_datetime(http_date)
if dt.tzinfo is None:
dt = dt.replace(tzinfo=UTC)
return dt.astimezone(UTC).isoformat()
except Exception: # noqa: BLE001
return None
@dataclass
class ProductRecord:
epa_reg_no: str
product_name: str | None
registrant: str | None
registrant_company_number: str | None
active_ingredients: list[dict[str, Any]]
label_pdf_url: str | None
label_pdf_filename: str | None
label_accepted_date: str | None
registration_status: str | None
signal_word: str | None
raw_api_item: dict[str, Any] | None = field(repr=False, default=None)
def fetch_product_record(client: httpx.Client, regno: str) -> ProductRecord:
"""Call the PPLS API for one EPA Reg No; build a ProductRecord."""
url = f"{PPLS_API_BASE}/{regno}"
resp = _get_with_retries(client, url, expect_json=True)
payload = resp.json()
items = payload.get("items") or []
if not items:
return ProductRecord(
epa_reg_no=regno,
product_name=None,
registrant=None,
registrant_company_number=None,
active_ingredients=[],
label_pdf_url=None,
label_pdf_filename=None,
label_accepted_date=None,
registration_status=None,
signal_word=None,
raw_api_item=None,
)
item = items[0]
company_info = (item.get("companyinfo") or [{}])[0]
registrant = company_info.get("name")
company_num = regno.split("-")[0]
ingredients = []
for ai in item.get("active_ingredients") or []:
ingredients.append({
"name": ai.get("active_ing"),
"cas": ai.get("cas_number"),
"percent": ai.get("active_ing_percent"),
"pc_code": ai.get("pc_code"),
})
pdffiles = item.get("pdffiles") or []
# Most recent PDF first (sorted by date desc); API returns them in
# date-descending order but we sort defensively.
pdf_entry: dict[str, Any] | None = None
if pdffiles:
def _date_key(e: dict[str, Any]) -> str:
d = _parse_label_date(e.get("pdffile_accepted_date"))
return d or "0000-00-00"
pdf_entry = max(pdffiles, key=_date_key)
pdf_filename = pdf_entry.get("pdffile") if pdf_entry else None
pdf_url = f"{PPLS_PDF_BASE}/{pdf_filename}" if pdf_filename else None
accepted = _parse_label_date(pdf_entry.get("pdffile_accepted_date")) if pdf_entry else None
return ProductRecord(
epa_reg_no=regno,
product_name=item.get("productname"),
registrant=registrant,
registrant_company_number=company_num,
active_ingredients=ingredients,
label_pdf_url=pdf_url,
label_pdf_filename=pdf_filename,
label_accepted_date=accepted,
registration_status=item.get("product_status"),
signal_word=item.get("signal_word"),
raw_api_item=item,
)
# ---------------------------------------------------------------------------
# PDF download + text extraction
# ---------------------------------------------------------------------------
def download_pdf(client: httpx.Client, url: str) -> tuple[bytes, str | None]:
"""Download a label PDF; return (bytes, Last-Modified header or None)."""
resp = _get_with_retries(client, url)
last_modified = resp.headers.get("last-modified")
return resp.content, last_modified
def extract_pdf_text(pdf_bytes: bytes) -> tuple[str, bool]:
"""Extract text from a PDF.
Returns (text, has_text_layer). Concatenates pages, normalizes whitespace.
If no extractable text is found, returns ("", False).
"""
try:
reader = PdfReader(io.BytesIO(pdf_bytes))
except PdfReadError as exc:
log.warning("pypdf failed to read PDF: %s", exc)
return "", False
chunks: list[str] = []
for i, page in enumerate(reader.pages):
try:
page_text = page.extract_text() or ""
except Exception as exc: # pypdf can throw on malformed pages
log.warning("pypdf extract_text failed on page %d: %s", i, exc)
page_text = ""
page_text = re.sub(r"[ \t]+", " ", page_text)
page_text = re.sub(r"\n{3,}", "\n\n", page_text).strip()
if page_text:
chunks.append(page_text)
combined = "\n\n".join(chunks).strip()
return combined, bool(combined)
# ---------------------------------------------------------------------------
# Per-product processing
# ---------------------------------------------------------------------------
def _md_path(regno: str) -> Path:
return CORPUS_DIR / f"{regno}.md"
def _json_path(regno: str) -> Path:
return CORPUS_DIR / f"{regno}.json"
def process_one(
client: httpx.Client,
regno: str,
*,
force: bool = False,
) -> str:
"""Fetch + extract one product. Returns 'skipped'|'wrote'|'no-pdf'|'error'."""
md_path = _md_path(regno)
json_path = _json_path(regno)
if not force and md_path.exists() and json_path.exists():
log.info("[%s] skip (already on disk)", regno)
return "skipped"
try:
record = fetch_product_record(client, regno)
except Exception as exc:
log.error("[%s] API fetch failed: %s", regno, exc)
return "error"
time.sleep(REQUEST_DELAY_SECONDS)
def _build_sidecar(
*,
label_url: str | None,
label_filename: str | None,
label_last_modified_iso: str | None,
page_count: int | None,
text_layer: bool | None,
) -> dict[str, Any]:
return {
"source": "epa_ppls",
"source_key": regno,
"epa_reg_no": regno,
"product_name": record.product_name,
"product_class": None, # EPA PPLS doesn't expose a clean class taxonomy
"registrant": record.registrant,
"active_ingredients": record.active_ingredients,
"signal_word": record.signal_word,
"label": {
"url": label_url,
"filename": label_filename,
"accepted_date": record.label_accepted_date,
"last_modified": label_last_modified_iso,
"page_count": page_count,
"text_layer": text_layer,
},
"supplemental_documents": [], # EPA PPLS sidecar omits supplementals; query API per regno
"source_urls": {
"product_page": None,
"label_api": f"{PPLS_API_BASE}/{regno}",
"label_index": PPLS_INDEX_URL_TEMPLATE.format(regno=regno),
},
# EPA-specific extras (kept out of the strict canonical schema but
# useful for joins back to EPA's data model)
"registration_status": record.registration_status,
"registrant_company_number": record.registrant_company_number,
"fetched_at": datetime.now(UTC).isoformat(),
"scraper_version": SCRAPER_VERSION,
}
if not record.label_pdf_url:
log.warning("[%s] no label PDF available — writing sidecar only", regno)
md_path.write_text(
f"# {record.product_name or regno}\n\n"
f"EPA Reg No: {regno}\n\n"
"[NO LABEL PDF AVAILABLE FROM EPA PPLS]\n",
encoding="utf-8",
)
sidecar = _build_sidecar(
label_url=None, label_filename=None,
label_last_modified_iso=None,
page_count=None, text_layer=False,
)
json_path.write_text(json.dumps(sidecar, indent=2), encoding="utf-8")
return "no-pdf"
try:
pdf_bytes, last_modified_raw = download_pdf(client, record.label_pdf_url)
except Exception as exc:
log.error("[%s] PDF download failed: %s", regno, exc)
return "error"
time.sleep(REQUEST_DELAY_SECONDS)
text, has_text = extract_pdf_text(pdf_bytes)
last_modified_iso = _http_date_to_iso(last_modified_raw)
page_count: int | None = None
try:
page_count = len(PdfReader(io.BytesIO(pdf_bytes)).pages)
except Exception:
pass
sidecar = _build_sidecar(
label_url=record.label_pdf_url,
label_filename=record.label_pdf_filename,
label_last_modified_iso=last_modified_iso,
page_count=page_count,
text_layer=has_text,
)
header_lines = [f"# {record.product_name or regno}", ""]
header_lines.append(f"- EPA Reg No: **{regno}**")
if record.registrant:
header_lines.append(f"- Registrant: {record.registrant}")
if record.signal_word:
header_lines.append(f"- Signal word: {record.signal_word}")
if record.active_ingredients:
ai_strs = [
f"{ai.get('name')} ({ai.get('percent')}%)"
for ai in record.active_ingredients
if ai.get("name")
]
if ai_strs:
header_lines.append("- Active ingredients: " + "; ".join(ai_strs))
if record.label_accepted_date:
header_lines.append(f"- Label accepted: {record.label_accepted_date}")
header_lines.append(f"- Source PDF: {record.label_pdf_url}")
header_lines.append("")
header_lines.append("---")
header_lines.append("")
if has_text:
body = text
else:
body = "[SCANNED PDF — OCR REQUIRED]\n\nThis label has no extractable text layer."
log.info("[%s] PDF has no text layer (scanned)", regno)
md_content = "\n".join(header_lines) + body + "\n"
md_path.write_text(md_content, encoding="utf-8")
json_path.write_text(json.dumps(sidecar, indent=2), encoding="utf-8")
log.info(
"[%s] wrote (text_layer=%s, pages=%s, name=%r)",
regno, has_text, page_count, record.product_name,
)
return "wrote"
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def _iter_regnos(
args: argparse.Namespace,
client: httpx.Client,
) -> Iterable[str]:
"""Yield reg nos to process based on CLI args."""
if args.reg_no:
for r in args.reg_no:
yield r
return
if args.seed_file:
with open(args.seed_file, encoding="utf-8") as fh:
for raw in fh:
line = raw.strip()
if not line or line.startswith("#"):
continue
yield line
return
# Default: enumerate via PPIS bulk index
rows = fetch_ppis_index(client)
count = 0
for row in rows:
# Skip transferred-out (status_flag 'T') entries by default; their
# registration has moved to another company-product pairing.
if row.status_flag == "T":
continue
yield row.epa_reg_no
count += 1
if args.limit and count >= args.limit:
return
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(
prog="python -m scrape.sources.epa_ppls",
description="Scrape EPA PPLS pesticide labels into corpus/epa_ppls/.",
)
parser.add_argument(
"--limit", type=int, default=None,
help="Max products to process when enumerating from PPIS.",
)
parser.add_argument(
"--force", action="store_true",
help="Re-fetch even if .md/.json already exist.",
)
parser.add_argument(
"--reg-no", action="append", metavar="REGNO",
help="Process specific EPA Reg No (e.g. 524-475). Repeatable.",
)
parser.add_argument(
"--seed-file", metavar="PATH",
help="Text file with one EPA Reg No per line (# comments OK).",
)
parser.add_argument(
"--log-level", default="INFO",
choices=["DEBUG", "INFO", "WARNING", "ERROR"],
)
args = parser.parse_args(argv)
logging.basicConfig(
stream=sys.stderr,
level=getattr(logging, args.log_level),
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
)
CORPUS_DIR.mkdir(parents=True, exist_ok=True)
summary = {"wrote": 0, "skipped": 0, "no-pdf": 0, "error": 0}
with _client() as client:
for regno in _iter_regnos(args, client):
result = process_one(client, regno, force=args.force)
summary[result] = summary.get(result, 0) + 1
log.info("done: %s", summary)
print(json.dumps(summary), file=sys.stderr)
return 0
if __name__ == "__main__":
sys.exit(main())
+20
View File
@@ -0,0 +1,20 @@
[
{
"id": "bayer",
"title": "Bayer Crop Science US — Product Labels",
"type": "manufacturer",
"homepage": "https://www.cropscience.bayer.us",
"scraper": "scrape.sources.bayer",
"scraper_version": "0.1.0",
"license_note": "robots.txt explicitly permits scraping for AI retrieval-augmented generation (verified 2026-05)"
},
{
"id": "epa_ppls",
"title": "EPA Pesticide Product Label System",
"type": "regulator",
"homepage": "https://ordspub.epa.gov/ords/pesticides/f?p=PPLS:1",
"scraper": "scrape.sources.epa_ppls",
"scraper_version": "0.1.0",
"license_note": "US federal government — public domain (no ToS restriction)"
}
]