Commit Graph

3 Commits

Author SHA1 Message Date
justin ea3aea5871 epa_ppls: narrow row-crop filter to corn/soy/wheat only
App focus is corn, soybeans, and wheat. Dropping the broader
US-row-crops allowlist (cotton/rice/sorghum/milo/barley/oats/rye/
sunflower/peanut/sugar-beet/dry-bean/canola/alfalfa).

Empirical impact (random N=100 sample): broad list matched 17/100
products, narrow list matches 16/100 — only 6% reduction, because
corn/soy/wheat dominate ag-chem registrations so thoroughly that
products registered for cotton/sorghum/etc. are almost always
co-registered for one of corn/soy/wheat. One sampled product was
dropped: a peanut-only herbicide (2749-614).

Verified live: 524-475 Roundup + 524-591 Warrant kept (CORN/SOYBEAN
sites); 2749-614 AG36448 (PEANUTS only) correctly filtered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 19:39:55 -04:00
justin 60657aa6df epa_ppls: filter PPLS enumeration to row-crop products
The farmer-advisor consumer only cares about US row crops, so the EPA
scraper now drops products without at least one row-crop site in the
PPLS API response. Filter is on by default; --no-row-crop-filter
overrides for one-off broader pulls.

Filter shape:
  - Word-boundary regex match against each entry in the API's `sites`
    array (e.g., "SOYBEANS (FOLIAR TREATMENT)" → keep, "SHIPS, BOATS,
    SHIPHOLDS" → drop even though it contains "OATS" as substring).
  - Allowlist covers the major US row + small-grain + oilseed + sugar/
    fiber crops, plus alfalfa as a common rotation crop. See
    ROW_CROP_KEYWORDS in scrape/sources/epa_ppls.py for the full list.

Cost model:
  - 102K PPIS rows still need one API call each (no bulk filter
    available upstream), so enumeration still takes ~28h at 1 req/sec.
  - But PDF downloads drop from ~67K → ~5-10K (estimated row-crop
    hit rate), saving ~17h wall time and ~60GB disk on a full backfill.

Smoke test (4 mixed reg nos):
  524-475 Roundup Ultra        → kept (CORN/SOYBEANS/COTTON sites)
  524-591 Warrant              → kept (CORN/SOYBEANS/SORGHUM sites)
  100-1486 Advion Cockroach    → filtered (building/transport sites only)
  432-1276 (Bayer pet flea)    → filtered (no row crops)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 19:05:26 -04:00
justin e9250de8e7 scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema
Adapts the docs-mcp-template scraping layer for the pesticide-labels
domain. The template's bundle/version/platform concepts don't map to
labels (there's no "Bayer 8.1.0" — there's just the current accepted
label per EPA Reg No), so the scraper layer is reshaped around a
"source" abstraction: one source per manufacturer or regulator, one
per-product label per source.

Sources shipped:
  - bayer       — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs)
  - epa_ppls    — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint

Canonical sidecar schema (see scrape/README.md) unifies fields across
sources:
  - active_ingredients always [{name, cas, percent}]
  - label/* nested (url, filename, accepted_date, last_modified,
    page_count, text_layer)
  - all timestamps normalized to ISO 8601 UTC
  - signal_word surfaced (operationally critical for the farmer advisor)
  - source_key + epa_reg_no separate per-source PK from the
    cross-source join key

bundles.json → sources.json. --bundle → --source. The runner walks
sources.json and dispatches by id; per-source modules remain
independently runnable for development.

PLAN.md gets a one-block domain note up front; later phases (chunking,
embeddings, retrieval, eval) still apply as written.

Smoke test:
  python -m scrape.runner --all --limit 2     # works
  python -m scrape.runner --source bayer --limit 3    # 3 written, idempotent re-run skips
  python -m scrape.runner --source epa_ppls --reg-no 524-475   # Roundup Ultra, 167 pages, ISO last_modified

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 18:27:07 -04:00