7 Commits

Author SHA1 Message Date
justin 1a45280e45 rename: ppls-docs → crop-chem-docs
Repo/project rename to better reflect scope. PPLS is EPA's term for
their Pesticide Product Label System — accurate when the corpus was
EPA-only, narrow now that it also pulls from Bayer's own catalog
(and may expand to Syngenta/Corteva/BASF/FMC labels in the future).
crop-chem-docs scopes flexibly without acronyms to explain.

Renames:
- directory:           ppls-docs            → crop-chem-docs
- PRODUCT_NAME:        ppls                 → crop_chem
- Chroma collection:   ppls_docs            → crop_chem_docs  (in-place via .modify(), no re-embed)
- BM25 db:             bm25/ppls_docs.db    → bm25/crop_chem_docs.db
- MCP tool name:       ppls_api_lessons     → crop_chem_api_lessons
- FastMCP server name: ppls-docs            → crop-chem-docs
- Env vars:            PPLS_CORPUS_ROOT     → CORPUS_ROOT
                       PPLS_CHROMA_DIR      → CHROMA_DIR_OVERRIDE
- User-Agent:          ppls-docs-scraper    → crop-chem-docs-scraper

Preserved (intentional, correct):
- epa_ppls (source id) — refers specifically to EPA's PPLS database
- "EPA PPLS" mentions in regulatory text (lessons.md, server docstrings)
- PPLS_API_BASE / PPLS_PDF_BASE / PPLS_INDEX_URL_TEMPLATE in
  scrape/sources/epa_ppls.py — these point at EPA's actual endpoints

Memory entries get updated in a follow-up commit so the rename is
isolated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 12:25:59 -04:00
justin 92a95d5e78 epa_ppls: add registrant allowlist pre-API filter
Cuts the PPIS-enumeration universe from 102K rows to ~11.5K rows by
dropping products from non-row-crop-ag registrants BEFORE the per-
product API call. This is the biggest cost lever we have on the EPA
scraper — full backfill drops from ~28 h to ~3.5 h.

scrape/sources/epa_registrant_allowlist.json holds the 34 verified
ag-chem company numbers (Syngenta, Bayer, BASF, Corteva, FMC, Nufarm,
ADAMA, UPL, Albaugh, Loveland, AMVAC, Helena, Drexel, Atticus, etc.).
Each entry was verified by querying the EPA PPLS API for the first
active product registered under that company number. Edit the JSON
freely — scraper loads it at run time. Bypass with
--no-registrant-filter when you suspect a row-crop product registered
to a specialty company not on the list.

Why a curated allowlist rather than blacklist consumer brands: the
102K PPIS rows are 89% non-ag-relevant; an allowlist is shorter to
maintain and harder to false-positive.

Excluded with intent (not omissions): Bayer Environmental Science
(turf/ornamental), Scotts (consumer lawn & garden), Wellmark/Zoecon
(animal flea/tick), Control Solutions (structural pest), Cleary
(turf), PBI/Gordon (mostly turf), Buckman Labs (industrial water).

Smoke test --limit 100:
  - 1239 PPIS rows considered (in first slice of file)
  - 1139 skipped by registrant filter (no API call paid)
  - 100 hit API, 81 filtered by row-crop sites, 19 written
  - = 91% API-call reduction over the prior version

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 23:55:38 -04:00
justin 420e00b44b bayer: dedup by EPA reg no across catalog product-type queries
Bayer's seed-treatment catalog query re-serves products from
herbicide/fungicide/insecticide queries that have seed-treatment use
sites listed. safe_slug() correctly strips the class suffix when the
catalog product type matches, but doesn't strip when querying as
seed-treatment, so the same product gets written twice — once as
"<base>" (canonical class) and once as "<base>-<class>"
(class=seed-treatment).

First full scrape produced 159 files for 87 unique EPA reg nos —
~45% redundant. Fix:

- process_product accepts an optional seen_regs set and returns
  "dup-skip" when the product's EPA reg no is already in it.
- run() seeds seen_regs from existing sidecars on disk via
  _load_seen_regs() so dedup survives re-runs (force overrides).
- run() updates seen_regs after each successful write, so within-run
  dedup works for the seed-treatment query (which iterates last).

Important nuance preserved: when two genuinely-different brand-name
products share the same EPA reg (e.g., Absolute Maxx + Adament Flow
both = 264-849), they are NOT treated as dups — they're different
catalog entries with different slugs and same canonical class. Only
the seed-treatment-clone pattern (slug = <canonical>-<class> AND
class=seed-treatment AND sibling at same reg with matching class) is
the bug we're fixing.

One-off cleanup of the existing USB corpus removed 68 dup pairs;
159 → 91 files (73 canonical-class + 18 true seed-treatments).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 21:27:45 -04:00
justin 717426f873 scrape: route corpus via PPLS_CORPUS_ROOT env var
Both scrapers now honor PPLS_CORPUS_ROOT so the corpus can land on
external storage (USB drive, NAS mount, secondary partition) without
editing the repo. Default behavior unchanged: corpus/ at repo root
when the env var is unset.

Per-source subdirectory layout preserved: ${PPLS_CORPUS_ROOT}/bayer/,
${PPLS_CORPUS_ROOT}/epa_ppls/, etc.

Live-verified against /run/media/justin/USB (vfat, 59GB free):
  PPLS_CORPUS_ROOT=/run/media/justin/USB/ppls-corpus \
    python -m scrape.runner --source epa_ppls --reg-no 524-475
  -> wrote to USB, root disk untouched

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 20:41:56 -04:00
justin ea3aea5871 epa_ppls: narrow row-crop filter to corn/soy/wheat only
App focus is corn, soybeans, and wheat. Dropping the broader
US-row-crops allowlist (cotton/rice/sorghum/milo/barley/oats/rye/
sunflower/peanut/sugar-beet/dry-bean/canola/alfalfa).

Empirical impact (random N=100 sample): broad list matched 17/100
products, narrow list matches 16/100 — only 6% reduction, because
corn/soy/wheat dominate ag-chem registrations so thoroughly that
products registered for cotton/sorghum/etc. are almost always
co-registered for one of corn/soy/wheat. One sampled product was
dropped: a peanut-only herbicide (2749-614).

Verified live: 524-475 Roundup + 524-591 Warrant kept (CORN/SOYBEAN
sites); 2749-614 AG36448 (PEANUTS only) correctly filtered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 19:39:55 -04:00
justin 60657aa6df epa_ppls: filter PPLS enumeration to row-crop products
The farmer-advisor consumer only cares about US row crops, so the EPA
scraper now drops products without at least one row-crop site in the
PPLS API response. Filter is on by default; --no-row-crop-filter
overrides for one-off broader pulls.

Filter shape:
  - Word-boundary regex match against each entry in the API's `sites`
    array (e.g., "SOYBEANS (FOLIAR TREATMENT)" → keep, "SHIPS, BOATS,
    SHIPHOLDS" → drop even though it contains "OATS" as substring).
  - Allowlist covers the major US row + small-grain + oilseed + sugar/
    fiber crops, plus alfalfa as a common rotation crop. See
    ROW_CROP_KEYWORDS in scrape/sources/epa_ppls.py for the full list.

Cost model:
  - 102K PPIS rows still need one API call each (no bulk filter
    available upstream), so enumeration still takes ~28h at 1 req/sec.
  - But PDF downloads drop from ~67K → ~5-10K (estimated row-crop
    hit rate), saving ~17h wall time and ~60GB disk on a full backfill.

Smoke test (4 mixed reg nos):
  524-475 Roundup Ultra        → kept (CORN/SOYBEANS/COTTON sites)
  524-591 Warrant              → kept (CORN/SOYBEANS/SORGHUM sites)
  100-1486 Advion Cockroach    → filtered (building/transport sites only)
  432-1276 (Bayer pet flea)    → filtered (no row crops)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 19:05:26 -04:00
justin e9250de8e7 scrape: Phase 1 — Bayer + EPA PPLS scrapers with unified label schema
Adapts the docs-mcp-template scraping layer for the pesticide-labels
domain. The template's bundle/version/platform concepts don't map to
labels (there's no "Bayer 8.1.0" — there's just the current accepted
label per EPA Reg No), so the scraper layer is reshaped around a
"source" abstraction: one source per manufacturer or regulator, one
per-product label per source.

Sources shipped:
  - bayer       — Bayer Crop Science US (Next.js JSON catalog + Scene7 PDFs)
  - epa_ppls    — EPA PPLS via PPIS bulk index + undocumented /cswu/ ORDS REST endpoint

Canonical sidecar schema (see scrape/README.md) unifies fields across
sources:
  - active_ingredients always [{name, cas, percent}]
  - label/* nested (url, filename, accepted_date, last_modified,
    page_count, text_layer)
  - all timestamps normalized to ISO 8601 UTC
  - signal_word surfaced (operationally critical for the farmer advisor)
  - source_key + epa_reg_no separate per-source PK from the
    cross-source join key

bundles.json → sources.json. --bundle → --source. The runner walks
sources.json and dispatches by id; per-source modules remain
independently runnable for development.

PLAN.md gets a one-block domain note up front; later phases (chunking,
embeddings, retrieval, eval) still apply as written.

Smoke test:
  python -m scrape.runner --all --limit 2     # works
  python -m scrape.runner --source bayer --limit 3    # 3 written, idempotent re-run skips
  python -m scrape.runner --source epa_ppls --reg-no 524-475   # Roundup Ultra, 167 pages, ISO last_modified

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 18:27:07 -04:00