Add ProHarvest Seeds: 119 varieties + 161 cross-vendor plot reports #16

Merged
claude merged 1 commits from add-proharvest into main 2026-06-04 21:05:31 -04:00
Contributor

Adds ProHarvest Seeds (independent Corn Belt brand, proharvestseeds.com) to the corpus. ProHarvest exposes a public, no-auth WordPress REST API — cleaner than the HTML-only independents (Ebbert's). Two new sources.

proharvest — variety identity (119 records)

70 corn-hybrid + 47 soybean + 2 wheat (forage/grass/cover-crop/sweet-corn excluded — out of scope). Enumerated via /wp/v2/seed (seed-type taxonomy); acf/content aren't registered to REST, so agronomics are parsed from each /seed/<slug>/ detail page (<h2> spec sections of <strong>label</strong><div>value</div> pairs) into structured characteristics_groups so the ratings embed (unlike ebberts_seeds, which left them body-only). Mixed scale, documented in _scale_direction + lessons: Disease Tolerance 1-9 numeric (9=best, same direction as Bayer/NK — no flip), General/Agronomic qualitative (Good/Very Good), Soil Adaptability HR/R.

proharvest_plots — cross-vendor yield trials (161 docs, data_type=trial)

Per-cooperator harvest reports via the custom GET /wp-json/proharvest/v1/plots?y=<year> endpoint (2024+2025 baseline; older years behind --include-old) + PDF table extraction. Emits the same sidecar shape as the gh/lg/agrigold plot reports → routed through the shared _render_gh_plot_chunk. Many are genuinely cross-vendor (ProHarvest/Apex vs Pioneer / DEKALB / Becks / Channel / Wyffels / NK / AgriGold / LG).

Robust against three PDF realities:

  • ruled tablesextract_tables() column split;
  • unruled tables → text-line fallback anchored on trailing numerics (soy reports drop the Test-Wt column → 4 vs 5 numerics, handled by right-anchored assignment);
  • off-template third-party reports (e.g. a university land-lab with extra RM/harvest-weight columns + multi-line header) → a per-row + per-plot sanity gate redirects them to verbatim raw_text so junk rows never ship and the cross-vendor yields stay searchable.

Image-only PDFs (no text layer) are skipped and counted (no silent cap). 139 structured + 22 verbatim + 1 image-skip of 162.

Plumbing + docs

  • rag/chunk.py: proharvest_plots branch (structured → cross-vendor renderer; raw_text → verbatim body).
  • sources.json: 2 entries (tos_check_date 2026-06-04 — robots permissive, no ToS automation clause).
  • docs_mcp/lessons.md: rating-scales + trial-data entries.
  • README/CLAUDE.md corpus inventory brought current — it had drifted badly (claimed 760 variety / 4,313 trial; bayer listed 475 but is really 931; ebberts/lg/agrigold were entirely unlisted). New verified totals: 1,645 variety + 6,787 trial records.

Validation

  • robots.txt permissive; data-quality pass: 0 numeric/junk brands, multi-word brands intact ("Golden Harvest" 35, "Seed Consultants" 28 — no splits).
  • All 280 new corpus files chunk cleanly (0 errors, 0 short chunks); py_compile clean; sources.json valid.
  • CI (image-only.yml) rebuilds the Chroma+BM25 indexes from the committed corpus, then builds+pushes the image → Watchtower deploys.
Adds **ProHarvest Seeds** (independent Corn Belt brand, proharvestseeds.com) to the corpus. ProHarvest exposes a public, no-auth WordPress REST API — cleaner than the HTML-only independents (Ebbert's). Two new sources. ## `proharvest` — variety identity (119 records) 70 corn-hybrid + 47 soybean + 2 wheat (forage/grass/cover-crop/sweet-corn excluded — out of scope). Enumerated via `/wp/v2/seed` (seed-type taxonomy); `acf`/`content` aren't registered to REST, so agronomics are parsed from each `/seed/<slug>/` detail page (`<h2>` spec sections of `<strong>label</strong><div>value</div>` pairs) **into structured `characteristics_groups`** so the ratings embed (unlike `ebberts_seeds`, which left them body-only). Mixed scale, documented in `_scale_direction` + lessons: Disease Tolerance **1-9 numeric (9=best, same direction as Bayer/NK — no flip)**, General/Agronomic qualitative (Good/Very Good), Soil Adaptability HR/R. ## `proharvest_plots` — cross-vendor yield trials (161 docs, `data_type=trial`) Per-cooperator harvest reports via the custom `GET /wp-json/proharvest/v1/plots?y=<year>` endpoint (2024+2025 baseline; older years behind `--include-old`) + PDF table extraction. Emits the **same sidecar shape** as the gh/lg/agrigold plot reports → routed through the shared `_render_gh_plot_chunk`. Many are genuinely cross-vendor (ProHarvest/Apex vs **Pioneer / DEKALB / Becks / Channel / Wyffels / NK / AgriGold / LG**). Robust against three PDF realities: - **ruled tables** → `extract_tables()` column split; - **unruled tables** → text-line fallback anchored on trailing numerics (soy reports drop the Test-Wt column → 4 vs 5 numerics, handled by right-anchored assignment); - **off-template third-party reports** (e.g. a university land-lab with extra RM/harvest-weight columns + multi-line header) → a per-row + per-plot **sanity gate** redirects them to verbatim `raw_text` so junk rows never ship and the cross-vendor yields stay searchable. Image-only PDFs (no text layer) are skipped **and counted** (no silent cap). 139 structured + 22 verbatim + 1 image-skip of 162. ## Plumbing + docs - `rag/chunk.py`: `proharvest_plots` branch (structured → cross-vendor renderer; `raw_text` → verbatim body). - `sources.json`: 2 entries (tos_check_date 2026-06-04 — robots permissive, no ToS automation clause). - `docs_mcp/lessons.md`: rating-scales + trial-data entries. - **README/CLAUDE.md corpus inventory brought current** — it had drifted badly (claimed 760 variety / 4,313 trial; bayer listed 475 but is really 931; ebberts/lg/agrigold were entirely unlisted). New verified totals: **1,645 variety + 6,787 trial** records. ## Validation - robots.txt permissive; data-quality pass: **0 numeric/junk brands**, multi-word brands intact ("Golden Harvest" 35, "Seed Consultants" 28 — no splits). - All 280 new corpus files chunk cleanly (0 errors, 0 short chunks); `py_compile` clean; `sources.json` valid. - CI (`image-only.yml`) rebuilds the Chroma+BM25 indexes from the committed corpus, then builds+pushes the image → Watchtower deploys.
claude added 1 commit 2026-06-04 21:05:22 -04:00
ProHarvest Seeds (independent Corn Belt brand, proharvestseeds.com) exposes
a public, no-auth WordPress REST API — cleaner ingestion than the HTML-only
independents. Two new sources:

- `proharvest` (variety identity, 119 row-crop varieties: 70 corn / 47 soy /
  2 wheat). Enumerated via /wp/v2/seed (seed-type taxonomy), agronomics
  parsed from each /seed/<slug>/ detail page into structured
  characteristics_groups so the ratings actually embed. Mixed scale: disease
  1-9 numeric (9=best, no flip), agronomic/general qualitative, soil HR/R.

- `proharvest_plots` (trials, data_type=trial, 161 plots, 2024+2025). Per-
  cooperator harvest reports via the custom /wp-json/proharvest/v1/plots?y=
  endpoint + PDF table extraction. Many are cross-vendor head-to-head
  (ProHarvest/Apex vs Pioneer/DEKALB/Becks/Channel/Wyffels). Handles ruled
  tables, unruled tables (text fallback; soy drops the Test-Wt column → 4 vs
  5 numerics), and off-template third-party reports (sanity-gated to verbatim
  so junk rows never ship). Image-only PDFs skipped + counted.

- rag/chunk.py: route proharvest_plots through the shared cross-vendor plot
  renderer (structured) / verbatim body (raw_text fallback).
- sources.json + lessons.md (rating-scales, trial-data).
- README/CLAUDE.md corpus inventory brought current (it had drifted: bayer
  931 not 475; ebberts/lg/agrigold were unlisted). New totals: 1,645 variety
  + 6,787 trial records.

robots.txt permissive (only search + /dealer-* disallowed); no ToS
automation clause. CI rebuilds the index from the committed corpus.
claude merged commit 22e8092faf into main 2026-06-04 21:05:31 -04:00
claude deleted branch add-proharvest 2026-06-04 21:05:31 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#16