Add ProHarvest Seeds: 119 varieties + 161 cross-vendor plot reports
ProHarvest Seeds (independent Corn Belt brand, proharvestseeds.com) exposes a public, no-auth WordPress REST API — cleaner ingestion than the HTML-only independents. Two new sources: - `proharvest` (variety identity, 119 row-crop varieties: 70 corn / 47 soy / 2 wheat). Enumerated via /wp/v2/seed (seed-type taxonomy), agronomics parsed from each /seed/<slug>/ detail page into structured characteristics_groups so the ratings actually embed. Mixed scale: disease 1-9 numeric (9=best, no flip), agronomic/general qualitative, soil HR/R. - `proharvest_plots` (trials, data_type=trial, 161 plots, 2024+2025). Per- cooperator harvest reports via the custom /wp-json/proharvest/v1/plots?y= endpoint + PDF table extraction. Many are cross-vendor head-to-head (ProHarvest/Apex vs Pioneer/DEKALB/Becks/Channel/Wyffels). Handles ruled tables, unruled tables (text fallback; soy drops the Test-Wt column → 4 vs 5 numerics), and off-template third-party reports (sanity-gated to verbatim so junk rows never ship). Image-only PDFs skipped + counted. - rag/chunk.py: route proharvest_plots through the shared cross-vendor plot renderer (structured) / verbatim body (raw_text fallback). - sources.json + lessons.md (rating-scales, trial-data). - README/CLAUDE.md corpus inventory brought current (it had drifted: bayer 931 not 475; ebberts/lg/agrigold were unlisted). New totals: 1,645 variety + 6,787 trial records. robots.txt permissive (only search + /dealer-* disallowed); no ToS automation clause. CI rebuilds the index from the committed corpus.
This commit is contained in:
@@ -177,6 +177,42 @@
|
||||
"tos_check_date": "2026-05-26",
|
||||
"schema_notes": "Server-rendered HTML; detail URL is /{crop_url}/performance/{slug}/{id}. Soybeans URL slug is singular: /soybeans/performance/soybean-yield-results/{id}. Columns: Rank, Brand, Product, Trait, Ck, H20 (moisture %), Test Wt., Yield, Adj Yield. Most metadata-rich of the three trial sources.",
|
||||
"data_type": "trial"
|
||||
},
|
||||
{
|
||||
"name": "proharvest",
|
||||
"vendor": "ProHarvest Seeds",
|
||||
"brands": [
|
||||
"ProHarvest Seeds",
|
||||
"Apex"
|
||||
],
|
||||
"crops": [
|
||||
"corn",
|
||||
"soybeans",
|
||||
"wheat"
|
||||
],
|
||||
"verdict": "green",
|
||||
"expected_count": 119,
|
||||
"base_url": "https://proharvestseeds.com",
|
||||
"scope_filter": "Row-crop varieties via the WordPress REST API (/wp/v2/seed) filtered to seed-type corn-hybrid (70) / soybean (47) / wheat (2). Excludes alfalfa/forage/grass/cover-crop/sweet-corn/blend terms (out of scope for the row-crop advisor).",
|
||||
"tos_check_date": "2026-06-04",
|
||||
"tos_note": "robots.txt permissive — disallows only /?s=, /search/, /dealer-files/*, /dealer-section/* and AdsBot. The /wp-json/ REST API + /seed/ detail pages are open. No ToS automation clause found.",
|
||||
"schema_notes": "REST list enumerates id/slug/title/link + seed-trait taxonomy, but acf+content are NOT registered to REST (empty), so agronomic data is parsed from each /seed/<slug>/ detail page (<h2> spec sections of <strong>label</strong><div>value</div> pairs). Parsed into characteristics_groups so ratings embed (unlike ebberts_seeds, which left them body-only). Ratings MIXED: Disease Tolerance 1-9 numeric (9=best, same direction as Bayer/NK — NO flip; NA=not rated); General/Agronomic qualitative (Excellent/Very Good/Good/Average); Soil Adaptability HR/R. RM in '<h1>Maturity: N Days</h1>'."
|
||||
},
|
||||
{
|
||||
"name": "proharvest_plots",
|
||||
"vendor": "ProHarvest Seeds",
|
||||
"brand_aggregator": "ProHarvest Seeds publishes",
|
||||
"crops": [
|
||||
"corn",
|
||||
"soybeans"
|
||||
],
|
||||
"verdict": "green",
|
||||
"expected_count": 161,
|
||||
"base_url": "https://proharvestseeds.com",
|
||||
"scope_filter": "Per-cooperator harvest-report plots via the custom GET /wp-json/proharvest/v1/plots?y=<year> endpoint, 2024+2025 baseline (~162 plots). Older 2015-2023 deferred to --include-old. Each plot links a harvest-report PDF with a head-to-head results table.",
|
||||
"tos_check_date": "2026-06-04",
|
||||
"schema_notes": "Same sidecar shape as agrigold/lg/gh plot reports (results:[{rank,brand,product,traits,metrics}]) — routed through _render_gh_plot_chunk (proharvest_plots added to that source list in rag/chunk.py). API gives clean location metadata (city/state/county/year/product/lat-long/PDF); PDF gives the management block (planted/harvested/prev-crop/population/tillage/irrigation) + results. THREE PDF realities: ruled tables (extract_tables splits columns), unruled tables (text-line fallback anchored on trailing numerics; soy reports drop the Test Wt. column so rows carry 4 vs 5 numerics), and off-template third-party reports (e.g. a university land-lab with extra RM/harvest-weight columns) which a per-row + per-plot sanity gate redirects to verbatim raw_text so cross-vendor yields aren't corrupted or lost. Many plots are cross-vendor (Pioneer/DEKALB/Becks/Channel/Wyffels vs ProHarvest/Apex). Image-only PDFs (no text layer) are skipped + counted (no silent cap). metrics key 'Yield' is canonical so the chunker top-N picker finds it.",
|
||||
"data_type": "trial"
|
||||
}
|
||||
],
|
||||
"_excluded_sources": [
|
||||
|
||||
Reference in New Issue
Block a user