Add ProHarvest Seeds: 119 varieties + 161 cross-vendor plot reports (#16)
Image rebuild (skip scrape) / build (push) Successful in 5m46s

Co-authored-by: claude <claude@jpaul.io>
Co-committed-by: claude <claude@jpaul.io>
This commit was merged in pull request #16.
This commit is contained in:
2026-06-04 21:05:30 -04:00
committed by Claude (agent)
parent e356633d4f
commit 22e8092faf
567 changed files with 80023 additions and 8 deletions
+36
View File
@@ -177,6 +177,42 @@
"tos_check_date": "2026-05-26",
"schema_notes": "Server-rendered HTML; detail URL is /{crop_url}/performance/{slug}/{id}. Soybeans URL slug is singular: /soybeans/performance/soybean-yield-results/{id}. Columns: Rank, Brand, Product, Trait, Ck, H20 (moisture %), Test Wt., Yield, Adj Yield. Most metadata-rich of the three trial sources.",
"data_type": "trial"
},
{
"name": "proharvest",
"vendor": "ProHarvest Seeds",
"brands": [
"ProHarvest Seeds",
"Apex"
],
"crops": [
"corn",
"soybeans",
"wheat"
],
"verdict": "green",
"expected_count": 119,
"base_url": "https://proharvestseeds.com",
"scope_filter": "Row-crop varieties via the WordPress REST API (/wp/v2/seed) filtered to seed-type corn-hybrid (70) / soybean (47) / wheat (2). Excludes alfalfa/forage/grass/cover-crop/sweet-corn/blend terms (out of scope for the row-crop advisor).",
"tos_check_date": "2026-06-04",
"tos_note": "robots.txt permissive — disallows only /?s=, /search/, /dealer-files/*, /dealer-section/* and AdsBot. The /wp-json/ REST API + /seed/ detail pages are open. No ToS automation clause found.",
"schema_notes": "REST list enumerates id/slug/title/link + seed-trait taxonomy, but acf+content are NOT registered to REST (empty), so agronomic data is parsed from each /seed/<slug>/ detail page (<h2> spec sections of <strong>label</strong><div>value</div> pairs). Parsed into characteristics_groups so ratings embed (unlike ebberts_seeds, which left them body-only). Ratings MIXED: Disease Tolerance 1-9 numeric (9=best, same direction as Bayer/NK — NO flip; NA=not rated); General/Agronomic qualitative (Excellent/Very Good/Good/Average); Soil Adaptability HR/R. RM in '<h1>Maturity: N Days</h1>'."
},
{
"name": "proharvest_plots",
"vendor": "ProHarvest Seeds",
"brand_aggregator": "ProHarvest Seeds publishes",
"crops": [
"corn",
"soybeans"
],
"verdict": "green",
"expected_count": 161,
"base_url": "https://proharvestseeds.com",
"scope_filter": "Per-cooperator harvest-report plots via the custom GET /wp-json/proharvest/v1/plots?y=<year> endpoint, 2024+2025 baseline (~162 plots). Older 2015-2023 deferred to --include-old. Each plot links a harvest-report PDF with a head-to-head results table.",
"tos_check_date": "2026-06-04",
"schema_notes": "Same sidecar shape as agrigold/lg/gh plot reports (results:[{rank,brand,product,traits,metrics}]) — routed through _render_gh_plot_chunk (proharvest_plots added to that source list in rag/chunk.py). API gives clean location metadata (city/state/county/year/product/lat-long/PDF); PDF gives the management block (planted/harvested/prev-crop/population/tillage/irrigation) + results. THREE PDF realities: ruled tables (extract_tables splits columns), unruled tables (text-line fallback anchored on trailing numerics; soy reports drop the Test Wt. column so rows carry 4 vs 5 numerics), and off-template third-party reports (e.g. a university land-lab with extra RM/harvest-weight columns) which a per-row + per-plot sanity gate redirects to verbatim raw_text so cross-vendor yields aren't corrupted or lost. Many plots are cross-vendor (Pioneer/DEKALB/Becks/Channel/Wyffels vs ProHarvest/Apex). Image-only PDFs (no text layer) are skipped + counted (no silent cap). metrics key 'Yield' is canonical so the chunker top-N picker finds it.",
"data_type": "trial"
}
],
"_excluded_sources": [