Add ProHarvest Seeds: 119 varieties + 161 cross-vendor plot reports (#16)
Image rebuild (skip scrape) / build (push) Successful in 5m46s
Image rebuild (skip scrape) / build (push) Successful in 5m46s
Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io>
This commit was merged in pull request #16.
This commit is contained in:
@@ -177,6 +177,42 @@
|
||||
"tos_check_date": "2026-05-26",
|
||||
"schema_notes": "Server-rendered HTML; detail URL is /{crop_url}/performance/{slug}/{id}. Soybeans URL slug is singular: /soybeans/performance/soybean-yield-results/{id}. Columns: Rank, Brand, Product, Trait, Ck, H20 (moisture %), Test Wt., Yield, Adj Yield. Most metadata-rich of the three trial sources.",
|
||||
"data_type": "trial"
|
||||
},
|
||||
{
|
||||
"name": "proharvest",
|
||||
"vendor": "ProHarvest Seeds",
|
||||
"brands": [
|
||||
"ProHarvest Seeds",
|
||||
"Apex"
|
||||
],
|
||||
"crops": [
|
||||
"corn",
|
||||
"soybeans",
|
||||
"wheat"
|
||||
],
|
||||
"verdict": "green",
|
||||
"expected_count": 119,
|
||||
"base_url": "https://proharvestseeds.com",
|
||||
"scope_filter": "Row-crop varieties via the WordPress REST API (/wp/v2/seed) filtered to seed-type corn-hybrid (70) / soybean (47) / wheat (2). Excludes alfalfa/forage/grass/cover-crop/sweet-corn/blend terms (out of scope for the row-crop advisor).",
|
||||
"tos_check_date": "2026-06-04",
|
||||
"tos_note": "robots.txt permissive — disallows only /?s=, /search/, /dealer-files/*, /dealer-section/* and AdsBot. The /wp-json/ REST API + /seed/ detail pages are open. No ToS automation clause found.",
|
||||
"schema_notes": "REST list enumerates id/slug/title/link + seed-trait taxonomy, but acf+content are NOT registered to REST (empty), so agronomic data is parsed from each /seed/<slug>/ detail page (<h2> spec sections of <strong>label</strong><div>value</div> pairs). Parsed into characteristics_groups so ratings embed (unlike ebberts_seeds, which left them body-only). Ratings MIXED: Disease Tolerance 1-9 numeric (9=best, same direction as Bayer/NK — NO flip; NA=not rated); General/Agronomic qualitative (Excellent/Very Good/Good/Average); Soil Adaptability HR/R. RM in '<h1>Maturity: N Days</h1>'."
|
||||
},
|
||||
{
|
||||
"name": "proharvest_plots",
|
||||
"vendor": "ProHarvest Seeds",
|
||||
"brand_aggregator": "ProHarvest Seeds publishes",
|
||||
"crops": [
|
||||
"corn",
|
||||
"soybeans"
|
||||
],
|
||||
"verdict": "green",
|
||||
"expected_count": 161,
|
||||
"base_url": "https://proharvestseeds.com",
|
||||
"scope_filter": "Per-cooperator harvest-report plots via the custom GET /wp-json/proharvest/v1/plots?y=<year> endpoint, 2024+2025 baseline (~162 plots). Older 2015-2023 deferred to --include-old. Each plot links a harvest-report PDF with a head-to-head results table.",
|
||||
"tos_check_date": "2026-06-04",
|
||||
"schema_notes": "Same sidecar shape as agrigold/lg/gh plot reports (results:[{rank,brand,product,traits,metrics}]) — routed through _render_gh_plot_chunk (proharvest_plots added to that source list in rag/chunk.py). API gives clean location metadata (city/state/county/year/product/lat-long/PDF); PDF gives the management block (planted/harvested/prev-crop/population/tillage/irrigation) + results. THREE PDF realities: ruled tables (extract_tables splits columns), unruled tables (text-line fallback anchored on trailing numerics; soy reports drop the Test Wt. column so rows carry 4 vs 5 numerics), and off-template third-party reports (e.g. a university land-lab with extra RM/harvest-weight columns) which a per-row + per-plot sanity gate redirects to verbatim raw_text so cross-vendor yields aren't corrupted or lost. Many plots are cross-vendor (Pioneer/DEKALB/Becks/Channel/Wyffels vs ProHarvest/Apex). Image-only PDFs (no text layer) are skipped + counted (no silent cap). metrics key 'Yield' is canonical so the chunker top-N picker finds it.",
|
||||
"data_type": "trial"
|
||||
}
|
||||
],
|
||||
"_excluded_sources": [
|
||||
|
||||
Reference in New Issue
Block a user