Add ProHarvest Seeds: 119 varieties + 161 cross-vendor plot reports
ProHarvest Seeds (independent Corn Belt brand, proharvestseeds.com) exposes a public, no-auth WordPress REST API — cleaner ingestion than the HTML-only independents. Two new sources: - `proharvest` (variety identity, 119 row-crop varieties: 70 corn / 47 soy / 2 wheat). Enumerated via /wp/v2/seed (seed-type taxonomy), agronomics parsed from each /seed/<slug>/ detail page into structured characteristics_groups so the ratings actually embed. Mixed scale: disease 1-9 numeric (9=best, no flip), agronomic/general qualitative, soil HR/R. - `proharvest_plots` (trials, data_type=trial, 161 plots, 2024+2025). Per- cooperator harvest reports via the custom /wp-json/proharvest/v1/plots?y= endpoint + PDF table extraction. Many are cross-vendor head-to-head (ProHarvest/Apex vs Pioneer/DEKALB/Becks/Channel/Wyffels). Handles ruled tables, unruled tables (text fallback; soy drops the Test-Wt column → 4 vs 5 numerics), and off-template third-party reports (sanity-gated to verbatim so junk rows never ship). Image-only PDFs skipped + counted. - rag/chunk.py: route proharvest_plots through the shared cross-vendor plot renderer (structured) / verbatim body (raw_text fallback). - sources.json + lessons.md (rating-scales, trial-data). - README/CLAUDE.md corpus inventory brought current (it had drifted: bayer 931 not 475; ebberts/lg/agrigold were unlisted). New totals: 1,645 variety + 6,787 trial records. robots.txt permissive (only search + /dealer-* disallowed); no ToS automation clause. CI rebuilds the index from the committed corpus.
This commit is contained in:
@@ -509,6 +509,28 @@ def _render_trial_chunk(sidecar: dict, md_text: str | None = None) -> str:
|
||||
# for each (Golden Harvest / LG Seeds / AgriGold).
|
||||
if source in ("gh_plot_reports", "lg_plot_reports", "agrigold_plot_reports"):
|
||||
return _render_gh_plot_chunk(sidecar)
|
||||
if source == "proharvest_plots":
|
||||
# Structured rows → shared cross-vendor renderer (publisher brand
|
||||
# read from the sidecar). Foreign-format third-party PDFs that
|
||||
# couldn't be parsed into rows carry raw_text=True and the verbatim
|
||||
# table text in the .md body — embed that so they're still found.
|
||||
if sidecar.get("results"):
|
||||
return _render_gh_plot_chunk(sidecar)
|
||||
crop = (sidecar.get("crop") or "").lower()
|
||||
crop_label = {"corn": "Corn", "soybeans": "Soybean"}.get(crop, crop.title())
|
||||
coop = sidecar.get("cooperator") or ""
|
||||
state = sidecar.get("state") or ""
|
||||
year = sidecar.get("year") or ""
|
||||
head = [
|
||||
f"# {crop_label} yield trial — {coop} ({state}, {year})", "",
|
||||
"ProHarvest Seeds plot report (cross-vendor, verbatim from PDF).", "",
|
||||
]
|
||||
body = md_text or ""
|
||||
sep = "## Trial data (verbatim from PDF)"
|
||||
if sep in body:
|
||||
body = body.split(sep, 1)[1].strip()
|
||||
body = re.sub(r"```", "", body).strip()
|
||||
return "\n".join(head) + "\n" + body + "\n"
|
||||
if source == "agripro_trials":
|
||||
header = _render_agripro_trial_chunk(sidecar)
|
||||
if md_text:
|
||||
|
||||
Reference in New Issue
Block a user