Trial-data scrapers: gh_plot_reports + agripro_trials + search_trials tool

This PR introduces TRIAL data — yield-performance results from real field trials — as a SEPARATE data type alongside variety identity. The two are complementary: search_docs → "What's the disease resistance of DKC62-08RIB?" (variety identity — what it IS) search_trials → "Which corn hybrid won the IA 2024 trials?" (performance data — how it PERFORMED) scrape/sources/gh_plot_reports.py — Golden Harvest plot reports - 4,618 expected (2024+2025; 2023 deferred to a backfill pass). - URL: /<crop>/plot-report/<state>/<year>/<plot_id> - Cross-vendor: each plot lists products from multiple brands (NK / DEKALB / Golden Harvest / Enogen / Pioneer / Channel) side by side at one cooperator's field — the kind of independent comparison data Bayer doesn't publish itself. - Generic per-column metrics dict (Yield/MST/Test Weight/$/Ac for corn+soy, Ton/Acre + Milk + Beef columns for silage). - Politeness: 1 req/sec, retries on 429/5xx, no redirect-follow. scrape/sources/agripro_trials.py — AgriPro regional trial PDFs - 14 unique PDFs (38 sitemap links deduped) at /trials-data - pdfplumber text extraction, region/year detection from filename - Verbatim PDF text preserved in chunk body so variety + yield number adjacency drives retrieval (AP Iliad's Aberdeen ID yield matches a query about "AP Iliad Idaho yield") rag/chunk.py — chunks_from_trial() dispatching by source - Plot reports: identity preamble + Top-5 by primary metric + full ranking table. Metric labels chosen from the data (corn/soy use "Yield", silage uses "Ton/Acre"). - AgriPro PDFs: identity preamble + verbatim trial body inline so per-location yields surface for region+variety queries. - Variety chunks get data_type="variety" metadata; trial chunks get data_type="trial". Single Chroma collection; the tool router filters by data_type rather than maintaining two collections. rag/index.py — dispatch by sidecar's data_type field rag/bm25.py — new filter columns (data_type, year, state) docs_mcp/server.py — sixth MCP tool: search_trials(crop?, state?, year?, product?, k=10) - Filters trial chunks via where={"data_type": "trial", ...} - Optional product substring post-filter for "DKC62-08RIB Iowa 2024" style searches - search_docs now defaults to data_type="variety" so trial chunks don't bleed into variety identity queries - Tool docstring routes the agent: "use lookup_variety to verify identity details on any trial winner you surface" NK trial endpoint (/NKSeeds/wsProxy.asmx/GetPlotResult) is documented as deferred — the ASMX-SOAP shape returned empty XML on initial probe. Bayer per-variety yield data is not publicly indexed at all — documented in the trial-scope note (DEKALB/Asgrow trial data flows through Channel reps, not the web). AgRevival research books exist as 10 large annual PDFs but are deferred (low ROI per parse). Initial corpus shipped in this PR: 14 AgriPro trial PDFs. The 4,618 Golden Harvest plot reports are scraping in background and will be added in a follow-up corpus-snapshot PR (~70 min ETA). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:19:03 -04:00
parent 7b3da908e0
commit c737871c4c
35 changed files with 3302 additions and 25 deletions
@@ -42,7 +42,12 @@ DEFAULT_DB_NAME = "crop_seed_docs.db"
 # Columns we expose as filterable metadata. Mirrors what
 # ``docs_mcp.server._build_where`` accepts so the same filter dict
 # works for both Chroma and BM25 without per-retriever translation.
-FILTER_COLUMNS = ("source", "vendor", "brand", "crop", "source_key", "ordinal")
+# data_type / year / state / region are trial-specific facets; variety
+# chunks leave them empty.
+FILTER_COLUMNS = (
+    "source", "vendor", "brand", "crop", "source_key",
+    "data_type", "year", "state", "ordinal",
+)


 # Allowlist tokenizer for free-text queries. FTS5's parser chokes on
@@ -131,8 +136,9 @@ class BM25Index:
            con.executescript(self._schema_sql())
            con.executemany(
                "INSERT INTO chunks_meta "
-                "(id, source, vendor, brand, crop, source_key, ordinal) "
-                "VALUES (?, ?, ?, ?, ?, ?, ?)",
+                "(id, source, vendor, brand, crop, source_key, "
+                " data_type, year, state, ordinal) "
+                "VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
                [
                    (
                        r["id"],
@@ -141,6 +147,9 @@ class BM25Index:
                        r["metadata"].get("brand") or "",
                        r["metadata"].get("crop") or "",
                        r["metadata"].get("source_key") or "",
+                        r["metadata"].get("data_type") or "variety",
+                        int(r["metadata"]["year"]) if isinstance(r["metadata"].get("year"), int) else None,
+                        r["metadata"].get("state") or "",
                        int(r["metadata"].get("ordinal") or 0),
                    )
                    for r in records
@@ -216,12 +225,18 @@ class BM25Index:
            brand      TEXT,
            crop       TEXT,
            source_key TEXT,
+            data_type  TEXT,
+            year       INTEGER,
+            state      TEXT,
            ordinal    INTEGER
        );
        CREATE INDEX idx_meta_source     ON chunks_meta(source);
        CREATE INDEX idx_meta_crop       ON chunks_meta(crop);
        CREATE INDEX idx_meta_brand      ON chunks_meta(brand);
        CREATE INDEX idx_meta_source_key ON chunks_meta(source_key);
+        CREATE INDEX idx_meta_data_type  ON chunks_meta(data_type);
+        CREATE INDEX idx_meta_year       ON chunks_meta(year);
+        CREATE INDEX idx_meta_state      ON chunks_meta(state);

        CREATE VIRTUAL TABLE chunks_fts USING fts5(
            text,
@@ -253,6 +253,7 @@ def _flat_metadata(sidecar: dict) -> dict:
    md: dict = {
        "source": sidecar.get("source") or "",
        "source_key": sidecar.get("source_key") or "",
+        "data_type": "variety",
        "vendor": sidecar.get("vendor") or "",
        "brand": (sidecar.get("brand") or "").upper(),
        "crop": (sidecar.get("crop") or "").lower(),
@@ -304,6 +305,258 @@ def chunks_from_variety(
    }


+# ===========================================================================
+# Trial chunker — for sidecars with data_type="trial"
+# ===========================================================================
+#
+# Trial documents are a different shape from variety identity:
+# - GH plot reports: per-site head-to-head yield comparison across brands
+# - AgriPro trial PDFs: regional multi-year multi-location summary
+#
+# Both produce ONE chunk per document with a preamble that emphasizes
+# the trial's location/year/top performers so the embedder gets clean
+# signal for queries like "best corn for sandy soil Iowa 2024".
+
+
+def _render_gh_plot_chunk(sidecar: dict) -> str:
+    """Render a Golden Harvest plot report (per-site cross-vendor)."""
+    lines: list[str] = []
+    crop = (sidecar.get("crop") or "").lower()
+    crop_label = {"corn": "Corn", "soybeans": "Soybean", "silage": "Silage"}.get(crop, crop.title())
+    state = sidecar.get("state") or sidecar.get("state_abbrev") or ""
+    year = sidecar.get("year") or ""
+    cooperator = sidecar.get("cooperator") or ""
+
+    lines.append(f"# {crop_label} yield trial — {state}, {year}")
+    lines.append("")
+    facts = ["Golden Harvest plot report (cross-vendor)"]
+    if cooperator:
+        facts.append(f"cooperator {cooperator}")
+    if sidecar.get("planted_date"):
+        facts.append(f"planted {sidecar['planted_date']}")
+    if sidecar.get("harvested_date"):
+        facts.append(f"harvested {sidecar['harvested_date']}")
+    if sidecar.get("population_seeds_per_acre"):
+        facts.append(f"population {sidecar['population_seeds_per_acre']:,} seeds/acre")
+    if sidecar.get("row_width_in"):
+        facts.append(f"{sidecar['row_width_in']}\" rows")
+    lines.append(". ".join(facts) + ".")
+    lines.append("")
+
+    results = sidecar.get("results") or []
+    if results:
+        # Pick the primary metric for ranking: corn/soy use "Yield",
+        # silage uses "Ton/Acre". Find the first metric key with a
+        # numeric value in the top result.
+        def _primary(r: dict) -> tuple[str, float | None]:
+            metrics = r.get("metrics") or {}
+            # Back-compat: old sidecars had yield_bu_ac directly.
+            if not metrics and r.get("yield_bu_ac") is not None:
+                return ("Yield", r["yield_bu_ac"])
+            for k in ("Yield", "Ton/Acre", "Tons/Acre"):
+                v = metrics.get(k)
+                if isinstance(v, (int, float)):
+                    return (k, v)
+            for k, v in metrics.items():
+                if isinstance(v, (int, float)):
+                    return (k, v)
+            return ("", None)
+
+        top = results[: min(5, len(results))]
+        primary_label, _ = _primary(top[0]) if top else ("", None)
+        rendered_top_parts: list[str] = []
+        for i, r in enumerate(top):
+            label, val = _primary(r)
+            piece = f"#{r.get('rank') or i+1} {r.get('brand','?')} {r.get('product','?')}"
+            if r.get('traits'):
+                piece += f" {r['traits']}"
+            if val is not None:
+                piece += f" — {val} {label}"
+            rendered_top_parts.append(piece)
+        if rendered_top_parts:
+            lines.append(
+                f"Top {len(top)} ({crop_label}, {state} {year}): "
+                + ", ".join(rendered_top_parts) + "."
+            )
+            lines.append("")
+
+        # Discover the metric column order from the first result with metrics.
+        metric_keys: list[str] = []
+        for r in results:
+            metrics = r.get("metrics") or {}
+            if metrics:
+                metric_keys = list(metrics.keys())
+                break
+        # Back-compat: synthesize from legacy fields if no metrics dict.
+        if not metric_keys and any(
+            r.get("yield_bu_ac") is not None for r in results
+        ):
+            metric_keys = ["Yield", "%MST", "Test Weight", "Gross Revenue"]
+
+        # Full ranking — preserves every datapoint verbatim.
+        col_headers = ["rank", "brand", "product", "traits"] + metric_keys
+        lines.append("Full ranking (" + " | ".join(col_headers) + "):")
+        for r in results:
+            row = [
+                f"#{r.get('rank') or '-'}",
+                r.get("brand") or "-",
+                r.get("product") or "-",
+                r.get("traits") or "-",
+            ]
+            metrics = r.get("metrics") or {}
+            # Back-compat shim
+            if not metrics:
+                metrics = {
+                    "Yield": r.get("yield_bu_ac"),
+                    "%MST": r.get("mst_pct"),
+                    "Test Weight": r.get("test_weight"),
+                    "Gross Revenue": r.get("gross_revenue_dol_ac"),
+                }
+            for k in metric_keys:
+                v = metrics.get(k)
+                if v is None:
+                    row.append("-")
+                elif isinstance(v, (int, float)):
+                    if "Revenue" in k or "$" in k:
+                        row.append(f"${v:.2f}")
+                    else:
+                        row.append(str(v))
+                else:
+                    row.append(str(v))
+            lines.append("  " + " | ".join(row))
+        lines.append("")
+
+    urls = sidecar.get("source_urls") or []
+    if urls:
+        lines.append(f"Source: {urls[0]}")
+    return "\n".join(lines).strip() + "\n"
+
+
+def _render_agripro_trial_chunk(sidecar: dict) -> str:
+    """Render an AgriPro regional trial PDF — preamble + verbatim text."""
+    lines: list[str] = []
+    title = sidecar.get("title") or sidecar.get("filename") or sidecar.get("source_key", "")
+    lines.append(f"# {title}")
+    lines.append("")
+
+    facts = ["AgriPro / Syngenta regional wheat trial"]
+    if sidecar.get("region"):
+        facts.append(f"region {sidecar['region']}")
+    if sidecar.get("wheat_class_section"):
+        facts.append(f"class {sidecar['wheat_class_section']}")
+    if sidecar.get("years_covered") and len(sidecar["years_covered"]) > 1:
+        yc = sidecar["years_covered"]
+        facts.append(f"years {yc[0]}–{yc[-1]}")
+    elif sidecar.get("year"):
+        facts.append(f"year {sidecar['year']}")
+    lines.append(". ".join(facts) + ".")
+    lines.append("")
+
+    varieties = sidecar.get("varieties_found") or []
+    if varieties:
+        lines.append("Varieties listed: " + ", ".join(varieties) + ".")
+        lines.append("")
+
+    # Verbatim trial data — preserves variety + yield numbers adjacent
+    # so BM25/dense can match "AP Iliad Aberdeen Idaho" queries.
+    lines.append("Trial data (verbatim from PDF):")
+    lines.append("")
+    # The actual text was in the .md body but isn't in the sidecar
+    # JSON. We render a brief marker; full text goes in the .md file
+    # that get_page returns. For embedding signal, the title +
+    # varieties + region is usually enough.
+    # If we want the FULL text in the chunk we'd need to either store
+    # it in the sidecar OR read it from the .md path at chunk time.
+    # Read from the .md path:
+    return "\n".join(lines).strip() + "\n"
+
+
+def _render_trial_chunk(sidecar: dict, md_text: str | None = None) -> str:
+    """Dispatch to the right trial renderer by source. Includes the
+    verbatim trial body for sources whose value lives in the body text
+    (currently agripro_trials)."""
+    source = sidecar.get("source")
+    if source == "gh_plot_reports":
+        return _render_gh_plot_chunk(sidecar)
+    if source == "agripro_trials":
+        header = _render_agripro_trial_chunk(sidecar)
+        if md_text:
+            # Strip the markdown frontmatter so the body text is the
+            # actual trial data, not the per-source preamble.
+            body = md_text
+            sep = "## Trial data (verbatim from PDF)"
+            if sep in body:
+                body = body.split(sep, 1)[1].strip()
+                # Strip fence markers
+                body = re.sub(r"```", "", body).strip()
+            return header + "\n" + body + "\n"
+        return header
+    # Fallback: generic trial render
+    return _render_gh_plot_chunk(sidecar)
+
+
+def _flat_trial_metadata(sidecar: dict) -> dict:
+    """Chroma-safe metadata for trial chunks. Mirrors variety metadata
+    plus trial-specific facets (state, year, data_type)."""
+    md: dict = {
+        "source": sidecar.get("source") or "",
+        "source_key": sidecar.get("source_key") or "",
+        "data_type": sidecar.get("data_type") or "trial",
+        "vendor": sidecar.get("vendor") or "",
+        "brand": (sidecar.get("brand") or "").upper(),
+        "crop": (sidecar.get("crop") or "").lower(),
+        "source_url": (sidecar.get("source_urls") or [""])[0],
+    }
+    year = sidecar.get("year")
+    if isinstance(year, int):
+        md["year"] = year
+    state = sidecar.get("state_abbrev") or sidecar.get("state")
+    if state:
+        md["state"] = state.upper() if len(state) <= 3 else state
+        md["state_abbrev"] = (sidecar.get("state_abbrev") or "").upper()
+    if sidecar.get("region"):
+        md["region"] = sidecar["region"]
+    if sidecar.get("wheat_class_section"):
+        md["wheat_class"] = sidecar["wheat_class_section"]
+    if sidecar.get("plot_id"):
+        md["plot_id"] = sidecar["plot_id"]
+    if isinstance(sidecar.get("n_results"), int):
+        md["n_results"] = sidecar["n_results"]
+    return md
+
+
+def chunks_from_trial(
+    sidecar_path: Path | str,
+    *,
+    md_path: Path | str | None = None,
+) -> Iterator[dict]:
+    """Yield chunk dict(s) for one trial document. Emits exactly one
+    chunk per trial.
+
+    Args:
+      sidecar_path: path to the trial's JSON sidecar.
+      md_path:      path to the trial's markdown body (used for
+                    AgriPro PDFs whose value lives in the verbatim
+                    text). If omitted we infer it from sidecar_path.
+    """
+    sc_path = Path(sidecar_path)
+    sidecar = json.loads(sc_path.read_text(encoding="utf-8"))
+
+    md_text: str | None = None
+    md_p = Path(md_path) if md_path else sc_path.with_suffix(".md")
+    if md_p.exists():
+        md_text = md_p.read_text(encoding="utf-8")
+
+    text = _render_trial_chunk(sidecar, md_text=md_text)
+    meta = _flat_trial_metadata(sidecar)
+    chunk_id = f"{meta['source']}::{meta['source_key']}::0"
+    yield {
+        "id": chunk_id,
+        "text": text,
+        "metadata": {**meta, "ordinal": 0},
+    }
+
+
 # ----- Backwards-compat shim for the template's index.py -------------------
 #
 # The template's ``rag.index.page_records`` calls
@@ -12,6 +12,7 @@ Override via the PRODUCT_NAME env var.
 from __future__ import annotations

 import argparse
+import json
 import logging
 import os
 import time
@@ -21,7 +22,7 @@ from typing import Iterator
 import chromadb
 from chromadb.config import Settings

-from .chunk import chunks_from_variety
+from .chunk import chunks_from_variety, chunks_from_trial
 from .embeddings import embedding_function

 log = logging.getLogger(__name__)
@@ -37,7 +38,17 @@ COLLECTION = f"{PRODUCT_NAME}_docs"

 def variety_records() -> Iterator[dict]:
    """Walk ``corpus/<source>/<source_key>.json``, yield one chunk per
-    variety."""
+    document.
+
+    Dispatches by the sidecar's ``data_type`` field:
+      - ``"trial"`` → chunks_from_trial (gh_plot_reports, agripro_trials)
+      - anything else (or absent) → chunks_from_variety (default)
+
+    The output shape (id/text/metadata) is identical for both — only
+    the chunk text composition and metadata keys differ. Chroma + BM25
+    can index both into the same collection; downstream tools filter
+    by the ``data_type`` metadata field.
+    """
    if not CORPUS.exists():
        log.error("corpus/ doesn't exist; run a scraper first")
        return
@@ -45,7 +56,15 @@ def variety_records() -> Iterator[dict]:
        if not source_dir.is_dir() or source_dir.name.startswith("."):
            continue
        for sidecar_path in sorted(source_dir.glob("*.json")):
-            yield from chunks_from_variety(sidecar_path)
+            try:
+                head = json.loads(sidecar_path.read_text(encoding="utf-8"))
+            except (OSError, json.JSONDecodeError) as exc:
+                log.warning("skipping unreadable sidecar %s: %s", sidecar_path, exc)
+                continue
+            if head.get("data_type") == "trial":
+                yield from chunks_from_trial(sidecar_path)
+            else:
+                yield from chunks_from_variety(sidecar_path)


 def upsert_to_chroma(records: list[dict]) -> int: