gh_plot_reports corpus (4,299 trials) + concurrency + 4-GPU pool #9

Merged
justin merged 1 commits from gh-plot-reports-corpus into main 2026-05-25 16:47:19 -04:00
Owner

Summary

  • 4,299 Golden Harvest plot reports added — head-to-head cross-vendor yield trials 2024 + 2025 (2023 deferred). Combined corpus is now 5,073 total chunks: 760 varieties + 14 AgriPro trials + 4,299 GH plot reports.
  • Scraper concurrency: 4 worker threads + shared 0.25-sec rate limiter. Net ~4 req/sec, ~25 min actual time (was 190 min ETA with the original 1-req/sec self-throttle). Diagnosis: zero 429s, zero retries — GH wasn't rate-limiting, my politeness floor was just too conservative.
  • Chunk-length cap (4,500 chars) for nomic-embed-text's 2,048-token window. Numeric-heavy trial chunks tokenize ~2.4 chars/token; capping at 4,500 leaves safe headroom across all source types. Full text stays in the on-disk .md for get_page to return verbatim — anti-hallucination contract intact.
  • 4-GPU embedder pool, weighted by measured throughput. Index rebuild for all 5,073 chunks now takes ~3 min (was 19+ on the single-endpoint pool).

GPU pool (in both workflows)

Endpoint GPU Throughput List weight
192.168.0.125:11434 RTX 40-series 242 embeds/sec ×4
192.168.0.2:11436 GPU-pinned 108 embeds/sec ×2
192.168.0.2:11435 GPU-pinned 72 embeds/sec ×1
localhost:11434 TITAN X 37 embeds/sec ×1

192.168.0.2:11434 explicitly excluded (not GPU-pinned).

Coverage after merge

  • 6 sources / 2 vendors / 6 brands / 4 crops
  • 5,073 indexed chunks
  • 4,313 trial documents (4,299 GH plot reports + 14 AgriPro PDFs)
  • 760 variety identity records

Smoke tests (post-rebuild)

  • search_trials({crop=corn, state=IA, year=2024}) → 3 IA 2024 corn trials
  • search_trials("Phytophthora resistance soybean trial") → NK NK43-W1XFS at #1 in LA 2024 trial (cross-vendor result; NK product surfaced via GH-published plot)
  • search_trials("AP Iliad Idaho wheat") → AgriPro Washington/N. Idaho 2025 trial
  • search_trials(product='DKC65-95') → 3 corn trials in IL/IA 2024 containing the hybrid
  • search_trials(product='NK1701') → 3 corn trials in AR/MS 2024
  • search_trials(product='DKC65-20') → empty (correctly! that's a 2023-only product; 2023 plots aren't in this corpus). Anti-hallucination contract holds.

Files

  • 8,598 corpus files added (corpus/gh_plot_reports/*.{md,json})
  • scrape/sources/gh_plot_reports.py: +120 lines (concurrency + workers)
  • rag/chunk.py: +37 lines (truncation helper + apply to both chunkers)
  • .gitea/workflows/{refresh,image-only}.yml: GPU pool restructure

What's not in this PR

  • 2023 plots (~3,619 docs). Could be a follow-up --include-2023 backfill PR if you want them.
  • NK / Bayer trial data — still no public source (see lessons.md).
## Summary - **4,299 Golden Harvest plot reports** added — head-to-head cross-vendor yield trials 2024 + 2025 (2023 deferred). Combined corpus is now **5,073 total chunks**: 760 varieties + 14 AgriPro trials + 4,299 GH plot reports. - **Scraper concurrency**: 4 worker threads + shared 0.25-sec rate limiter. Net ~4 req/sec, ~25 min actual time (was 190 min ETA with the original 1-req/sec self-throttle). Diagnosis: zero 429s, zero retries — GH wasn't rate-limiting, my politeness floor was just too conservative. - **Chunk-length cap (4,500 chars)** for nomic-embed-text's 2,048-token window. Numeric-heavy trial chunks tokenize ~2.4 chars/token; capping at 4,500 leaves safe headroom across all source types. Full text stays in the on-disk `.md` for `get_page` to return verbatim — anti-hallucination contract intact. - **4-GPU embedder pool, weighted by measured throughput**. Index rebuild for all 5,073 chunks now takes ~3 min (was 19+ on the single-endpoint pool). ## GPU pool (in both workflows) | Endpoint | GPU | Throughput | List weight | |---|---|---|---| | `192.168.0.125:11434` | RTX 40-series | 242 embeds/sec | ×4 | | `192.168.0.2:11436` | GPU-pinned | 108 embeds/sec | ×2 | | `192.168.0.2:11435` | GPU-pinned | 72 embeds/sec | ×1 | | `localhost:11434` | TITAN X | 37 embeds/sec | ×1 | `192.168.0.2:11434` explicitly excluded (not GPU-pinned). ## Coverage after merge - 6 sources / 2 vendors / 6 brands / 4 crops - 5,073 indexed chunks - 4,313 trial documents (4,299 GH plot reports + 14 AgriPro PDFs) - 760 variety identity records ## Smoke tests (post-rebuild) - [x] `search_trials({crop=corn, state=IA, year=2024})` → 3 IA 2024 corn trials - [x] `search_trials("Phytophthora resistance soybean trial")` → NK NK43-W1XFS at #1 in LA 2024 trial (cross-vendor result; NK product surfaced via GH-published plot) - [x] `search_trials("AP Iliad Idaho wheat")` → AgriPro Washington/N. Idaho 2025 trial - [x] `search_trials(product='DKC65-95')` → 3 corn trials in IL/IA 2024 containing the hybrid - [x] `search_trials(product='NK1701')` → 3 corn trials in AR/MS 2024 - [x] `search_trials(product='DKC65-20')` → empty (correctly! that's a 2023-only product; 2023 plots aren't in this corpus). Anti-hallucination contract holds. ## Files - 8,598 corpus files added (`corpus/gh_plot_reports/*.{md,json}`) - `scrape/sources/gh_plot_reports.py`: +120 lines (concurrency + workers) - `rag/chunk.py`: +37 lines (truncation helper + apply to both chunkers) - `.gitea/workflows/{refresh,image-only}.yml`: GPU pool restructure ## What's not in this PR - 2023 plots (~3,619 docs). Could be a follow-up `--include-2023` backfill PR if you want them. - NK / Bayer trial data — still no public source (see lessons.md).
justin added 1 commit 2026-05-25 16:47:05 -04:00
CORPUS — 4,299 GH plot reports added (3,797 written + 502 from the
earlier slow run + 319 sitemap-listed URLs that 404'd as
discontinued). Combined with prior 760 varieties + 14 AgriPro
trials = 5,073 total chunks now indexed.

scrape/sources/gh_plot_reports.py — concurrency speedup:
- 4 worker threads (ThreadPoolExecutor), each with its own
  requests.Session for connection-pool efficiency.
- Shared class-level rate limiter (0.25 sec between ANY two
  requests across all threads). Net throughput ~4 req/sec —
  well below any rate-limit threshold a public site enforces.
- Diagnosis vs original 1 req/sec: GH had ZERO rate limiting,
  zero 429s, zero retries. The 1 sec self-throttle was just too
  conservative. Bench:
    1 worker  / 1.0 sec throttle:  ~0.4 plots/sec (190 min ETA)
    4 workers / 0.25 sec throttle: ~3 plots/sec  (~25 min actual)

rag/chunk.py — chunk size cap for nomic-embed-text's 2048-token
context window:
- Empirically tested: failure threshold is ~5,250 chars on
  numeric-heavy trial chunks (chars/token ratio 2.4 vs 3.5 for
  prose). Cap at 4,500 chars to be safely under at worst-case
  2.2 chars/token.
- Applied to BOTH variety and trial chunks. Marked truncated
  chunks with metadata.embed_truncated = True; FULL text stays
  in the on-disk .md for get_page to return verbatim.

.gitea/workflows/{refresh,image-only}.yml — OLLAMA_URL pool
restructured for the 4 GPU-pinned endpoints. Bench (50-chunk
batches on nomic-embed-text):

    .0.125:11434  (RTX 40-series)  242 embeds/sec  ← weight ×4
    .0.2:11436    (GPU-pinned)     108 embeds/sec  ← weight ×2
    .0.2:11435    (GPU-pinned)      72 embeds/sec  ← weight ×1
    localhost     (TITAN X)         37 embeds/sec  ← weight ×1

Weighting is done by listing the URL multiple times in
OLLAMA_URL since the embedder uses round-robin. .0.2:11434 is
explicitly EXCLUDED — it isn't pinned to a specific GPU.

Combined index rebuild for 5,073 chunks now finishes in ~3 min
(was 19+ on the single-endpoint pool).

Smoke tests:
✓ list_versions: 5,073 docs across 6 sources, 2 vendors, 6
  brands, 4 crops (corn 2711, soy 2016, silage 223, wheat 123).
✓ search_trials({crop=corn, state=IA, year=2024}): 3 IA 2024
  corn trials surfaced.
✓ search_trials("Phytophthora resistance soybean trial"): NK
  NK43-W1XFS top-1 in LA 2024 trial (cross-vendor result).
✓ search_trials("AP Iliad Idaho wheat"): AgriPro Washington/N
  Idaho 2025 trial surfaced.
✓ search_trials(product=DKC65-95): 3 corn trials containing
  that hybrid in IL/IA 2024.
✓ search_trials(product=NK1701): 3 corn trials in AR/MS 2024.
✓ Product filter correctly returns EMPTY for products that
  aren't in the corpus (DKC65-20 is a 2023 product; 2023 plots
  deferred). Anti-hallucination contract preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justin merged commit d60d747858 into main 2026-05-25 16:47:19 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: justin/seed-mcp#9