a54fac240fb458dee7de0c2d088c318a0dd41732
12 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
a54fac240f |
Add university-extension trials: Illinois VT + Iowa ICPT + Ohio OCPT (+123 cross-vendor trial docs) (#19)
Image rebuild (skip scrape) / build (push) Successful in 5m54s
Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io> |
||
|
|
0bac06b7b6 |
Add RobSeeCo (Rob-See-Co + Innotech): 130 corn/soy varieties from the seed-guide PDF (#18)
Image rebuild (skip scrape) / build (push) Successful in 4m48s
Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io> |
||
|
|
84ad2b1de6 |
Add 4 independent seed brands: Latham + Stine + 1st Choice + Burrus (+623 varieties) (#17)
Image rebuild (skip scrape) / build (push) Successful in 4m44s
Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io> |
||
|
|
22e8092faf |
Add ProHarvest Seeds: 119 varieties + 161 cross-vendor plot reports (#16)
Image rebuild (skip scrape) / build (push) Successful in 5m46s
Co-authored-by: claude <claude@jpaul.io> Co-committed-by: claude <claude@jpaul.io> |
||
|
|
e356633d4f | monthly refresh: 2026-06-01T06:53Z — bayer=931 gh=139 nk=122 agripro=24 ag_trials=14 gh_plot_reports=4299 pfr=0 | ||
|
|
b98965a68a |
Two new trial sources: LG Seeds + AgriGold plot reports (+2,307 cross-vendor yield trials)
Adds the **first non-Syngenta trial coverage** to the corpus:
| Source | Docs | Publisher | URL pattern |
|---|---|---|---|
| lg_plot_reports | 1,304 | LG Seeds (AgReliant) | lgseeds.com/performance/{crop} JSON XHR |
| agrigold_plot_reports | 1,003 | AgriGold (AgReliant) | agrigold.com/{crop}/performance/{crop}-yield-results |
Total trial coverage now: gh_plot_reports (4,299) + agripro_trials (14) +
lg_plot_reports (1,304) + agrigold_plot_reports (1,003) = 6,620 trial docs.
**Both scrapers follow the gh_plot_reports template** — same RateLimitedSession
primitive, same TrialResult/PlotReport dataclass shape, same data_type="trial"
sidecar convention. The trial chunker (`rag/chunk.py:_render_trial_chunk`) is
extended to recognize both new sources; they share `_render_gh_plot_chunk`
since their sidecars are structurally identical (just different brand label).
**LG specifics:**
- POST `/performance/{crop}/GetPlots/` returns sparse listing (id, year, lat/lng)
- GET `/performance/{crop}/GetPlotData/?PlotId=X&IsSilage=Y` returns full detail
with state, cooperator, planting/harvest dates, and **top-5 hybrids** (LG +
competitors). Top-5 is what LG publishes publicly; not the full ranking.
- 4 crops: corn (963), soybeans (287), sorghum (10), silage (50) — first
alfalfa absent because LG doesn't run alfalfa plots; that's variety-only data.
- 301 gotcha: www.lgseeds.com redirects to lgseeds.com which drops POST body,
so the scraper hits the apex host directly.
**AgriGold specifics:**
- Listing: GET `/{crop}/performance/{crop}-yield-results?harvestYear={year}`
(server-rendered HTML, ~1MB; 408 corn plots in 2025 alone)
- Detail: GET `/{crop_url}/performance/{slug}/{plot_id}` returns the **full
ranking** (not just top-5) plus rich plot management metadata: tillage,
previous crop, fungicide, herbicide, insecticide, irrigation, soil type,
row width, population. Most metadata-rich of the three trial sources.
- Soybean URL slug is singular: `/soybeans/performance/soybean-yield-results/`
- Columns: Rank | Brand | Product | Trait | Ck | H20 (moisture) | Test Wt. |
Yield | Adj Yield (check-adjusted)
- 2 crops: corn (849) + soybeans (157)
**Indexer needs no changes** — `rag/index.py` auto-discovers any directory
under corpus/ and routes by data_type. Both new sources flow into the
existing trial collection and surface via `search_trials`.
Years scraped: 2024+2025 (matching gh_plot_reports baseline). 2023 is
available via `--include-2023` on either scraper for future backfill.
|
||
|
|
30b182e28a |
Three new brand scrapers: LG Seeds + AgriGold + Ebbert's Seeds (+310 varieties)
User flagged LG, AgriGold, and Ebbert's (local Ohio breeder) are
all active in farmer territory. Built three scrapers — corpus now
covers 5,839 chunks across 11 brands.
Net new varieties: 310
lg_seeds 170 — corn 78 + soy 63 + alfalfa 16 + sorghum 13
→ adds FIRST alfalfa coverage (FD 3-5 range)
agrigold 111 — corn 60 + soy 51
ebberts_seeds 29 — corn 17 + soy 12 (regional OH/IN breeder)
scrape/sources/lg_seeds.py — embedded-JSON pattern (cleanest):
- /products/<crop> pages have a `var products = [...]` blob with the
variety summary (Variety, Maturity, Traits[], Bullets[], CropType).
- Per-variety detail page (/products/<crop>/<Variety>) carries the
ratings as `<span class="bar-N">` where N is 1-9 on the canonical
scale. Same 9=best direction as Bayer / Golden Harvest.
- Three sections per page: Characteristics / Management / Disease
Tolerance, plus a few qualitative bars ("Tar Spot Susceptible",
"Fungicide Response High") preserved as text values.
scrape/sources/agrigold.py — 5-circle scale:
- Listing page has 60+ /corn/explore-corn-hybrids/<CODE> URLs.
- Detail page renders ratings as <div class="scale"> blocks with 5
child <div class="circle"> elements, of which N have class
"circle selected" → rating N on a 1-5 scale.
- 7 sections per page incl. Silage Characteristics (Dairy Silage
Rating, NDFd 30 Hr, Crude Protein), Planting Applications, Soil
Adaptability, Plant Characteristics, Product Features.
- Distinct rating direction (1-5 vs Bayer's 1-9) — declared in
_scale_direction so chunker preamble renders correctly.
scrape/sources/ebberts_seeds.py — small regional breeder, verbatim
text approach:
- Single page per crop (corn / soybeans / wheat). Each variety is an
<h1> + multi-section CSS-grid block where labels and values are in
separate adjacent cells. Reconstructing perfectly-aligned columns
for a 29-variety total isn't worth the engineering — chunk body
carries the verbatim text in document order, LLM can read the
tabular content.
- Scale: 1-5 (1 = best, lower = more resistant), inferred from
marketing-vs-rating cross-checks ("Robust tall plants" + STANDABILITY
1.0 → 1 = best).
- Politeness: robots.txt asks for Crawl-delay: 5; honored.
All three new scrapers smoke-tested:
- LG corn LG5701 RM 116 SmartStax → 3 characteristic groups with
Disease Tolerance ratings (Northern/Southern Leaf Blight 8-9, etc.)
- AgriGold A616-30 RM 86 VT2RIB → 7 groups incl. silage and soil
adaptability ratings
- Ebbert's 7000TR RIB RM 100 → 1098-char verbatim body covering
CHARACTERISTICS, DISEASE RATINGS, herbicide tolerance, etc.
Corpus state after this PR:
- 5,839 chunks (was 5,529)
- 11 brands (was 8)
- 8 crops (corn 3047, soy 2209, silage 359, wheat 123, sorghum 49,
cotton 30, alfalfa 16, canola 6) — alfalfa is brand-new
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
eaa7e0789b |
bayer_seeds: add Channel + DEKALB silage/sorghum/canola + Deltapine cotton
User flagged that Channel is expanding into their area — re-walked the cropscience.bayer.us sitemap and found 8 additional brand×crop paths beyond the original DEKALB/Asgrow/WestBred triple. Patches the scraper to walk all of them; total Bayer varieties roughly doubles from 475 to 931 and the corpus picks up first-ever coverage in sorghum (36), cotton (30), canola (6), and silage as a distinct crop (was conflated with corn before). Net new varieties: 456 Channel corn=181 soy=67 silage=54 sorghum=18 (320) DEKALB silage=82 sorghum=18 canola=6 (106) Deltapine cotton=30 (30) scrape/sources/bayer_seeds.py - Replace `BRANDS` (brand → 1 path) and `CROP_SUFFIX` (brand → 1 suffix) with a flatter `BRAND_PATHS` list of (brand, url_path, crop, is_primary_for_brand) entries. Channel and DEKALB are now multi-crop brands; the same scraper walks every brand×crop pair. - source_key derivation: for a brand's PRIMARY crop, strip the trailing `-<crop>` suffix (matches the existing deployed source keys for DEKALB corn / Asgrow soy / WestBred wheat). For SECONDARY crops, KEEP the suffix so DEKALB-the-same-SKU sold as both grain corn and silage gets two distinct source_keys (collision-safe and unambiguous for `lookup_variety`). - New `--crop` CLI filter for incremental backfills. - Log line shows brand + crop alongside source_key for visibility. rag/chunk.py - Channel + Deltapine pages use slightly different characteristics group labels (DISEASE not DISEASE RATINGS, AGRONOMIC CHARACTERISTICS not GROWTH/HARVEST, plus MATURITY / ADAPTATION / HERBICIDES / OTHER). Fold them into the DISEASE / AGRONOMIC / MANAGEMENT label sets so the chunker buckets them correctly into the standard sections. Smoke-tested cross-brand × cross-crop queries against the rebuilt index (5,529 chunks total) — all 6 sample queries surface the right brand+crop at top-3: Channel corn 110 RM → 210-25TRE BRAND Channel soy 2.5 MG IA → 2622RXF BRAND Deltapine cotton XF → DP 1820 B3XF BRAND Sorghum dryland Kansas → 6B95 BRAND (Channel) Silage corn WI dairy → DKC64-44RIB BRAND BLEND (silage variant) Canola Northern Plains → DK401TL BRAND Watchtower will pull the new image on the next push; deploy is unchanged otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0e625553e5 |
gh_plot_reports corpus (4,299 plots) + concurrency + 4-GPU pool
CORPUS — 4,299 GH plot reports added (3,797 written + 502 from the
earlier slow run + 319 sitemap-listed URLs that 404'd as
discontinued). Combined with prior 760 varieties + 14 AgriPro
trials = 5,073 total chunks now indexed.
scrape/sources/gh_plot_reports.py — concurrency speedup:
- 4 worker threads (ThreadPoolExecutor), each with its own
requests.Session for connection-pool efficiency.
- Shared class-level rate limiter (0.25 sec between ANY two
requests across all threads). Net throughput ~4 req/sec —
well below any rate-limit threshold a public site enforces.
- Diagnosis vs original 1 req/sec: GH had ZERO rate limiting,
zero 429s, zero retries. The 1 sec self-throttle was just too
conservative. Bench:
1 worker / 1.0 sec throttle: ~0.4 plots/sec (190 min ETA)
4 workers / 0.25 sec throttle: ~3 plots/sec (~25 min actual)
rag/chunk.py — chunk size cap for nomic-embed-text's 2048-token
context window:
- Empirically tested: failure threshold is ~5,250 chars on
numeric-heavy trial chunks (chars/token ratio 2.4 vs 3.5 for
prose). Cap at 4,500 chars to be safely under at worst-case
2.2 chars/token.
- Applied to BOTH variety and trial chunks. Marked truncated
chunks with metadata.embed_truncated = True; FULL text stays
in the on-disk .md for get_page to return verbatim.
.gitea/workflows/{refresh,image-only}.yml — OLLAMA_URL pool
restructured for the 4 GPU-pinned endpoints. Bench (50-chunk
batches on nomic-embed-text):
.0.125:11434 (RTX 40-series) 242 embeds/sec ← weight ×4
.0.2:11436 (GPU-pinned) 108 embeds/sec ← weight ×2
.0.2:11435 (GPU-pinned) 72 embeds/sec ← weight ×1
localhost (TITAN X) 37 embeds/sec ← weight ×1
Weighting is done by listing the URL multiple times in
OLLAMA_URL since the embedder uses round-robin. .0.2:11434 is
explicitly EXCLUDED — it isn't pinned to a specific GPU.
Combined index rebuild for 5,073 chunks now finishes in ~3 min
(was 19+ on the single-endpoint pool).
Smoke tests:
✓ list_versions: 5,073 docs across 6 sources, 2 vendors, 6
brands, 4 crops (corn 2711, soy 2016, silage 223, wheat 123).
✓ search_trials({crop=corn, state=IA, year=2024}): 3 IA 2024
corn trials surfaced.
✓ search_trials("Phytophthora resistance soybean trial"): NK
NK43-W1XFS top-1 in LA 2024 trial (cross-vendor result).
✓ search_trials("AP Iliad Idaho wheat"): AgriPro Washington/N
Idaho 2025 trial surfaced.
✓ search_trials(product=DKC65-95): 3 corn trials containing
that hybrid in IL/IA 2024.
✓ search_trials(product=NK1701): 3 corn trials in AR/MS 2024.
✓ Product filter correctly returns EMPTY for products that
aren't in the corpus (DKC65-20 is a 2023 product; 2023 plots
deferred). Anti-hallucination contract preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
c737871c4c |
Trial-data scrapers: gh_plot_reports + agripro_trials + search_trials tool
This PR introduces TRIAL data — yield-performance results from real
field trials — as a SEPARATE data type alongside variety identity.
The two are complementary:
search_docs → "What's the disease resistance of DKC62-08RIB?"
(variety identity — what it IS)
search_trials → "Which corn hybrid won the IA 2024 trials?"
(performance data — how it PERFORMED)
scrape/sources/gh_plot_reports.py — Golden Harvest plot reports
- 4,618 expected (2024+2025; 2023 deferred to a backfill pass).
- URL: /<crop>/plot-report/<state>/<year>/<plot_id>
- Cross-vendor: each plot lists products from multiple brands
(NK / DEKALB / Golden Harvest / Enogen / Pioneer / Channel) side
by side at one cooperator's field — the kind of independent
comparison data Bayer doesn't publish itself.
- Generic per-column metrics dict (Yield/MST/Test Weight/$/Ac for
corn+soy, Ton/Acre + Milk + Beef columns for silage).
- Politeness: 1 req/sec, retries on 429/5xx, no redirect-follow.
scrape/sources/agripro_trials.py — AgriPro regional trial PDFs
- 14 unique PDFs (38 sitemap links deduped) at /trials-data
- pdfplumber text extraction, region/year detection from filename
- Verbatim PDF text preserved in chunk body so variety + yield
number adjacency drives retrieval (AP Iliad's Aberdeen ID yield
matches a query about "AP Iliad Idaho yield")
rag/chunk.py — chunks_from_trial() dispatching by source
- Plot reports: identity preamble + Top-5 by primary metric + full
ranking table. Metric labels chosen from the data (corn/soy use
"Yield", silage uses "Ton/Acre").
- AgriPro PDFs: identity preamble + verbatim trial body inline so
per-location yields surface for region+variety queries.
- Variety chunks get data_type="variety" metadata; trial chunks get
data_type="trial". Single Chroma collection; the tool router
filters by data_type rather than maintaining two collections.
rag/index.py — dispatch by sidecar's data_type field
rag/bm25.py — new filter columns (data_type, year, state)
docs_mcp/server.py — sixth MCP tool: search_trials(crop?, state?,
year?, product?, k=10)
- Filters trial chunks via where={"data_type": "trial", ...}
- Optional product substring post-filter for "DKC62-08RIB Iowa 2024"
style searches
- search_docs now defaults to data_type="variety" so trial chunks
don't bleed into variety identity queries
- Tool docstring routes the agent: "use lookup_variety to verify
identity details on any trial winner you surface"
NK trial endpoint (/NKSeeds/wsProxy.asmx/GetPlotResult) is documented
as deferred — the ASMX-SOAP shape returned empty XML on initial
probe. Bayer per-variety yield data is not publicly indexed at all
— documented in the trial-scope note (DEKALB/Asgrow trial data flows
through Channel reps, not the web). AgRevival research books exist
as 10 large annual PDFs but are deferred (low ROI per parse).
Initial corpus shipped in this PR: 14 AgriPro trial PDFs. The 4,618
Golden Harvest plot reports are scraping in background and will be
added in a follow-up corpus-snapshot PR (~70 min ETA).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
9ce920f622 |
agripro + nk scrapers — 146 Syngenta varieties added (wheat + corn/soy)
agripro (24 varieties)
- Drupal Views form scrape via /search-agripro-brand-varieties with
explicit GET params (sidesteps the AJAX-only-on-load default that
returns an empty form skeleton).
- Per-variety parse: <h1>, .field--node--variety-type--variety,
.field--node--tag-line--variety, .field--node--body, plus the
three rated sections (Agronomics / Grain / Disease) with their
<div class="row"><div class="label">label</div><div>value</div>
pairs.
- Wheat-class distribution: 12 HRS, 7 SWW, 3 HRW, 1 HWS, 1 Barley
— provides the Northern Plains HRS coverage WestBred lacks.
nk (122 varieties — recon's "29" was outdated; the current NK seed
finder lists 41 corn + 81 soy)
- ASP.NET WebForms endpoint:
POST /NKSeeds/{Corn,Soy}ProductFinder.aspx/GetProducts returns
{"d": "<html>"} where the inner HTML is one <div class="sf-result">
per variety. BeautifulSoup tokenizes the whole blob.
- Per-card: product code (NK8005, NK008-P8XF), RM/MG from the
title <span>, "Brands Available" trait variants, marketing
positioning + bullet strengths, tech-sheet PDF URL.
- pdfplumber text extraction on the tech-sheet PDFs adds:
* corn disease ratings (Gray Leaf Spot, NCLB, Goss's Wilt,
Anthracnose, Tar Spot, Fusarium, etc.) where the PDF prints
"Label N" lines (text-extractable)
* soybean Phytophthora source genes (Rps1c, Rps3a, ...)
* soybean SCN race coverage
* soybean agronomic ratings (Emergence, Standability, Shatter
Tolerance, Green Stem) with text-extractable 1-9 values
* soybean soil-type adaptation (Best/Good/Fair/Poor) for drought
prone / high pH / poorly drained / etc.
- Agronomic rating BARS for corn (Emergence, Stalk Strength,
Drought) are not text-extractable; we record the labels with an
explicit "rated in PDF chart, see tech sheet" value so the agent
can direct the farmer at the source for those numbers.
Scale-direction correction in lessons.md:
- NK and AgriPro both use 1 = best, lower = more resistant — the
REVERSED convention vs Bayer / Golden Harvest. NK's tech-sheet
footer literally prints "1-9 Scale: 1 = Best, 9 = Worst".
AgriPro positioning on stripe-rust-resistant varieties (AP Iliad
with Stripe Rust 1, Eyespot 2) confirms the same direction.
- sources-not-yet-indexed section trimmed to just Beck's PFR +
Beck's products — everything else IS now in the corpus.
Cross-vendor coverage after this PR: 760 varieties.
bayer_seeds 475 (DEKALB 288 / Asgrow 102 / WestBred 85)
golden_harvest 139
nk 122 (41 corn / 81 soy)
agripro 24 (12 HRS / 7 SWW / 3 HRW / 1 HWS / 1 Barley)
Vendors: Bayer, Syngenta. Brands: 6. Crops: corn, soy, wheat (109
wheat now, up from 85).
requirements.txt: pdfplumber>=0.11 for NK tech-sheet parsing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
75f714b454 |
Phase 4-5: deployable container + corpus snapshot + CI fixes
deploy/docker-compose.yml — replace <product>/<registry> placeholders with concrete values for Drawbar's stack: - image: git.jpaul.io/justin/seed-mcp:latest (CF tunnel for pulls; CI pushes via LAN 192.168.0.2:1234 to avoid 100 MB body cap) - container_name: seed-mcp - port 8001:8000 (8001 host-side to not collide with crop-chem-docs on 8000) - PRODUCT_NAME=crop_seed, hybrid search enabled, stateless HTTP - llama-rerank shared with crop-chem-docs (NOT redefined here — expected to already be in Drawbar's parent compose network) - networks.drawbar-mcp external: true so seed-mcp joins the existing cross-MCP shared network .gitignore — corpus/ is now COMMITTED, not ignored. The monthly refresh workflow scrapes and commits corpus changes; the image-only workflow rebuilds indexes from the committed corpus. Allowing the corpus to flow through git means the :corpus-YYYY.MM.DD image tag pins to a specific seed-catalog snapshot. chroma/ and bm25/ remain ignored — those are deterministically derived from corpus. Initial committed snapshot: 614 varieties. - bayer_seeds: 475 (DEKALB 288 + Asgrow 102 + WestBred 85) - golden_harvest: 139 (Syngenta corn + soy; 36 sitemap URLs 302-redirected = discontinued) rag/chunk.py — normalize brand and crop to uppercase/lowercase in Chroma metadata so cross-vendor brand-filter lookups don't break on casing inconsistency (Bayer stores "DEKALB", Golden Harvest stores "Golden Harvest"; _build_where uppercases user-supplied brand which matched the former but not the latter pre-fix). Sidecar JSON keeps original casing for display. Stub scrapers (nk, agripro, becks_pfr, becks_products) — change return code from 2 to 0 so the monthly-refresh CI workflow doesn't fail on deferred sources. Real implementations will return 0 on success / 1 on failure when they ship. Smoke-tested cross-vendor retrieval against the 614-chunk index: - list_versions shows both vendors with correct facet counts - broad "corn hybrid 100 RM" query returns both DEKALB and Golden Harvest hits in top 5 - brand='Golden Harvest' filter returns 3 GH-only varieties - variety-code prefilter still works (E085Z5 → top hit on GH) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |