ac40e05734
Image rebuild (skip scrape) / build (push) Failing after 7s
Sibling project to crop-chem-docs, same MCP-template lineage. Corpus is
seed/hybrid varieties across 6 vendors instead of pesticide labels.
What's customized vs. the template:
- CLAUDE.md: vendor matrix, build priority, Pioneer fallback policy,
canonical sidecar schema (per-crop), Golden Harvest disease-scale
reversal gotcha, no-IPv6 / HTTPS-clone note
- README.md: vendor coverage table, tool list, phase status
- Dockerfile: PRODUCT_NAME=crop_seed default, sources.json (not
bundles.json), HYBRID_SEARCH=true, OLLAMA_URL + RERANK_URL Docker
DNS defaults (same llama-rerank sidecar as crop-chem-docs)
- .gitea/workflows/refresh.yml: monthly cron (seed catalogs move
slowly), 5 GREEN scraper steps, corpus-YYYY.MM.DD tag for Drawbar
pinning, continue-on-error on GC step
- .gitea/workflows/image-only.yml: paths filter + cancel-in-progress
concurrency group
- scripts/registry_gc.py: lifted from crop-chem-docs (correct Gitea
packages API URL + UA header to bypass CF block on default
Python-urllib UA)
- sources.json: catalog of 6 sources + scope_filter + per-source
schema notes + Pioneer-exclusion rationale
- scrape/runner.py: dispatcher with --all = GREEN-only
- scrape/sources/{bayer_seeds,golden_harvest,nk,agripro,becks_pfr,
becks_products}.py: stub modules with implementation notes
- docs_mcp/server.py: PRODUCT_NAME default → crop_seed,
PRODUCT_DOCS_URL → repo URL
Pioneer is intentionally NOT a source. ToS bans automation; dealer
locator is login-gated. The MCP returns a curated fallback lesson
directing the user to pioneer.com.
Next phases:
- Phase 1: implement bayer_seeds (lift-and-shift from crop-chem-docs
Bayer scraper; same __NEXT_DATA__ infra)
- Phase 7: curate eval/queries.jsonl
- Phase 11: lessons.md with Pioneer fallback + disease-scale notes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
127 lines
4.1 KiB
Python
127 lines
4.1 KiB
Python
"""Markdown chunker — paragraph-aware, ~400-600 token target.
|
|
|
|
Adjust the chunking strategy per product if your page format differs
|
|
significantly from prose. The output shape (id, text, metadata) is
|
|
fixed by the downstream Chroma + BM25 indexing in rag/index.py — don't
|
|
change that.
|
|
|
|
The key knob you'll tune per product is chunk-0. Dense retrieval lands
|
|
on chunk 0 first for most queries. Make it a synthetic chunk built
|
|
from:
|
|
|
|
- the page title (as natural-language H1)
|
|
- a 1-sentence task description (you'll have to generate this — for
|
|
pages that already have a "## Overview" or "## Introduction" the
|
|
first sentence usually works)
|
|
- a keyword bag of important terms (filenames, API names, error
|
|
codes — the rare technical tokens that BM25 lights up on)
|
|
|
|
Without a rich chunk 0, dense retrieval gets dominated by the much
|
|
larger prose body, and short pages (script examples, reference cards)
|
|
get buried.
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import re
|
|
from typing import Iterator
|
|
|
|
|
|
# Approximate token estimate from char count. Tunable — set per
|
|
# embedder if the default 4 chars/token is wrong.
|
|
CHARS_PER_TOKEN = 4
|
|
TARGET_TOKENS = 500
|
|
TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
|
|
|
|
|
|
def estimate_tokens(text: str) -> int:
|
|
return max(1, len(text) // CHARS_PER_TOKEN)
|
|
|
|
|
|
def split_paragraphs(md: str) -> list[str]:
|
|
"""Split markdown into paragraph-ish blocks.
|
|
|
|
Keeps fenced code blocks together (don't slice through ```).
|
|
Headings start new paragraphs.
|
|
"""
|
|
blocks: list[str] = []
|
|
current: list[str] = []
|
|
in_fence = False
|
|
for line in md.splitlines(keepends=True):
|
|
stripped = line.strip()
|
|
if stripped.startswith("```"):
|
|
in_fence = not in_fence
|
|
current.append(line)
|
|
continue
|
|
if in_fence:
|
|
current.append(line)
|
|
continue
|
|
if stripped.startswith("#"):
|
|
if current:
|
|
blocks.append("".join(current).strip())
|
|
current = []
|
|
current.append(line)
|
|
continue
|
|
if not stripped and current and not "".join(current).strip().endswith("\n\n"):
|
|
current.append(line)
|
|
blocks.append("".join(current).strip())
|
|
current = []
|
|
continue
|
|
current.append(line)
|
|
if current:
|
|
blocks.append("".join(current).strip())
|
|
return [b for b in blocks if b]
|
|
|
|
|
|
def chunks_from_page(
|
|
text: str,
|
|
page_id: str,
|
|
metadata: dict,
|
|
) -> Iterator[dict]:
|
|
"""Yield chunk dicts ready for index.py to upsert.
|
|
|
|
The synthetic chunk 0 is the per-product customization point. The
|
|
default below is a simple title + body-first-paragraph; rewrite
|
|
for richer retrieval signal (see module docstring).
|
|
"""
|
|
paragraphs = split_paragraphs(text)
|
|
if not paragraphs:
|
|
return
|
|
|
|
# ----- Chunk 0: synthetic anchor for dense retrieval ---------
|
|
title = metadata.get("title") or page_id
|
|
first_para = next((p for p in paragraphs if not p.startswith("#")), "")
|
|
chunk0_body = (
|
|
f"# {title}\n\n"
|
|
f"{first_para[:300]}"
|
|
# TODO per product: append a keyword bag here (filenames,
|
|
# API names, error codes) for BM25 + dense joint coverage.
|
|
)
|
|
yield {
|
|
"id": f"{metadata['bundle_id']}::{page_id}::0",
|
|
"text": chunk0_body,
|
|
"metadata": {**metadata, "ordinal": 0},
|
|
}
|
|
|
|
# ----- Body chunks: pack paragraphs up to TARGET_CHARS -------
|
|
ordinal = 1
|
|
buf: list[str] = []
|
|
buf_chars = 0
|
|
for p in paragraphs:
|
|
if buf_chars + len(p) > TARGET_CHARS and buf:
|
|
yield {
|
|
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
|
|
"text": "\n\n".join(buf),
|
|
"metadata": {**metadata, "ordinal": ordinal},
|
|
}
|
|
ordinal += 1
|
|
buf = []
|
|
buf_chars = 0
|
|
buf.append(p)
|
|
buf_chars += len(p)
|
|
if buf:
|
|
yield {
|
|
"id": f"{metadata['bundle_id']}::{page_id}::{ordinal}",
|
|
"text": "\n\n".join(buf),
|
|
"metadata": {**metadata, "ordinal": ordinal},
|
|
}
|