fix(chunker): drop MAX_CHARS 6000 → 4000 for table-dense content
Qualification matrix run 107 crashed at chunk 65 with HTTP 400 from Ollama: "the input length exceeds the context length". Reproduced locally — the offending chunk was 5839 chars (~1785 word-units), but nomic-embed-text's BPE tokenizer counts every `|` table separator and short cell as its own token, pushing the real token count past 2048 even though my 4-chars/token heuristic put it at ~1460. Markdown tables tokenize ~1.4× denser than prose. The HVM Qualification Matrix's Server/Storage/ISV tables are exactly the kind of content that trips this. Dropped MAX_CHARS from 6000 to 4000 — empirically safe for the densest content we have, still leaves 2-3× headroom for the typical 400-600-token target. Side effect: qualification matrix chunks went 19 → 11 (some merged back, some split further); max chunk size now 3984 chars across the whole corpus. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+6
-2
@@ -32,8 +32,12 @@ CHARS_PER_TOKEN = 4
|
||||
TARGET_TOKENS = 500
|
||||
TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
|
||||
# Hard cap: nomic-embed-text's context is 2048 tokens. Anything larger
|
||||
# 400s the entire embed batch. 6000 chars ≈ 1500 tokens leaves headroom.
|
||||
MAX_CHARS = 6000
|
||||
# 400s the entire embed batch. 6000 chars works for prose but markdown
|
||||
# tables with lots of `|` separators tokenize ~1.4× denser; a 5839-char
|
||||
# table chunk from the HVM qualification matrix tokenized past 2048 and
|
||||
# crashed the rebuild. 4000 chars stays under 2048 tokens even for
|
||||
# dense table content while leaving headroom for the query side.
|
||||
MAX_CHARS = 4000
|
||||
|
||||
|
||||
def _hard_split(text: str) -> list[str]:
|
||||
|
||||
Reference in New Issue
Block a user