fix(chunker): MAX_CHARS 6000 → 4000 for table-dense content #6

Merged

justin merged 1 commits from fix/chunk-cap-table-density into main

2026-05-22 15:11:23 -04:00

Author	SHA1	Message	Date
justin	eb18335715	fix(chunker): drop MAX_CHARS 6000 → 4000 for table-dense content Qualification matrix run 107 crashed at chunk 65 with HTTP 400 from Ollama: "the input length exceeds the context length". Reproduced locally — the offending chunk was 5839 chars (~1785 word-units), but nomic-embed-text's BPE tokenizer counts every `\|` table separator and short cell as its own token, pushing the real token count past 2048 even though my 4-chars/token heuristic put it at ~1460. Markdown tables tokenize ~1.4× denser than prose. The HVM Qualification Matrix's Server/Storage/ISV tables are exactly the kind of content that trips this. Dropped MAX_CHARS from 6000 to 4000 — empirically safe for the densest content we have, still leaves 2-3× headroom for the typical 400-600-token target. Side effect: qualification matrix chunks went 19 → 11 (some merged back, some split further); max chunk size now 3984 chars across the whole corpus. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 15:11:21 -04:00

Author

SHA1

Message

Date

justin

eb18335715

fix(chunker): drop MAX_CHARS 6000 → 4000 for table-dense content

Qualification matrix run 107 crashed at chunk 65 with HTTP 400 from
Ollama: "the input length exceeds the context length". Reproduced
locally — the offending chunk was 5839 chars (~1785 word-units),
but nomic-embed-text's BPE tokenizer counts every `|` table separator
and short cell as its own token, pushing the real token count past
2048 even though my 4-chars/token heuristic put it at ~1460.

Markdown tables tokenize ~1.4× denser than prose. The HVM Qualification
Matrix's Server/Storage/ISV tables are exactly the kind of content that
trips this. Dropped MAX_CHARS from 6000 to 4000 — empirically safe
for the densest content we have, still leaves 2-3× headroom for the
typical 400-600-token target.

Side effect: qualification matrix chunks went 19 → 11 (some merged
back, some split further); max chunk size now 3984 chars across the
whole corpus.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-22 15:11:21 -04:00

fix(chunker): MAX_CHARS 6000 → 4000 for table-dense content #6

1 Commits