From eb1833571587ec3478a3d49119e2ead6c21ade3d Mon Sep 17 00:00:00 2001
From: Justin Paul <justin@jpaul.me>
Date: Fri, 22 May 2026 15:11:21 -0400
Subject: [PATCH] =?UTF-8?q?fix(chunker):=20drop=20MAX=5FCHARS=206000=20?=
 =?UTF-8?q?=E2=86=92=204000=20for=20table-dense=20content?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Qualification matrix run 107 crashed at chunk 65 with HTTP 400 from
Ollama: "the input length exceeds the context length". Reproduced
locally — the offending chunk was 5839 chars (~1785 word-units),
but nomic-embed-text's BPE tokenizer counts every `|` table separator
and short cell as its own token, pushing the real token count past
2048 even though my 4-chars/token heuristic put it at ~1460.

Markdown tables tokenize ~1.4× denser than prose. The HVM Qualification
Matrix's Server/Storage/ISV tables are exactly the kind of content that
trips this. Dropped MAX_CHARS from 6000 to 4000 — empirically safe
for the densest content we have, still leaves 2-3× headroom for the
typical 400-600-token target.

Side effect: qualification matrix chunks went 19 → 11 (some merged
back, some split further); max chunk size now 3984 chars across the
whole corpus.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 rag/chunk.py | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/rag/chunk.py b/rag/chunk.py
index 81ef39c..c937c1f 100644
--- a/rag/chunk.py
+++ b/rag/chunk.py
@@ -32,8 +32,12 @@ CHARS_PER_TOKEN = 4
 TARGET_TOKENS = 500
 TARGET_CHARS = TARGET_TOKENS * CHARS_PER_TOKEN
 # Hard cap: nomic-embed-text's context is 2048 tokens. Anything larger
-# 400s the entire embed batch. 6000 chars ≈ 1500 tokens leaves headroom.
-MAX_CHARS = 6000
+# 400s the entire embed batch. 6000 chars works for prose but markdown
+# tables with lots of `|` separators tokenize ~1.4× denser; a 5839-char
+# table chunk from the HVM qualification matrix tokenized past 2048 and
+# crashed the rebuild. 4000 chars stays under 2048 tokens even for
+# dense table content while leaving headroom for the query side.
+MAX_CHARS = 4000
 
 
 def _hard_split(text: str) -> list[str]: