Problem
The chunker splits all content into ~300-500 token chunks using a uniform token budget. This is tuned for prose paragraphs but produces poor results for other content types:
- Tables: a table split across two chunks loses its header row in the second chunk, making it uninterpretable without reading both. The agent has no way to know the two chunks form a single table.
- Mathematical proofs / formulas: a one-line formula embedded in 400 tokens of surrounding prose wastes budget. Conversely, a multi-step derivation split mid-step loses logical coherence.
- Code listings: a function split at an arbitrary line boundary produces two syntactically invalid fragments.
- Lists / enumerations: splitting a numbered list mid-item loses the numbering context.
The agent sees chunks as atomic units — if a chunk is internally incoherent (half a table, half a proof), the answer quality degrades even when the right chunk is located.
Impact
This affects both token efficiency (agent reads extra chunks to get complete structures) and answer quality (partial structures lead to incomplete or wrong answers). The SBOM walkthrough's table-heavy content is likely the most affected domain.
Proposed solution
Content-aware boundary detection in chunker.py: detect structural blocks (tables, code fences, LaTeX environments, numbered lists) in the parsed text and treat them as atomic units that shouldn't be split. Allow chunks to exceed the 500-token soft cap when a structural block demands it, with a hard cap (e.g. 1000 tokens) to prevent degenerate cases.
Labels
enhancement, chunking
Problem
The chunker splits all content into ~300-500 token chunks using a uniform token budget. This is tuned for prose paragraphs but produces poor results for other content types:
The agent sees chunks as atomic units — if a chunk is internally incoherent (half a table, half a proof), the answer quality degrades even when the right chunk is located.
Impact
This affects both token efficiency (agent reads extra chunks to get complete structures) and answer quality (partial structures lead to incomplete or wrong answers). The SBOM walkthrough's table-heavy content is likely the most affected domain.
Proposed solution
Content-aware boundary detection in
chunker.py: detect structural blocks (tables, code fences, LaTeX environments, numbered lists) in the parsed text and treat them as atomic units that shouldn't be split. Allow chunks to exceed the 500-token soft cap when a structural block demands it, with a hard cap (e.g. 1000 tokens) to prevent degenerate cases.Labels
enhancement,chunking