Fixed chunk granularity breaks on non-prose content

## Problem

The chunker splits all content into ~300-500 token chunks using a uniform token budget. This is tuned for prose paragraphs but produces poor results for other content types:

- **Tables**: a table split across two chunks loses its header row in the second chunk, making it uninterpretable without reading both. The agent has no way to know the two chunks form a single table.
- **Mathematical proofs / formulas**: a one-line formula embedded in 400 tokens of surrounding prose wastes budget. Conversely, a multi-step derivation split mid-step loses logical coherence.
- **Code listings**: a function split at an arbitrary line boundary produces two syntactically invalid fragments.
- **Lists / enumerations**: splitting a numbered list mid-item loses the numbering context.

The agent sees chunks as atomic units — if a chunk is internally incoherent (half a table, half a proof), the answer quality degrades even when the right chunk is located.

## Impact

This affects both token efficiency (agent reads extra chunks to get complete structures) and answer quality (partial structures lead to incomplete or wrong answers). The SBOM walkthrough's table-heavy content is likely the most affected domain.

## Proposed solution

Content-aware boundary detection in `chunker.py`: detect structural blocks (tables, code fences, LaTeX environments, numbered lists) in the parsed text and treat them as atomic units that shouldn't be split. Allow chunks to exceed the 500-token soft cap when a structural block demands it, with a hard cap (e.g. 1000 tokens) to prevent degenerate cases.

## Labels

`enhancement`, `chunking`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed chunk granularity breaks on non-prose content #8

Problem

Impact

Proposed solution

Labels

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Fixed chunk granularity breaks on non-prose content #8

Description

Problem

Impact

Proposed solution

Labels

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions