Skip to content

Fixed chunk granularity breaks on non-prose content #8

@barkain

Description

@barkain

Problem

The chunker splits all content into ~300-500 token chunks using a uniform token budget. This is tuned for prose paragraphs but produces poor results for other content types:

  • Tables: a table split across two chunks loses its header row in the second chunk, making it uninterpretable without reading both. The agent has no way to know the two chunks form a single table.
  • Mathematical proofs / formulas: a one-line formula embedded in 400 tokens of surrounding prose wastes budget. Conversely, a multi-step derivation split mid-step loses logical coherence.
  • Code listings: a function split at an arbitrary line boundary produces two syntactically invalid fragments.
  • Lists / enumerations: splitting a numbered list mid-item loses the numbering context.

The agent sees chunks as atomic units — if a chunk is internally incoherent (half a table, half a proof), the answer quality degrades even when the right chunk is located.

Impact

This affects both token efficiency (agent reads extra chunks to get complete structures) and answer quality (partial structures lead to incomplete or wrong answers). The SBOM walkthrough's table-heavy content is likely the most affected domain.

Proposed solution

Content-aware boundary detection in chunker.py: detect structural blocks (tables, code fences, LaTeX environments, numbered lists) in the parsed text and treat them as atomic units that shouldn't be split. Allow chunks to exceed the 500-token soft cap when a structural block demands it, with a hard cap (e.g. 1000 tokens) to prevent degenerate cases.

Labels

enhancement, chunking

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions