Back to README | All docs | Benchmarks | Latest Results
Generated by running the current quality benchmark suite against v1.0.0, v1.1.0, and v1.2.0 source code.
| Scenario | v1.0.0 | v1.1.0 | v1.2.0 | Trend |
|---|---|---|---|---|
| Coding assistant | 1.68x | 1.94x | 1.94x | improved v1.0→v1.1 |
| Long Q&A | 6.16x | 4.90x | 4.90x | reduced (was over-compressing) |
| Tool-heavy | 1.30x | 1.41x | 1.40x | stable |
| Deep conversation | 2.12x | 2.50x | 2.50x | improved v1.0→v1.1 |
| Technical explanation | 1.24x | 1.24x | 1.24x | stable |
| Structured content | 1.24x | 1.26x | 1.26x | stable |
| Agentic coding session | 1.00x | 1.00x | 1.00x | no compression (correct) |
| Giant single message | 2.83x | 2.83x | 2.83x | stable |
| Entity-dense technical | 1.20x | 1.56x | 1.56x | improved v1.0→v1.1 |
| Prose-only conversation | 1.70x | 3.37x | 3.37x | large improvement v1.0→v1.1 |
| Scenario | v1.0.0 | v1.1.0 | v1.2.0 | Trend |
|---|---|---|---|---|
| Coding assistant | 94% | 94% | 94% | stable |
| Tool-heavy | 70% | 70% | 80% | improved in v1.2 |
| Structured content | 100% | 68% | 68% | regressed v1.0→v1.1 |
| Entity-dense technical | 68% | 53% | 53% | regressed v1.0→v1.1 |
| Mixed languages | 100% | 67% | 67% | regressed v1.0→v1.1 |
| Scenario | v1.0.0 | v1.1.0 | v1.2.0 | Trend |
|---|---|---|---|---|
| Long Q&A | 86% | 100% | 100% | improved |
| Deep conversation | 44% | 33% | 33% | regressed v1.0→v1.1 |
| Entity-dense technical | 75% | 63% | 63% | regressed v1.0→v1.1 |
| Prose-only conversation | 50% | 50% | 50% | stable |
100% across all versions and all scenarios. Code preservation has never failed.
v1.1.0 improved compression ratios across the board (Coding assistant 1.68x→1.94x, Prose-only 1.70x→3.37x), but this came at a cost: entity retention dropped on three scenarios where the engine started compressing content it should have preserved:
- Structured content: 100% → 68% entity retention — API keys and config values getting summarized
- Entity-dense technical: 68% → 53% — specific identifiers like
redis-prod-001,v22.3.0,PR #142dropped - Mixed languages: 100% → 67% — monitoring details lost in compression
The Long Q&A compression ratio decreased from 6.16x to 4.90x. This is actually an improvement — v1.0.0 was over-compressing, losing the min output ≥ 800 chars probe.
v1.2.0 added flow chains, semantic clusters, and other v2 features, but none of them changed quality metrics when running in default mode. The only improvement was Tool-heavy entity retention (70%→80%). The v2 features are opt-in and don't affect the default compression path.
Running the quality benchmark with each opt-in feature enabled reveals their effect on compression quality.
No measurable impact on any scenario. These features only activate when messages have clear forward-reference patterns or correction signals — the benchmark scenarios don't trigger them strongly enough.
Mostly neutral, but degrades Code-only conversation: ratio goes from 1.00x to 1.30x with probe pass rate dropping 25% (75% from 100%). The clustering groups code-only messages and compresses them when it shouldn't.
The most impactful feature — both positive and negative:
| Scenario | Baseline | With flow | Change |
|---|---|---|---|
| Deep conversation | 2.50x, 33% probes | 4.62x, 100% probes | +67% probe rate — groups Q&A pairs, preserves topic names |
| Long Q&A | 4.90x, 100% probes | 11.80x, 71% probes | -29% probe rate — over-compresses, loses terms |
| Technical explanation | 1.24x, 86% probes | 2.82x, 57% probes | -29% probe rate — loses technical details |
| Structured content | 1.26x, 100% probes | 1.54x, 100% probes | More compression, probes still pass |
| Mixed languages | 1.07x, 100% probes | 1.11x, 100% probes | Minimal change |
Conversation flow dramatically improves Deep conversation (the worst baseline scenario), but over-compresses Long Q&A and Technical explanation. The 25 coherence issues in Deep conversation (up from 6) suggest the summaries need work even though the topic probes pass.
Minimal impact. Entity-dense technical ratio drops from 1.56x to 1.27x (less compression) with slightly higher entity retention (57% vs 53%). The coreference tracking is inlining entity definitions into summaries, which preserves more context but reduces compression.
Combines the conversation flow wins and losses with semantic clustering's code-only regression:
- Deep conversation: 9/9 probes (up from 3/9) but 25 coherence issues
- Long Q&A: 5/7 probes (down from 7/7), entity retention crashes to 7%
- Code-only conversation: 3/4 probes (down from 4/4) from clustering
- Structured content: entity retention drops to 33%
- Conversation flow should be opt-in per scenario type — it helps long multi-topic conversations but hurts focused technical discussions
- Semantic clustering needs a guard against clustering code-only messages
- The v1.1.0 entity retention regression in Structured content, Entity-dense, and Mixed languages is the most actionable fix — the summarizer should preserve identifiers that v1.0.0 kept
- Importance scoring and contradiction detection need scenarios with stronger signal patterns to validate their impact