Skip to content

Latest commit

 

History

History
107 lines (72 loc) · 6.72 KB

File metadata and controls

107 lines (72 loc) · 6.72 KB

Quality History

Back to README | All docs | Benchmarks | Latest Results

Generated by running the current quality benchmark suite against v1.0.0, v1.1.0, and v1.2.0 source code.

Version Comparison

Compression Ratio

Scenario v1.0.0 v1.1.0 v1.2.0 Trend
Coding assistant 1.68x 1.94x 1.94x improved v1.0→v1.1
Long Q&A 6.16x 4.90x 4.90x reduced (was over-compressing)
Tool-heavy 1.30x 1.41x 1.40x stable
Deep conversation 2.12x 2.50x 2.50x improved v1.0→v1.1
Technical explanation 1.24x 1.24x 1.24x stable
Structured content 1.24x 1.26x 1.26x stable
Agentic coding session 1.00x 1.00x 1.00x no compression (correct)
Giant single message 2.83x 2.83x 2.83x stable
Entity-dense technical 1.20x 1.56x 1.56x improved v1.0→v1.1
Prose-only conversation 1.70x 3.37x 3.37x large improvement v1.0→v1.1

Entity Retention

Scenario v1.0.0 v1.1.0 v1.2.0 Trend
Coding assistant 94% 94% 94% stable
Tool-heavy 70% 70% 80% improved in v1.2
Structured content 100% 68% 68% regressed v1.0→v1.1
Entity-dense technical 68% 53% 53% regressed v1.0→v1.1
Mixed languages 100% 67% 67% regressed v1.0→v1.1

Probe Pass Rate

Scenario v1.0.0 v1.1.0 v1.2.0 Trend
Long Q&A 86% 100% 100% improved
Deep conversation 44% 33% 33% regressed v1.0→v1.1
Entity-dense technical 75% 63% 63% regressed v1.0→v1.1
Prose-only conversation 50% 50% 50% stable

Code Block Integrity

100% across all versions and all scenarios. Code preservation has never failed.

Key Findings

v1.0.0 → v1.1.0: More aggressive, less precise

v1.1.0 improved compression ratios across the board (Coding assistant 1.68x→1.94x, Prose-only 1.70x→3.37x), but this came at a cost: entity retention dropped on three scenarios where the engine started compressing content it should have preserved:

  • Structured content: 100% → 68% entity retention — API keys and config values getting summarized
  • Entity-dense technical: 68% → 53% — specific identifiers like redis-prod-001, v22.3.0, PR #142 dropped
  • Mixed languages: 100% → 67% — monitoring details lost in compression

The Long Q&A compression ratio decreased from 6.16x to 4.90x. This is actually an improvement — v1.0.0 was over-compressing, losing the min output ≥ 800 chars probe.

v1.1.0 → v1.2.0: Stability

v1.2.0 added flow chains, semantic clusters, and other v2 features, but none of them changed quality metrics when running in default mode. The only improvement was Tool-heavy entity retention (70%→80%). The v2 features are opt-in and don't affect the default compression path.

Opt-in Feature Impact (v1.2.0)

Running the quality benchmark with each opt-in feature enabled reveals their effect on compression quality.

importance + contradiction

No measurable impact on any scenario. These features only activate when messages have clear forward-reference patterns or correction signals — the benchmark scenarios don't trigger them strongly enough.

semantic clustering

Mostly neutral, but degrades Code-only conversation: ratio goes from 1.00x to 1.30x with probe pass rate dropping 25% (75% from 100%). The clustering groups code-only messages and compresses them when it shouldn't.

conversation flow

The most impactful feature — both positive and negative:

Scenario Baseline With flow Change
Deep conversation 2.50x, 33% probes 4.62x, 100% probes +67% probe rate — groups Q&A pairs, preserves topic names
Long Q&A 4.90x, 100% probes 11.80x, 71% probes -29% probe rate — over-compresses, loses terms
Technical explanation 1.24x, 86% probes 2.82x, 57% probes -29% probe rate — loses technical details
Structured content 1.26x, 100% probes 1.54x, 100% probes More compression, probes still pass
Mixed languages 1.07x, 100% probes 1.11x, 100% probes Minimal change

Conversation flow dramatically improves Deep conversation (the worst baseline scenario), but over-compresses Long Q&A and Technical explanation. The 25 coherence issues in Deep conversation (up from 6) suggest the summaries need work even though the topic probes pass.

coreference

Minimal impact. Entity-dense technical ratio drops from 1.56x to 1.27x (less compression) with slightly higher entity retention (57% vs 53%). The coreference tracking is inlining entity definitions into summaries, which preserves more context but reduces compression.

all features combined

Combines the conversation flow wins and losses with semantic clustering's code-only regression:

  • Deep conversation: 9/9 probes (up from 3/9) but 25 coherence issues
  • Long Q&A: 5/7 probes (down from 7/7), entity retention crashes to 7%
  • Code-only conversation: 3/4 probes (down from 4/4) from clustering
  • Structured content: entity retention drops to 33%

Recommendations

  1. Conversation flow should be opt-in per scenario type — it helps long multi-topic conversations but hurts focused technical discussions
  2. Semantic clustering needs a guard against clustering code-only messages
  3. The v1.1.0 entity retention regression in Structured content, Entity-dense, and Mixed languages is the most actionable fix — the summarizer should preserve identifiers that v1.0.0 kept
  4. Importance scoring and contradiction detection need scenarios with stronger signal patterns to validate their impact