Quality History

Back to README | All docs | Benchmarks | Latest Results

Generated by running the current quality benchmark suite against v1.0.0, v1.1.0, and v1.2.0 source code.

Version Comparison

Compression Ratio

Scenario	v1.0.0	v1.1.0	v1.2.0	Trend
Coding assistant	1.68x	1.94x	1.94x	improved v1.0→v1.1
Long Q&A	6.16x	4.90x	4.90x	reduced (was over-compressing)
Tool-heavy	1.30x	1.41x	1.40x	stable
Deep conversation	2.12x	2.50x	2.50x	improved v1.0→v1.1
Technical explanation	1.24x	1.24x	1.24x	stable
Structured content	1.24x	1.26x	1.26x	stable
Agentic coding session	1.00x	1.00x	1.00x	no compression (correct)
Giant single message	2.83x	2.83x	2.83x	stable
Entity-dense technical	1.20x	1.56x	1.56x	improved v1.0→v1.1
Prose-only conversation	1.70x	3.37x	3.37x	large improvement v1.0→v1.1

Entity Retention

Scenario	v1.0.0	v1.1.0	v1.2.0	Trend
Coding assistant	94%	94%	94%	stable
Tool-heavy	70%	70%	80%	improved in v1.2
Structured content	100%	68%	68%	regressed v1.0→v1.1
Entity-dense technical	68%	53%	53%	regressed v1.0→v1.1
Mixed languages	100%	67%	67%	regressed v1.0→v1.1

Probe Pass Rate

Scenario	v1.0.0	v1.1.0	v1.2.0	Trend
Long Q&A	86%	100%	100%	improved
Deep conversation	44%	33%	33%	regressed v1.0→v1.1
Entity-dense technical	75%	63%	63%	regressed v1.0→v1.1
Prose-only conversation	50%	50%	50%	stable

Code Block Integrity

100% across all versions and all scenarios. Code preservation has never failed.

Key Findings

v1.0.0 → v1.1.0: More aggressive, less precise

v1.1.0 improved compression ratios across the board (Coding assistant 1.68x→1.94x, Prose-only 1.70x→3.37x), but this came at a cost: entity retention dropped on three scenarios where the engine started compressing content it should have preserved:

Structured content: 100% → 68% entity retention — API keys and config values getting summarized
Entity-dense technical: 68% → 53% — specific identifiers like redis-prod-001, v22.3.0, PR #142 dropped
Mixed languages: 100% → 67% — monitoring details lost in compression

The Long Q&A compression ratio decreased from 6.16x to 4.90x. This is actually an improvement — v1.0.0 was over-compressing, losing the min output ≥ 800 chars probe.

v1.1.0 → v1.2.0: Stability

v1.2.0 added flow chains, semantic clusters, and other v2 features, but none of them changed quality metrics when running in default mode. The only improvement was Tool-heavy entity retention (70%→80%). The v2 features are opt-in and don't affect the default compression path.

Opt-in Feature Impact (v1.2.0)

Running the quality benchmark with each opt-in feature enabled reveals their effect on compression quality.

importance + contradiction

No measurable impact on any scenario. These features only activate when messages have clear forward-reference patterns or correction signals — the benchmark scenarios don't trigger them strongly enough.

semantic clustering

Mostly neutral, but degrades Code-only conversation: ratio goes from 1.00x to 1.30x with probe pass rate dropping 25% (75% from 100%). The clustering groups code-only messages and compresses them when it shouldn't.

conversation flow

The most impactful feature — both positive and negative:

Scenario	Baseline	With flow	Change
Deep conversation	2.50x, 33% probes	4.62x, 100% probes	+67% probe rate — groups Q&A pairs, preserves topic names
Long Q&A	4.90x, 100% probes	11.80x, 71% probes	-29% probe rate — over-compresses, loses terms
Technical explanation	1.24x, 86% probes	2.82x, 57% probes	-29% probe rate — loses technical details
Structured content	1.26x, 100% probes	1.54x, 100% probes	More compression, probes still pass
Mixed languages	1.07x, 100% probes	1.11x, 100% probes	Minimal change

Conversation flow dramatically improves Deep conversation (the worst baseline scenario), but over-compresses Long Q&A and Technical explanation. The 25 coherence issues in Deep conversation (up from 6) suggest the summaries need work even though the topic probes pass.

coreference

Minimal impact. Entity-dense technical ratio drops from 1.56x to 1.27x (less compression) with slightly higher entity retention (57% vs 53%). The coreference tracking is inlining entity definitions into summaries, which preserves more context but reduces compression.

all features combined

Combines the conversation flow wins and losses with semantic clustering's code-only regression:

Deep conversation: 9/9 probes (up from 3/9) but 25 coherence issues
Long Q&A: 5/7 probes (down from 7/7), entity retention crashes to 7%
Code-only conversation: 3/4 probes (down from 4/4) from clustering
Structured content: entity retention drops to 33%

Recommendations

Conversation flow should be opt-in per scenario type — it helps long multi-topic conversations but hurts focused technical discussions
Semantic clustering needs a guard against clustering code-only messages
The v1.1.0 entity retention regression in Structured content, Entity-dense, and Mixed languages is the most actionable fix — the summarizer should preserve identifiers that v1.0.0 kept
Importance scoring and contradiction detection need scenarios with stronger signal patterns to validate their impact

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality History

Version Comparison

Compression Ratio

Entity Retention

Probe Pass Rate

Code Block Integrity

Key Findings

v1.0.0 → v1.1.0: More aggressive, less precise

v1.1.0 → v1.2.0: Stability

Opt-in Feature Impact (v1.2.0)

importance + contradiction

semantic clustering

conversation flow

coreference

all features combined

Recommendations

FilesExpand file tree

quality-history.md

Latest commit

History

quality-history.md

File metadata and controls

Quality History

Version Comparison

Compression Ratio

Entity Retention

Probe Pass Rate

Code Block Integrity

Key Findings

v1.0.0 → v1.1.0: More aggressive, less precise

v1.1.0 → v1.2.0: Stability

Opt-in Feature Impact (v1.2.0)

importance + contradiction

semantic clustering

conversation flow

coreference

all features combined

Recommendations