All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Quality benchmark overhaul — replaced broken metrics (keywordRetention, factRetention, negationErrors) with five meaningful ones: task-based probes (~70 across 13 scenarios), information density, compressed-only quality score, negative compression detection, and summary coherence checks.
- Task-based probes — hand-curated per-scenario checks that verify whether specific critical information (identifiers, code patterns, config values) survives compression. Probe failures surface real quality issues.
- LLM-as-judge scoring (
--llm-judgeflag) — optional LLM evaluation of compression quality. Multi-provider support: OpenAI, Anthropic, Gemini (@google/genai), Ollama. Display-only, not used for regression testing. - Gemini provider for LLM benchmarks via
GEMINI_API_KEYenv var (default model:gemini-2.5-flash). - Opt-in feature comparison (
--featuresflag) — runs quality benchmark with each opt-in feature enabled to measure their impact vs baseline. - Quality history documentation (
docs/quality-history.md) — version-over-version quality tracking across v1.0.0, v1.1.0, v1.2.0 with opt-in feature impact analysis. - Min-output-chars probes to catch over-aggressive compression.
- Code block language aliases in benchmarks (typescript/ts, python/py, yaml/yml).
- New npm scripts:
bench:quality:judge,bench:quality:features.
- Coherence and negative compression regression thresholds now track increases from baseline, not just zero-to-nonzero transitions.
- Information density regression check only applies when compression actually occurs (ratio > 1.01).
- Quality benchmark table now shows:
Ratio EntRet CodeOK InfDen Probes Pass NegCp Coher CmpQ. analyzeQuality()accepts optionalCompressOptionsfor feature testing.
keywordRetentionmetric (tautological — 100% on 12/13 scenarios).factRetentionandfactCountmetrics (fragile regex-based fact extractor).negationErrorsmetric (noisy, rarely triggered).extractFacts()andanalyzeSemanticFidelity()functions.
- Quality metrics —
entity_retention,structural_integrity,reference_coherence, and compositequality_score(0–1) computed automatically on every compression. Tracks identifier preservation, code fence survival, and reference coherence. - Relevance threshold (
relevanceThreshold) — drops low-value messages to compact stubs instead of producing low-quality summaries. Consecutive stubs grouped. New stat:messages_relevance_dropped. - Tiered budget strategy (
budgetStrategy: 'tiered') — alternative to binary search that keeps recency window fixed and progressively compresses older content (tighten → stub → truncate). - Entropy scorer (
entropyScorer) — plug in a small causal LM for information-theoretic sentence scoring. Modes:'augment'(weighted average with heuristic) or'replace'(entropy only). - Conversation flow detection (
conversationFlow: true) — groups Q&A pairs, request→action→confirmation chains, corrections, and acknowledgments into compression units for more coherent summaries. - Cross-message coreference (
coreference: true) — inlines entity definitions into compressed summaries when a preserved message references an entity defined only in a compressed message. - Semantic clustering (
semanticClustering: true) — groups consecutive messages by topic using TF-IDF cosine similarity + entity overlap Jaccard, compresses each cluster as a unit. - Compression depth (
compressionDepth) —'gentle'(default),'moderate'(tighter budgets),'aggressive'(entity-only stubs),'auto'(progressive escalation untiltokenBudgetfits). - Discourse-aware summarization (
discourseAware: true) — experimental EDU-lite decomposition with dependency tracking. Reduces ratio 8–28% without a custom ML scorer; use exportedsegmentEDUs/scoreEDUs/selectEDUsdirectly instead. - ML token classifier (
mlTokenClassifier) — per-token keep/remove classification via user-provided model (LLMLingua-2 style). IncludescreateMockTokenClassifierfor testing. - Importance-weighted retention (
importanceScoring: true) — per-message importance scoring based on forward-reference density, decision/correction content signals, and recency. Default threshold raised to 0.65. - Contradiction detection (
contradictionDetection: true) — detects later messages that correct earlier ones. Superseded messages compressed with provenance annotation. - A/B comparison tool (
npm run bench:compare) — side-by-side comparison of default vs v2 features. - V2 Features Comparison section in benchmark output — per-feature and recommended combo vs default.
- Adversarial test suite — 8 edge-case tests (pronoun-heavy, scattered entities, correction chains, code-interleaved prose, near-duplicates, 10k+ char messages, mixed SQL/JSON/bash, full round-trip with all features).
- New modules:
entities.ts,entropy.ts,flow.ts,coreference.ts,cluster.ts,discourse.ts,ml-classifier.ts. - New types:
ImportanceMap,ContradictionAnnotation,MLTokenClassifier,TokenClassification,FlowChain,MessageCluster,EDU,EntityDefinition. - Comprehensive V2 features documentation with tradeoff analysis per feature.
- Adaptive summary budgets scale with content density when
compressionDepthis set to'moderate'or higher (entity-dense content gets up to 45% budget, sparse content down to 15%). - Default path (no v2 options) produces identical output to v1.1.0 — all new features are opt-in.
- Quality metrics section added to benchmark reporter and generated docs.
- Flow chains no longer skip non-member messages between chain endpoints.
- Semantic clusters restricted to consecutive indices to preserve round-trip ordering.
- Flow chains exclude messages with code fences to prevent structural integrity loss.
1.1.0 - 2026-03-19
- Reasoning chain detection in classifier — preserves chain-of-thought, step-by-step analysis, formal proofs, and multi-step logical arguments as hard T0 (verbatim). Uses two-tier anchor system: strong anchors (explicit labels like
Reasoning:, formal inference phrases) trigger on a single match; weak anchors (logical connectives liketherefore,hence,thus) require 3+ distinct to fire. Defense-in-depth scoring boost in the summarizer ensures reasoning sentences survive even if classification is bypassed.
1.0.0 - 2025-02-24
First stable release. Published as context-compression-engine.
- Lossless context compression with
compress()anduncompress() - Code-aware classification: fences, SQL, JSON/YAML, API keys, URLs, file paths preserved verbatim
- Paragraph-aware sentence scoring in
summarize() - Code-bearing message splitting to compress surrounding prose
- Exact and fuzzy cross-message deduplication (enabled by default)
- LLM-powered summarization with
createSummarizer()andcreateEscalatingSummarizer() - Three-level fallback: LLM → deterministic → size guard
tokenBudgetwith binary search overrecencyWindowforceConvergehard-truncation pass for guaranteed budget convergence- Pluggable
tokenCounteroption (default:ceil(content.length / 3.5)) embedSummaryIdoption to embed summary IDs directly into message content- Provenance tracking via
_cce_originalmetadata (origin IDs, summary hashes, version chains) - Verbatim store for lossless round-trip (
VerbatimMapor lookup function) - Recursive
uncompress()for multi-round compression chains preserveoption for role-based message protectionrecencyWindowto protect recent messages from compression- Tool/function result compression through the classifier
- Compression stats:
ratio,token_ratio,messages_compressed,messages_removed - Input validation on public API surface
- 333 tests with coverage across all compression paths
- Benchmark suite with synthetic and real-session scenarios
- LLM benchmark with multi-provider support (Claude, GPT, Gemini, Grok, Ollama)