UNTOKEN

Token compression for LLM prompts via a learned token selector.

UNTOKEN is a experimental architecture demonstrating adversarial autoencoder-based token importance scoring. Given N tokens, it returns a subsequence of ~0.3N tokens. The model shipped here (pacifio/untoken-v1) is trained at small scale as a proof of concept — the architecture is the contribution, not the weights.

Install

pip install untoken

Requires Python 3.10+ and PyTorch 2.1+. Works on CPU and GPU.

Context Window

The model processes up to 480 tokens per chunk (DistilBERT's 512-token limit minus special tokens). At ~20 tokens per average English sentence, that is roughly 20–24 sentences per chunk. Longer inputs are automatically split at sentence boundaries and compressed independently — no truncation occurs.

For best results, keep individual inputs under ~20 sentences. The model was trained at small scale and performs most reliably on short, self-contained passages.

Usage

from untoken import Untoken

ut = Untoken("pacifio/untoken-v2")

texts = [
    "The quick brown fox jumps over the lazy dog and then runs away into the forest.",
    "Scientists discovered a new species of deep-sea fish off the coast of Japan.",
    "The meeting was postponed due to a scheduling conflict with the board of directors.",
    "She completed the marathon in under four hours despite the difficult weather conditions.",
    "The server returned a 503 error after the deployment failed during the migration step.",
]

for text in texts:
    compressed, stats = ut.compress(text, ratio=0.4, return_stats=True)
    print(f"{text[:50]!r}...")
    print(f"  -> {compressed!r}")
    print(f"  -> {stats['original_tokens']} → {stats['compressed_tokens']} tokens ({stats['savings_pct']}% savings)\n")

"""
'The quick brown fox jumps over the lazy dog and th'...
  -> 'the quick brown fox jumps over dog'
  -> 19 → 9 tokens (52.6% savings)

'Scientists discovered a new species of deep-sea fi'...
  -> 'scientists discovered a new species of sea'
  -> 18 → 9 tokens (50.0% savings)

'The meeting was postponed due to a scheduling conf'...
  -> 'the meeting was postponed due scheduling'
  -> 17 → 8 tokens (52.9% savings)

'She completed the marathon in under four hours des'...
  -> 'she completed the marathon in hours'
  -> 16 → 8 tokens (50.0% savings)

'The server returned a 503 error after the deployme'...
  -> 'the server returned a 503 the'
  -> 18 → 9 tokens (50.0% savings)
"""

Note on v1 weights: The current model was trained on a small dataset and exhibits a known failure mode — it assigns high importance to frequent function words (determiners, auxiliaries) rather than content words. This is a training data scale issue, not an architectural one. The v1 checkpoint demonstrates that the full pipeline runs end-to-end. Improving selection quality requires more training data and longer adversarial fine-tuning.

Adjustable Ratio

compressed = ut.compress(text, ratio=0.5)  # keep 50%
compressed = ut.compress(text, ratio=0.2)  # keep 20%

No retraining required — ratio is applied at inference via top-k selection.

CLI

untoken --model pacifio/untoken-v1 --input prompt.txt --ratio 0.3

Long Documents

Inputs exceeding 480 tokens are automatically chunked at sentence boundaries.

with open("document.txt") as f:
    text = f.read()

compressed = ut.compress(text, ratio=0.3)

Evaluation (CNN/DailyMail, n=200, ratio=0.3)

Method	Cosine Sim	ROUGE-L	Compression Ratio
UNTOKEN	0.878	0.459	0.304
Random drop	0.723	0.429	0.303
Stopword removal	0.933	0.824	0.761

+15.5pp cosine similarity over random drop at equivalent compression ratio.

Architecture

The shipped artifact is a single ~300MB model:

Encoder: DistilBERT-base-uncased (66M parameters)
Importance head: Linear(768→256) → GELU → Dropout → Linear(256→1) → Sigmoid
Selection: hard top-k over importance scores, preserving original token order

Training is a three-phase adversarial autoencoder:

Supervised warm-up — importance head trained on (original, compressed) pairs from MeetingBank
Adversarial fine-tuning — full generator trained against a discriminator on CNN/DailyMail
Hardening — Gumbel-softmax replaced with straight-through estimation to close the train/test gap

The reconstructor and discriminator are training-only and are not shipped.

See ARCHITECTURE.md for full details.

Performance

Primary metric — ROUGE-L:

Target ratio	UNTOKEN v2	LLMLingua-2	Random drop	Actual ratio (UNTOKEN / LLMLingua-2)
0.2	0.331	0.279	0.308	0.205 / 0.172
0.3	0.455	0.406	0.430	0.305 / 0.262
0.4	0.558	0.518	0.539	0.404 / 0.353
0.5	0.650	0.618	0.635	0.505 / 0.448

UNTOKEN v2 leads on ROUGE-L at every compression ratio tested. The gap over LLMLingua-2 is 4-5pp at low ratios, narrowing to 3pp at 0.5. UNTOKEN also consistently outperforms random drop, which is the baseline that requires zero learning — confirming the model is doing meaningful token selection and not just noise.

Model Size

Model	Parameters	Relative size
LLMLingua-2 (XLM-RoBERTa-large)	~560M	8.4× larger
LLMLingua-2 (BERT-base-multilingual)	~179M	2.7× larger
UNTOKEN v2	66.56M	1×

Training Data

v2 was trained on 7 datasets across diverse domains:

Dataset	Domain	Supervision type	~Records
MeetingBank	Meeting transcripts	Paired (summary)	20K
CNN/DailyMail	News articles	Unlabeled	300K
XSum	BBC news	Paired (summary)	200K
DialogSum	Conversation	Paired (summary)	14K
BillSum	Legislation	Paired (summary)	23K
BookSum	Long-form books	Paired (summary)	12K
GSM8K	Math reasoning	Unlabeled (discriminator real pool)	8K

See report.md for more details.

Model

pacifio/untoken-v1 — trained on MeetingBank + CNN/DailyMail at small scale. pacifio/untoken-v2 — more diverse dataset

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
data		data
evaluation		evaluation
scripts		scripts
tests		tests
training		training
untoken		untoken
.DS_Store		.DS_Store
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
report.md		report.md
results_v2.json		results_v2.json
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UNTOKEN

Install

Context Window

Usage

Adjustable Ratio

CLI

Long Documents

Evaluation (CNN/DailyMail, n=200, ratio=0.3)

Architecture

Performance

Model Size

Training Data

Model

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UNTOKEN

Install

Context Window

Usage

Adjustable Ratio

CLI

Long Documents

Evaluation (CNN/DailyMail, n=200, ratio=0.3)

Architecture

Performance

Model Size

Training Data

Model

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages