A minimal, from-scratch implementation of a Byte Pair Encoding (BPE) tokenizer built with TypeScript + Bun. This project demonstrates how modern LLM tokenizers are trained, encoded, and decoded at a low level using raw UTF-8 bytes.
- 🔤 Byte-level base vocabulary (0–255)
- 🔁 BPE training loop to learn merge rules
- 🧩 Subword token generation via pair merging
- 🧠 Custom encoder (text → token IDs)
- 🔄 Custom decoder (token IDs → original text)
- 📦 Fully self-contained — no tokenizer libraries used
.
├── data.txt # Training corpus
├── encode_data.txt # (Optional) text for encoding tests
├── index.ts # Main tokenizer implementation
├── package.json
├── tsconfig.json
└── README.md
The tokenizer starts with a base vocabulary of 256 UTF-8 byte tokens.
It then repeatedly:
- Counts the most frequent adjacent token pair
- Assigns a new token ID (256+)
- Merges that pair throughout the dataset
- Stores the merge rule
Training stops when:
- No pairs remain, or
- The most frequent pair appears only once
The encode() function:
- Converts text into UTF-8 bytes
- Applies merge rules in learned order
- Replaces matching adjacent pairs with merged token IDs
Example:
"h" + "e" → token 256
"he" + "l" → token 300
The decode() function reverses the process:
- Expands merged tokens back into their original pairs
- Repeats until only base byte tokens remain
- Converts bytes back into a UTF-8 string
This guarantees:
decode(encode(text)) === text
bun installbun devYou’ll see:
- Original byte length
- Tokenized length after merges
- Learned merge rules
- Encoded token sequence
- Decoded output (should match original text)
Original Length 1986
After Tokenization Length 898
Stopping — no frequent pairs left
This shows that repeated byte patterns were compressed into higher-level tokens.
| Concept | Description |
|---|---|
| UTF-8 Encoding | Text represented as byte sequences |
| Byte Pair Encoding | Iterative subword token learning |
| Vocabulary Growth | Tokens expand beyond raw bytes |
| Greedy Merging | Highest-frequency pair merged first |
| Reversible Tokenization | Lossless encode/decode cycle |
Modern LLMs (GPT, LLaMA, etc.) don’t read raw characters — they read tokens. This project shows exactly how those tokens are created and used.
You now understand:
- How token vocabularies are built
- Why token counts matter
- How multilingual text is handled
- How compression improves efficiency
- Save merge rules to a JSON file
- Load trained tokenizer without retraining
- Add special tokens (BOS, EOS, PAD)
- Benchmark compression ratio
- Visualize merge trees
MIT — free to use, modify, and learn from.
Built for learning, inspired by real-world LLM tokenizers.