M09 Token Lab
Text, chopped into
pieces an LLM can count.
A live Byte Pair Encoding visualizer. Type any text and watch the algorithm merge the most frequent adjacent pairs into new tokens, one greedy step at a time. The same process that turns Shakespeare into 50,257 vocabulary slots inside GPT-2.
Enter text below. The algorithm starts with individual bytes (256 possible values), then iteratively merges the most frequent pair into a new token. Watch the vocabulary grow, the token count shrink, and the compression ratio climb. Every modern LLM does this before it sees a single parameter.
§ I Why LLMs do not read words
Large language models do not see words. They see integers.
Before a transformer processes a single sentence, a tokenizer breaks the raw text into a sequence of tokens. Each token is an integer index into a fixed vocabulary. The model never encounters the string "hello". It encounters token 23748, or whatever integer the training run assigned to that byte sequence. This mapping is learned, not designed, and it is learned by the same greedy algorithm running below.
Byte Pair Encoding, introduced for machine translation by Sennrich et al. in 2016 and later adopted by OpenAI for GPT-2, starts with the 256 possible byte values. It scans the training corpus, counts every adjacent pair, and merges the most common pair into a new token. Repeat this 50,000 times and you have a vocabulary that can represent common words in one slot, rare words in several, and never-seen characters by falling back to raw bytes. The result is a compression scheme that happens to be useful for language modeling.
Why does this matter? Because tokenization is where many LLM behaviors originate. A model struggles with arithmetic because numbers get split into individual digits. It handles Python better than English because code has more repeating substrings. And it sees "SolidGoldMagikarp" as a single token not because the name is famous, but because that exact string appeared often enough in the training data to earn its own slot. Understanding tokenization is the first step toward understanding what the model actually receives.
§ II · M09.1 The tokenizer · type something
The text is first encoded as UTF-8 bytes. Each byte becomes its own token. Press Step to perform one merge, or toggle Auto-play to watch the algorithm run. The slider sets a target vocabulary size. The algorithm stops when it reaches that size or when no pair appears more than once.
- No merges yet.
§ III How it works
Every character is a sequence of bytes
The tokenizer encodes the input as UTF-8, turning each character into one to four raw byte values. At the start, the vocabulary contains exactly 256 tokens, one for each possible byte. The token sequence is identical to the byte sequence.
Greedy frequency counting
On every iteration, the algorithm counts all adjacent token pairs in the current sequence. The pair that appears most often is replaced everywhere by a brand-new token ID. That new token is added to the vocabulary, and the sequence shrinks.
Vocabulary size is a hyperparameter
Training stops after a preset number of merges. GPT-2 used 50,257 total tokens. GPT-4 uses roughly 100,256. The vocabulary is saved as a lookup table, and every future string is tokenized by greedily matching the longest learned token first.
§ IV · M09.2 The edge cases · GPT-2 reference
These are the token counts GPT-2 would produce after training on its full 8 GB corpus, not what the demo above produces. The demo trains BPE on whatever you type, so on a single short string it has nothing to merge against and falls back to one token per byte. Click any row to load the text into the demo and watch what BPE on one input actually does.
§ V Receipts
§ VI Methodology & Colophon
Pure JavaScript implementation of the original BPE training loop. Each character is encoded via TextEncoder into UTF-8 bytes. The algorithm maintains a token sequence, a vocabulary map, and a merge ledger. At every step it counts adjacent pairs with a frequency map, selects the maximum, and performs a global replacement. Runs at 60 merges per second in auto-play mode.
The demo visualizes the training process, not the inference pass. Real-world tokenizers use the learned merge list in reverse: given a new string, they encode it to bytes, then greedily apply the merges in the order they were learned. The result is deterministic and subword. No neural network is involved in tokenization itself.
Sennrich et al. · Neural Machine Translation of Rare Words with Subword Units (2016) ↗
Karpathy · Let's build the GPT Tokenizer (2024) ↗
OpenAI · GPT-2 tokenizer source ↗
This is the original BPE training algorithm, not the exact GPT-2 tokenizer. Real tokenizers include regex-based pre-tokenization splits, special tokens for padding and end-of-text, and byte-fallback handling for characters outside the vocabulary. The visualizer also omits the ranking tie-breaking rules used in production. It is accurate enough to teach the concept, not to reproduce OpenAI's outputs bit-for-bit.