M09 Token Lab

Text, chopped into
pieces an LLM can count.

A live Byte Pair Encoding visualizer. Type any text and watch the algorithm merge the most frequent adjacent pairs into new tokens, one greedy step at a time. The same process that turns Shakespeare into 50,257 vocabulary slots inside GPT-2.

Enter text below. The algorithm starts with individual bytes (256 possible values), then iteratively merges the most frequent pair into a new token. Watch the vocabulary grow, the token count shrink, and the compression ratio climb. Every modern LLM does this before it sees a single parameter.

Filed

Engine

Pure JS · No model · BPE algorithm

Base vocab

256 byte tokens

§ I Why LLMs do not read words

Large language models do not see words. They see integers.

Before a transformer processes a single sentence, a tokenizer breaks the raw text into a sequence of tokens. Each token is an integer index into a fixed vocabulary. The model never encounters the string "hello". It encounters token 23748, or whatever integer the training run assigned to that byte sequence. This mapping is learned, not designed, and it is learned by the same greedy algorithm running below.

Byte Pair Encoding, introduced for machine translation by Sennrich et al. in 2016 and later adopted by OpenAI for GPT-2, starts with the 256 possible byte values. It scans the training corpus, counts every adjacent pair, and merges the most common pair into a new token. Repeat this 50,000 times and you have a vocabulary that can represent common words in one slot, rare words in several, and never-seen characters by falling back to raw bytes. The result is a compression scheme that happens to be useful for language modeling.

Why does this matter? Because tokenization is where many LLM behaviors originate. A model struggles with arithmetic because numbers get split into individual digits. It handles Python better than English because code has more repeating substrings. And it sees "SolidGoldMagikarp" as a single token not because the name is famous, but because that exact string appeared often enough in the training data to earn its own slot. Understanding tokenization is the first step toward understanding what the model actually receives.

§ II · M09.1 The tokenizer · type something

The text is first encoded as UTF-8 bytes. Each byte becomes its own token. Press Step to perform one merge, or toggle Auto-play to watch the algorithm run. The slider sets a target vocabulary size. The algorithm stops when it reaches that size or when no pair appears more than once.

Input text

Target vocabulary size

163264128256512

Ready. 256 tokens in vocabulary.

Token stream · 0 tokens

Enter text and press Step to begin.

Merge history

No merges yet.

Tokens

Vocab size

256

Compression

1.00×

§ III How it works

Fig. A · Start with bytes

Every character is a sequence of bytes

The tokenizer encodes the input as UTF-8, turning each character into one to four raw byte values. At the start, the vocabulary contains exactly 256 tokens, one for each possible byte. The token sequence is identical to the byte sequence.

Fig. B · Merge the winners

Greedy frequency counting

On every iteration, the algorithm counts all adjacent token pairs in the current sequence. The pair that appears most often is replaced everywhere by a brand-new token ID. That new token is added to the vocabulary, and the sequence shrinks.

Fig. C · Stop when rich enough

Vocabulary size is a hyperparameter

Training stops after a preset number of merges. GPT-2 used 50,257 total tokens. GPT-4 uses roughly 100,256. The vocabulary is saved as a lookup table, and every future string is tokenized by greedily matching the longest learned token first.

§ IV · M09.2 The edge cases · GPT-2 reference

These are the token counts GPT-2 would produce after training on its full 8 GB corpus, not what the demo above produces. The demo trains BPE on whatever you type, so on a single short string it has nothing to merge against and falls back to one token per byte. Click any row to load the text into the demo and watch what BPE on one input actually does.

hello

GPT-2: 1 token

Common word. Earned its own slot early in training.

Tokenization

GPT-2: 2 tokens

Token + ization. The suffix is common enough to split off.

127381

GPT-2: 5 tokens

Numbers split per digit. Large integers are expensive for LLMs.

🚀

GPT-2: 4 tokens

Emoji into UTF-8 bytes. A single glyph becomes four integers.

aaaaaaaaaa

GPT-2: 1 token

Repeating patterns compress beautifully. Ten bytes, one slot.

§ V Receipts

Base tokens

256

GPT-2 vocabulary

50,257

GPT-4 vocabulary

~100,256

Avg English tokens / word

1.3

Avg tokens / emoji

3–5

§ VI Methodology & Colophon

Engine

Pure JavaScript implementation of the original BPE training loop. Each character is encoded via TextEncoder into UTF-8 bytes. The algorithm maintains a token sequence, a vocabulary map, and a merge ledger. At every step it counts adjacent pairs with a frequency map, selects the maximum, and performs a global replacement. Runs at 60 merges per second in auto-play mode.

Inference

The demo visualizes the training process, not the inference pass. Real-world tokenizers use the learned merge list in reverse: given a new string, they encode it to bytes, then greedily apply the merges in the order they were learned. The result is deterministic and subword. No neural network is involved in tokenization itself.

Reading list

Sennrich et al. · Neural Machine Translation of Rare Words with Subword Units (2016) ↗
Karpathy · Let's build the GPT Tokenizer (2024) ↗
OpenAI · GPT-2 tokenizer source ↗

Limitations

This is the original BPE training algorithm, not the exact GPT-2 tokenizer. Real tokenizers include regex-based pre-tokenization splits, special tokens for padding and end-of-text, and byte-fallback handling for characters outside the vocabulary. The visualizer also omits the ranking tie-breaking rules used in production. It is accurate enough to teach the concept, not to reproduce OpenAI's outputs bit-for-bit.

← Back to the portfolio