jzhao.xyz

Recent Writing

2024: Centering
Dec 23, 2024
Taste is a guide for what is worthwhile
Jan 14, 2024
Agentic Computing
Nov 29, 2022
Building a BFT JSON CRDT
Nov 16, 2022

See 21 more →

Recent Notes

TrueTime
May 26, 2025
Concurrency control
May 26, 2025

See 735 more →

Byte Pair Encoding

Jul 29, 20242 min read

seed

A tokenization approach.

To construct a vocab table of size $n$ :

Assumes all unique characters to be an initial set of 1-character long n-grams (i.e. initial “tokens”).
Then, successively the most frequent pair of adjacent characters is merged into a new, 2-character long n-gram and all instances of the pair are replaced by this new token.
This is repeated until a vocabulary of size $n$ is obtained.

This can also be done recursively where n-grams can use previously created tokens (called recursive BPE).

Example

Given an example sequence aaabdaaabac and a target vocabulary of size 4:

We start with an initial set of n-grams of just the unique characters: [a,b,c,d] and an empty lookup table {}
As the size of our lookup table (0) is still less than the vocab size of 4, we start by finding the most-frequent byte-pair
Within this sequence, aa occurs the most frequently so we replace it with a byte-pair that isn’t in the data (lets say X)
1. This results in the sequence XabdXabac
2. We also add this mapping to our lookup table {aa:X}
Then we keep repeating this until we reach a vocab size of 4
The next most frequent byte-pair is ab
1. This results in the sequence XYdXYac
2. We also add this mapping to our lookup table {aa:X, ab:Y}
We can technically stop here because all remaining pairs only occur once, or we can do recursive BPE as technically the byte-pair XY occurs twice
1. This results in the sequence ZdZac
2. We also add this mapping to our lookup table {aa:X, ab:Y, XY:Z}

Recent Writing

2024: Centering
Dec 23, 2024
Taste is a guide for what is worthwhile
Jan 14, 2024
Agentic Computing
Nov 29, 2022
Building a BFT JSON CRDT
Nov 16, 2022

See 21 more →

Recent Notes

TrueTime
May 26, 2025
Concurrency control
May 26, 2025

See 735 more →

Graph View

Backlinks

Transformer Models

Created with Quartz v4.5.1 © 2025

GitHub
Twitter