A tokenization approach.

To construct a vocab table of size $n$:

- Assumes all unique characters to be an initial set of 1-character long n-grams (i.e. initial “tokens”).
- Then, successively the most frequent pair of adjacent characters is merged into a new, 2-character long n-gram and all instances of the pair are replaced by this new token.
- This is repeated until a vocabulary of size $n$ is obtained.

This can also be done recursively where n-grams can use previously created tokens (called recursive BPE).

## Example

Given an example sequence `aaabdaaabac`

and a target vocabulary of size 4:

- We start with an initial set of n-grams of just the unique characters:
`[a,b,c,d]`

and an empty lookup table`{}`

- As the size of our lookup table (0) is still less than the vocab size of 4, we start by finding the most-frequent byte-pair
- Within this sequence,
`aa`

occurs the most frequently so we replace it with a byte-pair that isn’t in the data (lets say`X`

)- This results in the sequence
`XabdXabac`

- We also add this mapping to our lookup table
`{aa:X}`

- This results in the sequence
- Then we keep repeating this until we reach a vocab size of 4
- The next most frequent byte-pair is
`ab`

- This results in the sequence
`XYdXYac`

- We also add this mapping to our lookup table
`{aa:X, ab:Y}`

- This results in the sequence
- We can technically stop here because all remaining pairs only occur once, or we can do recursive BPE as technically the byte-pair
`XY`

occurs twice- This results in the sequence
`ZdZac`

- We also add this mapping to our lookup table
`{aa:X, ab:Y, XY:Z}`

- This results in the sequence