A tokenization approach.
To construct a vocab table of size :
- Assumes all unique characters to be an initial set of 1-character long n-grams (i.e. initial “tokens”).
- Then, successively the most frequent pair of adjacent characters is merged into a new, 2-character long n-gram and all instances of the pair are replaced by this new token.
- This is repeated until a vocabulary of size is obtained.
This can also be done recursively where n-grams can use previously created tokens (called recursive BPE).
Example
Given an example sequence aaabdaaabac
and a target vocabulary of size 4:
- We start with an initial set of n-grams of just the unique characters:
[a,b,c,d]
and an empty lookup table{}
- As the size of our lookup table (0) is still less than the vocab size of 4, we start by finding the most-frequent byte-pair
- Within this sequence,
aa
occurs the most frequently so we replace it with a byte-pair that isn’t in the data (lets sayX
)- This results in the sequence
XabdXabac
- We also add this mapping to our lookup table
{aa:X}
- This results in the sequence
- Then we keep repeating this until we reach a vocab size of 4
- The next most frequent byte-pair is
ab
- This results in the sequence
XYdXYac
- We also add this mapping to our lookup table
{aa:X, ab:Y}
- This results in the sequence
- We can technically stop here because all remaining pairs only occur once, or we can do recursive BPE as technically the byte-pair
XY
occurs twice- This results in the sequence
ZdZac
- We also add this mapping to our lookup table
{aa:X, ab:Y, XY:Z}
- This results in the sequence