Byte Pair Encoding (BPE)#
Note
BPE is an algorithm that breaks text into subwords by repeatedly merging the most frequent character pairs to build a vocabulary.
Simple Example#
Let’s say we have these words:
low, lower, newest, widest
Initialize - Split into characters
l o w </w>
l o w e r </w>
n e w e s t </w>
w i d e s t </w>
where </w> means end of word.
Count character pair frequencies
'e' + 's' = 2 times (newest, widest)
's' + 't' = 2 times (newest, widest)
...
Merge most frequent pair (let’s say ‘es’)
l o w </w>
l o w e r </w>
n e w es t </w>
w i d es t </w>
Continue merging (let’s say ‘est’)
l o w </w>
l o w e r </w>
n e w est </w>
w i d est </w>
Why BPE is Great#
Tip
Handles unknown words: Even if we never saw “lowest”, we can break it into known subwords
Balanced efficiency: Not too granular like character-level, not too rigid like word-level
Compression: Common patterns get merged, reducing sequence length