Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE)#

Note

BPE is an algorithm that breaks text into subwords by repeatedly merging the most frequent character pairs to build a vocabulary.

Simple Example#

Let’s say we have these words:

low, lower, newest, widest
  1. Initialize - Split into characters

l o w </w>
l o w e r </w>
n e w e s t </w>
w i d e s t </w>

where </w> means end of word.

  1. Count character pair frequencies

'e' + 's' = 2 times (newest, widest)
's' + 't' = 2 times (newest, widest)
...
  1. Merge most frequent pair (let’s say ‘es’)

l o w </w>
l o w e r </w>
n e w es t </w>
w i d es t </w>
  1. Continue merging (let’s say ‘est’)

l o w </w>
l o w e r </w>
n e w est </w>
w i d est </w>

Why BPE is Great#

Tip

  • Handles unknown words: Even if we never saw “lowest”, we can break it into known subwords

  • Balanced efficiency: Not too granular like character-level, not too rigid like word-level

  • Compression: Common patterns get merged, reducing sequence length