A Formal Perspective on Byte-Pair Encoding

arXiv2306.16837MaRDI QIDQ6441960

Author name not available (Why is that?)

Publication date: 29 June 2023

Abstract: Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a

-approximation of an optimal merge sequence, where

is the total backward curvature with respect to the optimal merge sequence

. Empirically the lower bound of the approximation is

a p p r o x 0.37

. We provide a faster implementation of BPE which improves the runtime complexity from

m a t h c a l O l e f t (N M i g h t)

to

m a t h c a l O l e f t (N l o g M i g h t)

, where

N

is the sequence length and

M

is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.

Has companion code repository: https://github.com/zouharvi/formal-bpe

This page was built for publication: A Formal Perspective on Byte-Pair Encoding

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6441960)