Search code examples
textnlpcompressiontokenize

Byte pair encoding when most bytes are already used


Byte pair encoding is apparently sometimes used to compress/tokenize text prior to running machine learning algorithms on it.

According to https://en.wikipedia.org/wiki/Byte_pair_encoding the basic step is one in which

the most common pair of contiguous bytes of data in a sequence are replaced with a byte that does not occur within the sequence

I can see how this works for ASCII, which typically leaves about 160 possible bytes unused.

It would seem to be inapplicable for binary data, which in general would use all the possible byte values.

What about Unicode? That uses a lot more of the possible byte values than ASCII. Does the algorithm work less well here, does Unicode use fewer byte values than I am taking into account, or is there something else I am missing?


Solution

  • You did not specify how the Unicode is encoded. Since you're talking about bytes and ASCII, I will answer for the most common and accepted encoding, which is UTF-8.

    The bytes 0xc0, 0xc1, and 0xf5 through 0xff can appear nowhere in a valid UTF-8 sequence. There are many specifc sequences of bytes that also cannot appear in a valid UTF-8 sequence. Some examples of two-byte sequences that cannot appear are 0xxxxxxx 10xxxxxx, where x is any bit. So there's 8,192 to choose from. You can add more 10xxxxxx's after that, which also would never appear as a sequence, for more bits.

    The least intrusive I think would be to decode the UTF-8 string, insert your tokens only between valid UTF-8 codes, and have those tokens be a sequence of any number of 10xxxxxx bytes. Then you can encode six bits per inserted byte, and you can easily tell when your token ends, which is the next byte that is not 10xxxxxx.

    So, you could take any one of those single invalid bytes, or all of them, and have each be a prefix to some number of bytes that represent your tokens, inserted anywhere in the data. Or you could use the sequence of 10xxxxxx bytes between valid codes.