Search code examples
huggingface-transformersdecodingbyte-pair-encoding

Some doubts about huggingface's BPE algorithm


In most BPE(Byte-Pair Encoding) tutorials, it is mentioned to add </w> after a word. The function of this mark is to distinguish whether a subword is a prefix of a word or a suffix of a word.

We know that the input of the model is a sequence of subwords(usually represented by ID), and the output of the model is naturally a sequence of subwords. But obviously the readability of this sequence is not strong, we still need to combine these subwords to get a normal sequence. The role of the </w> mark is to merge subwords into words. Without </w>, we naturally don't know the boundary of a word.

The BPE implementation of huggingface refers to the source code of openAI's gpt-2, I checked their source code carefully and found that there is no mark like </w>, so how do we get a normal sequence during the decoding process?


Solution

  • The end of word marker </w> is part of the tokens during the creation of a vocabulary, not a token per se.

    Once the BPE vocabulary creation is finished, you normally invert the mark: you mark tokens that lack of the end-of-word marker. In the original implementation, the lack of end-of-word marker was expressed as @@. That's why to restore the original implementation you simply had to remove the occurrences "@@ ", so that the tokens that belonged to the same words were attached together.

    In the HuggingFace implementation, they mimick OpenAI's implementation and use a slightly different approach, representing the space as part of the tokens themselves. For this, they use the \u0120 marker, which you can see in the GPT-2 vocabulary at the beginning of many tokens. You can see details about this in this github issue. This huggingface disccussion shares some context on this.

    That's why you won't see any end-of-word marker in BPE vocabularies.