Search code examples
tokenizesentencepiecebyte-pair-encoding

Some doubts about SentencePiece


I recently encountered some questions when I was learning Google’s SentencePiece.

  • BPE, WordPiece and Unigram are all common subword algorithms, so what is the relationship between SentencePiece and them? Some tutorials say that SentencePiece is also a subword algorithm, and some tutorials say that SentencePiece is an implementation of the above subword algorithm.
  • SentencePiece seems to only replace spaces with special underlines in preprocessing. If there is no pre-tokenization stage, then how to apply subword algorithms such as BPE and Unigram that require pre-tokenization?

My own understanding:

  • I am more inclined to think that SentencePiece is an implementation of subword algorithms such as BPE and Unigram. Because if SentencePiece is also classified as a subword algorithm, why are there still such expressions as SentencePiece+BPE and SentencePiece+Unigram?
  • SentencePiece supports BPE, Unigram and other algorithms, but these algorithms obviously require pre-tokenization. SentencePiece does not need to be pre-tokenized. Is this a bit of a conflict?

Solution

  • I had the same question a couple of days ago, so I did some research and here is my answer (which may or may not be 100% correct, but might be helpful):

    1. The sentencepeice paper never explicitly says that they don't pretokenize for BPE. They just say that some languages like Chinese, Japanese, etc cannot be pretokenized based on whitespaces, and hence we need an algorithm that can handle these languages without the headache of handwritten rules.
    2. I checked the SentencePieceBPETokenizer module in the transformers library. This module has a function called pre_tokenizer.pre_tokenize_str which implements the sentencepiece pretokenizer (using _ for whitespaces) and look at the output below:
    from tokenizers import SentencePieceBPETokenizer
    tokenizer = SentencePieceBPETokenizer()
    tokenizer.pre_tokenizer.pre_tokenize_str("こんにちは世界")
    >>> [('▁こんにちは世界', (0, 7))]
    

    Since there are no delimiters in this sentence (which says "Hello World." by the way), it did not pre-tokenize it and just treats the whole sentence as a single token and probably sends this one token to the BPE algorithm.

    tokenizer.pre_tokenizer.pre_tokenize_str("Hello world.")
    >>> [('▁Hello', (0, 5)), ('▁world.', (5, 12))]
    tokenizer.pre_tokenizer.pre_tokenize_str("Hello   world.")
    >>> [('▁Hello', (0, 5)), ('▁', (5, 6)), ('▁', (6, 7)), ('▁world.', (7, 14))]
    

    See how it pretokenizes the sentence based on the whitespaces? The only difference is that it preserves the whitespace information, which a naive whitespace pretokenizer (or any other pretokenizer) might have lost. Based on what I understand from the paper, this is the only strength of the sentencepiece that it makes the encoding and decoding lossless by keeping the whitespace information intact.

    Reference:

    1. The comment section of this post might help: https://www.reddit.com/r/MachineLearning/comments/rprmq3/d_sentencepiece_wordpiece_bpe_which_tokenizer_is/