I recently encountered some questions when I was learning Google’s SentencePiece.
My own understanding:
I had the same question a couple of days ago, so I did some research and here is my answer (which may or may not be 100% correct, but might be helpful):
from tokenizers import SentencePieceBPETokenizer
tokenizer = SentencePieceBPETokenizer()
tokenizer.pre_tokenizer.pre_tokenize_str("こんにちは世界")
>>> [('▁こんにちは世界', (0, 7))]
Since there are no delimiters in this sentence (which says "Hello World." by the way), it did not pre-tokenize it and just treats the whole sentence as a single token and probably sends this one token to the BPE algorithm.
tokenizer.pre_tokenizer.pre_tokenize_str("Hello world.")
>>> [('▁Hello', (0, 5)), ('▁world.', (5, 12))]
tokenizer.pre_tokenizer.pre_tokenize_str("Hello world.")
>>> [('▁Hello', (0, 5)), ('▁', (5, 6)), ('▁', (6, 7)), ('▁world.', (7, 14))]
See how it pretokenizes the sentence based on the whitespaces? The only difference is that it preserves the whitespace information, which a naive whitespace pretokenizer (or any other pretokenizer) might have lost. Based on what I understand from the paper, this is the only strength of the sentencepiece that it makes the encoding and decoding lossless by keeping the whitespace information intact.
Reference: