Search code examples
splittokenizehuggingface-transformershuggingface-tokenizers

In HuggingFace tokenizers: how can I split a sequence simply on spaces?


I am using DistilBertTokenizer tokenizer from HuggingFace.

I would like to tokenize my text by simple splitting it on space:

["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]

instead of the default behavior, which is like this:

["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]

I read their documentation about Tokenization in general as well as about BERT Tokenizer specifically, but could not find an answer to this simple question :(

I assume that it should be a parameter when loading Tokenizer, but I could not find it among the parameters list ...

EDIT: Minimal code example to reproduce:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('distilbert-base-cased')

tokens = tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
print("Tokens: ", tokens)

Solution

  • That is not how it works. The transformers library provides different types of tokenizers. In the case of distilbert it is a wordpiece tokenizer that has a defined vocabulary that was used to train the corresponding model and therefore does not offer such modifications (as far as I know). Something you can do is using the split() method of the python string:

    text = "Don't you love 🤗 Transformers? We sure do."
    tokens = text.split()
    print("Tokens: ", tokens)
    

    Output:

    Tokens:  ["Don't", 'you', 'love', '🤗', 'Transformers?', 'We', 'sure', 'do.']
    

    In case you are looking for a bit more complex tokenization that also takes the punctuation into account, you can utilize the basic_tokenizer:

    from transformers import DistilBertTokenizer
    
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
    tokens = tokenizer.basic_tokenizer.tokenize(text)
    print("Tokens: ", tokens)
    

    Output:

    Tokens:  ['Don', "'", 't', 'you', 'love', '🤗', 'Transformers', '?', 'We', 'sure', 'do', '.']