Search code examples
tensorflowkerastexttensorflow2.0

Tensorflow text tokenizer incorrect tokenization


I am trying to use TF Tokenizer for a NLP model

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=200, split=" ")
sample_text = ["This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ", 
               "This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]

tokenizer.fit_on_texts(sample_text)

print (tokenizer.texts_to_sequences(["sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]))

OP:

[[1, 7, 8, 9]]

Word_Index:

print(tokenizer.index_word[8])  ===> 'ab'
print(tokenizer.index_word[9])  ===> 'cdefghijklmnopqrstuvwxyz'

The problem is that the tokenizer creates tokens based on . in this case. I am giving the split = " " in the Tokenizer so I expect the following op:

[[1,7,8]], where tokenizer.index_word[8] should be 'ab.cdefghijklmnopqrstuvwxyz'

As in I want the tokenizer to create words based on space (" ") and not on any special characters

How do I make the tokenizer create tokens only on spaces?


Solution

  • The Tokenizer takes another argument called filter which is currently defaults to all ascii punctuations (filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'). During tokenization, all of the characters contained in filter are replaced by the specified split string.

    If you will look in the source code of the Tokenizer and specifically on the method fit_on_texts, you will see it uses the function text_to_word_sequence which receive the filter characters and consider them the same as the split it also receives:

    def text_to_word_sequence(... ):
        ...
        translate_dict = {c: split for c in filters}
        translate_map = maketrans(translate_dict)
        text = text.translate(translate_map)
    
        seq = text.split(split)
        return [i for i in seq if i]
    

    So, in order to not split nothing but the specified split, just pass empty string to the filter argument