Search code examples
pythontensorflowkerasmodeltokenize

texts_to_sequences return empty string after loading the tokenizer


I'm working on a project, I've trained and saved my model as well as tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)

# save tokenizer
import pickle
# saving
with open('english_tokenizer_test.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

total_words = len(tokenizer.word_index) +1
from keras.models import load_model

model2.save('model.h5')

and loading the model and the tokenizer

with open('../input/shyaridatasettesting/english_tokenizer_test.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

tokenizer = Tokenizer()
max_sequence_len = 24
model = tf.keras.models.load_model('../input/model.h5')
print(model);

So when I'm doing tokenizer using loaded tokenizer it returns an empty string

token_list = tokenizer.texts_to_sequences(["this is something"])[0]

I want to use my model and tokenizer on my website but I'm getting an empty token_list whenever I pass text in into my tokenizer?

please help me if I'm doing something wrong.


Solution

  • The problem is you are creating a new Tokenizer with the same name after loading your original tokenizer and therefore it is overwritten. Here is a working example:

    import tensorflow as tf
    import pickle
    
    corpus = ['this is something', 'this is something more', 'this is nothing']
    tokenizer = tf.keras.preprocessing.text.Tokenizer()
    tokenizer.fit_on_texts(corpus)
    
    ### Save tokenizer
    with open('tokenizer.pickle', 'wb') as handle:
        pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
        
    ### Load tokenizer
    with open('/content/tokenizer.pickle', 'rb') as handle:
        tokenizer = pickle.load(handle)
    token_list = tokenizer.texts_to_sequences(["this is something"])[0]
    print(token_list)
    
    [1, 2, 3]