python tensorflow keras nlp word-embedding

Test data giving prediction error in Keras in the model with Embedding layer

I have trained a Bi-LSTM model to find NER on a set of sentences. For this I took the different words present and I did a mapping between a word and a number and then created the Bi-LSTM model using those numbers. I then create and pickle that model object.

Now I get a set of new sentences containing certain words that the training model has not seen. Thus these words do not have a numeric value till now. Thus when I test it on my previously existing model, it would give an error. It is not able to find the words or features as the numeric values for those do not exist.

To circumvent this error I gave a new integer value to all the new words that I see.

However, when I load the model and test it, it gives the error that:

InvalidArgumentError: indices[0,24] = 5444 is not in [0, 5442)   [[Node: embedding_14_16/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true,
_device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_14_16/embeddings/read, embedding_14_16/Cast)]]

The training data contains 5445 words including the padding word. Thus = [0, 5444]

5444 is the index value I have given to the paddings in the test sentences. Not clear why it is assuming the index values to range between [0, 5442).

I have used the base code available on the following link: https://www.kaggle.com/gagandeep16/ner-using-bidirectional-lstm

The code:

input = Input(shape=(max_len,))
model = Embedding(input_dim=n_words, output_dim=50
                  , input_length=max_len)(input)

model = Dropout(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model)  # softmax output layer

model = Model(input, out)
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])

#number of  epochs - Also for output file naming
epoch_num=20
domain="../data/Laptop_Prediction_Corrected"
output_file_name=domain+"_E"+str(epoch_num)+".xlsx"

model_name="../models/Laptop_Prediction_Corrected"
output_model_filename=model_name+"_E"+str(epoch_num)+".sav"


history = model.fit(X_tr, np.array(y_tr), batch_size=32, epochs=epoch_num, validation_split=0.1, verbose=1)

max_len is the total number of words in a sentence and n_words is the vocab size. In the model the padding has been done using the following code where n_words=5441:

X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=n_words)

The padding in the new dataset:

max_len = 50
# this is to pad sentences to the maximum length possible
#-> so all records of X will be of the same length

#X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=res_new_word2idx["pad_blank"])

#X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=5441)

Not sure which of these paddings is correct?

However, the vocab only includes the words in the training data. When I say:

p = loaded_model.predict(X)

How to use predict for text sentences which contain words that are not present in the initial vocab?

Solution

You can use Keras Tokenizer class and its methods to easily tokenize and preprocess the input data. Specify the vocab size when instantiating it and then use its fit_on_texts() method on the training data to construct a vocabulary based on the given texts. After that you can use its text_to_sequences() method to convert each text string to a list of word indices. The good thing is that only the words in the vocabulary is considered and all the other words are ignored (you can set those words to one by passing oov_token=1 to Tokenizer class):

from keras.preprocessing.text import Tokenizer

# set num_words to limit the vocabulary to the most frequent words
tok = Tokenizer(num_words=n_words)

# you can also pass an arbitrary token as `oov_token` argument 
# which will represent out-of-vocabulary words and its index would be 1
# tok = Tokenizer(num_words=n_words, oov_token='[unk]')

tok.fit_on_texts(X_train)

X_train = tok.text_to_sequences(X_train)
X_test = tok.text_to_sequences(X_test)  # use the same vocab to convert test data to sequences

You can optionally use pad_sequences function to pad them with zeros or truncate them to make them all have the same length:

from keras.preprocessing.sequence import pad_sequences

X_train = pad_sequences(X_train, maxlen=max_len)
X_test = pad_sequences(X_test, maxlen=max_len)

Now, the vocab size would be equal to n_words+1 if you have not used oov token or n_words+2 if you have used it. And then you can pass the correct number to embedding layer as its input_dim argument (first positional argument):

Embedding(correct_num_words, embd_size, ...)