I am working on RNN. After training, I got a high accuracy on the test data set. However, when I make a prediction with some external data, it predicts so poorly. Also, I used the same data set, which has over 300,000 texts and 57 classes, on artificial neural networks, it's still predicting very poorly. When I tried the same data set on a machine learning model, it worked fine.
Here is my code:
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, LSTM, BatchNormalization
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
df = pd.read_excel("data.xlsx", usecols=["X", "y"])
df = df.sample(frac = 1)
X = np.array(df["X"])
y = np.array(df["y"])
le = LabelEncoder()
y = le.fit_transform(y)
y = y.reshape(-1,1)
encoder = OneHotEncoder(sparse=False)
y = encoder.fit_transform(y)
num_words = 100000
token = Tokenizer(num_words=num_words)
token.fit_on_texts(X)
seq = token.texts_to_sequences(X)
X = sequence.pad_sequences(seq, padding = "pre", truncating = "pre")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = Sequential()
model.add(Embedding(num_words, 96, input_length = X.shape[1]))
model.add(LSTM(108, activation='relu', dropout=0.1, recurrent_dropout = 0.2))
model.add(BatchNormalization())
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="rmsprop", metrics=['accuracy'])
model.summary()
history = model.fit(X_train, y_train, epochs=4, batch_size=64, validation_data = (X_test, y_test))
loss, accuracy = model.evaluate(X_test, y_test)
Here are the history plots of the model:
After doing some research, I have realized that the model was actually working fine. The problem was using Keras Tokenizer
wrongly.
At the end of the code, I used the following code:
sentence = ["Example Sentence to Make Prediction."]
token.fit_on_texts(sentence) # <- This row is redundant.
seq = token.texts_to_sequences(sentence)
cx = sequence.pad_sequences(seq, maxlen = X.shape[1])
sx = np.argmax(model.predict(cx), axis=1)
The problem occurs when I want to fit Tokenizer again, on the new data. So, removing that code line solved the problem for me.