Search code examples
pythonkerasdeep-learningnlp

Prediction is identical for all input data in Multi-Label Classification (NLP)


I'm trying to build a deep learning model to predict the top 5 probable movie genres, using movies' synopses as input. The movie genres I'm including in the data are 19, but regardless of test input, the model always predicts the same 5 movie genres. Below is my code building the model. However, the accuracy during fitting is 90%. Can you point me to the right direction as to what I'm doing wrong?

from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten, LSTM
from keras.layers import GlobalMaxPooling1D
from keras.models import Model
from keras.layers.embeddings import Embedding
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.layers import Input
from keras.layers.merge import Concatenate

import pandas as pd
import numpy as np
import re

data = pd.read_csv('train.csv', encoding = 'utf-8')
#Create column with comma separated genres
data['genres_comma'] = data['genres'].str.split()
mlb = MultiLabelBinarizer()
#Create new dataframe with one hot encoded labels
train = pd.concat([
    data.drop(['genres', 'genres_comma'], 1),
    pd.DataFrame(mlb.fit_transform(data['genres_comma']), columns=mlb.classes_),
], 1)

genre_names = list(mlb.classes_)
genres = train.drop(['movie_id', 'synopsis'], 1)

def preprocess_text(sen):
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

X = []
sentences = list(train['synopsis'])
for sen in sentences:
    X.append(preprocess_text(sen))

y = genres.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

#Convert text inputs into embedded vectors.
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

vocab_size = len(tokenizer.word_index) + 1

maxlen = 200

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

#GloVe word embeddings to convert text inputs to their numeric counterparts
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()

glove_file = open('glove.6B.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

embedding_matrix = zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

#Model Creation
deep_inputs = Input(shape=(maxlen,))
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)(deep_inputs)
LSTM_Layer_1 = LSTM(128)(embedding_layer)
dense_layer_1 = Dense(19, activation='sigmoid')(LSTM_Layer_1)
model = Model(inputs=deep_inputs, outputs=dense_layer_1)

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])

print(model.summary())


history = model.fit(X_train, y_train, batch_size=128, epochs=5, verbose=1, validation_split=0.2)

score = model.evaluate(X_test, y_test, verbose=1)

Solution

  • Did you check for the class distribution in your training data? accuracy is not a good measure when you have strongly imbalanced classes. For instance, if these 5 genres make up >90% of your training data, the model may not have learned how to classify any other genre, but still achieve 90% accuracy (a common failure mode is constant output, which may still have a high accuracy with imbalanced classes). So, the first step should be to look at the number of training movies in the respective categories.

    If my hunch is correct, you may try looking into class balancing weights, or into other loss functions that essentially also give more weight to rare classes. You may also want to fine-tune your model without class weights, after it has learned to classify all genres, in order to learn the real prior probabilities.