python tensorflow keras neural-network word-embedding

NN in Keras - expected dense_2 to have 3 dimensions, but got array with shape (10980, 3)

I want to train a Neutral Network for Multi-Classification Sentiment Analysis using word embedding for tweets.

Here is my code:

import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, GRU
from keras.layers.embeddings import Embedding


from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline

Import the data

df = pd.DataFrame()
df = pd.read_csv('Tweets.csv', encoding='utf-8')

clean the tweets

def remove_mentions(input_text):
    return re.sub(r'@\w+', '', input_text)

def remove_stopwords(input_text):
    stopwords_list = stopwords.words('english')
    whitelist = ["n't", "not", "no"]
    words = input_text.split() 
    clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
    return " ".join(clean_words) 

df.text = df.text.apply(remove_stopwords).apply(remove_mentions)
df.text = [tweet for tweet in df.text if type(tweet) is str]

X = df['text']
y = df['airline_sentiment']

Split my data into train and test

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=37)

One-Hot Encode the field "Sentiment"

Originally the labels are of type string: 'neutral', 'positive', 'negative'. So I first transform them to integer and then apply one-hot encoding:

le = LabelEncoder()
y_train_num = le.fit_transform(y_train.values)
y_test_num = le.fit_transform(y_test.values)

nb_classes = 3
y_train = np_utils.to_categorical(y_train_num, nb_classes)
y_test = np_utils.to_categorical(y_test_num, nb_classes)

Prepare for Word Embedding

tokenizer_obj = Tokenizer()
tokenizer_obj.fit_on_texts(X)
max_length = max([len(tweet.split()) for tweet in X])
print("max_length=%s" % (max_length))

vocab_size = len(tokenizer_obj.word_index) + 1 
print("vocab_size=%s" % (vocab_size))

X_train_tokenized = tokenizer_obj.texts_to_sequences(X_train)
X_test_tokenized = tokenizer_obj.texts_to_sequences(X_test)

X_train_pad = pad_sequences(X_train_tokenized, maxlen=max_length, padding='post')
X_test_pad = pad_sequences(X_test_tokenized, maxlen=max_length, padding='post')

Define and Apply my NN Model

EMBEDDING_DIM = 100
    
model = Sequential()
model.add(Embedding(vocab_size, EMBEDDING_DIM, input_length=max_length))
model.add(Dense(8, input_dim=4, activation='relu'))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
print(model.summary())

model.fit(X_train_pad, y_train, batch_size=128, epochs=25, validation_data=(X_test_pad, y_test), verbose=2)

The reason I chose my last layer to have 3 output units is because it's a multi-classification task and I have 3 classes.

Here is the model summary:

Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 23, 100)           1488200   
_________________________________________________________________
dense_1 (Dense)              (None, 23, 8)             808       
_________________________________________________________________
dense_2 (Dense)              (None, 23, 3)             27        
=================================================================
Total params: 1,489,035
Trainable params: 1,489,035
Non-trainable params: 0
_________________________________________________________________

When the code gets to model.fit(), I get the following error:

ValueError: Error when checking target: expected dense_2 to have 3 dimensions, but got array with shape (10980, 3)

What am I doing wrong?

Solution

As you can see in the output of the model.summary(), the model output shape is (None, 23, 3) whereas you want it to be (None, 3). That happens because the Dense layer is applied on the last axis of its input and does not flatten its input automatically (if it has more than 2 dimensions). Therefore, one way to resolve this is to use a Flatten layer right after the Embedding layer:

model.add(Embedding(vocab_size, EMBEDDING_DIM, input_length=max_length))
model.add(Flatten())

This way the output of the Embedding layer would be flattened and the following Dense layers would have 2D output.

As a bonus(!), you might be a able to get a better accuracy if you use a LSTM layer right after the Embedding layer:

model.add(Embedding(vocab_size, EMBEDDING_DIM, input_length=max_length))
model.add(LSTM(32))
model.add(Dense(8, input_dim=4, activation='relu'))
model.add(Dense(3, activation='softmax'))

However, this is not guaranteed. You must experiment and tune your model properly.