python scikit-learn sentiment-analysis multilabel-classification

Inbalanced Dataset for Classification Report?

I am trying to classify a model to deduce sentiment from text. My two labels are "1" for positive, and "0" for negative. When classification report is ran it produces this output:

            precision    recall  f1-score   support

           0       0.39      1.00      0.57      1081
           1       0.00      0.00      0.00      1660

    accuracy                           0.39      2741
   macro avg       0.20      0.50      0.28      2741
weighted avg       0.16      0.39      0.22      2741

So by the looks of it, it doesn't seem to classify label 1. Looking at other Stack Overflow posts I thought it was an unbalanced dataset problem but it doesn't seem the case. To my understanding there seems to be more data for label 1 than label 0 so I am quite confused as to the issue here.

Below are the relevant code snippets

import time
#Import the DecisionTreeeClassifier
from sklearn.tree import DecisionTreeClassifier
# Load from the filename
word2vec_df = pd.read_csv(word2vec_filename)
#Initialize the model
clf_decision_word2vec = DecisionTreeClassifier()

start_time = time.time()
# Fit the model
clf_decision_word2vec.fit(word2vec_df, Y_train['Sentiment'])
print("Time taken to fit the model with word2vec vectors: " + str(time.time() - start_time))

from sklearn.metrics import classification_report
test_features_word2vec = []
for index, row in X_test.iterrows():
    model_vector = np.mean([sg_w2v_model[token] for token in row['stemmed_tokens']], axis=0)
    if type(model_vector) is list:
        test_features_word2vec.append(model_vector)
    else:
        test_features_word2vec.append(np.array([0 for i in range(1000)]))
test_predictions_word2vec = clf_decision_word2vec.predict(test_features_word2vec)
print(classification_report(Y_test['Sentiment'],test_predictions_word2vec))
for num in test_predictions_word2vec:
  print(num)

At the end of that code snippet I added a for loop to quickly test to see what data was in test_predictions_word2vec and it looks like all zeroes.

Not too sure what happened where all the 1s were left out (I only included a small subset here to show the 0s. Looking at the full output on my console there was no 1s present).

I'm assuming it is because of this line here:

test_features_word2vec.append(np.array([0 for i in range(1000)]))

Where it looks like it just appending 0s. Any help in this issue will be greatly appreciated!

P.S snippet for test-train split and output:

from sklearn.model_selection import train_test_split
# Train Test Split Function
def split_train_test(split_data, test_size=0.3, shuffle_state=True):
    X_train, X_test, Y_train, Y_test = train_test_split(split_data[['movie_title',  'critics_consensus',    'tomatometer_status',   'tokenized_text',   'stemmed_tokens']], 
                                                        split_data['Sentiment'], 
                                                        shuffle=shuffle_state,
                                                        test_size=test_size, 
                                                        random_state=42)
    print("Value counts for Train sentiments")
    print(Y_train.value_counts())
    print("Value counts for Test sentiments")
    print(Y_test.value_counts())
    print(type(X_train))
    print(type(Y_train))
    X_train = X_train.reset_index()
    X_test = X_test.reset_index()
    Y_train = Y_train.to_frame()
    Y_train = Y_train.reset_index()
    Y_test = Y_test.to_frame()
    Y_test = Y_test.reset_index()
    print(X_train.head())

    

    return X_train, X_test, Y_train, Y_test

# Call the train_test_split
X_train, X_test, Y_train, Y_test = split_train_test(split_data)

Value counts for Train sentiments
1    3805
0    2588
Name: Sentiment, dtype: int64
Value counts for Test sentiments
1    1660
0    1081

EDIT: Adding output of 'word2vec_df'

Time taken to fit the model with word2vec vectors: 18.75066113471985
             0         1         2         3         4         5         6  \
0     0.009097 -0.014559 -0.021197  0.060744 -0.019707  0.102395  0.032876   
1     0.008102 -0.003382 -0.014465  0.066731 -0.024593  0.085185  0.023677   
2     0.013941 -0.005870 -0.001550  0.071456 -0.013130  0.094142  0.043876   
3     0.010195 -0.012312 -0.006310  0.069745 -0.012042  0.091056  0.034140   
4     0.006570 -0.010348 -0.016157  0.063258 -0.029932  0.098463  0.034469   
...        ...       ...       ...       ...       ...       ...       ...   
6388  0.000616 -0.000732 -0.006287  0.063298 -0.024651  0.055185 -0.000368   
6389  0.010891 -0.007447 -0.025401  0.063245 -0.028681  0.100588  0.029031   
6390  0.009561 -0.007456 -0.017953  0.076449 -0.029962  0.092921  0.040811   
6391  0.012995 -0.008843 -0.013079  0.058345 -0.027885  0.095623  0.024361   
6392  0.007881  0.003228 -0.013990  0.065434 -0.017051  0.090314  0.031072   

             7         8         9  ...       990       991       992  \
0     0.068392  0.120006  0.038360  ... -0.009643 -0.062597 -0.027641   
1     0.073042  0.101701  0.030647  ... -0.016221 -0.058624 -0.030524   
2     0.061665  0.117775  0.014894  ... -0.017982 -0.065756 -0.044015   
3     0.057861  0.117489  0.015533  ... -0.016098 -0.065427 -0.039047   
4     0.071677  0.100755  0.029278  ... -0.022267 -0.050894 -0.030283   
...        ...       ...       ...  ...       ...       ...       ...   
6388  0.058975  0.085394  0.028661  ... -0.016373 -0.050449 -0.008869   
6389  0.066502  0.106864  0.035051  ... -0.019567 -0.069977 -0.039586   
6390  0.061507  0.120290  0.030399  ...  0.000696 -0.054154 -0.041237   
6391  0.081338  0.111422  0.034755  ... -0.019699 -0.060718 -0.032540   
6392  0.054831  0.125640  0.032965  ... -0.002751 -0.084193 -0.040441   

           993       994       995       996       997       998       999  
0     0.078252  0.034909 -0.007387  0.057867 -0.052527 -0.072866 -0.010007  
1     0.075942  0.039987 -0.012127  0.042507 -0.054933 -0.072949 -0.010296  
2     0.065845  0.057452  0.002048  0.057100 -0.048846 -0.097791 -0.007207  
3     0.059275  0.051354  0.000843  0.050823 -0.046350 -0.090028 -0.005206  
4     0.066598  0.034786 -0.000143  0.056494 -0.046227 -0.070975 -0.007705  
...        ...       ...       ...       ...       ...       ...       ...  
6388  0.061066  0.017348 -0.018751  0.041088 -0.042949 -0.049911 -0.019149  
6389  0.071031  0.043249 -0.002368  0.040806 -0.046722 -0.085424  0.005255  
6390  0.076632  0.065442 -0.000805  0.050374 -0.047395 -0.085746  0.006119  
6391  0.083535  0.030460 -0.004143  0.047868 -0.058123 -0.069077 -0.012215  
6392  0.077906  0.075460 -0.013605  0.056237 -0.059329 -0.093779 -0.009383  

[6393 rows x 1000 columns]

Solution

You are correct:

np.array([0 for i in range(1000)])

Creates an array full of zeros.

You should try:

from sklearn.metrics import classification_report
test_features_word2vec = []

averaged_test_vector = X_test['stemmed_tokens'].apply(
        lambda x: np.mean([sg_w2v_model[tok] for tok in x], axis=0) 
    ).tolist()

averaged_test_vector = np.vstack(averaged_test_vector)

test_predictions_word2vec = clf_decision_word2vec.predict(test_features_word2vec)
print(classification_report(Y_test['Sentiment'],test_predictions_word2vec))

Generally speaking, I would use embeddings of lower dimension if available. 1000 is a lot for a small dataset. And I wouldn't use DecisionTreeClassifier as it overfits quickly. I would start with LinearSVC or RandomForrestClassifier.