I am trying to classify a model to deduce sentiment from text. My two labels are "1" for positive, and "0" for negative. When classification report is ran it produces this output:
precision recall f1-score support
0 0.39 1.00 0.57 1081
1 0.00 0.00 0.00 1660
accuracy 0.39 2741
macro avg 0.20 0.50 0.28 2741
weighted avg 0.16 0.39 0.22 2741
So by the looks of it, it doesn't seem to classify label 1. Looking at other Stack Overflow posts I thought it was an unbalanced dataset problem but it doesn't seem the case. To my understanding there seems to be more data for label 1 than label 0 so I am quite confused as to the issue here.
Below are the relevant code snippets
import time
#Import the DecisionTreeeClassifier
from sklearn.tree import DecisionTreeClassifier
# Load from the filename
word2vec_df = pd.read_csv(word2vec_filename)
#Initialize the model
clf_decision_word2vec = DecisionTreeClassifier()
start_time = time.time()
# Fit the model
clf_decision_word2vec.fit(word2vec_df, Y_train['Sentiment'])
print("Time taken to fit the model with word2vec vectors: " + str(time.time() - start_time))
from sklearn.metrics import classification_report
test_features_word2vec = []
for index, row in X_test.iterrows():
model_vector = np.mean([sg_w2v_model[token] for token in row['stemmed_tokens']], axis=0)
if type(model_vector) is list:
test_features_word2vec.append(model_vector)
else:
test_features_word2vec.append(np.array([0 for i in range(1000)]))
test_predictions_word2vec = clf_decision_word2vec.predict(test_features_word2vec)
print(classification_report(Y_test['Sentiment'],test_predictions_word2vec))
for num in test_predictions_word2vec:
print(num)
At the end of that code snippet I added a for loop to quickly test to see what data was in test_predictions_word2vec and it looks like all zeroes.
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Not too sure what happened where all the 1s were left out (I only included a small subset here to show the 0s. Looking at the full output on my console there was no 1s present).
I'm assuming it is because of this line here:
test_features_word2vec.append(np.array([0 for i in range(1000)]))
Where it looks like it just appending 0s. Any help in this issue will be greatly appreciated!
P.S snippet for test-train split and output:
from sklearn.model_selection import train_test_split
# Train Test Split Function
def split_train_test(split_data, test_size=0.3, shuffle_state=True):
X_train, X_test, Y_train, Y_test = train_test_split(split_data[['movie_title', 'critics_consensus', 'tomatometer_status', 'tokenized_text', 'stemmed_tokens']],
split_data['Sentiment'],
shuffle=shuffle_state,
test_size=test_size,
random_state=42)
print("Value counts for Train sentiments")
print(Y_train.value_counts())
print("Value counts for Test sentiments")
print(Y_test.value_counts())
print(type(X_train))
print(type(Y_train))
X_train = X_train.reset_index()
X_test = X_test.reset_index()
Y_train = Y_train.to_frame()
Y_train = Y_train.reset_index()
Y_test = Y_test.to_frame()
Y_test = Y_test.reset_index()
print(X_train.head())
return X_train, X_test, Y_train, Y_test
# Call the train_test_split
X_train, X_test, Y_train, Y_test = split_train_test(split_data)
Value counts for Train sentiments
1 3805
0 2588
Name: Sentiment, dtype: int64
Value counts for Test sentiments
1 1660
0 1081
EDIT: Adding output of 'word2vec_df'
Time taken to fit the model with word2vec vectors: 18.75066113471985
0 1 2 3 4 5 6 \
0 0.009097 -0.014559 -0.021197 0.060744 -0.019707 0.102395 0.032876
1 0.008102 -0.003382 -0.014465 0.066731 -0.024593 0.085185 0.023677
2 0.013941 -0.005870 -0.001550 0.071456 -0.013130 0.094142 0.043876
3 0.010195 -0.012312 -0.006310 0.069745 -0.012042 0.091056 0.034140
4 0.006570 -0.010348 -0.016157 0.063258 -0.029932 0.098463 0.034469
... ... ... ... ... ... ... ...
6388 0.000616 -0.000732 -0.006287 0.063298 -0.024651 0.055185 -0.000368
6389 0.010891 -0.007447 -0.025401 0.063245 -0.028681 0.100588 0.029031
6390 0.009561 -0.007456 -0.017953 0.076449 -0.029962 0.092921 0.040811
6391 0.012995 -0.008843 -0.013079 0.058345 -0.027885 0.095623 0.024361
6392 0.007881 0.003228 -0.013990 0.065434 -0.017051 0.090314 0.031072
7 8 9 ... 990 991 992 \
0 0.068392 0.120006 0.038360 ... -0.009643 -0.062597 -0.027641
1 0.073042 0.101701 0.030647 ... -0.016221 -0.058624 -0.030524
2 0.061665 0.117775 0.014894 ... -0.017982 -0.065756 -0.044015
3 0.057861 0.117489 0.015533 ... -0.016098 -0.065427 -0.039047
4 0.071677 0.100755 0.029278 ... -0.022267 -0.050894 -0.030283
... ... ... ... ... ... ... ...
6388 0.058975 0.085394 0.028661 ... -0.016373 -0.050449 -0.008869
6389 0.066502 0.106864 0.035051 ... -0.019567 -0.069977 -0.039586
6390 0.061507 0.120290 0.030399 ... 0.000696 -0.054154 -0.041237
6391 0.081338 0.111422 0.034755 ... -0.019699 -0.060718 -0.032540
6392 0.054831 0.125640 0.032965 ... -0.002751 -0.084193 -0.040441
993 994 995 996 997 998 999
0 0.078252 0.034909 -0.007387 0.057867 -0.052527 -0.072866 -0.010007
1 0.075942 0.039987 -0.012127 0.042507 -0.054933 -0.072949 -0.010296
2 0.065845 0.057452 0.002048 0.057100 -0.048846 -0.097791 -0.007207
3 0.059275 0.051354 0.000843 0.050823 -0.046350 -0.090028 -0.005206
4 0.066598 0.034786 -0.000143 0.056494 -0.046227 -0.070975 -0.007705
... ... ... ... ... ... ... ...
6388 0.061066 0.017348 -0.018751 0.041088 -0.042949 -0.049911 -0.019149
6389 0.071031 0.043249 -0.002368 0.040806 -0.046722 -0.085424 0.005255
6390 0.076632 0.065442 -0.000805 0.050374 -0.047395 -0.085746 0.006119
6391 0.083535 0.030460 -0.004143 0.047868 -0.058123 -0.069077 -0.012215
6392 0.077906 0.075460 -0.013605 0.056237 -0.059329 -0.093779 -0.009383
[6393 rows x 1000 columns]
You are correct:
np.array([0 for i in range(1000)])
Creates an array full of zeros.
You should try:
from sklearn.metrics import classification_report
test_features_word2vec = []
averaged_test_vector = X_test['stemmed_tokens'].apply(
lambda x: np.mean([sg_w2v_model[tok] for tok in x], axis=0)
).tolist()
averaged_test_vector = np.vstack(averaged_test_vector)
test_predictions_word2vec = clf_decision_word2vec.predict(test_features_word2vec)
print(classification_report(Y_test['Sentiment'],test_predictions_word2vec))
Generally speaking, I would use embeddings of lower dimension if available.
1000 is a lot for a small dataset.
And I wouldn't use DecisionTreeClassifier
as it overfits quickly.
I would start with LinearSVC
or RandomForrestClassifier
.