I am making a Logistic Regression model to do sentiment analysis. This is the problem - ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602]
This occurs when I try to split my dataset into x and y train and valid sets.
# splitting data into training and validation set
xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], test_size=0.3, random_state=42)
lreg = LogisticRegression() # training the model
lreg.fit(xtrain_bow, ytrain)
prediction = lreg.predict_proba(xvalid_bow) # predicting on the validation set
prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0
prediction_int = prediction_int.astype(np.int)
f1_score(yvalid, prediction_int) # calculating f1 score for the validation set
I saw in some posts that it can occur because of the shape of the X and y, so printed out the shapes of the datset, I have splitted my dataset into 85% for training and rest for test/valid purpose.
# Extracting train and test BoW features
split_frac = 0.85
split_num = int(len(combi['tidy_tweet']) * split_frac)
train_bow = bow[:split_num,:]
test_bow = bow[split_num:,:]
print(train_bow.shape)
print(test_bow.shape)
print(train['label'].shape)
(32979, 1000)
(5820, 1000)
(21602,)
Also the problem is in this line-
----> 1 xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], test_size=0.3, random_state=42)
2 lreg = LogisticRegression() # training the model
3 lreg.fit(xtrain_bow, ytrain)
Now I am clueless, that what is actually causing the problem? Can you guys please help? Thanks in advance.
You are getting above error because the length of second parameter, i.e., the label, in train_test_split()
is 21602
while the length of first parameter is 32979
, which should not be. The length both X and Y data must be same. So, check the length of train_bow
and train['label']
.
So, just change
xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], test_size=0.3, random_state=42)
to something like below:
xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(bow[:split_num,:-1], bow[:split_num,-1], test_size=0.3, random_state=42)
(Assuming bow
contains both features and labels, labels being the last column).
Read more sklearn.model_selection.train_test_split
from here.