machine-learning text scikit-learn nlp train-test-split

How to fix 'ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602]'?

I am making a Logistic Regression model to do sentiment analysis. This is the problem - ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602] This occurs when I try to split my dataset into x and y train and valid sets.

# splitting data into training and validation set 
xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], test_size=0.3, random_state=42)
lreg = LogisticRegression() # training the model 
lreg.fit(xtrain_bow, ytrain) 
prediction = lreg.predict_proba(xvalid_bow) # predicting on the validation set 
prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0 
prediction_int = prediction_int.astype(np.int) 
f1_score(yvalid, prediction_int) # calculating f1 score for the validation set

I saw in some posts that it can occur because of the shape of the X and y, so printed out the shapes of the datset, I have splitted my dataset into 85% for training and rest for test/valid purpose.

# Extracting train and test BoW features
split_frac = 0.85

split_num = int(len(combi['tidy_tweet']) * split_frac)

train_bow = bow[:split_num,:] 
test_bow = bow[split_num:,:] 
print(train_bow.shape)
print(test_bow.shape)
print(train['label'].shape)

(32979, 1000)
(5820, 1000)
(21602,)

Also the problem is in this line-

----> 1 xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], test_size=0.3, random_state=42)
      2 lreg = LogisticRegression() # training the model
      3 lreg.fit(xtrain_bow, ytrain)

Now I am clueless, that what is actually causing the problem? Can you guys please help? Thanks in advance.

Solution

You are getting above error because the length of second parameter, i.e., the label, in train_test_split() is 21602 while the length of first parameter is 32979, which should not be. The length both X and Y data must be same. So, check the length of train_bow and train['label'].

So, just change

xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], test_size=0.3, random_state=42) to something like below:

xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(bow[:split_num,:-1], bow[:split_num,-1], test_size=0.3, random_state=42)

(Assuming bow contains both features and labels, labels being the last column).

Read more sklearn.model_selection.train_test_split from here.